The Role of Modularity and Neuro-Regulation for the Production of Multiple Behaviors
TThe Role of Modularity and Neuro-Regulation forthe Production of Multiple Behaviors
Victor Massagu´e Respall
Institute of RoboticsInnopolis University
Innopolis, [email protected]. I
NTRODUCTION
Nowadays, a long term goal for the robotics and artificialintelligence community is creating agents with the ability tosolve different problems, producing multiple behaviors [1], [2]. This task is challenging because until now, in order to learna new behavior to solve another problem, previous acquiredknowledge is forgotten to accommodate the new. This happensdue to the fact that the connections of the neural networkchange when learning to perform a new behavior or task,leading to the loss of previously trained experience.This project investigates whether functional specializationor modularity can support the development of multiple be-haviors. With the term functional specialization we refer to asituation in which one or more neurons, eventually forming aspecific sub-part of the neural network policy, are primarilyresponsible for the production of a specific behavior and areless involved in the production of other behaviors. When thegroups of specialized neurons are more connected with theneuron of the group than with other neurons, the groups ofspecialized neuron form a specialized neural module.In principle, modular solutions of this type can facilitatethe development of multiple behaviors since each module isresponsible for the production of a different behavior. Conse-quently, the interfaces that arise neural mechanisms supportingthe production of different behaviors can be reduced.The realization of a structural modularity of this type,however, also require the availability of regulatory mechanismsthat enhance the activity of the neurons specialized for theproduction of the behavior that is relevant in the currentcontext and suppress or filter-out the effect of the neuronsspecialized for the production of alternative behaviors.The project involves the implementation of regulatory net-works of this type and the realization of experiments involvingthe production of different behaviors.The paper is organized as follows: Section 2 describes thesystem setup for the experiments carried out explained inSection 3. Finally Section 4 briefly discusses the results.II. E
XPERIMENTAL S ETUP
In this section the methodology followed to solve the task isdescribed. For this project, evorobotpy [3] software library wasused to obtain different environments and also train the robotsin an evolutionary manner to produce multiple behaviors with the Salimans algorithm [4]. The first step to carry onthis project consists in choosing a problem and designing areward that encourage the development of multiple behaviours.Two environments of the library were used to evaluate theperformance of the neuro-regulation method with differentcircumstances.The next step, consists in implementing a neural networkincluding regulatory neurons. Regulation can be realized byusing special neurons that control the desired impact of otherneurons. For example, by including in the policy networkstandard neurons and logistic regulatory neurons and by mul-tiplying the output of standard neurons for the activation ofthe associated regulatory neurons.To test whether neuro-regulation can improve the efficacyof the model evonet library is modified to adapt such behavior.More specifically, the layer of internal neurons is divided intwo groups. • The first group which contains the first half of the internalneurons is updated with the tanh activation function • The second half of internal neurons is updated with thelogistic function that returns a value between 0.0 and 1.0. • The activation of the first half is multiplied with theactivation of the second half of internal neurons. • The activation of the second half of internal neuronsshould be then set to 0.0. This implies that the connec-tions weights from the second half of hidden units to themotor neurons will be useless, meaning that they will nothave any impact on the output neurons.This would imply that some of the internal neuron mightbe used more during the production of one behavior and someinternal neurons can be used more during the production ofthe second behavior. In other words, neurons can specialisefor the production of different behaviors, when necessary. Thesecond set of internal neurons is used to perform the regulatoryfunction. This is a possible way to allow a specialisation ofthe neurons.
A. Hopper
For the first set of experiments, a simple problem wasselected, Hopper (see Figure 1), to obtain many replicationresults fast to compare statistics across replications usingdifferent methodologies. The robot has three actuators, thigh, a r X i v : . [ c s . R O ] J un ig. 1. The Hopper robot environment, in rest (left) and in motion (right) leg and foot joints, which is why it is a pretty simple problemto work with, because of its small number of actions.The goal for this robot is to learn two differentiated be-haviors, one which tries to maximize the forward speed (byjumping forward) and the other which maximizes the verticalheight (by jumping up).The original reward function computes the distance coveredtoward the target destination during a step, which is alreadyone of the behaviors that we want. Equation 1 shows how thereward for the first behavior is calculated, reward bh = d old − d ∆ t (1)where d represents the current distance from the robot to thegoal and d old the previous distance to the goal.A second reward function (see Equation 3) is added torepresent the behavior of jumping vertically, on the basis ofthe elevation of the hopper over the floor and punish the robotslightly for moving forward. The punishment can be helpful topush the robot to produce differentiated behaviours but shouldnot be too strong. Moreover, the second reward is weightedwith a constant so that the maximum value of an effectiverobot can gain by jumping vertically is similar to the maximumvalue that an effective robot can gain by jumping forward. Thisis done to avoid one behavior achieving a much stronger ormuch weaker reward than the other behavior. progress up = (cid:12)(cid:12)(cid:12)(cid:12) h − h old ∆ t (cid:12)(cid:12)(cid:12)(cid:12) (2)where, h is the current height above the ground of the Hopperand h old is the previous height above the ground. reward bh = 2 . ∗ progress up − . ∗ reward bh (3)where progress up is described in Equation 2 and reward bh corresponds to the penalty for moving forwardcalculated as Equation 1.Next, a mechanism is needed to tell the robot whichbehavior (jumping forward or vertically) should perform. Todo so, the observation vector was extended with two additionalinput neurons, called Behavior 1 and
Behavior 2 , that are setto 5.0 and 0.0 when the robot should jump forward and to 0.0and 5.0 when the robot should jump vertically. The observationvector which is the input to the neural network is shown in
Fig. 2. The Ant robot environment in rest
Table I. The output of the neural network (motor neurons) hassize 3 and controls the torque applied to each joint.
TABLE IO
BSERVATION V ECTOR OF THE H OPPER ROBOT (I NPUT N EURONS )0 - Height above ground 1 - sin(angle target)2 - cos(angle target) 3 - Velocity in x4 - Velocity in y 5 - Velocity in z6 - Roll angle 7 - Pitch angle8,..,13 - Thigh, Leg, and Foot joints pos. and vel.14 - Feet contact 15 - Behavior 116 - Behavior 2
The parameters used for the experiments are a feed-forwardneural network structure with one hidden layer of 50 neurons.The maximum duration of each episode was set to 500 stepswith the termination condition of the robot falling down.Regarding the parameters of the algorithm, a total 50 millionevaluation steps were used, and 11 replications to comparestatistically the results obtained.
B. Ant
This environment presents a more challenging and complextask with respect to the previous one. In this case the robotis the Ant (see Figure 2), which has 8 actuators, one hip andankle joint for each of the four legs. This implies that thepossible number of actions to take is much bigger, thus makingthe problem more difficult.The goal of this robot is to learn two differentiated behav-iors, one which tries to maximize the distance walked ◦ left 7 with respect to the orientation of the robot and anotherbehavior which tries to maximize the distance walked ◦ right8. The following equations describe the rewards for eachbehavior: y angle = atan y − y old , x − x old ) (4) self angle = xy angle − yaw (5) step length = (cid:112) ( x − x old ) + ( y − y old ) (6) reward bh = step length ∗ cos ( self angle − π (7) reward bh = step length ∗ cos ( self angle + π (8)Equation 4 calculates the angle of the vector composedby the previous and the current xy position with respect tothe x-axis, where x, x old , y, y old correspond to the current x and previous x position of the robot, and current y andprevious y position of the robot, respectively. Equation 5calculates the difference between the current orientation of therobot, yaw , and the angle walked. Next, Equation 6 gets thedistance travelled for that period of time. Finally, for the firstbehavior, Equation 7 use the length walked multiplied by thecosine of the difference between self angle and ◦ , whichis maximized when the robot walks exactly ◦ left and thesecond behavior, Equation 8, where instead of the difference isthe sum to be maximized when the robot walks exactly − ◦ .Following the same fashion as the Hopper robot, the ob-servation vector was extended with two additional neurons, Behavior 1 and
Behavior 2 , to represent both behaviors. Whenthey are set to . and . respectively, the robot is rewardedfor walking ◦ left, and when they are set to . and . therobot should walk right ◦ . Thus, the resulting observationvector is shown in Table II, corresponding to the 30 inputneurons. The output of the neural network has size 8 andcontrols the torque applied to each joint. TABLE IIO
BSERVATION V ECTOR OF THE A NT ROBOT (I NPUT N EURONS )0 - Height above ground 1 - sin(angle target)2 - cos(angle target) 3 - Velocity in x4 - Velocity in y 5 - Velocity in z6 - Roll angle 7 - Pitch angle8,..,23 - Hip and Ankle joints pos. and vel. for each of the four legs24,..,27 - Feet contact for each of the four legs28 - Behavior 1 29 - Behavior 2
The parameters used for the experiments are a feed-forwardneural network with one hidden layer of 100 neurons. Themaximum duration of each episode is set to 500 steps with thetermination condition of the robot falling down. Regarding theparameters of the algorithm, a total of 100 million evaluationsteps were used, and 4 replications to compare statistically theresults obtained. III. R
ESULTS
This section presents and discusses the results obtained. Theresults are divided in the behavior with and without neuro-regulation to compare the performance and influence of thetechnique.When testing the version without neuro-regulation, it wasfound that the robot is not able to learn to perform both behaviors at the same time to some extent. Probably whathappened is that the robots start to develop one of the twobehaviours and then the evolutionary process tend to considerthe robots that are asked to perform that behaviour goodand the robots that are asked to perform the other behaviourbad, somewhat independently for their relative ability. Conse-quently, the evolutionary process get stuck on the attempt tooptimise a single behaviour only.Two ways were found that could alleviate the problem. Thefirst strategy consists in evaluating the robot for two episodesin which they are evaluated for performing the first and thesecond behaviour. The total fitness will correspond to the sumof the fitness obtained during the two episodes. The secondstrategy consists in evaluating symmetric individuals for theability to produce the same type of behaviour. The usage ofsymmetric samples implies that the centroid is moved in thedirection of the individual of the couple that achieved thehighest fitness and in the opposite direction of the individualthat achieve to lowest fitness. If the two robots are asked toproduce different behaviors, the centroid will be moved inthe direction of the robot evaluated on the first behaviour,independently for its real ability. If the two robots are evaluatedon the same behaviour, things should work better.
Fig. 3. Fitness evolution of the Hopper for 11 replications using strategy 1(left) and strategy 2 (right), where strategy 1 refers to evaluating the robotfor two episodes (one for each behavior) and strategy 2 refers to evaluatesymmetric individuals.
The results showing the comparison of the fitness progressfor different seeds is depicted in Figure 3 for both strategies.Strategy 1 refers to evaluating the robot for two episodes, evenepisode with the behavior 1 and odd episode with behavior 2.This strategy does not involve randomness, whereas strategy2 refers to the evaluation of symmetric individuals in theSalimans algorithm, meaning that when an individual needsto be evaluated, a behavior is chosen randomly for the pair.Moreover, analyzing Figure 4, we can conclude that thesecond strategy performs better than the first one. The firststrategy has an average fitness of 1137.19 with a standarddeviation of 139.55. On the other hand, the second strategyhas an average fitness of 1382.38 with a standard deviation of192.29.Finally, neuro-regulation is implemented on top of the strat-egy two, since in principle, it does not influence the strategychosen to see if it leads to significant improvements. Figure 5depicts the evolution of the fitness during five replications forthe Hopper. If compared with the raw standard network, there ig. 4. Comparison of the fitness of the Hopper across replications forboth strategies with the average represented in orange and its correspondingstandard deviation. is no significant improvement, even some of the replicationsperform worst than the standard network. It can be seen inFigure 6 that there is no big change with the addition of theneuro-regulation. The explanation why modularity helps onlya little might be that the performance are already rather goodwith the standard network.
Fig. 5. Fitness evolution of the Hopper for 5 replications using strategy 2 forthe standard network (left) and neuro-regulation (right)Fig. 6. Comparison of the fitness of the Hopper for 5 replications betweenthe standard network with the first strategy (left) and neuro-regulation (right).
Regarding the Ant robot, Figure 7 shows the evolution ofthe fitness during the training process of 100 million stepsusing a standard network and neuro-regulation.Moreover, Figure 8 describes the average fitness of the repli-cations set using the standard network and neuro-regulationstrategies. The standard network seems to obtain better results
Fig. 7. Fitness evolution of the Ant for 2 replications using strategy 2 forthe standard network (left) and neuro-regulation (right) as compared to the implementation of neuro-regulation. Pos-sibly, more replications are needed to obtain a more robuststatistical result.
Fig. 8. Comparison of the fitness of the Ant for 2 replications between thestandard network with the first strategy (left) and neuro-regulation (right).
The full source code for each step as well as the trainedfiles can be found in the Github repository provided .IV. C ONCLUSIONS
This project tries to analyze the role of modularity andneuro-regulation for the production of multiple behaviors, byfirst implementing two strategies for the Hopper robot thatlearn to perform both behaviors to some extend and then arecompared with the addition of neuro-regulation.The system does not show significant improvements withthe addition of neuro-regulation due to the fact that the robotis already good with the standard network, possibly becauseof its simplicity.In future work more replications must be done regardingneuro-regulation to have a more consistent statistical resultto compare with the standard network. Moreover a harderproblem must be chosen to see the effects on different en-vironments and robot complexities.R
EFERENCES[1] K. O. Ellefsen, J.-B. Mouret, and J. Clune, “Neural modularity helpsorganisms evolve to learn new skills without forgetting old skills,”
PLOSComputational Biology , vol. 11, no. 4, pp. 1–24, 04 2015. [Online].Available: https://doi.org/10.1371/journal.pcbi.1004128 https://github.com/vicmassy/behavioral robotics/tree/master/project2] Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel, “Bench-marking deep reinforcement learning for continuous control,” in Interna-tional Conference on Machine Learning , 2016, pp. 1329–1338.[3] S. Nolfi and O. Gigliotta, “Evorobot,” in
Evolution of communication andlanguage in embodied agents . Springer, 2010, pp. 297–301.[4] T. Salimans, J. Ho, X. Chen, S. Sidor, and I. Sutskever, “Evolutionstrategies as a scalable alternative to reinforcement learning,” arXivpreprint arXiv:1703.03864arXivpreprint arXiv:1703.03864