[PDF] The Role of Modularity and Neuro-Regulation for the Production of Multiple Behaviors

Abstract

This project investigates whether functional specialization or modularity can support the development of multiple behaviors. In principle, modular solutions of this type can facilitate the development of multiple behaviors since each module is responsible for the production of different behavior. Consequently, the interfaces that arise neural mechanisms supporting the production of different behaviors can be reduced. The project involves the implementation of regulatory networks of this type and the realization of experiments involving the production of different behaviors.

Full PDF

TThe Role of Modularity and Neuro-Regulation forthe Production of Multiple Behaviors

Victor Massagu´e Respall

Institute of RoboticsInnopolis University

Innopolis, [email protected]. I

NTRODUCTION

Nowadays, a long term goal for the robotics and artiﬁcialintelligence community is creating agents with the ability tosolve different problems, producing multiple behaviors [1], [2]. This task is challenging because until now, in order to learna new behavior to solve another problem, previous acquiredknowledge is forgotten to accommodate the new. This happensdue to the fact that the connections of the neural networkchange when learning to perform a new behavior or task,leading to the loss of previously trained experience.This project investigates whether functional specializationor modularity can support the development of multiple be-haviors. With the term functional specialization we refer to asituation in which one or more neurons, eventually forming aspeciﬁc sub-part of the neural network policy, are primarilyresponsible for the production of a speciﬁc behavior and areless involved in the production of other behaviors. When thegroups of specialized neurons are more connected with theneuron of the group than with other neurons, the groups ofspecialized neuron form a specialized neural module.In principle, modular solutions of this type can facilitatethe development of multiple behaviors since each module isresponsible for the production of a different behavior. Conse-quently, the interfaces that arise neural mechanisms supportingthe production of different behaviors can be reduced.The realization of a structural modularity of this type,however, also require the availability of regulatory mechanismsthat enhance the activity of the neurons specialized for theproduction of the behavior that is relevant in the currentcontext and suppress or ﬁlter-out the effect of the neuronsspecialized for the production of alternative behaviors.The project involves the implementation of regulatory net-works of this type and the realization of experiments involvingthe production of different behaviors.The paper is organized as follows: Section 2 describes thesystem setup for the experiments carried out explained inSection 3. Finally Section 4 brieﬂy discusses the results.II. E

XPERIMENTAL S ETUP

In this section the methodology followed to solve the task isdescribed. For this project, evorobotpy [3] software library wasused to obtain different environments and also train the robotsin an evolutionary manner to produce multiple behaviors with the Salimans algorithm [4]. The ﬁrst step to carry onthis project consists in choosing a problem and designing areward that encourage the development of multiple behaviours.Two environments of the library were used to evaluate theperformance of the neuro-regulation method with differentcircumstances.The next step, consists in implementing a neural networkincluding regulatory neurons. Regulation can be realized byusing special neurons that control the desired impact of otherneurons. For example, by including in the policy networkstandard neurons and logistic regulatory neurons and by mul-tiplying the output of standard neurons for the activation ofthe associated regulatory neurons.To test whether neuro-regulation can improve the efﬁcacyof the model evonet library is modiﬁed to adapt such behavior.More speciﬁcally, the layer of internal neurons is divided intwo groups. • The ﬁrst group which contains the ﬁrst half of the internalneurons is updated with the tanh activation function • The second half of internal neurons is updated with thelogistic function that returns a value between 0.0 and 1.0. • The activation of the ﬁrst half is multiplied with theactivation of the second half of internal neurons. • The activation of the second half of internal neuronsshould be then set to 0.0. This implies that the connec-tions weights from the second half of hidden units to themotor neurons will be useless, meaning that they will nothave any impact on the output neurons.This would imply that some of the internal neuron mightbe used more during the production of one behavior and someinternal neurons can be used more during the production ofthe second behavior. In other words, neurons can specialisefor the production of different behaviors, when necessary. Thesecond set of internal neurons is used to perform the regulatoryfunction. This is a possible way to allow a specialisation ofthe neurons.

A. Hopper

For the ﬁrst set of experiments, a simple problem wasselected, Hopper (see Figure 1), to obtain many replicationresults fast to compare statistics across replications usingdifferent methodologies. The robot has three actuators, thigh, a r X i v : . [ c s . R O ] J un ig. 1. The Hopper robot environment, in rest (left) and in motion (right) leg and foot joints, which is why it is a pretty simple problemto work with, because of its small number of actions.The goal for this robot is to learn two differentiated be-haviors, one which tries to maximize the forward speed (byjumping forward) and the other which maximizes the verticalheight (by jumping up).The original reward function computes the distance coveredtoward the target destination during a step, which is alreadyone of the behaviors that we want. Equation 1 shows how thereward for the ﬁrst behavior is calculated, reward bh = d old − d ∆ t (1)where d represents the current distance from the robot to thegoal and d old the previous distance to the goal.A second reward function (see Equation 3) is added torepresent the behavior of jumping vertically, on the basis ofthe elevation of the hopper over the ﬂoor and punish the robotslightly for moving forward. The punishment can be helpful topush the robot to produce differentiated behaviours but shouldnot be too strong. Moreover, the second reward is weightedwith a constant so that the maximum value of an effectiverobot can gain by jumping vertically is similar to the maximumvalue that an effective robot can gain by jumping forward. Thisis done to avoid one behavior achieving a much stronger ormuch weaker reward than the other behavior. progress up = (cid:12)(cid:12)(cid:12)(cid:12) h − h old ∆ t (cid:12)(cid:12)(cid:12)(cid:12) (2)where, h is the current height above the ground of the Hopperand h old is the previous height above the ground. reward bh = 2 . ∗ progress up − . ∗ reward bh (3)where progress up is described in Equation 2 and reward bh corresponds to the penalty for moving forwardcalculated as Equation 1.Next, a mechanism is needed to tell the robot whichbehavior (jumping forward or vertically) should perform. Todo so, the observation vector was extended with two additionalinput neurons, called Behavior 1 and

Behavior 2 , that are setto 5.0 and 0.0 when the robot should jump forward and to 0.0and 5.0 when the robot should jump vertically. The observationvector which is the input to the neural network is shown in

Fig. 2. The Ant robot environment in rest

Table I. The output of the neural network (motor neurons) hassize 3 and controls the torque applied to each joint.

TABLE IO

BSERVATION V ECTOR OF THE H OPPER ROBOT (I NPUT N EURONS )0 - Height above ground 1 - sin(angle target)2 - cos(angle target) 3 - Velocity in x4 - Velocity in y 5 - Velocity in z6 - Roll angle 7 - Pitch angle8,..,13 - Thigh, Leg, and Foot joints pos. and vel.14 - Feet contact 15 - Behavior 116 - Behavior 2

The parameters used for the experiments are a feed-forwardneural network structure with one hidden layer of 50 neurons.The maximum duration of each episode was set to 500 stepswith the termination condition of the robot falling down.Regarding the parameters of the algorithm, a total 50 millionevaluation steps were used, and 11 replications to comparestatistically the results obtained.

B. Ant

This environment presents a more challenging and complextask with respect to the previous one. In this case the robotis the Ant (see Figure 2), which has 8 actuators, one hip andankle joint for each of the four legs. This implies that thepossible number of actions to take is much bigger, thus makingthe problem more difﬁcult.The goal of this robot is to learn two differentiated behav-iors, one which tries to maximize the distance walked ◦ left 7 with respect to the orientation of the robot and anotherbehavior which tries to maximize the distance walked ◦ right8. The following equations describe the rewards for eachbehavior: y angle = atan y − y old , x − x old ) (4) self angle = xy angle − yaw (5) step length = (cid:112) ( x − x old ) + ( y − y old ) (6) reward bh = step length ∗ cos ( self angle − π (7) reward bh = step length ∗ cos ( self angle + π (8)Equation 4 calculates the angle of the vector composedby the previous and the current xy position with respect tothe x-axis, where x, x old , y, y old correspond to the current x and previous x position of the robot, and current y andprevious y position of the robot, respectively. Equation 5calculates the difference between the current orientation of therobot, yaw , and the angle walked. Next, Equation 6 gets thedistance travelled for that period of time. Finally, for the ﬁrstbehavior, Equation 7 use the length walked multiplied by thecosine of the difference between self angle and ◦ , whichis maximized when the robot walks exactly ◦ left and thesecond behavior, Equation 8, where instead of the difference isthe sum to be maximized when the robot walks exactly − ◦ .Following the same fashion as the Hopper robot, the ob-servation vector was extended with two additional neurons, Behavior 1 and

Behavior 2 , to represent both behaviors. Whenthey are set to . and . respectively, the robot is rewardedfor walking ◦ left, and when they are set to . and . therobot should walk right ◦ . Thus, the resulting observationvector is shown in Table II, corresponding to the 30 inputneurons. The output of the neural network has size 8 andcontrols the torque applied to each joint. TABLE IIO

BSERVATION V ECTOR OF THE A NT ROBOT (I NPUT N EURONS )0 - Height above ground 1 - sin(angle target)2 - cos(angle target) 3 - Velocity in x4 - Velocity in y 5 - Velocity in z6 - Roll angle 7 - Pitch angle8,..,23 - Hip and Ankle joints pos. and vel. for each of the four legs24,..,27 - Feet contact for each of the four legs28 - Behavior 1 29 - Behavior 2

The parameters used for the experiments are a feed-forwardneural network with one hidden layer of 100 neurons. Themaximum duration of each episode is set to 500 steps with thetermination condition of the robot falling down. Regarding theparameters of the algorithm, a total of 100 million evaluationsteps were used, and 4 replications to compare statistically theresults obtained. III. R

ESULTS

This section presents and discusses the results obtained. Theresults are divided in the behavior with and without neuro-regulation to compare the performance and inﬂuence of thetechnique.When testing the version without neuro-regulation, it wasfound that the robot is not able to learn to perform both behaviors at the same time to some extent. Probably whathappened is that the robots start to develop one of the twobehaviours and then the evolutionary process tend to considerthe robots that are asked to perform that behaviour goodand the robots that are asked to perform the other behaviourbad, somewhat independently for their relative ability. Conse-quently, the evolutionary process get stuck on the attempt tooptimise a single behaviour only.Two ways were found that could alleviate the problem. Theﬁrst strategy consists in evaluating the robot for two episodesin which they are evaluated for performing the ﬁrst and thesecond behaviour. The total ﬁtness will correspond to the sumof the ﬁtness obtained during the two episodes. The secondstrategy consists in evaluating symmetric individuals for theability to produce the same type of behaviour. The usage ofsymmetric samples implies that the centroid is moved in thedirection of the individual of the couple that achieved thehighest ﬁtness and in the opposite direction of the individualthat achieve to lowest ﬁtness. If the two robots are asked toproduce different behaviors, the centroid will be moved inthe direction of the robot evaluated on the ﬁrst behaviour,independently for its real ability. If the two robots are evaluatedon the same behaviour, things should work better.

Fig. 3. Fitness evolution of the Hopper for 11 replications using strategy 1(left) and strategy 2 (right), where strategy 1 refers to evaluating the robotfor two episodes (one for each behavior) and strategy 2 refers to evaluatesymmetric individuals.

The results showing the comparison of the ﬁtness progressfor different seeds is depicted in Figure 3 for both strategies.Strategy 1 refers to evaluating the robot for two episodes, evenepisode with the behavior 1 and odd episode with behavior 2.This strategy does not involve randomness, whereas strategy2 refers to the evaluation of symmetric individuals in theSalimans algorithm, meaning that when an individual needsto be evaluated, a behavior is chosen randomly for the pair.Moreover, analyzing Figure 4, we can conclude that thesecond strategy performs better than the ﬁrst one. The ﬁrststrategy has an average ﬁtness of 1137.19 with a standarddeviation of 139.55. On the other hand, the second strategyhas an average ﬁtness of 1382.38 with a standard deviation of192.29.Finally, neuro-regulation is implemented on top of the strat-egy two, since in principle, it does not inﬂuence the strategychosen to see if it leads to signiﬁcant improvements. Figure 5depicts the evolution of the ﬁtness during ﬁve replications forthe Hopper. If compared with the raw standard network, there ig. 4. Comparison of the ﬁtness of the Hopper across replications forboth strategies with the average represented in orange and its correspondingstandard deviation. is no signiﬁcant improvement, even some of the replicationsperform worst than the standard network. It can be seen inFigure 6 that there is no big change with the addition of theneuro-regulation. The explanation why modularity helps onlya little might be that the performance are already rather goodwith the standard network.

Fig. 5. Fitness evolution of the Hopper for 5 replications using strategy 2 forthe standard network (left) and neuro-regulation (right)Fig. 6. Comparison of the ﬁtness of the Hopper for 5 replications betweenthe standard network with the ﬁrst strategy (left) and neuro-regulation (right).

Regarding the Ant robot, Figure 7 shows the evolution ofthe ﬁtness during the training process of 100 million stepsusing a standard network and neuro-regulation.Moreover, Figure 8 describes the average ﬁtness of the repli-cations set using the standard network and neuro-regulationstrategies. The standard network seems to obtain better results

Fig. 7. Fitness evolution of the Ant for 2 replications using strategy 2 forthe standard network (left) and neuro-regulation (right) as compared to the implementation of neuro-regulation. Pos-sibly, more replications are needed to obtain a more robuststatistical result.

Fig. 8. Comparison of the ﬁtness of the Ant for 2 replications between thestandard network with the ﬁrst strategy (left) and neuro-regulation (right).

The full source code for each step as well as the trainedﬁles can be found in the Github repository provided .IV. C ONCLUSIONS

This project tries to analyze the role of modularity andneuro-regulation for the production of multiple behaviors, byﬁrst implementing two strategies for the Hopper robot thatlearn to perform both behaviors to some extend and then arecompared with the addition of neuro-regulation.The system does not show signiﬁcant improvements withthe addition of neuro-regulation due to the fact that the robotis already good with the standard network, possibly becauseof its simplicity.In future work more replications must be done regardingneuro-regulation to have a more consistent statistical resultto compare with the standard network. Moreover a harderproblem must be chosen to see the effects on different en-vironments and robot complexities.R

EFERENCES[1] K. O. Ellefsen, J.-B. Mouret, and J. Clune, “Neural modularity helpsorganisms evolve to learn new skills without forgetting old skills,”

PLOSComputational Biology , vol. 11, no. 4, pp. 1–24, 04 2015. [Online].Available: https://doi.org/10.1371/journal.pcbi.1004128 https://github.com/vicmassy/behavioral robotics/tree/master/project2] Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel, “Bench-marking deep reinforcement learning for continuous control,” in Interna-tional Conference on Machine Learning , 2016, pp. 1329–1338.[3] S. Nolﬁ and O. Gigliotta, “Evorobot,” in

Evolution of communication andlanguage in embodied agents . Springer, 2010, pp. 297–301.[4] T. Salimans, J. Ho, X. Chen, S. Sidor, and I. Sutskever, “Evolutionstrategies as a scalable alternative to reinforcement learning,” arXivpreprint arXiv:1703.03864arXivpreprint arXiv:1703.03864