Pseudorehearsal in actor-critic agents with neural network function approximation
Vladimir Marochko, Leonard Johard, Manuel Mazzara, Luca Longo
PPseudorehearsal in actor-critic agents with neuralnetwork function approximation
Vladimir Marochko
LLC Innodata
Innopolis, [email protected]
Leonard Johard
Innopolis University
Innopolis, [email protected]
Manuel Mazzara
Innopolis University
Innopolis, [email protected]
Luca Longo
Dublin Institute of Technology
Dublin, [email protected]
Abstract —Catastrophic forgetting has a significant negativeimpact in reinforcement learning. The purpose of this study is toinvestigate how pseudorehearsal can change performance of anactor-critic agent with neural-network function approximation.We tested agent in a pole balancing task and compared differentpseudorehearsal approaches. We have found that pseudore-hearsal can assist learning and decrease forgetting.
Index Terms —reinforcement learning, neural networks, catas-trophic forgetting, pseudorehearsal
I. I
NTRODUCTION
Reinforcement learning (RL) is a growing area of bio-logically inspired machine learning techniques. RL is usedto train agents how to act in unknown environments withunknown optimal behavior. Agents know only the goals theyhave to reach. Training is based on numeric feedback to agent’sactions. The common practice is to give a positive rewardwhen the final goal is reached or a negative one when thefinal goal became impossible to reach.When number of states recognizable by the agent is infiniteor too large to keep in memory—function approximations areused for state processing. One of common approximationsused in RL is an artificial neural network (ANN). Neuralnetworks are vulnerable to the problem known as catastrophicforgetting (CF) which causes information losses in networkduring retraining. In RL ANN retraining occurs often—fromonce per step to once per episode, and so CF has a significanteffect. Pseudorehearsal is one of the methods used to preventCF, and we are going to use it to improve speed and quality oftraining. We conducted an experiment on a simulation of thepole balancing cart to prove that pseudorehearsal significantlyimproves performance of actor-critic agent.II. B
ACKGROUND
A. Reinforcement learning
Reinforcement learning is a framework of machine-learning-based applications for action selection, policy im-provement and state evaluation. RL is a natural concept usedby living creatures. First researches on RL in nature appearedabout century ago. Edward Thorndike found that learningis based on the ability of the animals to figure out resultsfrom the consequences of their behavior [1]. Later B.F.Skinnerresearched learning by reinforcement and punishment moredeeply [2]. According to R.Sutton RL as a computer science concept wasborn in 1979 at the University of Massachusetts. It was theresult of analysis, extension and application of the ideas fromthe work of Klopf A. Harry ”Brain function and adaptivesystems: a heterostatic theory” [3]. Growth in computationalpowers and techniques, fast decreasing of computers pricesand flexibility of agents make RL a popular and promisingarea. All RL algorithms are based on a simple consequence:to get the observation of the current state of the environment,to apply some rule to choose the next action to reach thegoal, to receive reward or punishment and to improve the rule.Observation is a representation of environment that agent canget and process. State is an observation at some moment oftime. Observed environment has to be assumed as a MarkovDecision Process (MDP). MDP is a mathematical frameworkfor modeling decision-making problems. To denote MDP atenvironment Markov Property should be satisfied: each state s t at any timestep t is conditionally independent of all previousstates s t − n , ∀ n ∈ N and actions a t − n , while the next statereached from current after some action applied should bedefinable.State-action mapping rules in RL agents are expressed in pol-icy function π ( s t , a ) . Policy denotes probability of choosingthe action a being in state s t . Policy may vary from simpleconditional rules to a complex self-sufficient functions. Thevalue function v ( s t ) is a way the agent can predict someexpected total reward that can be reached from state s t . Theaction-value function Q ( s t , a ) denotes expected total rewardreached from state s t if agent chooses action a for the next step.The value function is used to create the model of environment.If the MDP is not fully observable and can’t be kept inmemory—value function, policy, or both can be presented byfunction approximation. It approximates the agent input fromthe observed sensory input. Learning on approximated statesmay provide descent learning quality, while an insufficientapproximation may lead to serious performance degradation[4]. One of the widely used approximations is an ANN. Thenetwork takes the state observation as the input and the updaterule computed by the algorithm as the target. It is trained tobe both the approximation for the state and the policy or valuefunction for this approximation. a r X i v : . [ c s . A I] F e b . Actor-critic algorithms Actor-critic model-based algorithms are a class of advancedon-policy RL algorithms which can model environment andconstruct optimal policy for the resulting model. It was provedthat in case of properly chosen learning parameters it can beasymptotically optimal—agent’s total reward is close to themaximal one in case of a larger number of steps [5], if thepolicy convergence is not too fast, while choosing these properparameters can be difficult [6]. Actor-critic agents use policy-based actor for action selection and train actor on evaluationby critic. Critic is a temporal-difference (TD) learning algo-rithm that models environment. Actor-critic approaches requireminimal computation for action selections and achieve highperformance [3].
1) Actor:
Actor-critic algorithms use the actor to chooseactions based on the current state. Actor is typically a policygradient function. Policy gradient is a model-free algorithmwhich does not use any models and does not evaluatethe states. It returns policy π ( s t , a t ) for the state s t , andupdates this state to distribution mapping function by therule of gradient accent: ∆ π ( s t , a t ) ∼ ∇ J π , where J is thetotal expected reward: J = E { (cid:80) k ( γ k r k ) } , where γ is thediscounting factor which is a parameter for agent’s farseeing.If γ is close to 1, the trained agent can predict reward manysteps before, if γ is close to zero—the agent gives moreattention to closest actions.The agent using policy gradient as a gradient method isguaranteed to converge to some optimal policy, but policygradient methods like any other gradient method can fall tolocal optimum.Policy gradient actors can be presented by a neural network,in this case a choice function is applied to the ANN output.In this paper we used the softmax function (1) P ( a = a j | π ( s, a )) = e π ( s,aj ) τ (cid:80) | a | i =0 e π ( s,ai ) τ (1)where τ is the exploration parameter (here temperature): thehigher the τ , the more explorational the policy. Exploration-exploitation dilemma is a problem the agent solves to findglobally optimal policy in descent time. Exploration is usingalready known information about environment. Exploitation isvisiting previously unvisited states. The dilemma is in properbalancing that lets agent reach optimal policy quickly, butdoesn’t make it walk around some local optima.Using a single neural network both for implementation ofpolicy gradient algorithm and for function approximationsimplifies the update rule, and resulting formula becomes: ∆ θ π ≈ α δρδθ π . The policy neural networks weights are denotedas θ π , the expected reward as ρ and the learning rate as α [7].
2) Critic:
Policy gradient methods have good convergenceproperties, but they do not store information about the en-vironment, so their performance is limited. Critics estimatesthe policy which is currently being followed by the actor.They provide a critique which takes the form of a TD error: δ = R + γV ( s t +1 ) − V ( s t ) . It drives all learning in both the actor and the critic. TD-error is usually counted on value-functions, then it is used for policy evaluation, extracted to theaction-value function as: δ = R + γQ ( s t +1 , a t +1 ) − Q ( s t , a t ) ,which is an update for SARSA-learning algorithm. SARSAagents evaluate both states and current policy. Critic is a state-value function, it keeps information about environment andpredicts action outcome. On the other hand now agent exploresenvironment for a longer time [8]. After each action selectioncritic determines whether things have gone better or worsethan expected.We used neural networks to represent both actor and critic.In our case the number of possible actions from each state issmall and constant, so critics outputs can be used directly forthe learning of the actor as a target. After some simplificationsfrom regular-gradient update rule we have following rulesfor the actor-critic: θ t +1 = Γ( θ t + βδ t ψ s t ,a t ) , where θ t isthe weight of policy gradient, β is the policy learning rate, δ t is the TD-error or the critic’s estimation of last action, ψ s t ,a t is the approximation function, Γ( x ) is the projectionfunction, which could be ignored assuming the iterationsremain bounded [9]. As the single structure is used both forpolicy and approximation, the actor’s update is simplified to(2) ∆ θ = βδ t (2)So the update of actor is made strictly by the critic’s estimationof action. Policy learning rate β should be significantly lessthan α —action values learning rate, so agent learns the modeland evaluate states much faster than it changes its policy.Consequently the actor acts independently while exploring theenvironment, and the critic’s remarks help the actor but do notparalyze its initiative.Such segregation of duties helps to provide fast decisionmaking and high chance of exploration. The policy convergesmuch slower than learning the model. This issue allows to actclose to greedy policy in highly evaluated states and look fornew ones until it is possible to make an improvement.
3) Biological roots of actor-critic algorithm:
The functionsof dopamine circuits in human and animal brain are normallyexplained as the sources of motivation that provide intrinsicreward for action learned as leading to reward even if theexternal reward was not provided. By this way dopamineneurons accelerate motivation and decision making [10], [11].Lack of dopamine leads to problems with learning for long-term reward and is often seen as the root of addictions andprocrastination, but learning still can occur successfully in caseof an immediate reward. From this point of view dopamineneurons work as the critic in actor-critic algorithm [12]. Thatallows animals and people to learn to act for a reward whichwill be received much later than actions provided withouteffort of will, but with a pleasure caused by intrinsic rewardproduced by the critics. It is interesting to note that the updaterules for dopamine critics are based not on actual rewards, butonly on differences between the actual total reward and thepredicted one [13], so, the actor-critic algorithm implementsthe real function of a brain structure in a similar way. . Catastrophic forgetting1) Causes of catastrophic forgetting:
Catastrophic forget-ting (CF) is a common problem in neural networks that isconnected with memory consolidation. The problem occurswhen a neural network trained to execute some tasks faceschanging conditions or learns to execute a new task. Old infor-mation might be erased in the process of learning new data. InRL this problem is common and leads to serious performancefalls. Network’s learning repeats regularly, relearning occursvery often, and target is usually not the same for the sameinput, so CF can seriously damage performance [14].The cause of CF lies in the mathematical nature of neuralnetworks. In linear networks forgetting happens because neuralnetworks base their prediction from input data on the vectororthogonality [15]. The change of error in the first learned setdepends on the dot product between sets. If the orthogonality islow, the error grows. In case of a non-linear ANN the situationis not so clear, but tends to be similar—the CF becomessignificant when the information is very distributed and highlyoverlapping between sets [16].
2) Methods of avoiding catastrophic forgetting:
Most com-mon simple methods to avoid of catastrophic forgetting can begrouped into two general approaches. First approach includesdifferent rehearsal, pseudorehearsal, and similar methods,which create additional patterns and make agent learn onthem and on new examples together. After learning with suchmethod there exist grounds to expect the performance to be asgood as if the initial training occurred on both sets at the sametime, not one after another. Rehearsal methods keep items fromthe previously learned sets in the rehearsal buffer. They fightCF well [17], but require additional memory. Pseudorehearsal(PR) is similar to rehearsal, but instead of real items fromthe old sets agent uses generated pseudoitems. More complexmethods like transfer learning and dueling networks requirecreating of additional structures.The second group comprises methods like context biasing andactivation sharpening. They update learning rules for hiddenlayers to protect some part of meaningful information fromchanging. Activation sharpening is increasing the activation ofsome most active hidden units [18]. Context biasing changesthe activations based on Hamming distance between old andnew activation vectors [19]. The newer EWS method exploredin [20] is grounded on similar theoretical basis. Classic re-ducing overlap techniques don’t really help to avoid CF, butreduce the time needed for relearning on the base set.
D. Pseudorehearsal
PR is a simple and computationally efficient method forsolving CF problem which is proven to be successful inunsupervised learning [17], supervised learning problems [21],[16] and sometimes in reinforcement learning as well [22],[14], [23]. It is interesting to note that the results of Baddeleysuggest, that the widely studied ill conditioning might not bethe main bottleneck of reinforcement learning while CF maybe.The PR is a two-step process: the first step is the construction of the set of pseudopatterns and the second is training thenetwork using pseudopatterns. The optimal way of creatingpseudopattern inputs is the subject of additional research.The simple one, proposed by Robins, is to assign randomlyeach element of input vector a score of 0 or 1. Feeding thispseudoinputs through the neural network and saving its outputshelps to save the internal state of the network. These modelshave been proven highly effective by [17], the argument of theauthors was that, although the input is completely random,the activation distributions on deeper levels of the networkwill be representative of the previously learned input data. Wecan use pseudopatterns for the correction of learning weightsin learning with respect to orthogonality between the learnedexample and pseudovectors; or we can use them as batchvectors for batch backpropagation algorithms. The first methodwas used by Frean and Robins in their work [15]. Workingon a similar research on Q-learning agents we improved thoseequations to make them more easy for implementation (3) ∆ w i = err b i pr pr (cid:88) j =1 b i ( x ij · x ij ) − x ij ( x ij · b i )( b i · b i )( x ij · x ij ) − ( b i · x ij )( b i · x ij ) (3)We also checked whether learning pseudopatterns on hiddenlayers activations improves performance, because neural net-works in [15] were linear, and our network is non-linear. Inthis project we will test both ideas, with PR on output onlyand PR on each layer, to find out if it gives any improvements. E. Biological inspiration of pseudorehearsal
The physiological part of PR approach is in the enforcingof knowledge consolidation based on random signals. Thisis similar and related to the processes which take place inthe brain during REM-sleep, where memory consolidationhappens. During that process brain does at the same time learn,providing significant memories to long-term memory, andunlearn—which makes us forget less important memories [24].Hattori in his work had also shown that hippocampus structureprovides avoidance of CF by a dual-network PR approachusing pseudopatterns produced by neocortical networks [25].III. E
XPERIMENT
A. Environment
We apply PR algorithms to actor-critic agent executing thesingle-pole cart balancing problem, a well-known reinforce-ment learning task mentioned by Sutton [3] and extendedfurther (Fig. 1). The task is to balance a pole installed ona cart for as long as it is possible by pushing the cart leftor right. If any pole falls or cart leaves the track the gameis failed and the agent receives the reward R=-1. The outputof this experiment is the number of steps the gent balancedthe pole in an episode, the bigger this number the better. Thedynamics is simulated by equations taken from Wieland [26]with the sign for angular acceleration changed—otherwise itwas directed opposite to its physically expected direction. ig. 1. Double-pole cart
B. Agent
The agent for learning on this task is an actor-critic agentusing two feed-forward back-propagation neural networks,one is for actor and other is for critic. The learning rate ofcritic is α = 0 . , the learning rate of actor is β = 0 . ,the discounting factor for the intrinsic reward is γ = 0 . ,the backpropagation algorithm for approaches with learningthrough batch-backpropagation is limited by time. We expectall PR approaches to improve agent’s performance signifi-cantly in the long run. C. Observation
The agent receives observation of the current cart’s andpoles’ positions, velocities and accelerations. Observations arereturned at discrete timesteps. The observation is representedas a real valued vector, where each i th observed parameter iswritten into one of two vector cells: ∗ i th if the parameter ispositive or (2 ∗ i + 1) th if parameter is negative. The secondvector entity assigned to parameter is assigned to zero. Afterthat, the linear parameters are divided by 20 and angularare divided by 60 for normalization. As the result is highlyreliable on all parameters, and the actual reward is rare, and theenvironment is highly unstable, we have a low orthogonalityof feature vectors from the agent’s side. This environmentwith this observation suffers from a very high amount of CF.For this reason the environment is good for testing tools forelimination of CF.For a more evident result representation we used plottinggraphs of tendencies—vectors where i t h element is meanof i t h to ( i + 100) t h elements of original result vector.Tendency graphs help to evaluate agent’s average progressmore accurately without the up and down jumps usual forthe RL agents performance.As the results are close to normally distributed, we appliedthe Student’s T-test to the results to see if the results arestatistically significant and therefore if we can make a strongstatement based on research.IV. R ESULTS
Trying different force of push we have found that the agentand the PR have totally different behaviors with differentforce values. During the early training we found the mostdemonstrative results are reached at values 25 Newtons and2.5 Newtons. A stronger push refers to high-risky actions, aweak push to low-risky ones. Low-risky environments have alarger and denser subspace of optimal states—states that have at least one possible action performing which leads to anotheroptimal state. In a high-risky environment optimal states areplaced much sparser.
A. Learning agent with highly risky actions
An environment with highly risky actions is a very com-plicated environment for the agent. Few actions will leadto success while nearly all the other will lead the agent tofail. Applying artificial agents to this type of tasks is veryimportant, because the execution of such tasks by a humanis highly stressful. Stress can seriously damage performanceon complicated tasks [27]. From this point of view evenmodest results in training will let to replace human agentswith artificial ones in risky tasks with performance growth.
1) Choosing better correction type for pseudorehearsalbased learning:
The first question of interest is which kindof PR with learning rate correction (FR PR) in case ofnonlinear network is better—with fixing learning rule onlyon output layer, or with fixing them on each layer w.r.t.pseudopattern’s activations on those layers. For testing wetook some samples for PR learning outputs and some with PRlearning all activation. The comparison of samples with sameparameters was provided by computing a difference vector,plotting its graph and plotting difference method tendencygraph. For all parameters the visual evaluation has shown thesame thing: FR PR applied to all layers have shown a muchbetter result; all the tendencies graphs looked nearly the same.On Fig. 2 you may see the example for FR PR 30 Rel 10. As
Fig. 2. Tendency graph for the difference between the agent with PRcorrecting learning on outputs and one correcting learning on all layers we can see almost everywhere the tendency curve of differenceis above zero. The performance of the FR PR agent withweight correction applied to all layers is higher. The results ofsignificance test has shown that t − stat ≈ . >> t. Thecritical one-tail ≈ . which means that this difference isstatistically significant, therefore the using of FR PR appliedon each layer of a non-linear neural network gives a muchhigher performance in all trials.The actor-critic agent without PR starts with a poor policy asbad as the free fall behavior, performance falls quickly, thengrows fast and later drops extremely. Then a similar behavioraltemplate repeats again with even lower results. Almost all runshave a lower performance then if the pole were just fallingdown without any involvement. The convergence occurs on avery low local optimum (Fig.3). We suppose that CF has sucheffect because of a high number of negative rewards. Agentistakenly evaluates possibly optimal states similar to riskyones like risky too. It erases the previously learned overlappingweights. Fig. 3. Even on initial graph it is perfectly visible where the learning occursand where convergence to worse state takes place. The tendency graph justmakes it more convenient for visual evaluation
In tests the performance of all agents was still poor, but betterthan in the no-RP case, and there were visible differencesof behavior connected with types of PR. Sizes of pseudosetsand reinitialization frequency show almost no effect on thebehavior.FR PR shows a strong improvement of performance: Fig. 4After the agent starts it quickly learns how to reach a fairly
Fig. 4. Typical steps-episode graph for FR PR with highly risky behaviour high performance two to three times better then free fall, butthen agent starts to diverge slowly. Finally it diverges to apolicy worse than the initially reached, but still better thanthe agent without PR has. After reaching that level no seriouschanges happen. This result shows that the FR PR helps theagent to widen the explorations and go further even in cases ofmeeting highly negative rewards. For the first time it protectsweights from being immediately erased by an avalanche ofnegative rewards. Because of this protection the agent maycontinue exploration in these highly negative states for moretime than the agent without PR.The batch-backpropagation PR approach has shown a goodresult in avoiding CF. The overall performance is sufficientlyhigh and doesn’t suffer from slow monotonic degradationlike in examples with FR PR (Fig.5). Visual comparison ofdifferent PR approaches of PR with the same size of pseudosetand relearning frequency shows that batch-backpropagationprovides a higher efficiency and a lower stability (Fig. 6).While the batch PR agent has a mean value higher by morethan than the correcting weights agent: 107.6 vs 83.6,it has an about three times higher variance: 2900 vs 1080.That means that the agent with batch PR type has a farmore aggressive and risky policy. This tendency to visit riskystates seems to be caused by deeper and stronger relearning
Fig. 5. Batch-backpropagation based PR exampleFig. 6. Comparing batch-backpropagation PR and FR PR by tendency graph. of previous networks internal state, so after CF occurrencesthe agent quickly returns to the previous high performance.Comparing the tendency graphs, one can see that whileinitially the tendencies of weight-correcting approach wereat least not worse than batch one, after some number ofsteps the weight-correcting starts to diverge to a less riskysuboptimal state, while the batch PR is still at the top (Fig.7).The result of the experiment shows that the agent oscillates
Fig. 7. Comparing batch backpropagation PR and FR PR around some quickly found local optimum. The Results ofthese oscillations are much higher than the properly resultsof FR PR and no PR approaches. The significance test hasapproved that difference between approaches has ∼
38 as t-statistics result—a significance so high that it was obviouseven before the test. The computational cost of the batch-backpropagation algorithm implementation used in the test isone-two orders of magnitude higher than the computationaltime of vector multiplications used for learning rate correcting. . Learning agent with low risky actions
In a less risky case there exists a large subspace within thestate space, where almost all actions will not lead to the fallingof the pole. As the agent moves further from the center of thissubspace, the higher is chance to perform an action whichwill drop the pole. When the size of safe state or states isincreased there is a lower risk that the values learned as safewill be overwritten because of an overlapping with a verysimilar state. In this type of problem it’s harder to lose allthe knowledge collected because of CF. On the other hand incurrent problem the time period between reaching failing stateand actual failing of the task is larger.Acting in a low-risky environment is simpler and the actor-critic agent without PR has reached high performance fastand was showing the same high performance for about twothousand steps. The agent also increased the avoidance ofdangerous states, marked by the increased lower boundaryon the graph. This agent is better in recognizing states withlowest expected reward and avoiding them. After that sequenceof good policy the agent’s performance quickly diverges andcan’t return to the same high result again (Fig.8). To avoid this
Fig. 8. Agents performance without PR and with low risky actions the performance drop caused by CF PR is used. The FR PR inthis type of problems shows a very interesting picture: not onlyit coherently grows, but its lower boundary goes up as well(Fig.9). We plotted tendencies graph and graph of smoothed
Fig. 9. Agents performance with learning-rate correction PR minimums denoted by the following rule: i th element ofminimums vector is min ( i th , i + 100 th ) from original sample.Both graphs grow coherently, and are expected to convergeto some optimal policy with high performance (Fig.10). Thebatch PR approach worked far worse than FR PR and worse Fig. 10. Tendencies of mean and minimal values for agent with FR PR than in case of an environment with highly risky actions. Theperformance was a bit better than the free fall, and significantlybetter than the agent without PR had, but much lower than incase of FR PR. The value of mean is about 140 vs 163 whichis 1.15 times lower. As well the batch PR agent seems to reachits optimal policy and neither learning nor relearning occursany more—it keeps oscillating around same value for most partof experiment—except initial learning at the very beginning oflearning series (Fig.11). The T-test proved this difference to besignificant. Applying the learning rate correction has smoothed
Fig. 11. Performance graph for batch-backpropagation approach learning curve and improved learning with a significance (T-Stat) about 4.5. It’s interesting to note, that unlike all previousapproaches it did not improve the mean of performance much.Improvement was neglectable—from 56.39 to 56.99. This PRapproach decreased variance from 101.7 to 69.9—about 1.45times. As the picture shows there are less high spikes, but theoverall performance is converged to some local optimum anddoes not fall lower.V. C
ONCLUSION AND FUTURE WORK
We proved that the PR approaches strongly improve thelearning of an actor-critic agent with softmax action-selectionfunction. Different PR approaches have different ways anddifferent performance boosts, but all of them were statisticallysignificant and none of them led to a worse performance.The possible cause is that the actor-critic agent has a higherpossibility of exploration of promising states.We used a nonlinear neural network with hidden layer andproved that activations returned by neurons in the hiddenlayer for pseudoinputs improved performance significantlycompared to keeping activations on output layer only. Thestatistical significance test has shown 43.77 for T-stats whichis a highly significant result.We found that the effects of different PR approaches vary inifferent situations which depends on the density of distri-bution of optimal subspaces in multidimensional state space.This dependency was found during the early training whenchoosing optimal force of push for the algorithm to makegraphs maximally evident. When talking about environmentswith high risky and low risky actions, the possible cost ofmistake is denoted in the very terms. The experiment hasshown that the batch-backpropagation algorithms provide abetter performance that does not degrade in high-risky en-vironments. In a low-risky environment FR PR trains fasterand the performance continues growing after initial training.These results can help in choosing the PR approach for aconcrete task. As this topic is not thoroughly explored, it mightlead to many new interesting experiments and discoveries inreinforcement learning agents.Some issues remain open for research. First of all, the explo-ration of different RL algorithms in high-risky and low-riskyenvironments. It would help to find important properties ofenvironment and to choose proper algorithms and parametersto task execution. Secondly, it is the application of PR todifferent reinforcement and supervised learning approaches,finding dependencies, similarities and differences betweenthem and creating a mathematical foundation which can helpto choose an optimal approach to current task. Potentialapplication scenarios can be seen in innovative technologies,such as smart houses [28] and smart automotive systems [29].As it was found that in actor-critic algorithms PR parametersgive no any significant difference in agents performance, whilein other algorithms they might have a much higher meaning, itis necessary to find which parameters of agent or environmentmake PR parameters more or less meaningful, and find outa way to easily predict those parameters and reduce theirsignificance. R
EFERENCES[1] E. L. Thorndike, “Review of animal intelligence: An experimental studyof the associative processes in animals.” 1898.[2] B. F. Skinner, “Reinforcement today.”
American Psychologist , vol. 13,no. 3, p. 94, 1958.[3] R. S. Sutton and A. G. Barto,
Reinforcement learning: An introduction .MIT press Cambridge, 1998.[4] F. S. Melo, S. P. Meyn, and M. I. Ribeiro, “An analysis of reinforcementlearning with function approximation,” in
Proceedings of the 25thinternational conference on Machine learning . ACM, 2008, pp. 664–671.[5] S. Manfred, “Estimation and control in discounted stochastic dynamicprogramming,”
Stochastics: An International Journal of Probability andStochastic Processes , vol. 20, no. 1, pp. 51–71, 1987.[6] V. R. Konda and V. S. Borkar, “Actor-critic–type learning algorithms formarkov decision processes,”
SIAM Journal on control and Optimization ,vol. 38, no. 1, pp. 94–123, 1999.[7] R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour, “Policygradient methods for reinforcement learning with function approxima-tion,” in
Advances in neural information processing systems , 2000, pp.1057–1063.[8] J. Beitelspacher, J. Fager, G. Henriques, and A. McGovern, “Policygradient vs. value function approximation: A reinforcement learningshootout,”
School of Computer Science, University of Oklahoma, Tech.Rep , 2006.[9] S. Bhatnagar, R. S. Sutton, M. Ghavamzadeh, and M. Lee, “Incrementalnatural actor-critic algorithms.” in
NIPS , 2007, pp. 105–112. [10] M. Talanov, J. Vallverd´u, S. Distefano, M. Mazzara, and R. Del-hibabu, “Neuromodulating cognitive architecture: Towards biomimeticemotional AI,” in , 2015, pp. 587–592.[11] J. Vallverd´u, M. Talanov, S. Distefano, M. Mazzara, A. Tchitchigin,and I. Nurgaliev, “A cognitive architecture for the implementationof emotions in computing systems,”
Biologically Inspired CognitiveArchitectures , vol. 15, no. Supplement C, pp. 34 – 40, 2016.[12] L. Johard and E. Ruffaldi, “A connectionist actor-critic algorithm forfaster learning and biological plausibility,” in
Robotics and Automation(ICRA), 2014 IEEE International Conference on . IEEE, 2014, pp.3903–3909.[13] K. C. Berridge, “The debate over dopamines role in reward: the case forincentive salience,”
Psychopharmacology , vol. 191, no. 3, pp. 391–431,2007.[14] A. Cahill, “Catastrophic forgetting in reinforcement-learning environ-ments,” Ph.D. dissertation, University of Otago, 2011.[15] M. Frean and A. Robins, “Catastrophic forgetting in simple networks:an analysis of the pseudorehearsal solution,”
Network: Computation inNeural Systems , vol. 10, no. 3, pp. 227–236, 1999.[16] O.-M. Moe-Helgesen and H. Stranden, “Catastophic forgetting in neuralnetworks,”
Dept. Comput. & Information Sci., Norwegian Univ. Science& Technology (NTNU), Trondheim, Norway, Tech. Rep , vol. 1, p. 22,2005.[17] A. Robins, “Catastrophic forgetting, rehearsal and pseudorehearsal,”
Connection Science , vol. 7, no. 2, pp. 123–146, 1995.[18] R. M. French, “Semi-distributed representations and catastrophic forget-ting in connectionist networks,”
Connection Science , vol. 4, no. 3-4, pp.365–377, 1992.[19] ——, “Dynamically constraining connectionist networks to produce dis-tributed, orthogonal representations to reduce catastrophic interference,” network , 1994.[20] J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins,A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska et al. , “Overcoming catastrophic forgetting in neural networks,”
Pro-ceedings of the National Academy of Sciences , vol. 114, no. 13, pp.3521–3526, 2017.[21] A. Robins, “Sequential learning in neural networks: A review and adiscussion of pseudorehearsal based methods,”
Intelligent Data Analysis ,vol. 8, no. 3, pp. 301–322, 2004.[22] V. Marochko, L. Johard, and M. Mazzara, “Pseudorehearsal in valuefunction approximation,” in ,2017.[23] B. Baddeley, “Reinforcement learning in continuous time and space:Interference and not ill conditioning is the main problem when usingdistributed function approximators,”
IEEE Transactions on Systems,Man, and Cybernetics, Part B (Cybernetics) , vol. 38, no. 4, pp. 950–956,2008.[24] A. Robins and S. McCallum, “The consolidation of learning duringsleep: comparing the pseudorehearsal and unlearning accounts,”
NeuralNetworks , vol. 12, no. 7, pp. 1191–1206, 1999.[25] M. Hattori, “A biologically inspired dual-network memory model forreduction of catastrophic forgetting,”
Neurocomputing , vol. 134, pp.262–268, 2014.[26] A. P. Wieland, “Evolving neural network controllers for unstable sys-tems,” in
Neural Networks, 1991., IJCNN-91-Seattle International JointConference on , vol. 2. IEEE, 1991, pp. 667–673.[27] L. E. Bourne Jr and R. A. Yaroush, “Stress and cognition: A cognitivepsychological perspective,” 2003.[28] I. B. Marco Nalin and M. Mazzara, “A holistic infrastructure tosupport elderlies’ independent living,”
Encyclopedia of E-Health andTelemedicine, IGI Global , 2016.[29] R. Gmehlich, K. Grau, A. Iliasov, M. Jackson, F. Loesch, and M. Maz-zara, “Towards a formalism-based toolkit for automotive applications,”1st FME Workshop on Formal Methods in Software Engineering (For-maliSE)