[PDF] Escaping Stochastic Traps with Aleatoric Mapping Agents

Abstract

Exploration in environments with sparse rewards is difficult for artificial agents. Curiosity driven learning -- using feed-forward prediction errors as intrinsic rewards -- has achieved some success in these scenarios, but fails when faced with action-dependent noise sources. We present aleatoric mapping agents (AMAs), a neuroscience inspired solution modeled on the cholinergic system of the mammalian brain. AMAs aim to explicitly ascertain which dynamics of the environment are unpredictable, regardless of whether those dynamics are induced by the actions of the agent. This is achieved by generating separate forward predictions for the mean and variance of future states and reducing intrinsic rewards for those transitions with high aleatoric variance. We show AMAs are able to effectively circumvent action-dependent stochastic traps that immobilise conventional curiosity driven agents. The code for all experiments presented in this paper is open sourced: this http URL

Full PDF

EEscaping Stochastic Traps with Aleatoric MappingAgents

Augustine N. Mavor-Parker , ∗ Kimberly A. Young , Caswell Barry , † Lewis D. Grifﬁn , † Abstract

Exploration in environments with sparse rewards is difﬁcult for artiﬁcial agents.Curiosity driven learning — using feed-forward prediction errors as intrinsic re-wards — has achieved some success in these scenarios, but fails when faced withaction-dependent noise sources. We present aleatoric mapping agents (AMAs),a neuroscience inspired solution modeled on the cholinergic system of the mam-malian brain. AMAs aim to explicitly ascertain which dynamics of the environmentare unpredictable, regardless of whether those dynamics are induced by the ac-tions of the agent. This is achieved by generating separate forward predictionsfor the mean and variance of future states and reducing intrinsic rewards for thosetransitions with high aleatoric variance. We show AMAs are able to effectively cir-cumvent action-dependent stochastic traps that immobilise conventional curiositydriven agents. The code for all experiments presented in this paper is open-sourced.

Efﬁcient exploration is a central problem in reinforcement learning, with particular pertinence inenvironments with sparse rewards — requiring agents to navigate with limited guidance (e.g. [1–3], see [4] for a review). A notable exploration method that effectively deals with sparse rewardenvironments is curiosity driven learning; where agents are equipped with a self-supervised forwardprediction model that employs prediction errors as intrinsic rewards [2, 5]. Curiosity is built upon theintuition that in unexplored regions of the environment, the forward prediction error of the agent’sinternal model will be large [2]. As a result, agents are rewarded for visiting regions of the statespace that they have not previously occupied. However, if a particular state transition is impossibleto predict, it will trap a curious agent [3, 5]. This is referred to as the noisy TV problem (e.g. [3]),the etymology being that a naively curious agent could dwell on the unpredictability of a noisy TVscreen.There are a range of curiosity-like methods [3, 2, 6] that are designed to overcome noisy TVs (orin the terminology introduced by Shyam, Jáskowski and Gomez: “stochastic traps” [7]). However,using prediction errors as intrinsic rewards is still an open problem as current methods fail whenstochastic traps are action-dependent, or require a distribution of forward prediction models [2, 6–8].In this work we develop aleatoric mapping agents (AMAs). AMAs are intrinsic reward modulesbased on the work of Kendall and Gal [9], which are capable of avoiding action-dependent stochastictraps within a single prediction network. The success of our AMA approach and its connection to ∗ Send correspondence to: [email protected] Artiﬁcial Intelligence Centre, University College London, UK, Department of Cell and DevelopmentalBiology, University College London, UK, Boston University, Center for Systems Neuroscience, GraduateProgram for Neuroscience, USA, Department of Computer Science, University College London, UK † Joint senior authors.Presented at the NeurIPS 2020 Biological and Artiﬁcial Reinforcement Learning Workshop. a r X i v : . [ c s . L G ] F e b he work of Yu and Dayan suggests a possible role of acetylcholine in the cortex and hippocampusduring exploration [10]. Namely, we hypothesise that acetylcholine could plausibly indicate expected aleatoric uncertainties [11] within a curiosity driven learning framework. Authors often remark that using future state prediction errors as rewards in reinforcement learning —as well as the realisation that this could lead to undesirable behaviours in stochastic environments —predates the contemporary success of Pathak’s Curiosity Driven Learning (e.g. [3, 2, 7]). Most notably,Schmidhuber [5] as well as Kaplan and Oudeyer [12] developed an intrinsic reward framework whereagents are rewarded for high initial prediction errors but only if those prediction errors decrease overtime.The seminal work of Pathak et al. (2017) [2] renewed interest in curiosity for deep reinforcementlearning. As a result, novel varieties of curiosity driven learning have been developed that are capableof working in stochastic environments. Burda et al. trained a curiosity-like intrinsic reward moduleto predict a vector output of a randomly intialised convolutional neural network (CNN) for everyobservation seen during exploration [3]. Exploration naturally arises as it is harder to predict theoutput of the random network in regions of the environment that have not been visited [3]. Further, asthe output vector comes from a CNN, the method generalises to high dimensional state spaces [3].Subsequent work by Pathak, Gandhi and Gupta addressed curiosity driven exploration using anensemble of networks to make forward predictions. Monitoring the divergence of the networks’predictions provided a means to avoid noisy TVs but necessitated increased computational andmemory complexity [6]. Similarly, count based methods are known to reliably deal with the noisyTV problem as they do not receive intrinsic rewards for prediction errors [13]. Count based methodstabulate which states agents have occupied and reward agents for visiting states with low counts [14,15]. They can be combined with curiosity-like methods, as performed by Raileanu and Rocktäschel[13], to avoid noisy TVs while still utilising dynamics based prediction errors when they are useful.

The uncertainty of a predictive model can be described as the sum of two theoretically distinct typesof uncertainty: epistemic uncertainty and aleatoric uncertainty [16]. Epistemic uncertainty measuresthe unreliability of a model’s prediction that can be minimised with additional experience [16]. As aresult, using epistemic uncertainties as intrinsic rewards means an agent will seek out dynamics it hasnot previously encountered (e.g. [17]).On the other hand, prediction errors due to aleatoric uncertainties are unavoidable. They are, bydeﬁnition, a result of unpredictable dynamics [16]. Prediction errors due to unpredictable dynamicsimmobilise curiosity driven agents as exempliﬁed by the noisy TV problem [8].Although theoretically attractive from an exploration point of view, quantifying epistemic uncertaintyin deep forward prediction models is difﬁcult and computationally expensive [18]. As a result, we aimto tractably approximate epistemic uncertainties by removing the aleatoric component from the totalforward prediction error. This is similar to methods that separate epistemic and aleatoric uncertainties,allowing for the construction of policies that are rewarded for exploring their environments andpunished for experiencing aleatoric risk [19, 20]. However, as far as we are aware we are the ﬁrst tocompute aleatoric uncertainties within a curiosity driven learning framework and reduce intrinsicrewards for those state transitions with high variance.

In the mammalian brain acetylcholine is a neurotransmitter implicated in a wide range of processesincluding learning and memory, fear, novelty detection, and attention [21–28]. Traditional views —supported by the rapid increase in acetylcholine in response to environmental novelty and demonstra-ble effects on neural plasticity — emphasised the role of acetylcholine as a learning signal, generatingphysiological changes that favour encoding of new information over retrieval [26]. We focus inparticular on the role of acetylcholine during exploration in uncertain environments [25, 29, 27].2ayan and Yu propose that in the cortex acetylcholine “signals” the expected uncertainty of top-down predictions, while cortical norepinephrine increases in correlation with experiences containingunexpected uncertainties [11]. This is supported by experimental evidence that shows acetylcholineinhibits feedback connections and strengthens sensory inputs [26], leaving an animal’s perception tobe more heavily inﬂuenced by observations rather than its predictions. However, Yu and Dayan’smodel does not separate epistemic and aleatoric uncertainties [25]. While the utility of quantifyingepistemic uncertainties for exploration has been widely recognised in the RL literature (e.g. [17, 6]),this work demonstrates a potential use of aleatoric uncertainties for exploring agents (biological andartiﬁcial). Namely, that aleatoric uncertainties can be used to divert attention away from unpredictabledynamics when using prediction errors as intrinsic rewards. As a result, we echo the sentiment ofYu and Dayan [25], that future experimental work should aim to pinpoint what type of uncertainty(epistemic or aleatoric) acetylcholine indicates.What’s more, we note that given acetylcholine could rise with expected aleatoric uncertainties, areasonable hypothesis is that norepinephrine levels increase when animals are faced with epistemicuncertainties (expected and unexpected). This is related to the model proposed by Parr and Friston,which suggests that acetylcholine indicates expected uncertainties in predicted observations gener-ated from hidden states, while norepinephrine is responsible for indicating epistemic unexpecteduncertainties in hidden state transitions [30].

To compute heteroscedastic aleatoric uncertainty we follow Kendall and Gal, who offer an objectivefor regressing both the mean and expected variance in a deterministic deep learning model [9]. Themethod assumes a supervised learning setting, with a dataset of N inputs X = ( x , x , . . . , x N ) and N labels Y = ( y , y , . . . , y N ) [9]. The labels are deﬁned as Gaussian distributed, with theirpredicted variances being a function of the inputs to the model (i.e. they are heteroscedastic) [9].Using maximum a posteriori (MAP) inference with a zero-mean Gaussian prior on the networkparameters θ , a proportionality expression for the posterior can be deﬁned [9, 31]. p ( θ, φ | X , Y ) ∝ N (cid:89) i =1 N ( y i ; f θ ( x i ) , g φ ( x i )) N ( θ, φ ; 0 , α − I ) (1)Where f θ are g φ are the neural networks used for forward prediction, θ are the network parametersfor the mean prediction, φ are the parameters used for the variance prediction and α is a constant.Based on Kendall and Gal’s implementation, the two mean and variance prediction heads sharefeature extracting parameters [9]. The objective function of the model L ( θ, φ ) , aims to maximise thenegative of the right hand side of Equation 1 with respect to the model parameters [9]. L ( θ, φ ) = 1 N N (cid:88) i =1

12 exp( − log ˆ σ i ) (cid:107) y i − ˆy i (cid:107) + λ σ i + α θ (2)To reiterate, both the estimated log variance log ˆ σ and the estimated mean of the labels ˆ y i arepredicted by two separate heads of a deep neural network [9]. The ﬁrst and third terms of Equation2 are the familiar mean squared error and weight decay variables, while the second term blocksthe explosion of predicted variances [9]. We follow Kendall and Gal’s prescription of estimating log σ instead of σ to ensure stable optimisation [9]. Furthermore, the hyperparameter λ was added,representing a quantity analogous to an aleatoric risk budget (e.g. [19, 20, 32]) of the model. Our method operates in the arena of episodic Markov decision processes (MDPs) that consist ofstates s ∈ S , actions a ∈ A , and rewards r ∈ R ⊂ R at each timestep t [1]. In our experiments thetotal reward is the sum of the intrinsic reward provided by the intrinsic reward module of the agentsand extrinsic rewards provided by the environment (e.g. [2, 33]). r t = βr it + r et (3)3here superscript, i , indicate intrinsic rewards, superscript e indicates extrinsic rewards and β is ahyperparameter that regulates the inﬂuence of intrinsic rewards on the policy. The objective of thecuriosity driven agent is to learn a stochastic policy π , parametrised by ξ , that generates an action asa function of the current state s t , which maximises the expectation of the sum of discounted futurerewards (e.g. [34]). max π ξ E π ξ (cid:34) T (cid:88) k =0 γ k r t + k (cid:35) (4)Where T is the episode length and γ is the discount factor. The forward prediction module shouldreward the agent for forward prediction errors unless the prediction error is impossible to reduce dueto random dynamics [3]. Relying on this intuition, we craft the following intrinsic reward functionfor a given ( s t +1 , ˆ s t +1 ) tuple; in a similar manner to [20]. r it = 1 D D (cid:88) j =1 (cid:0) ( s t +1 ,j − ˆs t +1 ,j ) − ˆ σ t +1 ,j (cid:1) (5)Where D is the dimensionality of the state observation. Note that the forward prediction moduleand the policy network do not share any parameters. The reward function provided by Equation 5 isnon-stationary as stochastic gradient descent of the forward prediction module is performed online.More speciﬁcally, a batch of N ( s t , s t + ) tuples are provided by N parallel actors at each time step t to the forward prediction module, which performs an optimisation step according to Equation 2. We demonstrate the performance of our method on two toy environments. The experiments isolate theproblematic effects of stochastic traps on intrinsically motivated agents: both in a purely supervisedsetting and in the context of deep reinforcement learning.

We begin with a supervised learning test bed, which is very similar to the noisy MNIST environmentintroduced by Pathak, Gandhi and Gupta [6]. The environment does not elicit any actions from anagent. Instead, the goal of the prediction network is simply to learn the mapping between N inputimages A = { a , a , . . . , a N } and output images B = { b , b , . . . , b N } , where the i th ( a i → b i )transition is randomly sampled from two possible transition categories. The ﬁrst type of transition iscompletely deterministic and is cued by the input a i having a 0 label. These transitions map ontothemselves: ( a i = 0 → b i = a i = 0 ). In this case, the prediction network simply has to learn anidentity mapping. The other type of transition gives 1’s as input and transitions onto a random digitfrom 2-9: ( a i = 1 → b i ∈ { , . . . , } ). These transitions are the stochastic trap.As remarked by [6], a prediction model capable of avoiding noisy TVs should learn to computeequal intrinsic rewards for both types of transitions at convergence. We train two feedforward neuralnetworks, one standard forward prediction network (that we train with mean squared error (MSE)and will refer to as the MSE network) and one AMA network, to complete this task (adapted from anopen source repository [35]). The networks are equivalent except for the fact that the AMA networkhas two prediction heads for the mean and variance of its prediction of the next state. Each networkconsists of fully connected layers (each with ReLU activations, except for the last layer, which islinear for the AMA and MSE network). The hidden layers have 784 units each except for the ﬁnalhidden layer, which has × as it receives a skip connection from the input layer. In the AMAcase there is only a skip connection for the mean prediction head. Both networks were optimised withAdam at a learning rate of 0.001 and a batch size of 32 [36]. Our AMA agent is optimised accordingto the loss function deﬁned in Equation 2, while the MSE network uses a mean squared error (MSE)loss. The intrinsic rewards computed via these different prediction modules is presented in Figure 1.The MSE prediction network is unable to reduce prediction errors for the stochastic transitions,leaving the stochastic transitions to produce much larger intrinsic rewards than the deterministictransitions, consistent with [6]. On the other hand, when heteroscedastic aleatoric uncertainties are4 a) Example transitions from the Noisy MNIST en-vironment along with associated predictions from anAMA. The top two rows are examples of stochastictransitions where the prediction module of the AMAassigns high variance to the majority of the imageallowing errors to be small despite the value of theoutput being completely unpredictable. (b) Two reward curves for each type of network areplotted where stochastic is the ( a i = 1 → b i ∈{ , ..., } ) transitions and deterministic is the ( a i =0 → b i = a i = 0 ) transitions. Shaded area is ± Figure 1: MNIST environment visualisation and MSE and AMA performance.estimated, the AMA prediction network is able to cut its losses by attributing high variance to thestochastic transitions, making them just as rewarding as the deterministic transitions (green lineFigure 1(b)).

The Gym MiniGrid environment [37] contains a suite of highly-optimised gridworld games forcomputationally affordable deep reinforcement learning research. We demonstrate the efﬁcacy ourmethod on the MiniGrid-KeyCorridorS6R3-v0 environment and measure exploration effectivenessby the number of novel states visited during training. A ( × × ) observation array tracks theagent’s location and orientation as it moves around the larger grid. The channels of the observationsrepresent semantic features (e.g. blue door, grey wall, empty, etc.) of each grid tile. The action spacehas seven distinct actions, expressed relative to the agent’s orientation (turn left, turn right, moveforward, pickup, drop, toggle, done), one of which is selected at each timestep.In our experimental design, a number of alterations were made to the MiniGrid environment. First,the number of frames per episode was reduced to 8. This makes efﬁcient principled explorationmore crucial, as the probability of the agent randomly discovering novel states decreases rapidlyin relation to the number of frames per episode. Second, an artiﬁcial action-dependent noisy TVwas added, inspired by other minigrid experiments with noisy TVs [13], consisting of a ( × × )burst of random noise in the top left corner of the agent’s observation array. The values of the noisepixels were uniformly sampled across the range of possible values (representing semantic features)for the suite of minigrid environments. Whenever the agent selects the ‘done’ action the noisy TVis activated in the next observation (note that in this particular environment the ‘done’ action hasno consequences for the agent, except for inducing the noisy TV). Third, all extrinsic rewards areturned off. The environment was kept constant across training episodes but different versions of theenvironment (with different random seeds) were used to produce the conﬁdence regions in Figure 2.We perform policy optimization with the synchronous actor critic (A2C) algorithm [34]. Ourimplementation is adapted from a starter repository recommended in the Gym MiniGrid README[38]. The actor critic architecture begins with 3 ( × ) convolutional layers with (16, 32, 64) outputchannels for each layer and a Maxpool layer after the ﬁrst convolution. A ReLU is applied to allconvolutional layers. The features extracted are then fed to two MLPs representing the actor and criticheads with the two layers each having dimension of (64, 64, 7) and (64, 64, 1). A Tanh activation isused in the penultimate layer of both actor-critic heads. The actor critic weights were optimised withRMSProp [39] at a learning rate of 0.001. We use 16 parallel actors during training.The forward prediction module works in the observation space as opposed to a learned feature spaceas is implemented in other curiosity driven methods [2, 3, 13]. Using pixel based predictions is nota limitation of our method, but was chosen due to its simplicity of implementation and the naturalinterpretability of predictions when debugging. The forward prediction model consists of a feature5able 1: Frequency of the actions selected by the MSE and AMA agents across one run in theMinigrid environmentAction Mean Frequency MSE Mean Frequency AMAleft 76 19457right 113 13496forward 63 15459pickup 54 127drop 54 95toggle 261 81done (stochastic trap) 49427 1336 (a) Rendering of the MiniGrid-KeyCorridorS6R3-v0 environ-ment. The highlighted greyrectangle represents the receptiveﬁeld of the agent. (b) Comparison of different agents averaged over 5 runs (different randomseeds) with and without an action-dependent stochastic trap present. Allpolicies are optimised with A2C. The agents differ in how their intrinsicrewards are computed: No rewards for the Baseline agent, mean squarederror for the vanilla approach and our method for AMA. Shaded is area is ± Figure 2: Deep learning minigrid results.extracting CNN architecture (1 layer with 32 output channels), which ingests an observation at s t followed by two separate fully-connected network heads predicting the mean and log variance of thepixels of s t +1 . The two hidden layers for both heads have (621, 147) hidden units each. ReLU isapplied throughout the intrinisic reward model except for in the ﬁnal layer where no activation is usedfor the forward prediction and ReLU is used for the log variance prediction. Our architecture is basedin part on other intrinsic reward models that performed well on gym-minigrid environments [13].The intrinsic reward module was optimised with the Adam optimizer at a learning rate of 0.0001 forthe AMA agent and 0.001 for the MSE agent [39].The intrinsic rewards were scaled by a factor of 10 in both cases. All rewards were normalised with arunning mean and standard deviation for the AMA agent and the MSE agent, as is recommendedfor intrinsic rewards (e.g. [8, 33]). All the settings described were selected by hyperparametertuning (via grid search) the respective agents to visit as many states as possible averaged acrossenvironments with and without a noisy TV present. For a full description of the all the underlying RLhyperparameters we refer the reader to the public repository [38].As shown in Figure 2(b) when no noisy TV is present, the MSE curiosity provides a performanceboost (right panel Figure 2(b)). However, the presence of the noisy TV has a disastrous effect onthe MSE curiosity agent (left panel Figure 2(b)). On the other hand, the AMA agent maintains aperformance boost from its intrinsic rewards with and without the presence of an action dependentnoisy TV. 6 Conclusion

We have shown AMAs are able to avoid action-dependent stochastic traps that destroy the explorationcapabilities of conventional curiosity driven agents [8], both in a supervised setting and a deep RLsetting. AMAs complete such a feat in a computationally tractable manner by decreasing intrinsicrewards in regions with high estimated aleatory using the predictions of a single deterministic forwardprediction network.Given the success of AMAs, we observe that there are three curiosity-like deep RL models thatmight accurately mimic the role of acetylcholine in exploration. The ﬁrst two possibilities beingequivalent to the descriptions provided by Yu and Dayan [25]. Speciﬁcally, that acetylcholinerepresents model “ignorance” [25] (expected epistemic uncertainty) or, alternatively, that it representsthe unpredictability of the environment (expected aleatoric uncertainty) [10]. A third possibility,suggested by the Random Network Distillation model presented by Burda et al., is consistent withthe cholinergic role in novelty signalling [3, 24].We build on the foundation laid by Yu and Dayan [25], arguing that future experimental workshould aim to decipher whether acetylcholine indicates expected epistemic uncertainties, expectedaleatoric uncertainties or simply novelty. Then, direct comparisons can be made between theuncertainties signalled by biological and artiﬁcial agents to further research in reinforcement learningand neuroscience alike. In future work, we aim to further verify the efﬁcacy of the AMA methodin deep reinforcement learning settings, by performing experiments on popular game playing andsimulated 3D environments.

Acknowledgments and Disclosure of Funding

Augustine N. Mavor-Parker is supported by the EPSRC project reference 2250955. Caswell Barrywas funded by a Wellcome Senior Research Fellowship (212281/Z/18/Z). Augustine N. Mavor-Parkerwould like to thank Changmin Yu, Andrea Banino, Charles Blundell, Maximillian Mozes, KimberlyMai and Felix Biggs for their comments on this project.

References [1] Richard S Sutton and Andrew G Barto.

Reinforcement learning: An introduction . MIT press,2018.[2] Deepak Pathak, Pulkit Agrawal, Alexei A Efros, and Trevor Darrell. Curiosity-driven explorationby self-supervised prediction. In

International Conference on Machine Learning , pages 2778–2787. PMLR, 2017.[3] Yuri Burda, Harrison Edwards, Amos Storkey, and Oleg Klimov. Exploration by randomnetwork distillation. arXiv preprint arXiv:1810.12894 , 2018.[4] Lilian Weng. Exploration strategies in deep reinforcement learning. lilianweng.github.io/lil-log ,2020.[5] Jürgen Schmidhuber. Adaptive conﬁdence and adaptive curiosity. In

Institut fur Informatik,Technische Universitat Munchen, Arcisstr. 21, 800 Munchen 2 . Citeseer, 1991.[6] Deepak Pathak, Dhiraj Gandhi, and Abhinav Gupta. Self-supervised exploration via disagree-ment. arXiv preprint arXiv:1906.04161 , 2019.[7] Pranav Shyam, Wojciech Ja´skowski, and Faustino Gomez. Model-based active exploration. In

International Conference on Machine Learning , pages 5779–5788, 2019.[8] Yuri Burda, Harri Edwards, Deepak Pathak, Amos Storkey, Trevor Darrell, and Alexei A Efros.Large-scale study of curiosity-driven learning. arXiv preprint arXiv:1808.04355 , 2018.[9] Alex Kendall and Yarin Gal. What uncertainties do we need in bayesian deep learning forcomputer vision? In

Advances in neural information processing systems , pages 5574–5584,2017. 710] Peter Dayan and J Yu Angela. Expected and unexpected uncertainty: Ach and ne in theneocortex. In

Advances in neural information processing systems , pages 173–180, 2003.[11] Peter Dayan and J Angela Yu. Ach, uncertainty, and cortical inference. In

Advances in neuralinformation processing systems , pages 189–196, 2002.[12] Frederic Kaplan and Pierre-Yves Oudeyer. In search of the neural circuits of intrinsic motivation.

Frontiers in neuroscience , 1:17, 2007.[13] Roberta Raileanu and Tim Rocktäschel. Ride: Rewarding impact-driven exploration forprocedurally-generated environments. In

International Conference on Learning Representations ,2020.[14] Alexander L Strehl and Michael L Littman. An analysis of model-based interval estimation formarkov decision processes.

Journal of Computer and System Sciences , 74(8):1309–1331, 2008.[15] Marc G Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, andRemi Munos. Unifying count-based exploration and intrinsic motivation. arXiv preprintarXiv:1606.01868 , 2016.[16] Eyke Hüllermeier and Willem Waegeman. Aleatoric and epistemic uncertainty in machinelearning: A tutorial introduction. arXiv preprint arXiv:1910.09457 , 2019.[17] Ian Osband, Charles Blundell, Alexander Pritzel, and Benjamin Van Roy. Deep exploration viabootstrapped dqn. In

Advances in neural information processing systems , pages 4026–4034,2016.[18] Yarin Gal. Uncertainty in deep learning.

University of Cambridge , 1(3), 2016.[19] Stefan Depeweg, Jose-Miguel Hernandez-Lobato, Finale Doshi-Velez, and Steffen Udluft.Decomposition of uncertainty in bayesian deep learning for efﬁcient and risk-sensitive learning.In

International Conference on Machine Learning , pages 1184–1193. PMLR, 2018.[20] William R Clements, Benoît-Marie Robaglia, Bastien Van Delft, Reda Bahi Slaoui, andSébastien Toth. Estimating risk and uncertainty in deep reinforcement learning. arXiv preprintarXiv:1905.09638 , 2019.[21] Charan Ranganath and Gregor Rainer. Neural mechanisms for detecting and remembering novelevents.

Nature Reviews Neuroscience , 4(3):193–202, 2003.[22] Giancarlo Pepeu and Maria Grazia Giovannini. Changes in acetylcholine extracellular levelsduring cognitive processes.

Learning & memory , 11(1):21–27, 2004.[23] Elio Acquas, Catriona Wilson, and Hans C Fibiger. Conditioned and unconditioned stimuliincrease frontal cortical and hippocampal acetylcholine release: effects of novelty, habituation,and fear.

Journal of Neuroscience , 16(9):3089–3096, 1996.[24] Caswell Barry, James G Heys, and Michael E Hasselmo. Possible role of acetylcholine inregulating spatial novelty effects on theta rhythm and grid cells.

Frontiers in neural circuits ,6:5, 2012.[25] J Angela Yu and Peter Dayan. Uncertainty, neuromodulation, and attention.

Neuron , 46(4):681–692, 2005.[26] Michael E Hasselmo. The role of acetylcholine in learning and memory.

Current opinion inneurobiology , 16(6):710–715, 2006.[27] MG Giovannini, A Rakovska, RS Benton, M Pazzagli, L Bianchi, and G Pepeu. Effects ofnovelty and habituation on acetylcholine, gaba, and glutamate release from the frontal cortexand hippocampus of freely moving rats.

Neuroscience , 106(1):43–53, 2001.[28] Vinay Parikh, Rouba Kozak, Vicente Martinez, and Martin Sarter. Prefrontal acetylcholinerelease controls cue detection on multiple timescales.

Neuron , 56(1):141–154, 2007.829] CM Thiel, JP Huston, and RKW Schwarting. Hippocampal acetylcholine and habituationlearning.

Neuroscience , 85(4):1253–1262, 1998.[30] Thomas Parr and Karl J Friston. Uncertainty, epistemics and active inference.

Journal of TheRoyal Society Interface , 14(136):20170376, 2017.[31] Christopher M Bishop.

Pattern recognition and machine learning . springer, 2006. p.30.[32] Hannes Eriksson and Christos Dimitrakakis. Epistemic risk-sensitive reinforcement learning. arXiv preprint arXiv:1906.06273 , 2019.[33] Adrià Puigdomènech Badia, Pablo Sprechmann, Alex Vitvitskyi, Daniel Guo, Bilal Piot, StevenKapturowski, Olivier Tieleman, Martín Arjovsky, Alexander Pritzel, Andew Bolt, et al. Nevergive up: Learning directed exploration strategies. arXiv preprint arXiv:2002.06038 , 2020.[34] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lilli-crap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deepreinforcement learning. In