[PDF] Deep active inference agents using Monte-Carlo methods

Abstract

Full PDF

DDeep active inference agents using Monte-Carlomethods

Zafeirios Fountas ∗ Emotech Labs &WCHN, University College London [email protected]

Noor Sajid

WCHN, University College London [email protected]

Pedro A.M. Mediano e University of Cambridge [email protected]

Karl Friston

WCHN, University College London [email protected]

Abstract

Active inference is a Bayesian framework for understanding biological intelligence.The underlying theory brings together perception and action under one singleimperative: minimizing free energy. However, despite its theoretical utility inexplaining intelligence, computational implementations have been restricted to low-dimensional and idealized situations. In this paper, we present a neural architecturefor building deep active inference agents operating in complex, continuous state-spaces using multiple forms of Monte-Carlo (MC) sampling. For this, we introducea number of techniques, novel to active inference. These include: i ) selecting free-energy-optimal policies via MC tree search, ii ) approximating this optimal policydistribution via a feed-forward ‘habitual’ network, iii ) predicting future parameterbelief updates using MC dropouts and, ﬁnally, iv ) optimizing state transitionprecision (a high-end form of attention). Our approach enables agents to learnenvironmental dynamics efﬁciently, while maintaining task performance, in relationto reward-based counterparts. We illustrate this in a new toy environment, basedon the dSprites data-set, and demonstrate that active inference agents automaticallycreate disentangled representations that are apt for modeling state transitions.In a more complex Animal-AI environment, our agents (using the same neuralarchitecture) are able to simulate future state transitions and actions (i.e., plan),to evince reward-directed navigation - despite temporary suspension of visualinput. These results show that deep active inference – equipped with MC methods –provides a ﬂexible framework to develop biologically-inspired intelligent agents,with applications in both machine learning and cognitive science. A common goal in cognitive science and artiﬁcial intelligence is to emulate biological intelligence, togain new insights into the brain and build more capable machines. A widely-studied neuroscienceproposition for this is the free-energy principle, which views the brain as a device performing varia-tional (Bayesian) inference [1, 2]. Speciﬁcally, this principle provides a framework for understandingbiological intelligence, termed active inference, by bringing together perception and action undera single objective: minimizing free energy across time [3–7]. However, despite the potential ofactive inference for modeling intelligent behavior, computational implementations have been largelyrestricted to low-dimensional, discrete state-space tasks [8–11]. ∗ Corresponding authorPreprint. Under review. a r X i v : . [ q - b i o . N C ] J un ecent advances have seen deep active inference agents solve more complex, continuous state-spacetasks, including Doom [12], the mountain car problem [13–15], and several tasks based on theMuJoCo environment [16], many of which use amortization to scale-up active inference [13–15, 17].A common limitation of these applications is a deviation from vanilla active inference in their abilityto plan. For instance, Millidge [17] introduced an approximation of the agent’s expected free energy (EFE), the quantity that drives action selection, based on bootstrap samples, while Tschantz et al. [16]employed a reduced version of EFE. Additionally, since all current approaches tackle low-dimensionalproblems, it is unclear how they would scale up to more complex domains. Here, we propose anextension of previous formulations that is closely aligned with active inference [4, 9] by estimatingall EFE summands using a single deep neural architecture.Our implementation of deep active inference focuses on ensuring both scalability and biologicalplausibility. We accomplish this by introducing MC sampling – at several levels – into active inference.For planning, we propose the use of MC tree search (MCTS) for selecting a free-energy-optimalpolicy. This is consistent with planning strategies employed by biological agents and providesan efﬁcient way to select actions. Next, we approximate the optimal policy distribution using afeed-forward ‘habitual’ network. This is inspired by biological habit formation, when acting infamiliar environments that relieves the computational burden of planning in commonly-encounteredsituations. Additionally, for both biological consistency and reducing computational burden, wepredict model parameter belief updates using MC-dropouts, a problem previously tackled withnetworks ensembles. Lastly, inspired by neuromodulatory mechanisms in biological agents, weintroduce a top-down mechanism that modulates precision over state transitions, which enhanceslearning of latent representations.In what follows, we brieﬂy review active inference. This is followed by a description of our deepactive inference agent. We then evaluate the performance of this agent. Finally, we discuss thepotential implications of this work. Agents deﬁned under active inference: A ) sample their environment and calibrate their internalgenerative model to best explain sensory observations (i.e., reduce surprise) and B ) perform actionsunder the objective of reducing their uncertainty about the environment. A more formal deﬁnitionrequires a set of random variables: s t to represent the hidden state of the world at time t , o t as thecorresponding observation, π = { a , a , ..., a T } as a sequence of actions (typically referred to as‘policy’ in the active inference literature) up to a given time horizon T ∈ N + , and P ( o t , s t ; θ ) as theagent’s generative model parameterized by θ . From this, the agent’s surprise at time t can be deﬁnedas the negative log-likelihood − log P ( o t ; θ ) .To address objective A ) under this formulation, the surprise of current observations can be indirectlyminimized by optimizing the parameters, θ , using as a loss function the tractable expression: − log P θ ( o t ; θ ) ≤ E Q ( s t ) (cid:2) log Q ( s t ) − log P ( o t , s t ; θ ) (cid:3) , (1)where Q ( s ) is an arbitrary distribution of s . The RHS expression of this inequality is the variationalfree energy at time t . This quantity is commonly referred to as negative evidence lower bound [18]in variational inference. Furthermore, to realize objective B ) , the expected surprise of futureobservations − log P ( o τ | θ ) where τ ≥ t − can be minimized by selecting the policy that is associatedwith the lowest EFE, G [19]: G ( π, τ ) = E P ( o τ | s τ ,θ ) E Q ( s τ ,θ | π ) (cid:2) log Q ( s τ , θ | π ) − log P ( o τ , s τ , θ | π ) (cid:3) . (2)Finally, the process of action selection in active inference is realized as sampling from the distribution P ( π ) = σ (cid:0) − γG ( π ) (cid:1) = σ (cid:16) − γ (cid:88) τ>t G ( π, τ ) (cid:17) , (3)where γ is a temperature parameter and σ ( · ) the standard softmax function. In this section, we introduce a deep active inference model using neural networks, based on amortiza-tion and MC sampling. 2 (cid:2) φ (cid:3) (cid:2) (cid:3) (cid:2) Α φ P θ o P θ s Generative modelInference modelTop-down precision H ( o t + θ , H ( s t + | o t + , ) log P(o t+1 | ) Β (cid:3) (cid:2) +1 (cid:1) (cid:2) +1 (cid:3) (cid:2) H(s t+1 | ) s ~ P s (s t+1 |s t , )s ~ P s (s t+1 |s t , ) (cid:2) +1 ~ Q( | ) Used in Eq. (8a)Used in Eq. (8b) Used in Eq. (8c)Used in Eq. (8) H ( o t + | s , ) C (cid:3) (cid:2) +1 (cid:3) (cid:2) (cid:3) (cid:2) +1 (cid:3) (cid:2) +1 (cid:3) (cid:2) +1 (cid:3) (cid:2) +2 (cid:3) (cid:2) +2 (cid:3) (cid:2) +2 (cid:3) (cid:2) +2 habitual prior Q φ Figure 1: A : Schematic of model architecture and networks used during the learning process. Blackarrows represent the generative model ( P ), orange arrows the recognition model ( Q ), and the greenarrow the top-down attention ( ω ). B : Relevant quantities for the calculation of EFE G , computedby simulating the future using the generative model and ancestral sampling. Where appropriate,expectations are taken with a single MC sample. C : MCTS scheme used for planning and acting,using the habitual network to selectively explore new tree branches.Throughout this section, we denote the parameters of the generative and recognition densities with θ and φ , respectively. The parameters are partitioned as follows: θ = { θ o , θ s } , where θ o parameterizesthe observation function P ( o t | s t ; θ o ) , and θ s parameterizes the transition function P ( s τ | s t , a t ; θ s ) .For the recognition density, φ = { φ s , φ a } , where φ s is the amortization parameters of the approximateposterior Q φ s ( s t ) (i.e., the state encoder), and φ a the amortization parameters of the approximateposterior Q φ a ( a ) (i.e., our habitual network). First, we extend the probabilistic graphical model (as deﬁned in Sec. 2) to include the action sequences π and factorize the model based on Fig. 1A. We then exploit standard variational inference machineryto calculate the free energy for each time-step t as: F t = − E Q ( s t ) (cid:2) log P ( o t | s t ; θ o ) (cid:3) + D KL (cid:2) Q φ s ( s t ) (cid:107) P ( s t | s t − , a t − ; θ s ) (cid:3) + E Q ( s t ) (cid:2) D KL (cid:2) Q φ a ( a t ) (cid:107) P ( a t ) (cid:3)(cid:3) , (4)where P ( a ) = (cid:88) π : a = a P ( π ) (5)is the summed probability of all policies that begin with action a . We assume that s t is normally dis-tributed and o t is Bernoulli distributed, with all parameters given by a neural network, parameterizedby θ o , θ s , and φ s for the observation, transition, and encoder models, respectively (see Sec. 3.2 fordetails about Q φ a ). With this assumption, all the terms here are standard log-likelihood and KL termseasy to compute for Gaussian and Bernoulli distributions. The expectations over Q φ s ( s t ) are takenvia MC sampling, using a single sample from the encoder.Next, we consider EFE. At time-step t and for a time horizon up to time T , EFE is deﬁned as [4]: G ( π ) = T (cid:88) τ = t G ( π, τ ) = T (cid:88) τ = t E ˜ Q (cid:104) log Q ( s τ , θ | π ) − log ˜ P ( o τ , s τ , θ | π ) (cid:105) , (6)where ˜ Q = Q ( o τ , s τ , θ | π ) and ˜ P ( o τ , s τ , θ | π ) = P ( o τ ) Q ( s τ | o τ ) P ( θ | s τ , o τ ) . Following Schwarten-beck et al. [20], the EFE of a single time instance τ can be further decomposed as G ( π, τ ) = − E ˜ Q (cid:2) log P ( o τ | π ) (cid:3) (7a) + E ˜ Q (cid:2) log Q ( s τ | π ) − log P ( s τ | o τ , π ) (cid:3) (7b) + E ˜ Q (cid:2) log Q ( θ | s τ , π ) − log P ( θ | s τ , o τ , π ) (cid:3) . (7c)3nterestingly, each term constitutes a conceptually meaningful expression. The term (7a) correspondsto the likelihood assigned to the desired observations o τ , and plays an analogous role to the notionof reward in the reinforcement learning literature [21]. The term (7b) corresponds to the mutualinformation between the agent’s beliefs about its latent representation of the world, before and aftermaking a new observation, and hence, it reﬂects a motivation to explore areas of the environment thatresolve state uncertainty. Similarly, the term (7c) describes the tendency of active inference agents toreduce their uncertainty about model parameters via new observations and is usually referred to inthe literature as active learning [3], novelty, or curiosity [20].However, two of the three terms that constitute EFE cannot be easily computed as written in Eq. (7).To make computation practical, we will re-arrange these expressions and make further use of MCsampling to render these expressions tractable and re-write Eq. (7) as G ( π, τ ) = − E Q ( θ | π ) Q ( s τ | θ,π ) Q ( o τ | s τ ,θπ ) (cid:2) log P ( o τ | π ) (cid:3) (8a) + E Q ( θ | π ) (cid:2) E Q ( o τ | θ,π ) H ( s τ | o τ , π ) − H ( s τ | π ) (cid:3) (8b) + E Q ( θ | π ) Q ( s τ | θ,π ) H ( o τ | s τ , θ, π ) − E Q ( s τ | π ) H ( o τ | s τ , π ) , (8c)where these expressions can be calculated from the deep neural network illustrated in Fig. 1B. Thederivation of Eq. (8) can be found in the supplementary material. To calculate the terms (8a) and (8b),we sample θ , s τ and o τ sequentially (through ancestral sampling) and then o τ is compared with theprior distribution log P ( o τ | π ) = log P ( o τ ) . The parameters of the neural network θ are sampled from Q ( θ ) using the MC dropout technique [22]. Similarly, to calculate the expectation of H ( o τ | s τ , π ) ,the same drawn θ is used again and s τ is re-sampled for N times while, for H ( o τ | s τ , θ, π ) , the set ofparameters θ is also re-sampled N times. Finally, all entropies can be computed using the standardformulas for multivariate Gaussian and Bernoulli distributions. In active inference, agents choose an action given by their EFE. In particular, any given actionis selected with a probability proportional to the accumulated negative EFE of the correspondingpolicies G ( π ) (see Eq. (3) and Ref. [19]). However, computing G across all policies is costly sinceit involves making an exponentially-increasing number of predictions for T -steps into the future,and computing all the terms in Eq. (8). To solve this problem, we employ two methods operating intandem. First, we employ standard MCTS [23–25], a search algorithm in which different potentialfuture trajectories of states are explored in the form of a search tree (Fig. 1C), giving emphasis tothe most likely future trajectories. This algorithm is used to calculate the distribution over actions P ( a ) , deﬁned in Eq. (5), and control the agent’s ﬁnal decisions. Second, we make use of amortizedinference through a habitual neural network that directly approximates the distribution over actions,which we parameterize by φ a and denote Q φ a ( a ) . In essence, Q φ a ( a ) acts as a variational posteriorthat approximates P ( a | s t ) , with a prior P ( a ) , calculated by MCTS (see Fig. 1A). During learning,this network is trained to reproduce the last executed action a t − (selected by sampling P ( a ) ) usingthe last state s t − . Since both tasks used in this paper (Sec. 4) have discrete action spaces A , wedeﬁne Q φ a ( a ) as a neural network with parameters φ a and |A| softmax output units.During the MCTS process, the agent generates a weighted search tree iteratively that is later sam-pled during action selection. In each single MCTS loop, one plausible state-action trajectory ( s t , a t , s t +1 , a t +1 , ..., s τ , a τ ) – starting from the present time-step t – is calculated. For statesthat are explored for the ﬁrst time, the distribution P ( s t +1 | s t , a t ; θ s ) is used. States that have beenexplored are stored in the buffer search tree and accessed during later loops of the same planningprocess. The weights of the search tree ˜ G ( a, s ) represent the agent’s best estimation for EFE aftertaking action a from state s . An upper conﬁdence bound for G ( a, s ) is deﬁned as U ( s, a ) = ˜ G ( s, a ) + c explore · Q φ a ( a | s ) ·

11 + N ( a, s ) , (9)where N ( a, s ) is the number of times that a was explored from state s , and c explore a hyper-parameterthat controls exploration. In each round, the EFE of the newly-explored parts of the trajectory iscalculated and back-propagated to all visited nodes of the search tree. Additionally, actions aresampled in two ways. Actions from states that have been explored are sampled from σ ( U ( a, s t )) while actions from new states are sampled from Q φ a ( a ) .4inally, the actions a i that assemble the selected policy are drawn from P ( a ) = N ( a i ,s ) (cid:80) j N ( a j ,s ) . In ourimplementation, the planning loop stops if either the process has identiﬁed a clear option (i.e. if max P ( a ) − / |A| > T dec ) or the maximum number of allowed loops has been reached.Through the combination of the approximation Q φ a ( a ) and the MCTS, our agent has at its disposaltwo methods of action selection. We refer to Q φ a ( a ) as the habitual network, as it corresponds to aform of fast decision-making, quickly evaluating and selecting a action; in contrast with the more deliberative system that includes future imagination via MC tree traversals [26]. One of the key elements of our framework is the state transition model P ( s t | s t − , a t − ; θ s ) , thatbelongs to the agent’s generative model. In our implementation, we take s t ∼ N ( µ, σ / ω ) , where µ and σ come from the linear and softplus units (respectively) of a neural network with parameters θ s applied to s t − , and, importantly, ω is a precision factor (c.f. Fig. 1A) modulating the uncertainty onthe agent’s estimate of the hidden state of the environment [8]. We model the precision factor as asimple function of the belief update about the agent’s current policy, ω = a e − D − bc + d , (10)where D = D KL (cid:2) Q φ a ( a ) (cid:107) P ( a ) (cid:3) and { a, b, c, d } are ﬁxed hyper-parameters. Note that ω is amonotonically decreasing function of D , such that when the posterior belief about the current policyis similar to the prior, precision is high.In cognitive terms, ω can be thought of as a means of top-down attention [27], that regulateswhich transitions should be learnt in detail and which can be learnt less precisely. This attentionmechanism acts as a form of resource allocation: if D KL (cid:2) Q φ a ( a ) (cid:107) P ( a ) (cid:3) is high, then a habit hasnot yet been formed, reﬂecting a generic lack of knowledge. Therefore, the precision of the prior P ( s t | s t − , a t − ; θ s ) (i.e., the belief about the current state before a new observation o t has beenreceived) is low, and less effort is spent learning Q φ s ( s t ) .In practice, the effect of ω is to incentivize disentanglement in the latent state representation s t – theprecision factor ω is somewhat analogous to the β parameter in β -VAE [28], effectively pushing thestate encoder Q φ s ( s t ) to have independent dimensions (since P ( s t | s t − , a t − ; θ s ) has a diagonalcovariance matrix). As training progresses and the habitual network becomes a better approximationof P ( a ) , ω is gradually increased, implementing a natural form of precision annealing. First, we present the two environments that were used to validate our agent’s performance.

Dynamic dSprites

We deﬁned a simple 2D environment based on the dSprites dataset [29, 28].This was used to i ) quantify the agent’s behavior against ground truth state-spaces and ii ) evaluate theagent’s ability to disentangle state representations. This is feasible as the dSprites data is designed forcharacterizing disentanglement, using a set of interpretable, independent ground-truth latent factors.In this task, which we call object sorting , the agent controls the position of the object via differentactions (right, left, up or down) and is required to sort single objects based on their shape (a latentfactor). The agent receives reward when it moves the object across the bottom border, and the rewardvalue depends on the shape and location as depicted in Fig. 2A. For the results presented in Section 4,the agent was trained in an on-policy fashion, with a batch size of . Animal-AI

We used a variation of ‘preferences’ task from the Animal-AI environment [30]. Thecomplexity of this, partially observable, 3D environment is the ideal test-bed for showcasing theagent’s reward-directed exploration of the environment, whilst avoiding negative reward or gettingstuck in corners. In addition, to test the agent’s ability to rely on its internal model, we used a lights-off variant of this task, with temporary suspension of visual input at any given time-step withprobability R . For the results presented in Section 4, the agent was trained in an off-policy fashiondue to computational constraints. The training data for this was created using a simple rule: move inthe direction of the greenest pixels. 5 iBii t=1 2 3 4 5 6t=1 3 6 9 12 15 18 21 24 27 30InputPrediction Visible Blackout: Visual input is not providedInputPredictionInputPrediction r e w a r d A w a ll w a ll wall Figure 2: A: The proposed object sorting task based on the dSprites dataset. The agent can perform 4actions; changing the position of the object in both axis. Reward is received if an object crosses thebottom boarder and differs for the 3 object shapes. B: Prediction of the visual observations undermotion if input is hidden in both ( i ) AnimalAI and ( ii ) dynamic dSprites environments.In the experiments that follow, we encode the actual reward from both environments as the priordistribution of future expected observations log P ( o τ ) or, in active inference terms, the expectedoutcomes. We optimized the networks using ADAM [31], with loss given in Eq. (4) and an extraregularization term D KL (cid:2) Q φ s ( s t ) (cid:107) N (0 , (cid:3) . The explicit training procedure is detailed in thesupplementary material. The complete source-code, data, and pre-trained agents, is available onGitHub (https://github.com/zfountas/deep-active-inference-mc). We initially show – through a simple visual demonstration (Fig. 2B) – that agents learn the environ-ment dynamics with or without consistent visual input for both dynamic dSprites and AnimalAI. Thisis further investigated, for the dynamic dSprites, by evaluating task performance (Fig. 3A-C), as wellas reconstruction loss for both predicted visual input and reward (Fig. 3D-E) during training.To explore the effect of using different EFE functionals on behavior, we trained and compared activeinference agents under three different formulations, all of which used the implicit reward function log P ( o τ ) , against a baseline reward-maximizing agent. These include i ) beliefs about the latentstates (i.e., terms a,b from Eq. 7), ii ) beliefs about both the latent states and model parameters (i.e.,complete Eq. 7) and iii ) beliefs about the latent states, with a down-weighted reward signal. We foundthat, although all agents exhibit similar performance in collecting rewards (Fig. 3B), active inferenceagents have a clear tendency to explore the environment (Fig. 3C). Interestingly, our results alsodemonstrate that all three formulations are better at reconstructing the expected reward, in comparisonto a reward-maximizing baseline (Fig. 3D). Additionally, our agents are capable of reconstructing thecurrent observation, as well as predicting 5 time-steps into the future, for all formulations of EFE,with similar loss with the baseline (Fig. 3E). Disentanglement of latent spaces leads to lower dimensional temporal dynamics that are easierto predict [32]. Thus, generating a disentangled latent space s can be beneﬁcial for learning theparameters of the transition function P ( s τ | s t , a t ; θ s ) . Due to the similarity between the precisionterm ω and the hyper-parameter β in β -VAE [28] discussed in Sec. 3.3, we hypothesized that ω couldplay an important role in regulating transition learning. To explore this hypothesis, we compared thetotal correlation (as a metric for disentanglement [33]) of latent state beliefs between i ) agents thathave been trained with the different EFE functionals, ii ) the baseline (reward-maximizing) agent, iii ) an agent trained without top-down attention (although the average value of ω was maintained), as6 e w a r d / r o un d Learning iteration R e w a r d r e c o n s t r u c t i o n ( M S E ) p i x e l r e c o n s t r u c t i o n ( M S E ) Learning iterationAfter 5 transitionsImmediateLearning iteration P o s i t i o n s v i s i t e d / r o un d a a+b a+b+c A B CD E

Learning iteration R e w a r d / r o un d term 7aterms 7a+bterms 7a+b+cterm 7aterms 7a+bterms 7a+b+cterms 7a+ b F T o t a l c o rr e l a t i o n o f Q ( s ) Learning iteration

MCTS

VAE

Agent with ω fixed Figure 3: Agent’s performance during on-policy training in the object sorting task. A: Comparisonof different action selection strategies for the agent driven by the full Eq. (8).

B-C:

Comparison ofagents driven by different functionals, limited to state estimations of a single step into the future. InC, the violin plots represent behavior driven by P ( a ) (the planner) and the gray box plots driven bythe habitual network Q φ a ( a ) . D-F:

Reconstruction loss and total correlation during learning for 4different functionals.well as iv ) a simple variational autoencoder that received the same visual inputs. As seen in Fig. 3F,all active inference agents using ω generated structures with signiﬁcantly more disentanglement (seetraversals in supp. material). Indeed, the performance ranking here is the same as in Fig. 3D, pointingto disentanglement as a possible reason for the performance difference in predicting rewards. The training process in the dynamic dSprites environment revealed two types of behavior. Initially,we see epistemic exploration (i.e., curiosity), that is overtaken by reward seeking (i.e., goal-directedbehavior) once the agent is reasonably conﬁdent about the environment. An example of this can beseen in the left trajectory plot in Fig. 4Ai, where the untrained agent – with no concept of reward– deliberates between multiple options and chooses the path that enables it to quickly move to thenext round. The same agent, after learning iterations, can now optimally plan where to movethe current object, in order to maximize potential reward, log P ( o τ ) . We next investigated thesensitivity when deciding, by changing the threshold T dec . We see that changing the threshold hasclear implications for the distribution of explored alternative trajectories i.e., number of simulatedstates (Fig. 4Aii). This plays an important role in the performance, with maximum performancefound at T dec ≈ . (Fig. 4Aiii).Agents trained in the Animal-AI environment also exhibit interesting (and intelligent) behavior. Here,the agent is able to make complex plans, by avoiding obstacles with negative reward and approachingexpected outcomes (red and green objects respectively, Fig. 4Bi). Maximum performance can befound for MCTS loops and T dec ≈ . (Fig. 4Bii; details in the supplementary material). Whendeployed in ’lights-off’ experiments, the agent can successfully maintain an accurate representation ofthe world state and simulate future plans despite temporary suspension of visual input (Fig. 2B). Thisis particularly interesting because P ( s t +1 | s t , a t ; θ s ) is deﬁned as a feed-forward network, withoutthe ability to maintain memory of states before t . As expected, the agent’s ability to operate in thisset-up becomes progressively worse the longer the visual input is removed, while shorter decisionthresholds are found to preserve performance longer (Fig. 4Biii).7igure 4: Agent’s planning performance. A: Dynamic dSprites. i) Example planned trajectory plotswith number of visits per state (blue-pink color map) and the selected policy (black lines). ii)

Theeffect of decision threshold T dec on the number of simulated states and iii) the agent’s performance. B: Animal-AI. i) Same as in A. ii)

System performance over hyper-parameters and iii) in the lights-off task. Error bars in Aiii denote standard deviation and in B standard error of the mean.

The attractiveness of active inference inherits from the biological plausibility of the framework [4, 34,35]. Accordingly, we focused on scaling-up active inference inspired by neurobiological structureand function that supports intelligence. This is reﬂected in the hierarchical generative model, wherethe higher-level policy network contextualizes lower-level state representations. This speaks to aseparation of temporal scales afforded by cortical hierarchies in the brain and provides a ﬂexibleframework to develop biologically-inspired intelligent agents.We introduced MCTS for tackling planning problems with vast search spaces [23, 36, 24, 37, 38].This approach builds upon Çatal et al. ’s [39] deep active inference proposal, to use tree search torecursively re-evaluate EFE for each policy, but is computationally more efﬁcient. Additionally, usingMCTS offers an

Occam’s window for policy pruning; that is, we stop evaluating a policy path ifits EFE becomes much higher than a particular upper conﬁdence bound. This pruning drasticallyreduces the number of paths one has to evaluate. It is also consistent with biological planning, whereagents adopt brute force exploration of possible paths in a decision tree, up to a resource-limitedﬁnite depth [40]. This could be due to imprecise evidence about different future trajectories [41]where environmental constraints subvert evaluation accuracy [42, 43] or alleviate computationalload [44]. Previous work addressing the depth of possible future trajectories in human subjects underchanging conditions shows that both increased cognitive load [43] and time constraints [45, 46, 42]reduce search depth. Huys et al. [44] highlighted that in tasks involving alleviated computationalload, subjects might evaluate only subsets of decision trees. This is consistent with our experimentsas the agent selects to evaluate only particular trajectories based on their prior probability to occur.We have shown that the precision factor, ω , can be used to incorporate uncertainty over the priorand enhances disentanglement by encouraging statistical independence between features [47–50].This is precisely why it has been associated with attention [51]; a signal that shapes uncertainty [52].Attention enables ﬂexible modulation of neural activity that allows behaviorally relevant sensory datato be processed more efﬁciently [53, 54, 27]. The neural realizations of this have been linked withneuromodulatory systems, e.g., cholinergic and noradrenergic [55–59]. In active inference, they havebeen associated speciﬁcally with noradrenaline for modulating uncertainty about state transitions [8],noradrenergic modulation of visual attention [60] and dopamine for policy selection [4, 60].A limitation of this work lies in its comparison to reward-maximizing agents. That is, if the speciﬁcgoal is to maximize reward, then it is not clear whether deep active inference (i.e., full speciﬁcationof EFE) has any performance beneﬁts over simple reward-seeking agents (i.e., using only Eq. 7a). We8mphasize, however, that the primary purpose of the active inference framework is to serve as a modelfor biological cognition, and not as an optimal solution for reward-based tasks. Therefore, we havedeliberately not focused on bench-marking performance gains against state-of-the-art reinforcementlearning agents, although we hypothesize that insights from active inference could prove useful incomplex environments where either reward maximization isn’t the objective, or in instances wheredirect reward maximization leads to sub-optimal performance.There are several extensions that can be explored, such as testing whether performance would increasewith more complex, larger neural networks, e.g., using LSTMs to model state transitions. One couldalso assess if including episodic memory would ﬁnesse EFE evaluation over a longer time horizon,without increasing computational complexity. Future work should also test how performance shifts ifthe objective of the task changes. Lastly, it might be neurobiologically interesting to see whether thegenerated disentangled latent structures are apt for understanding functional segregation in the brain. The authors would like to thank Sultan Kenjeyev for his valuable contributions and comments onearly versions of the model presented in the current manuscript and Emotech team for the greatsupport throughout the project. NS was funded by the Medical Research Council (MR/S502522/1).PM and KJF were funded by the Wellcome Trust (Ref: 210920/Z/18/Z - PM; Ref: 088130/Z/09/Z -KJF).

References [1] Karl J Friston. The free-energy principle: A uniﬁed brain theory?

Nature Reviews Neuroscience ,11(2):127–138, 2010.[2] Karl J Friston. A free energy principle for a particular physics. arXiv preprint arXiv:1906.10184 ,2019.[3] Karl J Friston, Thomas. FitzGerald, Francesco Rigoli, Philipp Schwartenbeck, John O’Doherty,and Giovanni Pezzulo. Active inference and learning.

Neuroscience & Biobehavioral Reviews ,68:862–79, 2016.[4] Karl J Friston, Thomas FitzGerald, Francesco Rigoli, Philipp Schwartenbeck, and GiovanniPezzulo. Active inference: A process theory.

Neural Computation , 29(1):1–49, 2017.[5] Karl J Friston, Thomas Parr, and Bert de Vries. The graphical brain: Belief propagation andactive inference.

Network Neuroscience , 1(4):381–414, 2017.[6] Giovanni Pezzulo, Francesco Rigoli, and Karl J Friston. Hierarchical active inference: A theoryof motivated control.

Trends in Cognitive Sciences , 22(4):294–306, 2018.[7] Lancelot Da Costa, Thomas Parr, Noor Sajid, Sebastijan Veselic, Victorita Neacsu, and KarlFriston. Active inference on discrete state-spaces: A synthesis. arXiv preprint arXiv:2001.07203 ,2020.[8] Thomas Parr and Karl J Friston. Uncertainty, epistemics and active inference.

Journal of TheRoyal Society Interface , 14(136):20170376, 2017.[9] Karl J Friston, Richard Rosch, Thomas Parr, Cathy Price, and Howard Bowman. Deep temporalmodels and active inference.

Neuroscience and Biobehavioral Reviews , 90:486—501, 2018.[10] Noor Sajid, Philip J Ball, and Karl J Friston. Active inference: Demystiﬁed and compared. arXiv preprint arXiv:1909.10863 , 2019.[11] Casper Hesp, Ryan Smith, Micah Allen, Karl J Friston, and Maxwell Ramstead. Deeply feltaffect: The emergence of valence in deep active inference.

PsyArXiv , 2019.[12] Maell Cullen, Ben Davey, Karl J Friston, and Rosalyn J. Moran. Active inference in Ope-nAI Gym: A paradigm for computational investigations into psychiatric illness.

BiologicalPsychiatry: Cognitive Neuroscience and Neuroimaging , 3(9):809–818, 2018.913] Karl J Friston, Jean Daunizeau, and Stefan J Kiebel. Reinforcement learning or active inference?

PLoS ONE , 4(7):e6421, 2009.[14] Kai Ueltzhöffer. Deep active inference.

Biological Cybernetics , 112(6):547–573, 2018.[15] Ozan Çatal, Johannes Nauta, Tim Verbelen, Pieter Simoens, and Bart Dhoedt. Bayesian policyselection using active inference. arXiv preprint arXiv:1904.08149 , 2019.[16] Alexander Tschantz, Manuel Baltieri, Anil Seth, Christopher L Buckley, et al. Scaling activeinference. arXiv preprint arXiv:1911.10601 , 2019.[17] Beren Millidge. Deep active inference as variational policy gradients.

Journal of MathematicalPsychology , 96:102348, 2020.[18] David M Blei, Alp Kucukelbir, and Jon D McAuliffe. Variational inference: A review forstatisticians.

Journal of the American Statistical Association , 112(518):859–877, 2017.[19] Thomas Parr and Karl J Friston. Generalised free energy and active inference.

BiologicalCybernetics , 113(5-6):495–513, 2019.[20] Philipp Schwartenbeck, Johannes Passecker, Tobias U Hauser, Thomas HB FitzGerald, MartinKronbichler, and Karl J Friston. Computational mechanisms of curiosity and goal-directedexploration. eLife , 8:e41703, 2019.[21] Richard S Sutton and Andrew G Barto.

Reinforcement Learning: An Introduction . MIT press,2018.[22] Yarin Gal and Zoubin Ghahramani. Dropout as a Bayesian approximation: Representingmodel uncertainty in deep learning. In

International Conference on Machine Learning , pages1050–1059, 2016.[23] Rémi Coulom. Efﬁcient selectivity and backup operators in monte-carlo tree search. In

International Conference on Computers and Games , pages 72–83. Springer, 2006.[24] Cameron B Browne, Edward Powley, Daniel Whitehouse, Simon M Lucas, Peter I Cowling,Philipp Rohlfshagen, Stephen Tavener, Diego Perez, Spyridon Samothrakis, and Simon Colton.A survey of Monte Carlo tree search methods.

IEEE Transactions on Computational Intelligenceand AI in Games , 4(1):1–43, 2012.[25] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, ArthurGuez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game ofgo without human knowledge.

Nature , 550(7676):354–359, 2017.[26] Matthijs Van Der Meer, Zeb Kurth-Nelson, and A David Redish. Information processing indecision-making systems.

The Neuroscientist , 18(4):342–359, 2012.[27] Anna Byers and John T. Serences. Exploring the relationship between perceptual learning andtop-down attentional control.

Vision Research , 74:30 – 39, 2012.[28] Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick,Shakir Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with aconstrained variational framework.

International Conference on Learning Representations , 2(5):6, 2017.[29] Loic Matthey, Irina Higgins, Demis Hassabis, and Alexander Lerchner. dsprites: Disentangle-ment testing sprites dataset. https://github.com/deepmind/dsprites-dataset/, 2017.[30] Matthew Crosby, Benjamin Beyret, and Marta Halina. The Animal-AI olympics.

NatureMachine Intelligence , 1(5):257–257, 2019.[31] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980 , 2014. 1032] Jun-Ting Hsieh, Bingbin Liu, De-An Huang, Li F Fei-Fei, and Juan Carlos Niebles. Learningto decompose and disentangle representations for video prediction. In

Advances in NeuralInformation Processing Systems , pages 517–526, 2018.[33] Hyunjik Kim and Andriy Mnih. Disentangling by factorising. arXiv preprint arXiv:1802.05983 ,2018.[34] Takuya Isomura and Karl J Friston. In vitro neural networks minimise variational free energy.

Nature Scientiﬁc Reports , 8(1):1–14, 2018.[35] Rick A Adams, Stewart Shipp, and Karl J Friston. Predictions not commands: Active inferencein the motor system.

Brain Structure and Function , 218(3):611–643, 2013.[36] Levente Kocsis and Csaba Szepesvári. Bandit based Monte-Carlo planning. In

EuropeanConference on Machine Learning , pages 282–293. Springer, 2006.[37] Xiaoxiao Guo, Satinder Singh, Honglak Lee, Richard L Lewis, and Xiaoshi Wang. Deeplearning for real-time Atari game play using ofﬂine Monte-Carlo tree search planning. In

Advances in Neural Information Processing Systems , pages 3338–3346, 2014.[38] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driess-che, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al.Mastering the game of go with deep neural networks and tree search.

Nature , 529(7587):484,2016.[39] Ozan Çatal, Tim Verbelen, Johannes Nauta, Cedric De Boom, and Bart Dhoedt. Learningperception and planning with deep active inference. In

IEEE International Conference onAcoustics, Speech and Signal Processing , pages 3952–3956, 2020.[40] Joseph Snider, Dongpyo Lee, Howard Poizner, and Sergei Gepshtein. Prospective optimizationwith limited resources.

PLoS Computational Biology , 11(9), 2015.[41] Alec Solway and Matthew M Botvinick. Evidence integration in model-based tree search.

Proceedings of the National Academy of Sciences , 112(37):11708–11713, 2015.[42] Bas van Opheusden, Gianni Galbiati, Zahy Bnaya, Yunqi Li, and Wei Ji Ma. A computationalmodel for decision tree search. In

CogSci. , 2017.[43] Dennis H Holding. Counting backward during chess move choice.

Bulletin of the PsychonomicSociety , 27(5):421–424, 1989.[44] Quentin JM Huys, Neir Eshel, Elizabeth O’Nions, Luke Sheridan, Peter Dayan, and Jonathan PRoiser. Bonsai trees in your head: How the Pavlovian system sculpts goal-directed choices bypruning decision trees.

PLoS Computational Biology , 8(3), 2012.[45] Bruce D Burns. The effects of speed on skilled chess performance.

Psychological Science , 15(7):442–447, 2004.[46] Frenk Van Harreveld, Eric-Jan Wagenmakers, and Han LJ Van Der Maas. The effects oftime pressure on chess skill: An investigation into fast and slow processes underlying expertperformance.

Psychological Research , 71(5):591–597, 2007.[47] Emile Mathieu, Tom Rainforth, N Siddharth, and Yee Whye Teh. Disentangling disentanglementin variational autoencoders. arXiv preprint arXiv:1812.02833 , 2018.[48] Minyoung Kim, Yuting Wang, Pritish Sahu, and Vladimir Pavlovic. Bayes-Factor-VAE: Hierar-chical Bayesian deep auto-encoder models for factor disentanglement. In

IEEE InternationalConference on Computer Vision , pages 2979–2987, 2019.[49] Hadi Fatemi Shariatpanahi and Majid Nili Ahmadabadi. Biologically inspired framework forlearning and abstract representation of attention control. In

Attention in Cognitive Systems.Theories and Systems from an Interdisciplinary Viewpoint , pages 307–324, 2007.1150] Alexander Mott, Daniel Zoran, Mike Chrzanowski, Daan Wierstra, and Danilo J Rezende.Towards interpretable reinforcement learning using attention augmented agents. In

Advances inNeural Information Processing Systems , pages 12329–12338, 2019.[51] Thomas Parr, David A. Benrimoh, Peter Vincent, and Karl J Friston. Precision and falseperceptual inference.

Frontiers in Integrative Neuroscience , 12:39, 2018.[52] Peter Dayan, Sham Kakade, and Read P Montague. Learning and selective attention.

NatureNeuroscience , 3(11):1218–1223, 2000.[53] Farhan Baluch and Laurent Itti. Mechanisms of top-down attention.

Trends in Neurosciences ,34(4):210–224, 2011.[54] Yuka Sasaki, Jose E Nanez, and Takeo Watanabe. Advances in visual perceptual learning andplasticity.

Nature Reviews Neuroscience , 11(1):53–60, 2010.[55] Michael I Posner and Steven E Petersen. The attention system of the human brain.

AnnualReview of Neuroscience , 13(1):25–42, 1990.[56] Peter Dayan and Angela J Yu. ACh, uncertainty, and cortical inference. In

Advances in NeuralInformation Processing Systems , pages 189–196. 2002.[57] Q Gu. Neuromodulatory transmitter systems in the cortex and their role in cortical plasticity.

Neuroscience , 111(4):815–835, 2002.[58] Angela J Yu and Peter Dayan. Uncertainty, neuromodulation, and attention.

Neuron , 46(4):681–692, 2005.[59] Rosalyn J Moran, Pablo Campo, Mkael Symmonds, Klaas E Stephan, Raymond J Dolan, andKarl J Friston. Free energy, precision and learning: The role of cholinergic neuromodulation.

Journal of Neuroscience , 33(19):8227–8236, 2013.[60] Thomas Parr.

The Computational Neurology of Active Vision . PhD thesis, University CollegeLondon, 2019. 12

Supplementary material

Here, we provide the steps needed to derive Eq. (8) from Eq. (7). The term (7b) can be re-written as: E ˜ Q (cid:2) log Q ( s τ | π ) − log Q ( s τ | o τ , π ) (cid:3) == E Q ( θ | π ) Q ( s τ | θ,π ) Q ( o τ | s τ ,θ,π ) (cid:2) log Q ( s τ | π ) − log Q ( s τ | o τ , π ) (cid:3) = E Q ( θ | π ) (cid:104) E Q ( s τ | θ,π ) log Q ( s τ | π ) − E Q ( s τ ,o τ | θ,π ) log Q ( s τ | o τ , π ) (cid:105) = E Q ( θ | π ) (cid:104) E Q ( o τ | θ,π ) H ( s τ | o τ , π ) − H ( s τ | π ) (cid:105) , where we have only used the deﬁnition ˜ Q = Q ( o τ , s τ , θ | π ) , and the deﬁnition of the standard (andconditional) Shannon entropy.Next, the term (7c) can be re-written as: E ˜ Q (cid:2) log Q ( θ | s τ , π ) − log Q ( θ | s τ , o τ , π ) (cid:3) == E Q ( s τ ,θ,o τ | π ) (cid:2) log Q ( o τ | s τ , π ) − log Q ( o τ | s τ , θ, π ) (cid:3) = E Q ( s τ | π ) Q ( o τ | s τ ,π ) log Q ( o τ | s τ , π ) − E Q ( θ | π ) Q ( s τ | θ,π ) Q ( o τ | s τ ,θπ ) log Q ( o τ | s τ , θ, π )= E Q ( θ | π ) Q ( s τ | θ,π ) H ( o τ | s τ , θ, π ) − E Q ( s τ | π ) H ( o τ | s τ , π ) , where the ﬁrst equality is obtained via a normal Bayes inversion, and the second via the factorizationof ˜ Q . The model presented here was implemented in Python and the library TensorFlow 2.0. We initialized3 different ADAM optimizers, which we used in parallel, to allow learning parameters with differentrates. The networks Q φ s , P θ o were optimized using an initial learning rate of − and, as a lossfunction, the ﬁrst two terms of Eq. (4). In experiments where regularization was used, the lossfunction used by this optimizer was adjusted to L φ s ,θ o = − E Q ( s t ) (cid:2) log P ( o t | s t ; θ o ) (cid:3) + γD KL (cid:2) Q φ s ( s t ) (cid:107) P ( s t | s t − , a t − ; θ s ) (cid:3) + (1 − γ ) D KL (cid:2) Q φ s ( s t ) (cid:107) N (0 , (cid:3) , (11)where γ is a hyper parameter, starting with value and gradually increasing to . . In our experiments,we found that the effect of regularization is only to improve the speed of convergence and not thebehavior of the agent and, thus, it can be safely omitted.The parameters of the network P θ s were optimized using a rate of − and only the second term ofEq. (4) as a loss. Finally, the parameters of Q φ a were optimized with a learning rate of − andonly the ﬁnal term of Eq. (4) as a loss. For all presented experiments and learning curves, batch sizewas set to 50. A learning iteration is deﬁned as 1000 optimization steps with new data generated fromthe corresponding environment.In order to learn to plan further into the future, the agents were trained to map transitions every 5simulation time-steps in dynamic dSprites and 3 simulation time-steps in Animal-AI. Finally, theruntime of the results presented here is as follows. For the agents in the dynamic dSprites environment,training of the ﬁnal version of the agents took approximately hours per version (on-policy, 700learning iterations) using an NVIDIA Titan RTX GPU. Producing the learning and performancecurves in Fig. 3, took hours per agent when the 1-step and habitual strategies were employedand approximately days when the full MCTS planner was used (Fig. 3A). For the Animal-AIenvironment, off-policy training took approximately hours per agent, on-policy training took daysand, the results presented in Fig. 4 took approximately days, using an NVIDIA GeForce GTX 1660super GPU (CPU: i7-4790k, RAM: 16GB DDR3).13 .3 Model parameters In both simulated environments, the network structure used was almost identical, consisting ofconvolutional, deconvolutional, fully-connected and dropout layers (Fig. 5). In both cases, thedimensionality of the latent space s was 10. For the top-down attention mechanism, the parametersused were a = 2 , b = 0 . , c = 0 . and d = 5 for the Animal-AI environment and a = 1 , b = 25 , c =5 and d = 1 . for dynamic dSprites. The action space was |A| = 3 for Animal-AI and |A| = 4 fordynamic dSprites. Finally, with respect to the planner, we set c explore = 1 in both cases, T dec = 0 . (when another value is not speciﬁcally mentioned), the depth of MCTS simulation rollouts was setto , while the maximum number of MCTS loops was set to for dynamic dSprites and forAnimal-AI. size: kernel: strides: (2, 2) size: kernel: strides: (2, 2) size: kernel: strides: (2, 2) size: kernel: strides: (2, 2) size: kernel: strides: (2, 2) size: R e L U R e L U R e L U R e L U R e L U R e L U rate: size: R e L U rate: size: R e L U rate: size:

10 + 10

Conv2DDeConv2DFully-connectedDropoutsize: kernel: strides: (2, 2) size: kernel: strides: (2, 2) size: kernel: strides: (2, 2) size: kernel: strides: (2, 2) size: R e L U R e L U R e L U R e L U rate: size: R e L U rate: size: R e L U rate: size: R e L U rate: size: R e L U rate: size: R e L U rate: size: size: R e L U size: size: R e L U size: R e L U rate: Q(s): P(o|s): P(s|s,a): Q(a|s):input: input: input: input: Figure 5: Neural network parameters used for the dynamic dSprites experiments. For the Animal-AIexperiments, the only differences are: i) the input layer of the network used for Q φ s ( s ) and outputlayer for P ( o | s ; θ o ) have shape (32 , , , ii) the input layer of P ( s t +1 | s t , a t ; θ s ) has shape (10 + 3) and iii) the output layer of Q φ a ( a ) has a shape of (3) , corresponding to the three actions forward, leftand right. ) Learning iteration: 2 B) Learning iteration: 700

Figure 6: Examples of consecutive plans in the dynamic dSprites environment during a singleexperiment.Figure 7: Examples of plans in Animal-AI environment. Examples were picked randomly.15 .5 Examples of traversals s i s i C o rr e l a t i o n F r e q u e n c y Figure 8: Latent space traversals for the full active inference agent trained in the dynamic dSpritesenvironment. Histograms represent distribution of values for 1000 random observations. Thegraphs on the right column represent Spearman’s correlation between each dimension of s and the 5continuous ground truth features of the environment. This includes the 4 continuous features of thedSprites dataset and reward, encoded in the top pixels shown in s . Although correlation with shapetypes is not shown, as this is a categorical feature, the traversals of s8