[PDF] The relationship between dynamic programming and active inference: the discrete, finite-horizon case

Abstract

Active inference is a normative framework for generating behaviour based upon the free energy principle, a theory of self-organisation. This framework has been successfully used to solve reinforcement learning and stochastic control problems, yet, the formal relation between active inference and reward maximisation has not been fully explicated. In this paper, we consider the relation between active inference and dynamic programming under the Bellman equation, which underlies many approaches to reinforcement learning and control. We show that, on partially observable Markov decision processes, dynamic programming is a limiting case of active inference. In active inference, agents select actions to minimise expected free energy. In the absence of ambiguity about states, this reduces to matching expected states with a target distribution encoding the agent's preferences. When target states correspond to rewarding states, this maximises expected reward, as in reinforcement learning. When states are ambiguous, active inference agents will choose actions that simultaneously minimise ambiguity. This allows active inference agents to supplement their reward maximising (or exploitative) behaviour with novelty-seeking (or exploratory) behaviour. This clarifies the connection between active inference and reinforcement learning, and how both frameworks may benefit from each other.

Full PDF

TThe relationship between dynamic programmingand active inference: the discrete, ﬁnite-horizon case

Lancelot Da Costa [email protected]

Department of MathematicsImperial College LondonLondon, SW7 2BU, UK

Noor Sajid [email protected]

Wellcome Centre for Human NeuroimagingUniversity College LondonLondon, WC1N 3AR, UK

Thomas Parr [email protected]

Wellcome Centre for Human NeuroimagingUniversity College LondonLondon, WC1N 3AR, UK

Karl Friston [email protected]

Wellcome Centre for Human NeuroimagingUniversity College LondonLondon, WC1N 3AR, UK

Ryan Smith [email protected]

Laureate Institute for Brain ResearchTulsa, OK 74136, United States

Abstract

Active inference is a normative framework for generating behaviour based upon the free energyprinciple, a theory of self-organisation. This framework has been successfully used to solve re-inforcement learning and stochastic control problems, yet, the formal relation between activeinference and reward maximisation has not been fully explicated. In this paper, we consider therelation between active inference and dynamic programming under the Bellman equation, whichunderlies many approaches to reinforcement learning and control. We show that, on partiallyobservable Markov decision processes, dynamic programming is a limiting case of active infer-ence. In active inference, agents select actions to minimise expected free energy. In the absenceof ambiguity about states, this reduces to matching expected states with a target distributionencoding the agent’s preferences. When target states correspond to rewarding states, this max-imises expected reward, as in reinforcement learning. When states are ambiguous, active inferenceagents will choose actions that simultaneously minimise ambiguity. This allows active inferenceagents to supplement their reward maximising (or exploitative) behaviour with novelty-seeking (orexploratory) behaviour. This clariﬁes the connection between active inference and reinforcementlearning, and how both frameworks may beneﬁt from each other.

Keywords:

Active inference, reward maximisation, reinforcement learning, approximate Bayesianinference, stochastic optimal control.

Contents a r X i v : . [ c s . A I] S e p .3 Backward induction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 . . . . . . . . . . . . . . . . . . . . . 164.3 Bellman optimality on ﬁnite temporal horizons . . . . . . . . . . . . . . . . . . . . . 184.4 Generalisation to partially observed MDPs . . . . . . . . . . . . . . . . . . . . . . . . 20

1. Introduction

Active inference is a normative framework for explaining behaviour under the free energy principle—a global theory of self-organisation in the neurosciences (Friston, 2010, 2019; Friston et al., 2006;Parr et al., 2020)—by assuming that the brain performs approximate Bayesian inference (Bishop,2006; Jordan et al., 1998; Sengupta et al., 2016; Wainwright and Jordan, 2007). Within the activeinference framework, there is a collection of belief updating schemes or algorithms for modeling per-ception, learning, and behavior in the context of both continuous and discrete state spaces (Fristonet al., 2020, 2017c). Within each scheme, active inference treats agents as systems that self-organiseto some (non-equilibrium) steady-state (Da Costa et al., 2020; Pavliotis, 2014); that is, an activeinference agent acts upon the world so that its predicted states match a target distribution encodingits characteristic or preferred states. Building active inference agents requires: equipping theagent with a (generative) model of the environment, ﬁtting the model to observations throughapproximate Bayesian inference by minimising variational free energy (Beal, 2003; Bishop, 2006; Jor-dan et al., 1998; Wainwright and Jordan, 2007) and selecting actions that minimise expected freeenergy, a quantity that that can be decomposed into risk (i.e., the divergence between predicted andpreferred states) and information gain, leading to context-speciﬁc combinations of exploratory andexploitative behaviour (Parr et al., 2020; Schwartenbeck et al., 2019). Exploitative behaviour ensuresthat predicted states match preferred states in a probabilistic sense or in the sense of maximisingexpected reward (Da Costa et al., 2020). This framework has been used to simulate intelligent be-haviour in neuroscience (Adams et al., 2013b; Cullen et al., 2018; Kaplan and Friston, 2018; Mirzaet al., 2018, 2019; Parr and Friston, 2019), artiﬁcial intelligence (Çatal et al., 2019, 2020a; Fountaset al., 2020; Millidge, 2019, 2020; Sajid et al., 2020; Tschantz et al., 2019, 2020a; Ueltzhöﬀer, 2018)and robotics (Çatal et al., 2020b; Lanillos et al., 2020; Pezzato et al., 2020; Pio-Lopez et al., 2016;Sancaktar et al., 2020). Given the prevalence of reinforcement learning (RL) and stochastic optimalcontrol in these ﬁelds, it is useful to understand the relationship between active inference and theseestablished approaches to modelling purposeful behaviour.Stochastic control traditionally calls on strategies that evaluate diﬀerent actions on a carefullyhandcrafted forward model of stochastic dynamics and then selects the reward-maximising action.RL has a broader and more ambitious scope. Loosely speaking, RL is a collection of methodsthat learn reward-maximising actions from data and seek to maximise reward in the long run.Many RL algorithms are model-free, which means that agents learn a reward-maximising state-action mapping, based on updating cached state-action pair values, through initially random actions2hat do not consider future state transitions. In contrast, model-based RL algorithms attempt toextend model-based control approaches by learning the dynamics and reward function from data.Because RL is a data driven ﬁeld, particular algorithms are selected based on how well they performon benchmark problems. This has yielded a zoo of diverse algorithms, many designed to solvespeciﬁc problems and each with their own strengths and limitations. This makes RL diﬃcult tocharacterise as a whole. Thankfully, many RL algorithms and approaches to solving control problemsoriginate or otherwise build upon dynamic programming under the Bellman equation (Bellmanand Dreyfus, 2015; Bertsekas and Shreve, 1996), a collection of methods that maximise cumulativereward (although this often becomes computationally intractable in real-world problems) (Barto andSutton, 1992). In what follows, we consider the relationship between active inference and dynamicprogramming, and discuss its implications in the broader context of RL.This leads us to discuss the apparent diﬀerences between active inference and RL. First, whileRL agents select actions to maximise cumulative reward (e.g., the solution to the Bellman equation(Bellman and Dreyfus, 2015)), active inference agents select actions so that predicted states matcha target distribution encoding preferred states. In fact, active inference also builds upon previouswork on the duality between inference and control (Kappen et al., 2012; Rawlik et al., 2013; Todorov,2008; Toussaint, 2009) to solve motor control problems via approximate inference (Friston et al.,2012a, 2009; Millidge et al., 2020b). Treating the control problem as an inference problem in thisfashion, is also known as planning as inference (Attias, 2003; Botvinick and Toussaint, 2012). Sec-ond, active inference agents always embody a generative (i.e., forward) model of their environment,while RL comprises both model-based algorithms as well as simpler model-free algorithms. Third,modelling exploratory behaviour—which can improve reward maximisation in the long run (espe-cially in volatile environments)—is implemented diﬀerently in the two approaches. In most casesRL implements a simple form of exploration by incorporating randomness in decision-making (Tokicand Palm, 2011; Wilson et al., 2014), where the level of randomness may or may not change overtime as a function of uncertainty. In other cases, RL incorporates ad-hoc "information bonus" termsto build in goal-directed exploratory drives. In contrast, goal-directed exploration emerges naturallywithin active inference through interactions between the reward and information gain terms in theexpected free energy (Da Costa et al., 2020; Schwartenbeck et al., 2019). Although not coveredin detail here, active inference can accommodate a principled form of random exploration (a.k.a.matching behaviour) by sampling actions from a posterior belief distribution over actions, whoseprecision is itself optimised—such that action selection becomes more random when the expectedoutcomes of actions are more uncertain (Schwartenbeck et al., 2015a). Finally, traditional RL ap-proaches have usually focused on cases where agents know their current state with certainty, and thuseschew uncertainty in state estimation (although, RL schemes can be supplemented with Bayesianstate-estimation algorithms, leading to Bayesian RL). In contrast, active inference integrates state-estimation, learning, decision-making, and motor control under the single objective of minimisingfree energy (Da Costa et al., 2020).Despite these well-known diﬀerences, the relationship between active inference and RL, andparticularly between the objectives of free energy minimization and reward maximization, has notbeen thoroughly explicated. Their relationship has become increasingly important to understand,as a growing body of research has begun to 1) compare the performance of active inference andRL models in simulated environments (Cullen et al., 2018; Millidge, 2020; Sajid et al., 2020), 2)apply active inference to study human behaviour on reward learning tasks (Smith et al., 2020a,b,d),and 3) consider the complementary predictions and interpretations they each oﬀer in computationalneuroscience, psychology, and psychiatry (Cullen et al., 2018; Schwartenbeck et al., 2015a, 2019;Tschantz et al., 2020b). In what follows, we try to clarify the relationship between RL and activeinference and identify the conditions under which they are equivalent.Despite apparent diﬀerences, we show that there is a formal relationship between active inferenceand RL that is most clearly seen with model-based RL. Speciﬁcally, we will see that dynamicprogramming under the Bellman equation is a limiting case of active inference on ﬁnite-horizon3artially observable Markov decision processes (POMDPs). Equivalently, we show that a limitingcase of active inference maximises reward on ﬁnite-horizon POMDPs. However, active inference alsocovers scenarios that do not involve reward maximization, as it can be used to solve any problemthat can be cast in terms of reaching and maintaining a target distribution on a suitable state-space(see (Da Costa et al., 2020, Appendix B)). In brief, active inference reduces to dynamic programmingwhen the target distribution is a (uniform mixture of) Dirac distribution over reward maximisingtrajectories. Note that, in inﬁnite horizon POMDPs, active inference will not necessarily furnish thesolution to the Bellman equation, as it plans only up to ﬁnite temporal horizons.In what follows, we ﬁrst review dynamic programming on ﬁnite-horizon Markov Decision Pro-cesses (MDPs; Section 2). Next, we introduce active inference for ﬁnite-horizon MDPs (Section 3).Third, we demonstrate how active inference reduces to dynamic programming in a limiting case(Section 4). Finally, we show how these results generalise to POMDPs (Section 4.4). We concludewith a discussion of the implications of these results and future directions (Section 5).

2. Dynamic programming on ﬁnite horizon MDPs

In this section, we recall the fundamentals of discrete-time dynamic programming.

Markov decision processes (MDPs) are a class of models specifying environmental dynamics widelyused in dynamic programming, model-based RL, and more broadly in engineering and artiﬁcialintelligence (Barto and Sutton, 1992; Stone, 2019). They have been used to simulate sequentialdecision-making tasks with the objective of maximising a reward or utility function. An MDPspeciﬁes the environmental dynamics in discrete time and space given the actions pursued by anagent.

Deﬁnition 1 (Finite horizon MDP) . A ﬁnite horizon MDP comprises the following tuple: • S a ﬁnite set of states. • T = { , ..., T } a ﬁnite set which stands for discrete time. T is the temporal horizon or planninghorizon. • A is a ﬁnite set of actions. • P ( s t = s (cid:48) | s t − = s, a t − = a ) is the probability that action a ∈ A in state s ∈ S at time t − will lead to state s (cid:48) ∈ S at time t . s t are random variables over S , which correspond to thestate being occupied at time t = 0 , ..., T . • P ( s = s ) speciﬁes the probability of being at state s ∈ S at the start of the trial. • R ( s ) is the ﬁnite reward received by the agent when at state s ∈ S .The dynamics aﬀorded by a ﬁnite horizon MDP can be written globally as a probability distributionover trajectories s T := ( s , . . . , s T ) , given a sequence of actions a T − := ( a , . . . , a T − ) . Thisfactorises as the following: P ( s T | a T − ) = P ( s ) T (cid:89) τ =1 P ( s τ | s τ − , a τ − ) . These MDP dynamics can be regarded as a Markov chain on the state-space S , given a sequenceof actions (see Figure 1). Finite horizon Markov decision process.

This is a Markov decision process illus-trated using a Bayesian network (Jordan et al., 1998; Pearl, 1998). A ﬁnite horizon MDPcomprises a ﬁnite sequence of states, indexed in time. The transition from one state tothe next state depends on action. As such, the dynamics of the MDP can be regarded as aMarkov chain on state-space, given the action sequence. Thus, the only degree of freedomis the action that is selected. This selection can be speciﬁed in terms of a state-actionpolicy, Π : a probabilistic mapping from space-time to actions. Remark 2 (On the deﬁnition of reward) . More generally, the reward function can be taken to bedependent on the previous action and previous state: R a ( s (cid:48) | s ) is the reward received after transi-tioning from state s to state s (cid:48) , due to action a (Barto and Sutton, 1992; Stone, 2019). However,given an MDP with such a reward function, we can recover our simpliﬁed setting without loss ofgenerality. We deﬁne a new MDP where the states comprise the previous action, previous state, andcurrent state in the original MDP. By inspection, the resulting reward function on the new MDPdepends only on the current state (i.e., R ( s ) ). Remark 3 (Admissible actions) . In general, it is possible that not all actions are available at everystate. Thus, A s is deﬁned to be the ﬁnite set of (allowable) actions available from state s ∈ S . Allthe results in this paper concerning MDPs can be extended to this setting. Given an MDP, the agent transitions from one state to the next as time unfolds. The transitionsdepend on the agent’s actions. The goal under RL is to select actions that maximise expectedreward. To formalise what it means to choose actions, we introduce the notion of a state-actionpolicy.

Deﬁnition 4 (State-action policy) . A state-action policy is a probability distribution over actions,that depends on the state that the agent occupies, and time. Explicitly, it is a function Π thatsatisﬁes: Π : A × S × T → [0 , a, s, t ) (cid:55)→ Π( a | s, t ) ∀ ( s, t ) ∈ S × T : (cid:88) a ∈ A Π( a | s, t ) = 1 . When s t = s , we will write Π( a | s t ) := Π( a | s, t ) . Additionally, the action at time T is redundant,as no further reward can be reaped from the environment. Therefore, one often speciﬁes state-action policies only up to time T − . This is equivalent to deﬁning a state-action policy as Π : A × S × { , . . . , T − } → [0 , .The state-action policy – as deﬁned here – is stochastic and can be regarded as a generalisation ofa deterministic policy that assigns the probability of to one of the available actions, and otherwise(Puterman, 2014). Remark 5 (Conﬂicting terminologies: policy in active inference) . In active inference, a policy isdeﬁned as a sequence of actions indexed in time. To avoid terminological confusion, we use sequenceof actions to denote a policy under active inference.

5s previously mentioned, the goal for an RL agent at time t is to choose actions that maximisefuture cumulative reward: R ( s t +1: T ) := T (cid:88) τ = t +1 R ( s τ ) . More precisely, the goal is to follow a state-action policy Π that maximises the state-value func-tion : v Π ( s, t ) := E Π [ R ( s t +1: T ) | s t = s ] ∀ ( s, t ) ∈ S × T . The state-value function scores the expected cumulative reward if the agent pursuesstate-action policy Π from state s t = s . When s t = s is clear from context, we will often write v Π ( s t ) := v Π ( s, t ) . Loosely speaking, we will call the expected reward the return . Remark 6 (Notation: E Π ) . Whilst standard in RL (Barto and Sutton, 1992; Stone, 2019), thenotation E Π [ R ( s t +1: T ) | s t = s ] can be misleading. It denotes the expected reward, under the transition probabilities of the MDP fora particular state-action policy Π : E P ( s t +1: T | a t : T − ,s t = s )Π( a t : T − | s t +1: T − ,s t = s ) [ R ( s t +1: T )] . It is important to keep this correspondence in mind, as we will use both notations depending oncontext.

Remark 7 (Temporal discounting) . In inﬁnite horizon MDPs (i.e., when T is inﬁnite), we oftenadd a temporal discounting term γ ∈ (0 , (Barto and Sutton, 1992; Bertsekas and Shreve, 1996;Stone, 2019) such that the inﬁnite sum v Π ( s, t ) := E Π [ ∞ (cid:88) τ = t +1 γR ( s τ ) | s t = s ] converges. However, under the ﬁnite temporal horizons considered here, the expected reward con-verges regardless of γ , which eschews the need to include temporal discounting when evaluatingexpected reward. Thus, in what follows, we set γ = 1 . We want to rank state-action policies in terms of their expected reward. To do this, we introducea partial ordering on state-action policies, such that a state-action policy is better than anotherwhen it yields higher expected rewards in any situation: Π ≥ Π (cid:48) ⇐⇒ ∀ ( s, t ) ∈ S × T : v Π ( s, t ) ≥ v Π (cid:48) ( s, t ) . Similarly, a state-action policy Π is strictly better than Π (cid:48) if Π > Π (cid:48) ⇐⇒ Π ≥ Π (cid:48) and ∃ ( s, t ) ∈ S × T : v Π ( s, t ) > v Π (cid:48) ( s, t ) . (Bellman optimal state-action policy) . A state-action policy Π ∗ is said to be Bellmanoptimal if, and only if, Π ∗ ≥ Π , ∀ Π . That is, if it maximises the state-value function v Π ( s, t ) forany state s at time t .

6n other words, a state-action policy is Bellman optimal if it is better than all alternatives. It isimportant to show that this concept is not vacuous. For this, we prove a classical result (Bertsekasand Shreve, 1996; Puterman, 2014):

Proposition 9 (Existence of Bellman optimal state-action policies) . Given a ﬁnite horizon MDPas speciﬁed in Deﬁnition 1, there exists a Bellman optimal state-action policy Π ∗ . Note that uniqueness of the Bellman optimal state-action policy is not implied by Proposition9. Indeed, it is a general feature of MDPs that there can be multiple Bellman optimal state-actionpolicies (Bertsekas and Shreve, 1996; Puterman, 2014).

Proof

Note that a Bellman optimal state-action policy Π ∗ is a maximal element according to thepartial ordering ≤ . Existence thus consists of a simple application of Zorn’s lemma. Zorn’s lemmastates that if any increasing chain Π ≤ Π ≤ Π ≤ . . . (1)has an upper bound that is a state-action policy, then there is a maximal element Π ∗ .Given the chain (1), we construct an upper bound. We enumerate A × S × T by ( α , σ , t ) , . . . , ( α N , σ N , t N ) . Then the state-action policy sequence Π n ( α | σ , t ) , n = 1 , , , . . . is bounded within [0 , . By the Bolzano-Weierstrass theorem, there exists a subsequence Π n k ( α | σ , t ) , k = 1 , , , . . . that converges. Similarly, Π n k ( α | σ , t ) is also a bounded sequence, and by Bolzano-Weierstrass it has a subsequence Π n kj ( a | σ , t ) that converges. We repeatedly take subsequencesuntil N . To ease notation, call the resulting subsequence Π m , m = 1 , , , . . . With this, we deﬁne ˆΠ = lim m →∞ Π m . It is straightforward to see that ˆΠ is a state-action policy: ˆΠ( α | σ, t ) = lim m →∞ Π m ( α | σ, t ) ∈ [0 , , ∀ ( α, σ, t ) ∈ A × S × T , (cid:88) α ∈ A ˆΠ( α | σ, t ) = lim m →∞ (cid:88) α ∈ A Π m ( α | σ, t ) = 1 , ∀ ( σ, t ) ∈ S × T . To show that ˆΠ is an upper bound, take any Π in the original chain of state-action policies (1).Then by the deﬁnition of an increasing subsequence, there exists an index M ∈ N such that ∀ k ≥ M : Π k ≥ Π . Since limits commute with ﬁnite sums, we have v ˆΠ ( s, t ) = lim m →∞ v Π m ( s, t ) ≥ v Π k ( s, t ) ≥ v Π ( s, t ) for any ( s, t ) ∈ S × T . Thus, by Zorn’s lemma there exists a Bellman optimal state-actionpolicy Π ∗ .Now that we know that Bellman optimal state-action policies exist, we can characterise themrecursively as a return-maximising action followed by a Bellman optimal state-action policy. Proposition 10 (Characterisation of Bellman optimal state-action policies) . For a state-actionpolicy Π , the following are equivalent:1. Π is Bellman optimal.2. Π is(a) Bellman optimal when restricted to { , . . . , T } . In other words, ∀ state-action policy Π (cid:48) and ( s, t ) ∈ S × { , . . . T } v Π ( s, t ) ≥ v Π (cid:48) ( s, t ) . (b) At time , Π selects actions that maximise return: Π( a | s, > ⇐⇒ a ∈ arg max a ∈ A E Π [ R ( s T ) | s = s, a = a ] , ∀ s ∈ S . (2)7ote that this characterisation oﬀers a recursive way to construct Bellman optimal state-actionpolicies by backwards induction (i.e., by successively selecting the best action), as speciﬁed byEquation 2, starting from T and inducting backwards (Puterman, 2014). Proof ⇒

2) :

We only need to show assertion (b) . By contradiction, suppose that ∃ ( s, α ) ∈ S × A such that Π( α | s, > and E Π [ R ( s T ) | s = s, a = α ] < max a ∈ A E Π [ R ( s T ) | s = s, a = a ] . We let α (cid:48) be the Bellman optimal action at state s and time deﬁned as α (cid:48) := arg max a ∈ A E Π [ R ( s T ) | s = s, a = a ] . Then, we let Π (cid:48) be the same state-action policy as Π except that Π (cid:48) ( ·| s, assigns α (cid:48) deterministically.Then, v Π ( s,

2) :

We only need to show that Π maximises v Π ( s, , ∀ s ∈ S . By contradiction, there existsa state-action policy Π (cid:48) and a state s ∈ S such that v Π ( s, < v Π (cid:48) ( s, ⇐⇒ (cid:88) a ∈ A E Π [ R ( s T ) | s = s, a = a ]Π( a | s, < (cid:88) a ∈ A E Π (cid:48) [ R ( s T ) | s = s, a = a ]Π (cid:48) ( a | s, . By (a) the left hand side equals max a ∈ A E Π [ R ( s T ) | s = s, a = a ] . Unpacking the expression on the right-hand side: (cid:88) a ∈ A E Π (cid:48) [ R ( s T ) | s = s, a = a ]Π (cid:48) ( a | s, (cid:88) a ∈ A (cid:88) σ ∈ S E Π (cid:48) [ R ( s T ) | s = σ ] P ( s = σ | s = s, a = a )Π (cid:48) ( a | s, (cid:88) a ∈ A (cid:88) σ ∈ S { E Π (cid:48) [ R ( s T ) | s = σ ] + R ( σ ) } P ( s = σ | s = s, a = a )Π (cid:48) ( a | s, (cid:88) a ∈ A (cid:88) σ ∈ S { v Π (cid:48) ( σ,

1) + R ( σ )] P ( s = σ | s = s, a = a )Π (cid:48) ( a | s, (3)8ince Π is Bellman optimal when restricted to { , . . . , T } we have v Π (cid:48) ( σ, ≤ v Π ( σ, , ∀ σ ∈ S .Therefore, (cid:88) a ∈ A (cid:88) σ ∈ S { v Π (cid:48) ( σ,

1) + R ( σ )] P ( s = σ | s = s, a = a )Π (cid:48) ( a | s, ≤ (cid:88) a ∈ A (cid:88) σ ∈ S { v Π ( σ,

1) + R ( σ )] P ( s = σ | s = s, a = a )Π (cid:48) ( a | s, . Repeating the steps above (3), but in reverse order, yields (cid:88) a ∈ A E Π (cid:48) [ R ( s T ) | s = s, a = a ]Π (cid:48) ( a | s, ≤ (cid:88) a ∈ A E Π [ R ( s T ) | s = s, a = a ]Π (cid:48) ( a | s, However, (cid:88) a ∈ A E Π [ R ( s T ) | s = s, a = a ]Π (cid:48) ( a | s, < max a ∈ A E Π [ R ( s T ) | s = s, a = a ] which is a contradiction. Proposition 10 suggests a straightforward recursive algorithm to construct Bellman optimal state-action policies known as backward induction (Puterman, 2014). Backward induction entails reasoningbackwards in time, from a goal state at the end of a problem or solution, to determine a sequenceof Bellman optimal actions. It proceeds by ﬁrst considering the last time at which a decision mightbe made and choosing what to do in any situation at that time. Using this information, one canthen determine what to do at the second-to-last decision time. This process continues backwardsuntil one has determined the best action for every possible situation or state at every point in time.This algorithm has a long history. It was developed by the German mathematician Zermelo in 1913to prove that chess has Bellman optimal strategies (Zermelo, 1913). In stochastic control, backwardinduction is one of the main methods for solving the Bellman equation (Adda and Cooper, 2003;Miranda and Fackler, 2002; Sargent, 2000). In game theory, the same method is used to computesub-game perfect equilibria in sequential games (Fudenberg and Tirole, 1991; ? ). Proposition 11 (Backward induction: construction of Bellman optimal state-action policies) . Back-ward induction Π( a | s, T − > ⇐⇒ a ∈ arg max a ∈ A E [ R ( s T ) | s T − = s, a T − = a ] , ∀ s ∈ S Π( a | s, T − > ⇐⇒ a ∈ arg max a ∈ A E Π [ R ( s T − T ) | s T − = s, a T − = a ] , ∀ s ∈ S ... Π( a | s, > ⇐⇒ a ∈ arg max a ∈ A E Π [ R ( s T ) | s = s, a = a ] , ∀ s ∈ S (4) deﬁnes a Bellman optimal state-action policy Π . Furthermore, this characterisation is complete: allBellman optimal state-action policies satisfy the backward induction relation (4) . Intuitively, this recursive scheme (4) consists of planning backwards, by starting from the endgoal and working out the actions needed to achieve the goal.

Example 1.

To give a concrete example of this kind of planning, the present scheme would considerthe example actions below in the following order: . Desired goal: I would like to go to the grocery store,2. Intermediate action: I need to drive to the store,3. Current best action: I should put my shoes on. Proof [Proof of Proposition 11] • We ﬁrst prove that state-action policies Π deﬁned as in (4) are Bellman optimal by inductionon T . T = 1 : Π( a | s, > ⇐⇒ a ∈ arg max a E [ R ( s ) | s = s, a = a ] , ∀ s ∈ S is a Bellman optimal state-action policy as it maximises the total reward possible in the MDP.Let T > be ﬁnite and suppose that the Proposition holds for MDPs with a temporal horizonof T − . This means that Π( a | s, T − > ⇐⇒ a ∈ arg max a E [ R ( s T ) | s T − = s, a T − = a ] , ∀ s ∈ S Π( a | s, T − > ⇐⇒ a ∈ arg max a E Π [ R ( s T − T ) | s T − = s, a T − = a ] , ∀ s ∈ S ... Π( a | s, > ⇐⇒ a ∈ arg max a E Π [ R ( s T ) | s = s, a = a ] , ∀ s ∈ S is a Bellman optimal state-action policy on the MDP restricted to times to T . Therefore,since Π( a | s, > ⇐⇒ a ∈ arg max a E Π [ R ( s T ) | s = s, a = a ] , ∀ s ∈ S Proposition 10 allows us to deduce that Π is Bellman optimal. • We now show that any Bellman optimal state-action policy satisﬁes the backward inductionalgorithm (4).Suppose by contradiction that there exists a state-action policy Π that is Bellman optimal butdoes not satisfy (4). Say, ∃ ( a, s, t ) ∈ A × S × T , t < T , such that Π( a | s, t ) > and a / ∈ arg max α ∈ A E Π [ R ( s t +1: T ) | s t = s, a t = α ] . This implies E Π [ R ( s t +1: T ) | s t = s, a t = a ] < max α ∈ A E Π [ R ( s t +1: T ) | s t = s, a t = α ] . Let ˜ a ∈ arg max α E Π [ R ( s t +1: T ) | s t = s, a t = α ] . Let ˜Π be a state-action policy such that ˜Π( ·| s, t ) assigns ˜ a ∈ A deterministically, and such that ˜Π = Π otherwise. Then we cancontradict the Bellman optimality of Π as follows10 Π ( s, t ) = E Π [ R ( s t +1: T ) | s t = s ]= (cid:88) α ∈ A E Π [ R ( s t +1: T ) | s t = s, a t = α ]Π( α | s, t ) < max α ∈ A E Π [ R ( s t +1: T ) | s t = s, a t = α ]= E Π [ R ( s t +1: T ) | s t = s, a t = ˜ a ]= E ˜Π [ R ( s t +1: T ) | s t = s, a t = ˜ a ]= (cid:88) α ∈ A E ˜Π [ R ( s t +1: T ) | s t = s, a t = α ] ˜Π( α | s, t )= v ˜Π ( s, t ) . This concludes our discussion of dynamic programming on ﬁnite horizon MDPs.

3. Active inference on ﬁnite horizon MDPs

We now turn to active inference agents on ﬁnite horizon MDPs.Here, the agent’s generative model of its environment is modelled using the previously deﬁnedﬁnite horizon MDP (see Deﬁnition 1). This means that we assume that the transition probabilitiesare known . We do not consider the general case where the transitions have to be learned but commenton it in the discussion (also see Appendix A).In what follows, we ﬁx a time t ≥ and suppose that the agent has been in states s , . . . , s t . Toease notation we let (cid:126)s := s t +1: T , (cid:126)a := a t : T be the future states and future actions.Let Q be the predictive distribution of the agent. That is, the distribution specifying the nextactions and states that the agent encounters and pursues when at state s t Q ( (cid:126)s, (cid:126)a | s t ) := T − (cid:89) τ = t Q ( s τ +1 | a τ , s τ ) Q ( a τ | s τ ) . In active inference, perception implies inferences about future, past, and current states given ob-servations and a sequence of actions. In active inference, this is done through variational Bayesianinference by minimising (variational) free energy (a.k.a. an evidence bound in machine learning)(Beal, 2003; Bishop, 2006; Wainwright and Jordan, 2007). See (Da Costa et al., 2020) for details onactive inference in the partially observable setting.In the MDP setting, past and current states are known, hence it is only necessary to infer futurestates, given the current state and a sequence of actions, P ( (cid:126)s | (cid:126)a, s t ) . These posteriors P ( (cid:126)s | (cid:126)a, s t ) areknown in virtue of the fact that the agent knows the transition probabilities of the MDP; hencevariational inference becomes exact Bayesian inference. Q ( (cid:126)s | (cid:126)a, s t ) := P ( (cid:126)s | (cid:126)a, s t ) = T − (cid:89) τ = t P ( s τ +1 | s τ , a τ ) (5) Remark 12 (Unknown transition probabilities) . When the probabilities or reward are unknown tothe agent the problem is one of RL (Shoham et al., 2003). Although we do not consider this scenario ere, when the model is unknown, we simply equip the agents generative model with a prior, andthe model is then updated via variational Bayesian inference to ﬁt the observed data. Depending onthe speciﬁc learning problem and generative model structure, this can involve updating the transitionprobabilities (i.e., the probability of transitioning to a rewarding state under each action) and/or thetarget distribution C (to be deﬁned later); in POMDPs it can also involve updating the probabilities ofobservations under each state. See Appendix A for further details on how active inference implementsreward learning and potential connections to representative RL approaches; and see (Da Costa et al.,2020) for details on modelling learning in active inference more generally. Now that the agent has inferred future states given diﬀerent sequences of actions, we must scorethese sequences using the goodness of the resulting state trajectories (in terms of C ). The expectedfree energy does exactly this: it is the objective that active inference agents minimise in order theselect the best possible action.Under active inference, agents minimize expected free energy in order to maintain a steady-statedistribution C over the state-space S . This steady-state speciﬁes the agent’s preferences, or thecharacteristic states it returns to after being perturbed. The expected free energy is deﬁned as afunctional of this steady-state distribution. In the absence of any observed or latent states, theexpected free energy reduces to the following form (which is a special case of expected free energyfor partially observed MDPs—see Section 4.4). Deﬁnition 13 (Expected free energy) . In MDPs, the expected free energy of an action sequence (cid:126)a starting from s t is deﬁned as (Da Costa et al., 2020): G ( (cid:126)a | s t ) = D KL [ Q ( (cid:126)s | (cid:126)a, s t ) (cid:107) C ( (cid:126)s )] (6) Therefore, minimising expected free energy corresponds to making the distribution over predictedstates close to the distribution C that encodes prior preferences. The expected free energy may be rewritten as G ( (cid:126)a | s t ) = E Q ( (cid:126)s | (cid:126)a,s t ) [ − log C ( (cid:126)s )] (cid:124) (cid:123)(cid:122) (cid:125) Expected surprise − H[ Q ( (cid:126)s | (cid:126)a, s t )] (cid:124) (cid:123)(cid:122) (cid:125) Entropy of future states (7)Hence, minimising expected free energy minimises the expected surprise of states according to C and maximising the entropy of Bayesian beliefs over future states (a maximum entropy principle(Jaynes, 1957a,b)).Evaluating the expected free energy of courses of action corresponds to planning as inference(Attias, 2003; Botvinick and Toussaint, 2012). This follows from the fact that the expected freeenergy scores the goodness of inferred future states. Remark 14 (Numerical tractability) . The expected free energy is straightforward to compute usinglinear algebra. Given an action sequence (cid:126)a , C ( (cid:126)s ) and Q ( (cid:126)s | (cid:126)a, s t ) are categorical distributions over S T − t . Let their parameters be c , s (cid:126)a ∈ [0 , | S | ( T − , where | · | denotes the cardinality of a set. Then: G ( (cid:126)a | s t ) = s T (cid:126)a (log s (cid:126)a − log c ) (8) Notwithstanding, (8) is expensive to evaluate repeatedly when all possible action sequences are con-sidered. In practice, one can adopt a temporal mean ﬁeld approximation over future states (Millidgeet al., 2020a):

1. The surprise of states − log C ( s ) is an information theoretic term (Stone, 2015) that scores the extent to whichan observation is unusual under C . It does not mean that the agent consciously experiences surprise. ( (cid:126)s | (cid:126)a, s t ) ≈ T (cid:89) τ = t +1 Q ( s τ | (cid:126)a, s t ) , which yields the simpliﬁed expression G ( (cid:126)a | s t ) ≈ T (cid:88) τ = t +1 D KL [ Q ( s τ | (cid:126)a, s t ) (cid:107) C ( s τ )] . (9) Expression (9) is much easier to handle: for each action sequence (cid:126)a , 1) one evaluates the sum-mands sequentially τ = t + 1 , . . . , T , and 2) if and when the sum up to τ becomes signiﬁcantly higherthan the lowest expected free energy encountered during planning, G ( (cid:126)a | s t ) is set to an arbitrarilyhigh value. Setting G ( (cid:126)a | s t ) to an high value is equivalent to pruning away unlikely trajectories. Thisbears some similarity to decision tree pruning procedures used in RL (Huys et al., 2012). It ﬁnessesexploration of the decision-tree in full depth and provides an Occam’s window for selecting actionsequences.There are complementary approaches to make planning tractable. For example, hierarchical gen-erative models factorise decisions into multiple levels. By abstracting information at a higher-level,lower-levels entertain fewer actions (Friston et al., 2018)—which reduces the depth of the decisiontree by orders of magnitude. Another approach is to use algorithms that search the decision-treeselectively, such as Monte-Carlo tree search (Coulom, 2006; Silver et al., 2016), and amortising theexpected free energy minimisation using artiﬁcial neural networks (i.e., learning to plan (Çatal et al.,2020a)).

4. Reward maximisation as active inference

In the following, we show how active inference can solve the reward maximisation problem.

From the deﬁnition of expected free energy (6), active inference can be thought of as reachingand remaining at a target distribution C over state-space. This distribution encodes the agent’spreferences. In short, simulating active inference can be regarded as engineering a stationary process(Pavliotis, 2014), where the stationary distribution encodes the agent’s preferences.The idea that underwrites the rest of this paper is that when the stationary distribution has allof its mass on reward maximising states, then the agent will maximise reward. To illustrate this,we deﬁne a distribution C λ , λ > , encoding the agent’s preferences over state-space S , such thatrewarding states become preferred states. C λ ( σ ) := exp λR ( σ ) (cid:80) ς ∈ S exp λR ( ς ) ∝ exp( λR ( σ )) , ∀ σ ∈ S ⇐⇒ − log C λ ( σ ) = − λR ( σ ) − c ( λ ) , ∀ σ ∈ S , for some c ( λ ) ∈ R constant w.r.t σ. The parameter λ > is an inverse temperature parameter, which scores how motivated theagent is to occupy reward maximising states. Note that states s ∈ S that maximise the reward R ( s ) maximise C λ ( s ) and minimise − log C λ ( s ) for any λ > .Using the additive property of the reward function, we can extend C λ to a probability distributionover trajectories (cid:126)σ := ( σ , . . . , σ T ) ∈ S T . Speciﬁcally, C λ scores to what extent a trajectory ispreferred over another trajectory: 13 λ ( (cid:126)σ ) := exp λR ( (cid:126)σ ) (cid:80) (cid:126)ς ∈ S T exp λR ( (cid:126)ς ) = T (cid:89) τ =1 exp λR ( σ τ ) (cid:80) ς ∈ S exp λR ( ς ) = T (cid:89) τ =1 C λ ( σ τ ) , ∀ (cid:126)σ ∈ S T ⇐⇒ − log C λ ( (cid:126)σ ) = − λR ( (cid:126)σ ) − c (cid:48) ( λ ) = − T (cid:88) τ =1 λR ( σ τ ) − c (cid:48) ( λ ) , ∀ (cid:126)σ ∈ S T , (10)where c (cid:48) ( λ ) := c ( λ ) T ∈ R is constant w.r.t (cid:126)σ .When the preferences are deﬁned in this way, the zero-temperature limit λ → + ∞ is the casewhere the preferences are non-zero only for states or trajectories that maximise reward. In this case, lim λ → + ∞ C λ is a uniform mixture of Dirac distributions over reward maximising trajectories: lim λ → + ∞ C λ ∝ (cid:88) (cid:126)s ∈ I T − t Dirac (cid:126)s I := arg max s ∈ S R ( s ) . (11)This is because, for a reward maximising state σ , exp( λR ( σ )) will converge to + ∞ more quicklythan exp( λR ( σ (cid:48) )) for a non-reward maximising state σ (cid:48) . Since C λ is constrained to be normalised to (as it is a probability distribution), C λ ( σ (cid:48) ) λ → + ∞ −−−−−→ . Hence, in the limit λ → + ∞ , C λ is non-zero(and uniform) only on reward maximising states.We now show how reaching preferred states can be formulated as reward maximisation: Lemma 15.

The sequence of actions that minimises expected free energy also maximises expectedreward in the limiting case λ → + ∞ : lim λ → + ∞ arg min (cid:126)a G ( (cid:126)a | s t ) ⊆ arg max (cid:126)a E Q ( (cid:126)s | (cid:126)a,s t ) [ R ( (cid:126)s )] Furthermore, of those action sequences that maximise expected reward, the expected free energyminimisers will be those that maximize the entropy of future states H[ Q ( (cid:126)s | (cid:126)a, s t )] . Proof lim λ → + ∞ arg min (cid:126)a D KL [ Q ( (cid:126)s | (cid:126)a, s t ) (cid:107) C λ ( (cid:126)s )]= lim λ → + ∞ arg min (cid:126)a − H[ Q ( (cid:126)s | (cid:126)a, s t )] + E Q ( (cid:126)s | (cid:126)a,s t ) [ − log C λ ( (cid:126)s )]= lim λ → + ∞ arg min (cid:126)a − H[ Q ( (cid:126)s | (cid:126)a, s t )] − λ E Q ( (cid:126)s | (cid:126)a,s t ) [ R ( (cid:126)s )]= lim λ → + ∞ arg max (cid:126)a H[ Q ( (cid:126)s | (cid:126)a, s t )] + λ E Q ( (cid:126)s | (cid:126)a,s t ) [ R ( (cid:126)s )] ⊆ lim λ → + ∞ arg max (cid:126)a λ E Q ( (cid:126)s | (cid:126)a,s t ) [ R ( (cid:126)s )]= arg max (cid:126)a E Q ( (cid:126)s | (cid:126)a,s t ) [ R ( (cid:126)s )] The inclusion follows from the fact that, as λ → + ∞ , a minimiser of the expected free energyhas to maximise E Q ( (cid:126)s | (cid:126)a,s t ) [ R ( (cid:126)s )] . Among such action sequences, the expected free energy minimisersare those that maximise the entropy of future states H[ Q ( (cid:126)s | (cid:126)a, s t )] .In the zero temperature limit λ → + ∞ , minimising expected free energy corresponds to choosingthe action sequence (cid:126)a such that Q ( (cid:126)s | (cid:126)a, s t ) has most mass on reward maximising states or trajectories.See Figure 2 for an illustration. Of those candidates with the same amount of mass, the maximiserof the entropy of future states H[ Q ( (cid:126)s | (cid:126)a, s t )] will be chosen.14igure 2: Reaching preferences and the zero temperature limit.

We illustrate how active in-ference selects actions such that Q ( (cid:126)s | (cid:126)a, s t ) most closely matches the preference distribution C λ (top-right). In this example, the discrete state-space is a discretisation of a continuousinterval in R , and the preferences and predictive distributions over states have a Gaussianshape. The predictive distribution Q is assumed to have a ﬁxed variance with respect toaction sequences, such that the only parameter that can be optimised by action selectionis its mean. Crucially, in the zero temperature limit (11), lim λ → + ∞ C λ becomes a Diracdistribution over the reward maximising state (bottom). Thus, minimising expected freeenergy corresponds to selecting the action, such that the predicted states assign most massto the reward maximising state (bottom-right). Q ∗ := Q ( (cid:126)s | (cid:126)a ∗ , s t ) denotes the predictivedistribution over states given the action sequence that minimises expected free energy (cid:126)a ∗ = arg min (cid:126)a G ( (cid:126)a | s t ) . 15 .2 Bellman optimality on a temporal horizon of In this section we ﬁrst consider the case of a single-step decision problem (i.e., temporal horizon of T = 1 ) and demonstrate how one simple active inference scheme maximizes reward on this problemin the limit λ → + ∞ . This will act as an important building block for when we subsequentlyconsider the more general multi-step decision problems that are addressed by both generic dynamicprogramming and active inference. In the multi-step case ( T > ), we will show that this simpleactive inference scheme is not guaranteed to maximize reward. However, when considering thismore general class of decision problems, it is important to emphasise that, similar to RL, activeinference is a broad normative framework that encompasses multiple algorithms or schemes. Thus,when we subsequently address multi-step decision problems, we will also show how a second, moresophisticated active inference scheme does maximise reward in the limit λ → + ∞ . These twoschemes diﬀer only in how the agent forms beliefs about the best possible courses of action whenminimising expected free energy .The most common action selection procedure consists of assigning the probability of actionsequences to be the softmax of the negative expected free energy (Da Costa et al., 2020; Fristonet al., 2017a) Q ( (cid:126)a | s t ) ∝ exp( − G ( (cid:126)a | s t )) Action selection under active inference usually involves selecting the most likely action under Q ( (cid:126)a | s t ) : a t ∈ arg max a ∈ A Q ( a | s t )= arg max a ∈ A (cid:88) (cid:126)a Q ( a | (cid:126)a ) Q ( (cid:126)a | s t )= arg max a ∈ A (cid:88) (cid:126)a Q ( a | (cid:126)a ) exp( − G ( (cid:126)a | s t ))= arg max a ∈ A (cid:88) (cid:126)a ( (cid:126)a ) t = a exp( − G ( (cid:126)a | s t )) In other words, this scheme selects actions that maximise the exponentiated negative expectedfree energies of all possible future action sequences. This means that if one action is part of anaction sequence with very low expected free energy, this score is exponentiated and adds a largecontribution to the selection of that particular action.See Table 1 for a summary of this scheme.Process ComputationPerceptualinference Q ( (cid:126)s | (cid:126)a, s t ) = P ( (cid:126)s | (cid:126)a, s t ) = (cid:81) T − τ = t P ( s τ +1 | s τ , a τ ) Planning asinference G ( (cid:126)a | s t ) = D KL [ Q ( (cid:126)s | (cid:126)a, s t ) (cid:107) C ( (cid:126)s )] Decision-making Q ( (cid:126)a | s t ) ∝ exp( − G ( (cid:126)a | s t ))

2. The additional degree of freedom one has in POMDPs is specifying the family of distributions to optimise varia-tional free energy over, in order to infer states from observations. See (Da Costa et al., 2020; Heskes, 2006; Parret al., 2019; Schwöbel et al., 2018) for details. a t ∈ arg max a ∈ A [ Q ( a t = a | s t ) = (cid:80) (cid:126)a Q ( a t = a | (cid:126)a ) Q ( (cid:126)a | s t )] Table 1: Example of an active inference scheme on ﬁnite horizonMDPs.

Theorem 16.

In the zero temperature limit λ → + ∞ , the state-action policy deﬁned as in Table 1 a t ∈ lim λ → + ∞ arg max a ∈ A (cid:88) (cid:126)a ( (cid:126)a ) t = a exp( − G ( (cid:126)a | s t )) G ( (cid:126)a | s t ) = D KL [ Q ( (cid:126)s | (cid:126)a, s t ) (cid:107) C λ ( (cid:126)s )] (12) is Bellman optimal for the temporal horizon T = 1 . Proof

When T = 1 the only action is a . We ﬁx an arbitrary initial state s = s ∈ S . ByProposition 10, a Bellman optimal state-action policy is fully characterised by an action a ∗ thatmaximises immediate reward a ∗ ∈ arg max a ∈ A E [ R ( s ) | s = s, a = a ] . Recall that by Remark 6, this expectation stands for return under the transition probabilities of theMDP a ∗ ∈ arg max a ∈ A E P ( s | a = a,s = s ) [ R ( s )] . Since transition probabilities are assumed to be known (5), this reads a ∗ ∈ arg max a ∈ A E Q ( s | a = a,s = s ) [ R ( s )] . On the other hand, a ∈ lim λ → + ∞ arg max a ∈ A exp( − G ( a | s t ))= lim λ → + ∞ arg min a ∈ A G ( a | s t ) . By Lemma 15, this implies ⇒ a ∈ arg max a ∈ A E Q ( s | a = a,s = s ) [ R ( s )] , which concludes the proof.This scheme falls short in terms of Bellman optimality with a planning horizon of T > ; this restsupon the fact that it does not coincide with backward induction. Recall that backward inductionoﬀers a complete description of Bellman optimal state-action policies (Proposition 11). In otherwords, active inference plans by adding weighted expected free energies of each possible futurecourse of action, as opposed to only considering those future courses of action that will subsequentlyminimise expected free energy, given subsequently encountered states.17 .3 Bellman optimality on ﬁnite temporal horizons To achieve Bellman optimality on ﬁnite temporal horizons, we turn to the expected free energy ofan action given future actions that also minimise expected free energy.To do this we can write the expected free energy recursively, as: the immediate expected freeenergy, plus the expected free energy that one would obtain by subsequently selecting actions thatminimise expected free energy (Friston et al., 2020). The resulting scheme consists of minimising anexpected free energy deﬁned recursively, from the last time step to the current timestep. In ﬁnitehorizon MDPs this reads G ( a T − | s T − ) = D KL [ Q ( s T | a T − , s T − ) (cid:107) C λ ( s T )] G ( a τ | s τ ) = D KL [ Q ( s τ +1 | a τ , s τ ) (cid:107) C λ ( s τ +1 )] + E Q ( a τ +1 ,s τ +1 | a τ ,s τ ) [ G ( a τ +1 | s τ +1 )] , τ = t, . . . , T − , where at each time-step, actions are chosen to minimise expected free energy Q ( a τ +1 | s τ +1 ) > ⇐⇒ a τ +1 ∈ arg min a ∈ A G ( a | s τ +1 ) . (13)To make sense of this formulation, we unpack the recursion G ( a t | s t ) = D KL [ Q ( s t +1 | a t , s t ) (cid:107) C λ ( s t +1 )] + E Q ( a t +1 ,s t +1 | a t ,s t ) [ G ( a t +1 | s t +1 )]= D KL [ Q ( s t +1 | a t , s t ) (cid:107) C λ ( s t +1 )] + D KL [ Q ( s t +2 | a t +1 , s t +2 ) (cid:107) C λ ( s t +2 )]+ E Q ( a t +1: t +2 ,s t +1: t +2 | a t ,s t ) [ G ( a t +2 | s t +2 )]= . . . = T − (cid:88) τ = t E Q ( (cid:126)a,(cid:126)s | a t ,s t ) D KL [ Q ( s τ +1 | a τ , s τ ) (cid:107) C λ ( s τ +1 )]= E Q ( (cid:126)a,(cid:126)s | a t ,s t ) D KL [ Q ( (cid:126)s | (cid:126)a, s t ) (cid:107) C λ ( (cid:126)s )] , (14)which shows that this expression is exactly the expected free energy under action a t , if one is topursue future actions that minimise expected free energy (13).The crucial improvement over the vanilla active inference scheme is that planning is now per-formed based on subsequent counterfactual actions that minimise expected free energy, as opposedto considering all future courses of action. Translating this into the language of state-action policiesyields ∀ s ∈ S a T − ( s ) ∈ arg min a ∈ A G ( a | s T − = s ) a T − ( s ) ∈ arg min a ∈ A G ( a | s T − = s ) ... a ( s ) ∈ arg min a ∈ A G ( a | s = s ) a ( s ) ∈ arg min a ∈ A G ( a | s ) . (15)Process ComputationPerceptualinference Q ( s τ +1 | a τ , s τ ) = P ( s τ +1 | a τ , s τ ) G ( a τ | s τ ) =D KL [ Q ( s τ +1 | a τ , s τ ) (cid:107) C λ ( s τ +1 )] + E Q ( a τ +1 ,s τ +1 | a τ ,s τ ) [ G ( a τ +1 | s τ +1 )] Decision-making Q ( a τ | s τ ) > ⇐⇒ a τ ∈ arg min a ∈ A G ( a | s τ ) Action selection a t ∼ Q ( a t | s t ) Table 2: Alternative active inference scheme on ﬁnite horizonMDPs.Equation (15) is strikingly similar to the backward induction algorithm (Proposition 11), andindeed we recover backward induction in the limit λ → + ∞ . Theorem 17 (Backward induction as active inference) . In the zero temperature limit λ → + ∞ , thescheme of Table 2 Q ( a τ | s τ ) > ⇐⇒ a t ∈ lim λ → + ∞ arg min a ∈ A G ( a | s τ ) G ( a τ | s τ ) = D KL [ Q ( s τ +1 | a τ , s τ ) (cid:107) C λ ( s τ +1 )] + E Q ( a τ +1 ,s τ +1 | a τ ,s τ ) [ G ( a τ +1 | s τ +1 )] (16) satisﬁes the backward induction relation. Therefore, it is Bellman optimal on any ﬁnite temporalhorizon. Furthermore, if there are multiple action choices that maximise future reward, that whichis selected by active inference also maximises a (cid:55)→ H[ Q ( (cid:126)s | (cid:126)a, a, s )] . Remark 18.

Note that, again, if there are multiple action choices that maximise future reward, theaction selected by active inference also maximises the expected entropy of future states—that is, thechosen action can be thought of as also "keeping one’s options open" (Klyubin et al., 2008) in thesense that the agent commits the least to a speciﬁed sequence of states.

Proof

We prove this result by induction on the temporal horizon T of the MDP.The proof of the Theorem when T = 1 can be seen from the proof of Theorem 16. Now supposethat T > is ﬁnite and that the Theorem holds for MDPs with a temporal horizon of T − . Q ( a τ | s τ ) as deﬁned in (16) is a Bellman optimal state-action policy on the MDP restricted totimes τ = 1 , . . . , T by induction. Therefore, by Proposition 10, we only need to show that the action a selected under active inference satisﬁes a ∈ arg max a ∈ A E Q [ R ( (cid:126)s ) | s , a = a ] . This is simple to show as arg max a ∈ A E Q [ R ( (cid:126)s ) | s , a = a ]= arg max a ∈ A E P ( (cid:126)s | a T ,a = a,s ) Q ( (cid:126)a | s T ) [ R ( (cid:126)s )] (by Remark 6) = arg max a ∈ A E Q ( (cid:126)s,(cid:126)a | a = a,s ) [ R ( (cid:126)s )] (as the transitions are known) = lim λ → + ∞ arg max a ∈ A E Q ( (cid:126)s,(cid:126)a | a = a,s ) [ λR ( (cid:126)s )] ⊇ lim λ → + ∞ arg max a ∈ A E Q ( (cid:126)s,(cid:126)a | a = a,s ) [ λR ( (cid:126)s )] − H[ Q ( (cid:126)s | (cid:126)a, a = a, s )]= lim λ → + ∞ arg min a ∈ A E Q ( (cid:126)s,(cid:126)a | a = a,s ) [ − log C λ ( (cid:126)s )] − H[ Q ( (cid:126)s | (cid:126)a, a = a, s )] (by (10)) = lim λ → + ∞ arg min a ∈ A E Q ( (cid:126)s,(cid:126)a | a = a,s ) D KL [ Q ( (cid:126)s | (cid:126)a, a = a, s ) (cid:107) C λ ( (cid:126)s )]= lim λ → + ∞ arg min a ∈ A G ( a = a | s ) (by (14)) . a selected under active inference is a Bellman optimal state-action policy onﬁnite temporal horizons. Furthermore, the inclusion follows from the fact that if there are multipleactions that maximise expected reward, that which is selected under active inference maximises theentropy of beliefs about future states. Partially observable Markov decision processes (POMDPs) (Aström, 1965) diﬀer from MDPs onlyin that the agent observes a modality o t , which carries incomplete information about the currentstate s t .In POMDPs and under active inference, states are inferred from observations through variationalBayesian inference. Let (cid:126)s := s T , (cid:126)a := a T − be all states and actions (past, current, and future),let ˜ o := o t be the observations available up to time t , and let (cid:126)o := o t +1: T be the future observa-tions. The agent has a predictive distribution over states given actions that is continuously updatedfollowing new observations Q ( (cid:126)s | (cid:126)a, ˜ o ) := T − (cid:89) τ =0 Q ( s τ +1 | a τ , s τ , ˜ o ) . To infer states from observations, the agent engages in variational Bayesian inference—an optimi-sation procedure over a space of probability distributions Q ( ·| (cid:126)a, ˜ o ) called the variational family —toﬁnd the variational free energy minimum arg min Q F (cid:126)a [ Q ( (cid:126)s | (cid:126)a, ˜ o )] = arg min Q D KL [ Q ( (cid:126)s | (cid:126)a, ˜ o ) (cid:107) P ( (cid:126)s | (cid:126)a, ˜ o )] F (cid:126)a [ Q ( (cid:126)s | (cid:126)a, ˜ o )] := D KL [ Q ( (cid:126)s | (cid:126)a, ˜ o ) (cid:107) P (˜ o, (cid:126)s | (cid:126)a )] . (17)Here, P (˜ o, (cid:126)s | (cid:126)a ) is a prior that is supplied to the agent (also a POMDP), and P ( (cid:126)s | (cid:126)a, ˜ o ) is theposterior estimate over states (that is usually intractable to compute directly). Note that equippingthe agent with a prior has been one of the main diﬃculties in scaling active inference. Toward thisend, recent research has explored learning the agent’s generative model with deep neural networks,allowing enough ﬂexibility for the model to adapt to the environment at hand (Çatal et al., 2020b;Millidge, 2020; Ueltzhöﬀer, 2018).Crucially, when the free energy minimum (17) is reached, the inference is exact Q ( (cid:126)s | (cid:126)a, ˜ o ) = P ( (cid:126)s | (cid:126)a, ˜ o ) . (18)For numerical tractability, the variational family may be constrained to a parametric family ofdistributions, in which case equality is not guaranteed Q ( (cid:126)s | (cid:126)a, ˜ o ) ≈ P ( (cid:126)s | (cid:126)a, ˜ o ) . In POMDPs, theexpected free energy reads (Da Costa et al., 2020) G ( (cid:126)a | ˜ o ) = D KL [ Q ( (cid:126)s | (cid:126)a, ˜ o ) (cid:107) C λ ( (cid:126)s )] (cid:124) (cid:123)(cid:122) (cid:125) Risk + E Q ( (cid:126)s | (cid:126)a, ˜ o ) H[ P ( (cid:126)o | (cid:126)s )] (cid:124) (cid:123)(cid:122) (cid:125) Ambiguity . The expected free energy is the quantity that agents need to optimise in order to remain atsteady-state (Pavliotis, 2014), which means that the states that they visit are distributed accordingto C λ (see (Da Costa et al., 2020, Appendix B)). The expected free energy has an important historicalpedigree in terms of the quantities that it subsumes, and has been carved in diﬀerent ways throughoutthe literature to showcase its various interpretations (Da Costa et al., 2020; Friston et al., 2017a;Millidge et al., 2020a; Schwartenbeck et al., 2019).20ote that the expected free energy for POMDPs is the expected free energy for MDPs plusan extra term called ambiguity. This ambiguity term accommodates the uncertainty implicit inpartially observed problems. The reason that this resulting functional is called expected free energyis because it comprises a relative entropy and expected energy; namely, the ﬁrst (risk) and second(ambiguity) terms, respectively. By analogy with variational free energy, the risk corresponds toexpected complexity cost and ambiguity corresponds to expected inaccuracy. See Figure 3 fordetails.Crucially, in the limit λ → + ∞ , and provided that the free energy minimum is reached (18), allof our Bellman optimality results translate to the POMDP case. Proposition 19 (Reward maximisation on POMDPs) . In POMDPs, provided that the free energyminimum is reached (18) , the sequence of actions that minimises expected free energy also maximisesexpected reward in the limiting case λ → + ∞ , (11) : lim λ → + ∞ arg min (cid:126)a G ( (cid:126)a | ˜ o ) ⊆ arg max (cid:126)a E Q ( (cid:126)s | (cid:126)a, ˜ o ) [ R ( (cid:126)s )] . Furthermore, when λ → + ∞ , the action sequences that minimise expected free energy: 1) maximiseexpected reward, and 2) maximise H[ Q ( (cid:126)s | (cid:126)a, ˜ o )] − E Q ( (cid:126)s | a t , ˜ o ) H[ P ( (cid:126)o | (cid:126)s )]] , that is the entropy of futurestates minus the (expected) entropy of outcomes given states. Proof [Proof of Proposition 19] lim λ → + ∞ arg min (cid:126)a G ( (cid:126)a | ˜ o )= lim λ → + ∞ arg min (cid:126)a D KL [ Q ( (cid:126)s | (cid:126)a, ˜ o ) (cid:107) C λ ( (cid:126)s )] + E Q ( (cid:126)s | (cid:126)a, ˜ o ) H[ P ( (cid:126)o | (cid:126)s )]= lim λ → + ∞ arg min (cid:126)a − H[ Q ( (cid:126)s | (cid:126)a, ˜ o )] + E Q ( (cid:126)s | (cid:126)a, ˜ o ) [ − log C λ ( (cid:126)s )] + E Q ( (cid:126)s | (cid:126)a, ˜ o ) H[ P ( (cid:126)o | (cid:126)s )]= lim λ → + ∞ arg min (cid:126)a − H[ Q ( (cid:126)s | (cid:126)a, ˜ o )] − λ E Q ( (cid:126)s | (cid:126)a, ˜ o ) [ R ( (cid:126)s )] + E Q ( (cid:126)s | (cid:126)a, ˜ o ) H[ P ( (cid:126)o | (cid:126)s )] (by (10)) ⊆ lim λ → + ∞ arg max (cid:126)a λ E Q ( (cid:126)s | (cid:126)a, ˜ o ) [ R ( (cid:126)s )]= arg max (cid:126)a E Q ( (cid:126)s | (cid:126)a, ˜ o ) [ R ( (cid:126)s )] The inclusion follows from the fact that as λ → + ∞ a minimiser of the expected free energyhas ﬁrst and foremost to maximise E Q ( (cid:126)s | (cid:126)a, ˜ o ) [ R ( (cid:126)s )] . Among such action sequences, the expected freeenergy minimisers are those that maximise the entropy of (beliefs about) future states H[ Q ( (cid:126)s | (cid:126)a, ˜ o )] and resolve ambiguity about future outcomes by minimising E Q ( (cid:126)s | (cid:126)a, ˜ o ) H[ P ( (cid:126)o | (cid:126)s )] .We have seen in Proposition 19 that among those action sequences that maximise reward, thosethat are chosen by active inference maximise H[ Q ( (cid:126)s | (cid:126)a, ˜ o )] (cid:124) (cid:123)(cid:122) (cid:125) Entropy of future states − E Q ( (cid:126)s | a t , ˜ o ) [H[ P ( (cid:126)o | (cid:126)s )]] (cid:124) (cid:123)(cid:122) (cid:125) Entropy of observations given expected future states which means that they maximise the number of future states that can be visited, and minimiseambiguity by ensuring that their expected observations carry as much information as possible abouttheir future states, so that these may be inferred with higher accuracy.In addition, the schemes of Table 1 & 2 exist in the POMDP setting, (e.g., (Da Costa et al., 2020;Friston et al., 2020)) and the results of Theorems 16 & 17 translate to this setting as well, under thecondition that the free energy minimum is reached (18). Of course, when Q ( (cid:126)s | (cid:126)a, ˜ o ) ≈ P ( (cid:126)s | (cid:126)a, ˜ o ) , theextent to which Proposition 19 holds is entirely dependent on the goodness of the approximation.In particular, consider an agent in a partially observed environment. The limit λ → + ∞ , whichturns the active inference agent into full exploitative mode, is only appropriate to gather maximumigure 3: Expected free energy.

This ﬁgure illustrates the various ways in which minimisingexpected free energy may be unpacked. The upper panel casts perception and action asthe minimisation of variational and expected free energy, respectively. Crucially, expectedfree energy subsumes several special cases that predominate in the psychological, machinelearning, and economics literature. These special cases are disclosed when one removesparticular sources of uncertainty from the implicit optimisation problem. For example,if we ignore prior preferences, then the expected free energy reduces to information gain(Lindley, 1956; MacKay, 2003) or intrinsic motivation (Barto et al., 2013; Deci and Ryan,1985; Oudeyer and Kaplan, 2009). This is mathematically equivalent to expected Bayesiansurprise and mutual information that underwrite salience in visual search (Itti and Baldi,2009; Sun et al., 2011) and the organisation of our visual apparatus (Barlow, 1961, 1974;Linsker, 1990; Optican and Richmond, 1987). If we now reinstate prior preferences butremove risk, we can treat hidden and observed (sensory) states as isomorphic. This leadsto risk sensitive state-action policies in economics (Fleming and Sheu, 2002; Kahnemanand Tversky, 1988) or KL control in engineering (van den Broek et al., 2010). Here,minimising risk corresponds to aligning predicted outcomes to preferred outcomes. If wethen remove ambiguity and relative risk of action (i.e., intrinsic value), we are left withexpected utility in economics (Von Neumann and Morgenstern, 1944) that underwritesRL and behavioural psychology (Barto and Sutton, 1992). Bayesian formulations of max-imising expected utility under uncertainty are also the basis of Bayesian decision theory(Berger, 1985). Finally, if we only consider a fully observed environment with uninforma-tive priors, minimising expected free energy corresponds to a maximum entropy principleover future states (Jaynes, 1957a,b). Note that here C ( o ) denotes the preferences over ob-servations derived from the preferences over states. These are related by the compatibilityrelation C ( s ) P ( o | s ) = C ( o ) P ( s | o ) . 22eward when an environment has been suﬃciently explored, that is when the agent may infer statesfrom observations with high accuracy. More generally, in environments that need to be learned it isgenerally preferable to consider ﬁnite values of λ . Remark 20 (Explore-exploit) . The expected free energy is an objective that subsumes exploratoryand exploitative behaviour (Friston et al., 2017a; Schwartenbeck et al., 2019; Tschantz et al., 2020b),and which may oﬀer a principled balance between both imperatives. The relationship between theexploration-exploitation trade-oﬀ aﬀorded by active inference and Bayes-adaptive RL (Guez et al.,2013a,b; Ross et al., 2008; Zintgraf et al., 2020) remains to be explored.

5. Discussion

In this paper, we have examined the relationship between active inference and dynamic programming.We have discussed a particular notion of optimality—the Bellman optimality principle of maximisinga reward function—and showed that active inference can be Bellman optimal. Indeed, in the limitingcase where the steady-state or target distribution in active inference concentrates its mass on rewardmaximising trajectories, active inference maximises reward. In particular, diﬀerent levels of Bellmanoptimal performance can be reached depending on the particular active inference scheme that isused (see Theorems 16 and 17). Interestingly, we have also shown that, in this limiting case, one(sophisticated) active inference scheme reduces to the backward induction algorithm from dynamicprogramming (Theorem 17). These results highlight important relationships between active inferenceand dynamic programming, as well as conditions under which they would and would not be expectedto behave diﬀerently (e.g., environments with multiple reward-maximizing trajectories, environmentswhere performance would beneﬁt from directed exploratory drives, environments with sparse rewards(Tschantz et al., 2020a), etc.). Yet, it is important to note that exact diﬀerences in the performanceof any active inference and RL schemes will likely remain an empirical question, which will needto be investigated through comparative simulation on diﬀerent environments. (See Appendix A fora brief discussion of how reward learning can be practically implemented within active inferenceschemes and how this relates to representative reward learning approaches in RL).These results build on previous work addressing other points of contact between active inferenceand RL, which have to date remained mostly qualitative. For example, a few studies previouslydemonstrated how active inference approaches can obtain similar performance to RL models on themountain-car problem—a benchmark problem in dynamic programming (Çatal et al., 2019; Fristonet al., 2009; Ueltzhöﬀer, 2018). More recent simulations have also compared active inference modelsto representative model-based and model-free RL algorithms. For example, in the "FrozenLake"OpenAI Gym environment, Sajid et al. (2020) found similar performance in static environmentsbut superior performance by active inference in changing environments. Another study used amodiﬁed expected free energy functional (the free energy of the expected future (Millidge et al.,2020a)) and showed robust performance on several other challenging benchmark RL tasks (Tschantzet al., 2020a). Neuroscience-related work using active inference has also considered the classicinterpretation of dopamine responses in the brain as reward prediction error signals (Bayer andGlimcher, 2005) and demonstrated the viability of an alternative perspective—that dopamine mayencode expected conﬁdence in policies that generate preferred outcomes (FitzGerald et al., 2015;Friston et al., 2012b; Schwartenbeck et al., 2015a). Our results here may oﬀer insights about theunderlying basis of these previous interpretations.Beyond RL, an important question in decision neuroscience is whether human decisions in factmaximise reward, minimise expected free energy or optimise any other objective. This question canbe addressed by comparing the evidence for diﬀerent models based on their ﬁt to empirical data(e.g., see (Smith et al., 2020a,b,d)). Current empirical evidence suggests that humans are not purelyreward-maximising agents, and that they also engage in both random and goal-directed exploration(Daw et al., 2006; Mirza et al., 2018; Wilson et al., 2014) and keep their options open (Schwartenbeck23t al., 2015b). Note that behavioural evidence favouring models that do not solely maximise rewardwithin reward maximisation tasks (i.e., where "maximise reward" is the explicit instruction) is nota contradiction. Rather, exploratory decisions can help to gather more reward in the long run (e.g.,see (Cullen et al., 2018; Sajid et al., 2020)). Conversely, it is worth noting that some popular RLalgorithms also follow a maximum entropy principle (or RL as inference); i.e., go beyond rewardmaximisation (Eysenbach and Levine, 2019; Haarnoja et al., 2017, 2018; Levine, 2018; Todorov,2008; Ziebart et al., 2008). For example, the model-free Soft Actor-Critic (SAC) (Haarnoja et al.,2018) algorithm maximizes both expected reward and entropy, and out-performs other state-of-the-art algorithms in continuous control environments. This has been shown to be more sampleeﬃcient than its purely reward-maximizing counter-parts (Haarnoja et al., 2018) and it can solve abroad class of control problems (Eysenbach and Levine, 2019). Formally, these methods diﬀer fromactive inference because of the way the target distribution is factorised with conditional independencebetween actions and outcomes—such that, unlike active inference, mutual information between themis not maximised. However, empirical comparisons (to evaluate diﬀerences in the agent’s epistemicdrive) between these methods and active inference remains an interesting avenue for research.When comparing RL and active inference approaches generally, one outstanding issue for activeinference is scaling to more complex problems (Çatal et al., 2020a,b; Millidge, 2020; Tschantz et al.,2019, 2020a). This is because planning ahead by evaluating all or many possible sequences of actionsis computationally prohibitive in most real-world situations. To the best of our knowledge, thereare three complementary approaches to solving this problem: 1) employing hierarchical generativemodels that factorise decisions into multiple levels and reduce the size of the decision tree by ordersof magnitude (Friston et al., 2018), 2) eﬃciently searching the decision tree using algorithms likeMonte Carlo tree search (Coulom, 2006; Fountas et al., 2020; Silver et al., 2016), and 3) amortisingplanning using artiﬁcial neural networks (Çatal et al., 2020a). Another issue rests upon the factthat active inference is a Bayesian scheme: it needs to optimise free energy of a space of generativemodels. For toy problems, a ﬂexible class of models can easily be crafted by hand; however, for largerproblems, more work needs to be done on plausible algorithms for learning the structure of generativemodels themselves (Gershman and Niv, 2010; Smith et al., 2020c; Tervo et al., 2016). This is animportant research avenue in generative modelling, called structure learning, consisting of ﬁnding themodel with the highest evidence given available data (Gershman and Niv, 2010; Tervo et al., 2016).Currently, a popular approach to solving this problem is the use of generative adversarial networks(GANs) (Cisse et al., 2017; Genevay et al., 2018; Goodfellow et al., 2014); however, working withGANs in this context is diﬃcult as one then needs to derive appropriate rules to perform inferenceon the model. Note that these issues are not unique to active inference. Model-based RL algorithmsdeal with the same combinatorial explosion when evaluating deep decision trees, which is one primarymotivation for developing eﬃcient model-free RL algorithms. However, other heuristics have alsobeen developed for eﬃciently searching and pruning decision trees in model-based RL that canaccount for human behaviour (e.g., see (Huys et al., 2012; Lally et al., 2017)), each with their costsand beneﬁts. RL may have much to oﬀer active inference in terms of eﬃcient implementation andthe identiﬁcation of methods to scale to more complex, real-world problems.Active inference does oﬀer several advantages, however. As we have shown, it aﬀords greatergenerality when modelling behaviour, and it subsumes the dynamic programming foundations ofcontrol and RL on ﬁnite horizon tasks. More speciﬁc advantages of active inference include: 1) itcan accommodate deep hierarchical models in discrete and continuous state-spaces (Buckley et al.,2017; Da Costa et al., 2020; Friston et al., 2018), 2) all processes, including perception, planning,learning, and motor control can be formulated and uniﬁed as inference problems (Adams et al.,2013a; Attias, 2003; Botvinick and Toussaint, 2012; Kappen et al., 2012; Rawlik et al., 2013), 3)the expected free energy eﬀectively addresses the explore-exploit dilemma and confers the agentwith artiﬁcial curiosity (Friston et al., 2017b; Schmidhuber, 2010; Schwartenbeck et al., 2019; Stilland Precup, 2012), as opposed to the need to add ad-hoc information bonus terms (Tokic andPalm, 2011), and 4) the expected free energy has further uniﬁcation potential, in that it subsumes24any other constructs used within decision-making approaches in the physical, engineering, and lifesciences (see Figure 3 and (Da Costa et al., 2020; Friston et al., 2020)).Finally, active inference allows one to move beyond state-action policies that predominate in tra-ditional RL, furnishing a uniform account of state action policies and sequential policy optimisation.In sequential policy optimisation, one relaxes the assumption that the same action is optimal given aparticular state—and acknowledges that the sequential order of actions may matter. This is similarto the linearly-solvable MDP formulation presented by Todorov (2007, 2009), where transition prob-abilities directly determine actions and an optimal policy speciﬁes transitions that minimise somedivergence cost. This way of approaching policies is perhaps most apparent in terms of explorationand foraging. Put simply, it is clearly better to explore and then exploit, than to exploit and thenexplore. Because expected free energy is a functional of belief states, novelty and exploration becomean integral part of optimisation (by reducing uncertainty)—in contrast with traditional RL schemesthat try to optimise a reward function of states. In other words, active inference agents will alwaysexplore until uncertainty is resolved, after which reward maximising, goal-seeking imperatives startto predominate.Such advantages should motivate future research to better characterize the tasks in which theseproperties may oﬀer the most useful advantages—such as contexts where performance beneﬁts fromlearning and planning at multiple temporal scales and from the ability to select policies that resolveboth state and parameter uncertainty.

6. Conclusion

In summary, we have shown that on ﬁnite horizon MDPs and POMDPs, in the zero-temperaturelimit and under the assumption that the preferences of the agent are to maximise reward:1. active inference selects reward maximising actions that are Bellman optimal. In addition, whenthere are multiple reward maximising actions, active inference agents will select the actionthat maximises the entropy of future states (a maximal entropy principle) and minimises theambiguity associated with future observations.2. We recover the well-known backward induction algorithm from dynamic programming with aparticular (sophisticated) active inference scheme.Thus, dynamic programming on ﬁnite temporal horizons, and discrete space-time, is a limiting caseof active inference.

Acknowledgements

The authors thank Dimitrije Markovic and Quentin Huys for providing constructive feedback duringthe preparation of the manuscript.

Funding information

LD is supported by the Fonds National de la Recherche, Luxembourg (Project code: 13568875).NS is funded by the Medical Research Council (MR/S502522/1). KF is funded by a WellcomeTrust Principal Research Fellowship (Ref: 088130/Z/09/Z). RS is supported by the Stewart G. WolfFellowship and the William K. Warren Foundation.

Author contributions

LD: conceptualisation, proofs, writing – ﬁrst draft, review and editing. NS, TP, KF, RS: conceptu-alisation, writing – review and editing. 25 ppendix A. Reward learning

Given the focus on relating active inference to the RL objective of maximizing reward, it is worthbrieﬂy illustrating how active inference can learn the reward function from data, and its potentialconnections to representative RL approaches. Active inference can learn a reward function, just asany RL algorithm. To do this in practice, one common approach (e.g., (Smith et al., 2020d)) is toset the preferences to be on observations rather than states, which corresponds to assuming that theinference is good enough D KL [ Q ( (cid:126)s | (cid:126)a, ˜ o ) (cid:107) C ( (cid:126)s )] (cid:124) (cid:123)(cid:122) (cid:125) Risk ( states ) = D KL [ Q ( (cid:126)o | (cid:126)a, ˜ o ) (cid:107) C ( (cid:126)o )] (cid:124) (cid:123)(cid:122) (cid:125) Risk (outcomes) + E Q ( (cid:126)o | (cid:126)a, ˜ o ) [D KL [ Q ( (cid:126)s | (cid:126)o, ˜ o, (cid:126)a ) (cid:107) P ( (cid:126)s | (cid:126)o )]] (cid:124) (cid:123)(cid:122) (cid:125) ≈ ≈ D KL [ Q ( (cid:126)o | (cid:126)a, ˜ o ) (cid:107) C ( (cid:126)o )] (cid:124) (cid:123)(cid:122) (cid:125) Risk (outcomes) , and equality holds whenever the free energy minimum is reached (18). Then one sets the prefer-ence distribution such that the observations designated as rewards are most preferred. In the zerotemperature limit (11), preferences only assign mass to reward-maximising observations. Note that,when formulated in this way, the reward signal is treated as sensory data, as opposed to a separatesignal from the environment. When one sets allowable actions (controllable state transitions) to befully deterministic, such that the selection of each action will transition the agent to a given statewith certainty, the emerging dynamics are such that the agent chooses actions to resolve uncertaintyabout the probability of observing reward under each state. Thus, learning the reward probabilitiesof available actions amounts to learning the likelihood matrix P ( (cid:126)o | (cid:126)s ) := o t · As t , where A is a stochas-tic matrix. This is done by setting a prior a over A , i.e., a matrix of non-negative components, thecolumns of which are Dirichlet priors over the columns of A . The agent then learns by accumulatingDirichlet parameters. Explicitly, at the end of a trial or episode, one sets a ← a + T (cid:88) τ =0 o τ ⊗ Q ( s τ | o T ) (19)In (19), Q ( s τ | o T ) is seen as a vector of probabilities over the state-space S , corresponding to theprobability of having been in one or another state at time the τ after having gathered observationsthroughout the trial (see (Da Costa et al., 2020; Friston et al., 2016) for a derivation of this rule).This rule simply amounts to counting observed state-outcome pairs (state-reward pairs when theobservation modalities correspond to reward).One should not conﬂate this approach with the naive update rule consisting of accumulatingstate-observation counts in the likelihood matrix A ← A + T (cid:88) τ =0 o τ ⊗ Q ( s τ | o T ) (20)and then normalising its columns to sum to one when computing probabilities. The latter simplyapproximates the likelihood matrix A by accumulating the number of observed state-outcome pairs.This is distinct from the approach outlined above, which encodes uncertainty over the matrix A ,as a probability distribution over possible distributions P ( o t | s t ) . The agent is initially very un-conﬁdent about A , which means that it doesn’t place high probability mass on any speciﬁcationof P ( o t | s t ) . This uncertainty is gradually resolved by observing state-observation (or state-reward)pairs. Computationally, it is a general fact of Dirichlet priors that an increase in elements of a ,causes the entropy of P ( o t | s t ) to decrease. As the terms added in (19) are always positive, onechoice of distribution P ( o t | s t ) is ultimately singled out (which best matches available data and priorbeliefs). In other words, the likelihood mapping is learned.26ote that, as always when working with partially observed environments, we cannot guaranteethat the true likelihood mapping will be learned in practical applications (see (Smith et al., 2019)for examples of where, although not in an explicit reward-learning context, learning the likelihoodcan be more or less successful in diﬀerent situations). Learning the true likelihood fails when theinference over states is inaccurate, e.g., when using too severe mean-ﬁeld approximations to thefree energy (Blei et al., 2017; Parr et al., 2019; Tanaka, 1999), which causes the agent to misinferstates and thereby accumulate Dirichlet parameters in the wrong place. Intuitively, this amounts tojumping to conclusions too quickly. Remark 21.

It is also worth noting that reward learning in active inference can be equivalentlyformulated as learning transition probabilities P ( s t +1 | s t , a t ) . In this alternative setup (e.g., as ex-empliﬁed in (Sales et al., 2019)), mappings between reward states and reward outcomes in A areset as identity matrices, and the agent instead learns the probability of transitioning to states thatdeterministically generate preferred (rewarding) observations given the choice of each policy. Notethat, barring any additional sources of state uncertainty in a task, when formulated in this way onecould also use an MDP instead of a POMDP, with a preference distribution instead speciﬁed overstates and no need to include any state-outcome uncertainty.The transition probabilities of the model are learned in a similar fashion as above (19) , by accu-mulating counts on a Dirichlet prior over P ( s t +1 | s t , a t ) . See (Da Costa et al., 2020, Appendix A)for details. It is also worth brieﬂy noting some connections to other common RL algorithms. For example,the naive update rule consisting of accumulating state-observation counts in the likelihood matrix(20) (i.e., not incorporating Dirichlet priors) is analogous to oﬀ-policy learning in Q-learning. InQ-learning, the objective is to ﬁnd the best action given the current observed state. For this, theQ-learning agent accumulates values for state-action pairs with repeated observation of reward-ing/punishing action outcomes—much like state-observation counts. This allows it to learn theQ-value function that deﬁnes a reward maximising policy.Given the Bayesian, model-based foundations of active inference, more direct links can be madebetween the active inference approach to reward learning described above and other Bayesian model-based RL approaches. For such links to be realised, the Bayesian RL agent would be required tohave a prior over a prior (e.g., a prior over the reward function prior or transition function prior).One way to implicitly incorporate this is through Thompson sampling (Ghavamzadeh et al., 2016;Russo and Van Roy, 2014, 2016; Russo et al., 2017). Speciﬁcally, Thompson sampling provides away to maintain an appropriate balance between exploiting what is known to maximize immediateperformance and accumulating new information that may improve future performance (Russo et al.,2017). It does this by specifying a distribution over a particular function, that is parameterizedby a prior distribution over it. This reduces to optimising dual objectives, reward maximisationand information gain. This is similar to active inference for reward maximisation (Section 4).Empirically, Sajid et al. (2020) has demonstrated that a Bayesian model-based RL agent usingThompson Sampling and an active inference agent exhibit similar behaviour when preferences aredeﬁned as a function over outcomes. They also highlighted that, by completely removing the rewardsignal from the environment, the two agents both select policies to maximise some sort of informationgain. Whilst, not the focus of this paper, future work could further elucidate the formal links betweenactive inference and Bayesian model-based RL schemes.

References

Rick A. Adams, Stewart Shipp, and Karl J. Friston. Predictions not commands: Active inferencein the motor system.

Brain Structure & Function , 218(3):611–643, May 2013a. ISSN 1863-2653.doi: 10.1007/s00429-012-0475-5. 27ick A. Adams, Klaas Enno Stephan, Harriet R. Brown, Christopher D. Frith, and Karl J. Friston.The Computational Anatomy of Psychosis.

Frontiers in Psychiatry , 4, 2013b. ISSN 1664-0640.doi: 10.3389/fpsyt.2013.00047.Jerome Adda and Russell W. Cooper.

Dynamic Economics Quantitative Methods and Applications .MIT Press, 2003.K J Aström. Optimal Control of Markov Processes with Incomplete State Information.

Journal ofMathematical Analysis and Applications , 10, 1965.Hagai Attias. Planning by Probabilistic Inference. In , page 8, 2003.H. B. Barlow.

Possible Principles Underlying the Transformations of Sensory Messages . The MITPress, 1961. ISBN 978-0-262-31421-3.H B Barlow. Inductive Inference, Coding, Perception, and Language.

Perception , 3(2):123–134,June 1974. ISSN 0301-0066. doi: 10.1068/p030123.Andrew Barto and Richard Sutton.

Reinforcement Learning: An Introduction . 1992.Andrew Barto, Marco Mirolli, and Gianluca Baldassarre. Novelty or Surprise?

Frontiers in Psy-chology , 4, 2013. ISSN 1664-1078. doi: 10.3389/fpsyg.2013.00907.Hannah M. Bayer and Paul W. Glimcher. Midbrain Dopamine Neurons Encode a QuantitativeReward Prediction Error Signal.

Neuron , 47(1):129–141, July 2005. ISSN 08966273. doi: 10.1016/j.neuron.2005.05.020.Matthew James Beal. Variational Algorithms for Approximate Bayesian Inference. page 281, 2003.Richard E. Bellman and Stuart E. Dreyfus.

Applied Dynamic Programming . Princeton UniversityPress, December 2015. ISBN 978-1-4008-7465-1.James O. Berger.

Statistical Decision Theory and Bayesian Analysis . Springer Series in Statis-tics. Springer-Verlag, New York, second edition, 1985. ISBN 978-0-387-96098-2. doi: 10.1007/978-1-4757-4286-2.Dimitri P. Bertsekas and Steven E. Shreve.

Stochastic Optimal Control: The Discrete Time Case .Athena Scientiﬁc, 1996. ISBN 978-1-886529-03-8.Christopher M. Bishop.

Pattern Recognition and Machine Learning . Information Science and Statis-tics. Springer, New York, 2006. ISBN 978-0-387-31073-2.David M. Blei, Alp Kucukelbir, and Jon D. McAuliﬀe. Variational Inference: A Review for Statis-ticians.

Journal of the American Statistical Association , 112(518):859–877, April 2017. ISSN0162-1459, 1537-274X. doi: 10.1080/01621459.2017.1285773.Matthew Botvinick and Marc Toussaint. Planning as inference.

Trends in Cognitive Sciences , 16(10):485–488, October 2012. ISSN 13646613. doi: 10.1016/j.tics.2012.08.006.Christopher L. Buckley, Chang Sub Kim, Simon McGregor, and Anil K. Seth. The free energyprinciple for action and perception: A mathematical review.

Journal of Mathematical Psychology ,81:55–79, December 2017. ISSN 00222496. doi: 10.1016/j.jmp.2017.09.004.Ozan Çatal, Johannes Nauta, Tim Verbelen, Pieter Simoens, and Bart Dhoedt. Bayesian policyselection using active inference. arXiv:1904.08149 [cs] , April 2019.28zan Çatal, Tim Verbelen, Johannes Nauta, Cedric De Boom, and Bart Dhoedt. Learning Per-ception and Planning With Deep Active Inference. In

ICASSP 2020 - 2020 IEEE InternationalConference on Acoustics, Speech and Signal Processing (ICASSP) , pages 3952–3956, May 2020a.doi: 10.1109/ICASSP40776.2020.9054364.Ozan Çatal, Samuel Wauthier, Tim Verbelen, Cedric De Boom, and Bart Dhoedt. Deep ActiveInference for Autonomous Robot Navigation. arXiv:2003.03220 [cs] , March 2020b.Moustapha Cisse, Piotr Bojanowski, Edouard Grave, Yann Dauphin, and Nicolas Usunier. Parse-val Networks: Improving Robustness to Adversarial Examples. In

International Conference onMachine Learning , pages 854–863, July 2017.Rémi Coulom. Eﬃcient selectivity and backup operators in Monte-Carlo tree search. In

In: Pro-ceedings Computers and Games 2006 . Springer-Verlag, 2006.Maell Cullen, Ben Davey, Karl J. Friston, and Rosalyn J. Moran. Active Inference in OpenAI Gym:A Paradigm for Computational Investigations Into Psychiatric Illness.

Biological Psychiatry:Cognitive Neuroscience and Neuroimaging , 3(9):809–818, September 2018. ISSN 24519022. doi:10.1016/j.bpsc.2018.06.010.Lancelot Da Costa, Thomas Parr, Noor Sajid, Sebastijan Veselic, Victorita Neacsu, and Karl Friston.Active inference on discrete state-spaces: A synthesis. arXiv:2001.07203 [q-bio] , January 2020.Nathaniel D. Daw, John P. O’Doherty, Peter Dayan, Ben Seymour, and Raymond J. Dolan. Corticalsubstrates for exploratory decisions in humans.

Nature , 441(7095):876–879, June 2006. ISSN 1476-4687. doi: 10.1038/nature04766.Edward Deci and Richard M. Ryan.

Intrinsic Motivation and Self-Determination in Human Be-havior . Perspectives in Social Psychology. Springer US, 1985. ISBN 978-0-306-42022-1. doi:10.1007/978-1-4899-2271-7.Benjamin Eysenbach and Sergey Levine. If maxent rl is the answer, what is the question? arXivpreprint arXiv:1910.01913 , 2019.Thomas H. B. FitzGerald, Raymond J. Dolan, and Karl Friston. Dopamine, reward learning, andactive inference.

Frontiers in Computational Neuroscience , 9, November 2015. ISSN 1662-5188.doi: 10.3389/fncom.2015.00136.W. H. Fleming and S. J. Sheu. Risk-sensitive control and an optimal investment model II.

TheAnnals of Applied Probability , 12(2):730–767, May 2002. ISSN 1050-5164, 2168-8737. doi: 10.1214/aoap/1026915623.Zafeirios Fountas, Noor Sajid, Pedro A. M. Mediano, and Karl Friston. Deep active inference agentsusing Monte-Carlo methods. arXiv:2006.04176 [cs, q-bio, stat] , June 2020.Karl Friston. The free-energy principle: A uniﬁed brain theory?

Nature Reviews Neuroscience , 11(2):127–138, February 2010. ISSN 1471-003X, 1471-0048. doi: 10.1038/nrn2787.Karl Friston. A free energy principle for a particular physics. arXiv:1906.10184 [q-bio] , June 2019.Karl Friston, James Kilner, and Lee Harrison. A free energy principle for the brain.

Journal ofPhysiology-Paris , 100(1-3):70–87, July 2006. ISSN 09284257. doi: 10.1016/j.jphysparis.2006.10.001.Karl Friston, Spyridon Samothrakis, and Read Montague. Active inference and agency: Optimalcontrol without cost functions.

Biological Cybernetics , 106(8):523–541, October 2012a. ISSN1432-0770. doi: 10.1007/s00422-012-0512-8. 29arl Friston, Thomas FitzGerald, Francesco Rigoli, Philipp Schwartenbeck, John O’Doherty, andGiovanni Pezzulo. Active inference and learning.

Neuroscience & Biobehavioral Reviews , 68:862–879, September 2016. ISSN 01497634. doi: 10.1016/j.neubiorev.2016.06.022.Karl Friston, Thomas FitzGerald, Francesco Rigoli, Philipp Schwartenbeck, and Giovanni Pezzulo.Active Inference: A Process Theory.

Neural Computation , 29(1):1–49, January 2017a. ISSN0899-7667, 1530-888X. doi: 10.1162/NECO_a_00912.Karl Friston, Lancelot Da Costa, Danijar Hafner, Casper Hesp, and Thomas Parr. SophisticatedInference. arXiv:2006.04120 [cs, q-bio] , June 2020.Karl J. Friston, Jean Daunizeau, and Stefan J. Kiebel. Reinforcement Learning or Active Inference?

PLoS ONE , 4(7):e6421, July 2009. ISSN 1932-6203. doi: 10.1371/journal.pone.0006421.Karl J. Friston, Tamara Shiner, Thomas FitzGerald, Joseph M. Galea, Rick Adams, Harriet Brown,Raymond J. Dolan, Rosalyn Moran, Klaas Enno Stephan, and Sven Bestmann. Dopamine, Aﬀor-dance and Active Inference.

PLoS Computational Biology , 8(1), January 2012b. ISSN 1553-734X.doi: 10.1371/journal.pcbi.1002327.Karl J. Friston, Marco Lin, Christopher D. Frith, Giovanni Pezzulo, J. Allan Hobson, and Sasha On-dobaka. Active Inference, Curiosity and Insight.

Neural Computation , 29(10):2633–2683, October2017b. ISSN 0899-7667, 1530-888X. doi: 10.1162/neco_a_00999.Karl J. Friston, Thomas Parr, and Bert de Vries. The graphical brain: Belief propagation andactive inference.

Network Neuroscience , 1(4):381–414, December 2017c. ISSN 2472-1751. doi:10.1162/NETN_a_00018.Karl J. Friston, Richard Rosch, Thomas Parr, Cathy Price, and Howard Bowman. Deep temporalmodels and active inference.

Neuroscience & Biobehavioral Reviews , 90:486–501, July 2018. ISSN01497634. doi: 10.1016/j.neubiorev.2018.04.004.Drew Fudenberg and Jean Tirole.

Game Theory . MIT Press, 1991. ISBN 978-0-262-06141-4.Aude Genevay, Gabriel Peyre, and Marco Cuturi. Learning Generative Models with Sinkhorn Di-vergences. In

International Conference on Artiﬁcial Intelligence and Statistics , pages 1608–1617,March 2018.Samuel J. Gershman and Yael Niv. Learning latent structure: Carving nature at its joints.

CurrentOpinion in Neurobiology , 20(2):251–256, April 2010. ISSN 1873-6882. doi: 10.1016/j.conb.2010.02.008.Mohammad Ghavamzadeh, Shie Mannor, Joelle Pineau, and Aviv Tamar. Bayesian reinforcementlearning: A survey. arXiv preprint arXiv:1609.04436 , 2016.Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair,Aaron Courville, and Yoshua Bengio. Generative Adversarial Nets. In Z. Ghahramani, M. Welling,C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors,

Advances in Neural InformationProcessing Systems 27 , pages 2672–2680. Curran Associates, Inc., 2014.A. Guez, D. Silver, and P. Dayan. Scalable and Eﬃcient Bayes-Adaptive Reinforcement Learn-ing Based on Monte-Carlo Tree Search.

Journal of Artiﬁcial Intelligence Research , 48:841–883,November 2013a. ISSN 1076-9757. doi: 10.1613/jair.4117.Arthur Guez, David Silver, and Peter Dayan. Eﬃcient Bayes-Adaptive Reinforcement Learningusing Sample-Based Search. arXiv:1205.3109 [cs, stat] , December 2013b.30uomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. Reinforcement learning withdeep energy-based policies. arXiv preprint arXiv:1702.08165 , 2017.Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor critic: Oﬀ-policymaximum entropy deep reinforcement learning with a stochastic actor.

CoRR , abs / 1801.01290,2018. URL http://arxiv.org/abs/1801.01290 .T. Heskes. Convexity Arguments for Eﬃcient Minimization of the Bethe and Kikuchi Free Energies.

Journal of Artiﬁcial Intelligence Research , 26:153–190, June 2006. ISSN 1076-9757. doi: 10.1613/jair.1933.Quentin J. M. Huys, Neir Eshel, Elizabeth O’Nions, Luke Sheridan, Peter Dayan, and Jonathan P.Roiser. Bonsai Trees in Your Head: How the Pavlovian System Sculpts Goal-Directed Choicesby Pruning Decision Trees.

PLoS Computational Biology , 8(3):e1002410, March 2012. ISSN1553-7358. doi: 10.1371/journal.pcbi.1002410.Laurent Itti and Pierre Baldi. Bayesian surprise attracts human attention.

Vision research , 49(10):1295–1306, May 2009. ISSN 0042-6989. doi: 10.1016/j.visres.2008.09.007.E. T. Jaynes. Information Theory and Statistical Mechanics.

Physical Review , 106(4):620–630, May1957a. doi: 10.1103/PhysRev.106.620.E. T. Jaynes. Information Theory and Statistical Mechanics. II.

Physical Review , 108(2):171–190,October 1957b. doi: 10.1103/PhysRev.108.171.Michael I. Jordan, Zoubin Ghahramani, Tommi S. Jaakkola, and Lawrence K. Saul. An Introductionto Variational Methods for Graphical Models. In Michael I. Jordan, editor,

Learning in GraphicalModels , pages 105–161. Springer Netherlands, Dordrecht, 1998. ISBN 978-94-010-6104-9 978-94-011-5014-9. doi: 10.1007/978-94-011-5014-9_5.Daniel Kahneman and Amos Tversky.

Prospect Theory: An Analysis of Decision under Risk . De-cision, Probability, and Utility: Selected Readings. Cambridge University Press, New York, NY,US, 1988. ISBN 978-0-521-33391-7 978-0-521-33658-1. doi: 10.1017/CBO9780511609220.014.Raphael Kaplan and Karl J. Friston. Planning and navigation as active inference.

Biological Cyber-netics , 112(4):323–343, August 2018. ISSN 1432-0770. doi: 10.1007/s00422-018-0753-2.Hilbert J. Kappen, Vicenç Gómez, and Manfred Opper. Optimal control as a graphical modelinference problem.

Machine Learning , 87(2):159–182, May 2012. ISSN 0885-6125, 1573-0565. doi:10.1007/s10994-012-5278-7.Alexander S. Klyubin, Daniel Polani, and Chrystopher L. Nehaniv. Keep Your Options Open:An Information-Based Driving Principle for Sensorimotor Systems.

PLOS ONE , 3(12):e4018,December 2008. ISSN 1932-6203. doi: 10.1371/journal.pone.0004018.Níall Lally, Quentin J. M. Huys, Neir Eshel, Paul Faulkner, Peter Dayan, and Jonathan P. Roiser.The Neural Basis of Aversive Pavlovian Guidance during Planning.

Journal of Neuroscience , 37(42):10215–10229, October 2017. ISSN 0270-6474, 1529-2401. doi: 10.1523/JNEUROSCI.0085-17.2017.Pablo Lanillos, Jordi Pages, and Gordon Cheng. Robot self/other distinction: Active inference meetsneural networks learning in a mirror. arXiv:2004.05473 [cs] , April 2020.Sergey Levine. Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:1805.00909 , 2018. 31. V. Lindley. On a Measure of the Information Provided by an Experiment.

The Annals ofMathematical Statistics , 27(4):986–1005, 1956. ISSN 0003-4851.R Linsker. Perceptual Neural Organization: Some Approaches Based on Network Models andInformation Theory.

Annual Review of Neuroscience , 13(1):257–281, 1990. doi: 10.1146/annurev.ne.13.030190.001353.David J. C. MacKay.

Information Theory, Inference and Learning Algorithms . Cambridge UniversityPress, Cambridge, UK ; New York, sixth printing 2007 edition edition, September 2003. ISBN978-0-521-64298-9.Beren Millidge. Implementing Predictive Processing and Active Inference: Preliminary Steps andResults. Preprint, PsyArXiv, March 2019.Beren Millidge. Deep active inference as variational policy gradients.

Journal of MathematicalPsychology , 96:102348, June 2020. ISSN 0022-2496. doi: 10.1016/j.jmp.2020.102348.Beren Millidge, Alexander Tschantz, and Christopher L. Buckley. Whence the Expected Free En-ergy? arXiv:2004.08128 [cs] , April 2020a.Beren Millidge, Alexander Tschantz, Anil K. Seth, and Christopher L. Buckley. On the RelationshipBetween Active Inference and Control as Inference. arXiv:2006.12964 [cs, stat] , June 2020b.Mario J. Miranda and Paul L. Fackler.

Applied Computational Economics and Finance . The MITPress, Cambridge, Mass. London, new ed edition edition, September 2002. ISBN 978-0-262-63309-3.M. Berk Mirza, Rick A. Adams, Christoph Mathys, and Karl J. Friston. Human visual explorationreduces uncertainty about the sensed world.

PLOS ONE , 13(1):e0190429, 2018. ISSN 1932-6203.doi: 10.1371/journal.pone.0190429.M. Berk Mirza, Rick A. Adams, Thomas Parr, and Karl Friston. Impulsivity and Active Inference.

Journal of Cognitive Neuroscience , 31(2):202–220, February 2019. ISSN 0898-929X, 1530-8898.doi: 10.1162/jocn_a_01352.L. M. Optican and B. J. Richmond. Temporal encoding of two-dimensional patterns by single units inprimate inferior temporal cortex. III. Information theoretic analysis.

Journal of Neurophysiology ,57(1):162–178, January 1987. ISSN 0022-3077. doi: 10.1152/jn.1987.57.1.162.Pierre-Yves Oudeyer and Frederic Kaplan. What is intrinsic motivation? A typology of computa-tional approaches.

Frontiers in Neurorobotics , 1, 2009. ISSN 1662-5218. doi: 10.3389/neuro.12.006.2007.Thomas Parr and Karl J. Friston. The computational pharmacology of oculomotion.

Psychophar-macology , April 2019. ISSN 1432-2072. doi: 10.1007/s00213-019-05240-0.Thomas Parr, Dimitrije Markovic, Stefan J. Kiebel, and Karl J. Friston. Neuronal message passingusing Mean-ﬁeld, Bethe, and Marginal approximations.

Scientiﬁc Reports , 9(1):1889, December2019. ISSN 2045-2322. doi: 10.1038/s41598-018-38246-3.Thomas Parr, Lancelot Da Costa, and Karl Friston. Markov blankets, information geometry andstochastic thermodynamics.

Philosophical Transactions of the Royal Society A: Mathematical,Physical and Engineering Sciences , 378(2164):20190159, February 2020. doi: 10.1098/rsta.2019.0159. 32rigorios A. Pavliotis.

Stochastic Processes and Applications: Diﬀusion Processes, the Fokker-Planck and Langevin Equations . Number volume 60 in Texts in Applied Mathematics. Springer,New York, 2014. ISBN 978-1-4939-1322-0.Judea Pearl. Graphical Models for Probabilistic and Causal Reasoning. In Philippe Smets, editor,

Quantiﬁed Representation of Uncertainty and Imprecision , Handbook of Defeasible Reasoning andUncertainty Management Systems, pages 367–389. Springer Netherlands, Dordrecht, 1998. ISBN978-94-017-1735-9. doi: 10.1007/978-94-017-1735-9_12.Corrado Pezzato, Riccardo Ferrari, and Carlos Hernández Corbato. A Novel Adaptive Controllerfor Robot Manipulators Based on Active Inference.

IEEE Robotics and Automation Letters , 5(2):2973–2980, April 2020. ISSN 2377-3766. doi: 10.1109/LRA.2020.2974451.Léo Pio-Lopez, Ange Nizard, Karl Friston, and Giovanni Pezzulo. Active inference and robot control:A case study.

Journal of The Royal Society Interface , 13(122):20160616, September 2016. doi:10.1098/rsif.2016.0616.Martin L. Puterman.

Markov Decision Processes: Discrete Stochastic Dynamic Programming . JohnWiley & Sons, August 2014. ISBN 978-1-118-62587-3.Konrad Rawlik, Marc Toussaint, and Sethu Vijayakumar. On Stochastic Optimal Control and Re-inforcement Learning by Approximate Inference. In

Twenty-Third International Joint Conferenceon Artiﬁcial Intelligence , June 2013.Stéphane Ross, Joelle Pineau, Brahim Chaib-draa, and Pierre Kreitmann. A Bayesian Approach forLearning and Planning in Partially Observable Markov Decision Processes. page 42.Stephane Ross, Brahim Chaib-draa, and Joelle Pineau. Bayes-Adaptive POMDPs. In J. C. Platt,D. Koller, Y. Singer, and S. T. Roweis, editors,

Advances in Neural Information Processing Systems20 , pages 1225–1232. Curran Associates, Inc., 2008.Daniel Russo and Benjamin Van Roy. Learning to optimize via posterior sampling.

Mathematics ofOperations Research , 39(4):1221–1243, 2014.Daniel Russo and Benjamin Van Roy. An information-theoretic analysis of thompson sampling.

TheJournal of Machine Learning Research , 17(1):2442–2471, 2016.Daniel Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband, and Zheng Wen. A tutorial onthompson sampling. arXiv preprint arXiv:1707.02038 , 2017.Noor Sajid, Philip J. Ball, and Karl J. Friston. Active inference: Demystiﬁed and compared. arXiv:1909.10863 [cs, q-bio] , January 2020.Anna C. Sales, Karl J. Friston, Matthew W. Jones, Anthony E. Pickering, and Rosalyn J. Moran.Locus Coeruleus tracking of prediction errors optimises cognitive ﬂexibility: An Active Inferencemodel.

PLOS Computational Biology , 15(1):e1006267, January 2019. ISSN 1553-7358. doi:10.1371/journal.pcbi.1006267.Cansu Sancaktar, Marcel van Gerven, and Pablo Lanillos. End-to-End Pixel-Based Deep ActiveInference for Body Perception and Action. arXiv:2001.05847 [cs, q-bio] , May 2020.R. W. H. Sargent. Optimal control.

Journal of Computational and Applied Mathematics , 124(1):361–371, December 2000. ISSN 0377-0427. doi: 10.1016/S0377-0427(00)00418-0.Jürgen Schmidhuber. Formal Theory of Creativity, Fun, and Intrinsic Motivation (1990–2010).

IEEE Transactions on Autonomous Mental Development , 2(3):230–247, September 2010. ISSN1943-0604, 1943-0612. doi: 10.1109/TAMD.2010.2056368.33hilipp Schwartenbeck, Thomas H. B. FitzGerald, Christoph Mathys, Ray Dolan, and Karl Friston.The Dopaminergic Midbrain Encodes the Expected Certainty about Desired Outcomes.

CerebralCortex (New York, N.Y.: 1991) , 25(10):3434–3445, October 2015a. ISSN 1460-2199. doi: 10.1093/cercor/bhu159.Philipp Schwartenbeck, Thomas H. B. FitzGerald, Christoph Mathys, Ray Dolan, Martin Kron-bichler, and Karl Friston. Evidence for surprise minimization over value maximization in choicebehavior.

Scientiﬁc Reports , 5:16575, November 2015b. ISSN 2045-2322. doi: 10.1038/srep16575.Philipp Schwartenbeck, Johannes Passecker, Tobias U Hauser, Thomas HB FitzGerald, Martin Kron-bichler, and Karl J Friston. Computational mechanisms of curiosity and goal-directed exploration. eLife , page 45, 2019.Sarah Schwöbel, Stefan Kiebel, and Dimitrije Marković. Active Inference, Belief Propagation, andthe Bethe Approximation.

Neural Computation , 30(9):2530–2567, September 2018. ISSN 0899-7667, 1530-888X. doi: 10.1162/neco_a_01108.Biswa Sengupta, Arturo Tozzi, Gerald K. Cooray, Pamela K. Douglas, and Karl J. Friston. Towardsa Neuronal Gauge Theory.

PLOS Biology , 14(3):e1002400, March 2016. ISSN 1545-7885. doi:10.1371/journal.pbio.1002400.Yoav Shoham, Rob Powers, and Trond Grenager. Multi-agent reinforcement learning: A criticalsurvey. Technical report, 2003.David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche,Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman,Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, MadeleineLeach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. Mastering the game of Go withdeep neural networks and tree search.

Nature , 529(7587):484–489, January 2016. ISSN 0028-0836,1476-4687. doi: 10.1038/nature16961.Ryan Smith, Philipp Schwartenbeck, Thomas Parr, and Karl J. Friston. An active inference modelof concept learning. bioRxiv , page 633677, May 2019. doi: 10.1101/633677.Ryan Smith, Namik Kirlic, Jennifer L. Stewart, James Touthang, Rayus Kuplicki, Sahib S. Khalsa,Martin P Paulus, T Investigators, and Robin Aupperle. Greater decision uncertainty characterizesa transdiagnostic patient sample during approach-avoidance conﬂict: A computational modelingapproach. Preprint, PsyArXiv, April 2020a.Ryan Smith, Rayus Kuplicki, Justin Feinstein, Katherine L. Forthman, Jennifer L. Stewart, Mar-tin P. Paulus, Tulsa 1000 Investigators, and Sahib S. Khalsa. An active inference model reveals afailure to adapt interoceptive precision estimates across depression, anxiety, eating, and substanceuse disorders. medRxiv , page 2020.06.03.20121343, June 2020b. doi: 10.1101/2020.06.03.20121343.Ryan Smith, Philipp Schwartenbeck, Thomas Parr, and Karl J. Friston. An Active Inference Ap-proach to Modeling Structure Learning: Concept Learning as an Example Case.

Frontiers inComputational Neuroscience , 14, May 2020c. ISSN 1662-5188. doi: 10.3389/fncom.2020.00041.Ryan Smith, Philipp Schwartenbeck, Jennifer L. Stewart, Rayus Kuplicki, Hamed Ekhtiari, T Inves-tigators, and Martin P Paulus. Imprecise Action Selection in Substance Use Disorder: Evidence forActive Learning Impairments When Solving the Explore-Exploit Dilemma. Preprint, PsyArXiv,April 2020d.Susanne Still and Doina Precup. An information-theoretic approach to curiosity-driven reinforcementlearning.

Theory in Biosciences = Theorie in Den Biowissenschaften , 131(3):139–148, September2012. ISSN 1611-7530. doi: 10.1007/s12064-011-0142-z.34ames V. Stone.

Information Theory: A Tutorial Introduction . Sebtel Press, England, 1st editionedition, February 2015. ISBN 978-0-9563728-5-7.James V Stone.

Artiﬁcial Intelligence Engines: A Tutorial Introduction to the Mathematics of DeepLearning . 2019.Yi Sun, Faustino Gomez, and Juergen Schmidhuber. Planning to Be Surprised: Optimal BayesianExploration in Dynamic Environments. arXiv:1103.5708 [cs, stat] , March 2011.Toshiyuki Tanaka. A Theory of Mean Field Approximation. page 10, 1999.D. Gowanlock R. Tervo, Joshua B. Tenenbaum, and Samuel J. Gershman. Toward the neuralimplementation of structure learning.

Current Opinion in Neurobiology , 37:99–105, April 2016.ISSN 1873-6882. doi: 10.1016/j.conb.2016.01.014.Emanuel Todorov. Linearly-solvable markov decision problems. In

Advances in neural informationprocessing systems , pages 1369–1376, 2007.Emanuel Todorov. General duality between optimal control and estimation. , pages 4286–4292, 2008. doi: 10.1109/cdc.2008.4739438.Emanuel Todorov. Eﬃcient computation of optimal actions.

Proceedings of the national academy ofsciences , 106(28):11478–11483, 2009.Michel Tokic and Günther Palm. Value-Diﬀerence Based Exploration: Adaptive Control betweenEpsilon-Greedy and Softmax. In Joscha Bach and Stefan Edelkamp, editors,

KI 2011: Advancesin Artiﬁcial Intelligence , Lecture Notes in Computer Science, pages 335–346, Berlin, Heidelberg,2011. Springer. ISBN 978-3-642-24455-1. doi: 10.1007/978-3-642-24455-1_33.Marc Toussaint. Robot trajectory optimization using approximate inference. In

Proceedings ofthe 26th Annual International Conference on Machine Learning , ICML ’09, pages 1049–1056,Montreal, Quebec, Canada, June 2009. Association for Computing Machinery. ISBN 978-1-60558-516-1. doi: 10.1145/1553374.1553508.Alexander Tschantz, Manuel Baltieri, Anil K. Seth, and Christopher L. Buckley. Scaling activeinference. arXiv:1911.10601 [cs, eess, math, stat] , November 2019.Alexander Tschantz, Beren Millidge, Anil K. Seth, and Christopher L. Buckley. ReinforcementLearning through Active Inference. In

ICLR , February 2020a.Alexander Tschantz, Anil K. Seth, and Christopher L. Buckley. Learning action-oriented modelsthrough active inference.

PLOS Computational Biology , 16(4):e1007805, April 2020b. ISSN 1553-7358. doi: 10.1371/journal.pcbi.1007805.Kai Ueltzhöﬀer. Deep Active Inference.

Biological Cybernetics , 112(6):547–573, December 2018.ISSN 0340-1200, 1432-0770. doi: 10.1007/s00422-018-0785-7.Bart van den Broek, Wim Wiegerinck, and Bert Kappen. Risk sensitive path integral control.

UAI ,2010.J. Von Neumann and O. Morgenstern.

Theory of Games and Economic Behavior . Theory of Gamesand Economic Behavior. Princeton University Press, Princeton, NJ, US, 1944.Martin J. Wainwright and Michael I. Jordan. Graphical Models, Exponential Families, and Vari-ational Inference.

Foundations and Trends R (cid:13) in Machine Learning , 1(1–2):1–305, 2007. ISSN1935-8237, 1935-8245. doi: 10.1561/2200000001.35obert C. Wilson, Andra Geana, John M. White, Elliot A. Ludvig, and Jonathan D. Cohen. Hu-mans Use Directed and Random Exploration to Solve the Explore–Exploit Dilemma. Journalof experimental psychology. General , 143(6):2074–2081, December 2014. ISSN 0096-3445. doi:10.1037/a0038199.Ernst Zermelo. über eine Anwendung der Mengenlehre auf die Theorie des Schachspiels. 1913.Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, and Anind K Dey. Maximum entropy inversereinforcement learning. 2008.Luisa Zintgraf, Kyriacos Shiarlis, Maximilian Igl, Sebastian Schulze, Yarin Gal, Katja Hofmann,and Shimon Whiteson. VariBAD: A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning. arXiv:1910.08348 [cs, stat]arXiv:1910.08348 [cs, stat]