[PDF] Bayesian Policy Reuse

Abstract

A long-lived autonomous agent should be able to respond online to novel instances of tasks from a familiar domain. Acting online requires 'fast' responses, in terms of rapid convergence, especially when the task instance has a short duration, such as in applications involving interactions with humans. These requirements can be problematic for many established methods for learning to act. In domains where the agent knows that the task instance is drawn from a family of related tasks, albeit without access to the label of any given instance, it can choose to act through a process of policy reuse from a library, rather than policy learning from scratch. In policy reuse, the agent has prior knowledge of the class of tasks in the form of a library of policies that were learnt from sample task instances during an offline training phase. We formalise the problem of policy reuse, and present an algorithm for efficiently responding to a novel task instance by reusing a policy from the library of existing policies, where the choice is based on observed 'signals' which correlate to policy performance. We achieve this by posing the problem as a Bayesian choice problem with a corresponding notion of an optimal response, but the computation of that response is in many cases intractable. Therefore, to reduce the computation cost of the posterior, we follow a Bayesian optimisation approach and define a set of policy selection functions, which balance exploration in the policy library against exploitation of previously tried policies, together with a model of expected performance of the policy library on their corresponding task instances. We validate our method in several simulated domains of interactive, short-duration episodic tasks, showing rapid convergence in unknown task variations.

Full PDF

NNoname manuscript No. (will be inserted by the editor)

Bayesian Policy Reuse

Benjamin Rosman · Majd Hawasly (cid:63) · Subramanian Ramamoorthy the date of receipt and acceptance should be inserted later

Abstract

A long-lived autonomous agent should be able to respond online tonovel instances of tasks from a familiar domain. Acting online requires ‘fast’ re-sponses, in terms of rapid convergence, especially when the task instance has ashort duration such as in applications involving interactions with humans. Theserequirements can be problematic for many established methods for learning to act.In domains where the agent knows that the task instance is drawn from a familyof related tasks, albeit without access to the label of any given instance, it canchoose to act through a process of policy reuse from a library in contrast to policylearning. In policy reuse, the agent has prior experience from the class of tasksin the form of a library of policies that were learnt from sample task instancesduring an oﬄine training phase. We formalise the problem of policy reuse andpresent an algorithm for eﬃciently responding to a novel task instance by reusinga policy from this library of existing policies, where the choice is based on ob-served ‘signals’ which correlate to policy performance. We achieve this by posingthe problem as a Bayesian choice problem with a corresponding notion of an opti-mal response, but the computation of that response is in many cases intractable.Therefore, to reduce the computation cost of the posterior, we follow a Bayesianoptimisation approach and deﬁne a set of policy selection functions, which balanceexploration in the policy library against exploitation of previously tried policies,together with a model of expected performance of the policy library on their corre-sponding task instances. We validate our method in several simulated domains ofinteractive, short-duration episodic tasks, showing rapid convergence in unknowntask variations. (cid:63)

The ﬁrst two authors contributed equally to this paper.Benjamin RosmanMobile Intelligent Autonomous Systems (MIAS), Council for Scientiﬁc and Industrial Research(CSIR), South Africa, and the School of Computer Science and Applied Mathematics, Univer-sity of the Witwatersrand, South Africa. E-mail: [email protected] HawaslySchool of Informatics, University of Edinburgh, UK. E-mail: [email protected] RamamoorthySchool of Informatics, University of Edinburgh, UK. E-mail: [email protected]. a r X i v : . [ c s . A I] D ec Rosman, Hawasly & Ramamoorthy

Keywords

Policy Reuse · Reinforcement Learning · Online Learning · OnlineBandits · Transfer Learning · Bayesian Optimisation · Bayesian Decision Theory.

As robots and software agents are becoming more ubiquitous in many applicationsinvolving human interactions, greater numbers of scenarios require new forms ofdecision making that allow fast responses to situations that may drift or changefrom their nominal descriptions.For example, online personalisation (Mahmud et al., 2014) is becoming a coreconcept in human-computer interaction (HCI), driven largely by a proliferation ofnew sensors and input devices which allow for a more natural means of communi-cating with hardware. Consider, for example, an interactive interface in a publicspace like a museum that aims to provide information or services to users throughnormal interaction means such as natural speech or body gestures. The diﬃcultyin this setting is that the same device may be expected to interact with a wideand diverse pool of users, who diﬀer both at the low level of interaction speeds andfaculties, and at the higher level of which expressions or gestures seem appropriatefor particular commands. The device should autonomously calibrate itself to theclass of user, and a mismatch in that could result in a failed interaction. On theother hand, taking too long to calibrate is likely to frustrate the user (Rosmanet al., 2014), who may then abandon the interaction.This problem, characterised as a short-term interactive adaptation to a newsituation (the user), also appears in interactive situations other than HCI. Asan example, consider a system for localising and monitoring poachers in a largewildlife reserve that comprises an intelligent base station which can deploy light-weight airborne drones to scan particular locations in the reserve for unusualactivity. While the tactics followed by the poachers in every trial would be diﬀerent,the possible number of drone deployments in a single instance of this adversarialproblem is limited, as the poachers can be expected to spend a limited time stalkingtheir target before leaving.In this paper, we formalise and propose a solution to the general problem in-spired by these real world examples. To this end, we present a number of simulatedscenarios to investigate diﬀerent facets of this problem, and contrast the proposedsolution with related approaches from the literature.The key component of this problem is the need for eﬃcient decision making, inthe sense that the agent is required to adapt or respond to scenarios which existonly for short durations. As a result, solution methods are required to have both low convergence time and low regret . To this end, the key intuition we employ is thatnearly-optimal solutions computed within a small computational and time budgetare preferred to those that are optimal but unbounded in time and computation.Building on this, the question we address in this paper is how to act well (not necessarily optimally) in an eﬃcient manner (for short duration tasks) in a largespace of qualitatively-similar tasks.While it is unreasonable to expect that an arbitrary task instance could besolved from scratch in a short duration task (where, in general, the interaction Poaching of large mammals such as rhinoceroses is a major problem throughout Africaand Asia (Amin et al., 2006).ayesian Policy Reuse 3 length is unknown), it is plausible to consider seeding the process with a set ofpolicies of previously solved, related task instances, in what can be seen as astrategy for transfer learning (Taylor and Stone, 2009). In this sense, we prefer toquickly select a nearly-optimal pre-learnt policy, rather than learn an optimal onefor each new task. For our previous examples, the interactive interface may shipwith a set of diﬀerent classes of user proﬁles which have been acquired oﬄine, andthe monitoring system may be equipped with a collection of pre-learnt behavioursto navigate the reserve when a warning is issued.We term this problem of short-lived sequential policy selection for a new in-stance the policy reuse problem, which diﬀers slightly from other uses of that term(see Section 1.1), and we deﬁne it as follows.

Deﬁnition 1 (Policy Reuse Problem)

Let an agent be a decision making entity ina speciﬁc domain, equipped with a policy library Π for some tasks in that domain. Theagent is presented with an unknown task which must be solved within a limited, andsmall, number of trials. At the beginning of each trial episode, the agent can select onepolicy from Π to execute for the full episode. The goal of the agent is thus toselect policies for the new task from Π to minimise the total regret, with respectto the performance of the best alternative from Π in hindsight, incurred in thelimited task duration. The online choice from a set of alternatives for minimal regret could be posedas a multi-armed bandit problem. Here, each arm corresponds to a pre-learnt pol-icy, and our problem becomes that of a sequential, ﬁnite-horizon, optimal selectionfrom a ﬁxed set of policies. Solving this problem in general is diﬃcult as it mapsinto the intractable ﬁnite-horizon online bandit problem (Ni˜no-Mora, 2011). Onthe other hand, traditional approaches to solve the multi-armed bandit probleminvolve testing each available arm on the new task in order to gauge its perfor-mance, which may be a very costly procedure from a convergence rate point ofview.Instead, one can exploit knowledge which has been acquired oﬄine to improveonline response times. These more informed approaches to the multi-armed ban-dit problem exploit background domain information (e.g. contexts in contextualbandits (Strehl et al., 2006; Langford and Zhang, 2008)) or analytical forms ofreward (e.g. correlated bandits (Pandey et al., 2007)) to share the credit of pullingan arm between many possible arms. This however requires prior knowledge of thedomain and its metrics, how possible domain instances relate to the arms, and insome cases to be able to determine this side information for any new instance.We propose a solution for policy reuse that neither requires complete knowl-edge of the space of possible task instances nor a metric in that space, but ratherbuilds a surrogate model of this space from oﬄine-captured correlations betweenthe policies when tested under canonical operating scenarios. Our solution thenmaintains a Bayesian belief over the nature of the new task instance in relation tothe previously-solved ones. Then, executing a policy provides the agent with infor- mation which, when combined with the model, is not only useful to evaluate thepolicy chosen but also to gauge the suitability of other policies. This informationupdates the belief, which facilitates choosing the next policy to execute.We treat the policy selection in the policy reuse problem as one of optimisation of the response surface of the new task instance, although over a ﬁnite library

Rosman, Hawasly & Ramamoorthy of policies. Because we are dealing with tasks which we assume are of limitedduration and which do not allow extensive experimenting, and in order to use in-formation from previous trials to maintain belief distributions over the task space,we draw inspiration from the Bayesian optimisation/eﬃcient global optimisationliterature (Brochu et al., 2010) for an approach to this problem that is eﬃcientin the number of policy executions, corresponding to function evaluations in theclassical optimisation setting.1.1 Other Deﬁnitions of Policy ReuseA version of the policy reuse problem was described by Mahmud et al. (2013),where it is used to test a set of landmark policies retrieved through clustering in thespace of MDPs. Additionally, the term ‘policy reuse’ has been used by Fern´andezand Veloso (2006) in a diﬀerent context. There, a learning agent is equipped with alibrary of previous policies to aid in exploration, as they enable the agent to collectrelevant information quickly to accelerate learning. In our case, we do not expect tohave enough time to learn a full policy, and so instead rely on aggressive knowledgetransfer using our proposed policy reuse framework to achieve the objective of theagent.1.2 Contributions and Paper OrganisationThe primary contributions made in this paper are as follows:1. We introduce Bayesian Policy Reuse (BPR) as a general Bayesian frameworkfor solving the policy reuse problem as deﬁned in Deﬁnition 1 (Section 2).2. We present several speciﬁc instantiations of BPR using diﬀerent policy selec-tion mechanisms (Section 3.1), and compare them on an online personalisationdomain (Section 4.2) as well as a domain modelling a surveillance problem(Section 4.3).3. We provide an empirical analysis of the components of our model, consideringdiﬀerent classes of observation signal, and the trade-oﬀ between library sizeand convergence rate.

We now pose the policy reuse transfer problem within a Bayesian framework.Bayesian Policy Reuse (BPR) builds on the intuition that, in many cases, perfor-mance of a speciﬁc policy is better, relative to the other policies in the library, intasks within some neighbourhood of the task for which it is known to be optimal.Thus, a model that measures the similarity between a new task and other known tasks may provide indications as to which policies may be the best to reuse. Welearn such a model from oﬄine experience, and then use it online as a Bayesianprior over the task space, which is updated with new observations from the currenttask. Note that in this work we consider the general case where we do not havea parametrisation of the task space that allows constructing that model explicitly ayesian Policy Reuse 5 (e.g. da Silva et al. (2012)). This may be the case where aspects of the model mayvary qualitatively (e.g. diﬀerent personality types), or where the agent has notbeen exposed to enough variations of the task to learn the underlying parametricmodel suﬃciently.2.1 NotationLet the space of task instances be X , and let a task instance x ∈ X be speciﬁed by aMarkov Decision Process (MDP). An MDP is deﬁned as a tuple µ = ( S, A, T, R, γ ),where S is a ﬁnite set of states; A is a ﬁnite set of actions which can be taken bythe agent; T : S × A × S → [0 ,

1] is the state transition function where T ( s, a, s (cid:48) )gives the probability of transitioning from state s to state s (cid:48) after taking action a ; R : S × A × S → R is the reward function, where R ( s, a, s (cid:48) ) is the reward received bythe agent when transitioning from state s to s (cid:48) with action a ; and ﬁnally, γ ∈ [0 , T is a probability function, (cid:80) s (cid:48) ∈ S T ( s, a, s (cid:48) ) = 1 , ∀ a ∈ A, ∀ s ∈ S . Denote the space of all MDPs M . We will consider episodic tasks, i.e.tasks that have a bounded time horizon.A policy π : S × A → [0 ,

1] for an MDP is a distribution over states andactions, deﬁning the probability of taking any action from a state. The return, orutility, generated from running the policy π in an episode of a task instance is theaccumulated discounted reward, U π = (cid:80) ki =0 γ i r i , with k being the length of the episode and r i being the reward received at step i . We refer to U π generated froma policy π in a task instance simply as the policy’s performance . Solving an MDP µ is to acquire an optimal policy π ∗ = arg max π U π which maximises the totalexpected return of µ . For a reinforcement learning agent, T and R are typicallyunknown. We denote a collection of policies possessed by the agent by Π , and referto it as the policy library.We complete the discussion of the formulation of a task with the deﬁnition of signals . The aim of signals is to provide the agent with auxiliary information thathints toward identifying the nature of the new task instance in the context of thepreviously-solved instances. Deﬁnition 2 (Signal) A signal σ ∈ Σ is any information which is correlated withthe performance of a policy and which is provided to the agent in an online executionof the policy on a task. The most straightforward signal is the performance itself, unless this is not directlyobservable (e.g. in cases where the payoﬀ may only be known after some timehorizon). The information content and richness of a signal determines how easilyan agent can identify the type of the new task with respect to the previously-solvedtypes. This is discussed in more detail in Section 2.8.Throughout the discussion, we adopt the following notational convention: P( · )refers to a probability, E[ · ] refers to an expectation, H( · ) refers to entropy, and ∆ ( · ) is a distribution. X and a set of policies Π , Bayesian PolicyReuse involves two key probability models: Rosman, Hawasly & Ramamoorthy – The ﬁrst, P( U |X , Π ), where U ∈ R is utility, is the performance model ; aprobability model over performance of the library of policies Π on the setof previously-solved tasks X . This information is available in an oﬄine phase. – The second key component is the observation model , deﬁned as a probabilitydistribution P( Σ |X , Π ) over Σ , the space of possible observation signals . Anykind of information that can be observed online and that is correlated withthe performance can be a used as an observation signal. When performanceinformation is directly observable online (e.g., not delayed), performance canbe used as the signal, and in this case the observation and the performancemodels can be the same.A caricature of the BPR problem for a one-dimensional task space is shownin Figure 1, where, given a new task x ∗ ∈ X , the agent is required to select thebest policy π ∗ ∈ Π in as few trials as possible, whilst minimising the accumulatedregret in the interim. As shown in this example, the agent has prior knowledge inthe form of performance models for each policy in Π on a set of tasks from X .The agent additionally has observation models of the signals generated by eachtask-policy pair, but these are not depicted in Figure 1). Fig. 1

A simpliﬁed depiction of the Bayesian Policy Reuse problem. The agent has access toa library of policies ( π , π and π ), and has previously experienced a set of task instances( τ , τ , τ , τ ), as well as samples of the utilities of the library policies on these instances (theblack dots indicate the means of these estimates, while the agent maintains distributions of theutility as illustrated by P( U | τ , π ) in grey). The agent is presented with a new unknown taskinstance ( x ∗ ), and it is asked to select the best policy from the library (optimising betweenthe red hollow points) without having to try every individual option (in less than 3 trialsin this example). The agent has no knowledge about the complete curves, where the taskinstances occur in the problem space, or where the new task is located in comparison toprevious tasks. This is inferred from utility similarity. For clarity, only performance is shownwhile the observation models are not depicted.ayesian Policy Reuse 7 belief ).This belief informs the selection of a policy at the next trial in an attempt tooptimise expected performance. This is the core step in the operation of BayesianPolicy Reuse.We present the general form of Bayesian Policy Reuse (BPR) in Algorithm 1.The policy selection step (line 3) is described in detail in Section 3, the modelsof observation signals (line 5) are described in Section 2.7, and the belief update(line 6) is discussed further in Section 2.9. Algorithm 1

Bayesian Policy Reuse (BPR)

Require:

Problem space X , Policy library Π , observation space Σ , prior over the problemspace P( X ), observation model P( Σ |X , Π ), performance model P( U |X , Π ), number ofepisodes K .1: Initialise beliefs: β ( X ) ←− P( X ).2: for episodes t = 1 . . . K do

3: Select a policy π t ∈ Π using the current belief β t − and the performance modelP( U |X , π t ).4: Apply π t on the task instance.5: Obtain an observation signal σ t from the environment.6: Update the belief β t ( X ) ∝ P( σ t |X , π t ) β t − ( X ).7: end for regret as thecriterion for policy selection to be optimised by Bayesian Policy Reuse. Deﬁnition 3 (Library Regret)

For a library of policies Π and for a policy selectionalgorithm ξ : X (cid:48) → Π that selects a policy for the new task instance x ∗ ∈ X (cid:48) , the libraryregret of ξ is deﬁned as R Π ( ξ, x ∗ ) = U π ∗ x ∗ − U ξ ( x ∗ ) x ∗ , where U πx is the utility of policy π when applied to task x , and π ∗ = arg max π ∈ Π U πx ∗ ,is the best policy in hindsight in the library for the task instance x ∗ . Deﬁnition 4 (Average Library Regret)

For a library of policies Π and for a pol-icy selection algorithm ξ : X (cid:48) → Π , the average library regret of ξ over K trials isdeﬁned as the average of the library regrets for the individual trials, R KΠ ( ξ ) = 1 K K (cid:88) t =1 R Π ( ξ, x t ) , for a sequence of task instances x , x , . . . , x K ∈ X (cid:48) . Rosman, Hawasly & Ramamoorthy

The metric we minimise in BPR is the average library regret R KΠ ( . ) for K trials.That is, the goal of BPR is to not only ﬁnd the right solution at the end of the K trials, possibly through expensive exploration, but also to optimise performanceeven when exploring in the small number of trials of the task. We will refer to thismetric simply as ‘regret’ throughout the rest of the paper.2.5 TypesWhen the problem space of BPR is a large task space M , modelling the true be-lief distribution over the complete space would typically require a large number ofsamples (point estimates), which would hence be expensive to maintain and usecomputationally. In many applications, there is a natural notion of clustering in M whereby many tasks, modelled as MDPs, are similar with only minor variationsin transition dynamics or reward structures. In the context of MDPs, previouswork has regarded classes of MDPs as probability distributions over task parame-ters (Wilson et al., 2007). A more recent work explored explicitly discovering theclustering in a space of tasks (Mahmud et al., 2013). Similar intuitions have beendeveloped in the multi-armed bandits literature, by examining ways of clusteringbandit machines in order to allow for faster convergence and better credit assign-ment, e.g. Pandey et al. (2007); Bui et al. (2012); Maillard and Mannor (2014).In this work we do not explicitly investigate methods of task clustering, but thealgorithms presented herein are most eﬃcient when such a cluster-based structureexists in the task space.We encode the concept of task clustering by introducing a notion of task types as (cid:15) -balls in the space of tasks, where the tasks are clustered with respect to theperformance of a collection of policies executed on them. Deﬁnition 5 (Type)

A type τ is a subset of tasks such that for any two tasks µ i , µ j from a single type τ , and for all policies π in a set of policies Π , the diﬀerence in utilityis upper-bounded by some (cid:15) ∈ R : µ i , µ j ∈ τ ⇔ | U πi − U πj | ≤ (cid:15), ∀ π ∈ Π, where U πi ∈ R is the utility from executing policy π on task µ i . Then, µ i and µ j are (cid:15) -equivalent under the policies Π . This deﬁnition of types is similar to the concept of similarity-based contextualbandits (Slivkins, 2014), where a distance function can be deﬁned in the jointspace of contexts and arms given by an upper bound of reward diﬀerences. In oursetting, we cluster the instances (contexts) that are less than (cid:15) -diﬀerent under allthe policies in the library (arms). We do not however assume any prior knowledgeof the metrics in the task or policy spaces.Figure 1 depicted four example types, where each accounts for an (cid:15) -ball inperformance space (only explicitly shown for τ ). Note that the deﬁnition does not assume that the types need to be disjoint, i.e. there may exist tasks thatbelong to multiple types. We denote the space of types with T .In the case of disjoint types, the type space T can be used as the problem spaceof BPR, inducing a hierarchical structure in the space M . The environment canthen be represented with the generative model shown in Figure 2(a) where a type ayesian Policy Reuse 9 Fig. 2

Problem space abstraction model under disjoint types. (a) Tasks µ are related bytypes τ , with a generating distribution G over them. (b) A simpliﬁcation of the hierarchicalstructure under (cid:15) -equivalence. The tasks of each type are represented by a single task µ τ . τ is drawn from a hyperprior τ ∼ G , and then a task is drawn from that type µ ∼ ∆ τ ( µ ), where ∆ τ ( . ) is some probability distribution over the tasks of type τ .By deﬁnition, the set of MDPs generated by a single type are (cid:15) -equivalent under Π , hence BPR regret cannot be more than (cid:15) if we represent all the MDPs in τ withany one of them. Let that chosen MDP be a landmark MDP of type τ , and denotethis by µ τ . This reduces the hierarchical structure into the simpler model shown inFigure 2(b), where the prior acts immediately on a set of landmark MDPs µ τ , τ ∈T . The beneﬁt of this for BPR is that each alternative is representative for a regionin the original task space, as deﬁned by a maximum loss of (cid:15) . Maintaining onlythis reduced set of landmarks removes near-duplicate tasks from consideration,thereby reducing the cost of maintaining the belief.For the remainder of this paper, we use the type space T as the problem space,although we note that the methods proposed herein do not prevent the alternativeuse of the full task space M .2.6 Performance ModelOne of the key components of BPR is the performance model of policies onpreviously-solved task instances, which describes the distribution of returns fromeach policy on the previously-solved tasks. A performance model represents thevariability in return under the various tasks in a type. Deﬁnition 6 (Performance Model)

For a policy π and a type τ , the performancemodel P( U | τ, π ) is a probability distribution over the utility of π when applied to alltasks µ ∈ τ . Figure 1 depicts the performance proﬁle for π on type τ in the form of aGaussian distribution. Recall that for a single type, and each policy, the domainof the performance model would be at most of size (cid:15) . The agent maintains perfor-mance models for all the policies it has in the library Π and for all the types ithas experienced. Deﬁnition 7 (Observation Model)

For a policy π and type τ and for a choice ofsignal space Σ , the observation model F τπ ( σ ) = P( σ | τ, π ) is a probability distributionover the signals σ ∈ Σ that may result by applying the policy π to the type τ . We consider the following oﬄine procedure to learn the signal models for a policylibrary Π :1. The type label τ is announced.2. A set of tasks are generated from the type τ .3. The agent runs all the policies from the library Π on all the instances of τ ,and observes the resultant sampled signals ˜ σ ∈ Σ .4. Empirical distributions F τπ = ∆ (˜ σ ) are ﬁtted to the data, for each type τ andpolicy π .The beneﬁt of these models is that they provide a connection between theobservable online information and the latent type label, the identiﬁcation of whichleads to better reuse from the policy library.2.8 Candidate Signals for Observation ModelsThe BPR algorithm requires that some signal information is generated from policyexecution on a task, although the form of this signal remains unspeciﬁed. Here wedescribe the three most typical examples of information that can be used as signalsin BPR, but note that this list is not exhaustive. The most detailed information signal which could be accrued by the agent is thehistory of all ( s, a, s (cid:48) ) tuples encountered during the execution of a policy. Thus, theobservation model in this case is an empirical estimate of the expected transitionfunction of the MDPs under the type τ .The expressiveness of this signal does have a drawback, in that it is expensiveto learn and maintain these models for every possible type. Additionally, this maynot generalise well, in cases with sparse sampling. On the other hand, this form ofsignal is useful in cases where some environmental factors may aﬀect the behaviourof the agent in a way that does not directly relate to attaining an episodic goal.As an example, consider an aerial agent which may employ diﬀerent navigationstrategies under diﬀerent wind conditions. Another form of information is the instantaneous reward r ∈ R received duringthe execution of a policy for some state-action pair. Then, the observation modelis an empirical estimate of the expected reward function for the MDPs in the type. ayesian Policy Reuse 11 Although this is a more abstract signal than the state-action-state tuples, itmay still provide a relatively ﬁne-grained knowledge on the behaviour of the taskwhen intermediate rewards are informative. It is likely to be useful in scenarioswhere the task has a number of subcomponents which individually contribute tooverall performance, for example in assembly tasks.

An example of a sparser kind of signal is the total utility U πτ ∈ R accrued overthe full episode of using a policy in a task. The observation model of such a scalarsignal is much more compact, and thereby easier to learn and reason with, thanthe previous two proposals. We also note that for our envisioned applications, theexecution of a policy cannot be terminated prematurely, meaning that an episodicreturn signal is always available to the agent before selecting a new policy.This signal is useful for problems of delayed reward, where intermediate statescannot be valued easily, but the extent to which the task was successfully com-pleted deﬁnes the return. In our framework, using episodic returns as signals hasthe additional advantage that this information is already captured in the perfor-mance model, which relieves the agent from maintaining two separate models, asin this case P( U | τ, π ) = F τπ ( U ) for all π and τ .2.9 Belief over Types Deﬁnition 8 (Type Belief )

For a set of previously-solved types T and a new in-stance x ∗ , the Type Belief β ( . ) is a probability distribution over T that measures theextent to which x ∗ matches the types of T in their observation signals. The type belief, or belief for short, is a surrogate measure of similarity intype space. It approximates where a new instance may be located in relation tothe known types which act as a basis of the unknown type space. The belief isinitialised with the prior probability over the types, labelled G in Figure 2.In episode t , the environment provides an observation signal σ t for executinga policy π t on the new task instance. This signal is used to update β (line 6 inAlgorithm 1). The posterior over the task space is computed using Bayes’ rule: β t ( τ ) = P( σ t | τ, π t ) β t − ( τ ) (cid:80) τ (cid:48) ∈T P( σ t | τ (cid:48) , π t ) β t − ( τ (cid:48) ) (1)= η F τπ t ( σ t ) β t − ( τ ) , ∀ τ ∈ T , (2)where β t − is the belief after episode t − η is a normalisation constant. Weuse β to refer to β t whenever this is not ambiguous. Note how the belief is updatedusing the observation model. The selection of a policy for each episode (line 3 in Algorithm 1) is a critical stepin BPR. Given the current type belief, the agent is required to choose a policy for the next episode to fulﬁl two concurrent purposes: acquire useful informationabout the new (current) task instance, and at the same time avoid accumulatingadditional regret.At the core of this policy selection problem is the trade-oﬀ between explorationand exploitation. When a policy is executed at some time t , the agent receives bothsome utility as well as information about the true type of the new task (the signal).The agent is required to gain as much information about the task as possible,so as to choose policies optimally in the future, but at the same time minimiseperformance losses arising from sub-optimal policy choices .Our problem can be mapped to a ﬁnite-horizon total-reward multi-armed ban-dits setting in which the arms represent the policies, the ﬁnite horizon is deﬁned bythe limited number of episodes, and the metric to optimise is the total reward. Forthis kind of setting Lai and Robbins (1985) show that index-based methods achieveoptimal performance asymptotically. In our case, however, we are interested in thecumulative performance over a small number of episodes.Clearly, a purely greedy policy selection mechanism would fail to choose ex-ploratory options to elicit what is needed for the belief to converge to the closesttype, and may result in the agent becoming trapped in a local maximum of theutility function. On the other hand, a purely exploratory policy selection mech-anism could be designed to ensure that all the possible information is elicited inexpectation, but this would not make an eﬀort to improve performance instantlyand thereby incur additional regret. We thus require a mechanism to explore aswell as exploit; ﬁnd a better policy to maximise asymptotic utility, and exploit thecurrent estimates of which are good policies to maximise myopic utility.Multiple proposals have been widely considered in the multi-armed bandits(MAB) literature for these heuristics, ranging from early examples like the Git-tins index for inﬁnite horizon problems (Gittins and Jones, 1974) to more recentmethods such as the knowledge gradient (Powell, 2010). Here we describe severalapproximate policy selection mechanisms that we use for dealing with the policyreuse problem. – A ﬁrst approach is through (cid:15) -greedy exploration , where with probability 1 − (cid:15) we select the policy which maximises the expected utility under the belief β ,ˆ π = arg max π ∈ Π (cid:88) τ ∈T β ( τ ) (cid:90) U ∈ R U P( U | τ, π )d U = arg max π ∈ Π (cid:88) τ ∈T β ( τ ) E[ U | τ, π ] , and with probability (cid:15) we select a policy from the policy library uniformly atrandom. This additional random exploration component perturbs the belieffrom local minima. – A second approach is through sampling the belief β . This involves samplinga type according to its probability in the belief ˆ τ ∼ β , and playing the bestresponse to that type from the policy library, ˆ π = arg max π ∈ Π E[ U | ˆ τ , π ] . Note that we denote by optimal policy the best policy in the library for a speciﬁc instance,as we are considering policy reuse problems in which learning the actual optimal policy is notfeasible.ayesian Policy Reuse 13

In this case, the sampled type acts as an approximation of the true unknowntype, and exploration is achieved through the sampling process. – The third approach is through employing what we call exploration heuristics ,which are functions that estimate a value for each policy which measures theextent to which it balances exploitation with a limited degree of look-aheadfor exploration.This is the prevalent approach in Bayesian optimisation, where, instead ofdirectly maximising the objective function itself (here, utility), a surrogatefunction that takes into account both the expected utility and a notion of theutility variance (uncertainty) is maximised (see, e.g ., Brochu et al. (2010)).independent of other policies.3.1 Bayesian Policy Reuse with Exploration HeuristicsBy incorporating the notion of an exploration heuristic that computes an index ν π for a policy π into Algorithm 1, we obtain the proto-algorithm Bayesian PolicyReuse with Exploration Heuristics (BPR-EH) described in Algorithm 2. Algorithm 2

Bayesian Policy Reuse with Exploration Heuristics (BPR-EH)

Require:

Type space T , Policy library Π , observation space Σ , prior over the type space G , observation model P( Σ |T , Π ), performance model P( U |T , Π ), number of episodes K ,exploration heuristic V .1: Initialise beliefs: β ←− G . for episodes t = 1 . . . K do

3: Compute ν π = V ( π, β t − ) for all π ∈ Π .4: π t ←− arg max π ∈ Π ν π .5: Apply π t to the task instance.6: Obtain the observation signal σ t from the environment.7: Update the belief β t using σ t by Equation (1).8: end for Note that we are now using G , the hyper-prior, as the prior in line 1 becausewe are using T as the problem space. We now deﬁne the exploration heuristics V that are used in line 3, and to this end we deﬁne four variants of the BPR-EHalgorithm, as – BPR-PI using probability of improvement (Section 3.1.1), – BPR-EI using expected improvement (Section 3.1.1), – BPR-BE using belief entropy (Section 3.1.2), and – BPR-KG using knowledge gradient (Section 3.1.3).

The ﬁrst heuristic for policy selection utilises the probability with which a speciﬁcpolicy can achieve a hypothesised increase in performance. Assume that U + ∈ R issome utility which is larger than the best estimate under the current belief, U + > ¯ U = max π ∈ Π (cid:80) τ ∈T β ( τ )E[ U | τ, π ]. The probability of improvement (PI) principle chooses the policy that maximises the term,ˆ π = arg max π ∈ Π (cid:88) τ ∈T β ( τ )P( U + | τ, π ) , thereby selecting the policy most likely to achieve the utility U + .The choice of U + is not straightforward, and this choice is the primary fac-tor aﬀecting the performance of this exploration principle. One approach to ad-dressing this choice, is through the related idea of expected improvement (EI).This exploration heuristic integrates over all the possible values of improvement¯ U < U + < U max , and the policy is chosen with respect to the best potential. Thatis, ˆ π = arg max π ∈ Π (cid:90) U max ¯ U (cid:88) τ ∈T β ( τ )P( U + | τ, π )d U + = arg max π ∈ Π (cid:88) τ ∈T β ( τ ) (cid:90) U max ¯ U P( U + | τ, π )d U + = arg max π ∈ Π (cid:88) τ ∈T β ( τ )(1 − F( ¯ U | τ, π ))= arg min π ∈ Π (cid:88) τ ∈T β ( τ )F( ¯ U | τ, π ) , where F( U | τ, π ) = (cid:82) U −∞ P( u | τ, π )d u is the cumulative distribution function of U for π and τ . This heuristic therefore selects the policy most likely to result inany improvement to the expected utility. Both PI and EI principles select a policy which has the potential to achieve higherutility. An alternate approach is to select the policy which will have the greatesteﬀect in reducing the uncertainty over the type space.The belief entropy (BE) exploration heuristic seeks to estimate the eﬀect ofeach policy in reducing uncertainty over type space, represented by the entropy ofthe belief. For each policy π ∈ Π , estimate the expected entropy of the belief afterexecuting π as H( β | π ) = − β π log β π , where β π is the updated belief after seeing the signal expected from running π ,given as β π ( τ ) = E σ ∈ Σ [ η F τπ ( σ ) β ( τ ) ] (3)= (cid:90) σ ∈ Σ F βπ ( σ ) [ η F τπ ( σ ) β ( τ )] d σ, (4)where F βπ ( σ ) is the probability of observing σ under the current belief β whenusing π , and η is the normalisation constant as before. ayesian Policy Reuse 15 Then, selecting the policy ˆ π = arg min π ∈ Π H( β | π )reduces the most uncertainty in the belief in expectation. This is however a purelyexploratory policy. To incorporate exploitation of the current state of knowledge,we rather select ˆ π = arg max π ∈ Π (cid:0) ˜ U ( π ) − κ H( β | π ) (cid:1) , where κ ∈ R is a positive constant controlling the exploration-exploitation trade-oﬀ, and ˜ U ( π ) is the expected utility of π under the current belief,˜ U ( π ) = (cid:88) τ ∈T β ( τ )E[ U | π, τ ] . (5) The ﬁnal exploration heuristic we describe is the knowledge gradient (Powell, 2010),which aims to balance exploration and exploitation through optimising myopic re-turn whilst maintaining asymptotic optimality. The principle behind this approachis to estimate a one step look-ahead, and select the policy which maximises utilityover both the current time step and the next in terms of the information gained.To select a policy using the knowledge gradient, we choose the policy whichmaximises the online knowledge gradient at time t ˆ π = arg max π ∈ Π (cid:16) ˜ U ( π ) + ( K − t ) ν tπ (cid:17) , trading-oﬀ between the expected utility ˜ U ( π ), given in Equation 5, and ν tπ , theoﬄine knowledge gradient of π for a horizon of K trials, weighted by the remainingnumber of trials. The oﬄine knowledge gradient essentially measures the perfor-mance of a one-step look-ahead in the process, given as ν tπ = E β (cid:20) max π (cid:48) ˜ U π ( π (cid:48) ) − max π (cid:48)(cid:48) ˜ U ( π (cid:48)(cid:48) ) (cid:21) , (6)where ˜ U ( π ) is the expected utility of π under the current belief (Equation 5),˜ U π ( π (cid:48) ) = (cid:88) τ ∈T β π ( τ ) E[ U | τ, π (cid:48) ] , (7) and β π is the expected updated belief after playing policy π and receiving a suitablesignal, as deﬁned in Equation 4. That is, the oﬄine knowledge gradient is thediﬀerence in expectation, with respect to β , of the best performance of any policyat t + 1 if π was played at t , compared to that of the best policy at t (which maybe diﬀerent from π ). not possible to reliably estimate the distance to the hole, as for this examplewe are considering a robot with weak sensors that are not in themselves suﬃcientto reliably measure distance. The robot is only allowed to take K = 3 shots, whichis less than the number of available clubs, from a ﬁxed position from the hole.The task is evaluated by the stopping distance of the ball to the hole. The robotcan choose any of the available clubs, and we assume that the robot uses a ﬁxed,canonical stroke with each club.In this setting, we consider the type space T to be a set of diﬀerent golﬁngexperiences the robot had before, each deﬁned for simplicity by how far the targetwas (other factors, e.g. weather conditions, could be factored into this as well).The performance of a club for some hole is deﬁned as the negative of the absolutedistance of the end location of the ball from the hole, such that this quantity mustbe maximised.Then, the choice of a club corresponds to a choice of a policy. For each, therobot has a performance proﬁle (distribution over ﬁnal distance of the ball fromthe hole) for the diﬀerent courses that the robot experienced. We assume a smallselection of four clubs, with properties shown in Table 1 for the robot canonicalstroke. The distances shown in this table are the ground truth values, and are notexplicitly known to the robot. Club Average Yardage Standard Deviation of Yardage π = 3-wood 215 8.0 π = 3-iron 180 7.2 π = 6-iron 150 6.0 π = 9-iron 115 4.4 Table 1

Statistics of the ranges (yardage) of the four clubs used in the golf club selectionexperiment. We choose to model the performance of each club by a Gaussian distribution. Weassume the robot is competent with each club, and so the standard deviation is small, butrelated to the distance hit.

Owing to the diﬃculty of the outdoor perception problem over large distances,the robot cannot measure exact distances in the ﬁeld, but for a feedback signal,it can crudely estimate a qualitative description of the result of a shot as fallinginto one of several broad categories (corresponding to concepts such as quite near and very far ), which deﬁne the observation space, as shown in Figure 3.Note that this is not the performance itself, but a weaker observation corre- lated with performance. The distributions over these qualitative categories (theobservation models) are known to the robot for each club on each of the trainingtypes it has encountered. For this example, we assume the robot has extensivetraining on four particular holes, with distances τ = 110yds, τ = 150yds, τ = 170yds and τ = 220yds. The observation models are shown in Figure 4. ayesian Policy Reuse 17 Fig. 3

The distance ranges used to provide observation signals. The robot is only able toidentify which of these bins corresponds to the end location of the ball after a swing.

Fig. 4

Performance-correlated signal models for the four golf clubs in Table 1 on four trainingholes with distances 110yds, 150yds, 170yds and 220yds. The models capture the probabilitiesof the ball landing in the corresponding distance categories. The width of each category binhas been scaled to reﬂect the distance range it signiﬁes. The x-axis is the distance to the hole,such that negative values indicate under-shooting, and positive distances over-shooting thehole.

When the robot faces a new hole, BPR allows the robot to overcome its inabilityto judge the distance to the hole by using the feedback from an arbitrary shot as asignal. The feedback signal updates an estimate of the most similar previous task(the belief), using the distributions in Figure 4. This belief enables the robot tochoose the club/clubs which would have been the best choice for the most similarprevious task/tasks.For a worked example, consider a hole 179 yards away. If a coarse estimateof the distance is feasible, it can be incorporated as a prior over T . Otherwise,an uniformed prior is used. Assume the robot is using greedy policy selection, and assume that it selects π for the ﬁrst shot due to a uniform prior, and thatthis resulted in an over-shot by 35 yards. The robot cannot gauge this error moreaccurately than that it falls into the category corresponding to ‘over-shooting inthe range of 20 to 50 yards’. This signal will update the belief of the robot over thefour types, and by Figure 4, the closest type to produce such a behaviour would be τ = 170 yards. The new belief dictates that the best club to use for anythinglike τ is π . Using π , the hole is over-shot by 13 yards, corresponding to thecategory with the ‘range 5 to 20 yards’. Using the same calculation, the mostsimilar previous type is again τ , keeping the best club as π , and allowing beliefto converge. Indeed, given the ground truth in Table 1, this is the best choice forthe 179 yard task. Table 2 describes this process over the course of 8 consecutiveshots taken by the robot. Shot 1 2 3 4 5 6 7 8Club 1 2 2 2 2 2 2 2Error 35.3657 13.1603 4.2821 6.7768 2.0744 11.0469 8.1516 2.4527Signal 20–50 5–20 -5–5 5–20 -5–5 5–20 5–20 -5–5 β entropy 1.3863 0.2237 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 Table 2

The 179 yard example. For each of 8 consecutive shots: the choice of club, the trueerror in distance to the hole (in yards), the coarse category within which this error lies (thesignal received by the agent), and the entropy of the belief. This shows convergence after thethird shot, although the correct club was used from the second shot onwards. The oscillatingerror is a result of the variance in the club yardage. Although the task length was K = 3strokes, we show these results for longer to illustrate convergence. Figure 5 shows the performance of BPR with greedy policy selection in thegolf club selection task averaged over 100 unknown golf course holes, with rangesrandomly selected between 120 and 220 yards. This shows that on average, by thesecond shot, the robot will have selected a club capable of bringing the ball within10–15 yards of the hole.

Fig. 5

Performance of BPR on the golf club example, with results averaged over 100 unknownholes, showing the decrease in entropy of the belief β and average distance to the hole (loweris better). Performance of the pure four clubs (average performance of each single club over all100 holes), as well as the best club for each hole in retrospect, is shown for regret comparison.Although the task length is K = 3 strokes, we show these results for longer to illustrate conver-gence. Shaded regions denote one standard deviation. Error bars on the individual clubs havebeen omitted for clarity, but their average standard deviations are 26 . . . .

66m respectively.ayesian Policy Reuse 19 i be deﬁned by a preference model of language, λ i ∈ { , . . . , L } ,where L is the number of such models. The policy π executed by the telephonicagent also corresponds to a choice of language model, i.e. π ∈ { , . . . , L } . The goalof the agent is to identify the user preference λ i , whilst minimising frustration tothe user.Assume that each telephonic interaction proceeds using the transition systemthrough the six states given in Figure 6. In every state, there is only one actionwhich can be taken by the system, being to use the chosen language model. Atthe beginning of the call, the user is in the start state. We assume the system canidentify the user by caller ID, and selects a language model. If, at any point, thesystem can deal with the user’s request, the call ends successfully. If not, we assumethe user becomes gradually more irritated with the system, passing through states frustrated and annoyed . If the user reaches state angry and still has an unresolvedrequest, she is transferred to a human operator. This counts as an unsuccessfulinteraction. Alternatively, at any point the user may hang up the call, which alsoterminates the interaction unsuccessfully.The transition dynamics of this problem depend on a parameter ρ = 1 − | π − λ i | L which describes how well the selected language model π can be understood by auser of type λ i , such that ρ = 1 if the chosen model matches the user’s, and it is0 if it is the worst possible choice. An additional parameter η governs the trade-oﬀ between the user becoming gradually more frustrated and simply hanging upwhen the system does not respond as expected. In our experiments, we ﬁx η = 0 . π = λ i where we set η = 0.To allow the use of diﬀerent observation signals in this example the domainwas designed in a way such that the transition dynamics and the rewards of thisdomain, shown in Figure 6, allow only two total utility values for any instance: U = 10 for a successful completion of the task, and U = − any state that transitions to the unsuccessful outcome angry receives the samereward for a transition to the unsuccessful outcome hang up . Finally, all transitionprobabilities between the states start , frustrated , annoyed , and angry are indepen-dent of ρ , and thus, of the type. This set up mirrors the fact that in general thestate sequence given by the signal ( s, a, s (cid:48) ) is more informative than the reward Fig. 6

Transition system describing the online telephonic personalisation example. Circles arestates, thick bordered circles are terminal states, small black edge labels in square bracketsare the transition rewards, and large blue edge labels are transition probabilities. See text fora description of the parameters ρ and η . sequence ( s, a, r ), which is in turn more informative than the total utility signal U alternative .The results shown in Figure 7 were generated from 1,000 call interactions whichproceeded according to the model in Figure 6. In this experiment, the correct lan-guage model for each user was randomly drawn from a set of 20 language models.Figure 7 shows comparative performance of BPR with sampling the belief selectionmechanism when the three kinds of signals are used. As expected, the lowest re-gret (and variance in regret) is achieved using the most-informative ( s, a, s (cid:48) ) signal,followed by the ( s, a, r ) signal, and ﬁnally the total performance signal U . We donote, however, that all three signals eventually converge to zero regret if givenenough time.4.3 Surveillance DomainThe surveillance domain models the monitoring problem laid out in the introduc-tion. Assume a base station is tasked with monitoring a wildlife reserve spreadout over some large geographical region. The reserve suﬀers from poaching andso the base station is required to detect and respond to poachers on the ground.The base station has a ﬁxed location, and so it monitors the region by deployinga low-ﬂying, light-weight autonomous drone to complete particular surveillancetasks using diﬀerent strategies. The episodic commands issued by the base station may be to deploy to a speciﬁc location, scan for unusual activity in the targetedarea and then report back. After completing each episode, the drone communicates We note that in many applications U might be the only of these signals available to theagent. For example, in the current scenario, it may not be easy or feasible to accurately gaugethe frustration of the caller, making the states and the immediate rewards unobservable.ayesian Policy Reuse 21 Fig. 7

Regret, showing comparative performance of BPR on the telephone banking domain,using ( s, a, s (cid:48) ), ( s, a, r ), and U as signals. with the base some information of whether or not there was any suspicious activityin the designated region. The base station is required to use that information tobetter decide on the next strategy for the drone.Concretely, we consider a 26 ×

26 cell grid world, which represents the wildlifereserve, and the base station is situated at a ﬁxed location in one corner. We assumethat there are 68 target locations of interest, being areas with a particularly highconcentration of wildlife. These areas are arranged around four ‘hills’, the tops ofwhich provide better vantage points. Figure 8 depicts this setting.

Fig. 8

Example of the surveillance domain. The red cell in the lower corner is the location ofthe base station, green cells correspond to surveillance locations, and blue cells are hill tops.Visibility between locations is indicated with a green edge. The base station is tasked withdeploying drones to ﬁnd poachers who may inﬁltrate at one of the surveillance locations.

At each episode, the base station deploys the drone to one of the 68 locations.The interpretation of this in BPR is that these 68 target locations each correspondto a diﬀerent poacher type, or task. For each type, we assume that there is a pre- learnt policy for reaching and surveying that area while dealing with local windperturbations and avoiding obstacles such as trees.The observation signal that the base station receives after each drone deploy-ment is noise-corrupted information related to the success in identifying an in-truder at the target location or somewhere nearby (in a diagonally adjacent cell).One exception is when surveying the hill centres which, by corresponding to ahigh vantage point, provide a weak signal stating that the intruder is in the largerarea around the hill. For a distance d between the region surveyed and the regionoccupied by the poachers, the signal R received by the agent is R ←−  − d + ψ if agent surveys a hilltop and d ≤ − d + ψ if agent surveys any another location and d ≤ ψ otherwise,where ψ ∼ N (10 ,

20) is Gaussian noise. A higher signal indicates more conﬁdencein having observed a poacher in the region surrounding the target surveillancepoint.Figure 9 presents a comparison between six variants of the BPR algorithm.Four use the exploration heuristics proposed in Section 3, namely BPR-KG, BPR-BE, BPR-PI, BPR-EI, in addition to sampling the belief β , and (cid:15) -greedy selectionwith (cid:15) = 0 .

3. These six variants were run on the domain and averaged over 10random tasks, with standard deviations of the regret shown in Table 3.

Fig. 9

Comparison of the six policy selection heuristics on the 68-task surveillance domain,averaged over 10 random tasks. (a) The entropy of the belief after each episode. (b) The regretafter each episode. Error bars are omitted for clarity, but standard deviations of the regret areshown in Table 3.ayesian Policy Reuse 23episode (cid:15) -greedy sampling BPR-KG BPR-BE BPR-PI BPR-EI5 72.193 103.07 76.577 96.529 15.695 27.80110 97.999 86.469 75.268 91.288 62.4 33.11220 83.21 18.834 7.1152 17.74 72.173 16.01150 86.172 10.897 18.489 13.813 101.69 11.142

Table 3

Standard deviations of the regret for the six BPR variants shown in Figure 9, afterepisodes 5, 10, 20 and 50.

Note in Figure 9(a) that BPR-BE, BPR-KG, and BPR with sampling the belief all converge in about 15 episodes, which is approximately a quarter the numberthat would be required by a brute force strategy which involved testing everypolicy in turn. Both BPR-PI and BPR with (cid:15) -greedy selection fail to convergewithin the allotted 50 episodes. BPR-EI shows the most rapid convergence.We now compare the performance of BPR to other approaches from the liter-ature. We choose two frameworks, multi-armed bandits for which we use UCB1(Auer et al., 2002), and Bayesian optimisation where we use GP-UCB (Srinivaset al., 2009). We note upfront that although these frameworks share many elementswith our own framework in terms of the problems they solve, the assumptions theyplace on the problem space are diﬀerent, and thus so is the information they use.The results of comparing performance of these approaches are presented inFigure 10 on the surveillance domain, averaged over 50 tasks. We use BPR-EI inthis experiment as it was the best performing BPR variant as seen in Figure 9.

Fig. 10

Comparison of the episodic regret with time, averaged over 50 random tasks, of BPR-EI, a multi-arm bandits approach (UCB1), and a Bayesian optimisation approach (GP-UCB)on the 68 task surveillance domain. Shaded regions represent one standard deviation.

For UCB1, we treat each existing policy in the library as a diﬀerent arm of thebandit. ‘Pulling’ an arm corresponds to executing that policy, and the appropriate reward is obtained. We additionally provide UCB1 with a prior in the form ofexpected performance of each policy given the task distribution G ( τ ), which weassumed to be uniform in this case. This alleviates UCB1 from having to test eacharm ﬁrst on the new task (which would require 68 episodes) before it can begin thedecision making process. It is still slower to converge than BPR, as informationfrom each episode only allows UCB1 to update the performance estimate of asingle policy, whereas BPR can make global updates over the policy space.On the other hand, an optimisation-based approach such as GP-UCB is bet-ter suited to this problem, as it operates with the same requirement as BPR ofmaintaining low sample complexity. This algorithm treats the set of policies asan input space, and is required to select the point in this space which achievesthe best performance on the current task. However, unlike BPR, this approachrequires a metric in policy space. This information is not known in this problem,but we approximate this from performance in the training tasks. As a result of thisapproximation, sampling a single point in GP-UCB (corresponding to executinga policy) again only provides information about a local neighbourhood in policyspace, whereas selecting the same action would allow BPR to update beliefs overthe entire task space.Further discussion of the diﬀerences between BPR and both bandits and opti-misation approaches is provided in Sections 5.2 and 5.3.1 respectively.Finally, we explore the trade-oﬀ between library size and sample complexitywith respect to the regret of BPR-EI, BPR-PI, BPR-BE, and BPR-KG. This isshown in Figure 11 where, for each method, the horizontal axis shows the ratio ofthe library size to the full task space size, the vertical axis shows the number ofepisodes allowed for each new instance, and regret is represented by colour. Foreach combination of a library size and a sample complexity, we average the regretresults over 200 trials. In each of these trials, a random subset of the full taskspace is used as the oﬄine policy library and the online task is drawn from thefull task space. That is, tasks in the online phase include both previously-solvedand new tasks.As can be seen from the ﬁgure, regret can be decreased by either increasing thelibrary size or the time allocated (in terms of number of episodes) to complete thenew task. Usually, the task speciﬁcation dictates the maximum allowed numberof episodes, and hence, this suggests a suitable library size to be acquired in theoﬄine phase to attain a speciﬁc regret rate. This ﬁgure also conﬁrms the previousﬁndings that BPR-EI is able to exceed the other variants in terms of performance. Speciﬁcally, Bayesian Policy Reuse aims to select a policy in a library Π for trans-ferring to a new, initially unknown, instance. The criterion for this choice is thatit is the best policy for the type most similar to the type of the new instance. Onetransfer approach that considers the similarity between source and target tasksis by Lazaric (2008), where generated ( s, a, r, s (cid:48) ) samples from the target task are ayesian Policy Reuse 25 Fig. 11

Average episodic regret for running BPR-EI, BPR-PI, BPR-BE, and BPR-KG onthe 68 task surveillance domain, with diﬀerent library sizes (as a proportion of the full taskspace size) and number of episodes (sample complexity), averaged over 200 random tasks. used to estimate similarity to source tasks, which is measured by the averageprobability of the generated transitions happening under the source task. Then,samples from the more similar source tasks are used to seed the learning of thetarget task, while less similar tasks are avoided, escaping negative transfer. Morerecently, Brunskill and Li (2013) consider using the ( s, a, r, s (cid:48) ) similarity to com-pute conﬁdence intervals of where, in a collection of MDP classes, a new instancebest ﬁts. The classes are acquired from experience by clumping together MDPsthat do not diﬀer in their transition dynamics or rewards more than a certainlevel. Once the class is determined, the previous knowledge of that class, in formof dynamics are rewards, is borrowed to inform the process of planning. BayesianPolicy Reuse does not assume learning is feasible, but relies on transferring a use-ful policy immediately. Also, we use a Bayesian measure of task similarity whichallows exploiting prior knowledge of the task space, quickly incorporating observedsignals for a faster response, and also, by maintaining beliefs, keeping open thepossibility of new MDPs that do not cleanly ﬁt in any of the discovered classes. arms. In our problem setting, the arms correspond to policies, and the new taskinstance corresponds to the new bandit ‘machine’ that generates utilities per armpull (policy execution).In the Correlated Bandits literature, the form of correlation between the armsis known to the agent. Usually, this happens to be the functional form of thereward curve. The agent’s task is then to identify the parameters of that curve,so that the hypothesis of the best arm moves in the reward curve’s parameterspace. For example, in Response Surface Bandits (Ginebra and Clayton, 1995),there is a known prior over the parameters of the reward curve and the metric onthe policy space is known. More recently, Mersereau et al. (2009) present a greedypolicy which takes advantage of the correlation between the arms in their rewardfunctions, assuming a linear form with one parameter, with a known prior. In ourwork, we approach a space of tasks from a sampling point of view, where an agentexperiences sample tasks and uses these to build the models of the domain. Thuswe do not assume any functional form for the response surface, and we do notrequire the metric on the policy space to be known.In our framework, we only assert assumptions on the continuity and smoothnessof the surface. We treat the known types as a set of learnt bandit machines withknown behaviour for each of the diﬀerent arms. These behaviours deﬁne local‘kernels’ on the response surface, which we then approximate by a sparse kernelmachine. We track a hypothesis of the best arm using that space. This is to someextent similar to the Gaussian process framework, but in our case the lack ofa metric on the policy space prevents the deﬁnition of the covariance functionsneeded there. This point is elaborated further in Section 5.3.1.In another thread, Dependent Bandits (Pandey et al., 2007) assume that thearms in a multi-armed bandit can be clustered into diﬀerent groups, such thatthe members of each have correlated reward distribution parameters. Then, eachcluster is represented with one representative arm, and the algorithm proceedsin two steps: a cluster is ﬁrst chosen by a variant of UCB1 (Auer et al., 2002)applied to the set of representative arms, and then the same method is used againto choose between the arms of the chosen cluster. We assume in our work that theset of previously-solved tasks span and represent the space well, but we do notdwell on how this set of tasks can be selected. Clustering is one good candidatefor that, and one particular example of identifying the important types in a taskspace can be seen in the work of Mahmud et al. (2013).In Contextual Bandits (Strehl et al., 2006; Langford and Zhang, 2008), theagent is able to observe side information (or context labels) that are related tothe nature of the bandit machine, and the question becomes one of selecting thebest arm for each possible context. Mapping this setting to our problem, a contextrepresents the type, whereas the arms represent the policies. The diﬀerence is thatin our case the context information (the type label) is latent, and the space of typesis not fully known, meaning that the construction of a bounded set of hypotheses ofpolicy correlation under types is not possible. In addition, our setting has that theresponse of the arms to contexts is only captured through limited oﬄine sampling, but the agent is able to engage with the same context for multiple rounds.Another related treatment is that of latent bandits (Maillard and Mannor,2014) where, in the single-cluster arrival case, the experienced bandit machine isdrawn from a single cluster with known reward distributions, and in the agnosticcase the instances are drawn from many unknown clusters with unknown reward ayesian Policy Reuse 27 distributions. Our setting ﬁts in between these two extremes, as the instances aredrawn from a single, but unknown, cluster with an unknown reward distribution.5.3 Relation to Bayesian Approaches

If the problem of Bayesian Policy Reuse is treated as an instance of Bayesianoptimisation, we consider the objective π ∗ = arg max π ∈ Π E[ U | x ∗ , π ] , (8)where x ∗ ∈ X is the unknown process with which the agent is interacting, andE[ U | x ∗ , π ] is the expected performance when playing π on x ∗ . This optimisationinvolves a selection from a discrete set of alternative policies ( π ∈ Π ), corre-sponding to sampling from the performance function at a discrete set of locations.However, sampling from this function is expensive (corresponding to executing apolicy for an episode), and as a result the performance function must be optimisedin as few samples as possible.A Bayesian optimisation solution requires the target function to be modelledas a Gaussian Process (GP). There are two issues here:1. Observations in BPR need not be the performance itself (see Section 2.8), whilethe GP model is appropriate only where these two are the same.2. BPR does not assume knowledge of the metric in policy space. This is howeverrequired for Bayesian optimisation, so as to deﬁne a kernel function for theGaussian process. An exception is in the case where policies all belong to aparametrised family of behaviours, placing the metric in parameter space as aproxy for policy space.Still, we assume smoothness and continuity of the response surface for similartasks and policies, which is also a standard assumption in Gaussian process regres-sion. Bayesian Policy Reuse uses a belief that tracks the most similar previously-solved type, and then reuses the best policy for that type. This belief can beunderstood as the mixing coeﬃcient in a mixture model that represents the re-sponse surface.To see this, consider Figure 12 which shows an example 2D response surface.Each type is represented by a ‘band’ on that surface; a set of curves only pre-cisely known (in terms of means and variances) at their intersections with a smallcollection of known policies. Projecting these intersections of some type into per-formance space results in a probabilistic description of the performance of thediﬀerent policies on that type (the Gaussian processes in the ﬁgure), the kind offunction that we are trying to optimise in Equation 8. Each of these projectionswould be a component of the mixture model that represents the response surface, and would be the type’s contribution to it.Any new task instance corresponds to an unknown curve on the surface, andcorrespondingly to a probabilistic model in performance space. Given that the onlyknowledge possessed by the agent from the surface are these Gaussian processes foreach known type, Bayesian Policy Reuse implicitly assumes that they act as a basis Fig. 12

An example 2D response surface. The ‘bands’ on the curve show two types, and thelines that run through the curve from left to right are policy performances for all types. Theagent only has access to the intersection of types’ bands with policy curves (the black dots).Shown on the left are the performance curves of the two types τ and τ under all policies.These are represented as Gaussian processes in the Policies-Performance plane. Note that Fig.1 is a projection of this response surface. that span the space of possible curves, so that the performance under any new taskcan be represented as a weighted average of the responses of the previously-solvedtypes . To this extent, the performance for the new task instance is approximatelyidentiﬁed by a vector of weights, which in our treatment of BPR we refer to as thetype belief. Thus, the BPR algorithm is one that ﬁts a probabilistic model to anunknown performance curve (Equation 8) through sampling and weight adjustingin an approximate mixture of Gaussian processes. Bayesian Reinforcement Learning (BRL) is a paradigm of Reinforcement Learningthat handles the uncertainty in an unknown MDP in a Bayesian manner by main-taining a probability distribution over the space of possible MDPs, and updatingthat distribution using the observations generated from the MDP as the interactioncontinues (Dearden et al., 1999). In work by Wilson et al. (2007), the problem ofMulti-task Reinforcement Learning of a possibly-inﬁnite stream of MDPs is han-dled in a Bayesian framework. The authors model the MDP generative processusing a hierarchical inﬁnite mixture model, in which any MDP is assumed to begenerated from one of a set of initially-unknown classes, and a hyper-prior controlsthe distribution of the classes.Bayesian Policy Reuse can be regarded as an special instance of Bayesian Multi- task Reinforcement Learning with the following construction. Assume a MarkovDecision Process that has a chain of K identical states (representing the trials) Note that this will create a bias in the agent’s estimated model of the type space towardthe types that have been seen more often before. We assume that the environment is benignand that the oﬄine phase is long enough to experience the necessary types.ayesian Policy Reuse 29 and a collection of viable actions that connect each state to the next in the chain.The set of actions is given by Π , the policy library. The processes are parametrisedwith their type label τ . For each decision step, the agent takes an action (a policy π ∈ Π ) and the process returns with a performance signal, U πτ . The task of theagent is to infer the best ‘policy’ for this process (a permutation of K policies from Π ; π , . . . , π K − ) that achieves the fastest convergence of values U , and thus lowconvergence time and low regret. The performance/observation models act as theBayesian prior over rewards required in Bayesian reinforcement learning. Engel and Ghavamzadeh (2007) introduce a Bayesian treatment to the PolicyGradient method in reinforcement learning. The gradient of some parametrisedpolicy space is modelled as a Gaussian process, and paths sampled from the MDP(completed episodes) are used to compute the posteriors and to optimise the policyby moving in the direction of the performance gradient. The use of Gaussianprocesses in policy space is similar to the interpretation of our approach, but theiruse is to model the gradient rather than the performance itself.When no gradient information is available to guide the search, Wingate et al.(2011) propose to use MCMC to search in the space of policies which is endowedwith a prior. Various kinds of hierarchical priors that can be used to bias the searchare discussed. In our work, we choose the policies using exploration heuristics basedon oﬄine-acquired performance proﬁles rather than using kernels and policy priors.Furthermore, we have access only to a small set of policies to search through inorder to optimise the time of response.5.4 Storage ComplexityAs described in Section 2.8, the use of diﬀerent signals entail diﬀerent observationmodels and hence diﬀerent storage complexities. Assume that | S | is the size ofthe state space, | A | is the size of the action space, | Π | is the size of the policylibrary, N is the number of previously-solved types, | R | is the size of the rewardspace, T is the duration of an episode, and B is the number of bits needed tostore one probability value. For the performance signal, the storage complexityof the observation model is upper bounded by SC U = | Π | N | R | B for the averagereward case, and SC U,γ = | Π | N − γ T − γ | R | B for the discounted reward case. For thestate-action-state signals, we have SC s (cid:48) = | Π | N | R | | S | | A | B , and for the immediatereward signal we have SC r = | Π | N | S | | A | B . In applications where | R | > | S | weobtain the ordering SC U < SC r < SC s (cid:48) . In this paper we address the policy reuse problem, which involves responding to anunknown task instance by selecting between a number of policies available to theagent so as to minimise regret, with respect to the best policy in the set, within ashort number of episodes. This problem is motivated by many application domains where tasks have short durations such as human interaction and personalisationor monitoring tasks.We introduce Bayesian Policy Reuse, a Bayesian framework for solving thisproblem. The algorithm tracks a probability distribution (belief) over a set ofknown tasks capturing their similarity to the new instance that the agent is solving.The belief is updated with the aid of side information (signals) available to theagent: observation signals acquired online for the new instance, and signal modelsacquired oﬄine for each policy. To balance the trade-oﬀ between exploration andexploitation, several mechanisms for selecting policies from the belief (explorationheuristics) are also described, giving rise to diﬀerent variants of the core algorithm.This approach is empirically evaluated in three simulated domains where wecompare the diﬀerent variants of BPR, and contrast performance with relatedapproaches. In particular, we compare the performance of BPR with a multi-armed bandit algorithm (UCB1) and a Bayesian optimisation method (GP-UCB).We also show the eﬀect of using diﬀerent kinds of observation signals on theconvergence of the belief, and we illustrate the trade-oﬀ between library size andsample complexity required to achieve a required level of performance in a task.The problem of policy reuse as deﬁned in this paper has many connections withrelated settings from the literature, especially in the multi-armed bandit research.However, it also has certain features that does not allow it to be reduced exactly toany one of them. The contributed problem deﬁnition and the proposed Bayesianapproach are ﬁrst steps toward a practical solution that can be applied to realworld scenarios where traditional learning approaches are not feasible.

Acknowledgements

This work has taken place in the Robust Autonomy and Decisionsgroup within the School of Informatics, University of Edinburgh. This research has beneﬁttedfrom support by the UK Engineering and Physical Sciences Research Council (grant numberEP/H012338/1) and the European Commission (TOMSY and SmartSociety grants).

References

R. Amin, K. Thomas, R.H. Emslie, T.J. Foose, and N. Strien. An overview of theconservation status of and threats to rhinoceros species in the wild.

InternationalZoo Yearbook , 40(1):96–117, 2006.P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmedbandit problem.

Machine Learning , 47(2):235–256, 2002.Eric Brochu, Vlad M. Cora, and Nando De Freitas. A tutorial on bayesian opti-mization of expensive cost functions, with application to active user modelingand hierarchical reinforcement learning. arXiv preprint arXiv:1012.2599 , 2010.Emma Brunskill and Lihong Li. Sample complexity of multi-task reinforcementlearning. In

Proceedings of The 29th Conference on Uncertainty in Artiﬁcial Intel-ligence (UAI) , 2013.

Loc Bui, Ramesh Johari, and Shie Mannor. Clustered bandits.

CoRR ,abs/1206.4169, 2012.B.C. da Silva, G.D. Konidaris, and A.G. Barto. Learning parameterized skills.In

Proceedings of the Twenty Ninth International Conference on Machine Learning ,June 2012. ayesian Policy Reuse 31

R. Dearden, N. Friedman, and D. Andre. Model based bayesian exploration.In

Proceedings of the ﬁfteenth Conference on Uncertainty in Artiﬁcial Intelligence ,pages 150–159. Morgan Kaufmann Publishers Inc., 1999.Yaakov Engel and Mohammad Ghavamzadeh. Bayesian policy gradient algo-rithms. In

Advances in Neural Information Processing Systems 19: Proceedingsof the 2006 Conference , volume 19, page 457. MIT Press, 2007.F. Fern´andez and M. Veloso. Probabilistic policy reuse in a reinforcement learningagent. In

Proceedings of the ﬁfth international joint conference on Autonomousagents and multiagent systems , pages 720–727. ACM, 2006.Josep Ginebra and Murray K. Clayton. Response surface bandits.

Journal of theRoyal Statistical Society. Series B (Methodological) , pages 771–784, 1995.J. C. Gittins and D. Jones. A dynamic allocation index for the discounted multi-armed bandit problem.

Progress in Statistics , pages 241–266, 1974.Tze Leung Lai and Herbert Robbins. Asymptotically eﬃcient adaptive allocationrules.

Advances in applied mathematics , 6(1):4–22, 1985.John Langford and Tong Zhang. The epoch-greedy algorithm for multi-armedbandits with side information. In J.C. Platt, D. Koller, Y. Singer, and S.T.Roweis, editors,

Advances in Neural Information Processing Systems 20 , pages 817–824. Curran Associates, Inc., 2008.Alessandro Lazaric.

Knowledge transfer in reinforcement learning . PhD thesis, PhDthesis, Politecnico di Milano, 2008.M. M. Hassan Mahmud, Majd Hawasly, Benjamin Rosman, and Subramanian Ra-mamoorthy. Clustering markov decision processes for continual transfer. arXivpreprint arXiv:1311.3959 , 2013.M. M. Hassan Mahmud, Benjamin Rosman, Subramanian Ramamoorthy, andPushmeet Kohli. Adapting interaction environments to diverse users throughonline action set selection. In

AAAI 2014 Workshop on Machine Learning forInteractive Systems , 2014.Odalric-Ambrym Maillard and Shie Mannor. Latent bandits. In

Proceedings of The31st International Conference on Machine Learning , pages 136–144, 2014.Adam J. Mersereau, Paat Rusmevichientong, and John N. Tsitsiklis. A structuredmultiarmed bandit problem and the greedy policy.

Automatic Control, IEEETransactions on , 54(12):2787–2802, 2009.Jos´e Ni˜no-Mora. Computing a classic index for ﬁnite-horizon bandits.

INFORMSJournal on Computing , 23(2):254–267, 2011.Sandeep Pandey, Deepayan Chakrabarti, and Deepak Agarwal. Multi-armed ban-dit problems with dependent arms. In

Proceedings of the 24th international con-ference on Machine learning , pages 721–728. ACM, 2007.Warren B. Powell. The knowledge gradient for optimal learning.

Wiley Encyclopediaof Operations Research and Management Science , 2010.Benjamin Rosman, Subramanian Ramamoorthy, M. M. Hassan Mahmud, andPushmeet Kohli. On user behaviour adaptation under interface change. In

International Conference on Intelligent User Interfaces , 2014.Aleksandrs Slivkins. Contextual bandits with similarity information.

The Journalof Machine Learning Research , 15(1):2533–2568, 2014.Niranjan Srinivas, Andreas Krause, Sham M. Kakade, and Matthias Seeger. Gaus-sian process optimization in the bandit setting: No regret and experimentaldesign. arXiv preprint arXiv:0912.3995 , 2009.

Alexander L Strehl, Chris Mesterharm, Michael L Littman, and Haym Hirsh.Experience-eﬃcient learning in associative bandit problems. In

Proceedings ofthe 23rd international conference on Machine learning , pages 889–896. ACM, 2006.M. E. Taylor and P. Stone. Transfer learning for reinforcement learning domains:A survey.

The Journal of Machine Learning Research , 10:1633–1685, 2009.A. Wilson, A. Fern, S. Ray, and P. Tadepalli. Multi-task reinforcement learning: ahierarchical bayesian approach. In

Proceedings of the 24th international conferenceon Machine learning , pages 1015–1022. ACM, 2007.David Wingate, Noah D. Goodman, Daniel M. Roy, Leslie P. Kaelbling, andJoshua B. Tenenbaum. Bayesian policy search with policy priors. In