[PDF] Discovering a set of policies for the worst case reward

Abstract

We study the problem of how to construct a set of policies that can be composed together to solve a collection of reinforcement learning tasks. Each task is a different reward function defined as a linear combination of known features. We consider a specific class of policy compositions which we call set improving policies (SIPs): given a set of policies and a set of tasks, a SIP is any composition of the former whose performance is at least as good as that of its constituents across all the tasks. We focus on the most conservative instantiation of SIPs, set-max policies (SMPs), so our analysis extends to any SIP. This includes known policy-composition operators like generalized policy improvement. Our main contribution is a policy iteration algorithm that builds a set of policies in order to maximize the worst-case performance of the resulting SMP on the set of tasks. The algorithm works by successively adding new policies to the set. We show that the worst-case performance of the resulting SMP strictly improves at each iteration, and the algorithm only stops when there does not exist a policy that leads to improved performance. We empirically evaluate our algorithm on a grid world and also on a set of domains from the DeepMind control suite. We confirm our theoretical results regarding the monotonically improving performance of our algorithm. Interestingly, we also show empirically that the sets of policies computed by the algorithm are diverse, leading to different trajectories in the grid world and very distinct locomotion skills in the control suite.

Full PDF

PPublished as a conference paper at ICLR 2021 D ISCOVERING A SET OF POLICIES FOR THE WORSTCASE REWARD

Tom Zahavy ∗ , Andre Barreto, Daniel J Mankowitz, Shaobo Hou, Brendan O’Donoghue,Iurii Kemaev and Satinder Singh DeepMind A BSTRACT

We study the problem of how to construct a set of policies that can be composedtogether to solve a collection of reinforcement learning tasks. Each task is adifferent reward function deﬁned as a linear combination of known features. Weconsider a speciﬁc class of policy compositions which we call set improvingpolicies (SIPs): given a set of policies and a set of tasks, a SIP is any compositionof the former whose performance is at least as good as that of its constituentsacross all the tasks. We focus on the most conservative instantiation of SIPs, set-max policies (SMPs), so our analysis extends to any SIP. This includes knownpolicy-composition operators like generalized policy improvement. Our maincontribution is a policy iteration algorithm that builds a set of policies in order tomaximize the worst-case performance of the resulting SMP on the set of tasks.The algorithm works by successively adding new policies to the set. We showthat the worst-case performance of the resulting SMP strictly improves at eachiteration, and the algorithm only stops when there does not exist a policy thatleads to improved performance. We empirically evaluate our algorithm on a gridworld and also on a set of domains from the DeepMind control suite. We conﬁrmour theoretical results regarding the monotonically improving performance of ouralgorithm. Interestingly, we also show empirically that the sets of policies computedby the algorithm are diverse, leading to different trajectories in the grid world andvery distinct locomotion skills in the control suite.

NTRODUCTION

Reinforcement learning (RL) is concerned with building agents that can learn to act so as to maximizereward through trial-and-error interaction with the environment. There are several reasons why it canbe useful for an agent to learn about multiple ways of behaving, i.e. , learn about multiple policies.The agent may want to achieve multiple tasks (or subgoals) in a lifelong learning setting and maylearn a separate policy for each task, reusing them as needed when tasks reoccur. The agent may havea hierarchical architecture in which many policies are learned at a lower level while an upper levelpolicy learns to combine them in useful ways, such as to accelerate learning on a single task or totransfer efﬁciently to a new task. Learning about multiple policies in the form of options (Suttonet al., 1999a) can be a good way to achieve temporal abstraction; again this can be used to quicklyplan good policies for new tasks. In this paper we abstract away from these speciﬁc scenarios and askthe following question: what set of policies should the agent pre-learn in order to guarantee goodperformance under the worst-case reward? A satisfactory answer to this question could be useful inall the scenarios discussed above and potentially many others.There are two components to the question above: ( i ) what policies should be in the set, and ( ii ) how to compose a policy to be used on a new task from the policies in the set. To answer ( ii ) ,we propose the concept of a set improving policy (SIP). Given any set of n policies, a SIP is anycomposition of these policies whose performance is at least as good as, and generally better than,that of all of the constituent policies in the set. We present two policy composition (or improvement)operators that lead to a SIP. The ﬁrst is called set-max policy (SMP). Given a distribution over states,a SMP chooses from n policies the one that leads to the highest expected value. The second SIPoperator is generalized policy improvement (Barreto et al., 2017, GPI). Given a set of n policies andtheir associated action-value functions, GPI is a natural extension of regular policy improvement inwhich the agent acts greedily in each state with respect to the maximum over the set of action-values ∗ [email protected] a r X i v : . [ c s . A I] F e b ublished as a conference paper at ICLR 2021functions. Although SMP provides weaker guarantees than GPI (we will show this below), it is moreamenable to analysis and thus we will use it exclusively for our theoretical results. However, sinceSMP’s performance serve as a lower bound to GPI’s, the results we derive for the former also applyto the latter. In our illustrative experiments we will show this result empirically.Now that we have ﬁxed the answer to ( ii ) , i.e. , how to compose pre-learned policies for a new rewardfunction, we can leverage it to address ( i ) : what criterion to use to pre-learn the policies. Here, onecan appeal to heuristics such as the ones advocating that the set of pre-learned policies should be asdiverse as possible (Eysenbach et al., 2018; Gregor et al., 2016; Grimm et al., 2019; Hansen et al.,2019). In this paper we will use the formal criterion of robustness , i.e. , we will seek a set of policiesthat do as well as possible in the worst-case scenario. Thus, the problem of interest to this paper is asfollows: how to deﬁne and discover a set of n policies that maximize the worst possible performanceof the resulting SMP across all possible tasks? Interestingly, as we will discuss, the solution to thisrobustness problem naturally leads to a diverse set of policies.To solve the problem posed above we make two assumptions: (A1) that tasks differ only in their rewardfunctions, and (A2) that reward functions are linear combinations of known features. These twoassumptions allow us to leverage the concept of successor features (SFs) and work in apprenticeshiplearning. As our main contribution in this paper, we present an algorithm that iteratively builds a setof policies such that SMP’s performance with respect to the worst case reward provably improves ineach iteration, stopping when no such greedy improvement is possible. We also provide a closed-formexpression to compute the worst-case performance of our algorithm at each iteration. This meansthat, given tasks satisfying Assumptions A1 and A2, we are able to provably construct a SIP that canquickly adapt to any task with guaranteed worst-case performance. Related Work.

The proposed approach has interesting connections with hierarchical RL (HRL) (Sut-ton et al., 1999b; Dietterich, 2000). We can think of SMP (and GPI) as a higher-level policy-selectionmechanism that is ﬁxed a priori . Under this interpretation, the problem we are solving can be seen asthe deﬁnition and discovery of lower-level policies that will lead to a robust hierarchical agent.There are interesting parallels between robustness and diversity. For example, diverse stock portfolioshave less risk. In robust least squares (El Ghaoui & Lebret, 1997; Xu et al., 2009), the goal is to ﬁnda solution that will perform well with respect to (w.r.t) data perturbations. This leads to a min-maxformulation, and there are known equivalences between solving a robust (min-max) problem and thediversity of the solution (via regularization) (Xu & Mannor, 2012). Our work is also related to robustMarkov decision processes (MDPs) (Nilim & El Ghaoui, 2005), but our focus is on a different aspectof the problem. While in robust MDPs the uncertainty is w.r.t the dynamics of the environment, herewe focus on uncertainty w.r.t the reward and assume that the dynamics are ﬁxed. More importantly,we are interested in the hierarchical aspect of the problem – how to discover and compose a set ofpolicies. In contrast, solutions to robust MDPs are typically composed of a single policy.In Apprenticeship Learning (AL; Abbeel & Ng, 2004) the goal is also to solve a min-max problem inwhich the agent is expected to perform as well as an expert w.r.t any reward. If we ignore the expert,AL algorithms can be used to ﬁnd a single policy that performs well w.r.t any reward. The solutionto this problem (when there is no expert) is the policy whose SFs have the smallest possible norm.When the SFs are in the simplex (as in tabular MDPs) the vector with the smallest (cid:96) norm putsequal probabilities on its coordinates, and is therefore "diverse" (making an equivalence between therobust min-max formulation and the diversity perspective). In that sense, our problem can be seen asa modiﬁed AL setup where: (a) no expert demonstrations are available (b) the agent is allowed toobserve the reward at test time, and (c) the goal is to learn a set of constituent policies. RELIMINARIES

We will model our problem of interest using a family of Markov Decision Processes (MDPs).An MDP is a tuple M (cid:44) ( S, A, P, r, γ, D ) , where S is the set of states, A is the set of actions, P = { P a | a ∈ A } is the set of transition kernels, γ ∈ [0 , is the discount factor and D is theinitial state distribution. The function r : S × A × S (cid:55)→ R deﬁnes the rewards, and thus the agent’sobjective; here we are interested in multiple reward functions, as we explain next.Let φ ( s, a, s (cid:48) ) ∈ [0 , d be an observable vector of features (our analysis only requires the features tobe bounded; we use [0 , for ease of exposition). We are interested in the set of tasks induced by allpossible linear combinations of the features φ . Speciﬁcally, for any w ∈ R d , we can deﬁne a rewardfunction r w ( s, a, s (cid:48) ) = w · φ ( s, a, s (cid:48) ) . Given w , the reward r w is well deﬁned and we will use theterms w and r w interchangeably to refer to the RL task induced by it. Formally, we are interested in2ublished as a conference paper at ICLR 2021the following set of MDPs: M φ (cid:44) { ( S, A, P, r w , γ, D ) | w ∈ W} . (1)In general, W is any convex set, but we will focus on the (cid:96) d -dimensional ball denoted by W = B .This choice is not restricting, since the optimal policy in an MDP is invariant with respect to the scaleof the rewards and the (cid:96) ball contains all the directions.A policy in an MDP M ∈ M φ , denoted by π ∈ Π , is a mapping π : S → P ( A ) , where P ( A ) is thespace of probability distributions over A . For a policy π we deﬁne the successor features (SFs) as ψ π ( s, a ) (cid:44) (1 − γ ) · E (cid:104) (cid:88) ∞ t =0 γ t φ ( s t , a t , s t +1 ) | P, π, s t = s, a t = a (cid:105) . (2)The multiplication by − γ together with the fact that the features φ are in [0 , assures that ψ π ( s, a ) ∈ [0 , d for all ( s, a ) ∈ S × A . We also deﬁne SFs that are conditioned on the initial statedistribution D and the policy π as: ψ π (cid:44) E [ ψ π ( s, a ) | D, π ] = E s ∼ D,a ∼ π ( s ) ψ π ( s, a ) . It should beclear that the SFs are conditioned on D and π whenever they are not written as a function of statesand actions like in Eq. (2). Note that, given a policy π , ψ π is simply a vector in [0 , d . Since we willbe dealing with multiple policies, we will use superscripts to refer to them—that is, we use π i to referto the i -th policy. To keep the notation simple, we will refer to the SFs of policy π i as ψ i . We deﬁnethe action-value function (or Q -function) of policy π under reward r w as Q π w ( s, a ) (cid:44) (1 − γ ) E (cid:104) (cid:88) ∞ t =0 γ t φ ( s t , a t , s t +1 ) · w | P, π, s t = s, a t = a (cid:105) = ψ π ( s, a ) · w . We deﬁne the value of a policy π as v π w (cid:44) (1 − γ ) E (cid:104) (cid:80) ∞ t =0 γ t w · φ ( s t ) | π, P, D (cid:105) = ψ π · w . Notethat v π w is a scalar, corresponding to the expected value of policy π under the initial state distribution D , given by v π w = E [ Q π w ( s, a ) | D, π ] = E s ∼ D,a ∼ π ( s ) Q π w ( s, a ) . (3) OMPOSING POLICIES TO SOLVE A SET OF

MDP S As described, we are interested in solving all the tasks w ∈ W in the set of MDPs M φ deﬁned in (1).We will approach this problem by learning policies associated with speciﬁc rewards w and thencomposing them to build a higher-level policy that performs well across all the tasks. We call thishigher-level policy a generalized policy , deﬁned as (Barreto et al., 2020): Deﬁnition 1 (Generalized policy) . Given a set of MDPs M φ , a generalized policy is a function π : S × W (cid:55)→ P ( A ) that maps a state s and a task w onto a distribution over actions. We can think of a generalized policy as a regular policy parameterized by a task, since for a ﬁxed w we have π ( · ; w ) : S (cid:55)→ P ( A ) . We now focus our attention on a speciﬁc class of generalized policiesthat are composed of other policies: Deﬁnition 2 (SIP) . Given a set of MDPs M φ and a set of n policies Π n = { π i } ni =1 , a set improvingpolicy (SIP) π SIP is any generalized policy such that: v SIP Π n , w ≥ v i w for all π i ∈ Π n and all w ∈ W , (4) where v SIP Π n , w and v i w are the value functions of π SIP Π n ( · ; w ) and the policies π i ∈ Π n under reward r w . We have been deliberately vague about the speciﬁc way the policies π i ∈ Π n are combined to form aSIP to have as inclusive a concept as possible. We now describe two concrete ways to construct a SIP. Deﬁnition 3 (SMP) . Let Π n = { π i } ni =1 be a set of n policies and let v i be the corresponding valuefunctions deﬁned analogously to (3) for an arbitrary reward. A set-max policy (SMP) is deﬁned as π SMP Π n ( s ; w ) = π k ( s ) , with k = arg max i ∈ [1 ,...,n ] v i w . While we focus on the most common, discounted RL criteria, all of our results will hold in the ﬁnite horizonand average reward criteria (see, for example, Puterman (1984)). Concretely, in these scenarios there existnormalizations for the SFs whose effect are equivalent to that of the multiplication by − γ . In the ﬁnite-horizoncase we can simply multiply the SFs by /H . In the average reward case, there is no multiplication (Zahavyet al., 2020b) and the value function is measured under the stationary distribution (instead of D ). M φ . Given the SFs of the policies π i ∈ Π n , { ψ i } ni =1 , we can quickly compute a generalized SMP as π SMP Π n ( s ; w ) = π k ( s ) , with k = arg max i ∈ [1 ,...,n ] { w · ψ i } . (5)Since the value of a SMP under reward w is given by v SMP Π n , w = max i ∈ [1 ,...,n ] v i w , it trivially qualiﬁesas a SIP as per Deﬁnition 2. In fact, the generalized policy π SMP Π n deﬁned in (5) is in some sense themost conservative SIP possible, as it will always satisfy (4) with equality. This means that any otherSIP will perform at least as well as the SIP induced by SMP. We formalize this notion below: Lemma 1.

Let π SMP Π n be a SMP deﬁned as in (5) and let π : S × W (cid:55)→ P ( A ) be any generalizedpolicy. Then, given a set of n policies Π n , π is a SIP if and only if v π Π n , w ≥ v SMP Π n , w for all w ∈ W . Due to space constraints, all the proofs can be found in the supplementary material. Lemma 1allows us to use SMP to derive results that apply to all SIPs. For example, a lower bound for v SMP Π n , w automatically applies to all possible v SIP Π n , w . Lemma 1 also allows us to treat SMP as a criterion todetermine whether a given generalized policy qualiﬁes as a SIP. We illustrate this by introducinga second candidate to construct a SIP called generalized policy improvement (Barreto et al., 2017;2018; 2020, GPI): Deﬁnition 4 (GPI policy) . Given a set of n policies Π n = { π i } ni =1 and corresponding Q -functions Q i w computed under an arbitrary reward w , the GPI policy is deﬁned as π GPI Π n ( s ; w ) = arg max a max i Q i w ( s, a ) . Again, we can combine GPI and SFs to build a generalized policy. Given the SFs of the poli-cies π i ∈ Π n , { ψ i } ni =1 , we can quickly compute the generalized GPI policy as π GPI Π n ( s ; w ) =arg max a max i ψ i ( s, a ) · w . Note that the maximization in GPI is performed in each state and usesthe Q -functions of the constituent policies. In contrast, SMP maximizes over value functions (not Q -functions), with an expectation over states taken with respect to the initial state distribution D . Forthis reason, GPI is a stronger composition than SMP. We now formalize this intuition: Lemma 2.

For any reward w ∈ W and any set of policies Π n , we have that v GPI Π n , w ≥ v SMP Π n , w . Lemma 2 implies that for any set of policies it is always better to use a GPI policy rather than anSMP (as we will conﬁrm in the experiments). As a consequence, it also certiﬁes that the generalizedGPI policy π GPI Π n ( s ; w ) qualiﬁes as a SIP (Lemma 1).We have described two ways of constructing a SIP by combining SMP and GPI with SFs. Othersimilar strategies might be possible, for example by using local SARSA (Russell & Zimdars, 2003;Sprague & Ballard, 2003) as the basic mechanism to compose a set of value functions. We also notethat in some cases it is possible to deﬁne a generalized policy (Deﬁnition 1), that is not necessarilya SIP (Eq. (5)), but is guaranteed to perform better than any SIP in expectation. For example, acombination of maximization, randomization and local search have been shown to be optimal inexpectation among generalized policies in tabular MDPs with collectible rewards (Zahavy et al.,2020c). That said, we note that some compositions of policies that may at ﬁrst seem like a SIP do notqualify as such. For example, a mixed policy is a linear (convex) combination of policies that assignsprobabilities to the policies in the set and samples from them. When the mixed policy is mixing thebest policy in the set with a less performant policy then it will result in a policy that is not as good asthe best single policy in the set (Zahavy et al., 2020c). Problem formulation.

We are now ready to formalize the problem we are interested in. Given a setof MDPs M φ , as deﬁned in (1), we want to construct a set of n policies Π n = { π i } ni =1 , such that theperformance of the SMP deﬁned on that set π SMP Π n will have the optimal worst-case performance overall rewards w ∈ W . That is, we want to solve the following problem: arg max Π n ⊆ Π min w v SMP Π n , w . (6)Note that, since v SMP Π n , w ≤ v SIP Π n , w for any SIP, Π n and w , as shown in Lemma 1, by ﬁnding a good setfor (6) we are also improving the performance of all SIPs (including GPI).4ublished as a conference paper at ICLR 2021 N ITERATIVE METHOD TO CONSTRUCT A SET - MAX POLICY

We now present and analyze an iterative algorithm to solve problem (6). We begin by deﬁning theworst case or adversarial reward associated with the generalized SMP policy:

Deﬁnition 5 (Adversarial reward for an SMP) . Given a set of policies Π n , we denote by ¯ w SMP Π n =arg min w ∈ B v SMP Π n , w the worst case reward w.r.t the SMP π SMP Π n deﬁned in (5) . In addition, the valueof the SMP w.r.t to ¯ w SMP Π n is deﬁned by ¯ v SMP Π n = min w ∈ B v SMP Π n , w . We are interested in ﬁnding a set of policies Π n such that the performance of the resulting SMP willbe optimal w.r.t its adversarial reward ¯ w SMP Π n . This leads to a reformulation of (6) as a max-min-maxoptimization for discovering robust policies: arg max Π n ⊆ Π ¯ v SMP Π n = arg max Π n ⊆ Π min w ∈ B v SMP Π n , w = arg max Π n ⊆ Π min w ∈ B max i ∈ [1 ,..,n ] ψ i · w . (7) Algorithm 1

SMP worst case policy iteration

Initialize:

Sample w ∼ N (¯0 , ¯1) , Π ←{ } , π ← arg max π ∈ Π w · ψ π , t ← v SMP Π ← −|| ψ || repeat Π t ← Π t − + { π t } ¯ w SMP Π t ← solution to (8) π t +1 ← solution of the RL task ¯ w SMP Π t t ← t + 1 until v t ¯ w SMP Π n ≤ ¯ v SMP Π t − return Π t − The order in which the maximizations and theminimization are performed in (7) is important. ( i ) The inner maximization over policies (orSFs), by the SMP, is performed last. This meansthat, for a ﬁxed set of policies Π n and a ﬁxedreward w , SMP selects the best policy in the set. ( ii ) The minimization over rewards w happenssecond, that is, for a ﬁxed set of policies Π n ,we compute the value of the generalized SMP π SMP Π n ( · ; w ) for any reward w , and then minimizethe maximum of these values. ( iii ) Finally, forany set of policies, there is an associated worstcase reward for the SMP, and we are looking forpolicies that maximize this value.The inner maximization ( i ) is simple: it comes down to computing n dot-products ψ i · w , i =1 , , . . . , n , and comparing the resulting values. The minimization problem ( ii ) is slightly morecomplicated, but fortunately easy to solve. To see this, note that this problem can be rewritten as: ¯ w SMP Π n = arg min w ∈ B max i ∈ [1 ,...,n ] { w · ψ , . . . , w · ψ n } . s.t. || w || − ≤ . (8)Eq. (8) is a convex optimization problem that can be easily solved using standard techniques, likegradient descent, and off-the-shelf solvers (Diamond & Boyd, 2016; Boyd et al., 2004). We note thatthe minimizer of Eq. (8) is a function of policy set. As a result, the set forces the worst case rewardto make a trade-off – it has to “choose” the coordinates it “wants” to be more adversarial for. Thistrade-off is what encourages the worst case reward to be diverse across iterations (w.r.t different sets).We note that this property holds since we are optimizing over B but it will not necessary be the casefor other convex sets. For example, in the case of B ∞ the internal minimization problem in the abovehas a single solution - a vector with -1 in all of its coordinates.The outer maximization problem ( iii ) can be difﬁcult to solve if we are searching over all possiblesets of policies Π n ⊆ Π . Instead, we propose an incremental approach in which policies π i aresuccessively added to an initially empty set Π . This is possible because the solution ¯ w SMP Π n of (8) givesrise to a well-deﬁned RL problem in which the rewards are given by r w ( s, a, s (cid:48) ) = ¯ w SMP Π n · φ ( s, a, s (cid:48) ) .This problem can be solved using any standard RL algorithm. So, once we have a solution ¯ w SMP Π n for (8), we solve the induced RL problem using any algorithm and add the resulting policy π n +1 to Π n (or, rather, the associated SFs ψ n +1 ).Algorithm 1 has a step by step description of the proposed method. The algorithm is initialized byadding a policy π that maximizes a random reward vector w to the set Π , such that Π = { π } . Ateach subsequent iteration t the algorithm computes the worst case reward ¯ w SMP Π t w.r.t to the current set Π t by solving (8). The algorithm then ﬁnds a policy π t +1 that solves the task induced by ¯ w SMP Π t . If thevalue of π t +1 w.r.t ¯ w SMP Π t is strictly larger than ¯ v SMP Π t the algorithm continues for another iteration, with π t +1 added to the set. Otherwise, the algorithm stops. As mentioned before, the set of policies Π t computed by Algorithm 1 can also be used with GPI. The resulting GPI policy will do at least as wellas the SMP counterpart on any task w (Lemma 2); in particular, the GPI’s worst-case performancewill be lower bounded by ¯ v SMP Π n . 5ublished as a conference paper at ICLR 20214.1 T HEORETICAL ANALYSIS

Algorithm 1 produces a sequence of policy sets Π , Π , . . . The deﬁnition of SMP guarantees thatenlarging a set of policies always leads to a soft improvement in performance, so ¯ v SMP Π t +1 ≥ ¯ v SMP Π t ≥ . . . ≥ ¯ v SMP { π } . We now show that the improvement in each iteration of our algorithm is in fact strict. Theorem 1 (Strict improvement) . Let Π , . . . , Π t be the sets of policies constructed by Algorithm 1.We have that the worst-case performance of the SMP induced by these set is strictly improving in eachiteration, that is: ¯ v SMP Π t +1 > ¯ v SMP Π t . Furthermore, when the algorithm stops, there does not exist a singlepolicy π t +1 such that adding it to Π t will result in improvement: (cid:64) π t +1 ∈ Π s.t. ¯ v SMP Π t + { π } > ¯ v SMP Π t . In general we cannot say anything about the value of the SMP returned by Algorithm 1. However, insome special cases we can upper bound it. One such case is when the SFs lie in the simplex.

Lemma 3 (Impossibility result) . For the special case where the SFs associated with any policy arein the simplex, the value of the SMP w.r.t the worst case reward for any set of policies is less than orequal to − / √ d. In addition, there exists an MDP where this upper bound is attainable.

One example where the SFs are in the simplex is when the features φ are “one-hot vectors”, that is,they only have one nonzero element. This happens for example in a tabular representation, in whichcase the SFs correspond to stationary state distributions. Another example are the features induced bystate aggregation, since these are simple indicator functions associating states to clusters (Singh et al.,1995). We will show in our experiments that when state aggregation is used our algorithm achievesthe upper bound of Lemma 3 in practice.Finally, we observe that not all the policies in the set Π t are needed at each point in time, and we canguarantee strict improvement even if we remove the "inactive" policies from Π t , as we show below. Deﬁnition 6 (Active policies) . Given a set of n policies Π n , and an associated worst case reward ¯ w SMP Π n , the subset of active policies Π a (Π n ) are the policies in Π n that achieve ¯ v SMP Π n w.r.t ¯ w SMP Π n :Π a (Π n ) = (cid:8) π ∈ Π n : ψ π · ¯ w SMP Π n = ¯ v SMP Π n (cid:9) . Theorem 2 (Sufﬁciency of Active policies) . For any set of policies Π n , π SMP Π a (Π n ) achieves the samevalue w.r.t the worst case reward as π SMP Π n , that is, ¯ v SMP Π n = ¯ v SMP Π a (Π n ) . Theorem 2 implies that once we have found ¯ w SMP Π n we can remove the inactive policies from the setand still guarantee the same worst case performance. Furthermore, we can continue with Algorithm 1to ﬁnd the next policy by maximizing ¯ w SMP Π n and guarantee strict improvement via Theorem 1. This isimportant in applications that have memory constraints, since it allows us to store fewer policies. XPERIMENTS

We begin with a × grid-world environment (Fig. 1(d)), where the agent starts in a randomplace in the grid (marked in a black color) and gains/loses reward from collecting items (markedwith white color). Each item belongs to one of d − classes (here with d = 5 ) and is associatedwith a marker: , O, X, Y . In addition, there is one "no item" feature (marked in gray color). Thefeatures are one-hot vectors, i.e. , for i ∈ [1 , d − , φ i ( s ) equals one when item i is in state s and zerootherwise (similarly φ d ( s ) equals one when there is no item in state s ). The objective of the agent isto pick up the “good” objects and avoid “bad” objects, depending on the weights of the vector w .In Fig. 1(a) we report the performance of the SMP π SMP Π t w.r.t ¯ w SMP Π t for d = 5 . At each iteration(x-axis) of Algorithm 1 we train a policy for · steps to maximize ¯ w SMP Π t . We then compute theSFs of that policy using additional · steps and evaluate it w.r.t ¯ w SMP Π t .As we can see, the performance of SMP strictly improves as we add more policies to the set (as westated in Theorem 1). In addition, we compare the performance of SMP with that of GPI, deﬁned onthe same sets of policies ( Π t ) that were discovered by Algorithm 1. Since we do not know how tocompute ¯ w GPI Π t (the worst case reward for GPI), we evaluate GPI w.r.t ¯ w SMP Π n (the blue line in Fig. 1(a)).Inspecting Fig. 1(a), we can see that the GPI policy indeed performs better than the SMP as Lemma 2indicates. We note that the blue line (in Fig. 1(a)) does not correspond to the worst case performanceof the GPI policy. Instead, we can get a good approximation for it because we have that: ¯ w SMP Π n · V a l u e SMPGPISMP Upper Bound (a) SMP vs. GPI U pp e r b o un d - S M P Worst caseOrthogonalRandom (b) Baselines (c) SFs (d) Trajectories

Figure 1: Experimental results in a 2D grid world. Fig. 1(a) presents the performance of the SMP andGPI w.r.t the worst case reward. Fig. 1(b) compares Algorithm 1 with two baselines, where we showthe worst case performance, relative to the upper bound, in a logarithmic scale. Fig. 1(c) visualizes theSFs of the policies in the set and Fig. 1(d) presents trajectories that were taken by different policies. ψ ( π SMP Π n ) ≤ ¯ w GPI Π n · ψ ( π GPI Π n ) ≤ ¯ w SMP Π n · ψ ( π GPI Π n ) ; i.e., the worst case performance of GPI (in the middle)is guaranteed to be between the green and blue lines in Fig. 1(a). This also implies that the upperbound in Lemma 3 does not apply for the blue line.We also compare our algorithm to two baselines in Fig. 1(b) (for d = 10 ): (i) Orthogonal - atiteration t we train policy π t to maximize the reward w = e t (a vector of zeroes with a one on thet-th coordinate) such that a matrix with the vectors w in its columns forms the identity matrix; (ii)Random: at iteration t we train policy π t to maximize reward w ∼ ˜ N (¯0 , ¯1) , i.e. , we sample a vectorof dimension d from a Normal Gaussian distribution and normalize it to have a norm of . While allthe methods improve as we add policies to the set, Algorithm 1 clearly outperforms the baselines.In Fig. 1(c) and Fig. 1(d) we visualize the policies that were discovered by Algorithm 1. Fig. 1(c)presents the SFs of the discovered policies, where each row (color) corresponds to a different policyand the columns correspond to the different features. We do not enumerate the features from to d , but instead we label them with markers that correspond to speciﬁc items (the x-axis labels). InFig. 1(d) we present a trajectory from each policy. We note that both the colors and the markers matchbetween the two ﬁgures: the red color corresponds to the same policy in both ﬁgures, and the itemmarkers in Fig. 1(d) correspond to the coordinates in the x-axis of Fig. 1(c).Inspecting the ﬁgures we can see that the discovered policies are qualitatively diverse: in Fig. 1(c)we can see that the SFs of the different policies have different weights for different items, and inFig. 1(d) we can see that the policies visit different states. For example, we can see that the teal policyhas a larger weight for the no item feature (Fig. 1(c)) and visits only no-item states (Fig. 1(d)) andthat the green policy has higher weights for the ’Y’ and ’X’ items (Fig. 1(c)) and indeed visits them(Fig. 1(d)). (a) SMP vs. GPI, d=5 (b) SMP vs. GPI, d=5 Figure 2: Experimental results with regularized w .Finally, in Fig. 2, we compare the performance of our algorithm with that of the baseline methodsover a test set of rewards. The only difference is in how we evaluate the algorithms. Speciﬁcally, wesampled reward signals from the uniform distribution over the unit ball. Recall that at iteration t each algorithm has a set of policies Π t , so we evaluate the SMP deﬁned on this set, π SMP Π t , w.r.teach one of the test rewards. Then, for each method, we report the mean value obtained over thetest rewards and repeat this procedure for different seeds. Finally, we report the mean and theconﬁdence interval over the seeds. Note that the performance in this experiment will necessarily bebetter than the in Fig. 1(a) because here we evaluate average performance rather than worst-caseperformance. Also note that our algorithm was not designed to optimize the performance on this "testset", but to optimize the performance w.r.t the worst case. Therefore it is not necessarily expected tooutperform the baselines when measured on this metric.7ublished as a conference paper at ICLR 2021Inspecting Figure Fig. 2(a) we can see that our algorithm (denoted by SMP) performs better than thetwo baselines. This is a bit surprising for the reasons mentioned above, and suggests that optimisingfor the worst case also improves the performance w.r.t the entire distribution (transfer learning result).At ﬁrst glance, the relative gain in performance might seem small. Therefore, the baselines mightseem preferable to some users due to their simplicity. However, recall that the computational costfor computing the worst case reward is small compared to ﬁnding the policy the maximizes it, andtherefore the relative cost of the added complexity is low.The last observation suggests that we should care about how many policies are needed by eachmethod to achieve the same value. We present these results in Fig. 2(b). Note that we use exactly thesame data as in Fig. 2(a) but present it in a different manner. Inspecting the ﬁgure, we can see that thebaselines require more policies to achieve the same value. For example, to achieve a value of . ,the SMP required policies, while the baselines needed ; and for a value of . the SMP required policies while the baselines needed and respectively. DeepMind Control Suite.

Next, we conducted a set of experiments in the DM Control Suite (Tassaet al., 2018). We focused on the setup where the agent is learning from feature observations cor-responding to the positions and velocities of the “body” in the task (pixels were only used forvisualization). We considered the following six domains: ’Acrobot’, ’Cheetah’, ’Fish’, ’Hopper’,’Pendulum’, and ’Walker’. In each of these tasks we do not use the extrinsic reward that is deﬁned bythe task, but instead consider rewards that are linear in the observations (of dimensions 6, 17, 21, 15,3, and 24 respectively). At each iteration of Algorithm 1 we train a policy for · steps using anactor-critic (and speciﬁcally STACX (Zahavy et al., 2020d)) to maximize ¯ w SMP Π t , add it to the set, andcompute a new ¯ w SMP Π t +1 . V a l u e Acrobot

Cheetah

Fish V a l u e Hopper

Pendulum

Walker o f a c t i v e p o li c i e s o f a c t i v e p o li c i e s (a) Convergence Explained Variance:1.0

Acrobot Explained Variance:1.0 Cheetah Explained Variance:0.998

123 45

FishExplained Variance:0.992

12 34

Hopper Explained Variance:1.0 Pendulum Explained Variance:0.967

Walker (b) Diversity

Figure 3: Experimental results in Deepmind Control Suite.Fig. 3(a) presents the performance of SMP in each iteration w.r.t ¯ w SMP Π t . As we can see, our algorithmis indeed improving in each iteration. In addition, we present the average number of active policies(Deﬁnition 6) in each iteration with bars. All the results are averaged over seeds and presentedwith Gaussian conﬁdence intervals. Fig. 3(b) presents the SFs of the active policies at the endof training (the seed with the maximum number of active policies was selected). We perform PCAdimensionality reduction such that each point in the scatter plot corresponds to the SFs of one ofthe active policies. We also report the variance explained by PCA: values close to indicate that thedimensionality reduction has preserved the original variance. Examining the ﬁgures we can see thatour algorithm is strictly improving (as Theorem 1 predicts) and that the active policies in the set areindeed diverse; we can also see that adding more policies is correlated with improving performance.Finally, in Fig. 4(a), Fig. 4(b) and Fig. 4(c) we visualize the trajectories of the discovered policies inthe Cheetah, Hopper and Walker environments. Although the algorithm was oblivious to the extrinsicreward of the tasks, it was still able to discover different locomotion skills, postures, and even some"yoga poses" (as noted by the label we gave each policy on the left). The other bodies (Acrobot,Pendulum and Fish) have simpler bodies and exhibited simpler movement in various directions andvelocities, e.g. the Pendulum learned to balance itself up and down. The supplementary materialcontains videos from all the bodies. ONCLUSION

We have presented an algorithm that incrementally builds a set of policies to solve a collection oftasks deﬁned as linear combinations of known features. The policies returned by our algorithm canbe composed in multiple ways. We have shown that when the composition is a SMP its worst-case8ublished as a conference paper at ICLR 2021performance on the set of tasks will strictly improve at each iteration of our algorithm. More generally,the performance guarantees we have derived also serve as a lower bound for any composition ofpolicies that qualiﬁes as a SIP. The composition of policies has many applications in RL, such as forexample to build hierarchical agents or to tackle a sequence of tasks in a continual learning scenario.Our algorithm provides a simple and principled way to build a diverse set of policies that can be usedin these and potentially many other scenarios.

CKNOWLEDGEMENTS

We would like to thank Remi Munos and Will Dabney for their comments and feedback on this paper. H e a d S t a n dS t a n d i n g T a ll W a l k B a c k w a r d R un F o r w a r d (a) Cheetah Q u a d S t r e t c h V C r un c h P l o w P o s e H e a d S t a n d (b) Hopper K n ee l S a l t a J u m p P i s t o l Sq u a t B r e a k - d a n c e C r a w l (c) Walker Figure 4: Experimental results in Deepmind Control Suite.9ublished as a conference paper at ICLR 2021 R EFERENCES

Pieter Abbeel and Andrew Y Ng. Apprenticeship learning via inverse reinforcement learning. In

Proceedings of the twenty-ﬁrst international conference on Machine learning , pp. 1. ACM, 2004.André Barreto, Will Dabney, Rémi Munos, Jonathan J Hunt, Tom Schaul, Hado P van Hasselt, andDavid Silver. Successor features for transfer in reinforcement learning. In

Advances in neuralinformation processing systems , pp. 4055–4065, 2017.Andre Barreto, Diana Borsa, John Quan, Tom Schaul, David Silver, Matteo Hessel, Daniel Mankowitz,Augustin Zidek, and Remi Munos. Transfer in deep reinforcement learning using successor featuresand generalised policy improvement. In

International Conference on Machine Learning , pp. 501–510. PMLR, 2018.André Barreto, Shaobo Hou, Diana Borsa, David Silver, and Doina Precup. Fast reinforcementlearning with generalized policy updates.

Proceedings of the National Academy of Sciences , 2020.Stephen Boyd, Stephen P Boyd, and Lieven Vandenberghe.

Convex optimization . Cambridgeuniversity press, 2004.Steven Diamond and Stephen Boyd. CVXPY: A Python-embedded modeling language for convexoptimization.

Journal of Machine Learning Research , 17(83):1–5, 2016.T. G. Dietterich. Hierarchical reinforcement learning with the MAXQ value function decomposition.

Journal of Artiﬁcial Intelligence Research , 13:227–303, 2000.Laurent El Ghaoui and Hervé Lebret. Robust solutions to least-squares problems with uncertain data.

SIAM Journal on matrix analysis and applications , 18(4):1035–1064, 1997.Benjamin Eysenbach, Abhishek Gupta, Julian Ibarz, and Sergey Levine. Diversity is all you need:Learning skills without a reward function. arXiv preprint arXiv:1802.06070 , 2018.Marguerite Frank and Philip Wolfe. An algorithm for quadratic programming.

Naval researchlogistics quarterly , 3(1-2):95–110, 1956.Dan Garber and Elad Hazan. A linearly convergent conditional gradient algorithm with applicationsto online and stochastic optimization. arXiv preprint arXiv:1301.4666 , 2013.Karol Gregor, Danilo Jimenez Rezende, and Daan Wierstra. Variational intrinsic control. arXivpreprint arXiv:1611.07507 , 2016.Christopher Grimm, Irina Higgins, Andre Barreto, Denis Teplyashin, Markus Wulfmeier, Tim Her-tweck, Raia Hadsell, and Satinder Singh. Disentangled cumulants help successor representationstransfer to new tasks. arXiv preprint arXiv:1911.10866 , 2019.Jacques Guélat and Patrice Marcotte. Some comments on wolfe’s ‘away step’.

MathematicalProgramming , 35(1):110–119, 1986.Steven Hansen, Will Dabney, Andre Barreto, Tom Van de Wiele, David Warde-Farley, andVolodymyr Mnih. Fast task inference with variational intrinsic successor features. arXiv preprintarXiv:1906.05030 , 2019.Martin Jaggi. Revisiting frank-wolfe: Projection-free sparse convex optimization. 2013.Arnab Nilim and Laurent El Ghaoui. Robust control of markov decision processes with uncertaintransition matrices.

Operations Research , 53(5):780–798, 2005.Martin L Puterman.

Markov decision processes: discrete stochastic dynamic programming . JohnWiley & Sons, 1984.Stuart J Russell and Andrew Zimdars. Q-decomposition for reinforcement learning agents. In

Proceedings of the 20th International Conference on Machine Learning (ICML-03) , pp. 656–663,2003.Satinder P Singh, Tommi Jaakkola, and Michael I Jordan. Reinforcement learning with soft stateaggregation. In

Advances in neural information processing systems , pp. 361–368, 1995.Nathan Sprague and Dana Ballard. Multiple-goal reinforcement learning with modular sarsa (0).2003. 10ublished as a conference paper at ICLR 2021Richard S Sutton, Doina Precup, and Satinder Singh. Between mdps and semi-mdps: A framework fortemporal abstraction in reinforcement learning.

Artiﬁcial intelligence , 112(1-2):181–211, 1999a.Richard S. Sutton, Doina Precup, and Satinder Singh. Between MDPs and semi-MDPs: a frameworkfor temporal abstraction in reinforcement learning.

Artiﬁcial Intelligence , 112:181–211, August1999b. doi: http://dx.doi.org/10.1016/S0004-3702(99)00052-1.Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas, David Budden,Abbas Abdolmaleki, Josh Merel, Andrew Lefrancq, et al. Deepmind control suite. arXiv preprintarXiv:1801.00690 , 2018.Philip Wolfe. Convergence theory in nonlinear programming.

Integer and nonlinear programming ,pp. 1–36, 1970.Huan Xu and Shie Mannor. Robustness and generalization.

Machine learning , 86(3):391–423, 2012.Huan Xu, Constantine Caramanis, and Shie Mannor. Robust regression and lasso. In

Advances inneural information processing systems , pp. 1801–1808, 2009.Tom Zahavy, Alon Cohen, Haim Kaplan, and Yishay Mansour. Apprenticeship learning via frank-wolfe.

AAAI, 2020 , 2020a.Tom Zahavy, Alon Cohen, Haim Kaplan, and Yishay Mansour. Average reward reinforcementlearning with unknown mixing times.

The Conference on Uncertainty in Artiﬁcial Intelligence(UAI) , 2020b.Tom Zahavy, Avinatan Hasidim, Haim Kaplan, and Yishay Mansour. Planning in hierarchicalreinforcement learning: Guarantees for using local policies. In

Algorithmic Learning Theory , pp.906–934, 2020c.Tom Zahavy, Zhongwen Xu, Vivek Veeriah, Matteo Hessel, Junhyuk Oh, Hado van Hasselt, DavidSilver, and Satinder Singh. A self-tuning actor-critic algorithm.

Advances in neural informationprocessing systems , 2020d. 11ublished as a conference paper at ICLR 2021

A P

ROOFS

Lemma 1.

Let π SMP Π n be a SMP deﬁned as in (5) and let π : S × W (cid:55)→ P ( A ) be any generalizedpolicy. Then, given a set of n policies Π n , π is a SIP if and only if v π Π n , w ≥ v SMP Π n , w for all w ∈ W .Proof. We ﬁrst show that the fact that π is a SIP implies that v π Π n , w ≥ v SMP Π n , w for all w . For any w ∈ W , we have v π Π n , w ≥ v i w for all π i ∈ Π n (SIP as in Deﬁnition 2) ≥ max i ∈ [1 ,...,n ] v i w = v SMP Π n , w . We now show the converse: v π Π n , w ≥ v SMP Π n , w = max i ∈ [1 ,...,n ] v i w (SMP as in Deﬁnition 3) ≥ v i w for all π i ∈ Π n . (cid:4) Lemma 2.

For any reward w ∈ W and any set of policies Π n , we have that v GPI Π n , w ≥ v SMP Π n , w . Proof.

We know from previous results in the literature Barreto et al. (2017) that Q GPI (Π n )( s, a ) ≥ Q π ( s, a ) for all ( s, a ) ∈ S × A and any π ∈ Π n .Thus, we have that ∀ s ∈ S : v GPI Π n , w ( s ) = Q GPI Π n , w ( s, π GPI ( s )) ≥ max π ∈ Π n ,a ∈A Q π w ( s, a ) ≥ max π ∈ Π n E a ∼ π [ Q π w ( s, a )]= max π ∈ Π n v π w ( s )= v SMP Π n , w ( s ) , where the second inequality is due to Jensen’s inequality.Therefore: v GPI Π n , w ( s ) ≥ v SMP Π n , w ( s ) E D [ v GPI Π n , w ( s )] ≥ E D [ v SMP Π n , w ( s )] v GPI Π n , w ≥ v SMP Π n , w (cid:4) Lemma 3 (Impossibility result) . For the special case where the SFs associated with any policy arein the simplex, the value of the SMP w.r.t the worst case reward for any set of policies is less than orequal to − / √ d. In addition, there exists an MDP where this upper bound is attainable.

Proof.

For the impossibility result. we have that max Π n ⊆ Π min w ∈ B v SMP Π n , w = min w ∈ B max π ∈ Π v πw (9) = min w ∈ B max π ∈ Π ψ ( π ) · w ≤ min w ∈ B max ψ ∈ ∆ d − ψ ( π ) · w (10) = min w ∈ B max i w i (11) = − √ d . (12)The equality in Eq. (9) follows from the fact that Π is set of all possible policies and therefore thelargest possible subset (the maximizer of the ﬁrst maximization). In that case the second maximization(by the SMP) is equivalent to selecting the optimal policy in the MDP. Notice that the order ofmaximization-minimization here is in the reversed when compared to AL, i.e., for each reward theSMP chooses the best policy in the MDP, while in AL the reward is chosen to be the worst possiblew.r.t any policy. The inequality in Eq. (10) follows from the fact that we increase the size of theoptimization set in the inner loop, and the equality in Eq. (12) follows from the fact that a maximizerin the inner loop puts maximal distribution on the largest component of w . Feasibility.

To show the feasibility of the upper bound in the previous impossibility result we give anexample of an MDP in which a set of d policies achieves the upper bound. The d policies are chosensuch that their stationary distributions form an orthogonal basis. min w ∈ B v SMP Π n , w = min w ∈ B max ψ ∈{ ψ ,...,ψ d } w · ψ = min w ∈ B max ψ ∈ ∆ d − w · ψ = − √ d , (13)which follows from the fact that the maximization over the simplex is equivalent to a maximizationover pure strategies. (cid:4) Lemma 4 (Reformulation of the worst-case reward for an SMP) . Let { ψ i } ni =1 be n successor featurevectors. Let w ∗ be the adversarial reward, w.r.t the SMP deﬁned given these successor features. Thatis, w ∗ is the solution for arg min w max i ∈ [1 ,...,n ] { w · ψ , . . . , w · ψ n } s.t. || w || − ≤ (14) Let w ∗ i be the solution to the following problem for i ∈ [1 , . . . , n ] : arg min w w · ψ i s.t. || w || − ≤ w · ( ψ j − ψ i ) ≤ (15) Then, w ∗ = arg min i w ∗ i . Proof.

For any solution w ∗ to Eq. (8) there is some policy i in the set that is one of its maximizers.Since it is the maximizer w.r.t w ∗ , its value w.r.t w ∗ is bigger or equal to that of any other policyin the set. Since we are checking the solution among all i ∈ [1 , . . . , n ] , one of them must be thesolution. (cid:4) Theorem 2 (Sufﬁciency of Active policies) . For any set of policies Π n , π SMP Π a (Π n ) achieves the samevalue w.r.t the worst case reward as π SMP Π n , that is, ¯ v SMP Π n = ¯ v SMP Π a (Π n ) .Proof. Let Π n = { π i } ni =1 . Denote by J a subset of the indices [1 , . . . , n ] that corresponds to theindices of the active policies such that Π a (Π n ) = { π j } j ∈ J . We can rewrite problem Eq. (14) asfollows: minimize γ s.t. γ ≥ w · ψ i i = 1 , . . . , n (cid:107) w (cid:107) ≤ . (16)13ublished as a conference paper at ICLR 2021Let ( γ (cid:63) , w (cid:63) ) be any optimal points. The set of inactive policies i (cid:54)∈ J satisfy γ (cid:63) > w (cid:63) · ψ i . Sincethese constraints are not binding we can drop them from the formulation and maintain the sameoptimal objective value, i.e. , minimize γ s.t. γ ≥ w · ψ j j ∈ J(cid:107) w (cid:107) ≤ , (17)has the same optimal objective value, ¯ v SMP Π n , as the full problem. This in turn can be rewrittenminimize max j ∈J w · ψ j s.t. (cid:107) w (cid:107) ≤ , (18)with optimal value ¯ v SMP Π a (Π n ) , which is therefore equal to ¯ v SMP Π n . (cid:4) Lemma 5 ( κ is binding) . Proof.

Denote by ˙ w a possible solution where the constraint (cid:107) ˙ w (cid:107) ≤ is not binding, i.e., ( (cid:107) ˙ w (cid:107) < , ˙ κ = 0 ). In addition, denote the primal objective for ˙ w by ˙ v = max i ∈ [1 ,...,n ] { ˙ w · ψ i } . To prove thelemma, we are going to inspect two cases: (i) ˙ v ≥ and (ii) ˙ v < . For each of these two cases wewill show that there exists another feasible solution ˜ w that achieves a lower value ˜ w for the primalobjective ( ˜ w < ˙ v ), and therefore ˙ w is not the minimizer.For the ﬁrst case ˙ v ≥ , consider the vector ˜ w = ( − , − , . . . , − / √ d. ˜ w is a feasible solution to the problem, since (cid:107) ˜ w (cid:107) = 1 . Since all the SFs have positive coordinates,we have that if they are not all exactly , then the primal objective evaluated at ˜ w is stictly negative: max i ∈ [1 ,...,n ] { ˜ w · ψ , . . . , ˜ w · ψ n } < . We now consider the second case of ˙ v < . Notice that multiplying ˙ w by a positive constant c wouldnot change the maximizer, i.e., arg max i ∈ [1 ,...,n ] { c ˙ w · ψ i } = arg max i ∈ [1 ,...,n ] { ˙ w · ψ i } . Since ˙ v < , it means that ˙ w/ (cid:107) ˙ w (cid:107) ( c = 1 / (cid:107) ˙ w (cid:107) ) is a feasible solution and a better minimizer than ˙ w. Therefore ˙ w is not the minimizer.We conclude that the constraint κ is always binding, i.e. (cid:107) w (cid:107) = 1 , κ > . (cid:4) Theorem 1 (Strict improvement) . Let Π , . . . , Π t be the sets of policies constructed by Algorithm 1.We have that the worst-case performance of the SMP induced by these set is strictly improving in eachiteration, that is: ¯ v SMP Π t +1 > ¯ v SMP Π t . Furthermore, when the algorithm stops, there does not exist a singlepolicy π t +1 such that adding it to Π t will result in improvement: (cid:64) π t +1 ∈ Π s.t. ¯ v SMP Π t + { π } > ¯ v SMP Π t .Proof. We have that v SMP Π t = min w ∈ B max ψ ∈ Ψ t ψ · w ≤ max ψ ∈ Ψ t ψ · ¯ w SMP Π t +1 ≤ max ψ ∈ Ψ t +1 ψ · ¯ w SMP Π t +1 = v SMP Π t +1 . (19)The ﬁrst inequality is true because we replace the minimization over w with ¯ w SMP Π t +1 , and the secondinequality is true because we add a new policy to the set. Thus, we will focus on showing thatthe ﬁrst inequality is strict. We do it in two steps. In the ﬁrst step, we will show that the problem min w ∈ B max ψ ∈ Ψ t ψ · w has a unique solution w (cid:63)t . Thus, for the ﬁrst inequality to hold withequality it must be that ¯ w SMP Π t +1 = ¯ w SMP Π t . However, we know that, since the algorithm did not stop, ψ t +1 · ¯ w SMP Π t > v SMP Π t , hence a contradiction.We will now show that min w ∈ B max ψ ∈ Ψ t ψ · w has a unique solution. Before we begin, we referthe reader to Lemma 4 and Theorem 2 where we reformulate the problem to a form that is simpler toanalyze. We begin by looking at the partial Lagrangian of Eq. (17): L ( w, γ, κ, λ ) = γ + (cid:88) j ∈ J λ j ( ψ j · w − γ ) + κ ( (cid:107) w (cid:107) − . κ ≥ is associated with the constraint (cid:107) w (cid:107) ≤ . Denote by ( λ (cid:63) , κ (cid:63) ) any optimaldual variables and note that by complementary slackness we know that either κ (cid:63) > and (cid:107) w (cid:107) = 1 or κ (cid:63) = 0 and (cid:107) w (cid:107) < . Lemma 5 above, guarantees that the constraint is in fact binding – onlysolutions with κ (cid:63) > and (cid:107) w (cid:107) = 1 are possible solutions. Notice that this is correct due to the factthat the SFs have positive coordinates and not all of them are (as in our problem formulation).Consequently we focus on the case where κ (cid:63) > under which the Lagrangian is strongly convex in w , and therefore the problem min w,γ L ( w, γ, λ (cid:63) , κ (cid:63) ) has a unique solution. Every optimizer of the original problem must also minimize the Lagrangianevaluated at an optimal dual value, and since this minimizer is unique, it implies that the minimizerof the original problem is unique (Boyd et al., 2004, Sect. 5.5.5).For the second part of the proof, notice that if the new policy π t +1 does not achieve better rewardw.r.t v SMP Π t than the policies in Π t then we have that: v SMP Π t +1 = min w ∈ B max π ∈ Π t +1 ψ ( π ) · w ≤ max π ∈ Π t +1 ψ ( π ) · ¯ w SMP Π t = max π ∈ Π t ψ ( π ) · ¯ w SMP Π t = v SMP Π t ; thus, it is necessary that the policy π t +1 will achieve better reward w.r.t v SMP Π t to guarantee strictimprovement. (cid:4) B AL

In AL there is no reward signal, and the goal is to observe and mimic an expert. The literature on ALis quite vast and dates back to the work of (Abbeel & Ng, 2004), who proposed a novel frameworkfor AL. In this setting, an expert demonstrates a set of trajectories that are used to estimate the SFs ofits policy π E , denoted by ψ E . The goal is to ﬁnd a policy π , whose SFs are close to this estimate, andhence will have a similar return with respect to any weight vector w , given by arg max π min w ∈ B w · (cid:0) ψ π − ψ E (cid:1) = arg max π −|| ψ π − ψ E || = arg min π || ψ π − ψ E || . (20)The projection algorithm (Abbeel & Ng, 2004) solves this problem in the following manner. Thealgorithm starts with an arbitrary policy π and computes its feature expectation ψ . At step t, thereward function is deﬁned using weight vector w t = ψ E − ¯ ψ t − and the algorithm ﬁnds a policy π t that maximizes it, where ¯ ψ t is a convex combination of SFs of previous (deterministic) policies ¯ ψ t = (cid:80) tj =1 α j ψ j . In order to get that (cid:107) ¯ ψ T − ψ E (cid:107) ≤ (cid:15) , the authors show that it sufﬁces to run thealgorithm for T = O ( k (1 − γ ) (cid:15) log( k (1 − γ ) (cid:15) )) iterations.Recently, it was shown that this algorithm can be viewed as a Frank-Wolfe method, also known asthe Conditional Gradient (CG) algorithm (Zahavy et al., 2020a). The idea is that solving Eq. (20) canbe seen as a constrained convex optimization problem, where the optimization variable is the SFs, theobjective is convex, and the SFs are constrained to be in the SFs polytope K , given as the followingconvex set: Deﬁnition 7 (The SFs polytope) . K = (cid:110) x : (cid:80) k +1 i =1 a i ψ i , a i ≥ , (cid:80) k +1 i =1 a i = 1 , π i ∈ Π (cid:111) . In general, convex optimization problems can be solved via the more familiar projected gradientdescent algorithm. This algorithm takes a step in the reverse gradient direction z t +1 = x t + α t ∇ h ( x t ) ,and then projects z t +1 back into K to obtain x t +1 . However, in some cases, computing this projectionmay be computationally hard. In our case, projecting into K is challenging since it has | A | | S | vertices(feature expectations of deterministic policies). Thus, computing the projection explicitly and thenﬁnding π whose feature expectations are close to this projection, is computationally prohibitive.The CG algorithm (Frank & Wolfe, 1956) (Algorithm 2) avoids this projection by ﬁnding a point y t ∈ K that has the largest correlation with the negative gradient. In AL, this step is equivalent toﬁnding a policy whose SFs has the maximal inner product with the current gradient, i.e., solve anMDP whose reward vector w is the negative gradient. This is a standard RL (planning) problem and15ublished as a conference paper at ICLR 2021can be solved efﬁciently, for example, with policy iteration. We also know that there exists at leastone optimal deterministic policy for it and that PI will return a solution that is a deterministic policy(Puterman, 1984). Algorithm 2

The CG method Frank & Wolfe (1956) Input: a convex set K , a convex function h , learning rate schedule α t . Initiation: let x ∈ K for t = 1 , . . . , T do y t = arg max y ∈K −∇ h ( x t − ) · y x t = (1 − α t ) x t − + α t y t end for For smooth functions, CG requires O (1 /t ) iterations to ﬁnd an (cid:15) − optimal solution to Eq. (20). Thisgives a logarithmic improvement on the result of (Abbeel & Ng, 2004). In addition, it was shown in(Zahavy et al., 2020a) that since the optimization objective is strongly convex, and the constraint setis a polytope, it is possible to use a variant of the CG algorithm, known as Away steps conditionalgradient (ASCG) (Wolfe, 1970). ASCG attains a linear rate of convergence when the set is a polytope(Guélat & Marcotte, 1986; Garber & Hazan, 2013; Jaggi, 2013), i.e., it converges after O (log(1 /(cid:15) ) iterations. See (Zahavy et al., 2020a) for the exact constants and analysis.There are some interesting relations between our problem and AL with "no expert", that is, solving arg min π || ψ π || (21)In terms of optimization, this problem is equivalent to Eq. (20), and the same algorithms can be usedto solve them.Both AL with "no expert" and our algorithm can be used to solve the same goal: achieve goodperformance w.r.t the worst case reward. However, AL is concerned with ﬁnding a single policy, whileour algorithm is explicitly designed to ﬁnd a set of policies. There is no direct connection between thepolicies that are discovered from following these two processes. This is because the intrinsic rewardsthat are maximised by each algorithm are essentially different. Another way to think about this isthat since the policy that is returned by AL is a mixed policy, its goal is to return a set of policiesthat are similar to the expert, but not diverse from one another. From a geometric perspective, thepolicies returned by AL are the nodes of the face in the polytope that is closest to the demonstratedSFs. Even more concretely, if the SFs of the expert are given exactly (instead of being approximatedfrom trajectories), then the AL algorithm would return a single vertex (policy). Finally, while a mixedpolicy can be viewed as a composition of policies, it is not a SIP. Therefore, it does not encouragediversity in the set. C R

EGULARIZING W

In this section we experimented with constraining the set of rewards to include only vectors w whosemean is zero. Since we are using CVXPY (Diamond & Boyd, 2016) to optimize for w (Eq. (8)), thisrequires adding a simple constraint (cid:80) di =1 w i = 0 to the minimization problem. Note that constrainingthe mean to be zero does not change the overall problem qualitatively, but it does potentially increasethe difference in the relative magnitude of the elements in w . Since it makes the resulting w ’s havemore zero elements, i.e., it makes the w ’s more sparse, it can also be viewed as a method to regularizethe worst case reward. Adding this constraint increased the number of w ’s (and correspondingpolicies) that made a difference to the optimal value (Deﬁnition 5). To see this, note that the greencurve in Fig. 5(a) converges to the optimal value in iterations while the the green curve in Fig. 1(a))does so in iterations. As a result, the policies that were discovered by the algorithm are morediverse. To see this observe that the SFs in Fig. 5(b) are more focused on speciﬁc items than the SFsin Fig. 1(c). In Fig. 5(c) and Fig. 5(d) we veriﬁed that this increased diversity continues to be the casewhen we increase the feature dimension d . 16ublished as a conference paper at ICLR 2021 (a) SMP vs. GPI, d=5 (b) SFs, d=5 (c) SMP vs. GPI, d=9 (d) SFs, d=9 Figure 5: Experimental results with regularized ww