[PDF] oIRL: Robust Adversarial Inverse Reinforcement Learning with Temporally Extended Actions

Abstract

Explicit engineering of reward functions for given environments has been a major hindrance to reinforcement learning methods. While Inverse Reinforcement Learning (IRL) is a solution to recover reward functions from demonstrations only, these learned rewards are generally heavily \textit{entangled} with the dynamics of the environment and therefore not portable or \emph{robust} to changing environments. Modern adversarial methods have yielded some success in reducing reward entanglement in the IRL setting. In this work, we leverage one such method, Adversarial Inverse Reinforcement Learning (AIRL), to propose an algorithm that learns hierarchical disentangled rewards with a policy over options. We show that this method has the ability to learn \emph{generalizable} policies and reward functions in complex transfer learning tasks, while yielding results in continuous control benchmarks that are comparable to those of the state-of-the-art methods.

Full PDF

ooIRL: Robust Adversarial Inverse Reinforcement Learning with TemporallyExtended Actions

David Venuto

Jhelum Chakravorty

Leonard Boussioux

Junhao Wang

Gavin McCracken

Doina Precup

Abstract

Explicit engineering of reward functions for givenenvironments has been a major hindrance to rein-forcement learning methods. While Inverse Rein-forcement Learning (IRL) is a solution to recoverreward functions from demonstrations only, theselearned rewards are generally heavily entangled with the dynamics of the environment and there-fore not portable or robust to changing environ-ments. Modern adversarial methods have yieldedsome success in reducing reward entanglementin the IRL setting. In this work, we leverage onesuch method, Adversarial Inverse ReinforcementLearning (AIRL), to propose an algorithm thatlearns hierarchical disentangled rewards with apolicy over options. We show that this method hasthe ability to learn generalizable policies and re-ward functions in complex transfer learning tasks,while yielding results in continuous control bench-marks that are comparable to those of the state-of-the-art methods.

1. Introduction

Reinforcement learning (RL) has been able to learn policiesin complex environments but it usually requires designingsuitable reward functions for successful learning. This canbe difﬁcult and may lead to learning sub-optimal policieswith unsafe behavior (Amodei et al., 2016) in the case ofpoor engineering. Inverse Reinforcement Learning (IRL)(Ng & Russell, 2000; Abbeel & Ng, 2004) can facilitatesuch reward engineering through learning an expert’s rewardfunction from expert demonstrations.IRL, however, comes with many difﬁculties and the problem * Equal contribution Department of Computer Science,McGill University, Montreal, Canada Mila, Montreal, Canada Department of Operations Research, MIT, Cambridge, USA DeepMind, Montreal, Canada. Correspondence to: DavidVenuto , Doina Precup .Preprint Under Review is not well-deﬁned because, for a given set of demonstra-tions, the number of optimal policies and correspondingrewards can be very large, especially for high dimensionalcomplex tasks. Also, many IRL algorithms learn rewardfunctions that are heavily shaped by environmental dynam-ics. Policies learned on such reward functions may notremain optimal with slight changes in the environment. Ad-versarial Inverse Reinforcement Learning (AIRL) (Fu et al.,2018) generates more generalizable policies with disentan-gled reward functions that are invariant to the environmentaldynamics. The reward and value function are learned si-multaneously to compute the reward function in a state-onlymanner. This is an instance of a transfer learning problemwith changing dynamics , where the agent learns an optimalpolicy in one environment and then transfers it to an envi-ronment with different environmental dynamics. A practicalexample of this transfer learning problem would be teachinga robot to walk with some mechanical structure, and thengeneralize this knowledge to perform the task with differentsized structural components.Other methods have been developed to learn demonstra-tions in environments with differing dynamics and thenexploit this knowledge while performing tasks in a novelenvironment. As complex tasks in different environmentsoften come from several reward functions, methods suchas Maximum Entropy IRL and GAN-Guided Cost Learn-ing (GAN-GCL) tend to overgeneralize (Finn et al., 2016b).One way to help solve the problem of over-ﬁtting is tobreak down a policy into small option (temporally extendedaction)-policies that solve various aspects of an overall task.This method has been shown to create a policy that is moregeneralizable (Sutton et al., 1999; Taylor & Stone, 2009).Methods such as Option-Critic have implemented modernRL architectures with a policy over options and have shownimprovements for generalization of policies (Bacon et al.,2017). OptionGAN (Henderson et al., 2018) also proposedan IRL framework for a policy over options and is shown tohave some improvement in one-shot transfer learning tasks,but does not return disentangled rewards.In this paper, we introduce Option-Inverse ReinforcementLearning (oIRL), to investigate transfer learning with op-tions. Following AIRL, we propose an algorithm that com- a r X i v : . [ c s . L G ] F e b IRL: Robust Adversarial Inverse Reinforcement Learning with Temporally Extended Actions putes disentangled rewards to learn joint reward-policy op-tions, with each option-policy having rewards that are dis-entangled from environmental dynamics. These policiesare shown to be heavily portable in transfer learning taskswith differences in environments. We evaluate this methodin a variety of continuous control tasks in the Open AIGym environment using the MuJoCo simulator (Brockmanet al., 2016; Todorov et al.) and GridWorlds with Mini-Grid (Chevalier-Boisvert et al., 2018). Our method showsimprovements in terms of reward on a variety of transferlearning tasks while still performing better than benchmarksfor standard continuous control tasks.

2. Preliminaries

Markov Decision Processes (MDP) are deﬁned by a tuple (cid:104)S , A , P , R, γ (cid:105) where S is a set of states, A is the set ofactions available to the agent, P is the transition kernel giv-ing a probability over next states given the current state andaction, R : S × A → [0 , R max ] is a reward function and γ ∈ [0 , is a discount factor. s t and a t are respectivelythe state and action of the expert at time instant t . We de-ﬁne a policy π as the probability distribution over actionsconditioned on the current state; π : S × A → [0 , . A pol-icy is modeled by a Gaussian distribution π θ ∼ N ( µ, σ ) where θ is the policy parameters. The value of a policy isdeﬁned as V π ( s ) = E π [ (cid:80) ∞ t =0 γ t r t +1 | s ] , where E denotesthe expectation. An agent follows a policy π and receivesreward from the environment. A state-action value func-tion is Q π ( s, a ) = E π [ (cid:80) ∞ t =0 γ t r t +1 | s, a ] . The advantage is A π ( s, a ) = Q π ( s, a ) − V π ( s ) . r ( s, a ) represents a one-stepreward. Options ( ω ∈ Ω ) are deﬁned as a triplet ( I ω , π ω , β ω ), where π ω is a policy over options, I ω ∈ S is the initiation set ofstates and β ω : S → [0 , is the termination function. Thepolicy over options is deﬁned by π Ω . An option has a reward r ω and an option policy π ω .The policy over options is parameterized by ζ , the intra-option policies by α for each option, the reward approx-imator by θ , and the option termination probabilities by δ .In the one-step case, selecting an option using the policy-over-options can be viewed as a mixture of completelyspecialized experts. This overall policy can be deﬁned as π Θ ( a | s ) = (cid:80) ω ∈ Ω π Ω ( ω | s ) π ω ( a | s ) . Disentangled Rewards are formally deﬁned as a rewardfunction r ∗ θ ( s, a, s (cid:48) ) that is disentangled with respect to(w.r.t.) a ground-truth reward and a set of environmentaldynamics T such that under all possible dynamics T ∈ T ,the optimal policy computed w.r.t. the reward function isthe same.

3. Related Work

Generative Adversarial Networks (GANs) learn the gen-erator distribution p g and discriminator D θ D ( x ) . They usea prior distribution over input noise variables p ( z ) . Giventhese input noise variables the mapping G θ g ( z ) is learned,which maps these input noise variables to the data set space. G is a neural network. Another neural network, D θ D ( x ) ,learns to estimate the probability that x came from the dataset and not the generator p g .In our two-player adversarial training procedure, D istrained to maximize the probability of assigning the cor-rect labels to the dataset and the generated samples. G istrained to minimize the objective log (1 − D θ D ( G θ G ( z ))) ,which causes it to generate samples that are more likely tofool the discriminator. Policy Gradient methods optimize a parameterized policy π θ using a gradient ascent. Given a discounting term, the ob-jective to be optimized is p ( θ, s ) = E [ (cid:80) ∞ t =0 γ t r θ ( s t ) | s ] .Proximal policy optimization (PPO) (Schulman et al.,2017) is a policy gradient method that uses policy gradi-ent theorem, which states ∂p ( θ,s ) ∂θ = (cid:80) s (cid:80) ∞ t =0 γ t P ( s t = s | s ) (cid:80) a ∂π ( a | s ) ∂θ Q π θ ( s, a ) . PPO has been adapted for theoption-critic architecture (PPOC) (Klissarov et al., 2017). Inverse Reinforcement Learning (IRL) is a form of imi-tation learning , where the expert’s reward is estimatedfrom demonstrations and then forward RL is applied to thatestimated reward to ﬁnd the optimal policy. GenerativeAdversarial Imitation Learning (GAIL) directly extracts op-timal policies from expert’s demonstrations (Ho & Ermon,2016). IRL infers a reward function from expert demonstra-tions, which is then used to optimize a generator policy.In IRL, an agent observes a set of state-action trajec-tories from an expert demonstrator D . We let T D = { τ E , τ E , . . . , τ En } be the state-action trajectories of the ex-pert, τ Ei ∼ τ D where τ Ei = { s , a , s , a . . . , s k , a k } . Wewish to ﬁnd the reward function r ( s, a ) given the set ofdemonstrations T D . It is assumed that the demonstrationsare drawn from the optimal policy π ∗ ( a | s ) . The MaximumLikelihood Estimation (MLE) objective in the IRL problemsis therefore: max θ J ( θ ) = max θ E τ ∼ τ E [log( p θ ( τ ))] , (1)with p θ ( τ ) ∝ p ( s ) (cid:81) Tt =0 p ( s t +1 | s t , a t ) exp ( γ t r θ ( s t , a t )) . Adversarial Inverse Reinforcement Learning (AIRL) isbased on GAN-Guided Cost Learning (Finn et al., 2016a),which casts the MLE objective as a Generative AdversarialNetwork (GAN) (Goodfellow et al., 2014) optimization Here the agent learns the expert’s policy by observing expertdemonstrations (Ng & Russell, 2000).

IRL: Robust Adversarial Inverse Reinforcement Learning with Temporally Extended Actions problem over trajectories. In AIRL (Fu et al., 2018), thediscriminator probability D θ is evaluated using the state-action pairs from the generator (agent), as given by D θ ( s, a ) = exp( f θ ( s, a ))exp( f θ ( s, a )) + π ( a | s ) . (2)The agent tries to maximize R ( s, a ) = log(1 − D θ ( s, a )) − log( D θ ( s, a )) where f θ ( s, a ) is a learned function and π ispre-computed. This formula is similar to GAIL but witha recoverable reward function since GAIL outputs 0.5 forthe reward of all states and actions at optimality. The dis-criminator function is then formulated as f θ, Φ ( s, a, s (cid:48) ) = g θ ( s, a ) + γh Φ ( s (cid:48) ) − h Φ ( s ) given shaping function h Φ andreward approximator g θ . Under deterministic dynamics, itis shown in AIRL that there is a state-only reward approxi-mator ( f ∗ ( s, a, s (cid:48) ) = r ∗ ( s )+ γV ∗ ( s (cid:48) ) − V ∗ ( s ) = A ∗ ( s, a ) where the reward is invariant to transition dynamics and isdisentangled. Hierarchical Inverse Reinforcement Learning learnspolicies with high level temporally extended actions us-ing IRL. OptionGAN (Henderson et al., 2018) providesan adversarial IRL objective function for the discriminatorwith a policy over options. It is formulated such that L reg deﬁnes the regularization terms on the mixture of experts sothat they converge to options. The discriminator objectivein OptionGAN takes state-only input and is formulated as: L Ω = E ω [ π Ω ,ζ ( ω | s )( L α,ω )] + L reg , where L α,ω = E τ N [log( r θ,ω ( s ))] + E τ E [log(1 − r θ,ω ( s ))] . (3)In Directed-Info GAIL (Sharma et al., 2019) implementsGAIL in a policy over options framework.Work such as (Krishnan et al., 2016) solves this hierarchi-cal problem of segmenting expert demonstration transitionsby analyzing the changes in local linearity w.r.t a kernelfunction . It has been suggested that decomposing the re-ward function is not enough (Henderson et al., 2018). Otherworks have learned the latent dimension along with the pol-icy for this task (Hausman et al., 2017; Wang et al., 2017). Inthis formulation, the latent structure is encoded in an unsu-pervised manner so that the desired latent variable does notneed to be provided. This work parallels many hierarchicalIRL methods but with recoverable robust rewards.

4. MLE Objective for IRL Over Options

Let ( s , a , . . . s T , a T ) ∈ τ Ei be an expert trajectory ofstate-action pairs. Denote by ( s , a , ω . . . s T , a T , ω T ) ∈ τ π Θ ,t a novice trajectory generated by policy over options π Θ ,t of the generator at iteration t .Given a trajectory of state-action pairs, we ﬁrst deﬁne anoption transition probability given a state and an option. Similar transition probabilities given state, action or optioninformation are deﬁned in (Appendix A). P ( s t +1 , ω t +1 | s t , ω t )= (cid:88) a ∈ A π ω,α ( a | s t ) P ( s t +1 | s t , a )((1 − β ω t ,δ ( s t +1 )) ω t = ω t +1 + β ω t ,δ ( s t +1 ) π Ω ,ζ ( ω t +1 | s t +1 )) . (4)We can similarly deﬁne a discounted return recursively. Con-sider the policy over options based on the probabilities ofterminating or continuing the option policies given a rewardapproximator ˆ r θ ( s, a ) for the state-action reward. R θ,δ ( s, ω, a ) := E (cid:104) ˆ r θ,ω ( s, a ) + γ (cid:88) s (cid:48) ∈ S P ( s (cid:48) | s, a ) (cid:16) β ω,δ ( s (cid:48) ) R Ω ζ,θ,α,δ ( s (cid:48) ) + (cid:0) − β ω,δ ( s (cid:48) ) (cid:1) R θ,α,δ ( s (cid:48) , ω ) (cid:17)(cid:105) . (5) ω is selected according to π ζ, Ω ( ω | s ) . The expressions forall relevant discounted returns appearing in the analysis aregiven in Appendix B. A suitable parameterization of thediscounted return R can be found by maximizing the causalentropy E τ ∼D [log( p θ ( τ ))] w.r.t parameter θ . We then havefor a trajectory τ with T time-steps: p θ ( τ ) (6) ≈ p ( s , ω ) T − (cid:89) t =0 P ( s t +1 , ω t +1 | s t , ω t , a t ) e R θ,δ ( s t ,ω t ,a t ) . Similar to (Fu et al., 2018) and (Finn et al., 2016a), wedeﬁne the MLE objective for the generator p θ as J ( θ ) = E τ ∼ τ E [ T (cid:88) t =0 R Ω ζ,θ,δ ( s t , a t )] − E p θ [ T (cid:88) t =0 (cid:88) ω ∈ Ω π ζ, Ω ( ω | s t ) R θ,δ ( s t , ω, a t )] . (7)Note that we may or may not know the option trajectoriesin our expert demonstrations, instead they are estimatedaccording to the policy over options. The gradient of (7)w.r.t θ (See Appendix B for detailed derivations) is givenby: ∂∂θ J ( θ ) = E τ ∼ τ E (cid:104) ∂∂θ log( p θ ( τ )) (cid:105) ≈ E τ ∼ τ E (cid:104) T (cid:88) t =0 ∂∂θ R Ω ζ,θ,δ ( s t , a t ) (cid:105) − E p θ (cid:104) T (cid:88) t =0 ∂∂θ R Ω ζ,θ,δ ( s t , a t ) (cid:105) . We deﬁne p θ,t ( s t , a t ) = (cid:82) s t (cid:48)(cid:54) = t ,a t (cid:48)(cid:54) = t p θ ( τ ) ds t (cid:48) da t (cid:48) as thestate action marginal at time t . This allows us to examine IRL: Robust Adversarial Inverse Reinforcement Learning with Temporally Extended Actions the trajectory from step t as deﬁned similarly in (Fu et al.,2018). Consequently, we have ∂∂θ J ( θ ) = T (cid:88) t =0 E τ ∼ τ E (cid:20) ∂∂θ R Ω ζ,θ,δ ( s t , a t ) (cid:21) (xyz) − E p θ,t (cid:20) ∂∂θ R Ω ζ,θ,δ ( s t , a t ) (cid:21) . (8)Since p θ is difﬁcult to draw samples from, we estimate itusing importance sampling distribution over the generatordensity. Then, we compute an importance sampling estimateof a mixture policy µ t,w ( τ ) for each option w as follows.We sample a mixture policy µ ω ( a | s ) deﬁned as π ω ( a | s ) + ˆ p ω ( a | s ) and ˆ p ω ( a | s ) is a density estimate trained on thedemonstrations. We wish to minimize D KL ( π w ( τ ) | p ω ( τ )) to reduce the importance sampling distribution variance,where D KL is the Kullback–Leibler divergence metric (Kull-back & Leibler, 1951) between two probability distributions.Applying the aforementioned density estimates in (8), wecan express the gradient of the MLE objective J follows: ∂∂θ J ( θ ) = T (cid:88) t =0 E τ ∼ τ E [ ∂∂θ R Ω ζ,θ,δ ( s t , a t )] − E µ t (cid:20)(cid:88) ω ∈ Ω π ζ, Ω ( ω | s t ) p θ,t,ω ( s t , a t ) µ t,w ( s t , a t ) ∂∂θ R θ,δ ( s t , ω, a t ) (cid:21) , (9)where ∂∂θ R Ω ζ,θ,α,δ ( s ) = E (cid:34) (cid:88) ω ∈ Ω π Ω ,ζ ( ω | s ) (cid:104) (cid:88) a ∈ A π ω,α ( a | s ) (cid:16) ∂∂θ ˆ r θ ( s, a ) + γ (cid:88) s (cid:48) ∈ S P ( s (cid:48) | s, a ) (cid:16) β w,δ ( s (cid:48) ) ∂∂θ R Ω ζ,θ,α,δ ( s (cid:48) )+ (1 − β ω,δ ( s (cid:48) )) ∂∂θ R θ,α,δ ( s (cid:48) , ω ) (cid:17)(cid:17)(cid:105)(cid:35) . (10)

5. Discriminator Objective

In this section we formulate the discriminator, parameter-ized by θ , as the odds ratio between the policy and theexponentiated reward distribution for option ω . We have adiscriminator D θ,ω for each option ω and a sample generatoroption policy π w , deﬁned as follows: D θ,ω ( s, a ) = exp( f θ,ω ( s, a ))exp( f θ,ω ( s, a )) + π w ( a | s ) . (11) The discriminator D θ,ω is trained by minimizing the cross-entropy loss between expert demonstrations and generatedexamples assuming we have the same number of options in the generated and expert trajectories. We deﬁne the lossfunction L θ as follows: l θ ( s, a, ω ) (12) = − E D [log( D θ,ω ( s, a ))] − E π Θ ,t [log(1 − D θ,ω ( s, a ))] . The parameterized total loss for the entire trajectory,

Lθ,α,δ ( s, a, ω ) , can be expressed recursively as follows bytaking expectations over the next options and states: L θ,δ ( s, a, ω )= l θ ( s, a, ω ) + γ (cid:88) s (cid:48) ∈ S P ( s (cid:48) | s, a ) (cid:16) β w,δ ( s (cid:48) ) L Ω ζ,θ,α,δ ( s (cid:48) )+ (cid:0) − β w,δ ( s (cid:48) ) (cid:1) L θ,α,δ ( s (cid:48) , w ) (cid:17) (13) L θ,α,δ ( s, w ) := E a ∈ A [ L θ,δ ( s, w, a )] (14) L Ω ζ,θ,δ ( s, a ) := E w ∈ Ω [ L θ,δ ( s, w, a )] (15) L Ω ζ,θ,α,δ ( s ) := E ω ∈ Ω [ L θ,α,δ ( s, ω )] . (16)The agent wishes to minimize L θ,α,δ to ﬁnd its optimalpolicy. For a given option ω , deﬁne the reward function ˆ R θ,δ ( s, ω, a ) , which is to be maximised. We then writea negative discriminator loss ( − L D ) to turn our loss mini-mization problem into a maximization problem, as follows: − L D = ˆ R θ,δ ( s, ω, a ) =log( D θ,ω ( s, a )) − log(1 − D θ,ω ( s, a )) . (17)We set a mixture of experts and novice as ¯ µ observationsin our gradient. We then wish to take the derivative of theinverse discriminator loss as, ∂∂θ ( − L D ) = T (cid:88) t =0 E τ ∼ τ E (cid:34)(cid:88) ω ∈ Ω π Ω ,ζ ( ω | s t ) ∂∂θ (cid:16) − L θ,δ ( s t , ω, a t ) (cid:17)(cid:35) − E ¯ µ t (cid:104) (cid:88) ω ∈ Ω π Ω ,ζ ( ω | s t ) (cid:32) exp( − L θ,δ ( s t , ω, a t )) exp( − L θ,δ ( s t , ω, a t )) + π ω ( a t | s t ) (cid:33) ∂∂θ (cid:16) − L θ,δ ( s t , ω, a t ) (cid:17)(cid:105) . (18)We can multiply the top and bottom of the fraction inthe mixture expectation by the state marginal π ω ( s t ) = (cid:82) a ∈ A π ω ( s t , a t ) . This allows us to write ˆ p θ,t,ω ( s t , a t ) =exp( L θ,δ ( s t , ω, a t )) π ω,t ( s t ) . Using this, we can derive animportance sampling distribution in our loss, ∂∂θ ( − L D ) = T (cid:88) t =0 E τ ∼ τ E (cid:34)(cid:88) ω ∈ Ω π Ω ,ζ ( ω | s t ) ∂∂θ (cid:16) − L θ,δ ( s t , ω, a t ) (cid:17)(cid:35) − E ¯ µ t (cid:34)(cid:88) ω ∈ Ω π Ω ,ζ ( ω | s t ) (cid:18) ˆ p θ,t,ω ( s t , a t )ˆ µ t,ω ( s t , a t ) (cid:19) ∂∂θ (cid:16) − L θ,δ ( s t , ω, a t ) (cid:17)(cid:35) . (19) IRL: Robust Adversarial Inverse Reinforcement Learning with Temporally Extended Actions

The gradient of this parametrized reward function corre-sponds to the inverse of the discriminator’s objective: ∂∂θ ˆ R θ,δ ( s, ω, a ) ≈ ∂∂θ (cid:16) − L θ,δ ( s, ω, a ) (cid:17) = E (cid:104) ∂∂θ r θ,ω ( s, a ) + γ (cid:88) s (cid:48) ∈ S P ( s (cid:48) | s, a ) (cid:16) β ω,δ ( s (cid:48) ) ∂∂θ (cid:0) − L Ω ζ,θ,α,δ ( s (cid:48) ) (cid:1) + (cid:0) − β ω,δ ( s (cid:48) ) (cid:1) ∂∂θ (cid:0) − L θ,δ ( s (cid:48) , ω ) (cid:1)(cid:17)(cid:105) . (20)See Appendix C for the detailed derivations of the termsappearing in (20). Substituting (20) into (19) one can seethat (9) (derivative of MLE objective) and (10) are of thesame form as of (19) (derivative of the discriminator objec-tive and (20)).

6. Learning Disentangled State-only Rewardswith Options

In this section, we provide our main algorithm for learningrobust rewards with options. Similar to AIRL, we imple-ment our algorithm with a discriminator update that consid-ers the rollouts of a policy over options. We perform thisupdate with ( s, a, s (cid:48) ) triplets and a discriminator function inthe form of f θ,ω ( s, a, s (cid:48) ) as given in (21). This allows us toformulate the discriminator with state-only rewards in termsof option-value function estimates. We can then compute anoption-advantage estimate. Since the reward function onlyrequires state, we learn a reward function and correspondingpolicy that is disentangled from the environmental transitiondynamics. f ω,θ ( s, a, s (cid:48) ) = ˆ r ω,θ ( s ) + γ ˆ V Ω ( s (cid:48) ) − ˆ V Ω ( s ) = ˆ A ( s, a, ω ) (21)Where Q ( s, ω ) = (cid:80) a ∈A π ω,α ( a | s )[ r ω,θ ( s, a ) + γ (cid:80) s (cid:48) ∈S P ( s (cid:48) | s, a )((1 − β δ,ω ( s (cid:48) )) Q ( s (cid:48) , ω ) + β δ,ω ( s (cid:48) ) V Ω ( s (cid:48) ))] and V Ω ( s ) = (cid:80) ω ∈ Ω π Ω ,ζ ( ω | s ) Q ( s, ω ) .Our discriminator model must learn a parameterization ofthe reward function and the value function for each option,given the total loss function in (37). These parameterizedmodels are learned with a multi-layer perceptron. For eachoption, the termination functions β ω,δ and option-policies π ω,α are learned using PPOC. Our main algorithm, oIRL, is given by Algorithm 1. Here,we iteratively train a discriminator from expert and novicesampled trajectories using the derived discriminator objec-tive. This allows us to obtain reward function estimates foreach option. We then use any policy optimization methodfor a policy over options given these estimated rewards. We can also have discriminator input of state-only format asdescribed in (21). It is important to note that in our recursiveloss, we recursively simulate a trajectory to compute the lossa ﬁnite number of times (and return if the state is terminal).We show the adversarial architecture of this algorithm inAppendix D.

7. Convergence Analysis

In this section we explain the gist of the analysis of con-vergence of oIRL. The detailed proofs can be found in Ap-pendix E and F.We ﬁrst show that the actual reward function is recovered(up to a constant) by the reward estimators. We show that foreach option’s reward estimator g θ,ω ( s ) , we have g ∗ ω ( s ) = r ∗ ( s ) + c ω , where c ω is a ﬁnite constant. Using the factthat g θ,ω ( s ) → g ∗ ω ( s ) = r ∗ ( s ) + c ω , and by using Cauchy-Schwarz inequality of sup-norm, we prove that the updateof the TD-error is a contraction, i.e., max s (cid:48)(cid:48) ,ω (cid:48)(cid:48) | Q π Ω ,t ( s, ω ) − Q ∗ ( s, ω ) | ≤ (cid:15) + max ω ∈ Ω c ω . (22)In order to prove asymptotic convergence to the optimaloption-value Q ∗ , we show using the contraction argumentthat g θ,ω ( s ) + γQ ( s (cid:48) , ω ) converges to Q ∗ by establishingthe following inequality: | E [ g θ,ω ( s )] + γ E [ Q ( s (cid:48) , ω ) | s ] − Q ∗ ( s (cid:48) , ω ) |≤ (max ω ∈ Ω c ω )( (cid:15) + max ω ∈ Ω c ω ) γ. (23)

8. Experiments oIRL learns disentangled reward functions for each optionpolicy, which facilitates policy generalizability and is instru-mental in transfer learning .Transfer learning can be described as using informationlearned by solving one problem and then applying it to adifferent but related problem. In the RL sense, it meanstaking a policy trained on one environment and then usingthe policy to solve a similar task in a different previouslyunseen environment.We run experiments in different environments to address thefollowing questions: • Does learning a policy over options with the AIRLframework improve policy generalization and rewardrobustness in transfer learning tasks where the environ-mental dynamics are manipulated? • Can the policy over options framework match or ex-ceed benchmarks for imitation learning on complexcontinuous control tasks?

IRL: Robust Adversarial Inverse Reinforcement Learning with Temporally Extended Actions

Algorithm 1

IRL Over Options with Robust Rewards (oIRL)

Require:

Expert Trajectories: { τ E , . . . , τ En } ∈ T D , Initial Parameters: ( θ , ζ , δ , α ) , γ Initialize policies π ω,α , π Ω ,ζ and discriminators D θ ,ω , and β ω,δ ∀ ω ∈ Ω for step t = 0 , , , . . . , T do Collect trajectories τ i = ( s , a , ω , . . . ) from π ω,α t , π Ω ,ζ t , β ω,δ t Train discriminator D θ t ,ω for step k = 0 , , , . . . do Sample ( s k , a k , s (cid:48) k , ω k ) ∼ τ i,t if s (cid:48) not terminal state then Sample ω (cid:48) k ∼ π Ω ,ζ t ( ω | s (cid:48) k ) , a (cid:48) k, ∼ π ω (cid:48) k ,α t ( a | s (cid:48) k ) , a (cid:48) k, ∼ π ω k ,α t ( a | s (cid:48) k ) Observe s (cid:48)(cid:48) k, , s (cid:48)(cid:48) k, from environment L k ( s k , a k , s (cid:48) k , ω k ) = − E D [log( D θ t ,ω k ( s k , a k , s (cid:48) k ))] − E π Θ ,t [log(1 − D θ t ,ω k ( s k , a k , s (cid:48) k ))] Optimize model parameters w.r.t.: − L D = L k + γ ( β δ t ,ω k ( s (cid:48) ) L ( s (cid:48) k , a (cid:48) k, , s (cid:48)(cid:48) k, , ω (cid:48) k ) +(1 − β δ t ,ω k ( s (cid:48) k )) L ( s (cid:48) k , a (cid:48) k, , s (cid:48)(cid:48) k, , ω k )) end if end for Obtain reward r θ t ,ω ( s, a, s (cid:48) ) ← log( D θ t ,ω ( s, a, s (cid:48) )) − log(1 − D θ t ,ω ( s, a, s (cid:48) )) Update π ω,α t , β ω,δ t ∀ ω ∈ Ω and π Ω ,ζ t with any policy optimization method (e.g. PPOC) end for To answer these questions, we compare our model againstAIRL (the current state of the art for transfer learning) in atransfer task by learning in an ant environment and modify-ing the physical structure of the ant and compare our methodon various benchmark IRL continuous control tasks. Wewish to see if learning disentangled rewards for sub-tasksthrough the options framework is more portable.We train a policy using each of the baseline methods andour method on these expert demonstrations for 500 timesteps on the gait environments and 500 time steps on thehierarchical ones. Then we take the trained policy (theparameterized distribution) and use this policy on the trans-fer environments and observe the reward obtained. Such amethod of transferring the policy is called a direct policytransfer . For the transfer learning tasks we use

Transfer Environ-ments for MuJoCo (Chu & Arnold, 2018), a set of gymenvironments for studying potential improvements in trans-fer learning tasks. The task involves an

Ant as an agentwhich optimizes a gait to crawl sideways across the land-scape. The expert demonstrations are obtained from theoptimal policy in the basic Ant environment. We disablethe agent ant in two ways for two transfer learning tasks.In

BigAnt tasks, the length of all legs is doubled, no extrajoints are added though. The

Amputated Ant task modiﬁesthe agent by shortening a single leg to disable it. Thesetransfer tasks require the learning of a true disentangled re-ward of walking sideways instead of directly imitating andlearning the reward speciﬁc to the gait movements. These

Table 1.

The mean reward obtained (higher is better) over 100 runsfor the Gait transfer learning tasks. We also show the results ofPPO optimizing the ground truth reward.B IG A NT A MPUTATED A NT AIRL (P

RIMITIVE ) -11.6 134.32 O

PTIONS O

IRL

PTIONS O

IRL -1.7

ROUND T RUTH manipulations are shown in Figure 1. (a) Ant environment (b) Big Ant environ-ment (c) Amputated Ant en-vironment

Figure 1.

MuJoCo Ant Gait transfer learning task environments.When the ant is disabled, it must position itself correctly to crawlforward. This requires a different initial policy than the originalenvironment where the ant must only crawl sideways.

Table 1 shows the results in terms of reward achieved for theant gait transfer tasks. As we can see, in both experimentsour algorithm performs better than AIRL. Remark that theground truth is obtained with PPO after 2 million iterations(therefore much less sample efﬁcient than IRL).

IRL: Robust Adversarial Inverse Reinforcement Learning with Temporally Extended Actions

We also create transfer learning environments in a 2D Mazeenvironment with lava blockades. The goal of the agent is togo through the opening in a row of lava cells and reach a goalon the other end. For the transfer learning task, we train theagent on an environment where the "crossing" path requiresthe agent to go through the middle for (

LavaCrossing-M )and then the policy is directly transferred and used on aGridWorld of the same size where the crossing is on theright end of the room (

LavaCrossing-R ). An additional taskwould be changing a blockade in a

Maze ( FlowerMaze-(R,T) ). The two environments are shown in Figure 2. Wecan think of two sub-tasks in this environment, going to thelava crossing and then going to the goal .In all of these environments, the rewards are sparse. Theagent receives a non-zero reward only after completing themission, and the magnitude of the reward is − . · n/n max ,where n is the length of the successful episode and n max isthe maximum number of steps that we allowed for complet-ing the episode, different for each mission. (a) LavaCrossing-MMiniGrid Env (b) LavaCrossing-RMiniGrid Env(c) FlowerMaze-RMiniGrid Env (d) FlowerMaze-TMiniGrid Env Figure 2.

The MiniGrid transfer learning task set 1. Here the policyis trained on (a or c) using our method and the baseline methodsand then transferred to be used on environment (b or d). The greencell is the goal.

We show the mean reward after 10 runs using the directpolicy transfers on the environments in Table 2. The 4 op-tion oIRL achieved the highest reward on the LavaCrossingtasks. The FlowerMaze task was quite difﬁcult with mostalgorithms obtaining very low reward. Options still resultin a large improvement.

Table 2.

The mean reward obtained (higher is better) over 10 runsfor the Maze transfer learning tasks. We also show the results ofPPO optimizing the ground truth reward.L

AVA C ROSSING F LOWER M AZE

AIRL (P

RIMITIVE ) 0.64 0.112 O

PTIONS O

IRL 0.67 0.204 O

PTIONS O

IRL

ROUND T RUTH

In addition, we adopt more complex hierarchical environ-ments that require both locomotion and object interaction.In the ﬁrst environment, the ant must interact with a largemovable block. This is called the

Ant-Push environment(Duan et al., 2016). To reach the goal, the ant must completetwo successive processes: ﬁrst, it must move to the left ofthe block and then push the block right, which clears thepath towards the target location. There is a maximum of500 timesteps. These can be thought of as hierarchical taskswith pushing to the left , pushing to the right and going tothe goal as sub-goals.We also utilize an Ant-Maze environment (Florensa et al.,2017) where we have a simple maze with a goal at the end.The agent receives a reward of +1 if it reaches the goal and elsewhere. The ant must learn to make two turns in themaze, the ﬁrst is down the hallway for one step and then aturn towards the goal. Again, we see hierarchical behaviorin this task: we can think of sub-goals consisting of learningto exit the ﬁrst hall of the maze , then making the turn andﬁnally going down the ﬁnal hall towards the goal . The twocomplex environments are shown in Figure 3. (a) Ant-Maze envi-ronment (b) Ant-Push environment Figure 3.

MuJoCo Ant Complex Gait transfer learning task envi-ronments. We perform these transfer learning tasks with the BigAnt and the Amputated Ant.

Table 3 shows that oIRL performs better than AIRL in all ofthe complex hierarchical transfer tasks. In some tasks suchas the Maze environment, AIRL fails to have any or veryfew successful runs while our method achieves reasonablyhigh reward. In the BigAnt push task, AIRL achieves onlyvery minimal reward where oIRL succeeds to perform thetask in some cases.

IRL: Robust Adversarial Inverse Reinforcement Learning with Temporally Extended Actions

Table 3.

The mean reward obtained (higher is better) over 100 runs for the MuJoCo Ant Complex Gait transfer learning tasks. We alsoshow the results of PPO optimizing the ground truth reward.B IG A NT M AZE A MPUTATED A NT M AZE B IG A NT P USH A MPUTATED A NT P USH

AIRL (P

RIMITIVE ) 0.28 0.14 0.02 0.172 O

PTIONS O

IRL

PTIONS O

IRL 0.55

ROUND T RUTH

Figure 4.

MuJoCo Continuous control locomotion tasks showing the mean reward (higher is better) achieved over 500 iterations of thebenchmark algorithms for 10 random seeds. The shaded area represents the standard deviation.

We also test our algorithm on a number of robotic contin-uous control benchmark tasks. These tasks do not involvetransfer.We show the plots of the average reward for each iterationduring training in Figure 4. Achieving a higher reward infewer iterations is better for these experiments. We exam-ine the Ant, the Half Cheetah, the and Walker MuJoCogait/locomotion tasks. We run these experiments with 10random seeds. The results are quite similar between thebenchmarks. Using a policy over options shows reasonableimprovements in each task.

9. Discussion

This work presents Option-Inverse Reinforcement Learning(oIRL), the ﬁrst hierarchical IRL algorithm with disentan-gled rewards. We validate oIRL on a wide variety of tasks,including transfer learning tasks, locomotion tasks, com-plex hierarchical transfer RL environments and GridWorldtransfer navigation tasks and compare our results with thestate-of-the-art algorithm. Combining options with a disen-tangled IRL framework results in highly portable policies.Our empirical studies show clear and signiﬁcant improve-ments for transfer learning. The algorithm is also shown toperform well in continuous control benchmark tasks.For future work, we wish to test other sampling methods(e.g., Markov-chain Monte Carlo) to estimate the implicit discriminator-generator pair’s distribution in our GAN, suchas Metropolis-Hastings GAN (Turner et al., 2019). We alsowish to investigate methods to reduce the computationalcomplexity for the step of computing the recursive lossfunction, which requires simulating some short trajectories,lowering the variance. Analyzing our algorithm using phys-ical robotic tests for tasks that require multiple sub-taskswould be an interesting future course of research.

References

Abbeel, P. and Ng, A. Y. Apprenticeship learning via inversereinforcement learning. in ICML , 2004.Amodei, D., Olah, C., Steinhardt, J., Christiano, P., Schul-man, J., and Mané, D. Concrete problems in ai safety. arXiv preprint arXiv:1606.06565 , 2016.Bacon, P.-L., Harb, J., and Precup, D. The option-criticarchitecture. in AAAI , 2017.Brockman, G., Cheung, V., Pettersson, L., Schneider, J.,Schulman, J., Tang, J., and Zaremba, W. Openai gym,2016.Chevalier-Boisvert, M., Willems, L., and Pal, S. Minimalis-tic gridworld environment for openai gym. https://github.com/maximecb/gym-minigrid , 2018.Chu, E. and Arnold, S. Transfer environments for mu-joco. GitHub, 2018. URL https://github.com/seba-1511/shapechanger . IRL: Robust Adversarial Inverse Reinforcement Learning with Temporally Extended Actions

Duan, Y., Chen, X., Houthooft, R., Schulman, J., andAbbeel, P. Benchmarking deep reinforcement learningfor continuous control, 2016.Finn, C., Christiano, P., Abbeel, P., and Levine, S. A con-nection between generative adversarial networks, inversereinforcement learning, and energy-based models. inNeurIPS , 2016a.Finn, C., Levine, S., and Abbeel, P. Guided cost learning:Deep inverse optimal control via policy optimization. inICML , 2016b.Florensa, C., Held, D., Wulfmeier, M., Zhang, M., andAbbeel, P. Reverse curriculum generation for reinforce-ment learning. in CoRL , 2017.Fu, J., Luo, K., and Levine, S. Learning robust rewardswith adversarial inverse reinforcement learning. in ICLR ,2018.Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B.,Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y.Generative adversarial nets.

In NeurIPS , 2014.Hausman, K., Chebotar, Y., Schaal, S., Sukhatme, G., andLim, J. J. Multi-modal imitation learning from unstruc-tured demonstrations using generative adversarial nets. inNeurIPS , 2017.Henderson, P., Chang, W.-D., Bacon, P.-L., Meger, D.,Pineau, J., and Precup, D. Optiongan: Learning jointreward-policy options using generative adversarial in-verse reinforcement learning. in AAAI , 2018.Ho, J. and Ermon, S. Generative adversarial imitation learn-ing. in NeurIPS , 2016.Klissarov, M., Bacon, P., Harb, J., and Precup, D. Learningsoptions end-to-end for continuous action tasks.

CoRR ,abs/1712.00004, 2017.Krishnan, S., Garg, A., Liaw, R., Miller, L., Pokorny, F. T.,and Goldberg, K. Y. Hirl: Hierarchical inverse rein-forcement learning for long-horizon tasks with delayedrewards.

ArXiv , abs/1604.06508, 2016.Kullback, S. and Leibler, R. A. On information and suf-ﬁciency.

The Annals of Mathematical Statistics , 22(1):79–86, 1951. ISSN 00034851.Ng, A. Y. and Russell, S. Algorithms for inverse reinforce-ment learning. in ICML , 2000.Schulman, J., Wolski, F., Dhariwal, P., Radford, A., andKlimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 , 2017. Sharma, M., Sharma, A., Rhinehart, N., and Kitani, K. M.Directed-info GAIL: Learning hierarchical policies fromunsegmented demonstrations using directed information. in ICLR , 2019.Sutton, R., Precup, D., and Singh, S. Between MDPs andsemi-MDPs: A framework for temporal abstraction inreinforcement learning.

Artiﬁcial Intelligence , 1999.Taylor, M. E. and Stone, P. Transfer learning for reinforce-ment learning domains: A survey.

J. Mach. Learn. Res. ,2009.Todorov, E., Erez, T., and Tassa, Y. Mujoco: A physicsengine for model-based control. In .Turner, R., Hung, J., Frank, E., Saatchi, Y., and Yosinski,J. Metropolis-hastings generative adversarial networks.

ICML , 2019.Wang, Z., Merel, J., Reed, S., Wayne, G., de Freitas, N.,and Heess, N. Robust imitation of diverse behaviors. inNeurIPS , 2017.

IRL: Robust Adversarial Inverse Reinforcement Learning with Temporally Extended Actions

A. Option Transition Probabilities

It is useful to redeﬁne transition probabilities in terms of options. Since at each step we have a additional consideration, wecan continue following the policy of the current option we are in or terminate the option with some probability, sample anew option and follow that option’s policy from a stochastic policy dependent on states. We have P ( s t +1 , ω t +1 | s t , ω t ) = (cid:88) a ∈A π ω,α ( a | s t ) P ( s t +1 | s t , a )((1 − β ω t ,δ ( s t +1 )) ω t = ω t +1 + β ω t ,δ ( s t +1 ) π Ω ,ζ ( ω t +1 | s t +1 )) (24) P ( s t +1 , ω t +1 | s t ) = (cid:88) ω ∈ Ω π ζ, Ω ( ω | s t ) (cid:88) a ∈A π ω,α ( a | s t ) P ( s t +1 | s t , a )((1 − β ω t ,δ ( s t +1 )) ω t = ω t +1 + β ω t ,δ ( s t +1 ) π Ω ,ζ ( ω t +1 | s t +1 )) (25) P ( s t +1 , w t +1 | s t , ω t , a t ) = P ( s t +1 | s t , a t )((1 − β ω t ,δ ( s t +1 )) ω t = ω t +1 + β ω t ,δ ( s t +1 ) π Ω ,ζ ( ω t +1 | s t +1 )) (26) B. MLE Objective for IRL Over Options

We can deﬁne a discounted return recursively for a policy over options in a similar manner to the transition probabilities.Consider the policy over options based on the probabilities of terminating or continuing the option policies given a rewardapproximator ˆ r θ ( s, a ) for the state-action reward. R Ω ζ,θ,α,δ ( s ) = E ω ∈ Ω [ R θ,α,δ ( s, ω )] R θ,α,δ ( s, ω ) = E a ∈ A [ R θ,δ ( s, ω, a )] R Ω ζ,θ,δ ( s, a ) = E ω ∈ Ω [ R θ,δ ( s, ω, a )] R θ,δ ( s, ω, a ) = E (cid:20) ˆ r ω,θ ( s, a )+ γ (cid:88) s (cid:48) ∈ S P ( s (cid:48) | s, a, ω )( β ω,δ ( s (cid:48) ) R Ω ζ,θ,α,δ ( s (cid:48) ) + (1 − β ω,δ ( s (cid:48) )) R θ,α,δ ( s (cid:48) , ω )) (cid:21) , (27)These formulations of the reward function account for option transition probabilities, including the probability of terminatingthe current option and therefore selecting a new one according to the policy over options.With ω selected according to π ζ, Ω ( ω | s ) , we can deﬁne a parameterization of the discounted return R in the style of amaximum causal entropy RL problem with objective max θ E τ ∼D [log( p θ ( τ ))] , where p θ ( τ ) ∼ p ( s , ω ) T − (cid:89) t =0 P ( s t +1 , ω t +1 | s t , ω t , a t ) e R θ,δ ( s t ,ω t ,a t ) . (28) MLE Derivative

We can write out our MLE objective for our generator. We may or may not know the option trajectories in our expertdemonstrations, but they are estimated below according to the policy over options. This is deﬁned similarly in (Fu et al.,2018) and (Finn et al., 2016a) as J ( θ ) = E τ ∼ τ E [ (cid:80) Tt =0 R Ω ζ,θ,δ ( s t , a t )] − E p θ [ (cid:80) Tt =0 (cid:80) ω ∈ Ω π ζ, Ω ( ω | s t ) R θ,δ ( s t , ω, a t )] . The IRL: Robust Adversarial Inverse Reinforcement Learning with Temporally Extended Actions full derivation is shown as (with generator p θ ): J ( θ ) = E τ ∼ τ E [log( p θ ( τ ))]= E τ ∼ τ E (cid:2) R θ,δ ( s t , ω, a t ) (cid:3) − log( Z θ )= E τ ∼ τ E (cid:34) T (cid:88) t =0 (cid:88) ω ∈ Ω π ζ, Ω ( ω | s t ) R θ ( s t , ω, a t ) (cid:35) − log( Z θ ) ≈ E τ ∼ τ E (cid:34) T (cid:88) t =0 (cid:88) ω ∈ Ω π ζ, Ω ( ω | s t ) R θ,δ ( s t , ω, a t ) (cid:35) − E p θ (cid:34) T (cid:88) t =0 (cid:88) ω ∈ Ω π ζ, Ω ( ω | s t ) R θ,δ ( s t , ω, a t ) (cid:35) = E τ ∼ τ E (cid:34) T (cid:88) t =0 R Ω ζ,θ,δ ( s t , a t ) (cid:35) − E p θ (cid:34) T (cid:88) t =0 R Ω ζ,θ,δ ( s t , a t ) (cid:35) (29)We go from Line 4 to 5 seeing R Ω ζ,θ,δ ( s t , a t ) = (cid:80) ω ∈ Ω π ζ, Ω ( ω | s t ) R θ,δ ( s t , ω, a t ) .Now, we take the gradient of the MLE objective w.r.t θ yields, ∂∂θ J ( θ ) = E τ ∼ τ E (cid:20) ∂∂θ log( p θ ( τ )) (cid:21) ∂∂θ J ( θ ) = E τ ∼ τ E (cid:20) T (cid:88) t =0 (cid:88) ω ∈ Ω π ζ, Ω ( ω | s t ) ∂∂θ R θ,δ ( s t , ω, a t ) (cid:21) − ∂∂θ log( Z θ ) ≈ E τ ∼ τ E (cid:20) T (cid:88) t =0 ∂∂θ R Ω ζ,θ,δ ( s t , a t ) (cid:21) − E p θ (cid:20) T (cid:88) t =0 ∂∂θ R Ω ζ,θ,δ ( s t , a t ) (cid:21) (30)Remark we deﬁne p θ,t ( s t , a t ) = (cid:82) s t (cid:48)(cid:54) = t ,a t (cid:48)(cid:54) = t p θ ( τ ) ds t (cid:48) da t (cid:48) as the state action marginal at time t . ∂∂θ J ( θ ) = T (cid:88) t =0 E τ ∼ τ E (cid:20) ∂∂θ R Ω ζ,θ,δ ( s t , a t ) (cid:21) − E p θ,t (cid:20) ∂∂θ R Ω ζ,θ,δ ( s t , a t ) (cid:21) (31)We perform importance sampling over the hard to estimate generator density. We make an importance sampling distribution µ t,w ( τ ) for option w .We sample a mixture policy µ ω ( a | s ) deﬁned as π ω ( a | s ) + ˆ p ω ( a | s ) and ˆ p ω ( a | s ) is a rough density estimate trained onthe demonstrations. We wish to minimize the D KL ( π w ( τ ) | p ω ( τ )) . KL refers to the Kullback–Leibler divergence metricbetween two probability distributions. Our new gradient is: ∂∂θ J ( θ ) = T (cid:88) t =0 E τ ∼ τ E [ ∂∂θ R Ω ζ,θ,δ ( s t , a t )] − E µ t (cid:20)(cid:88) ω ∈ Ω π ζ, Ω ( ω | s t ) p θ,t,ω ( s t , a t ) µ t,w ( s t , a t ) ∂∂θ R θ,δ ( s t , ω, a t ) (cid:21) . (32)Taking the derivative of the discounted option return results in ∂∂θ R Ω ζ,θ,α,δ ( s ) = E (cid:20)(cid:88) ω ∈ Ω π Ω ,ζ ( ω | s )[ (cid:88) a ∈ A [ π w,α ( a | s ) (cid:18) ∂∂θ ˆ r ω,θ ( s, a )+ γ (cid:88) s (cid:48) ∈ S P ( s (cid:48) | s, a )( β ω,δ ( s (cid:48) ) ∂∂θ R Ω ζ,θ,α,δ ( s (cid:48) ) + (1 − β ω,δ ( s (cid:48) )) ∂∂θ R θ,α,δ ( s (cid:48) , ω )) (cid:19) ] (cid:21) . (33) IRL: Robust Adversarial Inverse Reinforcement Learning with Temporally Extended Actions ∂∂θ R Ω ζ,θ,δ ( s, a ) = E (cid:20)(cid:88) ω ∈ Ω π Ω ,ζ ( ω | s ) (cid:18) ∂∂θ ˆ r ω,θ ( s, a )+ γ (cid:88) s (cid:48) ∈ S P ( s (cid:48) | s, a )( β ω,δ ( s (cid:48) ) ∂∂θ R Ω ζ,θ,α,δ ( s (cid:48) ) + (1 − β ω,δ ( s (cid:48) )) ∂∂θ R θ,α,δ ( s (cid:48) , ω )) (cid:19)(cid:21) (34) C. Discriminator Objective

We formulate the discriminator as the odds ratio between the policy and the exponentiated reward distribution for option ω as in AIRL parameterized by θ . We have a discriminator for each option ω and generator option policy π w , D θ,ω ( s, a ) = exp( f θ,ω ( s, a ))exp( f θ,ω ( s, a )) + π w ( a | s ) . (35) C.1. Recursive Loss Formulation

We minimize the cross-entropy loss between expert demonstrations and generated examples assuming we have the samenumber of options in the generated and expert trajectories. We deﬁne the loss function L θ as follows: L θ ( s, a, ω ) = − E D [log( D θ,ω ( s, a ))] − E π Θ ,t [log(1 − D θ,ω ( s, a ))] . (36)The total loss for the entire trajectory can be expressed recursively as follows by taking expectations over the next options orstates: L θ,δ ( s, a, ω ) = l θ ( s, a, ω ) + γ (cid:88) s (cid:48) ∈ S P ( s (cid:48) | s, a )( β w,δ ( s (cid:48) ) L Ω ζ,θ,α,δ ( s (cid:48) ) + (1 − β w,δ ( s (cid:48) )) L θ,α,δ ( s (cid:48) , w )) L θ,α,δ ( s, w ) = E a ∈ A [[ L θ,δ ( s, w, a )] L Ω ζ,θ,δ ( s, a ) = E w ∈ Ω [ L θ,δ ( s, w, a )] L Ω ζ,θ,α,δ ( s ) = E ω ∈ Ω [ L θ,α,δ ( s, ω )] (37)The agent wishes to minimize L θ,δ to ﬁnd its optimal policy.We can let cost function f θ,w ( s, a ) = L θ,δ ( s, ω, a ) as shown in AIRL and we have: D θ,ω = exp( L θ,δ ( s, ω, a ))exp( L θ,δ ( s, ω, a )) + π ω ( a | s ) (38) C.2. Optimization Criteria

For a given option ω , we can write the reward function ˆ R θ,δ ( s, ω, a ) to be maximised, as follows. Note that θ parameterizesthe state-action reward function estimate for option ω . − L D is the negative discriminator loss. We therefore turn ourminimization problem into a maximization problem. We deﬁne our objective similar to the GAN objective from AIRL: − L D = ˆ R θ,δ ( s, ω, a ) = log ( D θ,ω ( s, a )) − log (1 − D θ,ω ( s, a )) (39)Now we can write out our reward function in terms of the optimal discriminator ˆ R θ,δ ( s, ω, a ) = log (cid:18) exp ( − L θ,δ ( s, ω, a ))exp ( − L θ,δ ( s, ω, a )) + π ω ( a | s ) (cid:19) − log (cid:18) π ω ( a | s )exp ( − L θ,δ ( s, ω, a )) + π ω ( a | s ) (cid:19) = − L θ,δ ( s, ω, a ) − log( π ω ( a | s )) (40) IRL: Robust Adversarial Inverse Reinforcement Learning with Temporally Extended Actions

The derivative of this reward function can now be computed as follows: ∂∂θ ˆ R θ,δ ( s, ω, a ) ≈ ∂∂θ − L θ,δ ( s, ω, a )= E  ∂∂θ r ω,θ ( s, a ) + γ (cid:88) s (cid:48) ∈ S P ( s (cid:48) | s, a ) (cid:18) β ω,δ ( s (cid:48) ) ∂∂θ − L Ω ζ,θ,α,δ ( s (cid:48) ) + (1 − β ω,δ ( s (cid:48) )) ∂∂θ − L θ,α,δ ( s (cid:48) , ω ) (cid:19) = E (cid:20) ∂∂θ r ω,θ ( s, a ) (cid:21) + E (cid:20) γ (cid:88) s (cid:48) ∈ S P ( s (cid:48) | s, a ) (cid:18) β ω,δ ( s (cid:48) ) ∂∂θ − L Ω ζ,θ,α,δ ( s (cid:48) )+(1 − β ω,δ ( s (cid:48) )) ∂∂θ − L θ,α,δ ( s (cid:48) , ω ) (cid:19)(cid:21) (41)Writing out our discriminator objective yields: − L D = T (cid:88) t =0 E τ ∼ τ E (cid:18)(cid:104) (cid:88) ω ∈ Ω π Ω ,ζ ( ω | s t ) log( D θ,ω ( s t , a t )) (cid:105) + E π t (cid:104) (cid:88) ω ∈ Ω π Ω ,ζ ( ω | s t ) log(1 − D θ,ω ( s t , a t )) (cid:105)(cid:19) = T (cid:88) t =0 E τ ∼ τ E (cid:34)(cid:88) ω ∈ Ω π Ω ,ζ ( ω | s t ) log (cid:18) exp( − L θ,δ ( s t , ω, a t ))exp( − L θ,δ ( s t , ω, a t )) + π ω ( a t | s t ) (cid:19)(cid:35) + E π t (cid:34)(cid:88) ω ∈ Ω π Ω ,ζ ( ω | s t ) log (cid:18) π ω ( a t | s t )exp( − L θ,δ ( s t , ω, a t )) + π ω ( a t | s t ) (cid:19)(cid:35) = T (cid:88) t =0 E τ ∼ τ E (cid:34)(cid:88) ω ∈ Ω π Ω ,ζ ( ω | s t ) − L θ,δ ( s t , ω, a t ) (cid:35) − E τ ∼ τ E (cid:34)(cid:88) ω ∈ Ω π Ω ,ζ ( w | s t ) log(exp( − L θ,δ ( s t , w, a t )) + π ω ( a | s t )) (cid:35) + E π t (cid:34)(cid:88) ω ∈ Ω π Ω ,ζ ( ω | s t ) log( π ω ( a t | s t )) (cid:35) − E π t (cid:34)(cid:88) ω ∈ Ω π Ω ,ζ ( ω | s t ) log(exp( − L θ,δ ( s t , ω, a t )) + π ω ( a t | s t )) (cid:35) (42)We set a mixture of experts and novice as ¯ µ observations. = T (cid:88) t =0 E τ ∼ τ E (cid:34)(cid:88) ω ∈ Ω π Ω ,ζ ( ω | s t ) − L θ,δ ( s t , ω, a t ) (cid:35) + E π t (cid:34)(cid:88) ω ∈ Ω π Ω ,ζ ( ω | s t ) log( π ω ( a t | s t )) (cid:35) − E ¯ µ t (cid:34)(cid:88) ω ∈ Ω π Ω ,ζ ( ω | s t ) log (exp( − L θ,δ ( s t , ω, a t )) + π ω ( a t | s t )) (cid:35) (43)We can take the derivative w.r.t θ (state-action reward function estimate parameter): ∂∂θ ( − L D ) = T (cid:88) t =0 E τ ∼ τ E (cid:34)(cid:88) ω ∈ Ω π Ω ,ζ ( ω | s t ) ∂∂θ − L θ,δ ( s t , ω, a t ) (cid:35) − E ¯ µ t (cid:34)(cid:88) ω ∈ Ω π Ω ,ζ ( ω | s t ) (cid:18) exp( − L θ,δ ( s t , ω, a t )) exp( − L θ,δ ( s t , ω, a t )) + π ω ( a t | s t ) (cid:19) ∂∂θ − L θ,δ ( s t , ω, a t ) (cid:35) (44) IRL: Robust Adversarial Inverse Reinforcement Learning with Temporally Extended Actions

We can multiply the top and bottom of the fraction in the mixture expectation by the state marginal π ω ( s t ) = (cid:82) a ∈ A π ω ( s t , a t ) .This allows us to write ˆ p θ,t,ω ( s t , a t ) = exp( L θ,δ ( s t , ω, a t )) π ω,t ( s t ) . Now we have an importance sampling. ∂∂θ ( − L D ) = T (cid:88) t =0 E τ ∼ τ E (cid:34)(cid:88) ω ∈ Ω π Ω ,ζ ( ω | s t ) ∂∂θ − L θ,δ ( s t , ω, a t ) (cid:35) − E ¯ µ t (cid:34)(cid:88) ω ∈ Ω π Ω ,ζ ( ω | s t ) (cid:18) ˆ p θ,t,ω ( s t , a t )ˆ µ t,ω ( s t , a t ) (cid:19) ∂∂θ − L θ,δ ( s t , ω, a t ) (cid:35) (45)It is now easy to see we have the same form as our MLE objective loss function, our loss (the function we approximate withthe GAN) is the discounted reward for a state action pair with the expectation over options. We change the loss functions toreward functions to show this, as they are deﬁned equivalently. ∂∂θ ( − L D ) = T (cid:88) t =0 E τ ∼ τ E (cid:20) ∂∂θ R ζ,θ,δ ( s t , a t ) (cid:21) − E ¯ µ t (cid:34)(cid:88) ω ∈ Ω π Ω ,ζ ( ω | s ) (cid:18) ˆ p θ,t,ω ( s t , a t )ˆ µ t,ω ( s t , a t ) (cid:19) ∂∂θ R θ,δ ( s t , ω, a t ) (cid:35) (46)In addition, we can decompose the reward into a state-action reward and a future discounted sum of rewards considering thepolicy over options as follows: ∂∂θ ( − L D ) = T (cid:88) t =0 E τ ∼ τ E (cid:34)(cid:88) ω ∈ Ω π Ω ,ζ ( ω | s t ) ∂∂θ r ω,θ ( s t , a t ) (cid:35)(cid:124) (cid:123)(cid:122) (cid:125) State-Action Reward + E τ ∼ τ E (cid:20)(cid:88) ω ∈ Ω π Ω ,ζ ( ω | s t ) γ (cid:88) s t +1 ∈ S P ( s t +1 | s t , a t )( β ω,δ ( s t +1 ) ∂∂θ R Ω ζ,θ,α,δ ( s t +1 )+(1 − β w,δ ( s t +1 )) ∂∂θ R θ,α,δ ( s t +1 , ω )) (cid:21) − E ¯ µ t (cid:34)(cid:88) ω ∈ Ω π Ω ,ζ ( ω | s t ) (cid:18) ˆ p θ,t,ω ( s t , a t )ˆ µ t,ω ( s t , a t ) (cid:19) ∂∂θ R θ,δ ( s t , ω, a t ) (cid:35) (47)e are given a mixture of experts and generated policies as ¯ µ t and perform importance sampling with respect to thisdistribution. D. GAN Architecture

The architecture for our GAN-IRL framework is described in Figure 5.

E. Proof of Recoverable Rewards

A substantial amount of this proof is derived from (Fu et al., 2018).

Lemma 1: f θ,ω ( s, a ) recovers the advantage. Proof:

It is known that when π ω = π Eω , we have achieved the global min of the discriminator objective. The discriminatormust then output 0.5 for all state action pairs. This results in exp ( f θ,ω ( s, a )) = π Eω ( a | s ) . Equivalently we have f ∗ ω ( s, a ) = log π Eω ( a | s ) = A ∗ ( s, a, ω ) . Deﬁnition 1: Decomposability condition . We ﬁrst deﬁne 2 states s , s as 1-step linked under dynamics T ( s (cid:48) | s, a ) if there IRL: Robust Adversarial Inverse Reinforcement Learning with Temporally Extended Actions

Figure 5.

Architecture of GAN-IRL framework. exists a state s that can reach s and s with non-zero probability in one timestep. The transitivity property holds for thelinked relationship. We can say that if s and s and linked, s and s are linked then s and s must also be linked.The Decomposability condition for transition dynamics T holds if all states in the MDP are linked with all other states. Lemma 2:

For an MDP, where the decomposability condition holds for all dynamics. For arbitrary functions a ( s ) , b ( s ) , c ( s ) , d ( s ) , if for all s and s (cid:48) a ( s ) + b ( s (cid:48) ) = c ( s ) + d ( s (cid:48) ) (48) and for all s a ( s ) = c ( s ) + const s (49) b ( s ) = d ( s ) + const s , (50) where const s is a constant dependent with respect to state s . Proof:

If we rearrange Equation 48, we can obtain the quality a ( s ) − c ( s ) = b ( s (cid:48) ) − d ( s (cid:48) ) .Now we deﬁne f ( s ) = a ( s ) − c ( s ) . Given our equality, we have f ( s ) = a ( s ) − c ( s ) = b ( s (cid:48) ) − d ( s (cid:48) ) . This holds for somefunction dependent on s .To represent this, b ( s (cid:48) ) − d ( s (cid:48) ) must be equal to a constant (with the constant’s value dependent on the state s ) for allone-step successor states s (cid:48) from s .Now, under decomposability, all one step successor states ( s (cid:48) ) from s must be equal through the transitivity property so b ( s (cid:48) ) − d ( s (cid:48) ) must be a constant with respect to state s . Therefore, we can write a ( s ) = c ( s ) + const + s for an arbitrarystate s and functions b and d .Substituting this into the Equation 48, we can obtain b ( s ) = d ( s ) + const s . This completes our proof. Inductive proof for any successor state

Let us consider for any MDP and any arbitrary functions a ( · ) , b ( · ) , c ( · ) and d ( · ) , a ( s ) + b ( S ( k ) ) = c ( s ) + d ( S ( k ) ) , (51)where S ( k ) is the k -th successor state reached in k time-steps from the current state. Let us denote by T π, ( k ) ( s, S ( k ) ) theprobability of transitioning from state s to S ( k ) in k steps using policy π . Then, we can express T π, ( k ) ( s, S ( k ) ) recursively IRL: Robust Adversarial Inverse Reinforcement Learning with Temporally Extended Actions as follows: T π, ( k ) ( s, S ( k ) ) = (cid:88) s (cid:48) ∈S T π, ( k − ( s, s (cid:48) ) T π ( s (cid:48) , S ( k ) ) , (52)where T π ( s (cid:48) , S ( k ) ) is the one-step transition probability from state s (cid:48) to state S ( k ) (by deﬁnition of the Bellman operator).Denote by P ( S ( k ) ) the probability of landing in state S ( k ) in k steps from any current state. We can write P ( S ( k ) ) using (52)as follows: P ( S ( k ) ) := (cid:88) s ∈S T π, ( k ) ( s, S ( k ) ) µ ( s ) , (53)where µ is the state-distribution.The unbiased estimator ˆ s ( k ) of an unknown successor state S ( k ) is given by: ˆ s ( k ) := E ( S ( k ) ) = (cid:88) s ( k ) ∈S s ( k ) P ( S ( k ) ) , (54)where P ( S ( k ) ) is given in (53).Now, replacing S ( k ) in (51) with its unbiased estimator ˆ s ( k ) as given by (54), we have a ( s ) − c ( s ) = b (ˆ s ( k ) ) − d (ˆ s ( k ) ) ( a ) = f ( k ) , (55)for some function f , where ( a ) holds since ˆ s ( k ) depends only on k . Thus, we get a ( s ) = c ( s )+ const. and b ( s ) = d ( s )+ const.where the constant is with respect to the state s . Theorem 1:

Suppose we have, for a MDP where the decomposability condition holds, f θ,ω ( s, a, s (cid:48) ) = g ω ( s, a ) + γh Φ ( s (cid:48) ) − h Φ ( s ) (56) where h Φ is a shaping term. If we obtain the optimal f ∗ θ,ω ( s, a, s (cid:48) ) , with a reward approximator g ∗ ω ( s, a ) . Under deterministicdynamics the following holds g ∗ ω ( s, a ) + γh ∗ Φ ( s (cid:48) ) − h ∗ Φ ( s ) = r ∗ ω ( s ) + γV ∗ Ω ( s (cid:48) ) − V ∗ Ω ( s ) (57)and g ∗ ω ( s ) = r ∗ ω ( s ) + c ω . (58) Proof:

We know f ∗ ω ( s, a, s (cid:48) ) = A ∗ ( s, a, ω ) = Q ∗ ( s, a, ω ) − V ∗ Ω ( s ) = r ∗ ω ( s ) + γV ∗ Ω ( s (cid:48) ) − V ∗ Ω ( s ) . We can substitute thedeﬁnition of f ∗ ω ( s, a, s (cid:48) ) to obtain our Theorem.Where Q ( s, ω ) = (cid:80) a ∈A π ω,α ( a | s )[ r ω,θ ( s, a ) + γ (cid:80) s (cid:48) ∈S P ( s (cid:48) | s, a ) ((1 − β δ,ω ( s (cid:48) )) Q ( s (cid:48) , ω ) + β δ,ω ( s (cid:48) ) V Ω ( s (cid:48) ))] and V Ω ( s ) = (cid:80) ω ∈ Ω π Ω ,ζ ( ω | s ) Q ( s, ω ) Q ( s, a, ω ) = π ω,α ( a | s )[ r ω,θ ( s, a ) + γ (cid:80) s (cid:48) ∈S P ( s (cid:48) | s, a ) ((1 − β δ,ω ( s (cid:48) )) Q ( s (cid:48) , ω ) + β δ,ω ( s (cid:48) ) V Ω ( s (cid:48) ))] which holds for all s and s (cid:48) . Now we apply Lemma 2. We say that a ( s ) = g ∗ ω ( s ) − h ∗ Φ ( s ) , b ( s (cid:48) ) = γh ∗ Φ ( s (cid:48) ) , c ( s ) = r ( s ) − V ∗ Ω ( s ) and d ( s (cid:48) ) = γV ∗ Ω ( s (cid:48) ) and rearrange according to Lemma 2. We therefore have our results that g ∗ ω ( s ) = r ω ( s ) + c ω .Where c ω is a constant. F. Proof of Convergence

Deﬁnition 2: Reward Approximator Error.

From Theorem 1, we can see that our reward approximator g ∗ ω ( s ) = r ω ( s ) + c ω . We deﬁne a reward approximator error over all options as δ r = (cid:80) ω ∈ Ω π Ω ( ω ) | g ∗ ω ( s ) − r ∗ ( s ) | . This error isbounded by δ r = (cid:88) ω ∈ Ω π Ω ( ω ) | g ∗ ω ( s ) − r ∗ ( s ) | ≤ max ω ∈ Ω c ω (59) IRL: Robust Adversarial Inverse Reinforcement Learning with Temporally Extended Actions

By deﬁnition of g ∗ ω ( s ) . Lemma 3:

The Bellman operator for options in the IRL problem is a contraction.

Proof:

We prove this by Cauchy-Schwarz and the deﬁnition of the sup-norm. We must deﬁne this inequality in terms ofthe IRL problem where we have a reward estimator ˆ g θ ω ( s ) under our learned parameter θ and an optimal reward estimator r ∗ ( s ) . || Q π Ω ,t ( s, ω ) − Q ∗ ( s, ω ) || ∞ = || ˆ g θ ( s ) + γ (cid:88) s (cid:48) ∈S P ( s (cid:48) | s, a )((1 − β ( s (cid:48) ) Q π Ω ,t ( s (cid:48) , ω ) + β ( s (cid:48) ) max ω ∈ Ω Q π Ω ,t ( s (cid:48) , ω )) − r ∗ ( s ) + γ (cid:88) s (cid:48) ∈S P ( s (cid:48) | s, a )((1 − β ( s (cid:48) ) Q ∗ ( s (cid:48) , ω ) + β ( s (cid:48) ) max ω ∈ Ω Q ∗ ( s (cid:48) , ω )) || ∞ = || ˆ g θ ( s ) − r ∗ ( s ) + (cid:88) s (cid:48) ∈S P ( s (cid:48) | s, a )[(1 − β ( s (cid:48) ))( Q π Ω ,t ( s (cid:48) , ω ) − Q ∗ ( s (cid:48) , ω ))]+[ β ( s (cid:48) )(max ω ∈ Ω Q π Ω ,t ( s (cid:48) , ω ) − max ω ∈ Ω Q ∗ ( s (cid:48) , ω ))] || ∞ = || (cid:88) s (cid:48) ∈S P ( s (cid:48) | s, a )[(1 − β ( s (cid:48) ))( Q π Ω ,t ( s (cid:48) , ω ) − Q ∗ ( s (cid:48) , ω ))]+[ β ( s (cid:48) )(max ω ∈ Ω Q π Ω ,t ( s (cid:48) , ω ) − max ω ∈ Ω Q ∗ ( s (cid:48) , ω ))] || ∞ + max ω ∈ Ω c ω ≤ (cid:88) s (cid:48) ∈S P ( s (cid:48) | s, a ) max s (cid:48)(cid:48) ,ω (cid:48)(cid:48) || Q π Ω ,t ( s (cid:48)(cid:48) , ω (cid:48)(cid:48) ) − Q ∗ ( s (cid:48)(cid:48) , ω (cid:48)(cid:48) ) || ∞ + max ω ∈ Ω c ω ≤ γ max s (cid:48)(cid:48) ,ω (cid:48)(cid:48) || Q π Ω ,t ( s (cid:48)(cid:48) , ω (cid:48)(cid:48) ) − Q ∗ ( s (cid:48)(cid:48) , ω (cid:48)(cid:48) ) || ∞ + max ω ∈ Ω c ω (60)This is given by Lemma 3 and (Sutton et al., 1999) [Theorem 3].Giving our results max s (cid:48)(cid:48) ,ω (cid:48)(cid:48) | Q π Ω ,t ( s, ω ) − Q ∗ ( s, ω ) | ≤ (cid:15) + max ω ∈ Ω c ω . For (cid:15) ∈ R > Theorem 2: g θ ( s ) + γQ ( s (cid:48) , ω ) converges to Q ∗ . Proof:

We know g θ ( s ) → g ∗ θ ( s ) = r ∗ ( s ) + const. Given this we can show by Cauchy-Schwarz: | E [ g θ ( s )] + γ E [ Q ( s (cid:48) , ω ) | s ] − Q ∗ ( s (cid:48) , ω ) | = | E [ g θ ( s )] + γ (cid:88) s (cid:48) ∈S P ( s (cid:48) | s, a )((1 − β ω ( s (cid:48) )) Q ( s (cid:48) ω ) + β ω ( s (cid:48) ) V Ω ( s (cid:48) )) − r ∗ ( s ) − (cid:88) s (cid:48) ∈S P ( s (cid:48) | s, a )((1 − β ω ( s (cid:48) ) Q ∗ ( s (cid:48) , ω )) + β ω ( s (cid:48) ) max ω ∈ Ω Q ∗ ( s (cid:48) , ω ) | = | E [ g θ ( s )] − r ∗ ( s ) + γ (cid:88) s (cid:48) ∈S P ( s (cid:48) | s, a )[ β ω ( s (cid:48) )[max ω ∈ Ω Q ( s (cid:48) ω ) − max ω ∈ Ω Q ∗ ( s (cid:48) , ω )]+ (1 − β ω ( s (cid:48) ))[ Q ( s (cid:48) ω ) − Q ∗ ( s (cid:48) , ω )]] | ( a ) ≤ (max ω ∈ Ω c ω ) | γ (cid:88) s (cid:48) ∈S P ( s (cid:48) | s, a )[ max s (cid:48)(cid:48) ,ω (cid:48)(cid:48) || Q ( s (cid:48)(cid:48) , ω (cid:48)(cid:48) ) − Q ∗ ( s (cid:48)(cid:48) , ω (cid:48)(cid:48) ) | ] | ( b ) ≤ (max ω ∈ Ω c ω )( (cid:15) + max ω ∈ Ω c ω ) γ (cid:88) s (cid:48) ∈S P ( s (cid:48) | s, a ) ≤ (max ω ∈ Ω c ω )( (cid:15) + max ω ∈ Ω c ω ) γ, (61)where ( a ) follows from Lemma 3 and ( b ) holds since (cid:80) s (cid:48) ∈S P ( s (cid:48) | s, a ) ≤ . IRL: Robust Adversarial Inverse Reinforcement Learning with Temporally Extended Actions

G. Parameters for Experiments

G.1. MuJoCo Tasks

For these experiments, we use PPO to obtain an optimal policy given our ground truth rewards for 2 million iterationsand 20 million on the complex tasks. This is used to obtain the expert demonstrations. We sample 50 expert trajectories.PPOC is used for the policy optimization step for the policy over options. We tune the deliberation cost hyper-parametervia cross-validation. The optimal deliberation cost found was . for PPOC. We also use state-only rewards for the policytransfer tasks. The hyperparameters for our policy optimization are given in Table 4.Our discriminator is a neural network with the optimal architecture of 2 linear layers of 50 hidden states, each with ReLUactivation followed by a single node linear layer for output. We also tried a variety of hidden states including 100 and 25and tanh activation during our hyperparameter optimization step using cross-validation.The policy network has 2 layers of 64 hidden states. A batch size of 64 or 32 is used for 1 and any number of options greaterthan 1 respectively. No mini-batches are used in the discriminator since the recursive loss must be computed. There are 2048timesteps per batch. Generalized Advantage Estimation is used to compute advantage estimates. We list additional networkparameters in the next section. The output of the policy network gives the Gaussian mean and the standard deviation. This isthe same procedure as in (Schulman et al., 2017). Table 4.

Policy Optimization parameters for MuJoCo

Parameter ValueDiscr. Adam optimizer learning rate · − Adam (cid:15) · − PPOC Adam optimizer learning rate · − GAE λ . Entropy coefﬁcient − value loss coefﬁcient 0.5discount 0.99batch size for PPO 64 or 32PPO epochs 10entropy coefﬁcient − clip parameter 0.2 G.2. MuJoCo Continuous Control Tasks

In this section, we describe the structure of the objects that gait in the continuous control benchmarks and the rewardfunctions. For the transfer learning tasks, we use the same reward function described here for the Ant.

Walker:

The walker is a planar biped. There are 7 rigid links comprised of legs, a torso. This includes 6 actuated joints.This task is particularly prone to falling. The state space is of 21 dimensions. The observations in the states include jointangles, joint velocities, the center of mass’s coordinates. The reward function is r ( s, a ) = v x − . || a || . The terminationcondition occurs when z body < . , z body > . or || θ y || > . . Half-Cheetah:

The half-cheetah is a planar biped also like the Walker. There are 9 rigid links comprised of 9 actuatedjoints, a leg and a torso. The state space is of 20 dimensions. The observations include joint angles, the center of mass’scoordinates, and joint velocities. The reward function is r ( s, a ) = v x − . || a || . There is no termination condition. Ant : The ant has four legs with 13 rigid links in its structure. The legs have 8 actuated joints. The state space is of 125dimensions. This includes joint angles, joint velocities, coordinates of the center of mass, the rotation matrix for the body,and a vector of contact forces. The function is r ( s, a ) = v x − . || a || − C contact + 0 . , where C contact is a penalty forcontacts to the ground. This is × − || F contact || . F contact is the contact force. It’s values are clipped to be between 0 and1. The termination condition occurs z body < . or z body > . . IRL: Robust Adversarial Inverse Reinforcement Learning with Temporally Extended Actions

Figure 6.

Architecture of the actor-critic policies on MiniGrid. Conv is Convolutional Layer and ﬁlter sized is described below. FC is afully connected layer.

G.3. MiniGrid Tasks

For experiments, we used the PPOC algorithm with parallelized data collection and GAE. 0.1 is the optimal deliberationcost. Each environment is run with 10 random network initialization. As before, in Table 5, we show some of the policyoptimization parameters for MiniGrid Tasks. We rely on an actor-critic network architecture for these tasks. Since thestate space is relatively large and spatial features are relevant, we use 3 convolutional layers in the network. The networkarchitecture is detailed in Figure 6. n and m are deﬁned by the grid dimensions.The discriminator network is again an neural network with the optimal architecture of 3 linear layers of 150 hidden states,each with ReLU activation followed by a single node linear layer for output. Table 5.

Policy optimization parameters for benchmark tasks in MiniGrid

Parameter ValueAdam optimizer learning rate · − Adam (cid:15) − entropy coefﬁcient − value loss coefﬁcient 0.5discount 0.99maximum norm of gradient in PPO 0.5number of PPO epochs 4batch size for PPO 256entropy coefﬁcient −2