oIRL: Robust Adversarial Inverse Reinforcement Learning with Temporally Extended Actions
David Venuto, Jhelum Chakravorty, Leonard Boussioux, Junhao Wang, Gavin McCracken, Doina Precup
ooIRL: Robust Adversarial Inverse Reinforcement Learning with TemporallyExtended Actions
David Venuto
Jhelum Chakravorty
Leonard Boussioux
Junhao Wang
Gavin McCracken
Doina Precup
Abstract
Explicit engineering of reward functions for givenenvironments has been a major hindrance to rein-forcement learning methods. While Inverse Rein-forcement Learning (IRL) is a solution to recoverreward functions from demonstrations only, theselearned rewards are generally heavily entangled with the dynamics of the environment and there-fore not portable or robust to changing environ-ments. Modern adversarial methods have yieldedsome success in reducing reward entanglementin the IRL setting. In this work, we leverage onesuch method, Adversarial Inverse ReinforcementLearning (AIRL), to propose an algorithm thatlearns hierarchical disentangled rewards with apolicy over options. We show that this method hasthe ability to learn generalizable policies and re-ward functions in complex transfer learning tasks,while yielding results in continuous control bench-marks that are comparable to those of the state-of-the-art methods.
1. Introduction
Reinforcement learning (RL) has been able to learn policiesin complex environments but it usually requires designingsuitable reward functions for successful learning. This canbe difficult and may lead to learning sub-optimal policieswith unsafe behavior (Amodei et al., 2016) in the case ofpoor engineering. Inverse Reinforcement Learning (IRL)(Ng & Russell, 2000; Abbeel & Ng, 2004) can facilitatesuch reward engineering through learning an expert’s rewardfunction from expert demonstrations.IRL, however, comes with many difficulties and the problem * Equal contribution Department of Computer Science,McGill University, Montreal, Canada Mila, Montreal, Canada Department of Operations Research, MIT, Cambridge, USA DeepMind, Montreal, Canada. Correspondence to: DavidVenuto
2. Preliminaries
Markov Decision Processes (MDP) are defined by a tuple (cid:104)S , A , P , R, γ (cid:105) where S is a set of states, A is the set ofactions available to the agent, P is the transition kernel giv-ing a probability over next states given the current state andaction, R : S × A → [0 , R max ] is a reward function and γ ∈ [0 , is a discount factor. s t and a t are respectivelythe state and action of the expert at time instant t . We de-fine a policy π as the probability distribution over actionsconditioned on the current state; π : S × A → [0 , . A pol-icy is modeled by a Gaussian distribution π θ ∼ N ( µ, σ ) where θ is the policy parameters. The value of a policy isdefined as V π ( s ) = E π [ (cid:80) ∞ t =0 γ t r t +1 | s ] , where E denotesthe expectation. An agent follows a policy π and receivesreward from the environment. A state-action value func-tion is Q π ( s, a ) = E π [ (cid:80) ∞ t =0 γ t r t +1 | s, a ] . The advantage is A π ( s, a ) = Q π ( s, a ) − V π ( s ) . r ( s, a ) represents a one-stepreward. Options ( ω ∈ Ω ) are defined as a triplet ( I ω , π ω , β ω ), where π ω is a policy over options, I ω ∈ S is the initiation set ofstates and β ω : S → [0 , is the termination function. Thepolicy over options is defined by π Ω . An option has a reward r ω and an option policy π ω .The policy over options is parameterized by ζ , the intra-option policies by α for each option, the reward approx-imator by θ , and the option termination probabilities by δ .In the one-step case, selecting an option using the policy-over-options can be viewed as a mixture of completelyspecialized experts. This overall policy can be defined as π Θ ( a | s ) = (cid:80) ω ∈ Ω π Ω ( ω | s ) π ω ( a | s ) . Disentangled Rewards are formally defined as a rewardfunction r ∗ θ ( s, a, s (cid:48) ) that is disentangled with respect to(w.r.t.) a ground-truth reward and a set of environmentaldynamics T such that under all possible dynamics T ∈ T ,the optimal policy computed w.r.t. the reward function isthe same.
3. Related Work
Generative Adversarial Networks (GANs) learn the gen-erator distribution p g and discriminator D θ D ( x ) . They usea prior distribution over input noise variables p ( z ) . Giventhese input noise variables the mapping G θ g ( z ) is learned,which maps these input noise variables to the data set space. G is a neural network. Another neural network, D θ D ( x ) ,learns to estimate the probability that x came from the dataset and not the generator p g .In our two-player adversarial training procedure, D istrained to maximize the probability of assigning the cor-rect labels to the dataset and the generated samples. G istrained to minimize the objective log (1 − D θ D ( G θ G ( z ))) ,which causes it to generate samples that are more likely tofool the discriminator. Policy Gradient methods optimize a parameterized policy π θ using a gradient ascent. Given a discounting term, the ob-jective to be optimized is p ( θ, s ) = E [ (cid:80) ∞ t =0 γ t r θ ( s t ) | s ] .Proximal policy optimization (PPO) (Schulman et al.,2017) is a policy gradient method that uses policy gradi-ent theorem, which states ∂p ( θ,s ) ∂θ = (cid:80) s (cid:80) ∞ t =0 γ t P ( s t = s | s ) (cid:80) a ∂π ( a | s ) ∂θ Q π θ ( s, a ) . PPO has been adapted for theoption-critic architecture (PPOC) (Klissarov et al., 2017). Inverse Reinforcement Learning (IRL) is a form of imi-tation learning , where the expert’s reward is estimatedfrom demonstrations and then forward RL is applied to thatestimated reward to find the optimal policy. GenerativeAdversarial Imitation Learning (GAIL) directly extracts op-timal policies from expert’s demonstrations (Ho & Ermon,2016). IRL infers a reward function from expert demonstra-tions, which is then used to optimize a generator policy.In IRL, an agent observes a set of state-action trajec-tories from an expert demonstrator D . We let T D = { τ E , τ E , . . . , τ En } be the state-action trajectories of the ex-pert, τ Ei ∼ τ D where τ Ei = { s , a , s , a . . . , s k , a k } . Wewish to find the reward function r ( s, a ) given the set ofdemonstrations T D . It is assumed that the demonstrationsare drawn from the optimal policy π ∗ ( a | s ) . The MaximumLikelihood Estimation (MLE) objective in the IRL problemsis therefore: max θ J ( θ ) = max θ E τ ∼ τ E [log( p θ ( τ ))] , (1)with p θ ( τ ) ∝ p ( s ) (cid:81) Tt =0 p ( s t +1 | s t , a t ) exp ( γ t r θ ( s t , a t )) . Adversarial Inverse Reinforcement Learning (AIRL) isbased on GAN-Guided Cost Learning (Finn et al., 2016a),which casts the MLE objective as a Generative AdversarialNetwork (GAN) (Goodfellow et al., 2014) optimization Here the agent learns the expert’s policy by observing expertdemonstrations (Ng & Russell, 2000).
IRL: Robust Adversarial Inverse Reinforcement Learning with Temporally Extended Actions problem over trajectories. In AIRL (Fu et al., 2018), thediscriminator probability D θ is evaluated using the state-action pairs from the generator (agent), as given by D θ ( s, a ) = exp( f θ ( s, a ))exp( f θ ( s, a )) + π ( a | s ) . (2)The agent tries to maximize R ( s, a ) = log(1 − D θ ( s, a )) − log( D θ ( s, a )) where f θ ( s, a ) is a learned function and π ispre-computed. This formula is similar to GAIL but witha recoverable reward function since GAIL outputs 0.5 forthe reward of all states and actions at optimality. The dis-criminator function is then formulated as f θ, Φ ( s, a, s (cid:48) ) = g θ ( s, a ) + γh Φ ( s (cid:48) ) − h Φ ( s ) given shaping function h Φ andreward approximator g θ . Under deterministic dynamics, itis shown in AIRL that there is a state-only reward approxi-mator ( f ∗ ( s, a, s (cid:48) ) = r ∗ ( s )+ γV ∗ ( s (cid:48) ) − V ∗ ( s ) = A ∗ ( s, a ) where the reward is invariant to transition dynamics and isdisentangled. Hierarchical Inverse Reinforcement Learning learnspolicies with high level temporally extended actions us-ing IRL. OptionGAN (Henderson et al., 2018) providesan adversarial IRL objective function for the discriminatorwith a policy over options. It is formulated such that L reg defines the regularization terms on the mixture of experts sothat they converge to options. The discriminator objectivein OptionGAN takes state-only input and is formulated as: L Ω = E ω [ π Ω ,ζ ( ω | s )( L α,ω )] + L reg , where L α,ω = E τ N [log( r θ,ω ( s ))] + E τ E [log(1 − r θ,ω ( s ))] . (3)In Directed-Info GAIL (Sharma et al., 2019) implementsGAIL in a policy over options framework.Work such as (Krishnan et al., 2016) solves this hierarchi-cal problem of segmenting expert demonstration transitionsby analyzing the changes in local linearity w.r.t a kernelfunction . It has been suggested that decomposing the re-ward function is not enough (Henderson et al., 2018). Otherworks have learned the latent dimension along with the pol-icy for this task (Hausman et al., 2017; Wang et al., 2017). Inthis formulation, the latent structure is encoded in an unsu-pervised manner so that the desired latent variable does notneed to be provided. This work parallels many hierarchicalIRL methods but with recoverable robust rewards.
4. MLE Objective for IRL Over Options
Let ( s , a , . . . s T , a T ) ∈ τ Ei be an expert trajectory ofstate-action pairs. Denote by ( s , a , ω . . . s T , a T , ω T ) ∈ τ π Θ ,t a novice trajectory generated by policy over options π Θ ,t of the generator at iteration t .Given a trajectory of state-action pairs, we first define anoption transition probability given a state and an option. Similar transition probabilities given state, action or optioninformation are defined in (Appendix A). P ( s t +1 , ω t +1 | s t , ω t )= (cid:88) a ∈ A π ω,α ( a | s t ) P ( s t +1 | s t , a )((1 − β ω t ,δ ( s t +1 )) ω t = ω t +1 + β ω t ,δ ( s t +1 ) π Ω ,ζ ( ω t +1 | s t +1 )) . (4)We can similarly define a discounted return recursively. Con-sider the policy over options based on the probabilities ofterminating or continuing the option policies given a rewardapproximator ˆ r θ ( s, a ) for the state-action reward. R θ,δ ( s, ω, a ) := E (cid:104) ˆ r θ,ω ( s, a ) + γ (cid:88) s (cid:48) ∈ S P ( s (cid:48) | s, a ) (cid:16) β ω,δ ( s (cid:48) ) R Ω ζ,θ,α,δ ( s (cid:48) ) + (cid:0) − β ω,δ ( s (cid:48) ) (cid:1) R θ,α,δ ( s (cid:48) , ω ) (cid:17)(cid:105) . (5) ω is selected according to π ζ, Ω ( ω | s ) . The expressions forall relevant discounted returns appearing in the analysis aregiven in Appendix B. A suitable parameterization of thediscounted return R can be found by maximizing the causalentropy E τ ∼D [log( p θ ( τ ))] w.r.t parameter θ . We then havefor a trajectory τ with T time-steps: p θ ( τ ) (6) ≈ p ( s , ω ) T − (cid:89) t =0 P ( s t +1 , ω t +1 | s t , ω t , a t ) e R θ,δ ( s t ,ω t ,a t ) . Similar to (Fu et al., 2018) and (Finn et al., 2016a), wedefine the MLE objective for the generator p θ as J ( θ ) = E τ ∼ τ E [ T (cid:88) t =0 R Ω ζ,θ,δ ( s t , a t )] − E p θ [ T (cid:88) t =0 (cid:88) ω ∈ Ω π ζ, Ω ( ω | s t ) R θ,δ ( s t , ω, a t )] . (7)Note that we may or may not know the option trajectoriesin our expert demonstrations, instead they are estimatedaccording to the policy over options. The gradient of (7)w.r.t θ (See Appendix B for detailed derivations) is givenby: ∂∂θ J ( θ ) = E τ ∼ τ E (cid:104) ∂∂θ log( p θ ( τ )) (cid:105) ≈ E τ ∼ τ E (cid:104) T (cid:88) t =0 ∂∂θ R Ω ζ,θ,δ ( s t , a t ) (cid:105) − E p θ (cid:104) T (cid:88) t =0 ∂∂θ R Ω ζ,θ,δ ( s t , a t ) (cid:105) . We define p θ,t ( s t , a t ) = (cid:82) s t (cid:48)(cid:54) = t ,a t (cid:48)(cid:54) = t p θ ( τ ) ds t (cid:48) da t (cid:48) as thestate action marginal at time t . This allows us to examine IRL: Robust Adversarial Inverse Reinforcement Learning with Temporally Extended Actions the trajectory from step t as defined similarly in (Fu et al.,2018). Consequently, we have ∂∂θ J ( θ ) = T (cid:88) t =0 E τ ∼ τ E (cid:20) ∂∂θ R Ω ζ,θ,δ ( s t , a t ) (cid:21) (xyz) − E p θ,t (cid:20) ∂∂θ R Ω ζ,θ,δ ( s t , a t ) (cid:21) . (8)Since p θ is difficult to draw samples from, we estimate itusing importance sampling distribution over the generatordensity. Then, we compute an importance sampling estimateof a mixture policy µ t,w ( τ ) for each option w as follows.We sample a mixture policy µ ω ( a | s ) defined as π ω ( a | s ) + ˆ p ω ( a | s ) and ˆ p ω ( a | s ) is a density estimate trained on thedemonstrations. We wish to minimize D KL ( π w ( τ ) | p ω ( τ )) to reduce the importance sampling distribution variance,where D KL is the Kullback–Leibler divergence metric (Kull-back & Leibler, 1951) between two probability distributions.Applying the aforementioned density estimates in (8), wecan express the gradient of the MLE objective J follows: ∂∂θ J ( θ ) = T (cid:88) t =0 E τ ∼ τ E [ ∂∂θ R Ω ζ,θ,δ ( s t , a t )] − E µ t (cid:20)(cid:88) ω ∈ Ω π ζ, Ω ( ω | s t ) p θ,t,ω ( s t , a t ) µ t,w ( s t , a t ) ∂∂θ R θ,δ ( s t , ω, a t ) (cid:21) , (9)where ∂∂θ R Ω ζ,θ,α,δ ( s ) = E (cid:34) (cid:88) ω ∈ Ω π Ω ,ζ ( ω | s ) (cid:104) (cid:88) a ∈ A π ω,α ( a | s ) (cid:16) ∂∂θ ˆ r θ ( s, a ) + γ (cid:88) s (cid:48) ∈ S P ( s (cid:48) | s, a ) (cid:16) β w,δ ( s (cid:48) ) ∂∂θ R Ω ζ,θ,α,δ ( s (cid:48) )+ (1 − β ω,δ ( s (cid:48) )) ∂∂θ R θ,α,δ ( s (cid:48) , ω ) (cid:17)(cid:17)(cid:105)(cid:35) . (10)
5. Discriminator Objective
In this section we formulate the discriminator, parameter-ized by θ , as the odds ratio between the policy and theexponentiated reward distribution for option ω . We have adiscriminator D θ,ω for each option ω and a sample generatoroption policy π w , defined as follows: D θ,ω ( s, a ) = exp( f θ,ω ( s, a ))exp( f θ,ω ( s, a )) + π w ( a | s ) . (11) The discriminator D θ,ω is trained by minimizing the cross-entropy loss between expert demonstrations and generatedexamples assuming we have the same number of options in the generated and expert trajectories. We define the lossfunction L θ as follows: l θ ( s, a, ω ) (12) = − E D [log( D θ,ω ( s, a ))] − E π Θ ,t [log(1 − D θ,ω ( s, a ))] . The parameterized total loss for the entire trajectory,
Lθ,α,δ ( s, a, ω ) , can be expressed recursively as follows bytaking expectations over the next options and states: L θ,δ ( s, a, ω )= l θ ( s, a, ω ) + γ (cid:88) s (cid:48) ∈ S P ( s (cid:48) | s, a ) (cid:16) β w,δ ( s (cid:48) ) L Ω ζ,θ,α,δ ( s (cid:48) )+ (cid:0) − β w,δ ( s (cid:48) ) (cid:1) L θ,α,δ ( s (cid:48) , w ) (cid:17) (13) L θ,α,δ ( s, w ) := E a ∈ A [ L θ,δ ( s, w, a )] (14) L Ω ζ,θ,δ ( s, a ) := E w ∈ Ω [ L θ,δ ( s, w, a )] (15) L Ω ζ,θ,α,δ ( s ) := E ω ∈ Ω [ L θ,α,δ ( s, ω )] . (16)The agent wishes to minimize L θ,α,δ to find its optimalpolicy. For a given option ω , define the reward function ˆ R θ,δ ( s, ω, a ) , which is to be maximised. We then writea negative discriminator loss ( − L D ) to turn our loss mini-mization problem into a maximization problem, as follows: − L D = ˆ R θ,δ ( s, ω, a ) =log( D θ,ω ( s, a )) − log(1 − D θ,ω ( s, a )) . (17)We set a mixture of experts and novice as ¯ µ observationsin our gradient. We then wish to take the derivative of theinverse discriminator loss as, ∂∂θ ( − L D ) = T (cid:88) t =0 E τ ∼ τ E (cid:34)(cid:88) ω ∈ Ω π Ω ,ζ ( ω | s t ) ∂∂θ (cid:16) − L θ,δ ( s t , ω, a t ) (cid:17)(cid:35) − E ¯ µ t (cid:104) (cid:88) ω ∈ Ω π Ω ,ζ ( ω | s t ) (cid:32) exp( − L θ,δ ( s t , ω, a t )) exp( − L θ,δ ( s t , ω, a t )) + π ω ( a t | s t ) (cid:33) ∂∂θ (cid:16) − L θ,δ ( s t , ω, a t ) (cid:17)(cid:105) . (18)We can multiply the top and bottom of the fraction inthe mixture expectation by the state marginal π ω ( s t ) = (cid:82) a ∈ A π ω ( s t , a t ) . This allows us to write ˆ p θ,t,ω ( s t , a t ) =exp( L θ,δ ( s t , ω, a t )) π ω,t ( s t ) . Using this, we can derive animportance sampling distribution in our loss, ∂∂θ ( − L D ) = T (cid:88) t =0 E τ ∼ τ E (cid:34)(cid:88) ω ∈ Ω π Ω ,ζ ( ω | s t ) ∂∂θ (cid:16) − L θ,δ ( s t , ω, a t ) (cid:17)(cid:35) − E ¯ µ t (cid:34)(cid:88) ω ∈ Ω π Ω ,ζ ( ω | s t ) (cid:18) ˆ p θ,t,ω ( s t , a t )ˆ µ t,ω ( s t , a t ) (cid:19) ∂∂θ (cid:16) − L θ,δ ( s t , ω, a t ) (cid:17)(cid:35) . (19) IRL: Robust Adversarial Inverse Reinforcement Learning with Temporally Extended Actions
The gradient of this parametrized reward function corre-sponds to the inverse of the discriminator’s objective: ∂∂θ ˆ R θ,δ ( s, ω, a ) ≈ ∂∂θ (cid:16) − L θ,δ ( s, ω, a ) (cid:17) = E (cid:104) ∂∂θ r θ,ω ( s, a ) + γ (cid:88) s (cid:48) ∈ S P ( s (cid:48) | s, a ) (cid:16) β ω,δ ( s (cid:48) ) ∂∂θ (cid:0) − L Ω ζ,θ,α,δ ( s (cid:48) ) (cid:1) + (cid:0) − β ω,δ ( s (cid:48) ) (cid:1) ∂∂θ (cid:0) − L θ,δ ( s (cid:48) , ω ) (cid:1)(cid:17)(cid:105) . (20)See Appendix C for the detailed derivations of the termsappearing in (20). Substituting (20) into (19) one can seethat (9) (derivative of MLE objective) and (10) are of thesame form as of (19) (derivative of the discriminator objec-tive and (20)).
6. Learning Disentangled State-only Rewardswith Options
In this section, we provide our main algorithm for learningrobust rewards with options. Similar to AIRL, we imple-ment our algorithm with a discriminator update that consid-ers the rollouts of a policy over options. We perform thisupdate with ( s, a, s (cid:48) ) triplets and a discriminator function inthe form of f θ,ω ( s, a, s (cid:48) ) as given in (21). This allows us toformulate the discriminator with state-only rewards in termsof option-value function estimates. We can then compute anoption-advantage estimate. Since the reward function onlyrequires state, we learn a reward function and correspondingpolicy that is disentangled from the environmental transitiondynamics. f ω,θ ( s, a, s (cid:48) ) = ˆ r ω,θ ( s ) + γ ˆ V Ω ( s (cid:48) ) − ˆ V Ω ( s ) = ˆ A ( s, a, ω ) (21)Where Q ( s, ω ) = (cid:80) a ∈A π ω,α ( a | s )[ r ω,θ ( s, a ) + γ (cid:80) s (cid:48) ∈S P ( s (cid:48) | s, a )((1 − β δ,ω ( s (cid:48) )) Q ( s (cid:48) , ω ) + β δ,ω ( s (cid:48) ) V Ω ( s (cid:48) ))] and V Ω ( s ) = (cid:80) ω ∈ Ω π Ω ,ζ ( ω | s ) Q ( s, ω ) .Our discriminator model must learn a parameterization ofthe reward function and the value function for each option,given the total loss function in (37). These parameterizedmodels are learned with a multi-layer perceptron. For eachoption, the termination functions β ω,δ and option-policies π ω,α are learned using PPOC. Our main algorithm, oIRL, is given by Algorithm 1. Here,we iteratively train a discriminator from expert and novicesampled trajectories using the derived discriminator objec-tive. This allows us to obtain reward function estimates foreach option. We then use any policy optimization methodfor a policy over options given these estimated rewards. We can also have discriminator input of state-only format asdescribed in (21). It is important to note that in our recursiveloss, we recursively simulate a trajectory to compute the lossa finite number of times (and return if the state is terminal).We show the adversarial architecture of this algorithm inAppendix D.
7. Convergence Analysis
In this section we explain the gist of the analysis of con-vergence of oIRL. The detailed proofs can be found in Ap-pendix E and F.We first show that the actual reward function is recovered(up to a constant) by the reward estimators. We show that foreach option’s reward estimator g θ,ω ( s ) , we have g ∗ ω ( s ) = r ∗ ( s ) + c ω , where c ω is a finite constant. Using the factthat g θ,ω ( s ) → g ∗ ω ( s ) = r ∗ ( s ) + c ω , and by using Cauchy-Schwarz inequality of sup-norm, we prove that the updateof the TD-error is a contraction, i.e., max s (cid:48)(cid:48) ,ω (cid:48)(cid:48) | Q π Ω ,t ( s, ω ) − Q ∗ ( s, ω ) | ≤ (cid:15) + max ω ∈ Ω c ω . (22)In order to prove asymptotic convergence to the optimaloption-value Q ∗ , we show using the contraction argumentthat g θ,ω ( s ) + γQ ( s (cid:48) , ω ) converges to Q ∗ by establishingthe following inequality: | E [ g θ,ω ( s )] + γ E [ Q ( s (cid:48) , ω ) | s ] − Q ∗ ( s (cid:48) , ω ) |≤ (max ω ∈ Ω c ω )( (cid:15) + max ω ∈ Ω c ω ) γ. (23)
8. Experiments oIRL learns disentangled reward functions for each optionpolicy, which facilitates policy generalizability and is instru-mental in transfer learning .Transfer learning can be described as using informationlearned by solving one problem and then applying it to adifferent but related problem. In the RL sense, it meanstaking a policy trained on one environment and then usingthe policy to solve a similar task in a different previouslyunseen environment.We run experiments in different environments to address thefollowing questions: • Does learning a policy over options with the AIRLframework improve policy generalization and rewardrobustness in transfer learning tasks where the environ-mental dynamics are manipulated? • Can the policy over options framework match or ex-ceed benchmarks for imitation learning on complexcontinuous control tasks?
IRL: Robust Adversarial Inverse Reinforcement Learning with Temporally Extended Actions
Algorithm 1
IRL Over Options with Robust Rewards (oIRL)
Require:
Expert Trajectories: { τ E , . . . , τ En } ∈ T D , Initial Parameters: ( θ , ζ , δ , α ) , γ Initialize policies π ω,α , π Ω ,ζ and discriminators D θ ,ω , and β ω,δ ∀ ω ∈ Ω for step t = 0 , , , . . . , T do Collect trajectories τ i = ( s , a , ω , . . . ) from π ω,α t , π Ω ,ζ t , β ω,δ t Train discriminator D θ t ,ω for step k = 0 , , , . . . do Sample ( s k , a k , s (cid:48) k , ω k ) ∼ τ i,t if s (cid:48) not terminal state then Sample ω (cid:48) k ∼ π Ω ,ζ t ( ω | s (cid:48) k ) , a (cid:48) k, ∼ π ω (cid:48) k ,α t ( a | s (cid:48) k ) , a (cid:48) k, ∼ π ω k ,α t ( a | s (cid:48) k ) Observe s (cid:48)(cid:48) k, , s (cid:48)(cid:48) k, from environment L k ( s k , a k , s (cid:48) k , ω k ) = − E D [log( D θ t ,ω k ( s k , a k , s (cid:48) k ))] − E π Θ ,t [log(1 − D θ t ,ω k ( s k , a k , s (cid:48) k ))] Optimize model parameters w.r.t.: − L D = L k + γ ( β δ t ,ω k ( s (cid:48) ) L ( s (cid:48) k , a (cid:48) k, , s (cid:48)(cid:48) k, , ω (cid:48) k ) +(1 − β δ t ,ω k ( s (cid:48) k )) L ( s (cid:48) k , a (cid:48) k, , s (cid:48)(cid:48) k, , ω k )) end if end for Obtain reward r θ t ,ω ( s, a, s (cid:48) ) ← log( D θ t ,ω ( s, a, s (cid:48) )) − log(1 − D θ t ,ω ( s, a, s (cid:48) )) Update π ω,α t , β ω,δ t ∀ ω ∈ Ω and π Ω ,ζ t with any policy optimization method (e.g. PPOC) end for To answer these questions, we compare our model againstAIRL (the current state of the art for transfer learning) in atransfer task by learning in an ant environment and modify-ing the physical structure of the ant and compare our methodon various benchmark IRL continuous control tasks. Wewish to see if learning disentangled rewards for sub-tasksthrough the options framework is more portable.We train a policy using each of the baseline methods andour method on these expert demonstrations for 500 timesteps on the gait environments and 500 time steps on thehierarchical ones. Then we take the trained policy (theparameterized distribution) and use this policy on the trans-fer environments and observe the reward obtained. Such amethod of transferring the policy is called a direct policytransfer . For the transfer learning tasks we use
Transfer Environ-ments for MuJoCo (Chu & Arnold, 2018), a set of gymenvironments for studying potential improvements in trans-fer learning tasks. The task involves an
Ant as an agentwhich optimizes a gait to crawl sideways across the land-scape. The expert demonstrations are obtained from theoptimal policy in the basic Ant environment. We disablethe agent ant in two ways for two transfer learning tasks.In
BigAnt tasks, the length of all legs is doubled, no extrajoints are added though. The
Amputated Ant task modifiesthe agent by shortening a single leg to disable it. Thesetransfer tasks require the learning of a true disentangled re-ward of walking sideways instead of directly imitating andlearning the reward specific to the gait movements. These
Table 1.
The mean reward obtained (higher is better) over 100 runsfor the Gait transfer learning tasks. We also show the results ofPPO optimizing the ground truth reward.B IG A NT A MPUTATED A NT AIRL (P
RIMITIVE ) -11.6 134.32 O
PTIONS O
IRL
PTIONS O
IRL -1.7
ROUND T RUTH manipulations are shown in Figure 1. (a) Ant environment (b) Big Ant environ-ment (c) Amputated Ant en-vironment
Figure 1.
MuJoCo Ant Gait transfer learning task environments.When the ant is disabled, it must position itself correctly to crawlforward. This requires a different initial policy than the originalenvironment where the ant must only crawl sideways.
Table 1 shows the results in terms of reward achieved for theant gait transfer tasks. As we can see, in both experimentsour algorithm performs better than AIRL. Remark that theground truth is obtained with PPO after 2 million iterations(therefore much less sample efficient than IRL).
IRL: Robust Adversarial Inverse Reinforcement Learning with Temporally Extended Actions
We also create transfer learning environments in a 2D Mazeenvironment with lava blockades. The goal of the agent is togo through the opening in a row of lava cells and reach a goalon the other end. For the transfer learning task, we train theagent on an environment where the "crossing" path requiresthe agent to go through the middle for (
LavaCrossing-M )and then the policy is directly transferred and used on aGridWorld of the same size where the crossing is on theright end of the room (
LavaCrossing-R ). An additional taskwould be changing a blockade in a
Maze ( FlowerMaze-(R,T) ). The two environments are shown in Figure 2. Wecan think of two sub-tasks in this environment, going to thelava crossing and then going to the goal .In all of these environments, the rewards are sparse. Theagent receives a non-zero reward only after completing themission, and the magnitude of the reward is − . · n/n max ,where n is the length of the successful episode and n max isthe maximum number of steps that we allowed for complet-ing the episode, different for each mission. (a) LavaCrossing-MMiniGrid Env (b) LavaCrossing-RMiniGrid Env(c) FlowerMaze-RMiniGrid Env (d) FlowerMaze-TMiniGrid Env Figure 2.
The MiniGrid transfer learning task set 1. Here the policyis trained on (a or c) using our method and the baseline methodsand then transferred to be used on environment (b or d). The greencell is the goal.
We show the mean reward after 10 runs using the directpolicy transfers on the environments in Table 2. The 4 op-tion oIRL achieved the highest reward on the LavaCrossingtasks. The FlowerMaze task was quite difficult with mostalgorithms obtaining very low reward. Options still resultin a large improvement.
Table 2.
The mean reward obtained (higher is better) over 10 runsfor the Maze transfer learning tasks. We also show the results ofPPO optimizing the ground truth reward.L
AVA C ROSSING F LOWER M AZE
AIRL (P
RIMITIVE ) 0.64 0.112 O
PTIONS O
IRL 0.67 0.204 O
PTIONS O
IRL
ROUND T RUTH
In addition, we adopt more complex hierarchical environ-ments that require both locomotion and object interaction.In the first environment, the ant must interact with a largemovable block. This is called the
Ant-Push environment(Duan et al., 2016). To reach the goal, the ant must completetwo successive processes: first, it must move to the left ofthe block and then push the block right, which clears thepath towards the target location. There is a maximum of500 timesteps. These can be thought of as hierarchical taskswith pushing to the left , pushing to the right and going tothe goal as sub-goals.We also utilize an Ant-Maze environment (Florensa et al.,2017) where we have a simple maze with a goal at the end.The agent receives a reward of +1 if it reaches the goal and elsewhere. The ant must learn to make two turns in themaze, the first is down the hallway for one step and then aturn towards the goal. Again, we see hierarchical behaviorin this task: we can think of sub-goals consisting of learningto exit the first hall of the maze , then making the turn andfinally going down the final hall towards the goal . The twocomplex environments are shown in Figure 3. (a) Ant-Maze envi-ronment (b) Ant-Push environment Figure 3.
MuJoCo Ant Complex Gait transfer learning task envi-ronments. We perform these transfer learning tasks with the BigAnt and the Amputated Ant.
Table 3 shows that oIRL performs better than AIRL in all ofthe complex hierarchical transfer tasks. In some tasks suchas the Maze environment, AIRL fails to have any or veryfew successful runs while our method achieves reasonablyhigh reward. In the BigAnt push task, AIRL achieves onlyvery minimal reward where oIRL succeeds to perform thetask in some cases.
IRL: Robust Adversarial Inverse Reinforcement Learning with Temporally Extended Actions
Table 3.
The mean reward obtained (higher is better) over 100 runs for the MuJoCo Ant Complex Gait transfer learning tasks. We alsoshow the results of PPO optimizing the ground truth reward.B IG A NT M AZE A MPUTATED A NT M AZE B IG A NT P USH A MPUTATED A NT P USH
AIRL (P
RIMITIVE ) 0.28 0.14 0.02 0.172 O
PTIONS O
IRL
PTIONS O
IRL 0.55
ROUND T RUTH
Figure 4.
MuJoCo Continuous control locomotion tasks showing the mean reward (higher is better) achieved over 500 iterations of thebenchmark algorithms for 10 random seeds. The shaded area represents the standard deviation.
We also test our algorithm on a number of robotic contin-uous control benchmark tasks. These tasks do not involvetransfer.We show the plots of the average reward for each iterationduring training in Figure 4. Achieving a higher reward infewer iterations is better for these experiments. We exam-ine the Ant, the Half Cheetah, the and Walker MuJoCogait/locomotion tasks. We run these experiments with 10random seeds. The results are quite similar between thebenchmarks. Using a policy over options shows reasonableimprovements in each task.
9. Discussion
This work presents Option-Inverse Reinforcement Learning(oIRL), the first hierarchical IRL algorithm with disentan-gled rewards. We validate oIRL on a wide variety of tasks,including transfer learning tasks, locomotion tasks, com-plex hierarchical transfer RL environments and GridWorldtransfer navigation tasks and compare our results with thestate-of-the-art algorithm. Combining options with a disen-tangled IRL framework results in highly portable policies.Our empirical studies show clear and significant improve-ments for transfer learning. The algorithm is also shown toperform well in continuous control benchmark tasks.For future work, we wish to test other sampling methods(e.g., Markov-chain Monte Carlo) to estimate the implicit discriminator-generator pair’s distribution in our GAN, suchas Metropolis-Hastings GAN (Turner et al., 2019). We alsowish to investigate methods to reduce the computationalcomplexity for the step of computing the recursive lossfunction, which requires simulating some short trajectories,lowering the variance. Analyzing our algorithm using phys-ical robotic tests for tasks that require multiple sub-taskswould be an interesting future course of research.
References
Abbeel, P. and Ng, A. Y. Apprenticeship learning via inversereinforcement learning. in ICML , 2004.Amodei, D., Olah, C., Steinhardt, J., Christiano, P., Schul-man, J., and Mané, D. Concrete problems in ai safety. arXiv preprint arXiv:1606.06565 , 2016.Bacon, P.-L., Harb, J., and Precup, D. The option-criticarchitecture. in AAAI , 2017.Brockman, G., Cheung, V., Pettersson, L., Schneider, J.,Schulman, J., Tang, J., and Zaremba, W. Openai gym,2016.Chevalier-Boisvert, M., Willems, L., and Pal, S. Minimalis-tic gridworld environment for openai gym. https://github.com/maximecb/gym-minigrid , 2018.Chu, E. and Arnold, S. Transfer environments for mu-joco. GitHub, 2018. URL https://github.com/seba-1511/shapechanger . IRL: Robust Adversarial Inverse Reinforcement Learning with Temporally Extended Actions
Duan, Y., Chen, X., Houthooft, R., Schulman, J., andAbbeel, P. Benchmarking deep reinforcement learningfor continuous control, 2016.Finn, C., Christiano, P., Abbeel, P., and Levine, S. A con-nection between generative adversarial networks, inversereinforcement learning, and energy-based models. inNeurIPS , 2016a.Finn, C., Levine, S., and Abbeel, P. Guided cost learning:Deep inverse optimal control via policy optimization. inICML , 2016b.Florensa, C., Held, D., Wulfmeier, M., Zhang, M., andAbbeel, P. Reverse curriculum generation for reinforce-ment learning. in CoRL , 2017.Fu, J., Luo, K., and Levine, S. Learning robust rewardswith adversarial inverse reinforcement learning. in ICLR ,2018.Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B.,Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y.Generative adversarial nets.
In NeurIPS , 2014.Hausman, K., Chebotar, Y., Schaal, S., Sukhatme, G., andLim, J. J. Multi-modal imitation learning from unstruc-tured demonstrations using generative adversarial nets. inNeurIPS , 2017.Henderson, P., Chang, W.-D., Bacon, P.-L., Meger, D.,Pineau, J., and Precup, D. Optiongan: Learning jointreward-policy options using generative adversarial in-verse reinforcement learning. in AAAI , 2018.Ho, J. and Ermon, S. Generative adversarial imitation learn-ing. in NeurIPS , 2016.Klissarov, M., Bacon, P., Harb, J., and Precup, D. Learningsoptions end-to-end for continuous action tasks.
CoRR ,abs/1712.00004, 2017.Krishnan, S., Garg, A., Liaw, R., Miller, L., Pokorny, F. T.,and Goldberg, K. Y. Hirl: Hierarchical inverse rein-forcement learning for long-horizon tasks with delayedrewards.
ArXiv , abs/1604.06508, 2016.Kullback, S. and Leibler, R. A. On information and suf-ficiency.
The Annals of Mathematical Statistics , 22(1):79–86, 1951. ISSN 00034851.Ng, A. Y. and Russell, S. Algorithms for inverse reinforce-ment learning. in ICML , 2000.Schulman, J., Wolski, F., Dhariwal, P., Radford, A., andKlimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 , 2017. Sharma, M., Sharma, A., Rhinehart, N., and Kitani, K. M.Directed-info GAIL: Learning hierarchical policies fromunsegmented demonstrations using directed information. in ICLR , 2019.Sutton, R., Precup, D., and Singh, S. Between MDPs andsemi-MDPs: A framework for temporal abstraction inreinforcement learning.
Artificial Intelligence , 1999.Taylor, M. E. and Stone, P. Transfer learning for reinforce-ment learning domains: A survey.
J. Mach. Learn. Res. ,2009.Todorov, E., Erez, T., and Tassa, Y. Mujoco: A physicsengine for model-based control. In .Turner, R., Hung, J., Frank, E., Saatchi, Y., and Yosinski,J. Metropolis-hastings generative adversarial networks.
ICML , 2019.Wang, Z., Merel, J., Reed, S., Wayne, G., de Freitas, N.,and Heess, N. Robust imitation of diverse behaviors. inNeurIPS , 2017.
IRL: Robust Adversarial Inverse Reinforcement Learning with Temporally Extended Actions
A. Option Transition Probabilities
It is useful to redefine transition probabilities in terms of options. Since at each step we have a additional consideration, wecan continue following the policy of the current option we are in or terminate the option with some probability, sample anew option and follow that option’s policy from a stochastic policy dependent on states. We have P ( s t +1 , ω t +1 | s t , ω t ) = (cid:88) a ∈A π ω,α ( a | s t ) P ( s t +1 | s t , a )((1 − β ω t ,δ ( s t +1 )) ω t = ω t +1 + β ω t ,δ ( s t +1 ) π Ω ,ζ ( ω t +1 | s t +1 )) (24) P ( s t +1 , ω t +1 | s t ) = (cid:88) ω ∈ Ω π ζ, Ω ( ω | s t ) (cid:88) a ∈A π ω,α ( a | s t ) P ( s t +1 | s t , a )((1 − β ω t ,δ ( s t +1 )) ω t = ω t +1 + β ω t ,δ ( s t +1 ) π Ω ,ζ ( ω t +1 | s t +1 )) (25) P ( s t +1 , w t +1 | s t , ω t , a t ) = P ( s t +1 | s t , a t )((1 − β ω t ,δ ( s t +1 )) ω t = ω t +1 + β ω t ,δ ( s t +1 ) π Ω ,ζ ( ω t +1 | s t +1 )) (26) B. MLE Objective for IRL Over Options
We can define a discounted return recursively for a policy over options in a similar manner to the transition probabilities.Consider the policy over options based on the probabilities of terminating or continuing the option policies given a rewardapproximator ˆ r θ ( s, a ) for the state-action reward. R Ω ζ,θ,α,δ ( s ) = E ω ∈ Ω [ R θ,α,δ ( s, ω )] R θ,α,δ ( s, ω ) = E a ∈ A [ R θ,δ ( s, ω, a )] R Ω ζ,θ,δ ( s, a ) = E ω ∈ Ω [ R θ,δ ( s, ω, a )] R θ,δ ( s, ω, a ) = E (cid:20) ˆ r ω,θ ( s, a )+ γ (cid:88) s (cid:48) ∈ S P ( s (cid:48) | s, a, ω )( β ω,δ ( s (cid:48) ) R Ω ζ,θ,α,δ ( s (cid:48) ) + (1 − β ω,δ ( s (cid:48) )) R θ,α,δ ( s (cid:48) , ω )) (cid:21) , (27)These formulations of the reward function account for option transition probabilities, including the probability of terminatingthe current option and therefore selecting a new one according to the policy over options.With ω selected according to π ζ, Ω ( ω | s ) , we can define a parameterization of the discounted return R in the style of amaximum causal entropy RL problem with objective max θ E τ ∼D [log( p θ ( τ ))] , where p θ ( τ ) ∼ p ( s , ω ) T − (cid:89) t =0 P ( s t +1 , ω t +1 | s t , ω t , a t ) e R θ,δ ( s t ,ω t ,a t ) . (28) MLE Derivative
We can write out our MLE objective for our generator. We may or may not know the option trajectories in our expertdemonstrations, but they are estimated below according to the policy over options. This is defined similarly in (Fu et al.,2018) and (Finn et al., 2016a) as J ( θ ) = E τ ∼ τ E [ (cid:80) Tt =0 R Ω ζ,θ,δ ( s t , a t )] − E p θ [ (cid:80) Tt =0 (cid:80) ω ∈ Ω π ζ, Ω ( ω | s t ) R θ,δ ( s t , ω, a t )] . The IRL: Robust Adversarial Inverse Reinforcement Learning with Temporally Extended Actions full derivation is shown as (with generator p θ ): J ( θ ) = E τ ∼ τ E [log( p θ ( τ ))]= E τ ∼ τ E (cid:2) R θ,δ ( s t , ω, a t ) (cid:3) − log( Z θ )= E τ ∼ τ E (cid:34) T (cid:88) t =0 (cid:88) ω ∈ Ω π ζ, Ω ( ω | s t ) R θ ( s t , ω, a t ) (cid:35) − log( Z θ ) ≈ E τ ∼ τ E (cid:34) T (cid:88) t =0 (cid:88) ω ∈ Ω π ζ, Ω ( ω | s t ) R θ,δ ( s t , ω, a t ) (cid:35) − E p θ (cid:34) T (cid:88) t =0 (cid:88) ω ∈ Ω π ζ, Ω ( ω | s t ) R θ,δ ( s t , ω, a t ) (cid:35) = E τ ∼ τ E (cid:34) T (cid:88) t =0 R Ω ζ,θ,δ ( s t , a t ) (cid:35) − E p θ (cid:34) T (cid:88) t =0 R Ω ζ,θ,δ ( s t , a t ) (cid:35) (29)We go from Line 4 to 5 seeing R Ω ζ,θ,δ ( s t , a t ) = (cid:80) ω ∈ Ω π ζ, Ω ( ω | s t ) R θ,δ ( s t , ω, a t ) .Now, we take the gradient of the MLE objective w.r.t θ yields, ∂∂θ J ( θ ) = E τ ∼ τ E (cid:20) ∂∂θ log( p θ ( τ )) (cid:21) ∂∂θ J ( θ ) = E τ ∼ τ E (cid:20) T (cid:88) t =0 (cid:88) ω ∈ Ω π ζ, Ω ( ω | s t ) ∂∂θ R θ,δ ( s t , ω, a t ) (cid:21) − ∂∂θ log( Z θ ) ≈ E τ ∼ τ E (cid:20) T (cid:88) t =0 ∂∂θ R Ω ζ,θ,δ ( s t , a t ) (cid:21) − E p θ (cid:20) T (cid:88) t =0 ∂∂θ R Ω ζ,θ,δ ( s t , a t ) (cid:21) (30)Remark we define p θ,t ( s t , a t ) = (cid:82) s t (cid:48)(cid:54) = t ,a t (cid:48)(cid:54) = t p θ ( τ ) ds t (cid:48) da t (cid:48) as the state action marginal at time t . ∂∂θ J ( θ ) = T (cid:88) t =0 E τ ∼ τ E (cid:20) ∂∂θ R Ω ζ,θ,δ ( s t , a t ) (cid:21) − E p θ,t (cid:20) ∂∂θ R Ω ζ,θ,δ ( s t , a t ) (cid:21) (31)We perform importance sampling over the hard to estimate generator density. We make an importance sampling distribution µ t,w ( τ ) for option w .We sample a mixture policy µ ω ( a | s ) defined as π ω ( a | s ) + ˆ p ω ( a | s ) and ˆ p ω ( a | s ) is a rough density estimate trained onthe demonstrations. We wish to minimize the D KL ( π w ( τ ) | p ω ( τ )) . KL refers to the Kullback–Leibler divergence metricbetween two probability distributions. Our new gradient is: ∂∂θ J ( θ ) = T (cid:88) t =0 E τ ∼ τ E [ ∂∂θ R Ω ζ,θ,δ ( s t , a t )] − E µ t (cid:20)(cid:88) ω ∈ Ω π ζ, Ω ( ω | s t ) p θ,t,ω ( s t , a t ) µ t,w ( s t , a t ) ∂∂θ R θ,δ ( s t , ω, a t ) (cid:21) . (32)Taking the derivative of the discounted option return results in ∂∂θ R Ω ζ,θ,α,δ ( s ) = E (cid:20)(cid:88) ω ∈ Ω π Ω ,ζ ( ω | s )[ (cid:88) a ∈ A [ π w,α ( a | s ) (cid:18) ∂∂θ ˆ r ω,θ ( s, a )+ γ (cid:88) s (cid:48) ∈ S P ( s (cid:48) | s, a )( β ω,δ ( s (cid:48) ) ∂∂θ R Ω ζ,θ,α,δ ( s (cid:48) ) + (1 − β ω,δ ( s (cid:48) )) ∂∂θ R θ,α,δ ( s (cid:48) , ω )) (cid:19) ] (cid:21) . (33) IRL: Robust Adversarial Inverse Reinforcement Learning with Temporally Extended Actions ∂∂θ R Ω ζ,θ,δ ( s, a ) = E (cid:20)(cid:88) ω ∈ Ω π Ω ,ζ ( ω | s ) (cid:18) ∂∂θ ˆ r ω,θ ( s, a )+ γ (cid:88) s (cid:48) ∈ S P ( s (cid:48) | s, a )( β ω,δ ( s (cid:48) ) ∂∂θ R Ω ζ,θ,α,δ ( s (cid:48) ) + (1 − β ω,δ ( s (cid:48) )) ∂∂θ R θ,α,δ ( s (cid:48) , ω )) (cid:19)(cid:21) (34) C. Discriminator Objective
We formulate the discriminator as the odds ratio between the policy and the exponentiated reward distribution for option ω as in AIRL parameterized by θ . We have a discriminator for each option ω and generator option policy π w , D θ,ω ( s, a ) = exp( f θ,ω ( s, a ))exp( f θ,ω ( s, a )) + π w ( a | s ) . (35) C.1. Recursive Loss Formulation
We minimize the cross-entropy loss between expert demonstrations and generated examples assuming we have the samenumber of options in the generated and expert trajectories. We define the loss function L θ as follows: L θ ( s, a, ω ) = − E D [log( D θ,ω ( s, a ))] − E π Θ ,t [log(1 − D θ,ω ( s, a ))] . (36)The total loss for the entire trajectory can be expressed recursively as follows by taking expectations over the next options orstates: L θ,δ ( s, a, ω ) = l θ ( s, a, ω ) + γ (cid:88) s (cid:48) ∈ S P ( s (cid:48) | s, a )( β w,δ ( s (cid:48) ) L Ω ζ,θ,α,δ ( s (cid:48) ) + (1 − β w,δ ( s (cid:48) )) L θ,α,δ ( s (cid:48) , w )) L θ,α,δ ( s, w ) = E a ∈ A [[ L θ,δ ( s, w, a )] L Ω ζ,θ,δ ( s, a ) = E w ∈ Ω [ L θ,δ ( s, w, a )] L Ω ζ,θ,α,δ ( s ) = E ω ∈ Ω [ L θ,α,δ ( s, ω )] (37)The agent wishes to minimize L θ,δ to find its optimal policy.We can let cost function f θ,w ( s, a ) = L θ,δ ( s, ω, a ) as shown in AIRL and we have: D θ,ω = exp( L θ,δ ( s, ω, a ))exp( L θ,δ ( s, ω, a )) + π ω ( a | s ) (38) C.2. Optimization Criteria
For a given option ω , we can write the reward function ˆ R θ,δ ( s, ω, a ) to be maximised, as follows. Note that θ parameterizesthe state-action reward function estimate for option ω . − L D is the negative discriminator loss. We therefore turn ourminimization problem into a maximization problem. We define our objective similar to the GAN objective from AIRL: − L D = ˆ R θ,δ ( s, ω, a ) = log ( D θ,ω ( s, a )) − log (1 − D θ,ω ( s, a )) (39)Now we can write out our reward function in terms of the optimal discriminator ˆ R θ,δ ( s, ω, a ) = log (cid:18) exp ( − L θ,δ ( s, ω, a ))exp ( − L θ,δ ( s, ω, a )) + π ω ( a | s ) (cid:19) − log (cid:18) π ω ( a | s )exp ( − L θ,δ ( s, ω, a )) + π ω ( a | s ) (cid:19) = − L θ,δ ( s, ω, a ) − log( π ω ( a | s )) (40) IRL: Robust Adversarial Inverse Reinforcement Learning with Temporally Extended Actions
The derivative of this reward function can now be computed as follows: ∂∂θ ˆ R θ,δ ( s, ω, a ) ≈ ∂∂θ − L θ,δ ( s, ω, a )= E ∂∂θ r ω,θ ( s, a ) + γ (cid:88) s (cid:48) ∈ S P ( s (cid:48) | s, a ) (cid:18) β ω,δ ( s (cid:48) ) ∂∂θ − L Ω ζ,θ,α,δ ( s (cid:48) ) + (1 − β ω,δ ( s (cid:48) )) ∂∂θ − L θ,α,δ ( s (cid:48) , ω ) (cid:19) = E (cid:20) ∂∂θ r ω,θ ( s, a ) (cid:21) + E (cid:20) γ (cid:88) s (cid:48) ∈ S P ( s (cid:48) | s, a ) (cid:18) β ω,δ ( s (cid:48) ) ∂∂θ − L Ω ζ,θ,α,δ ( s (cid:48) )+(1 − β ω,δ ( s (cid:48) )) ∂∂θ − L θ,α,δ ( s (cid:48) , ω ) (cid:19)(cid:21) (41)Writing out our discriminator objective yields: − L D = T (cid:88) t =0 E τ ∼ τ E (cid:18)(cid:104) (cid:88) ω ∈ Ω π Ω ,ζ ( ω | s t ) log( D θ,ω ( s t , a t )) (cid:105) + E π t (cid:104) (cid:88) ω ∈ Ω π Ω ,ζ ( ω | s t ) log(1 − D θ,ω ( s t , a t )) (cid:105)(cid:19) = T (cid:88) t =0 E τ ∼ τ E (cid:34)(cid:88) ω ∈ Ω π Ω ,ζ ( ω | s t ) log (cid:18) exp( − L θ,δ ( s t , ω, a t ))exp( − L θ,δ ( s t , ω, a t )) + π ω ( a t | s t ) (cid:19)(cid:35) + E π t (cid:34)(cid:88) ω ∈ Ω π Ω ,ζ ( ω | s t ) log (cid:18) π ω ( a t | s t )exp( − L θ,δ ( s t , ω, a t )) + π ω ( a t | s t ) (cid:19)(cid:35) = T (cid:88) t =0 E τ ∼ τ E (cid:34)(cid:88) ω ∈ Ω π Ω ,ζ ( ω | s t ) − L θ,δ ( s t , ω, a t ) (cid:35) − E τ ∼ τ E (cid:34)(cid:88) ω ∈ Ω π Ω ,ζ ( w | s t ) log(exp( − L θ,δ ( s t , w, a t )) + π ω ( a | s t )) (cid:35) + E π t (cid:34)(cid:88) ω ∈ Ω π Ω ,ζ ( ω | s t ) log( π ω ( a t | s t )) (cid:35) − E π t (cid:34)(cid:88) ω ∈ Ω π Ω ,ζ ( ω | s t ) log(exp( − L θ,δ ( s t , ω, a t )) + π ω ( a t | s t )) (cid:35) (42)We set a mixture of experts and novice as ¯ µ observations. = T (cid:88) t =0 E τ ∼ τ E (cid:34)(cid:88) ω ∈ Ω π Ω ,ζ ( ω | s t ) − L θ,δ ( s t , ω, a t ) (cid:35) + E π t (cid:34)(cid:88) ω ∈ Ω π Ω ,ζ ( ω | s t ) log( π ω ( a t | s t )) (cid:35) − E ¯ µ t (cid:34)(cid:88) ω ∈ Ω π Ω ,ζ ( ω | s t ) log (exp( − L θ,δ ( s t , ω, a t )) + π ω ( a t | s t )) (cid:35) (43)We can take the derivative w.r.t θ (state-action reward function estimate parameter): ∂∂θ ( − L D ) = T (cid:88) t =0 E τ ∼ τ E (cid:34)(cid:88) ω ∈ Ω π Ω ,ζ ( ω | s t ) ∂∂θ − L θ,δ ( s t , ω, a t ) (cid:35) − E ¯ µ t (cid:34)(cid:88) ω ∈ Ω π Ω ,ζ ( ω | s t ) (cid:18) exp( − L θ,δ ( s t , ω, a t )) exp( − L θ,δ ( s t , ω, a t )) + π ω ( a t | s t ) (cid:19) ∂∂θ − L θ,δ ( s t , ω, a t ) (cid:35) (44) IRL: Robust Adversarial Inverse Reinforcement Learning with Temporally Extended Actions
We can multiply the top and bottom of the fraction in the mixture expectation by the state marginal π ω ( s t ) = (cid:82) a ∈ A π ω ( s t , a t ) .This allows us to write ˆ p θ,t,ω ( s t , a t ) = exp( L θ,δ ( s t , ω, a t )) π ω,t ( s t ) . Now we have an importance sampling. ∂∂θ ( − L D ) = T (cid:88) t =0 E τ ∼ τ E (cid:34)(cid:88) ω ∈ Ω π Ω ,ζ ( ω | s t ) ∂∂θ − L θ,δ ( s t , ω, a t ) (cid:35) − E ¯ µ t (cid:34)(cid:88) ω ∈ Ω π Ω ,ζ ( ω | s t ) (cid:18) ˆ p θ,t,ω ( s t , a t )ˆ µ t,ω ( s t , a t ) (cid:19) ∂∂θ − L θ,δ ( s t , ω, a t ) (cid:35) (45)It is now easy to see we have the same form as our MLE objective loss function, our loss (the function we approximate withthe GAN) is the discounted reward for a state action pair with the expectation over options. We change the loss functions toreward functions to show this, as they are defined equivalently. ∂∂θ ( − L D ) = T (cid:88) t =0 E τ ∼ τ E (cid:20) ∂∂θ R ζ,θ,δ ( s t , a t ) (cid:21) − E ¯ µ t (cid:34)(cid:88) ω ∈ Ω π Ω ,ζ ( ω | s ) (cid:18) ˆ p θ,t,ω ( s t , a t )ˆ µ t,ω ( s t , a t ) (cid:19) ∂∂θ R θ,δ ( s t , ω, a t ) (cid:35) (46)In addition, we can decompose the reward into a state-action reward and a future discounted sum of rewards considering thepolicy over options as follows: ∂∂θ ( − L D ) = T (cid:88) t =0 E τ ∼ τ E (cid:34)(cid:88) ω ∈ Ω π Ω ,ζ ( ω | s t ) ∂∂θ r ω,θ ( s t , a t ) (cid:35)(cid:124) (cid:123)(cid:122) (cid:125) State-Action Reward + E τ ∼ τ E (cid:20)(cid:88) ω ∈ Ω π Ω ,ζ ( ω | s t ) γ (cid:88) s t +1 ∈ S P ( s t +1 | s t , a t )( β ω,δ ( s t +1 ) ∂∂θ R Ω ζ,θ,α,δ ( s t +1 )+(1 − β w,δ ( s t +1 )) ∂∂θ R θ,α,δ ( s t +1 , ω )) (cid:21) − E ¯ µ t (cid:34)(cid:88) ω ∈ Ω π Ω ,ζ ( ω | s t ) (cid:18) ˆ p θ,t,ω ( s t , a t )ˆ µ t,ω ( s t , a t ) (cid:19) ∂∂θ R θ,δ ( s t , ω, a t ) (cid:35) (47)e are given a mixture of experts and generated policies as ¯ µ t and perform importance sampling with respect to thisdistribution. D. GAN Architecture
The architecture for our GAN-IRL framework is described in Figure 5.
E. Proof of Recoverable Rewards
A substantial amount of this proof is derived from (Fu et al., 2018).
Lemma 1: f θ,ω ( s, a ) recovers the advantage. Proof:
It is known that when π ω = π Eω , we have achieved the global min of the discriminator objective. The discriminatormust then output 0.5 for all state action pairs. This results in exp ( f θ,ω ( s, a )) = π Eω ( a | s ) . Equivalently we have f ∗ ω ( s, a ) = log π Eω ( a | s ) = A ∗ ( s, a, ω ) . Definition 1: Decomposability condition . We first define 2 states s , s as 1-step linked under dynamics T ( s (cid:48) | s, a ) if there IRL: Robust Adversarial Inverse Reinforcement Learning with Temporally Extended Actions
Figure 5.
Architecture of GAN-IRL framework. exists a state s that can reach s and s with non-zero probability in one timestep. The transitivity property holds for thelinked relationship. We can say that if s and s and linked, s and s are linked then s and s must also be linked.The Decomposability condition for transition dynamics T holds if all states in the MDP are linked with all other states. Lemma 2:
For an MDP, where the decomposability condition holds for all dynamics. For arbitrary functions a ( s ) , b ( s ) , c ( s ) , d ( s ) , if for all s and s (cid:48) a ( s ) + b ( s (cid:48) ) = c ( s ) + d ( s (cid:48) ) (48) and for all s a ( s ) = c ( s ) + const s (49) b ( s ) = d ( s ) + const s , (50) where const s is a constant dependent with respect to state s . Proof:
If we rearrange Equation 48, we can obtain the quality a ( s ) − c ( s ) = b ( s (cid:48) ) − d ( s (cid:48) ) .Now we define f ( s ) = a ( s ) − c ( s ) . Given our equality, we have f ( s ) = a ( s ) − c ( s ) = b ( s (cid:48) ) − d ( s (cid:48) ) . This holds for somefunction dependent on s .To represent this, b ( s (cid:48) ) − d ( s (cid:48) ) must be equal to a constant (with the constant’s value dependent on the state s ) for allone-step successor states s (cid:48) from s .Now, under decomposability, all one step successor states ( s (cid:48) ) from s must be equal through the transitivity property so b ( s (cid:48) ) − d ( s (cid:48) ) must be a constant with respect to state s . Therefore, we can write a ( s ) = c ( s ) + const + s for an arbitrarystate s and functions b and d .Substituting this into the Equation 48, we can obtain b ( s ) = d ( s ) + const s . This completes our proof. Inductive proof for any successor state
Let us consider for any MDP and any arbitrary functions a ( · ) , b ( · ) , c ( · ) and d ( · ) , a ( s ) + b ( S ( k ) ) = c ( s ) + d ( S ( k ) ) , (51)where S ( k ) is the k -th successor state reached in k time-steps from the current state. Let us denote by T π, ( k ) ( s, S ( k ) ) theprobability of transitioning from state s to S ( k ) in k steps using policy π . Then, we can express T π, ( k ) ( s, S ( k ) ) recursively IRL: Robust Adversarial Inverse Reinforcement Learning with Temporally Extended Actions as follows: T π, ( k ) ( s, S ( k ) ) = (cid:88) s (cid:48) ∈S T π, ( k − ( s, s (cid:48) ) T π ( s (cid:48) , S ( k ) ) , (52)where T π ( s (cid:48) , S ( k ) ) is the one-step transition probability from state s (cid:48) to state S ( k ) (by definition of the Bellman operator).Denote by P ( S ( k ) ) the probability of landing in state S ( k ) in k steps from any current state. We can write P ( S ( k ) ) using (52)as follows: P ( S ( k ) ) := (cid:88) s ∈S T π, ( k ) ( s, S ( k ) ) µ ( s ) , (53)where µ is the state-distribution.The unbiased estimator ˆ s ( k ) of an unknown successor state S ( k ) is given by: ˆ s ( k ) := E ( S ( k ) ) = (cid:88) s ( k ) ∈S s ( k ) P ( S ( k ) ) , (54)where P ( S ( k ) ) is given in (53).Now, replacing S ( k ) in (51) with its unbiased estimator ˆ s ( k ) as given by (54), we have a ( s ) − c ( s ) = b (ˆ s ( k ) ) − d (ˆ s ( k ) ) ( a ) = f ( k ) , (55)for some function f , where ( a ) holds since ˆ s ( k ) depends only on k . Thus, we get a ( s ) = c ( s )+ const. and b ( s ) = d ( s )+ const.where the constant is with respect to the state s . Theorem 1:
Suppose we have, for a MDP where the decomposability condition holds, f θ,ω ( s, a, s (cid:48) ) = g ω ( s, a ) + γh Φ ( s (cid:48) ) − h Φ ( s ) (56) where h Φ is a shaping term. If we obtain the optimal f ∗ θ,ω ( s, a, s (cid:48) ) , with a reward approximator g ∗ ω ( s, a ) . Under deterministicdynamics the following holds g ∗ ω ( s, a ) + γh ∗ Φ ( s (cid:48) ) − h ∗ Φ ( s ) = r ∗ ω ( s ) + γV ∗ Ω ( s (cid:48) ) − V ∗ Ω ( s ) (57)and g ∗ ω ( s ) = r ∗ ω ( s ) + c ω . (58) Proof:
We know f ∗ ω ( s, a, s (cid:48) ) = A ∗ ( s, a, ω ) = Q ∗ ( s, a, ω ) − V ∗ Ω ( s ) = r ∗ ω ( s ) + γV ∗ Ω ( s (cid:48) ) − V ∗ Ω ( s ) . We can substitute thedefinition of f ∗ ω ( s, a, s (cid:48) ) to obtain our Theorem.Where Q ( s, ω ) = (cid:80) a ∈A π ω,α ( a | s )[ r ω,θ ( s, a ) + γ (cid:80) s (cid:48) ∈S P ( s (cid:48) | s, a ) ((1 − β δ,ω ( s (cid:48) )) Q ( s (cid:48) , ω ) + β δ,ω ( s (cid:48) ) V Ω ( s (cid:48) ))] and V Ω ( s ) = (cid:80) ω ∈ Ω π Ω ,ζ ( ω | s ) Q ( s, ω ) Q ( s, a, ω ) = π ω,α ( a | s )[ r ω,θ ( s, a ) + γ (cid:80) s (cid:48) ∈S P ( s (cid:48) | s, a ) ((1 − β δ,ω ( s (cid:48) )) Q ( s (cid:48) , ω ) + β δ,ω ( s (cid:48) ) V Ω ( s (cid:48) ))] which holds for all s and s (cid:48) . Now we apply Lemma 2. We say that a ( s ) = g ∗ ω ( s ) − h ∗ Φ ( s ) , b ( s (cid:48) ) = γh ∗ Φ ( s (cid:48) ) , c ( s ) = r ( s ) − V ∗ Ω ( s ) and d ( s (cid:48) ) = γV ∗ Ω ( s (cid:48) ) and rearrange according to Lemma 2. We therefore have our results that g ∗ ω ( s ) = r ω ( s ) + c ω .Where c ω is a constant. F. Proof of Convergence
Definition 2: Reward Approximator Error.
From Theorem 1, we can see that our reward approximator g ∗ ω ( s ) = r ω ( s ) + c ω . We define a reward approximator error over all options as δ r = (cid:80) ω ∈ Ω π Ω ( ω ) | g ∗ ω ( s ) − r ∗ ( s ) | . This error isbounded by δ r = (cid:88) ω ∈ Ω π Ω ( ω ) | g ∗ ω ( s ) − r ∗ ( s ) | ≤ max ω ∈ Ω c ω (59) IRL: Robust Adversarial Inverse Reinforcement Learning with Temporally Extended Actions
By definition of g ∗ ω ( s ) . Lemma 3:
The Bellman operator for options in the IRL problem is a contraction.
Proof:
We prove this by Cauchy-Schwarz and the definition of the sup-norm. We must define this inequality in terms ofthe IRL problem where we have a reward estimator ˆ g θ ω ( s ) under our learned parameter θ and an optimal reward estimator r ∗ ( s ) . || Q π Ω ,t ( s, ω ) − Q ∗ ( s, ω ) || ∞ = || ˆ g θ ( s ) + γ (cid:88) s (cid:48) ∈S P ( s (cid:48) | s, a )((1 − β ( s (cid:48) ) Q π Ω ,t ( s (cid:48) , ω ) + β ( s (cid:48) ) max ω ∈ Ω Q π Ω ,t ( s (cid:48) , ω )) − r ∗ ( s ) + γ (cid:88) s (cid:48) ∈S P ( s (cid:48) | s, a )((1 − β ( s (cid:48) ) Q ∗ ( s (cid:48) , ω ) + β ( s (cid:48) ) max ω ∈ Ω Q ∗ ( s (cid:48) , ω )) || ∞ = || ˆ g θ ( s ) − r ∗ ( s ) + (cid:88) s (cid:48) ∈S P ( s (cid:48) | s, a )[(1 − β ( s (cid:48) ))( Q π Ω ,t ( s (cid:48) , ω ) − Q ∗ ( s (cid:48) , ω ))]+[ β ( s (cid:48) )(max ω ∈ Ω Q π Ω ,t ( s (cid:48) , ω ) − max ω ∈ Ω Q ∗ ( s (cid:48) , ω ))] || ∞ = || (cid:88) s (cid:48) ∈S P ( s (cid:48) | s, a )[(1 − β ( s (cid:48) ))( Q π Ω ,t ( s (cid:48) , ω ) − Q ∗ ( s (cid:48) , ω ))]+[ β ( s (cid:48) )(max ω ∈ Ω Q π Ω ,t ( s (cid:48) , ω ) − max ω ∈ Ω Q ∗ ( s (cid:48) , ω ))] || ∞ + max ω ∈ Ω c ω ≤ (cid:88) s (cid:48) ∈S P ( s (cid:48) | s, a ) max s (cid:48)(cid:48) ,ω (cid:48)(cid:48) || Q π Ω ,t ( s (cid:48)(cid:48) , ω (cid:48)(cid:48) ) − Q ∗ ( s (cid:48)(cid:48) , ω (cid:48)(cid:48) ) || ∞ + max ω ∈ Ω c ω ≤ γ max s (cid:48)(cid:48) ,ω (cid:48)(cid:48) || Q π Ω ,t ( s (cid:48)(cid:48) , ω (cid:48)(cid:48) ) − Q ∗ ( s (cid:48)(cid:48) , ω (cid:48)(cid:48) ) || ∞ + max ω ∈ Ω c ω (60)This is given by Lemma 3 and (Sutton et al., 1999) [Theorem 3].Giving our results max s (cid:48)(cid:48) ,ω (cid:48)(cid:48) | Q π Ω ,t ( s, ω ) − Q ∗ ( s, ω ) | ≤ (cid:15) + max ω ∈ Ω c ω . For (cid:15) ∈ R > Theorem 2: g θ ( s ) + γQ ( s (cid:48) , ω ) converges to Q ∗ . Proof:
We know g θ ( s ) → g ∗ θ ( s ) = r ∗ ( s ) + const. Given this we can show by Cauchy-Schwarz: | E [ g θ ( s )] + γ E [ Q ( s (cid:48) , ω ) | s ] − Q ∗ ( s (cid:48) , ω ) | = | E [ g θ ( s )] + γ (cid:88) s (cid:48) ∈S P ( s (cid:48) | s, a )((1 − β ω ( s (cid:48) )) Q ( s (cid:48) ω ) + β ω ( s (cid:48) ) V Ω ( s (cid:48) )) − r ∗ ( s ) − (cid:88) s (cid:48) ∈S P ( s (cid:48) | s, a )((1 − β ω ( s (cid:48) ) Q ∗ ( s (cid:48) , ω )) + β ω ( s (cid:48) ) max ω ∈ Ω Q ∗ ( s (cid:48) , ω ) | = | E [ g θ ( s )] − r ∗ ( s ) + γ (cid:88) s (cid:48) ∈S P ( s (cid:48) | s, a )[ β ω ( s (cid:48) )[max ω ∈ Ω Q ( s (cid:48) ω ) − max ω ∈ Ω Q ∗ ( s (cid:48) , ω )]+ (1 − β ω ( s (cid:48) ))[ Q ( s (cid:48) ω ) − Q ∗ ( s (cid:48) , ω )]] | ( a ) ≤ (max ω ∈ Ω c ω ) | γ (cid:88) s (cid:48) ∈S P ( s (cid:48) | s, a )[ max s (cid:48)(cid:48) ,ω (cid:48)(cid:48) || Q ( s (cid:48)(cid:48) , ω (cid:48)(cid:48) ) − Q ∗ ( s (cid:48)(cid:48) , ω (cid:48)(cid:48) ) | ] | ( b ) ≤ (max ω ∈ Ω c ω )( (cid:15) + max ω ∈ Ω c ω ) γ (cid:88) s (cid:48) ∈S P ( s (cid:48) | s, a ) ≤ (max ω ∈ Ω c ω )( (cid:15) + max ω ∈ Ω c ω ) γ, (61)where ( a ) follows from Lemma 3 and ( b ) holds since (cid:80) s (cid:48) ∈S P ( s (cid:48) | s, a ) ≤ . IRL: Robust Adversarial Inverse Reinforcement Learning with Temporally Extended Actions
G. Parameters for Experiments
G.1. MuJoCo Tasks
For these experiments, we use PPO to obtain an optimal policy given our ground truth rewards for 2 million iterationsand 20 million on the complex tasks. This is used to obtain the expert demonstrations. We sample 50 expert trajectories.PPOC is used for the policy optimization step for the policy over options. We tune the deliberation cost hyper-parametervia cross-validation. The optimal deliberation cost found was . for PPOC. We also use state-only rewards for the policytransfer tasks. The hyperparameters for our policy optimization are given in Table 4.Our discriminator is a neural network with the optimal architecture of 2 linear layers of 50 hidden states, each with ReLUactivation followed by a single node linear layer for output. We also tried a variety of hidden states including 100 and 25and tanh activation during our hyperparameter optimization step using cross-validation.The policy network has 2 layers of 64 hidden states. A batch size of 64 or 32 is used for 1 and any number of options greaterthan 1 respectively. No mini-batches are used in the discriminator since the recursive loss must be computed. There are 2048timesteps per batch. Generalized Advantage Estimation is used to compute advantage estimates. We list additional networkparameters in the next section. The output of the policy network gives the Gaussian mean and the standard deviation. This isthe same procedure as in (Schulman et al., 2017). Table 4.
Policy Optimization parameters for MuJoCo
Parameter ValueDiscr. Adam optimizer learning rate · − Adam (cid:15) · − PPOC Adam optimizer learning rate · − GAE λ . Entropy coefficient − value loss coefficient 0.5discount 0.99batch size for PPO 64 or 32PPO epochs 10entropy coefficient − clip parameter 0.2 G.2. MuJoCo Continuous Control Tasks
In this section, we describe the structure of the objects that gait in the continuous control benchmarks and the rewardfunctions. For the transfer learning tasks, we use the same reward function described here for the Ant.
Walker:
The walker is a planar biped. There are 7 rigid links comprised of legs, a torso. This includes 6 actuated joints.This task is particularly prone to falling. The state space is of 21 dimensions. The observations in the states include jointangles, joint velocities, the center of mass’s coordinates. The reward function is r ( s, a ) = v x − . || a || . The terminationcondition occurs when z body < . , z body > . or || θ y || > . . Half-Cheetah:
The half-cheetah is a planar biped also like the Walker. There are 9 rigid links comprised of 9 actuatedjoints, a leg and a torso. The state space is of 20 dimensions. The observations include joint angles, the center of mass’scoordinates, and joint velocities. The reward function is r ( s, a ) = v x − . || a || . There is no termination condition. Ant : The ant has four legs with 13 rigid links in its structure. The legs have 8 actuated joints. The state space is of 125dimensions. This includes joint angles, joint velocities, coordinates of the center of mass, the rotation matrix for the body,and a vector of contact forces. The function is r ( s, a ) = v x − . || a || − C contact + 0 . , where C contact is a penalty forcontacts to the ground. This is × − || F contact || . F contact is the contact force. It’s values are clipped to be between 0 and1. The termination condition occurs z body < . or z body > . . IRL: Robust Adversarial Inverse Reinforcement Learning with Temporally Extended Actions
Figure 6.
Architecture of the actor-critic policies on MiniGrid. Conv is Convolutional Layer and filter sized is described below. FC is afully connected layer.
G.3. MiniGrid Tasks
For experiments, we used the PPOC algorithm with parallelized data collection and GAE. 0.1 is the optimal deliberationcost. Each environment is run with 10 random network initialization. As before, in Table 5, we show some of the policyoptimization parameters for MiniGrid Tasks. We rely on an actor-critic network architecture for these tasks. Since thestate space is relatively large and spatial features are relevant, we use 3 convolutional layers in the network. The networkarchitecture is detailed in Figure 6. n and m are defined by the grid dimensions.The discriminator network is again an neural network with the optimal architecture of 3 linear layers of 150 hidden states,each with ReLU activation followed by a single node linear layer for output. Table 5.
Policy optimization parameters for benchmark tasks in MiniGrid
Parameter ValueAdam optimizer learning rate · − Adam (cid:15) − entropy coefficient − value loss coefficient 0.5discount 0.99maximum norm of gradient in PPO 0.5number of PPO epochs 4batch size for PPO 256entropy coefficient −2