[PDF] VILD: Variational Imitation Learning with Diverse-quality Demonstrations

Abstract

The goal of imitation learning (IL) is to learn a good policy from high-quality demonstrations. However, the quality of demonstrations in reality can be diverse, since it is easier and cheaper to collect demonstrations from a mix of experts and amateurs. IL in such situations can be challenging, especially when the level of demonstrators' expertise is unknown. We propose a new IL method called \underline{v}ariational \underline{i}mitation \underline{l}earning with \underline{d}iverse-quality demonstrations (VILD), where we explicitly model the level of demonstrators' expertise with a probabilistic graphical model and estimate it along with a reward function. We show that a naive approach to estimation is not suitable to large state and action spaces, and fix its issues by using a variational approach which can be easily implemented using existing reinforcement learning methods. Experiments on continuous-control benchmarks demonstrate that VILD outperforms state-of-the-art methods. Our work enables scalable and data-efficient IL under more realistic settings than before.

Full PDF

VVILD: Variational Imitation Learningwith Diverse-quality Demonstrations

Voot Tangkaratt , Bo Han , Mohammad Emtiyaz Khan , and Masashi Sugiyama RIKEN AIP, Japan The University of Tokyo, Japan

Abstract

The goal of imitation learning (IL) is to learn a good policy from high-quality demonstrations.However, the quality of demonstrations in reality can be diverse, since it is easier and cheaper tocollect demonstrations from a mix of experts and amateurs. IL in such situations can be challenging,especially when the level of demonstrators’ expertise is unknown. We propose a new IL method calledvariational imitation learning with diverse-quality demonstrations (VILD), where we explicitly modelthe level of demonstrators’ expertise with a probabilistic graphical model and estimate it along witha reward function. We show that a naive approach to estimation is not suitable to large state andaction spaces, and ﬁx its issues by using a variational approach which can be easily implemented usingexisting reinforcement learning methods. Experiments on continuous-control benchmarks demonstratethat VILD outperforms state-of-the-art methods. Our work enables scalable and data-eﬃcient ILunder more realistic settings than before.

The goal of sequential decision making is to learn a policy that makes good decisions (Puterman, 1994).As an important branch of sequential decision making, imitation learning (IL) (Russell, 1998; Schaal,1999) aims to learn such a policy from demonstrations (i.e., sequences of decisions) collected from experts.However, high-quality demonstrations can be diﬃcult to obtain in reality, since such experts may notalways be available and sometimes are too costly (Osa et al., 2018). This is especially true when thequality of decisions depends on speciﬁc domain-knowledge not typically available to amateurs; e.g., inapplications such as robot control (Osa et al., 2018), autonomous driving (Silver et al., 2012), and thegame of Go (Silver et al., 2016).In practice, demonstrations are often diverse in quality, since it is cheaper to collect them frommixed demonstrators, containing both experts and amateurs (Audiﬀren et al., 2015). Unfortunately, ILin such settings tends to perform poorly since low-quality demonstrations often negatively aﬀect theperformance (Shiarlis et al., 2016; Lee et al., 2016). For example, demonstrations for robotics can becheaply collected via a robot simulation (Mandlekar et al., 2018), but demonstrations from amateurswho are not familiar with the robot may cause damages to the robot which is catastrophic in the real-world (Shiarlis et al., 2016). Similarly, demonstrations for autonomous driving can be collected fromdrivers in public roads (Fridman et al., 2017), but these low-quality demonstrations may also cause traﬃcaccidents..When the level of demonstrators’ expertise is known, multi-modal IL (MM-IL) may be used to learn agood policy with diverse-quality demonstrations (Li et al., 2017; Hausman et al., 2017; Wang et al., 2017).More speciﬁcally, MM-IL aims to learn a multi-modal policy where each mode of the policy representsthe decision making of each demonstrator. When knowing the level of demonstrators’ expertise, goodpolicies can be obtained by selecting modes that correspond to the decision making of high-expertisedemonstrators. However, in reality it is diﬃcult to truly determine the level of expertise beforehand.Without knowing the level of demonstrators’ expertise, it is diﬃcult to distinguish the decision making ofexperts and amateurs, and thus learning a good policy is quite challenging.To overcome the issue of MM-IL, existing works have proposed to estimate the quality of eachdemonstration using additional information from experts (Audiﬀren et al., 2015; Wu et al., 2019; Brown

Contacts: [email protected]; [email protected] a r X i v : . [ c s . L G ] S e p t al., 2019). Speciﬁcally, Audiﬀren et al. (2015) proposed a method that infers the quality using similaritiesbetween diverse-quality demonstrations and high-quality demonstrations, where the latter are collected ina small number from experts. In contrast, Wu et al. (2019) proposed to estimate the quality using a smallnumber of demonstrations with conﬁdence scores. The value of these scores are proportion to the qualityand are given by an expert. Similarly, the quality can be estimated using demonstrations that are rankedaccording to their relative quality by an expert (Brown et al., 2019). These methods rely on additionalinformation from experts, namely high-quality demonstrations, conﬁdence scores, and ranking. In practice,these pieces of information can be scarce or noisy, which leads to the poor performance of these methods.In this paper, we consider a novel but realistic setting of IL where only diverse-quality demonstrationsare available, while the level of demonstrators’ expertise and additional information from experts arefully absent. To tackle this challenging setting, we propose a new method called variational imitationlearning with diverse-quality demonstrations (VILD). The central idea of VILD is to model the level ofexpertise via a probabilistic graphical model, and learn it along with a reward function that represents anintention of expert’s decision making. To scale up our model for large state and action spaces, we leveragethe variational approach (Jordan et al., 1999), which can be implemented using reinforcement learning(RL) (Sutton & Barto, 1998). To further improve data-eﬃciency when learning the reward function,we utilize importance sampling to re-weight a sampling distribution according to the estimated levelof expertise. Experiments on continuous-control benchmarks demonstrate that VILD is robust againstdiverse-quality demonstrations and outperforms existing methods signiﬁcantly. Empirical results alsoshow that VILD is a scalable and data-eﬃcient method for realistic settings of IL. In this section, we ﬁrstly discuss a related area of supervised learning with diverse-quality data. Then, wediscuss existing IL methods that use the variational approach.

Supervised learning with diverse-quality data.

In supervised learning, diverse-quality data hasbeen studied extensively under the setting of classiﬁcation with noisy label (Angluin & Laird, 1988). Thisclassiﬁcation setting assumes that human labelers may assign incorrect class labels to training inputs.With such labelers, the obtained dataset consists of high-quality data with correct labels and low-qualitydata with incorrect labels. To handle this challenging setting, many methods were proposed (Raykaret al., 2010; Natarajan et al., 2013; Han et al., 2018). The most related methods to ours are probabilisticmodeling methods, which aim to infer correct labels and the level of labeler’s expertise (Raykar et al.,2010; Khetan et al., 2018). Speciﬁcally, Raykar et al. (2010) proposed a method based on a two-coinmodel which enables estimating the correct labels and level of expertise. Recently, Khetan et al. (2018)proposed a method based on weighted loss functions, where the weight is determined by the estimatedlabels and level of expertise.Methods for supervised learning with diverse-quality data may be used to learn a policy in our setting.However, they tend to perform poorly due to the issue of compounding error (Ross & Bagnell, 2010).Speciﬁcally, supervised learning methods generally assume that data distributions during training andtesting are identical. However, data distributions during training and testing are diﬀerent in IL, sincedata distributions depend on policies (Ng & Russell, 2000). A discrepancy of data distributions causescompounding errors during testing, where prediction errors increase further in future predictions. Due tothe issue of compounding error, supervised-learning-based methods often perform poorly in IL (Ross &Bagnell, 2010). The issue becomes even worse with diverse-quality demonstrations, since data distributionsof diﬀerent demonstrators tend to be highly diﬀerent. For these reasons, methods for supervised learningwith diverse-quality data is not suitable for IL.

Variational approach in IL.

The variational approach (Jordan et al., 1999) has been previouslyutilized in IL to perform MM-IL and reduce over-ﬁtting. Speciﬁcally, MM-IL aims to learn a multi-modalpolicy from diverse demonstrations collected by many experts (Li et al., 2017), where each mode of thepolicy represents decision making of each expert . A multi-modal policy is commonly represented by acontext-dependent policy, where each context represents each mode of the policy. The variational approachhas been used to learn a distribution of such contexts, i.e., by learning a variational auto-encoder (Wanget al., 2017) and by maximizing a variational lower-bound of mutual information (Li et al., 2017; Hausman We emphasize that diverse demonstrations are diﬀerent from diverse-quality demonstrations. Diverse demonstrationsare collected by experts who execute equally good policies, while diverse-quality demonstrations are collected by mixeddemonstrators; The former consists of demonstrations that are equally high-quality but diverse in behavior, while the latterconsists of demonstrations that are diverse in both quality and behavior.

2t al., 2017). Meanwhile, variational information bottleneck (VIB) (Alemi et al., 2017) has been usedto reduce over-ﬁtting in IL (Peng et al., 2019). Speciﬁcally, VIB aims to compress information ﬂow byminimizing a variational bound of mutual information. This compression ﬁlters irrelevant signals, whichleads to less over-ﬁtting. Unlike these existing works, we utilize the variational approach to aid computingintegrals in large state-action spaces, and do not use a variational auto-encoder or a variational bound ofmutual information.

Before delving into our main contribution, we ﬁrst give the minimum background about RL and IL. Then,we formulate a new setting of IL with diverse-quality demonstrations, discuss its challenge, and reveal thedeﬁciency of existing methods.

Reinforcement learning.

Reinforcement learning (RL) (Sutton & Barto, 1998) aims to learn anoptimal policy of a sequential decision making problem, which is often mathematically formulated as aMarkov decision process (MDP) (Puterman, 1994). We consider a ﬁnite-horizon MDP with continuousstate and action spaces deﬁned by a tuple M “ p S , A , p p s | s , a q , p p s q , r p s , a qq with a state s t P S Ď R d s ,an action a t P A Ď R d a , an initial state density p p s q , a transition probability density p p s t ` | s t , a t q ,and a reward function r : S ˆ A ÞÑ R , where the subscript t P t , . . . , T u denotes the time step. Asequence of states and actions, p s T , a T q , is called a trajectory. A decision making of an agent isdetermined by a policy function π p a t | s t q , which is a conditional probability density of action givenstate. RL seeks for an optimal policy π ‹ p a t | s t q which maximizes the expected cumulative reward, i.e., π ‹ “ argmax π E p π p s T , a T q r Σ Tt “ r p s t , a t qs , where p π p s T , a T q “ p p s q Π Tt “ p p s t ` | s t , a t q π p a t | s t q is atrajectory probability density induced by π . RL has shown great successes recently, especially whencombined with deep neural networks (Mnih et al., 2015; Silver et al., 2017). However, a major limitationof RL is that it relies on the reward function which may be unavailable in practice (Russell, 1998). Imitation learning.

To address the above limitation of RL, imitation learning (IL) was proposed (Schaal,1999; Ng & Russell, 2000). Without using the reward function, IL aims to learn the optimal policy fromdemonstrations that encode information about the optimal policy. A common assumption in most ILmethods is that, demonstrations are collected by K ě demonstrators who execute actions a t drawnfrom π ‹ p a t | s t q for every states s t . A graphic model describing this data collection process is depicted inFigure 1(a), where a random variable k P t , . . . , K u denotes each demonstrator’s identiﬁcation numberand p p k q denotes the probability of collecting a demonstration from the k -th demonstrator. Under thisassumption, demonstrations tp s T , a T , k q n u Nn “ (i.e., observed random variables in Figure 1(a)) arecalled expert demonstrations and are regarded to be drawn independently from a probability density p ‹ p s T , a T q p p k q “ p p k q p p s q Π Tt “ p p s t ` | s t , a t q π ‹ p a t | s t q . We note that the variable k does not aﬀect thetrajectory density p ‹ p s T , a T q and can be omitted. In this paper, we assume a common assumptionthat p p s q and p p s t ` | s t , a t q are unknown but we can sample states from them.IL has shown great successes in benchmark settings (Ho & Ermon, 2016; Fu et al., 2018; Peng et al.,2019). However, practical applications of IL in the real-world is relatively few (Schroecker et al., 2019). Oneof the main reasons is that most IL methods aim to learn with expert demonstrations. In practice, suchdemonstrations are often too costly to obtain due to a limited number of experts, and even when we obtainthem, the number of demonstrations is often too few to accurately learn the optimal policy (Audiﬀrenet al., 2015; Wu et al., 2019; Brown et al., 2019). New setting: Diverse-quality demonstrations.

To improve practicality, we consider a new problemcalled

IL with diverse-quality demonstrations , where demonstrations are collected from demonstrators withdiﬀerent level of expertise. Compared to expert demosntrations, diverse-quality demonstrations can becollected more cheaply, e.g., via crowdsourcing (Mandlekar et al., 2018). The graphical model in Figure 1(b)depicts the process of collecting such demonstrations from K ą demonstrators. Formally, we selectthe k -th demonstrator for demonstrations according to a probability distribution p p k q . After selecting k ,for each time step t , the k -th demonstrator observes state s t and samples action a t using the optimalpolicy π ‹ p a t | s t q . However, the demonstrator may not execute a t in the MDP if this demonstrator is notexpertised. Instead, he/she may sample an action u t P A with another probability density p p u t | s t , a t , k q and execute it. Then, the next state s t ` is observed with a probability density p p s t ` | s t , u t q , and thedemonstrator continues making decision until time step T . We repeat this process for N times to collect diverse-quality demonstrations D d “ tp s T , u T , k q n u Nn “ . These demonstrations are regarded to be drawn3 s ¨ ¨ ¨ s T . . . a a a T k N (a) Expert demonstrations. s s ¨ ¨ ¨ s T . . . a a a T u u u T k N (b) Diverse-quality demonstrations. Figure 1: Graphical models describe expert demonstrations and diverse-quality demonstrations. Shadedand unshaded nodes indicate observed and unobserved random variables, respectively. Plate notationsindicate that the sampling process is repeated for N times. s t P S is a state with transition densities p p s t ` | s t , a t q , a t P A is an action with density π ‹ p a t | s t q , u t P A is a noisy action with density p p u t | s t , a t , k q ,and k P t , . . . , K u is an identiﬁcation number with distribution p p k q .independently from a probability density p d p s T , u T | k q p p k q “ p p k q p p s q T ź t “ p p s t ` | s t , u t q ż A π ‹ p a t | s t q p p u t | s t , a t , k q d a t . (1)We refer to p p u t | s t , a t , k q as a noisy policy of the k -th demonstrator, since it is used to execute a noisyaction u t . Our goal is to learn the optimal policy π ‹ using diverse-quality demonstrations D d . The deﬁciency of existing methods.

We conjecture that existing IL methods are not suitable to learnwith diverse-quality demonstrations according to p d . Speciﬁcally, these methods always treat observeddemonstrations as if they were drawn from p ‹ . By comparing p ‹ and p d , we can see that existing methodswould learn π p u t | s t q such that π p u t | s t q « Σ Kk “ p p k q ş A π ‹ p a t | s t q p p u t | s t , a t , k q d a t . In other words, theylearn a policy that averages over decisions of all demonstrators. This would be problematic when amateursare present, as averaged decisions of all demonstrators would be highly diﬀerent from those of all experts.Worse yet, state distributions of amateurs and experts tend to be highly diﬀerent, which often leads tounstable learning. For these reasons, we believe that existing methods tend to learn a policy that achievesaverage performances and are not suitable for handling the setting of diverse-quality demonstrations. This section describes VILD, namely a robust method for tackling the challenge from diverse-qualitydemonstrations. Speciﬁcally, we build a probabilistic model that explicitly describes the level of demonstra-tors’ expertise and a reward function (Section 4.1), and estimate its parameters by a variational approach(Section 4.2), which can be implemented by using RL (Section 4.3). We also improve data-eﬃciency byusing importance sampling (Section 4.4). Mathematical derivations are provided in Appendix A.

This section presents a model which enables estimating the level of demonstrators’ expertise. We ﬁrstdescribe a naive model, whose parameters can be estimated trivially via supervised learning, but suﬀersfrom the issue of compounding error. Then, we describe our proposed model, which avoids the issue ofthe naive model by learning a reward function.

Naive model.

Based on p d , one of the simplest models to handle diverse-quality demonstrations is p θ , ω p s T , u T , k q “ p p k q p p s q Π Tt “ p p s t ` | s t , u t q ş A π θ p a t | s t q p ω p u t | s t , a t , k q d a t , where θ and ω are real-valued parameter vectors. These parameters can be learned by e.g., minimizing the Kullback-Leibler (KL)divergence from the data distribution to the model: min θ , ω KL p p d p s T , u T | k q p p k q|| p θ , ω p s T , u T , k qq .This naive model can be regarded as a regression-extension of the two-coin model proposed by Raykaret al. (2010) for classiﬁcation with noisy label. As discussed previously in Section 2, such a model suﬀersfrom the issue of compounding error and is not suitable for our IL setting.4 roposed model. To avoid the issue of compounding error, our method utilizes the inverse RL (IRL)approach (Ng & Russell, 2000), where we aim to learn a reward function from diverse-quality demonstra-tions . IL problems can be solved by a combination of IRL and RL, where we learn a reward functionby IRL and then learn a policy from the reward function by RL. This combination avoids the issueof compounding error, since the policy is learned by RL which generalizes to states not presented indemonstrations.Speciﬁcally, our proposed model is based on a model of maximum entropy IRL (MaxEnt-IRL) (Ziebartet al., 2010). Brieﬂy speaking, MaxEnt-IRL learns a reward function from expert demonstrations byusing a model p φ p s T , a T q 9 p p s q Π Tt “ p p s t ` | s t , a t q exp p r φ p s t , a t qq . Based on this model, we proposeto learn the reward function and the level of expertise by a model p φ , ω p s T , u T , k q “ p p k q p p s q T ź t “ p p s t ` | s t , u t q ż A exp p r φ p s t , a t qq p ω p u t | s t , a t , k q d a t { Z φ , ω , (2)where φ and ω are parameters of the model and Z φ , ω is the normalization term. By comparing theproposed model p φ , ω p s T , u T , k q to the data distribution p d , the reward parameter φ should be learnedso that the cumulative rewards is proportion to a probability density of actions given by the optimalpolicy, i.e., exp p Σ Tt “ r φ p s t , a t qq 9 Π Tt “ π ‹ p a t | s t q . In other words, the cumulative rewards are large fortrajectories induced by the optimal policy π ‹ . Therefore, π ‹ can be learned by maximizing the cumulativerewards. Meanwhile, the density p ω p u t | s t , a t , k q is learned to estimate the noisy policy p p u t | s t , a t , k q . Inthe remainder, we refer to ω as an expertise parameter.To learn the parameters of this model, we propose to minimize the KL divergence fromthe data distribution to the model: min φ , ω KL p p d p s T , u T | k q p p k q|| p φ , ω p s T , u T , k qq . Byrearranging terms and ignoring constant terms, minimizing this KL divergence is equiv-alent to solving an optimization problem max φ , ω f p φ , ω q ´ g p φ , ω q , where f p φ , ω q “ E p d p s T , u T | k q p p k q r Σ Tt “ log p ş A exp p r φ p s t , a t qq p ω p u t | s t , a t , k q d a t qs and g p φ , ω q “ log Z φ , ω . To solve thisoptimization, we need to compute the integrals over both state space S and action space A . Computingthese integrals is feasible for small state and action spaces, but is infeasible for large state and actionspaces. To scale up our model to MDPs with large state and action spaces, we leverage a variationalapproach in the followings. The central idea of the variational approach is to lower-bound an integral by the Jensen inequality anda variational distribution (Jordan et al., 1999). The main beneﬁt of the variational approach is thatthe integral can be indirectly computed via the lower-bound, given an optimal variational distribution.However, ﬁnding the optimal distribution often requires solving a sub-optimization problem.Before we proceed, notice that f p φ , ω q´ g p φ , ω q is not a joint concave function of the integrals, and thisprohibits using the Jensen inequality. However, we can use the Jensen inequality to separately lower-bound f p φ , ω q and g p φ , ω q , since they are concave functions of their corresponding integrals. Speciﬁcally, let l φ , ω p s t , a t , u t , k q “ r φ p s t , a t q ` log p ω p u t | s t , a t , k q . By using a variational distribution q ψ p a t | s t , u t , k q withparameter ψ , we obtain an inequality f p φ , ω q ě F p φ , ω , ψ q , where F p φ , ω , ψ q “ E p d p s T , u T | k q p p k q « T ÿ t “ E q ψ p a t | s t , u t ,k q r l φ , ω p s t , a t , u t , k qs ` H t p q ψ q ﬀ , (3)and H t p q ψ q “ ´ E q ψ p a t | s t , u t ,k q r log q ψ p a t | s t , u t , k qs . It is trivial to verify that the equality f p φ , ω q “ max ψ F p φ , ω , ψ q holds (Murphy, 2013), where the maximizer ψ ‹ of the lower-bound yields q ψ ‹ p a t | s t , u t , k q 9 exp p l φ , ω p s t , a t , u t , k qq . Therefore, the function f p φ , ω q can be substituted by max ψ F p φ , ω , ψ q . Meanwhile, by using a variational distribution q θ p a t , u t | s t , k q with parameter θ , weobtain an inequality g p φ , ω q ě G p φ , ω , θ q , where G p φ , ω , θ q “ E r q θ p s T , u T , a T ,k q « T ÿ t “ l φ , ω p s t , a t , u t , k q ´ log q θ p a t , u t | s t , k q ﬀ , (4)and r q θ p s T , u T , a T , k q “ p p k q p p s q Π Tt “ p p s t ` | s t , u t q q θ p a t , u t | s t , k q . The lower-bound G resembles themaximum entropy RL (MaxEnt-RL) (Ziebart et al., 2010). By using the optimality results of MaxEnt-RL (Levine, 2018), we have an equality g p φ , ω q “ max θ G p φ , ω , θ q . Therefore, the function g p φ , ω q canbe substituted by max θ G p φ , ω , θ q . We emphasize that IRL (Ng & Russell, 2000) is diﬀerent from RL, since RL learns an optimal policy from a knownreward function.

5y using these lower-bounds, we have that max φ , ω f p φ , ω q ´ g p φ , ω q “ max φ , ω , ψ F p φ , ω , ψ q ´ max θ G p φ , ω , θ q “ max φ , ω , ψ min θ F p φ , ω , ψ q ´ G p φ , ω , θ q . Solving the max-min problem is often feasibleeven for large state and action spaces, since F p φ , ω , ψ q and G p φ , ω , θ q are deﬁned as an expectation andcan be optimized straightforwardly. Nevertheless, in practice, we represent the variational distributions byparameterized functions, and solve the sub-optimization (w.r.t. ψ and θ ) by stochastic optimization meth-ods. However, in this scenario, the equalities f p φ , ω q “ max ψ F p φ , ω , ψ q and g p φ , ω q “ max θ G p φ , ω , θ q may not hold for two reasons. First, the optimal variational distributions may not be in the space of ourparameterized functions. Second, stochastic optimization methods may yield local solutions. Nonetheless,when the variational distributions are represented by deep neural networks, the obtained variationaldistributions are often reasonably accurate and the equalities approximately hold (Ranganath et al., 2014). In practice, we are required to specify models for q θ p a t , u t | s t , k q and p ω p u t | s t , a t , k q . We propose touse q θ p a t , u t | s t , k q “ q θ p a t | s t q N p u t | a t , Σ q and p ω p u t | s t , a t , k q “ N p u t | a t , C ω p k qq . As shown below, thechoice for q θ p a t , u t | s t , k q enables us to solve the sub-optimization w.r.t. θ by using RL with reward function r φ . Meanwhile, the choice for p ω p u t | s t , a t , k q incorporates our prior knowledge that the noisy policy tendsto Gaussian, which is a reasonable assumption for actual human motor behavior (van Beers et al., 2004).Under these model speciﬁcations, solving max φ , ω , ψ min θ F p φ , ω , ψ q ´ G p φ , ω , θ q is equivalent to solving max φ , ω , ψ min θ H p φ , ω , ψ , θ q , where H p φ , ω , ψ , θ q “ E p d p s T , u T | k q p p k q ”ř Tt “ E q ψ p a t | s t , u t ,k q ” r φ p s t , a t q´ } u t ´ a t } C ´ ω p k q ı ` H t p q ψ q ı ´ E r q θ p s T , a T q ”ř Tt “ r φ p s t , a t q´ log q θ p a t | s t q ı ` T E p p k q “ Tr p C ´ ω p k q Σ q ‰ . (5)Here, r q θ p s T , a T q “ p p s q Π Tt “ ş R p p s t ` | s t , a t ` (cid:15) t q N p (cid:15) t | , Σ q d (cid:15) t q θ p a t | s t q is a noisy trajectory densityinduced by a policy q θ p a t | s t q , where N p (cid:15) t | , Σ q can be regarded as an approximation of the noisy policyin Figure 1(b). Minimizing H w.r.t. θ resembles solving a MaxEnt-RL problem with a reward function r φ p s t , a t q , except that trajectories are collected according to the noisy trajectory density. In other words,this minimization problem can be solved using RL, and q θ p a t | s t q can be regarded as an approximationof the optimal policy. The hyper-parameter Σ determines the quality of this approximation: smallervalue of Σ gives a better approximation. Therefore, by choosing a reasonably small value of Σ , solvingthe max-min problem yields a reward function r φ p a t | s t q and a policy q θ p a t | s t q . This policy imitates theoptimal policy, which is the goal of IL.We note that the model assumption for p ω incorporates our prior knowledge about the noisy policy p p u t | s t , a t , k q . Namely, p ω p u t | s t , a t , k q “ N p u t | a t , C ω p k qq assumes that the noisy policy tends to Gaussian,where the covariance C ω p k q gives an estimated expertise of the k -th demonstrator: High-expertisedemonstrators have small C ω p k q and vice-versa for low-expertise demonstrators. VILD is not restrictedto this choice. Diﬀerent choices of p ω incorporate diﬀerent prior knowledge. For example, we may use aLaplace distribution to incorporate a prior knowledge about demonstrators who tend to execute outlieractions (Murphy, 2013). In such a case, the squared error in H is simply replaced by the absolute error(see Appendix A.3).It should be mentioned that q ψ p a t | s t , u t , k q is a maximum-entropy probability density which maximizesthe immediate reward at time t and minimizes the weighted squared error between u t and a t . The trade-oﬀ between the reward and squared-error is determined by the covariance C ω p k q . Speciﬁcally, fordemonstrators with a small C ω p k q (i.e., high-expertise demonstrators), the squared error has a largemagnitude and q ψ tends to minimize the squared error. Meanwhile, for demonstrators with a large valueof C ω p k q (i.e., low-expertise demonstrators), the squared error has a small magnitude and q ψ tends tomaximize the immediate reward.In practice, we include a regularization term L p ω q “ T E p p k q r log | C ´ ω p k q|s{ , to penalize large covari-ance. Without this regularization, the covariance can be overly large which makes learning degenerate.We note that H already includes such a penalty via the trace term: E p p k q r Tr p C ´ ω p k q Σ qs . However, thestrength of this penalty tends to be too small, since we choose Σ to be small. To improve the convergence rate of VILD when updating the reward parameter φ ,we use importance sampling (IS). Speciﬁcally, by analyzing the gradient ∇ φ H “ ∇ φ t E p d p s T , u T | k q p p k q r Σ Tt “ E q ψ p a t | s t , u t ,k q r r φ p s t , a t qss ´ E r q θ p s T , a T q r Σ Tt “ r φ p s t , a t qsu , we can see that the6 lgorithm 1 VILD: Variational Imitation Learning with Diverse-quality demonstrations Input:

Diverse-quality demonstrations D d “ tp s T , u T , k q n u Nn “ and a replay buﬀer B “ ∅ . while Not converge do while | B | ă B with batch size B do Ź Collect samples from r q θ p s T , a T q Sample a t „ q θ p a t | s t q and (cid:15) t „ N p (cid:15) t | , Σ q . Execute a t ` (cid:15) t in environment and observe next state s t „ p p s t | s t , a t ` (cid:15) t q . Include p s t , a t , s t q into the replay buﬀer B . Set t Ð t ` . Update q ψ by an estimate of ∇ ψ H p φ , ω , ψ , θ q . Update p ω by an estimate of ∇ ω H p φ , ω , ψ , θ q ` ∇ ω L p ω q . Update r φ by an estimate of ∇ φ H IS p φ , ω , ψ , θ q . Update q θ by an RL method (e.g., TRPO or SAC) with reward function r φ .reward function is updated to maximize expected cumulative rewards obtained by demonstrators and q ψ ,while minimizing expected cumulative rewards obtained by q θ . However, low-quality demonstrations oftenhave low reward values. For this reason, stochastic gradients estimated by these demonstrations tend tobe uninformative, which leads to slow convergence and poor data-eﬃciency.To avoid estimating such uninformative gradients, we use IS to estimate gradients using high-qualitydemonstrations which are sampled with high probability. Brieﬂy, IS is a technique for estimating anexpectation over a distribution by using samples from a diﬀerent distribution (Robert & Casella, 2005).For VILD, we propose to sample k from a distribution ˜ p p k q “ z k { Σ Kk “ z k , where z k “ } vec p C ´ ω p k qq} .This distribution assigns high probabilities to demonstrators with high estimated level of expertise. Withthis distribution, the estimated gradients tend to be more informative which leads to a faster convergence.To reduce a sampling bias, we use a truncated importance weight: w p k q “ min p p p k q{ ˜ p p k q , q (Ion-ides, 2008). The distribution ˜ p p k q and the importance weight w p k q lead to an IS gradient: ∇ φ H IS “ ∇ φ t E p d p s T , u T | k q ˜ p p k q r w p k q Σ Tt “ E q ψ p a t | s t , u t ,k q r r φ p s t , a t qss ´ E r q θ p s T , a T q r Σ Tt “ r φ p s t , a t qsu . Computing theimportance weight requires p p k q , which can be estimated accurately since k is a discrete random variable.For simplicity, we assume that p p k q is a uniform distribution. A pseudo-code of VILD with IS is given inAlgorithm 1 and more details of our implementation are given in Appendix B. In this section, we experimentally evaluate the performance of VILD (with and without IS) in Mujocotasks from OpenAI Gym (Brockman et al., 2016). Performance is evaluated using cumulative ground-truthrewards along trajectories (i.e., higher is better), which is computed using 10 test trajectories generatedby learned policies (i.e., q θ p a t | s t q ). We repeat experiments for 5 trials with diﬀerent random seeds andreport the mean and standard error. Baselines & data generation.

We compare VILD against GAIL (Ho & Ermon, 2016), AIRL (Fuet al., 2018), VAIL (Peng et al., 2019), MaxEnt-IRL (Ziebart et al., 2010), and InfoGAIL (Li et al., 2017).These are online IL methods which collect transition samples to learn policies. We use trust-region policyoptimization (TRPO) (Schulman et al., 2015) to update policies, except for the Humanoid task where weuse soft actor-critic (SAC) (Haarnoja et al., 2018). To generate demonstrations from π ‹ (pre-trained byTRPO) according to Figure 1(b), we use two types of noisy policy p p u t | a t , s t , k q : Gaussian noisy policy: N p u t | a t , σ k I q and time-signal-dependent (TSD) noisy policy: N p u t | a t , diag p b k p t q ˆ } a t } qq , where b k p t q is sampled from a noise process. We use 10 demonstrators with diﬀerent σ k and noise processes for b k p t q .Notice that for TSD, the noise variance depends on time and magnitude of actions. This characteristic ofTSD has been observed in human motor control (van Beers et al., 2004). More details of data generationare given in Appendix C. Results against online IL methods.

Figure 2 shows learning curves of VILD and existing methodsagainst the number of transition samples in HalfCheetah and Ant , whereas Table 1 reports the performanceachieved in the last 100 update iterations. We can see that VILD with IS outperforms existing methodsin terms of both data-eﬃciency and ﬁnal performance, i.e., VILD with IS learns better policies using lessnumbers of transition samples. VILD without IS tends to outperform existing methods in terms of theﬁnal performance. However, it is less data-eﬃcient when compared to VILD with IS, except on Humanoidwith the Gaussian noisy policy, where VILD without IS performs better than VILD with IS in terms ofthe ﬁnal performance. We conjecture that this is because IS slightly biases gradient estimation, which Learning curves of other tasks are given in Appendix D. ILD (with IS) VILD (without IS) AIRL GAIL VAIL MaxEnt-IRL InfoGAIL C u m u l a t i v e R e w a r d s HalfCheetah (TRPO) C u m u l a t i v e R e w a r d s Ant (TRPO) (a) Performan when demonstrations are generated usingGaussian noisy policy. C u m u l a t i v e R e w a r d s HalfCheetah (TRPO) C u m u l a t i v e R e w a r d s Ant (TRPO) (b) Performan when demonstrations are generated usingTSD noisy policy.

Figure 2: Performance averaged over 5 trials in terms of the mean and standard error. Demonstrationsare generated by 10 demonstrators using (a) Gaussian and (b) TSD noisy policies. Horizontal dotted linesindicate performance of k “ , , , , demonstrators. IS denotes importance sampling.may have a negative eﬀect on the performance. Nonetheless, the overall good performance of VILD withIS suggests that it is an eﬀective method to handle diverse-quality demonstrations.On the contrary, existing methods perform poorly overall. We found that InfoGAIL, which learnsa context-dependent policy, can achieve good performance when the policy is conditioned on speciﬁccontexts. However, its performance is quite poor on average when using contexts sampled from a (uniform)prior distribution. These results supports our conjecture that existing methods are not suitable fordiverse-quality demonstrations when the level of demonstrators’ expertise in unknown.It can be seen that VILD without IS performs better for the Gaussian noisy policy when compared tothe TSD noisy policy. This is because the model of VILD is correctly speciﬁed for the Gaussian noisypolicy, but the model is incorrectly speciﬁed for the TSD noisy policy; misspeciﬁed model indeed leads tothe reduction in performance. Nonetheless, VILD with IS still perform well for both types of noisy policy.This is perhaps because negative eﬀects of a misspeciﬁed model is not too severe for learning expertiseparameters, which are required to compute r p p k q .We also conduct the following evaluations. Due to space limitation, ﬁgures are given in Appendix D. Results against oﬄine IL methods.

We compare VILD against oﬄine IL methods based on super-vised learning, namely behavior cloning (BC) (Pomerleau, 1988), Co-Teaching which is based on a noisylabel learning method (Han et al., 2018), and BC from diverse-quality demonstrations (BC-D) which opti-mizes the naive model described in Section 4.1. Results in Figure 5 show that these methods perform worsethan VILD overall; BC performs the worst since it severely suﬀers from both the compounding error andlow-quality demonstrations. BC-D and Co-teaching are quite robust against low-quality demonstrations,but they perform poorly due to the issue of compounding error.

Accuracy of estimated expertise parameter.

To evaluate accuracy of estimated expertise parameter,we compare the ground-truth value of σ k under the Gaussian noisy policy against the learned covariance C ω p k q . Figure 6 shows that VILD learns an accurate ranking of demonstrators’ expertise. The valuesof these parameters are also quite accurate compared to the ground-truth, except for demonstratorswith low-levels of expertise. A reason for this phenomena is that low-quality demonstrations are highlydissimilar, which makes learning the expertise more challenging. In this paper, we explored a practical setting of IL where demonstrations have diverse-quality. We showedthe deﬁciency of existing methods, and proposed a robust method called VILD which learns both thereward function and the level of demonstrators’ expertise by using the variational approach. Empiricalresults demonstrated that our work enables scalable and data-eﬃcient IL under this practical setting. Infuture, we will explore other approaches to eﬃciently estimate parameters of the proposed model exceptthe variational approach. 8able 1: Performance in the last 100 iterations in terms of the mean and standard error of cumulativerewards (higher is better). (G) denotes Gaussian noisy policy and (TSD) denotes time-signal-dependentnoisy policy. Boldfaces indicate best and comparable methods according to t-test with p-value 0.01. Theperformance of VAIL is similar to that of GAIL and is omitted.

Task VILD (IS) VILD (w/o IS) AIRL GAIL MaxEnt-IRL InfoGAILHalfCheetah (G)

Ant (G)

Humanoid (TSD)

203 (31)

References

Alexander A. Alemi, Ian Fischer, Joshua V. Dillon, , and Kevin Murphy. Deep variational informationbottleneck. In

International Conference on Learning Representations (ICLR) , 2017.Dana Angluin and Philip Laird. Learning from noisy examples.

Machine Learning , 2(4):343–370, 1988.ISSN 0885-6125.Devansh Arpit, Stanislaw K. Jastrzebski, Nicolas Ballas, David Krueger, Emmanuel Bengio, Maxinder S.Kanwal, Tegan Maharaj, Asja Fischer, Aaron C. Courville, Yoshua Bengio, and Simon Lacoste-Julien. Acloser look at memorization in deep networks. In

ICML , volume 70 of

Proceedings of Machine LearningResearch , pp. 233–242. PMLR, 2017.Julien Audiﬀren, Michal Valko, Alessandro Lazaric, and Mohammad Ghavamzadeh. Maximum entropysemi-supervised inverse reinforcement learning. In

IJCAI , pp. 3315–3321. AAAI Press, 2015.Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, andWojciech Zaremba. OpenAI Gym.

CoRR , abs/1606.01540, 2016.Daniel S. Brown, Wonjoon Goo, Prabhat Nagarajan, and Scott Niekum. Extrapolating beyond suboptimaldemonstrations via inverse reinforcement learning from observations. In

Proceedings of the 36thInternational Conference on Machine Learning, ICML , pp. 783–792, 2019.Chelsea Finn, Paul F. Christiano, Pieter Abbeel, and Sergey Levine. A connection between generativeadversarial networks, inverse reinforcement learning, and energy-based models.

CoRR , abs/1611.03852,2016a. URL http://arxiv.org/abs/1611.03852 .Chelsea Finn, Sergey Levine, and Pieter Abbeel. Guided cost learning: Deep inverse optimal control viapolicy optimization. In

Proceedings of the 33nd International Conference on Machine Learning , pp.49–58, 2016b. URL http://jmlr.org/proceedings/papers/v48/finn16.html .Lex Fridman, Daniel E. Brown, Michael Glazer, William Angell, Spencer Dodd, Benedikt Jenik, JackTerwilliger, Julia Kindelsberger, Li Ding, Sean Seaman, Hillary Abraham, Alea Mehler, AndrewSipperley, Anthony Pettinato, Bobbie Seppelt, Linda Angell, Bruce Mehler, and Bryan Reimer. MITautonomous vehicle technology study: Large-scale deep learning based analysis of driver behavior andinteraction with automation.

CoRR , abs/1711.06976, 2017.Justin Fu, Katie Luo, and Sergey Levine. Learning robust rewards with adversarial inverse reinforcementlearning. 2018.Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair,Aaron C. Courville, and Yoshua Bengio. Generative Adversarial Nets. In

Advances in Neural InformationProcessing Systems 27 , pp. 2672–2680, 2014.Shixiang (Shane) Gu, Timothy Lillicrap, Richard E Turner, Zoubin Ghahramani, Bernhard Schölkopf,and Sergey Levine. Interpolated policy gradient: Merging on-policy and oﬀ-policy gradient estimationfor deep reinforcement learning. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus,9. Vishwanathan, and R. Garnett (eds.),

Advances in Neural Information Processing Systems 30 , pp.3846–3855. Curran Associates, Inc., 2017.Ishaan Gulrajani, Faruk Ahmed, Martín Arjovsky, Vincent Dumoulin, and Aaron C. Courville. ImprovedTraining of Wasserstein GANs. In

Advances in Neural Information Processing Systems 30 , pp. 5769–5779,2017.Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Oﬀ-policy maximumentropy deep reinforcement learning with a stochastic actor. In

Proceedings of the 35th InternationalConference on Machine Learning, ICML , pp. 1856–1865, 2018.Bo Han, Quanming Yao, Xingrui Yu, Gang Niu, Miao Xu, Weihua Hu, Ivor W. Tsang, and MasashiSugiyama. Co-teaching: Robust training of deep neural networks with extremely noisy labels. In

Advances in Neural Information Processing Systems 31 , pp. 8536–8546, 2018.Karol Hausman, Yevgen Chebotar, Stefan Schaal, Gaurav S. Sukhatme, and Joseph J. Lim. Multi-modalimitation learning from unstructured demonstrations using generative adversarial nets. In

Advances inNeural Information Processing Systems 30 , pp. 1235–1245, 2017.Jonathan Ho and Stefano Ermon. Generative Adversarial Imitation Learning. In

Advances in NeuralInformation Processing Systems 29 , pp. 4565–4573, 2016.Matthew D. Hoﬀman and David M. Blei. Stochastic structured variational inference. In

Proceedings ofthe Eighteenth International Conference on Artiﬁcial Intelligence and Statistics, AISTATS , 2015.Edward L Ionides. Truncated importance sampling.

Journal of Computational and Graphical Statistics ,17(2):295–311, 2008.E. T. Jaynes. Information theory and statistical mechanics.

Physical Review , 106, 1957.Michael I. Jordan, Zoubin Ghahramani, Tommi S. Jaakkola, and Lawrence K. Saul. An introduction tovariational methods for graphical models.

Machine Learning , 37(2):183–233, November 1999. ISSN0885-6125.Ashish Khetan, Zachary C. Lipton, and Animashree Anandkumar. Learning from noisy singly-labeleddata. In , 2018.Kyungjae Lee, Sungjoon Choi, and Songhwai Oh. Inverse reinforcement learning with leveraged gaussianprocesses. In

IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS , pp.3907–3912, 2016.Sergey Levine. Reinforcement learning and control as probabilistic inference: Tutorial and review.

CoRR ,abs/1805.00909, 2018. URL http://arxiv.org/abs/1805.00909 .Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end Training of Deep VisuomotorPolicies.

Journal of Machine Learning Research , 17(1):1334–1373, January 2016. ISSN 1532-4435.Yunzhu Li, Jiaming Song, and Stefano Ermon. Infogail: Interpretable imitation learning from visualdemonstrations. In

Advances in Neural Information Processing Systems 30: Annual Conference onNeural Information Processing Systems 2017 , pp. 3815–3825, 2017.Ajay Mandlekar, Yuke Zhu, Animesh Garg, Jonathan Booher, Max Spero, Albert Tung, Julian Gao, JohnEmmons, Anchit Gupta, Emre Orbay, Silvio Savarese, and Li Fei-Fei. ROBOTURK: A crowdsourcingplatform for robotic skill learning through imitation. In

CoRL , volume 87 of

Proceedings of MachineLearning Research , pp. 879–893. PMLR, 2018.Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare,Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie,Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, andDemis Hassabis. Human-Level Control Through Deep Reinforcement Learning.

Nature , 518(7540):529–533, February 2015. ISSN 00280836.Kevin P. Murphy.

Machine learning : a probabilistic perspective . MIT Press, Cambridge, Mass. [u.a.],2013. 10agarajan Natarajan, Inderjit S Dhillon, Pradeep K Ravikumar, and Ambuj Tewari. Learning with noisylabels, 2013. URL http://papers.nips.cc/paper/5073-learning-with-noisy-labels.pdf .Andrew Y. Ng and Stuart J. Russell. Algorithms for Inverse Reinforcement Learning. In

Proceedings ofthe 17th International Conference on Machine Learning , pp. 663–670, 2000.Takayuki Osa, Joni Pajarinen, Gerhard Neumann, J. Andrew Bagnell, Pieter Abbeel, and Jan Peters. Analgorithmic perspective on imitation learning.

Foundations and Trends in Robotics , 7(1-2):1–179, 2018.Xue Bin Peng, Angjoo Kanazawa, Sam Toyer, Pieter Abbeel, and Sergey Levine. Variational discriminatorbottleneck: Improving imitation learning, inverse RL, and GANs by constraining information ﬂow. In

International Conference on Learning Representations (ICLR) , 2019.Dean Pomerleau. ALVINN: an autonomous land vehicle in a neural network. In

Advances in NeuralInformation Processing Systems 1, [NIPS Conference, Denver, Colorado, USA, 1988] , pp. 305–313,1988.Martin L. Puterman.

Markov Decision Processes: Discrete Stochastic Dynamic Programming . John Wiley& Sons, Inc., New York, NY, USA, 1st edition, 1994. ISBN 0-471-61977-9.Rajesh Ranganath, Sean Gerrish, and David M. Blei. Black box variational inference. In

Proceedingsof the Seventeenth International Conference on Artiﬁcial Intelligence and Statistics, AISTATS , pp.814–822, 2014.Vikas C. Raykar, Shipeng Yu, Linda H. Zhao, Gerardo Hermosillo Valadez, Charles Florin, Luca Bogoni,and Linda Moy. Learning from crowds.

Journal of Machine Learning Research , 11:1297–1322, 2010.Christian P. Robert and George Casella.

Monte Carlo Statistical Methods . Springer-Verlag, Berlin,Heidelberg, 2005. ISBN 0387212396.Stephane Ross and Drew Bagnell. Eﬃcient reductions for imitation learning. In Yee Whye Teh andMike Titterington (eds.),

Proceedings of the 13th International Conference on Artiﬁcial Intelligence andStatistics, AISTATS , volume 9 of

Proceedings of Machine Learning Research , pp. 661–668, Chia LagunaResort, Sardinia, Italy, 13–15 May 2010. PMLR.Stuart Russell. Learning agents for uncertain environments (extended abstract). In

Proceedings of theEleventh Annual Conference on Computational Learning Theory , COLT’ 98, pp. 101–103. ACM, 1998.ISBN 1-58113-057-0.Stefan Schaal. Is imitation learning the route to humanoid robots? 3(6):233–242, 1999. clmc.Yannick Schroecker, Mel Vecerik, and Jon Scholz. Generative predecessor models for sample-eﬃcientimitation learning. In

International Conference on Learning Representations , 2019. URL https://openreview.net/forum?id=SkeVsiAcYm .John Schulman, Sergey Levine, Philipp Moritz, Michael Jordan, and Pieter Abbeel. Trust Region PolicyOptimization. In

Proceedings of the 32nd International Conference on Machine Learning, July 6-11,2015, Lille, France , 2015.Kyriacos Shiarlis, João V. Messias, and Shimon Whiteson. Inverse Reinforcement Learning from Failure.In

Proceedings of the 2016 International Conference on Autonomous Agents & Multiagent Systems , pp.1060–1068, 2016.David Silver, J. Andrew Bagnell, and Anthony Stentz. Learning autonomous driving styles and maneu-vers from expert demonstration. In

Experimental Robotics - The 13th International Symposium onExperimental Robotics, ISER , pp. 371–386, 2012.David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche,Julian Schrittwieser, Ioannis Antonoglou, Vedavyas Panneershelvam, Marc Lanctot, Sander Dieleman,Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy P. Lillicrap, Madeleine Leach,Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. Mastering the Game of Go with Deep NeuralNetworks and Tree Search.

Nature , 529(7587):484–489, 2016.11avid Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez,Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy Lillicrap, Fan Hui,Laurent Sifre, George van den Driessche, Thore Graepel, and Demis Hassabis. Mastering the Game ofGo Without Human Knowledge.

Nature , 550(7676):354–359, October 2017. ISSN 00280836.Richard S. Sutton and Andrew G. Barto.

Reinforcement Learning - an Introduction . Adaptive computationand machine learning. MIT Press, 1998.Umar Syed, Michael H. Bowling, and Robert E. Schapire. Apprenticeship learning using linear programming.In

Proceedings of the 25th International Conference on Machine Learning , pp. 1032–1039, 2008. doi:10.1145/1390156.1390286.Istvan Szita and Csaba Szepesvári. Model-based reinforcement learning with nearly tight explorationcomplexity bounds. In

Proceedings of the 27th International Conference on Machine Learning ICML ,pp. 1031–1038, 2010.G. E. Uhlenbeck and L. S. Ornstein. On the theory of the brownian motion.

Physical Revview , 36:823–841,1930. doi: 10.1103/PhysRev.36.823.Robert J. van Beers, Patrick Haggard, and Daniel M. Wolpert. The role of execution noise in movementvariability.

Journal of Neurophysiology , 91(2):1050–1063, 2004. doi: 10.1152/jn.00652.2003. URL https://doi.org/10.1152/jn.00652.2003 . PMID: 14561687.Ziyu Wang, Josh Merel, Scott E. Reed, Nando de Freitas, Gregory Wayne, and Nicolas Heess. Robustimitation of diverse behaviors. In

Advances in Neural Information Processing Systems 30 , pp. 5320–5329,2017.Yueh-Hua Wu, Nontawat Charoenphakdee, Han Bao, Voot Tangkaratt, and Masashi Sugiyama. Imitationlearning from imperfect demonstration. In

Proceedings of the 36th International Conference on MachineLearning, ICML , 2019.Brian D. Ziebart, J. Andrew Bagnell, and Anind K. Dey. Modeling Interaction via the Principle ofMaximum Causal Entropy. In

Proceedings of the 27th International Conference on Machine Learning,June 21-24, 2010, Haifa, Israel , 2010.

A Derivations

This section derives the lower-bounds of f p φ , ω q and g p φ , ω q presented in the paper. We also derive theobjective function H p φ , ω , ψ , θ q of VILD. A.1 Lower-bound of f Let l φ , ω p s t , a t , u t , k q “ r φ p s t , a t q ` log p ω p u t | s t , a t , k q , we have that f p φ , ω q “ E p d p s T , u T | k q p p k q ”ř Tt “ f t p φ , ω q ı , where f t p φ , ω q “ log ş A exp p l φ , ω p s t , a t , u t , k qq d a t . By using avariational distribution q ψ p a t | s t , u t , k q with parameter ψ , we can bound f t p φ , ω q from below by using theJensen inequality as follows: f t p φ , ω q “ log ˆż A exp p l φ , ω p s t , a t , u t , k qq q ψ p a t | s t , u t , k q q ψ p a t | s t , u t , k q d a t ˙ ě ż A q ψ p a t | s t , u t , k q log ˆ exp p l φ , ω p s t , a t , u t , k qq q ψ p a t | s t , u t , k q ˙ d a t “ E q ψ p a t | s t , u t ,k q r l φ , ω p s t , a t , u t , k q ´ log q ψ p a t | s t , u t , k qs“ F t p φ , ω , ψ q . (6)Then, by using the linearity of expectation, we obtain the lower-bound of f p φ , ω q as follows: f p φ , ω q ě E p d p s T , u T | k q p p k q ”ř Tt “ F t p φ , ω , ψ q ı “ E p d p s T , u T | k q p p k q ”ř Tt “ E q ψ p a t | s t , u t ,k q r l φ , ω p s t , a t , u t , k q ´ log q ψ p a t | s t , u t , k qs ı “ F p φ , ω , ψ q . (7)12o verify that f p φ , ω q “ max ψ F p φ , ω , ψ q , we maximize F t p φ , ω , ψ q w.r.t. q ψ under the constraintthat q ψ is a valid probability density, i.e., q ψ p a t | s t , u t , k q ą and ş A q ψ p a t | s t , u t , k q d a t “ . By settingthe derivative of F t p φ , ω , ψ q w.r.t. q ψ to zero, we obtain q ψ p a t | s t , u t , k q “ exp p l φ , ω p s t , a t , u t , k q ´ q“ exp p l φ , ω p s t , a t , u t , k qq ş A exp p l φ , ω p s t , a t , u t , k qq d a t , where the last line follows from the constraint ş A q ψ p a t | s t , u t , k q d a t “ . To show that this is indeed themaximizer, we substitute q ψ ‹ p a t | s t , u t , k q “ exp p l p s t , a t , u t ,k qq ş A exp p l p s t , a t , u t ,k qq d a t into F t p φ , ω , ψ q : F t p φ , ω , ψ ‹ q “ E q ‹ ψ p a t | s t , u t ,k q r l φ , ω p s t , a t , u t , k q ´ log q ψ ‹ p a t | s t , u t , k qs“ log ˆż A exp p l φ , ω p s t , a t , u t , k qq d a t ˙ . This equality veriﬁes that f t p φ , ω q “ max ψ F t p φ , ω , ψ q . Finally, by using the linearity of expectation, wehave that f p φ , ω q “ max ψ F p φ , ω , ψ q . A.2 Lower-bound of g Next, we derive the lower-bound of g p φ , ω q presented in the paper. We ﬁrst derive a trivial lower-boundusing a general variational distribution over trajectories and reveal its issues. Then, we derive a lower-bound stated presented in the paper by using a structured variational distribution. Recall that thefunction g p φ , ω q “ log Z φ , ω is g p φ , ω q “ log ¨˚˝ K ÿ k “ p p k q ż ¨ ¨ ¨ ż p S ˆ A ˆ A q T p p s q T ź t “ p p s t ` | s t , u t q exp p l p s t , a t , u t , k qq d s T d u T d a T ˛‹‚ . Lower-bound via a variational distribution

A lower-bound of g can be obtained by using avariational distribution s q β p s T , u T , a T , k q with parameter β . We note that this variational distributionallows any dependency between the random variables s T , u T , a T , and k . By using this distribution,we have a lower-bound g p φ , ω q “ log ˜ K ÿ k “ p p k q ż ¨ ¨ ¨ ż p S ˆ A ˆ A q T p p s q T ź t “ p p s t ` | s t , u t q exp p l φ , ω p s t , a t , u t , k qqˆ s q β p s T , u T , a T , k q s q β p s T , u T , a T , k q d s T d u T d a T ¸ ě E s q β p s T , u T , a T ,k q « log p p k q p p s q ` T ÿ t “ t log p p s t ` | s t , u t q ` l φ , ω p s t , a t , u t , k qu´ log s q β p s T , u T , a T , k q ﬀ “ s G p φ , ω , β q . (8)The main issue of using this lower-bound is that, s G p φ , ω , β q can be computed or approximated only whenwe have an access to the transition probability p p s t ` | s t , u t q . In many practical tasks, the transitionprobability is unknown and needs to be approximated. However, approximating the transition probabilityfor large state and action spaces is known to be highly challenging (Szita & Szepesvári, 2010). For thesereasons, this lower-bound is not suitable for our method. Lower-bound via a structured variational distribution

To avoid the above issue, we use thestructure variational approach (Hoﬀman & Blei, 2015), where the key idea is to pre-deﬁne conditionaldependenc to ease computation. Speciﬁcally, we use a variational distribution q θ p a t , u t | s t , k q with13arameter θ and deﬁne dependencies between states according to the transition probability of MDPs.With this variational distribution, we lower-bound g as follows: g p φ , ω q “ log ˜ K ÿ k “ p p k q ż ¨ ¨ ¨ ż p S ˆ A ˆ A q T p p s q T ź t “ p p s t ` | s t , u t q exp p l φ , ω p s t , a t , u t , k qqˆ q θ p a t , u t | s t , k q q θ p a t , u t | s t , k q d s T d u T d a T ¸ ě E r q θ p s T , u T , a T ,k q « T ÿ t “ l φ , ω p s t , a t , u t , k q ´ log q θ p a t , u t | s t , k q ﬀ “ G p φ , ω , θ q , (9)where r q θ p s T , u T , a T , k q “ p p k q p p s q Π Tt “ p p s t ` | s t , u t q q θ p a t , u t | s t , k q . The optimal variational distri-bution q θ ‹ p a t , u t | s t , k q can be founded by maximizing G p φ , ω , θ q w.r.t. q θ . Solving this maximizationproblem is identical to solving a maximum entropy RL (MaxEnt-RL) problem (Ziebart et al., 2010) foran MDP deﬁned by a tuple M “ p S ˆ N ` , A ˆ A , p p s , | s , u q I k “ k , p p s q p p k q , l φ , ω p s , a , u , k qq . Speciﬁ-cally, this MDP is deﬁned with a state variable p s t , k t q P S ˆ N , an action variable p a t , u t q P A ˆ A , atransition probability density p p s t ` , | s t , u t q I k t “ k t ` , an initial state density p p s q p p k q , and a rewardfunction l φ , ω p s t , a t , u t , k q . Here, I a “ b is the indicator function which equals to if a “ b and other-wise. By adopting the optimality results of MaxEnt-RL (Ziebart et al., 2010; Levine, 2018), we have g p φ , ω q “ max θ G p φ , ω , θ q , where the optimal variational distribution is q θ ‹ p a t , u t | s t , k q “ exp p Q p s t , k, a t , u t q ´ V p s t , k qq . (10)The functions Q and V are soft-value functions deﬁned as Q p s t , k, a t , u t q “ l φ , ω p s t , a t , u t , k q ` E p p s t ` | s t , u t q r V p s t ` , k qs , (11) V p s t , k q “ log ĳ A ˆ A exp p Q p s t , k, a t , u t qq d a t d u t . (12) A.3 Objective function H of VILD This section derives the objective function H p φ , ω , ψ , θ q from F p φ , ω , ψ q ´ G p φ , ω , θ q . Specﬁcally, wesubstitute the models p ω p u t | s t , a t , k q “ N p u t | a t , C ω p k qq and q θ p a t , u t | s t , k q “ q θ p a t | s t q N p u t | a t , Σ q . Wealso give an example when using a Laplace distribution for p ω p u t | s t , a t , k q instead of the Gaussiandistribution.First, we substitute q θ p a t , u t | s t , k q “ q θ p a t | s t q N p u t | a t , Σ q into G : G p φ , ω , θ q “ E r q θ p s T , u T , a T ,k q « T ÿ t “ l φ , ω p s t , a t , u t , k q ´ log N p u t | a t , Σ q ´ log q θ p a t | s t q ﬀ “ E r q θ p s T , u T , a T ,k q « T ÿ t “ l φ , ω p s t , a t , u t , k q ` } u t ´ a t } Σ ´ ´ log q θ p a t | s t q ﬀ ` c , where c is a constant corresponding to the log-normalization term of the Gaussian distribution. Next, byusing the re-parameterization trick, we rewrite r q θ p s T , u T , a T , k q as r q θ p s T , u T , a T , k q “ p p k q p p s q T ź t “ p p s t ` | s t , a t ` Σ { (cid:15) t q N p (cid:15) t | , I q q θ p a t | s t q , where we use u t “ a t ` Σ { (cid:15) t with (cid:15) t „ N p (cid:15) t | , I q . With this, the expectation of Σ Tt “ } u t ´ a t } Σ ´ over r q θ p s T , u T , a T , k q can be written as E r q θ p s T , u T , a T ,k q « T ÿ t “ } u t ´ a t } Σ ´ ﬀ “ E r q θ p s T , u T , a T ,k q « T ÿ t “ } a t ` Σ { (cid:15) t ´ a t } Σ ´ ﬀ “ E r q θ p s T , u T , a T ,k q « T ÿ t “ } Σ { (cid:15) t } Σ ´ ﬀ “ T d a , G can be expressed as G p φ , ω , θ q “ E r q θ p s T , u T , a T ,k q « T ÿ t “ l φ , ω p s t , a t , u t , k q ´ log q θ p a t | s t q ﬀ ` c ` T d a . By ignoring the constant, the optimization problem max φ , ω , ψ min θ F p φ , ω , ψ q ´ G p φ , ω , θ q is equivalentto max φ , ω , ψ min θ E p d p s T , u T ,k q « T ÿ t “ E q ψ p a t | s t , u t ,k q r l φ , ω p s t , a t , u t , k q ´ log q ψ p a t | s t , u t , k qs ﬀ ´ E r q θ p s T , u T , a T ,k q « T ÿ t “ l φ , ω p s t , a t , u t , k q ´ log q θ p a t | s t q ﬀ . (13)Our next step is to substitute p ω p u t | s t , a t , k q by our choice of model. First, let us consider a Gaussiandistribution p ω p u t | s t , a t , k q “ N p u t | a t , C ω p s t , k qq , where the covariance depends on state. With thismodel, the second term in Eq. (13) is given by E r q θ p s T , u T , a T ,k q « T ÿ t “ l φ , ω p s t , a t , u t , k q ´ log q θ p a t | s t q ﬀ “ E r q θ p s T , u T , a T ,k q « T ÿ t “ r φ p s t , a t q ` log N p u t | a t , C ω p s t , k qq ´ log q θ p a t | s t q ﬀ “ E r q θ p s T , u T , a T ,k q « T ÿ t “ r φ p s t , a t q ´ } u t ´ a t } C ´ ω p s t ,k q ´

12 log | C ω p s t , k q| ´ log q θ p a t | s t q ﬀ ` c , where c “ ´ d a log 2 π is a constant. By using the reparameterization trick, we write the expectation of Σ Tt “ } u t ´ a t } C ´ ω p s t ,k q as follows: E r q θ p s T , u T , a T ,k q « T ÿ t “ } u t ´ a t } C ´ ω p s t ,k q ﬀ “ E r q θ p s T , u T , a T ,k q « T ÿ t “ } a t ` Σ { (cid:15) t ´ a t } C ´ ω p s t ,k q ﬀ “ E r q θ p s T , u T , a T ,k q « T ÿ t “ } Σ { (cid:15) t } C ´ ω p s t ,k q ﬀ . Using this equality, the second term in Eq. (13) is given by E r q θ p s T , u T , a T ,k q « T ÿ t “ r φ p s t , a t q ´ log q θ p a t | s t q ´ ´ } Σ { (cid:15) t } C ´ ω p s t ,k q ` log | C ω p s t , k q| ¯ﬀ . (14)Maximizing this quantity w.r.t. θ has an implication as follows: q θ p a t | s t q is maximum entropy policywhich maximizes expected cumulative rewards while avoiding states that are diﬃcult for demonstrators.Speciﬁcally, a large value of E p p k q r log | C ω p s t , k q|s indicates that demonstrators have a low level of expertisefor state s t on average, given by our estimated covariance. In other words, this state is diﬃcult to accuratelyexecute optimal actions for all demonstrators on averages. Since the policy q θ p a t | s t q should minimize E p p k q r log | C ω p s t , k q|s , the policy should avoid states that are diﬃcult for demonstrators. We expect thatthis property may improve exploration-exploitation trade-oﬀ. Still, such a property is not in the scope ofthis paper, and we leave it for future work.In this paper, we assume that the covariance does not depend on state: C ω p s t , k q “ C ω p k q . Thismodel enables us to simplify Eq. (14) as follows: E r q θ p s T , u T , a T ,k q « T ÿ t “ r φ p s t , a t q ´ log q θ p a t | s t q ´ ´ } Σ { (cid:15) t } C ´ ω p k q ` log | C ω p k q| ¯ﬀ “ E r q θ p s T , u T , a T ,k q « T ÿ t “ r φ p s t , a t q ´ log q θ p a t | s t q ﬀ ´ T E p p k q N p (cid:15) | , I q ” } Σ { (cid:15) } C ´ ω p k q ` log | C ω p k q| ı “ E r q θ p s T , a T q « T ÿ t “ r φ p s t , a t q ´ log q θ p a t | s t q ﬀ ´ T E p p k q “ Tr p C ´ ω p k q Σ q ` log | C ω p k q| ‰ , r q θ p s T , a T q “ p p s q ś Tt “ ş p p s t ` | s t , a t ` (cid:15) t q N p (cid:15) t | , Σ q d (cid:15) t q θ p a t | s t q . The last line followsfrom the quadratic form identity: E N p (cid:15) t | , I q ” } Σ { (cid:15) t } C ´ ω p k q ı “ Tr p C ´ ω p k q Σ q . Next, we substitute p ω p u t | s t , a t , k q “ N p u t | a t , C ω p k qq into the ﬁrst term of Eq. (13). E p d p s T , u T ,k q « T ÿ t “ E q ψ p a t | s t , u t ,k q r l φ , ω p s t , a t , u t , k q ´ log q ψ p a t | s t , u t , k qs ﬀ “ E p d p s T , u T ,k q « T ÿ t “ E q ψ p a t | s t , u t ,k q ” r φ p s t , a t q ´ } u t ´ a t } C ´ ω p k q ´

12 log | C ω p k q|´ log q ψ p a t | s t , u t , k q ıﬀ ´ T d a log 2 π { . (15)Lastly, by ignoring constants, Eq. (13) is equivalent to max φ , ω , ψ , θ H p φ , ω , ψ , θ q , where H p φ , ω , ψ , θ q “ E p d p s T , u T ,k q « T ÿ t “ E q ψ p a t | s t , u t ,k q „ r φ p s t , a t q ´ } u t ´ a t } C ´ ω p k q ´ log q ψ p a t | s t , u t , k q ﬀ ´ E r q θ p s T , a T q « T ÿ t “ r φ p s t , a t q ´ log q θ p a t | s t q ﬀ ` T E p p k q “ Tr p C ´ ω p k q Σ q ‰ . This concludes the derivation of VILD.As mentioned, other distributions beside the Gaussian distribution can be used for p ω . For instance, letus consider a multivariate-independent Laplace distribution: p ω p u t | s t , a t , k q “ Π d a d “ c p d q k exp p´} u t ´ a t c k } q ,where a division of vector by vector denotes element-wise division. The Laplace distribution has heaviertails when compared to the Gaussian distribution, which makes the Laplace distribution more suitablefor modeling demonstrators who tend to execute outlier actions. By using the Laplace distribution for p ω p u t | s t , a t , k q , we obtain an objective H Lap . “ E p d p s T , u T ,k q « T ÿ t “ E q ψ p a t | s t , u t ,k q „ r φ p s t , a t q ´ (cid:13)(cid:13)(cid:13) u t ´ a t c k (cid:13)(cid:13)(cid:13) ´ log q ψ p a t | s t , u t , k q ﬀ ´ E r q θ p s T , a T q « T ÿ t “ r φ p s t , a t q ´ log q θ p a t | s t q ﬀ ` T ? ? π E p p k q ” Tr p C ´ ω p k q Σ { q ı . We cann see that diﬀerences between H Lap and H are the absolute error and scaling of the trace term. B Implementation details

We implement VILD using the PyTorch deep learning framework. For all function approximators, weuse neural networks with 2 hidden-layers of 100 tanh units, except for the Humanoid task where we useneural networks with 2 hidden-layers of 100 relu units. We optimize parameters φ , ω , and ψ by Adamwith step-size ˆ ´ , β “ . , β “ . and mini-batch size 256. To optimize the policy parameter θ , we use trust-region policy optimization (TRPO) (Schulman et al., 2015) with batch size 1000, excepton the Humanoid task where we use soft actor-critic (SAC) (Haarnoja et al., 2018) with mini-batchsize 256; TRPO is an on-policy RL method that uses only trajectories collected by the current policy,while SAC is an oﬀ-policy RL method that use trajectories collected by previous policies. On-policymethods are generally more stable than oﬀ-policy methods, while oﬀ-policy methods are generally moredata-eﬃcient (Gu et al., 2017). We use SAC for Humanoid mainly due to its high data-eﬃciency. WhenSAC is used, we also use trajectories collected by previous policies to approximate the expectation overthe trajectory density ˜ q θ p s T , a T q .For the distribution p ω p u t | s t , a t , k q “ N p u t , a t , C ω p k qq , we use diagonal covariances C ω p k q “ diag p c k q ,where ω “ t c k u Kk “ with c k P R d a ` are parameter vectors to be learned. For the distribution q ψ p a t | s t , u t , k q ,we use a Gaussian distribution with diagonal covariance, where the mean and logarithm of the standarddeviation are the outputs of neural networks. Since k is a discrete variable, we represent q ψ p a t | s t , u t , k q by neural networks that have K output heads and take input vectors p s t , u t q ; The k -th output headcorresponds to (the mean and log-standard-deviation of) q ψ p a t | s t , u t , k q . We also pre-train the meanfunction of q ψ p a t | s t , u t , k q , by performing least-squares regression for gradient steps with target16alue u t . This pre-training is done to obtain reasonable initial predictions. For the policy q θ p a t | s t q , weuse a Gaussian policy with diagonal covariance, where the mean and logarithm of the standard deviationare outputs of neural networks. We use Σ “ ´ I in experiments.To control exploration-exploitation trade-oﬀ, we use an entropy coeﬃcient α “ . in TRPO. InSAC, we tune the value of α by optimization, as described in the SAC paper. Note that including α inVILD is equivalent to rescaling quantities in the model by α , i.e., exp p r φ p s t , a t q{ α q and p p ω p u t | s t , a t , k qq α .A discount factor ă γ ă may be included similarly, and we use γ “ . in experiments.For all methods, we regularize the reward/discriminator function by the gradient penalty (Gulrajaniet al., 2017) with coeﬃcient , since it was previously shown to improve performance of generativeadversarial learning methods. For methods that learn a reward function, namely VILD, AIRL, andMaxEnt-IRL, we apply a sigmoid function to the output of reward function to control the bounds ofreward function. We found that without controlling the bounds, reward values can be highly negative inthe early stage of learning, which makes learning the policy by RL very challenging. A possible explanationis that, in MDPs with large state and action spaces, distribution of demonstrations and distribution ofagent’s trajectories are not overlapped in the early stage of learning. In such a scenario, it is trivial tolearn a reward function which tends to positive-inﬁnity values for demonstrations and negative-inﬁnityvalues for agent’s trajectories. While the gradient penalty regularizer slightly remedies this issue, we foundthat the regularizer alone is insuﬃcient to prevent this scenario. C Experimental Details

In this section, we describe experimental settings and data generation. We also give brief reviews ofmethods compared against VILD in the experiments.

C.1 Settings and data generation

We evaluate VILD on four continuous control tasks from OpenAI gym platform (Brockman et al., 2016)with the Mujoco physics simulator: HalfCheetah, Ant, Walker2d, and Humanoid. To obtain the optimalpolicy for generating demonstrations, we use the ground-truth reward function of each task to pre-train π ‹ with TRPO. We generate diverse-quality demonstrations by using K “ demonstrators according tothe graphical model in Figure 1(b). We consider two types of the noisy policy p p u t | s t , a t , k q : a Gaussiannoisy policy and a time-signal-dependent (TSD) noisy policy. Gaussian noisy policy.

We use a Gaussian noisy policy N p u t | a t , σ k I q with a constant covariance. Thevalue of σ k for each of the 10 demonstrators is . , . , . , . , . , . , . , . , . and . , respectively.Note that our model assumption on p ω corresponds to this Gaussian noisy policy. Table 2 shows theperformance of demonstrators (in terms of cumulative ground-truth rewards) with this Gaussian noisypolicy. Time t V a l u e o f b k ( t ) k = 1.00 k = 0.40 k = 0.01 Figure 3: Samples b k p t q drawn fromnoise processes used for the TSDnoisy policy. TSD noisy policy.

To make learning more challenging, we gen-erate demonstrations by simulating characteristics of human mo-tor control (van Beers et al., 2004), where actuator noises areproportion to the magnitude of actuators, and noise’s strengthincreases with execution time (van Beers et al., 2004). Speciﬁ-cally, we generate demonstrations using a Gaussian distribution N p u t | a t , diag p b k p t q ˆ } a t } { d a qq , where the covariance is propor-tion to the magnitude of action and depends on time step. Wecall this policy time-signal-dependent (TSD) noisy policy. Here, b k p t q is a sample of a noise process whose noise variance increasesover time, as shown in Figure 3. We obtain this noise processfor the k -th demonstrator by reversing Ornstein–Uhlenbeck (OU)processes with parameters θ “ . and σ “ σ k (Uhlenbeck& Ornstein, 1930) . The value of σ k for each demonstrator is . , . , . , . , . , . , . , . , . , and . , respectively. Table 3 shows the performance of demon-strators with this TSD noisy policy. Learning from demonstrations generated by TSD is challenging; The OU process is commonly used to generate time-correlated noises, where the noise variance decays towards zero. Wereserve this process along the time axis, so that the noise variance grows over time. σ k Cheetah Ant Walker Humanoid( π ‹ ) 4624 4349 4963 50930.01 4311 3985 4434 43150.05 3978 3861 3486 51400.01 4019 3514 4651 51890.25 1853 536 4362 36280.40 1090 227 467 52200.6 567 -73 523 25930.7 267 -208 332 17440.8 -45 -979 283 7350.9 -399 -328 255 5381.0 -177 -203 249 361 Table 3: Performance of the optimal policy anddemonstrators with the TSD noisy policy. σ k Cheetah Ant Walker Humanoid( π ‹ ) 4624 4349 4963 50930.01 4362 3758 4695 51300.05 4015 3623 4528 50990.01 3741 3368 2362 51950.25 1301 873 644 16750.40 -203 231 302 6100.6 -230 -51 29 2490.7 -249 -37 24 2210.8 -416 -567 14 1910.9 -389 -751 7 1781.0 -424 -269 4 169 Gaussian model of p ω cannot perfectly model the TSD noisy policy, since the ground-truth variance is afunction of actions and time steps. C.2 Comparison methods

Here, we brieﬂy review methods compared against VILD in our experiments. We ﬁrstly review online ILmethods, which learn a policy by RL and require additional transition samples from MDPs.

MaxEnt-IRL.

Maximum entropy IRL (MaxEnt-IRL) (Ziebart et al., 2010) is a well-known IRL method.The original derivation of the method is based on the maximum entropy principle (Jaynes, 1957) anduses a linear-in-parameter reward function: r φ p s t , a t q “ φ J b p s t , a t q with a basis function b . Here,we consider an alternative derivation which is applicable to non-linear reward function (Finn et al.,2016b,a). Brieﬂy speaking, MaxEnt-IRL learns a reward parameter by minimizing a KL divergence from adata distribution p ‹ p s T , a T q to a model p φ p s T , a T q “ Z φ p p s q Π Tt “ p p s t ` | s t , a t q exp p r φ p s t , a t q{ α q ,where Z φ is the normalization term. Minimizing this KL divergence is equivalent to solving max φ E p ‹ p s T , a T q “ Σ Tt “ r φ p s t , a t q ‰ ´ log Z φ . To compute log Z φ , we can use the variational approachesas done in VILD, which leads to a max-min problem max φ min θ E p ‹ p s T , a T q ”ř Tt “ r φ p s t , a t q ı ´ E q θ p s T , a T q ”ř Tt “ r φ p s t , a t q ´ α log q θ p a t | s t q ı , where q θ p s T , a T q “ p p s q Π Tt “ p p s t ` | s t , a t q q θ p a t | s t q . The policy q θ p a t | s t q maximizes the learnedreward function and is the solution of IL.As we mentioned, the proposed model in VILD is based on the model of MaxEnt-IRL. By comparingthe max-min problem of MaxEnt-IRL and the max-min problem of VILD, we can see that the maindiﬀerence are the variational distribution q ψ and the noisy policy model p ω . If we assume that q ψ and p ω are Dirac delta functions: q ψ p a t | s t , u t , k q “ δ a t “ u t and p ω p u t | a t , s t , k q “ δ u t “ a t , then the max-minproblem of VILD reduces to the max-min problem of MaxEnt-IRL. In other words, if we assume that alldemonstrators execute the same optimal policy and have an equal level of expertise, then VILD reducesto MaxEnt-IRL. GAIL.

Generative adversarial IL (GAIL) (Ho & Ermon, 2016) is an IL method that perform occupancymeasure matching (Syed et al., 2008) via generative adversarial networks (GAN) (Goodfellow et al.,2014). Speciﬁcally, GAIL ﬁnds a parameterized policy π θ such that the occupancy measure ρ π θ p s , a q of π θ is similar to the occupancy measure ρ π ‹ p s , a q of π ‹ . To measure the similarity, GAIL uses theJensen-Shannon divergence, which is estimated and minimized by the following generative-adversarialtraining objective: min θ max φ E ρ π ‹ r log D φ p s , a qs ` E ρ π θ r log p ´ D φ p s , a qq ` α log π θ p a t | s t qs , where D φ p s , a q “ d φ p s , a q d φ p s , a q` is called a discriminator. The minimization problem w.r.t. θ is achieved usingRL with a reward function ´ log p ´ D φ p s , a qq . 18 IRL.

Adversarial IRL (AIRL) (Fu et al., 2018) was proposed to overcome a limitation of GAIL regardingreward function: GAIL does not learn the expert reward function, since GAIL has D φ p s , a q “ . atthe saddle point for every states and actions. To overcome this limitation while taking advantage ofgenerative-adversarial training, AIRL learns a reward function by solving max φ E p ‹ p s T , a T q ”ř Tt “ log D φ p s , a q ı ` E q θ p s T , a T q ”ř Tt “ log p ´ D φ p s , a qq ı , where D φ p s , a q “ r φ p s , a q r φ p s , a q` q θ p a | s q . The policy q θ p a t | s t q is learned by RL with a reward function r φ p s t , a t q .Fu et al. (2018) showed that the gradient of this objective w.r.t. φ is equivalent to the gradient ofMaxEnt-IRL w.r.t. φ . The authors also proposed an approach to disentangle reward function, which leadsto a better performance in transfer learning settings. Nonetheless, this disentangle approach is generaland can be applied to other IRL methods, including MaxEnt-IRL and VILD. We do not evaluate AIRLwith disentangle reward function.We note that, based on the relation between MaxEnt-IRL and VILD, we can extend VILD to usea training procedure of AIRL. Speciﬁcally, by using the same derivation from MaxEnt-IRL to AIRLby Fu et al. (2018), we can derive a variant of VILD which learns a reward parameter by solving max φ E p d p s T , u T | k q p p k q r Σ Tt “ E q ψ p a t | s t , u t ,k q r log D φ p s , a qss ` E r q θ p s T , a T q r Σ Tt “ log p ´ D φ p s , a qqs . We donot evaluate this variant of VILD in our experiment. VAIL.

Variational adversarial imitation learning (VAIL) (Peng et al., 2019) improves upon GAIL byusing variational information bottleneck (VIB) (Alemi et al., 2017). VIB aims to compress information ﬂowby minimizing a variational bound of mutual information. This compression ﬁlters irrelevant signals, whichleads to less over-ﬁtting. To achieve this in GAIL, VAIL learns the discriminator D φ by an optimizationproblem min φ ,E max β ě E ρ π ‹ “ E E p z | s , a q r´ log D φ p z qs ‰ ` E ρ π θ “ E E p z | s , a q r´ log p ´ D φ p z qqs ‰ ` β E p ρ π ‹ ` ρ π θ q{ r KL p E p z | s , a q| p p z qq ´ I c s , where z is an encode vector, E p z | s , a q is an encoder, p p z q is a prior distribution of z , I c is the target valueof mutual information, and β ą is a Lagrange multiplier. With this discriminator, the policy π θ p a t | s t q is learned by RL with a reward function ´ log p ´ D φ p E E p z | s , a q r z sqq .It might be expected that the compression can make VAIL robust against diverse-quality demonstrations,since irrelevant signals in low-quality demonstrations are ﬁltered out via the encoder. However, we ﬁndthat this is not the case, and VAIL does not improve much upon GAIL in our experiments. This is perhapsbecause VAIL compress information from both demonstrators and agent’s trajectories. Meanwhile in oursetting, irrelevant signals are generated only by demonstrators. Therefore, the information bottleneckmay also ﬁlter out relevant signals in agent’s trajectories by chance, which lead to poor performances. InfoGAIL.

Information maximizing GAIL (InfoGAIL) (Li et al., 2017) is an extension of GAIL forlearning a multi-modal policy in MM-IL. The key idea of InfoGAIL is to introduce a context variable z to the GAIL formulation and learn a context-dependent policy π θ p a | s , z q , where each context representseach mode of the multi-modal policy. To ensure that the context is not ignored during learning, InfoGAILregularizes GAIL’s objective so that a mutual information between contexts and state-action variables ismaximized. This mutual information is indirectly maximized via maximizing a variational lower-bound ofmutual information. By doing so, InfoGAIL solves a min-max problem min θ ,Q max φ E ρ π ‹ r log D φ p s , a qs ` E ρ π θ r log p ´ D φ p s , a qq ` α log π θ p a | s , z qs ` λL p π θ , Q q , where L p π θ , Q q “ E p p z q π θ p a | s ,z q r log Q p z | s , a q ´ log p p z qs is a lower-bound of mutual information, Q p z | s , a q is an encoder neural network, and p p z q is a prior distribution of contexts. In our experiment, the numberof context z is set to be the number of demonstrators K . As discussed in Section 1, if we know the levelof demonstrators’ expertise, then we can choose contexts that correspond to high-expertise demonstrator.In other words, we can hand-craft the prior p p z q so that a probability of contexts is proportion to thelevel of demonstrators’ expertise. For fair comparison in experiments, we do not use the oracle knowledgeabout the level of demonstrators’ expertise, and set p p z q to be a uniform distribution.Next, we review oﬄine IL methods. These methods learn a policy based on supervised learning anddo not require additional transition samples from MDPs.19 C. Behavior cloning (BC) (Pomerleau, 1988) is perhaps the simplest IL method. BC treats an ILproblem as a supervised learning problem and ignores dependency between states distributions and policy.For continuous action space, BC solves a least-square regression problem to learn a parameter θ of adeterministic policy π θ p s t q : min θ E p ‹ p s T , a T q ”ř Tt “ } a t ´ π θ p s t q} ı . BC-D.

BC with Diverse-quality demonstrations (BC-D) is a simple extension of BC for handlingdiverse-quality demonstrations. This method is based on the naive model in Section 4.1, and we consider itmainly for evaluation purpose. BC-D uses supervised learning to learn a policy parameter θ and expertiseparameter ω of a model p θ , ω p s T , u T , k q “ p p k q p p s q Σ Tt “ p p s t ` | s t , u t q ş A π θ p a t | s t q p ω p u t | s t , a t , k q d a t .To learn the parameters, we minimize the KL divergence from data distribution to the model. By usingthe variational approach to handle integration over the action space, BC-D solves an optimization problem max θ , ω , ν E p d p s T , u T | k q p p k q ”ř Tt “ E q ν p a t | s t , u t ,k q ” log π θ p a t | s t q p ω p u t | s t , a t ,k q q ν p a t | s t , u t ,k q ıı , where q ν p a t | s t , u t , k q is a variational distribution with parameters ν . We note that the model p θ , ω p s T , u T , k q of BC-D can be regarded as a regression-extension of the two-coin model proposedby Raykar et al. (2010) for classiﬁcation with noisy labels. Co-teaching.

Co-teaching (Han et al., 2018) is the state-of-the-art method to perform classiﬁcation withnoisy labels. This method trains two neural networks such that mini-batch samples are exchanged undera small loss criteria. We extend this method to learn a policy by least-square regression. Speciﬁcally, let π θ p s t q and π θ p s t q be two neural networks presenting policies, and ∇ θ L p θ , B q “ ∇ θ Σ p s , a qP B } a ´ π θ p s q} be gradients of a least-square loss estimated by using a mini-batch B . The parameters θ and θ areupdated by gradient iterates: θ Ð θ ´ η ∇ θ L p θ , B θ q , θ Ð θ ´ η ∇ θ L p θ , B θ q . The mini-batch B θ for updating θ is obtained such that B θ incurs small loss when using predictionfrom π θ , i.e., B θ “ argmin B L p θ , B q . Similarly, the mini-batch B θ for updating θ is obtained suchthat B θ incurs small loss when using prediction from π θ . For evaluating the performance, we use theﬁrst policy network: π θ . D More experimental results

Results against online IL methods.

Figure 4 shows the learning curves of VILD and existing onlineIL methods against the number of transition samples. It can be seen that for both types of noisy policy,VILD with and without IS outperform existing methods overall, in terms of both ﬁnal performance anddata-eﬃciency.

Results against oﬄine IL methods.

Figure 5 shows learning curves of oﬄine IL methods, namelyBC, BC-D, and Co-teaching. For comparison, the ﬁgure also shows the ﬁnal performance of VILD withand without IS, according to Table 1. We can see that these oﬄine methods do not perform well, especiallyon the high-dimensional Humanoid task. The poor performance of these methods is due to the issuesof compounding error and low-quality demonstrations. Speciﬁcally, BC performs the worst, since itsuﬀers from both issues. Still, BC may learn well in the early stage of learning, but its performancesharply degrades, as seen in Ant and Walker2d. This phenomena can be explained as an empirical eﬀectof memorization in deep neural networks (Arpit et al., 2017). Namely, deep neural networks learn toremember samples with simple patterns ﬁrst (i.e., high-quality demonstrations from experts), but aslearning progresses the networks overﬁt to samples with diﬃcult patterns (i.e., low-quality demonstrationsfrom amateurs). Co-teaching is the-state-of-the-art method to avoid this eﬀect, and we can see that itperforms overall better than BC. Meanwhile, BC-D, which learns the policy and level of demonstrators’expertise, also performs better than BC and is comparable to Co-teaching. However, due to the presenceof compounding error, the performance of Co-teaching and BC-D is still worse than VILD with IS.20 ccuracy of estimated expertise parameter.

Figure 6 shows the estimated parameters ω “ t c k u Kk “ of N p u t | a t , diag p c k qq and the ground-truth variance t σ k u Kk “ of the Gaussian noisy policy N p u t | a t , σ k I q .The results show that VILD learns an accurate ranking of the variance compared to the ground-truth. Thevalues of these parameters are also quite accurate compared to the ground truth, except for demonstratorswith low-levels of expertise. A possible reason for this phenomena is that low-quality demonstrations arehighly dissimilar, which makes learning the expertise more challenging. We can also see that the diﬀerencebetween the parameters of VILD with IS and VILD without IS is small and negligible.21 ILD (with IS) VILD (without IS) AIRL GAIL VAIL MaxEnt-IRL InfoGAIL C u m u l a t i v e R e w a r d s HalfCheetah (TRPO) C u m u l a t i v e R e w a r d s Ant (TRPO) C u m u l a t i v e R e w a r d s Walker2d (TRPO) C u m u l a t i v e R e w a r d s Humanoid (SAC) (a) Performan of online IL methods when demonstrations are generated using Gaussian noisy policy. C u m u l a t i v e R e w a r d s HalfCheetah (TRPO) C u m u l a t i v e R e w a r d s Ant (TRPO) C u m u l a t i v e R e w a r d s Walker2d (TRPO) C u m u l a t i v e R e w a r d s Humanoid (SAC) (b) Performan of online IL methods when demonstrations are generated using TSD noisy policy.

Figure 4: Performance averaged over 5 trials of online IL methods against the number of transitionsamples. Horizontal dotted lines indicate performance of k “ , , , , demonstrators. VILD (with IS) VILD (without IS) BC BC-D Co-teaching C u m u l a t i v e R e w a r d s HalfCheetah (offline) C u m u l a t i v e R e w a r d s Ant (offline) C u m u l a t i v e R e w a r d s Walker2d (offline) C u m u l a t i v e R e w a r d s Humanoid (offline) (a) Performan of oﬄine IL methods when demonstrations are generated using Gaussian noisy policy. C u m u l a t i v e R e w a r d s HalfCheetah (offline) C u m u l a t i v e R e w a r d s Ant (offline) C u m u l a t i v e R e w a r d s Walker2d (offline) C u m u l a t i v e R e w a r d s Humanoid (offline) (b) Performan of oﬄine IL methods when demonstrations are generated using TSD noisy policy.

Figure 5: Performance averaged over 5 trials of oﬄine IL methods against the number of gradient updatesteps. For VILD with and without IS, we report the ﬁnal performance in Table 1. V a r i a n c e : k VILD with ISVILD without ISGround-truth V a r i a n c e : k VILD with ISVILD without ISGround-truth V a r i a n c e : k VILD with ISVILD without ISGround-truth V a r i a n c e : k VILD with ISVILD without ISGround-truth

Figure 6: Expertise parameters ω “ t c k u Kk “ learned by VILD and the ground-truth t σ k u Kk “ for theGaussian noisy policy. For VILD, we report the value of } c k } { d aa