[PDF] When Will Generative Adversarial Imitation Learning Algorithms Attain Global Convergence

Abstract

Generative adversarial imitation learning (GAIL) is a popular inverse reinforcement learning approach for jointly optimizing policy and reward from expert trajectories. A primary question about GAIL is whether applying a certain policy gradient algorithm to GAIL attains a global minimizer (i.e., yields the expert policy), for which existing understanding is very limited. Such global convergence has been shown only for the linear (or linear-type) MDP and linear (or linearizable) reward. In this paper, we study GAIL under general MDP and for nonlinear reward function classes (as long as the objective function is strongly concave with respect to the reward parameter). We characterize the global convergence with a sublinear rate for a broad range of commonly used policy gradient algorithms, all of which are implemented in an alternating manner with stochastic gradient ascent for reward update, including projected policy gradient (PPG)-GAIL, Frank-Wolfe policy gradient (FWPG)-GAIL, trust region policy optimization (TRPO)-GAIL and natural policy gradient (NPG)-GAIL. This is the first systematic theoretical study of GAIL for global convergence.

Full PDF

aa r X i v : . [ c s . L G ] J un When Will Generative Adversarial Imitation Learning AlgorithmsAttain Global Convergence

Ziwei Guan, Tengyu Xu, Yingbin LiangDepartment of Electrical and Computer Engineering, The Ohio State University{guan.283, xu.3260, liang.889}@osu.edu

Abstract

Generative adversarial imitation learning (GAIL) is a popular inverse reinforcement learning approachfor jointly optimizing policy and reward from expert trajectories. A primary question about GAIL iswhether applying a certain policy gradient algorithm to GAIL attains a global minimizer (i.e., yields theexpert policy), for which existing understanding is very limited. Such global convergence has been shownonly for the linear (or linear-type) MDP and linear (or linearizable) reward. In this paper, we study GAILunder general MDP and for nonlinear reward function classes (as long as the objective function is stronglyconcave with respect to the reward parameter). We characterize the global convergence with a sublinearrate for a broad range of commonly used policy gradient algorithms, all of which are implemented inan alternating manner with stochastic gradient ascent for reward update, including projected policygradient (PPG)-GAIL, Frank-Wolfe policy gradient (FWPG)-GAIL, trust region policy optimization(TRPO)-GAIL and natural policy gradient (NPG)-GAIL. This is the ﬁrst systematic theoretical studyof GAIL for global convergence.

In reinforcement learning (RL), the reward function generally plays an important role to guide the designof policy optimization to attain the best long-term accumulative reward. However, a reward function maynot be known in many situations, and imitation learning Osa et al. (2018) aims to ﬁnd a desirable policyin such cases, which produces behaviors as close as possible to expert demonstrations. Two popular classesof approaches for imitation learning have been developed. The ﬁrst approach is behavioral cloning (BC)Pomerleau (1991), which directly provides a mapping strategy from the state space to the action spacebased on supervised learning to match expert demonstrations. The BC method often suﬀers from highsample complexity due to covariate shift Ross and Bagnell (2010); Ross et al. (2011) for achieving the de-sired performance, which is mitigated by improved algorithms such as DAgger Ross et al. (2011) and DartLaskey et al. (2017) that require further interaction with the expert’s demonstration. The second approachis the so-called inverse reinforcement learning (IRL) Ng and Russell (2000); Russell (1998), which attemptsto recover the unknown reward function based on the expert’s trajectories, and then ﬁnd an optimal policyby using such a reward function.A popular IRL method has been developed in Finn et al. (2016); Fu et al. (2018); Ho and Ermon(2016), which leverages the connection of IRL to the training of generative adversarial networks (GANs)Goodfellow et al. (2014). In particular, the generative adversarial imitation learning (GAIL) frameworkHo and Ermon (2016) formulates a min-max optimization problem as in the GAN training. The maximiza-tion is over the reward function (which serves as a discriminator) to best distinguish between the trajectoriesgenerated by the expert and the learner, and the minimization is then over the learner’s policy (whichserves as a generator) to best match the expert’s trajectories. Since the policy optimization in GAIL isnonconvex, its joint optimization with reward function in GAIL in general can be guaranteed to convergeonly to a stationary point. Such a type of result was recently established in Chen et al. (2020), which stud-ied GAIL under general MDP model and reward function class, and showed that the gradient-decent andgradient-ascent algorithm converges to a stationary point (not necessarily the global minimum).1able 1: Comparison among GAIL algorithms studied in this paperAlgorithms Convergence rate Total Complexity

PPG-GAIL O (cid:16) − γ ) √ T (cid:17) ˜ O (cid:0) ǫ (cid:1) FWPG-GAIL O (cid:16) − γ ) √ T (cid:17) ˜ O ( ǫ ) TRPO-GAIL (unregularized) O (cid:16) − γ ) √ T (cid:17) ˜ O ( ǫ ) TRPO-GAIL (regularized) ˜ O (cid:16) − γ ) T (cid:17) ˜ O ( ǫ ) NPG-GAIL O (cid:16) − γ ) √ T (cid:17) ˜ O ( ǫ ) Total complexity refers to the total number of samples needed to achievean ǫ -accurate globally optimal point. ˜ O ( · ) does not include the logarithmic terms.More recently, it has been shown that some popular policy gradient algorithms Agarwal et al. (2019);Liu et al. (2019); Shani et al. (2020); Wang et al. (2019); Xu et al. (2020a) can converge to a globally opti-mal policy under certain policy parameterizations. Then a natural question to ask is whether such globalconvergence continues to hold in GAIL when these algorithms are further implemented in an alternatingfashion with the reward optimization in GAIL. The global convergence does not necessarily hold in general,because the policy optimization is still over a nonconvex objective function, which can induce complicatedand undesirable geometries jointly with the reward optimization as a min-max problem in GAIL. Thus, exist-ing exploration on this topic in Cai et al. (2019); Zhang et al. (2020), which established global convergencefor GAIL, requires restrictive conditions: (1) linear (but possibly inﬁnite dimensional) MDP and (2) linearreward function or linearizable reward function such as overparameterized ReLU neural networks.This paper aims to substantially expand the aforementioned global convergence results as follows. • We allow general MDP models, not necessarily linear MDP. We study nonlinear reward functions as longas the resulting objective function is strongly concave with respect to the reward parameter. This is amuch bigger class than linear reward, and is satisﬁed easily by incorporating a strongly concave regularizerwhich has been commonly used in GAIL practice. • In addition to the projected gradient and NPG that have been studied in Cai et al. (2019); Zhang et al.(2020) for GAIL, we also study Frank-Wolfe policy gradient, which is easier to implement than projectedpolicy gradient, and TRPO which is widely adopted in GAIL in practice. • Existing convergence characterization for GAIL assumed that the samples are either identical and inde-pendently distributed (i.i.d.) as in Chen et al. (2020); Zhang et al. (2020) or follows the LQR dynamicsas in Cai et al. (2019), whereas here we assume that samples follow a general Markovian distribution.

In this paper, we establish the ﬁrst global convergence guarantee for GAIL under the general MDP modeland the nonlinear reward function class (as long as the objective function is strongly concave with respectto the reward parameter). We provide the convergence rate for three major types of algorithms, all of whichalternate between gradient ascent (for reward update) and policy gradient descent (for policy update), re-spectively being (a) projected policy gradient (PPG)-GAIL and Frank-Wolfe policy gradient (FWPG)-GAIL(with direct policy parameterization); (b) trust region policy optimization (TRPO)-GAIL (with direct policyparameterization); and (c) natural policy gradient (NPG)-GAIL (with general non-linear policy parameteri-zation). We show that all these alternating algorithms converge to the global minimum with a sublinear rate.We summarize our results on the convergence performance of the GAIL algorithms in Table 1. Comparingamong these algorithms indicates that TRPO-GAIL with regularized MDP achieves the best convergencerate, and TRPO-GAIL with regularized and unregularized MDP outperform the other algorithms in termsof the overall sample complexity. 2echnically, the global convergence guarantee for GAIL does not follow from the existing min-max optimiza-tion theory. In fact, the GAIL problem here falls into nonconvex-strongly-concave min-max optimizationframework, for which existing optimization theory does not provide the global convergence in general. Thus,our establishment of global convergence for GAIL develops several new properties specially for GAIL. Fur-thermore, in contrast to conventional min-max optimization, which is under i.i.d. sampling by certain staticdistribution, GAIL is under Markovian sampling by time-varying distributions due to the policy update.Thus, the convergence analysis for GAIL is more challenging than that for min-max optimization.

Due to the signiﬁcant growth of studies in imitation learning, this section focuses only on those studies thatare highly relevant to the theoretical analysis of the convergence for GAIL algorithms.

Theory for IRL via adversarial training:

The idea of generative adversarial training Goodfellow et al.(2014) has motivated a popular approach for IRL problems Finn et al. (2016); Fu et al. (2018); Ho and Ermon(2016). Among these studies, GAIL Ho and Ermon (2016) formulated a min-max problem for jointly opti-mizing the reward and policy, where reward and policy serve analogous roles as the discriminator and thegenerator in GANs. Naturally, such an approach has been explored via the divergence minimization per-spective in Ghasemipour et al. (2019); Ke et al. (2019), by leveraging GAN training Nowozin et al. (2016).Moreover, the generalization performance and sample complexity have been studied for the setting wherethe expert’s demonstrations include only the states but no actions.Most relevant to our study is the recent studies Cai et al. (2019); Chen et al. (2020); Zhang et al. (2020) onthe convergence rate for the algorithms developed for GAIL. Among these studies, Chen et al. (2020) studiedGAIL under the general MDP model and the reward function class, and showed that the gradient-decent andgradient-ascent algorithm converges to a stationary point (not necessarily the global minimum). Cai et al.(2019); Zhang et al. (2020) provided the global convergence result. More speciﬁcally, Cai et al. (2019) studiedGAIL under linear quadratic regulator (LQR) dynamics and the linear reward function class, and showedthat the alternating gradient algorithm converges to the unique saddle point. Zhang et al. (2020) studiedGAIL under a type of linear but inﬁnite dimensional MDP and with overparameterized neural networks forparameterizing the policy and reward function, and showed that the alternating algorithm between gradient-ascent (for reward update) and NPG (for policy update) converges to the neighborhood of a global optimalpoint, where the representation power of neural networks determines the convergence error. Our study hereestablishes global convergence for GAIL for general MDP and the nonlinear reward function class.

Diﬀerence from conventional min-max problems:

Although the GAIL framework is formulated as amin-max optimization problem, the stochastic algorithms that we use for solving such a problem have the fol-lowing major diﬀerences from the conventional min-max optimization problem. First, since these algorithmscontinuously update the policy, the samples that are used for iterations are sampled by time-varying policies;whereas the conventional min-max problem typically has a ﬁxed sampling distribution. Second, since thesamples are obtained following an MDP process, the samples are distributed with correlation rather than inthe i.i.d. manner as in the conventional optimization. These two diﬀerences cause the convergence analysisto be more complicated for GAIL than the conventional min-max problem. Furthermore, the min-max prob-lem that we encounter here for GAIL is nonconvex-strongly-concave, for which the conventional min-maxoptimization Lin et al. (2020); Nouiehed et al. (2019) has been shown to converge only to a stationary point,whereas this paper exploits further properties in GAIL and establishes the global convergence guarantee.

Connection to policy gradient algorithms:

In the GAIL framework, the policy optimization is jointlyperformed with the reward optimization via a min-max optimization. Thus, the variation of the rewardfunction during the algorithm execution continuously change the objective function for the policy optimiza-tion. Hence, even if the policy gradient algorithms (running for a ﬁxed objective function) converge globally,for example, PPG Agarwal et al. (2019), NPG Agarwal et al. (2019), and TRPO Shani et al. (2020), theglobal convergence is generally not guaranteed if these algorithms are executed in an alternating fashionwith reward iterations. Two special cases have been shown to retain such global convergence, namely, LQRmodel shown in Cai et al. (2019) and overparameterized neural networks for a linear type MDPZhang et al.(2020). This paper signiﬁcantly expands such a set of cases by establishing the global convergence guaranteefor more general MDP and reward class and a broader range of algorithms.3

Problem Formulation and Preliminaries

The imitation learning framework that we study is based on the Markov decision process (MDP) denoted by ( S , A , P , r, γ ) . We assume that both the state space S ⊂ R d and the action space A are ﬁnite, and use s ∈ S and a ∈ A to denote a state and an action, respectively. A policy π describes the probability to take anaction a ∈ A at each state s ∈ S in terms of the conditional probability π ( a | s ) . Then the system moves to anext state s ′ ∈ S governed by the probability transition kernel P ( s ′ | s, a ) , and receives a reward r t = r ( s, a ) ,which is assumed to be bounded by R max .Suppose the initial state takes a distribution ζ . For a given policy π and a reward function r , we deﬁne theaverage value function as: V ( π, r ) = E (cid:2) P ∞ t =0 γ t r ( s t , a t ) (cid:12)(cid:12) s ∼ ζ, a t ∼ π ( a t | s t ) , s t +1 ∼ P ( s t +1 | s t , a t ) (cid:3) = − γ E ( s,a ) ∼ ν π ( s,a ) [ r ( s, a )] , where γ ∈ (0 , is a discount factor and ν π ( s, a ) := (1 − γ ) P ∞ t =0 γ t P ( s t = s, a t = a ) is the state-actionvisitation distribution. It has been shown in Konda (2002) that ν π ( s, a ) is the stationary distribution of theMarkov chain with the transition kernel ˜ P ( ·| s, a ) = (1 − γ ) ζ ( · ) + γ P ( ·| s, a ) and policy π if the Markov chainis ergodic. Thus ˜ P is used in sampling for estimating the value function. For imitation learning, in which the reward function is not known, GAIL Ho and Ermon (2016) is a frameworkto jointly learn the reward function and optimize the policy. We parameterize the reward function by α ∈ Λ ⊂ R q , which takes the form r α ( s, a ) at the state-action pair ( s, a ) . We assume that Λ is a boundedclosed set, i.e., k α − α k ≤ C α , ∀ α , α ∈ Λ .We let π E represent the expert policy, and let the learner’s policy be parameterized by θ ∈ Θ and be denotedas π θ . In this paper, we consider two types of parameterization for the learner’s policy. The ﬁrst is thedirect parameterization, where θ = { θ s,a , s ∈ S , a ∈ A} , and π θ ( a | s ) = θ s,a where θ ∈ Θ p := { θ : θ s,a ≥ , P a ∈A θ s,a = 1 , for all s ∈ S , a ∈ A} . The second is the general nonlinear policy class, which satisﬁescertain smoothness conditions as given in Assumption 5.The GAIL framework is formulated as the following min-max optimization problem. min θ ∈ Θ max α ∈ Λ F ( θ, α ) := V ( π E , r α ) − V ( π θ , r α ) − ψ ( α ) , (1)where the objective function is given by the discrepancy of the accumulated rewards between the expert’sand learner’s policies, regularized by a function ψ ( α ) of the reward parameter. Thus, the maximizationin eq. (1) aims to ﬁnd the reward function that best distinguishes between the expert’s and the learner’spolicies and the minimization aims to ﬁnd the learner’s policy that matches the expert’s policy as close aspossible. Such a formulation is analogous to the GANs, with the reward serving as a discriminator and thepolicy serving as a generator.In this paper, we study four GAIL algorithms, all of which follow the nested-loop framework described inAlgorithm 1. Namely, at each time step t (associated with one outer loop), there is an entire inner loopupdates of the reward parameter α t to a certain accuracy and one update step of the policy parameter θ t .Speciﬁcally, α t is updated by the stochastic projected gradient ascent given by α k +1 t = P Λ (cid:16) α kt + β b ∇ α F ( θ t , α kt ) (cid:17) , where the gradient estimator b ∇ α F ( θ t , α kt ) is obtained via a Markovian sample trajectory. Then the policyparameter θ t is updated for one step, determined by any of the four policy gradient algorithms, namely, PPGin eq. (4), FWPG in eq. (5), TRPO in eq. (7) and NPG in eq. (8). The samples are obtained over a single trajectory path for the entire algorithm execution. lgorithm 1 Nested-loop GAIL framework Input:

Outer loop length T , inner loop length K , stepsize η , β for t = 0 , , ..., T − do Randomly pick α t ∈ Λ for k = 0 , , ..., K − do Query a length- B trajectory ( s Ei , a Ei ) ∼ ˜P π E and a length- B mini-batch ( s θi , a θi ) ∼ ˜P π θ b ∇ α F ( θ, α ) = − γ ) B P B − i =0 (cid:2) ∇ α r α ( s Ei , a Ei ) − ∇ α r α ( s θi , a θi ) (cid:3) − ∇ α ψ ( α ) α tk +1 = P Λ (cid:16) α tk + β b ∇ α F ( θ t , α tk ) (cid:17) end for α t = α tK θ t +1 = Options : PPG in eq. (4);

FWPG in eq. (5);

TRPO in eq. (7);

NPG in eq. (8) end for

For the GAIL problem in eq. (1) to be well posed, we assume that max α ∈ Λ F ( θ, α ) exists for any θ ∈ Θ , anddeﬁne the marginal-maximum function of F ( θ, α ) g ( θ ) := max α ∈ Λ F ( θ, α ) . (2)We further deﬁne the corresponding optimizer α op ( θ ) := argmax α ∈ Λ F ( θ, α ) . If there exists more than oneoptimizer, α op ( θ ) denotes the elements of the corresponding optimizer set. Deﬁnition 1.

Let θ ∗ = argmin θ ∈ Θ g ( θ ) . The output ¯ θ of an algorithm is said to attain an ǫ -global conver-gence if g (¯ θ ) − g ( θ ∗ ) ≤ ǫ holds for a prescribed accuracy ǫ ∈ (0 , . As remarked in Zhang et al. (2020), ǫ -global convergence further implies max α ∈ Λ [ V ( π E , r α ) − V ( π ¯ θ , r α )] ≤ max α ∈ Λ ψ ( α ) + ǫ. Hence, as long as ψ ( α ) is chosen properly (for example, with a small regularization coeﬃcient), π ¯ θ is guar-anteed to be suﬃciently close to the expert policy.In this paper, we make the following standard assumptions for our analysis. Assumption 1.

The regularizer function ψ ( α ) is diﬀerentiable with gradient Lipschitz constant L ψ . Assumption 1 captures the property for designing a regularizer and can be easily attained.

Assumption 2.

For any given θ , the objective function F ( θ, α ) in eq. (1) is µ -strongly concave on α . Assumption 2 includes the linear function class as a special case. In practice, a strongly convex regularizer ψ ( α ) is often used to guarantee the strong concavity of F ( θ, α ) . Assumption 3 (Ergodicity) . For any policy parameter θ ∈ Θ , consider the MDP with policy π θ andtransition kernel P ( ·| s, a ) or ˜ P ( ·| s, a ) = γ P ( ·| s, a ) + (1 − γ ) ζ ( · ) . There exist constants C M > and < ρ < such that ∀ t ≥ , sup s ∈S d T V ( P ( s t ∈ ·| s = s ) , χ θ ) ≤ C M ρ t , where χ θ is the stationary distribution of the given transition kernel P ( ·| s, a ) or ˜ P ( ·| s, a ) under policy π θ and d T V ( · , · ) is the total variation distance. Assumption 3 holds for any time-homogeneous Markov chain with ﬁnite state space or any uniformly ergodicMarkov chain with general state space.

Assumption 4.

The reward parameterization satisﬁes the following requirements:(1) Bounded gradient: ∃ C r ∈ R such that ∀ α ∈ Λ , k∇ α r α k ∞ , := rP qi =1 (cid:13)(cid:13)(cid:13) ∂r α ∂α i (cid:13)(cid:13)(cid:13) ∞ ≤ C r .

2) Gradient Lipschitz: ∃ L r ∈ R , such that ∀ s ∈ S , a ∈ A and ∀ α , α ∈ Λ , k∇ α r α ( s, a ) − ∇ α r α ( s, a ) k ≤ L r k α − α k . We next provide the following Lipschitz properties, which are vital for the analysis of convergence, andwere often taken as assumptions in the literature of min-max optimization Jin et al. (2019); Nouiehed et al.(2019).

Proposition 1.

Suppose Assumptions 1, 3 and 4 hold. Then the GAIL min-max problem in eq. (1) withdirect parameterization satisﬁes the following Lipschitz conditions: ∀ θ , θ ∈ Θ and ∀ α , α ∈ Λ , k∇ θ F ( θ , α ) − ∇ θ F ( θ , α ) k ≤ L k θ − θ k + L k α − α k , k∇ α F ( θ , α ) − ∇ α F ( θ , α ) k ≤ L k θ − θ k + L k α − α k , where L = √ |A| C r C α (1 − γ ) (1+ (cid:6) log ρ C − M (cid:7) +(1 − ρ ) − ) , L = √ |A| C r (1 − γ ) , L = C r √ |A| − γ (1+ ⌈ log ρ C − M ⌉ +(1 − ρ ) − ) ,and L = √ qL r − γ + L ψ . Furthermore, if θ = θ , the above second bound holds with a general parameterizationfor the policy. In this section, we provide the global convergence guarantee for four GAIL algorithms.

In this section, we study the PPG-GAIL and FWPG-GAIL algorithms, both of which take the generalframework in Algorithm 1, and update the policy parameter θ respectively based on projected policy gradient(PPG) and Frank-Wolfe policy gradient (FWPG).We take the direct parameterization for the policy. At each time t of the outer loop, both PPG-GAIL andFWPG-GAIL ﬁrst estimate the stochastic policy gradient by drawing a minibatch sample trajectory withlength b as ( s i , a i ) ∼ ˜P π θt as follows. b ∇ θ F ( θ t , α t )( s, a ) = − ˆ Q ( s, a ) b (1 − γ ) b − X i =0 { s i = s } , (3)for all s ∈ S , a ∈ A , where ˆ Q ( s, a ) applies EstQ in Zhang et al. (2019) (see Appendix A) with the rewardfunction r α t ( s, a ) . Then, PPG-GAIL updates θ t as θ t +1 = P Θ p (cid:16) θ t − η b ∇ θ F ( θ t , α t ) (cid:17) , (4)where Θ p is the probability simplex deﬁned in Section 2.2.Diﬀerently from PPG-GAIL, FWPG-GAIL updates θ t based on the Frank-Wolfe gradient as given by ˆ v t = argmax θ ∈ Θ p D θ, − b ∇ θ F ( θ t , α t ) E , θ t +1 = θ t + η (ˆ v t − θ t ) . (5)To analyze the convergence, we ﬁrst deﬁne the gradient dominance property. Deﬁnition 2.

A function f ( θ ) satisﬁes the gradient dominance property, if there exists a positive C , suchthat f ( θ ) − f ( θ ∗ ) ≤ C max ¯ θ ∈ Θ (cid:10) θ − ¯ θ, ∇ θ f ( θ ) (cid:11) for any given θ ∈ Θ , where θ ∗ := argmin θ ∈ Θ f ( θ ) . The following proposition facilitates to prove global convergence for PPG-GAIL and FWPG-GAIL.

Proposition 2.

The function g ( θ ) given in eq. (2) satisﬁes the gradient dominance property. The following theorem characterizes the global convergence of PPG-GAIL.6 heorem 1.

Suppose Assumptions 1 to 4 hold. Consider PPG-GAIL with the θ -update stepsize η = (cid:16) L + L L µ (cid:17) − and the α -update stepsize β = µ L , where L , L , L and L are given in Propo-sition 1. Then we have T T − X t =0 E [ g ( θ t )] − g ( θ ∗ ) ≤ O (cid:16) − γ ) √ T (cid:17) + O (cid:16) e − (1 − γ ) K (cid:17) + O (cid:16) − γ ) √ B (cid:17) + O (cid:16) − γ ) √ b (cid:17) . (6)Theorem 1 implies that if we set T = O (cid:0) ǫ (cid:1) , K = O (cid:0) log( ǫ ) (cid:1) , B = O (cid:0) ǫ (cid:1) and b = O (cid:0) ǫ (cid:1) , then PPG-GAILconverges to an ǫ -accurate globally optimal value with an overall sample complexity T ( KB + b ) = ˜ O (cid:0) ǫ (cid:1) .Due to the Markovian sampling for updating both the reward and policy parameters α and θ , our analysisbounds the two corresponding bias error terms by O ( √ B ) and O ( √ b ) as shown in eq. (6). Hence, the choicesfor the mini-batch sizes B and b trade oﬀ between the convergence error and the computational complexity.To achieve a given accuracy ǫ , the tradeoﬀ yields the overall complexity of ˜ O (cid:0) ǫ (cid:1) . We also note that theresult here provides the ﬁrst convergence rate for projected stochastic gradient with non-i.i.d. sampling.We next provide the following theorem, which characterizes the global convergence of FWPG-GAIL. Theorem 2.

Suppose Assumptions 1 to 4 hold. Consider FWPG-GAIL with the θ -update stepsize η = − γ √ T and α -update stepsize β = µ L , where L is given in Proposition 1. Then we have T T − X t =0 E [ g ( θ t )] − g ( θ ∗ ) ≤ O (cid:18) − γ ) √ T (cid:19) + O (cid:16) e − (1 − γ ) K (cid:17) + O (cid:18) − γ ) √ B (cid:19) + O (cid:18) − γ ) √ b (cid:19) . Theorem 2 implies that if we let T = O (cid:0) ǫ (cid:1) , K = O (cid:0) log( ǫ ) (cid:1) , B = O (cid:0) ǫ (cid:1) and b = O (cid:0) ǫ (cid:1) , then FWPG-GAIL converges to an ǫ -accurate globally optimal value with overall sample complexity T ( KB + b ) = ˜ O (cid:0) ǫ (cid:1) ,which is the same as that of PPG-GAIL. The analysis of FWPG-GAIL also needs to bound the two biasterms due to the Markovian sampling for updating the reward and policy parameters. This is the ﬁrstanalysis that provides the convergence rate for stochastic Frank-Wolfe gradient with non-i.i.d. sampling. In this section, we study the TRPO-GAIL algorithm, which takes the general framework in Algorithm 1 andupdates the policy parameter θ based on TRPO under λ -regularized MDP. At each time t of the outer loop,TRPO-GAIL adopts the update rule in Shani et al. (2020) for updating θ t as follows: π θ t +1 ( ·| s ) ∈ argmin π ∈ ∆ A D − ˆ Q π θt λ,α t ( s, · ) + λ ∇ ω ( π θ t ( ·| s )) , π − π θ t ( ·| s ) E + η − t B ω ( π, π θ t ( ·| s )) , where ˆ Q π θt λ,α t denotes the estimation of the Q-function based on EstQ Zhang et al. (2019) (see Appendix A),the regularized reward r λ,α t ( s, a ) := r α t ( s, a ) + λω ( π θ ( ·| s )) , the negative entropy function ω ( π ( ·| s )) := P a ∈A π ( ·| s ) log π ( ·| s )+log |A| , and the Bregman distance B ω ( x, y ) := ω ( x ) − ω ( y ) −h∇ ω ( y ) , x − y i associatedwith ω ( x ) , which is the KL-divergence here. We consider the direct parameterization for the policy, andhence the update for the policy parameter θ can be analytically computed Shani et al. (2020) as follows. Foreach ( s, a ) ∈ S × A , θ t +1 ( s, a ) = θ t ( s, a ) exp (cid:16) η t ( ˆ Q π θt λ,α t ( s, a ) − λ log θ t ( s, a )) (cid:17)P a ′ ∈A θ t ( s, a ′ ) exp (cid:16) η t ( ˆ Q π θt λ,α t ( s, a ′ ) − λ log θ t ( s, a ′ )) (cid:17) . (7)The following theorem provides the global convergence of TRPO-GAIL under the unregularized MDP, where λ = 0 . Theorem 3.

Suppose Assumptions 1 to 4 hold. Consider unregularized TRPO-GAIL ( λ = 0 ) with θ -updatestepsize η t = − γ √ T and α -update stepsize β = µ L , where L is given in Proposition 1. Then we have, T T − X t =0 E [ g ( θ t )] − g ( θ ∗ ) ≤ O (cid:18) − γ ) √ T (cid:19) + O (cid:16) e − (1 − γ ) K (cid:17) + O (cid:18) − γ ) B (cid:19) .

7e further consider the regularized MDP, where λ > . Theorem 4.

Suppose Assumptions 1 to 4 hold. Consider regularized TRPO-GAIL ( λ > ) with θ -updatestepsize η t = λ ( t +2) and α -update stepsize β = µ L , where L is given in Proposition 1. Then we have, T T − X t =0 E [ g ( θ t )] − g ( θ ∗ ) ≤ ˜ O (cid:18) − γ ) T (cid:19) + O (cid:16) e − (1 − γ ) K (cid:17) + O (cid:18) − γ ) B (cid:19) . Theorem 3 indicates that if we set T = O (cid:0) ǫ (cid:1) , K = O (cid:0) log( ǫ ) (cid:1) and B = O (cid:0) ǫ (cid:1) , then TRPO-GAIL withunregularized MDP converges to an ǫ -accurate globally optimal value with a total sample complexity T KB =˜ O (cid:0) ǫ (cid:1) . Theorem 4 indicates that if we let T = ˜ O ( ǫ ) , K = O (cid:0) log( ǫ ) (cid:1) , and B = O (cid:0) ǫ (cid:1) , then TRPO-GAILwith regularized MDP converges to an ǫ -accurate globally optimal value with an overall sample complexity T KB = ˜ O (cid:0) ǫ (cid:1) . The regularized MDP changes the objective function with λ -regularized perturbation andyields orderwisely better sample complexity. Moreover, the sample complexity here is with respect to theconvergence in expectation, which improves that in high-probability convergence in Shani et al. (2020) by afactor of ˜ O (cid:0) ǫ (cid:1) . In this section, we study the NPG-GAIL algorithm, which takes the general framework in Algorithm 1 andupdates the policy parameter θ based on natural policy gradient (NPG).We consider the general nonlinear parameterization for the policy, so that the state space may not be ﬁnite andfor example can be R d . At each time t of the outer loop, NPG-GAIL ideally should update θ t via a regularizednatural gradient − ( F ( θ t ) + λI ) − ∇ θ V ( π θ t , r α t ) , where F ( θ ) = E ( s,a ) ∼ ν πθ (cid:2) ∇ θ log( π θ ( a | s )) ∇ θ log( π θ ( a | s )) ⊤ (cid:3) is the Fisher-information matrix, and λ is the regularization coeﬃcient for avoiding singularity. In practice, weestimate such a natural gradient via solving the problem min w ∈ R d E ( s,a ) ∼ ν πθ (cid:2) ∇ θ log( π θ ( a | s )) ⊤ w − A π θ α ( s, a ) (cid:3) using the mini-batch linear stochastic approximation (SA) algorithm over a Markovian sampled trajectory,where A π θ α ( s, a ) := Q π θ α ( s, a ) − V π θ α ( s ) is the advance function under reward r α . More details are provided inAlgorithm 3 in Appendix A. Suppose such an algorithm provides an output w t . Then the policy parameteris updated as θ t +1 = θ t − ηw t . (8)Since we take the general nonlinear parameterization for the policy, we make the following assumptions forthe policy parameterization, which are standard in the literature Agarwal et al. (2019); Kumar et al. (2019);Xu et al. (2020b); Zhang et al. (2019). Assumption 5.

For any θ, θ ′ ∈ Θ , and any state-action pair ( s, a ) ∈ S × A , there exist positive constants L π , L φ , C φ and C π , such that the following bounds hold:(1) k∇ θ log( π θ ( a | s )) − ∇ θ log( π θ ′ ( a | s )) k ≤ L φ k θ − θ ′ k ,(2) k∇ θ log( π θ ( a | s )) k ≤ C φ ,(3) k π θ ( ·| s ) − π θ ′ ( ·| s ) k T V ≤ C π k θ − θ ′ k , where k·k T V denotes the total-variation norm.

Next, we provide the following theorem, which characterizes the global convergence of NPG-GAIL.

Theorem 5.

Suppose Assumptions 1 to 5 hold. Consider NPG-GAIL with θ -update stepsize η = − γ √ T , α -update stepsize β = µ L , and the SA-update stepsize β W = λ P C φ + λ ) , where L is given in Proposition 1.Then we have T T − X t =0 E [ g ( θ t )] − g ( θ ∗ ) ≤ O (cid:18) − γ ) √ T (cid:19) + O (cid:16) e − (1 − γ ) K (cid:17) + O (cid:18) − γ ) B (cid:19) + O (cid:0) e − T c (cid:1) + O (cid:18) ζ ′ (1 − γ ) / (cid:19) + O (cid:18) λ − γ (cid:19) + O (cid:18) − γ ) √ M (cid:19) . where ζ ′ = max θ ∈ Θ ,α ∈ Λ min w ∈ R d q E ν πθ [ ∇ θ log( π π θ ( a | s )) ⊤ w − A π θ α ( s, a )] and T c and M are deﬁned in Al-gorithm 3 in Appendix A. T = O (cid:0) ǫ (cid:1) , K = O (cid:0) log( ǫ ) (cid:1) , B = O (cid:0) ǫ (cid:1) , T c = O (cid:0) log( ǫ ) (cid:1) , λ = O ( ζ ′ ) and M = O (cid:0) ǫ (cid:1) , then NPG-GAIL converges to an ( ǫ + O ( ζ ′ )) -accurate globally optimal value with anoverall sample complexity of T ( KB + T c M ) = ˜ O (cid:0) ǫ (cid:1) , which is the same as PPG-GAIL and FWPG-GAIL.Comparison of Theorem 3 and Theorem 5 indicates that TRPO-GAIL has a better sample complexity thanNPG-GAIL, mainly because TRPO can update the policy parameter based on an analytical form, whichsaves the samples that NPG uses for estimating the natural gradient by solving the quadratic optimizationproblem. In this paper, we study four GAIL algorithms, each of which is implemented in an alternating fashion betweena popular policy gradient algorithm for the policy update and a gradient ascent for the reward update. Ourfocus is on investigating whether incorporation of these policy gradient algorithms to the GAIL frameworkwill still have global convergence guarantee. We show that all these GAIL algorithms converge globally aslong as the objective function is properly regularized (to be strongly concave) with respect to the rewardparameter. We also anticipate that the analysis tools that we develop here will beneﬁt the future theoreticalstudies of similar problems including GANs, min-max optimization, and bi-level optimization algorithms.

Acknowledgments

The work was supported in part by the U.S. National Science Foundation under the grants CCF-1761506,CCF-1801846, and CCF-1909291. 9 eferences

Agarwal, A., Kakade, S. M., Lee, J. D., and Mahajan, G. (2019). Optimality and approximation with policygradient methods in Markov decision processes. arXiv preprint arXiv:1908.00261 .Beck, A. (2017).

First-Order Methods in Optimization . Society for Industrial and Applied Mathematics,Philadelphia, PA.Cai, Q., Hong, M., Chen, Y., and Wang, Z. (2019). On the global convergence of imitation learning: A casefor linear quadratic regulator. arXiv preprint arXiv:1901.03674 .Chen, M., Wang, Y., Liu, T., Yang, Z., Li, X., Wang, Z., and Zhao, T. (2020). On computation and gen-eralization of generative adversarial imitation learning. In

Proc. International Conference on LearningRepresentations (ICLR) .Finn, C., Christiano, P., Abbeel, P., and Levine, S. (2016). A connection between generative adversarialnetworks, inverse reinforcement learning, and energy-based models. arXiv preprint arXiv:1611.03852 .Fu, J., Luo, K., and Levine, S. (2018). Learning robust rewards with adversarial inverse reinforcementlearning. In

Proc. International Conference on Learning Representations (ICLR) .Ghasemipour, S. K. S., Zemel, R., and Gu, S. (2019). A divergence minimization perspective on imitationlearning methods. In

Proc. Conference on Robot Learning (CoRL) .Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio,Y. (2014). Generative adversarial nets. In

Proc. Advances in Neural Information Processing Systems(NIPS) .Ho, J. and Ermon, S. (2016). Generative adversarial imitation learning. In

Proc. Advances in NeuralInformation Processing Systems (NIPS) .Jin, C., Netrapalli, P., and Jordan, M. I. (2019). What is local optimality in nonconvex-nonconcave minimaxoptimization? arXiv preprint arXiv:1902.00618 .Ke, L., Barnes, M., Sun, W., Lee, G., Choudhury, S., and Srinivasa, S. (2019). Imitation learning asf-divergence minimization. arXiv preprint arXiv:1905.12888 .Konda, V. (2002).

Actor-critic algorithms . PhD thesis, Department of Electrical Engineering and ComputerScience, Massachusetts Institute of Technology.Kumar, H., Koppel, A., and Ribeiro, A. (2019). On the sample complexity of actor-critic method forreinforcement learning with function approximation. arXiv preprint arXiv:1910.08412 .Laskey, M., Lee, J., Hsieh, W., Liaw, R., Mahler, J., Fox, R., and Goldberg, K. (2017). Iterative noiseinjection for scalable imitation learning. In

Proc. 1st Conference on Robot Learning (CoRL) .Lin, T., Jin, C., and Jordan, M. (2020). Near-optimal algorithms for minimax optimization. arXiv preprintarXiv:2002.02417 .Liu, B., Cai, Q., Yang, Z., and Wang, Z. (2019). Neural proximal/trust region policy optimization attainsglobally optimal policy. arXiv preprint arXiv:1906.10306 .Ng, A. Y. and Russell, S. (2000). Algorithms for inverse reinforcement learning. In

Proc. InternationalConference on Machine Learning (ICML) .Nouiehed, M., Sanjabi, M., Huang, T., Lee, J. D., and Razaviyayn, M. (2019). Solving a class of non-convexmin-max games using iterative ﬁrst order methods. In

Proc. Advances in Neural Information ProcessingSystems (NeurIPS) .Nowozin, S., Cseke, B., and Tomioka, R. (2016). f-GAN: Training generative neural samplers using variationaldivergence minimization. In

Proc. Advances in Neural Information Processing Systems (NIPS) .10sa, T., Pajarinen, J., Neumann, G., Bagnell, J. A., Abbeel, P., and Peters, J. (2018). An algorithmicperspective on imitation learning.

Foundations and Trends in Robotics , 7(1-2):1–179.Pomerleau, D. A. (1991). Eﬃcient training of artiﬁcial neural networks for autonomous navigation.

NeuralComputation , 3(1):88–97.Ross, S. and Bagnell, D. (2010). Eﬃcient reductions for imitation learning. In

Proc. International Conferenceon Artiﬁcial Intelligence and Statistics (AISTATS) .Ross, S., Gordon, G. J., and Bagnell, D. (2011). A reduction of imitation learning and structured predictionto no-regret online learning. In

Proc. International Conference on Artiﬁcial Intelligence and Statistics(AISTATS) .Russell, S. (1998). Learning agents for uncertain environments. In

Proc. Eleventh Annual Conference onComputational Learning Theory .Shani, L., Efroni, Y., and Mannor, S. (2020). Adaptive trust region policy optimization: Global convergenceand faster rates for regularized mdps. In

Proc. AAAI Conference on Artiﬁcial Intelligence (AAAI) .Wang, L., Cai, Q., Yang, Z., and Wang, Z. (2019). Neural policy gradient methods: Global optimality andrates of convergence. arXiv preprint arXiv:1909.01150 .Xu, T., Wang, Z., and Liang, Y. (2020a). Improving sample complexity bounds for actor-critic algorithms. arXiv preprint arXiv:2004.12956 .Xu, T., Wang, Z., Zhou, Y., and Liang, Y. (2020b). Reanalysis of variance reduced temporal diﬀerencelearning. arXiv preprint arXiv:2001.01898 .Zhang, K., Koppel, A., Zhu, H., and Başar, T. (2019). Global convergence of policy gradient methods to(almost) locally optimal policies. arXiv preprint arXiv:1906.08383 .Zhang, Y., Cai, Q., Yang, Z., and Wang, Z. (2020). Generative adversarial imitation learning with neuralnetworks: Global optimality and convergence rate. arXiv preprint arXiv:2003.03709 .11 upplementary MaterialsA Q-sampling and NPG-GAIL Algorithms

In this section, we provide the formal description for the algorithm EstQ in Algorithm 2, which returnsan unbiased estimation of the state-action value function (Q-value), and the algorithm of policy update ofNPG-GAIL in Algorithm 3.

Algorithm 2

EstQ Zhang et al. (2019) Input: s, a, θ . Initialize ˆ Q = 0 , s q = s, a q = a Draw T ∼ Geom (1 − γ / ) for t = 1 , , . . . , T − do Collect reward R ( s qt , a qt ) and update the Q-function ˆ Q ← ˆ Q + γ t/ R ( s qt , a qt ) Sample s qt +1 ∼ P ( ·| s qt , a qt ) , a qt +1 ∼ π θ ( ·| s qt +1 ) end for Collect reward R ( s qT , a qT ) and update the Q-function ˆ Q ← ˆ Q + γ T/ R ( s qT , a qT ) Output: ˆ Q π θ ← ˆ Q Algorithm 3

Policy update in NPG-GAIL

Input:

Policy parameter θ t , reward parameter α t , stepsize β W , policy stepsize η , batch-size M andtrajectory length T c for i = 0 , · · · , M T c do s i ∼ e P ( ·| s i − , a i − ) Sample a i and a ′ i independently from π θ t ( ·| s i ) end for Initialize W = 0 for k = 0 , · · · , T c − dofor i = kM, · · · , ( k + 1) M − do Obtain Q-function estimation b Q ( s i , a i ) with reward function r α t by Algorithm 2. ˆ g i = ( −∇ θ t log( π θ t ( a i | s i )) ⊤ W k + b Q ( s i , a i )) ∇ θ t log( π θ t ( a i | s i ) − b Q ( s i , a i ) ∇ θ t log( π θ t ( a ′ i | s i )) − λW k end for ˆ G k = M P ( k +1) M − i = kM ˆ g i W k +1 = W k + β W ˆ G k end for w t = W T c Return: θ t +1 = θ t − ηw t B Proof of Proposition 1

In this section, we ﬁrst provide two useful lemmas, which establish the smoothness property of the visitationdistribution and Q-function.

Lemma 1. ((Xu et al., 2020a, Lemma 3)) Consider the initial distribution ξ ( · ) and the transition kernel P ( ·| s, a ) . Let ξ ( · ) be ζ ( · ) or P ( ·| ˆ s, ˆ a ) for any given ˆ s ∈ S , ˆ a ∈ A . Denote ν π θ ,ξ as the state-action visitationdistribution of MDP with policy π θ and the initialization distribution ξ . Suppose Assumption 3 holds. Thenwe have, under direct parameterization for any θ , θ ∈ Θ p , (cid:13)(cid:13) ν π θ ,ξ − ν π θ ′ ,ξ (cid:13)(cid:13) T V ≤ C ν k θ − θ k , here C ν = √ |A| (cid:0) (cid:6) log ρ C − M (cid:7) + (1 − ρ ) − (cid:1) . Lemma 2. ((Xu et al., 2020a, Lemma 4)) Suppose Assumptions 3 and 4 hold. Let Q πα denote the Q-functionof policy π under the reward function r α . For any state-action pair ( s, a ) ∈ S × A , α ∈ Λ and θ , θ ∈ Θ p (under direct parameterization), we have | Q π θ α ( s, a ) − Q π θ α ( s, a ) | ≤ L Q k θ − θ k , where L Q = C r C α C ν − γ and C ν is deﬁned in Lemma 1. Denote d π ( s ) = (1 − γ ) P ∞ t =0 γ t P { s t = s | π } as the state visitation distribution induced by policy π . We nextprove Proposition 1 to characterize the Lipschitz constants L , L , L and L , respectively. Proof of Proposition 1.

We consider the ﬁrst inequality in Proposition 1: k∇ θ F ( θ , α ) − ∇ θ F ( θ , α ) k = k∇ θ F ( θ , α ) − ∇ θ F ( θ , α ) + ∇ θ F ( θ , α ) − ∇ θ F ( θ , α ) k ≤ k∇ θ F ( θ , α ) − ∇ θ F ( θ , α ) k | {z } T + k∇ θ F ( θ , α ) − ∇ θ F ( θ , α ) k | {z } T . (9)Next, we upper-bound the terms T and T in eq. (9), respectively. Upper-bounding T : For any given state-action pair ( s, a ) ∈ S × A , we have (cid:12)(cid:12)(cid:12) ( ∇ θ F ( θ , α ) − ∇ θ F ( θ , α )) s,a (cid:12)(cid:12)(cid:12) ( i ) = (cid:12)(cid:12)(cid:12)(cid:12) − γ (cid:0) d π θ ( s ) Q π θ α ( s, a ) − d π θ ( s ) Q π θ α ( s, a ) (cid:1)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12) − γ (cid:0) ( d π θ ( s ) − d π θ ( s )) Q π θ α ( s, a ) (cid:1)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12) − γ (cid:0) d π θ ( s )( Q π θ α ( s, a ) − Q π θ α ( s, a )) (cid:1)(cid:12)(cid:12)(cid:12)(cid:12) ( ii ) ≤ R max (1 − γ ) | d π θ ( s ) − d π θ ( s ) | + L Q − γ d π θ ( s ) k θ − θ k , (10)where ( i ) follows from the fact that ∂F ( θ,α ) ∂θ s,a = − ∂V ( π θ ,α ) ∂θ s,a = − − γ d π θ ( s ) Q π θ α ( s, a ) , and ( ii ) follows fromLemma 2. Then, we proceed as follows: k∇ θ F ( θ , α ) − ∇ θ F ( θ , α ) k = sX s,a (cid:12)(cid:12)(cid:12) ( ∇ θ F ( θ , α ) − ∇ θ F ( θ , α )) s,a (cid:12)(cid:12)(cid:12) i ) ≤ vuutX s,a (cid:18) R max (1 − γ ) (cid:12)(cid:12) d π θ ( s ) − d π θ ( s ) (cid:12)(cid:12) + L Q − γ d π θ ( s ) k θ − θ k (cid:19) ≤ p |A| vuutX s (cid:18) R max (1 − γ ) (cid:12)(cid:12) d π θ ( s ) − d π θ ( s ) (cid:12)(cid:12)(cid:19) + p |A| vuutX s (cid:18) L Q − γ d π θ ( s ) k θ − θ k (cid:19) ii ) ≤ p |A| X s R max (1 − γ ) (cid:12)(cid:12) d π θ ( s ) − d π θ ( s ) (cid:12)(cid:12) + X s L Q − γ d π θ ( s ) k θ − θ k ! ( iii ) ≤ √ |A| C r C α (1 − γ ) (cid:0) (cid:6) log ρ C − M (cid:7) + (1 − ρ ) − (cid:1) k θ − θ k , where ( i ) follows from eq. (10), ( ii ) follows from the fact that k x k ≤ k x k , and ( iii ) follows from Lemma 1and from the facts R max ≤ C r C α and X s ∈S (cid:12)(cid:12) d π θ ( s ) − d π θ ( s ) (cid:12)(cid:12) = 2 (cid:13)(cid:13) d π θ − d π θ (cid:13)(cid:13) T V ≤ (cid:13)(cid:13) ν π θ − ν π θ (cid:13)(cid:13) T V . pper-bounding T : For any given state-action pair ( s, a ) ∈ S × A , we have (cid:12)(cid:12)(cid:12) ( ∇ θ F ( θ , α ) − ∇ θ F ( θ , α )) s,a (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12) − γ (cid:0) d π θ ( s ) Q π θ α ( s, a ) − d π θ ( s ) Q π θ α ( s, a ) (cid:1)(cid:12)(cid:12)(cid:12)(cid:12) ( i ) = 11 − γ d π θ ( s ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) − γ X ˆ s, ˆ a ν π θ ,s,a (ˆ s, ˆ a )( r α (ˆ s, ˆ a ) − r α (ˆ s, ˆ a )) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ( ii ) ≤ − γ ) d π θ ( s ) C r k α − α k , where in ( i ) we denote ν π θ ,s,a (ˆ s, ˆ a ) as the visitation distribution of the Markov chain with initial distri-bution P ( ·| s = s, a = a ) and policy π θ , and ( ii ) follows from the fact that | r α (ˆ s, ˆ a ) − r α (ˆ s, ˆ a ) | = | h∇ α r α ′ (ˆ s, ˆ a ) , α − α i | ≤ k∇ α r α ′ (ˆ s, ˆ a ) k k α − α k ≤ C r k α − α k , for some α ′ ∈ [ α , α ] . The inequal-ity above implies that k∇ θ F ( θ , α ) − ∇ θ F ( θ , α ) k = sX s,a (cid:12)(cid:12)(cid:12) ( ∇ θ F ( θ , α ) − ∇ θ F ( θ , α )) s,a (cid:12)(cid:12)(cid:12) ≤ vuutX s,a (cid:18) − γ ) d π θ ( s ) C r k α − α k (cid:19) = p |A| C r (1 − γ ) k α − α k sX s (cid:0) d π θ ( s ) (cid:1) i ) ≤ p |A| C r (1 − γ ) k α − α k , where ( i ) follows from the fact that qP s (cid:0) d π θ ( s ) (cid:1) ≤ (cid:13)(cid:13) d π θ (cid:13)(cid:13) = 1 .Therefore we obtain the upper bound of eq. (9) as follows: k∇ θ F ( θ , α ) − ∇ θ F ( θ , α ) k ≤ √ |A| C r C α (1 − γ ) (cid:0) (cid:6) log ρ C − M (cid:7) + (1 − ρ ) − (cid:1) k θ − θ k + p |A| C r (1 − γ ) k α − α k , which determines the constants L and L .We then proceed to prove the second inequality in Proposition 1. k∇ α F ( θ , α ) − ∇ α F ( θ , α ) k ≤ k∇ α F ( θ , α ) − ∇ α F ( θ , α ) + ∇ α F ( θ , α ) − ∇ α F ( θ , α ) k ≤ k∇ α F ( θ , α ) − ∇ α F ( θ , α ) k | {z } T + k∇ α F ( θ , α ) − ∇ α F ( θ , α ) k | {z } T . (11)Next, we upper-bound T and T in eq. (11), respectively. Upper-bounding T : For any given ≤ i ≤ q , we have | ( ∇ α F ( θ , α ) − ∇ α F ( θ , α )) i | = | ( ∇ α V ( π E , r α ) − ∇ α V ( π θ , r α ) − ∇ α ψ ( α ) − ( ∇ α V ( π E , r α ) − ∇ α V ( π θ , r α ) − ∇ α ψ ( α ))) i | = | ( ∇ α V ( π θ , r α ) − ∇ α V ( π θ , r α )) i | = 11 − γ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)X s,a ( ν π θ ( s, a ) − ν π θ ( s, a ))( ∇ α r α ) i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:13)(cid:13) ν π θ − ν π θ (cid:13)(cid:13) (cid:13)(cid:13)(cid:13) ∂r α ∂α i (cid:13)(cid:13)(cid:13) ∞ − γ ( i ) ≤ C ν k θ − θ k (cid:13)(cid:13)(cid:13) ∂r α ∂α i (cid:13)(cid:13)(cid:13) ∞ − γ , ( i ) follows from Lemma 1 and the fact that k p − q k = 2 k p − q k T V . The inequality above furtherimplies that k∇ α F ( θ , α ) − ∇ α F ( θ , α ) k ≤ C ν k θ − θ k − γ vuut q X i =1 (cid:13)(cid:13)(cid:13)(cid:13) ∂r α ∂α i (cid:13)(cid:13)(cid:13)(cid:13) ∞ ≤ C r p |A| − γ (cid:0) (cid:6) log ρ C − M (cid:7) + (1 − ρ ) − (cid:1) k θ − θ k . Upper-bounding T : We provide a proof for the general parameterization of policy, which includes thedirect parameterization of policy as a special case and covers the last claim of Proposition 1. We proceed asfollows: k∇ α F ( θ , α ) − ∇ α F ( θ , α ) k ≤ k∇ α V ( π E , r α ) − ∇ α V ( π θ , r α ) − ∇ α ψ ( α ) − ( ∇ α V ( π E , r α ) − ∇ α V ( π θ , r α ) − ∇ α ψ ( α )) k ≤ − γ (cid:0)(cid:13)(cid:13)R ( ∇ α r α − ∇ α r α ) dν π E (cid:13)(cid:13) + (cid:13)(cid:13)R ( ∇ α r α − ∇ α r α ) dν π θ (cid:13)(cid:13) (cid:1) + k∇ α ψ ( α ) − ∇ α ψ ( α ) k = − γ (cid:18)qP qi =1 (cid:0)R ( ∇ α r α ( s, a ) − ∇ α r α ( s, a )) i dν π E (cid:1) + qP qi =1 (cid:0)R ( ∇ α r α ( s, a ) − ∇ α r α ( s, a )) i dν π θ (cid:1) (cid:19) + k∇ α ψ ( α ) − ∇ α ψ ( α ) k i ) ≤ (cid:18) √ qL r − γ + L ψ (cid:19) k α − α k , where ( i ) follows from Assumption 1 and further because for any ( s, a ) and i , we have | ( ∇ α r α ( s, a ) − ∇ α r α ( s, a )) i | ≤ k∇ α r α ( s, a ) − ∇ α r α ( s, a ) k ≤ L r k α − α k . Therefore, we obtain the following upper bound in eq. (11) k∇ α F ( θ , α ) − ∇ α F ( θ , α ) k ≤ C r p |A| − γ (cid:0) (cid:6) log ρ C − M (cid:7) + (1 − ρ ) − (cid:1) k θ − θ k + (cid:18) √ qL r − γ + L ψ (cid:19) k α − α k , which determines L and L . C Proof of Proposition 2

We deﬁne θ op ( α ) := argmin θ ∈ Θ p F ( θ, α ) . If there exist multiple optimal points, then θ op ( α ) can be anyoptimal point.We ﬁrst provide a lemma, which characterizes the gradient dominance property for the function F ( θ, α ) witha ﬁxed reward parameter α . Lemma 3. ((Agarwal et al., 2019, Lemma 4.1)) For any given α ∈ Λ , F ( θ, α ) deﬁned in eq. (1) with directparameterization satisﬁes, F ( θ, α ) − F ( θ op ( α ) , α ) ≤ C d max ˜ θ ∈ Θ p D θ − ˜ θ, ∇ θ F ( θ, α ) E , where C d = − γ ) min s { ζ ( s ) } . We then provide the proof of Proposition 2. 15 roof of Proposition 2.

We proceed as follows: g ( θ ) − g ( θ ∗ ) = F ( θ, α op ( θ )) − F ( θ ∗ , α op ( θ ∗ ))= F ( θ, α op ( θ )) − F ( θ op ( α op ( θ )) , α op ( θ )) + F ( θ op ( α op ( θ )) , α op ( θ )) − F ( θ ∗ , α op ( θ ∗ )) ( i ) ≤ F ( θ, α op ( θ )) − F ( θ op ( α op ( θ )) , α op ( θ )) ( ii ) ≤ C d max ¯ θ ∈ Θ p (cid:10) θ − ¯ θ, ∇ θ F ( θ, α op ( θ )) (cid:11) ( iii ) = C d max ¯ θ ∈ Θ p (cid:10) θ − ¯ θ, ∇ g ( θ ) (cid:11) , where ( i ) follows from the fact that F ( θ op ( α op ( θ )) , α op ( θ )) − F ( θ ∗ , α op ( θ ∗ ))= F ( θ op ( α op ( θ )) , α op ( θ )) − F ( θ ∗ , α op ( θ )) | {z } ≤ + F ( θ ∗ , α op ( θ )) − F ( θ ∗ , α op ( θ ∗ )) | {z } ≤ ≤ , ( ii ) follows from Lemma 3, and ( iii ) follows because ∇ g ( θ ) = ∇ θ F ( θ, α ) | α = α op ( θ ) . D Supporting Lemmas for GAIL Framework

In this section, we establish two supporting lemmas that are useful for the proof of our main theorems.

Lemma 4.

Suppose Assumption 3 holds. Consider the gradient approximation in the nested-loop GAILframework (Algorithm 1). For any k and t , ≤ k ≤ K − and ≤ t ≤ T − , we have E (cid:20)(cid:13)(cid:13)(cid:13) b ∇ α F ( θ t , α tk ) − ∇ α F ( θ t , α tk ) (cid:13)(cid:13)(cid:13) (cid:21) ≤ C r − γ (cid:18) C M − ρ (cid:19) B .

Proof of Lemma 4.

We denote d π ( s ) := (1 − γ ) P ∞ t =0 γ t P { s t = s } as the state visitation distribution ofthe Markov chain with initial distribution ζ ( · ) , transition kernel P ( ·| s, a ) and policy π . Both trajectories ( s E , a E , s E , a E , · · · , s Ei , a Ei ) and ( s θ , a θ , s θ , a θ , · · · , s Ei , a Ei ) are sampled under the transition kernel ˜ P ( ·| s, a ) = γ P ( ·| s, a ) + (1 − γ ) ζ ( · ) . Recall that it has been shown in Konda (2002) that the stationary distribution ofthe Markov chain with transition kernel and policy π is d π .By deﬁnition, we have, E (cid:20)(cid:13)(cid:13)(cid:13) b ∇ α F ( θ t , α tk ) − ∇ α F ( θ t , α tk ) (cid:13)(cid:13)(cid:13) (cid:21) = E (cid:20)(cid:13)(cid:13)(cid:13) − γ ) B (cid:16)P B − i =0 ∇ α tk r α tk ( s Ei , a Ei ) − ∇ α tk r α tk ( s θi , a θi ) (cid:17) − − γ (cid:16) E ( s,a ) ∼ ν πE h ∇ α tk r α tk ( s, a ) i − E ( s,a ) ∼ ν πθt h ∇ α tk r α tk ( s, a ) i(cid:17)(cid:13)(cid:13)(cid:13) (cid:21) ≤ − γ ) B E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) B − X i =0 (cid:16) ∇ α tk r α tk ( s Ei , a Ei ) − E ( s,a ) ∼ ν πE h ∇ α tk r α tk ( s, a ) i(cid:17)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) | {z } T + − γ ) B E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) B − X i =0 (cid:16) ∇ α tk r α tk ( s θi , a θi ) − E ( s,a ) ∼ ν πθt h ∇ α tk r α tk ( s, a ) i(cid:17)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) | {z } T . (12)We ﬁrst provide an upper bound on the term T in eq. (12), and proceed as follows: T = P B − i =0 E (cid:13)(cid:13)(cid:13) ∇ α tk r α tk ( s Ei , a Ei ) − E ( s,a ) ∼ ν πE h ∇ α tk r α tk ( s, a ) i(cid:13)(cid:13)(cid:13) P i = j E D ∇ α tk r α tk ( s Ei , a Ei ) − E ( s,a ) ∼ ν πE h ∇ α tk r α tk ( s, a ) i , ∇ α tk r α tk ( s Ej , a Ej ) − E ( s,a ) ∼ ν πE h ∇ α tk r α tk ( s, a ) iE ≤ BC r + P i = j E D ∇ α tk r α tk ( s Ei , a Ei ) − E ( s,a ) ∼ ν πE h ∇ α tk r α tk ( s, a ) i , ∇ α tk r α tk ( s Ej , a Ej ) − E ( s,a ) ∼ ν πE h ∇ α tk r α tk ( s, a ) iE (13)Deﬁne the ﬁltration F i = σ ( s E , a E , s E , a E , · · · , s Ei , a Ei ) . We continue to bound the second term in eq. (13)as follows: E hD ∇ α tk r α tk ( s Ei , a Ei ) − E ( s,a ) ∼ ν πE h ∇ α tk r α tk ( s, a ) i , ∇ α tk r α tk ( s Ej , a Ej ) − E ( s,a ) ∼ ν πE h ∇ α tk r α tk ( s, a ) iEi = E h E hD ∇ α tk r α tk ( s Ei , a Ei ) − E ( s,a ) ∼ ν πE h ∇ α tk r α tk ( s, a ) i , ∇ α tk r α tk ( s Ej , a Ej ) − E ( s,a ) ∼ ν πE h ∇ α tk r α tk ( s, a ) iE(cid:12)(cid:12)(cid:12) F i ii = E hD ∇ α tk r α tk ( s Ei , a Ei ) − E ( s,a ) ∼ ν πE h ∇ α tk r α tk ( s, a ) i , E h ∇ α tk r α tk ( s Ej , a Ej ) (cid:12)(cid:12)(cid:12) F i i − E ( s,a ) ∼ ν πE h ∇ α tk r α tk ( s, a ) iEi ≤ E h(cid:13)(cid:13)(cid:13) ∇ α tk r α tk ( s Ei , a Ei ) − E ( s,a ) ∼ ν πE h ∇ α tk r α tk ( s, a ) i(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) E h ∇ α tk r α tk ( s Ej , a Ej ) (cid:12)(cid:12)(cid:12) F i i − E ( s,a ) ∼ ν πE h ∇ α tk r α tk ( s, a ) i(cid:13)(cid:13)(cid:13) i ≤ C r E h(cid:13)(cid:13)(cid:13) E h ∇ α tk r α tk ( s Ej , a Ej ) (cid:12)(cid:12)(cid:12) F i i − E ( s,a ) ∼ ν πE h ∇ α tk r α tk ( s, a ) i(cid:13)(cid:13)(cid:13) i = C r E (cid:13)(cid:13)(cid:13)R s ∼ P ( s j ∈·| s Ei ,a Ei ) ,a ∼ π E ( ·| s ) ∇ α tk r α tk ( s, a ) dsda − R s ∼ χ θ ,a ∼ π E ( ·| s ) ∇ α tk r α tk ( s, a ) dsda (cid:13)(cid:13)(cid:13) = C r E rP ql =1 (cid:16)R s ∼ P ( s j ∈·| s Ei ,a Ei ) ,a ∼ π E ( ·| s ) ∂r α ∂α l | α = α tk ( s, a ) dsda − R s ∼ χ θ ,a ∼ π E ( ·| s ) ∂r α ∂α l | α = α tk ( s, a ) dsda (cid:17) ( i ) ≤ C r E vuut q X l =1 (cid:18)(cid:13)(cid:13)(cid:13)(cid:13) ∂r α ∂α i (cid:13)(cid:13)(cid:13)(cid:13) ∞ d T V (cid:0) P ( s j ∈ ·| s i = s Ei , a i = a Ei ) , χ π E π E (cid:1)(cid:19) , (14)where ( i ) follows from the fact that | R f dµ − R f dν | ≤ k f k ∞ d T V ( µ, ν ) . We next derive a bound on the totalvariation distance in the above equation as follows. d T V (cid:0) P ( s j ∈ · , a j ∈ ·| s i = s Ei , a i = a Ei ) , χ π E π E (cid:1) = d T V (cid:0) P ( s j ∈ ·| s i = s Ei , a i = a Ei ) , χ π E (cid:1) = d T V (cid:18)Z s P ( s j ∈ ·| s i +1 = s ) d ˜ P ( s | s i = s Ei , a i = a Ei ) , χ π E (cid:19) ≤ Z s d T V ( P ( s j ∈ ·| s i +1 = s ) , χ π E ) d ˜ P ( s | s i = s Ei , a i = a Ei ) ( i ) ≤ Z s C M ρ j − i − d ˜ P ( s | s i = s Ei , a i = a Ei ) = C M ρ j − i − , (15)where ( i ) follows from Assumption 3. Substituting eq. (15) into eq. (14) and then further into eq. (13) yieldsthe following upper-bound on T T ≤ BC r + 2 B − X i =0 B − X j = i +1 C M C r ρ j − i − ≤ BC r (1 + C M − ρ ) . (16)By following steps similar to those from eqs. (13) to (16), we can show that T ≤ BC r (1 + C M − ρ ) . Therefore, we have E (cid:20)(cid:13)(cid:13)(cid:13) b ∇ α F ( θ t , α tk ) − ∇ α F ( θ t , α tk ) (cid:13)(cid:13)(cid:13) (cid:21) ≤ C r (1 − γ ) (cid:18) C M − ρ (cid:19) B . emma 5. Suppose Assumptions 3 and 4 hold. Consider Algorithm 1 with α -update stepsize β = µ L . Forany ≤ t ≤ T − , we have E h(cid:13)(cid:13) α tK − α op ( θ t ) (cid:13)(cid:13) i ≤ C α e − µ L K + 48 C r µ (1 − γ ) (1 + C M − ρ ) 1 B .

Let K ≥ L µ log C α ∆ α and B ≥ C r µ (1 − γ ) (cid:16) C M − ρ (cid:17) α , we have E h k α tK − α op ( θ t ) k i ≤ ∆ α . The expectedtotal computational complexity is given by KB = O (cid:18) − γ ) ∆ α log (cid:18) α (cid:19)(cid:19) . Proof of Lemma 5.

We proceed as follows: (cid:13)(cid:13) α tk +1 − α op ( θ t ) (cid:13)(cid:13) i ) ≤ (cid:13)(cid:13)(cid:13) α tk + β b ∇ α F ( θ t , α tk ) − α op ( θ t ) (cid:13)(cid:13)(cid:13) = (cid:13)(cid:13) α tk − α op ( θ t ) (cid:13)(cid:13) + β (cid:13)(cid:13)(cid:13) b ∇ α F ( θ t , α tk ) (cid:13)(cid:13)(cid:13) + 2 β D b ∇ α F ( θ t , α tk ) , α tk − α op ( θ t ) E ( ii ) ≤ (cid:13)(cid:13) α tk − α op ( θ t ) (cid:13)(cid:13) + 2 β (cid:13)(cid:13) ∇ α F ( θ t , α tk ) (cid:13)(cid:13) + 2 β (cid:13)(cid:13)(cid:13) b ∇ α F ( θ t , α tk ) − ∇ α F ( θ t , α tk ) (cid:13)(cid:13)(cid:13) + 2 β (cid:10) ∇ α F ( θ t , α tk ) , α tk − α op ( θ t ) (cid:11) + 2 β D b ∇ α F ( θ t , α tk ) − ∇ α F ( θ t , α tk ) , α tk − α op ( θ t ) E ( iii ) ≤ (1 − βµ + 2 β L ) (cid:13)(cid:13) α tk − α op ( θ t ) (cid:13)(cid:13) + 2 β (cid:13)(cid:13)(cid:13) b ∇ α F ( θ t , α tk ) − ∇ α F ( θ t , α tk ) (cid:13)(cid:13)(cid:13) + 2 β D b ∇ α F ( θ t , α tk ) − ∇ α F ( θ t , α tk ) , α tk − α op ( θ t ) E ( iv ) ≤ (1 + 2 β L − µβ ) (cid:13)(cid:13) α tk − α op ( θ t ) (cid:13)(cid:13) + (2 β + β/µ ) (cid:13)(cid:13)(cid:13) b ∇ α F ( θ t , α tk ) − ∇ α F ( θ t , α tk ) (cid:13)(cid:13)(cid:13) v ) ≤ (cid:18) − µ L (cid:19) (cid:13)(cid:13) α tk − α op ( θ t ) (cid:13)(cid:13) + 38 L (cid:13)(cid:13)(cid:13) b ∇ α F ( θ t , α tk ) − ∇ α F ( θ t , α tk ) (cid:13)(cid:13)(cid:13) , (17)where ( i ) follows from the non-expansive property of the projection operator, ( ii ) follows because k A + B k ≤ k A k +2 k B k , ( iii ) follows from Proposition 1 and the fact h∇ α F ( θ t , α tk ) , α tk − α op ( θ t ) i ≤ − µ k α tk − α op ( θ t ) k , ( iv ) follows because h b ∇ α F ( θ t , α tk ) − ∇ α F ( θ t , α tk ) , α tk − α op ( θ t ) i≤ µ (cid:13)(cid:13) α tk − α op ( θ t ) (cid:13)(cid:13) + 12 µ k b ∇ α F ( θ t , α tk ) − ∇ α F ( θ t , α tk ) k , and ( v ) follows by letting β = µ L and because µ ≤ L .Applying eq. (17) recursively and using the fact − x ≤ e − x , we obtain (cid:13)(cid:13) α tK − α op ( θ t ) (cid:13)(cid:13) ≤ e − µ L K (cid:13)(cid:13) α t − α op ( θ t ) (cid:13)(cid:13) + 38 L K − X k =0 (cid:18) − µ L (cid:19) K − − k (cid:13)(cid:13)(cid:13) b ∇ α F ( θ t , α tk ) − ∇ α F ( θ t , α tk ) (cid:13)(cid:13)(cid:13) . Then, taking expectation on both sides of above inequality and applying Lemma 4 yield E h(cid:13)(cid:13) α tK − α op ( θ t ) (cid:13)(cid:13) i ≤ C α e − µ L K + 38 L K − X k =0 (cid:18) − µ L (cid:19) K − − k C r (1 − γ ) (1 + C M − ρ ) 1 B ≤ C α e − µ L K + 48 C r µ (1 − γ ) (1 + C M − ρ ) 1 B , which completes the proof. 18

Proof of Theorems 1 and 2: Global Convergence of PPG-GAILand FWPG-GAIL

In this section, we provide the proof of Theorems 1 and 2. We ﬁrst provide three supporting lemmas.Speciﬁcally, Lemmas 6 and 7 establish the smoothness condition of the global optimal α op ( θ ) and the gradient ∇ g ( θ ) . Similar property has also been established in Lin et al. (2020); Nouiehed et al. (2019). Lemma 8provides the upper bound on the bias and variance errors introduced by the stochastic gradient estimator of ∇ θ F ( θ t , α t ) . E.1 Supporting Lemmas

Lemma 6.

Suppose Assumptions 1 to 4 holds and the policy takes the direct parameterization speciﬁed inSection 2.2. We have k α op ( θ ) − α op ( θ ) k ≤ L µ k θ − θ k , where α op ( θ ) is the unique global optimal thatsatisﬁes α op ( θ ) = argmax α ∈ Λ F ( θ, α ) .Proof of Lemma 6. Since F ( θ , α ) is strongly concave on α , the following two inequalities hold for all α ∈ Λ , F ( θ , α op ( θ )) − F ( θ , α ) ≥ µ k α − α op ( θ ) k , (18) F ( θ , α op ( θ )) − F ( θ , α ) ≤ k∇ α F ( θ , α ) k µ . (19)In eqs. (18) and (19), letting α = α op ( θ ) and using the gradient Lipschitz condition established in Proposi-tion 1, we have µ k α op ( θ ) − α op ( θ ) k ≤ k∇ α F ( θ , α op ( θ )) k µ ≤ L k θ − θ k µ , which implies k α op ( θ ) − α op ( θ ) k ≤ L µ k θ − θ k . Lemma 7.

Suppose Assumptions 1 to 4 hold and the policy takes the direct parameterization speciﬁed inSection 2.2. Then we have ∇ θ g ( θ ) = ∇ θ F ( θ, α ) | α = α op ( θ ) , and for any θ , θ ∈ Θ p , k∇ θ g ( θ ) − ∇ θ g ( θ ) k ≤ ( L + ( L L ) /µ ) k θ − θ k , where L , L and L are deﬁned in Proposition 1.Proof of Lemma 7. Taking the directional derivative of g ( θ ) with respect to the direction ℓ , we have ∂g ( θ ) ∂ℓ = lim ǫ → g ( θ + ǫℓ ) − g ( θ ) ǫ = lim ǫ → F ( θ + ǫℓ, α op ( θ + ǫℓ )) − F ( θ, α op ( θ )) ǫ = lim ǫ → F ( θ + ǫℓ, α op ( θ + ǫℓ )) − F ( θ + ǫℓ, α op ( θ )) + F ( θ + ǫℓ, α op ( θ )) − F ( θ, α op ( θ )) ǫ ( i ) = lim ǫ → ℓ ⊤ ∇ α F ( θ, α ′ ǫ ) + ℓ ⊤ ∇ θ F ( θ, α op ( θ )) ( ii ) = ℓ ⊤ ∇ θ F ( θ, α op ( θ )) , (20)where α ′ ǫ in ( i ) is a point between α op ( θ + ǫℓ ) and α op ( θ ) , and ( ii ) follows from Lemma 6 and hence wehave lim ǫ → ∇ α F ( θ, α ′ ǫ ) = ∇ α F ( θ, α op ( θ )) = 0 . Since eq. (20) holds for all directions ℓ , we have ∇ θ g ( θ ) = ∇ θ F ( θ, α op ( θ )) .We then proceed to prove the gradient Lipschitz condition of g ( θ t ) . For any given θ , θ ∈ Θ p , we have k∇ θ g ( θ ) − ∇ θ g ( θ ) k k∇ θ F ( θ , α op ( θ )) − ∇ θ F ( θ , α op ( θ )) k = k∇ θ F ( θ , α op ( θ )) − ∇ θ F ( θ , α op ( θ )) + ∇ θ F ( θ , α op ( θ )) − ∇ θ F ( θ , α op ( θ )) k ≤ k∇ θ F ( θ , α op ( θ )) − ∇ θ F ( θ , α op ( θ )) k + k∇ θ F ( θ , α op ( θ )) − ∇ θ F ( θ , α op ( θ )) k ≤ L k α op ( θ ) − α op ( θ ) k + L k θ − θ k i ) ≤ ( L + L L µ ) k θ − θ k , where ( i ) follows from Lemma 6. Lemma 8.

Suppose Assumption 3 holds. For the policy gradient estimation speciﬁed in eq. (3) , in eachiteration t , ≤ t ≤ T − , we have E (cid:20)(cid:13)(cid:13)(cid:13) b ∇ θ F ( θ t , α t ) − ∇ θ F ( θ t , α t ) (cid:13)(cid:13)(cid:13) (cid:21) ≤ |A| R max (1 − γ / ) (1 − γ ) (cid:18) C M ρ − ρ (cid:19) b . Let the sample trajectory size b ≥ |A| R max (1 − γ / ) (1 − γ ) (cid:16) C M ρ − ρ (cid:17) θ , we have E (cid:20)(cid:13)(cid:13)(cid:13) b ∇ θ F ( θ t , α t ) − ∇ θ F ( θ t , α t ) (cid:13)(cid:13)(cid:13) (cid:21) ≤ ∆ θ . Proof of Lemma 8.

We deﬁne the vector g i ∈ R |S|·|A| with each entry given by ( g i ) s,a = − ˆ Q ( s,a )1 − γ { s i = s } .Then, we proceed as follows: E (cid:20)(cid:13)(cid:13)(cid:13) b ∇ θ F ( θ t , α t ) − ∇ θ F ( θ t , α t ) (cid:13)(cid:13)(cid:13) (cid:21) = E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) b b − X i =0 ( g i − ∇ θ F ( θ t , α t )) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)  = 1 b E  b − X i =0 E k g i − ∇ θ F ( θ t , α t ) k + X i = j E h g i − ∇ θ F ( θ t , α t ) , g j − ∇ θ F ( θ t , α t ) i  ( i ) ≤ |A| R max b (1 − γ / ) (1 − γ ) + 2 b b − X i =1 b − X j = i +1 E [ h g i − ∇ θ F ( θ t , α t ) , g j − ∇ θ F ( θ t , α t ) i ] | {z } T , (21)where ( i ) follows from the facts that k g i k = (cid:12)(cid:12)(cid:12)(cid:12) √ |A| ˆ Q ( s i ,a i )1 − γ (cid:12)(cid:12)(cid:12)(cid:12) ≤ √ |A| R max (1 − γ / )(1 − γ ) and k∇ θ F ( θ t , α t ) k ≤ √ |A| R max (1 − γ ) ≤√ |A| R max (1 − γ / )(1 − γ ) .Deﬁne the ﬁltration F i = σ ( s , s , · · · , s i ) . For the term T in eq. (21) with i < j , we have E [ h g i − ∇ θ F ( θ t , α t ) , g j − ∇ θ F ( θ t , α t ) i ] (22) = E [ E [ h g i − ∇ θ F ( θ t , α t ) , g j − ∇ θ F ( θ t , α t ) i|F i ]]= E [ h g i − ∇ θ F ( θ t , α t ) , E [ g j − ∇ θ F ( θ t , α t ) |F i ] i ] ≤ E (cid:2) k g i − ∇ θ F ( θ t , α t ) k k E [ g j − ∇ θ F ( θ t , α t ) |F i ] k (cid:3) ≤ R max p |A| (1 − γ )(1 − γ / ) E k E [ g j |F i ] − ∇ θ F ( θ t , α t ) k ≤ R max p |A| (1 − γ )(1 − γ / ) E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)vuutX s,a (cid:18) P { s j = s | s i } Q ( s, a )1 − γ − d π θt ( s ) Q ( s, a )1 − γ (cid:19) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ R max p |A| (1 − γ ) (1 − γ / ) sX s,a (cid:0) P { s j = s | s i } − d π θt ( s ) (cid:1) i ) = 2 R max |A| (1 − γ ) (1 − γ / ) (cid:13)(cid:13) P { s j = ·| s i } − χ π θt (cid:13)(cid:13) ii ) ≤ C M R max |A| (1 − γ ) (1 − γ / ) ρ j − i , (23)where ( i ) follows because χ π θt = d π θt , and ( ii ) follows from Assumption 3 and because d π θt = χ θ t and (cid:13)(cid:13) P { s j = ·| s i } − d π θt (cid:13)(cid:13) ≤ (cid:13)(cid:13) P { s j = ·| s i } − d π θt (cid:13)(cid:13) = 2d T V (cid:0) P { s j = ·| s i } , d π θt (cid:1) . Substituting eq. (23) into eq. (21), we obtain E (cid:20)(cid:13)(cid:13)(cid:13) b ∇ θ F ( θ t , α t ) − ∇ θ F ( θ t , α t ) (cid:13)(cid:13)(cid:13) (cid:21) ≤ |A| R max b (1 − γ / ) (1 − γ ) + 2 b b − X i =1 b − X j = i +1 C M |A| R max (1 − γ / ) (1 − γ ) ρ j − i ≤ |A| R max b (1 − γ / ) (1 − γ ) (cid:18) C M ρ − ρ (cid:19) b . The second claim can be easily checked.

E.2 Proof of Theorem 1

Based on the projection property, we have D θ t − η b ∇ θ F ( θ t , α t ) − θ t +1 , θ − θ t +1 E ≤ , ∀ θ ∈ Θ . (24)Next we use eq. (24) to upper bound on E h k θ t +1 − θ t k i . Letting θ = θ t and rearranging eq. (24) yield D b ∇ θ F ( θ t , α t ) , θ t +1 − θ t E ≤ − η − k θ t +1 − θ t k . (25)According to the gradient Lipschitz condition established in Lemma 7, we have g ( θ t +1 ) ≤ g ( θ t ) + h∇ θ g ( θ t ) , θ t +1 − θ t i + (cid:18) L L L µ (cid:19) k θ t +1 − θ t k = g ( θ t ) + D b ∇ θ F ( θ t , α t ) , θ t +1 − θ t E − h∇ θ F ( θ t , α t ) − ∇ θ g ( θ t ) , θ t +1 − θ t i− D b ∇ θ F ( θ t , α t ) − ∇ θ F ( θ t , α t ) , θ t +1 − θ t E + (cid:18) L L L µ (cid:19) k θ t +1 − θ t k i ) ≤ g ( θ t ) − (cid:18) L L L µ (cid:19) k θ t +1 − θ t k − h∇ θ F ( θ t , α t ) − ∇ θ g ( θ t ) , θ t +1 − θ t i− D b ∇ θ F ( θ t , α t ) − ∇ θ F ( θ t , α t ) , θ t +1 − θ t E , where ( i ) follows from eq. (25) and the fact that η = (cid:16) L + L L µ (cid:17) − .Rearranging the above inequality, we obtain k θ t +1 − θ t k ≤ (cid:18) L L L µ (cid:19) − ( g ( θ t ) − g ( θ t +1 )) − (cid:18) L L L µ (cid:19) − h∇ θ F ( θ t , α t ) − ∇ θ g ( θ t ) , θ t +1 − θ t i− (cid:18) L L L µ (cid:19) − D b ∇ θ F ( θ t , α t ) − ∇ θ F ( θ t , α t ) , θ t +1 − θ t E ( i ) ≤ (cid:18) L L L µ (cid:19) − ( g ( θ t ) − g ( θ t +1 ))+ (cid:18) L L L µ (cid:19) − k∇ θ F ( θ t , α t ) − ∇ θ g ( θ t ) k + 14 k θ t +1 − θ t k (cid:18) L L L µ (cid:19) − k b ∇ θ F ( θ t , α t ) − ∇ θ F ( θ t , α t ) k + 14 k θ t +1 − θ t k , where ( i ) follows from Young’s inequality.Taking expectation on both sides of the above inequality yields E (cid:2) k θ t +1 − θ t k (cid:3) ( i ) ≤ µµL + L L E [ g ( θ t ) − g ( θ t +1 )] + 8 µ L ( µL + L L ) E h k α t − α op ( θ t ) k i + 8 µ ( µL + L L ) E (cid:20)(cid:13)(cid:13)(cid:13) b ∇ θ F ( θ t , α t ) − ∇ θ F ( θ t , α t ) (cid:13)(cid:13)(cid:13) (cid:21) , (26)where ( i ) follows from the gradient Lipschitz condition established in Proposition 1Next, rearranging eq. (24), we obtain h θ t − θ t +1 , θ − θ t +1 i≤ η D b ∇ θ F ( θ t , α t ) , θ − θ t +1 E = η D b ∇ θ F ( θ t , α t ) − ∇ θ F ( θ t , α t ) , θ − θ t +1 E + η h∇ θ F ( θ t , α t ) − ∇ θ g ( θ t ) , θ − θ t +1 i + η h∇ θ g ( θ t , α t ) , θ − θ t i + η h∇ θ g ( θ t , α t ) , θ t − θ t +1 i . Letting η = (cid:16) L + L L µ (cid:17) − and rearranging the above inequality yield h∇ θ g ( θ t ) , θ − θ t i ≥ (cid:18) L + L L µ (cid:19) h θ t − θ t +1 , θ − θ t +1 i − h∇ θ F ( θ t , α t ) − ∇ θ g ( θ t ) , θ − θ t +1 i− D b ∇ θ F ( θ t , α t ) − ∇ θ F ( θ t , α t ) , θ − θ t +1 E − h∇ θ g ( θ t ) , θ t − θ t +1 i ( i ) ≥ − (cid:18) L + L L µ (cid:19) k θ t − θ t +1 k · R − p |A| R max (1 − γ ) k θ t +1 − θ t k − R ( k b ∇ θ F ( θ t , α t ) − ∇ θ F ( θ t , α t ) k + k∇ θ F ( θ t , α t ) − ∇ θ g ( θ t ) k ) , (27)where ( i ) follows from the Cauchy-Schwartz inequality and the boundness properties of Θ p ( R := max θ ∈ Θ p {k θ k } )and because k∇ θ g ( θ t ) k = k∇ θ F ( θ t , α op ( θ t )) k ≤ √ |A| R max (1 − γ ) .Applying the gradient dominance property of g ( θ ) established in Proposition 2, we obtain g ( θ t ) − g ( θ ∗ ) ≤ C d max θ ∈ Θ h∇ θ g ( θ t ) , θ t − θ i ( i ) ≤ C d µL + L L ) Rµ + p |A| R max (1 − γ ) ! k θ t − θ t +1 k + 2 RC d k b ∇ θ F ( θ t , α t ) − ∇ θ F ( θ t , α t ) k + 2 RC d k∇ θ F ( θ t , α t ) − ∇ θ g ( θ t ) k , where ( i ) follows by multiplying − on both sides of eq. (27) and taking the maximum over all θ ∈ Θ p .Taking expectation on both sides of above inequality and telescoping, we have T T − X t =0 E [ g ( θ t )] − g ( θ ∗ ) ≤ C d µL + L L ) Rµ + p |A| R max (1 − γ ) ! T T − X t =0 E [ k θ t − θ t +1 k ]+ 2 RC d T T − X t =0 E h k b ∇ θ F ( θ t , α t ) − ∇ θ F ( θ t , α t ) k i + 2 RC d T T − X t =0 E [ k∇ θ F ( θ t , α t ) − ∇ θ g ( θ t ) k ] i ) ≤ C d µL + L L ) Rµ + p |A| R max (1 − γ ) ! vuut E " T T − X t =0 k θ t − θ t +1 k + 2 RC d T T − X t =0 E h k b ∇ θ F ( θ t , α t ) − ∇ θ F ( θ t , α t ) k i + 2 RC d T T − X t =0 E [ k∇ θ F ( θ t , α t ) − ∇ θ g ( θ t ) k ] ( ii ) ≤ µL + L L ) Rµ + p |A| R max (1 − γ ) ! C d s µµL + L L E [ g ( θ ) − g ( θ T )] T + µL + L L ) Rµ + p |A| R max (1 − γ ) ! C d s µ L ( µL + L L ) E h k α t − α op ( θ t ) k i + µL + L L ) Rµ + p |A| R max (1 − γ ) ! C d s µ ( µL + L L ) E (cid:20)(cid:13)(cid:13)(cid:13) b ∇ θ F ( θ t , α t ) − ∇ θ F ( θ t , α t ) (cid:13)(cid:13)(cid:13) (cid:21) + 2 RC d T T − X t =0 E h k b ∇ θ F ( θ t , α t ) − ∇ θ F ( θ t , α t ) k i + 2 RC d T T − X t =0 E [ k∇ θ F ( θ t , α t ) − ∇ θ g ( θ t ) k ] ( iii ) ≤ µL + L L ) Rµ + p |A| R max (1 − γ ) ! C d s µµL + L L R max (1 − γ ) T + p |A| R max (1 − γ ) µµL + L L + 5 R ! L C d s C α e − µ L K + 48 C r µ (1 − γ ) (1 + C M − ρ ) 1 B + p |A| R max (1 − γ ) µµL + L L + 5 R ! C d s |A| R max b (1 − γ / ) (1 − γ ) (cid:18) C M ρ − ρ (cid:19) b ( iv ) ≤ O (cid:18) − γ ) √ T (cid:19) + O (cid:16) e − (1 − γ ) K (cid:17) + O (cid:18) − γ ) √ B (cid:19) + O (cid:18) − γ ) √ b (cid:19) , where ( i ) follows because E [ X ] ≤ p E [ X ] holds for any random variable X , ( ii ) follows by telescopingeq. (26) and further because √ a + b ≤ √ a + √ b holds, for all a, b > , ( iii ) follows from Lemmas 5 and 8and because E [ X ] ≤ p E [ X ] holds for any random variable X , and ( iv ) follows because L = O (cid:16) − γ ) (cid:17) , L = O (cid:16) − γ ) (cid:17) , L = O (cid:16) − γ (cid:17) , L = O (cid:16) − γ (cid:17) , C d = O (cid:16) − γ (cid:17) and O (cid:16) − γ / (cid:17) ≤ O (cid:16) − γ (cid:17) . E.3 Proof of Theorem 2

By the gradient Lipschitz condition (established in Lemma 7) of g ( θ ) , we have g ( θ t +1 ) ≤ g ( θ t ) + h∇ θ g ( θ t ) , θ t +1 − θ t i + (cid:18) L L L µ (cid:19) k θ t +1 − θ t k = g ( θ t ) + η h∇ θ g ( θ t ) , ˆ v t − θ t i + (cid:18) L L L µ (cid:19) η k ˆ v t − θ t k i ) ≤ g ( θ t ) + η D b ∇ θ F ( θ t , α t ) , ˆ v t − θ t E + η D ∇ θ g ( θ t ) − b ∇ θ F ( θ t , α t ) , ˆ v t − θ t E + (cid:18) L + 2 L L µ (cid:19) η R ii ) ≤ g ( θ t ) + η D b ∇ θ F ( θ t , α t ) , v t − θ t E + η D ∇ θ g ( θ t ) − b ∇ θ F ( θ t , α t ) , ˆ v t − θ t E + (cid:18) L + 2 L L µ (cid:19) η R = g ( θ t ) + η h∇ θ g ( θ t ) , v t − θ t i + η D ∇ θ g ( θ t ) − b ∇ θ F ( θ t , α t ) , ˆ v t − v t E (cid:18) L + 2 L L µ (cid:19) η R , (28)where ( i ) follows because k ˆ v t − θ t k ≤ R , and ( ii ) follows by deﬁnition of ˆ v t in eq. (5) (recall that ˆ v t :=argmax θ ∈ Θ p h θ, − b ∇ θ F ( θ t , α t ) i ), and further we deﬁne v t := argmax θ ∈ Θ h θ, −∇ θ g ( θ t ) i . We continue the proofas follows: max θ ∈ Θ h∇ θ g ( θ t ) , θ t − θ i ( i ) = h∇ θ g ( θ t ) , θ t − v t i ( ii ) ≤ η − ( g ( θ t ) − g ( θ t +1 )) + (cid:18) L + 2 L L µ (cid:19) ηR + h∇ θ g ( θ t ) − ∇ θ F ( θ t , α t ) , ˆ v t − v t i + D ∇ θ F ( θ t , α t ) − b ∇ θ F ( θ t , α t ) , ˆ v t − v t E ≤ η − ( g ( θ t ) − g ( θ t +1 )) + (cid:18) L + 2 L L µ (cid:19) ηR + 2 R k∇ θ g ( θ t ) − ∇ θ F ( θ t , α t ) k + 2 R (cid:13)(cid:13)(cid:13) ∇ θ F ( θ t , α t ) − b ∇ θ F ( θ t , α t ) (cid:13)(cid:13)(cid:13) , (29)where ( i ) follows by deﬁnition v t := argmax θ ∈ Θ h θ, −∇ θ g ( θ t ) i , and ( ii ) follows by rearranging eq. (28).Finally, we complete the proof as follows: T T − X t =0 E [ g ( θ t )] − g ( θ ∗ ) ( i ) ≤ C d · T T − X t =0 E (cid:20) max θ ∈ Θ h∇ θ g ( θ t ) , θ t − θ i (cid:21) ( ii ) ≤ C d E [ g ( θ ) − g ( θ T )] ηT + C d (cid:18) L + 2 L L µ (cid:19) ηR + 2 RC d T T − X t =0 E k∇ θ g ( θ t ) − ∇ θ F ( θ t , α t ) k + 2 RC d T T − X t =0 E (cid:13)(cid:13)(cid:13) ∇ θ F ( θ t , α t ) − b ∇ θ F ( θ t , α t ) (cid:13)(cid:13)(cid:13) iii ) ≤ C d · R max + 2(1 − γ ) (cid:0) L + L L µ − (cid:1) R (1 − γ ) √ T + 2 RC d s |A| R max b (1 − γ / ) (1 − γ ) (cid:18) C M ρ − ρ (cid:19) b + 2 RC d L s C α e − µ L K + 48 C r (1 − γ ) µ (1 + C M − ρ ) 1 B ( iv ) ≤ O (cid:18) − γ ) √ T (cid:19) + O (cid:16) e − (1 − γ ) K (cid:17) + O (cid:18) − γ ) √ B (cid:19) + O (cid:18) − γ ) √ b (cid:19) , where ( i ) follows from Proposition 2, ( ii ) follows from telescoping eq. (29), ( iii ) follows from Lemmas 5and 8 and because η = − γ √ T and E [ X ] ≤ p E [ X ] holds for any random variable X , and ( iv ) followsbecause L = O (cid:16) − γ ) (cid:17) , L = O (cid:16) − γ ) (cid:17) , L = O (cid:16) − γ (cid:17) , L = O (cid:16) − γ (cid:17) , C d = O (cid:16) − γ (cid:17) and O (cid:16) − γ / (cid:17) ≤ O (cid:16) − γ (cid:17) . F Proof of Theorems 3 and 4: Global Convergence of TRPO-GAIL

In this section, we add the subscript λ to the notations of the Q-function Q πα ( s, a ) , the value function V ( π, r α ) , the objective function F ( θ, α ) and g ( θ ) in order to emphasize that these functions are derivedunder λ -regularized MDP. 24 .1 Supporting Lemmas In this subsection, we introduce several useful lemmas.

Lemma 9. ((Beck, 2017, Lemma 9.1)) Consider a proper closed convex function ω : E → ( −∞ , ∞ ] . Let dom ( ∂ω ) denote the subset of E where ω is diﬀerentiable and dom ( ω ) denote the subset of E where the valueof ω is ﬁnite. Assume a, b ∈ dom ( ∂ω ) and c ∈ dom ( ω ) . Then the following inequality holds: h∇ ω ( b ) − ∇ ω ( a ) , c − a i = B ω ( c, a ) + B ω ( a, b ) − B ω ( c, b ) , where B ω ( · , · ) denotes the Bregman distance associated with ω ( · ) . Lemma 10. ((Shani et al., 2020, Lemma 25)) Consider the Q-function estimation in Algorithm 2. For any t ∈ { , , · · · , T − } , we have (cid:13)(cid:13)(cid:13) − ˆ Q π θt λ,α t ( s, · ) + λ ∇ ω ( π θ t ( ·| s )) (cid:13)(cid:13)(cid:13) ∞ ≤ C ω ( t ; λ ) , where ˆ Q π θt λ,α t is the Q-function estimated under the reward function r α t and policy π θ t , and C ω ( t ; λ ) ≤O (cid:16) C r C α (1+ { λ =0 } log t )1 − γ / (cid:17) . Lemma 11.

For any policy π, π ′ ∈ ∆ A and α ∈ Λ , the following equality holds, ( V λ ( π, r α ) − V λ ( π ′ , r α ))(1 − γ )= X s ∈S d π ′ ( s ) (cid:0)(cid:10) − Q πλ,α ( s, · ) + λ ∇ ω ( π ( ·| s )) , π ′ ( ·| s ) − π ( ·| s ) (cid:11) + λB ω ( π ′ ( ·| s ) , π ( ·| s )) (cid:1) , where V λ ( π, r α ) is the average value function under λ -regularized MDP with the reward function r α and d π ′ is the state visitation distribution of π ′ .Proof of Lemma 11. Following from (Shani et al., 2020, Lemma 24), for any s ∈ S , we have (cid:10) − Q πλ,α ( s, · ) + λ ∇ ω ( π ( ·| s )) , π ′ ( ·| s ) − π ( ·| s ) (cid:11) = − ( T π ′ λ V πλ,α ( s ) − V πλ,α ( s )) − λB ω ( π ′ ( ·| s ) , π ( ·| s )) , (30)where T π ′ λ is the Bellman operator under λ -regularized MDP, i.e., T π ′ λ V πλ,α ( s ) = X a ∈A (cid:0) π ′ ( a | s ) r α,λ ( s, a ) + X s ′ ∈S P ( s ′ | s, a ) V πλ,α ( s ′ ) (cid:1) . Furthermore, we have V λ ( π ′ , r α ) − V λ ( π, r α )= X s ζ ( s )( V π ′ λ,α ( s ) − V πλ,α ( s )) ( i ) = 1(1 − γ ) X s ∈S d π ′ ( s )( T π ′ λ V πλ,α ( s ) − V πλ,α ( s )) ( ii ) = − − γ X s ∈S d π ′ ( s ) (cid:0)(cid:10) − Q πλ,α ( s, · ) + λ ∇ ω ( π ( ·| s )) , π ′ ( ·| s ) − π ( ·| s ) (cid:11) + λB ω ( π ′ ( ·| s ) , π ( ·| s )) (cid:1) , where ( i ) follows from (Shani et al., 2020, Lemma 29) and ( ii ) follows by multiplying eq. (30) by d π ′ ( s ) andtake the summation over S . F.2 Proof of Theorems 3 and 4

Since the unregularized MDP can be viewed as a special case of the regularized MDP, i.e., λ = 0 , in thissubsection, we ﬁrst develop our proof for the general regularized MDP up to a certain step, and then specializeto the case with λ = 0 for proving Theorem 3 and continue to keep λ general for proving Theorem 4.25o we start the proof, recall that the update of θ t speciﬁed in eq. (7) satisﬁes, π θ t +1 ( ·| s ) ∈ argmin π ∈ ∆ A ( D − ˆ Q π θt λ,α t ( s, · ) + λ ∇ ω ( π θ t ( ·| s )) , π − π θ t ( ·| s ) E + η − t B ω ( π, π θ t ( ·| s )) | {z } := f ( π ) ) . Following from the ﬁrst-order optimality condition, we have ∇ π f ( π θ t +1 ( ·| s )) ⊤ ( π − π θ t +1 ( ·| s )) ≥ , ∀ π ∈ ∆ A , which together with the fact ∇ π f ( π ) = − ˆ Q π θt λ,α t ( s, · ) + λ ∇ ω ( π θ t ( ·| s )) + η − t ( ∇ ω ( π ) − ∇ ω ( π θ t ( ·| s ))) , implies that D − ˆ Q π θt λ,α t ( s, · ) + λ ∇ ω ( π θ t ( ·| s )) + η − t ( ∇ ω ( π θ t +1 ( ·| s )) − ∇ ω ( π θ t ( ·| s ))) , π − π θ t +1 ( ·| s ) E ≥ (31)holds for any π .Taking π = π θ ∗ ( ·| s ) in eq. (31), we obtain ≤ η t D − ˆ Q π θt λ,α t ( s, · ) + λ ∇ ω ( π θ t ( ·| s )) , π θ ∗ ( ·| s ) − π θ t ( ·| s ) E + η t D − ˆ Q π θt λ,α t ( s, · ) + λ ∇ ω ( π θ t ( ·| s )) , π θ t ( ·| s ) − π θ t +1 ( ·| s ) E + (cid:10) ∇ ω ( π θ t +1 ( ·| s )) − ∇ ω ( π θ t ( ·| s )) , π θ ∗ ( ·| s ) − π θ t +1 ( ·| s ) (cid:11) ( i ) ≤ η t D − ˆ Q π θt λ,α t ( s, · ) + λ ∇ ω ( π θ t ( ·| s )) , π θ ∗ ( ·| s ) − π θ t ( ·| s ) E + η t (cid:13)(cid:13)(cid:13) − ˆ Q π θt λ,α t ( s, · ) + λ ∇ ω ( π θ t ( ·| s )) (cid:13)(cid:13)(cid:13) ∞ (cid:13)(cid:13) π θ t ( ·| s ) − π θ t +1 ( ·| s ) (cid:13)(cid:13) B ω ( π θ ∗ ( ·| s ) , π θ t ( ·| s )) − B ω ( π θ ∗ ( ·| s ) , π θ t +1 ( ·| s )) − B ω ( π θ t +1 ( ·| s ) , π θ t ( ·| s )) ( ii ) ≤ η t D − ˆ Q π θt λ,α t ( s, · ) + λ ∇ ω ( π θ t ( ·| s )) , π θ ∗ ( ·| s ) − π θ t ( ·| s ) E + η t C ω ( t ; λ ) B ω ( π θ ∗ ( ·| s ) , π θ t ( ·| s )) − B ω ( π θ ∗ ( ·| s ) , π θ t +1 ( ·| s )) , (32)where ( i ) follows from Hölder’s inequality and Lemma 9, and ( ii ) follows from the Lemma 10 and Pinsker’sinequality given by (cid:13)(cid:13) π θ t ( ·| s ) − π θ t +1 ( ·| s ) (cid:13)(cid:13) ≤ KL (cid:0) π θ t +1 ( ·| s ) (cid:13)(cid:13) π θ t ( ·| s ) (cid:1) = B ω ( π θ t +1 ( ·| s ) , π θ t ( ·| s )) , where KL ( ·k· ) denotes the KL-divergence.Taking expectation conditioned on F t = σ ( θ , θ , · · · , θ t ) over eq. (32), we have ≤ η t D − Q π θt λ,α t ( s, · ) + λ ∇ ω ( π θ t ( ·| s )) , π θ ∗ ( ·| s ) − π θ t ( ·| s ) E + η t C ω ( t ; λ ) B ω ( π θ ∗ ( ·| s ) , π θ t ( ·| s )) − E (cid:2) B ω ( π θ ∗ ( ·| s ) , π θ t +1 ( ·| s )) (cid:12)(cid:12) F t (cid:3) . (33)Since eq. (33) holds for any state, we multiply it by d π θ ∗ ( s ) for each state s and take the summation over S .Then we rearrange the resulting bound and obtain η t C ω ( t ; λ ) X s ∈S d π θ ∗ ( s ) B ω ( π θ ∗ ( ·| s ) , π θ t ( ·| s )) − X s ∈S d π θ ∗ ( s ) E (cid:2) B ω ( π θ ∗ ( ·| s ) , π θ t +1 ( ·| s )) (cid:12)(cid:12) F t (cid:3) ≥ − η t X s ∈S d π θ ∗ ( s ) D − Q π θt λ,α t ( s, · ) + λ ∇ ω ( π θ t ( ·| s )) , π θ ∗ ( ·| s ) − π θ t ( ·| s ) E i ) = η t (1 − γ )( V λ ( π θ ∗ , r α t ) − V λ ( π θ t , r α t )) + η t λ X s ∈S d π θ ∗ ( s ) B ω ( π θ ∗ ( ·| s ) , π θ t ( ·| s )) , (34)where ( i ) follows from applying Lemma 11 with π = π θ t and π ′ = π θ ∗ . Rearranging eq. (34), we obtain V λ ( π θ ∗ , r α t ) − V λ ( π θ t , r α t ) ≤ η t (1 − γ ) X s ∈S d π θ ∗ ( s )(1 − λη t ) E [ B ω ( π θ ∗ ( ·| s ) , π θ t ( ·| s ))] − η t (1 − γ ) X s ∈S d π θ ∗ ( s ) E (cid:2) B ω ( π θ ∗ ( ·| s ) , π θ t +1 ( ·| s )) (cid:3) + η t C ω ( t, λ ) − γ ) . (35)Furthermore, we proceed the proof as follows: E [ g λ ( θ t )] − g λ ( θ ∗ )= E [ g λ ( θ t ) − F λ ( θ t , α t )] + E [ F λ ( θ t , α t ) − g λ ( θ ∗ )] ( i ) ≤ E [ g λ ( θ t ) − F λ ( θ t , α t )] + E [ F λ ( θ t , α t ) − F λ ( θ ∗ , α t )] ( ii ) = E [ g λ ( θ t ) − F λ ( θ t , α t )] + E [ V λ ( π θ ∗ , α t ) − V λ ( π θ t , α t )] ( iii ) ≤ L E h k α t − α op ( θ t ) k i + 1 η t (1 − γ ) X s ∈S d π θ ∗ ( s )(1 − λη t ) E [ B ω ( π θ ∗ ( ·| s ) , π θ t ( ·| s ))] − η t (1 − γ ) X s ∈S d π θ ∗ ( s ) E (cid:2) B ω ( π θ ∗ ( ·| s ) , π θ t +1 ( ·| s )) (cid:3) + η t C ω ( t, λ ) − γ ) , (36)where ( i ) follows because g λ ( θ ∗ ) ≥ F λ ( θ ∗ , α op ( θ t )) , ( ii ) follows from the deﬁnition of F λ ( θ, α ) , and ( iii ) follows from the gradient Lipschitz condition of α in Proposition 1 and eq. (35).Next, to prove Theorem 3, we let λ = 0 and recall η t = − γ √ T . Telescoping eq. (36), we obtain T T − X t =0 E [ g ( θ t )] − g ( θ ∗ ) ≤ − γ ) √ T X s ∈S d π θ ∗ ( s ) E [ B ω ( π θ ∗ ( ·| s ) , π θ ( ·| s )) − B ω ( π θ ∗ ( ·| s ) , π θ T ( ·| s ))]+ L T T − X t =0 E h k α t − α op ( θ t ) k i + C ω √ T ( i ) ≤ L C α e − µ L K + 48 C r L µ (1 − γ ) (1 + C M − ρ ) 1 B + (1 − γ ) C ω + 2 log |A| − γ ) √ T ( ii ) ≤ O (cid:18) − γ ) √ T (cid:19) + O (cid:16) e − (1 − γ ) K (cid:17) + O (cid:18) − γ ) B (cid:19) , where ( i ) follows from Lemma 5 and because ≤ B ω ( π , π ) ≤ log |A| for any θ , θ and ( ii ) follows because L = O (cid:16) − γ (cid:17) and C ω = O (cid:16) − γ / (cid:17) ≤ O (cid:16) − γ (cid:17) . This completes the proof of Theorem 3.To prove the Theorem 4, let η t = λ ( t +2) . Then, telescoping eq. (36) and applying Lemma 5, we obtain T T − X t =0 E [ g λ ( θ t )] − g λ ( θ ∗ ) ≤ L C α e − µ L K + 48 C r L µ (1 − γ ) (1 + C M − ρ ) 1 B + C ω ( T, λ )2(1 − γ ) λ log( T + 1) T + λ P s d π θ ∗ ( s ) E [ B ω ( π θ ∗ ( ·| s ) , π θ ( ·| s )) − ( T + 1) B ω ( π θ ∗ ( ·| s ) , π θ T ( ·| s ))](1 − γ ) T i ) ≤ O (cid:18) − γ ) T (cid:19) + O (cid:16) e − (1 − γ ) K (cid:17) + O (cid:18) − γ ) B (cid:19) , where ( i ) follows because ≤ B ω ( π , π ) ≤ log( |A| ) for any π , π , L = O (cid:16) − γ (cid:17) and C ω ( T, λ ) =˜ O (cid:16) − γ / (cid:17) ≤ ˜ O (cid:16) − γ (cid:17) . This completes the proof of Theorem 4. G Proof of Theorem 5: Global Convergence of NPG-GAIL

To prove the theorem, we ﬁrst deﬁne some notations. Let λ P := min θ ∈ Θ { λ min ( F ( θ ) + λI ) } , W λ ∗ θ,α := ( F ( θ ) + λI ) − E ( s,a ) ∼ ν πθ [ A π θ α ( s, a ) ∇ θ log π θ ( a | s )] and W ∗ θ,α := F ( θ ) † E ( s,a ) ∼ ν πθ [ A π θ α ( s, a ) ∇ θ log π θ ( a | s )] . For brevity, we denote W λ ∗ t = W λ ∗ θ t ,α t and W ∗ t = W ∗ θ t ,α t . G.1 Supporting Lemmas

In this subsection, we give several useful lemmas.

Lemma 12. ((Agarwal et al., 2019, Lemma 3.2)) For any policy π and π ′ and reward function r α , we have V ( π, r α ) − V ( π ′ , r α ) = 11 − γ E s,a ∼ ν π ( s,a ) h A π ′ α ( s, a ) i . Lemma 13. ((Xu et al., 2020a, Lemma 6)) For any θ and α , we have (cid:13)(cid:13)(cid:13) W λ ∗ θ,α − W ∗ θ,α (cid:13)(cid:13)(cid:13) ≤ C λ λ , where < C λ < ∞ is a constant only depending on the policy class. Lemma 14.

Suppose Assumptions 3 and 5 hold. Consider the policy update of NPG-GAIL (Algorithm 3)with β W = λ P C φ + λ ) . Then, for all t = 0 , , · · · , T − , we have E [ (cid:13)(cid:13) w t − W λ ∗ t (cid:13)(cid:13) ] ≤ exp ( − λ P T c C φ + λ ) ) R max C φ λ P (1 − γ ) + λ P + λ P C φ + λ ) ! R max C φ [( C φ + λ ) + 4 λ P ][1 + ( C M − ρ ](1 − ρ )(1 − γ ) λ P M .

Proof of Lemma 14.

At iteration t , W , W , · · · , W T c follows the linear SA iteration rule deﬁned in (Xu et al.,2020a, eq. (3)) with α = β W , A = − ( F ( θ t ) + λI ) , b = E ( s,a ) ∼ ν πθt (cid:2) A π θt α t ( s, a ) ∇ θ t log π θ t ( a | s ) (cid:3) and θ ∗ = − A − b = W λ ∗ t with (cid:13)(cid:13) W λ ∗ t (cid:13)(cid:13) ≤ R θ = C φ R max λ A (1 − γ ) . It is easy to check that the Assumption 3 in Xu et al. (2020a)holds. Namely, ( i ) , k A k F ≤ C φ + λ and k b k ≤ R max C φ − γ ; ( ii ) , for any w ∈ R d , (cid:10) w − W λ ∗ t , A ( w − W λ ∗ t ) (cid:11) ≤− λ p (cid:13)(cid:13) w − W λ ∗ t (cid:13)(cid:13) ; ( iii ) , The ergodicity of MDP is assumed here. Thus, applying (Xu et al., 2020a, Theorem4) completes the proof. G.2 Proof of Theorem 5

Deﬁne D ( θ ) = E s ∼ d πθ ∗ [KL ( π θ ∗ ( ·| s ) k π θ ( ·| s ))] . Then we have D ( θ t ) − D ( θ t +1 ) = E ν πθ ∗ (cid:2) log( π θ t +1 ( ·| s )) − log( π θ t ( ·| s )) (cid:3) ( i ) ≥ E ν πθ ∗ [ ∇ θ log( π θ t ( a | s ))] ⊤ ( θ t +1 − θ t ) − L φ k θ t +1 − θ t k , ( i ) follows from the gradient Lipschitz condition on log( π θ ( ·| s )) in Assumption 5.Recall that the update rule in NPG-GAIL (Algorithm 3) is given by θ t +1 = θ t − ηw t . Then we have D ( θ t ) − D ( θ t +1 ) ≥ η E ν πθ ∗ [ ∇ θ log( π θ t ( a | s ))] ⊤ w t − L φ η k w t k = η E ν πθ ∗ (cid:2) A π θt α t ( s, a ) (cid:3) + η E ν πθ ∗ (cid:2) ∇ θ log( π θ t ( a | s )) ⊤ W ∗ t − A π θt α t ( s, a ) (cid:3) + η E ν πθ ∗ [ ∇ θ log( π θ t ( a | s ))] ⊤ ( W λ ∗ t − W ∗ t ) + η E ν πθ ∗ [ ∇ θ log( π θ t ( a | s ))] ⊤ ( w t − W λ ∗ t ) − L φ η k w t k i ) = (1 − γ ) η ( V ( π θ ∗ , r α t ) − V ( π θ t , r α t )) + η E ν πθ ∗ (cid:2) ∇ θ log( π θ t ( a | s )) ⊤ W ∗ t − A π θt α t ( s, a ) (cid:3) + η E ν πθ ∗ [ ∇ θ log( π θ t ( a | s ))] ⊤ ( W λ ∗ t − W ∗ t ) + η E ν πθ ∗ [ ∇ θ log( π θ t ( a | s ))] ⊤ ( w t − W λ ∗ t ) − L φ η k w t k ii ) ≥ (1 − γ ) η ( V ( π θ ∗ , r α t ) − V ( π θ t , r α t )) − L φ η k w t k + η E ν πθ ∗ [ ∇ θ log( π θ t ( a | s ))] ⊤ ( W λ ∗ t − W ∗ t ) + η E ν πθ ∗ [ ∇ θ log( π θ t ( a | s ))] ⊤ ( w t − W λ ∗ t ) − η q E ν πθ ∗ (cid:2) ( ∇ θ log( π θ t ( a | s )) ⊤ W ∗ t − A π θt α t ( s, a )) (cid:3) ( iii ) ≥ (1 − γ ) η ( V ( π θ ∗ , r α t ) − V ( π θ t , r α t )) − L φ η k w t k + η E ν πθ ∗ [ ∇ θ log( π θ t ( a | s ))] ⊤ ( W λ ∗ t − W ∗ t ) + η E ν πθ ∗ [ ∇ θ log( π θ t ( a | s ))] ⊤ ( w t − W λ ∗ t ) − η q C d E ν πθt (cid:2) ( ∇ θ log( π θ t ( a | s )) ⊤ W ∗ t − A π θt α t ( s, a )) (cid:3) , (37)where ( i ) follows from Lemma 12, ( ii ) follows from the concavity of f ( x ) = √ x and Jensen’s inequality, and ( iii ) follows from the fact that ( ∇ θ log( π θ t ( a | s )) ⊤ W ∗ t − A π θt α t ( s, a )) ≥ and (cid:13)(cid:13)(cid:13)(cid:13) ν πθ ∗ ν πθt (cid:13)(cid:13)(cid:13)(cid:13) ∞ ≤ − γ ) min { ζ ( s ) } := C d .Continuing to bound eq. (37), we have D ( θ t ) − D ( θ t +1 ) ( i ) ≥ (1 − γ ) η ( V ( π θ ∗ , r α t ) − V ( π θ t , r α t )) − L φ η k w t k − η p C d ζ ′ + η E ν πθ ∗ [ ∇ θ log( π θ t ( a | s ))] ⊤ ( W λ ∗ t − W ∗ t ) + η E ν πE [ ∇ θ log( π θ t ( a | s ))] ⊤ ( w t − W λ ∗ t ) ( ii ) ≥ (1 − γ ) η ( V ( π θ ∗ , r α t ) − V ( π θ t , r α t )) − η p C d ζ ′ − ηC φ C λ λ − ηC φ (cid:13)(cid:13) w t − W λ ∗ t (cid:13)(cid:13) − L φ η k w t k iii ) ≥ (1 − γ ) η ( V ( π θ ∗ , r α t ) − V ( π θ t , r α t )) − η p C d ζ ′ − ηC φ C λ λ − ηC φ (cid:13)(cid:13) w t − W λ ∗ t (cid:13)(cid:13) − L φ η (cid:13)(cid:13) w t − W λ ∗ t (cid:13)(cid:13) − L φ η (cid:13)(cid:13) W λ ∗ t (cid:13)(cid:13) iv ) ≥ (1 − γ ) η ( V ( π θ ∗ , r α t ) − V ( π θ t , r α t )) − η p C d ζ ′ − ηC φ C λ λ − ηC φ (cid:13)(cid:13) w t − W λ ∗ t (cid:13)(cid:13) − L φ η (cid:13)(cid:13) w t − W λ ∗ t (cid:13)(cid:13) − L φ η λ P k∇ θ V ( θ t , r α t ) k , (38)where ( i ) follows from the deﬁnition of ζ ′ in the statement of Theorem 5, ( ii ) follows from the upper boundon k∇ θ π θ ( a | s ) k in Assumption 5, Lemma 13 and Cauchy-Schwartz inequality, ( iii ) follows from the fact k A + B k ≤ k A k + 2 k B k , and ( iv ) follows from the deﬁnition of W λ ∗ t and because λ P I (cid:22) F ( θ t ) + λI .29earranging eq. (38), we obtain V ( π θ ∗ , r α t ) − V ( π θ t , r α t ) ≤ D ( θ t ) − D ( θ t +1 ) η (1 − γ ) + √ C d ζ ′ − γ + C φ C λ λ − γ + C φ − γ (cid:13)(cid:13) w t − W λ ∗ t (cid:13)(cid:13) + L φ η − γ (cid:13)(cid:13) w t − W λ ∗ t (cid:13)(cid:13) + L φ ηλ P (1 − γ ) k∇ θ V ( θ t , r α t ) k . (39)Finally, we complete the proof as follows: T T − X t =0 E [ g ( θ t )] − g ( θ ∗ )= 1 T T − X t =0 E [ g ( θ t ) − F ( θ t , α t )] + 1 T T − X t =0 E [ F ( θ t , α t ) − g ( θ ∗ )] ( i ) ≤ T T − X t =0 E [ g ( θ t ) − F ( θ t , α t )] + 1 T T − X t =0 ( F ( θ t , α t ) − F ( θ ∗ , α t ))= 1 T T − X t =0 E [ g ( θ t ) − F ( θ t , α t )] + 1 T T − X t =0 ( V ( π θ ∗ , r α t ) − V ( π θ t , r α t )) ( ii ) ≤ T T − X t =0 E [ g ( θ t ) − F ( θ t , α t )] + D ( θ ) − D ( θ T )(1 − γ ) ηT + √ C d ζ ′ − γ + C φ C λ λ − γ + C φ (1 − γ ) T T − X t =0 (cid:13)(cid:13) w t − W λ ∗ t (cid:13)(cid:13) + L φ η (1 − γ ) T T − X t =0 (cid:13)(cid:13) w t − W λ ∗ t (cid:13)(cid:13) + L φ ηR max C φ (1 − γ ) λ P ( iii ) ≤ L C α e − µ L K + 48 C r L µ (1 − γ ) (1 + ρC M − ρ ) 1 B + E [ D ( θ ) − D ( θ T )](1 − γ ) √ T + √ C d ζ ′ − γ + C φ C λ λ − γ + C φ (1 − γ ) T T − X t =0 (cid:13)(cid:13) w t − W λ ∗ t (cid:13)(cid:13) + L φ T / T − X t =0 (cid:13)(cid:13) w t − W λ ∗ t (cid:13)(cid:13) + L φ R max C φ (1 − γ ) λ P √ T ( iv ) ≤ L C α e − µ L K + 48 C r L µ (1 − γ ) (1 + ρC M − ρ ) 1 B + E [ D ( θ ) − D ( θ T )](1 − γ ) √ T + √ C d ζ ′ − γ + C φ C λ λ − γ + C φ (1 − γ ) r exp n − λ P T c C φ + λ ) o R max C φ λ P (1 − γ ) + (cid:16) λ P + λ P C φ + λ ) (cid:17) R max C φ [( C φ + λ ) +4 λ P ][1+( C M − ρ ](1 − ρ )(1 − γ ) λ P M + L φ √ T (cid:18) exp n − λ P T c C φ + λ ) o R max C φ λ P (1 − γ ) + (cid:16) λ P + λ P C φ + λ ) (cid:17) R max C φ [( C φ + λ ) +4 λ P ][1+( C M − ρ ](1 − ρ )(1 − γ ) λ P M (cid:19) + L φ R max C φ (1 − γ ) λ P √ T ( v ) ≤ O (cid:18) − γ ) √ T (cid:19) + O (cid:16) e − (1 − γ ) K (cid:17) + O (cid:18) − γ ) B (cid:19) + O (cid:18) ζ ′ (1 − γ ) / (cid:19) + O (cid:18) λ − γ (cid:19) + O (cid:0) e − T c (cid:1) + O (cid:18) − γ ) √ M (cid:19) , where ( i ) follows because g ( θ ∗ ) = F ( θ ∗ , α op ( θ ∗ )) ≥ F ( θ ∗ , α t ) and ( ii ) follows from eq. (39) and because k∇ θ V ( θ t , α t ) k ≤ R max C φ − γ , ( iii ) follows from Proposition 1 and Lemma 5, and the fact η = − γ √ T , ( iv ) followsfrom Lemma 14, and ( v ) follows because L = O (cid:16) − γ (cid:17) and C d = O (cid:16) − γ (cid:17)(cid:17)