[PDF] RL for Latent MDPs: Regret Guarantees and a Lower Bound

Abstract

Full PDF

RRL for Latent MDPs: Regret Guarantees and a Lower Bound

Jeongyeol Kwon , Yonathan Efroni , Constantine Caramanis , and Shie Mannor Department of Electrical and Computer Engineering, University of Texas at Austin Microsoft Research, New York Department of Electrical Engineering, TechnionFebruary 10, 2021

Abstract

In this work, we consider the regret minimization problem for reinforcement learning in latent MarkovDecision Processes (LMDP). In an LMDP, an MDP is randomly drawn from a set of M possible MDPsat the beginning of the interaction, but the identity of the chosen MDP is not revealed to the agent. Weﬁrst show that a general instance of LMDPs requires at least Ω(( SA ) M ) episodes to even approximate theoptimal policy. Then, we consider suﬃcient assumptions under which learning good policies requirespolynomial number of episodes. We show that the key link is a notion of separation between the MDPsystem dynamics. With suﬃcient separation, we provide an eﬃcient algorithm with local guarantee, i.e., providing a sublinear regret guarantee when we are given a good initialization. Finally, if we are givenstandard statistical suﬃciency assumptions common in the Predictive State Representation (PSR) literature(e.g., [6]) and a reachability assumption, we show that the need for initialization can be removed. Reinforcement Learning (RL) [46] is a central problem in artiﬁcial intelligence which tackles the problem ofsequential decision making in an unknown dynamic environment. The agent interacts with the environmentby receiving the feedback on its actions in the form of a state-dependent reward and new observation. Thegoal of the agent is to ﬁnd a policy that maximizes the long-term reward from the interaction.Partially observable Markov decision processes (POMDPs) [41] give a general framework to describepartially observable sequential decision problems. In POMDPs, the underlying dynamics satisfy the Markovianproperty, but the observations give only partial information on the identity of the underlying states. Withthe generality of this framework comes a high computational and statistical price to pay: POMDPs are hard,primarily because optimal policies depend on the entire history of the process. But for many importantproblems, this full generality can be overkill, and in particular, does not have a way to leverage specialstructure. We are interested in settings where the hidden or latent (unobserved) variables have slow dynamicsor are even static in each episode. This model is important for diverse applications, from serving a user in adynamic web application [17], to medical decision making [44], to transfer learning in diﬀerent RL tasks [8].Yet, as we explain below, even this area remains little understood, and challenges abound.Thus, in this work, we consider reinforcement learning (RL) for a special type of POMDP which we call alatent Markov decision process (LMDPs). LMDPs consist of some (perhaps large) number M of MDPs withjoint state space S and actions A . In episodic LMDPs with ﬁnite time-horizon H , the static latent (hidden)variable that selects one of M MDPs is randomly chosen at the beginning of each episode, yet is not revealedto the agent. The agent then interacts with the chosen MDP throughout the episode (see Deﬁnition 1 for theformal description). 1 a r X i v : . [ c s . L G ] F e b he LMDP framework has previously been introduced under many diﬀerent names, e.g., hidden-modelMDP [11], Multitask RL [8], Contextual MDP [17], Multi-modal Markov decision process [44] and ConcurrentMDP [9]. Learning in LMDPs is a challenging problem due to the unobservability of latent contexts. Forinstance, the exact planning problem is P-SPACE hard [44], inheriting the hardness of planning from thegeneral POMDP framework. Nevertheless, the lack of dynamics of the latent variables, oﬀers some hope. Asan example, if the number of contexts M is bounded, then the planning problem can be at least approximatelysolved ( e.g., by point-based value iteration (PBVI) [38], or mixed integer programming (MIP) [44]).The most closely related work studying LMDPs is in the context of multitask RL [47, 8, 33, 17]. In thisline of work, a common approach is to cluster trajectories according to diﬀerent contexts, an approach thatguided us in designing the algorithms in Section 3.4. However, previous work requires very long time-horizon H (cid:29) SA in order to guarantee that every state-action pair can be visited multiple times in a single episode.In contrast, we consider a signiﬁcantly shorter time-horizon that scales poly-logarithmic with the numberof states, i.e., H = poly log( M SA ) . This short time-horizon results in a signiﬁcant diﬀerence in learningstrategy even when we get a feedback on the true context at the end of episode. We refer the readers to Section1.1 for additional discussion on related work. Main Results.

To the best of our knowledge, none of the previous literature has obtained sample complexityguarantees or studied regret bounds in the LMDP setting. This paper addresses precisely this problem. Weask the following:

Is there a sample eﬃcient RL algorithm for LMDPs, with sublinear regret?

The answer turns out to be not so simple. Our results comprise a ﬁrst impossibility result, followed by positivealgorithmic results under additional assumptions. Speciﬁcally:• First, we ﬁnd that for a general LMDP, polynomial sample complexity cannot be attained withoutfurther assumptions. That is, to ﬁnd an approximately optimal policy we need at least Ω (cid:0) ( SA ) M (cid:1) samples, i.e., at least exponential in the number of contexts M (Section 3.1). This lower bound evenapplies to instances with deterministic MDPs.• We ﬁnd that there are several natural assumptions under which optimal policies can be learned withpolynomial sample complexity. Similarly to mixture problems without dynamics, the key link is a notionof separation between the MDPs. With suﬃcient separation, we show that there is a planning-oracleeﬃcient RL algorithm with polynomial sample complexity. A critical development is adapting theprinciple of optimism as in UCB, but to the partially observed setting where value-iteration cannot bedirectly applied, and thus neither can the UCRL algorithm for MDPs.• Finally, under additional statistical suﬃciency assumptions that are common in the Predictive StateRepresentation (PSR) literature (e.g., [6]) and a reachability assumption, we show that the need forinitialization can be entirely removed.• Finally, we perform an empirical evaluation of the suggested algorithms on toy problem (Section 4),while focusing on the importance of the made assumptions. Due to the vast volume of literature on the RL, we only review the most closely related research to our problem.2 revious Study on LMDPs

As mentioned earlier, LMDPs have been previously introduced with diﬀerentnames. In the work of [11, 44, 9], the authors study the planning problem in LMDPs, when the true parametersof the model is given. The authors in [44] have shown that, as for POMDPs [37], it is P-SPACE hard to ﬁndan exact optimal policy, and NP-hard to ﬁnd an optimal memoryless policy of an LMDP. On the positive side,several heuristics are proposed for practical uses of ﬁnding the optimal memoryless policy [44, 9].LMDP has been studied in the context of multitask RL [47, 8, 33, 17]. In this line of work, a commonapproach is to cluster trajectories according to diﬀerent contexts under some separation assumption, anapproach that guided us in designing the algorithms in Section 3.4. However, in this line of work, the authorsassume very long time-horizon such that they can visit every state-action pair multiple times in a singleepisode. In order to satisfy such assumption, the time-horizon must be at least H ≥ Ω( SA ) . In contrast,we consider a signiﬁcantly shorter time-horizon that scales poly-logarithmic with the number of states, i.e., H = poly log( M SA ) . This short time-horizon results in a signiﬁcant diﬀerence in learning strategy evenwhen we get a feedback on the true context at the end of episode. Approximate Planning in POMDPs

The study of learning in partially observable domains has a longhistory. Unlike in MDPs, ﬁnding the optimal policy for a POMDP is P-SPACE hard even with knownparameters [41]. Even ﬁnding a memoryless policy is known to be NP-hard [30]. Due to the computationalintractability of exact planning, various approximate algorithms and heuristics within a policy class of interest[20, 36, 42, 43, 38, 29]. Since LMDP is a special case of POMDP, any of these methods can be applied tosolve LMDP. We will assume that the planning-oracle achieves some approximation guarantees with respectto maximum long-term rewards obtained by the optimal policy. We show that when the context is identiﬁablein hindsight, then we can quickly perform as good as the policy obtained by the planning-oracle with trueparameters.

Spectral Methods for POMDPs

Previous studies of partially-observed decision problems assumed thenumber of observations is larger than the number of hidden states, as well as, that a set of single observationsforms suﬃcient statistics to learn the hidden structure [5, 16]. With such assumptions, one can apply tensor-decomposition methods by constructing multi-view models [2, 1] and recovering POMDP parameters underuniformly-ergodic (or stationary) assumption on the environment [5, 16]. Our work is diﬀerentiated fromthe mentioned works in two aspects. First, for LMDPs, the observation space is smaller than the hiddenspace. Therefore constructing a multi-view model with a set of single observations is not enough to learnhidden structures of the system. Second, we are not aware of any natural conditions for tensor-decompositionmethods to be applicable for learning LMDPs. Therefore, we do not pursue the application of tensor-methodsin this work.

Predictive State Representation

Since the introduction of PSR [32, 40], it has become one major alternativeto POMDPs for modeling partially-observed environments. The philosophy of PSR is to express the internalstate only with a set of observable experiments, or tests . Various techniques have been developed for learningPSR with statistical consistency and global optimality guarantees [19, 7, 18]. We use the PSR framework toget an initial estimate of the system. However, the sample complexity of learning PSR is quite high [23] (seealso our ﬁnite-sample analysis of spectral learning technique in Section D.1). Therefore, we only learn PSRup to some desired accuracy and convert it to an LMDP parameter by clustering of trajectories (Section D.2)to warm-start the optimal policy learning.

Other Related Work

In a relatively well-studied setting of contextual decision processes (CDPs), thecontext is always given as a side information at the beginning of the episode [22, 35]. This makes thedecision problem a fully observed decision problem. LMDP is diﬀerent since the context is hidden. The main3hallenge comes from the partial observability which results in signiﬁcant diﬀerences in terms of analysisfrom CDPs. Another line of work on decision making with latent contexts considers the problem of latentbandits [34, 15, 14]. It would be interesting to understand whether any previous results on latent banditscan be extended to latent MDPs. Another line of research on theoretical studies with partially observabilityconsiders the environment with rich observations [25, 12, 13]. Rich observation setting assumes that anyobservation happens from only one internal state which removes the necessity to consider histories, and thusthe nature of the problem is diﬀerent from our setting. Recent work considers a sample-eﬃcient algorithm forundercomplete POMDPs [24], i.e., when the observation space is larger than the hidden state space and a setof single observations is statistically suﬃcient to learn hidden structures. In contrast, our problem is a specialcase of POMDPs where the observation space is smaller than the hidden state space.

In Section 2.1 we deﬁne LMDPs precisely, as well as the notion of regret relevant to this paper. Then inSection 2.2.1 we discuss predictive state representations, which we need for our results in Section 3.4.

We consider the following Latent Markov decision processes (LMDP) in an episodic reinforcement learningwith ﬁnite-time horizon H . Deﬁnition 1 (Latent Markov Decision Process (LMDP))

Suppose the set of MDPs M with joint statespace S and joint action space A in a ﬁnite time horizon H . Let M = |M| , S = |S| and A = |A| .Each MDP M m ∈ M is a tuple ( S , A , T m , R m , ν m ) where T m : S × A × S → [0 , a transition probabilitymeasures that maps a state-action pair and a next state to a probability, R m : S × A × { , } → [0 , aprobability measure for rewards that maps a state-action pair and a binary reward to a probability, and ν m isan initial state distribution. Let w , ..., w M be the mixing weights of LMDPs such that at the start of everyepisode, one MDP M m ∈ M is randomly chosen with probability w m . We assume for simplicity that one of MDPs is uniformly chosen at the start of each episode, i.e., w = ... = w M = 1 /M . The goal of the problem is to ﬁnd a (possibly non-Markovian) policy π in a policy class Π thatmaximizes the expected return: V ∗M := max π ∈ Π M (cid:88) m =1 w m E πm (cid:34) H (cid:88) t =1 r t (cid:35) , where E πm [ · ] is expectation taken over the m th MDP with a policy π . The policy π : ( S , A , { , } ) ∗ × S → A maps a history to an action. When parameters of the model M is given, we use suﬃcient statistics of histories,a.k.a. belief states, using the following recursive formulation: b ( m ) = w m ν m ( s ) (cid:80) m w m ν m ( s ) ,b t +1 ( m ) = b t ( m ) T m ( s t +1 | s t , a t ) R m ( r t | s t , a t ) (cid:80) m b t ( m ) T m ( s t +1 | s t , a t ) R m ( r t | s t , a t ) . We deﬁne the notion of regret in this work. Suppose we have a planning-oracle that guarantees thefollowing approximation guarantee: V π M ≥ ρ V ∗M − ρ , (1)4here π is a returned policy when M is given to the planning-oracle, and ρ , ρ are multiplicative and additiveapproximation constants respectively such that < ρ ≤ , ≤ ρ . Example 1

Point-based value-iteration (PBVI) [38] with discretization level (cid:15) d in the belief space over MDPsreturns (cid:15) d H additive approximate policy. That is, the equation (1) is satisﬁed with ρ = 1 and ρ = (cid:15) d H with the policy class Π a set of all possible history-dependent policies. Running-time of PBVI algorithm is O (cid:16) (cid:15) − O ( M ) d HSA (cid:17) . Example 2

The optimal memoryless policy gives at least /M -approximation to the optimal history-dependentpolicy, i.e., ρ = 1 /M and ρ = 0 with the policy class Π a set of all possible history-dependent policies. Wemay restrict Π to deterministic memoryless policies, i.e., Π is the class of policies which are mappings fromstate to actions. The mixed-integer programming (MIP) formulation in [44] may be used to ﬁnd the optimalmemoryless policy. In this case, we have ρ = 1 , ρ = 0 . We deﬁne the regret as the comparison to the best approximation guarantee that the planning-oracle canachieve:

Regret ( K ) = K (cid:88) k =1 ( ρ V ∗M − ρ ) − V π k M , (2)where π k is a policy executed in the k th episode. In general, a partially observable dynamical system can be viewed as a model that generates a sequence ofobservations from observation space O with (controlled) actions from action space A . A predictive staterepresentation (PSR) is a compact description of a dynamical system with a set of observable experiments,or tests [40]. Speciﬁcally, let a test of length t is a sequence of action-observation pairs given as τ = a τ o τ o τ ...a τt o τt . A history h = a h o h a h o h ...a ht o ht is a sequence of action-observation pairs that has beengenerated prior to a given time. A prediction P ( τ | h ) = P ( o τ t | h || do a τ t ) denotes the probability of seeingthe test sequence from a given history, given that we intervene to take actions a τ a τ ...a τt . In latent MDPs,the observation space can be considered as a pair of next-states and rewards, i.e., O = S × { , } and o t = ( s t +1 , r t ) .As we work with a special class of POMDPs, we customize the PSR formulation for LMDPs. The set ofhistories consists of a subset of histories that ends with diﬀerent states, i.e., H = (cid:83) s H s , where each element h ∈ H s is a short sequence of state-action-rewards of length at most l = O (1) and ends with state s : h = s h a h r h s h ...s hl − a hl − r hl − s = ( s, a, r ) h l − s. We deﬁne P πm ( H s ) a vector of probability where each coordinate is a probability of sampling each history in H s in m th MDP with a policy π . Likewise, each element in tests τ ∈ T is a short sequence of action-reward-nextstates of length at most l : τ = a τ r τ s τ ...a τl r τl s τl +1 = ( a, r, s (cid:48) ) τ l . We denote P m ( T | s ) as a vector of probability where each coordinate is a success probability of each test in m th MDP starting from a state s . That is, P m ( T | s ) i = P m ( τ i | s ) = P m ( r τ i s τ i ...r τ i l s τ i l +1 | s || do a τ i ...a τ i l ) . .2.1 Spectral Learning of PSRs in LMDPs In spectral learning, we build a set of observable matrices that contains the (joint) probabilities of historiesand tests, and then we can extract parameters from these matrices by performing singular value decomposition(SVD) and regressions [7]. In order to apply spectral learning techniques, we need the following technicalconditions on statistical suﬃciency of histories and tests. Speciﬁcally, the ﬁrst condition is a rank degeneracycondition on suﬃcient tests:

Condition 1 (Full-Rank Condition for Tests)

For all s ∈ S , for the test set T that starts from state s , let L s = [ P ( T | s ) | P ( T | s ) | ... | P M ( T | s )] . Then σ M ( L s ) ≥ σ τ for all s ∈ S with some σ τ > . Another technical condition for spectral learning method is a rank non-degeneracy condition for suﬃcienthistories:

Condition 2 (Full-Rank Condition for Histories)

For all s ∈ S , for the history set H s that ends with state s with a sampling policy π , let H s = [ P π ( H s ) | P π ( H s ) | ... | P πM ( H s )] (cid:62) . Then σ M ( L s H s ) ≥ P π ( end state = s ) · σ h for all s ∈ S with some σ h > . Here P π ( end state = s ) is a probability of sampling a history ending with s . Condition 2 implies that we cansample a set of short trajectories with some given sampling policy π . Along with the rank condition for tests,pairs of histories and tests can be thought as many short snap-shots of long trajectories obtained by externalexperts or some exploration policy (e.g., random policy in uniformly ergodic MDPs).Conditions 1 and 2 ensure that a set of tests and histories are statistically suﬃcient [7]. Following thenotations in [7], let P T , H s = L s H s and P T , ( s (cid:48) ,r ) a, H s = L s (cid:48) D ( s (cid:48) ,r ) ,a,s H s be empirical counterparts respec-tively, where D ( s (cid:48) ,r ) ,a,s = diag ( P ( s (cid:48) , r | a, s ) , ..., P M ( s (cid:48) , r | a, s )) . Since these matrices can be estimatedfrom observable sequences, we can construct the empirical matrices and apply the spectral learning methods.The goal of spectral learning algorithm is to output PSR parameters ( b ,s , B ( s (cid:48) ,r ) ,a,s , b ∞ ,s ) which are used tocompute ˆ P π ( τ | h ) , the estimated probability of any future observations (or tests τ ) conditioned any histories h with some sampling policy π . We describe a detailed spectral learning procedure in Appendix D.1. We denote the underlying LMDP with true parameters as M ∗ . With slight abuse in notations, we denote l distance between two probability distributions D and D on a random variable X conditioned on event E as (cid:107) ( P X ∼D − P X ∼D )( X | E ) (cid:107) = (cid:88) X ∈X | P X ∼D ( X | E ) − P X ∼D ( X | E ) | , where X is a support of X . When we do not condition on any event, we omit the conditioning on E .When we measure a transition or reward probability at a state-action pair ( s, a ) , we use T or R instead of P . We refer P m the probability of any event measured in the m th context (or in m th MDP). In particular, P m ( s (cid:48) , r | s, a ) = T m ( s (cid:48) | s, a ) R m ( r | s, a ) . For instance, l distance between transition probabilities to s (cid:48) at ( s, a ) in m th and m th MDP will be denoted as (cid:107) ( T m − T m )( s (cid:48) | s, a ) (cid:107) = (cid:88) s (cid:48) ∈S | T m ( s (cid:48) | s, a ) − T m ( s (cid:48) | s, a ) | . If we use P without any subscript, it is a probability of an event measured outside of the context, i.e. , P ( · ) = (cid:80) Mm =1 w m P m ( · ) . If the probability of an event depends on a policy π , we add superscript π to P . Similarly, E m [ · ] is expectation taken over the m th context and π is added as superscript if the expectation depends on π .We use ˆ · to denote any estimated quantities. a (cid:46) b implies a is less than b up to some constant and logarithmic6actors. We use (cid:22) for a coordinate-wise inequality for vectors. When the norm (cid:107) · (cid:107) is used without subscript,we mean l -norm for vectors and operator norm for matrices. We interchangeably use o , an observation, toreplace a pair of next-state and immediate reward ( s (cid:48) , r ) to simplify the notation. We occasionally express alength t > history ( s , a , r , ..., s t − , a t − , r t − , s t ) compactly as (( s, a, r ) t − , s t ) . In this section, we ﬁrst investigate sample complexity of the LMDP framework, and obtain a hardness resultfor the general case. We then consider sample- and computationally eﬃcient algorithms under additionalassumptions.

We ﬁrst study the fundamental limits of the problem. In particular, we are interested in whether we can learnthe optimal policy after interacting with the LMDP for a number of episodes polynomial in the problemparameters. We prove a worst-case lower bound, exhibiting an instance of LMDP that requires at least Ω (cid:0) ( SA ) M (cid:1) episodes: Theorem 3.1 (Lower Bound)

There exists an LMDP such that for ﬁnding an (cid:15) -optimal policy π (cid:15) for which V π (cid:15) M ≥ V ∗M − (cid:15) , we need at least Ω (cid:0) ( SA/M ) M /(cid:15) (cid:1) episodes. The hard instance consists of MDPs with deterministic transitions and possibly stochastic rewards, indicatingan exponential lower bound in the number of contexts even for the easiest types of LMDPs. The exampleis constructed such that, in the absence of knowing true contexts, all wrong action sequences of length M cannot provide any information with zero reward, whereas the only correct action sequence gets a total reward1 under one speciﬁc context. Nevertheless, we note here that Theorem 3.1 does not suggest an exponentiallower bound in H when the number of contexts M is ﬁxed. A construction of the lower bound example isgiven in Appendix A.1.Theorem 3.1 prevents a design of eﬃcient algorithms with growing number of contexts. To the best of ourknowledge this is the ﬁrst lower bound of its kind for LMDPs. In the following subsections, we investigatenatural assumptions which help us to develop an eﬃcient algorithm when only polynomial number of episodesare available. Suppose the true context of the underlying MDP is revealed to the agent at the end of each episode. Wedo not require any assumptions on the environments in this scenario. Note that this scenario is diﬀerent fromfully observable settings ( i.e., knowing the true context at the beginning of an episode). In the latter scenario,we would simply have M -decoupled RL problems in standard MDPs. While this can be considered as a“warm-up” for the sequel, it is motivated by real-world examples. Moreover, the key technical insight herewill prove important for the sequel as well.Knowing contexts in hindsight allows us to construct a conﬁdence set for parameters: C = {M | (cid:107) ( T m − ˆ T m )( s (cid:48) | s, a ) (cid:107) ≤ (cid:112) c T /N m ( s, a ) , (cid:107) ( R m − ˆ R m )( r | s, a ) (cid:107) ≤ (cid:112) c R /N m ( s, a ) , (cid:107) ( ν m − ˆ ν m )( s ) (cid:107) ≤ (cid:112) c ν /N ( m ) , ∀ m, s, a } , (3)7 lgorithm 1 Latent Upper Conﬁdence Reinforcement Learning (L-UCRL)Initialize visit counts N m ( s, a ) , N ( m ) and empirical estimates of an LMDP ( ˆ T m , ˆ R m , ˆ ν m ) prop-erly. for each k th episode do Construct optimistic model (cid:102) M with empirical estimates using Lemma 3.2 Get (approximately) optimal policy π k for (cid:102) M Play policy π k and get the trajectory τ = ( s , a , r , ..., s H , a H , r H ) Get an estimated belief over contexts ˆ b at the end of episode with either Algorithm 2 (when contextsare given), or Algorithm 3 (when we infer contexts) Update empirical parameters: for m = 1 , ..., M do for t = 1 , ..., H do N m ( s t +1 | a t , s t ) ← N m ( s t +1 | a t , s t ) + ˆ b ( m ) N m ( r t | s t , a t ) ← N m ( r t | s t , a t ) + ˆ b ( m ) end for N m ( s ) ← N m ( s ) + ˆ b ( m ) for all ( s (cid:48) , r, a, s ) ∈ S × { , } × A × S do Let N m ( s, a ) = max (cid:0) , (cid:80) x ∈S N m ( x | s, a ) (cid:1) . (cid:98) T m ( s (cid:48) | s, a ) = N m ( s (cid:48) | a,s ) N m ( s,a ) , (cid:98) R m ( r | s, a ) = N m ( r | s,a ) N m ( s,a ) (cid:98) ν m ( s ) = N m ( s )max ( , (cid:80) x ∈S N m ( x ) ) end for end for end for where N m ( s, a ) is the number of times each state-action pair ( s, a ) in m th MDP is visited, and N ( m ) isthe number of episodes we interact with the m th MDP. With properly set constants (depending on problemparameters) c T , c R , c ν > for the conﬁdence intervals, M ∗ ∈ C with high probability for all K episodes.With the construction of conﬁdence sets, it is then natural to try to design an optimistic RL algorithm, as inUCRL [21]. An obvious optimistic value in light of (3) is max π, M∈C V π M . However, solving this optimizationproblem is more general than solving an LMDP. In fully observable settings, we could replace the complexoptimization problem by adding a proper exploration bonus to obtain an optimistic value function [4].In partially observable environments, the notion of value iteration is only deﬁned in terms of belief-statesand not the observed states. For this reason, existing techniques solely based on the value-iteration cannot bedirectly applied for LMDPs. Yet, we ﬁnd that proper analysis of the Bellman update rule over the belief statereveals that an empirical LMDP with properly adjusted hidden rewards is optimistic:

Proposition 3.2

We construct an optimistic LMDP (cid:102) M whose parameters are given such that: (cid:101) T m ( s (cid:48) | s, a ) = ˆ T m ( s (cid:48) | s, a ) , (cid:101) R obsm ( r | s, a ) = ˆ R m ( r | s, a ) , (cid:101) R hidm ( s, a ) = H min (cid:16) , (cid:112) c R + c T ) /N m ( s, a ) (cid:17) , (cid:101) ν m ( s ) = ˆ ν m ( s ) , (cid:101) R hidinit ( m ) = min (cid:16) , (cid:112) c ν /N ( m ) (cid:17) , where (cid:101) R hidinit ( m ) is an initial hidden reward given when starting an episode with a context m , and (cid:101) R obsm ( s, a ) is a probability measure of an observable immediate reward r whereas (cid:101) R hidm ( s, a ) is a hidden immediatereward (that is not visible to the agent) for a state-action pair ( s, a ) in a context m . Then for any policy π , theexpected long-term reward is optimistic, i.e., V π (cid:102) M ≥ V π M ∗ . lgorithm 2 Access to True Contexts

Input:

Get a true context m ∗ in hindsight. Output :

Return an encoded belief ˆ b over contexts: ˆ b ( m ) = (cid:110) , for m = m ∗ , for m (cid:54) = m ∗ To establish this result we make use of the ‘alpha vector’ representation [41] of the value function of generalPOMDPs. Utilizing this representation we establish that each policy of the constructed LMDP has optimisticvalue, namely, it is not smaller than its value on the true LMDP. Detailed analysis is deferred to Appendix B.1.With the optimistic model constructed in Lemma 3.2, the planning-oracle eﬃcient algorithm based on theoptimism principle is straightforward. The resulting latent upper conﬁdence reinforcement learning (L-UCRL)algorithm is summarized in Algorithm 1. We note that most existing planning algorithms can incorporatethe hidden-reward structure without changes. For instance, the PBVI algorithm [38] can be executed as it isin the planning step. Hence in each episode, we can build one optimistic model from the Lemma 3.2, andcall the planning-oracle to get a policy to execute for the episode. Then we simply run the policy and updatemodel parameters in a straight-forward manner. The algorithm can be eﬃciently implemented as long assome eﬃcient (approximate) planning algorithms are available.Based on the established optimism in Lemma 3.2 and by carefully bounding the on-policy errors we arriveto the following regret guarantee of L-UCRL.

Theorem 3.3

The regret of the Algorithm 1 is bounded by:

Regret ( K ) ≤ K (cid:88) k =1 ( V π k (cid:102) M k − V π k M ∗ ) (cid:46) HS √ M AN , where N = HK , i.e., total number of taken actions. Proof of Theorem 3.3 is given in Appendix B.4. The central result of this section, Theorem 3.3 leads to thefollowing observation: a polynomial sample complexity is possible for the LMDP model assuming the contextof the underlying MDP is supplied at the end of each episode. In the next sections we explore ways to relaxthis assumption, while still supplying with a polynomial sample complexity guarantee.

Without explicit access to the true context at the end of an episode, it is natural to estimate the context fromthe sampled trajectory. One suﬃcient condition that ensures such well-separatedness between MDPs is thefollowing:

Assumption 1 ( δ -Strongly Separated MDPs) For all m , m , m ∈ [ M ] such that m (cid:54) = m , for all ( s, a ) ∈ S × A , l distance between probability of observations o = ( s (cid:48) , r ) of two diﬀerent MDPs inLMDP is at least δ > , i.e., (cid:107) ( P m − P m )( o | s, a ) (cid:107) ≥ δ for some constant δ > . In order to reliably infer the true contexts the seperatedness between MDPs alone is not suﬃcient, sincewe need to estimate the contexts from the current empirical estimates of LMDPs. In order to reliably estimatethe context from empirical estimate of LMDPs, we need a well-initialized empirical transition model of theLMDP: (cid:107) ( ˆ T m − T m )( s (cid:48) | s, a ) (cid:107) , (cid:107) (ˆ ν m − ν m )( s ) (cid:107) , (cid:107) ( ˆ R m − R m )( r | s, a ) (cid:107) ≤ (cid:15) init , ∀ ( s, a ) , (4)9 lgorithm 3 Inference of Contexts

Input:

Trajectory τ = ( s , a , r , ..., s H , a H , r H ) Output:

Return an estimate of belief over contexts ˆ b : ˆ p m ( τ ) = Π Ht =1 ( α + (1 − αS ) ˆ P m ( s t +1 , r t | s t , a t )) , ˆ b ( m ) = ˆ p m ( τ ) (cid:80) Mm =1 ˆ p m ( τ ) . for some initialization error (cid:15) init > . Note that while the initialization error is relatively small, it can be stillnot good enough to obtain a near-optimal policy (i.e., it will result in a linear regret). We can consider as if thestate-action pairs are visited at least N times such that N = c T /(cid:15) init in each context in a fully observablesetting.Once the initialization is given along with separation between MDPs, we can modify Algorithm 1 toupdate the empirical estimate of LMDP using the estimated belief over contexts computed in Algorithm 3.Note that when we update the model parameters, we increase the visit count of state-action pair ( s, a ) at m th MDP by ˆ b ( m ) . With Assumption 1, it approximately adds a count for the correctly estimated context, but evenwithout Assumption 1, the update steps can still be applied. In fact, this is equivalent to an implementation ofthe so-called (online) expectation-maximization (EM) algorithm [10] for latent MDPs. Thus Algorithm 1with Algorithm 3 essentially results in combining L-UCRL and the EM algorithm.In terms of performance guarantees, using Algorithm 3 as a sub-routine for L-UCRL gives the same orderof regret as in Theorem 3.3 as long as the true context can be almost correctly inferred with high probabilityfor all K episodes: Theorem 3.4

Suppose Assumption 1 holds with

H > C · δ − log (1 /α ) log( N/η ) for some absolute constants C, δ > , and a parameter α > such that α ln(1 /α ) ≤ δ / (200 S ) . If the initialization parameters satisfyequation (4) with some initialization error (cid:15) init ≤ δ / (200 ln(1 /α )) , then with probability at least − η , theregret of Algorithm 1 is bounded by: Regret ( K ) (cid:46) HS √ M AN .

The proof of Theorem 3.4 is given in Appendix C.3. While currently the provable guarantees are given only forwell-separated LMDPs, we empirically evaluate Algorithm 1 as a function of separations and initialization (seeFigure 1b).An interesting consequence of the Assumption 1 is that the length of episode can be logarithmic in thenumber of problem parameters. With much longer time-horizons H ≥ Ω( S A/δ ) , [8, 17] assumed similar δ -separation only for some ( s, a ) pairs. While Assumption 1 requires a stronger assumption of δ -separationfor all state-actions, the requirement on the time-horizon can be signiﬁcantly weaker. For a slightly moregeneral separation condition, see Appendix C.1. Finally, we discuss eﬃcient initialization with some additional assumptions. Clustering trajectories is thecornerstone of all our technical results, as this allows us to estimate the parameters of each hidden MDP andthen apply the techniques of Section 3.2. The challenge is how to cluster when we have short trajectories, andno good initialization.The key is again in Assumption 1. In Section 3.3, we use a good initialization to obtain accurate estimatesof the belief states. These can then be clustered, thanks to Assumption 1, allowing us to obtain the true labelin hindsight. Without initialization, we cannot accurately compute the belief state, so this avenue is blocked.10 lgorithm 4 (Informal) Recovery of LMDP parametersLearn PSR parameters up to precision o ( δ ) Get clusters { ˆ T m ( ·| s, a ) , ˆ R m ( ·| s, a ) } ( s,a ) ∈S×A ,m ∈ [ M ] with learned PSR parametersBuild each MDP model by correctly assigning contexts to estimated transition and reward probabilities Return

Well-initialized model { ˆ T m , ˆ R m } m ∈ [ M ] Instead, our key idea is to leverage a predictive state representation (PSR) of the POMDP dynamics, and thenshow that Assumption 1 also allows us to cluster in this space.Algorithm 4 gives our approach. We ﬁrst explain the high-level idea, and subsequently detail some of themore subtle points. Suppose we have PSR parameters allowing us to estimate P ( o | h (cid:107) do a ) , (the probabilitiesof any future observations o = ( s (cid:48) , r ) given a history h and intervening action a ) to within accuracy o ( δ ) . Wethen show that we can again apply Assumption 1, to (almost) perfectly cluster the MDPs by true context atthe end of the episode. After we collect transition probabilities at all states near the end of episode, we canconstruct a full transition model for each MDP.Learning the PSR to suﬃcient accuracy requires an additional assumption. We show that the followingstandard assumption on statistical suﬃciency of histories and tests, is suﬃcient for our purposes (see alsoSection 2.2.1 and Appendix D.1): Assumption 2 (Suﬃcient Tests/Histories)

Let T and H be the set of all possible tests and histories of length l = O (1) respectively, with a given sampling policy π (e.g., uniformly random policy) for histories H . T and H satisfy Condition 1 and 2 respectively. While the worst-case instance may require l ≥ M to satisfy the full-rank conditions, we assume that thelength of suﬃcient tests/histories is l = O (1) . In fact, l = 1 has been (implicitly) the common assumption inthe literature on learning POMDPs [19, 5, 16, 24]. Empirically, we observe that the more MDPs diﬀer, themore easily they satisfy Assumption 2. See Figure 2. At this point, we are not aware whether sample-eﬃcientlearning is possible with only Assumption 1.Though the main idea and key assumption are above, a few important details and technical assumptionsremain to complete this story. The primary guarantee still required is that we have access to an explorationpolicy with suﬃcient mixing, to guarantee we can collect all required information to perform the PSR-basedclustering. The following assumption ensures that additional ˜ O ( M/α ) sample trajectories obtained with theexploration policy π can provide M clusters of estimated one-step predictions P m ( o | s, a ) for every state s and intervening action a . Assumption 3 (Reachability of States)

There exists a priori known exploration policy π such that, for all m ∈ [ M ] and s ∈ S , we have P πm ( s H − = s ) ≥ α for some α > . A subtle point here is that we still have an ambiguity issue in the ordering of contexts (or labels) assignedin diﬀerent states, which prevents us from recovering the full model for each context. In Appendix D.2, wedescribe an approach that resolves this ambiguity assuming the MDP is connected.We conclude this section with an end-to-end guarantee.

Theorem 3.5

Let Assumption 2 hold for an LMDP instance with a sampling policy π . Furthermore, assumethe LMDP satisﬁes Assumptions 1 and 3. We learn the PSR parameters with n short trajectories of length l + 1 where n = poly ( A l , S, (cid:15) − c , σ − h , σ − τ , α − , α − , H, M ) , a) True context vs EM with poor/good initialization (b)

EM and L-UCRL with good initialization

Figure 1: (a) We compare performance of L-UCRL (Algorithm 1) when true contexts are revealed in hindsight(Algorithm 2) and when we infer true contexts with the EM algorithm (Algorithm 3). Without proper initialization, EMmay converge to a local optimum which in turn results in sub-optimal policy. (b) Performance of EM + L-UCRL underdiﬀerent level of separation and horizon. The larger δ and the longer H , MDPs are better separated which results in afast convergence of EM. For smaller δ or H , the convergence speed of EM slows down. L-UCRL ﬁnds the optimalpolicy for all diﬀerent levels of δ and H . where (cid:15) c < min( δ, (cid:15) init ) is a desired accuracy for estimated predictions, and α > is a parameter relatedto the connectivity of MDPs (see Assumption 4 in Appendix D.2). Let the number of additional episodes withtime-horizon H ≥ C · δ − log (1 /α ) log( N/η ) (as in Theorem 3.4) to be used for the clustering be n = C · M A log(

M S ) / ( α α ) , with some absolute constant C > . Then with probability at least / , Algorithm 4 (see the full algorithmdescribed in Algorithm 6) returns a good initialization of LMDP parameters that satisﬁes the initializationcondition (4) . Theorem 3.5 completes the entire pipeline for learning in latent MDPs: we initialize the parameters by theestimated PSR and clustering (see Appendix D) up to some accuracy, and then we run L-UCRL to reﬁnethe model and policy up to arbitrary accuracy (Algorithm 1). Note that the / probability guarantee canbe boosted to arbitrarily high precision − η by repeating Algorithm 6 O (log(1 /η )) times, and selecting amodel via majority vote. We mention that we have not optimized polynomial factors as our focus is to avoidthe exponential lower bound with additional assumptions. The proof of Theorem 3.5 is given in AppendixD.2.1. In this section, we evaluate the proposed algorithm on synthetic data. Our ﬁrst two experiments illustrate theperformance of L-UCRL (Algorithm 1) for various levels of separation and quality of initialization. Then,we empirically study the performance of the PSR-Clustering algorithm for randomly generated LMDPs fordiﬀerent levels of separation and time-horizon.

We ﬁrst study the importance of getting true contexts in hindsight for the approach analyzed in this work,by comparing Algorithm 1 when using Algorithm 2 or 3 as a sub-routine. We generate random instances ofLMDPs of size M = 7 , S = 15 , A = 3 and set the time-horizon H = 30 . The reward distribution is set to be0 for most state-action pairs. We compare when we give a true context to the algorithm (Algorithm 2) andwhen we infer a context with random initialization or good initialization (Algorithm 3). Note that in the lattercase, it is equivalent to running the EM algorithm for the model estimation.12 igure 2: Performance of PSR learning and Clustering based algorithm for diﬀerent levels of separation.

Left:

Convergence of belief state for diﬀerent levels of δ and H . Middle: M th singular value of suﬃcient histories/testsmatrix for various δ . Right:

Accuracy of estimated model produced by PSR-Clustering algorithm under various δ and H . Failed when δ < . due to small M th singular value of P T , H . We measure the model estimation error by simply summing over the l diﬀerences in probabilities ofreward and transitions: error := min σ ∈ Perm M (cid:88) ( m,s,a ) (cid:107) ( P m − ˆ P σ ( m ) )( s (cid:48) , r | s, a ) (cid:107) , where Perm M denotes all length M permutation sequences. The performance of the policy is measured byaveraging the total rewards over the last thousand episodes. For the planning algorithm, we ﬁnd that theQ-MDP heuristic [31] shows good performance. The measured errors are averaged over 10 independentexperiments.The experimental results are given in Figure 1b. When the true context is given at the end of episode(with Algorithm 2), L-UCRL converges to the optimal policy as our theory suggests. On the other hand, ifthe true context is not given (with Algorithm 3), the quality of initialization becomes crucial; when the modelis poorly initialized, the estimated model converges to a local optimum which leads to a sub-optimal policy.When the model is well-initialized, L-UCRL performs as well as when true contexts are given in hindsight. In our second experiment, we focus on the performance of L-UCRL (Algorithm 1) along with Algorithm3 under diﬀerent levels of separation ( δ in Assumption 1) when approximately good model parameters aregiven. For various levels of δ , we generate the parameters for transition probabilities randomly while keepingthe distance between diﬀerent MDPs to satisfy δ ≤ (cid:107) ( T m − T m )( s (cid:48) | s, a ) (cid:107) ≤ δ for m (cid:54) = m . As in theprevious section, we test the algorithms on random instances of LMDPs of size M = 7 , S = 15 , A = 3 .We show the error in the estimated model and average long-term rewards in Figure 1b. When the separationis suﬃcient (larger δ or H ), the estimated model converges fast to the true parameters. When the separationgets small (smaller δ or H ), the convergence speed gets slower. This type of transition in the convergencespeed of EM (the update of model parameters with Algorithm 3) is observed both in theory and practicewhen the overlap between mixture components gets larger ( e.g., [28]). On the other hand, the policy steadilyimproves regardless of the level of separation. We conjecture that this is because the optimal policy wouldonly need the model to be accurate in total-variation distance, not in the actual estimated parameters. In the third experiment, we evaluate the initialization algorithm (Algorithm 4) for randomly generated LMDPinstances. Since PSR learning requires a (relatively) large number of short sample trajectories, we evaluate13his step on smaller instances with S = 7 , A = 2 , M = 3 . The LMDP instances are generated similarly as inthe second experiment with diﬀerent levels of δ and H . The reward and initial distributions are set the sameacross all MDPs.To learn the parameters of PSR, we run episodes with H = 4 . We assume histories and tests of length1 are statistically suﬃcient with the uniformly random policy. In the clustering step, we run an additional · episodes to obtain longer trajectories of length H = 20 , and . We report the experimental resultsin Figure 2.We ﬁrst observe how the level of separation δ between MDPs impacts trajectory separation, i.e., beliefstate vs true label (left). Recall that this separation property is the key for clustering trajectories. We thenexamine the performance of Algorithm 4 (see full Algorithm 6) for various levels of separation. Empirically, itsucceeds to get a good initialization of an LMDP model when we have suﬃcient separation. As the separationlevel decreases, the algorithm fails to recover good enough LMDP parameters (Right). There are two possiblesources of the failure: (1) the belief state is far from the true context, and (2) the similarity between MDPsdrops the M th singular value of P T , H (Middle). We can compensate for (1) if we have a longer time-horizonto infer true contexts, as in the leftmost graph. For (2), if the M th singular value of P T , H drops, we requiremore samples for the estimation of PSR parameters. In our experiments, as we decreased δ we found thatfailure in the spectral learning step was the more signiﬁcant of the two. We establish the ﬁrst theoretical results in RL with latent contexts. We ﬁrst have established a lower bound forgeneral LMDPs, showing that necessary number of episodes can be exponential in the number of contexts.Then, we ﬁnd that a sample-eﬃcient RL is possible when true contexts of interacting MDPs are revealedin hindsight. Building oﬀ on this observation, we proposed a sample-eﬃcient algorithm for a class well-separated LMDP instances with additional technical assumptions. We also evaluated the proposed algorithmon synthetic data. The proposed EM and L-UCRL algorithm performed very well once initialized in randominstances, whereas the spectral learning and clustering method was sensitive to the amount of separationbetween diﬀerent contexts.There are several interesting research venues in continuation of this work. An interesting direction is tostudy RL algorithms for LMDPs with no underlying assumptions. Although our lower bound suggests suchan algorithm necessarily suﬀers an exponential dependence in the number of contexts, if this number is small,such dependence might be acceptable on an algorithm designer. Speciﬁcally, we conjecture the following:

Open Question 1 (Upper Bound)

Can we learn the (cid:15) -optimal policy for LMDPs with sample complexity atmost poly (cid:0) ( HSA ) M , (cid:15) − (cid:1) without any assumptions? In Appendix A.2, we show that the exponential dependence in M is suﬃcient when MDPs are fully deter-ministic. The case for general LMDPs is an interesting open question. Furthermore, a needed empiricaladvancement is to design eﬃcient ways to learn the set of suﬃcient histories/tests for learning predictivestate representation of LMDPs. This can dramatically improve the performance of our algorithms when asuﬃciently good initial model needs to be learned. References [1] A. Anandkumar, R. Ge, D. Hsu, S. M. Kakade, and M. Telgarsky. Tensor decompositions for learninglatent variable models.

Journal of Machine Learning Research , 15:2773–2832, 2014.142] A. Anandkumar, D. Hsu, and S. M. Kakade. A method of moments for mixture models and hiddenmarkov models. In

Conference on Learning Theory , pages 33–1, 2012.[3] D. Arthur and S. Vassilvitskii. k-means++ the advantages of careful seeding. In

Proceedings of theeighteenth annual ACM-SIAM symposium on Discrete algorithms , pages 1027–1035, 2007.[4] M. G. Azar, I. Osband, and R. Munos. Minimax regret bounds for reinforcement learning. arXiv preprintarXiv:1703.05449 , 2017.[5] K. Azizzadenesheli, A. Lazaric, and A. Anandkumar. Reinforcement learning of POMDPs using spectralmethods. In

Conference on Learning Theory , pages 193–256, 2016.[6] B. Boots and G. J. Gordon. An online spectral learning algorithm for partially observable nonlineardynamical systems. In

Twenty-Fifth AAAI Conference on Artiﬁcial Intelligence , 2011.[7] B. Boots, S. M. Siddiqi, and G. J. Gordon. Closing the learning-planning loop with predictive staterepresentations.

The International Journal of Robotics Research , 30(7):954–966, 2011.[8] E. Brunskill and L. Li. Sample complexity of multi-task reinforcement learning. In

Uncertainty inArtiﬁcial Intelligence , page 122. Citeseer, 2013.[9] P. Buchholz and D. Scheftelowitsch. Computation of weighted sums of rewards for concurrent MDPs.

Mathematical Methods of Operations Research , 89(1):1–42, 2019.[10] O. Cappé and E. Moulines. On-line expectation–maximization algorithm for latent data models.

Journalof the Royal Statistical Society: Series B (Statistical Methodology) , 71(3):593–613, 2009.[11] I. Chadès, J. Carwardine, T. Martin, S. Nicol, R. Sabbadin, and O. Buﬀet. Momdps: a solution formodelling adaptive management problems. In

Twenty-Sixth AAAI Conference on Artiﬁcial Intelligence(AAAI-12) , 2012.[12] C. Dann, N. Jiang, A. Krishnamurthy, A. Agarwal, J. Langford, and R. E. Schapire. On oracle-eﬃcientPAC RL with rich observations. In

Advances in neural information processing systems , pages 1422–1432,2018.[13] S. Du, A. Krishnamurthy, N. Jiang, A. Agarwal, M. Dudik, and J. Langford. Provably eﬃcient rl withrich observations via latent state decoding. In

International Conference on Machine Learning , pages1665–1674, 2019.[14] C. Gentile, S. Li, P. Kar, A. Karatzoglou, G. Zappella, and E. Etrue. On context-dependent clustering ofbandits. In

International Conference on Machine Learning , pages 1253–1262. PMLR, 2017.[15] C. Gentile, S. Li, and G. Zappella. Online clustering of bandits. In

International Conference on MachineLearning , pages 757–765, 2014.[16] Z. D. Guo, S. Doroudi, and E. Brunskill. A PAC RL algorithm for episodic POMDPs. In

ArtiﬁcialIntelligence and Statistics , pages 510–518, 2016.[17] A. Hallak, D. Di Castro, and S. Mannor. Contextual markov decision processes. arXiv preprintarXiv:1502.02259 , 2015.[18] A. Hefny, C. Downey, and G. J. Gordon. Supervised learning for dynamical system learning. In

Advancesin neural information processing systems , pages 1963–1971, 2015.1519] D. Hsu, S. M. Kakade, and T. Zhang. A spectral algorithm for learning hidden markov models.

Journalof Computer and System Sciences , 78(5):1460–1480, 2012.[20] T. Jaakkola, S. P. Singh, and M. I. Jordan. Reinforcement learning algorithm for partially observablemarkov decision problems. In

Advances in neural information processing systems , pages 345–352, 1995.[21] T. Jaksch, R. Ortner, and P. Auer. Near-optimal regret bounds for reinforcement learning.

Journal ofMachine Learning Research , 11:1563–1600, 2010.[22] N. Jiang, A. Krishnamurthy, A. Agarwal, J. Langford, and R. E. Schapire. Contextual decision processeswith low bellman rank are PAC-learnable. In

International Conference on Machine Learning , pages1704–1713. PMLR, 2017.[23] N. Jiang, A. Kulesza, and S. Singh. Improving predictive state representations via gradient descent. In

Thirtieth AAAI Conference on Artiﬁcial Intelligence , 2016.[24] C. Jin, S. M. Kakade, A. Krishnamurthy, and Q. Liu. Sample-eﬃcient reinforcement learning ofundercomplete POMDPs. arXiv preprint arXiv:2006.12484 , 2020.[25] A. Krishnamurthy, A. Agarwal, and J. Langford. PAC reinforcement learning with rich observations. In

Advances in Neural Information Processing Systems , pages 1840–1848, 2016.[26] J. Kwon and C. Caramanis. The EM algorithm gives sample-optimality for learning mixtures ofwell-separated gaussians. In

Conference on Learning Theory , pages 2425–2487, 2020.[27] J. Kwon and C. Caramanis. EM converges for a mixture of many linear regressions. In

InternationalConference on Artiﬁcial Intelligence and Statistics , pages 1727–1736, 2020.[28] J. Kwon, N. Ho, and C. Caramanis. On the minimax optimality of the EM algorithm for learningtwo-component mixed linear regression. arXiv preprint arXiv:2006.02601 , 2020.[29] Y. Li, B. Yin, and H. Xi. Finding optimal memoryless policies of POMDPs under the expected averagereward criterion.

European Journal of Operational Research , 211(3):556–567, 2011.[30] M. L. Littman. Memoryless policies: Theoretical limitations and practical results. In

From Animalsto Animats 3: Proceedings of the third international conference on simulation of adaptive behavior ,volume 3, page 238. Cambridge, MA, 1994.[31] M. L. Littman, A. R. Cassandra, and L. P. Kaelbling. Learning policies for partially observableenvironments: Scaling up. In

Machine Learning Proceedings 1995 , pages 362–370. Elsevier, 1995.[32] M. L. Littman and R. S. Sutton. Predictive representations of state. In

Advances in neural informationprocessing systems , pages 1555–1561, 2002.[33] Y. Liu, Z. Guo, and E. Brunskill. PAC continuous state online multitask reinforcement learning withidentiﬁcation. In

Proceedings of the 2016 International Conference on Autonomous Agents & MultiagentSystems , pages 438–446, 2016.[34] O.-A. Maillard and S. Mannor. Latent bandits. In

International Conference on Machine Learning , pages136–144, 2014.[35] A. Modi, N. Jiang, S. Singh, and A. Tewari. Markov decision processes with continuous side information.In

Algorithmic Learning Theory , pages 597–618, 2018.1636] A. Y. Ng and M. Jordan. PEGASUS: a policy search method for large MDPs and POMDPs. In

Proceedings of the Sixteenth conference on Uncertainty in artiﬁcial intelligence , pages 406–415, 2000.[37] C. H. Papadimitriou and J. N. Tsitsiklis. The complexity of Markov decision processes.

Mathematics ofoperations research , 12(3):441–450, 1987.[38] J. Pineau, G. Gordon, and S. Thrun. Anytime point-based approximations for large POMDPs.

Journalof Artiﬁcial Intelligence Research , 27:335–380, 2006.[39] S. Ross, M. Izadi, M. Mercer, and D. Buckeridge. Sensitivity analysis of POMDP value functions. In , pages 317–323. IEEE, 2009.[40] S. Singh, M. R. James, and M. R. Rudary. Predictive state representations: a new theory for modelingdynamical systems. In

Proceedings of the 20th conference on Uncertainty in artiﬁcial intelligence , pages512–519, 2004.[41] R. D. Smallwood and E. J. Sondik. The optimal control of partially observable markov processes over aﬁnite horizon.

Operations research , 21(5):1071–1088, 1973.[42] T. Smith and R. Simmons. Heuristic search value iteration for POMDPs. In

Proceedings of the 20thconference on Uncertainty in artiﬁcial intelligence , pages 520–527, 2004.[43] M. T. Spaan and N. Vlassis. Perseus: Randomized point-based value iteration for POMDPs.

Journal ofartiﬁcial intelligence research , 24:195–220, 2005.[44] L. N. Steimle, D. L. Kaufman, and B. T. Denton. Multi-model markov decision processes. , 2018.[45] G. W. Stewart. Matrix perturbation theory. 1990.[46] R. S. Sutton and A. G. Barto.

Reinforcement learning: An introduction . MIT press, 2018.[47] M. E. Taylor and P. Stone. Transfer learning for reinforcement learning domains: A survey.

Journal ofMachine Learning Research , 10(7), 2009.[48] R. Vershynin. Introduction to the non-asymptotic analysis of random matrices. arXiv preprintarXiv:1011.3027 , 2010.

Appendix A Guarantees for Latent Deterministic MDPs

In this section, we provide lower and upper bound for LMDP instances with deterministic MDPs. Thelower bound for latent deterministic MDPs implies the lower bound for general instances of LMDPs, provingTheorem 3.1. The upper bound for latent deterministic MDPs supports our conjecture on the sample complexityof learning general LMDPs (Open Question 1), and can be of independent interest.

A.1 Lower Bound (Theorem 3.1)

We consider the following constructions with M deterministic MDPs and H = M , S = M + 1 with A actions:1. At the start of episode, one of M -MDPs in {M , M , ..., M M } are chosen with probability /M .17 igure 3: External view of the system dynamics with wrong action sequences without context information. Arrowsindicate transition probabilities of a surrogate Markov chain that represents the external view.

2. At each time step, each MDP either goes to the next state or go to the SINK-state depending on theaction chosen at the time step. Once we fall into the SINK state, we keep staying in the SINK statethroughout the episode without any rewards.3. Rewards of all state-action pairs are all 0 except at time step t = M with the right action choice a M only in the ﬁrst MDP M .4. At time step t = 1 , there are three state-transition possibilities:• M : For all actions a ∈ A except a , we go to the SINK state. For the action a , we go to thenext state.• M M : For all actions a ∈ A except a , we go to the next state. For the action a , we go to theSINK state.• M , ..., M M − : For all actions a ∈ A , we go to the next state.5. At time step t = 2 , we again have three cases but now M and M K would look the same:• M , M M : For all actions a ∈ A except a , we go to the SINK state. For the action a , we go tothe next state.• M M − : For all actions a ∈ A except a , we go to the next state. For the action a , we go to theSINK state.• M , ..., M M − : For all actions a ∈ A , we go to the next state....6. At time step t = M − ,• M , M , ..., M M : For all actions a ∈ A except a M − , we go to the SINK state. For the action a M − , we go to the next state.• M : For all actions a ∈ A except a M − , we go to the next state. For the action a M − , we go tothe SINK state.7. At time step t = M , there are two possibilities of getting rewards:• M : For the action a M ∈ A , we get reward 1. For all other actions, we get no reward.• M , ..., M M : For all actions a ∈ A , we get no reward.18ote that the right action sequence is a ∗ = ( a , a , ..., a M ) . However, without the information on truecontexts, the system dynamics with any wrong action sequence among the A M − wrong sequences, isexactly viewed as Figure 3 with zero rewards, i.e., P ( s H , r H (cid:107) do a (1)1: H ) = P ( s H , r H (cid:107) do a (2)1: H ) , for any two wrong action sequences a (1) = a (1)1: H , a (2) = a (2)1: H such that a (1) , a (2) (cid:54) = a ∗ . The probabilitydistribution of observation sequences with any wrong action sequence is the same as the distribution ofsequences generated by the surrogate Markov chain in Figure 3. Therefore, we cannot gain any informationfrom executing wrong action sequences besides of eliminating this wrong action sequence. Note that there are A M possible choice of action sequences. Hence the problem is reduced to ﬁnd one speciﬁc sequence among A M possibilities without any other information on the correct action sequence. It leads to the conclusion thatbefore we play most of A M action sequences, we cannot ﬁnd the correct one. Figure 4:

Eﬀectively amplifying the number of actions

Now the above argument can be easily extended to get a lower bound of Ω (cid:16)(cid:0) SAM (cid:1) M (cid:17) , by eﬀectivelyamplifying the number of actions up to O (cid:0) SAM (cid:1) . That is, we can amplify the eﬀective number of actions byconsidering a big state consisting of a tree of states with O (log A S ) -depth (see Figure 4). Since we have such M big states, total number of states is O ( M S ) in our lower bound example with ampliﬁed number of actions,or conversely if the number of total states is S , then eﬀective number of actions is O ( SA/M ) per each bigstate. This gives the Ω (cid:0) ( SA/M ) M (cid:1) lower bounds.Finally, we can easily include (cid:15) -additive approximation factor to get the lower bound Ω( A M /(cid:15) ) byproperly adjusting the reward distribution at the last time step t = H as the following:• M : For the action a M ∈ A at time step H = M , we get a reward from Bernoulli distribution Ber (1 / (cid:15) ) . For all other actions, we get a reward from Bernoulli distribution Ber (1 / .• M , ..., M M : For all actions a ∈ A , we get a reward from Bernoulli distribution Ber (1 / .Then the distribution of the ﬁnal reward with a ∗ is /M · Ber (1 / (cid:15) ) + ( M − /M · Ber (1 / , whereas thedistribution of all other action sequences is Ber (1 / . Similarly to the above, for any wrong action sequence,the probability of observations is identical. Hence, identifying the optimal action sequence a ∗ among all A M action sequences requires Ω( A M /(cid:15) ) trials, i.e., to identify an (cid:15) optimal arm among A M actions. A.2 Upper Bound for Learning Deterministic LMDPs

Although the lower bound is exponential in M , it would not be a disaster for instances with small number ofcontexts. In this appendix, we brieﬂy discuss whether the exponential dependence in M is suﬃcient for learning19 lgorithm 5 Exploring Deterministic MDPs with Latent Contexts

Initialization:

For O ( M log M ) episodes, observe the possible initial states (discard all other information).If there is only one initial state s for O ( M log M ) episodes, then set C = { ( φ, s ) } . Otherwise, let C be aset of all observed initial states with { ( { ( init, s ) } , s ) , ∀ s observed } . for t = 1 , ..., H do Let C t +1 = {} . for each ( C, s ) ∈ C t do for each a ∈ A do Let O = {} .

1. Find any action sequence a , ..., a t − that can result in a state s with distinguishing observations C .

2. For O ( M log M ) episodes, run the action sequence a , ..., a t − (execute any policy for theremaining time steps). s with distinguishing observations C , then run action a , and get a newobservation of next state and reward ( s (cid:48) , r ) . O ← O ∪ { ( s (cid:48) , r ) } and record the probability p of observing ( s (cid:48) , r ) conditioned on C and s . if | O | = 1 then ( s (cid:48) , r ) ∈ O , update C t +1 ← C t +1 ∪ { ( C, s (cid:48) ) } . Record that there is a path from ( C, s ) to ( C, s (cid:48) ) by taking action a with a reward r . else ( s (cid:48) , r ) ∈ O , let C (cid:48) = C ∪ { ( s, a, s (cid:48) , r ) } . Update C t +1 ← C t +1 ∪ { ( C (cid:48) , s (cid:48) ) } . Record that there is a path from ( C, s ) to ( C (cid:48) , s (cid:48) ) by taking action a with a reward r and arecorded probability p . end if end for end for end for deterministic MDPs with latent contexts. If that is the case, then we can exclude the possibility of Ω( A H ) lower bound for deterministic LMDP instances. Intuitively, the exponential dependence in time-horizon isunlikely in LMDPs for the following reason: under certain regularity assumptions, if the time-horizon isextremely long H (cid:29) S A such that every state-action pair can be visited suﬃciently many times, then eachtrajectory can be easily clustered and the recovery of the model is straight-forward. The following theoremshows that we do not suﬀer from Ω( A H ) sample complexity for deterministic LMDPs: Theorem A.1 (Upper Bound for Deterministic LMDPs)

For any LMDP with a set of deterministic MDPs,there exists an algorithm such that it ﬁnds the optimal policy after at most O (cid:0) H ( SA ) M · poly ( M ) log M (cid:1) episodes. The algorithm for the upper bound is implemented in Algorithm 5. While the upper bound for deterministicLMDPs does not imply the upper bound for general stochastic LMDPs, we have shown that the exponentiallower bound higher than O (( SA ) M ) cannot be obtained via deterministic examples. We leave it as futurework to study the fundamental limits of general instances of LMDPs, and in particular, whether the problemis learnable with ˜ O (( SA ) M ) sample complexity, which can be promising when the number of contexts issmall enough ( e.g., M = 2 , ). 20 .2.1 Proof of Theorem A.1 Algorithm 5 is essentially a pure exploration algorithm which searches over all possible states. After the pureexploration phase, we model the entire system as one large MDP with poly ( M ) · ( SA ) M states. The optimalpolicy can be found by solving this large MDP.The core idea behind the algorithm is that since the system is deterministic, whenever there exist morethan one possibility of observations (a pair of reward and next state) from the same state and action, it impliesthat at least one MDP shows a diﬀerent behavior from other MDPs for the state-action pair. Therefore, eachobservation can be considered as a new distinguishing observation that can separate at least one MDPs fromother MDPs. Afterwards, we can consider a sub-problem of exploration in the remaining time-steps giventhe distinguishing observation in history and the current state. The argument can be similarly applied insub-problems, which leads to the concept of conditioning on a set of distinguishing observations and thecurrent state.On the other hand, if an action results in the same observation for all MDPs given a set of distinguishingobservations and a state, then we would only see one possibility. In this case, this state-action pair does notreveal any information on the context, and can be ignored for future decision making processes.Algorithm 5 implements the above principles: for each time step t , we construct a set of all reachablestates with a set of distinguishing observations in histories. In order to ﬁnd out all possibilities, for eachobservation set, state, and action we ﬁrst ﬁnd the action sequence by which we can reach the desired state(with target distinguishing observation set). Since all MDPs are deterministic, the existence of path means atleast one MDP always results in the desired state with the action sequence. The sequence can be found bythe induction hypothesis that we are given all possible transitions and observations in previous time steps , ..., t − . By the coupon-collecting argument, if we try the same action sequence for O ( M log M ) episodes,we can see all diﬀerent transitions that all diﬀerent MDPs resulting in the target observation set and statecan give. By doing this for all reachable states and observation sets, we can ﬁnd out all possibilities that canhappen at the time step t . The procedure repeats until t = H and eventually we can ﬁnd all possible outcomesfrom all action sequences.An important question is how many diﬀerent possibilities we would encounter in the procedure. Note thatas we ﬁnd out a new distinguishing observation, we cut out the possibility of at least one MDP conditioningon that new observation. Since there are only M possible MDPs, the size of distinguishable observationsets cannot be larger than M − . Based on this observation, we can see that the number of all possiblecombinations of the observation set and state is less than (cid:0) MSAM − (cid:1) · S . Note that the M SA is the total numberof possible state-action-observation ( s, a, s (cid:48) , r ) pairs. Hence in each time step, the iteration complexity doesnot exceed (cid:0) MSAK − (cid:1) · SA times the number of episodes for each possible state and observation set. Since weloop this procedure for H steps, the total number of episodes is bounded by O (cid:16) HSA (cid:0)

MSAM − (cid:1) · M log M (cid:17) ,which results in the sample complexity of O (cid:0) H ( SA ) M · poly ( M ) log M (cid:1) . Appendix B Analysis of L-UCRL when True Contexts are Revealed

In this section, we prove the optimism lemma (Lemma 3.2) and regret guarantee (Theorem 3.3) achieved byAlgorithm 1 when true contexts are given in hindsight.

B.1 Analysis of Optimism in Alpha-Vectors

We start with an important observation that the upper conﬁdence bound (UCB) type algorithm can be imple-mented in the belief-state space. Even though the exact planning in a belief-state space is not implementable,we can still discuss how the value iteration is performed in partially observable domains. Let h be an entire21istory at time t , and denote b ( h ) be a belief state over M MDPs corresponding to a history h . The valueiteration with a (history-dependent) policy π is given as Q πt ( h, s, a ) = b ( h ) (cid:62) ¯ R ( s, a ) + E s (cid:48) ,r | h,s,a [ V πt +1 ( h (cid:48) , s (cid:48) )] , for t = 1 , ..., H , where h (cid:48) = ha ( rs (cid:48) ) is a concatenated history. Here Q πt ( h, s, a ) and V πt ( h, s ) are state-action-value and state-value function at time step t respectively given a history h and a policy π . ¯ R ( s, a ) ∈ R M is a vector where value of m th coordinate ¯ R m ( s, a ) is an expected immediate reward at ( s, a ) in m th MDP, i.e., ¯ R m ( s, a ) = E r ∼ R m ( r | s,a ) [ r ] . In case there exists a hidden reward R hidm ( s, a ) , we deﬁne ¯ R m ( s, a ) = E r ∼ R m ( r | s,a ) [ r ] + R hidm ( s, a ) . At the end of episode, we set V πH +1 = 0 . We ﬁrst need the following lemma onthe policy evaluation procedure of a POMDP. Lemma B.1

For any history h at time t , the value function for a policy π can be written as V πt ( h, s ) = b ( h ) (cid:62) α h,πt,s , (5) for some α h,πt,s ∈ R M uniquely decided by t, s, h and π .Proof. We will show that the value of α h,πt,s is decided only by a history and policy, and is not aﬀected by thehistory to belief-state mapping. On the other hand, the Bayesian update for h (cid:48) is given by b m ( ha ( rs (cid:48) )) = b m ( h ) T m ( s (cid:48) | s, a ) R m ( r | s, a ) (cid:80) m b m ( h ) T m ( s (cid:48) | s, a ) R m ( r | s, a ) = b m ( h ) P m ( s (cid:48) , r | s, a ) P ( s (cid:48) , r | h, s, a ) . Thus, the value iteration for policy evaluation in LMDPs can be written as: Q πt ( h, s, a ) = b ( h ) (cid:62) ¯ R ( s, a ) + (cid:88) ( s (cid:48) ,r ) (cid:88) m b m ( h ) α ha ( rs (cid:48) ) ,πt +1 ,s (cid:48) ( m ) P m ( s (cid:48) , r | s, a ) ,V πt ( h, s ) = (cid:88) a π ( a | h ) Q πt ( h, s, a ) . (6)Let us explain how the alpha vectors [41] can be constructed recursively from the time step H + 1 . Notethat V H +1 ( h, s ) = 0 for any h and s , therefore α h,πH +1 ,s = 0 . Then we can deﬁne the set of alpha vectorsrecursively such that α h,a, ∗ ,πt,s ( m ) = ¯ R m ( s, a ) ,α h,a, ( s (cid:48) ,r ) ,πt,s ( m ) = P m ( s (cid:48) , r | s, a ) α ha ( s (cid:48) r ) ,πt +1 ,s (cid:48) ( m ) ∀ ( s, a, r, s (cid:48) ) , (7)Finally, the alpha vector for the value with respect to h is constructed as α h,πt,s ( m ) = (cid:88) a π ( a | h )  α h,a, ∗ ,πt,s ( m ) + (cid:88) s (cid:48) ,r α h,a, ( s (cid:48) ,r ) ,πt,s ( m )  . Note that in the construction of alpha vectors, the mapping from history to belief-state is not involved, and thevalue function can be represented as V πt ( h, s ) = b ( h ) (cid:62) α h,πt,s . (cid:3) Now consider the optimistic model deﬁned in Lemma 3.2. For the optimistic model, the intermediatealpha vectors are constructed with the following recursive equation: ˜ α h,a, ∗ ,πt,s ( m ) = E r ∼ ˜ R obsm ( r | s,a ) [ r ] + ˜ R hidm ( s, a ) , ˜ α h,a, ( s (cid:48) ,r ) ,πt,s ( m ) = ˜ T m ( s (cid:48) | s, a ) ˜ R obsm ( r | s, a ) ˜ α ha ( s (cid:48) r ) ,πt +1 ,s (cid:48) ( m ) ∀ ( s, a, r, s (cid:48) ) , (8)From the constructions of alpha vectors above, we can show the optimism in alpha vectors:22 emma B.2 Let α h,πt,s and ˜ α h,πt,s be alpha vectors constructed with M ∗ and (cid:102) M respectively. Then for all t, s, h, π , we have ˜ α h,πt,s (cid:23) α h,πt,s . The lemma implies that if the history is mapped to the same belief states in both models, then we also havethe optimism in value functions. Note that in general, diﬀerent models will lead each history to diﬀerentbelief states. At the initial time-step, however, we start from similar belief states, and we can claim Lemma3.2. The remaining proof of Lemma 3.2 is given in Section B.3.

B.2 Proof of Lemma B.2

We show this by mathematical induction moving reverse in time from t = H . The inequality is trivial when t = H + 1 since all α h,πH +1 ,s = ˜ α h,πH +1 ,s = 0 for any h, π, s . Now we investigate α h,πt,s ( m ) . It is suﬃcient toshow that for all a ∈ A , α h,a, ∗ ,πt,s ( m ) + (cid:88) s (cid:48) ,r α h,a, ( s (cid:48) ,r ) ,πt,s ( m ) ≤ ˜ α h,a, ∗ ,πt,s ( m ) + (cid:88) s (cid:48) ,r ˜ α h,a, ( s (cid:48) ,r ) ,πt,s ( m ) . Recall equations for alpha vectors (7), (8). ˜ α h,a, ∗ ,πt,s ( m ) + (cid:88) s (cid:48) ,r ˜ α h,a, ( s (cid:48) ,r ) ,πt,s ( m ) − α h,a, ∗ ,πt,s ( m ) − (cid:88) s (cid:48) ,r α h,a, ( s (cid:48) ,r ) ,πt,s ( m ) ≥ (cid:16) E r ∼ ˜ R obsm ( r | s,a ) [ r ] − E r ∼ R obsm ( r | s,a ) [ r ] (cid:17) + ˜ R hidm ( s, a )+ (cid:88) s (cid:48) ,r (cid:16) ˜ T m ( s (cid:48) | s, a ) ˜ R obsm ( r | s, a ) ˜ α ha ( s (cid:48) ,r ) ,πt +1 ,s (cid:48) ( m ) − T m ( s (cid:48) | s, a ) R m ( r | s, a ) α ha ( s (cid:48) ,r ) ,πt +1 ,s (cid:48) ( m ) (cid:17) ≥ (cid:16) E r ∼ ˜ R obsm ( r | s,a ) [ r ] − E r ∼ R m ( r | s,a ) [ r ] (cid:17) + H min (cid:16) , (cid:112) c R + c T ) /N m ( s, a ) (cid:17) + (cid:88) s (cid:48) ,r (cid:16) ˜ T m ( s (cid:48) | s, a ) ˜ R obsm ( r | s, a ) α ha ( s (cid:48) ,r ) ,πt +1 ,s (cid:48) ( m ) − T m ( s (cid:48) | s, a ) R m ( r | s, a ) α ha ( s (cid:48) ,r ) ,πt +1 ,s (cid:48) ( m ) (cid:17) , where the last inequality comes from the induction hypothesis. On the other hand, note that ˜ R obsm and ˜ T m aresimply empirical estimates after visiting the state-action pair N m ( s, a ) times. Thus, it is easy to see that withhigh probability, (cid:12)(cid:12)(cid:12) E r ∼ ˜ R obsm ( r | s,a ) [ r ] − E r ∼ R obsm ( r | s,a ) [ r ] (cid:12)(cid:12)(cid:12) ≤ (cid:107) ( ˆ R m − R m )( r | s, a ) (cid:107) ≤ (cid:112) c R /N m ( s, a ) , (cid:88) s (cid:48) ,r (cid:12)(cid:12)(cid:12) ˆ T m ( s (cid:48) | s, a ) ˆ R m ( r | s, a ) α h,a, ( s (cid:48) ,r ) ,πt +1 ,s (cid:48) ( m ) − T m ( s (cid:48) | s, a ) R m ( r | s, a ) α h,a, ( s (cid:48) ,r ) ,πt +1 ,s (cid:48) ( m ) (cid:12)(cid:12)(cid:12) ≤ H (cid:88) s (cid:48) ,r (cid:12)(cid:12)(cid:12) ˆ T m ( s (cid:48) | s, a ) ˆ R m ( r | s, a ) − T m ( s (cid:48) | s, a ) R m ( r | s, a ) (cid:12)(cid:12)(cid:12) ≤ H (cid:16) (cid:107) ( ˆ T m − T m )( s (cid:48) | s, a ) (cid:107) + (cid:107) ( ˆ R m − R m )( r | s, a ) (cid:107) (cid:17) ≤ H (cid:16)(cid:112) c R /N m ( s, a ) + (cid:112) c T /N m ( s, a ) (cid:17) , where we used that all alpha vectors in the original system satisﬁes (cid:107) α h,πt,s (cid:107) ∞ ≤ H for all t, s, h, π . Thiscompletes the proof of Lemma B.2. 23 .3 Proof of Lemma 3.2 The remaining step is to show the optimism at the initial time. When t = 1 , history h is simply the initialstate s . The belief state after observing the initial state is given by b m ( s ) = w m ν m ( s ) (cid:80) s (cid:48) w m ν m ( s (cid:48) ) , ˜ b m ( s ) = w m ˜ ν m ( s ) (cid:80) s (cid:48) w m ˜ ν m ( s (cid:48) ) . The expected long-term reward with π for each model is therefore V π M ∗ = (cid:88) s P ( s = s ) V ( s ) = (cid:88) s P ( s = s ) b ( s ) (cid:62) α s,π ,s = (cid:88) s (cid:88) m w m ν m ( s ) α s,π ,s ( m ) ,V π (cid:102) M = (cid:88) s (cid:88) m w m ˜ ν m ( s ) ˜ α s,π ,s ( m ) . Following the similar arguments, we have V π (cid:102) M − V π M ∗ ≥ H (cid:88) m w m (cid:112) c ν /N ( m ) − H (cid:88) m w m (cid:88) s | ν m ( s ) − ˜ ν m ( s ) | ≥ , which proves the claim of Lemma 3.2. B.4 Proof of Theorem 3.3

Let us deﬁne a few notations. Suppose M = ( S , A , T m , R m , ν m ) a LMDP and a context m is randomlychosen at the start of an episode following a probability distribution ( w , w , ..., w M ) . Let ¯ R m ( s, a ) = E r ∼ R m ( r | s,a ) [ r ] be an expected (observable) reward of taking action a at s in m th MDP. With a slight abusein notation, we use E π, M [ · ] to simplify E m ∼ ( w ,...,w M ) (cid:104) E π,T m ,R m ,ν m [ · ] (cid:12)(cid:12)(cid:12) m (cid:105) = (cid:80) Mm =1 w m E πm [ · ] .We start with the following lemma on the diﬀerence in values in terms of diﬀerence in parameters. Lemma B.3

Let M = ( S , A , T m , R m , ν m ) and M = ( S , A , T m , R m , ν m ) be two latent MDPs withdiﬀerent transition, reward and initial distributions. Then for any history-dependent policy π , | V π M − V π M | ≤ H · E π, M (cid:2) (cid:107) ( µ m − µ m )( s ) (cid:107) (cid:3) + H (cid:88) t =1 E π, M (cid:2) | ¯ R m ( s t , a t ) − ¯ R m ( s t , a t ) | (cid:3) + H · H (cid:88) t =1 E π, M (cid:2) (cid:107) ( P m − P m )( s (cid:48) , r | s t , a t ) (cid:107) (cid:3) . (9)The proof of Lemma B.3 is proven in B.4.1.Equipped with Lemma 3.2 and B.3, we now can prove the main theorem. We ﬁrst deﬁne a few newnotations. Let k ( m, s, a ) be a count of visiting ( s, a ) in the m th MDP by running a policy π k chosen at the k th episode. Let N km ( s, a ) be the total number of visit at ( s, a ) in the m th MDP before the beginning of k th episode, i.e., N km ( s, a ) = (cid:80) k − k (cid:48) =1 k (cid:48) ( m, s, a ) . Let F k be the ﬁlteration of events after running k episodes.Let ˜ V πk the value of the optimistic model chosen at the k th episode with a policy π . Let π ∗ be the optimalpolicy for the true LMDP M ∗ . Finally, let us denote ( · ) k for the model parameter in the optimistic model at k th episode.The expected reward (cid:101) ¯ R m ( s, a ) in optimistic model is equivalent to ˜ R hidm ( s, a ) + E r ∼ ˜ R obsm ( ·| s,a ) [ r ] . Usingthe Lemma B.3, the total regret can be rephrased as the following: K (cid:88) k =1 V π ∗ M ∗ − V π k M ∗ ≤ K (cid:88) k =1 V π ∗ (cid:102) M k − V π k M ∗ ≤ K (cid:88) k =1 V π k (cid:102) M k − V π k M ∗ K (cid:88) k =1 (cid:88) ( m,s,a ) E π k , M ∗ [ k ( m, s, a )] · (cid:32) H · (cid:107) ( ˜ P km − P m )( s (cid:48) , r | s, a ) (cid:107) + (cid:12)(cid:12)(cid:12) ˜ R hid,km ( s, a ) (cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12) E r ∼ ˜ R obs,km ( r | s,a ) [ r ] − E r ∼ R m ( r | s,a ) [ r ] (cid:12)(cid:12)(cid:12) (cid:33) + H · K (cid:88) k =1 (cid:88) m (cid:18) (cid:107) (˜ µ km − µ ∗ m )( s ) (cid:107) E M ∗ [ k ( m )] + (cid:113) c ν /N k ( m ) E M ∗ [ k ( m )] (cid:19) . Note that ˜ R hid,km ( s, a ) = H min (cid:16) , (cid:112) ( c T + c R ) /N km ( s, a ) (cid:17) ≥ H (cid:107) ( ˜ P km − P m )( s (cid:48) , r | s, a ) (cid:107) , and this isthe dominating term. Therefore, the upper bounding equation can be reduced to K (cid:88) k =1 V π k (cid:102) M k − V π k M ∗ ≤ H K (cid:88) k =1 (cid:88) ( m,s,a ) (cid:32) (cid:113) ( c T + c R ) /N km ( s, a ) E π k , M ∗ [ k ( m, s, a )] (cid:33) + 2 H K (cid:88) k =1 (cid:88) m (cid:18)(cid:113) c ν /N k ( m ) E M ∗ [ k ( m )] (cid:19) . Observe that the expected value of N km ( s, a ) is (cid:80) k − k (cid:48) =1 E π k (cid:48) , M ∗ [ k (cid:48) ( m, s, a )] . Let this quantity E [ N km ] .We can check that V ar ( k ( m, s, a ) |F k − ) ≤ H E π k , M ∗ [ k ( m, s, a )] . From the Bernstein’s inequality for martingales, for any ( s, a ) (ignoring constants), N km ( s, a ) ≥ E [ N km ( s, a )] − c (cid:113) H E [ N km ( s, a )] log( M SAK/η ) − c H log( M SAK/η ) , for some absolute constants c , c > and for all k and ( m, s, a ) , with probability at least − η . From this,we can show that H K (cid:88) k =1 (cid:88) ( k,s,a ) (cid:113) ( c T + c R ) /N km ( s, a ) E [ k ( m, s, a )] ≤ H (cid:88) ( k,s,a )  k (cid:88) k =1 E [ k ( m, s, a )] + K (cid:88) k = k +1 (cid:113) ( c T + c R ) /N km ( s, a ) E [ k ( m, s, a )]  (cid:46) H (cid:88) ( m,s,a )  H log( M SAK/η ) + 2 K (cid:88) k = k +1 (cid:113) ( c T + c R ) / E [ N km ( s, a )] E [ k ( m, s, a )]  , where k is a threshold point where the expected number of visit at ( m, s, a ) exceeds H log( M SAK/η ) .Note that after this point we can assume, with high probability, that N km ( s, a ) ≥ E [ N km ( s, a )] / . To boundthe summation of the remaining term, for a ﬁxed ( m, s, a ) , we denote X k = E [ N km ( s, a )] /H and x k = E [ k ( m, s, a )] /H . Note that X k +1 = X k + x k and x k ≤ . Then, K (cid:88) k = k +1 (cid:112) /X k x k ≤ (cid:90) X K X k (cid:114) x − dx ≤ (cid:112) X K . H (cid:88) ( m,s,a )  H log( M SAK/η ) + 2 K (cid:88) k = k +1 (cid:113) ( c T + c R ) / E [ N km ( s, a )] E [ k ( m, s, a )]  , ≤ H M SA log(

M SAK/η ) + 4 H (cid:88) ( m,s,a ) (cid:113) ( c T + c R ) N Km ( s, a ) ≤ H M SA log(

M SAK/η ) + 4 H (cid:112) ( c T + c R ) HM SAK, where in the last step, we used Cauchy-Schwartz inequality with (cid:80) ( m,s,a ) N Km ( s, a ) = HK . Similarly, wecan show that H K (cid:88) k =1 (cid:88) m (cid:113) c ν /N k ( m ) E [ k ( m )] (cid:46) HM log( M SAK/η ) + 4 H (cid:112) c ν M K.

Our choice of conﬁdence parameters c T for a transition probability is c T = O ( S log( M SAK/η )) , andthis is the dominating factor. Thus, the total regret is dominated by H (cid:112) c T HM SAK (cid:46) HS (cid:112) M AN log(

M SAK/η ) , which in turn gives a total regret bound of O (cid:16) HS (cid:112) M AN log(

M SAN/η ) (cid:17) where N = HK . B.4.1 Proof of Lemma B.3

Proof.

We ﬁrst observe that V π M − V π M = M (cid:88) m =1 w m (cid:32) E ,πm (cid:34) H (cid:88) t =1 r t (cid:35) − E ,πm (cid:34) H (cid:88) t =1 r t (cid:35)(cid:33) = M (cid:88) m =1 w m H (cid:88) t =1  (cid:88) ( s ,a ,r ,...,s t ,a t ,r t ) r t P ,πm ( s , ..., r t ) − r t P ,πm ( s , ..., r t )  , where P ,πm ( s , a , r , ..., r t − , s t ) := ν pm ( s )Π t − i =1 π ( a i | s , ..., r i − , s i ) T m ( s i +1 | s i , a i ) R m ( r i | s i , a i ) . Wedecompose the main diﬀerence as (cid:88) ( s,a,r ) t r t ( P ,πm (( s, a, r ) t ) − P ,πm (( s, a, r ) t ))= (cid:88) (( s,a,r ) t ) r t ( R m ( r t | s t , a t ) − R m ( r t | s t , a t )) P ,πm (( s, a, r ) t − , s t , a t )+ (cid:88) ( s,a,r ) t r t R m ( r t | s t , a t )( P ,πm − P ,πm )(( s, a, r ) t − , s t , a t ) ≤ (cid:88) s t ,a t (cid:12)(cid:12) E r t ∼ R m ( ·| s t ,a t ) [ r t ] − E r t ∼ R m ( ·| s t ,a t ) [ r t ] (cid:12)(cid:12) P ,πm ( s t , a t )+ (cid:107) ( P ,πm − P ,πm )(( s, a, r ) t − , s t , a t ) (cid:107) . Now we bound the total variation distance of the length t histories. For notational convenience, let us denote | P − P | ( · ) = | P ( · ) − P ( · ) | for any probability measures P , P . Then, (cid:88) ( s,a,r ) t − ,s t ,a t | P ,πm − P ,πm | (( s, a, r ) t − , s t , a t ) (cid:88) ( s,a,r ) t − ,s t | P ,πm − P ,πm | (( s, a, r ) t − , s t ) (cid:88) a t π ( a t | ( s, a, r ) t − , s t )= (cid:88) ( s,a,r ) t − ,s t | P ,πm − P ,πm | (( s, a, r ) ,t − , s t ) ≤ (cid:88) ( s,a,r ) t − ,s t | P ,πm − P ,πm | (( s, a, r ) t − , s t − , a t − ) P m ( s t , r t − | s t − , a t − )+ (cid:88) ( s,a,r ) t − ,s t P ,πm (( s, a, r ) t − , s t − , a t − ) | P m − P m | ( s t , r t − | s t − , a t − ) ≤ (cid:88) ( s,a,r ) t − ,s t − ,a t − | P ,πm − P ,πm | (( s, a, r ) t − , s t − , a t − )+ (cid:88) ( s,a,r ) t − ,s t − ,a t − (cid:107) ( P ,πm − P ,πm )( s t , r t − | s t − , a t − ) (cid:107) P ,πm (( s, a, r ) t − , s t − , a t − )= (cid:107) ( P ,πm − P ,πm )(( s, a, r ) t − , s t − , a t − ) (cid:107) + (cid:88) s t − ,a t − (cid:107) ( P ,πm − P ,πm )( s t , r t − | s t − , a t − ) (cid:107) P ,πm ( s t − , a t − ) . We can apply the same expansion recursively to bound total variation for length t − histories. Now plug thisrelation to the regret bound, we have | V π M − V π M | ≤ M (cid:88) m =1 w m (cid:88) ( s,a ) H (cid:88) t =1 (cid:12)(cid:12) E r ∼ R m ( ·| s,a ) [ r ] − E r ∼ R m ( ·| s,a ) [ r ] (cid:12)(cid:12) P ,πm ( s t = s, a t = a )+ M (cid:88) m =1 w m H (cid:88) t =1 (cid:32) (cid:88) s | ν m ( s ) − ν m ( s ) | P m ( s = s )+ (cid:88) ( s,a ) t (cid:88) t (cid:48) =1 (cid:107) ( P ,πm − P ,πm )( s (cid:48) , r | s, a ) (cid:107) P ,πm ( s t (cid:48) = s, a t (cid:48) = a ) (cid:33) ≤ M (cid:88) m =1 w m H (cid:88) t =1 (cid:0) E ,πm (cid:2) | ¯ R m ( s t , a t ) − ¯ R m ( s t , a t ) | (cid:3)(cid:1) + H · M (cid:88) m =1 w m (cid:32) (cid:107) ( µ m − µ m )( s ) (cid:107) + H (cid:88) t =1 E ,πm (cid:2) (cid:107) ( P m − P m )( s (cid:48) , r | s t , a t ) (cid:107) (cid:3)(cid:33) , giving the equation (9) as claimed. (cid:3) Appendix C Learning with Separation and Good Initialization

C.1 Well-Separated Condition for MDPs

In this subsection, we formalize a condition for clusterable mixtures of MDPs: the overlap of trajectoriesfrom diﬀerent MDPs should be small in order to correctly infer the true contexts from sampled trajectories.We call the underlying MDPs well-separated if they satisfy the following separation condition:27 ondition 3 (Well-Separated MDPs)

If a trajectory τ of length H is sampled from MDP M m ∗ by runningany policy π ∈ Π , we have P τ ∼M m ∗ ,π (cid:18) P τ ∼M m ,π ( τ ) P τ ∼M m ∗ ,π ( τ ) > ( (cid:15) p /M ) c (cid:19) < ( (cid:15) p /M ) c ∀ m (cid:54) = m ∗ . (10) for a target failure probability (cid:15) p > where c , c ≥ are some universal constants. Here, P τ ∼M m ,π is a probability of getting a trajectory from the context m with policy π . One suﬃcient condition that ensures the well-separated condition (10) is Assumption 1 as guaranteed by the followinglemma: Lemma C.1

Under the Assumption 1 with a constant δ = Θ(1) , if the time horizon is suﬃciently long suchthat H > C · δ − log (1 /α ) log( M/(cid:15) p ) for some absolute constant C > and α = δ / (200 S ) , then thewell-separated condition (10) holds true with c , c ≥ . Proof of Lemma C.1 is given in Appendix C.2. We remark here that we have not optimized the requirementon the time horizon H to satisfy Condition 3, and we conjecture it can be improved. We also mention herethat the required time-horizon can be much shorter if the KL-divergence between distributions is larger, eventhough the l distance remains the same. Finally, we remark that Assumption 1 is only a suﬃcient condition,and can be relaxed as long as Condition 3 is satisﬁed. C.2 Proof of Lemma C.1

In this proof, we assume all probabilistic event is taken with true context m ∗ : unless speciﬁed, we assume P ( · ) and E [ · ] are measured with context m ∗ .Suppose a trajectory τ is obtained from MDP M k ∗ . Let us denote the probability of getting τ from m th MDP by running policy π as P τ ∼ M m ,π ( τ ) = P m ( τ ) . It is enough to show that ln (cid:18) P m ∗ ( τ ) P m ( τ ) (cid:19) > c ln( M/(cid:15) p ) , ∀ m (cid:54) = m ∗ , with probability − ( (cid:15) p /M ) . Note that for any history-dependent policy π , ln (cid:18) P m ∗ ( τ ) P m ( τ ) (cid:19) = H (cid:88) t =1 ln (cid:18) P m ∗ ( s t +1 , r t | s t , a t ) P m ( s t +1 , r t | s t , a t ) (cid:19) . For simplicity, let us compactly denote ( s (cid:48) , r ) as o , and ( s t +1 , r t ) as o t . Note that in general, ln (cid:16) P m ∗ ( τ ) P m ( τ ) (cid:17) canbe unbounded due to zero probability assignments. Thus we consider a relaxed MDP that assigns non-zeroprobability to all observations. Let α > be suﬃciently small such that α ln(1 /α ) < δ / (200 S ) . We deﬁnesimilar probability distributions such that ˆ P m ˆ P m ( o | s, a ) = α + (1 − αS ) P m ( o | s, a ) . We split the original target into three terms and bound each of them: ln (cid:18) P m ∗ ( τ ) P m ( τ ) (cid:19) = ln (cid:32) P m ∗ ( τ )ˆ P m ( τ ) (cid:33) + ln (cid:32) ˆ P m ( τ ) P m ( τ ) (cid:33) . (cid:107) P m − ˆ P m ( o | s, a ) (cid:107) ≤ Sα . For the ﬁrst term, we investigate the expectation of this quantity ﬁrst: E (cid:34) H (cid:88) t =1 ln (cid:32) P m ∗ ( o t | s t , a t )ˆ P m ( o t | s t , a t ) (cid:33)(cid:35) = E (cid:34) H (cid:88) t =1 E (cid:34) ln (cid:32) P m ∗ ( o t | s t , a t )ˆ P m ( o t | s t , a t ) (cid:33) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) s , a , r , s , ..., r t − , s t , a t (cid:35)(cid:35) = E (cid:34) H (cid:88) t =1 (cid:88) o t P m ∗ ( o t | s t , a t ) ln (cid:32) P m ∗ ( o t | s t , a t )ˆ P m ( o t | s t , a t ) (cid:33)(cid:35) = E (cid:34) H (cid:88) t =1 D KL ( P m ∗ ( o t | s t , a t ) , ˆ P m ( o t | s t , a t )) (cid:35) ≥ Hδ , where in the last step we applied Pinsker’s inequality.Now we want to apply Chernoﬀ-type concentration inequalities for martingales. We need the followinglemma on a sub-exponential property of P ( X ) on a general random variable X : Lemma C.2

Suppose X is arbitrary discrete random variable on a ﬁnite support X . Then, ln(1 / P ( X )) is asub-exponential random variable [48] with Orcliz norm (cid:107) ln(1 / P ( X )) (cid:107) ψ = 1 /e .Proof. Following the deﬁnition of sub-exponential norm [48], we ﬁnd (cid:107) ln(1 / P ( X )) (cid:107) ψ = O (1) : (cid:107) ln(1 / P ( X )) (cid:107) ψ = sup q ≥ q − E X [ln q (1 / P ( X ))] /q = sup q ≥ q − (cid:32) (cid:88) X ∈X P ( X ) ln q (1 / P ( X )) (cid:33) /q . For any q ≥ , let us ﬁrst ﬁnd maximum value of p ln q (1 /p ) for ≤ p ≤ . Taking a log and ﬁnding aderivative with respect to p yields p + q ( − /p )ln(1 /p ) = 1 p (1 − q/ ln(1 /p )) . Hence p ln q (1 /p ) takes a maximum at p = e − q with value ( q/e ) q . This gives a bound for sub-exponentialnorm: (cid:107) ln(1 / P ( X )) (cid:107) ψ = sup q ≥ q − (cid:32) (cid:88) X ∈X P ( X ) ln q (1 / P ( X )) (cid:33) /q ≤ sup q ≥ q − ( q/e ) = 1 /e. (cid:3) With the above Lemma and the sum of sub-exponential martingales, it is easy to verify (see Proposition5.16 in [48]) that P (ln ( P m ∗ ( τ )) ≤ E [ln ( P m ∗ ( τ ))] − H(cid:15) ) ≤ exp (cid:0) − c · min( (cid:15) , (cid:15) ) H (cid:1) , where c > is some absolute constant, since ln( P m ∗ ( τ )) = (cid:80) Ht =1 ln( P m ∗ ( o t | s t , a t )) is a sum of H sub-exponential martingales. We can also apply Azuma-Hoeﬀeding’s inequality to control the statistical deviationin ln( ˆ P m ( τ )) : P (cid:16) ln (cid:16) ˆ P m ( τ ) (cid:17) ≥ E (cid:104) ln (cid:16) ˆ P m ( τ ) (cid:17)(cid:105) + H(cid:15) (cid:17) ≤ exp (cid:18) − H(cid:15) (1 /α ) (cid:19) , ˆ P m ( τ ) is bounded by ln(1 /α ) .Now let (cid:15) = (cid:15) + (cid:15) = c · log(1 /α ) (cid:112) M/(cid:15) p ) /H for some absolute constant c > . If the timehorizon H ≥ C δ − log (1 /α ) log( M/(cid:15) p ) for some suﬃciently large constant C > , then a simple algebrashows that ln (cid:32) P m ∗ ( τ )ˆ P m ( τ ) (cid:33) ≥ Hδ − H(cid:15) ≥ Hδ / , with probability at least − ( (cid:15) p /M ) .Finally, we bound extra terms caused by using approximated probabilities. We note that ln (cid:32) ˆ P m ( o | s, a ) P m ( o | s, a ) (cid:33) ≥ − αS, ∀ ( o, s, a ) , given αS is suﬃciently small. Therefore for any trajectory, we have ln (cid:16) ˆ P m ( τ ) / P m ( τ ) (cid:17) ≥ − αSH ≥− Hδ / . Thus we have ln ( P m ∗ ( τ ) / P m ( τ )) ≥ Hδ / ≥ M/(cid:15) p ) with probability at least − ( (cid:15) p /M ) ,which satisﬁes Condition 3. C.3 Proof of Theorem 3.4

The key component is the following lemma on the correct estimation of belief in contexts.

Lemma C.3

Let a trajectory is sampled from m ∗ th MDP. Under the Assumption 1 with good initialization (cid:15) init < δ / (200 ln(1 /α )) in (4) and H > C · δ − log (1 /α ) log( N/η ) for some universal constant C > ,we have ˆ b ( m ∗ ) ≥ − ( N/η ) − , with probability at least − ( N/η ) − . Since we have estimated belief is almost approximately correct for O ( N ) episodes with (cid:15) p = O (1 /N ) ,we now have the conﬁdence intervals for transition matrices and rewards: Corollary 1

With probability at least − /N , for all round of episodes, we have (cid:107) ( ˆ T m − T ∗ m )( s (cid:48) | s, a ) (cid:107) ≤ (cid:112) c T /N m ( s, a ) + 1 /N , (cid:107) ( ˆ R m − R ∗ m )( r | s, a ) (cid:107) ≤ (cid:112) c R /N m ( s, a ) + 1 /N , (cid:107) (ˆ ν m − ν ∗ m )( s ) (cid:107) ≤ (cid:112) c ν /N m ( s ) + 1 /N . for all s, a, r, s (cid:48) . The corollary is straight-forward since the estimation error accumulated from errors in beliefs throughout K episodes is at most /N . If we build an optimistic model with the estimated parameters as in Lemma 3.2,the optimistic value with any policy for the model satisﬁes V π (cid:102) M ≥ V π M ∗ − H /N . (11)Equation (11) is a consequence of Lemma 3.2 and LMDP version of sensitivity analysis in partially observableenvironments [39], which can also be inferred from B.3. Following the same argument in the proof of Theorem3.3, we can also show that the estimated visit counts at ( s, a ) is at least N m ( s, a ) ≥ E [ N m ( s, a )] − c (cid:112) H E [ N m ( s, a )] log( M SAK/η ) − c H log( M SAK/η ) − /N , c , c > for all ( s, a ) , with probability at least − η . The additional regretcaused by small errors in belief estimates is therefore bounded by SH /N ∗ N + H M SA/N ≤ /N, assuming N = HK (cid:29) H S M A . The remaining steps are equivalent to the proof of Theorem 3.3.We note here that the convergence guarantee for the online EM might be extended to allow some smallprobability of wrong inference of contexts. Such scenario can happen if H does not scale logarithmicallywith total number of episodes K . It would be an analogous to the local convergence guarantee in a mixtureof well-separated Gaussian distributions [27, 26]. The situation is even more complicated since we mayrun a possibly diﬀerent policy in each episode. It would be an interesting question whether the online EMimplementation would eventually get some good converged policy and model parameters in more generalsettings. C.4 Proof of Lemma C.3

Proof.

The proof for Lemma C.3 is an easy replication of the proof for Lemma C.1. We show that H (cid:88) t =1 ln (cid:32) α + (1 − αS ) ˆ P m ∗ ( o t | s t , a t ) α + (1 − αS ) ˆ P m ( o t | s t , a t ) (cid:33) ≥ N/η ) , (12)with probability at least − ( N/η ) − for all m ∗ (cid:54) = m .Let Q m = α + (1 − αS ) ˆ P m for all m . Note that (cid:107) Q m − Q m ∗ (cid:107) ≥ δ/ due to the initialization condition.Furthermore, | ln( Q m ( o | s, a ) /Q m ∗ ( o | s, a )) | ≤ ln(1 /α ) . Hence we can apply Azuma-Hoeﬀeding’s inequalityto get H (cid:88) t =1 ln (cid:18) Q m ∗ ( o t | s t , a t ) Q m ( o t | s t , a t ) (cid:19) ≥ E (cid:34) H (cid:88) t =1 ln (cid:18) Q m ∗ ( o t | s t , a t ) Q m ( o t | s t , a t ) (cid:19)(cid:35) − ln(1 /α ) (cid:112) H log( N/η ) with probability at least − ( M N ) − . To lower bound the expectation, we can proceed as before: E (cid:34) H (cid:88) t =1 ln (cid:18) Q m ∗ ( o t | s t , a t ) Q m ( o t | s t , a t ) (cid:19) (cid:35) = E (cid:34) H (cid:88) t =1 (cid:88) o t P m ∗ ( o t | s t , a t ) ln (cid:18) Q m ∗ ( o t | s t , a t ) Q m ( o t | s t , a t ) (cid:19)(cid:35) = E (cid:34) H (cid:88) t =1 (cid:88) o t Q m ∗ ( o t | s t , a t ) ln (cid:18) Q m ∗ ( o t | s t , a t ) Q m ( o t | s t , a t ) (cid:19)(cid:35) + E (cid:34) H (cid:88) t =1 (cid:88) o t ( P m ∗ − Q m ∗ )( o t | s t , a t ) ln (cid:18) Q m ∗ ( o t | s t , a t ) Q m ( o t | s t , a t ) (cid:19)(cid:35) ≥ E (cid:34) H (cid:88) t =1 D KL ( Q m ∗ ( o t | s t , a t ) , Q m ( o t | s t , a t )) (cid:35) − E (cid:34) H (cid:88) t =1 (cid:107) P m ∗ − Q m ∗ (cid:107) (cid:35) ln(1 /α ) ≥ Hδ / − H (2 αS + (cid:15) init ) ln(1 /α ) . As long as αS ln(1 /α ) ≤ δ / and (cid:15) init ln(1 /α ) ≤ δ / , we have E (cid:34) H (cid:88) t =1 ln (cid:18) Q m ∗ ( o t | s t , a t ) Q m ( o t | s t , a t ) (cid:19) (cid:35) ≥ Hδ / . H ≥ Cδ − ln(1 /α ) log( N/η ) for suﬃciently large constant C > , (12) holds with probability at least − ( N/η ) − . The implication of lemma is: ˆ b ( k ∗ ) ≥ − ( N/η ) − · M ≥ − ( N/η ) − , which proves the claimed lemma. (cid:3) Appendix D Algorithm Details for Initialization

This section provides a detailed algorithm for eﬃcient initialization which is deferred from Section 3.4.

D.1 Spectral Learning of PSRs

In this subsection, we implement a spectral algorithm to learn PSR in detail. Recall that we deﬁne P T , H s = L s H s in Condition 1, 2 such that ( P T , H s ) i,j = P π ( τ i , h s,j ) = ( L s ) i, : ( H s ) (: ,j ) . where P T , H s ∈ R |T |×|H s | is a matrix of joint probabilities of tests and histories ending with s . Let the top- k leftand right singular vectors of P T , H s be U s and V s respectively. Note that with the rank conditions, U (cid:62) s P T s , H s V s is invertible. We also consider a matrix of joint probabilities of histories, intermediate action-reward-next-statepairs, and tests P T , ( s (cid:48) ,r ) a, H s = L s (cid:48) D ( s (cid:48) ,r ) ,a,s H s , where D ( s (cid:48) ,r ) ,a,s = diag ( P ( s (cid:48) , r | a, s ) , ..., P M ( s (cid:48) , r | a, s )) .For the simplicity in notations, we occasionally replace ( s (cid:48) , r ) by a single letter o . The transformed PSRparameters of the LMDP can be computed by B o,a,s = U (cid:62) s (cid:48) P T ,oa, H s V s ( U s P T , H s V s ) − = ( U (cid:62) s (cid:48) L s (cid:48) ) D o,a,s ( U (cid:62) s L s ) − . The initial and normalization parameters can be computed as b ,s = U (cid:62) s P ( T , s = s ) = U (cid:62) s P ( T | s )( w · ν )( s ) = ( U (cid:62) s L s )( w · ν )( s ) ,b (cid:62)∞ ,s = P (cid:62)H s V s ( U (cid:62) s P T , H s V s ) − , where P H s ∈ R |H s | is a vector of probability of sampling a history in H s , and ( w · ν )( s ) is M dimensionalvector with each m th entry w m ν m ( s ) . For the normalization factor, note that P (cid:62)H s = 1 (cid:62) H s , therefore b (cid:62)∞ ,s = 1 (cid:62) H s V s ( U (cid:62) s P T , H s V s ) − = 1 (cid:62) ( U (cid:62) s L s ) − ( U (cid:62) s L s H s V s )( U (cid:62) s P T , H s V s ) − = 1 (cid:62) ( U (cid:62) s L s ) − . It is easy to verify that P (( s, a, r ) t , s t ) = b (cid:62) s t , ∞ B o t − ,a t − ,s t − ...B o ,a ,s b s , = 1 (cid:62) D o t − ,a t − ,s t − ...D o ,a ,s ( w · ν )( s ) . With Assumption 2, we assume that a set of histories and tests H , T contain all possible observations ofa ﬁxed length l . Furthermore, we assume that the short trajectories are collected such that each history issampled from the sampling policy π and then the intervening action sequence for test is uniformly randomlyselected. We estimate the joint probability matrices with N short trajectories such that ( ˆ P H s ) i = 1 N h s,i ) , ( ˆ P T , H s ) i,j = A l N τ i , h s,j ) , ( ˆ P T ,oa, H s ) i,j = A l +1 N τ i , oa, h s,j ) , where means the number of occurrence of the event when we sample histories from the sampling policy π .For instance, τ i , h s,j ) means the number of occurrence of j th history in H s and test resulting in i th test32n T . Factors A l and A l +1 are importance sampling weights for intervening actions. The initial PSR statesare estimated separately: ( ˆ P T ,s = s ) i = A l N τ i , s = s ) , assuming we get N sample trajectories from thebeginning of each episode.Now let ˆ U s , ˆ V s be left and right singular vectors of ˆ P T , H s . Then the spectral learning algorithm outputsparameters for PSR: ˆ B o,a,s = ˆ U (cid:62) s (cid:48) ˆ P T ,oa, H s ˆ V s ( ˆ U (cid:62) s ˆ P T , H s ˆ V s ) − , ˆ b (cid:62)∞ ,s = ˆ P (cid:62)H s ˆ V s ( ˆ U (cid:62) s ˆ P T , H s ˆ V s ) − , ˆ b ,s = ˆ U (cid:62) s ˆ P ( T , s = s ) . (13)Then, the estimated probability of a sequence with any history-dependent policy π is given by ˆ P π (( s, a, r ) t − , s t ) = Π t − i =1 π ( a i | ( s, a, r ) i − , s i ) · ˆ b (cid:62)∞ ,s t ˆ B o t − ,a t − ,s t − ... ˆ B o ,a ,s ˆ b ,s . (14)The update of PSR states and the prediction of next observation is given as the following: ˆ b = ˆ b ,s , ˆ b t = ˆ B o t − ,a t − ,s t − ˆ b t − ˆ b (cid:62)∞ ,s t ˆ B o t − ,a t − ,s t − ˆ b t − , (15) ˆ P ( s (cid:48) , r | ( s, a, r ) t − , s t || do a ) = ˆ b (cid:62) s (cid:48) , ∞ ˆ B ( s (cid:48) ,r ) ,a,s t ˆ b t . (16)From the above procedure, we can establish a formal guarantee on the estimation of probabilities of length t > trajectories obtained with any history-dependent policies: Theorem D.1

Suppose the LMDP and a set of histories H and tests T satisﬁes Assumption 2. If the numberof short trajectories N = n satisﬁes n ≥ C · M A l +1 p π σ τ σ h t (cid:15) t (cid:18) S + A l σ h (cid:19) log( SA/η ) , where C > is an universal constant, and p π = min s P π ( end state = s ) , then for any (history dependent)policy π , with probability at least − η , (cid:107) ( P π − ˆ P π )(( s, a, r ) t − , s t ) (cid:107) ≤ (cid:15) t . We mention that the formal ﬁnite-sample guarantee of PSR learning only exists for hidden Markov models[19], an extension to LMDPs requires re-derivation of the proof to include the eﬀect of arbitrary decisionmaking policies. For completeness, we provide the proof of Theorem D.1 in Appendix E.1.As a result of spectral learning of PSR (see a detailed procedure in Appendix D.1), we can provide akey ingredient to cluster longer trajectories to recover the original LMDP model, as we show in the nextsubsection.

Theorem D.2

Suppose we have successfully estimated PSR parameters from the spectral learning procedurein Section D.1, such that we have the following guarantee on estimated probabilities of trajectories with anyhistory-dependent policy π : (cid:107) ( P π − ˆ P π )(( s, a, r ) t − , s t ) (cid:107) ≤ (cid:15) t , for suﬃciently small (cid:15) t > . Suppose we will execute a policy π for t time steps, observe a history (( s, a, r ) t − , s t ) , and then estimate probabilities of all possible future observations (or tests o t : t + l − ) withintervening action sequence a τt : t + l − . Then we have the following guarantee on conditional probabilities withtarget accuracy (cid:15) c > : (cid:107) ( P π − ˆ P π )( o t : t + l − | ( s, a, r ) t − , s t || do a τt : t + l − ) (cid:107) ≤ (cid:15) c , with probability at least − (cid:15) t /(cid:15) c . lgorithm 6 Recovery of LMDP parameters

Input:

A set of short histories H and tests T for learning PSR, and tests T (cid:48) for clustering // Learn PSR parameters up to precision o ( δ ) Estimate PSR parameters { ˆ b ,s , ˆ b ∞ ,s , ˆ B o,a,s , ∀ o, a, s } following (13) in Appendix D.1 up to precision o ( δ ) // Get clusters { ˆ T m ( ·| s, a ) , ˆ R m ( ·| s, a ) } ( s,a ) ∈S×A ,m ∈ [ M ] with learned PSR parameters Initialize V s = {} for all s ∈ S for n / episodes do Play exploration policy π and get a trajectory h = ( s , a , r , ..., s H , a H , r H ) Get PSR state ˆ b H − at time step H − using equation (15) Compute p H − ( T (cid:48) ) = ˆ P ( T (cid:48) | ( s, a, r ) H − , s H − ) using equation (16) Add p H − in V s H − end for for all s ∈ S do Find M -cluster centers C s that cover all points in V s ( e.g., with k -means++ [3]) end for // Build each MDP model by correctly assigning contexts to estimated transition and reward probabilities for n / episodes do Play exploration policy π until time-step H − and get a PSR state ˆ b H − at time step H − Play an uniformly sampled action a and get a PSR state ˆ b H at time step H Compute p H − ( T (cid:48) ) , p H ( T (cid:48) ) Find centers (labels) c H − ∈ C s H − and c H ∈ C s H such that c H − and c H are the closest to p H − and p H respectively. If s H − and s H are diﬀerent, let two centers c H − , c H be in the same context end for If reordering of contexts are inconsistent, return FAIL

Otherwise, construct ˆ T m and ˆ R m from cluster centers { C s } s ∈S for n / episodes do Play exploration policy π and get a PSR state ˆ b H at time step H Compute p H ( T (cid:48) ) and ﬁnd centers c H ∈ C s H that is closest to p H ( T (cid:48) ) Get the context m where c H belongs to, and update initial state distribution ˆ ν m of m th MDP end for

D.2 Clustering with PSR Parameters and Separation

We begin with the high-level idea of the algorithm that works as the following: suppose we have a newtrajectory of length H and the last two states are s H − , s H from unknown context m ∗ . We ﬁrst consider trueconditional probability given a history of h = ( s, a, o ) H − . Here H > C · δ − log (1 /α ) log( N/η ) is thelength of episodes which satisﬁes the required condition for H to infer the context (see Lemma C.1). N istotal number of episodes to be run with L-UCRL (Algorithm 1). Under Condition 3 with a failure probability (cid:15) p = O (1 /N ) , the true belief state over contexts b at time step O ( H ) satisﬁes b ( m ∗ ) ≥ − ( η/N ) . With PSR parameters, we can estimate prediction probabilities at time step H − for any given histories.This in turn implies that for any intervening actions a τ , ..., a τl (cid:48) of length l (cid:48) , the prediction probability given thehistory of length H − is nearly close to the prediction in the m ∗ th MDP: (cid:107) ( P − P m ∗ )( o τ ...o τl (cid:48) | h || do a τ ...a τl (cid:48) ) (cid:107) ≤ ( η/N ) , igure 5: Connected graph constructed from an MDP with Assumption 4 with probability at least − ( η/N ) . On the other hand, note that in the m ∗ th MDP, P m ∗ ( o τ ...o τl (cid:48) | h || do a τ ...a τl (cid:48) ) = P m ∗ ( o τ ...o τl (cid:48) | s H − || do a τ ...a τl (cid:48) ) . Therefore, combining with Theorem D.2, we have that (cid:107) ( P − ˆ P )( o τ ...o τl (cid:48) | h || do a τ ...a τl (cid:48) ) (cid:107) ≤ (cid:15) c , ∀ a τ ...a τl (cid:48) ∈ A l (cid:48) , with probability at least − A l (cid:48) (cid:15) t /(cid:15) c . In other words, the prediction probability estimated with PSR parametersare almost correct within error ( η/N ) + 4 (cid:15) c with probability at least − A l (cid:48) (cid:15) t /(cid:15) c .In a slightly more general context, let T (cid:48) be a set of all tests of length l (cid:48) with all possible intervening A l (cid:48) action sequences where ≤ l (cid:48) ≤ l . The core idea of clustering is to have the error in prediction probability (cid:15) c smaller than the separation of prediction probabilities between diﬀerent MDPs. Let δ psr be the average l distance between predictions of all length l (cid:48) tests such that: (cid:88) a τ ...a τl (cid:48) ∈A l (cid:48) (cid:107) ( P m − P m )( o τ ...o τl (cid:48) | s || do a τ ...a τl (cid:48) ) (cid:107) ≥ A l (cid:48) · δ psr , ∀ s ∈ S , ∀ m (cid:54) = m ∈ [ M ] . (17)For instance, Assumption 2 alone gives that the equation (17) holds with l (cid:48) = l and A l (cid:48) · δ psr ≥ σ τ , since (cid:88) a τ ...a τl (cid:48) ∈A l (cid:48) (cid:107) ( P m − P m )( o τ ...o τl (cid:48) | s || do a τ ...a τl (cid:48) ) (cid:107) ≥ (cid:107) L s ( e m − e m ) (cid:107) ≥ (cid:107) L s ( e m − e m ) (cid:107) ≥ √ σ τ , where e m is a standard basis vector in R M with at the m th position. If MDPs satisfy the Assumption 1,then equation (17) holds with l (cid:48) = 1 and δ psr = δ . The discussion in Section 3.4 applies to this case.Once the equation (17) is given true with some δ psr = Θ(1) , with high probability, we can identify thecontext by grouping trajectories with same ending state and similar l (cid:48) -step predictions at time-step H − .Hence a prediction at the ( H − th time step serves as a label for each trajectory.We are then left with recovering the full LMDP models. Even though we can cluster trajectories accordingto predictions conditioning on length H − histories, if we have two trajectories landed in two diﬀerent statesat ( H − th time-step, we have no means to combine them even if they are still from the same context. Inorder to resolve this, our approach requires the following assumption: Assumption 4

For all m ∈ [ M ] , let G m be an undirected graph where each node in G m corresponds to eachstate s ∈ S . Suppose we connect ( s, s (cid:48) ) in G m (assign an edge between s, s (cid:48) ) for s (cid:54) = s (cid:48) if there exists at leastone action a ∈ A such that T m ( s (cid:48) | s, a ) ≥ α for some α > . Then, G m is connected, i.e., from any statesthere exists a path to any other states on G m . s, s (cid:48) in G m so that we can assign same labels resulted from the same contextbut ended at diﬀerent states.With Assumption 4, if we have a trajectory that ends with last two states ( s H − , s H ) = ( s, s (cid:48) ) where s (cid:54) = s (cid:48) , then we can ﬁnd labels of this trajectory according to two diﬀerent labeling rules at state s and s (cid:48) .Hence, we can associate labels assigned by predictions at two diﬀerent states s, s (cid:48) . Afterwards, even if wehave two trajectories ending at diﬀerent states from the same context, we can assign the same label to twotrajectories if we have seen a connection between ( s, s (cid:48) ) . In other words, this step connects labels accordingto the same context in diﬀerent states s, s (cid:48) . Note that even if there is no direct connection, we can infer theidentical context if we have a path in a graph by crossing over states that have direct connections. Remark 1

Assumption 4 is satisﬁed if, for instance, each MDP has a ﬁnite diameter

D > [21] where D = min π max m,s (cid:54) = s (cid:48) E πm [ of steps ( s → s (cid:48) )] ,D is the minimum required number of expected steps in any MDP (with some deterministic memoryless policy π ) to move from any state s to any other states s (cid:48) . In this case, each G m is connected with α ≥ /D , since ifwe have some disconnected groups of states in G m , then the diameter cannot be smaller than /α (see alsoFigure 5). Note that in general, we only need α to be bounded below to make each graph G m connected forall states. With the connectivity of G m , we can associate labels in all diﬀerent states in a consistent way toresolve ambiguity in the ordering of contexts. As we get more trajectories that end with various s H − and s H , whenever s H − (cid:54) = s H , we can associatelabels across more diﬀerent states, and recover more connections (edges in G m ). Then, once every node in G m is connected in each context m , we can recover full transition and reward models for the context m sincewe resolved the ambiguity in the ordering of labels of all diﬀerent states. After we recover transition andreward models, we recover initial distribution of each MDP with a few more length H trajectories. The fullclustering procedure is summarized in Algorithm 6.To reliably estimate the parameters with Algorithm 6 to serve as a good initialization for Algorithm 1, werequire (cid:15) c ≤ · min( (cid:15) init , δ psr ) , ( η/N ) + A l (cid:48) (cid:15) t /(cid:15) c ≤ . /n , which in turn implies the desired accuracy in total variation distance between full length t trajectories: (cid:15) t (cid:28) A − l (cid:48) (cid:15) c /n . In summary, total sample complexity we need for the initialization to be n ≥ C · H M n (cid:15) c · A l +2 l (cid:48) +1 p s σ τ σ h (cid:18) S + A l σ h (cid:19) poly log( N/η ) , for suﬃciently large absolute constant C > . D.2.1 Proof of Theorem 3.5

Proof.

Let n ≥ C · log( n ) M A/ ( α α ) , (cid:15) c = c · min( δ psr , (cid:15) init ) for some suﬃciently large constant C > and suﬃciently small constant c > . Let (cid:15) t = (cid:15) c / (10 n A l (cid:48) ) . Plugging this to the Theorem D.1 andTheorem D.2, if we use n short trajectories for learning PSR where n ≥ C · H M (cid:15) c α α · A l +2 l (cid:48) +3 p s σ τ σ h (cid:18) S + A l σ h (cid:19) poly log( N/η ) , (cid:15) c with probabilityat least / for all n trajectories (over the randomness of new trajectories).With Assumption 3, with n (cid:29) M/α log( M S ) , we can visit all states in all MDPs at least once attime step H − after n / episodes with probability larger than 9/10. Furthermore, for all n / trajectories h , ..., h n / up to H − time step, we have (cid:107) ( P π − ˆ P π )( T (cid:48) | h i ) (cid:107) ≤ A l (cid:48) (cid:15) c , ∀ i ∈ [ n ] , with probability at least / by union bound. Let k i and s i be the true context and ending state of h i . WithAssumption 1 and the separation Lemma C.1, we also have with probability at least − η that (cid:107) P π ( T (cid:48) | h i ) − P πk i ( T (cid:48) | s i ) (cid:107) ≤ A l (cid:48) · ( η/N ) , ∀ i ∈ [ n ] , where N (cid:29) n is the number of episodes to be run after initialization with Algorithm 1. Note that theprediction probabilities are δ psr -separated, Theorem D.2 ensures that all possible sets of l (cid:48) -step predictionsare within error (cid:15) c (cid:28) δ psr . Thus, we are guaranteed that all h i s whose estimated ˆ P π ( T (cid:48) | h i ) are within A l (cid:48) (cid:15) c -error are generated from the same context. Note that with Assumption 1, we have l (cid:48) = 1 and δ psr = δ .Suppose now that we have Assumption 1. In this case, we set T (cid:48) be a set of all possible observationsof length . Now we are remained with the recovery of full transition and reward models for each context.Note that same guarantees in the previous paragraph hold for predictions at the time step H with probability / . With Assumption 4 (see also Remark 1), we build a connection graph for each context. That is, with n = O ( M A log(

M S ) / ( α α )) episodes (since we need to see at least one occurrence of all edges in allcontexts, i.e., all ( m, s ) with edges to neighborhood states s (cid:48) via action a ), we have pairs of ( s H − , s H ) in thesame trajectory where s H − and s H are suﬃcient to recover all edges in all graphs G m . Note that each edgeoccurs with probability at least O ( α α / ( M A )) and there are at most M S edges, which gives a desirednumber of trajectories for clustering.More speciﬁcally, by associating 1-step predictions at time steps H − and H in the same trajectory,we can connect labels found at s H − with estimated quantity P m ( ·| s H − , a ) and the same one found at s H ,as we conﬁrm these labels are in the same true context m of the trajectory. We can aggregate more sampletrajectories until we recover all edges in the connection graph G m . As long as this association results in aconsistent reordering of contexts in all states, we can recover the full transition models (as well as rewardsand initial distributions) for all contexts.Now we visit every state s with probability at least α at time step H − by Assumption 3. Then, bytaking uniform action a at time step H − , with probability at least α /A , we reveal the connection from s tosome other state s (cid:48) (which is essential for the consistent reordering of contexts) at time step H by Assumption4. If we repeat this process for n = C · M A log(

M S ) / ( α α ) episodes, we can collect all necessaryinformation for the reordering of contexts in all diﬀerent states. In conclusion, Algorithm 6 recovers T m and R m up to (cid:15) c -accuracy for all m, s, a (not necessarily in the same order in m ). Initial state distributions for allcontexts can be similarly recovered. The entire process succeeds with probability at least / . (cid:3) Appendix E Proofs for Spectral Learning of PSR

In this section, we provide deferred proofs for the Lemmas used in Appendix D.1. If the norm (cid:107) · (cid:107) is usedwithout subscript, we mean l -norm for vectors and operator norm for matrices. E.1 Proof of Theorem D.1

Let us deﬁne a few notations before we get into the detail. Let us denote p s = 1 (cid:62) P H s = P ( end state = s ) ,and empirical counterpart ˆ p s = 1 (cid:62) ˆ P H s , for the (empirical) probability of sampling a history ending with s .37irst, we normalize joint probability matrices: P T , H| s = P T , H s p s , P T ,oa, H| s = P T ,oa, H s p s , ˆ P T , H| s = ˆ P T , H s p s , ˆ P T ,oa, H| s = ˆ P T ,oa, H s p s . We occasionally express unnormalized PSR states with PSR parameters { ( b ∞ ,s , B o,a,s , b ,s ) } as given ahistory ( s, a, o ) t − as b t,s = B ( o,a,s ) t − B ( o,a,s ) t − ...B ( o,a,s ) b ,s = B ( o,a,s ) t − b ,s . The empirical counterpart will be deﬁned similarly with ˆ · on the top. We often concisely use h t insteadof ( s, a, o ) t − = ( s, a, r ) t − s t . We represent the probability of choosing actions a , ..., a t − when thehistory is h t − as π ( a t − | h t − ) = π ( a | h ) π ( a | h ) ...π ( a t − | h t − ) . Now suppose that empirical estimates of probability matrices satisfy the following: (cid:107) P H s − ˆ P H s (cid:107) ≤ (cid:15) ,s (cid:107) P ( T , s = s ) − ˆ P ( T , s = s ) (cid:107) ≤ (cid:15) ,s (cid:107) P T , H| s − ˆ P T , H| s (cid:107) ≤ (cid:15) ,s (cid:107) P T ,oa, H| s − ˆ P T ,oa, H| s (cid:107) ≤ (cid:15) ,oas , for all s, a, o . The following lemma shows how the error in estimated matrices aﬀects the accuracy of PSRparameters. Lemma E.1

Let the true transformed PSR parameters with ˆ U s , ˆ V s be ˜ B o,a,s = ˆ U (cid:62) s (cid:48) P T ,oas, H s ˆ V s ( ˆ U (cid:62) s P T , H s ˆ V s ) − , ˜ b ∞ ,s = P (cid:62)H s ˆ V s ( ˆ U (cid:62) s P T , H s ˆ V s ) − , ˜ b ,s = ˆ U (cid:62) s P ( T , s = s ) , for all s, a, o . Let σ M,s be the minimum ( M th ) singular value of σ M ( ˆ U (cid:62) s P T , H| s ˆ V s ) . Then, we have that (cid:107) ˜ B o,a,s − ˆ B o,a,s (cid:107) ≤ (cid:15) ,oas σ M,s + √ A l P π ( o | s || do a ) 2 (cid:15) ,s σ M,s , (cid:107) ˜ b s, ∞ − ˆ b s, ∞ (cid:107) ≤ (cid:15) ,s p s σ M,s + 2 (cid:15) ,s σ M,s , (cid:107) ˜ b s, − ˆ b s, (cid:107) ≤ (cid:15) ,s , where P π ( · ) is the probability of events when we sample histories with the exploration policy π . The proofs of helping lemmas will be proved at the last of this subsection. We deﬁne the followingquantities with error bounds similarly as in [19]: δ ∞ ,s = (cid:107) L (cid:62) s ˆ U s (˜ b ∞ ,s − ˆ b ∞ ,s ) (cid:107) ∞ ≤ (cid:107) L (cid:62) s (cid:107) ∞ , (cid:107) ˜ b ∞ ,s − ˆ b ∞ ,s (cid:107) ≤ √ A l (cid:107) ˜ b ∞ ,s − ˆ b ∞ ,s (cid:107) ,δ ,s = (cid:107) ( ˆ U (cid:62) s L s ) − (˜ b ,s − ˆ b ,s ) (cid:107) ≤ √ M (cid:107) ˜ b ,s − ˆ b ,s (cid:107) /σ M ( ˆ U s (cid:62) L s ) , ∆ o,a,s = (cid:107) ( ˆ U (cid:62) s L s ) − ( ˜ B o,a,s − ˆ B o,a,s )( ˆ U (cid:62) s L s ) (cid:107) ≤ √ M (cid:107) ˜ B o,a,s − ˆ B o,a,s (cid:107) /σ M ( ˆ U (cid:62) s L s ) , = max a,s (cid:88) o ∆ o,a,s , δ ∞ = max s δ ∞ ,s , δ = max s δ ,s . (18)We let (cid:15) t = δ ∞ + (1 + δ ∞ )((1 + ∆) t δ + (1 + ∆) t − . We ﬁrst note that for any ﬁxed action sequence a t − , it holds that (cid:88) ( s,o ) t − | ˜ b (cid:62)∞ ,s ˜ B ( o,a,s ) t − ˜ b ,s − ˆ b (cid:62)∞ ,s ˆ B ( o,a,s ) t − ˆ b ,s |≤ δ ∞ + (1 + δ ∞ )((1 + ∆) t δ + (1 + ∆) t − . This equation is a direct consequence of the Lemma 12 in [19]. However, here we aim to get the bound for all history dependent policy, hence we need to establish the theorem by re-deriving the induction hypothesis withconsidering the policy. We now bound the original equation. Observe ﬁrst that (cid:88) ( s,a,r ) t − ,s t | P π (( s, a, r ) t − , s t ) − ˆ P π (( s, a, r ) t − , s t ) | = (cid:88) ( s,a,r ) t − ,s t π ( a t − | h t − ) | ˜ b (cid:62)∞ ,s ˜ B ( o,a,s ) t − ˜ b ,s − ˆ b (cid:62)∞ ,s ˆ B ( o,a,s ) t − ˆ b ,s | . Following the steps in [19], for each s , we will prove the following Lemma: Lemma E.2

For any t , we have (cid:88) ( s,a,o ) t − π ( a t − | h t − ) (cid:107) ( ˆ U (cid:62) s t L s t ) − ( ˜ B ( o,a,s ) t − ˜ b ,s − ˆ B ( o,a,s ) t − ˆ b ,s ) (cid:107) ≤ (1 + ∆) t δ + (1 + ∆) t − . (19)We are now ready to prove the original claim. Let us denote ˜ b t,s = ˜ B ( o,a,s ) t − ˜ b ,s and ˆ b t,s = ˆ B ( o,a,s ) t − ˆ b ,s .The remaining step is to involve the eﬀect of error in ˆ b ∞ ,s t . Following the similar steps, we decompose thesummation as: (cid:88) ( s,a,o ) t − π ( a t − | h t − ) | ˜ b (cid:62)∞ ,s t ˜ B ( o,a,s ) t − ˜ b ,s − ˆ b (cid:62)∞ ,s t ˆ B ( o,a,s ) t − ˆ b ,s | = (cid:88) ( s,a,o ) t − π ( a t − | h t − ) | (˜ b ∞ ,s t − ˆ b ∞ ,s t ) (cid:62) ( ˆ U (cid:62) s t L s t )( ˆ U (cid:62) s t L s t ) − ˜ b t,s | + (cid:88) ( s,a,o ) t − π ( a t − | h t − ) | (˜ b ∞ ,s t − ˆ b ∞ ,s t ) (cid:62) ( ˆ U (cid:62) s t L s t )( ˆ U (cid:62) s t L s t ) − (˜ b t,s − ˆ b t,s ) | + (cid:88) ( s,a,o ) t − π ( a t − | h t − ) | ˜ b (cid:62)∞ ,s t ( ˆ U (cid:62) s t L s t )( ˆ U (cid:62) s t L s t ) − (˜ b t,s − ˆ b t,s ) | . For the ﬁrst term, (cid:88) ( s,a,o ) t − π ( a t − | h t − ) | (˜ b ∞ ,s t − ˆ b ∞ ,s t ) (cid:62) ( ˆ U (cid:62) s t L s t )( ˆ U (cid:62) s t L s t ) − ˜ b t,s |≤ (cid:88) ( s,a,o ) t − π ( a t − | h t − ) (cid:107) (˜ b ∞ ,s t − ˆ b ∞ ,s t ) (cid:62) ( ˆ U (cid:62) s t L s t ) (cid:107) ∞ (cid:107) ( ˆ U (cid:62) s t L s t ) − ˜ b t,s (cid:107) ≤ (cid:88) ( s,a,o ) t − π ( a t − | h t − ) δ ∞ ,s t (cid:107) ( ˆ U (cid:62) s t L s t ) − ˜ b t,s (cid:107) δ ∞ (cid:88) ( s,a,o ) t − π ( a t − | h t − ) P ( h t | a t − ) ≤ δ ∞ . Following the similar step, the second term is bounded by δ ∞ ((1 + ∆) t δ + (1 + ∆) t − . For the last term,note that ˜ b (cid:62)∞ ,s t ( ˆ U (cid:62) s t L s t ) = 1 (cid:62) . Therefore, (cid:88) ( s,a,o ) t − π ( a t − | h t − ) | ˜ b (cid:62)∞ ,s t ( ˆ U (cid:62) s t L s t )( ˆ U (cid:62) s t L s t ) − (˜ b t,s − ˆ b t,s ) |≤ (cid:88) ( s,a,o ) t − π ( a t − | h t − ) (cid:107) ( ˆ U (cid:62) s t L s t ) − (˜ b t,s − ˆ b t,s ) (cid:107) ≤ ((1 + ∆) t δ + (1 + ∆) t − . Therefore, we conclude that (cid:88) ( s,a,o ) t − | P π (( s, a, o ) t − ) − ˆ P π (( s, a, o ) t − ) | ≤ δ ∞ + (1 + δ ∞ )((1 + ∆) t δ + (1 + ∆) t − . Finally, in other to make the error term smaller than (cid:15) t , we want the followings: δ ∞ ≤ (cid:15) t / , ∆ ≤ (cid:15) t / t, δ ≤ (cid:15) t / . We need the following lemma on ﬁnite-sample error in estimated probability matrices:

Lemma E.3

For a suﬃciently large constant

C > , the errors in empirical estimates of the probabilitymatrices are bounded by (cid:107) P H s − ˆ P H s (cid:107) ≤ C (cid:114) p s N log( SA/η ) , (cid:107) P T , H s − ˆ P T , H s (cid:107) ≤ CA l (cid:114) p s N log( SA/η ) , (cid:107) P ( T , s = s ) − ˆ P ( T , s = s ) (cid:107) ≤ CA l (cid:114) P ( s = s ) N log( SA/η ) , (cid:107) P T ,oa, H s − ˆ P T ,oa, H s (cid:107) ≤ CA l +1 (cid:32)(cid:114) P π ( o | s || do a ) p s N A log(

SA/η ) + log(

SA/η ) N (cid:33) , for all s, a, o with probability at least − η . This lemma follows the same concentration argument to Proposition 19 in [19] using McDiarmid’s inequality.The proofs of three lemmas are given at the end of this subsection. With Lemma E.1, E.3 and equation (18),we now decide the sample size. For ∆ , ∆ ≤ (cid:88) o ∆ o,a,s ≤ √ Mσ M ( ˆ U (cid:62) s L s ) (cid:32)(cid:88) o (cid:15) ,oas σ M,s + √ A l P π ( o | s || do a ) 2 (cid:15) ,s σ M,s (cid:33) ≤ √ Mσ M ( ˆ U (cid:62) s L s ) (cid:32) (cid:80) o (cid:15) ,oas σ M,s + √ A l (cid:15) ,s σ M,s (cid:33) . (cid:15) ,oas is bounded by (cid:88) o (cid:15) ,oas ≤ CA l +1 (cid:88) o (cid:32)(cid:115) P π ( o | s || do a ) N Ap s log( SA/η ) + log(

SA/η ) N p s (cid:33) ≤ CA l +1 (cid:32)(cid:115) SN Ap s log( SA/η ) + 2 S log( SA/η ) N p s (cid:33) . Also note that (cid:15) ,s ≤ CA l (cid:115) log( SA/η ) N p s . In order to have ∆ < (cid:15) t / (4 t ) , the sample size should be at least N ≥ C (cid:48) · K t (cid:15) t (cid:32) A l +1 Sp s σ M ( ˆ U (cid:62) s L s ) σ M,s + A l +1 p s σ K ( ˆ U (cid:62) s L s ) σ M,s (cid:33) log(

SA/η ) , for some large constant C (cid:48) > .Finally, σ M,s = σ M ( ˆ U (cid:62) s P T , H| s ˆ V s ) ≥ (1 − (cid:15) ) σ M ( P T , H| s ) where (cid:15) = (cid:15) ,s / ((1 − (cid:15) t ) σ M ( P T , H| s )) byapplying Lemma E.9 twice. Hence, as long as N (cid:29) /σ M ( P T , H| s ) , it holds that σ M,s ≥ σ M ( P T , H| s ) / .Similarly, we have σ M ( ˆ U (cid:62) s L s ) ≥ σ M ( L s ) / . Plugging this inequality in the sample complexity completesthe Theorem D.1. E.2 Proof of Theorem D.2

Proof.

We ﬁrst deﬁne an extended policy π (cid:48) which runs the given policy π for t times and play interveningaction sequences a t ...a t + l − . Let us denote o = ( r, s (cid:48) ) to represent a pair of reward and next state compactly.A simple corollary of Theorem D.1 is the following lemma: Lemma E.4

With the estimated PSR parameters in Theorem D.1, for any given trajectory ( s, a, o ) t − , thefollowing holds: (cid:88) a t ,r t ,...,s t + l π ( a t − | h t − ) | b (cid:62)∞ ,s t + l B ( o,a,s ) t + l − b ,s − ˆ b (cid:62)∞ ,s t + l ˆ B ( o,a,s ) t + l − ˆ b ,s |≤ (cid:15) l P π (( s, a, o ) t − ) + 2 π ( a t − | h t − ) (cid:107) ( ˆ U (cid:62) s t L s t ) − (˜ b t,s − ˆ b t,s ) (cid:107) . On top of this lemma, we also have the following lemma that bounds the probability of bad events in whichthe error in estimated probability can be arbitrarily large:

Lemma E.5

For any history-dependent policy π , with the PSR parameters guaranteed in Theorem D.1, wehave | ˆ P π (cid:48) (( s, a, o ) t − ) − P π (( s, a, o ) t − ) | ≤ (cid:15) c P π (( s, a, o ) t − ) ,π ( a t − | h t − ) (cid:107) ( ˆ U (cid:62) s t L s t ) − (˜ b t,s − ˆ b t,s ) (cid:107) ≤ (cid:15) c P π (( s, a, o ) t − ) , with probability at least − (cid:15) t /(cid:15) c .

41y the deﬁnition of conditional test probability, note that | ˆ P π (cid:48) ( τ | ( s, a, o ) t − ) − P π (cid:48) ( τ | ( s, a, o ) t − ) | = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˆ P π (cid:48) ( τ, ( s, a, o ) t − )ˆ P π (( s, a, o ) t − ) − P π (cid:48) ( τ, ( s, a, o ) t − ) P π (( s, a, o ) t − ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , which is less than (1 + (cid:15) c ) P π (( s, a, o ) t − ) (cid:12)(cid:12)(cid:12) ˆ P π (cid:48) ( τ, ( s, a, o ) t − ) − P π (cid:48) ( τ, ( s, a, o ) t − ) (cid:12)(cid:12)(cid:12) + (cid:15) c P π (cid:48) ( τ, ( s, a, o ) t − ) P π (( s, a, o ) t − ) , with probability at least − (cid:15) t /(cid:15) c . Now we sum over all possible trajectories τ in O with intervening actions a ...a l after observing ( s, a, o ) t − . Under the good event guaranteed in Lemma E.5, the summation over allpossible future trajectories is less than (1 + (cid:15) c ) P π (( s, a, o ) t − ) (( (cid:15) t + 2 (cid:15) c ) P π (( s, a, o ) t − ) + (cid:15) c P π (( s, a, o ) t − )) ≤ (cid:15) c , from Lemma E.4 and E.5. Therefore, for a ﬁxed intervening action sequences a t , ..., a t + l − , we can concludethat (cid:107) P π ( O| ( s, a, o ) t − || do a t : t + l − ) − ˆ P π ( O| ( s, a, o ) t − || do a t : t + l − ) (cid:107) ≤ (cid:15) c , with probability at least − (cid:15) t /(cid:15) c . (cid:3) E.3 Proof of Lemma E.1

Proof.

The proof of the lemma can be done by unfolding expressions: (cid:107) ˜ B o,a,s − ˆ B o,a,s (cid:107) = (cid:107) ˆ U (cid:62) s (cid:48) P T ,oa, H| s ˆ V s ( ˆ U (cid:62) s P T , H| s ˆ V s ) − − ˆ U (cid:62) s (cid:48) P T ,oa, H| s ˆ V s ( ˆ U (cid:62) s ˆ P T , H| s ˆ V s ) − (cid:107)≤ (cid:107) ( ˆ U (cid:62) s (cid:48) ( P T ,oa, H| s − ˆ P T ,oa, H| s ) ˆ V s )( ˆ U (cid:62) s P T , H| s ˆ V s ) − (cid:107) + (cid:107) ( ˆ U (cid:62) s (cid:48) P T ,oa, H| s ˆ V s )(( ˆ U (cid:62) s P T , H| s ˆ V s ) − − ( ˆ U (cid:62) s ˆ P T , H| s ˆ V s ) − ) (cid:107)≤ (cid:15) ,oas σ M,s + (cid:107) P T ,oa, H| s (cid:107) (cid:15) ,s σ M,s ≤ (cid:15) ,oas σ M,s + P π ( o | s || do a ) √ A l (cid:15) ,s σ M,s , where we used Lemma E.10 from matrix perturbation theory for the second inequality, and (cid:107) P T ,ao, H s (cid:107) ≤ (cid:115) (cid:88) τ ∈T ,h ∈H s P π ( oo τ o τ ...o τl | h || do aa τ ...a τl ) P π ( h ) ≤ (cid:115) (cid:88) a ,a ,...,a l (cid:88) o ,...,o l (cid:88) h ∈H s P π ( oo ...a l | h || do aa ...a l ) P π ( h ) ≤ (cid:115) (cid:88) a ,a ,...,a l P π ( o | h || do a ) (cid:88) o ,...,o l (cid:88) h ∈H s P π ( o ...o l | hao || do a ...a l ) P π ( h ) ≤ P π ( o | h || do a ) (cid:115) (cid:88) a ,a ,...,a l (cid:88) h ∈H s P π ( h ) = P π ( o | h || do a ) √ A l (cid:115) (cid:88) h ∈H s P π ( h ) ≤ P π ( o | h || do a ) √ A l (cid:88) h ∈H s P π ( h ) = P π ( o | h || do a ) √ A l p s , (cid:107) P T ,oa, H| s (cid:107) ≤ P π ( o | h || do a ) √ A l for the last inequality. For initial and normalization parameters, (cid:107) ˜ b ∞ ,s − ˆ b ∞ ,s (cid:107) ≤ (cid:107) ( P H s − ˆ P H s ) (cid:62) ˆ V s ( ˆ U (cid:62) s P T , H s ˆ V s ) − (cid:107) + (cid:107) P (cid:62)H s ˆ V s (( ˆ U (cid:62) s P T , H s ˆ V s ) − − ( ˆ U (cid:62) s ˆ P T , H s ˆ V s ) − ) (cid:107)≤ (cid:15) ,s σ M ( P T , H s ) + (cid:107) P H s / ˆ p s (cid:107) (cid:15) ,s σ M ( P T , H| s ) ≤ (cid:15) ,s p s σ M ( P T , H| s ) + 2 (cid:15) ,s σ M ( P T , H| s ) . (cid:107) ˜ b s, − ˆ b s, (cid:107) ≤ (cid:15) ,s . (cid:3) E.4 Proof of Lemma E.2

Proof.

We show this lemma by induction on t . For t = 1 , we bound (cid:107) ( ˆ U s L s ) − (˜ b ,s − ˆ b ,s ) (cid:107) ≤ δ ,s by deﬁnition. Now assume it holds for t − and check the induction hypothesis. (cid:88) ( s,a,o ) t − π ( a t − | h t − ) (cid:107) ( ˆ U (cid:62) s t L s t ) − ( ˜ B ( o,a,s ) t − ˜ b ,s − ˆ B ( o,a,s ) t − ˆ b ,s ) (cid:107) = (cid:88) ( s,a,o ) t − (cid:88) a t − ,o t − π ( a t − | h t − ) π ( a t − | h t − ) (cid:107) ( ˆ U (cid:62) s t L s t ) − ( ˜ B ( o,a,s ) t − ˜ b ,s − ˆ B ( o,a,s ) t − ˆ b ,s ) (cid:107) = (cid:88) a t − π ( a t − | h t − ) (cid:88) o t − (cid:88) ( s,a,o ) t − π ( a t − | h t − ) (cid:107) ( ˆ U (cid:62) s t L s t ) − ( ˜ B ( o,a,s ) t − ˜ b ,s − ˆ B ( o,a,s ) t − ˆ b ,s ) (cid:107) = (cid:88) a t − π ( a t − | h t − ) (cid:88) o t − (cid:88) ( s,a,o ) t − π ( a t − | h t − ) (cid:107) ( ˆ U (cid:62) s t L s t ) − (˜ b t,s − ˆ b t,s ) (cid:107) . We investigate the inside sum by decomposing (cid:107) ( ˆ U (cid:62) s t L s t ) − (˜ b t,s − ˆ b t,s ) (cid:107) as (cid:107) ( ˆ U (cid:62) s t L s t ) − (˜ b t,s − ˆ b t,s ) (cid:107) = (cid:107) ( ˆ U (cid:62) s t L s t ) − ( ˜ B ( o,a,s ) t − − ˆ B ( o,a,s ) t − )( ˆ U (cid:62) s t − L s t − ) (cid:107) (cid:107) ( ˆ U (cid:62) s t − L s t − ) − ˜ b t − ,s (cid:107) + (cid:107) ( ˆ U (cid:62) s t L s t ) − ( ˜ B ( o,a,s ) t − − ˆ B ( o,a,s ) t − )( ˆ U (cid:62) s t − L s t − ) (cid:107) (cid:107) ( ˆ U (cid:62) s t − L s t − ) − (˜ b t − ,s − ˆ b t − ,s ) (cid:107) + (cid:107) ( ˆ U (cid:62) s t L s t ) − ˜ B ( o,a,s ) t − ( ˆ U (cid:62) s t − L s t − )( ˆ U (cid:62) s t − L s t − ) − (˜ b t − ,s − ˆ b t − ,s ) (cid:107) . For the ﬁrst term, (cid:88) o t − (cid:88) ( s,a,o ) t − π ( a t − | h t − ) (cid:107) ( ˆ U (cid:62) s t L s t ) − ( ˜ B ( o,a,s ) t − − ˆ B ( o,a,s ) t − )( ˆ U (cid:62) s t − L s t − ) (cid:107) (cid:107) ( ˆ U (cid:62) s t − L s t − ) − ˜ b t − ,s (cid:107) = (cid:88) o t − (cid:88) ( s,a,o ) t − π ( a t − | h t − )∆ ( o,a,s ) t − (cid:107) ( ˆ U (cid:62) s t − L s t − ) − ˜ b t − ,s (cid:107) ≤ ∆ (cid:88) ( s,a,o ) t − π ( a t − | h t − ) (cid:107) ( ˆ U (cid:62) s t − L s t − ) − ˜ b t − ,s (cid:107) = ∆ (cid:88) ( s,a,o ) t − π ( a t − | h t − ) P ( h t − ) = ∆ , where we used the deﬁnition of ˜ b t − ,s = ˆ U (cid:62) s t − L s t − P (( s, a, o ) t − ) . For the second term, by the inductionhypothesis, (cid:88) o t − (cid:88) ( s,a,o ) t − π ( a t − | h t − )∆ ( o,a,s ) t − (cid:107) ( ˆ U (cid:62) s t − L s t − ) − (˜ b t − ,s − ˆ b t − ,s ) (cid:107) ∆ (cid:88) ( s,a,o ) t − π ( a t − | h t − ) (cid:107) ( ˆ U (cid:62) s t − L s t − ) − (˜ b t − ,s − ˆ b t − ,s ) (cid:107) = ∆((1 + ∆) t − δ + (1 + ∆) t − − . The last term can also be derived following the same argument in [19]. It gives (cid:88) o t − (cid:88) ( s,a,o ) t − π ( a t − | h t − ) (cid:107) ( ˆ U (cid:62) s t L s t ) − ˜ B ( o,a,s ) t − ( ˆ U (cid:62) s t − L s t − )( ˆ U (cid:62) s t − L s t − ) − (˜ b t − ,s − ˆ b t − ,s ) (cid:107) ≤ (cid:88) o t − (cid:88) ( s,a,o ) t − π ( a t − | h t − ) (cid:107) D ( o,a,s ) t − ( ˆ U (cid:62) s t − L s t − ) − (˜ b t − ,s − ˆ b t − ,s ) (cid:107) ≤ (cid:88) o t − (cid:88) ( s,a,o ) t − π ( a t − | h t − ) (cid:107) D ( o,a,s ) t − (cid:107) (cid:107) ( ˆ U (cid:62) s t − L s t − ) − (˜ b t − ,s − ˆ b t − ,s ) (cid:107) ≤ (1 + ∆) t − δ + (1 + ∆) t − − . Now combining these three bounds, we get (cid:88) a t − π ( a t − | h t − ) (cid:88) o t − (cid:88) ( a,r,s (cid:48) ) t − π ( a t − | h t − ) (cid:107) ( ˆ U (cid:62) s t L s t ) − (˜ b t,s − ˆ b t,s ) (cid:107) ≤ (cid:88) a t − π ( a t − | h t − )(∆ + (1 + ∆)((1 + ∆) t − δ + (1 + ∆) t − − ≤ (cid:88) a t − π ( a t − | h t − )((1 + ∆) t δ + (1 + ∆) t −

1) = (1 + ∆) t δ + (1 + ∆) t − . (cid:3) E.5 Proof of Lemma E.3

Proof.

For the ﬁrst inequality, we note that (cid:107) P H s − ˆ P H s (cid:107) ≤ p s (cid:107) P H| s − ˆ P H| s (cid:107) + | p s − ˆ p s |(cid:107) P H| s (cid:107) . Let N s = ˆ p s N . For the ﬁrst term, we can use McDiarmid’s inequality since a change at one sample among N s samples (conditioned on starting test from s ) causes only (cid:112) /N s diﬀerence: (cid:107) P H| s − ˆ P H| s (cid:107) − E [ (cid:107) P H| s − ˆ P H| s (cid:107) ] (cid:46) (cid:114) N s ln(1 /η ) , with probability at least − η/ . Let h s,i ) be a count of a history h s,i after seeing N s histories that endwith s . Also, E [ (cid:107) P H| s − ˆ P H| s (cid:107) ] ≤ (cid:113) E [ (cid:107) P H| s − ˆ P H| s (cid:107) ] ≤ (cid:118)(cid:117)(cid:117)(cid:116) | H s | (cid:88) i =1 V ar (cid:18) N s h s,i ) (cid:12)(cid:12)(cid:12) s (cid:19) ≤ (cid:118)(cid:117)(cid:117)(cid:116) N s | H s | (cid:88) i =1 P ( h s,i | s ) ≤ (cid:114) N s . (cid:107) P H| s − ˆ P H| s (cid:107) (cid:46) (cid:113) N s ln(1 /η ) . On the other hand, we can show that | p s − ˆ p s | (cid:46) (cid:113) p s N log(1 /η ) + log(1 /η ) N via a simple application of Bernstein’s inequality. Note that our samplecomplexity guarantees N (cid:29) log(1 /η ) /p s . Hence, (cid:107) P H s − ˆ P H s (cid:107) ≤ p s (cid:107) P H| s − ˆ P H| s (cid:107) + | p s − ˆ p s |(cid:107) P H| s (cid:107) (cid:46) p s (cid:114) N s log(1 /η ) + (cid:114) p s N log(1 /η ) (cid:107) P H| s (cid:107) (cid:46) (cid:114) p s N log(1 /η ) . Similarly, we can show that E [ (cid:107) P T , H| s − ˆ P T , H| s (cid:107) ] ≤ (cid:113) E [ (cid:107) P T , H| s − ˆ P T , H| s (cid:107) F ] ≤ (cid:118)(cid:117)(cid:117)(cid:116) (cid:88) j ∈ [ T ] ,i ∈ [ |H s | ] V ar (cid:18) A l N s τ j , h s,i ) (cid:12)(cid:12)(cid:12) s (cid:19) ≤ A l (cid:118)(cid:117)(cid:117)(cid:116) N s (cid:88) j ∈ [ T ] ,i ∈ [ |H s | ] P ( τ j , h s,i | s ) ≤ A l (cid:114) N s . Following the same argument with McDiarmid’s inequality, we get the second inequality. The remaininginequalities can be shown through similar arguments. Taking over union bounds over all s, a, o gives theLemma. (cid:3)

E.6 Proof of Lemma E.4

Proof.

As in the proof in Theorem D.1, let us denote ˜ b t,s = ˜ B ( o,a,s ) t − ˜ b ,s . Then, we can decompose theterms as before: (cid:88) r t ,s t +1 ,...,s t + l | b (cid:62)∞ ,s t + l B ( o,a,s ) t + l − b ,s − ˆ b (cid:62)∞ ,s t + l ˆ B ( o,a,s ) t + l − ˆ b ,s | = (cid:88) r t ,...,s t + l | ˜ b (cid:62)∞ ,s t + l ˜ B ( o,a,s ) t + l − t ˜ b t,s − ˆ b (cid:62)∞ ,s t + l ˆ B ( o,a,s ) t + l − t ˆ b t,s | = (cid:88) r t ,...,s t + l | ˜ b (cid:62)∞ ,s t + l ˜ B ( o,a,s ) t + l − t ( ˆ U (cid:62) s t L s t )( ˆ U (cid:62) s t L s t ) − ˜ b t,s − ˆ b (cid:62)∞ ,s t + l ˆ B ( o,a,s ) t + l − t ( ˆ U (cid:62) s t L s t )( ˆ U (cid:62) s t L s t ) − ˆ b t,s |≤ (cid:88) r t ,...,s t + l (cid:107) (˜ b (cid:62)∞ ,s t + l ˜ B ( o,a,s ) t + l − t − ˆ b ∞ ,s t + l ˆ B ( o,a,s ) t + l − t )( ˆ U (cid:62) s t L s t ) (cid:107) (cid:107) ( ˆ U (cid:62) s t L s t ) − ˜ b t,s (cid:107) + (cid:88) r t ,...,s t + l (cid:107) (˜ b (cid:62)∞ ,s t + l ˜ B ( o,a,s ) t + l − t − ˆ b ∞ ,s t + l ˆ B ( o,a,s ) t + l − t )( ˆ U (cid:62) s t L s t ) (cid:107) (cid:107) ( ˆ U (cid:62) s t L s t ) − (˜ b t,s − ˆ b t,s ) (cid:107) + (cid:88) r t ,...,s t + l (cid:107) ˜ b (cid:62)∞ ,s t + l ˜ B ( o,a,s ) t + l − t ( ˆ U (cid:62) s t L s t ) (cid:107) (cid:107) ( ˆ U (cid:62) s t L s t ) − (˜ b t,s − ˆ b t,s ) (cid:107) . We follow the same induction procedure starting with showing the following equation: (cid:88) s t ,r t ,...,s t + l − (cid:107) ( ˆ U (cid:62) s t + l L s t + l ) − ( ˜ B ( o,a,s ) t + l − t − ˆ B ( o,a,s ) t + l − t )( ˆ U (cid:62) s t L s t ) (cid:107) ≤ (1 + ∆) l − . We show this equation by induction as in the previous proof. If l = 1 , then (cid:107) ( ˆ U s t +1 L s t +1 ) − ( ˜ B ( o,a,s ) t − ˆ B ( o,a,s ) t )( ˆ U (cid:62) s t L s t ) (cid:107) ≤ (cid:88) o t ∆ o t ,a t ,s t ≤ ∆ ,

45y the deﬁnition of ∆ . Now we assume it holds for sequences of length less than l , and prove the inductionhypothesis for l . We again split the term into three terms: (cid:88) r t ,...,s t + l (cid:107) ( ˆ U (cid:62) s t + l L s t + l ) − ( ˜ B ( o,a,s ) t + l − t − ˆ B ( o,a,s ) t + l − t )( ˆ U (cid:62) s t + l − L s t + l − ) (cid:107) ≤ (cid:88) s t ,r t ,...,s t + l (cid:107) ( ˆ U (cid:62) s t + l L s t + l ) − ( ˜ B ( o,a,s ) t + l − − ˆ B ( o,a,s ) t + l − )( ˆ U (cid:62) s t + l − L s t + l − ) (cid:107) ×(cid:107) ( ˆ U (cid:62) s t + l − L s t + l − ) − ˜ B ( o,a,s ) t + l − t ( ˆ U (cid:62) s t L s t ) (cid:107) + (cid:88) r t ,...,s t + l (cid:107) ( ˆ U (cid:62) s t + l L s t + l ) − ( ˜ B ( o,a,s ) t + l − − ˆ B ( o,a,s ) t + l − )( ˆ U (cid:62) s t + l − L s t + l − ) (cid:107) ×(cid:107) ( ˆ U (cid:62) s t + l − L s t + l − ) − ( ˜ B ( o,a,s ) t + l − t − ˆ B ( o,a,s ) t + l − t )( ˆ U (cid:62) s t L s t ) (cid:107) + (cid:88) r t ,...,s t + l (cid:107) ( ˆ U (cid:62) s t + l L s t + l ) − ˜ B ( o,a,s ) t + l − ( ˆ U (cid:62) s t + l − L s t + l − ) (cid:107) ×(cid:107) ( ˆ U (cid:62) s t + l − L s t + l − ) − ( ˜ B ( o,a,s t + l − t − ˆ B ( o,a,s t + l − t )( ˆ U (cid:62) s t L s t ) (cid:107) ≤ ∆ + ∆ (cid:88) r t ,...,s t + l − (cid:107) ( ˆ U (cid:62) s t + l L s t + l ) − ( ˜ B ( o,a,s ) t + l − t − ˆ B ( o,a,s ) t + l − t )( ˆ U (cid:62) s t L s t ) (cid:107) + (cid:88) r t ,...,s t + l − (cid:107) ( ˆ U (cid:62) s t + l L s t + l ) − ( ˜ B ( o,a,s ) t + l − t − ˆ B ( o,a,s ) t + l − t )( ˆ U (cid:62) s t L s t ) (cid:107) , where in the last step, we used (cid:80) r t ,...,s t + l (cid:107) ( ˆ U (cid:62) s t + l L s t + l ) − ˜ B ( o,a,s ) t + l − t ( ˆ U (cid:62) s t L s t ) (cid:107) = 1 as well as (cid:88) r t + l − ,s t + l (cid:107) ( ˆ U (cid:62) s t + l L s t + l ) − ˜ B ( o,a,s ) t + l − ( ˆ U (cid:62) s t L s t ) (cid:107) = 1 . By induction hypothesis, it is bounded by ∆ + ∆((1 + ∆) l − −

1) + (1 + ∆) l − − l − . With the Lemma, we can verify that (cid:88) r t ,...,s t + l (cid:107) (˜ b (cid:62) s t + l ˜ B ( o,a,s ) t + l − t − ˆ b (cid:62) s t + l ˆ B ( o,a,s ) t + l − t )( ˆ U (cid:62) s t L s t ) (cid:107) ≤ (cid:88) r t ,...,s t + l (cid:107) (˜ b s t + l − ˆ b s t + l )( ˆ U (cid:62) s t + l L s t + l ) (cid:107) ∞ (cid:107) ( ˆ U (cid:62) s t + l L s t + l ) − ( ˜ B ( o,a,s ) t + l − t − ˆ B ( o,a,s ) t + l − t )( ˆ U (cid:62) s t L s t ) (cid:107) + (cid:88) r t ,...,s t + l (cid:107) ˜ b s t + l ( ˆ U (cid:62) s t + l L s t + l ) (cid:107) ∞ (cid:107) ( ˆ U (cid:62) s t + l L s t + l ) − ( ˜ B ( o,a,s ) t + l − t − ˆ B ( o,a,s ) t + l − t )( ˆ U (cid:62) s t L s t ) (cid:107) + (cid:88) r t ,...,s t + l (cid:107) (˜ b s t + l − ˆ b s t + l )( ˆ U (cid:62) s t + l L s t + l ) (cid:107) ∞ (cid:107) ( ˆ U (cid:62) s t + l L s t + l ) − ˜ B ( o,a,s ) t + l − t ( ˆ U (cid:62) s t L s t ) (cid:107) ≤ δ ∞ + ( δ ∞ + 1)((1 + ∆) l − . Let (cid:15) l := δ ∞ + ( δ ∞ + 1)((1 + ∆) l − . From the above, we can conclude that (cid:88) a t ,r t ,...,s t + l π ( a t | h t ) | b (cid:62)∞ ,s t + l B ( o,a,s ) t + l :1 b ,s − ˆ b (cid:62)∞ ,s t + l ˆ B ( o,a,s ) t + l :1 ˆ b ,s | = (cid:88) a t ,r t ,...,s t + l | π ( a t − | h t − ) b (cid:62)∞ ,s t + l B ( o,a,s ) t + l − t ˜ b t,s − ˆ b (cid:62)∞ ,s t + l ˆ B ( o,a,s ) t + l − t ˆ b t,s | (cid:15) l π ( a t − | h t − ) (cid:107) ( ˆ U (cid:62) s t L s t ) − ˜ b t,s (cid:107) + (cid:15) l π ( a t − | h t − ) (cid:107) ( ˆ U (cid:62) s t L s t ) − (˜ b t,s − ˆ b t,s ) (cid:107) + (cid:88) a t ,r t ,...,s t + l π ( a t − | h t − ) (cid:107) ˜ b (cid:62)∞ ,s t + l ˜ B ( o,a,s ) t + l − t ( ˆ U (cid:62) s t L s t ) (cid:107) (cid:107) ( ˆ U (cid:62) s t L s t ) − (˜ b t,s − ˆ b t,s ) (cid:107) ≤ (cid:15) l P π (( s, a, o ) t − ) + (1 + (cid:15) l ) π ( a t − | h t − ) (cid:107) ( ˆ U (cid:62) s t L s t ) − (˜ b t,s − ˆ b t,s ) (cid:107) , where we used π ( a t − | h t − ) (cid:107) ( ˆ U (cid:62) s t L s t ) − ˜ b t,s (cid:107) = π ( a t − | h t − )1 (cid:62) ( ˆ U (cid:62) s t L s t ) − ˜ b t,s = P (( o, a, s t | s ) , and (cid:88) r t ,...,s t + l (cid:107) ˜ b (cid:62)∞ ,s t + l ˜ B ( o,a,s ) t + l − t ( ˆ U (cid:62) s t L s t ) (cid:107) = (cid:88) s t ,r t ,...,s t + l (cid:107) (cid:62) ˜ D ( o,a,s ) t + l − t (cid:107) = 1 . Since (cid:15) l < , we get the Lemma. (cid:3) E.7 Proof of Lemma E.5

Proof.

Note that from equation (19) in Lemma E.2, we have (cid:88) s ,a ,r ,...,r t − ,s t π ( a t − | h t − ) (cid:107) ( ˆ U (cid:62) s t L s t ) − (˜ b t,s − ˆ b t,s ) (cid:107) ≤ (cid:15) t . Let E b be a bad event where for a sampled trajectory s , a , r , ..., r t − , s t , the diﬀerence in estimatedprobability is larger than (cid:15) c P π ( s , a , r , ..., r t − , s t ) , i.e., π ( a t − | h t − ) (cid:107) ( ˆ U (cid:62) s t L s t ) − (˜ b t,s − ˆ b t,s ) (cid:107) ≥ (cid:15) c P π ( s , a , r , ..., r t − , s t ) . Note that P π ( s , a , r , ..., r t − , s t ) = π ( a t − | h t − )1 (cid:62) ( ˆ U (cid:62) s t L s t ) − ˜ b t,s = π ( a t − | h t − ) (cid:107) ( ˆ U (cid:62) s t L s t ) − ˜ b t,s (cid:107) .If P π ( E b ) > (cid:15) t /(cid:15) c , then (cid:88) ( s,a,o ) t − π ( a t − | h t − ) (cid:107) ( ˆ U (cid:62) s t L s t ) − (˜ b t,s − ˆ b t,s ) (cid:107) ≥ (cid:88) ( s,a,o ) t − ∈E b π ( a t − | h t − ) (cid:107) ( ˆ U (cid:62) s t L s t ) − (˜ b t,s − ˆ b t,s ) (cid:107) ≥ (cid:88) ( s,a,o ) t − ∈E b (cid:15) c π ( a t − | h t − ) (cid:107) ( ˆ U (cid:62) s t L s t ) − ˜ b t,s (cid:107) ≥ (cid:15) c P π ( E b ) > (cid:15) t , which is a contradiction. Similarly, by Theorem D.1, we have (cid:88) ( s,a,r ) t − ,s t | P π (( s, a, r ) t − , s t ) − ˆ P π (( s, a, r ) t − , s t ) | ≤ (cid:15) t . Following the same argument, we can show the contradiction if the Lemma E.5 does not hold. (cid:3)

E.8 Auxiliary Lemmas for Spectral Learning

For completeness of the paper, we include the following lemmas from Appendix B in [19] and [45].

Lemma E.6 (Theorem 4.1 in [45])

Let A ∈ R m × n with m ≥ n and ˆ A = A + E for some E ∈ R m × n . Ifsingular values of A and ˆ A are σ ≥ σ ≥ ... ≥ σ n and ˆ σ ≥ ˆ σ ≥ ... ≥ ˆ σ n respectively, then | σ i − ˆ σ i | ≤ (cid:107) E (cid:107) , ∀ i ∈ [ n ] . Θ ) theorem. Lemma E.7 (Theorem 4.4 in [45])

Let A ∈ R m × n with m ≥ n with singular value decomposition (SVD) ( U , U , U , Σ , Σ , V , V ) such that A = (cid:2) U U U (cid:3)  Σ

00 Σ  (cid:20) V (cid:62) V (cid:62) (cid:21) . Similarly, ˆ A = A + E for some E ∈ R m × n has a SVD ( ˆ U , ˆ U , ˆ U , ˆΣ , ˆΣ , ˆ V , ˆ V ) . Let Φ be the matrixof canonical angles between range ( U ) and range ( ˆ U ) , and Θ be the matrix of canonical angles between range ( V ) and range ( ˆ V ) . If there exists α, δ > such that σ min (Σ ) ≥ α + δ and σ max (Σ ) < α , then max {(cid:107) sin Φ (cid:107) , (cid:107) sin Θ (cid:107) } ≤ (cid:107) E (cid:107) /δ. Lemma E.8 (Corollary 22 in [19])

Suppose A ∈ R m × n with rank k ≤ n and m ≥ n , and ˆ A = A + E with E ∈ R m × n . Let σ k ( A ) be the k th singular value of A and assume (cid:107) E (cid:107) ≤ (cid:15) · σ k ( A ) for some small (cid:15) < .Let ˆ U be top- k left singular vectors of ˆ A , and ˆ U ⊥ be the remaining left singular vectors. Then, • σ k ( ˆ A ) ≥ (1 − (cid:15) ) σ k ( A ) . • (cid:107) ˆ U (cid:62)⊥ U (cid:107) ≤ (cid:107) E (cid:107) /σ k ( ˆ A ) .Proof. The ﬁrst inequality follows from Lemma E.6, and the second inequality follows from Lemma E.7 byplugging α = 0 and δ = ˆ σ k . (cid:3) Lemma E.9 (Lemma 9 in [19])

Suppose A ∈ R m × n with rank k ≤ n , and ˆ A = A + E with E ∈ R m × n for (cid:107) E (cid:107) ≤ (cid:15) · σ k ( A ) with small (cid:15) < . Let (cid:15) = (cid:15) / (1 − (cid:15) ) and ˆ U be top- k left singular vectors of ˆ A . Then, • σ k ( ˆ U (cid:62) ˆ A ) ≥ (1 − (cid:15) ) · σ k ( A ) . • σ k ( ˆ U (cid:62) A ) ≥ √ − (cid:15) · σ k ( A ) .Proof. The ﬁrst item is immediate since σ k ( ˆ U (cid:62) ˆ A ) = σ k ( ˆ A ) . Let U be top- k left singular vectors of A . Ifthe top- k SVD of A is A = U Σ V (cid:62) , then σ k ( ˆ U (cid:62) U Σ V (cid:62) ) ≥ σ min ( ˆ U (cid:62) U ) · σ k (Σ) ≥ (cid:113) − (cid:107) ˆ U (cid:62)⊥ U (cid:107) · σ k ( A ) ≥ √ − (cid:15) · σ k ( A ) , where the ﬁrst inequality holds since V is orthonormal and ˆ U (cid:62) U is full-rank, and the ﬁnal inequality followsfrom Lemma E.8. (cid:3) Lemma E.10 (Theorem 3.8 in [45])

Let A ∈ R m × n with m ≥ n , and let ˜ A = A + E with E ∈ R m × n .Then, (cid:107) ˜ A † − A † (cid:107) ≤ √ · max {(cid:107) A † (cid:107) , (cid:107) ˜ A † (cid:107) } · (cid:107) E (cid:107) ..