[PDF] An Efficient, Expressive and Local Minima-free Method for Learning Controlled Dynamical Systems

Abstract

We propose a framework for modeling and estimating the state of controlled dynamical systems, where an agent can affect the system through actions and receives partial observations. Based on this framework, we propose the Predictive State Representation with Random Fourier Features (RFFPSR). A key property in RFF-PSRs is that the state estimate is represented by a conditional distribution of future observations given future actions. RFF-PSRs combine this representation with moment-matching, kernel embedding and local optimization to achieve a method that enjoys several favorable qualities: It can represent controlled environments which can be affected by actions; it has an efficient and theoretically justified learning algorithm; it uses a non-parametric representation that has expressive power to represent continuous non-linear dynamics. We provide a detailed formulation, a theoretical analysis and an experimental evaluation that demonstrates the effectiveness of our method.

Full PDF

AAn Efﬁcient, Expressive and Local Minima-free Method for Learning ControlledDynamical Systems

Ahmed Hefny

Carnegie Mellon University

Carlton Downey

Carnegie Mellon University

Geoffrey Gordon

Carnegie Mellon University

Abstract

We propose a framework for modeling and estimating thestate of controlled dynamical systems, where an agent canaffect the system through actions and receives partial obser-vations. Based on this framework, we propose the PredictiveState Representation with Random Fourier Features (RFF-PSR). A key property in RFF-PSRs is that the state estimateis represented by a conditional distribution of future observa-tions given future actions. RFF-PSRs combine this represen-tation with moment-matching, kernel embedding and localoptimization to achieve a method that enjoys several favor-able qualities: It can represent controlled environments whichcan be affected by actions; it has an efﬁcient and theoret-ically justiﬁed learning algorithm; it uses a non-parametricrepresentation that has expressive power to represent con-tinuous non-linear dynamics. We provide a detailed formu-lation, a theoretical analysis and an experimental evaluationthat demonstrates the effectiveness of our method.

Controlled dynamical systems, where an agent can inﬂuencean environment through actions and receive partial observa-tions, emerge in numerous applications in robotics and au-tomatic control. Modeling and learning these systems fromdata is of great importance in these ﬁelds.The general problem of learning dynamical systems fromdata (also known as system identiﬁcation) has been exten-sively studied and several methods were proposed to tackleit. However, having an expressive, efﬁcient and consistentmethod for non-linear controlled systems remains an openproblem.Many system identiﬁcation methods rely on likelihood-based optimization or sampling using EM, MCMC or gradi-ent descent. which makes them prone to poor local optima.There is another class of methods that alleviates the local op-tima problem and offers a tractable and statistically consis-tent approach to system identiﬁcation. These methods, usu-ally referred to as spectral algorithms, have two key prop-erties in common: predictive representation and method ofmoments. Instead of the state being a latent variable, theyrepresent the estimated state by the expectation of sufﬁcientstatistics (or features) of future observations; and they use

This is an extended version of a paper published at AAAI18. method of moments to learn model parameters from data. Initially introduced for linear-Gaussian systems (vanOverschee and de Moor, 1996), these algorithms have beenextended to discrete systems (Hsu, Kakade, and Zhang,2009; Siddiqi, Boots, and Gordon, 2010; Boots, Siddiqi, andGordon, 2011) and then to general smooth continuous sys-tems (Boots, Gretton, and Gordon, 2013). More recently, ithas been shown that a wide class of spectral learning algo-rithms for uncontrolled systems are instances of a two-stageregression framework (Hefny, Downey, and Gordon, 2015),where system identiﬁcation is reduced to solving a set of re-gression problems. This framework allows for seamless inte-gration of compressing non-linearities, sparsity (Xia, 2016)and online learning (Venkatraman et al., 2016) into systemidentiﬁcation, and for establishing theoretical guarantees byleveraging the rich literature on supervised regression.Unfortunately, the formulation in (Hefny, Downey, andGordon, 2015) is limited to uncontrolled systems. On thecontrary, we are interested in controlled systems, where theuser can affect the system through actions. This gives riseto a key issue: the policy that determines the actions canchange at test time. For this reason, the representation of thepredictive state must be independent of the training policyand therefore must encode a conditional distribution of fu-ture observations given future actions. To adopt such a rep-resentation into a practical method that retains the beneﬁtsof the two-stage regression formulation, there are a numberof challenges that need to be tackled.First, we need a suitable state representation and dynam-ics model that can be used to represent a wide class of con-trolled dynamical systems while ensuring the learning prob-lem remains tractable. Second, we would like to beneﬁt fromthe two-stage regression view of (Hefny, Downey, and Gor-don, 2015) to facilitate model formulation. However, a keyassumption in that work is that future observations providean unbiased estimate of the predictive state, which is nottrue when the state is a conditional distribution. Third, hav-ing a different state representation and having action pol-icy playing a key role on determining the training data re-quire a different theoretical analysis than the one in (Hefny, There is a class of spectral algorithms that maintains the latentvariable view. This is exempliﬁed by tensor decomposition meth-ods (Anandkumar et al., 2014). a r X i v : . [ s t a t . M L ] F e b ethod Actions Continuous Non-linear Partiallyobservable Scalable ConsistentNon-linear ARX (cid:88) (cid:88) (cid:88) × (cid:88) (cid:88) N4SID for Kalman Filter (cid:88) (cid:88) × (cid:88) (cid:88) (cid:88) Non-convex optimization (e.g. EM) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) × Gram-Matrix (e.g. HSE-PSR) (cid:88) (cid:88) (cid:88) (cid:88) × (cid:88) Spectral PSR/POMDP (cid:88) × (cid:88) (cid:88) (cid:88) (cid:88) Reduction to Supervised Learning × (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) RFF-PSR (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88)

Table 1: Comparison between proposed RFF-PSR and existing system identiﬁcation methods in terms of the type of systemsthey can model as well as their computational efﬁciency and statistical consistency. The table should be interpreted as follows:for each method there exists an instantiation that simultaneously satisﬁes all properties marked with (cid:88) but there is no instanti-ation that is guaranteed to satisfy the properties marked with × . A method is scalable if computational and memory costs scaleat most linearly with the number of training examples. For RFF-based methods, consistency is up to an approximation errorthat is controllable by the number of features (Rahimi and Recht, 2008).Downey, and Gordon, 2015). Fourth, because they are basedon method of moments, two stage regression models are sta-tistically inefﬁcient. Having the ability to reﬁne the modelusing local optimization can lead to signiﬁcant gains in pre-dictive performance.In this work we address these challenges by combin-ing ideas from two-stage regression, kernel embedding andapproximation and gradient descent with backpropagationthrough time to develop RFF-PSRs. Overall, RFF-PSRs en-joy a number of advantages that, to our knowledge, are notattained by existing system identiﬁcation methods. We sum-marize these advantages in Table 1.In summary, the contributions of this work are as fol-lows: (1) We develop a two-stage regression framework forcontrolled dynamical systems that admits tractable learning(Sections 3-4). (2) Through the two-stage regression view,we provide theoretical guarantees on learning the parame-ters of a controlled system (Section 4.4). (3) We use the ex-tended formulation to construct RFF-PSRs, an efﬁcient ap-proximation of kernel-based predictive state representations(HSE-PSRs) (Section 5). (4) We provide a means to reﬁnethe parameters of a controlled dynamical system and applyit to our proposed RFF-PSR model (Section 5.5). (5) Wedemonstrate the advantages of our proposed method throughsynthetic and robot simulation experiments (Section 6). Developing tractable and consistent algorithms for latentstate dynamical systems dates back to spectral subspaceidentiﬁcation algorithms for Kalman ﬁlters(van Overscheeand de Moor, 1996). At their heart, these algorithms repre-sent the state as a prediction of the future observations con-ditioned on history and future actions, and use matrix factor-ization to obtain a basis for the state.This notion of the state as a prediction is the basis of predictive state representations (PSRs) (Singh, James, andRudary, 2004), where the state is represented by the suc-cess probabilities of a number of tests . A test succeeds ifa speciﬁed sequence of test observations is observed whenadministering a speciﬁed sequence of test actions.Noting that the state and parameters of a PSR aredeﬁned up to a similarity transformation has led to a family of tractable and consistent spectral algorithms forlearning PSRs (Rosencrantz and Gordon, 2004). More re-cently, Boots, Gretton, and Gordon (2013) proposed a gen-eralization of PSRs in a reproducing kernel Hilbert space(RKHS). This Hilbert space embedding of PSRs (HSE-PSRs) is able to represent systems with continuous obser-vations and actions while still offering a tractable and con-sistent learning algorithm. HSE-PSRs, however, use a Grammatrix formulation, whose computational and storage re-quirements can grow rapidly with the size of training data.A ﬁnite dimensional approximation for non-linear PSRs wasproposed by Boots and Gordon (2011). However, it can bethought of as an approximation of HSE-HMMs (Song etal., 2010) with actions, a method that has poor theoreti-cal guarantees (Boots, Gretton, and Gordon, 2013). In ad-dition, Boots and Gordon (2011) did not provide examplesof how to apply the proposed model to controlled processeswith continuous actions. In contrast, the model we proposeis an approximation of HSE-PSRs, which is a more prin-cipled generalization of PSRs as it performs true Bayesianinference in the RKHS. In addition, our proposed learningalgorithm incorporates a local optimization procedure thatwe demonstrate to be very effective.We use a reduction of system identiﬁcation to super-vised regression. Similar reductions has been proposed inthe literature (Langford, Salakhutdinov, and Zhang, 2009;Hefny, Downey, and Gordon, 2015; Boots and Gordon,2011; Venkatraman et al., 2016; Sun et al., 2016). These re-ductions, however, assume uncontrolled systems, where fu-ture observation statistics constitute an unbiased representa-tion of the predictive state. Modeling controlled systems ismore subtle since the the state of the system is a conditionaldistribution of observations given actions.Another related work is the spectral learning algorithmfor POMDPs proposed by Azizzadenesheli, Lazaric, andAnandkumar (2016). This method uses tensor factorizationto recover POMDP parameters from examples collected by anon-blind memoryless policy. However, this method is lim- implicit reductions do exist in the system identiﬁcation liter-ature (van Overschee and de Moor, 1996) but they assume linearsystems. ted to discrete POMDPs. Also, PSRs have more represen-tational capacity than POMDPs and can compactly repre-sent more sophisticated systems (Singh, James, and Rudary,2004). There are other classes of dynamical system learningalgorithms that are based on local optimization or samplingapproaches (Fox et al., 2009; Frigola et al., 2013) but theydo not offer consistency guarantees. We deﬁne a class of models that extends predictive statemodels of Hefny, Downey, and Gordon (2015) to controlledsystems. We ﬁrst introduce some notation: We denote by

Pr[ x | do ( Y = y )] the probability of x given that we inter-vene by setting Y to y . This is different from Pr[ x | Y = y ] which denotes conditioning on observing Y = y ; in the for-mer case, we ignore all effects on Y by other variables. Wedenote by V A | B ; c the linear operator that satisﬁes E [ A | B = b, C = c ] = V A | B ; c b ∀ b, c In other words for each c , V A | B ; c is a conditional expectationoperator from B to A . In the discrete case, V A | B ; c is just aconditional probability table.When dealing with multiple variables, we will use tensornotation; e.g., V A,B | C,D is a 4-mode tensor. We will use V A,B | C,D × C c × D d to denote multiplying V A,B | C,D by c along the mode corre-sponding to C and by d along the mode corresponding to D .If c is a matrix then the multiplication is performed alongthe ﬁrst dimension of c .We will also use (cid:107) · (cid:107) F to denote Frobenius norm, a ⊗ b to denote Kronecker product of two vectors and A (cid:63) B todenote the Khatri-Rao product of two matrices (columnwiseKronecker product).

We will consider k -observable systems, where the poste-rior belief state given all previous observations and ac-tions is uniquely identiﬁed by the conditional distribution Pr[ o t : t + k − | do ( a t : t + k − )] .Following Hefny, Downey, and Gordon (2015), we denoteby ψ ot , ψ at , ξ ot and ξ at sufﬁcient features of future observa-tions o t : t + k − , future actions a t : t + k − , extended future ob-servations o t : t + k and extended future actions a t : t + k at time t respectively.We also use h ∞ t ≡ o t − , a t − to denote the entire his-tory of observations and actions at time t and use ψ ht ≡ ψ h ( o t − , a t − ) to denote ﬁnite features of previous ob-servations and actions before time t . We are now ready to deﬁne the class of systems we areinterested in.

Deﬁnition 1.

A dynamical system is said to conform to a predictive state controlled model (PSCM) if it satisﬁes thefollowing properties: Often but not always, ψ ht is a computed from ﬁxed-size win-dow of previous observations and actions ending at t − . • For each time t , there exists a linear operator Q t = V ψ ot | do ( ψ at ); h ∞ t (referred to as predictive state) such that E [ ψ ot | do ( a t : t + k − ) , h ∞ t ] = Q t ψ at • For each time t , there exists a linear operator P t = V ξ ot | do ( ξ at ); h ∞ t (referred to as extended state) such that E [ ξ ot | do ( a t : t + k ) , h ∞ t ] = P t ξ at • There exists a linear map W sys (referred to as system pa-rameter map), such that, for each time t , P t = W sys ( Q t ) (1) • There exists a ﬁltering function f ﬁlter such that, for eachtime t , Q t +1 = f ﬁlter ( P t , o t , a t ) . f ﬁlter is typically non-linear but known in advance. It follows that a PSCM is speciﬁed by the tuple ( Q , W sys , f ﬁlter ) , where Q denotes the initial belief state.There are a number of aspects of PSCMs that warrant dis-cussion. First, unlike latent state models, the state Q t is rep-resented by a conditional distribution of observed quantities.Second, Q t is a deterministic function of the history h ∞ t .It represents the belief state that one should maintain afterobserving the history to make optimal predictions. Third, aPSCM speciﬁes a recursive ﬁlter where given an action a t and an observation o t , the state update equation is given by Q t +1 = f ﬁlter ( W sys ( Q t ) , o t , a t ) (2)This construction allows us to have a linear map W sys andstill use it to build models with non-linear state updates,including IO-HMMs (Bengio and Frasconi, 1995), Kalmanﬁlters with inputs (van Overschee and de Moor, 1996) andHSE-PSRs (Boots, Gretton, and Gordon, 2013). As we seein Section 4, avoiding latent variables and having a linear W sys enable the formulation of a consistent learning algo-rithm. We assume that the extended features ξ ot and ξ at are chosensuch that f ﬁlter is known. The parameters to learn are thus W sys and Q . We also assume that a ﬁxed blind (open-loop)policy is used to collect training data, and so we can treatcausal conditioning on action do ( a t ) as ordinary condition-ing on a t . It is possible, however, that a different (possiblynon-blind) policy is used at test time.To learn model parameters, we will adapt the two-stageregression method of Hefny, Downey, and Gordon (2015).Let ¯ Q t ≡ E [ Q t | ψ ht ] (resp. ¯ P t ≡ E [ P t | ψ ht ] ) be theexpected state (resp. expected extended state) conditionedon ﬁnite history features ψ ht . For brevity, we might referto ¯ Q t simply as the (predictive) state when the distinctionfrom Q t is clear. It follows from linearity of expectation that One way to deal with non-blind training policies is to assignimportance weights to training examples to correct the bias result-ing from non-blindness (Bowling et al., 2006; Boots, Siddiqi, andGordon, 2011). This, however, requires knowledge of the data col-lection policy and can result in a high variance of the estimatedparameters. We defer the case of unknown non-blind policy to fu-ture work. [ ψ ot | ψ at , ψ ht ] = ¯ Q t ψ at and E [ ξ ot | ξ at , ψ ht ] = ¯ P t ξ at ; and itfollows from the linearity of W sys that ¯ P t = W sys ( ¯ Q t ) So, we train regression models (referred to S1 regressionmodels) to estimate ¯ Q t and ¯ P t from ψ ht . Then, we train an-other (S2) regression model to estimate W sys from ¯ Q t and ¯ P t . Being conditional distributions, estimating ¯ Q t and ¯ P t from ψ ht is more subtle compared to uncontrolled systems,since we cannot use observation features as estimates of thestate. We describe two methods to construct an S1 regres-sion model to estimate ¯ Q t . The same methods apply to ¯ P t .As we show below, instances of both methods exist in theliterature of system identiﬁcation. Let ψ oat denote a sufﬁcient statistic of the joint observa-tion/action distribution Pr( ψ ot , ψ at | ψ ht ) . This distributionis ﬁxed for each value of ψ ht since we assume a ﬁxed modeland policy. We use an S1 regression model to learn the map f : ψ ht (cid:55)→ E [ ψ aot | ψ h ] by solving the optimization problem arg min f ∈F T (cid:88) t =1 l ( f ( ψ ht ) , ψ oat ) + R ( f ) for some suitable Bregman divergence loss l (e.g., squareloss) and regularization R .Once we learn f , we can estimate ¯ Q t by ﬁrst estimatingthe joint distribution Pr( ψ ot , ψ at | ψ ht ) and then deriving theconditional operator ¯ Q t . By the continuous mapping theo-rem, a consistent estimator of f results in a consistent esti-mator of ¯ Q t . An example of applying this method is usingkernel Bayes rule (Fukumizu, Song, and Gretton, 2013) toestimate states in HSE-PSR (Boots, Gretton, and Gordon,2013). In this method, instead of estimating the joint distributionrepresented by E [ ψ oat | ψ ht ] , we directly estimate the con-ditional distribution ¯ Q t . We exploit the fact that each train-ing example ψ ot is an unbiased estimate of ¯ Q t ψ at = E [ ψ ot | ψ at , ψ ht ] . We can formulate the S1 regression problem aslearning a function f : ψ ht (cid:55)→ ¯ Q t that best matches the train-ing examples, i.e., we solve the problem arg min f ∈F T (cid:88) t =1 l ( f ( ψ ht ) ψ at , ψ ot ) + R ( f ) (3)for some suitable Bregman divergence loss l (e.g., squareloss) and regularization R . An example of applying thismethod is the oblique projection method used in spectralsystem identiﬁcation (van Overschee and de Moor, 1996). Itis worth emphasizing that both the joint and conditional S1approaches assume the state to be a conditional distribution.They only differ in the way to estimate that distribution. Given S1 regression models to estimate ¯ Q t and ¯ P t , learninga controlled dynamical system proceeds as shown in Algo-rithm 1. Algorithm 1

Two-stage regression for predictive state con-trolled models

Input: ψ hn,t , ψ on,t , ψ an,t , ξ on,t , ξ an,t for ≤ n ≤ N , ≤ t ≤ T n ( N is the number of trajectories, T n is the length of n th trajectory) Output:

Dynamics matrix ˆ W sys and initial state ˆ Q Use S1A regression to estimate ¯ Q n,t .Use S1B regression to estimate ¯ P n,t .Let ˆ W sys be the (regularized) least squares solution to thesystem of equations ¯ P n,t ≈ W sys ( ¯ Q n,t ) ∀ n, t if N is sufﬁciently large then Let ¯ Q be the (regularized) least square solution to thesystem of equations ψ on, ≈ Q ψ an, ∀ n else Set ˆ Q to the average of ¯ Q n,t end if It is worth noting that Algorithm 1 is still an instance ofthe two stage regression framework described in (Hefny,Downey, and Gordon, 2015) and hence retains its theoreticalguarantees: mainly that we can bound the error in estimat-ing the dynamics matrix W sys in terms of S1 regression errorbounds, assuming that we collect examples from the station-ary distribution of a blind policy with sufﬁcient exploration.A blind policy provides sufﬁcient exploration if it has astationary distribution that (1) visits a sufﬁcient history setsuch that the set of equations E [ P t | ψ ht ] = W sys ( E [ Q t | ψ ht ]) are sufﬁcient for estimating W sys and (2) provides trainingdata to estimate E [ Q t | ψ ht ] and E [ P t | ψ ht ] with increasing ac-curacy. Theorem 2.

Let π be a blind data collection policy witha stationary distribution. If history, action and observationfeatures have bounded norms, π provides sufﬁcient explo-ration, and ridge regression is used with λ and λ regular-ization parameter for S1 and S2 regression respectively, thenfor all valid states Q the following is satisﬁed with probabil-ity at least − δ . (cid:107) ( ˆ W sys − W sys )( Q ) (cid:107) ≤ O  η δ,N  (1 /λ ) + (1 /λ ) (cid:115) (cid:114) log(1 /δ ) N  + O (cid:32) log(1 /δ ) √ N (cid:32) λ + 1 λ (cid:33)(cid:33) + O (cid:16)(cid:112) λ (cid:17) , where η δ,N = O p (cid:32) / √ N + λ c + λ (cid:33) , where c > is a problem-dependent constant. We provide proofs and discussion of sufﬁcient explorationcondition in the supplementary material.

Predictive State Controlled Models WithRandom Fourier Features

Having a general framework for learning controlled dynam-ical systems, we now focus on HSE-PSR (Boots, Gretton,and Gordon, 2013) as a non-parametric instance of thatframework using Hilbert space embedding of distributions.We ﬁrst describe HSE-PSR learning as a two-stage regres-sion method. Then we demonstrate how to obtain a ﬁnitedimensional approximation using random Fourier features(RFF) (Rahimi and Recht, 2008). Before describing HSE-PSR we give some necessary background on Hilbert spaceembedding and random Fourier features.

We will brieﬂy describe the concept of Hilbert space em-bedding of distributions. We refer the reader to (Smola etal., 2007) for more details on this topic. Hilbert space em-bedding of distributions provide a non-parametric general-izations of marginal, joint and conditional probability tablesof discrete variables to continuous domains: namely, meanmaps, covariance operators and conditional operators.Let k be a kernel associated with a feature map φ ( x ) suchthat k ( x , x ) = (cid:104) φ ( x ) , φ ( x ) (cid:105) . A special case for discretevariables is the delta kernel where φ ( x ) maps x to an indi-cator vector. For a random variable X , the mean map µ X is deﬁned as E [ φ X ( X )] . Note that µ X is an element of thereproducing kernel Hilbert space (RKHS) associated with k .The uncentered covariance operator of two variables X and Y is C XY = E [ φ X ( X ) ⊗ φ Y ( Y )] . For universal kernels k X and k Y , C XY is a sufﬁcient representation of the jointdistribution Pr(

X, Y ) . In this paper, we will use C XY | z todenote the covariance of X and Y given that Z = z .Under smoothness assumptions, (Song et al., 2009) showthat V φ X ( X ) | φ Y ( Y ) = C XY C − XX , where the conditionaloperator V is as deﬁned in Section 3. More generally, V φ X ( X ) | φ Y ( Y ); z = C XY | z C − XX | z . HSE-PSR is a generalization of IO-HMM that has proven tobe successful in practice (Boots, Gretton, and Gordon, 2013;Boots and Fox, 2013). It is suitable for high dimensional andcontinuous observations and/or actions. HSE-PSR uses ker-nel feature maps as sufﬁcient statistics of observations andactions. We deﬁne four kernels k O , k A , k o , k a over futureobservation features, future action features, individual ob-servations and individual actions respectively.We can then deﬁne ψ ot = φ O ( o t : t + k − ) and similarly ψ at = φ A ( a t : t + k − ) . We will also use φ ot and φ at as short-hands for φ o ( o t ) and φ a ( a t ) . The extended future is thendeﬁned as ξ ot = ψ ot ⊗ φ ot and ξ at = ψ at ⊗ φ at Under the assumption of a blind learning policy, the oper-ators Q t and P t are deﬁned to be Q t = V ψ ot | ψ at ; h ∞ t (4) P t = ( P ξt , P ot ) = ( V ψ ot +1 ⊗ φ ot | ψ at +1 ⊗ φ at ; h ∞ t , V φ ot ⊗ φ ot | φ at ; h ∞ t ) (5) Therefore, Q t speciﬁes the state of the system as a condi-tional distribution of future observations given future actionswhile P t is a tuple of two operators that allow us to conditionon the pair ( a t , o t ) to obtain Q t +1 . In more detail, ﬁlteringin an HSE-PSR is carried out as follows • From o t and a t , obtain φ ot and φ at . • Compute C o t o t | h ∞ t ,a t = V φ ot ⊗ φ ot | φ at ; h ∞ t φ at • Multiply by inverse observation covariance to change“predicting φ ot ” into “conditioning on φ ot ”: V ψ ot +1 | ψ at +1 ,φ ot ,φ at ; h ∞ t = V ψ ot +1 ⊗ φ ot | ψ at +1 ,φ at ; h ∞ t × φ ot ( C o t o t | h ∞ t ,a t + λI ) − • Condition on φ ot and φ at to obtain shifted state Q t +1 ≡ V ψ ot +1 | ψ at +1 ; φ ot ,φ at ,h ∞ t = V ψ ot +1 | ψ at +1 ,φ ot ,φ at ; h ∞ t × φ ot φ ot × φ at φ at Thus, in HSE-PSR, the parameter W sys is composed oftwo linear maps; f o and f ξ such that P ξt = f ξ ( Q t ) and P ot = f o ( Q t ) . In the following section we show how to estimate ¯ Q t and ¯ P t from data. Estimation of f ξ , f o can then be carriedout using kernel regression.Learning and ﬁltering in an HSE-PSR can be implicitlycarried out in the RKHS using a Gram matrix formulation.We will describe learning in terms of the RKHS elementsand refer the reader to (Boots, Gretton, and Gordon, 2013)for details on the Gram matrix formulation. As we mentionin Section 5, random Fourier features, provides a scalableapproximation to operating in the RKHS. As discussed in section 4 we can use a joint or conditionalapproach for S1 regression. We now demonstrate how thesetwo approaches apply to HSE-PSR.

Joint S1 Regression for HSE-PSR

This is the methodused in (Boots, Gretton, and Gordon, 2013). In this approachwe exploit the fact that ¯ Q t = W ψ ot | ψ at ; ψ ht = C ψ ot ψ at | ψ ht ( C ψ at ψ at | ψ ht + λI ) − So, we learn two linear maps T oa and T a such that T oa ( ψ ht ) ≈ C ψ ot ψ at | ψ ht and T a ( ψ ht ) ≈ C ψ at ψ at | ψ ht . The train-ing examples for T oa and T a consist of pairs ( ψ ht , ψ ot ⊗ ψ at ) and ( ψ ht , ψ at ⊗ ψ at ) respectively.Once we learn this map, we can estimate C ψ ot ψ at | ψ ht and C ψ at ψ at | ψ ht and consequently estimate ¯ Q t . Conditional S1 Regression for HSE-PSR

It is also pos-sible to apply the conditional S1 regression formulation inSection 4.2. Speciﬁcally, let F be the set of 3-mode tensors,with modes corresponding to ψ ot , ψ ot and ψ ht . We estimate atensor T ∗ by optimizing T ∗ = arg min T ∈F (cid:107) ( T × ψ ht ψ ht × ψ ta ψ ta ) − ψ to (cid:107) + λ (cid:107) T (cid:107) HS , here (cid:107) . (cid:107) HS is the Hilbert-Schmidt norm, which translatesto Frobenius norm in ﬁnite-dimensional Euclidan spaces.We can then use ¯ Q t = T ∗ × ψ ht ψ ht For both regression approaches, the same procedure canbe used to estimate the extended state ¯ P t by replacing fea-tures ψ ot and ψ at with their extended counterparts ξ ot and ξ at . A Gram matrix formulation of the HSE-PSR has computa-tional and memory requirements that grow rapidly with thenumber of training examples. To alleviate this problem, weresort to kernel approximation—that is, we replace RKHSvectors such as ψ ot and ψ at with ﬁnite dimensional vectorsthat approximately preserve inner products. We use randomFourier features (RFF) (Rahimi and Recht, 2008) as an ap-proximation but it is possible to use other approximationmethods. Unfortunately RFF approximation can typicallyrequire D to be prohibitively large. Therefore, we applyprincipal component analysis (PCA) to the feature maps toreduce their dimension to p (cid:28) D . We apply PCA again toquantities that require p space such as extended features ξ ot , ξ at and states ¯ Q t , reducing them to p dimensions. We mapthem back to p dimensions when needed (e.g., for ﬁlter-ing). We also employ randomized SVD (Halko, Martinsson,and Tropp, 2011) for fast computation of PCA, resulting inan algorithm that scales linearly with N and D . A common practice is to use the output of a moment-basedalgorithm to initialize a non-convex optimization algorithmsuch as EM (Belanger and Kakade, 2015) or gradient de-scent (Jiang, Kulesza, and Singh, 2016). Since EM is notdirectly applicable to RFF-PSRs, we propose a gradient de-scent approach. We can observe that ﬁltering in an RFF-PSRdeﬁnes a recurrent structure given by. q t +1 = f ﬁlter ( W sys q t , o t , a t ) , E [ o t | q t ] = W pred ( q t ⊗ φ ( a t )) , where W pred is a linear operator that predicts the next ob-servation. If f ﬁlter is differentiable, we can improve our es-timates of W sys and W pred using backpropagation throughtime (BPTT) (Werbos, 1990). In particular, we optimizethe error in predicting (features of) a window of observa-tions. In our experiments, we learn to predict o t : t + k − given a t : t + k − . We provide pseudo-code in the supplementary material. MAT-LAB source code is available at: https://github.com/ahefnycmu/rffpsr The linearity of W pred is a valid assumption for a universalkernel. We use the benchmark synthetic non-linear system usedby (Boots, Gretton, and Gordon, 2013) : ˙ x ( t ) = x ( t ) − . x ( t ))(5 x ( t ) − x ( t ) + x ( t )) − . x ( t )) a ( t )˙ x ( t ) = − x ( t ) + 50 x ( t ) − x ( t ) − x ( t ) − a ( t ) o ( t ) = x ( t ) The input a is generated as zero-order hold white noise, uni-formly distributed between − . and 0.5. We collected 20trajectories of 100 observations and actions at 20Hz and wesplit them into 10 training, 5 validation and 5 test trajecto-ries. The prediction target for this experiment is o ( t ) . In this experiment we used the TORCS car simulationserver, which outputs 64x64 images (see Figure 2). Theobservations are produced by converting the images togreyscale and projecting them to 200 dimensions via PCA.The car is controlled by a built-in controller that controlsacceleration while the external actions control steering. Wecollected 50 trajectories by applying a sine wave with ran-dom starting phase to the steering control and letting thesimulator run until the car goes off the track. We used 40trajectories for training, 5 for validation and 5 for testing.The prediction target is the projected image.Figure 2: An example of windshield view output by TORCS.

We consider the 3-link simulated swimmer robot from theopen-source package RLPy (Geramifard et al., 2013). The2-d action consists of torques applied on the two joints ofthe links. The observation model returns the angles of thejoints and the position of the nose (in body coordinates). Themeasurements are contaminated with Gaussian noise whosestandard deviation is 5% of the true signal standard devia-tion. To collect the data, we use an open-loop policy thatselects actions uniformly at random. We collected 25 trajec-tories of length 100 each and use 24 for training and 1 forvalidation. We generate test trajectories using a mixed pol-icy: with probability p blind , we sample a uniformly randomaction, while with probability − p blind , we sample an ac-tion from a pre-speciﬁed deterministic policy that seeks agoal point. We generate two sets of 10 test trajectories each,one with p blind = 0 . and another with p blind = 0 . . Theprediction target is the position of the nose.igure 1: Mean square error for 10-step prediction on (from left to right) synthetic model, TORCS car simulator, swimmingrobot simulation with 80% blind test-policy, and swimming robot with 20% blind test policy. Randomly initialized RFF-PSRobtained signiﬁcantly worse MSE and are not shown for clarity. A comparison with HSE-PSR on TORCS and swimmer datasetswas not possible as it required prohibitively large memory. We tested three different initializations of RFF-PSR (withGaussian RBF kernel): random initialization, two-stage re-gression with joint S1, and two-stage regression with condi-tional S1 (Section 5.3). For each initialization, we tested themodel before and after reﬁnement. For reﬁnement we usedBPTT with a decreasing step size: the step size is reduced byhalf if validation error increases. Early stopping occurs if thestep size becomes too small ( − ) or the relative change invalidation is insigniﬁcant ( − ). We also test the followingbaselines. HSE-PSR : We implemented the Gram matrix HSE-PSRas described in (Boots, Gretton, and Gordon, 2013).

N4SID : We used MATLAB’s implementation of sub-space identiﬁcation of linear dynamical systems.

Non-linear Auto Regression (RFF-ARX) : We imple-mented a version of auto regression where the predictor vari-able is the RFF representation of future actions together witha ﬁnite history of previous observations and actions, and thetarget variable is future observations.Models were trained with future length of 10 and historylength of 20. For RFF-PSR and RFF-ARX we used 10000random features and applied PCA to project features onto20 dimensions. Kernel bandwidths were set to the medianof the distance between training points (median trick). Forevaluation, we perform ﬁltering on the data and estimate theprediction target of the experiment at test time t given thehistory o t − H , a t , where H is the prediction horizon. Wereport the mean square error across all times t for each valueof H ∈ { , , . . . , } . The results are shown in Figure 1. There are a number ofimportant observations. • In general, joint S1 training closely matches or outper-forms conditional S1 training, with and without reﬁne-ment. • Local reﬁnement signiﬁcantly improves predictive perfor-mance for all initialization methods. • Local reﬁnement, on its own, is not sufﬁcient to producea good model. The two stage regression provides a goodinitialization of the reﬁnement procedure. • Even without reﬁnement, RFF-PSR outperforms HSE-PSR. This could be attributed to the dimensionality re-duction step, which adds appropriate inductive bias. • Compared to other methods, RFF-PSR has better perfor-mance with non-blind test policies.

We proposed a framework to learn controlled dynamical sys-tems using two-stage regression. We then applied this frame-work to develop a scalable method for controlled non-linearsystem identiﬁcation: using RFF approximation of HSE-PSR together with a reﬁnement procedure to enhance themodel after a two-stage regression initialization. We havedemonstrated promising results for the proposed method interms of predictive performance. As future work, we wouldlike to use this framework for further tasks such as imitationlearning and reinforcement learning.

Acknowledgements

The authors gratefully acknowledgesupport from ONR (grant number N000141512365),DARPA (grant number FA87501720152) and NSF EAGER(grant number IIS1450543). The authors would like to thankWen Sun and Yuxiang Wang for the helpful discussions.

References

Anandkumar, A.; Ge, R.; Hsu, D.; Kakade, S. M.; and Tel-garsky, M. 2014. Tensor decompositions for learning la-tent variable models.

J. Mach. Learn. Res.

Azizzadenesheli, K.; Lazaric, A.; and Anandkumar, A.2016. Reinforcement learning of pomdp’s using spectralmethods.

CoRR abs/1602.07764.Belanger, D., and Kakade, S. M. 2015. A linear dynamicalsystem model for text. In

ICML .Bengio, Y., and Frasconi, P. 1995. An input output HMMarchitecture. In

NIPS .oots, B., and Fox, D. 2013. Learning dynamic policiesfrom demonstration.

NIPS Workshop on Advances in Ma-chine Learning for Sensorimotor Control .Boots, B., and Gordon, G. 2011. An online spectral learn-ing algorithm for partially observable nonlinear dynami-cal systems. In

AAAI .Boots, B.; Gretton, A.; and Gordon, G. J. 2013. HilbertSpace Embeddings of Predictive State Representations. In

UAI .Boots, B.; Siddiqi, S.; and Gordon, G. 2011. Closing thelearning planning loop with predictive state representa-tions. In

I. J. Robotic Research , volume 30, 954–956.Bowling, M.; McCracken, P.; James, M.; Neufeld, J.; andWilkinson, D. 2006. Learning predictive state represen-tations using non-blind policies. In

ICML .Fox, E.; Sudderth, E. B.; Jordan, M. I.; and Willsky, A. S.2009. Nonparametric bayesian learning of switching lin-ear dynamical systems. In

NIPS .Frigola, R.; Lindsten, F.; Sch¨on, T. B.; and Rasmussen, C. E.2013. Bayesian inference and learning in gaussian pro-cess state-space models with particle mcmc. In Burges,C. J. C.; Bottou, L.; Welling, M.; Ghahramani, Z.; andWeinberger, K. Q., eds.,

Advances in Neural InformationProcessing Systems 26 . Curran Associates, Inc. 3156–3164.Fukumizu, K.; Song, L.; and Gretton, A. 2013. Kernelbayes’ rule: Bayesian inference with positive deﬁnite ker-nels.

Journal of Machine Learning Research

SIAMRev.

Hefny, A.; Downey, C.; and Gordon, G. J. 2015. Supervisedlearning for dynamical system learning. In

NIPS .Hsu, D.; Kakade, S. M.; and Zhang, T. 2009. A spectralalgorithm for learning hidden markov models. In

COLT .Jiang, N.; Kulesza, A.; and Singh, S. P. 2016. Improvingpredictive state representations via gradient descent. In

AAAI .Langford, J.; Salakhutdinov, R.; and Zhang, T. 2009. Learn-ing nonlinear dynamic models. In

ICML .Rahimi, A., and Recht, B. 2008. Random features for large-scale kernel machines. In

NIPS .Rosencrantz, M., and Gordon, G. 2004. Learning low di-mensional predictive representations. In

ICML , 695–702.Siddiqi, S.; Boots, B.; and Gordon, G. J. 2010. Reduced-rank hidden Markov models. In

AISTATS .Singh, S.; James, M. R.; and Rudary, M. R. 2004. Predictivestate representations: A new theory for modeling dynam-ical systems. In

UAI . Smola, A.; Gretton, A.; Song, L.; and Schlkopf, B. 2007. Ahilbert space embedding for distributions. In

In Algorith-mic Learning Theory: 18th International Conference .Song, L.; Huang, J.; Smola, A. J.; and Fukumizu, K. 2009.Hilbert space embeddings of conditional distributionswith applications to dynamical systems. In

ICML .Song, L.; Boots, B.; Siddiqi, S. M.; Gordon, G. J.; andSmola, A. J. 2010. Hilbert space embeddings of hiddenMarkov models. In

ICML .Sun, W.; Venkatraman, A.; Boots, B.; and Bagnell, J. A.2016. Learning to ﬁlter with predictive state inferencemachines. In

ICML .Tropp, J. A. 2015. An introduction to matrix concentrationinequalities.

Found. Trends Mach. Learn.

Subspace identi-ﬁcation for linear systems: theory, implementation, appli-cations . Kluwer Academic Publishers.Venkatraman, A.; Sun, W.; Hebert, M.; Bagnell, J. A.; andBoots, B. 2016. Online instrumental variable regressionwith applications to online linear system identiﬁcation. In

AAAI .Werbos, P. J. 1990. Backpropagation through time: what itdoes and how to do it.

Proceedings of the IEEE .Xia, G. G. 2016.

Expressive Collaborative Music Per-formance via Machine Learning . Ph.D. Dissertation,Carnegie Mellon University.

RFF-PSR Learning Algorithm

For ease of exposition, we assume that RFF features are computed prior to PCA. In our implementation, we compute the RFFfeatures on the ﬂy while performing PCA to reduce the required memory footprint. Here we use

A (cid:63) B to denote the Khatri-Raoproduct of two matrices (columnwise Kronecker product).

Algorithm 2

Learning Predictive State Representation with Random Fourier Features (L

EARN -RFF-PSR)

Input:

Matrices Φ h , Φ o , Φ a of history, observation and action features (each column corresponds to a time step). Matrices Ψ o , Ψ a , Ψ o (cid:48) , Ψ a (cid:48) of test observations, test actions, shifted test observations and shifted test actions. Output:

S2 regression weights ˆ W ξ and ˆ W o . Subroutines: S VD ( X, p ) , returns the tuple ( U, U (cid:62) X ) , where U consists of top p singular vectors of X . { Feature projection using PCA } U h , Φ h ← S VD (Φ h , p ) ; U o , Φ o ← S VD (Φ o , p ) ; U a , Φ a ← S VD (Φ a , p ) ; U oψ , Ψ o ← S VD (Ψ o , p ) ; U aψ , Ψ a ← S VD (Ψ a , p ) ; U oξ , Ξ o ← S VD (( U oψ (cid:62) Ψ o (cid:48) ) (cid:63) Φ o , p ) ; U aξ , Ξ a ← S VD (Φ a (cid:63) ( U aψ (cid:62) Ψ a (cid:48) ) , p ) ; U oo , Φ oo ← S VD (Φ o (cid:63) Φ o , p ) { S1 Regression and State Projection } Estimate ¯ Q t , ¯ P ξt , ¯ P ot for each time t using the one of the S1 methods in 5.3.Reshape ¯ Q t , ¯ P t as column vectors for each t and then stack the resulting vectors in matrices Q , P ξ and P o . U q , Q ← S VD ( Q , p ) { S2 Regression } ˆ W ξ ← arg min W ∈ R p × p (cid:107) P ξ − W Q (cid:107) + λ (cid:107) W (cid:107) F ˆ W o ← arg min W ∈ R p × p (cid:107) P o − W Q (cid:107) + λ (cid:107) W (cid:107) F B Examples of Predictive State Controlled Models

Here we discuss IO-HMM and Kalman ﬁlter with inputs, showing that they are instances of PSCMs. We do this for each modelby deﬁning the predictive state, showing that it satisﬁes the condition P t = W Q t and describing an S1 regression method. B.1 IO-HMM

Let T be the transition tensor such that T × s s t × a a t = E [ s t +1 | a t , s t ] and O be the observation tensor such that O × s s t × a a t = E [ o t | a t , s t ] .Deﬁne O k to be the extended observation tensor where O k × s s t × a a t : t + k − = E [ o t : t + k − | a t : t + k − , s t ] As a shortcut, we will denote by T ij the product T × s e i × a e j .For k = 1 , we have O = O .For k > we can think of a t : t + k − as the outer product a t ⊗ a t +1: t + k . So we can deﬁne O k such that O k × s e i × a ( e j ⊗ e l ) = vec( O ij ⊗ ( O k − × a e l × s T ij )) (B.1)In words, starting from state e i and applying an action e j followed by a sequence of k − actions denoted by indicator e l . Theexpected indicator of the next k observations is the outer product of expected observation o t (given by O ij ) with the expectedindicator of observations o t +1: t + k − as predicted by O k − . Note that the two expectations being multiplied are conditionallyindependent given the state e i and the action sequence.Given the tensor O k the predictive states Q t and P t are deﬁned to be Q t = O k × s s t P t = O k +1 × s s t Now to show that (1) holds, let ˜ O k be a reshaping of O k into a matrix such that vec( Q t ) = ˜ O k s t t follows that P t = O k +1 × s s t = O k +1 × s (( ˜ O k ) + vec( Q t )) , which is linear in Q t . S1 Regression

Let s t = s ( h ∞ t ) be the belief state at time t . Note that s t is a deterministic function of the entire history.Under a ﬁxed policy assumption, an indicator vector of the joint observation and action assignment is an unbiased estimateof the joint probability table P [ ψ at , ξ at | h ∞ t ] . An S1 regression model can be used to learn the mapping ψ ht (cid:55)→ P [ ψ at , ξ at | ψ ht ] .It is then easy to estimate the conditional probability table ¯ Q t from the joint probability table P [ ψ at , ξ at | ψ ht ] .We can also use the conditional S1 approach. By exploiting the fact that ψ ot is an unbiased estimate of a single column of Q t corresponding to ψ at . We can use (3) to learn a function f : h t (cid:55)→ ¯ Q t that best matches the training examples. B.2 Kalman Filter with inputs

The Kalman ﬁlter is given by x t = Ax t − + Bu t + (cid:15) t o t = Cx t + ν t Given a belief state s t ≡ E [ x t − | h ∞ t ] we can write the predictive state as E [ o t : t + k − | s t , a t : t + k − ] = Γ k s t + U k a t : t + k − , where Γ k =  CACA ... CA k  U k =  B . . . AB B . . . A B AB B . . . ... A k − B . . . AB B 

The extended predictive state have similar form with Γ k and U k replaced with Γ k +1 and U k +1 . Since U is ﬁxed, keepingtrack of the state amounts to keeping track of Q t ≡ Γ k s t . It follows that P t = Γ k +1 s t = Γ k +1 Γ + k Q t = W Q t If h t is a linear projection of h ∞ t (e.g. stacking of a ﬁnite window of observations and actions), it can also be shown vanOverschee and de Moor (1996) that E [ Q t | h t ] = ˜Γ k h t , for some matrix ˜Γ k . S1 Regression

Let F be the set of functions that take the form f ( ψ h ) ψ at = Γ ψ ht + Bψ at The oblique projection method van Overschee and de Moor (1996) uses linear regression to estimate Γ and B (essentiallysolving (3)). Having a ﬁxed B , the conditional operator is determined by Γ h t through an afﬁne transformation. Therefore wecan use ¯ Q t = Γ h t . C Theoretical Analysis

Let H = { h i } Ni =1 be a set of histories generated from an i.i.d distribution. We use Q ( ψ h ) to denote E [ Q | ψ h ] .The main theorem in Hefny, Downey, and Gordon (2015) bounds parameter estimation error in terms of S1 regression error.This implies that we need to analyze the properties of S1 regression to prove Theorem 2. We will look at multiple scenarioswhere in each scenario we develop sufﬁcient exploration conditions and provide an S1 error bound for these conditions. The i.i.d property is achieved if we can restart the system or if the data collection policy induces an ergodic process with a stationarydistribution. In the latter case, we assume the examples are sufﬁciently spaced in time to that allow the process to mix. However, in practice,we use all examples as this makes the error only smaller. eﬁnition C.1 (Sufﬁcient history set) . Consider a PSCM that satisﬁes P t = W sys ( Q t ) A set of histories H = { h i } Mi =1 is called a sufﬁcient history set if it is sufﬁcient to estimate W sys using E [ Q t | ψ ht = h ] and E [ P t | ψ ht = h ] for each h ∈ H . Note that W sys may not be unique, we care about estimating W sys Q for any valid Q . From the above deﬁnition, it followsthat a data collection policy provides sufﬁcient exploration if it allows for estimating E [ Q | ψ ht = h ] and E [ P | ψ ht = h ] for asufﬁcient history set with increasing accuracy. C.1 Case 1: Discrete Observations and Actions

Consider a discrete system where H , A , A + , O , O + are the set of all possible histories, future action sequences, extendedfuture action sequences, future observation sequences and extended future observation sequences respectively. Theorem C.2.

Assume a discrete system where the data collection policy induces an i.i.d distribution over histories. If thepolicy generates each possible extended future action sequence starting from each possible history M times, then it generatesan S2 training dataset of size N = M |H||A + | with S1 error bound η δ,N = (cid:114) |H||A + ||O + | M log (cid:16) |H||A + ||O + | δ (cid:17) Proof.

The proof follows immediately from Heoffding’s inequality which bounds the error in estimating the probability of anevent by averaging.Note that we need to estimate |H||A||O| probabilities to estimate Q and |H||A + ||O + | probabilities to estimate P . Thereforewe divide δ by |H||A + ||O + | to correct for multiple probability estimates. Remark C.3.

Assume the system to be 1-observable, where the history and future are of length 1. Then a consistent estimateof Q and P can be obtained by a consistent estimate of the joint probability table P ( o t − t +1 , a t − t +1 ) . C.2 Case 2: Continuous System

Deﬁnition C.4 (Range and span of a policy) . Let π be a data collection policy with a stationary distribution. For a randomvector X t = f ( h ∞ t , o t : ∞ , a t : ∞ ) , the range of π on X is the support of the stationary distribution of X t induced by the policy π (i.e. the set of all possible values of X t that can be generated by the stationary distribution).The span of π on X is the subspace spanned by the range of π on X . When referring to the policy range or span, we may omit the variable name when it is clear in the context.

Condition C.5 (Action span for joint S1) . Let π be data collection policy and let H be the range of π on histories. The actionspan condition for joint S1 is deﬁned as the requirement to satisfy the following:1. H is a sufﬁcient history set.2. For any ψ h ∈ H , the conditional covariance Σ ψ a | ψ h is full rank. Condition C.6 (Action span for conditional S1) . Let π be data collection policy and let H be the range of π on histories. Theaction span condition for conditional S1 is deﬁned as the requirement to satisfy the following:1. H is a sufﬁcient history set.2. For any ψ h ∈ H and any future action feature vector ψ a , the quantity ( ψ h ⊗ ψ a ) is in the policy span. Remark C.7.

Condition C.5 implies Condition C.6.

Assumption C.8 (Bounded features) . We assume that (cid:107) ψ h (cid:107) < c h for all h ∈ H . Also, we assume that (cid:107) ψ o (cid:107) ≤ c O and (cid:107) ψ a (cid:107) ≤ c A for any valid future observation sequence and action sequence respectively. Theorem C.9.

Let π be a data collection policy and let H be the range of π on histories. If Assumption C.8 and Condition C.6are satisﬁed and conditional S1 regression is used with a liner model as the correct model, then π provides sufﬁcient explorationand, for all h ∈ H and any δ ∈ (0 , such that N > c log(2 d h d A /δ ) λ min (Σ ψh ⊗ ψa ) , the following holds with probability at least − δ (cid:107) ˆ Q ( ψ h ) − Q ( ψ h ) (cid:107) ≤ c h (cid:32)(cid:115) λ max (Σ ψ o ) λ min (Σ ψ h ⊗ ψ a ) (cid:32) (cid:112) λ min (Σ ψ h ⊗ ψ a )∆ + λλ min (Σ ψ h ⊗ ψ a )(1 − ∆ ) + λ (cid:33) + ∆ λ min (Σ ψ h ⊗ ψ a )(1 − ∆ ) + λ (cid:33) , here ∆ = 2 c h c A (cid:114) log(2 d h d A /δ ) N + 2 log(2 d h d A /δ )3 N (cid:32) c h c A (cid:112) λ min (Σ ψ h ⊗ ψ a ) + c h c A (cid:33) ∆ = 2 c O c h c A (cid:114) log(( d O + d h d A ) /δ ) N + 4 c O c h c A log(( d O + d h d A ) /δ )3 N ∆ = c h c A log(2 d h d A /δ ) λ min (Σ ψ h ⊗ ψ a ) N In the following section we provide a proof sketch for the asymptotic form in Theorem 2 for joint S1.

Remark C.10 (Conditioning) . It is known that linear regression converges faster if the problem is well-conditioned. In the twostage regression we need the good conditioning of both stages– that is, • The set of training histories result in a problem ¯ P t = W ¯ Q t that is well conditioned (S2 conditioning). • The S1 regression problem is well conditioned.

The second requirement ensures that we converge fast to good estimates of ¯ Q t and ¯ P t . Designing exploration policies thatresult in well conditioned two stage regression problems is an interesting direction for future work. D Proofs of theorems

In this section we provide proofs for Theorem C.9. The asymptotic statement in Theorem 2 follows directly from the maintheorem in (Hefny, Downey, and Gordon, 2015). We also provide a proof sketch for the joint S1 case.The proof strategy is as follows: First, we use matrix concentration bounds to analyze the effect of using estimated covariancematrices. Then, we analyze the effect of error in covariance matrix on regression weights. By combining the results of bothanalyses, we prove the desired theorems.

Lemma D.1 (Matrix Chernoff Inequality (Tropp, 2015)) . Consider a ﬁnite sequence { S k } of independent, random, Hermitianmatrices with common dimension d . Assume that ≤ λ min ( S k ) and λ max ( S k ) ≤ L for each index k. Introduce the random matrix Z = (cid:88) k S k Deﬁne µ min ≡ λ min ( E [ Z ]) Then, for any (cid:15) ∈ [0 ,

1) Pr( λ min ( Z ) ≤ (1 − (cid:15) ) µ min ) ≤ d (cid:20) e − (cid:15) (1 − (cid:15) ) − (cid:15) (cid:21) µ min /L ≤ de − (cid:15)µ min /L Corollary D.2 (Minimum eigenvalue of empirical covariance) . Let X be a random variable of dimensionality d such that (cid:107) X (cid:107) < c . Let { x k } Nk =1 be N i.i.d samples of the distribution of X .Deﬁne Σ X ≡ E [ XX (cid:62) ] and ˆΣ X = 1 N N (cid:88) k =1 x k x (cid:62) k For any δ ∈ (0 , such that N > c log(2 d/δ ) λ min (Σ X ) the following holds with probability at least − δλ min ( ˆΣ X ) ≥ (cid:18) − c log(2 d/δ ) λ min (Σ X ) N (cid:19) λ min (Σ X ) roof. Deﬁne S k = N x k x (cid:62) k . It follows that λ max ( S k ) ≤ L = c /N and µ min = λ min (Σ X ) . Deﬁne δ ≡ de − (cid:15)Nλ min (Σ X ) /c , which implies that (cid:15) = c log(2 d/δ ) λ min (Σ X ) N It follows from Lemma D.1 that

Pr( λ min ( ˆΣ X ) ≤ (1 − (cid:15) ) µ min ) ≤ δ Lemma D.3 (Matrix Bernstein Inequality (Tropp, 2015)) . Consider a ﬁnite sequence { S k } of independent, random matriceswith common dimensions d × d . Assume that E [ S k ] = 0 and (cid:107) S k (cid:107) ≤ L for each index k Introduce the random matrix Z = (cid:88) k S k Let v ( Z ) be the matrix variance statistic of the sum: v ( Z ) = max {(cid:107) E ( ZZ (cid:62) ) , E ( Z (cid:62) Z ) (cid:107)} Then

Pr( (cid:107) Z (cid:107) ≥ t ) ≤ ( d + d ) exp (cid:18) − t / v ( Z ) + Lt/ (cid:19) Corollary D.4 (Error in empirical cross-covariance) . With probability at least − δ (cid:107) ˆΣ Y X − Σ Y X (cid:107) ≤ (cid:114) d X + d Y ) /δ ) vN + 2 log(( d X + d Y ) /δ ) L N , where L = c y c x + (cid:107) Σ Y X (cid:107) ≤ c y c x v = max( c y (cid:107) Σ X (cid:107) , c x (cid:107) Σ Y (cid:107) ) + (cid:107) Σ Y X (cid:107) ≤ c y c x Proof.

Deﬁne S k = y k x (cid:62) k − Σ Y X , it follows that E [ S k ] = 0 (cid:107) S k (cid:107) = (cid:107) y k x (cid:62) k − Σ Y X (cid:107) ≤ (cid:107) y k (cid:107)(cid:107) x k (cid:107) + (cid:107) Σ Y X (cid:107) ≤ c y c x + (cid:107) Σ Y X (cid:107)(cid:107) E [ ZZ (cid:62) ] (cid:107) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:88) i,j ( E [ y i x (cid:62) i x j y (cid:62) j ] − Σ Y X Σ XY ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:88) i ( E [ (cid:107) x i (cid:107) y i y (cid:62) i ] − Σ Y X Σ XY ) + (cid:88) i,j (cid:54) = i ( E [ y i x (cid:62) i ] E [ x j y (cid:62) j ] − Σ Y X Σ XY ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ N ( c x (cid:107) Σ Y (cid:107) + (cid:107) Σ Y X (cid:107) ) (cid:107) E [ Z (cid:62) Z ] (cid:107) ≤ N ( c y (cid:107) Σ X (cid:107) + (cid:107) Σ Y X (cid:107) ) Applying Lemma D.3 we get δ = Pr( (cid:107) Z (cid:107) ≥ N t ) ≤ ( d X + d Y ) exp (cid:18) − N t / v + Lt/ (cid:19) nd hence t − d X + d Y ) /δ ) Lt N − d X + d Y ) /δ ) vN ≤ This quadratic inequality implies t ≤ log(( d X + d Y ) /δ ) L N + (cid:115) log (( d X + d Y ) /δ ) L N + 2 log(( d X + d Y ) /δ ) vN Using the fact that √ a + b ≤ | a | + | b | we get t ≤ d X + d Y ) /δ ) L N + (cid:114) d X + d Y ) /δ ) vN Corollary D.5 (Normalized error in empirical covariance) . With probability at least − δ (cid:107) Σ − / X ( ˆΣ X − Σ X ) (cid:107) ≤ c (cid:114) d/δ ) N + 2 log(2 d/δ ) L N , where L = c (cid:112) λ min (Σ X ) + c Proof.

Deﬁne S k = Σ − / X x k x (cid:62) k − Σ / X , it follows that E [ S k ] = 0 (cid:107) S k (cid:107) ≤ (cid:107) Σ − / X (cid:107)(cid:107) x k (cid:107) + (cid:107) Σ / X (cid:107) ≤ c (cid:112) λ min (Σ X ) + c (cid:107) E [ Z (cid:62) Z ] (cid:107) = (cid:107) E [ ZZ (cid:62) ] (cid:107) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:88) i,j (Σ − / X E [ x i x (cid:62) i x j x (cid:62) j ]Σ − / X − Σ X ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:88) i ( E [ (cid:107) x i (cid:107) Σ − / X x i x (cid:62) i Σ − / X ] − Σ X ) + (cid:88) i,j (cid:54) = i (Σ − / X E [ x i x (cid:62) i ] E [ x j x (cid:62) j ]Σ − / X − Σ X ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ N ( c x + (cid:107) Σ X (cid:107) ) ≤ N c Applying Lemma D.3 we get δ = Pr( (cid:107) Z (cid:107) ≥ N t ) ≤ d exp (cid:18) − N t / c + Lt/ (cid:19) and similar to the proof of Corollary D.4, we can show that t ≤ d/δ ) L N + 2 c (cid:114) log(2 d/δ ) N Lemma D.6.

Let ˆΣ Y X = Σ

Y X + ∆

Y X and ˆΣ X = Σ X + ∆ X where E [∆ Y X ] and E [∆ Y X ] are not necessarily zero and ˆΣ X is symmetric positive semideﬁnite. Deﬁne W = Σ Y X Σ − X and ˆ W = ˆΣ Y X ( ˆΣ X + λ ) − . It follows that (cid:107) ˆ W − W (cid:107) ≤ (cid:115) λ max (Σ Y ) λ min (Σ X ) (cid:32) (cid:112) λ min (Σ X ) (cid:107) Σ − / X ∆ X (cid:107) + λλ min ( ˆΣ X ) + λ (cid:33) + (cid:107) ∆ Y X (cid:107) λ min ( ˆΣ X ) + λ roof. ˆ W − W = Σ Y X (cid:0) (Σ X + ∆ X + λI ) − − Σ − X (cid:1) + ∆ Y X (Σ X + ∆ X + λI ) − = T + T It follows that (cid:107) T (cid:107) ≤ (cid:107) ∆ Y X (cid:107) λ min ( ˆΣ X ) + λ As for T , using the matrix inverse Lemma B − − A − = B − ( A − B ) A − and the fact that Σ Y X = Σ / Y V Σ / X , where V is a correlation matrix satisfying (cid:107) V (cid:107) ≤ we get T = − Σ Y X Σ − X (∆ X + λI )(Σ X + ∆ X + λI ) − = − Σ / Y V Σ − / X (∆ X + λI )(Σ X + ∆ X + λI ) − , and hence (cid:107) T (cid:107) ≤ (cid:112) λ max (Σ Y ) (cid:32) (cid:107) Σ − / X ∆ X (cid:107) + λ (cid:107) Σ − / X (cid:107) λ min ( ˆΣ X ) + λ (cid:33) = (cid:115) λ max (Σ Y ) λ min (Σ X ) (cid:32) (cid:112) λ min (Σ X ) (cid:107) Σ − / X ∆ X (cid:107) + λλ min ( ˆΣ X ) + λ (cid:33) Corollary D.7.

Let x kNk =1 and y kNk =1 be i.i.d samples from two random variables X and Y with dimensions d X and d Y and (uncentered) covariances Σ X and Σ Y respectively. Assume (cid:107) X (cid:107) ≤ c x and (cid:107) Y (cid:107) ≤ c y . Let ˆΣ Y X = N (cid:80) Nk =1 y k x (cid:62) k and ˆΣ X = N (cid:80) Nk =1 x k x (cid:62) k . Deﬁne W = Σ Y X Σ − X and ˆ W = ˆΣ Y X ( ˆΣ X + λ ) − .For any δ ∈ (0 , such that N > c x log(2 d X /δ ) λ min (Σ X ) the following holds with probability at least − δ (cid:107) ˆ W − W (cid:107) ≤ (cid:115) λ max (Σ Y ) λ min (Σ X ) (cid:32) (cid:112) λ min (Σ X )∆ + λλ min (Σ X )(1 − ∆ ) + λ (cid:33) + ∆ λ min (Σ X )(1 − ∆ ) + λ , where ∆ = 2 c x (cid:114) log(2 d X /δ ) N + 2 log(2 d X /δ )3 N (cid:32) c x (cid:112) λ min (Σ X ) + c x (cid:33) ∆ = 2 c y c x (cid:114) log(( d Y + d X ) /δ ) N + 4 c y c x log(( d Y + d X ) /δ )3 N ∆ = c x log(2 d X /δ ) λ min (Σ X ) N Proof.

This corollary follows simply from applying Corollaries D.2, D.4 and D.5 to Lemma D.6. The − δ bound followsfrom union bound; since we have three probabilitic bounds each of which holds with probability − δ . Lemma D.8.

Let ˆΣ Y X = Σ

Y X + ∆

Y X and ˆΣ X = Σ X + ∆ X where E [∆ Y X ] and E [∆ Y X ] is not necessarily zero and ˆΣ X issymmetric but not necessarily positive semideﬁnite. Deﬁne W = Σ Y X Σ − X and ˆ W = ˆΣ Y X ˆΣ X ( ˆΣ X + λ ) − . It follows that (cid:107) ˆ W − W (cid:107) ≤ (cid:115) λ max (Σ Y ) λ (Σ X ) (cid:107) ∆ X (cid:107) + 2 λ max (Σ X ) (cid:107) ∆ X (cid:107) + λλ ( ˆΣ X ) + λ + (cid:107) Σ Y X (cid:107)(cid:107) ∆ X (cid:107) + (cid:107) ∆ Y X (cid:107)(cid:107) Σ X (cid:107) + (cid:107) ∆ Y X (cid:107)(cid:107) ∆ X (cid:107) λ ( ˆΣ X ) + λ roof. ˆ W − W = (Σ Y X + ∆

Y X )(Σ X + ∆ X )((Σ X + ∆ X ) + λI ) − − Σ Y X Σ X Σ − X = Σ Y X Σ X (((Σ X + ∆ X ) + λI ) − − Σ − X ) + (Σ Y X ∆ X + ∆ Y X Σ X + ∆ Y X ∆ X )((Σ X + ∆ X ) + λI ) − = T + T Using the matrix inverse Lemma B − − A − = B − ( A − B ) A − and the fact that Σ Y X = Σ / Y V Σ / X , where V is acorrelation matrix satisfying (cid:107) V (cid:107) ≤ we get T = − Σ / Y X V Σ − / X (∆ X + Σ X ∆ X + ∆ X Σ X + λI )((Σ X + ∆ X ) + λI ) − (cid:107) T (cid:107) ≤ (cid:115) λ max (Σ Y ) λ (Σ X ) (cid:107) ∆ X (cid:107) + 2 λ max (Σ X ) (cid:107) ∆ X (cid:107) + λλ ( ˆΣ X ) + λ (cid:107) T (cid:107) ≤ (cid:107) Σ Y X (cid:107)(cid:107) ∆ X (cid:107) + (cid:107) ∆ Y X (cid:107)(cid:107) Σ X (cid:107) + (cid:107) ∆ Y X (cid:107)(cid:107) ∆ X (cid:107) λ ( ˆΣ X ) + λ D.1 Proof of Theorem C.9

Proof.

In the linear case, we estimate a tensor T with modes corresponding to ψ h , ψ a and ψ o by solving the minimizationproblem in Section 5.3. Equivalently, we estimate a matrix T r of size d O × d h d A where an input ψ h ⊗ ψ a is mapped to anoutput E [ ψ o | h, ψ a ] . Note that Q ( ψ h ) ψ a = T × h ψ h × A ψ a = T r ( ψ h ⊗ ψ a ) For any history h ∈ H and future action feature vector ψ a we have (cid:107) ˆ Q ( ψ h ) − Q ( ψ h ) (cid:107) = argmax ψ a (cid:107) ( ˆ Q ( ψ h ) − Q ( ψ h )) ψ a (cid:107)(cid:107) ψ a (cid:107) = argmax ψ a (cid:107) ( ˆ T r − T r )( ψ h ⊗ ψ a ) (cid:107)(cid:107) ψ a (cid:107) ≤ (cid:107) ˆ T r − T r (cid:107)(cid:107) ψ h (cid:107) Note that Condition C.6 implies that ψ h ⊗ ψ a will eventually be in the span of training examples. This rules out the case wherethe inequality is satisﬁed only because ( ψ h ⊗ ψ a ) is incorrectly in the null space of ˆ T r and T r .The theorem is proven by applying Corollary D.7 to bound (cid:107) ˆ T r − T r (cid:107) . D.2 Sketch Proof for Joint S1

Let T A be a tensor such that Σ ψ a | ψ h = T A × h ψ h In order to prove Theorem 2 for joint S1, note that (cid:107) ˆΣ ψ a | ψ h − Σ ψ a | ψ h (cid:107) ≤ (cid:107) ˆ T A − T A (cid:107)(cid:107) ψ h (cid:107)(cid:107) ˆΣ ψ o ψ a | ψ h − Σ ψ o ψ a | ψ h (cid:107) ≤ (cid:107) ˆ T OA − T OA (cid:107)(cid:107) ψ h (cid:107) From Lemma D.6, we obtain a high probability bound on (cid:107) ˆ T A − T A (cid:107) and (cid:107) ˆ T OA − T OA (cid:107) . Then we apply these bounds toLemma D.8 to obtain an error in Q ( ψ h ))