Barrier-Certified Adaptive Reinforcement Learning with Applications to Brushbot Navigation
Motoya Ohnishi, Li Wang, Gennaro Notomista, Magnus Egerstedt
11 Barrier-Certified Adaptive Reinforcement Learningwith Applications to Brushbot Navigation
Motoya Ohnishi, Li Wang, Gennaro Notomista, and Magnus Egerstedt
Abstract — This paper presents a safe learning frameworkthat employs an adaptive model learning algorithm togetherwith barrier certificates for systems with possibly nonstationaryagent dynamics. To extract the dynamic structure of the model,we use a sparse optimization technique. We use the learnedmodel in combination with control barrier certificates whichconstrain policies (feedback controllers) in order to maintainsafety, which refers to avoiding particular undesirable regionsof the state space. Under certain conditions, recovery of safety inthe sense of Lyapunov stability after violations of safety due tothe nonstationarity is guaranteed. In addition, we reformulate anaction-value function approximation to make any kernel-basednonlinear function estimation method applicable to our adaptivelearning framework. Lastly, solutions to the barrier-certifiedpolicy optimization are guaranteed to be globally optimal, ensur-ing the greedy policy improvement under mild conditions. Theresulting framework is validated via simulations of a quadrotor,which has previously been used under stationarity assumptions inthe safe learnings literature, and is then tested on a real robot,the brushbot, whose dynamics is unknown, highly complex andnonstationary.
Index Terms — Safe learning, control barrier certificate, sparseoptimization, kernel adaptive filter, brushbot
I. I
NTRODUCTION
By exploring and interacting with an environment, rein-forcement learning can determine the optimal policy withrespect to the long-term rewards given to an agent [1], [2].Whereas the idea of determining the optimal policy in termsof a cost over some time horizon is standard in the controlsliterature [3], reinforcement learning is aimed at learning thelong-term rewards by exploring the states and actions. As such,the agent dynamics is no longer explicitly taken into account,but rather is subsumed by the data.If no information about the agent dynamics is available,however, an agent might end up in certain regions of the statespace that must be avoided while exploring. Avoiding such
This work was sponsored in part by the U.S. National Science Foundationunder Grant No. 1531195. The work of M. Ohnishi was supported in partby the Scandinavia-Japan Sasakawa Foundation under Grant GA17-JPN-0002and the Travel Grant of the School of Electrical Engineering, Royal Instituteof Technology.M. Ohnishi is with the School of Electrical Engineering, Royal Institute ofTechnology, 11428 Stockholm, Sweden, the Georgia Robotics and IntelligentSystems Laboratory, Georgia Institute of Technology, Atlanta, GA 30332USA, and also with the RIKEN Center for Advanced Intelligence Project,Tokyo 103-0027, Japan (e-mail: [email protected]).L. Wang and M. Egerstedt are with the School of Electrical and ComputerEngineering, Georgia Institute of Technology, Atlanta, GA 30332 USA (e-mail: [email protected]; [email protected]).G. Notomista is with the School of Mechanical Engineering,Georgia Institute of Technology, Atlanta, GA 30313 USA (e-mail:[email protected]). regions of the state space is referred to as safety . Safety in-cludes collision avoidance, boundary-transgression avoidance,connectivity maintenance in teams of mobile robots, and othermandatory constraints, and this tension between explorationand safety becomes particularly pronounced in robotics, wheresafety is crucial.In this paper, we address this safety issue, by employingmodel learning in combination with barrier certificates. Inparticular, we focus on learning for systems with discrete-timenonstationary (or time-varying) agent dynamics. Nonstation-arity comes, for example, from failures of actuators, batterydegradations, or sudden environmental disturbances. The resultis a method that adapts to nonstationary agent dynamics and,under certain conditions, ensures recovery of safety in thesense of Lyapunov stability even after violations of safety dueto the nonstationarity occur. We also propose discrete-timebarrier certificates that guarantee global optimality of solutionsto the barrier-certified policy optimization, and we use thelearned model for barrier certificates.Over the last decade, the safety issue has been addressedunder the name of safe learning, and plenty of solutions havebeen proposed [4]–[13]. To ensure safety while exploring, aninitial knowledge of the agent dynamics, initial safe policyor a teacher advising the agent is necessary [4], [14]. Toobtain a model of the agent dynamics, human operators maymaneuver the agent and record its trajectories [12], [15]. Itis also possible that an agent continues exploring withoutentering the states with low long-term risks (e.g., [11], [16]).Due to the inherent uncertainty, the worst case scenario (e.g.,possible lowest rewards) is typically taken into account [13],[17] and the set of safe policies can be expanded by exploringthe states [4], [5]. To address the issue of this uncertainty fornonlinear-model estimation tasks, Gaussian process regression[18] is a strong tool, and many safe learning studies have takenadvantage of its property (e.g., [4], [6], [7], [10], [13]).Nevertheless, when the agent dynamics is nonstationaryand the long-term rewards vary accordingly, the assumptionsoften made in the safe learnings literature no longer hold,and violations of safety become inevitable. In such cases, wewish to ensure that the agent is at least successfully broughtback to the set of safe states and the negative effect of anunexpected violation of safety is mitigated. Moreover, thelong-term rewards must also be learned in an adaptive manner.These are the core motivations of this paper.To constrain the states within a desired safe region whileexploring, we employ control barrier functions (cf. [19]–[24]).When the exact model of the agent dynamics is available,control barrier certificates ensure that an agent remains in the a r X i v : . [ c s . L G ] A ug set of safe states for all time by constraining the instantaneouscontrol input at each time. Also, an agent outside of the set ofsafe states is forced back to safety (Proposition III.1). A usefulproperty of control barrier certificates is that they modifypolices only when violations of safety are truly imminent [22].If no nominal model (or simulation) of the possibly non-stationary agent dynamics is available, on the other hand,violations of safety are inevitable. Therefore, we wish toadaptively learn the agent dynamics, and eventually bringthe agent back to safety. To this end, we propose a learningframework for a possible nonstationary agent dynamics, whichrecovers safety in the sense of Lyapunov stability under someconditions. This learning framework ties adaptive algorithmswith control barrier certificates by focusing on set-theoreticalaspects and monotonicity (or non-expansivity). By augmentingthe state with the estimate of agent dynamics, Lyapunovstability with respect to the set of augmented safe states isguaranteed (Theorem IV.1). Also, to efficiently enforce controlbarrier certificates, we employ adaptive sparse optimizationtechniques to extract dynamic structures (e.g., control-affinedynamics) by identifying truly active structural components(see Section III-C and IV-B).In addition, the long-term rewards need to be adaptivelyestimated when the agent dynamics is nonstationary. To thisend, we reformulate the action-value function approximationproblem so that, even if the action-value function varies, itcan be adaptively estimated in the same functional space byemploying an adaptive supervised learning algorithm in thespace. Consequently, resetting the learning whenever the agentdynamics varies becomes unnecessary. Moreover, we presenta barrier-certified policy update strategy by employing controlbarrier functions to effectively constrain policies. Becausethe global optimality of solutions to the constrained policyoptimization is necessary to ensure the greedy improvement ofa policy, we propose a discrete-time control barrier certificatethat ensures the global optimality under some mild conditions(see Section IV-C and Theorem IV.4 therein). This is animprovement of the previously proposed discrete-time controlbarrier certificate [24].To validate and clarify our learning framework, we firstconduct experiments of quadrotor simulations. Then, we con-duct real-robotics experiments on a brushbot , whose dynamicsis unknown, highly complex and nonstationary, to test theefficacy of our framework in the real world (see Section V).This is challenging due to many uncertainties and lack ofsimulators often used in applications of reinforcement learningin robotics (see [25] for example).II. P RELIMINARIES
In this section, we present some of the related work and thesystem model considered in this paper. Throughout, R , Z ≥ and Z > are the sets of real numbers, nonnegative integersand positive integers, respectively. Let (cid:107)·(cid:107) H be the norminduced by the inner product (cid:104)· , ·(cid:105) H in an inner-product space H . In particular, define (cid:104) x , y (cid:105) R L : = x T y for L -dimensionalreal vectors x , y ∈ R L , and (cid:107) x (cid:107) R L : = (cid:112) (cid:104) x , x (cid:105) R L , where ( · ) T stands for transposition. We define [ x ; y ] as [ x T , y T ] T , and let x n ∈ X ⊂ R n x and u n ∈ U ⊂ R n u , for n x , n u ∈ Z > , denote thestate and the control input at time instant n ∈ R ≥ , respectively. A. Related Work
The primary focus of this paper is the safety issue whileexploring . Typically, some initial knowledges, such as aninitial safe policy and a model of the agent dynamics, arerequired to address the safety issue while exploring; therefore,model learning is often employed together. We introduce somerelated work on model learning and kernel-based action-valuefunction approximation.
1) Model Learning for Safe Maneuver:
The recent work in[13], [7], and [4] assumes an initial conservative set of safepolicies, which is gradually expanded as more data becomeavailable. These approaches are designed for stationary agentdynamics, and Gaussian processes (GPs) are employed toobtain the confidence interval of the model. To ensure safety,control barrier functions and control Lyapunov functions areemployed in [13] and [4], respectively. On the other hand,the work in [10] uses a trajectory optimization based on thereceding horizon control and model learning by GPs, which iscomputationally expensive when the model is highly nonlinear.In this paper, we aim at tying adaptive model learningalgorithms and control barrier certificates by focusing onset-theoretical aspects and monotonicity (or non-expansivity).Hence, we employ an adaptive filter with monotone approxi-mation property, which shares similar ideas with stable onlinelearning for adaptive control based on Lyapunov stability (c.f.[26]–[29], for example).
2) Learning Dynamic Structures in Reproducing KernelHilbert Spaces:
An approach that learns dynamics in repro-ducing kernel Hilbert spaces (RKHSs) so that the resultingmodel satisfies the Euler-Lagrange equation was proposed in[30], while our paper proposes a learning framework that adap-tively captures control-affine structure in RKHSs to efficientlyenforce control barrier certificates.
3) Reinforcement Learning in Reproducing Kernel HilbertSpaces:
We introduce, briefly, ideas of existing action-valuefunction approximation techniques. Given a policy φ : X → U , the action-value function Q φ associated with the policy φ is defined as Q φ ( x , φ ( x )) = V φ ( x ) : = ∞ ∑ n = γ n R ( x n , φ ( x n )) , (II.1)where γ ∈ ( , ) is the discount factor, ( x n ) n ∈ Z ≥ is a trajectoryof the agent starting from x = x , and R ( x , u ) ∈ R is theimmediate reward. It is known that the action-value functionfollows the Bellman equation (c.f. [2, Equation (66)]): Q φ ( x n , u n ) = γ Q φ ( x n + , φ ( x n + )) + R ( x n , u n ) . (II.2)For robotics applications, where the states and controls arecontinuous, some form of function approximators is requiredto approximate the action-value function (and/or policies).Nonparametric learning such as a kernel method is oftendesirable when a priori knowledge about a suitable set ofbasis functions for learning is unavailable. Kernel-based re-inforcement learning has been studied in the literature, e.g., [31], [31]–[44]. Due to the property of reproducing kernels,the framework of linear learning algorithms is directly appliedto nonlinear function estimation tasks in a possibly infinite-dimensional functional space, namely a reproducing kernelHilbert space. Definition II.1 ( [45, page 343]) . Given a nonempty set Z and H which is a Hilbert space defined in Z , the function κ ( z , w ) of z is called a reproducing kernel of H if1) for every w ∈ Z , κ ( z , w ) as a function of z ∈ Z belongsto H , and2) it has the reproducing property, i.e., the following holdsfor every w ∈ Z and every ϕ ∈ H : ϕ ( w ) = (cid:104) ϕ , κ ( · , w ) (cid:105) H . If H has a reproducing kernel, H is called a ReproducingKernel Hilbert Space (RKHS).One of the examples of kernels is the Gaussian kernel κ ( x , y ) : = ( πσ ) L / exp (cid:32) − (cid:107) x − y (cid:107) R L σ (cid:33) , x , y ∈ R L , σ > R L can be approximated with an ar-bitrary accuracy. Another widely used kernel is the polynomialkernel κ ( x , y ) : = ( x T y + c ) d , c ≥ , d ∈ Z > .In contrast to these existing approaches, we explicitly definea so-called reproducing kernel Hilbert space (RKHS) so thatadaptive supervised learning of action-value functions canbe conducted in the same space without having to reset thelearning. Consequently, we can also conduct an action-valuefunction approximation in the same RKHS even after the agentdynamics changes or policies are updated (See the remarkbelow Theorem IV.3 and Section V-A.2). The GP SARSA canalso be reproduced by employing a GP in the explicitly definedRKHS as is discussed in Appendix I. Specifically, in thispaper, a possibly nonstationary agent dynamics is consideredas detailed below. B. System Model
In this paper, we consider the following discrete-time deter-ministic nonlinear model of the nonstationary agent dynamics, x n + − x n = p ( x n , u n ) + f ( x n ) + g ( x n ) u n , (II.3)where p : X × U → R n x , f : X → R n x , g : X → R n x × n u are continuous. Hereafter, we regard X × U as the sameas Z ⊂ R n x + n u under the one-to-one correspondence between z : = [ x ; u ] ∈ Z and ( x , u ) ∈ X × U if there is no confusion.We consider an agent with dynamics given in (II.3), and thegoal is to find an optimal policy which drives the agent to adesirable state while remaining in the set of safe states (or thesafe set) C ⊂ X defined as C : = { x ∈ X | B ( x ) ≥ } , (II.4)where B : X → R . An optimal policy is a policy φ that attainsan optimal value Q φ ( x , φ ( x )) for every state x ∈ X . Note thatthe value associated with a policy varies when the dynamics is nonstationary, and that a quadruple ( x n , u n , x n + , R ( x n , u n )) is available at each time instant n .With these preliminaries in place, we can present our safelearning framework.III. S AFE L EARNING F RAMEWORK
Under possibly nonstationary dynamics, our safe learningframework adaptively estimates the long-term rewards to up-date policies with safety constraints. Also, recovery of safetyin the sense of Lyapunov stability during exploration is guaran-teed under certain conditions. Define ψ : Z → R as ψ ( x , u ) : = p ( x , u ) + f ( x ) + g ( x ) u , and suppose that the estimator of ψ attime instant n , denoted by ˆ ψ n , is approximated by the modelparameter h n ∈ R r , r ∈ Z > in the linear form asˆ ψ n ( z n ) : = h T n k ( z n ) . Here, k ( z n ) ∈ R r is the output of basis functions at z n . Ifthe model parameter is accurately estimated (or the exactagent dynamics is available), the safe set C becomes forwardinvariant and asymptotically stable by enforcing control barriercertificates at each time instant n . A. Discrete-time Control Barrier Functions
The idea of control barrier functions is similar to Lyapunovfunctions; they require no explicit computations of the forwardreachable set while ensuring certain properties by constrainingthe instantaneous control input. Particularly, control barrierfunctions guarantee that an agent starting from the safe set re-mains safe (i.e., forward invariance), and that an agent outsideof the safe set is forced back to safety (i.e., Lyapunov stabilitywith respect to the safe set). To make barrier certificatescompatible with model learning and reinforcement learning,we employ the discrete-time control barrier certificates.
Definition III.1 ( [24, Definition 4]) . A map B : X → R isa discrete-time exponential control barrier function if thereexists a control input u n ∈ U such that B ( x n + ) − B ( x n ) ≥ − η B ( x n ) , ∀ n ∈ Z ≥ , < η ≤ . (III.1)Note that we intentionally removed the condition B ( x ) ≥ Proposition III.1.
The set C defined in (II.4) for a validdiscrete-time exponential control barrier function B : X → R is forward invariant when B ( x ) ≥
0, and is asymptoticallystable when B ( x ) < Proof.
See Appendix A.Proposition III.1 implies that an agent remains in the safe setdefined in (II.4) for all time if B ( x ) ≥ Fig. III.1. An illustration of the monotone approximation property. Theestimate h n monotonically approaches to the set Ω of optimal vectors h ∗ by sequentially minimizing the distance between h n and Ω n . Here, Ω n : = argmin h ∈ R r Θ n ( h ) , where Θ n ( h ) is the cost function at time instant n . a). Little modifications of policies: control barrier functionsmodify polices only when violations of safety are im-minent. Consequently, an inaccurate or rough estimationof the model causes less negative effect on (model-free)reinforcement learning.b). Asymptotic stability of the safe set: the agent outside ofthe safe set is brought back to the safe set. In additionto Proposition III.1, this robustness property is analyzedin [19]. This property together with the adaptive modellearning algorithm presented in the next subsection isparticularly important when the safety is violated due tothe nonstationarity of the agent dynamics.Under a possibly nonstationary agent dynamics, we can nolonger guarantee that the current estimate of the model pa-rameter is sufficiently accurate to enforce the inequality (III.1)or forward invariance of C . Nevertheless, we are still able toshow that safety is recovered in the sense of Lyapunov stabilityunder certain conditions by adaptively learning the model. B. Adaptive Model Learning Algorithms with Monotone Ap-proximation Property
At each time instant, an input-output pair ( z n , δ n ) , where z n : = [ x n ; u n ] and δ n : = x n + − x n for model learning isavailable. Under possibly nonstationary agent dynamics, it isvital for the model parameter estimation to be stable evenafter the agent dynamics changes. In this paper, we employan adaptive algorithm with monotone approximation property.Note this approach shares a similar idea with stable onlinelearning based on Lyapunov-like conditions.Suppose that the estimate of model parameter at time instant n is given by h n ∈ R r , r ∈ Z > . Given a cost function Θ n ( h ) at time instant n , we update the parameter h n so as to satisfythe strictly monotone approximation property (cid:107) h n + − h ∗ n (cid:107) R r < (cid:107) h n − h ∗ n (cid:107) R r , ∀ h ∗ n ∈ Ω n : = argmin h ∈ R r Θ n ( h ) if h n / ∈ Ω n (cid:54) = /0, where /0 is the empty set. Then, if Ω : = (cid:84) n ∈ Z ≥ Ω n isnonempty and if h n / ∈ Ω n , it follows that (cid:107) h n + − h ∗ (cid:107) R r < (cid:107) h n − h ∗ (cid:107) R r , ∀ h ∗ ∈ Ω , n ∈ Z ≥ . This is illustrated in FigureIII.1. Under mild conditions, we can also design algorithms(e.g., the adaptive projected subgradient method [47]) thatsatisfy (cid:107) h n − h ∗ n (cid:107) R r − (cid:107) h n + − h ∗ n (cid:107) R r ≥ ρ dist ( h n , Ω n ) , for Fig. III.2. An illustration of Lyapunov stability of the system for theaugmented state [ x ; h ] ∈ R n x + r with respect to the forward invariant set C × Ω ⊂ R n x + r . all h ∗ n ∈ Ω n , and for some ρ >
0, where dist ( h n , Ω n ) : = inf {(cid:107) h n − h ∗ n (cid:107) R r | h ∗ n ∈ Ω n } . (See [47] for more detailed argu-ments for example.)At each time instant, we use the current estimate of themodel to constrain control inputs so that they satisfy B ( ˆ x n + ) − B ( x n ) ≥ − η B ( x n ) + ρ , ∀ n ∈ Z ≥ , < η ≤ , for some margin ρ >
0, where ˆ x n + is the predicted outputof the current estimate h n at x n and u n . Then, under certainconditions, we can guarantee Lyapunov stability of the systemfor the augmented state [ x n ; h n ] ∈ R n x + r with respect to theforward invariant set C × Ω ⊂ R n x + r as illustrated in FigureIII.2. In Sections IV-A and V, we will theoretically andexperimentally show that the system for the augmented stateis stable on the set of augmented safe states.To efficiently constrain policies by using control barrierfunctions, the learned model is preferred to be affine in control.(see Section IV-C and Theorem IV.4 therein.) As such, outputsof the learned model should have preferred dynamic structureswhile capturing the true agent dynamics. C. Leaning Dynamic Structure via Sparse Optimizations
Control-affine dynamics is given by (II.3) with p =
0, where0 denotes the null function. Therefore, the simplest way is tolearn the agent dynamics with the constraint p =
0. In practice,however, it is unrealistic to assume that p = p is negligibly small, we can consider p to be a systemnoise added to a control-affine dynamics. To encourage theterm p to be as small as possible while capturing the trueinput-output relations of the agent dynamics, we use adaptivesparse optimization techniques. In particular, motivated by themonotone approximation property due to convexity of theformulations, we use (sparse) kernel adaptive filters for thesystems with nonlinear dynamics. Specifically, we take thefollowing steps to extract the control-affine structure:1) Assume for simplicity that n x =
1. We supposethat p ∈ H p , f ∈ H f , and g ( ) , g ( ) , ..., g ( n u ) ∈ H g ,where H p , H f and H g are RKHSs, and g ( x ) =[ g ( ) ( x ) , g ( ) ( x ) , · · · , g ( n u ) ( x )] .2) Let H u be the RKHS associated with the reproducingkernel κ ( u , v ) : = u T v , u , v ∈ U ,and H c the set ofconstant functions on U . Estimate the function ψ in theRKHS H ψ : = H p + H f ⊗ H c + H g ⊗ H u (see SectionIV-B and Theorem IV.2 therein).
3) Define the cost Θ n so as to promote sparsity of the modelparameter. If the underlying true dynamics is affine incontrol, a control-affine model (i.e., the estimate of p denoted by ˆ p becomes null) is expected to be extracted.The resulting control-affine part of the estimated dynamics isused in combination with control barrier certificates in orderto efficiently constrain policies while and after learning anoptimal policy. (see Theorem IV.1 and Theorem IV.4 for moredetails.) D. Barrier-certified Policy Update
Lastly, we present the barrier-certified policy update strat-egy. To update policies, we use the long-term rewards thatneeds to be adaptively estimated for systems with possiblynonstationary agent dynamics.
1) Adaptive Action-value Function Approximation inRKHSs:
Again, motivated by the monotone approximationproperty (see Corollary IV.1) and the flexibility of nonpara-metric learning that requires no fixed set of basis functions,we employ kernel-based adaptive algorithms to estimate theaction-value function. One of the issues arising when ap-plying a kernel-based method to an action-value functionapproximation is that the output of the action-value function Q φ ( x n , u n ) ∈ H Q associated with a policy φ , where H Q is assumed to be an RKHS, is unobservable. Nevertheless,we know that the action-value function follows the Bellmanequation (II.2). Hence, by defining a function ψ Q : Z → R ,where R ( n x + n u ) ⊃ Z = Z × Z , as ψ Q ([ z ; w ]) : = Q φ ( x , u ) − γ Q φ ( y , v ) , (III.2) x , y ∈ X , u , v ∈ U , z = [ x ; u ] , w = [ y ; v ] , the Bellman equation in (II.2) is solved via iterativenonlinear function estimation with the input-output pairs { ([ x n ; u n ; x n + ; φ ( x n + )] , R ( x n , u n )) } n ∈ Z ≥ . In fact, the function ψ Q is an element of a properly constructed RKHS H ψ Q (seeSection IV-C and Theorem IV.3 therein). Because the domainof H ψ Q is defined as Z × Z instead of Z , the RKHS H ψ Q does not depend on the agent dynamics. Therefore, we donot have to reset learning even after the dynamics changesor the policy is updated, and we can analyze convergenceand/or monotone approximation property of an action-valuefunction approximation in the same RKHS (see Section V-A.2, for example).
2) Policy Update:
For a current policy φ : X → U , assumethat the action-value function Q φ with respect to φ at timeinstant n is available. Given a discrete-time exponential controlbarrier function B and 0 < η ≤
1, the barrier certified safecontrol space is define as S ( x n ) : = { u n ∈ U | B ( x n + ) − B ( x n ) ≥ − η B ( x n ) } . From Proposition III.1, the set C defined in (II.4) is forwardinvariant and asymptotically stable if u n ∈ S ( x n ) for all n ∈ Z ≥ . Then, the updated policy φ + given by φ + ( x ) : = argmax u ∈ S ( x ) (cid:2) Q φ ( x , u ) (cid:3) , (III.3) is well-known (e.g., [48], [49]) to satisfy that Q φ ( x , φ ( x )) ≤ Q φ + ( x , φ + ( x )) , where Q φ + is the action-value function withrespect to φ + . In practice, we use the estimate of Q φ becausethe exact function Q φ is unavailable. For example, the action-value function is estimated over N f ∈ Z > iterations, and thepolicy is updated every N f iterations.IV. A NALYSIS OF B ARRIER - CERTIFIED A DAPTIVE R EINFORCEMENT L EARNING
In the previous section, we presented our barrier-certifiedadaptive reinforcement learning framework. In this section,we present theoretical analysis of our framework to furtherstrengthen the arguments.
A. Safety Recovery: Adaptive Model Learning and ControlBarrier Certificates
The monotone approximation property of model parametersis closely related to Lyapunov stability. In fact, by augmentingthe state vector with the model parameter, we can construct aLyapunov function which guarantees stability with respect tothe safe set under certain conditions.We first make following assumptions.
Assumption IV.1.
1) Finite-dimensional model parameter:the dimension of model parameter h remains finite, andis r ∈ Z > .2) Boundedness of the basis functions: all of the basisfunctions (or kernel functions) are bounded over X .3) Lipschitz continuity of the control barrier function: thecontrol barrier function B is Lipschitz continuous over X with Lipschitz constant ν B .4) Validity of barrier certificates: there exists a control input u n ∈ U satisfying for a sufficiently small ρ > B ( ˆ x n + ) − B ( x n ) ≥ − η B ( x n ) + ρ , ∀ n ∈ Z ≥ , < η ≤ , (IV.1)where ˆ x n + is the predicted output of the current estimate h n at x n and u n .5) Appropriate cost functions: if h n ∈ Ω n : = argmin h ∈ R r Θ n ( h ) , where Θ n ( h ) is the continuous costfunction at time instant n , then (cid:107) x n + − ˆ x n + (cid:107) R nx ≤ ρ ν B .6) Model learning with monotone approximation property:model parameter h n is updated as h n + = T n ( h n ) , where T n : R r → R r is continuous and has monotone approx-imation property: if h n / ∈ Ω n , then dist ( h n , Ω n ) ≥ ρ and (cid:107) h n − h ∗ n (cid:107) R r − (cid:107) h n + − h ∗ n (cid:107) R r ≥ ρ dist ( h n , Ω n ) , forall h ∗ n ∈ Ω n , and for some ρ , ρ >
0. If h n ∈ Ω n , then h n + = h n .7) Data consistency: The set Ω : = (cid:84) n ∈ Z ≥ Ω n is nonempty. Remark
IV.1 (On Assumption IV.1.1) . Assumption IV.1.1is made so that Lyapunov stability can be analyzed in anEuclidean space and is reasonable if polynomial kernels areemployed for learning or if the input space Z : = X × U iscompact. Remark
IV.2 (On Assumptions IV.1.2 and IV.1.3) . Assump-tions IV.1.2 and IV.1.3 ensure that the predicted value of the barrier function is close to its true value if the current estimateof model parameter is close to the true parameter.
Remark
IV.3 (On Assumption IV.1.4) . Assumption IV.1.4implies that we can enforce barrier certificates for the currentestimate of the dynamics with a sufficiently small margin ρ .This assumption is necessary to implicitly bound the growth of B ( x n + ) and to robustly enforce barrier certificates whenever h n ∈ Ω n . Although this assumption is somewhat restrictive, itis still reasonable if the initial estimate does not largely deviatefrom the true dynamics. Remark
IV.4 (On Assumption IV.1.5) . Assumption IV.1.5implies that the set Ω n or equivalently the cost Θ n is designedso that the predicted output ˆ x n + for h n ∈ Ω n is sufficientlyclose to the true output x n + . Such a cost can be easilydesigned. This assumption is necessary to render the set C × Ω forward invariant. Remark
IV.5 (On Assumptions IV.1.6 and IV.1.7) . To applytheories of Lyapunov stability, Assumption IV.1.6 is needed tomake sure that the dynamical system for the augmented stateis continuous. Moreover, the cost (or the set Ω n ) is designedso that h n ∈ Ω n or dist ( h n , Ω n ) ≥ ρ . See the work in [47] fora class of algorithms that satisfy this property, for example.Unless there exist some adversarial data (or inappropriatecosts) that do not reflect the true agent dynamics, AssumptionIV.1.7 is valid and ensures that the set of augmented safe statesis nonempty.Let the augmented state be [ x ; h ] ∈ R n x + r . Then, the follow-ing theorem states that the system for the augmented state is(asymptotically) stable with respect to the set of augmentedsafe states even after a violation of safety due to the abruptand unexpected change of the agent dynamics occurs. Theorem IV.1.
Suppose that a triple ( x n , u n , x n + ) is availableat time instant n +
1. Suppose also that a control input u n satisfying (IV.1) is employed for all n ∈ Z ≥ . Then, underAssumption IV.1, the system for the augmented state is stablewith respect to the set of augmented safe states C × Ω ⊂ R n x + r .If, in addition, h n / ∈ Ω n for all n ∈ Z ≥ such that [ x n ; h n ] / ∈ C × Ω , then the system is uniformly globally asymptoticallystable with respect to C × Ω ⊂ R n x + r . Proof.
See Appendix B.
Remark
IV.6 (On Theorem IV.1) . Theorem IV.1 implies thathow much the current estimate gets closer to the true dynamicsdepends on how much the next state of the agent is deviatedfrom the predicted next state. Therefore, both barrier certifi-cates and model learning work together to guarantee stability.If the model learning algorithm satisfies Assumption IV.1, thenTheorem IV.1 claims that safety is recovered successfully.When GPs or kernel ridge regressions are employed formodel learning, for example, introducing forgetting factors orletting the sample size grow as time advances will make thealgorithms adaptive to time-varying systems; in such cases,we need to make sure that the algorithms satisfy AssumptionIV.1 to guarantee safety recovery. Numerical simulations aboutsafety recovery is given in Section V-A.If the agent dynamics keeps changing or if we know that there are multiple modes for dynamics, then we may haveseparate model learning processes as proposed in [50], andthe augmented state can be regarded as following a hybridsystem. Hence, stability should be analyzed under additionalassumptions in this case. We leave such an analysis as a futurework.
B. Structured Model Learning
We have seen that, by employing a model learning withmonotone approximation property under Assumption IV.1, theagent is stabilized on the set of augmented safe states evenafter an abrupt and unexpected change of the agent dynamics.Here, we show that a control-affine dynamics can be learnedvia sparse optimizations satisfying monotone approximationproperty in a properly defined RKHS. We assume that n x = n x approximators if n x > H c (see Section III-C) is anRKHS. Lemma IV.1.
The space H c is an RKHS associated with thereproducing kernel κ ( u , v ) = ( u ) : = , ∀ u , v ∈ U , with theinner product defined as (cid:104) α , β (cid:105) H c : = αβ , α , β ∈ R . Proof.
See Appendix C.Then, the following lemma implies that ψ can be approxi-mated in the sum space of RKHSs denoted by H ψ . Lemma IV.2 ( [51, Theorem 13]) . Let H and H betwo RKHSs associated with the reproducing kernels κ and κ . Then the completion of the tensor product of H and H , denoted by H ⊗ H , is an RKHS associated with thereproducing kernel κ ⊗ κ .From Lemmas IV.1 and IV.2, we can now assume thatˆ f ∈ H f ⊗ H c and ˆ˜ g ∈ H g ⊗ H u , where ˆ˜ g is an estimate of˜ g ( x , u ) : = g ( x ) u . As such, ψ can be approximated in theRKHS H ψ : = H p + H f ⊗ H c + H g ⊗ H u . Therefore, we canemploy a kernel adaptive filter working in the sum space H ψ .Second, the following theorem ensures that ψ can beuniquely decomposed into p , f , and ˜ g in the RKHS H ψ . Theorem IV.2.
Assume that X and U have nonemptyinteriors. Assume also that H p is a Gaussian RKHS. Then, H ψ is the direct sum of H p , H f ⊗ H c , and H g ⊗ H u , i.e.,the intersection of any two of the RKHSs H p , H f ⊗ H c , and H g ⊗ H u is { } . Proof.
See Appendix D.
Remark
IV.7 (On Theorem IV.2) . Because only the control-affine part of the learned model is used in combination withbarrier certificates (see Assumption IV.2 and Theorem IV.4)and the term p is assumed to be a system noise added to thecontrol-affine dynamics, the unique decomposition is crucial;if the unique decomposition does not hold, the term p may beable to estimate the overall dynamics, including the control-affine terms.By using a sparse optimization for the coefficient vector h n ∈ R r , we wish to extract a structure of the model; from Theorem IV.2, the term ˆ p n is expected to drop off when thetrue agent dynamics is affine in control.In order to use the learned model in combinationwith control barrier functions, each entry of the vectorˆ g n ( x n ) is required. Assume, without loss of generality, that { e i } i ∈{ , ,..., n u } ⊂ U (this is always possible for U (cid:54) = /0 bytransforming coordinates of the control inputs and reducingthe dimension n u if necessary). Then, the i th entry of thevector ˆ g n ( x n ) is given by ˆ g n ( x n ) e i = ˆ˜ g n ( x n , e i ) . As such, we canuse the learned model to constrain control inputs efficientlyby using control barrier functions for explorations as well aspolicy updates. We analyze an adaptive action-value functionapproximation with barrier-certified policy updates in the nextsubsection. C. Adaptive Action-value Function Approximation withBarrier-certified Policy Updates
In this subsection, we analyze the proposed adaptive action-value function approximation with barrier-certified policy up-dates.We showed in Section III-D.1 that the Bellmanequation in (II.2) is solved via iterative nonlinearfunction estimation with the input-output pairs { ([ x n ; u n ; x n + ; φ ( x n + )] , R ( x n , u n )) } n ∈ Z ≥ . The followingtheorem states that the function ψ Q defined in (III.2) can beestimated in a properly constructed RKHS. Theorem IV.3.
Suppose that H Q is an RKHS associated withthe reproducing kernel κ Q ( · , · ) : Z × Z → R . Define, for γ ∈ ( , ) , H ψ Q : = { ϕ | ϕ ([ z ; w ]) = ϕ Q ( z ) − γϕ Q ( w ) , ∃ ϕ Q ∈ H Q , ∀ z , w ∈ Z } . Then, the operator U : H Q → H ψ Q defined by U ( ϕ Q )([ z ; w ]) : = ϕ Q ( z ) − γϕ Q ( w ) , ∀ ϕ Q ∈ H Q , is bijective.Moreover, H ψ Q is an RKHS with the inner product definedby (cid:104) ϕ , ϕ (cid:105) H ψ Q : = (cid:68) ϕ Q , ϕ Q (cid:69) H Q , (IV.2) ϕ i ([ z ; w ]) : = ϕ Qi ( z ) − γϕ Qi ( w ) , ∀ z , w ∈ Z , i ∈ { , } . The reproducing kernel of the RKHS H ψ Q is given by κ ([ z ; w ] , [ ˜ z ; ˜ w ]) : = (cid:0) κ Q ( z , ˜ z ) − γκ Q ( z , ˜ w ) (cid:1) − γ (cid:0) κ Q ( w , ˜ z ) − γκ Q ( w , ˜ w ) (cid:1) , z , w , ˜ z , ˜ w ∈ Z . (IV.3) Proof.
See Appendix E.From Theorem IV.3, we can use any kernel-based method byassuming that the action-value function is in H Q . The estimateof Q φ denoted by ˆ Q φ is obtained by U − ( ˆ ψ Q ) , where ˆ ψ Q is theestimate of ψ Q ∈ H ψ Q . For instance, suppose that the estimateof ψ Q ( z , w ) for an input [ z ; w ] at time instant n is given byˆ ψ Qn ([ z ; w ]) : = h Qn T k ([ z ; w ]) , where h Qn ∈ R r is the model parameter, and k ([ z ; w ]) : =[ κ ([ z ; w ] , [ ˜ z ; ˜ w ]) ; κ ([ z ; w ] , [ ˜ z ; ˜ w ]) ; · · · ; κ ([ z ; w ] , [ ˜ z r ; ˜ w r ])] ∈ R r for { ˜ z j } j ∈{ , ,..., r } , { ˜ w j } j ∈{ , ,..., r } ⊂ Z and for κ ( · , · ) de-fined by (IV.3). Then, the estimate of Q φ ( z ) for an input z attime instant n is given byˆ Q φ n ( z ) : = h Qn T k Q ( z ) , (IV.4)where k Q ( z ) : = (cid:2) U − ( κ ( · , [ ˜ z ; ˜ w ])) ( z ) ; · · · ; U − ( κ ( · , [ ˜ z r ; ˜ w r ])) ( z ) (cid:3) ∈ R r . Remark
IV.8 (On Theorem IV.3) . As discussed in AppendixI, the GP SARSA is reproduced by applying a GP in thespace H ψ Q , although the GP SARSA or other kernel-basedaction-value function approximation is ad-hoc and designedfor estimating the action-value function associated with a fixedpolicy under a stationary agent dynamics.When the parameter h Qn for the estimator ˆ ψ Qn is monoton-ically approaching to an optimal point h Q ∗ in the Euclideannorm sense, so is the model parameter for the action-valuefunction because the same parameter is used to estimate ψ Q and Q φ . Suppose we employ a method which monotonicallybrings ˆ ψ Qn closer to an optimal function ψ Q ∗ in the Hilber-tian norm sense. Then, the following corollary implies thatan estimator of the action-value function also satisfies themonotonicity.
Corollary IV.1.
Let H ψ Q (cid:51) ˆ ψ Qn ([ z ; w ]) : = ˆ Q φ n ( z ) − γ ˆ Q φ n ( w ) and H ψ Q (cid:51) ψ Q ∗ ([ z ; w ]) : = Q φ ∗ ( z ) − γ Q φ ∗ ( w ) , z , w ∈ Z ,where ˆ Q φ n , Q φ ∗ ∈ H Q . Then, if ˆ ψ Qn is approaching to ψ Q ∗ ,i.e., (cid:13)(cid:13)(cid:13) ˆ ψ Qn + − ψ Q ∗ (cid:13)(cid:13)(cid:13) H ψ Q ≤ (cid:13)(cid:13)(cid:13) ˆ ψ Qn − ψ Q ∗ (cid:13)(cid:13)(cid:13) H ψ Q , it follows that (cid:13)(cid:13)(cid:13) ˆ Q φ n + − Q φ ∗ (cid:13)(cid:13)(cid:13) H Q ≤ (cid:13)(cid:13)(cid:13) ˆ Q φ n − Q φ ∗ (cid:13)(cid:13)(cid:13) H Q . Proof.
See Appendix F.Note that the use of action-value functions enables us touse random control inputs instead of the target policy φ forexploration, and we require no models of the agent dynamicsfor policy updates as discussed below.To obtain analytical solutions for (III.3), we follow thearguments in [37]. Suppose that ˆ Q φ n is given by (IV.4). Wedefine the reproducing kernel κ Q of H Q as the tensor kernelgiven by κ Q ([ x ; u ] , [ y ; v ]) : = κ x ( x , y ) κ u ( u , v ) , (IV.5)where κ u ( u , v ) is, for example, defined by κ u ( u , v ) : = + ( u T v ) . Then, (III.3) becomes φ + ( x ) : = argmax u ∈ S ( x ) (cid:104) h Qn T k Q ([ x ; u ]) (cid:105) , (IV.6)where the target value being maximized is linear to u at x .Therefore, if the set S ( x ) ⊂ U is convex, an optimal solutionto (IV.6) is guaranteed to be globally optimal, ensuring thegreedy improvement of the policy.As pointed out in [24], S ( x ) ⊂ U is not a convex set ingeneral. Instead, we consider a convex subset of S ( x ) underthe following moderate assumptions: Assumption IV.2.
1) The set U is convex.
2) Existence of Lipschitz continuous gradient of the barrierfunction: Given R : = { ( − t ) x n + t ( ˆ f n ( x n ) + ˆ g n ( x n ) u ) | t ∈ [ , ] , u ∈ U } , there exists a constant ν ≥ B ,denoted by ∂ B ( x ) ∂ x , satisfies (cid:13)(cid:13)(cid:13)(cid:13) ∂ B ( a ) ∂ x − ∂ B ( b ) ∂ x (cid:13)(cid:13)(cid:13)(cid:13) R nx ≤ ν (cid:107) a − b (cid:107) R nx , ∀ a , b ∈ R . Then, the following theorem holds.
Theorem IV.4.
Under Assumptions IV.1.3 and IV.2, assumealso that (cid:13)(cid:13) x n + − ( ˆ f n ( x n ) + ˆ g n ( x n ) u n + x n ) (cid:13)(cid:13) R nx ≤ ρ ν B . Then,inequality (III.1) is satisfied at time instant n ∈ Z ≥ if u n satisfies the following: ∂ B ( x n ) ∂ x ( ˆ f n ( x n ) + ˆ g n ( x n ) u n ) ≥ − η B ( x n ) + ν (cid:13)(cid:13) ˆ f n ( x n ) + ˆ g n ( x n ) u n (cid:13)(cid:13) R nx + ρ . (IV.7)Moreover, (IV.7) defines a convex constraint for u n . Proof.
See Appendix G.
Remark
IV.9 . When ∂ B ( x n ) ∂ x ˆ g n ( x n ) (cid:54) = U admits suffi-ciently large value of each entry of u n , there always exists a u n that satisfies (IV.7).Theorem IV.4 essentially implies that, even when the gra-dient of B along the shift of x n decreases steeply, inequality(III.1) holds if (IV.7) is satisfied. From Theorem IV.4, the setˆ S n ( x n ) , defined asˆ S n ( x n ) : = { u n ∈ U | ∂ B ( x n ) ∂ x ( ˆ f n ( x n ) + ˆ g n ( x n ) u n ) ≥ − η B ( x n ) + ν (cid:13)(cid:13) ˆ f n ( x n ) + ˆ g n ( x n ) u n (cid:13)(cid:13) R nx + ρ } ⊂ S ( x n ) , (IV.8)is convex under Assumption IV.2.As witnessed in the literatures (e.g., [22]), an agent mightencounter deadlock situations, where the constrained controlkeeps the agent remain in the same state, when control barriercertificates are employed. It is even possible that there is nosafe control driving the agent from those states. However, anelaborative design of control barrier functions remedies thisissue, as shown in the following example. Example IV.1.
If the agent is nonholonomic, turning inwardsafe regions when approaching their boundary might be infea-sible. To reduce the risk of such deadlock situations, controlbarrier functions may be designed as B ( x ) = ˜ B ( x ) − υ Γ (cid:18)(cid:12)(cid:12)(cid:12)(cid:12) θ − atan2 (cid:26) ∂ ˜ B ( x ) ∂ y , ∂ ˜ B ( x ) ∂ x (cid:27)(cid:12)(cid:12)(cid:12)(cid:12)(cid:19) , υ > , where the state x = [ x; y; θ ] consists of the X position x, the Yposition y, and the orientation θ of an agent from the worldframe, { x ∈ X | ˜ B ( x ) ≥ } is the original safe region, and Γ isa strictly increasing function. If this control barrier functionexists, then the agent is forced to turn inward the original Fig. IV.1. An illustration of how a nonholonomic agent avoids deadlocks.When the orientation of the agent is not considered (i.e., ˜ B ( x ) is the barrierfunction), there might be no safe control driving the agent from those statesas the left figure shows. By taking into account the orientation (i.e., B ( x ) isthe barrier function), the agent turns inward the safe region before reachingits boundaries as the right figure shows. Algorithm 1
Barrier-certified adaptive reinforcement learning
Requirement:
Assumptions IV.1 and IV.2; κ Q defined as(IV.5); x ∈ X and u ∈ U ; λ ∈ ( , ) , µ ≥ s ∈ Z > Output: ˆ Q φ n ( z n ) (cid:46) (IV.4) for n ∈ Z ≥ do - Sample x n , x n + ∈ X , u n ∈ S , and R ( x n , u n ) ∈ R - Obtain φ ( x n + ) ∈ ˆ S n + ( x n + ) (cid:46) (IV.8) if Random Exploration then
Select a (uniformly) random control input: u n + ∈ ˆ S n + ( x n + ) (cid:46) (IV.8) else Use the current policy: u n + = φ ( x n + ) end if - Model update: h n + = T n ( h n ) (cid:46) e.g., (H.3)- Update ˆ Q φ n by updating ˆ ψ Qn in H ψ Q :(e.g., kernel adaptive filter) h Qn + = prox λ µ (cid:2) ( − λ ) I + λ ∑ n ι = n − s + s P C ι (cid:3) ( h Qn ) (cid:46) Theorem IV.3 and (H.3) if n mod N f = then φ + ( x ) = argmax u ∈ ˆ S n ( x ) (cid:104) ˆ Q φ n ( x , u ) (cid:105) (cid:46) (IV.8) and (III.3)Let φ ← φ + end ifend for safe region before reaching its boundaries because the controlbarrier function also depends on θ and takes larger value whenthe agent is facing inward the safe region. An illustration ofthis example is given in Figure IV.1.Resulting barrier-certified adaptive reinforcement learningframework is summarized in Algorithm 1.V. E XPERIMENTAL R ESULTS
For the sake of reproducibility and for clarifying eachcontribution, we first validate the proposed learning frameworkon simulations of vertical movements of a quadrotor, whichhas been used in the safe learnings literature under stationarityassumption (e.g., [7]). Then, we test the proposed learningframework on a real robot called brushbot , whose dynamics is unknown, highly complex and nonstationary . The exper-iments on the brushbot was conducted at the Robotarium,a remotely accessible robot testbed at Georgia institute oftechnology [52]. A. Validations of the Safe Learning Framework via Simula-tions of a Quadrotor
In this experiment, we empirically validate Theorem IV.1(i.e., Lyapunov stability of the set of augmented safe statesafter an unexpected and abrupt change of the agent dynamics)and the motivations of using an online kernel method workingin the RKHS H ψ Q (see Section IV-C) for action-value functionapproximation. We also test the proposed framework for sim-ulated vertical movements of a quadrotor. We use parametricmodel for the agent dynamics and nonparametric model forthe action-value function in this experiment. The discrete-timedynamics of the vertical movement of a quadrotor is given by x n + = Ξ ( z n ) h ∗ : = h ∗ ξ ( z n ) + h ∗ ξ ( z n ) + h ∗ ξ ( z n ) : = h (cid:20) ∆ t (cid:21) x n + h (cid:20) − ∆ t − ∆ t (cid:21) + h (cid:20) − ∆ t − ∆ t (cid:21) u n , h ∗ : = [ h ∗ ; h ∗ ; h ∗ ] ∈ R , ξ i : Z → R , i ∈ { , , } , z n : = [ x n ; u n ] ∈ Z , x n : = [ x n ; ˙x n ] , where ∆ t ∈ ( , ∞ ) denotes the time interval, x n and ˙x n are thevertical position and the vertical velocity of the quadrotor attime instant n , respectively. When the weight of the quadrotoris 0 . h = , h = . h = / . ∆ t be 0 .
02 secondsfor the simulations, and the maximum input 2 × . × . ∈ [ − , ] , and we employ thefollowing two barrier functions: B t ( x ) = − x , B b ( x ) = x + , and we use the barrier-certificate parameter η = .
01 (see(III.1)) in this experiment. Note that the safe set is equivalentlyexpressed by C = [ − , ] = { x ∈ X | B t ( x ) ≥ ∧ B b ( x ) ≥ } , and the barrier functions satisfy Assumption IV.2.2 with theLipschitz constant ν = R ( x , u ) = − −
12 ˙x + , ∀ n ∈ Z ≥ , where the constant is added to prevent the resulting value ofexplored states from becoming negative, i.e., lower than thevalue outside of the safe set. The dynamics of the brushbot depends on the body structure, conditionsof the brushes, floors and many other factors. Thus, simulators of the brushbotare unavailable. -10 h -5 000 x h GP-based learningProposed frameworkSafe set
Fig. V.1. Trajectories of the vector [ x; h ; h ] of the GP-based learningand the adaptive model learning algorithm with barrier certificates from n = n = C × Ω . GP-based learning seems slowlyapproaching to the safe set while safety recovery is not theoretically supportedin the current settings.
1) Stability of the Safe Set:
In terms of safety recovery,we compare a GP-based approach, which tends to be lessadaptive to time-varying systems, and a set-theoretical adap-tive model learning algorithm with monotone approximationproperty. Random explorations by uniformly random controlinputs are conducted for the first 20 seconds correspondingto 1000 iterations under the dynamics h ∗ = [
1; 9 .
81; 1 / . ] .Then, we change the simulated dynamics and observe if thequadrotor is stabilized on the set of augmented safe states.To clearly visualize the difference between the GP-basedapproach and the adaptive model learning algorithm, we letthe new agent dynamics be h ∗ = [
1; 9 .
81; 5 / . ] , which isan extreme situation where the maximum input generates verylarge acceleration.We define the update rule of model learning as h n + = h n − λ Ξ T ( z n )( Ξ ( z n ) Ξ T ( z n )) − ( Ξ ( z n ) h n − x n + ) , which satisfies the monotone approximation property , where λ ∈ ( , ) is the step size. In this experiment, we used λ = . .
01, and let the prior covariance ofthe parameter vector h be 25 I .The trajectories of the vector [ x; h ; h ] of the GP-basedlearning and the adaptive model learning algorithm from n = n = C × Ω . GP-based learning seems to be slowly approaching to the safeset while safety recovery is not theoretically supported in thecurrent settings. This update is viewed as the projection of the current parameter onto theaffine set in which any element h ∗ n satisfies Ξ ( z n ) h ∗ n − x n + =
0, and henceit follows that (cid:107) h n − h ∗ n (cid:107) R r − (cid:107) h n + − h ∗ n (cid:107) R r ≥ ρ dist ( h n , Ω n ) , ∀ h ∗ n ∈ Ω n : = argmin h ∈ R r [ Ξ ( z n ) h ∗ n − x n + ] . (See [47] for more detailed arguments.) TABLE V.1S
UMMARY OF THE P ARAMETER S ETTINGS OF THE S IMULATED V ERTICAL M OVEMENTS OF A Q UADROTOR ( KERNEL ADAPTIVE FILTER )Parameter Description Values λ step size 0 . s data size 5 µ regularization parameter 0 . ε precision parameter 0 . ε large-normalized-error 0 . r max maximum-dictionary-size 600 σ scale parameters { , , , , , } γ discount factor 0 .
2) Adaptive Action-value Function Approximation:
We alsovalidate our action-value function approximation frameworkby employing a GP (i.e., the GP SARSA) and a kernel adaptivefilter in the same RKHS. The parameter settings for the kerneladaptive filter are summarized in Table V.1. Please refer toAppendix H for the notations that are not in the main text.Six Gaussian kernels with different scale parameters σ areemployed for the kernel adaptive filter (i.e., M =
6. See alsoAppendix H for more detail about multikernel adaptive filter).For the GP SARSA, we employ a Gaussian kernel with scaleparameter 3, which achieved sufficiently good performance,and let the noise variance of the output be 10 − (i.e., Σ = − I . See Appendix I.). Other parameters are the same asthose of the kernel adaptive filter. In addition, we also testthe GP SARSA in another settings, where the kernel functionis added in the first 600 iterations (i.e., dimension of theparameter becomes r = . Random explorations by uniformly random controlinputs are conducted for the first 200 seconds correspondingto 10000 iterations under the dynamics h ∗ = [
1; 9 .
81; 1 / . ] ,and the dynamics changes to h ∗ = [
1; 11 .
81; 0 . / . ] (i.e.,additional downward accelerations and degradations of bat-teries, for example) at time instant n = n = = Note the GP SARSA was originally designed for stationary agent dynam-ics. In this experiment, we call GP SARSA as a GP working in the RKHS H ψ Q . Iteration -60-50-40-30-20-10 N M S E ( d B ) GP SARSA 2Kernel Adaptive FilterGP SARSA
Fig. V.2. The learning curves of the normalized mean squared errors(NMSEs) of action-value function approximation for the GP SARSA, kerneladaptive filter, and the GP SARSA 2.TABLE V.2T
HE EXPECTED VALUES OF THE
GP SARSA
AND THE KERNEL ADAPTIVEFILTER
GP SARSA kernel adaptive filter GP SARSA 265 . ± .
42 64 . ± .
39 63 . ± . the first 600 iterations, the GP SARSA 2 could not adapt tothe new policy or new dynamics.The expected values E (cid:2) V φ ( x ) (cid:3) for the GP SARSA, thekernel adaptive filter and the GP SARSA 2 associated with thepolicies obtained at time instant n = × V φ is defined in (II.1).Among the 15 runs for the kernel adaptive filter, weextracted the seventh run, which was successful. The leftfigure of Figure V.3 illustrates the action-value function at timeinstant n = n = n = , , n =
3) Discussion:
The control barrier certificates with anadaptive model learning algorithm recovered safety even foran extreme situation where the control inputs start generatingvery large acceleration. As long as model learning algorithmsatisfies Assumption IV.1, safety recovery is guaranteed.Reinforcement learning with the GP SARSA and kerneladaptive filter in the RKHS H ψ Q worked sufficiently well.If no kernel functions are newly added, GP-based learningscannot adapt to the new policies or agent dynamics. Therefore,we need to sequentially add new kernel functions or use asparse adaptive filter to prune redundant kernel functions (seealso Appendix H for a sparse adaptive filter). We mentionthat identifying the RKHS H ψ Q enabled us to employ GPsfor nonstationary agent dynamics without having to resetlearnings. Consequently, we can effectively reuse the previousestimation of the target function if the new target function isclose to the previous one.Our safe learning framework validated by these simulations -3 -2 -1 0 1 2 3 Position -3-2-10123 V e l o c it y Iteration -3-2-10123 P o s iti on / V e l o c it y Velocity Position
Fig. V.3. The left figure illustrates the action-value function over the position x and the velocity ˙x at n = − . × . / . n = BrushMotor
Brush
Fig. V.4. A picture of the brushbot used in the experiment. Vibrations of thetwo motors propagate to the two brushes, driving the brushbot. Control inputsare of two dimensions each of which corresponds to the rotational speed ofa motor. is now ready to be applied to a real robot called brushbot aspresented below.
B. Real-Robotics Experiments on the Brushbot
Next, we apply our safe learning framework, which wasvalidated by simulations, to the brushbot, which has highlynonlinear, nonholonomic and nonstationary dynamics (seeFigure V.4). The objective of this experiment is to find apolicy driving the brushbot to the origin, while restrictingthe region of exploration. The experiment is conducted at theRobotarium, a remotely accessible robot testbed at Georgiainstitute of technology [52].
1) Experimental Condition:
The experimental conditionsfor model learning, reinforcement learning, control barrierfunctions and their parameter settings are presented below. a) Model learning:
The state x = [ x; y; θ ] consists of theX position x, Y position y and the orientation θ ∈ [ − π , π ] ofthe brushbot from the world frame. The exact positions andthe orientation are recorded by motion capture systems every0 . u is of two dimensions eachof which corresponds to the rotational speed of a motor. Toimprove the learning efficiency and reduce the total learningtime required, we identify the most significant dimension andreduce the dimensions to learn. The sole input variable of p , f and g for the shifts of x and y, is assumed to be θ . Theshift of θ is assumed to be constant over the state, and hencedepends on nothing but control inputs (see Section V-B.1.d).The brushbot used in the present study is nonholonomic, i.e., itcan only go forward, and positive control inputs basically drivethe brushbot in the same way as negative control inputs. Assuch, we use the rotational speeds of the motors as the controlinputs. Moreover, to eliminate the effect of static frictions onthe model, we assume that the zero control input given to thealgorithm actually generates some minimum control inputs u δ to the motors, i.e., the actual maximum control inputs to themotors are given by u max + u δ , where u max is the maximumcontrol input fed to the algorithm. b) Reinforcement learning: The state for action-valuefunction approximation consists of the distance (cid:107) [ x; y ] (cid:107) R fromthe origin and the orientation θ − atan2 ( y , x ) which is wrappedto the interval [ − π , π ] . The immediate reward is given by R ( x , u ) = − (cid:107) [ x; y ] (cid:107) R + , ∀ n ∈ Z ≥ , where the constant is added to prevent the resulting value ofexplored states from becoming negative, namely, lower thanthe value outside of the region of exploration. c) Discrete-time control barrier certificates: Control bar-rier certificates are used to limit the region of exploration to therectangular area: x ∈ [ − x max , x max ] , y ∈ [ − y max , y max ] , wherex max > max >
0. Because the brushbot can only go forward, we employ the following four barrier functions: B ( x ) = x max − x − υ | θ + π | , B ( x ) = x + x max − υ | θ | , B ( x ) = y max − y − υ (cid:12)(cid:12)(cid:12) θ + π (cid:12)(cid:12)(cid:12) , B ( x ) = y + y max − υ (cid:12)(cid:12)(cid:12) θ − π (cid:12)(cid:12)(cid:12) , (see Example IV.1 for the motivations of using the abovecontrol barrier functions). Note that those functions satisfyAssumption IV.2.2 and the Lipschitz constant ν is zero exceptat around θ = − π , , π , π . (Although we can employ globallyLipschitz functions for more rigorous treatment, we use theabove functions for simplicity.) d) Parameter settings: The parameter settings are sum-marized in Table V.3. Please refer to Appendix H for thenotations that are not in the main text. Five Gaussian kernelswith different scale parameters σ are employed in action-value function approximation (i.e., M =
5. See also AppendixH for more detail about multikernel adaptive filter), and sixGaussian kernels are employed in model learning for x and y(i.e., M = θ , we define H p , H f and H g as sets of constant functions.The kernels of H p and H f are weighed by τ = . e) Procedure: The time interval (duration of one iter-ation) for learning is 0 . f n ( x )+ ˆ g n ( x ) u , is used in combination with barrier certificates.Although barrier functions employed in the experiment reducedeadlock situations, the brushbot is forced to turn inward theregion of exploration when a deadlock is detected. Note thatthe barrier certificates are intentionally violated in such a case.The policy is updated every 50 seconds. After 300 seconds, westop learning a model and the action-value function, and thepolicy replaces random explorations. The brushbot is forcedto stop when it enters into the circle of radius 0 .
2) Results:
Figure V.5 plots ˆ p n ([ x ; 0; 0 ]) , ˆ f n ( x ) , ˆ g ( ) n ( x ) andˆ g ( ) n ( x ) for x and y at n = g ( i ) n is the estimateof g ( i ) at time instant n . Recall that these functions onlydepend on θ in this experiment to improve the learningefficiency. For the shift of θ , the estimators are constant overthe state, and the result is ˆ g ( ) n ( x ) = .
38, ˆ g ( ) n ( x ) = − . p n ([ x ; 0; 0 ]) = ˆ f n ( x ) = n = p n ([ x ; 0; 0 ]) is almost zero and so is ˆ f n ( x ) , implyingthat the proposed algorithm successfully dropped off irrelevantstructural components of a model.Figure V.6 plots the trajectory of the brushbot while ex-ploring (i.e., X,Y positions from n = n = ∈ [ − . , . ] and y ∈ [ − . , . ] ) most of the time.Moreover, the values of barrier functions B i , i ∈ { , , , } ,for the whole trajectory are plotted in Figure V.7. Even though -3 -2 -1 0 1 2 3 (rad) -0.2-0.100.10.2 O u t pu t fxy g1ypx g2x g2y g1ypy Fig. V.5. Estimated output of the model estimator at u = [ ] and n = θ . Irrelevant structures such as ˆ p n and ˆ f n dropped offsuccessfully. some violations of safety are seen in the figure, the brushbotreturned to the safe region before large violations occurred.Despite unknown, highly complex and nonstationary system,the proposed safe learning framework was shown to workefficiently.Figure V.8 plots the trajectories of the optimal policy learnedby the brushbot. Once the optimal policy replaced randomexplorations, the brushbot returned to the origin until n = n = , , , , n = n = , , Q φ n ([ (cid:107) [ x; y ] (cid:107) R ; 0 ] , [
0; 0 ]) overX,Y positions at n = n =
3) Discussion:
One of the challenges of the experimentsis that no initial data or simulators were available. Despitethe fact that the brushbot with highly complex system had tolearn an optimal policy while dealing with safety by employingan adaptive model learning algorithm, the proposed learningframework worked well in the real world. Brushbot is poweredby brushes, and its dynamics highly depends on the conditionsof the floor and brushes. The possible changes of the agentdynamics thus lead to some violations of safety. Nevertheless,our learning framework recovered safety quickly. In addition,the agent learned a good policy within a quite short period.One reason of those successes of adaptivity and data-efficiencyis the convex-analytic formulations.On the other hand, because no initial nominal model orpolicy is available and our framework is fully adaptive, i.e.,we do not collect data to conduct batch model learning and/orreinforcement learning, we need to reduce the dimensions ofinput vectors to speed-up and robustify learning. This can be TABLE V.3S
UMMARY OF THE P ARAMETER S ETTINGS
Parameter Description General settingsx max maximum X position 1 . max maximum Y position 1 . η barrier-function parameter 0 . υ coefficient in barrier functions 0 . u δ actual minimum control 0 . u max maximum control input 0 . θ ) Action-value function approximation λ step size 0 . .
03 0 . s data size 5 10 10 µ regularization parameter 0 . . ε precision parameter 0 .
001 0 .
01 0 . ε large-normalized-error 0 . . . r max maximum-dictionary-size 500 3 2000 σ scale parameters { , , , , . , . } – { , , , , . } γ discount factor – – 0 . -1.5 -1 -0.5 0 0.5 1 1.5 x -1.5-1-0.500.511.5 y Iteration -1.5-1-0.500.511.5 P o s iti on X position Y position
Fig. V.6. The left figure shows the trajectory of the brushbot while exploring, and the right figure shows X,Y positions over iterations. The region ofexploration is limited to x ∈ [ − . , . ] and y ∈ [ − . , . ] . The brushbot remained in the region most of the time. Iteration -0.500.511.522.5 V a l u e s o f b a rr i e r f un c ti on s Fig. V.7. The values of four control barrier functions employed in the experiment for the whole trajectory. Even though some violations of safety were seen,the brushbot returned to the safe region before large violations occurred. The nonholonomic brushbot adaptively learned a model to turn inward the region ofexploration before reaching the boundaries of the region of exploration. an inherent limitation of our framework.VI. C
ONCLUSION
The learning framework presented in this paper success-fully tied model learning, reinforcement learning, and barriercertificates, enabling barrier-certified reinforcement learning for unknown, highly nonlinear, nonholonomic, and possiblynonstationary agent dynamics. The proposed model learningalgorithm captures a structure of the agent dynamics byemploying a sparse optimization. The resulting model haspreferable structure for preserving efficient computations ofbarrier certificates. In addition, recovery of safety after an -1.5 -1 -0.5 0 0.5 1 1.5 x -1.5-1-0.500.511.5 y -1.5 -1 -0.5 0 0.5 1 1.5 x -1.5-1-0.500.511.5 y -1.5 -1 -0.5 0 0.5 1 1.5 x -1.5-1-0.500.511.5 y -1.5 -1 -0.5 0 0.5 1 1.5 x -1.5-1-0.500.511.5 y Iteration -1.5-1-0.500.511.5 P o s iti on X position Y position
Fig. V.8. Trajectories of the optimal policy learned by the brushbot. The optimal policy replaced random explorations at n = n = n = , , , , n = -1 -0.5 0 0.5 1 x -1-0.500.51 y Fig. V.9. The shape of the action-value function over X,Y positions at thecontrol input u = [ ] and n = unexpected and abrupt change of the agent dynamics wasguaranteed by employing barrier certificates and a modellearning algorithm with monotone approximation propertyunder certain conditions. For possibly nonstationary agentdynamics, the action-value function approximation problemwas appropriately reformulated so that kernel-based methods,including kernel adaptive filter, can be directly applied inan RKHS. Lastly, certain conditions were also presented to render the set of safe policies convex, thereby guaranteeing theglobal optimality of solutions to the policy update to ensurethe greedy improvement of a policy. The experimental resultshows the efficacy of the proposed learning framework in thereal world. A PPENDIX AP ROOF OF P ROPOSITION
III.1See [24, Proposition 4] for the proof of forward invariance.The set C ⊂ X is asymptotically stable aslim n → ∞ B ( x n ) ≥ lim n → ∞ ( − η ) n B ( x ) = , where the inequality holds from [24, Proposition 1].A PPENDIX BP ROOF OF T HEOREM
IV.1From Assumptions IV.1.1, IV.1.2, IV.1.5, and from the factsthat the estimated output is linear to the model parameter at afixed input and that (cid:107) x n + − ˆ x n + (cid:107) ≥
0, we obtain (cid:107) x n + − ˆ x n + (cid:107) R nx − ρ ν B ≤ ρ dist ( h n , Ω n ) , for some bounded ρ ≥
0. From Assumptions IV.1.3, we alsoobtain that | B ( x n + ) − B ( ˆ x n + ) | ≤ ν B (cid:107) x n + − ˆ x n + (cid:107) R nx . (B.1) Fig. V.10. Two trajectories of the brushbot returning to the origin by using the action-value function saved at n = Therefore, from Assumptions IV.1.6 and IV.1.7, and from ν B ≥
0, we obtain for h ∗ n ∈ { h ∈ Ω | dist ( h n , Ω ) = (cid:107) h n − h (cid:107) R r } that | B ( x n + ) − B ( ˆ x n + ) | − ρ ≤ ν B (cid:107) x n + − ˆ x n + (cid:107) R nx − ρ ≤ ν B ρ dist ( h n , Ω n ) ≤ ν B ρ ρ (cid:16) (cid:107) h n − h ∗ n (cid:107) R r − (cid:107) h n + − h ∗ n (cid:107) R r (cid:17) , ≤ ν B ρ ρ [ dist ( h n , Ω ) − dist ( h n + , Ω )] . If B ( x n + ) < B ( ˆ x n + ) , then we obtain B ( x n + ) − B ( ˆ x n + ) ≥ − (cid:115) ρ + ν B ρ ρ [ dist ( h n , Ω ) − dist ( h n + , Ω )] ≥ − ρ − ν B ρ ρ (cid:113) [ dist ( h n , Ω ) − dist ( h n + , Ω )] . (B.2)This inequality also holds in case when B ( x n + ) ≥ B ( ˆ x n + ) .Because of the continuity of the cost function Θ n and thebarrier function B (Assumptions IV.1.3 and IV.1.5), the set C × Ω is closed. We show that there exists a Lyapunov function V C × Ω with respect to the closed set C × Ω for the augmentedstate [ x ; h ] . A candidate function is given by V C × Ω ([ x ; h ])= (cid:40) [ x ; h ] ∈ C × Ω − min ( B ( x ) , ) + ν B ρ ρ ρ dist ( h , Ω ) if [ x , h ] / ∈ C × Ω Since − min ( B ( x ) , ) + ν B ρ ρ dist ( h , Ω ) = ∀ [ x ; h ] ∈ ∂ ( C × Ω ) , where ∂ ( C × Ω ) is the boundary of the set C × Ω ,from Assumption IV.1.3, the function V C × Ω is continuous. It also holds that V C × Ω ([ x ; h ]) > [ x ; h ] / ∈ C × Ω . UnderAssumption IV.1.6, we obtain V C × Ω ([ x n + ; h n + ]) − V C × Ω ([ x n ; h n ])= − min ( B ( x n + ) , ) + ν B ρ ρ ρ dist ( h n + , Ω )+ min ( B ( x n ) , ) − ν B ρ ρ ρ dist ( h n , Ω ) ≤ − ν B ρ ρ ρ [ dist ( h n , Ω ) − dist ( h n + , Ω )] ≤ , (B.3)for all n ∈ Z ≥ . To show that the first inequality holds, we firstshow − min ( B ( x n + ) , ) + min ( B ( x n ) , ) ≤ ν B ρ ρ (cid:113) [ dist ( h n , Ω ) − dist ( h n + , Ω )] . (a) For B ( x n ) ≥
0: from (IV.1), (B.2), and 0 < η ≤
1, we obtain B ( ˆ x n + ) ≥ ρ and B ( x n + ) ≥− ν B ρ ρ (cid:112) [ dist ( h n , Ω ) − dist ( h n + , Ω )] from which itfollows that − min ( B ( x n + ) , ) ≤ ν B ρ ρ (cid:113) [ dist ( h n , Ω ) − dist ( h n + , Ω )] . (b) For B ( x n ) < B ( x n + ) ≥
0: it is straightforward to seethat − min ( B ( x n + ) , ) + min ( B ( x n ) , ) = B ( x n ) < ≤ ν B ρ ρ (cid:113) [ dist ( h n , Ω ) − dist ( h n + , Ω )] . (c) For B ( x n ) < B ( x n + ) <
0: from (IV.1), (B.2), and0 < η ≤
1, we obtain B ( ˆ x n + ) ≥ B ( x n ) + ρ and B ( x n + ) − B ( x n ) ≥ − ν B ρ ρ (cid:112) [ dist ( h n , Ω ) − dist ( h n + , Ω )] from which it follows that − min ( B ( x n + ) , ) + min ( B ( x n ) , ) = B ( x n ) − B ( x n + ) ≤ ν B ρ ρ (cid:113) [ dist ( h n , Ω ) − dist ( h n + , Ω )] . If h n / ∈ Ω n , under Assumption IV.1.6, we obtain ρ ρ ≤ (cid:112) [ dist ( h n , Ω ) − dist ( h n + , Ω )] from which it followsthat (cid:112) [ dist ( h n , Ω ) − dist ( h n + , Ω )] ≤ ρ ρ [ dist ( h n , Ω ) − dist ( h n + , Ω )] and V C × Ω ([ x n + ; h n + ]) − V C × Ω ([ x n ; h n ]) ≤ ν B ρ ρ (cid:113) [ dist ( h n , Ω ) − dist ( h n + , Ω )] − ν B ρ ρ ρ [ dist ( h n , Ω ) − dist ( h n + , Ω )] ≤ − ν B ρ ρ ρ [ dist ( h n , Ω ) − dist ( h n + , Ω )] , and the first inequality of (B.3) holds. The inequality also holdsfor h n ∈ Ω n . Moreover, if [ x n ; h n ] ∈ C × Ω , then h n remains in Ω because of monotonic approximation property. From (B.2),the control barrier certificate (III.1) is thus ensured with acontrol input satisfying (IV.1) under Assumption IV.1.4, andthe set C × Ω is forward invariant. Therefore, the system forthe augmented state is stable with respect to the set C × Ω . If h n / ∈ Ω n for all n ∈ Z ≥ such that [ x n ; h n ] / ∈ C × Ω , it followsthat V C × Ω ([ x n + ; h n + ]) − V C × Ω ([ x n ; h n ]) < , (B.4)and [53, Theorem 1] applies, i.e., the system for the augmentedstate is uniformly globally asymptotically stable with respectto the set C × Ω . A PPENDIX CP ROOF OF L EMMA
IV.1Since κ ( u , v ) = ( u ) = , ∀ u , v ∈ U , is a positive definitekernel, it defines the unique RKHS given by span { } , whichis complete because it is a finite-dimensional space. For any ϕ : = α ∈ H c , (cid:104) ϕ , ϕ (cid:105) H c = α ≥ α =
0, or equivalently, ϕ =
0. The symmetry and thelinearity also hold, and hence (cid:104)· , ·(cid:105) H c defines the inner product.For any u ∈ U , it holds that (cid:104) ϕ , κ ( · , u ) (cid:105) H c = (cid:104) α , (cid:105) H c = α = ϕ ( u ) . Therefore, the reproducing property is satisfied.A PPENDIX DP ROOF OF T HEOREM
IV.2The following lemmas are used to prove the theorem.
Lemma D.1 ( [54, Theorem 2]) . Let X ⊂ R n x be any setwith nonempty interior. Then, the RKHS associated with theGaussian kernel for an arbitrary scale parameter σ > X , including the nonzeroconstant function. Lemma D.2.
Assume that X ⊂ R n x and U ⊂ R n u havenonempty interiors. Then, the intersection of the RKHS H u associated with the kernel κ ( u , v ) : = u T v , u , v ∈ U , and theRKHS H c is { } , i.e., H c ∩ H u = { } . Proof.
It is obvious that the function ϕ ( u ) = , ∀ u ∈ U , isan element of both of the RKHSs (vector spaces) H u and H c . Therefore, it is sufficient to show that there exists u ∈ U satisfying that ϕ ( u ) (cid:54) = ϕ ( u int ) , u int ∈ int ( U ) , where int ( U ) denotes the interior of U , for any ϕ ∈ H u \ { } . Assume that ϕ ( v ) (cid:54) = v ∈ U . From [51, Theorem 3], the RKHS H u is expressed as H u = span { κ ( · , u ) } u ∈ U , which is finitedimension, implying that any function in H u is linear. Sincethere exists u = u int + ρ v ∈ U for some ρ >
0, it is provedthat ϕ ( u ) = ϕ ( u int + ρ v ) = ϕ ( u int ) + ρ ϕ ( v ) (cid:54) = ϕ ( u int ) . Lemma D.3 ( [55, Proposition 1.3]) . If H = H ⊕ H forgiven vector spaces H and H , then H ⊗ H ∩ H ⊗ H = { } , i.e., H ⊗ H = ( H ⊗ H ) ⊕ ( H ⊗ H ) . Lemma D.4.
Given X ⊂ R n x and U ⊂ R n u , let H , H , and H be associated with the Gaussian kernels κ ( x , y ) : = ( √ πσ ) nx exp (cid:18) − (cid:107) x − y (cid:107) R nx σ (cid:19) , x , y ∈ X , κ ( u , v ) : = ( √ πσ ) nu exp (cid:18) − (cid:107) u − v (cid:107) R nu σ (cid:19) , u , v ∈ U , and κ ([ x ; u ] , [ y ; v ]) : = ( √ πσ ) nx + nu exp (cid:18) − (cid:107) [ x ; u ] − [ y ; v ] (cid:107) R nx + nu σ (cid:19) , x , y ∈ X , u , v ∈ U , re-spectively, for an arbitrary σ >
0. Then, by regarding afunction in H ⊗ H as a function over the input space X × U ⊂ R n x + n u , it holds that H = H ⊗ H . Proof. H ⊗ H has the reproducing kernel defined by κ ⊗ ([ x ; u ] , [ y ; v ]) : = κ ( x , y ) κ ( u , v )= ( √ πσ ) n x ( √ πσ ) n u exp (cid:32) − (cid:107) x − y (cid:107) R nx σ (cid:33) exp (cid:32) − (cid:107) u − v (cid:107) R nu σ (cid:33) = ( √ πσ ) n x + n u exp (cid:32) − (cid:107) x − y (cid:107) R nx + (cid:107) u − v (cid:107) R nu σ (cid:33) = κ ([ x ; u ] , [ y ; v ]) . This verifies the claim.We are now ready to prove Theorem IV.2.
Proof of Theorem IV.2.
By Lemmas D.2 and D.3, it is derivedthat H f ⊗ H c ∩ H g ⊗ H u = { } . By Lemmas D.1, D.3, andD.4, it holds that H p ∩ H f ⊗ H c = { } and H p ∩ H g ⊗ H u = { } . A PPENDIX EP ROOF OF T HEOREM
IV.3We show that the operator U : H Q → H ψ Q , which maps ϕ Q ∈ H Q to a function ϕ ∈ H ψ Q , ϕ ([ z ; w ]) = ϕ Q ( z ) − γϕ Q ( w ) where γ ∈ ( , ) , z , w ∈ Z , is bijective. Because the mapping U is surjective by definition, we show it is also injective. Forany ϕ Q , ϕ Q ∈ H Q , U ( ϕ Q + ϕ Q )([ z ; w ]) = ( ϕ Q + ϕ Q )( z ) − γ ( ϕ Q + ϕ Q )( w )= ( ϕ Q ( z ) − γϕ Q ( w )) + ( ϕ Q ( z ) − γϕ Q ( w ))= U ( ϕ Q )([ z ; w ]) + U ( ϕ Q )([ z ; w ]) , ∀ z , w ∈ Z , and U ( αϕ Q )([ z ; w ])= αϕ Q ( z ) − γαϕ Q ( w ) = α ( ϕ Q ( z ) − γϕ Q ( w ))= α U ( ϕ Q )([ z ; w ]) , ∀ α ∈ R , ∀ z , w ∈ Z , from which the linearity holds. Therefore, it is sufficient toshow that ker ( U ) = ϕ Q ∈ ker ( U ) , we obtain U ( ϕ Q )([ z ; z ]) = ( − γ ) ϕ Q ( z ) = , ∀ z ∈ Z , which implies that ϕ Q = H ψ Q is an RKHS. The space H ψ Q with the inner product defined in (IV.2) is isometric to theRKHS H Q , and hence is a Hilbert space. Because κ Q ( · , z ) − γκ Q ( · , w ) ∈ H Q , it is true that κ ( · , [ z ; w ]) ∈ H ψ Q . Moreover,it holds that (cid:104) κ ( · , [ z ; w ]) , κ ( · , [ ˜ z ; ˜ w ]) (cid:105) H ψ Q = (cid:10) κ Q ( · , z ) − γκ Q ( · , w ) , κ Q ( · , ˜ z ) − γκ Q ( · , ˜ w ) (cid:11) H Q = (cid:0) κ Q ( z , ˜ z ) − γκ Q ( z , ˜ w ) (cid:1) − γ (cid:0) κ Q ( w , ˜ z ) − γκ Q ( w , ˜ w ) (cid:1) = κ ([ z ; w ] , [ ˜ z ; ˜ w ]) , and that (cid:104) ϕ , κ ( · , [ z ; w ]) (cid:105) H ψ Q = (cid:10) ϕ Q , κ Q ( · , z ) − γκ Q ( · , w ) (cid:11) H Q = ϕ Q ( z ) − γϕ Q ( w ) = ϕ ([ z ; w ]) , ∀ ϕ ∈ H ψ Q . Therefore, κ ( · , · ) : Z × Z → R is the reproducing kernelwith which the RKHS H ψ Q is associated.A PPENDIX FP ROOF OF C OROLLARY
IV.1From the definition of the inner product in the RKHS H ψ Q ,it follows that (cid:13)(cid:13)(cid:13) ˆ Q φ n + − Q φ ∗ (cid:13)(cid:13)(cid:13) H Q = (cid:13)(cid:13)(cid:13) ˆ ψ Qn + − ψ Q ∗ (cid:13)(cid:13)(cid:13) H ψ Q ≤ (cid:13)(cid:13)(cid:13) ˆ ψ Qn − ψ Q ∗ (cid:13)(cid:13)(cid:13) H ψ Q = (cid:13)(cid:13)(cid:13) ˆ Q φ n − Q φ ∗ (cid:13)(cid:13)(cid:13) H Q . A PPENDIX GP ROOF OF T HEOREM
IV.4The line integral of ∂ B ( x ) ∂ x is path independent because itis the gradient of the scaler field B [57]. Let x ( t ) : = ( − t ) x n + t x n + = x n + t ( ˆ f n ( x n ) + ˆ g n ( x n ) u n ) , where t ∈ [ , ] pa-rameterizes the line path between x n and x n + , then dB ( x ( t )) dt = ∂ B ( x ( t )) ∂ x ( ˆ f n ( x n ) + ˆ g n ( x n ) u n ) . Therefore, for any path A from x n to ˆ x n + : = x n + ˆ f n ( x n ) + ˆ g n ( x n ) u n , it holds under AssumptionIV.2.2 that B ( ˆ x n + ) − B ( x n ) = (cid:90) A ∂ B ( x ) ∂ x · d x = (cid:90) dB ( x ( t )) dt d t ≥ (cid:90) (cid:18) ∂ B ( x n ) ∂ x − ν t ( ˆ f n ( x n ) + ˆ g n ( x n ) u n ) T (cid:19) ( ˆ f n ( x n ) + ˆ g n ( x n ) u n ) d t = ∂ B ( x n ) ∂ x ( ˆ f n ( x n ) + ˆ g n ( x n ) u n ) − ν (cid:13)(cid:13) ˆ f n ( x n ) + ˆ g n ( x n ) u n (cid:13)(cid:13) R nx . (G.1)The inequality implies that B ( ˆ x n + ) − B ( x n ) is greater than orequal to that in the case when ∂ B ( x ) ∂ x decreases along the linepath at the maximum rate. Therefore, when (IV.7) is satisfied,it holds from (G.1) that B ( ˆ x n + ) − B ( x n ) ≥ − η B ( x n ) + ρ , which is the control barrier certificate defined in (IV.1). Hence,(III.1) is satisfied by the same argument as in the proof ofTheorem IV.1 under Assumption IV.1.3. Equation (IV.7) canbe rewritten as ∂ B ( x n ) ∂ x ( ˆ f n ( x n ) + ˆ g n ( x n ) u n ) − ν (cid:13)(cid:13) ˆ f n ( x n ) + ˆ g n ( x n ) u n (cid:13)(cid:13) R nx ≥ − η B ( x n ) + ρ . (G.2)The first term in the left hand side of (G.2) is affine to u n ,the second term is the combination of a concave function − ν (cid:107)·(cid:107) R nx and an affine function of u n , which is concave.Therefore, the left hand side of (G.2) is a concave function,and the inequality (G.2) defines a convex constraint underAssumption IV.2.1. A PPENDIX HK ERNEL A DAPTIVE F ILTER WITH M ONOTONE A PPROXIMATION P ROPERTY
Kernel adaptive filter [58] is an adaptive extension of thekernel ridge regression [59], [60] or GPs. Multikernel adaptivefilter [61] exploits multiple kernels to conduct learning in thesum space of RKHSs associated with each kernel. Let M ∈ Z > be the number of kernels employed. Here, we only discussthe case that the dimension of the model parameter h is fixed,for simplicity. Denote, by D m : = { κ m ( · , ˜ z m , j ) } j ∈{ , ,..., r m } , m ∈{ , , ..., M } , r m ∈ Z > , the time-dependent set of functions,referred to as a dictionary , at time instant n for the m th kernel κ m ( · , · ) . The current estimator ˆ ψ n is evaluated at the currentinput z n , in a linear form, asˆ ψ n ( z n ) : = h T n k ( z n ) = M ∑ m = h T m , n k m ( z n ) , where h n : = [ h , n ; h , n ; · · · ; h M , n ] : = [ h ; h ; · · · ; h r ] ∈ R r , r : = ∑ Mm = r m , is the coefficent vector, and k ( z n ) : = [ k ( z n ) ; k ( z n ) ; · · · ; k M ( z n )] ∈ R r , k m ( z n ) : =[ κ m ( z n , ˜ z m , ) ; κ m ( z n , ˜ z m , ) ; · · · ; κ m ( z n , ˜ z m , r m )] ∈ R r m . Toobtain a sparse model parameter, we define the cost at timeinstant n as Θ n ( h ) : = n ∑ ι = n − s + s dist ( h , C ι ) + µ (cid:107) h (cid:107) , (H.1) where ι ∈ { n − s + , n } ⊂ Z ≥ , s ∈ Z > , and C ι : = { h ∈ R r || h T k ( z ι ) − δ ι | ≤ ε } , ε ≥ , (H.2)which is a set of coefficient vector h satisfying instantaneous-error-zero with a precision parameter ε . Here, δ n ∈ R isthe output at time instant n , and the (cid:96) -norm regularization (cid:107) h (cid:107) : = ∑ ri = | h i | with a parameter µ ≥ h . The update rule of the adaptive proximal forward-backwardsplitting [62], which is an adaptive filter designed for sparseoptimizations, for the cost (H.1) is given by h n + = prox λ µ (cid:34) ( − λ ) I + λ n ∑ ι = n − s + s P C ι (cid:35) ( h n ) , (H.3)where λ ∈ ( , ) is the step size, I is the identity operator, andprox λ µ ( h ) = r ∑ i sgn ( h i ) max {| h i | − λ µ , } e i , where sgn ( · ) is the sign function. Then, the strictlymonotone approximation property [62]: (cid:107) h n + − h ∗ n (cid:107) R r < (cid:107) h n − h ∗ n (cid:107) R r , ∀ h ∗ n ∈ Ω n : = argmin h ∈ R r Θ n ( h ) , holds if h n / ∈ Ω n (cid:54) = /0. Dictionary Construction:
If the dictionary is insufficient,we can employ two novelty conditions when adding thekernel functions { κ m ( · , z n ) } m ∈{ , ,..., M } to the dictionary: (i)the maximum-dictionary-size condition r ≤ r max , r max ∈ Z > , and (ii) the large-normalized-error condition | δ n − ˆ ψ n ( z n ) | > ε | ˆ ψ n ( z n ) | , ε ≥ . By using sparse optimizations, nonactive structural compo-nents represented by some kernel functions can be removed,and the dictionary is refined as time goes by. To effectivelyachieve a compact representation of the model, it might berequired to appropriately weigh the kernel functions to includesome preferences on a structure of the model. The followinglemma implies that the resulting kernels are still reproducingkernels.
Lemma H.1 ( [63, Theorem 2]) . Let κ : Z × Z → R be the reproducing kernel of an RKHS ( H , (cid:104)· , ·(cid:105) H ) . Then, τκ ( z , w ) , z , w ∈ Z for an arbitrary τ > ( H τ , (cid:104)· , ·(cid:105) H τ ) with the inner product (cid:104) z , w (cid:105) H τ : = τ − (cid:104) z , w (cid:105) H , z , w ∈ Z .A PPENDIX IC OMPARISON TO P ARAMETRIC A PPROACHES AND THE
GPSARSAIf the suitable set of basis functions for approximatingaction-value functions is available, we can adopt a parametricapproach for action-value function approximation. Supposethat an estimate of the action-value function at time instant n is given by ˆ Q φ n ( z ) = h T n ζ ( z ) , where ζ : Z → R r is fixedfor all time. In this parametric case, given an input-output pair ([ z n ; z n + ] , R ( x n , u n )) , we can update the estimate of theaction-value function asˆ Q φ n + = h n − λ (cid:2) h T n ( ζ ( z n ) − γζ ( z n + )) − R ( x n , u n ) (cid:3) · ( ζ ( z n ) − γζ ( z n + )) . Then, stable tracking is achieved if the step size λ is properlyselected, even after the dynamics or the policy is changed.On the other hand, when employing a kernel-based learning,it is not trivial how to update the estimate in a theoreticallyformal manner. Because the output of the action-value functionis not directly observable, the expansion ∑ ni = κ Q ( · , z n ) (where κ Q is the reproducing kernel of the RKHS containing theaction-value function) cannot be validated by the representertheorem [64] any more. By defining the RKHS H ψ Q as inTheorem IV.3, however, we can view an action-value functionapproximation as the supervised learning in the RKHS H ψ Q ,and can overcome the aforementioned issue. We mention thatwhen an adaptive filter is employed in the RKHS H ψ Q , wedo not have to reset learning even after policies are updatedor the dynamics changes, since the domain of H ψ Q is Z × Z instead of Z . The example below indicates that our approachis general.As discussed in Section II-A, the least squares temporaldifference algorithm has been extended to kernel-based meth-ods including the GP SARSA [37]. Given a set of input data { z n } n = , ,..., N d , z n : = [ x n ; u n ] , N d ∈ Z > , the posterior mean m Q and variance µ Q of ˆ Q φ N d at a point z ∗ ∈ Z are given by m Q ( z ∗ ) = ˜ k T N d H T ( HK Q H T + Σ ) − R N d − , (I.1) µ Q ( z ∗ ) = κ Q ( z ∗ , z ∗ ) − ˜ k T N d H T ( HK Q H T + Σ ) − H ˜ k N d , (I.2)where R N d − ∼ N ([ R ( x , u ) ; R ( x , u ) ; · · · ; R ( x N d − , u N d − )] , Σ ) is the vector of immediate rewards, κ Q is the reproducingkernel of H Q , ˜ k N d : = [ κ Q ( z ∗ , z ) ; κ Q ( z ∗ , z ) ; · · · ; κ Q ( z ∗ , z N d )] ,the ( i , j ) entry of K Q ∈ R ( N d + ) × ( N d + ) is κ Q ( z i − , z j − ) , and Σ ∈ R N d × N d is the covariance matrix of R N d − . Here, thematrix H is defined by H : = − γ · · ·
00 1 − γ · · · · · · − γ ∈ R N d × ( N d + ) . If we employ a GP for learning ψ Q in H ψ Q defined inTheorem IV.3, the posterior mean m ψ Q and variance µ ψ Q of ˆ ψ QN d at a point [ z ∗ ; w ∗ ] ∈ Z × Z are given by m ψ Q ([ z ∗ ; w ∗ ]) = k N d T ( K + Σ ) − R N d − , µ ψ Q ([ z ∗ ; w ∗ ]) = κ ([ z ∗ ; w ∗ ] ; [ z ∗ ; w ∗ ]) − k N d T ( K + Σ ) − k N d , where k N d : = [ κ ([ z ∗ ; w ∗ ] , [ z ; z ]) ; · · · ; κ ([ z ∗ ; w ∗ ] , [ z N d − ; z N d ])] ,and the ( i , j ) entry of K ∈ R N × N is κ ([ z i − ; z i ] , [ z j − ; z j ]) .Then, the posterior mean m Q and variance µ Q of ˆ Q φ N d at apoint z ∗ ∈ Z are given by m Q ( z ∗ ) = U − ( m ψ Q ( · ))( z ∗ ) = k QN d T ( K + Σ ) − R N d − , µ Q ( z ∗ ) = κ Q ( z ∗ , z ∗ ) − k QN d T ( K + Σ ) − k QN d , which result in the same values as (I.1) and (I.2).A CKNOWLEDGMENTS
M. Ohnishi thanks all of those who have given him in-sightful comments on this work, including the members ofthe Georgia Robotics and Intelligent Systems Laboratory.The authors thank all of the anonymous reviewers for theirconstructive suggestions.R
EFERENCES[1] R. S. Sutton and A. G. Barto,
Reinforcement learning: An introduction .MIT Press, 1998.[2] F. L. Lewis and D. Vrabie, “Reinforcement learning and adaptivedynamic programming for feedback control,”
IEEE Circuits and SystemsMagazine , vol. 9, no. 3, 2009.[3] D. Liberzon,
Calculus of variations and optimal control theory: aconcise introduction . Princeton University Press, 2011.[4] F. Berkenkamp, M. Turchetta, A. P. Schoellig, and A. Krause,“Safe model-based reinforcement learning with stability guarantees,” in
Proc. NIPS , 2017.[5] F. Berkenkamp, R. Moriconi, A. P. Schoellig, and A. Krause, “Safelearning of regions of attraction for uncertain, nonlinear systems withGaussian processes,” in
Proc. CDC , 2016, pp. 4661–4666.[6] J. Schreiter, D. Nguyen-Tuong, M. Eberts, B. Bischoff, H. Markert,and M. Toussaint, “Safe exploration for active learning with Gaussianprocesses,” in
Proc. ECML PKDD , 2015, pp. 133–149.[7] A. K. Akametalu, J. F. Fisac, J. H. Gillula, S. Kaynama, M. N. Zeilinger,and C. J. Tomlin, “Reachability-based safe learning with Gaussianprocesses,” in
Proc. CDC , 2014, pp. 1424–1431.[8] S. Shalev-Shwartz, S. Shammah, and A. Shashua, “Safe, multi-agent, reinforcement learning for autonomous driving,” arXiv preprintarXiv:1610.03295 , 2016.[9] H. B. Ammar, R. Tutunov, and E. Eaton, “Safe policy search for lifelongreinforcement learning with sublinear regret,” in
Proc. ICML , 2015, pp.2361–2369.[10] D. A. Niekerk, B. V. and B. Rosman, “Online constrained model-basedreinforcement learning,” in
Proc. AUAI , 2017.[11] J. Achiam, D. Held, A. Tamar, and P. Abbeel, “Constrained policyoptimization,” in
Proc. ICML , 2017.[12] P. Abbeel and A. Y. Ng, “Exploration and apprenticeship learning inreinforcement learning,” in
Proc. ICML , 2005, pp. 1–8.[13] L. Wang, E. A. Theodorou, and M. Egerstedt, “Safe learning ofquadrotor dynamics using barrier certificates,” in
IEEE Proc. ICRA ,2018, pp. 2460–2465.[14] J. Garcıa and F. Fern´andez, “A comprehensive survey on safe reinforce-ment learning,”
J. Mach. Learn. Res. , vol. 16, no. 1, pp. 1437–1480,2015.[15] B. D. Argall, S. Chernova, M. Veloso, and B. Browning, “A survey ofrobot learning from demonstration,”
Robotics and Autonomous Systems ,vol. 57, no. 5, pp. 469–483, 2009.[16] P. Geibel, “Reinforcement learning for MDPs with constraints,” in
Proc. ECML , vol. 4212, 2006, pp. 646–653.[17] S. P. Coraluppi and S. I. Marcus, “Risk-sensitive and minimax controlof discrete-time, finite-state Markov decision processes,”
Automatica ,vol. 35, no. 2, pp. 301–309, 1999.[18] C. E. Rasmussen and C. K. Williams,
Gaussian processes for machinelearning . MIT press Cambridge, 2006, vol. 1.[19] X. Xu, P. Tabuada, J. W. Grizzle, and A. D. Ames, “Robustness ofcontrol barrier functions for safety critical control,” in
Proc. IFAC ,vol. 48, no. 27, 2015, pp. 54–61.[20] P. Wieland and F. Allg¨ower, “Constructive safety using control barrierfunctions,” in
Proc. IFAC , vol. 40, no. 12, 2007, pp. 462–467.[21] P. Glotfelter, J. Cort´es, and M. Egerstedt, “Nonsmooth barrier functionswith applications to multi-robot systems,”
IEEE Control Systems Letters ,vol. 1, no. 2, pp. 310–315, 2017.[22] L. Wang, A. D. Ames, and M. Egerstedt, “Safety barrier certificates forcollisions-free multirobot systems,”
IEEE Trans. Robotics , 2017.[23] A. D. Ames, X. Xu, J. W. Grizzle, and P. Tabuada, “Control barrierfunction based quadratic programs for safety critical systems,”
IEEETrans. Automatic Control , vol. 62, no. 8, pp. 3861–3876, 2017.[24] A. Agrawal and K. Sreenath, “Discrete control barrier functions forsafety-critical control of discrete systems with application to bipedalrobot navigation,” in
Proc. RSS , 2017. [25] J. Kober, J. A. Bagnell, and J. Peters, “Reinforcement learning inrobotics: A survey,”
The International Journal of Robotics Research ,vol. 32, no. 11, pp. 1238–1274, 2013.[26] V. M. Janakiraman, X. L. Nguyen, and D. Assanis, “A Lyapunov basedstable online learning algorithm for nonlinear dynamical systems usingextreme learning machines,” in
IEEE Proc. IJCNN , 2013, pp. 1–8.[27] M. French and E. Rogers, “Non-linear iterative learning by an adaptiveLyapunov technique,”
International Journal of Control , vol. 73, no. 10,pp. 840–850, 2000.[28] M. M. Polycarpou, “Stable adaptive neural control scheme for nonlinearsystems,”
IEEE Trans. Automatic Control , vol. 41, no. 3, pp. 447–451,1996.[29] K. J. ˚Astr¨om and B. Wittenmark,
Adaptive control . Courier Corporation,2013.[30] C. A. Cheng and H. P. Huang, “Learn the Lagrangian: A vector-valuedRKHS approach to identifying Lagrangian systems,”
IEEE Trans. Cy-bernetics , vol. 46, no. 12, pp. 3247–3258, 2016.[31] D. Ormoneit and P. Glynn, “Kernel-based reinforcement learning inaverage-cost problems,”
IEEE Trans. Automatic Control , vol. 47, no. 10,pp. 1624–1636, 2002.[32] X. Xu, D. Hu, and X. Lu, “Kernel-based least squares policy iteration forreinforcement learning,”
IEEE Trans. Neural Networks , vol. 18, no. 4,pp. 973–992, 2007.[33] G. Taylor and R. Parr, “Kernelized value function approximation forreinforcement learning,” in
Proc. ICML , 2009, pp. 1017–1024.[34] W. Sun and J. A. Bagnell, “Online Bellman residual and temporaldifference algorithms with predictive error guarantees,” in
Proc. IJCAI ,2016.[35] Y. Nishiyama, A. Boularias, A. Gretton, and K. Fukumizu, “Hilbertspace embeddings of POMDPs,” in
Proc. UAI , 2012.[36] S. Grunewalder, G. Lever, L. Baldassarre, M. Pontil, and A. Gretton,“Modelling transition dynamics in MDPs with RKHS embeddings,” in
Proc. ICML , 2012.[37] Y. Engel, S. Mannor, and R. Meir, “Reinforcement learning withGaussian processes,” in
Proc. ICML , 2005, pp. 201–208.[38] A. Barreto, D. Precup, and J. Pineau, “Practical kernel-based reinforce-ment learning,”
J. Mach. Learn. Res. , vol. 17, no. 1, pp. 2372–2441,2016.[39] A. S. Barreto, D. Precup, and J. Pineau, “Reinforcement learning usingkernel-based stochastic factorization,” in
Proc. NIPS , 2011, pp. 720–728.[40] B. Kveton and G. Theocharous, “Kernel-based reinforcement learningon representative states.” in
Proc. AAAI , 2012.[41] J. Bae, P. Chhatbar, J. T. Francis, J. C. Sanchez, and J. C. Principe,“Reinforcement learning via kernel temporal difference,” in
IEEEProc. EMBC , 2011.[42] J. Reisinger, P. Stone, and R. Miikkulainen, “Online kernel selection forBayesian reinforcement learning,” in
Proc. ICML , 2008, pp. 816–823.[43] Y. Cui, T. Matsubara, and K. Sugimoto, “Kernel dynamic policy pro-gramming: Applicable reinforcement learning to robot systems with highdimensional states,”
Neural Networks , vol. 94, pp. 13–23, 2017.[44] H. Van H., J. Peters, and G. Neumann, “Learning of non-parametriccontrol policies with high-dimensional state features,” in
Artificial Intel-ligence and Statistics , 2015, pp. 995–1003.[45] N. Aronszajn, “Theory of reproducing kernels,”
Trans. Amer. Math. Soc. ,vol. 68, no. 3, pp. 337–404, May 1950.[46] I. Steinwart, “On the influence of the kernel on the consistency ofsupport vector machines,”
J. Mach. Learn. Res. , vol. 2, pp. 67–93, 2001.[47] I. Yamada and N. Ogura, “Adaptive projected subgradient method forasymptotic minimization of sequence of nonnegative convex functions,”
Numerical Functional Analysis and Optimization , vol. 25, no. 7&8, pp.593–617, 2004.[48] M. L. Puterman and S. L. Brumelle, “On the convergence of policy iter-ation in stationary dynamic programming,”
Mathematics of OperationsResearch , vol. 4, no. 1, pp. 60–69, 1979.[49] D. P. Bertsekas,
Dynamic programming and optimal control . AthenaScientific Belmont, MA, 2005, vol. 1, no. 3.[50] C. D. McKinnon and A. P. Schoellig, “Experience-based model selectionto enable long-term, safe control for repetitive tasks under changingconditions,” in
IEEE Proc. IROS , 2018, pp. 2977–2984.[51] A. Berlinet and A. C. Thomas,
Reproducing kernel Hilbert spaces inprobability and statistics . Kluwer, 2004.[52] D. Pickem, P. Glotfelter, L. Wang, M. Mote, A. Ames, E. Feron, andM. Egerstedt, “The robotarium: A remotely accessible swarm roboticsresearch testbed,” in
IEEE Proc. ICRA , 2017, pp. 1699–1706.[53] Z. P. Jiang and Y. Wang, “A converse Lyapunov theorem for discrete-time systems with disturbances,”
Systems & Control Letters , vol. 45,no. 1, pp. 49–58, 2002. [54] H. Q. Minh, “Some properties of Gaussian reproducing kernel Hilbertspaces and their implications for function approximation and learningtheory,” Constructive Approximation , vol. 32, no. 2, pp. 307–338, 2010.[55] R. A. Ryan,
Introduction to tensor products of Banach spaces . SpringerScience & Business Media, 2013.[56] G. Strang,
Introduction to linear algebra . Wellesley-Cambridge PressWellesley, MA, 1993, vol. 3.[57] L. V. Ahlfors, “Complex analysis: an introduction to the theory ofanalytic functions of one complex variable,”
New York, London , p. 177,1953.[58] W. Liu, J. Pr´ıncipe, and S. Haykin,
Kernel adaptive filtering . NewJersey: Wiley, 2010.[59] K. R. M¨uller, S. Mika, G. Ratsch, K. Tsuda, and B. Scholkopf, “Anintroduction to kernel-based learning algorithms,”
IEEE Trans. NeuralNetworks , vol. 12, no. 2, pp. 181–201, 2001.[60] B. Sch¨oelkopf and A. Smola,
Learning with kernels . MIT Press,Cambridge, 2002.[61] M. Yukawa, “Multikernel adaptive filtering,”
IEEE Trans. Signal Pro-cessing , vol. 60, no. 9, pp. 4672–4682, Sept. 2012.[62] Y. Murakami, M. Yamagishi, M. Yukawa, and I. Yamada, “A sparseadaptive filtering using time-varying soft-thresholding techniques,” in
Proc. IEEE ICASSP , 2010, pp. 3734–3737.[63] M. Yukawa, “Adaptive learning in Cartesian product of reproducingkernel Hilbert spaces,”
IEEE Trans. Signal Processing , vol. 63, no. 22,pp. 6037–6048, Nov. 2015.[64] G. Kimeldorf and G. Wahba, “Some results on Tchebycheffian splinefunctions,”