A Receding Horizon Approach for Simultaneous Active Learning and Control using Gaussian Processes
AA Receding Horizon Approach forSimultaneous Active Learning and Control using Gaussian Processes
Viet-Anh Le and Truong X. NghiemSchool of Informatics, Computing, and Cyber SystemsNorthern Arizona University { vl385,truong.nghiem } @nau.edu Abstract — This paper proposes a receding horizon activelearning and control problem for dynamical systems in whichGaussian Processes (GPs) are utilized to model the system dy-namics. The active learning objective in the optimization prob-lem is presented by the exact conditional differential entropyof GP predictions at multiple steps ahead, which is equivalentto the log determinant of the GP predictive covariance matrix.The resulting non-convex and complex optimization problem issolved by the Successive Convex Programming algorithm thatexploits the first-order approximations of non-convex functions.Simulation results of an autonomous racing car example verifythat using the proposed method can significantly improve dataquality for model learning while solving time is highly promisingfor real-time applications.
I. I
NRODUCTION
Modeling the system dynamics play a pivotal role inthe performance of model-based control techniques such asReceding Horizon Control (RHC, also known as Model Pre-dictive Control). Nevertheless, for many complex dynamicalsystems, obtained mathematical models are often insuffi-ciently accurate due to the existence of uncertainties andignored dynamical parts. This challenge motivates learning-based models for control, which leverages Machine Learning(ML) techniques to model dynamical systems from data andcertain prior knowledge. Gaussian Processes (GPs) [1] havebeen applied for dynamics and control [2] due to severalimportant advantages. Firstly, GPs can learn complex modelswith fewer parameters than other ML techniques, hencethey are particularly suitable for applications with smalldatasets. More importantly, GP models provide an estimateof prediction uncertainty in the form of predictive variance,which can be used for evaluating and ensuring the accuracyof model predictions.One of the most fundamental but challenging problems inlearning the system dynamics using GPs is how to obtaina training dataset such that the models learned from it canefficiently capture the actual dynamics. This is because, inmost control applications, the experiments for data collectionare limited by time, cost, or constraints of environment,whereas using only historical data is not suitable due to thelack of input excitation. For instance, consider a motivatingexample of an autonomous racing car. In this example, theexperiments are constrained by narrow and sharp racingtracks, while the historical data obtained from manual controlor simple automatic control techniques such as pure pursuit [3] do not have sufficient excitation. A better approach wouldbe simultaneous learning and control where the car exploresthe state space for learning as quickly as possible while satis-fying other control objectives such as maintaining safety andtracking a racing trajectory. The learning objective requiresthat the system dynamics are sampled at the states associatedwith the most informative GP inputs. Given current models,the goal can be achieved by driving the dynamical systemto the state where the information from collected data atthis state can minimize prediction uncertainty at the regionof interest. Once the system is controlled to the new state,the GP inputs and outputs are collected, then the model isretrained. This repetitive procedure is referred to as activelearning for dynamical systems.The general active learning framework has been applied tovarious domains (see [4] for a survey) to address the problempertaining to optimal training data collection. Nonetheless,contrary to in other domains where the most informativelocations of the state space can be easily sampled, in dy-namical control systems, since the system is constrainedby the dynamics and other input and state constraints, theactive learning problem is more challenging. As a result,the active learning problem for dynamical systems has onlybeen studied recently [5]–[9]. In [5], the authors considered adual control problem (i.e., simultaneous learning and control)in which an optimization problem to find the best controlactions was presented subjecting to one-step prediction ofthe system dynamics. The active learning term was includedto minimize the approximated Shannon’s entropy of GP es-timates over a grid of interested points. A greedy scheme forchoosing the next data point in Optimal Experiment Design(OED) was proposed in [6] by exploiting the MaximumVariance and Information Gain methods. The models learnedfrom the experiment was then used for an RHC problem.Meanwhile, active learning using multi-step lookahead wasconsidered in three recent works [7]–[9], which shows betterperformance than single-step approach. Particularly, in [7],based on the conditional differential entropy of multi-stepGP predictions, the authors first determined the most infor-mative data point within the region of interest by a greedyalgorithm, then steering the system towards that state usingRHC. The paper [8] considered an RHC problem for dualcontrol in which the objective function consists of a controlobjective and a knowledge-gain (information-gain) objective a r X i v : . [ ee ss . S Y ] J a n or learning the dynamics. The knowledge-gain objective wasobtained using concepts from information theory, such asmutual information or relative entropy. Likewise, in [9], anRHC problem for OED including both active learning objec-tive represented by the differential entropy of GP predictionsand dynamic constraints was formulated. However, to limitthe computational burden, only an upper or lower bound ofthe differential entropy in [9] and the estimated knowledgegain in [8] of the multi-step GP predictions was utilized.That is, the information-gain metrics for multi-step GPpredictions were replaced by the sum of individual metricsfor step-wise predictions over the horizon. As a result, theproblems can be solved in continuous domain by nonlinearprogramming solvers instead of grid-based methods as in theprevious studies. However, computation time for solving theoptimization problems was not reported in those papers tojudge whether the used methods are suitable for real-timecontrol or not.In this paper, we presents a Receding Horizon ActiveLearning and Control (RHALC) problem for dynamicalsystems using the GP regressions. The presented problemformulation covers both the dual control problem in [8] andthe problem for experiment design in [9]. However, insteadof approximate information-gain metrics, we take the exact conditional differential entropy of multi-step GP predictionsinto account. The resulting optimization problem involvingthe GP dynamics and the log determinants of posteriorcovariance matrices is thus non-convex and highly complexthat may prevent the success of any general-purpose non-linear programming solvers. To overcome this challenge, inthis work we apply the Successive Convex Programming(SCP) method [10] to efficiently address the problem byperforming the first-order approximations of the GP meansand the GP posterior covariance matrices. The effectivenessof the proposed method is validated by simulations of anautonomous racing car example. The results on trajectorytracking control performance of the racing car and predictionaccuracy of the learned models show that the active learningterm can improve data quality for model learning in bothoffline learning (experiment design) and simultaneous learn-ing and control. In addition, the reported fast computationtime of the SCP algorithm demonstrates its capability forreal-time implementation.The remainder of this paper is organized as follows. TheRHALC problem formulation is introduced in Section II.Section III provides SCP algorithm, while the simulationresults are reported and discussed in Section IV. Finally,Section V concludes the paper with a summary and futuredirections. II. P ROBLEM F ORMULATION
In this section, we begin with a brief introduction to GPs,followed by a general Receding Horizon Active Learningand Control (RHALC) formulation in which the GPs areemployed to model the system dynamics while the exact conditional differential entropy is used as an optimizationmetric in active learning.
A. Gaussian Processes for Dynamical Systems
Consider a latent function f : R n (cid:55)→ R and N > noisy observations y ( i ) of it, y ( i ) = f (cid:0) x ( i ) (cid:1) + (cid:15) ( i ) , at inputs x ( i ) ∈ R n and with i.i.d. Gaussian noise (cid:15) ( i ) ∼ N (cid:0) , σ n (cid:1) ,for i = 1 , . . . , N . We will use X = [ x (1) , . . . , x ( N ) ] ∈ R n × N to denote the collection of all input vectors and Y = [ y (1) , . . . , y ( N ) ] ∈ R N to denote the collection ofthe corresponding observed outputs. Let D = ( X , Y ) bethe set of observation data of f . A GP of f , which will bedenoted by G f , is a probability distribution over all possiblerealizations of f and can be formally defined as a collectionof random variables, any finite number of which have a jointGaussian distribution [1]. It is fully specified by a covariancefunction k ( x , x (cid:48) ; θ ) = E [( f ( x ) − E [ f ( x )])( f ( x (cid:48) ) − E [ f ( x (cid:48) )])] and a mean function m ( x ; θ ) = E [ f ( x )] , parameterized byhyperparameters collected in a vector θ . The mean functionis employed to include prior knowledge about the unknownfunction. In this paper, the mean function is assumed to bezero without loss of generality.At M new inputs x (cid:63) = [ x ,(cid:63) , . . . , x M,(cid:63) ] , the joint pre-dictions at x (cid:63) , f (cid:63) = f ( x (cid:63) ) , of G f is a random variable f (cid:63) ∼ N ( µ (cid:63) , Σ (cid:63) ) , in which the predictive mean vector µ (cid:63) and the M × M posterior covariance matrix Σ (cid:63) are computedas follows µ (cid:63) = µ G f ( x (cid:63) ) = K (cid:63) ( K + σ n I ) − Y (1a) Σ (cid:63) = Σ G f ( x (cid:63) ) = K (cid:63)(cid:63) − K (cid:63) ( K + σ n I ) − K T(cid:63) , (1b)where σ n is the Gaussian noise variance, I is an identitymatrix of appropriate dimensions, K (cid:63)(cid:63) ∈ R M × M is the co-variance matrix at x (cid:63) , K (cid:63) ∈ R M × N is the cross-covariancematrix between x (cid:63) and X , K ∈ R N × N is the covariancematrix at X , in which the elements K ij of each matrix iscomputed by K ij = k ( x ( i ) , x ( j ) ) . If x (cid:63) is a single input,the prediction consists of a scalar mean µ (cid:63) and a scalarvariance σ (cid:63) . Given the training data D , the hyperparameters θ are often found by maximizing the likelihood: θ (cid:63) =arg max θ Pr( Y | X, θ ) . Note that in this paper, we only utilizethe GP means without uncertainty propagation to representthe predicted values of the nonlinear dynamics. For moredetails on GPs and its usage in controls, readers are referredto [1], [2]. B. Receding Horizon Active Learning and Control withGaussian Process
We define the control input vector as u ∈ R n u , the vectorof GP output variables y ∈ R n y and the vector of non-GPvariables z ∈ R n z . For any variable (cid:3) , where (cid:3) is y , z , or u ,let (cid:3) k denote its value at time step k . The state of the systemat time step k comprises y k and z k . The GP dynamics ofthe system express y as y k ∼ G ( x k ; m , k) , where G ( · ; m , k) is a GP with mean function m and covariance function k .The input vector x k , or features, of the GP is formed fromcurrent and past values of the control inputs u τ and non-GPstates z τ , for τ ≤ k , as well as from past GP outputs y τ ,for τ < k . If y is multivariate, i.e., n y > , its GP dynamicscan be represented by n y independent single-output GPsr by a multi-output GP [2]. To simplify the notation andformulation, we assume that y is scalar; however, our resultscan be readily extended to multivariate y . Given an input x k ,let ¯y k = µ ( x k ) denote the predictive mean of the GP model G ( · ; m , k) at x k .Let H > be the length of the control horizon, t be thecurrent time step and I t = { t, . . . , t + H − } be the set of alltime steps in the control horizon at time step t . Denote ¯ Y t = { ¯y k | k ∈ I t } , Z t = { z k | k ∈ I t } , U t = { u k | k ∈ I t } , and X t = { x k | k ∈ I t } as the sets collecting the predictive GPoutput means, the non-GP states, the control inputs, and GPinputs over the control horizon. To simplify the mathematicalnotations, we will use [ X ] to denote the vector concatenationof all vectors in a set X (e.g., [ X t ] = [ x Tk ] Tk ∈I t ).Given a GP model trained on the data generated up to thecurrent time step, the most informative GP regressor vectorsin the next horizon X t can be determined by maximizing theconditional differential entropy of GP predictions at thesevectors that is computed by the logarithm of determinantof the covariance matrix [11]. In other words, the optimalsampling states for the dynamical systems in the next H timesteps can be obtained by solving the following optimizationproblem X t = argmax {X t } log det (cid:0) Σ G f ([ X t ]) (cid:1) (2)where Σ G f ([ X t ]) is a H × H posterior covariance matrix ofGP predictions at H input vectors in the set X t and can becomputed by (1b).The RHALC problem is thus formulated as follows minimize {U t , Z t } J ( ¯ Y t , U t , Z t ) − γH ( X t ) subject to ¯ y i,k = µ f ( x i,k ) , ∀ k ∈ I t g j ( ¯ Y t , U t , Z t ) ≤ , ∀ j ∈ J ieq h j ( ¯ Y t , U t , Z t ) = 0 , ∀ j ∈ J eq (3)where H ( X t ) = log det (cid:0) Σ ([ X t ]) (cid:1) is an active learningterm and J ( ¯ Y t , U t , Z t ) is an control objective function, g j ( ¯ Y t , U t , Z t ) ≤ and h j ( ¯ Y t , U t , Z t ) = 0 are inequalityand equality constraints while J ieq and J eq are the sets ofinequality and equality constraint indices, respectively. In theproblem (3), γ is a positive constant representing a tradeoffbetween learning and control objectives.Similar to [12], we make the following assumption aboutthe problem (3). Assumption 1:
Suppose that J is convex, each g j is con-vex, and each h i is affine in the optimization variables U t and Z t . In other words, the non-convexity of the problem(3) results from the GP dynamics and the log determinant ofthe GP predictive covariance matrix.The Assumption 1 holds in many applications of RHCbecause the non-convexity usually comes solely from thesystem dynamics [6]. Algorithm 1
Sequential Convex Programming for RHALC
Require: U (0) t , Z (0) t , ρ (0) > , < r < r < r < , β fail < , β succ > , (cid:15) > , j max > Simulate G f with U (0) t , obtain Y (0) t φ (0) ← φ (cid:0) Y (0) t , U (0) t , Z (0) t (cid:1) for j = 0 , . . . , j max − do Form convex subproblem (7) by using (4) and (5) Solve problem (7) to get ˜ Y t , ˜ U t , ˜ Z t Simulate G f with ˜ U t to obtain Y t δ ( j ) ← φ ( j ) − φ (cid:0) Y t , ˜ U t , ˜ Z t (cid:1) ˜ δ ( j ) ← φ ( j ) − ˜ φ (cid:0) ˜ Y t , ˜ U t , ˜ Z t (cid:1) if | ˜ δ ( j ) | ≤ (cid:15) then stop and return U ( j ) t r ( j ) ← δ ( j ) / ˜ δ ( j ) if r ( j ) < r then Keep current solution: U ( j +1) t ← U ( j ) t , Z ( j +1) t ←Z ( j ) t , Y ( j +1) t ← Y ( j ) t ρ ( j +1) ← β fail ρ ( j ) else Accept solution: U ( j +1) t ← ˜ U t , Z ( j +1) t ← ˜ Z t , Y ( j +1) t ← Y t , φ ( j ) ← φ (cid:0) Y t , U t , Z t (cid:1) if r ( j ) < r then ρ ( j +1) ← β fail ρ ( j ) else if r ( j ) < r then ρ ( j +1) ← ρ ( j ) else ρ ( j +1) ← β succ ρ ( j ) return U ( j max ) t III. S
UCCESSIVE C ONVEX P ROGRAMMING FOR R ECEDING H ORIZON A CTIVE L EARNING AND C ONTROL
The problem (3) is highly nonconvex due to the activelearning objective and the GP dynamics. Moreover, thecomplexity of the objective function involving the log de-terminant of the GP predictive covariance matrix makes theproblem (3) computationally intractable. For instance, thepopular nonlinear programming solver Ipopt [13] failed tosolve it. In this section, we employ the Sequential ConvexProgramming (SCP) approach [10] to effectively address thisnonconvex and complex problem.Suppose that nominal feasible control inputs are givenin U (cid:63)t = { u (cid:63)k | k ∈ I t } . We then simulate the GP G [ f ] over the RHC horizon to obtain the nominal output means Y (cid:63)t = { y (cid:63)k | k ∈ I t } . The nominal regressor vectors in X (cid:63)t = { x (cid:63)k | k ∈ I t } , can be obtained from these values.Consider small perturbations to the nominal control inputs u k = u (cid:63)k + ∆ u k , which are collected in ∆ U t = { ∆ u k | k ∈I t } . They will cause perturbations to the predictive outputmeans and regressor vectors during the MPC horizon as: ∆ Y t = { ∆ y k = y k − y (cid:63)k | k ∈ I t } and ∆ X y,t = { ∆ x k = x k − x (cid:63)k | k ∈ I t } . Using these perturbation variables, theRHALC (3) can be reformulated equivalently. This equiv-alent formulation can then be approximated locally aroundthe nominal values by replacing the GP predictive means ¯y k = µ ( x k ) and the log determinant of GP predictivecovariance matrix with their first order approximations asfollows.Define ˜ y k as the first-order approximation of y k aroundthe nominal solution x (cid:63)k , which can be computed as follows ˜ y k = µ f ( x (cid:63)k ) + ∇ µ f ( x (cid:63)k ) T ∆x k (4)here from (1a) we have µ f ( x (cid:63)k ) = k ( x (cid:63)k , X )( K + σ n I ) − Y ∇ µ f ( x (cid:63)k ) = K (1 , ( x (cid:63)k , X ) T ( K + σ n I ) − Y with K (1 , = ( ∇ x k ) being the gradient of k with respectto the first argument. We also define ˜ Y t as a collection of ˜ y k for k ∈ I t . Meanwhile, the first-order approximation of H ( X t ) = log det (cid:0) Σ ([ X t ]) (cid:1) around a nominal solution [ X (cid:63)t ] is computed by: ˜ H (∆ X t ) = log det ( Σ ([ X (cid:63)t ]))+ ∇ log det ( Σ ([ X (cid:63)t ])) T [∆ X t ] (5)Note that the derivative of the log determinant of theGP predictive covariance matrix, ∂ log det ( Σ ( ν )) /∂ν j , withrespect to each element ν j of a vector ν can be computedby ∂ log det ( Σ ( ν )) ∂ν j = tr (cid:18) Σ − ( ν ) ∂ Σ ( ν ) ∂ν j (cid:19) where from (1b) we have Σ ( ν ) ∂ν j = ∂ K (cid:63)(cid:63) ∂ν j − K (cid:63) ( K + σ n I ) − (cid:18) ∂ K (cid:63) ∂ν j (cid:19) T With a slight abuse of notations, we will write U t = U (cid:63)t +∆ U t . We now obtain the convexified RHALC problem, asstated below.minimize ∆ U t , Z t J ( ˜ Y t , U (cid:63)t + ∆ U t , Z t ) − γ ˜ H (∆ X t ) (6a)subject to ˜ y k = µ f ( x (cid:63)k ) + ∇ µ f ( x (cid:63)k ) T ∆x k , ∀ k ∈ I t (6b) (cid:107) ∆ u k (cid:107) ∞ ≤ ρ, (cid:107) ∆ x k (cid:107) ∞ ≤ ρ, ∀ k ∈ I t (6c) g j ( ˜ Y t , U (cid:63)t + ∆ U t , Z t ) ≤ , ∀ j ∈ J ieq (6d) h j ( ˜ Y t , U (cid:63)t + ∆ U t , Z t ) = 0 , ∀ j ∈ J eq (6e)Constraints (6c) specify a trust region, that is the localneighborhood around the nominal solution in which thelocal convexified subproblem is valid. To avoid the artificialinfeasibility [10] of the problem due to the approximations(4) and (5), the inequality and equality constraints in (6)are encoded into the objective function by the exact penaltyfunctions [10] leading to the penalized convex problemcorresponding to (6) as followsminimize ∆ U t , Z t J ( ˜ Y t , U (cid:63)t + ∆ U t , Z t ) − γ ˜ H (∆ X t )+ (cid:88) j ∈J ieq τ j max (cid:0) , g j ( ˜ Y t , U (cid:63)t + ∆ U t , Z t ) (cid:1) + (cid:88) j ∈J eq λ j (cid:12)(cid:12) h j ( ˜ Y t , U (cid:63)t + ∆ U t , Z t ) (cid:12)(cid:12) (7a)subject to ˜ y k = µ f ( x (cid:63)k ) + ∇ µ f ( x (cid:63)k ) T ∆x k , ∀ k ∈ I t (7b) (cid:107) ∆ u k (cid:107) ∞ ≤ ρ, (cid:107) ∆ x k (cid:107) ∞ ≤ ρ, ∀ k ∈ I t (7c)where τ j , ∀ j ∈ J ieq and λ j , ∀ j ∈ J eq are large positivepenalty weights. Under the Assumption 1, the penalized subproblem (7) is convex and can usually be solved effi-ciently by convex solvers. The SCP algorithm for solvingthe RHALC problem is outlined in Algorithm 1 and brieflydescribed below. For further details on the SCP method andits application to the RHC problem using the GP dynamics,the readers are referred to [10] and [12], respectively.At algorithmic iteration j , for k ∈ I t , let u ( j ) k and z ( j ) k be the current solution, y ( j ) k be the corresponding outputs, x ( j ) k be constructed from those values, ρ ( j ) be the currenttrust region radius, and φ ( j ) be the current exact penalizedcost. The first-order approximations of GP predictive meansand the log determinant of predictive covariance matrixare computed to form the penalty convex subproblem (7).The convex subproblem is then solved, resulting in theapproximate solution ( ˜ Y t , ˜ U t = U ( j ) t + ∆ ˜ U t , ˜ Z t ) . The exact Y t and X t are calculated by simulating the GP model with theinputs ˜ U t . The actual cost reduction δ ( j ) and the predictive cost reduction ˜ δ ( j ) are respectively computed in steps 7 and 8where φ ( Y t , ˜ U t , ˜ Z t ) is the cost value of the original problem(3) while ˜ φ ( ˜ Y t , ˜ U t , ˜ Z t ) is the cost value of the penalizedconvex subproblem (7). If | ˜ δ ( j ) | ≤ (cid:15) given a predefinedtolerance (cid:15) > , the solution is considered converged andthe algorithm is terminated. Otherwise, depending on theratio r ( j ) = δ ( j ) / ˜ δ ( j ) in comparison with three predefinedthresholds < r < r < r < , the solution is acceptedof rejected and the trust region is adapted by two predefinedfactors β fail < and β succ > , as presented in steps 10-18.IV. S IMULATION
This section validates the advantages of the RHALC prob-lem in two scenarios, experiment design and simultaneouslearning and control problem, as well as the effectiveness ofthe SCP algorithm in solving the problems in real-time by anumerical simulation of an autonomous racing car example.
A. Autonomous racing car example
We revisit the example of an autonomous racing carmentioned in Section I. The simulation consists of twophases: learning phase and racing phase. During the learningphase, starting with initial GP models trained on a fewhistorical data, the controller collects and adds new datapoints and retrain the GP models. Once sufficient data forlearning accurate models have been obtained, the learningphase is disabled and the obtained models are utilized toperform tracking control task in the racing phase.The dynamics of the vehicle used in the simulation is de-scribed by the following continuous-time kinematic bicyclemodel ˙ x = v cos( θ + β )˙ y = v sin( θ + β )˙ θ = vl r sin( β )˙ v = a (8)where l f and l r are the distances from the center of themass of the vehicle to respectively the front and the rearxles, β = tan − (cid:16) l r l f + l r tan( α ) (cid:17) is the angle of the currentvelocity of the center of mass with respect to the longitudinalaxis of the car, ( x, y ) is the position vector of the vehicleon a two-dimensional plane, θ is the heading angle, v is thespeed of the vehicle, and the two control inputs a and α arerespectively the linear acceleration and steering angle of thevehicle. The full description and analysis of the kinematicbicycle model can be found in [14].The vehicle’s dynamics are discretized with a samplingtime ∆ T > , leading to the following discrete-time form x k +1 = x k + ∆ x k y k +1 = y k + ∆ y k θ k +1 = θ k + ∆ θ k v k +1 = v k + ∆ T a k (9)in which the one-step changes ∆ x k , ∆ y k and ∆ θ k arenonlinear in other variables. In this example, these nonlinearcomponents are learned by three GP models. Specifically, ∆ x k ∼ G ∆ x ( x p,k ) , ∆ y k ∼ G ∆ y ( x p,k ) , and ∆ θy k ∼G ∆ θ ( x a,k ) , in which the vectors of GP inputs x p,k and x a,k are x p,k = [cos θ k , sin θ k , v k , α k ] T x a,k = [ v k , α k ] T .Note that the GP input vectors written in bold are differentfrom the vehicle’s position x k and have different notations.The GP models result in the following GP dynamical equa-tions ∆¯ x k = µ x ( x p,k )∆¯ y k = µ y ( x p,k )∆¯ θ k = µ θ ( x a,k ) . (10)The RHALC formulation for this example is given byminimize { a k ,α k } k ∈I t J − γ ( H x + H y + H θ ) (11a)subject to(9) and (10) (11b) v min ≤ v k ≤ v max , (11c) a min ≤ a k ≤ a max , α min ≤ α k ≤ α max (11d) a k (cid:20) x k +1 y k +1 (cid:21) ≤ b k (11e)where the constraints hold for all k ∈ I t , (11c) are velocitybound constraints, (11d) are bound constraints on the controlactions, and (11e) are affine constraints on the locations ofthe vehicle (e.g., collision avoidance to the border of theracing track). The objective of this problem consists of twoparts: trajectory tracking control and active learning for dy-namics. The tracking control aims to track a given referencetrajectory, while the active learning objective drives eachvehicle to the location that provides the most informativedata for learning the system dynamics. The positive constant γ balances the tradeoff between active learning and control Fig. 1: Constrained regions (shaded area) constituted by half-planes (gray lines) for time steps in the horizon to avoidcrashes to the border (red lines). Blue dashed line denotesreference trajectory while black dots denote the referencepoints at those time steps.objective. The control objective function J is given by J = t + H − (cid:88) k = t (cid:13)(cid:13)(cid:13)(cid:13)(cid:20) a k α k (cid:21)(cid:13)(cid:13)(cid:13)(cid:13) R + (cid:13)(cid:13)(cid:13)(cid:13)(cid:20) x k +1 y k +1 (cid:21) − r k +1 (cid:13)(cid:13)(cid:13)(cid:13) Q where r k +1 denotes the reference at time step k + 1 . Givena vector ν and a positive semidefinite matrix M , we define (cid:107) ν (cid:107) M = ν T M ν . The active learning goals for the GP modelsare captured by H x , H y , and H θ given by H x = log det (cid:0) Σ G ∆ x ( x p,t +1: t + H ) (cid:1) H y = log det (cid:0) Σ G ∆ y ( x p,t +1: t + H ) (cid:1) H θ = log det (cid:0) Σ G ∆ θ ( x a,t +1: t + H ) (cid:1) where x p,t +1: t + H and x p,t +1: t + H denotes the concatenatedvector of GP inputs from time t + 1 to t + H . To avoidthe collision with the border of the track, we employed ascheme presented in [15] that limit the movement of the carat each time step in the horizon to lie within two parallelhalf-planes, one for the right border and one for the leftborder, as illustrated in Figure 1. As a result, the collisionavoidance scheme can be represented as a set of linearinequality constraints in (11e).The sampling time ∆ T is chosen to be
200 ms while thelength H of control horizon is 5. The system specificationsof the racing car are: l r = 0 .
386 m , l f = 0 .
205 m , v min =0 m / s , v max = 2 m / s , α min / max = ± π/ / s , a min / max = ± / s , while the weights in the objective function are: Q = diag([10 , ]) , R = diag([10 − , − ]) , γ = 10 .Depending on the configuration of the simulation where thetracking control or active learning objective are omitted,these weights may be set to be zero. All the penalty weightsin (7) are chosen to be . The parameters of the SCPalgorithm are chosen as: ρ (0) = 0 . , β fail = 0 . , β succ = 2 . , r = 0 . , r = 0 . , r = 0 . , j max = 100 , (cid:15) = 10 − .In what follows, we present two scenarios for the au-tonomous racing car example that are considered in this pa-per: Offline learning and Simultaneous Learning and Control(Online Learning). While in the former scenario, a simulatedxperiment is conducted to generate a training dataset, thelatter considers the more challenging problem where thesystem is required to simultaneously learn GP models andperform the tracking control task. Two different virtual racingtracks developed by the University of California, Berkeley(UCB) [16] and the University of Pennsylvania (UPenn) [17]are used in the simulations.
1) Offline learning:
Assume that a large area withoutobstacles is available for experiments, i.e., the vehicle canmove freely in the experimental space. We design an exper-iment in the learning phase to obtain an optimal dataset fortraining the GP dynamics prior to the race. This procedureis also referred to as
Optimal Experiment Design (OED).Therefore, this simulation is similar to the problem setup in[9]. In this scenario, only the active learning term is includedin the objective function of the OED problem while thecontrol objective J is removed, i.e., Q = R = . Notethat the constraints (11e) on the locations are simplified tobound constraints on x and y , x min ≤ x k +1 ≤ x max and y min ≤ y k +1 ≤ y max , for all k ∈ I t , to ensure that the carlies within the experimental space.We compare the optimal experiment design using RHALCwith a randomized experiment design where random controlinput signals that satisfy all the input constraints are appliedto the system. In the optimal experiment design, threesimple GP models with 25 initial data points are utilized,then the RHALC problem is applied in 50 time steps tocollect new data points, i.e., final GP models with 75 dataobservations are obtained from this optimal experiment. Themodels learned from experimental data are used to performa reference tracking control task in the racing phase. Thereceding horizon reference tracking problem can be derivedby removing the active learning term in (11), i.e., γ = 0 .Note that contrary to the receding horizon active learningproblem, the receding horizon reference tracking problem isnot computationally intractable given the learned GP modelswith small training datasets and thus can be solved in real-time by nonlinear programming solvers (e.g., Ipopt [13]).However, in this simulation, the SCP algorithm is utilized tosolve the problem.
2) Simultaneous Learning and Control:
In the applicationwhere a free area for experiments is not available, it is re-quired to perform
Simultaneous Active Learning and Control ,or Online Learning, in the learning phase. In other words,the vehicle is expected to efficiently collect online data toupdate the GP dynamics while following the racing trackand avoiding collision to the borders. Hence, this scenariocovers the dual control problem in [8].To validate the benefits of active learning, we consider twosimulations depending on the effect of the active learningterm. In the first simulation, with γ > the active learningobjective is involved to explore the state space, whereas inthe second simulation, the active learning is disabled, i.e., γ = 0 , the vehicle only tracks to the reference and collectsonline data along the racing track. Since in the racing track,the car faces a more complicated control task than in a freearea in offline learning simulation, the initial GP models need to have more training data points. Particularly, three initialGP models trained on 50 initial data points are adoptedto control the system at the beginning, then 100 new datasamples are collected online. As a result, three final GPmodels with 150 data points are obtained at the end ofthe learning phase. The simultaneous learning and controlsimulation is conducted using the UPenn track, while theUCB track is utilized for a testing race given the modellearned from the previous race. Similar to in the OfflineLearning simulation, the SCP algorithm is utilized to solvethe optimization problems in both learning and racing phases. B. Results and discussions
The trajectories of the autonomous racing car in the offlinelearning simulation are shown in Figure 2. Using the GPmodels from the optimal experiment, the receding horizontracking controller can accurately track the reference andprevent the collision to the left and right borders. In contrast,the GP models from the randomized experiment apparentlyare not accurate for model predictive control since the carcannot complete one racing lap without crashing into theborders. Particularly, the car initially can track well to thereference but collision to the borders happens when the carneeds to turn sharply. Hence, the models obtained from theoptimal experiment outperform those from the randomizedexperiment.Figure 3 shows the trajectories of the racing car in theSimultaneous Learning and Control simulation. As can beseen from the figure, with the active learning in the objectivefunction, the vehicle initially fluctuates around the referenceto explore the informative states, hence the car does notperfectly track the reference in the first 100 time instants butthe safety condition of the car is still guaranteed. However,once completing the learning phase, the obtained GP modelsare accurate so that the vehicle is able to track the referencetrajectory in the rest of the racing track as well as in anew testing track. Meanwhile, without active learning, at thebeginning of the race where the racing track is relativelysimple, then the car can track to the reference. However,since the models learned do not have enough excitation,the car deviates to the reference trajectory when the trackbecomes more complicated (i.e., the car must turn sharply),leading to the crashes to the border.Furthermore, the prediction accuracy of the GP modelsobtained from all simulations are shown in Table I. Wecompare 4 types of GP models, OE, RE, AL and Non-AL, which are correspondingly obtained from the offlineoptimal experiment, the offline randomized experiment, thelearning and control simulation with active learning, andthe learning and control simulation without active learning.For each model, two validation metrics including the rootmean squared errors (RMSEs), and the maximum absoluteerrors (MAEs) are considered. These validation metrics arecomputed using the GP predictions and the correspondinglatent non-linear functions in (9). We test the predictions ona grid of GP inputs in which 20 linearly spaced points in the − − − x (m) y ( m ) (a) Offline learning - Optimal experiment − − − − x (m) y ( m ) (b) Offline learning - Randomized experiment − − − − − − x (m) y ( m ) (c) Offline learning - Optimal experiment − − − − − − x (m) y ( m ) (d) Offline learning - Randomized experiment Fig. 2: The trajectories (blue lines) of the autonomous vehicle in two racing tracks (red lines denote racing borders, dashedblack lines denote the references), with the offline GP models from optimal experiment ((a) and (c)) and randomizedexperiment ((b) and (d)).region of interest for each input are utilized. According tothe table, with the same number of training data points, threeGP models generated from the optimal experiment showbetter performance in prediction accuracy than those from therandomized experiment. Likewise, based on the metrics forAL and non-AL models, it is obvious that active learning canimprove GP precision in simultaneous learning and controlframework.Regarding the computation time, in the optimal exper-iment, SCP algorithm takes .
095 s on average for eachtime instant, while in the race, it takes .
067 s and .
061 s in UCB and UPenn racing tracks, respectively. Meanwhilein online learning simulation that includes active leaning,SCP algorithm averagely takes .
154 s , whereas in controlphase, it takes .
094 s for each time step. We do not collectthe computation time in other simulations where the car isnot able to complete a lap of racing track. Note that allsimulations in this work are performed on a DELL computerwith a . Intel Core i5 CPU and
RAM, and theJulia programming language is used for the implementation.V. C
ONCLUSION
In this paper, a receding horizon active learning andcontrol problem for dynamical systems using Gaussian Pro- TABLE I: Validation metrics for different types of obtainedGP models. G ∆ x G ∆ y G ∆ θ RMSE MAE RMSE MAE RMSE MAEOE 0.050 0.151 0.058 0.189 0.006 0.017RE 0.139 0.443 0.084 0.397 0.020 0.046AL 0.036 0.192 0.041 0.161 0.015 0.070Non-AL 0.063 0.296 0.114 0.498 0.038 0.238 cesses was taken into consideration. We first developeda problem formulation subjecting to the Gaussian Processdynamics while the exact conditional differential entropy wasemployed as an optimization metric for active learning. Theresulting complex and non-convex problem was then solvedby the Successive Convex Programming algorithm. Theproposed method was validated by numerical simulationsof an autonomous racing car example. Not only guaranteethe tracking performance in both offline and online learningscenarios, the control algorithm but also can be executed ina reasonable amount of time, which promises its potentialpracticality for real-time implementation. Our future workwill focus on applying the proposed approach to real-worldsystems. − − − x (m) y ( m ) (a) Online learning - With active learning − − − − x (m) y ( m ) (b) Online learning - Without active learning − − − − − − x (m) y ( m ) (c) Testing - With active learning − − − − − − x (m) y ( m ) (d) Testing - Without active learning Fig. 3: The trajectories (blue lines) of the autonomous vehicle in two racing tracks (red lines denote racing borders, dashedblack lines denote the references), with the GP models updated online with active learning ((a) and (c)) and without activelearning ((b) and (d)). R
EFERENCES[1] C. K. Williams and C. E. Rasmussen,
Gaussian processes for machinelearning . MIT press Cambridge, MA, 2006, vol. 2, no. 3.[2] J. Kocijan,
Modelling and control of dynamic systems using Gaussianprocess models . Springer, 2016.[3] M. Samuel, M. Hussein, and M. B. Mohamad, “A review of somepure-pursuit based path tracking techniques for control of autonomousvehicle,”
International Journal of Computer Applications , vol. 135,no. 1, pp. 35–38, 2016.[4] C. C. Aggarwal, X. Kong, Q. Gu, J. Han, and S. Y. Philip, “Activelearning: A survey,” in
Data Classification: Algorithms and Applica-tions . CRC Press, 2014, pp. 571–605.[5] T. Alpcan, “Dual control with active learning using gaussian processregression,” arXiv preprint arXiv:1105.2211 , 2011.[6] A. Jain, T. Nghiem, M. Morari, and R. Mangharam, “Learning andcontrol using gaussian processes,” in . IEEE, 2018,pp. 140–149.[7] A. Capone, G. Noske, J. Umlauft, T. Beckers, A. Lederer, andS. Hirche, “Localized active learning of gaussian process state spacemodels,” in
Learning for Dynamics and Control . PMLR, 2020, pp.490–499.[8] T. Alpcan and I. Shames, “An information-based learning approachto dual control,”
IEEE transactions on neural networks and learningsystems , vol. 26, no. 11, pp. 2736–2748, 2015.[9] M. Buisson-Fenet, F. Solowjow, and S. Trimpe, “Actively learninggaussian process dynamics,” in
Learning for Dynamics and Control .PMLR, 2020, pp. 5–15.[10] Y. Mao, D. Dueri, M. Szmuk, and B. Ac¸ıkmes¸e, “Successive convexifi-cation of non-convex optimal control problems with state constraints,”
IFAC-PapersOnLine , vol. 50, no. 1, pp. 4063–4069, 2017. [11] L. V. Nguyen, S. Kodagoda, R. Ranasinghe, and G. Dissanayake,“Information-driven adaptive sampling strategy for mobile roboticwireless sensor network,”
IEEE Transactions on Control SystemsTechnology , vol. 24, no. 1, pp. 372–379, 2015.[12] T. X. Nghiem, “Linearized gaussian processes for fast data-drivenmodel predictive control,” in . IEEE, 2019, pp. 1629–1634.[13] A. W¨achter and L. T. Biegler, “On the implementation of an interior-point filter line-search algorithm for large-scale nonlinear program-ming,”
Mathematical programming , vol. 106, no. 1, pp. 25–57, 2006.[14] J. Kong, M. Pfeiffer, G. Schildbach, and F. Borrelli, “Kinematic anddynamic vehicle models for autonomous driving control design,” in . IEEE, 2015, pp.1094–1099.[15] A. Liniger, A. Domahidi, and M. Morari, “Optimization-based au-tonomous racing of 1: 43 scale rc cars,”
Optimal Control Applicationsand Methods , vol. 36, no. 5, pp. 628–647, 2015.[16] U. Rosolia and F. Borrelli, “Learning how to autonomously race a car:a predictive control approach,”
IEEE Transactions on Control SystemsTechnology , 2019.[17] M. O’Kelly, H. Zheng, D. Karthik, and R. Mangharam, “F1tenth:An open-source evaluation environment for continuous control andreinforcement learning,” in