[PDF] Active Model Learning using Informative Trajectories for Improved Closed-Loop Control on Real Robots

Abstract

Model-based controllers on real robots require accurate knowledge of the system dynamics to perform optimally. For complex dynamics, first-principles modeling is not sufficiently precise, and data-driven approaches can be leveraged to learn a statistical model from real experiments. However, the efficient and effective data collection for such a data-driven system on real robots is still an open challenge. This paper introduces an optimization problem formulation to find an informative trajectory that allows for efficient data collection and model learning. We present a sampling-based method that computes an approximation of the trajectory that minimizes the prediction uncertainty of the dynamics model. This trajectory is then executed, collecting the data to update the learned model. In experiments we demonstrate the capabilities of our proposed framework when applied to a complex omnidirectional flying vehicle with tiltable rotors. Using our informative trajectories results in models which outperform models obtained from non-informative trajectory by 13.3\% with the same amount of training data. Furthermore, we show that the model learned from informative trajectories generalizes better than the one learned from non-informative trajectories, achieving better tracking performance on different tasks.

Full PDF

AActive Model Learning using Informative Trajectories for ImprovedClosed-Loop Control on Real Robots

Weixuan Zhang, Marco Tognon, Lionel Ott, Roland Siegwart, and Juan Nieto

Abstract — Model-based controllers on real robots requireaccurate knowledge of the system dynamics to perform op-timally. For complex dynamics, ﬁrst-principles modeling isnot sufﬁciently precise, and data-driven approaches can beleveraged to learn a statistical model from real experiments.However, the efﬁcient and effective data collection for such adata-driven system on real robots is still an open challenge. Thispaper introduces an optimization problem formulation to ﬁndan informative trajectory that allows for efﬁcient data collectionand model learning. We present a sampling-based method thatcomputes an approximation of the trajectory that minimizes theprediction uncertainty of the dynamics model. This trajectory isthen executed, collecting the data to update the learned model.In experiments we demonstrate the capabilities of our proposedframework when applied to a complex omnidirectional ﬂyingvehicle with tiltable rotors. Using our informative trajectoriesresults in models which outperform models obtained from non-informative trajectory by 13.3% with the same amount oftraining data. Furthermore, we show that the model learnedfrom informative trajectories generalizes better than the onelearned from non-informative trajectories, achieving bettertracking performance on different tasks.

I. I

NTRODUCTION

Model-based controllers have shown to be useful in vari-ous robotics applications. Especially when accurate modelsare available, these controllers can exhibit impressive per-formance [1], [2]. However, it can be hard to obtain a gooddynamical model for complex systems such as humanoidrobots [3], race cars on uneven terrains [4], soft robots [5],and novel fully actuated multi-rotor ﬂying vehicles [6] likethe one considered in this work (see Fig. 1).One approach to solve the modeling problem is to rely onlearning techniques: Through interaction with the real-worldand data collection a statistical dynamics model is trained,which is either directly fed into a model-based controller,e.g. [4], [7]–[9], or used in simulation to train a controlpolicy [10]. One challenge for these approaches is that oftenthe training data has a different distribution than the testdata due to several reasons: ﬁrst, model uncertainties andfeedback controller might lead the system to a state notencountered in a previous data collection routine. Secondly,given partial model knowledge, the region of the data thatleads to the best performance is a-priori unknown. Finally,the closed-loop dynamics change as the model used by thecontroller gets updated. One could perform large amount ofexperiments to cover as much of the input space as possibleduring training. However, for robotic systems with high-dimensional and continuous state space, the search space

This work was supported by the NCCR Robotics, NCCR Digital Fabri-cation and Armasuisse.Authors are with the Autonomous Systems Lab, ETH Z¨urich, Leonhard-strasse 21, 8092 Zurich, Switzerland. e-mail: [email protected]

Fig. 1:

The omnidirectional ﬂying vehicle (omav) used to experimentallyvalidate our method. typically is too large to be search exhaustively. Furthermore,the dynamics can change signiﬁcantly during consecutiveexperiments, e.g., a crash of a ﬂying vehicle that damages itsmotors, which invalidates the previous training data. Whenconsidering a speciﬁc task, a good model is only requiredin the working area of the state and input spaces. Thus, it isdesirable to have an efﬁcient scheme to collect training datalocally around the desired task.One idea is to use the statistical information learned fromtraining data to infer the region to sample data and thus im-prove the sampling efﬁciency. This is a well-known approachin machine learning community called active learning [11].In this paper we make use of such an idea: we rely on thepreviously learned statistical model to get an estimate of theregion of interest. We then generate an informative trajectorythat reduces the overall uncertainty in the estimated region.This trajectory is then executed in the real world to collectthe data.More speciﬁcally, in a ﬁrst step, possible informativelocations are inferred from the previously learning modelin simulation. Then, different informative trajectories aresampled and evaluated according to a cost metric, whichis deﬁned as the integral of the predictive uncertainty overthese possible locations. The most informative trajectory isthen selected and executed on the real robot to collect thedata. As a result, the model learned from this informativetrajectory should result in improved control performance anda better generalization. The latter is achieved because theinformative trajectory reduces the uncertainty over a largeregion of state and input space.The contributions of this paper is as follows: • A formal mathematical formulation of the problem ofefﬁcient data collection for learning dynamics model. • A practical strategy to efﬁciently collect task-relevant1 a r X i v : . [ c s . R O ] J a n ata that improves the control performance when usedto update the learned model. • Real experimental results conducted on a complex over-actuated omnidirectional ﬂying system with nonlineardynamics and 18 actuators.

A. Related work

Active learning in robotics is mostly deﬁned in a regres-sion setting: a regression mapping between an input and anoutput space is to be learned while the sample complexity isminimized. The exploration of the sample space is typicallydriven by some metric often consisting in variants of theexpected informational gain.Considering active dynamic model learning, existing workincludes the use of information gain on parameter estimates([12], [13]), Gaussian processes ([14]–[16]), and neural net-works [17]. They typically generate trajectories that mini-mize a deﬁned metric, trading off between exploration andexploitation. Aside from the parameter estimates approach,little work is done on real robots.These approaches can also be categorized dependingwhether the trajectory generation is performed online orofﬂine. The online approaches are often done in a recedinghorizon fashion [18], where trajectories are regenerated ata certain frequency on the ﬂy during experiments. Thisconstant update helps reducing the distance between thedesired inputs and achieved ones. However, this approach iscomputationally intensive. While exploring a state of interest,the robot cannot always stay stationary waiting for a newplanned trajectory. Up to date, this method exists only intheoretical works validated in simulation only [14], [15],[19].The ofﬂine approach has the shortcoming that the plannedtrajectory has a larger distance to the executed one, butapplicable on real robots. In [17], the trajectory generation isformulated as a variable-constrained problem and validatedon a simulated overactuated robotic spacecraft. In [16], theinput trajectories are parametrized by consecutive trajectorysections and the most informative and safe trajectory is thenexecuted. Their formulation did not take into account closed-loop control. The method is applied on a high-pressure ﬂuidinjection system. Our investigation belongs to this approach:we make use of the previously learned model and simulationsto reduce the deviation of the executed trajectory to thedesired one. In this work, we demonstrate that this approachworks for complex robots and efﬁciently improve the controlperformance.II. M

ODELING AND P ROBLEM S TATEMENT

We consider a generic system whose dynamics in thediscrete time domain is described by: x [ k + 1] = f ( x [ k ] , u [ k ]) , (1)where f ( · , · ) is a Lipschitz-continuous function and rep-resents the true dynamics . x [ k ] ∈ X ⊂ R n and u [ k ] ∈ This is a common assumption that does not limit the validity of the worksince most of the considered robotic systems have Lipschitz-continuousdynamics.

U ⊂ R m describes the state and the control input of thedynamical system at time k ∈ N ≥ . Considering state andinput limitations of real systems, we consider X and U compact sets. To simplify the notation, x [ k ] denotes x ( kT ) where T ∈ R > is the sampling time. We remark that f ( · , · ) is in general not available. We might have only an estimationof it denoted by ˆ f ( · , · ) .As normal in robotics, the considered the task consistsin a trajectory tracking problem. Given a desired task statetrajectory deﬁned by the sequence of state values X task =( x task [0] , . . . , x task [ N ]) in the time horizon N ∈ N > , weassume that a model-based controller π ( · ) based on ˆ f isprovided such that, if u [ k ] = π ( x task [ k ] , x [ k ] , ˆ f ) , (2) x [0] = x task [0] , and ˆ f ( x , u ) = f ( x , u ) for every ( x , u ) ∈Z = X × U , then x [ k + 1] = f ( x [ k ] , π ( x task [ k ] , x [ k ] , ˆ f )) = x task [ k + 1] for every k = 0 , . . . , N − . This conditiondescribes a perfect tracking of the desired trajectory.Wedeﬁne the sequence of inputs that provides perfect tracking as U task = ( u task [0] , . . . , u task [ N ]) called task input trajectory .It seems that to achieve perfect tracking, we must know thetrue dynamics for every state and input pairs. Objective 1.

Considering the closed-loop system (1) and (2) ,our objective is to deﬁne an active learning method aimingat optimizing the data process collection to • make it more efﬁcient (less experiments and datapoints), • improve the precision of the learned model, • improve the generalizability of the learned model, • minimize the tracking error. We shall show how the learning problem can be reformu-lated to address such objectives.Without loss of generality, we can decompose the truedynamics into two components: f ( x [ k ] , u [ k ]) = h ( x [ k ] , u [ k ]) + g ( x [ k ] , u [ k ]) , (3)where • h ( · , · ) , called ﬁrst principles dynamics , corresponds to“ﬁrst principles” model reﬂecting physical laws such asmass balance, energy balance, heat transfer relations,and so on. We consider h ( · , · ) to be known; • g ( · , · ) , called residual dynamics , corresponds to all otherelements not modeled by h . g is assumed unknown andwe only have an estimation denoted by ˆ g .This modeling allows to exploit the knowledge we alreadyhave about the system, reducing the learning effort and mak-ing it possible to employ several model-based controllers.Once again, it is clear that, considering the control law(2) with ˆ f = h + ˆ g , the closed loop system achieves perfecttracking if ˆ g ( x , u ) = g ( x , u ) for every ( x , u ) ∈ Z . Weassume a Bayesian prior model [20] of the residual dynamics g ( x [ k ] , u [ k ]) is given. That is, for a given test point ( x , u ) ,the value of g ( x , u ) is modeled with a Gaussian probabilitydistribution N x , u ( · ) : g ( x , u ) ∼ N x , u ( · ) . (4)2e denote the mean and variance of N x , u ( · ) as µ ( x , u ) ∈ R n and σ ( x , u ) ∈ R n × n ≥ , respectively. Note that thedistribution is a function of the test point x and u . Thevariance is an indication of the model conﬁdence at the testpoint.We consider the estimation of the residual dynamics as ˆ g ( x , u ) = µ ( x , u ) , which brings to ˆ f ( x , u ) = h ( x , u ) + µ ( x , u ) . As before, using the control law (2), it is clearthat improving the knowledge of µ ( x , u ) around the workingpoint would improve the tracking performances.To do so, suitable data must be collected. A possiblesolution is to simply run the task trajectory, over and over,until the data are enough to obtain a good model around ( X task , U task ) . However, this would require many trials suchthat the collected data are informative enough.Departing from this basic approach, here we aim at design-ing an algorithm that automatically derive some referencestate trajectories X inf , called informative state trajectories ,that efﬁciently collects data to improve the prior model, thusreducing the tracking error when it is employed in the controllaw (2). This is formulated in the following problem Problem 1.

Find X inf as solution of: min X inf N (cid:88) k =0 (cid:107) x task [ k ] − x [ k ] (cid:107) s.t. x [ k + 1] = f ( x [ k ] , u [ k ]) u [ k ] = π ( x task [ k ] , x [ k ] , ˆ f )ˆ f = h + ˆ g ˆ g ( x , u ) = µ ( x , u | ( ˜ X inf , ˜ U inf )) , (5) where µ ( x , u | ( ˜ X inf , ˜ U inf )) is the mean of the posteriormodel in ( x , u ) , i.e., the updated model based on the data ( ˜ X inf , ˜ U inf ) collected letting the closed-loop system evolveusing X inf as reference trajectory. In details ˜ x inf [ k + 1] = f (˜ x inf [ k ] , ˜ u inf [ k ]) (6) ˜ u inf [ k ] = π ( x inf [ k ] , ˜ x inf [ k ] , ˆ f ) , (7) with ˜ x inf [0] = x inf [0] . III. G

ENERATION OF I NFORMATIVE T RAJECTORIES

This section shows how we can turn (5) into a simpler op-timization problem aiming at minimizing an informative costmetric. The problem is solved by practical approximationsleading to a sampling-based trajectory generation algorithm.

A. Minimization of the informative cost

Solving (5) is deﬁnitely not a trivial problem, even usingsampling-based methods. In fact, since we do not know f ,solving (5) would require to run two experiments for everysampled informative state trajectory X inf , using as referenceﬁrstly X inf and then X task .In order to make the problem feasible from a practicalpoint of view, let us recall that using the model-basedcontroller (2), we can achieve perfect tracking by improving the knowledge of ˆ g , or equivalently of N x , u ( · ) for all ( x , u ) ∈ Z task where Z task = { ( x , u ) ∈ Z | ∃ k ∈ (0 , . . . , N ) s.t. ( x , u ) = ( x task [ k ] , u task [ k ]) } , (8)contains the pairs state/input that achieve perfect tracking ofthe task trajectory.A possible idea is that we can improve the model byminimizing the uncertainty of the prior model, i.e., σ ( x , u ) for all ( x , u ) ∈ Z task . We then reformulate (5) as min X inf (cid:88) Z task σ ( x , u | ( ˜ X inf , ˜ U inf )) d ( x , u ) (9)Recall that ( ˜ X inf , ˜ U inf ) are computed as in (6) and (7).Notice that we focus on reducing the informative cost onthe space relevant to the task instead of the entire state/inputspace Z . However, from experimental considerations, weremark that improving the model only in Z task is notenough to achieve good tracking performance. In fact, initialerrors, noisy measurements, and external disturbances mightmake the system deviate from ( X task , U task ) , visiting pairsinput/state not included in Z task for which the model couldbe imprecise. Therefore, to achieve good tracking also inthese non ideal and more realistic conditions, we propose toimprove the learning of the model by solving (9) not only forthe points in Z task , but also for the ones that are sufﬁcientlyclose, i.e., for all ( x , u ) ∈ Z ∆task where Z ∆task = { ( x , u ) ∈ Z | ∃ ( x task , u task ) ∈ Z task s.t. (cid:107) ( x , u ) − ( x task , u task ) (cid:107) ≤ (cid:15) } , (10)with (cid:15) ∈ R ≥ being a heuristic that can be tuned tocontrol the exploratory behavior of the informative trajectory.Problem (9) becomes: min X inf (cid:90) Z ∆task σ ( x , u | ( ˜ X inf , ˜ U inf )) d ( x , u ) (11)From now on, we refer to the objective function to beminimized as informative cost .The problem cannot be solved in a closed-form way. Thus,we propose to use a sampling-based optimization method[21] that consists in sampling different informative statetrajectories X inf and choose the one that shows the smallestinformative cost. However, this approach cannot be directlyemployed due to some practical issues: • To compute the informative cost for every sampledinformative trajectory we should theoretically run anexperiment. This is clearly time consuming and doesnot meet the goals of Objective 1. • It is not straightforward how we can compute theintegral of the posteriori variance over Z ∆task . • We do not know U task . From its deﬁnition, we shouldknow f to compute U task given X task . Therefore, wecannot directly compute Z ∆task . • It is not straightforward how we can efﬁciently sampleinformative trajectories. With an abuse of notation, we consider (cid:107) ( x , u ) − ( x (cid:63) , u (cid:63) ) (cid:107) = (cid:13)(cid:13) [ x (cid:62) u (cid:62) ] (cid:62) − [ x (cid:62) (cid:63) u (cid:62) (cid:63) ] (cid:62) (cid:13)(cid:13) . A weighted norm can also be used to nor-malize the components of state and input vectors. B. Approximations of the optimization problem1) Approximation of the dynamical constraints:

Given acandidate informative state trajectory, instead of computingthe posteriori variance based on the data collected as a resultof a real experiment, ( ˜ X inf , ˜ U inf ) , we compute it basedon the data collected as a result of a simulation of thesystem, ( ¯ X inf , ¯ U inf ) . In details, ( ¯ X inf , ¯ U inf ) is the output ofthe simulated closed-loop system using X inf as referencetrajectory, i.e., ¯ x inf [ k + 1] = h (¯ x inf [ k ] , ¯ u inf [ k ]) + g (cid:48) ¯ u inf [ k ] = π ( x inf [ k ] , ¯ x inf [ k ] , ˆ f ) , (12)where g (cid:48) is a sample of the Bayesian model ofthe residual dynamics, using the probability distribution N ¯ x inf [ k ] , ¯ u inf [ k ] ( · ) .

2) Approximation of the informative cost:

In (11), theintegral over Z ∆task can be approximately solved usingnumerical integration such as Monte-Carlo integration: weuniformly sample M pairs ( x i , u i ) in Z ∆task , where i =1 , . . . , M , creating the set Z (cid:48) ∆task . We then approximate theinformative cost in (11) as VM (cid:88) ( x i , u i ) ∈Z (cid:48) ∆task σ ( x i , u i | ( ˜ X inf , ˜ U inf )) , (13)where V is the volume (cid:82) Z ∆task d ( x , u ) . Notice that for thedifferent sampled informative trajectories, V and M remainconstant and therefore can be omitted in the optimization.

3) Approximation of Z ∆task : Since we do not know f , wecannot compute U task , and therefore neither Z ∆task . In thissection we show how we can get an estimation of Z ∆task ,denoted by ˆ Z ∆task , exploiting the current estimation of f .We ﬁrstly uniformly sample the state and input spaces, X and U , creating the sets X (cid:48) ⊂ X and U (cid:48) ⊂ U , respectively.We then simulate the closed-loop system M times using X task as reference, and with M different samples of theresidual dynamics denoted by g (cid:48) i with i = 1 , . . . , M . Weobtain M state and input trajectories ( ¯ X i task , ¯ U i task ) where ¯ x i task [ k + 1] = h (¯ x i task [ k ] , ¯ u i task [ k ]) + g (cid:48) i ¯ u i task [ k ] = π ( x task [ k ] , ¯ x i task [ k ] , ˆ f ) . (14)Finally, we compute ˆ Z ∆task as ˆ Z ∆task = { ( x , u ) ∈ X (cid:48) × U (cid:48) | ∃ k ∈ (0 , . . . , N ) and i ∈ (1 , . . . , M ) s.t. (cid:13)(cid:13) ( x , u ) − (¯ x i task , ¯ u i task ) (cid:13)(cid:13) ≤ (cid:15) } . (15)Similar to (10), the threshold (cid:15) is a heuristic that controls theexploration of the informative trajectory. With large (cid:15) , theoptimal trajectory should show a more exploratory behavior.

4) Parametrization of the informative trajectory:

Sincewe want to improve the knowledge of the model in Z ∆task ,it is natural to think that the informative state trajectory X inf should be “close” to the task state trajectory X task .Therefore, given a generic k , we deﬁne x inf [ k ] such that x inf [ k ] = x task [ k ] + δ x [ k ] . (16) Now, sampling informative state trajectories means sampling“deviations” from the task state trajectory. To reduce thesampling space, which has the same dimension of X , weparametrize δ x using the Discrete Fourier Transform (DFT) δ x [ k ] = 1 P P − (cid:88) p =0 Θ (cid:62) x e p e j πpP k , (17)where P ∈ N > , j is the complex operator, e p ∈ R P is avector with 1 in place p and elsewhere, and Θ x ∈ O x ⊂ R n × P is the state parameter matrix.From sampling every state of the informative trajectory, wenow samples only fewer parameters. Furthermore, the ratio-nale behind the use of DFT parametrization is that it gives usa more intuitive control of the frequencies of excitation. Wecan use fewer parameters to generate excitation signals thatare spread through frequencies of interest. Intuitively, thedeviation signal can be seen as an excitation signal addedaround the task state trajectory. As a result, the algorithminherently explores locally around the task trajectory. C. Sampling-based optimization algorithm

Considering the previous simpliﬁcations, (11) becomes min Θ x (cid:88) ( x i , u i ) ∈ ˆ Z ∆task σ ( x i , u i | ( ¯ X inf , ¯ U inf )) s.t. ˆ Z ∆task as in (15) (¯ x inf [ k ] , ¯ u inf [ k ]) as in (14) x inf [ k ] = x task [ k ] + δ x [ k ] δ x [ k ] as in (17) . (18)Practically, to solve (18) we used a Monte-Carlo sampling-based method. The algorithm follows the next steps whichrequire the simulation of the system only:1) Uniformly sample a set of parameters Θ x ∈ O x andcompute several informative state trajectories as in (16)and (17);2) Simulate multiple times the system with the sampledresidual model g (cid:48) according to the prior model. Eachinformative state trajectories computed at step 1 is usedas reference;3) For every simulation, collect the data relative tothe performed trajectory, ( ¯ X inf , ¯ U inf ) , and update theBayesian model of the residual dynamics;4) Compute ˆ Z ∆task as explained in III-B.3;5) Evaluate the information cost in ˆ Z ∆task associated toevery new updated model;6) Select the informative state trajectory corresponding tothe minimum information cost.Once the informative state trajectory supposed to provide thebest model update is selected, it is used as reference in a realexperiment. The relative collected data, ( ˜ X inf , ˜ U inf ) , is thenemployed to update the prior model. The full process canbe repeated from step 1), to ﬁnd a new state informativetrajectory that would allow to further improve the modelaccuracy, and in turn to reduce the tracking error. Remark : Note that the quality of the approximate solutiondepends on the quality of the prior model. Therein lies4he purpose of this algorithm: Within each iteration, thequality of the prior model improves, and the solution to theapproximated problem converges towards the true optimum.Consequently, this helps improving the prior model.IV. A

PPLICATION TO AN AERIAL ROBOT : THE OMAV

This section shows how the above framework is appliedon an omnidirectional ﬂying vehicle, called omav [6]. Theomav (Fig. 1) is an overactuated omnidirectional ﬂyingvehicle with six tiltable arms in a hexagonal arrangement.A coaxial rotor conﬁguration is rigidly attached to the endof each arm. The rotation of each arm can be activelycontrolled by a servo motor, which results in a total of18 actuators. Although the setup enhances the motion andinteraction capabilities, actually aerodynamic disturbancesamong the rotors, unknown servo dynamics, backlashes, andother mechanical inaccuracy are very difﬁcult to be modeledand included in standard model-based controller. This makesomav a suitable testbed to validate the proposed method foractive model learning.The state of the omav is given by x =[ p (cid:62) η (cid:62) ˙ p (cid:62) ω (cid:62) ] (cid:62) ∈ X ⊂ R . In order, x includesthe position, attitude (expressed in Euler angles), linearvelocity, and angular velocity of the vehicle. As inputof the system we consider the commanded wrench, i.e.,the total force and moment commanded to the vehicle , u = [ f (cid:62) cmd τ (cid:62) cmd ] (cid:62) ∈ U ⊂ R . We assume that anallocation policy is implemented to transform u into lowlevel commands for the servos and the motors [6]. Finally,the dynamics of the omav can be written as in (3), where h is derived using standard Newton-Euler equations. Noticethat h is linear with respect to the input and can be writtenas h ( x , u ) = l ( x ) + u , (19)where l ( x ) includes all the terms that do not depend on u .On the other hand, g includes all previously mentionedunmodeled dynamic behaviors that cannot be easily capturedwith ﬁrst principles. Considering the last six row of thedynamics (the linear and angular accelerations), we canconsider g ( x , u ) as the mismatch between the commandedwrench and the actuated one.The controller tries to implement a feedback linearizationcontrol law with on top a PID action on the positionand attitude errors. In particular, given a reference tasktrajectory, X task , and a priori model for g , the controller π ( x task [ k ] , x [ k ] , ˆ f ) tries to ﬁnd the input u [ k ] that solvesthe following optimization problem min u [ k ] (cid:107) x (cid:63) [ k ] − l ( x [ k ]) − u [ k ] − ˆ g ( u [ k ]) (cid:107) , (20)where x (cid:63) [ k ] = K ( x task [ k ] − x [ k ]) + K I (cid:82) ( x task [ k ] − x [ k ]) is the PID action, with K , K I ∈ R × positive deﬁnitematrices. For the details about the implementation of suchan optimization, we refer the interested reader to [8].From experimental observations, we remark that the resid-ual dynamics regarding the differential kinematics and linear For simplicity, we consider force and moment scaled by mass and inertia,respectively. acceleration (ﬁrst nine rows) is almost negligible with respectto the one regarding the angular acceleration (last threerows). In other words, the mismatch between commandedand actual force is much smaller than the one betweencommanded and actual torque. For this reason, in this ﬁrstwork, we focus our attention on the attitude dynamics,applying the proposed active dynamics learning only on thelast three rows or the system dynamics. These mismatchesare modeled as three independent single-output Gaussianprocesses with u as the training input and the achieved torqueas training output. We neglect the rotational drag torqueacting on the vehicle, thus ˆ g is modeled independent of thestate. V. E XPERIMENTAL RESULTS

The experimental platform is the omnidirectional aerialvehicle in Fig. 1: the omav. The omav weights and isequipped with a NUC i7 computer and a PixHawk ﬂightcontroller. This conﬁguration allows to run all the necessaryalgorithms onboard implemented in a ROS framework. Amotion capture system provides pose estimates at

100 Hz .For a more complete description of the testbed see [6].As stated in Section IV, the proposed method has beenimplemented and evaluated focusing on the rotational dy-namics. For the learned Gaussian process model, data pointsare subsampled from the experimental data using the k -medoids [22] algorithm where the Euclidean squared dis-tance between the inputs is used as the distance metric.Throughout the experiments, squared exponential kernels areused. The deviation δ x [ k ] is sampled around x, y, z -axison the angular acceleration level, constraining to be below2 Hz. Note that this is equivalent to giving δ x [ k ] on theangular velocity. For simplicity, we limit the number offrequencies P to 2 and allow the frequency locations to besampled as well. This yields to a total of 12 coefﬁcientsto be sampled. The simulation framework is set up usingRotorS Gazebo simulator [23]. In this section we use “non-informative trajectory” to describe the case where the tasktrajectory is used to collect the data to update the model. A. Correlation between informative cost and tracking error

An experiment is conducted to investigate whether thetracking error deﬁned in problem (5) is correlated to theinformative cost deﬁned in (11). The omav is asked tofollow a pitching trajectory up to 60 degrees in pitch and1 rad /s in pitch angular acceleration, similar to previouswork [8]. A prior model is built by collecting the data fromexecuting the task trajectory. Next, ﬁve sampled trajectoriesand the task trajectory are executed and six learned modelsare built accordingly. They are then evaluated on the testdata generated by the prior model. (cid:15) is heuristically tunedby simulation computing the average distance between theoptimal control inputs and the nominal computed controlinputs. The tracking performance of the angular acceleration along the y -axis using these models are shown in Fig. 2. Itcan be seen that there is a clear correspondence between Notice that evaluating the angular acceleration tracking is equivalent toevaluate the error between actual and commanded torque which stronglydepends on the model accuracy. a b s a n ga cce rr o r | r a d / s | task trajectorysampled trajectory Fig. 2:

A comparison of the tracking performance using the model learnedfrom sampled trajectories and task trajectory. x − . − . − . − . . . . . . y − . − . − . − . . . . . . z − . − . − . − . . . . . .

00 non-informative trajectoryinformative trajectoryreference orientation

Fig. 3:

Tracking of a body-ﬁxed unit vector (1 , , / √ is plotted on aunit sphere. the informative cost and the tracking error. Furthermore, themodel learned from the task trajectory does not yield lowesttracking error. B. Comparison between informative and task trajectory

To compare the efﬁciency of the informative and non-informative trajectory, a ﬁgure-8 in attitude (with roll andpitch up to 26 degrees) with constant position is given asa task trajectory (see Fig. 3). We compute the prior modelrunning the task trajectory for the ﬁrst time. Then 20 trajec-tories are randomly generated and evaluated in simulation asexplained in Section III-C using the prior model. The mostinformative trajectory (lowest informative cost) and the tasktrajectory are then executed and the data are recorded forboth trajectories. We subsampled 20, 40, 60, 80 data pointsfrom the experiments running each trajectory and built amodel for each of these combinations by augmenting theprior model with these data points. The hyperparametersof the Gaussian processes are reoptimized. The models arethen used in the controller to track the task trajectory inreal experiments for validation. Tracking performance areevaluated in Fig. 4 as the average of the absolute angularacceleration over all three axes. It can be noted that for thesame amount of data points, the informative trajectory alwaysoutperforms the non-informative trajectory in term of bothmean tracking error and corresponding variance. On averagethe performance of informative trajectories outperforms thenon-informative one by 13.3%. By performance of a trajectory we mean the tracking performance usingthe updated controller with the data collected from that trajectory.

40 60 80 100num data points12345 a b s a n ga cce rr o r [ | r a d / s | ] non-informative trajinformative traj Fig. 4:

A comparison of the tracking performance between informativetrajectory and task trajectory for the same number of data points. − . . . x -axis [Nm] − − − r o ll a n g l e [ d e g ] task trajectory modiﬁed task trajectory − . . . y -axis [Nm] − − − − p i t c h a n g l e [ d e g ] Fig. 5:

Phase plots of the task trajectory and modiﬁed task trajectory. It canbe observed that although the modiﬁed trajectory extend beyond the tasktrajectory, the model learned from the informative trajectory could help toreduce the tracking error there. x -axis y -axis z -axisnon-informative 38.4% 41.7% 23%informative 43.2% 57.9 % 62% TABLE I:

Angular acceleration tracking error reduction with respect to thecase without model learning in percentage.

C. Comparison of the generalizability

To test the generalizability of the model learned from theinformative trajectory, a modiﬁed ﬁgure-8 trajectory withhigher pitch and roll reference angles (up to 43 degrees)is used. As can be seen in the phase plot in Fig. 5, thestate input pairs of the modiﬁed ﬁgure-8 extend up totwice of the original one. In this case, both models fromthe informative trajectory and the non-informative trajectoryhave 100 data points. It can be seen from Table I thatthe model learned from informative trajectory yields bettertracking performance, especially around the z -axis.VI. C ONCLUSION

This work presents a practical framework that effectivelyand efﬁciently collects data points for the learning of modelsused at the control level to signiﬁcantly improve trackingperformance on real robots. We experimentally demonstratethe validity of the method on an overactuated aerial robot,the omav, whose dynamics is complex and difﬁcult to learn.Experimental results show that the learned model frominformative trajectories is efﬁcient in data points collectionand generalizes on modiﬁed trajectories.6

EFERENCES[1] P. Abbeel, A. Coates, and A. Y. Ng, “Autonomous helicopter aero-batics through apprenticeship learning,”

The International Journal ofRobotics Research , vol. 29, no. 13, pp. 1608–1639, 2010.[2] M. Kamel, T. Stastny, K. Alexis, and R. Siegwart, “Model predictivecontrol for trajectory tracking of unmanned aerial vehicles using robotoperating system,” in

Robot operating system (ROS) . Springer, 2017,pp. 3–39.[3] S. Kuindersma, R. Deits, M. Fallon, A. Valenzuela, H. Dai, F. Per-menter, T. Koolen, P. Marion, and R. Tedrake, “Optimization-basedlocomotion planning, estimation, and control design for the atlashumanoid robot,”

Autonomous robots , vol. 40, no. 3, pp. 429–455,2016.[4] C. J. Ostafew, A. P. Schoellig, and T. D. Barfoot, “Learning-basednonlinear model predictive control to improve vision-based mobilerobot path-tracking in challenging outdoor environments,” in

IEEEInternational Conference on Robotics and Automation , 2014.[5] M. T. Gillespie, C. M. Best, E. C. Townsend, D. Wingate, andM. D. Killpack, “Learning nonlinear dynamic models of soft robotsfor model predictive control with neural networks,” in . IEEE, 2018,pp. 39–45.[6] K. Bodie, M. Brunner, M. Pantic, S. Walser, P. Pfndler, U. Angst,R. Siegwart, and J. Nieto, “An Omnidirectional Aerial ManipulationPlatform for Contact-Based Inspection,”

Robotics: Science and Sys-tems XV , 2019.[7] J. Kabzan, L. Hewing, A. Liniger, and M. N. Zeilinger, “Learning-based model predictive control for autonomous racing,”

IEEE Roboticsand Automation Letters , 2019.[8] W. Zhang, M. Brunner, L. Ott, M. Kamel, R. Siegwart, and J. Nieto,“Learning dynamics for improving control of overactuated ﬂyingsystems,”

IEEE Robotics and Automation Letters , vol. 5, no. 4, pp.5283–5290, 2020.[9] D. Nguyen-Tuong and J. Peters, “Using model knowledge for learninginverse dynamics,” in

IEEE International Conference on Robotics andAutomation , 2010.[10] J. Hwangbo, J. Lee, A. Dosovitskiy, D. Bellicoso, V. Tsounis,V. Koltun, and M. Hutter, “Learning agile and dynamic motor skillsfor legged robots,”

Science Robotics , 2019.[11] B. Settles, “Active learning literature survey,” University of Wisconsin-Madison Department of Computer Sciences, Tech. Rep., 2009.[12] P. Schrangl, P. Tkachenko, and L. del Re, “Iterative model identiﬁca-tion of nonlinear systems of unknown structure: Systematic data-basedmodeling utilizing design of experiments,”

IEEE Control SystemsMagazine , vol. 40, no. 3, pp. 26–48, 2020.[13] A. D. Wilson, J. A. Schultz, A. R. Ansari, and T. D. Murphey,“Dynamic task execution using active parameter identiﬁcation withthe baxter research robot,”

IEEE Transactions on Automation Scienceand Engineering , vol. 14, no. 1, pp. 391–397, 2016.[14] T. Koller, F. Berkenkamp, M. Turchetta, and A. Krause, “Learning-based model predictive control for safe exploration,” in . IEEE, 2018, pp. 6059–6066.[15] M. Buisson-Fenet, F. Solowjow, and S. Trimpe, “Actively learninggaussian process dynamics,” arXiv preprint arXiv:1911.09946 , 2019.[16] C. Zimmer, M. Meister, and D. Nguyen-Tuong, “Safe active learningfor time-series modeling with gaussian processes,” in

Advances inNeural Information Processing Systems , 2018, pp. 2730–2739.[17] Y. K. Nakka, A. Liu, G. Shi, A. Anandkumar, Y. Yue, and S.-J. Chung,“Chance-constrained trajectory optimization for safe exploration andlearning of nonlinear systems,” arXiv preprint arXiv:2005.04374 ,2020.[18] F. Borrelli, A. Bemporad, and M. Morari,

Predictive control for linearand hybrid systems . Cambridge University Press, 2017.[19] A. Capone, G. Noske, J. Umlauft, T. Beckers, A. Lederer, andS. Hirche, “Localized active learning of gaussian process state spacemodels,” in

Learning for Dynamics and Control . PMLR, 2020, pp.490–499.[20] P. Congdon,

Bayesian statistical modelling . John Wiley & Sons,2007, vol. 704.[21] T. Homem-de Mello and G. Bayraksan, “Monte carlo sampling-basedmethods for stochastic optimization,”

Surveys in Operations Researchand Management Science , vol. 19, no. 1, pp. 56–85, 2014.[22] J. MacQueen et al. , “Some methods for classiﬁcation and analysisof multivariate observations,” in

Proceedings of the ﬁfth Berkeleysymposium on mathematical statistics and probability , vol. 1, no. 14.Oakland, CA, USA, 1967, pp. 281–297. [23] F. Furrer, M. Burri, M. Achtelik, and R. Siegwart,

Robot OperatingSystem (ROS): The Complete Reference (Volume 1) . Cham: SpringerInternational Publishing, 2016, ch. RotorS—A Modular GazeboMAV Simulator Framework, pp. 595–625. [Online]. Available:http://dx.doi.org/10.1007/978-3-319-26054-9 23. Cham: SpringerInternational Publishing, 2016, ch. RotorS—A Modular GazeboMAV Simulator Framework, pp. 595–625. [Online]. Available:http://dx.doi.org/10.1007/978-3-319-26054-9 23