Active Model Learning using Informative Trajectories for Improved Closed-Loop Control on Real Robots
Weixuan Zhang, Marco Tognon, Lionel Ott, Roland Siegwart, Juan Nieto
AActive Model Learning using Informative Trajectories for ImprovedClosed-Loop Control on Real Robots
Weixuan Zhang, Marco Tognon, Lionel Ott, Roland Siegwart, and Juan Nieto
Abstract — Model-based controllers on real robots requireaccurate knowledge of the system dynamics to perform op-timally. For complex dynamics, first-principles modeling isnot sufficiently precise, and data-driven approaches can beleveraged to learn a statistical model from real experiments.However, the efficient and effective data collection for such adata-driven system on real robots is still an open challenge. Thispaper introduces an optimization problem formulation to findan informative trajectory that allows for efficient data collectionand model learning. We present a sampling-based method thatcomputes an approximation of the trajectory that minimizes theprediction uncertainty of the dynamics model. This trajectory isthen executed, collecting the data to update the learned model.In experiments we demonstrate the capabilities of our proposedframework when applied to a complex omnidirectional flyingvehicle with tiltable rotors. Using our informative trajectoriesresults in models which outperform models obtained from non-informative trajectory by 13.3% with the same amount oftraining data. Furthermore, we show that the model learnedfrom informative trajectories generalizes better than the onelearned from non-informative trajectories, achieving bettertracking performance on different tasks.
I. I
NTRODUCTION
Model-based controllers have shown to be useful in vari-ous robotics applications. Especially when accurate modelsare available, these controllers can exhibit impressive per-formance [1], [2]. However, it can be hard to obtain a gooddynamical model for complex systems such as humanoidrobots [3], race cars on uneven terrains [4], soft robots [5],and novel fully actuated multi-rotor flying vehicles [6] likethe one considered in this work (see Fig. 1).One approach to solve the modeling problem is to rely onlearning techniques: Through interaction with the real-worldand data collection a statistical dynamics model is trained,which is either directly fed into a model-based controller,e.g. [4], [7]–[9], or used in simulation to train a controlpolicy [10]. One challenge for these approaches is that oftenthe training data has a different distribution than the testdata due to several reasons: first, model uncertainties andfeedback controller might lead the system to a state notencountered in a previous data collection routine. Secondly,given partial model knowledge, the region of the data thatleads to the best performance is a-priori unknown. Finally,the closed-loop dynamics change as the model used by thecontroller gets updated. One could perform large amount ofexperiments to cover as much of the input space as possibleduring training. However, for robotic systems with high-dimensional and continuous state space, the search space
This work was supported by the NCCR Robotics, NCCR Digital Fabri-cation and Armasuisse.Authors are with the Autonomous Systems Lab, ETH Z¨urich, Leonhard-strasse 21, 8092 Zurich, Switzerland. e-mail: [email protected]
Fig. 1:
The omnidirectional flying vehicle (omav) used to experimentallyvalidate our method. typically is too large to be search exhaustively. Furthermore,the dynamics can change significantly during consecutiveexperiments, e.g., a crash of a flying vehicle that damages itsmotors, which invalidates the previous training data. Whenconsidering a specific task, a good model is only requiredin the working area of the state and input spaces. Thus, it isdesirable to have an efficient scheme to collect training datalocally around the desired task.One idea is to use the statistical information learned fromtraining data to infer the region to sample data and thus im-prove the sampling efficiency. This is a well-known approachin machine learning community called active learning [11].In this paper we make use of such an idea: we rely on thepreviously learned statistical model to get an estimate of theregion of interest. We then generate an informative trajectorythat reduces the overall uncertainty in the estimated region.This trajectory is then executed in the real world to collectthe data.More specifically, in a first step, possible informativelocations are inferred from the previously learning modelin simulation. Then, different informative trajectories aresampled and evaluated according to a cost metric, whichis defined as the integral of the predictive uncertainty overthese possible locations. The most informative trajectory isthen selected and executed on the real robot to collect thedata. As a result, the model learned from this informativetrajectory should result in improved control performance anda better generalization. The latter is achieved because theinformative trajectory reduces the uncertainty over a largeregion of state and input space.The contributions of this paper is as follows: • A formal mathematical formulation of the problem ofefficient data collection for learning dynamics model. • A practical strategy to efficiently collect task-relevant1 a r X i v : . [ c s . R O ] J a n ata that improves the control performance when usedto update the learned model. • Real experimental results conducted on a complex over-actuated omnidirectional flying system with nonlineardynamics and 18 actuators.
A. Related work
Active learning in robotics is mostly defined in a regres-sion setting: a regression mapping between an input and anoutput space is to be learned while the sample complexity isminimized. The exploration of the sample space is typicallydriven by some metric often consisting in variants of theexpected informational gain.Considering active dynamic model learning, existing workincludes the use of information gain on parameter estimates([12], [13]), Gaussian processes ([14]–[16]), and neural net-works [17]. They typically generate trajectories that mini-mize a defined metric, trading off between exploration andexploitation. Aside from the parameter estimates approach,little work is done on real robots.These approaches can also be categorized dependingwhether the trajectory generation is performed online oroffline. The online approaches are often done in a recedinghorizon fashion [18], where trajectories are regenerated ata certain frequency on the fly during experiments. Thisconstant update helps reducing the distance between thedesired inputs and achieved ones. However, this approach iscomputationally intensive. While exploring a state of interest,the robot cannot always stay stationary waiting for a newplanned trajectory. Up to date, this method exists only intheoretical works validated in simulation only [14], [15],[19].The offline approach has the shortcoming that the plannedtrajectory has a larger distance to the executed one, butapplicable on real robots. In [17], the trajectory generation isformulated as a variable-constrained problem and validatedon a simulated overactuated robotic spacecraft. In [16], theinput trajectories are parametrized by consecutive trajectorysections and the most informative and safe trajectory is thenexecuted. Their formulation did not take into account closed-loop control. The method is applied on a high-pressure fluidinjection system. Our investigation belongs to this approach:we make use of the previously learned model and simulationsto reduce the deviation of the executed trajectory to thedesired one. In this work, we demonstrate that this approachworks for complex robots and efficiently improve the controlperformance.II. M
ODELING AND P ROBLEM S TATEMENT
We consider a generic system whose dynamics in thediscrete time domain is described by: x [ k + 1] = f ( x [ k ] , u [ k ]) , (1)where f ( · , · ) is a Lipschitz-continuous function and rep-resents the true dynamics . x [ k ] ∈ X ⊂ R n and u [ k ] ∈ This is a common assumption that does not limit the validity of the worksince most of the considered robotic systems have Lipschitz-continuousdynamics.
U ⊂ R m describes the state and the control input of thedynamical system at time k ∈ N ≥ . Considering state andinput limitations of real systems, we consider X and U compact sets. To simplify the notation, x [ k ] denotes x ( kT ) where T ∈ R > is the sampling time. We remark that f ( · , · ) is in general not available. We might have only an estimationof it denoted by ˆ f ( · , · ) .As normal in robotics, the considered the task consistsin a trajectory tracking problem. Given a desired task statetrajectory defined by the sequence of state values X task =( x task [0] , . . . , x task [ N ]) in the time horizon N ∈ N > , weassume that a model-based controller π ( · ) based on ˆ f isprovided such that, if u [ k ] = π ( x task [ k ] , x [ k ] , ˆ f ) , (2) x [0] = x task [0] , and ˆ f ( x , u ) = f ( x , u ) for every ( x , u ) ∈Z = X × U , then x [ k + 1] = f ( x [ k ] , π ( x task [ k ] , x [ k ] , ˆ f )) = x task [ k + 1] for every k = 0 , . . . , N − . This conditiondescribes a perfect tracking of the desired trajectory.Wedefine the sequence of inputs that provides perfect tracking as U task = ( u task [0] , . . . , u task [ N ]) called task input trajectory .It seems that to achieve perfect tracking, we must know thetrue dynamics for every state and input pairs. Objective 1.
Considering the closed-loop system (1) and (2) ,our objective is to define an active learning method aimingat optimizing the data process collection to • make it more efficient (less experiments and datapoints), • improve the precision of the learned model, • improve the generalizability of the learned model, • minimize the tracking error. We shall show how the learning problem can be reformu-lated to address such objectives.Without loss of generality, we can decompose the truedynamics into two components: f ( x [ k ] , u [ k ]) = h ( x [ k ] , u [ k ]) + g ( x [ k ] , u [ k ]) , (3)where • h ( · , · ) , called first principles dynamics , corresponds to“first principles” model reflecting physical laws such asmass balance, energy balance, heat transfer relations,and so on. We consider h ( · , · ) to be known; • g ( · , · ) , called residual dynamics , corresponds to all otherelements not modeled by h . g is assumed unknown andwe only have an estimation denoted by ˆ g .This modeling allows to exploit the knowledge we alreadyhave about the system, reducing the learning effort and mak-ing it possible to employ several model-based controllers.Once again, it is clear that, considering the control law(2) with ˆ f = h + ˆ g , the closed loop system achieves perfecttracking if ˆ g ( x , u ) = g ( x , u ) for every ( x , u ) ∈ Z . Weassume a Bayesian prior model [20] of the residual dynamics g ( x [ k ] , u [ k ]) is given. That is, for a given test point ( x , u ) ,the value of g ( x , u ) is modeled with a Gaussian probabilitydistribution N x , u ( · ) : g ( x , u ) ∼ N x , u ( · ) . (4)2e denote the mean and variance of N x , u ( · ) as µ ( x , u ) ∈ R n and σ ( x , u ) ∈ R n × n ≥ , respectively. Note that thedistribution is a function of the test point x and u . Thevariance is an indication of the model confidence at the testpoint.We consider the estimation of the residual dynamics as ˆ g ( x , u ) = µ ( x , u ) , which brings to ˆ f ( x , u ) = h ( x , u ) + µ ( x , u ) . As before, using the control law (2), it is clearthat improving the knowledge of µ ( x , u ) around the workingpoint would improve the tracking performances.To do so, suitable data must be collected. A possiblesolution is to simply run the task trajectory, over and over,until the data are enough to obtain a good model around ( X task , U task ) . However, this would require many trials suchthat the collected data are informative enough.Departing from this basic approach, here we aim at design-ing an algorithm that automatically derive some referencestate trajectories X inf , called informative state trajectories ,that efficiently collects data to improve the prior model, thusreducing the tracking error when it is employed in the controllaw (2). This is formulated in the following problem Problem 1.
Find X inf as solution of: min X inf N (cid:88) k =0 (cid:107) x task [ k ] − x [ k ] (cid:107) s.t. x [ k + 1] = f ( x [ k ] , u [ k ]) u [ k ] = π ( x task [ k ] , x [ k ] , ˆ f )ˆ f = h + ˆ g ˆ g ( x , u ) = µ ( x , u | ( ˜ X inf , ˜ U inf )) , (5) where µ ( x , u | ( ˜ X inf , ˜ U inf )) is the mean of the posteriormodel in ( x , u ) , i.e., the updated model based on the data ( ˜ X inf , ˜ U inf ) collected letting the closed-loop system evolveusing X inf as reference trajectory. In details ˜ x inf [ k + 1] = f (˜ x inf [ k ] , ˜ u inf [ k ]) (6) ˜ u inf [ k ] = π ( x inf [ k ] , ˜ x inf [ k ] , ˆ f ) , (7) with ˜ x inf [0] = x inf [0] . III. G
ENERATION OF I NFORMATIVE T RAJECTORIES
This section shows how we can turn (5) into a simpler op-timization problem aiming at minimizing an informative costmetric. The problem is solved by practical approximationsleading to a sampling-based trajectory generation algorithm.
A. Minimization of the informative cost
Solving (5) is definitely not a trivial problem, even usingsampling-based methods. In fact, since we do not know f ,solving (5) would require to run two experiments for everysampled informative state trajectory X inf , using as referencefirstly X inf and then X task .In order to make the problem feasible from a practicalpoint of view, let us recall that using the model-basedcontroller (2), we can achieve perfect tracking by improving the knowledge of ˆ g , or equivalently of N x , u ( · ) for all ( x , u ) ∈ Z task where Z task = { ( x , u ) ∈ Z | ∃ k ∈ (0 , . . . , N ) s.t. ( x , u ) = ( x task [ k ] , u task [ k ]) } , (8)contains the pairs state/input that achieve perfect tracking ofthe task trajectory.A possible idea is that we can improve the model byminimizing the uncertainty of the prior model, i.e., σ ( x , u ) for all ( x , u ) ∈ Z task . We then reformulate (5) as min X inf (cid:88) Z task σ ( x , u | ( ˜ X inf , ˜ U inf )) d ( x , u ) (9)Recall that ( ˜ X inf , ˜ U inf ) are computed as in (6) and (7).Notice that we focus on reducing the informative cost onthe space relevant to the task instead of the entire state/inputspace Z . However, from experimental considerations, weremark that improving the model only in Z task is notenough to achieve good tracking performance. In fact, initialerrors, noisy measurements, and external disturbances mightmake the system deviate from ( X task , U task ) , visiting pairsinput/state not included in Z task for which the model couldbe imprecise. Therefore, to achieve good tracking also inthese non ideal and more realistic conditions, we propose toimprove the learning of the model by solving (9) not only forthe points in Z task , but also for the ones that are sufficientlyclose, i.e., for all ( x , u ) ∈ Z ∆task where Z ∆task = { ( x , u ) ∈ Z | ∃ ( x task , u task ) ∈ Z task s.t. (cid:107) ( x , u ) − ( x task , u task ) (cid:107) ≤ (cid:15) } , (10)with (cid:15) ∈ R ≥ being a heuristic that can be tuned tocontrol the exploratory behavior of the informative trajectory.Problem (9) becomes: min X inf (cid:90) Z ∆task σ ( x , u | ( ˜ X inf , ˜ U inf )) d ( x , u ) (11)From now on, we refer to the objective function to beminimized as informative cost .The problem cannot be solved in a closed-form way. Thus,we propose to use a sampling-based optimization method[21] that consists in sampling different informative statetrajectories X inf and choose the one that shows the smallestinformative cost. However, this approach cannot be directlyemployed due to some practical issues: • To compute the informative cost for every sampledinformative trajectory we should theoretically run anexperiment. This is clearly time consuming and doesnot meet the goals of Objective 1. • It is not straightforward how we can compute theintegral of the posteriori variance over Z ∆task . • We do not know U task . From its definition, we shouldknow f to compute U task given X task . Therefore, wecannot directly compute Z ∆task . • It is not straightforward how we can efficiently sampleinformative trajectories. With an abuse of notation, we consider (cid:107) ( x , u ) − ( x (cid:63) , u (cid:63) ) (cid:107) = (cid:13)(cid:13) [ x (cid:62) u (cid:62) ] (cid:62) − [ x (cid:62) (cid:63) u (cid:62) (cid:63) ] (cid:62) (cid:13)(cid:13) . A weighted norm can also be used to nor-malize the components of state and input vectors. B. Approximations of the optimization problem1) Approximation of the dynamical constraints:
Given acandidate informative state trajectory, instead of computingthe posteriori variance based on the data collected as a resultof a real experiment, ( ˜ X inf , ˜ U inf ) , we compute it basedon the data collected as a result of a simulation of thesystem, ( ¯ X inf , ¯ U inf ) . In details, ( ¯ X inf , ¯ U inf ) is the output ofthe simulated closed-loop system using X inf as referencetrajectory, i.e., ¯ x inf [ k + 1] = h (¯ x inf [ k ] , ¯ u inf [ k ]) + g (cid:48) ¯ u inf [ k ] = π ( x inf [ k ] , ¯ x inf [ k ] , ˆ f ) , (12)where g (cid:48) is a sample of the Bayesian model ofthe residual dynamics, using the probability distribution N ¯ x inf [ k ] , ¯ u inf [ k ] ( · ) .
2) Approximation of the informative cost:
In (11), theintegral over Z ∆task can be approximately solved usingnumerical integration such as Monte-Carlo integration: weuniformly sample M pairs ( x i , u i ) in Z ∆task , where i =1 , . . . , M , creating the set Z (cid:48) ∆task . We then approximate theinformative cost in (11) as VM (cid:88) ( x i , u i ) ∈Z (cid:48) ∆task σ ( x i , u i | ( ˜ X inf , ˜ U inf )) , (13)where V is the volume (cid:82) Z ∆task d ( x , u ) . Notice that for thedifferent sampled informative trajectories, V and M remainconstant and therefore can be omitted in the optimization.
3) Approximation of Z ∆task : Since we do not know f , wecannot compute U task , and therefore neither Z ∆task . In thissection we show how we can get an estimation of Z ∆task ,denoted by ˆ Z ∆task , exploiting the current estimation of f .We firstly uniformly sample the state and input spaces, X and U , creating the sets X (cid:48) ⊂ X and U (cid:48) ⊂ U , respectively.We then simulate the closed-loop system M times using X task as reference, and with M different samples of theresidual dynamics denoted by g (cid:48) i with i = 1 , . . . , M . Weobtain M state and input trajectories ( ¯ X i task , ¯ U i task ) where ¯ x i task [ k + 1] = h (¯ x i task [ k ] , ¯ u i task [ k ]) + g (cid:48) i ¯ u i task [ k ] = π ( x task [ k ] , ¯ x i task [ k ] , ˆ f ) . (14)Finally, we compute ˆ Z ∆task as ˆ Z ∆task = { ( x , u ) ∈ X (cid:48) × U (cid:48) | ∃ k ∈ (0 , . . . , N ) and i ∈ (1 , . . . , M ) s.t. (cid:13)(cid:13) ( x , u ) − (¯ x i task , ¯ u i task ) (cid:13)(cid:13) ≤ (cid:15) } . (15)Similar to (10), the threshold (cid:15) is a heuristic that controls theexploration of the informative trajectory. With large (cid:15) , theoptimal trajectory should show a more exploratory behavior.
4) Parametrization of the informative trajectory:
Sincewe want to improve the knowledge of the model in Z ∆task ,it is natural to think that the informative state trajectory X inf should be “close” to the task state trajectory X task .Therefore, given a generic k , we define x inf [ k ] such that x inf [ k ] = x task [ k ] + δ x [ k ] . (16) Now, sampling informative state trajectories means sampling“deviations” from the task state trajectory. To reduce thesampling space, which has the same dimension of X , weparametrize δ x using the Discrete Fourier Transform (DFT) δ x [ k ] = 1 P P − (cid:88) p =0 Θ (cid:62) x e p e j πpP k , (17)where P ∈ N > , j is the complex operator, e p ∈ R P is avector with 1 in place p and elsewhere, and Θ x ∈ O x ⊂ R n × P is the state parameter matrix.From sampling every state of the informative trajectory, wenow samples only fewer parameters. Furthermore, the ratio-nale behind the use of DFT parametrization is that it gives usa more intuitive control of the frequencies of excitation. Wecan use fewer parameters to generate excitation signals thatare spread through frequencies of interest. Intuitively, thedeviation signal can be seen as an excitation signal addedaround the task state trajectory. As a result, the algorithminherently explores locally around the task trajectory. C. Sampling-based optimization algorithm
Considering the previous simplifications, (11) becomes min Θ x (cid:88) ( x i , u i ) ∈ ˆ Z ∆task σ ( x i , u i | ( ¯ X inf , ¯ U inf )) s.t. ˆ Z ∆task as in (15) (¯ x inf [ k ] , ¯ u inf [ k ]) as in (14) x inf [ k ] = x task [ k ] + δ x [ k ] δ x [ k ] as in (17) . (18)Practically, to solve (18) we used a Monte-Carlo sampling-based method. The algorithm follows the next steps whichrequire the simulation of the system only:1) Uniformly sample a set of parameters Θ x ∈ O x andcompute several informative state trajectories as in (16)and (17);2) Simulate multiple times the system with the sampledresidual model g (cid:48) according to the prior model. Eachinformative state trajectories computed at step 1 is usedas reference;3) For every simulation, collect the data relative tothe performed trajectory, ( ¯ X inf , ¯ U inf ) , and update theBayesian model of the residual dynamics;4) Compute ˆ Z ∆task as explained in III-B.3;5) Evaluate the information cost in ˆ Z ∆task associated toevery new updated model;6) Select the informative state trajectory corresponding tothe minimum information cost.Once the informative state trajectory supposed to provide thebest model update is selected, it is used as reference in a realexperiment. The relative collected data, ( ˜ X inf , ˜ U inf ) , is thenemployed to update the prior model. The full process canbe repeated from step 1), to find a new state informativetrajectory that would allow to further improve the modelaccuracy, and in turn to reduce the tracking error. Remark : Note that the quality of the approximate solutiondepends on the quality of the prior model. Therein lies4he purpose of this algorithm: Within each iteration, thequality of the prior model improves, and the solution to theapproximated problem converges towards the true optimum.Consequently, this helps improving the prior model.IV. A
PPLICATION TO AN AERIAL ROBOT : THE OMAV
This section shows how the above framework is appliedon an omnidirectional flying vehicle, called omav [6]. Theomav (Fig. 1) is an overactuated omnidirectional flyingvehicle with six tiltable arms in a hexagonal arrangement.A coaxial rotor configuration is rigidly attached to the endof each arm. The rotation of each arm can be activelycontrolled by a servo motor, which results in a total of18 actuators. Although the setup enhances the motion andinteraction capabilities, actually aerodynamic disturbancesamong the rotors, unknown servo dynamics, backlashes, andother mechanical inaccuracy are very difficult to be modeledand included in standard model-based controller. This makesomav a suitable testbed to validate the proposed method foractive model learning.The state of the omav is given by x =[ p (cid:62) η (cid:62) ˙ p (cid:62) ω (cid:62) ] (cid:62) ∈ X ⊂ R . In order, x includesthe position, attitude (expressed in Euler angles), linearvelocity, and angular velocity of the vehicle. As inputof the system we consider the commanded wrench, i.e.,the total force and moment commanded to the vehicle , u = [ f (cid:62) cmd τ (cid:62) cmd ] (cid:62) ∈ U ⊂ R . We assume that anallocation policy is implemented to transform u into lowlevel commands for the servos and the motors [6]. Finally,the dynamics of the omav can be written as in (3), where h is derived using standard Newton-Euler equations. Noticethat h is linear with respect to the input and can be writtenas h ( x , u ) = l ( x ) + u , (19)where l ( x ) includes all the terms that do not depend on u .On the other hand, g includes all previously mentionedunmodeled dynamic behaviors that cannot be easily capturedwith first principles. Considering the last six row of thedynamics (the linear and angular accelerations), we canconsider g ( x , u ) as the mismatch between the commandedwrench and the actuated one.The controller tries to implement a feedback linearizationcontrol law with on top a PID action on the positionand attitude errors. In particular, given a reference tasktrajectory, X task , and a priori model for g , the controller π ( x task [ k ] , x [ k ] , ˆ f ) tries to find the input u [ k ] that solvesthe following optimization problem min u [ k ] (cid:107) x (cid:63) [ k ] − l ( x [ k ]) − u [ k ] − ˆ g ( u [ k ]) (cid:107) , (20)where x (cid:63) [ k ] = K ( x task [ k ] − x [ k ]) + K I (cid:82) ( x task [ k ] − x [ k ]) is the PID action, with K , K I ∈ R × positive definitematrices. For the details about the implementation of suchan optimization, we refer the interested reader to [8].From experimental observations, we remark that the resid-ual dynamics regarding the differential kinematics and linear For simplicity, we consider force and moment scaled by mass and inertia,respectively. acceleration (first nine rows) is almost negligible with respectto the one regarding the angular acceleration (last threerows). In other words, the mismatch between commandedand actual force is much smaller than the one betweencommanded and actual torque. For this reason, in this firstwork, we focus our attention on the attitude dynamics,applying the proposed active dynamics learning only on thelast three rows or the system dynamics. These mismatchesare modeled as three independent single-output Gaussianprocesses with u as the training input and the achieved torqueas training output. We neglect the rotational drag torqueacting on the vehicle, thus ˆ g is modeled independent of thestate. V. E XPERIMENTAL RESULTS
The experimental platform is the omnidirectional aerialvehicle in Fig. 1: the omav. The omav weights and isequipped with a NUC i7 computer and a PixHawk flightcontroller. This configuration allows to run all the necessaryalgorithms onboard implemented in a ROS framework. Amotion capture system provides pose estimates at
100 Hz .For a more complete description of the testbed see [6].As stated in Section IV, the proposed method has beenimplemented and evaluated focusing on the rotational dy-namics. For the learned Gaussian process model, data pointsare subsampled from the experimental data using the k -medoids [22] algorithm where the Euclidean squared dis-tance between the inputs is used as the distance metric.Throughout the experiments, squared exponential kernels areused. The deviation δ x [ k ] is sampled around x, y, z -axison the angular acceleration level, constraining to be below2 Hz. Note that this is equivalent to giving δ x [ k ] on theangular velocity. For simplicity, we limit the number offrequencies P to 2 and allow the frequency locations to besampled as well. This yields to a total of 12 coefficientsto be sampled. The simulation framework is set up usingRotorS Gazebo simulator [23]. In this section we use “non-informative trajectory” to describe the case where the tasktrajectory is used to collect the data to update the model. A. Correlation between informative cost and tracking error
An experiment is conducted to investigate whether thetracking error defined in problem (5) is correlated to theinformative cost defined in (11). The omav is asked tofollow a pitching trajectory up to 60 degrees in pitch and1 rad /s in pitch angular acceleration, similar to previouswork [8]. A prior model is built by collecting the data fromexecuting the task trajectory. Next, five sampled trajectoriesand the task trajectory are executed and six learned modelsare built accordingly. They are then evaluated on the testdata generated by the prior model. (cid:15) is heuristically tunedby simulation computing the average distance between theoptimal control inputs and the nominal computed controlinputs. The tracking performance of the angular acceleration along the y -axis using these models are shown in Fig. 2. Itcan be seen that there is a clear correspondence between Notice that evaluating the angular acceleration tracking is equivalent toevaluate the error between actual and commanded torque which stronglydepends on the model accuracy. a b s a n ga cce rr o r | r a d / s | task trajectorysampled trajectory Fig. 2:
A comparison of the tracking performance using the model learnedfrom sampled trajectories and task trajectory. x − . − . − . − . . . . . . y − . − . − . − . . . . . . z − . − . − . − . . . . . .
00 non-informative trajectoryinformative trajectoryreference orientation
Fig. 3:
Tracking of a body-fixed unit vector (1 , , / √ is plotted on aunit sphere. the informative cost and the tracking error. Furthermore, themodel learned from the task trajectory does not yield lowesttracking error. B. Comparison between informative and task trajectory
To compare the efficiency of the informative and non-informative trajectory, a figure-8 in attitude (with roll andpitch up to 26 degrees) with constant position is given asa task trajectory (see Fig. 3). We compute the prior modelrunning the task trajectory for the first time. Then 20 trajec-tories are randomly generated and evaluated in simulation asexplained in Section III-C using the prior model. The mostinformative trajectory (lowest informative cost) and the tasktrajectory are then executed and the data are recorded forboth trajectories. We subsampled 20, 40, 60, 80 data pointsfrom the experiments running each trajectory and built amodel for each of these combinations by augmenting theprior model with these data points. The hyperparametersof the Gaussian processes are reoptimized. The models arethen used in the controller to track the task trajectory inreal experiments for validation. Tracking performance areevaluated in Fig. 4 as the average of the absolute angularacceleration over all three axes. It can be noted that for thesame amount of data points, the informative trajectory alwaysoutperforms the non-informative trajectory in term of bothmean tracking error and corresponding variance. On averagethe performance of informative trajectories outperforms thenon-informative one by 13.3%. By performance of a trajectory we mean the tracking performance usingthe updated controller with the data collected from that trajectory.
40 60 80 100num data points12345 a b s a n ga cce rr o r [ | r a d / s | ] non-informative trajinformative traj Fig. 4:
A comparison of the tracking performance between informativetrajectory and task trajectory for the same number of data points. − . . . x -axis [Nm] − − − r o ll a n g l e [ d e g ] task trajectory modified task trajectory − . . . y -axis [Nm] − − − − p i t c h a n g l e [ d e g ] Fig. 5:
Phase plots of the task trajectory and modified task trajectory. It canbe observed that although the modified trajectory extend beyond the tasktrajectory, the model learned from the informative trajectory could help toreduce the tracking error there. x -axis y -axis z -axisnon-informative 38.4% 41.7% 23%informative 43.2% 57.9 % 62% TABLE I:
Angular acceleration tracking error reduction with respect to thecase without model learning in percentage.
C. Comparison of the generalizability
To test the generalizability of the model learned from theinformative trajectory, a modified figure-8 trajectory withhigher pitch and roll reference angles (up to 43 degrees)is used. As can be seen in the phase plot in Fig. 5, thestate input pairs of the modified figure-8 extend up totwice of the original one. In this case, both models fromthe informative trajectory and the non-informative trajectoryhave 100 data points. It can be seen from Table I thatthe model learned from informative trajectory yields bettertracking performance, especially around the z -axis.VI. C ONCLUSION
This work presents a practical framework that effectivelyand efficiently collects data points for the learning of modelsused at the control level to significantly improve trackingperformance on real robots. We experimentally demonstratethe validity of the method on an overactuated aerial robot,the omav, whose dynamics is complex and difficult to learn.Experimental results show that the learned model frominformative trajectories is efficient in data points collectionand generalizes on modified trajectories.6
EFERENCES[1] P. Abbeel, A. Coates, and A. Y. Ng, “Autonomous helicopter aero-batics through apprenticeship learning,”
The International Journal ofRobotics Research , vol. 29, no. 13, pp. 1608–1639, 2010.[2] M. Kamel, T. Stastny, K. Alexis, and R. Siegwart, “Model predictivecontrol for trajectory tracking of unmanned aerial vehicles using robotoperating system,” in
Robot operating system (ROS) . Springer, 2017,pp. 3–39.[3] S. Kuindersma, R. Deits, M. Fallon, A. Valenzuela, H. Dai, F. Per-menter, T. Koolen, P. Marion, and R. Tedrake, “Optimization-basedlocomotion planning, estimation, and control design for the atlashumanoid robot,”
Autonomous robots , vol. 40, no. 3, pp. 429–455,2016.[4] C. J. Ostafew, A. P. Schoellig, and T. D. Barfoot, “Learning-basednonlinear model predictive control to improve vision-based mobilerobot path-tracking in challenging outdoor environments,” in
IEEEInternational Conference on Robotics and Automation , 2014.[5] M. T. Gillespie, C. M. Best, E. C. Townsend, D. Wingate, andM. D. Killpack, “Learning nonlinear dynamic models of soft robotsfor model predictive control with neural networks,” in . IEEE, 2018,pp. 39–45.[6] K. Bodie, M. Brunner, M. Pantic, S. Walser, P. Pfndler, U. Angst,R. Siegwart, and J. Nieto, “An Omnidirectional Aerial ManipulationPlatform for Contact-Based Inspection,”
Robotics: Science and Sys-tems XV , 2019.[7] J. Kabzan, L. Hewing, A. Liniger, and M. N. Zeilinger, “Learning-based model predictive control for autonomous racing,”
IEEE Roboticsand Automation Letters , 2019.[8] W. Zhang, M. Brunner, L. Ott, M. Kamel, R. Siegwart, and J. Nieto,“Learning dynamics for improving control of overactuated flyingsystems,”
IEEE Robotics and Automation Letters , vol. 5, no. 4, pp.5283–5290, 2020.[9] D. Nguyen-Tuong and J. Peters, “Using model knowledge for learninginverse dynamics,” in
IEEE International Conference on Robotics andAutomation , 2010.[10] J. Hwangbo, J. Lee, A. Dosovitskiy, D. Bellicoso, V. Tsounis,V. Koltun, and M. Hutter, “Learning agile and dynamic motor skillsfor legged robots,”
Science Robotics , 2019.[11] B. Settles, “Active learning literature survey,” University of Wisconsin-Madison Department of Computer Sciences, Tech. Rep., 2009.[12] P. Schrangl, P. Tkachenko, and L. del Re, “Iterative model identifica-tion of nonlinear systems of unknown structure: Systematic data-basedmodeling utilizing design of experiments,”
IEEE Control SystemsMagazine , vol. 40, no. 3, pp. 26–48, 2020.[13] A. D. Wilson, J. A. Schultz, A. R. Ansari, and T. D. Murphey,“Dynamic task execution using active parameter identification withthe baxter research robot,”
IEEE Transactions on Automation Scienceand Engineering , vol. 14, no. 1, pp. 391–397, 2016.[14] T. Koller, F. Berkenkamp, M. Turchetta, and A. Krause, “Learning-based model predictive control for safe exploration,” in . IEEE, 2018, pp. 6059–6066.[15] M. Buisson-Fenet, F. Solowjow, and S. Trimpe, “Actively learninggaussian process dynamics,” arXiv preprint arXiv:1911.09946 , 2019.[16] C. Zimmer, M. Meister, and D. Nguyen-Tuong, “Safe active learningfor time-series modeling with gaussian processes,” in
Advances inNeural Information Processing Systems , 2018, pp. 2730–2739.[17] Y. K. Nakka, A. Liu, G. Shi, A. Anandkumar, Y. Yue, and S.-J. Chung,“Chance-constrained trajectory optimization for safe exploration andlearning of nonlinear systems,” arXiv preprint arXiv:2005.04374 ,2020.[18] F. Borrelli, A. Bemporad, and M. Morari,
Predictive control for linearand hybrid systems . Cambridge University Press, 2017.[19] A. Capone, G. Noske, J. Umlauft, T. Beckers, A. Lederer, andS. Hirche, “Localized active learning of gaussian process state spacemodels,” in
Learning for Dynamics and Control . PMLR, 2020, pp.490–499.[20] P. Congdon,
Bayesian statistical modelling . John Wiley & Sons,2007, vol. 704.[21] T. Homem-de Mello and G. Bayraksan, “Monte carlo sampling-basedmethods for stochastic optimization,”
Surveys in Operations Researchand Management Science , vol. 19, no. 1, pp. 56–85, 2014.[22] J. MacQueen et al. , “Some methods for classification and analysisof multivariate observations,” in
Proceedings of the fifth Berkeleysymposium on mathematical statistics and probability , vol. 1, no. 14.Oakland, CA, USA, 1967, pp. 281–297. [23] F. Furrer, M. Burri, M. Achtelik, and R. Siegwart,
Robot OperatingSystem (ROS): The Complete Reference (Volume 1) . Cham: SpringerInternational Publishing, 2016, ch. RotorS—A Modular GazeboMAV Simulator Framework, pp. 595–625. [Online]. Available:http://dx.doi.org/10.1007/978-3-319-26054-9 23. Cham: SpringerInternational Publishing, 2016, ch. RotorS—A Modular GazeboMAV Simulator Framework, pp. 595–625. [Online]. Available:http://dx.doi.org/10.1007/978-3-319-26054-9 23