[PDF] Discontinuity-Sensitive Optimal Control Learning by Mixture of Experts

Abstract

This paper proposes a discontinuity-sensitive approach to learn the solutions of parametric optimal control problems with high accuracy. Many tasks, ranging from model predictive control to reinforcement learning, may be solved by learning optimal solutions as a function of problem parameters. However, nonconvexity, discrete homotopy classes, and control switching cause discontinuity in the parameter-solution mapping, thus making learning difficult for traditional continuous function approximators. A mixture of experts (MoE) model composed of a classifier and several regressors is proposed to address such an issue. The optimal trajectories of different parameters are clustered such that in each cluster the trajectories are continuous function of problem parameters. Numerical examples on benchmark problems show that training the classifier and regressors individually outperforms joint training of MoE. With suitably chosen clusters, this approach not only achieves lower prediction error with less training data and fewer model parameters, but also leads to dramatic improvements in the reliability of trajectory tracking compared to traditional universal function approximation models (e.g., neural networks).

Full PDF

DDiscontinuity-Sensitive Optimal Control Learningby Mixture of Experts

Gao Tang

Department of Mechanical Engineering and Material ScienceDuke UniversityDurham, North Carolina 27708Email: [email protected]

Kris Hauser

Department of Electrical and Computer EngineeringDuke UniversityDurham, North Carolina 27708Email: [email protected]

Abstract —This paper proposes a discontinuity-sensitive ap-proach to learn the solutions of parametric optimal controlproblems with high accuracy. Many tasks, ranging from modelpredictive control to reinforcement learning, may be solved bylearning optimal solutions as a function of problem parameters.However, nonconvexity, discrete homotopy classes, and controlswitching cause discontinuity in the parameter-solution mapping,thus making learning difﬁcult for traditional continuous functionapproximators. A mixture of experts (MoE) model composed ofa classiﬁer and several regressors is proposed to address suchan issue. The optimal trajectories of different parameters areclustered such that in each cluster the trajectories are continuousfunction of problem parameters. Numerical examples on bench-mark problems show that training the classiﬁer and regressorsindividually outperforms joint training of MoE. With suitablychosen clusters, this approach not only achieves lower predictionerror with less training data and fewer model parameters,but also leads to dramatic improvements in the reliability oftrajectory tracking compared to traditional universal functionapproximation models (e.g., neural networks).

I. I

NTRODUCTION

Nonlinear Optimal Control Problems (OCPs) are criticalto solve to obtain high performance in many engineeringapplications. For example, model predictive control (MPC)requires an OCP being solved in every control loop [1], whilekinodynamic motion planners rely on solving OCPs betweensampled states [3]. However, they are generally difﬁcult tosolve to global optimum quickly and with high conﬁdencedue to inherent nonconvexity. This has led to an intenseinterest in using learning to obtain approximations of optimalcontrol policies, either using supervised learning [9, 12] orreinforcement learning [13].In this paper, we highlight the problem that function ap-proximators such as standard neural networks (SNN) performpoorly near discontinuities that are prevalent in many nonlinearOCPs. Fig. 1 shows the results of using a multilayer SNNto learn a pendulum swingup task from optimal trajectories.The optimal trajectories have three possible goal states sothe parameter-solution mapping is discontinuous. Althoughneural networks are quite useful for approximating nonlinearfunctions [7], near the region where the optimal goal stateswitches, their prediction tends to predict a ﬁnal state thatinterpolates between two goal states.This paper addresses this problem by modifying the Mixture (a) Samples of data (b) SNN Prediction (c) Samples of clustered data (d) MoE Prediction

Fig. 1:

Illustration of dataset and prediction of a selected state fromSNN and MoE for the pendulum swingup task. (a) samples of optimalpendulum swingup trajectories from different initial states. The redcircles are possible target states. (b) prediction of a selected stateby SNN that is trained using data in (a). The solid and dashed linesdenote the optimal and predicted trajectories, respectively. (c) samplesof clustered optimal trajectories where each color denotes one cluster.Trajectories are clustered according to ﬁnal state. (d) prediction byMoE to the same state as (b). of Experts (MoE) [8, 11, 17] model to learn the solutionsto parametric OCPs. The model structure uses a classiﬁer(gating network) to select a regressor (expert) which makesthe ﬁnal prediction (Fig. 2). We intend to train a model suchthat each regressor works in a region of the parameter spacewhere the parameter-solution mapping is continuous. This isreminiscent of a divide and conquer approach, which hasalready been widely used in control community for controllerdesign [16]. Fig. 1 illustrates that the pendulum swingupdataset can be divided into three regions, and by classifyingthem and approximating them separately, MoE makes betterprediction than SNN particularly near discontinuity.Considerable care must be taken during MoE training. a r X i v : . [ c s . R O ] J u l egressor 1 regressor n x x ... x classiﬁer y y n y Fig. 2:

Illustration of MoE. The classiﬁer selects a model whichmakes the ﬁnal prediction.

Although MoE is generally trained using backpropagation [17]or expectation maximization [11], training can be unstable.We propose an approach specially designed for parametricOCPs. The training set consists of solutions to a samplingof parametric OCPs, and we ﬁrst partition the data intoseveral clusters. Then the classiﬁer is trained to predict theidentity of the partition and a separate regressor is trainedfor each partition. Each component is trained individuallyusing backpropagation. Interestingly, although joint trainingleads to a model with lower prediction error (loss), it tends to worsen trajectory tracking success rate. Moreover, clusteringthe dataset appropriately is nontrivial and it is fundamentalto our approach. Rather than using general methods of inputpartitioning [18], we propose certain features of optimal tra-jectories that tend to work well empirically.Experiments on toy underactuated control problems andagile vehicle control problems demonstrate that suitablytrained MoE models can learn near-optimal trajectories suit-able for trajectory tracking with remarkably high success rates(99.5+%). II. R

ELATED W ORK

Nonconvex OCP is generally difﬁcult to solve to global op-timum, despite much work to enlarge the convergence domain,e.g., [10]. Moreover, numerical trajectory optimization [2]techniques are, in general, too computationally expensive forhighly reactive motions.As a result, machine learning approaches have been pro-posed to solve OCPs approximately but in real-time. Re-inforcement learning learns the optimal policy by interact-ing with the environment, and deep neural network policyapproximators have been shown to solve complex controlproblems [13]. Another approach uses supervised learningto learn from precomputed optimal solutions to solve novelproblems, and has seen successful application in trajectoryoptimization [9, 19, 20] and global nonlinear optimization [6].In [9] precomputed optimal motions are used in a regression topredict trajectories for novel situations to speed up subsequentoptimization. In [19] the nearest-neighbor optimal control(NNOC) method is proposed, with a multiple restart method proposed to handle discontinuities. In both these works, thetechniques work faster than optimizing from scratch, butstill require some amount of optimization for their predictedtrajectories. This paper also learns optimal trajectories insteadof optimal policies, which has the advantage that trajectoriescan be tracked using a stabilizing feedback controller to handlemodel uncertainties and disturbances. It should be noted thatthe predicted trajectory might not fully satisfy the systemdynamics constraints. However, if learning is sufﬁciently ac-curate, then this should not be an issue because a feedbackcontroller can correct for such violations.The discontinuity of the solutions to parametric OCPs asa function of problem parameters has long been known [4],a fact that has been underappreciated in the control learningcommunity. Under certain assumptions, this function is piece-wise continuous, and discontinuity-tolerant methods have beenproposed for learning from optimal solutions [6, 19]. However,their approaches do not explicitly try to partition the spaceinto regions. In contrast, the discontinuity-sensitive approachproposed here does indeed segment the dataset according toestimated discontinuities.The most related work is previous research on MoE [11, 17,18]. This paper proposes several modiﬁcations to MoE make itsuitable for learning optimal control. We use hard classiﬁcationboundaries to avoid predicting an average of both sides, andwe also modify the training approach. Traditionally MoEis trained using either backpropagation [17] or expectationmaximization [11] so the gating function and experts areboth updated. However, we train the classiﬁer and regressorsindividually, and experiments suggest that this is fundamentalto achieving high trajectory tracking accuracy.III. P

ROBLEM F ORMULATION

In this section, the problem of learning from optimal controlis formulated and the key components are analyzed. Theproposed approach ﬁrst forumlates a parametric OCP and thenperforms the following procedure:1) Input: collect dataset of solutions to parametric OCPson sampled parameters.2) Cluster: select a clustering approach to cluster the tra-jectories and partition the parameter space.3) Train: weights of classiﬁer and regressors are trainedindividually using backpropagation.4) Validate: predict optimal trajectories for novel states andvalidate the learned model by trajectory rollout.

A. Parametric Optimal Control

A system is governed by dynamical equations˙ xxx = fff ( t , xxx , uuu , ppp ) (1)where t is time; xxx ∈ R n is the state variable; uuu ∈ R m is thecontrol variable; ppp ∈ R l is the problem parameters and capturesthe variability of studied problems. The vector ppp may specifythe initial state, model parameters, and modiﬁcations to costsor constraints. We use subscript 0 and f to denote the variablesat initial and ﬁnal time, respectively. The goal is to control theystem from some state xxx to some state xxx f while minimizingthe cost function J = ϕ ( t , xxx , t f , xxx f , ppp ) + (cid:90) t f t L ( t , xxx ( t ) , uuu ( t ) , ppp ) (2)where ϕ only depends on initial and ﬁnal states; L depends onstate and control variables within [ t , t f ] . Practical OCPs mayhave state, control, and terminal set constraints that have to besatisﬁed and we refer to [2] for details.Parametric OCP is generally difﬁcult to solve analyti-cally [14], but for any given parameter, numerical methodsmay be used to solve the resulting OCP [2]. In this work weemploy a direct transcription method, which transforms theOCP into a nonlinear optimization problem and solves it usingSNOPT [5]. The solution trajectory is a sequence of state andcontrol variables along a time grid, denoted as zzz ≡ { t i ; xxx i ; uuu i } Ni = where N is the grid size for discretization. Stacking theelement of zzz into a vector, our goal is to approximate themap from problem parameters to optimal trajectories zzz (cid:63) ( ppp ) . B. Optimal Trajectory Database Generation

To train and test models we generate a database of optimaltrajectories zzz , . . . , zzz M to sampled problems ppp , . . . , ppp M ∈ R l .Due to non-convexity, even ﬁnding a global optimum to asingle problem can be difﬁcult. One practical approach isto pick the best local optimum from a multi-start method.However, the local optimum can be also quite difﬁcult toﬁnd if an initial guess not close to the optimum is provided.We adopt a nearest-neighbor approach [19] to help generatelarge databases quickly. We ﬁrst sample some number ofproblems (fewer than M but much larger than the numberof expected partitions) and use an exhaustive random restartapproach to solve them. These solutions are used as the initialdatabase. Then we sample more parameters, and for each newproblem we attempt local optimization from each of its k -nearest neighbors to ﬁnd k local optima. The best solutionis kept in the database. We note that this process is donecompletely ofﬂine and parallelizable. C. Mixture of Experts

The MoE model is composed of a classiﬁer and r regressors,as shown in Fig. 2. In this paper both models are chosen asmultilayer perceptrons (MLP). The goal is to learn a function z : R l → R R that approximates zzz ( ppp ) where R is the lengthof vector zzz . Each regressor takes input ppp ∈ R l and makes aprediction y i ( ppp , w i ) ∈ R R , i = , . . . , r where w i speciﬁes theweights of each regressor. The classiﬁer, with weights w c ,takes input ppp and predicts r values { c i } ri = . The output of theclassiﬁer are combined with softmax to assign probabilitiesfor each model, i.e. P i = exp c i Σ ni = exp c i (3)or argmax to select one model only (in this case, P k = k = arg max i c i and P k = z ( ppp ) = Σ ni = P i ( ppp , w c ) y i ( ppp , w i ) (4)The target is to ﬁnd w c and { w i } ri = in order to miminize L = E ppp ∼ P data loss ( z ( ppp ) , zzz (cid:63) ( ppp )) (5)where P data is a distribution over problems and loss ( · , · ) is anyregression loss function.The most straightforward way train MoE is to treat itas an SNN, randomly initialize weights, and miminize (5)using backpropagation. Although several heuristics have beenproposed to train MoE using backpropagation such as [17],training may still be unstable. If softmax is used, all the datais used to train each regressor, with weights equal to theprobabilities predicted by the classiﬁer. In the case of argmax,each regressor is only trained using data assigned to it by theclassiﬁer. There is no gradient for the classiﬁer to update itsweights if argmax is used. Softmax, on the other hand, canstill have gradient to update the weights of the classiﬁer.To perform joint training, since argmax is the limit ofsoftmax if we scale { c i } ri = by a large positive scalar, weintroduce ε ∈ [ , ∞ ) which is used to divide the output of theclassiﬁer before applying softmax, i.e. P i = exp ( c i / ε ) Σ ni = exp ( c i / ε ) . (6)As ε →

0, the softmax weights approach the argmax function.Hence, ε must be gradually lowered to balance between up-dating weights of classiﬁer and restricting mixture of outputsfrom multiple regressors. As we shall show later, joint trainingof MoE may improve the loss function compared to decoupledtraining, but appears to be detrimental to trajectory trackingperformance. D. Parameter Space Partition

Clustering has been shown to be effective to avoid someinstability in MoE training [18] by training the classiﬁer andregressors of MoE individually on subsets of the data. Weadopt the same approach here, and study how to partitionparameter space such that in each region the parameter-solution mapping is continuous.The dataset { ( ppp j , zzz j ) } Mj = is divided into r groups C , . . . , C r ,ideally so that zzz (cid:63) ( ppp ) is a continuous function for all ppp in a given region. This problem can be formulated as aclustering problem and each cluster denotes a region of thepartitioned parameter space. The classiﬁer is trained to predict P i ( ppp j , w c ) = ppp j in C i , and the i ’th regressor is trainedas usual, restricted to the examples in C i . We call this process(decoupled) pretraining .Parametric OCPs have rich features that can be used to ﬁndappropriate clusters. We note that this partition cannot be donesimply using problem parameters only since the target is toﬁnd the discontinuity in the solutions. Discontinuity comesrom switching from one family of local optima to another.Hence, although the objective function value and the problemparameters at these discontinuities is similar, the trajectorymay not. For example, a car might reverse ﬁrst or moveforward ﬁrst, and a quadcopter might avoid an obstacle fromabove or below.Hence, we experiment with using distance between optimaltrajectories to classify the family of solution. The simplestapproach is to apply standard clustering techniques, such asthe k-Means algorithm, on the trajectory vector space. In orderto do so, we ﬁrst normalize the state and control variablesto zero mean and unit variance. After choosing a number ofclusters k , the k-Means algorithm is run from random initialcenters.Our experiments observe that k-Means is for some problemssuccessful at predicting discontinuities, but can also grouptrajectories poorly when k is small. On the other hand, when k is large, each cluster contains less training data, causingthe regressors to overﬁt, and making the job of the classiﬁerharder.We also propose custom clustering criteria that are basedon a system designer’s intuition and inspection of datasets. Asan example, the periodicity of angles is a useful feature whenangle is in the state space and optimal trajectories have distinctﬁnal angles; in other words, trajectories lie in distinct homo-topy classes. This is useful for the pendulum swingup problemas well as the ground vehicle control problem we considerlater. Another approach that is applicable is to examine theLagrange multipliers of constraints at optimal solutions, sincethey provide rich information about how constraints inﬂuencethe trajectory’s shape. For example, in quadcopter obstacleavoidance the shortest path might go on either side of theobstacle. Hence, the gradient of the active constraint will havedifferent sign. E. Discussion and Preliminary Experimentation

The usual approach to MoE is to ﬁrst perform pretrainingbefore (coupled) retraining by minimizing (5). The rationaleis that pretraining provides a good initialization, but if the datais clustered badly, i.e. in one cluster there is discontinuity, theloss function may be large. Moreover, even if clustering isperfect, a pretrained model does not necessarily minimize (5)due to misclassiﬁcation. In this section we shall experimentallydemonstrate and discuss why this may be a poor approach forparametric OCPs.We study a toy pendulum swingup task, where the taskis to reach the upright position. Details on the system andneural network models are given in Sec. IV-A. We compareon two metrics: 1) test error (smoothed L1 loss) and 2) rolloutsuccess rate after trajectory tracking. In trajectory tracking,we simulate trajectory execution under an LQR controller,which compensates for errors dynamic constraint violations.About each state along the predicted trajectory, we computean LQR solution for a linear dynamics model and a quadraticcost obtained by Taylor expansion. After trajectory trackingis complete, the simulation switches to a stabilizing controller about the origin. If after 5 seconds the norm of the state erroris within a certain threshold (0.1) we denote the rollout as asuccess. (We note that for the car problem, only the ﬁrst stageis implemented since the ﬁnal state is not controllable.)The following variations are considered:1) SNN vs MoE,2) MoE with random weights against k -means clusteringon trajectories, and against custom clustering, and3) Retraining vs no retraining.The SNN is chosen as MLP of size (2, 300, 75), there the ﬁrstnumber denotes the size of the input layer, the last numberdenotes the size of the output layer, and intermediate numbersindicate the size of hidden layers. We experimented with SNNwith more hidden layers or more neurons in the hidden layer,but they result in similar or larger test error. Speciﬁcally, MLPsof size (2, 50, 20, 75) and (2, 20, 50, 75), (2, 30, 30, 30, 75)yield test errors of 0.258, 0.170, and 0.232, respectively. Thesize (2, 300, 75) network, on the other hand, has test error of0.058.For MoE, the classiﬁer is of size (2, 50, r ) and the r regressors are all of size (2, 20, 75). Custom MoE and randomweight MoE use 3 experts. The custom clustering divides thedata into 3 clusters based on the ﬁnal angle. We also use k -means with 3, 4, and 10 clusters solely on trajectories withthe same design of network size.Fig. 3.a plots the prediction error on θ f and Fig. 3.b plotsthe state error after trajectory tracking. The validation errorand rollout success for each model are also listed in Tab. I.Row 1 shows that SNN has difﬁculty in making predictionsin regions near the discontinuity, averaging between bothsides. MoE does also make inaccurate prediction, but theseare caused by misclassiﬁcation and the prediction is a localoptimal trajectory belonging to another cluster. Hence, theyare suboptimal but still reach the vertical position as desired,since the difference in θ f is 2 π . The suboptimality is not toogreat, because near the boundaries two families of solutionshave similar objective function. MoE trained from randominitialization does achive lower prediction error than SNN,but is not very successful. This indicates that training bysimply descending (5) is unable to guide the classiﬁer to theappropriate clusters.Row 2 tests MoE with k-Means and various cluster sizes.,which are shown in Fig. 4. k = k = k =

10 clusters ﬁndsthe discontinuity successfully, and the resulting MoE achieveshigh success rate.Row 3 of Fig. 3 shows various methods of retraining afterpretraining MoE with custom clustering. In all cases thisapproach decreases regression error but also rollout successrate. In (vii) argmax is used following the output layer of theclassiﬁer. The classiﬁer has no gradient to update it self soonly the regressors are updated. Due to classiﬁcation error,the regressors will be trained with trajectories from otherclusters. As a result, the prediction near the boundaries willtends towards the average of two clutsers. In (vii) and (ix) we SNN MoE Custom MoE Rand k-means-3 k-means-4 k-means-10

Retrain Argmax

Softmax 1.0

Softmax 0.1 (a) Prediction error of θ f SNN MoE Custom MoE Rand k-means-3 k-means-4 k-means-10

Retrain Argmax

Softmax 1.0

Softmax 0.1 (b) State error after trajectory tracking

Fig. 3:

Comparing several models for learning the pendulum swing-up task.

Custom-3 k-means-3 k-means-4 k-means-10

Fig. 4:

Choices of clusters for the pendulum problem. Differentcolors means different clusters. Figures include: 1) custom 3 clusters2) k-Means with 3 clusters 3) k-Means with 4 clusters 4) k-Meanswith 10 clusters use softmax with different ε . In these cases, the classiﬁer isupdated but the regressors will predict towards the average. Asshown in Tab. I, retraining does decrease the prediction errorat the cost of lower rollout success rate.These experiments suggest that proper clustering is im-portant for MoE training. Moreover, rollout success is abetter metric to use in practice, while testing error can bemisleading. Due to misclassiﬁcations, a lower testing errorcan be achieved by averaging at discontinuities, but this leadsto severe failures. We also observe that coupled retrainingis detrimental to performance. This is because the imperfectclassiﬁcation causes the individual regressors to be providedwith discontinuous training data, again leading to averagingartifacts. IV. N UMERICAL E XAMPLES

We run experiments on the pendulum task and three dy-namic vehicle problems, and the details are given below.Results are summarized in Tab. II. In each case, training setscontained 80% of examples speciﬁed in Dataset size, and thetesting sets had the remaining 20%. Validation sets (of sizeValidation size) are generated separately.SNN test error indicates the testing error when trainingis terminated. SNN hyperparameters (SNN size) were tunedto achieve low test error. Validation error (SNN/MoE valida-tion) indicates loss on the validation set, while rollout error(SNN/MoE rollout) indicates success rate during trajectorytracking. Except for the car problem, this involves the sta-bilizing LQR approach described in Sec. III-E. Details on thecar rollout success criteria are speciﬁed below.Details on the MoE network design are listed in the rowslisting the number of clusters, the resulting cluster sizes,and the network hyperparameters (Classiﬁer/Regressor size).The Regressor Test Error row indicates how well the MoEregressors are ﬁtting on clustered data, showing that eachregressor has quite small error when ﬁt on a continuous region.In all of these experiments, hidden layers use LeakyReLUwith α = .

2. The output layer of regressors is a linear layerwithout nonlinear activation function. The loss function is thesmooth L1 loss and cross entropy loss for regressors andclassiﬁer, respectively.

A. Pendulum Swing-up1) Problem Setup:

The system dynamic equations are˙ θ = ω , ˙ ω = u − sin θ (7)where θ , ω are the angle and angular velocity of the pendulum; u ∈ [ − , ] is the control torque. The problem parameters arethe initial states. The target state is the straight up state, i.e. ω f = , mod ( θ f , π ) = π . The cost function is a weightedABLE I: Comparison of prediction error and rollout success rate on the pendulum problem

Model SNN MoEClustering — Custom Rand. k-means-3 k-means-4 k-means-10 Custom Custom CustomRetrain — — — — — — argmax softmax 1.0 softmax 0.1Validation error 0.046 0.030 0.035 0.039 0.029 0.051 0.027 0.028 0.026Success (out of 1000) 717 998 829 970 1000 1000 941 896 969

10 5 0 5 10 x y (a) Samples of optimal trajecto-ries for the car problem x y cluster 0cluster 2cluster 3SNNMoEtruth (b) Illustration of poor predic-tions from SNN Fig. 5:

Left: samples of optimal trajectories. Each color correspondsto one cluster of trajectories. Black circle is the target. Right:A selected state that SNN makes worse prediction than MoE. Italso shows states near this state might belong to to three differenttrajectory clusters. SNN predicts a trajectory with incorrect ﬁnalangle. sum of time and control energy, i.e. J = w ( t f − t ) + r (cid:82) t f t u d t with w = , r =

2) Data Generation and Training:

The parameter space is asubset of R and we directly sample parameters on a uniformgrid. Speciﬁcally, we use a grid size of 61 ×

21. The validationset is sampled at random. Samples of optimal trajectoriesare shown in Fig. 1. The custom clustering partitions thetrajectories by θ f . B. Ground vehicle1) Problem Setup:

We use a planar car with dynamicequations ˙ x = v sin θ , ˙ y = v cos θ , ˙ θ = u θ v , ˙ v = u v (8)where the state xxx = [ x , y , θ , v ] includes the planar coordinates,orientation, and velocity of the vehicle; the control uuu = [ u θ , u v ] includes the control variables which change the steering angleand velocity, respectively. The problem parameters are theinitial states, as listed in Tab. II and the goal is to control thesystem to the origin with zero velocity and mod ( θ f , π ) = J = w ( t f − t ) + (cid:82) t f t r u θ + r u v d t with w = , r = r =

2) Data Generation and Training:

The data is generated byuniformly sampling the parameter space. Fig. 5 shows a fewsamples of the optimal trajectories. Similar to the pendulumswingup problem, the constraint on θ f makes it possible toreach the goal with different θ f . The custom clustering isdeveloped by inspection, whereby we ﬁrst divide the datasetinto three groups based on the ﬁnal angle. Then we ﬁndthat for trajectories with the same θ f , the car can either go x y

505 0 25 N v N N N N N Fig. 6:

Samples of trajectories in each cluster for the car problem.Column: different state variables for each cluster. Row: state variablefor different clusters. forward or backward to reach the origin, i.e. with positiveor negative velocities. This is illustrated in Fig. 6. Hence, wedivide the dataset into 6 clusters. We note that the cluster sizesare bimodal and we use larger regression network for clusterwith larger size.

3) Trajectory tracking:

Because this problem is not con-trollable at the origin, a stabilizing LQR controller may notbe used at the trajectory endpoint. Instead, we simply performLQR rollout on the predicted trajectory, and stop when theend time is reached. To determine success, we check if normof ﬁnal state error is within 0.5.

4) Results and Discussion:

The data in Tab. II show similartrends to the pendulum problem, in particular, MoE yieldslower validation error and higher rollout success rate thanSNN. Moreover, the custom clustering outperforms k-Meanswhich further outperforms SNN. In Fig. 5 we show thepredictions from SNN and MoE on a selected parameter aswell as the optimal trajectories of its neighbors. It is clearlyshown that SNN may fail to predict θ f correctly.The histogram in Fig. 7.a shows the norm of the error inpredicted ﬁnal state, indicating that SNN has higher predictionerror. Fig. 7.b also show that paths predicted by SNN violatessystem dynamics more than MoE. The reason why trackingerror is actually much larger than predicted is that the predictedtrajectory violates system dynamics, so path tracking diverges. C. Quadcopter with Collision Avoidance1) Problem Setup:

The system has state xxx = ( x , y , z , v x , v y , v z , φ , θ , ψ , p , q , r ) ∈ R and control uuu ∈ R .ABLE II: Summary of experimental results for SNN and MoE pendulum ground vehicle quadcopter quadcopter-obstacleState dims 2 4 12 12Control dims 1 2 4 4Problem param. xxx ∈ R xxx ∈ R initial position, R initial position andobstacle, R Param range [ − π , π ] × [ − , ] [ − , ] × [ − π , π ] × [ − . , . ] [ − , ] [ − , ] × [ , ] a Dataset size † . × − SNN validation . × − SNN rollout b .

2% 98 .

7% 97 .

7% 99 .

6% 88 .

9% 96 . ± ± ± . × − ± × − ± ± MoE validation . × − MoE rollout b -0.167 ba The obstacle is sampled such that it always collides with optimal obstacle-free trajectory b Average of the largest constraint violations based on trajectory rollout. All states can be controlled to the target. See histogram in Fig. 11 fordistribution. x f Error0200040006000800010000 MoESNN (a) Prediction error of xxx f x f Error0200040006000800010000 MoESNN (b) Tracking error of xxx f Fig. 7:

Histograms of prediction and tracking results for MoE andSNN on the car problem.

We refer [15] to the details. The goal is to control thequadcopter from any equilibrium state with position within [ − , ] and all other states zero to the goal state 000. Thecost function is a weighted sum of time, control energy, andpenalty on states, i.e. J = w ( t f − t ) + (cid:82) t f t xxx T QQQxxx + uuu T RRRuuu d t with w = QQQ = diag ( , , , , , , . , . , . , , , ) , RRR = diag ( , , , ) .The quadcopter-obstacle case imposes additional path con-straints on the state variables. The obstacle is a sphere withdifferent position and radius, and obstacles are randomlyplaced in space with radius within [ , ] . We are interestedin how the obstacles inﬂuence the trajectory.

2) Data Generation and Training:

In the obstacle-free case,initial positions are sampled at random, and k-Means is usedfor clustering.The obstacle problem is more challenging because it hashigher dimensionality in parameter space (7). The OCP is alsomore challenging to solve due to the non-convex of obstacle x y z Fig. 8:

Trajectories of problems with parmameter close to theproblem with sphere at (4, 4, 4), radius 3 and initial position (8, 8, 8).Each color corresponds to trajectories from one cluster. It shows thetrajectories can be quite different even for close problem parameters. avoidance constraint. We want to focus on problem instanceswith signiﬁcant obstacles, so our dataset only includes exam-ples where the optimal collision-free trajectory would collidewith an obstacle. To generate this dataset, we collect obstaclefree trajectories and then sample obstacles that collide withthe trajectory. We then re-optimize for the sampled obstacles.Samples of trajectories are shown in Fig. 8 and Fig. 9.The discontinuity of the parameter-solution mapping in thisproblem is avoiding the obstacle from different directions y z Fig. 9:

Samples of optimal trajectories for the quadcopter problem.Each color corresponds to each trajectory cluster. outperforms others and vice versa. One feature that describeshow the obstacle-free trajectory is affected by the obstaclesis the gradient of the active constraints with respect to statevariables. Since the obstacles are spheres, the gradient isessentially the vector from the center of the sphere to thepoint on surface where constraints are active. Its directionclearly shows which direction the trajectory has to change forcollision avoidance. For trajectories that has more than oneactive constraints, we use the multipliers as weights and takethe average. In this way, a 3D vector is calculated for eachtrajectory and used as features to divide the problem space.We divide the dataset into 8 groups based on the sign of eachelement of the 3D vector.

3) Results and Discussion:

Results show that both SNNand MoE control the quadcopter to a stabilizable state in highlyreliable fashion without obstacles. Hence, for validation wefocus more on the amount of collision avoidance violation,i.e. min {(cid:107) xxx i − ccc o (cid:107) − r o } Ni = where r o and ccc o are respectivelythe radius and center of the obstacle.With obstacles, MoE with custom cluster also signiﬁcantlyoutperforms others. A histogram of the constraint violationis shown in Fig. 11, indicating that MoE yields much lowerviolation of constraints than SNN. Fig. 10 shows examples ofoptimal trajectories and prediction from SNN and MoE. As theinitial state moves along z direction, the optimal trajectoriesturns from going above to going below the obstacle. SNN isunable to handle such discontinuity and predicts a trajectorythat violates the constraints. However, MoE is able to detectsuch discontinuity and predicts the corresponding trajectories.It is important to note however that MoE still creates grazingcollisions, so to successfully avoid an obstacle in practice,either a margin of error should be added to the modeledobstacle, or local collision avoidance should be added to thetrajectory tracker. x y z Fig. 10:

Optimal trajectories and prediction from SNN and MoE fortwo selected close states. The green sphere is an obstacle centeredat (0, 4, 4) with radius of 3. The solid, dashed and dotted lines arethe optimal trajectories, prediction of MoE, and prediction of SNN,respectively. It shows SNN predicts a trajectory that violates obstaclesavoidance constraints.

Fig. 11:

Rollout constraint violation for quadcopter-obstacle.

V. C

ONCLUSION

In this paper we demonstrate that optimal trajectories canbe learned with high accuracy if we take into account thespecial structure of optimal control problems. The mixture ofexperts model is designed such that each expert approximates asmooth region in the problem optimum map, and the classiﬁerhandles discontinuities without averaging. It is important totrain MoE with the correct clusters, and curiously, coupledtraining of the regressors and classiﬁer tends to be detrimentalto tracking performance. We also argue that test error is nota good metric to judge learning models, but rather rolloutsuccess rate under trajectory tracking control is preferable.Future work includes developing more sophisticated clus-tering algorithms that automatically ﬁnd the best partitiontrategy. For certain OCPs, differential ﬂatness can be usedsuch that the predicted trajectory satisﬁes dynamical con-straints. Further work also includes how to prove the stabilityof the predicted trajectories, and to scale up to handle largerproblems, e.g., from sensor data or model uncertainties.A

CKNOWLEDGMENTS

This work is supported by NSF grant

EFERENCES [1] A Bemporad, M Morari, V Dua, and EN Pistikopoulos.The explicit solution of model predictive control via mul-tiparametric quadratic programming. In

Proc. AmericanControl Conf. , volume 1–6, pages 872 – 876, 2000.[2] John T Betts. Survey of numerical methods for trajectoryoptimization.

Journal of guidance, control, and dynam-ics , 21(2):193–207, 1998.[3] Bruce Donald, Patrick Xavier, John Canny, and JohnReif. Kinodynamic motion planning.

Journal of the ACM(JACM) , 40(5):1048–1066, 1993.[4] Anthony V Fiacco. Introduction to sensitivity and stabil-ity analysis in nonlinear programming. 1983.[5] Philip E Gill, Walter Murray, and Michael A Saunders.Snopt: An sqp algorithm for large-scale constrainedoptimization.

SIAM review , 47(1):99–131, 2005.[6] K. Hauser. Learning the problem-optimum map: Analysisand application to global optimization in robotics.

IEEETrans. Robotics , 33(1):141–152, February 2017.[7] Kurt Hornik, Maxwell Stinchcombe, and Halbert White.Multilayer feedforward networks are universal approxi-mators.

Neural networks , 2(5):359–366, 1989.[8] Robert A Jacobs, Michael I Jordan, Steven J Nowlan, andGeoffrey E Hinton. Adaptive mixtures of local experts.

Neural computation , 3(1):79–87, 1991.[9] Nikolay Jetchev and Marc Toussaint. Fast motion plan-ning from experience: trajectory prediction for speedingup movement generation.

Autonomous Robots , 34(1-2):111–127, January 2013.[10] Fanghua Jiang, Hexi Baoyin, and Junfeng Li. Practicaltechniques for low-thrust trajectory optimization withhomotopic approach.

J. Guid. Control Dynam. , 35(1):245–258, 2012.[11] Michael I Jordan and Robert A Jacobs. Hierarchicalmixtures of experts and the em algorithm.

Neuralcomputation , 6(2):181–214, 1994.[12] Roberto Lampariello, Duy Nguyen-Tuong, ClaudioCastellini, Gerd Hirzinger, and Jan Peters. Trajectoryplanning for optimal robot catching in real-time. In , pages 3719–3726. IEEE.[13] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel,Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, andDaan Wierstra. Continuous control with deep reinforce-ment learning. arXiv preprint arXiv:1509.02971 , 2015.[14] Helmut Maurer and Dirk Augustin. Sensitivity Analysisand Real-Time Control of Parametric Optimal Control Problems Using Boundary Value Methods. In

Online Op-timization of Large Scale Systems , pages 17–55. SpringerBerlin Heidelberg, Berlin, Heidelberg, 2001.[15] Daniel Mellinger, Nathan Michael, and Vijay Kumar.Trajectory generation and control for precise aggressivemaneuvers with quadrotors.

The International Journal ofRobotics Research , 31(5):664–674, April 2012.[16] Roderick Murray-Smith and T Johansen.

Multiple modelapproaches to nonlinear modelling and control . CRCpress, 1997.[17] Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz,Andy Davis, Quoc Le, Geoffrey Hinton, and JeffDean. Outrageously large neural networks: Thesparsely-gated mixture-of-experts layer. arXiv preprintarXiv:1701.06538 , 2017.[18] Bin Tang, Malcolm I Heywood, and Michael Shepherd.Input partitioning to mixture of experts. In

NeuralNetworks, 2002. IJCNN’02. Proceedings of the 2002International Joint Conference on , volume 1, pages 227–232. IEEE, 2002.[19] Gao Tang and Kris Hauser. A data-driven indirect methodfor nonlinear optimal control. In

Intelligent Robots andSystems (IROS), 2017 IEEE/RSJ International Confer-ence on , pages –. IEEE, 2017.[20] Teodor Tomi´c, Moritz Maier, and Sami Haddadin. Learn-ing quadrotor maneuvers from optimal control and gener-alizing in real-time. In