[PDF] An Optimal Control Approach to Sequential Machine Teaching

Abstract

Given a sequential learning algorithm and a target model, sequential machine teaching aims to find the shortest training sequence to drive the learning algorithm to the target model. We present the first principled way to find such shortest training sequences. Our key insight is to formulate sequential machine teaching as a time-optimal control problem. This allows us to solve sequential teaching by leveraging key theoretical and computational tools developed over the past 60 years in the optimal control community. Specifically, we study the Pontryagin Maximum Principle, which yields a necessary condition for optimality of a training sequence. We present analytic, structural, and numerical implications of this approach on a case study with a least-squares loss function and gradient descent learner. We compute optimal training sequences for this problem, and although the sequences seem circuitous, we find that they can vastly outperform the best available heuristics for generating training sequences.

Full PDF

AAn Optimal Control Approach to Sequential Machine Teaching

Laurent Lessard Xuezhou Zhang Xiaojin Zhu

University of Wisconsin–Madison University of Wisconsin–Madison University of Wisconsin–Madison

Abstract

Given a sequential learning algorithm anda target model, sequential machine teachingaims to ﬁnd the shortest training sequenceto drive the learning algorithm to the targetmodel. We present the ﬁrst principled wayto ﬁnd such shortest training sequences. Ourkey insight is to formulate sequential machineteaching as a time-optimal control problem.This allows us to solve sequential teaching byleveraging key theoretical and computationaltools developed over the past 60 years in theoptimal control community. Speciﬁcally, westudy the Pontryagin Maximum Principle,which yields a necessary condition for opti-mality of a training sequence. We presentanalytic, structural, and numerical implica-tions of this approach on a case study with aleast-squares loss function and gradient de-scent learner. We compute optimal train-ing sequences for this problem, and althoughthe sequences seem circuitous, we ﬁnd thatthey can vastly outperform the best availableheuristics for generating training sequences.

Machine teaching studies optimal control on machinelearners (Zhu et al., 2018; Zhu, 2015). In controls lan-guage the plant is the learner, the state is the modelestimate, and the input is the (not necessarily i.i.d. )training data. The controller wants to use the leastnumber of training items—a concept known as theteaching dimension (Goldman and Kearns, 1995)—toforce the learner to learn a target model. For exam-ple, in adversarial learning, an attacker may minimallypoison the training data to force a learner to learn a ne-farious model (Biggio et al., 2012; Mei and Zhu, 2015).Conversely, a defender may immunize the learner byinjecting adversarial training examples into the train-ing data (Goodfellow et al., 2014). In education sys-tems, a teacher may optimize the training curriculum to enhance student (modeled as a learning algorithm)learning (Sen et al., 2018; Patil et al., 2014).Machine teaching problems are either batch or sequen-tial depending on the learner. The majority of priorwork studied batch machine teaching, where the con-troller performs one-step control by giving the batchlearner an input training set . Modern machine learn-ing, however, extensively employs sequential learningalgorithms. We thus study sequential machine teach-ing: what is the shortest training sequence to force alearner to go from an initial model w to some targetmodel w (cid:63) ? Formally, at time t = 0 , , . . . the controllerchooses input ( x t , y t ) from an input set U . The learnerthen updates the model according to its learning algo-rithm. This forms a dynamical system f : w t +1 = f ( w t , x t , y t ) . (1a)The controller has full knowledge of w , w (cid:63) , f, U , andwants to minimize the terminal time T subject to w T = w (cid:63) .As a concrete example, we focus on teaching a gradientdescent learner of least squares: f ( w t , x t , y t ) = w t − η ( w T t x t − y t ) x t (1b)with w ∈ R n and the input set (cid:107) x (cid:107) ≤ R x , | y | ≤ R y .We caution the reader not to trivialize the problem:(1) is a nonlinear dynamical system due to the inter-action between w t and x t . A previous best attempt tosolve this control problem by Liu et al. (2017) employsa greedy control policy, which at step t optimizes x t , y t to minimize the distance between w t +1 and w (cid:63) . Oneof our observations is that this greedy policy can besubstantially suboptimal. Figure 1 shows three teach-ing problems and the number of steps T to arrive at w (cid:63) using diﬀerent methods. Our optimal control methodNLP found shorter teaching sequences compared tothe greedy policy (lengths 151, 153, 259 for NLP vs219, 241, 310 for GREEDY, respectively). This andother experiments are discussed in Section 4. a r X i v : . [ c s . L G ] O c t .0 0.5 1.00.00.20.40.60.81.0 GREEDY, T=219STRAIGHT, T=172NLP, T=151CNLP, t f =1.52s w w GREEDY, T=241STRAIGHT, T=166NLP, T=153CNLP, t f =1.53s GREEDY, T=310STRAIGHT, T=305NLP, T=259CNLP, t f =2.59s Figure 1: The shortest teaching trajectories found by diﬀerent methods. All teaching tasks use the terminalpoint w (cid:63) = (1 , w = (0 ,

1) (left panel), w = (0 , .

5) (middle panel), and w = ( − . , .

5) (right panel). The learner is the least squares gradient descent algorithm (1) with η = 0 .

01 and R x = R y = 1. Total steps T to arrive at w (cid:63) is indicated in the legends. Our main contribution is to show how tools from op-timal control theory may be brought to bear on themachine teaching problem. Speciﬁcally, we show that:1. The Pontryagin optimality conditions reveal deepstructural properties of optimal teaching se-quences. For example, we show that the least-squares case (1) is fundamentally a 2D problemand we provide a structural characterization ofsolutions. These results are detailed in Section 3.2. Optimal teaching sequences can be vastly moreeﬃcient than what may be obtained via commonheuristics. We present two optimal approaches:an exact method (NLP) and a continuous approxi-mation (CNLP). Both agree when the stepsize η issmall, but CNLP is more scalable because its run-time does not depend on the length of the trainingsequence. These results are shown in Section 4.We begin with a survey of the relevant optimal controltheory and algorithms literature in Section 2. To study the structure of optimal control we considerthe continuous gradient ﬂow approximation of gradi-ent descent, which holds in the limit η →

0. In thissection, we present the corresponding canonical time-optimal control problem and summarize some of thekey theoretical and computational tools that have beendeveloped over the past 60 years to address it. For amore detailed exposition on the theory, we refer thereader to modern references on the topic (Kirk, 2012;Liberzon, 2011; Athans and Falb, 2013). This section is self-contained and we will use notationconsistent with the control literature ( x instead of w , u instead of ( x , y ), t f instead of T ). We revert back tomachine learning notation in section 3. Consider thefollowing boundary value problem:˙ x = f ( x, u ) with x (0) = x and x ( t f ) = x f . (2)The function x : R + → R n is called the state and u : R + → U is called the input . Here, U ⊆ R m is agiven constraint set that characterizes admissible in-puts. The initial and terminal states x and x f areﬁxed, but the terminal time t f is free. If an admissible u together with a state x satisfy the boundary valueproblem (2) for some choice of t f , we call ( x, u ) a tra-jectory of the system. The objective in a time-optimalcontrol problem is to ﬁnd an optimal trajectory , whichis a trajectory that has minimal t f .Established approaches for solving time-optimal con-trol problems can be grouped in three broad categories:dynamic programming, indirect methods, and directmethods. We now summarize each approach. Consider the value function V : R n → R + , where V ( x )is the minimum time required to reach x f startingat the initial state x . The Hamilton–Jacobi–Bellman(HJB) equation gives necessary and suﬃcient condi-tions for optimality and takes the form:min ˜ u ∈U ∇ V ( x ) T f ( x, ˜ u ) + 1 = 0 for all x ∈ R n (3)together with the boundary condition V ( x f ) = 0. Ifthe solution to this diﬀerential equation is V (cid:63) , then theoptimal input is given by the minimizer: u ( x ) ∈ arg min ˜ u ∈U ∇ V (cid:63) ( x ) T f ( x, ˜ u ) for all x ∈ R n (4) nice feature of this solution is that the optimal input u depends on the current state x . In other words, HJBproduces an optimal feedback policy .Unfortunately, the HJB equation (3) is generally dif-ﬁcult to solve. Even if the minimization has a closedform solution, the resulting diﬀerential equation is of-ten intractable. We remark that the optimal V (cid:63) maynot be diﬀerentiable. For this reason, one looks forso-called viscosity solutions , as described by Liberzon(2011); Tonon et al. (2017) and references therein.Numerical approaches for solving HJB include the fast-marching method (Tsitsiklis, 1995) and Lax–Friedrichssweeping (Kao et al., 2004). The latter reference alsocontains a detailed survey of other numerical schemes. Also known as “optimize then discretize”, indirect ap-proaches start with necessary conditions for optimal-ity obtained via the Pontryagin Maximum Principle(PMP). The PMP may be stated and proved in sev-eral diﬀerent ways, most notably using the Hamilto-nian formalism from physics or using the calculus ofvariations. Here is a formal statement.

Theorem 2.1 (PMP) . Consider the boundary valueproblem (2) where f and its Jacobian with respect to x are continuous on R n × U . Deﬁne the Hamiltonian H : R n × R n × U → R as H ( x, p, u ) := p T f ( x, u ) + 1 .If ( x (cid:63) , u (cid:63) ) is an optimal trajectory, then there existssome function p (cid:63) : R + → R n (called the “co-state”)such that the following conditions hold.a) x (cid:63) and p (cid:63) satisfy the following system of diﬀeren-tial equations for t ∈ [0 , t f ] with boundary condi-tions x (cid:63) (0) = x and x (cid:63) ( t f ) = x f . ˙ x (cid:63) ( t ) = ∂H∂p (cid:0) x (cid:63) ( t ) , p (cid:63) ( t ) , u (cid:63) ( t ) (cid:1) , (5a)˙ p (cid:63) ( t ) = − ∂H∂x (cid:0) x (cid:63) ( t ) , p (cid:63) ( t ) , u (cid:63) ( t ) (cid:1) . (5b) b) For all t ∈ [0 , t f ] , an optimal input u (cid:63) ( t ) satisﬁes: u (cid:63) ( t ) ∈ arg min ˜ u ∈U H ( x (cid:63) ( t ) , p (cid:63) ( t ) , ˜ u ) . (6) c) Zero Hamiltonian along optimal trajectories: H ( x (cid:63) ( t ) , p (cid:63) ( t ) , u (cid:63) ( t )) = 0 for all t ∈ [0 , t f ] . (7)In comparison to HJB, which needs to be solved for all x ∈ R n , the PMP only applies along optimal trajecto-ries. Although the diﬀerential equations (5) may stillbe diﬃcult to solve, they are simpler than the HJBequation and therefore tend to be more amenable toboth analytical and numerical approaches. Solutionsto HJB and PMP are related via ∇ V (cid:63) ( x (cid:63) ( t )) = p (cid:63) ( t ). PMP is only necessary for optimality, so solutionsof (5)–(7) are not necessarily optimal. Moreover, PMPdoes not produce a feedback policy; it only producesoptimal trajectory candidates . Nevertheless, PMP canprovide useful insight, as we will explore in Section 3.If PMP cannot be solved analytically, a common nu-merical approach is the shooting method , where weguess p (cid:63) (0), propagate the equations (5)–(6) forwardvia numerical integration. Then p (cid:63) (0) is reﬁned andthe process is repeated until the trajectory reaches x f . Also known as “discretize then optimize”, a sparsenonlinear program is solved, where the variables arethe state and input evaluated at a discrete set of time-points. An example is collocation methods , which usediﬀerent basis functions such as piecewise polynomi-als to interpolate the state between timepoints. Forcontemporary surveys of direct and indirect numericalapproaches, see Rao (2009); Betts (2010).If the dynamics are already discrete as in (1), we maydirectly formulate a nonlinear program. We refer tothis approach as NLP. Alternatively, we can take thecontinuous limit and then discretize, which we callCNLP. We discuss the advantages and disadvantagesof both approaches in Section 4.

In this section, we specialize time-optimal control toleast squares. To recap, our goal is to ﬁnd the mini-mum number of steps T such that there exists a controlsequence ( x t , y t ) T − that drives the learner (1) withinitial state w to the target state w (cid:63) . The constraintset is U = { ( x , y ) | (cid:107) x (cid:107) ≤ R x , | y | ≤ R y } . This is an nonlinear discrete-time time-optimal control problem ,for which no closed-form solution is available.On the corresponding continuous-time control prob-lem, applying Theorem 2.1 we obtain the followingnecessary conditions for optimality for all t ∈ [0 , t f ]. w (0) = w , w ( t f ) = w (cid:63) (8a)˙ w ( t ) = (cid:0) y ( t ) − w ( t ) T x ( t ) (cid:1) x ( t ) (8b)˙ p ( t ) = (cid:0) p ( t ) T x ( t ) (cid:1) x ( t ) (8c) x ( t ) , y ( t ) ∈ arg min (cid:107) ˆ x (cid:107)≤ R x , | ˆ y |≤ R y (cid:0) ˆ y − w ( t ) T ˆ x (cid:1) ( p ( t ) T ˆ x ) (8d)0 = (cid:0) y ( t ) − w ( t ) T x ( t ) (cid:1)(cid:0) p ( t ) T x ( t ) (cid:1) + 1 (8e) State, co-state, and input in Theorem 2.1 are ( x, p, u ),which is conventional controls notation. For this problem,we use ( w , p , ( x , y )), which is machine learning notation. e can simplify (8) by setting y ( t ) = R y , as describedin Proposition 3.1 below. Proposition 3.1.

For any trajectory ( w , p , x , y ) sat-isfying (8) , there exist another trajectory of the form ( w , p , ˜ x , R y ) . So we may set y ( t ) = R y without anyloss of generality. Proof.

Since (8d) is linear in ˆ y , the optimal ˆ y occursat a boundary and ˆ y = ± R y . Changing the sign ofˆ y is equivalent to changing the sign of ˆ x , so we mayassume without loss of generality that ˆ y = R y . Thesechanges leave (8b)–(8c) and (8e) unchanged so w and p are unchanged as well.In fact, Proposition 3.1 holds if we consider trajectoriesof (1) as well. For a proof, see the appendix.Applying Proposition 3.1, the conditions (8d) and (8e)may be combined to yield the following quadraticallyconstrained quadratic program (QCQP) equation.min (cid:107) x (cid:107)≤ R x ( R y − w T x )( p T x ) = − t ) for clarity. Note that (9) constrains the possibletuples ( w , p , x ) that can occur as part of an optimaltrajectory. So in addition to solving the left-hand sideto ﬁnd x , we must also ensure that it’s equal to −

1. Wewill now characterize the solutions of (9) by examiningﬁve distinct regimes of the solution space that dependon the relationship between w and p as well as whichregime transitions are admissible. Regime I (Origin): w = 0 and p (cid:54) = 0 . This regimehappens when the teaching trajectory pass throughthe origin. In this regime, one can obtain closed-formsolutions. In particular, x = − R x (cid:107) p (cid:107) p and (cid:107) p (cid:107) = R x R y .In this regime, both ˙ w and ˙ p are positively alignedwith p . Therefore, Regime I necessarily transitions from Regime II and into Regime III, given that it is notat the beginning or the end of the teaching trajectory. Regime II (positive alignment): w = α p withp (cid:54) = 0 and α > . This regime happens when w and p are positively aligned. Again we have closedform solutions. In particular, x (cid:63) = − R x (cid:107) w (cid:107) w and α = R x (cid:107) w (cid:107) ( R y + R x (cid:107) w (cid:107) ). In this regime, both ˙ w and ˙ p arenegatively aligned with w , thus Regime II necessarilytransitions into Regime I and can never transition fromany other regimes. Regime III (negative alignment inside theorigin-centered ball): w = − α p with p (cid:54) = 0 and α > and (cid:107) w (cid:107) ≤ R y R x . This regime happenswhen w and p are negatively aligned and w is insidethe ball centered at the origin with radius R = R y R x . Again, closed form solutions exists: x (cid:63) = R x (cid:107) w (cid:107) w and α = R (cid:107) w (cid:107) (1 − R (cid:107) w (cid:107) ). Regime III necessarily transi-tions from Regime I and into Regime IV. Regime IV (negative alignment out of theorigin-centered ball): w = − α p with p (cid:54) = 0 and α > and (cid:107) w (cid:107) > R y R x . In this case, the solutionssatisﬁes α = R y so that p is uniquely determined by w .However, the optimal x (cid:63) is not unique. Any solutionto w T x = R y with (cid:107) x (cid:107) ≤ R x can be chosen. RegimeIV can only transition from Regime III and cannottransition into any other regime. In other word, oncethe teaching trajectory enters Regime IV, it cannotescape. Another interesting property of Regime IV isthat we know exactly how fast the norm of w is chang-ing. In particular, knowing w T x = R y , one can derivethat d (cid:107) w (cid:107) d t = R y . As a result, once the trajectory en-ters regime IV, we know exact how long it will take forthe trajectory to reach w (cid:63) , if it is able to reach it. Regime V (general positions): w and p are lin-early independent.

This case covers the remainingpossibilities for the state and co-state variables. Tocharacterize the solutions in this regime, we’ll ﬁrst in-troduce some new coordinates. Deﬁne { ˆ w , ˆ u } to be theorthonormal basis for span { w , p } such that w = γ ˆ w and p = α ˆ w + β ˆ u for some α, β, γ ∈ R . Note that β (cid:54) = 0 because we assume w and p are assumed tobe linearly independent in this regime. We can there-fore express any input uniquely as x = w ˆ w + u ˆ u + z ˆ z where ˆ z is an out-of-plane unit vector orthogonal toboth ˆ w and ˆ u , and w, u, z ∈ R are suitably chosen.Substituting these deﬁnitions, (9) becomesmin w + u + z ≤ R x ( R y − γw )( αw + βu ) = − . (10)Now observe that the objective is linear in u and doesnot depend on z . The objective is linear in u be-cause β (cid:54) = 0 and (1 − γw ) (cid:54) = 0 otherwise the entireobjective would be zero. Since the feasible set is con-vex, the optimal u must occur at the boundary of thefeasible set of variables w and u . Therefore, z = 0.This is profound, because it implies that in Regime V,the optimal solution necessarily lies on the 2D planespan { w , p } . In light of this fact, we can pick a moreconvenient parametrization. Let w = R x cos θ and u = R x sin θ . Equation (10) becomes:min θ R x ( R y − γR x cos θ )( α cos θ + β sin θ ) = − . (11)This objective function has at most four critical points,of which there is only one global minimum, and we canﬁnd it numerically. Last but not least, Regime V doesnot transition from or into any other Regime. I I III IV V w (cid:63) IIIIIIIVV w Figure 2: Optimal trajectories for w (cid:63) = (1 ,

0) for dif-ferent choices of w . Trajectories are colored accordingto the regime to which they belong and the directedgraph above shows all possible transitions. The opti-mal trajectories are symmetric about the x -axis. Forimplementation details, see Section 4. Intrinsic low-dimensional structure of the opti-mal control solution.

As is hinted in the analysisof Regime V, the optimal control x sometimes lies inthe 2D subspace spanned by w and p . In fact, thisholds not only for Regime V but for the whole prob-lem. In particular, we make the following observation. Theorem 3.2.

There always exists a global optimaltrajectory of (8) that lies in a 2D subspace of R n . The detailed proof can be found in the appendix. Animmediate consequence of Theorem 3.2 is that if w and w (cid:63) are linearly independent, we only need to con-sider trajectories that are conﬁned to the subspacespan { w , w (cid:63) } . When w and w (cid:63) are aligned, trajecto-ries are still 2D, and any subspace containing w and w (cid:63) is equivalent and arbitrary choice can be made.This insight is extremely important because it enablesus to restrict our attention to 2D trajectories eventhough the dimensionality of the original problem ( n )may be huge. This allows us to not only obtain a moreelegant and accurate solution in solving the necessarycondition induced by PMP, but also to parametrize di-rect and indirect approaches (see Sections 2.2 and 2.3)to solve this intrinsically 2D problem more eﬃciently. Multiplicity of Solution Candidates.

The PMPconditions are only necessary for optimality. There-fore, the optimality conditions (8) need not have aunique solution. We illustrate this phenomenon in Fig-ure 3. We used a shooting approach (Section 2.2) to shooting trajectoriescandidate 1, t f = 2.58scandidate 2, t f = 3.26s Figure 3: Trajectories found using a shooting approach(Section 2.2) with w = ( − ,

1) and w (cid:63) = (1 , . η = 0 . p (cid:63) (0) forward in time.It turns out two choices lead to trajectories that endat w (cid:63) , and they do not have equal total times. So ingeneral, PMP identiﬁes optimal trajectory candidates ,which can be thought of as local minima for this highlynonlinear optimization problem. While the PMP yields necessary conditions for time-optimal control as detailed in Section 3, there is noclosed-form solution in general. We now present anddiscuss four numerical methods: CNLP and NLPare diﬀerent implementations of time-optimal control,while GREEDY and STRAIGHT are heuristics.

CNLP:

This approach solves the continuous gradi-ent ﬂow limit of the machine teaching problem using adirect approach (Section 2.3). Speciﬁcally, we usedthe NLOptControl package (Febbo, 2017), which isan implementation of the hp -pseudospectral methodGPOPS-II (Patterson and Rao, 2014) written in theJulia programming language using the JuMP model-ing language (Dunning et al., 2017) and the IPOPTinterior-point solver (W¨achter and Biegler, 2006). Themain tuning parameters for this software are the inte-gration scheme and the number of mesh points. Weselected the trapezoidal integration rule with 100 meshpoints for most simulations. We used CNLP to pro-duce the trajectories in Figures 1 and 2. LP:

A na¨ıve approach to optimal control is to ﬁndthe minimum T for which there is a feasible inputsequence to drive the learner to w (cid:63) . Fixing T , thefeasibility subproblem is a nonlinear program over 2 Tn -dimensional variables x , . . . , x T − and w , . . . , w T constrained by learner dynamics. Recall w is given,and one can ﬁx y t = R y for all t by Proposition 3.1.For our learner (1), the feasibility problem ismin w T , x T − w T = w (cid:63) w t +1 = w t − η ( w T t x t − R y ) x t (cid:107) x t (cid:107) ≤ R x , ∀ t = 0 , . . . , T − . As in the CNLP case, we modeled and solved the sub-problems (12) using JuMP and IPOPT. We also triedKnitro, a state-of-the-art commercial solver (Byrdet al., 2006), and it produced similar results. We stressthat such feasibility problems are diﬃcult; IPOPT andKnitro can handle moderately sized T . For our spe-ciﬁc learner (1) there are 2D optimal control and statetrajectories in span { w , w (cid:63) } as discussed in Section 3.Therefore, we reparameterized (12) to work in 2D.On top of this, we run a binary search over positiveintegers to ﬁnd the minimum T for which the sub-problem (12) is feasible. Subject to solver numericalstability, the minimum T and its feasibility solution x , . . . , x T − is the time-optimal control. While NLPis conceptually simple and correct, it requires solv-ing many subproblems with 2 T variables and 2 T con-straints, making it less stable and scalable than CNLP. GREEDY:

We restate the greedy control policy ini-tially proposed by Liu et al. (2017). It has the ad-vantage of being computationally more eﬃcient andreadily applicable to diﬀerent learning algorithms (i.e.dynamics). Speciﬁcally for the least squares learner (1)and given the current state w t , GREEDY solves thefollowing optimization problem to determine the nextteaching example ( x t , y t ):min ( x t ,y t ) ∈U (cid:107) w t +1 − w (cid:63) (cid:107) (13)s.t. w t +1 = w t − η ( w T t x t − y t ) x t . The procedure repeats until w t +1 = w (cid:63) . We usedthe Matlab function fmincon to solve the abovequadratic program iteratively. We point out that theoptimization problem is not convex. Moreover, w t +1 does not necessarily point in the direction of w (cid:63) . Thisis evident in Figure 1 and Figure 6. STRAIGHT:

We describe an intuitive control pol-icy: at each step, move w in straight line toward w (cid:63) as far as possible subject to the constraint U . This policy is less greedy than GREEDY because it maynot reduce (cid:107) w t +1 − w (cid:63) (cid:107) as much at each step. Theper-step optimization in x is a 1D line search:min a,y t ∈ R (cid:107) w t +1 − w (cid:63) (cid:107) (14)s.t. x t = a ( w (cid:63) − w t ) / (cid:107) w (cid:63) − w t (cid:107) ( x t , y t ) ∈ U w t +1 = w t − η ( w T t x t − y t ) x t . The line search (14) can be solved in closed-form. Inparticular, one can obtain that a = (cid:40) min { R x , R y (cid:107) w (cid:63) − w (cid:107) w (cid:63) − w ) T w } , if ( w (cid:63) − w ) T w > R x , otherwise. We ran a number of experiments to study the behaviorof these numerical methods. In all experiments, thelearner is gradient descent on least squares (1), andthe control constraint set is (cid:107) x (cid:107) ≤ , | y | ≤

1. Our ﬁrstobservation is that CNLP has a number of advantages:1. CNLP’s continuous optimal state trajectorymatches NLP’s discrete state trajectories, espe-cially on learners with small η . This is expected,since the continuous optimal control problem isobtained asymptotically from the discrete one as η →

0. Figure 4 shows the teaching task w =(1 , ⇒ w ∗ = (1 , η values. TheNLP optimal teaching sequences vary drasticallyin length T , but their state trajectories quicklyoverlap with CNLP’s optimal trajectory.2. CNLP is quick to compute, while NLP runtimegrows as the learner’s η decreases. Table 1presents the wall clock time. With a small η , theoptimal control takes more steps (larger T ). Con-sequently, NLP must solve a nonlinear programwith more variables and constraints. In contrast,CNLP’s runtime does not depend on η .3. CNLP can be used to approximately compute the“teaching dimension”, i.e. the minimum num-ber of sequential teaching steps T for the discreteproblem. Recall CNLP produces an optimal ter-minal time t f . When the learner’s η is small,the discrete “teaching dimension” T is related by T ≈ t f /η . This is also supported by Table 1.That said, it is not trivial to extract a discrete controlsequence from CNLP’s continuous control function.This hinders CNLP’s utility as an optimal teacher. .0 0.5 1.00.00.20.40.60.81.0 CNLP, t f =1.52sNLP, =0.4, T=4NLP, =0.02, T=76NLP, =0.001, T=1515 w w CNLP, t f =1.53sNLP, =0.4, T=4NLP, =0.02, T=76NLP, =0.001, T=1531 CNLP, t f =2.59sNLP, =0.4, T=6NLP, =0.02, T=129NLP, =0.001, T=2594 Figure 4: Comparison of CNLP vs NLP. All teaching tasks use the terminal point w (cid:63) = (1 , w = (0 ,

1) (left panel), w = (0 , .

5) (middle panel), and w = ( − . , .

5) (right panel). We observethat the NLP trajectories on learners with smaller η ’s quickly converges to the CNLP trajectory.Table 1: Teaching sequence length and wall clock timecomparison. NLP teaches three learners with diﬀerent η ’s. Target is always w (cid:63) = (1 , NLP CNLPw η = 0 . , T = 3 75 1499 t f = 1 . , . T = 5 76 1519 t f = 1 . − . , . T = 6 128 2570 t f = 2 . T .We ﬁxed η = 0 .

01 in all cases. w w (cid:63) NLP STRAIGHT GREEDY (0 ,

1) (2 ,

0) 148 161 233(0 ,

2) (4 ,

0) 221 330 721(0 ,

4) (8 ,

0) 292 867 2667(0 ,

8) (16 ,

0) 346 2849 10581Our second observation is that NLP, being thediscrete-time optimal control, produces shorter teach-ing sequences than GREEDY or STRAIGHT. This isnot surprising, and we have already presented threeteaching tasks in Figure 1 where NLP has the small-est T . In fact, there exist teaching tasks on whichGREEDY and STRAIGHT can perform arbitrarilyworse than the optimal teaching sequence found byNLP. A case study is presented in Table 2. In this setof experiments, we set w = ( a,

0) and w (cid:63) = (0 , a ).As a increases, the ratio of teaching sequence lengthbetween STRAIGHT and NLP and between GREEDYand NLP grow at an exponential rate. Figure 5: Points reachable in one step of gradient de-scent (with η = 0 .

1) on a least-squares objective start-ing from each of the black dots. There is circular sym-metry about the origin (red dot).We now dig deeper and present an intuitive ex-planation of why GREEDY requires more teachingsteps than NLP. The fundamental issue is the non-linearity of the learner dynamics (1) in x . Forany w let us deﬁne the one-step reachable set (cid:8) w − η ( w T x − y ) x (cid:12)(cid:12) ( x , y ) ∈ U (cid:9) . Figure 5 shows asample of such reachable sets. The key observationis that the starting w is quite close to the boundaryof most reachable sets. In other words, there is oftena compressed direction—from w to the closest bound-ary of U —along which w makes minimal progress. TheGREEDY scheme falls victim to this phenomenon. .0 0.5 1.00.00.51.0 NLP, T=14 w w GREEDY, T=21 w w Figure 6: Reachable sets along the trajectory of NLP(left panel) and GREEDY (right panel). To minimizeclutter, we only show every 3 rd reachable set. For thissimulation, we used η = 0 .

1. The greedy approachmakes fast progress initially, but slows down later on. GREEDY STRAIGHT

NLP

CNLP

Figure 7: Trajectories of the input sequence { x t } forGREEDY, STRAIGHT, and NLP methods and thecorresponding x ( t ) for CNLP. The teaching task is w = ( − . , . w (cid:63) = (1 , η = 0 .

01. Markersshow every 10 steps. Input constraint is (cid:107) x (cid:107) ≤ w (cid:63) . Unfortunately, it also arrived at the x -axis. For w on the x -axis, the compressed direction is horizon-tally outward. Therefore, subsequent GREEDY movesare relatively short, leading to a large number of stepsto reach w (cid:63) . Interestingly, STRAIGHT is often bet-ter than GREEDY because it also avoids the x -axiscompressed direction for general w . We illustrate the optimal inputs in Figure 7, whichcompares { x t } produced by STRAIGHT, GREEDY,and NLP and the x ( t ) produced by CNLP. The heuris-tic approaches eventually take smaller-magnitudesteps as they approach w (cid:63) while NLP and CNLP main-tain a maximal input norm the whole way. Techniques from optimal control are under-utilized inmachine teaching, yet they have the power to providebetter quality solutions as well as useful insight intotheir structure.As seen in Section 3, optimal trajectories for the leastsquares learner are fundamentally 2D. Moreover, thereis a taxonomy of regimes that dictates their behavior.We also saw in Section 4 that the continuous CNLPsolver can provide a good approximation to the truediscrete trajectory when η is small. CNLP is also morescalable than simply solving the discrete NLP directlybecause NLP becomes computationally intractable as T gets large (or η gets small), whereas the runtime ofCNLP is independent of η .A drawback of both NLP and CNLP is that they pro-duce trajectories rather than policies . In practice, us-ing an open-loop teaching sequence ( x t , y t ) will notyield the w t we expect due to the accumulation ofsmall numerical errors as we iterate. In order to ﬁnd acontrol policy, which is a map from state w t to input( x t , y t ), we discussed the possibility of solving HJB(Section 2.1) which is computationally expensive.An alternative to solving HJB is to pre-compute thedesired trajectory via CNLP and then use model-predictive control (MPC) to ﬁnd a policy that tracksthe reference trajectory as closely as possible. Suchan approach is used in Liniger et al. (2015), for ex-ample, to design controllers for autonomous race cars,and would be an interesting avenue of future work forthe machine teaching problem.Finally, this paper presents only a glimpse at whatis possible using optimal control. For example, thePMP is not restricted to merely solving time-optimalcontrol problems. It is possible to analyze problemswith state- and input-dependent running costs, stateand input pointwise or integral constraints, conditionalconstraints, and even problems where the goal is toreach a target set rather than a target point. eferences M. Athans and P. L. Falb.

Optimal control: An intro-duction to the theory and its applications . CourierCorporation, 2013.J. T. Betts.

Practical methods for optimal control andestimation using nonlinear programming , volume 19.SIAM, 2010.B. Biggio, B. Nelson, and P. Laskov. Poisoning attacksagainst support vector machines. In J. Langfordand J. Pineau, editors, . Omnipress, Omnipress, 2012.R. H. Byrd, J. Nocedal, and R. A. Waltz. Knitro:An integrated package for nonlinear optimization.In

Large Scale Nonlinear Optimization, 3559, 2006 ,pages 35–59. Springer Verlag, 2006.I. Dunning, J. Huchette, and M. Lubin. Jump: Amodeling language for mathematical optimization.

SIAM Review , 59(2):295–320, 2017. doi: 10.1137/15M1020575.H. Febbo. Nloptcontrol.jl, 2017. URL :\https://github.com/JuliaMPC/NLOptControl.jl .S. Goldman and M. Kearns. On the complexity ofteaching.

Journal of Computer and Systems Sci-ences , 50(1):20–31, 1995.I. J. Goodfellow, J. Shlens, and C. Szegedy. Explain-ing and Harnessing Adversarial Examples.

ArXive-prints , Dec. 2014.C. Y. Kao, S. Osher, and J. Qian. Lax–Friedrichssweeping scheme for static Hamilton–Jacobi equa-tions.

Journal of Computational Physics , 196(1):367–391, 2004.D. E. Kirk.

Optimal control theory: An introduction .Courier Corporation, 2012.D. Liberzon.

Calculus of variations and optimal con-trol theory: A concise introduction . Princeton Uni-versity Press, 2011.A. Liniger, A. Domahidi, and M. Morari.Optimization-based autonomous racing of 1:43scale RC cars.

Optimal Control Applications andMethods , 36(5):628–647, 2015.W. Liu, B. Dai, A. Humayun, C. Tay, C. Yu, L. B.Smith, J. M. Rehg, and L. Song. Iterative machineteaching. In

International Conference on MachineLearning , pages 2149–2158, 2017.S. Mei and X. Zhu. Using machine teaching to identifyoptimal training-set attacks on machine learners. In

The Twenty-Ninth AAAI Conference on ArtiﬁcialIntelligence , 2015.K. Patil, X. Zhu, L. Kopec, and B. Love. Optimalteaching for limited-capacity human learners. In

Advances in Neural Information Processing Systems(NIPS) , 2014.M. A. Patterson and A. V. Rao. GPOPS-II: AMATLAB software for solving multiple-phase opti-mal control problems using hp-adaptive Gaussianquadrature collocation methods and sparse nonlin-ear programming.

ACM Transactions on Mathemat-ical Software (TOMS) , 41(1):1, 2014.A. V. Rao. A survey of numerical methods for optimalcontrol.

Advances in the Astronautical Sciences , 135(1):497–528, 2009.A. Sen, P. Patel, M. A. Rau, B. Mason, R. Nowak,T. T. Rogers, and X. Zhu. Machine beats human atsequencing visuals for perceptual-ﬂuency practice.In

Educational Data Mining , 2018.D. Tonon, M. S. Aronna, and D. Kalise.

Optimal con-trol: Novel directions and applications , volume 2180.Springer, 2017.J. N. Tsitsiklis. Eﬃcient algorithms for globally opti-mal trajectories.

IEEE Transactions on AutomaticControl , 40(9):1528–1538, 1995.A. W¨achter and L. T. Biegler. On the implementationof a primal-dual interior point ﬁlter line search algo-rithm for large-scale nonlinear programming.

Math-ematical Programming , 106(1):25–57, 2006.X. Zhu. Machine teaching: an inverse problem to ma-chine learning and an approach toward optimal ed-ucation. In

The Twenty-Ninth AAAI Conferenceon Artiﬁcial Intelligence (AAAI “Blue Sky” SeniorMember Presentation Track) , 2015.X. Zhu, A. Singla, S. Zilles, and A. N. Raﬀerty. AnOverview of Machine Teaching.

ArXiv e-prints , Jan.2018. https://arxiv.org/abs/1801.05927.

Appendix

Proof of modiﬁed Proposition 3.1.

In this ver-sion, we assume ( w , x , y ) is a trajectory of (1) ratherthan being a trajectory of (8).All we need to show is that for any pair of ( x , y ), thereexist another pair (˜ x , R y ), such that they give the sameupdate. In particular, we set ˜ x = a x and show thatthere always exists an a ∈ [ − ,

1] such that( y − w T x ) x = ( R y − w T a x ) a x . This simpliﬁes to g ( a ) := ( w T x ) a − R y a + ( y − w T x ) = 0 . (15)The discriminant of the quadratic (15) is R y − w T x ( y − w T x ) ≥ R y − | w T x | (cid:0) R y + | w T x | (cid:1) = (cid:0) R y − | w T x | (cid:1) ≥ a ∈ R . Moreover, g ( −

1) = R y + y ≥ g (1) = − R y + y ≤

0, so theremust be a real root in [ − , Proof of Theorem 3.2.

We showed in Section 3that Regime V trajectories are 2D. We also arguedthat solutions that reach w (cid:63) via Regime III–IV arenot unique and need not be 2D. We will now showthat it’s always possible to construct a 2D solution.We begin by characterizing the set of w (cid:63) reachable viaRegime III–IV. Recall from Section 3 that the transi-tion between III and IV occurs when (cid:107) w (cid:107) = R := R y R x .If t is the time at which this transition occurs, thenfor 0 ≤ t ≤ t , the solution is x = R x (cid:107) w (cid:107) w , which leadsto a straight-line trajectory from w to w ( t ).Now consider the part of the trajectory in Regime IV,where t ≤ t ≤ t f . As derived in Section 3, Regime IVtrajectories satisfy ˙ w = w T x = R y . These lead to d (cid:107) w (cid:107) d t = R y , which means that (cid:107) w (cid:107) grows at the samerate regardless of x . If our trajectory reaches w ( t f ) = w (cid:63) , then we can deduce via integration that (cid:107) w (cid:63) (cid:107) − (cid:107) w ( t ) (cid:107) = R y ( t f − t ) , (16)Suppose ( w ( t ) , x ( t )) for t ≤ t ≤ t f is a trajectory thatreaches w (cid:63) . Refer to Figure 8. The reachable set attime t f is a spherical sector whose boundary requiresa trajectory that maximizes curvature. We will nowderive this fact.Let θ max be the largest possible angle between w ( t )and any reachable w ( t f ) = w (cid:63) , where we have ﬁxed t f .Deﬁne θ ( t ) to be the angle between w ( t ) and w ( t f ). θ ( t ) = (cid:90) t f t ˙ θ d t ≤ (cid:90) t f t | ˙ θ | d t III IV w ( t ) R (cid:107) w (cid:63) (cid:107) w w ( t ) w ( t f ) = w (cid:63) Figure 8: If a reachable w (cid:63) is contained in the concavefunnel shape, which is the reachable set in Regime IV,it can be reached by some trajectory ( w ( t ) , x ( t )) lyingentirely in the 2D subspace deﬁned by span { w , w (cid:63) } :follow the max-curvature solution until t and thentransition to a radial solution until t f .An alternative expression for this rate of change is theprojection of ˙ w onto the orthogonal complement of w : | ˙ θ | = (cid:13)(cid:13) ˙ w − (cid:0) ˙ w T w (cid:107) w (cid:107) (cid:1) w (cid:107) w (cid:107) (cid:13)(cid:13) (cid:107) w (cid:107) = R y (cid:13)(cid:13) x − R y (cid:107) w (cid:107) w (cid:13)(cid:13) (cid:107) w (cid:107) Where we used the fact that ˙ w = w T x = R y inRegime IV. Now, θ max = max x : w T x = R y / (cid:107) x (cid:107)≤ R x θ ( t ) ≤ max x : w T x = R y / (cid:107) x (cid:107)≤ R x (cid:90) t f t R y (cid:13)(cid:13) x − R y (cid:107) w (cid:107) w (cid:13)(cid:13) (cid:107) w (cid:107) d t ≤ (cid:90) t f t (cid:113) R x − (cid:0) R y (cid:107) w (cid:107) (cid:1) (cid:107) w (cid:107) d t (17)In the ﬁnal step, we maximized over x . Notice that theintegrand (17) is an upper bound that only dependson t and (cid:107) w (cid:63) (cid:107) but not on x . One can also verify thatthis upper bound is achieved by the choice x = R y (cid:107) w (cid:107) ˆ w + (cid:115) R x − (cid:18) R y (cid:107) w (cid:107) (cid:19) w (cid:63) − ( ˆ w T w (cid:63) ) ˆ w (cid:107) w (cid:63) − ( ˆ w T w (cid:63) ) ˆ w (cid:107) . where ˆ w := w / (cid:107) w (cid:107) and w (cid:63) is any vector that satis-ﬁes (16) with angle θ max with w ( t ). Any w (cid:63) withthis norm but angle θ f < θ max can also be reached byusing the max-curvature control until time t , where t is chosen such that θ f = (cid:82) t t (cid:114) R x − (cid:0) Ry (cid:107) w (cid:107) (cid:1) (cid:107) w (cid:107) d t , andthen using x = R y (cid:107) w (cid:107) w for t ≤ t ≤ t f . This piecewisepath is illustrated in Figure 8.Our constructed optimal trajectory lies in the 2D spanof w (cid:63) and w . This shows that all reachable w (cid:63)(cid:63)