[PDF] Learning Controller Gains on Bipedal Walking Robots via User Preferences

Abstract

Experimental demonstration of complex robotic behaviors relies heavily on finding the correct controller gains. This painstaking process is often completed by a domain expert, requiring deep knowledge of the relationship between parameter values and the resulting behavior of the system. Even when such knowledge is possessed, it can take significant effort to navigate the nonintuitive landscape of possible parameter combinations. In this work, we explore the extent to which preference-based learning can be used to optimize controller gains online by repeatedly querying the user for their preferences. This general methodology is applied to two variants of control Lyapunov function based nonlinear controllers framed as quadratic programs, which have nice theoretic properties but are challenging to realize in practice. These controllers are successfully demonstrated both on the planar underactuated biped, AMBER, and on the 3D underactuated biped, Cassie. We experimentally evaluate the performance of the learned controllers and show that the proposed method is repeatably able to learn gains that yield stable and robust locomotion.

Full PDF

LLearning Controller Gains on Bipedal Walking Robotsvia User Preferences

Noel Csomay-Shanklin , Maegan Tucker , Min Dai , Jenna Reher , Aaron D. Ames , Abstract — Experimental demonstration of complex roboticbehaviors relies heavily on ﬁnding the correct controller gains.This painstaking process is often completed by a domainexpert, requiring deep knowledge of the relationship betweenparameter values and the resulting behavior of the system. Evenwhen such knowledge is possessed, it can take signiﬁcant effortto navigate the nonintuitive landscape of possible parametercombinations. In this work, we explore the extent to whichpreference-based learning can be used to optimize controllergains online by repeatedly querying the user for their prefer-ences. This general methodology is applied to two variants ofcontrol Lyapunov function based nonlinear controllers framedas quadratic programs, which have nice theoretic propertiesbut are challenging to realize in practice. These controllers aresuccessfully demonstrated both on the planar underactuatedbiped, AMBER, and on the 3D underactuated biped, Cassie.We experimentally evaluate the performance of the learnedcontrollers and show that the proposed method is repeatablyable to learn gains that yield stable and robust locomotion.

I. I

NTRODUCTION

Achieving robust and stable performance for physicalrobotic systems relies heavily on careful gain tuning, re-gardless of the implemented controller. Navigating the spaceof possible parameter combinations is a challenging en-deavor, even for domain experts. To combat this challenge,researchers have developed systematic ways to tune gainsfor speciﬁc controller types [1]–[5]. For controllers wherethe input/output relationship between parameters and theresulting behavior is less clear, this can be prohibitivelydifﬁcult. These difﬁculties are especially prevalent in thesetting of bipedal locomotion, due to the extreme sensitivityof the stability of the system with respect to controller gains.It was shown in [6] that control Lyapunov functions(CLFs) are capable of stabilizing locomotion through thehybrid zero dynamics (HZD) framework, with [7] demon-strating how this can be implemented as a quadratic program(QP), allowing the problem to be solved in a pointwise-optimal fashion even in the face of feasibility constraints.However, achieving robust walking behavior on physicalbipeds can be difﬁcult due to complexities such as com-pliance, under-actuation, and narrow domains of attraction.One such controller that has recently demonstrated stablelocomotion on the 22 degree of freedom (DOF) Cassie biped,as shown in Figure 1, is the ID-CLF-QP + [8]. This research was supported by NSF NRI award 1924526, NSF award1932091, NSF CMMI award 1923239, NSF Graduate Research FellowshipNo. DGE-1745301, and the Caltech Big Ideas and ZEITLIN Funds. Authors are with the Department of Computing and MathematicalSciences, California Institute of Technology, Pasadena, CA 91125. Authors are with the Department of Mechanical and Civil Engineering,California Institute of Technology, Pasadena, CA 91125.

Fig. 1: The two experimental platforms investigated in thiswork: the planar AMBER-3M point-foot [9] robot (left), andthe 3D Cassie robot [10] (right).Synthesizing a controller capable of accounting for thecomplexities of underactuated locomotion, such as the ID-CLF-QP + , necessitates the addition of numerous controlparameters, exacerbating the issue of gain tuning. The re-lationship between the control parameters and the resultingbehavior of the robot is extremely nonintuitive and results ina landscape that requires dedicated time to navigate, even fordomain experts. For example, the implementation of the ID-CLF-QP + in [8] entailed 2 dedicated months of hand-tuningaround 60 control parameters.Recently, machine learning techniques have been imple-mented to alleviate the process of hand-tuning gains ina controller agnostic way by systematically navigating theentire parameter space [11]–[13]. However, these techniquesrely on a carefully constructed predeﬁned reward function.Moreover, it is often the case where different desired prop-erties of the robotic behavior are conﬂicting such that theyboth can’t be optimized simultaneously.To alleviate the gain tuning process and enable the useof complicated controllers for na¨ıve users, we propose apreference-based learning framework that only relies onsubjective user feedback, mainly pairwise preferences, tosystematically search the parameter space and realize stableand robust experimental walking. Preferences are a par-ticularly useful feedback mechanism for parameter tuningbecause they are able to capture the notion of “generalgoodness” without a predeﬁned reward function. Preference-based learning has been previously used towards selectingessential constraints of an HZD gait generation frameworkwhich resulted in stable and robust experimental walking on a r X i v : . [ c s . R O ] F e b ig. 2: Conﬁguration of the 22 DOF (ﬂoating base) Cassierobot [10] (left) and conﬁguration of the 5 DOF (pinnedmodel) planar robot AMBER-3M [9] (right).a planar biped with unmodeled compliance at the ankle [14].In this work, we apply a similar preference-based learningframework as [14] towards learning gains of a CLF-QP + controller on the AMBER bipedal robot, as well as an ID-CLF-QP + controller on the Cassie bipedal robot. This appli-cation requires extending the learning framework to a muchhigher-dimensional space which led to unique challenges.First, more user feedback was required to navigate the largeraction space. This was accomplished by sampling actionscontinuously on hardware which led to more efﬁcient feed-back collection. Second, to increase the speed of the learning,ordinal labels were also added as a feedback mechanism.II. P RELIMINARIES ON D YNAMICS AND C ONTROL

A. Modeling and Gait Generation

Following a ﬂoating-base convention [15], we begin witha general deﬁnition of a bipedal robot as a branched-chaincollection of rigid linkages subjected to intermittent contactwith the environment. We deﬁne the conﬁguration space as

Q ⊂ R n , where n is the unconstrained DOF (degrees offreedom). Let q = ( p b , φ b , q l ) ∈ Q := R × SO (3) × Q l , where p b is the global Cartesian position of the body ﬁxedframe attached to the base linkage (the pelvis), φ b is its globalorientation, and q l ∈ Q l ∈ R n l are the local coordinatesrepresenting rotational joint angles. Further, the state space X = T Q ⊂ R n has coordinates x = ( q (cid:62) , ˙ q (cid:62) ) (cid:62) . The robotis subject to various holonomic constraints , which can besummarized by an equality constraint h ( q ) ≡ where h ( q ) ∈ R h . Differentiating h ( q ) twice and applying D’Alembert’sprinciple to the Euler-Lagrange equations for the constrainedsystem, the dynamics can be written as: D ( q )¨ q + H ( q, ˙ q ) = Bu + J ( q ) (cid:62) λ (1) J ( q )¨ q + ˙ J ( q, ˙ q ) ˙ q = 0 (2)where D ( q ) ∈ R n × n is the mass-inertia matrix, H ( q, ˙ q ) con-tains the Coriolis, gravity, and additional non-conservativeforces, B ∈ R n × m is the actuation matrix, J ( q ) ∈ R h × n isthe Jacobian matrix of the holonomic constraint, and λ ∈ R h is the constraint wrench. The system of equations (1) for the dynamics can also be written in the control-afﬁne form: ˙ x = (cid:20) ˙ q − D ( q ) − ( H ( q, ˙ q ) − J ( q ) (cid:62) λ ) (cid:21)(cid:124) (cid:123)(cid:122) (cid:125) f ( x ) + (cid:20) D ( q ) − B (cid:21)(cid:124) (cid:123)(cid:122) (cid:125) g ( x ) u. The mappings f : T Q → R n and g : T Q → R n × m areassumed to be locally Lipschitz continuous.Dynamic and underactuated walking consists of periods ofcontinuous motion followed by discrete impacts, which canbe accurately modeled within a hybrid framework [16]. If weconsider a bipedal robot undergoing domains of motion withonly one foot in contact (either the left ( L ) or right ( R )), anddomain transition triggered at footstrike, then we can deﬁne: D { L,R } SS = { ( q, ˙ q ) : p zswf ( q ) ≥ } , S L → R,R → L = { ( q, ˙ q ) : p zswf ( q ) = 0 , ˙ p zswf ( q, ˙ q ) < } , where p zswf : Q → R is the vertical position of theswing foot, D { L,R } SS is the continuous domain on which ourdynamics (1) evolve, with a transition from one stance legto the next triggered by the switching surface S L → R,R → L .When this domain transition is triggered, the robot undergoesan impact with the ground, yielding a hybrid model: HC = (cid:40) ˙ x = f ( x ) + g ( x ) u x (cid:54)∈ S L → R,R → L ˙ x + = ∆( x − ) x ∈ S L → R,R → L (3)where ∆ is a plastic impact model [15] applied to thepre-impact states, x − , such that the post-impact states, x + ,respect the holonomic constraints of the subsequent domain. B. Hybrid Zero Dynamics

In this work, we design locomotion using the hybrid zerodynamics (HZD) framework [16], in order to design stableperiodic walking for underactuated bipeds. At the core of thismethod is the regulation of virtual constraints , or outputs: y ( x ) = y a ( x ) − y d ( τ, α ) , (4)with the goal of driving y → where y a : T Q → R p and y d : T Q × R × R a → R p are smooth functions, and α represents a set of Bezi`er polynomial coefﬁcients that canbe shaped to encode stable locomotion.If we assume the existence of a feedback controller u ∗ ( x ) that can effectively stabilize this output tracking problem,then we can write the close-loop dynamics: ˙ x = f cl ( x ) = f ( x ) + g ( x ) u ∗ ( x ) . (5)Additionally, by driving the outputs to zero this controllerrenders the zero dynamics manifold : Z = { ( q, ˙ q ) ∈ D | y ( x, τ ) = 0 , L f cl y ( x, τ ) = 0 } . (6)forward invariant and attractive. However, because our sys-tem is represented as a hybrid system (3) we must alsoensure that (6) is shaped such that the walking is stablethrough impact. We thus wish to enforce an impact invariancecondition when we intersect with the switching surface: ∆( Z ∩ S ) ⊂ Z . (7)n order to enforce this condition, the B´ezier polynomials forthe desired outputs can be shaped through the parameters α .In order to generate walking behaviors using the HZDapproach, we utilize the optimization library FROST [17] totranscribe the walking problem into an NLP: ( α, X ) ∗ = argmin α, X J ( X ) (8) s . t . Closed loop dynamics (5)HZD condition (7)Physical feasibilitywhere X = ( x , ..., x N , T ) is the collection of all decisionvariables with x i the state at the i th discretization and T theduration. The NLP (8) was solved with the optimizer IPOPT.This was done ﬁrst for AMBER, in which one walking gaitwas designed using a pinned model of the robot [9], and thenon Cassie for D locomotion using the motion library foundin [18] consisting of walking gaits for speeds in . m/sintervals on a grid for sagittal speeds of v x ∈ [ − . , . m/sand coronal speeds of v y ∈ [ − . , . m/s. C. Control Lyapunov Functions

Control Lyapunov functions (CLFs), and speciﬁcallyrapidly exponentially stabilizing control Lyapunov functions(RES-CLFs), were introduced as methods for achieving(rapidly) exponential stability on walking robots [19]. Thiscontrol approach has the beneﬁt of yielding a control frame-work that can provably stabilize periodic orbits for hybridsystem models of walking robots, and can be realized in apointwise optimal fashion. In this work, we consider onlyoutputs which are vector relative degree . Thus, differenti-ating (4) twice with respect to the dynamics results in: ¨ y ( x ) = L f y ( x ) + L g L f y ( x ) u. Assuming that the system is feedback linearizeable, wecan invert the decoupling matrix, L g L f y ( x ) , to construct apreliminary control input: u = ( L g L f y ( x )) − (cid:0) ν − L f y ( x ) (cid:1) , (9)which renders the output dynamics to be ¨ y = ν . With theauxiliary input ν appropriately chosen, the nonlinear systemcan be made exponentially stable. Assuming the preliminarycontroller (9) has been applied to our system, and deﬁning η = [ y , ˙ y ] (cid:62) we have the following output dynamics [20]: ˙ η = (cid:20) I (cid:21)(cid:124) (cid:123)(cid:122) (cid:125) F η + (cid:20) I (cid:21)(cid:124)(cid:123)(cid:122)(cid:125) G v. (10)With the goal of constructing a CLF using (10), we evaluatethe continuous time algebraic Ricatti equation (CARE): F (cid:62) P + P F + P GR − G (cid:62) P + Q = 0 , (CARE)which has a solution P (cid:31) for any Q = Q (cid:62) (cid:31) and R = R (cid:62) (cid:31) . From the solution of (CARE), we can construct arapidly exponentially stabilizing CLF (RES-CLF) [19]: V ( η ) = η (cid:62) I ε P I ε (cid:124) (cid:123)(cid:122) (cid:125) P ε η, I ε = (cid:20) ε I I (cid:21) , (11) which, for < ε < , is a tunable parameter that drives the(rapidly) exponential convergence. Any feedback controller, u , which can satisfy the convergence condition: ˙ V ( η ) = L F V ( η ) + L G V ( η ) ν = L F V ( η ) + L G V ( η ) (cid:0) L g L f y ( x ) u + L f y ( x ) (cid:1) = L f V ( η ) + L g V ( η ) u ≤ − ε λ min ( Q ) λ max ( P ) (cid:124) (cid:123)(cid:122) (cid:125) γ V ( η ) , (12)will then render rapidly exponential stability for the outputdynamics (4). In the context of RES-CLF, we can then deﬁne: K ε ( x ) = { u ε ∈ U : L f V ( x ) + L g V ( x ) u + γε V ( x ) ≤ } , describing an entire class of the controllers which result in(rapidly) exponential convergence. This leads naturally to theconsideration of an optimization-based approach to enforcing(12). One such approach is to pose the CLF problem withina quadratic program (CLF-QP), with (12) as an inequalityconstraint [7]. When implementing this controller on physicalsystems, which are often subject to additional constraintssuch as torque limits or friction limits, a weighted relaxationterm, δ , is added (12) in order to maintain feasibility. CLF-QP- δ : u ∗ = argmin u ∈ R m (cid:107) L f y ( x ) + L g L f y ( x ) u (cid:107) + w ˙ V δ (13) s . t . ˙ V ( x ) = L f V ( x ) + L g V ( x ) u ≤ − γε V + δu min (cid:22) u (cid:22) u max Because this relaxation term is penalized in the cost, wecould also move the inequality constraint completely intothe cost as an exact penalty function [8]: J δ = (cid:107) L f y ( x ) + L g L f y ( x ) u (cid:107) + w ˙ V || g + ( x, u ) || where: g ( x, u ) := L f V ( x ) + L g V ( x ) u + γε V ( x ) ,g + ( x, u ) (cid:44) max( g, , One of the downsides to using this approach is that thecost term || g + ( x, u ) || will intermittently trigger and causea jump to occur in the commanded torque. Instead, we canallow g ( x, u ) to go negative, meaning that the controller willalways drive convergence even when the inequality (12) isnot triggered [21]. This leads to the following relaxed (CLF-QP) with incentivized convergence in the cost: CLF-QP + : u ∗ = argmin u ∈ R m (cid:107) L f y ( x ) + L g L f y ( x ) u (cid:107) + w ˙ V ˙ V ( x, u ) (14) s . t . u min (cid:22) u (cid:22) u max In order to avoid computationally expensive inversions ofthe model sensitive mass-inertia matrix, and to allow for aariety of costs and constraints to be implemented, a variantof the (CLF-QP) termed the (ID-CLF-QP) was introduced in[21]. This controller is used on the Cassie biped, with thedecision variables X = [¨ q (cid:62) , u (cid:62) , λ (cid:62) ] (cid:62) ∈ R : ID-CLF-QP + : X ∗ = argmin X∈ X ext (cid:107) A ( x ) X − b ( x ) (cid:107) + ˙ V ( q, ˙ q, ¨ q ) (15) s . t . D ( q )¨ q + H ( q, ˙ q ) = Bu + J ( q ) (cid:62) λu min (cid:22) u (cid:22) u max λ ∈ AC ( X ) (16) where (2) has been moved into the cost terms A ( x ) and b ( x ) as a weighted soft constraint, in addition to a feedbacklinearizing cost, and a regularization for the nominal X ∗ ( τ ) from the HZD optimization. Interested readers are referredto [8], [21] for the full (ID-CLF-QP+) formulation. D. Parameterization of CLF-QP

For the following discussion, let a = [ a , ..., a v ] ∈ A ⊂ R v be an element of a v − dimensional parameter space,termed an action . We let Q = Q ( a ) , ε = ε ( a ) , and w ˙ V = w ˙ V ( a ) denote a parameterization of our control tuningvariables, which will subsequently be learned. Each gain a i for i = 1 , . . . , v is discretized into d i values, leading toan overall search space of actions given by the set A withcardinality | A | = (cid:81) vi =1 d i . In this work, experiments areconducted on two separate experimental platforms: the planarbiped AMBER, and the 3D biped Cassie. For AMBER, v is taken to be 6 with discretizations d = [4 , , , , , ,resulting in the following parameterization: Q ( a ) = (cid:20) Q Q (cid:21) , Q = diag ([ a , a , a , a ]) ,Q = diag ([ a , a , a , a ]) ,ε ( a ) = a , w ˙ V ( a ) = a , which satisﬁes Q ( a ) (cid:31) , < ε ( a ) < , and w ˙ V ( a ) > forthe choice of bounds, as summarized in Table I. Because ofthe simplicity of AMBER, we were able to tune all associatedgains for the CLF-QP + controller. For Cassie, however, thecomplexity of the ID-CLF-QP + controller warranted only asubset of parameters to be selected. Namely, v is taken to beTABLE I: Learned Parameters CASSIEPos. Bounds Vel. Bounds Q Pelvis Roll ( φ x ) a :[2000, 12000] a :[5, 200] Q Pelvis Pitch ( φ y ) a :[2000, 12000] a :[5, 200] Q Stance Leg Length ( (cid:107) φ st (cid:107) ) a :[4000, 15000] a :[50, 500] Q Swing Leg Length ( (cid:107) φ sw (cid:107) ) a :[4000, 20000] a :[50, 500] Q Swing Leg Angle ( θ swhp ) a :[1000, 10000] a :[10, 200] Q Swing Leg Roll ( θ swhr ) a :[1000, 8000] a :[5, 150]AMBERPos. Bounds Vel. Bounds Bounds Q Knees a :[100, 1500] a :[10, 300] ε a :[0.08, 0.2] Q Hips a :[100, 1500] a :[10, 300] w ˙ V a :[1, 5] Fig. 3: The experimental procedure, notably the communica-tion between the controller, physical robot, human operator,and learning framework.12 and d i to be 8, resulting in: Q = (cid:20) Q Q (cid:21) , Q = diag ([ a , . . . , a ]) ,Q = ¯ Q, with ¯ Q , ε , and w ˙ V remaining ﬁxed and predetermined by adomain expert. From this deﬁnition of Q , we can split ouroutput coordinates η = ( η t , η nt ) into tuned and not-tuned components, where η t ∈ R and η nt ∈ R correspond tothe Q and Q blocks in in Q .III. L EARNING F RAMEWORK

In this section we will present this preference-based learn-ing framework used in this work, speciﬁcally aimed at tuningcontroller gains. We assume that the user has some unknownunderlying utility function U : A → R , which maps actionsto a personal rating of how good of the experimental walkingseems to them. The goal of the framework is to identifythe user preferred action, a ∗ = argmax a U ( a ) , in as fewiterations as possible.In general, Bayesian optimization is a probabilistic ap-proach towards identifying a ∗ by selecting ˆ a ∗ , the actionbelieved to be optimal, which minimizes || ˆ a ∗ − a ∗ || . Typ-ically, Bayesian optimization is used on problems wherethe underlying function is difﬁcult to evaluate but can beobtained. Recent work extended Bayesian optimization tothe preference setting [22], where the action maximizingthe users underlying utility function U ( a ) is obtained usingonly pairwise preferences between sampled actions. We referto this setting as “preference-based learning”. In this work,we utilize a more recent preference-based learning algo-rithm, LineCoSpar [23] with the addition of ordinal labelsinspired from [24], which maintains the posterior only overa subset of the entire actions space to increase computationtractability – more details can be found in [14]. The resultinglearning framework iteratively applies Thompson samplingto navigate a high-dimensional Bayesian landscape of userpreferences. . Summary of Learning Method A summary of the learning method is as follows. At eachiteration, the user is queried for their preference between themost recently sampled action, a i , and the previous action, a i − . We deﬁne a likelihood function based on preferences: P ( a i (cid:31) a i − | U ( a i ) , U ( a i − )) = (cid:40) if U ( a i ) ≥ U ( a i − )0 otherwise , where a i (cid:31) a i − denotes a preference of action a i overaction a i − . In other words, the likelihood function statesthat the user has utility U ( a i ) ≥ U ( a i − ) with probability1 given that they return a preference a i (cid:31) a i − . This is astrong assumption on the ability of the user to give noise-freefeedback; to account for noisy preferences we instead use: P ( a i (cid:31) a i − | U ( a i ) , U ( a i − )) = φ (cid:18) U ( a i ) − U ( a i − ) c p (cid:19) , where φ : R → (0 , is a monotonically-increasing linkfunction, and c p > represents the amount of noise expectedin the preferences. In this work, we select the heavy-tailedsigmoid distribution φ ( x ) := e − x .Inspired by [24], we supplement preference feedback withordinal labels. Because ordinal labels are expected to benoisy, the ordinal categories are limited to only “very bad”,“neutral”, and “very good”. Ordinal labels are obtained eachiteration for the corresponding action a i and are assumedto be assigned based on U ( a i ) . Just as with preferences, alikelihood function is created for ordinal labels: P ( o = r | U ( a i )) = (cid:40) if b r − < U ( a i ) < b r otherwisewhere { b , . . . , b N } are arbitrary thresholds that dictatewhich latent utility ranges correspond to which ordinallabel assuming ideal noise-free feedback. In our work, thesethresholds were selected to be {− inf , − , , inf } . Again, thelikelihood function is modiﬁed to account for noise by a linkfunction φ and expected noise in the ordinal labels c o > : P ( o = r | U ( a )) = φ (cid:18) b r − U ( a m ) c o (cid:19) − φ (cid:18) b r − − U ( a ) c o (cid:19) . After every sampled action a i , the human operator isqueried for both a pairwise preference between a i − and a i as well as an ordinal label for a i . This user feedback is addedto respective datasets D p = { a k ( i ) (cid:31) a k ( i ) , i = 1 , . . . , n } ,and D o = { o i , i = 1 , . . . , n } , with the total dataset of userfeedback denoted as D = D p ∪ D o .To infer the latent utilities of the sampled actions U =[ U ( a ) , . . . , U ( a N )] (cid:62) using D , we apply the preference-based Gaussian process model to the posterior distribution P ( U | D ) as in [25]. First, we model the posterior distributionas proportional to the likelihoods multiplied by the Gaussianprior using Bayes rule, P ( U | D p , D o ) ∝ P ( D o , D p | U ) P ( U ) , (17)where the Gaussian prior over U is given by: P ( U ) = 1(2 π ) | V | / | Σ | / exp (cid:18) − U (cid:62) Σ − U (cid:19) . Fig. 4: Simulated learning results averaged over 10 runs,demonstrating the capability of preference-based learning tooptimize over large action spaces, speciﬁcally the one usedfor experiments with Cassie. Standard error is shown by theshaded region.with Σ ∈ R | V |×| V | , Σ ij = K ( a i , a j ) , and K is a kernel.Assuming conditional independence of queries, we can split P ( D o , D p | U ) = P ( D o | U ) P ( D p | U ) wherse P ( D p | U ) = K (cid:89) i =1 P ( a (cid:31) a | U ( a ) , U ( a )) , P ( D o | U ) = M (cid:89) i =1 P ( a = r | U ( a )) . The posterior (17) is then estimated via the Laplace ap-proximation as in [25], which yields a multivaraite Gaussian N ( µ, σ ) . The mean µ can be interpreted as our estimate ofthe latent utilities U with uncertainty (cid:112) diag ( σ − ) .To select new actions to query in each iteration, we apply aThomposon sampling approach. Speciﬁcally, at each iterationwe draw a random sample from U ∼ N ( µ, Σ) and select theaction which maximizes U as: a = argmax a U ( a ) . (18)This action is then given an ordinal label, and a preferenceis collected between it and the previous action. This processis completed for as many iterations as is desired. The bestaction after the iterations have been completed is ˆ a ∗ = argmax a µ ( a ) where µ is the mean function of the multivariate Gaussian. B. Expected Learning Behavior

To demonstrate the expected behavior of the learningalgorithm, a toy example was constructed of the same dimen-sionality as the controller parameter space being investigatedon Cassie ( v = 12 , d = 8 ), where the utility was modeled as U ( a ) = (cid:107) a − a ∗ (cid:107) for some a ∗ . Feedback was automaticallygenerated for both ideal noise-free feedback as well as fornoisy feedback (correct feedback given with probability 0.9).The results of the simulated algorithm, illustrated in Fig.4, show that the learning framework is capable of decreasing a) The behavior corresponding to a very low utility (top) and to themaximum posterior utility (bottom). (b) The robustness (top) and and tracking (bottom) of the walkingwith the learned optimal gains is demonstrated through gait tiles. Fig. 5: Gait tiles for AMBER (left) and Cassie (right). (a) Phase portraits for AMBER experiments. (b) Output Error of η t (left) and η nt (right) for Cassie experiment. Fig. 6: Experimental walking behavior of the CLF-QP + (left) and the ID-CLF-QP + (right) with the learned gains.the error in the believed optimal action ˆ a ∗ even for an actionspace as large as the one used in the experiments with Cassie.The simulated results also show that ordinal labels allowfor faster convergence to the optimal action, even in thecase of noise, motivating their use in the ﬁnal experiment.Lastly, the preference-based learning framework was alsocompared to random sampling, where the only differencein the algorithm was that actions were selected randomly. Incomparison, the random sampling method leads to minimalimprovement when compared to preference-based learning.From these simulation results, it can clearly be seen that theproposed method is an effective mechanism for exploringhigh-dimensional parameter spaces.IV. L EARNING TO W ALK IN E XPERIMENTS

Preference-based learning applied to tuning control pa-rameters was experimentally implemented on two separaterobotic platforms: the 5 DOF planar biped AMBER, andthe 22 DOF 3D biped Cassie, as can be seen in the video[26]. A visualization of the experimental procedure is givenin Figure 3. The experiments had four main components:the physical robot (either AMBER or Cassie), the controllerrunning on a real-time PC, a human operating the robotwho gave their preferences, and a secondary PC runningthe learning algorithm. The user feedback provided to thelearning algorithm included pairwise preferences and ordinallabels. For the pairwise preferences, the human operator wasasked “Do you prefer this behavior more or less than thelast behavior”. For the ordinal labels, the human was askedto provide a label of either “very bad, neutral, or very good”. User feedback was obtained after each sampled actionwas experimentally deployed on the robot. Each action wastested for approximately 30 seconds to 1 minute, duringwhich the behavior of the robot was evaluated in terms ofboth performance and robustness. After user feedback wascollected for the sampled controller gains, the posterior wasinferred over all of the uniquely sampled actions, whichtook up to 0.5 seconds. The experiment with AMBER wasconducted for 50 iterations, lasing approximately one hour,and the experiment with Cassie was conducted for 100iterations, lasting one hour for the domain expert and roughlytwo hours for the na¨ıve user.

A. Results with AMBER

The preference-based learning framework is ﬁrst demon-strated on tuning the gains associated with the CLF-QP + forthe AMBER bipedal robot. The CLF-QP + controller wasimplemented on an off-board i7-6700HQ CPU @ 2.6GHzwith 16 GB RAM, which solved for desired torques andcommunicated them with the ELMO motor drivers on theAMBER robot. The motor driver communication and CLF-QP + controller ran at 2kHz. During the ﬁrst half of theexperiment, the algorithm sampled a variety of gains caus-ing behavior ranging from instantaneous torque chatter toinduced tripping due to inferior output tracking. By the endof the experiment, the algorithm had sampled 3 gains whichwere deemed ”very good”, and which resulted in stablewalking behavior. Gait tiles for an action deemed “very bad”,as well as the learned best action are shown in Figure 5a.Additionally, tracking performance for the two sets of gainsig. 7: Phase plots and torques commanded by the ID-CLF-QP + in the na¨ıve user experiments with Cassie. For torques,each colored line corresponds to a different joint, with the black dotted lines being the feedforward torque. The gainscorresponding to a “very bad” action (top) yield torques that exhibit poor tracking on joints and torque chatter. On the otherhand, the gains corresponding to the learned optimal action (bottom) exhibit much better tracking and no torque chatter.is seen in Figure 6a, where the learned best action tracks thedesired behavior to a better degree.The importance of the relative weight of the parameterscan be seen by looking at the learned best action: ˆ a ∗ = [750 , , , , . , . Interestingly, the knees are weighted higher than the hips inthe Q matrix, which is reﬂected in the desired convergence ofthese outputs when constructing the the Lyapunov function.Also, the values of ε and w ˙ V are in the middle of the givenrange, suggesting that undesirable behavior results from thesevalues being too high or too low. In the end, applyingpreference-based learning to tuning the gains of the CLF-QP + on AMBER resulted in stable walking and in one ofthe few instantiations of a CLF-QP running on hardware. B. Results with Cassie

To test the capability of the learning method towards tun-ing more complex controllers, the preference-based learningmethod was applied for tuning the gains of the ID-CLF-QP + controller for the Cassie bipedal robot. To demonstraterepeatability, the experiment was conducted twice: once witha domain expert, and once with a na¨ıve user. In both exper-iments, a subset of the Q matrix from (CARE) was tunedwith coarse bounds given by a domain export, as reportedin Table I. These speciﬁc outputs were chosen because theywere deemed to have a large impact on the performance ofthe controller. Additionally, the regularization terms in (15)were lowered when compared to the baseline controller forboth experiments so that the effect of the outputs wouldbe more noticeable. Although lower regularization termsencourage faster convergence of the outputs to the zerodynamics surface, they induce increased torque chatter andlead to a more challenging gain tuning process. The controller was implemented on the on-board IntelNUC computer, which was running a PREEMPT RT kernel.The software runs on two ROS nodes, one of which commu-nicate state information and joint torques over UDP to theSimulink Real-Time xPC, and one of which runs the con-troller. Each node is given a separate core on the CPU, and iselevated to real-time priority. Preference-based learning wasrun on an external computer and was connected to the ROSmaster over wiﬁ. Actions were updated continuously withno break in between each walking motion. To accomplishthis real-time update, once an action was selected it wassent to Cassie via a rosservice call, where, upon receipt, therobot immediately updated the corresponding gains. Becauserosservice calls are blocking, multithreading their receiptand parsing was necessary in order to maintain real-timeperformance.For both experiments, preferences were dictated by thefollowing criteria (ordered by importance): no torque chatter,no drift in the ﬂoating base frame, responsiveness to desireddirectional input, and no violent impacts. At the start ofthe experiments, there was signiﬁcant torque chatter andwandering, with the user having to regularly intervene torecenter the global frame. As the experiments continued,the walking generally improved, but not strictly. At theconclusion of 100 iterations, the posterior was inferred overall uniquely visited actions. The action corresponding withthe maximum utility – believed by the algorithm to resultin the most user preferred walking behavior – was furtherevaluated for tracking and robustness. In the end, this learnedbest action coincided with the walking behavior that the userpreferred the most, and the domain expert found the learnedgains to be “objectively good”. The optimal gains identiﬁedby the framework are: a ∗ =[2400 , , , , , , , , , , , . Features of this optimal action, compared to a worse actionsampled in the beginning of the experiments, are outlinedin Figure 6. In terms of quantiﬁable improvement, thedifference in tracking performance is shown in Figure 6b.For the sake of presentation, the outputs are split into η =( η t , η nt ) where η t are the 12 outputs whose parameters weretuned by the learning algorithm and η nt are the remaining 6outputs. The magnitude of η t illustrates the improvement thatpreference-based learning attained in tracking the outputs itintended to. At the same time, the tracking error of η nt showsthat the outputs that were not tuned remained unaffectedby the learning process. This quantiﬁable improvement isfurther illustrated by the commanded torques in Figure 7,which show that the optimal gains result in much less torquechatter and better tracking as compared to the other gains. Limitations.

The main limitation of the current formulationof preference-based learning towards tuning controller gainsis that the action space bounds must be predeﬁned, andthese bounds are often difﬁcult to know a priori . Futurework to address this problem involves modiﬁcations to thelearning framework to shift action space based on the user’spreferences. Furthermore, the current framework limits theset of potential new actions to the set of actions discretized by d i for each dimension i . As such, future work also includesadapting the granularity of the action space based on theuncertainty in speciﬁc regions.V. C ONCLUSION

Navigating the complex landscape of controller gains is achallenging process that often requires signiﬁcant knowledgeand expertise. In this work, we demonstrated that preference-based learning is an effective mechanism towards system-atically exploring a high-dimensional controller parameterspace. Furthermore, we experimentally demonstrated thepower of this method on two different platforms with twodifferent controllers, showing the application agnostic natureof framework. In all experiments, the robots went fromstumbling to walking in a matter of hours. Additionally, thelearned best gains in both experiments corresponded withthe walking trials most preferred by the human operator. Inthe end, the robots had improved tracking performance, andwere robust to external disturbance. Future work includesaddressing the aforementioned limitations, extending thismethodology to other robotic platforms, coupling preference-based learning with metric-based optimization techniques,and addressing multi-layered parameter tuning tasks.R

EFERENCES[1] L. Zheng, “A practical guide to tune of proportional and integral(pi) like fuzzy controllers,” in [1992 Proceedings] IEEE InternationalConference on Fuzzy Systems . IEEE, 1992, pp. 633–640.[2] Y. Zhao, W. Xie, and X. Tu, “Performance-based parameter tun-ing method of model-driven pid control systems,”

ISA transactions ,vol. 51, no. 3, pp. 393–399, 2012. [3] H. Hjalmarsson and T. Birkeland, “Iterative feedback tuning of lineartime-invariant mimo systems,” in

Proceedings of the 37th IEEEConference on Decision and Control (Cat. No. 98CH36171) , vol. 4.IEEE, 1998, pp. 3893–3898.[4] S. W. Sung and I.-B. Lee, “Limitations and countermeasures of pidcontrollers,”

Industrial & engineering chemistry research , vol. 35,no. 8, pp. 2596–2610, 1996.[5] P. F. Odgaard, L. F. Larsen, R. Wisniewski, and T. G. Hovgaard, “Onusing pareto optimality to tune a linear model predictive controller forwind turbines,”

Renewable Energy , vol. 87, pp. 884–891, 2016.[6] A. D. Ames and M. Powell, “Towards the uniﬁcation of locomotionand manipulation through control lyapunov functions and quadraticprograms,” in

Control of Cyber-Physical Systems . Springer, 2013,pp. 219–240.[7] K. Galloway, K. Sreenath, A. D. Ames, and J. W. Grizzle, “Torque sat-uration in bipedal robotic walking through control lyapunov function-based quadratic programs,”

IEEE Access , vol. 3, pp. 323–332, 2015.[8] J. Reher and A. D. Ames, “Control lyapunov functions for complianthybrid zero dynamic walking,” ieee Transactions on Robotics andAutomation , In Preparation, 2021.[9] E. Ambrose, W.-L. Ma, C. Hubicki, and A. D. Ames, “Towardbenchmarking locomotion economy across design conﬁgurations onthe modular robot: Amber-3m,” in

Tuning metaheuristics: a machinelearning perspective . Springer, 2009, vol. 197.[12] M. Jun and M. G. Safonov, “Automatic pid tuning: An application ofunfalsiﬁed control,” in

Proceedings of the 1999 IEEE InternationalSymposium on Computer Aided Control System Design (Cat. No.99TH8404) . IEEE, 1999, pp. 328–333.[13] A. Marco, P. Hennig, J. Bohg, S. Schaal, and S. Trimpe, “Automaticlqr tuning based on gaussian process global optimization,” in .IEEE, 2016, pp. 270–277.[14] M. Tucker, N. Csomay-Shanklin, W.-L. Ma, and A. D. Ames,“Preference-based learning for user-guided hzd gait generation onbipedal walking robots,” 2020.[15] J. W. Grizzle, C. Chevallereau, R. W. Sinnet, and A. D. Ames,“Models, feedback control, and open problems of 3d bipedal roboticwalking,”

Automatica , vol. 50, no. 8, pp. 1955–1988, 2014.[16] E. R. Westervelt, J. W. Grizzle, C. Chevallereau, J. H. Choi, andB. Morris,

Feedback control of dynamic bipedal robot locomotion .CRC press, 2018.[17] A. Hereid and A. D. Ames, “FROST: Fast robot optimization andsimulation toolkit,” in , 2017, pp. 719–726.[18] J. Reher and A. D. Ames, “Inverse dynamics control of complianthybrid zero dynamic walking,” 2020.[19] A. D. Ames, K. Galloway, K. Sreenath, and J. W. Grizzle, “Rapidlyexponentially stabilizing control lyapunov functions and hybrid zerodynamics,”

IEEE Transactions on Automatic Control , vol. 59, no. 4,pp. 876–891, 2014.[20] A. Isidori,

Nonlinear Control Systems, Third Edition , ser.Communications and Control Engineering. Springer, 1995. [Online].Available: https://doi.org/10.1007/978-1-84628-615-5[21] J. Reher, C. Kann, and A. D. Ames, “An inverse dynamics approachto control lyapunov functions,” 2020.[22] Y. Sui, M. Zoghi, K. Hofmann, and Y. Yue, “Advancements in duelingbandits.” in

IJCAI , 2018, pp. 5502–5510.[23] M. Tucker, M. Cheng, E. Novoseller, R. Cheng, Y. Yue, J. W.Burdick, and A. D. Ames, “Human preference-based learning for high-dimensional optimization of exoskeleton walking gaits,” arXiv preprintarXiv:2003.06495 , 2020.[24] K. Li, M. Tucker, E. Bıyık, E. Novoseller, J. W. Burdick, Y. Sui,D. Sadigh, Y. Yue, and A. D. Ames, “Roial: Region of interest activelearning for characterizing exoskeleton gait preference landscapes,” arXiv preprint arXiv:2011.04812 , 2020.[25] W. Chu and Z. Ghahramani, “Preference learning with gaussianprocesses,” in