Towards Expedited Impedance Tuning of a Robotic Prosthesis for Personalized Gait Assistance by Reinforcement Learning Control
11 Towards Expedited Impedance Tuning of a RoboticProsthesis for Personalized Gait Assistance byReinforcement Learning Control
Minhan Li, Yue Wen, Xiang Gao, Jennie Si,
Fellow, IEEE, and He (Helen) Huang,
Senior Member, IEEE
Abstract —Personalizing medical devices such as lower limbwearable robots is challenging. While the initial feasibility ofautomating control parameter tuning of a prosthesis has beendemonstrated in a principled way, the next critical issue is toefficiently enhance such tuning process for a human subject safelyand quickly in time. We therefore propose a novel idea underthe framework of approximate policy iteration in reinforcementlearning. The resulted new algorithm, Policy Iteration with Con-straint Embedded (PICE), includes two complimentary consider-ations in solving the value function: a simplified approximationof the value function and a reduced policy evaluation error byimposing realistic constraints. The PICE is an off-policy learningapproach, and can be trained with both offline and onlinesamples. Tests were performed on both able-bodied and amputeesubjects. Results show that the PICE provided convergent andeffective policies, and significantly reduced users’ tuning timecombining offline training with online samples. Furthermore, forthe first time, we experimentally evaluated and demonstratedthe robustness of the deployed policies by applying them todifferent tasks and users. Putting it together, the PICE has shownits potential towards truly automating the process of controlparameter tuning of robotic knees for users in practice.
Index Terms —Rehabilitation robotics, knee prosthesis, rein-forcement learning, policy iteration, policy evaluation, impedancecontrol.
I. I
NTRODUCTION R OBOTIC prostheses have emerged with recent break-throughs in mechanical design, control theory andbiomechanics [1]–[3]. These robotic prostheses have mani-fested exceptional potentials to benefit lower limb amputeesin various ways, such as reducing metabolic consumption [4],enhancing balance and stability [5], augmenting adaptability tovarying walking speeds and inclines [6], and enabling walkingon changing terrains seamlessly [7]–[9]. The finite-state ma-chine impedance control (FSM-IC) has been the most adoptedcontrol framework for these devices [10]–[12], because studieshave suggested that the human nervous system modulates theimpedance of lower limb joints in order to realize stable androbust dynamics when walking on various terrains [13], [14].
This work was supported in part by National Science Foundation (Corresponding authors: He (Helen)Huang; Jennie Si.)
M. Li, Y. Wen, and H. Huang are with the NCSU/UNC Department ofBiomedical Engineering, NC State University, Raleigh, NC, 27695-7115;University of North Carolina at Chapel Hill, Chapel Hill, NC 27599 USA(email: [email protected]; [email protected]; [email protected]).X. Gao and J. Si are with the Department of Electrical, Computer, andEnergy Engineering, Arizona State University, Tempe, AZ, 85281 USA(email: [email protected]; [email protected]).
The challenges in applying FSM-IC in robotic prosthesesare: 1) the control consists of a large number of impedanceparameters (e.g., 12-15 in a knee prosthesis for level walkingonly) in order to achieve safe human-machine-environmentinteraction and sufficient characterization of limb movementwithin a gait cycle [1], [12], [15], [16]; 2) these controlparameters must be personalizedto assist the individual amputees’ gait. To find a set ofmodest parameters, in current clinical practice, highly trainedprosthetists need to spend hours to arduously hand-tune theparameters for each amputee user and each locomotion mode(e.g., level walking, ramp ascent/descent) based mainly on thesubjective observations of the user’s gait performance [17].Not only does it lack precision, but also it needs intensivetime and efforts due to the inability of humans to tune high-dimensional parameters simultaneously.Given a growing need for facilitating the assistive devicepersonalization, the research community has developed varioussolutions to automate the process.The concept of optimization was adopted to tune lowerlimb exoskeletons in order to minimize the metabolic costin walking and has been validated in the low-dimensionalcontrol parameter space (the number of tunable parameterswas no greater than four) on able-bodied individuals [18], [19].Beyond merely identifying a set of optimal parameters, a cou-ple of studies have attempted to learn the optimal sequentialdecision-making for optimizing high-dimensional prosthesiscontrol parameters. Employing knowledge and skills fromexperienced prosthetists, an expert system was developedto encode human decisions as automatic tuning rules [20].The method was challenged by the lack of sufficient datacollected from the prosthetists in device tuning. Alternatively,an actor-critic reinforcement learning (RL) based method (i.e.,direct heuristic dynamic programming, dHDP for short) wasdesigned to directly obtain the impedance tuning policy viainteraction with the human-prosthesis system in an onlinemanner [21], [22] without a closed-form model of the system.Although the aforementioned studies have demonstrated thefeasibility of applying automatic tuning to wearable robotswith human-in-the-loop, little attention has been paid to ad-dress the efficiency and robustness of the tuning algorithmsfrom a user’s perspective. First, efficiency of a tuning algo-rithm (i.e., the ability to safely complete the online tuningquickly in time) is critical for the clinical translation of a newmethod due to patient-in-the-loop. Second, the robustness ofthe tuning algorithm quantifies whether the optimal control a r X i v : . [ ee ss . S Y ] J un parameters or learned prosthesis tuning policy can handlesituations when walking condition (e.g., treadmill walking vs.level-ground walking) or user has changed. Additionally, arobust policy is expected to alleviate computational burdenin online learning or continued customization, improve usersafety during automated prosthesis tuning, and expedite thetuning process in clinics. Therefore, the objective of this studywas to develop a novel, automatic control tuning method fora robotic prosthesis knee to reproduce the near-normal kneekinematics during walking. The tuning goal stemmed from thefact that having amputees walk with normal appearance as theable-bodied people do was widely used as the design goal orthe evaluation criteria for knee prosthesis control [16], [23]. Tothis end, we addressed our study objective by: 1) developing anovel learning algorithm to directly enhance the data and timeefficiency; 2) investigating the robustness of policies againstthe changes of tasks and users.To address the efficiency in learning a control parametertuning policy for robotic prostheses, policy iteration, a classicalRL method, lends itself as a promising candidate. This isbecause it, like general RL-based control framework, has anexcellent capability of learning optimal sequential decisionsin high-dimensional space [24]. In addition, as the process ofcustomizing prosthesis control parameter design for a humanuser does not render abundance of data, which is a necessaryfeature in those deep RL applications [25]–[28], the policyiteration framework is a suitable approach.Furthermore, previous evidences suggest that the policy iter-ation has the advantage of fast convergence over other classicalRL algorithms, such as value iteration and gradient-based pol-icy search [29], [30] (including our previously reported dHDP[31] which is a stochastic gradient method). The idea of policyiteration is to iteratively improve the policy by alternatelycarrying out the policy evaluation and the policy improvement[32], [33]. Under such a concept, as the efficiency in the step ofpolicy evaluation significantly influences the overall learningalgorithm efficiency, the problem boils down to improving theefficiency of policy evaluation.Therefore, in the present study, base upon the policy itera-tion framework, we developed a novel algorithm, namely thepolicy iteration with constraint embedded (PICE). The PICEalgorithm is expected to enhance policy training efficiencyespecially in policy evaluation, because such enhancement wasenabled by the following innovations in the algorithm design.First, PICE as an off-policy method can be implemented usingboth offline and online data, or first trained by offline data andthen updated online. Second, we opted for an approximatorwith simple quadratic bases for the policy evaluation ratherthan complex neural networks. Additionally, we utilized areal-time stage cost function as in traditional linear quadraticregulator to effectively account for control performance asa result of adjustment of the impedance parameters. Such acost structure is more informative than discrete and subjectivemeasures used in previous studies [21], [22]. Finally, tomitigate the value function approximation error, we imposeda positive semi-definite constraint on the approximated valuefunction, and thus to improve the resulted policy. The PICEalgorithm was implemented and tested on human-prosthesis systems; the efficiency and robustness of the tuning policywere evaluated quantitatively.The main contributions of this study are as follows:1) We proposed an innovative idea based on the policy itera-tion reinforcement learning framework, which resulted inthe PICE algorithm to enhance the efficiency of roboticknee prosthesis controller tuning. For the central issue ofvalue function approximation error, we provided a proofof boundedness by a constant;2) The proposed PICE algorithm was implemented andtested in real time on human subjects for prosthesis con-trol parameter tuning. We successfully demonstrated itsefficiency, effectiveness and convergence in experimentsinvolving human subjects;3) The robustness of learned control parameter tuning poli-cies against changes of tasks and users were tested onhuman subjects. The successful demonstration of robust-ness of PICE suggests its potential values for clinicalapplication.The remainder of this paper is organized as follows. SectionII describes the problem to be solved and shows how itrelates to the theory of optimal sequential decision. Section IIIpresents details of the proposed RL algorithm for improvingdata and time efficiency. Section IV elaborates on the consider-ations regarding the implementations of the algorithm. Resultsare presented in Section V. Finally, we discuss these resultsand limitations of the study in Section VI and conclude inSection VII. II. P ROBLEM F ORMULATION
In this study, the proposed PICE algorithm aims at determin-ing optimal control parameter tuning policies to supplementthe impedance controller of a robotic knee prosthesis inorder for its user to restore a near-normal knee motion. Thealgorithm is implemented within a well-established FSM-ICframework [1], [12], as shown in Fig. 1.
A. Finite-state Machine Impedance Controller (FSM-IC)
As depicted in the FSM-IC block of Fig. 1, a single gaitcycle during walking is decomposed into 4 distinct phasesin the FSM-IC: stance flexion (STF), stance extension (STE),swing flexion (SWF) and swing extension (SWE). The majorgait events determining the phase transitions are identified byutilizing the measurements of knee angle and ground reactionforce together using the Dempster-Shafer theory as describedin [12].For each phase in a single gait cycle, the FSM selects thecorresponding set of impedance parameters for the impedancecontroller to generate a torque τ at the prosthetic knee jointbased on the impedance control law, τ = K ( θ e − θ ) − Cω (1)where the impedance controller consists of three controlparameters: the stiffness K , the equilibrium angle θ e andthe damping C . Real-time sensor feedback includes the kneejoint angle θ and angular velocity ω . Therefore, a total of 12impedance parameters need to be regulated in a gait cycle. HUMAN-PROSTHESIS SYSTEM
STF STE SWE SWF FSM-IC
State ( x ) IMPEDANCE UPDATE PICE FORPOLICY UPDATE
Torque ( τ ) Knee Kinematics( θ , ω ) Action( u ) Policy ( π ) Fig. 1. The schematic illustration of the RL based impedance tuning fora robotic knee prosthesis system within the finite-state machine impedancecontrol (FSM-IC) framework. Red dashed lines denote inputs and outputs inthe impedance update loop, whereas blue dash-dotted lines stand for those inthe policy update loop. A tuning policy acts to adjust impedance parameters,according to the state of human-prosthesis system, in the FSM-IC to regulatethe interaction force with users. The policy can be obtained from and furtherupdated by the proposed PICE algorithm. In the block of FSM-IC, STF, STE,SWF and SWE stand for stance flexion, stance extension, swing flexion andswing extension, respectively.
B. Dynamic Process of Impedance Update
As shown in Fig. 1, the impedance update loop is executedby specified policies to adjust impedance parameters for theFSM-IC. Without loss of generality, the following formulationtowards describing the dynamic process of impedance updatefor a robotic prosthesis is applicable to all four phases in theFSM-IC. This is owing to the fact that, despite sharing theidentical framework for learning the tuning policy, each phaseis associated with an independent tuning policy running inparallel.We consider the human-prosthesis system as a discrete timesystem with unknown dynamics f , which was also studied in[22], [34], x ( k +1) = f ( x ( k ) , u ( k ) ) , k = 0 , , . . .u ( k ) = π ( x ( k ) ) (2)where k denotes the discrete index in the impedance updateloop in Fig. 1. We denote x and u as state and actionvariables of the process, respectively, while the tuning policy π represents a mapping to determine actions according to currentstates.In the context of impedance update, the above state vari-ables are defined based on features extracted from the kneekinematic profiles for each segmented phase in the FSM-IC.Specifically, the continuous knee profile within a single gaitcycle (from the heel strike to the next heel strike of the samefoot) is characterized by 4 discrete points, each of which is alocal extrema in the corresponding phase along the profile asshown in Fig. 2. Each point is associated with two features,the angle feature P and the duration feature D , respectively.Similarly, target features ( P d and D d ) in each phase can bedetermined from the representative data of knee kinematics inthe able-bodied population [35]. Consequently, state variables x ∈ R are defined as thedifferences between measured features and target features(referred as the peak error and the duration error, respectively)in a specific phase at each impedance update as follows, x = [ P − P d , D − D d ] T . (3)Meanwhile, action variables u ∈ R are defined in thefollowing form, u = [∆ K, ∆ θ e , ∆ C ] T , (4)where ∆ K, ∆ θ e , ∆ C are the adjustments of impedance pa-rameters for the corresponding phase at each instance ofimpedance update. K n ee A n g l e ( d e g ) D D D D P P P P Fig. 2. Features of knee kinematics in a single gait cycle. The subscriptnumbers 1 through 4 denote respective phases (i.e., STF, STE, SWF andSWE) of a gait cycle, to which the features correspond.
C. Policy Update
The policy update loop is carried out by the proposedPICE algorithm, as shown in Fig. 1, to progressively approachoptimal policies with respect to specified objectives. In thisstudy, the objective is to regulate states with minimal controlenergy expenditure over the process of impedance update inorder to keep the peak error and duration error as close to zeroas possible.Hence, at each instance of impedance update, we assign ascalar stage cost with a quadratic form, g ( x ( k ) , u ( k ) ) = x ( k ) T R s x ( k ) + u ( k ) T R a u ( k ) (5)where R s ∈ R × and R a ∈ R × are both positive semi-definite matrices. Thereby, given a policy π i after the i th policyupdate, the corresponding discounted cost-to-go in an infinitehorizon, namely the action-dependent value function Q ( i ) , isgiven by Q ( i ) ( x ( k ) , u ( k ) ) = g ( x ( k ) , u ( k ) ) + ∞ (cid:88) t = k +1 α t − k g ( x ( t ) , u ( t ) )= g ( x ( k ) , u ( k ) ) + αQ ( i ) ( x ( k +1) , π i ( x ( k +1) )) (6)where α is the discount factor. This value function is non-negative due to the definition of g ( x ( k ) , u ( k ) ) in (5), and the value function reflects a measure of the performance whenaction u ( k ) is applied at state x ( k ) and the control policy π i isfollowed thereafter.The goal for PICE, as in value-based RL algorithms [32],is to seek an optimal policy that minimizes the cost-to-go bysolving the Bellman optimality equation approximately, Q ∗ ( x ( k ) , u ( k ) ) = min π Q ( x ( k ) , u ( k ) )= g ( x ( k ) , u ( k ) ) + α min u ( k +1) Q ∗ ( x ( k +1) , u ( k +1) )= g ( x ( k ) , u ( k ) ) + αQ ∗ ( x ( k +1) , π ∗ ( x ( k +1) )) (7)where π ∗ and Q ∗ denote the optimal tuning policy and theassociated optimal value, respectively.III. P OLICY I TERATION WITH C ONSTRAINT E MBEDDED
To solve the above Bellman equation approximately andeffectively, we propose the PICE procedure. Instead of achiev-ing a close approximation to the value function as in mostgeneral policy iteration algorithms, the PICE instead makesuse of simple quadratic polynomial basis functions whichare expected to provide simplified, albeit with approximationerrors, and thus efficient solutions during policy evaluation.The second step of PICE, however, aims at improving policyevaluation results by imposing a positive semi-definite (PSD)constraint on the approximated value function, to alleviate theproblem associated with inaccurate approximation.
A. Value Function Approximation
To represent the approximate value function ˆ Q ( i ) associatedwith the policy π i , a linear parametric combination of basisfunctions is often used in classic policy iterations as follows,because of its virtues of easy-to-implement and fairly trans-parent behavior [29], [30]. ˆ Q ( i ) ( x, u ) = φ ( x, u ) T r ( i ) (8)where φ ( x, u ) ∈ R m is the vector of fixed basis functions ofstates and actions, and the weight parameter vector r ( i ) ∈ R m varies as policy updates. Hereafter, we ignore subscript k instate and action and replace them with x and u , respectively,for clear presentation. Instead of the usual universal approxi-mators, such as multi-layer perceptron neural networks, radialbasis functions, and splines, we opt for a simple structureof quadratic polynomials as the basis functions to practi-cally further simplify the basis functions, therefore potentiallyreducing uncertainties associated with large number of freeparameters used in the approximation. To compensate for theconsequent approximation errors due to simple basis functions,an additional PSD constraint stemming from insights on theformulated problem is imposed (details in the next subsection).As a result, the approximating value function can be rewrit-ten in the following equivalent form of weighted inner product, which yields all possible quadratic basis functions of states andactions, ˆ Q ( i ) ( x, u ) = (cid:20) xu (cid:21) T H ( i ) (cid:20) xu (cid:21) = (cid:20) xu (cid:21) T (cid:34) H ( i ) xx H ( i ) xu H ( i ) ux H ( i ) uu (cid:35) (cid:20) xu (cid:21) (9)where H ( i ) is a PSD matrix, and H ( i ) xx , H ( i ) xu , H ( i ) ux and H ( i ) uu aresubmatrices of H ( i ) with proper dimensions. By rearrangingand grouping like terms in (9), we can convert the weightparameter vector r ( i ) to the matrix H ( i ) and vice versa. B. Policy Iteration Under Constraint
Two iterative procedures, policy evaluation and policy im-provement, are included in a standard approximate policyiteration, which is to find an approximated value function ˆ Q ( i ) satisfying the Bellman equation under the target policy π i asfollows, ˆ Q ( i ) ( x, u ) = g ( x, u ) + α ˆ Q ( i ) ( f ( x, u ) , π i ( f ( x, u ))) (cid:44) B ( ˆ Q ( i ) ( x, u )) (10)where B is called the Bellman operator under the target policy.Replacing the approximate function ˆ Q ( i ) with parametric basisfunctions in (8), we obtain the following equivalent form, φ ( x, u ) T r ( i ) = B ( φ ( x, u ) T r ( i ) ) . (11)In vector-matrix form, the above equation becomes Φ r ( i ) = B ( Φ r ( i ) ) (12)where the matrix Φ contains the values of all basis functionsfor all possible state-action sample pairs.Directly solving for approximate value functions based onthe above standard policy iteration framework, we obtained apreliminary proof-of-concept result from offline learning [36].To be more efficient and also to accommodate both offline andonline learning, we proposed the novel PICE algorithm withthe following details.As a result of the approximation error in (8), some of thevalue functions solved from (11) can be negative definite. Thisclearly indicates poor approximation. We therefore imposea PSD constraint on the solved value functions from (11)towards an improved solution. Specifically, we seek an approx-imated value function ˆ Q ( i ) satisfying the following projectedBellman equation that is to be solved by PICE, Φ r ( i ) = proj S + B ( Φ r ( i ) ) (13)where proj S + denotes the operator of projection onto a closedconvex subset S + . The closed convex subset S + is containedin a subspace spanned by the columns of Φ , S + = Φ R + (14)where R + ⊂ R m is the PSD cone in the vector space of R m .The idea of solving the projected Bellman equation was alsoused in the least square policy iteration (LSPI) algorithm [29].The PICE algorithm, however, imposes a tighter constraintand thus results in a different projected Bellman equation. Specifically, the PICE requires the solved value function fromthe projected Bellman to be positive semi-defnite. For PICEas an off-policy learning where policy being evaluated (targetpolicy) is different than the policy regulating actions (behaviorpolicy), we showed that value function approximation error isbounded by a constant (see Appendix for details).Inspired by established results [37], we convert the problemof solving (13) to the one that corresponds to the solution ofthe following variational inequality, ( Φ r ( i ) − B ( Φ r ( i ) )) T Ξ( Φ r − Φ r ( i ) ) ≥ , ∀ r ∈ R + (15)where Ξ is a diagonal matrix with the steady-state probabilitiesof the Markov chain under the behavior policy along thediagonal elements.In the real application, we use observational data to approx-imate inequality (15) by introducing the matrix ˆ A ( i ) and thevector ˆ b ( i ) , and turn the problem into solving the followingapproximated inequality [33], [37]. ( ˆ A ( i ) r ( i ) − ˆ b ( i ) ) T ( r − r ( i ) ) ≥ , ∀ r ∈ R + (16)where ˆ A ( i ) and ˆ b ( i ) are computed from ˆ A ( i ) = 1 N + 1 N (cid:88) n =0 φ ( s n ) (cid:16) φ ( s n ) − α p s n ,s (cid:48) n q s n ,s (cid:48) n φ ( s (cid:48) n ) (cid:17) T , ˆ b ( i ) = 1 N + 1 N (cid:88) n =0 p s n ,s (cid:48) n q s n ,s (cid:48) n φ ( s n ) g ( s n ) , (17)where n and N denote sample index and sample size ofthe collected data, respectively. The variable s n (cid:44) ( x n , u n ) denotes the state-action pairs, and the variable s (cid:48) n (cid:44) ( x (cid:48) n , u (cid:48) n ) denotes the next sample pair following s n in the sequence ofimpedance update. In addition, the ratio between p s n ,s (cid:48) n and q s n ,s (cid:48) n can be further simplified as, p s n ,s (cid:48) n q s n ,s (cid:48) n = δ (cid:0) u (cid:48) n = π i ( x (cid:48) n ) (cid:1) ν (cid:0) u (cid:48) n | x (cid:48) n (cid:1) (18)where δ ( · ) denotes the indicator function (i.e., equals to 1 if u (cid:48) = π i ( x (cid:48) ) and 0 otherwise), and ν ( · ) denotes the conditionalprobability of taking action u of the behavior policy in state x . Once the weight parameter vector r ( i ) is obtained bysolving the variational inequality (16), the policy evaluationof the given target policy π i is then followed by the policyimprovement. With the choice of quadratic basis functions, theimprovement is equivalent to solving a quadratic programming(QP) problem. The equivalence can be easily observed byformulating the minimization problem of (9) over the actionswith any given states. As such, the equivalent QP problem canbe readily written as follows, π i +1 ( x ) = arg min u ∈ U ˆ Q ( i ) ( x, u )= arg min u ∈ U { u T H ( i ) uu u + 2 x T H ( i ) xu u } (19)where U is the admissible action space. C. Iterative Approach for Solving Policy Evaluation
To solve the variational inequality (16), the following itera-tive approach has been proposed in previous studies [37], [38]to approximate r ( i ) with ˆ r ( i ) j ˆ r ( i ) j +1 = proj E [ˆ r ( i ) j − γ j ( ˆ A ( i ) j ˆ r ( i ) j − ˆ b ( i ) j )] , (20)where j denotes iterative steps and E is the constraint set forthe solution.To result in a convergent sequence, the approach alsorequires constraint set E to be closed, bounded and convex[38]. In our case, however, the PSD cone is not bounded.To address this issue, we construct a convex set with anintersection between the PSD cone and an Euclidean ball asfollows, E = R + ∩ Z δ (21)where Z δ denotes a closed Euclidean ball centered at theorigin with the radius of δ , the choice of radius can be aslarge as needed to cover a sufficient subset of the originalPSD cone. Since equation (20) involves a projection ontothe intersection of two convex sets, the Dykstra’s projectionalgorithm is applied [39].Furthermore, the step size γ j also needs to be on the orderof /j to guarantee a convergent sequence resulted from (20),as follows, γ j − γ j +1 γ j = O (cid:16) j (cid:17) . (22)IV. I MPLEMENTATION
To apply PICE for tuning the impedance control parametersof a prosthetic knee on human subjects, some practical issuesneed to be considered during implementation. Hereafter, anexperimental trial with a human subject, namely a trial, refersto a single experiment with impedance and policy initializa-tions that allow prosthetic knee control parameters to adaptuntil reaching a stopping criterion.
A. Human Variability and Stopping Criterion
Due to variations in physical conditions and fatigue ofhuman subjects, measurement noise, and other uncertaintiesassociated with the environment, data recorded from thehuman-prosthesis system need to be processed and tuningtarget set needs to be realistically specified. Specifically,impedance update is set to take place every 4 gait cycles toreduce noise introduced by human stride-to-stride variance.That is to say, knee features are averaged over the 4 gait cycleswith a single impedance update to form a state-action pair to beused in policy update. In addition, we introduce a target set astolerance levels of error (specifically, ± . deg for peak errorsand ± percent for duration errors) to account for the inherentwalking variability [35], [40]. Consequently, we consider animpedance parameter tuning procedure in a given phase asuccess if the errors stay within the target set for 8 out of10 consecutive impedance updates. If all four phases becomesuccessful, a trial is successful and is considered reaching thestopping criterion. B. Safety Bounds for Impedance Tuning
In each trial, a set of initial impedance parameters areselected for the prosthetic knee, and then subjects experience aseries of impedance updates guided by tuning policies for eachphase of the FSM. While the initial impedance parametersare randomly selected, they need to be feasible for walking.Such feasible initial impedance parameter setting is validatedprior to the start of a trial and is verified by the experimentereither via visual inspection if the subject is capable of walkingwithout holding on a handrail, or via the subject’s verbalexpression. To avoid any potential harm to human subjectscaused by unsafe parameters and associated knee kinematics,we set a safety range within which peak error is allowed tovary. Once a peak error is beyond the safety range, impedanceparameters will be reset to the initial ones, which are knownto be safe. Herein, the peak error bounds are set to ± degfor all four phases since they cover two standard deviationsof knee kinematic features in normal walking among differenttest subjects [35]. C. Implementations of PICE
Given the nature of off-policy learning, the PICE algorithmcan be implemented in both offline and online training man-ners. The detailed procedures of both implementations aredescribed as pseudocodes in Algorithm 1 and Algorithm 2.Feature scaling was first performed on state and action vari-ables for all four phases. To normalize them into a comparableunit magnitude, following scaling factors were selected in thestudy. Specifically, the state variables x in (3) were normalizedwith a scaling factor of and . for the respective peakerror and duration error. Similarly, the action variables u in (4)were normalized with a scaling factor of . , . and . for respective adjustments of stiffness, equilibrium angle anddamping. The only one exception was that the value forequilibrium angle in the SWF was set to when consideringphase differences. Meanwhile, to keep actions staying in areasonable range, we set the admissible space U in (19) fornormalized action variables to a range between − and .Other parameter values in PICE implementation are asfollows. Penalty matrices for states R s and actions R a were setto diag (1 , . and diag (0 . , . , . , respectively, whilediscount factor α was selected as . . The radius of Euclideanball δ was assigned to . The batch size N b was selectedas 15. The tolerance for offline training ε a was set to − .The main differences between the implementations of Al-gorithm 1 and Algorithm 2 are in terms of how trainingdata and termination condition are formulated. Specifically, theoffline training utilizes a fixed set of available data collectedfrom previous studies, which are likely generated by variouspolicies, to perform policy evaluation. Whereas, the onlinetraining collects its own data for training via interactions withthe human-prosthesis system in real time during impedanceupdates. These data are prepared in batches, and they areupdated with newly generated samples under the same policybeing evaluated.We used an early-stopping termination condition to deac-tivate an online training process not only to prevent over training, but also to take into consideration that human subjectscan only walk for about 30-60 minutes during an experientialtrial due to physical constraints. Specifically, during onlinetraining, we analyze the trend in evolution of the stage costbased on the current policy every time when we have newlycollected N b samples of state-action pairs. When either of thetwo conditions listed below is fulfilled, the online training isdeactivated and the rest of impedance update is carried outwith the current policy until policy update is triggered again.1) The case of no occurrence of impedance reset due tohitting the safety bounds. We fit a linear regressionmodel between the time series of stage cost and thatof impedance update. From the model, we obtain aconfidence interval (specifically confidence interval)around the slope of the regression line. If the interval fallsbelow zero, which signals a rigorously decreasing stagecost as impedance parameters updated according to thecurrent policy, we deactivate the online training;2) The case of using stage costs. We averaged the stagecosts over samples being analyzed. If it is smaller than athreshold value ε b , online training is deactivated. We setthe threshold to 0.043, which is equivalent to the stagecost of the largest tolerated errors within target set (i.e., . deg for peak error and percent for duration error). Algorithm 1
Offline PICE
Input:
Tolerance for offline training ε a , training dataset { ( x n , u n , g n , x (cid:48) n ) | n = 1 , , . . . , N } Initialization: policy update index i ← , weight vector r (0) ← r initial repeat Compute next actions u (cid:48) n with π i ( x (cid:48) n ) , n ∈ (cid:2) , N (cid:3) Reset initial guess ˆ r ( i )0 ← and step size γ ← for j = 0 , , . . . , N do Compute ˆ A ( i ) j and ˆ b ( i ) j with training dataset Compute ˆ r ( i ) j +1 by (20) Update step size γ j +1 = 1 / ( j + 1) end for Approximate ˆ Q ( i ) ( x, u ) with r ( i ) ← ˆ r ( i ) N Update policy to π i +1 by (19) i ← i + 1 until (cid:107) r ( i ) − r ( i − (cid:107) (cid:54) ε a Output:
Final value function ˆ Q ∗ ( x, u ) and policy ˆ π ∗ V. E
XPERIMENTS AND R ESULTS
We performed three tests involving four human subjects(two able-bodied and two amputees) to evaluate the perfor-mances of the proposed PICE.
A. Hardware Setup
A prototype of robotic knee prosthesis designed in our labwas used in this study [12]. The prosthesis utilizes a slider-crank mechanism, in which the slider is driven by the rotationof a DC motor (Maxon, Switzerland) through a ball screw(THK, Japan), and the crank rotation mimics the knee motion.
Algorithm 2
Online PICE
Input:
Batch size N b Initialization: policy update index i ← , impedanceupdate index k ← , weight vector r (0) ← r initial repeat Measure current state x ( k ) and stage cost g ( x ( k ) ) Take action u ( k ) = π i ( x ( k ) ) if k = ( i + 1) N b then Reset initial guess ˆ r ( i )0 ← Reload samples with index k ∈ (cid:2) iN b , ( i + 1) N b (cid:1) for j = 0 , , . . . , N b do Compute ˆ A ( i ) j and ˆ b ( i ) j with loaded samples Compute ˆ r ( i ) j +1 by (20) Update step size γ j +1 = 1 / ( j + 1) end for Approximate ˆ Q ( i ) ( x, u ) with r ( i ) ← ˆ r ( i ) N Update policy with π i +1 by (19) i ← i + 1 end if k ← k + 1 until Early-stopping termination condition is fulfilled
Output:
Final value function ˆ Q ∗ ( x, u ) and policy ˆ π ∗ The whole mechanism is integrated with a pylon as shownin Fig. 3. A maximum of 80 Nm torque output at the jointis ensured with such a design. The rotational motion of theprosthetic knee joint is recorded by a potentiometer (ALPS,Japan). A load cell (Bertec, USA) is attached to the pylon tomeasure the ground reaction force. All the analog readings areconverted to digital signals through a DAQ board (NI, USA)and then fed back to the control system, which is implementedby LabVIEW and MATLAB on a desktop PC.
MotorBall Screw PylonLoadcellPotentiometer CrankAdapter
Fig. 3. Hardware setup for the prototype of robotic knee prosthesis.
B. Participants
We recruited four male subjects, two able-bodied individuals(AB1 and AB2) and two transfemoral amputees (TF1 andTF2), in this study. An L-shaped adapter (see Fig. 3) and adaily socket were used by AB and TF subjects, respectively toallow them to walk with the knee prosthesis. The alignment ofprosthesis for each subject was done by a certified prosthetist.All the subjects received training with the powered prosthesisuntil they can walk comfortably and confidently without holding the handrail. All the subjects were provided writteninformed consent before any procedures, and the study wasapproved by the Institutional Review Board of University ofNorth Carolina at Chapel Hill.
C. Experimental Protocols
We carried out three experimental tests to validate andanalyze the performance of the proposed PICE. They arerespectively associated with the following three goals: 1) toexperimentally assess convergence properties during offlinetraining and the effect of training data size; 2) to quantitativelyassess potential gains by using an offline pre-trained policyas the initial policy for online training, and compare itsperformance to randomly initialized online training; 3) toinvestigate the robustness of a set of well-trained policies astasks and users change.
1) Test of Offline Training:
We used five sets of offline dataall collected from AB1 to perform the offline policy trainingand obtained five sets of policies accordingly. The numbers ofdata samples in the five sets were 15, 45, 75, 105 and 135,respectively, each sample was a 4-tuple ( x n , u n , g n , x (cid:48) n ) .We then evaluated each policy in five independent trialsusing AB1 as the test subject. AB1 walked on a treadmill ata speed of 0.6 m/s, while the offline trained policy adjustedthe prosthesis impedance. The same set of initial impedancewas applied to all the five trials. To eliminate the confoundingeffect of fatigue resulting from prolonged walking, for eachtuning trial, AB1 performed several 3-minute walking seg-ments followed by a rest period. Additionally, a maximum of135 impedance updates were allowed in consideration of thesubject’s limited enduring with walking. If training did notcomplete within this limit, the trial was considered a failure.Two outcome measures were captured in each trial. The firstmeasure was the L distance (i.e., (cid:107) r ( i ) − r ( i − (cid:107) ) between thetwo consecutive weight parameter vectors in (16). It is used toquantify changes in the series of value outcomes in (10). Thesecond measure was to explore the relationship between thenumber of offline training samples and the number of phases inwhich the success as defined in Subsection IV-A was reachedwithout any online policy updates beyond offline training.
2) Test of Online Training:
We conducted online trainingunder two different initial policy conditions: 1) randomlyinitialized; 2) offline pre-trained. Two subjects, AB1 and TF1,were asked to perform the treadmill walking task at the speedof 0.6 m/s. We used their own available offline data to obtainthe pre-trained policies. For both subjects, the offline trainingdata had 105 samples. The same pre-trained policies wereused to serve as initial policies across trials for each subject,while randomly initialized policies varied. A few blocks ofexperimental sessions were conducted, each including twoonline training trials for comparison purpose. Specifically,in each block, we randomly selected the initial impedanceparameters with the only requirement of being feasible forwalking. Then two online training trials with different initialpolicy conditions were performed. For AB1, three blocks ofexperimental sessions (each of which used different initialimpedance parameter) were conducted. For TF1, one block was tested. The same walk-rest experimental protocol and thesame maximum number of impedance update were applied asdiscussed in the first test.The evaluation for the test of online training includedefficiency, effectiveness, and impedance tuning convergence.Herein, the efficiency of online training was quantified by: 1)the number of phases needed for online policy updates beyondthe initial policies until meeting the stopping criteria defined inSubsection IV-A; 2) the number of impedance updates to meetthe stopping criterion for prosthesis tuning. To understand theeffectiveness of tuning prosthesis control for producing desiredknee motion, the knee kinematics were measured to reflecthow the prosthetic knee joint moved when it interacted withthe human users as the impedance varied with the guidanceof policies. Finally, the impedance tuning convergence wasanalyzed by checking the evolution of peak errors and durationerrors of knee kinematics (states) and prosthesis impedancevalues (control parameters) during the tuning.
3) Test of Policy Robustness:
We tested the robustness ofa set of well-trained policies and investigated how well theybehaved against changes of task and human subject. In thistest, the policies to be tested were trained and validated withAB2 in the treadmill walking task at the speed of 0.6 m/sprior to trials in this test. We applied them as initial policiesfor two new trials, which were performed by the subject AB2in the task of self-paced level-ground walking and by subjectTF2 in the task of treadmill walking (0.6 m/s), respectively.To investigate how policies acted when they were applied todifferent tasks or subjects, we monitored sequences of bothimpedance and policy updates in each trial and the associatedevolution of stage cost.
D. Experimental Results
15 45 75 105 13501234 N u m b e r o f S u cc e ss f u l P h a s e s Amount of Training Data (b)
Number of Iterations
STFSTESWFSWE L D i s t a n c e (a) Fig. 4. Result of offline training evaluation. (a) Evolution of the L distancebetween weight parameter vectors in two value function consecutive updates(i.e., (cid:107) r ( i ) − r ( i − (cid:107) ) in a representative trial of offline training test. Ateach instance of offline policy update, the vector is updated correspondingly.(b) The number of phases being able to reach a success increases with theincrease of amount of training data.
1) Offline Training Assessment:
Since similar results wereobtained from all trials in the test of offline training, we onlydemonstrate the representative results (trained by data withthe size of 105 samples collected from AB1) in Fig. 4(a).As shown in the data that changes in the weight parametervectors of the approximating value function in (16) reduced towithin the tolerance (10 − ) in 6 offline policy updates, whichis equivalent to 10 seconds for performing the computation. TABLE IE
FFICIENCY C OMPARISONS OF O NLINE T RAINING BY U SING P RE - TRAINED VS . R
ANDOM I NITIAL P OLICIES
ExperimentalBlock Number ∗ Phase RequiringPolicy Updates † Overall Number ofImpedance UpdatesPre-trained Random Pre-trained RandomB1 2 1, 2, 3, 4 39 126B2 2 1, 2, 3, 4 43 94B3 2 1, 2, 3, 4 42 111B4 4 1, 2, 3, 4 28 53 ∗ The experimental block B1 through B3 were tested with subject AB1,while the block B4 was associated with the subject TF1. † The numbers 1 through 4 under the column represent STF, STE, SWFand SWE respectively in the FSM-IC.
The results suggest that the approximate value functions, aswell as the policies, were convergent given the offline trainingdata.For the effect of the size of training dataset, we see inFig. 4(b) that the number of phases being able to reach asuccess increased as the amount of training data increased.Particularly, when we performed the offline training with 135samples, the number of phases reached up to four and no morepolicy updates were needed to accomplish the tuning. Theevidence implies that, with the offline implementation alone,the proposed PICE algorithm is able to obtain policies ready todeploy as sufficient offline data of good quality are availableto use.
2) Online Training Assessment:
We first looked into itsimprovement in tuning efficiency by employing the pre-trainedinitial policies obtained from offline training. As shown in Ta-ble I, the comparison results reveal that online training startingwith pre-trained policies obtained from offline training, albeitnot perfect, were significantly more efficient than those startingwith random policies. On average, the former cases onlyresulted in 1 phase that required online policy update, whereasthe number amounted to 4 in the latter cases. Meanwhile, pre-trained cases were observed to have less overall number ofimpedance updates than random cases did to meet the stoppingcriterion by an average of 58, which was equivalent to about7 minutes of walking time of a subject.Apart from the efficiency, for the trials staring with pre-trained initial policies, we studied the tuning effectiveness.Fig. 5 displays the overall effect of tuning by comparing theknee profiles generated by initial impedance parameters beforetuning with those produced by adjusted parameters at theend of tuning trials. We noted that, though differed in shape,initial knee profiles in all trials deviated from the targets,especially peak angle features. However, going through thetuning process under the guidance of final policies, the finalparameters enabled knee profiles to approach the targets.To inspect the impedance tuning convergence, we presentrepresentative results here (the experimental block B1 withpre-trained initial policies in the Table I) as similar resultsacross trials were observed. As revealed in Fig. 6, no mat-ter how large the initial errors were, they all progressivelyconverged into the tolerance range of errors ( ± ± observed that impedance parameters converged to constantvalues at the end of the trial (i.e., last ten updates) for mostphases, except for the STF where the momentum of impedanceadjustment lingered. The difference may be attributed to vary-ing perturbations introduced by more dynamical interactionsoccurring in the STF among the human, the robotic prosthesisand the ground. As a result, the final policy for STF neededto respond by adjusting the impedance to accommodate suchdisturbances and stabilize errors within tolerances. Initial Profile Final Profile Target Points20 40 60 80 100Gait Cycle (%)-100102030405060 P r o s t h e t i c K n ee A n g l e ( d e g ) (a) 20 40 60 80 100Gait Cycle (%)-100102030405060 P r o s t h e t i c K n ee A n g l e ( d e g ) (b)20 40 60 80 100Gait Cycle (%)-100102030405060 P r o s t h e t i c K n ee A n g l e ( d e g ) (c) 20 40 60 80 100Gait Cycle (%)(d)-100102030405060 P r o s t h e t i c K n ee A n g l e ( d e g ) Fig. 5. Prosthetic knee kinematics with initial and final tuned impedanceparameters in the test of online training. (a) to (d) Trials with pre-trained initialpolicies in experimental block B1, B2, B3 and B4, respectively. Time seriesof kinematics are divided and normalized to multiple profiles in individualgait cycles based on the timing of heel strike. Shaded areas along profilesindicate the real motion ranges across 4 gait cycles performed by subjectswalking with the same impedance parameters. The associated lines (dashedand solid) denote the averaged kinematics. P e a k E rr o r ( d e g ) D u r a t i o n E rr o r ( p e r c e n t a g e o f o n e g a i t c y c l e ) STF STE SWF SWE Tolerated Errors
Fig. 6. Evolution of states as impedance parameters were updated. (a) Peakerrors, (b) Duration errors.
3) Robustness Investigation:
As seen in Fig. 8(a), subjectAB2 used a pre-trained policy obtained from his own treadmillwalking to perform level-ground walking with no difficulty asno further policy update was needed, and after 46 impedanceupdates (about 6 minutes of subject walking time), the subjectknee kinematics met stopping criterion. As for the trial of TF2 treadmill walking using the same pre-trained policy, despitenot being completely successful in deploying policies to allfour phases, only three updates of policy occurred in the STFphase, as shown in Fig. 8(b). Although 72 impedance updates(about 9 minutes) were needed to meet the stopping criterion,it only took 45 updates of impedance (about 6 minutes) toobtain the final policy refined for the STF phase of the newsubject.Note that a cyclic pattern of change in the cost wasdisplayed in Fig. 8(b). This was caused by following initial orintermediate policies, which led to cost value sloping upwarduntil the safety bound was hit, thereby triggering the reset ofimpedance parameters and getting the cost drop back to theinitial value. The results suggest that policies we obtained fromAB2 treadmill walking were, to some extent, robust against thechanges of tasks and subjects.VI. D
ISCUSSIONS
In this study, we proposed an innovative algorithm, PICE,for tuning high-dimensional robotic knee prosthesis controlparameters in order to provide efficient personalized assistancein walking. The tuning efficiency stemmed partly from ourinnovation that enables offline policy training, beside onlinetraining, via policy iteration. To our knowledge, few studieshave successfully addressed wearable robot personalization insuch an offline-online manner.Our proposed RL method, as suggested in Fig. 4, demon-strated the feasibility of obtaining policies from a sufficientamount of existing offline data by the offline training, whichcan be deployed directly without interacting with the realhuman-prosthesis system. Clearly the offline implementationof PICE enables a maximal utility of existing parameter-performance data and is a new way to improve trainingefficiency in obtaining prosthesis tuning policies. Neverthe-less, pure offline implementation has no guarantees to obtainaccurate and robust policies despite being convergent in thesense of offline training, especially when the training dataquality is poor. Note that the quality of data has two meanings,which include the amount of data and the extent of mismatchin data distribution [41]. In this study, however, we onlyinvestigated the influence of the amount of data on the offlinetraining. Therefore, the number of training data we examinedin this study might not be applicable to other datasets due tothe confounding effect caused by data distributions, and it isactually difficult to determine the exact number in practice.Hence, an RL algorithm that is capable of performing offline-online learning, such as our proposed PICE, became espe-cially intriguing in order to ensure efficiency, effectiveness,and convergence of auto-tuning algorithm for learning theprosthesis tuning policy. In this paper, we demonstrated thatwhen offline learned policies cannot handle realistic human-prosthesis interaction or were not robust enough to handle thevariation across human users, as shown in Fig. 8(b), PICE cantrigger online training that further update the policy to achievethe desired tuning goal.In addition, the investigation of robustness associated withpolicies learned by the proposed algorithm showed other indi-rect benefits to potentially scale up the training outcome. As Number of Impedance Updates(a) D a m p i n g ( N m s / d e g ) E q u i l i b r i u m ( d e g ) S t i ff n e ss ( N m / d e g ) Number of Impedance Updates(b)
Number of Impedance Updates(c)
Number of Impedance Updates(d)
Fig. 7. Evolution of the impedance parameters (i.e., stiffness, equilibrium angle and damping) in different phases. (a) STF, (b) STE, (c) SWF, (d) SWE.
Policy UpdateIntermediate PoliciesFinal PolicyCost Tolerance010.5010.5
15 30 45 15 30 45Number of Impedance Updates N o r m a l i z e d S t a g e C o s t STF STESWF SWE(a)010.5010.5
15 30 45 7260Number of Impedance Updates N o r m a l i z e d S t a g e C o s t STFSWF
15 30 45 7260
STESWE(b)
Fig. 8. Normalized stage cost over the number of impedance updates intesting policy robustness. The vertical lines indicate the instances where policyupdates took place. The areas highlighted in yellow denote the periods oftime when the online learning was activated and intermediate policies wereemployed, the areas highlighted in purple describe the phases where onlinelearning was deactivated and final policies were deployed. Remaining areaswithout highlights indicate that initial policies trained from the task of AB2treadmill walking was sufficient for successful impedance tuning and wasapplied. (a) Four phases in a trial of AB2 level-ground walking, (b) Fourphases in a trial of TF2 treadmill walking. shown in Fig. 8, most deployed polices possessed exceptionalrobustness, in spite of the fact that refinements happened tothe policy in the STF through online training to further accom-modate for the changes in users. A potential reason to explainthe phenomenon could be the fact that the underlying physicalprinciples in prosthesis control have no drastic changes acrossdifferent subjects and walking tasks. The promising discoverymay enable us to collect data and obtain pre-trained initialpolicies, albeit not optimal, from more available users andrelatively easier tasks. From there, further user-specific or task-specific refinements, if needed, could be accomplished byonline training. As opposed to learning from scratch, such anapproach is more likely to result in higher training efficiency,and thus it is of great clinical value when applied at scales.As a generic and efficient learning framework, the proposedPICE could also potentially shed light on similar problemsfor other assistive wearable machines, such as exoskeletons,neuroprosthetics. These devices are also in need of identifyingthe optimal control parameters for individual users with motordeficits [42], [43]. By unleashing the potentials demonstratedin this study, translations of the proposed approach into otherhuman-machine systems are expected to be valuable becausethey all call for high training efficiency and being model-freedue to patient-in-the-loop. However, specific modificationsregarding problem formulations or implementations need tobe properly considered before the translations, such as how todefine states, actions and costs for each application.The successful implementation of the proposed PICE inthe human-prosthesis system would encourage future studiesto explore more application-specific solutions to an efficientapproximation of the value function in reinforcement learning.We demonstrated, in the study, that leveraging simple basisfunctions (e.g., quadratic basis functions) fueled by insightson the control problem (e.g., the PSD constraint for the valuefunction presented in this paper) is likely to yield a satisfyingapproximation for the value function with limited amountof data. This is because, with fewer number of unknownparameters to estimate, such a choice alleviates the highdemands for persistent excitation [44] or data richness [45]required by generic basis functions, which are often difficultto meet in practice; meanwhile the pre-structured treatment isable to compensate for reducing the approximation errors dueto the lack of data.Our proposed design and study, although promising, alsohad several limitations. The primary limitations of the studywere the limited evaluation of the algorithm on human sub-jects because the focuses of the study were on developing anovel automatic prosthesis tuning algorithm and demonstratingits promising advantages. Systematic evaluation of proposedtuning algorithm on more human subjects and designed exper-iments are needed in order to show the clinical value in thefuture. Another limitation arose from the feature extraction of the continuous knee profile. We selected four discretepoints, as a means of dimension reduction, to characterizeknee kinematics in each phase of a single gait cycle. Sucha selection dropped the information of kinematics betweenthese points, and thus we had little control over the entireprofile except for the feature points. To really reproduce targetknee kinematics, we need to explore more advanced featureextraction methods that can better characterize a continuousprofile. VII. C ONCLUSION
In this paper, we proposed an innovative RL-based al-gorithm, PICE, to learn impedance tuning policies for arobotic knee prosthesis efficiently. The tuning objective was toreproduce near-normal knee kinematics during walking tasks.The PICE algorithm benefited from an ability of offline-onlinetraining and from making a compromise in using a simplifiedvalue function approximation structure. The resulted approx-imation error due to simple value function approximationstructures was compensated by imposing a PSD constraintto keep the approximated value function from qualitativelyincorrect. Therefore, it has great advantage on improvingefficiency of the policy training.We directly tested the proposed idea on human subjects. Ourresults showed that PICE successfully provided impedancetuning policy to the prosthetic knee with a human in the loop,and it significantly reduced policy training time especiallyfor online training after initializing with an offline pre-trainedpolicy. In addition, the deployed policy is robust across humansubjects and modifications in tasks. These promising resultssuggest great potential for future clinical application of ourproposed methods on automatically personalizing assistivewearable robots. A
PPENDIX
Q-V
ALUE F UNCTION C ONVERGENCE
Inspired by the previous work [46], we let B and T bethe Bellman operator and the transition probability matrix,respectively associated with the behavior policy π . The B andthe T herein are both contractions of modulus β ∈ [0 , withrespect to the weighed Euclidean norm (cid:107) · (cid:107) Ξ . To proceedwith the analysis, we also made the following assumptions. (A1) The behavior policy π generates training samples of theMarkov chain with uniform steady-state probabilities. (A2) The discrepancy between the target and behavior policiesis bounded by a constant λ (i.e., (cid:107) T − T (cid:107) Ξ ≤ λ ). (A3) The approximate value function for the target policy (i.e., ˆ Q π ) has bounded value for any given state-action pair,which is met in this study because the admissible statesand actions are bounded, so is the constrained weightparameter vector. (A4) True value functions for both target and behavior policiesare bounded.We can derive the upper bound for the policy evaluationerrors between the approximate and the true value functionsof policy π as follows. (cid:107) ˆ Q π − Q π (cid:107) Ξ ≤(cid:107) ˆ Q π − proj S + Q π (cid:107) Ξ + (cid:107) proj S + Q π − Q π (cid:107) Ξ = (cid:107) proj S + B ( ˆ Q π ) − proj S + B ( Q π ) (cid:107) Ξ + (cid:107) proj S + Q π − Q π (cid:107) Ξ ≤(cid:107) B ( ˆ Q π ) − B ( Q π ) (cid:107) Ξ + (cid:107) proj S + Q π − Q π (cid:107) Ξ ≤(cid:107) B ( ˆ Q π ) − B ( ˆ Q π ) (cid:107) Ξ + (cid:107) B ( ˆ Q π ) − B ( Q π ) (cid:107) Ξ + (cid:107) proj S + Q π − Q π (cid:107) Ξ = α (cid:107) T ˆ Q π − T ˆ Q π (cid:107) Ξ + (cid:107) B ( ˆ Q π ) − B ( Q π ) (cid:107) Ξ + (cid:107) proj S + Q π − Q π (cid:107) Ξ ≤ αλ (cid:107) ˆ Q π (cid:107) Ξ + β (cid:107) ˆ Q π − Q π (cid:107) Ξ + (cid:107) proj S + Q π − Q π (cid:107) Ξ ≤ αλ (cid:107) ˆ Q π (cid:107) Ξ + β (cid:107) ˆ Q π − Q π (cid:107) Ξ + (cid:107) Q π − Q π (cid:107) Ξ + (cid:107) proj S + Q π − Q π (cid:107) Ξ ≤ αλ (cid:107) ˆ Q π (cid:107) Ξ + β (cid:107) ˆ Q π − Q π (cid:107) Ξ + 2 (cid:107) Q π − Q π (cid:107) Ξ + (cid:107) proj S + Q π − Q π (cid:107) Ξ Rearranging the above the inequality and under the assump-tions ( A3 ) and ( A4 ), we can obtain the bounded evaluationerrors with an upper bound denoted as ξ . (cid:107) ˆ Q π − Q π (cid:107) Ξ ≤ − β (cid:16) αλ (cid:107) ˆ Q π (cid:107) Ξ + 2 (cid:107) Q π − Q π (cid:107) Ξ + (cid:107) proj S + Q π − Q π (cid:107) Ξ (cid:17) ≤ − β (cid:16) αλ (cid:107) ˆ Q π (cid:107) Ξ + 2 αλ (cid:107) Q π (cid:107) Ξ − β + (cid:107) Q π (cid:107) Ξ (cid:17) = ξ Let ξ max be the largest upper bound of evaluation errorsover all policy evaluation iterations. According to [29], [46],when the policy improvement is performed exactly withoutincurring errors (which is guaranteed in this problem settingdue to a use of quadratic programming solution), the followingbound can be obtained. lim sup i →∞ (cid:107) ˆ Q ( i ) − Q ∗ (cid:107)≤ αξ max (1 − α ) The bound implies that the iterative sequence eventuallyproduces an approximate value function whose performance isat most a constant away from the truly optimal value function,and hence the proposed PICE is a convergent algorithm.R
EFERENCES[1] F. Sup, A. Bohara, and M. Goldfarb, “Design and control of a poweredtransfemoral prosthesis,”
Int. J. Robot. Res. , vol. 27, no. 2, pp. 263–273,2008.[2] T. Lenzi, M. Cempini, L. Hargrove, and T. Kuiken, “Design, develop-ment, and testing of a lightweight hybrid robotic knee prosthesis,”
Int.J. Robot. Res. , vol. 37, no. 8, pp. 953–976, 2018.[3] S. Au, M. Berniker, and H. Herr, “Powered ankle-foot prosthesis toassist level-ground and stair-descent gaits,”
Neural Netw. , vol. 21, no. 4,pp. 654 – 666, 2008.[4] H. M. Herr and A. M. Grabowski, “Bionic ankle–foot prosthesisnormalizes walking gait for persons with leg amputation,”
Proc. Biol.Sci. , vol. 279, pp. 457–464, 2012. [5] B. E. Lawson, H. A. Varol, and M. Goldfarb, “Standing stabilityenhancement with an intelligent powered transfemoral prosthesis,” IEEETrans. Biomed. Eng. , vol. 58, no. 9, pp. 2617–2624, 2011.[6] D. Quintero, D. J. Villarreal, D. J. Lambert, S. Kapp, and R. D.Gregg, “Continuous-phase control of a powered knee–ankle prosthesis:Amputee experiments across speeds and inclines,”
IEEE Trans. Robot. ,vol. 34, no. 3, pp. 686–701, 2018.[7] L. J. Hargrove et al. , “Robotic leg control with emg decoding in anamputee with nerve transfers,”
N. Engl. J. Med. , vol. 369, no. 13, pp.1237–1242, 2013.[8] M. Liu, D. Wang, and H. Huang, “Development of an environment-aware locomotion mode recognition system for powered lower limbprostheses,”
IEEE Trans. Neural Syst. Rehabil. Eng. , vol. 24, no. 4,pp. 434–443, 2016.[9] F. Zhang and H. Huang, “Source selection for real-time user intentrecognition toward volitional control of artificial legs,”
IEEE J. Biomed.Health Inform. , vol. 17, no. 5, pp. 907–914, 2013.[10] S. K. Au, J. Weber, and H. Herr, “Powered ankle–foot prosthesisimproves walking metabolic economy,”
IEEE Trans. Robot. , vol. 25,no. 1, pp. 51–66, 2009.[11] A. H. Shultz, B. E. Lawson, and M. Goldfarb, “Running with a poweredknee and ankle prosthesis,”
IEEE Trans. Neural Syst. Rehabil. Eng. ,vol. 23, no. 3, pp. 403–412, 2015.[12] M. Liu, F. Zhang, P. Datseris, and H. Huang, “Improving finite stateimpedance control of active-transfemoral prosthesis using dempster-shafer based state transition rules,”
J. Intell. Robot. Syst. , vol. 76, no. 3,pp. 461–474, 2014.[13] E. Burdet, R. Osu, D. W. Franklin, T. E. Milner, and M. Kawato, “Thecentral nervous system stabilizes unstable dynamics by learning optimalimpedance,”
Nature , vol. 414, pp. 446–449, 2001.[14] N. Hogan, “Adaptive control of mechanical impedance by coactivationof antagonist muscles,”
IEEE Trans. Autom. Control , vol. 29, no. 8, pp.681–690, 1984.[15] D. A. Winter,
Biomechanics and motor control of human gait: normal,elderly and pathological . Waterloo, ON, Canada: University ofWaterloo Press, 1991.[16] E. J. Rouse, L. M. Mooney, and H. M. Herr, “Clutchable series-elasticactuator: Implications for prosthetic knee design,”
Int. J. Robot. Res. ,vol. 33, no. 13, pp. 1611–1625, 2014.[17] A. M. Simon et al. , “Configuring a powered knee and ankle prosthesisfor transfemoral amputees within five specific ambulation modes,”
PLoSOne , vol. 9, pp. 1–10, 06 2014.[18] J. Zhang et al. , “Human-in-the-loop optimization of exoskeleton assis-tance during walking,”
Science , vol. 356, no. 6344, pp. 1280–1284, 2017.[19] Y. Ding, M. Kim, S. Kuindersma, and C. J. Walsh, “Human-in-the-loopoptimization of hip assistance with a soft exosuit during walking,”
Sci.Robot. , vol. 3, no. 15, p. eaar5438, 2018.[20] H. Huang, D. L. Crouch, M. Liu, G. S. Sawicki, and D. Wang, “A cyberexpert system for auto-tuning powered prosthesis impedance controlparameters,”
Ann. Biomed. Eng. , vol. 44, no. 5, pp. 1613–1624, 2016.[21] Y. Wen, J. Si, X. Gao, S. Huang, and H. H. Huang, “A new poweredlower limb prosthesis control framework based on adaptive dynamicprogramming,”
IEEE Trans. Neural Netw. Learn. Syst. , vol. 28, no. 9,pp. 2215–2220, 2017.[22] Y. Wen, J. Si, A. Brandt, X. Gao, and H. Huang, “Onlinereinforcement learning control for the personalization of a roboticknee prosthesis,”
IEEE Trans. Cybern. , 2019. [Online]. Available:https://ieeexplore.ieee.org/document/8613842[23] E. C. Martinez-Villalpando and H. Herr, “Agonist-antagonist active kneeprosthesis: a preliminary study in level-ground walking,”
J. Rehabil. Res.Dev. , vol. 46, pp. 361–374, 2009.[24] J. Si, A. G. Barto, W. B. Powell, and D. Wunsch,
Handbook of Learningand Approximate Dynamic Programming . Piscataway, NJ, USA: WileyPress, 2004.[25] A. Zeng, S. Song, S. Welker, J. Lee, A. Rodriguez, and T. Funkhouser,“Learning synergies between pushing and grasping with self-superviseddeep reinforcement learning,” in , Oct 2018, pp. 4238–4245.[26] S. Gu, E. Holly, T. Lillicrap, and S. Levine, “Deep reinforcementlearning for robotic manipulation with asynchronous off-policy updates,”in , May 2017, pp. 3389–3396.[27] X. B. Peng, P. Abbeel, S. Levine, and M. van de Panne, “Deepmimic:Example-guided deep reinforcement learning of physics-based characterskills,”
ACM Trans. Graph. , vol. 37, no. 4, p. 143, 2018. [28] J. Hwangbo et al. , “Learning agile and dynamic motor skills for leggedrobots,”
Sci. Robot. , vol. 4, no. 26, p. eaau5872, 2019.[29] M. G. Lagoudakis and R. Parr, “Least-squares policy iteration,”
J. Mach.Learn. Res. , vol. 4, no. 6, pp. 1107–1149, 2003.[30] L. Bus , oniu, R. Babuˇska, B. de Schutter, and D. Ernst, ReinforcementLearning and Dynamic Programming Using Function Approximators .Boca Raton, FL, USA: CRC Press, 2010.[31] J. Si and Yu-Tsung Wang, “Online learning control by association andreinforcement,”
IEEE Transactions on Neural Networks , vol. 12, no. 2,pp. 264–276, 2001.[32] R. S. Sutton and A. G. Barto,
Reinforcement learning: an introduction .Cambridge, MA, US: MIT Press, 2018.[33] D. P. Bertsekas, “Approximate policy iteration: a survey and some newmethods,”
J. Control Theory Appl. , vol. 9, no. 3, pp. 310–335, 2011.[34] X. Gao, J. Si, Y. Wen, M. Li, and H. Huang, “Knowledge-guided rein-forcement learning control for robotic lower limb prosthesis,” in , tobe published.[35] M. P. Kadaba, H. K. Ramakrishnan, and M. E. Wootten, “Measurementof lower extremity kinematics during level walking,”
J. Orthop. Res. ,vol. 8, pp. 383–392, 1990.[36] M. Li, X. Gao, Y. Wen, J. Si, and H. Huang, “Offline policy iterationbased reinforcement learning controller for online robotic knee pros-thesis parameter tuning,” in , May 2019, pp. 2831–2837.[37] D. P. Bertsekas, “Temporal difference methods for general projectedequations,”
IEEE Trans. Autom. Control , vol. 56, no. 9, pp. 2128–2139,2011.[38] H. Yu, “Least squares temporal difference methods: An analysis undergeneral conditions,”
SIAM J. Control Optim. , vol. 50, no. 6, pp. 3310–3343, 2012.[39] R. Escalante and M. Raydan,
Alternating Projection Methods . Philadel-phia, PA, USA: Society for Industrial and Applied Mathematics, 2011.[40] D. A. Winter, “Kinematic and kinetic patterns in human gait: Variabilityand compensating effects,”
Hum. Mov. Sci. , vol. 3, no. 1, pp. 51 – 76,1984.[41] Y. Liu et al. , “Representation balancing mdps for off-policy policyevaluation,” in , 2018, pp. 2649–2658.[42] A. U. Pehlivan, D. P. Losey, and M. K. O’Malley, “Minimal assist-as-needed controller for upper limb robotic rehabilitation,”
IEEE Trans.Robot. , vol. 32, no. 1, pp. 113–124, 2016.[43] A. J. Bergquist, J. Clair, O. Lagerquist, C. S. Mang, Y. Okuma, andD. F. Collins, “Neuromuscular electrical stimulation: implications of theelectrically evoked sensory volley,”
Eur. J. Appl. Physiol. , vol. 111, pp.2409–2426, 2011.[44] K. G. Vamvoudakis and F. L. Lewis, “Online actor–critic algorithmto solve the continuous-time infinite horizon optimal control problem,”
Automatica , vol. 46, no. 5, pp. 878 – 888, 2010.[45] H. Modares, F. L. Lewis, and M.-B. Naghibi-Sistani, “Integral rein-forcement learning and experience replay for adaptive optimal controlof partially-unknown constrained-input continuous-time systems,”
Auto-matica , vol. 50, no. 1, pp. 193 – 202, 2014.[46] D. P. Bertsekas,