Learning User-Preferred Mappings for Intuitive Robot Control
LLearning User-Preferred Mappings for Intuitive Robot Control
Mengxi Li, Dylan P. Losey, Jeannette Bohg, and Dorsa Sadigh
Abstract — When humans control drones, cars, and robots, weoften have some preconceived notion of how our inputs shouldmake the system behave. Existing approaches to teleoperationtypically assume a one-size-fits-all approach, where the design-ers pre-define a mapping between human inputs and robotactions, and every user must adapt to this mapping over re-peated interactions. Instead, we propose a personalized methodfor learning the human’s preferred or preconceived mappingfrom a few robot queries. Given a robot controller, we identifyan alignment model that transforms the human’s inputs so thatthe controller’s output matches their expectations. We make thisapproach data-efficient by recognizing that human mappingshave strong priors : we expect the input space to be proportional,reversable, and consistent. Incorporating these priors ensuresthat the robot learns an intuitive mapping from few examples.We test our learning approach in robot manipulation tasksinspired by assistive settings, where each user has differentpersonal preferences and physical capabilities for teleoperatingthe robot arm. Our simulated and experimental results suggestthat learning the mapping between inputs and robot actionsimproves objective and subjective performance when comparedto manually defined alignments or learned alignments withoutintuitive priors. The supplementary video showing these userstudies can be found at: https://youtu.be/rKHka0_48-Q
I. I
NTRODUCTION
Humans are extremely good at developing tools and sys-tems. Each of us interact with these systems everyday—from driving your car to work to playing video gamesat home. Unfortunately, we are not always good at theseinteractions. Using a joystick to play a kart racing computergame seems relatively easy, and we can do this withoutthinking much about it. But actually learning how to drive areal-life car does require practice. Even more challenging isflying a helicopter—this is highly demanding, and requiresprofessional training. In these systems the human must adaptto the robot’s controller across multiple rounds of interaction.Rather than forcing the human to adapt to the robot, wewonder if instead the robot could intelligently adapt to ourpreferences, making control more easy and intuitive?Alongside rapid advancements in the field of robotics,intuitive control is increasingly in demand, particularly sincea wide spectrum of users are operating robots beyond justengineers [1]. User-friendliness and comfort are crucial inmany application domains including surgical and assistiverobots [2], [3], mobile robots [4] and even more abstractsmart home systems [5]. At the heart of these systems, thereis seamless teleoperation . Typically—for teleoperation—there are two main categories of control methods. One
The authors are affiliated with the Computer Science Department,Stanford University, Stanford, CA 94305.e-mail: { mengxili, dlosey, bohg, dorsa } @stanford.edu Fig. 1: The human has in mind a preferred mapping between their joystickinputs and the robot’s actions. The robot learns this mapping (i.e., θ ) offlinefrom labelled data and intuitive priors, so that the robot’s online actions arecorrectly aligned with the human’s intentions. is to use control interfaces such as joysticks or sip-and-puff devices [6], [7]. These controllers are light-weight butlow-dimensional, which makes control of robots with many degrees of freedom (DoF) challenging. To make up for thislimitation when teleoperating robot arms, [8] propose to thatthe controller should change modes between different DoFs.Other work focuses on capturing high-DoF human bodymovement with wearable devices, cameras, or sensors [9],and then maps them to robot actions. While these methodsprovide more accurate measurements and natural controlinterfaces, they are typically larger and more expensivesystems that might not be available to everyday users.Within this work, we are interested in the first categoryof teleoperation methods. We present a human-centered,adaptive system for robotic manipulation so that human userscan smoothly and intuitively teleoperate high-DoF robotswith simple, low-DoF controllers. We envision a model thatprovides a general framework for learning intuitive inputmappings that are agnostic to both the controller and the dy-namics of the underlying robotic system (see Fig. 1). Basedon a general architecture of feedforward neural networks, thepresented framework can be applied to different robots andcontrollers for various tasks with only a few demonstrationsneeded from the human users. Our key insight is: By incorporating the intuitive priors that humans expect,including proportionality, reversability and consistency, wecan quickly learn how humans prefer to control robots.
Our approach first asks the human user to label a set ofrobot actions with their preferred inputs. For example, the a r X i v : . [ c s . R O ] J u l obot moves towards the cup, and then asks what joystickdirection should be associated with this action. Based on asmall number of these labeled state-action pairs, we learnan alignment model that maps from human inputs to robotinputs such that the resulting robot actions match the hu-man’s preferences. Ensuring a small number of questions is critical for real-world implementation: we cannot askthe human to label hundreds of examples actions! Ourinsight—incorporating intuitive priors—enables the robot togeneralize across the workspace, leading to offline learningfrom a limited amount of labeled data.More precisely, we make the following contributions: Formalizing Priors for Human Control.
We formalizethe properties that humans expect to have when controllingrobotic systems, including proportionality, reversability, andconsistency. These priors are inspired by [10], and we hereanalyse their relative importance when learning the human’spreferred input mapping.
Generalizable Data-Efficient Learning.
We build a generalframework for learning human control preferences. We de-velop this data-efficient method by incorporating the intuitivepriors that humans expect, so that users only need to providea few examples of their desired mappings. Our proposedmethod develops a light-weight solution that can generalizeacross different controllers, dynamics, and tasks.
Evaluating Learned Human Alignment.
We implementour approach both in simulation and on robot manipulationtasks, and get subjective feedback from users. Our tasksare inspired by assisitve robotics, where users living withphysical disabilities have different capabilities and prefer-ences. Results in simulation as well as in user studies suggestthat our algorithm successfully learned the human-preferredalignment between the low-DoF control interface and high-DoF robot actions. In practice, this learned alignment led toimproved control and more seamless interaction.II. R
ELATED W ORK
Intuitive Control.
When controllers are intuitive, the effectsof user inputs match with that user’s expectations. Such aproperty is very important for robotic control, especiallyin human-robot teleoperation. Many research efforts aredevoted to tackle this problem [11]–[13]. Some use wear-able devices to capture human motion patterns [14], someleverage augmented reality and use a gesture-based systemto enable intuitive human robot teaming [15], and others relyon various sensors to measure and understand humans’ hap-tic feedback [16]. These previous works require additionaldevices to infer the human intent behind their actions, andthen provide a universal solution for intuitive control. Insteadof directly enforcing a one-size-fits-all control strategy to hu-man users, we separately query each user for their individual preferences. With no additional hardware requirements, wecome up with a light-weight solution for intuitive control thatcould generalize across different platforms and tasks.
Assistive Robots.
Intelligent assistive robot systems (e.g.,wheelchair-mounted robot arms) are one important setting inwhich intuitive control is essential for widespread use [17]. Because users are constrained by person-specific capabilities,one-size-fits-all controllers are insufficient. Moreover, theproblem of “dimensionality mismatch” arises when humanstry to teleoperate these high-DoF robots by manipulatinga low-DoF controller, e.g., a 2-axis joystick. Prior workstackle this with a mode switching mechanism [18], [19]:however, this forces users to constantly switch modes whenperforming complex tasks. Other works mitigate mode-switching with model-based optimization [8] or in a data-driven manner [20], but the underlying mode switching stillremains discontinuous. An alternative was recently proposedby [21], in which the robot learns a continuous mappingbetween a low-DoF latent action space and the high-DoFrobot action space using autoencoders. While this approachenables continuous control of the high-DoF robot, it still failsto provide an intuitive control mapping to the user.In this paper, we use assistive robot manipulation as thetest domain for our framework. We consider the teleoperationproblem from the user’s perspective, and attempt to learna mapping between their inputs and the assistive robot’sactions which caters to the user’s preferences.III. F
ORMALISM FOR M ODELING P REFERENCE A LIGNMENT
We formulate a robot manipulation task as a discrete-timeMarkov Decision Process (MDP) M = ( S , A , T , R, γ, ρ ) .Here S ⊆ R n is the state space, A ⊆ R m is the robot actionspace, T ( s, a ) is the transition function, R is the reward, γ is the discount factor, and ρ is the initial distribution. A. Problem Statement.
Assume humans control a robot using the teleoperationinterface a t = φ ( z t , s t ) , where t denotes timestep, s t ∈ S isthe system state, z t ∈ R d is the d -dimensional system input,and a t ∈ A is the action executed by the robot. We use h t ∈ R d to denote the human’s input . For instance, whenthe human is using a joystick to teleoperate the robot arm,their two-DoF joystick input corresponds to h ∈ R .Traditionally, the human’s input h is directly used asthe system input z , so that z t = h t , and the robot takesaction a t = φ ( h t , s t ) . However — for complex or non-intuitive systems — it may be hard for human users todirectly interact with the controller φ . For instance, using alow-DoF joystick to control a high-DoF assistive robot canbe difficult, since we do not know how to coordinate therobot’s joints. Moreover, different users also have differentpreferences for controlling their robot—which may or maynot match with the system controller. Put another way, thepre-defined mapping of the controller a t = φ ( h t , s t ) is oftenquite different from what users want!Our goal is to make the robotic system easier to control:instead of forcing the human to adapt to φ , we want the robotto adapt to the user’s preferences (without fundamentallychanging the controller). We therefore propose to learn an alignment function z t = f θ ( h t , s t ) parameterized by θ ,which maps the human input h t ∈ R d and the robot state s t ∈ S to the controller input z t ∈ R d . In this way, we ig. 2: Visualization for the training process of the alignment model z = f θ ( h, s ) parametrized by θ . Here the example task is for the robot to move in aplane, and the human would like the robot’s end-effector motion to align with their joystick inputs ( h and h ). We take snapshots at three different pointsduring training, and plot how the robot actually moves when the human presses up, down, left, and right. Note that this alignment is state dependent. Astraining progresses, the robot learns the alignment θ , and the robot’s motions are gradually and consistently pushed to match with the human’s preferences. construct a new, two-step mapping between the action of therobotic system a t and the human’s input h t : a t = φ ( z t , s t ) = φ ( f θ ( h t , s t ) , s t ) (1) State Conditioning.
Consider the person in Fig. 1, who isusing a 2-axis joystick to control a high-DoF assistive robotarm to reach and pour a cup. The user’s preferred way tocontrol the robot is unclear: what does the user mean if theypush the joystick forward? When the robot is far from thecup, the user might intend to move the robot towards thecup—but when the robot is holding the cup, pressing forwardnow indicates that the robot should rotate, and pour the cupinto a bowl! This mapping from the user input to intendedaction is not only person dependent, but it is also state de-pendent . In practice, this state dependency prevents us fromlearning a single transformation to uniformly apply acrossthe robot’s workspace—we need an intelligent strategy forunderstanding the human’s preferences in different contexts.
B. Background: Latent Action Embeddings
We will leverage the assistive robot controller proposed in[21] as the main test domain for our alignment framework.Here a conditional autoencoder is trained from demonstra-tions of related tasks, learning the control function φ inEq. (1). More formally, φ : Z × S (cid:55)→ A is a decoder thatrecovers a high-DoF robot action a t ∈ A given the systeminput z t ∈ Z and current context s t ∈ S . Overall, φ is asuitable test domain for our work since it is not immediatelyclear what the system inputs map to, i.e., there is no priorover how z t is aligned with a t at different states. We alsopoint out that this controller captures the state conditioningdescribed above. Since φ depends on s , the robot’s action a changes based on the current context: the same input z moves the robot towards the cup when the gripper is empty,and then rotates to pour when holding the cup. This enableseasy switching between tasks, like reaching and pouring.IV. A PPROACH
A. Constructing Alignment Model
Human Alignment.
As described in Sec. III, we seek toobtain a mapping z t = f θ ( h t , s t ) that converts the humaninput h t to the controller input z t at timestep t . Using this transformed system input, the controller will then executeaction a t on the robot, as in Eq. (1). Model.
We leverage a function f θ ( h t , s t ) with parameters θ to capture this alignment. Here s t — the current state ofthe robot — is also passed as part of the input since thehuman’s alignment could be state-dependent. In this work,we will utilize a general Multi-Layer Perceptron (MLP) torepresent f θ , where θ becomes the weights of the network. Objective Function.
Given the alignment model f , we willapply supervised learning to match the output of this modelto the true human preferences (as visualized in Fig. 2).Formally, with the action a t computed by Eq. (1), the robotstate s t +1 at next timestep is given by the transition model s t +1 = T ( s t , a t ) . We therefore denote the overall mappingbetween the current human input h t at state s t and the nextstate s t +1 using a function T with parameters θ : s t +1 = T θ ( h t , s t ) = T ( s t , φ ( f θ ( h t , s t ) , s t )) (2)Because we are ultimately interested in how well the robot’saction a t matches the human’s preference, we define ourobjective function as the distance d between the robot’sactual next state s t +1 and the ground truth state s ∗ t +1 , whichis where the human really intended to go . Taking theexpectation, we get the supervised loss L sup : L sup ( θ ) = E [ d ( s ∗ t +1 , s t +1 )] = E [ d ( s ∗ t +1 , T θ ( h t , s t ))] (3)We leverage Euclidean Distance as our distance metricbetween states, although other metrics are also possible: d ( s , s ) = || s − s || . To approximate the integral inEq. (3), we implement Monte Carlo Sampling by stochas-tically picking N data points from the state distribution: L ( θ ) = 1 N N (cid:88) i =1 d ( s ∗ it +1 , T θ ( h t , s it )) (4) B. Ensuring Data-Efficiency with Intuitive Priors
Recall that we are learning the human’s preferred mappingwith a function approximator. In general, training thesemodels requires a large number of labeled data samples,particularly in complex scenarios where the mapping changesin different states. However, since we need annotations from We query the user for this intended state, as explained in Section IV-C umans to identify their intended states, it is impractical forthe robot to collect a large labeled dataset. Accordingly—totackle the challenge of insufficient labeled data—we employa semi-supervised learning method [22]. Our contributionhere is to formulate the intuitive priors that humans haveover their control mappings as loss terms , which can then beleveraged within this semi-supervised learning.Formally, for a given robotic arm, we let s be the robot’sjoint position, and we denote the forward kinematics as Ψ .For a human control input h at state s , the next state is givenby T θ ( h, s ) as in Eq. (2). The corresponding end-effectorpose change is ∆ x ( s ) = Ψ( T θ ( h, s )) − Ψ( s ) . With thesedefinitions in mind, we argue an intuitive control mechanismshould satisfy the following properties:1) Proportionality – The amount of change in the 3D poseof the robot’s end effector should be proportional to thescale of the human input, i.e., α · | Ψ( T θ ( h, s )) − Ψ( s ) | = | Ψ( T θ ( α · h, s )) − Ψ( s ) | where α ∈ R is the scaling factor. Proportionality is quitecommon and intuitive in system design: it indicates thathumans expect the system to be linearly interpolatable.We define the proportionality loss L prop as: L prop = (cid:2) Ψ( T θ ( α · h, s )) − (Ψ( s ) + α ∆ x ( s )) (cid:3) where α is independently sampled from a uniform distri-bution α ∼ U ( − , , since the control input is bounded.2) Reversability – If an action h makes the robot transitfrom state s to s , the opposite action, denoted by thenegation of h , i.e., − h , should be able to make the robottransit from s back to s . s = T θ ( h, s ) → s = T θ ( − h, s ) This property ensures recoverability when users mistak-enly operate the system, so that people can undo theirmistakes and return to the original state. We define thereversability loss as: L reverse = (cid:2) Ψ( s ) − Ψ( T θ ( − h, T θ ( h, s ))) (cid:3) where Ψ( s ) is the current 3D pose of the robot endeffector, while Ψ( T θ ( − h, T θ ( h, s ))) is the 3D pose ofthe end effector after executing human input h followedby executing the opposite input − h .3) Consistency – The same action taken at nearby statesshould lead to similar amounts of change in the 3D poseof the robot’s end effector, i.e., ∆ x = | Ψ( T θ ( h, s )) − Ψ( s ) | ∆ x = | Ψ( T θ ( h, s )) − Ψ( s ) |∀ (cid:15) > , ∃ δ > , s.t. | s − s | < δ → | ∆ x − ∆ x | < (cid:15). Consistency encourages the control to be smooth, so thatthe mapping does not discontinuously change alignment.We define the consistency loss as: L con = exp ( − γ || s − s || ) · (∆ x ( s ) − ∆ x ( s )) We use the weight term exp ( − γ || s − s || ) to gauge thesimilarity between states s and s , where γ > is a hy-perparameter controlling the temperature. A large weightis assigned to the state pair ( s , s ) if the differencebetween them is small.During the training process, we minimize the supervisedloss for labeled data combined with the semi-supervised lossfor unlabeled data using intuitive priors: L = L sup + λ L prop + λ L reverse + λ L con (5)Here L sup is supervised loss term as shown in Eq. (3),and λ , λ , λ are constant coefficients. Importantly, incor-porating these different loss terms — which are inspired byhuman priors over controllable spaces [10] — enables therobot to generalize the labeled human data (which it performssupervised learning on) to unlabeled states (which it can nowperform semi-supervised learning on)! C. Data Collection by Querying Users
To enable the training of our algorithm, we need to collectlabeled data tuples ( s t , h t , s ∗ t +1 ) as well as unlabeled datatuples ( s t , s ∗ t +1 ) , where s t is the robot’s current state, h t isthe human’s low-dimensional input, and s ∗ t +1 is the intendednext robot state corresponding to h t . We emphasize that therobot never needs to detect the human’s intended state —rather, the human labels robot actions with their preferredinputs. Our data collection procedure is as follows: • Step : Implement the controller a = φ ( s, z ) . This stepis optional depending on the type of the controller; wewant to emphasize that the controller will remain fixed,and will not be altered by our alignment model. • Step 1 : For the task of interest, we sample a valid state s t from the state distribution and randomly sample avalid controller input z t at state s t . • Step 2 : At the sampled state s t , we apply system input z t to the controller φ in order to get the robot action a t = φ ( s t , z t ) . • Step 3 : We record the subsequent robot state s t +1 afterexecuting the action a t at the state s t . • Step 4 : Steps 1, 2, 3 are repeated for N iterations toget the unlabeled dataset { ( s it , s ∗ it +1 ) Ni =1 } , consisting of N unlabeled (state, next-state) tuples. • Step 5 : We randomly sample ( s t , s ∗ t +1 ) from the unla-beled dataset collected in Step 4, and then we query thehumans for the corresponding labels h t . Here the userprovides examples to the robot of what controls theywould find intuitive to move from s t to s ∗ t +1 . • Step 6 : Step 5 is repeated for K iterations to get thelabeled dataset { ( s it , h it , s ∗ it +1 ) Ki =1 } , consisting of K To test our proposed alignment algorithm, we conductedsimulations on a Franka Emika Panda robot arm for threemanipulation tasks of increasing complexity. We will firstdiscuss the details that are consistent across all experiments,and then elaborate on each task respectively. Setup. We used a simulated Panda robot arm. The state s ∈ S denotes the robots joint position, and the action a ∈ A refersto the robots joint velocity. The system transition functionis s (cid:48) = s + a · dt , where dt is the step size. We traineda conditional autoencoder [21] for each task by collectingtrajectory demonstrations on the simulated robot. The learneddecoder maps the low-dimensional system input z ∈ R tothe high-dimensional robot action a ∈ R — this decoder isused as the controller φ in our experiments. The alignmentmodel z = f θ ( h, s ) aims to identify the correspondencebetween control inputs z and human inputs h at each state s .Applying a simulated virtual human H sim that is separatelydefined for each task, we generated data for training andtesting using the procedure described in Section IV-C. Tasks. We considered three tasks with increasing levels ofcomplexity; these tasks roughly correspond to our user studytasks shown in Fig. 4. Across all tasks the simulated user wasgiven a 2-DoF joystick, such that h ∈ R .1) Plane : The simulated robot arm moves its end-effectorin a 2-dimensional horizontal plane. In this task, thesimulated human’s preference is to use one dimension ofthe joystick for controlling the movement along the x axisand the other dimension for controlling the movementalong the y axis (also see Fig. 2).2) Pour : The simulated robot arm moves and rotates itsend-effector along the vertical axis. Here the simulatedhuman’s preference is to use one dimension of thejoystick for controlling the position and to use the otherdimension for controlling the rotation of the end-effector.3) Reach & Pour : The simulated robot arm reaches for abottle and then lifts and rotates the bottle to pour it intoa bowl. In this task, the simulated human’s preference isdivided into two parts, and is based on the previous tasks.For reaching the bottle in the 2D plane, the preference isdefined as in the task Plane . For pouring, the preferencefor joystick control is the same as in the task Pour . Data Efficiency. The simulated human provides 10 labeleddata samples { ( s it , h it , s ∗ it +1 ) i =1 } , and the robot collects 1000unlabeled data samples { ( s jt , s ∗ jt +1 ) j =1 } for the tasks Plane and Pour . For the sequential task Reach & Pour , we have 20labeled samples and 2000 unlabeled samples. Model Details. The teleoperation controller in our experi-ments follows the structure in [21], and the human alignmentmodel is constructed as in Section IV-A. Our human align-ment model is a Multi-Layer Perceptron (MLP) network with2 hidden layers. For supervised training, the loss functionwe adopt is given by L = L trans + λL rot . Let L trans be the L loss between predicted position and ground truthposition, and let L rot be the L loss defined on the quaternionrepresentation between predicted and actual rotation. In the Plane task, λ = 0 because rotation is involved: but for theother two tasks, λ = 1 . During semi-supervised training, theloss function is a combination of supervised loss terms andsemi-supervised loss terms as described in Eq. (5). Independent Variables. Within each simulated task, wevaried (a) the type of alignment model and (b) the noiselevel of the simulated human oracle H sim . We comparedagainst two baselines that did not learn an alignment: noalign , where z = h , and manual align , where we appliedan affine transformation that best matched the data acrossall states. To understand which priors are useful, we alsoperformed an ablation study where the robot learned themapping with one prior at a time. Finally, we included a no prior baseline, which only leveraged supervised training.To test the model robustness to noisy human labels—wherethe human incorrectly matches h to a —we set the coefficientof variance σµ ∈ { , . , . } for the simulated human. Dependent Measures. In both our approach and baselines,we measured position and rotation errors. For positions, wecomputed the relative distance error E d = (cid:107) x ∗ t +1 − x t +1 (cid:107)(cid:107) x ∗ t +1 − x t (cid:107) ,where x t is the current end-effector position before move-ment, x ∗ t +1 = Ψ( s ∗ t +1 ) is the desired end-effector positioncorresponding to intended next state s ∗ t +1 , and x t +1 is theactual end-effector position the robot reaches after executingaction a t = φ ( s t , f θ ( h t , s t )) . We also measured the distancebetween rotations E r = 2 arccos( |(cid:104) q t +1 , q ∗ t +1 (cid:105)| ) , where q t +1 and q ∗ t +1 are the quaternion representation of the predictedand ground truth orientation, respectively. For each exper-iment setting, we reported mean and standard deviation ofthe performance metrics over 10 total runs. Hypotheses. We have the following three hypotheses: H 1. With abundant labeled data, the alignment model willaccurately learn the human’s preferences. H 2. Compared to the fully-supervised baseline, our semi-supervised alignment models that leverage intuitive priorswill achieve similar performance with far less human data. H 3. Semi-supervised training with proportional, reversible,and consistent priors will outperform alignment modelstrained with only one of these priors. Results & Analysis. The simulations demonstrate that ouralignment method successfully learned the human’s prefer-ence with a limited amount of labeled data. As shown inFig. 3, each of the proposed alignment models significantlyoutperformed the one-size-fits-all baselines. For models thatdo not leverage data or learning — i.e., no align and manual o align manual align no priors proportional reversable consistent all priors ( ours ) ideal alignno noise 10% noise 50% noise Reach + Pour Task no noise 10% noise 50% noise A li g n m e n t E rr o r Plane Task Fig. 3: Quantitative results from the simulation experiments. We tested three different tasks with increasing complexity, and here we display the results ofthe easiest ( Plane ) and the hardest ( Reach + Pour ). Alignment Error refers to a weighted sum of the relative positional and rotational error in end-effectorspace. To explore the robustness of our method, we additionally varied how noisy the human was when providing labels. Across different tasks and levelsof human noise, including all priors consistently outperformed the other methods, and almost matched an ideal alignment learned from abundant data. align — the error is significantly higher than learning alterna-tives. With abundant data and noise-free human annotations, ideal align provided the best-case performance: indicatingthat our parametrization of the alignment model is capableof capturing the human’s control preferences.Of course, in practice the amount of human feedback islimited. We therefore focus on models that learned from onlya small number of labeled datapoints ( − examples).Here our proposed priors were critical: semi-supervisedmodels that included at least one intuitive prior performedtwice as well as no priors , the supervised baseline.Across different noise levels, our model that leverages allpriors consistently demonstrated the lowest mean error andstandard deviation. This was particularly noticeable when thehuman oracle is noisy, suggesting that the three priors areindeed complementary , and including each of them togetherbrings a performance boost!Comparing the easier Plane task to the more complex Reach & Pour task, we also saw that using priors becamemore important as the task got harder. This suggests that— in complex scenarios — simply relying on a few labeledexamples may lead to severe overfitting. On the other hand,our intuitive priors could effectively mitigate this problem. Summary. Viewed together, the results of our simulationsstrongly support the hypotheses H1 , H2 , and H3 . Ourproposed alignment model successfully learned the mappingbetween human actions and the system input space ( H1 ). Insettings with limited labels, our proposed alignment modelwith intuitive control priors reached results that almostmatch supervised training with abundant data ( H2 ). Finally,in ablation studies, we showed how combining all threeproposed priors leads to superior performance and greatertraining stability than training with a single prior ( H3 ).VI. U SER S TUDIES Within this section we show the results of a user studythat evaluates our framework across three robot manipulationtasks based on assistive settings. Participants teleoperated a7-DoF robotic arm (Panda, Franka Emika) through a hand-held 2-axis joystick controller. As in our simulations, we use[21] to learn a decoder that enables low-DoF control over thehigh-DoF robot arm. The robot learned the user’s individualpreferences for how this controller should behave by askingfor a limited number of examples and then generalizing withour intuitive priors. Tasks. Similar to our simulation experiments, we consideredthree different tasks. These tasks are visualized in Fig. 4.1) Avoid : The Panda robot arm moves its end-effector withina 2-dimensional horizontal plane. Users are asked toguide the robot around an obstacle without colliding withit. The task ended once users completed one clockwiserotation followed by one counter-clockwise rotation.2) Pour : The Panda robot arm is holding a cup, and userswant to pour this cup into two bowls. Users are asked tofirst pour into the farther bowl, before moving the cupback to the start and pouring into the closer bowl.3) Reach & Pour : This is the most complex task. Users startby guiding the robot towards a cup and then pick it up.Once the users reach and grasp the cup, they are askedto take the cup to a target bowl, and finally pour into it. Independent and Dependent Variables. For each of thetask described above, we compared three different alignmentmodels: making no adjustments to the original controller ( noalign ), an alignment model trained just using the human’slabeled data ( no priors ), and an alignment model trainedusing our proposed semi-supervised approach ( all priors ).Our semi-supervised learning model should generalize thehuman’s preferences by enforcing intuitive priors such asproportionality, reversibility, and consistency.To evaluate the effectiveness of these different alignmentstrategies, we recorded quantitative measures including taskcompletion time and trajectory length. We also calculatedthe percentage of the time that users undo their actionsby significantly changing the joystick direction — undoingsuggests that the alignment is not quite right, and the humanis still adapting to the robot’s control strategy. Besides theseobjective measures, we also collected subjective feedbackfrom the participants through 7-point Likert scale surveys. Hypothesis. An alignment model learned from user-specificfeedback and generalized through intuitive priors makes iteasier for humans to control the robot and perform assistivemanipulation tasks. Experimental Procedures. We recruited 10 volunteers thatprovided informed written consent (3 female, ages . ± . ). Participants used a 2-axis joystick to teleoperate the7-DoF robot arm, and completed three manipulation tasksinspired by assistive settings. At the start of each task, weshowed the user a set of robot movements and ask them toprovide their preferred input on the joystick — i.e., “if you ig. 4: Example end-effector trajectories for the Avoid , Pour and Reach + Pour tasks during our user studies. Participants teleoperated the 7-DoF Pandarobot arm without any alignment model ( no align ), with an alignment model trained only on their supervised feedback ( no priors ), and with our proposedmethod, where the robot generalizes the human’s feedback using intuitive priors ( all priors ). For both baselines, we can see examples of the human gettingconfused, counteracting themselves, or failing to complete the task.Fig. 5: Heatmaps of the participants’ joystick inputs during task Avoid . For no align in the upper right, people primarily used the cardinal directions. For no priors in the bottom left, the joystick inputs were not clearly separated,and no clear pattern was established. For our all priors model on the bottomright, however, we observed that the human inputs were evenly distributed .This indicates that the users smoothly completed the task by continuouslymanipulating the joystick in the range [ − , +1] along both axes. wanted the robot to perform the movement you just saw,what joystick input would you provide?” Users answered queries for task Avoid , queries for task Pour and queries for task Reach & Pour . After the queries finished,the users started performing tasks sequentially using each ofthe alignment strategies. The order of alignment strategieswas counterbalanced. Results & Analysis The objective results of our user studyare summarized in Fig. 6. Across tasks and metrics, ourmodel with all priors outperforms the two baselines. In addi-tion, our model not only has the best average performance,but it also demonstrates the least variance. Similar to ourresults in simulation, when the task is difficult (i.e., Reach &Pour ), the performance of the align model trained with onlysupervised loss and no priors drops significantly compared tosimpler tasks, reinforcing the importance of priors in learningcomplex alignment models!We also illustrate our survey responses in Fig. 7. Acrossthe board, we found that users exhibited a clear preference for our proposed method. Specifically, they perceived all pri-ors as resulting in better alignment, more natural, accurate,and effortless control, and would elect to use it again. Thesesubjective results highlight the importance of personalizationwhen controlling high-DoF systems — we contrast theseresults to [21], where participants perceived the unalignedcontroller as confusing and unintuitive.To better visualize the user experiences, we also displayexample robot end-effector trajectories from one of theparticipants in Fig. 4. Here we observe that the trajectories ofour model (in orange) are smooth and do not detour duringthe tasks, while the trajectories for no align (in grey) and no priors (in blue) have many movements that counteractthemselves, indicating that this user was struggling to un-derstand and align with the control strategy. In the worstcase, participants were unable to complete the task with the no priors model (see the Avoid task in Fig. 4) because nojoystick inputs mapped to their intended direction, effectivelycausing them to get stuck at undesirable states.To further validate that our model is learning the human’spreferences, we illustrate heatmaps over user inputs for the Avoid task in Fig. 5. Recall that this task requires movingthe robot around an obstacle. Without the correct alignment,users default to the four cardinal directions ( no align ), orwarped circle-like motions ( no priors ). By contrast, under allpriors the users smoothly moved the joystick around an evendistribution, taking full advantage of the joystick’s [ − , +1] workspace along both axes. Summary. Taken together, these objective measurements aswell as subjective results empirically support our hypothesis.Using the alignment model learned with our intuitive priors,users are able to complete the manipulation tasks more effi-ciently, and in a way that matches their personal preferences.VII. D ISCUSSION Summary. We developed a framework for learning person-alized mappings between human inputs and robot actions.Since no two humans are the same, we doubt that any one-size-fits-all approach will work well with everyday users.But asking humans about their preferences — and gettingtheir feedback in every situation — is prohibitively time con-suming. We therefore proposed a semi-supervised approach,where the robot has access to a few examples of the human’s T a s k T i m e [ s ] Avoid Pour Reach + Pour * * * T r a j e c t o r y L e n g t h [ m ] Avoid Pour Reach + Pour * * * Avoid Pour Reach + Pour U n d o % * * * no align no priors all priors (ours) Fig. 6: Objective results from our user study. Left: Average time taken to complete each task. Middle: Average trajectory length as measured in end-effectorspace. Right: Percentage of the time people spend undoing their actions. Error bars show the standard deviation across the participants, and colorsmatch Fig. 4. Asterisks denote statistically significant pairwise comparisons between the two marked strategies ( p < . ). U s e r R a t i n g aligned controllable natural accurate e ff ortless use again Fig. 7: Results from our -point Likert-scale survey. Higher ratings indicateagreement. Users thought that our learned model with intuitive priors alignedwith their preferences, was easy to control, and improved efficiency —plus they would choose to use it again! Pairwise comparisons between ourapproach and the baselines are statistically significant across the board. desired mapping, and must generalize that mapping acrossunlabeled data. We achieve this generalization by recognizingthat humans have strong priors over how controllers shouldoperate: we expect the input space to be proportional, re-versible, and consistent. By incorporating these priors, ourproposed approach learned the human’s preferences froma few queries. Importantly, this proposed method does notaffect the system controller itself, is lightweight and agnosticto the underlying robot dynamics, and does not require anyadditional hardware or software for intent recognition. Limitations and Future Work. The robot currently queriesthe human at random states. Future work will incorporateactive learning, so that the robot intelligently selects infor-mative states to ask for human preferences.A CKNOWLEDGEMENTS We acknowledge funding from a Fanuc Fellowship, Qual-comm Innovation Fellowship, and the NSF Award EFERENCES[1] J. Arata, “Intuitive control in robotic manipulation,” in Human InspiredDexterity in Robotic Manipulation . Elsevier, 2018, pp. 53–60.[2] B. D. Argall, “Autonomy in rehabilitation robotics: An intersection,” Annual Review of Control, Robotics, and Autonomous Systems , vol. 1,pp. 441–463, 2018.[3] D. W. Robinson, T. R. Nixon, M. Hanuschik, R. P. Goldberg,J. Hemphill, D. Q. Larkin, and P. Millman, “Adaptable integratedenergy control system for electrosurgical tools in robotic surgicalsystems,” Jun. 28 2016, uS Patent 9,375,288.[4] C. Melidis, H. Iizuka, and D. Marocco, “Intuitive control of mobilerobots: an architecture for autonomous adaptive dynamic behaviourintegration,” Cognitive processing , vol. 19, no. 2, pp. 245–264, 2018.[5] X.-M. Huang and C.-R. Zhang, “Over-the-air manipulation: An intu-itive control system for smart home,” in . IEEE, 2017, pp. 18–21.[6] R. Boboc, H. Moga, and D. Talaba, “A review of current applicationsin teleoperation of mobile robots,” Bulletin of the Transilvania Uni-versity of Brasov. Engineering Sciences. Series I , vol. 5, no. 2, p. 9,2012. [7] S. Javdani, H. Admoni, S. Pellegrinelli, S. S. Srinivasa, and J. A.Bagnell, “Shared autonomy via hindsight optimization for teleopera-tion and teaming,” The International Journal of Robotics Research ,vol. 37, no. 7, pp. 717–742, 2018.[8] L. V. Herlant, R. M. Holladay, and S. S. Srinivasa, “Assistive teleop-eration of robot arms via automatic time-optimal mode switching,” in The Eleventh ACM/IEEE International Conference on Human RobotInteraction . IEEE Press, 2016, pp. 35–42.[9] J. Rebelo, T. Sednaoui, E. B. den Exter, T. Krueger, and A. Schiele,“Bilateral robot teleoperation: A wearable arm exoskeleton featuringan intuitive user interface,” IEEE Robotics & Automation Magazine ,vol. 21, no. 4, pp. 62–69, 2014.[10] R. Jonschkowski and O. Brock, “State representation learning inrobotics: Using prior knowledge about physical interaction.” in Robotics: Science and Systems , 2014.[11] V. Duchaine and C. Gosselin, “Safe, stable and intuitive controlfor physical human-robot interaction,” in . IEEE, 2009, pp. 3383–3388.[12] N. Guenard, T. Hamel, and V. Moreau, “Dynamic modeling andintuitive control strategy for an” x4-flyer”,” in , vol. 1. IEEE, 2005, pp.141–146.[13] M. T. Ciocarlie and P. K. Allen, “Hand posture subspaces for dexterousrobotic grasping,” The International Journal of Robotics Research ,vol. 28, no. 7, pp. 851–867, 2009.[14] R. Ribeiro, J. Ramos, D. Safadinho, and A. M. de Jesus Pereira,“Uav for everyone: An intuitive control alternative for drone racingcompetitions,” in . IEEE,2018, pp. 1–8.[15] J. M. Gregory, C. Reardon, K. Lee, G. White, K. Ng, and C. Sims,“Enabling intuitive human-robot teaming using augmented reality andgesture control,” arXiv preprint arXiv:1909.06415 , 2019.[16] I. Hussain, L. Meli, C. Pacchierotti, G. Salvietti, and D. Prattichizzo,“Vibrotactile haptic feedback for intuitive control of robotic extrafingers.” in World Haptics , 2015, pp. 394–399.[17] K. Yamazaki, Y. Watanabe, K. Nagahama, K. Okada, and M. Inaba,“Recognition and manipulation integration for a daily assistive robotworking on kitchen environments,” in . IEEE, 2010, pp. 196–201.[18] M. Nuttin, D. Vanhooydonck, E. Demeester, and H. Van Brussel, “Se-lection of suitable human-robot interaction techniques for intelligentwheelchairs,” in Proceedings. 11th IEEE International Workshop onRobot and Human Interactive Communication . IEEE, 2002, pp. 146–151.[19] K. Tsui, H. Yanco, D. Kontak, and L. Beliveau, “Development andevaluation of a flexible interface for a wheelchair mounted roboticarm,” in Proceedings of the 3rd ACM/IEEE international conferenceon Human robot interaction . ACM, 2008, pp. 105–112.[20] S. Jain and B. Argall, “Robot learning to switch control modesfor assistive teleoperation,” in RSS 2016 Workshop on Planningfor Human-Robot Interaction: Shared Autonomy and CollaborativeRobotics , 2016.[21] D. P. Losey, K. Srinivasan, A. Mandlekar, A. Garg, and D. Sadigh,“Controlling assistive robots with learned latent actions,” arXivpreprint arXiv:1909.09674 , 2019.[22] O. Chapelle, B. Scholkopf, and A. Zien, “Semi-supervised learning(chapelle, o. et al., eds.; 2006)[book reviews],”