LESS is More: Rethinking Probabilistic Models of Human Behavior
Andreea Bobu, Dexter R.R. Scobee, Jaime F. Fisac, S. Shankar Sastry, Anca D. Dragan
LLESS is More:Rethinking Probabilistic Models of Human Behavior
Andreea Bobu ∗ University of California, [email protected]
Dexter R.R. Scobee ∗ University of California, [email protected]
Jaime F. Fisac
University of California, [email protected]
S. Shankar Sastry
University of California, [email protected]
Anca D. Dragan
University of California, [email protected]
ABSTRACT
Robots need models of human behavior for both inferring humangoals and preferences, and predicting what people will do. A com-mon model is the Boltzmann noisily-rational decision model, whichassumes people approximately optimize a reward function andchoose trajectories in proportion to their exponentiated reward.While this model has been successful in a variety of robotics do-mains, its roots lie in econometrics, and in modeling decisionsamong different discrete options, each with its own utility or re-ward. In contrast, human trajectories lie in a continuous space, withcontinuous-valued features that influence the reward function. Wepropose that it is time to rethink the Boltzmann model, and designit from the ground up to operate over such trajectory spaces. Weintroduce a model that explicitly accounts for distances betweentrajectories, rather than only their rewards. Rather than each trajec-tory affecting the decision independently, similar trajectories nowaffect the decision together. We start by showing that our modelbetter explains human behavior in a user study. We then analyze theimplications this has for robot inference, first in toy environmentswhere we have ground truth and find more accurate inference, andfinally for a 7DOF robot arm learning from user demonstrations.
KEYWORDS human decision modeling, robot inference and prediction
ACM Reference Format:
Andreea Bobu, Dexter R.R. Scobee, Jaime F. Fisac, S. Shankar Sastry, and AncaD. Dragan. 2020. LESS is More: Rethinking Probabilistic Models of Hu-man Behavior. In
Proceedings of the 2020 ACM/IEEE International Confer-ence on Human-Robot Interaction (HRI ’20), March 23–26, 2020, Cambridge,United Kingdom.
ACM, New York, NY, USA, 9 pages. https://doi.org/10.1145/3319502.3374811 ∗ Both authors contributed equally to this research.This research is supported by the Air Force Office of Scientific Research (AFOSR), theNSF grant IIS1734633 (SCHooL), and the NSF grant CNS1545126 (VeHICaL).Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than theauthor(s) must be honored. Abstracting with credit is permitted. To copy otherwise, orrepublish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected].
HRI ’20, March 23–26, 2020, Cambridge, United Kingdom © 2020 Copyright held by the owner/author(s). Publication rights licensed to ACM.ACM ISBN 978-1-4503-6746-2/20/03...$15.00https://doi.org/10.1145/3319502.3374811
Figure 1: (Top) Contrary to Boltzmann, when adding moreoptions to the right, LESS (right) does not drastically reducethe probability of selecting the left option. (Bottom) We testLESS on learning from user demonstrations for a 7DOF arm.
What we do depends on our intent – our goals and our preferences.When robots collaborate with us, they need to be able to observeour behavior and infer our intent from it, so that they can helpus achieve it. They also need to anticipate or predict our futurebehavior given what they have inferred, so that they can seamlesslycoordinate their behavior with ours. Both inference and predictionthus require a model of human behavior conditioned on intent.A very popular such model is Boltzmann rationality [2, 22]. Itformalizes intent via a reward function, and models the humanas selecting trajectories in proportion to their (exponentiated) re-ward. Boltzmann rationality has seen great successes in a varietyof robotic domains, from mobile robots [9, 12, 18, 21, 27] to au-tonomous cars [11, 25, 26] to manipulation [4, 6, 10, 16, 17], in bothinference [1, 6, 9, 10, 12, 13, 19, 21, 26] and prediction [11, 16–18, 27].Despite its widespread use, Boltzmann predictions are not alwaysthe most natural. At the core of the Boltzmann model is the view thatbehavior is a choice among available alternatives; the probabilityof any trajectory thus heavily depends on the available alternatives.This has some unforeseen side-effects. One of the simplest examplesis at the top of Figure 1. Imagine first that there are two possible a r X i v : . [ c s . R O ] J a n rajectories to a goal, left and right, both equally good. Boltzmannwould predict a . .
25 probability each, and estimates going leftwith only .
25 probability instead of . that much?This example seems artificial – when are we going to have a) agroup of similar trajectories, and b) an imbalance in the number ofsimilar trajectories for each option, so that Boltzmann shows thisside-effect? Unfortunately, it is quite representative of real-worldtrajectory spaces. Spaces of trajectories are continuous and bounded ,so they naturally contain a continuum of alternatives of varyingsimilarity to each other, just like the right-side trajectories in ourexample. Further, trajectories will have varying amounts of simi-larity to the rest of the space: just like our left-side trajectory wasdissimilar from the other alternatives, in the real world, trajecto-ries closer to joint limits or that squeeze in between two nearbyobstacles will be dissimilar from the rest of the trajectory space.Unfortunately, the Boltzmann model was not designed to han-dle such spaces. It has its roots in the Luce axiom of choice fromeconometrics and mathematical psychology [14, 15], which modelsdecisions among discrete and different options. When we move totrajectory spaces, the options now are all connected to some degree: Our insight is that we need to rethink how to generalize the Luceaxiom to trajectory spaces, and account for how similarity in trajec-tories should influence their probability.
We take a first step towards this goal by introducing an alterna-tive to the Boltzmann model that accounts not just for the rewardof each trajectory, but also for the feature-space similarity eachtrajectory has with all other alternatives. We name our model LESS,as it is Limiting Errors due to Similar Selections. We start by testingthat our model does better at predicting human decision (Section 3),and then move on to analyze its implications for inference. We firstconduct experiments in simulation, with ground truth reward func-tions, to show that we can make more accurate inferences using ourmodel (Section 4). Finally, we test inference on real manipulationtasks with a 7DOF arm, where we learn from user demonstrations(Section 5)– though we no longer have ground truth, we show thatwe can improve the robustness of the inference if we use LESS.
Motivated by human prediction and reward inference for robotics,we seek an improved human behavior model, explicitly designed for trajectory spaces rather than abstract discrete decisions. To developthis theory, we first turn to the literature on human decision making.
One of the preeminent theories ofhuman decision making in mathematical psychology is based onLuce’s axiom of choice [14, 15]. In this formulation, we consider a setof options O , and we seek to quantify the likelihood that a humanwill select any particular option o ∈ O . The desirability of each option can be modeled by a function v : O → R + , where v produceshigher values for more desirable options. As a consequence of Luce’schoice axiom, the probability of selecting an option o is given by P ( o ) = v ( o ) (cid:205) ¯ o ∈O v ( ¯ o ) . (1)If we further assume that each option o has some underlying reward R ( o ) ∈ R , and we allow desirability to be an exponential functionof this reward, then we recover the Luce-Shepard choice rule [20]: P ( o ) = e R ( o ) (cid:205) ¯ o ∈O e R ( ¯ o ) . (2)When the options being chosen by the human are trajectories ξ ∈ Ξ , i.e. sequences of (potentially continuous-valued) actions, werefer to (2) as the Boltzmann model of noisily-rational behavior[2, 22]. The reward R is typically a function of a feature vector ϕ : Ξ → R k , giving the probability density p over continuous Ξ as p ( ξ ) = e R ( ϕ ( ξ )) ∫ Ξ e R ( ϕ ( ¯ ξ )) d ¯ ξ . (3) Since the introduction of the Luce choiceaxiom, related works [5, 7] have pointed out its duplicates problem ,where inserting a duplicate of any option o into O has an undue in-fluence on selection probabilities. To address this drawback, variousextensions of the Luce model have been proposed which attempt togroup together identical or similar options [3, 23]. Further extend-ing these ideas, Gul et al. [7] recently introduced the attribute rule ,which reinterprets options as bundles of attributes but maintainsLuce’s idea that choice is governed by desirability values.Analogous to [7], let X be the set of all attributes, let X o ⊆ X be the set of attributes belonging to o , and let X O ⊆ X be the setof attributes which belong to at least one option o ∈ O . Definean attribute value , w : X → R + , that maps attributes to theirdesirability, and an attribute intensity , s : X × O → N , that mapspairs of attributes and options to natural numbers, usually 0 or 1, toindicate the degree to which an attribute is expressed. For instance,an attribute could be the property “green” and s ( “green” , o ) couldreturn 1 if option o , say one of a set of cars, is green, and 0 otherwise.According to the attribute rule, the probability of choosing o is P ( o ) = (cid:213) x ∈X o w ( x ) (cid:205) ¯ x ∈X O w ( ¯ x ) · s ( x , o ) (cid:205) ¯ o ∈O s ( x , ¯ o ) , (4)which describes a process where the human first chooses an at-tribute x ∈ X O according to a Luce-like rule, then an option o ∈ O with that attribute according to another Luce-like rule. Note that(4) reduces to (1) if no pair of options in O shares any attributes;for example, if each o has a single unique attribute, the first sum in(4) disappears, and the second fraction evaluates to 1. In this work,we want to take advantage of the attribute rule’s graceful handlingof duplicates while extending its functionality to trajectories withcontinuous-valued features and not only categorical attributes. In this paper, we take inspiration from the attribute rule to derive anovel model of human decision making in continuous spaces. Keyto our approach is introducing a similarity measure on trajectories.This could be directly in the trajectory space, but more generallyt is in feature space, where features could, in one extreme, be thetrajectory itself. We first instantiate the attribute rule with featuresas the attributes, and then soften it to account for feature similarity.Indeed, the Boltzmann rationality model given by (3) already assignsselection probabilities based only on trajectory features, so we lookto modify the decision space to depend directly on features as well.
We deriveour model by starting from (4) and defining the set of attributes tobe Φ , the set of all possible feature vectors. Accordingly, the set ofattributes that belong to ξ is a single element Φ ξ = { ϕ ( ξ )} , and theattributes represented in a set Ξ ′ ⊆ Ξ are Φ Ξ ′ = { ϕ ( ξ ′ ) | ξ ′ ∈ Ξ ′ } .Combining this convention with the reward model (3), the modifiedattribute rule for trajectories over a finite subset Ξ f ⊂ Ξ becomes P ( ξ ) = e R ( ϕ ( ξ )) (cid:205) ¯ ϕ ∈ Φ Ξ f e R ( ¯ ϕ ) · s ( ϕ ( ξ ) , ξ ) (cid:205) ¯ ξ ∈ Ξ f s ( ϕ ( ξ ) , ¯ ξ ) . (5)In the original attribute rule, the attribute intensity s mapped to thenatural numbers. A convenient mapping in this context would beto use s as an indicator function, where s ( x , ξ ) evaluates to 1 onlyif x = ϕ ( ξ ) . With this formulation, if all trajectories have a uniquefeature vector, then the rightmost term of (5) is identically 1 andwe recover the Boltzmann model (3), as applied to a finite sample oftrajectories Ξ f . If, on the other hand, multiple trajectories share theexact same feature vector, then they will effectively be consideredas a single option, and the selection probability will be distributedequally among them. This effect is desirable: since the features ϕ ( ξ ) capture all the relevant inputs to the reward, trajectories with thesame features should be considered practically equivalent. We suggest that such a notionof attribute intensity is too stringent for continuous spaces, and weredefine s to be a soft similarity metric s : Φ × Ξ → R + , which shouldbe symmetric ( s ( ϕ ( ξ ) , ¯ ξ ) = s ( ϕ ( ¯ ξ ) , ξ ) ) and positive semidefinite( s ( x , ξ ) ≥ s ( ϕ ( ξ ) , ξ ) = max x ∈ Φ , ¯ ξ ∈ Ξ s ( x , ¯ ξ ) for all ξ ∈ Ξ .Using this redefined similarity metric s , we extend (5) to be aprobability density on the continuous trajectory space Ξ , as in (3): p ( ξ ) = e R ( ϕ ( ξ )) ∫ Ξ s ( ϕ ( ξ ) , ¯ ξ ) d ¯ ξ ∫ Ξ e R ( ϕ ( ˆ ξ )) ∫ Ξ s ( ϕ ( ˆ ξ ) , ¯ ξ ) d ¯ ξ d ˆ ξ ∝ e R ( ϕ ( ξ )) ∫ Ξ s ( ϕ ( ξ ) , ¯ ξ ) d ¯ ξ , (6)where s ( ϕ ( ξ ) , ξ ) and the integral over Φ Ξ are omitted because theyare constant over Ξ and cancel out during normalization.Under this new formulation, the likelihood of selecting a trajec-tory is inversely proportional to its feature-space similarity withother trajectories. This de-weighting of trajectories that are similarto others is precisely the effect we seek, and we adopt the probabilitygiven by (6) as our LESS model of human decision making. The main innovation that differentiates our model from previouslyproposed rules is the use of a similarity metric that reweights trajec-tory likelihoods based on the presence of other trajectories that arenearby in feature space. We note that the integral of this similarityover trajectories, the denominator of (6), is akin to a measure of tra-jectory density in feature space. We estimate similarity as a density by selecting our similarity metric as a kernel function and perform-ing Kernel Density Estimation (KDE). There are many choices ofkernel functions, each parametrized by some notion of bandwidth.In our experiments, we used a radial basis function, which peakswhen x = ϕ ( ξ ) , then exponentially decreases the farther away x and ϕ ( ξ ) are from one another in feature space: s ( x , ξ ) = (cid:18) σ √ π (cid:19) exp (cid:18) − ∥ x − ϕ ( ξ )∥ σ (cid:19) , (7)where the bandwidth σ is an important parameter that dictates,for a given feature difference between two trajectories, how muchthat difference affects the ultimate similarity evaluation. Higher σ means a higher bandwidth and makes everything look more similar.We find an optimal bandwidth σ ∗ automatically by using a finiteset of samples Ξ f ⊂ Ξ and maximizing the sum of the log oftheir summed similarities, which is equivalent to maximizing theirlikelihood under a probability density estimate produced by KDE: σ ∗ = arg max σ ∈ R (cid:213) ξ ∈ Ξ f log (cid:169)(cid:173)(cid:173)(cid:171) (cid:213) ¯ ξ ∈ Ξ f s ( ϕ ( ξ ) , ¯ ξ ) (cid:170)(cid:174)(cid:174)(cid:172) . (8) Let θ ∈ Θ parametrize the reward function R . To predict what thehuman will do given a belief b ( θ ) , we marginalize over θ : p ( ξ ) = ∫ Θ b ( θ ) p ( ξ | θ ) dθ , (9)with p ( ξ | θ ) given by (6). To perform inference over θ given a humantrajectory, we update our belief using Bayesian inference: b ′ ( θ ) = b ( θ ) p ( ξ | θ ) ∫ Θ b ( ¯ θ ) p ( ξ | ¯ θ ) d ¯ θ . (10)In practice, calculating the integrals in the denominators of (10) and(6) can be intractable, so we use a discretized set of θ parametersand finite trajectory sample sets in our experiments. The specificsampling of the trajectory choice space can significantly impactinference, and we explore its implications in Section 5. We start by testing the hypothesis that LESS is a better model forhuman decision making than the standard Boltzmann model.
We design a browser-based user study in which we ask participantsto make behavior decisions, and measure which model best charac-terizes these decisions. We select a simple navigation task as ourdomain, where different behaviors correspond to different ways oftraversing the grid from start to goal, as shown in Figure 2.
The key difficulty in designing such astudy is that both models require access to a ground truth rewardfunction, i.e. user preferences over trajectories. Even though wecan provide participants with some criteria – in our case optimizingfor path length while avoiding the obstacle –, this does not meanour criteria are the only ones they care about. For instance, peoplemight implicitly prefer trajectories that go closer to or further fromthe obstacle, or that go around the obstacle to the left or right. a) Control trial (b) Experimental trial(c) Distributions for Left and Right (d) Distributions within Right
Figure 2: The human decision model experiment. (a) and (b)show the trajectories used for the two trials. In (c), LESS pre-dictions more closely match the observed Left-Right distri-bution. In (d), both models miss that users demonstrate aslight preference for R2 (the trajectory which visits the moststates in the rightmost column in (b)).
Our design idea is to introduce a control trial in which we gatherdata about relative preferences among two dissimilar options: leftand right. These relative preferences then enables us to make pre-dictions, under each model, about the experimental trial, where weadd trajectories similar to the option on the right.For the control trial, participants saw the grid world shown inFigure 2a with one obstacle in the middle and three trajectoriestravelling between the start and goal. Two of the trajectories tra-versed an equal amount of tiles (optimal) and were symmetric alongthe diagonal of the grid (left and right), and a third trajectory wentthrough the obstacle and visited more tiles than the others (not op-timal). We were only interested in what specific optimal trajectorypeople chose (Left versus Right), and we used the third suboptimaltrajectory as an attention test to check if subjects had paid atten-tion to the instructions. We chose the two optimal trajectories tobe symmetric and of the same color to reduce possible confounds,such as bias people might have for extraneous features like numberof turns, distance from obstacle, color, etc.For the experimental trial, shown in Figure 2b, we had the samesetup as in the control, with the addition of two other optimaltrajectories on the right. They had the same color, number of turns,and number of tiles traversed as the original right-side trajectory. Inthis setup, there were two visible clusters of options: one trajectoryon the left, and three clustered on the right, which we denote asthe Left and Right groups, respectively.
We manipulated the model used fordecision-making in the experimental trial to be Boltzmann vs. LESS.Having access to the ratio λ that participants chose the left trajec-tory over the right in the control trial means that regardless of theirreward function R ( ξ ) , e R ( ξ lef t ) = λe R ( ξ riдht ) , according to (3). Thisenables us to make predictions using both models as a function of λ for the experimental trial, despite not knowing R itself. For thesecomputations, we assumed that all trajectories in the Right grouphad the same reward, that the reward of trajectories in the Left andRight groups would be equal to those estimated from the controltrial, and (for LESS) that the Left trajectory had density one whilethe Right trajectories had density three.Under the Boltzmann model, the addition of two trajectoriessimilar to the one on the right decreases the probability that thetrajectory on the left gets chosen. This is most obvious when λ = P ( ξ left ) would gofrom . .
25, as there are now 4 good options. Onthe other hand, LESS accounts for the similarity of the trajectorieson the right and keeps P ( ξ left ) closer to the control value. Our measure is the selection propor-tion of each trajectory in the experimental trial, which enables us tocompute agreement between each model and the users’ decisions.
We recruited 80 participants (24 female,56 male, with ages between 18 and 65) from Amazon MechanicalTurk (AMT) using the psiTurk experimental framework [8]. Weexcluded 3 participants for failing our attention test. All partici-pants were from the United States and had a minimum approvalrating of 95%. The treatment trial was assigned between-subjects:participants saw only one of the sets of trajectory options.
H1:
For the experimental trial, the Boltzmann proportion predictionis significantly different from the observed proportion.
H2:
For the experimental trial, the LESS proportion prediction isequivalent to the observed proportions.
In the control trial, users chose the Left trajectory 47.5% of thetime. Figure 2 plots the observed proportions for the experimentaltrial, along with each model’s predictions. The experimental trialresulted in an observed probability of .41 for the Left trajectory,whereas Boltzmann predicts .23 and LESS predicts .475. The modelsboth predict a uniform distribution among the Right trajectories.We performed a chi-square test of goodness of fit to see if theobserved distribution of left vs. right from the experimental groupdiffered from the predicted distributions. In line with our hypothe-ses, we found a significant difference between the observed valuesand the Boltzmann prediction ( X ( , N = ) = . p < . X ( , N = ) = . p = . ϵ (where the distance is computed by taking each distribution toe a vector in [ , ] k , where k is the number of trajectories repre-sented by the distribution). We do not have an a priori estimate forwhich values of ϵ are practically insignificant in this vector spaceof probability distributions, so we instead invert the test to findthe minimum ϵ for which the observed distribution matches thepredicted distribution at a significance level of α = .
05. We foundthat the minimum ϵ bound for equivalence at the α = .
05 level was0.22 for the LESS prediction and 0.39 for the Boltzmann prediction.The results across all trajectories are analogous, albeit slightlyweaker because users tended to favor one of the three Right tra-jectories more than the other two. The chi-square test revealed asignificant difference with the Boltzmann predictions, X ( , N = ) = . p < .
05, but no significant difference between the ob-servations and the LESS prediction X ( , N = ) = . p = . α = . ϵ bound is 0.29, and 0.36 for Boltzmann. Despite LESS’tighter ϵ , neither prediction aligns perfectly with the empirical datain Figure 2d. This discrepancy is likely due to some unmodeledfeatures (e.g. distance from the obstacle), which may influenceparticipants’ preferences. However, while unknown features mayaffect both Boltzmann’s and LESS’ performance, LESS still correctsBoltzmann’s errors from mishandling similarity. We explore thespecific effects of feature misspecification further in Section 4.3.Overall, although neither model is a perfect predictor of behavior,we find that LESS is a better fit: Boltzmann is significantly differentfrom the observed, and LESS provides a tighter equivalence bound. In Section 3, we provided evidence supporting that LESS can moreaccurately capture human decisions. This has direct implicationsfor how robots predict behavior – increasing the model accuracyby definition increases the robot’s prediction accuracy. We nowhypothesize that it also has implications for how robots infer humanpreferences from behavior: namely, that using a higher accuracymodel when performing inference leads to more accurate inference.
We first design an experiment to test that if people do act accordingto the LESS distribution, modeling them as such leads to betterinference than modeling them via Boltzmann. To control for poten-tial confounds, we also verify the opposite: if instead people actedaccording to Boltzmann (which Section 3 does not support), thenmodeling them as Boltzmann would instead be better for inference.In this experiment, we created a grid world environment withtwo objects, where humans have to teach a robot to navigate from astart to a goal and learn preferences for whether to stay close or farfrom the objects. We simulated hypothetical human demonstrations Ξ D by sampling trajectories according to LESS and Boltzmann. Todo so, we fixed a particular objective θ ∗ and a confidence parameter β , and randomly chose trajectories according to probabilities givenby either (6), for LESS, or (3), for Boltzmann. We then utilized thesetrajectories as “human” demonstrations and performed inferenceusing either Boltzmann or LESS as the underlying choice model. Ourgoal was to analyze how each model’s inference quality dependson the sampling model used across a range of objectives θ ∗ . θ P ( θ ∗ ) P ( θ ∗ ) averaged across seeds BoltzmannLESS (a)
T ruePosterior metric for LESS sampling model. θ P ( θ ∗ ) P ( θ ∗ ) averaged across seeds BoltzmannLESS (b)
T ruePosterior metric for Boltzmann sampling model.
Figure 3:
TruePosterior results for the inference comparisonexperiment in Section 4.1. Legends indicate which inferencemethod was employed for those results. We found a signifi-cant interaction effect between sampling method and infer-ence method, which can be seen in the change of relativeperformance for LESS and Boltzmann between (a) and (b).
We used a 2-by-2 factorial design.We manipulated the sampling model with two levels, Boltzmannand LESS, as well as the inference model , Boltzmann and LESS.
We tested inference quality across eightdifferent θ ∗ values for more variation and insight. We also used 150random seeds for sampling demonstrations. For a given samplingmethod, the combination of a θ ∗ and a seed determine the demon-stration set that the inference will use. Therefore, we generated1200 demonstration sets for each sampling method. To analyze each model’s inferencequality, we employ two objective metrics:Accuracy of a-posteriori inference: once we obtain a posterior prob-ability induced by the sampled Ξ D , we verify that the maximuma-posteriori θ MAP matches the original θ ∗ . Thus, we define a binaryvariable that takes value 1 if they match and 0 otherwise: TrueMatch = { θ MAP = θ ∗ } . Magnitude of posterior θ ∗ probability: this metric provides a soft-ened, continuous indication of inference performance by capturingthe posterior probability mass assigned to the correct θ ∗ : TruePosterior = P ( θ ∗ | Ξ D ) . a) Ξ L and the resulting inferred posterior(b) Ξ B and the resulting inferred posterior Figure 4: Visualizations of Ξ L and Ξ B along with the LESSand Boltzmann inferred posteriors over θ . (a): LESS learnsthe correct θ , whereas Boltzmann under-learns. (b): Boltz-mann learns the correct θ , while LESS is split between avoid-ing both obstacles vs. avoiding the top one but being ambiva-lent about the bottom one. H3:
When human input is generated using LESS, inference qualityis significantly higher with LESS than with Boltzmann.
H4:
When human input is generated using Boltzmann, inferencequality is significantly higher with Boltzmann than with LESS.
Figure 3 summarizes the results by showing how
TruePosterior varies by inference method for each of our samplingmethods. To analyze these results, we ran a factorial repeated mea-sures ANOVA. We found a significant interaction effect between thesampling and inference methods ( F ( , ) = . p < . TrueMatch results also revealed a significant inter-action between sampling method and inference method ( p < . TruePosterior was significantly higher when the inference method matched thesampling method ( p < .
001 for both), and logistic regressions simi-larly showed that the probability of
TrueMatch = p < .
001 for both).These results strongly support both H3 and H4, as they revealthat inference performance is superior when the inference methodagrees with the sampling method. Given that the experiment in Sec-tion 3 suggests that LESS can be a better model of human samplingbehavior, these results provide evidence that using LESS-based in-ference could give better performance when learning from humans.
Based on what we have seen thus far, LESS clearly leads to differ-ent robot inferences. In this section we provide some qualitativeintuition about what contributes to this difference.
Figure 5: Left: actual feature density (gray), adjusted by LESS(orange). The Ξ L points (red) are in dense areas, thus Boltz-mann inference under-learns. The Ξ B points are in sparse ar-eas, but two of them are in a slightly more dense area, whichmakes Boltzmann reduce their relative influence and ignorethe θ they suggest. Right: 2D density with Ξ B , Ξ L overlaid. The important change from Boltzmann to LESS is the strength ofthe inference as a function of the feature density at the demonstratedtrajectory. If a demonstrated trajectory lies in a high-density area, i.e.its features are similar to those of many other possible trajectories,Boltzmann inference will under-learn . This is because there aremany high-reward alternatives in the normalizer of (3), whichlowers the probability of the demonstration. For the analogousreason, if a demonstration lies in a low-density area, Boltzmanninference will over-learn . Because our LESS method weighs eachtrajectory ξ by the inverse of the density at its location in featurespace ϕ ( ξ ) , the resulting weighted density will be approximatelyuniform, not allowing the feature density to influence the strengthof the inference: the presence of other options with similar featuresdoes not skew the probability as much anymore.To visualize this, we chose two sets of demonstrations from theprevious experiment. One set, Ξ B , comes from one of the groundtruth rewards for which Boltzmann performed better ( θ in Figure3a). The other set, Ξ L , comes from one for which LESS performedbetter ( θ in Figure 3b). Figure 4 shows the sampled trajectoriesin Ξ L and Ξ B , along with the inference for each model. For Ξ L ,LESS confidently identifies the ground truth, whereas Boltzmann’sposterior is higher entropy. Figure 5 shows that Ξ L does fall ina high-density region, which indeed leads to Boltzmann under-learning and finding many alternative explanations.For Ξ B , on the other hand, something very interesting happens.Looking at where the samples lie (blue dots in Figure 5), two of themare in relatively high-density areas (call them Ξ denseB ), whereasthe others are in a very sparse region (call them Ξ sparseB ). Ξ denseB are the two with lower ϕ in Figure 5 (right). They correspond,in Figure 4b, to the two trajectories that go closer to the bottomobstacle. To the LESS inference, which is more agnostic to thefeature density, this gives evidence for two hypotheses: Ξ denseB support the hypothesis that the robot should stay far from the topobstacle, but be ambivalent about the bottom one, whereas the othertrajectories, Ξ sparseB , support that the robot should stay far fromboth obstacles. This is why we see two hypotheses inferred by LESSin 4b. The Boltzmann inference, however, learns much more fromthe trajectories that lie in the low-density area, essentially ignoring denseB . This is what leads to the very confident inference of onlyone of the hypotheses. In this case, this happens to be the correcthypothesis. In general though, the opposite could have happened –had the two trajectories that go closer to the obstacle been the onesto lie in a sparse area, Boltzmann would have confidently inferredthe wrong objective. In summary, Boltzmann, by being sensitive tofeature densities, can under- or over-learn. LESS uses information from features to compute similarity, evenwhen those features do not affect the reward. For example, if thereward is solely about efficiency, LESS captures that people treat"right-of-the-obstacle" options as similar. What if the robot doesnot have access though to these additional features?
We again generate demonstrations us-ing LESS, but we include two additional features: the average x andaverage y coordinate of the trajectory. The two new features do notinfluence the trajectories’ reward values, but they do influence thesimilarity metric. To induce a misspecification, the robot perform-ing inference is unaware of these new features. For this experiment,we only manipulate the inference model : LESS vs. Boltzmann. H5:
When the robot’s feature space is misspecified, inference qual-ity with LESS is still superior to inference quality with Boltzmannfor LESS-sampled demonstrations.
For
TruePosterior , we performed a one-way re-peated measures ANOVA, and as hypothesized, the test revealedthat LESS inference was still significantly better than Boltzmann,in spite of the feature misspecification ( p < . TrueMatch , a logistic regression revealed that the odds of hav-ing
TrueMatch = p < . s and not rely on features. Section 4 teased that Boltzmann inference performance is highlydependent on the structure of the environment, and, more precisely,the feature space density induced by all possible trajectories. How-ever, we demonstrated this on a toy task with simulated humandata and ground truth access. We now put the same hypothesis totest in a real world high-dimensional scenario with a 7DoF roboticmanipulator and real human demonstrations, where one cannothave access to the full trajectory space, nor the ground truth reward.
Since for such an environment calculating thedenominator in (3) exactly is intractable, practitioners typically
10 30 100 300 1000 S K L A gg re g a t e Single
KLAggregate averaged over subjects for laptop task
BoltzmannLESS (a)
KLAggregate metric for single inference comparison.
10 30 100 300 1000 S − − − − l o g ( K L A gg re g a t e ) Batch
KLAggregate across S for laptop task BoltzmannLESS (b) log ( KLAggregate ) metric for batch inference comparison. Figure 6: Results for the laptop task in the robustness analy-sis experiments. In (a), LESS significantly outperforms Boltz-mann at low sample sizes, but they converge for the largestsample sizes. For the batch inference task in (b), Boltzmannoutperforms LESS at the lowest sample size, but the twomethods converge towards zero as sample size increase. 𝛯 𝛯 𝛯 𝛯 𝛳 Marginal
Figure 7: Single-demonstration (blue) inference posteriorsfor the table task with two different trajectory sets of 100samples. The distributions reveal that both Boltzmann andLESS produce the same θ MAP , but there is less variability be-tween the LESS posteriors, leading to lower
KLAggregate . sample the space of trajectories, obtaining varying subsets. Giventhe Boltzmann model’s high dependency on the feature space den-sity, we speculate that different sample sets would result in vastlyvarying inference results. In this section, we investigate how LESScan mitigate this effect and help inference robustness. We collectdemonstrations from participants for different tasks, and run infer-ence using different sets of trajectory for computing the normalizer. We used a 2-by-5 factorial design.We manipulated the inference model with two levels, Boltzmannand LESS, as well as the size S of the sampled trajectory sets usedor inference, with five levels: 10, 30, 100, 300, and 1000. We sample10 different trajectory sets of each size. We tested our hypothesis across three house-hold manipulation tasks where the robot learned to carry a coffeemug from a start position to a goal according to the person’s pref-erences. In the first task, which we dub table , the participants wereasked to move the robot arm from start to goal while maintainingthe end-effector close to the table, to prevent the mug from breakingin case of a slip. In the second task, dubbed laptop , the participantswere instructed to avoid spilling the coffee over a laptop by pro-viding a demonstration that keeps the robot’s end-effector awayfrom the electronic device. Lastly, in the third task, dubbed human we asked the participants to keep the end-effector away from theirbody, to avoid spilling coffee on their clothes.In all scenarios, the robot performs inference by reasoning overthree features: one feature of interest (distance from the table, dis-tance from the laptop, and distance from the human, respectively),a second feature drawn from that set, and an efficiency featurecomputed as the sum of squared velocities across the trajectory.
In total, for each task T , sample size S , inference method M , and user i , we obtained 10 posterior dis-tributions P T , iM , S ( ˆ θ | ξ T , i ) constituting a set P T , iM , S . Our goal wasto test how robust (or consistent) each method’s inference resultwas across the ten different trajectory sets. We used an aggregateKullback-Leibler divergence as a measure of how much the poste-rior distributions P ∈ P T , iM , S differ from one another: KLAддreдate = − (cid:213) P ∈P T , iM , S (cid:213) Q ∈P T , iM , S (cid:213) ˆ θ ∈ Θ P ( ˆ θ | ξ T , i ) log (cid:32) Q ( ˆ θ | ξ T , i ) P ( ˆ θ | ξ T , i ) (cid:33) . H6:
Performing single inference with LESS across multiple trajec-tory sets results in higher robustness and, thus, a lower
KLAggregate measure than inference with Boltzmann.
We recruited 12 users (3 female, 9 male,aged 18-30) from the campus community to physically interact witha JACO 7DOF robotic arm and provide demonstrations for threetasks. Figure 7 (left) illustrates the demonstrations collected forthe table task. Before giving any demonstrations, each person wasallowed a period of training with the robot in gravity compensationmode, in order to get accustomed to interacting with the robot.
As seen in Figure 7, given two different trajectorysets, inference with each method can have drastically differentoutcomes. With LESS (top), we see that the resulting posteriordistributions are fairly similar, whereas with Boltzmann inference(bottom), they differ in entropy/confidence.For each sample task T , we performed a factorial repeated-measuresANOVA. The results for the laptop task are summarized in Fig-ure 6a. As the trend in the figure indicates, we found a signifi-cant interaction effect between inference method and sample size( F ( , ) = . p < . KLAggregate than Boltz-mann for S =
10, 30, and 100 ( p < .
001 for all), but there was nosignificant difference found for S =
300 or 1000 ( p ≈ .
00 for both). This trend supports our hypothesis that LESS provides more robustsingle-demonstration inference, and it reveals that the differencein
KLAggregate between LESS and Boltzmann disappears with in-creasing sample size. Results from the table task also support thistrend, with a significant main effect of inference method.While the human task did reveal a significant interaction betweeninference method and sample size ( F ( , ) = . p < .
05) it standsapart from the other two: a post-hoc Tukey HSD test only found adifference for sample size 1000 ( p < . We repeated the same experiment, except this time we run inferenceby aggregating all users’ demonstrations for a task (batch inference).This would happen in practice if we were interested in teaching therobot about what the average user wants, rather than focusing oncustomizing the behavior to each user. Here, we found the oppositeresults, also shown in in Figure 6b: LESS has higher divergence(lower robustness). We attribute this to the phenomenon describedin Section 4.2. When we had only one demonstration before, Boltz-mann was not robust because, depending on the set of samples,the demonstration could fall in low- or high-density regions, thusleading to different Boltzmann inferences for different sets. Now,with 12 demonstrations at once, the chances of one demonstrationfalling in a low-density area are much higher. As we’ve seen inSection 4.2, when there are multiple demonstrations, Boltzmanninference will be dominated by those lying in low-density areas.This leads to a more consistent posterior distribution, so long asthe low-density demonstrations suggest the same reward function. We propose a new probabilistic human behavior model and presentcompelling evidence that it better captures human decision makingand it attenuates inference errors that arise due to similar selections,increasing accuracy and robustness.One limitation of our method is its reliance on a pre-specifiedset of robot features for similarity selection, which makes featuremisspecification a possible limitation. Although our experimentsin Section 4.3 reveal that LESS still performs better inference thanBoltzmann, it is unclear whether this outcome is due to the effectof hypothesis H3 or if our method is truly unaffected by misspecifi-cation. Further experiments are needed for complete clarification.Our 12-person aggregate inference results in Section 5 show thatLESS can lead to less robust inference. We attributed this outcome tothe phenomenon in Section 4.2, but it remains unclear whether thisleads to less accurate inference, or whether Boltzmann is actuallypreferable in situations with enough varied demonstrations.Lastly, the Mechanical Turk study in Section 3, although com-pelling, illustrates simplistic datasets of human choices. Furtherstudies on human behavior in more realistic settings would be use-ful, but complicated by lack of access to the "ground truth" reward.Despite these limitations, Boltzmann rationality has becomeso fundamental to how robots do inference and prediction, thatdesigning a counterpart for continuous robotics domains is sorelyneeded. We are excited to have taken a step in this direction.
EFERENCES [1] N. Aghasadeghi and T. Bretl. 2011. Maximum entropy inverse reinforcementlearning in continuous state spaces with path integrals. In . 1561–1566. https://doi.org/10.1109/IROS.2011.6094679[2] Chris Baker, Joshua B Tenenbaum, and Rebecca R Saxe. 2007. Goal inference asinverse planning. (01 2007).[3] Moshe Ben-Akiva. 1973. Structure of Passenger Travel Demand Models.
Trans-portation Research Record
526 (08 1973).[4] Andreea Bobu, Andrea Bajcsy, Jaime F. Fisac, and Anca D. Dragan. 2018. Learningunder Misspecified Objective Spaces. In
CoRL .[5] Gerard Debreu. 1960.
The American Economic Review
Proceedings of the33rd International Conference on International Conference on Machine Learning- Volume 48 (ICML’16) . JMLR.org, 49–58. http://dl.acm.org/citation.cfm?id=3045390.3045397[7] Faruk Gul, Paulo Natenzon, and Wolfgang Pesendorfer. 2014. Random Choice asBehavioral Optimization.[8] Todd M. Gureckis, Jay Martin, John McDonnell, Alexander S. Rich, Doug Markant,Anna Coenen, David Halpern, Jessica B. Hamrick, and Patricia Chan. 2016.psiTurk: An open-source framework for conducting replicable behavioral ex-periments online.
Behavior Research Methods
48, 3 (01 Sep 2016), 829–842.https://doi.org/10.3758/s13428-015-0642-8[9] P. Henry, C. Vollmer, B. Ferris, and D. Fox. 2010. Learning to navigate throughcrowded environments. In . 981–986. https://doi.org/10.1109/ROBOT.2010.5509772[10] M. Kalakrishnan, P. Pastor, L. Righetti, and S. Schaal. 2013. Learning objectivefunctions for manipulation. In . 1331–1336. https://doi.org/10.1109/ICRA.2013.6630743[11] Kris M. Kitani, Brian D. Ziebart, James Andrew Bagnell, and Martial Hebert. 2012.Activity Forecasting. In
Computer Vision – ECCV 2012 , Andrew Fitzgibbon, Svet-lana Lazebnik, Pietro Perona, Yoichi Sato, and Cordelia Schmid (Eds.). SpringerBerlin Heidelberg, Berlin, Heidelberg, 201–214.[12] Henrik Kretzschmar, Markus Spies, Christoph Sprunk, and Wolfram Burgard.2016. Socially Compliant Mobile Robot Navigation via Inverse ReinforcementLearning.
Int. J. Rob. Res.
35, 11 (Sept. 2016), 1289–1307. https://doi.org/10.1177/0278364915619772[13] Sergey Levine and Vladlen Koltun. 2012. Continuous Inverse Optimal Controlwith Locally Optimal Examples. In
Proceedings of the 29th International Coferenceon International Conference on Machine Learning (ICML’12) . Omnipress, USA,475–482. http://dl.acm.org/citation.cfm?id=3042573.3042637[14] R.Duncan Luce. 1977. The choice axiom after twenty years.
Journal of Mathemati-cal Psychology
15, 3 (1977), 215 – 233. https://doi.org/10.1016/0022-2496(77)90032- 3[15] R. Duncan Luce. 1959.
Individual choice behavior.
John Wiley, Oxford, England.xii, 153–xii, 153 pages.[16] J. Mainprice and D. Berenson. 2013. Human-robot collaborative manipulationplanning using early prediction of human motion. In . 299–306. https://doi.org/10.1109/IROS.2013.6696368[17] J. Mainprice, R. Hayne, and D. Berenson. 2015. Predicting human reaching motionin collaborative tasks using Inverse Optimal Control and iterative re-planning. In . 885–892.https://doi.org/10.1109/ICRA.2015.7139282[18] M. Pfeiffer, U. Schwesinger, H. Sommer, E. Galceran, and R. Siegwart. 2016.Predicting actions to act predictably: Cooperative partial motion planning withmaximum entropy models. In . 2096–2101. https://doi.org/10.1109/IROS.2016.7759329[19] Deepak Ramachandran and Eyal Amir. 2007. Bayesian Inverse ReinforcementLearning. In
Proceedings of the 20th International Joint Conference on ArtificalIntelligence (IJCAI’07) . Morgan Kaufmann Publishers Inc., San Francisco, CA,USA, 2586–2591. http://dl.acm.org/citation.cfm?id=1625275.1625692[20] Roger N. Shepard. 1957. Stimulus and response generalization: A stochasticmodel relating generalization to distance in psychological space.
Psychometrika
22, 4 (01 Dec 1957), 325–345. https://doi.org/10.1007/BF02288967[21] D. Vasquez, B. Okal, and K. O. Arras. 2014. Inverse Reinforcement Learningalgorithms and features for robot navigation in crowds: An experimental compar-ison. In .1341–1346. https://doi.org/10.1109/IROS.2014.6942731[22] John Von Neumann and Oskar Morgenstern. 1945.
Theory of games and economicbehavior . Princeton University Press Princeton, NJ.[23] Peter Vovsha. 1997. Application of Cross-Nested Logit Model to Mode Choice inTel Aviv, Israel, Metropolitan Area.
Transportation Research Record
Testing statistical hypotheses of equivalence and noninferiority .Chapman and Hall/CRC.[25] Markus Wulfmeier, Peter Ondruska, and Ingmar Posner. 2015. Maximum EntropyDeep Inverse Reinforcement Learning.[26] Brian D. Ziebart, Andrew Maas, J. Andrew Bagnell, and Anind K. Dey. 2008.Maximum Entropy Inverse Reinforcement Learning. In
Proceedings of the 23rdNational Conference on Artificial Intelligence - Volume 3 (AAAI’08) . AAAI Press,1433–1438. http://dl.acm.org/citation.cfm?id=1620270.1620297[27] Brian D. Ziebart, Nathan Ratliff, Garratt Gallagher, Christoph Mertz, Kevin Peter-son, J. Andrew Bagnell, Martial Hebert, Anind K. Dey, and Siddhartha Srinivasa.2009. Planning-based Prediction for Pedestrians. In