[PDF] Choice Set Misspecification in Reward Inference

Abstract

Specifying reward functions for robots that operate in environments without a natural reward signal can be challenging, and incorrectly specified rewards can incentivise degenerate or dangerous behavior. A promising alternative to manually specifying reward functions is to enable robots to infer them from human feedback, like demonstrations or corrections. To interpret this feedback, robots treat as approximately optimal a choice the person makes from a choice set, like the set of possible trajectories they could have demonstrated or possible corrections they could have made. In this work, we introduce the idea that the choice set itself might be difficult to specify, and analyze choice set misspecification: what happens as the robot makes incorrect assumptions about the set of choices from which the human selects their feedback. We propose a classification of different kinds of choice set misspecification, and show that these different classes lead to meaningful differences in the inferred reward and resulting performance. While we would normally expect misspecification to hurt, we find that certain kinds of misspecification are neither helpful nor harmful (in expectation). However, in other situations, misspecification can be extremely harmful, leading the robot to believe the opposite of what it should believe. We hope our results will allow for better prediction and response to the effects of misspecification in real-world reward inference.

Full PDF

CC HOICE S ET M ISSPECIFICATION IN R EWARD I NFERENCE P RESENTED AT THE

IJCAI-PRICAI 2020 W

ORKSHOP ON A RTIFICIAL I NTELLIGENCE S AFETY

Rachel Freedman ∗ University of California, Berkeley

Rohin Shah

DeepMind

Anca Dragan

University of California, Berkeley A BSTRACT

Specifying reward functions for robots that operate in environments without a natural reward signalcan be challenging, and incorrectly speciﬁed rewards can incentivise degenerate or dangerous behavior.A promising alternative to manually specifying reward functions is to enable robots to infer themfrom human feedback, like demonstrations or corrections. To interpret this feedback, robots treatas approximately optimal a choice the person makes from a choice set , like the set of possibletrajectories they could have demonstrated or possible corrections they could have made. In this work,we introduce the idea that the choice set itself might be difﬁcult to specify, and analyze choice setmisspeciﬁcation : what happens as the robot makes incorrect assumptions about the set of choices fromwhich the human selects their feedback. We propose a classiﬁcation of different kinds of choice setmisspeciﬁcation, and show that these different classes lead to meaningful differences in the inferredreward and resulting performance. While we would normally expect misspeciﬁcation to hurt, we ﬁndthat certain kinds of misspeciﬁcation are neither helpful nor harmful (in expectation). However, inother situations, misspeciﬁcation can be extremely harmful, leading the robot to believe the opposite of what it should believe. We hope our results will allow for better prediction and response to theeffects of misspeciﬁcation in real-world reward inference.

Specifying reward functions for robots that operate in environments without a natural reward signal can be challenging,and incorrectly speciﬁed rewards can incentivise degenerate or dangerous behavior [14, 13]. A promising alternativeto manually specifying reward functions is to design techniques that allow robots to infer them from observing andinteracting with humans.These techniques typically model humans as optimal or noisily optimal. Unfortunately, humans tend to deviate fromoptimality in systematically biased ways [12, 5]. Recent work improves upon these models by modeling pedagogy [9],strategic behavior [23], risk aversion [15], hyperbolic discounting [7], or indifference between similar options [4].However, given the complexity of human behavior, our human models will likely always be at least somewhatmisspeciﬁed [22].One way to formally characterize misspeciﬁcation is as a misalignment between the real human and the robot’sassumptions about the human. Recent work in this vein has examined incorrect assumptions about the human’shypothesis space of rewards [3], their dynamics model of the world [19], and their level of pedagogic behavior [16].In this work, we identify another potential source of misalignment: what if the robot is wrong about what feedbackthe human could have given?

Consider the situation illustrated in Figure 1, in which the robot observes the humangoing grocery shopping. While the grocery store contains two packages of peanuts, the human only notices the moreexpensive version with ﬂashy packaging, and so buys that one. If the robot doesn’t realize that the human was effectivelyunable to evaluate the cheaper package on its merits, it will learn that the human values ﬂashy packaging.We formalize this in the recent framework of reward-rational implicit choice (RRiC) [11] as misspeciﬁcation in thehuman choice set , which speciﬁes what feedback the human could have given. Our core contribution is to categorizechoice set misspeciﬁcation into several formally and empirically distinguishable “classes”, and ﬁnd that different typeshave signiﬁcantly different effects on performance. As we might expect, misspeciﬁcation is usually harmful; in the ∗ [email protected] Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International(CC BY 4.0). a r X i v : . [ c s . A I] J a n resented at the IJCAI-PRICAI 2020 Workshop on Artiﬁcial Intelligence SafetyFigure 1: Example choice set misspeciﬁcation : The human chooses a pack of peanuts at the supermarket. They onlynotice the expensive one because it has ﬂashy packaging, so that’s the one they buy. However, the robot incorrectlyassumes that the human can see both the expensive ﬂashy one and the cheap one with dull packaging but extra peanuts.As a result, the robot incorrectly infers that the human likes ﬂashy packaging, paying more, and getting fewer peanuts.most extreme case the choice set is so misspeciﬁed that the robot believes the human feedback was the worst possiblefeedback for the true reward, and so updates strongly towards the opposite of the true reward. Surprisingly, we ﬁnd thatunder other circumstances misspeciﬁcation is provably neutral : it neither helps nor hurts performance in expectation.Crucially, these results suggest that not all misspeciﬁcation is equivalently harmful to reward inference: we may be ableto minimize negative impact by systematically erring toward particular misspeciﬁcation classes deﬁned in this work.Future work will explore this possibility. There are many ways that a human can provide feedback to a robot: demonstrations [18, 1, 24], comparisons [20, 6],natural language [8], corrections [2], the state of the world [21], proxy rewards [10, 17], etc. Jeon et al. propose aunifying formalism for reward inference to capture all of these possible feedback modalities, called reward-rational(implicit) choice (RRiC). Rather than study each feedback modality separately, we study misspeciﬁcation in this generalframework.RRiC consists of two main components: the human’s choice set , which corresponds to what the human could havedone , and the grounding function , which converts choices into (distributions over) trajectories so that rewards can becomputed.For example, in the case of learning from comparisons, the human chooses which out of two trajectories is better. Thus,the human’s choice set is simply the set of trajectories they are comparing, and the grounding function is the identity. Amore complex example is learning from the state of the world, in which the robot is deployed in an environment inwhich a human has already acted for T timesteps, and must infer the human’s preferences from the current world state.In this case, the robot can interpret the human as choosing between different possible states. Thus, the choice set is theset of possible states that the human could reach in T timesteps, and the grounding function maps each such state to theset of trajectories that could have produced it.Let ξ denote a trajectory and Ξ denote the set of all possible trajectories. Given a choice set C for the human andgrounding function ψ : C → (Ξ → [0 , , Jeon et al. deﬁne a procedure for reward learning. They assume that the2resented at the IJCAI-PRICAI 2020 Workshop on Artiﬁcial Intelligence Safety C R ⊂ C H C R ⊃ C H C R ∩ C H c ∗ ∈ C R ∩ C H A1 A2 A3 ∈ C R \ C H B2 B3

Table 1: Choice set misspeciﬁcation classiﬁcation, where C R is the robot’s assumed choice set, C H is the human’sactual choice set, and c ∗ is the optimal element from C R ∪ C H . B1 is omitted because if C R ⊂ C H , then C R \ C H isempty and cannot contain c ∗ .human is Boltzmann-rational with rationality parameter β , so that the probability of choosing any particular feedback isgiven by: P ( c | θ, C ) = exp ( β · E ξ ∼ ψ ( c ) [ r θ ( ξ )]) (cid:80) c (cid:48) ∈ C exp ( β · E ξ ∼ ψ ( c (cid:48) ) [ r θ ( ξ )]) (1)From the robot’s perspective, every piece of feedback c is an observation about the true reward parameterization θ ∗ , sothe robot can use Bayesian inference to infer a posterior over θ . Given a prior over reward parameters P ( θ ) , the RRiCinference procedure is deﬁned as: P ( θ | c, C ) ∝ exp ( β · E ξ ∼ ψ ( c ) [ r θ ( ξ )] (cid:80) c (cid:48) ∈ C exp ( β · E ξ ∼ ψ ( c (cid:48) ) [ r θ ( ξ )]) · P ( θ ) (2)Since we care about misspeciﬁcation of the choice set C , we focus on learning from demonstrations, where we restrictthe set of trajectories that the expert can demonstrate. This enables us to have a rich choice set, while allowing for asimple grounding function (the identity). In future work, we aim to test choice set misspeciﬁcation with other feedbackmodalities as well. For many common forms of feedback, including demonstrations and proxy rewards, the RRiC choice set is implicit .The robot knows which element of feedback the human provided (ex. which demonstration they performed), but mustassume which elements of feedback the human could have provided based on their model of the human. However, thisassumption could easily be incorrect – the robot may assume that the human has capabilities that they do not, or mayfail to account for cognitive biases that blind the human to particular feedback options, such as the human bias towardsthe most visually attention-grabbing choice in Fig 1.To model such effects, we assume that the human selects feedback c ∈ C Human according to P ( c | θ, C Human ) ,while the robot updates their belief assuming a different choice set C Robot to get P ( θ | c, C Robot ) . Note that C Robot is the robot’s assumption about what the human’s choice set is – this is distinct from the robot’s action space. When C Human (cid:54) = C Robot , we get choice set misspeciﬁcation .It is easy to detect such misspeciﬁcation when the human chooses feedback c / ∈ C R . In this case, the robot observes achoice that it believes to be impossible, which should certainly be grounds for reverting to some safe baseline policy.So, we only consider the case where the human’s choice c is also present in C R (which also requires C H and C R tohave at least one element in common).Within these constraints, we propose a classiﬁcation of types of choice set misspeciﬁcation in Table 1. Onthe vertical axis, misspeciﬁcation is classiﬁed according to the location of the optimal element of feedback c ∗ = argmax c ∈ C R ∪ C H E ξ ∼ ψ ( c ) [ r θ ∗ ( ξ ))] . If c ∗ is available to the human (in C H ), then the class code begins with A .We only consider the case where c ∗ is also in C R : the case where it is in C H but not C R is uninteresting, as the robotwould observe the “impossible” event of the human choosing c ∗ , which immediately demonstrates misspeciﬁcation atwhich point the robot should revert to some safe baseline policy. If c ∗ / ∈ C H , then we must have c ∗ ∈ C R (since itwas chosen from C H ∪ C R ), and the class code begins with B . On the horizontal axis, misspeciﬁcation is classiﬁedaccording to the relationship between C R and C H . C R may be a subset (code ), superset (code ), or intersecting class(code ) of C H . For example, class A1 describes the case in which the robot’s choice set is a subset of the human’s(perhaps because the human is more versatile), but both choice sets contain the optimal choice (perhaps because it isobvious). 3resented at the IJCAI-PRICAI 2020 Workshop on Artiﬁcial Intelligence SafetyFigure 2: The set of four gridworlds used in randomized experiments, with the lava feature marked in red. To determine the effects of misspeciﬁcation class, we artiﬁcially generated C R and C H with the properties of eachparticular class, simulated human feedback, ran RRiC reward inference, and then evaluated the robot’s resulting beliefdistribution and optimal policy. To isolate the effects of misspeciﬁcation and allow for computationally tractable Bayesian inference,we ran experiments in toy environments. We ran the randomized experiments in the four × gridworlds shownin Fig 2. Each square in environment x is a state s x = { lava, goal } . lava ∈ [0 , is a continuous feature, while goal ∈ { , } is a binary feature set to in the lower-right square of each grid and everywhere else. The truereward function r θ ∗ is a linear combination of these features and a constant stay-alive cost incurred at each timestep,parameterized by θ = ( w lava , w goal , w alive ) . Each episode begins with the robot in the upper-left corner and ends oncethe robot reaches the goal state or episode length reaches the horizon of 35 timesteps. Robot actions A R move therobot one square in a cardinal or diagonal direction, with actions that would move the robot off of the grid causing it toremain in place. The transition function T is deterministic. Environment x deﬁnes an MDP M x = (cid:104) S x , A R , T, r θ ∗ (cid:105) . Inference

While the RRiC framework enables inference from many different types of feedback, we use demonstrationfeedback here because demonstrations have an implicit choice set and straightforward deterministic grounding. Onlythe human knows their true reward function parameterization θ ∗ . The robot begins with a uniform prior distribution overreward parameters P ( θ ) in which w lava and w alive vary, but w goal always = 2 . . P ( θ ) contains θ ∗ . RRiC inferenceproceeds as follows for each choice set tuple (cid:104) C R , C H (cid:105) and environment x . First, the simulated human selects thebest demonstration from their choice set with respect to the true reward c H = argmax c ∈ C H E ξ ∼ ψ ( c ) [ r θ ∗ ( ξ ))] . Then,the simulated robot uses Eq. 2 to infer a “correct” distribution over reward parameterizations B H ( θ ) (cid:44) P ( θ | c, C H ) using the true human choice set, and a “misspeciﬁed” distribution B R ( θ ) (cid:44) P ( θ | c, C R ) using the misspeciﬁedhuman choice set. In order to evaluate the effects of each distribution on robot behavior, we deﬁne new MDPs M xH = (cid:104) S x , A R , T, r E [ B H ( θ )] (cid:105) and M xR = (cid:104) S x , A R , T, r E [ B R ( θ )] (cid:105) for each environment, solve them using valueiteration, and then evaluate the rollouts of the resulting deterministic policies according to the true reward function r θ ∗ . We ran experiments with randomized choice set selection for each misspeciﬁcation class to evaluate the effects of classon entropy change and regret.

Conditions

The experimental conditions are the classes of choice set misspeciﬁcation in Table 1: A1 , A2 , A3 , B2 and B3 . We tested each misspeciﬁcation class on each environment, then averaged across environments to evaluate eachclass. For each environment x , we ﬁrst generated a master set C xM of all demonstrations that are optimal w.r.t. at leastone reward parameterization θ . For each experimental class, we randomly generated 6 valid (cid:104) C R , C H (cid:105) tuples, with C R , C H ⊆ C xM . Duplicate tuples, or tuples in which c H / ∈ C R , were not considered. Measures

There are two key experimental measures: entropy change and regret . Entropy change is the difference inentropy between the correct distribution B H , and the misspeciﬁed distribution B R . That is, ∆ H = H ( B H ) − H ( B R ) .4resented at the IJCAI-PRICAI 2020 Workshop on Artiﬁcial Intelligence Safety (a) C H (b) C R Figure 3: Human and robot choice sets with a human goal bias. Because the human only considers trajectories thatterminate at the goal, they don’t consider the blue trajectory in C R .If entropy change is positive, then misspeciﬁcation induces overconﬁdence, and if it is negative, then misspeciﬁcationinduces underconﬁdence.Regret is the difference in return between the optimal solution to M xH , with the correctly-inferred reward param-eterization, and the optimal solution to M xR , with the incorrectly-inferred parameterization, averaged across all4 environments. If ξ ∗ xH is an optimal trajectory in M xH and ξ ∗ xR is an optimal trajectory in M xR , then regret = (cid:80) x =0 [ r θ ∗ ( ξ ∗ xH ) − r θ ∗ ( ξ ∗ xR )] . Note that we are measuring regret relative to the optimal action under the correctlyspeciﬁed belief , rather than optimal action under the true reward. As a result, it is possible for regret to be negative ,e.g. if the misspeciﬁcation makes the robot become more conﬁdent in the true reward than it would be under correctspeciﬁcation, and so execute a better policy. We also ran an experiment in a ﬁfth gridworld where we select the human choice set with a realistic human biasto illustrate how choice set misspeciﬁcation may arise in practice. In this experiment the human only considersdemonstrations that end at the goal state because, to humans, the word “goal” can be synonymous with “end” (Fig 3a).However, to the robot, the goal is merely one of multiple features in the environment. The robot has no reason toprivilege it over the other features, so the robot considers every demonstration that is optimal w.r.t some possible rewardparameterization (Fig 3b). The trajectory that only the robot considers is marked in blue. We ran RRiC inference usingthis (cid:104) C R , C H (cid:105) and evaluated the results using the same measures described above. We summarize the aggregated measures, discuss the realistic human bias result, then examine two interesting results:symmetry between classes A1 and A2 and high regret in class B3 . Entropy change varied signiﬁcantly across misspeciﬁcation class. As shown in Fig 4, the interquar-tile ranges (IQRs) of classes A1 and A3 did not overlap with the IQRs of A2 and B2 . Moreover, A1 and A3 had positivemedians, suggesting a tendency toward overconﬁdence, while A2 and B2 had negative medians, suggesting a tendencytoward underconﬁdence. B3 was less distinctive, with an IQR that overlapped with that of all other classes. Notably, thedistributions over entropy change of classes A1 and A2 are precisely symmetric about 0. Regret

Regret also varied as a function of misspeciﬁcation class. Each class had a median regret of 0, suggesting thatmisspeciﬁcation commonly did not induce a large enough shift in belief for the robot to learn a different optimal policy.However the mean regret, plotted as green lines in Fig 5, did vary markedly across classes. Regret was sometimes sohigh in class B3 that outliers skewed the mean regret beyond of the whiskers of the boxplot. Again, classes A1 and A2 are precisely symmetric. We discuss this symmetry in Section 5.3, then discuss the poor performance of B3 inSection 5.4. 5resented at the IJCAI-PRICAI 2020 Workshop on Artiﬁcial Intelligence SafetyFigure 4: Entropy Change (N=24). The box is the IQR, the whiskers are the range, and the blue line is the median.There are no outliers.Figure 5: Regret (N=24). The box is the IQR, the whiskers are the most distant points within 1.5*the IQR, and thegreen line is the mean. Multiple outliers are omitted. The human bias of only considering demonstrations that terminate at the goal leads to very poor inference in thisenvironment. Because the human does not consider the blue demonstration from Fig 3b, which avoids the lava altogether,they are forced to provide the demonstration in Fig 6a, which terminates at the goal but is long and encounters lava. Asa result, the robot infers the very incorrect belief distribution in Fig 6b. Not only is this distribution underconﬁdent(entropy change = − . ), but it also induces poor performance (regret = . ). This result shows that we can see anoutsized negative impact on robot reward inference with a small incorrect assumption that the human considered andrejected demonstrations that don’t terminate at the goal. Intuitively, misspeciﬁcation should lead to worse performance in expectation. Surprisingly, when we combine mis-speciﬁcation classes A1 and A2 , their impact on entropy change and regret is actually neutral. The key to this is theirsymmetry – if we switch the contents of C Robot and C Human in an instance of class A1 misspeciﬁcation, we get aninstance of class A2 with exactly the opposite performance characteristics. Thus, if a pair in A1 is harmful, then theanalogous pair in A2 must be helpful, meaning that it is better for performance than having the correct belief about the (a) feedback c H (b) P ( θ | c H , C R ) Figure 6: Human feedback and the resulting misspeciﬁed robot belief with a human goal bias. Because the feedbackthat the biased human provides is poor, the robot learns a very incorrect distribution over rewards. human’s choice set . We show below that this is always the case under certain symmetry conditions that apply to A1 and A2 .Assume that there is a master choice set C M containing all possible elements of feedback for MDP M , and that choicesets are sampled from a symmetric distribution over pairs of subsets D : 2 C M × C M → [0 , with D ( C x , C y ) = D ( C y , C x ) (where C M is the set of subsets of C M ). Let ER ( r θ , M ) be the expected return from maximizing thereward function r θ in M . A reward parameterization is chosen from a shared prior P ( θ ) and C H , C R are sampled from D . The human chooses the optimal element of feedback in their choice set c C H = argmax c ∈ C H E ξ ∼ ψ ( c ) [ r θ ∗ ( ξ ))] . Theorem 1.

Let M and D be deﬁned as above. Assume that ∀ C x , C y ∼ D , we have c C x = c C y ; that is, the humanwould pick the same feedback regardless of which choice set she sees. If the robot follows RRiC inference according toEq. 2 and acts to maximize expected reward under the inferred belief, then: E C H ,C R ∼ D Regret ( C H , C R ) = 0 Proof.

Deﬁne R ( C x , c ) to be the return achieved when the robot follows RRiC inference with choice set C x andfeedback c , then acts to maximize r E [ B x ( θ )] , keeping β ﬁxed. Since the human’s choice is symmetric across D , for any C x , C y ∼ D , regret is anti- symmetric: Regret ( C x , C y ) = R ( C x , c C x ) − R ( C y , c C x )= R ( C x , c C y ) − R ( C y , c C y )= − Regret ( C y , C x ) Since D is symmetric, (cid:104) C x , C y (cid:105) is as likely as (cid:104) C y , C x (cid:105) . Combined with the anti-symmetry of regret, this implies thatthe expected regret must be zero: E C x ,C y ∼ D [ Regret ( C x , C y )]= 12 E C x ,C y [ Regret ( C x , C y )] + 12 E C x ,C y [ Regret ( C y , C x )]= 12 E C x ,C y [ Regret ( C x , C y )] − E C x ,C y [ Regret ( C x , C y )]= 0 An analogous proof would work for any anti-symmetric measure (including entropy change).

As shown in Table 4, class B3 misspeciﬁcation can induce regret an order of magnitude worse than the maximum regretinduced by classes A3 and B2 , which each differ from B3 along a single axis. This is because the worst case inferenceoccurs in RRiC when the human feedback c H is the worst element of C R , and this is only possible in class B3 . In class7resented at the IJCAI-PRICAI 2020 Workshop on Artiﬁcial Intelligence SafetyClass Mean Std Q1 Q3A1 0.256 0.2265 0.1153 0.4153A2 -0.256 0.2265 -0.4153 -0.1153Table 2: Entropy change is symmetric across classes A1 and A2 .Class Mean Std Q1 Q3A1 0.04 0.4906 0.1664 0.0A2 -0.04 0.4906 0.0 -0.1664Table 3: Regret is symmetric across classes A1 and A2 . B2 , C R contains all of C H , so as long as | C H | > , C R must contain at least one element worse than C H . In class A3 , c H = c ∗ , so C R cannot contain any elements better than c H . However, in class B3 , C R need not contain any elementsworse than c H , in which case the robot updates its belief in the opposite direction from the ground truth.For example, consider the sample human choice set in Fig 7a. Both trajectories are particularly poor, but the humanchooses the demonstration c H in Fig 7b because it encounters slightly less lava and so has a marginally higher reward.Fig 8a shows a potential corresponding robot choice set C R from B2 , containing both trajectories from the humanchoice set as well as a few others. Fig 8b shows P ( θ | c H , C R ) . The axes represent the weights on the lava and alive features and the space of possible parameterizations lies on the circle where w lava + w alive = 1 . The opacityof the gold line is proportional to the weight that P ( θ ) places on each parameter combination. The true reward has w lava , w alive < , whereas the peak of this distribution has w lava < , but w alive > . This is because C R containsshorter trajectories that encounter the same amount of lava, and so the robot infers that c H must be preferred in largepart due to its length.Fig 9a shows an example robot choice set C R from B3 , and Fig 9b shows the inferred P ( θ | c H , C R ) . Note that thepeak of this distribution has w lava , w alive > . Since c H is the longest and the highest-lava trajectory in C R , and alternative shorter and lower-lava trajectories exist in C R , the robot infers that the human is attempting to maximizeboth trajectory length and lava encountered: the opposite of the truth. Unsurprisingly, maximizing expected reward forthis belief leads to high regret. The key difference between B2 and B3 is that c H is the lowest-reward element in C R ,resulting in the robot updating directly away from the true reward. Summary

In this work, we highlighted the problem of choice set misspeciﬁcation in generalized reward inference,where a human gives feedback selected from choice set C Human but the robot assumes that the human was choosingfrom choice set C Robot . As expected, such misspeciﬁcation on average induces suboptimal behavior resulting in regret.However, a different story emerged once we distinguished between misspeciﬁcation classes. We deﬁned ﬁve distinctClass Mean Std Max MinA3 -0.001 0.5964 1.1689 -1.1058B2 0.228 0.6395 1.6358 -0.9973

B3 2.059 6.3767 24.7252 -0.9973

Table 4: Regret comparison showing that class B3 has much higher regret than neighboring classes.8resented at the IJCAI-PRICAI 2020 Workshop on Artiﬁcial Intelligence Safety (a) C H (b) c H Figure 7: Example human choice set and corresponding feedback. (a) C R (b) P ( θ | c H , C R ) Figure 8: Robot choice set and resulting misspeciﬁed belief in B2 .classes varying along two axes: the relationship between C Human and C Robot and the location of the optimal elementof feedback c ∗ . We empirically showed that different classes lead to different types of error, with some classes leadingto overconﬁdence, some to underconﬁdence, and one to particularly high regret. Surprisingly, under certain conditionsthe expected regret under choice set misspeciﬁcation is actually 0, meaning that in expectation, misspeciﬁcation doesnot hurt in these situations. Implications

There is wide variance across the different types of choice-set misspeciﬁcation: some may haveparticularly detrimental effects, and others may not be harmful at all. This suggests strategies for designing robotchoice sets to minimize the impact of misspeciﬁcation. For example, we ﬁnd that regret tends to be negative (thatis, misspeciﬁcation is helpful ) when the optimal element of feedback is in both C Robot and C Human and C Robot ⊃ C Human (class A2 ). Similarly, worst-case inference occurs when the optimal element of feedback is in C Robot only,and C Human contains elements that are not in C Robot (class B3 ). This suggests that erring on the side of specifying alarge C Robot , which makes A2 more likely and B3 less, may lead to more benign misspeciﬁcation. Moreover, it may bepossible to design protocols for the robot to identify unrealistic choice set-feedback combinations and verify its choiceset with the human, reducing the likelihood of misspeciﬁcation in the ﬁrst place. We plan to investigate this in futurework. (a) C R (b) P ( θ | c H , C R ) Figure 9: Robot choice set and resulting misspeciﬁed belief in B3 .9resented at the IJCAI-PRICAI 2020 Workshop on Artiﬁcial Intelligence Safety Limitations and future work.

In this paper, we primarily sampled choice sets randomly from the master choice setof all possibly optimal demonstrations. However, this is not a realistic model. In future work, we plan to select humanchoice sets based on actual human biases to improve ecological validity. We also plan to test this classiﬁcation andour resulting conclusions in more complex and realistic environments. Eventually, we plan to work on active learningprotocols that allow the robot to identify when its choice set is misspeciﬁed and alter its beliefs accordingly.

Acknowledgements

We thank colleagues at the Center for Human-Compatible AI for discussion and feedback. This work was partiallysupported by an ONR YIP.

References [1] Pieter Abbeel and Andrew Y Ng. Apprenticeship learning via inverse reinforcement learning. In

Proceedings ofthe twenty-ﬁrst international conference on Machine learning , page 1, 2004.[2] Andrea Bajcsy, Dylan P Losey, Marcia K O’Malley, and Anca D Dragan. Learning robot objectives from physicalhuman interaction.

Proceedings of Machine Learning Research , 78:217–226, 2017.[3] Andreea Bobu, Andrea Bajcsy, Jaime F Fisac, Sampada Deglurkar, and Anca D Dragan. Quantifying hypothesisspace misspeciﬁcation in learning from human–robot demonstrations and physical corrections.

IEEE Transactionson Robotics , 2020.[4] Andreea Bobu, Dexter RR Scobee, Jaime F Fisac, S Shankar Sastry, and Anca D Dragan. Less is more: Rethinkingprobabilistic models of human behavior. arXiv preprint arXiv:2001.04465 , 2020.[5] Syngjoo Choi, Shachar Kariv, Wieland Müller, and Dan Silverman. Who is (more) rational?

American EconomicReview , 104(6):1518–50, 2014.[6] Paul F. Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcementlearning from human preferences.

Advances in Neural Information Processing Systems , 2017-Decem:4300–4308,2017.[7] Owain Evans, Andreas Stuhlmueller, and Noah D. Goodman. Learning the Preferences of Ignorant, InconsistentAgents. arXiv , 2015.[8] Prasoon Goyal, Scott Niekum, and Raymond J Mooney. Using natural language for reward shaping in reinforcementlearning. arXiv preprint arXiv:1903.02020 , 2019.[9] Dylan Hadﬁeld-Menell, Stuart J Russell, Pieter Abbeel, and Anca Dragan. Cooperative inverse reinforcementlearning. In

Advances in neural information processing systems , pages 3909–3917, 2016.[10] Dylan Hadﬁeld-Menell, Smitha Milli, Pieter Abbeel, Stuart J Russell, and Anca Dragan. Inverse reward design.In

Advances in neural information processing systems , pages 6765–6774, 2017.[11] Hong Jun Jeon, Smitha Milli, and Anca D Dragan. Reward-rational (implicit) choice: A unifying formalism forreward learning. arXiv preprint arXiv:2002.04833 , 2020.[12] Daniel Kahneman and Amos Tversky. Prospect Theory: An Analysis of Decision Under Risk.

Econometrica ,47(2):263–292, 1979.[13] Victoria Krakovna. Speciﬁcation gaming examples in ai, April 2018.[14] Jan Leike, David Krueger, Tom Everitt, Miljan Martic, Vishal Maini, and Shane Legg. Scalable agent alignmentvia reward modeling: a research direction. arXiv , 2018.[15] Anirudha Majumdar, Sumeet Singh, Ajay Mandlekar, and Marco Pavone. Risk-sensitive inverse reinforcementlearning via coherent risk models. In

Robotics: Science and Systems , 2017.[16] Smitha Milli and Anca D Dragan. Literal or pedagogic human? analyzing human model misspeciﬁcation inobjective learning. arXiv preprint arXiv:1903.03877 , 2019.[17] Sören Mindermann, Rohin Shah, Adam Gleave, and Dylan Hadﬁeld-Menell. Active inverse reward design. arXivpreprint arXiv:1809.03060 , 2018.[18] Andrew Y Ng and Stuart J Russell. Algorithms for inverse reinforcement learning. In

International Confer-enceon Machine Learning (ICML) , 2000.[19] Sid Reddy, Anca Dragan, and Sergey Levine. Where do you think you’re going?: Inferring beliefs about dynamicsfrom behavior. In

Advances in Neural Information Processing Systems , pages 1454–1465, 2018.10resented at the IJCAI-PRICAI 2020 Workshop on Artiﬁcial Intelligence Safety[20] Dorsa Sadigh, Anca D Dragan, Shankar Sastry, and Sanjit A Seshia. Active preference-based learning of rewardfunctions. In

Robotics: Science and Systems , 2017.[21] Rohin Shah, Dmitrii Krasheninnikov, Jordan Alexander, Pieter Abbeel, and Anca Dragan. Preferences implicit inthe state of the world. arXiv preprint arXiv:1902.04198 , 2019.[22] Jacob Steinhardt and Owain Evans. Model mis-speciﬁcation and inverse reinforcement learning, Feb 2017.[23] Kevin Waugh, Brian D Ziebart, and J Andrew Bagnell. Computational rationalization: The inverse equilibriumproblem. arXiv preprint arXiv:1308.3506 , 2013.[24] Brian D Ziebart.