[PDF] Value Alignment Verification

Abstract

As humans interact with autonomous agents to perform increasingly complicated, potentially risky tasks, it is important to be able to efficiently evaluate an agent's performance and correctness. In this paper we formalize and theoretically analyze the problem of efficient value alignment verification: how to efficiently test whether the behavior of another agent is aligned with a human's values. The goal is to construct a kind of "driver's test" that a human can give to any agent which will verify value alignment via a minimal number of queries. We study alignment verification problems with both idealized humans that have an explicit reward function as well as problems where they have implicit values. We analyze verification of exact value alignment for rational agents and propose and analyze heuristic and approximate value alignment verification tests in a wide range of gridworlds and a continuous autonomous driving domain. Finally, we prove that there exist sufficient conditions such that we can verify exact and approximate alignment across an infinite set of test environments via a constant-query-complexity alignment test.

Full PDF

VValue Alignment Veriﬁcation

Daniel S. Brown ∗ University of California, Berkeley [email protected]

Jordan Schneider ∗ , Scott Niekum University of Texas at Austin {joschnei,sniekum}@cs.utexas.edu

Abstract

As humans interact with autonomous agents to perform increasingly complicated,potentially risky tasks, it is important that humans can verify these agents’ trust-worthiness and efﬁciently evaluate their performance and correctness. In this paperwe formalize the problem of value alignment veriﬁcation : how to efﬁciently testwhether the goals and behavior of another agent are aligned with a human’s val-ues? We explore several different value alignment veriﬁcation settings and providefoundational theory regarding value alignment veriﬁcation. We study alignmentveriﬁcation problems with an idealized human that has an explicit reward functionas well as value alignment veriﬁcation problems where the human has implicitvalues. Our theoretical and empirical results in both a discrete grid navigation do-main and a continuous autonomous driving domain demonstrate that it is possibleto synthesize highly efﬁcient and accurate value alignment veriﬁcation tests forcertifying the alignment of autonomous agents.

If we desire autonomous agents that can interact with and assist humans and other agents in performingcomplex, potentially risky tasks, then it is important that humans can verify that other agents’ policiesare aligned with what is expected and desired. This alignment is often termed value alignment andis deﬁned in the Asilomar AI Principles as follows: "Highly autonomous AI systems should bedesigned so that their goals and behaviors can be assured to align with human values throughouttheir operation." We note that it is also important that even non-human agents are mututally valuealigned in multi-agent settings so that they can assist each other and collaborate under shared normsand preferences. In this paper, we propose and explore the problem of efﬁcient value alignmentveriﬁcation : How can a human efﬁciently test whether a robot is aligned with the human’s values?

The goal of value alignment veriﬁcation is to construct a kind of “driver’s test” that a human can giveto another agent which can verify value alignment and consists of only a small number of queries.For the purposes of this paper we will deﬁne values in the reinforcement learning sense, i.e. withrespect to a value function or reward/utility function. We say that a robot is perfectly value alignedwith a human if the robot’s policy is optimal under the human’s reward function. The two agents in avalue alignment veriﬁcation problem (human and robot) will likely have different communicationmechanisms and different value introspection abilities. Thus, value alignment veriﬁcation will takedifferent forms depending on whether the human and robot have explicit (i.e. being able to writedown a value function or reward function) or implicit access to their values (i.e. only able to answerpreference queries or to sample actions from a policy). As an example, artiﬁcial agents typicallyhave explicit value functions or policies, while humans typically have implicit values. Despite thesedifferences, we would like to perform value alignment veriﬁcation regardless of the agent havingexplicit or implicit values. In Section 4.2.1 we examine methods for provable value alignment ∗ Equal contribution https://futureoflife.org/ai-principles/ NeurIPS 2020 Workshop on Human And Machine in-the-Loop Evaluation and Learning Strategies (HAMLETS) a r X i v : . [ c s . L G ] D ec uman Tester Test GeneratorPreference Elicitation Alignment Test Agents to be verifiedVerification Figure 1: Value alignment veriﬁcation with a human tester with implicit values. The tester’s valuesare distilled into a succinct alignment test via preference elicitation. This test can then be applied toany number of agents to verify their alignment with the human’s values.veriﬁcation in an idealized setting when both the human and robot have explicit values. Then, inSection 4.2.2 we discuss how we can use this test under several different conditions including whenthe robot may have implicit values and can only answer preference queries. Finally, in Section 4.3 wepropose an approximation algorithm for value alignment veriﬁcation (depicted in Figure 1) that isapplicable in cases where the human tester has implicit values.Prior work on value alignment often focuses on loose deﬁnitions of value alignment, qualitativeevaluation of trust [20] or asymptotic alignment of an agent’s performance via interactions andactive learning [17, 18, 30]. In contrast, our work seeks to build trust between agents by formallydeﬁning value alignment and seeking efﬁcient tests for value alignment veriﬁcation that are applicablewhen two or more agents already have learned a policy or reward function and want to quicklytest compatibility. Related work seeks to provide high-conﬁdence bounds on the performance ofa reinforcement learning agent [19, 34] or an imitation learning agent [12, 14]. However, theseapproaches typically require full access to the parameterized policies of both agents and involveevaluating the robot’s policy over signiﬁcant amounts of historical data or extensive counterfactualcomputations. To the best of our knowledge, we are the ﬁrst to address the general problem ofalgorithmic value alignment veriﬁcation. In particular, we propose exact, approximate, and heuristictests that one agent can use to quickly and efﬁciently verify value alignment with another agent.The contributions of this work are the following: (1) We formally deﬁne value alignment veriﬁcation;(2) We then analyze the complexity of value alignment veriﬁcation and show that in an idealizedsetting it can be much more efﬁcient than active reward learning, requiring only a constant numberof queries; (3) We next propose exact and heuristic value alignment veriﬁcation methods that areapplicable under a wide range of test queries; (4) We also propose an approximation algorithm forvalue alignment veriﬁcation that works with a human tester with implicit values; and (5) We provideempirical results demonstrating the efﬁcacy of exact and approximate value alignment veriﬁcation inboth a discrete grid navigation domain and a continuous autonomous driving domain.

Value Alignment:

Most work on value alignment focuses on how to iteratively train a learningagent such that its ﬁnal behavior is aligned with a user’s intentions [5, 22, 28]. One example iscooperative inverse reinforcement learning (CIRL) [18], which formulates value alignment as a gamebetween a human and a robot, where both try to maximize a shared reward function that is only knownby the human. CIRL and other research on value alignment focus on ensuring the learning agentasymptotically converges to the same values as the human teacher, but do not provide a way to checkwhen value alignment has been achieved. By contrast, we are instead interested in value alignment veriﬁcation : testing whether an agent is currently value aligned. We do not assume a cooperativesetting—the robot is not assumed to have the same payoff as the human. Instead, we assume theagent being tested has already learned a policy/reward function via some black-box optimizationprocess and the human wants to efﬁciently test for alignment.

Active Reward Learning:

Value alignment veriﬁcation is closely related to the problem of activepreference learning [2, 8, 10, 14, 17, 20, 24] where an AI system seeks to efﬁciently determinethe reward function of a human expert via queries for expert demonstrations or preferences overtrajectories; however, value alignment veriﬁcation only seeks to answer the question of whethertwo agents are aligned, without concern for the exact reward function of the robot. We prove in2ection 4.1 that value alignment veriﬁcation can sometimes be performed in a constant number ofqueries whereas active learning requires a logarithmic number of queries. We also demonstrate thatwhen the human has implicit values, then active reward learning can be used to automatically generatea high-conﬁdence value alignment test with respect to these implicit values.

Machine Teaching:

Machine teaching [36, 37] is the inverse problem to machine learning. Inmachine teaching, a teacher seeks a minimal set of training data such that a student (running aparticular learning algorithm) learns a desired set of model parameters. Value alignment veriﬁcationis related and can be seen as a form of machine testing rather than teaching. Machine teachingalgorithms typically search for a minimal set of training data that will teach a learner a speciﬁc model,whereas we seek a minimal set of questions that will allow a tester to verify another agent’s model.Thus, in machine teaching, the teacher provides examples and their answers, but in machine testingthe tester provides examples and then queries the student for the answer. Machine teaching has beenpreviously applied to sequential decision making problems [13, 15], but has not been used to directlyaddress the problem of machine testing. Other related work has proposed to use pedagogic examplesas a way to enable robots to express their capabilities [21] and values [20] to a human. Our work issimilarly motivated by building trust between agents via veriﬁcation testing.

Policy Evaluation

Policy evaluation [33] can be seen as a form of value alignment, but aims toanswer the harder question of "How much return would the other agent achieve according to myvalues?" By focusing on the simpler question, "Is the robot value aligned with the human?", our workprovides sample-efﬁcient tests for exact and approximate value alignment. Off-Policy Evalutaion(OPE) seeks to perform policy evalution without executing the testee’s policy [27, 34, 35]. OPE isoften sample-inefﬁcient or provides high-variance estimates and typically assumes explicit access tothe tester’s reward function, explicit access to the tester and testee policies, and a large dataset ofrollouts from the tester’s policy with corresponding returns. By contrast, value alignment veriﬁcationis applicable in settings where the policies and reward functions of both agents may be implicitand only accessible indirectly. High-conﬁdence policy evaluation has also been investigated in theimitation learning setting [1, 10, 12] where an agent has access to demonstrations from an expert andseeks to evaluate its policy loss with respect to the teacher’s unknown reward function. Rather thanconsidering a learner that receives demonstrations from a teacher, we consider a tester who seeks todesign a test that can (approximately) verify the value alignment of any other agent.

Because we are interested in agents that have different reward functions, we adopt notation proposedby Amin et al. [3] where a Markov Decision Process (MDP) M consists of an environment E =( S , A , P, S , γ ) and a reward function R : S → R . An environment has a set of states S , a set ofactions A , a transition function P : S ×A×S → [0 , , a discount factor γ ∈ [0 , , and a distributionover initial states S . A policy π : S × A (cid:55)→ [0 , is a mapping from states to a distribution overactions. The state and state-action values of a policy π are V πR ( s ) = E π [ (cid:80) ∞ t =0 γ t R ( s t ) | s = s ] and Q πR ( s, a ) = E π [ (cid:80) ∞ t =0 γ t R ( s t ) | s = s, a = a ] for s ∈ S and a ∈ A . We denote V ∗ R ( s ) = max π V πR ( s ) and Q ∗ R ( s, a ) = max π Q πR ( s, a ) . The expected value of a policy is denotedby V πR = E s ∈ S [ V πR ( s )] . As is common [7, 14, 26, 38], we will often assume that the reward functioncan be expressed as a linear combination of features φ : S (cid:55)→ R k , so that R ( s ) = w T φ ( s ) , where w ∈ R k . Thus, we use R and w interchangeably. Note that this assumption of a linear rewardfunction is not restrictive as these features can be arbitrarily complex nonlinear functions of the stateand can be obtained via unsupervised learning from raw state observations [14, 16, 32]. Given thisassumption, the state-action value function can be written in terms of discounted expectations overfeatures as Q πR ( s, a ) = w T Φ ( s,a ) π , where Φ ( s,a ) π = E π [ (cid:80) ∞ t =0 γ t φ ( s t ) | s = s, a = a ] . In this section we ﬁrst explicitly deﬁne value alignment and value alignment veriﬁcation. Next, wediscuss how assuming rationality of the robot agent enables highly efﬁcient provable value alignmentveriﬁcation. We then present results for value alignment veriﬁcation when the human has full controlover the environment and also in the case where the environment is ﬁxed. We conclude this section3y presenting a method for approximate value alignment veriﬁcation when the tester is a human withimplicit values.We ﬁrst formalize value alignment. Consider two agents: a human and a robot. We will assume thatthe human has a (possibly implicit) reward function that provides the ground truth for determiningvalue alignment veriﬁcation of the robot. We deﬁne exact value alignment as follows:

Deﬁnition 1.

Given reward function R , policy π (cid:48) is value aligned in environment E if and only if π (cid:48) ∈ OP T ( R ) , (1) where OPT ( R ) = { π | π ( a | s ) > ⇒ a ∈ arg max a Q ∗ R ( s, a ) } , is the set of all optimal (potentiallystochastic) policies in MDP (E,R) and arg max x f ( x ) := { x | f ( y ) ≤ f ( x ) , ∀ y } . In complex environments or for robots with bounded rationality or computation, expecting exactalignment may be unreasonable. Thus, we also deﬁne (cid:15) -value alignment:

Deﬁnition 2.

Given reward function R , policy π (cid:48) is (cid:15) - value aligned in environment E if and only if V ∗ R − V π (cid:48) R ≤ (cid:15). (2)Note that Deﬁnition 1 is a special case of Deﬁnition 2 when (cid:15) = 0 .We are interested in the problem of value alignment veriﬁcation which we deﬁne as follows: Deﬁnition 3.

Value Alignment Veriﬁcation : Given an environment E , reward function R , policy π (cid:48) ,and a threshold (cid:15) , solve the decision problem: Is π (cid:48) (cid:15) -value aligned with R in environment E ? To verify value alignment without checking alignment at every state, it needs to be the case that therobot preferring an action in one state implies something about its preferences in another. Any suchimplication is going to require both that the states have some relationship to one another and that theagent’s preferences are consistent with this relationship between states. In our case we assume stateshave known reward features and that agents act rationally with respect to a linear reward in thesefeatures. While we require these assumptions for our theoretical analysis, we will later show thatmany of our proposed methods for value alignment veriﬁcation can be used as heuristics for buildingtrust even if the subject agent is not rational.A rational agent is one that picks actions to maximize its utility [29]. Thus, given a reward function R (cid:48) , a rational agent’s policy π (cid:48) is of the form: π (cid:48) ( s ) ∈ arg max a Q ∗ R (cid:48) ( s, a ) . (3)Consider two rational agents with reward functions R and R (cid:48) . Because there are inﬁnite rewardfunctions that lead to the same optimal policy [25], determining that ∃ s ∈ S, R ( s ) (cid:54) = R (cid:48) ( s ) is notsufﬁcient to verify mis -alignment. Instead we formalize value alignment for reward functions witharbitrary shaping or scale via the following Lemma that directly follows from Deﬁnition 1 and thedeﬁnition of a rational agent in Equation (3). Lemma 1.

A rational robot with reward function R (cid:48) is value aligned with a human with rewardfunction R in environment E if and only if OP T ( R (cid:48) ) ⊆ OP T ( R ) .Proof. This follows directly from Deﬁnition 1 and the deﬁnition of a rational agent in Eq. (3).Thus a rational robot is aligned with a human if all optimal policies under the robot’s reward functionare also optimal policies under the human’s reward function. (cid:15) -Alignment Veriﬁcation via Omnipotent Testing

We ﬁrst consider the theoretical setting of an omnipotent testing agent: one that is able to construct aset of arbitrary test MDPs to verify value alignment across a family of environments that share thesame reward function. We assume that the human has explicit access to their reward function, butonly assume that the robot has implicit values which allow the agent to answer preference queries.Amin and Singh [4] prove under these assumptions that an omnipotent active learner can determinethe reward function of another agent within (cid:15) precision via O (log |S| + log(1 /(cid:15) )) active queries.4hese queries take the form of asking for the entire policy of the robot. In Appendix A.2, we extendthis result to the case of value alignment testing, where we prove that if the human is able to querythe robot for preferences over policies, then the sample complexity of (cid:15) -value alignment veriﬁcationis only O (1) . Theorem 1.

In this section we develop theoretical results regarding provable exact alignment veriﬁcation ( (cid:15) = 0 )of a rational robot when the tester does not have full control over the testing environment.

We seek an efﬁcient value alignment veriﬁcation test which enables a human to query the robotto determine alignment according to Lemma 1. As demonstrated by Theorem 2 below, due to thelinearity of R , a sufﬁcient condition for value alignment veriﬁcation is to test whether the rationalrobot’s reward function lies in the following geometric object. Deﬁnition 4.

Given an MDP M composed of environment E and reward function R , the alignedreward polytope (ARP) is deﬁned as the following set of reward functions:ARP ( R ) = { R (cid:48) | OPT ( R (cid:48) ) ⊆ OPT ( R ) } . (4)We now present a sufﬁcient test for provable exact value alignment. As a reminder, given a linearreward function we can write the state-action value function as Q πR ( s, a ) = w T Φ ( s,a ) π , where Φ ( s,a ) π = E π [ (cid:80) ∞ t =0 γ t φ ( s t ) | s = s, a = a ] . Theorem 2.

Given an MDP M = ( E, R ) , if the human’s reward function R and robot’s rewardfunction R (cid:48) can be represented as a linear combination of features φ ( s ) ∈ R k , i.e., R ( s ) = w T φ ( s ) , R (cid:48) ( s ) = w (cid:48) T φ ( s ) , then a sufﬁcient condition for testing value alignment is to test whether w (cid:48) ∈ (cid:92) ( s,a,b ) ∈ S × A × A H Rs,a,b ⊆ ARP ( R ) (5) where H Rs,a,b = (cid:8) w | w T (Φ ( s,a ) π ∗ R − Φ ( s,b ) π ∗ R ) > (cid:9) , if a ∈ arg max a (cid:48) ∈A Q ∗ R ( s, a (cid:48) ) , b / ∈ arg max a (cid:48) ∈A Q ∗ R ( s, a (cid:48) ) and is equal to R k , i.e., non-constraining, otherwise.Proof. See Appendix A.1.

Given Theorem 2, we can now design an efﬁcient test value alignment veriﬁcation where we havean idealized human and robot that both have explicit representations of their reward functions. Ouranalysis provides theoretical insight into the value alignment veriﬁcation problem and the resultingtests for exact alignment in this section will motivate our approximation algorithm for value alignmentveriﬁcation when one or both of the agents have implicit values. We propose an approach that is We also note that our results are of practical interest if there are two robots that need to collaborate, butwere trained by different organizations and have different reward functions and/or policies [6, 31]. Running avalue alignment test with explicit values if an efﬁcient way to verify if the robots can work together. w (cid:48) . Later we will show that many different types of queries reduce to this type of test.A direct result of Theorem 2 is that we can test for value alignment veriﬁcation via the test T = {H s,a,b | ( s, a, b ) ∈ S × A × A } where the questions are deﬁned as w (cid:48) T (Φ ( s,a ) π ∗ R − Φ ( s,b ) π ∗ R ) > , if a ∈ arg max a (cid:48) ∈A Q ∗ R ( s, a (cid:48) ) , b / ∈ arg max a (cid:48) ∈A Q ∗ R ( s, a (cid:48) ) , ∀ s ∈ S (6)All constraints of this form can be checked simultaneously via a single matrix-vector multiplication Φ ARP w (cid:48) > , where Φ ARP is a matrix where each row corresponds to a unique feature countdifference in Equation (6). The above test assumes that the human can query directly for the robot’sreward function weights w (cid:48) . In Appendix C, we show that similar tests can be formulated under morerestrictive query assumptions, including preferences queries answered via implicit values: Proposition 1.

Under the assumption of a rational robot that shares the same linear reward featuresas the human, efﬁcient exact value alignment veriﬁcation is possible in the following query settings:(1) Query access to reward function weights w (cid:48) , (2) Query access to samples of the reward function R (cid:48) ( s ) , (3) Query access to Q ∗ R (cid:48) ( s, a ) , and (4) Query access to preferences over trajectories.Proof. See Appendix C.Proposition 1 assumes the human can either directly query the robot’s reward or value function orquery the robot for its preferences over trajectories. However, sometimes the human may only havequery access to the robot’s policy π (cid:48) . In this case, we can resort to heuristics for value alignmentvia policy queries that have high veriﬁcation accuracy in practice, but may occasionally have falsepositives where a non-aligned agent is certiﬁed as aligned, as we discuss in the next section. We now discuss how to perform value alignment when the human and/or robot only have implicitvalues. In this setting, the goal is to distill a human’s intent or values into a veriﬁcation test that canbe used to quickly check the value alignment of any agent. For example, a regulatory body may wanta sample efﬁcient test to validate proprietary autonomous driving software. The testing agency maychange its regulations periodically and a value alignment test could be used to check whether existingproprietary software still meets the new guidelines. A well designed value alignment veriﬁcation testcould also be useful as a replacement for exhaustively backtesting an agent during development toensure updates to software do not violate safety constraints. Without an explicit representation of the human’s values we cannot directly compute the ARP asdescribed in the previous section. Instead, we propose the approach outlined in Figure 1 where weuse an AI system as a test generator to enable the creation of an alignment test. The test generator ﬁrstperforms preference elicitation to distill the human’s internal value function into an efﬁcient alignmenttest. This test can then be reused to test any other agent, human or robot, for value alignment.As is common for many active reward learning algorithms [8, 17, 30], we assume that the preferenceelicitation algorithm outputs both a set of trajectory preferences P = { ( ξ i , ξ j ) : ξ i (cid:31) ξ j } and a set ofsample reward weights w from the posterior distribution P ( w |P ) = { w i } . Given P and P ( w |P ) ,the aligned reward polytope of the human’s implicit reward function can be approximated as ARP = (cid:92) ( ξ i ,ξ j ) ∈P { w | w T (Φ( ξ i ) − Φ( ξ j )) > (cid:9) , (7)which generalizes the deﬁnition of the ARP to MDPs with continous states and actions. To test thealignment of agents with bounded rationality or slightly misspeciﬁed reward functions we consider In some cases, such as an AI tutoring system, the robot could be the tester and the human could be the testee.For example, a robot that comes preprogrammed from a factory to perform household chores may want to ﬁrstquickly verify whether the human’s preferences are aligned with its preprogrammed behavior. (cid:15) -value alignment (Deﬁnition 2). In particular, we synthesize a test by computing a (1 − δ ) -conﬁdence (cid:15) -ARP. As each sample w i has a probability mass associated with it, we can createa high-conﬁdence version of the (cid:15) -ARP by only testing using trajectory pairs ( ξ i , ξ j ) ∈ P such that P r ( w T (Φ( ξ i ) − Φ( ξ j )) > (cid:15) ) > − δ under P ( w |P ) . Finally we remove redundant constraints [13].The result is a succinct, high-conﬁdence test T for (cid:15) -value alignment veriﬁcation that consists of aminimal set of informative preference queries (see Appendix for details). The alignment test consistsof asking the robot for preferences over trajectories in T and checking if they match the preferencelabels given by the human tester. In this section, we evaluate the performance of our proposed exact value alignment veriﬁcation test(Section 4.2.2) in two forms: querying for the weight vector of the robot (ARP-w) and preferencequeries (ARP-pref). We also consider three heuristic alignment tests designed to work with black-boxagents where the tester can only ask policy action queries. We brieﬂy discuss the three black-boxheuristics here and include full details in Appendix D. Our ﬁrst heuristic is inspired by Huang etal.’s notion of critical states : states where Q ∗ R ∗ ( s, π ∗ R ∗ ( s )) − |A| (cid:80) a ∈A Q ∗ R ∗ ( s, a ) > t , for someuser deﬁned threshold t [20]. We adapt this idea to form a critical state alignment heuristic (CS)that computes critical states under the human’s reward function R , then queries the robot’s policy ateach critical state and tests if the robot action is an optimal action under the human’s policy π ∗ R . Oursecond heuristic uses the Set Cover Optimal Teaching algorithm (SCOT) proposed by Brown andNiekum [13] and adapt this to make it a value alignment heuristic. SCOT generates a set of maximallyinformative state-action trajectories designed to efﬁciently teach a reward function to maximumlikelihood IRL agent. We turn this into an alignment veriﬁcation test by generating maximallyinformative trajectories, querying the robot’s policy at each state in the teaching trajectories and thentesting whether the sampled actions are optimal under the human’s reward function R . Our thirdheuristic takes inspiration from the deﬁnition of the ARP to deﬁne a black-box (action-only-query)alignment heuristic (ARP-bb). ARP-bb ﬁrst computes ARP ( R ) , removes redundant half-spaceconstraints via linear programming, queries the robot’s policy for an action in each state s that deﬁnesa non-redundant halfspace constraint w T (Φ ( s,a ) π ∗ R − Φ ( s,b ) π ∗ R ) > in ARP ( R ) , and ﬁnally checks if thesampled actions are optimal under R . To illustrate the types of test queries found via value alignment veriﬁcation, we consider two domainsinspired by the AI safety grid worlds [23]. The ﬁrst domain, island navigation is shown in Figure 2.Figure 2a shows the optimal policy under the tester’s reward function R ( s ) = 50 · green ( s ) − · white ( s ) − · blue ( s ) , (8)where color ( s ) is an indicator feature for the color of the grid cell. Shown in ﬁgures 2b and 2care the two preference queries generated by ARP-pref. In both cases the query consists of twotrajectories (shown in black and orange for visualization), and the agent taking the test must decidewhich trajectory is preferable (black is preferable to orange). We see that preference query 1 veriﬁesthat the agent would rather move the to terminal state (green) rather than visit white cells. The secondpreference veriﬁes that the agent would rather visit white cells than blue cells, and prefers an indirectpath to the goal state (green) rather than a more direct path that visits a blue cell. Shown in ﬁgures 2d,2e, and 2f are the query states for ARP-bb, SCOT, and CS heuristics, respectively. In each of thesetests the agent being tested is asked what action its policy would take in each of the states markedwith a question mark. To pass the test, the agent must respond with an action that is optimal actionunder the tester’s policy in each of these states. ARP-bb chooses two states where the halfspacesdeﬁned by the expected feature counts of following the optimal policy versus taking a suboptimalaction and following the optimal policy fully deﬁne the ARP. SCOT asks queries for maximallyinformative trajectory that starts near the water. CS only reasons about Q-value differences and asksmany redundant queries. In Appendix E we show similar results for the lava world environment [23].7 a) Optimal policy (b) Preference query 1 (c) Preference query 2(d) ARP-bb queries (e) SCOT queries (f) CS queries Figure 2: Example value alignment veriﬁcation tests for the island navigation domain.

We also evaluated the exact value alignment veriﬁcation methods across a suite of grid navigationdomains with varying numbers of states and reward features. We summarize our results here andrefer the reader to Appendix F for the full details. By construction, ARP-w requires only one query(querying for w (cid:48) ) to achieve perfect accuracy. Using trajectory preferences to deﬁne the ARP (ARP-pref) also has perfect accuracy, but requires more queries to the robot. SCOT has sample complexitythat is lower than the critical state methods, but much higher than querying directly reward functionweights since it queries at for actions as states along each machine teaching trajectory. We foundempirically that SCOT has nearly perfect accuracy, but occasionally has false positives. Using theARP inspired heuristic (ARP-bb) has low sample complexity and high accuracy, but sometimes hasfalse positives. CS has signiﬁcantly higher sample cost than the other methods and requires carefultuning of the threshold t to obtain good performance.These results give evidence that the testingmethod of choice depends on the capability of the robot and the complexity of the environmentrelative to the robot’s reward function. If the robot can report a ground truth reward weight thenARP-w has the best performance. If the robot can only answer trajectory preference queries, thenARP-pref should be used. When only given query access to the robot’s policy, ARP-bb is preferablein domains where query costs are high and a few false positives are acceptable, if query costs are notan issue, then SCOT is preferable since we found it to achieve fewer false positives in practice. We next applied our approximate value alignment veriﬁcation test to the continuous autonomousdriving domain shown in Figure 3(b), where we only assume implicit values for the human androbot [9, 30]. We tested the pipeline shown in Figure 1 by eliciting preferences from a simulatedhuman, ﬁltering the resulting questions for duplication, epsilon value gaps, and redundancy. Thetest’s false positive rate (FPR) is then computed. Our 10 simulated humans are randomly generatedreward weight vectors with unit L norm in the "Driver" environment[9]. For preference elicitationwe use a batch method proposed by Biyik and Sadigh [9]. A pair of trajectories that best restricts theremaining space of possible rewards is generated and the simulated human gives its preference. Thispreference induces a posterior distribution over reward weights which is then used to compute thenext maximally informative pair of trajectories. Each of the 10 experiments consist of 1000 pairs oftrajectories and preferences. All other parameters are as in Biyik et. al [9]. These preferences arethen ﬁltered for duplicates, a difference in expected value of at least (cid:15) under (1- δ ) of the posteriorreward distribution, and redundancy, see Appendix D and E.2 for details. The remaining preferencesform our alignment test. If none of the constraints met the (cid:15) - (1 − δ ) value difference criteria thenwe say that all agents pass the test. To evaluate these tests we uniformly sample 10,000 rewardweights with with unit L norm, use all constraints that meet the (cid:15) - (1 − δ ) criteria to determineground-truth alignment of each reward, and then report the false positive rate for different values of (cid:15) in Figure 3. The largest average test size for any value of (cid:15) was . queries, a 72x reduction from theinitial queries used to build the test. We additionally analyze our method with different human querybudgets and on preferences generated according to the the noise assumptions in Bikik and Sadigh [9]both with and without an additional noise ﬁltering step (see Appendix H for full results).8 a) (cid:15) is the maximum value error an agent can makebefore being considered misaligned. The average valuegap under the ground truth reward is 0.04, with a 5thpercentile of 0.0003 and a 95th percentile of 0.13. (b) A preference query. The yellow trajectoryof the white car is ﬁxed. The human is askedif they prefer the blue or the red trajectory.(c) Accuracy of test question from human preferences for differ-ent numbers of human queries and values of (cid:15) . accuracy isachieved for multiple values of (cid:15) with both and humanpreferences. See Appendix H.4 for more details. Figure 3: Approximate value alignment veriﬁcation for a continuous autonomous driving domain.Additionally, a small pilot study was run which used actual human preferences. We elicited preferences from the authors using the Information Gain criterion from Biyik et. al. [8]. Thesepreferences were distilled into a test as above. Then reward functions were sampled randomlyfrom a diagonal Gaussian distribution centered at the mean posterior reward with standard deviation , chosen to provide a roughly balanced number of aligned and misaligned agents. An optimaltrajectory under each reward function was generated and manually judged to be either aligned ormisaligned by the authors. We evaluate our method by computing the accuracy of the test relativeto these manual judgments. Our method correctly determines alignment of ( ) of the rewardfunctions for a range of (cid:15) values close to 1.0. More complete results are in Figure 3c. Eventually, (cid:15) isso large that most half-space constraints are not included in the test, resulting in many false positives. We proposed and explored the novel problem of value alignment veriﬁcation of autonomous agents,where a human wants to verify the alignment of a robot’s policy with respect to the human’s rewardfunction. Value alignment veriﬁcation seeks to enable humans to verify and build trust in AI systemsby designing a test that probes another agent via queries to see if they conform to the human’s values.Distilling a human’s preferences into a test allows humans to efﬁciently evaluate the performanceof an autonomous agent according to either explicit or implicit human values. We developed atheoretical foundation for value alignment veriﬁcation and proved sufﬁcient conditions for verifyingthe alignment of a rational agent. Our theoretical results demonstrate that value alignment veriﬁcationcan be performed in a constant amount of queries as opposed to the logarithmic number required foractive reward learning. Our empirical results demonstrate that heuristics based on machine teachingand value alignment provide good sample complexity and high accuracy while only requiring black-box access to an agent’s policy. When the human has only implicit access to their values, activepreference learning algorithms can be leveraged in order to automatically construct a high-conﬁdenceapproximate value alignment test that can efﬁciently test a large number of agents. Future workincludes relaxing rationality assumptions, empirically testing value alignment veriﬁcation tests inmore complex domains, and performing a full study using actual human preferences.9 eferences [1] Pieter Abbeel and Andrew Y Ng. Apprenticeship learning via inverse reinforcement learning.In

Proceedings of the twenty-ﬁrst international conference on Machine learning , page 1. ACM,2004.[2] Abhijin Adiga, Sarit Kraus, Oleg Maksimov, and S. S. Ravi. Boolean games: Inferring agents’goals using taxation queries. In

Proceedings of the Twenty-Ninth International Joint Conferenceon Artiﬁcial Intelligence, IJCAI-20 .[3] Kareem Amin, Nan Jiang, and Satinder Singh. Repeated inverse reinforcement learning. In

Advances in Neural Information Processing Systems , pages 1815–1824, 2017.[4] Kareem Amin and Satinder Singh. Towards resolving unidentiﬁability in inverse reinforcementlearning. arXiv preprint arXiv:1601.06569 , 2016.[5] Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané.Concrete problems in ai safety. arXiv preprint arXiv:1606.06565 , 2016.[6] Yoram Bachrach, Richard Everett, Edward Hughes, Angeliki Lazaridou, Joel Z. Leibo, MarcLanctot, Michael Johanson, Wojciech M. Czarnecki, and Thore Graepel. Negotiating teamformation using deep reinforcement learning.

Artiﬁcial Intelligence , 288:103356, 2020.[7] André Barreto, Will Dabney, Rémi Munos, Jonathan J Hunt, Tom Schaul, Hado P van Hasselt,and David Silver. Successor features for transfer in reinforcement learning. In

Advances inneural information processing systems , pages 4055–4065, 2017.[8] Erdem Bıyık, Malayandi Palan, Nicholas C Landolﬁ, Dylan P Losey, and Dorsa Sadigh. Askingeasy questions: A user-friendly approach to active reward learning. In

Conference on RobotLearning (CoRL) , 2019.[9] Erdem Biyik and Dorsa Sadigh. Batch active preference-based learning of reward functions.PMLR, 2018.[10] Daniel S Brown, Yuchen Cui, and Scott Niekum. Risk-aware active inverse reinforcementlearning. In

Proceedings of the 2nd Annual Conference on Robot Learning (CoRL) , 2018.[11] Daniel S Brown, Wonjoon Goo, and Scott Niekum. Better-than-demonstrator imitation learningvia automaticaly-ranked demonstrations. In

Conference on Robot Learning (CoRL) , 2019.[12] Daniel S. Brown and Scott Niekum. Efﬁcient Probabilistic Performance Bounds for InverseReinforcement Learning. In

AAAI Conference on Artiﬁcial Intelligence , 2018.[13] Daniel S. Brown and Scott Niekum. Machine teaching for inverse reinforcement learning:Algorithms and applications. In

Proceedings of the AAAI Conference on Artiﬁcial Intelligence ,volume 33, pages 7749–7758, 2019.[14] Daniel S. Brown, Scott Niekum, Russell Coleman, and Ravi Srinivasan. Safe imitation learningvia fast bayesian reward inference from preferences. In

International Conference on MachineLearning . 2020.[15] Maya Cakmak and Manuel Lopes. Algorithmic and human teaching of sequential decisiontasks. In

AAAI , 2012.[16] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple frameworkfor contrastive learning of visual representations. 2020.[17] Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deepreinforcement learning from human preferences. In

Advances in Neural Information ProcessingSystems , pages 4299–4307, 2017.[18] Dylan Hadﬁeld-Menell, Stuart J Russell, Pieter Abbeel, and Anca Dragan. Cooperative inversereinforcement learning. In

Advances in Neural Information Processing Systems 29 , pages3909–3917. 2016. 1019] Josiah Hanna, Scott Niekum, and Peter Stone. Importance sampling policy evaluation with anestimated behavior policy. In

Proceedings of the 36th International Conference on MachineLearning (ICML) , June 2019.[20] Sandy H Huang, Kush Bhatia, Pieter Abbeel, and Anca D Dragan. Establishing appropriatetrust via critical states. In , pages 3929–3936. IEEE, 2018.[21] Sandy H Huang, David Held, Pieter Abbeel, and Anca D Dragan. Enabling robots to communi-cate their objectives. In

Robotics: Science and Systems , 2017.[22] Jan Leike, David Krueger, Tom Everitt, Miljan Martic, Vishal Maini, and Shane Legg. Scalableagent alignment via reward modeling: a research direction. arXiv preprint arXiv:1811.07871 ,2018.[23] Jan Leike, Miljan Martic, Victoria Krakovna, Pedro A Ortega, Tom Everitt, Andrew Lefrancq,Laurent Orseau, and Shane Legg. Ai safety gridworlds. arXiv preprint arXiv:1711.09883 , 2017.[24] Manuel Lopes, Francisco Melo, and Luis Montesano. Active learning for reward estimationin inverse reinforcement learning. In

Joint European Conference on Machine Learning andKnowledge Discovery in Databases , pages 31–46. Springer, 2009.[25] Andrew Y Ng and Stuart J Russell. Algorithms for inverse reinforcement learning. In

ICML ,pages 663–670, 2000.[26] Matteo Pirotta and Marcello Restelli. Inverse reinforcement learning through policy gradientminimization. In

AAAI , 2016.[27] Doina Precup. Eligibility traces for off-policy policy evaluation.

Computer Science DepartmentFaculty Publication Series , page 80, 2000.[28] Stuart Russell, Daniel Dewey, and Max Tegmark. Research priorities for robust and beneﬁcialartiﬁcial intelligence.

Ai Magazine , 36(4):105–114, 2015.[29] Stuart J Russell and Peter Norvig.

Artiﬁcial intelligence: a modern approach . Malaysia; PearsonEducation Limited„ 2016.[30] Dorsa Sadigh, Anca D. Dragan, S. Shankar Sastry, and Sanjit A. Seshia. Active preference-basedlearning of reward functions. In

Proceedings of Robotics: Science and Systems (RSS) , July2017.[31] Peter Stone, Gal A Kaminka, Sarit Kraus, Jeffrey S Rosenschein, et al. Ad hoc autonomousagent teams: Collaboration without pre-coordination. In

AAAI , 2010.[32] Adam Stooke, Kimin Lee, Pieter Abbeel, and Michael Laskin. Decoupling representationlearning from reinforcement learning. arXiv preprint arXiv:2009.08319 , 2020.[33] Richard S Sutton and Andrew G Barto.

Introduction to reinforcement learning , volume 135.MIT press Cambridge, 1998.[34] Philip S Thomas, Georgios Theocharous, and Mohammad Ghavamzadeh. High-conﬁdenceoff-policy evaluation. In

AAAI , pages 3000–3006, 2015.[35] Tengyang Xie, Yifei Ma, and Yu-Xiang Wang. Towards optimal off-policy evaluation for rein-forcement learning with marginalized importance sampling. In

Advances in Neural InformationProcessing Systems , pages 9665–9675, 2019.[36] Xiaojin Zhu. Machine teaching for bayesian learners in the exponential family. In

Advances inNeural Information Processing Systems , pages 1905–1913, 2013.[37] Xiaojin Zhu, Adish Singla, Sandra Zilles, and Anna N Rafferty. An overview of machineteaching. arXiv preprint arXiv:1801.05927 , 2018.[38] Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, and Anind K Dey. Maximum entropyinverse reinforcement learning. In

AAAI , 2008.11

Theory and Proofs

A.1 Aligned Reward PolytopesTheorem 1.

Given an MDP M = ( E, R ) , if the tester’s reward function R and subject’s rewardfunction R (cid:48) can be represented as a linear combination of features φ ( s ) ∈ R k , i.e., R ( s ) = w T φ ( s ) , R (cid:48) ( s ) = w (cid:48) T φ ( s ) , then a sufﬁcient condition for testing value alignment ( R (cid:48) ∈ ARP ( R ) ) is to testwhether w (cid:48) ∈ (cid:92) ( s,a,b ) ∈ S × A × A H Rs,a,b ⊆ ARP ( R ) (9) where H Rs,a,b = (cid:8) w | w T (Φ ( s,a ) π ∗ R − Φ ( s,b ) π ∗ R ) > (cid:9) , if a ∈ arg max a (cid:48) ∈A Q ∗ R ( s, a (cid:48) ) and b / ∈ arg max a (cid:48) ∈A Q ∗ R ( s, a (cid:48) ) , R k , i.e., non-constraining otherwise . (10) Proof.

We will prove that (cid:84) ( s,a,b ) ∈ S × A × A H Rs,a,b ⊆ ARP ( R ) . Assume that w (cid:48) ∈ (cid:84) ( s,a,b ) ∈ S × A × A H Rs,a,b . This implies that, for all states s ∈ S , optimal actions under R havehigher expected utility than suboptimal actions, when evaluated under w (cid:48) . Thus, there exists anoptimal policy, call it π ∗ R , under R that is optimal under w (cid:48) .Now consider an optimal policy under w (cid:48) , call it π ∗ R (cid:48) , we need to show that π ∗ R (cid:48) ∈ OP T ( R ) To dothis, we prove by contradiction that π ∗ R (cid:48) is optimal under R . The key idea is to compare the featurecounts of π R and π ∗ R (cid:48) after one step and notice that they must look equally appealing under w (cid:48) .Assume for contradiction that π ∗ R (cid:48) / ∈ OP T ( R ) . We know that π ∗ R ∈ OP T ( R (cid:48) ) . Thus, there mustexist a state s and actions a, b such that a ∈ π ∗ R ( s ) , a ∈ π ∗ R (cid:48) ( s ) , b ∈ π ∗ R (cid:48) ( s ) but b / ∈ π ∗ R ( s ) . Thus, w (cid:48) T (Φ s,aπ ∗ R (cid:48) − Φ s,bπ ∗ R (cid:48) ) = 0 (11) ⇒ w (cid:48) T Φ s,aπ ∗ R (cid:48) = w (cid:48) T Φ s,bπ ∗ R (cid:48) . (12)By assumption about the construction of H Rs,a,b , we also have w (cid:48) T (Φ s,aπ ∗ R − Φ s,bπ ∗ R ) > (13) ⇒ w (cid:48) T Φ s,aπ ∗ R > w (cid:48) T Φ s,bπ ∗ R (14)We have previously shown that both π ∗ R and π ∗ R (cid:48) are optimal policies under R (cid:48) . This means that, forall states j and actions k , Q π ∗ R R (cid:48) ( j, k ) = Q π ∗ R (cid:48) R (cid:48) ( j, k ) (15) ⇒ w (cid:48) T Φ j,kπ ∗ R = w (cid:48) T Φ j,kπ ∗ R (cid:48) (16)and so in particular w (cid:48) T Φ s,aπ ∗ R = w (cid:48) T Φ s,aπ ∗ R (cid:48) (17) w (cid:48) T Φ s,bπ ∗ R = w (cid:48) T Φ s,bπ ∗ R (cid:48) (18)By substituting Equations 17 and 18 into Equation 12, we arrive at w (cid:48) T Φ s,aπ R = w (cid:48) T Φ s,bπ R (19)which contradicts our assumption in Equation (14) and yields the desired contradiction. We havemade only one assumption, that there is a state where there is an action taken by π R (cid:48) but not π R , so itmust be the case that at all states, every action taken by π R (cid:48) is also taken by π R .This means that all optimal actions under π ∗ R (cid:48) are also optimal under π ∗ R . Therefore, arg max a Q ∗ R (cid:48) ( s, a ) ⊆ arg max Q ∗ R ( s, a ) . This proves that R (cid:48) ∈ ARP M ( R ) as desired. Thus, (cid:84) ( s,a,b ) ∈ S × A × A H Rs,a,b ⊆ ARP M ( R ) . 12 .2 (cid:15) -Alignment Veriﬁcation via Omnipotent Testing In this section, we consider the case where the testing agent is able to construct a set of arbitrary testMDPs to verify value alignment across a family of environments that may have different transitions,actions, initial state distribution, and discount factor, but that share the same reward function overstates. Amin and Sing [4] prove that an omnipotent active learner can determine the reward functionof another agent within (cid:15) precision via O (log( |S| ) + log(1 /(cid:15) )) active policy queries. We extend thisresult to the case of value alignment testing.We ﬁrst prove that if two agents’ reward functions are sufﬁciently similar, then we can guarantee (cid:15) -value alignment. Lemma 1. If (cid:107) R ( s ) − R (cid:48) ( s ) (cid:107) ∞ ≤ (cid:15) (1 − γ ) / , where γ is the discount factor and (cid:15) is any non-negative error term, then rational agents that have reward functions R ( s ) and R (cid:48) ( s ) are (cid:15) -ValueAligned across all MDPs that share the reward function R ( s ) .Proof. To be (cid:15) -value aligned we must have V π ∗ R R − V π (cid:48) R ≤ (cid:15) , where π (cid:48) is optimal under R (cid:48) . To provethe lemma we must show that an adversary that can change the reward function from R to R (cid:48) , withinthe constraint (cid:107) R ( s ) − R (cid:48) ( s ) (cid:107) ∞ ≤ (cid:15) (1 − γ ) / , cannot make V π ∗ R R − V ∗ R (cid:48) > (cid:15) under any MDP.To make value alignment adversarially bad, we want to maximize V π ∗ R R − V π (cid:48) R . Writing this out interms of expectations over rewards we have: V π ∗ R R − V π (cid:48) R = E [ ∞ (cid:88) t =0 γ t R ( s t ) | s t ∼ π ∗ R ] − E [ ∞ (cid:88) t =0 γ t R ( s t ) | s t ∼ π (cid:48) ] . (20)To create an adversarial MDP we wish to ﬁnd a reward function R (cid:48) such that V π ∗ R R > V π (cid:48) R and V π ∗ R R (cid:48) < V π (cid:48) R (cid:48) . The intuition is that we want to adversarially construct R (cid:48) such that it makes π (cid:48) = π ∗ R (cid:48) look better than π ∗ R under R (cid:48) while forcing the true policy loss ( V π ∗ R R − V π (cid:48) R ) to be as large as possible.We now consider the maximal possible perturbation via an adversarial reward function R (cid:48) . We want V π ∗ R R > V π (cid:48) R and V π ∗ R R (cid:48) < V π (cid:48) R (cid:48) . Thus, given the constraint (cid:107) R (cid:48) ( s ) − R ( s ) (cid:107) ∞ ≤ (cid:15) (1 − γ ) / , themaximal difference at each state between R (cid:48) and R is (cid:15) (1 − γ ) / . In the worst-case, the adversarycreates R (cid:48) by subtracting (cid:15) (1 − γ ) / from the true reward ( R (cid:48) ( s ) = R ( s ) − (cid:15) (1 − γ ) / ) at statesvisited by π ∗ R to make them look as bad as possible and makes the states visited by π (cid:48) look as good aspossible by adding (cid:15) (1 − γ ) / to the true reward at those states ( R (cid:48) ( s ) = R ( s ) + (cid:15) (1 − γ ) / ). Thus,we have in the worst-case V π ∗ R R (cid:48) = E [ ∞ (cid:88) t =0 γ t R (cid:48) ( s t ) | s t ∼ π ∗ R ] (21) = E [ ∞ (cid:88) t =0 γ t ( R ( s t ) + R (cid:48) ( s t ) − R ( s t )) | s t ∼ π ∗ R ] (22) = E [ ∞ (cid:88) t =0 γ t ( R ( s t ) − (cid:15) (1 − γ ) / | s t ∼ π ∗ R ] (23) = V π ∗ R R − (cid:15) (1 − γ )2(1 − γ ) (24) = V π ∗ R R − (cid:15) (25)13imilarly, we have in the worst-case V π (cid:48) R (cid:48) = E [ ∞ (cid:88) t =0 γ t R (cid:48) ( s t ) | s t ∼ π (cid:48) ] (26) = E [ ∞ (cid:88) t =0 γ t ( R ( s t ) + R (cid:48) ( s t ) − R ( s t )) | s t ∼ π (cid:48) ] (27) = E [ ∞ (cid:88) t =0 γ t ( R ( s t ) + (cid:15) (1 − γ ) / | s t ∼ π (cid:48) ] (28) = V π (cid:48) R + (cid:15) (1 − γ )2(1 − γ ) (29) = V π (cid:48) R + (cid:15) (30)The adversarial perturbation of the reward function will only be successful if, as noted previously, wehave V π ∗ R R > V π (cid:48) R and V ∗ R (cid:48) < V π (cid:48) R (cid:48) . Substituting the values above we have in the worst-case that V π ∗ R R (cid:48) < V π (cid:48) R (cid:48) (31) ⇒ V π ∗ R R − (cid:15)/ < V π (cid:48) R + (cid:15)/ (32) ⇒ V π ∗ R R < V π (cid:48) R + (cid:15) (33) ⇒ V π ∗ R R − V π (cid:48) R < (cid:15) (34)Thus, we have shown that under the assumption that (cid:107) R ( s ) − R (cid:48) ( s ) (cid:107) ∞ ≤ (cid:15) (1 − γ ) / , then thesubject agent with reward function R (cid:48) is (cid:15) -value aligned with the tester’s reward function R under allpossible MDPs that share the reward function R .Note that if we scale the reward of an agent by a positive constant or by a constant vector, we can getthe difference to look arbitrarily large even if the two rewards lead to the same optimal policy. This isundesirable for computing value alignment in terms of reward differences. Thus, it really only makessense to compare rewards if they are similarly normalized. We utilize a canonical form for rewardfunctions deﬁned by the transformation ( R ( s ) − max s R ( s )) / (max s R ( s ) − min s R ( s )) such thatthe values of the reward function are scaled to be between 0 and 1 [4]. Following the notation ofAmin and Singh [4] we use [ R ] to denote the canonical form for reward function R .Given the ability to construct arbitrary testing environments, we can guarantee (cid:15) -value alignment overall MDPs that share the reward function R . The following theorem is inspired by Amin and Singh [4]who prove a analogous theorem for the case of actively querying an expert to approximate the expert’sreward function. The proof of Amin and Singh [4] relies on binary search and the query algorithmthey derive results in query complexity of O (log( |S| ) + log(1 /(cid:15) )) , where each query requires theexpert to specify a complete policy for a new MDP. In contrast, our proof is based instead on machineteaching (the tester knows what it is testing for), and we prove that in the case of value alignmentveriﬁcation we only require O (1) queries. In fact we only need two test MDPs where for each testMDP we query the agent whether it prefers one of two different policies in that test MDP. Theorem 2.

Given a testing reward R , there exists a two-query test (complexity O (1) ) that determines (cid:15) -value alignment of a rational agent over all MDPs that share the same state space and rewardfunction R , but may differ in actions, transitions, discount factors, and initial state distribution.Proof. By Lemma 1 we want a test that guarantees (cid:107) [ R (cid:48) ] − [ R ] (cid:107) ∞ ≤ (cid:15) (1 − γ ) / Thus we need | [ R (cid:48) ]( s ) − [ R ]( s ) | < (cid:15) (1 − γ ) / , ∀ s ∈ S (35) ⇔ [ R ]( s ) − (cid:15) (1 − γ ) / < [ R (cid:48) ]( s ) < [ R ]( s ) + (cid:15) (1 − γ ) / , ∀ s ∈ S (36)We use the notation [ R ] and [ R (cid:48) ] to represent the canonical versions of R and R (cid:48) , the tester andsubjects reward functions, respectively. If we can directly query for R (cid:48) , then we simply compute (cid:107) R − R (cid:48) (cid:107) ∞ and check if it is less than (cid:15) (1 − γ ) / . We now consider the case where we can only query14he agent about policy preferences. We deﬁne s max = arg max s R ( s ) and s min = arg min s R ( s ) and s (cid:48) max = arg max s R (cid:48) ( s ) and s (cid:48) min = arg min s R (cid:48) ( s ) and we assume that s max and s min areunique.We now create a testing environment E such that from each state there is an action a that selftransitions and an action a that goes from each state to the max reward with probability α s and tothe min reward with probability (1 − α s ) , except in states s min and s max in which all transitions via a and a are self transitions. Thus, taking action a represents a gamble between the states withminimum and maximum reward under the tester’s reward function R .For s ∈ S \ { s max , s min } , we design two different transition dynamics with the parameters α U and α L such that α Ls = max([ R ] s − (cid:15) (1 − γ )2 , and α Us = min([ R ] s + (cid:15) (1 − γ )2 , . Then we constructtwo test environments E L and E U . L has α L as the transitions and U has α U as the transitions. Wethen design two test questions:1. Is π a (cid:31) π a in MDP L ?2. Is π a (cid:31) π a in MDP U ?where π a is the policy that always takes action a .If the agent answers "YES" to the ﬁrst question, then ∀ s ∈ S \ { s max , s min } we know that a ispreferred to a . Thus the agent will prefer to self transition at a state rather than take action a whichprobabilistically transitions to s max and s min . Thus, under the subject agents unknown reward R (cid:48) thefollowing inequality holds for all s ∈ S \ { s max , s min } : α Ls R (cid:48) ( s max ) + (1 − α Ls ) R (cid:48) ( s min ) < R (cid:48) ( s ) (37) ⇔ α Ls R (cid:48) ( s max ) + (1 − α Ls ) R (cid:48) ( s min ) − R (cid:48) ( s (cid:48) min ) < R (cid:48) ( s ) − R (cid:48) ( s (cid:48) min ) (38) ⇔ α Ls ( R (cid:48) ( s max ) − R (cid:48) ( s (cid:48) min )) + (1 − α Ls )( R (cid:48) ( s min ) − R (cid:48) ( s (cid:48) min )) < R (cid:48) ( s ) − R (cid:48) ( s (cid:48) min ) (39) ⇔ α Ls R (cid:48) ( s max ) − R (cid:48) ( s (cid:48) min ) R (cid:48) ( s (cid:48) max ) − R (cid:48) ( s (cid:48) min ) + (1 − α Ls ) R (cid:48) ( s min ) − R (cid:48) ( s (cid:48) min ) R (cid:48) ( s (cid:48) max ) − R (cid:48) ( s (cid:48) min ) < R (cid:48) ( s ) − R (cid:48) ( s (cid:48) min ) R (cid:48) ( s (cid:48) max ) − R (cid:48) ( s (cid:48) min ) (40) ⇔ α Ls [ R (cid:48) ]( s max ) + (1 − α Ls )[ R (cid:48) ]( s min ) < [ R (cid:48) ]( s ) . (41)and similarly, if the agent answers "YES" to question 2, we have R (cid:48) ( s ) < α Us R (cid:48) ( s max ) + (1 − α Us ) R (cid:48) ( s min ) (42) ⇔ [ R (cid:48) ]( s ) < α Us [ R (cid:48) ]( s max ) + (1 − α Us )[ R (cid:48) ]( s min ) . (43)These above inequalities hold for all s ∈ S \ { s max , s min } . We now prove that answering "YES" toboth questions 1 and 2 also means that s (cid:48) max = max s R (cid:48) ( s ) = max s R ( s ) = s max . We prove this bycontradiction. Assume that s max (cid:54) = s (cid:48) max , then s (cid:48) max is one of the states where the subject answeredquestion 2 in the afﬁrmative. Thus, we know that [ R (cid:48) ]( s (cid:48) max ) < α Us [ R (cid:48) ]( s max ) + (1 − α Us )[ R (cid:48) ]( s min ) (44) ⇒ < α Us [ R (cid:48) ]( s max ) + (1 − α Us )[ R (cid:48) ]( s min ) (45) ⇒ < α Us + (1 − α Us ) = 1 (46)where second line uses the fact that [ R (cid:48) ]( s (cid:48) max ) = 1 , and the third uses the fact that, by deﬁnition, [ R ]( s ) < = 1 , ∀ s ∈ S . Thus < which provides the desired contradiction. Therefore, we musthave that s (cid:48) max = s max .Similarly, we prove that s (cid:48) min ≡ min s R (cid:48) ( s ) = min s R ( s ) ≡ s min by contradiction. Assumethat s min (cid:54) = s (cid:48) min , then s min is one of the states for which the subject answered question 1 in theafﬁrmative. Thus, we know that α Ls [ R (cid:48) ]( s max ) + (1 − α Ls )[ R (cid:48) ]( s min ) < [ R (cid:48) ]( s (cid:48) min ) (47) ⇒ α Ls [ R (cid:48) ]( s max ) + (1 − α Ls )[ R (cid:48) ]( s min ) < (48) ⇒ < (49)15he second line uses the fact that, by deﬁnition, [ R (cid:48) ]( s (cid:48) min ) = 0 . The third line uses the fact that bydeﬁnition, [ R ]( s max ) ≥ and [ R ]( s min ) ≥ . This provides the desired contradiction so we musthave that s min = s (cid:48) min .Combining the above results we have (assuming the subject answers "YES" to questions 1 and 2)that [ R ]( s max ) = [ R (cid:48) ]( s max ) = 1 and [ R ]( s min ) = [ R (cid:48) ]( s min ) = 0 . Additionally, we know for all s ∈ S \ { s max , s min } that α Ls R (cid:48) ( s max ) + (1 − α Ls ) R (cid:48) ( s min ) < R (cid:48) ( s ) < α Us R (cid:48) ( s max ) + (1 − α Us ) R (cid:48) ( s min ) (50) ⇒ α Ls [ R (cid:48) ]( s max ) + (1 − α Ls )[ R (cid:48) ]( s min ) < [ R (cid:48) ]( s ) < α Us [ R (cid:48) ]( s max ) + (1 − α Us )[ R (cid:48) ]( s min ) ⇒ α Ls < [ R (cid:48) ]( s ) < α Us (51) ⇒ max([ R ]( s ) − (cid:15) (1 − γ ) / , < [ R (cid:48) ]( s ) < min([ R ]( s ) + (cid:15) (1 − γ ) / , (52) ⇒ | [ R (cid:48) ]( s ) − [ R ]( s ) | < (cid:15) (1 − γ ) / . (53)Thus, we have (cid:107) [ R (cid:48) ] − [ R ] (cid:107) ∞ < (cid:15) (1 − γ ) / so by Lemma 1 we have veriﬁed (cid:15) -value alignment viatwo policy preference queries as desired. B Relationship of the ARP to Ng and Russell’s Consistent Reward Sets

In this section we discuss the relationship between our approach and the foundational work on IRLby Ng and Russell [25].We deﬁne the set of rewards consistent with an optimal policy as follows:

Deﬁnition 1.

Given an environment E , The consistent reward set (CRS) of a policy π in environment E is deﬁned as the set of reward functions under which π is optimal:CRS ( π ) = { w ∈ R k | π is optimal with respect to R ( s ) = w T φ ( s ) } . (54)The fundamental theorem of inverse reinforcement learning [25], deﬁnes the set of all consistentreward functions as a set of linear inequalities for ﬁnite MDPs. Proposition 1. [25] Given an environment E , with ﬁnite state and action spaces, R ∈ CRS ( π ) ifand only if ( P π − P a )( I − γ P π ) − R ≥ , ∀ a ∈ A (55) where P a is the transition matrix associated with always taking action a , P π is the transition matrixassociated with policy π , and R is the column vector of rewards for each state in the MDP. When the reward function is a linear combination of features, we get the following:

Corollary 1. [13, 25] Given an environment E , the CRS ( π ) is given by the following intersectionof half-spaces: { w ∈ R k | w T (Φ ( s,a ) π − Φ ( s,b ) π ) ≥ , ∀ a ∈ π ( s ) , b ∈ A , s ∈ S} . (56) Proof.

In every state s we can assume that there is one or more optimal actions a . For each optimalaction a ∈ support ( π ( s )) . We then have by deﬁnition of optimality that Q ∗ ( s, a ) ≥ Q ∗ ( s, b ) , ∀ b ∈ A (57)Rewriting this in terms of expected discounted feature counts we have w T Φ ( s,a ) π ≥ w T Φ ( s,b ) π , ∀ b ∈ A (58)Thus, the entire feasible region is the intersection of the following half-spaces w T (Φ ( s,a ) π − Φ ( s,b ) π ) ≥ , (59) ∀ a ∈ support ( π ( s )) , b ∈ A , s ∈ S (60)and thus the feasible region is convex.The consistent reward set of a demonstration from an optimal policy can be deﬁned similarly:16 orollary 2. [13] Given a set of demonstrations D from a policy π , CRS ( D| π ) is given by thefollowing intersection of half-spaces: w T (Φ ( s,a ) π − Φ ( s,b ) π ) ≥ , ∀ ( s, a ) ∈ D , b ∈ A . (61) Proof.

The proof follows from the proof of Theorem 1 by only considering half-spaces correspondingto optimal ( s, a ) pairs in the demonstration.It is important to note that while Corollary 1 seems to solve the alignment veriﬁcation problem, itonly provides a necessary, but not sufﬁcient condition. Thus, just because a reward function is withinthe CRS of a policy, it does not mean agents are aligned. Consider the example of the all zero reward:it is always in the CRS of any policy; however, an agent optimizing the zero reward can end up withany policy. Even ignoring the all zero reward we can have rewards on the boundaries of the CRSpolytope that are consistent with a policy, but not value aligned since they lead to more than oneoptimal policy, one or more of which may not be optimal under the tester’s reward function. C Value Alignment Veriﬁcation with Explicit Values

Proposition 2.

Under the assumption of a rational subject agent that shares the same linear rewardfeatures as the tester, efﬁcient exact value alignment veriﬁcation is possible in the following querysettings: (1) Query access to reward function weights w (cid:48) , (2) Query access to samples of the rewardfunction R (cid:48) ( s ) , (3) Query access to Q ∗ R (cid:48) ( s, a ) , and (4) Query access to preferences over trajectories.Proof. The proof of case (1) follows directly from Theorem 2.In case (2), the tester can query for samples of the reward function R (cid:48) ( s ) . If the tester only hasquery access to R (cid:48) ( s ) , then the same test can be used since the tester can solve a system of linearequations to recover the weight vector w after sampling a sufﬁcient number of R ( s ) values since thetester knows the features φ ( s ) . Note that this also works for rewards that are functions of (s,a) and(s,a,s’). The number of required samples is equal to rank (Φ) where Φ is the matrix where each rowcorresponds to the features φ ( s ) of a unique state. Thus, in the worst case we only need k samplesfrom the subject’s reward function so that we have a system with k unknowns and k equalities. Ifthere is noise in the sampling proceedure, then linear regression can be used to efﬁciently estimatethe subject’s weight vector w (cid:48) . After recovering the weight vector, the same value alignment testused for case (1) can be used.In case (3) the tester has access to the value function of the subject. If the tester can query the subjectagent’s value function then w can be recovered by solving a linear system of equations since we havefor any agent that R ( s ) = w T φ ( s ) = Q ( s, a ) − γ E s (cid:48) | s,a (cid:104) max a (cid:48) Q ( s (cid:48) , a (cid:48) ) (cid:105) (62)and the tester knows φ ( s ) and can query for Q ( s, a ) . As in case (2) we only need rank (Φ) ≤ k queries to the subject’s value functions and linear regression can be used if there is noise in thesampling process. Thus, and the tester can verify value alignment via the reward function valuealignment test that is used in case (1).In case (4), the tester only has access to the subject’s values via preference queries over trajectories.If the subject agent being tested can answer pairwise preferences over trajectories, then a valuealignment test can also be tested via the ARP. Each preference over trajectories ξ A ≺ ξ B induces theconstraint w T ( ξ B − ξ A ) > . Thus, given a test T consisting of preferences over trajectories, wecan guarantee value alignment if { w | w T ( ξ B − ξ A ) > , ∀ ( ξ A , ξ B ) ∈ T } ⊆ ARP ( w ) . (63)Note that a single trajectory in general will not actually match the successor features of a stochasticpolicy. However, by synthesizing arbitrary trajectories we can create more halfspace constraints thanare used to deﬁne the ARP since these trajectories do not need to be the product of a rational policy.As more trajectory queries are asked the estimate of the ARP will approach the true ARP. Brown etal. [11] proved that given random halfplane constraints, the volume of the polytope will decreaseexponentially. Thus we will need a logarithmic number of queries to accurately deﬁne the ARP.17 Value Alignment Veriﬁcation Heuristics

In this section we discuss the value alignment veriﬁcation heuristics in more detail.

D.1 Critical State-Action Value Alignment Heuristic

Prior work by Huang et al. [20], seeks to build human-agent trust by asking an agent for critical stateswhere critical states are deﬁned as follows: Q ∗ R ∗ ( s, π ∗ R ∗ ( s )) − |A| (cid:88) a ∈A Q ∗ R ∗ ( s, a ) > t (64)for some user-deﬁned t . If t = 0 , then all states will be critical states. On the otherhand, for large t ,none of the states will be critical. Thus, t must be carefully tuned to the scale of the reward functionand to the particulars of the MDP. Huang et al. [20] also proposed ﬁnding critical states in terms ofstates with policy entropy below some threshold t , but found that state-action value critical statesperformed better. Futhermore, using entropy would label every state as critical for a deterministicpolicy. State-action value critical states can also be computed for both deterministic and stochasticpolicies, thus we only compare against state-action value critical states.One possible way to use critical states for a value alignment heuristic would be to ask an agent forits critical states and then see if those match the tester’s critical states However, this is problematicsince reward scale isn’t ﬁxed and there are an inﬁnite number of reward functions that lead to thesame policy [25], so the gap in Q-values can be arbitrarily large. Thus t would have to be carefullyconstructed and tuned for both the tester and the agent, making this impractical. Instead, we simplycalculate the critical states for the tester under a tester-deﬁned t and then test whether the optimalaction that the agent being tested would take in the tester’s critical state is also optimal under thetester’s value function.This results in the following value alignment heuristic:(1) Find critical states in true MDP for t ≥ and save ( s, a ) pairs.(2) For each critical state-action ( s, a ) , query the subject for their action in state s and check if this isan optimal action under the tester’s reward function. D.2 Aligned Reward Polytope Black-Box Heuristic

For this heuristic we have the tester compute

ARP ( R ) for the tester’s reward function R , and thenﬁnd the state-actions successor features such that the constraints deﬁned by these successor featuresare the minimal set of constraints that deﬁne the ARP. In other words, we ﬁnd the minimal set ofconstraints using linear programming as discussed in Section H.2. To run a veriﬁcation test we simplytake the set of states corresponding to the minimal set of constraints. For each of these constraints wehave w T (Φ ( s,a ) π ∗ − Φ ( s,b ) π ∗ ) > (65)for all a ∈ arg max a (cid:48) Q ∗ ( s, a (cid:48) ) . The test then consists of asking the agent being tested for the actionthe testee would take in state s and see if it is optimal under the tester’s reward function. D.3 SCOT Trajectory-Based Heuristic

We also adapt the set cover optimal teaching (SCOT) algorithm for value alignment veriﬁcation[13]. As done in the original paper [13], we ﬁrst compute feature expectations, then we calculate theminimal set of constraints that deﬁne the consistent reward set (CRS) using Corollary 1. We thenrollout m trajectories using the teacher’s policy from each initial state and calculate the CRS of therollouts using Corollary 2. We then run set cover and ﬁnd the minimum set of rollouts of length H that implicitly covers the CRS.Given the machine teaching demos from SCOT we mask the actions and ask the agent being testedwhat action it would take in each state. We then compare this action with the machine teaching action.In particular, we implement this querying the subject agent for an action at each state s and thenchecking if this action is optimal under the tester’s reward function.18 a) Optimal policy (b) Preference query 1 (c) Preference query 2 (d) ARP black-box queries(e) SCOT queries (f) Critical state queries Figure 4: Example value alignment veriﬁcation tests for the lava world domain.Note that all of the methods above are not guaranteed to verify value alignment and may give falsepositives. However, all are designed to never give a false negative.

E Case Study Continued

To illustrate the types of test queries found via value alignment veriﬁcation, we consider twodomains inspired by the AI safety grid worlds [23]. The ﬁrst domain, island navigation is shown inSection 5.1.1. We now discuss another domain inspired by the AI safety gridworlds: lava world. Thisdomain is shown in Figure 4. Figure 4a shows the optimal policy under the tester’s reward function R ( s ) = 50 · green ( s ) − · white ( s ) − · red ( s ) , (66)where color ( s ) is an indicator feature for the color of the grid cell. Shown in ﬁgures 4b and 4care the two preference queries generated by ARP-pref. In both cases the query consists of twotrajectories (shown in black and orange for visualization), and the agent taking the test must decidewhich trajectory is preferable (we chose the colors such that the black trajectory is preferable toorange). We see that preference query 1 veriﬁes that the agent would rather move the to terminalstate (green) rather than visit white cells. The second preference veriﬁes that the agent would rathervisit white cells than red cells, and would rather take an indirect path to the goal state (green) ratherthan a more direct path that visits a blue cell. Note that the black trajectory in preference query 2 ﬁrstgoes up, which resuls in a self transition, then goes left to get out of the lava. Shown in ﬁgures 5d, 4e,and 4f are the query states for ARP-bb, SCOT, and CS heuristics, respectively. In each of these teststhe agent being tested is asked what action its policy would take in each of the states marked with aquestion mark. To pass the test, the agent must respond with an action that is optimal action underthe tester’s policy in each of these states. ARP-bb chooses two states where the halfspaces deﬁned bythe expected feature counts of following the optimal policy versus taking a suboptimal action andfollowing the optimal policy fully deﬁne the ARP. F Value Alignment Veriﬁcation with Idealized Human Tester

In Appendix F, we compare these heuristics with the exact alignment tests described previously thatquery for the robot’s reward function (ARP- w ) and query for preferences over trajectories (ARP-pref).Since the tests are designed such that they accurately verify aligned agents, we constructed a suiteof grid navigation domains with varying numbers of states and reward features. We generated 50different misaligned agents by sampling random reward functions and comparing the resulting optimalpolicies to the optimal policy under a randomly-chosen ground-truth reward function. Figure 5 (a)and (b) show that for a ﬁxed number of features, the size of the test generated via the critical stateheuristic with threshold t = 0 . (CS-0.2) scales poorly with the size of the grid world, even thoughthe complexity of the reward function stays constant. The threshold t has a large impact on the19 a) ARP black-box queries (b) ARP black-box queries(c) ARP black-box queries (d) ARP black-box queries Figure 5: Queries vs. accuracy (1 - false positive rate) for value alignment testing of misalignedagents. Exact alignment tests (ARP-w and ARP-pref) achieve good efﬁciency and perfect accuracy.performance: small t results in better accuracy at the cost of signiﬁcantly more queries and larger t results in signiﬁcantly more false positives. We chose t = 0 . to minimize false positives while alsoattempting to keep the test size small. In Figure 5 (c) and (d) we plot how the number of constraintsgrows as the reward function dimension increases and the MDP size is ﬁxed. The plot for ARP-bbshows that the number of constraints grows with the size of the reward weight vector as expected.Conversely, the number of critical states has the undesirable effect of growing with the size of theMDP, regardless of the complexity of the underlying reward function.By construction, ARP-w requires only one query (querying for w (cid:48) ) to achieve perfect accuracy.Using trajectory preferences to deﬁne the ARP (ARP-pref) also has perfect accuracy, but requiresmore queries to the robot. SCOT has sample complexity that is lower than the critical state methods,but much higher than querying directly reward function weights since it queries at for actions asstates along each machine teaching trajectory. We found empirically that SCOT has nearly perfectaccuracy, but occasionally has false positives. Using the ARP inspired heuristic (ARP-bb) has lowsample complexity and high accuracy, but sometimes has false positives.These results give evidencethat the testing method of choice depends on the capability of the robot and the complexity of theenvironment relative to the robot’s reward function. If the robot can report a ground truth rewardweight then ARP-w has the best performance. If the robot can only answer trajectory preferencequeries, then ARP-pref should be used. The heuristics (ARP-bb, SCOT, and CS) have higher querycosts and lower accuracy, but are applicable when only given query access to the robot’s policy andwhen the robot may not be perfectly rational. G Details on Value Alignment Veriﬁcation with Human Tester

This method can be extended to test for (cid:15) -value alignment. In continuous or complex environmentssome trajectories may be too close in value for the subject to correctly tell the difference. This maybe because the subject being tested has a reward function not exactly within the ARP of the tester orbecause the agent being tested has not perfectly rational. To test the alignment of agents like these,we compute a (1 − δ ) -conﬁdence (cid:15) -ARP. 20s each w has a probability mass associated with we can compute a (1 − δ ) -conﬁdence bound bytaking any ( ξ i , ξ j ) ∈ P and checking whether P r ( w T (Φ( ξ i ) − Φ( ξ j )) < (cid:15) ) < δ (67)We then throw away all constraints for which fewer than − δ of the weights imply a difference inreturn at least as big as (cid:15) , taking into account the human preference for that halfplane. Duplicate,noise, and redundancy ﬁlter are then applied to obtain a minimal high-conﬁdence (cid:15) -ARP that is robustto noise in the human preferences. The trajectory pairs that make up this set of minimal constraintsnow form the test T . If the subject agent is a robot with an explicit reward function, then we canuse the (cid:15) -ARP in the same way we used the ARP in the main test, and simply check if w (cid:48) is in theintersection deﬁned by Theorem 2. If the agent does not have explicit access to its reward function(e.g. if the subject agent is human), then we can test for alignment veriﬁcation by asking the subjectagent for preferences over trajectories and checking if they match the preferences given by the humantester.We then perform a series of post-processing steps to these preferences to create an efﬁcient, robusttest. Trajectory pairs with near-identical feature differences are removed. Humans often makemistakes when giving preferences, so preferences whose halfspace constraints containing less than p thresh = 70% of the mass of the preference elicitation algorithm’s reward posterior are ﬁltered out.Additionally, many halfspace constraints are implied by a more restrictive constraint, so to reduce thenumber of questions, linear programming is used to ﬁnd a minimal set of constraints. H Experiment Details

H.1 Exact vs Heuristics Grid Domains

In all grid domains the transition dynamics are deterministic and actions corresponding to movementup, down, left, and right are available at every state. Actions that would lead the agent off of thegrid result result in the agent staying in the same state. We ran experiments over different sizedgrid worlds with different numbers of features. For each grid world size and number of features wegenerated 50 random MDPs with features placed randomly and with a random ground-truth rewardfunction. We then sampled 50 different reward function weights w from the unit hypersphere. Thisbounds the Q-values of states, and so allowed us to tune over a bounded interval of t hyperparametersfor the critical-action state value alignment heuristic. For each reward we function we computed anoptimal policy to create different agents for veriﬁcation. Duplicate policies were removed. H.2 Filtering

All experiments (gridworlds and Driver) do duplication and redundancy ﬁltering. Duplicate con-straints are detected by computing cosine distance between the halfplane normal vectors. Any normalvectors that are within a small threshold (0.0001) of other normal vectors are deduplicated arbitrarily.Trivial (all-zero) constraints are also removed. Redundant constraints are then removed using theprocedure from Brown et al. [13] which we will brieﬂy summarize.A redundant constraint is one that can be removed without changing the interior of the intersectionof halfspaces. We can ﬁnd redundant constraints efﬁciently using linear programming. To check ifa constraint a T x ≤ b is binding we can remove that constraint and solve the linear program with max x a T x as the objective. If the optimal solution is still constrained to be less than or equal to b even when the constraint is removed, then the constraint can be removed. However, if the optimalvalue is greater than b then the constraint is non-redundant. Thus, all redundant constraints can beremoved by making one pass through the constraints, where each constraint is immediately removedif redundant.There are two optional ﬁltering steps that are applied when eliciting preferences from a human or whenoperating in a continuous environment. Preferences can be ﬁltered for noise in the human’s elicitedpreference, and preferences can ﬁltered to allow for (cid:15) -Value alignment. In the main experiments,only (cid:15) -alignment ﬁltering was performed, as the elicited preferences from the human are exact andconsistent. During noise ﬁltering, 1000 rewards are sampled from the posterior distribution impliedby the full set of constraints. A preference is considered to be noisy if the fraction of the rewardsoutside the constraint implied by that preference is greater than some threshold (0.7). This removes21igure 6: False positive and negative rates on noisy data with noise ﬁltering (left) and without noiseﬁltering (right). All false negative rates with noise ﬁltering are exactly 0.Figure 7: Detailed breakdown of mistakes from the human pilot study.preferences that are likely to be violated under the posterior. (cid:15) -alignment ﬁltering starts by samplinga new 1000 rewards from the posterior implied by the (possibly noise ﬁltered) remaining set ofconstraints. We aim to ensure that P ( w T ( ξ A − ξ B ) > (cid:15) ) > − δ and we estimate this probabilityfor each constraint using the samples from the reward posterior. If this condition does not hold forthe sampled reward posterior, the constraint is removed. H.3 Noise Ablation Experiments for Driving Domain

When eliciting preferences between trajectories from humans, humans sometimes report their prefer-ences incorrectly. Biyik and Sadigh [9] assume that a human with true reward w reports a preference ξ A ≺ ξ B for one trajectory is equal to min(1 , exp( w T ( ξ B − ξ A ))) . We run ten simulations of 1000noisily reported human preferences using this model. We then evaluate our active test generationpipeline with and without the noise ﬁltering speciﬁed in section H.2. Roughly speaking, this ﬁlteringremoves constraints that exclude too much of the posterior reward distribution.These experiments ﬁnd that noise ﬁltering does reduce the false-negative rate to 0, but at the costof sometimes large increases in the false positive rate. The relative cost of false-positives and false-negatives is dependent on the speciﬁc application, but even without noise ﬁltering, false negatives arerare enough that most will not want to do noise ﬁltering on elicited preferences. These experimentsare highly idealized based on a speciﬁc model of noise in elicited human preferences that may nothold of actual humans, and further study is needed to determine if the false positive rate of alignmenttests generated from elicited human preferences is high enough to justify noise ﬁltering of some kind. H.4 Human Pilot Study

As epsilon increases, more of the questions are removed from the test. This necessarily increases thenumber of positive judgements the test provides, all else being equal. The accuracy initially increaseswith (cid:15) because the test has fewer false negatives as more noise questions are removed. At around (cid:15) = 1 .0