VValue Alignment Verification
Daniel S. Brown ∗ University of California, Berkeley [email protected]
Jordan Schneider ∗ , Scott Niekum University of Texas at Austin {joschnei,sniekum}@cs.utexas.edu
Abstract
As humans interact with autonomous agents to perform increasingly complicated,potentially risky tasks, it is important that humans can verify these agents’ trust-worthiness and efficiently evaluate their performance and correctness. In this paperwe formalize the problem of value alignment verification : how to efficiently testwhether the goals and behavior of another agent are aligned with a human’s val-ues? We explore several different value alignment verification settings and providefoundational theory regarding value alignment verification. We study alignmentverification problems with an idealized human that has an explicit reward functionas well as value alignment verification problems where the human has implicitvalues. Our theoretical and empirical results in both a discrete grid navigation do-main and a continuous autonomous driving domain demonstrate that it is possibleto synthesize highly efficient and accurate value alignment verification tests forcertifying the alignment of autonomous agents.
If we desire autonomous agents that can interact with and assist humans and other agents in performingcomplex, potentially risky tasks, then it is important that humans can verify that other agents’ policiesare aligned with what is expected and desired. This alignment is often termed value alignment andis defined in the Asilomar AI Principles as follows: "Highly autonomous AI systems should bedesigned so that their goals and behaviors can be assured to align with human values throughouttheir operation." We note that it is also important that even non-human agents are mututally valuealigned in multi-agent settings so that they can assist each other and collaborate under shared normsand preferences. In this paper, we propose and explore the problem of efficient value alignmentverification : How can a human efficiently test whether a robot is aligned with the human’s values?
The goal of value alignment verification is to construct a kind of “driver’s test” that a human can giveto another agent which can verify value alignment and consists of only a small number of queries.For the purposes of this paper we will define values in the reinforcement learning sense, i.e. withrespect to a value function or reward/utility function. We say that a robot is perfectly value alignedwith a human if the robot’s policy is optimal under the human’s reward function. The two agents in avalue alignment verification problem (human and robot) will likely have different communicationmechanisms and different value introspection abilities. Thus, value alignment verification will takedifferent forms depending on whether the human and robot have explicit (i.e. being able to writedown a value function or reward function) or implicit access to their values (i.e. only able to answerpreference queries or to sample actions from a policy). As an example, artificial agents typicallyhave explicit value functions or policies, while humans typically have implicit values. Despite thesedifferences, we would like to perform value alignment verification regardless of the agent havingexplicit or implicit values. In Section 4.2.1 we examine methods for provable value alignment ∗ Equal contribution https://futureoflife.org/ai-principles/ NeurIPS 2020 Workshop on Human And Machine in-the-Loop Evaluation and Learning Strategies (HAMLETS) a r X i v : . [ c s . L G ] D ec uman Tester Test GeneratorPreference Elicitation Alignment Test Agents to be verifiedVerification Figure 1: Value alignment verification with a human tester with implicit values. The tester’s valuesare distilled into a succinct alignment test via preference elicitation. This test can then be applied toany number of agents to verify their alignment with the human’s values.verification in an idealized setting when both the human and robot have explicit values. Then, inSection 4.2.2 we discuss how we can use this test under several different conditions including whenthe robot may have implicit values and can only answer preference queries. Finally, in Section 4.3 wepropose an approximation algorithm for value alignment verification (depicted in Figure 1) that isapplicable in cases where the human tester has implicit values.Prior work on value alignment often focuses on loose definitions of value alignment, qualitativeevaluation of trust [20] or asymptotic alignment of an agent’s performance via interactions andactive learning [17, 18, 30]. In contrast, our work seeks to build trust between agents by formallydefining value alignment and seeking efficient tests for value alignment verification that are applicablewhen two or more agents already have learned a policy or reward function and want to quicklytest compatibility. Related work seeks to provide high-confidence bounds on the performance ofa reinforcement learning agent [19, 34] or an imitation learning agent [12, 14]. However, theseapproaches typically require full access to the parameterized policies of both agents and involveevaluating the robot’s policy over significant amounts of historical data or extensive counterfactualcomputations. To the best of our knowledge, we are the first to address the general problem ofalgorithmic value alignment verification. In particular, we propose exact, approximate, and heuristictests that one agent can use to quickly and efficiently verify value alignment with another agent.The contributions of this work are the following: (1) We formally define value alignment verification;(2) We then analyze the complexity of value alignment verification and show that in an idealizedsetting it can be much more efficient than active reward learning, requiring only a constant numberof queries; (3) We next propose exact and heuristic value alignment verification methods that areapplicable under a wide range of test queries; (4) We also propose an approximation algorithm forvalue alignment verification that works with a human tester with implicit values; and (5) We provideempirical results demonstrating the efficacy of exact and approximate value alignment verification inboth a discrete grid navigation domain and a continuous autonomous driving domain.
Value Alignment:
Most work on value alignment focuses on how to iteratively train a learningagent such that its final behavior is aligned with a user’s intentions [5, 22, 28]. One example iscooperative inverse reinforcement learning (CIRL) [18], which formulates value alignment as a gamebetween a human and a robot, where both try to maximize a shared reward function that is only knownby the human. CIRL and other research on value alignment focus on ensuring the learning agentasymptotically converges to the same values as the human teacher, but do not provide a way to checkwhen value alignment has been achieved. By contrast, we are instead interested in value alignment verification : testing whether an agent is currently value aligned. We do not assume a cooperativesetting—the robot is not assumed to have the same payoff as the human. Instead, we assume theagent being tested has already learned a policy/reward function via some black-box optimizationprocess and the human wants to efficiently test for alignment.
Active Reward Learning:
Value alignment verification is closely related to the problem of activepreference learning [2, 8, 10, 14, 17, 20, 24] where an AI system seeks to efficiently determinethe reward function of a human expert via queries for expert demonstrations or preferences overtrajectories; however, value alignment verification only seeks to answer the question of whethertwo agents are aligned, without concern for the exact reward function of the robot. We prove in2ection 4.1 that value alignment verification can sometimes be performed in a constant number ofqueries whereas active learning requires a logarithmic number of queries. We also demonstrate thatwhen the human has implicit values, then active reward learning can be used to automatically generatea high-confidence value alignment test with respect to these implicit values.
Machine Teaching:
Machine teaching [36, 37] is the inverse problem to machine learning. Inmachine teaching, a teacher seeks a minimal set of training data such that a student (running aparticular learning algorithm) learns a desired set of model parameters. Value alignment verificationis related and can be seen as a form of machine testing rather than teaching. Machine teachingalgorithms typically search for a minimal set of training data that will teach a learner a specific model,whereas we seek a minimal set of questions that will allow a tester to verify another agent’s model.Thus, in machine teaching, the teacher provides examples and their answers, but in machine testingthe tester provides examples and then queries the student for the answer. Machine teaching has beenpreviously applied to sequential decision making problems [13, 15], but has not been used to directlyaddress the problem of machine testing. Other related work has proposed to use pedagogic examplesas a way to enable robots to express their capabilities [21] and values [20] to a human. Our work issimilarly motivated by building trust between agents via verification testing.
Policy Evaluation
Policy evaluation [33] can be seen as a form of value alignment, but aims toanswer the harder question of "How much return would the other agent achieve according to myvalues?" By focusing on the simpler question, "Is the robot value aligned with the human?", our workprovides sample-efficient tests for exact and approximate value alignment. Off-Policy Evalutaion(OPE) seeks to perform policy evalution without executing the testee’s policy [27, 34, 35]. OPE isoften sample-inefficient or provides high-variance estimates and typically assumes explicit access tothe tester’s reward function, explicit access to the tester and testee policies, and a large dataset ofrollouts from the tester’s policy with corresponding returns. By contrast, value alignment verificationis applicable in settings where the policies and reward functions of both agents may be implicitand only accessible indirectly. High-confidence policy evaluation has also been investigated in theimitation learning setting [1, 10, 12] where an agent has access to demonstrations from an expert andseeks to evaluate its policy loss with respect to the teacher’s unknown reward function. Rather thanconsidering a learner that receives demonstrations from a teacher, we consider a tester who seeks todesign a test that can (approximately) verify the value alignment of any other agent.
Because we are interested in agents that have different reward functions, we adopt notation proposedby Amin et al. [3] where a Markov Decision Process (MDP) M consists of an environment E =( S , A , P, S , γ ) and a reward function R : S → R . An environment has a set of states S , a set ofactions A , a transition function P : S ×A×S → [0 , , a discount factor γ ∈ [0 , , and a distributionover initial states S . A policy π : S × A (cid:55)→ [0 , is a mapping from states to a distribution overactions. The state and state-action values of a policy π are V πR ( s ) = E π [ (cid:80) ∞ t =0 γ t R ( s t ) | s = s ] and Q πR ( s, a ) = E π [ (cid:80) ∞ t =0 γ t R ( s t ) | s = s, a = a ] for s ∈ S and a ∈ A . We denote V ∗ R ( s ) = max π V πR ( s ) and Q ∗ R ( s, a ) = max π Q πR ( s, a ) . The expected value of a policy is denotedby V πR = E s ∈ S [ V πR ( s )] . As is common [7, 14, 26, 38], we will often assume that the reward functioncan be expressed as a linear combination of features φ : S (cid:55)→ R k , so that R ( s ) = w T φ ( s ) , where w ∈ R k . Thus, we use R and w interchangeably. Note that this assumption of a linear rewardfunction is not restrictive as these features can be arbitrarily complex nonlinear functions of the stateand can be obtained via unsupervised learning from raw state observations [14, 16, 32]. Given thisassumption, the state-action value function can be written in terms of discounted expectations overfeatures as Q πR ( s, a ) = w T Φ ( s,a ) π , where Φ ( s,a ) π = E π [ (cid:80) ∞ t =0 γ t φ ( s t ) | s = s, a = a ] . In this section we first explicitly define value alignment and value alignment verification. Next, wediscuss how assuming rationality of the robot agent enables highly efficient provable value alignmentverification. We then present results for value alignment verification when the human has full controlover the environment and also in the case where the environment is fixed. We conclude this section3y presenting a method for approximate value alignment verification when the tester is a human withimplicit values.We first formalize value alignment. Consider two agents: a human and a robot. We will assume thatthe human has a (possibly implicit) reward function that provides the ground truth for determiningvalue alignment verification of the robot. We define exact value alignment as follows:
Definition 1.
Given reward function R , policy π (cid:48) is value aligned in environment E if and only if π (cid:48) ∈ OP T ( R ) , (1) where OPT ( R ) = { π | π ( a | s ) > ⇒ a ∈ arg max a Q ∗ R ( s, a ) } , is the set of all optimal (potentiallystochastic) policies in MDP (E,R) and arg max x f ( x ) := { x | f ( y ) ≤ f ( x ) , ∀ y } . In complex environments or for robots with bounded rationality or computation, expecting exactalignment may be unreasonable. Thus, we also define (cid:15) -value alignment:
Definition 2.
Given reward function R , policy π (cid:48) is (cid:15) - value aligned in environment E if and only if V ∗ R − V π (cid:48) R ≤ (cid:15). (2)Note that Definition 1 is a special case of Definition 2 when (cid:15) = 0 .We are interested in the problem of value alignment verification which we define as follows: Definition 3.
Value Alignment Verification : Given an environment E , reward function R , policy π (cid:48) ,and a threshold (cid:15) , solve the decision problem: Is π (cid:48) (cid:15) -value aligned with R in environment E ? To verify value alignment without checking alignment at every state, it needs to be the case that therobot preferring an action in one state implies something about its preferences in another. Any suchimplication is going to require both that the states have some relationship to one another and that theagent’s preferences are consistent with this relationship between states. In our case we assume stateshave known reward features and that agents act rationally with respect to a linear reward in thesefeatures. While we require these assumptions for our theoretical analysis, we will later show thatmany of our proposed methods for value alignment verification can be used as heuristics for buildingtrust even if the subject agent is not rational.A rational agent is one that picks actions to maximize its utility [29]. Thus, given a reward function R (cid:48) , a rational agent’s policy π (cid:48) is of the form: π (cid:48) ( s ) ∈ arg max a Q ∗ R (cid:48) ( s, a ) . (3)Consider two rational agents with reward functions R and R (cid:48) . Because there are infinite rewardfunctions that lead to the same optimal policy [25], determining that ∃ s ∈ S, R ( s ) (cid:54) = R (cid:48) ( s ) is notsufficient to verify mis -alignment. Instead we formalize value alignment for reward functions witharbitrary shaping or scale via the following Lemma that directly follows from Definition 1 and thedefinition of a rational agent in Equation (3). Lemma 1.
A rational robot with reward function R (cid:48) is value aligned with a human with rewardfunction R in environment E if and only if OP T ( R (cid:48) ) ⊆ OP T ( R ) .Proof. This follows directly from Definition 1 and the definition of a rational agent in Eq. (3).Thus a rational robot is aligned with a human if all optimal policies under the robot’s reward functionare also optimal policies under the human’s reward function. (cid:15) -Alignment Verification via Omnipotent Testing
We first consider the theoretical setting of an omnipotent testing agent: one that is able to construct aset of arbitrary test MDPs to verify value alignment across a family of environments that share thesame reward function. We assume that the human has explicit access to their reward function, butonly assume that the robot has implicit values which allow the agent to answer preference queries.Amin and Singh [4] prove under these assumptions that an omnipotent active learner can determinethe reward function of another agent within (cid:15) precision via O (log |S| + log(1 /(cid:15) )) active queries.4hese queries take the form of asking for the entire policy of the robot. In Appendix A.2, we extendthis result to the case of value alignment testing, where we prove that if the human is able to querythe robot for preferences over policies, then the sample complexity of (cid:15) -value alignment verificationis only O (1) . Theorem 1.
Given a testing reward R , there exists a two-query test (complexity O (1) ) that determines (cid:15) -value alignment of a rational agent over all MDPs that share the same state space and rewardfunction R , but may differ in actions, transitions, discount factors, and initial state distribution.Proof. See Appendix A.2.This illustrates the benefit of having a verification test versus running active reward learning andconfirms related work that has shown that methods related to machine teaching are much more sampleefficient than active learning methods [13, 36]. While creating an arbitrary synthetic testing worldmay work in some cases, it is often the case that nature provides the environment in which we wouldlike to guarantee verification. In the rest of this paper we focus on this setting, where the testingenvironment is fixed and cannot be arbitrarily constructed or changed.
In this section we develop theoretical results regarding provable exact alignment verification ( (cid:15) = 0 )of a rational robot when the tester does not have full control over the testing environment.
We seek an efficient value alignment verification test which enables a human to query the robotto determine alignment according to Lemma 1. As demonstrated by Theorem 2 below, due to thelinearity of R , a sufficient condition for value alignment verification is to test whether the rationalrobot’s reward function lies in the following geometric object. Definition 4.
Given an MDP M composed of environment E and reward function R , the alignedreward polytope (ARP) is defined as the following set of reward functions:ARP ( R ) = { R (cid:48) | OPT ( R (cid:48) ) ⊆ OPT ( R ) } . (4)We now present a sufficient test for provable exact value alignment. As a reminder, given a linearreward function we can write the state-action value function as Q πR ( s, a ) = w T Φ ( s,a ) π , where Φ ( s,a ) π = E π [ (cid:80) ∞ t =0 γ t φ ( s t ) | s = s, a = a ] . Theorem 2.
Given an MDP M = ( E, R ) , if the human’s reward function R and robot’s rewardfunction R (cid:48) can be represented as a linear combination of features φ ( s ) ∈ R k , i.e., R ( s ) = w T φ ( s ) , R (cid:48) ( s ) = w (cid:48) T φ ( s ) , then a sufficient condition for testing value alignment is to test whether w (cid:48) ∈ (cid:92) ( s,a,b ) ∈ S × A × A H Rs,a,b ⊆ ARP ( R ) (5) where H Rs,a,b = (cid:8) w | w T (Φ ( s,a ) π ∗ R − Φ ( s,b ) π ∗ R ) > (cid:9) , if a ∈ arg max a (cid:48) ∈A Q ∗ R ( s, a (cid:48) ) , b / ∈ arg max a (cid:48) ∈A Q ∗ R ( s, a (cid:48) ) and is equal to R k , i.e., non-constraining, otherwise.Proof. See Appendix A.1.
Given Theorem 2, we can now design an efficient test value alignment verification where we havean idealized human and robot that both have explicit representations of their reward functions. Ouranalysis provides theoretical insight into the value alignment verification problem and the resultingtests for exact alignment in this section will motivate our approximation algorithm for value alignmentverification when one or both of the agents have implicit values. We propose an approach that is We also note that our results are of practical interest if there are two robots that need to collaborate, butwere trained by different organizations and have different reward functions and/or policies [6, 31]. Running avalue alignment test with explicit values if an efficient way to verify if the robots can work together. w (cid:48) . Later we will show that many different types of queries reduce to this type of test.A direct result of Theorem 2 is that we can test for value alignment verification via the test T = {H s,a,b | ( s, a, b ) ∈ S × A × A } where the questions are defined as w (cid:48) T (Φ ( s,a ) π ∗ R − Φ ( s,b ) π ∗ R ) > , if a ∈ arg max a (cid:48) ∈A Q ∗ R ( s, a (cid:48) ) , b / ∈ arg max a (cid:48) ∈A Q ∗ R ( s, a (cid:48) ) , ∀ s ∈ S (6)All constraints of this form can be checked simultaneously via a single matrix-vector multiplication Φ ARP w (cid:48) > , where Φ ARP is a matrix where each row corresponds to a unique feature countdifference in Equation (6). The above test assumes that the human can query directly for the robot’sreward function weights w (cid:48) . In Appendix C, we show that similar tests can be formulated under morerestrictive query assumptions, including preferences queries answered via implicit values: Proposition 1.
Under the assumption of a rational robot that shares the same linear reward featuresas the human, efficient exact value alignment verification is possible in the following query settings:(1) Query access to reward function weights w (cid:48) , (2) Query access to samples of the reward function R (cid:48) ( s ) , (3) Query access to Q ∗ R (cid:48) ( s, a ) , and (4) Query access to preferences over trajectories.Proof. See Appendix C.Proposition 1 assumes the human can either directly query the robot’s reward or value function orquery the robot for its preferences over trajectories. However, sometimes the human may only havequery access to the robot’s policy π (cid:48) . In this case, we can resort to heuristics for value alignmentvia policy queries that have high verification accuracy in practice, but may occasionally have falsepositives where a non-aligned agent is certified as aligned, as we discuss in the next section. We now discuss how to perform value alignment when the human and/or robot only have implicitvalues. In this setting, the goal is to distill a human’s intent or values into a verification test that canbe used to quickly check the value alignment of any agent. For example, a regulatory body may wanta sample efficient test to validate proprietary autonomous driving software. The testing agency maychange its regulations periodically and a value alignment test could be used to check whether existingproprietary software still meets the new guidelines. A well designed value alignment verification testcould also be useful as a replacement for exhaustively backtesting an agent during development toensure updates to software do not violate safety constraints. Without an explicit representation of the human’s values we cannot directly compute the ARP asdescribed in the previous section. Instead, we propose the approach outlined in Figure 1 where weuse an AI system as a test generator to enable the creation of an alignment test. The test generator firstperforms preference elicitation to distill the human’s internal value function into an efficient alignmenttest. This test can then be reused to test any other agent, human or robot, for value alignment.As is common for many active reward learning algorithms [8, 17, 30], we assume that the preferenceelicitation algorithm outputs both a set of trajectory preferences P = { ( ξ i , ξ j ) : ξ i (cid:31) ξ j } and a set ofsample reward weights w from the posterior distribution P ( w |P ) = { w i } . Given P and P ( w |P ) ,the aligned reward polytope of the human’s implicit reward function can be approximated as ARP = (cid:92) ( ξ i ,ξ j ) ∈P { w | w T (Φ( ξ i ) − Φ( ξ j )) > (cid:9) , (7)which generalizes the definition of the ARP to MDPs with continous states and actions. To test thealignment of agents with bounded rationality or slightly misspecified reward functions we consider In some cases, such as an AI tutoring system, the robot could be the tester and the human could be the testee.For example, a robot that comes preprogrammed from a factory to perform household chores may want to firstquickly verify whether the human’s preferences are aligned with its preprogrammed behavior. (cid:15) -value alignment (Definition 2). In particular, we synthesize a test by computing a (1 − δ ) -confidence (cid:15) -ARP. As each sample w i has a probability mass associated with it, we can createa high-confidence version of the (cid:15) -ARP by only testing using trajectory pairs ( ξ i , ξ j ) ∈ P such that P r ( w T (Φ( ξ i ) − Φ( ξ j )) > (cid:15) ) > − δ under P ( w |P ) . Finally we remove redundant constraints [13].The result is a succinct, high-confidence test T for (cid:15) -value alignment verification that consists of aminimal set of informative preference queries (see Appendix for details). The alignment test consistsof asking the robot for preferences over trajectories in T and checking if they match the preferencelabels given by the human tester. In this section, we evaluate the performance of our proposed exact value alignment verification test(Section 4.2.2) in two forms: querying for the weight vector of the robot (ARP-w) and preferencequeries (ARP-pref). We also consider three heuristic alignment tests designed to work with black-boxagents where the tester can only ask policy action queries. We briefly discuss the three black-boxheuristics here and include full details in Appendix D. Our first heuristic is inspired by Huang etal.’s notion of critical states : states where Q ∗ R ∗ ( s, π ∗ R ∗ ( s )) − |A| (cid:80) a ∈A Q ∗ R ∗ ( s, a ) > t , for someuser defined threshold t [20]. We adapt this idea to form a critical state alignment heuristic (CS)that computes critical states under the human’s reward function R , then queries the robot’s policy ateach critical state and tests if the robot action is an optimal action under the human’s policy π ∗ R . Oursecond heuristic uses the Set Cover Optimal Teaching algorithm (SCOT) proposed by Brown andNiekum [13] and adapt this to make it a value alignment heuristic. SCOT generates a set of maximallyinformative state-action trajectories designed to efficiently teach a reward function to maximumlikelihood IRL agent. We turn this into an alignment verification test by generating maximallyinformative trajectories, querying the robot’s policy at each state in the teaching trajectories and thentesting whether the sampled actions are optimal under the human’s reward function R . Our thirdheuristic takes inspiration from the definition of the ARP to define a black-box (action-only-query)alignment heuristic (ARP-bb). ARP-bb first computes ARP ( R ) , removes redundant half-spaceconstraints via linear programming, queries the robot’s policy for an action in each state s that definesa non-redundant halfspace constraint w T (Φ ( s,a ) π ∗ R − Φ ( s,b ) π ∗ R ) > in ARP ( R ) , and finally checks if thesampled actions are optimal under R . To illustrate the types of test queries found via value alignment verification, we consider two domainsinspired by the AI safety grid worlds [23]. The first domain, island navigation is shown in Figure 2.Figure 2a shows the optimal policy under the tester’s reward function R ( s ) = 50 · green ( s ) − · white ( s ) − · blue ( s ) , (8)where color ( s ) is an indicator feature for the color of the grid cell. Shown in figures 2b and 2care the two preference queries generated by ARP-pref. In both cases the query consists of twotrajectories (shown in black and orange for visualization), and the agent taking the test must decidewhich trajectory is preferable (black is preferable to orange). We see that preference query 1 verifiesthat the agent would rather move the to terminal state (green) rather than visit white cells. The secondpreference verifies that the agent would rather visit white cells than blue cells, and prefers an indirectpath to the goal state (green) rather than a more direct path that visits a blue cell. Shown in figures 2d,2e, and 2f are the query states for ARP-bb, SCOT, and CS heuristics, respectively. In each of thesetests the agent being tested is asked what action its policy would take in each of the states markedwith a question mark. To pass the test, the agent must respond with an action that is optimal actionunder the tester’s policy in each of these states. ARP-bb chooses two states where the halfspacesdefined by the expected feature counts of following the optimal policy versus taking a suboptimalaction and following the optimal policy fully define the ARP. SCOT asks queries for maximallyinformative trajectory that starts near the water. CS only reasons about Q-value differences and asksmany redundant queries. In Appendix E we show similar results for the lava world environment [23].7 a) Optimal policy (b) Preference query 1 (c) Preference query 2(d) ARP-bb queries (e) SCOT queries (f) CS queries Figure 2: Example value alignment verification tests for the island navigation domain.
We also evaluated the exact value alignment verification methods across a suite of grid navigationdomains with varying numbers of states and reward features. We summarize our results here andrefer the reader to Appendix F for the full details. By construction, ARP-w requires only one query(querying for w (cid:48) ) to achieve perfect accuracy. Using trajectory preferences to define the ARP (ARP-pref) also has perfect accuracy, but requires more queries to the robot. SCOT has sample complexitythat is lower than the critical state methods, but much higher than querying directly reward functionweights since it queries at for actions as states along each machine teaching trajectory. We foundempirically that SCOT has nearly perfect accuracy, but occasionally has false positives. Using theARP inspired heuristic (ARP-bb) has low sample complexity and high accuracy, but sometimes hasfalse positives. CS has significantly higher sample cost than the other methods and requires carefultuning of the threshold t to obtain good performance.These results give evidence that the testingmethod of choice depends on the capability of the robot and the complexity of the environmentrelative to the robot’s reward function. If the robot can report a ground truth reward weight thenARP-w has the best performance. If the robot can only answer trajectory preference queries, thenARP-pref should be used. When only given query access to the robot’s policy, ARP-bb is preferablein domains where query costs are high and a few false positives are acceptable, if query costs are notan issue, then SCOT is preferable since we found it to achieve fewer false positives in practice. We next applied our approximate value alignment verification test to the continuous autonomousdriving domain shown in Figure 3(b), where we only assume implicit values for the human androbot [9, 30]. We tested the pipeline shown in Figure 1 by eliciting preferences from a simulatedhuman, filtering the resulting questions for duplication, epsilon value gaps, and redundancy. Thetest’s false positive rate (FPR) is then computed. Our 10 simulated humans are randomly generatedreward weight vectors with unit L norm in the "Driver" environment[9]. For preference elicitationwe use a batch method proposed by Biyik and Sadigh [9]. A pair of trajectories that best restricts theremaining space of possible rewards is generated and the simulated human gives its preference. Thispreference induces a posterior distribution over reward weights which is then used to compute thenext maximally informative pair of trajectories. Each of the 10 experiments consist of 1000 pairs oftrajectories and preferences. All other parameters are as in Biyik et. al [9]. These preferences arethen filtered for duplicates, a difference in expected value of at least (cid:15) under (1- δ ) of the posteriorreward distribution, and redundancy, see Appendix D and E.2 for details. The remaining preferencesform our alignment test. If none of the constraints met the (cid:15) - (1 − δ ) value difference criteria thenwe say that all agents pass the test. To evaluate these tests we uniformly sample 10,000 rewardweights with with unit L norm, use all constraints that meet the (cid:15) - (1 − δ ) criteria to determineground-truth alignment of each reward, and then report the false positive rate for different values of (cid:15) in Figure 3. The largest average test size for any value of (cid:15) was . queries, a 72x reduction from theinitial queries used to build the test. We additionally analyze our method with different human querybudgets and on preferences generated according to the the noise assumptions in Bikik and Sadigh [9]both with and without an additional noise filtering step (see Appendix H for full results).8 a) (cid:15) is the maximum value error an agent can makebefore being considered misaligned. The average valuegap under the ground truth reward is 0.04, with a 5thpercentile of 0.0003 and a 95th percentile of 0.13. (b) A preference query. The yellow trajectoryof the white car is fixed. The human is askedif they prefer the blue or the red trajectory.(c) Accuracy of test question from human preferences for differ-ent numbers of human queries and values of (cid:15) . accuracy isachieved for multiple values of (cid:15) with both and humanpreferences. See Appendix H.4 for more details. Figure 3: Approximate value alignment verification for a continuous autonomous driving domain.Additionally, a small pilot study was run which used actual human preferences. We elicited preferences from the authors using the Information Gain criterion from Biyik et. al. [8]. Thesepreferences were distilled into a test as above. Then reward functions were sampled randomlyfrom a diagonal Gaussian distribution centered at the mean posterior reward with standard deviation , chosen to provide a roughly balanced number of aligned and misaligned agents. An optimaltrajectory under each reward function was generated and manually judged to be either aligned ormisaligned by the authors. We evaluate our method by computing the accuracy of the test relativeto these manual judgments. Our method correctly determines alignment of ( ) of the rewardfunctions for a range of (cid:15) values close to 1.0. More complete results are in Figure 3c. Eventually, (cid:15) isso large that most half-space constraints are not included in the test, resulting in many false positives. We proposed and explored the novel problem of value alignment verification of autonomous agents,where a human wants to verify the alignment of a robot’s policy with respect to the human’s rewardfunction. Value alignment verification seeks to enable humans to verify and build trust in AI systemsby designing a test that probes another agent via queries to see if they conform to the human’s values.Distilling a human’s preferences into a test allows humans to efficiently evaluate the performanceof an autonomous agent according to either explicit or implicit human values. We developed atheoretical foundation for value alignment verification and proved sufficient conditions for verifyingthe alignment of a rational agent. Our theoretical results demonstrate that value alignment verificationcan be performed in a constant amount of queries as opposed to the logarithmic number required foractive reward learning. Our empirical results demonstrate that heuristics based on machine teachingand value alignment provide good sample complexity and high accuracy while only requiring black-box access to an agent’s policy. When the human has only implicit access to their values, activepreference learning algorithms can be leveraged in order to automatically construct a high-confidenceapproximate value alignment test that can efficiently test a large number of agents. Future workincludes relaxing rationality assumptions, empirically testing value alignment verification tests inmore complex domains, and performing a full study using actual human preferences.9 eferences [1] Pieter Abbeel and Andrew Y Ng. Apprenticeship learning via inverse reinforcement learning.In
Proceedings of the twenty-first international conference on Machine learning , page 1. ACM,2004.[2] Abhijin Adiga, Sarit Kraus, Oleg Maksimov, and S. S. Ravi. Boolean games: Inferring agents’goals using taxation queries. In
Proceedings of the Twenty-Ninth International Joint Conferenceon Artificial Intelligence, IJCAI-20 .[3] Kareem Amin, Nan Jiang, and Satinder Singh. Repeated inverse reinforcement learning. In
Advances in Neural Information Processing Systems , pages 1815–1824, 2017.[4] Kareem Amin and Satinder Singh. Towards resolving unidentifiability in inverse reinforcementlearning. arXiv preprint arXiv:1601.06569 , 2016.[5] Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané.Concrete problems in ai safety. arXiv preprint arXiv:1606.06565 , 2016.[6] Yoram Bachrach, Richard Everett, Edward Hughes, Angeliki Lazaridou, Joel Z. Leibo, MarcLanctot, Michael Johanson, Wojciech M. Czarnecki, and Thore Graepel. Negotiating teamformation using deep reinforcement learning.
Artificial Intelligence , 288:103356, 2020.[7] André Barreto, Will Dabney, Rémi Munos, Jonathan J Hunt, Tom Schaul, Hado P van Hasselt,and David Silver. Successor features for transfer in reinforcement learning. In
Advances inneural information processing systems , pages 4055–4065, 2017.[8] Erdem Bıyık, Malayandi Palan, Nicholas C Landolfi, Dylan P Losey, and Dorsa Sadigh. Askingeasy questions: A user-friendly approach to active reward learning. In
Conference on RobotLearning (CoRL) , 2019.[9] Erdem Biyik and Dorsa Sadigh. Batch active preference-based learning of reward functions.PMLR, 2018.[10] Daniel S Brown, Yuchen Cui, and Scott Niekum. Risk-aware active inverse reinforcementlearning. In
Proceedings of the 2nd Annual Conference on Robot Learning (CoRL) , 2018.[11] Daniel S Brown, Wonjoon Goo, and Scott Niekum. Better-than-demonstrator imitation learningvia automaticaly-ranked demonstrations. In
Conference on Robot Learning (CoRL) , 2019.[12] Daniel S. Brown and Scott Niekum. Efficient Probabilistic Performance Bounds for InverseReinforcement Learning. In
AAAI Conference on Artificial Intelligence , 2018.[13] Daniel S. Brown and Scott Niekum. Machine teaching for inverse reinforcement learning:Algorithms and applications. In
Proceedings of the AAAI Conference on Artificial Intelligence ,volume 33, pages 7749–7758, 2019.[14] Daniel S. Brown, Scott Niekum, Russell Coleman, and Ravi Srinivasan. Safe imitation learningvia fast bayesian reward inference from preferences. In
International Conference on MachineLearning . 2020.[15] Maya Cakmak and Manuel Lopes. Algorithmic and human teaching of sequential decisiontasks. In
AAAI , 2012.[16] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple frameworkfor contrastive learning of visual representations. 2020.[17] Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deepreinforcement learning from human preferences. In
Advances in Neural Information ProcessingSystems , pages 4299–4307, 2017.[18] Dylan Hadfield-Menell, Stuart J Russell, Pieter Abbeel, and Anca Dragan. Cooperative inversereinforcement learning. In
Advances in Neural Information Processing Systems 29 , pages3909–3917. 2016. 1019] Josiah Hanna, Scott Niekum, and Peter Stone. Importance sampling policy evaluation with anestimated behavior policy. In
Proceedings of the 36th International Conference on MachineLearning (ICML) , June 2019.[20] Sandy H Huang, Kush Bhatia, Pieter Abbeel, and Anca D Dragan. Establishing appropriatetrust via critical states. In , pages 3929–3936. IEEE, 2018.[21] Sandy H Huang, David Held, Pieter Abbeel, and Anca D Dragan. Enabling robots to communi-cate their objectives. In
Robotics: Science and Systems , 2017.[22] Jan Leike, David Krueger, Tom Everitt, Miljan Martic, Vishal Maini, and Shane Legg. Scalableagent alignment via reward modeling: a research direction. arXiv preprint arXiv:1811.07871 ,2018.[23] Jan Leike, Miljan Martic, Victoria Krakovna, Pedro A Ortega, Tom Everitt, Andrew Lefrancq,Laurent Orseau, and Shane Legg. Ai safety gridworlds. arXiv preprint arXiv:1711.09883 , 2017.[24] Manuel Lopes, Francisco Melo, and Luis Montesano. Active learning for reward estimationin inverse reinforcement learning. In
Joint European Conference on Machine Learning andKnowledge Discovery in Databases , pages 31–46. Springer, 2009.[25] Andrew Y Ng and Stuart J Russell. Algorithms for inverse reinforcement learning. In
ICML ,pages 663–670, 2000.[26] Matteo Pirotta and Marcello Restelli. Inverse reinforcement learning through policy gradientminimization. In
AAAI , 2016.[27] Doina Precup. Eligibility traces for off-policy policy evaluation.
Computer Science DepartmentFaculty Publication Series , page 80, 2000.[28] Stuart Russell, Daniel Dewey, and Max Tegmark. Research priorities for robust and beneficialartificial intelligence.
Ai Magazine , 36(4):105–114, 2015.[29] Stuart J Russell and Peter Norvig.
Artificial intelligence: a modern approach . Malaysia; PearsonEducation Limited„ 2016.[30] Dorsa Sadigh, Anca D. Dragan, S. Shankar Sastry, and Sanjit A. Seshia. Active preference-basedlearning of reward functions. In
Proceedings of Robotics: Science and Systems (RSS) , July2017.[31] Peter Stone, Gal A Kaminka, Sarit Kraus, Jeffrey S Rosenschein, et al. Ad hoc autonomousagent teams: Collaboration without pre-coordination. In
AAAI , 2010.[32] Adam Stooke, Kimin Lee, Pieter Abbeel, and Michael Laskin. Decoupling representationlearning from reinforcement learning. arXiv preprint arXiv:2009.08319 , 2020.[33] Richard S Sutton and Andrew G Barto.
Introduction to reinforcement learning , volume 135.MIT press Cambridge, 1998.[34] Philip S Thomas, Georgios Theocharous, and Mohammad Ghavamzadeh. High-confidenceoff-policy evaluation. In
AAAI , pages 3000–3006, 2015.[35] Tengyang Xie, Yifei Ma, and Yu-Xiang Wang. Towards optimal off-policy evaluation for rein-forcement learning with marginalized importance sampling. In
Advances in Neural InformationProcessing Systems , pages 9665–9675, 2019.[36] Xiaojin Zhu. Machine teaching for bayesian learners in the exponential family. In
Advances inNeural Information Processing Systems , pages 1905–1913, 2013.[37] Xiaojin Zhu, Adish Singla, Sandra Zilles, and Anna N Rafferty. An overview of machineteaching. arXiv preprint arXiv:1801.05927 , 2018.[38] Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, and Anind K Dey. Maximum entropyinverse reinforcement learning. In
AAAI , 2008.11
Theory and Proofs
A.1 Aligned Reward PolytopesTheorem 1.
Given an MDP M = ( E, R ) , if the tester’s reward function R and subject’s rewardfunction R (cid:48) can be represented as a linear combination of features φ ( s ) ∈ R k , i.e., R ( s ) = w T φ ( s ) , R (cid:48) ( s ) = w (cid:48) T φ ( s ) , then a sufficient condition for testing value alignment ( R (cid:48) ∈ ARP ( R ) ) is to testwhether w (cid:48) ∈ (cid:92) ( s,a,b ) ∈ S × A × A H Rs,a,b ⊆ ARP ( R ) (9) where H Rs,a,b = (cid:8) w | w T (Φ ( s,a ) π ∗ R − Φ ( s,b ) π ∗ R ) > (cid:9) , if a ∈ arg max a (cid:48) ∈A Q ∗ R ( s, a (cid:48) ) and b / ∈ arg max a (cid:48) ∈A Q ∗ R ( s, a (cid:48) ) , R k , i.e., non-constraining otherwise . (10) Proof.
We will prove that (cid:84) ( s,a,b ) ∈ S × A × A H Rs,a,b ⊆ ARP ( R ) . Assume that w (cid:48) ∈ (cid:84) ( s,a,b ) ∈ S × A × A H Rs,a,b . This implies that, for all states s ∈ S , optimal actions under R havehigher expected utility than suboptimal actions, when evaluated under w (cid:48) . Thus, there exists anoptimal policy, call it π ∗ R , under R that is optimal under w (cid:48) .Now consider an optimal policy under w (cid:48) , call it π ∗ R (cid:48) , we need to show that π ∗ R (cid:48) ∈ OP T ( R ) To dothis, we prove by contradiction that π ∗ R (cid:48) is optimal under R . The key idea is to compare the featurecounts of π R and π ∗ R (cid:48) after one step and notice that they must look equally appealing under w (cid:48) .Assume for contradiction that π ∗ R (cid:48) / ∈ OP T ( R ) . We know that π ∗ R ∈ OP T ( R (cid:48) ) . Thus, there mustexist a state s and actions a, b such that a ∈ π ∗ R ( s ) , a ∈ π ∗ R (cid:48) ( s ) , b ∈ π ∗ R (cid:48) ( s ) but b / ∈ π ∗ R ( s ) . Thus, w (cid:48) T (Φ s,aπ ∗ R (cid:48) − Φ s,bπ ∗ R (cid:48) ) = 0 (11) ⇒ w (cid:48) T Φ s,aπ ∗ R (cid:48) = w (cid:48) T Φ s,bπ ∗ R (cid:48) . (12)By assumption about the construction of H Rs,a,b , we also have w (cid:48) T (Φ s,aπ ∗ R − Φ s,bπ ∗ R ) > (13) ⇒ w (cid:48) T Φ s,aπ ∗ R > w (cid:48) T Φ s,bπ ∗ R (14)We have previously shown that both π ∗ R and π ∗ R (cid:48) are optimal policies under R (cid:48) . This means that, forall states j and actions k , Q π ∗ R R (cid:48) ( j, k ) = Q π ∗ R (cid:48) R (cid:48) ( j, k ) (15) ⇒ w (cid:48) T Φ j,kπ ∗ R = w (cid:48) T Φ j,kπ ∗ R (cid:48) (16)and so in particular w (cid:48) T Φ s,aπ ∗ R = w (cid:48) T Φ s,aπ ∗ R (cid:48) (17) w (cid:48) T Φ s,bπ ∗ R = w (cid:48) T Φ s,bπ ∗ R (cid:48) (18)By substituting Equations 17 and 18 into Equation 12, we arrive at w (cid:48) T Φ s,aπ R = w (cid:48) T Φ s,bπ R (19)which contradicts our assumption in Equation (14) and yields the desired contradiction. We havemade only one assumption, that there is a state where there is an action taken by π R (cid:48) but not π R , so itmust be the case that at all states, every action taken by π R (cid:48) is also taken by π R .This means that all optimal actions under π ∗ R (cid:48) are also optimal under π ∗ R . Therefore, arg max a Q ∗ R (cid:48) ( s, a ) ⊆ arg max Q ∗ R ( s, a ) . This proves that R (cid:48) ∈ ARP M ( R ) as desired. Thus, (cid:84) ( s,a,b ) ∈ S × A × A H Rs,a,b ⊆ ARP M ( R ) . 12 .2 (cid:15) -Alignment Verification via Omnipotent Testing In this section, we consider the case where the testing agent is able to construct a set of arbitrary testMDPs to verify value alignment across a family of environments that may have different transitions,actions, initial state distribution, and discount factor, but that share the same reward function overstates. Amin and Sing [4] prove that an omnipotent active learner can determine the reward functionof another agent within (cid:15) precision via O (log( |S| ) + log(1 /(cid:15) )) active policy queries. We extend thisresult to the case of value alignment testing.We first prove that if two agents’ reward functions are sufficiently similar, then we can guarantee (cid:15) -value alignment. Lemma 1. If (cid:107) R ( s ) − R (cid:48) ( s ) (cid:107) ∞ ≤ (cid:15) (1 − γ ) / , where γ is the discount factor and (cid:15) is any non-negative error term, then rational agents that have reward functions R ( s ) and R (cid:48) ( s ) are (cid:15) -ValueAligned across all MDPs that share the reward function R ( s ) .Proof. To be (cid:15) -value aligned we must have V π ∗ R R − V π (cid:48) R ≤ (cid:15) , where π (cid:48) is optimal under R (cid:48) . To provethe lemma we must show that an adversary that can change the reward function from R to R (cid:48) , withinthe constraint (cid:107) R ( s ) − R (cid:48) ( s ) (cid:107) ∞ ≤ (cid:15) (1 − γ ) / , cannot make V π ∗ R R − V ∗ R (cid:48) > (cid:15) under any MDP.To make value alignment adversarially bad, we want to maximize V π ∗ R R − V π (cid:48) R . Writing this out interms of expectations over rewards we have: V π ∗ R R − V π (cid:48) R = E [ ∞ (cid:88) t =0 γ t R ( s t ) | s t ∼ π ∗ R ] − E [ ∞ (cid:88) t =0 γ t R ( s t ) | s t ∼ π (cid:48) ] . (20)To create an adversarial MDP we wish to find a reward function R (cid:48) such that V π ∗ R R > V π (cid:48) R and V π ∗ R R (cid:48) < V π (cid:48) R (cid:48) . The intuition is that we want to adversarially construct R (cid:48) such that it makes π (cid:48) = π ∗ R (cid:48) look better than π ∗ R under R (cid:48) while forcing the true policy loss ( V π ∗ R R − V π (cid:48) R ) to be as large as possible.We now consider the maximal possible perturbation via an adversarial reward function R (cid:48) . We want V π ∗ R R > V π (cid:48) R and V π ∗ R R (cid:48) < V π (cid:48) R (cid:48) . Thus, given the constraint (cid:107) R (cid:48) ( s ) − R ( s ) (cid:107) ∞ ≤ (cid:15) (1 − γ ) / , themaximal difference at each state between R (cid:48) and R is (cid:15) (1 − γ ) / . In the worst-case, the adversarycreates R (cid:48) by subtracting (cid:15) (1 − γ ) / from the true reward ( R (cid:48) ( s ) = R ( s ) − (cid:15) (1 − γ ) / ) at statesvisited by π ∗ R to make them look as bad as possible and makes the states visited by π (cid:48) look as good aspossible by adding (cid:15) (1 − γ ) / to the true reward at those states ( R (cid:48) ( s ) = R ( s ) + (cid:15) (1 − γ ) / ). Thus,we have in the worst-case V π ∗ R R (cid:48) = E [ ∞ (cid:88) t =0 γ t R (cid:48) ( s t ) | s t ∼ π ∗ R ] (21) = E [ ∞ (cid:88) t =0 γ t ( R ( s t ) + R (cid:48) ( s t ) − R ( s t )) | s t ∼ π ∗ R ] (22) = E [ ∞ (cid:88) t =0 γ t ( R ( s t ) − (cid:15) (1 − γ ) / | s t ∼ π ∗ R ] (23) = V π ∗ R R − (cid:15) (1 − γ )2(1 − γ ) (24) = V π ∗ R R − (cid:15) (25)13imilarly, we have in the worst-case V π (cid:48) R (cid:48) = E [ ∞ (cid:88) t =0 γ t R (cid:48) ( s t ) | s t ∼ π (cid:48) ] (26) = E [ ∞ (cid:88) t =0 γ t ( R ( s t ) + R (cid:48) ( s t ) − R ( s t )) | s t ∼ π (cid:48) ] (27) = E [ ∞ (cid:88) t =0 γ t ( R ( s t ) + (cid:15) (1 − γ ) / | s t ∼ π (cid:48) ] (28) = V π (cid:48) R + (cid:15) (1 − γ )2(1 − γ ) (29) = V π (cid:48) R + (cid:15) (30)The adversarial perturbation of the reward function will only be successful if, as noted previously, wehave V π ∗ R R > V π (cid:48) R and V ∗ R (cid:48) < V π (cid:48) R (cid:48) . Substituting the values above we have in the worst-case that V π ∗ R R (cid:48) < V π (cid:48) R (cid:48) (31) ⇒ V π ∗ R R − (cid:15)/ < V π (cid:48) R + (cid:15)/ (32) ⇒ V π ∗ R R < V π (cid:48) R + (cid:15) (33) ⇒ V π ∗ R R − V π (cid:48) R < (cid:15) (34)Thus, we have shown that under the assumption that (cid:107) R ( s ) − R (cid:48) ( s ) (cid:107) ∞ ≤ (cid:15) (1 − γ ) / , then thesubject agent with reward function R (cid:48) is (cid:15) -value aligned with the tester’s reward function R under allpossible MDPs that share the reward function R .Note that if we scale the reward of an agent by a positive constant or by a constant vector, we can getthe difference to look arbitrarily large even if the two rewards lead to the same optimal policy. This isundesirable for computing value alignment in terms of reward differences. Thus, it really only makessense to compare rewards if they are similarly normalized. We utilize a canonical form for rewardfunctions defined by the transformation ( R ( s ) − max s R ( s )) / (max s R ( s ) − min s R ( s )) such thatthe values of the reward function are scaled to be between 0 and 1 [4]. Following the notation ofAmin and Singh [4] we use [ R ] to denote the canonical form for reward function R .Given the ability to construct arbitrary testing environments, we can guarantee (cid:15) -value alignment overall MDPs that share the reward function R . The following theorem is inspired by Amin and Singh [4]who prove a analogous theorem for the case of actively querying an expert to approximate the expert’sreward function. The proof of Amin and Singh [4] relies on binary search and the query algorithmthey derive results in query complexity of O (log( |S| ) + log(1 /(cid:15) )) , where each query requires theexpert to specify a complete policy for a new MDP. In contrast, our proof is based instead on machineteaching (the tester knows what it is testing for), and we prove that in the case of value alignmentverification we only require O (1) queries. In fact we only need two test MDPs where for each testMDP we query the agent whether it prefers one of two different policies in that test MDP. Theorem 2.
Given a testing reward R , there exists a two-query test (complexity O (1) ) that determines (cid:15) -value alignment of a rational agent over all MDPs that share the same state space and rewardfunction R , but may differ in actions, transitions, discount factors, and initial state distribution.Proof. By Lemma 1 we want a test that guarantees (cid:107) [ R (cid:48) ] − [ R ] (cid:107) ∞ ≤ (cid:15) (1 − γ ) / Thus we need | [ R (cid:48) ]( s ) − [ R ]( s ) | < (cid:15) (1 − γ ) / , ∀ s ∈ S (35) ⇔ [ R ]( s ) − (cid:15) (1 − γ ) / < [ R (cid:48) ]( s ) < [ R ]( s ) + (cid:15) (1 − γ ) / , ∀ s ∈ S (36)We use the notation [ R ] and [ R (cid:48) ] to represent the canonical versions of R and R (cid:48) , the tester andsubjects reward functions, respectively. If we can directly query for R (cid:48) , then we simply compute (cid:107) R − R (cid:48) (cid:107) ∞ and check if it is less than (cid:15) (1 − γ ) / . We now consider the case where we can only query14he agent about policy preferences. We define s max = arg max s R ( s ) and s min = arg min s R ( s ) and s (cid:48) max = arg max s R (cid:48) ( s ) and s (cid:48) min = arg min s R (cid:48) ( s ) and we assume that s max and s min areunique.We now create a testing environment E such that from each state there is an action a that selftransitions and an action a that goes from each state to the max reward with probability α s and tothe min reward with probability (1 − α s ) , except in states s min and s max in which all transitions via a and a are self transitions. Thus, taking action a represents a gamble between the states withminimum and maximum reward under the tester’s reward function R .For s ∈ S \ { s max , s min } , we design two different transition dynamics with the parameters α U and α L such that α Ls = max([ R ] s − (cid:15) (1 − γ )2 , and α Us = min([ R ] s + (cid:15) (1 − γ )2 , . Then we constructtwo test environments E L and E U . L has α L as the transitions and U has α U as the transitions. Wethen design two test questions:1. Is π a (cid:31) π a in MDP L ?2. Is π a (cid:31) π a in MDP U ?where π a is the policy that always takes action a .If the agent answers "YES" to the first question, then ∀ s ∈ S \ { s max , s min } we know that a ispreferred to a . Thus the agent will prefer to self transition at a state rather than take action a whichprobabilistically transitions to s max and s min . Thus, under the subject agents unknown reward R (cid:48) thefollowing inequality holds for all s ∈ S \ { s max , s min } : α Ls R (cid:48) ( s max ) + (1 − α Ls ) R (cid:48) ( s min ) < R (cid:48) ( s ) (37) ⇔ α Ls R (cid:48) ( s max ) + (1 − α Ls ) R (cid:48) ( s min ) − R (cid:48) ( s (cid:48) min ) < R (cid:48) ( s ) − R (cid:48) ( s (cid:48) min ) (38) ⇔ α Ls ( R (cid:48) ( s max ) − R (cid:48) ( s (cid:48) min )) + (1 − α Ls )( R (cid:48) ( s min ) − R (cid:48) ( s (cid:48) min )) < R (cid:48) ( s ) − R (cid:48) ( s (cid:48) min ) (39) ⇔ α Ls R (cid:48) ( s max ) − R (cid:48) ( s (cid:48) min ) R (cid:48) ( s (cid:48) max ) − R (cid:48) ( s (cid:48) min ) + (1 − α Ls ) R (cid:48) ( s min ) − R (cid:48) ( s (cid:48) min ) R (cid:48) ( s (cid:48) max ) − R (cid:48) ( s (cid:48) min ) < R (cid:48) ( s ) − R (cid:48) ( s (cid:48) min ) R (cid:48) ( s (cid:48) max ) − R (cid:48) ( s (cid:48) min ) (40) ⇔ α Ls [ R (cid:48) ]( s max ) + (1 − α Ls )[ R (cid:48) ]( s min ) < [ R (cid:48) ]( s ) . (41)and similarly, if the agent answers "YES" to question 2, we have R (cid:48) ( s ) < α Us R (cid:48) ( s max ) + (1 − α Us ) R (cid:48) ( s min ) (42) ⇔ [ R (cid:48) ]( s ) < α Us [ R (cid:48) ]( s max ) + (1 − α Us )[ R (cid:48) ]( s min ) . (43)These above inequalities hold for all s ∈ S \ { s max , s min } . We now prove that answering "YES" toboth questions 1 and 2 also means that s (cid:48) max = max s R (cid:48) ( s ) = max s R ( s ) = s max . We prove this bycontradiction. Assume that s max (cid:54) = s (cid:48) max , then s (cid:48) max is one of the states where the subject answeredquestion 2 in the affirmative. Thus, we know that [ R (cid:48) ]( s (cid:48) max ) < α Us [ R (cid:48) ]( s max ) + (1 − α Us )[ R (cid:48) ]( s min ) (44) ⇒ < α Us [ R (cid:48) ]( s max ) + (1 − α Us )[ R (cid:48) ]( s min ) (45) ⇒ < α Us + (1 − α Us ) = 1 (46)where second line uses the fact that [ R (cid:48) ]( s (cid:48) max ) = 1 , and the third uses the fact that, by definition, [ R ]( s ) < = 1 , ∀ s ∈ S . Thus < which provides the desired contradiction. Therefore, we musthave that s (cid:48) max = s max .Similarly, we prove that s (cid:48) min ≡ min s R (cid:48) ( s ) = min s R ( s ) ≡ s min by contradiction. Assumethat s min (cid:54) = s (cid:48) min , then s min is one of the states for which the subject answered question 1 in theaffirmative. Thus, we know that α Ls [ R (cid:48) ]( s max ) + (1 − α Ls )[ R (cid:48) ]( s min ) < [ R (cid:48) ]( s (cid:48) min ) (47) ⇒ α Ls [ R (cid:48) ]( s max ) + (1 − α Ls )[ R (cid:48) ]( s min ) < (48) ⇒ < (49)15he second line uses the fact that, by definition, [ R (cid:48) ]( s (cid:48) min ) = 0 . The third line uses the fact that bydefinition, [ R ]( s max ) ≥ and [ R ]( s min ) ≥ . This provides the desired contradiction so we musthave that s min = s (cid:48) min .Combining the above results we have (assuming the subject answers "YES" to questions 1 and 2)that [ R ]( s max ) = [ R (cid:48) ]( s max ) = 1 and [ R ]( s min ) = [ R (cid:48) ]( s min ) = 0 . Additionally, we know for all s ∈ S \ { s max , s min } that α Ls R (cid:48) ( s max ) + (1 − α Ls ) R (cid:48) ( s min ) < R (cid:48) ( s ) < α Us R (cid:48) ( s max ) + (1 − α Us ) R (cid:48) ( s min ) (50) ⇒ α Ls [ R (cid:48) ]( s max ) + (1 − α Ls )[ R (cid:48) ]( s min ) < [ R (cid:48) ]( s ) < α Us [ R (cid:48) ]( s max ) + (1 − α Us )[ R (cid:48) ]( s min ) ⇒ α Ls < [ R (cid:48) ]( s ) < α Us (51) ⇒ max([ R ]( s ) − (cid:15) (1 − γ ) / , < [ R (cid:48) ]( s ) < min([ R ]( s ) + (cid:15) (1 − γ ) / , (52) ⇒ | [ R (cid:48) ]( s ) − [ R ]( s ) | < (cid:15) (1 − γ ) / . (53)Thus, we have (cid:107) [ R (cid:48) ] − [ R ] (cid:107) ∞ < (cid:15) (1 − γ ) / so by Lemma 1 we have verified (cid:15) -value alignment viatwo policy preference queries as desired. B Relationship of the ARP to Ng and Russell’s Consistent Reward Sets
In this section we discuss the relationship between our approach and the foundational work on IRLby Ng and Russell [25].We define the set of rewards consistent with an optimal policy as follows:
Definition 1.
Given an environment E , The consistent reward set (CRS) of a policy π in environment E is defined as the set of reward functions under which π is optimal:CRS ( π ) = { w ∈ R k | π is optimal with respect to R ( s ) = w T φ ( s ) } . (54)The fundamental theorem of inverse reinforcement learning [25], defines the set of all consistentreward functions as a set of linear inequalities for finite MDPs. Proposition 1. [25] Given an environment E , with finite state and action spaces, R ∈ CRS ( π ) ifand only if ( P π − P a )( I − γ P π ) − R ≥ , ∀ a ∈ A (55) where P a is the transition matrix associated with always taking action a , P π is the transition matrixassociated with policy π , and R is the column vector of rewards for each state in the MDP. When the reward function is a linear combination of features, we get the following:
Corollary 1. [13, 25] Given an environment E , the CRS ( π ) is given by the following intersectionof half-spaces: { w ∈ R k | w T (Φ ( s,a ) π − Φ ( s,b ) π ) ≥ , ∀ a ∈ π ( s ) , b ∈ A , s ∈ S} . (56) Proof.
In every state s we can assume that there is one or more optimal actions a . For each optimalaction a ∈ support ( π ( s )) . We then have by definition of optimality that Q ∗ ( s, a ) ≥ Q ∗ ( s, b ) , ∀ b ∈ A (57)Rewriting this in terms of expected discounted feature counts we have w T Φ ( s,a ) π ≥ w T Φ ( s,b ) π , ∀ b ∈ A (58)Thus, the entire feasible region is the intersection of the following half-spaces w T (Φ ( s,a ) π − Φ ( s,b ) π ) ≥ , (59) ∀ a ∈ support ( π ( s )) , b ∈ A , s ∈ S (60)and thus the feasible region is convex.The consistent reward set of a demonstration from an optimal policy can be defined similarly:16 orollary 2. [13] Given a set of demonstrations D from a policy π , CRS ( D| π ) is given by thefollowing intersection of half-spaces: w T (Φ ( s,a ) π − Φ ( s,b ) π ) ≥ , ∀ ( s, a ) ∈ D , b ∈ A . (61) Proof.
The proof follows from the proof of Theorem 1 by only considering half-spaces correspondingto optimal ( s, a ) pairs in the demonstration.It is important to note that while Corollary 1 seems to solve the alignment verification problem, itonly provides a necessary, but not sufficient condition. Thus, just because a reward function is withinthe CRS of a policy, it does not mean agents are aligned. Consider the example of the all zero reward:it is always in the CRS of any policy; however, an agent optimizing the zero reward can end up withany policy. Even ignoring the all zero reward we can have rewards on the boundaries of the CRSpolytope that are consistent with a policy, but not value aligned since they lead to more than oneoptimal policy, one or more of which may not be optimal under the tester’s reward function. C Value Alignment Verification with Explicit Values
Proposition 2.
Under the assumption of a rational subject agent that shares the same linear rewardfeatures as the tester, efficient exact value alignment verification is possible in the following querysettings: (1) Query access to reward function weights w (cid:48) , (2) Query access to samples of the rewardfunction R (cid:48) ( s ) , (3) Query access to Q ∗ R (cid:48) ( s, a ) , and (4) Query access to preferences over trajectories.Proof. The proof of case (1) follows directly from Theorem 2.In case (2), the tester can query for samples of the reward function R (cid:48) ( s ) . If the tester only hasquery access to R (cid:48) ( s ) , then the same test can be used since the tester can solve a system of linearequations to recover the weight vector w after sampling a sufficient number of R ( s ) values since thetester knows the features φ ( s ) . Note that this also works for rewards that are functions of (s,a) and(s,a,s’). The number of required samples is equal to rank (Φ) where Φ is the matrix where each rowcorresponds to the features φ ( s ) of a unique state. Thus, in the worst case we only need k samplesfrom the subject’s reward function so that we have a system with k unknowns and k equalities. Ifthere is noise in the sampling proceedure, then linear regression can be used to efficiently estimatethe subject’s weight vector w (cid:48) . After recovering the weight vector, the same value alignment testused for case (1) can be used.In case (3) the tester has access to the value function of the subject. If the tester can query the subjectagent’s value function then w can be recovered by solving a linear system of equations since we havefor any agent that R ( s ) = w T φ ( s ) = Q ( s, a ) − γ E s (cid:48) | s,a (cid:104) max a (cid:48) Q ( s (cid:48) , a (cid:48) ) (cid:105) (62)and the tester knows φ ( s ) and can query for Q ( s, a ) . As in case (2) we only need rank (Φ) ≤ k queries to the subject’s value functions and linear regression can be used if there is noise in thesampling process. Thus, and the tester can verify value alignment via the reward function valuealignment test that is used in case (1).In case (4), the tester only has access to the subject’s values via preference queries over trajectories.If the subject agent being tested can answer pairwise preferences over trajectories, then a valuealignment test can also be tested via the ARP. Each preference over trajectories ξ A ≺ ξ B induces theconstraint w T ( ξ B − ξ A ) > . Thus, given a test T consisting of preferences over trajectories, wecan guarantee value alignment if { w | w T ( ξ B − ξ A ) > , ∀ ( ξ A , ξ B ) ∈ T } ⊆ ARP ( w ) . (63)Note that a single trajectory in general will not actually match the successor features of a stochasticpolicy. However, by synthesizing arbitrary trajectories we can create more halfspace constraints thanare used to define the ARP since these trajectories do not need to be the product of a rational policy.As more trajectory queries are asked the estimate of the ARP will approach the true ARP. Brown etal. [11] proved that given random halfplane constraints, the volume of the polytope will decreaseexponentially. Thus we will need a logarithmic number of queries to accurately define the ARP.17 Value Alignment Verification Heuristics
In this section we discuss the value alignment verification heuristics in more detail.
D.1 Critical State-Action Value Alignment Heuristic
Prior work by Huang et al. [20], seeks to build human-agent trust by asking an agent for critical stateswhere critical states are defined as follows: Q ∗ R ∗ ( s, π ∗ R ∗ ( s )) − |A| (cid:88) a ∈A Q ∗ R ∗ ( s, a ) > t (64)for some user-defined t . If t = 0 , then all states will be critical states. On the otherhand, for large t ,none of the states will be critical. Thus, t must be carefully tuned to the scale of the reward functionand to the particulars of the MDP. Huang et al. [20] also proposed finding critical states in terms ofstates with policy entropy below some threshold t , but found that state-action value critical statesperformed better. Futhermore, using entropy would label every state as critical for a deterministicpolicy. State-action value critical states can also be computed for both deterministic and stochasticpolicies, thus we only compare against state-action value critical states.One possible way to use critical states for a value alignment heuristic would be to ask an agent forits critical states and then see if those match the tester’s critical states However, this is problematicsince reward scale isn’t fixed and there are an infinite number of reward functions that lead to thesame policy [25], so the gap in Q-values can be arbitrarily large. Thus t would have to be carefullyconstructed and tuned for both the tester and the agent, making this impractical. Instead, we simplycalculate the critical states for the tester under a tester-defined t and then test whether the optimalaction that the agent being tested would take in the tester’s critical state is also optimal under thetester’s value function.This results in the following value alignment heuristic:(1) Find critical states in true MDP for t ≥ and save ( s, a ) pairs.(2) For each critical state-action ( s, a ) , query the subject for their action in state s and check if this isan optimal action under the tester’s reward function. D.2 Aligned Reward Polytope Black-Box Heuristic
For this heuristic we have the tester compute
ARP ( R ) for the tester’s reward function R , and thenfind the state-actions successor features such that the constraints defined by these successor featuresare the minimal set of constraints that define the ARP. In other words, we find the minimal set ofconstraints using linear programming as discussed in Section H.2. To run a verification test we simplytake the set of states corresponding to the minimal set of constraints. For each of these constraints wehave w T (Φ ( s,a ) π ∗ − Φ ( s,b ) π ∗ ) > (65)for all a ∈ arg max a (cid:48) Q ∗ ( s, a (cid:48) ) . The test then consists of asking the agent being tested for the actionthe testee would take in state s and see if it is optimal under the tester’s reward function. D.3 SCOT Trajectory-Based Heuristic
We also adapt the set cover optimal teaching (SCOT) algorithm for value alignment verification[13]. As done in the original paper [13], we first compute feature expectations, then we calculate theminimal set of constraints that define the consistent reward set (CRS) using Corollary 1. We thenrollout m trajectories using the teacher’s policy from each initial state and calculate the CRS of therollouts using Corollary 2. We then run set cover and find the minimum set of rollouts of length H that implicitly covers the CRS.Given the machine teaching demos from SCOT we mask the actions and ask the agent being testedwhat action it would take in each state. We then compare this action with the machine teaching action.In particular, we implement this querying the subject agent for an action at each state s and thenchecking if this action is optimal under the tester’s reward function.18 a) Optimal policy (b) Preference query 1 (c) Preference query 2 (d) ARP black-box queries(e) SCOT queries (f) Critical state queries Figure 4: Example value alignment verification tests for the lava world domain.Note that all of the methods above are not guaranteed to verify value alignment and may give falsepositives. However, all are designed to never give a false negative.
E Case Study Continued
To illustrate the types of test queries found via value alignment verification, we consider twodomains inspired by the AI safety grid worlds [23]. The first domain, island navigation is shown inSection 5.1.1. We now discuss another domain inspired by the AI safety gridworlds: lava world. Thisdomain is shown in Figure 4. Figure 4a shows the optimal policy under the tester’s reward function R ( s ) = 50 · green ( s ) − · white ( s ) − · red ( s ) , (66)where color ( s ) is an indicator feature for the color of the grid cell. Shown in figures 4b and 4care the two preference queries generated by ARP-pref. In both cases the query consists of twotrajectories (shown in black and orange for visualization), and the agent taking the test must decidewhich trajectory is preferable (we chose the colors such that the black trajectory is preferable toorange). We see that preference query 1 verifies that the agent would rather move the to terminalstate (green) rather than visit white cells. The second preference verifies that the agent would rathervisit white cells than red cells, and would rather take an indirect path to the goal state (green) ratherthan a more direct path that visits a blue cell. Note that the black trajectory in preference query 2 firstgoes up, which resuls in a self transition, then goes left to get out of the lava. Shown in figures 5d, 4e,and 4f are the query states for ARP-bb, SCOT, and CS heuristics, respectively. In each of these teststhe agent being tested is asked what action its policy would take in each of the states marked with aquestion mark. To pass the test, the agent must respond with an action that is optimal action underthe tester’s policy in each of these states. ARP-bb chooses two states where the halfspaces defined bythe expected feature counts of following the optimal policy versus taking a suboptimal action andfollowing the optimal policy fully define the ARP. F Value Alignment Verification with Idealized Human Tester
In Appendix F, we compare these heuristics with the exact alignment tests described previously thatquery for the robot’s reward function (ARP- w ) and query for preferences over trajectories (ARP-pref).Since the tests are designed such that they accurately verify aligned agents, we constructed a suiteof grid navigation domains with varying numbers of states and reward features. We generated 50different misaligned agents by sampling random reward functions and comparing the resulting optimalpolicies to the optimal policy under a randomly-chosen ground-truth reward function. Figure 5 (a)and (b) show that for a fixed number of features, the size of the test generated via the critical stateheuristic with threshold t = 0 . (CS-0.2) scales poorly with the size of the grid world, even thoughthe complexity of the reward function stays constant. The threshold t has a large impact on the19 a) ARP black-box queries (b) ARP black-box queries(c) ARP black-box queries (d) ARP black-box queries Figure 5: Queries vs. accuracy (1 - false positive rate) for value alignment testing of misalignedagents. Exact alignment tests (ARP-w and ARP-pref) achieve good efficiency and perfect accuracy.performance: small t results in better accuracy at the cost of significantly more queries and larger t results in significantly more false positives. We chose t = 0 . to minimize false positives while alsoattempting to keep the test size small. In Figure 5 (c) and (d) we plot how the number of constraintsgrows as the reward function dimension increases and the MDP size is fixed. The plot for ARP-bbshows that the number of constraints grows with the size of the reward weight vector as expected.Conversely, the number of critical states has the undesirable effect of growing with the size of theMDP, regardless of the complexity of the underlying reward function.By construction, ARP-w requires only one query (querying for w (cid:48) ) to achieve perfect accuracy.Using trajectory preferences to define the ARP (ARP-pref) also has perfect accuracy, but requiresmore queries to the robot. SCOT has sample complexity that is lower than the critical state methods,but much higher than querying directly reward function weights since it queries at for actions asstates along each machine teaching trajectory. We found empirically that SCOT has nearly perfectaccuracy, but occasionally has false positives. Using the ARP inspired heuristic (ARP-bb) has lowsample complexity and high accuracy, but sometimes has false positives.These results give evidencethat the testing method of choice depends on the capability of the robot and the complexity of theenvironment relative to the robot’s reward function. If the robot can report a ground truth rewardweight then ARP-w has the best performance. If the robot can only answer trajectory preferencequeries, then ARP-pref should be used. The heuristics (ARP-bb, SCOT, and CS) have higher querycosts and lower accuracy, but are applicable when only given query access to the robot’s policy andwhen the robot may not be perfectly rational. G Details on Value Alignment Verification with Human Tester
This method can be extended to test for (cid:15) -value alignment. In continuous or complex environmentssome trajectories may be too close in value for the subject to correctly tell the difference. This maybe because the subject being tested has a reward function not exactly within the ARP of the tester orbecause the agent being tested has not perfectly rational. To test the alignment of agents like these,we compute a (1 − δ ) -confidence (cid:15) -ARP. 20s each w has a probability mass associated with we can compute a (1 − δ ) -confidence bound bytaking any ( ξ i , ξ j ) ∈ P and checking whether P r ( w T (Φ( ξ i ) − Φ( ξ j )) < (cid:15) ) < δ (67)We then throw away all constraints for which fewer than − δ of the weights imply a difference inreturn at least as big as (cid:15) , taking into account the human preference for that halfplane. Duplicate,noise, and redundancy filter are then applied to obtain a minimal high-confidence (cid:15) -ARP that is robustto noise in the human preferences. The trajectory pairs that make up this set of minimal constraintsnow form the test T . If the subject agent is a robot with an explicit reward function, then we canuse the (cid:15) -ARP in the same way we used the ARP in the main test, and simply check if w (cid:48) is in theintersection defined by Theorem 2. If the agent does not have explicit access to its reward function(e.g. if the subject agent is human), then we can test for alignment verification by asking the subjectagent for preferences over trajectories and checking if they match the preferences given by the humantester.We then perform a series of post-processing steps to these preferences to create an efficient, robusttest. Trajectory pairs with near-identical feature differences are removed. Humans often makemistakes when giving preferences, so preferences whose halfspace constraints containing less than p thresh = 70% of the mass of the preference elicitation algorithm’s reward posterior are filtered out.Additionally, many halfspace constraints are implied by a more restrictive constraint, so to reduce thenumber of questions, linear programming is used to find a minimal set of constraints. H Experiment Details
H.1 Exact vs Heuristics Grid Domains
In all grid domains the transition dynamics are deterministic and actions corresponding to movementup, down, left, and right are available at every state. Actions that would lead the agent off of thegrid result result in the agent staying in the same state. We ran experiments over different sizedgrid worlds with different numbers of features. For each grid world size and number of features wegenerated 50 random MDPs with features placed randomly and with a random ground-truth rewardfunction. We then sampled 50 different reward function weights w from the unit hypersphere. Thisbounds the Q-values of states, and so allowed us to tune over a bounded interval of t hyperparametersfor the critical-action state value alignment heuristic. For each reward we function we computed anoptimal policy to create different agents for verification. Duplicate policies were removed. H.2 Filtering
All experiments (gridworlds and Driver) do duplication and redundancy filtering. Duplicate con-straints are detected by computing cosine distance between the halfplane normal vectors. Any normalvectors that are within a small threshold (0.0001) of other normal vectors are deduplicated arbitrarily.Trivial (all-zero) constraints are also removed. Redundant constraints are then removed using theprocedure from Brown et al. [13] which we will briefly summarize.A redundant constraint is one that can be removed without changing the interior of the intersectionof halfspaces. We can find redundant constraints efficiently using linear programming. To check ifa constraint a T x ≤ b is binding we can remove that constraint and solve the linear program with max x a T x as the objective. If the optimal solution is still constrained to be less than or equal to b even when the constraint is removed, then the constraint can be removed. However, if the optimalvalue is greater than b then the constraint is non-redundant. Thus, all redundant constraints can beremoved by making one pass through the constraints, where each constraint is immediately removedif redundant.There are two optional filtering steps that are applied when eliciting preferences from a human or whenoperating in a continuous environment. Preferences can be filtered for noise in the human’s elicitedpreference, and preferences can filtered to allow for (cid:15) -Value alignment. In the main experiments,only (cid:15) -alignment filtering was performed, as the elicited preferences from the human are exact andconsistent. During noise filtering, 1000 rewards are sampled from the posterior distribution impliedby the full set of constraints. A preference is considered to be noisy if the fraction of the rewardsoutside the constraint implied by that preference is greater than some threshold (0.7). This removes21igure 6: False positive and negative rates on noisy data with noise filtering (left) and without noisefiltering (right). All false negative rates with noise filtering are exactly 0.Figure 7: Detailed breakdown of mistakes from the human pilot study.preferences that are likely to be violated under the posterior. (cid:15) -alignment filtering starts by samplinga new 1000 rewards from the posterior implied by the (possibly noise filtered) remaining set ofconstraints. We aim to ensure that P ( w T ( ξ A − ξ B ) > (cid:15) ) > − δ and we estimate this probabilityfor each constraint using the samples from the reward posterior. If this condition does not hold forthe sampled reward posterior, the constraint is removed. H.3 Noise Ablation Experiments for Driving Domain
When eliciting preferences between trajectories from humans, humans sometimes report their prefer-ences incorrectly. Biyik and Sadigh [9] assume that a human with true reward w reports a preference ξ A ≺ ξ B for one trajectory is equal to min(1 , exp( w T ( ξ B − ξ A ))) . We run ten simulations of 1000noisily reported human preferences using this model. We then evaluate our active test generationpipeline with and without the noise filtering specified in section H.2. Roughly speaking, this filteringremoves constraints that exclude too much of the posterior reward distribution.These experiments find that noise filtering does reduce the false-negative rate to 0, but at the costof sometimes large increases in the false positive rate. The relative cost of false-positives and false-negatives is dependent on the specific application, but even without noise filtering, false negatives arerare enough that most will not want to do noise filtering on elicited preferences. These experimentsare highly idealized based on a specific model of noise in elicited human preferences that may nothold of actual humans, and further study is needed to determine if the false positive rate of alignmenttests generated from elicited human preferences is high enough to justify noise filtering of some kind. H.4 Human Pilot Study
As epsilon increases, more of the questions are removed from the test. This necessarily increases thenumber of positive judgements the test provides, all else being equal. The accuracy initially increaseswith (cid:15) because the test has fewer false negatives as more noise questions are removed. At around (cid:15) = 1 .0