Mitigating Negative Side Effects via Environment Shaping
MMitigating Negative Side Effects via Environment Shaping
Sandhya Saisubramanian and Shlomo Zilberstein
College of Information and Computer Sciences,University of Massachusetts Amherst
Abstract
Agents operating in unstructured environments often produce negative side effects (NSE), which are difficult to identify atdesign time. While the agent can learn to mitigate the sideeffects from human feedback, such feedback is often expen-sive and the rate of learning is sensitive to the agent’s staterepresentation. We examine how humans can assist an agent,beyond providing feedback, and exploit their broader scopeof knowledge to mitigate the impacts of NSE. We formulatethis problem as a human-agent team with decoupled objec-tives. The agent optimizes its assigned task, during which itsactions may produce NSE. The human shapes the environ-ment through minor reconfiguration actions so as to mitigatethe impacts of the agent’s side effects, without affecting theagent’s ability to complete its assigned task. We present an al-gorithm to solve this problem and analyze its theoretical prop-erties. Through experiments with human subjects, we assessthe willingness of users to perform minor environment modi-fications to mitigate the impacts of NSE. Empirical evaluationof our approach shows that the proposed framework can suc-cessfully mitigate NSE, without affecting the agent’s abilityto complete its assigned task.
Introduction
Deployed AI agents often require complex design choicesto support safe operation in the open world. During designand initial testing, the system designer typically ensures thatthe agent’s model includes all the necessary details relevantto its assigned task. Inherently, many other details of the en-vironment that are unrelated to this task may be ignored.Due to this model incompleteness, the deployed agent’s ac-tions may create negative side effects (NSE) (Amodei et al.2016; Saisubramanian, Zilberstein, and Kamar 2020). Forexample, consider an agent that needs to push a box to agoal location as quickly as possible (Figure 1(a)). Its modelincludes accurate information essential to optimizing its as-signed task, such as reward for pushing the box. But detailssuch as the impact of pushing the box over a rug may not beincluded in the model if the issue is overlooked at system de-sign. Consequently, the agent may push a box over the rug,dirtying it as a side effect. Mitigating such NSE is critical toimprove trust in deployed AI systems.It is practically impossible to identify all such NSE dur-ing system design since agents are deployed in varied set-tings. Deployed agents often do not have any prior knowl- (a) Initial Environment (b) Protective sheet over rug
Figure 1: Example configurations for boxpushing domain:(a) denotes the initial setting in which the actor dirties therug when pushing the box over it; (b) denotes a modificationthat avoids the NSE, without affecting the actor’s policy.edge about NSE, and therefore they do not have the abilityto minimize NSE.
How can we leverage human assistanceand the broader scope of human knowledge to mitigate neg-ative side effects, when agents are unaware of the side effectsand the associated penalties?
A common solution approach in the existing literatureis to update the agent’s model and policy by learningabout NSE through feedbacks (Hadfield-Menell et al. 2017;Zhang, Durfee, and Singh 2018; Saisubramanian, Kamar,and Zilberstein 2020). The knowledge about the NSE canbe encoded by constraints (Zhang, Durfee, and Singh 2018)or by updating the reward function (Hadfield-Menell et al.2017; Saisubramanian, Kamar, and Zilberstein 2020). Theseapproaches have three main drawbacks. First, the agent’sstate representation is assumed to have all the necessary fea-tures to learn to avoid NSE (Saisubramanian, Kamar, andZilberstein 2020; Zhang, Durfee, and Singh 2018). In prac-tice, the model may include only the features related to theagent’s task. This may affect the agent’s learning and adapta-tion since the NSE could potentially be non-Markovian withrespect to this limited state representation. Second, exten-sive model revisions will likely require suspension of op-eration and exhaustive evaluation before the system can beredeployed. This is inefficient when the NSE is not safety-critical, especially for complex systems that have been care-fully designed and tested for critical safety aspects. Third, a r X i v : . [ c s . A I] F e b pdating the agent’s policy may not mitigate NSE that areunavoidable. For example, in the boxpushing domain, NSEare unavoidable if the entire floor is covered with a rug.The key insight of this paper is that agents often operate inenvironments that are configurable, which can be leveragedto mitigate NSE. We propose environment shaping for de-ployed agents: a process of applying modest modificationsto the current environment to make it more agent-friendlyand minimize the occurrence of NSE. Real world examplesshow that modest modifications can substantially improveagents’ performance. For example, Amazon improved theperformance of its warehouse robots by optimizing the ware-house layouts, covering skylights to reduce the glare, andrepositioning the air conditioning units to blow out to theside (Simon 2019). Recent studies show the need for infras-tructure modifications such as traffic signs with barcodes andredesigned roads to improve the reliability of autonomousvehicles (Agashe and Chapman 2019; Hendrickson, Biehler,and Mashayekh 2014; McFarland 2020). Simple modifica-tions to the environment have also accelerated agent learn-ing (Randløv 2000) and goal recognition (Keren, Gal, andKarpas 2014). These examples show that environment shap-ing can improve an agent’s performance with respect to its assigned task . We target settings in which environment shap-ing can reduce NSE , without significantly degrading the per-formance of the agent in completing its assigned task.Our formulation consists of an actor and a designer . Theactor agent computes a policy that optimizes the completionof its assigned task and has no knowledge about NSE of itsactions. The designer agent, typically a human, shapes theenvironment through minor modifications so as to mitigatethe NSE of the actor’s actions, without affecting the actor’sability to complete its task. The environment is described us-ing a configuration file, such as a map, which is shared be-tween the actor and the designer. Given an environment con-figuration, the actor computes and shares sample trajectoriesof its optimal policy with the designer. The designer mea-sures the NSE associated with the actor’s policy and mod-ifies the environment by updating the configuration file, ifnecessary. We target settings in which the completion of theactor’s task is prioritized over minimizing NSE. The sampletrajectories and the environment configuration are the onlyknowledge shared between the actor and the designer.The advantages of this decoupled approach to minimizeNSE are: (1) robustness—it can handle settings in whichthe actor is unaware of the NSE and does not have the nec-essary state representation to effectively learn about NSE;and (2) bounded-performance—by controlling the space ofvalid modifications, bounded-performance of the actor withrespect to its assigned task can be guaranteed.Our primary contributions are: (1) introducing a novelframework for environment shaping to mitigate the unde-sirable side effects of agent actions; (2) presenting an algo-rithm for shaping and analyzing its properties; (3) providingthe results of a human subjects study to assess the attitudesof users towards negative side effects and their willingnessto shape the environment; and (4) empirical evaluation of theapproach on two domains. Actor-Designer Framework
The problem of mitigating NSE is formulated as a collabo-rative actor-designer framework with decoupled objectives.The problem setting is as follows. The actor agent operatesbased on a Markov decision process (MDP) M a in an envi-ronment that is configurable and described by a configura-tion file E , such as a map. The agent’s model M a includesthe necessary details relevant to its assigned task, which isits primary objective o P . A factored state representation isassumed. The actor computes a policy π that optimizes o P .Executing π may lead to NSE, unknown to the actor. The environment designer , typically the user, measures the im-pact of NSE associated with the actor’s π and shapes the en-vironment, if necessary. The actor and the environment de-signer share the configuration file of the environment, whichis updated by the environment designer to reflect the modi-fications. Optimizing o P is prioritized over minimizing theNSE. Hence, shaping is performed in response to the actor’spolicy. In the rest of the paper, we refer to the environmentdesigner simply as designer—not to be confused with thedesigner of the agent itself.Each modification is a sequence of design actions. An ex-ample is { move ( table, l , l ) , remove ( rug ) } , which movesthe table from location l to l and removes the rug in thecurrent setting. We consider the set of modifications to befinite since the environment is generally optimized for theuser and the agent’s primary task (Shah et al. 2019) and theuser may not be willing to drastically modify it. Addition-ally, the set of modifications for an environment is includedin the problem specification since it is typically controlledby the user and rooted in the NSE they want to mitigate.We make the following assumptions about the nature ofNSE and the modifications: (1) NSE are undesirable but notsafety-critical, and its occurrence does not affect the actor’sability to complete its task; (2) the start and goal conditionsof the actor are fixed and cannot be altered, so that the mod-ifications do not alter the agent’s task; and (3) modificationsare applied tentatively for evaluation purposes and the en-vironment is reset if the reconfiguration affects the actor’sability to complete its task or the actor’s policy in the modi-fied setting does not minimize the NSE. Definition 1.
An actor-designer framework to miti-gate negative side effects (
AD-NSE ) is defined by (cid:104) E , E , M a , M d , δ A , δ D (cid:105) with: • E denoting the initial environment configuration; • E denoting a finite set of possible reconfigurations of E ; • M a = (cid:104) S, A, T, R, s , s G (cid:105) is the actor’s MDP with a dis-crete and finite state space S , discrete actions A , transi-tion function T , reward function R , start state s ∈ S ,and a goal state s G ∈ S ; • M d = (cid:104) Ω , Ψ , C, N (cid:105) is the model of the designer with – Ω denoting a finite set of valid modifications thatare available for E , including ∅ to indicate that nochanges are made; – Ψ :
E × Ω → E determines the resulting environmentconfiguration after applying a modification ω ∈ Ω tothe current configuration, and is denoted by Ψ( E, ω ) ; C : E × Ω → R is a cost function that specifies the costof applying a modification to an environment, denotedby C ( E, ω ) , with C ( E, ∅ ) = 0 , ∀ E ∈ E ; – N = (cid:104) π, E, ζ (cid:105) is a model specifying the penalty fornegative side effects in environment E ∈ E for theactor’s policy π , with ζ mapping states in π to E forseverity estimation. • δ A ≥ is the actor’s slack, denoting the maximum al-lowed deviation from the optimal value of o P in E , whenrecomputing its policy in a modified environment; and • δ D ≥ indicates the designer’s NSE tolerance threshold. Decoupled objectives
The actor’s objective is to computea policy π that maximizes its expected reward for o P , fromthe start state s in the current environment configuration E : max π ∈ Π V πP ( s | E ) ,V πP ( s ) = R ( s, π ( s )) + (cid:88) s (cid:48) ∈ S T ( s, π ( s ) , s (cid:48) ) V πP ( s (cid:48) ) , ∀ s ∈ S. When the environment is modified, the actor recomputes itspolicy and may end up with a longer path to its goal. Theslack δ A denotes the maximum allowed deviation from theoptimal expected reward in E , V ∗ P ( s | E ) , to facilitate min-imizing the NSE via shaping. A policy π (cid:48) in a modified en-vironment E (cid:48) satisfies the slack δ A if V ∗ P ( s | E ) − V π (cid:48) P ( s | E (cid:48) ) ≤ δ A . Given the actor’s π and environment configuration E , thedesigner first estimates its corresponding NSE and the asso-ciated penalty, denoted by N Eπ . The environment is modifiedif N Eπ > δ D , where δ D is the designer’s tolerance thresholdfor NSE, assuming π is fixed . Given π and a set of valid mod-ifications Ω , the designer selects a modification that maxi-mizes its utility: max ω ∈ Ω U π ( ω ) U π ( ω ) = (cid:0) N Eπ − N Ψ( E,ω ) π (cid:1)(cid:124) (cid:123)(cid:122) (cid:125) reduction in NSE − C ( E, ω ) . (1)Typically the designer is the user of the system and theirpreferences and tolerance for NSE is captured by the util-ity of a modification. The designer’s objective is to balancethe tradeoff between minimizing the NSE and the cost ofapplying the modification. It is assumed that the cost ofthe modification is measured in the same units as the NSEpenalty. The cost of a modification may be amortized overepisodes of the actor performing the task in the environment.The following properties are used to guarantee bounded-performance of the actor when shaping the environment tominimize the impacts of NSE. Definition 2.
Shaping is admissible if it results in an envi-ronment configuration in which (1) the actor can completeits assigned task, given δ A and (2) the NSE does not in-crease, relative to E . Definition 3.
Shaping is proper if (1) it is admissible and(2) reduces the actor’s NSE to be within δ D . Figure 2: Actor-designer approach for mitigating NSE.
Definition 4. A robust shaping results in an environmentconfiguration E where all valid policies of the actor, given δ A , produce NSE within δ D . That is, N Eπ ≤ δ D , ∀ π : V ∗ P ( s | E ) − V πP ( s | E ) ≤ δ A . Shared knowledge
The actor and the designer do not havedetails about each other’s model. Since the objectives are de-coupled, knowledge about the exact model parameters of theother agent is not required. Instead, the two key details thatare necessary to achieve collaboration are shared: configu-ration file describing the environment and the actor’s policy.The shared configuration file, such as the map of the envi-ronment, allows the designer to effectively communicate themodifications to the actor. The actor’s policy is required forthe designer to shape the environment. Compact represen-tation of the actor’s policy is a practical challenge in largeproblems (Guestrin, Koller, and Parr 2001). Therefore, in-stead of sharing the complete policy, the actor provides afinite number of demonstrations D = { τ , τ , . . . , τ n } ofits optimal policy for o P in the current environment config-uration. Each demonstration τ is a trajectory from start togoal, following π . Using D , the designer can extract the ac-tor’s policy by associating states with actions observed in D ,and measure its NSE. The designer does not have knowledgeabout the details of the agent’s model but is assumed to beaware of the agent’s objectives and its general capabilitiesof the agent. This makes it straightforward to construe theagent’s trajectories. Naturally, increasing the number and di-versity of sample trajectories that cover the actor’s reachablestates helps the designer improve the accuracy of estimatingthe actor’s NSE and select an appropriate modification. If D does not starve any reachable state, following π , then thedesigner can extract the actor’s complete policy. Solution Approach
Our solution approach for solving AD-NSE, described in Al-gorithm 1, proceeds in two phases: a planning phase and ashaping phase. In the planning phase (Line 4), the actor com-putes its policy π for o P in the current environment config-uration and generates a finite number of sample trajectories D , following π . The planning phase ends with disclosing D to the designer. The shaping phase (Lines 7-15) begins withthe designer associating states with actions observed in D to extract a policy ˆ π , and estimating the corresponding NSEpenalty, denoted by N E ˆ π . If N E ˆ π > δ D , the designer appliesa utility-maximizing modification and updates the configu-ration file. The planning and shaping phases alternate untilthe NSE is within δ D or until all possible modifications havebeen tested. Figure 2 illustrates this approach.The actor returns D = {} when the modification affectsits ability to reach the goal, given δ A . Modifications are ap- lgorithm 1 Environment shaping to mitigate NSERequire: (cid:104) E , E , M a , M d , δ A , δ D (cid:105) : AD-NSE Require: d : Number of sample trajectories Require: b : Budget for evaluating modifications E ∗ ← E (cid:46) Initialize best configuration E ← E n ← ∞ D ←
Solve M a for E and sample d trajectories ¯Ω ← Diverse modifications( b, Ω , M d , E ) while | ¯Ω | > do ˆ π ← Extract policy from D if N E ˆ π < n then n ← N E ˆ π E ∗ ← E end if if N E ˆ π ≤ δ D then break ω ∗ ← arg max ω ∈ ¯Ω U ˆ π ( ω ) if ω ∗ = ∅ then break ¯Ω ← ¯Ω \ ω ∗ D (cid:48) ← Solve M a for Ψ( E , ω ∗ ) and sample d trajec-tories if D (cid:48) (cid:54) = {} then D ← D (cid:48) E ← Ψ( E , ω ∗ ) end if end while return E ∗ Shaping phaseplied tentatively for evaluation and the environment is resetif the actor returns D = {} or if the reconfiguration does notminimize the NSE. Therefore, all modifications are appliedto E and it suffices to test each ω without replacement asthe actor always calculates the corresponding optimal pol-icy. When M a is solved approximately, without bounded-guarantees, it is non-trivial to verify if δ A is violated but thedesigner selects a utility-maximizing ω corresponding to thispolicy. Algorithm 1 terminates when at least one of the fol-lowing conditions is satisfied: (1) NSE impact is within δ D ;(2) every ω has been tested; or (3) the utility-maximizing op-tion is no modification, ω ∗ = ∅ , indicating that cost exceedsthe reduction in NSE or no modification can reduce the NSEfurther. When no modification reduces the NSE to be within δ D , configuration with the least NSE is returned.Though the designer calculates the utility for all modifi-cations, using N , there is an implicit pruning in the shap-ing phase since only utility-maximizing modifications areevaluated (Line 16). However, when multiple modificationshave the same cost and produce similar environment con-figurations, the algorithm will alternate multiple times be-tween planning and shaping to evaluate all these modifica-tions, which is | Ω | in the worst case. To minimize the num-ber of evaluations in settings with large Ω , consisting of mul-tiple similar modifications, we present a greedy approach toidentify and evaluate diverse modifications. Selecting diverse modifications
Let < b ≤ | Ω | denotethe maximum number of modifications the designer is will- Algorithm 2 Diverse modifications ( b, Ω , M d , E ) ¯Ω ← Ω if b ≥ | Ω | then return Ω for each ω , ω ∈ ¯Ω do if similarity( ω , ω ) ≤ (cid:15) then ¯Ω = ¯Ω \ arg max ω ( C ( E , ω ) , C ( E , ω )) if | ¯Ω | = b then return ¯Ω end if end for ing to evaluate. When b < | Ω | , it is beneficial to evaluate b di-verse modifications. Algorithm 2 presents a greedy approachto select b diverse modifications. If two modifications resultin similar environment configurations, the algorithm prunesthe modification with a higher cost. The similarity thresholdis controlled by (cid:15) . This process is repeated until b modifica-tions are identified. Measures such as the Jaccard distance orembeddings may be used to estimate the similarity betweentwo environment configurations. Theoretical Properties
For the sake of clarity of the theoretical analysis, we assumethat the designer shapes the environment based on the actor’sexact π , without having to extract it from D , and with b = | Ω | . These assumptions allow us to examine the propertieswithout approximation errors and sampling biases. Proposition 1.
Algorithm 1 produces admissible shaping.Proof.
Algorithm 1 ensures that a modification does notnegatively impact the actor’s ability to complete its task,given δ A (Lines 16-19), and stores the configuration withleast NSE (Line 10). Therefore, Algorithm 1 is guaranteedto return an E ∗ with NSE equal to or lower than that of theinitial configuration E . Hence, shaping using Algorithm 1is admissible.Shaping using Algorithm 1 is not guaranteed to be properbecause (1) no ω ∈ Ω may be able to reduce the NSE below δ D ; or (2) the cost of such a modification may exceed thecorresponding reduction in NSE. With each reconfiguration,the actor is required to recompute its policy. We now showthat there exists a class of problems for which the actor’spolicy is unaffected by shaping. Definition 5.
A modification is policy-preserving if the ac-tor’s policy is unaltered by environment shaping.
This property induces policy invariance before and af-ter shaping and is therefore considered to be backward-compatible. Additionally, the actor is guaranteed to com-plete its task with δ A = 0 when a modification is policy-preserving with respect to the actor’s optimal policy in E .Figure 3 illustrates the dynamic Bayesian network for thisclass of problems, where r P denotes the reward associatedwith o P and r N denotes the penalty for NSE. Let F denotethe set of features in the environment, with the actor’s policydepending on F P ⊂ F and F D ⊂ F altered by the designer’smodifications. Let (cid:126)f and (cid:126)f (cid:48) be the feature values before andigure 3: A dynamic Bayesian network description of apolicy-preserving modification.after environment shaping. Given the actor’s π , a policy-preserving ω follows F D = F \ F P , ensuring (cid:126)f P = (cid:126)f (cid:48) P . Proposition 2.
Given an environment configuration E andactor’s policy π , a modification ω ∈ Ω is guaranteed to be policy-preserving if F D = F \ F P .Proof. We prove by contradiction. Let ω ∈ Ω be a modifi-cation consistent with F D ∩ F P = ∅ and F = F P ∪ F D . Let π and π (cid:48) denote the actor’s policy before and after shapingwith ω such that π (cid:54) = π (cid:48) . Since the actor always computesan optimal policy for o P (assuming fixed strategy for tie-breaking), a difference in the policy indicates that at leastone feature in F P is affected by ω , (cid:126)f P (cid:54) = (cid:126)f (cid:48) P . This is a con-tradiction since F D ∩ F P = ∅ . Therefore (cid:126)f P = (cid:126)f (cid:48) P and π = π (cid:48) .Thus, ω ∈ Ω is policy-preserving when F D = F \ F P . Corollary 1.
A policy-preserving modification identified byAlgorithm 1 guarantees admissible shaping.
Furthermore, a policy-preserving modification results inrobust shaping when
P r ( F D = (cid:126)f D ) = 0 , ∀ (cid:126)f = (cid:126)f P ∪ (cid:126)f D . Thepolicy-preserving property is sensitive to the actor’s staterepresentation. However, many real-world problems exhibitthis feature independence. In the boxpushing example, theactor’s state representation may not include details about therug since it is not relevant to the task. Moving the rug toa different part of the room that is not on the actor’s path tothe goal is an example of a policy-preserving and admissibleshaping. Modifications such as removing the rug or coveringit with a protective sheet are policy-preserving and robustshaping since there is no direct contact between the box andthe rug, for all policies of the actor (Figure 1(b)). Relation to Game Theory
Our solution approach with de-coupled objectives can be viewed as a game between the ac-tor and the designer. The action profile or the set of strategiesfor the actor is the policy space Π , with respect to o P and E ,with payoffs defined by its reward function R . The actionprofile for the designer is the set of all modifications Ω withpayoffs defined by its utility function U . In each round ofthe game, the designer selects a modification that is a bestresponse to the actor’s policy π : b D ( π ) = { ω ∈ Ω |∀ ω (cid:48) ∈ Ω , U π ( ω ) ≥ U π ( ω (cid:48) ) } . (2) The actor’s best response is its optimal policy for o P in thecurrent environment configuration, given δ A : b A ( E ) = { π ∈ Π |∀ π (cid:48) ∈ Π , V πP ( s ) ≥ V π (cid:48) P ( s ) ∧ V π P ( s ) − V πP ( s ) ≤ δ A } , (3)where π denotes the optimal policy in E before initiatingenvironment shaping. These best responses are pure strate-gies since there exists a deterministic optimal policy for adiscrete MDP and a modification is selected deterministi-cally. Proposition 3 shows that AD-NSE is an extensive-form game, which is useful to understand the link betweentwo often distinct fields of decision theory and game the-ory. This also opens the possibility for future work on game-theoretic approaches for mitigating NSE. Proposition 3.
An AD-NSE with policy constraints inducedby Equations 2 and 3 induces an equivalent extensive formgame with incomplete information, denoted by N .Proof Sketch. Our solution approach alternates between theplanning phase and the shaping phase. This induces an ex-tensive form game N with strategy profiles Ω for the de-signer and Π for the actor, given start state s . However,each agent selects a best response based on the informationavailable to it and its payoff matrix, and is unaware of thestrategies and payoffs of the other agent. The designer is un-aware of how its modifications may impact the actor’s policyuntil the actor recomputes its policy and the actor is unawareof the NSE of its actions. Hence this is an extensive formgame with incomplete information.Using Harsanyi transformation, N can be converted toa Bayesian extensive-form game, assuming the players arerational (Harsanyi 1967). This converts N to a game withcomplete but imperfect information as follows. Nature se-lects a type τ i , actor or designer, for each player, which de-termines its available actions and its payoffs parameterizedby θ A and θ D for the actor and designer, respectively. Eachplayer knows its type but does not know τ and θ of the otherplayer. If the players maintain a probability distribution overthe possible types of the other player and update it as thegame evolves, then the players select a best response, giventheir strategies and beliefs. This satisfies consistency and se-quential rationality property, and therefore there exists a per-fect Bayesian equilibrium for N . Extension to Multiple Actors
Our approach can be extended to settings with multiple ac-tors and one designer, with slight modifications to Equa-tion 1 and Algorithm 1. This models large environmentssuch as infrastructure modifications to cities. The actors areassumed to be homogeneous in their capabilities, tasks, andmodels, but may have different start and goal states. For ex-ample, consider multiple self-driving cars that have the samecapabilities, task, and model but are navigating between dif-ferent locations in the environment. When optimizing traveltime, the vehicles may travel at a high velocity through pot-holes, resulting in a bumpy ride for the passenger as a sideeffect. Driving fast through deep potholes may also damagehe car. If the road is rarely used, the utility-maximizing de-sign may be to reduce the speed limit for that segment. If theroad segment with potholes is frequently used by multiplevehicles, filling the potholes reduces the NSE for all actors.The designer’s utility for an ω , given K actors, is: U (cid:126)π ( ω ) = K (cid:88) i =1 (cid:16) N Eπ i − N Ψ( E,ω ) π i (cid:17) − C ( E, ω ) , with (cid:126)π denoting policies of the K actors. More complexways of estimating the utility may be considered. The pol-icy is recomputed for all the actors when the environmentis modified. It is assumed that shaping does not introduce amulti-agent coordination problem that did not pre-exist. If amodification impacts the agent coordination for o P , it willbe reflected in the slack violation. Shaping for multiple ac-tors requires preserving the slack guarantees of all actors. Apolicy-preserving modification induces policy invariance forall the actors in this setting. Experiments
We present two sets of experimental results. First, we reportthe results of our user study conducted specifically to vali-date two key assumptions of our work: (1) users are willingto engage in shaping and (2) users want to mitigate NSEeven when they are not safety-critical. Second, we evalu-ate the effectiveness of shaping in mitigating NSE on twodomains in simulation. The user study complements the ex-periments that evaluate our algorithmic contribution to solvethis problem.
User Study
Setup
We conducted two IRB-approved surveys focus-ing on two domains similar to the settings in our simula-tion experiments. The first is a household robot with capa-bilities such as cleaning the floor and pushing boxes. Thesecond is an autonomous driving domain. We surveyed theparticipants to understand their attitudes to NSE such as thehousehold robot spraying water on the wall when cleaningthe floor, the autonomous vehicle (AV) driving fast throughpotholes which results in a bumpy ride for the users, and theAV slamming the brakes to halt at stop signs which resultsin sudden jerks for the passengers. We recruited 500 partici-pants on Amazon Mturk to complete a pre-survey question-naire to assess their familiarity with AI systems and fluencyin English. Based on the pre-survey responses, we invited300 participants aged above 30 to complete the main survey,since they are less likely to game the system (Downs et al.2010). Participants were required to select the option thatbest describes their attitude towards NSE occurrence. Re-sponses that were incomplete or with a survey completiontime of less than one minute were discarded and 180 validresponses for each domain are analyzed.
Tolerance to NSE
We surveyed participants to understandtheir attitude towards NSE that are not safety-critical andwhether such NSE affects the user’s trust in the system.For the household robot setting, . of the partici-pants indicated willingness to tolerate the NSE. For the driv-ing domain, . were willing to tolerate milder NSE Figure 4: User tolerance score.such bumpiness when the AV drives fast through potholesand . were willing to tolerate relatively severe NSEsuch as hard braking at a stop sign. Participants were alsorequired to enter a tolerance score on a scale of 1 to 5, with 5being the highest. Figure 4 shows the distribution of the usertolerance score for the two domains in our human subjectsstudy. The mean, along with the confidence interval forthe tolerance scores are . ± . for the household robotdomain and . ± . for the AV domain. For householdrobot domain, . voted a score of ≥ and voteda score of ≥ for the AV domain. Furthermore, . respondents of the household robot survey voted that theirtrust in the system’s abilities is unaffected by the NSE and . indicated that their trust may be affected if the NSEare not mitigated over time. Similarly for the AV driving do-main, . voted that their trust is unaffected by the NSEand . voted that NSE may affect their trust if the NSEare not mitigated over time. These results suggest that (1)individual preferences and tolerance of NSE varies and de-pends on the severity of NSE; (2) users are generally willingto tolerate NSE that are not severe or safety-critical, but pre-fer to reduce them as much as possible; and (3) mitigatingNSE is important to improve trust in AI systems. Willingness to perform shaping
We surveyed participantsto determine their willingness to perform environment shap-ing to mitigate the impacts of NSE. In the household robotdomain, shaping involved adding a protective sheet on thesurface. In the driving domain, shaping involved installing apothole-detection sensor that detects potholes and limits thevehicle’s speed. Our results show that . participantsare willing to engage in shaping for the household robot do-main and . for the AV domain. Among the respondentswho were willing to install the sheet in the household robotdomain, . are willing to purchase the sheet ( $10 ) if itnot provided by the manufacturer. Similarly, . werewilling to purchase the sensor for the AV, which costs $50 .These results indicate that (1) users are generally willing toengage in environment shaping to mitigate the impacts ofNSE; and (2) many users are willing to pay for environmentshaping as long as it is generally affordable, relative to thecost of the AI system. Evaluation in Simulation
We evaluate the effectiveness of shaping in two domains:boxpushing (Figure 1) and AV driving (Figure 5). The ef-fectiveness of shaping is studied in settings with avoidableigure 5: AV environment shaping with multiple actors.NSE, unavoidable NSE, and with multiple actors.
Baselines
The performance of shaping with budget is com-pared with the following baselines. First is the
Initial ap-proach in which the actor’s policy is naively executed anddoes not involve shaping or any form of learning to mitigateNSE. Second is the shaping with exhaustive search to selecta modification, b = | Ω | . Third is the Feedback approach inwhich the gent performs a trajectory of its optimal policyfor o P and the human approves or disapproves the observedactions based on the NSE occurrence (Saisubramanian, Ka-mar, and Zilberstein 2020). The agent then disables all thedisapproved actions and recomputes a policy for execution.If the updated policy violates the slack, the actor ignores thefeedbacks and executes its initial policy since o P is priori-tized. We consider two variants of the feedback approach:with and without generalizing the gathered information tounseen situations. In Feedback w/ generalization , the actoruses the human feedback as training data to learn a predic-tive model of NSE occurrence. The actor disables the actionsdisapproved by the human and those predicted by the model,and recomputes a policy.We use Jaccard distance to measure the similarity betweenenvironment configurations. A random forest classifier fromthe sklearn
Python package is used for learning a predic-tive model. The actor’s MDP is solved using value iteration.The algorithms were implemented in Python and tested ona computer with 16GB of RAM. We optimize action costs,which are negation of rewards. Values averaged over 100 tri-als of planning and execution, along with the standard errors,are reported for the following domains.
Boxpushing
We consider a boxpushing domain (Saisub-ramanian, Kamar, and Zilberstein 2020) in which the actoris required to minimize the expected time taken to push abox to the goal location (Figure 1). Each action costs onetime unit. The actions succeed with probability . and mayslide right with probability . . Each state is represented as (cid:104) x, y, b (cid:105) where x, y denote the agent’s location and b is aboolean variable indicating if the agent is pushing the box.Pushing the box over the rug or knocking over a vase on itsway results in NSE, incurring a penalty of 5. The designersare the system users. We experiment with grid size × . Driving
Our second domain is based on simulated au-tonomous driving (Saisubramanian, Kamar, and Zilberstein2020) in which the actor’s objective is to minimize the ex-pected cost of navigation from start to a goal location, dur-ing which it may encounter some potholes. Each state is the agent’s location represented by (cid:104) x, y (cid:105) . We consider a grid ofsize × . The actor can move in all four directions atlow and high speeds, with costs 2 and 1 respectively. Driv-ing fast through shallow potholes results in a bumpy ride forthe user, which is a mild NSE with a penalty of . Drivingfast through a deep pothole may damage the car in addi-tion to the unpleasant experience for the rider and thereforeit is a severe NSE with a penalty of . Infrastructure man-agement authorities are the designers for this setting and thestate space is divided into four zones, similar to geographicaldivisions of cities for urban planning. Available Modifications
We consider 24 modificationsfor the boxpushing domain, such as adding a protective sheetover the rug, moving the vase to corners of the room, remov-ing the rug, block access to the rug and vase area, amongothers. Removing the rug costs . /unit area covered by therug, moving the vase costs , and all other modifications cost . /unit. Except for blocking access to the rug and vase area,all the other modifications are policy-preserving for the ac-tor state representation described earlier. The modificationsconsidered for the driving domain are: reduce the speed limitin zone i , ≤ i ≤ ; reduce speed limits in all four zones;fill all potholes; fill deep potholes in all zones; reduce thespeed limit in zones with shallow potholes; and fill deeppotholes. Reducing the speed costs . units per pothole inthat zone and filling each pothole costs units. Reducingthe speed limit disables the ‘move fast’ action for the actor.Filling the potholes is a policy-preserving modification. Effectiveness of shaping
The effectiveness of shapingis evaluated in terms of the average NSE penalty incurredand the expected value of o P after shaping. Figure 6 plotsthe results with δ A = 0 , δ D = 0 , and b = 3 , as the num-ber of observed actor trajectories is increased. The resultsare plotted for the boxpushing domain with one actor in set-tings with avoidable and unavoidable NSE. Feedback budgetis 500. Feedback without generalization does not minimizeNSE and performs similar to initial baseline, because the ac-tor has no knowledge about the undesirability of actions notincluded in the demonstration. As a result, when an action isdisapproved, the actor may execute an alternate action thatis equally bad or worse than the initial action in that state,in terms of NSE. Generalizing the feedback overcomes thisdrawback and mitigates the NSE relatively. With at most fivetrajectories, the designer is able to select a policy-preserving modification that avoids NSE. Shaping w/ budget b = 3 performs similar to shaping by evaluating all modifications.The trend in the relative performances of the different tech-niques is similar for both avoidable and unavoidable NSE.Table 6(f) shows that shaping w/ budget reduces the numberof shaping evaluations by . Effect of slack on shaping
We vary δ A and δ D , andplot the resulting NSE penalty in Figure 7. We report re-sults for the driving domain with a single actor and designer,and shaping based on 100 observed trajectories of the actor.We vary δ A between 0- of V ∗ P ( s | E ) and δ D between0- of the NSE penalty of the actor’s policy in E . Fig-ure 7 shows that increasing the slack helps reduce the NSE, a) Average NSE penalty. (b) Expected cost of o P . (c) Average NSE penalty.(d) Expected cost of o P . (e) Legend for plots (a)-(d) (f) Figure 6: Results on boxpushing domain with avoidable NSE (a-b) and unavoidable NSE (c-d).Figure 7: Effect of δ A and δ D on the average NSE penalty.as expected. In particular, when δ D ≥ , the NSE penaltyis considerably reduced with δ A = 15% . Overall, increasing δ A is most effective in reducing the NSE. We also tested theeffect of slack on the cost for o P . The results showed that thecost was predominantly affected by δ A . We observed similarperformance for all values of δ D for a fixed δ A and thereforedo not include that plot. This is an expected behavior since o P is prioritized and shaping is performed in response to π . Shaping with multiple actors
We measure the cumula-tive NSE penalty incurred by varying the number of actorsin the environment from 10 to 300. Figure 8 shows the re-sults for the driving domain, with δ D = 0 and δ A = 25% ofthe optimal cost for o P , for each actor. The start and goal lo-cations were randomly generated for the actors. Shaping andfeedbacks are based on 100 observed trajectories of each ac-tor. Shaping w/ budget results are plotted with b = 4 , whichis 50% | Ω | . The feedback budget for each actor is 500. Al-though the feedback approach is sometimes comparable toshaping when there are fewer actors, it requires overseeingeach actor and providing individual feedback, which is im- Figure 8: Results with multiple actors.practical. Overall, as we increase the number of agents, sub-stantial reduction in NSE is observed with shaping. The timetaken for shaping and shaping w/ budget are comparable upto 100 actors, beyond which we see considerable time sav-ings of using shaping w/ budget. For 300 actors, the averagetime taken for shaping is 752.27 seconds and the time takenfor shaping w/budget is 490.953 seconds. Discussion and Future Work
We present an actor-designer framework to mitigate the im-pacts of NSE by shaping the environment. The general ideaof environment modification to influence the behaviors ofthe acting agents has been previously explored in other con-texts (Zhang, Chen, and Parkes 2009), such as to accelerateagent learning (Randløv 2000), to quickly infer the goals ofthe actor (Keren, Gal, and Karpas 2014), and to maximizehe agent’s reward (Keren et al. 2017). We study how en-vironment shaping can mitigate NSE. We also identify theconditions under which the actor’s policy is unaffected byshaping and show that our approach guarantees bounded-performance of the actor.Our approach is well-suited for settings where the agent iswell-trained to perform its assigned task and must performit repeatedly, but NSE are discovered after deployment. Al-tering the agent’s model after deployment requires the envi-ronment designer to be (or have the knowledge of) the de-signer of the AI system. Otherwise, the system will have tobe suspended and returned to the manufacturer. Our frame-work provides a tool for the users to mitigate NSE withoutmaking any assumptions about the agent’s model, its learn-ing capabilities, the representation format for the agent to ef-fectively use new information, or the designer’s knowledgeabout the agent’s model. We argue that is many contexts, it ismuch easier for users to reconfigure the environment than toaccurately update the agent’s model. Our algorithm guidesthe users in the shaping process.In addition to the greedy approach described in Algo-rithm 2, we also tested a clustering-based approach to iden-tify diverse modifications. In our experiments, the clusteringapproach performed comparably to the greedy approach inidentifying diverse modifications but had a longer run time.It is likely that benefits of the clustering approach would bemore evident in settings with a much larger Ω , since it canquickly group similar modifications. In the future, we aim toexperiment with a large set of modifications, say | Ω | > ,and compare the performance of the clustering approachwith that of Algorithm 2.When the actor’s model M a has a large state space, ex-isting algorithms for solving large MDPs may be leveraged.When M a is solved approximately, without bounded guar-antees, slack guarantees cannot be established. Extendingour approach to multiple actors with different models is aninteresting future direction.We conducted experiments with human subjects to val-idate their willingness to perform environment shaping. Inour user study, we focused on scenarios that are more acces-sible to participants from diverse backgrounds so that we canobtain reliable responses regarding their overall attitude toNSE and shaping. In the future, we aim to evaluate whetherAlgorithm 1 aligns with human judgment in selecting thebest modification. Additionally, we will examine ways to au-tomatically identify useful modifications in a large space ofvalid modifications. Acknowledgments
Support for this work was providedin part by the Semiconductor Research Corporation undergrant
References
Agashe, N.; and Chapman, S. 2019. Traffic Signs in theEvolving World of Autonomous Vehicles. Technical report,Avery Dennison.Amodei, D.; Olah, C.; Steinhardt, J.; Christiano, P.; Schul- man, J.; and Man´e, D. 2016. Concrete Problems in AISafety. arXiv preprint arXiv:1606.06565 .Downs, J. S.; Holbrook, M. B.; Sheng, S.; and Cranor,L. F. 2010. Are Your Participants Gaming the System?Screening Mechanical Turk Workers. In
Proceedings of theSIGCHI Conference on Human Factors in Computing Sys-tems , 2399–2402.Guestrin, C.; Koller, D.; and Parr, R. 2001. Max-norm Pro-jections for Factored MDPs. In
Proceedings of the 17th In-ternational Joint Conference on Artificial Intelligence .Hadfield-Menell, D.; Milli, S.; Abbeel, P.; Russell, S. J.; andDragan, A. 2017. Inverse Reward Design. In
Advances inNeural Information Processing Systems , 6765–6774.Harsanyi, J. C. 1967. Games with Incomplete Informa-tion Played by “Bayesian” Players, Part I. The Basic Model.
Management science
Pennsylva-nia Department of Transportation .Keren, S.; Gal, A.; and Karpas, E. 2014. Goal RecognitionDesign. In
Proceedings of the 24th International Conferenceon Automated Planning and Scheduling .Keren, S.; Pineda, L.; Gal, A.; Karpas, E.; and Zilberstein, S.2017. Equi-Reward Utility Maximizing Design in Stochas-tic Environments. In
Proceedings of the 26th InternationalJoint Conference on Artificial Intelligence
Proceedings of the17th International Conference on Machine Learning .Saisubramanian, S.; Kamar, E.; and Zilberstein, S. 2020.A Multi-Objective Approach to Mitigate Negative Side Ef-fects. In
Proceedings of the 29th International Joint Confer-ence on Artificial Intelligence , 354–361.Saisubramanian, S.; Zilberstein, S.; and Kamar, E. 2020.Avoiding Negative Side Effects due to Incomplete Knowl-edge of AI Systems.
CoRR abs/2008.12146.Shah, R.; Krasheninnikov, D.; Alexander, J.; Abbeel, P.; andDragan, A. 2019. Preferences Implicit in the State of theWorld. In
Proceedings of the 7th International Conferenceon Learning Representations (ICLR)
Proceed-ings of the 21st International Joint Conference on ArtificialIntelligence , 2002–2014.Zhang, S.; Durfee, E. H.; and Singh, S. P. 2018. Minimax-Regret Querying on Side Effects for Safe Optimality inFactored Markov Decision Processes. In