[PDF] Mitigating Negative Side Effects via Environment Shaping

Abstract

Agents operating in unstructured environments often produce negative side effects (NSE), which are difficult to identify at design time. While the agent can learn to mitigate the side effects from human feedback, such feedback is often expensive and the rate of learning is sensitive to the agent's state representation. We examine how humans can assist an agent, beyond providing feedback, and exploit their broader scope of knowledge to mitigate the impacts of NSE. We formulate this problem as a human-agent team with decoupled objectives. The agent optimizes its assigned task, during which its actions may produce NSE. The human shapes the environment through minor reconfiguration actions so as to mitigate the impacts of the agent's side effects, without affecting the agent's ability to complete its assigned task. We present an algorithm to solve this problem and analyze its theoretical properties. Through experiments with human subjects, we assess the willingness of users to perform minor environment modifications to mitigate the impacts of NSE. Empirical evaluation of our approach shows that the proposed framework can successfully mitigate NSE, without affecting the agent's ability to complete its assigned task.

Full PDF

MMitigating Negative Side Effects via Environment Shaping

Sandhya Saisubramanian and Shlomo Zilberstein

College of Information and Computer Sciences,University of Massachusetts Amherst

Abstract

Agents operating in unstructured environments often produce negative side effects (NSE), which are difﬁcult to identify atdesign time. While the agent can learn to mitigate the sideeffects from human feedback, such feedback is often expen-sive and the rate of learning is sensitive to the agent’s staterepresentation. We examine how humans can assist an agent,beyond providing feedback, and exploit their broader scopeof knowledge to mitigate the impacts of NSE. We formulatethis problem as a human-agent team with decoupled objec-tives. The agent optimizes its assigned task, during which itsactions may produce NSE. The human shapes the environ-ment through minor reconﬁguration actions so as to mitigatethe impacts of the agent’s side effects, without affecting theagent’s ability to complete its assigned task. We present an al-gorithm to solve this problem and analyze its theoretical prop-erties. Through experiments with human subjects, we assessthe willingness of users to perform minor environment modi-ﬁcations to mitigate the impacts of NSE. Empirical evaluationof our approach shows that the proposed framework can suc-cessfully mitigate NSE, without affecting the agent’s abilityto complete its assigned task.

Introduction

Deployed AI agents often require complex design choicesto support safe operation in the open world. During designand initial testing, the system designer typically ensures thatthe agent’s model includes all the necessary details relevantto its assigned task. Inherently, many other details of the en-vironment that are unrelated to this task may be ignored.Due to this model incompleteness, the deployed agent’s ac-tions may create negative side effects (NSE) (Amodei et al.2016; Saisubramanian, Zilberstein, and Kamar 2020). Forexample, consider an agent that needs to push a box to agoal location as quickly as possible (Figure 1(a)). Its modelincludes accurate information essential to optimizing its as-signed task, such as reward for pushing the box. But detailssuch as the impact of pushing the box over a rug may not beincluded in the model if the issue is overlooked at system de-sign. Consequently, the agent may push a box over the rug,dirtying it as a side effect. Mitigating such NSE is critical toimprove trust in deployed AI systems.It is practically impossible to identify all such NSE dur-ing system design since agents are deployed in varied set-tings. Deployed agents often do not have any prior knowl- (a) Initial Environment (b) Protective sheet over rug

Figure 1: Example conﬁgurations for boxpushing domain:(a) denotes the initial setting in which the actor dirties therug when pushing the box over it; (b) denotes a modiﬁcationthat avoids the NSE, without affecting the actor’s policy.edge about NSE, and therefore they do not have the abilityto minimize NSE.

How can we leverage human assistanceand the broader scope of human knowledge to mitigate neg-ative side effects, when agents are unaware of the side effectsand the associated penalties?

A common solution approach in the existing literatureis to update the agent’s model and policy by learningabout NSE through feedbacks (Hadﬁeld-Menell et al. 2017;Zhang, Durfee, and Singh 2018; Saisubramanian, Kamar,and Zilberstein 2020). The knowledge about the NSE canbe encoded by constraints (Zhang, Durfee, and Singh 2018)or by updating the reward function (Hadﬁeld-Menell et al.2017; Saisubramanian, Kamar, and Zilberstein 2020). Theseapproaches have three main drawbacks. First, the agent’sstate representation is assumed to have all the necessary fea-tures to learn to avoid NSE (Saisubramanian, Kamar, andZilberstein 2020; Zhang, Durfee, and Singh 2018). In prac-tice, the model may include only the features related to theagent’s task. This may affect the agent’s learning and adapta-tion since the NSE could potentially be non-Markovian withrespect to this limited state representation. Second, exten-sive model revisions will likely require suspension of op-eration and exhaustive evaluation before the system can beredeployed. This is inefﬁcient when the NSE is not safety-critical, especially for complex systems that have been care-fully designed and tested for critical safety aspects. Third, a r X i v : . [ c s . A I] F e b pdating the agent’s policy may not mitigate NSE that areunavoidable. For example, in the boxpushing domain, NSEare unavoidable if the entire ﬂoor is covered with a rug.The key insight of this paper is that agents often operate inenvironments that are conﬁgurable, which can be leveragedto mitigate NSE. We propose environment shaping for de-ployed agents: a process of applying modest modiﬁcationsto the current environment to make it more agent-friendlyand minimize the occurrence of NSE. Real world examplesshow that modest modiﬁcations can substantially improveagents’ performance. For example, Amazon improved theperformance of its warehouse robots by optimizing the ware-house layouts, covering skylights to reduce the glare, andrepositioning the air conditioning units to blow out to theside (Simon 2019). Recent studies show the need for infras-tructure modiﬁcations such as trafﬁc signs with barcodes andredesigned roads to improve the reliability of autonomousvehicles (Agashe and Chapman 2019; Hendrickson, Biehler,and Mashayekh 2014; McFarland 2020). Simple modiﬁca-tions to the environment have also accelerated agent learn-ing (Randløv 2000) and goal recognition (Keren, Gal, andKarpas 2014). These examples show that environment shap-ing can improve an agent’s performance with respect to its assigned task . We target settings in which environment shap-ing can reduce NSE , without signiﬁcantly degrading the per-formance of the agent in completing its assigned task.Our formulation consists of an actor and a designer . Theactor agent computes a policy that optimizes the completionof its assigned task and has no knowledge about NSE of itsactions. The designer agent, typically a human, shapes theenvironment through minor modiﬁcations so as to mitigatethe NSE of the actor’s actions, without affecting the actor’sability to complete its task. The environment is described us-ing a conﬁguration ﬁle, such as a map, which is shared be-tween the actor and the designer. Given an environment con-ﬁguration, the actor computes and shares sample trajectoriesof its optimal policy with the designer. The designer mea-sures the NSE associated with the actor’s policy and mod-iﬁes the environment by updating the conﬁguration ﬁle, ifnecessary. We target settings in which the completion of theactor’s task is prioritized over minimizing NSE. The sampletrajectories and the environment conﬁguration are the onlyknowledge shared between the actor and the designer.The advantages of this decoupled approach to minimizeNSE are: (1) robustness—it can handle settings in whichthe actor is unaware of the NSE and does not have the nec-essary state representation to effectively learn about NSE;and (2) bounded-performance—by controlling the space ofvalid modiﬁcations, bounded-performance of the actor withrespect to its assigned task can be guaranteed.Our primary contributions are: (1) introducing a novelframework for environment shaping to mitigate the unde-sirable side effects of agent actions; (2) presenting an algo-rithm for shaping and analyzing its properties; (3) providingthe results of a human subjects study to assess the attitudesof users towards negative side effects and their willingnessto shape the environment; and (4) empirical evaluation of theapproach on two domains. Actor-Designer Framework

The problem of mitigating NSE is formulated as a collabo-rative actor-designer framework with decoupled objectives.The problem setting is as follows. The actor agent operatesbased on a Markov decision process (MDP) M a in an envi-ronment that is conﬁgurable and described by a conﬁgura-tion ﬁle E , such as a map. The agent’s model M a includesthe necessary details relevant to its assigned task, which isits primary objective o P . A factored state representation isassumed. The actor computes a policy π that optimizes o P .Executing π may lead to NSE, unknown to the actor. The environment designer , typically the user, measures the im-pact of NSE associated with the actor’s π and shapes the en-vironment, if necessary. The actor and the environment de-signer share the conﬁguration ﬁle of the environment, whichis updated by the environment designer to reﬂect the modi-ﬁcations. Optimizing o P is prioritized over minimizing theNSE. Hence, shaping is performed in response to the actor’spolicy. In the rest of the paper, we refer to the environmentdesigner simply as designer—not to be confused with thedesigner of the agent itself.Each modiﬁcation is a sequence of design actions. An ex-ample is { move ( table, l , l ) , remove ( rug ) } , which movesthe table from location l to l and removes the rug in thecurrent setting. We consider the set of modiﬁcations to beﬁnite since the environment is generally optimized for theuser and the agent’s primary task (Shah et al. 2019) and theuser may not be willing to drastically modify it. Addition-ally, the set of modiﬁcations for an environment is includedin the problem speciﬁcation since it is typically controlledby the user and rooted in the NSE they want to mitigate.We make the following assumptions about the nature ofNSE and the modiﬁcations: (1) NSE are undesirable but notsafety-critical, and its occurrence does not affect the actor’sability to complete its task; (2) the start and goal conditionsof the actor are ﬁxed and cannot be altered, so that the mod-iﬁcations do not alter the agent’s task; and (3) modiﬁcationsare applied tentatively for evaluation purposes and the en-vironment is reset if the reconﬁguration affects the actor’sability to complete its task or the actor’s policy in the modi-ﬁed setting does not minimize the NSE. Deﬁnition 1.

An actor-designer framework to miti-gate negative side effects (

AD-NSE ) is deﬁned by (cid:104) E , E , M a , M d , δ A , δ D (cid:105) with: • E denoting the initial environment conﬁguration; • E denoting a ﬁnite set of possible reconﬁgurations of E ; • M a = (cid:104) S, A, T, R, s , s G (cid:105) is the actor’s MDP with a dis-crete and ﬁnite state space S , discrete actions A , transi-tion function T , reward function R , start state s ∈ S ,and a goal state s G ∈ S ; • M d = (cid:104) Ω , Ψ , C, N (cid:105) is the model of the designer with – Ω denoting a ﬁnite set of valid modiﬁcations thatare available for E , including ∅ to indicate that nochanges are made; – Ψ :

E × Ω → E determines the resulting environmentconﬁguration after applying a modiﬁcation ω ∈ Ω tothe current conﬁguration, and is denoted by Ψ( E, ω ) ; C : E × Ω → R is a cost function that speciﬁes the costof applying a modiﬁcation to an environment, denotedby C ( E, ω ) , with C ( E, ∅ ) = 0 , ∀ E ∈ E ; – N = (cid:104) π, E, ζ (cid:105) is a model specifying the penalty fornegative side effects in environment E ∈ E for theactor’s policy π , with ζ mapping states in π to E forseverity estimation. • δ A ≥ is the actor’s slack, denoting the maximum al-lowed deviation from the optimal value of o P in E , whenrecomputing its policy in a modiﬁed environment; and • δ D ≥ indicates the designer’s NSE tolerance threshold. Decoupled objectives

The actor’s objective is to computea policy π that maximizes its expected reward for o P , fromthe start state s in the current environment conﬁguration E : max π ∈ Π V πP ( s | E ) ,V πP ( s ) = R ( s, π ( s )) + (cid:88) s (cid:48) ∈ S T ( s, π ( s ) , s (cid:48) ) V πP ( s (cid:48) ) , ∀ s ∈ S. When the environment is modiﬁed, the actor recomputes itspolicy and may end up with a longer path to its goal. Theslack δ A denotes the maximum allowed deviation from theoptimal expected reward in E , V ∗ P ( s | E ) , to facilitate min-imizing the NSE via shaping. A policy π (cid:48) in a modiﬁed en-vironment E (cid:48) satisﬁes the slack δ A if V ∗ P ( s | E ) − V π (cid:48) P ( s | E (cid:48) ) ≤ δ A . Given the actor’s π and environment conﬁguration E , thedesigner ﬁrst estimates its corresponding NSE and the asso-ciated penalty, denoted by N Eπ . The environment is modiﬁedif N Eπ > δ D , where δ D is the designer’s tolerance thresholdfor NSE, assuming π is ﬁxed . Given π and a set of valid mod-iﬁcations Ω , the designer selects a modiﬁcation that maxi-mizes its utility: max ω ∈ Ω U π ( ω ) U π ( ω ) = (cid:0) N Eπ − N Ψ( E,ω ) π (cid:1)(cid:124) (cid:123)(cid:122) (cid:125) reduction in NSE − C ( E, ω ) . (1)Typically the designer is the user of the system and theirpreferences and tolerance for NSE is captured by the util-ity of a modiﬁcation. The designer’s objective is to balancethe tradeoff between minimizing the NSE and the cost ofapplying the modiﬁcation. It is assumed that the cost ofthe modiﬁcation is measured in the same units as the NSEpenalty. The cost of a modiﬁcation may be amortized overepisodes of the actor performing the task in the environment.The following properties are used to guarantee bounded-performance of the actor when shaping the environment tominimize the impacts of NSE. Deﬁnition 2.

Shaping is admissible if it results in an envi-ronment conﬁguration in which (1) the actor can completeits assigned task, given δ A and (2) the NSE does not in-crease, relative to E . Deﬁnition 3.

Shaping is proper if (1) it is admissible and(2) reduces the actor’s NSE to be within δ D . Figure 2: Actor-designer approach for mitigating NSE.

Deﬁnition 4. A robust shaping results in an environmentconﬁguration E where all valid policies of the actor, given δ A , produce NSE within δ D . That is, N Eπ ≤ δ D , ∀ π : V ∗ P ( s | E ) − V πP ( s | E ) ≤ δ A . Shared knowledge

The actor and the designer do not havedetails about each other’s model. Since the objectives are de-coupled, knowledge about the exact model parameters of theother agent is not required. Instead, the two key details thatare necessary to achieve collaboration are shared: conﬁgu-ration ﬁle describing the environment and the actor’s policy.The shared conﬁguration ﬁle, such as the map of the envi-ronment, allows the designer to effectively communicate themodiﬁcations to the actor. The actor’s policy is required forthe designer to shape the environment. Compact represen-tation of the actor’s policy is a practical challenge in largeproblems (Guestrin, Koller, and Parr 2001). Therefore, in-stead of sharing the complete policy, the actor provides aﬁnite number of demonstrations D = { τ , τ , . . . , τ n } ofits optimal policy for o P in the current environment conﬁg-uration. Each demonstration τ is a trajectory from start togoal, following π . Using D , the designer can extract the ac-tor’s policy by associating states with actions observed in D ,and measure its NSE. The designer does not have knowledgeabout the details of the agent’s model but is assumed to beaware of the agent’s objectives and its general capabilitiesof the agent. This makes it straightforward to construe theagent’s trajectories. Naturally, increasing the number and di-versity of sample trajectories that cover the actor’s reachablestates helps the designer improve the accuracy of estimatingthe actor’s NSE and select an appropriate modiﬁcation. If D does not starve any reachable state, following π , then thedesigner can extract the actor’s complete policy. Solution Approach

Our solution approach for solving AD-NSE, described in Al-gorithm 1, proceeds in two phases: a planning phase and ashaping phase. In the planning phase (Line 4), the actor com-putes its policy π for o P in the current environment conﬁg-uration and generates a ﬁnite number of sample trajectories D , following π . The planning phase ends with disclosing D to the designer. The shaping phase (Lines 7-15) begins withthe designer associating states with actions observed in D to extract a policy ˆ π , and estimating the corresponding NSEpenalty, denoted by N E ˆ π . If N E ˆ π > δ D , the designer appliesa utility-maximizing modiﬁcation and updates the conﬁgu-ration ﬁle. The planning and shaping phases alternate untilthe NSE is within δ D or until all possible modiﬁcations havebeen tested. Figure 2 illustrates this approach.The actor returns D = {} when the modiﬁcation affectsits ability to reach the goal, given δ A . Modiﬁcations are ap- lgorithm 1 Environment shaping to mitigate NSERequire: (cid:104) E , E , M a , M d , δ A , δ D (cid:105) : AD-NSE Require: d : Number of sample trajectories Require: b : Budget for evaluating modiﬁcations E ∗ ← E (cid:46) Initialize best conﬁguration E ← E n ← ∞ D ←

Solve M a for E and sample d trajectories ¯Ω ← Diverse modiﬁcations( b, Ω , M d , E ) while | ¯Ω | > do ˆ π ← Extract policy from D if N E ˆ π < n then n ← N E ˆ π E ∗ ← E end if if N E ˆ π ≤ δ D then break ω ∗ ← arg max ω ∈ ¯Ω U ˆ π ( ω ) if ω ∗ = ∅ then break ¯Ω ← ¯Ω \ ω ∗ D (cid:48) ← Solve M a for Ψ( E , ω ∗ ) and sample d trajec-tories if D (cid:48) (cid:54) = {} then D ← D (cid:48) E ← Ψ( E , ω ∗ ) end if end while return E ∗ Shaping phaseplied tentatively for evaluation and the environment is resetif the actor returns D = {} or if the reconﬁguration does notminimize the NSE. Therefore, all modiﬁcations are appliedto E and it sufﬁces to test each ω without replacement asthe actor always calculates the corresponding optimal pol-icy. When M a is solved approximately, without bounded-guarantees, it is non-trivial to verify if δ A is violated but thedesigner selects a utility-maximizing ω corresponding to thispolicy. Algorithm 1 terminates when at least one of the fol-lowing conditions is satisﬁed: (1) NSE impact is within δ D ;(2) every ω has been tested; or (3) the utility-maximizing op-tion is no modiﬁcation, ω ∗ = ∅ , indicating that cost exceedsthe reduction in NSE or no modiﬁcation can reduce the NSEfurther. When no modiﬁcation reduces the NSE to be within δ D , conﬁguration with the least NSE is returned.Though the designer calculates the utility for all modiﬁ-cations, using N , there is an implicit pruning in the shap-ing phase since only utility-maximizing modiﬁcations areevaluated (Line 16). However, when multiple modiﬁcationshave the same cost and produce similar environment con-ﬁgurations, the algorithm will alternate multiple times be-tween planning and shaping to evaluate all these modiﬁca-tions, which is | Ω | in the worst case. To minimize the num-ber of evaluations in settings with large Ω , consisting of mul-tiple similar modiﬁcations, we present a greedy approach toidentify and evaluate diverse modiﬁcations. Selecting diverse modiﬁcations

Let < b ≤ | Ω | denotethe maximum number of modiﬁcations the designer is will- Algorithm 2 Diverse modiﬁcations ( b, Ω , M d , E ) ¯Ω ← Ω if b ≥ | Ω | then return Ω for each ω , ω ∈ ¯Ω do if similarity( ω , ω ) ≤ (cid:15) then ¯Ω = ¯Ω \ arg max ω ( C ( E , ω ) , C ( E , ω )) if | ¯Ω | = b then return ¯Ω end if end for ing to evaluate. When b < | Ω | , it is beneﬁcial to evaluate b di-verse modiﬁcations. Algorithm 2 presents a greedy approachto select b diverse modiﬁcations. If two modiﬁcations resultin similar environment conﬁgurations, the algorithm prunesthe modiﬁcation with a higher cost. The similarity thresholdis controlled by (cid:15) . This process is repeated until b modiﬁca-tions are identiﬁed. Measures such as the Jaccard distance orembeddings may be used to estimate the similarity betweentwo environment conﬁgurations. Theoretical Properties

For the sake of clarity of the theoretical analysis, we assumethat the designer shapes the environment based on the actor’sexact π , without having to extract it from D , and with b = | Ω | . These assumptions allow us to examine the propertieswithout approximation errors and sampling biases. Proposition 1.

Algorithm 1 produces admissible shaping.Proof.

Algorithm 1 ensures that a modiﬁcation does notnegatively impact the actor’s ability to complete its task,given δ A (Lines 16-19), and stores the conﬁguration withleast NSE (Line 10). Therefore, Algorithm 1 is guaranteedto return an E ∗ with NSE equal to or lower than that of theinitial conﬁguration E . Hence, shaping using Algorithm 1is admissible.Shaping using Algorithm 1 is not guaranteed to be properbecause (1) no ω ∈ Ω may be able to reduce the NSE below δ D ; or (2) the cost of such a modiﬁcation may exceed thecorresponding reduction in NSE. With each reconﬁguration,the actor is required to recompute its policy. We now showthat there exists a class of problems for which the actor’spolicy is unaffected by shaping. Deﬁnition 5.

A modiﬁcation is policy-preserving if the ac-tor’s policy is unaltered by environment shaping.

This property induces policy invariance before and af-ter shaping and is therefore considered to be backward-compatible. Additionally, the actor is guaranteed to com-plete its task with δ A = 0 when a modiﬁcation is policy-preserving with respect to the actor’s optimal policy in E .Figure 3 illustrates the dynamic Bayesian network for thisclass of problems, where r P denotes the reward associatedwith o P and r N denotes the penalty for NSE. Let F denotethe set of features in the environment, with the actor’s policydepending on F P ⊂ F and F D ⊂ F altered by the designer’smodiﬁcations. Let (cid:126)f and (cid:126)f (cid:48) be the feature values before andigure 3: A dynamic Bayesian network description of apolicy-preserving modiﬁcation.after environment shaping. Given the actor’s π , a policy-preserving ω follows F D = F \ F P , ensuring (cid:126)f P = (cid:126)f (cid:48) P . Proposition 2.

Given an environment conﬁguration E andactor’s policy π , a modiﬁcation ω ∈ Ω is guaranteed to be policy-preserving if F D = F \ F P .Proof. We prove by contradiction. Let ω ∈ Ω be a modiﬁ-cation consistent with F D ∩ F P = ∅ and F = F P ∪ F D . Let π and π (cid:48) denote the actor’s policy before and after shapingwith ω such that π (cid:54) = π (cid:48) . Since the actor always computesan optimal policy for o P (assuming ﬁxed strategy for tie-breaking), a difference in the policy indicates that at leastone feature in F P is affected by ω , (cid:126)f P (cid:54) = (cid:126)f (cid:48) P . This is a con-tradiction since F D ∩ F P = ∅ . Therefore (cid:126)f P = (cid:126)f (cid:48) P and π = π (cid:48) .Thus, ω ∈ Ω is policy-preserving when F D = F \ F P . Corollary 1.

A policy-preserving modiﬁcation identiﬁed byAlgorithm 1 guarantees admissible shaping.

Furthermore, a policy-preserving modiﬁcation results inrobust shaping when

P r ( F D = (cid:126)f D ) = 0 , ∀ (cid:126)f = (cid:126)f P ∪ (cid:126)f D . Thepolicy-preserving property is sensitive to the actor’s staterepresentation. However, many real-world problems exhibitthis feature independence. In the boxpushing example, theactor’s state representation may not include details about therug since it is not relevant to the task. Moving the rug toa different part of the room that is not on the actor’s path tothe goal is an example of a policy-preserving and admissibleshaping. Modiﬁcations such as removing the rug or coveringit with a protective sheet are policy-preserving and robustshaping since there is no direct contact between the box andthe rug, for all policies of the actor (Figure 1(b)). Relation to Game Theory

Our solution approach with de-coupled objectives can be viewed as a game between the ac-tor and the designer. The action proﬁle or the set of strategiesfor the actor is the policy space Π , with respect to o P and E ,with payoffs deﬁned by its reward function R . The actionproﬁle for the designer is the set of all modiﬁcations Ω withpayoffs deﬁned by its utility function U . In each round ofthe game, the designer selects a modiﬁcation that is a bestresponse to the actor’s policy π : b D ( π ) = { ω ∈ Ω |∀ ω (cid:48) ∈ Ω , U π ( ω ) ≥ U π ( ω (cid:48) ) } . (2) The actor’s best response is its optimal policy for o P in thecurrent environment conﬁguration, given δ A : b A ( E ) = { π ∈ Π |∀ π (cid:48) ∈ Π , V πP ( s ) ≥ V π (cid:48) P ( s ) ∧ V π P ( s ) − V πP ( s ) ≤ δ A } , (3)where π denotes the optimal policy in E before initiatingenvironment shaping. These best responses are pure strate-gies since there exists a deterministic optimal policy for adiscrete MDP and a modiﬁcation is selected deterministi-cally. Proposition 3 shows that AD-NSE is an extensive-form game, which is useful to understand the link betweentwo often distinct ﬁelds of decision theory and game the-ory. This also opens the possibility for future work on game-theoretic approaches for mitigating NSE. Proposition 3.

An AD-NSE with policy constraints inducedby Equations 2 and 3 induces an equivalent extensive formgame with incomplete information, denoted by N .Proof Sketch. Our solution approach alternates between theplanning phase and the shaping phase. This induces an ex-tensive form game N with strategy proﬁles Ω for the de-signer and Π for the actor, given start state s . However,each agent selects a best response based on the informationavailable to it and its payoff matrix, and is unaware of thestrategies and payoffs of the other agent. The designer is un-aware of how its modiﬁcations may impact the actor’s policyuntil the actor recomputes its policy and the actor is unawareof the NSE of its actions. Hence this is an extensive formgame with incomplete information.Using Harsanyi transformation, N can be converted toa Bayesian extensive-form game, assuming the players arerational (Harsanyi 1967). This converts N to a game withcomplete but imperfect information as follows. Nature se-lects a type τ i , actor or designer, for each player, which de-termines its available actions and its payoffs parameterizedby θ A and θ D for the actor and designer, respectively. Eachplayer knows its type but does not know τ and θ of the otherplayer. If the players maintain a probability distribution overthe possible types of the other player and update it as thegame evolves, then the players select a best response, giventheir strategies and beliefs. This satisﬁes consistency and se-quential rationality property, and therefore there exists a per-fect Bayesian equilibrium for N . Extension to Multiple Actors

Our approach can be extended to settings with multiple ac-tors and one designer, with slight modiﬁcations to Equa-tion 1 and Algorithm 1. This models large environmentssuch as infrastructure modiﬁcations to cities. The actors areassumed to be homogeneous in their capabilities, tasks, andmodels, but may have different start and goal states. For ex-ample, consider multiple self-driving cars that have the samecapabilities, task, and model but are navigating between dif-ferent locations in the environment. When optimizing traveltime, the vehicles may travel at a high velocity through pot-holes, resulting in a bumpy ride for the passenger as a sideeffect. Driving fast through deep potholes may also damagehe car. If the road is rarely used, the utility-maximizing de-sign may be to reduce the speed limit for that segment. If theroad segment with potholes is frequently used by multiplevehicles, ﬁlling the potholes reduces the NSE for all actors.The designer’s utility for an ω , given K actors, is: U (cid:126)π ( ω ) = K (cid:88) i =1 (cid:16) N Eπ i − N Ψ( E,ω ) π i (cid:17) − C ( E, ω ) , with (cid:126)π denoting policies of the K actors. More complexways of estimating the utility may be considered. The pol-icy is recomputed for all the actors when the environmentis modiﬁed. It is assumed that shaping does not introduce amulti-agent coordination problem that did not pre-exist. If amodiﬁcation impacts the agent coordination for o P , it willbe reﬂected in the slack violation. Shaping for multiple ac-tors requires preserving the slack guarantees of all actors. Apolicy-preserving modiﬁcation induces policy invariance forall the actors in this setting. Experiments

We present two sets of experimental results. First, we reportthe results of our user study conducted speciﬁcally to vali-date two key assumptions of our work: (1) users are willingto engage in shaping and (2) users want to mitigate NSEeven when they are not safety-critical. Second, we evalu-ate the effectiveness of shaping in mitigating NSE on twodomains in simulation. The user study complements the ex-periments that evaluate our algorithmic contribution to solvethis problem.

User Study

Setup

We conducted two IRB-approved surveys focus-ing on two domains similar to the settings in our simula-tion experiments. The ﬁrst is a household robot with capa-bilities such as cleaning the ﬂoor and pushing boxes. Thesecond is an autonomous driving domain. We surveyed theparticipants to understand their attitudes to NSE such as thehousehold robot spraying water on the wall when cleaningthe ﬂoor, the autonomous vehicle (AV) driving fast throughpotholes which results in a bumpy ride for the users, and theAV slamming the brakes to halt at stop signs which resultsin sudden jerks for the passengers. We recruited 500 partici-pants on Amazon Mturk to complete a pre-survey question-naire to assess their familiarity with AI systems and ﬂuencyin English. Based on the pre-survey responses, we invited300 participants aged above 30 to complete the main survey,since they are less likely to game the system (Downs et al.2010). Participants were required to select the option thatbest describes their attitude towards NSE occurrence. Re-sponses that were incomplete or with a survey completiontime of less than one minute were discarded and 180 validresponses for each domain are analyzed.

Tolerance to NSE

We surveyed participants to understandtheir attitude towards NSE that are not safety-critical andwhether such NSE affects the user’s trust in the system.For the household robot setting, . of the partici-pants indicated willingness to tolerate the NSE. For the driv-ing domain, . were willing to tolerate milder NSE Figure 4: User tolerance score.such bumpiness when the AV drives fast through potholesand . were willing to tolerate relatively severe NSEsuch as hard braking at a stop sign. Participants were alsorequired to enter a tolerance score on a scale of 1 to 5, with 5being the highest. Figure 4 shows the distribution of the usertolerance score for the two domains in our human subjectsstudy. The mean, along with the conﬁdence interval forthe tolerance scores are . ± . for the household robotdomain and . ± . for the AV domain. For householdrobot domain, . voted a score of ≥ and voteda score of ≥ for the AV domain. Furthermore, . respondents of the household robot survey voted that theirtrust in the system’s abilities is unaffected by the NSE and . indicated that their trust may be affected if the NSEare not mitigated over time. Similarly for the AV driving do-main, . voted that their trust is unaffected by the NSEand . voted that NSE may affect their trust if the NSEare not mitigated over time. These results suggest that (1)individual preferences and tolerance of NSE varies and de-pends on the severity of NSE; (2) users are generally willingto tolerate NSE that are not severe or safety-critical, but pre-fer to reduce them as much as possible; and (3) mitigatingNSE is important to improve trust in AI systems. Willingness to perform shaping

We surveyed participantsto determine their willingness to perform environment shap-ing to mitigate the impacts of NSE. In the household robotdomain, shaping involved adding a protective sheet on thesurface. In the driving domain, shaping involved installing apothole-detection sensor that detects potholes and limits thevehicle’s speed. Our results show that . participantsare willing to engage in shaping for the household robot do-main and . for the AV domain. Among the respondentswho were willing to install the sheet in the household robotdomain, . are willing to purchase the sheet ( $10 ) if itnot provided by the manufacturer. Similarly, . werewilling to purchase the sensor for the AV, which costs $50 .These results indicate that (1) users are generally willing toengage in environment shaping to mitigate the impacts ofNSE; and (2) many users are willing to pay for environmentshaping as long as it is generally affordable, relative to thecost of the AI system. Evaluation in Simulation

We evaluate the effectiveness of shaping in two domains:boxpushing (Figure 1) and AV driving (Figure 5). The ef-fectiveness of shaping is studied in settings with avoidableigure 5: AV environment shaping with multiple actors.NSE, unavoidable NSE, and with multiple actors.

Baselines

The performance of shaping with budget is com-pared with the following baselines. First is the

Initial ap-proach in which the actor’s policy is naively executed anddoes not involve shaping or any form of learning to mitigateNSE. Second is the shaping with exhaustive search to selecta modiﬁcation, b = | Ω | . Third is the Feedback approach inwhich the gent performs a trajectory of its optimal policyfor o P and the human approves or disapproves the observedactions based on the NSE occurrence (Saisubramanian, Ka-mar, and Zilberstein 2020). The agent then disables all thedisapproved actions and recomputes a policy for execution.If the updated policy violates the slack, the actor ignores thefeedbacks and executes its initial policy since o P is priori-tized. We consider two variants of the feedback approach:with and without generalizing the gathered information tounseen situations. In Feedback w/ generalization , the actoruses the human feedback as training data to learn a predic-tive model of NSE occurrence. The actor disables the actionsdisapproved by the human and those predicted by the model,and recomputes a policy.We use Jaccard distance to measure the similarity betweenenvironment conﬁgurations. A random forest classiﬁer fromthe sklearn

Python package is used for learning a predic-tive model. The actor’s MDP is solved using value iteration.The algorithms were implemented in Python and tested ona computer with 16GB of RAM. We optimize action costs,which are negation of rewards. Values averaged over 100 tri-als of planning and execution, along with the standard errors,are reported for the following domains.

Boxpushing

We consider a boxpushing domain (Saisub-ramanian, Kamar, and Zilberstein 2020) in which the actoris required to minimize the expected time taken to push abox to the goal location (Figure 1). Each action costs onetime unit. The actions succeed with probability . and mayslide right with probability . . Each state is represented as (cid:104) x, y, b (cid:105) where x, y denote the agent’s location and b is aboolean variable indicating if the agent is pushing the box.Pushing the box over the rug or knocking over a vase on itsway results in NSE, incurring a penalty of 5. The designersare the system users. We experiment with grid size × . Driving

Our second domain is based on simulated au-tonomous driving (Saisubramanian, Kamar, and Zilberstein2020) in which the actor’s objective is to minimize the ex-pected cost of navigation from start to a goal location, dur-ing which it may encounter some potholes. Each state is the agent’s location represented by (cid:104) x, y (cid:105) . We consider a grid ofsize × . The actor can move in all four directions atlow and high speeds, with costs 2 and 1 respectively. Driv-ing fast through shallow potholes results in a bumpy ride forthe user, which is a mild NSE with a penalty of . Drivingfast through a deep pothole may damage the car in addi-tion to the unpleasant experience for the rider and thereforeit is a severe NSE with a penalty of . Infrastructure man-agement authorities are the designers for this setting and thestate space is divided into four zones, similar to geographicaldivisions of cities for urban planning. Available Modiﬁcations

We consider 24 modiﬁcationsfor the boxpushing domain, such as adding a protective sheetover the rug, moving the vase to corners of the room, remov-ing the rug, block access to the rug and vase area, amongothers. Removing the rug costs . /unit area covered by therug, moving the vase costs , and all other modiﬁcations cost . /unit. Except for blocking access to the rug and vase area,all the other modiﬁcations are policy-preserving for the ac-tor state representation described earlier. The modiﬁcationsconsidered for the driving domain are: reduce the speed limitin zone i , ≤ i ≤ ; reduce speed limits in all four zones;ﬁll all potholes; ﬁll deep potholes in all zones; reduce thespeed limit in zones with shallow potholes; and ﬁll deeppotholes. Reducing the speed costs . units per pothole inthat zone and ﬁlling each pothole costs units. Reducingthe speed limit disables the ‘move fast’ action for the actor.Filling the potholes is a policy-preserving modiﬁcation. Effectiveness of shaping

The effectiveness of shapingis evaluated in terms of the average NSE penalty incurredand the expected value of o P after shaping. Figure 6 plotsthe results with δ A = 0 , δ D = 0 , and b = 3 , as the num-ber of observed actor trajectories is increased. The resultsare plotted for the boxpushing domain with one actor in set-tings with avoidable and unavoidable NSE. Feedback budgetis 500. Feedback without generalization does not minimizeNSE and performs similar to initial baseline, because the ac-tor has no knowledge about the undesirability of actions notincluded in the demonstration. As a result, when an action isdisapproved, the actor may execute an alternate action thatis equally bad or worse than the initial action in that state,in terms of NSE. Generalizing the feedback overcomes thisdrawback and mitigates the NSE relatively. With at most ﬁvetrajectories, the designer is able to select a policy-preserving modiﬁcation that avoids NSE. Shaping w/ budget b = 3 performs similar to shaping by evaluating all modiﬁcations.The trend in the relative performances of the different tech-niques is similar for both avoidable and unavoidable NSE.Table 6(f) shows that shaping w/ budget reduces the numberof shaping evaluations by . Effect of slack on shaping

We vary δ A and δ D , andplot the resulting NSE penalty in Figure 7. We report re-sults for the driving domain with a single actor and designer,and shaping based on 100 observed trajectories of the actor.We vary δ A between 0- of V ∗ P ( s | E ) and δ D between0- of the NSE penalty of the actor’s policy in E . Fig-ure 7 shows that increasing the slack helps reduce the NSE, a) Average NSE penalty. (b) Expected cost of o P . (c) Average NSE penalty.(d) Expected cost of o P . (e) Legend for plots (a)-(d) (f) Figure 6: Results on boxpushing domain with avoidable NSE (a-b) and unavoidable NSE (c-d).Figure 7: Effect of δ A and δ D on the average NSE penalty.as expected. In particular, when δ D ≥ , the NSE penaltyis considerably reduced with δ A = 15% . Overall, increasing δ A is most effective in reducing the NSE. We also tested theeffect of slack on the cost for o P . The results showed that thecost was predominantly affected by δ A . We observed similarperformance for all values of δ D for a ﬁxed δ A and thereforedo not include that plot. This is an expected behavior since o P is prioritized and shaping is performed in response to π . Shaping with multiple actors

We measure the cumula-tive NSE penalty incurred by varying the number of actorsin the environment from 10 to 300. Figure 8 shows the re-sults for the driving domain, with δ D = 0 and δ A = 25% ofthe optimal cost for o P , for each actor. The start and goal lo-cations were randomly generated for the actors. Shaping andfeedbacks are based on 100 observed trajectories of each ac-tor. Shaping w/ budget results are plotted with b = 4 , whichis 50% | Ω | . The feedback budget for each actor is 500. Al-though the feedback approach is sometimes comparable toshaping when there are fewer actors, it requires overseeingeach actor and providing individual feedback, which is im- Figure 8: Results with multiple actors.practical. Overall, as we increase the number of agents, sub-stantial reduction in NSE is observed with shaping. The timetaken for shaping and shaping w/ budget are comparable upto 100 actors, beyond which we see considerable time sav-ings of using shaping w/ budget. For 300 actors, the averagetime taken for shaping is 752.27 seconds and the time takenfor shaping w/budget is 490.953 seconds. Discussion and Future Work

We present an actor-designer framework to mitigate the im-pacts of NSE by shaping the environment. The general ideaof environment modiﬁcation to inﬂuence the behaviors ofthe acting agents has been previously explored in other con-texts (Zhang, Chen, and Parkes 2009), such as to accelerateagent learning (Randløv 2000), to quickly infer the goals ofthe actor (Keren, Gal, and Karpas 2014), and to maximizehe agent’s reward (Keren et al. 2017). We study how en-vironment shaping can mitigate NSE. We also identify theconditions under which the actor’s policy is unaffected byshaping and show that our approach guarantees bounded-performance of the actor.Our approach is well-suited for settings where the agent iswell-trained to perform its assigned task and must performit repeatedly, but NSE are discovered after deployment. Al-tering the agent’s model after deployment requires the envi-ronment designer to be (or have the knowledge of) the de-signer of the AI system. Otherwise, the system will have tobe suspended and returned to the manufacturer. Our frame-work provides a tool for the users to mitigate NSE withoutmaking any assumptions about the agent’s model, its learn-ing capabilities, the representation format for the agent to ef-fectively use new information, or the designer’s knowledgeabout the agent’s model. We argue that is many contexts, it ismuch easier for users to reconﬁgure the environment than toaccurately update the agent’s model. Our algorithm guidesthe users in the shaping process.In addition to the greedy approach described in Algo-rithm 2, we also tested a clustering-based approach to iden-tify diverse modiﬁcations. In our experiments, the clusteringapproach performed comparably to the greedy approach inidentifying diverse modiﬁcations but had a longer run time.It is likely that beneﬁts of the clustering approach would bemore evident in settings with a much larger Ω , since it canquickly group similar modiﬁcations. In the future, we aim toexperiment with a large set of modiﬁcations, say | Ω | > ,and compare the performance of the clustering approachwith that of Algorithm 2.When the actor’s model M a has a large state space, ex-isting algorithms for solving large MDPs may be leveraged.When M a is solved approximately, without bounded guar-antees, slack guarantees cannot be established. Extendingour approach to multiple actors with different models is aninteresting future direction.We conducted experiments with human subjects to val-idate their willingness to perform environment shaping. Inour user study, we focused on scenarios that are more acces-sible to participants from diverse backgrounds so that we canobtain reliable responses regarding their overall attitude toNSE and shaping. In the future, we aim to evaluate whetherAlgorithm 1 aligns with human judgment in selecting thebest modiﬁcation. Additionally, we will examine ways to au-tomatically identify useful modiﬁcations in a large space ofvalid modiﬁcations. Acknowledgments

Support for this work was providedin part by the Semiconductor Research Corporation undergrant

References

Agashe, N.; and Chapman, S. 2019. Trafﬁc Signs in theEvolving World of Autonomous Vehicles. Technical report,Avery Dennison.Amodei, D.; Olah, C.; Steinhardt, J.; Christiano, P.; Schul- man, J.; and Man´e, D. 2016. Concrete Problems in AISafety. arXiv preprint arXiv:1606.06565 .Downs, J. S.; Holbrook, M. B.; Sheng, S.; and Cranor,L. F. 2010. Are Your Participants Gaming the System?Screening Mechanical Turk Workers. In

Proceedings of theSIGCHI Conference on Human Factors in Computing Sys-tems , 2399–2402.Guestrin, C.; Koller, D.; and Parr, R. 2001. Max-norm Pro-jections for Factored MDPs. In

Proceedings of the 17th In-ternational Joint Conference on Artiﬁcial Intelligence .Hadﬁeld-Menell, D.; Milli, S.; Abbeel, P.; Russell, S. J.; andDragan, A. 2017. Inverse Reward Design. In

Advances inNeural Information Processing Systems , 6765–6774.Harsanyi, J. C. 1967. Games with Incomplete Informa-tion Played by “Bayesian” Players, Part I. The Basic Model.

Management science

Pennsylva-nia Department of Transportation .Keren, S.; Gal, A.; and Karpas, E. 2014. Goal RecognitionDesign. In

Proceedings of the 24th International Conferenceon Automated Planning and Scheduling .Keren, S.; Pineda, L.; Gal, A.; Karpas, E.; and Zilberstein, S.2017. Equi-Reward Utility Maximizing Design in Stochas-tic Environments. In

Proceedings of the 26th InternationalJoint Conference on Artiﬁcial Intelligence

Proceedings of the17th International Conference on Machine Learning .Saisubramanian, S.; Kamar, E.; and Zilberstein, S. 2020.A Multi-Objective Approach to Mitigate Negative Side Ef-fects. In

Proceedings of the 29th International Joint Confer-ence on Artiﬁcial Intelligence , 354–361.Saisubramanian, S.; Zilberstein, S.; and Kamar, E. 2020.Avoiding Negative Side Effects due to Incomplete Knowl-edge of AI Systems.

CoRR abs/2008.12146.Shah, R.; Krasheninnikov, D.; Alexander, J.; Abbeel, P.; andDragan, A. 2019. Preferences Implicit in the State of theWorld. In

Proceedings of the 7th International Conferenceon Learning Representations (ICLR)

Proceed-ings of the 21st International Joint Conference on ArtiﬁcialIntelligence , 2002–2014.Zhang, S.; Durfee, E. H.; and Singh, S. P. 2018. Minimax-Regret Querying on Side Effects for Safe Optimality inFactored Markov Decision Processes. In