On the Complexity of Sequential Incentive Design
OOn the Complexity of Sequential Incentive Design
Yagiz Savas (cid:63) , Vijay Gupta, and Ufuk Topcu
Abstract —In many scenarios, a principal dynamically interactswith an agent and offers a sequence of incentives to align theagent’s behavior with a desired objective. This paper focuseson the problem of synthesizing an incentive sequence that, onceoffered, induces the desired agent behavior even when the agent’sintrinsic motivation is unknown to the principal. We modelthe agent’s behavior as a Markov decision process, express itsintrinsic motivation as a reward function, which belongs to afinite set of possible reward functions, and consider the incentivesas additional rewards offered to the agent. We first show thatthe behavior modification problem (BMP), i.e., the problem ofsynthesizing an incentive sequence that induces a desired agentbehavior at minimum total cost to the principal, is PSPACE-hard.Moreover, we show that by imposing certain restrictions on theincentive sequences available to the principal, one can obtain twoNP-complete variants of the BMP. We also provide a sufficientcondition on the set of possible reward functions under which theBMP can be solved via linear programming. Finally, we proposetwo algorithms to compute globally and locally optimal solutionsto the NP-complete variants of the BMP.
I. I
NTRODUCTION
Consider a scenario in which a principal offers a sequence ofincentives to influence the behavior of an agent with unknown intrinsic motivation. For example, when an online retailer(principal) interacts with a customer (agent), the retailer mayaim to convince the customer to purchase a number of products over time by offering a sequence of discounts (incentives). Thecustomer’s willingness to purchase new products may dependon the ones purchased in the past, e.g., willingness to purchasea video game depends on whether the customer already ownsa game console. Moreover, the retailer typically does notknow what discount rates will encourage the customer to shopmore. In this paper, we study the problem of synthesizingan incentive sequence that, once offered, induces the desiredagent behavior regardless of the agent’s intrinsic motivationwhile minimizing the worst-case total cost to the principal .The design of incentives that align an agent’s behaviorwith a principal’s objective is a classical problem that hasbeen studied from the perspective of control theory [1], gametheory [2], contract theory [3], and mechanism design [4]under various assumptions on related problem parameters.Despite the long history, to the best of our knowledge, thereare only a few studies, e.g., [5]–[7], that analyze the computa-tional complexity of incentive design problems. Such analyses
This work is supported in part by the grants DARPA D19AP00004 andARL W911NF-17-2-0181.We thank Melkior Ornik and Lillian Ratliff for their contributions to anearlier version of this paper. We thank Mustafa Karabag for giving the ideaof the motion planning example and for insightful discussions.Y. Savas and U. Topcu are with the Department of Aerospace En-gineering, University of Texas at Austin, TX, USA. E-mail: { yagiz.savas,utopcu } @utexas.eduV. Gupta is with the Department of Electrical Engineering, Universityof Notre Dame, IN, USA. E-mail: [email protected] are crucial to understand the computational challenges thatnaturally arise in incentive design problems and to developeffective algorithms for the synthesis of incentives. Here, wepresent a comprehensive complexity analysis of a particularclass of sequential incentive design problems and proposeseveral algorithms to solve them.We model the agent’s behavior as a Markov decision process(MDP) [8]. MDPs model sequential decision-making underuncertainty and have been used to build recommendationsystems [9], to develop sales promotion strategies [10], andto design autonomous driving algorithms [11]. By modellingthe agent’s behavior as an MDP, one can represent uncertainoutcomes of the agent’s actions. For instance, when a retailermakes a discount offer for purchases exceeding a certainamount, the actual amount the agent will spend upon acceptingthe offer can be expressed through a probability distribution.We express the agent’s intrinsic motivation as a rewardfunction and consider the incentives as additional nonnegativerewards offered for the agent’s actions. Similar to the classicaladverse selection problem [3], we assume that the agent’sreward function is a private information and the principal onlyknows that the agent’s reward function belongs to a finite set ofpossible reward functions. The finite set assumption is standardin incentive design literature [12]; using certain techniques,e.g., clustering [13], the principal may infer from historicaldata that agents can be categorized into a finite number of types each of which is associated with a certain reward function.We consider a principal whose objective is to lead the agentto a desired set of target states in the MDP with maximumprobability while minimizing its worst-case expected total costto induce such a behavior. In the considered setting, the worst-case scenario corresponds to a principal that interacts withan agent type that maximizes the incurred total cost overall possible agent types. The target set may represent, forexample, the collection of products purchased by the customer.Then, the principal’s objective corresponds to maximizing itsprofit while guaranteeing that the agent purchases the desiredgroup of products regardless of its type.This paper has four main contributions. First, we show thatthe behavior modification problem (BMP), i.e., the problemof synthesizing an incentive sequence that leads the agent toa desired target set with maximum probability while mini-mizing the worst-case total cost to the principal, is PSPACE-hard. Second, we show that by preventing the principal fromadapting its incentive offers in time according to its historyof interactions with the agent, one can obtain two “easier”variants of the BMP which are NP-complete. Third, we showthat, when the set of possible agent types contains a typethat always demands an incentive offer that is higher thanthe ones demanded by any other type, the resulting BMP canbe solved in polynomial-time via linear programming. Finally, a r X i v : . [ m a t h . O C ] J u l e present two algorithms to solve the NP-complete variantsof the BMP. The first algorithm computes a globally optimalsolution to the considered problems based on a mixed-integerlinear program. The second algorithm, on the other hand,computes a locally optimal solution by resorting to a variation[14] of the so-called convex-concave procedure [15] to solvea nonlinear optimization problem with bilinear constraints. Related Work:
Some of the results presented in Section VII ofthis paper has previously appeared in [16] where we study asequential incentive design problem in which the agent hasa known intrinsic motivation. In this paper, we present anextensive study of the sequential incentive design problem foragents with unknown intrinsic motivation, which is a signif-icantly different problem. In particular, we show that, whenthe agent’s intrinsic motivation is unknown to the principal, itis, in general, PSPACE-hard to synthesize an optimal incentivesequence that induces the desired behavior. On the other hand,for an agent with known intrinsic motivation, an optimalincentive sequence can be synthesized in polynomial-time.MDPs have been recently used to study the synthesisof sequential incentive offers in [7], [17]–[19]. In [7], theauthors consider a principal with a limited budget who aimsto induce an agent behavior that maximizes the principal’sutility. Similarly, in [18], a principal with a limited budgetwho aims to induce a specific agent policy through incentiveoffers is considered. In [17], the authors consider a multi-armed bandit setting in which a principal sequentially offersincentives to an agent with the objective of modifying itsbehavior. The paper [19] studies an online mechanism designproblem in which a principal interacts with multiple agentsand aims to maximize the expected social utility. Unlike theabove mentioned studies, in this paper, we consider a principal with no budget constraints that aims to modify the behaviorof a single agent at minimum worst-case total cost.There are only a few results in the literature on thecomplexity of incentive design problems. In [7], the authorsconsider a principal with budget constraints and prove, bya reduction from the Knapsack problem, that the consideredproblem is NP-hard even when the agent’s reward functionis known to the principal. Since we consider a principalwith no budget constraints, the reduction techniques employedhere are significantly different from the one used in [7]. In[6], a number of NP-hardness results are presented for static
Stackelberg game settings in which the principal interacts withthe agent only once . In [5], the authors prove the complexity ofseveral mechanism design problems in which the principal isallowed to condition its incentive offers on agent types. Eventhough the reduction techniques used in the above mentionedreferences are quite insightful, they cannot be applied to provethe complexity of the BMP which concerns a sequential settingin which the incentives are not conditioned on the agent types.The complexity results presented in this paper are alsoclosely related to the complexity of synthesizing optimalpolicies in robust MDPs with unstructured uncertainty sets[20], multi-model MDPs [21], and partially observable MDPs[22]. In particular, the policy synthesis problems for all thesemodels are PSPACE-hard. The main difference of the reduc-tion technique we use in this paper from the ones used in the above mentioned references is the construction of the rewardfunctions for each agent type.II. P
RELIMINARIES
Notation:
For a set S , we denote its cardinality by | S | . Ad-ditionally, N := { , , . . . } , R :=( −∞ , ∞ ) , and R ≥ :=[0 , ∞ ) . A. Markov decision processes
Definition 1: A Markov decision process (MDP) is a tuple M :=( S, s , A , P ) where S is a finite set of states, s ∈ S is aninitial state, A is a finite set of actions, and P : S ×A× S → [0 , is a transition function such that (cid:80) s (cid:48) ∈ S P ( s, a, s (cid:48) )=1 for all s ∈ S and a ∈A .We denote the transition probability P ( s, a, s (cid:48) ) by P s,a,s (cid:48) ,and the set of available actions in s ∈ S by A ( s ) . A state s ∈ S is absorbing if P s,a,s =1 for all a ∈A ( s ) . The size of an MDP isthe number of triplets ( s, a, s (cid:48) ) ∈ S ×A× S such that P s,a,s (cid:48) > . Definition 2:
For an MDP M , a policy is a sequence π :=( d , d , d , . . . ) where each d t : S →A is a decision rule such that d t ( s ) ∈A ( s ) for all s ∈ S . A stationary policy is ofthe form π =( d, d, d, . . . ) . We denote the set of all policies andall stationary policies by Π( M ) and Π S ( M ) , respectively.We denote the action a ∈A ( s ) taken by the agent in a state s ∈ S under a stationary policy π by π ( s ) . B. Incentive sequences
For an MDP M , an information sequence I t describes theinformation available to the principal at stage t ∈ N and isrecursively defined as follows. At the first stage, an informa-tion sequence comprises only of the agent’s current state, e.g., I =( s ) . The principal makes incentive offers δ ( I , a ) to theagent for some actions, the agent takes an action a ∈A ( s ) and transitions to a state s ∈ S . At the second stage, the infor-mation sequence becomes I =( s , γ , a , s ) . The principalmakes incentive offers δ ( I , a ) to the agent for some actions,the agent takes an action a ∈A ( s ) and transitions to a state s ∈ S . The information sequence at stage t ∈ N is then recur-sively defined as I t =( s , γ , a , s , . . . , s t − , γ t − , a t − , s t ) .We denote by I t the set of all possible information sequencesavailable to the principal at stage t ∈ N . Definition 3:
For an MDP M , an incentive sequence is a se-quence γ :=( δ , δ , δ , . . . ) of incentive offers δ t : I t ×A→ R ≥ .A stationary incentive sequence is of the form γ =( δ, δ, δ, . . . ) where δ : S ×A→ R ≥ . A stationary deterministic incentive se-quence is a stationary incentive sequence such that for all s ∈ S , (cid:80) a (cid:48) ∈A ( s ) δ ( s, a (cid:48) )= δ ( s, a ) for some a ∈A ( s ) . For an MDP M ,we denote the set of all incentive sequences, all stationaryincentive sequences, and all stationary deterministic incentivesequences by Γ( M ) , Γ S ( M ) , and Γ SD ( M ) , respectively.For a stationary incentive sequence γ ∈ Γ S ( M ) , we denotethe incentive offer for a state-action pair ( s, a ) by γ ( s, a ) . C. Reachability
An infinite sequence (cid:37) π = s s s . . . of states generated in M under a policy π , which starts from the initial state s andsatisfies P s t ,d t ( s t ) ,s t +1 > for all t ∈ N , is called a path . Weefine the set of all paths in M generated under the policy π by P aths π M . We use the standard probability measure overthe outcome set P aths π M [23]. Let (cid:37) π [ t ]:= s t denote the statevisited at the t -th stage along the path (cid:37) π . We definePr π M ( Reach [ B ]) := Pr { (cid:37) π ∈ P aths π M : ∃ t ∈ N , (cid:37) π [ t ] ∈ B } as the probability with which the paths generated in M under π reaches the set B ⊆ S . We denote the maximum probabilityof reaching the set B under any policy π ∈ Π( M ) by R max ( M , B ) := max π ∈ Π( M ) Pr π M ( Reach [ B ]) . The existence of maximum in the above definition followsfrom Lemma 10.102 in [23], and the value of R max ( M , B ) can be computed via linear programming [23, Chapter 10].III. P ROBLEM S TATEMENT
We consider an agent whose behavior is modeled as anMDP M and a principal that provides a sequence γ ∈ Γ( M ) of incentives to the agent in order to induce a desired behavior.We assume that the agent’s intrinsic motivation is unknown to the principal, but it can be expressed by one of finitely manyreward functions. Formally, let Θ be a finite set of agent types ,and for each θ ∈ Θ , R θ : S ×A→ R be the associated rewardfunction. The principal knows the function R θ associated witheach type θ ∈ Θ , but does not know the true agent type θ (cid:63) ∈ Θ .The sequence of interactions between the principal andthe agent is as follows. At stage t ∈ N , the agent occupies astate s t ∈ S . The information sequence I t , defined in SectionII-B, describes the information available to the principal aboutthe agent. The principal offers an incentive δ t ( I t , a ) to theagent for each action a ∈A , and the agent chooses an action a t ∈A ( s t ) to maximize its immediate total reward, i.e., a t ∈ arg max a ∈A ( s t ) (cid:104) R θ (cid:63) ( s t , a ) + δ t ( I t , a ) (cid:105) . (1)Depending on the action taken by the agent, the principal pays δ t ( I t , a t ) to the agent. Finally, the agent transitions to the nextstate s t +1 ∈ S with probability P s t ,a t ,s t +1 , the principal updatesits information sequence to I t +1 =( I t , δ t , a t , s t +1 ) , and so on.The principal offers a sequence of incentives to the agent sothat the agent reaches a target state with maximum probability regardless of its type and the worst-case expected total cost ofinducing the desired behavior to the principal is minimized. Problem 1: (Behavior modification problem (BMP))
Foran MDP M , a set B ⊆ S of absorbing target states, and a set Θ of possible agent types, synthesize an incentive sequence γ ∈ Γ( M ) that leads the agent to a target state with maximumprobability while minimizing the expected total cost, i.e.,minimize γ ∈ Γ( M ) max θ ∈ Θ E π (cid:63) (cid:34) ∞ (cid:88) t =1 δ t ( I t , A t ) (cid:12)(cid:12)(cid:12) θ (cid:35) (2a)subject to: π (cid:63) = ( d (cid:63) , d (cid:63) , d (cid:63) , . . . ) (2b) ∀ t ∈ N , ∀ s ∈ S, d (cid:63)t ( s ) ∈ arg max a ∈A ( s ) (cid:104) R θ ( s, a ) + δ t ( I t , a ) (cid:105) (2c)Pr π (cid:63) M ( Reach [ B ]) = R max ( M , B ) . (2d) The expectation in (2a) is taken over the paths that areinduced by the policy π (cid:63) , i.e., the principal pays the offeredincentive if and only if the agent takes the incentivized action. Remark 1:
The decision-making model (1) describes a myopic agent which aims to maximize only its immediate rewards.Clearly, this model is restrictive in terms of its capability ofexplaining sophisticated agent behaviors. The main reason thatwe use such a simple model is to avoid notational burden thatcomes with more expressive alternatives. In Appendix A, weshow that the behavior modification of an agent with a finitedecision horizon is computationally not easier than the be-havior modification of a myopic agent. We also show how thesolution techniques developed for the behavior modificationof a myopic agent can be utilized to modify the behavior ofan agent with a finite decision horizon.
Remark 2:
The BMP requires all target states s ∈ B to beabsorbing. We impose such a restriction just to avoid theadditional definitions and notations required to consider non-absorbing target states. In Appendix A, we show how theresults presented in this paper can be applied to MDPs withnon-absorbing target states. Remark 3:
The BMP inherently assumes that the principalknows the transition function P of the given MDP M . It isclear that, by allowing P to belong to an uncertainty set, onecannot obtain a computationally easier problem.IV. S UMMARY OF THE R ESULTS
In this section, we provide a brief summary of the presentedresults. We keep the exposition non-technical to improvereadability; precise statements are provided in later sections.Table I illustrates an overview of the considered problems,their complexity, and the proposed solution techniques.We first show that the BMP is PSPACE-hard. That is, unlessP=PSPACE, an event considered to be less likely than the eventP=NP [22], an efficient algorithm to solve the BMP does notexist. Due to this discouraging result, instead of studying theBMP in its most generality, we focus on its variants in whichwe restrict the incentive sequences available to the principal.As the first restriction, we allow the principal to use onlystationary incentive sequences to modify the agent’s behaviorand refer to the resulting problem as the non-adaptive behaviormodification problem (N-BMP) . We show that the decisionproblem associated with the N-BMP is NP-complete evenwhen the given MDP has only deterministic transitions . Tofind a globally optimal solution to the N-BMP, we formulatea mixed-integer linear program (MILP) in which the integervariables correspond to actions taken by different agent typesin a given state. We also show that the N-BMP can beformulated as a nonlinear optimization problem with bilin-ear constraints for which a locally optimal solution can beobtained using the so-called convex-concave procedure [14].As the second restriction, we allow the principal to useonly stationary deterministic incentive sequences to modifythe agent’s behavior and refer to the resulting problem asthe non-adaptive single-action behavior modification problem(NS-BMP) . We prove that the decision problem associated withthe NS-BMP is NP-complete even when the agent has state-independent reward functions . We also show that globally andABLE I: A summary of the presented results.
Problem Complexity Globally optimal solution Locally optimal solution
Behavior modification (BMP) PSPACE-hard — —Non-adaptive behavior modification (N-BMP) NP-complete MILP Convex-concave procedureNon-adaptive single-action behavior modification (NS-BMP) NP-complete MILP Convex-concave procedureBehavior modification of a dominant type (BMP-D) P LP — locally optimal solutions to the NS-BMP can be computed byslightly modifying the methods developed to solve the N-BMP.Finally, we consider the case in which the set of agent typesinclude a dominant type which always demands the principalto offer an incentive amount that is higher than the onesdemanded by any other agent type. We prove that solving theBMP instances that involve a dominant type is equivalent tomodifying the behavior of the dominant type. We show thatthe behavior modification problem of a dominant type (BMP-D) is in P, and present an approach based on a linear program(LP) to solve the BMP-D.V. O
PTIMAL A GENT B EHAVIOR AND O PTIMAL I NCENTIVE S EQUENCES
In the BMP, we aim to synthesize an incentive sequence thatinduces an optimal agent policy that reaches a target set withmaximum probability while minimizing the expected total costto the principal. To obtain a well-defined BMP, in this section,we first precisely specify how the agent behaves when thereare multiple optimal policies. We then show that, in general,an incentive sequence that minimizes the expected total costto the principal may not exist. Hence, we focus on (cid:15) -optimalincentive sequences where (cid:15)> is an arbitrarily small constant. A. Agent’s behavior when multiple optimal policies exist
For a given incentive sequence, the agent’s optimal policymay not be unique. In the presence of multiple optimal poli-cies, there are only two possible cases: either (i) there existsan optimal policy that violates the reachability constraint in(2d) or (ii) all optimal policies satisfy the constraint in (2d).We first analyze case (i). Consider the MDP given in Fig. 1(left). Let the agent’s reward function be R θ (cid:63) ( s , a )=0 and R θ (cid:63) ( s , a )= − . That is, in the absence of incentives, it isoptimal for the agent to stay in state s under action a .Suppose that the principal offers the stationary incentives γ ( s , a )=0 and γ ( s , a )=1 to the agent. Then, we have { a , a } = arg max a ∈A ( s ) (cid:104) R θ (cid:63) ( s , a ) + γ ( s , a ) (cid:105) which implies that the agent has multiple optimal policies.Note that under the optimal stationary policy π ( s )= a , theagent reaches the target state s with probability 1, whereasunder the optimal stationary policy π ( s )= a , it reaches thetarget state s with probability 0. We assume that, in such ascenario, the agent behaves adversarially against the principal. Assumption:
Under the provided incentive sequence, if thereexists an optimal policy following which the agent can violatethe constraint in (2d), then the agent follows such a policy. s s a , a , − s s a , a , − a , − Fig. 1: MDP examples to illustrate the agent’s behavior in theexistence of multiple optimal policies. Solid lines representdeterministic transitions; dashed lines represent transitionswith equal probability. The initial state is s . The type set Θ satisfies | Θ | =1 , i.e., the principal knows the true agent type.The target set is B = { s } . The tuples ( a, r ) next to the arrowsindicate the action a and the reward r .We now analyze case (ii). Consider the MDP given in Fig.1 (right). In this MDP, in addition to the actions a and a , theagent can take a third action a which leads the agent to thestates s and s with equal probability. Let the reward functionbe R θ (cid:63) ( s , a )=0 , R θ (cid:63) ( s , a )= − , and R θ (cid:63) ( s , a )= − .Suppose that the principal offers the stationary incentives γ ( s , a )=0 , γ ( s , a )=2 , and γ ( s , a )=2 . Then, we have { a , a } = arg max a ∈A ( s ) (cid:104) R θ (cid:63) ( s , a ) + γ ( s , a ) (cid:105) , and the agent has multiple optimal policies. Under all itsoptimal policies, the agent reaches the target state s withprobability 1. However, if the agent follows the stationarypolicy π ( s )= a , the expected total cost to the principal isequal to 2, whereas the stationary policy π ( s )= a incurs theexpected total cost of 4. In such a scenario, we again assumethat the agent behaves adversarially against the principal. Assumption:
Under the provided incentive sequence, if alloptimal agent policies satisfy the constraint in (2d), then theagent follows the policy that maximizes the expected total costto the principal.
B. Non-existence of optimal incentive sequences
We now illustrate with an example that, in general, theremay exist no incentive sequence γ ∈ Γ( M ) that attains theminimum in (2a)-(2d). Consider again the MDP given in Fig. 1(left). In this example, without loss of generality, we can focuson stationary incentive sequences to minimize the expectedtotal cost to the principal. The feasible set of stationaryincentive sequences is described by the set { γ ( s, a ) ≥ | γ ( s , a ) − γ ( s , a ) ≥ (cid:15), ˜ (cid:15) > } hich is not a compact set due to the condition ˜ (cid:15)> . Theincentive offers satisfying γ ( s , a ) − γ ( s , a )=1 , i.e., ˜ (cid:15) =0 ,do not belong to the feasible set because under such incentivesequences the agent adversarially follows the stationary policy π ( s )= a which violates the constraint in (2d).Motivated by the above example, in the rest of the paper,we focus on obtaining an (cid:15) -optimal solution to the BMP. Definition 4:
Let D be a non-empty set, h : D→ R ≥ be afunction, and (cid:15)> be a constant. A point x ∈D is said to bean (cid:15) -optimal solution to the problem min x ∈D h ( x ) if h ( x ) ≤ min x ∈D h ( x ) + (cid:15). VI. C
OMPLEXITY OF B EHAVIOR M ODIFICATION
In this section, we present the results on the computationalcomplexity of the BMP and its two variants. Before proceedingwith the complexity results, we first show that a feasiblesolution to the BMP can be computed efficiently.
A. A feasible solution to the behavior modification problem
Let C : S ×A→ R ≥ be defined as C ( s, a ) := max θ ∈ Θ (cid:16) max a (cid:48) ∈A ( s ) R θ ( s, a (cid:48) ) − R θ ( s, a ) + (cid:15) (cid:17) . (3)The value of C ( s, a ) denotes an incentive amount under whichthe agent takes the incentivized action a ∈A ( s ) in state s ∈ S regardless of its type. To see this, note that the term insidethe parentheses denotes an amount that, once offered, makesthe action a ∈A ( s ) uniquely optimal for the agent type θ ∈ Θ .Hence, by taking the maximum over all agent types, we ensurethat, under the stationary incentive offer γ ( s, a )= C ( s, a ) , theagent takes the action a ∈A ( s ) regardless of its type.We can obtain a feasible solution to the BMP as follows.First, compute a policy π (cid:63) ∈ Π( M ) that reaches the desiredtarget set B with maximum probability R max ( M , B ) bysolving a linear program [23, Chapter 10]. Then, offer thestationary incentive sequence γ ( s, a ) = (cid:40) C ( s, a ) if π (cid:63) ( s ) = a otherwise (4)to the agent. Under the above incentive sequence, the agentfollows the desired policy π (cid:63) regardless of its type; hence, itreaches the target set B with maximum probability.There are two important aspects of the incentive sequencesynthesized above. First, it is computed in time polynomial inthe size of M and Θ . Second, it conservatively incentivizesthe same actions for all agent types ignoring the principal’sknowledge on the reward functions R θ . In the next section,we show that, in order to exploit the principal’s knowledge onthe reward functions and minimize the incurred total cost, oneshould solve a more challenging computational problem. B. Complexity results
We first present a discouraging result which shows that,unless P=PSPACE, solving the BMP in time polynomial inthe size of M and Θ is not possible. Theorem 1:
For N ∈ N , deciding whether there exists a feasiblesolution to the BMP with the objective value less than or equalto N is PSPACE-hard.We provide a proof for the above result in Appendix Bwhere we show that the PSPACE-complete quantified satisfi-ability problem (QSAT) [24] can be reduced to the BAP. Thereduction closely follows the reduction of QSAT to partiallyobservable MDP problems [22] in which the objective is tosynthesize an observation-based controller to maximize theexpected total reward collected by an agent.Next, we introduce two variants of the BMP in which theprincipal is allowed to use only a subset of the set Γ( M ) ofall incentive sequences. Definition 5:
When the principal is restricted to use stationaryincentive sequences, i.e., the minimization in (2a) is performedover the set Γ S ( M ) , the resulting BMP is referred to as the non-adaptive BMP (N-BMP).In N-BMP, the principal does not utilize its past experiences,i.e., information sequence I t , with the agent to refine theincentive offers in the future. Therefore, in a sense, the N-BMPdescribes an offline incentive design problem. The followingresult shows that the N-BMP is computationally “easier” thanthe BMP. However, it is still not likely to be solved efficientlyeven for MDPs with deterministic transition functions. Theorem 2:
For N ∈ N , deciding whether there exists a feasiblesolution to the N-BMP with the objective value less than orequal to N is NP-complete even when the transition function P satisfies P s,a,s (cid:48) ∈{ , } for all s, s (cid:48) ∈ S and all a ∈A .A proof for the above result is provided in Appendix Bwhere we reduce the NP-complete Euclidean path travelingsalesman problem [25] to the N-BMP. The reduction demon-strates that, to lead the agent to a target state at minimum totalcost, the principal may need to reveal the agent’s true type. Definition 6:
When the principal is restricted to use stationarydeterministic incentive sequences, i.e., minimization in (2a) isperformed over the set Γ SD ( M ) , the resulting BMP is referredto as the non-adaptive single-action BMP (NS-BMP).In NS-BMP, the principal offers incentives only for a singleaction in a given state. Such a restriction can be encountered inpractice, for example, when a retailer offers discounts only forthe purchases that exceed a certain total sum. The followingresult shows that the NS-BMP is a challenging problem evenwhen the agent has a state-independent reward function. Theorem 3:
For N ∈ N , deciding whether there exists a feasiblesolution to the NS-BMP with the objective value less than orequal to N is NP-complete even when R θ ( s, a )= R θ ( t, a ) forall s, t ∈ S , all a ∈A , and all θ ∈ Θ .A proof for the above result is provided in Appendix Bwhere we reduce the NP-complete set cover problem [22] tothe NS-BMP. The main idea in the reduction is that, to leadthe agent to a target state, the principal may need to reveal thetrue agent type by the end of a fixed number of stages .VII. A S UFFICIENT C ONDITION FOR P OLYNOMIAL -T IME S OLVABILITY
In this section, we derive a sufficient condition on thestructure of the type set Θ under which the BMP can be solvedy assuming that the true agent type is known to the principal.We define the BMP of a dominant type (BMP-D) as the BMPinstances that satisfy the derived sufficient condition and showthat an (cid:15) -optimal solution to the BMP-D can be computed intime polynomial in the size of M and Θ . A. Behavior modification of a dominant type: formulation
Let f :Γ( M ) × Θ → R be a function such that f ( γ, θ ) := E π (cid:63) (cid:34) ∞ (cid:88) t =1 δ t ( I t , A t ) (cid:12)(cid:12)(cid:12) θ (cid:35) (5)is the expected total cost incurred by the principal when thesequence of incentive offers is γ ∈ Γ( M ) and the agent typeis θ ∈ Θ . In (5), π (cid:63) ∈ Π( M ) is the agent’s optimal policy underthe incentive sequence γ . Moreover, let Γ R,θ ( M ) ⊆ Γ( M ) bethe set of incentive sequences under which the agent type θ ∈ Θ reaches the target set B ⊆ S with maximum probability.Precisely, we have γ ∈ Γ R,θ ( M ) if and only if the optimalpolicy π (cid:63) of the agent type θ under the sequence γ ∈ Γ( M ) satisfies the equality in (2d). Finally, let Γ R, Θ ( M ) := (cid:92) θ ∈ Θ Γ R,θ ( M ) (6)be the set of incentive sequences under which the agent reachesthe target set with maximum probability regardless of its type .Then, the BMP in (2a)-(2d) can equivalently be written as min γ ∈ Γ R, Θ ( M ) max θ ∈ Θ f ( γ, θ ) . (7)Using weak duality [26], together with the fact that Γ R, Θ ⊆ Γ R,θ for all θ ∈ Θ , we have min γ ∈ Γ R, Θ ( M ) max θ ∈ Θ f ( γ, θ ) ≥ max θ ∈ Θ min γ ∈ Γ R,θ ( M ) f ( γ, θ ) . (8)The above inequality implies that a principal that knows thetrue agent type θ (cid:63) can always modify the agent’s behavior at anexpected total cost that is less than or equal to the worst-caseexpected total cost incurred by a principal that does not know the true agent type. We now provide a sufficient conditionunder which the knowledge of the true agent type θ (cid:63) does nothelp the principal to strictly decrease its expected total cost,i.e., the inequality in (8) becomes an equality. For each θ ∈ Θ and s ∈ S , let R max θ ( s ):=max a (cid:48) ∈A ( s ) R θ ( s, a (cid:48) ) . Theorem 4:
For a given MDP M and a type set Θ , if thereexists θ d ∈ Θ such that, R max θ d ( s ) − R θ d ( s, a ) ≥ R max θ ( s ) − R θ ( s, a ) (9)for all θ ∈ Θ , s ∈ S , and a ∈A , then Γ R,θ d ( M )=Γ R, Θ ( M ) and min γ ∈ Γ R, Θ ( M ) max θ ∈ Θ f ( γ, θ ) = min γ ∈ Γ R,θd ( M ) f ( γ, θ d ) . Definition 7:
When there exists a dominant type θ d ∈ Θ thatsatisfies the condition in (9), the resulting BMP is referred toas the BMP of a dominant type (BMP-D).Theorem 4 implies that one can solve the BMP instancesthat satisfy the condition in (9) by assuming that the true agenttype θ (cid:63) is the type θ d . Hence, solving the BMP-D is equivalentto finding an incentive sequence γ (cid:63) ∈ Γ( M ) such that γ (cid:63) ∈ arg min γ ∈ Γ R,θd ( M ) f ( γ, θ d ) . (10) B. Behavior modification of a dominant type: reformulation
In general, the minimum in (10) is not attainable by anyfeasible incentive sequence. Hence, we aim to find an (cid:15) -optimal solution as defined as in Section V-B. In this section,we show that finding an (cid:15) -optimal solution to the BMP-D isequivalent to solving a certain constrained MDP problem.For a given MDP M , we first partition the set S of statesinto three disjoint sets as follows. Let B ⊆ S be the set of targetstates and S ⊆ S be the set of states that have zero probabilityof reaching the states in B under any policy. Finally, we let S r = S \ B ∪ S . These sets can be found in time polynomialin the size of the MDP using graph search algorithms [23].For (cid:15) ≥ , let φ (cid:15) : S × A→ R ≥ be the cost of control function φ (cid:15) ( s, a ) := (cid:40) R max θ d ( s ) − R θ d ( s, a ) + (cid:15) if s ∈ S r , a ∈ A ( s )0 otherwise . Additionally, let Π R,θ d ( M ) be the set of policies under whichthe dominant agent type θ d reaches the target set B withmaximum probability, i.e., π (cid:48) ∈ Π R,θ d ( M ) if and only ifPr π (cid:48) M ( Reach [ B ]) = R max ( M , B ) . For any incentive sequence γ ∈ Γ( M ) , there exists a cor-responding optimal policy π ∈ Π( M ) for the agent type θ d . Moreover, for the principal to induce an agent policy π =( d , d , . . . ) such that π ∈ Π R,θ d ( M ) , it is necessary that,for any information sequence I t =( s , γ , a , . . . , s t ) , we have δ t ( I t , a ) ≥ φ ( s t , a ) if d t ( s t )= a . Otherwise, we have R θ d ( s t , a ) + δ t ( I t , a ) < R max θ d ( s t ) , which implies that the action a ∈A ( s t ) cannot be optimal forthe agent in state s t ∈ S at stage t ∈ N . Let g (cid:15) :Π( M ) → R ≥ be the expected total cost of control function g (cid:15) ( π ) := E π (cid:34) ∞ (cid:88) t =1 φ (cid:15) ( S t , A t ) (cid:12)(cid:12)(cid:12) θ d (cid:35) (11)where the expectation is taken over the paths induced bythe dominant agent type’s policy π . The necessary conditiondescribed above implies that we have min γ ∈ Γ R,θd ( M ) f ( γ, θ d ) ≥ min π ∈ Π R,θd ( M ) g ( π ) . (12)Similarly, for the principal to induce an agent policy π =( d , d , . . . ) such that π ∈ Π R,θ d ( M ) , it is sufficient that,for any information sequence I t =( s , γ , a , . . . , s t ) where s t ∈ S r , we have δ t ( I t , a )= φ (cid:15) ( s t , a ) for some (cid:15)> if d t ( s t )= a .This follows from the fact that, if δ t ( I t , a )= φ (cid:15) ( s t , a ) , then R θ d ( s t , a ) + δ t ( I t , a ) = R max θ d ( s t ) + (cid:15) > R θ d ( s t , a (cid:48) ) for all a (cid:48) ∈A ( s ) \{ a } . Hence, the action a ∈A ( s t ) is uniquelyoptimal for the agent in state s t ∈ S r at stage t ∈ N . Thesufficient condition described above implies that, for any (cid:15)> , min π ∈ Π R,θd ( M ) g (cid:15) ( π ) ≥ min γ ∈ Γ R,θd ( M ) f ( γ, θ d ) . (13)We now show that finding an (cid:15) -optimal solution to the BMP-D is equivalent to solving the optimization problem on the lefthand side of (13) for some (cid:15)> . Let Π SR,θ d ( M ) ⊆ Π R,θ d ( M ) e the set of stationary policies under which the dominantagent type θ d reaches the target set with maximum probability. Lemma 1: [27] For any (cid:15) ≥ , min π ∈ Π R,θd ( M ) g (cid:15) ( π ) = min π ∈ Π SR,θd ( M ) g (cid:15) ( π ) . (14)The above result, which is proven in [27], implies thatstationary policies are sufficient to minimize the expected totalcost function (11) subject to the reachability constraint in (2d). Theorem 5:
For any given (cid:15)> , there exists (cid:15)> such that min π ∈ Π SR,θd ( M ) g (cid:15) ( π ) ≤ min π ∈ Π SR,θd ( M ) g ( π ) + (cid:15). (15)A proof of the above result and the explicit form of theconstant (cid:15) in terms of (cid:15) can be found in [16]. The above result,together with (12) and (13), implies that one can find an (cid:15) -optimal solution to the BMP-D in two steps as follows. First,solve the optimization problem on the left hand side of (13)for the corresponding (cid:15)> . Let π (cid:63) ∈ Π SR,θ d ( M ) be the optimalstationary policy which exists by Lemma 1. Second, providethe agent the stationary incentive sequence γ ( s, a ) = (cid:40) φ (cid:15) ( s, a ) if π (cid:63) ( s ) = a, otherwise . (16) C. Behavior modification of a dominant type: an algorithm
In the previous section, we showed that to find an (cid:15) -optimalsolution to the BMP-D, one can compute a stationary policy π (cid:63) ∈ min π ∈ Π SR,θd ( M ) g (cid:15) ( π ) (17)and provide the agent the incentive sequence given in (16). Inthis section, we present an algorithm that solves the problemin (17) in time polynomial in the size of M .The optimization problem in (17) is an instance of the so-called constrained MDP problem in which the objective is tosynthesize a policy that maximizes the expected total rewardwhile ensuring that the expected total cost is below a certainthreshold [28]. It is known that, in general, deterministic poli-cies π ∈ Π R,θ d ( M ) are not sufficient to solve the constrainedMDP problem [28]. However, in what follows, we show thatthe problem in (17) indeed admits an optimal solution in theset of stationary deterministic policies.Let r : S ×A→ R ≥ be a function such that r ( s, a ) := (cid:40)(cid:80) s (cid:48) ∈ B P s,a,s (cid:48) if s ∈ S r otherwise. (18)By Theorem 10.100 in [23], for any π ∈ Π( M ) , we have E π (cid:34) ∞ (cid:88) t =1 r ( S t , A t ) (cid:35) = Pr π M ( Reach [ B ]) . Hence, the problem in (17) can be equivalently written as min π ∈ Π S ( M ) E π (cid:34) ∞ (cid:88) t =1 φ (cid:15) ( S t , A t ) (cid:35) (19a)subject to: E π (cid:34) ∞ (cid:88) t =1 r ( S t , A t ) (cid:35) = R max ( M , B ) . (19b) We synthesize an optimal stationary policy π (cid:63) ∈ Π S ( M ) that solves the problem in (19a)-(19b) by solving two linearprograms (LPs). First, we solve the LPminimize x ( s,a ) (cid:88) s ∈ S r (cid:88) a ∈A x ( s, a ) φ (cid:15) ( s, a ) (20a)subject to: (cid:88) s ∈ S r (cid:88) a ∈A x ( s, a ) r ( s, a ) = R max ( M , B ) (20b) ∀ s ∈ S r , (cid:88) a ∈A ( s ) x ( s, a ) − (cid:88) s (cid:48) ∈ S r (cid:88) a ∈A ( s ) P s (cid:48) ,a,s x ( s (cid:48) , a ) = α ( s ) (20c) ∀ s ∈ S r , a ∈ A ( s (cid:48) ) , x ( s, a ) ≥ (20d)where α : S →{ , } is a function such that α ( s )=1 and α ( s )=0 for all s ∈ S \{ s } , i.e., the initial state distribution.The variable x ( s, a ) corresponds to the expected residencetime in the state-action pair ( s, a ) [8], [29]. The constraintin (20b) ensures that the probability of reaching the set B ismaximized, and the constraints in (20c) represent the balancebetween the “inflow” to and “outflow” from states.Let υ (cid:63) be the optimal value of the LP in (20a)-(20d). Next,we solve the LPminimize x ( s,a ) (cid:88) s ∈ S r (cid:88) a ∈A x ( s, a ) (21a)subject to: (cid:88) s ∈ S r (cid:88) a ∈A x ( s, a ) φ (cid:15) ( s, a ) = υ (cid:63) (21b)(20b) − (20d) . (21c)Let { x (cid:63) ( s, a ): s ∈ S, a ∈A} be a set of optimal decision variablesfor the problem in (21a)-(21c). Moreover, for a given state s ∈ S , let A (cid:63) ( s ):= { a ∈A ( s ) : x (cid:63) ( s, a ) > } be the set of active actions. An optimal stationary policy π (cid:63) that satisfies thecondition in (17) can be generated from this set as π (cid:63) ( s ) = a for an arbitrary a ∈ A (cid:63) ( s ) . (22) Proposition 1:
A stationary policy generated from the optimaldecision variables x (cid:63) ( s, a ) using the rule in (22) is a solutionto the problem in (19a)-(19b).A proof of Proposition 1 can be found in [16]. Intuitively,the LP in (21a)-(21c) computes the minimum expected timeto reach the set B with probability R max ( M , B ) and cost υ (cid:63) . Therefore, if x (cid:63) ( s, a ) > , by taking the action a ∈A ( s ) ,the agent has to “get closer” to the set B with nonzeroprobability. Otherwise, the minimum expected time to reachthe set B would be strictly decreased. Hence, by choosing anaction a ∈A (cid:63) ( s ) , the agent reaches the set B with maximumprobability at minimum total cost.VIII. A LGORITHMS FOR B EHAVIOR M ODIFICATION
In this section, we present two algorithms to solve theN-BMP. In Appendix C, we explain how to modify thesealgorithms to solve the NS-BMP.
A. Computing a globally optimal solution to the N-BMP
We formulate a mixed-integer linear program (MILP) tocompute a globally optimal solution to the N-BMP. Recallhat in the N-BMP, the objective is to synthesize a stationary incentive sequence γ ∈ Γ S ( M ) under which the agent reachesthe target set B with probability R max ( M , B ) regardless of itstype and the expected total cost to the principal is minimized.We present the MILP formulated to obtain an (cid:15) -optimalstationary incentive sequence for the N-BMP in (23a)-(23n).In what follows, we explain its derivation in detail. Recallfrom Section VII-B that it is sufficient for the principal toincentivize the agent only from the states S r ⊆ S in order toinduce the desired behavior. Hence, we focus only on the set S r of states in the MILP formulation.We first express the optimal behavior of an agent type θ , i.e., the constraints in (2b)-(2c), as a set of inequalityconstraints. Let { γ ( s, a ) ≥ | s ∈ S r , a ∈A} be a set of variablesrepresenting the incentive offers. For each θ ∈ Θ , s ∈ S r , and a ∈A ( s ) , let V θ ( s ) ∈ R and Q θ ( s, a ) ∈ R be variables such that Q θ ( s, a ) = R θ ( s, a ) + γ ( s, a ) , (24a) V θ ( s ) = max a ∈A ( s ) Q θ ( s, a ) . (24b)Then, for a given state s ∈ S r , any action a ∈A ( s ) that sat-isfies V θ ( s )= Q θ ( s, a ) is optimal for the agent type θ . Let { X θs,a ∈{ , } | s ∈ S r , a ∈A , θ ∈ Θ } be a set of binary variablesrepresenting whether an action a ∈A ( s ) is optimal for the agenttype θ , i.e., X θs,a =1 . The constraint in (24b) is not linearin Q θ ( s, a ) ; however, using the big-M method [30], we canreplace (24b) exactly by the following set of inequalities V θ ( s ) ≥ Q θ ( s, a ) ∀ a ∈ A ( s ) , (25a) V θ ( s ) ≤ Q θ ( s, a ) + (1 − X θs,a ) M (cid:15) ∀ a ∈ A ( s ) , (25b) (cid:88) a ∈A ( s ) X θs,a ≥ (25c)where M (cid:15) is a large constant which will be specified shortly.In particular, the constant M (cid:15) is chosen such that if X θs,a =0 ,then the inequality in (25b) trivially holds. The constraintsdescribed above correspond to the constraints in (23b)-(23d)and (23n) in the formulated MILP.Next, we introduce a set of inequality constraints whichensure that the agent type θ has a unique optimal policy. Foreach s ∈ S r and a ∈A ( s ) , using the big-M method, we require Q θ ( s, a ) + (1 − X θs,a ) M (cid:15) ≥ Q θ ( s, a (cid:48) ) + (cid:15) (26)for all a (cid:48) ∈A ( s ) \{ a } , where the constant (cid:15)> is defined interms of the constant (cid:15) as shown in Theorem 5. Note thatif X θs,a =1 , then Q θ ( s, a ) > Q θ ( s, a (cid:48) ) for all a (cid:48) ∈A ( s ) \{ a } .Hence, the unique optimal policy of the agent type θ satisfies π (cid:63) ( s )= a . Again, the choice of M (cid:15) guarantees that if X θs,a =0 ,then the inequality in (26) trivially hold. These constraintscorrespond to the constraints (23e) in the formulated MILP.We now express the reachability constraint in (2d) as a setof equality constraints. Recall from (19b) that the constraint in(2d) can be written as an expected total reward constraint withrespect to the reward function r defined in (18). In particular,the constraint in (20b), together with the constraints in (20c)-(20d), ensures that the agent reaches the target set B with maximum probability. For each θ ∈ Θ , s ∈ S r , and a ∈A ( s ) , let µ r,θ ( s, a ) ∈ R and λ r,θ ( s, a ) ∈ R ≥ be variables such that (cid:88) a ∈A ( s ) µ r,θ ( s, a ) − (cid:88) s (cid:48) ∈ S r (cid:88) a ∈A ( s (cid:48) ) µ r,θ ( s (cid:48) , a ) = α ( s ) , (27a) (cid:88) s ∈ S r (cid:88) a ∈A ( s ) λ r,θ ( s, a ) r ( s, a ) = R max ( M , B ) , (27b) µ r,θ ( s, a ) = λ r,θ ( s, a ) X θs,a (27c)where the function α : S →{ , } represents the initial statedistribution. Using the principle of optimality [8], one canshow that, from the initial state s ∈ S , the agent type θ reachesthe target set B with probability R max ( M , B ) if and only ifthe agent’s deterministic optimal policy expressed in terms ofthe variables X θs,a satisfies the constraints in (27a)-(27c).The constraint in (27c) involves a multiplication of thebinary variable X θs,a with the continuous variable λ r,θ ( s, a ) .Using McCormick envelopes [31], we can replace each con-straint in (27c) exactly by the following inequalities µ r,θ ( s, a ) ≥ , (28a) µ r,θ ( s, a ) ≤ (cid:102) M X θs,a , (28b) µ r,θ ( s, a ) ≤ λ r,θ ( s, a ) (28c) µ r,θ ( s, a ) ≥ λ r,θ ( s, a ) − (1 − X θs,a ) (cid:102) M . (28d)The exact value of the large constant (cid:102) M will be discussedshortly. The above inequalities ensure that if X θs,a =0 , then µ r,θ ( s, a )=0 , and if X θs,a =1 , then µ r,θ ( s, a )= λ r,θ ( s, a ) . Theconstraints described above are the constraints in (23f)-(23i).We finally express the cost of behavior modification to theprincipal, i.e., the objective function in (2a), with a set ofinequality constraints. For each θ ∈ Θ , s ∈ S r , and a ∈A ( s ) , let V p,θ ( s ) ∈ R and Q p,θ ( s, a ) ∈ R be variables such that Q p,θ ( s, a ) = γ ( s, a ) + (cid:88) s (cid:48) ∈ S P s,a,s (cid:48) V p,θ ( s (cid:48) ) , (29a) V p,θ ( s ) = (cid:88) a ∈A ( s ) X θs,a Q p,θ ( s, a ) . (29b)Using the principle of optimality, one can show that theexpected total amount of incentives paid to the agent type θ bythe principal is V p,θ ( s ) where s ∈ S is the initial state of M .Hence, the principal’s objective is to find a stationary incentivesequence that minimizes V p,θ ( s ) over all agent types θ ∈ Θ .This objective is expressed in the formulated MILP with theobjective function in (23a) and the constraint in (23m).The equality constraint in (29b) involves bilinear terms.Using the big-M method, we can replace the constraint in(29b) exactly by the following set of inequality constraints V p,θ ( s ) ≥ Q p,θ ( s, a ) − (1 − X θs,a ) M (cid:15) ∀ a ∈ A ( s ) , (30a) V p,θ ( s ) ≤ Q p,θ ( s, a ) + (1 − X θs,a ) M (cid:15) ∀ a ∈ A ( s ) , (30b)where M (cid:15) is a large constant which will be precisely specifiedshortly. The above constraints ensure that if X θs,a =1 , we have V p,θ ( s )= Q p,θ ( s ) ; otherwise, the corresponding inequalitiestrivially hold. The constraints in (23k)-(23l) in the formulatedMILP then correspond to the constraints explained above.inimize ω,γ, V θ , V r,θ , V p,θ ,X θs,a ,λ r,θ ω (23a)subject to: ∀ θ ∈ Θ , ∀ s ∈ S r , ∀ a ∈ A ( s ) , Q θ ( s, a ) = R θ ( s, a ) + γ ( s, a ) (23b) ∀ θ ∈ Θ , ∀ s ∈ S r , V θ ( s ) ≥ Q θ ( s, a ) ∀ a ∈ A ( s ) (23c) ∀ θ ∈ Θ , ∀ s ∈ S r , V θ ( s ) ≤ Q θ ( s, a ) + (1 − X θs,a ) M (cid:15) ∀ a ∈ A ( s ) (23d) ∀ θ ∈ Θ , ∀ s ∈ S r , ∀ a ∈ A ( s ) , Q θ ( s, a ) + (1 − X θs,a ) M (cid:15) ≥ Q θ ( s, a (cid:48) ) + (cid:15) ∀ a (cid:48) ∈ A\{ a } (23e) ∀ θ ∈ Θ , ∀ s ∈ S r , (cid:88) a ∈A ( s ) µ r,θ ( s, a ) − (cid:88) s (cid:48) ∈ S r (cid:88) a ∈A ( s (cid:48) ) µ r,θ ( s (cid:48) , a ) = α ( s ) , (23f) ∀ θ ∈ Θ , (cid:88) s ∈ S r (cid:88) a ∈A ( s ) λ r,θ ( s, a ) r ( s, a ) = R max ( M , B ) , (23g) ∀ θ ∈ Θ , ∀ s ∈ S r , ∀ a ∈ A ( s ) , µ r,θ ( s, a ) ≥ , µ r,θ ( s, a ) ≤ (cid:102) M X θs,a , µ r,θ ( s, a ) ≤ λ r,θ ( s, a ) (23h) ∀ θ ∈ Θ , ∀ s ∈ S r , ∀ a ∈ A ( s ) , µ r,θ ( s, a ) ≥ λ r,θ ( s, a ) − (1 − X θs,a ) (cid:102) M (23i) ∀ θ ∈ Θ , ∀ s ∈ S r , ∀ a ∈ A ( s ) , Q p,θ ( s, a ) = γ ( s, a ) + (cid:88) s (cid:48) ∈ S P s,a,s (cid:48) V p,θ ( s (cid:48) ) , (23j) ∀ θ ∈ Θ , ∀ s ∈ S r , V p,θ ( s ) ≥ Q p,θ ( s, a ) − (1 − X θs,a ) M (cid:15) ∀ a ∈ A ( s ) , (23k) ∀ θ ∈ Θ , ∀ s ∈ S r , V p,θ ( s ) ≤ Q p,θ ( s, a ) + (1 − X θs,a ) M (cid:15) ∀ a ∈ A ( s ) , (23l) ∀ θ ∈ Θ , ω ≥ V p,θ ( s ) (23m) ∀ θ ∈ Θ , ∀ s ∈ S r , ∀ a ∈ A ( s ) γ ( s, a ) ≥ , λ r,θ ( s, a ) ≥ , X θs,a ∈ { , } , (cid:88) a ∈A ( s ) X θs,a ≥ . (23n)The formulation of the MILP (23a)-(23n) involves onlythe exact representation of the problem in (2a)-(2d) and theconstraint in (26) which ensures that each agent type has aunique optimal policy. Therefore, it follows from Theorem5 that an (cid:15) -optimal solution to the N-BMP can be obtainedfrom the optimal solution to the formulated MILP by usingits optimal decision variables γ (cid:63) ( s, a ) as the incentive offers. Choosing the big-M constants:
The formulation of the MILP(23a)-(23n) requires one to specify the constants M (cid:15) , (cid:102) M , and M (cid:15) . They should be chosen such that the constraints involvingthem hold trivially when the binary variable X θs,a =0 .The constant M (cid:15) appears in (23d) and (23e). We set M (cid:15) := 2 (cid:16) max θ ∈ Θ ,s ∈ S,a ∈A R θ ( s, a ) − min θ ∈ Θ ,s ∈ S,a ∈A R θ ( s, a ) + (cid:15) (cid:17) . The rationale behind this choice is quite simple. For agiven state s ∈ S r , the principal can make any action a ∈A ( s ) uniquely optimal for any agent type θ ∈ Θ by offering anincentive γ ( s, a ) ≤ max R θ ( s, a ) − min R θ ( s, a )+ (cid:15) . Hence, wehave Q θ ( s, a ) − Q θ ( s, a (cid:48) ) ≤ M (cid:15) for any θ ∈ Θ , any s ∈ S r , andany a, a (cid:48) ∈A ( s ) . Consequently, for the above choice of M (cid:15) , theconstraints in (23d) and (23e) hold trivially when X θs,a =0 .The constant (cid:102) M appears in (23h)-(23i). These constraintsput upper bounds on the expected residence times in state-action pairs. For MDPs with deterministic transition functions,we simply set (cid:102) M =1 . The reason is that, under a deterministicpolicy that reaches the set B with maximum probability, theagent cannot visit a state s ∈ S r twice. On the other hand, forMDPs with stochastic transition functions, it is, in general,NP-hard to compute the maximum expected residence times in states subject to the reachability constraint in (2d). Therefore,in that case, we set (cid:102) M = k | S | for some large k ∈ N .The constant M (cid:15) appears in (30a)-(30b). These constraintsput bounds on the total expected cost V p,θ ( s ) to the prin-cipal. Using the methods described in Section VII-C, onecan compute a policy that minimizes the expected total costwith respect to the cost function C , defined in (3), subject tothe reachability constraint in (2d). The optimal value of thisoptimization problem corresponds to a total incentive amountthat is sufficient to lead the agent to the target set regardless ofits type. Hence, it provides an upper bound on the variables Q p,θ ( s, a ) . We set the constant M (cid:15) to the optimal value ofthe above mentioned optimization problem. As a result, theconstraints in (30a)-(30b) hold trivially when X θs,a =0 . B. Computing a locally optimal solution to the N-BMP
The MILP formulation described in the previous sectioninvolves | S ||A|| Θ | binary variables. In the worst case, com-puting its optimal solution takes exponential time in the sizeof M and Θ . Here, we present a more practical algorithmwhich computes a locally optimal solution to the N-BMP.Specifically, we formulate the N-BMP as a nonlinear opti-mization problem (NLP) with bilinear constraints by slightlymodifying the MILP (23a)-(23n). We then use the so-calledconvex-concave procedure (CCP) to solve the derived NLP.To formulate the N-BMP as an NLP with bilinar constraintsinstead of an MILP, we express the policy of the agent type θ using the set of continuous variables (cid:40) ≤ ν θ ( s, a ) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) a ∈A ( s ) ν θ ( s, a ) = 1 , s ∈ S r , a ∈ A , θ ∈ Θ (cid:41) nstead of the binary variables X θs,a ∈{ , } . In what follows,we explain the modifications we make to the MILP (23a)-(23n)to obtain an NLP with bilinear constraints.To express the optimal behavior of an agent type θ , we keepthe constraints (23b)-(23c), but replace the constraint in (23d)in the MILP with the following constraint V θ ( s ) ≤ (cid:88) a ∈A ( s ) ν θ ( s, a ) Q θ ( s, a ) . (31)The above inequality, together with (23b)-(23c), ensures thatif ν θ ( s, a ) > , the corresponding action a ∈A ( s ) is optimalfor the agent type θ . Note that the condition ν θ ( s, a ) > is not necessary for the action a ∈A ( s ) to be optimal. Next, weintroduce a constraint which ensures that the agent type θ hasa unique deterministic optimal policy; therefore, the condition ν θ ( s, a ) > becomes a necessary and sufficient condition forthe optimality of the action a ∈A ( s ) for the agent type θ .Now, similar to (26), we introduce a set of inequalityconstraints which ensure that the agent type θ has a uniqueoptimal policy. In particular, we replace the constraints in(23e) in the MILP with the following set of constraints ν θ ( s, a ) (cid:104) Q θ ( s, a ) − (cid:16) Q θ ( s, a (cid:48) ) + (cid:15) (cid:17)(cid:105) ≥ (32)for all a (cid:48) ∈A ( s ) \{ a } . The above constraints, together with theconstraints in (23b)-(23c) and (31), ensure that ν θ ( s, a )=1 forthe uniquely optimal action a ∈A ( s ) .In the MILP formulation, we express the reachability con-straint in (2d) as the set of constraints in (23f)-(23i). In partic-ular, the constraints in (28a)-(28d) represent the McCormickenvelope corresponding to the constraint in (27c). In the NLPformulation, we keep the constraints in (23f)-(23g), but insteadof using the McCormick envelopes, we replace the constraintin (27c) simply with the following bilinear constraint µ r,θ ( s, a ) = λ r,θ ( s, a ) ν θ ( s, a ) . (33)The above constraint, together with the constraints in (23f)-(23g), ensures that the agent type θ ∈ Θ reaches the targetset B with probability R max ( M , B ) under its optimal policyexpressed in terms of the variables ν θ ( s, a ) .Finally, we express the expected total cost of behaviormodification with a set of inequality constraints. Specifically,we keep the constraints in (23j) and (23m) in the MILP, but replace the constraints in (23k)-(23l) with the constraint V p,θ ( s ) ≥ (cid:88) a ∈A ( s ) ν θ ( s, a ) Q p,θ ( s, a ) . (34)It is important to note that we removed the constraint thatputs an upper bound on the value of V p,θ ( s ) . As a result,unlike the MILP (23a)-(23n), in the formulated optimizationproblem, the value of V p,θ ( s ) is, in general, not equal to the cost of behavior modification for the corresponding agenttype θ . However, it is indeed equal to the cost of behaviormodification for the agent type that maximizes the cost to theprincipal, i.e., θ ∈ Θ that satisfies w = V p,θ ( s ) .The formulation of the NLP described above involves theexact representation of the constraints in (2b)-(2d) and anadditional constraint in (32) which ensures that each agent type has a unique optimal policy. Moreover, it exactly representsthe worst-case expected total cost incurred by the principal.Hence, it follows from Theorem 5 that an (cid:15) -optimal solutionto the N-BMP can be obtained from the optimal solution tothe formulated NLP by using its optimal decision variables γ (cid:63) ( s, a ) as the incentive offers to the agent. CCP to compute a local optimum:
We employ the PenaltyCCP (P-CCP) algorithm [14], which is a variation of the basicCCP algorithm [15], to compute a locally optimal solution tothe formulated NLP. We now briefly cover the main workingprinciples of the P-CCP algorithm and explain how to utilizeit for the purposes of this paper. We refer the interested readerto [14] for further details on the P-CCP algorithm.Suppose we are given an optimization problem of the formminimize x ∈ R n Z ( x ) (35a)subject to: Z i ( x ) − Y i ( x ) ≤ , i = 1 , , . . . , m (35b)where Z i : R n → R and Y i : R n → R are convex functions foreach i ∈{ , . . . , m } . The above optimization problem is, ingeneral, not convex due to the constraints in (35b) [26]. TheP-CCP algorithm is a heuristic method to compute a locallyoptimal solution to the problems of the form (35a)-(35b) byiteratively approximating the constraints in (35b).Let ζ be a constant such that ζ> , τ and τ max be positiveconstants, and τ k be a constant that is recursively defined as τ k :=min { ζτ k − , τ max } for k ∈ N . Note that τ i ≥ τ j for all i ≥ j .Starting from an arbitrary initial point x ∈ R n , at the k -thiteration, the P-CCP algorithm computes a globally optimalsolution x k +1 of the convex optimization problem minimize x ∈ R n Z ( x ) + τ k m (cid:88) i =1 s ki (36a)subject to: Z i ( x ) − Y i ( x ; x k ) ≤ s ki , i = 1 , , . . . , m (36b) s ki ≥ , i = 1 , , . . . , m (36c)where Y i ( x ; x k ):= Y i ( x k )+ ∇ Y i ( x k ) T ( x − x k ) is the first-order approximation of the function Y i at x k , and ∇ Y i denotesthe gradient of Y i . In the above problem, the variables s ki is referred to as the slack variables . This is because, whenthe optimal solution to the problem in (36a)-(36c) satisfies s ki =0 for all i , then the optimal variables x k +1 ∈ R n constitutea feasible solution for the original problem in (35a)-(35b).The P-CCP algorithm terminates when (cid:16) Z ( x k ) + τ k m (cid:88) i =1 s ki (cid:17) − (cid:16) Z ( x k +1 ) + τ k m (cid:88) i =1 s k +1 i (cid:17) ≤ δ for some small δ> , and either x k is feasible, i.e., m (cid:88) i =1 s k +1 i ≤ δ violation ≈ , (37)or τ k = τ max . The P-CCP algorithm is guaranteed to terminate[14]. Moreover, if the condition in (37) is satisfied upontermination, the output of the algorithm constitutes a locallyoptimal solution to the problem in (35a)-(35b).The NLP formulated in the previous section consists of aconvex objective function, i.e., Z ( x )= w , a number of linearonstraints, and the bilinear constraints in (31)-(34). To employthe P-CCP algorithm for obtaining a locally optimal solutionto the formulated NLP, we express the above mentionedbilinear constraints in the form of (35b) by following theconvexification technique described in [32]. Due to spacerestrictions, we refer the interested reader to Section 5 in [32]for further details on the convexification technique.We can now compute a locally optimal solution to theNLP formulated in the previous section as follows. Set theparameters ζ , τ , τ max , δ , and δ violation . Initialize the P-CCPalgorithm by setting all variables to some initial values. Runthe P-CCP algorithm by convexifying each bilinear term in theNLP at each iteration. Upon termination, verify whether thecondition in (37) is satisfied. If it is satisfied, use the optimaldecision variables γ (cid:63) ( s, a ) as the incentive offers to the agent.IX. N UMERICAL E XAMPLES
We illustrate the application of the presented algorithmswith two examples on discount planning and motion planning. We run the computations on a 3.1 GHz desktop with 32 GBRAM and employ the GUROBI solver [33] for optimization.
A. Discount planning to encourage purchases
We consider a retailer (principal) that aims to sell n productsto a customer (agent) at the end of n interactions. It is assumedthat the agent purchases a single product per interaction, andthe agent’s willingness to purchase new products depends onthe ones purchased in the past. The principal does not knowhow the agent associates different products with each other andaims to maximize its total profit by synthesizing a sequenceof discount offers (incentives) by convincing the agent topurchase all the products with a minimum total discount.Let Q = { , , . . . , n } be the set of products that the principalaims to sell. We construct an MDP to express the agent’sbehavior as follows. Each state in the MDP corresponds to aset of products that are already purchased by the agent, i.e.,the set S of states is the power set of Q . The initial state of theMDP is the empty set. In a given state s ∈ S such that s ⊆ Q ,the agent has two choices: not to make any purchases andstay in the same state by taking the action a or to purchase aproduct i ∈ Q \ s by taking the action a i as a result of which ittransitions to the state s ∪ { i }∈ S . All transitions in this modelare deterministic, i.e., P s,a,s (cid:48) ∈{ , } .We choose n =4 and consider 3 agent types, i.e., | Θ | =3 .All agent types prefer not to purchase any products unless theprincipal offers some discounts, i.e., R max θ ( s )= R θ ( s, a )=0 for all s ∈ S and θ ∈ Θ . The intrinsic motivation of each agenttype is summarized in Table II. The product grouping inTable II expresses how an agent type associates the prod-ucts with each other. For example, once the agent type θ purchases product , it becomes more willing to purchaseproduct than products or , i.e., R θ ( { } , a )= − and R θ ( { } , a )= R θ ( { } , a )= − . Once a group of products Due to space restrictions, we provide only a brief description of thereward functions in numerical examples. We refer the interested readersto https://github.com/yagizsavas/Sequential-Incentive-Design for a completedescription of reward functions and an implementation of the algorithms.
TABLE II: Customer preferences in discount planning.
Customer Type Product grouping Product importance θ G = { , } , G = { , } G >G θ G = { , } , G = { , } G >G θ G = { , } , G = { , } G = G is purchased by the agent, it becomes harder for the prin-cipal to sell another group of products. The product im-portance in Table II expresses the agent’s willingness topurchase a group of products once the other group of prod-ucts is already purchased. For example, it is more prof-itable for the principal to sell first the group G = { , } of products to the type θ and then the group G = { , } of products since θ is already more willing to purchasethe group G , i.e., R θ ( { , } , a )= R θ ( { , } , a )= − and R θ ( { , } , a )= R θ ( { , } , a )= − .We synthesized a sequence of discount offers for the agentby solving the corresponding MILP formulation which has205 continuous and 112 binary variables. The computationtook 0.4 seconds. A part of the synthesized discount se-quence is demonstrated in Table III. During the first in-teraction, the principal offers a discount only for product1. The offer is large enough to ensure that the agent pur-chases product 1 regardless of its type, i.e., γ ( {} , a )=1+ (cid:15) .Then, during the second interaction, the principal offers dis-counts γ ( { } , a )= γ ( { } , a )= γ ( { } , a )=1+ (cid:15) for all re-maining products. As a result, depending on its type, the agentpurchases one of the discounted products. For example, theagent type θ purchases product 2, whereas the agent type θ purchases product 3. Finally, the principal sequentially offersdiscounts for remaining two products.The synthesized discount sequence utilizes the principal’sknowledge on the possible agent types. Specifically, the prin-cipal knows that purchasing product 1 is not a priority forany of the agent types (see Table II). Therefore, the principalis aware that it will be harder to sell product 1 once theagent spends money on the other products. Hence, by offeringa discount only for product 1 during the first interaction,the principal makes sure that the agent purchases the leastimportant product when it still has the money. Moreover,during the second interaction, the principal ensures that eachagent type purchases the second least important product foritself by offering discounts for products 2, 3, and 4 at thesame. Since the group of products that are not purchased yetare more important than the group of products that are alreadypurchased, the principal then sells the remaining products byoffering small discount amounts. B. Incentives for motion planning
We consider a ridesharing company (principal) that aimsto convince a driver (agent) to be present at a desired targetregion during the rush hour by offering monetary incentives.The company is assumed to operate in Austin, TX, USA whichis divided into regions according to zip codes as shown inFig. 2 (left) in which the areas, e.g., downtown, north etc., areABLE III: Discount offers that maximize the retailer’s profitwhile ensuring that the customer purchases all products re-gardless of its type. The check marks indicate the discountedproducts as a function of the products that are already pur-chased. For example, the retailer discounts only product 1 ifthe customer has no previous purchases and discounts product4 if the set { , } of products has already been purchased. Product Purchased {} { } { , } { , } { , } (cid:88) - - - -2 - (cid:88) - - -3 - (cid:88) - - (cid:88) (cid:88) (cid:88) (cid:88) - indicated with different colors for illustrative purposes. Theenvironment is modeled as an MDP with 54 states each ofwhich corresponds to a region. In state s i ∈ S , the agent hastwo choices: to stay in the same state s i under the action a i orto move to a “neighboring” state s j , i.e., a region that sharea border with the agent’s current region, under the action a j .We consider three agent types, i.e., | Θ | =3 . In the absenceof incentives, staying in the same state is optimal for all agenttypes in all states, i.e., R max θ ( s i )= R θ ( s i , a i )=0 for all s i ∈ S and θ ∈ Θ . Hence, the principal cannot distinguish the trueagent type by passively observing the agent and, to inducethe desired behavior, the principal has to offer incentives inall states. The first agent type θ associates each state pair ( s i , s j ) with their corresponding distance d i,j . To move to adifferent region, it demands the principal to offer incentivesthat is proportional to d i,j , i.e., R θ ( s i , a j )= − d i,j . The secondagent type θ associates each state s i with a congestion index t i ∈{ , , . . . , } , e.g., the states in downtown area has thehighest congestion indices. To move to a region s j , it demandsthe principal to offer incentives that is proportional to t j , i.e., R θ ( s i , a j )= − t j . Finally, the third agent type θ takes boththe distance and congestion index into account and has thereward function R θ ( s i , a j )= − (0 . d i,j + 0 . t j ) .We first considered the case of known agent types andsynthesized optimal incentive sequences for each agent typeusing the corresponding LP formulation. Trajectories followedby the agents under the synthesized incentive sequences areshown in Fig. 2 (right) with dashed lines, e.g., the type θ is incentivized to follow the shortest trajectory to the target.We then considered the case of unknown agent types andcomputed a globally optimal solution to the corresponding N-BMP instance using the MILP formulation. The MILP had1407 continuous and 937 binary variables, and the compu-tation exceeded the memory limit after 6 hours. Finally, wecomputed a locally optimal solution to the corresponding N-BMP instance through the NLP formulation which has 64801continuous variables. The CCP converged in 193 iterations,and the computation took 1543 seconds in total. Optimaltrajectories followed by each agent type under the synthesizedincentive sequence are shown in Fig. 2 (right) with solid lines.We measured the suboptimality of the synthesized incentive Fig. 2: Motion planning with incentive offers. The states withlabels S and T are the initial and target states, respectively.Green, blue, and red lines are the trajectories followed by theagent types θ , θ , and θ , respectively. When the agent typeis unknown (known) to the principal, the solid (dashed) trajec-tories are followed under the synthesized incentive sequence.sequence using the lower bound in (8); the total cost of thesynthesized incentive sequence to the principal is at most 1.52times the total cost of the globally optimal one.As seen in Fig. 2, under the synthesized incentive sequence,the agent reaches the target state desired by the principalregardless of its type. Moreover, the synthesized incentivesequence utilizes the principal’s information on possible agenttypes. In particular, under the offered incentive sequences, theagent types θ and θ follow trajectories that resemble the onesthat would be followed if the true agent type was known to theprincipal. Moreover, even though the optimal trajectory of thetype θ changes significantly due to the principal’s incompleteinformation on the true agent type, the type θ still avoids thedowntown area under the offered incentive sequences.X. C ONCLUSIONS AND F UTURE D IRECTIONS
We considered a principal that offers a sequence of incen-tives to an agent with unknown intrinsic motivation to alignthe agent’s behavior with a desired objective. We showed thatthe behavior modification problem (BMP), the problem ofsynthesizing an incentive sequence that induces the desiredbehavior at minimum total cost is, in general, computationallyintractable. We presented a sufficient condition under whichthe BMP can be solved in time polynomial in the related pa-rameters. Finally, we developed two algorithms to synthesize astationary incentive sequence that induces the desired behaviorwhile minimizing the total cost either globally or locally.In the BMP, we assume that the agent’s intrinsic motivationcan be expressed as a reward function that belongs to a finiteset of possible reward functions. The performance of anyalgorithm that solves the BMP necessarily depends on the sizeof this finite set. Hence, it is of interest to develop methodsthat extract reward representations from data by taking thecomputational complexity of the resulting BMP into account.Another possible future direction is to express the agent’sunknown intrinsic motivation as a generic function with certainstructural properties, e.g., continuity, monotonicity, concavityetc., and investigate the effects of different properties on thecomplexity of the resulting BMP.
EFERENCES[1] Y.-C. Ho, P. B. Luh, and G. J. Olsder, “A control-theoretic view onincentives,”
Automatica , pp. 167–179, 1982.[2] T. Bas¸ar, “Affine incentive schemes for stochastic systems with dynamicinformation,”
SIAM Journal on Control and Optimization , vol. 22, no. 2,pp. 199–210, 1984.[3] P. Bolton, M. Dewatripont et al. , Contract theory . MIT press, 2005.[4] N. Nisan and A. Ronen, “Algorithmic mechanism design,”
Games andEconomic behavior , vol. 35, no. 1-2, pp. 166–196, 2001.[5] V. Conitzer and T. Sandholm, “Complexity of mechanism design,” arXivpreprint cs/0205075 , 2002.[6] ——, “Computing the optimal strategy to commit to,” in
Proceedingsof the 7th ACM conference on Electronic commerce , 2006, pp. 82–90.[7] H. Zhang and D. C. Parkes, “Value-based policy teaching with activeindirect elicitation.” in
AAAI Conference on Artificial Intelligence , 2008.[8] M. L. Puterman,
Markov Decision Processes: Discrete Stochastic Dy-namic Programming . John Wiley & Sons, Inc., 1994.[9] G. Shani, D. Heckerman, and R. I. Brafman, “An MDP-based recom-mender system,”
Journal of Machine Learning Research , 2005.[10] V. R. Rao and L. J. Thomas, “Dynamic models for sales promotionpolicies,”
Journal of the Operational Research Society , 1973.[11] J. Wei, J. M. Dolan, J. M. Snider, and B. Litkouhi, “A point-based MDPfor robust single-lane autonomous driving behavior under uncertainties,”in
International Conference on Robotics and Automation , 2011.[12] D. Bergemann and S. Morris, “Robust mechanism design,”
Economet-rica , pp. 1771–1813, 2005.[13] I. Cadez, D. Heckerman, C. Meek, P. Smyth, and S. White, “Model-based clustering and visualization of navigation patterns on a web site,”
Data mining and knowledge discovery , vol. 7, no. 4, pp. 399–424, 2003.[14] T. Lipp and S. Boyd, “Variations and extension of the convex–concaveprocedure,”
Optimization and Engineering , pp. 263–287, 2016.[15] A. L. Yuille and A. Rangarajan, “The concave-convex procedure,”
Neural computation , pp. 915–936, 2003.[16] Y. Savas, V. Gupta, M. Ornik, L. J. Ratliff, and U. Topcu, “Incentivedesign for temporal logic objectives,” in
Conference on Decision andControl , 2019, pp. 2251–2258.[17] S. J. Reddi and E. Brunskill, “Incentive decision processes,” arXivpreprint arXiv:1210.4877 , 2012.[18] H. Zhang, D. C. Parkes, and Y. Chen, “Policy teaching through rewardfunction learning,” in
Proceedings of the 10th ACM conference onElectronic commerce , 2009, pp. 295–304.[19] D. C. Parkes and S. P. Singh, “An MDP-based approach to onlinemechanism design,” in
Advances in neural information processingsystems , 2004, pp. 791–798.[20] Y. Le Tallec, “Robust, risk-sensitive, and data-driven control of Markovdecision processes,” Ph.D. dissertation, MIT, 2007.[21] L. N. Steimle, D. L. Kaufman, and B. T. Denton, “Multi-model Markovdecision processes,”
Optimization Online , 2018.[22] C. H. Papadimitriou and J. N. Tsitsiklis, “The complexity of Markovdecision processes,”
Mathematics of operations research , vol. 12, no. 3,pp. 441–450, 1987.[23] C. Baier and J.-P. Katoen,
Principles of Model Checking . MIT Press,2008.[24] M. Sipser, “Introduction to the theory of computation,”
ACM SigactNews , vol. 27, no. 1, pp. 27–29, 1996.[25] C. H. Papadimitriou, “The Euclidean travelling salesman problem isNP-complete,”
Theoretical computer science , pp. 237–244, 1977.[26] S. Boyd, S. P. Boyd, and L. Vandenberghe,
Convex optimization .Cambridge university press, 2004.[27] F. Teichteil-K¨onigsbuch, “Stochastic safest and shortest path problems,”in
AAAI Conference on Artificial Intelligence , 2012.[28] E. Altman,
Constrained Markov Decision Processes . Chapman andHall/CRC, 1999.[29] K. Etessami, M. Kwiatkowska, M. Y. Vardi, and M. Yannakakis,“Multi-objective model checking of Markov decision processes,” in
International Conference on Tools and Algorithms for the Constructionand Analysis of Systems , 2007, pp. 50–65.[30] A. Schrijver,
Theory of linear and integer programming . John Wiley& Sons, 1998.[31] G. P. McCormick, “Computability of global solutions to factorablenonconvex programs: Part IConvex underestimating problems,”
Math-ematical programming , vol. 10, no. 1, pp. 147–175, 1976.[32] M. Cubuktepe, N. Jansen, S. Junges, J.-P. Katoen, and U. Topcu,“Synthesis in pMDPs: a tale of 1001 parameters,” in
InternationalSymposium on Automated Technology for Verification and Analysis
Complex-ity of computer computations , 1972, pp. 85–103. A PPENDIX
AIn this appendix, we briefly explain how to modify themethods developed in this paper in order to solve the BMPwhen the agent has a finite decision horizon N ∈ N and whenthe set B ⊆ S of target states consists of non-absorbing states.Consider an agent with a finite decision horizon N ∈ N whose objective is to maximize the expected total reward itcollects at the end of every N stages [16]. Such an agent’soptimal policy is a sequence ( D , D , . . . ) of decisions where,for k ∈ N , D k :=( d ( k − N +1 , d ( k − N +2 , . . . , d kN ) satisfies D k ∈ arg max π ∈ Π( M ) E π (cid:34) kN (cid:88) t =( k − N +1 R θ (cid:63) ( s t , a ) + δ t ( I t , a ) (cid:35) . We now show that the behavior modification of anagent with a finite decision horizon N is computationallynot easier than the behavior modification of a myopicagent. For any MDP M and N ∈ N , consider an expandedMDP with the set S := S × [ N ] of states and the initialstate ( s , . Let the transition function P : S ×A× S → [0 , be such that P (( s, t ) , a, ( s, t +1))=1 for t ∈ [ N ] \{ N } and P (( s, N ) , a, ( s (cid:48) , P s,a,s (cid:48) , and the reward function R θ (cid:63) : S ×A→ R be such that R θ (cid:63) (( s, N ) , a ):= R θ (cid:63) ( s, a ) and R θ (cid:63) (( s, t ) , a ):=0 otherwise. It can be shown that, on theexpanded MDP, the behavior modification of an agent with adecision horizon N is equivalent to the behavior modificationof a myopic agent. Since the expanded MDP is constructed intime polynomial in the size of M and N , the result follows.One can solve the BMP when the agent has a decisionhorizon N ∈ N as follows. For a given BMP instance, firstconstruct an MDP with the set S := S × [ N ] of states, the initialstate ( s , , and the transition function P : S ×A× S → [0 , such that P (( s, t ) , a, ( s (cid:48) , t +1))= P s,a,s (cid:48) for t ∈ [ N ] \{ N } and P (( s, N ) , a, ( s (cid:48) , P s,a,s (cid:48) . Note that, in the above construc-tion, the agent returns to a state ( s, where s ∈ S after every N stages. Second, on the constructed MDP, synthesize a sequenceof incentive offers that solve the BMP. Finally, in every N stages, provide the agent the next N incentive offers.One can solve the BMP when the set B of target statesconsists of non-absorbing states as follows. For a given BMPinstance, construct an MDP with the set S := S ×{ , } ofstates, the initial state ( s , , and the transition function P : S ×A× S → [0 , such that P (( s, , a, ( s (cid:48) , P s,a,s (cid:48) forall s ∈ S \ B , P (( s, , a, ( s (cid:48) , P s,a,s (cid:48) for all s ∈ B , and P (( s, , a, ( s (cid:48) , P s,a,s (cid:48) for all s ∈ S . On the constructedMDP, the agent transitions to a state ( s, ∈ S ×{ } if and onlyif it reaches the target set. Make all states S ×{ } absorbingand replace the target set B with the set S ×{ } . Finally, onthe constructed MDP, synthesize a sequence of incentive offersthat solve the BMP. A PPENDIX
BIn this appendix, we provide proofs for all results presentedin this paper. efinition 8: (QSAT) [22] Let F ( x , x , . . . , x n ) be aBoolean formula in conjunctive normal form with three literalsper clause, and ∃ x ∀ x ∃ x . . . ∀ x n F ( x , x , . . . , x n ) be aquantified Boolean formula (QBF). Decide whether the givenQBF is true. Proof of Theorem 1:
The proof is by a reduction from QSAT.We are given an arbitrary QBF with n variables and m clauses C , C , . . . , C m . Without loss of generality, we assume that n is an even number. To prove the claim, we first construct anMDP M , a target set B ⊆ S , a type set Θ , and a reward function R θ for each θ ∈ Θ . We then show that, on the constructedmodel, the total cost of an incentive sequence that solves theBMP is n or less if and only if the QBF is true.We first construct the MDP M . A graphical illustration ofthe constructed MDP is given in Fig. 3. The set S of statesconsists of n +2 states. That is, states A i , A (cid:48) i , T i , T (cid:48) i , F i , F (cid:48) i for each variable x i where i ∈ [ n ] , and two additional states A n +1 , A (cid:48) n +1 . The initial state s is the state A (cid:48) . The set A ofactions consists of 6 actions a , a , . . . , a . The target set is B = { A n +1 } , and the type set is Θ= { θ , θ , . . . , θ m } . Transition function P : We first define the transitions fromthe states T i , F i , T (cid:48) i , F (cid:48) i . From T i , under the only availableaction a , the agent makes a deterministic transition to A i +1 .From F i , under the only available action a , the agent makes adeterministic transition to A i +1 . From T (cid:48) i , under the availableactions a and a , the agent deterministically transitions,respectively, to the states A (cid:48) i +1 and A i +1 . Similarly, from F (cid:48) i ,under the available actions a and a , the agent deterministi-cally transitions, respectively, to the states A (cid:48) i +1 and A i +1 .Next, we define the transitions from the states A i , A (cid:48) i . From A i , if i ∈ [ n ] is even, the agent transitions to the states T i and F i with equal probability under the only available action a ; if i ∈ [ n ] is odd, the agent deterministically transitions to the states A i , T i , and F i , under the actions a , a , and a , respectively.Similarly, from A (cid:48) i , if i ∈ [ n ] is even, the agent transitions tothe states T (cid:48) i and F (cid:48) i with equal probability under the onlyavailable action a ; if i ∈ [ n ] is odd, the agent deterministicallytransitions to the states A (cid:48) i , T (cid:48) i , and F (cid:48) i , under the actions a , a , and a , respectively. Finally, from the states A n +1 and A (cid:48) n +1 , the agent makes a self transition under the action a . Reward functions:
The reward function R θ k where k ∈ [ m ] is defined as follows. For s ∈{ A i , A (cid:48) i : i ∈ [ n ] is odd } , i.e., thegreen states in Fig. 3, we have R θ k ( s, a )=0 , R θ k ( s, a )= − , R θ k ( s, a )= − . That is, the agent stays in the same state inthe absence of incentives, and to move the agent toward thetarget state, the principal needs to offer the incentive of at least (cid:15) for a or a , where (cid:15)> is an arbitrarily small constant.We now define the reward function for the states T (cid:48) i , F (cid:48) i .This is the most important step in our construction since itrelates the agent type θ k with the clause C k . If the variable x i appears positively in C k , for s ∈{ T (cid:48) i : i ∈ [ n ] } , we have R θ k ( s, a ) = n + 1 , R θ k ( s, a ) = 0 , A variable x i is said to appear positively in a disjunction clause C k if C k is true when x i is true; it is said to appear negatively in C k if C k istrue when x i is false; it is said to not appear in C k if the truth value of C k is independent of the truth value of x i . For example, in C = x ∨ ¬ x , thevariable x appears positively, x appears negatively, and x does not appear. and for s ∈{ F (cid:48) i : i ∈ [ n ] } , we have R θ k ( s, a ) = 0 , R θ k ( s, a ) = n + 1 . On the other hand, if the variable x i appears negatively in C k , for s ∈{ T (cid:48) i : i ∈ [ n ] } , we have R θ k ( s, a ) = 0 , R θ k ( s, a ) = n + 1 , and for s ∈{ F (cid:48) i : i ∈ [ n ] } , we have R θ k ( s, a ) = n + 1 , R θ k ( s, a ) = 0 . Finally, if the variable x i does not appear in C k , for s ∈{ T (cid:48) i : i ∈ [ n ] } , we have R θ k ( s, a ) = 0 , R θ k ( s, a ) = n + 1 , and for s ∈{ F (cid:48) i : i ∈ [ n ] } , we have R θ k ( s, a ) = 0 , R θ k ( s, a ) = n + 1 . For all the other states and actions, we have R θ k ( s, a ) = 0 .We now show that an incentive sequence that leads the agentto the target state A n +1 with probability 1 at the worst-casetotal cost n or less exists if and only if the QBF F is true.Suppose first that the QBF F is true. Then, there is atruth value assignment, i.e., true or false, for each existentialvariable { x i : i ∈ [ n ] is odd } , such that under this assignmentall clauses C k where k ∈ [ m ] are true. Then, on the constructedMDP, from the states { A (cid:48) i : i ∈ [ n ] is odd } , i.e., the greenstates in Fig. 3, we can simply offer the following stationaryincentives to the agent: If x i is true, γ ( A (cid:48) i , a )=2 ; if x i isfalse γ ( A (cid:48) i , a )=2 . For all the other state-action pairs ( t, a ) , weoffer γ ( t, a )=0 . Since the formula F is true, all the clauses C k must be true. Then, under the provided incentives, each type θ k must eventually transition to a state A i with probability 1.Since, from any state A i , the agent reaches to the target state A n +1 with probability 1, the reachability constraint is satis-fied. Moreover, the principal pays the incentives γ ( A (cid:48) i , a )=2 exactly n/ times; hence, the total cost to the principal is n .Suppose that there exists an incentive sequence under whichthe agent reaches the state A n +1 with probability 1 at theworst-case total cost n or less. Then, the optimal policy ofthe agent under such an incentive sequence differs from itsoptimal policy in the absence of incentives only in the states { A (cid:48) i : i ∈ [ n ] is odd } . The previous claim is true becausechanging the optimal policy of the agent in states T (cid:48) i , F (cid:48) i requires the principal to offer at least an incentive amountequal to n + 1 . Note that all the agent types θ k reaches thetarget state with probability 1 under the provided incentivesequence. Then, it must be true that the principal incentivizesthe actions a and/or a in states { A (cid:48) i : i ∈ [ n ] is odd } suchthat the agent eventually reaches a state A i with probability1 regardless of its type. We set the existential variable x i totrue if the principal incentives a , and to false if the principalincentivizes only a . Recall that each clause C k correspondsto an agent type θ k and the transition of an agent type θ k toa state A i with probability 1 implies the clause C k becomingtrue. Consequently, since the agent reaches the state A n +1 with probability 1 regardless of its type , under the describedtruth value assignment, the formula F is evaluated true. (cid:3) (cid:48) A T (cid:48) F (cid:48) T F A (cid:48) A T (cid:48) F (cid:48) T F A (cid:48) A T (cid:48) F (cid:48) T F A (cid:48) A A (cid:48) n A n T (cid:48) n F (cid:48) n T n F n ... A (cid:48) n +1 A n +1 a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a Fig. 3: An illustration of the MDP constructed to show that the BMP is PSPACE-hard. Solid lines represent deterministictransitions; dashed lines represent transitions with equal probability. The states A (cid:48) i , A i represent the variables x i in the Booleanformula F ( x , x , . . . , x n ) . An existential variable x i is set to true (false) when the action a ( a ) is incentivized from thestates A (cid:48) i , A i . A clause C k becomes true when the agent type θ k transitions to a state A i with probability 1 under the incentiveoffers. The formula F becomes true when the agent transitions to a state A i with probability 1 regardless of its type. Definition 9: (Euclidian path-TSP) [25] Given a set [ N ] ofcities, distances c i,j ∈ N between each city pair ( i, j ) such that c i,j = c j,i and c i,j + c j,k ≥ c i,k for all i, j, k ∈ [ N ] , and a constant K ∈ N , decide whether there exists a path from the city 1 tocity N that visits all cities exactly once and the total traverseddistance is K or less. Proof of Theorem 2:
The decision problem is in NP because,for a given stationary incentive sequence, we can compute theoccupancy measure [28] for the Markov chain induced by theoptimal stationary policy of each agent type in polynomial-time via matrix inversion. The reachability probability and theincurred total cost are linear functions of the occupancy mea-sure [8]. Hence, the satisfaction of the reachability constraintand the corresponding total cost incurred by the principal canbe verified in polynomial-time.The NP-hardness proof is by a reduction from the Euclidianpath-TSP problem. We are given an arbitrary Euclidian path-TSP problem instance. We first construct an MDP M . Theset of states is S = { q , q , . . . , q N } , the initial state is q ,and the set of actions is A = { a , a , . . . , a N } . The transitionfunction P is such that, for states q i ∈ S \{ q N } , P q i ,a j ,q j =1 forall j ∈{ , , . . . , N } and, the state q N ∈ S is absorbing.The type set is Θ= { θ , θ , . . . , θ N − } , and the target set is B = { q N } . The reward function R θ i for the type θ i ∈ Θ is R θ i ( q j , a k ) = if j = k − c j,k if j (cid:54) = k, j = i − c j,k if j (cid:54) = k, j (cid:54) = i, k (cid:54) = N − ( K + 1) if j (cid:54) = k, j (cid:54) = i, k = N. We claim that there exists a feasible solution to the con-structed N-BMP with the objective value K or less if andonly if there exists a TSP path with total cost K or less.Suppose that there exists a feasible solution to the N-BMPwith the objective value of K or less. Then, there exists anincentive sequence under which at least one agent type visitsall states in S exactly once. Suppose for contradiction thatnone of the agent types visits the state q i ∈ S and that the trueagent type is θ (cid:63) = θ i . Then, to convince the agent to reach the target state q N , the principal must pay the incentive of ( K +1) for the action a N from at least one of the states q ∈ S \{ q i } ,because otherwise, the agent will never take the action a N and reach the target state. Therefore, the optimal incentivesequence incurs the total cost of at least ( K +1) , which raisesa contradiction. Moreover, the agent type θ i cannot visitthe same state twice under a provided stationary incentivesequence because otherwise the agent’s stationary optimalpolicy violates the reachability constraint. Consequently, thepath followed by the agent type θ i constitutes a solution tothe Euclidian path-TSP with the total cost K or less.Consider a solution to the Euclidian path-TSP problemwhich visits the states s , s , . . . , and s N in sequence, where s = q and s N = q N . The principal can construct a stationaryincentive sequence from this solution as follows. For all states s i where i ∈ [ N ] , offer the incentives γ ( s i , a N )= c s i ,s N + (cid:15) and γ ( s i , a s i +1 )= c s i ,s i +1 + (cid:15) . Under the constructed incentivesequence, each agent type θ i ∈ Θ transitions to the target state q N from the state q i . Moreover, the worst-case total cost tothe principal is equal to the cost of the Euclidian path-TSPproblem plus N (cid:15) , which can be made arbitrarily small. Hence,the claim follows. (cid:3)
Definition 10: (Set cover) [34] Given a set S , a collectionof subsets T = { T i ⊆ S} such that ∪ T i ∈T = S , and a positiveinteger M , decide whether there exists a subcollection U⊆T of size M or less, i.e., |U|≤ M , that covers S , i.e., ∪ T i ∈U = S . Proof of Theorem 3:
The decision problem can be shownto be in NP using the same method presented in Theorem2. The NP-hardness proof is by a reduction from the setcover problem. We are given an arbitrary set cover probleminstance in which S = { , , . . . , N } and T = { T , T , . . . , T K } such that M ≤ K . To prove the claim, we construct an MDP M , a target set B ⊆ S , a type set Θ , and a state-independentreward function R θ for each θ ∈ Θ . Then, we show that, onthe constructed model, the total cost of an incentive sequencethat solves the NS-BMP is M or less if and only if |U|≤ M .The MDP M has M +2 states, i.e., S = { q , q , . . . , q M +2 } ,and K +1 actions, i.e., A = { a , a , . . . , a K } . The initial state is q . From the states q i where i ∈ [ M − , the agent transitions tohe state q i +1 under the action a , and to the state q M +1 underall the other actions. From the state q M , the agent transitionsto the state q M +2 under the action a , and to q M +1 under allthe other actions. The states q M +1 and q M +2 are absorbing.We define B = { q M +1 } and Θ= { θ , θ , . . . , θ N } . Finally,the state-independent reward function R θ i is defined as R θ i ( q, a j ) = if j = 0 − / if i ∈ T j − ( K + 1) otherwise . Note that, in the absence of incentives, the agent reaches theabsorbing state q M +2 with probability 1 regardless of its type.Suppose there exists a collection of subsets U⊆T of size M or less such that ∪ T i ∈U = S . Without loss of generality,assume that U = { T , T , . . . , T L } where L ≤ M . Consider thefollowing stationary deterministic incentive sequence γ ( q i , a j ) = (cid:40) if i = j otherwise . Under the incentive sequence given above, an agent type θ i transitions to the state q M +1 from a state q k where k ∈ [ M ] if i ∈∪ T j ∈U . Since ∪ T j ∈U = S , all agent types transitions to thetarget state q M +1 with probability 1. Moreover, the total costto the principal is clearly less than or equal to M .Suppose there exists a stationary deterministic incentive se-quence that leads the agent to the target state q M +1 regardlessof its type at the total cost of M or less. Then, the incentivesequence is such that, the agent type θ i where i ∈ T j transitionsto the target state q M +1 from a state q k where k ∈ [ M ] ifthe action a j is incentivized from that state. Since the agentreaches the target state regardless of its type, the collectionof subsets T j that corresponds to the incentivized actions a j constitutes a set cover. Because the incentives are providedonly from the states q k where k ∈ [ M ] , the size of the resultingset cover is less than or equal to M . (cid:3) Proof of Theorem 4:
For a given incentive sequence γ ∈ Γ R,θ d ( M ) , let π (cid:63) =( d (cid:63) , d (cid:63) , . . . ) be the optimal policy ofthe agent type θ d and γ ∈ Γ( M ) be an incentive sequence γ ( I t , a ) = (cid:40) γ ( I t , a ) if d (cid:63)t ( s ) = a otherwise . (38)Note that R θ d ( s, a )+ γ ( I t , a ) ≥ max a (cid:48) ∈A ( s ) R θ d ( s, a (cid:48) ) for d (cid:63)t ( s )= a and γ ∈ Γ R,θ d ( M ) . The condition stated in the the-orem implies that R θ ( s, a )+ γ ( I t , a ) ≥ max a (cid:48) ∈A ( s ) R θ ( s, a (cid:48) ) for any θ ∈ Θ . Hence, we have γ ∈ Γ R, Θ ( M ) , which impliesthat Γ R, Θ ( M ) ⊆ Γ R,θ d ( M ) . Since, Γ R,θ d ( M ) ⊆ Γ R, Θ ( M ) bydefinition, we conclude that Γ R, Θ ( M )=Γ R,θ d ( M ) .Now, for any given incentive sequence γ ∈ Γ R, Θ ( M ) , let π (cid:63) =( d (cid:63) , d (cid:63) , . . . ) be the optimal policy of the agent type θ d and γ ∈ Γ( M ) be an incentive sequence defined as in (38). Ifthe condition stated in the theorem holds, then we have max θ ∈ Θ f ( γ, θ ) = f ( γ, θ d ) . Using the identity Γ R, Θ ( M )=Γ R,θ d ( M ) , we conclude that min γ ∈ Γ R, Θ ( M ) max θ ∈ Θ f ( γ, θ ) = min γ ∈ Γ R,θd ( M ) f ( γ, θ d ) . (cid:3) A PPENDIX
CIn this appendix, we show how to modify the algorithmspresented in Section VIII in order to compute globally andlocally optimal solutions to the NS-BMP.Let { X s,a ∈{ , } : s ∈ S r , a ∈A , (cid:80) a ∈A X s,a =1 } be a set ofbinary variables. One can obtain a globally optimal solutionto the NS-BMP by adding the constraints (cid:88) a ∈A X s,a γ ( s, a ) = (cid:88) a ∈A γ ( s, a ) for all s ∈ S r (39)to the MILP (23a)-(23n) and solving the resulting optimizationproblem. Note that the above constraint is not linear in thevariables γ ( s, a ) ; however, each term X s,a γ ( s, a ) can bereplaced exactly by its corresponding McCormick envelopein order to obtain an MILP formulation for the NS-BMP.Let { ν s,a ≥ s ∈ S r , a ∈A , (cid:80) a ∈A ν s,a =1 } be a set ofcontinous variables. One can obtain a locally optimal solutionto the NS-BMP by adding the constraints (cid:88) a ∈A ν s,a γ ( s, a ) ≥ (cid:88) a ∈A γ ( s, a ) for all s ∈ S r (40)to the NLP formulated in Section VIII-B and solving theresulting optimization problem by resorting to the CCP. Yagiz Savas joined the Department of AerospaceEngineering at the University of Texas at Austin asa Ph.D. student in Fall 2017. He received his B.S.degree in Mechanical Engineering from BogaziciUniversity in 2017. His research focuses on devel-oping theory and algorithms that guarantee desirablebehavior of autonomous systems operating in uncer-tain, adversarial environments.
Vijay Gupta is in the Department of ElectricalEngineering at the University of Notre Dame since2008. He received the 2018 Antonio J Rubert Awardfrom the IEEE Control Systems Society, the 2013Donald P. Eckman Award from the American Auto-matic Control Council and a 2009 National ScienceFoundation (NSF) CAREER Award. His researchinterests are broadly in the interface of communi-cation, control, distributed computation, and humandecision making.