[PDF] Learnable Strategies for Bilateral Agent Negotiation over Multiple Issues

Abstract

We present a novel bilateral negotiation model that allows a self-interested agent to learn how to negotiate over multiple issues in the presence of user preference uncertainty. The model relies upon interpretable strategy templates representing the tactics the agent should employ during the negotiation and learns template parameters to maximize the average utility received over multiple negotiations, thus resulting in optimal bid acceptance and generation. Our model also uses deep reinforcement learning to evaluate threshold utility values, for those tactics that require them, thereby deriving optimal utilities for every environment state. To handle user preference uncertainty, the model relies on a stochastic search to find user model that best agrees with a given partial preference profile. Multi-objective optimization and multi-criteria decision-making methods are applied at negotiation time to generate Pareto-optimal outcomes thereby increasing the number of successful (win-win) negotiations. Rigorous experimental evaluations show that the agent employing our model outperforms the winning agents of the 10th Automated Negotiating Agents Competition (ANAC'19) in terms of individual as well as social-welfare utilities.

Full PDF

LLearnable Strategies for Bilateral Agent Negotiation over Multiple Issues

Pallavi Bagga , Nicola Paoletti , Kostas Stathis Royal Holloway, University of London, UK { pallavi.bagga.2017 , nicola.paoletti , kostas.stathis } @rhul.ac.ukAbstract We present a novel bilateral negotiation model that allowsa self-interested agent to learn how to negotiate over multi-ple issues in the presence of user preference uncertainty. Themodel relies upon interpretable strategy templates represent-ing the tactics the agent should employ during the negotia-tion and learns template parameters to maximize the aver-age utility received over multiple negotiations, thus resultingin optimal bid acceptance and generation. Our model alsouses deep reinforcement learning to evaluate threshold util-ity values, for those tactics that require them, thereby deriv-ing optimal utilities for every environment state. To handleuser preference uncertainty, the model relies on a stochas-tic search to ﬁnd user model that best agrees with a givenpartial preference proﬁle. Multi-objective optimization andmulti-criteria decision-making methods are applied at nego-tiation time to generate Pareto-optimal outcomes thereby in-creasing the number of successful (win-win) negotiations.Rigorous experimental evaluations show that the agent em-ploying our model outperforms the winning agents of the th Automated Negotiating Agents Competition (ANAC’ ) in terms of individual as well as social-welfare utilities. Introduction

An important problem in automated negotiation is modellinga self-interested agent that is learning to optimally adapt itsstrategy while bilaterally negotiating against an opponentover multiple issues. In many domains a model of this kindwill need to consider the preferences of the user the agentrepresents in an application. Consider, for instance, bilateralnegotiation in e-commerce where a buyer agent negotiateswith a seller agent to buy a product on user’s behalf. Herethe buyer has to settle the price of a product speciﬁed by auser, around a number of similar issues expressed as userpreferences about delivery time, payment methods and loca-tion delivery (Fatima et al. et al. et al. et al. et al. et al. et al. et al. et al. et al. a r X i v : . [ c s . M A ] S e p o develop an interpretable strategy template that guides theuse of a series of tactics whose optimal use can be learnedduring negotiation. The structure of such templates dependsupon a number of learnable choice parameters determin-ing which acceptance and bidding tactic to employ at anyparticular time during negotiation. As these tactics repre-sent hypotheses to be tested, deﬁned by the agent devel-oper, they can be explained to a user, and can in turn dependon learnable parameters. The outcome of our work is anagent model, called ANESIA ( A daptive NE gotiation modelfor a S elf- I nterested A utonomous agent), that formulates astrategy template for bid acceptance and generation so thatan agent that uses it can make optimal decisions about thechoice of tactics while negotiating in different domains.Our speciﬁc contribution involves implementing ANESIA as an actor-critic architecture interpreted using Deep Rein-forcement Learning (DRL) (Lillicrap et al. et al. , as the theme of this tournament has been bi-lateral multi-issue negotiations. Our results indicate that ourstrategy outperforms the winning strategies in terms of indi-vidual and joint (or social welfare) utilities. Related Work

Existing approaches with reinforcement learning have fo-cused on methods such as Tabular Q-learning for bidding(Bakker et al. et al. et al. et al. http://ii.tudelft.nl/nego/node/7 learned based on a strategy template containing different tac-tics to be employed different times.Many meta-heuristic optimization algorithms have beenacknowledged in the negotiation literature, such as ParticleSwarm Optimization for opponent selection in (Silva et al. et al. et al. et al. et al. et al. et al. etal. The ANESIA Model

We assume that our negotiation environment E consistsof two agents negotiating with each other over some do-main D . A domain D consists of n different issues, D =( I , I , . . . I n ) , where each issue can take a ﬁnite set of pos-sible values: I i = ( v i , . . . v ik i ) . An agent’s bid ω is a map-ping from each issue to a chosen value (denoted by c i forthe i -th issue), i.e. ω = ( v c , . . . v nc n ) . The set of all possi-ble bids or outcomes is called outcome space and is denotedby Ω s.t. ω ∈ Ω . Before the agents can begin the negoti-ation and exchange bids, they must agree on a negotiationprotocol P , which determines the valid moves agents cantake at any state of the negotiation (Fatima et al. Actions = { oﬀer ( ω ) , accept , reject } .Furthermore, we assume that each negotiating agent hasits own private preference proﬁle which describes how bidsare offered over the other bids. This proﬁle is given in termsof a utility function U , deﬁned as a weighted sum of evalua-tion functions e i ( v ic i ) as shown in (1). U ( ω ) = U ( v c , . . . v nc n ) = n (cid:88) i =1 w i · e i ( v ic i ) , where n (cid:88) i =1 w i = 1 . (1)In (1), each issue is evaluated separately contributing lin-early without depending on the value of other issues andhence U is referred to as the Linear Additive Utility space.Here, w i are the normalized weights indicating the impor-tance of each issue to the user and e i ( v ic i ) is an evaluationable 1: Agent’s State AttributesAttribute Description t Current negotiation time Ω Total number of possible bids n Total number of issues B Given number of bids in the partial-orderingdue to user preference uncertainty O best Utility of the best opponent bid so far O avg Average of utilities of all the bids received fromthe opponent agent O sd Standard deviation of utilities of all the bids re-ceived from the opponent agentfunction that maps the v ic i value of the i th issue to a utility.In our settings, we assume that U is unknown and our agentis given incomplete information in terms of partial prefer-ences i.e. a randomly generated partial ordered (cid:22) rankingover bids (w.r.t. U ) s.t. ω (cid:22) ω → U ( ω ) ≤ U ( ω ) . Hence,during the negotiation, one of the objectives of our agent isto derive an estimate (cid:98) U of the real utility function U fromthe given partial preferences. ANESIA

Components

Our proposed agent negotiation model (shown in Figure1) supports learning during bilateral negotiations with un-known opponents under user preference uncertainty.

Physical Capabilities:

These are the sensors and actua-tors of the agent that enable it to access a negotiation envi-ronment E . More speciﬁcally, they allow our agent to per-ceive the current (external) state S t of the environment E and represent that state locally in the form of internal at-tributes as shown in Table 1. Some of these attributes ( Ω , n , B ) are stored locally in its Knowledge Base and some ofthem ( t , O best , O avg , O sd ) are derived from the sequenceof previous bids offered by the opponent Ω ot which is per-ceived by the agent using its sensors while interacting withan opponent agent during the negotiation. At any time t , theinternal agent representation of the environment is s t , whichis used by the agent (among acceptance and bidding strate-gies) to decide what action a t to execute using its actuators .Action execution then changes the state of the environmentto S t +1 . Learning Capabilities:

This component consists of fol-lowing sub-components:

Negotiation Experience , Decide and

Evaluate . The

Decide component is further sub-dividedinto

Acceptance strategy and

Bidding strategy . These sub-components need information from two other componentscalled

User modeling and

Opponent modeling which helpsan agent to be able to negotiate given incomplete informa-tion about user and opponent preferences and estimate theuser and opponent models respectively, (cid:98) U and (cid:98) U o . (cid:98) U is es-timated using the given partial ordered preferences of theuser about the bids. It is estimated only once by the agentbefore the start of the negotiation in order to encourageautonomous behaviour of the agent and avoid user elicita- tion. On the other hand, (cid:98) U o is estimated using informationfrom Ω ot . The set of opponent bids Ω ot is collected only tillthe half of the negotiation period, as the opponent agent ismore likely to change its initial strategy afterwards in or-der to either reach the negotiation or know more about theother agent’s preferences. The decoupled structure of De-cide in the form of acceptance and bidding strategies is in-spired by a well known negotiation architecture known asBOA (Baarslag et al.

Negotiation Experience stores historical informationabout previous negotiation experiences which involve theinteractions of an agent with other agents. Experience ele-ments are of the form (cid:104) s t , a t , r t , s t +1 (cid:105) , where s t is the inter-nal state of the negotiation environment E , a t is an actionperformed by the agent at s t , r t is a scalar reward receivedfrom the environment and s t +1 is new internal state afterexecuting a t . Decide refers to a negotiation strategy which helps anagent to choose an optimal action a t among a set of Actions at a particular state s t based on the negotiation protocol P .In particular, it consists of two functions, f a and f b , for theacceptance and bidding strategy, respectively. Function f a takes as inputs the agent’s state s t , a dynamic threshold util-ity ¯ u t (which we deﬁne next), and the sequence of past op-ponent bids and returns a discrete action among accept andreject . When f a decides reject , f b is used to compute thenext bid to be proposed to the opponent, given in input s t and ¯ u t , see (2–3). f a ( s t , ¯ u t , Ω ot ) = a t , a t ∈ { accept , reject } (2) f b ( s t , ¯ u t , Ω ot ) = a t , a t ∈ { oﬀer ( ω ) , ω ∈ Ω } (3) Evaluate refers to a critic which helps our agent learn thedynamic threshold utility ¯ u t and evolve the negotiation strat-egy for unknown negotiation scenarios. More speciﬁcally, itis a function of random K ( K < N ) past negotiation experi-ences fetched from the database. The process of learning ¯ u t is retrospective since it depends on the reward r t obtainedfrom the negotiation environment by performing action a t at state s t . The value of the reward depends on the (esti-mated) discounted utility of the last bid received from theopponent, ω ot , or of the bid accepted by either parties ω acc and is deﬁned as follows: r t =  (cid:98) U ( ω acc , t ) , on agreement (cid:98) U ( ω ot , t ) , on received offer − , otherwise , (4)where (cid:98) U ( ω, t ) is the discounted reward of ω deﬁned as (cid:98) U ( ω, t ) = (cid:98) U ( ω ) · d t , d ∈ [0 , (5)where d is a temporal discount factor included to encouragethe agent to negotiate without delay.We stress that our design of reward functions accelerateagent learning by allowing the agent to receive rewards afterevery action it performs in the environment instead of at theend of the negotiation.igure 1: The Architecture of ANESIA

Strategy templates:

One common way to deﬁne the ac-ceptance and bidding strategies f a and f b is via a combina-tion of hand-crafted tactics that, by empirical evidence or do-main knowledge, are known to work effectively. However, aﬁxed set of tactics might not well adapt to multiple differentnegotiation domains. In our model, we do not assume pre-deﬁned strategies for f a and f b , but our agent learns thesestrategies ofﬂine . To do so, we assume that our agent learnsthe strategy by negotiating with different opponents bilater-ally and with the full knowledge of the true preferences ofthe user it represents, so that the strategies can be derived byoptimizing the true utility over multiple negotiations.To enable strategy learning, we introduce the notion of strategy templates , i.e., strategies consisting of a series oftactics, where each tactic is executed for a speciﬁc phase ofthe negotiation. The parameters describing the start and du-ration of each phase as well as the choice of the particulartactic for that phase are all learnable (speciﬁed in blue colorin equations below). Moreover, some tactics might exposelearnable parameters too. We assume a library of acceptanceand bidding tactics, T a and T b . Each t a ∈ T a maps the agentstate, threshold utility, opponent bid history, and a (possiblyempty) vector of learnable parameters p into a utility value u , i.e., t a ( s t , ¯ u t , Ω ot , p ) = u , where u represents the mini-mum utility required by the agent to accept the offer. Each t b ∈ T b is of the form t b ( s t , ¯ u t , Ω ot , p ) = ω where ω ∈ Ω isthe bid returned by the tactic. Given a library of acceptancestrategies T a , an acceptance strategy template is a paramet-ric function deﬁned by n a (cid:94) i =1 t ∈ [ t i , t i +1 ) →  n i (cid:94) j =1 c i,j → (cid:98) U ( ω ot ) ≥ t i,j ( s t , ¯ u t , Ω ot , p i , j )  (6)where n a is the number of tactics used, n i is the num-ber of options for the i -th tactic, t = 0 , t n a +1 = t end , t i +1 = t i + δ i , t i,j ∈ T a , and δ i , c i,j , and p i , j are the pa-rameters to learn, for i = 1 , . . . , n a and j = 1 , . . . , n i . Inother words, the δ i parameters determine for how long the i -th tactic is applied, and the c i,j are choice parameters de-termining which particular tactic from T a to use. We note that (6) is a predicate, i.e., it returns a Boolean, indicatingwhether the opponent bid ω ot is accepted. Similarly, given alibrary of bidding strategies T b , a bidding strategy template is deﬁned by n b (cid:94) i =1 t ∈ [ t i , t i +1 ) →  t i, ( s t , ¯ u t , Ω ot , p i , ) if c i, · · · · · · t i,n i − ( s t , ¯ u t , Ω ot , p i , n i − ) if c i,n i − t i,n i ( s t , ¯ u t , Ω ot , p i , n i ) o/w (7)where n b is the number of tactics, n i is the number of op-tions for the i -th tactic, t i,j ∈ T b , and t i , c i,j , and p i , j areas per above. The particular libraries of tactics used in thiswork are discussed in the next Section. Methods

In this section, we describe the methods used for user andopponent modelling, for learning the dynamic utility thresh-old, and for deriving optimal acceptance and bidding strate-gies out of our strategy templates.

User modeling:

To estimate the user model (cid:98) U from thegiven partial bid order (cid:22) , our agent uses Cuckoo search op-timization (CSO) (Yang and Deb 2009), a meta-heuristicinspired by the brood parasitism of cuckoo birds. As ametaphor, a cuckoo is an agent, which is in search of itsbest user model (cid:98) U (or nest or solution). In brief, in CSOa set of candidate solutions (user models) is evolved, andat each iteration the worst-performing p solutions are aban-doned and replaced with new solutions generated by L´evyﬂight. In our case, the ﬁtness of a candidate solution (cid:98) U (cid:48) isdeﬁned as the Spearman’s rank correlation coefﬁcient ρ be-tween the estimated ranking of (cid:98) U (cid:48) and the real, but partial,ranking of bids given in input to the agent. The coefﬁcient ρ ∈ [ − , is indeed a measure of the similarity betweentwo rankings, assigning a value of for identical rankings,and − for opposed rankings. Opponent modeling:

For the estimation of opponentpreferences during the negotiation, we have used thedistribution-based frequency model proposed in (Tunalı etal. Ω ot provides aneducated guess on the most preferred issue values by the op-ponent. On the other hand, the issue weights are estimatedby analyzing the disjoint windows of the opponent biddinghistory, which gives an idea of whether the opponent shiftsfrom its previous negotiation strategy as the time passes. Utility threshold learning:

We use an actor-critic archi-tecture with model-free deep reinforcement learning (i.e.Deep Deterministic Policy Gradient (DDPG) (Lillicrap et al. ¯ u t . Thus, ¯ u t isexpressed as a deep neural network function, whose input isthe agent state s t (see Table 1 for the list of features).rior to reinforcement learning, our agent’s strategy ispre-trained with supervision from synthetic negotiation data.To collect supervision data, we use a simulation environ-ment called GENIUS (Lin et al. )against other strategies ( AgentGP , Gravity , HardDealer , Ka-gent , Kakesoba , SAGA , winkyagent , SACRA , FSEGA2019 ) in three different domains for varied user proﬁles assum-ing no user preference uncertainties. This initial supervisedlearning (SL) stage helps our agent in decreasing the explo-ration time required for DRL during the negotiation, an ideaprimarily inﬂuenced by the work of Bagga et al. (2020). Strategy learning:

The parameters of the acceptance andbidding strategy templates (see (6–7) are learned by runningthe CSO meta-heuristic (initializing the values of the tem-plate parameters based on an educated guess). We deﬁne theﬁtness of a particular choice of template parameters as theaverage of true utility over multiple rounds of negotiationsunder the concrete strategy implied by those parameters, ob-tained by running our agent on the GENIUS platform againstthree different opponents (

AgentGG , KakeSoba and

SAGA )and three different negotiation domains.We now describe the libraries of acceptance and biddingtactics we draw from in our templates. As for the acceptancetactics, we consider: • (cid:98) U ( ω t ) , i.e., the estimated utility of the bid that our agentwould propose at time t ( ω t = f b ( s t , ¯ u t , Ω ot ) ). • Q (cid:98) U (Ω ot ) ( a · t + b ) , where (cid:98) U (Ω ot ) is the distribution of (es-timated) utility values of the bids in Ω ot , Q (cid:98) U ( B o ( t )) ( p ) isthe quantile function of such distribution, and a and b arelearnable parameters. In other words, we consider the p -thbest utility received from the agent, where p is a learnablefunction of the negotiation time t . • The dynamic DRL-based utility threshold ¯ u t . • A ﬁxed utility threshold ¯ u .The bidding tactics in our library are: • b Boulware , a bid generated by a time-dependent Boulwarestrategy (Fatima et al. • P S ( a · t + b ) extracts a bid from the set of Pareto-optimalbids P S , which is derived (using the NSGA-II algorithm)under the estimated user and opponent utility models. Inparticular, it selects the bid that assigns a weight of a · t + b to the ego agent utility (and − ( a · t + b ) to the opponent’s),where a and b are learnable parameters telling how thisweight scales with the negotiation time. The TOPSIS al-gorithm is used to derive such a bid, given the weighting a · t + b as input. http://web.tuat.ac.jp/ katfuji/ANAC2019/ Available in GENIUS Laptop, Holiday and Party; all are readily available in GE-NIUS • b opp ( ω ot ) , a bid generated by changing (in a greedy way)the value of least relevant issue randomly in the last re-ceived opponent bid ω ot in a greedy way. • ω ∼ U (Ω ≥ ¯ u ( t ) ) , a random bid above our DRL-based util-ity threshold ¯ u t .Below is an example of concrete acceptance strategylearned in our experiments, a strategy that tend to favor thetime-dependent quantile tactic during the middle of the ne-gotiation, and the DRL utility threshold during the initial andﬁnal stages. t ∈ [0 . , . → (cid:98) U ( ω ot ) ≥ ¯ u t ∧ (cid:98) U ( ω ot ) ≥ u t t ∈ [0 . , . → (cid:98) U ( ω ot ) ≥ Q (cid:98) U (Ω ot ) ( − . · t + 1 . t ∈ [0 . , . → (cid:98) U ( ω ot ) ≥ Q (cid:98) U (Ω ot ) ( − . · t + 0 . t ∈ [0 . , . → (cid:98) U ( ω ot ) ≥ (cid:98) U ( ω t ) ∧ (cid:98) U ( ω ot ) ≥ ¯ u ( t ) Experimental Results

All the experimental simulations are performed on the sim-ulation environment GENIUS (Lin et al. • Hypothesis A:

A stochastic search allows to derive ac-curate user models under user preference uncertainty. • Hypothesis B:

The set of estimated Pareto-optimal bidsobtained using NSGA-II under uncertainty are close tothe true Pareto-optimal solution. • Hypothesis C:

Under non-optimized acceptance and bid-ding strategies, DRL of the utility threshold yields perfor-mance superior to other negotiation strategies in differentdomains. • Hypothesis D:

Learning optimal acceptance and biddingtactics yields performance superior w.r.t. other negotiationstrategies and non-optimized strategies. • Hypothesis E:

ANESIA agents effectively adapt to un-seen negotiation settings.

Performance metrics:

Inspired by the ANAC’19 com-petition, for our experiments we use the following widelyadopted metrics: • Average individual utility rate ( U ind ): Sum of all the util-ities of an agent averaged over the successful negotiations(Ideal value: High (1.0)); • Average social welfare utility ( U soc ): Sum of all the util-ities gained by both negotiating agents averaged over thesuccessful negotiations (Ideal value: High (2.0)); • Average number of negotiation rounds ( R avg ): Total num-ber of negotiation rounds until agreement is reached, av-eraged over the successful negotiations (Ideal value: Low(1)). Least relevant w.r.t. (cid:98) U U ( S ) is the uniform distribution with support equal to S , and Ω ≥ ¯ u ( t ) is the subset of Ω whose bids have estimated utility above ¯ u ( t ) w.r.t. (cid:98) U . igure 2: User Modeling Results for 3 different domainsFigure 3: Estimated Pareto-Frontier using NSGA-II basedon true and estimated user models Experimental settings:

We consider three domains (Lap-top, Holiday and Party) already used in GENIUS for ANACwith three different domain sizes: low: Laptop Domain( | Ω | = 27 ), medium: Holiday domain ( | Ω | = 1024 ) andhigh: Party Domain ( | Ω | = 3072 ) with their default settingsof reservation price and discount factor. During the nego-tiation, we assume the deadline for each negotiation to be60 seconds normalized in [0,1]. For each setting, each agentplays both sides in the negotiation (i.e. 2 user proﬁles in eachsetting). A user proﬁle is a role an agent plays during ne-gotiation with its associated preferences. We assume onlyincomplete information about user preferences, given in theform of B randomly-chosen partially-ordered bids. For CSO(in hypotheses A and D), we select a population size of and generations for both user model estimation andlearning of strategy template parameters. For NSGA-II (inhypothesis B), we set the population size to 100, number ofgenerations to 25 and mutation count = 0.1. The process oftuning the hyper-parameters of CSO and NSGA-II are criti-cal as we don’t want our agent to exceed the timeout of 1000seconds given during each turn while deciding an action. Empirical Evaluation

Hypothesis A: User Modeling

The results in Figure 2show the average Spearman Correlation Coefﬁcient ( ρ ) val- Table 2: Performance comparison of ANESIA VS AgentGGVS KakeSoba VS SAGA (without Strategy Template) Metric ANESIA AgentGG KakeSoba SAGA

Laptop domain ( B = 10 , B = 20 ) U ind (0.87, 0.83) (0.75, 0.56) (0.72, 0.61) (0.72, 0.63) U soc (1.66,1.67) (1.39, 1.08) (1.53, 1.11) (1.51, 1.38) R avg (207.56,29.46) (1651.60,5450.23) (1370.0,5877.86 (1045.0,5004.46) Holiday domain ( B = 10 , B = 20 ) U ind (0.86, 0.88) (0.81, 0.85) (0.84, 0.79) (0.77, 0.73) U soc (1.65, 1.70 R avg (40.30,43.16) (923.85,417.68) (880.3,296.69) (421.55,470.44) Party domain ( B = 10 , B = 20 ) U ind (0.81, 0.94) (0.72, 0.78) (0.69, 0.71) (0.55, 0.52) U soc (1.40, 1.48) (1.36, 1.47) (1.37, 1.42) (1.28, 1.23) R avg (109.71,42.71) (938.54,1432.87) (774.08,407.41) (319.69,202.96) ues (on Y-axis) taken during 10 simulations for each userproﬁle in every negotiation setting, plotted against the ratio B/ | Ω | (on X-axis) of given number of partial bids over thetotal number of possible bids. Dashed lines indicate the ρ value w.r.t. the true (unknown) ranking of bids, solid linesw.r.t. the partial (given) ranking (i.e., the CSO ﬁtness func-tion). We observe that the true ρ value grows with the ra-tio B/ | Ω | , attaining relatively high values (above . ) evenwhen, as in the party domain, only of the bids are madeavailable to the agent. This demonstrates that our agent canuncover accurate user models also under high uncertainty. Hypothesis B: Pareto-Optimality

Figure 3 shows threedifferent plots using true and estimated preference proﬁles.We can clearly see that there is not much distance betweenthe frontier obtained with the estimated user and opponentmodels and that with true models. This evidences the po-tential of NSGA-II for generating the Pareto-optimal bids aswell as the closeness of estimated utility models to the trueutility models. Due to space limitations, we show the resultswith only one domain (Party) under only a single negotiationsetting.

Hypothesis C: Impact of Dynamic Threshold Utility

We tested an

ANESIA agent in a GENIUS tournament set-ting against AgentGG, KakeSoba and AgentGG for a totalable 3: Performance comparison of ANESIA* VS ANESIA VS AgentGG VS KakeSoba VS SAGA (with Strategy Template)Metric ANESIA* ANESIA AgentGG KakeSoba SAGA

Laptop domain ( B = 10 , B = 20 ) U ind (0.87, 0.87) (0.73, 0.68) (0.64, 0.53) (0.51, 0.73) (0.73, 0.68) U soc (1.66, 1.60) (1.52, 1.57) (1.22, 1.05) (1.58, 1.28) (1.52, 1.57) R avg (147.03, 173.23) (279.80, 181.28) (4251.46, 1999.16) (2159.54, 1115.70) (1865.01, 2794.10) Holiday domain ( B = 10 , B = 20 ) U ind (0.86, 0.79) (0.85, 0.87) (0.88, 0.87) (0.79, 0.76) (0.78, 0.70) U soc ( , 1.59) (1.68, ) (1.56, 1.53) (1.66, 1.58) (1.38, 1.28) R avg (84.38, 74.21) (168.79, 278.63) (1486.72, 569.64) (745.59, 405.78) (725.81, 367.09) Party domain ( B = 10 , B = 20 ) U ind (0.78, 0.91) (0.77, 069) (0.75, 076) (0.67, 0.71) (0.55, 0.51) U soc (1.38, 1.50) (1.37, 1.35) (1.30, 1.23) (1.40, 1.46) (1.39, 1.37) R avg (4.83, 2.50) (20.30, 92.69) (1068.25, 1093.44) (520.63, 1082.00) (338.01, 296.05) Table 4: Performance comparison of ANESIA* VSWinkyAgent VS FSEGA2019 VS AgentGP

Metric ANESIA* WinkyAgent AgentGP FSEGA2019

Laptop domain ( B = 10 , B = 20 ) U ind (0.92, 0.92) (0.88, 0.86) (0.77, 0.75 (0.90, 1.00) U soc (0.94, 0.93) (0.89, 0.85) 0.74, 0.75 (0.88, 0.76) Holiday domain ( B = 10 , B = 20 ) U ind (0.86, 0.82) (0.77, 0.78) (0.79, 0.80) (0.84, 0.84) U soc (1.68, 1.69) (1.63, 1.63) (1.55, 1.57) (1.66, 1.61) Party domain ( B = 10 , B = 20 ) U ind (0.72, 0.70) (0.61, 0.67) (0.60, 0.62) (0.71,0.63) U soc (1.47, 1.35) (1.36, 1.38) (1.31, 1.31) (1.27, 1.30) Smart Energy Grid domain ( B = 10 , B = 20 ) U ind (0.70, ) NA (0.68, 0.70) ( , 0.65) U soc (1.42, 1.42) NA (1.38, 1.41) (1.40, 1.39) of 120 sessions in 3 different domains (Laptop, Holiday andParty) where each agent negotiates with every other agent.Table 2 compares their performance. We choose two differ-ent user proﬁles with two different preference uncertainties( B ∈ { , } ) in each domain. According to our results,our agent employing the ANESIA model outperforms theother strategies in terms of U ind , U soc and R avg , and hencevalidates the hypothesis. During experiments, we have alsoobserved that our agent becomes picky and learns to focuson getting the maximum utility from the end agreement (byaccepting or proposing a bid from/to the opponent only ifa certain dynamic (or learned) threshold utility is met) andhence the successful negotiation rate is low. However, theproportion of successful negotiations can be accommodatedin the reward function to bias our learning to optimize thismetric. Hypothesis D: Strategy Template

Results in Table 3demonstrate that our agent

ANESIA * learns to make the op-timal choice of tactics to be used during run time and out-performs the non-optimized

ANESIA as well as the otherteacher strategies which it was trained on using DDPG.

Hypothesis E: Adaptiveness of the proposed model

Wedeploy our agent in a negotiation domain called Smart En- ergy Grid against the different agents of ANAC’ 19 tourna-ment which won the competition but based on joint utilityi.e. Winky Agent, FSEGA2019 and AgentGP These agentsand the domain are different from what our agent was ini-tially trained on. Results presented in Table 4 over 2 dif-ferent user preference uncertainties ( B ∈ { , } ) clearlydemonstrate the beneﬁts of our agent strategy built upon thegiven template over the other existing strategies . This con-ﬁrms our hypothesis that our model ANESIA with optimizedstrategies can learn to adapt at run-time to different negotia-tion settings against different unknown opponents.

Conclusions

ANESIA is a novel model encapsulating different types oflearning to aid an agent negotiate over multiple issues un-der user preference uncertainty. The model uses stochasticsearch based on Cuckoo Search optimization for user model-ing, combining NSGA-II and TOPSIS for generating Paretobids during negotiation according to estimated user and op-ponent models. An

ANESIA agent learns using a strategytemplate to choose which tactic to employ for deciding whento accept or bid at a particular time during negotiation. Themodel implements an actor-critic architecture-based DDPGto evaluate the target threshold utility value below which itneither accepts/proposes bids from/to the opponent. We haveempirically evaluated the performance of

ANESIA againstthe winning agent strategies of ANAC’19 tournament indifferent settings, showing that

ANESIA outperforms them.Moreover, our template-based strategy exhibits adaptive be-haviour, as it helps the agent to transfer the knowledge toenvironments with unknown opponent agents which are un-seen during training. already existing in Genius ( | Ω | = 625 ) We don’t check the R avg since our deadline is given interms of rounds (60 rounds) as WinkyAgent readily available codedoesn’t work with continuous time but discrete. Winky Agent gives Timeout Exception in deciding an actionduring each turn, hence represented by NA in Table 4 eferences Bedour Alrayes, ¨Ozg¨ur Kafalı, and Kostas Stathis. Concur-rent bilateral negotiation for open e-markets: the conan strat-egy.

Knowledge and Information Systems , 56(2):463–501,2018.Tim Baarslag, Koen Hindriks, Mark Hendrikx, AlexanderDirkzwager, and Catholijn Jonker. Decoupling negotiatingagents to explore the space of negotiation strategies. In

Novel Insights in Agent-based Complex Automated Negotia-tion , pages 61–83. Springer, 2014.Tim Baarslag, Mark JC Hendrikx, Koen V Hindriks, andCatholijn M Jonker. Learning about the opponent in auto-mated bilateral negotiation: a comprehensive survey of op-ponent modeling techniques.

Autonomous Agents and Multi-Agent Systems , 30(5):849–898, 2016.Pallavi Bagga, Nicola Paoletti, Bedour Alrayes, and KostasStathis. A deep reinforcement learning approach to concur-rent bilateral negotiation. In

IJCAI , 2020.Jasper Bakker, Aron Hammond, Daan Bloembergen, andTim Baarslag. Rlboa: A modular reinforcement learningframework for autonomous negotiating agents. In

AAMAS ,pages 260–268, 2019.Stefania Costantini, Giovanni De Gasperis, AlessandroProvetti, and Panagiota Tsintza. A heuristic approachto proposal-based negotiation: with applications in fashionsupply chain management.

Mathematical Problems in En-gineering , 2013, 2013.Kalyanmoy Deb, Amrit Pratap, Sameer Agarwal, andTAMT Meyarivan. A fast and elitist multiobjective geneticalgorithm: Nsga-ii.

IEEE transactions on evolutionary com-putation , 6(2):182–197, 2002.Walaa H El-Ashmawi, Diaa Salama Abd Elminaam, Ay-man M Nabil, and Esraa Eldesouky. A chaotic owl searchalgorithm based bilateral negotiation model.

Ain Shams En-gineering Journal , 2020.Mir Majid Etghani, Mohammad Hassan Shojaeefard, Abol-fazl Khalkhali, and Mostafa Akbari. A hybrid method ofmodiﬁed nsga-ii and topsis to optimize performance andemissions of a diesel engine using biodiesel.

Applied Ther-mal Engineering , 59(1-2):309–315, 2013.S Shaheen Fatima, Michael Wooldridge, and Nicholas RJennings. Optimal negotiation strategies for agents with in-complete information. In

International Workshop on AgentTheories, Architectures, and Languages , pages 377–392.Springer, 2001.Shaheen S Fatima, Michael Wooldridge, and Nicholas RJennings. Multi-issue negotiation under time constraints.In

Proceedings of the ﬁrst international joint conference onAutonomous agents and multiagent systems: part 1 , pages143–150, 2002.Shaheen S Fatima, Michael Wooldridge, and Nicholas RJennings. A comparative study of game theoretic and evolu-tionary models of bargaining for software agents.

ArtiﬁcialIntelligence Review , 23(2):187–205, 2005. S Shaheen Fatima, Michael J Wooldridge, and Nicholas RJennings. Multi-issue negotiation with deadlines.

Journal ofArtiﬁcial Intelligence Research , 27:381–417, 2006.Khayyam Hashmi, Amal Alhosban, Erfan Najmi, Zaki Ma-lik, et al. Automated web service quality component nego-tiation using nsga-2. In , pages 1–6. IEEE, 2013.Mark Klein, Peyman Faratin, Hiroki Sayama, and YaneerBar-Yam. Negotiating complex contracts.

Group Decisionand Negotiation , 12(2):111–125, 2003.Fabian Lang and Andreas Fink. Learning from the meta-heuristics: Protocols for automated negotiations.

Group De-cision and Negotiation , 24(2):299–332, 2015.Raymond YK Lau, Maolin Tang, On Wong, Stephen WMilliner, and Yi-Ping Phoebe Chen. An evolutionary learn-ing approach for adaptive negotiation agents.

InternationalJournal of Intelligent Systems , 21(1):41–72, 2006.Kejing Li and Xiaobing Zhang. Using nsga-ii and topsismethods for interior ballistic optimization based on one-dimensional two-phase ﬂow model.

Propellants, Explosives,Pyrotechnics , 37(4):468–475, 2012.Timothy Paul Lillicrap, Jonathan James Hunt, AlexanderPritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver,and Daan Wierstra. Continuous control with deep reinforce-ment learning. In

Proceedings of the 4th International Con-ference on Learning Representations (ICLR 2016) , 2016.Raz Lin, Sarit Kraus, Tim Baarslag, Dmytro Tykhonov,Koen Hindriks, and Catholijn M Jonker. Genius: An inte-grated environment for supporting the design of generic au-tomated negotiators.

Computational Intelligence , 30(1):48–70, 2014.M´aximo M´endez, Blas Galv´an, Daniel Salazar, and DavidGreiner. Multiple-objective genetic algorithm using the mul-tiple criteria decision making method topsis. In

Multiobjec-tive Programming and Goal Programming , pages 145–154.Springer, 2009.Yousef Razeghi, Celal Ozan Berk Yavaz, and ReyhanAydo˘gan. Deep reinforcement learning for acceptance strat-egy in bilateral negotiations.

Turkish Journal of ElectricalEngineering & Computer Sciences , 28(4):1824–1840, 2020.Ariel Rubinstein. Perfect equilibrium in a bargaining model.

Econometrica: Journal of the Econometric Society , pages97–109, 1982.Francisco Silva, Ricardo Faia, Tiago Pinto, Isabel Prac¸a, andZita Vale. Optimizing opponents selection in bilateral con-tracts negotiation with particle swarm. In

International Con-ference on Practical Applications of Agents and Multi-AgentSystems , pages 116–124. Springer, 2018.Dimitrios Tsimpoukis, Tim Baarslag, Michael Kaisers, andNikolaos G Paterakis. Automated negotiations under userpreference uncertainty: A linear programming approach. In

International conference on agreement technologies , pages115–129. Springer, 2018.kan Tunalı, Reyhan Aydo˘gan, and Victor Sanchez-Anguix.Rethinking frequency opponent modeling in automated ne-gotiation. In

International Conference on Principles andPractice of Multi-Agent Systems , pages 263–279. Springer,2017.Gwo-Hshiung Tzeng and Jih-Jeng Huang.

Multiple attributedecision making: methods and applications . CRC press,2011.Dengfeng Wang, Rongchao Jiang, and Yinchong Wu. A hy-brid method of modiﬁed nsga-ii and topsis for lightweightdesign of parameterized passenger car sub-frame.

Journalof Mechanical Science and Technology , 30(11):4909–4917,2016.Xin-She Yang and Suash Deb. Cuckoo search via l´evyﬂights. In , pages 210–214. IEEE, 2009.N Zeelanbasha, V Senthil, and G Mahesh. A hybrid ap-proach of nsga-ii and topsis for minimising vibration andsurface roughness in machining process.