[PDF] A Deep Reinforcement Learning Approach to Concurrent Bilateral Negotiation

Abstract

We present a novel negotiation model that allows an agent to learn how to negotiate during concurrent bilateral negotiations in unknown and dynamic e-markets. The agent uses an actor-critic architecture with model-free reinforcement learning to learn a strategy expressed as a deep neural network. We pre-train the strategy by supervision from synthetic market data, thereby decreasing the exploration time required for learning during negotiation. As a result, we can build automated agents for concurrent negotiations that can adapt to different e-market settings without the need to be pre-programmed. Our experimental evaluation shows that our deep reinforcement learning-based agents outperform two existing well-known negotiation strategies in one-to-many concurrent bilateral negotiations for a range of e-market settings.

Full PDF

AA Deep Reinforcement Learning Approach to Concurrent Bilateral Negotiation

Pallavi Bagga , Nicola Paoletti , Bedour Alrayes and Kostas Stathis , , Royal Holloway, University of London King Saud University, Saudi Arabia { pallavi.bagga.2017 , nicola.paoletti , kostas.stathis } @rhul.ac.uk, [email protected] Abstract

We present a novel negotiation model that allowsan agent to learn how to negotiate during con-current bilateral negotiations in unknown and dy-namic e-markets. The agent uses an actor-criticarchitecture with model-free reinforcement learn-ing to learn a strategy expressed as a deep neu-ral network. We pre-train the strategy by supervi-sion from synthetic market data, thereby decreasingthe exploration time required for learning duringnegotiation. As a result, we can build automatedagents for concurrent negotiations that can adaptto different e-market settings without the need tobe pre-programmed. Our experimental evaluationshows that our deep reinforcement learning basedagents outperform two existing well-known negoti-ation strategies in one-to-many concurrent bilateralnegotiations for a range of e-market settings.

We are concerned with the problem of learning a strategy fora buyer agent to engage in concurrent bilateral negotiationswith unknown seller agents in open and dynamic e-marketssuch as E-bay . Previous work in concurrent bilateral ne-gotiation has mainly focused on heuristic strategies [Nguyenand Jennings, 2004; Mansour and Kowalczyk, 2014; An etal. , 2006], some of which adapt to changes in the environ-ment [Williams et al. , 2012]. Different bilateral negotia-tions are managed in such strategies through a coordinatoragent [Rahwan et al. , 2002] or by coordinating multiple di-alogues internally [Alrayes and Stathis, 2013], but do notsupport agent learning which is our main focus. Other ap-proaches use agent learning based on Genetic Algorithms(GA) [Oliver, 1996; Zou et al. , 2014], but they require ahuge number of trials before obtaining a good strategy, whichmakes them infeasible for online negotiation settings. Rein-forcement Learning (RL)-based negotiation approaches typ-ically employ Q-learning [Papangelis and Georgila, 2015;Bakker et al. , 2019; Rodriguez-Fernandez et al. , 2019] whichdoes not support continuous actions. This is an important lim-itation in our setting because we want the agent to learn howmuch to concede e.g. on the price of an item for sale, whichin turn naturally leads to a continuous action space. Conse-quently, the design of autonomous agents capable of learning a strategy from concurrent negotiations with other agents isstill an important open problem.We propose, to the best of our knowledge, the ﬁrst DeepReinforcement Learning (DRL) approach for one-to-manyconcurrent bilateral negotiations in open, dynamic and un-known e-market settings. In particular, we deﬁne a novelDRL-inspired agent model called ANEGMA , which allowsthe buyer to develop an adaptive strategy to effectively useagainst its opponents (which use ﬁxed-but-unknown strate-gies) during concurrent negotiations in an environment withincomplete information. We choose deep neural networks asthey provide a rich class of strategy functions to capture thecomplex decisions-making behind negotiation.Since RL approaches need a long time to ﬁnd an optimalpolicy from scratch we pre-train our deep negotiation strate-gies using supervised learning (SL) from a set of training ex-amples. To overcome the lack of real-world negotiation datafor the initial training, we generate synthetic datasets usingthe simulation environment in [Alrayes et al. , 2016] and twowell-known strategies for concurrent bilateral negotiation de-scribed in [Alrayes et al. , 2018] and [Williams et al. , 2012]respectively.With this work, we empirically demonstrate three impor-tant beneﬁts of our deep learning framework for automatednegotiations: 1) existing negotiation strategies can be accu-rately approximated using neural networks; 2) evolving a pre-trained strategy using DRL with additional negotiation expe-rience yields strategies that even outperform the teachers, i.e.,the strategies used for supervision; 3) buyer strategies trainedassuming a particular seller strategy quickly adapt via DRLto different (and unknown) sellers’ behaviours.In summary, our contribution is threefold: we propose anovel agent model for one-to-many concurrent bilateral nego-tiations based on DRL and SL; we extend the existing simu-lation environment [Alrayes et al. , 2016] to generate data andperform experiments that support agent learning for negotia-tion; and we run an extensive experiments showing that ourapproach outperforms the existing strategies and producesadaptable agents that can transfer to a range of e-market set-tings.

The existing body of automated negotiations differs from oursin one or more of the following ways: the application do-main, the focus (or goal of the research), and the way andwhat machine learning approach has been used to improvethe autonomous decision making performance of an agent. a r X i v : . [ c s . M A ] F e b he work in [Lau et al. , 2006] uses GAs to derive a heuris-tic search over a set of potential solutions in order to ﬁnd themutually acceptable offers. Also, in [Choudhary and Bharad-waj, 2018], the authors propose a GA-based learning tech-nique for multi-agent negotiation but with regard to makingrecommendations to a group of persons based on their pref-erences. Since we are dealing with an environment with lim-ited information, another relevant consideration is related toRL. In [Bakker et al. , 2019], the authors study a modularRL based BOA (Bidding strategy, Opponent model and Ac-ceptance condition) framework which is an extension of thework done in [Baarslag et al. , 2016]. This framework imple-ments an agent that uses tabular Q-learning to learn the bid-ding strategy by discretizing the continuous state/action space(not an optimal solution for large state/action spaces as it maylead to curse of dimensionality and cause the loss of relevantinformation about the state/action domain structure too). Q-learning is also used in [Rodriguez-Fernandez et al. , 2019]to provide a decision support system for the Energy market.In addition, the work in [Sunder et al. , 2018] uses a variablereward function for an RL approach called REINFORCE tomodel the pro-social or selﬁsh behaviour of agents. Further-more, the work of [Hindriks and Tykhonov, 2008; Zeng andSycara, 1998] uses Bayesian Learning to learn the opponentpreferences instead of the negotiation strategy.Previous work also consider the combination of differ-ent learning approaches to determine an optimal negotiationstrategy for an agent. In [Zou et al. , 2014], the authors pro-pose the fusion of evolutionary algorithms (EAs) and RL thatoutperforms classic EAs; here the replicator dynamics is usedwith a GA to adjust the probabilities of strategies. In thiswork, the experiments have shown that different weights as-signed to the historical and current payoffs (due to changein environment dynamics) while learning impact both thenegotiation performance and the learning to a great extent.Another relevant work is [Lewis et al. , 2017], which com-bines SL (Recurrent Neural Network (RNN)) and RL (RE-INFORCE) to train on human dialogues. We also combineSL and RL but with the main focus on autonomy of negotia-tions rather than Natural Language Processing (NLP). Also,we differ with respect to the combination of ML approaches(i.e. Artiﬁcial Neural Network (ANN) for SL and Actor-Critic model called DDPG [Lillicrap et al. , 2017] for RL),which will be explained in subsequent sections.In addition and independently of the approach, numerousworks in the domain of bilateral negotiation rely on the Al-ternating Offers protocol [Rubinstein, 1982] as the negotia-tion mechanism, which, despite its simplicity does not cap-ture many realistic bargaining scenarios. In this section, we formulate the negotiation environmentand introduce our agent negotiation model called

ANEGMA ( A daptive NEG otiation model for e- MA rkets). We consider e-marketplaces like E-bay where the competitionis visible, i.e. a buyer can observe the number of competitors that are dealing with the same resource from the same seller.We assume that the environment consists of a single e-market m with P agents, with a non-empty set of buyers B m and anon-empty set of sellers S m – these sets need not be mutuallyexclusive. For a buyer b ∈ B m and resource r , we denotewith S tb,r ⊆ S m the set of sellers from market m which, attime point t , negotiate with b for a resource r (over a rangeof issues I ). The buyer b uses | S tb,r | negotiation threads, inorder to negotiate concurrently with each seller ∈ S tb,r . Weassume that no agent can be both buyer and seller for the sameresource at the same time, that is, ∀ b, r, t. s ∈ S tb,r = ⇒ S ts,r = ∅ . C tb,r = { b (cid:48) (cid:54) = b ∈ B m | S tb (cid:48) ,r (cid:54) = ∅} is the set ofcompetitors of b , i.e. those agents negotiating with the samesellers and for the same resource r as that of b .As we are interested in practical settings, we adopt the ne-gotiation protocol of [Alrayes et al. , 2018], since it supportsconcurrent bilateral negotiations. This protocol assumes anopen e-market environment, i.e., where agents can enter orleave the negotiation at their own will. A buyer b alwaysstarts the negotiation by making an offer whose start timeis t start . Any negotiation is for a resource r , since we in-dex the negotiation thread with the name of the seller s andthe resource r , and can last for up to time t b , the maxi-mum time b can negotiate for. The deadline for b is, thus, t end = t start + t b , which for simplicity we assume for allthe resources being negotiated. Information about the dead-line t b , Initial Price IP b and Reservation Price RP b is pri-vate to each b ∈ B m . Each seller s also has its own Ini-tial Price IP s , Reservation Price RP s and maximum negoti-ation duration parameter t s (which are not visible by otheragents). The protocol is turn-based and allows agents totake actions from a pool Actions at each negotiation state(from S1 to S5, see [Alrayes et al. , 2018]) where

Actions = { oﬀer ( x ) , reqToReserve , reserve , cancel , conﬁrm , accept , exit } . ANEGMA

Components

Our proposed agent negotiation model supports learning dur-ing concurrent bilateral negotiations with unknown oppo-nents in dynamic and complex e-marketplaces. In this model,we use a centralized approach in which the coordination isdone internally to the agent via multi-threading synchroniza-tion. This approach minimizes the agent communicationoverhead and thus, improve the run-time performance. Thedifferent components of the proposed model are shown inFigure 1 and explained below.

Physical Capabilities:

These are the sensors and actuators of the agent that enableit to access an e-marketplace. More speciﬁcally, they allow abuyer b to perceive the current (external) state of the environ-ment s t and represent that state locally in the form of internalattributes as shown in Table 1. Some of these attributes ( N S r , N C r ) are perceived by the agent using its sensors, some ofthem ( IP b , RP b , t end ) are stored locally in its knowledgebase and some of the them ( S neg , X best , T left ) are obtainedwhile interacting with other seller agents during a negotiation.At time t , the internal agent representation of the environmentis s t , which is used by the agent to decide what action a t to igure 1: The Architecture of ANEGMA execute using its actuators . Action execution then changesthe state of the environment to s t +1 . Learning Capabilities:

The foundation of our model is a component providing learn-ing capabilities similar to those in the Actor-Critic architec-ture as in [Lillicrap et al. , 2017]. It consists of three sub-components:

Negotiation Experience , Decide and

Evaluate . Negotiation Experience stores historical information aboutprevious negotiation experiences which involve the interac-tions of an agent with other agents in the market. Experienceelements are of the form (cid:104) s t , a t , r t , s t +1 (cid:105) , where s t is the stateof the e-market environment, a t is action performed by b at s t , r t is scalar reward or feedback received from the environ-ment and s t +1 is new e-market state after executing a t . Decide refers to a negotiation strategy which helps b tochoose an optimal action a t among a set of actions ( Actions )at a particular state s t . In particular, it consists of two differ-ent functions f c and f r . f c take state s t as an input and re-turns a discrete action among counter-offer, accept, conﬁrm,reqToReserve and exit , see (1). When f c decides to performa counter-offer action, f r is used to compute, given an in-put state s t , the value of the counter-offer, see (2). From amachine learning perspective, deriving f c corresponds to aclassiﬁcation problem, deriving f r to a regression problem. f c ( s t ) = a t , a t ∈ Actions (1) f r ( s t ) = x, x ∈ [ IP b , RP b ] (2) Evaluate refers to a critic which helps b learn and evolvethe negotiation strategy for unknown and dynamic environ-ments. More speciﬁcally, it is a function of random K ( K <

Table 1: Agent’s State Attributes

Attribute Description

N S r Number of sellers that b is concurrently deal-ing for resource r at time t ( | S tb,r | ). N C r Number of buyer agents competing with b for resource r at time t ( | C tb,r | ). S neg Current state of the negotiation protocol (S1to S5 [Alrayes et al. , 2018])). X best Best offer made by either b or s in S neg . T left Time left for b to reach t end after the lastaction of s . IP b Minimum price which b can offer at the startof the negotiation. RP b Maximum price which b can offer to s . N ) past negotiation experiences fetched from the database.Also, the learning process of b is retrospective since it de-pends on the feedback or scalar reward r t (and r (cid:48) t ) obtainedfrom the e-market environment by performing action a t atstate s t which is calculated using (3) and (4) to evaluate thediscrete and continuous action made by Decide component attime t respectively. Our design of reward functions acceler-ate agent learning by allowing b to receive rewards after everyaction it performs in the environment instead of at the end ofthe negotiation. r t (during classiﬁcation) =  U b ( x, t ) , if t ≤ t end , Agreement − , if t ≤ t end , No Deal r (cid:48) t if a t = Counter-offer , otherwise (3) r (cid:48) t (during regression) =  U b ( x, t ) , if t ≤ t end , x ≤ ∀ i ∈ O t − , if t ≤ t end , x > ∀ i ∈ O t , otherwise (4)In (3) and (4), U b ( x, t ) refers to the utility value of offer x (generated using (2)) at time t and is calculated using InitialPrice ( IP b ), Reservation Price ( RP b ), agreement offer ( x ) andtemporal discount factor ( d t ∈ [0 , ) [Williams et al. , 2012]as deﬁned in (5) . The parameter d t encourages b to negotiatewithout delay. The reward function r (cid:48) t in (4) helps b learnthat it should not offer greater than what active sellers havealready offered it. O t refers to a list of preferred offers of ∀ s ∈ S tb,r at time t . U b ( x, t ) = (cid:18) RP b − xRP b − IP b (cid:19) . (cid:18) tt end (cid:19) d t (5)In our experiments, the value of d t is set to . . Higher the d t value, higher is the penalty due to delay. In this section, we describe the data set collected for trainingthe SL model (used for pre-training the

ANEGMA agent), var-ious performance measures (used for evaluating the negotia-tion process) and ML models (used for the learning process). .1 Data set collection

In order to collect the data set to train

ANEGMA agent usingan SL model, we have used a simulation environment [Al-rayes et al. , 2016] that supports concurrent negotiations be-tween buyers and sellers. The buyers use two different strate-gies presented in [Alrayes et al. , 2018] and [Williams etal. , 2012]; whereas the sellers use the strategies describedin [Faratin et al. , 1998]. We could have also collected thenegotiation examples for training using other buyer strategiesfor concurrent negotiation which can deal with same envi-ronment as ours, or any real-world market data; however, tothe best of our knowledge none of these had readily avail-able implementations. We have selected the input featuresfor the dataset manually, and this set of features correspondto the agent’s state attributes in Table 1. To avoid choosingoverlapping features, we have then applied the

Pearson Cor-relation coefﬁcient [Lee Rodgers and Nicewander, 1988] andensured no correlation (with all correlation coefﬁcients be-tween − . and . ; most are closer to ) between the se-lected features. To successfully evaluate the performance of

ANEGMA andcompare it with other negotiation approaches, it is necessaryto identify the appropriate performance metrics. For our ex-periments, we have used the following widely adopted met-rics [Williams et al. , 2012; Faratin et al. , 1998; Nguyen andJennings, 2004; Alrayes et al. , 2018]:

Average utility rate( U avg ) , Average negotiation time ( T avg ) and Percentage ofsuccessful negotiations ( S % ) , which are described in Table 2.Our main motive behind calculating the U avg is to calculatethe agent proﬁt over only successful negotiations, hence weexclude the unsuccessful ones in this metric. We capture the(un)successful negotiations in a separate metric called S % . During our experiments, the buyer negotiates with ﬁxed-but-unknown seller strategies in an e-market. Also, the com-petitor buyers use only a single ﬁxed-but-unknown strategywhich can be learnt by the buyers after some simulation runs.Hence, we consider our negotiation environment as fully-observable . Following this, for our dynamic (agents leave andenter the market at any time) and episodic (the negotiationterminates at some point) environment, we use a model-free , off-policy RL approach which generates a deterministic pol-icy based on the policy gradient method to support continu-ous control. More speciﬁcally, we use the

Deep DeterministicPolicy Gradient algorithm (DDPG) , which is an actor-criticRL approach and generates a deterministic action selectionpolicy for the buyer (see [Lillicrap et al. , 2017] for more de-tails, due to lack of space). We consider a model-free RL approach because our buyer is more concerned with deter-mining what action to take given a particular state rather thanpredicting a new state of the environment. This is becausethe strategies of sellers and competitor buyers are unknownin the environment. On the other hand, we consider the off-policy approach for efﬁcient and independent exploration ofcontinuous action spaces. Furthermore, we, instead of initial-izing the RL policy randomly, use a policy generated by an Artiﬁcial Neural Network (ANN) [Goodfellow et al. , 2016]due to its compatibility with DRL in order to speed up andreduce the cost of the RL process. To reduce the over-ﬁttingand generalization errors, we also apply regularization tech-niques (dropout) during the training of the neural network.

We use

ANEGMA to build autonomous buyers that negoti-ate against unknown opponents in different e-market settings.Our experiments make the following hypotheses.

Hypothesis A:

The

Market Density ( MD ) , the Market ra-tio or Demand/Supply Ratio ( MR ), the Zone of Agreement ( ZoA ) and the

Buyer’s Deadline ( t end ) have a considerableeffect on the success of negotiations. Here, • MD is the total agents in the e-market at any given timedealing with the same resource as that of our buyer. • MR is the ratio of the total number of buyers over thesellers in the e-market. • ZoA refers to the intersection between the price rangesof buyers and sellers for them to agree.In practice, buyers have no control over these parameters ex-cept the deadline, which can be decided by the user or con-strained by a higher-level goal the buyer is trying to achieve.

Hypothesis B:

The

ANEGMA buyer outperforms SL, CO-NAN, and Williams’ negotiation strategies in terms of U avg , T avg and S % in a range of e-market settings. Hypothesis C: An ANEGMA buyer if trained against a spe-ciﬁc seller strategy, still performs well against other ﬁxed-but-unknown seller strategies. This shows that the

ANEGMA agent behaviour is adaptive in that the agent transfers knowl-edge from previous experience to unknown e-market settings.

To carry out our experiments, we have extended the simula-tion environment RECON [Alrayes et al. , 2016] with a newonline learning component for

ANEGMA . Seller Strategies

For the purpose of training our SL model and conduct-ing large-scale quantitative evaluations, we have used twogroups of ﬁxed seller strategies developed by Faratin etal. [1998]: Time-Dependent (

Linear , Conceder and

Boul-ware ) and Behaviour-Dependent (

Relative tit-for-tat , RandomAbsolute tit-for-tat and

Averaged tit-for-tat ). Each seller’sdeadline is assumed to be same as that of buyer but privateto the seller. Other parameters such as IP s and RP s are de-termined by the ZoA parameter, as shown in Table 3.

Simulation Parameters

We assume that the buyer negotiates with multiple sellersconcurrently to buy a second-hand laptop ( r = Laptop )based only on a single issue

Price ( I = { P rice } ). Westress that the single-issue assumption is realistic in sev-eral real-world e-markets. The simulated market allows theagents to enter and leave the market at their own will. Themaximum number of agents allowed in the market, the de-mand/supply ratio, the buyer’s deadline and the ZoA s aresimulation-dependent. able 2: Performance Evaluation Metrics

Metric Deﬁnition Ideal Value U avg Sum of all the utilities of the buyer averaged over the successful negotiations. High(1.0) T avg Total time taken by the buyer (in milliseconds) averaged over all successful negotiations to reach theagreement. Low( ≈ S % Proportion of total negotiations in which the buyer reaches an agreement successfully with one of theconcurrent sellers. High(100%)

Table 3: Simulation Parameter Values

Values IP b [300 − RP b [500 − IP s − , − , − RP s − , − , − MD H { , , } , A { , , } , L { , , } MR H { : , : , : } , A { : , : , : } , L { : , : , : } t end Lg [151 s – s ] , A [91 s – s ] , Sh [30 s – s] ZoA H( %), A( %), L( %)As in [Alrayes et al. , 2018], three qualitative values areconsidered for each parameter during simulations, e.g., High(H), Average (A) and Low (L) for MD or Long (Lg), Aver-age (A) and Short (Sh) for t end . Parameters are reported inTable 3. The user can select one of such qualitative values foreach parameter. Each qualitative value corresponds to a setof three quantitative values, of which only one is chosen atrandom for each each simulation (e.g., setting H for param-eter MD corresponds to choosing at random among , ,and ). The only exception is parameter ZoA , which mapsto a range of uniformly distributed quantitative values for theseller’s initial price IP s and reservation price RP s (e.g., se-lecting A for ZoA leads to a value of IP s uniformly sampledin the interval [580 , ). Therefore, the total number of sim-ulation settings is 81, as we consider possible settings foreach of MD , MR , t end , and ZoA (see Table 3).

We evaluate hypotheses A, B and C as described at the begin-ning of this section.

Hypothesis A ( MD , MR , ZoA and t end have signiﬁcantimpact on negotiations) We experimented with different e-market settings, byconsidering, for each setting, both time-dependent andbehaviour-dependent seller strategies over simulationsusing the CONAN buyer strategy. As shown in Figure 2,these experiments suggest that MD and ZoA have a consider-able effect on S % . From our observations, when MD is low,the agents reach more negotiation agreements. Also, thereis not much difference in the agreement rate for % ZoA and % ZoA when MD is low. The very low number ofsuccessful negotiations for % ZoA is not unexpected sinceonly a minority of agents is willing to concede more in such asmall

ZoA . On the other hand, MR and t end have, according Figure 2: Effect of Market Density ( MD ) and Zone of Agree-ment ( ZoA ) on Proportion of Successful Negotiations ( S % ) usingtime-dependent strategies (left) and behaviour-dependent strategies(right). to our experiments, a comparably minor impact on the nego-tiation success (only some effect of MR on S % is observedunder behaviour-dependent strategies and low MD as shownin Figure 3). These results support our hypothesis. Hypothesis B (ANEGMA outperforms SL and CONAN)

We performed simulations for our

ANEGMA agent in low MD , 60% and 100% ZoA , high MR and a long t end be-cause these settings yielded the best performance in terms of S % in our experiments for Hypothesis A. We have used thesesettings against Conceder Time Dependent and

Relative Titfor Tat Behaviour Dependent seller strategies. Firstly, we col-lected training data for our SL approach (ANN) using two dis-tinct strategies for supervision, viz. CONAN [Alrayes et al. ,2018] and Williams [Williams et al. , 2012]. Both were runfor simulations and with the same settings. Table 4 com-pares the performances of CONAN’s and Williams’ models.CONAN outperforms Williams’ strategy in these settings.Then, the resulting trained ANN models – called ANN-Cand ANN-W respectively – were used as the initial strate-gies in our DRL approach (based on DDPG), where strategiesare evolved using negotiation experience from additional simulations. In the remainder, we will abbreviate this modelby

ANEGMA(SL+RL) .Finally, we use test data from simulations to comparethe performance of such derived ANEGMA(SL+RL) buyersagainst CONAN, Williams’ model, ANN-C, ANN-W, and theso-called ANEGMA(RL) model, which uses DDPG but ini- able 4: Performance comparison of CONAN and Williams’ model. Best results are in bold.

Metric CONAN Williams’

Conceder Time Dependent Seller Strategy

ZoA

ZoA U avg ± ± ± ± T avg ± ± ± ± S % Relative Tit For Tat Behaviour Seller Strategy U avg ± ± ± ± T avg ± ± ± ± S % Table 5: Performance comparison of ANN VS ANEGMA(SL+RL) VS ANEGMA(RL) when

ZoA is 60%. Best results are in bold.ANN-C and ANN-W correspond to ANN trained using data set collected from CONAN and Williams’ approach respectively, whereasANEGMA(SL+RL)-C and ANEGMA(SL+RL)-W correspond to ANEGMA(DDPG) initialized with ANN-C and ANN-W respectively.

Metric ANN ANEGMA(SL+RL) ANEGMA(RL)

Trained and Tested on Conceder Time Dependent Seller Strategy

ANN-C ANN-W ANEGMA(SL+RL)-C ANEGMA(SL+RL)-W U avg ± ± ± ± ± T avg ± ± ± ± ± S % Trained and Tested on Relative Tit for Tat Behaviour Dependent Seller Strategy

ANN-C ANN-W ANEGMA(SL+RL)-C ANEGMA(SL+RL)-W U avg ± ± ± ± ± T avg ± ± ± ± ± S % Table 6: Performance comparison of ANN VS ANEGMA(SL+RL) VS ANEGMA(RL) when

ZoA is 100%. Best results are in bold.ANN-C and ANN-W correspond to ANN trained using data set collected from CONAN and Williams’ approach respectively, whereasANEGMA(SL+RL)-C and ANEGMA(SL+RL)-W correspond to ANEGMA(DDPG) initialized with ANN-C and ANN-W respectively.

Metric ANN ANEGMA(SL+RL) ANEGMA(RL)

Trained and Tested on Conceder Time Dependent Seller Strategy

ANN-C ANN-W ANEGMA(SL+RL)-C ANEGMA(SL+RL)-W U avg ± ± ± ± ± T avg ± ± ± ± ± S % Trained and Tested on Relative Tit for Tat Behaviour Dependent Seller Strategy

ANN-C ANN-W ANEGMA(SL+RL)-C ANEGMA(SL+RL)-W U avg ± ± ± ± ± T avg ± ± ± ± ± S % able 7: Performance comparison for the adaptive behaviour of ANN VS ANEGMA(SL+RL) VS ANEGMA(RL). Best results are in bold.ANN-C and ANN-W correspond to ANN trained using data set collected from CONAN and Williams’ approach respectively, whereasANEGMA(SL+RL)-C and ANEGMA(SL+RL)-W correspond to ANEGMA(DDPG) initialized with ANN-C and ANN-W respectively. Metric ANN ANEGMA(SL+RL) ANEGMA(RL)

Trained on Relative Tit for Tat Behaviour Dependent and Tested on Conceder Time Dependent Seller Strategy

ANN-C ANN-W ANEGMA(SL+RL)-C ANEGMA(SL+RL)-W U avg ± ± ± ± ± T avg ± ± ± ± ± S % Trained on Conceder Time Dependent and Tested on Relative Tit for Tat Behaviour Dependent Seller Strategy

ANN-C ANN-W ANEGMA(SL+RL)-C ANEGMA(SL+RL)-W U avg ± ± ± ± ± T avg ± ± ± ± ± S % Figure 3: Effect of Market Density ( MD ) and Market Ratio ( MR )Proportion of Successful Negotiations ( S % ) using time-dependentstrategies (left) and behaviour-dependent strategies (right).Figure 4: Training Accuracy’s of ANN when trained using datasetscollected by negotiating CONAN and Williams’ buyer strategy(for different ZoA s) against time-dependent strategies (left) andbehaviour-dependent strategies (right). tialized with a random strategy.According to our results shown in Tables 5 and 6, the per-formance of ANN-C is comparable to that of CONAN forboth 60% and 100%

ZoA s (see Table 4), and we observe thesame for ANN-W and the Williams’ strategy. So, we con-clude that our approach can successfully produce neural net-work strategies which are able to imitate the behaviour andthe performance of CONAN and Williams’ models (more-over, the training accuracy’s were in the range between . and . as shown in Figure 4).Even more importantly, the results demonstrate thatANEGMA(SL+RL)-C (i.e. DDPG initialized with ANN-C) and ANEGMA(SL+RL)-W (i.e. DDPG initializedwith ANN-W) improve on their respective initial ANNstrategies obtained by SL, and outperform the DRL agentANEGMA(RL) initialized at random for both 60% and 100% ZoA s, see Tables 5 and 6. This proves that both the evo-lution of the strategies via DRL and the initial supervi-sion are beneﬁcial. Furthermore, ANEGMA(SL+RL)-C andANEGMA(SL+RL)-W also outperform the existing “teacherstrategies” (CONAN and Williams) used for the initial super-vision and hence can improve on them, see Table 4.

Hypothesis C (ANEGMA is adaptable)

In this ﬁnal test, we evaluate how well our

ANEGMA agentscan adapt to environments different from those used attraining-time. Speciﬁcally, we deploy strategies trained using

Conceder Time Dependent opponents into an environmentwith

Relative Tit for Tat Behaviour Dependent opponents, andviceversa. The ANEGMA agents use experience from 500simulations to adapt to the new environment. Results are pre-sented in Table 7 for 60%

ZoA and show clear superiority ofthe ANEGMA agents over the ANN-C and ANN-W strate-gies which, without online retraining, cannot maintain theirperformance in the new environment. This conﬁrms our hy-pothesis that ANEGMA agents can learn to adapt at run-timeto different unknown seller strategies.

Further discussion

Pondering over the negative average utility values ofANEGMA(RL) (see Tables 5 and 6), recall that we deﬁnethe utility value as per Equation (5) but without the discountfactor term. Therefore, if an agent concedes a lot to make aeal, it will collect a negative utility. This is precisely whathappens to the initial random (and inefﬁcient) strategy usedin the ANEGMA(RL) conﬁguration. The combination of SLand DRL prevents this very problem as it uses an initial pre-trained strategy which is much less likely to incur negativeutility values.For the same reason, we observe a consistently shorter av-erage negotiation time for ANEGMA(RL), which is causedby the buyer that concedes more to reach the agreement with-out negotiating for a long time with the seller. Hence, ashorter T avg alone does not generally imply a better nego-tiation performance.An additional advantage of our approach is that it allevi-ates the common limitation of RL that an RL agent needs anon-trivial amount of experience before reaching a satisfac-tory performance. We have proposed

ANEGMA , a novel agent negotiationmodel that supports agent learning and adaptation during con-current bilateral negotiations for a class of e-markets such asE-bay. Our approach derives an initial neural network strat-egy via supervision from well-known existing negotiationmodels, and evolves the strategy via DRL. We have empir-ically evaluated the performance of

ANEGMA against ﬁxed-but-unknown seller strategies in different e-market settings,showing that

ANEGMA outperforms the well-known existing“teacher strategies”, the strategies trained with SL only andthose trained with DRL only. Crucially, our model also ex-hibit adaptive behaviour, as it can transfer to environmentswith unknown sellers’ behaviours different from training.As future work, we plan to consider more complex mar-ket settings including multi-issue negotiations and dynamicopponent strategies.

References [Alrayes and Stathis, 2013] Bedour Alrayes and KostasStathis. An agent architecture for concurrent bilateralnegotiations. In

Decision Support Systems III-Impact ofDecision Support Systems for Global Environments , pages79–89. Springer, 2013.[Alrayes et al. , 2016] Bedour Alrayes, ¨Ozg¨ur Kafalı, andKostas Stathis. Recon: a robust multi-agent environ-ment for simulating concurrent negotiations. In

Recentadvances in agent-based complex automated negotiation ,pages 157–174. Springer, 2016.[Alrayes et al. , 2018] Bedour Alrayes, ¨Ozg¨ur Kafalı, andKostas Stathis. Concurrent bilateral negotiation for opene-markets: the conan strategy.

Knowledge and InformationSystems , 56(2):463–501, 2018.[An et al. , 2006] Bo An, Kwang Mong Sim, Liang Gui Tang,Shuang Qing Li, and Dai Jie Cheng. Continuous-time ne-gotiation mechanism for software agents.

IEEE Transac-tions on Systems, Man, and Cybernetics, Part B (Cyber-netics) , 36(6):1261–1272, 2006.[Baarslag et al. , 2016] Tim Baarslag, Mark JC Hendrikx,Koen V Hindriks, and Catholijn M Jonker. Learning about the opponent in automated bilateral negotiation: acomprehensive survey of opponent modeling techniques.

Autonomous Agents and Multi-Agent Systems , 30(5):849–898, 2016.[Bakker et al. , 2019] Jasper Bakker, Aron Hammond, DaanBloembergen, and Tim Baarslag. Rlboa: A modular re-inforcement learning framework for autonomous negotiat-ing agents. In

Proceedings of the 18th International Con-ference on Autonomous Agents and MultiAgent Systems ,pages 260–268. International Foundation for AutonomousAgents and Multiagent Systems, 2019.[Choudhary and Bharadwaj, 2018] Nirmal Choudhary andKK Bharadwaj. Evolutionary learning approach to multi-agent negotiation for group recommender systems.

Multi-media Tools and Applications , pages 1–23, 2018.[Faratin et al. , 1998] Peyman Faratin, Carles Sierra, andNick R Jennings. Negotiation decision functions for au-tonomous agents.

Robotics and Autonomous Systems ,24(3-4):159–182, 1998.[Goodfellow et al. , 2016] Ian Goodfellow, Yoshua Bengio,and Aaron Courville.

Deep learning . MIT press, 2016.[Hindriks and Tykhonov, 2008] Koen Hindriks and DmytroTykhonov. Opponent modelling in automated multi-issuenegotiation using bayesian learning. In

Proceedings of the7th international joint conference on Autonomous agentsand multiagent systems-Volume 1 , pages 331–338. Interna-tional Foundation for Autonomous Agents and MultiagentSystems, 2008.[Lau et al. , 2006] Raymond YK Lau, Maolin Tang,On Wong, Stephen W Milliner, and Yi-Ping PhoebeChen. An evolutionary learning approach for adaptivenegotiation agents.

International Journal of IntelligentSystems , 21(1):41–72, 2006.[Lee Rodgers and Nicewander, 1988] Joseph Lee Rodgersand W Alan Nicewander. Thirteen ways to look at the cor-relation coefﬁcient.

The American Statistician , 42(1):59–66, 1988.[Lewis et al. , 2017] Mike Lewis, Denis Yarats, Yann NDauphin, Devi Parikh, and Dhruv Batra. Deal or nodeal? end-to-end learning for negotiation dialogues. arXivpreprint arXiv:1706.05125 , 2017.[Lillicrap et al. , 2017] Timothy Paul Lillicrap,Jonathan James Hunt, Alexander Pritzel, NicolasManfred Otto Heess, Tom Erez, Yuval Tassa, DavidSilver, and Daniel Pieter Wierstra. Continuous controlwith deep reinforcement learning, January 26 2017. USPatent App. 15/217,758.[Mansour and Kowalczyk, 2014] Khalid Mansour andRyszard Kowalczyk. Coordinating the bidding strategyin multiissue multiobject negotiation with single andmultiple providers.

IEEE transactions on cybernetics ,45(10):2261–2272, 2014.[Nguyen and Jennings, 2004] Thuc Duong Nguyen andNicholas R Jennings. Coordinating multiple concurrentnegotiations. In

Proceedings of the Third Internationaloint Conference on Autonomous Agents and MultiagentSystems-Volume 3 , pages 1064–1071. IEEE ComputerSociety, 2004.[Oliver, 1996] Jim R Oliver. A machine-learning approach toautomated negotiation and prospects for electronic com-merce.

Journal of management information systems ,13(3):83–112, 1996.[Papangelis and Georgila, 2015] Alexandros Papangelis andKallirroi Georgila. Reinforcement learning of multi-issuenegotiation dialogue policies. In

Proceedings of the 16thAnnual Meeting of the Special Interest Group on Dis-course and Dialogue , pages 154–158, 2015.[Rahwan et al. , 2002] Iyad Rahwan, Ryszard Kowalczyk,and Ha Hai Pham. Intelligent agents for automated one-to-many e-commerce negotiation. In

Australian Com-puter Science Communications , volume 24, pages 197–204. Australian Computer Society, Inc., 2002.[Rodriguez-Fernandez et al. , 2019] J Rodriguez-Fernandez,T Pinto, F Silva, I Prac¸a, Z Vale, and JM Corchado. Con-text aware q-learning-based model for decision support inthe negotiation of energy contracts.

International Jour-nal of Electrical Power & Energy Systems , 104:489–501,2019.[Rubinstein, 1982] Ariel Rubinstein. Perfect equilibrium ina bargaining model.

Econometrica: Journal of the Econo-metric Society , pages 97–109, 1982.[Sunder et al. , 2018] Vishal Sunder, Lovekesh Vig, ArnabChatterjee, and Gautam Shroff. Prosocial or self-ish? agents with different behaviors for contract ne-gotiation using reinforcement learning. arXiv preprintarXiv:1809.07066 , 2018.[Williams et al. , 2012] Colin R Williams, Valentin Robu,Enrico H Gerding, and Nicholas R Jennings. Negotiatingconcurrently with unknown opponents in complex, real-time domains. 2012.[Zeng and Sycara, 1998] Dajun Zeng and Katia Sycara.Bayesian learning in negotiation.

International Journalof Human-Computer Studies , 48(1):125–141, 1998.[Zou et al. , 2014] Yi Zou, Wenjie Zhan, and Yuan Shao.Evolution with reinforcement learning in negotiation.