An Autonomous Negotiating Agent Framework with Reinforcement Learning Based Strategies and Adaptive Strategy Switching Mechanism
AAn Autonomous Negotiating Agent Framework withReinforcement Learning Based Strategies and Adaptive StrategySwitching Mechanism
Ayan Sengupta
NEC CorporationTokyo, [email protected]
Yasser Mohammad
NEC CorporationTokyo, [email protected]
Shinji Nakadai
NEC CorporationTokyo, [email protected]
ABSTRACT
Despite abundant negotiation strategies in literature, the complex-ity of automated negotiation forbids a single strategy from beingdominant against all others in different negotiation scenarios. Toovercome this, one approach is to use mixture of experts, but at thesame time one problem of this method is the selection of experts, asthis approach is limited by the competency of the experts selected.Another problem with most negotiation strategies is their incapa-bility of adapting to dynamic variation of the opponent’s behaviourwithin a single negotiation session resulting in poor performance.This work focuses on both, solving the problem of expert selectionand adapting to the opponent’s behaviour with our AutonomousNegotiating Agent Framework. This framework allows real-timeclassification of opponent’s behaviour and provides a mechanismto select, switch or combine strategies within a single negotiationsession. Additionally, our framework has a reviewer componentwhich enables self-enhancement capability by deciding to includenew strategies or replace old ones with better strategies periodically.We demonstrate an instance of our framework by implementingmaximum entropy reinforcement learning based strategies with adeep learning based opponent classifier. Finally, we evaluate theperformance of our agent against state-of-the-art negotiators undervaried negotiation scenarios.
KEYWORDS
Automated Negotiation; Negotiation strategy; Reinforcement Learn-ing
ACM Reference Format:
Ayan Sengupta, Yasser Mohammad, and Shinji Nakadai. 2021. An AutonomousNegotiating Agent Framework with Reinforcement Learning Based Strate-gies and Adaptive Strategy Switching Mechanism. In
Proc. of the 20th Inter-national Conference on Autonomous Agents and Multiagent Systems (AAMAS2021), Online, May 3–7, 2021 , IFAAMAS, 10 pages.
Negotiation has been studied for a long time from different perspec-tives like game theory [42], business [18], psychology [4], neuroeco-nomics [22] and many more. With the progress of AI-technologies,automated negotiation allows collaboration and negotiation amongAI-enabled parties. Automated negotiation aims to achieve win-win
Proc. of the 20th International Conference on Autonomous Agents and Multiagent Systems(AAMAS 2021), U. Endriss, A. Nowé, F. Dignum, A. Lomuscio (eds.), May 3–7, 2021, Online deals for all parties, while simultaneously reducing the time and ef-fort, thus adding significant value to society as a whole [39]. But thecomplexity of automated negotiation still hinders the deploymentof autonomous agents in real-world applications [14].Though much research already existed in developing negotiationstrategies, Automated Negotiating Agents Competition (ANAC)brought significant improvements in strategy development [32]. Inspite of such improvements in strategy design, there is no singlestrategy that is optimal for all possible domains [30]. One naturalsolution is to choose a pool of strategies and use the approach ofmixture of experts during negotiation. At the same time, one needsto choose an appropriate initial set of expert strategies to excel.The questions that can originate while designing such an algorithmare these: What initial set of strategies should we select? On whatconditions should we switch strategies? How to improve the initialset of chosen strategies? In this work we give a solution to all threeof these questions by introducing our autonomous negotiatingagent framework.The contributions of this work to the existing research in thisdomain are three-fold. Firstly, we propose an autonomous negotiat-ing agent framework, which facilitates the creation of autonomousnegotiating agents capable of classifying opponent’s behaviour andadaptively change strategies within a single negotiation sessionto reach better agreements. Secondly, we propose a mechanism toupdate the base strategies in an algorithmic manner to improvethe overall performance. Finally, we validate this framework andprovide insights in general about autonomous negotiating agentsby evaluating it extensively against state of the art negotiators.The rest of the paper is organized as follows: Section 2 givesa sketch of related work in this domain, Section 3 provides theintroduction to negotiation settings. Section 4 gives a detailed de-scription of each of the components in our framework and Section 5describes the experimental setup. Section 6 shows the evaluationsof our framework and finally, we conclude with Section 7 by dis-cussing the limitations and provide direction for future research.
A considerable amount of literature has already been published onautonomous negotiation strategies. However, in recent years, thesuccess of reinforcement learning (RL) algorithms in different fieldshas drawn significant attention to its application in autonomousnegotiation [17, 48]. A part of our work falls under the above men-tioned domain. Additionally, the other part of our work is at theintersection of the domains of opponent classification and strategy a r X i v : . [ c s . A I] F e b election in autonomous negotiation. In this section we discuss thework done in both of these domains. Previously many computational methods including Bayesian Learn-ing [29, 51] and Genetic Algorithm [21, 35, 40] have been used inautomated negotiation for developing and evaluating negotiationstrategies. Then again, in the last couple of decades several studieshave looked at the application of reinforcement learning (RL) al-gorithms like Q-learning [17, 20, 46, 48, 49] and REINFORCE [47]in automated negotiation. Recently, Deep Reinforcement learning(DRL) has been used to learn the target utility values [16], the accep-tance strategy [43] or both bidding and acceptance strategies [19].Moreover, authors of [15] have also shown application of DRL inconcurrent bilateral negotiation.Bakker et al. introduced RLBOA framework [17] based on theBOA architecture [12] for automated negotiating agents, where theytrained the bidding strategy of the agent using Q-learning. Theirapproach involves discretizing utility space and using opponentmodelling to choose next offer from a set of offers, where the set ofoffers at each time step depends on the action taken. A limitation ofthis method is the loss of information due to discretization of utilityspace and that leads to further dependence on opponent modellingfor the choice of next offer. In contrast to their work, we do not useopponent modelling while training bidding strategies. Moreover,we train the bidding strategy using DRL on continuous state andaction spaces.The authors of [19] have used DRL algorithms for training bothbidding and acceptance strategies in continuous state and actionspaces. The state space and the action space for their approachincludes actual offer from the outcome space and hence limits thescope to a particular negotiation scenario. Furthermore, in theirapproach one needs to train both acceptance and bidding strategyfor every domain. Moreover, experimental setup and evaluationswere done against fixed preference profiles, which limits the scopeof applicability. In contrast, our approach considers the utility valueof the offers projected to self utility axis, thus making our biddingstrategies applicable to multiple negotiation scenarios. Additionally,we show the generality of our approach by evaluating in variednegotiation scenarios while training in a single negotiation domain.Furthermore, evaluations in both [17] and [19] are against prim-itive agents only, whereas we evaluated our approach against GE-NIUS [37] based ANAC [32] winning agents.
The Opponent modelling is a fundamental component of BOA archi-tecture proposed in [12]. Commonly, opponent models attempt tolearn one or more of the following opponent’s attribute: acceptancestrategy, deadline, utility function or bidding strategy [11]. How-ever, our approach does not fit any of these usual types of opponentmodels. Unlike popular approaches of learning the bidding strategy,we classify an opponent depending on the history of bids. In factour problem of classifying the opponent falls under the domain ofcontinuous opponent strategy classification. Under this domain forinstance, authors of [44] used a hierarchical approach with fuzzy models to perform the opponent strategy classification in a realtime strategy game. Preference profile learning by classifying thenegotiation trace was done in [29, 38, 39] using Bayesian learningto determine the best match for the opponent’s preference profile.However, in this work we classify the opponents bidding behaviourperiodically with respect to a set of negotiator’s bidding behavioursand select an appropriate strategy for negotiation within a singlenegotiation session.Inspired from algorithm selection method [36], authors of [30]developed a meta-agent, which predicts the performance of a set ofbilateral negotiation negotiators based on features of domains, andaccordingly chooses the negotiator expected to perform best forthe given negotiation scenario. Extending this idea to multilateralnegotiation settings and using the approach of mixture of experts,authors of [26] solved the problem of how to combine multipleexperts. However, in all these approaches a single negotiator isselected throughout a negotiation session. In contrast to that, thispaper focuses on selection and switching (or combination) of strate-gies within a single negotiation session based on the opponentbehaviour.
A bilateral automated negotiation is a negotiation between twoautomated entities. We will denote these entities as negotiators . Anegotiation setting consists of a negotiation protocol, the concernednegotiators and negotiation scenario [9]. A negotiation protocoldefines the rules of the encounter, specifying which actions eachnegotiator can perform at any given moment. A negotiation sce-nario consists of the preference profiles of each negotiator and thenegotiation domain. In this work, a strategy of a negotiator is thecombination of an acceptance strategy and a bidding strategy. Addi-tionally we denote the opponent negotiators drawn from GENIUSplatform [37] as agents .The negotiation protocol used throughout this paper is thestacked alternating offers protocol. Under this protocol, a nego-tiation session consists of rounds of consecutive turns where eachnegotiator can either make an offer, accept offer, or walk awayfrom the negotiation [7]. The negotiation session ends if both ne-gotiators find a joint agreement or a deadline is reached or one ofthe negotiators decides to walk away from the negotiation result-ing in no agreement. The deadline can be measured in number ofrounds or actual wall-time. Negotiations are non-repeated, that isone negotiation session cannot impact actions of any negotiator insubsequent sessions.A negotiation domain consists of one or more issues. To reachan agreement, the negotiators must settle on a specific value foreach negotiated issue. The outcome space of a negotiation domaindenoted by Ω is the set of all possible negotiation outcomes. Theoutcome space can be defined as the Cartesian product of negoti-ation issues and is formally denoted as Ω = { 𝜔 , · · · , 𝜔 𝑛 } where 𝜔 𝑖 is a possible outcome and 𝑛 is the carnality of outcome space.A preference profile or utility profile defines a preference order ≤ that ranks the outcomes in the outcome space. Usually, a prefer-ence profile of a negotiator is specified by a utility function, whichassigns a utility value to an outcome 𝜔 𝑖 denoted by 𝑈 ( 𝜔 𝑖 ) . Utilityfunctions are private information and the negotiators only knowheir own utility functions. The preference profile of an agent alsospecifies a reservation value. The reservation value 𝑢 𝑟 is the utilitythat the negotiator receives in case of no agreement. In this section we provide the structure and explain the detailsof our proposed Autonomous Negotiating Agent Framework, aframework that facilitates the creation of autonomous negotiatingagents which are capable of classifying opponents in real timeand switching strategies accordingly within a single negotiationsession. First, we introduce the components of our framework andthen describe an approach for designing each of the components.The proposed framework is comprised of four main components:negotiator-strategy pair, opponent classifier, strategy switchingmechanism and reviewer. Figure 1 outlines all the components ofthe framework, while each of them are discussed in the remainderof the section.The first component of our framework is a set of negotiator-strategy pairs, where negotiators can be any autonomous negotiat-ing agents and strategies are bidding strategy trained against thenegotiators in addition to fixed acceptance strategy for each chosennegotiator. The framework gives the user the flexibility to chooseany set of negotiators which we will call the base negotiators forthe rest of this paper. To illustrate, the base negotiators can includefrom simple time or behaviour dependent strategies [24] to state-of-the-art negotiators like ANAC winning agents. In this work, asdescribed in Section 4.1, we have trained deep reinforcement learn-ing (DRL) based bidding strategies against each negotiator to form negotiator-strategy pairs and subsequently showed the superiorityand generality for such class of strategies. … Switcher
Strategy 1 Strategy 2
Strategy n Classifier … Negotiator Negotiator Negotiator n NewNegotiator
New
Strategy
Reviewer※1 ※1 Dotted line connects the best strategy for a negotiator
Offers from opponentAction
Figure 1: Block diagram of proposed framework showingthe following blocks: 𝑛 base negotiators blocks (blue), 𝑛 trained strategies block (green), the classifier block (yellow),strategy switching block (yellow) and the reviewer block(purple). The dashed lines connecting a single negotiator to asingle strategy represents negotiator-strategy pair. The com-ponents inside the solid box are utilised within a negotiationsession whereas the blocks outside the solid box are usedoutside a negotiation session. The second component is an opponent classifier that classifiesthe opponent’s bidding behaviour with respect to the bidding be-haviour of base negotiators. After every negotiation round during a negotiation session, the classifier takes a sequence of opponent’sbid as input and accordingly assigns an estimated probability toeach base negotiator. In this work, we have used only the sequenceof opponent’s offers projected on the self utility axis as input sinceour implementation showed similar accuracy with additional in-formation of self offers. Deep learning based approach, describedin Section 4.2 is used for training the opponent classifier and theresults in Section 6 show the versatility of such classifier in differentnegotiation scenarios.The third component is a switching mechanism, that switchesor combines the strategies learned against base negotiators depend-ing on the output of the opponent classifier. While the opponentclassifier classifies the behaviour of the opponent negotiator atevery time step, the switching mechanism has added flexibility ofchanging decisions after certain intervals. Note that the strategyswitching as narrated in Section 4.3 is performed within a singlenegotiation session, and hence makes our negotiator frameworkadaptive.All the three aforementioned components are active componentsthat function when a negotiation is underway. Whereas, the Re-viewer is a passive component that does not actively take part in thenegotiation process. Outside the negotiation session, the Reviewerprovides a mechanism that decides if a new negotiator or a newstrategy should be included in the framework. To the best of ourknowledge, all the meta-agent strategies proposed in literature donot have a mechanism that can enhance their capability by evalu-ating and adding new strategies. This component is crucial to thedesign of our framework as it insures the framework against depre-ciation in the future. In Section 4.4, we provide the algorithm of theReviewer and in the remaining section, we discuss the approachesfor building each component of our framework.
The prime components of our framework are the negotiation strate-gies trained against the base negotiators. The whole frameworkis based on the presumption that one can successfully learn aneffective strategy against each of the base negotiators. Additionalrequirement is that the approach should facilitate the frameworkto perform well in a domain-independent manner. Due to recentsuccess of RL algorithms in training strategies of automated nego-tiators, we used Soft Actor-Critic [27, 28], a DRL algorithm to traina bidding strategy against each base negotiator. For acceptancestrategy we adopted the approach of combined acceptance condi-tion as proposed in [13]. In contrast to RLBOA framework in [17],no opponent model component is used while training against anopponent negotiator.
A major problem fordeveloping a domain independent negotiator framework is the factthat the outcome space Ω varies significantly across different ne-gotiation scenarios. Moreover the offers 𝜔 𝑖 ∈ Ω are usually nonnumerical in nature, which again demands an approach to convertthe offers to numerical value in a meaningful way. To overcomeboth of these problems, we took a similar approach to [17] andrepresented every outcome 𝜔 𝑖 by 𝑈 𝑠 ( 𝜔 𝑖 ) where 𝑈 𝑠 is the self utilityunction. To avoid any information loss we considered the continu-ous outcome space rather than discretizing it. For our DRL approacheverything in the negotiation scenario including the opponent isconsidered as the environment. Let us denote the state and actionin an environment as 𝑠 𝑡 and 𝑎 𝑡 respectively. The state consists ofonly the information about the offers and the action determineswhat utility value to bid next. For a negotiation session with timelimit 𝑇 , we defined our state space and action space as 𝑠 𝑡 = { 𝑡 𝑟 , 𝑈 𝑠 ( 𝜔 𝑡 − 𝑠 ) , 𝑈 𝑠 ( 𝜔 𝑡 − 𝑜 ) , 𝑈 𝑠 ( 𝜔 𝑡 − 𝑠 ) ,𝑈 𝑠 ( 𝜔 𝑡 − 𝑜 ) , 𝑈 𝑠 ( 𝜔 𝑡𝑠 ) , 𝑈 𝑠 ( 𝜔 𝑡𝑜 )} 𝑎 𝑡 = 𝑢 𝑡 + 𝑠 such that 𝑢 𝑟 < 𝑢 𝑠 ≤ 𝑡 𝑟 denotes the relative time, 𝜔 𝑡𝑠 and 𝜔 𝑡𝑜 denotes offers by selfand opponent at time step 𝑡 < 𝑇 respectively. The self reservationvalue is denoted by 𝑢 𝑟 and 𝑢 𝑡 + 𝑠 denotes the utility value of thenext offer. To get the actual offer from the utility value we need aninverse map 𝑈 − 𝑠 : 𝑢 𝑠 → 𝜔 𝑠 of the self utility function 𝑈 𝑠 whichcan be a one-to-one or one-to-many mapping. One simple way ofdefining the inverse utility function is given in Equation (1). 𝑈 − 𝑠 ( 𝑢 𝑠 ) = argmin 𝜔 𝑓 ( 𝜔 ) , where 𝑓 ( 𝜔 ) = ( 𝑈 𝑠 ( 𝜔 ) − 𝑢 𝑠 ) ∀ 𝜔 ∈ Ω . (1)The goal of the strategies trained is to maximise the average utilityagainst the corresponding base negotiator. So the reward function 𝑅 is defined as 𝑅 ( 𝑠 𝑡 , 𝑎 𝑡 , 𝑠 𝑡 + ) = 𝑈 𝑠 ( 𝜔 𝑎 ) , if there is an agreement 𝜔 𝑎 − , for no agreement and 𝑠 𝑡 + is terminal state,0 , otherwise.There is an immediate reward of 0 after every step in a negotiationsession when negotiation has not ended. Soft actor-critic (SAC) [27, 28] isan off-policy algorithm based on maximum entropy reinforcementlearning that aims to maximize both the expected reward and thepolicy’s entropy. Policies with higher entropy have more random-ness, which means that maximum entropy reinforcement learninglearns a policy that has maximum randomness yet achieves a highreward. Normal reinforcement learning algorithms try to maximizethe expected reward only. On the contrast the reason for maximiz-ing the the entropy of the policy is to improve both the algorithm’srobustness to hyperparameters and its sample efficiency [27].Inautomated negotiation, this randomness is desirable to reduce theopponent’s ability to predict the behaviour of an agent and exploitthis information.An optimal policy 𝜋 ∗ in entropy-regularized reinforcement learn-ing can be expressed as 𝜋 ∗ = arg max 𝜋 E 𝜋 " ∞ ∑︁ 𝑡 = 𝛾 𝑡 (cid:18) 𝑅 ( 𝑠 𝑡 , 𝑎 𝑡 , 𝑠 𝑡 + ) + 𝛼𝐻 ( 𝜋 (·| 𝑠 𝑡 )) (cid:19) , where 𝑅 is the reward function, 𝛾 is the discount factor, 𝐻 denotesthe entropy of policy 𝜋 and 𝛼 > 𝑠 𝑡 and 𝑎 𝑡 denotes the state and action at time-step 𝑡 respectively. Now, the corresponding action-value function 𝑄 𝜋 ( 𝑠, 𝑎 ) for state 𝑠 and action 𝑎 can be expressed as 𝑄 𝜋 ( 𝑠, 𝑎 ) = E 𝜋 " ∞ ∑︁ 𝑡 = 𝛾 𝑡 𝑅 ( 𝑠 𝑡 , 𝑎 𝑡 , 𝑠 𝑡 + ) + 𝛼 ∞ ∑︁ 𝑡 = 𝛾 𝑡 𝐻 ( 𝜋 (·| 𝑠 𝑡 )) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 𝑠, 𝑎 SAC concurrently learns a policy 𝜋 and two Q-functions. In ourimplementation we used the approach proposed by Haarnoja etal. [28] where entropy regularization parameter 𝛼 is also a trainableparameter. The details of the hyperparameters for our implementedmodel are provided in the supplementary material. Opponent modelling is a fundamental block of BOA architectureproposed in [12]. Although we do not have an opponent modellingblock while training the DRL based strategies, our framework con-tains a classifier that classifies an unknown opponent’s biddingbehaviour with respect to the base negotiator’s behaviour in theframework. Our approach uses 1D-Convolutional Neural Networks(1D-CNN) based classifier to classify an unknown opponent at ev-ery time step of a negotiation. In the following sections we willdescribe our classifier’s input/output and the model architecture.
The input to the classifier is a sequenceof offers by the opponent projected to self utility axis. Similar toSection 4.1, we will denote each offer as 𝜔 𝑖 and the self utilityvalue for the offer as 𝑈 𝑠 ( 𝜔 𝑖 ) . The choice of using self utility valueensures the input sequence to be numerical all the time and can bedirectly provided to the classifier without any requirement of pre-processing. Another significant benefit is, it allows the frameworkto work across different negotiation scenarios without retrainingthe classifier for each domain. The output of the classifier are theestimated probabilities for each base negotiator. Let us denote theinput to the classifier at the current negotiation time step 𝑡 as I 𝑡𝑐 and output as O 𝑡𝑐 . Then, I 𝑡𝑐 = { 𝑈 𝑠 ( 𝜔 𝑖 )} 𝑖 = 𝑡 − 𝑖 = 𝑡 − 𝑘 where 𝑘 ∈ Z + and 𝑘 > . O 𝑡𝑐 = [ 𝑝 , · · · , 𝑝 𝑛 ] where 𝑛 is the number of base negotiators in the framework. Z + denotes the set of positive integers. 𝑝 𝑖 denotes the estimated proba-bility of the opponent behaving as the 𝑖 𝑡ℎ negotiator. Values of inputarray I 𝑡𝑐 are zero before the first opponent offer, that is, 𝑈 𝑠 ( 𝜔 𝑖 ) = 𝑖 < 𝑘 is the window length and is fixed before training. More-over, greater the value of 𝑘 , greater is the information provided tothe classifier. The input to the classifier at every timestep I 𝑡𝑐 is a time-series. For such time series data, Long short-termmemory (LSTM) or recurrent neural network (RNN) architecturesperform incredibly well. But, a major problem with LSTM andRNN is that they require datasets of massive sizes and large com-putational resources for training. To overcome such difficulties1D-CNNs have shown great promise [33]. 1D-CNN based classifiershave been successfully used in structural damage detection [2, 5],fault detection in modular multilevel converters [34], conditionmonitoring in rotating mechanical machine parts [23, 31].In our classifier model, we have consecutive 1D-CNN layers fol-lowed by consecutive Dense layers. The depth of the model willncrease or decrease with the increase or decrease of the windowlength 𝑘 respectively. Moreover, a greater value of 𝑘 will result ina larger part of history of opponent offers to be considered by theclassifier. This will reduce the model accuracy. On the other hand avery small value of 𝑘 will make the model myopic and will be errorprone in classification. Some hyperparameter tuning is requiredfor adjusting the value of the window length. The model architec-ture and the hyperparameters are provided in the supplementarymaterial. Depending on the output of the classifier O 𝑡𝑐 , this componentswitches or combines the strategies to take the next action. Thealgorithm for the switching mechanism is provided in Algorithm 1.Although the opponent classification is done at every time step,the approach of choosing next offer need not change after everytime step. In Algorithm 1 the parameter 𝛽 𝑖 where 𝑖 ∈ [ , 𝑛 ] tunesthe algorithm from a hard switcher to a combination mechanism.The algorithm becomes a pure switching algorithm if 𝛽 𝑖 = 𝛽 𝑘 = ∀ 𝑘 ≠ 𝑖. Algorithm 1:
Algorithm for strategy switching
Input: O 𝑡𝑐 = { 𝑝 , · · · , 𝑝 𝑛 } from the opponent classifier Data: S 𝑏 = { 𝑠 , · · · , 𝑠 𝑛 } , the set of base strategiescorresponding to the set of base negotiators N 𝑏 = { 𝑁 , · · · , 𝑁 𝑛 } Output:
Action: next offer or AcceptChoose initial strategy 𝑠 𝑖𝑛𝑖𝑡 ∈ 𝑆 𝑏 , 𝑆 𝑎𝑐𝑡𝑖𝑜𝑛 = 𝑠 𝑖𝑛𝑖𝑡 𝑖 = 𝑎𝑟𝑔𝑚𝑎𝑥 (O 𝑡𝑐 ) where 𝑁 𝑖 denotes the base negotiator withhighest classification probability if action by strategy 𝑠 𝑖 is Accept then Accept opponent’s offer else 𝑤 𝑡 + = 𝑈 − 𝑠 ( ˝ 𝑘 = 𝑛𝑘 = (cid:8) 𝛽 𝑘 ∗ 𝑢 𝑠 𝑘 ) (cid:9) where 𝑢 𝑠 𝑘 is the utilityvalue by strategy 𝑠 𝑘 for next time step and 𝛽 𝑘 is theweight parameter end This component enables the addition of new negotiators or strate-gies or both to the RL-agent instantiated by our framework. To showthe basic operation of this component, we implemented an evalua-tion based approach for the Reviewer. The algorithm is provided inAlgorithm 2 where the parameters 𝛼 and 𝛽 are threshold parame-ters. When a new negotiator N 𝑛𝑒𝑤 is introduced to the Reviewer,first a new strategy S 𝑡𝑟𝑎𝑖𝑛 is trained against it. Subsequently, S 𝑡𝑟𝑎𝑖𝑛 and the RL-agent are evaluated against N 𝑛𝑒𝑤 . Finally, depending ofthe parameter 𝛼 if the evaluation with the S 𝑡𝑟𝑎𝑖𝑛 is better in com-parison with RL-agent, then the reviewer will provide confirmationand S 𝑡𝑟𝑎𝑖𝑛 will added to the pool of strategies and the classifierwill be retrained with a new class N 𝑛𝑒𝑤 .Moreover, The new trained strategy S 𝑡𝑟𝑎𝑖𝑛 is cross-evaluatedwith base negotiators and compared with the base strategies. De-pending on the evaluation and parameter 𝛽 , base strategies may Algorithm 2:
Algorithm for Reviewer component
Input:
New strategy S 𝑛𝑒𝑤 or new negotiator N 𝑛𝑒𝑤 Data: S 𝑏 = { 𝑠 , · · · , 𝑠 𝑛 } , the set of base strategiescorresponding to the set of base negotiators N 𝑏 = { 𝑁 , · · · , 𝑁 𝑛 } , Eval function that provides anevaluation score of strategy for a negotiator.
Function
StrategyEvaluation( S 𝑡𝑒𝑠𝑡 , 𝑁 𝑖 , 𝑠 𝑖 , 𝛽 ) : ˆ 𝑒 𝑘 = 𝐸𝑣𝑎𝑙 ( 𝑁 𝑖 , S 𝑡𝑒𝑠𝑡 ) 𝑒 𝑘 = 𝐸𝑣𝑎𝑙 ( 𝑁 𝑖 , 𝑠 𝑖 ) if ˆ 𝑒 𝑘 ≥ 𝛽 ∗ 𝑒 𝑘 thenreturn Accept and replace 𝑠 𝑖 with S 𝑡𝑒𝑠𝑡 elsereturn Reject endif
Input is N 𝑛𝑒𝑤 then Train new strategy S 𝑡𝑟𝑎𝑖𝑛 against N 𝑛𝑒𝑤 𝑒 𝑓 = 𝐸𝑣𝑎𝑙 (N 𝑛𝑒𝑤 , RL-agent ) 𝑒 𝑠 = 𝐸𝑣𝑎𝑙 (N 𝑛𝑒𝑤 , S 𝑡𝑟𝑎𝑖𝑛 ) if 𝑒 𝑠 ≥ 𝛼 ∗ 𝑒 𝑓 then Accept N 𝑛𝑒𝑤 and S 𝑡𝑟𝑎𝑖𝑛 else Reject endfor ( 𝑘 ∈ [ , 𝑛 ] ) StrategyEvaluation( S 𝑡𝑟𝑎𝑖𝑛 , 𝑛 𝑘 , 𝑠 𝑘 , 𝛽 ) endif Input is S 𝑛𝑒𝑤 thenfor ( 𝑘 ∈ [ , 𝑛 ] ) StrategyEvaluation( S 𝑁𝑒𝑤 , 𝑛 𝑘 , 𝑠 𝑘 , 𝛽 ) end be updated with S 𝑡𝑟𝑎𝑖𝑛 . In this manner the Reviewer provides amechanism for gradual improvement of the agent. The goals of our experiments are two fold. First, we introduce ourRL-agent based on the proposed framework with small number ofbase negotiators and show that the RL-agent, while generalizingover negotiation scenarios performs on average better than stateof the art ANAC winning agents. Secondly, we show the value ofthe reviewer mechanism by evaluating and adding new negotiatorsand subsequently show improvement of our RL-agent.We analyzed our proposed system in 18 domains of ANAC 2013with cardinality of outcome space ranging from 3 to 56700 andopposition [10] ranging from 0.002 to 0.606 as shown in Table 1. Allthe negotiation experiments are conducted using the NEGotiationMultiAgent System (NegMAS) platform [41, 50]. For the purpose ofcalculating benchmarks and evaluating performance we used thegiven preference profiles of ANAC 2013 for each domain. Amongthe negotiation settings, the reserved value is kept zero and thediscount factor is ignored for all negotiations. Moreover, we usedmin-max normalisation for normalising the utility values between able 1: Overview of the ANAC 2013 domains. We consid-ered 18 domains with a pair of utility functions.
Domain Opposition Outcome Space
Acquisition 0.104 384Animal 0.15 1152Camera 0.076 3600Coffee 0.279 112Defensive Charms 0.193 36Dog Choosing 0.002 270Fifty Fifty 0.498 11House Keeping 0.13 384Ice-cream 0.01 720Kitchen 0.219 15625Laptop 0.076 27Lunch 0.246 3840Nice Or Die 0.177 3Outfit 0.049 128planes 0.606 27Smart Phone 0.022 12000Ultimatum 0.319 9Wholesaler 0.128 567000 and 1. For performance comparisons average utility values arecalculated on negotiation data obtained from 50 to 100 negotiationsbetween a pair of agents for each negotiation scenario.For training a bidding strategy against a given negotiator we gen-erate a random utility function for the opponent and used fixed selfutility function. This ensures the that maximum entropy reinforce-ment learning algorithm can learn a stochastic policy that performswell in varied negotiation scenarios. Additionally, the training ofstrategies are done in a single domain and evaluations are done inall 18 domains. For our experiments all strategies are trained onthe camera domain and tested across other domains. Moreover, forsimplicity all hyperparameters of SAC algorithm were kept fixedwhile training against different negotiators. Training of RL-agentwas done using the TF-Agents [45] library.For the opponent classifier we used a window length of 20 andthe model consists of three consecutive 1D-CNN layers followed bytwo Dense layers. The training data for the classifier was generatedby multiple simulations of the base negotiators while training theRL strategies against them. Additionally, the classifier is trainedonly on the data generated in the camera domain while the sameclassifier has been used in the evaluation for all other domains. Thetraining of the classifier was done using TensorFlow library [1].Evaluations of our proposed framework are done by first instan-tiating an RL-agent with the number of base negotiators 𝑛 = 𝑛 = 𝛼 = 𝛽 = . In this section we present the results according to the experimentalsetup of Section 5. First we will present the detailed results ofour experiments with 7 ANAC winning agents. Next we presentthe evaluations of the Reviewer mechanism. Finally, we presenta summarised result against 47 GENIUS agents. For comparison,we created three different benchmarks and then compared theperformance of RL-agent against each of the benchmarks. Now, 𝑈 𝑎 × 𝑏 : 𝑑𝑎 denotes the average utility achieved by agent 𝑎 against 𝑏 indomain 𝑑 over 100 runs with two different utility functions. 𝐴 and 𝐷 denotes the set of agents and domains respectively over whichthe benchmark is calculated and | . | denotes the cardinality of a set.(1) Self utility benchmark : In this benchmark, score of an agent 𝑆 𝑎 = | 𝐴 |×| 𝐷 | ˝ 𝑑 ∈ 𝐷 ˝ 𝑏 ∈ 𝐴 𝑈 𝑎 × 𝑏 : 𝑑𝑎 is the mean utility acquiredby the agent 𝑎 when negotiating with every agent 𝑏 ∈ 𝐴 inall negotiation scenarios.(2) Utility against opponent benchmark : In this benchmark, scoreagainst an agent 𝑎 , 𝑂 𝑎 = ( | 𝐴 |− )×| 𝐷 | ˝ 𝑑 ∈ 𝐷 ˝ 𝑏 ∈ 𝐴 / 𝑎 𝑈 𝑎 × 𝑏 : 𝑑𝑏 denotes the mean utility acquired by agents 𝑏 ∈ 𝐴 / 𝑎 whilenegotiating with agent 𝑎 in all negotiation scenarios.(3) Domain utility benchmark : In this benchmark, score of adomain 𝑑 , 𝐷 𝑑 = | 𝐴 | ˝ 𝑏 ∈ 𝐴 𝑈 𝑎 × 𝑏 : 𝑑𝑎 denotes the mean utilityobtained by all agents 𝑎 ∈ 𝐴 in domain 𝑑 , while negotiatingwith every agent 𝑏 ∈ 𝐴 . Before evaluating our agent, we first calculate the benchmark scoreswith 7 ANAC winning agents and then compare the score of ourRL agent with each benchmarks as shown in Figure 2. The agentsselected are Atlas3, ParsAgent, RandomDance, ParsCat, AgentYX,Caduceus and PonpokoAgent [6, 8, 25] . It is clearly visible fromFigure 2, that our RL-agent outperformed all 7 agents in all thebenchmarks. The error bars in Figure 2a and Figure 2b denote thestandard deviation of average utilities obtained over the domains.In comparison with self utility benchmark, RL-agent performed25% better than the agent which acquired highest average utilityin that benchmark as shown in Figure 2a. Overall, our agent’sscore was 37% higher than the average scores of all other agents.Glancing at the error bars, one can understand that the RL-agenthas the minimum standard deviation among all other agents andhence shows the robustness of the agent in varied negotiationdomains. In comparison with utility against opponent benchmark,the average utility obtained by the RL-agent outperformed thebenchmark scores in a range of 11% to 50% as shown in Figure 2b.This shows that the RL-agent outperforms the average score ofthe opponents against each agent. Proceeding to the comparisonwith domain benchmark illustrated in Figure 2c, one can clearlyvisualize that the scores of RL-agent is better than the highest scoreby any agent in 13 out of 18 domains. In fact the average score ofRL-agent is more than the utility benchmark by a range of 4% to450%. Atlas3 (2015 winner), ParsAgent (2015 𝑛𝑑 position), RandomDance (2015 𝑟𝑑 po-sition), Caduceus (2016 winner), ParsCat (2016 𝑛𝑑 position), AgentYX (2016 𝑛𝑑 position) and PonpokoAgent (2017 winner) t l a s P a r s A g e n t P a r s c a t YXA g e n t P o np o k o R a nd o m D a n ce C a du ce u s R L - a g e n t . . . . . A v e r a g e u t ili t y Comparison using self utility benchmark n= 3n= 2n= 1 (a) A t l a s P a r s A g e n t P a r s c a t Y X A g e n t P o n p o k o R a n d o m D a n c e C a d u c e u s A v e r a g e u t ili t y Comparison using utility against opponent benchmark
BenchmarkRL-agent (n=3) (b) A c qu i s i t i o n A n i m a l C a m e r a d o m a i n C o ff ee D e f e n s i v e C h a r m s D og C h oo s i n g F i f t y H o u s e k ee p i n g I cec r e a m K i t c h e n L a p t o p L un c h N i ce o r D i e O u t fi t P l a n e s S m a r t ph o n e U l t i m a t u m W h o l e s a l e r Domain . . . . . . A v e r a g e u t ili t y Comparison using domain utility benchmark
CaduceusRandomDancePonPokoAgentYXAgentParsCatParsAgentAtlas3RL Agent (n=3) (c)
Figure 2: (a) Comparison of RL-agent with self utility benchmark consisting of 7 ANAC winning agents. It also shows theperformance of RL-agent with 𝑛 = , 𝑛 = and 𝑛 = . (b) Comparison of RL-agent with utility against opponent benchmarkconsisting of 7 ANAC winning agents. (c) Comparison of RL-agent with domain benchmark consisting of 18 domains. Although the final comparison shown in Figure 2 is the instance ofRL-agent with three negotiator-strategy pairs, the framework wasfirst initialized with only one negotiator, that is, with the Randomnegotiator. One RL based strategy was trained against it and in-cluded in the negotiator-strategy pair. Opponent classifier trainingwas not needed as number of base negotiator 𝑛 =
1. As expectedthe performance of the initial RL-agent was poor as shown in Fig-ure 2a. Next we introduced a simple time dependent agent calledBoulware agent [24], to the Reviewer. After receiving acceptancefrom reviewer, new RL based strategy was added to the strategypool and the opponent classifier was trained for 𝑛 =
2. In a similarway, a new behavioural strategy based agent, Naive tit-fot-tat andcorresponding RL based strategy was included in the RL-agent andclassifier was trained for 𝑛 =
3. Furthermore, the strategy trainedagainst random negotiator was replaced by the RL strategy trainedagainst Naive tit-for-tat as per the evaluation of the Reviewer. Mov-ing ahead, we introduced other agents from the pool of 7 ANACagents to the Reviewer, but the Reviewer rejected the inclusionof any additional agents. It is to be noted that all evaluations bythe Reviewer were also restricted to a single domain only (cameradomain). The performance of the RL-agent with different numberof base negotiators is shown in Figure 2a. It can be clearly seen thatthe performance of the RL-agent has increased with the addition ofstrategies and hence shows the significance of the Reviewer mech-anism in our framework. To keep the visualisations simple, furthercomparisons with other benchmarks as shown in Figure 2b andFigure 2c are only shown with number of base negotiators 𝑛 = To show the versatility of the proposed framework, we chose thealready created RL-agent with only 3 base negotiators and evalu-ated it against the benchmarks scores of 47 GENIUS agents whichincludes ANAC competitors from year 2015 to 2017. The resultsare illustrated in Figure 4 and Figure 3 where the error bars de-note the standard deviation over the domains. In comparison withself utility benchmark, our RL-agent performed better than otheragents in the range of 11% to 139%. The average improvement is37 .
4% with an improvement of more than 50% against 12 agents asshown in Figure 4a. The relatively low standard deviation marksthe robustness of our agent in different domains. Next, comparisonwith the utility against opponent benchmark as shown in Figure 3,reveals that the RL-agent outperformed in 40 out of 47 comparisons.Individual performance improvement ranges from 1% to 428% withan improvement of over 50% against 19 agents and 100% against5 agents. Finally, comparison with domain benchmark is shownin Figure 4b, where it is visible that the scores of RL-agent is bet-ter than the highest score by any agent in 13 out of 18 domains.Additionally, the RL-agent outperformed the average benchmarkscores in all the domains. The performance improvement over thebenchmark scores in each domain ranges from 10% to 488% with animprovement of at least 50% in 7 domains. Finally, we calculate thestatistical significance of the differences. For the utility differencesto be statistically significant, following Bonferroni’s conservativemultiple-comparisons correction the p-values of each t-tests in selfutility benchmark, utility against opponent benchmark and domainbenchmark should be less than 0.0011, 0.0011 and 0.0028 respec-tively. It turns out that for self utility benchmark, differences ofutility in 30 out of 47 comparisons were statistically significantwhereas in utility against opponent benchmark 32 out of 47 werestatistically significant. In case of domain benchmark, all differencesof utilities were statistically significant. r o up G r o up G r o up G r o up G r o up G r o up G r o up G r o up G r o up G r o up G r o up M e r c u r y R a nd o m D a n ce C U H K A g e n t S E N G O K U D r a g e K n i g h t M e a n B o t P h o e n i x P a r t y P o k e r F a ce A g e n t H P a r s A g e n t J o nn y B l a c k A r e s P a r t y k a w a ii A t l a s T U D M i x e d A g e n t G r o upn A g e n t B u y og M a i n A g e n t X T u c A g e n t A g e n t F F a r m a P o n P o k o A g e n t t a x i b o x T e rr a G r a nd m a A g e n t C l o c k w o r k A g e n t P a r s C a t C a du ce u s P a r s A g e n t YXA g e n t F a r m a S YA g e n t N g e n t A t l a s A g e n t S m i t h M y A g e n t . . . . . . A v e r a g e u t ili t y Comparison using utility against opponent benchmark
BenchmarkRL-agent (n=3)
Figure 3: Comparison of RL-agent with utility against opponent benchmark consisting of 47 GENIUS Agents in 18 domains. A g e n t B u y og M a i n A g e n t F A g e n t H A g e n t S m i t h A g e n t X A r e s P a r t y A t l a s A t l a s C U H K A g e n t C a du ce u s C l o c k w o r k A g e n t D r a g e K n i g h t F a r m a F a r m a G r a nd m a A g e n t G r o up G r o up G r o up G r o up G r o up G r o up G r o up G r o up G r o up G r o up G r o up G r o upn J o nn y B l a c k M e a n B o t M e r c u r y M y A g e n t P a r s A g e n t P a r s A g e n t P a r s C a t P h o e n i x P a r t y P o k e r F a ce P o n P o k o A g e n t R a nd o m D a n ce S E N G O K U S YA g e n t T U D M i x e d A g e n t T e rr a T u c A g e n t YXA g e n t k a w a ii t a x i b o x R L - a g e n t ( n = ) . . . . . A v e r a g e u t ili t y Comparison using self utility benchmark (a) A c qu i s i t i o n A n i m a l c a m e r a d o m a i n C o ff ee D e f e n s i v e C h a r m s D og C h oo s i n g F i f t y H o u s e K ee p i n g I cec r e a m K i t c h e n L a p t o p L un c h N i ce O r D i e O u t fi t p l a n e s S m a r t P h o n e U l t i m a t u m W h o l e s a l e r Domain . . . . . A v e r a g e u t ili t y Comparison using domain utility benchmark
RL-Agent (n=3)GENIUS Agents (b)
Figure 4: Comparison of the performance of RL-agent (a) with self utility benchmark consisting of 47 GENIUS Agents and (b)with domain benchmark consisting of 18 domains.
In this work we proposed an autonomous negotiating agent frame-work with four components: negotiators paired with trained strate-gies, an opponent classifier, a switching mechanism and a reviewermechanism. Strategies included are RL based strategies whereasstrategy switching depends on the classification probabilities ofthe opponent classifier. The proposed opponent classifier classifiesthe opponent’s bidding behaviour with respect to the base negotia-tor’s behaviour at every time step thus allows the agent to switchor combine strategies within a single negotiation session. Thesefunctionalities together gives our RL-agent versatility even witha small pool of base negotiators as illustrated in our evaluations.Furthermore, the reviewer mechanism helps in the decision makingof adding more negotiators and strategies to the existing pool ofbase negotiators and strategies. This helps in incremental improve-ment of the RL-agent and also restricts unnecessary addition ofbase entities. In our experimental setup, all training and evaluations were doneby removing the discount factor and keeping the reserved utilityvalue as zero. It would be interesting to obtain and compare theresults with varying discount factors and reserved utility values.Moreover, while training bidding strategies, we have used the sameacceptance strategies against all negotiators. At the same time, ithas been noticed that the performance of trained strategy dependson the choice of acceptance strategy. So our future work involvestraining the acceptance strategy together with the bidding strat-egy. Also, to show the concept of reviewer mechanism, we haveimplemented an evaluation based reviewer. Another interestingdirection could be using unsupervised clustering algorithms onthe negotiators bidding behaviour to differentiate a new negotiatorfrom the pool of negotiators. That will remove the additional stepof training a new strategy each time for evaluation by the reviewerand at the same time will give us a concrete picture about the typeof base negotiators that make the RL-agent perform better acrossvarious negotiation scenarios and against varied opponents.
EFERENCES
Journal of Sound and Vibration
Encyclopedia of measurement and statistics
Advances in experimentalsocial psychology . Vol. 2. Elsevier, Amsterdam, Netherlands, 267–299.[5] Onur Avci, Osama Abdeljaber, Serkan Kiranyaz, and Daniel Inman. 2020. Con-volutional neural networks for real-time and wireless damage detection. In
Dynamics of Civil Structures, Volume 2 . Springer, Berlin, Germany, 129–136.[6] Reyhan Aydogan. 2016.
ANAC2016 - Automated Negotiating Agents Competition2016 . TUDelft. Retrieved October 8, 2020 from http://web.tuat.ac.jp/~katfuji/ANAC2016/[7] Reyhan Aydoğan, David Festen, Koen V Hindriks, and Catholijn M Jonker. 2017.Alternating offers protocols for multilateral negotiation. In
Modern Approachesto Agent-based Complex Automated Negotiation . Springer, Berlin, Germany, 153–167.[8] Reyhan Aydoğan, Katsuhide Fujita, Tim Baarslag, Catholijn M Jonker, andTakayuki Ito. 2018. ANAC 2017: Repeated multilateral negotiation league. In
International Workshop on Agent-Based Complex Automated Negotiation . Springer,Stockholm, Sweden, 101–115.[9] Tim Baarslag. 2014.
What to bid and when to stop . Ph.D. Dissertation. DelftUniversity of Technology.[10] Tim Baarslag, Katsuhide Fujita, Enrico H Gerding, Koen Hindriks, Takayuki Ito,Nicholas R Jennings, Catholijn Jonker, Sarit Kraus, Raz Lin, Valentin Robu, et al.2013. Evaluating practical negotiating agents: Results and analysis of the 2011international competition.
Artificial Intelligence
198 (2013), 73–103.[11] Tim Baarslag, Mark JC Hendrikx, Koen V Hindriks, and Catholijn M Jonker. 2016.Learning about the opponent in automated bilateral negotiation: a comprehensivesurvey of opponent modeling techniques.
Autonomous Agents and Multi-AgentSystems
30, 5 (2016), 849–898.[12] Tim Baarslag, Koen Hindriks, Mark Hendrikx, Alexander Dirkzwager, andCatholijn Jonker. 2014. Decoupling negotiating agents to explore the spaceof negotiation strategies. In
Novel Insights in Agent-based Complex AutomatedNegotiation . Springer, Berlin, Germany, 61–83.[13] Tim Baarslag, Koen Hindriks, and Catholijn Jonker. 2014. Effective acceptanceconditions in real-time automated negotiation.
Decision Support Systems
60 (2014),68–77.[14] Tim Baarslag, Michael Kaisers, Enrico H. Gerding, Catholijn M. Jonker, andJonathan Gratch. 2017. When Will Negotiation Agents Be Able to Represent Us?The Challenges and Opportunities for Autonomous Negotiators. In
Proceedings ofthe Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI-17 .IJCAI, Melbourne, Australia, 4684–4690. https://doi.org/10.24963/ijcai.2017/653[15] Pallavi Bagga, Nicola Paoletti, Bedour Alrayes, and Kostas Stathis. 2020. ADeep Reinforcement Learning Approach to Concurrent Bilateral Negotiation.In
Proceedings of the Twenty-Ninth International Joint Conference on ArtificialIntelligence, IJCAI-20 , Christian Bessiere (Ed.). International Joint Conferenceson Artificial Intelligence Organization, Yokohama, Japan, 297–303. https://doi.org/10.24963/ijcai.2020/42 Main track.[16] Pallavi Bagga, Nicola Paoletti, and Kostas Stathis. 2020. Learnable Strategies forBilateral Agent Negotiation over Multiple Issues. arXiv:2009.08302 [cs.MA][17] Jasper Bakker, Aron Hammond, Daan Bloembergen, and Tim Baarslag. 2019.RLBOA: A Modular Reinforcement Learning Framework for Autonomous Nego-tiating Agents.. In
AAMAS . IFAAMAS, Montreal, Canada, 260–268.[18] Max H Bazerman, George F Loewenstein, and Sally Blount White. 1992. Reversalsof preference in allocation decisions: Judging an alternative versus choosingamong alternatives.
Administrative science quarterly
37, 2 (1992), 220–240.[19] Ho-Chun Herbert Chang. 2020. Multi-Issue Bargaining With Deep ReinforcementLearning. arXiv:2002.07788 [cs.MA][20] Lihong Chen, Hongbin Dong, Qilong Han, and Guangzhe Cui. 2013. Bilateralmulti-issue parallel negotiation model based on reinforcement learning. In
In-ternational Conference on Intelligent Data Engineering and Automated Learning .Springer, Hefei, China, 40–48.[21] Nirmal Choudhary and Kamal Kant Bharadwaj. 2019. Evolutionary learningapproach to multi-agent negotiation for group recommender systems.
MultimediaTools and Applications
78, 12 (2019), 16221–16243.[22] Giorgio Coricelli and Rosemarie Nagel. 2009. Neural correlates of depth ofstrategic reasoning in medial prefrontal cortex.
Proceedings of the NationalAcademy of Sciences
Journal ofSignal Processing Systems
91, 2 (2019), 179–189. [24] Peyman Faratin, Carles Sierra, and Nick R Jennings. 1998. Negotiation decisionfunctions for autonomous agents.
Robotics and Autonomous Systems
24, 3-4 (1998),159–182.[25] Katsuhide Fujita, Reyhan Aydoğan, Tim Baarslag, Koen Hindriks, Takayuki Ito,and Catholijn Jonker. 2017. The sixth automated negotiating agents competi-tion (ANAC 2015). In
Modern Approaches to Agent-based Complex AutomatedNegotiation . Springer, Berlin, Germany, 139–151.[26] Taha D Güneş, Emir Arditi, and Reyhan Aydoğan. 2017. Collective voice ofexperts in multilateral negotiation. In
International Conference on Principles andPractice of Multi-Agent Systems . Springer, Nice, France, 450–458.[27] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. 2018. SoftActor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with aStochastic Actor. arXiv:1801.01290 [cs.LG][28] Tuomas Haarnoja, Aurick Zhou, Kristian Hartikainen, George Tucker, Se-hoon Ha, Jie Tan, Vikash Kumar, Henry Zhu, Abhishek Gupta, Pieter Abbeel,and Sergey Levine. 2019. Soft Actor-Critic Algorithms and Applications.arXiv:1812.05905 [cs.LG][29] Koen Hindriks and Dmytro Tykhonov. 2008. Opponent modelling in automatedmulti-issue negotiation using bayesian learning. In
Proceedings of the 7th interna-tional joint conference on Autonomous agents and multiagent systems-Volume 1 .IFAAMAS, Estoril, Portugal, 331–338.[30] Litan Ilany and Ya’akov Gal. 2016. Algorithm selection in bilateral negotiation.
Autonomous Agents and Multi-Agent Systems
30, 4 (2016), 697–723.[31] Turker Ince, Serkan Kiranyaz, Levent Eren, Murat Askar, and Moncef Gabbouj.2016. Real-time motor fault detection by 1-D convolutional neural networks.
IEEE Transactions on Industrial Electronics
63, 11 (2016), 7067–7075.[32] Catholijn M Jonker, Reyhan Aydogan, Tim Baarslag, Katsuhide Fujita, TakayukiIto, and Koen Hindriks. 2017. Automated negotiating agents competition (ANAC).In
Thirty-first AAAI conference on artificial intelligence . AAAI, San Francisco,California, USA, 5070–5072.[33] Serkan Kiranyaz, Onur Avci, Osama Abdeljaber, Turker Ince, Moncef Gabbouj,and Daniel J. Inman. 2019. 1D Convolutional Neural Networks and Applications:A Survey. arXiv:1905.03554 [eess.SP][34] Serkan Kiranyaz, Adel Gastli, Lazhar Ben-Brahim, Nasser Al-Emadi, and MoncefGabbouj. 2018. Real-time fault detection and identification for MMC using 1-Dconvolutional neural networks.
IEEE Transactions on Industrial Electronics
66, 11(2018), 8760–8771.[35] Raymond YK Lau, Maolin Tang, On Wong, Stephen W Milliner, and Yi-Ping Phoebe Chen. 2006. An evolutionary learning approach for adaptive negoti-ation agents.
International journal of intelligent systems
21, 1 (2006), 41–72.[36] Kevin Leyton-Brown, Eugene Nudelman, Galen Andrew, Jim McFadden, andYoav Shoham. 2003. A portfolio approach to algorithm selection. In
IJCAI , Vol. 3.Morgan Kaufmann Publishers, Acapulco, Mexico, 1542–1543.[37] Raz Lin, Sarit Kraus, Tim Baarslag, Dmytro Tykhonov, Koen Hindriks, andCatholijn M. Jonker. 2014. Genius: An Integrated Environment for Support-ing the Design of Generic Automated Negotiators.
Computational Intelligence
ECAI 2006: 17th European Con-ference on Artificial Intelligence, August 29-September 1, 2006, Riva Del Garda, Italy;Including: Prestigious Applications of Intelligent Systems (PAIS 2006); Proceedings ,Vol. 141. IOS Press, Riva Del Garda, Italy, 270.[39] Raz Lin, Sarit Kraus, Jonathan Wilkenfeld, and James Barry. 2008. Negotiatingwith bounded rational agents in environments with incomplete informationusing an automated agent.
Artificial Intelligence
Proceedings InternationalConference on Multi Agent Systems (Cat. No. 98EX160) . IEEE, Paris, France, 182–189.[41] Yasser Mohammad. 2020. NEGotiation MultiAgent System (NegMAS). https://github.com/yasserfarouk/negmas [Online; accessed 2020-09-21].[42] Howard Raiffa, John Richardson, David Metcalfe, et al. 2002.
Negotiation analysis:The science and art of collaborative decision making . Harvard University Press,Harvard, US.[43] Yousef Razeghi, Celal Ozan Berk Yavuz, and REYHAN Aydoğan. 2020. Deepreinforcement learning for acceptance strategy in bilateral negotiations.
TurkishJournal of Electrical Engineering & Computer Sciences
28, 4 (2020), 1824–1840.[44] Frederik Schadd, Sander Bakkes, and Pieter Spronck. 2007. Opponent Modelingin Real-Time Strategy Games.. In
GAMEON . GAMEON, Bologna, Italy, 61–70.[45] Sergio Guadarrama and Anoop Korattikara and Et al. 2018. TF-Agents: A libraryfor Reinforcement Learning in TensorFlow. https://github.com/tensorflow/agents.https://github.com/tensorflow/agents [Online; accessed 25-June-2019].[46] M. Sridharan and G. Tesauro. 2000. Multi-agent Q-learning and regression treesfor automated pricing decisions. In
Proceedings Fourth International Conferenceon MultiAgent Systems . IEEE, Boston, MA, USA, 447–448.[47] Vishal Sunder, Lovekesh Vig, Arnab Chatterjee, and Gautam Shroff. 2018. Proso-cial or Selfish? Agents with different behaviors for Contract Negotiation usingReinforcement Learning. arXiv:1809.07066 [cs.LG]48] Gerald Tesauro. 2000. Pricing in agent economies using neural networks andmulti-agent Q-learning. In
Sequence learning . Springer, Berlin, Germany, 288–307.[49] Gerald Tesauro and Jeffrey O Kephart. 2002. Pricing in agent economies usingmulti-agent Q-learning.
Autonomous agents and multi-agent systems
5, 3 (2002),289–304. [50] Yasser Mohammad, Shinji Nakadai and Amy Greenwald. 2020. NegMAS: Aplatform for Situated Negotiations.
International Conference on Principles andPractice of Multi-Agent Systems (2020).[51] Dajun Zeng and Katia Sycara. 1998. Bayesian learning in negotiation.
Interna-tional Journal of Human-Computer Studies
48, 1 (1998), 125–141. upplementary Material:An Autonomous Negotiating Agent Framework withReinforcement Learning based Strategies and Adaptive StrategySwitching Mechanism
As part of supplementary material we are providing additional in-formation to make our work reproducible. In our proposed methodwe create several policies which are trained with reinforcementlearning algorithms. Additionally, we also train a classifier with 1D-CNN based deep learning technique that is capable of classifyingan unknown opponent and enables the agent to switch mecha-nism within a single negotiation round. In this material firstly, wewill provide the detailed implementation and the correspondinghyperparameters for the Soft-Actor-Critic (SAC) algorithm [2, 3]used for training bidding strategies against the base negotiators.Secondly we layout the model architecture and the correspondinghyperparameters for the 1D-CNN based opponent classifier.The hyperparameters of SAC algorithm used for our experi-ments while training strategies are provided in Table 2. We usedTF-Agents [5] library for the implementation of the SAC algorithm.The first version of SAC [2] uses a fixed entropy temperature 𝛼 .Though the performance of the original SAC was quite impressive, 𝛼 turned out to be a very sensitive hyperparameter. To remedy this,in the second version of SAC [3] alpha is converted to a trainableparameter and we have used this version of SAC algorithm. Further-more, It is to be noted that in SAC, two critic networks are createdwith the same structure each one with its own layers and weights.Second one is usually known as target critic network. After every target update period train steps, the weights from critic network arecopied with smoothing via target update 𝜏 to target critic network.Additionally, NEGotiation MultiAgentSystem (NegMAS) [4, 6] plat-form is used for the purpose of creating the negotiation environ-ment on which the reinforcement learning algorithms were trained.All other negotiation simulations for benchmarking and results arealso done using the same platform.The opponent classifier as proposed in our work, is a 1D-CNNbased classifier and is trained using TensorFlow library [1]. Themodel has three consecutive 1D-CNN layers, followed by a twofully connected layers as shown in Figure 1. The parameter 𝑘 de-notes the window length or the length of the input sequence tothe classifier and the parameter 𝑛 denotes the number of classesor base negotiators. We have separately tested the classifier withbase negotiators ranging from 2 to 7. For our final results we onlyrequired the model with 2 and 3 classes. Here Figure 1 illustratesa model with 𝑘 =
20 and 𝑛 =
7, that is, input consisting of 20opponent offers and outputs a set of probabilities for each of theseven classes. The hyperparameters of the opponent classifier aregiven in Table 1.
Figure 1: Classifier model having three consecutive 1D-CNNlayers followed by two Dense layer for 𝑘 = and 𝑁 = Table 1: Hyperparameters for opponent classifier
Hyperparameter values
Training epochs 30-401D-CNN layer activation function relu1D-CNN layer filter size 321D-CNN kernel size 5Dense layer activation function reluFinal layer activation function softmaxLearning rate 0.0001Batch size 50Loss function Categorical CrossentropyOptimizer Adam Optimizer
REFERENCES a r X i v : . [ c s . A I] F e b able 2: Hyperparameters for SAC algorithm Hyperparameters values Description