[PDF] Multi-Agent Reinforcement Learning in a Realistic Limit Order Book Market Simulation

Abstract

Optimal order execution is widely studied by industry practitioners and academic researchers because it determines the profitability of investment decisions and high-level trading strategies, particularly those involving large volumes of orders. However, complex and unknown market dynamics pose significant challenges for the development and validation of optimal execution strategies. In this paper, we propose a model-free approach by training Reinforcement Learning (RL) agents in a realistic market simulation environment with multiple agents. First, we configure a multi-agent historical order book simulation environment for execution tasks built on an Agent-Based Interactive Discrete Event Simulation (ABIDES) [arXiv:1904.12066]. Second, we formulate the problem of optimal execution in an RL setting where an intelligent agent can make order execution and placement decisions based on market microstructure trading signals in High Frequency Trading (HFT). Third, we develop and train an RL execution agent using the Double Deep Q-Learning (DDQL) algorithm in the ABIDES environment. In some scenarios, our RL agent converges towards a Time-Weighted Average Price (TWAP) strategy. Finally, we evaluate the simulation with our RL agent by comparing it with a market replay simulation using real market Limit Order Book (LOB) data.

Full PDF

MMulti-Agent Reinforcement Learning in a Realistic Limit OrderBook Market Simulation

Michaël Karpe ∗ Jin Fang ∗ [email protected][email protected] of California, BerkeleyBerkeley, California Zhongyao Ma † Chen Wang † [email protected]@berkeley.eduUniversity of California, BerkeleyBerkeley, California ABSTRACT

Optimal order execution is widely studied by industry practitionersand academic researchers because it determines the profitabilityof investment decisions and high-level trading strategies, particu-larly those involving large volumes of orders. However, complexand unknown market dynamics pose enormous challenges for thedevelopment and validation of optimal execution strategies. Wepropose a model-free approach by training Reinforcement Learn-ing (RL) agents in a realistic market simulation environment withmultiple agents. First, we have configured a multi-agent historicalorder book simulation environment for execution tasks based onan Agent-Based Interactive Discrete Event Simulation (ABIDES)[5]. Second, we formulated the problem of optimal execution in anRL setting in which an intelligent agent can make order executionand placement decisions based on market microstructure tradingsignals in HFT. Third, we developed and trained an RL executionagent using the Double Deep Q-Learning (DDQL) algorithm inthe ABIDES environment. In some scenarios, our RL agent con-verges towards a Time-Weighted Average Price (TWAP) strategy.Finally, we evaluated the simulation with our RL agent by compar-ing the simulation on the actual market Limit Order Book (LOB)characteristics.

KEYWORDS high-frequency trading, limit order book, market simulation, multi-agent reinforcement learning, optimal execution

ACM Reference Format:

Michaël Karpe, Jin Fang, Zhongyao Ma, and Chen Wang. 2020. Multi-AgentReinforcement Learning in a Realistic Limit Order Book Market Simulation.In

ICAIF 2020: ACM International Conference on AI in Finance, October 15–16,2020, New York, NY.

ACM, New York, NY, USA, 6 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn ∗ Both authors contributed equally to this research. † Both authors contributed equally to this research.Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected].

Simulation techniques form the basis for understanding marketdynamics and evaluating trading strategies for both financial sectorinvestment institutions and academic researchers. Current simula-tion methods are based on sound assumptions about the statisticalproperties of the market environment and the impact of transac-tions on the prices of financial instruments. Unfortunately, marketcharacteristics are complex and existing simulation methods can-not replicate a realistic historical trading environment. The tradingstrategies tested by these simulations generally show lower prof-itability when implemented in real markets. It is therefore necessaryto develop interactive agent-based simulations that allow tradingstrategy activities to interact with historical events in an environ-ment close to reality.High Frequency Trading (HFT) is a trading method that allowslarge volumes of trades to be executed in nanoseconds. In the UnitedStates, HFT companies account for more than 70% of daily equitytrading volume. Execution strategies aim to execute a large vol-ume of orders with minimal adverse market price impact. Theyare particularly important in HFT to reduce transaction costs. Acommon practice of execution strategies is to split a large orderinto several child orders and place them over a predefined period oftime. However, developing an optimal execution strategy is difficultgiven the complexity of the HFT environment and the interactionsbetween market participants.The availability of NASDAQ’s high-frequency LOB data allowsresearchers to develop model-free execution strategies based onRL through LOB simulation. These model-free approaches do notmake assumptions or model market responses, but instead rely onrealistic market simulations to train an RL agent to accumulateexperience and generate optimal strategies. However, no existingresearch has implemented RL agents in realistic simulations, whichmakes the generated strategies suboptimal and not robust in realmarkets.

The use of RL for developing trading strategies has gained popular-ity in recent years. HFT makes necessary the use of RL automateand accelerate order placement. Many papers present such RL ap-proaches, such as temporal-difference RL [12] and risk-sensitive RL[9]. a r X i v : . [ q -f i n . T R ] J un CAIF 2020, October 15–16, 2020, New York, NY Karpe and Fang, et al.

Although RL strategies have proven their effectiveness, they suf-fer from a lack of explainability. Thus, the need to be able to explainthese strategies in a business context has led to the developmentof representations of risk-sensitive RL strategies in the form ofcompact decision trees [16]. Advances in the development of RLagents for trading and order placement then showed the need tolearn strategies in an environment close to the real market envi-ronment. Indeed, traditional RL approaches suffer from two mainshortcomings.First, each financial market agent adapts its strategy to the strate-gies of other agents, in addition to the market environment. Thishas led research in the field to consider the use of Multi-AgentReinforcement Learning (MARL) for learning trading and orderplacement strategies [11]. Second, the market environment simu-lated in classical RL approaches was simplistic. The creation of astandardized market simulation environment for artificial intelli-gence agent research was then undertaken to allow agents to learnin conditions closer to reality, through the creation of ABIDES [5].Research works on the metrics to be considered to evaluate theagents of RL in this environment was also supported within theframework of LOB simulation [15].As MARL and ABIDES allow a simulation much closer to realmarket conditions, additional research was undertaken to addressthe curse of dimensionality, as millions of agents compete in tradi-tional market environments. The use of mean field MARL allowsfaster learning of strategies by approximating the behavior of eachagent by the average behavior of its neighbors [17].The notion of fairness also brings both efficiency and stability toMARL [7] by making it possible to avoid situations where agentscould act disproportionately in the market, for example by execut-ing large orders. The integration of fairness into MARL [2] hasbeen studied as an evolution of traditional MARL strategies usedfor example for liquidation strategies [3].

Our main contributions in HFT simulation and RL for optimalexecution are the following: • We have set up a multi-agent LOB simulation environmentfor the training of RL execution agents within ABIDES. • we have formulated the problem of optimal execution withinan RL framework, consisting of a combination of actionspaces, private states, market states and reward functions. Inparticular, this is the first formulation with optimal executionand optimal placement combined in the action space. • We have developed RL execution agents using the DoubleDeep Q-Learning (DDQL) algorithm in the ABIDES environ-ment. • We trained an RL agent in a multi-agent LOB simulationenvironment. Our RL agent converges to the TWAP strategyin some situations. • We evaluated the multi-agent simulation with a trained RLagent based on the real market LOB characteristics and theobserved order flow model is consistent with LOB stylizedfacts.

In our work, we allow RL agents not only to choose the ordervolume to be placed, but also to choose between a market orderand one or more limit orders at different levels of the order book.In this section, we describe the states, actions, and rewards of ouroptimal execution problem formulation.We defined the trading simulation as a T -period problem, whichis denoted by the times T < T < · · · < T N with T =

0. We willfocus on the time horizon from 10:00 to 15:30 for each trading dayto avoid the most volatile periods during the agent training process.The time interval within each period is ∆ T =

30 seconds, so thatthere is a total of 660 periods within the time horizon we havedefined in a trading day, lasting 5 hours (i.e. T N = T / ∆ T = P represents the price, while the capital Q representsthe quantity volume at a certain price in the limit order book. Ouroptimal execution problem is then formulated as follows:(1) State s : the state space includes the information on the LOBat the beginning of each period. For each time period, we usea tuple containing the following characteristics to representthe current state: • time _ remaininд t : the time remaining after the time pe-riod T k . Since we assume that a trade can only take placeat the beginning of each period, this variable also containsthe number of remaining trading times. The variable is nor-malized to be in the [− , ] range as follows: t = × T − tT − • quantity _ remaininд n : the quantity of remaining inven-tory at the time period T k , which is also normalized: n = × N − (cid:205) ti = n i N −

1, where the capital N denotes the initialinventoryThe above state variables are linked to specific executiontasks, called private states. In addition, we also use the fol-lowing market state variables to capture the market situationat a given point in time: • bid-ask spread : the difference between the highest bid priceand the lowest ask price, which is intended to provideinformation on the liquidity of the asset in the market: s = P best _ ask − P best _ bid • volume imbalance : the difference between the existing or-der volume on the best bid and best ask price levels. Thisfeature contains information on the current liquidity differ-ence on both sides of the order book, indirectly reflectingthe price trend. v imbalance = Q best _ ask − Q best _ bid Q best _ ask + Q best _ bid • one-period price return : the log-return of the stock priceover two consecutive days measures the short-term pricetrend. We intend to allow the RL agent to take advantageof the mean-reverting characteristics of the stock price. r = log (cid:18) P t P t − (cid:19) ulti-Agent Reinforcement Learning in a Realistic Limit Order Book Market Simulation ICAIF 2020, October 15–16, 2020, New York, NY • t-period price return : the log-return of the stock price sincethe beginning of the simulation measures the deviationbetween the stock price at time t and the initial price attime 0. r t = log (cid:18) P t P (cid:19) (2) Action a : the action space defines a possible executed or-der in a given state, i.e. the possible quantity of remaininginventory affected by the order. In this case, the order canbe either a market order or a limit order, either from the bidside or from the ask side. Therefore, the action for each statewould be a combination of the quantity to be executed andthe direction for placement. We use the execution choice toindicate the former and the placement choice to representthe latter. • Execution Choices:

At the beginning of each period, theagent shall decide on an execution quantity N t = a · N TW AP , where a is a scalar the agent chooses from aset of numbers to increase or decrease the order quan-tity placed using the TWAP strategy. The scalar a is in [ . , . , . , . . . , . ] . • Placement Choices:

The agent can choose one of the fol-lowing order placement methods: – choice 0: Market Order – choice 1: Limit Order - place 100% on top-level of LOB – choice 2: Limit Order - place 50% on each of top 2 levels – choice 3: Limit Order - place 33% on each of top 3 levels(3) Reward r : the reward is intended to reflect the feedbackfrom the environment after agents have taken a given actionin a given state. It is usually represented by a reward functionconsisting of useful information obtained from the state orthe environment. In our formulation, the reward function R t measures the execution price slippage and quantity. R t = (cid:18) − | P f ill − P arrival | P arrival (cid:19) · λ N t N where λ is a constant for scaling the effect of the quantitycomponent. Aiming to achieve the optimal execution policy, RL enables agentsto learn the best action to take through interaction with the en-vironment. The agent follows the strategy that can maximize theexpectation of cumulative reward. In Q-Learning, it is representedby the function below [10]: Q ( s , a ) = E (cid:34) ∞ (cid:213) t = γ t × R ( s , a t , s ′ ) (cid:35) Since the above approach would be redundant with larger di-mensions of the state space where states cannot be visited in depth,instead of directly using a matrix as the Q-function, we can learna feature-based value function Q ( s , a | θ ) , where θ is the weightingfactor that is updated by a stochastic gradient descent.The Q-value with parametric representation can be estimated bymultiple flexible means, in which the Deep Q-Learning (DQL) bestfits our problem. In the DQL, the Q-function Q ( s , a | θ ) is combinedwith a Deep Q-Network (DQN), and θ is the network parameters. The network memory database contains samples with tuples ofinformation recording the current state, action, reward and nextstate ( s , a , r , s ′ ) . For each period, we generate samples accordingto an ε -greedy policy and store them in memory. We replace thesamples when the memory database is full, following the First-In-First-Out (FIFO) principle, which means that old samples areremoved first.At each iteration, we batch a given quantity of samples from thememory database, and compute the target Q-value for each sample,which is defined as [10]: y = R ( s , a ) + γ × max a Q ( s , a | θ ) The network parameter θ is updated by minimizing the loss betweenthe target Q-value and the estimated Q-value calculated from thenetwork based on the current parameters.However, the DQL algorithm suffers from both instability andoverestimation problems, since the neural network is used to gener-ate the current target Q-value as well as to update the parameters.A common method to solve this problem is to introduce anotherneural network with the same structure and calculate the Q-valueseparately, which is called Double Deep Q-Learning (DDQL) [14].In DDQL, we use two neural networks, the evaluation networkand the target network, to generate the appropriate Q-value. Theevaluation network is used to select the best action a ∗ for eachstate in the sample, while the target network is used to estimate thetarget Q-value. We update the evaluation network parameters θ E every period, and replace the target network parameters θ T with θ T = θ E after several iterations. ABIDES is an Agent-Based Interactive Discrete Event Simulationenvironment primarily for the development of Artificial Intelligence(AI) research in financial market simulations [5].The first version of ABIDES (0.1) was released in April 2019.The ABIDES development team released a second version (1.0) inSeptember 2019, which is supposed to be the first stable release.Finally, the latest version (1.1) , released in March 2020, adds manynew functionalities, including the implementation of new agentssuch as Q-Learning agents, as well as the implementation of therealism metrics.ABIDES aims to replicate a realistic financial market environ-ment by largely implementing the characteristics of real financialmarkets such as NASDAQ, including nanosecond time resolution,network latency and agent computation delays and communicationsolely by means of standardized message protocols [5]. In addition,by providing ABIDES with historical LOB data, we are able to re-produce a given period of this history using ABIDES marketreplay configuration file.ABIDES also aims to help researchers to answer questions relatedto the understanding of market behavior, such as the influence ofdelays in sending orders to an exchange, the price impact of placinglarge orders or the implementation of AI agents into real markets[5].ABIDES uses a hierarchical structure in order to ease the devel-opment of complex agents such as AI agents. Indeed, thanks to

CAIF 2020, October 15–16, 2020, New York, NY Karpe and Fang, et al.

Python object-oriented programming and inheritance , we can, forexample, create a new

ComplexAgent class which inherits from the

Agent and thus benefits from all functionalities available in the

Agent class. We can then use overriding if we want to change a

Agent function in order to make it specific for our

ComplexAgent .Given that ABIDES not only aims to implement financial marketsimulations, the base

Agent class has nothing related to financialmarkets is only provided with functions for Discrete Event Simula-tion. The

FinancialAgent class inherits from

Agent and has supple-mentary functionalities to deal with currencies. On the one hand,the

ExchangeAgent class inherits from

FinancialAgent and simulatesan financial exchange. On the other hand, the

TradingAgent alsoinherits from

FinancialAgent and is the base class for all tradingagents which will communicate with the

ExchangeAgent duringfinancial market simulations.Some trading agents – i.e. inheriting from the

TradingAgent class– are already provided in ABIDES, such as the

MomentumAgent which places orders depending on a given number of previousobservations of the simulated stock price. In the next sections,unless otherwise mentioned, the agents we refer to are all tradingagents.We present in Figure 1 an example of market simulation inABIDES using real historical data. The graph below presents a

Ze-roIntelligenceAgent placing orders on a stock in a market simulationwith 100

ZeroIntelligenceAgent trading against an

ExchangeAgent .Each agent is able to place long, short or exit orders, competingwith thousands of other agents to maximize their reward.

Figure 1: Agent placing orders on simulated stock price

In order to train the DDQL agent in ABIDES during a marketreplay simulation, the learning process needs to be integrated with thesimulation process. The training process starts by initializing theABIDES execution simulation kernel and instantiating a

DDQLEx-ecutionAgent object. The same agent object needs to complete B simulations, which is referred to as the number of training episodes.Within each training episode, the simulation is divided into N discrete periods. For each period T i , the agent chooses an action a T i for the current period according to the ε -greedy policy in order to achieve a balance of exploration and exploitation. Then, an orderschedule is generated based on the quantity and placement strategydefined in the chosen action. The current-period order could bebroken into small orders and placed on different levels of the LOB.Then, the current experience (cid:0) s T i , a T i , s T i + , r T i (cid:1) is stored in thereplay buffer D .The replay buffer removes the oldest experience when its sizereaches to the maximum capacity specified. This intends to userelatively recent experiences to train the agent. As long as the sizeof the replay buffer D reaches a minimum training size, a randomminibatch (cid:16) s ( j ) , x ( j ) , r ( j ) , s s , x ( j ) (cid:17) is sampled from D for training theevaluation network.The target network is updated after training the evaluation net-work 5 times. The final step within time period T i is to update thestate s T i + and compute the reward r T i for the current period. Theentire process is summarized in the algorithm below. Algorithm 1

Training of DDQL for optimal execution in ABIDES. for training episode b ∈ B do for i ← N − do With probability ε select random action a T i Otherwise select optimal action a T i ′ based on target_net Schedule orders o i according to a i and submit o i Store experience (cid:0) s T i , a T i , s T i + , r T i (cid:1) in replay buffer D if lenдth (D) > max _ experience then Remove oldest experience if lenдth (D) ≥ min _ experience AND i mod 5 == then Sample random minibatch from D Train eval_net and update target_net if orders o i accepted or executed then Observe environment and update s T i + Compute and update r T i To train a DDQL agent, we implement our neural network basedon a Multi Layer Perceptron (MLP). We stack multiple dense layerstogether as illustrated in Figure 2, and we set the activation functionto be ReLU to introduce non-linearity. Dropout is used to avoidoverfitting. The optimization algorithm used for backpropagationis the

Root Mean Square back-propagation (RMSprop) [13] witha learning rate of 0.01. The loss function we choose is the meansquared error (MSE). The size of the output layer size is the numberof actions to choose from.

In this section, we describe our experiment for training a DDQLagent in a multi-agent environment and observe the behavior ofthe agent during testing.

The data we used for the experiments is NASDAQ data we convertedto LOBSTER format to fit the simulation environment. We extractedthe order flow for 5 stocks (CSCO, IBM, INTC, MSFT and YHOO)from January 13 to February 6, 2003. We trained the model over9 days and tested it over the following 9 days. The training data ulti-Agent Reinforcement Learning in a Realistic Limit Order Book Market Simulation ICAIF 2020, October 15–16, 2020, New York, NY

Figure 2: Multi-Layer Perceptron (MLP) architecture is concatenated into a single sequence, and the training process iscontinuous for consecutive days while the model parameters arestored in intermediate files.

The multi-agent environment that we have set up for the trainingof DDQL agents at ABIDES consists of an

ExchangeAgent , a

Mar-ketreplayAgent , six

MomentumAgents , a

TWAPExecutionAgent andour

DDQLExecutionAgent . • ExchangeAgent acts as a centralized exchange that keeps theorder book and matches orders on the bid and ask sides. • MarketreplayAgent accurately replays all market and limitorders recorded in the historical LOB data. • MomentumAgent compares the last 20 mid-price observa-tions with the last 50 mid-price observations and places abuy limit order if the 20 mid-price average is not lower thanthe 50 mid-price average, or a sell limit order if it is. • TWAPAgent adopts the TWAP strategy. This strategy mini-mizes the price impact by dividing a large order equally intoseveral smaller orders. Its execution price is the average priceof the recent k time periods. The agent’s optimal trading rateis calculated by dividing the total size of the order by thetotal execution time, which means that the trading quantityis constant. When the stock price follows a Brownian motionand the price impact is assumed to be constant, this is theoptimal strategy [6]. In RL, if there is no penalty for anyrunning inventory, but a significant penalty for the endinginventory, the TWAP strategy is also optimal. We have observed that our RL agent converges to the TWAP strat-egy after 9 consecutive days of training regardless of the stockchosen. The agent places a top-level limit order or market order.However, the execution quantity chosen by the agent throughoutthe test period changes with the stock. The possible reason for this result is that, as our RL agent places an order every 30 seconds, itis not able to capture the trading signals existing during shorterperiods of time.

Numerous research papers have studied the behaviour of the LOB. Arecent research paper presents a review of these LOB characteristicswhich can be referred to as stylized facts [15]. In this section, wecompare our simulation with real markets based on these realismmetrics , in order to assess whether our market simulation, mainlybased on ABIDES, is realistic.We can mainly distinguish two sets of metrics for the analysisof the LOB behavior. The first set includes metrics related to assetreturn distributions and the second set includes metrics related tovolume and order flow distributions [15].Asset returns metrics generally relate to price return or percent-age change. For the LOB, it includes the mid-price trend, which isthe average of the best bid price and the best ask price. Volumes andorder flow metrics relate to the behavior of incoming order flows,including new buy orders, new sell orders, order modifications ororder cancellations.We briefly recall three main stylized facts related to order flows[15]: • Order volume in a fixed time interval:

Order volume ina fixed interval or time window likely follows a positivelyskewed log-normal distribution or gamma distribution [1]. • Order interarrival time:

The time interval of two consec-utive limit orders likely follows an exponential distribution[8] or a Weibull distribution [1]. • Intraday volume patterns:

Limit order volume within agiven time interval for each trading day can be approximatedby a U-shaped polynomial curve, where the volume is higherat the start and the end of the trading day [4].These realism metrics are implemented in ABIDES. We verifiedthe order flow stylized facts mentioned above, as well as the assetreturns stylized facts that are also implemented in ABIDES. We didit on a marketreplay simulation before adding our

DDQLExecution-Agent , and then on a simulation with our

DDQLExecutionAgent .We first observe that adding a single new agent to a simulationdoes not significantly alter the observation of stylized facts. Thismeans that evaluating the realism of our simulation with a single

DDQLExecutionAgent is equivalent to evaluating the realism of theLOB data provided as input to the simulation.On our NASDAQ LOB 2003 data, we always observe the twoorder flow stylized facts mentioned above, however we do notalways observe intraday volume patterns. Figure 3 illustrates thestylized fact about order volume in a fixed time interval, for IBMstock on January 13, 2003. We verify that order volume in a fixedtime interval follows a gamma distribution.

In the work presented, we built our

DDQLExecutionAgent in ABIDESby implementing our own optimal execution problem formula-tion through RL in a financial market simulation, and set up a

CAIF 2020, October 15–16, 2020, New York, NY Karpe and Fang, et al.

Figure 3: Order volume in fixed time interval for IBM stockon January 13, 2003 multi-agent simulation environment accordingly. In addition, weconducted experiments to train our

DDQLExecutionAgent in theABIDES environment and compared the agent strategy with TWAP.Finally, we evaluated the multi-agent simulation environment inthe financial field which showed that it follows stylized facts aboutorder flow patterns.In our experiments, our

DDQLExecutionAgent learned how toperform a TWAP strategy because of its trading frequency which isnot high enough. However, this work shows the potential of MARLfor developing optimal trading strategies in real financial markets,by implementing agents with an higher trading frequency in therealistic ABIDES market simulation.

Due to limited computing resources and lack of data, the experi-ments we have been able to do are limited. Our current model canbe improved in many ways. The agent period of time that we setcan be refined to a shorter time interval, closer to the nanosecond,in order to be closer to real HFT. More features can be added in thestate space, and the action space can be expanded to include moretypes of execution actions. In addition, the reward function can beenhanced to include more information and feedback from both themarket and other agents.Regarding the RL algorithm, we can try several advanced meth-ods to implement an updated approach on our

DDQLExecutionA-gent . Directions include the use of prioritized experience replay toincrease the frequency of batching important transitions from mem-ory, or the combination of bootstrapping with DDQL to improveexploration efficiency. In addition, the performance of the neuralnetwork itself can also be improved by increasing the complexityof the architecture. For example, since agents’ trading decisionsmay also depend on previous observations, several LSTM layerscan be added to take advantage of the agent’s past experience.So far, we have focused on a relatively monotonous set of multi-ple agents, which is not able to fully capture the influence of theinteraction between the agents. To remedy this situation, moreand different types of agents can be added to the configuration to study collaboration and competition among agents in more detail.Moreover, after introducing a more complex combination of agentsin the ABIDES environment, we can try to perform the financialmarket simulation on the basis of this configuration, which shouldbe much more realistic than the existing one.The approach to evaluation is also an aspect that can be furtherexpanded. Our experience does not currently allow us to clearly dis-tinguish the difference between our agents and the benchmark. Inorder to assess the model more accurately, we can further improveour evaluation methods to examine both the parameters of RL andfinancial performance. For example, by conducting a simulation ofthe financial market in the ABIDES environment, we can use therealism metrics we have designed to evaluate our agents.

ACKNOWLEDGMENTS

We would like to thank Xin Guo and Yusuke Kikuchi for theirhelpful comments during the realization of this work. We also wantto thank Svitlana Vyetrenko for her answers to our questions aboutthe work she contributed to and cited in the References section.

REFERENCES [1] Frédéric Abergel, Marouane Anane, Anirban Chakraborti, Aymen Jedidi, andIoane Muni Toke. 2016.

Limit order books . Cambridge University Press.[2] Wenhang Bao. 2019. Fairness in Multi-agent Reinforcement Learning for StockTrading. arXiv preprint arXiv:2001.00918 (2019).[3] Wenhang Bao and Xiao-yang Liu. 2019. Multi-agent deep reinforcement learningfor liquidation strategy analysis. arXiv preprint arXiv:1906.11046 (2019).[4] Jean-Philippe Bouchaud, Julius Bonart, Jonathan Donier, and Martin Gould. 2018.

Trades, quotes and prices: financial markets under the microscope . CambridgeUniversity Press.[5] David Byrd, Maria Hybinette, and Tucker Hybinette Balch. 2019. Abides: Towardshigh-fidelity market simulation for ai research. arXiv preprint arXiv:1904.12066 (2019).[6] Kevin Dabérius, Elvin Granat, and Patrik Karlsson. 2019. Deep Execution-Valueand Policy Based Reinforcement Learning for Trading and Beating Market Bench-marks.

Available at SSRN 3374766 (2019).[7] Jiechuan Jiang and Zongqing Lu. 2019. Learning Fairness in Multi-Agent Systems.In

Advances in Neural Information Processing Systems . 13854–13865.[8] Junyi Li, Xintong Wang, Yaoyang Lin, Arunesh Sinha, and Michael P Wellman.2018. Generating Realistic Stock Market Order Streams. (2018).[9] Mohammad Mani, Steve Phelps, and Simon Parsons. 2019. Applications ofReinforcement Learning in Automated Market-Making. (2019).[10] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, IoannisAntonoglou, Daan Wierstra, and Martin Riedmiller. 2013. Playing atari with deepreinforcement learning. arXiv preprint arXiv:1312.5602 (2013).[11] Yagna Patel. 2018. Optimizing Market Making using Multi-Agent ReinforcementLearning. arXiv preprint arXiv:1812.10252 (2018).[12] Thomas Spooner, John Fearnley, Rahul Savani, and Andreas Koukorinis. 2018.Market making via reinforcement learning. In

Proceedings of the 17th Interna-tional Conference on Autonomous Agents and MultiAgent Systems . InternationalFoundation for Autonomous Agents and Multiagent Systems, 434–442.[13] Tijmen Tieleman and Geoffrey Hinton. 2012. Lecture 6.5-rmsprop: Divide thegradient by a running average of its recent magnitude.

COURSERA: Neuralnetworks for machine learning

4, 2 (2012), 26–31.[14] Hado Van Hasselt, Arthur Guez, and David Silver. 2016. Deep reinforcementlearning with double q-learning. In

Thirtieth AAAI conference on artificial intelli-gence .[15] Svitlana Vyetrenko, David Byrd, Nick Petosa, Mahmoud Mahfouz, Danial Der-vovic, Manuela Veloso, and Tucker Hybinette Balch. 2019. Get Real: Real-ism Metrics for Robust Limit Order Book Market Simulations. arXiv preprintarXiv:1912.04941 (2019).[16] Svitlana Vyetrenko and Shaojie Xu. 2019. Risk-Sensitive Compact Decision Treesfor Autonomous Execution in Presence of Simulated Market Response. arXivpreprint arXiv:1906.02312 (2019).[17] Yaodong Yang, Rui Luo, Minne Li, Ming Zhou, Weinan Zhang, and Jun Wang. 2018.Mean field multi-agent reinforcement learning. arXiv preprint arXiv:1802.05438arXiv preprint arXiv:1802.05438