[PDF] Extending Deep Reinforcement Learning Frameworks in Cryptocurrency Market Making

Abstract

There has been a recent surge in interest in the application of artificial intelligence to automated trading. Reinforcement learning has been applied to single- and multi-instrument use cases, such as market making or portfolio management. This paper proposes a new approach to framing cryptocurrency market making as a reinforcement learning challenge by introducing an event-based environment wherein an event is defined as a change in price greater or less than a given threshold, as opposed to by tick or time-based events (e.g., every minute, hour, day, etc.). Two policy-based agents are trained to learn a market making trading strategy using eight days of training data and evaluate their performance using 30 days of testing data. Limit order book data recorded from Bitmex exchange is used to validate this approach, which demonstrates improved profit and stability compared to a time-based approach for both agents when using a simple multi-layer perceptron neural network for function approximation and seven different reward functions.

Full PDF

EExtending Deep Reinforcement Learning Frameworks inCryptocurrency Market Making

Jonathan Sadighian ∗ SESAMm [email protected]

April 16, 2020

Abstract

There has been a recent surge in interest in the application of artiﬁcial intelligence to automatedtrading. Reinforcement learning has been applied to single- and multi-instrument use cases, suchas market making or portfolio management. This paper proposes a new approach to framingcryptocurrency market making as a reinforcement learning challenge by introducing an event-basedenvironment wherein an event is deﬁned as a change in price greater or less than a given threshold, asopposed to by tick or time-based events (e.g., every minute, hour, day, etc.). Two policy-based agentsare trained to learn a market making trading strategy using eight days of training data and evaluatetheir performance using 30 days of testing data. Limit order book data recorded from Bitmex is usedto validate this approach, which demonstrates improved proﬁt and stability compared to a time-basedapproach for both agents when using a simple multi-layer perceptron neural network for functionapproximation and seven diﬀerent reward functions.

Keywords: reinforcement learning, limit order book, market making, cryptocurrencies

Applying quantitative methods to market making is a longstanding interest of the quantitative ﬁnancecommunity. Over the past decade, researchers have applied stochastic, statistical and machine learningtechniques to automate market making. These approaches often use limit order book (LOB) data to traina model to make a prediction about future price movements, generally for the purpose of maintaining theoptimal inventory and quotes (i.e., posted bid and ask orders at the exchange) for the market maker.These models are typically trained using the most granular form of event-driven data, level II/III tickdata, where an event is deﬁned as an incoming order received by the exchange (e.g., new, cancel, ormodify). Alternatively, an event can be deﬁned as a time interval (e.g., every n seconds, minutes, hours,etc.). Every time an event occurs, these models have the opportunity to react (e.g., adjust their quotes),thereby indirectly managing inventory by increasing or decreasing the likelihood of posted order execution.Tick-based approaches make assumptions about latency and executions, which may impact thecapability of a model to translate simulated results into real-world performance. Researchers attemptto address this challenge in diﬀerent ways, such as creating simulation rules around market impact,execution rates, and priority of LOB queues [1–5]. We propose an alternative approach to addressthis challenge: Fundamentally change the mindset of strategy creation for automated market makingfrom latency-sensitive to intelligent strategies using deep reinforcement learning (DRL) with time- andprice-based event environments.This paper applies DRL to create an intelligent market making strategy, extending the DRL MarketMaking (DRLMM) framework set forth in our previous work [6], which used time-based event environments.The reinforcement learning framework follows a Markov Decision Process (MDP), where an agent interacts ∗ Completed while associated with SESAMm. The views presented in this paper are of the author and do not necessarilyrepresent the views of SESAMm. a r X i v : . [ q -f i n . T R ] A p r ith an environment E over discrete time steps t , observes a state space s t , takes an action a t guidedby a policy π , and receives a reward r t . The policy π is a probability distribution mapping state spaces s t ∈ S to action spaces a t ∈ A . The agent’s interactions with the environment continues until a terminalstate is reached. The return R t = (cid:80) ∞ k =0 γ k r t + k is the total accumulated return from time step t withdiscount factor γ ∈ (0 , . The goal of the agent is to maximize the expected return from each state s t [7].Although reinforcement learning has seen many recent successes across various domains [8–10], thesuccess of its application in automated trading is highly dependent on the reward function (i.e., feedbacksignal) [11–13]. Previous research [6] proposed a framework for deep reinforcement learning as applied tocryptocurrency market making (DRLMM) and demonstrated its capability to generalize across diﬀerentcurrency pairs. The focus of this paper is to extend the research of previous work by evaluating howseven diﬀerent reward functions impact the agent’s trading strategy, and to introduce a new frameworkfor DRLMM using price-based events.This paper is structured as follows: section 1, introduction; section 2, related research; section 3,contributions in this paper; section 4, overview of the seven reward functions; section 5, overview of time-and price-based event-driven environments; section 6, our experiment design and methodology; section 7results and analysis; and section 8, conclusion and future work. The earliest approaches of applying model-free reinforcement learning to automated trading consist oftraining an agent to learn a directional single-instrument trading strategy using low-resolution pricedata and policy methods [11–14]. These approaches ﬁnd risk-based reward functions, such as DownsideDeviation Ratio or Diﬀerential Sharpe Ratio, generate more stable out-of-sample results than using actualproﬁt and loss.More recently, researchers have applied reinforcement learning methods to market making. [3] createda framework using value-based model-free methods with high-resolution LOB data from equity markets.Under this framework, an agent takes a step through the environment every time a new tick eventoccurs. They proposed a novel reward function, which dampens the change in unrealized proﬁt andloss asymmetrically, discouraging the agent from speculation. Although their agent demonstrates stableout-of-sample results, assumptions about latency and executions in the simulator make it unclear howeﬀectively the trained agent would perform in live trading environments. [15] proposed a hierarchicalreinforcement learning architecture, where a macro agent views low-resolution data and generates tradesignals, and a micro agent accesses the LOB and is responsible for order execution. Under this approach,the macro agent takes a step through the environment with time-based events using one-minute timeintervals; the micro agent interacts with its environment in ten-second intervals. Although the agentoutperformed their baseline, the lack of inventory constraints on the agent makes the results uncertainfor live trading.Previous work [6] proposed a new framework for applying deep reinforcement learning to cryptocurrencymarket making. The approach consists of using a time-based event approach with one-second snapshotsof LOB data (including derived statistics from order and trade ﬂow imbalances and indicators) to trainpolicy-based model-free actor-critic algorithms. The performance of two reward functions were comparedon Bitcoin, Ether and Litecoin data from Coinbase exchange. The framework’s ability to generalize wasdemonstrated by applying trained agents to make markets in diﬀerent currency pairs proﬁtably. Thispaper extends previous work through comparing ﬁve additional reward functions, and introduce a newapproach for the DRLMM framework using price-based events.

The main contributions of this paper are as follows:1.

Analysis of seven reward functions : We extend previous work, apply the DRLMM frameworkto more reward functions and evaluate the impact on the agent’s market making strategy. Thereward function deﬁnitions are explained in section 4.2.

Price-based event environment : We propose a new approach to deﬁning an event in the agent’senvironment and compare this approach to our original time-based environment framework. Theprice-based event approach is explained in section 5.

The reward function serves as a feedback signal to the agent and therefore directly impacts the agent’strading strategy in a signiﬁcant way. There are seven reward functions described in this section, whichare categorized as proﬁt-and-loss (PnL), goal-oriented, and risk-based approaches. These seven rewardfunctions provide a wide range of feedback signals to the agent, from frequent to sparse.When calculating realized PnL, orders are netted in FIFO order and presented in percentage terms,opposed to dollar value, to ensure compatibility if applied to diﬀerent instruments (all simulation rulesare set forth in section 6.3.3).

Unrealized PnL

The agent’s unrealized PnL

U P nL provides the agent with a continuous feedbacksignal (assuming the agent is trading actively and maintains inventory). This reward function is calculatedby multiplying the agent’s inventory count

Inv by the percentage change in midpoint price ∆ m for timestep t . Note, inventory count Inv is an integer, because the agent trades with equally sized orders (seesection 6.3.3 for the comprehensive list of trading rules in our environment).

U P nL t = Inv t ∆ m t (1)where ∆ m = m t m t − − and Inv t = (cid:80) IMn =0 Ex nt is the total count of executed orders Ex held in inventoryand IM is the maximum permitted inventory. In our experiment, we set IM = 10 , meaning the agentcan execute and hold 10 trades (of equal quantity). Unrealized PnL with Realized Fills

The unrealized PnL with realized ﬁlls

U P nLwF rewardfunction (referred to as positional PnL in our previous work) is similar to

U P nL , but includes any realizedgains or losses

RP nL step obtained between time steps t and t − . This reward function provides theagent with a continuous feedback signal, as well as larger sparse rewards (assuming the agent is tradingactively and maintains inventory). U P nLwF t = U P nL t + RP nL stept (2)where

RP nL stept = (cid:104) Ex E,shortt Ex X,covert − (cid:105) + (cid:104) Ex X,sellt Ex E,longt − (cid:105) and Ex E,long,shortt is the average entry price and Ex X,sell,cover is the average exit price of the executed order(s) between time steps t and t − for long or short sides. Asymmetrical Unrealized PnL with Realized Fills

The asymmetrical unrealized PnL with realizedﬁlls

Asym reward function is similar to

U P nLwF , but removes any upside unrealized PnL to discourageprice speculation and adds a small rebate (i.e., half the spread) whenever an open order is executed topromote the use of limit orders. This reward function is provides both immediate and sparse feedbackto the agent (assuming the agent is trading actively and maintains inventory). Our implementation issimilar to [3]’s asymmetrically dampened PnL function, but includes the realized gains

RP nL stept fromthe current time step t , which improved our agent’s performance in volatile cryptocurrency markets. Asym t = min (0 , ηU P nL t ) + RP nL stept + ψ t (3)where ψ = Ex nt [ m t p bidt − is the number n of matched (i.e., executed) orders Ex multiplied by half thespread m t p bidt − in percentage terms, and η is a constant value used for dampening. In our experiment, weset η to 0.35. 3 symmetrical Unrealized PnL with Realized Fills and Ceiling The asymmetrical unrealizedPnL with realized ﬁlls and gains ceiling

AsymC reward function can be though of as an extension of

Asym , where we add a cap κ on the realized upside gains RP nL stept and remove the half-spread rebate ψ t on executed limit orders. The intended eﬀect is that AsymC is discouraged from long inventory holdingperiods and price speculation due to the ceiling and asymmetrical dampening. Like

Asym , this rewardfunction provides both immediate and sparse feedback to the agent.

AsymC t = min (0 , ηU P nL t ) + min ( RP nL stept , κ ) (4)where κ is the eﬀective ceiling on time step realized gains. In our experiment, we set κ to twice themarket order transaction fee. Realized PnL Change

The change in realized PnL ∆ RP nL provides the agent with a sparse feedbacksignal since values are only generated at the end of a round-trip trade. The reward is calculated by takingthe diﬀerence in realized PnL

RP nL values between time step t and t − . ∆ RP nL t = RP nL t − RP nL t − (5)where RP nL is the agent’s realized PnL at time step t and previous time step t − . Trade Completion

The trade completion

T C reward function provides a goal-oriented feedback signal,where a reward r t ∈ [ − , is generated if a objective is obtained or missed. Moreover, if the realizedPnL RP nL step is greater (or less) than a predeﬁned threshold (cid:36) , the reward r t is 1 (or -1) otherwise, if RP nL step is in between the thresholds, the actual realized PnL in percentage terms is the reward. Usingthis approach, the agent is encouraged to open and close positions with a targeted proﬁt-to-loss ratio,and is not rewarded for longer term price speculation.

T C t =  , if RP nL stept ≥ (cid:15)(cid:36) − , if RP nL stept ≤ − (cid:36)RP nL stept , otherwise (6)where (cid:15) is a constant used for the multiplier and (cid:36) is a constant used for the threshold. In our experiment,we set (cid:15) to 2 and (cid:36) to the market order transaction fee. Diﬀerential Sharpe Ratio

The diﬀerential sharpe ratio

DSR provides the agent with a risk adjustedcontinuous feedback signal (assuming the agent is trading actively and maintains inventory). Originallyproposed [11] more than 20 years ago, this reward function is the online version of the well known SharpeRatio, but can be calculated cheaply with O(1) time complexity, thereby making it the more practicalchoice for training agents using high-resolution data sets.

DSR t = B t − ∆ A t − A t − ∆ B t ( B t − − A t − ) / (7)where A t = A t − + η ( R t − A t − ) and B t = B t − + η ( R t − B t − ) and ∆ A = R − A t − and ∆ B = R − B t − and η is a constant value. In our experiment, we use U P nL (as described in section 4.1) for R and set η to 0.01. When applying reinforcement learning to ﬁnancial time series, the typical approach to framing the MDPis to have the agent take a step through the environment using a time-based interval. Depending on the4rading strategy, the interval of time can be anywhere from seconds to days. For market making, thetypical approach is to use tick events (e.g., new, cancel or modify order) as the catalyst for an agent tointeract with its environment. The tick-based approach diﬀers from a time-based approach in that theevents are irregularly spaced in time, and occur in much greater frequency (more than a magnitude).Although tick-based strategies can yield very impressive results in research, external factors (e.g., partialexecutions, latency, risk checks, etc.) could limit their practicality in live trading environments. Weaddress the challenge by proposing the use of price -based events for market making trading strategies,which partially removes the dependency on these assumptions, enabling the deep reinforcement algorithmsto learn non-linear market dynamics across multiple time steps (i.e., not latency-sensitive).

The time-based approach to event-driven environments consists of sampling the data at periodic intervalsevenly spaced in time (e.g., every second, minute, day, etc.). This approach is the most intuitive fortrading strategies, since market data is easily available in this format. This experiment takes snapshotsof the LOB (and other inputs in our feature space) using one-second time intervals to reduce the numberof events in one trading day from millions to 86,400 (the number of seconds in a 24-hour trading day),resulting in less clock time required to train our agent.

The price-based approach to event-driven environments consists of an event being deﬁned as a changein midpoint price m greater or less than a threshold β . Following this approach, our data set is furtherdown-sampled from its original form of one-second time intervals into signiﬁcantly fewer price changeevents that are irregularly spaced in time, thereby decreasing the amount of time required to train theagent per episode (i.e., one trading day). In this experiment, the minimum threshold β is set to onebasis point (i.e., 0.01%) and use the one-second LOB snapshot data (as described in section 5.1) as theunderlying data set. Algorithm 1:

Deriving price-based events from high-resolution data sets.

Result:

Observation and accumulated reward at time t+n β ← . n ← m t ← p askt + p bidt upper ← m t (1 + β ) lower ← m t (1 − β ) step ← T rue while step doif upper ≤ m t + n ≤ lower then n ← n + 1 else step ← F alse endend

In this section, the design and methodology aspects of the experiment are set forth.

The agent’s observation space is represented by a combination of LOB data from the ﬁrst 20 rows, orderand trade ﬂow imbalances, indicators, and other hand-crafted indicators. For each observation, we include5ction ID 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16Bid 0 0 0 4 4 4 4 9 9 9 9 14 14 14 14Ask 4 9 14 0 4 9 14 0 4 9 14 0 4 9 14Action 1 No actionAction 17 Market order M with size Inv

Table 1: The agent action space with 17 possible actions. The numbers in the

Bid and

Ask rows representthe price level at which the agent’s orders are set to for a given action and are indexed at zero. Forexample, action 2 indicates the agent open orders are skewed so that its bid is at level zero (i.e., best bid)and its ask is at level ﬁve.100 window lags. The observation space implementation speciﬁcations are detailed in the appendix(section A.2).It is worth noting that in previous work [6], the non-stationary feature price level distances to midpoint is included in the agent’s observation space; however, this feature does not inform the agent when usingBitmex data. This is likely due to the tick size at Bitmex being relatively large (0.50) compared toCoinbase exchange (0.01). As a result, the distances of price levels to the midpoint remain unchanged for99.99% of the time at Bitmex.

The agent action space consists of 17 possible actions. The idea is that the agent can take four generalactions: no action, symmetrically quote prices, asymmetrically skew quoted prices, or ﬂatten the entireinventory. The action space is outlined in Table 1.

The function approximator is a multilayer perceptron (MLP), which is a forward feed artiﬁcial neuralnetwork. The architecture of our implementation consists of 3-layer network with a single shared layer forfeature extraction, followed by separate actor and critic networks. ReLu activations are used in everyhidden layer.Figure 1: Architecture of actor-critic MLP neural network used in the experiments.

Gray representsshared layers.

Blue represents non-shared layers. The window size w is 100 and the feature count variesdepending on the feature set, as described in table 2. We implement seven diﬀerent reward functions, as outlined in section 4.

Two advanced policy-based model-free algorithms are used as market making agents: Advantage Actor-Critic (A2C) and Proximal Policy Optimization (PPO). We use the Stable Baselines [16] implementationfor the algorithms. Since both algorithms run on multiple processes, they require nearly the same amountof clock time to train. The same policy network architectures (ﬁgure 1) are used across all experiments,and parameter settings are listed in the appendix (section A.1).6 .2.1 Advantage Actor-Critic (A2C)

The A2C is an on-policy model-free actor-critic algorithm that is part of policy-based class of RLalgorithms. It interacts with the environment asynchronously while maintaining a policy π ( a t | s t ; θ ) andestimate of the value function V ( s t ; θ v ) , and synchronously updates parameters using a GPU, opposedto its oﬀ-policy asynchronous update counterpart A3C [17]. A2C learns good and bad actions throughcalculating the advantage A ( s t | a t ) of a particular action for a given state. The advantage is the diﬀerencebetween the action-value Q ( s t | a t ) and state value V ( s t ) . The A2C algorithm also uses k-step returnsto update both policy and value-function, which results in more stable learning than a vanilla policygradient, which uses 1-step returns. These features, asynchronous training and k-step returns, make A2Ca strong ﬁt for its application to market making, which relies on noisy high-resolution LOB data.The A2C update is calculated as ∇ θ J ( θ ) = ∇ θ (cid:48) logπ ( a t | s t ; θ (cid:48) ) A ( s t , a t ; θ, θ υ ) (8)where A ( s t , a t ; θ, θ υ ) is the estimate of the advantage function given by (cid:80) k − i =0 = γ i r t + i + γ k V ( s t + k , θ υ ) − V ( s t , θ υ ) , where k can vary by an upper bound t max [17]. The PPO is an on-policy model-free actor-critic algorithm is part of policy-based class of RL algorithms,even though the policy is indirectly updated through a surrogate function. Like the A2C algorithm, itinteracts with the environment asynchronously, makes synchronous parameter updates θ , and uses k-stepreturns. However, unlike A2C, PPO uses Generalized Advantage Estimation (GAE) to reduce the biasof advantages [18] and indirectly optimizes the policy π θ ( a t | s t ) through a clipped surrogate function L CLIP that represents the diﬀerence between the new policy after the most recent update π θ ( a | s ) andthe old policy before the most recent update π θ k ( a | s ) [19]. This surrogate function removes the incentivefor a new policy to depart from the old policy, thereby increasing learning stability. These features,asynchronous training, k-step returns, and surrogate function, make PPO a strong ﬁt for its applicationto market making, which relies on noisy high-resolution LOB data.The PPO Clip update is calculated as L ( s, a, θ k , θ ) = min (cid:18) π θ ( a t | s t ) π θ k ( a t | s t ) A π θk ( s t , a t ) , clip (cid:18) π θ ( a t | s t ) π θ k ( a t | s t ) , − (cid:15), (cid:15) (cid:19) A π θk ( s t , a t ) (cid:19) (9) where (cid:15) is a hyperparameter constant [19]. LOB data for cryptocurrencies is free to access via WebSocket, but not readily downloadable fromexchanges, and therefore requires recording. The data set for this experiment was recorded using Level IItick and trade data from Bitmex exchange and persisted into an Arctic TickStore for storage.Unlike previous work, where we replayed recorded data to reconstruct the data set, in this experimentwe recorded the LOB snapshots in real-time using one-second time intervals. This approach has twomain advantages over replaying tick data. First, the computational burden is signiﬁcantly reduced(millions of tick events per trading day) in setting up the experiment, since the LOB no longer needs tobe reconstructed to create LOB snapshot data. Second, the data feed is more reﬂective of a productiontrading system. However, this approach introduced a small amount of latency into the snapshot intervals(less than 1 millisecond), resulting in approximately 86,390 snapshots per a 24 hour trading day, opposed tothe actual number of seconds (86,400). We export the recorded data to compressed CSV ﬁles, segregatedby trading date using UTC timezone; each ﬁle is approximately 160Mb in size before compression. https://github.com/man-group/arctic .3.2 Data Processing Since LOB data cannot be used for machine learning without preprocessing, it is necessary to apply anormalization technique to the raw data set. In this experiment, the data set was normalized using theapproach described by [6,20], which transforms the LOB from non-stationary into a stationary feature set,then uses the previous three trading days to ﬁt and z -score normalize the current trading day’s values,and in which data point z x is σ standard deviations from the mean ¯ x . After normalizing the data set,outliers (values less than -10 or greater than 10) are clipped. z x = x − ¯ xσ (10) The environment follows a set of rules to ensure the simulation as realistic as possible.

Episode

An episode is deﬁned as a 24-hour trading day, using coordinated universal time (UTC) tosegregate trading days. At the end of an episode, the agent is required to ﬂatten its entire inventory as arisk control.

Transaction Fees

Transaction fees for orders are included and are deducted from the realized proﬁtand loss when an order is completed. We use a maker rebate of 0.025% and taker fee of 0.075%, whichcorresponds to Bitmex’s fee schedule at the time of the experiment. The maker-taker fee structure iscrucial to the success of our agent’s market making trading strategy.

Risk Limits

The agent is permitted to open one order per side (e.g., bid and ask) at a given moment,and can hold up to ten executed orders (i.e., inventory maximum IM = 10 ) in its inventory Inv . Allorders placed by the agent are equal in size Sz . There are no stop losses imposed on agents. Position Netting

If the agent has an open long (short) position and ﬁlls an open short (long) order,the existing long (short) position is netted, and the position’s realized proﬁt and loss is calculated inFIFO order. PnL is calculated in percentage terms.

Executions

Each time a new order is opened by the agent, the dollar value (e.g., price × quantity) ofthe order’s price-level i at the time step t is captured by our simulator, and only reduced when thereare buy (sell) transactions at or above (below) the ask (bid). Only after the LOB price-level queue isdepleted, can the agent’s order begin to be executed. This environment rule is necessary to help simulatemore realistic results. Additionally, the agent can modify an existing open order, even if it is partiallyﬁlled, to a new price and reset its priority in the price-level queue. Once the order is ﬁlled completely,the average execution price Ex Avg is used for proﬁt and loss calculations and the agent must wait untilthe next environment step to select an action a t (such as replenishing the ﬁlled order in the order book). Slippage

If the agent selects decides to ﬂatten its inventory, we account for market impact by applying aﬁxed slippage percentage ξ to each transaction n individually and recursively (e.g., p slippagen = p slippagen − ± ξ ),where ξ is 0.01% and ± is linked to order direction. We noticed adding slippage to the ﬂatten all actionencouraged the agent to use limit orders more frequently. The market-making agents (A2C and PPO) are trained on 8 days of data (December 27th, 2019 to January3rd, 2020) and tested on 30 days of data (January 4th to February 3rd, 2020) using perpetual Bitcoindata (instrument: XBTUSD). Each trading day consists of ≈ eature Sets Combination LOB Quantity Order Flow LOB Imbalances Indicators

Set 1 (cid:88) (cid:88) (cid:88) (cid:88)

Set 2 (cid:88) (cid:88)

Set 3 (cid:88) (cid:88) (cid:88)

Set 4 (cid:88) (cid:88)

Set 5 (cid:88) (cid:88) (cid:88)

Set 6 (cid:88) (cid:88)

Table 2: Combination of features which make up the observation space in diﬀerent experiments. Forexample,

Set 1 uses all available features, whereas

Set 2 uses only LOB Imbalances and indicators torepresent the environment’s state space. Implementation details for features are outlined in section A.2.1.In each experiment, agents are trained for one million environment steps and episodes restart at arandom step in the environment to prevent deterministic learning. The time-based environment takesadvantage of action repeats, enabling agents to accelerate learning; we use ﬁve action repeats in ourexperiment, which results in up to approximately 17,000 agent interactions with the environment perepisode. The price-based environment does not use action repeats, since the number of interactionswith the environment is already reduced to approximately 5,000 instances per episode. It is importantto note that during action repeats or between price events, the agent’s action is only performed onceand not repeated; all subsequent repeats in the environment consist of taking “no action,” thereby avoidperforming illogical repetitive actions multiple times in a row, such as ﬂattening the entire inventory, orre-posting orders to the same price and losing LOB queue priority. All the experiment parameters areoutlined in the appendix (section A.1).

In this section, the performance of each agent (PPO and A2C) are compared using the cumulative return(in percentage) including transaction costs from the out-of-sample tests. Our benchmark is a simplebuy-and-hold strategy, where we assume Bitcoin is purchased on the ﬁrst day of the out-of-sample dataand sold on the last day, for a total holding period of 30 consecutive trading days. Although the train-testsplit of data sets was selected based on data availability and not empirically, the out-of-sample data setcoincidentally captures a volatile upward month long trend in January 2020, which enables the benchmarkto generate a 16.25% return during this period.The best result obtained from our agents is a 17.61% return over this same period, using rewardfunction Trade Completion T C , A2C algorithm, and feature combination

Set 3 for the observation space.Although the A2C algorithm outperformed the PPO agent in terms of greatest return and number ofproﬁtable experiments, it is interesting that no clear trends emerged for the best observation spacecombination, or reward function (other than what does not work).It is worth noting that on January 19, 2020, the price of Bitcoin sold oﬀ more than 5% in less than200 seconds, and all experiments (agents, reward functions, and observation space combinations) incurredsigniﬁcant losses ranging between 5% and 10% as a result of the rapid price drop; if this trading daywere excluded, many more experiments would have yielded positive results. All experiment results areoutlined in tables 3 and 4.

We evaluated seven diﬀerent reward functions across a combination of features in the observation space,A2C and PPO reinforcement learning algorithms, and time- and price-based event environments. Eachreward function resulted in the agent learning a diﬀerent approach to trading and maintaining its inventory. Not including January 14th, 2020 due to a dropped WebSocket connection .1.1 PnL-based rewards Reward functions where realized gains are not incorporated in the function’s feedback signal tendedto result in nearsighted trading behavior. For example, the unrealized proﬁt-and-loss function

U P nL encouraged the agent to use market orders (e.g., action 17 - ﬂatten inventory) often and executed manytrades with short holding periods, resulting in consistent losses due to transaction costs.Reward functions where the feedback signal is sparse tended to result in speculative trading behavior.For example, the change in realized proﬁt function ∆ RP nL encouraged the agent to hold positions for anextended period, regardless if the agent had large unrealized gains or drawdown.Reward functions where the feedback signal is dampened asymmetrically tended to result in tacticaltrading while failing to exploit large price movements. For example, the asymmetrical unrealized PnLwith realized ﬁlls function

Asym discouraged the agent holding a position for an extended period of timeinto a price jump, either resulting in closing out a position too early and foregoing proﬁts, or closing outa position during a transitory drawdown period. These types of reward functions are very sensitive to thedampening factor η , and the value 0.35 yielded the most stable out-of-sample performance through a gridsearch. The Trade Completion

T C reward function tended to result in more active trading and inventorymanagement. For example, agent does not hold positions for speculation and quickly closes positions asthey approach the upper and lower boundaries of the reward function curve.

The Diﬀerential Sharpe Ratio

DSR reward function produced inconsistent results, and appears to bevery sensitive to experiment settings. For example, in some experiments the agents learned very stabletrading strategies, while unable to learn at all in other experiments (even with diﬀerent random seeds).Additionally, in certain market conditions the agents were able to learn how to exploit price jumps, whilemaking nonsensical decisions in other market regimes. It is possible that this reward function could havebetter performance with a thorough parameter grid search.Figure 2: Plots of agent episode performance.

Green and red dots represent buy and sell executions,respectively.

Left:

Example of price-based PPO agent making nearsighted decisions and frequent use ofmarket orders with reward function

U P nL on February 3, 2020.

Right:

Example of time-based PPOagent trading tactically while failing to exploit price jumps with reward function

AsymC on January 6,2020. 10igure 3: Plots of agent episode performance.

Green and red dots represent buy and sell executions,respectively.

Left:

Example of time-based A2C agent eﬀectively scaling into positions with goal-orientedreward function

T C on January 4, 2020.

Right:

Example of price-based A2C agent actively trading andexploiting a price jump with reward function

DSR on January 30, 2020.

Time-event: Proﬁt-and-Loss (%) A2C PPO

Reward Function Set 1 Set 2 Set 3 Set 4 Set 5 Set 6 Set 1 Set 2 Set 3 Set 4 Set 5 Set 6

UPnL (-12.05) (-11.67) (-12.06) (-24.00) (-35.30) (-14.83) (-18.58) (-29.57) (-25.62) (-43.95) (-32.12) (-56.52)

UPnLwF

Asym (-13.09) (-39.02) (-35.41) (-6.96) 2.82 8.28 (-14.00) (-16.61) (-13.00) (-17.97) (-11.57) (-4.69)

AsymC (-13.32) (-38.37) ∆ RPnL (-10.57) (-36.18) (-19.41) (-21.71) (-33.18) (-26.42) (-31.16) (-39.70) (-13.90) (-8.19) (-19.50) (-30.56) TC (-7.22) (-32.49) (-7.83) (-0.82) 3.45 (-24.88) (-2.42) (-18.55) (-13.28) (-24.53) (-19.34) DSR (-16.55) (-23.33) (-0.66) 0.18 9.98 (-2.99) (-7.55) (-18.71) (-6.11) (-19.96) (-28.43) (-38.10)

Table 3: Total return (in percentage) for out-of-sample data set (January 4, 2020 to March 3, 2020) usingthe time -based event environment.

The time-based environments were more diﬃcult for the agents to learn; 15 out of 84 experiments ledto proﬁtable outcomes. This is likely due to the training methodology, where agents may beneﬁt fromtraining for more than one million steps. That said, the time-based environment was able to achievethe highest return out of all experiments due to quicker reactions to adverse price movements with thegoal-based reward function

T C . The price-based environments were easier to learn for the agents; 23 out of 84 experiments led to proﬁtableoutcomes. This is likely due to the nature of having the agent take steps in the environment only whenthe price changes, therefore avoiding some noise in the LOB data. Although this environment approachdid not yield the highest score, in general the agent trading patterns appeared to be more stable and lesserratic during large price jumps.

Price-event: Proﬁt-and-Loss (%) A2C PPO

Reward Function Set 1 Set 2 Set 3 Set 4 Set 5 Set 6 Set 1 Set 2 Set 3 Set 4 Set 5 Set 6

UPnL (-31.42) (-28.65) (-38.74) (-0.95) (-0.91) (-43.58) (-31.74) (-25.89) (-21.37) (-46.12) (-16.72) (-32.32)

UPnLwF

Asym (-27.21) (-2.00) (-1.66) (-0.86) (-14.82) (-8.16) (-16.04) (-12.76) (-15.32) (-6.10) (-15.78) (-10.73)

AsymC (-7.12) (-11.58) (-12.24) 11.88 (-14.98) (-14.12) (-19.58) (-15.92) (-2.08) (-12.57) (-8.21) (-15.42) ∆ RPnL TC (-16.25) DSR (-27.35) 2.80 12.23 (-1.80) 9.73 (-28.14) (-14.23) (-13.96) (-19.67) (-24.19) 5.74 (-33.25)

Table 4: Total return (in percentage) for out-of-sample data set (January 4, 2020 to March 3, 2020) usingthe price -based event environment. 11

Conclusion

In this paper, two advanced policy-based model-free reinforcement learning algorithms were trained tolearn automated market making for Bitcoin using high resolution Level II tick data from Bitmex exchange.The agents learned diﬀerent trading strategies from seven diﬀerent reward functions and six diﬀerentcombinations of features for the agent’s observation space. Additionally, this paper proposes a price-basedapproach to deﬁning an event in which the agent steps through the environment and demonstrates itseﬀectiveness to solve the automated market making challenge, extending the DRLMM framework [6].All agents were trained for one million steps across eight days of data and evaluated on 30 out-of-sampledays. The A2C algorithm outperformed PPO in terms of cumulative return and number of proﬁtableexperiments. An A2C agent with goal-based

T C reward function generated the greatest return for bothtime- and price-based environments.Several observations made during the execution of this experiment could lead to fruitful future researchavenues. First, a formalized methodology for training model-free reinforcement learning in contextof ﬁnancial time-series problem. More speciﬁcally, it would be worthwhile to explore the eﬀects of aframework for scoring and selecting which trading days to include in the training data set (e.g., volatility,daily volume, number of price jumps, etc.). Second, with the demonstrated success of more advancedneural network architectures in the supervised learning domain [1, 21], it would be interesting to see ifconvolution, attention, and recurrent neural networks help the agents learn to better exploit price jumps.

ACKNOWLEDGEMENTS

Thank you to Toussaint Behaghel for reviewing the paper and providinghelpful feedback and Florian Labat for suggesting the use of price-based events in reinforcement learning.Thank you to Mathieu Beucher, Sakina Ouisrani, and Badr Ghazlane for helping execute experimentsand collate results.

References [1] Zihao Zhang, Stefan Zohren, and Stephen Roberts. Deeplob: Deep convolutional neural networks forlimit order books.

IEEE Transactions on Signal Processing , 67(11):3001–3012, Jun 2019.[2] Baron Law and Frederi Viens. Market making under a weakly consistent limit order book model.2019.[3] Thomas Spooner, John Fearnley, Rahul Savani, and Andreas Koukorinis. Market making viareinforcement learning, 2018.[4] Maxime Morariu-Patrichi and Mikko S. Pakkanen. State-dependent hawkes processes and theirapplication to limit order book modelling, 2018.[5] E. Bacry and J. F Muzy. Hawkes model for price and trades high-frequency dynamics, 2013.[6] Jonathan Sadighian. Deep reinforcement learning in cryptocurrency market making, 2019.[7] Richard S. Sutton and Andrew G. Barto.

Reinforcement Learning: An Introduction . The MIT Press,second edition, 2018.[8] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare,Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, CharlesBeattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, ShaneLegg, and Demis Hassabis. Human-level control through deep reinforcement learning.

Nature ,518(7540):529–533, February 2015.[9] David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche,Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman,Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach,Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. Mastering the game of Go with deepneural networks and tree search.

Nature , 529(7587):484–489, jan 2016.1210] OpenAI, :, Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung, Przemysław Dębiak,Christy Dennison, David Farhi, Quirin Fischer, Shariq Hashme, Chris Hesse, Rafal Józefowicz,Scott Gray, Catherine Olsson, Jakub Pachocki, Michael Petrov, Henrique Pondé de Oliveira Pinto,Jonathan Raiman, Tim Salimans, Jeremy Schlatter, Jonas Schneider, Szymon Sidor, Ilya Sutskever,Jie Tang, Filip Wolski, and Susan Zhang. Dota 2 with large scale deep reinforcement learning, 2019.[11] Rakesh Agrawal, Paul E. Stolorz, and Gregory Piatetsky-Shapiro, editors.

Proceedings of the FourthInternational Conference on Knowledge Discovery and Data Mining (KDD-98), New York City, NewYork, USA, August 27-31, 1998 . AAAI Press, 1998.[12] John Moody and Matthew Saﬀell. Reinforcement learning for trading. In

Proceedings of the 1998Conference on Advances in Neural Information Processing Systems II , pages 917–923, Cambridge,MA, USA, 1999. MIT Press.[13] John E. Moody and Matthew Saﬀell. Learning to trade via direct reinforcement.

IEEE Trans. NeuralNetworks , 12(4):875–889, 2001.[14] Carl Gold. Fx trading via recurrent reinforcement learning. , pages 363–370, 2003.[15] Yagna Patel. Optimizing market making using multi-agent reinforcement learning, 2018.[16] Ashley Hill, Antonin Raﬃn, Maximilian Ernestus, Adam Gleave, Anssi Kanervisto, Rene Traore,Prafulla Dhariwal, Christopher Hesse, Oleg Klimov, Alex Nichol, Matthias Plappert, Alec Radford,John Schulman, Szymon Sidor, and Yuhuai Wu. Stable baselines. https://github.com/hill-a/stable-baselines , 2018.[17] Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Timothy P. Lillicrap,Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcementlearning, 2016.[18] John Schulman, Philipp Moritz, Sergey Levine, Michael I. Jordan, and Pieter Abbeel. High-dimensional continuous control using generalized advantage estimation.

CoRR , abs/1506.02438,2015.[19] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policyoptimization algorithms, 2017.[20] Avraam Tsantekidis, Nikolaos Passalis, Anastasios Tefas, Juho Kanniainen, Moncef Gabbouj, andAlexandros Iosiﬁdis. Using deep learning for price prediction by exploiting stationary limit orderbook features, 2018.[21] James Wallbridge. Transformers for limit order books, 2020.13

Appendix

A.1 Agent Conﬁgurations

The parameters used to train agents in all experiments. γ α λ A.2 Observation Space

As set forth in our previous work [6], the agent’s observation space is a combination of three sub-spaces:the environment state space, consisting of LOB, trade and order ﬂow snapshots with a window size w ;the agent state space, consisting of handcrafted risk and position indicators; and the agent action space,consisting of a one-hot vector of the agent’s latest action. In this experiment, w is set to 100. A.2.1 Environment State SpaceLOB Quantity

The dollar value of each price level in the LOB, where χ is the dollar value at LOBlevel i at time t , applied to both bid and ask sides. Since we use the ﬁrst 20 price levels of the LOB, thisfeature is represented by a vector of 40 values. χ bid,askt,i = I − (cid:88) i =0 p bid,askt,i × q bid,askt,i (11)where p bid,ask is the price and q bid,ask is the quantity at LOB level i for bid and ask sides, respectively. LOB Imbalances

The order imbalances ι ∈ [ − , are represented by the cumulative dollar value foreach price level i in the LOB. Since we use the ﬁrst 20 price levels of the LOB, this feature is representedby a vector of 20 values. ι t,i = χ ask,qt,i − χ bid,qt,i χ ask,qt,i + χ bid,qt,i (12) Order Flow

The sum of dollar values for cancel C , limit L , and market M orders is captured betweeneach LOB snapshot. Since we use the ﬁrst 20 price levels of the LOB, this feature is represented by a14ector of 120 values, 60 per each side of the LOB. C bid,askt,i = p bid,askt,i × q bid,askt,i (13) L bid,askt,i = p bid,askt,i × q bid,askt,i (14) M bid,askt,i = p bid,askt,i × q bid,askt,i (15)where q is the number of units available at price p at LOB level i . Trade Flow Imbalances

The Trade Flow Imbalances

T F I ∈ [ − , indicator measures the magnitudeof buyer initiated BI and and seller initiated SI transactions over a given window w . Since we use 3diﬀerent windows w (5, 15, and 30 minutes), this feature is represented by a vector of 3 values. T F I t = U P t − DW N t U P t + DW N t (16)where U P t = (cid:80) wn =0 BI n and DW N t = (cid:80) wn =0 SI n . Custom RSI

The relative strength index indicator (RSI) measures the magnitude of prices changesover a given window w . This custom implementation CRSI ∈ [ − , scales the data so that it does notrequire normalization, even though we do use the scaled z-score values in our experiment. Since we use 3diﬀerent windows w (5, 15, and 30 minutes), this feature is represented by a vector of 3 values. CRSI t = gain t − | loss t | gain t + | loss t | (17)where gain t = (cid:80) wn =0 ∆ m n if ∆ m n > else and loss t = (cid:80) wn =0 ∆ m n if ∆ m n < else and ∆ m t = m t m t − − . Spread

The spread ς t is the diﬀerence between the best bid p bid and best ask p ask . This feature isrepresented as a scalar. ς t = p bidt − p askt (18) Change in Midpoint

The change in midpoint δm t is the log diﬀerence in midpoint prices betweentime step t and t − . This feature is represented as a scalar. δm t = log m t − log m t − (19) Reward

The reward r from the environment, as described in section 4. A.2.2 Agent State SpaceNet Inventory Ratio

The agent’s net inventory ratio υ ∈ [ − , is the inventory count Inv representedas a percentage of the maximum inventory IM . This feature is represented as a scalar. υ t = Inv long − Inv short IM (20) Realized PnL

The agent’s realized proﬁt-and-loss

RP nL is the sum of realized and unrealized proﬁtand losses. In this experiment, the

RP nL is scaled by a scalar value ρ , which represents the daily PnLtarget. 15 nrealized PnL The agent’s current unrealized PnL

U P nL t is the unrealized PnL across all openpositions. The unrealized PnL feature is represented as a scalar, containing the net of long and shortpositions. U P nL t = (cid:34) p Avg,shortt m t − (cid:35) + (cid:34) m t p Avg,longt − (cid:35) (21)where p Avg is the average price of the agent’s long or short position and m is the midpoint price at time t . Open Order Distance to Midpoint

The agent’s open limit order distance to midpoint is the distance ζ of the agent’s open bid and ask limit orders L to the midpoint price m at time t . The feature isrepresented as a vector with 2 values. ζ long,shortt = L bid,askt m t − (22) Order Completion Ratio