[PDF] A reinforcement learning extension to the Almgren-Chriss model for optimal trade execution

Abstract

Reinforcement learning is explored as a candidate machine learning technique to enhance existing analytical solutions for optimal trade execution with elements from the market microstructure. Given a volume-to-trade, fixed time horizon and discrete trading periods, the aim is to adapt a given volume trajectory such that it is dynamic with respect to favourable/unfavourable conditions during realtime execution, thereby improving overall cost of trading. We consider the standard Almgren-Chriss model with linear price impact as a candidate base model. This model is popular amongst sell-side institutions as a basis for arrival price benchmark execution algorithms. By training a learning agent to modify a volume trajectory based on the market's prevailing spread and volume dynamics, we are able to improve post-trade implementation shortfall by up to 10.3% on average compared to the base model, based on a sample of stocks and trade sizes in the South African equity market.

Full PDF

aa r X i v : . [ q -f i n . T R ] M a r A reinforcement learning extension to theAlmgren-Chriss framework for optimal tradeexecution

Dieter Hendricks

School of Computational andApplied MathematicsUniversity of the WitwatersrandJohannesburg, South AfricaEmail: [email protected]

Diane Wilcox

School of Computational andApplied MathematicsUniversity of the WitwatersrandJohannesburg, South AfricaEmail: [email protected]

Abstract —Reinforcement learning is explored as a candidatemachine learning technique to enhance existing analytical so-lutions for optimal trade execution with elements from themarket microstructure. Given a volume-to-trade, ﬁxed timehorizon and discrete trading periods, the aim is to adapt agiven volume trajectory such that it is dynamic with respectto favourable/unfavourable conditions during realtime execution,thereby improving overall cost of trading. We consider thestandard Almgren-Chriss model with linear price impact as acandidate base model. This model is popular amongst sell-sideinstitutions as a basis for arrival price benchmark executionalgorithms. By training a learning agent to modify a volumetrajectory based on the market’s prevailing spread and volumedynamics, we are able to improve post-trade implementationshortfall by up to 10.3% on average compared to the base model,based on a sample of stocks and trade sizes in the South Africanequity market.

I. I

NTRODUCTION

A critical problem faced by participants in investmentmarkets is the so-called optimal liquidation problem, viz. howbest to trade a given block of shares at minimal cost. Here,cost can be interpreted as in Perold’s implementation shortfall([21]), i.e. adverse deviations of actual transaction pricesfrom an arrival price baseline when the investment decisionis made. Alternatively, cost can be measured as a deviationfrom the market volume-weighted trading price (VWAP) overthe trading period, effectively comparing the speciﬁc trader’sperformance to that of the average market trader. In each case,the primary problem faced by the trader/execution algorithmis the compromise between price impact and opportunity costwhen executing an order.Price impact here refers to adverse price moves due to alarge trade size absorbing liquidity supply at available levelsin the order book (temporary price impact). As market partic-ipants begin to detect the total volume being traded, they mayalso adjust their bids/offers downward/upward to anticipateorder matching (permanent price impact) [16]. To avoid priceimpact, traders may split a large order into smaller child ordersover a longer period. However, there may be exogenous marketforces which result in execution at adverse prices (opportunity cost). This behaviour of institutional investors was empiricallydemonstrated in [9], where they observed that typical trades oflarge investment management ﬁrms are almost always brokenup into smaller trades and executed over the course of a dayor several days.Several authors have studied the problem of optimal liq-uidation, with a strong bias towards stochastic dynamic pro-gramming solutions. See [7], [17], [26], [1] as examples. Inthis paper, we consider the application of a machine learningtechnique to the problem of optimal liquidation. Speciﬁcallywe consider a case where the popular Almgren-Chriss closed-form solution for a trading trajectory (see [1]) can be enhancedby exploiting microstructure attributes over the trading horizonusing a reinforcement learning technique.Reinforcement learning in this context is essentially a cali-brated policy mapping states to optimal actions. Each state isa vector of observable attributes which describe the currentconﬁguration of the system. It proposes a simple, model-free mechanism for agents to learn how to act optimally ina controlled Markovian domain, where the quality of actionchosen is successively improved for a given state [10]. Forthe optimal liquidation problem, the algorithm examines thesalient features of the current order book and current state ofexecution in order to decide which action (e.g. child orderprice or volume) to select to service the ultimate goal ofminimising cost.The ﬁrst documented large-scale empirical application ofreinforcement learning algorithms to the problem of optimisedtrade execution in modern ﬁnancial markets was conductedby [20]. They set up their problem as a minimisation ofimplementation shortfall for a buying/selling program over aﬁxed time horizon with discrete time periods. For actions,the agent could choose a price to repost a limit order forthe remaining shares in each discrete period. State attributesincluded elapsed time, remaining inventory, current spread,immediate cost and signed volume. In their results, theyfound that their reinforcement learning algorithm improved theexecution efﬁciency by 50% or more over traditional submit-nd-leave or market order policies.Instead of a pure reinforcement learning solution to theproblem, as in [20], we propose a hybrid approach which enhances a given analytical solution with attributes fromthe market microstructure. Using the Almgren-Chriss (AC)model as a base, for a ﬁnite liquidation horizon with discretetrading periods, the algorithm determines the proportion ofthe AC-suggested trajectory to trade based on prevailingvolume/spread attributes. One would expect, for example, thatallowing the trajectory to be more aggressive when volumesare relatively high and spreads are tight may reduce theultimate cost of the trade. In our implementation, a staticvolume trajectory is preserved for the duration of the trade,however the proportion traded is dynamic with respect tomarket dynamics. As in [20], a market order is executed atthe end of the trade horizon for the remaining volume, toensure complete liquidation. An important consideration in ouranalysis is the speciﬁcation of the problem as a ﬁnite-horizonMarkov Decision Process (MDP) and the consequences foroptimal policy convergence of the reinforcement learning al-gorithm. In [20], they use an approximation in their frameworkto address this issue by incorporating elapsed time as a stateattribute, however they do not explicitly discuss convergence.We will use the ﬁndings of [14] in our model speciﬁcationand demonstrate near-optimal policy convergence of the ﬁnite-horizon MDP problem.The model described above is compared with thebase Almgren-Chriss model to determine whether it in-creases/decreases the cost of execution for different typesof trades consistently and signiﬁcantly. This study will helpdetermine whether reinforcement learning is a viable techniquewhich can be used to extend existing closed-form solutions toexploit the nuances of the microstructure where the algorithmsare applied.This paper proceeds as follows: Section 2 introduces thestandard Almgren-Chriss model. Section 3 describes the spe-ciﬁc hybrid reinforcement learning technique proposed, alongwith a discussion regarding convergence to optimum actionvalues. Section 4 discusses the data used and results, compar-ing the 2 models for multiple trade types. Section 5 concludesand proposes some extensions for further research.II. T HE A LMGREN -C HRISS MODEL

Bertsimas and Lo are pioneers in the area of optimal liqui-dation, treating the problem as a stochastic dynamic program-ming problem [7]. They employed a dynamic optimisationprocedure which ﬁnds an explicit closed-form best executionstrategy, minimising trading costs over a ﬁxed period of timefor large transactions. Almgren and Chriss extended the workof [7] to allow for risk aversion in their framework [1]. Theyargue that incorporating the uncertainty of execution of anoptimal solution is consistent with a trader’s utility function.In particular, they employ a price process which permits linearpermanent and temporary price impact functions to constructan efﬁcient frontier of optimal execution. They deﬁne a tradingstrategy as being efﬁcient if there is no strategy which has lower execution cost variance for the same or lower level ofexpected execution cost.The exposition of their solution is as follows: They assumethat the security price evolves according to a discrete arith-metic random walk: S k = S k − + στ / ξ k − τ g ( n k τ ) , where: S k = price at time k,σ = volatility of the security ,τ = length of discete time interval ,ξ k = draws from independent random variables, n k = volume traded at time k and g ( . ) = permanent price impact . Here, permanent price impact refers to changes in theequilibrium price as a direct function of our trading, whichpersists for at least the remainder of the liquidation horizon.Temporary price impact refers to adverse deviations as a resultof absorbing available liquidity supply, but where the impactdissipates by the next trading period due to the resilience ofthe order book. Almgren and Chriss introduce a temporaryprice impact function h ( v ) to their model, where h ( v ) causesa temporary adverse move in the share price as a function ofour trading rate v [1]. Given this addition, the actual securitytransaction price at time k is given by: ˜ S k = S k − − h ( n k τ ) . Assuming a sell program, we can then deﬁne the totaltrading revenue as: N X k =1 n k ˜ S k = XS + N X k =1 ( στ / ξ k − τ g ( n k τ )) x k − N X k =1 n k h ( n k τ ) , where x k = X − k P j =1 n j = N P j = k +1 n j for k = 0 , , ..., N .The total cost of trading is thus given by x = XS − P n k ˜ S k , i.e. the difference between the target revenue valueand the total actual revenue from the execution. This deﬁnitionrefers to Perold’s implementation shortfall measure (see [21]),and serves as the primary transaction cost metric whichis minimised in order to maximise trading revenue. Sinceimplementation shortfall is a random variable, Almgren andChriss compute the following: E ( x ) := N X k =1 τ x k g ( n k τ ) + N X k =1 n k h ( n k τ ) and V ar ( x ) := σ N X k =1 τ x k . The distribution of implementation shortfall is Gaussian if the ξ k are Gaussian.iven the overall goal of minimising execution costs andthe variance of execution costs, they specify their objectivefunction as: min x { E ( x ) + λV ar ( x ) } , where: x = implementation shortfall ,λ = level of risk aversion . The intuition of this objective function can be thoughtof as follows: Consider a stock which exhibits high pricevolatility and thus a high risk of price movement away fromthe reference price. A risk averse trader would prefer to tradea large portion of the volume immediately, causing a (known)price impact, rather than risk trading in small incrementsat successively adverse prices. Alternatively, if the price isexpected to be stable over the liquidation horizon, the traderwould rather split the trade into smaller sizes to avoid priceimpact. This trade-off between speed of execution and riskof price movement is what governs the shape of the resultingtrade trajectory in the AC framework.A detailed derivation of the general solution can be foundin [1]. Here, we state the general solution: x j = sinh( κ ( T − t j ))sinh( κT ) X for j = 0 , ..., N. The associated trade list is: n j = 2 sinh( κτ )sinh( κT ) cosh( κ ( T − t j − )) X for j = 0 , ..., N, where: κ = 1 τ cosh − (cid:18) τ κ + 1 (cid:19) , ˜ κ = λσ η (1 − ρτ η ) ,η = temporary price impact parameter ,ρ = permanent price impact parameter ,τ = length of discrete time period . This implies that for a program of selling an initially longposition, the solution decreases monotonically from its initialvalue to zero at a rate determined by the parameter κ . If tradingintervals are short, κ is essentially the ratio of the productof volatility and risk-intolerance to the temporary transactioncost parameter. We note here that a larger value of κ impliesa more rapid trading program, again conceptually conﬁrmingthe propositions of [17] that an intolerance for execution riskleads to a larger concentration of quantity traded early in thetrading program. Another consequence of this analysis is thatdifferent sized baskets of the same securities will be liquidatedin the same manner, barring scale differences and providedthe risk aversion parameter λ is held constant. This may becounter-intuitive, since one would expect larger baskets to beeffectively less liquid, and thus follow a less rapid tradingprogram to minimise price impact costs. It should be noted that the AC solution yields a suggestedvolume trajectory over the liquidation horizon, however thereis no discussion in [1] as to the prescribed order type to executethe trade list. We have assumed that the trade list can beexecuted as a series of market orders . Given that this implieswe are always crossing the spread, one needs to consider thattraversing an order book with thin volumes and widely-spacedprices could have a signiﬁcant transaction cost impact. We thusconsider a reinforcement learning technique which learns when and how much to cross the spread, based on the current orderbook dynamics.The general solution outlined above assumes linear price im-pact functions, however the model was extended by Almgrenin [2] to account for non-linear price impact. This extendedmodel can be considered as an alternative base model in futureresearch.III. A REINFORCEMENT LEARNING APPROACH

The majority of reinforcement learning research is based ona formalism of Markov Decision Processes (MDPs) [4]. In thiscontext, reinforcement learning is a technique used to numeri-cally solve for a calibrated policy mapping states to optimal ornear-optimal actions. It is a framework within which a learningagent repeatedly observes the state of its environment, andthen performs a chosen action to service some ultimate goal.Performance of the action has an immediate numeric rewardor penalty and changes the state of the environment [10].The problem of solving for an optimal policy mapping statesto actions is well-known in stochastic control theory, witha signiﬁcant contribution by Bellman [5]. Bellman showedthat the computational burden of an MDP can be signiﬁcantlyreduced using what is now known as dynamic programming.It was however recognised that two signiﬁcant drawbacks existfor classical dynamic programming: Firstly, it assumes that acomplete, known model of the environment exists, which isoften not realistically obtainable. Secondly, problems rapidlybecome computationally intractable as the number of statevariables increases, and hence, the size of the state space forwhich the value function must be computed increases. Thisproblem is referred to as the curse of dimensionality [25].Reinforcement learning offers two advantages over classi-cal dynamic programming: Firstly, agents learn online andcontinually adapt while performing the given task. Secondly,the methods can employ function approximation algorithmsto represent their knowledge. This allows them to generalizeacross the state space so that the learning time scales muchbetter [12]. Reinforcement learning algorithms do not requireknowledge about the exact model governing an MDP andthus can be applied to MDP’s where exact methods becomeinfeasible.Although a number of implementations of reinforcementlearning exist, we will focus on

Q-learning . This is a model-free technique ﬁrst introduced by [27], which can be used toﬁnd the optimal, or near-optimal, action-selection policy for agiven MDP.uring

Q-learning , an agent’s learning takes place duringsequential episodes. Consider a discrete ﬁnite world whereat each step n , an agent is able to register the current state x n ∈ X and can choose from a ﬁnite set of actions a n ∈ A .The agent then receives a probabilistic reward r n , whosemean value R x n ( a n ) depends only on the current state andaction. According to [27], the state of the world changesprobabilistically to y n according to: Prob ( y n = y | x n , a n ) = P x n y ( a n ) . The agent is then tasked to learn the optimal policy mappingstates to actions, i.e. one which maximises total discountedexpected reward. Under some policy mapping π and discountrate γ ( < γ < ), the value of state x is given by: V π ( x ) = R x ( π ( x )) + γ X y P xy ( π ( x )) V π ( y ) . According to [6] and [23], the theory of dynamic programmingsays there is at least one optimal stationary policy π ∗ such that V ∗ ( x ) = V π ∗ ( x ) = max a { R x ( a ) + γ X y P xy ( a ) V π ∗ ( y ) } . We also deﬁne Q π ( x, a ) as the expected discounted rewardfrom choosing action a in state x , and then following policy π thereafter, i.e. Q π ( x, a ) = R x ( a ) + γ X y P xy ( π ( x )) V π ( y ) . The task of the

Q-learning agent is to determine V ∗ , π ∗ and Q π ∗ where P xy ( a ) is unknown, using a combination ofexploration and exploitation techniques over the given domain.It can be shown that V ∗ ( x ) = max a Q ∗ ( x, a ) and that anoptimal policy can be formed such that π ∗ ( x ) = a ∗ . Itthus follows that if the agent can ﬁnd the optimal Q-values,the optimal action can be inferred for a given state x . It isshown in [10] that an agent can learn Q-values via experientiallearning, which takes place during sequential episodes. In the n th episode, the agent: • observes its current state x n , • selects and performs an action a n , • observes the subsequent state y n as a result of performingaction a n , • receives an immediate reward r n and • uses a learning factor α n , which decreases gradually overtime. Q is updated as follows: Q ( x n , a n ) = Q ( x n , a n ) + α n ( r n + γ max b Q ( x n +1 , b ) − Q ( x n , a n )) . Provided each state-action pair is visited inﬁnitely often,[10] show that Q converges to Q ∗ for any exploration policy.Singh et al. provide guidance as to speciﬁc exploration policiesfor asymptotic convergence to optimal actions and asymptoticexploitation under the Q-learning algorithm, which we incor-porate in our analysis [24].

A. Implications of ﬁnite-horizon MDP

The above exposition presents an algorithm which guar-antees optimal policy convergence of a stationary inﬁnite-horizon MDP. The stationarity assumption, and hence validityof the above result, needs to be questioned when considering aﬁnite-horizon MDP, since states, actions and policies are time-dependent [22]. In particular, we are considering a discreteperiod, ﬁnite trading horizon, which guarantees executionof a given volume of shares. At each decision step in thetrading horizon, it is possible to have different state spaces,actions, transition probabilities and reward values. Hence theabove model needs revision. Garcia and Ndiaye considerthis problem and provide a model speciﬁcation which suitsthis purpose [14]. They propose a slight modiﬁcation to theBellman optimality equations shown above: V ∗ n ( x ) = max a n { R x ( a n ) + γ X y P nxy ( a n +1 ) V π ∗ n +1 ( y ) } for all x ∈ S n , y ∈ S n +1 , a n ∈ S n , n ∈ { , , ..., N } and V ∗ N +1 ( x ) = 0 . This optimality equation has a single solution V ∗ = { V ∗ , V ∗ , ..., V ∗ N } that can be obtained using dynamicprogramming techniques. The equivalent discounted expectedreward speciﬁcation thus becomes: Q πn ( x, a n ) = R x ( a n ) + γ X y P nxy ( π ( x )) V πn +1 ( y ) . They propose a novel transformation of an N -step non-stationary MDP into an inﬁnite-horizon process ([14]). Thisis achieved by adding an artiﬁcal ﬁnal reward-free absorbingstate x abs , such that all actions a N ∈ A N lead to x abs withprobability 1. Hence the revised Q-learning update equationbecomes: Q n +1 ( x n , a n ) = Q n ( x n , a n ) + α n ( x n , a n ) U n , where U n =  r n + γ max b Q n ( y n , b ) − Q n ( x n , a n ) if x n ∈ S i , i < Nr n − Q n ( x n , a n ) if x n ∈ S N otherwise.If x n / ∈ S N , y n = x n +1 , otherwise choose randomly in S .If x n +1 ∈ S j , select a n +1 ∈ A j . The learning rule for S N isthus equivalent to setting V ∗ N +1 ( x abs ) = Q ∗ N +1 ( x abs , a j ) = 0 ∀ a j ∈ A N +1 .Garcia and Ndiaye further show that the above speciﬁcation(in the case where γ = 1 ) will converge to the optimal policywith probability one, provided that each state-action pair isvisited inﬁnitely often, P n α n ( x, a ) = ∞ and P n α n ( x, a ) < ∞ [14]. B. Implementation for optimal liquidation

Given the above description, we are able to discuss ourspeciﬁc choices for state attributes, actions and rewards inthe context of the optimal liquidation problem. We need toconsider a speciﬁcation which adequately accounts for ourstate of execution and the current state of the limit orderook, representing the opportunity set for our ultimate goalof executing a volume of shares over a ﬁxed trading horizon.

1) States:

We acknowledge that the complexity of theﬁnancial system cannot be distilled into a ﬁnite set of statesand is not likely to evolve according to a Markov process.However, we conjecture that the essential features of thesystem can be sufﬁciently captured with some simplifyingassumptions such that meaningful insights can still be inferred.For simplicity, we have chosen a look-up table representationof Q . Function approximation variants may be explored infuture research for more complex system conﬁgurations. Asdescribed above, each state x n ∈ X represents a vector ofobservable attributes which describe the conﬁguration of thesystem at time n . As in [20], we use Elapsed Time t and Remaining Inventory i as private attributes which capture ourstate of execution over a ﬁnite liquidation horizon T . Sinceour goal is to modify a given volume trajectory based onfavourable market conditions, we include spread and volume as candidate market attributes. The intuition here is that theagent will learn to increase (decrease) trading activity when spreads are narrow (wide) and volumes are high (low). Thiswould ensure that a more signiﬁcant proportion of the totalvolume-to-trade would be secured at a favourable price and,similarly, less at an unfavourable price, ultimately reducing thepost-trade implementation shortfall. Given the look-up tableimplementation, we have simpliﬁed each of the state attributesas follows: • T = Trading Horizon , V = Total Volume-to-Trade, • H = Hour of day when trading will begin, • I = Number of remaining inventory states, • B = Number of spread states, • W = Number of volume states, • sp n = %ile Spread of the n th tuple, • vp n = %ile Bid/Ask Volume of the n th tuple, • Elapsed Time : t n = 1 , , , ..., T , • Remaining Inventory : i n = 1 , , , ..., I , • Spread State : s n =  , if < sp n ≤ B , if B < sp n ≤ B ...B, if ( B − R < sp n ≤ , • Volume State : v n =  , if < vp n ≤ U , if W < vp n ≤ W ...W, if ( W − W < vp n ≤ W. Thus, for the n th episode, the state attributes can be sum-marised as the following tuple: z n = < t n , i n , s n , v n > . For sp n and vp n , we ﬁrst construct a historical distribution ofspreads and volumes based on the training set. It has beenempirically observed that major equity markets exhibit U -shaped trading intensity throughout the day, i.e. more activityin mornings and closer to closing auction. A further discussionof these insights can be found in [3] and [8]. In fact, [13]empirically demonstrates that South African stocks exhibitsimilar characteristics. We thus consider simlulations wheretraining volume/spread tuples are H -hour dependent, such thatthe optimal policy is further reﬁned with respect to tradingtime ( H ).

2) Actions:

Based on the Almgren-Chriss (AC) modelspeciﬁed above, we calculate the AC volume trajectory( AC , AC , ..., AC T ) for a given volume-to-trade ( V ), ﬁxedtime horizon ( T ) and discrete trading periods ( t = 1 , , ..., T ). AC t represents the proportion of V to trade in period t , suchthat T P t =1 AC t = V . For the purposes of this study, we assumethat each child order is executed as a market order based onthe prevailing limit order book structure. We would like ourlearning agent to modify the AC volume trajectory based onprevailing volume and spread characteristics in the market. Assuch, the possible actions for our agent include: • β j = Proportion of AC t to trade, • β LB = Lower bound of volume proportion to trade, • β UB = Upper bound of volume proportion to trade, • Action : a jt = β j AC t , where β LB ≤ β j ≤ β UB and β j = β j − + β incr . The aim here is to train the learning agent to trade a higher(lower) proportion of the overall volume when conditions arefavourable (unfavourable), whilst still broadly preserving thevolume trajectory suggested by the AC model. To ensurethat the total volume-to-trade is executed over the given timehorizon, we execute any residual volume at the end of thetrading period with a market order .

3) Rewards:

Each of the actions described above resultsin a volume to execute with a market order , based on theprevailing structure of the limit order book. The size of thechild order volume will determine how deep we will need totraverse the order book. For example, suppose we have a

BUY order with a volume-to-trade of 20000, split into child ordersof 10000 in period t and 10000 in period t + 1 . If the structureof the limit order book at time t is as follows: • Level-1 Ask Price = 100.00;

Level-1 Ask Volume = 3000 • Level-2 Ask Price = 100.50;

Level-2 Ask Volume = 4000 • Level-3 Ask Price = 102.30;

Level-3 Ask Volume = 5000 • Level-4 Ask Price = 103.00;

Level-4 Ask Volume = 6000 • Level-5 Ask Price = 105.50;

Level-5 Ask Volume = 2000 the volume-weighted execution price will be: (3000 × × .

5) + (3000 × . . . Trading more (less) given this limit order book structure willresult in a higher (lower) volume-weighted execution price. Ifthe following trading period t + 1 has the following structure: • Level-1 Ask Price = 99.80;

Level-1 Ask Volume = 6000 • Level-2 Ask Price = 99.90;

Level-2 Ask Volume = 2000 • Level-3 Ask Price = 101.30;

Level-3 Ask Volume = 7000 • Level-4 Ask Price = 107.00;

Level-4 Ask Volume = 3000 • Level-5 Ask Price = 108.50;

Level-5 Ask Volume = 1000 the volume-weighted execution price for the second child orderwill be: (6000 × .

8) + (2000 × .

9) + (2000 × . . . If the reference price of the stock at t = 0 is 99.5, then the implementation shortfall from this trade is: ((20000 × . − (10000 × . × . × . − . − bps. ince the conditions of the limit order book were morefavourable for BUY orders in period t + 1 , if we had modiﬁedthe child orders to, say 8000 in period t and 12000 in period t + 1 , the resulting implementation shortfall would be: ((20000 × . − (8000 × .

54 + 12000 × . × . − . − bps. In this example, increasing the child order volume when

AskPrices are lower and

Level-1 Volumes are higher decreases theoveral cost of the trade. It is for this reason that implementationshortfall is a natural candidate for the rewards matrix inour reinforcement learning system. Each action implies achild order volume, which has an associated volume-weightedexecution price. The agent will learn the consequences ofeach action over the trading horizon, with the ultimate goal ofminimising the total trade’s implementation shortfall .

4) Algorithm and Methodology:

Given the above speciﬁca-tion, we followed the following steps to generate our results: • Specify a stock ( S ), volume-to-trade ( V ), time horizon( T ), and trading datetime (from which the trading hour H is inferred), • Partition the dataset into independent training sets and testing sets to generate results (the training set alwayspre-dates the testing set ), • Calibrate the parameters for the Almgren-Chriss (AC)volume trajectory ( σ, η ) using the historical training set ;set ρ = 0 , since we assume order book is resilient totrading activity (see below), • Generate the AC volume trajectory ( AC , ..., AC T ), • Train the

Q-matrix based on the state-action tuples gen-erated by the training set , • Execute the AC volume trajectory at the speciﬁed tradingdatetime ( H ) on each day in the testing set , recording the implementation shortfall , • Use the trained

Q-matrix to modify the AC trajectory aswe execute V at the speciﬁed trading datetime, recordingthe implementation shortfall and • Determine whether the reinforcement learning (RL)model improved/worsened realised implementation short-fall .In order to train the

Q-matrix to learn the optimal policymapping, we need to traverse the training data set ( T × I × A )times, where A is the total number of possible actions. Thefollowing pseudo-code illustrates the algorithm used to trainthe Q-matrix : Optimal_strategyFor (Episode 1 to N) {Record reference price at t=0For t = T to 1 {For i = 1 to ICalculate episode’s STATE attributes For a = 1 to A {Set x = Determine the action volume aCalculate IS from trade, R(x,a)Simulate transition x to yLook up max_p Q(y,p)Update Q(x,a) = Q(x,a) + alpha*U }}}Select the lowest-IS action max_p Q(y,p) for optimal policy

An important assumption in this model speciﬁcation is that ourtrading activity does not affect the market attributes. Althoughtemporary price impact is incorporated into execution pricesvia depth participation of the market order in the prevailinglimit order book, we assume the limit order book is resilientwith respect to our trading activity. Market resiliency can bethought of as the number of quote updates before the market’sspread reverts to its competitive level. Degryse et al. showedthat a pure limit order book market (Euronext Paris) is fairlyresilient with respect to most order sizes, taking on average50 quote updates for the spread to normalise following themost aggressive orders [11]. Since we are using 5-minutetrading intervals and small trade sizes, we will assume thatany permanent price impact effects dissipate by the nexttrading period. A preliminary analysis of South African stocksrevealed that there were on average over 1000 quote updatesduring the 5-minute trading intervals and the pre-trade orderbook equilibrium is restored within 2 minutes for large trades.The validity of this assumption however will be tested in futureresearch, as well as other model speciﬁcations explored whichincorporate permanent effects in the system conﬁguration.IV. D

ATA AND RESULTS

A. Data used

For this study, we collected 12 months of market depthtick data (Jan-2012 to Dec-2012) from the Thomson ReutersTick History (TRTH) database, representing a universe of 166stocks that make up the South African local benchmark index(ALSI) as at 31-Dec-2012. This includes 5 levels of order bookdepth (bid/ask prices and volumes) at each tick. The raw datawas imported into a MongoDB database and aggregated into5-minute intervals showing average level prices and volumes,which was used as the basis for the analysis.

B. Stocks, parameters and assumptions

To test the robustness of the proposed model in the SouthAfrican (SA) equity market we tested a variety of stock types,trade sizes and model parameters. Due to space constraints,we will only show a representative set of results here thatillustrate the insights gained from the analysis. The followingsummarises the stocks, parameters and assumptions used forthe results that follow: • Stocks–

SBK (Large Cap, Financials) – AGL (Large Cap, Resources) – SAB (Large Cap, Industrials) • Model Parameters– β LB : 0, β UB : 2, β incr : 0.25 – λ : 0.01, τ : 5-min, α : 1, γ : 1 – V : 100 000, 1000 000 – T : 4 (20-min), 8 (40-min), 12 (60-min) – H : 9, 10, 11, 12, 13, 14, 15, 16 – I, B, W : 5, 10 – Buy/Sell: BUY • Assumptions–

Max volume participation rate in order book: 20% – Market is resilient to our trading activity

Note, we set γ = 1 since [14] states that this is a necessarycondition to ensure convergence to the optimal policy withrobability one for a ﬁnite-horizon MDP. We also choose anarbitrary value for λ , although sensitivities to these parameterswill be explored in future work. AC parameters are calibratedand Q-matrix trained over a 6-month training set from 1-Jan-2012 to 30-Jun-2012. The resultant AC and RL trading trajec-tories are then executed on each day at the speciﬁed tradingtime H in the testing set from 1-Jul-2012 to 20-Dec-2012. The implementation shortfall for both models is calculated and thedifference recorded. This allows us to construct a distributionof implementation shortfall for each of the AC and RL models,and for all trading hours H = 9 , , ..., . C. Results

Table 1 shows the average % improvement in median implementation shortfall for the complete set of stocks andparameter values. These results suggest that the model is moreeffective for shorter trading horizons ( T = 4 ), with an averageimprovement of up to 10.3% over the base AC model. Thisresult may be biased due to the assumption of order bookresilience. Indeed, the efﬁcacy of the trained Q-matrix maybe less reliable for stocks which exhibit slow order bookresilience, since permanent price effects would affect the statespace transitions. In future work, we plan to relax this orderbook resilience assumption and incoporate permanent effectsinto state transitions.Figure 1 illustrates the improvement in median post-trade implementation shortfall when executing the volume trajecto-ries generated by each of the models, for each of the candidatestocks at the given trading times. In general, the RL modelis able to improve (lower) ex-post implementation shortfall ,however the improvement seems more signiﬁcant for earlymorning/late afternoon trading hours. This could be due tothe increased trading activity at these times, resulting in morestate-action visits in the training set to reﬁne the associatedQ-matrix values. We also notice more dispersed performancebetween 10:00 and 11:00. This time period coincides withthe UK market open, where global events may drive localtrading activity and skew results, particularly since certain SAstocks are dual-listed on the London Stock Exchange (LSE).The improvement in implementation shortfall ranges from 15bps (85.3%) for trading 1000 000 of SBK between 16:00 and17:00, to -7 bps (-83.4%) for trading 100 000 SAB between16:00 and 17:00. Overall, the RL model is able to improve implementation shortfall by 4.8%.Figure 2 shows the % of correct actions implied by the Q-matrix, as it evolves through the training process after eachtuple visit. Here, a correct action is deﬁned as a reduction(addition) in the volume-to-trade based on the max Q-valueaction, in the case where spreads are above (below) the50%ile and volumes are below (above) the 50%ile level. Thiscoincides with the intuitive behaviour we would like the RLagent to learn. These results suggest that ﬁner state granularity( I, B, W = 10 ) improves the overall accuracy of the learn-ing agent, as demonstrated by the higher % correct actions achieved. All model conﬁgurations seem to converge to some stationary accuracy level after approximately 1000 tuple visits,

Parameters Trading Time(hour) AverageV T I,B,W 9 10 11 12 13 14 15 16100000 4 5 23.9 -1.4 4.7 13.4 1.8 3.3 1.8 35.1 10.3100000 8 5 25.3 4.3 8.3 2.3 1.4 9.9 -0.6 -1.9 6.1100000 12 5 32.7 -25.2 7.2 -2.7 -1.5 4.6 4.5 -3.3 2.11000000 4 5 23.3 -1.3 4.8 9.3 1.9 3.5 1.8 35.0 9.81000000 8 5 28.8 5.6 8.2 1.9 1.4 9.9 -0.3 -2.6 6.61000000 12 5 33.1 -25.0 7.2 -4.0 -0.8 4.8 4.8 1.2 2.7100000 4 10 22.9 1.3 3.0 9.7 2.7 5.8 3.5 -26.1 2.8100000 8 10 26.0 4.3 6.7 -0.2 3.5 8.6 1.6 -3.1 5.9100000 12 10 27.8 -21.9 7.5 -4.1 0.6 1.8 6.2 -9.5 1.11000000 4 10 22.6 1.4 3.1 9.3 2.5 6.0 3.6 -26.1 2.81000000 8 10 26.3 5.0 7.2 -0.5 3.3 7.0 2.3 -1.8 6.11000000 12 10 27.9 -24.3 8.3 -6.9 0.5 1.8 7.5 -3.3 1.4

TABLE I:

Average % improvement in median implementation shortfall for variousparameter values, using AC and RL models. Training H -dependent. Fig. 1:

Difference between median implementation shortfall generated using RL and ACmodels, with given parameters (

I,B,W = 5 ). Training H -dependent. suggesting that a shorter training period may yield similarresults. We do however note that improving the % of correctactions by increasing the granularity of the state space does notnecessarily translate into better model performance. This canbe seen by Table 1, where the results where I, B, W = 10 do not show any signiﬁcant improvement over those with

I, B, W = 5 . This suggests that the market dynamics maynot be fully represented by volume and spread state attributes,and alternative state attributes should be explored in futurework to improve ex-post model efﬁcacy.Table 2 shows the average standard deviation of the re-sultant implementation shortfall when using each of the ACand RL models. Since we have not explicitly accounted for variance of execution in the RL reward function, we see thatthe resultant trading trajectories generate a higher standarddeviation compared to the base AC model. Thus, although theRL model provides a performance improvement over the ACmodel, this is achieved with a higher degree of execution risk,which may not be acceptable for the trader. We do note that theRL model exhibits comparable risk for T = 4 , thus validatingthe use of the RL model to reliably improve IS over shorttrade horizons. A future reﬁnement on the RL model shouldincorporate variance of execution , such that it is consistentwith the AC objective function. In this way, a true comparisonof the techniques can be done, and one can conclude as to ig. 2: % correct actions implied by Q-matrix after each training set tuple. Training H -dependent. Parameters Standard Deviation(%) % improvementV T I,B,W AC RL in IS100000 4 5 0.13 0.17 10.3100000 8 5 0.14 0.23 6.1100000 12 5 0.14 0.26 2.11000000 4 5 0.13 0.17 9.81000000 8 5 0.14 0.23 6.61000000 12 5 0.14 0.26 2.7100000 4 10 0.13 0.17 2.8100000 8 10 0.14 0.22 5.9100000 12 10 0.14 0.26 1.11000000 4 10 0.13 0.17 2.81000000 8 10 0.14 0.22 6.11000000 12 10 0.14 0.26 1.4Average 0.14 0.22 4.8

TABLE II:

Standard deviation(%) of implementation shortfall when using AC vs RLmodels. whether the RL model indeed outperforms the AC model at astatistically signiﬁcant level.V. C

ONCLUSION

In this paper, we introduced reinforcement learning as acandidate machine learning technique to enhance a givenoptimal liquidation volume trajectory. Nevmyvaka, Feng andKearns showed that reinforcement learning delivers promisingresults where the learning agent is trained to choose theoptimal limit order price at which to place the remaininginventory, at discrete periods over a ﬁxed liquidation horizon[20]. Here, we show that reinforcement learning can also beused successfully to modify a given volume trajectory basedon market attributes, executed via a sequence of market orders based on the prevailing limit order book.Speciﬁcally, we showed that a simple look-up table

Q-learning technique can be used to train a learning agent tomodify a static Almgren-Chriss volume trajectory based onprevailing spread and volume dynamics, assuming order bookresiliency. Using a sample of stocks and trade sizes in theSouth African equity market, we were able to reliably improvepost-trade implementation shortfall by up to 10.3% on averagefor short trade horizons, demonstrating promising potentialapplications of this technique. Further investigations includeincorporating variance of execution in the RL reward function,relaxing the order book resiliency assumption and alternativestate attributes to govern market dynamics. A

CKNOWLEDGMENT

The authors thank Dr Nicholas Westray for his contribu-tion in the initiation of this work, as well as the insightfulcomments from the anonymous reviewers. This work is basedon the research supported in part by the National ResearchFoundation of South Africa (Grant Number CPRR 70643)R

EFERENCES[1] R. Almgren, N. Chriss.

Optimal execution of portfolio transactions ,Journal of Risk, 3, pp. 5-40, 2000.[2] R. Almgren.

Optimal execution with nonlinear impact functions andtrading-enhanced risk , Applied Mathematical Finance, 10(1), pp. 1-18,2003.[3] A. Admati, P Pﬂeiderer.

A theory of intraday patterns: volume and pricevariability , Review of Financial Studies, 1(1), pp. 3-40, 1988.[4] A. Barto, S. Mahadevan.

Recent advances in hierarchical reinforcementlearning , Discrete Event Dynamic Systems, 13(4), pp. 341-379, 2003.[5] R. Bellman.

The theory of dynamic programming , Bulletin of the Amer-ican Mathematical Society, 1954.[6] R. Bellman, S. Dreyfus.

Applied dynamic programming , Princeton Uni-versity Press, Princeton, New Jersey, 1962.[7] D. Bertsimas, A. Lo.

Optimal control of execution costs , Journal ofFinancial Markets, 1(1), pp. 1-50, 1998.[8] W. Brock, A Kleidon.

Periodic market closure and trading volume: amodel of intraday bids and asks , Journal of Economic Dynamics andControl, 16(3), pp. 451-489, 1992.[9] L. Chan, J Lakonishok.

The behavior of stock prices around institutionaltrades , Journal of Finance, 50(4), pp.1147-1174, 1995.[10] P. Dayan, C. Watkins.

Reinforcement learning , Encyclopedia of Cogni-tive Science, 2001.[11] H. Degryse, F. deJong, M. Ravenswaaij, G. Wuyts,

Aggressive ordersand the resiliency of a limit order market , Review of Finance, 9(2), pp.201-242, 2003.[12] T. Dietterich.

Hierarchical reinforcement learning with the MAXQ valuefunction decomposition , Abstraction, Reformulation and Approximation,pp. 26-44, 2000.[13] B. Du Preez.

JSE Market Microstructure , MSc Dissertation, Universityof the Witwatersrand, School of Computational and Applied Mathematics,2013.[14] F. Garcia, S. Ndiaye.

A learning rate analysis of reinforcement learningalgorithms in ﬁnite-horizon , Proceedings of the 15th International Con-ference on Machine Learning, 1998.[15] A. Gosavi.

Reinforcement learning: a tutorial survey and recent ad-vances , INFORMS Journal on Computing, 21(2), pp. 178-192, 2009.[16] R. Holthausen, R. Leftwich, D. Mayers.

Large-block transactions, thespeed of response and temporary and permanent stock-price effects ,Journal of Financial Economics, 26(1), pp. 71-95, 1990.[17] G. Huberman, W. Stanzl.

Optimal liquidity trading , Yale School ofManagement, Working Paper, 2001.[18] L. Kaelbling, M. Littman, A. Moore.

Reinforcement learning: a survey ,Journal of Artiﬁcial Intelligence Research, 4, pp. 237-285, 1996.[19] J. McCulloch.

Relative volume as a doubly stochastic binomial pointprocess , Quantitative Finance, 7(1), pp. 55-62, 2007.[20] Y. Nevmyvaka, Y. Feng., M. Kearns.

Reinforcement learning for optimaltrade execution , Proceedings of the 23rd international conference onmachine learning, pp. 673-680, 2006.[21] A. Perold,

The implementation shortfall: paper vs reality , Journal ofPortfolio Management, 14(3), pp. 4-9, 1988.[22] M. Puterman,

Markov Decision Processes , John Wiley and Sons, NewYork, 1994.[23] S. Ross.

Introduction to stochastic dynamic programming , AcademicPress, New York, 1983.[24] S. Singh, T Jaakola, M. Littman, C. Szepesvari.

Convergence resultsfor single-step on-policy reinforcement learning algorithms , MachineLearning, 38(3), pp. 287-308, 2000.[25] R. Sutton, A. Barto.

Reinforcement learning , Cambridge, MA: MITPress, 1998.[26] D. Vayanos.

Strategic trading in a dynamic noisy market , Journal ofFinance, 56(1), pp. 131-171, 2001.[27] C. Watkins.