[PDF] Market-making with reinforcement-learning (SAC)

Abstract

The paper explores the application of a continuous action space soft actor-critic (SAC) reinforcement learning model to the area of automated market-making. The reinforcement learning agent receives a simulated flow of client trades, thus accruing a position in an asset, and learns to offset this risk by either hedging at simulated "exchange" spreads or by attracting an offsetting client flow by changing offered client spreads (skewing the offered prices). The question of learning minimum spreads that compensate for the risk of taking the position is being investigated. Finally, the agent is posed with a problem of learning to hedge a blended client trade flow resulting from independent price processes (a "portfolio" position). The position penalty method is introduced to improve the convergence. An Open-AI gym-compatible hedge environment is introduced and the Open AI SAC baseline RL engine is being used as a learning baseline.

Full PDF

MMarket-making with reinforcement-learning (SAC)

Alexey [email protected] 28, 2020

Abstract

The paper explores the application of a continuous action space soft actor-critic (SAC) reinforcement learningmodel to the area of automated market-making. The reinforcement learning agent receives a simulated ﬂow ofclient trades, thus accruing a position in an asset, and learns to oﬀset this risk by either hedging at simulated”exchange” spreads or by attracting an oﬀsetting client ﬂow by changing oﬀered client spreads (skewing theoﬀered prices). The question of learning minimum spreads that compensate for the risk of taking the position isbeing investigated. Finally, the agent is posed with a problem of learning to hedge a blended client trade ﬂowresulting from independent price processes (a ”portfolio” position). The position penalty method is introducedto improve the convergence. An Open-AI gym-compatible hedge environment is introduced and the Open AISAC baseline RL engine is being used as a learning baseline.

Let’s assume that our goal is to train our agent in a way that it can perform market making eﬀectively. Inthis trading mode our agent puts out both the price it is willing its clients to buy at (”client” ask ) and sell at(”client” bid ), thereby accepting incoming ﬂow of client orders at those prices and proﬁting from the resultingspread ( ask − bid > t = 47 the agent buys froma client at a discount to mid market price (at”client bid”), makes a half-spread s t = 47 by selling to another client,it would have made the entire spread of ask − bid .Instead, however, the agent closes the position attime t = 50 at a lower market at a half-spread s

2, and the sum of those half-spreads s s • putting oﬀsetting orders out to exchange and hereby ”hedging” the position at the cost of paying exchangebid/ask spreads • decreasing client bid/ask spreads to attract more client ﬂow on a chosen side and hereby decreasing theposition more quickly. This is known as ”skewing” the price under condition of client ﬂow (”demand”) beingsuﬃciently elastic to price changes. 1 a r X i v : . [ q -f i n . P R ] A ug edging comes at a cost of decreased proﬁtability whereas not hedging introduces the risk of incurring losses in caseof unfavorable price movements. Thus, given the net incoming client order ﬂow and the resulting accrued clientposition, the agent needs to learn to reason out its risk appetite: how much of the position to carry (in a hope itgets oﬀset by the future client order ﬂow) and how much of it to hedge either via an exchange or by oﬀering clientsmore attractive prices to attract more oﬀsetting client ﬂow.In a driftless environment where the asset price is a martingale, the expectation of the market price move with timeis zero: E [ S t − S t ] = 0 ( t ≥ t ≥

0) (1)From this follows the ﬁrst challenge in setting up an eﬀective market-making ML framework: the agent needs tolearn that there is no sense in entering speculative hedge positions that would try to proﬁt from market moves.Instead, in a very basic setting, the agent needs to learn to oﬀset the accrued client position with an oppositeexchange (hedger) position. This is an inventory management problem that is very similar to a popular Pendulumenvironment where ML agent learns to balance the pendulum, or to Mountain car environment where the agentneeds to learn to swing the car out of the bottom of the pit.Both observation space (the size of a position accrued by the agent) and action space (choice of an amount to hedge)are chosen to be continuous, and the asset price is driven by the drift-adjusted Brownian motion. This allows for aninﬁnite action search space and presents a challenging task. An extension of this problem is presented when priceskew is added as an additional action available for an agent to take making the overall search space a box in R .Assuming the agent has successfully learnt to balance the position to zero via choosing hedge / skew actions, wecan now pose the following problem. In an environment where hedge spreads and client ﬂow are unknown, can themarket making agent learn to oﬀer such spreads that would compensate it for the risk taking? In other words, canan agent learn the price of market risk, measured in terms of oﬀered client spreads?Another problem that naturally follows is that of portfolio management. The dynamics of the portfolio value isdetermined by the blended asset price process as well as the blended client trade ﬂow, both of which are drivenby the underlying asset price processes, correlation coeﬃcients and asset weights. The learning agent does notknow what correlations and asset weights have been used to build the portfolio and needs to work out the eﬀectivehedge strategy that will maximize proﬁtablity and minimize the position involved. This makes it an inventorymanagement problem with not just one asset but a portfolio of assets. Reinforcement learning is becoming increasingly popular in the area of robotics and control automation. In thereinforcement learning setting there is a learning agent that can perpetually ( i ∈ [1 , N ]) interact with the envi-ronment by performing an action i and then observing the resulting state i of the environment and the associatedreward r i . The main purpose of the reward function is to deﬁne the end goal of the learning process by rewarding”good” actions and penalizing for ”bad” ones. As a result, the agent learns how to sample from the action space ina way that maximizes cumulative discounted return over all steps within a simulation trajectory R ( τ ) = (cid:88) i γ i r i (2)where γ is a discount factor which puts more weight on nearest rewards. The learned probability distribution π for action sampling given reward increment and observation space state is called a policy . The goal of the learningprocess is to maximize the expected return over all simulation trajectories and to converge onto the optimal policy π ∗ : π ∗ = arg max π E τ ∼ π (cid:34)(cid:88) i γ i r i (cid:35) (3)If an agent chooses actions according to some ﬁxed policy π starting from a given initial pair of state and action,the resulting expected return is known as ”value function”, representing the value of this policy. Naturally, as thenetwork learns, the policy gets updated, resulting in a new value function on each step of a trajectory. A Q-function,or ”action-value” function is a value function where the ﬁrst action is being ﬁxed (e.g. sampled oﬀ-policy) beforethe policy is applied on consecutive steps of the trajectory.Soft actor-critic (SAC) algorithm introduced in [4], [5] and [6] oﬀered a few methods of improving stability ofconvergence to the optimal policy in case of continuous action and observation spaces:2 entropy maximization : entropy H ( π ( s t )) of the action-generating distribution is being used as a part of thereward function. This leads to joined maximization of expected policy and return which encourages stochasticexploration and stability of convergence: π ∗ = arg max π E τ ∼ π (cid:34)(cid:88) i γ i ( R ( s i , a i , s i +1 ) + αH ( π ( s t ))) (cid:35) (4)where α is the temperature parameter which determines the trade-oﬀ between maximizing return vs maxi-mizing entropy (exploration) • actor-critic : the use of separate policy (”actor”) and value function (”critic”) networks. The policy π andtwo value ( Q ( s, a )) functions j = 1 , L j = E (cid:2) ( Q j ( s i , a i ) − y ( r i , s i +1 )) (cid:3) (5)where the target y is approximated by sampling actions ˜ a i +1 from the policy π ( s i +1 ), conditioned on statereturn r and next state s i +1 coming from the replay buﬀer: y ( r, s i +1 ) = r i + γ (1 − d ) (cid:18) min j =1 , Q targj ( s i +1 , ˜ a i +1 ) − α log π (˜ a i +1 | s i +1 )) (cid:19) , ˜ a i +1 ∼ π ( s i +1 ) (6)The target value function ( Q targj ) networks are being periodically updated from current value networks Q j in an exponential average fashion. Within the learning loop (see SAC algo) Q-functions are updated bygradient descent on the loss function 5 and the policy is updated by gradient ascent on the entropy-adjustedQ functions. • experience replay : the use of the replay buﬀer to simulate next states in an oﬀ-policy manner. Next actionsare simulated from the current policy.These methods alongside with others were used for baseline implementation of SAC algorithm by OpenAI. Itserves as a baseline RL engine used in this paper. The boxed network is a multi-layered ReLU perceptron with alinear input and output layers. Linear input layer transforms dimensionality from input (observation) space ontothe ﬁrst perceptron layer. Linear output layer transforms dimensionality from the last perceptron layer onto theaction space. E.g. when we have the action space deﬁned as position in an asset (dim = 1), and action space deﬁnedas hedge amount (dim = 1) and 2 ReLU layers of 2048 units, we will write this as MLP Linear (1) x ReLU (2048)x ReLU (2048) x Linear (1). An additional dimension to all those would be the batch size. Hedge environment is implemented to be Open AI Gym-compatible. It makes use of Pytorch Tensor framework forgenerating price processes and client ﬂows and is intended to be used with a Pytorch-based implementation. Hedgeenvironment provides the standard members step() for updating the hedger state based on the action selected bythe agent, reset() for re-setting the simulation and render() for displaying the dashboard with hedger statistics.Visualization of the hedging agent performance serves as a very important tool in understanding the progress (orlack thereof) of the hedging agent as well as assessing the overall correctness of the environment set-up.Project code for this paper is available under: https://github.com/bakalex/autohedger

Asset price process is generated as a standard Euler discretization of a log-normal price process:∆ log ( S i ) = µ ∆ t − σ ∆ t + σ(cid:15) √ ∆ t (7) S N = S exp (cid:88) i ∆ log ( S i ) (8)For the inventory management problem we are trying to solve we don’t want our reinforcement learning agent tofocus on market drift which would result in the hedger taking speculative directional positions in an asset. Instead,3e want it to focus on managing the position arising from the incoming client order ﬂow, so we assume no drift( µ = 0) which makes our price process a martingale (eq 1).For a given normally distributed variable (cid:15) and uncorrelated i.i.d variable (cid:15) a correlated normally distributedvariable (cid:15) with correlation coeﬃcient ρ is generated as follows: ε = f ( ε , ρ ) = ρε + ( (cid:112) − ρ ) ε (9)Generation of normally distributed variables as well as above discretization is done within Pytorch Tensor frameworkso that at the beginning of the learning process we have the market data set ready for the entire simulation. There are two types of oﬀered spreads δ bid , δ ask deﬁned in the model: ”client” spreads that our market-makingagent oﬀers clients to trade at (hence, ”client” bid / ask) and ”hedge” spreads that our agent trades at with anexchange (hence, ”hedge” bid/ask). Both those spreads are applied on the top of the same mid price process8. Generated spreads are based on the rolling mean volatility σ roll avg of the mid price process plus log-normalstochastic spread add-on ε L . S bid = S − νδ bid , S ask = S + νδ ask (10) δ bid = σ roll avg + ε L , δ ask = σ roll avg + ε L (11) σ roll avg = StDev ( S i , window ) √ n steps (12) ε L = LogN ormal (cid:18) , γS σ √ n steps (cid:19) (13)where ν is spread multiplier determining the overall magnitude of the spread, γ is spread multiplier that determinesthe magnitude of the stochastic spread add-on, σ is ﬂat volatility driving the mid process and ε L and ε L are log-normal i.i.d variables.Given the fat-tailed nature of the log-normal distribution, stochastic spreads δ bid , δ ask are being clamped between0.1 and 2.5 of mean simulation volatility. This helps to avoid unreasonable spikes and bid/ask prices. Client trade sizes hitting our oﬀered bid and ask prices are simulated as bid and ask Poisson processes. Magnitudeof intensity of those processes is assumed to be a function of rolling volatility of the price process. Net trade ﬂowthat determines how position changes on each step of the simulation is a function of imbalance of bid/ask Poissonintensities which is assumed to be correlated with the log-price process.

T radeSize net = T radeSize bid − T radeSize ask (14)

T radeSize bid = P oisson ( ClientT radeRate ∗ ¯ λ bid ) (15) T radeSize ask = P oisson ( ClientT radeRate ∗ ¯ λ ask ) (16)where ClientT radeRate determines the overall magnitude of the size Poisson process, and ¯ λ bid and ¯ λ ask are tradeﬂow intensity multipliers. ClientT radeRate = C (cid:18) α + σ rolling σ mean (cid:19) (17)where α allows to deﬁne the vol-independent ”mean” client trade ﬂow, σ rolling is the average volatility within arolling window and σ mean is mean volatility for the entire simulation. C is the ﬂow scaling factor and may be takento be equal to the initial price of the asset S .Trade ﬂow intensity multipliers are deﬁned as: ¯ λ bid = max (1 − λ,

0) (18)4 λ ask = max (1 + λ,

0) (19)where λ is the net intensity. It is deﬁned to be correlated with the log-price process as per eq 9: λ = β ∗ ε (log S i , ρ ) (20)where β is sensitivity of net client trade ﬂow towards log-returns and ρ deﬁnes correlation between net intensityprocess and log-price process. To smoothen out variance in intensity, a rolling average version of the log-priceprocess may be used. As explained in introduction, our hedging agent accepts client trades at ”client” spreads δ client it oﬀers to clientsand then it may choose to hedge the accrued position out exchange ”hedge” spreads δ hedge . Let’s assume a clientsells to our agent at our oﬀered ask price and then the agent hedges out (sells) the accrued position to the exchangeat exchange bid price. In a hypothetical setting where our hedger could have hedged instantaneously, it could havemade the client half-spread and paid exchange half-spread, locking in the proﬁt. P N L = T radeSize ( S client ask − S mid ) − T radeSize ( S hedge bid − S mid ) = (cid:0) δ client ask − δ hedge bid (cid:1) T radeSize (21)In reality, having accrued the client trade at step t , our hedger can only choose to hedge out on the next step t of the simulation, or even later, which results in revaluation of the position at the market mid price between thosesteps. P N L = T radeSize ( S client askt − S midt ) + T radeSize ( S midt − S midt ) − T radeSize ( S hedge bidt − S midt ) (22)= (cid:16) δ client askt − δ hedge bidt + ∆ S midt − t (cid:17) T radeSize (23)Given the martingale nature of the simulated price process E (cid:0) ∆ S midt − t (cid:1) = 0 our agent can only be proﬁtable ifaverage oﬀered client spreads are greater than average paid hedge spreads. Note that our reinforcement agent doesnot know that the price process is a martingale, so as it begins to explore the action space initially, it would besensitive to realized directional market moves of the mid process. With time, however, the agent needs to learn tofocus on managing positions resulting from the client ﬂow instead of trying to capture directional moves of the midprocess. Martingale formulation of the price process helps to achieve that. When demand is elastic, we may aﬀect the trade size a client is willing to trade with us by oﬀering a more attractive(”skewed”) price than the rest of the market. This can be used as means of position management: e.g. given along position in an asset, we could lower our oﬀered price to entice the client to buy from us thus decreasing theposition. Depending on the marginal proﬁtability of the transaction given the skewed price, such an action may beless expensive compared to executing the similar transaction on the exchange. We will be using the linear model inour experiments: ∆ size = skew ∗ β M axHedgeSizeClientT radeRate (24)where ClientT radeRate is the mean trade rate from TradeFlowModel 3 . M axHedgeSize is the maximum hedgetrade size constant deﬁned for the bounded action space for the hedger and skew is a bounded action skew ∈ [ − . , .

0] resulting in the following client price adjustment: S bidadj = S bid − skew (cid:0) S mid − S bid (cid:1) , skew ∈ [ − . , .

0] (25) S askadj = S ask − skew (cid:0) S ask − S mid (cid:1) , skew ∈ (0 . , .

0] (26)So, negative skew means our agent buys more expensively and attracts more client ﬂow on bid side. Positive skewmeans our agent sells cheaper and attracts more client ﬂow on ask side. β constant is speciﬁed to be large enough for the learning agent to prefer skewing over hedging. Setting β = 0makes the client ﬂow non-elastic and can be used for price discovery purposes such as ﬁnding market price of riskspreads in M arketpriceof risk . 5 .6 Dummy hedger and simulation dashboard

It is of interest to set up the environment in such a way that the choice to oﬀset the accrued client positionwith hedge trades should follow naturally from the observed reward. An obvious choice would be to use strategyproﬁtability (PNL) as a basis for the reward function. Hence, we’d be using such parametrization of the environmentthat wrong hedging choices would result in adverse PNL. In particular, the ratio of mid process volatility σ andstochastic spread multiplier ν is such that for a given realization of the mid price process the decision not to hedgequickly becomes apparent in strategy PNL.To test this, let’s use a dummy hedging strategy that on each step targets to oﬀset the position accrued on theprevious step. See hedging env.py: def heuristic action To compare performance of such a strategy to an alternative not to hedge at all we will be using a dashboardthat shows the price process, the net position alongside hedge and client positions, net PNL of the strategy andits constituent components: client PNL, market PNL and hedge PNL. With our dummy strategy on Figure 2 itis seen that we are oﬀsetting the outstanding client position with the opposite hedge position and are making thediﬀerence of client and hedge spreads. It is seen that the strategy looses on unfavorable spikes of market databetween consecutive steps which is a result of position revaluation between steps. Compared to unhedged case thisstrategy exhibits low variance in PNL since market exposure is limited to only unhedged position increment (netclient size ﬂow) between steps. Figure 2: Dummy hedger dashboardIn the case of unhedged client position the variance of strategy PNL is driven by the combined variance of thesimulated position and market data process. PNL becomes hostage to directional market moves resulting in worstSharpe ratios. 6igure 3: Unhedged client ﬂow

We want the learning agent to carry a position in time only if it is justiﬁed by the resulting PNL. As a matter ofrisk management we don’t want the agent to take on excessive positions, so we need to set up a risk limit system.To achieve this goal a position penalty component is introduced to the reward function. This penalty is deﬁnedto be an exponential function of the position size, and is intended to make taking bigger positions increasinglymore expensive for the agent in reward terms, forcing the agent to justify those by the resulting PNL. For smallpositions relative to

M axP osLimit the charge would be relatively small but would increase exponentially whenposition nears or overshoots

M axP osLimit . We can see this penalty as a ”spring” that would force the agent tobring the position back to zero unless having this position is proﬁtable enough.

P enalty = γS (cid:16) e abs ( Position ) MaxPosLimit − (cid:17) M axP osLimit (27)where γ constant determines how harshly the position penalty w.r.t M axP osLimit is enforced. To further dis-courage exploring unpromising trajectories the simulation is terminated when

M axP ositionLimit breach exceedsa certain multiple, and additionally penalized reward is recorded.The following reward function is deﬁned:

Reward t = ClientP N L t + HedgeP N L t + M arketP N L t − P enalty t (28)Another technique to improve the convergence to the optimal policy is to limit the action space to a closed interval a i = hedgeAmount i ∈ [ − M axHedgeAmount, M axHedgeAmount ] (29)where

M axHedgeAmount is deﬁned to be of the same order as mean client trade size. This naturally limits thecapacity of the learning agent to explore policies related to speculating on asset price movements and makes it focuson managing net position size instead.Network architecture: • action space : agent action a i = HedgeSize ∈ [ − M axHedgeSize, M axHedgeSize ]7 observation space: cumulative net position of client trades and the hedger up until and including the laststep N etP osition i = (cid:80) i ( N etClientT radeSize j + HedgeP ositionSize j ) ∈ R • network layers : MLP of Linear (1) x ReLU (1024) x ReLU (1024) x Linear (1) layers (x batch size)The agent successfully learns to hedge the client ﬂow as it could be seen from rewards statistics and the simulationdashboard: Figure 4: Autohegder single dashboardFigure 5: Autohedger single progressListing 1: Training the single asset autohedger 8 d s t a b l e − b a s e l i n e s 3 − autohedger − s i n g l e / autohedger − s i n g l epython3 a u t o h e d g e r m l . py Let’s now consider the setting where in addition to hedging on the market the agent can also apply skew i.e. adjustthe oﬀered price on one side to attract more oﬀsetting client ﬂow instead of hedging all of the position on themarket. For simplicity we will assume that the oﬀsetting client trade ﬂow size is a linear function of skew (seeSkewing model 3.5).

Architecture layout : • action space : agent action is a pair of { HedgeSize, Skew } rational numbers within a box bounded by HedgeSize ∈ [ − M axHedgeSize, M axHedgeSize ] ,

Skew ∈ [ − , • observation space: net position • network layers : MLP of Linear (1) x ReLU (2048) x ReLU (2048) x Linear (2) layersIt can be seen that the agent learns to make use of the skew in addition to utilizing hedge amounts to hedge theresidual. This is useful as it allows us to raise the question of Market price of risk.Listing 2: Training the skew cd s t a b l e − b a s e l i n e s 3 − autohedger − s i n g l e / autohedger − s i n g l epython3 a u t o h e d g e r m l s k e w . pyTo gauge the performance of model with skew vs the model without skew it is useful to introduce a risk-weightedmeasure of reward. We’ll be using a measure similar to Sharpe-ratio of returns, where a return of a single step i isdeﬁned as step return = P N L i − P N L i − : Sharpe ratio = E [ returns array ] stdev ( returns array ) = (cid:80) N stepsi =1 ( P N L i − P N L i − ) stdev ( returns array ) (30)To compare a non-skew and a skew model side-by-side those models are trained independently, saved, and thenused on the same generated market data to compare the performance.Figure 6: Models with skew and withoutDepending on the magnitude of the chosen β in the Elasticity of demand (skewing) model it can be seen thatskew-based model shows higher Sharpe ratios than a non-skew one. You can run the pre-trained models side-by-sideon the same generated market data sets as per listing below:9isting 3: Comparison of models with skew and without cd s t a b l e − b a s e l i n e s 3 − autohedger − s i n g l e / autohedger − s i n g l epython3 a u t o h e d g e r m l l e a r n e d . py Let’s assume a setting where our market-making agent chooses which spreads to charge clients. It is of interestthan to determine such minimum mean spread that would compensate the agent for the need of risk-managing aposition in an asset resulting from the accumulation of the client trade ﬂow. Such spread would include the cost ofheding the position in the market as well as the premium for any non-Brownian behavior of the market (e.g. jumps,volatility clustering, drifts etc). Having those ”hurdle” spreads calculated would allow the researcher to build theelasticity of demand spreading model on the top of them and to have a more clear idea of how to price the spreadsbeing charged.Within our market model the price process is a martingale, so the expected revaluation of a position between anytwo future timestamps is zero:

P osition t ∗ E [ S t − S t ] = 0 ( t ≥ t ≥

0) (31)Within our PNL model this means that mean client spreads the agent should charge should be equal to mean hedgespreads paid on the exchange to oﬀset the position. Note however that our agent does not know what those hedgespreads are and needs to work them out by interacting with the market model.To achieve this goal the following reward function is deﬁned:

Reward t = max ( − abs ( ClientP N L t + HedgeP N L t ) ,

0) +

M arketP N L t − P ositionP enalty t (32)This reward function makes the learning agent to come up with such client spreads that would oﬀset the cost ofhedging the position on the exchange. As before, we introduce a charge on carrying a directional position via P ositionP enalty . As the agent learns, at ﬁrst it incurs higher penalties due to exploring actions leading to largerpositions. Once the agent learned to control the position,

P ositionP enalty becomes a lesser factor letting the agentto bring client spreads in line with the cost of hedging.We allow the learning agent to control client spreads via skew (see 3.5). In the context of this experiment we assumethat client trade ﬂow is not elastic to price changes so β = 0 in eq. 24. Architecture layout : • action space : agent action is a pair of { HedgeSize, Skew } rational numbers within a box bounded by HedgeSize ∈ [ − M axHedgeSize, M axHedgeSize ] ,

Skew ∈ [ − , • observation space: net position • network layers : MLP of Linear (1) x ReLU (2048) x ReLU (2048) x Linear (2) layersAs a result of the learning process it is seen how rolling average spread charged by our agent (”maker” spread)converges onto the rolling average hedge spread (”taker” spread).10igure 7: Discovering price of risk spreadsOn a sample PNL dashboard of the learned agent it is seen how the agent tries to keep ﬂat net PNL, at thesame time continuing to oﬀset accrued client position with the hedger position.Figure 8: Price of risk dashboardListing 4: Training Price of risk model cd s t a b l e − b a s e l i n e s 3 − autohedger − s i n g l e / autohedger − s i n g l epython3 a u t o h e d g e r m l p r i c e o f r i s k . py 11 Modeling portfolios

Let’s assume we have two assets S and S which log increments 7 are correlated as in 9 with correlation coeﬃcient ρ . For simplicity let’s assume that log-increments have the same ﬂat volatility σ , so that the resulting log-normalprice processes would have the same variance characteristics. Each of S and S processes would drive its ownClient trade ﬂow model and Stochastic spread model. Naturally, each of those processes would imply its ownPoisson intensity of arriving client trade sizes, as well as net imbalance of the arriving client trade ﬂow. ”Portfolio”is formed by blending these asset prices and net client trade size ﬂows 14 with a ﬁxed weight w : S blendedmid,bid,ask = wS mid,bid,ask + (1 − w ) S mid,bid,ask (33) T radeSize blendednet = wT radeSize net + (1 − w ) T radeSize net (34)As a result, the reinforcement learning agent accrues a position which gets incremented with the blended size andthe blended price. The agent does not know what asset weights are being used ( w and 1 − w ), neither does it knowthe correlation coeﬃcient ρ . The agent needs to learn the optimal hedging strategy using hedge amounts for S and S .At any step the net position (”portfolio”) value is deﬁned as the sum of values of the blended client position andits hedges: P ortf olio value ( t ) = S blendedmid ( t ) P osition blendedClient ( t ) + S mid ( t ) P osition S Hedge ( t ) + S mid ( t ) P osition S Hedge ( t ) (35)The goal of the learning agent would be to maximize policy PNL whilst keeping the position value in check.The action space is now a R -valued square which bounds the sampled hedge amounts S and S . As before, wewill be using a penalty function to facilitate risk-management by the learning agent. However, a straightforwardapplication of the exponential penalty function (see single-asset penalty function 27) to P ortf olio value would notbe as eﬀective. The reason is that by having the two-dimensional action space we are now allowing

HedgeAmount S and HedgeAmount S to oﬀset each other, thus introducing an additional degree of freedom to the action space.This shows up as a lower net portfolio value and contradicts the convex deﬁnition of the blended price 33. Thiscould be remediated in a few ways.The ﬁrst approach of making the actions convex is to change the parametrization of the action space: HedgeAmount S = w action ∗ hedge amount action (36) HedgeAmount S = (1 − w action ) ∗ hedge amount action (37)Here actual actions are the weight w and hedge amount , and the respective S and S hedge amounts are beingcalculated in a convex manner from those.A better approach would be to leave it for the learning agent to reason out but to introduce an additional ”over-hedge” penalty for when an individual hedge value overshoots the value of the blended client position or has thesame sign (so leverages instead of oﬀsetting): Overhegdge = (cid:40) abs ( hedge value + client pos value ) hedge overshoots or same sign0 other casesThen Overhedge is used as a part of the overall penalty function, where φ constant determines penalty tolerance tooverhedge and γ determines the tolerance to overall portfolio value closing on M axP osLimit or overshooting it:

P enalty = γS (cid:16) e abs ( Portfolio value )+ φabs ( Overhedge ) S MaxPosLimit − (cid:17) M axP osLimit (38)Note that three positions are accrued over the simulation trajectory: the blended client position, S hedge positionand S hedge position, where hedge positions are supposed to be oﬀsetting the blended client position. For each ofthose positions at each step we are tracking spread PNL (paid or received) as well as market revaluation PNL.Portfolio PNL is deﬁned as trajectory PNL consisting of blended client position PNL (client spread PNL and clientposition market revaluation) and the sum of S and S hedge positions PNL (hedge spread PNL and hedge positionmarket revaluation). Reward function is deﬁned as portfolio PNL minus the penalty function: Reward portf olio = BlendedClientP osition P N L + HedgeP osition S P N L + HedgeP osition S P N L − P enalty (39)

Architecture layout : 12 action space : agent action is a pair of { HedgeSize S , HedgeSize S } rational numbers within a squarebox bounded by HedgeSize ∈ [ − M axHedgeSize, M axHedgeSize ] • observation space: S hedger position value S mid P osition S Hedge , S hedger position value S mid P osition S Hedge and net

P ortf olio value ( t ) (see 35) • network layers : MLP of Linear (input: 3) x ReLU (8000) x ReLU (8000) x Linear (output: 2) layersGiven the initial correlation ρ = 0, the agent is seen to learn to oﬀset the blended incoming client size ﬂow withthe hedge positions. The agent seems to be able to deduce the implied weighting of assets within the blended clienttrade ﬂow, however, more work needs to be done on improving stability on farther epochs. One can ﬁnd sampledashboards for portfolio learning agents in Portfolio hedging dashboards.Listing 5: Training portfolio hedge model cd s t a b l e − b a s e l i n e s 3 − autohedger − p o r t f o l i o / autohedger − p o r t f o l i opython3 a u t o h e d g e r m l p o r t f o l i o . pyA slower, more stable version:Listing 6: Training portfolio hedge model - slow cd s t a b l e − b a s e l i n e s 3 − autohedger − p o r t f o l i o / autohedger − p o r t f o l i opython3 a u t o h e d g e r m l p o r t f o l i o s l o w . py Techniques explored in this paper could be used as a baseline for setting up a position management frameworkwithin an automated market-making system. More importantly, however, is that it could be used as a way ofblending alpha-generation and position management into one reinforcement learning system. Let’s assume thatwe have a signal generating unit that is listening to the real-life price process as well as other information sourceslike order book, news feeds etc. Then the signal it generates could be plugged into our reinforcement learningagent. If (as a result of the learning process) our agent ﬁnds this signal to be beneﬁcial to the reward function,it would automatically learn to adjust its hedging decisions in line with that. Bridging signal-generation capacitywith reinforcement-learning inventory management poses an interesting area of research.Automating portfolio management decisions by letting the agent infer correlations and weights while learning alsoproves to be a challenging area to be explored. In a more realistic setting for the portfolio management task wecould deﬁne a risk-weighted reward function (e.g. based on Sharpe ratio 30) and let the agent pick from tens orhundreds of assets which would translate into the same dimensionality for observation and action spaces and wouldincrease the complexity of stochatic policy search problem.Even in the current set-up more work needs to be done to increase learning stability and avoid intermittent oscil-lations of

HedgeSize action between − M axHedgeSize and

M axHedgeSize between consecutive steps on fartherepochs. This is most likely related to diminished scale of rewards on farther epochs compared to large scale ofpenalties incurred while starting to train. Methods like normalization of rewards and entropy regularization couldbe used for this purpose, which would boil down to ﬁne-tuning the entropy maximization parametrization withinSAC.

Project code is available under https://github.com/bakalex/autohedger

Modeling was performed on Amazon EC2 g4dn.2xlarge machine with the use of Deep Learning AMI (Ubuntu 16.04)Version 30.0 (ami-02379288a3b4cbe7b). PyTorch 1.5 CUDA 10.1 is activated with:13isting 7: Setting up PyTorch environment on EC2 AMI source a c t i v a t e p y t o r c h l a t e s t p 3 6g4dn.2xlarge allowed for research on MLPs as large as 10000 x 10000 x 8000 (x default batch size), but typicallynets of no more than 8000 x 8000 units were used.Open AI baseline installed with:Listing 8: Installing Open AI baseline cd s t a b l e − b a s e l i n e s 3p i p i n s t a l l .For the convenience of reproducing the results raised in this paper, a version of Open AI is supplied alongside theproject code. It is a choice of the researcher to use it or to try a more recent version. Soft actor-critic algorithm as listed in Open AI Spinning Up. A more succinct version is available in [6].Figure 9: Soft actor-critic RL algorithm14 .4 Portfolio hedging dashboards

Figure 10: Learning to hedge portfolios, w=0.5Figure 11: Learning to hedge portfolios, w=0.015 eferenceseferences