Market-making with reinforcement-learning (SAC)
MMarket-making with reinforcement-learning (SAC)
Alexey [email protected] 28, 2020
Abstract
The paper explores the application of a continuous action space soft actor-critic (SAC) reinforcement learningmodel to the area of automated market-making. The reinforcement learning agent receives a simulated flow ofclient trades, thus accruing a position in an asset, and learns to offset this risk by either hedging at simulated”exchange” spreads or by attracting an offsetting client flow by changing offered client spreads (skewing theoffered prices). The question of learning minimum spreads that compensate for the risk of taking the position isbeing investigated. Finally, the agent is posed with a problem of learning to hedge a blended client trade flowresulting from independent price processes (a ”portfolio” position). The position penalty method is introducedto improve the convergence. An Open-AI gym-compatible hedge environment is introduced and the Open AISAC baseline RL engine is being used as a learning baseline.
Let’s assume that our goal is to train our agent in a way that it can perform market making effectively. Inthis trading mode our agent puts out both the price it is willing its clients to buy at (”client” ask ) and sell at(”client” bid ), thereby accepting incoming flow of client orders at those prices and profiting from the resultingspread ( ask − bid > t = 47 the agent buys froma client at a discount to mid market price (at”client bid”), makes a half-spread s t = 47 by selling to another client,it would have made the entire spread of ask − bid .Instead, however, the agent closes the position attime t = 50 at a lower market at a half-spread s
2, and the sum of those half-spreads s s • putting offsetting orders out to exchange and hereby ”hedging” the position at the cost of paying exchangebid/ask spreads • decreasing client bid/ask spreads to attract more client flow on a chosen side and hereby decreasing theposition more quickly. This is known as ”skewing” the price under condition of client flow (”demand”) beingsufficiently elastic to price changes. 1 a r X i v : . [ q -f i n . P R ] A ug edging comes at a cost of decreased profitability whereas not hedging introduces the risk of incurring losses in caseof unfavorable price movements. Thus, given the net incoming client order flow and the resulting accrued clientposition, the agent needs to learn to reason out its risk appetite: how much of the position to carry (in a hope itgets offset by the future client order flow) and how much of it to hedge either via an exchange or by offering clientsmore attractive prices to attract more offsetting client flow.In a driftless environment where the asset price is a martingale, the expectation of the market price move with timeis zero: E [ S t − S t ] = 0 ( t ≥ t ≥
0) (1)From this follows the first challenge in setting up an effective market-making ML framework: the agent needs tolearn that there is no sense in entering speculative hedge positions that would try to profit from market moves.Instead, in a very basic setting, the agent needs to learn to offset the accrued client position with an oppositeexchange (hedger) position. This is an inventory management problem that is very similar to a popular Pendulumenvironment where ML agent learns to balance the pendulum, or to Mountain car environment where the agentneeds to learn to swing the car out of the bottom of the pit.Both observation space (the size of a position accrued by the agent) and action space (choice of an amount to hedge)are chosen to be continuous, and the asset price is driven by the drift-adjusted Brownian motion. This allows for aninfinite action search space and presents a challenging task. An extension of this problem is presented when priceskew is added as an additional action available for an agent to take making the overall search space a box in R .Assuming the agent has successfully learnt to balance the position to zero via choosing hedge / skew actions, wecan now pose the following problem. In an environment where hedge spreads and client flow are unknown, can themarket making agent learn to offer such spreads that would compensate it for the risk taking? In other words, canan agent learn the price of market risk, measured in terms of offered client spreads?Another problem that naturally follows is that of portfolio management. The dynamics of the portfolio value isdetermined by the blended asset price process as well as the blended client trade flow, both of which are drivenby the underlying asset price processes, correlation coefficients and asset weights. The learning agent does notknow what correlations and asset weights have been used to build the portfolio and needs to work out the effectivehedge strategy that will maximize profitablity and minimize the position involved. This makes it an inventorymanagement problem with not just one asset but a portfolio of assets. Reinforcement learning is becoming increasingly popular in the area of robotics and control automation. In thereinforcement learning setting there is a learning agent that can perpetually ( i ∈ [1 , N ]) interact with the envi-ronment by performing an action i and then observing the resulting state i of the environment and the associatedreward r i . The main purpose of the reward function is to define the end goal of the learning process by rewarding”good” actions and penalizing for ”bad” ones. As a result, the agent learns how to sample from the action space ina way that maximizes cumulative discounted return over all steps within a simulation trajectory R ( τ ) = (cid:88) i γ i r i (2)where γ is a discount factor which puts more weight on nearest rewards. The learned probability distribution π for action sampling given reward increment and observation space state is called a policy . The goal of the learningprocess is to maximize the expected return over all simulation trajectories and to converge onto the optimal policy π ∗ : π ∗ = arg max π E τ ∼ π (cid:34)(cid:88) i γ i r i (cid:35) (3)If an agent chooses actions according to some fixed policy π starting from a given initial pair of state and action,the resulting expected return is known as ”value function”, representing the value of this policy. Naturally, as thenetwork learns, the policy gets updated, resulting in a new value function on each step of a trajectory. A Q-function,or ”action-value” function is a value function where the first action is being fixed (e.g. sampled off-policy) beforethe policy is applied on consecutive steps of the trajectory.Soft actor-critic (SAC) algorithm introduced in [4], [5] and [6] offered a few methods of improving stability ofconvergence to the optimal policy in case of continuous action and observation spaces:2 entropy maximization : entropy H ( π ( s t )) of the action-generating distribution is being used as a part of thereward function. This leads to joined maximization of expected policy and return which encourages stochasticexploration and stability of convergence: π ∗ = arg max π E τ ∼ π (cid:34)(cid:88) i γ i ( R ( s i , a i , s i +1 ) + αH ( π ( s t ))) (cid:35) (4)where α is the temperature parameter which determines the trade-off between maximizing return vs maxi-mizing entropy (exploration) • actor-critic : the use of separate policy (”actor”) and value function (”critic”) networks. The policy π andtwo value ( Q ( s, a )) functions j = 1 , L j = E (cid:2) ( Q j ( s i , a i ) − y ( r i , s i +1 )) (cid:3) (5)where the target y is approximated by sampling actions ˜ a i +1 from the policy π ( s i +1 ), conditioned on statereturn r and next state s i +1 coming from the replay buffer: y ( r, s i +1 ) = r i + γ (1 − d ) (cid:18) min j =1 , Q targj ( s i +1 , ˜ a i +1 ) − α log π (˜ a i +1 | s i +1 )) (cid:19) , ˜ a i +1 ∼ π ( s i +1 ) (6)The target value function ( Q targj ) networks are being periodically updated from current value networks Q j in an exponential average fashion. Within the learning loop (see SAC algo) Q-functions are updated bygradient descent on the loss function 5 and the policy is updated by gradient ascent on the entropy-adjustedQ functions. • experience replay : the use of the replay buffer to simulate next states in an off-policy manner. Next actionsare simulated from the current policy.These methods alongside with others were used for baseline implementation of SAC algorithm by OpenAI. Itserves as a baseline RL engine used in this paper. The boxed network is a multi-layered ReLU perceptron with alinear input and output layers. Linear input layer transforms dimensionality from input (observation) space ontothe first perceptron layer. Linear output layer transforms dimensionality from the last perceptron layer onto theaction space. E.g. when we have the action space defined as position in an asset (dim = 1), and action space definedas hedge amount (dim = 1) and 2 ReLU layers of 2048 units, we will write this as MLP Linear (1) x ReLU (2048)x ReLU (2048) x Linear (1). An additional dimension to all those would be the batch size. Hedge environment is implemented to be Open AI Gym-compatible. It makes use of Pytorch Tensor framework forgenerating price processes and client flows and is intended to be used with a Pytorch-based implementation. Hedgeenvironment provides the standard members step() for updating the hedger state based on the action selected bythe agent, reset() for re-setting the simulation and render() for displaying the dashboard with hedger statistics.Visualization of the hedging agent performance serves as a very important tool in understanding the progress (orlack thereof) of the hedging agent as well as assessing the overall correctness of the environment set-up.Project code for this paper is available under: https://github.com/bakalex/autohedger
Asset price process is generated as a standard Euler discretization of a log-normal price process:∆ log ( S i ) = µ ∆ t − σ ∆ t + σ(cid:15) √ ∆ t (7) S N = S exp (cid:88) i ∆ log ( S i ) (8)For the inventory management problem we are trying to solve we don’t want our reinforcement learning agent tofocus on market drift which would result in the hedger taking speculative directional positions in an asset. Instead,3e want it to focus on managing the position arising from the incoming client order flow, so we assume no drift( µ = 0) which makes our price process a martingale (eq 1).For a given normally distributed variable (cid:15) and uncorrelated i.i.d variable (cid:15) a correlated normally distributedvariable (cid:15) with correlation coefficient ρ is generated as follows: ε = f ( ε , ρ ) = ρε + ( (cid:112) − ρ ) ε (9)Generation of normally distributed variables as well as above discretization is done within Pytorch Tensor frameworkso that at the beginning of the learning process we have the market data set ready for the entire simulation. There are two types of offered spreads δ bid , δ ask defined in the model: ”client” spreads that our market-makingagent offers clients to trade at (hence, ”client” bid / ask) and ”hedge” spreads that our agent trades at with anexchange (hence, ”hedge” bid/ask). Both those spreads are applied on the top of the same mid price process8. Generated spreads are based on the rolling mean volatility σ roll avg of the mid price process plus log-normalstochastic spread add-on ε L . S bid = S − νδ bid , S ask = S + νδ ask (10) δ bid = σ roll avg + ε L , δ ask = σ roll avg + ε L (11) σ roll avg = StDev ( S i , window ) √ n steps (12) ε L = LogN ormal (cid:18) , γS σ √ n steps (cid:19) (13)where ν is spread multiplier determining the overall magnitude of the spread, γ is spread multiplier that determinesthe magnitude of the stochastic spread add-on, σ is flat volatility driving the mid process and ε L and ε L are log-normal i.i.d variables.Given the fat-tailed nature of the log-normal distribution, stochastic spreads δ bid , δ ask are being clamped between0.1 and 2.5 of mean simulation volatility. This helps to avoid unreasonable spikes and bid/ask prices. Client trade sizes hitting our offered bid and ask prices are simulated as bid and ask Poisson processes. Magnitudeof intensity of those processes is assumed to be a function of rolling volatility of the price process. Net trade flowthat determines how position changes on each step of the simulation is a function of imbalance of bid/ask Poissonintensities which is assumed to be correlated with the log-price process.
T radeSize net = T radeSize bid − T radeSize ask (14)
T radeSize bid = P oisson ( ClientT radeRate ∗ ¯ λ bid ) (15) T radeSize ask = P oisson ( ClientT radeRate ∗ ¯ λ ask ) (16)where ClientT radeRate determines the overall magnitude of the size Poisson process, and ¯ λ bid and ¯ λ ask are tradeflow intensity multipliers. ClientT radeRate = C (cid:18) α + σ rolling σ mean (cid:19) (17)where α allows to define the vol-independent ”mean” client trade flow, σ rolling is the average volatility within arolling window and σ mean is mean volatility for the entire simulation. C is the flow scaling factor and may be takento be equal to the initial price of the asset S .Trade flow intensity multipliers are defined as: ¯ λ bid = max (1 − λ,
0) (18)4 λ ask = max (1 + λ,
0) (19)where λ is the net intensity. It is defined to be correlated with the log-price process as per eq 9: λ = β ∗ ε (log S i , ρ ) (20)where β is sensitivity of net client trade flow towards log-returns and ρ defines correlation between net intensityprocess and log-price process. To smoothen out variance in intensity, a rolling average version of the log-priceprocess may be used. As explained in introduction, our hedging agent accepts client trades at ”client” spreads δ client it offers to clientsand then it may choose to hedge the accrued position out exchange ”hedge” spreads δ hedge . Let’s assume a clientsells to our agent at our offered ask price and then the agent hedges out (sells) the accrued position to the exchangeat exchange bid price. In a hypothetical setting where our hedger could have hedged instantaneously, it could havemade the client half-spread and paid exchange half-spread, locking in the profit. P N L = T radeSize ( S client ask − S mid ) − T radeSize ( S hedge bid − S mid ) = (cid:0) δ client ask − δ hedge bid (cid:1) T radeSize (21)In reality, having accrued the client trade at step t , our hedger can only choose to hedge out on the next step t of the simulation, or even later, which results in revaluation of the position at the market mid price between thosesteps. P N L = T radeSize ( S client askt − S midt ) + T radeSize ( S midt − S midt ) − T radeSize ( S hedge bidt − S midt ) (22)= (cid:16) δ client askt − δ hedge bidt + ∆ S midt − t (cid:17) T radeSize (23)Given the martingale nature of the simulated price process E (cid:0) ∆ S midt − t (cid:1) = 0 our agent can only be profitable ifaverage offered client spreads are greater than average paid hedge spreads. Note that our reinforcement agent doesnot know that the price process is a martingale, so as it begins to explore the action space initially, it would besensitive to realized directional market moves of the mid process. With time, however, the agent needs to learn tofocus on managing positions resulting from the client flow instead of trying to capture directional moves of the midprocess. Martingale formulation of the price process helps to achieve that. When demand is elastic, we may affect the trade size a client is willing to trade with us by offering a more attractive(”skewed”) price than the rest of the market. This can be used as means of position management: e.g. given along position in an asset, we could lower our offered price to entice the client to buy from us thus decreasing theposition. Depending on the marginal profitability of the transaction given the skewed price, such an action may beless expensive compared to executing the similar transaction on the exchange. We will be using the linear model inour experiments: ∆ size = skew ∗ β M axHedgeSizeClientT radeRate (24)where ClientT radeRate is the mean trade rate from TradeFlowModel 3 . M axHedgeSize is the maximum hedgetrade size constant defined for the bounded action space for the hedger and skew is a bounded action skew ∈ [ − . , .
0] resulting in the following client price adjustment: S bidadj = S bid − skew (cid:0) S mid − S bid (cid:1) , skew ∈ [ − . , .
0] (25) S askadj = S ask − skew (cid:0) S ask − S mid (cid:1) , skew ∈ (0 . , .
0] (26)So, negative skew means our agent buys more expensively and attracts more client flow on bid side. Positive skewmeans our agent sells cheaper and attracts more client flow on ask side. β constant is specified to be large enough for the learning agent to prefer skewing over hedging. Setting β = 0makes the client flow non-elastic and can be used for price discovery purposes such as finding market price of riskspreads in M arketpriceof risk . 5 .6 Dummy hedger and simulation dashboard
It is of interest to set up the environment in such a way that the choice to offset the accrued client positionwith hedge trades should follow naturally from the observed reward. An obvious choice would be to use strategyprofitability (PNL) as a basis for the reward function. Hence, we’d be using such parametrization of the environmentthat wrong hedging choices would result in adverse PNL. In particular, the ratio of mid process volatility σ andstochastic spread multiplier ν is such that for a given realization of the mid price process the decision not to hedgequickly becomes apparent in strategy PNL.To test this, let’s use a dummy hedging strategy that on each step targets to offset the position accrued on theprevious step. See hedging env.py: def heuristic action To compare performance of such a strategy to an alternative not to hedge at all we will be using a dashboardthat shows the price process, the net position alongside hedge and client positions, net PNL of the strategy andits constituent components: client PNL, market PNL and hedge PNL. With our dummy strategy on Figure 2 itis seen that we are offsetting the outstanding client position with the opposite hedge position and are making thedifference of client and hedge spreads. It is seen that the strategy looses on unfavorable spikes of market databetween consecutive steps which is a result of position revaluation between steps. Compared to unhedged case thisstrategy exhibits low variance in PNL since market exposure is limited to only unhedged position increment (netclient size flow) between steps. Figure 2: Dummy hedger dashboardIn the case of unhedged client position the variance of strategy PNL is driven by the combined variance of thesimulated position and market data process. PNL becomes hostage to directional market moves resulting in worstSharpe ratios. 6igure 3: Unhedged client flow
We want the learning agent to carry a position in time only if it is justified by the resulting PNL. As a matter ofrisk management we don’t want the agent to take on excessive positions, so we need to set up a risk limit system.To achieve this goal a position penalty component is introduced to the reward function. This penalty is definedto be an exponential function of the position size, and is intended to make taking bigger positions increasinglymore expensive for the agent in reward terms, forcing the agent to justify those by the resulting PNL. For smallpositions relative to
M axP osLimit the charge would be relatively small but would increase exponentially whenposition nears or overshoots
M axP osLimit . We can see this penalty as a ”spring” that would force the agent tobring the position back to zero unless having this position is profitable enough.
P enalty = γS (cid:16) e abs ( Position ) MaxPosLimit − (cid:17) M axP osLimit (27)where γ constant determines how harshly the position penalty w.r.t M axP osLimit is enforced. To further dis-courage exploring unpromising trajectories the simulation is terminated when
M axP ositionLimit breach exceedsa certain multiple, and additionally penalized reward is recorded.The following reward function is defined:
Reward t = ClientP N L t + HedgeP N L t + M arketP N L t − P enalty t (28)Another technique to improve the convergence to the optimal policy is to limit the action space to a closed interval a i = hedgeAmount i ∈ [ − M axHedgeAmount, M axHedgeAmount ] (29)where
M axHedgeAmount is defined to be of the same order as mean client trade size. This naturally limits thecapacity of the learning agent to explore policies related to speculating on asset price movements and makes it focuson managing net position size instead.Network architecture: • action space : agent action a i = HedgeSize ∈ [ − M axHedgeSize, M axHedgeSize ]7 observation space: cumulative net position of client trades and the hedger up until and including the laststep N etP osition i = (cid:80) i ( N etClientT radeSize j + HedgeP ositionSize j ) ∈ R • network layers : MLP of Linear (1) x ReLU (1024) x ReLU (1024) x Linear (1) layers (x batch size)The agent successfully learns to hedge the client flow as it could be seen from rewards statistics and the simulationdashboard: Figure 4: Autohegder single dashboardFigure 5: Autohedger single progressListing 1: Training the single asset autohedger 8 d s t a b l e − b a s e l i n e s 3 − autohedger − s i n g l e / autohedger − s i n g l epython3 a u t o h e d g e r m l . py Let’s now consider the setting where in addition to hedging on the market the agent can also apply skew i.e. adjustthe offered price on one side to attract more offsetting client flow instead of hedging all of the position on themarket. For simplicity we will assume that the offsetting client trade flow size is a linear function of skew (seeSkewing model 3.5).
Architecture layout : • action space : agent action is a pair of { HedgeSize, Skew } rational numbers within a box bounded by HedgeSize ∈ [ − M axHedgeSize, M axHedgeSize ] ,
Skew ∈ [ − , • observation space: net position • network layers : MLP of Linear (1) x ReLU (2048) x ReLU (2048) x Linear (2) layersIt can be seen that the agent learns to make use of the skew in addition to utilizing hedge amounts to hedge theresidual. This is useful as it allows us to raise the question of Market price of risk.Listing 2: Training the skew cd s t a b l e − b a s e l i n e s 3 − autohedger − s i n g l e / autohedger − s i n g l epython3 a u t o h e d g e r m l s k e w . pyTo gauge the performance of model with skew vs the model without skew it is useful to introduce a risk-weightedmeasure of reward. We’ll be using a measure similar to Sharpe-ratio of returns, where a return of a single step i isdefined as step return = P N L i − P N L i − : Sharpe ratio = E [ returns array ] stdev ( returns array ) = (cid:80) N stepsi =1 ( P N L i − P N L i − ) stdev ( returns array ) (30)To compare a non-skew and a skew model side-by-side those models are trained independently, saved, and thenused on the same generated market data to compare the performance.Figure 6: Models with skew and withoutDepending on the magnitude of the chosen β in the Elasticity of demand (skewing) model it can be seen thatskew-based model shows higher Sharpe ratios than a non-skew one. You can run the pre-trained models side-by-sideon the same generated market data sets as per listing below:9isting 3: Comparison of models with skew and without cd s t a b l e − b a s e l i n e s 3 − autohedger − s i n g l e / autohedger − s i n g l epython3 a u t o h e d g e r m l l e a r n e d . py Let’s assume a setting where our market-making agent chooses which spreads to charge clients. It is of interestthan to determine such minimum mean spread that would compensate the agent for the need of risk-managing aposition in an asset resulting from the accumulation of the client trade flow. Such spread would include the cost ofheding the position in the market as well as the premium for any non-Brownian behavior of the market (e.g. jumps,volatility clustering, drifts etc). Having those ”hurdle” spreads calculated would allow the researcher to build theelasticity of demand spreading model on the top of them and to have a more clear idea of how to price the spreadsbeing charged.Within our market model the price process is a martingale, so the expected revaluation of a position between anytwo future timestamps is zero:
P osition t ∗ E [ S t − S t ] = 0 ( t ≥ t ≥
0) (31)Within our PNL model this means that mean client spreads the agent should charge should be equal to mean hedgespreads paid on the exchange to offset the position. Note however that our agent does not know what those hedgespreads are and needs to work them out by interacting with the market model.To achieve this goal the following reward function is defined:
Reward t = max ( − abs ( ClientP N L t + HedgeP N L t ) ,
0) +
M arketP N L t − P ositionP enalty t (32)This reward function makes the learning agent to come up with such client spreads that would offset the cost ofhedging the position on the exchange. As before, we introduce a charge on carrying a directional position via P ositionP enalty . As the agent learns, at first it incurs higher penalties due to exploring actions leading to largerpositions. Once the agent learned to control the position,
P ositionP enalty becomes a lesser factor letting the agentto bring client spreads in line with the cost of hedging.We allow the learning agent to control client spreads via skew (see 3.5). In the context of this experiment we assumethat client trade flow is not elastic to price changes so β = 0 in eq. 24. Architecture layout : • action space : agent action is a pair of { HedgeSize, Skew } rational numbers within a box bounded by HedgeSize ∈ [ − M axHedgeSize, M axHedgeSize ] ,
Skew ∈ [ − , • observation space: net position • network layers : MLP of Linear (1) x ReLU (2048) x ReLU (2048) x Linear (2) layersAs a result of the learning process it is seen how rolling average spread charged by our agent (”maker” spread)converges onto the rolling average hedge spread (”taker” spread).10igure 7: Discovering price of risk spreadsOn a sample PNL dashboard of the learned agent it is seen how the agent tries to keep flat net PNL, at thesame time continuing to offset accrued client position with the hedger position.Figure 8: Price of risk dashboardListing 4: Training Price of risk model cd s t a b l e − b a s e l i n e s 3 − autohedger − s i n g l e / autohedger − s i n g l epython3 a u t o h e d g e r m l p r i c e o f r i s k . py 11 Modeling portfolios
Let’s assume we have two assets S and S which log increments 7 are correlated as in 9 with correlation coefficient ρ . For simplicity let’s assume that log-increments have the same flat volatility σ , so that the resulting log-normalprice processes would have the same variance characteristics. Each of S and S processes would drive its ownClient trade flow model and Stochastic spread model. Naturally, each of those processes would imply its ownPoisson intensity of arriving client trade sizes, as well as net imbalance of the arriving client trade flow. ”Portfolio”is formed by blending these asset prices and net client trade size flows 14 with a fixed weight w : S blendedmid,bid,ask = wS mid,bid,ask + (1 − w ) S mid,bid,ask (33) T radeSize blendednet = wT radeSize net + (1 − w ) T radeSize net (34)As a result, the reinforcement learning agent accrues a position which gets incremented with the blended size andthe blended price. The agent does not know what asset weights are being used ( w and 1 − w ), neither does it knowthe correlation coefficient ρ . The agent needs to learn the optimal hedging strategy using hedge amounts for S and S .At any step the net position (”portfolio”) value is defined as the sum of values of the blended client position andits hedges: P ortf olio value ( t ) = S blendedmid ( t ) P osition blendedClient ( t ) + S mid ( t ) P osition S Hedge ( t ) + S mid ( t ) P osition S Hedge ( t ) (35)The goal of the learning agent would be to maximize policy PNL whilst keeping the position value in check.The action space is now a R -valued square which bounds the sampled hedge amounts S and S . As before, wewill be using a penalty function to facilitate risk-management by the learning agent. However, a straightforwardapplication of the exponential penalty function (see single-asset penalty function 27) to P ortf olio value would notbe as effective. The reason is that by having the two-dimensional action space we are now allowing
HedgeAmount S and HedgeAmount S to offset each other, thus introducing an additional degree of freedom to the action space.This shows up as a lower net portfolio value and contradicts the convex definition of the blended price 33. Thiscould be remediated in a few ways.The first approach of making the actions convex is to change the parametrization of the action space: HedgeAmount S = w action ∗ hedge amount action (36) HedgeAmount S = (1 − w action ) ∗ hedge amount action (37)Here actual actions are the weight w and hedge amount , and the respective S and S hedge amounts are beingcalculated in a convex manner from those.A better approach would be to leave it for the learning agent to reason out but to introduce an additional ”over-hedge” penalty for when an individual hedge value overshoots the value of the blended client position or has thesame sign (so leverages instead of offsetting): Overhegdge = (cid:40) abs ( hedge value + client pos value ) hedge overshoots or same sign0 other casesThen Overhedge is used as a part of the overall penalty function, where φ constant determines penalty tolerance tooverhedge and γ determines the tolerance to overall portfolio value closing on M axP osLimit or overshooting it:
P enalty = γS (cid:16) e abs ( Portfolio value )+ φabs ( Overhedge ) S MaxPosLimit − (cid:17) M axP osLimit (38)Note that three positions are accrued over the simulation trajectory: the blended client position, S hedge positionand S hedge position, where hedge positions are supposed to be offsetting the blended client position. For each ofthose positions at each step we are tracking spread PNL (paid or received) as well as market revaluation PNL.Portfolio PNL is defined as trajectory PNL consisting of blended client position PNL (client spread PNL and clientposition market revaluation) and the sum of S and S hedge positions PNL (hedge spread PNL and hedge positionmarket revaluation). Reward function is defined as portfolio PNL minus the penalty function: Reward portf olio = BlendedClientP osition P N L + HedgeP osition S P N L + HedgeP osition S P N L − P enalty (39)
Architecture layout : 12 action space : agent action is a pair of { HedgeSize S , HedgeSize S } rational numbers within a squarebox bounded by HedgeSize ∈ [ − M axHedgeSize, M axHedgeSize ] • observation space: S hedger position value S mid P osition S Hedge , S hedger position value S mid P osition S Hedge and net
P ortf olio value ( t ) (see 35) • network layers : MLP of Linear (input: 3) x ReLU (8000) x ReLU (8000) x Linear (output: 2) layersGiven the initial correlation ρ = 0, the agent is seen to learn to offset the blended incoming client size flow withthe hedge positions. The agent seems to be able to deduce the implied weighting of assets within the blended clienttrade flow, however, more work needs to be done on improving stability on farther epochs. One can find sampledashboards for portfolio learning agents in Portfolio hedging dashboards.Listing 5: Training portfolio hedge model cd s t a b l e − b a s e l i n e s 3 − autohedger − p o r t f o l i o / autohedger − p o r t f o l i opython3 a u t o h e d g e r m l p o r t f o l i o . pyA slower, more stable version:Listing 6: Training portfolio hedge model - slow cd s t a b l e − b a s e l i n e s 3 − autohedger − p o r t f o l i o / autohedger − p o r t f o l i opython3 a u t o h e d g e r m l p o r t f o l i o s l o w . py Techniques explored in this paper could be used as a baseline for setting up a position management frameworkwithin an automated market-making system. More importantly, however, is that it could be used as a way ofblending alpha-generation and position management into one reinforcement learning system. Let’s assume thatwe have a signal generating unit that is listening to the real-life price process as well as other information sourceslike order book, news feeds etc. Then the signal it generates could be plugged into our reinforcement learningagent. If (as a result of the learning process) our agent finds this signal to be beneficial to the reward function,it would automatically learn to adjust its hedging decisions in line with that. Bridging signal-generation capacitywith reinforcement-learning inventory management poses an interesting area of research.Automating portfolio management decisions by letting the agent infer correlations and weights while learning alsoproves to be a challenging area to be explored. In a more realistic setting for the portfolio management task wecould define a risk-weighted reward function (e.g. based on Sharpe ratio 30) and let the agent pick from tens orhundreds of assets which would translate into the same dimensionality for observation and action spaces and wouldincrease the complexity of stochatic policy search problem.Even in the current set-up more work needs to be done to increase learning stability and avoid intermittent oscil-lations of
HedgeSize action between − M axHedgeSize and
M axHedgeSize between consecutive steps on fartherepochs. This is most likely related to diminished scale of rewards on farther epochs compared to large scale ofpenalties incurred while starting to train. Methods like normalization of rewards and entropy regularization couldbe used for this purpose, which would boil down to fine-tuning the entropy maximization parametrization withinSAC.
Project code is available under https://github.com/bakalex/autohedger
Modeling was performed on Amazon EC2 g4dn.2xlarge machine with the use of Deep Learning AMI (Ubuntu 16.04)Version 30.0 (ami-02379288a3b4cbe7b). PyTorch 1.5 CUDA 10.1 is activated with:13isting 7: Setting up PyTorch environment on EC2 AMI source a c t i v a t e p y t o r c h l a t e s t p 3 6g4dn.2xlarge allowed for research on MLPs as large as 10000 x 10000 x 8000 (x default batch size), but typicallynets of no more than 8000 x 8000 units were used.Open AI baseline installed with:Listing 8: Installing Open AI baseline cd s t a b l e − b a s e l i n e s 3p i p i n s t a l l .For the convenience of reproducing the results raised in this paper, a version of Open AI is supplied alongside theproject code. It is a choice of the researcher to use it or to try a more recent version. Soft actor-critic algorithm as listed in Open AI Spinning Up. A more succinct version is available in [6].Figure 9: Soft actor-critic RL algorithm14 .4 Portfolio hedging dashboards
Figure 10: Learning to hedge portfolios, w=0.5Figure 11: Learning to hedge portfolios, w=0.015 eferenceseferences