Hedging of Financial Derivative Contracts via Monte Carlo Tree Search
HHEDGING OF FINANCIAL DERIVATIVE CONTRACTSVIA MONTE CARLO TREE SEARCH
OLEG SZEHR
Abstract.
The construction of approximate replication strategies for derivative con-tracts in incomplete markets is a key problem of financial engineering. Recently Rein-forcement Learning algorithms for pricing and hedging under realistic market conditionshave attracted significant interest. While financial research mostly focused on variationsof Q -learning, in Artificial Intelligence Monte Carlo Tree Search is the recognized state-of-the-art method for various planning problems, such as the games of Hex, Chess, Go,...This article introduces Monte Carlo Tree Search as a method to solve the stochasticoptimal control problem underlying the pricing and hedging of financial derivatives. Ascompared to Q -learning it combines reinforcement learning with tree search techniques.As a consequence Monte Carlo Tree Search has higher sample efficiency, is less proneto over-fitting to specific market models and generally learns stronger policies faster. Inour experiments we find that Monte Carlo Tree Search, being the world-champion ingames like Chess and Go, is easily capable of directly maximizing the utility of investor’sterminal wealth without an intermediate mathematical theory. Introduction
Monte Carlo Tree Search (MCTS) is a method for approximating optimal decisions inmulti-period optimization tasks by taking random samples of actions and constructing asearch tree according to the results. In Artificial Intelligence MCTS is the state-of-the-artand most well-researched technique for solving sequential decisions problems in domainsthat can be represented by decision trees. Important applications include the search foroptimal actions in discrete planning problems and games. This article introduces theapplication of MCTS to an important planning problem in Finance: The pricing andhedging of financial derivative contracts under realistic market conditions.The modern derivatives pricing theory came to light with two articles by Black and Sc-holes [9] and Merton [34] on the valuation of option contracts . The articles introduce whatis nowadays known as the Black-Scholes-Merton (BSM) economy. A minimal model of afinancial market comprised of a risk-free asset (commonly called a ’bond’ or ’cash’) a riskyasset (the ’stock’) and an asset, whose price is derived (thus a derivative ) from the stock.The risk-free asset exhibits a constant rate of interest, the risky asset promises higherreturns but bears market fluctuations (following a continuous-time Geometric Brownianmotion (GBM) process), the derivative contract, a so-called European option, gives itsholder the right to purchase shares of stock at a fixed price and date in the future. BSMshow that in their economy there exists a self-financing, continuous-time trading strategy (Oleg Szehr)
Dalle Molle Institute for Artificial Intelligence (IDSIA) - SUPSI/USI,Manno, Switzerland
E-mail address : [email protected] . a r X i v : . [ c s . A I] M a r n the risk-free asset and the stock that exactly replicates the value (i.e. price) of theoption contract over the investment horizon. The replicating portfolio can be used tooff-set (i.e. to ’hedge’) the risk involved in the option contract. The absence of arbitragethen dictates that the option price must be equal to the cost of setting up the replicat-ing/hedging portfolio. In modern terms the replication and pricing problems are oftenphrased as stochastic multi-period utility optimization problems [21]: In order to priceor hedge a derivative contract an economic agent consecutively purchases/sells shares ofstock as to maximize a given utility function of her/his terminal holdings.In reality, continuous time trading is of course impossible. It cannot even serve as anapproximation due to high resulting transaction costs. In fact transaction costs can makeit cheaper to hold a portfolio of greater value than the derivative (i.e. to super-hedge) [7].In other words, the transaction costs create market incompleteness : the derivative contractcannot be hedged exactly and, as a consequence, its issuer incurs risk . Further reasonsfor market incompleteness include the presence of mixed jump-diffusion price process [1]and stochastic volatility [26]. Since hedging under market-incompleteness involves riskypositions it follows that the price depends on the risk preference of the investor. Hence,in reality, the hedging problem is ’subjective’, it depends on the investors utility .Various models have been proposed to address pricing and hedging in incomplete mar-kets. One prominent line that was initiated by Föllmer and Sondermann [22], see also [10,41, 42], investigates the minimization of a quadratic adjustment costs. However, takingsquares equally weights positive and negative net deviations leading to arbitrage oppor-tunities. The concept of the reservation price of a derivative [25, 19, 17, 3, 53] takes intoaccount both the investor’s risk appetite and the financial market structure. In brief thereservation buy/sell prices are defined to be such that the buyer/seller remains indifferent,in terms of the expected utility, by entering into a derivative contract as compare to trad-ing in a market without the derivative. The concept appeared in [25], where it was alsorecognized that (for a particular structure of transaction costs and utility function) thehedging problem can be embedded within a dynamic programming framework. Nowadaysdynamic programming-based algorithms constitute a standard tool for pricing and hedg-ing, see e.g. [6, 27, 21]. Despite this progress effective algorithms for pricing in realisticmarkets are in need. In fact, a common criticism of the concept of reservation price liesin the complexity of the computation.In recent years, reinforcement learning (RL) has gained wide public attention as trainedRL agents began to outperform humans in playing Atari [36] and board games such asHex [4], Go [43, 44] and Chess [45]. This advancement was also noticed by researchers ofthe financial derivatives area, where several publications reported on promising hedgingperformance of trained RL agents, see e.g. [25, 32, 23, 15, 13, 14, 8, 51]. In terms ofthe choice of training algorithm, the research focus has lain on variations of Deep Q -Learning (DQN) [37], which combines Q -learning [52] with a deep neural network forpolicy representation. The article [23] proposes Q -Learning as a tool for training hedgerswith a quadratic utility functional but with no transaction costs. The articles [13, 14]are concerned with hedging under coherent risk measures, which have been identified asthe right class of measures of risk in the financial mathematics literature. Recent work[15] applies double Q -learning [49] to take account of stochastic volatility. The article [32]2ssumes that derivative prices are known and studies a DQN approach for hedging inthe presence of transaction costs, focusing on mean-variance equivalent loss distributions.Article [51] studies risk-averse policy search for hedging when price information is availablefollowing the mean-volatility approach of [8].The article at hand introduces the application of MCTS to solve the planning problemsof pricing and hedging of financial derivative contracts (in a setting where no a prioripricing information is given). The proposed methodology follows the established dynamicprogramming formulation of pricing and hedging tasks, see e.g. [25, 6, 27, 21], whichequally applies in complete and incomplete markets. Today MCTS is the strongest andmost well-studied algorithm for adversarial games with complete information. Despitethe focus on games in artificial intelligence research, MCTS applies to Markov decisionprocesses whenever the planning problem can be modeled in terms of a decision tree. Thisis apparent already in the early roots of MCTS [28, 31], where Monte Carlo search hasbeen investigated as a method to solve stochastic optimal control problems. Over theyears MCTS has been applied to various planning problems, including single-player games(puzzles) [40] and the travelling salesman problem [39]. The conceptually closest appli-cations of MCTS to our hedging problem are the planning problems for energy marketsin [2] and the stochastic shortest path problems commonly summarized within the sail-ing domain [50] paradigm. As compared to un-directed Monte Carlo methods (includingDQN), MCTS is characterized by problem-specific and heavily restricted search. As aconsequence MCTS learns stronger policies faster. It is also less susceptible to the typi-cal instability of RL methods when used in conjunction with deep neural networks [48].See [31] for a comparison of MCTS versus plain RL agents in the sailing domain or [4] inthe game of Hex. Section 3.3 in the main body of this article contains an introductorydescription of MCTS and the details of our architecture.Stability and sample efficiency are key when it comes to the application of RL-basedalgorithms for pricing/hedging in practice. The relative sample inefficiency of RL (as com-pared to supervised learning) leaves the practitioner little choice but to train on simulateddata [32, 15]. This raises the concern that an agent trained in a simulated environment,generally learns to acquire rewards in this environment and not necessarily in the realfinancial markets. Hence, the real-world run-time performance might well be determinedby the choice of ad hoc market models for training. MCTS combines RL with search,which leads to a significantly stronger out-of-sample performance and less over-fitting toa specific market model.Finally let us remark that the application of tree-search techniques to hedging is not asurprise. The binomial [18] and trinomial [11] tree models are popular discrete approxima-tion for the valuation of option contracts, with and without stochastic volatility. Binomialtrees are a standard numerical tool for the computation of reservation price [25]. ApplyingMCTS to hedging amounts to a combination of the established tree-based pricing tech-niques together with RL. The main idea is to reduce the size of the planning tree focusingon relevant purchases and sells as well as relevant market moves. In our applications wewill only address the discrete time, discrete market, discrete actions setting. For com-pleteness we mention that MCTS has been studied extensively also for stochastic optimalcontrol problems with continuous action and state spaces [2, 29]. To illustrate our idea,3 a) State transitions in trinomial market. (b)
Asymmetric planning in MCTS.
Figure 1.1.
Illustration of partial planning trees for valuation of a Eu-ropean call option in trinomial and MCTS models. The initial stock andstrike prices are , the planning horizon is days, computed in timesteps. The volatility of the underlying stock model is . No interest anddividends are paid. Stock prices are shown in the first row at each node,option prices in the second. The trees depict the evolution of the market(the agent’s actions are not shown). The MCTS planning tree is heavilyrestricted and asymmetric leading to a more efficient and realistic search.Tree constuction is guided by a dedicated tree policy. Leaf nodes are valuedby Monte Carlo Rollout (wavy line).Figure 1.1 shows a comparison between a trinomial market model and an MCTS marketmodel for the pricing of an European call option. The trinomial model is characterized bya discrete stock market, where at each time step a transition up, middle, down leads from S t to S t +1 ∈ { uS t , mS t , dS t } . MCTS builds a restirced and asymmetric decision tree, seeSection 4.1 for details. 2. The Hedging Problem
In this section we describe exemplary the pricing and hedging of a short Euro vanillacall option but the discussion extends mutatis mutandis to other derivatives. In general,pricing and hedging of derivative contracts can be modeled in terms of sequential stochasticoptimal control problems, see [21] for an introduction. In this formulation an investor,who holds the derivative contract, receives price information about the underlying stockat given time steps. Once the new information is revealed the investor purchases or sellsshares as to maximize the expected utility max E [ u ( w T )] of wealth w T over the investment horizon T . Usually the utility function u is assumedto be smooth, increasing and concave. The terminal wealth w T is the result of an initialendowment w and the sequence of investment decisions at times t that result in wealthincrements ∆ w t +1 . Thus w T = w + (cid:80) t ∆ w t +1 .2.1. The complete market.
Let us consider, as in the BSM model, an economy that iscomprised of a stock, a bank account with cash and a call option. Let S t denote the price4f the underlying stock at time t , which is assumed to follow a Markov process. An eurovanilla call option contract is characterized by the right to buy at maturity T shares ofthe underlying at price K . Its terminal value is compactly written as C T = C T ( S T ) = ( S T − K ) + . The issuer of the option contract sets up a replicating portfolio of the form Π t = n t S t + B t to offset the risk incurred by selling the option, where n t is the number of shares held bythe agent and B t is the bank account . The composition of the replication portfolio iscomputed by back-propagation beginning with Π T = C T and imposing that the tradingstrategy is self-financing . The latter asks that shares bought at time t are equally billedto the bank account. In the absence of transaction costs this implies Π t +1 − Π t = n t +1 S t +1 + B t +1 − n t S t − B t = n t +1 ∆ S t +1 , (2.1)where we use the notation ∆ X t +1 = X t +1 − X t . Consequently, Π T = C + T − (cid:88) t =0 n t +1 ∆ S t +1 and maximizing terminal utility amounts to max C ,n ,...,n T E [ u (Π T − C T )] . The self-financing constraint implies in particular that E [∆Π t +1 − n t +1 ∆ S t +1 |F t ] . (2.2)The expression E [ ·|F t ] denotes a conditional expectation with respect to the (natural) fil-tration F t (of S t ). Conditioning on F t means that all market information at time t ,including S t , Π t ,..., is realized and available for decision making. We give a more detaileddiscussion of F t in terms of a decision tree in Section 4.2. In a complete market thereplication portfolio would mimic the option price exactly C t = Π t , such that C t = E [Π t +1 − n t +1 ∆ S t +1 |F t ] . (2.3)In case of incomplete markets there is no exact replication portfolio and the price willgenerally depend on the choice of the utility function. Thus the open question is whichutility should be used, where various choices are reasonable. We refer the reader to [21]for a detailed discussion of utility optimization in the context of dynamic asset allocation.Two common examples are: Example 1.
Black-Scholes-Merton (BSM) least variance hedges [35]:
It is common toassume that utility is additive with respect to individual wealth increments, E [ u ( w T )] = u ( w ) + T − (cid:88) t =0 E [ u t +1 (∆ w t +1 )] . For simpler notation we assume that there is no interest on cash. n the BSM economy the market is comprised of the three assets S t , B t , C t and individualwealth increments are ∆ w t +1 = n t +1 ∆ S t +1 − ∆ C t +1 . Accordingly, the individual utilityincrements are given by the variances E [ u t +1 (∆ w t +1 ) |F t ] = − V [∆ C t +1 − n t +1 ∆ S t +1 |F t ] . (2.4) Notice that (2.4) is iff (2.3) holds, i.e. the option contract is redundant. Assuming S t follows a geometric Brownian motion, the well-known solution is the Black-Scholes δ -hedge portfolio. In the limit of small time steps and in absence of transaction costs thebest hedges are given by n t +1 → δ t = ∂C t /∂S t . In this case the global minimum of theutility is reached: In following the δ -hedge strategy the agent does not take any risk, anydeviation from this strategy results in a risky position. Example 2.
Terminal variance hedging [42]:
The variance of the terminal wealth is min-imized, i.e. the agent seeks to solve min C ,n ,...,n T V [ C T − Π T ] . Following the dynamic programming approach to stochastic optimization the composition ofthe replication portfolio can be computed recursively backwards through time. The (optimal)value function V ∗ is defined as V ∗ ( t, S t , Π t ) := − min C ,n ,...,n T V [ C T − Π T |F t ] One time step before maturity the agent is faced with minimizing − min n T V [ C T − Π T − − n T ∆ S T |F T − ] = V ∗ ( T − , S T − , Π T − ) . Introducing a conditional expectation with respect to the information available at T − theagent chooses n T − to minimize min n T − E [ V ∗ ( T − , S T − , Π T − ) |F T − ] = V ∗ ( T − , S T − , Π T − ) . More generally, no matter what the previous decisions have been, at time t the optimalnumber of shares is obtained from min n t +1 E [ V ∗ ( t + 1 , S t +1 , Π t +1 ) |F t ] = V ∗ ( t, S t , Π t ) . As will be discussed later, compare (3.2) , this is a form of the Bellman equation. At thelast stage the option price C , and the number of shares n can be obtained from min C ,n E [ V ∗ (1 , S , Π )] . If the agent is given the transition probabilities P ( S T | n T , S T − ) the optimal action n T canbe computed directly. In many financial models (such as the BSM economy) a distributionis assumed for P . Incomplete markets and reservation price.
In the presence of transaction costsequation (2.1) must be modified, where each trading activity is accompanied with a lossof transactionCost < . In other words Π t +1 − Π t = n t +1 ∆ S t +1 + transactionCosts. (2.5)An additional liquidation fee might apply at maturity leading to a terminal payoff of C T − Π T − liquidationCost . The dynamic programming approach outlined in Example 2remains valid if proper account is taken of the payoff structure and the new self-financingrequirement (2.5). Notice, that the structure of transaction costs might require additionalstate variables. If, for instance, transaction costs depend explicitly on the current holdings n t then so does the value function V ( n t , S t , Π t ) . In similar vein stochastic volatility modelsmight require additional state variables to keep track of holdings in stock and cash.The main idea behind the concept of reservation price [25, 19, 17, 3, 53] is that buy/sellprices of derivative contracts should be such that the buyer/seller remains indifferent, interms of expected utility, with respect to the two situations:(1) Buying/ Selling a given number of the derivative contracts and hedging the result-ing risk by a portfolio of existing assets, versus(2) leaving the wealth optimally invested within existing assets (and not entering newcontracts).Suppose an investor buys θ > options at time t but is not allowed to trade the optionsthereafter. The buyer’s problem is to choose a trading strategy (policy, see below) subjectto (2.5) to maximize expected utility of the resulting portfolio V ∗ buy ( t, n t , S t , Π t ) = max E [ u (Π T + θC T ) |F t ] , where max goes over choices of trading strategy. In the absence of options θ = 0 , theinvestor simply optimizes ˜ V ∗ ( t, n t , S t , Π t ) = max E [ u (Π T ) |F t ] . The reservation buy price of θ European call options is defined as the price P bθ such that ˜ V ∗ ( t, n t , S t , Π t ) = V ∗ buy ( t, n t , S t , Π t − θP bθ ) . The sell price is defined by analogy. It can be shown that in the absence of marketfriction the two reservation prices converge to the Black-Scholes price. The computationof reservation price is simplified if the investor’s utility function exhibits a constant absoluterisk aversion (a so-called CARA investor). For the CARA investor the option price andhedging strategy are independent of the investor’s total wealth. In case of proportionaltransaction costs the problem can be solved explicitly.
Example 3.
Hodges-Neuberger (HN) exponential utility hedges [25, 19]:
Assuming aconstant proportional cost model, transactionCosts (∆ n t +1 , S t ) = − β | ∆ n t +1 | S t , and exponential utility of terminal wealth (CARA), u ( w T ) = − exp( − λw T ) , esults in trading strategies that are wealth-independent and that do not create risky po-sitions when no derivative contract is traded. The portfolio allocation space is dividedinto three regions, which can be described as the Buy region, the Sell region, and the no-transaction region. If a portfolio lies in the Buy region, the optimal strategy is to buythe risky asset until the portfolio reaches the boundary between the Buy region and theno-transaction region. Similarly, if a portfolio lies in the Sell region, the optimal strategyis to sell the risky asset until the portfolio reaches the boundary between the Sell regionand the no-transaction region. If a portfolio is located in the no-transaction region then itis not adjusted at the respective time step. Monte Carlo Planning
Markov Decision Processes.
Markov decision processes (MDPs) provide a math-ematical framework for modeling sequential decision problems in uncertain dynamic en-vironments [38]. An MDP is a four-tuple ( S, A, P, R ) , where S = { s , ..., s n } is a setof states, A = { a , ..., a m } is a set of actions . The transition probability function P : S × A × S → [0 , associates the probability of entering the next state s (cid:48) to thetriple ( s (cid:48) , a, s ) . The reward obtained from action a in state s is R ( s, a ) . The policy π : S × A → [0 , is the conditional probability of choosing action a in state s ; it is arule that specifies which action a decision-maker takes in each state. The value functionof policy π associates to each state s the expected total reward when starting at state s and following policy π , V π ( s ) = E π (cid:34) T (cid:88) t =0 γ t R ( s t , s t +1 ) (cid:12)(cid:12)(cid:12) s = s (cid:35) . (3.1)The role of the discount factor γ ∈ [0 , is two-fold. In the case of infinite time horizons γ < is chosen to discount the value of future rewards and to enforce convergence ofthe above series . Second, γ describes the mutual inter-dependency of rewards, i.e. itmeasures how relevant in terms of total reward are the immediate consequences of anaction versus its long-term consequences. The problem in MDPs is to find an optimalpolicy that maximizes the expected total reward. The optimal value function V ∗ ( s ) = V π ∗ ( s ) = sup π V π ( s ) satisfies the Bellman equation V ∗ t ( s ) = max a (cid:40) R ( s, a ) + γ (cid:88) s (cid:48) ∈ S P ( s (cid:48) | s, a ) V ∗ t − ( s (cid:48) ) (cid:41) . (3.2)3.2. The k -armed bandit problem. The k -armed bandit is the prototypical instanceof reinforcement learning reflecting the simplest MDP setting [47]. Formally the setupinvolves a set of k possible actions, called arms , and a sequence of T periods. In eachperiod t = 1 , ..., T the agent chooses an arm a ∈ { a , .., a k } and receives a random reward We assume that S and A are finite in what follows. We assume
T < ∞ in what follows. The name originates from a gambler who chooses from k slot machines. = X i,t , ≤ i ≤ k . The main assumption is that the random variables X i,t are inde-pendent with respect to i and i.i.d. with respect to t . As compared to general MDPs the k -armed bandit addresses a simplified scenario, which is characterized by the assumptionthat there is no mutual inter-dependency of rewards . In other words, whatever the currentaction is, it will influence the immediate reward but not the subsequent rewards . The agentreceives rewards upon taken actions but has no further information about the X i,t . Thisleads to the famous exploitation-exploration dilemma: should one choose an action thathas been lucrative so far or try other actions in hope to find something even better. Forlarge classes of reward distributions, there is no policy whose regret after n rounds growsslower than O (log n ) [33]. The UCB1 policy of [5] has expected regret that achieves theasymptotic growth of the lower bound (assuming the reward distribution has compactsupport). UCB1 keeps track of the average realized rewards ¯ X i and selects the arm thatmaximizes the confidence bound: U CB i = ¯ X i + wc n,n i with c n,l = (cid:114) n ) l , where n i is the number of times arm i has been played so far. The average reward ¯ X i emphasizes exploitation of the currently best action, while the c n,n i encourages theexploration of high-variance actions. Their relative weight is determined by a problem-specific hyper-parameter w .In the contextual k -armed bandit setting, the agent is faced with non-stationary rewarddistributions. In each round the agent receives context about the state of the k -armedbandit. Contextual search tasks are an intermediate between the k -armed bandit problemand the full MDP [47, 46]. They are like the full MDP problem in that they involvelearning a policy π ( a | s ) , but like in the ordinary k -armed bandit problem in that eachaction affects only the immediate reward.3.3. Algorithms for Markov Decision Processes.
For complex MDPs, the computa-tion of optimal policies is usually untractable. Several approaches have been developed tocompute near optimal policies by means of function approximation and simulation.The value iteration algorithm calculates an approximation to the optimal policy byrepeatedly performing Bellman’s updates over the entire state space: V ( s ) ← arg max a ∈ A ( s ) (cid:40) R ( s, a ) + γ (cid:88) s (cid:48) ∈ S P ( s, a, s (cid:48) ) V ( s (cid:48) ) (cid:41) , ∀ s ∈ S, where A ( s ) denotes the set of applicable actions in state s . Value iteration converges to theoptimal value function V ∗ . The drawback of this method is that it performs updates overthe entire state space. To focus computations on relevant states, real-time heuristic searchmethods have been generalized to non-deterministic problems [20, 24]. These algorithmsperform a look-ahead search on a subset of reachable states given current state. Howeversuch look-ahead search is applicable only in situations, where the transition probabilities P ( s (cid:48) | s, a ) are known and the number of possible successors states for a state/action pairremains low. If P ( s (cid:48) | s, a ) are not given it is usually assumed that an environment simulator is available that generates samples s (cid:48) given ( s, a ) according to P ( s (cid:48) | s, a ) . (In the contextof hedging the market will play the role of the simulator.) This has been proposed for9xample in [28], where a state s is evaluated by trying every possible action C times and,recursively, from each generated state every possible action C times, too, until the planninghorizon H is reached. While theoretical results [28] demonstrate that such search providesnear optimal policies for any MDP, often H and C need be so large that computationbecomes impractical. The key idea of [16, 31, 30] is to incrementally build a problem-specific, restricted and heavily asymmetric decision tree instead of ’brute force MonteCarlo’ with width C and depth H . The growth of the decision tree is controlled by a treepolicy , whose purpose it is to efficiently trade-off the incorporation of new nodes versus thesimulation of existing promising lines. In each iteration, more accurate sampling resultsbecome available through the growing decision tree. In turn they are used to improve thetree policy. Algorithms of this form are commonly summarized under the name MonteCarlo Tree Search (MCTS) [12]. The MCTS’ main loop contains the following steps:(1) Selection:
Starting at the root node, a tree policy is recursively applied to descendthrough the tree until the most relevant expandable node is reached.(2)
Expansion:
Child nodes are added to expand the tree, again according to treepolicy.(3)
Simulation:
A simulation is run from the new nodes.(4)
Backpropagation:
The simulation result is used to update the tree policy.In MCTS the tree policy handles the incorporation of new tree nodes (exploration) andthe simulation of promising lines (exploitation). Kocsis and Szepesvári [31, 30] proposeto follow UCB1 bandit policy for tree construction. They demonstrate that (in case ofdeterministic environments) the UCB1 regret bound still holds in the non-stationary caseand that given infinite computational resources the resulting MCTS algorithm, calledUCT, selects the optimal action. In summary UCT is characterized by the followingfeatures, which make it a suitable choice for real-world applications such as hedging:(1) UCT is aheuristic. No domain-specific knowledge (such as an evaluation for leaf-nodes) is needed.(2) UCT is an anytime algorithm. Simulation outcomes are back-propagated immedi-ately, which ensures that tree statistics are up to date after each iteration. Thisleads to a small error probability if the algorithm is stopped prematurely.(3) UCT convergence to the best action if enough resources are granted.In the setting of adversarial games the role of the simulator is usually taken by anotherinstance of the same MCTS algorithm. The success of MCTS methods, especially in gamessuch as Hex [4], Go [44] and Chess [45], is largely due to the mutual improvement of policiesfor tree construction and node evaluation. Various architectures have been proposed forthe ’search and reinforce’ steps. Our specific architecture is described in Section 4.3.4.
Monte Carlo Tree Search for Hedging
Hedging as a Markov Decision Process.
We consider a discrete-time setting,where after each market move the investor adjusts her/his positions in the replicatingportfolio. Two formulations of the hedging exercise are common in the literature.(1)
P&L formulation:
The agent is aware of the value V t of the derivative contractto be hedged, treating the price as ground truth. The price might be available10hrough an ad hoc formula (such as the Black-Scholes formula for option pricing)or directly learned from the market. Typically, the agent will set up a hedgingportfolio to match the value of the derivative contract at each action. In otherwords only the immediate performance matters.(2) Cash Flow formulation:
The agent has no price information about the derivativecontract. This can be the case when a new (over-the-counter) contract is issued onthe market. In the absence of pricing information the agent will set up a hedgingportfolio to maximize utility of terminal wealth, see above. In other words onlythe terminal performance matters and the perfect hedge has value V t , i.e. the priceof the derivative contract, at inception.The articles [23, 15, 13] study the Cash Flow formulation, while [32, 51] is concerned withthe P&L formulation. The reward structure of the P & L formulation is characterized bythe availability of price information V t . If the market is complete at each time step theagent takes the action a ∗ t = n ∗ t +1 as to minimize the current conditional variance a ∗ t = arg min( V [∆ V t +1 − n t +1 ∆ S t +1 |F t ]) . (4.1)In this case the greedy action is optimal such that the P & L formulation is representedby a contextual k -armed bandit. Notice that if state-dependent transaction costs wereintroduced in this framework, this would alter the prices C t , which leads to the Cash Flowformulation. In the Cash Flow formulation the agent is faced with an episodic rewardstructure as in Examples 2, 3 of Section 2. Remark that Examples 1 can also be phrasedin the Cash Flow setting if rewards are accumulated over the training episode and onlyrevealed at maturity. The Cash Flow formulation leads to a full-fledged reinforcementlearning problem in which each individual decision is guided by terminal performance.Episodic rewards are common for adversarial games, where a player cares about winning inthe long-term and no help is available in the course of the game. MCTS, being the state-of-the-art algorithm for game tree search, is naturally suited for learning from episodicrewards. Once the search has terminated the action is selected that maximizes the reward in the form a ∗ t = arg max E [ u (Π T − V T ) |F t ] . Two types of market models are in principle possible within the MCTS paradigm, whichdiffer in the way how market behavior is simulated during search.(1)
Pure planning problem:
In this setting P ( s (cid:48) | s, a ) are assumed a priori as a model ofthe market. The agent learns to hedge by calling the market as a simulator. Thissetting corresponds to training the agent on simulated data, where the simulatorprovides market samples according to P ( s (cid:48) | s, a ) .(2) Adversarial game problem:
To focus search on relevant market behavior, the marketis modeled by an independent instance of MCTS with a specific market policy, seee.g. Figure 1.1. This generalizes the scenario of the pure planning problem andresembles asymmetric games. The market policy can be either fixed or subject toan RL training process. Other choices are possible, e.g. the most visited action might be chosen or a weighted average.
11n our implementation we have focused only on setting (1), leaving (2) for further research.Figure 1.1 illustrates market actions in the setting (2).4.2.
Building of the decision tree.
The planning tree consists of an alternating se-quence of planned decisions and market moves. Depending on the underlying marketmodel, the market either generates transitions from state s to s (cid:48) with probability P ( s (cid:48) | s, a ) or a dedicated market policy guides the market behavior. Here we focus on the pureplanning problem perspective, relying on a standard market model from the theory ofstochastic processes. Let { , , ..., T } be given dates and suppose that trading occurs at { , , ..., T − } . The set Ω of all possible market states (over the planning horizon) isassumed to carry a filtration F t , which models the unfolding of available information. F t can be interpreted as a sequence of partitions of Ω . When trading begins no marketrandomness has been realized yet, F = { Ω } . At T all market variables are realized,which corresponds to the partition of Ω into individual elements. At time t the elements P ( t ) ∈ F t = { P , ..., P K } reflect the realized state of the market. As the amount ofavailable information increases the partitions of Ω become finer. The market evolution isrepresented by the nodes ( t, P ( t ) ) .In our experiments we implemented a trinomial market model. The advantage of thetrinomial model over the more standard binomial model is that drifts and volatilities ofstocks can depend on the value of the underlying. There is no dynamic trading strategy inthe underlying security that exactly replicates the derivative contract payoff resulting inrisky positions for the hedger. Figure 4.1 depicts the planning tree comprised of trinomialmarket model and MCTS guided actions. Notice that a planning tree structure comprisedof agent and (random) market components has been proposed in [2] for planning problemsof energy markets, where MCTS in also employed in the continuous market setting.4.3. Architecture.
Our algorithm design orients itself at state-of-the-art MCTS for ad-versarial game tree search [4, 44]. In brief the UCT design is enhanced by making use ofdeep neural networks for policy and value function representation. Although the architec-tural details may vary, the main purpose of the involved neural networks is to(1) guide tree construction and search and to(2) provide accurate evaluation of leaf-nodes.Imitation learning is concerned with mimicking an expert policy π E that has been pro-vided ad hoc. In other words the expert delivers a list of strong actions given states andan apprentice policy π A is trained via supervised learning on this data. The role of theexpert is taken by MCTS, while the apprentice is a standard convolutional neural net-work. The purpose of the expert is to accurately determine good actions. The purposeof the apprentice is to generalize the expert policy across possible states and to providefaster access to the expert policy. The quality of expert and apprentice estimates improvemutually in an iterative process. Notice that in standard microscopic market models in terms of Itô diffusion processes, the agent’saction does not influence the behavior of the market P ( s (cid:48) | s, a ) = P ( s (cid:48) | s ) . In fact F t is a sequence of σ -algebras and there exists a bijection that maps each σ -algebras to apartition of S . igure 4.1. Illustration of composed planning tree for a trinomial mar-ket model. Black arrow depict agent’s actions. Blue arrows depict marketmoves. Nodes of tree are labeled by the elements ( t, P ( t ) ) of the partition F t at time t . Currently searched path is highlighted. A Monte Carlo simulationis run to value the current leaf node (wavy line).Specifically, regarding point (1), the apprentice policy is trained on tree-policy targets,i.e. the average tree policy at the MCTS root reflects expert advice. The loss function isthus Loss
T P T = − (cid:88) a n a n log π A ( a | s ) , where n a is the number of times action a has been played from s and n is the total numberof simulations. In turn the apprentice improves the expert by guiding tree search towardsstronger actions. For this the standard UCB1 tree policy is enhanced with an extra term N U CB a = U CB a + w a π A ( a | s ) n a + 1 , where w a ∼ √ n is a hyper-parameter that weights the contributions of U CB and theneural network. Concerning point (2), it is well known that using good value networkssubstantially improves MCTS performance. The value network reduces search depth andavoids inaccurate rollout-based value estimation. As in the case of the policy network treesearch provides the data samples z for the value network, which is trained to minimize Loss V = − ( z − V A ( s )) . To regularize value prediction and accelerate tree search (1) and (2) are simultaneouslycovered by a multitask network with separate outputs for the apprentice policy and valueprediction. The loss for this network is simply the sum of
Loss V and Loss
T P T .5.
Experiments
Our code for conducting the experiments is available upon request for non-commercialuse. Our architecture defines a general-purpose MCTS agent that, in principle, appliesto any MDP-planning problem. This includes the valuation and hedging of American Training the apprentice to imitate the optimal MCTS action a ∗ , Loss
Best = − log π A ( a ∗ | s ) , is alsopossible. However, the empirical results using Loss
T P T are usually better. • Option contract:
Short Euro vanilla call with strike K = 90 and maturity T = 60 . Hedging and pricing are investigated under the reward models of Exam-ples 1, 2, 3 (minimal terminal variance, HN and BSM models). • Market:
Trinomial market model with initial price S = 90 and constant stockvolatility of . No interest of dividends are paid. The planning horizon is splitinto discrete time steps. Transition probabilities are chosen to ensure that S t isa martingale. • MCTS Parameters:
Each training iteration consists of episodes, each of individual simulations. A discrete action space was chosen with possible actioncorresponding to the purchase or sell of up to shares or no transaction. • MCTS Neural Network (NN):
States are represented as two-dimensional ar-rays. Present state symmetries are employed to enhance the training process. Theneural network used for training consisted of CNN layers with ReLu activationand batch normalization, followed by dropout and linear layers.In all hedging tasks rewards are episodic and gauged to the interval [ − , , where a rewardof signifies a perfect hedge. A typical feature of RL is that the reward function and theappropriate mapping to [ − , are delicate choices. After each iteration of training theout-of-sample performance is measured on trinomial tree paths. The trained andstored versions of the NN are compared in terms of their reward and the new versionis accepted only if reward increases by a given percentage. Figure 5.1 illustrates typicaltraining processes in a trinomial market model (with constant volatility), where rewardstructures correspond to Examples 1, 2, 3. In each case the whole reward is grantedonly at option maturity and no option prices are available during the training process.For the BSM example we accumulate the rewards over the training episode and increasethe number of time steps to and the number of iterations to to investigate howclose the trained agent gets to the BSM δ -hedger. In the ideal BSM economy risk-lesshedging is possible, which would yield a terminal reward of . In contrast, in our settingdiscretization errors occur due to the finite representation of state and action spaces.Thus even in a complete market the optimal hedger, who is aware of the exact marketmodel and option prices at each time step, could at best choose an action that minimizeshedging error in expectation. To assess the performance of the trained MCTS agentsin the terminal variance and BSM settings we compare their terminal profits and lossesversus the optimal hedger in a complete market. Figure 5.2 shows histograms of terminalprofits and loss distributions obtained by executing trained hedger agents on randompaths of market evolution. To gain a clearer picture of the distribution of terminal profitsand losses Figure 5.3 depicts a break down of terminal hedges according to stock price.14emark that the purpose of these figures is not to present strong hedging agents but toprovide proof of concept that even after a relatively small training process the RL agentlearns to hedge. Notice also that we cannot run this comparison in the HN example, asswitching off transaction costs yields a utility function that has no maximum.In terminal variance and HN settings risk-free hedging is not possible, here a reward of would mean that the global minimum of the utility function is achieved. In all examplesthe introduction of transaction costs has no apparent impact on the convergence of thetraining process if collected costs are added as part of the state variable. Stochasticvolatility, to the contrary, has a significant impact because much more training samplesare required for the agent to learn the distribution of variance.For comparison we also implemented a plain DQN hedger for terminal utility maxi-mization, where the architecture was chosen such that all reward is granted at maturity.However, training and runtime performance were very poor. The reason lies in the factthat reward is granted only at maturity. DQN must identify those sequences of actionsthat yield positive rewards from an exponentially growing set of admissible action se-quences. Given the possible decision paths our examples seem out of reach for plainMonte Carlo and tabula rasa DQN. Notice that this implementation differs from previoussuccessful reports, e.g. [32, 15] in that the agent receives no reward signal before maturity.6. Conclusion
We have introduced a new class of algorithms to the problem of hedging and pricingof financial derivative contracts. Our approach is based on state-of-the-art MCTS plan-ning methods, whose architecture is inspired by the most successful systems for game-treesearch. This is justified by the theoretical insight that discrete pricing and hedging modelscan be represented in terms of decision trees. Recent work has highlighted the applica-tion of modern RL techniques to hedging but mostly focused on DQN. As compared toDQN, MCTS combines RL with search, which results in a stronger overall performance.We report that our implementation of DQN maximization of terminal utility was notsuccessful, while MCTS shows solidly improving reward curves within the computationalresources available to us. We finally note that plain RL agents (including DQN-basedagents) trained on simulated data will perform well in market situations that are similarto the training simulations. This shortcoming is addressed by MCTS within the NN’s gen-eralization ability by evaluating multiple market states and active search. We leave thevaluation of complex derivatives, including American-type or path dependent derivativesfor further research, but it is conceivable that with sufficient resources granted, trainedMCTS agents will show very competitive performance.15 a) Terminal variance setting. (b)
HN setting. (c)
Training process in BSM setting. To reducethe effect of discretization error, the time to ma-turity interval has been split into (insteadof ) steps. Figure 5.1.
Illustration of a training process of MCTS applied to terminalvariance option hedging. Blue line depicts mean reward over randompaths of market evolution, red lines depict the th and th percentiles.16 a) Terminal variance setting. (b)
BSM setting.
Figure 5.2.
Histograms of terminal profit and loss of option and hedgingportfolio at maturity as obtained on random paths of market evolution.Agents have received training iterations each consisting of episodes.For performance assessment the transaction costs are switched off. Red:trained MCTS agents. Blue: optimal agent.17 a) Terminal variance setting. (b)
BSM setting. (c)
Optimal hedger setting.
Figure 5.3.
Scatter plots of terminal profit and loss of hedging portfolioat maturity as obtained on random paths of market evolution. Forperformance measurement the transaction costs are switched off. To assessthe level of discretization error (C) illustrates the performance of the optimalhedger. Red scatters: hedging agents. Blue scatters: theoretical optionvalue at maturity. 18 eferences [1] Knut K. Aase. Contingent claims valuation when the security price is a combination of an Ito processand a random point process.
Stochastic Processes and their Applications , 28(2):185–220, 1988.[2] Couëtoux Adrien.
Monte Carlo Tree Search for Continuous and Stochastic Sequential Decision Mak-ing Problems . PhD Thesis, Université Paris Sud, 2013.[3] Erling D. Andersen and Anders Damgaard. Utility based option pricing with proportional transac-tion costs and diversification problems: an interior-point optimization approach.
Applied NumericalMathematics , 29(3):395–422, 1999.[4] Thomas Anthony, Zheng Tian, and David Barber. Thinking fast and slow with deep learning andtree search. In
NIPS , 2017.[5] Peter Auer, Nicolò Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed banditproblem.
Machine Learning , 47(2/3):235–256, 2002.[6] E.N. Barron and R. Jensen. A stochastic control approach to the pricing of options.
Mathematics ofOperations Research , 15(1):49–79, 1990.[7] Bernard Bensaid, Jean-Philippe Lesne, Henri Pagès, and José Scheinkman. Derivative asset pricingwith transaction costs.
Mathematical Finance , 2(2):63–86, 1992.[8] L. Bisi, L. Sabbioni, E. Vittori, E. Papini, and M. Restelli. Risk-Averse Trust Region Optimizationfor Reward-Volatility Reduction. In
Proc. Twenty-Ninth Int. Joint Conference on Art. Intelligence ,2020.[9] F. Black and M. Scholes. The pricing of options and corporate liabilities.
Journal of Political Economy ,81:637–654, 1973.[10] Jean-Philippe Bouchaud and Didier Sornette. The Black-Scholes option pricing problem in mathe-matical finance: generalization and extensions for a large class of stochastic processes.
Journal dePhysique I , 4(6):863–881, June 1994.[11] Phelim Boyle. Option valuation using a three-jump process.
International Options Journal , 3:7–12,1986.[12] C. B. Browne, E. Powley, D. Whitehouse, S. M. Lucas, P. I. Cowling, P. Rohlfshagen, S. Tavener,D. Perez, S. Samothrakis, and S. Colton. A Survey of Monte Carlo Tree Search Methods.
IEEETransactions on Computational Intelligence and AI in Games , 4(1):1–43, 2012.[13] H. Buehler, L. Gonon, J. Teichmann, and B. Wood. Deep hedging.
Quantitative Finance , 19(8):1271–1291, 2019.[14] H. Buehler, L. Gonon, J. Teichmann, B. Wood, B. Mohan, and J. Kochems. Deep hedging: hedg-ing derivatives under generic market frictions using reinforcement learning. Technical report, SwissFinance Institute, 2019.[15] J. Cao, J. Chen, J. C. Hull, and Z. Poulos. Deep hedging of derivatives using reinforcement learning.
Available at SSRN 3514586 , 2019.[16] Hyeong Soo Chang, Michael C. Fu, Jiaqiao Hu, and Steven I. Marcus. An Adaptive Sampling Algo-rithm for Solving Markov Decision Processes.
Operations Research , 53(1):126–139, 2005.[17] Les Clewlow and Stewart Hodges. Optimal delta-hedging under transactions costs.
Journal of Eco-nomic Dynamics and Control , 21(8-9):1353–1376, 1997.[18] John C. Cox, Stephen A. Ross, and Mark Rubinstein. Option pricing: A simplified approach.
Journalof Financial Economics , 7(3):229–263, 1979.[19] Mark H. A. Davis, Vassilios G. Panas, and Thaleia Zariphopoulou. European option pricing withtransaction costs.
SIAM Journal on Control and Optimization , 31(2):470–493, 1993.[20] Thomas Dean, Leslie Pack Kaelbling, Jak Kirman, and Ann Nicholson. Planning under time con-straints in stochastic domains.
Artificial Intelligence , 76(1-2):35–74, July 1995.[21] D. Duffie.
Dynamic Asset Pricing Theory: Third Edition . Princeton Univ. Press, 2001.[22] H. Föllmer and D. Sondermann. Hedging of non-redundant contingent claims. In
Contributions toMathematical Economics , pages 205–223. Springer Berlin Heidelberg, 1985.[23] I. Halperin. Qlbs: Q-learner in the black-scholes (-merton) worlds.
Available at SSRN 3087076 , 2017.
24] Eric A. Hansen and Shlomo Zilberstein. LAO ∗ : A heuristic search algorithm that finds solutions withloops. Artificial Intelligence , 129(1-2):35–62, June 2001.[25] S. Hodges and A. Neuberger. Option replication of contingent claims under transactions costs. Tech-nical report, Working paper, Financial Options Research Centre, University of Warwick, 1989.[26] J. Hull and A. White. The pricing of options on assets with stochastic volatilities.
The Journal ofFinance , 42(2):281–300, 1987.[27] Nicole El Karoui and Marie-Claire Quenez. Dynamic programming and pricing of contingent claimsin an incomplete market.
SIAM Journal on Control and Optimization , 33(1):29–66, 1995.[28] Michael Kearns, Yishay Mansour, and Andrew Y. Ng. A Sparse Sampling Algorithm for Near-OptimalPlanning in Large Markov Decision Processes.
Machine Learning , 49(2/3):193–208, 2002.[29] B. Kim, K. Lee, S. Lim, L. Kaelbling, and T. Lozano-Perez. Monte carlo tree search in continuousspaces using voronoi optimistic optimization with regret bounds. In
Thirty-Fourth AAAI Conferenceon Artificial Intelligence , 2020.[30] L. Kocsis, C. Szepesvári, and J. Willemson. Improved Monte-Carlo Search. 2006.[31] Levente Kocsis and Csaba Szepesvári. Bandit Based Monte-Carlo Planning. In
Lecture Notes inComputer Science , pages 282–293. Springer Berlin Heidelberg, 2006.[32] P. N. Kolm and G. Ritter. Dynamic replication and hedging: A reinforcement learning approach.
The Journal of Financial Data Science , 1(1):159–171, 2019.[33] T.L Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules.
Advances in AppliedMathematics , 6(1):4–22, March 1985.[34] R. C. Merton. Theory of rational option pricing.
BELL Journal of Economics , 4:141—-183, 1973.[35] R.C. Merton.
Continuous Time Finance . Blackwell, 1990.[36] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller.Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602 , 2013.[37] I. Osband, C. Blundell, A. Pritzel, and B. Van Roy. Deep exploration via bootstrapped dqn. In
Advances in neural information processing systems , pages 4026–4034, 2016.[38] M.L. Puterman.
Markov decision processes: discrete stochastic dynamic programming . iley-Interscience, New York, 1994.[39] A. Rimmel, F. Teytaud, and T. Cazenave. Optimization of the Nested Monte-Carlo Algorithm onthe Traveling Salesman Problem with Time Windows. In
Proc. Applicat. Evol. Comput. , 2011.[40] M. Schadd, M. Winands, and J. Uiterwijk. Single-Player Monte-Carlo Tree Search. In
Proc. Comput.and Games , 2008.[41] Manfred Schäl. On quadratic cost criteria for option hedging.
Mathematics of Operations Research ,19(1):121–131, 1994.[42] Martin Schweizer. Variance-optimal hedging in discrete time.
Mathematics of Operations Research ,20(1):1–32, 1995.[43] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser,I. Antonoglou, V. Panneershelvam, and M. Lanctot. Mastering the game of go with deep neuralnetworks and tree search. nature , 529(7587):484, 2016.[44] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker,M. Lai, and A. Bolton. Mastering the game of go without human knowledge.
Nature , 550(7676):354–359, 2017.[45] David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez,Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy Lillicrap, Karen Si-monyan, and Demis Hassabis. A general reinforcement learning algorithm that masters chess, shogi,and go through self-play.
Science , 362(6419):1140–1144, 2018.[46] Aleksandrs Slivkins. Introduction to multi-armed bandits.
Foundations and Trends® in MachineLearning , 12(1-2):1–286, 2019.[47] R. S. Sutton and A. G. Barto.
Reinforcement learning: An introduction . MIT press, 2018.[48] J. N. Tsitsiklis and B. Van Roy. Analysis of temporal-difference learning with function approximation.In
Advances in neural information processing systems , pages 1075–1081, 1997.
49] H. Van Hasselt, A. Guez, and D. Silver. Deep reinforcement learning with double q-learning. In
Thirtieth AAAI conference on artificial intelligence , 2016.[50] R. Vanderbei.
Optimal sailing strategies, statistics and operations research program . University ofPrinceton, 1996.[51] E. Vittori, E. Trapletti, and M. Restelli. Option Hedging with Risk Averse Reinforcement Learning.In
Proc. ACM International Conference on AI in Finance , 2020.[52] C. J. C. H. Watkins and P. Dayan. Q-learning.
Machine learning , 8(3-4):279–292, 1992.[53] Valeri I. Zakamouline. European option pricing and hedging with both fixed and proportional trans-action costs.
Journal of Economic Dynamics and Control , 30(1):1–25, 2006., 30(1):1–25, 2006.