[PDF] Deep Hedging of Long-Term Financial Derivatives

Abstract

This study presents a deep reinforcement learning approach for global hedging of long-term financial derivatives. A similar setup as in Coleman et al. (2007) is considered with the risk management of lookback options embedded in guarantees of variable annuities with ratchet features. The deep hedging algorithm of Buehler et al. (2019a) is applied to optimize neural networks representing global hedging policies with both quadratic and non-quadratic penalties. To the best of the author's knowledge, this is the first paper that presents an extensive benchmarking of global policies for long-term contingent claims with the use of various hedging instruments (e.g. underlying and standard options) and with the presence of jump risk for equity. Monte Carlo experiments demonstrate the vast superiority of non-quadratic global hedging as it results simultaneously in downside risk metrics two to three times smaller than best benchmarks and in significant hedging gains. Analyses show that the neural networks are able to effectively adapt their hedging decisions to different penalties and stylized facts of risky asset dynamics only by experiencing simulations of the financial market exhibiting these features. Numerical results also indicate that non-quadratic global policies are significantly more geared towards being long equity risk which entails earning the equity risk premium.

Full PDF

DDeep Hedging of Long-Term Financial Derivatives ∗ Alexandre Carbonneau † Concordia University, Department of Mathematics and Statistics, Montr´eal, Canada

July 31, 2020

Abstract

This study presents a deep reinforcement learning approach for global hedging of long-term ﬁnancial derivatives. A similar setup as in Coleman et al. (2007) is considered withthe risk management of lookback options embedded in guarantees of variable annuitieswith ratchet features. The deep hedging algorithm of Buehler et al. (2019a) is appliedto optimize neural networks representing global hedging policies with both quadratic andnon-quadratic penalties. To the best of the author’s knowledge, this is the ﬁrst paper thatpresents an extensive benchmarking of global policies for long-term contingent claims withthe use of various hedging instruments (e.g. underlying and standard options) and with thepresence of jump risk for equity. Monte Carlo experiments demonstrate the vast superiorityof non-quadratic global hedging as it results simultaneously in downside risk metrics two tothree times smaller than best benchmarks and in signiﬁcant hedging gains. Analyses showthat the neural networks are able to eﬀectively adapt their hedging decisions to diﬀerentpenalties and stylized facts of risky asset dynamics only by experiencing simulations of theﬁnancial market exhibiting these features. Numerical results also indicate that non-quadraticglobal policies are signiﬁcantly more geared towards being long equity risk which entailsearning the equity risk premium.

Keywords:

Reinforcement learning; Global hedging; Variable annuity; Lookback option;Jump risk. ∗ A GitHub repository with some examples of codes can be found at github.com/alexandrecarbonneau. ∗ The author gratefully acknowledges ﬁnancial support from the Fonds de recherche du Qu´ebec (FRQNT). Hewould also like to thank Fr´ed´eric Godin for his helpful comments and suggestions. † Email address: [email protected] a r X i v : . [ q -f i n . R M ] J u l Introduction

Variable annuities (VAs), also known as segregated funds and equity-linked insurance, are ﬁnancialproducts that enable investors to gain exposure to the market through cashﬂows that dependon equity performance. These products often include ﬁnancial guarantees to protect investorsagainst downside equity risk with beneﬁts which can be expressed as the payoﬀ of derivatives. Forinstance, a guaranteed minimum maturity beneﬁt (GMMB) with ratchet feature is analogous toa lookback put option by providing a minimum monetary amount at the maturity of the contractequal to the maximum account value on speciﬁc dates (e.g. anniversary dates of the policy). Thevaluation of VAs guarantees is typically done with classical option pricing theory by computing theexpected risk-neutral discounted cashﬂows of embedded options under an appropriate equivalentmartingale measure; see, for instance, Brennan and Schwartz (1976), Boyle and Schwartz (1977),Persson and Aase (1997), Bacinello (2003) and Bauer et al. (2008). A comprehensive review ofpricing segregated funds guarantees literature can be found in Gan (2013).During the subprime mortgage ﬁnancial crisis, many insurers incurred large losses in segregatedfund portfolios due in part to poor risk management with some insurers even stopping writing VAsguarantees in certain markets (Zhang (2010)). Two categories of risk management approachesare typically used in practice: the actuarial method and the ﬁnancial engineering method (Boyleand Hardy (1997)). The foremost consist in providing stochastic models for the risk factors andsetting a reserve held in risk-free assets to cover the liabilities associated to VAs guarantees witha certain probability (e.g. the Value-at-Risk at 99%). The second approach commonly known as dynamic hedging entails ﬁnding a self-funded sequence of positions in securities to hedge the riskexposure of embedded options. Dynamic hedging is a popular risk management approach amonginsurance companies and is studied in this current paper; the reader is referred to Hardy (2003)for a detailed description of the actuarial approach.Financial markets are said to be complete if every contingent claim can be perfectly replicatedwith some dynamic hedging strategy. In practice, segregated funds embedded options are typicallynot attainable as a consequence of their many interrelated risks which are very complex to manage1uch as equity risk, interest rate risk, mortality risk and basis risk. For insurance companiesselling VAs with guarantees, market incompleteness entails that some level of residual risk mustbe accepted as being intrinsic to the embedded options; the identiﬁcation of optimal hedgingpolicies in such context is thus highly relevant. Nevertheless, the attention of the actuarialliterature has predominantly been on the valuation of segregated funds, not on the design ofoptimal hedging policies. Indeed, the hedging strategies considered are most often suboptimaland are not necessarily in line with the ﬁnancial objectives of insurance companies. One popularhedging approach is the greek-based policy where assets positions depend on the sensitivitiesof the option value (i.e. the value of the guarantee) to diﬀerent risk factors. Boyle and Hardy(1997) and Hardy (2000) delta-hedge GMMBs under market completeness for mortality risk andAugustyniak and Boudreault (2017) delta-rho hedge GMMBs and guaranteed minimum deathbeneﬁts (GMDBs) in the presence of model uncertainty for both equity and interest rate. Animportant pitfall of greek-based policies in incomplete markets is their suboptimality by design:they are a by-product of the choice of pricing kernel (i.e. of the equivalent martingale measure)for option valuation, not of an optimization procedure over hedging decisions to minimize residualrisk. Also, as shown in the seminal work of Harrison and Pliska (1981), in incomplete markets,there exist an inﬁnite set of equivalent martingale measures each of which is consistent witharbitrage-free pricing and can thus be used to compute the hedging positions (i.e. the greeks).Another strand of literature optimizes hedging policies with local and global criterions.

Local riskminimization (F¨ollmer and Schweizer (1988) and Schweizer (1991)) consists in choosing assetspositions to minimize the periodic risk associated with the hedging portfolio. On the other hand, global risk minimization procedures jointly optimize all hedging decisions with the objective ofminimizing the expected value of a loss function applied to the terminal hedging error. In spite oftheir myopic view of the hedging problem by not necessarily minimizing the risk associated withhedging shortfalls, local risk minimization procedures are attractive for the risk mitigation of VAsguarantees as they are simple to implement and they have outperformed greek-based hedging inseveral studies. Coleman et al. (2006) and Coleman et al. (2007) apply local risk minimizationprocedures for risk mitigation of GMDBs using standard options with the foremost considering2he presence of both interest rate and jump risk and the latter the presence of volatility andjump risk. K´elani and Quittard-Pinon (2017) extends the work of Coleman et al. (2007) in ageneral L´evy market by including mortality and transaction costs, and Trottier et al. (2018b) andTrottier et al. (2018a) propose a local risk minimization scheme for guarantees in the presence ofbasis risk.Within the realm of total risk minimization, global quadratic hedging pioneered by the seminalwork of Schweizer (1995) aims at jointly optimizing all hedging decisions with a quadratic penaltyfor hedging shortfalls. The latter paper provides a theoretical solution to the optimal policy with asingle risky asset (see R´emillard and Rubenthaler (2013) for the multidimensional asset case) andBertsimas et al. (2001) develops a tractable solution to the optimal policy relying on stochasticdynamic programming. A major drawback of global quadratic hedging is in penalizing equallygains and losses which is naturally not in line with the ﬁnancial objectives of insurance companies.Alternatively, non-quadratic global hedging applies an asymmetric treatment to hedging errors byoverly (and most often strictly) penalizing hedging losses. In contrast to global quadratic hedging,there is usually no closed-form solution to the optimal policy, but numerical implementations havebeen proposed in the literature: Fran¸cois et al. (2014) developed a methodology with stochasticdynamic programming algorithms for global hedging with any desired penalty function, Godin(2016) adapts the latter numerical implementation under the Conditional Value-at-Risk measurein the presence of transaction costs and Dupuis et al. (2016) apply global hedging under thesemi-mean-square error penalty in the context of short-term hedging for an electricity retailer.The aforementioned studies demonstrated the vast superiority of non-quadratic global hedgingover other hedging schemes (e.g. greek-based policies, local risk minimization and global quadratichedging). Yet, to the best of the author’s knowledge, both quadratic and non-quadratic globalhedging has seldom been applied for risk mitigation of segregated funds guarantees, or moregenerally, of long-term contingent claims. Moreover, numerical schemes for global hedging arecomputationally intensive and often rely on solving Bellman’s equations which is known to be An exception is the work of Ankirchner et al. (2014) which considers a minimal-variance hedging strategy forVAs guarantees in continuous-time in the presence of basis risk. deep hedging to hedge a portfolio of over-the-counter derivatives in the presence of marketfrictions. The general framework of RL is for an agent to learn over many iterations of anenvironment how to select sequences of actions to optimize a cost function. RL has been appliedsuccessfully in many areas of quantitative ﬁnance such as algorithmic trading (e.g. Moody andSaﬀell (2001) and Deng et al. (2016)), portfolio optimization (e.g. Jiang et al. (2017) and Almahdiand Yang (2017)) and option pricing (e.g. Li et al. (2009), Becker et al. (2019) and Carbonneauand Godin (2020)). Hedging has also received some attention: Halperin (2020) and Kolm andRitter (2019) propose TD-learning approaches to the hedging problem and Hongkai et al. (2020)and Carbonneau and Godin (2020) deep hedge European options under respectively the quadraticpenalty and the Conditional Value-at-Risk measure. The deep hedging algorithm trains anagent to learn how to approximate optimal hedging decisions by neural networks through manysimulations of a synthetic market. This approach is related to the deep learning method of Hanand E (2016) by directly optimizing policies for stochastic control problems with Monte Carlosimulations. Arguably, the most important beneﬁt of using neural networks to approximateoptimal policies is to overcome the curse of dimensionality which arises when the state-space getstoo large.The contribution of this paper is threefold. First, this study presents a deep reinforcementlearning procedure for global hedging long-term ﬁnancial derivatives which are analogous underassumptions made in this study to embedded options of segregated funds. Our methodologicalapproach which relies on the deep hedging algorithm can be applied for the risk mitigation of anylong-term European-type contingent claims (e.g. vanilla, path-dependent) with multiple hedging4nstruments (e.g. standard options and underlying) under any desired penalty (e.g. quadraticand non-quadratic) and in the presence of diﬀerent risky assets stylized features (e.g. jump,volatility and regime risk). The second contribution consists in conducting broad numericalexperiments of hedging long-term contingent claims with the optimized global policies. A similarsetup as in the work of Coleman et al. (2007) is considered with the risk mitigation of ratchetGMMBs strictly for ﬁnancial risks in the presence of jumps for equity. To the best of the author’sknowledge, this is the ﬁrst paper that presents such an extensive benchmarking of quadratic andnon-quadratic global policies for long-term options with the use of various hedging instruments andby considering diﬀerent risky assets dynamics. Such benchmarking would have been inaccessiblewhen relying on more traditional optimization procedures for global hedging such as stochasticdynamic programming due to the curse of dimensionality. Numerical results demonstrate the vastsuperiority of non-quadratic global hedging as it results simultaneously in downside risk metricstwo to three times smaller than best benchmarks and in signiﬁcant hedging gains. Our resultsclearly demonstrate that non-quadratic global hedging should be prioritized over other populardynamic hedging procedures found in the literature as it is tailor-made to match the ﬁnancialobjectives of the hedger by always signiﬁcantly reducing the downside risk as well as earning largeexpected positive returns. The third contribution is in providing important insights into speciﬁccharacteristics of the optimized global policies. Monte Carlo experiments indicate that on average,non-quadratic global policies are signiﬁcantly more bullish than their quadratic counterpart byholding a larger average equity risk exposure which entails earning the equity risk premium.Key factors which contribute to this speciﬁc characteristic of non-quadratic global policies areidentiﬁed. Furthermore, analyses of numerical results show that the training algorithm is able toeﬀectively adapt hedging policies (i.e. neural networks parameters) to diﬀerent stylized featuresof risky asset dynamics only by experiencing simulations of the ﬁnancial market exhibiting thesefeatures.The paper is structured as follows. Section 2 introduces the notation and the optimal hedgingproblem. Section 3 describes the numerical scheme based on deep RL to optimize global hedgingpolicies. Section 4 presents benchmarking of the risk mitigation of GMMBs under various market5ettings. Section 5 concludes.

This section details the ﬁnancial market setup and the hedging problem considered in this paper.

The ﬁnancial market is in discrete-time with a ﬁnite time horizon of T ∈ N years and N + 1known observation dates T := { t i : t i = i ∆ N , i = 0 , . . . , N } with ∆ N := T /N . The probabilityspace (Ω , F T , P ) with P as the physical measure is equipped with the ﬁltration F := {F t n } Nn =0 that deﬁnes all available information of the ﬁnancial market to investors. A total of D + 2 liquidassets are accessible to ﬁnancial participants with D + 1 risky assets and one risk-free asset. Let { B t n } Nn =0 be the price process of the risk-free asset where B t n := e rt n with r ∈ R as the annualizedcontinuous risk-free rate. The risky assets include a non-dividend paying stock and D liquidvanilla European-type options such as calls and puts on the stock which expire on observationdates in T . In this context, the speciﬁcation of two distinct price processes, one at the beginningand one at the end of each trading period, is required. Let { ¯ S ( b ) t n } Nn =0 be the risky price processat the beginning of each trading period where ¯ S ( b ) t n := [ S (0 ,b ) t n , . . . , S ( D,b ) t n ] are the prices at thebeginning of [ t n , t n +1 ) with S (0 ,b ) t n and S ( j,b ) t n respectively as the price of the underlying and of the j th option. Similarly, let { ¯ S ( e ) t n } N − n =0 be the risky price process at the end of each trading periodwhere ¯ S ( e ) t n := [ S (0 ,e ) t n , . . . , S ( D,e ) t n ] are the prices at the end of [ t n , t n +1 ) before the next rebalancingat t n +1 . For the tradable options, if the j th option matures at t n +1 , then S ( j,e ) t n is the payoﬀ of thederivative and S ( j,b ) t n +1 is the price of a new contract with the same characteristics (i.e. same payoﬀfunction and time-to-maturity). For the underlying, the equality S (0 ,b ) t n +1 = S (0 ,e ) t n holds P -a.s. for n = 0 , . . . , N − T be the known maturity in years of the embeddedguarantee to be hedged. This assumption can be motivated by the fact that in practice, insurancecompanies can signiﬁcantly reduce the impact of mortality risk on their segregated funds portfoliosby insuring additional policies. Furthermore, all VAs are assumed to be held until expiration (i.e.no lapse risk) and their values are linked to a liquid index such as the S&P500 which implies nobasis risk.In this study, the option embedded in VAs is a GMMB with an annual ratchet feature whichprovides a payoﬀ at time T of the maximum anniversary account value. The anniversary datesof the equity-linked insurance account are assumed to form a subset of the observation dates,i.e. { , , . . . , T } ⊆ T . Let { Z t n } Nn =0 be the running maximum anniversary value process of theequity-linked account : Z t n =  max( S (0 ,b )0 , . . . , S (0 ,b ) m ) , if (cid:98) t n (cid:99) = m and m ∈ { , . . . , T − } , max( S (0 ,b )0 , . . . , S (0 ,b ) T − ) , if t n = T. The payoﬀ of the GMMB with annual ratchet can be expressed as the account value at time T plus a lookback put option payoﬀmax( S (0 ,b )0 , . . . , S (0 ,b ) T ) = max(max( S (0 ,b )0 , . . . , S (0 ,b ) T − ) , S (0 ,b ) T )= max( Z T − S (0 ,b ) T ,

0) + S (0 ,b ) T . (2.1)Thus, the assumptions of market completeness with respect to mortality risk and lapse riskconsidered in this paper entail that the risk exposure of the insurer selling a GMMB is equivalent (cid:98)·(cid:99) : R → R is the ﬂoor function, i.e. (cid:98) x (cid:99) is the largest integer smaller or equal to x . Coleman et al. (2007) consider the problem of hedging a ratchet GMDB with a ﬁxed and known maturity T .The use of a ﬁxed maturity in the latter paper is motivated by assuming market completeness under mortalityrisk and hedging the expected loss of the guarantee. While the current paper considers the risk mitigation of a

7o holding short position in a long-term lookback option of ﬁxed maturity T and of payoﬀΦ : R × R T → [0 , ∞ ): Φ( S (0 ,b ) T , Z T ) := max( Z T − S (0 ,b ) T , . (2.2)Let δ := { δ t n } Nn =0 be a trading strategy used by the hedger to minimize his risk exposure to Φwhere for n = 1 , . . . , N , δ t n := ( δ (0) t n , . . . , δ ( D ) t n , δ ( B ) t n ) is a vector containing the number of sharesheld in each asset during the period ( t n − , t n ] with δ (0: D ) t n := ( δ (0) t n , . . . , δ ( D ) t n ) and δ ( B ) t n respectivelyas the positions in the D + 1 risky assets and in the risk-free asset. The initial portfolio (at time0 before the ﬁrst trade) is invested strictly in the risk-free asset. Also, for convenience, all optionsused as hedging instruments have one period maturity, i.e. they are traded once and held untilexpiration. Here is an additional assumption considered for the rest of the paper. Assumption 2.1.

The market is liquid and trading in risky assets does not aﬀect their prices.

Before describing the optimization problem of hedging Φ, some well-known concepts in themathematical ﬁnance literature must be described. The reader is referred to Lamberton andLapeyre (2011) for additional details. Let { G δt n } Nn =0 be the discounted gain process associatedwith the strategy δ where G δt n is the discounted gain at time t n prior to rebalancing. G δ := 0 and G δt n := n (cid:88) k =1 δ (0: D ) t k • ( B − t k ¯ S ( e ) t k − − B − t k − ¯ S ( b ) t k − ) , n = 1 , , . . . , N, (2.3)where • is the dot product operator. Moreover, let { V δt n } Nn =0 be hedging portfolio values for atrading strategy δ where V δt n is the value prior to rebalancing at time t n : V δt n := δ (0: D ) t n • ¯ S ( e ) t n − + δ ( B ) t n B t n , n = 1 , . . . , N, (2.4)and V δ := δ ( B )0 since the initial capital amount is assumed to be strictly invested in the risk-freeasset. In this paper, the trading strategies considered require no cash infusion nor withdrawal GMMB instead of a GMDB, assumptions made in both papers (i.e. no mortality risk and lapse risk) entail thatthe beneﬁts of the two guarantees are equivalent and result in the same lookback put option to hedge as in (2.2). If X = [ X , . . . , X K ] and Y = [ Y , . . . , Y K ], X • Y := (cid:80) Ki =1 X i Y i . self-ﬁnancing .More precisely, the hedging strategy δ is said to be self-ﬁnancing if it is predictable and if δ (0: D ) t n +1 • ¯ S ( b ) t n + δ ( B ) t n +1 B t n = V δt n , n = 0 , , . . . , N − . (2.5)Lastly, let Π be the set of admissible trading strategies for the hedger which consists of allsuﬃciently well-behaved self-ﬁnancing strategies. Remark 2.1.

It can be shown that δ is self-ﬁnancing if and only if V δt n = B t n ( V δ + G δt n ) for n = 0 , , . . . , N. See for instance Lamberton and Lapeyre (2011).

The optimization problem of hedging the risk exposure associated to a short position in thelong-term lookback option is now formally deﬁned. For the hedger, the problem consists in thedesign of a trading policy which minimizes a penalty , also referred to as a loss function , of thediﬀerence between the payoﬀ of the lookback option and the hedging portfolio value at maturity(i.e. the hedging error or hedging shortfall ). Strategies embedded in such policies are called globalhedging strategies as they are jointly optimized over all hedging decisions until the maturity ofthe lookback option. Let L : R → R be a loss function for the hedging error. For the rest ofthe paper, assume without loss of generality that the position in the hedging portfolio is long,and that all assets and penalties are well-behaved and integrable enough. Speciﬁc conditions arebeyond the scope of this study. Deﬁnition 2.1. (Global risk exposure) Deﬁne (cid:15) ( V ) as the global risk exposure of the shortposition in Φ under optimal hedge if the value of the initial hedging portfolio is V ∈ R : (cid:15) ( V ) := min δ ∈ Π E (cid:104) L (cid:16) Φ( S (0 ,b ) T , Z T ) − V δT (cid:17)(cid:105) , (2.6) where the expectation is taken with respect to the physical measure. X = { X n } Nn =0 with X n = [ X (1) n , . . . , X ( K ) n ] is F -predictable if for j = 1 , . . . , K, X ( j )0 ∈ F and X ( j ) n +1 ∈ F n for n = 0 , . . . , N − emark 2.2. An assumption implicit to Deﬁnition 2.1 is that the minimum (2.6) is indeedattained by some trading strategy, i.e. that the inﬁmum is in fact a minimum. The identiﬁcationof conditions which ensure that this assumption is satisﬁed are left out-of-scope.

The following deﬁnes the optimal hedging strategy for Φ given the initial capital investment andthe loss function for hedging errors.

Deﬁnition 2.2. (Optimal hedging strategy) Let δ (cid:63) ( V ) be the optimal hedging strategy correspond-ing to the global risk exposure of the hedger if the initial portfolio value is V ∈ R : δ (cid:63) ( V ) := arg min δ ∈ Π E (cid:104) L (cid:16) Φ( S (0 ,b ) T , Z T ) − V δT (cid:17)(cid:105) . (2.7)In a realistic setting, the choice of loss function should reﬂect the ﬁnancial objectives and the riskaversion of the hedger. One example of penalty which has been extensively studied in the hedgingliterature is the mean-square error (MSE): L ( x ) = x . This penalty entails that hedging gainsand losses are treated equally which could be desirable for a ﬁnancial participant who has toprovide a price quote on a security prior to knowing his position (long or short). In the context ofthis paper where the position in Φ is always short, penalizing hedging gains is clearly undesirablefor the hedger. The corresponding loss function to the MSE that penalizes only hedging lossesis the semi-mean-square error (SMSE): L ( x ) = x { x> } . While the MSE and SMSE are theonly penalties considered in numerical experiments of Section 4, the optimization procedurefor global hedging policies presented in Section 3 is ﬂexible to any well-behaved penalties (seee.g. Carbonneau and Godin (2020) for an implementation with the Conditional Value-at-Riskmeasure).The author wants to emphasize that diﬀerent penalties will often result in diﬀerent optimalhedging strategies. An extensive numerical study of the impact of the choice of loss function onthe hedging policy for the risk management of lookback options is done in Section 4. Moreover,while the numerical section of this paper strictly studies a speciﬁc example of long-term optionto hedge, namely the lookback option of payoﬀ Φ, the methodological approach to approximateoptimal hedging strategies can be applied for any European-type derivative of well-behaved payoﬀ10unction which can naturally include other VAs guarantees with payoﬀs analogous to ﬁnancialderivatives. This section describes the reinforcement learning procedure used to optimize global policies.The approach relies on the deep hedging algorithm of Buehler et al. (2019a) who showed that a feedforward neural network (FFNN) can be used to approximate arbitrarily well optimal hedgingstrategies in very general ﬁnancial market conditions. At its core, a FFNN is a parameterizedcomposite function which maps input to output vectors through the composition of a sequence offunctions called hidden layers . Each hidden layer applies an aﬃne transformation and a nonlineartransformation to input vectors. A FFNN F θ : R d → R ˜ d with L hidden layers has the followingrepresentation: F θ ( X ) := o ◦ h L ◦ . . . ◦ h ,h l ( X ) := g ( W l X + b l ) , l = 1 , . . . , L, where W l ∈ R d l × d l − and b l ∈ R d l × are respectively known as the weight matrix and bias vectorof the l th hidden layer h l , g is a non-linear function applied to each scalar given as input and o : R d L → R ˜ d is the output function which applies an aﬃne transformation to the output of thelast hidden layer h L and possibly also a nonlinear transformation with the same range as F θ .Furthermore, the trainable parameters θ is the set of all weight matrices and bias vectors whichare learned (i.e. ﬁtted in statistical terms) by minimizing a speciﬁed cost function.In the current study, the type of neural network considered for functions representing hedgingpolicies is from the family of recurrent neural networks (RNNs, Rumelhart et al. (1986)), a classof neural networks which maps input sequences to output sequences. The architecture of RNNsis similar to FFNNs but diﬀers by having self-connections in hidden layers: each hidden layeris a function of both an input vector from the current time-step and an output vector from thehidden layer of the previous time-step, hence the name recurrent . More formally, for an inputvector X t n at time t n , the time- t n output of the hidden layer is computed as h t n = f ( h t n − , X t n )11or some time-independent function f . In contrast to FFNNs, feedback loops in hidden layersentail that each output is dependent of past inputs which makes RNNs more appropriate fortime-series modeling. The type of RNN considered for dynamic hedging in this study is the longshort-term memory (LSTM) introduced by Hochreiter and Schmidhuber (1997). This choice ofneural network is motivated by recent results of Buehler et al. (2019b) who showed that LSTMshedging policies are more eﬀective for the risk mitigation of path-dependent contingent claimsthan FFNNs policies. Additional remarks are made in subsequent sections to motivate the choiceof an LSTM for the speciﬁc setup considered in the current paper. For more general informationabout RNNs, the reader is referred to Chapter 10 of Goodfellow et al. (2016) and the manyreferences therein.The LSTM architecture is now formally deﬁned. The application of LSTMs as functions repre-senting global hedging policies is described in Section 3.1. In what follows, the time-steps are thesame as the observation dates of the ﬁnancial market.

Deﬁnition 3.1. (LSTM) Let F θ : R N × R d in → R N × R d out be an LSTM which maps the sequence offeature vectors { X t n } N − n =0 to { Y t n } N − n =0 where X t n and Y t n are respectively two vectors of dimensions d in , d out ∈ N . Let sigm ( · ) and tanh ( · ) be the sigmoid and hyperbolic tangent functions appliedelement-wise to each scalar given as input. For H ∈ N , the computation of F θ at each time-stepconsists of H LSTM cells which are analogous to but more complex than RNNs hidden layers.Each LSTM cell outputs a vector of d j neurons denoted as h ( j ) t n ∈ R d j × at time t n for d j ∈ N and j = 1 , . . . , H . More precisely, the computation done by the j th LSTM cell at time t n is as Here, h t n − and h t n are to be understood for convenience as output vectors from hidden layers and not asmappings. For X := [ X , . . . , X K ], sigm( X ) := (cid:104) e − X , . . . , e − XK (cid:105) and tanh( X ) := (cid:104) e X − e − X e X + e − X , . . . , e XK − e − XK e XK + e − XK (cid:105) . ollows : i ( j ) t n = sigm ( W ( j ) i [ h ( j ) t n − , h ( j − t n ] + b ( j ) i ) ,f ( j ) t n = sigm ( W ( j ) f [ h ( j ) t n − , h ( j − t n ] + b ( j ) f ) ,o ( j ) t n = sigm ( W ( j ) o [ h ( j ) t n − , h ( j − t n ] + b ( j ) o ) ,c ( j ) t n = f ( j ) t n ◦ c ( j ) t n − + i ( j ) t n ◦ tanh ( W ( j ) c [ h ( j ) t n − , h ( j − t n ] + b ( j ) c ) ,h ( j ) t n = o ( j ) t n ◦ tanh ( c ( j ) t n ) , (3.1) where [ · , · ] and ◦ denote respectively the concatenation of two vectors and the Hadamard product(i.e. the element-wise product) and • W (1) i , W (1) f , W (1) o , W (1) c ∈ R d × ( d + d in ) and b (1) i , b (1) f , b (1) o , b (1) c ∈ R d × . • If H ≥ : W ( j ) i , W ( j ) f , W ( j ) o , W ( j ) c ∈ R d j × ( d j + d j − ) and b ( j ) i , b ( j ) f , b ( j ) o , b ( j ) c ∈ R d j × for j =2 , . . . , H .At each time-step, the input of the ﬁrst LSTM cell is the feature vector (i.e. h (0) t n := X t n ) and theﬁnal output is an aﬃne transformation of the output of the last LSTM cell: Y t n = W y h ( H ) t n + b y , n = 0 , . . . , N − , (3.2) where W y ∈ R d out × d H and b y ∈ R d out × . Lastly, the set of trainable parameters denoted as θ consists of all weight matrices and bias vectors: θ := (cid:110) { W ( j ) i , W ( j ) f , W ( j ) o , W ( j ) c , b ( j ) i , b ( j ) f , b ( j ) o , b ( j ) c } Hj =1 , W y , b y (cid:111) . (3.3) Remark 3.1.

In the deep learning literature, the i ( j ) t n , f ( j ) t n and o ( j ) t n are known as input gates,forget gates and output gates. Their architectures have shown to help to alleviate the issue oflearning long-term dependencies of time series with classical RNNs as they control the information At time 0 (i.e. n = 0), the computation of the H LSTM cells is the same as in (3.1) with h ( j ) t − and c ( j ) t − asvectors of zeros of dimensions d j for j = 1 , . . . , H . assed through the LSTM cells. The reader is referred to Bengio et al. (1994) for more informationabout this latter pitfall of RNNs and to Chapter . of Goodfellow et al. (2016) and the manyreferences therein for more general information about LSTMs. In the context of dynamic hedging, an LSTM maps a sequence of feature vectors consisting ofrelevant ﬁnancial market observations to the sequence of positions in each asset for all time-steps.The trainable parameters θ are optimized to minimize the expected value of a loss functionapplied to the terminal hedging error obtained as a result of the trading decisions made by theLSTM. The following deﬁnition describes more formally how the LSTM computes the hedgingstrategy. Note that in the numerical experiments of Section 4, the hedging instruments used forthe risk minimization of Φ are either only the underlying or standard options. The case of usingboth the underlying and options is not considered because of its redundancy; the options canreplicate positions in the underlying with calls and puts. Deﬁnition 3.2. (Hedging with an LSTM) Let F θ be an LSTM as in Deﬁnition 3.1 which mapsthe sequence of feature vectors { X t n } N − n =0 to the output vectors { Y t n } N − n =0 . The choice of hedginginstruments (i.e. the underlying or standard options) implies diﬀerences for the feature vectorsand output vectors :1) Hedging only with the underlying: the feature vector at each time-step is X t n := [log( S (0 ,b ) t n ) , log( Z t n ) , V δt n /V δ ] , n = 0 , . . . , N − , and F θ outputs at each rebalancing date the position in the underlying: δ (0) t n = Y t n − .2) Hedging only with options: the feature vector at each time-step includes option prices as The computation of { V δt n } N − n =0 can be done for instance as in (2.4) where asset positions are given by theoutput vectors of the LSTM. Using the transformations { log( S (0 ,b ) t n ) , log( Z t n ) , V δt n /V δ } instead of { S (0 ,b ) t n , Z t n , V δt n } in feature vectors forthe numerical experiments of Section 4 was found to signiﬁcantly improve the training of neural networks. Wenote that the log transformation could not be applied for the hedging portfolio values since V δt n can theoreticallytake values on the real line. ell as the price of the underlying: X t n := [log( ¯ S ( b ) t n ) , log( Z t n ) , V δt n /V δ ] , n = 0 , . . . , N − , and F θ outputs at each rebalancing date the position in the D options: [ δ (1) t n , . . . , δ ( D ) t n ] = Y t n − . It is important to note that the choice of dynamics for the ﬁnancial market could imply thatrelevant necessary information to compute the time- t n trading strategy should be added to featurevectors. For instance, Carbonneau and Godin (2020) apply the deep hedging algorithm withGARCH models which entails adding the volatility process to feature vectors. In the currentpaper, the models considered for the underlying imply that { S (0 ,b ) t n } Nn =0 is a Markov process under P and thus that no additional variables must be added to feature vectors. Nevertheless, we notethat the same methodological approach for hedging described in this section can easily be adaptedto dynamics requiring the inclusion of additional state variables. Remark 3.2.

Buehler et al. (2019b) deep hedge exotic derivatives with an LSTM with featurevectors that does not include a path-dependent state variable such as { Z t n } N − n =0 . The author of thecurrent paper observed that adding { Z t n } N − n =0 to feature vectors as per Deﬁnition 3.2 signiﬁcantlyimproved the performance of the optimized hedging policies when the number of trading periodwas large (i.e. for large N ), while for less frequent trading, the gain was marginal. Remark 3.3.

Theoretical results from Buehler et al. (2019a) show that a FFNN could have beenused to approximate arbitrarily well the optimal hedging policy in the setup considered in thisstudy (see Proposition 4.3 of their paper). However, the author of the current paper observedthat hedging with an LSTM was signiﬁcantly more eﬀective than with a FFNN for the numericalexperiments conducted in Section 4 in terms of both computational time (i.e. faster learning withLSTMs) and hedging eﬀectiveness which motivated the use of LSTMs as trading policies. Thejustiﬁcations of the superiority of LSTMs over FFNNs in the context of this paper are out-of-scopeand are left out as interesting potential future work.

For the rest of the paper, a single set of hyperparameters for the LSTM is considered in terms of15he number of LSTM cells and neurons per cell. The optimization problem thus consists insearching for the optimal values of trainable parameters for this speciﬁc architecture of LSTM.The hyperparameter tuning step is not considered in this paper; the reader is referred to Buehleret al. (2019a) or Carbonneau and Godin (2020) for a complete description of the optimal hedgingproblem with FFNNs which includes hyperparameter tuning.

Deﬁnition 3.3. (Global risk exposure with an LSTM) Deﬁne ˜ (cid:15) ( V ) as the global risk exposure ofthe short position in Φ under optimal hedge if the hedging strategy is given by F θ and if the valueof the initial hedging portfolio is V ∈ R : ˜ (cid:15) ( V ) := min θ ∈ R q E (cid:104) L (cid:16) Φ( S (0 ,b ) T , Z T ) − V δ θ T (cid:17)(cid:105) , (3.4) where δ θ is to be understood as the output vectors of F θ and q ∈ N is the total number of trainableparameters. The numerical scheme to optimize the trainable parameters θ is now described. For convenience,a similar notation as in the work of Carbonneau and Godin (2020) is used. For a given lossfunction and an initial portfolio value, the objective is to ﬁnd θ such that the risk exposure of ashort position in Φ is minimized (i.e. as in (3.4)). The training procedure was originally proposedin Buehler et al. (2019a) and relies on (mini-batch) stochastic gradient descent (SGD), a verypopular algorithm in the deep learning literature to train neural networks. Denote J ( θ ) as thecost function to minimize: J ( θ ) := E (cid:104) L (cid:16) Φ( S (0 ,b ) T , Z T ) − V δ θ T (cid:17)(cid:105) , θ ∈ R q . Note that as per Deﬁnition 3.2, the dimensions of the input and output of the LSTM at each time-step, i.e. d in and d out , are dependent of the choice of hedging instruments. Thus, while the number of neurons d , . . . , d H and the number of LSTM cells H is ﬁxed for the numerical experiments of Section 4, the total number of trainableparameters will vary with respect to the choice of hedging instruments. θ be the initial values for the trainable parameters. The optimization procedure consists inthe following iterations: θ j +1 = θ j − η j ∇ θ J ( θ j ) , (3.5)where ∇ θ is the gradient operator with respect to θ and { η j } j ≥ is a sequence of small positivereal values. In the context of this paper, ∇ θ J ( θ ) is unknown analytically and is estimated withMonte Carlo sampling. Let B j := { π i,j } N batch i =1 be a mini-batch of simulated hedging errors of size N batch ∈ N with π i,j as the i th hedging error if θ = θ j : π i,j := Φ( S (0 ,b ) T,i , Z

T,i ) − V δ θj T,i , where S (0 ,b ) T,i , Z

T,i and V δ θj T,i are to be understood as the values of the i th simulated path. Moreover,denote ˆ J : R N batch → R as the empirical estimator of J ( θ j ) evaluated with B j and ∇ θ ˆ J ( B j ) asthe empirical estimator of ∇ θ J ( θ j ) evaluated at θ = θ j . In Section 4, the MSE and SMSEpenalties deﬁned respectively as L MSE ( x ) := x and L SMSE ( x ) := x { x> } are extensively used.The empirical estimator of the cost function under each penalty can be stated as follows:ˆ J MSE ( B j ) := 1 N batch N batch (cid:88) i =1 π i,j , ˆ J SMSE ( B j ) := 1 N batch N batch (cid:88) i =1 π i,j { π i,j > } . (3.6)One essential property of the architecture of neural networks is that the gradient of empirical costfunctions (i.e. ∇ θ ˆ J ( B j ) for both penalties) is known analytically. Indeed, we note that hedgingerrors are linearly dependent of the trading strategies produced as the outputs of the LSTM.Furthermore, the gradient of the outputs of an LSTM with respect to trainable parameters isknown analytically (see e.g. Chapter 10 of Goodfellow et al. (2016)). Remark 3.4.

In practice, the algorithm backpropagation through time (BPTT) is often used to In this paper, the initial values of θ are always set as the glorot initialization of Glorot and Bengio (2010). ompute analytically the gradient of a cost function with respect to the trainable parameters forrecurrent type of neural networks such as an LSTM. BPTT leverages the structure of LSTMs(e.g. parameters sharing at each time-step) as well as the chain rule of calculus to obtain suchgradients. In practice, eﬃcient deep learning libraries such as Tensorﬂow (Abadi et al., 2016)are often used to implement BPTT. Moreover, algorithms such as Adam (Kingma and Ba, 2014)which dynamically adapt the terms { η j } j ≥ in (3.5) have been shown to improve the training ofneural networks. For the rest of the paper, Tensorﬂow and Adam are used to train every neuralnetwork. In this section, an extensive numerical study benchmarking diﬀerent dynamic hedging strategiesfor the long-term lookback option is presented. Section 4.3 benchmarks two global hedgingstrategies optimized with the deep hedging algorithm and the local risk minimization scheme ofColeman et al. (2007) with diﬀerent hedging instruments and diﬀerent dynamics for the ﬁnancialmarket. Section 4.4 provides insight into speciﬁc characteristics of the optimized global policies.The setup for the latter numerical experiments is described in Section 4.1 and Section 4.2.

The market setup considered in this paper is very similar to the work of Coleman et al. (2007).The contingent claim to hedge is a lookback option of payoﬀ Φ as in (2.2) with a time-to-maturityof 10 years (i.e. T = 10). The annualized continuous risk-free rate is set at 3% (i.e. r = 0 . S (0 ,b )0 = 100. In the design of hedging policies, the trading instruments considered are eitherthe underlying, two options or six options. All options have a time-to-maturity of 1 year, aretraded once and are held until expiration. For the case of two options, the hedging instrumentsavailable at the beginning of each year consist of at-the-money (ATM) calls and puts. Withsix options, three calls of moneynesses K ∈ { S t n , . S t n , . S t n } and three puts of moneynesses K ∈ { S t n , . S t n , . S t n } are available at the beginning of each year t n . As for the underlying,both monthly and yearly rebalancing are considered in numerical experiments. Yearly time-steps18re used for all hedging instruments (i.e. N = 10) except when hedging is done with the underlyingon a monthly basis (i.e. N = 120). Remark 4.1.

The methodological approach of Section 3 is in no way dependent on this choice ofhedging instruments.4.1.1 Global hedging penalties

The penalties studied for global hedging are the MSE and SMSE, and the respective optimizationprocedures are referred to as quadratic deep hedging (QDH) and semi-quadratic deep hedging(SQDH). While the MSE penalizes equally hedging gains and losses, the SMSE is more in line withthe actual objectives of the hedger as it corresponds to an agent who strictly penalizes hedginglosses proportionally to their squared values. It is important to note that the computational costof the deep hedging algorithm is closed to invariant to the choice of loss function. The motivationfor assessing the eﬀectiveness of QDH is the popularity of the quadratic penalty in the globalhedging literature.

The training of the LSTM is done as described in Section 3.2 on a training set of 350 ,

000 pathswith 150 epochs and a mini-batch size of 1 , ,

000 paths is used toﬁnd the optimal set of trainable parameters out of the 150 epochs. More precisely, at the end ofeach epoch, the hedging metric associated to the penalty being optimized (i.e. MSE for QDHand SMSE for SQDH) is evaluated on the validation set at the current values of the trainableparameters. The optimal set of trainable parameters is approximated by the one that minimizesthe empirical cost function on the validation set out of 150 epochs. The use of a validation setto select the number of epochs was found to signiﬁcantly improve the out-of-sample hedgingperformance obtained with SQDH, while for QDH, the improvement was marginal.All results presented in subsequent sections are from a test set (out-of-sample) of 75 ,

000 paths.The structure of the LSTM is as in Deﬁnition 3.1 with two LSTM cells (i.e. H = 2) and 24 One epoch is deﬁned as a complete iteration of SGD on the training set. For a training set and mini-batchsize of respectively 350 ,

000 and 1 , d = d = 24). The Adam optimizer (Kingma and Ba (2014)) is used for allexamples with a learning rate of 0 .

01 for QDH and . for SQDH since a smaller learning ratewas found to improve the training under the SMSE penalty. Deﬁne { C δt n } Nn =0 as the discounted cumulative cost process associated to a trading strategy δ : C δt n := B − t n V δt n − G δt n , n = 0 , . . . , N. Contrarily to global hedging, local risk minimization results in strategies that are not necessarilyself-ﬁnancing. Indeed, the optimization of hedging strategies under this framework imposes theconstraint that the terminal portfolio value exactly matches the payoﬀ of the contingent claim, i.e. V δT = Φ( S (0 ,b ) T , Z T ) P -a.s., which can always be respected by the injection or withdrawal of capitalat time T . Under this constraint, local risk minimization optimizes at each time-step startingbackward from time T positions in the assets which minimize the expected squared incrementalcost. More precisely, for n = N − , . . . ,

0, the optimization aims at ﬁnding ( δ (0: D ) t n +1 , δ ( B ) t n +1 ) thatminimize E [( C δt n +1 − C δt n ) |F t n ] at time t n with the constraint that V δT = Φ( S (0 ,b ) T , Z T ) P -a.s.The optimal initial capital amount to invest denoted as V (cid:63) is also obtained as a result of thisscheme. Once the trading strategy δ is optimized with the local risk minimization procedure, aself-ﬁnancing strategy can be constructed by setting the initial portfolio value as V δ = V (cid:63) , byfollowing the optimized trading strategy strictly for the risky assets (i.e. δ (0: D ) t n for n = 1 , . . . , N )and by adjusting positions in the risk-free asset such that the trading strategy is self-ﬁnancing(i.e. respecting (2.5)). Hedging results presented in the numerical experiments of this sectionwith local risk minimization are self-ﬁnancing as per the latter description and are from the workof Coleman et al. (2007). For examples of numerical schemes to implement local risk procedures,the reader is referred to Coleman et al. (2006) or Augustyniak et al. (2017).The motivation for benchmarking the global policies optimized with our methodological approachto local risk minimization is twofold. First, local risk procedures are popular for the risk mitigation20f VAs guarantees in the literature (e.g. Coleman et al. (2006), Coleman et al. (2007), K´elaniand Quittard-Pinon (2017), Trottier et al. (2018b) and Trottier et al. (2018a)). Second, in thecontext of hedging European vanilla options of maturity one to three years, Augustyniak et al.(2017) showed that global quadratic hedging with the underlying improves upon the downsiderisk reduction over local risk minimization. The question remains if the latter holds for longermaturities and when liquid options are used as hedging instruments. The hedging metrics considered for the benchmarking of the diﬀerent trading policies include theroot-mean-square error (RMSE) and the semi-RMSE (i.e. the root of the SMSE statistic). Tail riskmetrics are also studied with the Value-at-Risk (VaR) and the Conditional Value-at-Risk (CVaR,Rockafellar and Uryasev (2002)). For an absolutely continuous integrable random variable , theCVaR at conﬁdence level α has the following representation:CVaR α ( X ) := E [ X | X ≥ VaR α ( X )] , α ∈ (0 , , (4.1)where VaR α ( X ) := min x { x | P ( X ≤ x ) ≥ α } is the VaR at conﬁdence level α . The CVaR α represents tail risk by averaging all hedging errors larger than the α th percentile of the distributionof hedging errors (i.e. the VaR α metric). Hedging statistics presented in subsequent sections areestimated with conventional empirical estimators on the test set. The choice of dynamics for the underlying is motivated by the objective of studying the optimizedglobal policies under diﬀerent stylized features of the ﬁnancial market. It is important to recall thatdeep hedging is a model-free reinforcement learning approach: the LSTM is never explicitly toldthe dynamics of the ﬁnancial market during its training phase. Instead, the neural network mustlearn through many simulations of a market generator how to dynamically adapt its embedded All dynamics assumed for the underlying in Section 4 imply that hedging errors are absolutely continuousintegrable random variables. Q equivalent to P such that { e − rt n S ( b, t n } Nn =0 is an ( F , Q )-martingale (see, for instance, Delbaen and Schachermayer(1994)). Let y t n := log( S (0 ,b ) t n /S (0 ,b ) t n − ) be the periodic log-return of the underlying, and { (cid:15) P t n } Nn =1 and { (cid:15) Q t n } Nn =1 be sequences of independent standard normal random variables under respectively P and Q . The dynamics of both models are now formally deﬁned. P The discrete BSM assumes that log-returns are i.i.d. normal random variables of periodic meanand variance of respectively ( µ − σ )∆ N and σ ∆ N : y t n = (cid:18) µ − σ (cid:19) ∆ N + σ (cid:112) ∆ N (cid:15) P t n , n = 1 , . . . , N, (4.2)where µ ∈ R and σ > P The MJD model extends the BSM by assuming the presence of random jumps to the underlyingstock price. More precisely, let { ζ P k } ∞ k =1 be independent normal random variables of mean µ J and variance σ J , and { N P t n } Nn =0 be values of a Poisson process of intensity λ > { ζ P k } ∞ k =1 , { N P t n } Nn =0 and { (cid:15) P t n } Nn =1 are independent. Periodic log-returns under this model can be22tated as follows : y t n = (cid:18) α − λ (cid:16) e µ J + σ J / − (cid:17) − σ (cid:19) ∆ N + σ (cid:112) ∆ N (cid:15) P t n + N P tn (cid:88) k = N P tn − +1 ζ P k , (4.3)where { α, µ J , σ J , λ, σ } are the model parameters with { α, λ, σ } being on a yearly scale, α ∈ R and σ > Q By a discrete-time version of the Girsanov theorem, there exist an F -adapted market price of riskprocess { ϕ t n } Nn =1 such that (cid:15) Q t n = (cid:15) P t n − ϕ t n , n = 1 , . . . , N. (4.4)For n = 1 , . . . , N , let ϕ t n := −√ ∆ N (cid:0) µ − rσ (cid:1) . By replacing (cid:15) P t n = (cid:15) Q t n + ϕ t n into (4.2), it isstraightforward to obtain the Q -dynamics of log-returns: y t n = (cid:18) r − σ (cid:19) ∆ N + σ (cid:112) ∆ N (cid:15) Q t n , n = 1 , . . . , N. (4.5)The pricing of European calls and puts used as hedging instruments under this model is donewith the well-known Black-Scholes closed-form solutions. Q The change of measure considered is the same as the one from Coleman et al. (2007). Let { ζ Q k } ∞ k =1 be independent normal random variables under Q of mean ˜ µ J and variance ˜ σ J , and { N Q t n } Nn =0 be values of a Poisson process of intensity ˜ λ > { ζ Q k } ∞ k =1 , { N Q t n } Nn =0 and { (cid:15) Q t n } Nn =1 are We adopt the convention that if N P t n = N P t n − , then: N P tn (cid:88) k = N P tn − +1 ζ P k = 0 . Q -dynamics of log-returns can be stated as follows: y t n = (cid:18) r − ˜ λ (cid:16) e ˜ µ J +˜ σ J / − (cid:17) − σ (cid:19) ∆ N + σ (cid:112) ∆ N (cid:15) Q t n + N Q tn (cid:88) k = N Q tn − +1 ζ Q k , where ˜ σ J := σ J , ˜ µ J := µ J − (1 − γ ) σ J , ˜ λ := λe − (1 − γ )( µ J − (1 − γ ) σ J ) with γ ≤ γ = − .

5. The value of the risk aversion parameter implies more frequentand more negative jumps on average under Q than under P by increasing ˜ λ and decreasing ˜ µ J .The pricing of European calls and puts used as hedging instruments under the MJD model isdone with the well-known closed-form solutions. Table 1:

Parameters of the Black-Scholes model. µ σ .

10 0 . µ and σ are on an annual basis. Table 2:

Parameters of the Merton jump-diﬀusion model. α σ λ µ J σ J γ .

10 0 .

15 0 . − .

20 0 . − . α , σ and λ are on an annual basis. In this section, the hedging eﬀectiveness of QDH, SQDH and local risk minimization is assessedunder various market settings. The analysis starts oﬀ in Section 4.3.1 by comparing QDH andlocal risk minimization performance as both approaches are optimized with a quadratic criterion;the benchmarking of global hedging policies embedded in QDH and SQDH is done in Section 4.3.2.24 .3.1 QDH and local risk minimization benchmark

Table 3 and Table 4 presents hedging statistics of QDH and local risk minimization underrespectively the BSM and MJD model. For comparative purposes, the initial capital investmentis set to the optimized value obtained as a result of the local risk minimization procedure ofColeman et al. (2007) for all examples. We note that this choice naturally gives a disadvantageto QDH.

Table 3:

Benchmarking of quadratic deep hedging (QDH) and local risk minimization to hedgethe lookback option of T = 10 years under the BSM.Local risk minimization QDHStatistics V δ RMSE VaR . CVaR . RMSE VaR . CVaR . Stock (year) 13 . . . . . . . . . . . . . . . . . . . . . . . . . . . . µ = 0 . , σ = 0 . , r = 0 .

03 and S (0 ,b )0 = 100 (seeSection 4.2.1 for model description under P and Section 4.2.3 for the risk-neutral dynamics usedfor option pricing). Hedging instruments : monthly and yearly underlying, yearly ATM call andput options ( two options ) and three yearly calls and puts of strikes K = { S (0 ,b ) t n , . S (0 ,b ) t n , . S (0 ,b ) t n } and K = { S (0 ,b ) t n , . S (0 ,b ) t n , . S (0 ,b ) t n } ( six options ). Results for local risk minimization and initialportfolio values V δ are from Table 3 of Coleman et al. (2007). Results for QDH are computedbased on 75 ,

000 independent paths generated from the BSM under P . Training of the neuralnetworks is done as described in Section 4.1.2.Since QDH optimizes the MSE penalty, the latter was expected to outperform local risk mini-mization on the RMSE metric. The question remained if QDH also improved upon the downsiderisk captured by the VaR . and CVaR . statistics. Numerical results under both dynamicsdemonstrate that QDH outperforms local risk minimization across all downside risk metrics andall hedging instruments. The risk reduction obtained with QDH over local risk minimization ismost impressive with six options: the percentage decrease for respectively the RMSE, VaR . and CVaR . statistics are of 33% ,

52% and 36% under the BSM and of 27% ,

38% and 30% under The choice of hedging statistics presented in Table 3 and Table 4 are the ones considered in Coleman et al.(2007). Additional hedging statistics for QDH are presented in Section 4.3.2. able 4: Benchmarking of quadratic deep hedging (QDH) and local risk minimization to hedgethe lookback option of T = 10 years under the MJD model.Local risk minimization QDHStatistics V δ RMSE VaR . CVaR . RMSE VaR . CVaR . Stock (year) 19 . . . . . . . . . . . . . . . . . . . . . . . . . . . . α = 0 . , σ = 0 . , λ = 0 . , µ J = − . , σ J =0 . , γ = − . , r = 0 .

03 and S (0 ,b )0 = 100 (see Section 4.2.2 for model description under P andSection 4.2.4 for the risk-neutral dynamics used for option pricing). Hedging instruments : monthlyand yearly underlying, yearly ATM call and put options ( two options ) and three yearly calls andputs of strikes K = { S (0 ,b ) t n , . S (0 ,b ) t n , . S (0 ,b ) t n } and K = { S (0 ,b ) t n , . S (0 ,b ) t n , . S (0 ,b ) t n } ( six options ).Results for local risk minimization and initial portfolio values V δ are from Table 4 of Colemanet al. (2007). Results for QDH are computed based on 75 ,

000 independent paths generated fromthe MJD model under P . Training of the neural networks is done as described in Section 4.1.2.the MJD model. As for hedging with the underlying on a monthly and yearly basis as well aswith two options, the improvement of QDH over local risk minimization for the three hedgingstatistics ranges between 5% to 13% under the BSM and 8% to 20% under the MJD model exceptfor the VaR . metric with the stock on a monthly basis under the MJD dynamics which achieves30% reduction. These results demonstrate that the use of a global procedure rather than a localprocedure provides better hedging performance. The benchmarking of QDH and SQDH policies is now presented with the same setup as in theprevious section except for the initial capital investment which is set as the risk-neutral priceof the lookback option under both dynamics for all hedging instruments: 17 .

7$ for BSM and25 .

3$ for MJD. This choice is motivated by the objective of comparing on common groundsthe results obtained across the diﬀerent hedging instruments for both global hedging approaches.Table 5 and Table 6 present descriptive statistics of the hedging shortfall obtained with QDHand SQDH under respectively the BSM and MJD model. Numerical results indicate that as Risk-neutral prices of the lookback option were estimated with simulations for both dynamics. able 5: Benchmarking of quadratic deep hedging (QDH) and semi-quadratic deep hedging(SQDH) to hedge the lookback option of T = 10 years under the BSM.Statistics Mean RMSE semi-RMSE VaR . VaR . CVaR . CVaR . Skew

QDH

Stock (year) − . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . SQDH

Stock (year) − . . . . . . . − . − . . . . . . . − . − . . . . . . . − . − . . . . . . . − . µ = 0 . , σ = 0 . , r = 0 . , S (0 ,b )0 = 100 and V δ = 17 . P and Section 4.2.3 forthe risk-neutral dynamics used for option pricing). Hedging instruments : monthly and yearlyunderlying, yearly ATM call and put options ( two options ) and three yearly calls and puts ofstrikes K = { S (0 ,b ) t n , . S (0 ,b ) t n , . S (0 ,b ) t n } and K = { S (0 ,b ) t n , . S (0 ,b ) t n , . S (0 ,b ) t n } ( six options ). Resultsfor each penalty are computed based on 75 ,

000 independent paths generated from the BSM under P . Training of the neural networks is done as described in Section 4.1.2.compared to QDH, SQDH policies result in downside risk metrics two to three times smaller foralmost all examples and earn signiﬁcant gains across all hedging instruments (i.e. negative meanhedging errors). While QDH minimizes the RMSE statistic, the downside risk captured by thesemi-RMSE, VaR α and CVaR α statistics for α equal to 0 .

95 and 0 .

99 are always signiﬁcantlyreduced by SQDH policies. Indeed, the downside risk reduction with SQDH over QDH in thelatter hedging statistics ranges between 51% to 85% under the BSM and 45% to 76% under theMJD model. These impressive gains in risk reduction can be attributed to the fact that QDHpenalizes equally upside and downside risk, while on the other hand, SQDH strictly penalizeshedging losses proportionally to their squared values. Furthermore, hedging statistics also indicatethat SQDH policies achieve signiﬁcant gains under both models and across all hedging instrumentswith a lesser extend for six options. We observe that hedging with the underlying on a yearlybasis result in the most expected gains, followed by monthly underlying, two options and six27 able 6:

Benchmarking of quadratic deep hedging (QDH) and semi-quadratic deep hedging(SQDH) to hedge the lookback option of T = 10 years under the MJD model.Statistics Mean RMSE semi-RMSE VaR . VaR . CVaR . CVaR . Skew

QDH

Stock (year) − . . . . . . . . . . . . . . . . . . . . . . . . − . . . . . . . . SQDH

Stock (year) − . . . . . . . − . − . . . . . . . − . − . . . . . . . − . − . . . . . . . − . α = 0 . , σ = 0 . , λ = 0 . , µ J = − . , σ J =0 . , γ = − . , r = 0 . , S (0 ,b )0 = 100 and V δ = 25 . P and Section 4.2.4 for the risk-neutral dynamics used for optionpricing). Hedging instruments : monthly and yearly underlying, yearly ATM call and put options( two options ) and three yearly calls and puts of strikes K = { S (0 ,b ) t n , . S (0 ,b ) t n , . S (0 ,b ) t n } and K = { S (0 ,b ) t n , . S (0 ,b ) t n , . S (0 ,b ) t n } ( six options ). Results for each penalty are computed based on75 ,

000 independent paths generated from the MJD model. Training of the neural networks isdone as described in Section 4.1.2.options. All of these results clearly demonstrate that SQDH policies should be prioritized overQDH policies as they are tailor-made to match the ﬁnancial objectives of the hedger by alwayssigniﬁcantly reducing the downside risk as well as earning positive returns on average. Section 4.4that follows will shed some light on speciﬁc characteristics of the SQDH policies which result inthese large average hedging gains and downside risk reduction. Moreover, it is also interesting tonote that the distinct treatment of hedging shortfalls by each penalty has a direct implication onthe skewness statistic. Indeed, by strictly optimizing squared hedging losses, SQDH eﬀectivelyminimize the right tail of hedging errors which entails negative skewness. As for QDH, the positiveskewness for all examples can be explained by the fact that the payoﬀ of the lookback option ishighly positively asymmetric since it is bounded below at zero and has no upper bound.Lastly, Coleman et al. (2007) observed with local risk minimization that while hedging with six28ptions always results in better policies in terms of hedging eﬀectiveness, the relative performanceof using yearly ATM call and put options (i.e. two options) or the underlying on a monthly basisdepends on the dynamics of the risky asset. The same conclusions can be made from our resultsobtained with global hedging. Indeed, hedging statistics of both QDH and SQDH policies underthe Black-Scholes dynamics in Table 5 show that the downside risk metrics are most often onlyslightly better with two options as compared to hedging with the underlying on a monthly basis.On the other hand, values from Table 6 indicate that hedging with two options under the MJDmodel result in downside risk metrics at least two times smaller than with the underlying ona monthly basis for both QDH and SQDH. This observation stems from the fact that hedgingwith options is signiﬁcantly more eﬀective than with the underlying in the presence of jump risk.Thus, our results show that the observation made by Coleman et al. (2007) with respect to thesigniﬁcant improvement in hedging eﬀectiveness of local risk minimization with options in thepresence of jump risk also holds for both QDH and SQDH policies.

While the previous section assessed the hedging performance of QDH and SQDH with varioushedging instruments and diﬀerent market scenarios, the current section provides insights intospeciﬁc characteristics of the optimized global policies. The analysis starts oﬀ by comparingthe average equity risk exposure of QDH and SQDH policies, also called average exposure forconvenience, with the same dynamics for the underlying as in previous sections (i.e. BSM andMJD model). The motivation of the latter is to assess if either the MSE or SMSE penalty resultin hedging policies more geared towards being long equity risk and are thus earning the equityrisk premium. In this paper, the equity risk exposure is measured as the average portfolio deltaover one complete path of the ﬁnancial market. More formally, for ( δ (0: D ) t n +1 , δ ( B ) t n +1 ) given and ﬁxed,29he portfolio delta at the beginning of year t n denoted as ˜∆ ( pf ) t n is deﬁned as˜∆ ( pf ) t n := ∂V δt n ∂S (0 ,b ) t n = ∂∂S (0 ,b ) t n (cid:16) δ (0: D ) t n +1 • ¯ S ( b ) t n + δ ( B ) t n +1 B t n (cid:17) = δ (0) t n +1 + D (cid:88) j =1 δ ( j ) t n +1 ˜∆ ( j ) , where ˜∆ ( j ) is the j th option delta (i.e. ˜∆ ( j ) = ∂S ( j,b ) tn ∂S (0 ,b ) tn ). Note that ˜∆ ( j ) is time-independent sincethe calls and puts used for hedging are always of the same characteristics at each trading date(i.e. same moneyness and maturity) and both risky asset models are homoskedastic which entailsthat the underlying returns have the same conditional distribution for all time-steps. The ˜∆ ( j ) can be computed with the well-known closed form solutions under both models. For a total of ˜ N simulated paths, the average exposure is computed as follows:¯∆ ( pf ) := 1˜ N N ˜ N (cid:88) k =1 N − (cid:88) n =0 ˜∆ ( pf ) t n ,k , where ˜∆ ( pf ) t n ,k is the time- t n portfolio delta of the k th simulated path. Results presented below foraverage exposures are from the test set. Table 7 presents average exposures of QDH and SQDH policies with the same market setup as inprevious sections with respect to hedging instruments, model parameters and lookback optionto hedge. The initial capital investments are again set as the risk-neutral price of the lookbackoption under each dynamics (i.e. 17 .

7$ and 25 .

3$ for BSM and MJD). Numerical results indicatethat on average, SQDH policies are signiﬁcantly more bullish than QDH policies under bothdynamics and for all hedging instruments with a lesser extend for six options. This characteristicof SQDH policies to be more geared towards being long equity risk through a larger averageexposure is most important with the underlying on a yearly basis, followed by monthly trading in30 able 7:

Average equity exposures with quadratic deep hedging (QDH) and semi-quadratic deephedging (SQDH) for the lookback option of T = 10 years under the BSM and MJD model.BSM MJDQDH SQDH QDH SQDHStock (year) − .

10 0 . − .

14 0 . − . − . − .

15 0 . − . − . − . − . − . − . − . − . S (0 ,b )0 = 100 and r = 0 . P and Q are described in Section 4.2 (see Table 1 and Table 2for parameters values). Initial capital investments are respectively of 17 .

7$ and 25 .

3$ underBSM and MJD.

Hedging instruments : monthly and yearly underlying, yearly ATM call and putoptions ( two options ) and three yearly calls and puts of strikes K = { S (0 ,b ) t n , . S (0 ,b ) t n , . S (0 ,b ) t n } and K = { S (0 ,b ) t n , . S (0 ,b ) t n , . S (0 ,b ) t n } ( six options ). Results for QDH and SQDH are computedbased on 75 ,

000 independent paths generated from the BSM and MJD model under P . Trainingof the neural networks is done as described in Section 4.1.2.the underlying, two options and six options. The observation that the average exposure of SQDHpolicies is only slightly larger than the average exposure of QDH policies when hedging with sixoptions is consistent with benchmarks presented in previous sections. Indeed, values from Table 5and Table 6 show that the absolute diﬀerence between the hedging statistics of QDH and SQDHis by far the smallest with six options. The latter naturally implies that the hedging positions ofquadratic and non-quadratic policies are on average more similar with six options than with theother hedging instruments, which thus results in relatively closer average equity exposure. Onedirect implication of the larger average exposure of SQDH policies is that in the risk managementof the lookback option, SQDH should result in positive expected gains. This was in fact observedin the benchmarking of global policies presented in Table 5 and Table 6 where SQDH resulted innegative mean hedging error statistics (i.e. mean hedging gains) under both risky assets dynamics.It is worth noting that Trottier et al. (2018a) developed local risk minimization strategies forlong-term options which also earned positive returns on average as well as reduced downside riskas compared to delta-hedging by having larger equity risk exposures.31 .4.2 Analysis of SQDH bullishness The distinctive feature of SQDH policies to hold a larger average equity exposure than with QDHcan ﬁrstly be explained by the impact of hedging gains and losses on the optimized policies asmeasured by each penalty. On the one hand, by minimizing the MSE statistic in a market withpositive expected log-returns for the underlying as implied by both models parameters values,QDH policies have to be less bullish whenever the hedging portfolio value at maturity is expectedto be larger than the lookback option payoﬀ. On the other hand, SQDH policies are strictlypenalized for hedging losses proportionally to their squared values, not for hedging gains. Thelatter entails that SQDH policies are not constrained to reduce their equity risk exposure whenthe hedging portfolio value is expected to be larger than the lookback option payoﬀ. The secondimportant factor which contributes to SQDH bullishness speciﬁcally when hedging is done withthe underlying is the capacity of deep agents to learn to beneﬁt from time diversiﬁcation of risk .In the context of this study, time diversiﬁcation of risk refers to the fact that investing in stocksover a long-term horizon reduces the risk of observing large losses as compared to short-terminvestments. Average exposure values in Table 7 indicate that deep agents hedging with theunderlying and penalized with the SMSE have learned to hold a larger equity risk exposurethan under the MSE penalty to beneﬁt simultaneously from the positive expected returns of theunderlying and from the downside risk reduction with time diversiﬁcation of risk. This observationis most important with the underlying on a yearly basis with SQDH obtaining average exposuresof 0 .

18 and 0 .

17 under respectively the Black-Scholes and the MJD dynamics as compared to − .

10 and − .

14 with QDH.Moreover, it is very interesting to note that the deep agents rely more on time diversiﬁcationof risk in the presence of jump risk, i.e. with the MJD dynamics. Indeed, the average exposurediﬀerence between SQDH and QDH policies with the underlying is signiﬁcantly larger under theMJD dynamics with a diﬀerence of 0 .

31 and 0 .

22 for yearly and monthly trading as comparedto 0 .

28 and 0 .

09 under the BSM. The latter observations can be explained by the fact that as For instance, the average exposure diﬀerence between SQDH and QDH with the underlying on a yearly basisunder the MJD model is 0 . − ( − .

14) = 0 . This paper studies global hedging strategies of long-term ﬁnancial derivatives with a reinforcementlearning approach. A similar ﬁnancial market setup to the work of Coleman et al. (2007) isconsidered by studying the impact of equity risk with jump risk for the equity on the hedgingeﬀectiveness of segregated funds GMMBs. In the context of this paper, the latter guarantee isequivalent to holding a short position in a long-term lookback option of ﬁxed maturity. Thedeep hedging algorithm of Buehler et al. (2019a) is applied to optimize long short-term memorynetworks representing global hedging policies with the mean-square error (MSE) and semi-mean-square error (SMSE) penalties and with various hedging instruments (e.g. standard options andthe underlying).Monte Carlo simulations are performed under the Black-Scholes model (BSM) and the Mertonjump-diﬀusion (MJD) model to benchmark the hedging eﬀectiveness of quadratic deep hedging(QDH) and semi-quadratic deep hedging (SQDH). Numerical results showed that under bothdynamics and across all trading instruments, SQDH results in hedging policies which simultane-ously reduce downside risk and increase expected returns as compared to QDH. The downsiderisk reduction achieved with SQDH over QDH ranges between 51% to 85% under the BSM and45% to 76% under the MJD model. Numerical experiments also indicated that QDH outperformsthe local risk minimization scheme of Coleman et al. (2007) across all downside risk metrics and33ll hedging instruments. Thus, our results clearly demonstrate that SQDH policies should beprioritized as they are tailor-made to match the ﬁnancial objectives of the hedger by signiﬁcantlyreducing downside risk as well as resulting in large expected positive returns.Monte Carlo experiments are also done to provide insight into speciﬁc characteristics of theoptimized global policies. Numerical results showed that on average, SQDH policies are signiﬁ-cantly more bullish than QDH policies for every example considered. Analysis presented in thispaper indicate that the bullishness of SQDH policies stems from the impact of hedging gains andlosses on the optimized policies as measured by each penalty. Furthermore, an additional factorwhich contributes to the larger average equity exposure of SQDH policies when hedging withthe underlying is the capacity of deep agents to learn to beneﬁt from time diversiﬁcation of risk.The latter was shown to be most important in the presence of jump risk for the equity wheredeep agents penalized with the SMSE learned by experiencing many simulations of the ﬁnancialmarket to rely more on time diversiﬁcation risk through larger positions in the underlying ascompared to training on the Black-Scholes dynamics due to the lesser eﬃciency of hedging withthe underlying in the presence of jumps.Further research in the area of global hedging for long-term contingent claims with the deephedging algorithm would prove worthwhile. The analysis of the impact of additional equity riskfactors (e.g. volatility risk and regime risk) on the optimized policies would be of interest. Thesame methodological approach presented in this paper could be applied with the addition of thelatter equity risk factors with closed to no modiﬁcation to the algorithm. Moreover, robustnessanalysis of the optimized policies when dynamics experienced slightly diﬀer from the ones used totrain the neural networks would prove worthwhile. The inclusion of realistic transaction costs foreach hedging instrument could also be considered following the methodology of the original workof Buehler et al. (2019a). 34 eferences

Abadi, M. et al. (2016). Tensorﬂow: Large-scale machine learning on heterogeneous distributedsystems. arXiv preprint arXiv:1603.04467 .Almahdi, S. and Yang, S. Y. (2017). An adaptive portfolio trading system: A risk-return portfoliooptimization using recurrent reinforcement learning with expected maximum drawdown.

ExpertSystems with Applications , 87:267–279.Ankirchner, S., Schneider, J. C., and Schweizer, N. (2014). Cross-hedging minimum returnguarantees: Basis and liquidity risks.

Journal of Economic Dynamics and Control , 41:93–109.Augustyniak, M. and Boudreault, M. (2017). Mitigating interest rate risk in variable annuities:An analysis of hedging eﬀectiveness under model risk.

North American Actuarial Journal ,21(4):502–525.Augustyniak, M., Godin, F., and Simard, C. (2017). Assessing the eﬀectiveness of local andglobal quadratic hedging under GARCH models.

Quantitative Finance , 17(9):1305–1318.Bacinello, A. R. (2003). Fair valuation of a guaranteed life insurance participating contractembedding a surrender option.

Journal of risk and insurance , 70(3):461–487.Bauer, D., Kling, A., and Russ, J. (2008). A universal pricing framework for guaranteed minimumbeneﬁts in variable annuities.

ASTIN Bulletin: The Journal of the IAA , 38(2):621–651.Becker, S., Cheridito, P., and Jentzen, A. (2019). Deep optimal stopping.

Journal of MachineLearning Research , 20:1–25.Bengio, Y., Simard, P., and Frasconi, P. (1994). Learning long-term dependencies with gradientdescent is diﬃcult.

IEEE transactions on neural networks , 5(2):157–166.Bertsimas, D., Kogan, L., and Lo, A. W. (2001). Hedging derivative securities and incompletemarkets: an (cid:15) -arbitrage approach.

Operations Research , 49(3):372–397.Black, F. and Scholes, M. (1973). The pricing of options and corporate liabilities.

Journal ofPolitical Economy , 81(3):637–654. 35oyle, P. P. and Hardy, M. R. (1997). Reserving for maturity guarantees: Two approaches.

Insurance: Mathematics and Economics , 21(2):113–127.Boyle, P. P. and Schwartz, E. S. (1977). Equilibrium prices of guarantees under equity-linkedcontracts.

Journal of Risk and Insurance , 44:639–660.Brennan, M. J. and Schwartz, E. S. (1976). The pricing of equity-linked life insurance policieswith an asset value guarantee.

Journal of Financial Economics , 3(3):195–213.Buehler, H., Gonon, L., Teichmann, J., and Wood, B. (2019a). Deep hedging.

QuantitativeFinance , 19(8):1271–1291.Buehler, H., Gonon, L., Teichmann, J., Wood, B., Mohan, B., and Kochems, J. (2019b). Deephedging: hedging derivatives under generic market frictions using reinforcement learning.Technical Report 19-80.Carbonneau, A. and Godin, F. (2020). Equal risk pricing of derivatives with deep hedging. arXivpreprint arXiv:2002.08492 .Coleman, T., Kim, Y., Li, Y., and Patron, M. (2007). Robustly hedging variable annuities withguarantees under jump and volatility risks.

Journal of Risk and Insurance , 74(2):347–376.Coleman, T., Li, Y., and Patron, M. (2006). Hedging guarantees in variable annuities under bothequity and interest rate risks.

Insurance: Mathematics and Economics , 38(2):215–228.Delbaen, F. and Schachermayer, W. (1994). A general version of the fundamental theorem ofasset pricing.

Mathematische Annalen , 300(1):463–520.Deng, Y. et al. (2016). Deep direct reinforcement learning for ﬁnancial signal representation andtrading.

IEEE Transactions on Neural Networks and Learning Systems , 28(3):653–664.Dupuis, D., Gauthier, G., and Godin, F. (2016). Short-term hedging for an electricity retailer.

The Energy Journal , 37(2):31–59.F¨ollmer, H. and Schweizer, M. (1988). Hedging by sequential regression: An introduction to themathematics of option trading.

ASTIN Bulletin: The Journal of the IAA , 18(2):147–160.36ran¸cois, P., Gauthier, G., and Godin, F. (2014). Optimal hedging when the underlying as-set follows a regime-switching markov process.

European Journal of Operational Research ,237(1):312–322.Gan, G. (2013). Application of data clustering and machine learning in variable annuity valuation.

Insurance: Mathematics and Economics , 53(3):795–801.Glorot, X. and Bengio, Y. (2010). Understanding the diﬃculty of training deep feedforward neuralnetworks. In

Proceedings of the thirteenth international conference on artiﬁcial intelligence andstatistics , pages 249–256.Godin, F. (2016). Minimizing CVaR in global dynamic hedging with transaction costs.

QuantitativeFinance , 16(3):461–475.Goodfellow, I., Bengio, Y., and Courville, A. (2016).

Deep learning . MIT press.Halperin, I. (2020). Qlbs: Q-learner in the black-scholes (-merton) worlds.

The Journal ofDerivatives .Han, J. and E, W. (2016). Deep learning approximation for stochastic control problems. arXivpreprint arXiv:1611.07422 .Hardy, M. (2003).

Investment guarantees: modeling and risk management for equity-linked lifeinsurance , volume 215. John Wiley & Sons.Hardy, M. R. (2000). Hedging and reserving for single-premium segregated fund contracts.

NorthAmerican Actuarial Journal , 4(2):63–74.Harrison, J. M. and Pliska, S. R. (1981). Martingales and stochastic integrals in the theory ofcontinuous trading.

Stochastic Processes and their Applications , 11(3):215–260.Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory.

Neural computation ,9(8):1735–1780.Hongkai, C., Cui, Z., and Yanchu, L. (2020). Discrete-time variance-optimal deep hedging inaﬃne GARCH models.

Working paper . 37iang, Z., Xu, D., and Liang, J. (2017). A deep reinforcement learning framework for the ﬁnancialportfolio management problem. arXiv preprint arXiv:1706.10059 .K´elani, A. and Quittard-Pinon, F. (2017). Pricing and hedging variable annuities in a L´evymarket: a risk management perspective.

Journal of Risk and Insurance , 84(1):209–238.Kingma, D. P. and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980 .Kolm, P. N. and Ritter, G. (2019). Dynamic replication and hedging: A reinforcement learningapproach.

The Journal of Financial Data Science , 1(1):159–171.Lamberton, D. and Lapeyre, B. (2011).

Introduction to stochastic calculus applied to ﬁnance .Chapman and Hall/CRC.Li, Y., Szepesvari, C., and Schuurmans, D. (2009). Learning exercise policies for american options.In

Artiﬁcial Intelligence and Statistics , pages 352–359.Merton, R. C. (1976). Option pricing when underlying stock returns are discontinuous.

Journalof Financial Economics , 3:125–144.Moody, J. and Saﬀell, M. (2001). Learning to trade via direct reinforcement.

IEEE Transactionson Neural Networks , 12(4):875–889.Persson, S.-A. and Aase, K. K. (1997). Valuation of the minimum guaranteed return embeddedin life insurance products.

Journal of Risk and Insurance , 64(4):599–617.Powell, W. B. (2009). What you should know about approximate dynamic programming.

NavalResearch Logistics (NRL) , 56(3):239–249.R´emillard, B. and Rubenthaler, S. (2013). Optimal hedging in discrete time.

Quantitative Finance ,13(6):819–825.Rockafellar, R. T. and Uryasev, S. (2002). Conditional Value-at-Risk for general loss distributions.

Journal of Banking & Finance , 26(7):1443–1471.Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986). Learning representations by38ack-propagating errors.

Nature , 323(6088):533–536.Schweizer, M. (1991). Option hedging for semimartingales.

Stochastic processes and theirApplications , 37(2):339–363.Schweizer, M. (1995). Variance-optimal hedging in discrete time.

Mathematics of OperationsResearch , 20(1):1–32.Trottier, D.-A., Godin, F., and Hamel, E. (2018a). Local hedging of variable annuities in thepresence of basis risk.

ASTIN Bulletin: The Journal of the IAA , 48(2):611–646.Trottier, D.-A., Godin, F., and Hamel, E. (2018b). On fund mapping regressions applied tosegregated funds hedging under regime-switching dynamics.

Risks , 6(3):78.Zhang, F. (2010). Integrating robust risk management into pricing: New thinking for VA writers.