Solving optimal stopping problems with Deep Q-Learning
aa r X i v : . [ q -f i n . P R ] J a n Solving optimal stopping problems with Deep Q-learning
John Ery ∗ Loris Michel † January 26, 2021
Abstract
We propose a reinforcement learning (RL) approach to model optimal exercise strategiesfor option-type products. We pursue the RL avenue in order to learn the optimal action-valuefunction of the underlying stopping problem. In addition to retrieving the optimal Q-functionat any time step, one can also price the contract at inception. We first discuss the standardsetting with one exercise right, and later extend this framework to the case of multiple stoppingopportunities in the presence of constraints. We propose to approximate the Q-function witha deep neural network, which does not require the specification of basis functions as in theleast-squares Monte Carlo framework and is scalable to higher dimensions. We derive a lowerbound on the option price obtained from the trained neural network and an upper boundfrom the dual formulation of the stopping problem, which can also be expressed in terms ofthe Q-function. Our methodology is illustrated with examples covering the pricing of swingoptions.
Reinforcement learning (RL) in its most general form deals with agents living in some environ-ment and aiming at maximizing a given reward function. Alongside supervised and unsupervisedlearning, it is often considered as the third family of models in the machine learning literature. Itencompasses a wide class of algorithms that have gained popularity in the context of building intel-ligent machines that can outperform masters in ancestral board games such as Go or chess, see e.g.Silver et al. (2016); Silver et al. (2017). These models are very skilled when it comes to learningthe rules of a certain game, starting from little or no prior knowledge at all, and progressively de-veloping winning strategies. Recent research, see e.g. Mnih et al. (2013), van Hasselt et al. (2016),Wang et al. (2016), has considered integrating deep learning techniques in the framework of re-inforcement learning in order to model complex unstructured environments. Deep reinforcementlearning can hence leverage the ability of deep neural networks to uncover hidden structure fromvery complex functionals and the power of reinforcement techniques to take complex actions.Optimal stopping problems from mathematical finance naturally fit into the reinforcementlearning framework. Our work is motivated by the pricing of swing options which appear inenergy markets (oil, natural gas, electricity) to hedge against futures price fluctuations, see e.g.Meinshausen and Hambly (2004), Bender et al. (2015), and more recently Daluiso et al. (2020).Intuitively, when behaving optimally, investors holding these options are trying to maximize theirreward by following some optimal sequence of decisions, which in the case of swing options consistsin purchasing a certain amount of electricity or natural gas at multiple exercise times.The stopping problems we will consider belong to the category of Markov decision processes(MDP). We refer the reader to Puterman (1994) or Bertsekas (1995) for good textbook referenceson this topic. When the size of the MDP becomes large or when the MDP is not fully known(model-free learning), alternatives to standard dynamic programming techniques must be sought.Reinforcement learning can efficiently tackle these issues and can be transposed to our problem ofdetermining optimal stopping strategies.Previous work exists on the connections between optimal stopping problems in mathematicalfinance and reinforcement learning. For example, the common problem of learning optimal exer-cise policies for American options has been tackled in Li et al. (2009) using reinforcement learn-ing techniques. They implement two algorithms, namely least-squares policy iteration (LSPI), ∗ RiskLab, Department of Mathematics, ETH Zurich, [email protected] † Seminar für Statistik, Department of Mathematics, ETH Zurich, [email protected]
In this section we present the mathematical building blocks and the reinforcement learning ma-chinery, leading to the formulation of the stopping problems under consideration.
As discussed in the introduction, the problems we will consider in the sequel can be embeddedinto the framework of the well-studied Markov decision processes (MDPs), see Sutton and Barto(1998). A Markov decision process is defined as a tuple ( S , A , p, R, γ ) , where• S is the set of states;• A is the set of actions the agent can take;• p is the transition probability kernel, where p ( ·| s, a ) is the probability of future states giventhat the current state is s and that action a is taken;• R is a reward function, where R ( s, a ) denotes the reward obtained when moving from state s under action a (note here that different definitions exist in the literature);• γ ∈ (0 , is a discount factor which expresses preference towards short-term rewards (in thepresent work γ = 1 as we consider already discounted rewards).A policy π is then a rule for selecting actions based on the last visited state. More specifically, π ( s, a ) denotes the probability of taking action a in state s under policy π. The conventionaltask is to maximize the total (discounted) expected reward over policies, and can be expressedas E π [ P ∞ t =0 γ t R t ] . A policy which maximizes this quantity is called an optimal policy. Givena starting state s , an initial action a , one can define the action-value function , also called Q-function : Q π ( s, a ) = E π " ∞ X t =0 γ t R t (cid:12)(cid:12)(cid:12) s = s, a = a , (1)2here R t = R ( s t , a t ) , for a sequence of state-action pairs ( s t , a t ) t ≥ ∼ π. The optimal policy π ∗ satisfies Q ∗ ( s, a ) = sup π Q π ( s, a ) , (2)where we write Q ∗ for Q π ∗ . In other words, the optimal Q-function measures how "good" or"rewarding" it is to choose action a while in state s, by following optimal decisions. We willconsider problems with finite time horizon T > , and we accordingly set R t = 0 for all t > T. We consider the same stopping problem as in Becker et al. (2019) and Becker et al. (2020), namelyan American-style option defined on a finite time grid t < t < . . . < t N = T . The discountedpayoff process ( G n ) Nn =0 is assumed to be square-integrable and takes the form G n = g ( n, X t n ) for a measurable function g : { , , . . . , N } × R d → R and a d -dimensional F -Markovian process ( X t n ) Nn =0 defined on a filtered probability space (cid:16) Ω , F , F = ( F n ) Nn =0 , P (cid:17) . Let E ⊂ R d denote thespace in which the underlying process lives. We assume that X is deterministic and that P is therisk-neutral probability measure. The value of the option at time is given by V = sup τ ∈T E [ g ( τ, X τ )] , (3)where T denotes all stopping times τ : Ω → { t , t , . . . , t N } . This problem is essentially a Markovdecision process with state space S = { , , . . . , N } × R d × { , } , action space A = { , } (wherewe follow the convention a = 0 for continuing and a = 1 for stopping), reward function R (( n, X t n ) , a ) = ( g ( n, X t n ) , if a = 1 , , if a = 0 , for n = 0 , . . . , N, and transition kernel p driven by the dynamics of the F -Markovian process ( X t n ) Nn =0 . The state space includes time, the d -dimensional Markovian process and an additional(absorbing) state which at each time step captures the event of exercise or no exercise. Moreprecisely, we jump to this absorbing state when we have exercised. In the multiple stopping casewhich we discuss in Section 3, we jump to this absorbing state once we have used the last exerciseright. In both single and multiple stopping frameworks, once this absorbing state has been reachedat a random time τ : Ω → { t , t , . . . , t N } , we set all rewards and Q-values to 0 for t > τ. Theassociated
Snell envelope process ( Z n ) Nn =0 of the stopping problem in (3) is defined recursively by Z n = ( g ( N, X t N ) , if n = N, max { g ( n, X t n ) , E [ Z n +1 | F n ] } , if ≤ n ≤ N − . (4)It is well known that the Snell envelope provides an optimal stopping time solving (3) as stated inthe following result. A standard proof for the latter can be found in Karatzas and Shreve (1991).
Proposition 2.1.
The stopping time τ ∗ defined by τ ∗ = inf { n : Z n = g ( n, X t n ) } for the Snell envelope ( Z n ) Nn =0 given in (4), is optimal for the problem (3).Various modeling approaches have been proposed to estimate the option value in (3). Kohler et al.(2008) propose to model directly the Snell envelope, Becker et al. (2019) take the approach ofmodeling the optimal stopping times. More recently, Becker et al. (2020) model the continuationvalues of the stopping problem. In this work, we rather propose to model the optimal action-valuefunction of the problem Q ∗ (( n, X t n ) , a ) for all n = 0 , . . . , N, and a ∈ { , } (where a representsthe stopping decision) given by Q ∗ (( n, X t n ) , a ) = ( g ( n, X t n ) , if a = 1 , E [ Z n +1 | F n ] , if a = 0 . (5) When exercizing (taking action a = 1 ), we implicitly move to the absorbing state, i.e. the last component ofthe state space becomes 1. Note that in particular Z = V . Q ∗ (( n, X t n ) , a ) , we can recover the optimal stopping time τ ∗ . Indeed, it turns out that theoptimal decision functions f , . . . , f N in Becker et al. (2019) can be expressed in the action-valuefunction framework through f n ( X t n ) = ( arg max a ∈{ , } Q ∗ (( n, X t n ) , a ) = 1 ) , ∀ n = 0 , . . . , N, where {·} denotes the indicator function. Moreover, one can express the Snell envelope (es-timated in Kohler et al. (2008)) as Z n = max { Q ∗ (( n, X t n ) , , Q ∗ (( n, X t n ) , } , and the con-tinuation value modeled in Becker et al. (2020) can be reformulated in our setting as C n = Q ∗ (( n, X t n ) , . As a by-product, one can price financial products such as swing options byconsidering max { Q ∗ (0 , X , , Q ∗ (0 , X , } . In this perspective, our modeling approach is very similar to previous studies but differs inthe reinforcement learning machinery employed. Indeed, modeling the action-value function andoptimizing it is a common and natural approach known under the name of
Q-learning in thereinforcement learning literature. We introduce it in the next section.
In contrast to policy or value iteration, Q-learning methods, see e.g. Watkins (1989) andWatkins and Dayan (1992), estimate directly the optimal action-value function. They are model-free and can learn optimal strategies with no prior knowledge of the state transitions and therewards. In this paradigm, an agent interacts with the environment (exploration step) and learnsfrom past actions (exploitation step) to derive the optimal strategy.One way to model the action-value function is by using deep neural networks. This approach isreferred to under the name deep Q-learning in the reinforcement learning literature. In this setup,the optimal action-value function Q ∗ is modeled with a neural network Q ( s, a ; θ ) often called deepQ-network (DQN), where θ is a vector of parameters corresponding to the network architecture.However, reinforcement learning can be highly unstable or even potentially diverge due to theintroduction of neural networks in the approximation the Q-function. To tackle these issues, avariant to the original Q-learning method has been developed in Mnih et al. (2015). It relies ontwo main concepts. The first is called experience replay and allows to remove correlations in thesequence of observations. In practice this is done by generating a large sample of experienceswhich we denote as vectors e t = ( s t , a t , r t , s t +1 ) at each time t, and that we store in a dataset D. We note that once we have reached the absorbing state, we start a new episode or sequenceof observations by resetting the MDP to the initial state s . Furthermore, we allow the agent toexplore new unseen states according to a so-called ε -greedy strategy, see Sutton and Barto (1998),meaning that with probability ε we take a random action and with probability (1 − ε ) we take theaction maximizing the Q-value. Typically one reduces the value of ε according to a linear scheduleas the training iterations increase.During the training phase, we then perform updates to the Q-values by sampling mini-batchesuniformly at random from this dataset ( s, a, r, s ′ ) ∼ U ( D ) and minimizing over θ the following lossfunction L ( θ ) = E ( s,a,r,s ′ ) ∼U ( D ) (cid:20)(cid:16) R ( s, a ) + γ max a ′ Q ( s ′ , a ′ ; θ ) − Q ( s, a ; θ ) (cid:17) (cid:21) . (6)However there might still be some correlations between the Q-values Q ( s, a ; θ ) and the so-calledtarget values R ( s, a )+ γ max a ′ Q ( s ′ , a ′ ; θ ) . The second improvement brought forward in Mnih et al.(2015) consists in updating the network parameters for the target values only with a regular fre-quency and not after each iteration. This is called parameter freezing and translates into minimizingover θ the modified loss function L ( θ ) = E ( s,a,r,s ′ ) ∼U ( D ) (cid:20)(cid:16) R ( s, a ) + γ max a ′ Q ( s ′ , a ′ ; θ ∗ ) − Q ( s, a ; θ ) (cid:17) (cid:21) , (7)where the target network parameters θ ∗ are only updated with the DQN parameters θ every T ∗ > steps, and are held constant between individual updates.An alternative network specification would be to take only the state as input Q ( s ; θ ) andupdate the Q-values for each action, see the implementation in Mnih et al. (2013). Network archi-tectures such as double deep Q-networks, see van Hasselt et al. (2016), dueling deep Q-networks,4ee Wang et al. (2016), and combinations thereof, see Hessel et al. (2017) have been developed toimprove the training performance even further. However the implementation of these algorithmsis out of the scope of our presentation. In the same spirit as Becker et al. (2019) and Becker et al. (2020), we compute lower and upperbounds on the option price in (3), the confidence interval resulting from the central limit theorem,as well as a point estimate for the optimal value V . In the sequel, for ease of notation, we will use X t n = X n , for n = 0 , . . . , N. We store the parameters learned through the training of the deep neural network on an experiencereplay dataset with simulations (cid:0) X kn (cid:1) Nn =0 for k = 1 , . . . , K. We denote as ˆ θ ∈ Θ the vector ofnetwork parameters where Θ ⊂ R q , q > denotes the dimension of the parameter space and Q (cid:16) s, a ; ˆ θ (cid:17) corresponds to the calibrated network. We then generate new simulations of the statespace process (cid:0) X kn (cid:1) Nn =0 , independent from those used for training, for k = K + 1 , . . . , K + K L . Theindependence is necessary to achieve unbiasedness of the estimates. The Monte Carlo average ˆ L = 1 K L K + K L X k = K +1 g (cid:0) τ k , X kτ k (cid:1) where τ k = inf n ≤ n ≤ N : Q (cid:16)(cid:0) n, X kn (cid:1) ,
1; ˆ θ (cid:17) > Q (cid:16)(cid:0) n, X kn (cid:1) ,
0; ˆ θ (cid:17)o yields a lower bound for theoptimal value V . Since the optimal strategies are not unique, we follow the convention of takingthe largest optimal stopping rule which yields a strict inequality.
The derivation of the upper bound is based on the Doob-Meyer decomposition of the supermartin-gale given by the Snell envelope, see Karatzas and Shreve (1991). The Snell envelope ( Z n ) Nn =0 ofthe discounted payoff process ( G n ) Nn =0 can be decomposed as Z n = Z + M Zn − A Zn , where M Z is the ( F n ) -martingale given by M Z = 0 and M Zn − M Zn − = Z n − E [ Z n | F n − ] , n = 1 , . . . , N, and A Z is the non-decreasing ( F n ) -predictable process given by A Z = 0 and A Zn − A Zn − = Z n − − E [ Z n | F n − ] , n = 1 , . . . , N. From Proposition 7 in Becker et al. (2019), given a sequence ( ε n ) Nn =0 of integrable random variablesin (Ω , F , P ) such that E [ ε n | F n ] = 0 for all n = 0 , . . . , N, one has V ≤ E (cid:20) max ≤ n ≤ N ( g ( n, X n ) − M n − ε n ) (cid:21) , for every ( F n ) -martingale ( M n ) Nn =0 starting from 0.This upper bound is tight if M = M Z and ε ≡ . We can then use the optimal action-valuefunction learned via the deep neural network to construct a martingale close to M Z . We now adaptthe approach presented in Becker et al. (2019) to the expression of the martingale component ofthe Snell envelope. Indeed, the martingale differences ∆ M n from Subsection 3.2 in Becker et al.(2019) can be written in terms of the optimal action-value function: ∆ M n = M n − M n − = Q ∗ (( n, X n ) , a ) − Q ∗ (( n − , X n − ) , , n − is given by evaluating the optimal action-value functionat action a = 0 (continuing). Given the definition of the optimal action-value function at (5), onecan rewrite the martingale differences as ∆ M n = g ( n, X n ) { a = 1 } + E [ Z n +1 | F n ] { a = 0 } − E [ Z n | F n − ] . (8)The empirical counterparts are given by generating realizations M kn of M n + ε n based on a sampleof K U simulations (cid:0) X kn (cid:1) Nn =0 , for k = K + K L +1 , . . . , K + K L + K U . Again, we simulate realizationsof the state space process independently from the simulations used for training. This gives us thefollowing empirical differences: ∆ M kn = g (cid:0) n, X kn (cid:1) (cid:8) a kn = 1 (cid:9) + ˆ E (cid:2) Z kn +1 | F n (cid:3) (cid:8) a kn = 0 (cid:9) − ˆ E (cid:2) Z kn | F n − (cid:3) , where a kn is the chosen action at time n for simulation path k, and ˆ E (cid:2) Z kn +1 | F n (cid:3) are the MonteCarlo averages approximating the continuation values for n = 0 , . . . , N − and k = K + K L +1 , . . . , K + K L + K U . The continuation values appearing in the martingale increments are obtained through nestedsimulation, see the remark below: ˆ E (cid:2) Z kn +1 | F n (cid:3) = 1 J J X j =1 g (cid:18) τ k,jn +1 , e X k,jτ k,jn +1 (cid:19) , where J is the number of simulations in the inner step, and where, given each X kn , we simulate(conditional) continuation paths e X k,jn +1 , . . . , e X k,jN , j = 1 , . . . , J, that are conditionally independentof each other and of X kn +1 , . . . , X kN , and τ k,jn +1 is the value of τ θn +1 along the path e X k,jn +1 , . . . , e X k,jN . Remark.
It is not guaranteed than E [∆ M n |F n − ] = 0 for the Q-function learned via the neuralnetwork. To tackle this issue, we implement nested simulations as in Becker et al. (2019) andBecker et al. (2020) to estimate the continuation values. This gives unbiased estimates of M n , which is crucial to obtain a valid upper bound. Moreover, the variance of the estimates decreaseswith the number of inner simulations, at the expense of increased computational time. Finally we can derive an unbiased estimate for the upper bound of the optimal value V :ˆ U = 1 K U K + K L + K U X k = K + K L +1 max ≤ n ≤ N (cid:0) g (cid:0) n, X kn (cid:1) − M kn (cid:1) , with M kn = P nm =1 ∆ M km . The average between the lower and the upper bound for the point estimate of V is considered inBecker et al. (2019) and Becker et al. (2020): ˆ L + ˆ U . Assuming the discounted payoff process is square-integrable for all n = 0 , . . . , N, we also obtainthat the upper bound max ≤ n ≤ N ( g ( n, X n ) − M n − ε n ) is square-integrable. Let z α/ denote the (1 − α/ -quantile of a standard normal distribution. Defining the empirical standard deviationsfor the lower and upper bounds as ˆ σ L = vuut K L − K + K L X k = K +1 (cid:16) X kτ k − ˆ L (cid:17) , and ˆ σ U = vuut K U − K + K L + K U X k = K + K L +1 (cid:18) max ≤ n ≤ N ( g ( n, X kn ) − M kn ) − ˆ U (cid:19) , (1 − α ) -confidence interval for the true optimal value V : (cid:20) ˆ L − z α/ ˆ σ L √ K L , ˆ U + z α/ ˆ σ U √ K U (cid:21) . (9)We have presented in this section the unifying properties of Q-learning compared to other ap-proaches used to study optimal stopping problems. On one hand we do not require any iterativeprocedure and do not have to solve a potentially complicated optimization problem at each timestep. Indeed the calibrated deep neural network solves the optimal stopping problem on the wholetime interval. On the other hand, we are able to accommodate any finite number of possibleactions. Looking back at the direct approach of Becker et al. (2019) to model optimal stoppingpolicies, the parametric form of the stopping times would explode if we allow for more than twopossible actions. In this section we extend the previous problem to the more general framework of multiple-exerciseoptions. Examples from this family include swing options, which are common in the electricitymarket. The holder of such an option is entitled to exercise a certain right, e.g. the delivery of acertain amount of energy, several times, until the maturity of the contract. The number of exerciserights and constraints on how they can be used are specified at inception. Typical constraints area waiting period, i.e. a minimal waiting time between two exercise rights, and a volume constraint,which specifies how many units of the underlying asset can be purchased at each time.Monte Carlo valuation of such products has been studied in Meinshausen and Hambly (2004),producing lower and upper bounds for the price. Building on the dual formulation for option pric-ing, alternative methods additionally accounting for waiting time constraints have been consideredin Bender (2011), and for both volume and waiting time constraints in Bender et al. (2015). In allcases, the multiple stopping problem is decomposed into several single stopping problems using theso-called reduction principle. The dual formulation in Meinshausen and Hambly (2004) expressesthe marginal excess value due to each additional exercise right as an infimum of an expectationover a certain space of martingales and a set of stopping times. A version of the dual problemin discrete time relying solely on martingales is presented in Schoenmakers (2012), and a dualfor the continuous time problem with a non-trivial waiting time constraint is derived in Bender(2011). In the latter case, the optimization is not only over a space of martingales, but also overadapted processes of bounded variation, which stem from the Doob-Meyer decomposition of theSnell envelope. The dual problem in the more general setting considering both volume and waitingtime constraints is formulated in Bender et al. (2015).We now express the multiple stopping extension of the problem defined at (3) for American-style options. Assume that the option holder has n > exercise rights over the lifetime of thecontract. We consider the setting with no volume constraint and a waiting time δ > which weassume to be a multiple of the time step resulting from the discretization of the interval [0 , T ] . Theaction space is still A = { , } . The state space now has an additional dimension corresponding tothe number of remaining exercise opportunities. As in standard stopping, we assume an absorbingstate to which we jump once the n -th right has been exercised.We note that due to the introduction of the waiting period, depending on the specification of n, T and δ, it may not be possible for the option holder to exercise all his rights before maturity,see the discussion in Bender et al. (2015), where a "cemetery time" is defined. If the specificationof these parameters allows the exercise of all rights, and if we assume that g ( n, X t n ) ≥ for all n = 0 , . . . , N, then it will always be optimal to use all exercise rights. The value of this optionwith n > exercise possibilities at time is given by V n = sup τ ∈T nδ n X i =1 E [ g ( τ i , X τ i )] , (10)where T nδ is the set of n -tuples τ = ( τ n , τ n − , . . . , τ ) of stopping times in { t , t , . . . , t N } n satisfying τ i ≥ τ i +1 + δ, for i = 1 , . . . , n − . As in Bender (2011), one can combine the dynamic programming principle with the reductionprinciple to rewrite the primal optimization problem. We introduce the following functions defined7n Bender (2011) for ν = 1 , . . . , n, and k = N, . . . , q ν ( k, x ) = E (cid:2) y ν (cid:0) t k +1 , X t k +1 (cid:1)(cid:12)(cid:12) X t k = x (cid:3) ,q νδ ( k, x ) = E (cid:2) y ν (cid:0) t k + δ , X t k + δ (cid:1)(cid:12)(cid:12) X t k = x (cid:3) , and we define the functions y ν as y ν ( t k , x ) = max { g ( t k , x ) + q ν − δ ( k, x ) , q ν ( k, x ) } . We set q δ ( k, x ) = 0 for all x ∈ R d and all k ∈ { , . . . , N } , and g ( t, X t ) = 0 for all t > T. In thesequel, we denote as y ∗ ,ν the Snell envelope for the problem with ν remaining exercise rights, for ν = 1 , . . . , n. The reduction principle essentially states that the option with n stopping times is asgood as the single option paying the immediate cashflow plus the option with ( n − stopping timesstarting with a temporal delay of δ. This philosophy is also followed in Meinshausen and Hambly(2004) by looking at the marginal extra payoff obtained with an additional exercise right. Thefunction q ν corresponds to the continuation value in case of no exercise and the function q νδ to thecontinuation value in case of exercise, which requires a waiting period of δ. As shown in Bender (2011), one can derive the optimal policy from the continuation values.Indeed, the optimal stopping times τ ∗ ,nν , for ν = 1 , . . . , n, are given by τ ∗ ,nν = inf n k ≥ τ ∗ ,nν +1 + δ ; g ( t k , x ) + q ∗ ,ν − δ ( k, x ) ≥ q ∗ ,ν ( k, x ) o , (11)for starting value τ ∗ ,nn +1 = − δ , which is a convention to make sure that the first exercise time isbounded from below by 0. The optimal price is then V n = y ∗ ,n (0 , X ) , and as in the single stopping framework, one can express the Snell envelope, the optimal stoppingtimes and the continuation values in terms of the optimal Q-function Q ∗ . Indeed, the continuationvalues can be expressed as q ∗ ,ν ( k, X t k ) = Q ∗ (( t k , X t k ) , ν ) ,q ∗ ,νδ ( k, X t k + δ ) = Q ∗ (cid:0)(cid:0) t k + δ , X t k + δ (cid:1) , ν (cid:1) , the Snell envelope as y ∗ ,ν ( t k , X t k ) = max (cid:8) g ( t k , X t k ) + Q ∗ (cid:0)(cid:0) t k + δ , X t k + δ (cid:1) , ν − (cid:1) , Q ∗ (( t k , X t k ) , ν ) (cid:9) , and the optimal policy as τ ∗ ,nν = inf (cid:8) k ≥ τ ∗ ,nν +1 + δ ; g ( t k , X t k ) + Q ∗ (cid:0)(cid:0) t k + δ , X t k + δ (cid:1) , ν − (cid:1) ≥ Q ∗ (( t k , X t k ) , ν ) (cid:9) . (12)To remain consistent with the notation introduced above for the functions q ν , q νδ and y ν , we denoteby Q ∗ (( t k , X t k ) , ν ) the optimal Q-value in state ( t k , X t k , ν ) , i.e. when there are ν remainingexercise rights. Analogously to standard stopping with one exercise right, we can derive a lowerbound from the primal problem and an upper bound from the dual problem. Moreover, we derivea confidence interval around the pointwise estimate based on Monte Carlo simulations. As in Section 2.4.1, we denote by Q (cid:16) s, a ; ˆ θ (cid:17) the deep neural network calibrated through the trainingprocess using experience replay on a sample of simulated paths ( X mn ) Nn =0 for m = 1 , . . . , M. Wethen generate a new set of M L simulations ( X mn ) Nn =0 , independent from the simulations used fortraining, for m = M + 1 , . . . , M + M L . Then, using the learned stopping times τ m,nν = inf n k ≥ τ m,nν +1 + δ ; g (cid:0) t k , X mt k (cid:1) + Q (cid:16)(cid:16) t k + δ , X mt k + δ (cid:17) , ν −
1; ˆ θ (cid:17) ≥ Q (cid:16)(cid:0) t k , X mt k (cid:1) , ν ; ˆ θ (cid:17)o , for ν = 1 , . . . , n, and with the convention τ m,nn +1 = − δ for all m = M + 1 , . . . , M + M L , the MonteCarlo average ˆ L n = 1 M L M + M L X m = M +1 n X ν =1 g (cid:0) τ m,nν , X mτ ν (cid:1) yields a lower bound for the optimal value V n . In order to not overload the notation we consider τ ν = τ m,nν in the subscript of the simulated state space above.8 .2 Upper bound By exploiting the dual as in Bender (2011), one can also derive an upper bound on the optimal value V n . In order to do so, we consider the Doob decomposition of the supermartingales y ∗ ,ν ( t k , X t k ) given by y ∗ ,ν ( t k , X t k ) = y ∗ ,ν (0 , X ) + M ∗ ,ν ( k ) − A ∗ ,ν ( k ) , where M ∗ ,ν ( k ) is a ( F k ) -martingale with M ∗ ,ν (0) = 0 and A ∗ ,ν ( k ) is a non-decreasing ( F k ) -predictable process with A ∗ ,ν (0) = 0 for all ν = 1 , . . . , n and k = 0 , . . . , N. The correspondingapproximated terms using the learned Q-function lead to the following decomposition: y ν ( t k , X t k ) = y ν (0 , X ) + M ν ( k ) − A ν ( k ) , where M ν are martingales with M ν (0) = 0 , for ν = 1 , . . . , n, and A ν are integrable adaptedprocesses in discrete time with A ν (0) = 0 , for ν = 1 , . . . , n. Moreover, one can write the increments of both the martingale and adapted components as: M ν ( k ) − M ν ( k −
1) = y ν ( t k , X t k ) − E [ y ν ( t k , X t k ) |F k − ] , and A ν ( k ) − A ν ( k −
1) = y ν (cid:0) t k − , X t k − (cid:1) − E [ y ν ( t k , X t k ) |F k − ] . Given the existence of the waiting period, one must also include the δ -increment term A ν ( k + δ ) − E [ A ν ( k + δ ) |F k ] = M ν ( k + δ ) − M ν ( k ) + E (cid:2) y ν (cid:0) t k + δ , X t k + δ (cid:1)(cid:12)(cid:12) F k (cid:3) − y ν (cid:0) t k + δ , X t k + δ (cid:1) . We note that for δ = 1 , since A ∗ ,ν is a predictable process, this increment is equal to 0 for theoptimal martingale M ∗ ,ν and we retrieve the dual formulation in Schoenmakers (2012).As the dual formulation involves conditional expectations, we use nested simulation on a new setof M U independent simulations ( X mn ) Nn =0 for m = M + M L + 1 , . . . , M + M L + M U , with M innerU inner simulations for each outer simulation as explained in Section 2.4.2, to approximate theone-step ahead continuation values E [ y ν ( t k , X t k ) |F k − ] and the δ -steps ahead continuation values E (cid:2) y ν (cid:0) t k + δ , X t k + δ (cid:1)(cid:12)(cid:12) F k (cid:3) . We denote the Monte Carlo estimators of these conditional expectations as ˆ E [ y ν ( t k , X t k ) | F k − ] and ˆ E (cid:2) y ν ( t k + δ , X t k + δ ) | F k (cid:3) , respectively. We use these quantities to expressthe empirical counterparts of the adapted process increments for m = M + M L + 1 , . . . , M + M L + M U : A νm ( k + δ ) − ˆ E [ A ν ( k + δ ) | F k ] = M νm ( k + δ ) − M νm ( k )+ ˆ E (cid:2) y ν ( t k + δ , X t k + δ ) | F k (cid:3) − y νm (cid:16) t k + δ , X mt k + δ (cid:17) . We can then rewrite the empirical counterparts of the Snell envelopes through the Q-function: y νm ( t k , X mt k ) = max n g (cid:0) t k , X mt k (cid:1) + Q (cid:16)(cid:16) t k + δ , X mt k + δ (cid:17) , ν −
1; ˆ θ (cid:17) , Q (cid:16)(cid:0) t k , X mt k (cid:1) , ν ; ˆ θ (cid:17)o , for ν = 1 , . . . , n, k = 0 , . . . , N, m = M + M L + 1 , . . . , M + M L + M U , and where we set g ( t, X t ) = 0 for t > T and Q (cid:16) ( t k , X t k ) ,
0; ˆ θ (cid:17) = 0 (no more exercises left). The theoretical upper bound U n stemming from the dual problem in Bender (2011) is given by: U n = E " sup u ,...,u n ∈ N u ν ≥ u ν +1 + δ ( n − X ν =1 (cid:16) g ( u ν , X u ν ) − ( M ν ( u ν ) − M ν ( u ν +1 ))+ A ν ( u ν +1 + δ ) − E (cid:2) A ν ( u ν +1 + δ ) (cid:12)(cid:12) F u ν +1 (cid:3) (cid:17) + g ( u n , X u n ) − M n ( u n ) ) , We hence obtain V n ≤ U n and this bound is sharp for the exact Doob-Meyer decomposition terms M ∗ ,ν and A ∗ ,ν , for ν = 1 , . . . , n. We denote the sharp upper bound as U ∗ ,n = sup u ,...,u n ∈ N u ν ≥ u ν +1 + δ ( n − X ν =1 (cid:16) g ( u ν , X u ν ) − ( M ∗ ,ν ( u ν ) − M ∗ ,ν ( u ν +1 ))+ A ∗ ,ν ( u ν +1 + δ ) − E (cid:2) A ∗ ,ν ( u ν +1 + δ ) (cid:12)(cid:12) F u ν +1 (cid:3) (cid:17) + g ( u n , X u n ) − M ∗ ,n ( u n ) ) . V n :ˆ U n = 1 M U M + M L + M U X m = M + M L +1 sup u ,...,u n ∈ N u ν ≥ u ν +1 + δ n − X ν =1 (cid:16) g (cid:0) u ν , X mu ν (cid:1) − ( M νm ( u ν ) − M νm ( u ν +1 ))+ A νm ( u ν +1 + δ ) − ˆ E (cid:2) A ν ( u ν +1 + δ ) | F u ν +1 (cid:3) (cid:17) + g (cid:0) u n , X mu n (cid:1) − M nm ( u n ) ! . The pathwise supremum appearing in the expression of the upper bound can be computed usingthe recursion formula from Proposition 3.8 in Bender et al. (2015). This recursion formula isimplemented in our setting using the representation via the Q-function.
As in Becker et al. (2019) and Becker et al. (2020), we can construct a pointwise estimate for theoptimal value in the multiple stopping framework in presence of a waiting time constraint by takingthe pointwise estimate: ˆ L n + ˆ U n . By storing the empirical standard deviations for the lower and upper bounds that we denote as ˆ σ L n and ˆ σ U n , respectively, one can leverage the central limit theorem as in Section 2.4.3 to derivethe asymptotic two-sided (1 − α ) -confidence interval for the true optimal value V n : (cid:20) ˆ L n − z α/ ˆ σ L n √ M L , ˆ U n + z α/ ˆ σ U n √ M U (cid:21) . (13) We now derive the extension of a result presented in Meinshausen and Hambly (2004) on the biasresulting from the derivation of the upper bound, to the case of multiple stopping in presence of awaiting period. The dual problem from Meinshausen and Hambly (2004), being obtained from anoptimization over a space of martingales and a set of stopping times, contains two terms: the biascoming from the martingale approximation, and the bias coming from the policy approximation.In the case with waiting constraint, as exemplified in the dual of Bender (2011), we show how onecan again control the bias in the approximations to the n Doob-Meyer decompositions of the Snellenvelopes y ∗ ,ν , for ν = 1 , . . . , n. Indeed, in the dual problem, each martingale M ∗ ,ν is approximatedby a martingale M ν , and each predictable non-decreasing process A ∗ ,ν is approximated by anintegrable adapted process in discrete time A ν . We proceed in three steps and analyse separatelythe bias from each approximation employed:• Martingale terms: E " sup u ,...,u n ∈ N u ν ≥ u ν +1 + δ n − X ν =1 (cid:12)(cid:12)(cid:12) M ν ( u ν ) − M ν ( u ν +1 ) − ( M ∗ ,ν ( u ν ) − M ∗ ,ν ( u ν +1 )) (cid:12)(cid:12)(cid:12) • Adapted terms: E " sup u ,...,u n ∈ N u ν ≥ u ν +1 + δ n − X ν =1 (cid:12)(cid:12)(cid:12) A ν ( u ν +1 + δ ) − E (cid:2) A ν ( u ν +1 + δ ) (cid:12)(cid:12) F u ν +1 (cid:3) − (cid:0) A ∗ ,ν ( u ν +1 + δ ) − E (cid:2) A ∗ ,ν ( u ν +1 + δ ) (cid:12)(cid:12) F u ν +1 (cid:3)(cid:1) (cid:12)(cid:12)(cid:12) • Final term: E (cid:20) sup ≤ n ≤ N (cid:12)(cid:12) g ( u n , X u n ) − M n ( u n ) − ( g ( u n , X u n ) − M ∗ ,n ( u n )) (cid:12)(cid:12)(cid:21) g ( u n , X u n ) − M n ( u n ) can be bounded using the methodology inMeinshausen and Hambly (2004). Define D y,n = sup ≤ k ≤ Nx ∈ E (cid:12)(cid:12)(cid:12) y ∗ ,n ( t k , x ) − y n ( t k , x ) (cid:12)(cid:12)(cid:12) , as the distance between the true Snell envelope and its approximation, and σ M innerU ,n = sup ≤ k ≤ Nx ∈ E E (cid:20)(cid:16) ˆ E (cid:2) y n ( t k , X t k ) (cid:12)(cid:12) X t k − = x (cid:3) − E (cid:2) y n ( t k , X t k ) (cid:12)(cid:12) X t k − = x (cid:3)(cid:17) (cid:12)(cid:12)(cid:12)(cid:12) X t k − = x (cid:21) , as an upper bound on the Monte Carlo error from the 1-step ahead nested simulation to approxi-mate the continuation values.In order to study the bias coming from the martingale approximations, we define D y = sup ν =1 ,...,n − u ,...,u n ∈ N u ν ≥ u ν +1 + δx ∈ E (cid:12)(cid:12)(cid:12) y ∗ ,ν ( u ν , x ) − y ν ( u ν , x ) (cid:12)(cid:12)(cid:12) , as the distance between the optimal Snell envelope and its approximation over all remaining exercisetimes, σ M innerU = sup ν =1 ,...,n − u ,...,u n ∈ N u ν ≥ u ν +1 + δx ∈ E E (cid:20)(cid:16) ˆ E (cid:2) y ν ( u ν +1 , X u ν +1 ) (cid:12)(cid:12) X u ν = x (cid:3) − E (cid:2) y ν ( u ν +1 , X u ν +1 ) (cid:12)(cid:12) X u ν = x (cid:3)(cid:17) (cid:12)(cid:12)(cid:12)(cid:12) X u ν = x (cid:21) , and σ M innerU ,δ = sup ν =1 ,...,n − u ,...,u n ∈ N u ν ≥ u ν +1 + δx ∈ E E (cid:20)(cid:16) ˆ E (cid:2) y ν ( u ν + δ, X u ν + δ ) (cid:12)(cid:12) X u ν = x (cid:3) − E (cid:2) y ν ( u ν + δ, X u ν + δ ) (cid:12)(cid:12) X u ν = x (cid:3)(cid:17) (cid:12)(cid:12)(cid:12)(cid:12) X u ν = x (cid:21) . In other words, σ M innerU and σ M innerU ,δ correspond to upper bounds on the standard deviations ofthe 1-step ahead and δ -steps ahead Monte Carlo estimates of the continuation values, respectively,using a sample of M innerU independent simulations starting from the endpoint of simulation path m for m = M + M L + 1 , . . . , M + M L + M U .The following theorem allows to control for the bias in the derivation of the upper bound fromthe dual problem. Theorem 1 (Dual upper bound bias) . Let B nδ ( M, A ) be the total bias which is the difference be-tween the approximate upper bound using ( M ν ) ν =1 ,...,n and ( A ν ) ν =1 ,...,n − and the theoretical sharpupper bound using the optimal Doob decomposition components ( M ∗ ,ν ) ν =1 ,...,n and ( A ∗ ,ν ) ν =1 ,...,n − B nδ ( M, A ) = U n − U ∗ ,n . The following result holds: B nδ ( M, A ) ≤ n − r(cid:16) D y + σ M innerU (cid:17) T +( n − (cid:16) σ M innerU ,δ + D y (cid:17) +2 r(cid:16) D y,n + σ M innerU ,n (cid:17) T . (14)
In order to prove this result, let us state an intermediary result which will appear in the proofs ofthe following propositions. Define R νt = M νt − M ∗ ,νt , as the difference between the martingale approximation and the optimal martingale for the problemwith ν remaining exercise times, for ν = 1 , . . . , n. Lemma 3.1.
The process R ν is a martingale with R ν (0) = 0 , for all ν = 1 , . . . , n, and we havethe following inequality on the second moment of the martingale increments, for all ≤ t < T and ν = 1 , . . . , n : E h(cid:0) R νt +1 − R νt (cid:1) (cid:12)(cid:12)(cid:12) F t i ≤ D y + σ M innerU . As a consequence, E h ( R νt ) i ≤ (cid:16) D y + σ M innerU (cid:17) t. roof. The proof of this lemma follows similar lines to the proof of Lemma 6.1 inMeinshausen and Hambly (2004). Let ν ∈ { , . . . , n } . As a difference of martingales with initialvalue 0, R ν is also a martingale with initial value 0. The increments can be rewritten as R νt +1 − R νt = M νt +1 − M ∗ ,νt +1 − ( M νt − M ∗ ,νt )= y ν ( t + 1 , X t +1 ) − ˆ E [ y ν ( t + 1 , X t +1 ) | F t ] − ( y ∗ ,ν ( t + 1 , X t +1 ) − E [ y ∗ ,ν ( t + 1 , X t +1 ) |F t ])= y ν ( t + 1 , X t +1 ) − y ∗ ,ν ( t + 1 , X t +1 ) + E [ y ∗ ,ν ( t + 1 , X t +1 ) |F t ] − E [ y ν ( t + 1 , X t +1 ) |F t ]+ E [ y ν ( t + 1 , X t +1 ) |F t ] − ˆ E [ y ν ( t + 1 , X t +1 ) | F t ] . Now, both differences between the first two terms and the third and fourth term in the final equalityare bounded in absolute value by D y . The last term corresponds to the error from the Monte Carloapproximation of the 1-step ahead continuation values. Since this error term has mean 0, a secondmoment bounded by σ M innerU and is independent of the term ( y ∗ ,ν ( t + 1 , X t +1 ) − y ν ( t + 1 , X t +1 )) , we obtain the desired result. Proposition 3.1.
The bias in the final term can be bounded by E (cid:20) sup ≤ n ≤ N (cid:12)(cid:12)(cid:12) g ( u n , X u n ) − M n ( u n ) − ( g ( u n , X u n ) − M ∗ ,n ( u n )) (cid:12)(cid:12)(cid:12)(cid:21) ≤ r(cid:16) D y,n + σ M innerU ,n (cid:17) T . (15)
Proof.
We consider the error in the final term E (cid:20) sup ≤ n ≤ N (cid:12)(cid:12)(cid:12) M n ( u n ) − M ∗ ,n ( u n ) (cid:12)(cid:12)(cid:12)(cid:21) ≤ E (cid:20) sup ≤ t ≤ T (cid:12)(cid:12) R nt (cid:12)(cid:12)(cid:21) . From the Cauchy-Schwarz inequality, E h sup ≤ t ≤ T (cid:12)(cid:12) R nt (cid:12)(cid:12)i ≤ (cid:16) E h sup ≤ t ≤ T ( R nt ) i(cid:17) / , and since R n is a martingale, ( R n ) is a non-negative submartingale which is well-defined from theexistence of D y and σ M innerU . Then, using Doob’s submartingale inequality, E h sup ≤ t ≤ T ( R nt ) i ≤ E (cid:2) ( R nT ) (cid:3) . This last inequality in combination with Lemma 3.1 leads to the desired result.
Proposition 3.2.
The bias from the approximations of the martingale terms can be bounded by E " sup u ,...,u n ∈ N u ν ≥ u ν +1 + δ n − X ν =1 (cid:12)(cid:12)(cid:12) M ν ( u ν ) − M ν ( u ν +1 ) − ( M ∗ ,ν ( u ν ) − M ∗ ,ν ( u ν +1 )) (cid:12)(cid:12)(cid:12) ≤ n − r(cid:16) D y + σ M innerU (cid:17) T . (16)
Proof.
The error in the martingale term for the problem with ν remaining exercise times can beexpressed as (cid:12)(cid:12) M ν ( u ν ) − M ν ( u ν +1 ) − ( M ∗ ,ν ( u ν ) − M ∗ ,ν ( u ν +1 )) (cid:12)(cid:12) ≤ (cid:12)(cid:12) R ν ( u ν ) (cid:12)(cid:12) + (cid:12)(cid:12) R ν ( u ν +1 ) (cid:12)(cid:12) ≤ sup u ,...,u n ∈ N u ν ≥ u ν +1 + δ (cid:12)(cid:12) R ν ( u ν ) (cid:12)(cid:12) + sup u ,...,u n ∈ N u ν ≥ u ν +1 + δ (cid:12)(cid:12) R ν ( u ν +1 ) (cid:12)(cid:12) . By taking the sum over ν = 1 , . . . , n − , taking the supremum over the subspace of N n with theconstraints imposed by the presence of the waiting period, and finally taking the expectation, weobtain E " sup u ,...,u n ∈ N u ν ≥ u ν +1 + δ n − X ν =1 (cid:12)(cid:12)(cid:12) M ν ( u ν ) − M ν ( u ν +1 ) − ( M ∗ ,ν ( u ν ) − M ∗ ,ν ( u ν +1 )) (cid:12)(cid:12)(cid:12) ≤ ( n − E " sup u ,...,u n ∈ N u ν ≥ u ν +1 + δ (cid:12)(cid:12) R ν ( u ν ) (cid:12)(cid:12) + ( n − E " sup u ,...,u n ∈ N u ν ≥ u ν +1 + δ (cid:12)(cid:12) R ν ( u ν +1 ) (cid:12)(cid:12) . Proposition 3.3.
The bias from the approximations of the non-decreasing predictable terms canbe bounded by E " sup u ,...,u n ∈ N u ν ≥ u ν +1 + δ n − X ν =1 (cid:12)(cid:12)(cid:12) A ν ( u ν +1 + δ ) − E (cid:2) A ν ( u ν +1 + δ ) (cid:12)(cid:12) F u ν +1 (cid:3) − (cid:0) A ∗ ,ν ( u ν +1 + δ ) − E (cid:2) A ∗ ,ν ( u ν +1 + δ ) (cid:12)(cid:12) F u ν +1 (cid:3)(cid:1) (cid:12)(cid:12)(cid:12) ≤ ( n − (cid:18) σ M innerU ,δ + D y + 4 r(cid:16) D y + σ M innerU (cid:17) T (cid:19) . Proof.
Again, we consider the approximation of the predictable process for the problem with ν remaining exercise times A ν ( u ν +1 + δ ) − E (cid:2) A ν ( u ν +1 + δ ) (cid:12)(cid:12) F u ν +1 (cid:3) − (cid:0) A ∗ ,ν ( u ν +1 + δ ) − E (cid:2) A ∗ ,ν ( u ν +1 + δ ) (cid:12)(cid:12) F u ν +1 (cid:3)(cid:1) = M ν ( u ν +1 + δ ) − M ν ( u ν +1 ) + E (cid:2) y ν (cid:0) u ν +1 + δ, X u ν +1 + δ (cid:1)(cid:12)(cid:12) F u ν +1 (cid:3) − y ν (cid:0) u ν +1 + δ, X u ν +1 + δ (cid:1) − (cid:0) M ∗ ,ν ( u ν +1 + δ ) − M ∗ ,ν ( u ν +1 ) + E (cid:2) y ∗ ,ν (cid:0) u ν +1 + δ, X u ν +1 + δ (cid:1)(cid:12)(cid:12) F u ν +1 (cid:3) − y ∗ ,ν (cid:0) u ν +1 + δ, X u ν +1 + δ (cid:1)(cid:1) . Now, since (cid:12)(cid:12) M ν ( u ν +1 + δ ) − M ∗ ,ν ( u ν +1 + δ ) (cid:12)(cid:12) ≤ (cid:12)(cid:12) R νu ν +1 + δ (cid:12)(cid:12) , (cid:12)(cid:12) M ν ( u ν +1 ) − M ∗ ,ν ( u ν +1 ) (cid:12)(cid:12) ≤ (cid:12)(cid:12) R νu ν +1 (cid:12)(cid:12) , and (cid:12)(cid:12) y ∗ ,ν (cid:0) u ν +1 + δ, X u ν +1 + δ (cid:1) − y ν (cid:0) u ν +1 + δ, X u ν +1 + δ (cid:1) (cid:12)(cid:12) ≤ D y , by summing over all exercise opportunities, taking the supremum and then the expectation, weobtain by definition of σ M innerU ,δ , E (cid:20) sup u ,...,u n ∈ N u ν ≥ u ν +1 + δ n − X ν =1 (cid:12)(cid:12)(cid:12) A ν ( u ν +1 + δ ) − E (cid:2) A ν ( u ν +1 + δ ) (cid:12)(cid:12) F u ν +1 (cid:3) − (cid:0) A ∗ ,ν ( u ν +1 + δ ) − E (cid:2) A ∗ ,ν ( u ν +1 + δ ) (cid:12)(cid:12) F u ν +1 (cid:3)(cid:1) (cid:12)(cid:12)(cid:12)(cid:21) ≤ ( n − (cid:18) σ M innerU ,δ + D y + 4 r(cid:16) D y + σ M innerU (cid:17) T (cid:19) . The proof of Theorem 1 is then obtained by summing up all contributions to the total bias fromPropositions 3.1, 3.2 and 3.3. We thus obtain an upper bound on the total bias stemming fromthe errors in all approximations. We see in particular in the expression of the total bias that thewaiting period appears implicitly in the error term from the Monte Carlo δ -steps ahead estimation.We now illustrate the Q-learning approach with several numerical examples. As illustrative examples we present swing options in the multiple stopping framework in severaldimensions, with varying maturities, n = 2 exercise rights and a waiting period constraint δ > .In all examples we select mini-batches of size 1000 using experience replay on a sample of1,000,000 simulations. We consider ReLU activation functions applied component-wise, performstochastic gradient descent for the optimization step using the RMSProp implementation from PyTorch , and initialize the network parameters using the default
PyTorch implementation.Swing options appear in the commodity and energy markets (natural gas, electricity) as hedginginstruments to protect investors from futures price fluctuations. They give the holder of the optionthe right to exercise at multiple times during the lifetime of the contract, the number of exerciseopportunities being specified at inception. Further constraints can be imposed at each exercisetime, such as the maximal quantity of energy that can be bought or sold, or the minimal waiting13eriod between two exercise times, see e.g. Bender (2011). In the presence of a volume constraint,under certain sufficient conditions, see Bardou et al. (2009), the optimal policy is a so-called "bang-bang strategy", see e.g. Daluiso et al. (2020), i.e. at each exercise time the optimal strategy isto buy or sell the maximum or the minimum amount allowed, which then simplifies the actionspace. A model for commodity futures prices is derived in Daluiso et al. (2020), implemented usingproximal policy optimization (PPO), which is another tool from reinforcement learning and wherethe policy update is forced to be close to the previous policy by clipping the advantage function.The pricing of such contracts is also investigated in Meinshausen and Hambly (2004) with noconstraints, in Bender (2011) with waiting time constraint and in Bender et al. (2015) with bothwaiting time and volume constraints. We will consider the same model for the electricity spot pricesas in Meinshausen and Hambly (2004), that is, the exponential of a Gaussian Ornstein-Uhlenbeckprocess, which in discrete time takes the form log S t +1 = (1 − k ) (log S t − µ ) + µ + σZ t , where { Z t } t =0 ,...,T − are standard normal random variables, and where we choose σ = 0 . ,k = 0 . , µ = 0 , S = 1 and strike price K = 1 . We consider the payoff ( S t − K ) + for time t = 0 , . . . , T, without any discounting, as in Meinshausen and Hambly (2004), Bender (2011) andBender et al. (2015). A discount factor could be taken into account with no real additional com-plexity. In the multi-dimensional setting we will consider the same payoff as max-call options,that is (cid:0) max i =1 ,...,d S it − K (cid:1) + for a d -dimensional vector of asset prices (cid:0) S , . . . , S d (cid:1) T , where weassume for the marginals the same dynamics as above and independence between the respectiveinnovations. We will consider the same starting value S = 1 for all the assets in the examplesbelow. We stress that this pricing approach can be extended to any other type of Markoviandynamics which are more adequate for capturing electricity prices.We assume that the arbitrage-free price is given by taking the expectation at (10) under anappropriate pricing measure, that is, a probability measure under which the (discounted) pricesof tradable and storable basic securities in the underlying market are (local) martingales. Theelectricity market being incomplete, the prices will depend on the choice of the pricing measure.The latter can be selected by considering a calibration on liquidly traded swing options.We select a deep neural network with 3 hidden layers containing 32 neurons each for theexamples with d = 3 and d = 10 , and 90 neurons each for the examples with d = 50 . We presentour results in dimensions d = 3 , d = 10 and d = 50 in Table 1 below, using M L = 100 , , M U = 100 and J = 5000 . Table 1: Prices at t = 0 for swing options with varying maturities and asset price dimensions, K = 1 , µ = 0 , σ = 0 . , and k = 0 . . Model parameters ˆ L PE ˆ U CI d = 3 , n = 2 , δ = 2 , T = 10 d = 3 , n = 2 , δ = 2 , T = 20 d = 10 , n = 2 , δ = 2 , T = 10 d = 10 , n = 2 , δ = 2 , T = 20 d = 50 , n = 2 , δ = 2 , T = 10 d = 50 , n = 2 , δ = 2 , T = 20 We have presented optimal stopping problems appearing in the valuation of financial productsunder the lens of reinforcement learning. This new angle allows us to model the optimal action-value function using the RL machinery and deep neural networks. This method could serve asan alternative to recent approaches developed in the literature, be it to derive the optimal policyby modeling directly the stopping times as in Becker et al. (2019), or by modeling the continu-ation values by approximating conditional expectations as in Becker et al. (2020). We have alsoconsidered the pricing of multiple exercise stopping problems with waiting period constraint andderived lower and upper bounds on the option price, using the trained neural network and the14ual representation, respectively. In addition, we have proved a result that controls for the totalbias resulting from the approximation of the terms appearing in the dual formulation. The RLframework is suitable for configurations where the action space varies in a non-trivial way withtime, i.e. there are certain degrees of freedom for the agent to explore the environment at each timestep. This is exemplified through the swing option with multiple stopping rights and waiting timeconstraint, but could also be useful for more complex environments. It could also be interesting toinvestigate state-of-the-art improvements to the DQN algorithm brought forward in Hessel et al.(2017). One could explore these avenues in further research.15 cknowledgements
We thank Prof. Patrick Cheridito for helpful comments and for carefully reading previous versionsof the manuscript.As SCOR Fellow, John Ery thanks SCOR for financial support.Both authors have contributed equally to this work.
References
Bardou, O., Bouthemy, S., and Pagès, G. (2009). Optimal quantization for the pricing of swingoptions.
Applied Mathematical Finance , 16(2):183–217.Becker, S., Cheridito, P., and Jentzen, A. (2019). Deep optimal stopping.
Journal of MachineLearning Research , 20(74):1–25.Becker, S., Cheridito, P., and Jentzen, A. (2020). Pricing and hedging American-style options withdeep learning.
Journal of Risk and Financial Management , 13(7):158.Bender, C. (2011). Primal and dual pricing of multiple exercise options in continuous time.
SIAMJournal on Financial Mathematics , 2(1):562–586.Bender, C., Schoenmakers, J., and Zhang, J. (2015). Dual representations for general multiplestopping problems.
Mathematical Finance , 25(2):339–370.Bertsekas, D. (1995).
Dynamic Programming and Optimal Control . Athena Scientific, Mas-sachusetts, USA.Chen, Y. and Wan, J. (2020). Deep neural network framework based on backward stochasticdifferential equations for pricing and hedging American options in high dimensions.
To appearin Quantitative Finance .Daluiso, R., Nastasi, E., Pallavicini, A., and Sartorelli, G. (2020). Pricing commodity swingoptions.
ArXiv Preprint 2001.08906, version of January 24, 2020 .Hessel, M., Modayil, J., van Hasselt, H., Schaul, T., Ostrovski, G., Dabney, W., Horgan, D., Piot,B., Azar, M. G., and Silver, D. (2017). Rainbow: Combining improvements in deep reinforcementlearning. arXiv 1710.02298, version of October 6, 2017 .Karatzas, I. and Shreve, S. (1991).
Brownian Motion and Stochastic Calculus . Springer, GraduateTexts in Mathematics, 2nd edition.Kohler, M., Krzyżak, A., and Todorovic, N. (2008). Pricing of high-dimensional American optionsby neural networks.
Mathematical Finance , 20(3):383–410.Lagoudakis, M. and Parr, R. (2003). Least-squares policy iteration.
The Journal of MachineLearning Research , 4:1107–1149.Lapeyre, B. and Lelong, J. (2019). Neural network regression for Bermudan option pricing.
ArXivPreprint 1907.06474, version of December 16, 2019 .Li, Y., Szepesvari, C., and Schuurmans, D. (2009). Learning exercise policies for American op-tions.
Proceedings of the 12th International Conference on Artificial Intelligence and Statistics(AISTATS) 2009, Clearwater Beach, Florida, USA , Volume 5 of JMLR: W&CP 5.Longstaff, F. and Schwartz, E. (2001). Valuing American options by simulation: a simple least-squares approach.
The Review of Financial Studies , 14(1):113–147.Meinshausen, N. and Hambly, B. M. (2004). Monte Carlo methods for the valuation of multiple-exercise options.
Mathematical Finance , 14(4):557–583.Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller,M. (2013). Playing Atari with deep reinforcement learning.
NIPS Deep Learning Workshop2013 . 16nih, V., Kavukcuoglu, K., Silver, D., Rusu, A., Veness, J., Bellemare, M. G., Graves, A., Ried-miller, M., Fidjeland, A., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I.,King, H., Kumaran, D., Wierstra, D., Legg, S., and Hassabis, D. (2015). Human-level controlthrough deep reinforcement learning.
Nature , 518(7540):529–533.Puterman, M. (1994).
Markov Decision Processes: Discrete Stochastic Dynamic Programming .John Wiley & Sons, New York.Schoenmakers, J. (2012). A pure martingale dual for multiple stopping.
Finance and Stochastics ,16:319–334.Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., van den Driessche, G., Schrittwieser,J., Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D., Nham, J., Kalch-brenner, N., Sutskever, I., Lillicrap, T., Leach, M., Kavukcuoglu, K., Graepel, T., and Hassabis,D. (2016). Mastering the game of Go with deep neural networks and tree search.
Nature ,529:484–489.Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre,L., Kumaran, D., Graepel, T., Lillicrap, T., Simonyan, K., and Hassabis, D. (2017). Masteringchess and Shogi by self-play with a general reinforcement learning algorithm.
ArXiv Preprint1712.01815, version of December 5, 2017 .Sutton, R. S. and Barto, A. G. (1998).
Introduction to Reinforcement Learning . MIT Press,Cambridge, MA, USA, 1st edition.Tsitsiklis, J. and Roy, B. V. (2001). Regression methods for pricing complex American-style options.
IEEE Transactions on Neural Networks , 12(4):694–703.van Hasselt, H., Guez, A., and Silver, D. (2016). Deep reinforcement learning with double Q-learning.
Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16) .Wang, Z., Schaul, T., Hessel, M., van Hasselt, H., Lanctot, M., and Freitas, N. (2016). Duelingnetwork architectures for deep reinforcement learning.
Proceedings of the 33rd InternationalConference on Machine Learning , Volume 48 of JMLR: W&CP.Watkins, C. J. C. H. (1989).
Learning from Delayed Rewards . PhD thesis, King’s College, Oxford.Watkins, C. J. C. H. and Dayan, P. (1992). Q-learning. In