[PDF] Solving optimal stopping problems with Deep Q-Learning

Abstract

We propose a reinforcement learning (RL) approach to model optimal exercise strategies for option-type products. We pursue the RL avenue in order to learn the optimal action-value function of the underlying stopping problem. In addition to retrieving the optimal Q-function at any time step, one can also price the contract at inception. We first discuss the standard setting with one exercise right, and later extend this framework to the case of multiple stopping opportunities in the presence of constraints. We propose to approximate the Q-function with a deep neural network, which does not require the specification of basis functions as in the least-squares Monte Carlo framework and is scalable to higher dimensions. We derive a lower bound on the option price obtained from the trained neural network and an upper bound from the dual formulation of the stopping problem, which can also be expressed in terms of the Q-function. Our methodology is illustrated with examples covering the pricing of swing options.

Full PDF

aa r X i v : . [ q -f i n . P R ] J a n Solving optimal stopping problems with Deep Q-learning

John Ery ∗ Loris Michel † January 26, 2021

Abstract

We propose a reinforcement learning (RL) approach to model optimal exercise strategiesfor option-type products. We pursue the RL avenue in order to learn the optimal action-valuefunction of the underlying stopping problem. In addition to retrieving the optimal Q-functionat any time step, one can also price the contract at inception. We ﬁrst discuss the standardsetting with one exercise right, and later extend this framework to the case of multiple stoppingopportunities in the presence of constraints. We propose to approximate the Q-function witha deep neural network, which does not require the speciﬁcation of basis functions as in theleast-squares Monte Carlo framework and is scalable to higher dimensions. We derive a lowerbound on the option price obtained from the trained neural network and an upper boundfrom the dual formulation of the stopping problem, which can also be expressed in terms ofthe Q-function. Our methodology is illustrated with examples covering the pricing of swingoptions.

Reinforcement learning (RL) in its most general form deals with agents living in some environ-ment and aiming at maximizing a given reward function. Alongside supervised and unsupervisedlearning, it is often considered as the third family of models in the machine learning literature. Itencompasses a wide class of algorithms that have gained popularity in the context of building intel-ligent machines that can outperform masters in ancestral board games such as Go or chess, see e.g.Silver et al. (2016); Silver et al. (2017). These models are very skilled when it comes to learningthe rules of a certain game, starting from little or no prior knowledge at all, and progressively de-veloping winning strategies. Recent research, see e.g. Mnih et al. (2013), van Hasselt et al. (2016),Wang et al. (2016), has considered integrating deep learning techniques in the framework of re-inforcement learning in order to model complex unstructured environments. Deep reinforcementlearning can hence leverage the ability of deep neural networks to uncover hidden structure fromvery complex functionals and the power of reinforcement techniques to take complex actions.Optimal stopping problems from mathematical ﬁnance naturally ﬁt into the reinforcementlearning framework. Our work is motivated by the pricing of swing options which appear inenergy markets (oil, natural gas, electricity) to hedge against futures price ﬂuctuations, see e.g.Meinshausen and Hambly (2004), Bender et al. (2015), and more recently Daluiso et al. (2020).Intuitively, when behaving optimally, investors holding these options are trying to maximize theirreward by following some optimal sequence of decisions, which in the case of swing options consistsin purchasing a certain amount of electricity or natural gas at multiple exercise times.The stopping problems we will consider belong to the category of Markov decision processes(MDP). We refer the reader to Puterman (1994) or Bertsekas (1995) for good textbook referenceson this topic. When the size of the MDP becomes large or when the MDP is not fully known(model-free learning), alternatives to standard dynamic programming techniques must be sought.Reinforcement learning can eﬃciently tackle these issues and can be transposed to our problem ofdetermining optimal stopping strategies.Previous work exists on the connections between optimal stopping problems in mathematicalﬁnance and reinforcement learning. For example, the common problem of learning optimal exer-cise policies for American options has been tackled in Li et al. (2009) using reinforcement learn-ing techniques. They implement two algorithms, namely least-squares policy iteration (LSPI), ∗ RiskLab, Department of Mathematics, ETH Zurich, [email protected] † Seminar für Statistik, Department of Mathematics, ETH Zurich, [email protected]

In this section we present the mathematical building blocks and the reinforcement learning ma-chinery, leading to the formulation of the stopping problems under consideration.

As discussed in the introduction, the problems we will consider in the sequel can be embeddedinto the framework of the well-studied Markov decision processes (MDPs), see Sutton and Barto(1998). A Markov decision process is deﬁned as a tuple ( S , A , p, R, γ ) , where• S is the set of states;• A is the set of actions the agent can take;• p is the transition probability kernel, where p ( ·| s, a ) is the probability of future states giventhat the current state is s and that action a is taken;• R is a reward function, where R ( s, a ) denotes the reward obtained when moving from state s under action a (note here that diﬀerent deﬁnitions exist in the literature);• γ ∈ (0 , is a discount factor which expresses preference towards short-term rewards (in thepresent work γ = 1 as we consider already discounted rewards).A policy π is then a rule for selecting actions based on the last visited state. More speciﬁcally, π ( s, a ) denotes the probability of taking action a in state s under policy π. The conventionaltask is to maximize the total (discounted) expected reward over policies, and can be expressedas E π [ P ∞ t =0 γ t R t ] . A policy which maximizes this quantity is called an optimal policy. Givena starting state s , an initial action a , one can deﬁne the action-value function , also called Q-function : Q π ( s, a ) = E π " ∞ X t =0 γ t R t (cid:12)(cid:12)(cid:12) s = s, a = a , (1)2here R t = R ( s t , a t ) , for a sequence of state-action pairs ( s t , a t ) t ≥ ∼ π. The optimal policy π ∗ satisﬁes Q ∗ ( s, a ) = sup π Q π ( s, a ) , (2)where we write Q ∗ for Q π ∗ . In other words, the optimal Q-function measures how "good" or"rewarding" it is to choose action a while in state s, by following optimal decisions. We willconsider problems with ﬁnite time horizon T > , and we accordingly set R t = 0 for all t > T. We consider the same stopping problem as in Becker et al. (2019) and Becker et al. (2020), namelyan American-style option deﬁned on a ﬁnite time grid t < t < . . . < t N = T . The discountedpayoﬀ process ( G n ) Nn =0 is assumed to be square-integrable and takes the form G n = g ( n, X t n ) for a measurable function g : { , , . . . , N } × R d → R and a d -dimensional F -Markovian process ( X t n ) Nn =0 deﬁned on a ﬁltered probability space (cid:16) Ω , F , F = ( F n ) Nn =0 , P (cid:17) . Let E ⊂ R d denote thespace in which the underlying process lives. We assume that X is deterministic and that P is therisk-neutral probability measure. The value of the option at time is given by V = sup τ ∈T E [ g ( τ, X τ )] , (3)where T denotes all stopping times τ : Ω → { t , t , . . . , t N } . This problem is essentially a Markovdecision process with state space S = { , , . . . , N } × R d × { , } , action space A = { , } (wherewe follow the convention a = 0 for continuing and a = 1 for stopping), reward function R (( n, X t n ) , a ) = ( g ( n, X t n ) , if a = 1 , , if a = 0 , for n = 0 , . . . , N, and transition kernel p driven by the dynamics of the F -Markovian process ( X t n ) Nn =0 . The state space includes time, the d -dimensional Markovian process and an additional(absorbing) state which at each time step captures the event of exercise or no exercise. Moreprecisely, we jump to this absorbing state when we have exercised. In the multiple stopping casewhich we discuss in Section 3, we jump to this absorbing state once we have used the last exerciseright. In both single and multiple stopping frameworks, once this absorbing state has been reachedat a random time τ : Ω → { t , t , . . . , t N } , we set all rewards and Q-values to 0 for t > τ. Theassociated

Snell envelope process ( Z n ) Nn =0 of the stopping problem in (3) is deﬁned recursively by Z n = ( g ( N, X t N ) , if n = N, max { g ( n, X t n ) , E [ Z n +1 | F n ] } , if ≤ n ≤ N − . (4)It is well known that the Snell envelope provides an optimal stopping time solving (3) as stated inthe following result. A standard proof for the latter can be found in Karatzas and Shreve (1991).

Proposition 2.1.

The stopping time τ ∗ deﬁned by τ ∗ = inf { n : Z n = g ( n, X t n ) } for the Snell envelope ( Z n ) Nn =0 given in (4), is optimal for the problem (3).Various modeling approaches have been proposed to estimate the option value in (3). Kohler et al.(2008) propose to model directly the Snell envelope, Becker et al. (2019) take the approach ofmodeling the optimal stopping times. More recently, Becker et al. (2020) model the continuationvalues of the stopping problem. In this work, we rather propose to model the optimal action-valuefunction of the problem Q ∗ (( n, X t n ) , a ) for all n = 0 , . . . , N, and a ∈ { , } (where a representsthe stopping decision) given by Q ∗ (( n, X t n ) , a ) = ( g ( n, X t n ) , if a = 1 , E [ Z n +1 | F n ] , if a = 0 . (5) When exercizing (taking action a = 1 ), we implicitly move to the absorbing state, i.e. the last component ofthe state space becomes 1. Note that in particular Z = V . Q ∗ (( n, X t n ) , a ) , we can recover the optimal stopping time τ ∗ . Indeed, it turns out that theoptimal decision functions f , . . . , f N in Becker et al. (2019) can be expressed in the action-valuefunction framework through f n ( X t n ) = ( arg max a ∈{ , } Q ∗ (( n, X t n ) , a ) = 1 ) , ∀ n = 0 , . . . , N, where {·} denotes the indicator function. Moreover, one can express the Snell envelope (es-timated in Kohler et al. (2008)) as Z n = max { Q ∗ (( n, X t n ) , , Q ∗ (( n, X t n ) , } , and the con-tinuation value modeled in Becker et al. (2020) can be reformulated in our setting as C n = Q ∗ (( n, X t n ) , . As a by-product, one can price ﬁnancial products such as swing options byconsidering max { Q ∗ (0 , X , , Q ∗ (0 , X , } . In this perspective, our modeling approach is very similar to previous studies but diﬀers inthe reinforcement learning machinery employed. Indeed, modeling the action-value function andoptimizing it is a common and natural approach known under the name of

Q-learning in thereinforcement learning literature. We introduce it in the next section.

In contrast to policy or value iteration, Q-learning methods, see e.g. Watkins (1989) andWatkins and Dayan (1992), estimate directly the optimal action-value function. They are model-free and can learn optimal strategies with no prior knowledge of the state transitions and therewards. In this paradigm, an agent interacts with the environment (exploration step) and learnsfrom past actions (exploitation step) to derive the optimal strategy.One way to model the action-value function is by using deep neural networks. This approach isreferred to under the name deep Q-learning in the reinforcement learning literature. In this setup,the optimal action-value function Q ∗ is modeled with a neural network Q ( s, a ; θ ) often called deepQ-network (DQN), where θ is a vector of parameters corresponding to the network architecture.However, reinforcement learning can be highly unstable or even potentially diverge due to theintroduction of neural networks in the approximation the Q-function. To tackle these issues, avariant to the original Q-learning method has been developed in Mnih et al. (2015). It relies ontwo main concepts. The ﬁrst is called experience replay and allows to remove correlations in thesequence of observations. In practice this is done by generating a large sample of experienceswhich we denote as vectors e t = ( s t , a t , r t , s t +1 ) at each time t, and that we store in a dataset D. We note that once we have reached the absorbing state, we start a new episode or sequenceof observations by resetting the MDP to the initial state s . Furthermore, we allow the agent toexplore new unseen states according to a so-called ε -greedy strategy, see Sutton and Barto (1998),meaning that with probability ε we take a random action and with probability (1 − ε ) we take theaction maximizing the Q-value. Typically one reduces the value of ε according to a linear scheduleas the training iterations increase.During the training phase, we then perform updates to the Q-values by sampling mini-batchesuniformly at random from this dataset ( s, a, r, s ′ ) ∼ U ( D ) and minimizing over θ the following lossfunction L ( θ ) = E ( s,a,r,s ′ ) ∼U ( D ) (cid:20)(cid:16) R ( s, a ) + γ max a ′ Q ( s ′ , a ′ ; θ ) − Q ( s, a ; θ ) (cid:17) (cid:21) . (6)However there might still be some correlations between the Q-values Q ( s, a ; θ ) and the so-calledtarget values R ( s, a )+ γ max a ′ Q ( s ′ , a ′ ; θ ) . The second improvement brought forward in Mnih et al.(2015) consists in updating the network parameters for the target values only with a regular fre-quency and not after each iteration. This is called parameter freezing and translates into minimizingover θ the modiﬁed loss function L ( θ ) = E ( s,a,r,s ′ ) ∼U ( D ) (cid:20)(cid:16) R ( s, a ) + γ max a ′ Q ( s ′ , a ′ ; θ ∗ ) − Q ( s, a ; θ ) (cid:17) (cid:21) , (7)where the target network parameters θ ∗ are only updated with the DQN parameters θ every T ∗ > steps, and are held constant between individual updates.An alternative network speciﬁcation would be to take only the state as input Q ( s ; θ ) andupdate the Q-values for each action, see the implementation in Mnih et al. (2013). Network archi-tectures such as double deep Q-networks, see van Hasselt et al. (2016), dueling deep Q-networks,4ee Wang et al. (2016), and combinations thereof, see Hessel et al. (2017) have been developed toimprove the training performance even further. However the implementation of these algorithmsis out of the scope of our presentation. In the same spirit as Becker et al. (2019) and Becker et al. (2020), we compute lower and upperbounds on the option price in (3), the conﬁdence interval resulting from the central limit theorem,as well as a point estimate for the optimal value V . In the sequel, for ease of notation, we will use X t n = X n , for n = 0 , . . . , N. We store the parameters learned through the training of the deep neural network on an experiencereplay dataset with simulations (cid:0) X kn (cid:1) Nn =0 for k = 1 , . . . , K. We denote as ˆ θ ∈ Θ the vector ofnetwork parameters where Θ ⊂ R q , q > denotes the dimension of the parameter space and Q (cid:16) s, a ; ˆ θ (cid:17) corresponds to the calibrated network. We then generate new simulations of the statespace process (cid:0) X kn (cid:1) Nn =0 , independent from those used for training, for k = K + 1 , . . . , K + K L . Theindependence is necessary to achieve unbiasedness of the estimates. The Monte Carlo average ˆ L = 1 K L K + K L X k = K +1 g (cid:0) τ k , X kτ k (cid:1) where τ k = inf n ≤ n ≤ N : Q (cid:16)(cid:0) n, X kn (cid:1) ,

1; ˆ θ (cid:17) > Q (cid:16)(cid:0) n, X kn (cid:1) ,

0; ˆ θ (cid:17)o yields a lower bound for theoptimal value V . Since the optimal strategies are not unique, we follow the convention of takingthe largest optimal stopping rule which yields a strict inequality.

The derivation of the upper bound is based on the Doob-Meyer decomposition of the supermartin-gale given by the Snell envelope, see Karatzas and Shreve (1991). The Snell envelope ( Z n ) Nn =0 ofthe discounted payoﬀ process ( G n ) Nn =0 can be decomposed as Z n = Z + M Zn − A Zn , where M Z is the ( F n ) -martingale given by M Z = 0 and M Zn − M Zn − = Z n − E [ Z n | F n − ] , n = 1 , . . . , N, and A Z is the non-decreasing ( F n ) -predictable process given by A Z = 0 and A Zn − A Zn − = Z n − − E [ Z n | F n − ] , n = 1 , . . . , N. From Proposition 7 in Becker et al. (2019), given a sequence ( ε n ) Nn =0 of integrable random variablesin (Ω , F , P ) such that E [ ε n | F n ] = 0 for all n = 0 , . . . , N, one has V ≤ E (cid:20) max ≤ n ≤ N ( g ( n, X n ) − M n − ε n ) (cid:21) , for every ( F n ) -martingale ( M n ) Nn =0 starting from 0.This upper bound is tight if M = M Z and ε ≡ . We can then use the optimal action-valuefunction learned via the deep neural network to construct a martingale close to M Z . We now adaptthe approach presented in Becker et al. (2019) to the expression of the martingale component ofthe Snell envelope. Indeed, the martingale diﬀerences ∆ M n from Subsection 3.2 in Becker et al.(2019) can be written in terms of the optimal action-value function: ∆ M n = M n − M n − = Q ∗ (( n, X n ) , a ) − Q ∗ (( n − , X n − ) , , n − is given by evaluating the optimal action-value functionat action a = 0 (continuing). Given the deﬁnition of the optimal action-value function at (5), onecan rewrite the martingale diﬀerences as ∆ M n = g ( n, X n ) { a = 1 } + E [ Z n +1 | F n ] { a = 0 } − E [ Z n | F n − ] . (8)The empirical counterparts are given by generating realizations M kn of M n + ε n based on a sampleof K U simulations (cid:0) X kn (cid:1) Nn =0 , for k = K + K L +1 , . . . , K + K L + K U . Again, we simulate realizationsof the state space process independently from the simulations used for training. This gives us thefollowing empirical diﬀerences: ∆ M kn = g (cid:0) n, X kn (cid:1) (cid:8) a kn = 1 (cid:9) + ˆ E (cid:2) Z kn +1 | F n (cid:3) (cid:8) a kn = 0 (cid:9) − ˆ E (cid:2) Z kn | F n − (cid:3) , where a kn is the chosen action at time n for simulation path k, and ˆ E (cid:2) Z kn +1 | F n (cid:3) are the MonteCarlo averages approximating the continuation values for n = 0 , . . . , N − and k = K + K L +1 , . . . , K + K L + K U . The continuation values appearing in the martingale increments are obtained through nestedsimulation, see the remark below: ˆ E (cid:2) Z kn +1 | F n (cid:3) = 1 J J X j =1 g (cid:18) τ k,jn +1 , e X k,jτ k,jn +1 (cid:19) , where J is the number of simulations in the inner step, and where, given each X kn , we simulate(conditional) continuation paths e X k,jn +1 , . . . , e X k,jN , j = 1 , . . . , J, that are conditionally independentof each other and of X kn +1 , . . . , X kN , and τ k,jn +1 is the value of τ θn +1 along the path e X k,jn +1 , . . . , e X k,jN . Remark.

It is not guaranteed than E [∆ M n |F n − ] = 0 for the Q-function learned via the neuralnetwork. To tackle this issue, we implement nested simulations as in Becker et al. (2019) andBecker et al. (2020) to estimate the continuation values. This gives unbiased estimates of M n , which is crucial to obtain a valid upper bound. Moreover, the variance of the estimates decreaseswith the number of inner simulations, at the expense of increased computational time. Finally we can derive an unbiased estimate for the upper bound of the optimal value V :ˆ U = 1 K U K + K L + K U X k = K + K L +1 max ≤ n ≤ N (cid:0) g (cid:0) n, X kn (cid:1) − M kn (cid:1) , with M kn = P nm =1 ∆ M km . The average between the lower and the upper bound for the point estimate of V is considered inBecker et al. (2019) and Becker et al. (2020): ˆ L + ˆ U . Assuming the discounted payoﬀ process is square-integrable for all n = 0 , . . . , N, we also obtainthat the upper bound max ≤ n ≤ N ( g ( n, X n ) − M n − ε n ) is square-integrable. Let z α/ denote the (1 − α/ -quantile of a standard normal distribution. Deﬁning the empirical standard deviationsfor the lower and upper bounds as ˆ σ L = vuut K L − K + K L X k = K +1 (cid:16) X kτ k − ˆ L (cid:17) , and ˆ σ U = vuut K U − K + K L + K U X k = K + K L +1 (cid:18) max ≤ n ≤ N ( g ( n, X kn ) − M kn ) − ˆ U (cid:19) , (1 − α ) -conﬁdence interval for the true optimal value V : (cid:20) ˆ L − z α/ ˆ σ L √ K L , ˆ U + z α/ ˆ σ U √ K U (cid:21) . (9)We have presented in this section the unifying properties of Q-learning compared to other ap-proaches used to study optimal stopping problems. On one hand we do not require any iterativeprocedure and do not have to solve a potentially complicated optimization problem at each timestep. Indeed the calibrated deep neural network solves the optimal stopping problem on the wholetime interval. On the other hand, we are able to accommodate any ﬁnite number of possibleactions. Looking back at the direct approach of Becker et al. (2019) to model optimal stoppingpolicies, the parametric form of the stopping times would explode if we allow for more than twopossible actions. In this section we extend the previous problem to the more general framework of multiple-exerciseoptions. Examples from this family include swing options, which are common in the electricitymarket. The holder of such an option is entitled to exercise a certain right, e.g. the delivery of acertain amount of energy, several times, until the maturity of the contract. The number of exerciserights and constraints on how they can be used are speciﬁed at inception. Typical constraints area waiting period, i.e. a minimal waiting time between two exercise rights, and a volume constraint,which speciﬁes how many units of the underlying asset can be purchased at each time.Monte Carlo valuation of such products has been studied in Meinshausen and Hambly (2004),producing lower and upper bounds for the price. Building on the dual formulation for option pric-ing, alternative methods additionally accounting for waiting time constraints have been consideredin Bender (2011), and for both volume and waiting time constraints in Bender et al. (2015). In allcases, the multiple stopping problem is decomposed into several single stopping problems using theso-called reduction principle. The dual formulation in Meinshausen and Hambly (2004) expressesthe marginal excess value due to each additional exercise right as an inﬁmum of an expectationover a certain space of martingales and a set of stopping times. A version of the dual problemin discrete time relying solely on martingales is presented in Schoenmakers (2012), and a dualfor the continuous time problem with a non-trivial waiting time constraint is derived in Bender(2011). In the latter case, the optimization is not only over a space of martingales, but also overadapted processes of bounded variation, which stem from the Doob-Meyer decomposition of theSnell envelope. The dual problem in the more general setting considering both volume and waitingtime constraints is formulated in Bender et al. (2015).We now express the multiple stopping extension of the problem deﬁned at (3) for American-style options. Assume that the option holder has n > exercise rights over the lifetime of thecontract. We consider the setting with no volume constraint and a waiting time δ > which weassume to be a multiple of the time step resulting from the discretization of the interval [0 , T ] . Theaction space is still A = { , } . The state space now has an additional dimension corresponding tothe number of remaining exercise opportunities. As in standard stopping, we assume an absorbingstate to which we jump once the n -th right has been exercised.We note that due to the introduction of the waiting period, depending on the speciﬁcation of n, T and δ, it may not be possible for the option holder to exercise all his rights before maturity,see the discussion in Bender et al. (2015), where a "cemetery time" is deﬁned. If the speciﬁcationof these parameters allows the exercise of all rights, and if we assume that g ( n, X t n ) ≥ for all n = 0 , . . . , N, then it will always be optimal to use all exercise rights. The value of this optionwith n > exercise possibilities at time is given by V n = sup τ ∈T nδ n X i =1 E [ g ( τ i , X τ i )] , (10)where T nδ is the set of n -tuples τ = ( τ n , τ n − , . . . , τ ) of stopping times in { t , t , . . . , t N } n satisfying τ i ≥ τ i +1 + δ, for i = 1 , . . . , n − . As in Bender (2011), one can combine the dynamic programming principle with the reductionprinciple to rewrite the primal optimization problem. We introduce the following functions deﬁned7n Bender (2011) for ν = 1 , . . . , n, and k = N, . . . , q ν ( k, x ) = E (cid:2) y ν (cid:0) t k +1 , X t k +1 (cid:1)(cid:12)(cid:12) X t k = x (cid:3) ,q νδ ( k, x ) = E (cid:2) y ν (cid:0) t k + δ , X t k + δ (cid:1)(cid:12)(cid:12) X t k = x (cid:3) , and we deﬁne the functions y ν as y ν ( t k , x ) = max { g ( t k , x ) + q ν − δ ( k, x ) , q ν ( k, x ) } . We set q δ ( k, x ) = 0 for all x ∈ R d and all k ∈ { , . . . , N } , and g ( t, X t ) = 0 for all t > T. In thesequel, we denote as y ∗ ,ν the Snell envelope for the problem with ν remaining exercise rights, for ν = 1 , . . . , n. The reduction principle essentially states that the option with n stopping times is asgood as the single option paying the immediate cashﬂow plus the option with ( n − stopping timesstarting with a temporal delay of δ. This philosophy is also followed in Meinshausen and Hambly(2004) by looking at the marginal extra payoﬀ obtained with an additional exercise right. Thefunction q ν corresponds to the continuation value in case of no exercise and the function q νδ to thecontinuation value in case of exercise, which requires a waiting period of δ. As shown in Bender (2011), one can derive the optimal policy from the continuation values.Indeed, the optimal stopping times τ ∗ ,nν , for ν = 1 , . . . , n, are given by τ ∗ ,nν = inf n k ≥ τ ∗ ,nν +1 + δ ; g ( t k , x ) + q ∗ ,ν − δ ( k, x ) ≥ q ∗ ,ν ( k, x ) o , (11)for starting value τ ∗ ,nn +1 = − δ , which is a convention to make sure that the ﬁrst exercise time isbounded from below by 0. The optimal price is then V n = y ∗ ,n (0 , X ) , and as in the single stopping framework, one can express the Snell envelope, the optimal stoppingtimes and the continuation values in terms of the optimal Q-function Q ∗ . Indeed, the continuationvalues can be expressed as q ∗ ,ν ( k, X t k ) = Q ∗ (( t k , X t k ) , ν ) ,q ∗ ,νδ ( k, X t k + δ ) = Q ∗ (cid:0)(cid:0) t k + δ , X t k + δ (cid:1) , ν (cid:1) , the Snell envelope as y ∗ ,ν ( t k , X t k ) = max (cid:8) g ( t k , X t k ) + Q ∗ (cid:0)(cid:0) t k + δ , X t k + δ (cid:1) , ν − (cid:1) , Q ∗ (( t k , X t k ) , ν ) (cid:9) , and the optimal policy as τ ∗ ,nν = inf (cid:8) k ≥ τ ∗ ,nν +1 + δ ; g ( t k , X t k ) + Q ∗ (cid:0)(cid:0) t k + δ , X t k + δ (cid:1) , ν − (cid:1) ≥ Q ∗ (( t k , X t k ) , ν ) (cid:9) . (12)To remain consistent with the notation introduced above for the functions q ν , q νδ and y ν , we denoteby Q ∗ (( t k , X t k ) , ν ) the optimal Q-value in state ( t k , X t k , ν ) , i.e. when there are ν remainingexercise rights. Analogously to standard stopping with one exercise right, we can derive a lowerbound from the primal problem and an upper bound from the dual problem. Moreover, we derivea conﬁdence interval around the pointwise estimate based on Monte Carlo simulations. As in Section 2.4.1, we denote by Q (cid:16) s, a ; ˆ θ (cid:17) the deep neural network calibrated through the trainingprocess using experience replay on a sample of simulated paths ( X mn ) Nn =0 for m = 1 , . . . , M. Wethen generate a new set of M L simulations ( X mn ) Nn =0 , independent from the simulations used fortraining, for m = M + 1 , . . . , M + M L . Then, using the learned stopping times τ m,nν = inf n k ≥ τ m,nν +1 + δ ; g (cid:0) t k , X mt k (cid:1) + Q (cid:16)(cid:16) t k + δ , X mt k + δ (cid:17) , ν −

1; ˆ θ (cid:17) ≥ Q (cid:16)(cid:0) t k , X mt k (cid:1) , ν ; ˆ θ (cid:17)o , for ν = 1 , . . . , n, and with the convention τ m,nn +1 = − δ for all m = M + 1 , . . . , M + M L , the MonteCarlo average ˆ L n = 1 M L M + M L X m = M +1 n X ν =1 g (cid:0) τ m,nν , X mτ ν (cid:1) yields a lower bound for the optimal value V n . In order to not overload the notation we consider τ ν = τ m,nν in the subscript of the simulated state space above.8 .2 Upper bound By exploiting the dual as in Bender (2011), one can also derive an upper bound on the optimal value V n . In order to do so, we consider the Doob decomposition of the supermartingales y ∗ ,ν ( t k , X t k ) given by y ∗ ,ν ( t k , X t k ) = y ∗ ,ν (0 , X ) + M ∗ ,ν ( k ) − A ∗ ,ν ( k ) , where M ∗ ,ν ( k ) is a ( F k ) -martingale with M ∗ ,ν (0) = 0 and A ∗ ,ν ( k ) is a non-decreasing ( F k ) -predictable process with A ∗ ,ν (0) = 0 for all ν = 1 , . . . , n and k = 0 , . . . , N. The correspondingapproximated terms using the learned Q-function lead to the following decomposition: y ν ( t k , X t k ) = y ν (0 , X ) + M ν ( k ) − A ν ( k ) , where M ν are martingales with M ν (0) = 0 , for ν = 1 , . . . , n, and A ν are integrable adaptedprocesses in discrete time with A ν (0) = 0 , for ν = 1 , . . . , n. Moreover, one can write the increments of both the martingale and adapted components as: M ν ( k ) − M ν ( k −

1) = y ν ( t k , X t k ) − E [ y ν ( t k , X t k ) |F k − ] , and A ν ( k ) − A ν ( k −

1) = y ν (cid:0) t k − , X t k − (cid:1) − E [ y ν ( t k , X t k ) |F k − ] . Given the existence of the waiting period, one must also include the δ -increment term A ν ( k + δ ) − E [ A ν ( k + δ ) |F k ] = M ν ( k + δ ) − M ν ( k ) + E (cid:2) y ν (cid:0) t k + δ , X t k + δ (cid:1)(cid:12)(cid:12) F k (cid:3) − y ν (cid:0) t k + δ , X t k + δ (cid:1) . We note that for δ = 1 , since A ∗ ,ν is a predictable process, this increment is equal to 0 for theoptimal martingale M ∗ ,ν and we retrieve the dual formulation in Schoenmakers (2012).As the dual formulation involves conditional expectations, we use nested simulation on a new setof M U independent simulations ( X mn ) Nn =0 for m = M + M L + 1 , . . . , M + M L + M U , with M innerU inner simulations for each outer simulation as explained in Section 2.4.2, to approximate theone-step ahead continuation values E [ y ν ( t k , X t k ) |F k − ] and the δ -steps ahead continuation values E (cid:2) y ν (cid:0) t k + δ , X t k + δ (cid:1)(cid:12)(cid:12) F k (cid:3) . We denote the Monte Carlo estimators of these conditional expectations as ˆ E [ y ν ( t k , X t k ) | F k − ] and ˆ E (cid:2) y ν ( t k + δ , X t k + δ ) | F k (cid:3) , respectively. We use these quantities to expressthe empirical counterparts of the adapted process increments for m = M + M L + 1 , . . . , M + M L + M U : A νm ( k + δ ) − ˆ E [ A ν ( k + δ ) | F k ] = M νm ( k + δ ) − M νm ( k )+ ˆ E (cid:2) y ν ( t k + δ , X t k + δ ) | F k (cid:3) − y νm (cid:16) t k + δ , X mt k + δ (cid:17) . We can then rewrite the empirical counterparts of the Snell envelopes through the Q-function: y νm ( t k , X mt k ) = max n g (cid:0) t k , X mt k (cid:1) + Q (cid:16)(cid:16) t k + δ , X mt k + δ (cid:17) , ν −

1; ˆ θ (cid:17) , Q (cid:16)(cid:0) t k , X mt k (cid:1) , ν ; ˆ θ (cid:17)o , for ν = 1 , . . . , n, k = 0 , . . . , N, m = M + M L + 1 , . . . , M + M L + M U , and where we set g ( t, X t ) = 0 for t > T and Q (cid:16) ( t k , X t k ) ,

0; ˆ θ (cid:17) = 0 (no more exercises left). The theoretical upper bound U n stemming from the dual problem in Bender (2011) is given by: U n = E " sup u ,...,u n ∈ N u ν ≥ u ν +1 + δ ( n − X ν =1 (cid:16) g ( u ν , X u ν ) − ( M ν ( u ν ) − M ν ( u ν +1 ))+ A ν ( u ν +1 + δ ) − E (cid:2) A ν ( u ν +1 + δ ) (cid:12)(cid:12) F u ν +1 (cid:3) (cid:17) + g ( u n , X u n ) − M n ( u n ) ) , We hence obtain V n ≤ U n and this bound is sharp for the exact Doob-Meyer decomposition terms M ∗ ,ν and A ∗ ,ν , for ν = 1 , . . . , n. We denote the sharp upper bound as U ∗ ,n = sup u ,...,u n ∈ N u ν ≥ u ν +1 + δ ( n − X ν =1 (cid:16) g ( u ν , X u ν ) − ( M ∗ ,ν ( u ν ) − M ∗ ,ν ( u ν +1 ))+ A ∗ ,ν ( u ν +1 + δ ) − E (cid:2) A ∗ ,ν ( u ν +1 + δ ) (cid:12)(cid:12) F u ν +1 (cid:3) (cid:17) + g ( u n , X u n ) − M ∗ ,n ( u n ) ) . V n :ˆ U n = 1 M U M + M L + M U X m = M + M L +1 sup u ,...,u n ∈ N u ν ≥ u ν +1 + δ n − X ν =1 (cid:16) g (cid:0) u ν , X mu ν (cid:1) − ( M νm ( u ν ) − M νm ( u ν +1 ))+ A νm ( u ν +1 + δ ) − ˆ E (cid:2) A ν ( u ν +1 + δ ) | F u ν +1 (cid:3) (cid:17) + g (cid:0) u n , X mu n (cid:1) − M nm ( u n ) ! . The pathwise supremum appearing in the expression of the upper bound can be computed usingthe recursion formula from Proposition 3.8 in Bender et al. (2015). This recursion formula isimplemented in our setting using the representation via the Q-function.

As in Becker et al. (2019) and Becker et al. (2020), we can construct a pointwise estimate for theoptimal value in the multiple stopping framework in presence of a waiting time constraint by takingthe pointwise estimate: ˆ L n + ˆ U n . By storing the empirical standard deviations for the lower and upper bounds that we denote as ˆ σ L n and ˆ σ U n , respectively, one can leverage the central limit theorem as in Section 2.4.3 to derivethe asymptotic two-sided (1 − α ) -conﬁdence interval for the true optimal value V n : (cid:20) ˆ L n − z α/ ˆ σ L n √ M L , ˆ U n + z α/ ˆ σ U n √ M U (cid:21) . (13) We now derive the extension of a result presented in Meinshausen and Hambly (2004) on the biasresulting from the derivation of the upper bound, to the case of multiple stopping in presence of awaiting period. The dual problem from Meinshausen and Hambly (2004), being obtained from anoptimization over a space of martingales and a set of stopping times, contains two terms: the biascoming from the martingale approximation, and the bias coming from the policy approximation.In the case with waiting constraint, as exempliﬁed in the dual of Bender (2011), we show how onecan again control the bias in the approximations to the n Doob-Meyer decompositions of the Snellenvelopes y ∗ ,ν , for ν = 1 , . . . , n. Indeed, in the dual problem, each martingale M ∗ ,ν is approximatedby a martingale M ν , and each predictable non-decreasing process A ∗ ,ν is approximated by anintegrable adapted process in discrete time A ν . We proceed in three steps and analyse separatelythe bias from each approximation employed:• Martingale terms: E " sup u ,...,u n ∈ N u ν ≥ u ν +1 + δ n − X ν =1 (cid:12)(cid:12)(cid:12) M ν ( u ν ) − M ν ( u ν +1 ) − ( M ∗ ,ν ( u ν ) − M ∗ ,ν ( u ν +1 )) (cid:12)(cid:12)(cid:12) • Adapted terms: E " sup u ,...,u n ∈ N u ν ≥ u ν +1 + δ n − X ν =1 (cid:12)(cid:12)(cid:12) A ν ( u ν +1 + δ ) − E (cid:2) A ν ( u ν +1 + δ ) (cid:12)(cid:12) F u ν +1 (cid:3) − (cid:0) A ∗ ,ν ( u ν +1 + δ ) − E (cid:2) A ∗ ,ν ( u ν +1 + δ ) (cid:12)(cid:12) F u ν +1 (cid:3)(cid:1) (cid:12)(cid:12)(cid:12) • Final term: E (cid:20) sup ≤ n ≤ N (cid:12)(cid:12) g ( u n , X u n ) − M n ( u n ) − ( g ( u n , X u n ) − M ∗ ,n ( u n )) (cid:12)(cid:12)(cid:21) g ( u n , X u n ) − M n ( u n ) can be bounded using the methodology inMeinshausen and Hambly (2004). Deﬁne D y,n = sup ≤ k ≤ Nx ∈ E (cid:12)(cid:12)(cid:12) y ∗ ,n ( t k , x ) − y n ( t k , x ) (cid:12)(cid:12)(cid:12) , as the distance between the true Snell envelope and its approximation, and σ M innerU ,n = sup ≤ k ≤ Nx ∈ E E (cid:20)(cid:16) ˆ E (cid:2) y n ( t k , X t k ) (cid:12)(cid:12) X t k − = x (cid:3) − E (cid:2) y n ( t k , X t k ) (cid:12)(cid:12) X t k − = x (cid:3)(cid:17) (cid:12)(cid:12)(cid:12)(cid:12) X t k − = x (cid:21) , as an upper bound on the Monte Carlo error from the 1-step ahead nested simulation to approxi-mate the continuation values.In order to study the bias coming from the martingale approximations, we deﬁne D y = sup ν =1 ,...,n − u ,...,u n ∈ N u ν ≥ u ν +1 + δx ∈ E (cid:12)(cid:12)(cid:12) y ∗ ,ν ( u ν , x ) − y ν ( u ν , x ) (cid:12)(cid:12)(cid:12) , as the distance between the optimal Snell envelope and its approximation over all remaining exercisetimes, σ M innerU = sup ν =1 ,...,n − u ,...,u n ∈ N u ν ≥ u ν +1 + δx ∈ E E (cid:20)(cid:16) ˆ E (cid:2) y ν ( u ν +1 , X u ν +1 ) (cid:12)(cid:12) X u ν = x (cid:3) − E (cid:2) y ν ( u ν +1 , X u ν +1 ) (cid:12)(cid:12) X u ν = x (cid:3)(cid:17) (cid:12)(cid:12)(cid:12)(cid:12) X u ν = x (cid:21) , and σ M innerU ,δ = sup ν =1 ,...,n − u ,...,u n ∈ N u ν ≥ u ν +1 + δx ∈ E E (cid:20)(cid:16) ˆ E (cid:2) y ν ( u ν + δ, X u ν + δ ) (cid:12)(cid:12) X u ν = x (cid:3) − E (cid:2) y ν ( u ν + δ, X u ν + δ ) (cid:12)(cid:12) X u ν = x (cid:3)(cid:17) (cid:12)(cid:12)(cid:12)(cid:12) X u ν = x (cid:21) . In other words, σ M innerU and σ M innerU ,δ correspond to upper bounds on the standard deviations ofthe 1-step ahead and δ -steps ahead Monte Carlo estimates of the continuation values, respectively,using a sample of M innerU independent simulations starting from the endpoint of simulation path m for m = M + M L + 1 , . . . , M + M L + M U .The following theorem allows to control for the bias in the derivation of the upper bound fromthe dual problem. Theorem 1 (Dual upper bound bias) . Let B nδ ( M, A ) be the total bias which is the diﬀerence be-tween the approximate upper bound using ( M ν ) ν =1 ,...,n and ( A ν ) ν =1 ,...,n − and the theoretical sharpupper bound using the optimal Doob decomposition components ( M ∗ ,ν ) ν =1 ,...,n and ( A ∗ ,ν ) ν =1 ,...,n − B nδ ( M, A ) = U n − U ∗ ,n . The following result holds: B nδ ( M, A ) ≤ n − r(cid:16) D y + σ M innerU (cid:17) T +( n − (cid:16) σ M innerU ,δ + D y (cid:17) +2 r(cid:16) D y,n + σ M innerU ,n (cid:17) T . (14)

In order to prove this result, let us state an intermediary result which will appear in the proofs ofthe following propositions. Deﬁne R νt = M νt − M ∗ ,νt , as the diﬀerence between the martingale approximation and the optimal martingale for the problemwith ν remaining exercise times, for ν = 1 , . . . , n. Lemma 3.1.

The process R ν is a martingale with R ν (0) = 0 , for all ν = 1 , . . . , n, and we havethe following inequality on the second moment of the martingale increments, for all ≤ t < T and ν = 1 , . . . , n : E h(cid:0) R νt +1 − R νt (cid:1) (cid:12)(cid:12)(cid:12) F t i ≤ D y + σ M innerU . As a consequence, E h ( R νt ) i ≤ (cid:16) D y + σ M innerU (cid:17) t. roof. The proof of this lemma follows similar lines to the proof of Lemma 6.1 inMeinshausen and Hambly (2004). Let ν ∈ { , . . . , n } . As a diﬀerence of martingales with initialvalue 0, R ν is also a martingale with initial value 0. The increments can be rewritten as R νt +1 − R νt = M νt +1 − M ∗ ,νt +1 − ( M νt − M ∗ ,νt )= y ν ( t + 1 , X t +1 ) − ˆ E [ y ν ( t + 1 , X t +1 ) | F t ] − ( y ∗ ,ν ( t + 1 , X t +1 ) − E [ y ∗ ,ν ( t + 1 , X t +1 ) |F t ])= y ν ( t + 1 , X t +1 ) − y ∗ ,ν ( t + 1 , X t +1 ) + E [ y ∗ ,ν ( t + 1 , X t +1 ) |F t ] − E [ y ν ( t + 1 , X t +1 ) |F t ]+ E [ y ν ( t + 1 , X t +1 ) |F t ] − ˆ E [ y ν ( t + 1 , X t +1 ) | F t ] . Now, both diﬀerences between the ﬁrst two terms and the third and fourth term in the ﬁnal equalityare bounded in absolute value by D y . The last term corresponds to the error from the Monte Carloapproximation of the 1-step ahead continuation values. Since this error term has mean 0, a secondmoment bounded by σ M innerU and is independent of the term ( y ∗ ,ν ( t + 1 , X t +1 ) − y ν ( t + 1 , X t +1 )) , we obtain the desired result. Proposition 3.1.

The bias in the ﬁnal term can be bounded by E (cid:20) sup ≤ n ≤ N (cid:12)(cid:12)(cid:12) g ( u n , X u n ) − M n ( u n ) − ( g ( u n , X u n ) − M ∗ ,n ( u n )) (cid:12)(cid:12)(cid:12)(cid:21) ≤ r(cid:16) D y,n + σ M innerU ,n (cid:17) T . (15)

Proof.

We consider the error in the ﬁnal term E (cid:20) sup ≤ n ≤ N (cid:12)(cid:12)(cid:12) M n ( u n ) − M ∗ ,n ( u n ) (cid:12)(cid:12)(cid:12)(cid:21) ≤ E (cid:20) sup ≤ t ≤ T (cid:12)(cid:12) R nt (cid:12)(cid:12)(cid:21) . From the Cauchy-Schwarz inequality, E h sup ≤ t ≤ T (cid:12)(cid:12) R nt (cid:12)(cid:12)i ≤ (cid:16) E h sup ≤ t ≤ T ( R nt ) i(cid:17) / , and since R n is a martingale, ( R n ) is a non-negative submartingale which is well-deﬁned from theexistence of D y and σ M innerU . Then, using Doob’s submartingale inequality, E h sup ≤ t ≤ T ( R nt ) i ≤ E (cid:2) ( R nT ) (cid:3) . This last inequality in combination with Lemma 3.1 leads to the desired result.

Proposition 3.2.

The bias from the approximations of the martingale terms can be bounded by E " sup u ,...,u n ∈ N u ν ≥ u ν +1 + δ n − X ν =1 (cid:12)(cid:12)(cid:12) M ν ( u ν ) − M ν ( u ν +1 ) − ( M ∗ ,ν ( u ν ) − M ∗ ,ν ( u ν +1 )) (cid:12)(cid:12)(cid:12) ≤ n − r(cid:16) D y + σ M innerU (cid:17) T . (16)

Proof.

The error in the martingale term for the problem with ν remaining exercise times can beexpressed as (cid:12)(cid:12) M ν ( u ν ) − M ν ( u ν +1 ) − ( M ∗ ,ν ( u ν ) − M ∗ ,ν ( u ν +1 )) (cid:12)(cid:12) ≤ (cid:12)(cid:12) R ν ( u ν ) (cid:12)(cid:12) + (cid:12)(cid:12) R ν ( u ν +1 ) (cid:12)(cid:12) ≤ sup u ,...,u n ∈ N u ν ≥ u ν +1 + δ (cid:12)(cid:12) R ν ( u ν ) (cid:12)(cid:12) + sup u ,...,u n ∈ N u ν ≥ u ν +1 + δ (cid:12)(cid:12) R ν ( u ν +1 ) (cid:12)(cid:12) . By taking the sum over ν = 1 , . . . , n − , taking the supremum over the subspace of N n with theconstraints imposed by the presence of the waiting period, and ﬁnally taking the expectation, weobtain E " sup u ,...,u n ∈ N u ν ≥ u ν +1 + δ n − X ν =1 (cid:12)(cid:12)(cid:12) M ν ( u ν ) − M ν ( u ν +1 ) − ( M ∗ ,ν ( u ν ) − M ∗ ,ν ( u ν +1 )) (cid:12)(cid:12)(cid:12) ≤ ( n − E " sup u ,...,u n ∈ N u ν ≥ u ν +1 + δ (cid:12)(cid:12) R ν ( u ν ) (cid:12)(cid:12) + ( n − E " sup u ,...,u n ∈ N u ν ≥ u ν +1 + δ (cid:12)(cid:12) R ν ( u ν +1 ) (cid:12)(cid:12) . Proposition 3.3.

The bias from the approximations of the non-decreasing predictable terms canbe bounded by E " sup u ,...,u n ∈ N u ν ≥ u ν +1 + δ n − X ν =1 (cid:12)(cid:12)(cid:12) A ν ( u ν +1 + δ ) − E (cid:2) A ν ( u ν +1 + δ ) (cid:12)(cid:12) F u ν +1 (cid:3) − (cid:0) A ∗ ,ν ( u ν +1 + δ ) − E (cid:2) A ∗ ,ν ( u ν +1 + δ ) (cid:12)(cid:12) F u ν +1 (cid:3)(cid:1) (cid:12)(cid:12)(cid:12) ≤ ( n − (cid:18) σ M innerU ,δ + D y + 4 r(cid:16) D y + σ M innerU (cid:17) T (cid:19) . Proof.

Again, we consider the approximation of the predictable process for the problem with ν remaining exercise times A ν ( u ν +1 + δ ) − E (cid:2) A ν ( u ν +1 + δ ) (cid:12)(cid:12) F u ν +1 (cid:3) − (cid:0) A ∗ ,ν ( u ν +1 + δ ) − E (cid:2) A ∗ ,ν ( u ν +1 + δ ) (cid:12)(cid:12) F u ν +1 (cid:3)(cid:1) = M ν ( u ν +1 + δ ) − M ν ( u ν +1 ) + E (cid:2) y ν (cid:0) u ν +1 + δ, X u ν +1 + δ (cid:1)(cid:12)(cid:12) F u ν +1 (cid:3) − y ν (cid:0) u ν +1 + δ, X u ν +1 + δ (cid:1) − (cid:0) M ∗ ,ν ( u ν +1 + δ ) − M ∗ ,ν ( u ν +1 ) + E (cid:2) y ∗ ,ν (cid:0) u ν +1 + δ, X u ν +1 + δ (cid:1)(cid:12)(cid:12) F u ν +1 (cid:3) − y ∗ ,ν (cid:0) u ν +1 + δ, X u ν +1 + δ (cid:1)(cid:1) . Now, since (cid:12)(cid:12) M ν ( u ν +1 + δ ) − M ∗ ,ν ( u ν +1 + δ ) (cid:12)(cid:12) ≤ (cid:12)(cid:12) R νu ν +1 + δ (cid:12)(cid:12) , (cid:12)(cid:12) M ν ( u ν +1 ) − M ∗ ,ν ( u ν +1 ) (cid:12)(cid:12) ≤ (cid:12)(cid:12) R νu ν +1 (cid:12)(cid:12) , and (cid:12)(cid:12) y ∗ ,ν (cid:0) u ν +1 + δ, X u ν +1 + δ (cid:1) − y ν (cid:0) u ν +1 + δ, X u ν +1 + δ (cid:1) (cid:12)(cid:12) ≤ D y , by summing over all exercise opportunities, taking the supremum and then the expectation, weobtain by deﬁnition of σ M innerU ,δ , E (cid:20) sup u ,...,u n ∈ N u ν ≥ u ν +1 + δ n − X ν =1 (cid:12)(cid:12)(cid:12) A ν ( u ν +1 + δ ) − E (cid:2) A ν ( u ν +1 + δ ) (cid:12)(cid:12) F u ν +1 (cid:3) − (cid:0) A ∗ ,ν ( u ν +1 + δ ) − E (cid:2) A ∗ ,ν ( u ν +1 + δ ) (cid:12)(cid:12) F u ν +1 (cid:3)(cid:1) (cid:12)(cid:12)(cid:12)(cid:21) ≤ ( n − (cid:18) σ M innerU ,δ + D y + 4 r(cid:16) D y + σ M innerU (cid:17) T (cid:19) . The proof of Theorem 1 is then obtained by summing up all contributions to the total bias fromPropositions 3.1, 3.2 and 3.3. We thus obtain an upper bound on the total bias stemming fromthe errors in all approximations. We see in particular in the expression of the total bias that thewaiting period appears implicitly in the error term from the Monte Carlo δ -steps ahead estimation.We now illustrate the Q-learning approach with several numerical examples. As illustrative examples we present swing options in the multiple stopping framework in severaldimensions, with varying maturities, n = 2 exercise rights and a waiting period constraint δ > .In all examples we select mini-batches of size 1000 using experience replay on a sample of1,000,000 simulations. We consider ReLU activation functions applied component-wise, performstochastic gradient descent for the optimization step using the RMSProp implementation from PyTorch , and initialize the network parameters using the default

PyTorch implementation.Swing options appear in the commodity and energy markets (natural gas, electricity) as hedginginstruments to protect investors from futures price ﬂuctuations. They give the holder of the optionthe right to exercise at multiple times during the lifetime of the contract, the number of exerciseopportunities being speciﬁed at inception. Further constraints can be imposed at each exercisetime, such as the maximal quantity of energy that can be bought or sold, or the minimal waiting13eriod between two exercise times, see e.g. Bender (2011). In the presence of a volume constraint,under certain suﬃcient conditions, see Bardou et al. (2009), the optimal policy is a so-called "bang-bang strategy", see e.g. Daluiso et al. (2020), i.e. at each exercise time the optimal strategy isto buy or sell the maximum or the minimum amount allowed, which then simpliﬁes the actionspace. A model for commodity futures prices is derived in Daluiso et al. (2020), implemented usingproximal policy optimization (PPO), which is another tool from reinforcement learning and wherethe policy update is forced to be close to the previous policy by clipping the advantage function.The pricing of such contracts is also investigated in Meinshausen and Hambly (2004) with noconstraints, in Bender (2011) with waiting time constraint and in Bender et al. (2015) with bothwaiting time and volume constraints. We will consider the same model for the electricity spot pricesas in Meinshausen and Hambly (2004), that is, the exponential of a Gaussian Ornstein-Uhlenbeckprocess, which in discrete time takes the form log S t +1 = (1 − k ) (log S t − µ ) + µ + σZ t , where { Z t } t =0 ,...,T − are standard normal random variables, and where we choose σ = 0 . ,k = 0 . , µ = 0 , S = 1 and strike price K = 1 . We consider the payoﬀ ( S t − K ) + for time t = 0 , . . . , T, without any discounting, as in Meinshausen and Hambly (2004), Bender (2011) andBender et al. (2015). A discount factor could be taken into account with no real additional com-plexity. In the multi-dimensional setting we will consider the same payoﬀ as max-call options,that is (cid:0) max i =1 ,...,d S it − K (cid:1) + for a d -dimensional vector of asset prices (cid:0) S , . . . , S d (cid:1) T , where weassume for the marginals the same dynamics as above and independence between the respectiveinnovations. We will consider the same starting value S = 1 for all the assets in the examplesbelow. We stress that this pricing approach can be extended to any other type of Markoviandynamics which are more adequate for capturing electricity prices.We assume that the arbitrage-free price is given by taking the expectation at (10) under anappropriate pricing measure, that is, a probability measure under which the (discounted) pricesof tradable and storable basic securities in the underlying market are (local) martingales. Theelectricity market being incomplete, the prices will depend on the choice of the pricing measure.The latter can be selected by considering a calibration on liquidly traded swing options.We select a deep neural network with 3 hidden layers containing 32 neurons each for theexamples with d = 3 and d = 10 , and 90 neurons each for the examples with d = 50 . We presentour results in dimensions d = 3 , d = 10 and d = 50 in Table 1 below, using M L = 100 , , M U = 100 and J = 5000 . Table 1: Prices at t = 0 for swing options with varying maturities and asset price dimensions, K = 1 , µ = 0 , σ = 0 . , and k = 0 . . Model parameters ˆ L PE ˆ U CI d = 3 , n = 2 , δ = 2 , T = 10 d = 3 , n = 2 , δ = 2 , T = 20 d = 10 , n = 2 , δ = 2 , T = 10 d = 10 , n = 2 , δ = 2 , T = 20 d = 50 , n = 2 , δ = 2 , T = 10 d = 50 , n = 2 , δ = 2 , T = 20 We have presented optimal stopping problems appearing in the valuation of ﬁnancial productsunder the lens of reinforcement learning. This new angle allows us to model the optimal action-value function using the RL machinery and deep neural networks. This method could serve asan alternative to recent approaches developed in the literature, be it to derive the optimal policyby modeling directly the stopping times as in Becker et al. (2019), or by modeling the continu-ation values by approximating conditional expectations as in Becker et al. (2020). We have alsoconsidered the pricing of multiple exercise stopping problems with waiting period constraint andderived lower and upper bounds on the option price, using the trained neural network and the14ual representation, respectively. In addition, we have proved a result that controls for the totalbias resulting from the approximation of the terms appearing in the dual formulation. The RLframework is suitable for conﬁgurations where the action space varies in a non-trivial way withtime, i.e. there are certain degrees of freedom for the agent to explore the environment at each timestep. This is exempliﬁed through the swing option with multiple stopping rights and waiting timeconstraint, but could also be useful for more complex environments. It could also be interesting toinvestigate state-of-the-art improvements to the DQN algorithm brought forward in Hessel et al.(2017). One could explore these avenues in further research.15 cknowledgements

We thank Prof. Patrick Cheridito for helpful comments and for carefully reading previous versionsof the manuscript.As SCOR Fellow, John Ery thanks SCOR for ﬁnancial support.Both authors have contributed equally to this work.

References

Bardou, O., Bouthemy, S., and Pagès, G. (2009). Optimal quantization for the pricing of swingoptions.

Applied Mathematical Finance , 16(2):183–217.Becker, S., Cheridito, P., and Jentzen, A. (2019). Deep optimal stopping.

Journal of MachineLearning Research , 20(74):1–25.Becker, S., Cheridito, P., and Jentzen, A. (2020). Pricing and hedging American-style options withdeep learning.

Journal of Risk and Financial Management , 13(7):158.Bender, C. (2011). Primal and dual pricing of multiple exercise options in continuous time.

SIAMJournal on Financial Mathematics , 2(1):562–586.Bender, C., Schoenmakers, J., and Zhang, J. (2015). Dual representations for general multiplestopping problems.

Mathematical Finance , 25(2):339–370.Bertsekas, D. (1995).

Dynamic Programming and Optimal Control . Athena Scientiﬁc, Mas-sachusetts, USA.Chen, Y. and Wan, J. (2020). Deep neural network framework based on backward stochasticdiﬀerential equations for pricing and hedging American options in high dimensions.

To appearin Quantitative Finance .Daluiso, R., Nastasi, E., Pallavicini, A., and Sartorelli, G. (2020). Pricing commodity swingoptions.

ArXiv Preprint 2001.08906, version of January 24, 2020 .Hessel, M., Modayil, J., van Hasselt, H., Schaul, T., Ostrovski, G., Dabney, W., Horgan, D., Piot,B., Azar, M. G., and Silver, D. (2017). Rainbow: Combining improvements in deep reinforcementlearning. arXiv 1710.02298, version of October 6, 2017 .Karatzas, I. and Shreve, S. (1991).

Brownian Motion and Stochastic Calculus . Springer, GraduateTexts in Mathematics, 2nd edition.Kohler, M., Krzyżak, A., and Todorovic, N. (2008). Pricing of high-dimensional American optionsby neural networks.

Mathematical Finance , 20(3):383–410.Lagoudakis, M. and Parr, R. (2003). Least-squares policy iteration.

The Journal of MachineLearning Research , 4:1107–1149.Lapeyre, B. and Lelong, J. (2019). Neural network regression for Bermudan option pricing.

ArXivPreprint 1907.06474, version of December 16, 2019 .Li, Y., Szepesvari, C., and Schuurmans, D. (2009). Learning exercise policies for American op-tions.

Proceedings of the 12th International Conference on Artiﬁcial Intelligence and Statistics(AISTATS) 2009, Clearwater Beach, Florida, USA , Volume 5 of JMLR: W&CP 5.Longstaﬀ, F. and Schwartz, E. (2001). Valuing American options by simulation: a simple least-squares approach.

The Review of Financial Studies , 14(1):113–147.Meinshausen, N. and Hambly, B. M. (2004). Monte Carlo methods for the valuation of multiple-exercise options.

Mathematical Finance , 14(4):557–583.Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller,M. (2013). Playing Atari with deep reinforcement learning.

NIPS Deep Learning Workshop2013 . 16nih, V., Kavukcuoglu, K., Silver, D., Rusu, A., Veness, J., Bellemare, M. G., Graves, A., Ried-miller, M., Fidjeland, A., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I.,King, H., Kumaran, D., Wierstra, D., Legg, S., and Hassabis, D. (2015). Human-level controlthrough deep reinforcement learning.

Nature , 518(7540):529–533.Puterman, M. (1994).

Markov Decision Processes: Discrete Stochastic Dynamic Programming .John Wiley & Sons, New York.Schoenmakers, J. (2012). A pure martingale dual for multiple stopping.

Finance and Stochastics ,16:319–334.Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., van den Driessche, G., Schrittwieser,J., Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D., Nham, J., Kalch-brenner, N., Sutskever, I., Lillicrap, T., Leach, M., Kavukcuoglu, K., Graepel, T., and Hassabis,D. (2016). Mastering the game of Go with deep neural networks and tree search.

Nature ,529:484–489.Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre,L., Kumaran, D., Graepel, T., Lillicrap, T., Simonyan, K., and Hassabis, D. (2017). Masteringchess and Shogi by self-play with a general reinforcement learning algorithm.

ArXiv Preprint1712.01815, version of December 5, 2017 .Sutton, R. S. and Barto, A. G. (1998).

Introduction to Reinforcement Learning . MIT Press,Cambridge, MA, USA, 1st edition.Tsitsiklis, J. and Roy, B. V. (2001). Regression methods for pricing complex American-style options.

IEEE Transactions on Neural Networks , 12(4):694–703.van Hasselt, H., Guez, A., and Silver, D. (2016). Deep reinforcement learning with double Q-learning.

Proceedings of the Thirtieth AAAI Conference on Artiﬁcial Intelligence (AAAI-16) .Wang, Z., Schaul, T., Hessel, M., van Hasselt, H., Lanctot, M., and Freitas, N. (2016). Duelingnetwork architectures for deep reinforcement learning.

Proceedings of the 33rd InternationalConference on Machine Learning , Volume 48 of JMLR: W&CP.Watkins, C. J. C. H. (1989).

Learning from Delayed Rewards . PhD thesis, King’s College, Oxford.Watkins, C. J. C. H. and Dayan, P. (1992). Q-learning. In