Reinforced optimal control
Christian Bayer, Denis Belomestny, Paul Hager, Paolo Pigato, John Schoenmakers, Vladimir Spokoiny
RREINFORCED OPTIMAL CONTROL
CHRISTIAN BAYER , DENIS BELOMESTNY , PAUL HAGER , PAOLO PIGATO ,JOHN SCHOENMAKERS , AND VLADIMIR SPOKOINY Abstract.
Least squares Monte Carlo methods are a popular numerical approximation methodfor solving stochastic control problems. Based on dynamic programming, their key feature isthe approximation of the conditional expectation of future rewards by linear least squares re-gression. Hence, the choice of basis functions is crucial for the accuracy of the method. Earlierwork by some of us [Belomestny, Schoenmakers, Spokoiny, Zharkynbay. Commun. Math. Sci.,18(1):109–121, 2020] proposes to reinforce the basis functions in the case of optimal stoppingproblems by already computed value functions for later times, thereby considerably improvingthe accuracy with limited additional computational cost. We extend the reinforced regressionmethod to a general class of stochastic control problems, while considerably improving themethod’s efficiency, as demonstrated by substantial numerical examples as well as theoreticalanalysis. Introduction
Stochastic control problems form an important class of stochastic optimization problems thatfind applications in a wide variety of fields, see [Pha09] for an overview. The general problem canbe formulated as follows: How should a decision maker control a system with a stochastic com-ponent to maximize the expected reward? In the theory of stochastic control one distinguishesbetween problems with continuous and discrete sets of possible control values. While the firstclass of control problems contains, for example, energy storage problems, the second one includesstopping and multiple stopping problems. Furthermore one differentiates between discrete timeand continuous time optimal control problems. (Neither of these distinctions is fundamental: forinstance, many numerical methods will replace optimal control problems with a continuous set ofcontrol values in continuous time by a surrogate problem with discrete control values in discretetime. Moreover, many discrete optimal control problem may well be analyzed as continuousones, if the number of possible control values or time-steps is finite, but very high.) Weierstrass Institute, Berlin Duisburg-Essen University and National University Higher School of Economics TU Berlin Department of Economics and Finance, University of Rome Tor Vergata
E-mail addresses : [email protected], [email protected],[email protected], [email protected], [email protected],[email protected] .2020 Mathematics Subject Classification.
Key words and phrases.
Reinforced regression, least squares Monte Carlo, stochastic optimal control. a r X i v : . [ m a t h . O C ] N ov C. BAYER, D. BELOMESTNY, P. HAGER, P. PIGATO, J. SCHOENMAKERS, AND V. SPOKOINY
The range of applications of stochastic control problems is very wide. Originally, optimalstochastic continuous control problems were inspired by engineering problems in the continuouscontrol of a dynamic system in the presence of random noise. In the last decades, problemsin mathematical finance (portfolio optimization, options with variable exercise possibilities) andeconomics inspired many new developments, see [BR11] for some recent developments.As a canonical general approach for solving a discrete time optimal control problem one mayconsider all possible future evolutions of the process at each time that a control choice is tobe made. This method is well developed and may be effective in some special cases, but formore general problems such as optimal control of a diffusion in high dimensions, this approachis impractical. In [BKS09] a generic Monte Carlo approach combined with linear regression wasproposed and studied, see also [BS18] for an overview. However, this approach has an importantdisadvantage – it may exhibit too little flexibility for modeling highly non-linear behavior ofthe optimal value functions. Higher-degree polynomials or local polynomials (splines) can beused, but they may contain too many parameters and, therefore, either over-fit the Monte Carlosample or prohibit parameter estimation because the number of parameters is too large. As analternative to the polynomial bases, nonlinear approximation structures (e.g., artificial neuralnetworks) can be used instead (see, e.g. [BCJ18]).In [BSSZ20] a Monte Carlo based reinforced regression approach is developed for buildingsparse regression models at each backward step of the dynamic programming algorithm in thecase of optimal stopping problems. In a nutshell, the idea is to start with a generic set of basisfunctions, which is systematically enlarged with highly problem dependent additional functions.The additional basis functions are constructed for the optimal stopping problem at hand withoutusing a fixed predefined finite dictionary. Specifically, the new basis functions are learned duringthe backward induction via incorporating information from the preceding backward inductionstep. Thereby, basis functions highly specific to the problem at hand are constructed in acompletely automatic way. [BSSZ20] report that the reinforced basis lead to increased precisionover the starting set of basis functions, comparable to the standard regression algorithm basedon a substantially increased set of basis functions. This improvement is obtained with limitedincrease of the computational cost.In this work we carry over the approach of [BSSZ20] to a general class of discrete time optimalcontrol problems including multiple stopping problems (allowing pricing of swing options) and agas storage problem. This generalization turns out to be rather challenging as the complexity ofusing the previously constructed value function in regression basis at each step of the backwardprocedure becomes prohibitive when applying the original approach of [BSSZ20]. We overcomethis computational bottleneck by introducing a novel version of the original reinforced regressionalgorithm where one uses a hierarchy of fixed time depth approximation of the optimal valuefunction instead of a full depth approximation employed in [BSSZ20]. As a result we regainefficiency and are able to improve upon the standard linear regression algorithm in terms ofachievable precision for a given computational budget.More precisely, we construct a hierarchy v ( i ) , i = 0 , . . . , I , of (approximate) value functionswith depth I >
0. Here, v (0) denotes the value functions obtained from the classical Monte Carlo EINFORCED OPTIMAL CONTROL 3 regression algorithm. The higher levels v ( i ) are computed by regression based on a set of basisfunctions reinforced by the value function v ( i − one level lower. This way, the added computa-tional cost incurred from reinforcing the basis can be further decreased with minimal sacrificesof accuracy already for small values of I . In fact, we propose two versions of the algorithm. Inthe first version, the levels of the hierarchy of value functions are trained consecutively, allowingfor an adaptive choice of the depth I of the hierarchy. In the second version, all the levels aretrained concurrently, thereby improving the accuracy at each individual level. As a consequence, I needs to be fixed in advance and cannot be chosen adaptively in the second variant. Outline of the paper.
In Section 2 we describe a rather general setting for discrete stochasticcontrol problems which we are going to use in this paper. The setting is based on [GHW11]. Werecall the reinforced regression algorithm for optimal stopping problems by [BSSZ20] in detail inSection 3. There we also motivate the hierarchical construction of the new reinforced regressionalgorithm as restricted to the optimal stopping problem. The full algorithm – including bothvariants – is introduced in Section 4. A detailed analysis of computational costs is provided inSection 5. The next Section 6 provides a detailed convergence analysis for the standard andreinforced regression algorithms in the current setting. Extensive numerical examples includingoptimal stopping problems, multiple stopping problems and a gas storage optimization problemare provided in Section 7. 2.
Setting
First, we present a proper setting for the construction and analysis of reinforced regressionalgorithms. The setting will be largely based on [GHW11]. We will consider stochastic controlproblems in discrete time with a finite action sets. We note that extensions to continuous actionsets are certainly possible, but are left to future research.We consider a filtration F j , j = 0 , . . . , J , which is extended by F − := { ∅ , Ω } , F J +1 := F J forconvenience. While the setting below could be (easily) formulated in a non-Markovian setting,we right away consider a Markov process X adapted to ( F j ) j =0 ,...,J . Note that we assume thatthe dynamics of the underlying process X does not depend on any control.At time 0 ≤ j ≤ J we are given a control Y j , which is F j − -measurable, and an F j -measurable cash-flow Z j = H j ( a, Y j , X j ) for some deterministic, measurable function H j , where a is an action that we may choose at time j in some action space K . Note that cash-flows may be positive ornegative. We assume that the control Y j takes values in a finite set L , and that K is finite, aswell. For a given value of the control y ∈ L and a given value x of the underlying process X j ,we are given a set(2.1) K j ( y, x ) ⊂ K , j = 0 , . . . , J. Only actions a ∈ K j ( Y j , X j ) are admissible at time j . Finally, if we choose a ∈ K j ( Y j , X j ), thenthe control is updated by(2.2) Y j +1 := ϕ j +1 ( a, Y j ) , ϕ j +1 : K × L → L . C. BAYER, D. BELOMESTNY, P. HAGER, P. PIGATO, J. SCHOENMAKERS, AND V. SPOKOINY
Suppose that we are at time 0 ≤ j ≤ J , the control and the underlying state process take values Y j and X j , respectively. For α = ( a j , . . . , a J ) ∈ K J − j +1 and j ≤ ‘ ≤ J −
1, we define(2.3) Y ‘ +1 ( α ; j, Y j ) := ϕ ‘ +1 ( a ‘ , Y ‘ ( α ; j, Y j )) , Y j ( α ; j, Y j ) := Y j , noting that Y ‘ ( α ; j, Y j ) only depends on a j , . . . , a ‘ − . Additionally, we define F j,J ( K ) to be theset of ( F ‘ ) J‘ = j -adapted processes taking values in K indexed by j, . . . , J . Clearly, if α ∈ F j,J ( K )and Y j ∈ F j − , then the process Y · ( α ; j, Y j ) is previsible. The set of admissible strategies oradmissible policies A j is defined as follows:(2.4) A j ( Y j , X ≥ j ) := n α = ( a j , . . . , a J ) ∈ F j,J ( K ) (cid:12)(cid:12)(cid:12) a ‘ ∈ K ‘ ( Y ‘ ( α ; j, Y j ) , X ‘ ) , ‘ = j, . . . , J o . Now the central issue is the optimal control problem(2.5) V j := sup α ∈A j ( Y j ,X ≥ j ) E j J X ‘ = j H ‘ ( a ‘ , Y ‘ ( α ; j, Y j ) , X ‘ ) , at a generic time 0 ≤ j ≤ J, where E j denotes the conditional expectation w.r.t. F j .Taking advantage of the Markov property, we introduce the notation A j ( y, x ) := A j ( y, X x ≥ j ) , where X j,x denotes the Markov process X conditioned on X j = x , and is defined for j ≤ ‘ ≤ J .We may then define the value function as(2.6) v ∗ j ( y, x ) := sup α ∈A j ( y,x ) E J X ‘ = j H ‘ (cid:16) a ‘ , Y ‘ ( α ; j, y ) , X j,x‘ (cid:17) , which satisfies the dynamic programming principle :(2.7) v ∗ j ( y, x ) = sup a ∈ K j ( y,x ) (cid:16) H j ( a, y, x ) + E h v ∗ j +1 ( ϕ j +1 ( a, y ) , X j,xj +1 ) i(cid:17) . for j = 0 , . . . , J (with v ∗ J +1 ( y, x ) := 0). Remark 2.1.
In light of portfolio optimization problems, we may also want to allow ϕ j +1 toadditionally depend on X j , X j +1 . Indeed, such a construction would allow Y j to contain thevalue of the portfolio at time j . In such a case, the process Y would cease to be previsible. Onthe other hand, most of the following observations would continue to hold.Let us now give a few examples for classical stopping and control problems which fall into theabove setup. Example 2.2.
For a single optimal stopping problem with payoff g j ≥ j , the set ofpossible control values is L = { , } , where a control state y denotes the number of remainingexercise opportunities. The action a takes the value 1 if we stop at the current time and 0otherwise. Hence, we have K j ( y, x ) = K ( y ) := { , } , y = 1 , { } , y = 0 , implying that K = { , } . The cash-flow is defined by H j ( a, Y j , X j ) := a g j ( X j ) , EINFORCED OPTIMAL CONTROL 5 independent of the value of the control Y j . Finally, the update function of the control is definedby ϕ j +1 ( a, y ) := max( y − a, . Example 2.3.
Let us now suppose that we have a multiple stopping problem with L ∈ N exercise rights. Again, the control state y signifies the remaining exercise opportunities, leadingto L = { , , . . . , L } . The admissible action set is now defined as K j ( y, x ) = K ( y ) := { , } , y ≥ , { } , y = 0 . Once again, we have K = { , } . The cash-flow H j and the update function ϕ j +1 are defined asin Example 2.2. Example 2.4.
Next we consider a more complicated multiple stopping problem. Considera multiple stopping problem with refraction period J , i.e., after exercising a right, we maynot exercise another right for J periods. Moreover, the total set of possible exercise dates J := { , , . . . , J } is divided into two subsets, J = J t J . When j ∈ J , at most oneoptionality may be exercised, whereas at most two options may be exercised when j ∈ J .Again, we have L ∈ N total exercise rights.In this case, the control y keeps track of the remaining time until the next optionality may beexercised. Hence, we set L := { , . . . , L } × { , . . . , J } , where, for y = ( y , y ) ∈ L , y denotes the number of remaining exercise rights, whereas y denotes the number of time periods, until we may exercise next. In other words, we may onlyexercise at time j if Y j = 0. Consequently, the set of admissible policies is given by K j ( y ) := { } , y = 0 or y > , { , } , y > y = 0 and j ∈ J , { , , } , y ≥ y = 0 and j ∈ J , { , } , y = 1 and y = 0 and j ∈ J , and K := { , , } . The cash-flow is defined by H j ( a, Y j , X j ) := a g j ( X j ) , and the update rule satisfies the following: ϕ j +1 ( a, y ) := ( y , max( y − , , a = 0 , (max( y − a, , J ) , a ≥ . Example 2.5.
Consider a simple gas storage problem: given N ∈ N and ∆ = 1 /N , we assumethat the volume of gas in a storage can only be increased and decreased by a fraction ∆ overa given time increment. Let the control y denote the status (fill level) of the storage at time j .Hence, we define L := { , ∆ , , . . . , } . C. BAYER, D. BELOMESTNY, P. HAGER, P. PIGATO, J. SCHOENMAKERS, AND V. SPOKOINY
At time j , we may either sell ∆ (volume of gas; a = − a = +1) – at the currentmarket price X j – or do nothing ( a = 0). Hence, the admissible policy set is K j ( y ) := { , } , y = 0 , { − , , } , ∆ ≤ y ≤ − ∆ , { − , } , y = 1 , with K = { − , , } , while the cash-flow is given by H j ( a, Y j , X j ) := − a ∆ X j . The update function in given by ϕ j +1 ( a, y ) := (( y + a ∆) ∧ ∨ . Reinforced regression for optimal stopping
In this section, we recall the standard regression algorithm as well as the reinforced regres-sion algorithm introduced in [BSSZ20] for optimal stopping problems. We will point out thedrawbacks of the latter algorithm for more general control problems, and propose and motivateseveral modifications. However, for the purpose of a clear illustration, we will restrict ourselvesin this section to the optimal stopping case.Let us recall the optimal stopping setup from Example 2.2 and denote by v ∗ j ( x ) the valuefunction at j ∈ { , ..., J } evaluated in x ∈ R d and y = 1. Further recall that the dynamicprogramming principle is given by v ∗ j ( x ) = max( g j ( x ) , c ∗ j ( x )) , ≤ j ≤ J − , v ∗ J ( x ) = g J ( x ) , x ∈ R d , where the continuation function is given by c ∗ j ( x ) = E j [ v ∗ j +1 ( X j,xj +1 )]. For a set of basis function { ψ , ..., ψ K } with ψ i : R d → R and sample trajectories ( X ( m ) j ) ≤ j ≤ J, ≤ m ≤ M from the underlyingMarkov chain, the regression method due to Tsitsiklis-van Roy [TVR01], which we will refer toas the standard regression method , inductively constructs an approximation v = ( v j ) j =0 ,...,J tothe value function v ∗ as follows: For j = J initialize v J = g J . For j ∈ { J − , ..., } set v j ( x ) := max( g j ( x ) , c j ( x )) , c j ( x ) = K X k =1 γ j,k ψ k ( x ) , (3.1)where the regression coefficients are given by the solution to the least squares problem γ j, , ..., γ j,K := arg min γ ,...,γ K M X m =1 (cid:12)(cid:12)(cid:12)(cid:12) v j +1 ( X ( m ) j +1 ) − K X k =1 γ k ψ k ( X ( m ) j ) (cid:12)(cid:12)(cid:12)(cid:12) . (3.2)The procedure is illustrated in Figure 1. Note that the costs of this algorithm are of the order M · J · K (see, e.g., [BS20] or Section 5).One problem of the standard regression algorithm is that its performance strongly depends onthe choice of basis functions. Indeed, while standard classes such as polynomials or splines usuallyform the backbone of the construction of basis functions, practitioners usually add customizedbasis functions, for instance the payoff function g j and some functionals applied to it. EINFORCED OPTIMAL CONTROL 7 v j ( x ) g j ( x ) c j ( x ) ψ · ( x ) γ j, · j = J − J − J − J − J Figure 1.
Illustration of standard regression approach due to Tsitsiklis-vanRoy [TVR01]. The straight lines indicate the dependencies in the evaluation of c j and v j in (3.1). The dashed arrows start from the regression data v j +1 andsymbolize the regression procedure (3.2).As a more systematic approach, the authors of [BS20] proposed a reinforced regression algo-rithm . In this procedure the regression basis at each step of the backward induction is reinforcedwith the approximate value function from the previous step of the induction. The approximatecontinuation function at j ∈ { , ..., J − } is then given by c j ( x ) := K X k =1 γ j,k ψ k ( x ) + γ j,K +1 v j +1 ( x ) , where the regression coefficient are the solution to the least squares problem γ j, , ..., γ j,K +1 := arg min γ ,...,γ K +1 M X m =1 (cid:12)(cid:12)(cid:12)(cid:12) v j +1 ( X ( m ) j +1 ) − K X k =1 γ k ψ k ( X ( m ) j ) − γ K +1 v j +1 ( X ( m ) j ) (cid:12)(cid:12)(cid:12)(cid:12) . Note that this procedure induces a recursion whenever an approximate value function is evalu-ated: in order to evaluate v j ( x ) we need to evaluate c j ( x ), which in turn requires an evaluationof v j +1 ( x ) and so forth, until v J ( x ) = g J ( x ) terminates the recursion. Figure 2 illustrates thisprocedure. The costs of the reinforced regression method are proportional to M · J · K + M · J · K (see [BS20]). v j ( x ) g j ( x ) c j ( x ) ψ · ( x ) γ j, · j = J − J − J − J − J Figure 2.
Illustration of the reinforced regression approach. Evaluation of v j in the reinforced regression algorithm leads to a recursion with J − j steps.Despite the increased computational cost compared to the standard regression algorithm withthe same set of basis functions ψ , . . . , ψ K , the reinforced regression algorithm can improve theoverall computational cost for a fixed error tolerance drastically. As a rule of thumb, [BS20] reportthat the reinforced regression algorithm with a standard basis consisting of polynomials of a given C. BAYER, D. BELOMESTNY, P. HAGER, P. PIGATO, J. SCHOENMAKERS, AND V. SPOKOINY degree leads to similar accuracy as the standard regression algorithm based on polynomials ofone degree higher. In particular, the reinforced regression algorithm already outperforms thestandard regression algorithm for small dimensions d >
1, as long as the number J of time-stepsis not too large.A direct generalization of the reinforced regression algorithm to more general control problemsis certainly possible. The main difference to the optimal stopping problem is that for fixed time j we have to choose from many potential candidates to reinforce with, namely any v j +1 ( y, · ) , y ∈ L is a candidate. Additionally, the dynamic programming principle (2.7) now entails a possiblynon-trivial optimization problem in terms of the policy a . Especially the second point makesthe recursion at the heart of the reinforced regression algorithm untenable for general controlproblems.One solutions immediately comes to mind: If performing the recursion all the way to terminaltime J is too costly, why not truncate at a certain recursion depth? This idea is, in principle,sound, and is the basis of the adaptations suggested below. However, some care is needed in theimplementation of this idea. Indeed, if “truncation” simply were to mean “replace the reinforcingbasis functions by 0 after a certain truncation step”, this would introduce a structural error in theprocedure, as regression coefficients formerly computed in the presence of these basis functionswould suddenly be incorrect. Instead, we propose to compute a hierarchy of reinforced regressionsolutions, corresponding to different “cut-off depths” of the recursion. This way, we can makesure that the coefficients are always consistent, that is, an error as mentioned above can beavoided. We introduce two versions, which both adhere to the same general idea, but differ inan important implementation detail.The hierarchical reinforced regression algorithm A iteratively constructs approximations ( v ( i ) ) i =0 , ,... to the true value function as follows: For i = 0 we construct ( v (0) j ) ≤ j ≤ J using the standard re-gression method described above. Then for any i ≥
1, given that v ( l ) is already constructed for0 ≤ l ≤ i −
1, define v ( i ) with the usual backwards induction, where the regression basis at step j ∈ { J − , ..., } is reinforced with v ( i − j . The approximate continuation function of the i th iteration is given by c ( i ) j ( x ) := K X k =1 γ ( i ) j,k ψ k ( x ) + γ ( i ) j,K +1 v ( i − j +1 ( x ) , (3.3)where the regression coefficient are the solution to the least squares problem γ ( i ) j, , ..., γ ( i ) j,K +1 := arg min γ ,...,γ K +1 M X m =1 (cid:12)(cid:12)(cid:12)(cid:12) v ( i ) j +1 ( X ( m ) j +1 ) − K X k =1 γ k ψ k ( X ( m ) j ) − γ K +1 v ( i − j +1 ( X ( m ) j ) (cid:12)(cid:12)(cid:12)(cid:12) . The procedure may be stopped after a fixed number of iterations, or using an adaptive criterion.An illustration of the method can be found in Figure 3. Note that the recursion that is startedwhen evaluating v ( i ) j ( x ) always terminates after at most i steps in the evaluation of v (0) j + i ( x ) for i ≤ J − j or in v ( i − J − j ) J ( x ) = g J ( x ) for J − j ≤ i .For a fixed number of iterations i = { , ..., I } we can modify the structure of the previousmethod so that the primary iteration is the backwards induction over j = { J, J − , ..., } andthe secondary iteration is over i = { , ..., I } . In this case we can further modify the algorithm by EINFORCED OPTIMAL CONTROL 9 v ( i ) j ( x ) g j ( x ) c ( i ) j ( x ) ψ · ( x ) γ ( i ) j, · j = J − J − J − J − J − Ji = 0 i = 1 i = 2 i = 3 Figure 3.
Illustration of the hierarchical reinforced regression algorithm A, forthree iterations. In the lower right part of the diagram, the vertical lines indicatethe equality v ( i ) j ≡ v ( l ) j for J − j ≤ i .using v ( I ) j +1 as the regression target for the continuations functions c ( i ) j for all i ∈ { , ..., I } . Wename the resulting algorithm the hierarchical reinforced regression algorithm B . The approximatecontinuation function at step j and iteration i is then still given by (3.3) and the least squaresproblem is given by γ ( i ) j, , ..., γ ( i ) j,K +1 := arg min γ ,...,γ K +1 M X m =1 (cid:12)(cid:12)(cid:12)(cid:12) v ( I ) j +1 ( X ( m ) j +1 ) − K X k =1 γ k ψ k ( X ( m ) j ) − γ K +1 v ( i − j +1 ( X ( m ) j ) (cid:12)(cid:12)(cid:12)(cid:12) . Also in this algorithm, the recursion that is started when evaluating v ( I ) j stops after at most I steps. Figure 4 illustrates the procedure. The costs of the algorithm are discussed in Section 5).4. Iterated reinforced regression for optimal control
Following the ideas and motivations of Section 3 we now present hierarchical reinforced re-gression algorithms for optimal control based on the Bellman equation (2.7). The algorithmsare based on M sample trajectories ( X ( m ) j ) j =0 ,...,J,m =1 ,...,M from the underlying Markov chain X . and some initial set { ψ , ..., ψ K } of basis functions ψ i : R d → R . For each y ∈ L wewill define a subset L y ⊂ L of cardinality R y := |L y | and reinforce the basis { ψ , . . . , ψ K } by { v j +1 ( z, · ) | z ∈ L y } . The respective algorithms iteratively construct sequences of approximationsto the value function v ( i ) = ( v ( i ) j ) j =0 ,...,J with v ( i ) j : L × R d → R , for i = { , , ... } until the iteration is terminated. v ( i ) j ( x ) g j ( x ) c ( i ) j ( x ) ψ · ( x ) γ ( i ) j, · j = J − J − J − J − J − Ji = 0 i = 1 i = 2 i = 3 Figure 4.
Illustration of the hierarchical reinforced regression algorithm B with I = 3 iterations. In contrast to algorithm A illustrated in Figure 3, the regressionis always applied to the approximation v ( I ) j +1 , as indicated by the dashed arrows.The gray color of the elements in the upper left part of the diagram indicatethat these elements are not needed for the construction and evaluation of v ( I ) .4.1. Hierarchical reinforced regression algorithm A.
For i = 0 construct v (0) using thestandard regression method inductively as follows: At the terminal time J initialize v (0) J := v J where v J ( y, x ) = max a ∈ K J ( y,x ) H J ( a, y, x ) , for all y ∈ L , x ∈ R d . (4.1)For a j ∈ { , ..., J − } , assume that v (0) l is already constructed for all l ∈ { j + 1 , ..., J } . Thenfor each y ∈ L define the regression coefficients by solving the following least squares problem γ (0) ,yj, , ..., γ (0) ,yj,K := arg min γ ,...,γ K M X m =1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) v (0) j +1 ( y, X ( m ) j +1 ) − K X k =1 γ k ψ k ( X ( m ) j ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . (4.2)Next define the continuation function by c (0) j ( y, x ) := K X k =1 γ (0) ,yj,k ψ k ( x ) , for all y ∈ L , x ∈ R d (4.3)and the approximate value function v (0) j through the dynamic programming principle v (0) j ( y, x ) := max a ∈ K j ( y,x ) (cid:16) H j ( a, y, x ) + c (0) j ( ϕ j ( a, y ) , x ) (cid:17) for all y ∈ L , x ∈ R d . (4.4)Given the approximation v ( i ) for some i ≥ v ( i +1) usingreinforced regression inductively as follows: Initialize at the terminal time v ( i +1) J := v J . For EINFORCED OPTIMAL CONTROL 11 j ∈ { , ..., J − } assume that v ( i +1) l is already constructed for l ∈ { j + 1 , ..., J } . Then for each y ∈ L define the regression coefficients by solving the following least squares problem γ ( i +1) ,yj, , ..., γ ( i +1) ,yj,K + R y := arg min γ ,...,γ K + Ry M X m =1 (cid:12)(cid:12)(cid:12)(cid:12) v ( i +1) j +1 ( y, X ( m ) j +1 ) − K X k =1 γ k ψ k ( X ( m ) j )(4.5) − R y X k =1 γ K + k v ( i ) j +1 ( y k , X ( m ) j ) (cid:12)(cid:12)(cid:12)(cid:12) , where { y k } k =1 ,...,R y = L y , and define the continuation function c ( i +1) j by c ( i +1) j ( y, x ) := K X k =1 γ ( i +1) ,yj,k ψ k ( x ) + R y X k =1 γ ( i +1) ,yj,K + k v ( i ) j +1 ( y k , x ) , (4.6)for all y ∈ L and x ∈ R d . Finally define the approximation v ( i +1) j through the dynamic program-ming principle by v ( i +1) j ( y, x ) := max a ∈ K j ( y,x ) (cid:16) H j ( a, y, x ) + c ( i +1) j ( ϕ j ( a, y ) , x ) (cid:17) , (4.7)for all y ∈ L and x ∈ R d .The iteration over i ∈ { , , ... } can be terminated after I ∈ N steps, yielding v ( I ) as an ap-proximation to the true value function. Alternatively one can introduce an adaptive terminationcriterion, for example by comparing the relative change in the error of the least squares problem(4.5), terminating after the change falls under a given threshold. Remark 4.1.
Recall that in the initialization we have v ( i ) J = v (0) J for all i ∈ { , ..., I } . It thenfollows inductively that v ( i ) j ≡ v ( l ) j , for all J − j ≤ i ≤ I, l ≥ i. (4.8)This identity can be used to reduce the costs of the algorithm, since the regression problem onlyneeds to be solved for all ( j, i ) with 0 ≤ j ≤ J − ≤ i ≤ ( J − j ) ∧ I . Remark 4.2.
More general or other forms of reinforced basis functions are certainly possible.The essential point is that they are based on the regression result from the preceding step inthe backwards induction and the preceding iteration. Our specific choice may be seen as anatural primal choice. We left flexibility in the choice of the sets L y , for which depending onthe cardinality of the set L , possible choices are the trivial L y = L and L y = { y } , or L y = L for some set L independent of y , or more elaborately L yj = { ϕ ( a, y ) | a ∈ K j ( y, x j ) } for some x j ∈ R d . Note that the use of a step dependent set L yj in the above method is straightforward.4.2. Hierarchical reinforced regression algorithm B.
Fix a number of iterations I ∈ N andinitialize the approximate value functions at the terminal time by v ( i ) J ≡ v J for all i ∈ { , ..., I } ,where v J is given by (4.1). The approximate value functions at times previous to J are definedinductively as follows: Let j ∈ { , ..., J − } and assume that v ( i ) l is already defined for all l ∈ { j + 1 , ..., J } and i ∈ { , ..., I } . For i = 0 and each y ∈ L determine the coefficients for the regression basis bysolving the least squares problem(4.9) γ (0) ,yj, , ..., γ (0) ,yj,K := arg min γ ,...,γ K M X m =1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) v ( I ) j +1 ( y, X ( m ) j +1 ) − K X k =1 γ k ψ k ( X ( m ) j ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) and define the approximate continuation function c (0) by (4.3). For i ∈ { , ..., D } and each y ∈ L determine the regression coefficients by solving the least squares problem γ ( i ) ,yj, , ..., γ ( i ) ,yj,K + R y := arg min γ ,...,γ K + Ry M X m =1 (cid:12)(cid:12)(cid:12)(cid:12) v ( I ) j +1 ( y, X ( m ) j +1 ) − K X k =1 γ k ψ k ( X ( m ) j )(4.10) − R y X k =1 γ K + k v ( i − j +1 ( y k , X ( m ) j ) (cid:12)(cid:12)(cid:12)(cid:12) , where { y k } k =1 ,...,R y = L y , and define the continuation function c ( i ) j by (4.6).Finally, define the approximation to the value function v ( i ) j for all i = { , ..., I } by (4.7). Afterending the backwards induction use ( v ( I ) j ) j =0 ,...,J as an approximation to the true value function. Remark 4.3.
Note that the identity (4.8) also holds for the above algorithm. Moreover, since weare only interested in v ( I ) , we can discard the computation of c ( i ) j and v ( i ) j for all 0 ≤ j + i ≤ I − v ( I ) . The least squares problem then onlyneeds to be solved for ( j, i ) ∈ { , ..., J − } × { , ..., I } with0 ≤ j + i ≤ I − ≤ i ≤ ( J − j ) ∧ I. Remark 4.4.
Choosing the number of iterations I = J we then have from the previous remarkthat only the value functions on the diagonal j = i need to be constructed. In this case, denote v j = v ( j ) j , c j = c ( j ) j etc., and observe that the least squares problem which is solved in each step j ∈ { J − , ..., } of the backwards induction is given by γ yj, , ..., γ yj,K + R y := arg min γ ,...,γ K +1 M X m =1 (cid:12)(cid:12)(cid:12)(cid:12) v j +1 ( y, X ( m ) j +1 ) − K X k =1 γ k ψ k ( X ( m ) j ) − R y X k =1 γ K + k v j +1 ( y k , X ( m ) j ) (cid:12)(cid:12)(cid:12)(cid:12) . where { y k } k =1 ,...,R y = L y . Hence, for I = J the above algorithm represents a direct extension ofthe reinforced regression algorithm in [BS20] from optimal stopping to optimal control problems.5. Computational cost
We study the computational work of the modified reinforced regression algorithm of Sec-tion 4.2. In what follows, the following operations are considered to be performed at constant cost: • Multiplications, additions and other primitive operations at cost c ∗ ; • Simulation from the distribution of the Markov process X j at cost c X ; EINFORCED OPTIMAL CONTROL 13 • Evaluation of the standard basis functions ψ i or of the payoff H j at cost c f ; • We set R := max y ∈L R y . • The cost of evaluating other non-trivial, but known functions ϕ (think of the valuefunction when all the required regression coefficients are already known) will be denotedby cost( ϕ ).If an expression involves several such operations, then only the most expansive constant is re-ported. (E.g., evaluating a basis function and multiplying the value by a scalar constant isconsidered to incur a cost c f .) We may also use constants c which do not depend on the specificsof the algorithm. We now go through the individual stages of the algorithm.(1) Simulating trajectories at cost cost = c X M ( J + 1).(2) Computing the terminal value function as in (4.1) at cost cost = c f |L| |K| .(3) For fixed 0 ≤ j ≤ J − y ∈ L set up the least squares problem (4.9) at cost M (cid:16) c f K + cost (cid:16) v ( I ) j +1 (cid:17)(cid:17) .(4) For fixed 0 ≤ j ≤ J − y ∈ L , we solve the least squares problem (4.9) at cost c ∗ M K . The total cost is cost = c ∗ JM K |L| .(5) For fixed 0 ≤ j ≤ J − y ∈ L , and 1 ≤ i ≤ I set up the least squares problem (4.10) atcost M (cid:16) c f K + cost (cid:16) v ( I ) j +1 (cid:17) + R cost (cid:16) v ( i − j +1 (cid:17)(cid:17) .(6) For fixed 0 ≤ j ≤ J − y ∈ L , and 1 ≤ i ≤ I solve the least squares problem (4.10) atcost c ∗ M ( K + R ) , leading to a total cost of cost = c ∗ M ( K + R ) J |L| . List 5.1.
Stages of the algorithmFor simplicity of the presentation, we shall only consider the following scenario:
Assumption 5.1.
The total set of reinforced basis functions contains all available value func-tions, i.e., [ y ∈L L y = L . For fixed 0 ≤ i ≤ I and 0 ≤ j ≤ J let(5.1) v ( i ) j := (cid:16) v ( i ) j ( y, · ) (cid:17) y ∈L , c ( i ) j := (cid:16) c ( i ) j ( y, · ) (cid:17) y ∈L . The key step of the cost analysis is understanding the cost of evaluating the reinforced basisfunctions, which are, in turn, given in terms of reinforced basis functions at later time steps. Wenote that it is essential to analyze the cost of evaluating the full set of reinforced basis functions v ( i ) j rather than individual ones v ( i ) j ( y, · ), as the latter method would show us an apparentexplosion of basis functions as we increase time. By (4.7), evaluating v ( i ) j requires evaluating Suppose that each reinforced basis function v ( i ) j ( y, · ) depends on two reinforced basis functions v ( i − j +1 ( y , · )and v ( i − j +1 ( y , · ). If we follow this recursion for l ≤ i steps, we arrive at a total set of 2 l basis functions. The catchis that many, if not all, of these basis functions overlap with basis functions for other reinforced basis functions v ( i ) j (˜ y, · ). the payoff functions for all combinations of controls y ∈ L and policies a ∈ K , then evaluating c ( i ) j , and taking the corresponding maxima. In total, this meanscost (cid:16) v ( i ) j (cid:17) ≤ |K| |L| ( c f + c ∗ ) + cost (cid:16) c ( i ) j (cid:17) . On the other hand, by (4.6) evaluating c ( i ) j requires K evaluations of standard basis functions, K |L| elementary operations for summing them, one evaluation of v ( i − j +1 , and |L| elementaryoperations for their summation. In total, this means thatcost (cid:16) c ( i ) j (cid:17) ≤ Kc f + K |L| c ∗ + i> (cid:16) |L| c ∗ + cost (cid:16) v ( i − j +1 (cid:17)(cid:17) . This implies the cost estimate(5.2) cost (cid:16) v ( i ) j (cid:17) ≤ |K| |L| ( c f + c ∗ ) + Kc f + K |L| c ∗ + i> (cid:16) |L| c ∗ + cost (cid:16) v ( i − j +1 (cid:17)(cid:17) . Lemma 5.2.
The cost of evaluating v ( i ) j , i = 0 , . . . , I , j = 0 , . . . , J can be bounded by cost (cid:16) v ( i ) j (cid:17) ≤ ( i + 1) ( |K| |L| ( c f + c ∗ ) + Kc f + K |L| c ∗ ) + i |L| c ∗ , j + i ≤ J, |L| |K| ( c f + c ∗ ) + ( J − j ) ( |K| |L| ( c f + c ∗ ) + Kc f + ( K + 1) |L| c ∗ ) , j + i > J. Proof.
For a := |K| |L| ( c f + c ∗ ) + Kc f + K |L| c ∗ , consider the cost recursion c ( k + 1) ≤ a + |L| c ∗ c ( k ) , k ≥ . Assuming that the recursion hits i = 0 before j = J , i.e., i + j ≤ J , the cost c ( k ) := cost (cid:16) v ( k ) j + i − k (cid:17) satisfies the recursion with c (0) ≤ a , and, hence, we obtain c ( k ) ≤ ( k + 1) a + k |L| c ∗ . This gives the first expression in the statement of the lemma with k = i .On the other hand, if i + j > J , we hit j = J before i = 0. In this case, c ( k ) :=cost (cid:16) v ( i + j − J + k ) J − k (cid:17) satisfies the same recursion, but with initial value c (0) ≤ |L| |K| ( c f + c ∗ ). (cid:3) In order to shorten notation, we introduce a := |K| |L| ( c f + c ∗ ) + Kc f + K |L| c ∗ ,b := |L| c ∗ ,d := |L| |K| ( c f + c ∗ ) ,e := |K| |L| ( c f + c ∗ ) + Kc f + ( K + 1) |L| c ∗ , so that the estimate of Lemma 5.2 shortens tocost (cid:16) v ( i ) j (cid:17) ≤ ( i + 1) a + ib, j + i ≤ J,d + ( J − j ) e, j + i > J. We next estimate the cost of setting up the regression problem (4.9).
Lemma 5.3.
The cost of setting up the regression problem for c (0) j ( y, · ) , j = 0 , . . . , J − , y ∈ L ,can be bounded by cost ≤ JM Kc f + M ( J − I ) (( I + 1) a + Ib ) + M Id + 12
M I ( I + 1) e. EINFORCED OPTIMAL CONTROL 15
Proof.
We havecost ≤ J − X j =0 M (cid:16) Kc f + cost (cid:16) v ( I ) j +1 (cid:17)(cid:17) ≤ JM Kc f + M J − I − X j =0 [( I + 1) a + Ib ] + I X ‘ =1 [ d + ‘e ]= JM Kc f + M ( J − I ) (( I + 1) a + Ib ) + M Id + M I ( I + 1) e. (cid:3) The cost for setting up the least squares problem (4.10) is computed in a similar way.
Lemma 5.4.
The cost of setting up the regression problem for c ( i ) j ( y, · ) , i = 1 , . . . , I , j =0 , . . . , J − , y ∈ L , can be bounded by cost ≤ JM Kc f + M I [( I + 1) a + ( I − b ] ( J − I + 2)++ M I (cid:2) a + 2 b − d + 5 e + I ( I + 6) a + 3 I ( I + b ) + 3 I d + I ( I + 6) e (cid:3) . Proof.
A closer look at (4.10) reveals that the total cost of setting up all these least squaresproblems can be bounded by(5.3) cost ≤ J − X j =0 M Kc f + I X i =1 cost (cid:16) v ( i − j +1 (cid:17)! , taking into account that v ( I ) j +1 was already evaluated during the set-up of the least squaresproblem (4.9) and, hence, does not need to be evaluated again. Using Lemma 5.2, we obtaincost ≤ JM Kc f + M J − X j =0 ( J − j ) ∧ ( I − X i =0 (( i + 1) a + ib ) + I − X i =1+( J − j ) ∧ ( I − ( d + ( J − j − e ) = JM Kc f + M J − I +1 X j =0 " I − X i =0 (( i + 1) a + ib ) ++ M J − X j = J − I J − j X i =0 (( i + 1) a + ib ) + I − X i = J − j +1 ( d + ( J − j − e ) . Evaluating the double sums gives the estimate from the statement of the lemma. (cid:3)
Theorem 5.5.
The computational cost of the algorithm presented in Section 4.2 can be boundedby cost ≤ const M J (cid:0) c X + I ( K + |K| + |L| ) |L| + ( K + R ) |L| (cid:1) , where const is a positive number independent of |K| , |L| , K , J , and I . Proof.
We abandon the difference between c f and c ∗ . Then, a ≤ const ( |K| |L| + K ( |L| + 1)) ≤ const( K + |K| ) |L| ,b ≤ const |L| ,d ≤ const |K| |L| ,e ≤ const a ≤ const( K + |K| ) |L| , max( a, b, e, d ) ≤ const( K + |K| + |L| ) |L| . Similarly, we boundcost ≤ const (cid:2) JM K + M ( J − I ) I ( a + b ) + M Id + M I e (cid:3) ≤ const M J ( I + 1)( K + |K| + |L| ) |L| . For the second term in the cost estimate cost given in Lemma 5.4, we bound M I [( I + 1) a + ( I − b ] ( J − I + 2) ≤ const M I ( J − I + 2)( K + |K| + |L| ) |L| . The third term in the cost estimate cost is bounded by M I (cid:2) a + 2 b − d + 5 e + I ( I + 6) a + 3 I ( I + b ) + 3 I d + I ( I + 6) e (cid:3) ≤≤ const M I ( K + |K| + |L| ) |L| . Hence, we obtain cost ≤ const M (cid:2) JK + I J ( K + |K| + |L| ) |L| (cid:3) . Further, note that cost ≤ const cost , and similarly for cost . We also use that cost ≤ const cost . The total cost can now be expressed as followscost ≤ const [cost + cost + cost ] ≤ const (cid:0) M Jc X + M (cid:2) JK + I J ( K + |K| + |L| ) |L| (cid:3) + M ( K + R ) J |L| (cid:1) . (cid:3) Remark 5.6.
Recall that Remark 4.4 introduced a significantly cheaper variant of algorithm Bfor the case I = J . It is easy to see that the computational cost of this variant is bounded bycost ≤ const M J (cid:0) c X + J ( K + |K| + |L| ) |L| + ( K + R ) |L| (cid:1) , i.e., the total cost is proportional to J rather than J . Indeed, the main difference in the costanalysis as compared to the full modified algorithm is that (5.3) can be replaced bycost ≤ J − X j =0 M (cid:16) Kc f + cost (cid:16) v ( J − j − j +1 (cid:17)(cid:17) . Note that this essentially corresponds to the algorithm of [BS20] directly generalized to optimalcontrol problems.
EINFORCED OPTIMAL CONTROL 17 Convergence analysis
In this section we analyze the convergence properties of the standard and reinforced regressionalgorithms introduced in the previous sections. For related convergence analysis in the case ofoptimal stopping problems we refer the interested reader to [Zan13] and [BS20], see also [BRS20].Henceforth we assume that(6.1) max j =0 ,...,J sup y ∈L ,a ∈ A sup x ∈X | H j ( a, y, x ) | ≤ C H , then all the value functions v ∗ j ( y, x ) := sup α ∈A j ( y,x ) E J X ‘ = j H ‘ (cid:16) a ‘ , Y ‘ ( α ; j, y ) , X j,x‘ (cid:17) are uniformly bounded by JC H . Fix a sequence of spaces Ψ j , j = 0 , . . . , J, of functions definedon X . We stress that these spaces must not be linear at this point. Construct the correspondingsequence of estimates ( v j,M ( y, x )) Jj =0 via v J,M ( y, x ) = sup a ∈ K j ( y,x ) H J ( a, y, x ) and(6.2) v j,M ( y, x ) = sup a ∈ K j ( y,x ) ( H j ( a, y, x ) + T W P j,M [ v j +1 ,M ]( ϕ j +1 ( a, y ) , x )) , j < J, where P j,M [ g ]( z, x ) stands for the empirical projection of the conditional expectation E [ g ( z, X j,xj +1 )]on Ψ j , based on a sample(6.3) D M,j = n ( X ( m ) j , X ( m ) j +1 ) , m = 1 , . . . , M o from the joint distribution of ( X j , X j +1 ) , that is, P j,M [ g ]( z, · ) ∈ arg inf ψ ∈ Ψ j M X m =1 (cid:20)(cid:12)(cid:12)(cid:12) g ( z, X ( m ) j +1 ) − ψ ( X ( m ) j ) (cid:12)(cid:12)(cid:12) (cid:21) . In (6.2) T W is a truncation operator at level W = JC H defined by T W f ( x ) = f ( x ) , | f ( x ) | ≤ W,W sign( f ( x )) , otherwise . Due to Theorem 11.5 in [GKKW02], one has for all g with k g k ∞ ≤ W, j = 0 , . . . , J − , and all z ∈ L , that(6.4) E (cid:20)(cid:13)(cid:13)(cid:13) T W P j,M [ g ]( z, · ) − E h g ( z, X j, · j +1 ) i(cid:13)(cid:13)(cid:13) L ( µ j ) (cid:21) ≤ ε j,M + 2 inf w ∈ Ψ j k g ( z, · ) − w k L ( µ j ) with ε j,M := cW MM VC (Ψ j ) , where VC (Ψ j ) is the Vapnik-Chervonenkis dimension of Ψ j (see Definition 9.6 in [GKKW02]) , µ j is the distribution of X j , and c is an absolute constant. In order to keep the analysis tractable,we assume that the sets D M,j are independent for different j. More specifically, we consider an algorithmic framework based on (6.2), where for every exercise date the samples (6.3) aresimulated independently, and consider the information sets G j,M := σ (cid:8) X j ; M , . . . , X J ; M (cid:9) with X j ; M := (cid:16) X ( m ) j , X ( m ) j +1 , m = 1 , . . . , M (cid:17) . Let us define for j < J, z ∈ L , x ∈ X , (6.5) b C j ( z, x ) := T W P j,M [ v j +1 ,M ]( z, x ) , and for a generic (exact) dummy trajectory ( X l ) l =0 ,...,J independent of G j,M , let(6.6) e C j ( z, x ) := E G j +1 ,M h v j +1 ,M ( z, X j,xj +1 ) i . Note that e C j ( · , · ) is a G j +1 ,M -measurable random function while the estimate b C j ( · , · ) is a G j -measurable one. We further define(6.7) C ∗ j ( z, x ) = E h v ∗ j +1 ( z, X j,xj +1 ) i , j < J. The following lemma holds.
Lemma 6.1.
We have that, (6.8) E (cid:20)(cid:13)(cid:13)(cid:13) sup z ∈L (cid:12)(cid:12)(cid:12) e C j ( z, · ) − C ∗ j ( z, · ) (cid:12)(cid:12)(cid:12)(cid:13)(cid:13)(cid:13) L ( µ j ) (cid:21) ≤ E (cid:20)(cid:13)(cid:13)(cid:13) sup z ∈L (cid:12)(cid:12)(cid:12) b C j +1 ( z, · ) − C ∗ j +1 ( z, · ) (cid:12)(cid:12)(cid:12)(cid:13)(cid:13)(cid:13) L ( µ j +1 ) (cid:21) . Proof.
Let X be a generic (exact) dummy trajectory independent of G j +1 ,M . Then from (6.6),and (6.7) we see that for j < J, (6.9) (cid:12)(cid:12)(cid:12) e C j ( z, X j ) − C ∗ j ( z, X j ) (cid:12)(cid:12)(cid:12) ≤ E G j +1 ,M (cid:2) (cid:12)(cid:12) v j +1 ,M ( z, X j +1 ) − v ∗ j +1 ( z, X j +1 ) (cid:12)(cid:12)(cid:12)(cid:12) X j (cid:3) Next, by (2.7), (6.2), (6.5), and (6.7) we have that (cid:12)(cid:12) v j +1 ,M ( z, x ) − v ∗ j +1 ( z, x ) (cid:12)(cid:12) ≤ sup a ∈ K j +1 ( z,x ) (cid:12)(cid:12)(cid:12) b C j +1 ( ϕ j +2 ( a, z ) , x ) − C ∗ j +1 ( ϕ j +2 ( a, z ) , x ) (cid:12)(cid:12)(cid:12) ≤ sup z ∈L (cid:12)(cid:12)(cid:12) b C j +1 ( z , x ) − C ∗ j +1 ( z , x ) (cid:12)(cid:12)(cid:12) . (6.10)Hence, by (6.9) one has thatsup z ∈L (cid:12)(cid:12)(cid:12) e C j ( z, X j ) − C ∗ j ( z, X j ) (cid:12)(cid:12)(cid:12) ≤ E G j +1 ,M (cid:20) sup z ∈L (cid:12)(cid:12)(cid:12) b C j +1 ( z, X j +1 ) − C ∗ j +1 ( z, X j +1 ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X j (cid:21) . Finally, by taking the “all-in expectation” w.r.t. the law µ j ⊗ P M , we observe that E (cid:20) sup z ∈L (cid:12)(cid:12)(cid:12) e C j ( z, X j ) − C ∗ j ( z, X j ) (cid:12)(cid:12)(cid:12) (cid:21) ≤ E (cid:26) E G j +1 ,M (cid:20) sup z ∈L (cid:12)(cid:12)(cid:12) b C j +1 ( z, X j +1 ) − C ∗ j +1 ( z, X j +1 ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X j (cid:21)(cid:27) ≤ E (cid:20) sup z ∈L (cid:12)(cid:12)(cid:12) b C j +1 ( z, X j +1 ) − C ∗ j +1 ( z, X j +1 ) (cid:12)(cid:12)(cid:12) (cid:21) by Jensen’s inequality and the tower property. (cid:3) In fact, Lemma 6.1 is the key to the next proposition.
EINFORCED OPTIMAL CONTROL 19
Proposition 6.2.
Set E j := (cid:13)(cid:13)(cid:13) sup z ∈L (cid:12)(cid:12)(cid:12) b C j ( z, · ) − C ∗ j ( z, · ) (cid:12)(cid:12)(cid:12)(cid:13)(cid:13)(cid:13) L ( µ j ⊗ P M ) , j = 0 , . . . , J − , with P M being the law of the sample X ( m ) j , m = 1 , . . . , M, j = 1 , . . . , J. Then it holds (6.11) E j ≤ |L| (cid:16) ε j,M + √ z ∈L inf w ∈ Ψ j (cid:13)(cid:13) e C j ( z, · ) − w (cid:13)(cid:13) L ( µ j ⊗ P M ) (cid:17) + |L| E j +1 for all j = 0 , . . . , J − , with E J = 0 by definition.Proof. The case j = J − e C J − = C ∗ J − and Due to (6.4)we have with probability 1 , (6.12) E G j +1 ,M (cid:20)(cid:13)(cid:13)(cid:13) b C j ( z, · ) − e C j ( z, · ) (cid:13)(cid:13)(cid:13) L ( µ j ) (cid:21) ≤ ε j,M + 2 inf w ∈ Ψ j (cid:13)(cid:13)(cid:13) e C j ( z, · ) − w (cid:13)(cid:13)(cid:13) L ( µ j ⊗ P M ) . Hence(6.13) (cid:13)(cid:13)(cid:13) b C j ( z, · ) − e C j ( z, · ) (cid:13)(cid:13)(cid:13) L ( µ j ⊗ P M ) ≤ ε j,M + √ w ∈ Ψ j (cid:13)(cid:13)(cid:13) e C j ( z, · ) − w (cid:13)(cid:13)(cid:13) L ( µ j ⊗ P M ) . By applying (6.13) it follows that (cid:13)(cid:13)(cid:13) b C j ( z, · ) − C ∗ j ( z, · ) (cid:13)(cid:13)(cid:13) L ( µ j ⊗ P M ) ≤ ε j,M + √ w ∈ Ψ j (cid:13)(cid:13)(cid:13) e C j ( z, · ) − w (cid:13)(cid:13)(cid:13) L ( µ j ⊗ P M ) + (cid:13)(cid:13)(cid:13) e C j ( z, · ) − C ∗ j ( z, · ) (cid:13)(cid:13)(cid:13) L ( µ j ⊗ P M ) . From this and Lemma 6.1 we implysup z ∈L (cid:13)(cid:13)(cid:13) b C j ( z, · ) − C ∗ j ( z, · ) (cid:13)(cid:13)(cid:13) L ( µ j ⊗ P M ) ≤ ε j,M + √ z ∈L inf w ∈ Ψ j (cid:13)(cid:13)(cid:13) e C j ( z, · ) − w (cid:13)(cid:13)(cid:13) L ( µ j ⊗ P M ) + (cid:13)(cid:13)(cid:13) sup z ∈L (cid:12)(cid:12)(cid:12) b C j +1 ( z, · ) − C ∗ j +1 ( z, · ) (cid:12)(cid:12)(cid:12)(cid:13)(cid:13)(cid:13) L ( µ j +1 ⊗ P M ) and then (6.11) follows. (cid:3) Corollary 6.3.
Suppose that sup z ∈L inf w ∈ Ψ j (cid:13)(cid:13) e C j ( z, · ) − w (cid:13)(cid:13) L ( µ j ⊗ P M ) ≤ δ, VC (Ψ j ) ≤ D, ≤ j ≤ J − , for some δ > and D > . Proposition 6.2 then yields for j = 0 , . . . , J − , by using (6.10) , (6.14) (cid:13)(cid:13)(cid:13) sup z ∈L (cid:12)(cid:12) v j,M ( z, · ) − v ∗ j ( z, · ) (cid:12)(cid:12)(cid:13)(cid:13)(cid:13) L ( µ j ⊗ P M ) ≤ (cid:18) cW MM D + √ δ (cid:19) |L| J − j +1 − |L||L| − . Corollary 6.4.
By inserting the estimate inf w ∈ Ψ j (cid:13)(cid:13)(cid:13) e C j ( z, · ) − w (cid:13)(cid:13)(cid:13) L ( µ j ⊗ P M ) ≤ (cid:13)(cid:13)(cid:13) e C j ( z, · ) − C ∗ j ( z, · ) (cid:13)(cid:13)(cid:13) L ( µ j ⊗ P M ) + inf w ∈ Ψ j (cid:13)(cid:13) C ∗ j ( z, · ) − w (cid:13)(cid:13) L ( µ j )0 C. BAYER, D. BELOMESTNY, P. HAGER, P. PIGATO, J. SCHOENMAKERS, AND V. SPOKOINY in Proposition 6.2, we get the alternative recursion E j ≤ |L| (cid:18) ε K,M + √ z ∈L inf w ∈ Ψ j (cid:13)(cid:13) C ∗ j ( z, · ) − w (cid:13)(cid:13) L ( µ j ) (cid:19) + |L| (1 + √ E j +1 , and under the alternative assumption sup z ∈L inf w ∈ Ψ j (cid:13)(cid:13) C ∗ j ( z, · ) − w (cid:13)(cid:13) L ( µ j ) ≤ δ, VC (Ψ j ) ≤ D, ≤ j ≤ J − , for some δ > and D > , we obtain for j = 0 , ..., J the bounds (cid:13)(cid:13)(cid:13) sup z ∈L (cid:12)(cid:12) v j,M ( z, · ) − v ∗ j ( z, · ) (cid:12)(cid:12)(cid:13)(cid:13)(cid:13) L ( µ j ⊗ P M ) ≤ (cid:18) cW MM D + √ δ (cid:19) |L| (cid:0) (1 + √ |L| (cid:1) J − j − √ |L| − . Discussion . The proposed reinforced regression algorithm with I = J uses linear approximationspaces of the formΨ j = span { ψ ( x ) , . . . , ψ K ( x ) , v j +1 ,M ( y , x ) , . . . , v j +1 ,M ( y R , x ) } , j = 0 , . . . , J − , for L = { y , . . . , y R } , where ψ ( x ) , . . . , ψ K ( x ) are some fixed basis functions (e.g. polynomials)on X . In this case VC (Ψ j ) ≤ K + R, ≤ j ≤ J − . In order to see the advantage of addingadditional basis functions more clearly, estimatesup z ∈L inf w ∈ Ψ j (cid:13)(cid:13)(cid:13) e C j ( z, · ) − w (cid:13)(cid:13)(cid:13) L ( µ j ⊗ P M ) ≤ sup z ∈L (cid:13)(cid:13)(cid:13) E G j +1 ,M (cid:2) v j +1 ,M ( z, X j, · j +1 ) − v j +1 ,M ( z, · ) (cid:3)(cid:13)(cid:13)(cid:13) L ( µ j ⊗ P M ) . Assume additionally that(6.15) max j =1 ,...,J sup y ∈L ,a ∈ A sup x ∈X | H j ( a, y, x ) − H j ( a, y, x ) | ≤ L H | x − x | , for all x , x ∈ X andmax j =1 ,...,J max ‘ = j,...,J E [ | X j,x ‘ − X j,x ‘ | ] ≤ L X | x − x | , ∀ x , x ∈ X (6.16)for some constants L H > , L X > . Under assumptions (6.15) and (6.16) we then havemax j =1 ,...,J sup y ∈L | v ∗ j ( y, x ) − v ∗ j ( y, x ) | ≤ JL X L H | x − x | , ∀ x , x ∈ X . By using an additional truncation, one can achieve that the Lipschitz constants of the estimates v j,M ( z, · ) , j = 0 , . . . , J − , are all uniformly bounded by a constant JL X L H with probability 1 . Hence sup z ∈L inf w ∈ Ψ j (cid:13)(cid:13)(cid:13) e C j ( z, · ) − w (cid:13)(cid:13)(cid:13) L ( µ j ⊗ P M ) ≤ JL X L H (cid:20) E Z | X j,xj +1 − x | µ j ( dx ) (cid:21) / . This implies that if JL H stays bounded for J → ∞ , (for example if H scales as 1 /J ), then the ap-proximation error inf w ∈ Ψ j (cid:13)(cid:13)(cid:13) e C j ( z, · ) − w (cid:13)(cid:13)(cid:13) L ( µ j ⊗ P M ) becomes small as J → ∞ . Note that the latterproperty can not be guaranteed when using fixed (nonadaptive) linear spaces Ψ j . Of course, theexponential in J factor in (6.14) will lead to explosion of the overall error as J → ∞ , but the aboveobservation still indicates that the inclusion of the functions v j +1 ,M ( y , x ) , . . . , v j +1 ,M ( y R , x ) intoΨ j can significantly improve the quality of the estimates v j,M ( z, · ) especially in the case of large J. We also note that the bound (6.14) is likely to be too pessimistic what its dependence on J isconcerned, see also a discussion in [Zan13]. EINFORCED OPTIMAL CONTROL 21 Numerical examples
We now present various numerical examples which demonstrate the accuracy of the reinforcedregression algorithm in practice. To allow for a direct comparison with the reinforced regressionalgorithm of [BSSZ20], we first consider a (single) optimal stopping problem, i.e., a Bermudanoption. Our next example is a multiple stopping problem, for which the hyperparameters alreadybecome critical. Finally, our last example is an optimal control of a gas storage.Before, let us also mention how a lower estimate to the value of a control problem in aMarkovian setting is calculated using the result of the a regression procedure. Let c be anapproximation to the function c ∗ given by c ∗ j ( x, y ) = E h v ∗ j +1 ( y, X j,xj +1 ) i , x ∈ R d , y ∈ L , j ∈ { , ..., J − } , with c J ≡ c ∗ J ≡ c ( I ) defined in (4.6). Further let ( X ( m ) ) ≤ m ≤ M test be sampletrajectories from the underlying Markov chain, generated independently from the samples usedin the regression procedure. Then we can iteratively define a sequence of polices ( α ( m ) ) ≤ m ≤ M test with α ( m ) = ( a ( m )0 , ..., a ( m ) J ) by a ( m ) j := arg max a ∈ K j ( Y ( m ) j ,X ( m ) j ) (cid:16) H j ( a, Y ( m ) j , X ( m ) j ) + c j ( ϕ ( a, Y ( m ) j ) , X ( m ) j ) (cid:17) , or all m = 1 , ..., M test and j = 0 , ..., J , where Y ( m )0 := y ∈ L and Y ( m ) j +1 := ϕ j +1 ( a mj , Y ( m ) j ).It then follows from the definition, that each α ( m ) is an admissible sequence of policies, i.e. α ( m ) ∈ A ( y , X ( m ) ). Therefore, a lower estimate to the value E ( v ( y , X )) is given by1 M test M test X m =1 J X j =0 H j ( a ( m ) j , Y ( m ) j , X ( m ) j ) . Bermudan max-call option.
In this section we evaluate the performance of the hierarchi-cal reinforced regression (HRR) method from Section 4 on the valuation of a Bermudan max-calloption. Let (Ω , F , ( F t ) ≤ t ≤ T , P ) be a filtered probability space on which a d -dimensional Brow-nian motion W = ( W ( t )) ≤ t ≤ T is defined. Further let X = ( X ( t )) ≤ t ≤ T be the geometricBrownian motion defined by dX k ( t ) = ( r − δ ) X k ( t ) dt + X k ( t ) σ dW k ( t ) , X k (0) = x , ≤ t ≤ T, k ∈ { , ..., d } , where x , r, δ, σ >
0. Option rights can be exercised on a predefined set of possible exercisedates { t , t , ..., t J } , where at most one right can be exercised on any given date. Assume thatthe exercise dates are equidistant t j := j · ∆ t for all j = { , ..., J } with ∆ t := T /J and definethe underlying Markov chain ( X j ) j =0 ,...,J by X j = X ( j ∆ t ). Recall from Example 2.3 that inorder to model a multiple stopping problem in the optimal control framework we define the setof policies by K = { , } and the set of controls by L = { , ..., y max } , where y max is the numberof exercise rights. Further, we define ϕ j ( a, y ) := ( y − a ) + , K j ( x, y ) := { , ∧ y } , H j ( a, y, x ) := a · g j ( x ) e − t j r , for all y ∈ L , a ∈ K and j ∈ { , ..., J } , where g is the max-call pay-off function defined by g ( x ) := (max { x , ...., x d } − C ) + , x ∈ R d , where C ∈ R + is the option strike. Then the value function v ∗ ( y max , x ) defined in (2.6) yields thevalue of the Bermudan max-call option with underlying X and data ( d, J, T, y max , x , C, r, δ, σ ).7.1.1. Single exercise right.
We will first consider the case of a single exercise right y max = 1.This is a standard example in the literature and we refer to [AB04] for reference values. Theperformance of the reinforced regression method for this example was already analyzed in [BS20].We revisit this example in order to demonstrate that even in the optimal stopping case our novelHRR method allows for improvements in computational costs without sacrificing the quality ofthe estimations. d Basis Lower bounds Upper boundsRegression HRR Reinf. Reg. I = 0 I = 1 I = 92 Ψ ,g ,g ,g ,g Table 1.
Bounds (with 99.7% confidence interval) for the value of the Bermudamax-call option with data J = 9, T = 1, y max = 1, x = C = 100, r = 0 . δ = 0 . σ = 0 . d ∈ { , , , } . Forall methods we used M = 10 training sample paths and M test = 10 paths forcalculating the lower bound. The dual upper bounds are calculated using theHRR method with I = 1, basis Ψ and 10 outer and 10 inner sample paths.Define the functions f i : R d → R , x sort( x , . . . , x d ) i , the i th largest entry of x , for i = 1 , . . . , d and consider the following three sets of regression basis functions:Ψ := { , f , ..., f d } , Ψ ,g := Ψ ∪ { g } , Ψ := Ψ ∪ { f i · f j | ≤ i ≤ j ≤ d } , EINFORCED OPTIMAL CONTROL 23 Ψ := Ψ ∪ Ψ ∪ { f i · f j · f k | ≤ i ≤ j ≤ k ≤ d } . Note that the cardinalities of these sets are given by | Ψ | = d + 1, | Ψ ,p | = d + 2, | Ψ | = d + d + 1, and | Ψ | = d + d + d + 1, respectively. Regarding the HRR method, weuse the algorithm of the second type described in Section 4.2. Further note that in the optimalstopping case there is only one choice for the set of reinforced value function for the HRR methodsince L = { } and therefore we always set L = { } . g d = 2 g d = 4 g d = 6 g d = 8Regression( I = 0) HRR( I = 1) RR( I = 9) Regression( I = 0) HRR( I = 1) RR( I = 9) Figure 5.
A visualization of the lower bounds from Table 1.We considered two different set-ups for the comparison of the different methods: • First we keep the number of exercises dates J fixed and vary the number of underlyingasset d ; • Second we keep d fixed and vary J (while also keeping T fixed). In Table 1 we present lower and upper bounds to the value of a Bermudan max-call optionwith a single exercise right for J = 9 and d = 2 , , ,
8. In the corresponding Figure 5 we havevisualized the lower bounds for the comparison between the different regression methods. Foreach of the considered methods we used M = 10 simulated training samples paths to determinethe regression coefficients and M test = 10 paths for calculating the lower bounds.We first observe that across all numbers of assets d the HRR method with the set of basisfunctions Ψ performs significantly better then the standard regression method with the basisΨ and Ψ ,g . The same holds true when comparing the methods using the regression basis Ψ .More importantly however, we observe that for d ≤ yields lowerbounds of the same quality as obtained with the standard method and the larger basis Ψ . Thesame holds true when comparing the HRR method with basis Ψ against the standard methodwith the basis Ψ . In the case d = 8 assets, the lower bounds obtained with the HRR method andbasis Ψ lie just slightly below the values of the lower bounds obtained with standard methodand the basis Ψ , however one has to keep in mind that in this case | Ψ | = 9 and | Ψ | = 165.Moreover, we see that the HRR method with a recursion depth I = 1 performs just as well asthe (full depth) reinforced regression method ( I = J = 9). d C P U T i m e ( s ) , I = 0 g , I = 0 , I = 0 , I = 0 , I = 1 , I = 1 , I = 9 , I = 9 Figure 6.
The elapsed CPU times during the backwards induction and calcu-lation of the lower bounds from Table 1, plotted with respect to the number ofunderlying assets d .Furthermore, in Figure 6 we have visualized the corresponding elapsed CPU times during thebackwards induction and the calculation of the lower bounds. As foreshadowed in Section 5,we see that the computational costs of the HRR method are significantly reduced by choosing asmall recursion depth I . In particular, we are able to state that for sufficiently large d ( d ≥ EINFORCED OPTIMAL CONTROL 25 respectively d ≥
2) the HRR method with recursions depth I = 1 and the basis Ψ respectivelyΨ is more efficient then the standard method with the basis Ψ respectively Ψ . J = 9 J = 18 J = 36 J = 720 1 2 9 0 1 2 9 0 1 2 9 0 1 2 9 I I = 0, g I = 0, I = 0, I = 1, I = 2, I = 9, Figure 7.
Visualization of the lower bounds (with 99.7% confidence intervals)of the values of Bermuda max-call options with J = 9 , , ,
72 exercise datesand d = 4, T = 1, y max = 1, x = C = 100, r = 0 . δ = 0 . σ = 0 .
2. Thevalues are obtained with the standard regression method I = 0 and the HRRmethod I = 1 , ,
9. For all methods we used M = 10 training sample pathsand M test = 10 sample paths for the calculation for the lower bounds.The results of the second set-up are visualized in Figure 7. In this case, we have approximatedthe value of Bermudan max-call options with a fixed number of assets d = 4 and differentnumbers of exercise dates J = 9 , , ,
72, while also keeping T = 1 fixed. When keeping allother parameters fixed, the value of the option clearly is non-decreasing in the number of exercisedates J . Our first observation is that for all considered methods there exists a threshold for J at which the performance worsens. Indeed, Corollary 6.3 shows that the approximation errordepends exponentially on J .However, in practice, some methods are less vulnerable to the error explosion in J thanothers. In this example, we see that the standard regression method with the basis Ψ ,g and Ψ , respectively, starts to perform worse for J ≥
36. The lower bounds calculated with the HRRmethod with the basis Ψ and I = 1 stay approximately on the same level as the lower boundscalculated with the standard method and the basis Ψ , for all numbers of exercise dates. Thelower bounds calculated with the standard method and the basis Ψ first increase when movingform 9 to 18 exercise dates and decrease at last when moving from 36 to 72 exercise dates. J C P U T i m e ( s ) I = 0, g I = 0, I = 0, I = 1, I = 2, I = 9, Figure 8.
The elapsed CPU times during the backwards induction and calcu-lation of the lower bounds in Figure 7, plotted with respect to the number ofexercise dates J .The main observation is that when we increase J , the HRR methods with I = 2 and I = 9come closer the lower bounds calculated with the standard method and the basis Ψ . Thisunderlines the theoretical discussion for J → ∞ form Section 6. Moreover, we see that the HRRmethod performs at least as well with I = 2 as with I = 9. The corresponding CPU times thatelapsed during the regression procedure and the calculation of the lower bounds are presented inFigure 8. We see that for large enough J the HRR method with I = 2 is more efficient than thenthe standard regression method with the basis Ψ . Comparing the HRR methods with I = 2 and I = 9, we see that choosing the parameter I small is necessary to in order to obtain desirableefficiency. Since on the other hand, the HRR method with I = 1 performed significantly worsethen with I = 2, we also see that in this case it was necessary to choose I >
1. These observationsunderline the relevance of the HRR in its full complexity even in the case of optimal stoppingproblems.7.1.2.
Multiple exercise rights.
Next we consider a Bermudan max-call option with y max = 4exercise rights. In this case the HRR method allows for different possibilities of reinforced value EINFORCED OPTIMAL CONTROL 27 functions depending on the choice of the sets L y (recall Remark 4.2). Since y max is small, wechoose L y ≡ { , , , } for simplicity.Basis Regression Hierarchical Reinforced Regression I = 0 I = 1 I = 2 I = 3 I = 4 I = 5Ψ ,g Table 2.
Lower bounds (with 99.7% confidence interval valid for each row) forthe value of the Bermuda max-call option with data J = 24, T = 2, y max = 4, x = C = 100, d = 5, r = 0 . δ = 0 . σ = 0 .
2. For all methods we used M = 10 training sample paths and M test = 10 paths for calculating the lowerbound. An upper bound to the value, calculated with the dual approach from [ ? ]and [ ? ], is given by 92 .
971 (0 . I = 3 and basisΨ using 10 outer and 10 inner sample paths.In Table 3 and the corresponding Figure 9 we present lower bounds to the value of a Bermudamax-call option with y max = 4 exercises rights, obtained with the standard regression methodand HRR method for different choices of regression basis functions and the parameter I , with theimplementation of the second type described in Section 4.2. We first observe that for a fixed setof basis functions Ψ or Ψ increasing the parameter I yields increased, and thus improved, lowerbounds. This improvement is most significant when moving from I = 0 (standard regression)to I = 1 and from I = 1 to I = 2 and becomes less significant when further increasing I .Moreover, we observe that the HRR method with I = 1 and basis functions Ψ yields better lowerbounds then the standard regression method with the larger set of basis functions Ψ ,g and moreimportantly, for I ≥ method yields better lower bounds thethe standard regression method with the even larger set of basis functions Ψ . This observationprevails when comparing the standard regression method with the basis Ψ against the HRRmethod with the basis Ψ . We can therefore conclude that the HRR method yields results ofbetter quality then standard regression using fewer regression basis functions. Moreover, werealize that up to changes that are insignificant with respect to the Monte Carlo error, the HRRreaches its best performance already for I = 3, thus further increasing I is not necessary.7.2. A gas storage problem.
In this subsection we consider a gas-storage problem of the kindintroduced in Example 2.5. In contrast to the example in the previous subsection, this optimalcontrol problem is not of a multiple stopping type, which is a consequence of the anti-symmetryin the policy set: injection of gas into the facility ( a = 1), no action ( a = 0) and production ofgas ( a = − g Regression( I = 0) Hierarchical Reinforced Regression( I = 1) ( I = 3) ( I = 5) Figure 9.
Visualization of the lower bounds with confidence intervals fromTable 2. The black horizontal line represents the dual upper bound.to model the price of crude oil X and the price of natural gas X d X ( t ) = α ( β − X ( t ))d t + σ X ( t )d W ( t ) + (cid:16) J N ( t − )+1 − X ( t ) (cid:17) d N ( t )d X ( t ) = α ( X ( t ) − X ( t ))d t + σ X ( t )d W ( t ) + (cid:16) J N ( t − )+1 − X ( t ) (cid:17) d N ( t ) , (7.1)for 0 ≤ t ≤ T , where β, α i , σ i > i = 1 , W and W are Brownian motions withcorrelation ρ W ∈ [0 , N is a Poisson process with intensity λ > J k ) k =1 ,... are i.i.d.normal distributed random vectors with J i ∼ N ( µ i , η i ), µ i , η i > ρ J = Cor( J , J ) ∈ [0 , W , W ), N and ( J , J ) are independent. Note that both X and X are mean reverting processes with jump contributions. The oil price process X revertsto the long-term constant mean β and the gas price process X reverts towards the oil price X , which is aiming to model the well known strong correlation between crude oil and naturalgas prices. Note also that we have assumed for simplicity that the jump signal, which has thepurpose of modeling price peaks, is the same Poisson process for both oil and gas prices, howeverthe magnitude of the jumps is given by different (but correlated) normal distributed randomvariables.Denote by ( e X j ) j =1 ,..., the 2-dimensional Markov chain that is obtained by discretizing theabove SDE (7.1) with an Euler-scheme on the time interval interval [0 , EINFORCED OPTIMAL CONTROL 29 manager of the gas storage facility has the possibility to buy and sell gas on a predefined setof dates in the year { t j } j =1 ,...,J ⊂ { , ..., } with t j = j · δt and some δt, J ∈ N such that δt · J ≤ X = ( X j ) j =0 ,...,J with X j := e X t j .Recall from Example 2.5 that we assume that the volume of gas in the storage can only beincreased or decreased by a fraction ∆ = 1 /N for some N ∈ N over the time interval of δt days.The state space of the control variable is then given by L = { , ∆ , , ..., } . Also recall thedefinition of the space of policies K , the constraint sets K j and the function ϕ j from Example2.5. We assume that there is no trading at j = 0 hence K ≡ { } . The cash-flow underlyingto the optimal control problem only depends on the second component of the Markov chain X j and is given by H j ( a, y, X j ) = − a · ∆ · X j · e − rj ( δt/ , a ∈ K , y ∈ L , j = 1 , ..., J, where r > X j allows to properly scale the resulting value of the optimalcontrol problem. The following specific choice of the price model parameters are oriented at thevalues in [TDR09] β = 45 , α = 0 . , α = 0 . , σ = σ = 0 . , ρ W = 0 . ,λ = 2 , µ = µ = 100 , η = η = 30 , ρ J = 0 . . (7.2)Figure 10 shows a sample trajectory of the Markov chain X with the above parameters. X X Figure 10.
A sample path of the Markov chain ( X j ) j =0 ,...,J = ( e X t j ) j =0 ,...,J where t j = j · J = 52. The approximation e X = ( e X j ) j =1 ,..., to the SDE(7.1) is simulated with the parameters given in (7.2) and e X = (100 , X and X serve as models for the prices of crude oil and natural gas. Further we define the following sets of polynomial regression basis functions P i ( X ) := (cid:8) ( x , x ) ( x ) p (cid:12)(cid:12) p = 0 , ..., i (cid:9) P i ( X , X ) := (cid:8) ( x , x ) ( x ) p ( x ) q (cid:12)(cid:12) p, q = 0 , ..., i, p + q ≤ i (cid:9) . (7.3) I Basis v ( Y , X ) Lower bounds0 P ( X ) 78.381 70.489 (0.066) P ( X , X ) 78.575 70.635 (0.068) P ( X ) 73.072 71.253 (0.068) P ( X , X ) 73.207 71.402 (0.068) P ( X , X ) 72.929 71.333 (0.081) P ( X , X ) 72.595 71.498 (0.068)1 P ( X , X ) 71.991 71.579 (0.070) Table 3.
Approximate values and lower bounds for the gas storage problemwith parameters given in (7.4) and price model parameters given in (7.2). Thequantities were obtained with the standard regression method ( I = 0) and theHRR method ( I = 1), the different sets of basis functions (7.3), M = 10 training sample paths and M test = 10 paths for calculating the lower bounds.We have approximated the value of the gas storage problem with the following parameters δt = 7 , J = 52 , ∆ = 1 / , X = (100 , , Y = 4 / , r = 0 . . (7.4)In this configuration the gas storage facility is initially loaded with half its capacity and thegas storage manager has the possibility to trade gas every seven days, and the amount bywhich the manager can inject or produce gas is one height of the total capacity. In Table 3and the corresponding Figure 11 we present the numerical results that were obtained with thestandard regression method and the HRR method. We used M = 10 training sample paths and M test = 10 sample paths for the calculation of the lower bounds. For the set of reinforced basisfunctions in the HRR method we have chosen L y ≡ { Y } , i.e. in each step of the backwardsinduction, the regression basis was reinforced with only one function.We observe at first, that the lower bounds obtained with the standard regression methodare improved when using polynomials in both variables ( X , X ) instead of just in the secondvariable X (gas price) and are also improved when using polynomials of increasing order (withthe only exception of the third degree polynomials). Moreover, we observe that the lower boundobtained with the HRR method, using the set of basis functions P ( X , X ) and I = 1, liesabove all lower bounds that were obtained with the standard regression, in particular the boundobtained with the regression basis P ( X , X ).8. Conclusions
In this paper, we present a new reinforced regression algorithm (HRR) for stochastic con-trol problem, extending and improving the algorithm (RR) developed in [BSSZ20] for optimal
EINFORCED OPTIMAL CONTROL 31 P ( X ) P ( X , X ) P ( X ) P ( X , X ) P ( X , X ) P ( X , X ) P ( X , X ) Regression ( I = 0) HRR ( I = 1) Figure 11.
Visualization of the lower bounds with confidence intervals fromTable 3.stopping problems. The new algorithm provides a flexible hierarchical construction of approxi-mate value functions, based on enlarging the set of basis functions used with approximate valuefunctions at one level lower in the hierarchy. As value functions at level 0 correspond to valuefunctions obtained by standard regression, the new hierarchical construction significantly reducesthe computational footprint of reinforced regression compared with the algorithm of [BSSZ20],while keeping the improvement in accuracy.In fact, for a hierarchy of value functions of depth I + 1, the computational cost of HRR isproportional to JI compared to J for RR and J for standard regression (SR), see Theorem 5.5.We provide a generic convergence analysis of regression type algorithms for considered class ofstochastic optimal control problems. In particular, we show that mean-squared error of the cor-responding estimate for the value function v ∗ ( z, · ) is proportional to log MM plus an approximationerror. Our analysis further indicates that the use of adaptive (data dependent) approximationspaces as, for example, in HRR can significantly reduce the approximation error especially inthe case of big J. These theoretical findings are underlined by careful numerical examples. Inparticular, we observe that HRR based on polynomial basis functions of a certain degree degtend to produce results comparable to SR based on polynomial basis functions of degree deg +1or even higher, see Figures 5, 7, 9, and, most impressively, Figure 11.The numerical results also indicate that, indeed, HRR with low depth of the hierarchy I already performs very well, even if I (cid:28) J , see Figures 5, 7, 9. Hence, HRR performs with similaraccuracy to RR, but much improved cost. Additionally, when comparing HRR with SRR atfixed accuracy, the computational cost of HRR is usually much smaller, especially for d large,see Figures 6, 8. Finally, we note that the accuracy of the HRR method increases substantially when the timediscretization is refined, i.e., when J is increased for fixed time horizon T . This theoreticallyvery plausible observation (see Section 6) is backed up by numerical experiments, see Figure 7.The present paper considers stochastic control problems in discrete time with finite actionspaces. Of course, continuous time problems can be easily covered by discretizing time, a standardapproach also considered in our theoretical analysis of Section 6. The same is, in principle, truefor continuous action spaces, even though such problems lead to interesting new questions forthe proposed HRR algorithms, in particular regarding to choice of (reinforced) basis functionsin light of infinitely many possible choices. We postpone this problem to future research. References [AB04] L. Andersen and M. Broadie. A Primal-Dual Simulation Algorithm for Pricing Multi-DimensionalAmerican Options.
Management Science , 50(9):1222–1234, 2004.[BCJ18] Sebastian Becker, Patrick Cheridito, and Arnulf Jentzen. Deep optimal stopping. arXiv preprintarXiv:1804.05394 , 2018.[BKS09] D. Belomestny, A. Kolodko, and J. Schoenmakers. Regression methods for stochastic control problemsand their convergence analysis.
SIAM Journal on Control and Optimization , 48:3562–3588, 01 2009.[BR11] Nicole Bäuerle and Ulrich Rieder.
Markov decision processes with applications to finance . SpringerScience & Business Media, 2011.[BRS20] Christian Bayer, Martin Redmann, and John Schoenmakers. Dynamic programming for optimalstopping via pseudo-regression.
Quantitative Finance , pages 1–16, 2020.[BS18] Denis Belomestny and John Schoenmakers.
Advanced simulation-based methods for optimal stoppingand control . Palgrave Macmillan, London, 2018. With applications in finance.[BS20] Denis Belomestny and John Schoenmakers. Optimal stopping of mckean–vlasov diffusions via regres-sion on particle systems.
SIAM Journal on Control and Optimization , 58(1):529–550, 2020.[BSSZ20] Denis Belomestny, John Schoenmakers, Vladimir Spokoiny, and Bakhyt Zharkynbay. Optimal stop-ping via reinforced regression.
Commun. Math. Sci. , 18(1):109–121, 2020.[GHW11] Lajos Gyurko, B. Hambly, and Jan Witte. Monte carlo methods via a dual approach for some discretetime stochastic control problems.
Mathematical Methods of Operations Research , 81, 12 2011.[GKKW02] László Györfi, Michael Kohler, Adam Krzyżak, and Harro Walk.
A distribution-free theory of non-parametric regression . Springer Series in Statistics. Springer-Verlag, New York, 2002.[Pha09] Huyên Pham.
Continuous-time stochastic control and optimization with financial applications , vol-ume 61. Springer Science & Business Media, 2009.[TDR09] Matt Thompson, Matt Davison, and Henning Rasmussen. Natural gas storage valuation and opti-mization: A real options application.
Naval Research Logistics (NRL) , 56(3):226–238, 2009.[TVR01] John Tsitsiklis and Benjamin Van Roy. Regression methods for pricing complex american style op-tions.
IEEE transactions on neural networks / a publication of the IEEE Neural Networks Council ,12:694–703, 02 2001.[Zan13] Daniel Z. Zanger. Quantitative error estimates for a least-squares Monte Carlo algorithm for Americanoption pricing.