[PDF] A Backward Simulation Method for Stochastic Optimal Control Problems

Abstract

A number of optimal decision problems with uncertainty can be formulated into a stochastic optimal control framework. The Least-Squares Monte Carlo (LSMC) algorithm is a popular numerical method to approach solutions of such stochastic control problems as analytical solutions are not tractable in general. This paper generalizes the LSMC algorithm proposed in Shen and Weng (2017) to solve a wide class of stochastic optimal control models. Our algorithm has three pillars: a construction of auxiliary stochastic control model, an artificial simulation of the post-action value of state process, and a shape-preserving sieve estimation method which equip the algorithm with a number of merits including bypassing forward simulation and control randomization, evading extrapolating the value function, and alleviating computational burden of the tuning parameter selection. The efficacy of the algorithm is corroborated by an application to pricing equity-linked insurance products.

Full PDF

AA Backward Simulation Method forStochastic Optimal Control Problems ∗ Zhiyi Shen † , and Chengguo Weng ‡ January 23, 2019

Abstract

A number of optimal decision problems with uncertainty can be formulated into a stochasticoptimal control framework. The Least-Squares Monte Carlo (LSMC) algorithm is a popularnumerical method to approach solutions of such stochastic control problems as analytical so-lutions are not tractable in general. This paper generalizes the LSMC algorithm proposed inShen and Weng (2017) to solve a wide class of stochastic optimal control models. Our algorithmhas three pillars: a construction of auxiliary stochastic control model, an artiﬁcial simulation ofpost-action value of state process, and a shape-preserving sieve estimation method which equipthe algorithm with a number of merits including bypassing forward simulation and control ran-domization, evading extrapolating the value function, and alleviating computational burden ofthe tuning parameter selection. The eﬃcacy of the algorithm is corroborated by an applicationto pricing equity-linked insurance products.

The stochastic optimal control model is a prevalent paradigm for solving optimal decision problemswith uncertainty in a variety of ﬁelds, particularly, ﬁnancial engineering. In solving discrete-timestochastic optimal control problems, the Dynamic Programming Principle (DPP) is a prevailingtool which characterizes the optimal value function as the solution to a backward recursive equationsystem, often known as the

Bellman equation . This reduces the stochastic optimization probleminto two separate problems: 1) solving a sequence of deterministic optimization problems and 2)evaluating the conditional expectation terms in the Bellman equation. In spite of the theoretical ap-pealingness of the DPP, there generally does not exist closed-form solution of the Bellman equation,which impedes the application of stochastic optimal control models to complicated real-world prob-lems. Recently, a number of numerical methods have been proposed in the literature to approach ∗ The authors are thankful to Alexander Schied, Pengfei Li, Degui Li, and Zhaoxing Gao for helpful discussions. † Department of Statistics and Actuarial Science, University of Waterloo. Email Address: [email protected] ‡ Department of Statistics and Actuarial Science, University of Waterloo. Email Address: [email protected] a r X i v : . [ q -f i n . C P ] J a n he optimal or suboptimal solutions to various stochastic optimal control problems by combiningMonte Carlo Simulation with nonparametric regression methods.In a statistical setting, the typical goal of nonparametric regression methods is to estimate thefunctional form of the expectation of a response variable conditioning on a covariate variable. Thisnaturally motivates one to use certain nonparametric regression methods to evaluate the conditionalexpectation (also known as the continuation value in the context of pricing Bermudan option) in-volved in the Bellman equation where the value function at the next time step and the state variableat the current time step are taken as the response and covariate variables, respectively. Such aground-breaking idea was incubated in a series of papers including Carriere (1996), Longstaﬀ andSchwartz (2001), and Tsitsiklis and Van Roy (2001), and the corresponding numerical algorithms areoften referred to as the Least-Squares Monte Carlo (LSMC) algorithms. Since then, the LSMC algo-rithm has witnessed remarkable popularity in solving optimal stopping problems, a special class ofstochastic control problems; see, e.g., Cl´ement et al. (2002), Stentoft (2004), Egloﬀ (2005), Glasser-man and Yu (2004), Glasserman et al. (2004), Egloﬀ et al. (2007), Zanger (2009), Zanger (2013),Belomestny et al. (2009), Belomestny (2011), and the references therein.The problem of solving general stochastic control problems by resorting to the LSMC algorithmis considerably more involved. To understand the crux, let us note that the LSMC method hastwo building blocks: 1) a forward simulation of the state process and 2) a backward updatingprocedure which employs the nonparametric regression to estimate the continuation value. In anoptimal stopping problem, the evolution of the state process is independent of decision maker’s(DM’s) action and therefore, the forward simulation of the sample paths of the state process isrelatively straightforward. In stark contrast to this, in a general stochastic optimal control setting,the state process is inﬂuenced by the DM’s action and accordingly, its simulation is unattainablewithout specifying the DM’s action. Ideally, one may expect to simulate the state process drivenby the optimal action of the DM. However, the optimal action should be determined by solving theBellman equation in a backward recursion manner, which is incongruous with the need of forwardsimulation in an LSMC algorithm. To circumvent this, Kharroubi et al. (2014) proposes to ﬁrst drawthe DM’s action from a random distribution and then simulate the sample paths of the state processbased on the initialized action. This method has been referred to as the control randomizationmethod in the literature and applied in the LSMC algorithm to solve many speciﬁc stochasticcontrol problems; see, e.g., Cong and Oosterlee (2016), Zhang et al. (2018), and Huang and Kwok(2016), among others. Despite the wide usage of the control randomization method, the accuracyof the numerical estimate is impaired over the region with sparse sample points (Zhang et al., 2018,Section 3.3) and the spread of the sample paths is sensitive to the speciﬁc way of initializing theaction. In an extreme case, the LSMC algorithm might even miss the optimal solution under adismal choice of random distribution from which the PH’s action are drawn; see Shen and Weng(2017), for instance. It is also notable that most literature bind together the control randomizationand the forward simulation of the state process. However, this paper will show that the forwardsimulation is not imperative in an LSMC algorithm and the merits of abjuring the forward simulation2re extant in several aspects. The limitations of the control randomization and the consequentialforward simulation will be elaborated in Section 2.2 of this paper.Besides the simulation of the state process, the approximation of the conditional expectation termin a Bellman equation is also taxing for several reasons. Firstly, the prevalent regression methodsonly warrant the accuracy of the regression estimate over a compact support, see, e.g., Newey (1997),Stentoft (2004), and Zanger (2013), whereas the state variable generally takes value in an unboundedset. Some literature compromise to ﬁrst truncate the domain of the continuation function and thenuse extrapolation techniques when the knowledge of the function outside the truncated region isrequired. It is worth noting that this problem is not acute in the context of optimal stoppingproblem but is severe in a general stochastic control setting. This is because, in the latter case, onehas to traverse all admissible actions, which calls for the values of the continuation function over adomain that is wider than spreading range of sample paths. Secondly, in order to avoid overﬁtting orunderﬁtting, most nonparametric regression methods thirst for an appropriate choice of the tuningparameter, e.g., the number of basis functions in a linear sieve estimation method (see the sequelSection 3.3). This is often resolved by computationally expensive cross-validation methods, see,e.g., Li (1987). However, in view of the extraordinarily large number of simulated paths, sucha tuning parameter selection procedure is intolerable in implementing the LSMC algorithm. Theaforementioned challenges will be investigated in details in the sequel section.The contribution of this paper is summarized as follows. Firstly, we restrain the value set of thestate process into a compact set, which evades the undesirable extrapolating value function estimateduring the backward recursion of the LSMC algorithm. The value function accompanying thetruncated state process is shown to be a legitimate approximation for the primal value function undera suitable choice of the truncation parameter. Secondly, we generalize the idea of Shen and Weng(2017) to simulate the post-action value of the state process from an artiﬁcial probability distribution.This eliminates the need for the forward simulation and is consistent with the backward inductionnature of the Bellman equation. The memory as well as time costs of the artiﬁcial simulation methodare considerably less than those of the control-randomization-based forward simulation method.Thirdly, we introduce a shape-preserving sieve estimation method to approximate the conditionalexpectation term involved in the Bellman equation. By exploiting certain shape information ofthe continuation function, the sieve estimate is insensitive to the tuning parameter and accordinglyreduces the computational cost of the tuning parameter selection. We refer to the proposed LSMCalgorithm as the Backward Simulation and Backward Updating (BSBU) algorithm. Finally, weestablish the convergence result of BSBU algorithm which sheds light on how the numerical errorpropagates over the backward recursion procedure.This paper is organized as follows. Section 2 gives a tour through the LSMC algorithm andshows its challenges in solving general stochastic optimal control problems. Section 3 gives themain results of the paper: a construction of auxiliary stochastic optimal control model, the BSBUalgorithm, and the associated convergence analysis. Section 4 applies the BSBU algorithm to thepricing problem of an equity-linked insurance product and Section 5 conducts the corresponding3umerical experiments. Finally, Section 6 concludes the paper. We restrict our attention to a collection of consecutive time points labeled by T := { , , . . . , T } onwhich a decision maker (DM) may take action. The uncertainty faced by the DM is formulated bya probability space (Ω , F , P ) equipped with a ﬁltration F = (cid:8) F t (cid:9) t ∈T . The DM’s action is describedby a discrete-time stochastic process a = { a t } t ∈T with T = T \{ T } . Let X = { X t } t ∈T be a certainstate process valued in X ⊆ R d with d ∈ N . Starting from an initial state X ∈ R d , it evolvesrecursively according to the following transition equation: X t +1 = S ( X t , a t , ε t +1 ) , for t = 0 , , . . . , T − , (1)where ε := { ε t +1 } t ∈T is a sequence of independent random variables valued in R q with q ∈ N . ε t +1 reﬂects the uncertainty faced by the DM at time step t and is referred to as random innovation inwhat follows. For brevity of notation, in what follows, we compress the dependency of the stateprocess on the action and the readers should always bear in mind that X t implicitly depends on theDM’s action up to time t −

1. We give the formal deﬁnition of the action a as follows. Deﬁnition 1 (Admissible Action) . We call a discrete-time process a = { a t } t ∈T an admissibleaction if it satisﬁes: (i) a t is F t -measurable for t = 0 , , . . . , T − ; (ii) a t ∈ A t ( X t ) for t = 0 , , . . . , T − , where A t ( · ) is some function valued as a subset of R p with p ∈ N . The function A t ( · ) in the preceding deﬁnition corresponds to a certain state constraint on theaction taken at time t . Denote by A the set of the DM’s admissible actions. Consider a discrete-timestochastic optimal control problem in the following form: V ( X ) = sup a ∈A E (cid:34) T − (cid:88) t =0 ϕ t f t ( X t , a t ) + ϕ T f T ( X T ) (cid:35) , (2)where ϕ ∈ (0 ,

1) is a certain discounting factor, f t ( · , · ) and f T ( · ) are the intermediate and terminalreward functions, respectively. In order to ensure the well-posedness of the stochastic control problem(2), we impose the following assumption which is conventional in literature, see Rogers (2007) andBelomestny et al. (2010) for instance. Assumption 1. sup a ∈A E (cid:34) T − (cid:88) t =0 f t ( X t , a t ) (cid:35) < ∞ , and sup a ∈A E [ f T ( X T )] < ∞ . V ( · ) can be solved recur-sively:  V T ( x ) = f T ( x ) ,V t ( x ) = sup a ∈ A t ( x ) (cid:104) f t ( x, a ) + ϕ ¯ C t ( x, a ) (cid:105) , for t = 0 , , . . . , T − , (3)where ¯ C t ( x, a ) = E (cid:104) V t +1 ( X t +1 ) (cid:12)(cid:12)(cid:12) X t = x, a t = a (cid:105) . (4)We proceed by rewriting the transition equation (1) into the following form: S ( X t , a t , ε t +1 ) = H (cid:0) K ( X t , a t ) , ε t +1 (cid:1) , (5)where H ( · , · ) : R r + q −→ R d and K ( · , · ) : R d + p −→ R r are some measurable functions with r ∈ N .It is worth stressing that any transition function S ( · , · , · ) can be rewritten into the above form sinceone may choose K ( · , · ) as identity function (i.e., K ( x, a ) = ( x, a ) (cid:124) ) and the above equation holdstrivially. Nevertheless, it is instructive to introduce the function K ( · , · ) as it brings the beneﬁt ofdimension reduction. We will explain this more in the sequel. Combing Eqs. (4) and (5), we get¯ C t ( X t , a t ) = E (cid:104) V t +1 (cid:0) H ( X t + , ε t +1 ) (cid:1)(cid:12)(cid:12)(cid:12) X t + = K ( X t , a t ) (cid:105) . Hereafter, we call X t + the post-action value of the state process X t at time t . It constitutes anessential component in the LSMC algorithm proposed in Section 3.2. Deﬁne function C t ( k ) := E (cid:104) V t +1 ( X t +1 ) (cid:12)(cid:12)(cid:12) X t + = k (cid:105) = E (cid:104) V t +1 (cid:0) H ( k, ε t +1 ) (cid:1)(cid:105) . (6)We observe the following relationship between ¯ C t ( · , · ) and C t ( · ):¯ C t ( x, a ) = C t (cid:0) K ( x, a ) (cid:1) . (7)The crucial implication of the above relation is that it suﬃces to recover the functional form of C t ( · )in order to evaluate ¯ C t ( · , · ) since K ( · , · ) is known at the ﬁrst hand. The motivation of rewritingthe transition equation into Eq. (5) is now clear: K ( · , · ) maps a ( d + p )-dimensional vector into a r -dimensional vector, which compresses the dimension if r < d + p and it is more eﬃcient to recoverthe function C t ( · ) than ¯ C t ( · , · ) due to such a dimension reduction. It is also worth noting that C t ( · )is solely determined by the probability distribution of ε t +1 according to Eq. (6). This implies that it is not necessary to know the exact distribution of X t + in the evaluation of the function C t ( · ) . Inview of the relation (7), the Bellman equation (3) can be equivalently written as  V T ( x ) = f T ( x ) ,V t ( x ) = sup a ∈ A t ( x ) (cid:104) f t ( x, a ) + ϕC t (cid:0) K ( x, a ) (cid:1)(cid:105) , for t = 0 , , . . . , T − . (8)5 · · V t +1 ( · ) C t ( · )Eq. (6) ¯ C t ( · , · )Eq. (7) V t ( · )Eq. (8) . . . Figure 1: A diagram for backward information propagation in solving the Bellman equa-tion.The above equation system states that, given the value function at time step t + 1, one may ﬁrstevaluate continuation function according to Eqs. (6) and (7) and then obtain the value functionat time step t via solving an optimization problem in the second line of Eq. (8). The informationpropagation behind the above recursive procedure is illustrated in Figure 1. We proceed by brieﬂy reviewing the Least-squares Monte Carlo (LSMC) algorithm. We will showits limitations in several aspects which motivate the algorithm we will propose in the subsequentsections.

There has been voluminous literature on the LSMC for optimal stopping problem, while the literatureon the LSMC for general stochastic optimal control problem is thin. Most literature addresses theLSMC for the stochastic control problems arising in some speciﬁc applications, see, e.g., Carmonaand Ludkovski (2010), Barrera-Esteve et al. (2006), Huang and Kwok (2016), Shen and Weng (2017),Cong and Oosterlee (2016), and Zhang et al. (2018), among others. An LSMC algorithm for a classof stochastic control problem is developed in Belomestny et al. (2010).For most variants of the LSMC algorithm, they can be decomposed into two pillars: (i) a forwardsimulation of the state process and (ii) a backward updating of control policies. We review thesealgorithms in a uniﬁed paradigm as follows.1.

Initiation:

Set V E T ( x ) = f T ( x ). For t = T − , T − , . . . ,

0, do the two steps below.2.

Forward Simulation:

Control randomization

Generate a random sample of the DM’s action up to timestep t : a M t := (cid:110)(cid:16) a ( m )0 , . . . , a ( m ) t (cid:17) , m = 1 , , . . . , M (cid:111) with each a ( m ) t generated by a certain heuristic rule.2.2 Simulation of state process

Simulate a random sample of the random innovations: (cid:110)(cid:16) ε ( m )1 , . . . , ε ( m ) t +1 (cid:17) , m = 1 , , . . . , M (cid:111) . The sample of the state process up to time step t + 1 is given by X M t +1 := (cid:110) X ( m )1: t +1 := (cid:16) X ( m )1 , . . . , X ( m ) t +1 (cid:17) , m = 1 , , . . . , M (cid:111) , ( · , · )( x, a ) Range of post-action value D Figure 2: A diagram illustrating the map K ( · , · ) relating pre-action value to the post-actionvalue.where X ( m ) n = S (cid:16) X ( m ) n − , a ( m ) n − , ε ( m ) n (cid:17) for n = 1 , , . . . , t + 1.3. Backward Updating

Regression

Given a numerical estimate of value function at time step t + 1, denotedby V E t +1 ( · ), construct the random sample Y Mt +1 := (cid:110) V E t +1 (cid:16) X ( m ) t +1 (cid:17) , m = 1 , , . . . , M (cid:111) . Further construct a random sample of post-action value of the state process as follows: X Mt + := (cid:110) X ( m ) t + := K (cid:16) X ( m ) t , a ( m ) t (cid:17) , m = 1 , , . . . , M (cid:111) . (9)Take Y Mt +1 and X Mt + as the samples of response variable and regressor, respectively, andemploy a certain non-parametric regression to obtain a regression estimate C E t ( · ) for C t ( · ).3.2 Optimization

An estimate for the value function at time step t is given by V E t ( x ) = sup a ∈ A t ( x ) (cid:104) f t ( x, a ) + ϕC E t (cid:0) K ( x, a ) (cid:1)(cid:105) . (10)We henceforth call the above algorithm as the Forward Simulation and Backward Updating(FSBU) algorithm. Remark 1 (Randomness of V E t ( · ) and C E t ( · )) . The superscript E in V E t ( · ) and C E t ( · ) stresses thatthey are numerical estimates of the true value function and continuation function, respectively. Sincea certain regression technique is employed to get such numerical estimates, they essentially dependon the random samples Y Mt +1 and X Mt + and hence on all previously generated random samples goingfrom step T − down to step t , i.e., Y Mn +1 and X Mn + for n = t, . . . , T − . Such dependency issuppressed in notation for brevity, but the readers should keep in mind that both V E t ( · ) and C E t ( · ) are random functions. There are several challenges in implementing the previously introduced FSBU algorithm to solve astochastic control problem. Below, we make some comments on the challenges from three aspects.7

I) Limitation of control randomization

As the DM’s optimal action is not tractable at thevery ﬁrst but should be solved in the backward updating stage of the algorithm, Step 1.1randomly generates a feasible action, which is referred to as control randomization method;see Kharroubi et al. (2014). For some selected action a M n , the accuracy of the regressionestimate C E t ( · ) can be warranted only over the support of the resulting sampling points X Mt + ,say D , which might be smaller than those for other actions. However, in order to solve theoptimization problem in Step 2.2 (see Eq. (10)), one requires the knowledge of C E t ( · ) overthe range of post-action value X t + = K ( X t , a t ) for all feasible actions a t because all possiblevalues of the action should be traversed and taken as the input of the function C E t ( K ( x, a ))in evaluating V E t ( x ); see Figure 2 for a graphical illustration. As a compromise, one may usecertain extrapolation methods to infer the value of C E t ( · ) outside the region D , which incursextra error and is hard to justify its legitimacy. (II) Cost of forward simulation It is notable that, at time step t of the above FSBU algorithm,a new random sample of the state process that is independent of the sample at time step t + 1is simulated; see Figure 3 for a graphical illustration. This is required in order to applythe nonparametric regression theory to establish the convergence result, see Section 2.3 of(Zanger, 2013, p. 511) for instance. On the contrary, using a single sample causes in-samplebias because the numerical estimate of value function obtained at time step t + 1 V E t +1 ( · ) iscorrelated with X Mt + ; see, e.g., Section 3.1 of Choi et al. (2018) and the earlier Remark 1. Thetotal time cost in a forward simulation procedure of the above LSMC algorithm is of O ( T ) .Simulating the whole path of state process can be time-consuming especially when one usessome approximation schemes to simulate general stochastic diﬀerential equations . Besidesthe issue of time cost, the memory cost in a single simulation is of O ( dT ) with T and d beingthe number of time steps and dimensionality of the state process, respectively, which is sizablefor a large T . (III) Choice of regression technique Despite the voluminous literature on nonparametric re-gression, the choice of the nonparametric regression method in Step 2.1 should be meticulous.In the above FSBU algorithm, the sample size in the regression problem corresponds to thenumber of simulated paths and is generally recommended in the literature be chosen larger thanone hundred thousand, which makes most regression methods computationally prohibitive . Tobe speciﬁc, the local methods such as local-polynomial regression are clearly not wise choicesas they require running a regression at each sample point. It is worthy to point out thateven computing a single point in the sample Y Mt is fairly time-consuming as it involves a localoptimization problem (see Eq. (10)). Furthermore, the nuisance of high memory cost also bur-dens most nonparametric regression methods. For example, the kernel regression and Isotonic Suppose the time cost of simulating a path over each time interval [ t, t +1] is C . Then the time cost of simulating awhole path up to time step n is about n ×C , and the forward simulation in the whole LSMC algorithm has approximatetime cost of C (1 + 2 + · · · + T ) = T ( T + 1) C /

2, accordingly. It is the authors’ experience that a single simulation of 10 paths of the Heston model over a 10-year period takes365 seconds by using the R package “ msde ” on a MacBook Pro (2.8 GHz Intel Core i7). t + 1 t Time step X M t +1 X M t Figure 3: A diagram for illustrating the forward simulation of the state process. Thesolid line corresponds to a simulated sample of state process at time step t of the LSMCalgorithm. The dashed line corresponds to another independent simulated sample of thestate process at time step t − extraordinarily large ac-cordingly. The above two thorny issues are escalated by the fact that almost all nonparametricregression techniques involve a computationally-intensive cross-validation procedure to deter-mine the tuning parameter (e.g., the bandwidth in local regression methods and the numberof basis functions in global regression methods) in order to avoid overﬁtting or underﬁtting. In view of the previous items (I)–(III), the thrust behind this paper is to explore possible answersto the following questions: (Q1)

How to avoid theoretically shaky extrapolation? (Q2)

Is it possible to bypass the forward simulation in an LSMC algorithm? (Q3)

Is there a regression method that is insensitive to tuning parameter?

In terms of (Q1) , in the sequel section, we will construct an auxiliary stochastic control problemwhere the accompanying state process only takes values in a bounded set. This construction sidestepsextrapolating the regression function outside the region where the sample distributes. In responseto (Q2) , we will propose to directly simulate the post-action value of state process. For (Q3) , wewill introduce a shape-preserving sieve estimation method to infer the continuation function. Theresulting sieve estimate, on the one hand, is insensitive to the tuning parameter, and on the otherhand, preserves certain shape properties of the continuation function.

As commented in the item “Limitation of control randomization” in the last section, it is necessaryto know the value of the continuation function over the whole range of post-action value of the state9rocess which is wider than the set where the regression sample suﬀuse. It is notable that the rangeof post-action value is unbounded if the state process takes value in an unbounded set, which isparticularly the case in many ﬁnance applications. Therefore, it is generally inevitable to infer thecontinuation function outside the support of the sample and the error incurred by extrapolating theregression estimate is hard to quantify.The aim of this subsection is to ﬁnd a certain way to circumvent the unsound extrapolation in theimplementation of an LSMC algorithm. The pivotal idea is to ﬁrst construct an auxiliary stochasticoptimal control problem where the accompanying state process only takes values in a bounded setand then show the discrepancy between the auxiliary problem and the primal one is marginal in acertain sense. To formalize the idea, we let X R be a bounded subset of the set X where the subscript R denotes a certain truncation parameter. Further denote ˚ X R (resp. ∂ X R ) as the interior (resp.boundary) of X R . Given the initial state X ∈ ˚ X R , deﬁne the following stopping time: τ R := inf (cid:110) t ∈ T (cid:12)(cid:12)(cid:12) X t / ∈ ˚ X R (cid:111) , (11)with the convention: τ R = ∞ if X t ∈ ˚ X R for all t ∈ T . Let cl ( X R ) be the closure of the set X R andassume it to be strictly convex.We recursively deﬁne an auxiliary state process X R := (cid:8) X Rt (cid:9) t ∈T as follows:  X R = X ,X Rt = X t I { τ R >t } + Q ( X τ R ∧ t ) I { τ R ≤ t } , for t = 1 , , . . . , T, (12)where Q ( x ) = arg inf y ∈ col( X R ) (cid:107) y − x (cid:107) with (cid:107)·(cid:107) denoting the Euclidean (cid:96) -norm. Since cl ( X R ) is acompact and strictly convex set, Q ( x ) is unique and lies on the boundary set ∂ X R for x / ∈ ˚ X R .Below we give some interpretations regarding the auxiliary state process deﬁned in the above Eq.(12). The primal state process X coincides with the auxiliary state process X R until the stoppingtime τ R . Once the primal state process passes through the interior of the truncated domain, theauxiliary state process freezes at a certain point in the boundary set ∂ X R thereafter. The evolutionmechanisms of the primal and auxiliary state processes are illustrated in Figure 4. The followingproposition gives the transition equation of X R . Proposition 1.

The auxiliary state process X R deﬁned by Eq. (12) admits the following transitionequation across each time point: X R = X and X Rt +1 = X Rt I { X Rt ∈ ∂ X R } + ˜ H (cid:16) K (cid:0) X Rt , a t (cid:1) , ε t +1 (cid:17) I { X Rt ∈ ˚ X R } , for t = 0 , , . . . , T − , (13) where ˜ H ( k, e ) =  Q ( H ( k, e )) , if H ( k, e ) / ∈ ˚ X R ,H ( k, e ) , otherwise , (14) and K ( · , · ) is the transition equation relating the pre-action and post-action values of the primal X R . . . τ R − τ R τ R + 1 . . . X Q ( X t ) Figure 4: Graphical illustration of the evolution mechanisms of X and X R . It is notablethat X might evolve continuously between two discrete time points t and t + 1. Thestopping time τ R corresponds to the ﬁrst time point upon which X t stays outside of X R among all discrete time points { , , . . . , T } . The circles correspond to a path of X R . state process deﬁned in Eq. (5) . The proof of the above proposition is relegated to Appendix B.1 for clarity of presentation. Thepreceding Eq. (13) essentially states that X R is a Markov chain by itself, and accordingly, it is thesole state process of the auxiliary stochastic control model deﬁned in the sequel.Let A R be the set of all admissible actions for the auxiliary state process which is deﬁned as: A R := (cid:8) a = { a t } t ∈T (cid:12)(cid:12) a t is F t − measurable , a t ∈ A t (cid:0) X Rt (cid:1) , for t ∈ T (cid:9) . Relative to the primal stochastic optimal control problem (2), we consider the following auxiliaryproblem:˜ V ( X ) = sup a ∈A R E (cid:34) T − (cid:88) t =0 ϕ t f t (cid:0) X Rt , a t (cid:1) + ϕ T f T (cid:0) X RT (cid:1)(cid:35) , (15)where X R = (cid:8) X Rt (cid:9) t ∈T is deﬁned recursively by Eq. (13) for any given action a . Since the stateprocess X R freezes once it reaches the boundary set ∂ X R , the value function in Eq. (15) is given by˜ V t ( x ) = T − (cid:88) n = t ϕ n − t f n (cid:0) x ; a ∗ n ( x ) (cid:1) + ϕ T − t f T ( x ) , for x ∈ ∂ X R , t ∈ T , (16)with a ∗ n ( x ) ∈ arg max a ∈ A n ( x ) f n ( x ; a ). Over the interior of the truncated domain, the above valuefunction ˜ V ( · ) can be solved in a similar backward recursion way as V ( · ) does, that is,  ˜ V T ( x ) = f T ( x ) , ˜ V t ( x ) = sup a ∈ A t ( x ) (cid:104) f t ( x, a ) + ϕ ˜ C t (cid:0) K ( x, a ) (cid:1)(cid:105) , for x ∈ ˚ X R , t = 0 , , . . . , T − , (17)where ˜ C t ( · ) is deﬁned in line with Eq. (6) with H ( · , · ) replaced by ˜ H ( · , · ). It is worth noting that, inevaluating ˜ C t (cid:0) K ( x, a ) (cid:1) , the knowledge of ˜ V t +1 ( · ) over ∂ X R might be in need, and in such a situation,11q. (16) is invoked.We make some comparisons between Eq. (17) and the Bellman equation (8) associated with theprimal stochastic control model. Firstly, in both equations, the state constraint A t ( · ), transitionequation between pre-action and post-action values K ( · , · ), and reward functions are exactly thesame. Secondly, the value function ˜ V t ( · ) is solely deﬁned on a bounded set cl ( X R ), whilst V t ( · ) isdeﬁned on the set X which might be unbounded in many ﬁnancial applications as the primal stateprocess X may correspond to a certain risky asset valued on the whole positive real line.We will characterize the discrepancy between the value functions ˜ V t ( · ) and V t ( · ) in a certainsense. To this end, it is necessary to impose some assumptions on the state process and the rewardfunctions. Assumption 2.

Let X R = X ∈ ˚ X R . There exists a measurable function E ( · , · ) : ˚ X R × R > −→ [0 , satisfying inf a ∈A P (cid:104) X t = X Rt , for all 1 ≤ t ≤ T (cid:105) ≥ − E ( X , R ) . (18) E ( X , R ) in Eq. (18) gives an upper bound for the probability that the auxiliary state processdisagrees with the primal at some time before maturity regardless of the DM’s action. Since theprimary diﬀerence between the auxiliary and primal value functions stems from the disparity betweenthe associated state processes, it is not surprising that the above inequality (18) plays an importantrole in characterizing the approximation error of ˜ V t ( · ) as we will see later in the proof of Theorem1. The expression of E ( X , R ) should be speciﬁed for each speciﬁc application. Assumption 3. (i)

There exists a measurable function B ( · ) : R d −→ R > and a generic constant ζ independent of t and R such that (cid:12)(cid:12) f T ( x ) (cid:12)(cid:12) ≤ B ( x ) , sup a ∈ A t ( x ) (cid:12)(cid:12) f t ( x, a ) (cid:12)(cid:12) ≤ B ( x ) , sup a ∈A E [ B ( X t +1 )] ≤ ζ, and sup a ∈A E [ f T ( X T )] ≤ ζ, for all t ∈ T . (19) (ii) There exists a measurable function ξ ( · ) : R > −→ R > satisfying sup x ∈X R (cid:32) sup a ∈ A t ( x ) (cid:12)(cid:12) f t ( x, a ) (cid:12)(cid:12) (cid:33) ≤ ξ ( R ) , for all t ∈ T , and sup x ∈X R (cid:12)(cid:12) f T ( x ) (cid:12)(cid:12) ≤ ξ ( R ) . In many applications B ( x ) has a polynomial form and in such a situation, the above assumptionstates that the reward functions are bounded by a certain polynomial from the above uniformly in t . In the context of pricing ﬁnancial products, this assumption says that the policy payoﬀs have apolynomial growth rate.The following theorem quantiﬁes the error stemming from using the auxiliary problem (15) as aproxy for the primal stochastic control model (2).12 t,R (cid:98) K t,R ˜ H ( · , ε t +1 )˜ H ( · , ε t +1 ) X R ∂ X R Figure 5: A graphical illustration for the relationships between K t,R , (cid:98) K t,R , and X R . Theorem 1 (Truncation Error Estimate) . Suppose Assumptions 1, 2, and 3 hold. Then (cid:12)(cid:12)(cid:12) V ( X ) − ˜ V ( X ) (cid:12)(cid:12)(cid:12) ≤ T (cid:113) (cid:0) ξ ( R ) + ζ (cid:1) E ( X , R ) . (20)The proof of the above theorem is relegated to Appendix B.2. The inequality (20) can be under-stood as follows. The term (cid:0) ξ ( R ) + ζ (cid:1) E ( X , R ) corresponds to an upper bound for the discrepancybetween the reward functions of the two stochastic control models (2) and (15) at each time step.Since such a diﬀerence primarily stems from replacing the primal state process X by X R , it is notsurprising that the term E ( X , R ) appears in the error estimate. Furthermore, the two terms ξ ( R )and ζ correspond to certain upper bounds of the magnitudes of the reward terms f t (cid:0) X Rt , a t (cid:1) and f t ( X t , a t ), respectively, and therefore a square root arises in the inequality (20). Finally, the dis-crepancy between the two value functions is ampliﬁed as the time horizon is prolonged, which isreﬂected by the existence of factor T in the above error estimate. In this subsection we propose an LSMC algorithm which simulates the state process without referringto the optimal action. Recall from Step 2.1 of the FSBU algorithm in Section 2.2 that the ultimategoal of simulating the state process is generating a random sample of the post-action value of thestate process which acts as a crucial input for the regression step. This naturally inspires us todirectly simulate the post-action value X t + from an artiﬁcial probability distribution. The term“artiﬁcial” stresses the fact that such a distribution might not coincide with the distribution of X t + under the optimal action process.Since the value function ˜ V t ( · ) is explicitly given by Eq. (16) over ∂ X R , the primary goal of ourproposed LSMC algorithm is to get a numerical estimate for the value function over the open set˚ X R . In view of this, we may circumscribe the support of the artiﬁcial probability distribution that13he post-action values are simulated from. First note that the range of post-action value of theauxiliary state process denoted by (cid:101) K t,R is given by (cid:101) K t,R := (cid:91) x ∈ ˚ X R  (cid:91) a ∈ A t ( x ) (cid:8) K ( x, a ) (cid:9) , for t ∈ T . Consider the following subset: (cid:98) K t,R := (cid:110) k ∈ (cid:101) K t,R (cid:12)(cid:12)(cid:12) ˜ H ( k, e ) = ˜ H ( k, e ) ∈ ∂ X R , ∀ e and e ∈ ran ( ε t +1 ) (cid:111) , (21)for t ∈ T , where ran ( ε t +1 ) is the set of all values the random innovation ε t +1 might take and ˜ H ( · , · )is the transition equation relating the post-action value at time step t to the state variable at timestep t + 1 which is given in Eq. (14).The preceding equation states that X t +1 will stop at a certain point in the boundary set ∂ X R if X Rt + := K (cid:0) X Rt , a t (cid:1) lies in the set (cid:98) K t,R ; see Figure 5 for a graphical illustration. To make thematter more concrete, let us consider the example of pricing variable annuities (see, e.g. Huang andKwok (2016) and Shen and Weng (2017)) where X Rt + corresponds the post-withdrawal value of theinvestment account. If the investment account is depleted after the policyholder’s withdrawal (i.e., X Rt + = 0), it remains exhausted forever (i.e., X Rn = 0 for n = t + 1 , . . . , T ). In such an example, (cid:98) K t,R is a singleton { } . In view of the above discussion and Eq. (16), for any k ∈ (cid:98) K t,R , we observe˜ C t ( k ) = E (cid:104) ˜ V t +1 (cid:0) ˜ H ( k, ε t +1 ) (cid:1)(cid:105) = ˜ V t +1 (cid:0) ˜ H ( k, e ) (cid:1) (22)which has a value independent of e ∈ ran( ε t +1 ) and is given by Eq. (16). Therefore, at time step t , itsuﬃces to get a regression estimate for the continuation function ˜ C t ( · ) on the set K t,R := (cid:101) K t,R \ (cid:98) K t,R . Now we present the Backward Simulation and Backward Updating (BSBU) algorithm as follows.1.

Initiation:

Set ˜ V E T ( x ) = f T ( x ) for x ∈ cl ( X R ). For t = T − , T − , . . . ,

0, do the two stepsbelow.2.

Backward Simulation:

Simulation of post-action value

Generate a sample of the post-action values denotedby X Mt + := (cid:110) X ( m ) t + , m = 1 , , . . . , M (cid:111) from a probability distribution Q t,R with support K t,R .2.2 Simulation of the state process

Construct the sample of the state process at timestep n + 1 according to X Mt +1 := (cid:110) X ( m ) t +1 = ˜ H (cid:16) X ( m ) t + , ε ( m ) t +1 (cid:17) , m = 1 , , . . . , M (cid:111) . (23)14 · · ˜ V E t +1 ( · ) ˜ C E t ( · )Regression ˜ V E t ( · )Eq. (25) . . . Figure 6: A diagram for backward information propagation in the BSBU algorithm. ˜ V E t (cid:16) X ( m ) t (cid:17) ˜ C E t (cid:16) K (cid:16) X ( m ) t , a (cid:17)(cid:17) if X ( m ) t ∈ ˚ X R Eq. (16)if X ( m ) t ∈ ∂ X R Regression estimateobtained at Step 2.2if K (cid:16) X ( m ) t , a (cid:17) ∈ K t,R Eq. (22)if K (cid:16) X ( m ) t , a (cid:17) ∈ (cid:98) K t,R Figure 7: A diagram for the information propagation in evaluating ˜ V E t (cid:16) X ( m ) t (cid:17) .with (cid:110) ε ( m ) t +1 , m = 1 , , . . . , M (cid:111) being a sample of the random innovations.3. Backward Updating:

Data preparation

Given a numerical estimate of value function at time step t + 1,denoted by ˜ V E t +1 ( · ), construct the sample Y Mt +1 := (cid:110) ˜ V E t +1 (cid:16) X ( m ) t +1 (cid:17) , m = 1 , , . . . , M (cid:111) . (24)3.2 Regression

Take Y Mt +1 and X Mt + as the samples of response variable and regresssor,respectively, and employ a certain non-parametric regression to obtain a regression esti-mate ˜ C E t ( · ) over the set K t,R . For k ∈ (cid:98) K t,R , we set ˜ C E t ( k ) = ˜ C t ( k ) with ˜ C t ( · ) given byEq. (22).3.3 Optimization

An estimate for the value function at time step t is given by:˜ V E t ( x ) = sup a ∈ A t ( x ) (cid:104) f t ( x, a ) + ϕ ˜ C E t (cid:0) K ( x, a ) (cid:1)(cid:105) , for x ∈ ˚ X R . (25)For x ∈ ∂ X R , we set ˜ V E t ( x ) = ˜ V t ( x ) with ˜ V t ( · ) given by Eq. (16).In Step 3.2, we prescribe ˜ C t ( k ) for the value of ˜ C E t ( k ) when k ∈ (cid:98) K t,R because K (cid:16) X ( m ) t , a (cid:17) might fallin the set (cid:98) K t,R . Similarly, in Step 3.3, Eq. (16) is invoked to evaluate ˜ V E t ( x ) for x ∈ ∂ X R as X ( m ) t generated by Eq. (23) may lie on ∂ X R , the boundary set of the truncated domain. The backwardinformation propagation in the above BSBU algorithm is illustrated in Figure 6. Comparing the above BSBU algorithm and the FSBU counterpart in Section 2.2, we have thefollowing observations.1. Firstly, the primary diﬀerence of the two algorithms lies in how to generate the post-actionvalues of state process, i.e., X Mt + . The control randomization method is a forward simulation15cheme while the BSBU algorithm directly generates post-action value from a certain priordistribution. Indeed, the FSBU algorithm can be viewed as a special BSBU algorithm if Q t,R is chosen as the probability distribution of the post-action value from a control randomizationprocedure. In general, both methods do not yield the distribution of X t + driven by the optimalaction, and thus, there is no loss to directly generate X Mt + from a prior distribution Q t,R .2. Secondly, the BSBU method has the advantage of reducing memory and time costs. On theone hand, one does not need to store the sample of whole trajectories at each time step in theBSBU algorithm. On the other hand, the total time cost of simulating the state process isof O ( T ) in the BSBU algorithm, while it is of O (cid:0) T (cid:1) in the FSBU counterpart; see the item“Cost of forward simulation” in Section 2.2.3. Thirdly, the BSBU algorithm circumvents extrapolating the numerical estimates of the contin-uation function and the value function. It is notable that ˜ C E t ( · ) and ˜ V E t ( · ) are obtained overthe sets (cid:101) K t,R and col ( X R ), respectively, at time step t ; see Steps 2.2-2.3 of the above BSBUalgorithm. At the time step t −

1, the BSBU algorithm does not require the knowledge of thevalue function (resp. continuation function) outside col ( X R ) (resp. (cid:101) K t,R ) in computing theregression data Y Mt ; see Figure 7 for a graphical illustration. This nice property inherits fromthe construction of the auxiliary state process X R whose values are conﬁned to a bounded set.In the FSBU algorithm, however, the state process is not restrained and K (cid:16) X ( m ) t , a (cid:17) mightlie outside the regression domain for ˜ C E t ( · ). In such a situation, extrapolating the numericalsolution causes extra error which is hard to control. In this subsection, we discuss the details of the regression method used to estimate the continuationfunction in our BSBU algorithm.

In Section 2.2 we have discussed the potential issue which may be associated with a regressionmethod when used for the estimation of the continuation function of a stochastic control problem;see the item “Choice of regression method”. Based on the discussion, we propose the followingcriteria for the choice of regression method in estimating continuation function. (C1) Small memory cost

The regression problem embedded in an LSMC algorithm usuallyexhibits extraordinarily large sample size. Thus, an appropriate regression method should havesmall memory requirement. This criterion excludes the kernel method (Nadaraya (1964) andWatson (1964)), local-polynomial regression method (Fan and Gijbels (1996)), and Isotonicregression method (Robertson et al. (1988)) which require storing all sample points in thememory in order to compute the regression function at any point in the domain.16

C2) Computationally cheap

In almost all nonparametric regression methods, a certain pa-rameter (referred to as tuning parameter in statistical literature) is used to avoid undesirableoverﬁtting or underﬁtting of the regression model. Determining the optimal value of such atuning parameter is usually computationally intensive. Therefore, an ideal regression methodshould be insensitive to the tuning parameter.In view of the above two criteria, we have a limited number of suitable choices despite the volu-minous nonparametric regression methods in the literature. In the following, we discuss a classof regression methods referred to as the sieve estimation method which include the least-squaresmethod of Longstaﬀ and Schwartz (2001) as a special case.

We give a brief introduction to the sieve estimation method; refer to Chen (2007) for a comprehensivereview. Suppose we have a sample of independent and identically distributed (i.i.d.) random pairs (cid:8)(cid:0) U ( m ) , Z ( m ) (cid:1)(cid:9) Mm =1 where Z ( m ) is a R r -valued random vector with compact support Z and U ( m ) isa univariate random variable. Deﬁne the function g ( · ) : Z −→ R as g ( z ) = E (cid:104) U ( m ) (cid:12)(cid:12)(cid:12) Z ( m ) = z (cid:105) (26)which is independent of m . In the context of our BSBU algorithm, U ( m ) and Z ( m ) correspond to˜ V t +1 (cid:16) X ( m ) t +1 (cid:17) and X ( m ) t + , respectively, and the parallel function g ( · ) is the continuation function ˜ C t ( · ).The sieve estimation method strives to estimate the functional form of g ( · ) by solving the followingoptimization problem:ˆ g ( · ) := arg min h ( · ) ∈H J M M (cid:88) m =1 (cid:104) U ( m ) − h (cid:16) Z ( m ) (cid:17)(cid:105) , (27)where H J is a ﬁnite-dimensional functional space depending on a certain parameter J and is calledas sieve space . Intuitively, the ampler the sieve space is, the smaller the “gap” between the H J and the function g ( · ) would be. The price to pay is that larger estimation error is incurred for aricher sieve space due to limited sample size M . Therefore, one has to balance such a trade-oﬀ bycontrolling the complexity of the sieve space and this is achieved by tuning the parameter J . Tomake the matter more concrete, we consider two examples of the sieve space in the sequel. Example 1 (Linear Sieve Space) . Let { φ j ( · ) : Z −→ R } j ∈ N be a sequence of basis functions indexedby j ∈ N . Consider the sieve space deﬁned by H J =  h ( · ) : h ( z ) = J (cid:88) j =0 β j φ j ( z ) , β j ∈ R  . (28)The above set H J is essentially a linear span of ﬁnitely many basis functions and is referred toas linear sieve space in the statistical literature. 17n the present context of stochastic control, the regression function g ( · ) corresponds to the con-tinuation function and it exhibits some shape properties such as monotonicity in many applications;see Del Moral et al. (2012) for pricing American option and Huang and Kwok (2016) for valuingequity-linked insurance product, among others. In view of this, it is natural to expect the element inthe sieve space satisﬁes such shape constraints, which in turn preserves the ﬁnancial interpretationsof the numerical result. This can be achieved by considering a special linear sieve space in thefollowing example. Example 2 (Shape-Preserving Sieve Space) . Let { φ j ( · ) : Z −→ R } j ∈ N be a sequence of basis func-tions indexed by j ∈ N . Denote β J = ( β , . . . , β J ) (cid:124) with β j ∈ R , j = 0 , , . . . , J . Consider the sievespace deﬁned by H J =  h ( · ) : h ( z ) = J (cid:88) j =0 β j φ j ( z ) , A J β J ≥ b ( J )  , (29) where b ( · ) : N −→ N is some integer-valued function, A J is a b ( J ) -by- ( J + 1) matrix, and b ( J ) is a b ( J ) -by-1 null vector. Wang and Ghosh (2012a,b) show that each element in the sieve space in Eq. (29) is a convex,concave, or monotone function (with respect to each coordinate) with a special the choice of matrix A J given that φ j ( · ) , j = 0 , , . . . , J, are Bernstein polynomials. Some forms of A J are relegated toAppendix A.1.For a linear sieve space H J deﬁned either in Eq. (28) or Eq. (29), the solution of the precedingoptimization problem (27) is given by the following form:ˆ g ( z ) = ˆ β (cid:124) φ ( z ) , for z ∈ Z , (30)where φ ( z ) := ( φ ( z ) , . . . , φ J ( z )) (cid:124) and ˆ β is the optimizer of the following optimization problem:min β ∈ R J M M (cid:88) m =1 (cid:104) U ( m ) − β (cid:124) φ (cid:16) Z ( m ) (cid:17)(cid:105) , subject to β (cid:124) φ ( · ) ∈ H J . (31)The dependency of ˆ β and φ ( · ) on J is suppressed for brevity. In general, one has to solve aconstrained quadratic programming problem to obtain ˆ β . One clear merit of the above linear sieve estimation is that one only needs to store the vector ˆ β forfuture evaluation of the regression function ˆ g ( · ) at any point in the domain because basis functions φ ( · ) are explicitly known at the ﬁrst hand. This makes the linear sieve estimation method tailoredto our present problem in terms of the criterion (C1) .For the criterion (C2) , it is well documented in statistical literature that when the true regressionfunction g ( · ) satisﬁes certain shape constraints, the shape-preserving estimate ˆ g ( · ) obtained by (31)18ith H J given by Eq. (29) is insensitive to the tuning parameter J ; see, e.g., Meyer (2008) and Wangand Ghosh (2012a,b). However, this is legitimate only when the true conditional mean function g ( · )exhibits such convexity, concavity or monotonicity property. For the general case when there is noprior shape information of g ( · ), one has to use the sieve space (28) and the regression estimate issensitive to the choice of J . Under such a situation, J should be determined in a data-driven manner.In Appendix A.2, we present some common methods of selecting J discussed in the statisticalliterature.Finally, the convergence of the sieve estimate ˆ g ( · ) to the conditional mean function g ( · ) is en-sured under some technical conditions. These conditions are summarized in Assumption 4 which isrelegated to Appendix A.3 for the clarity of presentation. Now we are ready to conduct convergence analysis of the BSBU algorithm proposed in Section 3.2.For the regression method employed in the algorithm, we restrict our attention to the linear sieveestimator given by Eqs. (30) and (31) in the last subsection.A complete convergence analysis of the BSBU algorithm should take account of three types oferrors: (E1) Truncation Error

The truncation error is caused by taking ˜ V ( X ) as a proxy for V ( X ). (E2) Sieve Estimation Error At each step of the BSBU algorithm, the sieve estimation methodis employed to get an estimate for the continuation function. The associated sieve estimationerror stems from two resources: (a) the bias caused by using a ﬁnite-dimensional sieve space H J to approximate continuation function; and (b) the statistical error in estimating coeﬃcientsof basis functions under a limited sample size of M . (E3) Accumulation Error The primal goal of the regression step in the BSBU algorithm is toestimate the continuation function of the auxiliary stochastic control problem, i.e., ˜ C t ( · ) = E (cid:104) ˜ V t +1 ( X t +1 ) (cid:12)(cid:12) X t + = · (cid:105) . Thus, in principle, one should generate a random sample (cid:110)(cid:16) ˜ V t +1 (cid:16) X ( m ) t +1 (cid:17) , X ( m ) t + (cid:17)(cid:111) Mm =1 based on which the sieve estimation method can be employed to get a regression estimate.However, ˜ V t +1 ( · ) is not known exact at each time step of the algorithm and is replaced by itsnumerical estimate ˜ V E t +1 ( · ); see Step 2.2 of the BSBU algorithm in Section 3.2. Therefore, thethe algorithm error accumulated from time step T − t + 1 triggers a new type oferror in addition to (E1) and (E2) . (E1) has been investigated in Theorem 1. The discrepancy between ˜ V ( X ) and ˜ V E0 ( X ) iscontributed by (E2) and (E3) . Distinguishing these two types of error plays a crucial role in our19onvergence analysis and this is inspired by Belomestny et al. (2010). Our main convergence resultis summarized in the following theorem. Theorem 2 (BSBU Algorithm Error) . Suppose that (i)

Assumptions 1–3 and Assumption 5 in Appendix B.3 hold; (ii)

Assumption 4 in Appendix A holds for U ( m ) = V t +1 (cid:16) X ( m ) t +1 (cid:17) and Z ( m ) = X ( m ) t + uniformly in t ∈ T , where X ( m ) t + and X ( m ) t +1 are given in Steps 2.1 and 2.2 of the BSBU algorithm.Then, there exists a constant ψ such that (cid:12)(cid:12)(cid:12) ˜ V ( X ) − ˜ V E0 ( X ) (cid:12)(cid:12)(cid:12) = O P (cid:18)(cid:113) ψ T − ( J/M + ρ J ) (cid:19) , as M −→ ∞ , (32) with “Big O p” notation O P ( · ) deﬁned in Deﬁnition 2 of Appendix B.3. The above theorem basically states that the numerical solution ˜ V E0 ( X ) converges to ˜ V ( X ) inprobability as both the number of basis functions J and number of simulated paths M approachinﬁnity at the rate speciﬁed by Condition (v) in Assumption 4. Since Theorem 1 shows that thediscrepancy between ˜ V ( X ) and V ( X ) shrinks as R increases, the numerical estimate ˜ V E0 ( X ) isa legitimate approximation for V ( X ) when R , J , and M are considerable. The R.H.S. of Eq. (32)reveals that the overall BSBU algorithm error arises from the two resources discussed in the previousitem (E2) , which are indicated by the terms ρ J and J/M , respectively. Furthermore, Eq. (32) alsoshows that such a regression error is escalated by a factor ψ at each time step, which reﬂects theerror accumulation from time step T − (E3) . In this section, we apply the BSBU algorithm to the pricing of equity-linked insurance products.This pricing problem is an appropriate example to show the limitations of the FSBU algorithmcommented in Section 2.2. For the convenience of illustration, the contract we study here is asimpliﬁed version of variable annuities (VAs); for the discussions on more generic policies, we referto Azimzadeh and Forsyth (2015), Huang and Kwok (2016), Huang et al. (2017), and Shen andWeng (2017), among others.

We give a brief introduction to the VA. VAs are equity-linked insurance products issued by insurancecompanies. At the inception of the contract, the policyholder (PH) pays a lump sum W to theinsurer which is invested into a certain risky asset. The PH is entitled to withdraw any portion ofthe investment before the maturity. She also enjoys certain guaranteed payments regardless of theperformance of the investment account. Therefore, the insurer provides downside protection for a20otential market decline. As a compensation, the insurer deducts insurance fees from the investmentaccount and trade available securities to hedge his risk exposure. Thus, no-arbitrage pricing has beenthe dominating paradigm for pricing VAs in the literature. The primary challenge of this pricingproblem stems from the uncertainty of the PH’s withdrawal behavior. This is conventionally resolvedby studying the optimal withdrawal strategy of the PH, which naturally leads to a stochastic controlproblem; see Dai et al. (2008), Chen et al. (2008), Huang and Kwok (2016), and many others. In the following, we exemplify the model setup of Section 2 in the present pricing problem. Thelattice T corresponds to the collection of all available withdrawal dates. The ﬁrst decision variable τ t represents the PH’s decision to initialize the withdrawal or not by taking values 1 and 0, respectively.As we will see later, the payoﬀ functions depend on the timing of the ﬁrst withdrawal of the PH.Therefore, a state variable { I t } t ∈T is introduced to record the ﬁrst-withdrawal-time, and its evolutionmechanism is prescribed as follows: I = 0, and I t +1 = S It ( I t , τ t ) :=  t, if I t = 0 and τ t = 1 ,I t , otherwise , (33)for t ∈ T . The feasible set of τ t is a singleton { } if the withdrawal has been initialized, i.e., I t > { , } .Denote ( a ) + := max { a, } and a ∨ b := max { a, b } . The second state variable corresponds to theinvestment account and it evolves according to  W = W ,W t +1 = (cid:0) W t − γ t (cid:1) + (cid:124) (cid:123)(cid:122) (cid:125) post-withdrawal value · ε t +1 , γ t ∈ (cid:2) , W t ∨ G ( I t ) P (cid:3) , t ∈ T , (34)where γ t is the withdrawal amount of the PH at time t , ε t +1 is the absolute return of the underlyingasset over [ t, t + 1], and G ( I t ) is a certain percentage depending on the ﬁrst-withdrawal-time I t .The above equation implies that the PH can withdraw up to the amount of G ( I t ) P even if theinvestment account is depleted, i.e., W t = 0. The jump mechanism of the investment account acrosseach withdrawal date is illustrated in Figure 8.Now, the state process and the DM’s action are X = { X t = ( W t , I t ) (cid:124) } t ∈T and a = { a t = ( γ t , τ t ) (cid:124) } t ∈T ,respectively, with the superscript “ (cid:124) ” denoting vector transpose. In accordance with Eqs. (33) and(34), the accompanying transition equation is X t +1 = H (cid:0) K ( X t , a t ) , ε t +1 (cid:1) , where K ( X t , a t ) = (cid:16)(cid:0) W t − γ t (cid:1) + , S It ( I t , τ t ) (cid:17) (cid:124) , H (cid:0) k, ε t +1 (cid:1) = (cid:0) k ε t +1 , k (cid:1) (cid:124) (35)with k = ( k , k ) (cid:124) ∈ [0 , ∞ ) × T . The dependency of K ( · , · ) on t is suppressed for notational brevity.Next, we discuss the feasible set of the PH’s action. In principle, the withdrawal amount γ t takes21 t W t +1 γ t t − t t + 1 (cid:0) W t − γ t (cid:1) + Figure 8: Jump mechanism of the investment account across a withdrawal date.values in a continum (cid:2) , W t ∨ G ( I t ) P (cid:3) (see Eq. (34)). However, it can be shown that the optimalwithdrawal amount is limited to three choices: 1) γ t = 0, 2) γ t = G ( I t ) P , and 3) γ t = W t undercertain contract speciﬁcations; see Azimzadeh and Forsyth (2015), Huang and Kwok (2016), Huanget al. (2017), and Shen and Weng (2017). Via a similar argument adopted by the above references,one may show that this conclusion still holds for the contract considered here. Therefore, we restrictthe feasible set of action a t into the following discrete set: A t ( X t ) = (cid:8) (0 , (cid:124) , ( G ( I t ) P , (cid:124) , ( W t , (cid:124) (cid:9) , if I = 0 , (withdrawal has not been initialized) (cid:8) (0 , (cid:124) , ( G ( I t ) P , (cid:124) , ( W t , (cid:124) (cid:9) , if I > , (withdrawal has been initialized) (36)for t = 1 , , . . . , T −

1. As a convention, the PH is not allowed to withdraw at inception, and thus A ( X ) = ∅ .We proceed by specifying the reward functions which corresponds to the policy payoﬀs in thepresent context. Before maturity, the cash inﬂow of the PH is her withdrawal amount subject tosome penalty: f t ( X t , a t ) = γ t − κ (cid:0) γ t − G ( I t ) P (cid:1) + , γ t ∈ (cid:2) , W t ∨ G ( I t ) P (cid:3) , t = 1 , , . . . , T − , with κ ∈ [0 ,

1] being the penalty rate. In other words, the withdrawal amount in excess of theguaranteed amount is subject to a proportional penalty. Conventionally, f ( · , · ) ≡

0. At maturity,the policy payoﬀ is the remaining value of the investment account, i.e., f T ( X T ) = W T .Finally, we give the interpretation of value function in the present context. V t ( x ) = V t (( W, I ) (cid:124) )with I > I = 0) corresponds to the no-arbitrage price of the contract at withdrawaldate t given that the investment account has a value of W and the ﬁrst withdrawal is triggered atwithdrawal date I (resp., no withdrawal has been taken).22 .3 A BSBU Algorithm for the Pricing Problem The state process X generally takes value in the unbounded set X = [0 , ∞ ) × T . We consider atruncated domain: X R = [0 , R ) × T with R >

0. Consequently, we may deﬁne the auxiliary stateprocess X R as in Eq. (13). The range of the post-action value is given by (cid:101) K t,R = (cid:98) K t,R ∪ K t,R , where (cid:98) K t,R = { , R } × { , , . . . , t } and K t,R = (0 , R ) × { , , . . . , t } , respectively. This is in line with Eq.(21).Now we are almost ready to employ the BSBU algorithm developed in Section 3.2 to solve thepresent pricing problem. It is worth noting that a discrete state variable I t appears in the presentcontext and the continuation function, in general, is not continuous with respect to the post-actionvalue accompanying this state variable, i.e., k ; see Eq. (35). Consequently, Condition (iii) ofAssumption 4 might not hold here; see Appendix A.3. However, for each given value of k , thecontinuation function is still continuous with respect to k , the post-action value associated withthe investment account value. And therefore, one may repeat Step 3.2 of the BSBU algorithm forevery distinct value of k . It is easy to see the convergence of the resulting BSBU algorithm is notinﬂuenced by this modiﬁcation.Finally, it remains to specify how to simulate the post-action value of the state process in order topave the way to implementing the BSBU algorithm. In the sequel section, we will address this issuein details and, in particular, we will compare the control randomization method with our artiﬁcialsimulation method. This section devotes to conducting numerical experiments to show the merits of the BSBU algorithmin the context of pricing the variable annuity product addressed in the last section.

We ﬁrst present the parameter setting for our numerical experiments. We consider T = 12 timesteps and the time interval between two consecutive withdrawal date is assumed to be δ = 1 / W = 1. The guaranteed payment percentage isprescribed as follows: G ( I t ) =  . , if 0 ≤ I t ≤ , . , if 4 ≤ I t ≤ , . , if 8 ≤ I t ≤ . In other words, the PH enjoys a larger amount of guaranteed payment if she postpones the initiationof the withdrawal. As a result, the value function/continuation is not continuous with respect to23he state variable I t .Let r and q be the annualized risk-free rate and insurance fee rate, respectively. We assume theabsolutely return ε t +1 of the underlying fund follows a log-normal distribution with E [log ε t +1 ] = (cid:0) r − q − σ / (cid:1) δ and V ar[log ε t +1 ] = σ δ under a risk-neutral pricing measure. This implicitly assumesthe underlying fund evolves according to a Geometric Brownian Motion with annualized volatilityrate σ . Finally, the discounting rate is given by ϕ = e − rδ . All of the above market and contractparameters are summarized in Table 1.Table 1: Parameters used for numerical experiments.Parameter ValueVolatility rate σ . r . q . T δ ϕ = e − rδ W κ G ( I ) 0 ≤ I ≤ , ≤ I ≤ ≤ I ≤

11 : 7%Finally, we discuss the choice of truncation parameter R . Under the present context, it is easyto see that the function E ( X , R ) in Assumption 2 is bounded from above by the tail probability ofthe continuous running maximum of a geometric Brownian Motion. To be speciﬁc, we have E ( X , R ) ≤ P (cid:18) W max t ∈ [0 ,δT ] (cid:16) e ( r − q − . σ ) t + σ B t (cid:17) ≥ R (cid:19) , = P (cid:18) max t ∈ [0 ,δT ] (cid:0) ( r − q − . σ ) t/σ + B t (cid:1) ≥ (1 /σ ) log (cid:0) R/W (cid:1)(cid:19) = 1 − N (cid:32) (1 /σ ) log (cid:0) R/W (cid:1) − αδT √ δT (cid:33) + (cid:18) RW (cid:19) α/σ ) N (cid:32) − (1 /σ ) log (cid:0) R/W (cid:1) − αδT √ δT (cid:33) with α := ( r − q − . σ ) /σ , where B t is a standard Brownian Motion, N ( · ) is the cumulativedistribution function of a standard normal distribution, and the last equality follows by the ReﬂectionPrinciple (see, e.g., Shreve (2004, pp. 297)). Let R = 4. Then the R.H.S. of the above inequalityapproximately equals to 2 × − under the parameter setting in Table 1. It is also easy to seethat ξ ( R ) is quadratic in R in the present example, and therefore, the truncation error is marginalaccording to the error bound in Eq. (20). In view of this, we ﬁx R = 4 in all subsequent numericalexperiments. 24 .2 Forward Simulation v.s. Artiﬁcial Simulation Next, we would like to show the limitations of the forward simulation based on control randomiza-tion in generating random samples of the state process. Below, we present some prevalent controlrandomization methods. (CR0)

Given a simulated X ( m ) t := (cid:16) W ( m ) t , I ( m ) t (cid:17) (cid:124) , the PH’s action a ( m ) t is simulated from a de-generated distribution with one single point mass at ( G ( I t ) P , (cid:124) . (CR1) Given X ( m ) t , the DM’s action a ( m ) t is simulated from a discrete uniform distribution with sup-port set (cid:8) (0 , (cid:124) , ( G ( I t ) P , (cid:124) , ( W t , (cid:124) (cid:9) if I ( m ) t = 0; and (cid:8) (0 , (cid:124) , ( G ( I t ) P , (cid:124) , ( W t , (cid:124) (cid:9) ,otherwise. (CR2) Given X ( m ) t , the DM’s action a ( m ) t is simulated from a discrete uniform distribution withsupport set (cid:8) (0 , (cid:124) , ( G ( I t ) P , (cid:124) (cid:9) if I ( m ) t = 0; and (cid:8) (0 , (cid:124) , ( G ( I t ) P , (cid:124) (cid:9) , otherwise.Given the above rules of generating the PH’s action, one may simulate the state process in a forwardmanner in accordance with Steps 2.1 and 2.2 of the FSBU algorithm; see Section 2.2. (CR0) is ﬁrst proposed by Huang and Kwok (2016) in the context of pricing Guaranteed LifelongWithdrawal Beneﬁt, a particular type of variable annuity policy. It initializes the withdrawal at t = 1 and the resulting simulated the state variable I ( m ) t (resp., its accompanying post-action value S It (cid:16) I ( m ) t , a ( m ) t (cid:17) ) equals a ﬁxed value for all t = 1 , , . . . , T − I t (resp., S It ( I t , a t )), inprinciple, can take any value in { , , . . . , t − } (resp., { , , . . . , t } ). A consequential annoyingissue is that the obtained estimate for the value function/continuation function is invariant to theﬁrst-withdrawal-time I t , which is not sensible since the later the PH initializes the withdrawal thelarger guaranteed amount G ( I t ) she could enjoy in remaining contract life. (CR1) uniformly simulates the PH’s action from its feasible set A t ( X t ); see Eq. (36). Byvirtue of this, there always exists some paths with I ( m ) t = 0 which corresponds to the scenario thatthe withdrawal has not been initialized. This in turn guarantees that, in principle, I ( m ) t (resp., S It (cid:16) I ( m ) t , a ( m ) t (cid:17) ) can take any value in { , , . . . , t − } (resp., { , , . . . , t } ). However, this strategyis also not satisfactory: an overwhelming portion of paths are absorbed by the state W t = 0, i.e.,the depletion of investment account, and very sparse sample points of the investment account arepositive. This is graphically illustrated in the top panel of Figure 9 where 1000 sample paths areplotted for the clarity of presentation. So it is not hard to expect that the accuracy of the regressionestimate is severely impaired over K t,R .To alleviate the serious problem mentioned above, (CR2) discards the strategy of depletingthe investment account, i.e., ( W t , (cid:124) , in simulating the PH’s action. And therefore, the simulatedinvestment account value W ( m ) t can spread over a wider range than that accompanying (CR1) ; seethe bottom panel of Figure 9. This phenomenon is more palpable from the histograms of W ( m ) T − whichare collected by Figure 10. Nevertheless, (CR2) ’s performance in simulating the I ( m ) t is undesirable:Figure 11 shows that a substantial portion of sample points of the ﬁrst-withdrawal-time I ( m ) t areconcentrated in ﬁrst few values that I t can take. To understand the crux, we note that at the ﬁrst25ossible withdrawal date, one-half of sample paths exhibit the initiation of the withdrawal; amongthe remaining paths, one-half of them witness the withdrawal in the consecutive withdrawal date.Therefore, the portion of positive I ( m ) t declines at an exponential rate as t increases which is in linewith Figure 11. In view of this, it can be expected that the consequential numerical estimate forthe value function sustains signiﬁcant error at state x = ( W, I ) (cid:124) with a large I .Overall, none of the above rules (CR0) – (CR2) gives agreeable performance. It is hard to ﬁgureout an ideal way to randomize the PH’s action which can sidestep the thorny issues mentionedabove. This shows one drawback of binding together control randomization and forward simulationin addition to the issue of computational cost; see also the item “Limitation of control randomization”of Section 2.2.To circumvent the annoying problems mentioned above, in the sequel numerical experiments,we simulate the post-action value of the state process at each time step as follows: X ( m ) t + := (cid:16) W ( m ) t + , I ( m ) t + (cid:17) where W ( m ) t + and I ( m ) t + are simulated from two independent uniform distributionswith support sets (0 , R ) and { , , . . . , t } , respectively. This ensures the post-action value evenlydistributed over (cid:98) K t,R . . . . Sample path generated by CR1

Time lattice I n v e s t m en t a cc oun t . . . Sample path generated by CR2

Time lattice I n v e s t m en t a cc oun t Figure 9: Sample paths of the investment account generated by control randomizationmethods (CR1) and (CR2) . 26

R2 CR102505007501000 0.0 0.5 1.0 0.0 0.5 1.0

Investment account at T−1 C oun t Figure 10: Histograms of W ( m ) T − generated by control randomization methods (CR1) and (CR2) . First−withdrawal−time C oun t Figure 11: Histogram of I ( m )11 generated by control randomization method (CR2) . In the sequel, we conduct several numerical experiments to compare the regression estimates for thecontinuation function produced by two regression methods: the raw sieve estimation (RSE) methodand the shape-preserving sieve estimation (SPSE) method. The RSE and SPSE are essentially thelinear sieve estimation method discussed in Section 3.3 with sieve spaces (28) and (29), respectively.It is easy to show that the continuation function k (cid:55)−→ ˜ C E t ( k ) is monotone and therefore weincorporate this shape constraint in the SPSE method. The expression of the accompanying matrix A J is given in Appendix A.1. The RSE method is equivalent to the least-squares method commonlyadopted in the literature; see, for instance, Longstaﬀ and Schwartz (2001). The aim of this subsectionis to show the advantage of incorporating shape constraints in the regression step of an LSMC27lgorithm.In the ﬁrst numerical experiment, we compare the SPSE and RSE for the continuation func-tion at each time step. For the fairness of the comparison, for both methods, we take φ ( · ) = (cid:0) φ ( · ) , . . . , φ J ( · ) (cid:1) (cid:124) as a vector of univariate Bernstein polynomials up to order J = 20 in both sieveestimation methods. Figure 12 collects the plots of regression estimates as a function of k at oddtime steps with k = 0. To better show the subtle diﬀerence between the estimates produced bySPSE and RSE, the plots are restricted on the interval [0 , k , the dotted lines in Figure 12 showthat its regression estimate produced by the RSE does not inherit this monotonicity and loses certaineconomic interpretations, accordingly. This issue is more serious at smaller time steps as shown bythe bottom panel of Figure 12. This is not surprising because once the monotonicity is lost at acertain time step, the regression estimate obtained in the consecutive time step will be inﬂuenced,which in turn exaggerates the problem. In contrast, as depicted by the solid lines in Figure 12,the SPSE method always preserves the monotonicity of the continuation function and therefore thecorresponding regression estimates are more economically sensible. This shows the ﬁrst advantageof the SPSE method in terms of preserving certain shape properties of the continuation function.Next, we implement the BSBU algorithm to compute ˜ V E0 ( X ) with X = (1 , (cid:124) where theSPSE and RSE are employed, respectively. These estimates approximate the no-arbitrage price ofthe VA policy at the inception and therefore are of most interest in the present context; see thelast paragraph of Section 4.2. It is worth noting that ˜ V E0 ( X ) is random due to the randomnessof the simulated sample; see also Remark 1. And therefore, we repeat the BSBU algorithm 40times in order to study the stability of ˜ V E0 ( X ) under a ﬁnite sample size. Table 2 summarizes themean and standard deviation of ˜ V E0 ( X ) under diﬀerent pairs of M and J . The “S.d.” column ofthe table discloses that the standard deviation accompanying the SPSE is nearly one half of thatassociated with the RSE under all numerical settings. For the numerical settings 0-2 of Table 2,Figure 13 delineates the corresponding density plots of the 40 estimates. By comparing the leftand right panels of Figure 13, we can easily perceive that the numerical estimates accompanyingSPSE are less volatile as reﬂected by the more spiked shape of the corresponding density plots. Thisobservation is consistent with Table 2. To sum up, the SPSE surpasses the RSE in terms of smallerstandard deviation of the resulting numerical estimate for the optimal value function at the initialstate.From the settings 0-2 of Table 2, we also observe that for both methods, the change of the meanof the numerical estimate is not substantial as the number of basis functions J hikes from 15 to 25.In the settings 1, 3 and 4 of Table 2, we ﬁx J = 20 and increase the number of simulated paths M from 10 to 4 × . We witness that standard deviation decreases as the number of simulated pathsclimbs. This descending trend is also conﬁrmed by the box plots depicted in Figure 14: the heightof the box shrinks as the number of simulated paths hikes. All of these show the convergence of the28 t=11 SPSERSE 0 0.5 100.20.40.60.81 t=9

SPSERSE0 0.5 100.20.40.60.81 t=7

SPSERSE 0 0.5 100.20.40.60.81 t=5

SPSERSE0 0.5 100.20.40.60.81 t=3

SPSERSE 0 0.5 100.20.40.60.81 t=1

SPSERSE

Figure 12: Regression estimates of SPSE and RSE for k (cid:55)−→ ˜ C E t ( k ,

0) over [0 , M =10 sample paths are generated and J = 20 basis functions are used.BSBU algorithm which is in line with the convergence result established in Section 3.4.Overall, the advantages of the SPSE over the RSE are extant at least in two-fold. Firstly, theSPSE produces economically sensible regression estimates by inheriting certain shape properties ofthe true continuation function. Secondly, the consequential estimate for the optimal value functionaccompanying the SPSE method is less volatile than that produced by the RSE method under aﬁnite number of simulated sample paths. 29able 2: Mean and standard deviation of ˜ V E0 ( X ) produced by SPSE and RSE methods.The results are obtained by repeating the BSBU algorithm 40 times.Setting ( M, J ) SPSE RSEMean S.d. Mean S.d.0 (1 × ,

15) 0 . . . . × ,

20) 0 . . . . × ,

25) 0 . . . . × ,

20) 0 . . . . × ,

20) 0 . . . . SPSE RSE D en s i t y SPSERSE

M=10 , J=15 SPSE RSE D en s i t y SPSERSE

M=10 , J=20 SPSE RSE D en s i t y SPSERSE

M=10 , J=25 Figure 13: Density plots of ˜ V E0 ( X ) produced by SPSE and RSE methods under 40 repeatsof the BSBU algorithm. The dotted line corresponds to the sample mean.30 SESPSE × × V ~ E ( X ) Figure 14: Box plots of ˜ V E0 ( X ) produced by SPSE and RSE methods under 40 repeatsof the BSBU algorithm. J is ﬁxed as 20 and M varies from 10 to 4 × . This paper develops a novel LSMC algorithm, referred to as Backward Simulation and BackwardUpdating (BSBU) algorithm, to solve discrete-time stochastic optimal control problems. We ﬁrst in-troduce an auxiliary stochastic control problem where the state process only takes value in a compactset. This enables the BSBU algorithm to successfully sidestep extrapolating value function estimate.We further show the optimal value function of the auxiliary problem is a legitimate approximation forthat of the original problem with an appropriate choice of the truncation parameter. To circumventthe drawbacks of forward simulation and control randomization, we propose to directly simulatethe post-action value of the state process from an artiﬁcial probability distribution. The pivotalidea behind this artiﬁcial simulation method is that the continuation function is solely determinedby the distribution of random innovation term. Moreover, motivated by the shape information ofthe continuation function, we introduce a shape-preserving sieve estimation technique to alleviatethe computational burden of tuning parameter selection involved in the regression step of an LSMCalgorithm. Furthermore, convergence result of the BSBU algorithm is established by resorting tothe theory of nonparametric sieve estimation. Finally, we conﬁrm the merits of the BSBU algorithmthrough an application to pricing equity-linked insurance products and the corresponding numerical31xperiments.

References

Parsiad Azimzadeh and Peter A Forsyth. The existence of optimal bang-bang controls for gmxbcontracts.

SIAM Journal on Financial Mathematics , 6(1):117–139, 2015.Christophe Barrera-Esteve, Florent Bergeret, Charles Dossal, Emmanuel Gobet, Asma Meziou, R´emiMunos, and Damien Reboul-Salze. Numerical methods for the pricing of swing options: a stochas-tic control approach.

Methodology and computing in applied probability , 8(4):517–540, 2006.Denis Belomestny. Pricing bermudan options by nonparametric regression: optimal rates of conver-gence for lower estimates.

Finance and Stochastics , 15(4):655–683, 2011.Denis Belomestny, Grigori Milstein, and Vladimir Spokoiny. Regression methods in pricing americanand bermudan options using consumption processes.

Quantitative Finance , 9(3):315–327, 2009.Denis Belomestny, Anastasia Kolodko, and John Schoenmakers. Regression methods for stochasticcontrol problems and their convergence analysis.

SIAM Journal on Control and Optimization , 48(5):3562–3588, 2010.Ren´e Carmona and Michael Ludkovski. Valuation of energy storage: An optimal switching approach.

Quantitative ﬁnance , 10(4):359–374, 2010.Jacques F Carriere. Valuation of the early-exercise price for options using simulations and nonpara-metric regression.

Insurance: mathematics and Economics , 19(1):19–30, 1996.Xiaohong Chen. Large sample sieve estimation of semi-nonparametric models.

Handbook of econo-metrics , 6:5549–5632, 2007.Zhang Chen, Ken Vetzal, and Peter A Forsyth. The eﬀect of modelling parameters on the value ofgmwb guarantees.

Insurance: Mathematics and Economics , 43(1):165–173, 2008.Jaehyuk Choi, Chenru Liu, and Jeechul Woo. An eﬃcient approach for removing look-ahead bias inthe least square monte carlo algorithm: Leave-one-out. arXiv preprint arXiv:1810.02071 , 2018.Emmanuelle Cl´ement, Damien Lamberton, and Philip Protter. An analysis of a least squares re-gression method for american option pricing.

Finance and Stochastics , 6(4):449–471, 2002.Fei Cong and Cornelis W Oosterlee. Multi-period mean–variance portfolio optimization based onmonte-carlo simulation.

Journal of Economic Dynamics and Control , 64:23–38, 2016.Min Dai, Yue Kuen Kwok, and Jianping Zong. Guaranteed minimum withdrawal beneﬁt in variableannuities.

Mathematical Finance , 18(4):595–611, 2008.32ierre Del Moral, Bruno R´emillard, and Sylvain Rubenthaler. Monte carlo approximations of amer-ican options that preserve monotonicity and convexity. In

Numerical Methods in Finance , pages115–143. Springer, 2012.Daniel Egloﬀ. Monte carlo algorithms for optimal stopping and statistical learning.

The Annals ofApplied Probability , 15(2):1396–1432, 2005.Daniel Egloﬀ, Michael Kohler, Nebojsa Todorovic, et al. A dynamic look-ahead monte carlo algo-rithm for pricing bermudan options.

The Annals of Applied Probability , 17(4):1138–1171, 2007.Jianqing Fan and Irene Gijbels.

Local polynomial modelling and its applications: monographs onstatistics and applied probability 66 , volume 66. CRC Press, 1996.Paul Glasserman and Bin Yu. Simulation for american options: Regression now or regression later?In

Monte Carlo and Quasi-Monte Carlo Methods 2002 , pages 213–226. Springer, 2004.Paul Glasserman, Bin Yu, et al. Number of paths versus number of basis functions in americanoption pricing.

The Annals of Applied Probability , 14(4):2090–2119, 2004.Yao Tung Huang and Yue Kuen Kwok. Regression-based monte carlo methods for stochastic controlmodels: Variable annuities with lifelong guarantees.

Quantitative Finance , 16(6):905–928, 2016.Yao Tung Huang, Pingping Zeng, and Yue Kuen Kwok. Optimal initiation of guaranteed lifelongwithdrawal beneﬁt with dynamic withdrawals.

SIAM Journal on Financial Mathematics , 8(1):804–840, 2017.Idris Kharroubi, Nicolas Langren´e, and Huyˆen Pham. A numerical algorithm for fully nonlinear hjbequations: an approach by control randomization.

Monte Carlo Methods and Applications , 20(2):145–165, 2014.Ker-Chau Li. Asymptotic optimality for cp, cl, cross-validation and generalized cross-validation:discrete index set.

The Annals of Statistics , pages 958–975, 1987.Francis A Longstaﬀ and Eduardo S Schwartz. Valuing american options by simulation: a simpleleast-squares approach.

Review of Financial studies , 14(1):113–147, 2001.Mary C Meyer. Inference using shape-restricted regression splines.

The Annals of Applied Statistics ,pages 1013–1033, 2008.Elizbar A Nadaraya. On estimating regression.

Theory of Probability & Its Applications , 9(1):141–142, 1964.Whitney K Newey. Convergence rates and asymptotic normality for series estimators.

Journal ofeconometrics , 79(1):147–168, 1997.Tim Robertson, F.T. Wright, and R.L. Dykstra.

Order restricted statistical inference . John Wileyand Sons, 1988. 33CG Rogers. Pathwise stochastic optimal control.

SIAM Journal on Control and Optimization , 46(3):1116–1132, 2007.Zhiyi Shen and Chengguo Weng. Pricing bounds and bang-bang analysis of polaris variable annuities.Working paper of University of Waterloo, Available at SSRN: https://ssrn.com/abstract=3056794,2017.Steven E Shreve.

Stochastic calculus for ﬁnance II: Continuous-time models , volume 11. SpringerScience & Business Media, 2004.Lars Stentoft. Convergence of the least squares monte carlo approach to american option valuation.

Management Science , 50(9):1193–1203, 2004.John N Tsitsiklis and Benjamin Van Roy. Regression methods for pricing complex american-styleoptions.

IEEE Transactions on Neural Networks , 12(4):694–703, 2001.Jiangdian Wang and Sujit K Ghosh. Shape restricted nonparametric regression based on multivariatebernstein polynomials. Technical report, North Carolina State University. Dept. of Statistics,2012a.Jiangdian Wang and Sujit K Ghosh. Shape restricted nonparametric regression with bernsteinpolynomials.

Computational Statistics and Data Analysis , 56(9):2729–2741, 2012b.Geoﬀrey S Watson. Smooth regression analysis.

Sankhy¯a: The Indian Journal of Statistics, SeriesA , pages 359–372, 1964.Daniel Z Zanger. Convergence of a least-squares monte carlo algorithm for bounded approximatingsets.

Applied Mathematical Finance , 16(2):123–150, 2009.Daniel Z Zanger. Quantitative error estimates for a least-squares monte carlo algorithm for americanoption pricing.

Finance and Stochastics , 17(3):503–534, 2013.Rongju Zhang, Nicolas Langren´e, Yu Tian, Zili Zhu, Fima Klebaner, and Kais Hamza. Dynamicportfolio optimization with liquidity cost and market impact: a simulation-and-regression ap-proach.

Quantitative Finance , pages 1–14, 2018.34

Supplements for Sieve Estimation Method

A.1 Forms of Matrix A J To make the paper self-contained, we collect several forms of the constraint matrix A J in (29)which ensures monotonicity, convexity, or concavity of the sieve estimate (30); see Wang and Ghosh(2012a,b) for a justiﬁcation. For the simplicity of notation, we only address the case r = 1. Monotonicity

Suppose the conditional mean g ( · ) deﬁned in Eq. (26) is monotone. Then thecorresponding monotonicity-preserved sieve estimate ˆ g ( · ) is obtained from (30) with H J givenby Eq. (29) and A J =  − · · · − · · · . . .0 · · · −  J × ( J +1) . Convexity/Concavity If g ( · ) is convex, we choose the matrix A J as A J =  − · · ·

00 1 − · · ·

0. . .0 · · · −  ( J − × ( J +1) . Moreover, the matrix A J accompanying a concave g ( · ) is obtained by taking negative of theabove matrix. Convexity and Monotonicity If g ( · ) is convex and monotone, the corresponding A J is givenby A J =  − · · · · · · − · · ·

00 1 − · · ·

0. . .0 · · · −  J × ( J +1) . A.2 A Data-driven Choice of J Below we present some common methods of choosing the number of basis functions J in a sieveestimation method; see, e.g., Li (1987). 35 allows’s C p For a discrete set

J ⊆ N , J is determined by solving the following minimizationproblem:ˆ J = arg min J ∈J M M (cid:88) m =1 (cid:104) U ( m ) − ˆ g (cid:16) Z ( m ) (cid:17)(cid:105) + 2ˆ σ (cid:0) J/M (cid:1) , where ˆ g ( · ) is given in (30) and ˆ σ := M − (cid:80) Mm =1 (cid:2) U ( m ) − ˆ g (cid:0) Z ( m ) (cid:1)(cid:3) which is an estimate forthe variance of residual term. Generalized cross-validation J is determined byˆ J = arg min J ∈J M − (cid:80) Mm =1 (cid:2) U ( m ) − ˆ g (cid:0) Z ( m ) (cid:1)(cid:3) (cid:0) − (cid:0) J/M (cid:1)(cid:1) , with ˆ g ( · ) given in (30). Leave-one-out cross-validation

Select J to minimizeCV( J ) := 1 M M (cid:88) m =1 (cid:104) U ( m ) − ˆ g − m (cid:16) Z ( m ) (cid:17)(cid:105) , where ˆ g − m ( · ) is similarly obtained by Eq. (30) with the sample point (cid:0) U ( m ) , Z ( m ) (cid:1) removed.It is worth stressing that, among the above three selection methods, the leave-one-out cross-validationmethod is most computationally expensive as one has to compute the regression estimate ˆ g − m ( · ) M times in a single evaluation of CV( J ). This is clearly computationally prohibitive when M isconsiderable, which is particularly the case in the present context of the LSMC algorithm. TheMallows’s C p criterion and generalized cross-validation method are relatively less cumbersome butstill undesirable under a sizable M . These show the importance of avoiding such a tuning parameterselection procedure which is one important thrust behind proposing shape-preserved sieve estimationmethod. A.3 Technical Assumption of Sieve Estimation Method

We impose the following assumption accompanying the sieve estimation method discussed in Section3.3 which follows from Newey (1997).

Assumption 4. (i) (cid:8)(cid:0) U ( m ) , Z ( m ) (cid:1)(cid:9) Mm =1 are i.i.d. and Z ( m ) has compact support Z . Furthermore, V ar (cid:2) U ( m ) (cid:12)(cid:12) Z ( m ) = · (cid:3) is bounded over Z . (ii) There exists a sequence Υ( J ) such that (cid:107) φ (cid:107) ∞ ≤ Υ( J ) with (cid:107)·(cid:107) denoting the supremum norm ofa continuous function over Z . iii) For the sieve space H J deﬁned either in Eq. (28) or Eq. (29) , there exists a ( J + 1) -by-1 vector ˜ β and a sequence ρ J such that ρ J −→ as J −→ ∞ , and inf h ( · ) ∈H J (cid:107) h − g (cid:107) ∞ = (cid:13)(cid:13)(cid:13) ˜ β (cid:124) φ − g (cid:13)(cid:13)(cid:13) ∞ = O ( ρ J ) , (A.1) where we remind that g ( · ) := E (cid:2) U ( m ) (cid:12)(cid:12) Z ( m ) = · (cid:3) . (iv) Let

Φ := E (cid:2) φ (cid:0) Z ( m ) (cid:1) φ (cid:124) (cid:0) Z ( m ) (cid:1)(cid:3) . There exists a positive constant c Φ independent of J suchthat < c Φ ≤ λ min (Φ) ≤ λ max (Φ) ≤ ¯ c Φ < ∞ , with λ min (Φ) and λ max (Φ) denoting thesmallest and largest eigenvalues of Φ , respectively. (v) As M −→ ∞ , J −→ ∞ , and Υ ( J ) J/M −→ . We give some comments on the above technical conditions.1. The i.i.d. condition in Part (i) of the above assumption discloses the necessity of generatingan independent sample at each time step in an LSMC algorithm; see also the discussion in theearlier item “Cost of forward simulation” of Section 2.2. Part (i) further requires Z ( m ) has acompact support, which is conventional in literature see, e.g., Newey (1997) and Chen (2007).In the context of BSBU algorithm, this shows that restraining the state process into a boundeddomain is not only beneﬁcial in eliminating undesirable extrapolation but also indispensablein guaranteeing the convergence of the regression estimate to the continuation function. Thishas also been pointed out in the literature, see, e.g., Stentoft (2004) and Zanger (2013).2. Part (ii) speciﬁes how the magnitude of φ ( · ) is ampliﬁed as the number of basis functionfunctions grows up. In particular, Newey (1997) shows that Υ( J ) = O (cid:16) √ J (cid:17) for B-splines andΥ( J ) = O ( J ) for power series; for the cases of other types of basis functions, we refer to Chen(2007).3. Part (iii) states that there exists a function ˜ β (cid:124) φ ( · ) in the sieve space H J that “best” ap-proximates the conditional mean function g ( · ) under the supremum norm; see Figure 15 fora graphical illustration. The existence of vector ˜ β (referred to as oracle ) is guaranteed bythe convexity of sieve space H J . For the sieve space (29), the existence of ρ J relies on theconvexity, concavity or monotonicity of the function g ( · ) which follows by the Property 3.2 ofWang and Ghosh (2012b). Figure 15 depicts the relationship between ˜ β (cid:124) φ ( · ) and g ( · ): theirdiscrepancy vanishes as J increases and, for a ﬁxed J , the sieve estimate ˆ β (cid:124) φ ( · ) converges to˜ β (cid:124) φ ( · ) as the sample size M approaches inﬁnity. Therefore, one may view the sieve estimationas a two-stage approximation for the conditional mean function g ( · ).4. The condition in Part (iv) ensures the design matrix of the regression problem is nonsingularwith a high probability and does not blow up as J approaches inﬁnity. Finally, Part (v)prescribes the growth rates of J and M in order to avoid overﬁtting or underﬁtting.37 ( · ) Sieve Spaces H J J = 1 ˜ β (cid:124) φ ( · ) J = 2 J = 3 ρ J → , as J → ∞ Figure 15: A diagram illustrating the relationship between ˜ β (cid:124) φ ( · ) and g ( · ). B Proofs of Statements

B.1 Proof of Proposition 1

B.1.1 PreliminaryLemma 1.

For any F -adapted process a = { a t } t ∈T , the following statements hold: (i) (cid:8) τ R ≤ t (cid:9) = (cid:8) X Rt ∈ ∂ X R (cid:9) for t = 1 , , . . . , T ; (ii) (cid:8) τ R = t + 1 (cid:9) = (cid:110) X Rt ∈ ˚ X R , S (cid:0) X Rt , a t , ε t +1 (cid:1) / ∈ ˚ X R (cid:111) for t = 0 , , . . . , T − ,where τ R and X Rt are deﬁned in Eqs. (11) and (12) , respectively.Proof of Lemma 1. (i) According to Eq. (12), we observe (cid:8) X Rt ∈ ∂ X R (cid:9) = (cid:8) X t ∈ ∂ X R , τ R > t (cid:9) ∪ (cid:8) Q ( X τ R ∧ t ) ∈ ∂ X R , τ R ≤ t (cid:9) = (cid:8) Q ( X τ R ∧ t ) ∈ ∂ X R , τ R ≤ t (cid:9) , where the second identity is by the deﬁnition of the stopping time τ R and the fact that ∂ X R ∩ ˚ X R = ∅ . To show the statement in Part (i) of Lemma 1, it suﬃces to prove (cid:8) τ R ≤ t (cid:9) ⊆{Q ( X τ R ∧ t ) ∈ ∂ X R } . Indeed, τ R ≤ t implies X τ R ∧ t / ∈ ˚ X R , and thus Q ( X τ R ∧ t ) ∈ ∂ X R . (ii) In view of Part (i) and Eq. (12), we obtain (cid:110) X Rt ∈ ˚ X R (cid:111) = (cid:8) X Rt ∈ ∂ X R (cid:9) c = (cid:8) τ R > t (cid:9) ⊆ (cid:8) X Rt = X t (cid:9) . (cid:110) X Rt ∈ ˚ X R , S (cid:0) X Rt , a t , ε t +1 (cid:1) / ∈ ˚ X R (cid:111) = (cid:110) X Rt ∈ ˚ X R , X Rt = X t , S (cid:0) X Rt , a t , ε t +1 (cid:1) / ∈ ˚ X R , τ R > t (cid:111) = (cid:110) X t ∈ ˚ X R , S ( X t , a t , ε t +1 ) / ∈ ˚ X R , τ R > t (cid:111) = (cid:110) X t ∈ ˚ X R , X t +1 / ∈ ˚ X R , τ R > t (cid:111) = (cid:8) τ R = t + 1 (cid:9) . This proves Part (ii) of Lemma 1.

B.1.2 Proof of the Main Result

Proof of Proposition 1.

By exploiting Lemma 1 and Eq. (12), we get X Rt +1 = X t +1 I { τ R >t +1 } + Q (cid:0) X τ R ∧ ( t +1) (cid:1) I { τ R ≤ t +1 } = X t +1 I { τ R >t +1 } + Q ( X τ R ∧ t ) I { τ R ≤ t } + Q ( X t +1 ) I { τ R = t +1 } = S ( X t , a t , ε t +1 ) I { τ R >t +1 } + Q ( X τ R ∧ t ) I { τ R ≤ t } + Q ( S ( X t , a t , ε t +1 )) I { τ R = t +1 } = S (cid:0) X Rt , a t , ε t +1 (cid:1) I { τ R >t +1 } + X Rt I { τ R ≤ t } + Q (cid:0) S (cid:0) X Rt , a t , ε t +1 (cid:1)(cid:1) I { τ R = t +1 } = S (cid:0) X Rt , a t , ε t +1 (cid:1) I { τ R >t +1 } + X Rt I { X t ∈ ∂ X R } + Q (cid:0) S (cid:0) X Rt , a t , ε t +1 (cid:1)(cid:1) I { X Rt ∈ ˚ X R , S ( X Rt ,a t ,ε t +1 ) / ∈ ˚ X R } , where the fourth equality follows from Eq. (12) and the last equality follows from Lemma 1.The above equation together with Eqs. (5) and (14) yields Eq. (13). This completes theproof. B.2 Proof of Theorem 1

B.2.1 Preliminary

Recall that X and X R implicitly depend on certain actions a ; see Eqs. (1) and (13), respectively.In the sequel, we sometimes stress such dependency by writing X t ( a ) (resp. X Rt ( a )) and X ( a ) (resp. X R ( a )). Lemma 2.

For the state process X R deﬁned through Eq. (13) , the following statements hold. (i) For any a ∈ A , there exists ˜ a ∈ A R such that X Rt ( a ) = X Rt (˜ a ) for all t ∈ T almost surely. (ii) For any ˜ a ∈ A R , there exists a ∈ A such that X Rt ( a ) = X Rt (˜ a ) for all t ∈ T almost surely.Proof of Lemma 2. (i) Given a ∈ A and X R ( a ), we construct ˜ a as follows: ˜ a = a , and˜ a t = a t I { X Rt ( a ) ∈ ˚ X R } + a ∗ t (cid:0) X Rt ( a ) (cid:1) I { X Rt ( a ) ∈ ∂ X R } , for t = 1 , , . . . , T − , (B.1)39here a ∗ t ( x ) := arg sup a ∈ A t ( x ) f t ( x, a ) for x ∈ ∂ X R and t ∈ T .It is easy to see from the above construction that ˜ a is F -adapted. It remains to show that˜ a t ∈ A t (cid:0) X Rt (˜ a ) (cid:1) and X Rt ( a ) = X Rt (˜ a ) , for t ∈ T . (B.2)Firstly, we observe ˜ a = a ∈ A ( X ) and X R (˜ a ) = X . As induction hypothesis, we assumethe statement (B.2) holds for time step t . For time step t + 1, we split the discussions into twocases.1. If X Rt ( a ) = X Rt (˜ a ) ∈ ∂ X R , then X Rt +1 (˜ a ) = X Rt (˜ a ) = X Rt ( a ) = X Rt +1 ( a ) , where the ﬁrst and third equalities follow by Eq. (13) and the second equality is due tothe induction hypothesis.2. In the second case that X Rt ( a ) = X Rt (˜ a ) ∈ ˚ X R , we apply Eq. (13) to get X Rt +1 (˜ a ) = ˜ H (cid:16) K (cid:0) X Rt (˜ a ) , ˜ a t (cid:1) , ε t +1 (cid:17) = ˜ H (cid:16) K (cid:0) X Rt ( a ) , a t (cid:1) , ε t +1 (cid:17) = X Rt +1 ( a ) , where the second equality follows by Eq. (B.1) and the induction hypothesis (B.2).In either of the above cases, we have X Rt +1 (˜ a ) = X Rt +1 ( a ). This combined with Eq. (B.1)implies˜ a t +1 = a t +1 ∈ A t +1 (cid:0) X Rt +1 ( a ) (cid:1) = A t +1 (cid:0) X Rt +1 (˜ a ) (cid:1) , if X Rt +1 (˜ a ) ∈ ˚ X R . Otherwise, ˜ a t +1 = a ∗ t +1 (cid:0) X Rt +1 ( a ) (cid:1) = a ∗ t +1 (cid:0) X Rt +1 (˜ a ) (cid:1) ∈ A t +1 (cid:0) X Rt +1 (˜ a ) (cid:1) . This proves thestatement (B.2) for time step t + 1. The proof of Part (i) is complete. (ii) Given ˜ a ∈ A R and X R (˜ a ), we construct a as follows: a = ˜ a , and a t = ˜ a t I { X Rt (˜ a ) ∈ ˚ X R } + ˆ a t ( X t ( a )) I { X Rt (˜ a ) ∈ ∂ X R } , for t = 1 , , . . . , T − , (B.3)where ˆ a t ( · ) is any measurable function satisfying ˆ a t ( x ) ∈ A t ( x ) for x ∈ X and t ∈ T .It is easy to see that ˜ a is F -adapted. Next, we use a forward induction argument to show that a t ∈ A t ( X t ( a )) and X Rt ( a ) = X Rt (˜ a ) , for t ∈ T . (B.4)The above statement holds trivially for t = 0. As induction hypothesis, we assume it holds fortime step t . For time step t + 1, we consider two separate cases.40. If X Rt ( a ) = X Rt (˜ a ) ∈ ∂ X R , Eq. (13) in combined with (B.4) implies X Rt +1 ( a ) = X Rt ( a ) = X Rt (˜ a ) = X Rt +1 (˜ a ) .

2. In the second case that X Rt ( a ) = X Rt (˜ a ) ∈ ˚ X R , applying Eq. (13) gives X Rt +1 ( a ) = ˜ H (cid:16) K (cid:0) X Rt ( a ) , a t (cid:1) , ε t +1 (cid:17) = ˜ H (cid:16) K (cid:0) X Rt (˜ a ) , ˜ a t (cid:1) , ε t +1 (cid:17) = X Rt +1 (˜ a ) , where the second equality follows by Eq. (B.3) and the induction hypothesis (B.4).Overall, we always observe X Rt +1 ( a ) = X Rt +1 (˜ a ). To prove the statement (B.4) holds for timestep t + 1, it remains to show a t +1 ∈ A t +1 ( X t +1 ( a )) . We split the discussion into two separatecases.1. Firstly, suppose X Rt +1 ( a ) = X Rt +1 (˜ a ) ∈ ∂ X R , Eq. (B.3) implies a t +1 = ˆ a t +1 ( X t +1 ( a )) ∈ A t +1 ( X t +1 ( a )) .

2. Secondly, suppose X Rt +1 ( a ) = X Rt +1 (˜ a ) ∈ ˚ X R . By Part (i) of Lemma 1, (cid:110) X Rt +1 ( a ) ∈ ˚ X R (cid:111) = (cid:8) τ R > t (cid:9) and thus, it follows from Eq. (12) that (cid:110) X Rt +1 ( a ) ∈ ˚ X R (cid:111) ⊆ (cid:8) X Rt +1 ( a ) = X t +1 ( a ) (cid:9) .Consequently, we apply Eq. (B.3) to get a t +1 = ˜ a t +1 ∈ A t +1 (cid:0) X Rt +1 (˜ a ) (cid:1) = A t +1 (cid:0) X Rt +1 ( a ) (cid:1) = A t +1 ( X t +1 ( a )) . The proof of Part (ii) is complete.A direct consequence of the preceding lemma is the following corollary.

Corollary 1.

The value function ˜ V ( X ) deﬁned in Eq. (15) exhibits: ˜ V ( X ) = sup a ∈A E (cid:34) T − (cid:88) t =0 ϕ t f t (cid:0) X Rt , a t (cid:1) + ϕ T G (cid:0) X RT (cid:1)(cid:35) . (B.5)It is worth noting that the optimization problems in Eq. (B.5) and Eq. (15) are taken overthe set A and A R , respectively. The above corollary states that the optimal values of these twooptimization problems are exactly the same as given by ˜ V ( X ). B.2.2 Proof of the Main Result

Proof of Theorem 1.

In view of Eqs. (2) and (B.5), we obtain (cid:12)(cid:12)(cid:12) ˜ V ( X ) − V ( X ) (cid:12)(cid:12)(cid:12) = sup a ∈A E (cid:34) T − (cid:88) t =0 (cid:12)(cid:12) f t (cid:0) X Rt , a t (cid:1) − f t ( X t , a t ) (cid:12)(cid:12) I { X Rt (cid:54) = X t } (cid:35) + sup a ∈A E (cid:104)(cid:12)(cid:12) G (cid:0) X RT (cid:1) − G ( X T ) (cid:12)(cid:12) I { X RT (cid:54) = X T } (cid:105) := I + I . (B.6)41elow we establish upper bounds for the I and I deﬁned in the above display, respectively. Let E := (cid:8) X t = X Rt for all 1 ≤ t ≤ T (cid:9) . Note that E ⊆ (cid:8) X t = X Rt (cid:9) = ⇒ (cid:8) X t (cid:54) = X Rt (cid:9) = (cid:8) X t = X Rt (cid:9) c ⊆ E c . Accordingly, we get I = sup a ∈A E (cid:34) T − (cid:88) t =0 (cid:12)(cid:12) f t (cid:0) X Rt , a t (cid:1) − f t ( X t , a t ) (cid:12)(cid:12) I { X t (cid:54) = X Rt } (cid:35) ≤ sup a ∈A E (cid:34) T − (cid:88) t =0 (cid:0)(cid:12)(cid:12) f t (cid:0) X Rt , a t (cid:1)(cid:12)(cid:12) + | f t ( X t , a t ) | (cid:1) I { X t (cid:54) = X Rt } (cid:35) ≤ sup a ∈A E (cid:34) T − (cid:88) t =0 (cid:0)(cid:12)(cid:12) f t (cid:0) X Rt , a t (cid:1)(cid:12)(cid:12) + | f t ( X t , a t ) | (cid:1) I E c (cid:35) ≤ sup a ∈A E (cid:34) T − (cid:88) t =0 (cid:16) ξ ( R ) + B ( X t ) (cid:17) I E c (cid:35) = sup a ∈A E (cid:34)(cid:32) T − (cid:88) t =0 Y t (cid:33) I E c (cid:35) , (B.7)with Y t := ξ ( R ) + B ( X t ) , where the ﬁrst inequality is by triangular inequality and Assumption 3and the third inequality is due to Part (ii) of Assumption 3. Applying Cauchy–Schwarz inequalitytwice gives I ≤ sup a ∈A  E [ I E c ] · E (cid:32) T − (cid:88) t =0 Y t (cid:33)  ≤ ( T − · sup a ∈A (cid:40) E [ I E c ] · E (cid:34) T − (cid:88) t =0 Y t (cid:35)(cid:41) ≤ ( T − · sup a ∈A (cid:40) E [ I E c ] · E (cid:34) T − (cid:88) t =0 (cid:0) ξ ( R ) + B ( X t ) (cid:1)(cid:35)(cid:41) ≤ √ T − · sup a ∈A (cid:40) E [ I E c ] · T − (cid:88) t =0 (cid:0) E [ ξ ( R )] + E [ B ( X t )] (cid:1)(cid:41) , where the third inequality follows because ( a + b ) ≤ a + 2 b for two real numbers a and b . Inview of Assumption 3, we get T − (cid:88) t =0 (cid:0) E [ ξ ( R )] + E [ B ( X t )] (cid:1) ≤ ( T − (cid:18) ξ ( R ) + sup a ∈A E [ B ( X t )] (cid:19) ≤ ( T − (cid:0) ξ ( R ) + ζ (cid:1) . Combing the last two displays with Assumption 2 implies I ≤ √ T − (cid:0) ξ ( R ) + ζ (cid:1) (cid:18) sup a ∈A E [ I E c ] (cid:19) = √ T − (cid:0) ξ ( R ) + ζ (cid:1) (cid:18) − inf a ∈A E [ I E ] (cid:19) ≤ ( T − (cid:113) (cid:0) ξ ( R ) + ζ (cid:1) E ( X , R ) . (B.8)42 similar argument gives I ≤ (cid:113) (cid:0) ξ ( R ) + ζ (cid:1) E ( X , R ) . (B.9)Combining (B.6), (B.8), and (B.9) together implies (cid:12)(cid:12) V ( X ) − ˜ V ( X ) (cid:12)(cid:12) ≤ T (cid:113) (cid:0) ξ ( R ) + ζ (cid:1) E ( X , R ) . The proof is complete.

B.3 Proof of Theorem 2

B.3.1 Preliminary lemmas

We ﬁrst give the deﬁnitions of “Big O p” and “Small O p” notations which are commonplaces instatistical literature.

Deﬁnition 2. (i)

For two sequences of random variables { a M } M ∈ N and { b M } M ∈ N indexed by M ,we say a M = O P ( b M ) if lim k →∞ lim sup M →∞ P ( | a M | > kb M ) = 0 . (ii) Moreover, we say a M = o P ( b M ) if lim sup M →∞ P ( | a M | > kb M ) = 0 for all k > .Some Matrices Let h t ( x ) = (cid:32) sup a ∈ A t ( x ) φ (cid:0) K ( x, a ) (cid:1) , . . . , sup a ∈ A t ( x ) φ J (cid:0) K ( x, a ) (cid:1)(cid:33) (cid:124) , for x ∈ cl ( X R ),and we suppress its dependency on J . Deﬁne matricesΨ t = E (cid:104) h t (cid:16) X ( m ) t (cid:17) h (cid:124) t (cid:16) X ( m ) t (cid:17)(cid:105) and ˆΨ t = 1 M M (cid:88) m =1 h t (cid:16) X ( m ) t (cid:17) h (cid:124) t (cid:16) X ( m ) t (cid:17) for t = 1 , , . . . , T − (cid:124) denoting vector transpose. It is palpable that ˆΨ t is aﬁnite-sample estimate for Ψ t . In the sequel, we denote λ max ( B ) (resp. λ min ( B )) as the largest (resp.smallest) eigenvalue of a square matrix B . We impose the following Assumption on the eigenvaluesof Ψ t . Assumption 5. (i)

For any ﬁxed x and t , A t ( x ) is a compact set. Moreover, a (cid:55)−→ K ( x, a ) and φ j ( · ) : R r −→ R are continuous functions for ≤ j ≤ J . (ii) There exists a positive constant ¯ c Ψ independent of t and J such that λ max (Ψ t ) ≤ ¯ c Ψ < ∞ . Part (i) of the preceding assumption guarantees that the function h t ( · ) is well-deﬁned for t ∈ T .The continuity requirement of a (cid:55)−→ K ( x, a ) can be removed if A t ( x ) is a lattice (discrete set), whichis particularly the case when the stochastic optimal control problem exhibits the Bang-bang solution,see, e.g., Azimzadeh and Forsyth (2015) and Huang and Kwok (2016). Part (ii) requires the largesteigenvalue of the matrix ˆΨ t does not blow up as M and J approach inﬁnity. This condition ensuresthe sample eigenvalue converges to the non-sample counterpart as M approaches inﬁnity as shownin the sequel Lemma 3. 43oreover, we deﬁne matricesΦ t = E (cid:104) φ (cid:16) X ( m ) t + (cid:17) φ (cid:124) (cid:16) X ( m ) t + (cid:17)(cid:105) and ˆΦ t = 1 M M (cid:88) m =1 φ (cid:16) X ( m ) t + (cid:17) φ (cid:124) (cid:16) X ( m ) t + (cid:17) . The following lemma relates the eigenvalues of ˆΦ t and ˆΨ t to those of Φ t and Ψ t . Lemma 3. (i)

Suppose Condition (ii) of Theorem 2 is satisﬁed. Then, (cid:12)(cid:12)(cid:12) λ max (Φ t ) − λ max (cid:16) ˆΦ t (cid:17)(cid:12)(cid:12)(cid:12) = O P (cid:16) Υ( J ) (cid:112) J/M (cid:17) , and (cid:12)(cid:12)(cid:12) λ min (Φ t ) − λ min (cid:16) ˆΦ t (cid:17)(cid:12)(cid:12)(cid:12) = O P (cid:16) Υ( J ) (cid:112) J/M (cid:17) , for t ∈ T . (ii) Suppose Assumption 5 holds. In addition, Condition (v) of Assumption 4 is satisﬁed. Then, λ max (cid:16) ˆΨ t (cid:17) = O P (1) for t = 1 , , . . . , T − .Proof of Lemma 3. Lemma 3 can be proved by a similar argument as that used in the proof of Eq.(A.1) in Newey (1997).The above lemma shows the sample eigenvalues converge to the non-sample counterparts as M approaches inﬁnity. In view of Condition (iv) of Assumption 4, Lemma 3 also implies the largest(resp., smallest) eigenvalue of ˆΦ t is bounded from above (resp., below) with probability approaching1 as M −→ ∞ . This fact is exploited in the proofs of sequel Lemmas 4 and 5. Pseudo Estimate, Oracle, and True Estimate

Next, we introduce the concept of pseudo esti-mate . Let ¯ β t (resp. ˆ β t ) be the solution to the optimization problem in Eq. (31) with U ( m ) =˜ V t +1 (cid:16) X ( m ) t +1 (cid:17) (resp. ˜ V E t +1 (cid:16) X ( m ) t +1 (cid:17) ) and Z ( m ) = X ( m ) t + . Given ¯ β t and ˆ β t , denote the associated re-gression estimates by ˜ C PE t ( · ) = ¯ β (cid:124) t φ ( · ) and ˜ C E t ( · ) = ˆ β (cid:124) t φ ( · ), respectively. ˜ C PE t ( · ) is essentially thesieve estimate for the continuation function ˜ C t ( · ) when the true value function ˜ V t +1 ( · ) is employedin the regression. We further deﬁne function ˜ V PE t ( x ) for x ∈ ˚ X R by substituting ˜ C E t ( · ) in Eq. (25)with ˜ C PE t ( · ). For x ∈ ∂ X R , we set ˜ V PE t ( x ) = ˜ V t ( x ) with ˜ V t ( · ) given by Eq. (16).Admittedly, in the implementation of the BSBU algorithm, ¯ β t is not tractable because thetrue value function is unknown and should be replaced by the numerical estimate ˜ V E t +1 ( · ) obtainedinductively. For this reason, following Belomestny et al. (2010), we call ¯ β t the pseudo estimate .Despite this, the pseudo estimate plays an indispensable role in establishing the convergence resultof Theorem 2. In addition to the two estimates ¯ β t and ˆ β t deﬁned in the above, we further deﬁnethe oracle ˜ β t as the solution to the optimization problem (A.1) with g ( · ) replaced by ˜ C t ( · ).The following lemma discloses that the gap between pseudo estimate and the oracle vanisheswhen both M and J increase at a certain rate. 44 emma 4. Suppose the conditions of Theorem 2 are satisﬁed. Then, (cid:13)(cid:13)(cid:13) ¯ β t − ˜ β t (cid:13)(cid:13)(cid:13) = O P (cid:16)(cid:112) J/M + ρ J (cid:17) , for t ∈ T . Proof of Lemma 4.

Recall that ¯ β t solves the optimization problem:min β ∈ R J M M (cid:88) m =1 (cid:104) ˜ V t +1 (cid:16) X ( m ) t +1 (cid:17) − β (cid:124) φ (cid:16) X ( m ) t + (cid:17)(cid:105) , subject to β (cid:124) φ ( · ) ∈ H J . On the other hand, ˜ β t is a suboptimal solution to the above optimization problem. Therefore, weget (cid:13)(cid:13) V t +1 − P ¯ β t (cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13) V t +1 − P ˜ β t (cid:13)(cid:13)(cid:13) , where P is a M -by- J matrix with m -th row being φ (cid:124) (cid:16) X ( m ) t + (cid:17) and V t +1 is a M -by-1 vector with m -th element given by ˜ V t +1 (cid:16) X ( m ) t +1 (cid:17) .By adding and subtracting the term P ˜ β t in the L.H.S. of the above inequality, we get (cid:13)(cid:13) ¯ U − P ¯ δ (cid:13)(cid:13) ≤ (cid:13)(cid:13) ¯ U (cid:13)(cid:13) , where we use the shorthand notations ¯ δ := ¯ β t − ˜ β t and ¯ U := V t +1 − P ˜ β t . Expanding both sides ofthe above inequality gives (cid:13)(cid:13) P ¯ δ (cid:13)(cid:13) M ≤ (cid:12)(cid:12) ¯ U (cid:124) P ¯ δ (cid:12)(cid:12) M ≤ (cid:13)(cid:13) P (cid:124) ¯ U (cid:13)(cid:13) (cid:13)(cid:13) ¯ δ (cid:13)(cid:13) M , where the second inequality is by H¨older’s inequality. For the L.H.S. of the above inequality, itfollows from the deﬁnition of the smallest eigenvalue that (cid:13)(cid:13) P ¯ δ (cid:13)(cid:13) M = ¯ δ (cid:124) P (cid:124) P ¯ δ M ≥ (cid:13)(cid:13) ¯ δ (cid:13)(cid:13) λ min (cid:16) ˆΦ t (cid:17) . Combing the last two inequalities together implies (cid:13)(cid:13) ¯ δ (cid:13)(cid:13) λ min (cid:16) ˆΦ t (cid:17) ≤ M (cid:13)(cid:13) P (cid:124) ¯ U (cid:13)(cid:13) . It follows from Lemma 3 that the event (cid:110) c Φ / ≤ λ min (cid:16) ˆΦ t (cid:17)(cid:111) holds with probability approaching 1as M −→ ∞ . And therefore, (cid:13)(cid:13) ¯ δ (cid:13)(cid:13) ≤ (cid:0) /c Φ (cid:1) M − (cid:13)(cid:13) P (cid:124) ¯ U (cid:13)(cid:13) (B.10)holds with probability approaching 1 as M −→ ∞ .It follows as in Eq. (A.2) of Newey (1997, pp. 163) that M − (cid:13)(cid:13) P (cid:124) ¯ U (cid:13)(cid:13) = O P (cid:16)(cid:112) J/M + ρ J (cid:17) . This in conjunction with the last display proves the desired result. The proof is complete.The next lemma relates the discrepancy between the pseudo estimate ¯ β t and the true estimateˆ β t to the estimation error of the value function at the previous time step. Lemma 5.

Suppose the conditions of Theorem 2 are satisﬁed. Then, for t ∈ T , there exists a onstant ψ > independent of t , R and J such that (cid:13)(cid:13)(cid:13) ¯ β t − ˆ β t (cid:13)(cid:13)(cid:13) ≤ (cid:114) ψM (cid:13)(cid:13)(cid:13) V t +1 − ˆ V t +1 (cid:13)(cid:13)(cid:13) + O P (cid:16)(cid:112) ψρ J (cid:17) holds with probability approaching 1 as M −→ ∞ , where V t +1 and ˆ V t +1 are two M -by-1 vectorswith m -th element given by ˜ V t +1 (cid:16) X ( m ) t +1 (cid:17) and ˜ V E t +1 (cid:16) X ( m ) t +1 (cid:17) , respectively.Proof of Lemma 5. Using the argument as in the proof of inequality (B.10), we obtain (cid:13)(cid:13)(cid:13) ˆ δ (cid:13)(cid:13)(cid:13) ≤ (cid:0) /c Φ (cid:1) M − (cid:13)(cid:13)(cid:13) P (cid:124) ˆ U (cid:13)(cid:13)(cid:13) , holds with probability approaching 1 as M −→ ∞ , where we adopt shorthand notations ˆ δ := ¯ β t − ˆ β t and ˆ U := ˆ V t +1 − P ¯ β t . On the other hand, it follows from Lemma 3 that M − (cid:13)(cid:13)(cid:13) P (cid:124) ˆ U (cid:13)(cid:13)(cid:13) = M − ˆ U (cid:124) (cid:0) M − P P (cid:124) (cid:1) ˆ U ≤ M − λ max (cid:16) ˆΦ t (cid:17) (cid:13)(cid:13)(cid:13) ˆ U (cid:13)(cid:13)(cid:13) ≤ M − c Φ (cid:13)(cid:13)(cid:13) ˆ U (cid:13)(cid:13)(cid:13) holds with probability approaching 1 as M −→ ∞ .Combing the above two inequalities implies (cid:13)(cid:13)(cid:13) ˆ δ (cid:13)(cid:13)(cid:13) ≤ (cid:113)(cid:0) c Φ /c (cid:1) M (cid:13)(cid:13)(cid:13) ˆ U (cid:13)(cid:13)(cid:13) := (cid:114) ψM (cid:13)(cid:13)(cid:13) ˆ U (cid:13)(cid:13)(cid:13) . By adding and subtracting the term V t +1 in the R.H.S. of the above inequality, we get (cid:13)(cid:13)(cid:13) ˆ U (cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13) ˆ V t +1 − V t +1 + V t +1 − P ¯ β t (cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13) ˆ V t +1 − V t +1 (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13) V t +1 − P ¯ β t (cid:13)(cid:13) = (cid:13)(cid:13)(cid:13) ˆ V t +1 − V t +1 (cid:13)(cid:13)(cid:13) + O (cid:16) √ M ρ J (cid:17) where the last equality is guaranteed by Part (ii) of Assumption 4. Combing the last two inequalitiesimplies (cid:13)(cid:13)(cid:13) ˆ δ (cid:13)(cid:13)(cid:13) ≤ (cid:114) ψM (cid:13)(cid:13)(cid:13) ˆ V t +1 − V t +1 (cid:13)(cid:13)(cid:13) + O P (cid:16)(cid:112) ψρ J (cid:17) . holds with probability approaching 1 as M −→ ∞ . This proves Lemma 5.The statement of the above lemma is not hard to expect because the primary diﬀerence betweenthe pseudo estimate and the true estimate stems from the the estimation error of value function.The ﬁnal lemma quantiﬁes the discrepancy between the value function and its numerical estimateunder the empirical L norm. Lemma 6.

Let F Xt ( · ) be the probability distribution function of X ( m ) t for t = 1 , , . . . , T − . Supposethe assumptions of Theorem 2 hold. Then M − (cid:13)(cid:13)(cid:13) V t − ˆ V t (cid:13)(cid:13)(cid:13) = O P (cid:0) ψ T − t − (cid:0) J/M + ρ J (cid:1)(cid:1) , for t = 1 , , . . . , T − , (B.11) where V t and ˆ V t are two M -by- vectors with m -th element being ˜ V t (cid:16) X ( m ) t (cid:17) and ˜ V E t (cid:16) X ( m ) t (cid:17) ,respectively. roof of Lemma 6. We use a backward induction procedure to prove the statement of Lemma 6.For t = T −

1, we note that ˜ C E T − ( · ) is in agreement with ˜ C PE T − ( · ) because ˜ V E T ( x ) = ˜ V T ( x ) = G ( x )for x ∈ cl ( X R ). We get ˜ V E T − ( x ) = ˜ V PE T − ( x ) for x ∈ cl ( X R ), accordingly. Furthermore, we observethat (cid:12)(cid:12)(cid:12) ˜ V E T − ( x ) − ˜ V T − ( x ) (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12) ˜ V PE T − ( x ) − ˜ V T − ( x ) (cid:12)(cid:12)(cid:12) ≤ sup a ∈ A T − ( x ) (cid:12)(cid:12)(cid:12) ˜ C PE T − (cid:0) K ( x, a ) (cid:1) − ˜ C T − (cid:0) K ( x, a ) (cid:1)(cid:12)(cid:12)(cid:12) = sup a ∈ A T − ( x ) (cid:12)(cid:12)(cid:12) ¯ β (cid:124) T − φ (cid:0) K ( x, a ) (cid:1) − ˜ C T − (cid:0) K ( x, a ) (cid:1)(cid:12)(cid:12)(cid:12) ≤ sup a ∈ A T − ( x ) (cid:12)(cid:12)(cid:12)(cid:16) ¯ β T − − ˜ β T − (cid:17) (cid:124) φ (cid:0) K ( x, a ) (cid:1)(cid:12)(cid:12)(cid:12) + sup a ∈ A T − ( x ) (cid:12)(cid:12)(cid:12) ˜ β (cid:124) T − φ (cid:0) K ( x, a ) (cid:1) − ˜ C T − (cid:0) K ( x, a ) (cid:1)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:16) ¯ β T − − ˜ β T − (cid:17) (cid:124) h T − ( x ) (cid:12)(cid:12)(cid:12) + (cid:13)(cid:13)(cid:13) ˜ β (cid:124) T − φ − ˜ C T − (cid:13)(cid:13)(cid:13) ∞ = (cid:12)(cid:12)(cid:12)(cid:16) ¯ β T − − ˜ β T − (cid:17) (cid:124) h T − ( x ) (cid:12)(cid:12)(cid:12) + O ( ρ J ) , (B.12)where the third inequality is by the deﬁnition of function h T − ( · ) and the last equality is guaranteedby Assumption 4.Consequently, we obtain M − (cid:13)(cid:13)(cid:13) V T − − ˆ V T − (cid:13)(cid:13)(cid:13) = 1 M M (cid:88) m =1 (cid:12)(cid:12)(cid:12) ˜ V E T − (cid:16) X ( m ) T − (cid:17) − ˜ V T − (cid:16) X ( m ) T − (cid:17)(cid:12)(cid:12)(cid:12) ≤ M M (cid:88) m =1 (cid:12)(cid:12)(cid:12)(cid:16) ¯ β T − − ˜ β T − (cid:17) (cid:124) h T − (cid:16) X ( m ) T − (cid:17)(cid:12)(cid:12)(cid:12) + O (cid:0) ρ J (cid:1) = (cid:16) ¯ β T − − ˜ β T − (cid:17) (cid:124) ˆΨ T − (cid:16) ¯ β T − − ˜ β T − (cid:17) + O (cid:0) ρ J (cid:1) ≤ λ max (cid:16) ˆΨ T − (cid:17) (cid:13)(cid:13)(cid:13) ¯ β T − − ˜ β T − (cid:13)(cid:13)(cid:13) + O (cid:0) ρ J (cid:1) = O P (cid:0) J/M + ρ J (cid:1) , (B.13)where the second inequality follows from the deﬁnition of the largest eigenvalue of a matrix and thelast equality is guaranteed by Lemma 4 and Lemma 3. In view of the above display, Eq. (B.11)holds for t = T − t + 1. Note that, for x ∈ ˚ X R , (cid:12)(cid:12)(cid:12) ˜ V t ( x ) − ˜ V E t ( x ) (cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12) ˜ V t ( x ) − ˜ V PE t ( x ) (cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12) ˜ V E t ( x ) − ˜ V PE t ( x ) (cid:12)(cid:12)(cid:12) . (B.14)An argument similar to the one used in establishing (B.13) shows that1 M M (cid:88) m =1 (cid:12)(cid:12)(cid:12) ˜ V PE t (cid:16) X ( m ) t (cid:17) − ˜ V t (cid:16) X ( m ) t (cid:17)(cid:12)(cid:12)(cid:12) = O P (cid:0) J/M + ρ J (cid:1) . (B.15)47ext, we investigate the term (cid:12)(cid:12)(cid:12) ˜ V E t ( x ) − ˜ V PE t ( x ) (cid:12)(cid:12)(cid:12) . Observe that (cid:12)(cid:12)(cid:12) ˜ V E t ( x ) − ˜ V PE t ( x ) (cid:12)(cid:12)(cid:12) ≤ sup a ∈ A t ( x ) (cid:12)(cid:12)(cid:12) ˜ C E t (cid:0) K ( x, a ) (cid:1) − ˜ C PE t (cid:0) K ( x, a ) (cid:1)(cid:12)(cid:12)(cid:12) = sup a ∈ A t ( x ) (cid:12)(cid:12)(cid:12)(cid:16) ˆ β t − ¯ β t (cid:17) (cid:124) φ (cid:0) K ( x, a ) (cid:1)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:16) ˆ β t − ¯ β t (cid:17) (cid:124) h t (cid:0) x (cid:1)(cid:12)(cid:12)(cid:12) . (B.16)We adopt the same argument as in the proof of (B.13) to get1 M M (cid:88) m =1 (cid:12)(cid:12)(cid:12) ˜ V E t (cid:16) X ( m ) t (cid:17) − ˜ V PE t (cid:16) X ( m ) t (cid:17)(cid:12)(cid:12)(cid:12) ≤ λ max (cid:16) ˆΨ t (cid:17) (cid:13)(cid:13)(cid:13) ˆ β t − ¯ β t (cid:13)(cid:13)(cid:13) . Applying Lemma 5 yields1 M M (cid:88) m =1 (cid:12)(cid:12)(cid:12) ˜ V E t (cid:16) X ( m ) t (cid:17) − ˜ V PE t (cid:16) X ( m ) t (cid:17)(cid:12)(cid:12)(cid:12) ≤ λ max (cid:16) ˆΨ t (cid:17) (cid:20) ψM (cid:13)(cid:13)(cid:13) V t +1 − ˆ V t +1 (cid:13)(cid:13)(cid:13) + O (cid:0) ψρ J (cid:1)(cid:21) = O P (cid:0) ψ T − t − (cid:0) J/M + ρ J (cid:1)(cid:1) , where the last equality is due to induction hypothesis (B.11) and λ max (cid:16) ˆΨ t (cid:17) = O P (1) (see Lemma3). The above display in conjunction with (B.14) and (B.15) implies M − (cid:13)(cid:13)(cid:13) V t − ˆ V t (cid:13)(cid:13)(cid:13) = O P (cid:0) J/M + ρ J (cid:1) + O P (cid:0) ψ T − t − (cid:0) J/M + ρ J (cid:1)(cid:1) = O P (cid:0) ψ T − t − (cid:0) J/M + ρ J (cid:1)(cid:1) . This completes the proof.

B.3.2 Proof of the Main Result

Proof of Theorem 2.

Following the arguments used to prove (B.12), we get (cid:12)(cid:12)(cid:12) ˜ V PE0 ( X ) − ˜ V ( X ) (cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:16) ¯ β − ˜ β (cid:17) (cid:124) h ( X ) (cid:12)(cid:12)(cid:12) + O ( ρ J ) = O P (cid:16)(cid:112) J/M + ρ J (cid:17) , where the last equality is by Lemma 4 and Part (ii) of Assumption 5.On the other hand, an argument similar to the one used in deriving (B.16) shows (cid:12)(cid:12)(cid:12) ˜ V PE0 ( X ) − ˜ V E0 ( X ) (cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:16) ˆ β − ¯ β (cid:17) (cid:124) h ( X ) (cid:12)(cid:12)(cid:12) ≤ M − / (cid:13)(cid:13)(cid:13) V − ˆ V (cid:13)(cid:13)(cid:13) (cid:107) h ( X ) (cid:107) . The above two displays in conjunction with (B.11) implies (cid:12)(cid:12)(cid:12) ˜ V ( X ) − ˜ V E0 ( X ) (cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12) ˜ V PE0 ( X ) − ˜ V ( X ) (cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12) ˜ V PE0 ( X ) − ˜ V E0 ( X ) (cid:12)(cid:12)(cid:12) = O P (cid:18)(cid:113) ψ T − ( J/M + ρ J ) (cid:19) ..