[PDF] Rolling Horizon Policies in Multistage Stochastic Programming

Abstract

Multistage Stochastic Programming (MSP) is a class of models for sequential decision-making under uncertainty. MSP problems are known for their computational intractability due to the sequential nature of the decision-making structure and the uncertainty in the problem data due to the so-called curse of dimensionality. A common approach to tackle MSP problems with a large number of stages is a rolling-horizon (RH) procedure, where one solves a sequence of MSP problems with a smaller number of stages. This leads to a delicate issue of how many stages to include in the smaller problems used in the RH procedure. This paper addresses this question for, both, finite and infinite horizon MSP problems. For the infinite horizon case with discounted costs, we derive a bound which can be used to prescribe an epsilon-sufficient number of stages. For the finite horizon case, we propose a heuristic approach from the perspective of approximate dynamic programming to provide a sufficient number of stages for each roll in the RH procedure. Our numerical experiments on a hydrothermal power generation planning problem show the effectiveness of the proposed approaches.

Full PDF

RR OLLING H ORIZON P OLICIES IN M ULTISTAGE S TOCHASTIC P ROGRAMMING

Murwan Siddig

Department of Industrial EngineeringClemson UniversityClemson, SC, USA 29631 [email protected]

Yongjia Song

Department of Industrial EngineeringClemson UniversityClemson, SC, USA 29631 [email protected]

Amin Khademi

Department of Industrial EngineeringClemson UniversityClemson, SC, USA 29631 [email protected]

February 10, 2021 A BSTRACT

Multistage Stochastic Programming (MSP) is a class of models for sequential decision-making underuncertainty. MSP problems are known for their computational intractability due to the sequentialnature of the decision-making structure and the uncertainty in the problem data due to the so-calledcurse of dimensionality. A common approach to tackle MSP problems with a large number of stagesis a rolling-horizon (RH) procedure, where one solves a sequence of MSP problems with a smallernumber of stages. This leads to a delicate issue of how many stages to include in the smaller problemsused in the RH procedure. This paper addresses this question for, both, ﬁnite and inﬁnite horizonMSP problems. For the inﬁnite horizon case with discounted costs, we derive a bound which can beused to prescribe an (cid:15) − sufﬁcient number of stages. For the ﬁnite horizon case, we propose a heuristicapproach from the perspective of approximate dynamic programming to provide a sufﬁcient numberof stages for each roll in the RH procedure. Our numerical experiments on a hydrothermal powergeneration planning problem show the effectiveness of the proposed approaches. K eywords Multistage stochastic programming · Rolling-horizon · Stochastic dual dynamic programming · Approximate dynamic programming

Multistage Stochastic Programming (MSP) is a class of decision-making models where the decision-maker (DM) mayadapt and control the behavior of a probabilistic system sequentially over multiple stages. The goal of the DM is tochoose a decision policy that leads to optimal system performance with respect to some performance criterion, e.g.,maximizing the expected proﬁt or minimizing the expected cost. MSP models have many real-life applications; theseinclude (but not limited to) energy [1, 2, 3, 4, 5, 6], ﬁnance [7, 8, 9, 10], transportation [11, 12, 13] and sports [14],among others.Two main challenges arise from solving MSP problems: (i) the uncertainty in the problem data; and (ii) the nested form in the sequential decision-making structure of the problem. A typical approach to tackle these challenges is to approximate the underlying stochastic process using scenario trees to address the former and make use of the dynamicprogramming principle via the so-called

Bellman equations [15] to address the latter. This approach, however, couldbecome quickly impractical – especially, with the increase in the dimensionality of the state variables ( s t ) of the system, a r X i v : . [ m a t h . O C ] F e b PREPRINT - F

EBRUARY

10, 2021and the number of stages ( T ) in the planning horizon. This is known as the curse-of-dimensionality and it gives rise toan exponential growth in computational resources required to solve such large MSP problems under certain optimalityguarantees. In this paper, we are concerned with the computational tractability of MSP problems with a large number ofstages T , including the special case when T = ∞ .In practice, a typical workaround for the computational intractability of MSP problems with a large number of stagesis to solve a sequence of MSP problems in an “online” fashion, each of which has a smaller number of stages. Thisapproach is the so-called rolling horizon (RH) procedure, and the (smaller) number of stages used in each roll of the RHprocedure is referred to as the forecast horizon [16, 17]. In this procedure, the DM ﬁxes a forecast horizon with τ ≤ T ,solves the corresponding τ -stage problem, implements the decisions only for the current period, rolls forward oneperiod, and repeats the process starting from a new initial state. Nevertheless, because the length of the forecast horizon τ employed by the RH procedure is typically smaller than the actual planning horizon of the original MSP problem, theresulting decision policy, which is referred to as the look-ahead policy , may not be optimal. It is conceivable that thedegree of suboptimality in a look-ahead policy is tied to the choice for the number of stages τ used in the RH procedure.This leads to a very delicate issue of how to choose a small enough forecast horizon with a tolerable optimality gap,which we refer to as a sufﬁcient forecast horizon τ ∗ (see Deﬁnition 1). The goal of this paper is to address the issue ofﬁnding τ ∗ in a RH procedure.In the literature, there is a good deal of work on the RH procedure, aiming to achieve a balance between computationalefﬁciency and the corresponding policy performance. In [18], for instance, instead of solving a sequence of MSPswith small forecast horizons, the authors use a deterministic approximation to the MSP in the RH procedure using thepoint forecasts of all the exogenous uncertain future information. The authors also provide bounds on the suboptimalityof the corresponding solution and [19] extend these bounds to a stochastic RH approximation using what is knownas the reference scenarios. [20] use a partial approximation approach which uses the point forecasts of exogenousfuture information only beyond a certain number of stages until the end of the horizon to truncate the number of stagesconsidered in the MSPs. Another common approach is the so-called stage aggregation, where the DM aggregates all thedecisions to be made in multiple stages into a single decision stage (see the discussion in [21]). This approach is usedby [22], who introduced the partially adaptive MSP model where stage aggregations only occur after a certain perioduntil the end of the horizon.Unlike these prior works, we do not consider aggregating any of the future exogenous uncertain information nor thestages. Instead, we consider solving MSP problems with a smaller number of stages and provide a systematic wayof detecting a sufﬁcient forecast horizon τ ∗ . We focus on ﬁnite/inﬁnite horizon MSP problems where the stochasticprocess characterizing the data uncertainty is stationary. In the inﬁnite horizon discounted case, given a ﬁxed forecasthorizon τ , we show that the resulting optimality gap associated with the ( static ) look-ahead policy in terms of the totalexpected discounted reward/cost can be bounded by a function of the forecast horizon τ . This function can be used toderive a forecast horizon τ ∗ (cid:15) which achieves a prescribed (cid:15) optimality gap. In the ﬁnite horizon case (with no discount),we take an approximate dynamic programming (ADP) [23] perspective and develop a dynamic look-ahead policy wherethe forecast horizon τ to use in each roll is chosen dynamically according to the state of the system at that stage.The rest of this paper is organized as follows. In Section 2, we describe in detail the key components of an MSPmodel, introduce the necessary mathematical notations, and discuss some basic assumptions and solution approaches.In Section 3 we provide a detailed description of the RH procedure and the main theoretical result for the inﬁnitehorizon discounted case. In Section 4 we present the ADP-based heuristic approach for constructing the dynamiclook-ahead policy for the ﬁnite horizon case. An extensive numerical experiment analysis to the proposed approacheson a multi-period hydro-thermal power generation planning problem is presented in Section 5. Finally, in Section 6, weconclude with some ﬁnal remarks. This section is organized as follows. First, we introduce a generic nested formulation for a ﬁnite horizon MSP problem,its dynamic programming (DP) counterpart, and some basic assumptions. Second, we discuss solution methodologybased on the DP formulation under these basic assumptions. Finally, we describe a stationary analog for the discountedinﬁnite horizon MSP problems.

Consider a generic nested formulation for a ﬁnite horizon MSP problem as follows: min x ∈X ( x ,ξ ) f ( x , ξ ) + E | ξ [1] (cid:20) min x ∈X ( x ,ξ ) f ( x , ξ ) + E | ξ [2] (cid:20) · · · + E | ξ [ T − (cid:20) min x T ∈X T ( x T − ,ξ T ) f T ( x T , ξ T ) (cid:21)(cid:21)(cid:21) . (1) PREPRINT - F

EBRUARY

10, 2021Here, the initial vector x and initial data ξ are assumed to be deterministic, and the exogenous random data is modeledby a stochastic process ( ξ , . . . , ξ T ) ∈ Ξ × · · · × Ξ T , where each random vector ξ t is associated with a knownprobability distribution D t supported on a set Ξ t ⊂ R n . Moreover, we denote the history of the stochastic process up tostage t by ξ [ t ] := ( ξ , . . . , ξ t ) . The decision variable x t is also referred to as the pre-decision state, and the state of thesystem s t in stage t is fully resolved after the realization of the random vector ξ t is observed. Here, s t := S t ( x t , ξ t ) with S t : X t × Ξ t → R m .In its present form, this formulation has two main challenges: (i) the nested optimization posed by the sequential naturein the decision-making structure; and (ii) the (conditional) expectation posed by the stochastic nature of the problem.The ﬁrst challenge can be addressed by using a DP perspective via the Bellman equations and the computation simpliﬁesdramatically under the following stage-wise independence assumption. Assumption 1. (Stage-wise independence) We assume that the stochastic process { ξ t } is stage-wise independent, i.e., ξ t is independent of the history of the stochastic process up to time t − , for t = 1 , , . . . T , which is given by ξ [ t − . Under this assumption, problem (1) can be written as: min x ∈X ( x ,ξ ) f ( x , ξ ) + Q ( x ) , (2)where for t = 1 , . . . , T − : Q t +1 ( x t ) := E [ Q t +1 ( x t , ξ t +1 )] is referred to as the expected cost-to-go function, Q t ( x t − , ξ t ) := min x t ∈X t ( x t − ,ξ t ) f t ( x t , ξ t ) + Q t +1 ( x t ) , (3)and Q T +1 ( x T ) := 0 . As for the second challenge, a typical approach is to proceed by means of discretization and/orapproximate the (discretized) distribution of ξ t by a sample N t obtained using sampling techniques such as Monte Carlomethods. This is, for instance, the case in which (2) is a sample average approximation (SAA) where the expectation E [ Q t +1 ( x t , ξ t +1 )] is approximated by a sample average. We refer the reader to [24] for a detailed discussion onthis topic. Before we discuss the solution methodology based on the DP formulation (2), it is important to make thefollowing additional assumptions. Assumption 2. (Relatively complete recourse) We assume that X t ( x t − , ξ t ) (cid:54) = ∅ , ∀ x t − ∈ X t − and ξ t ∈ Ξ t , ∀ t =1 , . . . , T . Assumption 3. (Boundedness) We assume that the immediate cost function f t ( · , · ) at each stage t is bounded, i.e., ∃ κ > such that | f t ( x t , ξ t ) | ≤ κ, ∀ ( x t , ξ t ) ∈ X t × Ξ t , ∀ t = 1 , . . . , T . Assumption 4.

We consider multistage stochastic linear programs (MSLPs), i.e., f t ( x t , ξ t ) in (1) is deﬁned as: f t ( x t , ξ t ) := (cid:26) ˜c (cid:62) t x t if x t ≥ , ∀ t = 1 , . . . , T + ∞ otherwise. (4) and X t ( x t − , ξ t ) := { x t ∈ R m + | ˜ A t x t + ˜ B t x t − = ˜ b t } with ξ t = ( ˜c t , ˜ A t , ˜ B t , ˜ b t ) for t = 1 , . . . T . Assumption 2 and Assumption 4 are made for simplicity – the proposed approaches can be potentially used for moregeneral classes of MSPs. Assumption 3 is a technical assumption made without loss of generality.

Under Assumption 4, a backward induction argument can be used to show that the expected cost-to-go function Q t ( x t − ) is piecewise linear and convex with respect to x t − , ∀ t = 1 , . . . , T . Therefore, Q t ( x t − ) can be approximated frombelow by an outer cutting-plane approximation ˇ Q t ( x t − ) , which can be represented by the maximum of a collection of L cutting planes (cuts): ˇ Q t ( x t − ) = max (cid:96) ∈ L (cid:8) β (cid:62) t,(cid:96) x t − + α t,(cid:96) (cid:9) . (5)A common approach for assembling these cuts is the SDDP algorithm [25]. Drawing inﬂuence from the backwardrecursion technique developed in DP, the SDDP algorithm alternates between two main steps: (i) a forward simulation( forward pass ) which uses the current approximate expected cost-to-go functions ˇ Q t +1 ( · ) to generate a sequence ofdecisions ˇ x t := ˇ x t ( ξ t ) , ∀ t = 2 , . . . , T ; and (ii) a backward recursion ( backward pass ) to improve the approximation ˇ Q t +1 ( · ) , for t = 2 , . . . T . After the forward step, a statistical upper bound for the optimal value of (2) can be computed;and after the backward step, an improved lower bound for the optimal value of (2) is obtained. We summarize the3 PREPRINT - F

EBRUARY

10, 2021forward and backward steps of the algorithm next and the reader is referred to [25] for a detailed discussion on thattopic.Given the initial (pre-decision) state x , the initial realization ξ , and the current approximations of the expectedcost-to-go functions ( ˇ Q ( · ) , . . . , ˇ Q T ( · )) , the SDDP algorithm proceeds as follows.• Forward pass.

1. Initialize a list of candidate solutions (ˇ x , . . . , ˇ x T ) , set t = 1 , ˇ x t − = x , ˇ ξ t = ξ .2. Solve ˇ Q t (ˇ x t − , ˇ ξ t ) with ˇ Q t +1 ( · ) as shown in (3), to obtain x ∗ t and then set ˇ x t = x ∗ t .3. If t = T go to Backward pass . Otherwise, let t ← t + 1 , sample a new realization ξ t +1 ∈ Ξ t +1 withprobability P ( ξ t +1 ∈ Ξ t +1 ) , set ˇ ξ t +1 = ξ t +1 , and go to step 2.• Backward pass.

1. Given (ˇ x , . . . , ˇ x T ) , for t = T, . . . , do the following.2. For every ξ t ∈ Ξ t , solve ˇ Q t (ˇ x t − , ξ t ) with ˇ Q t +1 ( · ) as shown in (3). Add a cut weighted by the respectiveprobability P ( ξ t ∈ Ξ t ) to ˇ Q t ( · ) (if violated by ˇ x ).3. If the termination criterion is met, STOP and return the expected cost-to-go functions ˇ Q t ( · ) , ∀ t =2 , . . . , T . Otherwise, go to Forward pass and repeat.While the SDDP algorithm has become a popular approach for solving MSP problems, using SDDP (and other similardecomposition schemes) can become impractical with the increase in the dimensionality of the problem. In the literature,there has been some recent advancement on how some of this impracticality can be mitigated by (static/dynamic)scenario reduction and aggregation techniques, see, e.g., [26] and the references therein. Nevertheless, even when usingsuch enhancement techniques, the viability of these approaches could be easily undermined in MSP problems wherethe number of stages in the planning horizon T is large. Consider an inﬁnite horizon MSP problem of the form (1) introduced in [27], where T = ∞ and the stochastic processhas a periodic behavior with m ∈ N periods. To simplify the presentation, we will only consider the risk-neutral and stationary case where the number of periods m = 1 . To that end, we have the following stationarity assumption. Assumption 5. (Stationarity) The random vector ξ t has the same distribution with support Ξ ⊂ R m , and functions f t ( · , · ) , A t ( · ) , B t ( · ) , b t ( · ) are identical for all t = 2 , , . . . , T . Unlike the ﬁnite horizon case, where the solvability of the DP formulation (3) is guaranteed by the ﬁniteness of theterminal stage value function, inﬁnite horizon MSP problems require an evaluation of an inﬁnite sequence of costs forevery state s t . As such, it is necessary to establish some notion of ﬁniteness for the expected cost-to-go functions Q t ( · ) .One common approach for doing this is to introduce a discount factor γ ∈ (0 , and ﬁnd a policy π = { x πt } Tt =1 whichminimizes the expected total discounted cost, that is, min π ∈ Π f ( x π , ξ ) + lim T →∞ E (cid:34) T (cid:88) t =2 γ t − f t ( x πt , ξ t ) (cid:35) . (6)Note that, since ξ t ’s are assumed to follow the same probability distribution for all t ≥ , we can omit the stagesubscript t and obtain the following discounted inﬁnite horizon stationary variant of the DP formulation (3): Q ( x, ξ ) := min x (cid:48) ∈X ( x,ξ ) f ( x (cid:48) , ξ ) + γ Q ( x (cid:48) ) , (7)where Q ( x ) := E [ Q ( x, ξ )] and random vector ξ has the same distribution as ξ t ’s.The popularity of these discounted models for inﬁnite horizon MSPs stems from their numerical solvability and theirwide range of direct applications in real-life economic problems where the DM accounts for the time value of futurecosts. This solvability, however, relies on the existence of a unique ﬁxed point solution for function Q ( · ) satisfying (7).Such existence can be asserted using the Banach ﬁxed-point theorem. We summarize this result next, and readers arereferred to [27] for a detailed discussion. To that end, consider the following:• B := B ( X ) : the Banach space of bounded functions g := g ( x ) , and g : X → R is equipped with the sup-norm (cid:107) g (cid:107) B = sup x ∈X | g ( x ) | . 4 PREPRINT - F

EBRUARY

10, 2021• A mapping T : B → B deﬁned as T ( g )( x ) := E [ ψ g ( x, ξ )] where ψ g ( x, ξ ) := min x (cid:48) ∈X ( x,ξ ) { f ( x (cid:48) , ξ ) + γg ( x (cid:48) ) } . (8) Proposition 1. (Contraction mapping) The mapping T is a contraction mapping, that is, for every g , g (cid:48) ∈ B thefollowing inequality holds: (cid:107) T ( g ) − T ( g (cid:48) ) (cid:107) B ≤ γ (cid:107) g − g (cid:48) (cid:107) B . (9) Theorem 1. [27] Under Assumptions 2, 3 and 5, the following results hold: • Uniqueness: there exists a unique function Q ( · , · ) satisfying the DP formulation (7) . • Convergence: for any g ∈ B the sequence of functions { g k } ∞ k =0 obtained iteratively by g k ( x ) = T ( g k − )( x ) = E [ ψ g k − ( x, ξ )] := E (cid:104) min x k ∈X ( x,ξ ) (cid:110) f ( x k , ξ ) + γg k − ( x k ) (cid:111)(cid:105) ∀ x ∈ X , (10) converges (in the norm (cid:107) · (cid:107) B ) to Q ( · ) . • Convexity: if set X is convex and function f ( x, ξ ) is convex in x ∈ X , then function Q : X → R , is alsoconvex. The uniqueness and convergence assertions of Theorem 1 imply that the unique solution g ∗ = T ( g ∗ ) obtained iterativelyusing (10) is the optimal solution to the DP formulation (7). Moreover, under the convexity assertion, [27] show that Q ( x ) is piecewise linear and convex with respect to x , and provide an SDDP-type cutting-plane algorithm which givesan optimal policy π ∞ to the corresponding inﬁnite horizon problem (6). We refer the reader to [27] for the proofs ofProposition 1 and Theorem 1, a detailed description of the SDDP-type algorithm, and further discussions on the topic. The RH procedure is a well-known approach to construct online decision policies for sequential decision making. Inthis section, our goal is to provide an error bound to the suboptimality of the online policy corresponding to the RHprocedure for solving the stationary discounted inﬁnite horizon MSP model (7). To that end, let us ﬁrst discuss the RHprocedure and clarify what is meant by an online policy.

To demonstrate what is meant by an online policy, it might be instructive to revisit and analyze the form of what isconsidered as an ofﬂine policy ﬁrst. We consider the ofﬂine policy provided by the standard implementation of theSDDP algorithm as mentioned in Subsection 2.2. Recall that the ﬁnal output of the SDDP algorithm is a collectionof approximate expected cost-to-go functions ( ˇ Q ( · ) , . . . , ˇ Q T ( · )) . This collection of expected cost-to-go functions iswhat deﬁnes the ofﬂine policy provided by the SDDP algorithm. In other words, an ofﬂine policy is a policy inducedby the approximate value functions obtained via an ofﬂine training procedure. The quality of this ofﬂine policy canbe evaluated by using a set of K (out-of-sample) scenarios { ξ k } k ∈K , with |K| (cid:28) | Ξ | × | Ξ | × · · · × | Ξ T | , and ξ k = ( ξ k , . . . , ξ k T T ) .Alternatively, one might consider using an RH procedure where an online policy is given by solving an MSP problem on the ﬂy for decision stage t = 1 , . . . , T during an out-of-sample evaluation. To demonstrate how the RH procedureworks, let us deﬁne a roll as the decision stage at which the DM implements their actions. For any trajectory of thestochastic process ξ k = ( ξ k , . . . , ξ k T T ) , at the beginning of every roll t = 1 , . . . , T , based on the current (pre-decision)state x t and the observed realization ξ k t t , the DM deﬁnes a new MSP problem of the form (1) with τ ≤ T being thenumber of look-ahead stages. The newly deﬁned problem is then solved (e.g., using the SDDP algorithm) and only theﬁrst-stage decision ˇ x t is implemented. The DM then pays the immediate cost f t (ˇ x t , ξ k t t ) and moves forward to thenext stage. At that stage ( t + 1 ), a new problem is deﬁned with state variable s t = S t (ˇ x t , ξ k t t ) being the informationcarried over from the previous stage, ξ k t +1 t +1 being the observed realization of the stochastic process, and τ ≤ T (possiblydifferent from the one used in the previous stage) being the number of look-ahead stages. The online policy is thenevaluated by collecting all the immediate costs f t (ˇ x t , ξ k t t ) incurred at every roll t in a collection of (out-of-sample)scenarios { ξ k } k ∈K . A symbolic representation of the RH procedure is depicted in Figure 1 (in the appendix) and analgorithmic description is provided in Algorithm 1. 5 PREPRINT - F

EBRUARY

10, 2021

Algorithm 1

The rolling horizon procedure ( online policy).

Input:

Initial value x and an out-of-sample path ξ = ( ξ , ξ , . . . , ξ T ) . STEP 0:

Initialize z ( ξ ) = 0 and a vector ˇ x := (ˇ x , ˇ x , . . . , ˇ x T ) with ˇ x = x given as input. STEP 1:

For t = 1 , , . . . , T • Deﬁne an MSP problem P t ( τ, ˇ x t − , ξ t ) of the form (3), with Q t + τ ( · ) := 0 , ˇ x t − as the initial value, and ξ t asthe observed realization of the random variable.• Solve P t ( τ, ˇ x t − , ξ t ) to obtain the optimal decision x ∗ t , and set ˇ x t = x ∗ t .• Update z ( ξ ) → z ( ξ ) + f t ( x ∗ t , ξ t ) • Set ˇ x t = x ∗ t . STEP 2: return z ( ξ ) .The MSP problem solved at the beginning of every roll, with τ being the length of the forecast horizon, can be deﬁnedin many different ways. We summarize them in the following three categories and the reader is referred to [21] for adetailed discussion. The reader is also referred to [16] for a classiﬁed bibliography of the literature on the RH procedure.1. Limiting the horizon: reduce the number of stages in every roll by assuming that Q t + τ ( · ) := 0 .2. Stage aggregation: aggregate the stages from t (cid:48) = t + τ, . . . , T in the same spirit as the partially adaptive framework introduced in [22] such that Q t + τ ( x t + τ − ) := E  min x t + τ ,...,x T  T (cid:88) t (cid:48) = t + τ f t (cid:48) ( x t (cid:48) , ξ t (cid:48) ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ( x t + τ , . . . , x T ) ∈ X t + τ ( x t + τ − , ξ t + τ ) × · · · × X T ( x T − , ξ T ) ,x t (cid:48) ( ξ t (cid:48) ) = x t (cid:48) , ∀ t (cid:48) = t + τ, . . . , T  . (11) Outcome aggregation: use a point estimator µ t (cid:48) = E [ ξ t (cid:48) ] for all the exogenous information from t (cid:48) = t + τ, . . . , T in the same spirit as the framework discussed in [20] such that Q t + τ ( x t + τ − ) := min x t + τ ,...,x T  T (cid:88) t (cid:48) = t + τ f t (cid:48) ( x t (cid:48) , µ t (cid:48) ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ( x t + τ , . . . , x T )) ∈ X t + τ ( x t + τ − , ξ t + τ ) × · · · × X T ( x T − , ξ T )  . (12) For simplicity, in this paper, we will only consider the ﬁrst strategy (limiting the horizon). Nevertheless, we expect thatsimilar results in this paper can be extended to the other strategies as well.

In this subsection, our goal is to construct an error bound to the suboptimality of the RH procedure when used to solveproblem (7). The bound can then be used to derive a sufﬁcient forecast horizon τ ∗ (cid:15) (see Deﬁnition 1). Our work isinspired by that of [28] where the authors provide similar ideas for discrete-time Markov control processes. anTo that end, let us ﬁrst make clear the deﬁnition of a sufﬁcient forecast horizon τ ∗ . Let π τ := { x τt } ∞ t =1 be the sequenceof optimal ﬁrst-stage decisions obtained by an online RH procedure, where τ is the forecast horizon used in every roll t = 1 , . . . , ∞ , such that, V ( π τ ) := f ( x τ , ξ t ) + lim T →∞ E (cid:34) T (cid:88) t =2 γ t − f t ( x τt , ξ t ) (cid:35) (13)is the total expected discounted cost when evaluated using π τ . Similarly, we have that g ∗ = V ( π ∞ ) is the minimizerof (6) where π ∞ ∈ arg min π ∈ Π (cid:40) f ( x ∞ , ξ t ) + lim T →∞ E (cid:34) T (cid:88) t =2 γ t − f t ( x ∞ t , ξ t ) (cid:35)(cid:41) . (14) Deﬁnition 1. ( (cid:15) -Sufﬁcient forecast horizon τ ∗ (cid:15) ) For (cid:15) > , we deﬁne a forecast horizon τ ∗ (cid:15) as (cid:15) -sufﬁcient if V ( π τ ) − V ( π ∞ ) ≤ (cid:15), for all τ ∈ { τ ∗ , τ ∗ + 1 , . . . , ∞} . Under the contraction mapping assertion of Proposition 1, we have the following lemma.

Lemma 1.

For any g ∈ B and k = 1 , . . . , ∞ , the value functions g k obtained iteratively using (10) satisﬁes (cid:107) g ∗ − g k (cid:107) ≤ γ k (cid:107) g ∗ − g (cid:107) . PREPRINT - F

EBRUARY

10, 2021

Proof. (cid:107) g ∗ − g k (cid:107) = (cid:107) Tg ∗ − Tg k − (cid:107) = (cid:107) T k g ∗ − T k g (cid:107) ≤ γ k (cid:107) g ∗ − g (cid:107) .Let us deﬁne g = 0 , and suppose we solve (10) recursively for k = τ iterations. We claim that g τ provides the sameoptimal ﬁrst-stage action obtained by the RH policy π τ . To see why this is the case, note that since g = 0 , function g is equivalent to the terminal stage value function Q τ ( · ) obtained using the policy π τ . Similarly, since g is theminimizer of E [ f ( x , ξ ) + γg ( x )] , we have that g is equivalent to Q τ − ( · ) . We can then use the same argument bydoing backward induction to show the equivalence between g k +1 and Q τ − k ( · ) for k = 2 , . . . , τ − . This concludesthat for a given initial state values x and ξ , function g τ is equivalent to the optimal ﬁrst stage solution of problem (2).Moreover, the minimizer given by x τ ∈ arg min x (cid:48) ∈X ( x,ξ ) (cid:8) f ( x (cid:48) , ξ ) + γg τ − ( x (cid:48) ) (cid:9) (15)is also a minimizer of (2) for a given x ∈ X and ξ ∈ Ξ . Theorem 2.

Let γ ∈ (0 , , suppose that V ( π ∞ ) is the minimizer of (6) and V ( π τ ) is given by (13) , then underAssumption 3, we have the following:(a) If f ( x, ξ ) ≤ , then V ( π τ ) − V ( π ∞ ) ≤ γ τ κ − γ .(b) For any general f ( x, ξ ) : X × Ξ → R , we have that V ( π τ ) − V ( π ∞ ) ≤ γ τ κ − γ .Proof. (a) Since x τ is the minimizer of (15), we have that g τ ( x ) = E (cid:2) f ( x τ , ξ ) + γ g τ − ( x τ ) (cid:3) , ∀ x ∈ X . (16)Moreover, since f ( x, ξ ) ≤ , we have that g τ ( x ) ≥ E (cid:34) k (cid:88) t =1 γ t − f ( x τ , ξ ) + γ k g τ − ( x τ ) (cid:35) , ∀ x ∈ X , k = 1 , , . . . . By letting k → ∞ , we get that g τ ( x ) ≥ E (cid:34) ∞ (cid:88) t =1 γ t − f ( x τ , ξ ) (cid:35) = V ( π τ ) , ∀ x ∈ X . Therefore, V ( π τ ) − V ( π ∞ ) = V ( π τ ) − g ∗ ≤ g τ − g ∗ ≤ γ τ (cid:107) g ∗ − g (cid:107) ≤ γ τ (cid:107) ∞ (cid:88) t =1 γ t − f ( x τ , ξ ) (cid:107) ≤ γ τ κ − γ . (b) Let us deﬁne another (immediate) cost function ¯ f ( x, ξ ) = f ( x, ξ ) − κ . Since this newly deﬁned function − κ ≤ ¯ f ( x, ξ ) ≤ , the result follows from (a) (by redeﬁning κ = 2 × κ ).Theorem 2 provides a bound on the total expected discounted cost obtained by using the RH procedure as a function ofthe length of the forecast horizon τ . This bound can be used to obtain the following formula for an (cid:15) -sufﬁcient forecasthorizon τ ∗ (cid:15) τ ∗ (cid:15) = log ( (cid:15) (1 − γ ) κ )log ( γ ) , (17)which guarantees that, for all τ ≥ τ ∗ (cid:15) , the suboptimality of the respective policy π τ is within an optimality gap of (cid:15) .Note that, similar bounds can also be constructed for the situation where the number of periods m > . However, sincethe mapping g is going to be deﬁned for all the stages within a period m , the bound will scale by a factor m for everyincrement in the forecast horizon τ .While Theorem 2 does provide a cue on how to choose the number of look-ahead stages in the RH policies to performin MSP problems, the formula for an (cid:15) -sufﬁcient forecast horizon τ ∗ (cid:15) provided in (17) indicates that τ ∗ (cid:15) could growquickly as γ gets bigger and/or (cid:15) gets smaller. From a computational perspective, this is somewhat pathological anddefeats the purpose of using an RH procedure – now that constructing a policy π τ ∗ (cid:15) with a relatively small (cid:15) and large γ entails solving a sequence of MSP problems, each with a signiﬁcantly large horizon τ ∗ . Another concern with this7 PREPRINT - F

EBRUARY

10, 2021bound is that its dependency on the original MSP problem is only through the (constant valued) upper bound κ on theimmediate cost function f t ( · , · ) . This makes the choice of the employed (cid:15) -sufﬁcient forecast horizon to be static , losingthe opportunity to exploit the system state related information. In Section 4, we exactly address these concerns byconstructing a dynamic policy which maps the state of the system s t to a sufﬁcient forecast horizon τ ∗ t , ∀ t = 1 , . . . , T .To that end, henceforth, we will introduce a subscript t to distinguish between the static forecast horizon τ and thedynamic forecast horizon τ t which is time dependant (via the state s t ). In this section, we propose a solution approach for constructing a dynamic RH look-ahead policy from an ADPperspective. Although novel in stochastic programming, similar ideas to our proposed dynamic look-ahead policy havebeen discussed in the MDP literature. For instance, [29] construct a dynamic look-ahead policy for the inﬁnite horizondiscounted MDP as follows. First, the authors show the existence of a ﬁnite sufﬁcient forecast horizon τ ∗ for any givensystem state. Second, they identify the optimal ﬁrst-stage decision x τ ∗ which corresponds to this τ ∗ . Finally, theypropose an optimality criterion to this x τ ∗ , and then use this result to provide a procedure for detecting the smallest τ ∈ Z + such that V ( π τ ) = V ( π τ ∗ ) .The high-level idea of the procedure described in [29] for detecting τ ∗ is to go through every possible system stateand increase the length of the forecast horizon one stage at a time until τ ∗ is identiﬁed. To do this, for each step,the algorithm loops over every feasible action, and solves a DP problem, where all actions which do not satisfy the(predetermined) optimality criterion for x τ ∗ are eliminated. This process is repeated until the ﬁnite action spaceis reduced to a single element, in which case the corresponding forecast horizon is considered sufﬁcient. Similarprocedures have also been discussed in [17] for a more general class of MDPs.Showing the existence of τ ∗ and ﬁnding the corresponding x τ ∗ are somewhat readily accessible for the MDP settingbecause of the ﬁniteness assumptions on the state/action space. While it seems challenging to extend this procedurein MSP problems with continuous state/action space, we expect that this idea and its featured theoretical results areextendable to MSP problems with pure discrete variables. However, even if such extensions are possible, since thisprocedure solves a DP for every state, action, and τ ≤ τ ∗ , the computation might scale poorly with the size of the stateand action space. Instead, we propose a more practical approach next. Ultimately, our goal is to construct a function which maps every state s t = S t ( x t , ξ t ) , ∀ ( x t , ξ t ) ∈ X t × Ξ t at thebeginning of every roll t to the smallest sufﬁcient forecast horizon τ ∗ t . As previously noted, when using the RHprocedure, at the beginning of every roll t = 1 , , . . . , the DM solves an MSP problem with a given forecast horizon,implements the ﬁrst-stage only , and repeats this process at the next roll t + 1 . As such, if the inclusion of additionalstages in the forecast horizon (i.e., using τ (cid:48) t > τ t stages instead of τ t ) does not inﬂuence the ﬁrst-stage decision x τ t t (i.e., x τ (cid:48) t t remains the same for all τ (cid:48) t ≥ τ t ) then τ t can be viewed sufﬁcient for this roll t . The main idea of our proposedapproach is to train the dependency of a “small and sufﬁcient” forecast horizon τ ∗ t on the state using an ofﬂine trainingprocedure. Speciﬁcally, we consider a regression model by treating the system states as the independent variables andtreating the corresponding smallest sufﬁcient forecast horizon as the dependent variables in the regression. First, wegenerate a random sample of the system states { s n } Nn =1 . Then, for each sample s n , we solve an MSP problem usingdifferent forecast horizons τ ∈ { , , . . . , τ max } , and we record the smallest forecast horizon τ ∗ for which the optimalﬁrst-stage solution x τ ∗ “converges”, i.e., it remains the same for all τ ≥ τ ∗ . To practically determine whether or not x τ will remain the same for all τ ≥ τ ∗ we use the following stability test : we increase the length of the forecast horizon τ one step at a time, and we keep track of the optimal ﬁrst-stage solution x τ for every τ ∈ { , , . . . , τ max } . Then,whenever (cid:107) x τ − x τ − w (cid:107) < (cid:15) for some user-speciﬁed tolerance parameter (cid:15) > and stability parameter w ∈ N , weconclude that τ ∗ n = τ − w . Once all the data points { ( s n , τ ∗ n ) } Nn =1 has been collected, we use a real-valued function T :

X × Ξ → R to ﬁt these points. To do this, we deﬁne a set of basis functions φ ( · ) , . . . , φ P ( · ) , and look for T( · ) inthe ﬁnite dimensional space spanned by these functions. In other words, we look for T( · ) in the form of the followinglinear combination: T( s t ) := P (cid:88) i =0 θ i φ i ( s t ) , (18)and ﬁt a regression model to ﬁnd the parameters θ , . . . , θ P . We note that to ﬁnd a good design of the basis functions,one needs to take advantage of speciﬁc problem structures. Given θ , . . . , θ P and the form of the basis functions, onecan obtain the forecast horizon to consider for each state encountered online during the RH procedure. This approach issummarized in Algorithm 2. 8 PREPRINT - F

EBRUARY

10, 2021

Algorithm 2

An ofﬂine learning approach for dynamic selection of forecast horizons

Input:

A sample size N , a sample { s n } Nn =1 , a tolerance parameter (cid:15) > , a stalling parameter w ∈ N , and amaximum forecast horizon τ max < T . Step 0:

Initialize n = 0 , and a set of candidate sufﬁcient forecast horizons (one for each sample) { τ ∗ n } Nn =1 . Step 1:

Let n ← n + 1 . Step 2:

Initialize a set of candidate optimal ﬁrst-stage solutions X ∗ n = ∅ and a candidate forecast horizon τ ∈{ , , . . . , τ max } .• while true do (a) Given s n as an input, solve an MSP problem of the form (3) with T = τ to obtain the optimal (ﬁrst-stage)action x τ and append x τ to X ∗ n .(b) If | X ∗ n | > w go to (c). Otherwise, go to (d).(c) If (cid:107) x τ − x τ − w (cid:107) < (cid:15) , set τ ∗ n = τ − w and stop . Else, go to (d).(d) If τ < τ max , let τ ← τ + 1 and go to (a). Else, set τ ∗ n = τ max and stop . Step 3: If n = N , return { ( s n , τ ∗ n ) } Nn =1 and stop . Else, go to Step 1 . Step 4:

Deﬁne T( s t ) := (cid:80) Pi =0 θ i φ i ( s t ) and ﬁt a regression model to ﬁnd θ , . . . , θ P using { ( s n , τ ∗ n ) } Nn =1 .Next, we illustrate the proposed approach on a class of hydrothermal energy planning problems, which are used as thebenchmark instances in our numerical experiments shown in Section 5. The hydrothermal power generation system is one of the main energy sources in many countries. For example, theBrazilian hydrothermal power system is a large scale network of facilities that can be used to produce and distributeenergy by circulating H O ﬂuids (water). The power plants in the Brazilian network can be categorized into two types:1. A set of hydro plants H , which has no production cost, but for each hydro plant h ∈ H there is an upper limit ¯ q h , which is the maximum allowed amount of turbined ﬂow that can be used for power generation. Thesehydro plants can also be categorized further into two types:(a) Hydro plants with reservoirs H R , and for each h ∈ H R there is an upper and lower limit on the level ofwater allowed in the reservoirs, denoted by v h,t and ¯ v h,t , respectively.(b) Hydro plants without reservoirs (run-of-river) H I , and thus no water storage is possible.Moreover, for each hydro plant h , a unit of released water will generate r h units of power, also known asefﬁciency rate. The set of immediate upper and lower stream plants for hydro plant h in the network is givenby U ( h ) and L ( h ) , respectively.2. A set of thermal plants F , where for each thermal plant f ∈ F , the minimum and maximum amount of powergeneration allowed is g f,t and ¯ g f,t , respectively, ∀ t = 1 , . . . , T . Additionally, each plant f ∈ F is associatedwith an operating cost of c f,t for each unit of thermal power generated. We denote the cost vector of generatingthermal power at time t by c g,t .In the hydrothermal power operation planning problem (HPOP) problem, the goal of the DM is to ﬁnd an operationstrategy π , in order to meet production targets (demands) d t for each decision stage t, ∀ t = 1 , . . . , T , while minimizingthe overall production cost. To do this, at each decision stage t and given the state of the system s t (i.e., the waterlevel at each hydro plant h ∈ H R ), the DM has to decide on how much water to turbine by each hydro plant h ∈ H and how much thermal power to generate by each plant f ∈ F . In the event where the levels of energy productionusing hydro/thermal plants does not meet the levels of demand, the DM must pay a penalty of c p,t for each unit ofunsatisﬁed demand. Such penalty can also be thought of as a cost of buying energy from the outside market, whichtypically has a higher cost compared to the cost of generating energy using internal resources (thermal plants), i.e., c p,t > max f ∈ F { c f,t } .The HPOP problem can be modeled as an MSP problem because of the uncertainty in the problem data, such as futureinﬂows (amount of rainfall) ˜ b t := ˜b t ( ξ t ) , demand d t := ˜d t ( ξ t ) , fuel costs ˜ c t := ˜c t ( ξ t ) and equipment availability [30].We will use the notation b t , d t , c t whenever the parameter is deterministic (e.g., stage t = 1 ) and ˜ b t , ˜ d t , ˜ c t whenever itis random. To introduce a mathematical programming formulation for the decision problem which the DM has to solveat every decision stage t , for t = 1 , . . . , T , consider the following decision variables:• x t ∈ R H R + : where x h,t is the amount of water stored at each hydro plant with a reservoir h ∈ H R .9 PREPRINT - F

EBRUARY

10, 2021• y t ∈ R H + : where y h,t is the amount of water turbined by each hydro plant h ∈ H .• g t ∈ R F + : where g f,t is the amount of thermal power generated by each thermal plant f ∈ F .• v + t , v − t ∈ R H R + : where v + h,t , v − h,t is the amount of spilled/pumped-back water (without generating power) inhydro plant h ∈ H R • p t ∈ R + : the amount of unsatisﬁed demand at stage t .See Table 8 in the appendix for a summary of all the notations used in the HPOP problem. The local optimizationproblem at every decision stage t , for t = 1 , . . . , T is deﬁned as follows: min x t , y t , g t , v + t , v − t ,p t ˜c (cid:124) g,t g t + ˜ c p,t p t (19a)s.t. x h,t = x h,t − + ˜ b h,t + (cid:34) (cid:88) m ∈ U ( h ) ( y m,t + v + m,t ) − (cid:88) m ∈ L ( h ) v − m,t (cid:35) − ( y h,t + v + h,t − v − h,t ) , ∀ h ∈ H R (19b) y h,t + v + h,t − v − h,t = ˜ b h,t + (cid:34) (cid:88) m ∈ U ( h ) ( y m,t + v + m,t ) − (cid:88) m ∈ L ( h ) v − m,t (cid:35) , ∀ h ∈ H I (19c) (cid:88) h ∈ H y h,t r h + (cid:88) f ∈ F g f,t + p t ≥ d t (19d) v h,t ≤ x h,t ≤ ¯ v h,t , ∀ h ∈ H R (19e) y h,t ≤ ¯ q h , ∀ h ∈ H (19f) g f,t ≤ g f,t ≤ ¯ g f,t , ∀ f ∈ F (19g) x t , y t , g t , v + t , v − t , p t ≥ . (19h) Constraints (19b) describe the ﬂow balance in the system, which gives the (post-decision) water level x h,t stored ina reservoir h ∈ H R at the end of stage t . Constraints (19c) describe the system dynamics in the run-of-river andit is similar to the ﬁrst set except that the state variables representing the water level in the reservoirs are excluded.Constraint (19d) models the demand requirement. Constraints (19e) to constraints (19g) are the capacity constraintsfor the reservoirs, the amount of water that can be turbined, and the thermal power generation, respectively. Note thatin this context, the system state is given by s h,t = S t ( x h,t − , ˜ b h,t ) = x h,t − + ˜ b h,t for a hydro plant with a reservoir h ∈ H R and s h,t = ˜ b h,t for each run-of-river hydro plant h ∈ H I .In our implementation, to construct the mapping T( · ) from the system state to a sufﬁcient forecast horizon for thisHPOP problem, we consider a piecewise linear function where each piece is given by (18) with P = 1 . The basisfunctions for each piece are deﬁned as: φ ( s t ) = 1 (following the ADP literature for numerical stability, see, e.g., [31])and φ ( s t ) = (cid:80) h ∈ H r h s h,t . In other words, we deﬁne φ ( s t ) as the total amount of electricity that can be generated byall the hydro plants given their current levels of water in the reservoir. As such, performing the regression now reducesto estimating parameters θ and θ for each piece of the piecewise linear function through an ofﬂine training procedure. In this section, we present our computational results on the empirical performance of different RH policies. InSection 5.1 we provide an overview of the implementation details of the different RH policies used in our numericalexperiments. We then describe the test instances in Section 5.2. Afterwards, we compare the performance of variousapproaches considered in the experiment in Section 5.3. Finally, we present some sensitivity analysis on some keyparameters used in the proposed algorithms in Section 5.4.

We implemented two different policies for choosing the forecast horizon in the RH procedure: a static policy where theforecast horizon is kept the same in each roll, and a dynamic policy where the forecast horizon is dynamically chosenbased on the system state according to the regression model trained ofﬂine as described in Algorithm 2. To train thisfunction, we use a sample size of N = 50 , a tolerance parameter (cid:15) = 0 . , a stalling parameter w = 10 , and amaximum forecast horizon τ max = 64 (we chose as the maximum forecast horizon due to limited computational10 PREPRINT - F

EBRUARY

10, 2021budget and the observation that the policy performance does not improve signiﬁcantly beyond that). We refer to thisprocedure as the ofﬂine training step. We then evaluate both the static and the dynamic RH policies on a sample pathwith T = 5000 stages as the out-of-sample evaluation. We refer to this step as the online evaluation step. To measurethe performance of each policy, we calculate the long-run average cost which is given by ¯ z = 1 T T (cid:88) t =1 f t ( x τ t t , ξ t ) . (20)It might seem computationally prohibitive to consider evaluating the out-of-sample performance on a sample pathof T = 5000 stages for different policies to choose the number of look-ahead stages in each roll, static or dynamic.However, the situation simpliﬁes dramatically under our stationarity assumption (see Assumption 5) where the expectedcost-to-go functions Q t ( · ) for the MSPs deﬁned in every roll are identical for any ﬁxed length of the forecast horizon.In this case, one can then reuse the approximate value function obtained in previous rolls in the current roll as an initialapproximate value function, instead of redeﬁning a new MSP problem and training the value function starting fromscratch. Since the problems solved in previous rolls were solved under initial states that may not be necessarily thesame as the one observed in the current roll, the approximate value function inherited from previous rolls might not be agood enough approximation under the new initial state in the current roll, and therefore need to be evaluated and trainedfurther if necessary. Note that an algorithmic procedure similar to that described in Algorithm 1 can be employed here.The only difference is that, as we move forward over time, we do not dispose of the cuts generated in previous rolls.Instead, we carry them over to the next roll and then resolve the problem under the new initial state s t (if needed). Thisidea of “reusing” the approximate value functions has appeared in the literature (see, e.g., [32]).We care to remark the following subtle distinction between reusing the approximate value functions in the static policyvs. dynamic policy for choosing the number of look-ahead stages in the RH procedure. In the static policy, a singleMSP problem is deﬁned and the same approximate value functions in every stage are simply reused over the course ofthe entire procedure. Whereas in the dynamic policy, we keep track of multiple MSP problems (and the associatedapproximate value functions) one for each forecast horizon τ t . If the forecast horizon τ t given by T( s t ) at the beginningof every roll has been observed before, we reuse the corresponding (previously deﬁned) MSP problem which hasthe same τ t value. Otherwise, a new MSP problem is deﬁned and is added to the list of MSP problems with thecorresponding forecast horizon encountered so far.In addition, as previously noted, discount factors have many economical interpretations where the DM accounts for thetime value of the costs. One such interpretation is that the future cost is considered less valuable than the current cost.Meaning that, the smaller the discount factor, the less the DM values future costs and vice versa. It is intuitively clearthat for each discount factor, there exists a certain length of the look-ahead horizon, both of which represent how theDM values future costs. To show this analogy, we implemented the periodic variant of the SDDP algorithm describedin [27] using different values for the discount factor γ ∈ { . , . , . . . , . , . , . } and made a comparison withRH policies.In summary, we compare the performance of the following policies:1. A static RH policy that reuses the cuts generated during the online evaluation steps with no discount factor.2. A dynamic

RH policy that reuses the cuts generated during the online evaluation steps with no discount factor.3. A stationary policy trained with a given discount factor using the periodic variant of the SDDP algorithmdescribed in [27].In all of our experiments, we consider varying the number of sample paths used in each iteration of the SDDP algorithmfor solving the MSP problems. However, we observe that using a single sample path per forward/backward step worksthe best, and thus we use it in all of our experiments. We use the following termination criteria for stopping the SDDPalgorithm: (i) a maximum number of iterations is achieved; or (ii) a time limit of 10,800 seconds (three hours)is achieved; or (iii) the lower bound does not progress by more than (cid:15) (in the relative term) in ¯ j ∈ Z + consecutiveiterations (i.e., ( LB i − LB i − ¯ j ) LB i < (cid:15) ). We refer to ¯ j as the “stalling parameter”, which is set to ¯ j = 500 by default. Whensolving the MSP problems in an RH procedure during the out-of-sample evaluation, we gradually tune down the onlinetraining effort by decreasing this parameter ¯ j to accelerate computation without sacriﬁcing the solution quality by much.The rationale behind this dynamic parameter setting is that, as the online evaluation procedure proceeds, the state spaceis explored more extensively, implying that the quality of the approximate value functions keeps improving. Therefore,it does not worth the same effort to do online training in later rolls as initial rolls. Speciﬁcally, we gradually tune downthe online training effort by doing the following:1. Starting from the second roll ( t = 2 ), we set ¯ j = 50 .11 PREPRINT - F

EBRUARY

10, 20212. Then, once all of the realizations of the random data process ξ t are encountered in the online evaluationprocedure, we set ¯ j = 10 .3. Finally, if the algorithm keeps getting terminated as soon as the stalling parameter ¯ j = 10 is hit (i.e., rightafter i = 11 iterations) for more than 50 consecutive rolls, we turn off the online training by setting ¯ j = 1 .We remark that all parameters above are chosen in a rather heuristic manner and can be tuned according to speciﬁcproblem instances to improve the computational performance. All proposed approaches are tested using benchmark instances motivated by the Brazilian HPOP problem de-scribed in Section 4.1. To create a variety of instances, we consider different values for the demand parameter d t ∈ { , , , , } , number of hydro plants in the network | H | ∈ { , , } , number of realizations | Ξ t | ∈ { , } and probability distribution of the random process ξ t as shown in Table 9 and Table 10 (in the appendix).In these instances, hydro plants that have reservoirs correspond to H R = { , , } , and the run-of-river hydro plantsare H I = { , , } . The problem parameters used in formulation (19) are given in Table 11 (in the appendix). Notethat in instances where | H | = 1 , we chose the third hydro plant ( h = 3 ) and in instances where | H | = 3 , we chose thesecond, third and fourth hydro plants. In addition to the hydro plants, we also consider a set of four thermal plants F with maximum power generation capacity ¯ g f,t and cost c f,t for each thermal plant f ∈ F as shown in Table 12 (in theappendix). Finally, we use c p,t = 500 as the penalty parameter for each unit of unsatisﬁed demand and c = 2 . asthe parameter for converting the amount of water ﬂow into the water level in the reservoirs.All algorithms were implemented in Julia

JuMP

Gurobi ,version 9.0.0. All of the tests were conducted on Clemson University’s primary high-performance computing cluster, the

Palmetto cluster, where we use an

R830 Dell Intel Xeon “big memory” compute node with 2.60GHz, 1.0 TB memory,and 24 cores.

The ﬁrst step in our implementation for the dynamic RH policy is the ofﬂine training step where we estimate parameters θ and θ in each piece (see (18)) of the piecewise linear function. The choice of using a piecewise linear function asthe regression function is inspired by our preliminary observations that, in these instances, the relationship between s n and τ n exhibit a piecewise linear structure with up to three pieces. The best ﬁt results are shown in Table 1. To measurethe quality of the obtained regression functions, we use an aggregated coefﬁcient of determination R avg , which is givenby a weighted summation of the R values over all pieces representing the function. For example, in the test instancewhere | H | = 1 , d t = 2250 and | Ξ t | = 5 , the piecewise linear regression function contains three pieces with an average R avg = 85 . (the R values are . , . , and . respectively for each piece), which is almost sixteentimes higher than the value when we use a simple linear regression (i.e., with a single piece). Similar improvementsin terms of the R avg values can be observed for most instances, except two instances where | H | = 3 and d t = 2250 ,where the coefﬁcients of determinations given by the simple linear regression are already satisfactory.To analyze the performance of different RH look-ahead policies and the stationary policy, we record the followingperformance metrics:• The training time (in seconds ): which is recorded online in the case of the RH policies, and ofﬂine in the caseof the stationary policy obtained by the periodic variant of the SDDP algorithm.• The long-run average cost : ¯ z given by (20) which is denoted by ¯ z τ in the case of an RH policy with τ look-ahead stages, and ¯ z γ in the case of stationary policy with discount factor γ .• The relative gap which is given by (¯ z τ − ¯ z ) / ¯ z % in the case of RH policies and is given by (¯ z γ − ¯ z . ) / ¯ z . % . We treat ¯ z as the “optimal” RH policy as the length of the look-ahead horizon is largeenough in the sense that the policy performance does not improve signiﬁcantly beyond that, according to ourcomputational experiments. We treat ¯ z . as the “optimal” stationary policy as the discount factor of . isclose enough to .All of the numerical results for instances where | H | = 1 are shown in Table 2 to Table 5. Next, we summarize our mainobservations. 12 PREPRINT - F

EBRUARY

10, 2021 | H | ˜ d t | Ξ t | Range ˆ θ ˆ θ R R avg ≤ φ ( s t ) < -6.69 . × − ≤ φ ( s t ) < -1.59 . × − ≤ φ ( s t ) ≤ φ ( s t ) < -14.61 . × − ≤ φ ( s t ) < -1.95 . × − ≤ φ ( s t ) ≤ φ ( s t ) < -10.61 . × − ≤ φ ( s t ) < -3.86 . × − ≤ φ ( s t ) -0.81 . × − ≤ φ ( s t ) < -1.00 . × − ≤ φ ( s t ) < -5.33 . × − ≤ φ ( s t ) -0.74 . × − ≤ φ ( s t ) < . × − ≤ φ ( s t ) -1.32 . × − ≤ φ ( s t ) < . × − ≤ φ ( s t ) . × − ≤ φ ( s t ) . × − ≤ φ ( s t ) . × − ≤ φ ( s t ) < -0.55 . × − ≤ φ ( s t ) < -5.40 . × − ≤ φ ( s t ) -7.21 . × − ≤ φ ( s t ) < . × − ≤ φ ( s t ) < -5.85 . × − ≤ φ ( s t ) -8.03 . × − ≤ φ ( s t ) < -6.72 . × − ≤ φ ( s t ) < -6.44 . × − ≤ φ ( s t ) -13.48 . × − ≤ φ ( s t ) < -14.59 . × − ≤ φ ( s t ) < -8.36 . × − ≤ φ ( s t ) -5.61 . × − Table 1: The regression ﬁt for the piecewise linear basis function parameters ˆ θ and ˆ θ , and the (aggregated) coefﬁcientof determination ( R avg ) R . The ﬁrst observation is that in static RH policies, across different instances, the long-run average cost ¯ z τ plateaus aftera certain length of forecast horizon τ ∗ . In other words, the decrease in ¯ z τ as τ increases becomes hardly noticeableafter τ > τ ∗ . This behavior can be observed in Figure 2 to Figure 4 in the appendix. Across all of the instances, thisplateauing behavior seems to take place somewhere between ≤ τ ∗ ≤ where ¯ z τ is around . , . , and . larger than ¯ z on average, for instances where | H | = 1 , | H | = 3 and | H | = 6 , respectively. On the other hand,we can see that static RH policies with τ ∈ [8 , require, on average, a computational time of 770.13 seconds, 1439.22seconds, and 8016.52 seconds for instances where | H | = 1 , | H | = 3 and | H | = 6 , respectively. Comparing these tothe static RH policy with τ = 64 , we can see that on average, it requires a computational time of 10569.23 seconds,12368.98 seconds, and 42374.78 seconds for instances where | H | = 1 , | H | = 3 and | H | = 6 , respectively. This meansthat for a . , . and . reduction in the optimality gap, using a static RH policy with τ = 64 instead of τ ∈ [8 , would require a . , . and . increase in the computational time, on average, for instanceswhere | H | = 1 , | H | = 3 and | H | = 6 , respectively. Overall, across all of the instances, the dynamic RH policy performs quite well compared to the benchmark static RHpolicy with τ = 64 . At its worst , the ¯ z dynamic is . larger than ¯ z , which happens in the instance where ( | H | =3 , d t = 1750 , | Ξ t | = 12 ), followed by the instance where ( | H | = 6 , d t = 2250 , | Ξ t | = 12 ) and ( | H | = 3 , d t = 1750 , | Ξ t | = 5 ), where the corresponding ¯ z dynamic is . and . larger than ¯ z , respectively. It is interesting to seethat these three instances also happen to be the ones with the least accurate regression ﬁt according to the (aggregated)coefﬁcient of determination as shown in Table 1. Nevertheless, apart from these three instances, on average, ¯ z dynamic is only around . larger than ¯ z in the instances where | H | = 1 , . larger in the remaining instances where13 PREPRINT - F

EBRUARY

10, 2021 | Ξ t | τ | H | = 1 | H | = 3 | H | = 6 d t Time ¯ z τ Gap d t Time ¯ z τ Gap d t Time ¯ z τ Gap

Table 2: Numerical results for the different RH look-ahead policies (with no discount factor). | H | = 3 , and . larger in the remaining instances where | H | = 6 . Comparing these gaps to the computational timerequired to compute the static RH policy with τ = 64 , we can see from Table 2 that for a . , . , and . reduction in the optimality gap, it would require . , . , and . increase in the computational time, forthe instances where | H | = 1 , | H | = 3 and | H | = 6 , respectively.Moreover, given the monotonicity of ¯ z τ as τ increases, it might be instructive to explore where the dynamic policy ranks(in terms of its performance) compared to static RH policies with various lengths of forecast horizons. In nine out of thetwelve instances that we have, the long-run average cost and the computational time by the dynamic policy are bothplaced between the static RH policies with τ = 8 and τ = 16 . Even in instances where this is not the case – namely,the instances ( | H | = 3 , d t = 1750 , | Ξ t | = 12 ), ( | H | = 3 , d t = 2250 , | Ξ t | = 5 ) and ( | H | = 6 , d t = 2000 , | Ξ t | = 5 ) –the performance of the dynamic RH policy, in terms of ¯ z dynamic and Time dynamic , places very closely to policies with τ = 8 in the case of the two instances with | H | = 3 and very closely to policies with τ = 16 in the case of the instancewith | H | = 6 . This time interval is, as discussed earlier, the same interval at which the plateauing behavior starts to takeplace. This coincidence, however, is somewhat expected given the fact that our regression functions are constructedprecisely so that the corresponding states are mapped to the smallest forecast horizon τ ∗ for which (cid:107) a τ − a τ ∗ (cid:107) < (cid:15) and hence (cid:107) ¯ z τ − ¯ z τ ∗ (cid:107) < (cid:15) , i.e. when ¯ z τ plateaus. This provides a numerical evidence on the validity of our proposedapproach.Finally, another advantage to this dynamic policy over the static policy is its usability . To elaborate on this, it isimportant to note that when using a static policy, it is not clear how one can identify a sufﬁcient forecast horizon τ ∗ a priori, for which ¯ z τ plateaus for all τ ≥ τ ∗ . In our implementation, for instance, we do this by keeping track of ¯ z τ and testing the static RH policy with every τ ∈ { , , . . . , } . In other words, if the DM chooses to use a staticpolicy, then prior to solving the problem he/she does not know what τ ∗ is, and the only way to ﬁnd out is to solvethe actual problem for multiple values of τ . While the ofﬂine training step involved in the estimation of the dynamicpolicy regression parameter, in some sense, also follows this same enumeration process, it is less expensive from acomputational perspective and allows for more ﬂexibility. First, when enumerating the values τ ∈ Z + , we do not haveto solve the actual problem for every t = 1 , , . . . , T . Instead, we only need to do this for the ﬁrst stage problem underdifferent sampled initial states s n . Moreover, the accuracy of the regression ﬁt can be adapted according to the samplesize N which depends on the available computational budget. Finally, the DM has the opportunity to study and analyze14 PREPRINT - F

EBRUARY

10, 2021function T( s t ) prior to solving the actual problem, e.g., he/she has the freedom to construct the regression functionsthat ﬁt the set of sample initial states s n and the corresponding responses the best. From Table 3 to Table 5, we can see that stationary policy trained with a given discount factor using the periodicvariant of the SDDP algorithm performs quite well compared to the RH policies. Speciﬁcally, if we pick the policiescorresponding to γ = 0 . values in every instance and compare them to ¯ z , we can see that ¯ z γ is only around . , . , and . larger than ¯ z on average, for instances where | H | = 1 , | H | = 3 and | H | = 6 , respectively.More importantly, the computational time required in the stationary policy is . , . , and . less,on average, than that of the static RH policies with τ = 64 , for instances where | H | = 1 , | H | = 3 and | H | = 6 ,respectively. d t | Ξ t | γ Time ¯ z γ (¯ z γ − ¯ z . ) / ¯ z . % (¯ z γ − ¯ z ) / ¯ z % Table 3: Numerical results for the stationary policy (with discount factor γ ) for | H | = 1 .It is also interesting to see, as shown in Figure 5 to Figure 7 in the appendix, that the performance of the stationarypolicy exhibits a plateauing behavior (as γ increases) similar to that of the static RH policies (as τ increases). Although15 PREPRINT - F

EBRUARY

10, 2021 d t | Ξ t | γ Time ¯ z γ (¯ z γ − ¯ z . ) / ¯ z . % (¯ z γ − ¯ z ) / ¯ z % Table 4: Numerical results for the stationary policy (with discount factor γ ) for | H | = 3 .16 PREPRINT - F

EBRUARY

10, 2021 d t | Ξ t | γ Time ¯ z γ (¯ z γ − ¯ z . ) / ¯ z . % (¯ z γ − ¯ z ) / ¯ z % Table 5: Numerical results for the stationary policy (with discount factor γ ) for | H | = 6 .17 PREPRINT - F

EBRUARY

10, 2021not with the same monotonicity rate, it is important to note that when implementing the periodic variant of the SDDPalgorithm under the stationarity assumption (Assumption 5) with m = 1 , the (inﬁnite horizon) MSP reduces to astatic problem. This leads to the delicate issue of how to choose the trial points during the forward pass of the SDDPalgorithm, which should depend on the length of the horizon T . Approximating T = ∞ using a ﬁnite horizon leads toan error which is a consequence of the so-called end-of-horizon effect. Moreover, when γ is close to 1, the convergenceof the algorithm becomes slow since the needed T can be large. To deal with this, in our implementation of the periodicvariant of the SDDP algorithm, we treat T as a random variable following a geometric distribution with a successprobability p = (1 − γ ) . Then, at the beginning of every iteration i , we sample a forecast horizon T i , a sample path ξ i = ( ξ i , . . . , ξ iT i ) and implement the forward/backward pass along ξ i . Note that the sampling error coming fromsampling T i is what might be causing the non-monotonicity of ¯ z γ as γ increases in our numerical results.Finally, we investigate how the (cid:15) − sufﬁcient forecast horizon τ ∗ (cid:15) suggested by Theorem 2 (see also (17)) compared to theaverage length of forecast horizon used by the dynamic RH policy ¯ τ = (cid:80) Tt =1 τ t /T over the entire sample path in theout-of-sample test, as shown in Table 6. Note that the upper bound κ on the immediate cost function f t ( · , · ) is calculatedas the optimal objective value of problem (19) with state variables s h,t = 0 , ∀ h ∈ H . From Table 6, we see that ¯ τ hasan average value of around . , . , and . for instances when | H | = 1 , | H | = 3 , and | H | = 6 , respectively. Thesenumbers are not only signiﬁcantly smaller than those of the τ ∗ (cid:15) corresponding to the different discount factors computedvia Theorem 2 and equation (17), but also consistent with the observations made in Subsection 5.3.2 regarding wherethe dynamic policy ranks (in terms of its performance) compared to static RH policies with various lengths of forecasthorizons. | H | ˜ d t ¯ τ κ γ τ γ Table 6: The average number of forecast horizon in the dynamic policy ¯ τ = (cid:80) Tt =1 τ t /T and the (cid:15) − sufﬁcient forecasthorizon τ ∗ (cid:15) obtained by (17). In this section, we present a sensitivity analysis to assess the robustness of the solution with respect to different valuesthat we use for the stalling parameter w . As previously noted, the stalling parameter w is what we use to determinewhether or not x τ will remain the same for all τ ≥ τ ∗ , during the ofﬂine training step of our proposed approach forthe dynamic RH policy (see also Step w = 10 used in our original implementation, we also performed the same procedure using w ∈ { , } . As it is intuitive to consider w = 15 as the stalling parameter with the most stable solution, we use it asthe benchmark. Similarly to the analysis presented in Subsection 5.3, we also report (i) the training time w (in seconds )for w = 15 ; (ii) the long-run average cost : ¯ z w given by (20) for w = 15 ; and (iii) the relative gap in time w and ¯ z w for w ∈ { , } compared to time and ¯ z . When comparing the performance in terms of ¯ z w , ∀ w ∈ { , } and the ¯ z obtained by using the supposedly moststable parameter w = 15 , we can see from Table 7 that the difference is negligible. Speciﬁcally, compared to the ¯ z w obtained by the W = 5 , ¯ z has a relative gap of . , . , and . when averaged across the instances where | H | = 1 , | H | = 3 and | H | = 6 , respectively. Whereas, compared to the ¯ z w obtained by the W = 10 used in ouroriginal implementation, ¯ z has a relative gap of − . , − . , and . when averaged across the instanceswhere | H | = 1 , | H | = 3 and | H | = 6 , respectively. Unlike the difference in the performance in terms of ¯ z w , the difference in the time it took each policy to perform theRH procedure seems to be signiﬁcant. Speciﬁcally, compared to the Time w obtained for w = 5 , Time has a relativegap of − . , − . , and − . when averaged across the instances where | H | = 1 , | H | = 3 and | H | = 6 ,18 PREPRINT - F

EBRUARY

10, 2021respectively. Whereas, compared to the Time w obtained by the w = 10 used in our original implementation, Time has a relative gap of . , − . , and − . when averaged across the instances where | H | = 1 , | H | = 3 and | H | = 6 , respectively.As we can see, the dynamic RH policy constructed by using w = 5 has much less computational time compared tothe dynamic RH policy constructed by using w = 15 . This is to be expected, however, since using w = 5 is likelyto prescribe smaller forecast horizons during the RH procedure and hence lead to less computational time overall.Similarly, apart from the instances where | H | = 1 , we can see that the dynamic RH policy constructed by using w = 10 has less computational time compared to the one with w = 15 . Overall, using w = 5 instead of w = 10 willlead to an average of . , − . and . decrease in the computational time for instances where | H | = 1 , | H | = 3 and | H | = 6 , respectively. However, it will also lead to an average of . , . , and . increase inthe optimality gap for these instances respectively. | H | | Ξ t | d t ¯ z w Time w (¯ z w − ¯ z ) / ¯ z % ( Time w − Time ) / Time % w = 15 w = 5 w = 10 w = 5 w = 10 Table 7: Sensitivity analysis with respect to the stalling parameter w used in the ofﬂine training step for the dynamicRH policy. In this paper, we have studied the question of how many stages to include in the forecast horizon when solving MSPproblems using the RH procedure for both ﬁnite and inﬁnite horizon MSP problems. In the inﬁnite horizon discountedcase, given a ﬁxed forecast horizon τ , we have shown that the resulting optimality gap associated with the staticlook-ahead policy in terms of the total expected discounted cost can be bounded by a function of the chosen forecasthorizon τ . This function could be used to provide an upper bound on forecast horizon τ ∗ (cid:15) where the corresponding staticRH policy achieves a prescribed (cid:15) optimality gap. In the ﬁnite horizon case (with no discount), we have taken an ADPperspective and developed a dynamic RH policy where the forecast horizon τ to use in each roll of the RH procedureis chosen dynamically according to the state of the system at that stage via certain regression function that can betrained ofﬂine. This dynamic RH policy has shown to be capable of exploiting the system state related informationencountered during the RH procedure. Our numerical results have illustrated the empirical behaviors of the proposedRH policies on a class of MSPs and highlighted the effectiveness of the proposed approaches. Speciﬁcally, when usinga static RH policy where the length of the forecast horizon in every roll is ﬁxed, we have shown that the optimality gapplateaued and ceased to respond to the increase in the length of the forecast horizon beyond a certain number of stages.On the other hand, we have shown that advantage of the dynamic RH policy against alternative approaches in the slightincrease in the optimality gap compared to the signiﬁcant reduction in the computational time.We have identiﬁed several avenues for future research. First, we anticipate that the proposed approaches can beextended to various strategies to address the “end-of-horizon” effect during the RH procedure. Second, we expect thatsome assumptions made for the sake of simplicity in experiments and analysis can be relaxed, allowing the proposedapproaches to be applied to a broader class of MSPs, including those deﬁned on hidden Markov chains. Lastly, althoughwe chose not to pursue in this paper, we expect that the proposed approaches can be more appealing for multi-stagestochastic integer programs with a ﬁnite action/control space. References [1] Vikas Goel and Ignacio E Grossmann. A stochastic programming approach to planning of offshore gas ﬁelddevelopments under uncertainty in reserves.

Computers & Chemical Engineering , 28(8):1409–1429, 2004.19

PREPRINT - F

EBRUARY

10, 2021[2] Vitor L de Matos, David P Morton, and Erlon C Finardi. Assessing policy quality in a multistage stochasticprogram for long-term hydrothermal scheduling.

Annals of Operations Research , 253(2):713–731, 2017.[3] Tito Homem-de Mello, Vitor L De Matos, and Erlon C Finardi. Sampling strategies and stopping criteria forstochastic dual dynamic programming: a case study in long-term hydrothermal scheduling.

Energy Systems ,2(1):1–31, 2011.[4] Mario Veiga Pereira, Sérgio Granville, Marcia HC Fampa, Rafael Dix, and Luiz Augusto Barroso. Strategicbidding under uncertainty: a binary expansion approach.

IEEE Transactions on Power Systems , 20(1):180–188,2005.[5] Steffen Rebennack. Combining sampling-based and scenario-based nested benders decomposition methods:application to stochastic dual dynamic programming.

Mathematical Programming , 156(1-2):343–389, 2016.[6] Alexander Shapiro, Wajdi Tekaya, Joari Paulo da Costa, and Murilo Pereira Soares. Risk neutral and risk aversestochastic dual dynamic programming method.

European Journal of Operational Research , 224(2):375–391,2013.[7] Shabbir Ahmed, Alan J King, and Gyana Parija. A multi-stage stochastic integer programming approach forcapacity expansion under uncertainty.

Journal of Global Optimization , 26(1):3–24, 2003.[8] Pﬂug Georg Ch and Romisch Werner.

Modeling, measuring and managing risk . World Scientiﬁc, 2007.[9] Jitka Dupaˇcová.

Portfolio Optimization and Risk Management via Stochastic Programming . Osaka UniversityPress, 2009.[10] Jitka Dupaˇcová and Jan Polívka. Asset-liability management for czech pension funds using stochastic program-ming.

Annals of Operations Research , 165(1):5–28, 2009.[11] Antonio Alonso, Laureano F Escudero, and M Teresa Ortuno. A stochastic 0–1 program based approach for theair trafﬁc ﬂow management problem.

European Journal of Operational Research , 120(1):47–62, 2000.[12] Boutheina Fhoula, Adnene Hajji, and Monia Rekik. Stochastic dual dynamic programming for transportationplanning under demand uncertainty. In ,pages 550–555. IEEE, 2013.[13] Yale T Herer, Michal Tzur, and Enver Yücesan. The multilocation transshipment problem.

IISE Transactions ,38(3):185–200, 2006.[14] Giovanni Pantuso. The football team composition problem: a stochastic programming approach.

Journal ofQuantitative Analysis in Sports , 13(3):113–129, 2017.[15] RICHARD Bellman. Dynamic programming.

Princeton University Press , 1957.[16] Suresh Chand, Vernon Ning Hsu, and Suresh Sethi. Forecast, solution, and rolling horizons in operationsmanagement problems: A classiﬁed bibliography.

Manufacturing & Service Operations Management , 4(1):25–43,2002.[17] O Hernández-Lerma and JB Lasserre. A forecast horizon and a stopping rule for general markov decisionprocesses.

Journal of Mathematical Analysis and Applications , 132(2):388–400, 1988.[18] Francesca Maggioni, Elisabetta Allevi, and Marida Bertocchi. Bounds in multistage linear stochastic programming.

Journal of Optimization Theory and Applications , 163(1):200–229, 2014.[19] Francesca Maggioni and Georg Ch Pﬂug. Bounds and approximations for multistage stochastic programs.

SIAMJournal on Optimization , 26(1):831–855, 2016.[20] Giovanni Pantuso and Trine K Boomsma. On the number of stages in multistage stochastic programs.

Annals ofOperations Research , pages 1–23, 2019.[21] Warren B Powell. Clearing the jungle of stochastic optimization. In

Bridging data and decisions , pages 109–137.INFORMS, 2014.[22] Jikai Zou, Shabbir Ahmed, and Xu Andy Sun. Partially adaptive stochastic optimization for electric powergeneration expansion planning.

INFORMS Journal on Computing , 30(2):388–401, 2018.[23] W.B. Powell.

Approximate Dynamic Programming: Solving the Curses of Dimensionality , volume 842. JohnWiley & Sons, 2011.[24] Alexander Shapiro. Analysis of stochastic dual dynamic programming method.

European Journal of OperationalResearch , 209(1):63–72, 2011.[25] Mario VF Pereira and Leontina MVG Pinto. Multi-stage stochastic optimization applied to energy planning.

Mathematical Programming , 52(1-3):359–375, 1991.20

PREPRINT - F

EBRUARY

10, 2021[26] Murwan Siddig and Yongjia Song. Adaptive partition-based sddp algorithms for multistage stochastic linearprogramming. arXiv preprint arXiv:1908.11346 , 2019.[27] Alexander Shapiro and Lingquan Ding. Stationary multistage programs.

Optimization Online , 2019.[28] O Hernández-Lerma and JB Lasserre. Error bounds for rolling horizon policies in discrete-time markov controlprocesses.

IEEE Transactions on Automatic Control , 35(10):1118–1124, 1990.[29] C Bes and JB Lasserre. An on-line procedure in discounted inﬁnite-horizon stochastic optimal control.

Journal ofOptimization Theory and Applications , 50(1):61–67, 1986.[30] Alexander Shapiro, Wajdi Tekaya, Joari Paulo da Costa, and Murilo Pereira Soares. Report for technicalcooperation between georgia institute of technology and ons-operador nacional do sistema elétrico, 2011.[31] Amir Ali Nasrollahzadeh, Amin Khademi, and Maria E Mayorga. Real-time ambulance dispatching and relocation.

Manufacturing & Service Operations Management , 20(3):467–480, 2018.[32] Thuener Silva, Davi Valladao, and Tito Homem-de Mello. A data-driven approach for a class of stochasticdynamic optimization problems.

Optimization Online , 2019.[33] Iain Dunning, Joey Huchette, and Miles Lubin. Jump: A modeling language for mathematical optimization.

SIAMReview , 59(2):295–320, 2017. 21

PREPRINT - F

EBRUARY

10, 2021

A Tables

Notation Description

H, F

The set of hydro/thermal plants in the HPOP problem h, f

A hydro/thermal plant index in the set of hydro/thermal plants c g,t , ˜c g,t Cost and random cost vector of generating thermal power at time tc p,t , ˜ c p,t Unit penalty cost and random unit penalty cost for unsatisﬁed demand at time tb h,t , ˜ b h,t Amount of inﬂows and random inﬂows to hydro plant h during stage tr h,t Amount of power generated by releasing one unit of water ﬂow in hydro plant hd t , d t Demand and random demand in stage t ¯ q h Maximum allowed amount of turbined ﬂow in hydro plant h in stage tv h,t , ¯ v h,t Minimum/Maximum level of water allowed in hydro plant hg h,t , ¯ g h,t Minimum/Maximum allowed amount of power generated by thermal plant f in stage tU ( h ) , L ( h ) Set of immediate upper/lower stream hydro plants of h in the network Table 8: Notation for the parameters of the HPOP stage- t problem (19). h h h h h h P ( ξ kt ∈ Ξ t ) ξ t ξ t ξ t ξ t ξ t Table 9: Realizations & probability distribution of ξ t (rainfall in hydro h ∈ H ) when | Ξ t | = 5 . h h h h h h P ( ξ kt ∈ Ξ t ) ξ t ξ t ξ t ξ t ξ t ξ t ξ t ξ t ξ t ξ t ξ t ξ t Table 10: Realizations & probability distribution of ξ t (rainfall in hydro h ∈ H ) when | Ξ t | = 12 . B Figures PREPRINT - F

EBRUARY

10, 2021 h h h h h h r h,t ¯ q h,t ¯ v h,t v h,t x U ( h ) ∅ { } ∅ { , } ∅ { , } L ( h ) { } { } { } { } { } ∅ Table 11: A summary of the hydro plants data. f f f f ¯ g f,t c g,t Table 12: A summary of the thermal plants data.Figure 1: Symbolic representation of the rolling horizon procedure for T = 5 and τ = 3 .23 PREPRINT - F

EBRUARY

10, 2021 (a) ¯ z for | Ξ t | = 5 and d t = 1000 . (b) Training time for | Ξ t | = 5 and d t = 1000 .(c) ¯ z for | Ξ t | = 12 and d t = 1000 . (d) Training time for | Ξ t | = 12 and d t = 1000 .(e) ¯ z for | Ξ t | = 5 and d t = 1500 . (f) Training time for | Ξ t | = 5 and d t = 1500 .(g) ¯ z for | Ξ t | = 12 and d t = 1500 . (h) Training time for | Ξ t | = 12 and d t = 1500 . Figure 2: The long-run average cost ¯ z as a function of the discount factor γ in all of the different test instances where | H | = 1 . 24 PREPRINT - F

EBRUARY

10, 2021 (a) ¯ z for | Ξ t | = 5 and d t = 1750 . (b) Training time for | Ξ t | = 5 and d t = 1750 .(c) ¯ z for | Ξ t | = 12 and d t = 1750 . (d) Training time for | Ξ t | = 12 and d t = 1750 .(e) ¯ z for | Ξ t | = 5 and d t = 2250 . (f) Training time for | Ξ t | = 5 and d t = 2250 .(g) ¯ z for | Ξ t | = 12 and d t = 2250 . (h) Training time for | Ξ t | = 12 and d t = 2250 . Figure 3: The long-run average cost ¯ z as a function of the discount factor γ in all of the different test instances where | H | = 3 . 25 PREPRINT - F

EBRUARY

10, 2021 (a) ¯ z for | Ξ t | = 5 and d t = 2000 . (b) Training time for | Ξ t | = 5 and d t = 2000 .(c) ¯ z for | Ξ t | = 12 and d t = 2000 . (d) Training time for | Ξ t | = 12 and d t = 2000 .(e) ¯ z for | Ξ t | = 5 and d t = 2250 . (f) Training time for | Ξ t | = 5 and d t = 2250 .(g) ¯ z for | Ξ t | = 12 and d t = 2250 . (h) Training time for | Ξ t | = 12 and d t = 2250 . Figure 4: The long-run average cost ¯ z as a function of the discount factor γ in all of the different test instances where | H | = 6= 6

EBRUARY

10, 2021 (a) ¯ z for | Ξ t | = 5 and d t = 1000 . (b) Training time for | Ξ t | = 5 and d t = 1000 .(c) ¯ z for | Ξ t | = 12 and d t = 1000 . (d) Training time for | Ξ t | = 12 and d t = 1000 .(e) ¯ z for | Ξ t | = 5 and d t = 1500 . (f) Training time for | Ξ t | = 5 and d t = 1500 .(g) ¯ z for | Ξ t | = 12 and d t = 1500 . (h) Training time for | Ξ t | = 12 and d t = 1500 . Figure 5: The long-run average cost ¯ z as a function of the discount factor γ in all of the different test instances where | H | = 1 . 27 PREPRINT - F

EBRUARY

10, 2021 (a) ¯ z for | Ξ t | = 5 and d t = 1750 . (b) Training time for | Ξ t | = 5 and d t = 1750 .(c) ¯ z for | Ξ t | = 12 and d t = 1750 . (d) Training time for | Ξ t | = 12 and d t = 1750 .(e) ¯ z for | Ξ t | = 5 and d t = 2250 . (f) Training time for | Ξ t | = 5 and d t = 2250 .(g) ¯ z for | Ξ t | = 12 and d t = 2250 . (h) Training time for | Ξ t | = 12 and d t = 2250 . Figure 6: The long-run average cost ¯ z as a function of the discount factor γ in all of the different test instances where | H | = 3 . 28 PREPRINT - F

EBRUARY

10, 2021 (a) ¯ z for | Ξ t | = 5 and d t = 2000 . (b) Training time for | Ξ t | = 5 and d t = 2000 .(c) ¯ z for | Ξ t | = 12 and d t = 2000 . (d) Training time for | Ξ t | = 12 and d t = 2000 .(e) ¯ z for | Ξ t | = 5 and d t = 2250 . (f) Training time for | Ξ t | = 5 and d t = 2250 .(g) ¯ z for | Ξ t | = 12 and d t = 2250 . (h) Training time for | Ξ t | = 12 and d t = 2250 . Figure 7: The long-run average cost ¯ z as a function of the discount factor γ in all of the different test instances where | H | = 6= 6