[PDF] Regularized Decomposition of High-Dimensional Multistage Stochastic Programs with Markov Uncertainty

Abstract

We develop a quadratic regularization approach for the solution of high-dimensional multistage stochastic optimization problems characterized by a potentially large number of time periods/stages (e.g. hundreds), a high-dimensional resource state variable, and a Markov information process. The resulting algorithms are shown to converge to an optimal policy after a finite number of iterations under mild technical assumptions. Computational experiments are conducted using the setting of optimizing energy storage over a large transmission grid, which motivates both the spatial and temporal dimensions of our problem. Our numerical results indicate that the proposed methods exhibit significantly faster convergence than their classical counterparts, with greater gains observed for higher-dimensional problems.

Full PDF

aa r X i v : . [ m a t h . O C ] F e b Regularized Decomposition of High–DimensionalMultistage Stochastic Programs with MarkovUncertainty

Tsvetan Asamov ∗ Warren B. Powell † February 28, 2017

Abstract

We develop a quadratic regularization approach for the solution of high–dimensional multi-stage stochastic optimization problems characterized by a potentially large number of time peri-ods/stages (e.g. hundreds), a high-dimensional resource state variable, and a Markov informationprocess. The resulting algorithms are shown to converge to an optimal policy after a ﬁnite num-ber of iterations under mild technical assumptions. Computational experiments are conductedusing the setting of optimizing energy storage over a large transmission grid, which motivatesboth the spatial and temporal dimensions of our problem. Our numerical results indicate that theproposed methods exhibit signiﬁcantly faster convergence than their classical counterparts, withgreater gains observed for higher–dimensional problems.

Multistage stochastic problems arise in a wide variety of real-world applications in ﬁelds as diverseas energy, ﬁnance, transportation and others. In this paper we consider multistage stochastic linearprograms that satisfy the following conditions: i ) the time horizon length T is ﬁnite but potentiallylarge (there may be hundreds of time periods and stages); ii ) for each time period, the set of samplerealizations of the exogenous information process is ﬁnite (and relatively small); iii ) for each stage,the stage cost is a linear function of the decision.Pereira and Pinto [19] introduced a powerful algorithmic strategy known as Stochastic Dual Dy-namic Programming (SDDP) that has received considerable attention for this problem class. Despiteits popularity, SDDP can exhibit slow convergence, especially in the setting of high–dimensional re-source allocation problems. A separate but important challenge arises when handling problems withlong horizons which introduces algorithmic issues for both the setting of intertemporal indepen-dence, as well as when there is Markov dependence. Not surprisingly, as practical problems growin size, improving the rate of convergence of SDDP–type methods becomes an issue of growingimportance.Quadratic regularization has been among the most effective techniques for accelerating the con-vergence of scenario tree–based decomposition methods (see work by Ruszczy`nski [26, 27, 29]).However, its application to the SDDP framework has not been possible because of the exponential ∗ Department of Operations Research and Financial Engieering, Princeton University † Department of Operations Research and Financial Engieering, Princeton University i ) We adopt notation that bridges the gap be-tween dynamic programming and classical stochastic programming, which lays the foundation ofour algorithmic strategy by identifying and clarifying the role of the post–decision informationstate; ii ) We develop the ﬁrst quadratic regularization method for the SDDP framework, with orwithout Markov dependence in the information process, that produces an optimal policy for a sam-pled model; iii ) Unlike existing regularization methods on scenario trees, our approach remainscomputationally tractable even for problems that involve long time horizons; iv ) Our numerical re-sults indicate that the proposed approach exhibits faster convergence than classical SDDP and isespecially useful for problems with high–dimensional resource states. That makes the work relevantto a wide variety of practical applications.Our numerical work uses the setting of optimizing energy over a ﬂeet of storage devices for acongested transmission grid. This problem class offers a realistic setting for testing the algorithmwith anywhere from 50 to 500 batteries, allowing us to test the performance of the algorithm forresource state variables with widely varying dimensionality. A separate challenge is that these prob-lems exhibit a large number of time periods; our experiments model a day in 5–minute increments,producing problems with 288 time periods. The decomposition approach of Benders [3] and the L–shaped method of Van Slyke and Wets [35]originally focused on the solution of two–stage stochastic optimization problems. Eventually, theidea was extended to the multi–period setting by Birge [5], as well as Donohue and Birge [8] whoconsidered successive Benders–type approximations of the recourse functions in the nested Bendersdecomposition algorithm for multistage problems on scenario trees. Pereira and Pinto [19] furtherextended the approach to develop Stochastic Dual Dynamic Programming which has become popu-lar among practitioners. On one hand, the method provides both lower and upper bounds, as well asclear convergence guarantees for many of its different versions as has been discussed Shapiro [32],as well as Linowsky and Philpott [14]. Moreover, it is also very suitable for parallel computing andcan be applied to problems with long time horizons. Despite its progress towards overcoming thecurse of dimensionality, in its essence SDDP is a cutting plane method, a class of algorithms knownto exhibit slow convergence (see [30]), a behavior that is a byproduct of the well–known curse ofdimensionality. In general, their computational complexity grows exponentially with the dimensionof the problem. In the special case of only two time periods, the SDDP algorithm is equivalentto the well known cutting plane method of Kelley [12] which takes O (cid:18) ln ǫ − h √ i n − (cid:19) itera-tions to achieve an ǫ –optimal solution on an n –dimensional problem as pointed out by Nesterov andNesterov [18].Rockafellar [24] introduced the proximal point algorithm for the minimization of (deterministic)lower semicontinuous proper convex functions. The quadratic regularization of two–stage linearstochastic optimization problems was developed by Ruszczy´nski [25, 29]. The same idea has alsobeen implemented in the two–stage and multistage versions of the Stochastic Decomposition methoddeveloped by Higle and Sen [10, 31], as well as the decomposition work of Morton [17]. All of thesemethods utilize a scenario tree, either explicitly or implicitly (when indexing regularization terms2y the entire history H t ), and consider separate incumbent solutions for every parent node in thescenario tree. That limits their applicability to problems with short time horizons. On the otherhand, the method described below is universal and can be applied to problems with a large numberof time periods. Given a probability space ( Ω, F , P ) with a sigma–algebra F , and a ﬁltration {∅ , Ω } = F ⊂F ⊂ · · · ⊂ F T = F , we consider a stochastic process { W t , t = 1 , . . . , T } adapted to {F t , t =1 , . . . , T } . Throughout our presentation, we adopt the convention that any variable indexed by t is F t –measurable (surprisingly, this is not a standard assumption). Our goal is to develop new solutionmethods for the following multistage linear stochastic programming problem: min A x = b x ≥ h c , x i + E  min B x + A x = b x ≥ h c , x i + E  · · · + E T  min B T − x T − + A T x T = b T x T ≥ h c T , x T i  . . .  . (1)The components of the information process W t = ( A t , B t , b t , c t ) , t = 1 , . . . , T are the F t –measurablerandom matrices A t , B t and vectors b t , c t , while A , B , b , c are assumed to be deterministic com-ponents of the initial state of the system S = ( A , B , b , c ) . We denote the sets of possible re-alizations of W t with Ω t , t = 1 , . . . , T . Those correspond to nested partitions of Ω given by theﬁltration {F t , t = 1 , . . . , T } , and each w ∈ Ω can be represented as ω = ( ω , ω , . . . , ω T ) ∈ Ω × Ω × · · · × Ω T . We assume that each sample set Ω t has a ﬁnite number of elements that issmall enough to be enumerated computationally. Deﬁnition 1.

The information history at time t is H t = { S , ω , ω , . . . , ω t } , where H t ∈ H t = { S } × Ω × Ω × , · · · × Ω t . Further, we deﬁne the post–decision information history at time t tobe H xt = { S , x , ω , x , ω , x , . . . , ω t , x t } . Employing a dynamic programming framework, we distinguish between two types of states ofthe system, the pre–decision states S t and the post–decision states S xt . Deﬁnition 2.

The (pre–decision) state S t of the system at time t ≥ is all the information in H xt − ∪ ω t that is necessary and sufﬁcient to make a decision at time t , and model the impact of H xt − ∪ ω t on the computation of costs, constraints and transitions from time t onward. Furthermore, the pre–decision state of the system S t can be represented as S t = ( R t , I t ) , wherethe pre–decision resource state R t is the amount of resources available at the beginning of timeperiod t , and I t is the pre–decision information state. Please note that R t depends on both thedecision x t − and the random vector b t , R t = B t − x t − − b t . The information state I t contains all the remaining information of S t that is necessary and sufﬁcientto model the system but is not in R t . Formally, we consider the following model for the evolutionof the system over time: S x −→ S x ω −→ S x −→ S x ω −→ . . . ω T −−→ S T x T −−→ S xT . eﬁnition 3. The post–decision state S xt , t ≥ of the system at time t is all the information inthe post–decision history H xt that is necessary and sufﬁcient to model the impact of H xt on thecomputation of costs, constraints and transitions from time t onward, after a decision has beenmade. We also represent the post–decision state of the system as S xt = ( R xt , I xt ) . The post–decisionresource state R xt is given by R xt = B t x t , and the post–decision information state I xt represents all the information in S xt that is not in R xt .Moreover, we refer to the rank of the matrix B t as the dimension of the post–decision resource state .If we deﬁne C ( S t , x t ) := h c t , x t i and the set X t ( S t ) is such that the following conditions are satisﬁed, X t ( S t ) := (cid:26) x t ∈ R n t : A t x t = b t , if t = 0 x t ∈ R n t : B t − x t − + A t x t = b t , if t > then we can rewrite problem (1) using dynamic programming notation as follows, min x ∈X ( S ) C ( S , x ) + E (cid:20) min x ∈X ( S ) C ( S , x ) + E (cid:20) · · · + E T (cid:20) min x T ∈X T ( S T ) C ( S T , x T ) (cid:21) . . . (cid:21)(cid:21) . (2)Since the problem is stochastic, its optimal solution is not a vector but rather a policy π , which isa function that maps states S t to decisions x t ∈ X t ( S t ) . Thus, we can consider the optimizationproblem (2) to be a search for an optimal policy π ∗ over the set Π consisting of all feasible andimplementable policies min π ∈ Π E " T X t =0 C ( S t , X πt ( S t )) . (3)We refer to equation (3) as the base model , and we can solve it by constructing an optimal looka-head policy. In that case, the optimal decisions X ∗ t ( S t ) corresponding to π ∗ satisfy the followingoptimality equation: X ∗ t ( S t ) ∈ arg min x t ∈X t ( S t ) C ( S t , x t ) + min π ∈ Π E ( T X t ′ = t +1 C ( S t ′ , X πt ′ ( S t ′ )) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) S xt )! . (4)Therefore, we can also specify an optimal lookahead policy π ∗ by employing its correspondingpost–decision value functions V ∗ t ( S xt ) , V ∗ t ( S xt ) = min π ∈ Π E ( T X t ′ = t +1 C ( S t ′ , X πt ′ ( S t ′ )) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) S xt ) (5) Remark 1.

It is common for practitioners to employ a scenario tree in order to construct an ap-proximate lookahead policy for problem (2) as follows: X ∗ t ( S t ) ∈ arg min x t ∈X t ( S t )  C ( S t , x t ) + min π ∈ Π E  t ′′ X t ′ = t +1 C ( S t ′ , X πt ′ ( S t ′ )) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) S xt  . (6)4 hen t ′′ < T the optimality of the approximate lookahead policy given by (6) cannot be estab-lished. In addition, lower and upper bounds to the optimal value of problem (2) might not be readilyavailable (due to approximation errors stemming from the stage reduction). Nonetheless, lookaheadmodels can produce high-quality solutions in selected problems (see [9]). Thus, at any time period t = 0 , . . . , T , the optimal decision X ∗ t ( S t ) for problem (2) can becomputed as X ∗ t ( S t ) ∈ arg min x t ∈X t ( S t ) { C ( S t , x t ) + V ∗ t ( S xt ) } . Hence, the search for an optimal policy π ∗ is equivalent to the computation of optimal post–decisionvalue functions V ∗ t ( S xt ) , t = 0 , . . . , T . One of the well–known methods that allows us to constructsuch value functions is Stochastic Dual Dynamic Programming. Typically, stochastic programming techniques model the ﬂow of information by utilizing a scenariotree that involves the entire set H t = { S } × Ω × · · · × Ω t . While such an approach is useful foranalytical purposes, its practical applicability is limited by the exponential growth of the numberof nodes in the scenario tree when the length of the time horizon T increases. To overcome thatchallenge, Pereira and Pinto [19] introduced the Stochastic Dual Dynamic Programming (SDDP)method for the solution of multistage stochastic linear optimization problems over long time hori-zons. SDDP overcomes the combinatorial explosion of the information state by exploiting (a keyand limiting assumption of) stagewise independence as P ( ω t +1 | H t ) = P ( ω t +1 ) , and therefore allpost–decision states S xt share a single information state I xt . Hence, S xt = R xt and the optimal valuefunctions V ∗ t ( S xt ) only depend on the post–decision resource states R xt , V ∗ t ( S xt ) = V ∗ t ( R xt ) , t = 0 , . . . , T. The convexity of the optimal value functions V ∗ t ( R xt ) is the key property that allows one to partiallyescape the curse of dimensionality arising from partitioning the resource space. Instead, V ∗ t ( R xt ) can be approximated with lower–bounding convex outer approximations V kt ( R xt ) whose functionalform is the maximum over a collection of afﬁne functions. Those are commonly known as cuttinghyperplanes or Benders cuts , and are constructed at the resource points R x,jt that are visited duringthe j –th forward pass, V kt ( R xt ) := max j ≤ k (cid:8) α jt + h β jt , R xt − R x,jt i} . (7)For example, at iteration k we would obtain R x,kt by solving the following linear program x kt ∈ arg min x t ∈X t ( S t ) n C ( S t , x t ) + V k − t ( R xt ) o , and setting R x,kt ← B kt x kt (8)where V t ( R xt ) = 0 . The approximations V k − t ( R xt ) are updated in the backward pass of iteration k by constructinga cutting hyperplane h kt ( R xt ) to the optimal value function V ∗ t ( R xt ) . To accomplish this, we use alower bound V kt +1 ( R x,kt ) (derived from solutions to subproblems for time t + 1 ) to V ∗ t ( R x,kt ) , h kt ( R xt ) := V kt +1 ( R x,kt ) + h β kt , R xt − R x,kt i . (9)Please note that the hyperplane h kt ( R xt ) is not necessarily tangent to V ∗ t ( R xt ) sinceV kt +1 ( R x,kt ) might be strictly smaller than V ∗ t ( R x,kt ) .5 emark 2. When we need to emphasize the dependence of the feasible set X t +1 ( S t +1 ) on the previ-ous post–decision state R xt , we use the notation X t +1 ( R xt , I t +1 ) , where the exogenous informationin R t +1 that is not contained in R xt is assumed to be contained in I t +1 . In order to construct V kt +1 , we consider every element of the sample set ω t +1 ∈ Ω t +1 and denotewith V kt +1 ( R xt , ω t +1 ) the optimal value of the following optimization problem, V kt +1 ( R xt , ω t +1 ) := min x t +1 ∈X t +1 ( R xt ,I t +1 ( ω t +1 )) n C ( S t +1 ( ω t +1 ) , x t +1 ) + V kt +1 ( R xt +1 ) o . Finally, we set V kt +1 ( R xt ) := X ω t +1 ∈ Ω t +1 P ( ω t +1 ) V kt +1 ( R xt , ω t +1 ) . Hence, if we choose β kt ∈ ∂ R V kt +1 ( R x,kt ) , then we can construct a new aggregated cut h kt ( R xt ) as described in equation (9). Thus, in thebackward pass of iteration k , we can update the approximate value function V kt ( R xt ) as follows, V kt ( R xt ) := max (cid:8) V k − t ( R xt ) , h kt ( R xt ) (cid:9) . If none of the constructed cuts are removed, then the growing collections of afﬁne functionsgenerate sequences of monotonically increasing lower bounding approximations V kt ( R xt ) to theoptimal post–decision value functions V ∗ t ( R xt ) for any t = 0 , . . . , T − . V k − t ( R xt ) ≤ V kt ( R xt ) ≤ V ∗ t ( R xt ) , ∀ k ∈ N , t = 0 , . . . , T − . Furthermore, in this work we assume relatively complete recourse, i.e. for any feasible solutionsto the optimization problems at time periods t = 0 , . . . , T − , there exists a feasible solution toany realized stage t + 1 subproblem with probability one. This assumption alleviates the need forfeasibility cuts and allows us to improve the clarity of the presentation. Existing regularization approaches [29, 25, 10, 31, 17] utilize a scenario tree and consider separateincumbent solutions ¯ x t ( H t ) for every possible history H t ∈ H t , t = 0 , . . . , T − . The underlyingidea in such methods has been to augment optimization problems of the form (8) with a regulariza-tion term as follows, x kt ∈ arg min x t ∈X t ( S t ) n C ( S t , x t ) + V k − t ( R xt ) + ρ || x t − ¯ x t ( H t ) || o . (10)As the algorithm progresses, each incumbent solution ¯ x t ( H t ) is updated to a new optimal solu-tion, if certain conditions are satisﬁed. While such an approach is feasible for problems on scenariotrees with small T , it is not practical for non–trivial time horizons. The exponential growth of thescenario tree ensures that only a tiny fraction of all possible realizations H t ∈ H t , t = 0 , . . . , T − could be examined in the forward pass in a reasonable computational time. Moreover, multiple visitsto each H t ∈ H t , t = 0 , . . . , T − and multiple updates of its incumbent solution are also out of6he realm of computational feasibility for most practical instances. One way to remedy this difﬁ-culty would be for different histories to share incumbent solutions. For example, a single incumbentsolution ¯ x t can be shared among all realizations H t ∈ H t and that would result in the optimizationproblem x kt ∈ arg min x t ∈X t ( S t ) n C ( S t , x t ) + V k − t ( R xt ) + ρ || x t − ¯ x t || o . (11)Equation (11) can be used in place of equation (8), and it would still result in a convergentmethod for a ﬁxed set of incumbent solutions ¯ x t , t = 0 , . . . , T − . However, the optimality of theresulting policy cannot be established. Moreover, since the purpose of the quadratic regularizationterm is to mitigate the inaccuracy of the value function approximations, we do not need to regularizearound the entire vector x t (which might be very high–dimensional) but only around the parameters R xt of the post–decision value function approximations V k − t ( R xt ) . Thus, we can adjust problem(11) to address these concerns by making the following adjustments, x kt ∈ arg min x t ∈X t ( S t ) (cid:26) C ( S t , x t ) + V k − t ( R xt ) + ̺ k D R xt − R x,k − t , Q t ( R xt − R x,k − t ) E(cid:27) (12)where the sequence of penalty coefﬁcients { ̺ k } is such that ̺ k ≥ , ∀ k ∈ N and lim k →∞ ̺ k = 0 .We also introduce a positive semi–deﬁnite matrix Q t (cid:23) , which can be used to address any scalingconcerns across different entries of the resource vectors R xt . Please note that the meaning of theproposed regularization strategy is quite different from its scenario tree counterparts, as it aims tosteer the solution towards a “known” region of the value function domain, rather than to the “cor-rect” solution for the given history H t of the stochastic process. Hence, we choose the incumbentsolutions to be the previous points encountered in the forward pass since the cuts supported at thosepoints are the ones generated with the most information. Finally, we also point out that unlike thecase of scenario trees, in the current method we do not aim for the convergence of the incumbentsolutions towards any point. Interested readers are free to choose different incumbent solutions thatthey might ﬁnd appropriate, and the convergence results presented below would still hold.Now, we can substitute equation (12) for equation (8) in the forward pass of SDDP, and the newprocedure would still converge to an optimal solution of problem (2) with probability one after aﬁnite number of iterations. That might appear somewhat surprising since gradient methods appliedto quadratic optimization problems typically entail asymptotic convergence. However, in this casea ﬁnite number of iterations is sufﬁcient since the true problem remains linear, and the quadraticterms are only used to guide the exploration phase of the forward pass. Moreover, the generationof the supporting hyperplanes in the backward pass utilizes linear programming problems whichcan generate only a ﬁnite number of different cuts when basic dual feasible solutions are used. Thedetails of the resulting method are presented in Algorithm 1, and we study its convergence propertiesbelow. Lemma 4 ([21, 32]) . Suppose that dual basic solutions are used in the solution of subproblemsin the backward pass of Algorithm 1. Then, there exist a ﬁnite number of possible value functionapproximations V t ( · ) , t = 0 , . . . , T . Since the regularization terms are artiﬁcial for the original problem, we exclude them from thedeﬁnition of an optimal policy.

Deﬁnition 5.

The value function approximations V kt , t = 0 , . . . , T are optimal for problem (2) iffor any realization ω ∈ Ω , min x t ∈X t ( S t ( ω )) (cid:8) C ( S t ( ω ) , x t ) + V kt ( R xt ) (cid:9) = min x t ∈X t ( S t ( ω )) (cid:8) C ( S t ( ω ) , x t ) + V ∗ t ( R xt ) (cid:9) (13)7 lgorithm 1 Quadratic Regularization Method with Stagewise Independence

1: Choose Q t (cid:23) , t = 0 , . . . , T , and deﬁne sequence { ̺ k } .2: Deﬁne V kT ( R xT ) := V ∗ T ( R xT ) , k = 0 , . . . , K .3: Deﬁne V t ( R xt ) := −∞ , t = 0 , . . . , T − .4: ( R x,k − , I ) ← S , k = 0 , . . . , K for k = 0 , . . . , K do Forward Pass:

7: Sample ω ∈ Ω .8: for t = 0 , . . . , T do if ( k = 0) then

10: Select x kt ∈ arg min x t ∈X t ( R x,kt − ,I t ( ω )) { C ( S t ( ω ) , x t ) } else if t < T then x kt ∈ arg min x t ∈X t ( R x,kt − ,I t ( ω )) (cid:26) C ( S t ( ω ) , x t ) + V k − t ( R xt ) + ̺ k D R xt − R x,k − t , Q t ( R xt − R x,k − t ) E(cid:27) else

15: Select x kt ∈ arg min x t ∈X t ( R x,kt − ,I t ( ω )) n C ( S t ( ω ) , x t ) + V k − t ( R xt ) o end if end if

18: Set R x,kt ← B kt x kt ; S t +1 ( ω ) ← ( R x,kt − b t +1 ( ω ) , I t +1 ( ω )) end for Backward Pass: for t = T, . . . , do

22: Deﬁne V kt ( R xt − , ω t ) := min x t ∈X t ( R xt − ,I t ( ω t )) n C ( S t ( ω t ) , x t ) + V kt ( R xt ) o for all ω t ∈ Ω t do

24: Select β kt ( ω t ) ∈ ∂ R xt − V kt ( R x,kt − , ω t ) end for α kt − ← X ω t ∈ Ω t P ( ω t ) V kt ( R x,kt , ω t ) ; β kt − ← X ω t ∈ Ω t P ( ω t ) β kt ( ω t ) h kt − ( R xt − ) := α kt − + h β kt − , R xt − − R x,kt − i V kt − ( R xt − ) := max (cid:8) V k − t − ( R xt − ) , h kt − ( R xt − ) (cid:9) end for V k ← (cid:26) min x ∈X ( S ) C ( S , x ) + V k ( R x ) (cid:27) R x,kt ← R x,kt , t = 0 , . . . , T − end for or t = 0 , . . . , T . Theorem 6.

Suppose that Algorithm 1 satisﬁes the following assumptions:1. V kT ( · ) ≡ V ∗ T ( · ) , k ∈ N .

2. Dual basic optimal solutions are used in the backward pass.3. Every element ω ∈ Ω has a strictly positive probability P ( ω ) > .4. ̺ k ≥ and lim k →∞ ̺ k = 0 .5. The feasible sets X t ( S t ) are bounded for each t = 0 , . . . , T .Then, the regularization method presented in Algorithm 1 converges to an optimal policy of problem(2) after a ﬁnite number of iterations with probability one.Proof. Proof: Let V t denote the set of all possible value function approximations V kt , t = 0 , . . . , T that can be generated by the backward pass of Algorithm 1. Since according toAssumption 2 we use only dual basic optimal solutions in the backward pass, by Lemma 4 we knowthat the sets V t have ﬁnite cardinality for all t = 0 , . . . , T . Thus, we know that as the algorithmprogresses all the value function approximations V kt will eventually stabilize. Therefore, there existsan iteration index k ∈ N after which no updates can be made to the value functions V kt , t =0 , . . . , T for k > k . If the value functions V k t , t = 0 , . . . , T are optimal for problem (2), then weare done.Now, suppose that was not the case. Then there exists t ′ , ≤ t ′ < T , and ω ′ ∈ Ω such that forany k > k we have min x t ′ ∈X t ′ ( S t ′ ( ω ′ )) (cid:8) C ( S t ′ ( ω ′ ) , x t ′ ) + V ∗ t ′ ( R xt ′ ( ω ′ )) (cid:9) > min x t ′ ∈X t ′ ( S t ′ ( ω ′ )) (cid:8) C ( S t ′ ( ω ′ ) , x t ′ ) + V k − t ′ ( R xt ′ ( ω ′ )) (cid:9) Let us consider the set ∆ = n δ ∈ R ,δ = min x t ∈X t ( S t ( ω )) (cid:8) C ( S t ( ω ) , x t ) + V ∗ t ( R xt ( ω )) (cid:9) − min x t ∈X t ( S t ( ω )) (cid:8) C ( S t ( ω ) , x t ) + V t ( R xt ( ω )) (cid:9) :min x t ∈X t ( S t ( ω )) (cid:8) C ( S t ( ω ) , x t ) + V ∗ t ( R xt ( ω )) (cid:9) > min x t ∈X t ( S t ( ω )) (cid:8) C ( S t ( ω ) , x t ) + V t ( R xt ( ω )) (cid:9) , where ω ∈ Ω , and V t ∈ V t , t = 0 , . . . , T o (14)Since the number of elements ω ∈ Ω is ﬁnite, we know that the set ∆ also has a ﬁnite number ofelements. Thus, ∆ has a minimum element, and we denote ǫ = min ∆. Hence, min x t ′ ∈X t ′ ( S t ′ ( ω ′ )) (cid:8) C ( S t ′ ( ω ′ ) , x t ′ ) + V ∗ t ′ ( R xt ′ ( ω ′ )) (cid:9) − min x t ′ ∈X t ′ ( S t ′ ( ω ′ )) (cid:8) C ( S t ′ ( ω ′ ) , x t ′ ) + V k − t ′ ( R xt ′ ( ω ′ )) (cid:9) ≥ ǫ t ′ > we know that V ∗ t ′ − ( R xt ′ − ) = X ω t ′ ∈ Ω t ′ P ( ω t ′ ) min x t ′ ∈X ( R xt ′− ,I t ( ω t ′ )) (cid:8) C ( S t ′ ( ω t ′ ) , x t ′ ) + V ∗ t ′ ( R xt ′ ) (cid:9) and using convexity, V k − t ′ − ( R xt ′ − ) ≤ X ω t ′ ∈ Ω t ′ P ( ω t ′ ) min x ′ t ∈X ( R xt ′− ,I t ( ω t ′ )) (cid:8) C ( S t ′ ( ω t ′ ) , x t ′ ) + V k − t ′ ( R xt ′ ) (cid:9) . Therefore, min x t ′− ∈X t ′− ( S t ′− ( ω ′ )) (cid:8) C ( S t ′ − ( ω ′ ) , x t ′ − ) + V ∗ t ′ − ( R xt ′ − ( ω ′ )) (cid:9) > min x t ′− ∈X t ′− ( S t ′− ( ω ′ )) (cid:8) C ( S t ′ − ( ω ′ ) , x t ′ − ) + V k − t ′ − ( R xt ′ − ( ω ′ )) (cid:9) , (15)which implies min x t ′− ∈X t ′− ( S t ′− ( ω ′ )) (cid:8) C ( S t ′ − ( ω ′ ) , x t ′ − ) + V ∗ t ′ − ( R xt ′ − ( ω ′ )) (cid:9) − min x t ′− ∈X t ′− ( S t ′− ( ω ′ )) (cid:8) C ( S t ′ − ( ω ′ ) , x t ′ − ) + V k − t ′ − ( R xt ′ − ( ω ′ )) (cid:9) ≥ ǫ. (16)Proceeding by backward induction, we know that min x ∈X ( S ) (cid:8) C ( S , x ) + V ∗ ( R x ) (cid:9) − min x ∈X ( S ) (cid:8) C ( S , x ) + V k − ( R x ) (cid:9) ≥ ǫ. Moreover, using Assumption 5 we know that R xt is bounded for each t . Hence, without loss ofgenerality we can assume that k is such that ̺ k h R xt − R x,k − t , Q t ( R xt − R x,k − t ) i < ǫ, for t = 0 , . . . , T − . Hence, if we denote with e x k the solution to the following regularized problem, e x k = arg min x ∈X ( S ) (cid:8) C ( S , x ) + V k − ( R x ( ω ′ )) + ̺ k h R x ( ω ′ ) − R x,k − , Q ( R x ( ω ′ ) − R x,k − ) i (cid:9) then we know that min x ∈X ( S ) (cid:8) C ( S , x ) + V ∗ ( R x ) (cid:9) > C ( S , e x k ) + V k − ( R e x,k ) + ̺ k h R e x,k − R x,k − , Q ( R e x,k − R x,k − ) i And since Q is positive semi–deﬁnite, we know that min x ∈X ( S ) (cid:8) C ( S , x ) + V ∗ ( R x ) (cid:9) > C ( S , e x k ) + V k − ( R e x,k ) which implies, C ( S , e x k ) + V ∗ ( R e x,k ) > C ( S , e x k ) + V k − ( R e x,k ) . and therefore V ∗ ( R e x,k ) > V k − ( R e x,k ) . V k − ( · ) is suboptimal at the point R e x,k cor-responding to e x k . Hence, if the value function V k − ( · ) is such that for each ω ∈ Ω the followingholds, min x ∈X ( R e x,k ,I ( ω )) (cid:8) C ( S ( ω ) , x ) + V k − ( R x ) (cid:9) = min x ∈X ( R e x,k ,I ( ω )) (cid:8) C ( S ( ω ) , x ) + V ∗ ( R x ) (cid:9) then the backward pass will result in an updated value function V k ( · ) such that V k ( R e x,k ) = V ∗ ( R e x,k ) > V k − ( R e x,k ) which is a contradiction with the choice of k . Therefore, it must bethe case that there exists ω ′′ ∈ Ω such that min x ∈X ( R e x,k ,I ( ω ′′ )) (cid:8) C ( S ( ω ′′ ) , x ) + V k − ( R x ) (cid:9) < min x ∈X ( R e x,k ,I ( ω ′′ )) (cid:8) C ( S ( ω ′′ ) , x ) + V ∗ ( R x ) (cid:9) . Moreover, min x T ∈X ( S T ( ω )) (cid:8) C ( S T ( ω ) , x T ) + V k − T ( R xT ) (cid:9) = min x T ∈X ( S T ( ω )) (cid:8) C ( S T ( ω ) , x T ) + V ∗ T ( R xT ) (cid:9) . Therefore, there exists a sample path ω ′′ ∈ Ω and a time index t ′′ , < t ′′ < T such that thesequence of regularized solutions e x kt ( ω ′′ ) would result in a suboptimal value function evaluation at t ′′ , min x t ′′ ∈X ( R e x,kt ′′− ,I t ′′ ( ω ′′ )) (cid:8) C ( S t ′′ ( ω ′′ ) , x t ′′ ) + V k − t ′′ ( R xt ′′ ) (cid:9) < min x t ′′ ∈X ( R e x,kt ′′− ,I t ′′ ( ω ′′ )) (cid:8) C ( S t ′′ ( ω ) , x t ′′ ) + V ∗ t ′′ ( R xt ′′ ) (cid:9) , and optimal evaluations at t ′′ + 1 for all possible ω t ′′ +1 ∈ Ω t ′′ +1 , min x t ′′ +1 ∈X ( R e x,kt ′′ ,I t ′′ +1 ( ω t ′′ +1 )) (cid:8) C ( S t ′′ +1 ( ω t ′′ +1 ) , x t ′′ +1 ) + V k − t ′′ +1 ( R xt ′′ +1 ) (cid:9) = min x t ′′ +1 ∈X ( R e x,kt ′′ ,I t ′′ +1 ( ω t ′′ +1 )) (cid:8) C ( S t ′′ +1 ( ω t ′′ +1 ) , x t ′′ +1 ) + V ∗ t ′′ +1 ( R xt ′′ +1 ) (cid:9) . Hence the backward pass of iteration k will result in an updated value function approximation V kt ′′ ( · ) such that V kt ′′ ( R e x,kt ′′ ) = V ∗ t ′′ ( R e x,kt ′′ ) > V k − t ′′ ( R e x,kt ′′ ) , which is a contradiction with the choice of k . This completes the proof.11 R ˜ x,kt ′′ ( ω ′′ ) R xt ′ V a l u e ( $ ) Forward Pass V k − t ′′ ( R xt ′′ ) V ∗ t ′′ ( R xt ′′ ) − R xt ′ V a l u e ( $ ) Backward Pass V kt ′′ ( R xt ′′ ) V ∗ t ′′ ( R xt ′′ ) Figure 1: Value function update at iteration k . Despite its advantages, the SDDP methodology has one crucial drawback. The stagewise inde-pendence of W t = ( A t , B t , b t , c t ) will generally not hold in practice since real–world multistageproblems often involve stochastic processes that exhibit some degree of temporal dependence. Thereare different approaches that we can adopt to address this difﬁculty. First, let us consider the specialcase when the history dependence occurs only in the right hand side constraint vectors b t , and it hasthe following autoregressive structure: b t = t − X t ′ =1 (Φ t,t ′ b t ′ + Ψ t,t ′ η t ′ ) + η t (17)where the process ( A t , B t , c t , η t ) is stagewise independent and the deterministic matrices Φ t,t ′ and Ψ t,t ′ contain the autoregressive information. Then, for each time period t > in the SDDP formu-lation, we can extend the original optimization problem with additional variables to accommodatethe realizations of b t ′ and η t ′ , t ′ < t that are necessary to model the autoregressive dependence(see [6, 15], and [34]). The advantage of such a solution to the history dependence problem is thatstagewise independence is present in the extended formulation. A drawback of the approach is thatthe dimension of the state space also increases from | R xt | (in the stagewise independence case) topossibly as much as | R xt | + P t − t ′ =0 ( | b t ′ | + | η t ′ | ) (in the history dependent case), which implies aslower convergence rate (note that we can omit terms | b t ′ | if Φ t,t ′ = , and | η t ′ | if Ψ t,t ′ = ). Thisproblem can be alleviated with the use of cut sharing strategies as described in [11], and [7].In the remainder of this section we consider an alternative setup that leads to an increase in theinformation dimension rather than the resource dimension. We assume that the stochastic process W t is a discrete state Markov chain. Thus, the probability of occurrence of ω t +1 ∈ Ω t +1 depends12nly on the current realization ω t ∈ Ω t or the current post–decision information state I xt , P ( ω t +1 | H t ) = (cid:26) P ( ω t +1 | S t ) = P ( ω t +1 | I xt ) , if t = 0 P ( ω t +1 | ω t ) = P ( ω t +1 | I xt ) , if t > . (18)Such an approach can be suitable for problems where the process ( A t , B t , c t ) is not stagewise inde-pendent, or the autoregressive model (17) does not constitute a good ﬁt to the observed realizationsof the random process. For example, historical weather data might indicate the presence of distinctpatterns that cannot be explained with a normal error distribution around a given mean (which arisein autoregressive estimation). Alternatively, the relevant information state could be the forecast ofthe highest temperature tomorrow.To properly model such weather dynamics one might need to consider different weather regimesthat are inherently distinct. Thus, multiple approximations of the value functions need to be em-ployed, which increases the size of the optimization problem. That leads to greater computationalrequirements for solving the problem as a distinct recourse function approximation needs to be con-structed for every I xt ∈ I xt , where I xt denotes the set of all possible post–decision information statesat time t . Hence, we need to maintain |I xt (Ω t ) | sets of cuts for each time period t = 0 , . . . , T , andtherefore the approach is suitable for problems where the cardinality of the possible post–decisioninformation states |I xt (Ω t ) | is small, or alternatively when the cardinality of the sample sets | Ω t | issmall. However, unlike the case of an autoregressive ﬁt (17), the dimension of the post–decisionresource state is preserved in each set of cuts, and the corresponding exponential increase in thecomputational time is avoided.In the forward pass at iteration k , we consider a sample path ω = ( ω , . . . , ω T ) that is generatedusing (18). At each time step t = 0 , . . . , T − the piecewise–linear value function V k − t ( R xt , I xt ( ω )) is used to approximate the optimal value function V ∗ t ( R xt , I xt ( ω )) at the current post–decision infor-mation state I xt ( ω ) .In the backward pass of the algorithm at iteration k , we consider t = T, . . . , and generate thecutting hyperplanes h kt − ( R xt − , I xt − ) for each I xt − ∈ I xt − . Please note that if the random process W t is a ﬁnite state Markov chain, then |I xt (Ω t ) | ≤ | Ω t | , t = 0 , . . . , T . We employ the conditionalprobabilities P ( ω t | I xt − ) to construct constant intercepts and slopes, α kt − ( I xt − ) ← X ω t ∈ Ω t P ( ω t | I xt − ) V kt ( R x,kt , ω t ) and, β kt − ( I xt − ) ← X ω t ∈ Ω t P ( ω t | I xt − ) β kt ( ω t ) . Thus, h kt − ( R xt − , I xt − ) := α kt − ( I xt − ) + h β kt − ( I xt − ) , R xt − − R x,kt − i . Hence, we can construct the new value function approximation V kt − ( R xt − , I xt − ) for the post–decision information state I xt − as, V kt − ( R xt − , I xt − ) := max (cid:8) V k − t − ( R xt − , I xt − ) , h kt − ( R xt − , I xt − ) (cid:9) . (19)The description of the method is given in Algorithm 2. Theorem 7.

Suppose that { W t , t = 1 , . . . , T } is a discrete Markov process as described by equation(18). If V kT ( · , I xT ) ≡ V ∗ T ( · , I xT ) , for I xT ∈ I xT , k = 0 , . . . , K , and conditions 2, 3, 4, and 5 speciﬁedin Theorem 6 are satisﬁed, then the method presented in Algorithm 2 converges to an optimal policyof problem (2) after a ﬁnite number of iterations with probability one. lgorithm 2 Quadratic Regularization Method for Markov Models

1: Choose Q t (cid:23) , t = 0 , . . . , T , and deﬁne the sequence { ̺ k } .2: Deﬁne V kT ( R xT , I xT ) := V ∗ T ( R xT , I xT ) , k = 0 , . . . , K, I xT ∈ I xT .3: Deﬁne V t ( R xt , I t ( ω t )) := −∞ , ω t ∈ Ω t , t = 0 , . . . , T − .4: ( R x,k − , I ) ← S , k = 0 , . . . , K for k = 0 , . . . , K do

6: Sample ω ∈ Ω using the Markov stochastic process { W t , t = 1 , . . . , T } .7: for t = 0 , . . . , T do if ( k = 0) then

9: Select x kt ∈ arg min x t ∈X t ( R x,kt − ,I t ( ω )) { C ( S t ( ω ) , x t ) } else if t < T then x kt ∈ arg min x t ∈X t ( R x,kt − ,I t ( ω )) (cid:26) C ( S t ( ω ) , x t ) + V k − t ( R xt , I xt ( ω )) + ̺ k D R xt − R x,k − t , Q t ( R xt − R x,k − t ) E(cid:27) else x kt ∈ arg min x t ∈X t ( R x,kt − ,I t ( ω )) n C ( S t ( ω ) , x t ) + V k − t ( R xt , I xt ( ω )) o end if end if

17: Set R x,kt ← B kt x kt ; S t +1 ( ω ) ← ( R x,kt − b t +1 ( ω ) , I t +1 ( ω )) end for for t = T, . . . , do

20: Deﬁne V kt ( R xt − , ω t ) := min x t ∈X t ( R xt − ,I t ( ω t )) n C ( S t ( ω t ) , x t ) + V kt ( R xt , I xt ( ω t )) o for all ω t ∈ Ω t do

22: Select β kt ( ω t ) ∈ ∂ R xt − V kt ( R x,kt − , ω t ) end for for all I xt − ∈ I xt − (Ω t − ) do α kt − ( I xt − ) ← X ω t ∈ Ω t P ( ω t | I xt − ) V kt ( R x,kt , ω t ) ; β kt − ( I xt − ) ← X ω t ∈ Ω t P ( ω t | I xt − ) β kt ( ω t ) h kt − ( R xt − , I xt − ) := α kt − ( I xt − ) + h β kt − ( I xt − ) , R xt − − R x,kt − i V kt − ( R xt − , I xt − ) := max (cid:8) V k − t − ( R xt − , I xt − ) , h kt − ( R xt − , I xt − ) (cid:9) end for end for V k ← (cid:26) min x ∈X ( S ) C ( S , x ) + V k ( R x , I x ) (cid:27) R x,kt ← R x,kt , t = 0 , . . . , T − end for roof. Proof: The proof is analogous to the proof of Theorem 6. The main difference is that foreach time period t = 1 , . . . , T we need to consider | I xt | different value functions V kt ( R xt , I xt ) . Since | I xt | is ﬁnite, the argument of the proof of Theorem 6 can be extended to show that with probability1, there exists a large enough k ∈ N such that the value functions V kt ( R ˆ x,kt ( ω ) , I xt ) are optimal forall ω ∈ Ω , and I xt ∈ I xt , t = 0 , . . . , T . Remark 3.

Various optimization methods for Markov models have been studied in the literaturefor both the risk–neutral (see [23, 22, 33]) and risk–averse cases (see [20, 28, 16] and the refer-ences within). An extensive treatment of optimization problems with Markov uncertainty is beyondthe scope of this work. The goal of our presentation is the introduction of regularization into theMarkovian setting, so that it can be adapted to other problems on a case–by–case basis.

In order to turn mathematical arguments into useful numerical results one needs to employ a highquality implementation and suitable parameter tuning. In this section we present some of the poten-tial issues regarding the reliability and computational performance of the methods presented above.We consider the construction of regularization sequences, and discuss numerical concerns regardingthe solutions of subproblems.

In general, we cannot ﬁnd a regularization sequence that would lead to the fastest possible con-vergence. However, if we consider sequences that are deﬁned by a set of parameters, then we canattempt to ﬁnd suitable parameter values. For example, we can construct regularization sequences ̺ k ≥ such that lim k →∞ ̺ k = 0 by using the following geometric sequence. Given ̺ > and r ∈ (0 , , we deﬁne ̺ k = ̺ r k = r · ̺ k − , if k > . (20)In this case, we need to tune the parameters ̺ and r . We can gain insight by solving a smallinstance of the given problem for different pairs ( ̺ , r ) . For example, in section 8 we describe anoptimization model to be solved for high–dimensional post–decision resource states | R xt | ≥ .As a pre–processing step, we can solve a smaller instance, e.g. | R xt | = 25 , for each ( ̺ , r ) ∈{ , , } × { . , . , . } , and compare the results. Since estimates of the upper bounds andoptimality gaps are stochastic, we prefer to compare only the deterministic lower bounds as theyare more reliable. The resulting plots can be found in Figure 2. We can see that the sequencesof regularization coefﬁcients has an impact on the behavior of the proposed methods. However,various choices of ( ̺ , r ) can be used with similar success. In our experiments in section 8, we use ̺ = 1 , r = 0 . . 15

50 100 150 200 250 3004 . . . . · Iteration V a l u e ( $ ) Stagewise Independence for | R xt | = 25 0 50 100 150 200 250 300 · IterationMarkov Uncertainty for | R xt | = 25 ̺ = 1 , r = 0 . ̺ = 1 , r = 0 . ̺ = 1 , r = 0 . ̺ = 10 , r = 0 . ̺ = 10 , r = 0 . ̺ = 10 , r = 0 . ̺ = 100 , r = 0 . ̺ = 100 , r = 0 . ̺ = 100 , r = 0 . Figure 2: Lower bounds to the objective value of a stochastic optimization problem for differentregularization sequences.

At each step of the forward and backward pass of Algorithm 1 and Algorithm 2, we use the currentcollection of hyperplanes (cid:8) α jt + h β jt , R xt − R x,jt i , j ≤ k (cid:9) and a realization of W t = ( A t , B t , b t , c t ) as a part of the input to a convex optimization problems having the following general form, min h c, y i + 12 h y, Qy i s.t. Ay = by ≥ (21)The numerical precision of the solutions to subproblems (21) is essential for the correctness of theresulting policy for problem (1). However, the right–hand side vector b of problem (21) includesthe vector b t and the constant terms α jt − β jt R x,jt of the value function approximations given in (7)or (19). If problem (1) has a long time horizon, then an aggregation of constant terms with largemodulus | α jt − β jt R x,jt | can occur, and that could lead to numerical solutions of problem (21) whichdo not satisfy the system of constraints B t − x t − + A t x t = b t with a desirable precision. Convexoptimization tools, including specialized algorithms for linear and quadratic programming problems,often use convergence tolerance parameters to guide their stopping conditions. For problems withlong time horizons, we encounter numerical precision problems that require that we use care insetting tolerance parameters for stopping conditions. In the sections below, we discuss the issues ofrelative primal feasibility, and the relative complementarity gap.16 .2.1 Relative primal feasibility Suppose that a given optimization solver has a feasibility condition of the following form, || Ay − b || || b || ≤ ε f . (22)We can consider two right–hand side vectors b , b such that || b || < || b || and correspondingcandidate solutions y , y such that || Ay − b || || b || = || Ay − b || || b || = ε f .Then the feasibility errors satisfy || Ay − b || < || Ay − b || . Therefore, if we keep the primalfeasibility tolerance ε f ﬁxed while || b || grows, then the feasibility errors || Ay − b || (and therefore || B t − x t − + A t x t − b t || ) could increase as well. Hence, for problems with a long time horizon ora large number of hyperplanes in the value function approximation, one might need to decrease thetolerance ε f for problem (21) in order to bring the size of the error || B t − x t − + A t x t − b t || downto an acceptable level. Commercial solvers often include implementations of primal–dual interior point methods (see [36,4]) that employ a relative complementarity tolerance ε c in their stopping condition. The presence oflarge (by modulus) constant terms in the right–hand side vector b can also lead the numerical solverto terminate at an infeasible solution with non–negligible errors || Ay − b || and || B t − x t − + A t x t − b t || , if ε c is not chosen appropriately. We present a brief explanation below.The Lagrangian of problem (21) is given by L ( y, µ, λ ) = h c, y i + 12 h y, Qy i + h µ, b − Ay i − h λ, y i . (23)Hence, the Karush–Kuhn–Tucker conditions for problem (21) are given by the system of constraints, Ay = bA ⊤ µ − Qy + λ = cY Λ = 0 y, λ ≥ (24)where Y = diag ( y ) and Λ = diag ( λ ) .Interior point methods construct iterative approximations to the solution of (24) using a sequenceof scalar barrier parameters ν n > , such that ν n ↓ . Assuming that the initial point ( y , µ , λ ) isinfeasible for (24) and h y , λ i > , we can have a stopping condition for the complementarity gapsuch as h y n , λ n ih y , λ i ≤ ε c or ν n ≤ ε c or ν n |h c, y n i + h y n , Qy n i| ≤ ε c . (25)At iteration n , the interior point method ﬁnds a Newton direction ( ∆y, ∆µ, ∆λ ) for problem (24)as the solution to the following system :  A − Q A ⊤ I Λ 0 Y  ·  ∆y∆µ∆λ  =  b − Ay n c − A ⊤ µ n + Qy n + λ n ν n − Y n Λ n  (26)17here I = diag (1 , , . . . , denotes the identity matrix.Then the current solution ( y n , µ n , λ n ) can be updated by choosing γ n ∈ (0 , such that ( y n , λ n ) + γ n ( ∆y, ∆λ ) ≥ (27)and setting ( y n +1 , µ n +1 , λ n +1 ) = ( y n , µ n , λ n ) + γ n ( ∆y, ∆µ, ∆λ ) . (28)Please note that if γ n = 1 , then y n +1 ≥ would be feasible for problem (21) since Ay n +1 = b .However, in practice we usually have γ n < . Hence, a complementarity tolerance condition (25)can be met even if the system B t − x t − + A t x t = b t is not satisﬁed within the desired precision. Inorder to address this concern, in our numerical experiments we set the tolerance ε c to the smallestpossible value allowed by the solver ( − ). In this section we study the computational performance of the algorithms proposed above. We focusour analysis on the following questions. • How is the computational performance of Algorithms 1 and 2 affected by: – the dimension of the resource vector R t ? – the size of the post–decision information state space I xt ? • How does the performance of Algorithms 1 and 2 compare to their non–regularized counter-parts?Our experimental work was conducted using the setting of optimizing grid level storage for alarge transmission grid managed by PJM Interconnection. PJM manages grid level storage devicesfrom a single location, making it a natural setting for testing our algorithms. As of this writing, gridlevel storage is dropping in price, providing a meaningful setting to evaluate the performance of ouralgorithms for a wide range of storage devices, challenging the ability of the algorithms to handlehigh dimensional applications. For this reason, we conducted tests on networks with 50 to 500storage devices. These are much higher dimensional problems than prior research that has focusedon the management of water reservoirs.Another distinguishing feature of our grid storage setting (compared to prior experimental work)is that a natural time step is 5 minutes, which is the frequency with which real–time electricity prices(known as LMPs, for locational marginal prices) are updated on the PJM grid. We anticipate usingstorage devices to hold energy over horizons of several hours. For this reason, we used a 24 hourmodel, divided into 5–minute increments, for 288 time periods, which is quite large compared tomany applications using this algorithmic technology.A complete description of the given model is beyond the scope of the current paper and canbe found in [1]. Below we brieﬂy describe the construction of the network, and the exogenousstochastic process. Finally we present the results of an extensive set of experiments investigatingthe effect of regularization, the number of storage devices (which determines the dimensionality of R t ), and the presence of an exogenous post–decision information state, on the rate of convergenceand solution quality. 18 .1 The network We performed our experiments using an aggregated version of the PJM grid. Instead of the full net-work with 9,000 buses and 14,000 transmission lines, we limited our analysis to the higher voltagelines, producing a grid with 1,360 buses and 1,715 transmission lines. The power generators include396 gas turbines (23,309 MW), 50 combined cycle generators (21,248 MW), 264 steam genera-tors (73,374 MW), 31 nuclear reactors (31, 086 MW), and 84 conventional hydro power generators(2,217 MW). Off–shore wind power was simulated for a set of hypothetical wind turbines with acombined maximum capacity of 16 GW. Moreover, we consider a daily time horizon with 5–minutediscretization resulting in a total of 288 time periods.The data was prepared by ﬁrst running a unit–commitment simulator called SMART–ISO thatdetermines which generators are on or off at each point in time, given forecasts of wind generatedfrom a planned set of off–shore wind farms. We made the assumption that the use of grid levelstorage would not change which generators are on or off at any point in time. However, we simulta-neously optimize ramping the generators up or down within ranges, while charging and dischargingof storage devices around the grid in the presence of stochastic injections from the wind farms.We placed the distributed storage devices at the points–of–interconnection for wind farms, aswell as the buses with the highest demand. Each storage device is characterized by its minimumand maximum energy capacity, its charging and discharging efﬁciency, and its variable storage cost.The control of multiple storage devices in a distributed energy system is a challenging task thatdepends on a variety of factors such as the location of each device, and the presence of transmissionline congestion. A good storage algorithm needs to respond to daily variations in supply, demandand congestion, taking advantage of opportunities to store energy near generation points (to avoidcongestion) or near load points (during off–peak periods). It has to balance when and where to storeand discharge in a stochastic, time–dependent setting, providing a challenging test environment forour algorithm.

Our only source of uncertainty (the exogenous information) was from the injected wind from theoffshore wind farms. In order to calibrate our stochastic wind error model, we employed historicalwind data and speed measurements of off–shore wind for the month of January 2010. For each timeperiod, we consider a set of ten vectors of possible wind speed realizations which correspond to tendifferent weather regimes.In general, the exogenous information process can be characterized by one of the following:stagewise independence, compact state variables (Markov processes), or scenario–dependence (pathdependence). For some instances, the latter case could be reduced to one of the former two byapplying an appropriate transformation as described in section 6. In our experiments, we considerinstances with stagewise independent transitions between ten equally likely scenarios. When weassumed stagewise independence, we would sample from each of these 10 scenarios with equalprobability at each time period. For the problems with Markov uncertainty, we assumed that atevery time period t , the probability of continuing with the same weather regime at time t + 1 is91 percent. Additionally, each of the remaining nine regimes can be visited at time t + 1 with aprobability of 1 percent. The proposed algorithms were implemented in Java, and the IBM ILOG CPLEX 12.4 solver wasused for the solution of linear and quadratic convex optimization problems. Further, we performed192:00 18:00 00:00 06:00 12:00 , , Time M W h Wind Power Simulations

Simulation 1 Simulation 2 Simulation 3 Simulation 4 Simulation 5Simulation 6 Simulation 7 Simulation 8 Simulation 9 Simulation 10

Figure 3: Simulated daily realizations of wind power for a given wind farm over 24 hour timehorizon.parameter tuning as described in section 7. We set the relative complementarity tolerance of CPLEXto − , and used a geometric regularization sequence with ̺ = 1 and r = 0 . . Additionally,we run each method for K = 300 iterations. Moreover, the scaling matrices Q t , t = 0 , . . . , T areset to the identity matrix which implies that the amount of energy in each storage device has thesame weight in the regularization term. In this section we examine the performance of Algorithms1 and 2 when the number of storage devices (dimension of the resource state variable) is | R xt | =50 , , , .Plots of the behavior of the methods can be found in Figures 4, 5, 6, 7 below. Each ﬁgure showsthe results for stagewise independence on the left, and Markov uncertainty on the right. Thesegraphs show the convergence of the upper and lower bounds, illustrating the dramatic impact ofregularization, especially as the number of dimensions grow. The results suggest that we consistentlyobtain high quality solutions within approximately 50 iterations for all problems.20

50 100 150 200 250 3004 . . . . · Iteration V a l u e ( $ ) Stagewise Independence for | R xt | = 50 UB SDDPLB SDDPUB Algorithm 1LB Algorithm 1 · IterationMarkov Uncertainty for | R xt | = 50 UB MSDDPLB MSDDPUB Algorithm 2LB Algorithm 2Figure 4: Numerical comparison of multistage stochastic optimization methods for | R xt | = 500 50 100 150 200 250 30044 . . . . · Iteration V a l u e ( $ ) Stagewise Independence for | R xt | = 100 UB SDDPLB SDDPUB Algorithm 1LB Algorithm 1 · IterationMarkov Uncertainty for | R xt | = 100 UB MSDDPLB MSDDPUB Algorithm 2LB Algorithm 2Figure 5: Numerical comparison of multistage stochastic optimization methods for | R xt | = 100

50 100 150 200 250 3003 . . . · Iteration V a l u e ( $ ) Stagewise Independence for | R xt | = 200 UB SDDPLB SDDPUB Algorithm 1LB Algorithm 1 · IterationMarkov Uncertainty for | R xt | = 200 UB MSDDPLB MSDDPUB Algorithm 2LB Algorithm 2Figure 6: Numerical comparison of multistage stochastic optimization methods for | R xt | = 2000 50 100 150 200 250 30033 . . · Iteration V a l u e ( $ ) Stagewise Independence for | R xt | = 500 UB SDDPLB SDDPUB Algorithm 1LB Algorithm 1 · IterationMarkov Uncertainty for | R xt | = 500 UB MSDDPLB MSDDPUB Algorithm 2LB Algorithm 2Figure 7: Numerical comparison of multistage stochastic optimization methods for | R xt | = 500 Table 1 and Table 2 show the CPU times (in seconds) per iteration for problems with 50 to 500storage devices, with stagewise independence and Markov uncertainty, for up to 300 iterations. Wenote that in a practical application, the algorithms would be run ofﬂine (for example, the day before,given a particular forecast of wind). The cuts would be stored and then used in real time the nextday. This would be easily implementable in a policy updated every 5 minutes.22 ❳❳❳❳❳❳❳❳❳ | R xt | ❳❳❳❳❳❳❳❳❳❳ | R xt | Conclusion

Large scale multistage stochastic optimization problems with long time horizons arise in numerousreal–world applications in energy, ﬁnance, transportation and other ﬁelds. The numerical solutionof such models can be computationally demanding, often causing practioners to face a trade–offbetween solution quality and computational time.In our work we have developed regularization methods for the SDDP framework and studiedtheir convergence. The algorithms employ regularization terms in the selection of cutting hyper-planes which improve the quality of the resulting value function approximations. The proposedtechniques feature straightforward implementation and can be quickly integrated into existing soft-ware solutions without the need for major additional efforts in development and testing.In order to assess the performance of the proposed approach we consider a model for the integra-tion of renewable energy using distributed grid–level storage into the grid of PJM, one of the largestregional transmission operators in the United States. Our numerical experiments indicate that theproposed regularized algorithms exhibits signiﬁcantly faster convergence than their non–regularizedcounterparts, with greater gains observed for higher–dimensional problems.In the future we can consider several extensions of the current work. One possible directionwould involve further investigation into the selection of appropriate regularization terms and coef-ﬁcients. Another possible path of exploration would be the application of regularization techniquesfor the solution of risk–averse models involving time–consistent compositions of coherent measuresof risk along the lines of [13, 2], and [34]. Additionally, we would also like to extend the pro-posed approach to the solution of multiobjective stochastic models [37]. Finally, obtaining furtherempirical results and insights from problems in the ﬁeld would also be a subject of great interest.

References [1] T. A

SAMOV AND

W. B. P

OWELL , Multistage stochastic optimization methods applied to grid energystorage , tech. report, Department of Operations Research and Financial Engineering, Princeton Univer-sity, 2015.[2] T. A

SAMOV AND

A. R

USZCZY ´ NSKI , Time-consistent approximations of risk-averse multistage stochastic optimization problems , Mathematical Programming,(2014), pp. 1–35,, http://dx.doi.org/10.1007/s10107-014-0813-x , http://dx.doi.org/10.1007/s10107-014-0813-x .[3] J. F. B ENDERS , Partitioning procedures for solving mixed-variables programming problems , NumerischeMathematik, 4 (1962), pp. 238–252.[4] H. Y. B

ENSON , Interior-point linear programming solvers , Wiley Encyclopedia of Operations Researchand Management Science, (2009).[5] J. R. B

IRGE , Decomposition and partitioning methods for multistage stochastic linear programs , Opera-tions Research, 33 (1985), pp. 989–1007.[6] A. C

HIRALAKSANAKUL AND

D. P. M

ORTON , Assessing policy quality in multi-stage stochastic pro-gramming , Humboldt-Universit¨at zu Berlin, Mathematisch-Naturwissenschaftliche Fakult¨at II, Institutf¨ur Mathematik, (2004).[7] A. R. D E Q UEIROZ AND

D. P. M

ORTON , Sharing cuts under aggregated forecasts when decomposingmulti-stage stochastic programs , Operations Research Letters, 41 (2013), pp. 311–316.[8] C. J. D

ONOHUE AND

J. R. B

IRGE , The abridged nested decomposition method for multistage stochasticlinear programs with relatively complete recourse , Algorithmic Operations Research, 1 (2006).[9] S. G ¨

ULTEN AND

A. R

USZCZY ´ NSKI , Two-stage portfolio optimization with higher-order conditional mea-sures of risk , Annals of Operations Research, 229 (2014), pp. 409–427.

10] J. L. H

IGLE AND

S. S EN , Finite master programs in regularized stochastic decomposition , MathematicalProgramming, 67 (1994), pp. 143–168.[11] G. I

NFANGER AND

D. P. M

ORTON , Cut sharing for multistage stochastic linear programs with interstagedependency , Mathematical Programming, 75 (1996), pp. 241–256.[12] J. E. K

ELLEY , J R , The cutting-plane method for solving convex programs , Journal of the Society forIndustrial & Applied Mathematics, 8 (1960), pp. 703–712.[13] V. K

OZMIK AND

D. M

ORTON , Risk-averse stochastic dual dynamic programming

INOWSKY AND

A. B. P

HILPOTT , On the convergence of sampling-based decomposition algo-rithms for multistage stochastic programs , Journal of Optimization Theory and Applications, 125 (2005),pp. 349–366.[15] M. M

ACEIRA AND

J. D AM ´ AZIO , Use of the par (p) model in the stochastic dual dynamic programmingoptimization scheme used in the operation planning of the brazilian hydropower system , Probability inthe Engineering and Informational Sciences, 20 (2006), pp. 143–156.[16] B. M O , A. G JELSVIK , AND

A. G

RUNDT , Integrated risk management of hydro power scheduling andcontract management , Power Systems, IEEE Transactions on, 16 (2001), pp. 216–221.[17] D. P. M

ORTON , An enhanced decomposition algorithm for multistage stochastic hydroelectric schedul-ing , Annals of Operations Research, 64 (1996), pp. 211–235.[18] Y. N

ESTEROV AND

I. E. N

ESTEROV , Introductory lectures on convex optimization: A basic course ,vol. 87, Springer, 2004.[19] M. V. F. P

EREIRA AND

L. M. V. G. P

INTO , Multi-stage stochastic optimization applied to energy plan-ning , Mathematicsl Programming, 52 (1991), pp. 359–375.[20] A. B. P

HILPOTT AND

V. L. D E M ATOS , Dynamic sampling algorithms for multistage stochastic pro-grams with risk aversion , tech. report, Electric Power Optimization Centre, University of Auckland, 2011.[21] A. B. P

HILPOTT AND

Z. G

UAN , On the convergence of stochastic dual dynamic programming andrelated methods , Operations Research Letters, 36 (2008), pp. 450–455.[22] W. B. P

OWELL , Approximate Dynamic Programming: Solving the curses of dimensionality , John Wiley& Sons, 2011.[23] M. L. P

UTERMAN , Markov decision processes: discrete stochastic dynamic programming , John Wiley& Sons, 2014.[24] R. T. R

OCKAFELLAR , Monotone operators and the proximal point algorithm , SIAM Journal on Controland Optimization, 14 (1976), pp. 877–898.[25] A. R

USZCZY ´ NSKI , A regularized decomposition method for minimizing a sum of polyhedral functions ,Mathematical Programming, 35 (1986), pp. 309–333.[26] A. R

USZCZY ´ NSKI , Regularized decomposition of stochastic programs: algorithmic techniques and nu-merical results , Citeseer, (1993).[27] A. R

USZCZY ´ NSKI , Decomposition methods in stochastic programming , Mathematical Programming, 79(1997), pp. 333–353.[28] A. R

USZCZY ´ NSKI , Risk-averse dynamic programming for markov decision processes , Mathematical Pro-gramming, 125 (2010), pp. 235–261.[29] A. R

USZCZY ´ NSKI AND

A. ´S

WIETANOWSKI , Accelerating the regularized decomposition method for twostage stochastic linear problems , European Journal of Operational Research, 101 (1997), pp. 328–342.[30] A. P. R

USZCZY ´ NSKI , Nonlinear optimization , vol. 13, Princeton university press, 2006.[31] S. S

EN AND

Z. Z

HOU , Multistage stochastic decomposition: a bridge between stochastic programmingand approximate dynamic programming , SIAM Journal on Optimization, 24 (2014), pp. 127–153.[32] A. S

HAPIRO , Analysis of stochastic dual dynamic programming method , European Journal of OperationalResearch, 209 (2011), pp. 63–72.

33] A. S

HAPIRO , D. D

ENTCHEVA , AND

A. R

USZCZY ´ NSKI , Lectures on stochastic programming: modelingand theory , vol. 16, SIAM, 2014.[34] A. S

HAPIRO , W. T

EKAYA , J. P. DA C OSTA , AND

M. P. S

OARES , Risk neutral and risk averse stochasticdual dynamic programming method , European Journal of Operational Research, 224 (2013), pp. 375–391.[35] R. M. V AN S LYKE AND

R. W

ETS , L-shaped linear programs with applications to optimal control andstochastic programming , SIAM Journal on Applied Mathematics, 17 (1969), pp. 638–663.[36] S. J. W

RIGHT , Primal-dual interior-point methods , Siam, 1997.[37] C. M. Y

OUNG , M. L I , Y. Z HU , M. X IE , E. A. E LSAYED , AND

T. A

SAMOV , Multiobjective optimiza-tion of a port-of-entry inspection policy , Automation Science and Engineering, IEEE Transactions on, 7(2010), pp. 392–400., Automation Science and Engineering, IEEE Transactions on, 7(2010), pp. 392–400.