Scenario Reduction Revisited: Fundamental Limits and Guarantees
Napat Rujeerapaiboon, Kilian Schindler, Daniel Kuhn, Wolfram Wiesemann
SScenario Reduction Revisited:Fundamental Limits and Guarantees
Napat Rujeerapaiboon, Kilian Schindler,Daniel Kuhn, Wolfram WiesemannAbstract
The goal of scenario reduction is to approximate a given discrete dis-tribution with another discrete distribution that has fewer atoms. We distinguishcontinuous scenario reduction, where the new atoms may be chosen freely, anddiscrete scenario reduction, where the new atoms must be chosen from among theexisting ones. Using the Wasserstein distance as measure of proximity between dis-tributions, we identify those n -point distributions on the unit ball that are leastsusceptible to scenario reduction, i.e. , that have maximum Wasserstein distanceto their closest m -point distributions for some prescribed m < n . We also providesharp bounds on the added benefit of continuous over discrete scenario reduction.Finally, to our best knowledge, we propose the first polynomial-time constant-factor approximations for both discrete and continuous scenario reduction as wellas the first exact exponential-time algorithms for continuous scenario reduction. Keywords scenario reduction, Wasserstein distance, constant-factor approxima-tion algorithm, k -median clustering, k -means clustering The vast majority of numerical solution schemes in stochastic programming relyon a discrete approximation of the true (typically continuous) probability distribu-tion governing the uncertain problem parameters. This discrete approximation isoften generated by sampling from the true distribution. Alternatively, it could beconstructed directly from real historical observations of the uncertain parameters.
Napat Rujeerapaiboon, Kilian Schindler, Daniel KuhnRisk Analytics and Optimization Chair´Ecole Polytechnique F´ed´erale de Lausanne, SwitzerlandTel.: +41 (0)21 693 00 36 Fax: +41 (0)21 693 24 89E-mail: napat.rujeerapaiboon@epfl.ch, kilian.schindler@epfl.ch, daniel.kuhn@epfl.chWolfram WiesemannImperial College Business SchoolImperial College London, United KingdomTel.: +44 (0)20 7594 9150E-mail: [email protected] a r X i v : . [ m a t h . O C ] J a n Napat Rujeerapaiboon, Kilian Schindler, Daniel Kuhn, Wolfram Wiesemann
To obtain a faithful approximation for the true distribution, however, the discretedistribution must have a large number n of support points or scenarios , which mayrender the underlying stochastic program computationally excruciating.An effective means to ease the computational burden is to rely on scenarioreduction pioneered by Dupaˇcov´a et al (2003), which aims to approximate theinitial n -point distribution with a simpler m -point distribution ( m < n ) that is asclose as possible to the initial distribution with respect to a probability metric;see also Heitsch and R¨omisch (2003). The modern stability theory of stochasticprogramming surveyed by Dupaˇcov´a (1990) and R¨omisch (2003) indicates that theWasserstein distance may serve as a natural candidate for this probability metric.Our interest in Wasserstein distance-based scenario reduction is also fuelledby recent progress in data-driven distributionally robust optimization, where ithas been shown that the worst-case expectation of an uncertain cost over all dis-tributions in a Wasserstein ball can often be computed efficiently via convex op-timization (Mohajerin Esfahani and Kuhn 2015, Zhao and Guan 2015, Gao andKleywegt 2016). A Wasserstein ball is defined as the family of all distributionsthat are within a certain Wasserstein distance from a discrete reference distribu-tion. As distributionally robust optimization problems over Wasserstein balls areharder to solve than their stochastic counterparts, we expect significant computa-tional savings from replacing the initial n -point reference distribution with a new m -point reference distribution. The benefits of scenario reduction may be partic-ularly striking for two-stage distributionally robust linear programs, which admittight approximations as semidefinite programs (Hanasusanto and Kuhn 2016).Suppose now that the initial distribution is given by P = (cid:80) i ∈ I p i δ ξ i , where ξ i ∈ R d and p i ∈ [0 ,
1] represent the location and probability of the i -th scenarioof P for i ∈ I = { , . . . , n } . Similarly, assume that the reduced target distributionis representable as Q = (cid:80) j ∈ J q j δ ζ j , where ζ j ∈ R d and q j ∈ [0 ,
1] stand for thelocation and probability of the j -th scenario of Q for j ∈ J = { , . . . , m } . Then,the type- l Wasserstein distance between P and Q is defined through d l ( P , Q ) = min Π ∈ R n × m + (cid:88) i ∈ I (cid:88) j ∈ J π ij (cid:107) ξ i − ζ j (cid:107) l : (cid:80) j ∈ J π ij = p i ∀ i ∈ I (cid:80) i ∈ I π ij = q j ∀ j ∈ J /l , where l ≥ (cid:107) · (cid:107) denotes some norm on R d , see, e.g. , Heitsch and R¨omisch(2007) or Pflug and Pichler (2011). The linear program in the definition of theWasserstein distance can be viewed as a minimum-cost transportation problem,where π ij represents the amount of probability mass shipped from ξ i to ζ j atunit transportation cost (cid:107) ξ i − ζ j (cid:107) l . Thus, d ll ( P , Q ) quantifies the minimum cost ofmoving the initial distribution P to the target distribution Q .For any Ξ ⊆ R d , we denote by P E (Ξ , n ) the set of all uniform discrete distri-butions on Ξ with exactly n distinct scenarios and by P (Ξ , m ) the set of all (notnecessarily uniform) discrete distributions on Ξ with at most m scenarios. Wehenceforth assume that P ∈ P E ( R d , n ). This assumption is crucial for the simplic- ity of the results in Sections 2 and 3, and it is almost surely satisfied whenever P isobtained via sampling from a continuous probability distribution. Hence, we canthink of P as an empirical distribution . To remind us of this interpretation, we willhenceforth denote the initial distribution by ˆ P n . Note that the pairwise differenceof the scenarios can always be enforced by slightly perturbing their locations, while cenario Reduction Revisited: Fundamental Limits and Guarantees 3 the uniformity of their probabilities can be enforced by decomposing the scenariosinto clusters of close but mutually distinct sub-scenarios with (smaller) uniformprobabilities.We are now ready to introduce the continuous scenario reduction problem C l (ˆ P n , m ) = min Q (cid:110) d l (ˆ P n , Q ) : Q ∈ P ( R d , m ) (cid:111) , where the new scenarios ζ j , j ∈ J , of the target distribution Q may be chosenfreely from within R d , as well as the discrete scenario reduction problem D l (ˆ P n , m ) = min Q (cid:110) d l (ˆ P n , Q ) : Q ∈ P (supp(ˆ P n ) , m ) (cid:111) , where the new scenarios must be chosen from within the support of the empiricaldistribution, which is given by the finite set supp(ˆ P n ) = { ξ i : i ∈ I } . Even thoughthe continuous scenario reduction problem offers more flexibility and is thereforeguaranteed to find (weakly) better approximations to the initial empirical distri-bution, to our best knowledge, the existing stochastic programming literature hasexclusively focused on the discrete scenario reduction problem.Note that if the support points ζ j , j ∈ J , are fixed, then both scenario reductionproblems simplify to a linear program over the probabilities q j , j ∈ J , which admitsan explicit solution (Dupaˇcov´a et al 2003, Theorem 2). Otherwise, however, bothproblems are intractable. Indeed, if l = 1, then the discrete scenario reductionproblem represents a metric k -median problem with k = m , which was shown tobe N P -hard by Kariv and Hakimi (1979). If l = 2 and distances in R d are measuredby the 2-norm, on the other hand, then the continuous scenario reduction problemconstitutes a k -means clustering problem with k = m , which is N P -hard even if d = 2 or m = 2; see Mahajan et al (2009) and Aloise et al (2009).Heitsch and R¨omisch (2003) have shown that the discrete scenario reductionproblem admits a reformulation as a mixed-integer linear program (MILP), whichcan be solved to global optimality for n (cid:46) using off-the-shelf solvers. For largerinstances, however, one must resort to approximation algorithms. Most large-scalediscrete scenario reduction problems are nowadays solved with a greedy heuristicthat was originally devised by Dupaˇcov´a et al (2003) and further refined by Heitschand R¨omisch (2003). For example, this heuristic is routinely used for scenario(tree) reduction in the context of power systems operations, see, e.g. , R¨omischand Vigerske (2010) or Morales et al (2009) and the references therein. Despiteits practical success, we will show in Section 4 that this heuristic fails to providea constant-factor approximation for the discrete scenario reduction problem.This paper extends the theory of scenario reduction along several dimensions.(i) We establish fundamental performance guarantees for continuous scenario re-duction when l ∈ { , } , i.e. , we show that the Wasserstein distance of the initial n -point distribution to its nearest m -point distribution is bounded by (cid:113) n − mn − across all initial distributions on the unit ball in R d . We show that for l = 2this worst-case performance is attained by some initial distribution, which weconstruct explicitly. We also provide evidence indicating that this worst-caseperformance reflects the norm rather than the exception in high dimensions d .Finally, we provide a lower bound on the worst-case performance for l = 1. Napat Rujeerapaiboon, Kilian Schindler, Daniel Kuhn, Wolfram Wiesemann (ii) We analyze the loss of optimality incurred by solving the discrete scenario re-duction problem instead of its continuous counterpart. Specifically, we demon-strate that the ratio D l (ˆ P n , m ) /C l (ˆ P n , m ) is bounded by √ l = 2 and by 2for l = 1. We also show that these bounds are essentially tight.(iii) We showcase the intimate relation between scenario reduction and k -meansclustering. By leveraging existing constant-factor approximation algorithms for k -median clustering problems due to Arya et al (2004) and the new performancebounds from (ii), we develop the first polynomial-time constant-factor approx-imation algorithms for both continuous and discrete scenario reduction. Wealso show that these algorithms can be warmstarted using the greedy heuristicby Dupaˇcov´a et al (2003) to improve practical performance.(iv) We present exact mixed-integer programming reformulations for the continuousscenario reduction problem.Continuous scenario reduction is intimately related to the optimal quantizationof probability distributions, where one seeks an m -point distribution approximat-ing a non-discrete initial distribution. Research efforts in this domain have mainlyfocused on the asymptotic behavior of the quantization problem as m tends to in-finity, see Graf and Luschgy (2000). The ramifications of this stream of literaturefor stochastic programming are discussed by Pflug and Pichler (2011). Techniquesfamiliar from scenario reduction lend themselves also for scenario generation, whereone aims to construct a scenario tree with a prescribed branching structure thatapproximates a given stochastic process with respect to a probability metric, see, e.g. , Pflug (2001) and Hochreiter and Pflug (2007).The rest of this paper unfolds as follows. Section 2 seeks to identify n -pointdistributions on the unit ball that are least susceptible to scenario reduction, i.e. ,that have maximum Wasserstein distance to their closest m -point distributions,and Section 3 discusses sharp bounds on the added benefit of continuous overdiscrete scenario reduction. Section 4 presents exact exponential-time algorithmsas well as polynomial-time constant-factor approximations for scenario reduction.Section 5 reports on numerical results for a color quantization experiment. Unlessotherwise specified, below we will always work with the 2-norm on R d . Notation:
We let I be the identity matrix, e the vector of all ones and e i the i -thstandard basis vector of appropriate dimensions. The ij -th element of a matrix A is denoted by a ij . For A and B in the space S n of symmetric n × n matrices,the relation A (cid:23) B means that A − B is positive semidefinite. Generic norms aredenoted by (cid:107) · (cid:107) , while (cid:107) · (cid:107) p stands for the p -norm, p ≥
1. For Ξ ⊆ R d , we define P (Ξ , m ) as the set of all probability distributions supported on at most m pointsin Ξ and P E (Ξ , n ) as the set of all uniform distributions supported on exactly n distinct points in Ξ. The support of a probability distribution P is denoted bysupp( P ), and the Dirac distribution concentrating unit mass at ξ is denoted by δ ξ . In this section we characterize the Wasserstein distance C l (ˆ P n , m ) between an n -point empirical distribution ˆ P n = n (cid:80) ni =1 δ ξ i and its continuously reduced optimal m -point distribution Q ∈ P ( R d , m ). Since the positive homogeneity of the Wasser-stein distance d l implies that C l (ˆ P (cid:48) n , m ) = λ · C l (ˆ P n , m ) for the scaled distribution cenario Reduction Revisited: Fundamental Limits and Guarantees 5 ˆ P (cid:48) n = n (cid:80) ni =1 δ λ ξ i , λ ∈ R + , we restrict ourselves to empirical distributions ˆ P n whose scenarios satisfy (cid:107) ξ i (cid:107) ≤ i = 1 , . . . , n . We thus want to quantify C l ( n, m ) = max ˆ P n ∈P E ( R d ,n ) (cid:110) C l (ˆ P n , m ) : (cid:107) ξ (cid:107) ≤ ∀ ξ ∈ supp(ˆ P n ) (cid:111) , (1)which amounts to the worst-case ( i.e. , largest) Wasserstein distance between any n -point empirical distribution ˆ P n over the unit ball and its optimally selectedcontinuous m -point scenario reduction. By construction, this worst-case distancesatisfies C l ( n, m ) ≥
0, and the lower bound is attained whenever n = m . One alsoverifies that C l ( n, m ) ≤ C l ( n, ≤ δ is bounded above by 1. Our goal is to derive possibly tight upperbounds on C l ( n, m ) for the Wasserstein distances of type l ∈ { , } .In the following, we denote by P ( I, m ) the family of all m -set partitions of theindex set I , i.e. , P ( I, m ) = (cid:8) { I , . . . , I m } : ∅ (cid:54) = I , . . . , I m ⊆ I, ∪ j I j = I, I i ∩ I j = ∅ ∀ i (cid:54) = j (cid:9) , and an element of this set ( i.e. a specific m -set partition) as { I j } ∈ P ( I, m ). Ourderivations will make extensive use of the following theorem.
Theorem 1
For any type- l Wasserstein distance induced by any norm (cid:107) · (cid:107) , the con-tinuous scenario reduction problem can be reformulated as C l (ˆ P n , m ) = min { I j }∈ P ( I,m ) n (cid:88) j ∈ J min ζ j ∈ R d (cid:88) i ∈ I j (cid:107) ξ i − ζ j (cid:107) l /l . (2)Problem (2) can be interpreted as a Voronoi partitioning problem that asks fora Voronoi decomposition of R d into m cells whose Voronoi centroids ζ , . . . , ζ m min-imize the cumulative l -th powers of the distances to n prespecified points ξ , . . . , ξ n . Proof of Theorem 1
Theorem 2 of Dupaˇcov´a et al (2003) implies that the smallestWasserstein distance between the empirical distribution ˆ P n ∈ P E ( R d , n ) and anydistribution Q supported on a finite set Ξ ⊂ R d amounts tomin Q ∈P (Ξ , ∞ ) d l (ˆ P n , Q ) = (cid:34) n (cid:88) i ∈ I min ζ ∈ Ξ (cid:107) ξ i − ζ (cid:107) l (cid:35) /l , where P (Ξ , ∞ ) denotes the set of all probability distributions supported on thefinite set Ξ. The continuous scenario reduction problem C l (ˆ P n , m ) selects the setΞ (cid:63) that minimizes this quantity over all sets in Ξ ⊂ R d with | Ξ | = m elements: C l (ˆ P n , m ) = min { ζ j }⊆ R d (cid:34) n (cid:88) i ∈ I min j ∈ J (cid:107) ξ i − ζ j (cid:107) l (cid:35) /l . (3)One readily verifies that any optimal solution { ζ (cid:63) , . . . , ζ (cid:63)m } to problem (3) corre- sponds to an optimal solution { I (cid:63) , . . . , I (cid:63)m } to problem (2) with the same objectivevalue if we identify the set I (cid:63)j with all observations ξ i that are closer to ζ (cid:63)j thanany other ζ (cid:63)j (cid:48) (ties may be broken arbitrarily). Likewise, any optimal solution { I (cid:63) , . . . , I (cid:63)m } to problem (2) with inner minimizers { ζ (cid:63) , . . . , ζ (cid:63)m } translates into anoptimal solution { ζ (cid:63) , . . . , ζ (cid:63)m } to problem (3) with the same objective value. (cid:117)(cid:116) Napat Rujeerapaiboon, Kilian Schindler, Daniel Kuhn, Wolfram Wiesemann
Remark 1 (Minimizers of (2) ) For l = 2, the inner minimum corresponding to theset I j is attained by the mean ζ (cid:63)j = mean( I j ) = | I j | (cid:80) i ∈ I j ξ i . Likewise, for l = 1,the inner minimum corresponding to the set I j is attained by any geometric median ζ (cid:63)j = gmed( I j ) ∈ arg min ζ j ∈ R d (cid:88) i ∈ I j (cid:107) ξ i − ζ j (cid:107) , which can be determined efficiently by solving a second-order cone program when-ever a p -norm with rational p ≥ C l ( n, m ) for Wassersteindistances of type l = 2 (Section 2.1) as well as upper and lower bounds for Wasser-stein distances of type l = 1 (Section 2.2). We summarize and discuss our findingsin Section 2.3.2.1 Fundamental Limits for the Type-2 Wasserstein DistanceWe now derive a revised upper bound on C l ( n, m ) for the type-2 Wassersteindistance. The result relies on auxiliary lemmas that are relegated to the appendix. Theorem 2
The worst-case type- Wasserstein distance satisfies C ( n, m ) ≤ (cid:113) n − mn − . Note that whenever the reduced distribution satisfies m >
1, the bound ofTheorem 2 is strictly tighter than the na¨ıve bound of 1 from the previous section.
Proof of Theorem 2
From Theorem 1 and Remark 1 we observe that C ( n, m ) = max { ξ i } ⊆ R d min { I j }∈ P ( I,m ) n (cid:88) j ∈ J (cid:88) i ∈ I j (cid:13)(cid:13) ξ i − mean( I j ) (cid:13)(cid:13) / s.t. (cid:107) ξ i (cid:107) ≤ ∀ i ∈ I. Introducing the epigraphical variable τ , this problem can be expressed as C ( n, m ) = max τ ∈ R , { ξ i } ⊆ R d n τ s.t. τ ≤ (cid:88) j ∈ J (cid:88) i ∈ I j (cid:13)(cid:13) ξ i − mean( I j ) (cid:13)(cid:13) ∀{ I j } ∈ P ( I, m ) ξ i (cid:62) ξ i ≤ ∀ i ∈ I. (4)For each j ∈ J and i ∈ I j , the squared norm in the first constraint of (4) can beexpressed in terms of the inner products between pairs of empirical observations: (cid:13)(cid:13) ξ i − mean( I j ) (cid:13)(cid:13) = 1 | I j | (cid:13) (cid:13)(cid:13) | I j | ξ i − (cid:88) k ∈ I j ξ k (cid:13) (cid:13)(cid:13) = 1 | I j | (cid:32) | I j | ξ i (cid:62) ξ i − | I j | (cid:88) k ∈ I j ξ i (cid:62) ξ k + (cid:88) k ∈ I j ξ k (cid:62) ξ k + (cid:88) k,k (cid:48) ∈ I j k (cid:54) = k (cid:48) ξ k (cid:62) ξ k (cid:48) (cid:33) . cenario Reduction Revisited: Fundamental Limits and Guarantees 7 Introducing the Gram matrix S = [ ξ , . . . , ξ n ] (cid:62) [ ξ , . . . , ξ n ] ∈ S n , S (cid:23) and rank( S ) ≤ min { n, d } (5)then allows us to simplify the first constraint in (4) to τ ≤ (cid:88) j ∈ J | I j | (cid:88) i ∈ I j (cid:32) | I j | s ii − | I j | (cid:88) k ∈ I j s ik + (cid:88) k ∈ I j s kk + (cid:88) k,k (cid:48) ∈ I j k (cid:54) = k (cid:48) s kk (cid:48) (cid:33) . Note that the second constraint in problem (4) can now be expressed as s ii ≤ S .Our discussion implies that we obtain an upper bound on C ( n, m ) by refor-mulating problem (4) as a semidefinite program in terms of the Gram matrix S max τ ∈ R , S ∈ S n n τ s.t. τ ≤ (cid:88) j ∈ J | I j | (cid:88) i ∈ I j (cid:32) | I j | s ii − | I j | (cid:88) k ∈ I j s ik + (cid:88) k ∈ I j s kk + (cid:88) k,k (cid:48) ∈ I j k (cid:54) = k (cid:48) s kk (cid:48) (cid:33) ∀{ I j } ∈ P ( I, m ) S (cid:23) , s ii ≤ ∀ i ∈ I, (6)where we have relaxed the rank condition in the definition of the Gram matrix (5).Lemma 1 in the appendix shows that (6) has an optimal solution ( τ (cid:63) , S (cid:63) ) thatsatisfies S (cid:63) = α I + β ee (cid:62) for some α, β ∈ R . Moreover, Lemma 2 in the appendixshows that any matrix of the form S = α I + β (cid:62) is positive semidefinite if andonly if α ≥ α + nβ ≥
0. We thus conclude that (6) can be reformulated asmax τ,α,β ∈ R n τ s.t. τ ≤ ( n − m ) α, α + β ≤ α ≥ , α + nβ ≥ , (7)where the first constraint follows from the fact that for any set I j in (6), we have1 | I j | (cid:88) i ∈ I j (cid:32) | I j | ( α + β ) − | I j | ( α + | I j | β ) + (cid:88) k ∈ I j ( α + β ) + (cid:88) k,k (cid:48) ∈ I j k (cid:54) = k (cid:48) β (cid:33) = ( | I j | − α, and (cid:80) j ∈ J ( | I j | − α = ( n − m ) α since | I | = n and | J | = m . The statement of thetheorem now follows since problem (7) is optimized by τ (cid:63) = n ( n − m )( n − , α (cid:63) = nn − and β (cid:63) = − n − . (cid:117)(cid:116) The proof of Theorem 2 shows that the upper bound (cid:113) n − mn − on the worst-casetype-2 Wasserstein distance C ( n, m ) is tight whenever there is an empirical dis-tribution ˆ P n ∈ P E ( R d , n ) whose scenarios ξ , . . . , ξ n correspond to a Gram matrix S = [ ξ , . . . , ξ n ] (cid:62) [ ξ , . . . , ξ n ] = nn − I − n − ee (cid:62) , which implies (cid:107) ξ i (cid:107) = √ s ii = 1 forall i ∈ I . We now show that such an empirical distribution exists when d ≥ n − Napat Rujeerapaiboon, Kilian Schindler, Daniel Kuhn, Wolfram Wiesemann
Proposition 1
For d ≥ n − , there is ˆ P n ∈ P E ( R d , n ) with (cid:107) ξ (cid:107) ≤ for all ξ ∈ supp(ˆ P n ) such that C (ˆ P n , m ) = (cid:113) n − mn − .Proof Assume first that d = n and consider the empirical distribution ˆ P n = n (cid:80) ni =1 δ ξ i defined through ξ i = y e + ( x − y ) e i ∈ R n with x = (cid:114) n − n and y = − (cid:112) n ( n − . (8)A direct calculation reveals that S = [ ξ , . . . , ξ n ] (cid:62) [ ξ , . . . , ξ n ] = nn − I − n − ee (cid:62) .To prove the statement for d = n −
1, we note that the n scenarios in (8) lieon the ( n − H orthogonal to e ∈ R n . Thus, there exists arotation that maps H to R n − × { } . As the Gram matrix is invariant under rota-tions, the rotated scenarios give rise to an empirical distribution ˆ P n ∈ P E ( R n − , n )satisfying the statement of the proposition. Likewise, for d > n the linear trans-formation ξ i (cid:55)→ ( I , ) (cid:62) ξ i , I ∈ R n × n and ∈ R n × ( d − n ) , generates an empiricaldistribution ˆ P n ∈ P E ( R d , n ) that satisfies the statement of the proposition. (cid:117)(cid:116) Proposition 1 requires that d ≥ n −
1, which appears to be restrictive. We note,however, that this condition is only sufficient (and not necessary) to guaranteethe tightness of the bound from Theorem 2. Moreover, we will observe in Sec-tion 2.3 that the bound of Theorem 2 provides surprisingly accurate guidance forthe Wasserstein distance between practice-relevant empirical distributions ˆ P n andtheir continuously reduced optimal distributions.2.2 Fundamental Limits for the Type-1 Wasserstein DistanceIn analogy to the previous section, we now derive a revised upper bound on C l ( n, m ) for the type-1 Wasserstein distance. Theorem 3
The worst-case type- Wasserstein distance satisfies C ( n, m ) ≤ (cid:113) n − mn − . Note that this bound is identical to the bound of Theorem 2 for l = 2. Proof of Theorem 3
Leveraging again Theorem 1 and Remark 1, we obtain that C ( n, m ) = max { ξ i } ⊆ R d min { I j }∈ P ( I,m ) n (cid:88) j ∈ J (cid:88) i ∈ I j (cid:13)(cid:13) ξ i − gmed( I j ) (cid:13)(cid:13) s.t. (cid:107) ξ i (cid:107) ≤ ∀ i ∈ I. We show that C ( n, m ) ≤ C ( n, m ) for all n and m = 1 , . . . , n , which in turn provesthe statement of the theorem by virtue of Theorem 2. To this end, we observe that C ( n, m ) ≤ max { ξ i } ⊆ R d min { I j }∈ P ( I,m ) n (cid:88) j ∈ J (cid:88) i ∈ I j (cid:13)(cid:13) ξ i − mean( I j ) (cid:13)(cid:13) s.t. (cid:107) ξ i (cid:107) ≤ ∀ i ∈ I (9) ≤ max { ξ i } ⊆ R d min { I j }∈ P ( I,m ) n (cid:88) j ∈ J (cid:88) i ∈ I j (cid:13)(cid:13) ξ i − mean( I j ) (cid:13)(cid:13) / s.t. (cid:107) ξ i (cid:107) ≤ ∀ i ∈ I, (10) cenario Reduction Revisited: Fundamental Limits and Guarantees 9 where the first inequality follows from the definition of the geometric median,which ensures that (cid:88) i ∈ I j (cid:13)(cid:13) ξ i − gmed( I j ) (cid:13)(cid:13) ≤ (cid:88) i ∈ I j (cid:13)(cid:13) ξ i − mean( I j ) (cid:13)(cid:13) ∀ j ∈ J, and the second inequality is due to the arithmetic-mean quadratic-mean inequality(Steele 2004, Exercise 2.14). The statement of the theorem now follows from theobservation that the optimal value of (10) is identical to C ( n, m ). (cid:117)(cid:116) In the next proposition we derive a lower bound on C ( n, m ). Proposition 2
For d ≥ n − , the worst-case type-1 Wasserstein distance satisfies C ( n, m ) ≥ (cid:113) ( n − m )( n − m +1) n ( n − .Proof Assume first that d = n , and consider the empirical distribution ˆ P n withscenarios defined as in (8). Let { I j } ∈ P ( I, m ) be an arbitrary m -set partitionof I and note that gmed( I j ) = mean( I j ) for every j ∈ J due to the permutationsymmetry of the ξ i . This is indeed the case because ∈ ∂f j (mean( I j )) for each f j ( ζ ) = (cid:80) i ∈ I j (cid:107) y e + ( x − y ) e i − ζ (cid:107) , j ∈ J . Thus, we have (cid:13)(cid:13) ξ i − mean( I j ) (cid:13)(cid:13) = (cid:34)(cid:18) x − x + ( | I j | − y | I j | (cid:19) + ( | I j | − (cid:18) y − x + ( | I j | − y | I j | (cid:19) (cid:35) / = (cid:20) | I j | − | I j | (cid:21) / ( x − y ) = (cid:20) n ( | I j | − n − | I j | (cid:21) / ∀ i ∈ I j , where the last equality follows from the definitions of x and y in (8). By Theorem 1and Remark 1 we therefore obtain C (ˆ P n , m ) = min { I j }∈ P ( I,m ) n (cid:88) j ∈ J (cid:88) i ∈ I j (cid:13)(cid:13) ξ i − mean( I j ) (cid:13)(cid:13) = min { I j }∈ P ( I,m ) (cid:112) n ( n − (cid:88) j ∈ J (cid:113) | I j | ( | I j | − . By introducing auxiliary variables z j = | I j | − ∈ N , j ∈ J , we find that deter-mining C (ˆ P n , m ) is tantamount to solving C (ˆ P n , m ) = 1 (cid:112) n ( n −
1) min { z j }⊆ N (cid:88) j ∈ J (cid:113) z j ( z j + 1) : (cid:88) j ∈ J z j = n − m . Observe that the objective function of z = n − m and z = . . . = z m = 0 evaluatesto (cid:112) ( n − m )( n − m + 1), which implies that C (ˆ P n , m ) ≤ (cid:113) ( n − m )( n − m +1) n ( n − . Hence, it remains to establish the reverse inequality. To this end, we note that (cid:88) j ∈ J (cid:113) z j ( z j + 1) = (cid:34) (cid:88) j ∈ J z j ( z j + 1) + (cid:88) j,j (cid:48) ∈ Jj (cid:54) = j (cid:48) (cid:113) z j z j (cid:48) (1 + z j )(1 + z j (cid:48) ) (cid:35) / ≥ (cid:34) (cid:88) j ∈ J z j ( z j + 1) + (cid:88) j,j (cid:48) ∈ Jj (cid:54) = j (cid:48) z j z j (cid:48) (cid:35) / = (cid:32) (cid:88) j ∈ J z j (cid:33) + (cid:88) j ∈ J z j / = (cid:112) ( n − m )( n − m + 1) , and thus the claim follows for d = n . The cases d = n − d > n can be reducedto the case d = n as in Proposition 1. Details are omitted for brevity. (cid:117)(cid:116) Proposition 2 asserts that C ( n, m ) (cid:38) n − mn − = C ( n, m ) whenever d ≥ n − l = 1 and l = 2: C ( n, m ) ≤ C ( n, m ) ≤ C ( n, m ) . We conjecture that the lower bound is tighter, but we were not able to prove this.2.3 DiscussionTheorems 2 and 3 imply that C l ( n, m ) (cid:46) √ − p for large n and for l ∈ { , } ,where p = mn represents the desired reduction factor. The significance of thisresult is that it offers a priori guidelines for selecting the number m of supportpoints in the reduced distribution. To see this, consider any empirical distributionˆ P n = n (cid:80) i ∈ I δ ξ i , and denote by r ≥ µ ∈ R d the radius and the center of any(ideally the smallest) ball enclosing ξ , . . . , ξ n , respectively. In this case, we have C l (ˆ P n , m ) = r · C l (cid:32) n (cid:88) i ∈ I δ ξ i − µ r , m (cid:33) ≤ r · C l ( n, m ) (cid:46) r · (cid:112) − p, (11)where the inequality holds because (cid:107) ( ξ i − µ ) /r (cid:107) ≤ i ∈ I . Note that (11)enables us to find an upper bound on the smallest m guaranteeing that C l (ˆ P n , m )falls below a prescribed threshold ( i.e. , guaranteeing that the reduced m -pointdistribution remains within some prescribed distance from ˆ P n ).Even though the inequality in (11) can be tight, which has been established in Proposition 1, one might suspect that typically C l (ˆ P n , m ) is significantly smallerthan r · √ − p when the points ξ i ∈ R d , i ∈ I , are sampled randomly from astandard distribution, e.g. , a multivariate uniform or normal distribution. However,while the upper bound (11) can be loose for low-dimensional data, Proposition 3below suggests that it is surprisingly tight in high dimensions—at least for l = 2. cenario Reduction Revisited: Fundamental Limits and Guarantees 11 Proposition 3
For any (cid:15) > and δ > there exist c > and d ∈ N such that P n (cid:32) (cid:107) ξ i (cid:107) ≤ ∀ i ∈ I and C (cid:32) n (cid:88) i ∈ I δ ξ i , m (cid:33) ≥ (cid:112) − p − δ (cid:33) ≥ − (cid:15), (12) where p = mn , and the support points ξ i , i ∈ I , are sampled independently from thenormal distribution P with mean ∈ R d and covariance matrix ( √ d − c ) − I ∈ S d . Proposition 3 can be paraphrased as follows. Sampling the ξ i , i ∈ I , indepen-dently from the normal distribution P yields a (random) empirical distribution ˆ P n that is feasible and δ -suboptimal in (1) with probability 1 − (cid:15) . The intuition behindthis result is that, in high dimensions, samples drawn from P are almost orthogonaland close to the surface of the unit ball with high probability. Indeed, these twoproperties are shared by the worst case distribution (8) in high dimensions. Proof of Proposition 3
Theorem 1 and Remark 1 imply that C (cid:32) n (cid:88) i ∈ I δ ξ i , m (cid:33) = min { I j }∈ P ( I,m ) n (cid:88) j ∈ J (cid:88) i ∈ I j (cid:13)(cid:13) ξ i − mean( I j ) (cid:13)(cid:13) . (13)From the proof of Theorem 2 we further know that (13) can be expressed as acontinuous function f ( S ) of the Gram matrix S = [ ξ , . . . , ξ n ] (cid:62) [ ξ , . . . , ξ n ], that is, f ( S ) = min { I j }∈ P ( I,m ) n (cid:88) j ∈ J | I j | (cid:88) i ∈ I j (cid:32) | I j | s ii − | I j | (cid:88) k ∈ I j s ik + (cid:88) k ∈ I j s kk + (cid:88) k,k (cid:48) ∈ I j k (cid:54) = k (cid:48) s kk (cid:48) (cid:33) . An elementary calculation shows that f ( I ) = n − mn = 1 − p . Thus, by the continuityof f ( · ), there exists η ∈ (0 ,
1) with (cid:112) f ( S ) ≥ √ − p − δ whenever (cid:107) S − I (cid:107) max ≤ η .We are now ready to construct P . First, select c > (cid:18) − c e − c (cid:19) n ≥ − (cid:15) . Then, select d ∈ N large enough such that √ d − − c √ d − c ≥ − η and n ( n − − η ( √ d − c )) ≤ (cid:15) , where Φ( · ) denotes the univariate standard normal distribution function. Observethat the distribution P is completely determined by c and d . Next, we find that P n (cid:32) (cid:107) ξ i (cid:107) ≤ ∀ i and C (cid:32) n (cid:88) i ∈ I δ ξ i , m (cid:33) ≥ (cid:112) − p − δ (cid:33) ≥ P n ( (cid:107) ξ i (cid:107) ≤ ∀ i and (cid:107) S − I (cid:107) max ≤ η )= P n (cid:16) − η ≤ (cid:107) ξ i (cid:107) ≤ ∀ i and | ξ (cid:62) i ξ j | ≤ η ∀ i (cid:54) = j (cid:17) ≥ P n (cid:18) √ d − − c √ d − c ≤ (cid:107) ξ i (cid:107) ≤ ∀ i and | ξ (cid:62) i ξ j | ≤ η · (cid:107) ξ j (cid:107) ∀ i (cid:54) = j (cid:19) ≥ P n (cid:18) √ d − − c √ d − c ≤ (cid:107) ξ i (cid:107) ≤ ∀ i (cid:19) + P n (cid:16) | ξ (cid:62) i ξ j | ≤ η · (cid:107) ξ j (cid:107) ∀ i (cid:54) = j (cid:17) − , (14) where the first and second inequalities follow from the construction of η and c ,respectively, while the third inequality exploits the union bound. We also have P n (cid:18) √ d − − c √ d − c ≤ (cid:107) ξ i (cid:107) ≤ ∀ i (cid:19) = P (cid:18) √ d − − c √ d − c ≤ (cid:107) ξ (cid:107) ≤ (cid:19) n ≥ (cid:18) − c e − c (cid:19) n ≥ − (cid:15) , (15)where the equality holds due to the independence of the ξ i , while the first andsecond inequalities follow from Lemma 2.8 in Hopcroft and Kannan (2012) andthe construction of c , respectively. By the choice of d , we finally obtain P n (cid:16) | ξ (cid:62) i ξ j | ≤ η · (cid:107) ξ j (cid:107) ∀ i (cid:54) = j (cid:17) ≥ − (cid:88) i (cid:54) = j P n (cid:16) | ξ (cid:62) i ξ j | ≥ η · (cid:107) ξ j (cid:107) (cid:17) = 1 − n ( n − − η ( √ d − c )) ≥ − (cid:15) . (16)The equality in (16) holds due to the rotation symmetry of P , which implies that P n (cid:16) | ξ (cid:62) i ξ j | ≥ η · (cid:107) ξ j (cid:107) (cid:17) = P (cid:16) | ξ (cid:62) e | ≥ η (cid:17) = 2Φ( − η ( √ d − c )) . The claim then follows by substituting (15) and (16) into (14). (cid:117)(cid:116)
10 20 30 40 50 60 70 80 90 10000.10.20.30.40.50.60.70.80.91 10 20 30 40 50 60 70 80 90 10000.10.20.30.40.50.60.70.80.91
Fig. 1: Comparison between C (ˆ P n , m ) and C ( n, m ) under uniform (left panel)and normal (right panel) sampling.Figure 1 compares C ( m, n ) with C (ˆ P n , m ) for n = 100, m ∈ { , . . . , } and d ∈ { , , , } . The n original support points are sampled randomly from the uniform distribution on the unit ball (left panel) and the normal distribution fromProposition 3 with c = 2 .
97, which ensures that (cid:107) ξ i (cid:107) ≤ C (ˆ P n , m ) is random. Thus, all shown values are averagedacross 100 independent trials. Figure 1 confirms that C (ˆ P n , m ) approaches theworst-case bound C ( n, m ) as the dimension d increases. cenario Reduction Revisited: Fundamental Limits and Guarantees 13 For n -point empirical distributions ˆ P n = n (cid:80) ni =1 δ ξ i supported on R d , we nowstudy the loss of optimality incurred by solving the discrete scenario reductionproblem instead of its continuous counterpart. More precisely, we want to deter-mine the point-wise largest lower bound κ l ( n, m ) and the point-wise smallest upperbound κ l ( n, m ) that satisfy κ l ( n, m ) · C l (ˆ P n , m ) ≤ D l (ˆ P n , m ) ≤ κ l ( n, m ) · C l (ˆ P n , m ) ∀ ˆ P n ∈ P E ( R d , n ) (17)for the Wasserstein distances of type l ∈ { , } . Note that the existence of finitebounds κ l ( n, m ) and κ l ( n, m ) is not a priori obvious as they do not depend on thedimension d . Moreover, while it is clear that κ l ( n, m ) ≥ κ l ( n, m ).Our derivations in this section will use the following result, which is the ana-logue of Theorem 1 for the discrete scenario reduction problem. Theorem 4
For any type- l Wasserstein distance induced by any norm (cid:107)·(cid:107) , the discretescenario reduction problem can be reformulated as D l (ˆ P n , m ) = min { I j }∈ P ( I,m ) n (cid:88) j ∈ J min ζ j ∈{ ξ i : i ∈ I j } (cid:88) i ∈ I j (cid:107) ξ i − ζ j (cid:107) l /l . Proof
The proof is similar to the proof of Theorem 1 and is therefore omitted. (cid:117)(cid:116)
The remainder of this section derives lower and upper bounds on κ l ( n, m )and κ l ( n, m ) for Wasserstein distances of type l = 2 (Section 3.1) and l = 1(Section 3.2), respectively. To eliminate trivial cases, we assume throughout thissection that n ≥ m ∈ { , . . . , n − } and d ≥ κ ( n, m ) in equation (17) from above (Theorem 5) and below(Proposition 4). Theorem 5
The upper bound κ ( n, m ) in (17) satisfies κ ( n, m ) ≤ √ for all n, m .Proof The proof proceeds in two steps. We first show that κ ( n, m ) ≤ √ n when m = 1 (Step 1). Then we extend the result to all n and m (Step 2). Step 1:
Fix any ˆ P n ∈ P E ( R d , n ). W.l.o.g., we can assume that mean( I ) = and n (cid:80) i ∈ I (cid:107) ξ i (cid:107) = 1 by re-positioning and scaling the atoms ξ i appropriately. Notethat the re-positioning does not affect C (ˆ P n ,
1) or D (ˆ P n , C (ˆ P n ,
1) and D (ˆ P n ,
1) in the same way and thus preserves their ratio κ ( n, C (ˆ P n ,
1) = (cid:34) n (cid:88) i ∈ I (cid:107) ξ i − mean( I ) (cid:107) (cid:35) / = 1 . Step 1 is thus complete if we can show that D (ˆ P n , ≤ √
2. Indeed, we have D (ˆ P n ,
1) = min j ∈ I n (cid:88) i ∈ I (cid:107) ξ i − ξ j (cid:107) = min j ∈ I n (cid:88) i ∈ I ( ξ i − ξ j ) (cid:62) ( ξ i − ξ j )= min j ∈ I n (cid:88) i ∈ I (cid:16) ξ (cid:62) i ξ i − ξ (cid:62) i ξ j + ξ (cid:62) j ξ j (cid:17) = min j ∈ I n (cid:88) i ∈ I (cid:16) ξ (cid:62) i ξ i + ξ (cid:62) j ξ j (cid:17) = min j ∈ I (cid:107) ξ j (cid:107) + 1 n (cid:88) i ∈ I (cid:107) ξ i (cid:107) = 1 + min j ∈ I (cid:107) ξ j (cid:107) ≤ , (18)where the first equality is due to Theorem 4, the fourth follows from (cid:80) i ∈ I ξ i = n · mean( I ) = , and the inequality holds since min j ∈ I (cid:107) ξ j (cid:107) ≤ n (cid:80) i ∈ I (cid:107) ξ i (cid:107) = 1. Step 2:
Fix any ˆ P n ∈ P E ( R d , n ). Theorem 1 and Remark 1 imply that C (ˆ P n , m ) = min { I j } ∈ P ( I,m ) n (cid:88) j ∈ J (cid:88) i ∈ I j (cid:107) ξ i − mean( I j ) (cid:107) / . For an optimal partition { I (cid:63)j } to this problem, C ( ˆ P n , m ) can be expressed as C (ˆ P n , m ) = (cid:88) j ∈ J | I (cid:63)j | n C ,j / with C ,j = | I (cid:63)j | (cid:88) i ∈ I (cid:63)j (cid:107) ξ i − mean( I (cid:63)j ) (cid:107) / . From our discussion in Step 1 we know that C ,j represents the type-2 Wassersteindistance between the conditional empirical distribution ˆ P jn = | I (cid:63)j | (cid:80) i ∈ I (cid:63)j δ ξ i andits closest Dirac distribution, that is, C (ˆ P jn , D (ˆ P n , m ) ≤ (cid:88) j ∈ J | I (cid:63)j | n D ,j / with D ,j = min j ∈ I (cid:63)j | I (cid:63)j | (cid:88) i ∈ I (cid:63)j (cid:107) ξ i − ξ j (cid:107) / ≤ (cid:88) j ∈ J | I (cid:63)j | n (2 C ,j ) / = √ C (ˆ P n , m ) , where the first inequality holds since the optimal partition { I (cid:63)j } for C (ˆ P n , m )is typically suboptimal in D (ˆ P n , m ), the second inequality follows from the factthat D ,j = D (ˆ P jn ,
1) and D (ˆ P jn , ≤ √ C (ˆ P jn ,
1) due to Step 1, and the identityfollows from the definition of C ,j . The statement now follows. (cid:117)(cid:116) Proposition 4
There is ˆ P n ∈ P E ( R d , n ) with D (ˆ P n , m ) = √ C (ˆ P n , m ) for all n, m .Proof In analogy to the proof of Theorem 5, we first show the statement for m = 1(Step 1) and then extend the result to m > cenario Reduction Revisited: Fundamental Limits and Guarantees 15 -1-1 11 1 -1 Fig. 2: Empirical distributions in R that maximize the ratio between D l (ˆ P n ,
1) and C l (ˆ P n ,
1) for l = 2 and (cid:107)·(cid:107) = (cid:107)·(cid:107) (left panel) as well as l = 1 and (cid:107)·(cid:107) = (cid:107)·(cid:107) (rightpanel). In both cases, the continuous scenario reduction problem is optimized bythe Dirac distribution at (marked as × ), whereas the discrete scenario reductionproblem is optimized by any of the atoms (such as ◦ ). Step 1:
The first step in the proof of Theorem 5 shows that ˆ P n ∈ P E ( R d , n ) satisfies D (ˆ P n ,
1) = √ C (ˆ P n ,
1) if (cid:80) i ∈ I ξ i = and (cid:107) ξ (cid:107) = . . . = (cid:107) ξ n (cid:107) = 1. For an evennumber n = 2 k , k ∈ N , both conditions are satisfied if we place ξ , . . . , ξ k on thesurface of the unit ball in R d and then choose ξ k + i = − ξ i for i = 1 , . . . , k (see leftpanel of Figure 2 for an illustration in R ). Likewise, for an odd number n = 2 k +3, k ∈ N , we can place ξ , . . . , ξ k on the surface of the unit ball, choose ξ k + i = − ξ i for i = 1 , . . . , k and fix ξ k +1 = e , ξ k +2 = − e + √ e and ξ k +3 = − e − √ e . Step 2:
To prove the statement for m >
1, we construct an empirical distributionˆ P n ∈ P E ( R d , n ) whose atoms satisfy supp(ˆ P n ) = Ξ ∪ Ξ with | Ξ | = n − m + 1 and | Ξ | = m −
1. The atoms ξ , . . . , ξ n − m +1 in Ξ are selected according to the recipeoutlined in Step 1, whereas the atoms ξ n − m +2 , . . . , ξ n in Ξ satisfy ξ n − m +1+ i =(1 + iM ) e , i = 1 , . . . , m −
1, for any number M satisfying M > √ n − m + 1.A direct calculation then shows that the atoms in Ξ are sufficiently far awayfrom those in Ξ as well as from each other so that any optimal partition { I (cid:63)j } tothe discrete scenario reduction problem in Theorem 4 as well as the continuousscenario reduction problem in Theorem 1 consists of the sets { i : ξ i ∈ Ξ } and { i } , ξ i ∈ Ξ . The result then follows from the fact that either problem accumulates aWasserstein distance of 0 over the atoms in Ξ , whereas the Wasserstein distanceof D (ˆ P n , m ) is a factor of √ C (ˆ P n , m )over the atoms in Ξ (see Step 1). (cid:117)(cid:116) Theorem 5 and Proposition 4 imply that κ ( n, m ) = √ n and m , thatis, the bound is indeed independent of both the number of atoms n in the empiricaldistribution and the number of atoms m in the reduced distribution. We now showthat the na¨ıve lower bound of 1 on the approximation ratio is essentially tight. Proposition 5
The lower bound κ ( n, m ) in (17) satisfies κ ( n, m ) = 1 whenever n ≥ and m ∈ { , . . . , n − } , while κ ( n, n −
1) = √ always.Proof We first prove κ ( n, m ) = 1 when m = 1 and n ≥ m ∈ { , . . . , n − } (Step 2). Then, we show κ ( n, n −
1) = √ Step 1:
Choose ˆ P n ∈ P E ( R d , n ) such that the first n − ξ , . . . , ξ n − areselected according to the recipe outlined in Step 1 in the proof of Proposition 4and ξ n = . We thus have mean( I ) = , and Theorem 1 and Remark 1 imply thatthe optimal continuous scenario reduction is given by the Dirac distribution δ .Since ∈ supp(ˆ P n ), we have C (ˆ P n ,
1) = D (ˆ P n ,
1) and the result follows.
Step 2:
To prove the statement for m >
1, we proceed as in Step 2 in the proof ofProposition 4. In particular, we construct an empirical distribution ˆ P n ∈ P E ( R d , n )whose atoms satisfy supp(ˆ P n ) = Ξ ∪ Ξ with | Ξ | = n − m + 1 and | Ξ | = m − ξ , . . . , ξ m − m +1 in Ξ are selected according to the recipe outlined inStep 1 of this proof, whereas the remaining atoms ξ n − m +1 , . . . , ξ n in Ξ satisfy ξ n − m +1+ i = (1 + iM ) e , i = 1 , . . . , m −
1, for any
M > √ n − m + 1. A similarargument as in the proof of Proposition 4 then shows that C (ˆ P n , m ) = D (ˆ P n , m ). Step 3:
Fix any ˆ P n ∈ P E ( R d , n ). W.l.o.g., assume that { ξ n − , ξ n } is the closestpair of atoms in terms of Euclidean distance, and let d min = (cid:107) ξ n − ξ n − (cid:107) . Onereadily verifies that the partition I (cid:63)j = { j } , j = 1 , . . . , n −
2, and I (cid:63)n − = { n − , n } optimizes both the discrete scenario reduction problem in Theorem 4 as well as thecontinuous scenario reduction problem in Theorem 1. We thus have C (ˆ P n , n −
1) = √ n d min and D ( ˆ P n , n −
1) = √ n d min , which concludes the proof. (cid:117)(cid:116) Hence, for any empirical distribution ˆ P n ∈ P E ( R d , n ) the type-2 Wassersteindistance between the minimizer of the discrete scenario reduction problem and ˆ P n exceeds the Wasserstein distance between the minimizer of the continuous scenarioreduction problem and ˆ P n by up to 41 . n, m .3.2 Guarantees for the Type-1 Wasserstein DistanceIn analogy to Section 3.1, we first bound κ ( n, m ) from above (Theorem 6) andbelow (Proposition 6). In contrast to the previous section, we consider an arbitrarynorm (cid:107) · (cid:107) , and we adapt the definition of the geometric median accordingly. Theorem 6
The upper bound κ ( n, m ) in (17) satisfies κ ( n, m ) ≤ whenever m ∈{ , . . . , n − } as well as κ ( n, ≤ (cid:0) − n (cid:1) and κ ( n, n − ≤ .Proof We first prove the statement for m = 1 (Step 1) and then extend the resultto m ∈ { , . . . , n − } (Step 2) and m = n − Step 1:
Fix any ˆ P n ∈ P E ( R d , n ). As in the proof of Theorem 5, we can assume that gmed( I ) = and n (cid:80) i ∈ I (cid:107) ξ i (cid:107) = 1 by re-positioning and scaling the atoms ξ i appropriately. Theorem 1 and Remark 1 then imply that for m = 1, we have C (ˆ P n ,
1) = 1 n (cid:88) i ∈ I (cid:107) ξ i − gmed( I ) (cid:107) = 1 . cenario Reduction Revisited: Fundamental Limits and Guarantees 17 Step 1 is thus complete if we can show that D (ˆ P n , ≤ (cid:0) − n (cid:1) . Indeed, we have D (ˆ P n ,
1) = min j ∈ I n (cid:88) i ∈ I (cid:107) ξ i − ξ j (cid:107) = min j ∈ I n (cid:88) i ∈ I \{ j } (cid:107) ξ i − ξ j (cid:107)≤ min j ∈ I n (cid:88) i ∈ I \{ j } ( (cid:107) ξ i (cid:107) + (cid:107) ξ j (cid:107) ) = min j ∈ I n (cid:32) ( n − (cid:107) ξ j (cid:107) + (cid:88) i ∈ I (cid:107) ξ i (cid:107) (cid:33) = min j ∈ I n (( n − (cid:107) ξ j (cid:107) + n ) = 1 + n − n · min j ∈ I (cid:107) ξ j (cid:107) ≤ (cid:18) − n (cid:19) , where the two inequalities follow from the triangle inequality and the fact thatmin j ∈ I (cid:107) ξ j (cid:107) ≤ n (cid:80) i ∈ I (cid:107) ξ i (cid:107) = 1, respectively. Step 2:
Fix any ˆ P n ∈ P E ( R d , n ). Theorem 1 and Remark 1 then imply that C (ˆ P n , m ) = min { I j }∈ P ( I,m ) n (cid:88) j ∈ J (cid:88) i ∈ I j (cid:107) ξ i − gmed( I j ) (cid:107) . Let { I (cid:63)j } be an optimal partition for this problem. The same arguments as in theproof of Theorem 5 show that D (ˆ P n , m ) ≤ (cid:88) j ∈ J | I (cid:63)j | n D ,j with D ,j = min j ∈ I (cid:63)j | I (cid:63)j | (cid:88) i ∈ I (cid:63)j (cid:107) ξ i − ξ j (cid:107)≤ (cid:88) j ∈ J | I (cid:63)j | n (2 C ,j ) with C ,j = 1 | I (cid:63)j | (cid:88) i ∈ I (cid:63)j (cid:107) ξ i − gmed( I (cid:63)j ) (cid:107) , and the last expression is equal to 2 C (ˆ P n , m ) by definition of C ,j . Step 3:
For n = 2 and m = n − κ (2 , ≤ (cid:0) − (cid:1) = 1.For n > m = n −
1, the statement can be derived in the same way as the thirdstep in the proof of Proposition 5. We omit the details for the sake of brevity. (cid:117)(cid:116)
Proposition 6
There is ˆ P n ∈ P E ( R d , n ) such that D (ˆ P n , m ) = 2 (cid:0) − mn (cid:1) C (ˆ P n , m ) under the -norm for all n divisible by m , all m and all d ≥ n m .Proof We first prove the statement for m = 1 (Step 1) and then extend the resultto m > k = n m and consider w.l.o.g. thecase where d = k . Step 1:
Fix ˆ P n ∈ P E ( R d , n ) with the atoms ξ i = + e i as well as ξ k + i = − e i , i = 1 , . . . , k . The symmetric placement of the atoms implies that gmed( I ) = andhence C (ˆ P n ,
1) = 1. Furthermore, we note that (cid:107) ξ i − ξ j (cid:107) = 2 for all i (cid:54) = j , thatis, any two atoms are equidistant from another (see right panel of Figure 2 for anillustration in R ). By Theorem 4, any 1-point discrete scenario reduction resultsin a Wasserstein distance of 2 n − n to ˆ P n . Step 2:
To prove the statement for m >
1, we construct an empirical distri-bution ˆ P n ∈ P E ( R d , n ) whose atoms satisfy supp(ˆ P n ) = (cid:83) mj =1 (Ξ + j ∪ Ξ − j ) with | Ξ + j | = | Ξ − j | = k , j = 1 , . . . , m . The atoms ξ j − k +1 , . . . , ξ j − k + k in Ξ + j satisfy ξ j − k + i = + e i + jM e , i = 1 , . . . , k , whereas the atoms ξ j − k + k +1 , . . . , ξ jk inΞ − j satisfy ξ j − k + k + i = − e i + jM e , i = 1 , . . . , k , for any number M satisfying M > n + 2. The same arguments as in the proof of Theorem 5 show that anyoptimal partition { I (cid:63)j } to the discrete scenario reduction problem in Theorem 4as well as the continuous scenario reduction problem in Theorem 1 consists of thesets indexing the atoms in Ξ + j ∪ Ξ − j , j = 1 , . . . , m . Step 1 shows that the continu-ous scenario reduction problem accumulates a Wasserstein distance of 1 over eachset, whereas the discrete scenario reduction problem accumulates a Wassersteindistance of 2 k − k over each set. The result then follows from the fact that thereare m such sets and hence the ratio of the respective overall Wasserstein distancesamounts to m (cid:0) k − k (cid:1) /m = 2 (cid:0) − mn (cid:1) . (cid:117)(cid:116) Theorem 6 and Proposition 6 imply that κ ( n, m ) ∈ [2(1 − m/n ) ,
2] for all n and m ∈ { , . . . , n − } . For the small ratios m : n commonly used in practice,we thus conclude that the bound is essentially independent of both the number ofatoms n in the empirical distribution and the number of atoms m in the reduceddistribution. We close with an analysis of the lower bound κ ( n, m ). Proposition 7
The lower bound κ ( n, m ) in (17) satisfies κ ( n, m ) = 1 for all n, m .Proof The proof widely parallels that of Proposition 5, with the difference that theatoms ξ , . . . , ξ n of the empirical distribution ˆ P n are placed such that a geometricmedian (as opposed to the mean) of each subset in the optimal partition coincideswith one of the atoms in that subset. This allows both continuous and discretescenario reduction to choose the same support points for the reduced distribution,hence incurring the same Wasserstein distance. Details are omitted for brevity. (cid:117)(cid:116) In conclusion, for any ˆ P n ∈ P E ( R d , n ) the type-1 Wasserstein distance be-tween the minimizer of the discrete scenario reduction problem and ˆ P n exceedsthe Wasserstein distance between the minimizer of the continuous scenario reduc-tion problem and ˆ P n by up to 100%, and this bound is asymptotically attainedfor decreasing ratios m : n . We now review existing and propose new solution schemes for the discrete andcontinuous scenario reduction problems. More precisely, we will study two heuris-tics for discrete and continuous scenario reduction, respectively, that do not comewith approximation guarantees (Section 4.1), we will propose a constant-factorapproximation scheme for both the discrete and the continuous scenario reduc-tion problem (Section 4.2), and we will discuss two exact reformulations of these problems as mixed-integer optimization problems (Section 4.3).In the remainder of this section, we denote by D l (ˆ P n , Ξ) the type- l Wassersteindistance between ˆ P n and its closest distribution supported on the finite set Ξ.Moreover, for an algorithm providing an upper bound D l (ˆ P n , m ) on the discretescenario reduction problem in R d , we define the algorithm’s approximation ratio as cenario Reduction Revisited: Fundamental Limits and Guarantees 19 the maximum fraction D l (ˆ P n , m ) /D l (ˆ P n , m ), where the maximum is taken over all n and m , as well as all empirical distributions ˆ P n ∈ P E ( R d , n ).4.1 Heuristics for the Discrete Scenario Reduction ProblemWe review in Section 4.1.1 a popular heuristic for the discrete scenario reductionproblem due to Dupaˇcov´a et al (2003). We will show that despite the simplicityand efficiency of the algorithm, there is no finite upper bound on the algorithm’sapproximation ratio. In Section 4.1.2 we adapt a widely used clustering heuristicto the continuous scenario reduction problem, and we show that this algorithm’sapproximation ratio cannot be bounded from above either. We outline Dupaˇcov´a et al.’s algorithm for the problem D l (ˆ P n , m ) below. Dupaˇcov´a et al.’s algorithm for D l (ˆ P n , m ) :
1. Initialize the set of atoms in the reduced set as R ← ∅ .2. Select the next atom to be added to the reduced set as ζ ∈ arg min ζ ∈ supp(ˆ P n ) D l (ˆ P n , R ∪ { ζ } )and update R ← R ∪ { ζ } .3. Repeat Step 2 until | R | = m .Given an empirical distribution ˆ P n ∈ P E ( R d , n ), the algorithm iteratively pop-ulates the reduced set R containing the atoms of the reduced distribution Q . Eachatom ζ ∈ supp(ˆ P n ) is selected greedily so as to minimize the Wasserstein distancebetween ˆ P n and the closest distribution supported on the augmented reduced set R ∪ { ζ } . After termination, the distribution Q can be recovered from the reducedset R as follows. Let { I ζ } ∈ P ( I, m ) be any partition of supp(ˆ P n ) into sets I ζ , ζ ∈ R , such that I ζ contains all elements of supp(ˆ P n ) that are closest to ζ (tiesmay be broken arbitrarily). Then Q = (cid:80) ζ ∈ R q ζ δ ζ , where q ζ = | I ζ | /n . Theorem 7
For every d ≥ and l, p ≥ , the approximation ratio Dupaˇcov´a et al.’salgorithm is unbounded.Proof The proof constructs a specific distribution ˆ P n (Step 1), bounds D l (ˆ P n , m )from above (Step 2) and bounds the Wasserstein distance between ˆ P n and theoutput Q of Dupaˇcov´a et al.’s algorithm from below (Step 3). Step 1:
Fix d ≥ l, p ≥ m = 4, and consider the empirical distributionˆ P n ∈ P E ( R d , n ) with n = 4 z + 1 for some positive integer z as well as supp(ˆ P n ) =Ξ ∪ · · · ∪ Ξ ∪ { ξ z +1 } with Ξ j = { ξ ( j − z +1 , . . . , ξ jz } , j = 1 , . . . ,
4, andΞ ⊂ B (cid:15) (+ e ) , Ξ ⊂ B (cid:15) ( − e ) , Ξ ⊂ B (cid:15) (+ e ) , Ξ ⊂ B (cid:15) ( − e ) and ξ z +1 = , where B (cid:15) ( x ) = { ξ ∈ R d : (cid:107) ξ − x (cid:107) p ≤ (cid:15) } denotes the (cid:15) -ball around x . Here, (cid:15) > i is closer to than to anyatom in any of the other sets Ξ j . The triangle inequality then implies that (cid:107) ξ i (cid:107) p ∈ [1 − (cid:15), (cid:15) ] ∀ ξ i ∈ Ξ , (cid:107) ξ i − ξ (cid:107) p ≥ − (cid:15) ∀ ξ i ∈ Ξ , (cid:107) ξ i − ξ (cid:107) p ≥ − (cid:15) ∀ ξ i ∈ Ξ ∪ Ξ . (19) Step 2:
By construction, we have that D l (ˆ P n , ≤ d l (cid:18) ˆ P n , z + 14 z + 1 δ ξ z + z z + 1 δ ξ z + z z + 1 δ ξ z + z z + 1 δ ξ z (cid:19) ≤ z + 1 (cid:88) j =1 (cid:88) ξ i ∈ Ξ j (cid:107) ξ i − ξ jz (cid:107) lp + (cid:107) ξ z +1 − ξ z (cid:107) lp /l ≤ z + 1 (cid:88) j =1 (cid:88) ξ i ∈ Ξ j (2 (cid:15) ) l + (1 + (cid:15) ) l /l = (cid:20) z l (cid:15) l + (1 + (cid:15) ) l z + 1 (cid:21) /l , where the first inequality holds because ξ z , ξ z , ξ z , ξ z ∈ supp(ˆ P n ), the secondinequality holds since moving the atoms in Ξ j to ξ jz , j = 1 , . . . ,
4, and ξ z +1 to ξ z represents a feasible transportation plan, and the third inequality is due to (19)and the triangle inequality. Step 3:
We first show that for a sufficiently small (cid:15) >
0, Dupaˇcov´a et al.’s algorithmadds ξ z +1 = to the reduced set R in the first iteration. We then show that underthis selection, the output Q of Dupaˇcov´a et al.’s algorithm can be arbitrarily worsethan the bound on D l (ˆ P n ,
4) determined in the previous step.To show the first point, the symmetry inherent in supp(ˆ P n ) implies that itsuffices to show that d l (ˆ P n , δ ) < d l (ˆ P n , δ ξ ). To this end, we note that d ll (ˆ P n , δ ) = 14 z + 1 (cid:88) j =1 (cid:88) ξ i ∈ Ξ j (cid:107) ξ i (cid:107) lp ≤ z + 1 (cid:88) j =1 (cid:88) ξ i ∈ Ξ j (1 + (cid:15) ) l = 4 z z + 1 (1 + (cid:15) ) l due to equation (19), while at the same time d ll (ˆ P n , δ ξ ) = 14 z + 1 z +1 (cid:88) i =2 (cid:107) ξ i − ξ (cid:107) lp ≥ z + 1 z +1 (cid:88) i = z +1 (cid:107) ξ i − ξ (cid:107) lp ≥ z (2 − (cid:15) ) l + (2 z + 1)(1 − (cid:15) ) l z + 1 . As (cid:15) tends to 0, we have thatlim (cid:15) → d l (ˆ P n , δ ) ≤ (cid:20) z z + 1 (cid:21) /l < (cid:20) z l + 2 z + 14 z + 1 (cid:21) /l ≤ lim (cid:15) → d l (ˆ P n , δ ξ ) , cenario Reduction Revisited: Fundamental Limits and Guarantees 21 where the strict inequality is due to l ≥
1. As a consequence, we may conclude thatthere indeed exists an (cid:15) > ξ z +1 = to the reduced set R in the first iteration.As for the second point, we note that after adding ξ z +1 = to the reducedset R , there must be at least one subset Ξ j , j ∈ { , . . . , } , such that no ξ i ∈ Ξ j is contained in the final reduced set R . Assume w.l.o.g. that this is the case for j = 1. We then have d l (ˆ P n , Q ) ≥ z + 1 (cid:88) ξ i ∈ Ξ (cid:107) ξ j − (cid:107) p /l ≥ (cid:20) z (1 − (cid:15) )4 z + 1 (cid:21) /l , and combining this with the result of Step 2, we can conclude that the approxi-mation ratio d l (ˆ P n , Q ) /D l (ˆ P n ,
4) approaches ∞ as z → ∞ and z(cid:15) l → (cid:117)(cid:116) We remark that the algorithm of Dupaˇcov´a et al. can be improved by addingmultiple atoms to the reduced set R in Step 2. Nevertheless, a similar argumentas in the proof of Theorem 7 shows that the resulting improved algorithm doesnot allow for a finite upper bound on the approximation ratio either. k -Means Clustering Algorithm The k -means clustering algorithm has first been proposed in 1957 for a pulse-codemodulation problem (Lloyd 1982), and it has since then become a widely usedheuristic for various classes of clustering problems. It aims to partition a set of ob-servations x , . . . , x n into m clusters S , . . . , S m such that the intra-cluster sums ofsquared distances are minimized. By generalizing the algorithm to arbitrary pow-ers and norms, we can adapt the algorithm to our continuous scenario reductionproblem as follows. k -means clustering algorithm for C l (ˆ P n , m ) :
1. Initialize the reduced set R = { ζ , . . . , ζ m } ⊆ supp(ˆ P n ) arbitrarily.2. Let { I j } ∈ P ( I, m ) be any partition whose sets I j , j ∈ J , contain allatoms of supp(ˆ P n ) that are closest to ζ j (ties may be broken arbitrarily).3. For each j ∈ J , update ζ j ← arg min { (cid:80) i ∈ I j (cid:107) ξ i − ζ (cid:107) l : ζ ∈ R d } .4. Repeat Steps 2 and 3 until the reduced set R no longer changes.For the empirical distribution ˆ P n ∈ P E ( R d , n ), the algorithm iteratively up-dates the reduced set R containing the atoms of the reduced distribution Q througha sequence of assignment (Step 2) and update (Step 3) steps. Step 2 assigns eachatom ξ i ∈ supp(ˆ P n ) of the empirical distribution to the closest atom in the re-duced set, and Step 3 updates each atom in the reduced set so as to minimize thesum of l -th powers of the distances to its assigned atoms from supp(ˆ P n ). After termination, the continuously reduced distribution Q can be recovered from thereduced set R in the same way as in the previous subsection.Remark 1 implies that for l = 2 and (cid:107) · (cid:107) = (cid:107) · (cid:107) , Step 3 reduces to ζ j ← | I j | (cid:80) i ∈ I j ξ i , in which case we recover the classical k -means clustering algorithm.Although the algorithm terminates at a local minimum, Dasgupta (2008) has shown that for l = 2 and (cid:107) · (cid:107) = (cid:107) · (cid:107) , the solution determined by the algorithmcan be arbitrarily suboptimal. We now generalize this finding to generic type- l Wasserstein distances induced by arbitrary p -norms. Theorem 8
If initialized randomly in Step 1, the approximation ratio of the k -meansclustering algorithm is unbounded for every d, l, p ≥ with significant probability.Proof In analogy to the proof of Theorem 7, we construct a specific distribution ˆ P n (Step 1), bound D l (ˆ P n , m ) from above (Step 2) and bound the Wasserstein distancebetween ˆ P n and the output Q of the k -means algorithm from below (Step 3). Step 1:
Fix d, l, p ≥ P n ∈P E ( R d , m ) with n = 3 z + 1 for some positive integer z as well as supp(ˆ P n ) =Ξ ∪ Ξ ∪ { ξ z +1 } with Ξ = { ξ , . . . , ξ z } , Ξ = { ξ z +1 , . . . , ξ z } andΞ ⊂ B (cid:15) ( − e ) , Ξ ⊂ B (cid:15) ( ) for (cid:15) ∈ (0 , / , (20)as well as ξ z +1 = e , where again B (cid:15) ( x ) = { ξ ∈ R d : (cid:107) ξ − x (cid:107) p ≤ (cid:15) } . By construc-tion, the distance between any pair of atoms in Ξ j is bounded above by 2 (cid:15) < , j = 1 ,
2, whereas the distance between two atoms from Ξ and Ξ is bounded frombelow by 1 − (cid:15) > . Step 2:
As similar argument as in the proof of Theorem 7 shows that C l (ˆ P n , ≤ d l (cid:18) ˆ P n , k k + 1 δ − e + k k + 1 δ + 13 k + 1 δ e (cid:19) ≤ z + 1 (cid:88) ξ i ∈ Ξ (cid:107) ξ i + e (cid:107) lp + (cid:88) ξ i ∈ Ξ (cid:107) ξ i (cid:107) lp /l ≤ (cid:20) z(cid:15) l z + 1 (cid:21) /l , where the last inequality follows from (20). Step 3:
We first show that with significant probability, the algorithm chooses areduced set R containing two atoms from Ξ and one atom from Ξ in the firststep. We then show that under this initialization, the output Q of the algorithmcan be arbitrarily worse than the bound on C l (ˆ P n ,
3) determined above.In view of the first point, we note that the probability of the reduced set R containing two atoms from Ξ and one atom from Ξ after the first step is( z )( z ) / ( z +13 ) and approaches 44.44% as z → ∞ . In the following, we thus assumew.l.o.g. that R = { ζ , ζ , ζ } with ζ , ζ ∈ Ξ and ζ ∈ Ξ after the first step.As for the second point, we note that Step 2 of the algorithm assigns the atoms ξ i ∈ Ξ to either ζ or ζ , whereas the atoms ξ i ∈ Ξ ∪ { ξ z +1 } are assigned to ζ . Hence, the update of the reduced set R in the next iteration satisfies ζ , ζ ∈B (cid:15) ( − e ), whereas ζ is chosen with respect to the set Ξ ∪ { ξ z +1 } . The algorithmthen terminates in the third iteration as the reduced set R no longer changes. Wethus find that d l (ˆ P n , Q ) ≥ z + 1 (cid:88) ξ i (cid:54)∈ Ξ (cid:107) ξ i − ζ (cid:107) lp /l ≥ (cid:20) z + 1 (cid:16) (cid:107) ξ z − ζ (cid:107) lp + (cid:107) ξ z +1 − ζ (cid:107) lp (cid:17)(cid:21) /l . cenario Reduction Revisited: Fundamental Limits and Guarantees 23 Recall that ξ z ∈ B (cid:15) ( ) and ξ z +1 = e , which implies that (cid:107) ξ z − ζ (cid:107) p + (cid:107) ξ z +1 − ζ (cid:107) p ≥ (cid:107) ξ z +1 − ξ z (cid:107) p ≥ − (cid:15), and that at least one of the two terms (cid:107) ξ z − ζ (cid:107) p or (cid:107) ξ z +1 − ζ (cid:107) p is greaterthan − (cid:15) . We thus conclude that d l (ˆ P n , Q ) ≥ (cid:20) (1 − (cid:15) ) l l (3 z + 1) (cid:21) /l , which by virtue Step 2 implies that d l (ˆ P n , Q ) /C l (ˆ P n , → ∞ as (cid:15) → (cid:117)(cid:116) continuous scenario reduction problem with an approx-imation ratio of 10. To our best knowledge, we describe the first constant-factorapproximations for the discrete and continuous scenario reduction problems.Our algorithm follows from the insight that the discrete scenario reductionproblem under the type-1 Wasserstein distance is equivalent to the k -median clus-tering problem. The k -median clustering problem is a variant of the k -means clus-tering problem described in Section 4.1.2, where the l -th power of the norm termsis dropped ( i.e. , l = 1). In the following, we adapt a well-known local search algo-rithm (Arya et al 2004) to our discrete scenario reduction problem: Local search algorithm for D l (ˆ P n , m ) :
1. Initialize the reduced set R ⊆ supp(ˆ P n ), | R | = m , arbitrarily.2. Select the next exchange to be applied to the reduced set as( ζ , ζ (cid:48) ) ∈ arg min (cid:110) D l (ˆ P n , R ∪ { ζ } \ { ζ (cid:48) } ) : ( ζ , ζ (cid:48) ) ∈ (cid:0) supp(ˆ P n ) \ R (cid:1) × R (cid:111) , and update R ← R ∪ { ζ } \ { ζ (cid:48) } if D l (ˆ P n , R ∪ { ζ } \ { ζ (cid:48) } ) < D l (ˆ P n , R ).3. Repeat Step 2 until no further improvement is possible.For an empirical distribution ˆ P n ∈ P E ( R d , n ), the algorithm constructs a se-quence of reduced sets R containing the atoms of the reduced distribution Q .In each iteration, Step 2 selects the exchange R ∪ { ζ } \ { ζ (cid:48) } , ζ ∈ supp(ˆ P n ) and ζ (cid:48) ∈ R , that maximally reduces the Wasserstein distance D l (ˆ P n , R ). For perfor-mance reasons, this ‘best fit’ strategy can also be replaced with a ‘first fit’ strategywhich conducts the first exchange R ∪ { ζ } \ { ζ (cid:48) } found that leads to a reductionof D l (ˆ P n , R ). After termination, the reduced distribution Q can be recovered from the reduced set R in the same way as in Section 4.1.1.It follows from Arya et al (2004) that the above algorithm (with either ‘bestfit’ or ‘first fit’) has an approximation ratio of 5 for the discrete scenario reductionproblem for all d . We now show that the algorithm also provides solutions to the continuous scenario reduction problem with an approximation ratio of at most 10. Corollary 1
The problems D l (ˆ P n , m ) and C l (ˆ P n , m ) are related as follows.1. Any approximation algorithm for D (ˆ P n , m ) under the -norm with approximationratio α gives rise to an approximation algorithm for C (ˆ P n , m ) under the -normwith approximation ratio √ α .2. Any approximation algorithm for D (ˆ P n , m ) under any norm with approximationratio α gives rise to an approximation algorithm for C (ˆ P n , m ) under the samenorm with approximation ratio α .Proof The two statements follow directly from Theorems 5 and 6, respectively. (cid:117)(cid:116)
As presented, the local search algorithm is not guaranteed to terminate inpolynomial time. This can be remedied by a variant of the algorithm that onlyaccepts exchanges R ∪ { ζ } \ { ζ (cid:48) } that reduce the Wasserstein distance D l (ˆ P n , R )by at least (cid:15)/ (( n − m ) m ) for some constant (cid:15) >
0. It follows from Arya et al (2004)that for any (cid:15) , this variant terminates in polynomial time and provides a (5 + (cid:15) )-approximation for the discrete scenario reduction problem. The algorithm can alsobe extended to accommodate multiple swaps in every iteration, which lowers theapproximation ratio to 3 + (cid:15) at the expense of additional computations.We remark that there is a wealth of algorithms for the k -median problem thatcan be adapted to the discrete scenario reduction problem. For example, Charikarand Li (2012) present a rounding scheme for the k -median problem that gives riseto a polynomial-time algorithm for D (ˆ P n , R ) with an approximation ratio of 3 . k -medianproblem that gives rise to a polynomial-time algorithm for D (ˆ P n , R ) under the2-norm with an approximation ratio of 9+ (cid:15) . In both cases, Corollary 1 allows us toextend these guarantees to the corresponding versions of the continuous scenarioreduction problem.4.3 Mixed-Integer Reformulations of the Discrete and Continuous ScenarioReduction ProblemsWe first review a well-known mixed-integer linear programming (MILP) reformu-lation of the discrete scenario reduction problem D l (ˆ P n , m ): Theorem 9
The discrete scenario reduction problem can be formulated as the MILP D ll (ˆ P n , m ) = min Π , λ n (cid:104) Π , D (cid:105) s.t. Πe = e , Π ≤ e λ (cid:62) , λ (cid:62) e = m Π ∈ R n × n + , λ ∈ { , } n , (21) with D ∈ S n and d ij = (cid:107) ξ i − ξ j (cid:107) l encoding the distances among the atoms in supp (ˆ P n ) .Proof See e.g.
Heitsch and R¨omisch (2003). (cid:117)(cid:116)
In problem (21), the decision variable π ij determines how much of the proba-bility mass of atom ξ i in ˆ P n is shifted to the atom ζ j in the reduced distribution Q , whereas the decision variable λ j determines whether the atom ξ j ∈ supp(ˆ P n )is contained in the support of Q . A solution ( Π (cid:63) , λ (cid:63) ) to problem (21) allows us to cenario Reduction Revisited: Fundamental Limits and Guarantees 25 recover the reduced distribution via Q = n (cid:80) nj =1 e (cid:62) Π (cid:63) e j · δ ξ j . Problem (21) has n binary and n continuous variables as well as n + n + 1 constraints.We now consider the continuous scenario reduction problem. Due to its bilinearobjective function, which involves products of transportation weights π ij and thedistances (cid:107) ξ i − ζ j (cid:107) l containing the continuous decision variables ζ j , this problemmay not appear to be amenable to a reformulation as a mixed-integer convexoptimization problem. We now show that such a reformulation indeed exists. Theorem 10
The continuous scenario reduction problem can be formulated as themixed-integer convex optimization problem C ll (ˆ P n , m ) = min Π , c , { ζ j } n e (cid:62) c s.t. Πe = e (cid:107) ξ i − ζ j (cid:107) l ≤ c i + M (1 − π ij ) ∀ i ∈ I, ∀ j ∈ J Π ∈ { , } n × m + , c ∈ R n + , ζ , . . . , ζ m ∈ R d , (22) where M = max i,j ∈ I (cid:107) ξ i − ξ j (cid:107) l denotes the diameter of the support of ˆ P n . In problem (22), the decision variable π ij determines whether or not the prob- ability mass of atom ξ i in the empirical distribution ˆ P n is shifted to the atom ζ j in the reduced distribution Q , whereas the decision variable c i records the cost ofmoving the atom ξ i under the transportation plan Π . Proof of Theorem 10
We prove the statement by showing that optimal solutionsto problem (22) correspond to feasible solutions in problem (2) with the sameobjective function value in their respective problems and vice versa.Fix a minimizer ( Π (cid:63) , c (cid:63) , ζ (cid:63) , . . . , ζ (cid:63)m ) to problem (22), which corresponds to afeasible solution ( { I j } , ζ (cid:63) , . . . , ζ (cid:63)m ) in problem (2) if we set I j = { i ∈ I : π (cid:63)ij = 1 } for all j ∈ J . Note that c (cid:63)i = (cid:107) ξ i − ζ (cid:63)j (cid:107) l for j ∈ J and i ∈ I j . Thus, both solutionsadopt the same objective value.Conversely, fix a minimizer ( { I (cid:63)j } , ζ (cid:63) , . . . , ζ (cid:63)m ) to problem (2). This solutioncorresponds to a feasible solution ( Π , c , ζ (cid:63) , . . . , ζ (cid:63)m ) to problem (22) if we set π ij = 1if i ∈ I (cid:63)j and π ij = 0 otherwise for all j ∈ J , as well as c i = (cid:107) ξ i − ζ (cid:63)j (cid:107) l for all i ∈ I j and j ∈ J . By construction, both solutions adopt the same objective value. (cid:117)(cid:116) A solution ( Π (cid:63) , c (cid:63) , ζ (cid:63) , . . . , ζ (cid:63)m ) to problem (22) allows us to recover the reduceddistribution via Q = n (cid:80) mj =1 e (cid:62) Π (cid:63) e j · δ ζ (cid:63)j . Problem (22) has nm binary and n + md continuous variables as well as nm + n constraints. We now show that (22) typicallyreduces to an MILP or a mixed-integer second-order cone program (MISOCP). Proposition 8
For the type- Wasserstein distance induced by (cid:107) · (cid:107) or (cid:107) · (cid:107) ∞ , prob-lem (22) reduces to an MILP. For any type- l Wasserstein distance induced by (cid:107) · (cid:107) p ,where l ≥ and p ≥ are rational numbers, problem (22) reduces to an MISOCP.Proof In view of the first statement, we note that (cid:107) ξ i − ζ j (cid:107) ≤ c i + M (1 − π ij ) issatisfied if and only if there is φ ij ∈ R d such that φ ij ≥ ξ i − ζ j , φ ij ≥ ζ j − ξ i and e (cid:62) φ ij ≤ c i + M (1 − π ij ) . Likewise, (cid:107) ξ i − ζ j (cid:107) ∞ ≤ c i + M (1 − π ij ) holds if and only if there is φ ij ∈ R with φ ij e ≥ ξ i − ζ j , φ ij e ≥ ζ j − ξ i and φ ij ≤ c i + M (1 − π ij ) . As for the second statement, we note that (cid:107) ξ i − ζ j (cid:107) lp ≤ c i + M (1 − π ij ) issatisfied if and only if there is φ ij ∈ R such that φ ij ≥ (cid:107) ξ i − ζ j (cid:107) p and φ lij ≤ c i + M (1 − π ij ) . For rational l, p ≥
1, both inequalities can be expressed through finitely manysecond-order cone constraints (Alizadeh and Goldfarb 2003, Section 2.3). (cid:117)(cid:116)
Color quantization aims to reduce the color palette of a digital image without com-promising its visual appearance. In the standard RGB24 model colors are encodedby vectors of the form ( r, g, b ) ∈ { , , . . . , } . This means that the RGB24model can represent a vast number of 16 , ,
216 distinct colors. Consequently,color quantization serves primarily as a lossy image compression method.In the following we interpret the color quantization problem as a discrete sce-nario reduction problem using the type-1 Wasserstein distance induced by the1-norm on R . Thus, we can solve color quantization problems via Dupaˇcov´a’sgreedy heuristic, the local search algorithm or the exact MILP reformulation (21).In our experiment we aim to compress all 24 pictures from the Kodak LosslessTrue Color Image Suite ( http://r0k.us/graphics/kodak/ ) to m = 2 , . . . , col-ors. As the MILP reformulation scales poorly with n , we first reduce each image to n (cid:46) ,
024 colors using the Linux command “convert -colors”, which is distributedthrough ImageMagick ( ). We henceforth refer to theresulting 1 , DPCV ), and we initialize the local search algorithm ei-ther with the color palette obtained from Dupaˇcov´a’s algorithm (
LOC-1 ) or na¨ıvelywith the m most frequent colors of the original image ( LOC-2 ). The MILP (21) issolved with GUROBI 7.0.1 (
MILP ). All algorithms are implemented in C++, andall experiments are executed on a 3.40GHz i7 CPU machine with 16GB RAM. Wereport the average and the worst-case runtimes in Table 1. Note that
DPCV , LOC-1 and
LOC-2 all terminate in less than 14 seconds across all instances, while
MILP requires substantially more time (the maximum runtime was set to ten hours).Moreover, warmstarting the local search algorithm with the color palette obtainedfrom
DPCV can significantly reduce the runtimes.
DPCV LOC-1 LOC-2 MILP
Average (secs) 2.16 2.60 3.59 1 , , Table 1: Runtimes of different methods for discrete scenario reduction. cenario Reduction Revisited: Fundamental Limits and Guarantees 27(a) Original (b)
MS Paint (c)
DPCV (d)
LOC-1 (e)
LOC-2 (f)
MILP
Fig. 3: Outputs of different color quantization algorithms for image “kodim15.png”.As an example, Figure 3 shows the image “kodim15.png” as well as the resultsof different color quantization algorithms for m = 16. While the outputs of LOC-1 , LOC-2 and
MILP are almost indistinguishable, the output of
DPCV has ostensibledeficiencies ( e.g. , it misrepresents the yellow color around the subject’s eye). Forcomparison, we also show the output of the color quantization routine in MicrosoftPaint (
MS Paint ). Figure 4 visualizes the optimality gaps of
DPCV , LOC-1 and
LOC-2 relative to
MILP ( i.e. , their respective approximation ratio − MILP in terms of outputquality but at significantly reduced runtimes. Moreover, the local search algorithm
LOC-1 warmstarted with the color palette obtained from
DPCV is guaranteed tooutperform
DPCV in terms of optimality gaps.
Acknowledgements
This research was funded by the SNSF grant BSCGI0 157733and the EPSRC grants EP/M028240/1 and EP/M027856/1.
References
Alizadeh F, Goldfarb D (2003) Second-order cone programming. Mathematical Programming95(1):3–51Aloise D, Deshpande A, Hansen P, Popat P (2009) NP-hardness of Euclidean sum-of-squaresclustering. Machine Learning 75(2):245–248Arya V, Garg N, Khandekar R, Meyerson A, Munagala K, Pandit V (2004) Local searchheuristics for k -median and facility location problems. SIAM Journal on Computing33(3):544–562Charikar M, Li S (2012) A dependent LP-rounding approach for the k -median problem. In:Proceedings of the 39th International Colloquium Conference on Automata, Languages,and Programming, pp 194–2058 Napat Rujeerapaiboon, Kilian Schindler, Daniel Kuhn, Wolfram WiesemannDasgupta S (2008) CSE 291: Topics in unsupervised learning. URL http://cseweb.ucsd.edu/~dasgupta/291-unsup/ Dupaˇcov´a J (1990) Stability and sensitivity-analysis for stochastic programming. Annals ofOperations Research 27(1):115–142Dupaˇcov´a J, Gr¨owe-Kuska N, R¨omisch W (2003) Scenario reduction in stochastic program-ming: an approach using probability metrics. Mathematical Programming 95(3):493–511Gao R, Kleywegt A (2016) Distributionally robust stochastic optimization with Wassersteindistance. arXiv arXiv
Kanungo T, Mount D, Netanyahu N, Piatko C, Silverman R, Wu A (2004) A local searchapproximation algorithm for k -means clustering. Computational Geometry 28(2):89–112Kariv O, Hakimi SL (1979) An algorithmic approach to network location problems. ii: The p -medians. SIAM Journal on Applied Mathematics 37(3):539–560Lloyd S (1982) Least squares quantization in PCM. IEEE Transactions on Information Theory28(2):129–137Mahajan M, Nimbhorkar P, Varadarajan K (2009) The planar k -means problem is NP-hard.In: Proceedings of the 3rd International Workshop on Algorithms and Computation, pp274–285Mohajerin Esfahani P, Kuhn D (2015) Data-driven distributionally robust optimization usingthe Wasserstein metric: Performance guarantees and tractable reformulations. arXiv Appendix: Auxiliary Results
The proofs of Theorem 2 relies on the following two lemmas.
Lemma 1
The semidefinite program (6) admits an optimal solution ( τ, S ) with S = α I + β (cid:62) for some α, β ∈ R .Proof Let ( τ, S (cid:63) ) be any optimal solution to (6), which exists because (6) has acontinuous objective function and a compact feasible set, and denote by S theset of all permutations of I . For any σ ∈ S , the permuted solution ( τ, S σ ), with s σij = s (cid:63)σ ( i ) σ ( j ) is also optimal in (6). Note first that ( τ, S σ ) is feasible in (6) because τ ≤ (cid:88) j ∈ J | I j | (cid:88) i ∈ I j (cid:32) | I j | s σii − | I j | (cid:88) k ∈ I j s σik + (cid:88) k ∈ I j s σkk + (cid:88) k,k (cid:48) ∈ I j k (cid:54) = k (cid:48) s σkk (cid:48) (cid:33) ⇐⇒ τ ≤ (cid:88) j ∈ J | I σj | (cid:88) i ∈ I σj (cid:32) | I σj | s (cid:63)ii − | I σj | (cid:88) k ∈ I j s (cid:63)ik + (cid:88) k ∈ I j s (cid:63)kk + (cid:88) k,k (cid:48) ∈ I j k (cid:54) = k (cid:48) s (cid:63)kk (cid:48) (cid:33) , where the index sets I σj = { σ ( i ) : i ∈ I j } for j ∈ J form an m -set partitionfrom within P ( I, m ), and because S σ (cid:23) and s σii = s (cid:63)σ ( i ) σ ( i ) ≤ i ∈ I byconstruction. Moreover, it is clear that ( τ, S σ ) and ( τ, S (cid:63) ) share the same objectivevalue in (6). Thus, ( τ, S σ ) is optimal in (6) for every σ ∈ S .The convexity of problem (6) implies that ( τ, S ) with S = n ! (cid:80) σ ∈ S S σ is alsooptimal in (6). The claim follows by noting that S is invariant under permutationsof the coordinates and thus representable as α I + β (cid:62) for some α, β ∈ R . (cid:117)(cid:116) Lemma 2
For α, β ∈ R the eigenvalues of S = α I + β (cid:62) ∈ S n are given by α + nβ (with multiplicity 1) and α (with multiplicity n − ).Proof Note that S is a circulant matrix, meaning that each of its rows coincideswith the preceding row rotated by one element to the right. Thus, the eigenvaluesof S are given by α + β (1 + ρ j + . . . ρ n − j ), j = 0 , . . . , n −
1, where ρ j = e πij/n and i denotes the imaginary unit; see e.g. Gray (2006). For j = 0 we then obtain theeigenvalue α + nβ , and for j = 1 , . . . , n − n − α because (cid:80) n − k =0 e πijk/n = (1 − e πij ) / (1 − e πij/n ) = 0. (cid:117)(cid:116) log (m) gap ( % ) log (m)
0 1.22.33.54.6 gap ( % ) log (m) gap ( % ) log (m)
0 4.7 9.3 14 18.6 gap ( % ) log (m)
0 2.8 5.5 8.3 11.1 gap ( % ) log (m)
0 3.1 6.2 9.4 12.5 gap ( % ) log (m)
0 3.1 6.2 9.3 12.4 gap ( % ) log (m)
0 2.14.16.28.2 gap ( % ) log (m)
0 3.2 6.5 9.7 12.9 gap ( % ) log (m)
0 2.44.97.39.7 gap ( % ) log (m)
0 2.55 7.49.9 gap ( % ) log (m)
0 1.73.45 6.7 gap ( % ) log (m)
0 1.22.53.74.9 gap ( % ) log (m)
0 4.6 9.2 13.818.4 gap ( % ) log (m)
0 3.8 7.7 11.515.3 gap ( % ) log (m)
0 3.2 6.4 9.6 12.9 gap ( % ) log (m)
0 5.2 10.415.620.8 gap ( % ) log (m)
0 1.93.85.77.6 gap ( % ) log (m)
0 2.5 5.1 7.6 10.2 gap ( % ) log (m)
0 7.4 14.722.129.4 gap ( % ) log (m)
0 2.44.97.39.7 gap ( % ) log (m)
0 2.14.36.48.5 gap ( % ) log (m)
0 2.44.77.19.5 gap ( % ) log (m)
0 1.42.94.35.7 gap ( % ) Fig. 4: Optimality gaps of
DPCV (dashed line with box),
LOC-1 (solid line with plus)and
LOC-2 (solid line with star) relative to