Sparsity in Max-Plus Algebra and Systems
aa r X i v : . [ m a t h . O C ] J un Sparsity in Max-Plus Algebra and Systems ∗ Anastasios Tsiamis † and Petros Maragos ‡ June 5, 2019
Abstract
We study sparsity in the max-plus algebraic setting. We seek both exact and approximatesolutions of the max-plus linear equation with minimum cardinality of support. In the formercase, the sparsest solution problem is shown to be equivalent to the minimum set cover problemand, thus, NP-complete. In the latter one, the approximation is quantified by the ℓ residualerror norm, which is shown to have supermodular properties under some convex constraints,called lateness constraints. Thus, greedy approximation algorithms of polynomial complexitycan be employed for both problems with guaranteed bounds of approximation. We also studythe sparse recovery problem and present conditions, under which, the sparsest exact solutionsolves it. Through multi-machine interactive processes, we describe how the present frameworkcould be applied to two practical discrete event systems problems: resource optimization andstructure-seeking system identification. We also show how sparsity is related to the pruningproblem. Finally, we present a numerical example of the structure-seeking system identificationproblem and we study the performance of the greedy algorithm via simulations. Max-plus algebra has been used to model a subclass of nonlinear phenomena with some linear-likestructure. It is obtained from the linear algebra if we replace addition with maximum and multipli-cation with addition (Butkoviˇc, 2010). The development of this algebraic theory was motivated byproblems arising in scheduling theory, graph theory and operations research (Cuninghame-Green,1979). Later on, max-plus algebra was also employed in discrete event systems to deal mainly withsynchronization problems (Cohen et al, 1985; Baccelli et al, 1992; Cohen et al, 1999; De Schutter and van den Boom,2008). Other applications include the max-algebraic approach to optimal control (Litvinov et al,2001; McEneaney, 2006), general max-plus dynamical systems and control (Adzkiya et al, 2015;Hardouin et al, 2011) and generalized HMMs for audiovisual event detection (Maragos and Koutras,2015). An extensive survey about the applications of the max-plus algebra can be found in Gaubert(2009). Generalizations of max-plus algebra using other idempotent semirings are described in Gondran and Minoux ∗ The paper was published in the Discrete Event Dynamic Systems journal;doi: https://doi.org/10.1007/s10626-019-00281-1 . † Department of Electrical and Systems Engineering, University of Pennsylvania, 200 South 33rd Street, Philadel-phia, PA 19104, United States. (email: [email protected]) ‡ School of Electrical and Computer Engineering, National Technical University of Athens, Zografou Campus,15773 Athens, Greece. (email: [email protected]) −∞ elements.ii) We define the problem of finding the sparsest exact solution to the max-plus equation (seeproblem (6)). Then, in Section 4, we show that this problem is equivalent to the minimumset-cover one and, thus, NP-complete (Theorem 1).iii) We define the problem of finding the sparsest approximate solution (see problem (7)). Here,we are searching for the sparsest solution that satisfies the following constraints: i) its ℓ approximation error is bounded and ii) it satisfies some additional convex constraints, calledlateness constraints. In Section 5, we show that the ℓ -error of approximation has supermodularproperties (Theorem 3). Thus, a suboptimal greedy approximate algorithm of polynomialcomplexity can be employed with guaranteed bounds on the suboptimality ratio (Theorem 4).Our analysis is extended to the case when the components of the matrices are allowed to take −∞ values (Theorem 5).iv) We study the sparse recovery problem (see Section 6, Theorem 6). In particular, it is exploredwhether we can recover a vector, for which we do not know the sparsity pattern, from its imageunder a max-plus linear transformation. We derive sufficient conditions, under which, we canuse the sparsity framework to recover that vector and its sparsity pattern.The paper is organized as follows. In Section 2, we revisit the max-plus equation and its prop-erties. Section 3 formulates the problems of finding the exact and approximate sparsest solutionsto the max-plus equation. Then, in Sections 4, 5, we present possible solutions to the the former2nd the latter problem respectively. For completeness, in Section 5, we also include a brief intro-duction to the supermodularity literature. In Section 6, we study the sparse recovery problem.In Section 7, we study two applications of the sparsity framework to multi-machine interactiveproduction processes (Butkoviˇc, 2010): i) application to resource optimization and ii) applicationto structure-seeking system identification. There, we also show how our sparsity framework is re-lated to the pruning problem (McEneaney, 2009; Gaubert et al, 2011). In Section 8, we presenta numerical example of the system identification problem and we study the performance of thegreedy algorithm via simulations. Finally, in Section 9, we conclude the paper and discuss possibleextensions of the present work. All proofs which do not appear in the main text are included inthe Appendix. The relation between set covers and solutions to the max-plus equation has been known be-fore ((Vorobyev, 1967; Zimmermann, 1976; Butkoviˇc, 2003; Akian et al, 2005)). We use thoseprevious results to prove the equivalence between the sparsest exact solution problem and the min-imum set cover problem in Theorem 1. Still, our paper is the first to explicitly define and studythe problem of finding the sparsest exact solution.The most related problem to sparsity is the pruning one (McEneaney, 2009; Gaubert et al,2011). It arises in optimal control problems, where we try to approximate value functions as thesupremum of certain basis functions. The goal there is to replace the supremum over many basisfunctions with the supremum over a smaller subset of basis functions, which has fixed-cardinality;this subset is selected via minimizing an ℓ approximation error cost. The problem of finding thesparsest approximate solution (problem (7)) defined in this paper is a “dual” version of the pruningone–see also Section 7. The minimization is over the cardinality of the subset such that the ℓ -error remains bounded. Another difference is that the pruning problem deals with basis functionsdefined on infinite spaces; the sparsity problem deals with basis vectors instead of basis functions.In Gaubert et al (2011), equation (34), the pruning problem is reduced to a k − median problem,which can be shown to have supermodular properties (Nemhauser et al, 1978). This argumentcould lead to an alternative proof of Theorem 3. Finally, our sparsity framework also applies whenthe basis vectors are allowed to have −∞ (null) components (Theorem 5).The recovery problem, without any sparsity considerations, is related to the uniqueness of themax-plus equation (see Chapter 15 of Cuninghame-Green (1979) or Butkoviˇc (2010) or Corollary 4.8in Akian et al (2005)). However, the sparse recovery problem is quite different; we might be able tosolve the sparse recovery problem even if we have have infinite solutions to the max-plus equation–see Section 6 for more details. The results of Schullerus et al (2006) are related to the resultof Theorem 6. However, in Schullerus et al (2006) there are no sparsity considerations, e.g. thesparsity pattern of the involved matrices is considered known. To the best of our knowledge, ourpaper is the first to define and address the sparse recovery problem. Throughout this paper, matrices and vectors will be denoted by bold characters. If A is a m × n matrix then its columns ( m × A j , j = 1 , . . . , n . Its components aredenoted by A ij or [ A ] ij , for i = 1 , . . . , m , j = 1 , . . . , n . The transpose matrix is denoted by A ⊺ . If3 is a n × x j or [ x ] j , for j = 1 , . . . , n . Finally, for simplicitywe denote the row and column index sets by I = { , , . . . , m } and J = { , , . . . , n } , respectively. The max-plus algebra (or the ( ∨ , +) semiring) is the set R max = R ∪ {−∞} equipped with the maximum operator ∨ as “addition” and + as “multiplication” (Gaubert and Plus, 1997). If x, y ∈ R max , then x ∨ y , max { x, y } . The zero element for the maximum operator ∨ is −∞ . Theoperator + is defined in the usual way with 0 as the identity element and −∞ as the null element.Similarly, we can define the min-plus algebra on R min = R ∪ { + ∞} , equipped with the minimumoperator ∧ and addition +.If x , y ∈ R n max are vectors, we overload ∨ with componentwise maximum[ x ∨ y ] i = x i ∨ y i , i = 1 , · · · , n. Operators <, ≤ are interpreted with the vector partial order, induced by componentwise compari-son. We also define the addition x + a of a scalar a ∈ R max to a vector x ∈ R n max componentwiseas follows: [ x + a ] i = x i + a, i = 1 , · · · , n, This can be interpreted as the scalar “multiplication” counterpart of linear algebra.If A ∈ R m × n max , B ∈ R m × n max are matrices, then we define their componentwise maximum [ A ∨ B ] ij = A ij ∨ B ij , i = 1 , · · · , m, j = 1 , · · · , n . If A ∈ R m × n max and B ∈ R n × p max , then their max-plus matrix“multiplication” A ⊞ B ∈ R m × p max is defined as:[ A ⊞ B ] ij = n _ k =1 ( A ik + B kj ) , i = 1 , · · · , m, j = 1 , · · · , p. If A ∈ R m × n min , B ∈ R n × p min , their min-plus matrix multiplication A ⊞ ′ B ∈ R m × p min is defined similarly: (cid:2) A ⊞ ′ B (cid:3) ij = n ^ k =1 ( A ik + B kj ) , i = 1 , · · · , m, j = 1 , · · · , p. The max-plus equation has a form similar to the linear equation Ax = b , though we replace additionwith maximum and multiplication with addition. In particular, given A ∈ R m × n max , x ∈ R n max , b ∈ R m , it is given by the following formula : n _ j =1 ( A ij + x j ) = b i , i = 1 , · · · , m An alternative notation that has been used in the literature is ⊕ for maximum (max-plus “addition”) and ⊗ for addition (max-plus “multiplication”)–see Cuninghame-Green (1979) or Baccelli et al (1992). Here, we follow thenotation of lattice theory–see Birkhoff (1967), Maragos (2013), Maragos (2017), where the symbol ∨ / ∧ is used formax/min operations. We also use the classic symbol ”+” for real addition, without obscuring the addition with theless intuitive symbol ⊗ . Further, we avoid the symbol ⊕ because it is used in signal and image processing to denotemax-plus convolution and in set theory to denote Minkowski set addition. In the general case, b ∈ R m max (Butkoviˇc, 2010). However, in this paper we will only consider finite b ∈ R m . Seealso Assumption 1 in Section 3.
4r in compact form A ⊞ x = b . (1)Next, we define the set of all solutions of (1) S ( A , b ): S ( A , b ) = { x ∈ R n max : A ⊞ x = b } (2)We can also write (1) as A ⊞ x = n W j =1 ( A j + x j ) = b . Hence, in this form, A ⊞ x can be interpretedas a “max-plus linear combination” of the columns of A with weights x j .To analyze the max-plus equation, we need the definition of the principal solution ¯ x ∈ R n min (Cuninghame-Green, 1979) : ¯ x = ( − A ) ⊺ ⊞ ′ b , whose components can be expressed as:¯ x j = m ^ i =1 ( b i − A ij ) , ∀ j ∈ J. (3)Although the principal solution belongs to R n min , in this paper we will only deal with cases where ¯ x ∈ R n . When the max-plus equation (1) admits a solution, it turns out that the principal solution¯ x is also an actual solution (see Theorem 7 in the Appendix). In other words, the set S ( A , b ) isnon-empty if and only if ¯ x is a solution to equation (1) (Cuninghame-Green, 1979). Although the principal solution ¯ x is always defined, it may not be a solution of (1). In this case,system (1) cannot be solved. However, we may find an approximate solution, by minimizing the ℓ norm of the residual error b − A ⊞ x . Still, without any additional constraint this problem is hardto solve. For this reason, the convex constraint A ⊞ x ≤ b , (4)also called the lateness constraint (Cuninghame-Green, 1979), is added to the minimization prob-lem. This relaxation, adopted in Cuninghame-Green (1979), is also motivated by time constraintsin operations research (see also Section 7.1.1). The approximate solution problem can be describedwith the following optimization problem:minimize x ∈ R n max k b − A ⊞ x k subject to A ⊞ x ≤ b , (5)which can be recast as a linear program. It turns out that the principal solution ¯ x is the largestpossible element that satisfies the constraint A ⊞ x ≤ b . Therefore, it is also an optimal solutionto problem (5) (see Theorem 8 in the Appendix). The principal solution can also be expressed in terms of residuation theory–see, for example, Baccelli et al (1992).The map Π( x ) = A ⊞ x is residuated, with Π ♯ ( b ) = ( − A ⊺ ) ⊞ ′ b being the residual map, where ⊞ ′ denotes the min-plusmatrix product. Both maps are increasing and they satisfy the property (cid:0) Π ◦ Π ♯ (cid:1) ( b ) ≤ b , (cid:0) Π ♯ ◦ Π (cid:1) ( x ) ≥ x . Then,the principal solution ¯ x can be written as ¯ x = Π ♯ ( b ). The notion of residuated and residual maps is also related tothe notion of adjunctions in lattice theory, e.g. see Maragos (2013), Maragos (2017), as well as the notion of GaloisConnections, e.g see Akian et al (2005). Problem Statement
In this section, we define the problem of finding the sparsest exact and approximate solutions tothe max-plus equation A ⊞ x = b . In linear algebra, the sparsity pattern of a vector or a matrixis determined by the set of its nonzero components. In a similar fashion, in max-plus algebra, thesparsity pattern of any matrix or vector is determined by the set of its finite elements, since thezero element is −∞ . We define the support of an element x ∈ R n max assupp ( x ) = { j ∈ J : x j = −∞} , i.e. the set of the indices of its finite components.The first problem studied in this paper is finding the sparsest solution to equation (1). Formally,given the matrices A ∈ R m × n max , b ∈ R m , we want to determine the optimal (possibly non-unique)solution x ∗ ∈ R n max to the following optimization problem: x ∗ = arg min x ∈ R n max | supp ( x ) | subject to A ⊞ x = b (6)where | T | denotes the cardinality of a set T .However, a solution to equation A ⊞ x = b may not exist. Meanwhile, solving problem (5) mightnot work either, since it does not guarantee a sparse approximate solution. Instead of optimizingwith respect to the residual error, one option would be to search for sparse approximate solutionsto (1), within some allowed error. We define an ǫ -approximate solution to (1) as a vector x ∈ R n max that: i) has residual error bounded by positive constant ǫ > k b − A ⊞ x k ≤ ǫ , and ii) satisfiesthe lateness constraint A ⊞ x ≤ b .In the second problem, given a prescribed constant ǫ >
0, we seek the sparsest (possibly non-unique) ǫ -approximate solution. Equivalently, we solve the optimization problem: x ∗ = arg min x ∈ R n max | supp ( x ) | subject to k b − A ⊞ x k ≤ ǫ A ⊞ x ≤ b (7)We may recover the exact sparsest solution problem if we select ǫ = 0. Notice that we need toselect ǫ ≥ k b − A ⊞ ¯ x k in order to guarantee feasibility of problem (7) (follows from Theorem 8).To guarantee that the problem we are solving is not trivial, we make the following assumptionabout A and b , which holds throughout the paper . It has been a standard assumption in theliterature (see chapter 15 in Cuninghame-Green (1979)). Assumption 1.
All elements of b in equation (1) are finite: b ∈ R m . Every row and column ofmatrix A ∈ R m × n max in (1) has at least one finite element : i -th row: n _ k =1 A ik = −∞ , i = 1 , . . . , mj -th column: m _ l =1 A lj = −∞ , j = 1 , . . . , n Such matrices are also called doubly R -astic in Butkoviˇc (2010) and doubly G-astic in Cuninghame-Green (1979).
6f this assumption is not satisfied, it leads to trivial situations. For example, if the k -th column A k consists only of −∞ elements, then x k does not influence the solution at all, since A ik + x k = −∞ , i = 1 , . . . m for every x ∈ R n max . So, we may remove k -th column and variable x k without anyeffect. Remark 1.
The lateness constraint A ⊞ x ≤ b is desirable in many discrete-event systems appli-cations (see also Section 7.1.1), where we want some tasks to be completed at times A ⊞ x , beforesome deadlines b . In general, it makes problem (7) more tractable. However, in other situationswhere it is not needed, it might lead to less sparse solutions or higher residual error. It is a subjectof future work to explore how we could remove it in problem (7) . Remark 2.
The sparsest solution problem makes sense even if m > n and we have an overde-termined system. When system (1) is solvable, we might have infinite solutions. Among thosesolutions some might be sparse.
In the following sections we study problems (6), (7). Then, we explore the sparse recoveryproblem as well as applications.
In this section, we present our results about the solution to the first problem (6). We show that thesparsest solution problem is equivalent to a minimum set cover problem and, thus, NP-complete.Recall that J = { , . . . , n } is used for column indices, while I = { , . . . , m } is used for row indices.Although the principal solution ¯ x defined in (3) is a solution when S ( A , b ) is non-empty, it isnot sparse, as the next result shows. Lemma 1.
Under Assumption 1, the principal solution ¯ x of (1) , defined in (3) , is finite or equiv-alently ¯ x ∈ R n . ⋄ Proof.
Since b i is finite (Assumption 1), for every i ∈ I, j ∈ J , we have b i − A ij > −∞ . Thus,¯ x j > −∞ , for all j ∈ J . Moreover, from Assumption 1, for every j ∈ J , there exists at least one k ∈ I , such that A kj is finite, which implies b k − A kj is finite. Thus,¯ x j = m ^ i =1 ( b i − A ij ) ≤ b k − A kj < + ∞ , and ¯ x j is finite for all j ∈ J .The above result implies that we should find another way to compute sparse solutions. Inparticular, we can leverage results from Vorobyev (1967), Zimmermann (1976), (see Butkoviˇc (2003)for an English source or Theorem 7 in the Appendix) which show that any solution of equation (1)has to agree with the principal one at some components. To each element x ∈ S ( A , b ), we assignthe set of indices J x , which indicates the components where x agrees with ¯ x : J x = { j ∈ J : x j = ¯ x j } (8)We will call this set the agreement set of x . By Lemma 1, since ¯ x if finite, we have J x ⊆ supp ( x )for every solution x ∈ S ( A , b ). The main idea is that if x ∈ S ( A , b ) is a solution, we can construct7 new sparser solution ˆ x ∈ S ( A , b ) such that supp ( ˆ x ) = J x . Thus, solving the sparsest solutionproblem is equivalent to finding an agreement set of the smallest possible cardinality | J x | .However, we cannot have arbitrarily small agreement set J x . There are some necessary condi-tions that should be satisfied. For each j ∈ J , we define the set of row indices I j ⊆ I , where theminimum in (3) is attained: I j = ( i ∈ I : b i − A ij = m ^ k =1 ( b k − A kj ) = ¯ x j ) . (9)Those necessary conditions require the sub-collection I j , j ∈ J x to be a set cover of I (see Theorem 7in the Appendix).The next theorem proves that the solution to problem (6) can be reduced to finding the minimumset cover of I , by the subsets I j , j ∈ J ; the minimum is with respect to the number of subsetsrequired for the cover. Conversely, any minimum set cover problem can be reduced to solving aninstance of problem (6), for suitably defined matrices A , b . Thus, problem (6) is NP-complete. Theorem 1. i) The problem (6) of computing the sparsest max-plus solution is equivalent to findingthe minimum set cover of I by the subset-collection { I j : j ∈ J } defined in (9) . In particular, let ¯ x be the principal solution defined in (3) . Given a minimum set cover { I j : j ∈ K ⋆ } , K ⋆ ⊆ J , theelement ˆ x ∈ R n max defined as: ˆ x j = ¯ x j , j ∈ K ⋆ ˆ x j = −∞ , j ∈ J \ K ⋆ , (10) is an optimal solution to problem (6) .ii) Any minimum set cover problem can be reduced to solving problem (6) , for suitably definedmatrices A , b . Thus, problem (6) is NP-complete. ⋄ Remark 3 (Suboptimal solution to problem (6)) . According to Theorem 1, we can solve prob-lem (6) , by finding the minimum set cover { I j : j ∈ K ∗ } of I , and by using (10) to construct anoptimal solution x ∗ . Although the minimum set cover is an NP-complete problem, it can be ap-proximated by a greedy algorithm of polynomial complexity with approximation ratio n ) (Chvatal, 1979). Alternatively, we could solve problem (6) by solving problem (7) for ǫ = 0 , usingthe techniques of Section 5. The next example illustrates the results of this section.
Example 1.
Suppose we are given the equation − ⊞ x x x = From (3), the principal solution is ¯ x = (2 − ∧ (0 + 2) ∧ (2 − − ∧ (0 − ∧ (2 − − ∧ (0 − ∧ (2 − = − − . I j are: I = { , } , I = { } , I = { } . The minimum set cover of I = { , , } is either I ∪ I or I ∪ I . Hence, we have two possiblesparsest solutions: x ∗ = (cid:2) − −∞ (cid:3) T and x ∗ = (cid:2) −∞ − (cid:3) T . ⋄ In this section, we present the approximate solution to problem (7), which uses tools from thesupermodular optimization literature; a brief introduction to supermodularity is included in Sub-section 5.1. In Subsection 5.2, we reformulate problem (7) to a simpler one, where we only optimizeover the support of the optimal solution. Then, in Subsection 5.3, we prove that this new opti-mization problem has supermodular properties if A has only finite elements (Theorem 3). Thisallow us to approximately solve problem (7) via a greedy algorithm of polynomial complexity withguaranteed bounds of approximation (Theorem 4). In some sense, this greedy solution is similar tothe “matching pursuit” algorithm in Mallat and Zhang (1993), applied to linear systems. Finally,in Subsection 5.4, we extend the results to the case where matrix A can also have infinite elements(Theorem 5). Supermodularity (Krause and Golovin, 2012) is a property of set functions, which enables us toapproximately solve some optimization problems of combinatorial complexity. In particular, greedyalgorithms of polynomial complexity can be employed, with theoretical guarantees (bounds) regard-ing the ratio of approximation (Wolsey, 1982), (Nemhauser et al, 1978). A set function f : 2 J → R is a function that takes a subset T ⊆ J and returns a real value f ( T ). Two useful properties of setfunctions are supermodularity and monotonicity. A set function f : 2 J → R is supermodular if forevery C ⊆ B ⊆ J and k ∈ J : f ( C ∪ { k } ) − f ( C ) ≤ f ( B ∪ { k } ) − f ( B ) (11)Respectively, a set function f : 2 J → R is decreasing if for every C ⊆ B ⊆ J , f ( C ) ≥ f ( B ).Finally we present a result from Wolsey (1982) , which shows how we can approximately solvecardinality minimization problems subject to a supermodular equality constraint. Let the opti-mization problem be: minimize T ⊆ J | T | subject to f ( T ) = f ( J ) (12)where f : 2 J → R is supermodular and decreasing, while | T | denotes the cardinality of set T .Suppose we use the following greedy algorithm.The following theorem provides a bound on the approximation ratio of Algorithm 1. The result in Wolsey (1982) is for submodular and increasing functions. But f is supermodular (decreasing) ifand only if − f is submodular (increasing). Hence, the result is also valid for supermodular and decreasing functions. lgorithm 1 Greedy Approximate Solution of (12) Set T = ∅ , k = 0 while f ( T k ) = f ( J ) do k = k + 1 j = arg min s ∈ J \ T k − { f ( T k − ∪ { s } ) } T k = T k − ∪ { j } end while return T k Theorem 2 (Wolsey (1982)) . Suppose f : 2 J → R is supermodular and decreasing. Algorithm 1returns a suboptimal solution T k ⊆ J to problem (12) with | T k | = k . If T ∗ is the optimal solutionthen the following bound holds | T k || T ∗ | ≤ (cid:18) f ( ∅ ) − f ( J ) f ( T k − ) − f ( J ) (cid:19) (13) ⋄ In the next sections, we reformulate problem (7) in order to reveal its supermodular structureand leverage the results of Theorem 2. (7)
Given any feasible point x of problem (7), we can construct a new one by forcing every componentin the support to be equal to the respective component of the principal solution. In this way, wereduce problem (7) to just finding the support of x , skipping the decision over the finite values of x . Formally, suppose x ∈ R n max , with support supp ( x ) = T , satisfies the inequality A ⊞ x ≤ b .Now, define a new element z ∈ R n max with the same support as x , supp ( z ) = T . Then, replace allits finite components with the ones of the principal solution: z j = ¯ x j , j ∈ supp ( x ). In terms ofthe agreement set defined in (8), we have J z = supp ( z ) = T . The next lemma shows that the newvector z not only is feasible, but also has smaller residual error than x . Lemma 2.
Fix a subset T ⊆ J . Let X T = { x ∈ R n max : supp ( x ) = T, A ⊞ x ≤ b } be the set of elements which satisfy the lateness constraint and have support equal to T . Assumethat z ∈ R n max has support and agreement set equal to T : J z = T supp ( z ) = T. Then, z ∈ X T and k b − A ⊞ x k ≥ k b − A ⊞ z k , for all x ∈ X T . ⋄ The feasible points of an optimization problem are the elements that satisfy the constraints. x ) = T ⊆ J , we can select x j = ¯ x j , j ∈ T and x j = −∞ , j ∈ J \ T , the only decision variable that matters in problem (7) is T ⊆ J . To introduce morecompact notation, we can rewrite A ⊞ x = W j ∈ J ( A j + x j ) as a max-plus linear combination. Butif supp ( x ) = T ⊆ J , then this max-plus linear combination becomes: A ⊞ x = _ j ∈ T ( A j + x j ) , if supp ( x ) = T. Choosing x j = −∞ is equivalent to ignoring column A j in the max-plus linear combination. Thenext definition uses this notation. Definition 1.
We define the error vector e : 2 J → R m min as: e ( T ) = b − _ j ∈ T ( A j + ¯ x j ) , for T = ∅ e ( ∅ ) = _ j ∈ J e ( { j } ) . (14) The ℓ -error function E ( T ) : 2 J → R min is defined as the ℓ -norm of the error vector: E ( T ) = k e ( T ) k , (15) where k e ( T ) k = ∞ if e j ( T ) = ∞ , for some j ∈ J . ⋄ We note that for the empty set we consider the singletons’ error vectors and take the component-wise maximum in the above definition. This selection guarantees that the ℓ -error function E issupermodular and decreasing.The next corollary exploits the result of Lemma 2 and proves that we can rewrite problem (7)as: min | T | subject to E ( T ) ≤ ǫ (16) Corollary 1.
Problem (7) is equivalent to problem (16) . In particular, if ˆ T is a optimal solutionto problem (16) , then the element ˆ x ∈ R n min defined as: ˆ x j = ¯ x j , j ∈ ˆ T ˆ x j = −∞ , j ∈ J \ ˆ T , (17) is an optimal solution to problem (7) . ⋄ Now, we can show that if A has only finite elements, the ℓ error set function E ( T ), defined in(15), is supermodular. An alternative proof can be found if we follow the steps of Gaubert et al(2011), Section VI . Function E ( T ) can be expressed as the cost function of a k -median problem–see Gaubert et al (2011). Thisfunction is known to be supermodular (Nemhauser et al, 1978). heorem 3. Suppose A ∈ R m × n . The ℓ error set function E ( T ) defined in (15) , is decreasingand supermodular. ⋄ The above result along with Corollary 1 enable us to approximately solve problem (7), usingAlgorithm 2 below. First, we compute the approximate solution to problem (16) in a greedy way.Define T k ⊂ J to be the collection of k elements, selected greedily in a sequential way. Startingfrom the empty set T = ∅ , at each time k , we select the index j , which achieves the smallest ℓ -error E ( T k − ∪ { j } ). Then, we update T k = T k − ∪ { j } and this is repeated until the error E ( T k )becomes less than ǫ . After the selection of T k , we construct an approximate solution according toequation (17). The complexity of the algorithm is O ( n ), since the minimization step requires aninner loop of at most n iterations, while the outer loop requires at most n iterations. Algorithm 2
Approximate Solution of Problem (7)
Input: A , b Compute ¯ x from (3) if E ( J ) > ǫ then return Infeasible end if Initialize ˆ x j = −∞ , for all j ∈ J Set T = ∅ , k = 0 while E ( T k ) > ǫ do k = k + 1 j = arg min s ∈ J \ T k − E ( T k − ∪ { s } ) T k = T k − ∪ { j } end while Update ˆ x j = ¯ x j , j ∈ T k return ˆ x , T k Since E ( T ) is a supermodular function, it follows that ¯ E ( T ) = max ( E ( T ) , ǫ ) is also super-modular (Krause and Golovin, 2012). Thus, the constraint E ( T N ) > ǫ is equivalent to ¯ E ( T ) = ǫ .Now, by applying the results of Wolsey (1982) (Theorem 2), we can obtain an upper bound to theapproximation ratio of Algorithm 2. Theorem 4.
Assume that A ∈ R m × n has only finite elements. Suppose ǫ ≥ is such that E ( J ) ≤ ǫ and E ( ∅ ) > ǫ , where E is defined in (15) . Let k be the time Algorithm 2 terminates with ˆ x , T k therespective outputs. Then, ˆ x is a suboptimal solution to problem (7) with T k = supp ( ˆ x ) . Moreover,if T ∗ = supp ( x ∗ ) , where x ∗ is an optimal solution of problem (7) , the following inequality holds: | T k || T ∗ | ≤ (cid:18) m ∆ E ( T k − ) − ǫ (cid:19) (18) where ∆ = W i ∈ I,j ∈ J ( b i − A ij − ¯ x j ) and ¯ x j are the components of the principal solution defined in (3) . ⋄ Parameter ∆ is the largest element of the normalized matrix [ b i − A ij − ¯ x j ], i ∈ I, j ∈ J . Since A has only finite elements, ∆ is also finite. The presence of the logarithm mitigates the effect of12 large ∆ or small E ( T k − ) − ǫ differences. In general, term E ( T k − ) depends on A , b , but byallowing more memory, it can be precomputed for all possible k with complexity O ( n ) ( O ( n ) per k ). Nonetheless, there are special cases, where data independent bounds for the difference E ( T k − ) − ǫ are possible. For example, if both A and b are integer valued, which is common in timing appli-cations, then the error function is also integer valued and E ( T k − ) ≥ ⌊ ǫ + 1 ⌋ . Then, the bound ofTheorem 4 becomes | T k || T ∗ | ≤ (cid:16) m ∆ ⌊ ǫ +1 ⌋− ǫ (cid:17) and does not depend any more on the specific A , b .Quantized elements can also be dealt in a similar fashion. If A has infinite elements, then we cannot directly apply the results of Theorem 2. However,we can replace the infinite elements of the error vector e ( T ), T ⊆ J with a sufficiently largepositive constant M > M is motivated by the “big-M” method in linear optimiza-tion (Bertsimas and Tsitsiklis, 1997).It is sufficient to replace matrix A ∈ R m × n max with a new one, denoted by ˆ A ( M ) ∈ R m × n max , where:ˆ A ij ( M ) = A ij , if A ij = −∞− M + b i − ¯ x j , if A ij = −∞ ) , for all i ∈ I, j ∈ J. (19)This new matrix ˆ A ( M ) has only finite elements. Thus, we can now apply Algorithm 2 to matrices ˆ A ( M ) , b instead of A , b and leverage Theorem 4 to bound the approximation ratio. However, wefirst have to require that the optimal solution remains the same with this change. This is indeed thecase if M is large enough. In particular, if M > ǫ , it turns out that the optimal solution remainsthe same, as the following lemma shows.
Lemma 3.
Suppose
M > ǫ ≥ . Then for ˆ A ( M ) defined in (19) the following problem: min x ∈ R n max | supp ( x ) | subject to (cid:13)(cid:13)(cid:13) b − ˆ A ( M ) ⊞ x (cid:13)(cid:13)(cid:13) ≤ ǫ ˆ A ( M ) ⊞ x ≤ b (20) is equivalent to problem (7) . Now, we can just apply Theorem 4 to the finite matrices ˆ A ( M ) and b . Theorem 5.
Suppose
M > ǫ ≥ are constants such that E ( ∅ ) > ǫ and E ( J ) ≤ ǫ , where E isdefined in (15) . Let k be the time Algorithm 2 terminates under input ˆ A ( M ) , b , where ˆ A ( M ) isdefined in (19) . Let ˆ x , T k be the respective outputs. Then, ˆ x is a suboptimal solution to problem (7) with T k = supp ( ˆ x ) . Moreover, if T ∗ = supp ( x ∗ ) , where x ∗ is an optimal solution of problem (7) ,the following inequality holds: | T k || T ∗ | ≤ (cid:18) m ∆min { E ( T k − ) , M } − ǫ (cid:19) (21) where ∆ = W i ∈ I,j ∈ J (cid:16) b i − ˆ A ij ( M ) − ¯ x j (cid:17) . ⋄ roof. Let ˆ E ( T ) = b − W j ∈ T (cid:16) ˆ A ( M ) j + ¯ x j (cid:17) be the ℓ -error function for ˆ A ( M ) , b . From Theorem 4and Lemma 3, we obtain: | T k || T ∗ | ≤ m ∆ˆ E ( T k − ) − ǫ ! But either ˆ E ( T k − ) = E ( T k − ) if there is no infinite component in e ( T k − ), or ˆ E ( T k − ) ≥ M ifthere is some infinite component in e ( T k − ). Remark 4.
Consider the notation of the previous theorem. Notice that: ∆ = max { _ i ∈ I,j ∈ J,A ij = −∞ ( b i − A ij − ¯ x j ) , M } , where M is used to replace the −∞ elements in (19) . By increasing M we might make the nominatorin (21) bigger. Thus, in the sufficient condition M > ǫ it might be a good choice to select M close to ǫ . On the other hand, we should not choose M too close to ǫ , since we might make the denominatorsmall. In the case of integer valued elements, a reasonable selection could be M = ǫ + 1 , since itguarantees | T N || T ∗ | ≤ (cid:16) m ∆ ⌊ ǫ +1 ⌋− ǫ (cid:17) as in the finite element case. In the recovery problem, the goal is to reconstruct an unknown vector z ∈ R n max from the measure-ments A ⊞ z ∈ R m , by solving the equation: A ⊞ x = A ⊞ z . (22)If the equation A ⊞ x = A ⊞ z has a unique solution then the principal solution can recover z .Uniqueness holds only if the whole collection { I j : j ∈ J } is needed to cover I (see Chapter 15 ofCuninghame-Green (1979) or Butkoviˇc (2010) or Corollary 4.8 in Akian et al (2005)), where I j aredefined in (9). In other words, the principal solution will recover z only if z is dense. If the original z is sparse then, in general, the equation A ⊞ x = A ⊞ z , will not have a unique solution and theprincipal solution will misidentify the −∞ elements as finite.Here, we explore conditions under which we could estimate a sparse z by computing x ∗ , i.e.one of the sparsest solutions to problem (6). We call this the sparse recovery problem. Problem 1 (Sparse Recovery) . Consider an arbitrary vector z ∈ R n max such that the pair ( A , b ) =( A , A ⊞ z ) satisfies Assumption 1. Let x ∗ be the optimal solution of problem (6) for the pair ( A , b ) = ( A , A ⊞ z ) . We say that the Sparse Recovery Problem is solved if x ∗ recovers z or z = x ∗ . This problem is also related to the system identification problem (Schullerus et al, 2006), where,however, the sparsity patter is considered known. Notice that in general there might be multiplesparsest solutions to the max-plus equation–see Example 1. However, the sparse recovery problemabove can only be solved exactly when the sparsest solution x ∗ = z is unique . Even if x ∗ is unique, We note that uniqueness of the sparsest solution x ∗ is different than the uniqueness of the equation A ⊞ x = b .The former requires a unique minimum set-cover, while the later requires that the minimum set-cover is the wholecollection { I j : j ∈ J } .
14t will have more −∞ components than z in general. Nonetheless, under some sufficient conditions,the sparse recovery problem can be solved as the next theorem proves. Theorem 6.
Consider an element z ∈ R n max such that the pair ( A , A ⊞ z ) satisfies Assumption 1.Let x ∗ be the optimal solution of problem (6) for ( A , b ) = ( A , A ⊞ z ) . Then, x ∗ = z if thefollowing sufficient condition holds: For every finite component j ∈ supp ( z ) , there exists a rowindex i = i ( j ) ∈ I such that:a) for all other indices in the support, k ∈ supp ( z ) , k = j , we have: A ij > A ik + z k − z j b) for all indices in the complement of the support, l ∈ J \ supp ( z ) , there exists at least one rowindex s = s ( j, l ) ∈ I , such that: A sl > A il + [ A ⊞ z ] s − [ A ⊞ z ] i . Intuitively, the first part of the condition of the preceding theorem states that for any component j ∈ J with z j = −∞ , there must be at least one row index i for which A ij is large enough, in order toobserve the influence of z j in A ⊞ z . Given the previous pair ( i, j ), the second part of the conditionrequires that for every l ∈ J \ supp ( z ), there exists some row s ∈ I such that the component A sl is large enough to reveal that z l is smaller than z j ; small enough to be −∞ .Both conditions can be guaranteed if, for example, m ≥ n and A has large enough leadingdiagonal elements (or large diagonal elements up to permutations–see Section 8). In this case, if A jj , j ∈ supp ( z ), is large enough across the j -th row then part a) is satisfied with i = j . Similarlyif A ll , l ∈ J \ supp ( z ), is large enough across the l -th column, then part b) is satisfied by choosing s ( j, l ) = l for all j ∈ supp ( z ). In this section, we give several applications of the present framework. First, we provide two possibleapplications in discrete-event systems: i) resource optimization; and ii) system identification withunknown sparsity pattern. Then, we show how the pruning problem can be formulated as a sparsityproblem.
We motivate the application to discrete-event systems through multi-machine interactive productionprocesses (Butkoviˇc, 2010). Consider m different products, which are made using n machines. Amachine j ∈ J contributes to the completion of a product i ∈ I by making a partial product. Itprocesses all partial products in parallel as soon as it starts working. A system matrix G ∈ R m × n max determines how much time it takes for the partial products to be made. Each element G ij representsthe time needed for machine j to make the partial product for product i . Thus, either G ij ≥ G ij = −∞ if product i does not depend on machine j . An input u ∈ R n describes the times themachines start working; u j is the time, at which machine j starts working. If u j = −∞ , then themachine j is not used at all. The output y = G ⊞ u (23)15escribes the times the products are made; product i is completed at time y i . We will use the abovemodel to explore the following problems. Suppose that the products have delivery deadlines d ∈ R m , which should not be exceeded. Thisimplies that the outputs y should satisfy the lateness constraint y ≤ d . Meanwhile, it costs storageresources to make the products before the delivery time. Thus, we wish to restrict the earliness k d − y k . Suppose now that we have an extra constraint; we also want to minimize the numberof machines used, which consume energy resources. Recall that when u j = −∞ , then machine j isnot used. Thus, the number of used machines is equal to the cardinality of the support of vector u . This problem could be formulated as an instance of problem (7) with A = G , x = u , b = d .Sparsity here implies resource efficiency, since we use fewer machines. Notice that in this case thelateness constraint is not a relaxation but a desired property. Assume we have an unknown system matrix G ∈ R m × n max . Our goal is to recover G from a sequenceof K input-output pairs ( u l , y l ) ∈ R n max × R m , l ∈ L = { , . . . , K } . Those pairs are related via themax-plus model (23): y l = G ⊞ u l , l ∈ L (we assume the output is finite). If we stack the inputsand outputs together, we obtain a set of max-plus equations: y ⊺ ... y ⊺ K | {z } Y = u ⊺ ... u ⊺ K | {z } U ⊞ G ⊺ or Y = U ⊞ G ⊺ . (24)Notice that Y ∈ R K × m , U ∈ R K × n max . System (24) consists of m separate max-plus equations writtentogether in matrix form.In this scenario, the infinite elements of G reflect the structure of the system. As mentionedbefore, G ij = −∞ means that the product i does not depend on the machine j . Here, we areinterested in obtaining a solution that not only solves the above equation but also reveals thesystem structure. (We assume that we do not have any a priori knowledge about the structure ofsystem G ; the only information is input-output pairs.)Without any sparsity constraints, the principal solution ¯ G will have only finite elements, hidingthe actual sparsity pattern of the original system matrix G . Thus, we have to find another way toidentify the −∞ elements. One way to approach this problem would be to solve the sparse recoveryproblem instead. If the sufficient conditions of Theorem 6 are satisfied, then exact reconstructionis possible. In fact, those conditions also suggest a way to do experiment design, i.e. to design theinputs U . It is sufficient to select U with large enough leading diagonal elements such that thesparsest solution recovers G . Without knowing G , we may not be able to compute how large theleading diagonal elements should be. Nonetheless, we could overcome this problem by exploitingbounds on the finite elements of G . 16 .2 Pruning The pruning problem emerged as a curse-of-dimensionality-free method for approximating optimalcontrol value functions–see McEneaney (2009), Gaubert et al (2011) for more details and motivationbehind the method. Next, we show that a “dual” version of the pruning problem can be formulatedin terms of a sparsity problem as in (7).Suppose that φ = n W j =1 φ j , where φ j ∈ R m max are basis vectors. The goal is to find a reducedrepresentation ˜ φ = W j ∈ S ⊆ J φ j such that the approximation error (cid:13)(cid:13)(cid:13) φ − ˜ φ (cid:13)(cid:13)(cid:13) ℓ is small. Let x ∈ R n max indicate which basis columns should be selected. If x j = −∞ , then we ignore column φ j , otherwisewe select it. To solve the pruning problem, we could formulate it as an ǫ − approximate sparsityproblem: min x ∈ R n max | supp ( x ) | subject to (cid:13)(cid:13) φ − (cid:2) φ . . . φ n (cid:3) ⊞ x (cid:13)(cid:13) ≤ ǫ (cid:2) φ . . . φ n (cid:3) ⊞ x ≤ φ . The formulation here is a “dual” version of the one that appears in Gaubert et al (2011). There, theminimization is with respect to (cid:13)(cid:13) φ − (cid:2) φ . . . φ n (cid:3) ⊞ x (cid:13)(cid:13) , while the cardinality of the support | supp ( x ) | = k is kept fixed to a value k . In this subsection, we present a numerical example, where we apply our results to the systemidentification problem. We implemented the greedy Algorithm 2 in Matlab to obtain solutions toproblems (6), (7). For the small examples below, we can verify by hand that the greedy solutionwill also be optimal.Consider a multi-machine interactive production process, as defined in Section 7, with systemmatrix G ∈ R m × n max G = −∞ −∞−∞ Now consider K = 4 input instances u l , l = 1 , . . . ,
4, which are imposed to the system: U ⊺ = (cid:2) u . . . u (cid:3) = . For each input instance u l , the respective outputs y l , l = 1 , . . . , Y ⊺ = (cid:2) y . . . y (cid:3) = G ⊞ U ⊺ =
13 12 3 411 11 1 312 11 16 8 . G from the given inputs and the corresponding outputs. The principalsolution gives: ¯ G = −
71 1 −
91 2 6 , which hides the sparsity pattern of the original matrix G . Notice that some −∞ elements in G , i.e. G , correspond to negative elements in ¯ G , i.e. ¯ G = −
7. Those can be identified as −∞ , since G , can only have positive or −∞ elements. However, not all of them are negative. For instance, ¯ G = 1. Thus, this method does not guarantee that all −∞ elements are revealed.Suppose now that we compute the sparsest solution G ∗ , by solving problem (6). In this case,we obtain: G ∗ = G . This result is expected, since U is designed to have large diagonal values under the column per-mutation { , , } and satisfies the assumptions of Theorem 6. Thus, without any prior knowledge,we managed to identify for all products, which machines they depend on. If this condition isnot satisfied, i.e. if we change U from 10 to U = 1, then the sparsest solution falsely yields G ∗ = −∞ 6 = G , but it correctly identifies the remaining elements.For the next example, suppose that due to some unexpected delay the last output is y = (cid:2) . (cid:3) ⊺ . The equation Y = U ⊞ G ⊺ is no longer satisfied. In this case, we solve problem (7)and find the sparsest approximate solution ˆ G . For ǫ = 0 .
3, we have ˆ G = G and we recover G .However, if the error gets bigger, for example Y = 5, the sparsest approximate solution falselyreturns G ∗ = −∞ for ǫ = 1. The results for the sparse recovery problem, presented in Section 6,are only applicable to the exact solution case. Nonetheless, from the last numerical example, itseems that if the delay is small, they might still be valid for the approximate solution case. It issubject of future work to provide a formal analysis. In this subsection, we explore the performance of the greedy Algorithm 2 with respect to prob-lem (7). First, we construct an example where the greedy Algorithm 2 is suboptimal. Then, wecompare Algorithm 2 with the brute force one, using random matrices A , b . For the brute forcealgorithm, we solve a combinatorial problem; we search over all possible supports supp ( x ). Example 2 (Suboptimality of greedy algorithm) . Consider the matrices: A = − − − − −
10 0 , b = and let ǫ = 1. The optimal solution to problem (7) is x ∗ = (cid:2) −∞ (cid:3) ⊺ . The greedy algorithmwill initially select T = { } , since the first column of A leads to the smallest error. However, inthis example, it is sufficient and necessary for both components 2 , ǫ . Hence, the greedy algorithm will return the set T = { , , } and the suboptimal solution ˆ x = (cid:2) (cid:3) ⊺ . ⋄ m, n ) (8,16) (8,17) (9,18) (9,19) (10,20) (10,21) (11,22)suboptimality ratio 0.970 0.948 0.952 0.968 0.967 0.955 0.979time greedy (sec) 0.0012 0.0013 0.0015 0.0017 0.0019 0.0020 0.0022time brute force (sec) 0.09 1.33 2.72 5.56 11.30 22.37 46.73Table 1: Comparison between the greedy and the brute force algorithm for random matrices A , b . For every pair of ( m, n ), the average is over 40 independent samples. The greedy algorithmperforms very well on average for small ( m, n ). It has suboptimality ratio close to one and is muchfaster than the brute force algorithm.Next, we compare the greedy algorithm with the brute force one. Both were implementedin Matlab. For the comparison we compute the suboptimality ratio of the greedy algorithm aswell as the execution times. Due to the exponential complexity of the brute force algorithm, thiscomparison can only be made for small values of n , where n is the number of columns of matrix A .We generated random m × n matrices A with elements taking values in the set { , . . . , n − } and m × b with elements taking values in { , . . . , n + 5 } , for several ( m, n ) pairs–see Table 1.Because the times and the suboptimality ratios depend on the randomly sampled matrices, weaveraged them over 40 independent iterations for each ( m, n ) pair. To guarantee feasibility, inall of the cases we selected ǫ = k b − A ⊞ ¯ x k + 1, where ¯ x is the principal solution. We observethat the average suboptimality ratio of the greedy algorithm is very close to one and does notdecrease noticeably. Meanwhile, as we expected, the execution time of the greedy algorithm scalesmuch better than the brute force one. Thus, empirically the greedy algorithm performs very wellon average for small values of m, n . As we stated above, it is not easy to empirically evaluatethe performance for larger values of m, n since the brute force algorithm requires a lot of time toterminate. We studied the problem of finding the sparsest solution of the max-plus equation and proved that itis NP-complete. Then we studied the problem of finding the sparsest approximate solution subjectto a lateness constraint. The degree of approximation was measured via a ℓ norm function, whichwas proved to have supermodular properties. Thus, we developed a greedy algorithm of polynomialcomplexity, which approximates the optimal solution with guaranteed ratio of approximation. Wealso derived sufficient conditions such that the sparse recovery problem can be solved. The presentframework can be applied to discrete event systems applications such as resource optimization orsystem identification. In future work we will explore whether we can drop the lateness constraintwhen searching for the sparsest approximate solution. We will also study whether the sufficientconditions of the sparse recovery problem can be relaxed. Another direction is extending theconcepts of sparsity to max-plus dynamical systems. Finally, we would like to extend the resultsto more general idempotent semi-rings by using residuation theory.19 ppendix A: Previous Results The result below was originally proved in Vorobyev (1967) and Zimmermann (1976). A referencein English can be found in Butkoviˇc (2003).
Theorem 7 (Covering theorem) . An element x ∈ R n max is a solution to (1) or x ∈ S ( A , b ) if andonly if: a) x ≤ ¯ x b) [ j ∈ J x I j = I, where ¯ x is the principal solution defined in (3) , set J x is defined in (8) , and sets I j are defined in (9) . ⋄ Theorem 8 (Cuninghame-Green (1979)) . Let ¯ x be the principal solution defined in (3) . Thefollowing equivalence holds: A ⊞ x ≤ b ⇔ x ≤ ¯ x . (25) Moreover, ¯ x is an optimal solution to problem (5) . ⋄ Appendix B: Proofs
Proof of Theorem 1
First, we prove i). Suppose that x ∗ is an optimal solution to (6). Since it is a solution of theequation A ⊞ x = b , by Theorem 7, the subcollection { I j : j ∈ J x ∗ } , determined by the agreementset J x ∗ = n j ∈ J : x ∗ j = ¯ x j o , is a set cover of I . We will show that the size | J x ∗ | of the set cover isminimum. By optimality of x ∗ , we necessarily have x ∗ j = −∞ , for j ∈ J \ J x ∗ and the support of x ∗ is the same as the agreement set; otherwise, we could create a sparser solution by forcing theelements outside of the agreement set to be −∞ . So, | supp ( x ∗ ) | = | J x ∗ | . Now, take any set cover { I j : j ∈ K ⊆ J } of I and define element x ( K ) as: x ( K ) j = ¯ x j , j ∈ K x ( K ) j = −∞ , j ∈ J \ K (26)Notice that | supp ( x ( K ) ) | = | K | and by Theorem 7, x ( K ) is also a solution to the max-plusequation A ⊞ x = b . By optimality, x ∗ has the smallest support, or | supp ( x ∗ ) | ≤ | supp ( x ( K ) ) | .But this implies that | J x ∗ | ≤ | K | , which shows that { I j : j ∈ J x ∗ } is a minimum set cover of I .Conversely, suppose the collection { I j : j ∈ K ∗ ⊆ J } is a minimum set cover of I . Then, wecan define the solution ˆ x as in (10). We will show that ˆ x is an optimal solution to (6). Sup-pose x ∗ is one optimal solution to (6). Then, the collection { I j : j ∈ J x ∗ } is a set cover with J x ∗ = n j ∈ J : x ∗ j = ¯ x j o . Since x ∗ is the sparsest solution, we can only have | supp ( x ∗ ) | = | J x ∗ | .Meanwhile, by optimality of the set cover we have | supp ( ˆ x ) | = | K ∗ | ≤ | J x ∗ | = | supp ( x ∗ ) | . Hence, ˆ x is also an optimal solution to (6). 20econd, we prove ii). This part is adapted from Butkoviˇc (2003). Suppose we are given anarbitrary collection of nonempty subsets S j ⊆ { , . . . , m } = I, j ∈ { , . . . , n } = J, for some m, n ∈ N , such that S j ∈ J S j = I . Define A ij = ( i ∈ S j ) for all i ∈ I, j ∈ J , where isthe indicator function, and b i = 1, for all i ∈ I . By equations (3), (9), it follows that the principalsolution is ¯ x = (cid:2) . . . (cid:3) ⊺ , while the sets S j are equal to the sets I j . But following the analysisof i), finding the minimum set cover of I using S j = I j is equivalent to finding the solution toproblem (6) with the above selection of A , b . This completes the proof of part ii). Proof of Lemma 2
By construction, the agreement set and the support are equal to T or z j = ¯ x j , for j ∈ Tz j = −∞ , for j ∈ J \ T Thus, z ≤ ¯ x and by Theorem 8, also A ⊞ z ≤ b , which proves that z ∈ X T .To prove the second part, again from Theorem 8, if x ∈ X T then x j ≤ ¯ x j = z j , for j ∈ Tx j = z j = −∞ , for j ∈ J \ T As a result, x ≤ z for any x ∈ X T . Now, since A ⊞ · is increasing (Cuninghame-Green, 1979) weobtain the inequality: b − A ⊞ z ≤ b − A ⊞ x , for any x ∈ X T . Since both x , z satisfy the lateness constraint (4), we finally have k b − A ⊞ x k = ⊺ ( b − A ⊞ x ) ≥ ⊺ ( b − A ⊞ z ) = k b − A ⊞ z k for any x ∈ X T , where = (cid:2) · · · (cid:3) ⊺ . Proof of Corollary 1
Let x ∗ , ˆ T be the optimal solutions of problems (7), (16) respectively. Denote by T ⋆ = supp ( x ∗ ) thesupport of x ∗ . Then construct a new vector z ∗ such that z ∗ j = ¯ x j , j ∈ T ∗ and z ∗ j = −∞ , i ∈ J \ T ∗ .By Lemma 2, E ( T ∗ ) = k b − A ⊞ z ∗ k ≤ k b − A ⊞ x ∗ k ≤ ǫ. Thus, T ∗ = supp ( x ∗ ) is a feasible point of problem (16), implying | ˆ T | ≤ | T ∗ | .Conversely, define ˆ x as in (17). By construction and the feasibility of ˆ T and Lemma 2, we have: k b − A ⊞ ˆ x k = E ( ˆ T ) ≤ ǫ. A ⊞ ˆ x ≤ b Thus, ˆ x is a feasible point of problem (7), which implies | T ∗ | ≤ | ˆ T | . From the above inequalitieswe obtain | T ∗ | = | ˆ T | , which also proves that ˆ x is an optimal solution to problem (7).21 roof of Theorem 3 Notice that we have: _ j ∈ T ( A j + ¯ x j ) ≤ _ j ∈ J ( A j + ¯ x j ) = A ⊞ ¯ x ≤ b Thus, we get by construction that the error vector e ( T ) has only positive components, for every T ⊆ J , which implies: E ( T ) = k e ( T ) k = ⊺ e ( T ) , (27)where 1 ⊺ = (cid:2) . . . (cid:3) ⊺ . For convenience, define matrix ˆ A ∈ R m × n max as ˆ A ij = A ij + ¯ x j . Then, bythe definition (14) of error vector: _ j ∈ T ˆ A j = b − e ( T ) . First, we show that E ( T ) is decreasing. Let B , C be two nonempty subsets of J with C ⊆ B ⊂ J . Then, W j ∈ C ˆ A j ≤ W j ∈ B ˆ A j . Consequently, e ( B ) ≤ e ( C ). Now if C is empty and B isnon-empty, then by construction e ( ∅ ) ≥ W k ∈ J e ( { k } ) ≥ e ( B ) (if C , B are both empty we triviallyhave e ( C ) = e ( B )). In any case, by (27), we obtain E ( C ) ≥ E ( B ).Second, we show that E ( T ) is supermodular. Let C ⊆ B ⊆ J and k ∈ J \ B . It is sufficient toprove that: e ( C ∪ { k } ) − e ( C ) ≤ e ( B ∪ { k } ) − e ( B ) . (28)For C = ∅ define: u = _ j ∈ C ˆ A j , v = _ j ∈ C ∪{ k } ˆ A j z = _ j ∈ B ˆ A j , w = _ j ∈ B ∪{ k } ˆ A j . By this definition, v i = u i ∨ ˆ A ik , w i = z i ∨ ˆ A ik for every i ∈ I . Also, by monotonicity u ≤ z , v ≤ w .There are three possibilities:i) If u i > ˆ A ik then v i = u i . But also w i = z i , since by monotonicity z ≥ u and z i ≥ u i > ˆ A ik . Inthis case, v i − u i = w i − z i = 0.ii) If u i ≤ ˆ A ik and z i > ˆ A ik then v i − u i = ˆ A ik − u i ≥ w i − z i = 0 ≤ v i − u i .iii) If both u i ≤ ˆ A ik and z i ≤ ˆ A ik then v i − u i = ˆ A ik − u i ≥ ˆ A ik − z i = w i − z i , since by monotonicity z i ≥ u i .If C is the empty set, we define u = b − e ( ∅ ) and u , z , w are defined as before. Since by construction e ( ∅ ) ≤ e ( k ) for all k ∈ J , we also have u ≤ v and u ≤ z . Thus, either case ii) or case iii) apply.In any case, v − u ≥ w − z which is equivalent to (28). Finally, multiplying both sides of (28)from the left by ⊺ gives the desired result: E ( C ∪ { k } ) − E ( C ) ≤ E ( B ∪ { k } ) − E ( B ).22 roof of Theorem 4 Define the truncated error set function¯ E ( T ) = max ( E ( T ) , ǫ ) . By Theorem 3, the error set function E ( T ) is supermodular and decreasing. Thus, so is thetruncated error function (Krause and Golovin, 2012). This enables as to express the constraint E ( T ) ≤ ǫ as ¯ E ( T ) = ¯ E ( J ). Then, the lines 6 −
11 of Algorithm 2 are a version of Algorithm 1.Hence, Theorem 2 readily applies giving the bounds | T k || ˆ T | ≤ (cid:18) ¯ E ( ∅ ) − ¯ E ( J )¯ E ( T k − ) − ¯ E ( J ) (cid:19) , where ˆ T is the optimal solution of problem (16). From Corollary 1, we can replace ˆ T with T ∗ . Bythe assumption E ( ∅ ) > ǫ and the definition of the ℓ -error set function at ∅ :¯ E ( ∅ ) = E ( ∅ ) = X i ∈ I _ j ∈ J ( b i − A ij − ¯ x j ) ≤ m ∆ . Meanwhile, we have ¯ E ( J ) ≥ k is such that E ( T k − ) > ǫ and E ( T k ) ≤ ǫ . Such k exists since E ( J ) ≤ ǫ and in the worst case, Algorithm 2 halts at k = | J | with T k = J . Thus, we have ¯ E ( T k − ) = E ( T k − )and ¯ E ( J ) = ǫ . Proof of Lemma 3
It is sufficient to prove that the feasible regions of both problems are identical. First, we provethat: A ⊞ x ≤ b ⇔ ˆ A ( M ) ⊞ x ≤ b (29)But from Theorem 8, it is equivalent to show that ¯ x = ˆ¯ x , where ¯ x is the original principal solutiondefined in (3) and ˆ¯ x is the new principal solution with ˆ A ( M ) instead of A :ˆ¯ x j = m ^ i =1 (cid:16) b i − ˆ A ij ( M ) (cid:17) , ∀ j ∈ J. (30)By construction, A ij ≤ ˆ A ij ( M ), which by (3), (30), implies ˆ¯ x ≤ ¯ x . To show the other direction,we have ˆ¯ x j = b k − ˆ A kj ( M ) , for some k ∈ I. There are two cases:i) ˆ A kj ( M ) = A kj . Then, ˆ¯ x j = b k − A kj ≥ V i ∈ I b i − A ij = ¯ x j .ii) ˆ A kj ( M ) = b k − M − ¯ x j . Then, ˆ¯ x j = M + ¯ x j > ¯ x j , since M > ˆ¯ x ≥ ¯ x . This proves ¯ x = ˆ¯ x .Second, we prove that under the constraint A ⊞ x ≤ b (which we showed is equivalent to ˆ A ( M ) ⊞ x ≤ b ) we have: k b − A ⊞ x k ≤ ǫ ⇔ k b − ˆ A ( M ) ⊞ x k ≤ ǫ (31)“ ⇒ ” direction. Since A ij ≤ ˆ A ij ( M ), we obtain A ⊞ x ≤ ˆ A ( M ) ⊞ x . But we have A ⊞ x ≤ b , ˆ A ( M ) ⊞ x ≤ b . Thus, k b − ˆ A ( M ) ⊞ x k ≤ k b − A ⊞ x k ≤ ǫ. “ ⇐ ” direction. For every i ∈ I , there exists an index j i ∈ J such that: k b − ˆ A ( M ) ⊞ x k = m X i =1 ( b i − ˆ A ij i ( M ) − x j i )Now assume that some element ˆ A kj k ( M ) is equal to − M + b i − ¯ x j i , for some k ∈ I . Then, thisimplies ǫ ≥ k b − ˆ A ( M ) ⊞ x k = m X i =1 ( b i − ˆ A ij i ( M ) − x j i ) ≥ b k − ˆ A kj k ( M ) − ¯ x j k = M, where the second inequality follows from ˆ A ( M ) ⊞ x ≤ b and the equivalent fact x ≤ ¯ x (seeTheorem 8). But since M > ǫ , this is a contradiction and the only possible case is ˆ A ij k ( M ) = A ij i ,for all i ∈ I . Finally, ǫ ≥ k b − ˆ A ( M ) ⊞ x k = m X i =1 ( b i − A ij i − x j i ) ≥ m X i =1 ( b i − _ j ∈ J ( A ij + x j )) = k b − A ⊞ x k . This completes the proof.
Proof of Theorem 6