[PDF] Sparse Approximate Solutions to Max-Plus Equations with Application to Multivariate Convex Regression

Abstract

In this work, we study the problem of finding approximate, with minimum support set, solutions to matrix max-plus equations, which we call sparse approximate solutions. We show how one can obtain such solutions efficiently and in polynomial time for any ℓ p approximation error. Based on these results, we propose a novel method for piecewise-linear fitting of convex multivariate functions, with optimality guarantees for the model parameters and an approximately minimum number of affine regions.

Full PDF

SSparse Approximate Solutions to Max-Plus Equations withApplication to Multivariate Convex Regression

Nikos Tsilivis , Anastasios Tsiamis , and Petros Maragos School of ECE, National Technical University of Athens, Greece ESE Department, SEAS, University of Pennsylvania, USADecember 22, 2020

Abstract

In this work, we study the problem of ﬁnding approximate, with minimum support set, solu-tions to matrix max-plus equations, which we call sparse approximate solutions. We show howone can obtain such solutions eﬃciently and in polynomial time for any (cid:96) p approximation error.Based on these results, we propose a novel method for piecewise-linear ﬁtting of convex multi-variate functions, with optimality guarantees for the model parameters and an approximatelyminimum number of aﬃne regions. In the last decades, the areas of signal and image processing had been greatly beneﬁted from theadvancement of the theory of sparse representations [1]. Given a few linear measurements of an objectof interest, sparse approximation theory provides eﬃcient tools and algorithms for the acquisition ofthe sparsest (most zero elements) solution of the corresponding underdetermined linear system [1,2]. Based on the sparsity assumption of the initial signal, this allows perfect reconstruction fromlittle data. Ideas stemming from this area had also given birth to compressed sensing techniques [3,4] that allow accurate reconstructions from limited random projections of the initial signal, withwide-ranging applications in photography, magnetic resonance imaging and others.Yet, there is a variety of problems in areas such as scheduling and synchronization [5, 6], mor-phological image and signal analysis [7, 8, 9] and optimization and optimal control [6, 10, 11] thatdo not admit linear representations. Instead, these problems share the ability to be described asa system of nonlinear equations, which involve maximum operations together with additions. Therelevant theoretical framework has initially been developed in [5, 6, 12] and the appropriate algebrafor this kind of problems is called max-plus algebra. Motivated by the sparsity in the linear setting,[13] introduced the notion of sparsity (signals with many −∞ values, i.e. the identity element ofthis algebra) in max-plus algebra. Herein, we contribute to this theory, by studying the problem ofsparse approximate solutions to matrix max-plus equations allowing the approximation error to bemeasured by any l p norm.Subsequently, we apply our theoretical framework to the fundamental problem of multivariateconvex regression, where the goal is to approximate a convex function by a piecewise-linear (PWL)one. Formulating the problem as a max-plus equation and computing a sparse solution enables us to1 a r X i v : . [ m a t h . O C ] D ec btain a PWL function with a minimum number of aﬃne regions. In general, the problem of ﬁttingPWL functions has been studied before in many areas, including convex optimization, non-linearcircuits, geometric programming, machine learning and statistics. Previous attempts on solving themultivariate version of it have focused on iterating between ﬁnding a suitable partition of the inputspace and locally ﬁtting aﬃne functions to each domain of the partition [14, 15, 16, 17]. A stablemethod is proposed in [17], where the authors propose a convex adaptive partitioning algorithm thatis a consistent estimator and requires O ( n ( n + 1) m log( m ) log(log( m ))) computing time, where n is the dimension of the input space and m the number of points sampled from the convex function.Recently, it has been proposed to identify PWL functions with max-plus polynomials and formulatethe regression problem as a max-plus equation, yielding a linear time algorithm [18].In summary, our contributions are the following: a) We pose a generalized problem of ﬁnding thesparsest approximate solution to max-plus equations under a constraint which makes the problemmore tractable, also known as the “lateness constraint”. The approximation error is in terms ofany (cid:96) p norm, for p < ∞ . This formulation is more general than [13], where only the (cid:96) norm wasconsidered. b) We prove that for any (cid:96) p , p < ∞ , norm the problem has a supermodular structure,which allows us to solve it approximately but eﬃciently via a greedy algorithm, with a derivedapproximation ratio. c) We investigate the (cid:96) ∞ case without the “lateness constraint”, reveal itshardness and propose a heuristic method for solving it. d) We apply our framework to the problemof multivariate convex regression via PWL function ﬁtting. Our method shares a common theoreticalbackground with [18], but it diﬀerentiates from it as it allows an automatic, nearly optimal, selectionof the aﬃne regions, due to the imposed sparsity of the solutions. It, also, guarantees error boundsto the approximation, while compared to partitioning and locally ﬁtting style methods [14, 15, 16,17] it has lower complexity. For max and min operations we use the well-established lattice-theoretic symbols of ∨ and ∧ , respec-tively. We use roman letters for functions, signals and their arguments and greek letters mainly foroperators. Also, boldface roman letters for vectors (lowcase) and matrices (capital). If M = [ m ij ]is a matrix, its ( i, j )-th element is also denoted as m ij or as [ M ] ij . Similarly, x = [ x i ] denotes acolumn vector, whose i -th element is denoted as [ x ] i or simply x i . Max-plus arithmetic consists of the idempotent semiring ( R max , max , +), where R max = R ∪{−∞} isequipped with the standard maximum and sum operations, respectively. Max-plus algebra consists ofvector operations that extend max-plus arithmetic to R n max . They include the pointwise operationsof partial ordering x ≤ y and pointwise supremum x ∨ y = [ x i ∨ y i ], together with a class of vectortransformations deﬁned below. Max-plus algebra is isomorphic to the tropical algebra , namely themin-plus semiring ( R min , min , +), R min = R ∪ {∞} when extended to R n min in a similar fashion.Vector transformations on R n max (resp. R n min ) that distribute over max-plus (resp. min-plus) vectorsuperpositions can be represented as a max-plus (cid:1) (resp. min-plus (cid:1) (cid:48) ) product of a matrix A ∈ R m × n max ( R m × n min ) with an input vector x ∈ R n max ( R n min ):[ A (cid:1) x ] i (cid:44) n (cid:95) k =1 a ik + x k , [ A (cid:1) (cid:48) x ] i (cid:44) n (cid:94) k =1 a ik + x k (1)2ore details about general algebraic structures that obey those arithmetics can be found in [19]. Inthe case of a max-plus matrix equation A (cid:1) x = b , there is a solution if and only if the vector ˆx = ( − A ) (cid:124) (cid:1) (cid:48) b (2)satisﬁes it [5, 12, 19]. We call this vector the principal solution of the equation. It also satisﬁes theinequality A (cid:1) ˆ x ≤ b . Lastly, a vector x ∈ R n max is called sparse if it contains many −∞ elementsand we deﬁne its support set , supp( x ), to be the set of positions where vector x has ﬁnite values,that is supp( x ) = { i | x i (cid:54) = −∞} . Let U be a universe of elements. A set function f : 2 U → R is called submodular [20, 21] if ∀ A ⊆ B ⊆ U, k / ∈ B holds: f ( A ∪ { k } ) − f ( A ) ≥ f ( B ∪ { k } ) − f ( B ) . (3)A set function f is called supermodular if − f is submodular. Submodular functions occur as modelsof many real world evaluations in a number of ﬁelds and allow many hard combinatorial problems tobe solved fast and with strong approximation guarantees [22, 23]. It has been suggested that theirimportance in discrete optimization is similar to convex functions’ in continuous optimization [21].The following deﬁnition captures the idea of how far a given function is from being submodularand generalizes the notion of submodularity. Deﬁnition 1. [24] Let U be a set and f : 2 U → R + be an increasing, non-negative, function. Thesubmodularity ratio of f is γ U,k ( f ) (cid:44) min L ⊆ U,S : | S |≤ k,S ∩ L = ∅ (cid:80) x ∈ S f ( L ∪ { x } ) − f ( L ) f ( L ∪ S ) − f ( L ) (4) Proposition 1. [24] An increasing function f : 2 U → R is submodular if and only if γ U,k ( f ) ≥ , ∀ U, k . In [24], the authors used the submodularity ratio to analyze the properties of greedy algorithmsin maximization problems subject to cardinality constraints and in minimum submodular coverproblems, when the functions are only approximately submodular ( γ ∈ (0 , γ , thus allowing guaranteesfor a wider variety of objective functions. We consider the problem of ﬁnding the sparsest approximate solution to the max-plus matrix equa-tion A (cid:1) x = b , A ∈ R m × n , b ∈ R m . Such a solution should i) have minimum support set supp( x ),and ii) have small enough approximation error (cid:107) b − A (cid:1) x (cid:107) pp , for some (cid:96) p , p < ∞ , norm. For thisreason, given a prescribed constant (cid:15) , we formulate the following optimization problem:arg min x ∈ R n max | supp( x ) | s.t. (cid:107) b − A (cid:1) x (cid:107) pp ≤ (cid:15), p < ∞ , A (cid:1) x ≤ b . (5)3ote that we add an additional constraint A (cid:1) x ≤ b , also known as the “lateness” constraint. Thisconstraint makes problem (5) more tractable; it enables the reformulation of problem (5) as a setoptimization problem in (13). In many applications this constraint is desirable–see [13]. However,in other situations, it might lead to less sparse solutions or higher residual error. A possible way toovercome this constraint is explored in section 3.1.Even with the additional lateness constraint, problem (5) is very hard to solve. For example,when (cid:15) = 0, solving (5) is an N P -hard problem [13]. Thus, we do not expect to ﬁnd an eﬃcientalgorithm which solves (5) exactly. Instead, we will prove next there is a polynomial time algorithmwhich ﬁnds an approximate solution, by leveraging its supermodular properties. First, let us showthat the above problem can be formed as a discrete optimization problem over a set. We followa similar procedure to [13], where the case p = 1 was examined. For the rest of this section, let J = { , . . . , n } . Lemma 1. (Projection on the support set, (cid:96) p case) Let T ⊆ J , X T = { x ∈ R nmax : supp ( x ) = T, A (cid:1) x ≤ b } . (6) and x | T be deﬁned as ˆ x inside T and −∞ otherwise, where ˆ x is the principal solution deﬁned in (2).Then, it holds: • x | T ∈ X T . • (cid:107) b − A (cid:1) x | T (cid:107) pp ≤ (cid:107) b − A (cid:1) x (cid:107) pp ∀ x ∈ X T .Proof. • It suﬃces to show that A (cid:1) x | T ≤ b . For j ∈ T it is [ x | T ] j = ˆ x j and for j ∈ J \ T, [ x | T ] j = −∞ ≤ ˆ x j . Thus, x | T ≤ ˆx ⇐⇒ A (cid:1) x | T ≤ A (cid:1) ˆx = ⇒ A (cid:1) x | T ≤ b . (7)Hence, x | T ∈ X T . • Let x ∈ X T , then A (cid:1) x ≤ b ⇐⇒ x ≤ ˆx , which implies (since both x , x | T have −∞ valuesoutside of T ): x ≤ x | T ⇐⇒ b − A (cid:1) x | T ≤ b − A (cid:1) x . (8)Hence: (cid:107) b − A (cid:1) x | T (cid:107) pp = (cid:88) j ∈ T ( b − A (cid:1) x | T ) pj ≤ (cid:88) j ∈ T ( b − A (cid:1) x ) pj = (cid:107) b − A (cid:1) x (cid:107) pp . (9)The previous lemma informs us that we can ﬁx the ﬁnite values of a solution of Problem (5) tobe equal to those of the principal solution ˆ x . Indeed, Proposition 2.

Let x OPT be an optimal solution of (5), then we can construct a new one withvalues inside the support set equal to those of the principal solution ˆ x .Proof. Deﬁne z = (cid:40) ˆ x j , j ∈ supp( x OPT ) −∞ , otherwise , (10)then supp( x OPT ) = supp( z ) and, from Lemma 1, (cid:107) b − A (cid:1) z (cid:107) pp ≤ (cid:107) b − A (cid:1) x OPT (cid:107) pp and A (cid:1) z ≤ b .Thus, z is also an optimal solution of (5). 4herefore, the only variable that matters in Problem (5) is the support set. To further clarifythis, let us proceed with the following deﬁnitions: Deﬁnition 2.

Let T ⊆ J be a candidate support and let A j denote the j -th column of A . The error vector e : 2 J → R m is deﬁned as: e ( T ) = (cid:40) b − (cid:87) j ∈ T ( A j + ˆ x j ) , T (cid:54) = ∅ (cid:87) j ∈ J e ( { j } ) , T = ∅ . (11)Observe that for any T , it holds (cid:87) j ∈ T ( A j + ˆ x j ) ≤ (cid:87) j ∈ J ( A j + ˆ x j ) ≤ b , which means that the abovevector e ( T ) = ( e ( T ) , e ( T ) , . . . , e m ( T )) (cid:124) is always non-negative. We also deﬁne the correspondingerror function E p : 2 J → R as: E p ( T ) = (cid:107) e ( T ) (cid:107) pp = m (cid:88) i =1 ( e i ( T )) p . (12)Problem (5) can now be written as: arg min T ⊆ J | T | s.t. E p ( T ) ≤ (cid:15) (13)The main results of this section are based on the following properties of E p . Theorem 1.

Error function E p is decreasing and supermodular.Proof. Regarding the monotonicity, let ∅ (cid:54) = C ⊆ B ⊂ J , then (cid:95) j ∈ C ( A j + ˆ x j ) ≤ (cid:95) j ∈ B ( A j + ˆ x j ) ⇐⇒ e ( B ) ≤ e ( C ) , thus raising the, non-negative, components of the two vectors to the p -th power and adding theinequalities together yields E p ( B ) ≤ E p ( C ). The case for C = ∅ easily follows from the deﬁnition of e . We employ deﬁnition (1) to help us prove the supermodularity of the function. Let S, L ⊆ U ⊆ J ,with | S | ≤ K, S ∩ L = ∅ and deﬁne f ( U ) = − E p ( U ), ∀ U . Then: γ U,K ( f ) = min L,S (cid:80) s k ∈ S f ( L ∪ { s k } ) − f ( L ) f ( L ∪ S ) − f ( L ) == min L,S (cid:80) s k ∈ S {− (cid:80) mi =1 [ b i − (cid:87) j ∈ L ∪{ s k } ( A ij + ˆ x j )] p + (cid:80) mi =1 [ b i − (cid:87) j ∈ L ( A ij + ˆ x j )] p }− (cid:80) mi =1 [ b i − (cid:87) j ∈ L ∪ S ( A ij + ˆ x j )] p + (cid:80) mi =1 [ b i − (cid:87) j ∈ L ( A ij + ˆ x j )] p == min L,S (cid:80) s k ∈ S (cid:80) mi =1 − [ b i − (cid:87) j ∈ L ∪{ s k } ( A ij + ˆ x j )] p + [ b i − (cid:87) j ∈ L ( A ij + ˆ x j )] p (cid:80) mi =1 − [ b i − (cid:87) j ∈ L ∪ S ( A ij + ˆ x j )] p + [ b i − (cid:87) j ∈ L ( A ij + ˆ x j )] p } . Let now I be the set: I =  i ∈ { , , . . . , m } | (cid:95) j ∈ L ∪ S ( A ij + ˆ x j ) = (cid:95) j ∈ L ( A ij + ˆ x j )  (14)5nd for each s k ∈ S , we deﬁne two sets of indices: I ( s k ) =  i ∈ { , , . . . , m } | (cid:95) j ∈ L ∪{ s k } ( A ij + ˆ x j ) = (cid:95) j ∈ L ∪ S ( A ij + ˆ x j ) > (cid:95) j ∈ L ( A ij + ˆ x j )  (15)and: I ( s k ) =  i ∈ { , , . . . , m } | (cid:95) j ∈ L ∪ S ( A ij + ˆ x j ) > (cid:95) j ∈ L ∪{ s k } ( A ij + ˆ x j ) > (cid:95) j ∈ L ( A ij + ˆ x j )  . (16)Then, if Σ = (cid:88) s k ∈ S (cid:88) i ∈ I ,I ( s k ) − [ b i − (cid:95) j ∈ L ∪{ s k } ( A ij + ˆ x j )] p + [ b i − (cid:95) j ∈ L ( A ij + ˆ x j )] p , (17)the ratio becomes: γ U,K ( f ) = min L,S Σ + (cid:80) s k ∈ S (cid:80) i ∈ I ( s k ) − [ b i − (cid:87) j ∈ L ∪{ s k } ( A ij + ˆ x j )] p + [ b i − (cid:87) j ∈ L ( A ij + ˆ x j )] p Σ ≥ , ∀ U, K. meaning that f is submodular or, equivalently, E p = − f is supermodular. Algorithm 1:

Approximate solution of problem (5)

Input: A , b Compute ˆx = ( − A ) (cid:124) (cid:1) (cid:48) bif E p ( J ) > (cid:15) thenreturn InfeasibleSet T = ∅ , k = 0 while E p ( T k ) > (cid:15) do j = arg min s ∈ J \ T k E p ( T k ∪ { s } ) T k +1 = T k ∪ { j } k = k + 1 end x j = ˆ x j , j ∈ T k and x j = −∞ , otherwise return x , T k Setting ˜ E p ( T ) = max( E p ( T ) , (cid:15) ) and leveraging the previous theorem, we are able to formulateproblem (13), and thus the initial one (5), as a cardinality minimization problem subject to asupermodular equality constraint [25], which allows us to approximately solve it by the greedyAlgorithm 1. The calculation of the principal solution requires O ( nm ) time and the greedy selectionof the support set of the solution costs O ( n ) time. We call the solutions of problem (5) SparseGreatest Lower Estimates (SGLE) of b . Regarding the approximation ratio between the optimalsolution and the output of Algorithm 1, the following proposition holds. The new, truncated, error function remains supermodular; see [22]. roposition 3. Let x be the output of Algorithm 1 after k > iterations of the inner while loopand T k the respective support set. Then, if T ∗ is the support set of the optimal solution of (5), thefollowing inequality holds: | T k || T ∗ | ≤ (cid:18) m ∆ p − (cid:15)E p ( T k − ) − (cid:15) (cid:19) , (18) where ∆ = (cid:87) i,j ( b i − A ij − ˆ x j ) .Proof. From [25], the following bound holds for the cardinality minimization problem subject to asupermodular and decreasing constraint, deﬁned as function f : 2 J → R , by the greedy algorithm: | T k || T ∗ | ≤ (cid:18) f ( ∅ ) − f ( J ) f ( T k − ) − f ( J ) (cid:19) (19)For our problem, it is f = ˜ E p . Observe now that, since k >

0, ˜ E p ( ∅ ) = E p ( ∅ ) ≤ m ∆ p , 0 ≤ ˜ E p ( J ) = (cid:15) and ˜ E p ( T k − ) > (cid:15) . Therefore, the result follows.The ratio warn us to expect less optimal and, thus, less sparse vectors when increasing the norm p that we use to measure the approximation. It also hints towards an inapproximability result when p → ∞ , which is formalised next. (cid:96) ∞ errors Although in some settings the A (cid:1) x ≤ b constraint is needed [13], in other cases it could disqualifypotentially sparsest vectors from consideration. Omitting the constraint, on the other hand, makesit unclear how to search for minimum error solutions for any (cid:96) p ( p < ∞ ) norm. For instance, it hasrecently been reported that it is N P -hard to determine if a given point is a local minimum for the (cid:96) norm [26]. For that reason, we shift our attention to the case of p = ∞ . It is well known [5, 12]that problem min x ∈ R n max (cid:107) b − A (cid:1) x (cid:107) ∞ has a closed form solution; it can be calculated in O ( nm )time by adding to the principal solution element-wise the half of its (cid:96) ∞ error. Note that this newvector does not necessarily satisfy A (cid:1) x ≤ b , so it shows a way to overcome the aforementionedlimitation.First, let us demonstrate that problem (5), when considering the (cid:96) ∞ norm, becomes harderthan before and non-approximable by the greedy Algorithm 1. Hence, consider now the followingoptimization problem: arg min x ∈ R n max | supp( x ) | s.t. (cid:107) b − A (cid:1) x (cid:107) ∞ ≤ (cid:15). (20)Thanks to a similar construction as in the previous section, this problem can be recast as a set-searchproblem. Lemma 2. (Projection on the support set, (cid:96) ∞ case) Let T ⊆ J , x | T deﬁned as ˆ x inside T and −∞ otherwise and x ∗ = x | T + (cid:107) b − A (cid:1) x | T (cid:107) ∞ . Then ∀ z ∈ R n max with supp ( z ) = T , it holds: (cid:107) b − A (cid:1) z (cid:107) ∞ ≥ (cid:107) b − A (cid:1) x ∗ (cid:107) ∞ = (cid:107) b − A (cid:1) x | T (cid:107) ∞ . (21) Proof. (Sketch) By ﬁxing the support set of the considered vectors equal to T , equivalently we omitthe columns and indices of A and x , respectively, that do not belong in T (since they will not be7onsidered at the evaluation of the maximum). By doing so, we get a new equation with same vector b and restricted A , x . The vector x ∗ that minimizes the (cid:96) ∞ error of this equation is obtained fromits principal solution plus the half of its (cid:96) ∞ error. But now observe that the new principal solutionshares the same values with the original principal solution (follows from Lemma 1) inside T , whichis exactly vector x | T . Extending x ∗ back to R n max yields the result.So, a similar result to Proposition 2 holds. Proposition 4.

Let x OPT be an optimal solution of (20), then we can construct a new one withvalues inside the support set equal to those of the principal solution ˆ x plus the half of its (cid:96) ∞ error. By deﬁning E ∞ ( T ) = (cid:107) b − A (cid:1) x | T (cid:107) ∞ , (20) becomes:arg min T ⊆ J | T | s.t. E ∞ ( T ) ≤ (cid:15) (22)Unfortunately this problem does not admit an approximate solution by the greedy Algorithm 1 (tobe precise, the modiﬁed version of Algorithm 1 when E p becomes E ∞ ), as its error function, althoughdecreasing, is not supermodular. The following example also reveals that the submodularity ratio(4) of E ∞ is 0. Therefore, it is not even approximately supermodular and a solution by Algorithm1 can be arbitrarily bad [24]. Example 1.

Let A =   , b =   , then principal solution ˆ x is:ˆ x =  − − − − −  (cid:1) (cid:48)   =  − −  . We calculate now the error function on diﬀerent sets: • When T = { } , then ˆ x | { } = (cid:0) −∞ , −∞ , (cid:1) (cid:124) and E ∞ ( { } ) = (cid:107) b − (cid:87) j ∈{ } ( A j + ˆ x | { } ,j ) (cid:107) ∞ = (cid:107)   −   (cid:107) ∞ = . • Likewise, when T = { , } , E ∞ ( { , } ) = (cid:107)   −  − −  (cid:87)   (cid:107) ∞ = . • T = { , } , E ∞ ( { , } ) = (cid:107)   −  − −  (cid:87)   (cid:107) ∞ = . • T = { , , } , E ∞ ( { , , } ) = (cid:107)   −  − −  (cid:87)  − −  (cid:87)   (cid:107) ∞ = 0.8et now f = − E ∞ , L = { } , S = { , } , then, by (4), we have: f ( { } ∪ { } − f ( { } ) + f ( { } ∪ { } ) − f ( { } ) f ( { } ∪ { , } ) − f ( { } ) = − / / − / /

20 + 1 / , (23)meaning that f has submodularity ratio 0 or E ∞ is not even approximately supermodular.Although the previous discussion denies from problem (20) a greedy solution with any guarantees,we propose next a practical alternative to get a sparse enough vector. We ﬁrst obtain a sparsevector x p,(cid:15) by solving problem (5). Then, we add to the vector element-wise half of its (cid:96) ∞ error (cid:107) b − A (cid:1) x p,(cid:15) (cid:107) ∞ /

2. Interestingly, this new solution minimizes the (cid:96) ∞ error among all vectors withthe same support, as formalized in the following result. Proposition 5.

Let x SMMAE ∈ R n max be deﬁned as: x SMMAE = x p,(cid:15) + (cid:107) b − A (cid:1) x p,(cid:15) (cid:107) ∞ , (24) where x p,(cid:15) is a solution of problem (5) with ﬁxed ( p, (cid:15) ) . Then ∀ z ∈ R n max with supp ( z ) = supp ( x p,(cid:15) ) ,it holds (cid:107) b − A (cid:1) z (cid:107) ∞ ≥ (cid:107) b − A (cid:1) x SMMAE (cid:107) ∞ = (cid:107) b − A (cid:1) x p,(cid:15) (cid:107) ∞ and, also, (cid:107) b − A (cid:1) x SMMAE (cid:107) ∞ ≤ p √ (cid:15) . (26) Proof.

Observe that x p,(cid:15) is equal to the principal solution ˆ x inside supp( x p,(cid:15) ). So the ﬁrst inequalityholds from Lemma 2. Regarding the second one, we have: (cid:107) b − A (cid:1) x SMMAE (cid:107) ∞ = (cid:107) b − A (cid:1) x p,(cid:15) (cid:107) ∞ (cid:87) i ( b i − [ A (cid:1) x p,(cid:15) ] i )2 . (27)But, notice that:( (cid:95) i b i − [ A (cid:1) x p,(cid:15) ] i ) p = (cid:95) i ( b i − [ A (cid:1) x p,(cid:15) ] i ) p ≤ (cid:88) i ( b i − [ A (cid:1) x p,(cid:15) ] i ) p ≤ (cid:15), (28)so (cid:95) i ( b i − [ A (cid:1) x p,(cid:15) ] i ) ≤ p √ (cid:15) (29)and the result follows from (27). Note that the bound tightens, as p increases.The above method provides sparse vectors that are approximate solutions of the equation withrespect to the (cid:96) ∞ norm without the need of the lateness constraint. After computing x p,(cid:15) , x SMMAE requires O ( m | supp( x p,(cid:15) ) | + | supp( x p,(cid:15) ) | ) = O (( m + 1) | supp( x p,(cid:15) ) | ) time. We call x SMMAE

SparseMinimum Max Absolute Error (SMMAE) estimate of b .Next, we provide an experiment on randomly generated data to assess the eﬀectiveness of theproposed method in solving the (cid:96) ∞ problem (20). Example 2.

Input data consist of 100 random pair of matrices A ∈ R × , b ∈ R × whereeach value of A is sampled from a normal distribution N (0 , ) and each value of b from a standardone N (0 , igure 1: The cardinality of the support set obtained from a greedy solution of (20) and theheuristic approach proposed in Proposition 5. Shown for 100 diﬀerent pairs of input data A , b . Bestviewed in color.We organise the experiment as follows: for each pair of A , b , we solve, ﬁrst, problem (5) with p = 150 and (cid:15) = (2 · . to acquire a sparse vector that is an approximate solution with respectto the (cid:96) norm and then add to it half of its (cid:96) ∞ error, obtaining this way a sparse vector that has (cid:96) ∞ error less than 2 . (cid:96) ∞ is close to itstheoretical bound 2 .

5. Afterwards, we solve directly the (cid:96) ∞ problem (20) with greedy Algorithm 1(with a change of error functions i.e. E p becomes E ∞ ) for (cid:15) = 2 . In this section, we are interested in approximating a convex function by a piecewise-linear one.We call this the

Tropical Regression problem . It is well known that any convex function can beexpressed as the pointwise supremum of a, potentially inﬁnite, family of aﬃne hyperplanes, usingthe Legendre-Fenchel conjugate (a.k.a. slope transform) [27, 28, 29]. Our goal is to approximatethe convex function with as few hyperplanes as possible. We show next how the sparse frameworkwe introduced addresses this problem.Let ( x i , f i ) ∈ R n +1 , i = 1 , . . . , m, be a set of (possibly noisy) data sampled from a convex function f and { a k } Kk =1 be a set of slope vectors; for example, this could be some integer multiples of a slopestep inside a ﬁxed n -dimensional interval or the numerical gradients of the data. Given the data10nd the slopes, our goal is to compute a PWL (piecewise-linear) function p : p ( x ) = K (cid:95) k =1 a (cid:124) k x + b k , (30)that satisﬁes f i = p ( x i )+error , ∀ i . Ideally, this regression problem can be formulated as the followingmax-plus matrix equation:  a (cid:124) x a (cid:124) x . . . a (cid:124) K x . . . .. . . . a (cid:124) x m a (cid:124) x m . . . a (cid:124) K x m (cid:124) (cid:123)(cid:122) (cid:125) A (cid:1)  b b ..b K (cid:124) (cid:123)(cid:122) (cid:125) x =  f ..f m (cid:124) (cid:123)(cid:122) (cid:125) b (31)Observe that by taking b k = −∞ , the hyperplane a (cid:124) k x + b k is neglected in the maximum. Hence,sparsity leads to using less aﬃne regions. We can formulate problem (5) for the above matricesfor any desired ( (cid:15), p ). If a solution exists, then it produces intercepts b k that ensure that the (cid:96) p approximation error is less than (cid:15) and, at the same time, the resulting tropical polynomialcontains the approximately minimum number of aﬃne regions needed to approximate f . Exceptfor the previous SGLEs, we are also able to get the SMMAE estimates of f by adding to the resulthalf of its (cid:96) ∞ error, as explained in section 3.1. Coming with (cid:96) ∞ guarantees, those estimates areuseful especially when the approximation is being used as a surrogate of the original function in anoptimization problem, as the diﬀerence between the 2 minima can be bounded.First, we calculate matrix A in O ( Knm ). Solving, now, problem (5) for equation (31) requiresthe computation of its principal solution in O ( Km ) time and then employing the greedy algorithmto ﬁnd the intercepts b k with complexity O ( K ), meaning a total complexity of O ( K + K ( n + 1) m ).Computing the SMMAE estimate, as well, requires an extra O ( Km ). Next, we demonstrate theeﬀectiveness of our method via numerical examples. Example 3.

Consider 100 pairs of noiseless data ( x i , y i ), where x i are evenly spaced numberssampled from the interval [ − ,

2] and y i = f ( x i ), where f is the convex function: f ( x ) = max( − x − , x , x x . (32)We wish to ﬁt the following max-plus tropical polynomial curve, where we ﬁx the candidate slopesto be the set of all k ∈ [ − , . p ( x ) = (cid:95) k = − kx + b k , (33)so the corresponding equations become:  − x − . x . x . . . x . . . . .. . . . . − x − . x − . x . . . x  (cid:1)  b − b − . b − . ..b  =  f ..f  (34)11 = 1 p = 2 θ erro r RMS error ∞ | supp | erro r RMS error ∞ | supp | .

15 0 . . . . .

25 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Table 1: (cid:96) and (cid:96) SGLEs obtained from solving problem (5) for equation (34), with p = 1 , (cid:15) p . We report the R oot M ean S quared and Maximum Absoluteerrors, along with the cardinality of the support set of the solution (the number of aﬃne regions ofthe resulting tropical polynomial). p = 5 p = 150 θ erro r RMS error ∞ | supp | erro r RMS error ∞ | supp | .

15 0 . . . . .

25 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Table 2: (cid:96) and (cid:96) SGLEs for a number of diﬀerent error thresholds. Same metrics reported, asin Table 1.We solve problem (5) for the above matrices and for a variety of diﬀerent pairs of error threshold andnorm order to obtain sparse greatest lower estimates (SGLE) and then add to these solutions the halfof their (cid:96) ∞ error in order to get the corresponding sparse minimum max absolute error (SMMAE)estimates. In order to provide a clarifying comparison between solutions obtained with diﬀerent p norms, for each experiment we set the error threshold (cid:15) to be θ p , where θ is varied. We presentthe resulting SGLEs and SMMAEs in Tables 1, 2 and 3, 4, respectively. Notice that the SMMAEestimates have exactly half the (cid:96) ∞ error of the respective SGLEs, as expected by Proposition 5.Also, observe in Tables 2 and 4 the eﬀect of increasing the norm order p to the resulting support set(it is increased as suggested by Proposition 3). See Figure 2 for the best PWL approximations of f . Example 4.

Let us now focus on the 2-dimensional case, meaning we obtain data from a convexsurface. For this example, we sample values from the noisy paraboloid surface: z = x + y + N (0 , . ) , (35)12 = 1 p = 2 θ erro r RMS error ∞ | supp | erro r RMS error ∞ | supp | .

15 0 . . . . .

25 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Table 3: (cid:96) and (cid:96) SMMAE estimates for a number of diﬀerent error thresholds. Same metricsreported, as in Table 1. p = 5 p = 150 θ erro r RMS error ∞ | supp | erro r RMS error ∞ | supp | .

15 0 . . . .

657 180 .

25 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Table 4: l and l SMMAE estimates for a number of diﬀerent error thresholds. Same metricsreported, as in Table 1. 13 a) K = 11 , (cid:15) = 0 . , p = 1 (b) K = 6 , (cid:15) = 0 . , p = 2 (c) K = 5 , (cid:15) = 1 , p = 2 (d) K = 3 , (cid:15) = 1024 , p = 5 Figure 2:

Piecewise linear approximations of f ( x ) = max( − x − , x , x + x ) with K regions,resulting from the sparse tropical regression method with varied error threshold (cid:15) and norm order p . Best viewed in color. 14 a) K = 16 , (cid:15) = 10 , p = 150 (b) K = 5 , (cid:15) = 220 , p = 2 Figure 3:

The sparse greatest lower and minimum max absolute error estimates of surface z = x + y + N (0 , . ) for 2 diﬀerent runs of the ﬁtting algorithm. Best viewed in color.15 igure 4: RMS error of SMMAE estimators vs number of aﬃne regions K . Comparison betweenour method and the tropical regression method (MMAE) reported in [18].where x i , y i are drawn as i.i.d. random variables from the Unif[ − ,

1] distribution. We obtain 500observations from the surface.Let A = {− . , − . , − . , . . . , . , . , } be the set of the partial derivatives of theaﬃne regions that are to be considered, then our tropical model for this example is p ( x, y ) = (cid:95) ( k,l ) ∈ A × A b kl + kx + ly (36)We obtain SGLEs and SMMAE estimators for the above model, with diﬀerent runs of our algorithm,as in Example 3. We present the results in Table 5, compared to those obtained from the tropicalregression method of [18], in which the number of aﬃne regions is a pre-deﬁned constant. Fig. 4shows the RMS error of the SMMAE estimators as a function of the number K of aﬃne regions andcompares it with the MMAE estimators reported in [18].We verify that, in the presence of noise, the SMMAE estimators perform better than the SGLEs,as the latter must approximate the data from below (See Fig. 3) and, therefore, underestimate noise-corrupted low values. Both the estimators are able to ﬁnd good approximations with a relativelylow number of aﬃne regions and the results are superior to those reported in [18] (in terms of errorand number of aﬃne regions). Notice that the SMMAE estimates have exactly half the (cid:96) ∞ error oftheir SGLEs counterparts, as expected by Proposition 5. Moreover, observe that when p = 150, theSMMAE estimate has (cid:96) ∞ error equal to 0 . / = 0 . p and set (cid:15) = (2 δ ) p , where δ is the accepted (cid:96) ∞ error threshold). Example 5.

Consider the case where dimension is n = 3 and we have m = 11 = 1331 pointscollected from the set V × V × V , where V = {− , − , . . . , , } . The convex function to approximate16GLE SMMAE( (cid:15), p ) error RMS error ∞ error RMS error ∞ | supp | (210 ,

1) 0 . . . . ,

1) 0 . . . . , . . . . ,

2) 0 . . . . ,

2) 0 . . . . , .

3) 0 . . . . ,

5) 0 . . . . ,

7) 0 . . . . , . . . . RMS error ∞ error RMS error ∞

10 0 . . . . . . . . . . . . . . . . Table 5:

PWL approximations and their errors of surface (35). K is the number of aﬃne regionsin the resulting tropical polynomial.is: g ( x ) = log(exp( x ) + exp( x ) + exp( x )) . (37)The above synthetic dataset was used before in the PWL ﬁtting literature in [15]. The authorspropose an iterated method, which alternates between partitioning the data into aﬃne regions andcarrying out least squares ﬁts to update the local coeﬃcients. As the resulting approximationdepends on the initial partition, the authors propose running multiple instances of their algorithmto obtain a good PWL ﬁt to g .Note that when the dimension of the problem grows more than n = 3 or n = 4, it becomesinfeasible to divide large n -dimensional intervals, [ − l, l ] n , with a ﬂoat step size, as K becomes equalto ( l +1 step ) n . Similar to [18], we propose instead ﬁnding the numerical gradients of the data, settingthem as the candidate slopes a k and then applying our tropical sparse method, to select some ofthe regions and determine their constant terms. By changing that, the method becomes tractableand grows as O ( m ). For this example, we ﬁx p = 2 and to obtain the ﬁrst approximation, weset (cid:15) = 1331, so that the RMS error is less than 1. The resulting tropical polynomial has K = 4aﬃne regions. From then on, we gradually lower (cid:15) , so that we get approximations with varied K ,until K reaches 21. Fig. 5 shows the RMS errors versus the number of aﬃne regions. The resultsare competitive to those reported in [15], while our method produces approximations with a singlerun, as opposed to [15] which relies on 10 or 100 diﬀerent trials, with complexity for each one of O (( n + 1) mi ), i being the number of iterations until convergence.17 igure 5: RMS error vs number of aﬃne regions of PWL approximation of g ( x ) = log(exp( x ) +exp( x ) + exp( x )). Max-plus and tropical algebra serve as a framework for various ﬁelds, with emerging applications inoptimization and machine learning. In this work, we demonstrated how to obtain sparse approximatesolutions to max-plus equations and based on that, introduced a novel method for multivariateconvex regression by PWL functions (i.e tropical regression) with a nearly optimal number of aﬃneregions. The proposed method comes with error bounds for the resulting approximation and has anedge over previously reported tropical regression methods, in terms of robustness. In future work,we wish to further study the statistical properties of the tropical estimators, when dealing withnoisy data. Lastly, an extension of the sparsity results in nonlinear vector spaces, called CompleteWeighted Lattices [19], would allow one to solve more general problems of regression, using the toolsintroduced in this work.

Acknowledgements

The authors wish to thank Manos Theodosis for helpful comments on this work.

References [1] M. Elad,

Sparse and Redundant Representations: From Theory to Applications in Signal andImage Processing , 1st. Springer, 2010.[2] B. K. Natarajan, “Sparse approximate solutions to linear systems,”

SIAM J. Comput. , vol. 24,no. 2, pp. 227–234, 1995. 183] D. Donoho, “Compressed sensing,”

IEEE Transactions on Information Theory , vol. 52, pp. 1289–1306, 2006.[4] E. Cand`es, J. Romberg, and T. Tao, “Stable signal recovery from incomplete and inaccuratemeasurements,”

Communications on Pure and Applied Mathematics , vol. 59, no. 8, pp. 1207–1223, 2006.[5] R. Cuninghame-Green,

Minimax Algebra . Springer-Verlag, 1979.[6] F. Baccelli, G. Cohen, G. J. Olsder, and J.-P. Quadrat,

Synchronization and Linearity: AnAlgebra for Discrete Event Systems . J. Wiley & Sons, 1992.[7] J. Serra,

Image Analysis and Mathematical Morphology . Acad. Press, 1982.[8] H. Heijmans,

Morphological Image Operators . Boston: Acad. Press, 1994.[9] P. Maragos, “Morphological ﬁltering for image enhancement and feature detection,”

The Imageand Video Processing Handbook, Second Edition , A. C. Bovik, Ed., pp. 135–156, 2005, ElsevierAcad. Press.[10] M. Akian, S. Gaubert, and A. Guterman, “Tropical Polyhedra Are Equivalent To Mean PayoﬀGames,”

Int’l J. Algebra and Computation , vol. 22, no. 1, 2012.[11] S. Gaubert, W. McEneaney, and Z. Qu, “Curse of dimensionality reduction in max-plus basedapproximation methods: Theoretical estimates and improved pruning algorithms,” in

Proc.IEEE Conf. on Decision and Control and Eur. Control Conf. , 2011.[12] P. Butkoviˇc,

Max-linear Systems: Theory and Algorithms . Springer, 2010.[13] A. Tsiamis and P. Maragos, “Sparsity in Max-plus Algebra,”

Discrete Events Dynamic Sys-tems , vol. 29, pp. 163–189, 2019.[14] W. Hoburg, P. Kirschen, and P. Abbeel, “Data ﬁtting with geometric-programming-compatiblesoftmax functions,”

Optim. Eng. , vol. 17, pp. 897–918, 2016.[15] A. Magnani and S. P. Boyd, “Convex piecewise-linear ﬁtting,”

Optim. Eng. , vol. 10, pp. 1–17,2009.[16] J. Kim, L. Vandenberghe, and C. Yang, “Convex Piecewise-Linear Modeling Method for CircuitOptimization via Geometric Programming,”

IEEE Trans. Computer-Aided Design of Integr.Circuits Syst. , vol. 29, no. 11, pp. 1823–1827, Nov. 2010.[17] L. A. Hannah and D. B. Dunson, “Multivariate convex regression with adaptive partitioning,”2011. arXiv: .[18] P. Maragos and E. Theodosis, “Multivariate tropical regression and piecewise-linear surfaceﬁtting,” in

Proc. IEEE Int’l Conf. on Acoustics, Speech and Signal Processing (ICASSP) , 2020,pp. 3822–3826.[19] P. Maragos, “Dynamical systems on weighted lattices: General theory,”

Math. Control SignalsSyst. , vol. 29, no. 21, 2017.[20] J. Edmonds, “Submodular functions, matroids, and certain polyhedra,”

Combinatorial Struc-tures and Applications , pp. 69–87, 1970.[21] L. Lov´asz, “Submodular functions and convexity,”

Mathematical Programming The State ofthe Art. Springer, Berlin, Heidelberg , 1983.[22] A. Krause and D. Golovin, “Submodular function maximization,” in

Tractability , 2014.[23] F. Bach,

Learning with submodular functions: A convex optimization perspective , 2013. arXiv: . 1924] A. Das and D. Kempe, “Approximate submodularity and its applications: Subset selection,sparse approximation and dictionary selection,”

Journal of Machine Learning Research , vol. 19,no. 1, pp. 74–107, Jan. 2018, issn : 1532-4435.[25] L. Wolsey, “An analysis of the greedy algorithm for the submodular set covering problem,”

Combinatorica , vol. 2, pp. 385–393, 1982.[26] J. Hook,

Max-plus linear inverse problems: 2-norm regression and system identiﬁcation ofmax-plus linear dynamical systems with gaussian noise , 2019. arXiv: .[27] R. T. Rockafellar,

Convex Analysis . Princeton Univ. Press, 1970.[28] S. Boyd and L. Vandenberghe,

Convex Optimization . Cambridge Univ. Press, 2004.[29] H. Heijmans and P. Maragos, “Lattice calculus of the morphological slope transform,”