Finding sparse solutions of systems of polynomial equations via group-sparsity optimization
FFinding sparse solutions of systems of polynomialequations via group-sparsity optimization
F. Lauer and H. Ohlsson , LORIA, Universit´e de Lorraine, CNRS, Inria, France Dept. of Electrical Engineering and Computer Sciences, University of California, Berkeley, USA Dept. of Electrical Engineering, Link¨oping University, Sweden
November 8, 2018
Abstract
The paper deals with the problem of finding sparse solutions to systems of polynomialequations possibly perturbed by noise. In particular, we show how these solutions can berecovered from group-sparse solutions of a derived system of linear equations. Then, twoapproaches are considered to find these group-sparse solutions. The first one is based ona convex relaxation resulting in a second-order cone programming formulation which canbenefit from efficient reweighting techniques for sparsity enhancement. For this approach,sufficient conditions for the exact recovery of the sparsest solution to the polynomial systemare derived in the noiseless setting, while stable recovery results are obtained for the noisycase. Though lacking a similar analysis, the second approach provides a more computationallyefficient algorithm based on a greedy strategy adding the groups one-by-one. With respect toprevious work, the proposed methods recover the sparsest solution in a very short computingtime while remaining at least as accurate in terms of the probability of success. This probabilityis empirically analyzed to emphasize the relationship between the ability of the methods tosolve the polynomial system and the sparsity of the solution.
When faced with an underdetermined system of equations, one typically applies a regularizationstrategy in order to recover well-posedness. The choice of regularization depends on the particularapplication at hand and should be made to drive the solution towards desired properties. Inthe absence of precise goals, the most popular choice favors solutions with minimum (cid:96) -norm.However, an alternative becoming more and more popular is to search for sparse solutions (whichoften have non-minimal (cid:96) -norm). In the case of a system of linear equations, this alternativehas been investigated in numerous works, see, e.g., [Bruckstein et al., 2009] for a review, andentails many applications of great importance, particularly for signal processing where it goesunder the name of compressed sensing/sampling [Donoho, 2006, Cand`es, 2006]. Formally, findingthe sparsest solutions of linear systems can be written as the minimization, under the constraintsof the linear system, of the number of nonzero variables, which is a nonsmooth, nonconvex andNP-hard problem. Two main approaches can be distinguished to tackle such problems. The firstone, known as Basis Pursuit (BP), relies on a convex relaxation based on the minimization of an (cid:96) -norm, while the second one applies a greedy strategy to add nonzero variables one-by-one. Manyresults on the convergence of these methods to the sparsest solution are available in the literature[Bruckstein et al., 2009, Eldar and Kutyniok, 2012, Foucart and Rauhut, 2013].More recently, a few works introduced extensions of this problem to nonlinear equa-tions. In particular, the first greedy approaches appeared in [Blumensath and Davies, 2008,Blumensath, 2013, Beck and Eldar, 2013, Ehler et al., 2014], while BP algorithms were devel-oped in [Ohlsson et al., 2013b] to find sparse solutions of systems of quadratic equations and in1 a r X i v : . [ c s . I T ] J u l Ohlsson et al., 2013a] for more general nonlinear equations. Formally, these problems can be for-mulated as min x ∈ R n (cid:107) x (cid:107) (1)s.t. y i = f i ( x ) , i = 1 , . . . , N, where y i ∈ R , f i : R n → R are nonlinear functions and (cid:107) x (cid:107) = |{ j ∈ { , . . . , n } : x j (cid:54) = 0 }| denotesthe (cid:96) -pseudo-norm of x , i.e, the number of nonzero components x j .Here, we focus on the case where the f i ’s in (1) are polynomial functions of maximal degree d : y i = f i ( x ) = p di ( x ) = b i + M (cid:88) k =1 a ik x α k , i = 1 , . . . , N, (2)where { α k } Mk =1 is the set of M = (cid:80) dq =1 (cid:18) n + q − q (cid:19) multi-indexes with 1 ≤ | α k | = (cid:80) nj =1 ( α k ) j ≤ d and x α k = (cid:81) nj =1 x ( α k ) j j are the corresponding monomials. This includes the particular caseconsidered in [Ohlsson et al., 2013b] with d = 2, while in [Ohlsson et al., 2013a] this setting isused to deal with the more general case via the Taylor expansions of the nonlinear functions f i .Note that the formulation in (1)–(2) also entails cases outside of the quasi-linear setting consideredin [Ehler et al., 2014].The present paper proposes a new BP approach to solve (1)–(2), which, in comparison withthe previous works [Ohlsson et al., 2013b, Ohlsson et al., 2013a], is more simple and amounts tosolving an easier optimization problem. More precisely, the method proposed in Sect. 2.1 relies ona simple change of variable sufficient to result in an efficient algorithm implemented as a classical (cid:96) -minimization problem. However, the structure of the polynomials is discarded and the solutionmay not satisfy the original polynomial constraints. Note that this change of variable is alsofound in [Ohlsson et al., 2013a], though with constraints handled in a more complex manner, andclosely related to the lifting technique of [Ohlsson et al., 2013b] for quadratic BP and the one of[Vidal et al., 2005] for subspace clustering.The method in [Ohlsson et al., 2013b] enforces the structure of the quadratic polynomials viarank constraints, which leads to optimization problems with additional levels of relaxation and theintroduction of a parameter tuning the trade-off between the minimization of the (cid:96) -norm on theone hand and the satisfaction of the rank constraint on the other hand. In [Ohlsson et al., 2013a],the structure of higher-degree polynomials is enforced by a set of quadratic constraints on themonomials , which are then relaxed as in [Ohlsson et al., 2013b].Instead, the method proposed in Sect. 2.2 implements the structural knowledge via group-sparsity. This results in an (cid:96) /(cid:96) -minimization, i.e., a second-order cone program (SOCP) whichcan be solved more efficiently. In addition, this SOCP formulation is easily extended in Sect. 2.4to benefit from reweighting techniques for improving the sparsity of the solution.The conditions for exact recovery of the proposed BP methods are analyzed in Sect. 2.3. Inparticular, we show that though the simple (cid:96) -minimization does not include structural constraints,exact recovery occurs for sufficiently sparse cases. A similar condition is proved for the (cid:96) /(cid:96) -minimization based on group-sparsity.The greedy approach is discussed in Sect. 3, where two variants are proposed: an exact al-gorithm for solving the group-sparsity optimization problem in small-scale cases and an approx-imate one which remains efficient in higher-dimensional settings. Previous greedy approaches[Blumensath, 2013, Beck and Eldar, 2013] considered the problem in its sparsity-constrained form,where the sum of squared errors over the equations (2) is minimized subject to (cid:107) x (cid:107) ≤ s for an a priori fixed s . The iterative hard thresholding of [Blumensath, 2013] is a gradient descent algo-rithm with an additional projection onto the feasible set at each iteration via a simple thresholdingoperation. In [Beck and Eldar, 2013], this is interpreted as a fixed point iteration enforcing a nec-essary (but not sufficient) optimality condition, which requires a well-chosen step-size to converge For example, consider the monomials u = x , u = x , u = x x , u = x x , then the structure of u and u is enforced by u = u u and u = u u . But note that since these constraints are relaxed in the final formulation,the estimation can yield u (cid:54) = u u , which then recursively implies that all monomial constraints involving u aremeaningless. d >
1. The sparse-simplex algorithm of [Beck and Eldar, 2013] isa coordinate descent method which enjoys similar but less restrictive convergence properties whilebeing parameter-free. However, for polynomial equations as in (2), each iteration requires solv-ing several one-dimensional minimizations of a polynomial of degree 2 d , which becomes difficultfor d >
2. On the contrary, the proposed greedy algorithms remain simple thanks to the group-sparse and linearized formulation of the problem: each iteration requires only solving least-squaresproblems.Extensions of the methods are discussed in Sections 4 and 5. In particular, Section 4 deals withthe issue of purely nonlinear polynomials, for which the solution to (1)–(2) cannot be estimated asdirectly as for polynomials involving linear monomials, but which arise in important applicationssuch as phase retrieval [Kohler and Mandel, 1973, Gonsalves, 1976, Gerchberg and Saxton, 1972,Fienup, 1982, Marchesini, 2007, Cand`es et al., 2013b, Cand`es et al., 2013a]. Then, the case wherethe equations (2) are perturbed by noise is considered in Sect. 5. In particular, the analysisprovides stable recovery results for polynomial BP denoising in the flavor of the one obtained in[Donoho et al., 2006] for linear BP denoising.Finally, numerical experiments in Sect. 6 show the efficiency of the proposed methods andextensions. Results are in line with the ones found in classical sparse optimization with linearconstraints. In particular, all methods can recover the sparsest solution in sufficiently sparse casesand the greedy approach is the fastest while the BP methods based on convex relaxations benefitfrom a slightly higher probability of success.
This section develops two basis pursuit approaches to find sparse solutions of systems of polynomialequations via the minimization of an (cid:96) -norm for the first one and of a mixed (cid:96) /(cid:96) -norm for thesecond one. The methods are developed under the following assumption, which will be relaxed inSect. 4. Assumption 1.
For all j ∈ { , . . . , n } , ∃ i ∈ { , . . . , N } such that a ik (cid:54) = 0 for k determined suchthat ( α k ) l = 1 for l = j and ( α k ) l = 0 for l ∈ { , . . . , n } \ { j } . In particular, Assumption 1 ensures that the polynomials include a linear part, or more precisely,that for any variable x j , j = 1 , . . . , n , the monomial x j has a nonzero coefficient in at least one ofthe polynomials. (cid:96) -minimization We start by rewriting the constraints in (1)–(2) as y i = f i ( x ) = b i + M (cid:88) k =1 a ik x α k = a Ti φ ( x ) + b i , (3)where a i = [ a i , . . . , a iM ] T and φ : R n → R M is a mapping that computes all the monomialsof degree q , 1 ≤ q ≤ d , with n variables. Note that x is embedded in φ ( x ) as the componentscorresponding to the n multi-indexes α k ( j ) , j = 1 , . . . , n , such that (cid:0) α k ( j ) (cid:1) l = 1 for l = j and 0for l (cid:54) = j . Assuming that these are the first components of φ ( x ), i.e., k ( j ) = j , we also define the(linear) inverse mapping φ − : R M → R n , such that φ − ( φ ( x )) = x , as φ − ( φ ) = (cid:2) I n (cid:3) φ , where I n is the n -by- n identity matrix.The high-dimensional lifting by φ allows us to recast (1)–(2) as a standard (i.e., linear) problemin sparse optimization: min φ ∈ R M (cid:107) φ (cid:107) (4)s.t. Aφ = y − b , A = [ a , . . . , a N ] T , b = [ b , . . . , b N ] T and φ ( x ) is replaced by an unstructured vector φ . Thisconstitutes the first level of relaxation in the proposed approach, where the components of φ arenot constrained to be interdependent monomials. While this yields a rather crude approximation,it will serve as the basis for the refined approach of Sect. 2.2, where additional structure will beimposed on φ .The second level of relaxation comes from the BP approach, in which problems such as (4) aretypically solved via the convex relaxationˆ φ = arg min φ ∈ R M (cid:107) W φ (cid:107) (5)s.t. Aφ = y − b , where W = diag ( (cid:107) A (cid:107) , . . . , (cid:107) A M (cid:107) ), with A k the k th column of A , is a diagonal matrix ofprecompensating weights. Then, for polynomial BP, an estimate of the solution to (1)–(2) can beeasily computed under Assumption 1 as ˆ x = φ − ( ˆ φ ).Problem (5) can be formulated as a linear program and solved by generic interior-point algo-rithms in polynomial time, while it can also be solved by more efficient and dedicated algorithms,see, e.g., [Foucart and Rauhut, 2013, Chapter 15]. However, the complexity with respect to n remains exponential because of the dimension M which scales exponentially with n .The literature on basis pursuit and compressed sensing [Bruckstein et al., 2009,Foucart and Rauhut, 2013] tells us that the sparser the solution to (4) is, the more likelythe convex relaxation (5) is to yield its recovery. Here, by construction we know that there is atleast a very sparse vector φ ( x ) satisfying the constraints: with x the sparse solution to (1)–(2),the sparsity level (cid:107) φ ( x ) (cid:107) /M is better than (cid:107) x (cid:107) /n , as stated by Proposition 1 below. Notethat a better bound implying an increased level of sparsity depending on d for very sparse cases isderived in Appendix A. Proposition 1.
Let the mapping φ : R n → R M be defined as above. Then, the vector φ ( x ) is atleast as sparse as the vector x in the sense that the inequality (cid:107) φ ( x ) (cid:107) M ≤ (cid:107) x (cid:107) n holds for all x ∈ R n . Proposition 1 implies that if (1)–(2) has a sparse solution then so does (4). To prove Proposi-tion 1, we first need the following lemma.
Lemma 1.
For all triplet ( a, b, c ) ∈ ( N ∗ ) such that a ≥ b , the inequality a (cid:18) a + c − c (cid:19) ≥ b (cid:18) b + c − c (cid:19) holds.Proof. On the one hand, we have1 a (cid:18) a + c − c (cid:19) = ( a + c − a c !( a − a c ! c − (cid:89) i =0 ( a + i ) = 1 c ! c − (cid:89) i =1 ( a + i )and a similar expression with b instead of a . On the other hand, with a ≥ b , we have ∀ i ∈ N , a + i ≥ b + i ⇒ c ! c − (cid:89) i =1 ( a + i ) ≥ c ! c − (cid:89) i =1 ( b + i )which then yields the sought statement.We now give the proof of Proposition 1. 4 roof. By construction, the number of nonzeros in φ ( x ) is equal to the sum over q , 1 ≤ q ≤ d , ofthe number of monomials of degree q in (cid:107) x (cid:107) variables. This yields (cid:107) φ ( x ) (cid:107) M = (cid:80) dq =1 (cid:18) (cid:107) x (cid:107) + q − q (cid:19)(cid:80) dq =1 (cid:18) n + q − q (cid:19) = (cid:107) x (cid:107) n (cid:107) x (cid:107) (cid:80) dq =1 (cid:18) (cid:107) x (cid:107) + q − q (cid:19) n (cid:80) dq =1 (cid:18) n + q − q (cid:19) Since n ≥ (cid:107) x (cid:107) , Lemma 1 gives the bound1 (cid:107) x (cid:107) d (cid:88) q =1 (cid:18) (cid:107) x (cid:107) + q − q (cid:19) ≤ n d (cid:88) q =1 (cid:18) n + q − q (cid:19) from which the statement follows. (cid:96) / (cid:96) -minimization (group sparsity) The approach proposed above relies on a rather crude approximation. Thus, the polynomialequations are not guaranteed to be satisfied by the solution ˆ x , due to the factorization of thepolynomial that may not hold for ˆ φ . In other words, the linearization in (3) together with thedirect optimization of φ discard the desired structure for φ . In practice, the solution ˆ x can bechecked a posteriori with y i ? = p di (ˆ x ), i = 1 , . . . , N . But in order to increase the probability ofobtaining a satisfactory solution, i.e., one which satisfies the original polynomial constraints, wemust embed the structure of the polynomial in the problem formulation.Let us define the index sets I j = { k ∈ { , . . . , M } : ( α k ) j (cid:54) = 0 } , j = 1 , . . . , n. (6)Then, the structural information we aim at embedding is given by the following implication: ∀ j ∈ { , . . . , n } , ∀ k ∈ I j , (cid:40) x j = 0 ⇒ ( φ ( x )) k = 0 , ( φ ( x )) k (cid:54) = 0 ⇒ x j (cid:54) = 0 , (7)which formalizes the fact that whenever a variable is zero, all monomials involving this variablemust be zero. Such a structure can be favored via group-sparsity optimization, as detailed next.Let us define the set of mappings ϕ j : R n → R m , each computing the subset of components of φ involving the variable x j (see Appendix C for the precise value of m = | I j | ). Note that thesemappings are nonlinear in x but linear in φ ( x ): ϕ j ( x ) = B j φ ( x ), where B j is an m -by- M binarymatrix filled with zeros except for a 1 on each row at the k th column for all k ∈ I j . Further definethe (cid:96) -pseudo-norm of a vector-valued sequence as the number of nonzero vectors in the sequence: (cid:107){ u j } nj =1 (cid:107) = |{ j ∈ { , . . . , n } : u j (cid:54) = }| . We call φ a group-sparse vector if (cid:107){ B j φ } nj =1 (cid:107) is small. By construction and due to (7), a sparse x leads to a group-sparse φ ( x ) with (cid:107){ B j φ ( x ) } nj =1 (cid:107) = (cid:107) x (cid:107) . Therefore, in order to solve (1)–(2) we search for a group-sparse solution to the linear system Aφ = y − b .Such group-sparse solutions can be found by solvingmin φ ∈ R M (cid:107){ W j φ } nj =1 (cid:107) (8)s.t. Aφ = y − b , B j by a weight matrix W j = B j W , with W = diag ( (cid:107) A (cid:107) , . . . , (cid:107) A M (cid:107) ),without effect on the (cid:96) -pseudo-norm. The reason for introducing these precompensating weightswill become clear in the analysis of Sect. 2.3.Let us define the (cid:96) p /(cid:96) q -norm of a vector-valued sequence as (cid:107){ u j } nj =1 (cid:107) p,q = n (cid:88) j =1 (cid:107) u j (cid:107) pq p . Problem (8) can be relaxed by replacing the (cid:96) -pseudo-norm with an (cid:96) p /(cid:96) q -norm. In this paper weonly consider the case p = 1 and q = 2, i.e., (cid:107){ W j φ } nj =1 (cid:107) , = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:107) W φ (cid:107) ... (cid:107) W n φ (cid:107) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) , but other norms could be used, such as the (cid:96) /(cid:96) ∞ -norm. This yields the convex relaxation ˆ φ = arg min φ ∈ R M n (cid:88) j =1 (cid:107) W j φ (cid:107) (9)s.t. Aφ = y − b , which is easily reformulated as a Second Order Cone Program (SOCP) that can be solvedby generic software such as CVX [Grant and Boyd, 2013, Grant and Boyd, 2008] or MOSEK[Andersen and Andersen, 2000] (more efficient dedicated solvers can also be found, e.g.,[Deng et al., 2011]). However, as for the (cid:96) -minimization (5), the complexity remains exponen-tial with respect to the number n of base variables via the dimension M . Adding structure via constraints.
All components of the mapping φ ( x ) involving only evendegrees of base variables x j are nonnegative. These structural constraints can help to drive thesolution towards one that correctly factorizes and corresponds to a solution of the polynomialequations. This is obtained by solvingˆ φ = arg min φ ∈ R M n (cid:88) j =1 (cid:107) W j φ (cid:107) (10)s.t. Aφ = y − b φ k ≥ , ∀ k such that ( α k ) j is even for all j. This optimization problem has the same complexity as (9) since we simply added nonnegativityconstraints to some variables.Other forms of prior knowledge can be easily introduced in (10). For instance, if (tight) boxconstraints on the base variables are available, then lower and upper bounds on all monomials can bederived. Finally, note that these structural constraints can also be added to the (cid:96) -minimization (5). The following derives conditions of exact recovery of the sparse solution to the system of polynomialequations via various convex relaxations. These conditions are based on the mutual coherence ofthe matrix A as defined e.g. in [Donoho and Huo, 2001, Bruckstein et al., 2009]. Definition 1.
The mutual coherence of a matrix A = [ A , . . . , A M ] is µ ( A ) = max ≤ i 6n order for the mutual coherence of A to be defined, we focus on the case where the followingassumption holds. Assumption 2. All columns A k of the matrix A are nonzero, i.e., A k (cid:54) = , k = 1 , . . . , M ,or, equivalently, for all k ∈ { , . . . , M } , ∃ i ∈ { , . . . , N } such that the corresponding polynomialcoefficient a ik (cid:54) = 0 . Assumption 2 is slightly more restrictive than Assumption 1, which only constrains the first n columns of A . If Assumption 2 does not hold, the following analysis can be reproduced underAssumption 1 by considering the submatrix ˜ A ∈ R N × ˜ M containing the ˜ M nonzero columns of A and a similarly truncated mapping φ : R n → R ˜ M . Adjustments then need to be made wherenumbers of columns are used, i.e., by substituting ˜ M ≤ M and ˜ m j ≤ m for M and m . However,for the case where both Assumptions 1 and 2 do not hold, i.e., of zero columns corresponding tobase variables, A k = , k ∈ { , . . . , n } , ˆ x cannot be obtained by the inverse mapping φ − . Thisparticular case will be discussed in Sect. 4.Note that whenever a condition requires the mutual coherence µ ( A ) to be defined, Assumption 2implicitly holds. In such cases, Assumption 2 will not be explicitly stated in the theorems below. (cid:96) -minimization method The result below characterizes a case where the simple (cid:96) -minimization method is sufficient to solvethe sparse optimization problem (1)–(2). Theorem 1. Let x denote the unique solution to (1) – (2) . If the inequality (cid:107) x (cid:107) < n M (cid:18) µ ( A ) (cid:19) (11) holds, then the solution ˆ φ to (5) is unique and equal to φ ( x ) , thus providing ˆ x = φ − ( ˆ φ ) = x .Proof. Assume (1)–(2) has a unique solution x . Then, Aφ = y − b has a solution φ = φ ( x )with a sparsity bounded by Proposition 1 as (cid:107) φ (cid:107) ≤ Mn (cid:107) x (cid:107) . But, according to Theorem 7 in [Bruckstein et al., 2009], if (cid:107) φ (cid:107) < (cid:18) µ ( A ) (cid:19) then, on the one hand, φ is the sparsest solution to Aφ = y − b , and on the other hand, it is alsothe unique solution to (5). Thus, if (11) holds, φ is the unique solution to both (4) and (5), i.e.,ˆ φ = φ . Since ˆ x is given by the first components of ˆ φ and x by the ones of φ , this completesthe proof.Other less conservative conditions can be similarly obtained by considering the exact value of (cid:107) φ (cid:107) or tighter bounds (see Appendix B), but these do not take the form of a simple inequality on (cid:107) x (cid:107) . Another condition for very sparse cases (with (cid:107) x (cid:107) ≤ n/d − 1) can be similarly obtainedby using Proposition 2 in Appendix A instead of Proposition 1. (cid:96) / (cid:96) -minimization method The first result below shows that the group-sparse problem (8) can be used as a proxy for thepolynomial problem (1)–(2). Theorem 2. If the solution φ ∗ to (8) is unique and yields x ∗ = φ − ( φ ∗ ) such that x ∗ is asolution to the system of polynomial equations (2) , then x ∗ is the sparsest solution to the systemof polynomial equations, i.e., the unique solution to (1) – (2) . roof. Assume there is an x (cid:54) = x ∗ solution to the polynomial system (2) and at least as sparse as x ∗ . Then A φ ( x ) = y − b and (cid:107){ W j φ ( x ) } nj =1 (cid:107) = (cid:107) x (cid:107) ≤ (cid:107) x ∗ (cid:107) ≤ (cid:107){ W j φ ∗ } nj =1 (cid:107) , which contradicts the fact that φ ∗ is the unique solution to (8) unless φ ( x ) = φ ∗ and x = x ∗ .The next result provides a condition on the sparsity of the solution to (1)–(2) under which it canbe recovered by solving the convex problem (9) or its variant with nonnegativity constraints (10).This result requires the following lemma. Lemma 2. Let A = [ A , . . . , A M ] be an N × M matrix with mutual coherence µ ( A ) as definedin Definition 1. Let W be the M × M -diagonal matrix of entries w i = (cid:107) A i (cid:107) . Then, for all δ ∈ Ker ( A ) and i ∈ { , . . . , M } , the bound w i δ i ≤ µ ( A )1 + µ ( A ) (cid:107) W δ (cid:107) (12) holds.Proof. For all δ ∈ Ker ( A ), we have Aδ = 0 ⇒ A T Aδ = 0, which further implies W − A T AW − W δ = 0 and − W δ = (cid:16) W − A T AW − − I (cid:17) W δ . Then, we have w i δ i = ( W δ ) i = (cid:16) ( W − A T AW − − I ) W δ (cid:17) i = M (cid:88) j =1 (cid:16) W − A T AW − − I (cid:17) i,j ( W δ ) j ≤ M (cid:88) j =1 (cid:16) W − A T AW − − I (cid:17) i,j ( W δ ) j . Note that W − A T AW − is a normalized matrix with ones on the diagonal and off diagonal( i, j )-entries equal to | A Ti A j |(cid:107) A i (cid:107) (cid:107) A j (cid:107) and bounded by µ ( A ) via Definition 1. Thus, we obtain w i δ i ≤ µ ( A ) (cid:88) j (cid:54) = i w j δ j and, by adding µ ( A ) w i δ i to both sides, the bound w i δ i ≤ µ ( A )1 + µ ( A ) M (cid:88) j =1 w j δ j , which is precisely (12). Theorem 3. Let x be a solution to the polynomial equations (2) and φ = φ ( x ) . If the condition (cid:107) x (cid:107) < √ m (cid:115) µ ( A ) , where m is the number of components of φ associated to a variable, holds, then φ is the uniquesolution to both (9) and (10) . roof. The vector φ is the unique solution to (9) if the inequality n (cid:88) j =1 (cid:107) W j ( φ + δ ) (cid:107) > n (cid:88) j =1 (cid:107) W j φ (cid:107) holds for all δ (cid:54) = satisfying Aδ = 0. The inequality above can be rewritten as (cid:88) j ∈ I (cid:107) W j δ (cid:107) + (cid:88) j / ∈ I (cid:107) W j ( φ + δ ) (cid:107) − (cid:107) W j φ (cid:107) > , where I = { j ∈ { , . . . , n } : W j φ = } . By the triangle inequality, (cid:107) u + v (cid:107) − (cid:107) u (cid:107) ≥ −(cid:107) v (cid:107) ,this condition is met if (cid:88) j ∈ I (cid:107) W j δ (cid:107) − (cid:88) j / ∈ I (cid:107) W j δ (cid:107) > n (cid:88) j =1 (cid:107) W j δ (cid:107) − (cid:88) j / ∈ I (cid:107) W j δ (cid:107) > . (13)By defining G j as the set of indexes corresponding to nonzero columns of W j , Lemma 2 yields (cid:107) W j δ (cid:107) = (cid:88) i ∈ G j w i δ i ≤ m j µ ( A )1 + µ ( A ) (cid:107) W δ (cid:107) , where m j = m is the number of components of φ associated to a variable x j . Due to the fact that (cid:83) k ∈{ ,...,n } G k = { , . . . , M } , we also have (cid:107) W δ (cid:107) = M (cid:88) i =1 w i δ i ≤ n (cid:88) k =1 (cid:88) i ∈ G k w i δ i = n (cid:88) k =1 (cid:107) W k δ (cid:107) ≤ (cid:32) n (cid:88) k =1 (cid:107) W k δ (cid:107) (cid:33) , which then leads to (cid:107) W j δ (cid:107) ≤ m µ ( A )1 + µ ( A ) (cid:32) n (cid:88) k =1 (cid:107) W k δ (cid:107) (cid:33) . Introducing this result in (13) gives the condition n (cid:88) j =1 (cid:107) W j δ (cid:107) − n − | I | ) µ ( A ) √ m (cid:112) µ ( A ) n (cid:88) k =1 (cid:107) W k δ (cid:107) > . Finally, given that | I | = n − (cid:107){ W j φ } nj =1 (cid:107) = n − (cid:107) x (cid:107) , this yields n (cid:88) j =1 (cid:107) W j δ (cid:107) − (cid:107) x (cid:107) µ ( A ) √ m (cid:112) µ ( A ) n (cid:88) k =1 (cid:107) W k δ (cid:107) > . (14)or, after dividing by (cid:107){ W j δ } nj =1 (cid:107) , and rearranging the terms, (cid:107) x (cid:107) < (cid:112) µ ( A )2 µ ( A ) √ m , which can be rewritten as in the statement of the Theorem.It remains to prove that φ is also the unique solution to (10) in the case where this conditionis satisfied and φ is the unique solution to (9). To see this, note that φ = φ ( x ) is by definitiona feasible point of (10). In addition, the feasible set of (10) is included in the one of (9), whilethe problems (9) and (10) share the same cost function. Thus, if φ is the unique solution to (9),there cannot be another feasible φ (cid:54) = φ with lower or equal cost function value for (10).9 orollary 1. Let x be a solution to the polynomial equations (2) and m be the number of com-ponents of φ associated to a variable. If the condition (cid:107) x (cid:107) < √ m (cid:115) µ ( A ) holds, then x is the unique solution to the minimization problem (1) – (2) and it can be computedas x = φ − ( ˆ φ ) with ˆ φ the solution to either (9) or (10) .Proof. Assume there exists another solution x (cid:54) = x to (1)–(2), and thus with (cid:107) x (cid:107) ≤ (cid:107) x (cid:107) .Then, Theorem 3 implies that both φ ( x ) and φ ( x ) are unique solutions to either (9) or (10), andthus that φ ( x ) = φ ( x ) = ˆ φ (where ˆ φ is similarly defined as the solution to either (9) or (10) inthis case). But this contradicts the definition of the mapping φ implying φ ( x ) (cid:54) = φ ( x ) whenever x (cid:54) = x . Therefore, the assumption x (cid:54) = x cannot hold and x is the unique solution to (1)–(2),while φ − ( ˆ φ ) = φ − ( φ ( x )) = x . Theorem 4. Let ˆ φ be a solution to (9) and m be the number of components of φ associated to avariable. If the condition (cid:107){ W j ˆ φ } nj =1 (cid:107) < √ m (cid:115) µ ( A ) holds, then ˆ φ is the unique solution to both (8) and (9) . If, in addition, ˆ x = φ − ( ˆ φ ) satisfies thepolynomial constraints (2) , then ˆ x is the unique solution to (1) – (2) .Proof. By following steps similar to those in the proof of Theorem 3 with ˆ φ instead of φ , weobtain (by replacing (cid:107) x (cid:107) by (cid:107){ W j ˆ φ } nj =1 (cid:107) in (14)) that ˆ φ is the unique solution to (9). Then,let φ ∗ be a solution to (8). This implies (cid:107){ W j φ ∗ } nj =1 (cid:107) ≤ (cid:107){ W j ˆ φ } nj =1 (cid:107) and, under the conditionof the Theorem, (cid:107){ W j φ ∗ } nj =1 (cid:107) < √ m (cid:115) µ ( A ) . (15)By following similar steps again, we obtain that φ ∗ is also the unique solution to (9) and thus that φ ∗ = ˆ φ is the unique solution to (8).Finally, since ˆ φ is the unique solution to (8), if ˆ x satisfies the polynomial equations (2), Theo-rem 2 implies that ˆ x is the solution to (1)–(2).In comparison with Theorem 3, Theorem 4 provides a condition that only depends on theestimate obtained by solving (9) rather than on the sought solution. In practice, convex relaxations such as the ones described above provide a good step towardsthe solution but might fail to yield the exact solution with sufficient sparsity. In such cases, itis common practice to improve the sparsity of the solution by repeating the procedure with awell-chosen weighting of the variables as described in [Cand`es et al., 2008, Le et al., 2013]. Thesetechniques can be directly applied to improve the (cid:96) -minimization method of Sect. 2.1 while theyare adapted below to group-sparsity as considered in Sect. 2.2. The classical reweighting scheme of [Cand`es et al., 2008] for sparse recovery improves the sparsityof the solution by solving a sequence of linear programs. It can be adapted to the group-sparse10ecovery problem by iteratively solvingˆ φ = arg min φ ∈ R M n (cid:88) j =1 µ j (cid:107) W j φ (cid:107) (16)s.t. Aφ = y − b φ k ≥ , ∀ k such that ( α k ) j is even for all j, with weights µ j initially set to 1 and refined at each iteration by µ j = 1 (cid:107) W j ˆ φ (cid:107) + (cid:15) for a given small value of (cid:15) > (cid:96) -norms that areassumed to be nonzero in the solution while increasing the weight of groups with small norms inorder to force them towards zero. (cid:96) / (cid:96) -minimization The S (cid:96) M algorithm proposed in [Le et al., 2013] is another reweighted (cid:96) -minimization mechanismwhich sets a single weight at zero at each iteration. Though typically requiring more computationtime than the previous approach due to a number of iterations equal to the number of non-zeroelements, this algorithm can recover sparse solutions in cases where the classical reweighting schemeof [Cand`es et al., 2008] fails. Other advantages include the absence of a tuning parameter and thepresence of a convergence analysis [Le et al., 2013]. The S (cid:96) (cid:96) M algorithm below is an adaptationof S (cid:96) M to group-sparse problems.1. Initialize all weights µ j = 1, j = 1 , . . . , n .2. Solve the weighted group-sparse problem (16).3. Find k ∈ arg max j ∈{ ,...,n } (cid:107) W j ˆ φ (cid:107) .4. Set µ k = 0 (to relax the sparsity constraint on the k th group).5. Repeat from Step 2 until (cid:80) nj =1 µ j (cid:107) W j ˆ φ (cid:107) = 0.Note that in this algorithm, the number of iterations is equal to the number of nonzero groups ,which, for polynomial basis pursuit, is (cid:107) x (cid:107) and is typically small. This results in a fast andaccurate method for polynomial basis pursuit, as will be seen in Sect. 6. As mentioned in the introduction, there are two major techniques to minimize an (cid:96) -pseudo-norm:the BP approach and the greedy approach. We now consider the second one to solve problem (8),and more particularly develop two variants of the greedy approach: the exact method and theapproximate method. The exact method is intended for small-scale problems where the numberof possible combinations of base variables remains small. The approximate method is designed tocircumvent this limitation and applies to much larger problems. Exact greedy algorithm. The exact method is implemented as follows, where we let (cid:15) = 0 ifthe polynomial system of equations is assumed to be feasible, and (cid:15) > n = 0 and e = + ∞ .2. ˆ n ← ˆ n + 1. The maximal number of iterations is the number of groups n , but if the correct sparsity pattern is recoveredthen the algorithm stops earlier. 11. For all combinations C of ˆ n variables among n :(a) Set S = { , . . . , M } \ (cid:83) k / ∈ C I k , where the index sets I k are defined as in (6).(b) Build the submatrix A S with the columns of A with index in S .(c) Solve φ S = arg min φ ∈ R | S | (cid:107) A S φ + b − y (cid:107) . (d) Update e ← min { e, (cid:107) A S φ S + b − y (cid:107) } .(e) If e ≤ (cid:15) , compute ˆ φ by setting its components of index in S to the values in φ S and theothers to 0. Return ˆ x = φ − ( ˆ φ ).4. If ˆ n < n , repeat from Step 2, otherwise return an infeasibility certificate.In Step 3.(a), S corresponds to the support of φ ( x ) when supp ( x ) = C , where C is a combina-tion of ˆ n indexes from 1 to n . The maximal number of least squares (LS) problems to solve inStep 3.(c) is 2 n . But many of these are spared by starting with the sparsest combinations andstopping as soon as a solution is found. Thus, if a sparse solution x with (cid:107) x (cid:107) ≤ n/ (cid:80) (cid:107) x (cid:107) t =1 (cid:18) nt (cid:19) ≤ (cid:107) x (cid:107) (cid:18) n (cid:107) x (cid:107) (cid:19) LS problems are solved. At iteration t , the LS problem isof size N × | S | , with | S | = (cid:80) dq =1 (cid:18) t + q − q (cid:19) ≤ d (cid:18) t + d − d (cid:19) . Thus, assuming a complexity LS ( N , N ) for an LS problem of size N × N , the overall complexity of the algorithm scales as (cid:107) x (cid:107) (cid:18) n (cid:107) x (cid:107) (cid:19) LS (cid:18) N, d (cid:18) (cid:107) x (cid:107) + d − d (cid:19)(cid:19) , which is exponential in n , d and (cid:107) x (cid:107) .With (cid:15) = 0, the exact greedy algorithm above can be slightly modified to compute all thesolutions to (8), simply by letting the for loop in Step 3 complete instead of returning as soon as asolution is found. As a result, the algorithm could provide a uniqueness certificate for the solutionof (8) from which Theorem 2 could be applied to conclude that the solution coincides with theunique minimizer of (1)–(2). Approximate greedy algorithm. The approximate method is similar except that it exploresonly a single branch of the tree of possible combinations. Its implementation uses a set S of retainedvariables (more precisely, S contains the indexes of these variables):1. Initialize the set of nonzero variables: S = ∅ .2. For all j ∈ { , . . . , n } \ S ,(a) Set S j = { , . . . , M } \ (cid:83) k / ∈ S ∪{ j } I k , where the index sets I k are defined as in (6).(b) Build the submatrix A S j with the columns of A with index in S j .(c) Solve φ j = arg min φ ∈ R | Sj | (cid:107) A S j φ + b − y (cid:107) . 3. Select the variable that minimizes the error if added to S : k = arg min j ∈{ ,...,n } (cid:107) A S j φ j + b − y (cid:107) . 4. Update S ← S ∪ { k } .5. Repeat from Step 2 until (cid:107) A S k φ k + b − y (cid:107) ≤ (cid:15) .6. Compute ˆ φ by setting its components of index in S k to the values in φ k and the others to 0.7. Return ˆ x = φ − ( ˆ φ ) and the error (cid:107) A S k φ k + b − y (cid:107) .12he algorithm starts with an empty set of nonzero variables S and adds a single variable to thatset at each iteration. The variable retained at a given iteration is the one that, if added, leads tothe minimum sum of squared errors for the equations A S j φ j = y − b . In Step 2.(a), S j correspondsto the support of φ ( x ) when supp ( x ) = S ∪ { j } . Note that the value of the minimizer φ k is notretained but re-estimated at the next iteration. The reason for this is that there is no guaranteethat the components of φ k correspond to monomials of base variables.Since one base variable is added at each iteration t , the number of iterations equals the sparsityof the returned solution and t ≤ (cid:107) ˆ x (cid:107) . In this approximate variant of the algorithm, each iterationrequires solving only ( n − t + 1) LS problems, which yields a total number of LS problems equalto (cid:80) (cid:107) ˆ x (cid:107) t =1 ( n − t + 1) ≤ n (cid:107) ˆ x (cid:107) . Then, bounding the complexity of each LS problem as for the exactgreedy algorithm leads to an overall complexity bounded by n (cid:107) ˆ x (cid:107) LS (cid:18) N, d (cid:18) (cid:107) ˆ x (cid:107) + d − d (cid:19)(cid:19) .Since we do not have a result on the convergence of the algorithm to the sparsest solution, wecannot bound (cid:107) ˆ x (cid:107) except by n and the worst-case complexity is exponential in n . However, inpractice, if the algorithm finds a sparse solution ˆ x , the complexity is exponential in its sparsity (cid:107) ˆ x (cid:107) , but only linear in n . This is rather promising given the empirical results to be shown inSect. 6 which suggest that this occurs with high probability. We now consider the case of purely nonlinear polynomials p di ( x ) without a linear part for somevariables, i.e., with a ik = 0, i = 1 , . . . , N , for some k ∈ { , . . . , n } (according to the ordering of themulti-indexes, the linear monomials correspond to the first coefficients with 1 ≤ k ≤ n ). For thisspecific case where Assumption 1 does not hold, some of the first n columns of A are zero and thecorresponding components of φ are unconstrained, thus set to arbitrary values in ˆ φ . As a result,not only does the analysis of Sect. 2.3 not hold, but the estimate ˆ x cannot be directly obtained bythe inverse mapping φ − as the first components of ˆ φ .However, the core of the method remains applicable to purely nonlinear polynomials. Moreprecisely, we can solve (5), (9) or (10) (or apply a reweighting scheme of Sect. 2.4) to obtain ˆ φ andthe corresponding support of ˆ x as supp (ˆ x ) = { j ∈ { , . . . , n } : (cid:107) W j ˆ φ (cid:107) (cid:54) = 0 } , while the greedy algorithms of Sect. 3 directly estimate supp (ˆ x ). Then, the estimate ˆ x can becomputed from the higher-degree monomials as, e.g., ˆ x j = ± q (cid:113) ˆ φ jq , where the subscript jq denotesthe index such that ( φ ( · )) jq : x (cid:55)→ x qj . The precise procedure to compute ˆ x actually depends onthe monomials involved in the polynomials. The most straightforward manner is to compute ˆ x j from the estimate of its smallest nonzero odd power: ∀ j ∈ supp (ˆ x ) , ˆ x j = p +1 (cid:113) ˆ φ j (2ˆ p +1) , with ˆ p = min p ∈{ ,..., ( d − / } p, s.t. ˆ φ j (2 p +1) (cid:54) = 0 . (17)But for polynomial systems that involve only monomials with even degrees, the minimizationcomputing ˆ p in (17) has no solution and the procedure is slightly more complex. For instance, withpurely quadratic equations, the absolute value of ˆ x j is given by | ˆ x j | = (cid:113) ˆ φ j and the sign must bedetermined by looking at the signs of the estimates of the bilinear terms (cid:100) x i x j , i (cid:54) = j . In addition,note that for such cases, the solution of (1)–(2) is not unique for symmetry reasons and the methodcannot be analyzed as in Sect. 2.3 in terms of convergence towards the sparsest solution. Then, adifferent notion of uniqueness is usually considered in the literature dedicated to purely quadraticequations [Balan et al., 2006, Bandeira et al., 2014, Ohlsson and Eldar, 2013, Ranieri et al., 2013]. In many applications, the equations y i = f i ( x ) need to be relaxed to an error-tolerant form forvarious reasons, which can for instance be interpreted as having access to noisy measurements,13 i = f i ( x ) + e i , with unknown noise terms e i . In this case, we reformulate the general problem (1)as a denoising one: min x ∈ R n , e ∈ R N (cid:107) x (cid:107) (18)s.t. y i = f i ( x ) + e i , i = 1 , . . . , N, (cid:107) e (cid:107) p ≤ ε, where ε > (cid:96) p -norm, with typical choices for p being p = 1, p = 2or p = ∞ .With polynomial constraints, y i = f i ( x ) + e i = p di ( x ) + e i = b i + M (cid:88) k =1 a ik x α k + e i , i = 1 , . . . , N, (19)convex relaxations similar to the ones described in Sect. 2 can be derived to obtain polynomial BPdenoising methods. This leads to solvingˆ φ = arg min φ ∈ R M (cid:107) W φ (cid:107) (20)s.t. (cid:107) Aφ + b − y (cid:107) p ≤ ε, for the (cid:96) -minimization method andˆ φ = arg min φ ∈ R M n (cid:88) j =1 (cid:107) W j φ (cid:107) (21)s.t. (cid:107) Aφ + b − y (cid:107) p ≤ ε with (cid:96) /(cid:96) -minimization. For p ∈ { , ∞} , problem (20) remains a linear program, while p = 2 leadsto a SOCP. Problem (21) can still be written as a SOCP for all p ∈ { , , ∞} and be solved by thegeneric solvers that apply to (9). Thus, the enhancements proposed in Sect. 2 for the formulations(5) and (9) (such as reweighting schemes or the addition of structural constraints) can be easilytransposed to the noisy case.Regarding the greedy algorithms of Sect. 3, they are already applicable to the noisy case, forwhich it suffices to set an appropriate threshold ε on the noise (cid:96) -norm. Adaptations of thesealgorithms to p = 1 or p = ∞ are straightforward, but require solving a convex optimizationproblem in sub-step (c) without a closed-form solution and thus without the same computationalbenefit for the approximate greedy approach. In the noisy case, the solution to (18) is in general not unique and the analysis focuses on stabilityrather than on exact recovery. The following theorem provides such a stability result for the (cid:96) -minimization. Theorem 5. Let ( x , e ) denote a solution to (18) – (19) for p = 2 . If the inequality (cid:107) x (cid:107) < n M (cid:18) µ ( A ) (cid:19) (22) holds, then ˆ x = φ − ( ˆ φ ) with ˆ φ the solution to (20) for p = 2 must obey (cid:107) ˆ x − x (cid:107) ≤ ε − µ ( A )(4 M (cid:107) x (cid:107) /n − . (23) Proof. Assume (18)–(19) has a solution ( x , e ). Then, Aφ = y − b − e with φ = φ ( x )and (cid:107) e (cid:107) ≤ ε . By Proposition 1, we have (cid:107) φ (cid:107) ≤ Mn (cid:107) x (cid:107) . But, according to Theorem 9 in[Bruckstein et al., 2009], if (cid:107) φ (cid:107) < (cid:16) µ ( A ) (cid:17) then ˆ φ must obey (cid:107) ˆ φ − φ (cid:107) ≤ ε − µ ( A )(4 (cid:107) φ (cid:107) − . (24)14hus, if (22) holds, so does (24) and (cid:107) ˆ φ − φ (cid:107) ≤ ε / [1 − µ ( A )(4 M (cid:107) x (cid:107) /n − (cid:107) ˆ x − x (cid:107) ≤(cid:107) ˆ φ − φ (cid:107) , this completes the proof.For the (cid:96) /(cid:96) -minimization, we have the following stability result. Theorem 6. Let ( x , e ) denote a solution to (18) – (19) for p = 2 . If the inequality (cid:107) x (cid:107) < nM (cid:18) µ ( A ) (cid:19) (25) holds, then ˆ x = φ − ( ˆ φ ) with ˆ φ the solution to (21) for p = 2 must obey (cid:107) W x (ˆ x − x ) (cid:107) ≤ nε − µ ( A )(4 nM (cid:107) x (cid:107) − , (26) where W x = diag ( (cid:107) A (cid:107) , . . . , (cid:107) A n (cid:107) ) .Proof. Assume (18)–(19) has a solution ( x , e ). Let us define φ = φ ( x ) and δ = ˆ φ − φ . To provethe theorem, we will first bound from above the norm (cid:107) W δ (cid:107) , where W is the diagonal matrix ofprecompensating weights. This part of the proof follows a path similar to that of Theorem 3.1 in[Donoho et al., 2006], while adapting it to the group-sparse setting and mixed (cid:96) p / (cid:96) q norms.Due to the definition of ˆ φ as a minimizer of (21), δ must satisfy either n (cid:88) j =1 (cid:107) W j ( φ + δ ) (cid:107) < n (cid:88) j =1 (cid:107) W j φ (cid:107) or δ = , in which case the statement is obvious. The inequality above can be rewritten as (cid:88) j ∈ I (cid:107) W j δ (cid:107) + (cid:88) j / ∈ I (cid:107) W j ( φ + δ ) (cid:107) − (cid:107) W j φ (cid:107) < , where I = { j ∈ { , . . . , n } : W j φ = } . By the triangle inequality, (cid:107) u + v (cid:107) − (cid:107) u (cid:107) ≥ −(cid:107) v (cid:107) ,this implies (cid:88) j ∈ I (cid:107) W j δ (cid:107) − (cid:88) j / ∈ I (cid:107) W j δ (cid:107) < . (27)In addition, δ must satisfy the constraints in (21) as (cid:107) A ( φ + δ ) + b − y (cid:107) ≤ ε, in which y can be replaced by Aφ + b + e , leading to (cid:107) Aδ − e (cid:107) ≤ ε. Using (cid:107) u (cid:107) ≤ (cid:107) u − v (cid:107) + (cid:107) v (cid:107) , this implies (cid:107) Aδ (cid:107) ≤ ε , which further gives4 ε ≥ (cid:107) Aδ (cid:107) = (cid:107) AW − W δ (cid:107) = ( W δ ) T W − A T AW − ( W δ )= (cid:107) W δ (cid:107) + ( W δ ) T ( W − A T AW − − I )( W δ ) ≥ (cid:107) W δ (cid:107) − (cid:12)(cid:12)(cid:12) ( W δ ) T ( W − A T AW − − I )( W δ ) (cid:12)(cid:12)(cid:12) ≥ (cid:107) W δ (cid:107) − | W δ | T | W − A T AW − − I || W δ |≥ (cid:107) W δ (cid:107) − µ ( A )( (cid:107) W δ (cid:107) − (cid:107) W δ (cid:107) )= (1 + µ ( A )) (cid:107) W δ (cid:107) − µ ( A ) (cid:107) W δ (cid:107) (28)where we used W − W = I and the fact that the diagonal entries of | W − A T AW − − I | arezeros while off-diagonal entries are bounded from above by µ ( A ).15ue to W j δ being a vector with a subset of entries from W δ , we have (cid:107) W δ (cid:107) ≥ (cid:107) W j δ (cid:107) , j = 1 , . . . , n , and thus (cid:107) W δ (cid:107) ≥ n n (cid:88) j =1 (cid:107) W j δ (cid:107) = 1 n (cid:107){ W j δ } nj =1 (cid:107) , . (29)Since the groups defined by the W j ’s overlap, (cid:107) W δ (cid:107) ≤ (cid:80) nj =1 (cid:107) W j δ (cid:107) , and the squared (cid:96) -normin (28) can be bounded by (cid:107) W δ (cid:107) ≤ M (cid:107) W δ (cid:107) ≤ M n (cid:88) j =1 (cid:107) W j δ (cid:107) = M (cid:107){ W j δ } nj =1 (cid:107) , . (30)Introducing the bounds (29)–(30) in (28) yields1 + µ ( A ) n (cid:107){ W j δ } nj =1 (cid:107) , − µ ( A ) M (cid:107){ W j δ } nj =1 (cid:107) , ≤ ε . (31)We will now use this inequality to derive an upper bound on (cid:107){ W j δ } nj =1 (cid:107) , , which will also applyto (cid:107) W δ (cid:107) ≤ (cid:107){ W j δ } nj =1 (cid:107) , , since the groups overlap and the squared components of W δ aresummed multiple times in (cid:107){ W j δ } nj =1 (cid:107) , . To derive the upper bound, we first introduce a fewnotations: a = (cid:107){ W j δ } j ∈ I (cid:107) , , b = (cid:107){ W j δ } j / ∈ I (cid:107) , , and c = (cid:18) (cid:107){ W j δ } j ∈ I (cid:107) , (cid:107){ W j δ } j ∈ I (cid:107) , (cid:19) ∈ (cid:20) | I | , (cid:21) , c = (cid:18) (cid:107){ W j δ } j / ∈ I (cid:107) , (cid:107){ W j δ } j / ∈ I (cid:107) , (cid:19) ∈ (cid:20) n − | I | , (cid:21) , where the box bounds are obtained by classical relations between the (cid:96) and (cid:96) norms ( ∀ u ∈ R k , (cid:107) u (cid:107) ≤ (cid:107) u (cid:107) ≤ √ k (cid:107) u (cid:107) ). With these notations, the term to bound is rewritten as (cid:107){ W j δ } nj =1 (cid:107) , = (cid:107){ W j δ } j ∈ I (cid:107) , + (cid:107){ W j δ } j / ∈ I (cid:107) , = c a + c b , while the inequality (31) becomes1 + µ ( A ) n ( c a + c b ) − µ ( A ) M ( a + b ) ≤ ε . We further reformulate this constraint by letting a = ρb :1 + µ ( A ) n ( c ρ + c ) b − µ ( A ) M (1 + ρ ) b ≤ ε . (32)Let γ = (1 + ρ ) / ( c ρ + c ). Due to (27), we have a < b and thus ρ ∈ [0 , c and c , gives the constraints 1 ≤ γ ≤ n − | I | ). By setting V = ( c ρ + c ) b ,(31) is rewritten as 1 + µ ( A ) n V − µ ( A ) M γV ≤ ε , where 1 + µ ( A ) n − µ ( A ) M γ ≥ µ ( A ) n − n − | I | ) µ ( A ) M > , since γ ≤ n − | I | ) and the positivity is ensured by the condition (25) and the fact that (cid:107) x (cid:107) = n − | I | . Thus, (cid:107) W δ (cid:107) ≤ (cid:107){ W j δ } nj =1 (cid:107) , = V ≤ nε µ ( A ) − µ ( A ) nM (cid:107) x (cid:107) and, since (cid:107) W x (ˆ x − x ) (cid:107) ≤ (cid:107) W ( ˆ φ − φ ) (cid:107) = (cid:107) W δ (cid:107) , (26) is proved.16able 1: Results on the example from [Ohlsson et al., 2013b] with quadratic equations. Method IHT QBP (cid:96) M (cid:96) (cid:96) M IR (cid:96) (cid:96) M S (cid:96) (cid:96) M AGA EGA Success rate 54% 79% 0% 0% 97% 97% 91% 100% Mean time (s) d = 4. Method IHT NLBP (cid:96) M (cid:96) (cid:96) M IR (cid:96) (cid:96) M S (cid:96) (cid:96) M AGA EGA Success rate 66% 100% 85% 16% 100% 100% 100% 100% Mean time (s) This section evaluates the efficiency of the BP and greedy methods in terms of accuracy andcomputing time for the noiseless case in Sect. 6.1 and the noisy case in Sect. 6.2. Here, thedefinition of accuracy depends on the presence of noise in the equations, while the computing timerefers to Matlab implementations using MOSEK and CVX for the convex programs and runningon a standard laptop (except for times reported in Fig. 3).The following methods are compared: the iterative hard thresholding (IHT) algorithm[Blumensath, 2013] as implemented by [Beck and Eldar, 2013], the quadratic (QBP) and nonlinear(NLBP) BP methods of [Ohlsson et al., 2013b, Ohlsson et al., 2013a], the simple (cid:96) -minimization( (cid:96) M) solving (5) , the (cid:96) /(cid:96) -minimization ( (cid:96) (cid:96) M) solving (10) with its iteratively reweightedcounterpart (IR (cid:96) (cid:96) M) using 10 iterations as defined in Sect. 2.4.1 for (cid:15) = 0 . (cid:96) /(cid:96) -minimization (S (cid:96) (cid:96) M) of Sect. 2.4.2, and the exact (EGA) and approximate (AGA) greedyalgorithms of Sect. 3. For the noisy cases, error tolerant formulations of these methods as describedin Sect. 5 are used. In the noiseless setting considered in the following experiments, accuracy is defined as the ability ofa method to recover the sparsest solution x of a polynomial system and is measured as a successrate, i.e., the percentage of systems for which it recovers x (meaning (cid:107) ˆ x − x (cid:107) ≤ − ) in aMonte Carlo experiment involving 100 trials with random polynomial coefficients but same x .More precisely, for each experiment, given a sparsity level (cid:107) x (cid:107) , we generate an x with the first (cid:107) x (cid:107) components set to 1 and the rest at 0, while the values of y i , i = 1 , . . . , N , are given by (2)with b i and a ik , k = 1 , . . . , M , randomly drawn for each trial according to a zero-mean Gaussiandistribution with unit variance. Sparse solutions of quadratic equations. Consider example A in [Ohlsson et al., 2013b] with N = 25 quadratic equations ( d = 2) in n = 20 variables. The true x that we aim at recoveringhas three components at 1 and the rest at 0: (cid:107) x (cid:107) = 3. The results reported in Table 1 showthat all reweighted BP and greedy methods recover the solution in almost all trials in less than 1second, whereas the IHT and QBP methods lead to more failures despite longer computing times.However, the (cid:96) M method, which does not enforce the polynomial structure on ˆ φ , cannot recoverthe correct solution. The straightforward optimization of the (cid:96) /(cid:96) -norm ( (cid:96) (cid:96) M) also fails, whichshows the importance of reweighting for obtaining truly sparse solutions. Higher-degree polynomial equations. Consider now example A in [Ohlsson et al., 2013a]with N = 50 polynomial equations of degree d = 4 in n = 5 variables with two nonzeros ( (cid:107) x (cid:107) = The code for the proposed methods is available at . The results reported here are obtained with 10 iterations of the reweighting procedure of [Cand`es et al., 2008]applied to (5). Method QBP (cid:96) M (cid:96) (cid:96) M IR (cid:96) (cid:96) M S (cid:96) (cid:96) M AGA EGA Success rate 0% 0% 3% 100% 99% 91% 100% Mean time (s) d = 4. Method NLBP (cid:96) M (cid:96) (cid:96) M IR (cid:96) (cid:96) M S (cid:96) (cid:96) M AGA EGA Success rate Mean time (s) Method GSS QBP IR (cid:96) (cid:96) M S (cid:96) (cid:96) M AGA EGA Success rate 32% 0% 79% 72% 71% 100% Mean time (s) (cid:96) M method alreadyobtains a success rate of 85% while all reweighted BP and greedy methods achieve a 100% successrate. This is due to the increase of the number of equations, N , and the decrease of the numberof variables, n , as will be emphasized by additional experiments below. Here, S (cid:96) (cid:96) M can bemuch faster than IR (cid:96) (cid:96) M based on the iterative reweighting of [Cand`es et al., 2008] (Sect. 2.4.1)because the number of nonzero groups is very small: 2 < n < typical number of iterations for thereweighting of [Cand`es et al., 2008]. However, the AGA obtains the same success rate, while beingone order of magnitude faster.The NLBP method also recovered the correct solution in all trials, but with a much longercomputing time. This time is roughly divided in halves for the construction of quadratic constraintsthat can be handled by QBP on the one hand and the semi-definite optimization of QBP on theother hand. Here again, the IHT method offers a low success rate, though it uses additionalinformation on (cid:107) x (cid:107) to compute ˆ x with the optimal sparsity level (which is unknown to the othermethods). This is due in part to the difficulty of tuning the gradient step-size to obtain convergencewith an objective function whose gradient is not Lipschitz continuous. For this reason, we excludethe IHT method from the remaining experiments. Purely nonlinear equations. Consider now the particular case of purely nonlinear equationsdiscussed in Sect. 4. Given N = 25 purely quadratic polynomials p i ( x ) = x T Q i x in n = 20variables, we compute the values y i = p i ( x ). Results are shown in Table 3 for (cid:107) x (cid:107) = 3 andwhere the definition of the success rate is slightly modified: since the sparsest solution is notunique in this case, a successful trial is defined as one for which the estimate ˆ x belongs to the setof solutions (meaning that min {(cid:107) ˆ x − x (cid:107) , (cid:107) ˆ x + x (cid:107) } ≤ − ). Table 4 reports results of similarexperiments with a higher degree d = 4 and N = 50, n = 5, (cid:107) x (cid:107) = 2. In this case, the estimatesof base variables ˆ x j are directly given as cube roots of the estimates of x j . In both cases ( d = 2or 4), the success rates are similar to the ones reported in Tables 1 and 2, which provides evidencethat the alternative proposed in Sect. 4 yields satisfactory results when Assumption 1 does nothold.In order to compare with the greedy sparse-simplex (GSS) method of [Beck and Eldar, 2013]using the implementation provided with that paper, we consider a typical application for purelyquadratic equations, namely, phase retrieval. In such applications, the measurements obey c i = | c Ti x | , which can be reformulated as y i = c i = x T c i c Ti x to be handled by the proposed methods.Here, the components of the vectors c i , i = 1 , . . . , N , are randomly drawn in each trial according toa zero-mean Gaussian distribution of unit variance. Results in Table 5 obtained for N = 25, n = 2018nd (cid:107) x (cid:107) = 3 show that the proposed methods perform much better than the GSS method withdata generated in this manner, despite the fact that GSS works with the additional knowledgeof (cid:107) x (cid:107) . Note that many other methods have been proposed for sparse phase retrieval. Thecomparison is here limited to a sample of generic sparse recovery methods that apply to phaseretrieval as a special case. Remark: All examples above consider a moderate dimensional setting ( n ≤ ) with low sparsitylevels (cid:107) x (cid:107) ≤ . In such cases, the exact greedy algorithm (EGA) can be applied without complexityissues and yields a perfect recovery in all trials, even faster than the BP methods based on convexrelaxations. However, despite its name and the observed success rates, the EGA is not an exactalgorithm for solving polynomial systems but only for solving group-sparsity optimization problemsin the form of (8) . While Theorem 2 states that this is sufficient to solve (1) – (2) if the polynomialconstraints are satisfied, there is no guarantee that this is the case in general. Estimating the probability of successful recovery. Figures 1 (for n = 10) and 2 (for n = 20)show the probability of successful recovery of an x estimated via the success rate versus thesparsity level (cid:107) x (cid:107) for various values of δ = N/n . Contrary to the linear case where the systemis overdetermined for values of δ > 1, here δ can grow as large as M/n before the system becomesoverdetermined. Thus, classical studies on the linear case focus on small values of δ < 1, whereaswe only analyze the probability of successful recovery for δ ≥ δ < δ on the other hand. Larger values of δ correspond to larger systems with more equations and thus with more information on the soughtsolution. Another expected observation is that the probability of success decreases when the degree d increases. For n = 20, this leads to constant failure for the (cid:96) M and IR (cid:96) (cid:96) M methods at d = 5,though the S (cid:96) (cid:96) M and AGA methods can still recover sufficiently sparse solutions. Except forthis particular setting ( n = 20, d = 5), the results for IR (cid:96) (cid:96) M and S (cid:96) (cid:96) M are comparable, whilethe AGA yields slightly lower probabilities. The simple (cid:96) M method is much less effective thanthe others, but nonetheless obtains a high probability of recovery for not too difficult cases with d = 2 and a sufficiently large N . Computational complexity. The next experiments evaluate the computational complexity ofthe methods with respect to the number of variables n , the degree d and the sparsity level (cid:107) x (cid:107) .Focusing only on the computing time, we consider favorable settings with these parameters and thenumber of equations N set such that x should be recovered in most cases. The left plot of Fig. 3shows the mean time of the methods for a range of n ∈ [10 , 40] with (cid:107) x (cid:107) = 3, d = 2 and N = 50.They all have a complexity exponential in n due to the dimension M of φ . In addition, they havesimilar rates of increase, except the EGA for which the computing time grows faster with n due tothe combinatorial nature of the algorithm. Similar results, shown in the middle plot, are obtainedwith respect to the degree d (for n = 10, (cid:107) x (cid:107) = 2 and N = 50), except that the greedy algorithmsbenefit from a much lower rate of increase. In these experiments, the reweighting scheme of theS (cid:96) (cid:96) M is faster than the one used by the IR (cid:96) (cid:96) M, but its computing time highly depends on (cid:107) x (cid:107) whereas the number of iterations is fixed for IR (cid:96) (cid:96) M. This is clearly seen from the rightplot of Fig. 3 (obtained with n = 20, d = 2 and N = 100), where the computing time of the S (cid:96) (cid:96) Mreaches the one of the IR (cid:96) (cid:96) M when (cid:107) x (cid:107) equals the number of iterations of IR (cid:96) (cid:96) M. The exactgreedy algorithm (EGA) suffers from a similar complexity issue with respect to (cid:107) x (cid:107) , but muchmore pronounced. On the contrary, its approximate variant (AGA) remained very efficient in allof our tests. We now turn to the noisy case as discussed in Sect. 5. In this setting, accuracy is measured by twoperformance indexes: the mean relative error, (cid:107) ˆ x − x (cid:107) / (cid:107) x (cid:107) , and the success rate with respect19 = 2 δ = 1 δ = 2 δ = 3 δ = 4 δ = 5 δ = 1 δ = 2 δ = 3 δ = 4 δ = 5 δ = 1 δ = 2 δ = 3 δ = 4 δ = 5 δ = 1 δ = 2 δ = 3 δ = 4 δ = 5 d = 3 δ = 1 δ = 2 δ = 3 δ = 4 δ = 5 δ = 1 δ = 2 δ = 3 δ = 4 δ = 5 δ = 1 δ = 2 δ = 3 δ = 4 δ = 5 δ = 1 δ = 2 δ = 3 δ = 4 δ = 5 d = 4 δ = 1 δ = 2 δ = 3 δ = 4 δ = 5 δ = 1 δ = 2 δ = 3 δ = 4 δ = 5 δ = 1 δ = 2 δ = 3 δ = 4 δ = 5 δ = 1 δ = 2 δ = 3 δ = 4 δ = 5 d = 5 δ = 1 δ = 2 δ = 3 δ = 4 δ = 5 δ = 1 δ = 2 δ = 3 δ = 4 δ = 5 δ = 1 δ = 2 δ = 3 δ = 4 δ = 5 δ = 1 δ = 2 δ = 3 δ = 4 δ = 5 Figure 1: Estimated probability of successful recovery versus (cid:107) x (cid:107) for various δ = N/n in thecase n = 10. Each row considers a different degree d , while each column contains the results of adifferent algorithm (from left to right: (cid:96) M, IR (cid:96) (cid:96) M, S (cid:96) (cid:96) M and AGA).20 = 2 δ = 1 δ = 2 δ = 3 δ = 4 δ = 5 δ = 1 δ = 2 δ = 3 δ = 4 δ = 5 δ = 1 δ = 2 δ = 3 δ = 4 δ = 5 δ = 1 δ = 2 δ = 3 δ = 4 δ = 5 d = 3 δ = 1 δ = 2 δ = 3 δ = 4 δ = 5 δ = 1 δ = 2 δ = 3 δ = 4 δ = 5 δ = 1 δ = 2 δ = 3 δ = 4 δ = 5 δ = 1 δ = 2 δ = 3 δ = 4 δ = 5 d = 4 δ = 1 δ = 2 δ = 3 δ = 4 δ = 5 δ = 1 δ = 2 δ = 3 δ = 4 δ = 5 δ = 1 δ = 2 δ = 3 δ = 4 δ = 5 δ = 1 δ = 2 δ = 3 δ = 4 δ = 5 d = 5 δ = 1 δ = 2 δ = 3 δ = 4 δ = 5 δ = 1 δ = 2 δ = 3 δ = 4 δ = 5 δ = 1 δ = 2 δ = 3 δ = 4 δ = 5 δ = 1 δ = 2 δ = 3 δ = 4 δ = 5 Figure 2: Estimated probability of successful recovery versus (cid:107) x (cid:107) for various δ = N/n in thecase n = 20. Each row considers a different degree d , while each column contains the results of adifferent algorithm (from left to right: (cid:96) M, IR (cid:96) (cid:96) M, S (cid:96) (cid:96) M and AGA). 10 15 20 25 30 35 4010 −2 −1 L1MIRL1/L2MSL1/L2MAGAEGA −2 −1 L1MIRL1/L2MSL1/L2MAGAEGA −2 −1 L1MIRL1/L2MSL1/L2MAGAEGA Figure 3: Mean computing time of the methods in seconds versus the number of variables n (left),the degree d (middle) and the sparsity level (cid:107) x (cid:107) (right). These times are obtained on a Linuxserver equipped with 2 Xeon X5690 at 3.47 GHz.21able 6: Results with noisy quadratic equations. Method QBP (cid:96) M (cid:96) (cid:96) M IR (cid:96) (cid:96) M S (cid:96) (cid:96) M AGA EGA Mean relative error . 1% 11 . 7% 22 . 2% 7 . 72% 6 . 52% 6 . 20% 6 . Success rate 14% 96% 0% 100% 100% 99% 100% Mean time (s) d = 4. Method NLBP (cid:96) M (cid:96) (cid:96) M IR (cid:96) (cid:96) M S (cid:96) (cid:96) M AGA EGA Mean relative error . 5% 22 . 3% 29 . 8% 7 . 65% 5 . 83% 6 . 74% 5 . Success rate 55% 87% 0% 100% 100% 99% 100% Mean time (s) x (where values | ˆ x j | < − are considered as zeros). Indeed,in many applications, the recovery of the correct support is more important that the precise valuesof the estimates. First results. The data are generated as in Sect. 6.1 except that y i = p di ( x ) + e i , with noiseterms e i forming a zero-mean Gaussian random vector e satisfying (cid:107) e (cid:107) = ε = 3. This correspondsto a signal-to-noise ratio, (cid:107) y (cid:107) / (cid:107) e (cid:107) , of about 18 dB. Results with N = 50 polynomials of degree d = 2 in n = 20 variables are shown in Table 6 and in Table 7 for d = 4 and n = 5. The reweightedBP methods based on group-sparsity and the greedy methods lead to a rather small approximationerror and they all recover the correct support in almost all trials despite the presence of noise. Influence of the noise level ε . The influence of the noise level ε = (cid:107) e (cid:107) on the performanceof the proposed methods is evaluated by letting the value of ε vary between 1 and 10, whichcorresponds to a signal-to-noise ratio decreased from 28 dB to 9 dB. For all methods exceptthe (cid:96) M, the results plotted in Fig. 4 indicate that, as expected, e.g., from the bound (26), theapproximation error directly depends on the noise level. In fact, the mean relative error is almostlinear with respect to ε , which shows that the dependence on ε in the bound (26) of Theorem 6 is ofthe correct nature. But even more interestingly, the noise level does not influence the success ratecorresponding to the estimation of the support of x , thus providing evidence that the methodsare robust to noise. Here again, the (cid:96) M method, which does not include structural knowledge inits formulation, does not benefit from such satisfactory features. L1MIRL1/L2MSL1/L2MAGAEGA L1MIRL1/L2MSL1/L2MAGAEGA Figure 4: Mean relative error (left) and success rate (right) versus ε = (cid:107) e (cid:107) . Except for the (cid:96) M,the curves are hardly distinguishable and close to 100% for the success rate.22 L1MIRL1/L2MSL1/L2MAGAEGA 0 2 4 6 8 1000.20.40.60.81 L1MIRL1/L2MSL1/L2MAGAEGA Figure 5: Mean relative error (left) and success rate (right) versus ε with data perturbed by a noiseof (cid:96) -norm equal to 3. Except for (cid:96) M, the curves of the success rate are hardly distinguishableand close to 100% for ε ≥ Influence of ε as a tuning parameter. In practical applications, the noise level might beunknown and ε becomes a tuning parameter. Figure 5 shows the influence of this parameter on theperformance of the methods for a fixed noise level (cid:107) e (cid:107) = 3. All methods except the (cid:96) M performvery well with a slightly overestimated ε ∈ [3 , 4] and, for ε within a reasonable range around (cid:107) e (cid:107) ( ε ∈ [2 , x . Significantly larger values of ε lead to an increase of the error for all methods except the greedy algorithms which maintain amean relative error below 7% for all values of ε ≥ (cid:107) e (cid:107) . On the other hand, the IR (cid:96) (cid:96) M methodis the only one that is not badly affected by the underestimation of ε in terms of approximationerror.Regarding the estimation of the support of x , all methods fail with ε < (cid:107) e (cid:107) , while overesti-mating ε ≥ (cid:107) e (cid:107) leads to perfect recovery even for much larger values of ε . This is in line withresults on linear BP denoising, such as Theorem 4.1 in [Donoho et al., 2006] which guarantees thatthe estimated support is a subset of the sought one for sufficiently sparse cases when using anoverestimated ε . The paper proposed several methods for finding the sparsest solution of a system of polynomialequations. Two generic approaches were considered, one based on convex relaxations and one on agreedy strategy. For the convex relaxations, sufficient conditions of exact recovery of the sparsestsolution were derived. The methods were also extended to deal with noisy equations, in whichcase stable recovery bounds for the convex relaxations were obtained. Both the computationalefficiency and the accuracy of the proposed methods were shown in numerical experiments, whichalso emphasized the relationship between the probability of success and the sparsity of the solutionfor each method. In addition, these results indicate that the proposed methods accurately recoverthe sought solution in many cases where the sufficient conditions do not hold. As in classical BPtheory, these conditions suffer from an “excessive pessimism” and are too restrictive due to theirworst-case nature.Remaining open issues include the convergence analysis of the greedy approximation towards thesparsest solution and the derivation of sufficient conditions for the solution of the group-sparsityoptimization problem (8) to satisfy the polynomial constraints. Another question of interest iswhether we can obtain less restrictive conditions for the variants of the convex relaxations with non-negativity constraints such as (10). Future work will also consider applying the proposed methodsto more general nonlinear equations, e.g., by using Taylor expansions as in [Ohlsson et al., 2013a].23 cknowledgements Henrik Ohlsson gratefully acknowledges support from the NSF project FORCES (Foundations OfResilient CybEr-physical Systems), the Swedish Research Council in the Linnaeus center CADICS,the European Research Council under the advanced grant LEARN, contract 267381, a postdoc-toral grant from the Sweden-America Foundation, donated by ASEA’s Fellowship Fund, and apostdoctoral grant from the Swedish Research Council. References [Andersen and Andersen, 2000] Andersen, E. D. and Andersen, K. D. (2000). The MOSEK interiorpoint optimizer for linear programming: an implementation of the homogeneous algorithm. HighPerformance Optimization , 33:197–232.[Balan et al., 2006] Balan, R., Casazza, P., and Edidin, D. (2006). On signal reconstruction with-out phase. Applied and Computational Harmonic Analysis , 20:345–356.[Bandeira et al., 2014] Bandeira, A. S., Cahill, J., Mixon, D. G., and Nelson, A. A. (2014). Sav-ing phase: Injectivity and stability for phase retrieval. Applied and Computational HarmonicAnalysis , 37(1):106–125.[Beck and Eldar, 2013] Beck, A. and Eldar, Y. C. (2013). Sparsity constrained nonlinear optimiza-tion: Optimality conditions and algorithms. SIAM Journal on Optimization , 23(3):1480–1509.[Blumensath, 2013] Blumensath, T. (2013). Compressed sensing with nonlinear observations andrelated nonlinear optimisation problems. IEEE Transactions on Information Theory , 59(6):3466–3474.[Blumensath and Davies, 2008] Blumensath, T. and Davies, M. E. (2008). Gradient pursuit fornon-linear sparse signal modelling. In European Signal Processing Conference (EUSIPCO) , pages25–29.[Bruckstein et al., 2009] Bruckstein, A. M., Donoho, D. L., and Elad, M. (2009). From sparsesolutions of systems of equations to sparse modeling of signals and images. SIAM Review ,51(1):34–81.[Cand`es, 2006] Cand`es, E. J. (2006). Compressive sampling. In Proceedings of the InternationalCongress of Mathematicians: invited lectures , pages 1433–1452.[Cand`es et al., 2013a] Cand`es, E. J., Eldar, Y. C., Strohmer, T., and Voroninski, V. (2013a). Phaseretrieval via matrix completion. SIAM Journal on Imaging Sciences , 6(1):199–225.[Cand`es et al., 2013b] Cand`es, E. J., Strohmer, T., and Voroninski, V. (2013b). Phaselift: Exactand stable signal recovery from magnitude measurements via convex programming. Communi-cations on Pure and Applied Mathematics , 66(8):1241–1274.[Cand`es et al., 2008] Cand`es, E. J., Wakin, M. B., and Boyd, S. P. (2008). Enhancing sparsity byreweighted (cid:96) minimization. Journal of Fourier Analysis and Applications , 14(5):877–905.[Deng et al., 2011] Deng, W., Yin, W., and Zhang, Y. (2011). Group sparse optimization byalternating direction method. Technical Report TR11-06, Department of Computational andApplied Mathematics, Rice University.[Donoho, 2006] Donoho, D. L. (2006). Compressed sensing. IEEE Transactions on InformationTheory , 52(4):1289–1306.[Donoho et al., 2006] Donoho, D. L., Elad, M., and Temlyakov, V. N. (2006). Stable recovery ofsparse overcomplete representations in the presence of noise. IEEE Transactions on InformationTheory , 52(1):6–18. 24Donoho and Huo, 2001] Donoho, D. L. and Huo, X. (2001). Uncertainty principles and idealatomic decomposition. IEEE Transactions on Information Theory , 47(7):2845–2862.[Ehler et al., 2014] Ehler, M., Fornasier, M., and Sigl, J. (2014). Quasi-linear compressed sensing. Multiscale Modeling & Simulation , 12(2):725–754.[Eldar and Kutyniok, 2012] Eldar, Y. C. and Kutyniok, G., editors (2012). Compressed Sensing:Theory and Applications . Cambridge University Press.[Fienup, 1982] Fienup, J. (1982). Phase retrieval algorithms: a comparison. Applied Optics ,21(15):2758–2769.[Foucart and Rauhut, 2013] Foucart, S. and Rauhut, H. (2013). A Mathematical Introduction toCompressive Sensing . Springer.[Gerchberg and Saxton, 1972] Gerchberg, R. and Saxton, W. (1972). A practical algorithm for thedetermination of phase from image and diffraction plane pictures. Optik , 35:237–246.[Gonsalves, 1976] Gonsalves, R. (1976). Phase retrieval from modulus data. Journal of OpticalSociety of America , 66(9):961–964.[Grant and Boyd, 2008] Grant, M. and Boyd, S. (2008). Graph implementations for nonsmoothconvex programs. In Blondel, V., Boyd, S., and Kimura, H., editors, Recent Advances in Learningand Control , Lecture Notes in Control and Information Sciences, pages 95–110. Springer-VerlagLimited. http://stanford.edu/~boyd/graph_dcp.html .[Grant and Boyd, 2013] Grant, M. and Boyd, S. (2013). CVX: Matlab software for disciplinedconvex programming, version 2.0 beta. http://cvxr.com/cvx .[Kohler and Mandel, 1973] Kohler, D. and Mandel, L. (1973). Source reconstruction from the mod-ulus of the correlation function: a practical approach to the phase problem of optical coherencetheory. Journal of the Optical Society of America , 63(2):126–134.[Le et al., 2013] Le, V. L., Lauer, F., and Bloch, G. (2013). Selective (cid:96) minimization for sparserecovery. IEEE Transactions on Automatic Control . (to appear).[Marchesini, 2007] Marchesini, S. (2007). Phase retrieval and saddle-point optimization. Journalof the Optical Society of America A , 24(10):3289–3296.[Ohlsson and Eldar, 2013] Ohlsson, H. and Eldar, Y. C. (2013). On conditions for uniqueness insparse phase retrieval. CoRR , abs/1308.5447.[Ohlsson et al., 2013a] Ohlsson, H., Yang, A. Y., Dong, R., and Sastry, S. (2013a). Nonlinear basispursuit. In Asilomar Conf. on Signals, Systems and Computers, Pacific Grove, CA, USA , pages315–319.[Ohlsson et al., 2013b] Ohlsson, H., Yang, A. Y., Dong, R., Verhaegen, M., and Sastry, S. (2013b).Quadratic basis pursuit. arXiv preprint arXiv:1301.7002 .[Ranieri et al., 2013] Ranieri, J., Chebira, A., Lu, Y. M., and Vetterli, M. (2013). Phase retrievalfor sparse signals: Uniqueness conditions. CoRR , abs/1308.3058.[Vidal et al., 2005] Vidal, R., Ma, Y., and Sastry, S. (2005). Generalized principal component anal-ysis (GPCA). IEEE Transactions on Pattern Analysis and Machine Intelligence , 27(12):1945–1959. 25 Bound on the sparsity level of φ Lemma 3. For all ( a, b, d ) ∈ ( N ∗ ) such that a ≥ d ( b + d ) , the inequality a d (cid:88) q =2 (cid:18) a + q − q (cid:19) ≥ db d (cid:88) q =2 (cid:18) b + q − q (cid:19) holds.Proof. For q ≤ d , we can bound the terms in the sum as1 a (cid:18) a + q − q (cid:19) = ( a + q − a q !( a − a q ! q − (cid:89) i =0 ( a + i ) = 1 q ! q − (cid:89) i =1 ( a + i ) ≥ q ! a q − ≥ q ! d q − ( b + d ) q − ≥ q ! d q − q − (cid:89) i =1 ( b + i ) ≥ d q − b (cid:18) b + q − q (cid:19) where we used a ≥ d ( b + d ) in the second inequality. Then1 a d (cid:88) q =2 (cid:18) a + q − q (cid:19) ≥ b d (cid:88) q =2 d q − (cid:18) b + q − q (cid:19) ≥ db d (cid:88) q =2 d q − (cid:18) b + q − q (cid:19) ≥ db d (cid:88) q =2 (cid:18) b + q − q (cid:19) Proposition 2. Let the mapping φ : R n → R M be defined as above. Then, with d ≥ and n ≥ d ( (cid:107) x (cid:107) + d ) , the vector φ ( x ) is sparser than the vector x in the sense that the inequality (cid:107) φ ( x ) (cid:107) M ≤ (cid:107) x (cid:107) dn holds for all x ∈ R n .Proof. By construction, the number of nonzeros in φ ( x ) is equal to the sum over q , 1 ≤ q ≤ d , ofthe number of monomials of degree q in (cid:107) x (cid:107) variables: (cid:107) φ ( x ) (cid:107) M = (cid:80) dq =1 (cid:18) (cid:107) x (cid:107) + q − q (cid:19)(cid:80) dq =1 (cid:18) n + q − q (cid:19) = (cid:107) x (cid:107) n (cid:107) x (cid:107) (cid:80) dq =1 (cid:18) (cid:107) x (cid:107) + q − q (cid:19) n (cid:80) dq =1 (cid:18) n + q − q (cid:19) = (cid:107) x (cid:107) n (cid:107) x (cid:107) (cid:80) dq =2 (cid:18) (cid:107) x (cid:107) + q − q (cid:19) n (cid:80) dq =2 (cid:18) n + q − q (cid:19) = (cid:107) x (cid:107) n 11 + n (cid:80) dq =2 (cid:18) n + q − q (cid:19) + (cid:107) x (cid:107) (cid:80) dq =2 (cid:18) (cid:107) x (cid:107) + q − q (cid:19) n (cid:80) dq =2 (cid:18) n + q − q (cid:19) n ≥ d ( (cid:107) x (cid:107) + d ) implies that d ≤ n . With d ≤ n , we have1 n (cid:18) n + q − q (cid:19) ≥ q ! n q − ≥ q ! d q − which yields 1 n d (cid:88) q =2 (cid:18) n + q − q (cid:19) ≥ n d (cid:88) q =2 (cid:18) n + 2 − (cid:19) ≥ ( d − d d ≥ ⇒ d − d + 1 ≥ ⇒ n d (cid:88) q =2 (cid:18) n + q − q (cid:19) ≥ d ⇒ 11 + n (cid:80) dq =2 (cid:18) n + q − q (cid:19) ≤ d and on the other hand, Lemma 3 yields (cid:107) x (cid:107) (cid:80) dq =2 (cid:18) (cid:107) x (cid:107) + q − q (cid:19) n (cid:80) dq =2 (cid:18) n + q − q (cid:19) ≤ (cid:107) x (cid:107) (cid:80) dq =2 (cid:18) (cid:107) x (cid:107) + q − q (cid:19) n (cid:80) dq =2 (cid:18) n + q − q (cid:19) ≤ d Thus, (cid:107) φ ( x ) (cid:107) M ≤ (cid:107) x (cid:107) dn B Other conditions for sparse recovery The following uses the exact value of (cid:107) φ (cid:107) . Theorem 7. Let x denote the unique solution to (1) – (2) . If the inequality d (cid:88) q =1 (cid:18) (cid:107) x (cid:107) + q − q (cid:19) ≤ (cid:18) µ ( A ) (cid:19) (33) holds, then the solution ˆ φ to (5) is unique and equal to φ ( x ) , thus providing ˆ x = x . Another more compact but slightly less tight result is as follows. Theorem 8. Let x denote the unique solution to (1) – (2) . If the inequality (cid:18) (cid:107) x (cid:107) + d − d (cid:19) ≤ d (cid:18) µ ( A ) (cid:19) (34) holds, then the solution ˆ φ to (5) is unique and equal to φ ( x ) , thus providing ˆ x = x .Proof. Since the terms in the sum of Theorem 7 form an increasing sequence, we have d (cid:88) q =1 (cid:18) (cid:107) x (cid:107) + q − q (cid:19) ≤ d (cid:18) (cid:107) x (cid:107) + d − d (cid:19) which yields the sought statement by application of Theorem 7. Assume d > n , then, the assumption of the Proposition leads to n > n ( (cid:107) x (cid:107) + n ) and (cid:107) x (cid:107) + n < n ≥ Value of m Let us define m as the number of monomials involving a base variable x with a degree ≥ 1. Itcan be computed as the sum over q , 0 ≤ q ≤ d − q in n − d − q (since each monomial in n − x or x ... or x d − q to produce a monomial of degree at most d in n variables): m = d − (cid:88) q =0 ( d − q ) (cid:18) ( n − 1) + q − q (cid:19) = d − (cid:88) q =0 ( d − q ) (cid:18) n + q − q (cid:19) . Another technique computes m as the total number of all monomials minus the number of mono-mials not involving x which is the number of monomials in n − m = M − d (cid:88) q =1 (cid:18) n + q − q (cid:19) = d (cid:88) q =1 (cid:18) n + q − q (cid:19) − (cid:18) n + q − q (cid:19) = d (cid:88) q =1 (cid:20) ( n + q − n − − (cid:21) (cid:18) n + q − q (cid:19) = d (cid:88) q =1 qn − (cid:18) n + q − q (cid:19) ..