[PDF] A Sublevel Moment-SOS Hierarchy for Polynomial Optimization

Abstract

Full PDF

aa r X i v : . [ m a t h . O C ] J a n A Sublevel Moment-SOS Hierarchy for PolynomialOptimization

CHEN, Tong ∗ [email protected] LASSERRE, Jean-Bernard ∗† [email protected] MAGRON, Victor ∗† [email protected] PAUWELS, Edouard ‡ † [email protected]

January 14, 2021

Abstract

We introduce a sublevel Moment-SOS hierarchy where each SDP relaxation can be viewedas an intermediate (or interpolation) between the d -th and ( d + 1)-th order SDP relaxations ofthe Moment-SOS hierarchy (dense or sparse version).With the ﬂexible choice of determining the size (level) and number (depth) of subsets in theSDP relaxation, one is able to obtain diﬀerent improvements compared to the d -th order relax-ation, based on the machine memory capacity. In particular, we provide numerical experimentsfor d = 1 and various types of problems both in combinatorial optimization (Max-Cut, MixedInteger Programming) and deep learning (robustness certiﬁcation, Lipschitz constant of neuralnetworks), where the standard Lasserre’s relaxation (or its sparse variant) is computationallyintractable. In our numerical results, the lower bounds from the sublevel relaxations improvethe bound from Shor’s relaxation (ﬁrst order Lasserre’s relaxation) and are signiﬁcantly closerto the optimal value or to the best-known lower/upper bounds. Consider the polynomial optimization problem (POP) of the following form: f ∗ := inf x ∈ R n { f ( x ) : g i ( x ) ≥ , i = 1 , . . . , p } , (POP)where f and g i are polynomials in variable x for all i = 1 , . . . , p . Lasserre’s hierarchy [9] is awell-known method based on semideﬁnite programming (SDP) to approximate the optimal value of(POP), by solving a sequence of SDPs that provide a series of lower bounds and converges to theoptimal value of the original problem. Under certain assumptions, such convergence is shown to beﬁnite [18]. ∗ LAAS-CNRS, BP 54200, 7 avenue du Colonel Roche, 31031 Toulouse, C´edex 4, France. † IMT, Universit´e Toulouse 3 Paul Sabatier. ‡ IRIT, Universit´e de Toulouse, CNRS. elated works Other related frameworks of relaxations, including DSOS [17] based on linearprogramming (LP) , SDSOS [17] based on second-order cone programming (SOCP) , and the hybridBSOS [11] combining the features of LP and SDP hierarchies, also provide lower bounds convergingto the optimal value of a POP. Generally speaking, when comparing LP and SDP solvers, the formercan handle problems of much larger size. On the other hand, the bounds from LP relaxationsare signiﬁcantly weaker than those obtained by SDP relaxations, in particular for combinatorialproblems [13]. Based on the standard Lasserre’s hierarchy, several further works have exploredvarious types of sparsity patterns inside POPs to compute lower bounds more eﬃciently and handlelarger-scale POPs. The ﬁrst such extension can be traced back to Waki [27] and Lasserre [10]where the authors consider the so-called correlative sparsity pattern (CSP) with associated CSPgraph whose nodes consist of the POP’s variables. Two nodes in the CSP graph are connectedvia an edge if the two corresponding variables appear in the same constraint or in same monomialof the objective. The standard sparse Lasserre’s hierarchy splits the full moment and localizingmatrices into several smaller submatrices, according to subsets of nodes ( maximal cliques ) in achordal extension of the CSP graph associated with the POP. When the size of the largest clique(a crucial parameter of the sparsity pattern) is reasonable the resulting SDP relaxations becometractable. There are many successful applications of the resulting sparse moment-SOS hierarchy ,including certiﬁed roundoﬀ error bounds [15, 14], optimal power ﬂow [6], volume computation ofsparse semialgebraic sets [26], approximating regions of attractions of sparse polynomial systems[24, 25], noncommutative POPs [7], sparse positive deﬁnite functions [16]. Similarly, the sparseBSOS hierarchy [33] is a sparse version of BSOS for large scale polynomial optimization.Besides correlative sparsity, recent developments [32, 31] exploit the so-called term sparsity (TSSOS) or combine correlative sparsity and term sparsity (CS-TSSOS) [30] to handle large scalepolynomial optimization problems. The TSSOS framework relies on a term sparsity pattern (TSP)graph whose nodes consist of monomials of some monomial basis. Two nodes in a TSP graph areconnected via an edge when the product of the corresponding monomials appears in the supports ofpolynomials involved in the POP or is a monomial of even degree. Extensions have been provided tocompute more eﬃciently approximations of joint spectral radii [28] and minimal traces or eigenvalueof noncommutative polynomials [29]. More variants of the sparse moment-SOS hierarchy have beenbuilt for quantum bounds on Bell inequalities [21], condensed-matter ground-state problems [1],quantum many-body problems [5], where one selects a certain subset of words (noncommutativemonomials) to decrease the number of SDP variables.Recently, in [2] the authors proposed a partial and augmented partial relaxation tailored tothe Max-Cut problem. It strengthens Shor relaxation by adding some (and not all) constraintsfrom the second-order Lasserre’s hierarchy. The same idea was already used in the multi-order

SDP relaxation of [6] for solving large-scale optimal power ﬂow (OPF) problems. The authors set athreshold for the maximal cliques and include the second-order relaxation constraints for the cliqueswith size under the threshold and the ﬁrst-order relaxation constraints for the cliques with size overthe threshold.

Contribution

This work is in the line of research concerned with extensions and/or variants ofthe Moment-SOS hierarchy so as to handle large-scale POPs out of reach by the standard hierarchy.We provide a principled way to obtain intermediate alternative SDP relaxations between the ﬁrst-and second-order SDP relaxations of the Moment-SOS hierarchy for general POPs. It encompassesthe above cited works [6, 2] as special cases for MAX-Cut and OPF problems. It can also begeneralized to provide intermediate alternative SDP relaxations between (arbitrary) order- d and2rder- d + 1 relaxations of the Moment-SOS hierarchy when the order- d + 1 relaxation is too costlyto implement.We develop what we call the sublevel hierarchy based on the standard Moment-SOS hierarchy.Compared with existing sparse variants of the latter, we propose several possible SDP relaxationsto improve lower bounds for general POPs. The basic principle is quite simple. In the sublevel hierarchy concerned with d -th and ( d + 1)-thorders of the sparse Moment-SOS hierarchy, from the maximal cliques of a chordal extension of thecsp graph, we further select several subsets of nodes (variables). Then in the d -th sparse SDPrelaxation we also include ( d + 1)-th order moment and localizing matrices w.r.t. these subsets only.This methodology reveals helpful if the bound obtained by the d -th order relaxation of a POP isnot satisfactory and if one is not able to solve the ( d + 1)-th order relaxation.One important distinguished feature of the sublevel hierarchy is to not be restricted to POPswith a correlative sparsity pattern. Indeed it can also be applied to dense POPs or nearly-densePOPs where the problem is sparse except that there are a few dense constraints. As a result we areable to improve bounds obtained at the ﬁrst-order relaxation (also called Shor’s relaxation). In [3]we proposed a heuristic method to deal with general nearly-dense POPs as a trade-oﬀ between theﬁrst-order and second-order relaxations of the Moment-SOS hierarchy. As we will see the heuristic[3] is also a special case of the sublevel hierarchy.Another feature of the sublevel hierarchy is that we have more ﬂexible ways to tune the resultingrelaxation instead of simply increasing the relaxation order (a rigid and costly strategy). Morespeciﬁcally, there are two hyper-parameters in the sublevel hierarchy: (i) the size of the selectedsubsets, and (ii) the number of such subsets. Suppose m is the size of a maximal clique of a chordalextension of the csp graph. Then we can choose q (called the depth) many subsets of size l (calledthe level) with 1 ≤ l ≤ m and q ≤ (cid:0) ml (cid:1) . For each maximal clique, we have a wide range of choices forthe level and depth, yielding a good trade-oﬀ between the solution accuracy and the computationaleﬃciency.The outline of the paper is as follows: Section 2 introduces some preliminaries of dense and sparseLasserre’s hierarchy; Section 3 is the theoretical part of the sublevel hierarchy and the sublevelrelaxation; Section 4 explicitly illustrates the sublevel relaxation for several type of optimizationproblems; Section 5 shows the results of sublevel relaxation applied on the problems discussed inSection 4. In this section we brieﬂy introduce the Lasserre’s hierarchy [9] which has already many successfulapplications in and outside optimization [12]. First let us recall some notations in polynomialoptimization. Given a positive integer n ∈ N , let x = [ x , . . . , x n ] T be a vector of decision variablesand R [ x ] be the space of real polynomials in variable x . For a set I ⊆ { , , . . . , n } , let x I := [ x i ] i ∈ I and let R [ x I ] be the space of real polynomials in variable x I . Denote by R [ x ] (resp. R [ x ] d ) the vectorspace of polynomials (resp. of degree at most d ) in variable x ; P [ x ] ⊆ R [ x ] (resp. P d [ x ] ⊆ R [ x ] d )the convex cone of nonnegative polynomials (resp. nonnegative polynomials of degree at most 2 d )in variable x ; Σ[ x ] ⊆ P [ x ] (resp. Σ[ x ] d ⊆ P d [ x ]) the convex cone of SOS polynomials (resp. SOSpolynomials of degree at most 2 d ) in variable x . 3n the context of optimization, Lasserre’s hierarchy allows one to approximate the global op-timum of (POP), by solving a hierarchy of SDPs of increasing size. Each SDP is a semideﬁniterelaxation of (POP) in the form: ρ dense d = inf y { L y ( f ) : L y (1) = 1 , M d ( y ) (cid:23) , M d − ω i ( g i y ) (cid:23) , i = 1 , . . . , p } , (Mom- d )where ω i = ⌈ deg( g j ) / ⌉ , y = ( y α ) α ∈ N n d , L y : R [ x ] → R is the so-called Riesz linear functional : f = X α f α x α L y ( f ) := X α f α y α , f ∈ R [ x ] , and M d ( y ), M d − ω i ( g i y ) are moment matrix and localizing matrix respectively; see [12] for precisedeﬁnitions and more details. The semideﬁnite program (Mom- d ) is the d -th order moment relaxation of problem (POP). As a result, when the semialgebraic set K := { x : g i ( x ) ≥ , i = 1 , . . . , p } iscompact, one obtains a monotone sequence of lower bounds ( ρ d ) d ∈ N with the property ρ d ↑ f ∗ as d → ∞ under a certain technical Archimedean condition; the latter is easily satisﬁed by includinga redundant quadratic constraint M − k x k ≥ M > K (redundant as K is compact and M is large enough). At last but not least and interestingly,generically the latter convergence is ﬁnite [18]. Ideally, one expects an optimal solution y ∗ of(Mom- d ) to be the vector of moments up to order 2 d of the Dirac measure δ x ∗ at a global minimizer x ∗ of (POP).The hierarchy (Mom- d ) is often referred to as dense Lasserre’s hierarchy since we do not exploitany possible sparsity pattern of the POP. Therefore, if one solves (Mom- d ) with interior pointmethods (as current SDP solvers usually do), then the dense hierarchy is limited to POPs ofmodest size. Indeed the d -th order dense moment relaxation (Mom- d ) involves (cid:0) n +2 d d (cid:1) variablesand a moment matrix M d ( y ) of size (cid:0) n + dd (cid:1) = O ( n d ) at ﬁxed d . Fortunately, large-scale POPsoften exhibit some structured sparsity patterns which can be exploited to yield a sparse version of(Mom- d ), as initially demonstrated in [27]. As a result, wider applications of Lasserre’s hierarchyhave been possible.Assume that the set of variables in (POP) can be divided into r several subsets indexed by I k ,for k ∈ { , . . . , r } , i.e., { , . . . , n } = ∪ rk =1 I k . Suppose that the following assumptions hold: A1 : The function f is a sum of polynomials, each summand involving variables of only onesubset, i.e., f ( x ) = P rk =1 f k ( x I k ); A2 : Each constraint also involves variables of only one subset, i.e., g i ∈ R [ x I k ( i ) ] for some k ( i ) ∈ { , · · · , r } ; A3 : The subsets I k satisfy the Running Intersection Property (RIP) : for every k ∈ { , · · · , r − } , I k +1 ∩ S kj =1 I j ⊆ I s , for some s ≤ k .It turns out that the maximal cliques in the chordal extension of the csp graph induced by thePOP satisfy the RIP [27]. From now on, we will call the these subsets cliques , in order to distinguishfrom the subsets in the sublevel hierarchy that will be discussed in the next section. A POP witha sparsity pattern is of the form:inf x ∈ R n { f ( x ) : g i ( x I k ) ≥ , i = 1 , . . . , p ; i ∈ I k } , (SpPOP)4nd its associated sparse Lasserre’s hierarchy reads: ρ sparse d = inf y { L y ( f ) : L y (1) = 1 , M d ( y , I k ) (cid:23) , k ∈ { , · · · , r } , M d − ω i ( g i y , I k ) (cid:23) , i ∈ { , · · · , p } ; i ∈ I k } , (SpMom- d )where d , ω i , y , L y are deﬁned as in (Mom- d ) but with a crucial diﬀerence. The matrix M d ( y , I k )(resp. M d − ω i ( g i y , I k )) is a submatrix of the moment matrix M d ( y ) (resp. localizing matrix M d − ω i ( g i y )) with respect to the clique I k , and hence of much smaller size (cid:0) τ k + dτ k (cid:1) if | I k | =: τ k ≪ n .Finally, ρ sparse d ≤ f ∗ for all d and moreover, if the cliques I k satisfy the RIP, then we still obtainthe convergence ρ sparse d ↑ f ∗ as d → ∞ , as for the dense relaxation (Mom- d ).Finally, for each ﬁxed d , the dual of (Mom- d ) reads:sup t ∈ R { t : f − t = θ + p X i =1 σ i g i } , (SOS- d )where θ is a sum-of-squares (SOS) polynomial in R [ x ] of degree at most 2 d , and σ j are SOSpolynomials in R [ x ] of degree at most 2( d − ω i ) with ω i = ⌈ deg( g j ) / ⌉ . The right-hand-side ofthe identity in (SOS- d ) is nothing less than Putinar’s positivity certiﬁcate [20] for the polynomial x f ( x ) − t on the compact semialgebraic set K .Similarly, the dual problem of (SpMom- d ) reads:sup t ∈ R { t : f − t = m X k =1 (cid:0) θ k + X i ∈ I k σ i,k g i (cid:1) } , (SpSOS- d )where θ k is an SOS in R [ x I k ] of degree at most 2 d , and σ i,k is an SOS in R [ x I k ] of degree at most2( d − ω i ) with ω i = ⌈ deg( g i ) / ⌉ , for each k = 1 , . . . , p . Then (SpSOS- d ) implements the sparsePutinar’s positivity certiﬁcate [10, 27]. Example 1

Let x ∈ R , x := [ x i ] i =1 , x := [ x i ] i =3 . We minimize f ( x ) = −|| x || , under thesemialgebraic set deﬁned by g ( x ) = 1 − || x || ≥ and g ( x ) = 1 − || x || ≥ . Then, thesecond-order dense Lasserre’s relaxation reads sup t ∈ R { t : f ( x ) − t = θ ( x ) + σ ( x ) g ( x ) + σ ( x ) g ( x ) } where θ is a degree-4 SOS polynomial in variable x , σ , σ are degree-2 SOS polynomials in variable x . Deﬁne I = { , , , } and I = { , , , } , then g ∈ R [ x I ] and g ∈ R [ x I ] . The second-ordersparse Lasserre’s relaxation reads sup t ∈ R { t : f ( x ) − t = (cid:0) θ ( x I ) + σ ( x I ) g ( x ) (cid:1) + (cid:0) θ ( x I ) + σ ( x I ) g ( x ) (cid:1) } where θ k is a degree-4 SOS polynomials in variable x I k , and σ k is a degree-2 SOS polynomials invariable x I k , for eac k = 1 , . As seen in Section 2, the way to reduce the size of the moment and localizing matrices in (Mom- d ) iseither by reducing the relaxation order or the number of variables/terms in the SOS weights involved5n the Putinar’s representation. The authors from [6] propose the multi-order Lasserre’s hierarchyto deal with large-scale optimal power ﬂow problems. In this hierarchy, one reduces the relaxationorder with respect to the constraints with large number of variables. This approach is reused asthe so-called partial relaxation to solve Max-Cut problems in [2]. The authors in [2] also proposedthe augmented partial relaxation as an extended version of the partial relaxation, to improve thebounds further. In this section, we develop the sublevel hierarchy which is a generalization ofseveral existing frameworks for both sparse and non-sparse POPs, and show that in the case ofMax-Cut problems, the partial and augmented partial relaxation can be cast as special instancesof the sublevel relaxation.

For problem (POP), the d -th order dense Lasserre’s relaxation relates to the Putinar’s certiﬁcate f − t = σ + P pi =1 σ i g i where σ is an SOS in R [ x ] of degree at most 2 d and σ i are SOS in R [ x ]of degree at most 2( d − ω i ) with ω i = ⌈ deg( g i ) / ⌉ . In this section, we are going to choose somesubsets of the variable x to decrease the number of terms involved in the SOS multipliers σ and σ i ,and deﬁne the intermediate sublevel hierarchies between the d -th and ( d + 1)-th order relaxations.Note that in the dense variant of Lasserre’s hierarchy, one approximates the cone of positivepolynomials from the inside with the following hierarchy of SOS cones: R = Σ[ x ] ⊆ Σ[ x ] ⊆ . . . ⊆ Σ[ x ]with S + ∞ d =0 Σ[ x ] d = Σ[ x ]. Similarly, in the sparse variant, one relies on the following hierarchy ofdirect sums of SOS cones: R = ⊕ k Σ[ x I k ] ⊆ ⊕ k Σ[ x I k ] ⊆ . . . ⊆ ⊕ Σ[ x I k ]with S + ∞ d =0 ( ⊕ k Σ[ x I k ] d ) = ⊕ k Σ[ x I k ]. Deﬁnition 1 (Sublevel hierarchy of SOS cones)

Let n be the number of variables in (POP) .For d ≥ and ≤ l ≤ n , the l -th level SOS cone associated to Σ[ x ] d , denoted by Σ[ x ] ld , is an SOScone lying between Σ[ x ] d and Σ[ x ] d +1 , which is deﬁned as Σ[ x ] d ⊆ Σ[ x ] ld := Σ[ x ] d + ˜Σ[ x ] ld +1 ⊆ Σ[ x ] d +1 where ˜Σ[ x ] ld +1 := (cid:26) X | I | = l σ I ( x I ) : I ⊆ { , . . . , n } , σ I ( x I ) ∈ Σ[ x I ] d +1 (cid:27) ⊆ Σ[ x ] d +1 , i.e., the SOSpolynomials in ˜Σ[ x ] ld +1 are the elements in Σ[ x ] d +1 which can be decomposed into several componentswhere each component is an SOS polynomial in l variables. Let us use the convention Σ[ x ] d := Σ[ x ] d .Then, for the dense case, we rely on the sublevel hierarchy of inner approximations of the cone ofpositive polynomials: Σ[ x ] d = Σ[ x ] d ⊆ Σ[ x ] d ⊆ . . . ⊆ Σ[ x ] nd = Σ[ x ] d +1 Similarly, suppose that { I k } ≤ k ≤ r are the cliques of the sparse problem (SpPOP) . For l ≤ τ k := | I k | , we deﬁne the l -th level SOS cone of Σ[ x I k ] d , denoted by Σ[ x I k ] ld , as Σ[ x I k ] d ⊆ Σ[ x I k ] ld := Σ[ x I k ] d + ˜Σ[ x I k ] ld +1 ⊆ Σ[ x I k ] d +1 here ˜Σ[ x I k ] ld +1 := (cid:26) X | I | = l σ I ( x I ) : I ⊆ I k , σ I ( x I ) ∈ Σ[ x I ] d +1 (cid:27) ⊆ Σ[ x I k ] d +1 , i.e., the SOS polyno-mials in ˜Σ[ x I k ] ld +1 are the elements in Σ[ x I k ] d +1 which can be decomposed into several componentswhere each component is an SOS polynomial in l variables indexed by I k . Then, for the sparse case,we rely on the sublevel hierarchy of inner approximations of the cone of positive polynomials: Σ[ x I k ] d = Σ[ x I k ] d ⊆ Σ[ x I k ] d ⊆ . . . ⊆ Σ[ x I k ] τ k d = Σ[ x I k ] d +1 Remark 1

Lasserre’s hierarchy relies on a hierarchy of SOS cones, while the sublevel hierarchyrelies on a hierarchy of sublevel SOS cones. Take the sparse case for illustration, solving the d -thorder relaxation of the standard sparse Lasserre’s hierarchy boils down to ﬁnding SOS multipli-ers in the cone Σ[ x I k ] d ⊕ Σ[ x I k ] d − ω i for each clique I k , i.e., L k (Σ[ x I k ] d ⊕ Σ[ x I k ] d − ω i ) . Solv-ing the d -th order sublevel hierarchy boils down to ﬁnding SOS multipliers in the intermediatecones L k (Σ[ x I k ] l k d ⊕ Σ[ x I k ] l k d − ω i ) for some ≤ l k ≤ τ k . This cone approximates the standardcone L k (Σ[ x I k ] d ⊕ Σ[ x I k ] d − ) as l k gets larger since L k (Σ[ x I k ] τ k d ⊕ Σ[ x I k ] τ k d − ω i ) = L k (Σ[ x I k ] d ⊕ Σ[ x I k ] d − ω i ) . We will see in the next deﬁnition that this is the so-called sublevel relaxation, andwe call the vector { l k } the vector of sublevels of the relaxation. Each l k determines the size of thesubsets in the clique I k and is called a sublevel . Deﬁnition 2 (Sublevel hierarchy of moment-SOS relaxations)

Let n be the number of vari-ables in (POP) . For each constraint g i ≥ in (POP) , we deﬁne a sublevel ≤ l i ≤ n and a depth ≤ q i ≤ n . Denote by l = [ l i ] pi =1 the vector of sublevels and q = [ q i ] pi =1 the vector of depths. Then,the ( l , q ) -sublevel relaxation of the d -th order dense SOS problem (SOS- d ) reads sup t ∈ R (cid:26) t : f − t = θ + p X i =1 (cid:0) ˜ θ i + ( σ i + ˜ σ i ) g i (cid:1)(cid:27) , (SubSOS-[ d, l , q ]) where θ (resp. σ i ) are SOS polynomials in Σ[ x ] d (resp. Σ[ x ] d − ω i ), and ˜ θ i (resp. ˜ σ i ) are SOSpolynomials in ˜Σ[ x ] l i d +1 (resp. ˜Σ[ x ] l i d − ω i +1 ) with ω i = ⌈ deg( g i ) / ⌉ ). Moreover, each ˜ σ i is a sum of q i SOS polynomials where each sum term involves variables in a certain subset Γ i,j ⊆ { , , . . . , n } with | Γ i,j | = l i , i.e., ˜ σ i = P q i j =1 ˜ σ i,j where ˜ σ i,j ∈ Σ[ x Γ i,j ] d − ω i +1 . Each ˜ θ i is also a sum of q i SOSpolynomials where the sum terms share the same variable sets Γ i,j as ˜ σ i,j , i.e., ˜ θ i = P q i j =1 ˜ θ i,j where ˜ θ i,j ∈ Σ[ x Γ i,j ] d +1 . The equation (SubSOS-[ d, l , q ]) can be compressed as an analogical form of thestandard dense Lasserre’s relaxation: sup t ∈ R (cid:26) t : f − t = p X i =1 (˜ θ i + ˜ σ i g i ) (cid:27) , where ˜ θ i (resp. ˜ σ i ) are SOS polynomials in Σ[ x ] ld +1 (resp. Σ[ x ] ld − ω i +1 ).Similarly, suppose that ( I k ) ≤ k ≤ p are the cliques of the sparse problem (SpPOP) with τ k = | I k | .For each constraint g i ≥ in (SpPOP) , denote by k ( i ) the set of indices s such that i ∈ I s .For each i and s ∈ k ( i ) , deﬁne a sublevel ≤ l i,s ≤ τ s and a depth ≤ q i,s ≤ τ s . Denote by l = [ l i,s ] i =1 ,...,p ; s ∈ k ( i ) the vector of sublevels and q = [ q i,s ] i =1 ,...,p ; s i ∈ k ( i ) the vector of depths. Then,the ( l , q ) -sublevel relaxation of the d -th order sparse SOS problem (SpSOS- d ) reads sup t ∈ R (cid:26) t : f − t = m X k =1 (cid:18) θ ,k + X i ∈ I k (cid:0) ˜ θ i,k + ( σ i,k + ˜ σ i,k ) g i (cid:1)(cid:19)(cid:27) , (SubSpSOS-[ d, l , q ])7 here θ ,k (resp. σ i,k ) are SOS polynomials in Σ[ x I k ] d (resp. Σ[ x I k ] d − ω i ), and ˜ θ ,k (resp. ˜ σ i,k )are SOS polynomials in ˜Σ[ x I k ] l i,k d +1 (resp. ˜Σ[ x I k ] l i,k d − ω i +1 ) with ω i = ⌈ deg( g i ) / ⌉ . Moreover, each ˜ σ i,k with i ∈ I k is a sum of q i,k SOS polynomials where each sum term involves variables in a certainsubset Γ i,k,j ⊆ I k with | Γ i,k,j | = l i,k , i.e., ˜ σ i,k = P q i,k j =1 ˜ σ i,k,j where ˜ σ i,k,j ∈ Σ[ x Γ i,k,j ] d − ω i +1 . Each ˜ θ i,k is also a sum of q i,k SOS polynomials where the sum terms share the same variable sets Γ i,k,j as ˜ σ i,k,j , i.e., ˜ θ i,k = P q i,k j =1 ˜ θ i,k,j where ˜ θ i,k,j ∈ Σ[ x Γ i,k,j ] d +1 . The equation (SubSpSOS-[ d, l , q ]) canalso be compressed as an analogical form of the standard sparse Lasserre’s relaxation: sup t ∈ R (cid:26) t : f − t = m X k =1 X i ∈ I k (cid:0) ˜ θ i,k + ˜ σ i,k g i (cid:1)(cid:27) , where ˜ θ i,k (resp. ˜ σ i,k ) are SOS polynomials in Σ[ x I k ] ld +1 (resp. Σ[ x I k ] ld − ω i +1 ). Remark 2 (i). If one of the sublevel l i (resp. l i,k ) in the dense (resp. sparse) sublevel relaxationis such that l i = n (resp. l i,k = τ k ), then the depth q i (resp. q i,k ) should automatically be 1.(ii). The heuristics to determine the subsets ( Γ i,j for the dense case and Γ i,k,j for the sparsecase) in the sublevel relaxation will be discussed in the next section.(iii). The size of the SDP Gram matrix associated to an SOS polynomial in Σ[ x ] ld (resp. Σ[ x I k ] ld )is max { (cid:0) n + dd (cid:1) , (cid:0) l + d +1 d +1 (cid:1) } (resp. max { (cid:0) | I k | + dd (cid:1) , (cid:0) l + d +1 d +1 (cid:1) } ). If the lower bound obtained by solving theSOS problem over Σ[ x ] d +1 (resp. Σ[ x I k ] d +1 ) is not satisfactory enough, then we may try to ﬁndmore accurate solutions in one of the cones of Σ[ x ] ld (resp. Σ[ x I k ] ld ). Example 2

Take the polynomials f, g k and the cliques I k as in Example 1. Deﬁne l = [2 , and q = [1 , . We select subsets w.r.t. g and g respectively as Γ , = { , } , Γ , = { , } . Then, thesecond-order dense ( l , q ) -sublevel relaxation reads sup t ∈ R { t : f ( x ) − t = θ ( x ) + (cid:0) ˜ θ ( x Γ , ) + ˜ σ ( x Γ , ) g ( x ) (cid:1) + (cid:0) ˜ θ ( x Γ , ) + ˜ σ ( x Γ , ) g ( x ) (cid:1) } where θ is a degree-2 SOS polynomial in variable x , ˜ θ k are degree-4 SOS polynomials in vari-able x Γ k, , ˜ σ k are degree-2 SOS polynomials in variable x Γ k, . In other words, θ ∈ Σ[ x ] , ˜ θ k ∈ Σ[ x Γ k, ] ⊆ ˜Σ[ x ] , ˜ σ k ∈ Σ[ x Γ k, ] ⊆ ˜Σ[ x ] .Similarly, deﬁne Γ , , = { , } ⊆ I and Γ , , = { , } ⊆ I , then the second-order sparse ( l , q ) -sublevel relaxation reads sup t ∈ R { t : f ( x ) − t = (cid:0) θ , ( x I )+ ˜ θ ( x Γ , , )+ ˜ σ ( x Γ , , ) g ( x ) (cid:1) + (cid:0) θ , ( x I )+ ˜ θ ( x Γ , , )+ ˜ σ ( x Γ , , ) g ( x ) (cid:1) } where θ ,k are degree-2 SOS polynomials in variable x I k , ˜ θ k are degree-4 SOS polynomials in variable x Γ k,k, , ˜ σ k are degree-2 SOS polynomials in variable x Γ k,k, . In other words, θ ,k ∈ Σ[ x I k ] , ˜ θ k ∈ Σ[ x Γ k,k, ] ⊆ ˜Σ[ x I k ] , ˜ σ k ∈ Σ[ x Γ k,k, ] ⊆ ˜Σ[ x I k ] . The standard Lasserre’s hierarchy and many of its variants are contained in the framework ofsublevel hierarchy:

Example 3 (Dense Lasserre’s Relaxation [9])

The dense version of the d -th order Lasserre’srelaxation is the dense ( d − -th order sublevel relaxation with l = [ n, n, . . . , n ] and q = p , where p denotes the p -dimensional vector with all ones. xample 4 (Sparse Lasserre’s Relaxation [10]) The sparse version of the ( d − -th orderLasserre’s relaxation is the sparse d -th order sublevel relaxation with l = [[ τ s ] s ∈ k (1) ; . . . ; [ τ s ] s ∈ k ( n ) ] and q = | k (1) | + ... + | k ( n ) | . Example 5 (Multi-Order/Partial Relaxation)

The multi-order relaxation (used to solve the

Optimal Power Flow problem in [6]), also named as partial relaxation (used to solve the

Max-Cut problem in [2]), is a variant of the second-order sparse Lasserre’s relaxation. We ﬁrst preseta value r , then compute the maximal cliques in the chordal extension of the CSP graph of thePOP. For those cliques of size larger than r , we consider the ﬁrst-order moment matrices; forthose of size smaller or equal than r , we consider the second-order moment matrices. Denoteby S the set of indices such that τ k > r for k ∈ S , and T the set of indices such that τ k ≤ r for k ∈ T . Then the multi-order/partial relaxation is the second-order sublevel relaxation with l =[[0] s ∈ k (1) ∩ S , [ τ s ] s ∈ k (1) ∪ T ; . . . ; [0] s ∈ k ( n ) ∩ S , [ τ s ] s ∈ k ( n ) ∪ T ] and q = [[0] s ∈ k (1) ∩ S , [1] s ∈ k (1) ∪ T ; . . . ; [0] s ∈ k ( n ) ∩ S , [1] s ∈ k ( n ) ∪ T ] . Example 6 (Augmented Partial Relaxation)

This relaxation is the strengthened version of thepartial relaxation used by the authors in [2] to solve Max-Cut problems. It is exactly the second-ordersublevel relaxation restricted to Max-Cut problem.

Example 7 (Heuristic Relaxation)

The heuristic relaxation proposed by the authors in [3] tocompute the upper bound of the Lipschitz constant of ReLU networks, is a variant of the second-orderdense Lasserre’s relaxation. The intuition is that some constraints in the POP are sparse, so letus denote by S the set of their indices, while their corresponding cliques are large, thus one cannotsolve the second-order relaxation of the standard sparse Lasserre’s hierarchy. We then consider thedense ﬁrst-order relaxation (Shor’s relaxation), and choose subsets of moderate sizes (size 2 in [3])that contain the variable sets of these sparse constraints. For other constraints with larger variablesets, let us denote by T the set of their indices and let us consider the ﬁrst-order moment matrices.Then the heuristic relaxation is the second-order sublevel relaxation with l = [[0] i ∈ T , [2] i ∈ S ] and q = [[0] i ∈ T , [1] i ∈ S ] . Summarizing the above discussion, we have the following proposition:

Proposition 1

For the dense case, if l = [ n, n, . . . , n ] , then the d -th order ( l , q ) -sublevel relaxationis exactly the dense ( d + 1) -th order Lasserre’s relaxation.For the sparse case, if l = [[ τ s ] s ∈ k (1) ; . . . ; [ τ s ] s ∈ k ( n ) ] , then the d -th order ( l , q ) -sublevel relaxationis exactly the sparse ( d + 1) -th order Lasserre’s relaxation. There are diﬀerent ways to determine the subsets Γ i,j (or Γ i,k,j ) of the sublevel relaxation describedin Deﬁnition 2. Generically, we are not aware of any algorithm that would guarantee that theselected subsets are optimal at a given level of relaxation. In this section, we propose severalheuristics to select the subsets. Suppose that { I k } ≤ k ≤ r is the sequence of maximal cliques in thechordal extension of the CSP graph of the sparse problem (SpPOP) and that the level of relaxationis l ≤ | I k | =: τ k . We need to select the “best” candidate among the (cid:0) τ k l (cid:1) many subsets of size l .However, in practice, the number (cid:0) τ k l (cid:1) might be very large since (cid:0) τ k l (cid:1) ≈ τ lk when l is ﬁxed.In order to make this selection procedure tractable, we reduce the number of sample subsets to τ k . Precisely, suppose I k := { i , i , . . . , i τ k } , deﬁne I k,j := { i j , i j +1 , . . . , i j + l } for j = 1 , , . . . , τ k and 1 ≤ l ≤ τ k . By convention, i j = i k if j ≡ k mod τ k . Denote by p the depth of the relaxation.9hen we use the following heuristics to choose p subsets among the candidates I k,j . Without lossof generality, we assume that l < τ k (otherwise one has l ≥ τ k , then we only need to select onesubset I = I k ). • H1 (Random Heuristic) . For each i and clique I k , we randomly select p subsets Γ i,k,j ⊆ I k for j = 1 , . . . , p , such that | Γ i,k,j | = l for all j . • H2 (Ordered Heuristic) . For each i and clique I k , we select one after another Γ i,k,j = I k,j ⊆ I k for j = 1 , . . . , p . For p = τ k , we also call this heuristic the cyclic heuristic .The heurisics H1 and H2 do not depend on the problem, thus they might not fully explorethe speciﬁc structure hidden in the POPs. We can also try the heuristic that selects the subsetsaccording to the value of the moments in the ﬁrst-order moment relaxation (Shor’s relaxation). • H3 (Moment Heuristic) . First of all, we solve the ﬁrst-order sparse relaxation. For each i and clique I k , suppose M k is the ﬁrst-order moment matrix indexed by 1 and the monomials in x I k . Denote by M k ( I k,j ) the submatrix whose rows and columns are indexed by 1 and x I k,j for j = 1 , , . . . , τ k . We reorder the subsets I k,j w.r.t. the inﬁnity norm of the submatrices M k ( I k,j ),i.e., || M k ( I k, ) || ∞ ≥ || M k ( I k, ) || ∞ ≥ . . . ≥ k| M k ( I k,τ k ) || ∞ . Then we pick the ﬁrst p subsets Γ i,k, = I k, , Γ i,k, = I k, , . . . , Γ i,k,p = I k,p after reordering.In particular, for Max-Cut problem, the authors in [2] proposed the following heuristics thattake the weights in the graph or the maximal cliques in the chordal graph into account. We brieﬂyintroduce the idea of these heuristics, readers can refer to [2] for details. For heuristic H4 to H4-6 ,denote by L the Laplacian matrix of the graph. • H4 (Laplacian Heuristic) . For each clique I k , denote by L ( I k,j ) the submatrix of the momentmatrix M k whose rows and columns are indexed by 1 and x I k,j for j = 1 , , . . . , τ k . We reorder thesubsets ( I k,j ) w.r.t. the inﬁnity norm of the submatrices ( L ( I k,j )), i.e., || L ( I k, ) || ∞ ≥ || L ( I k, ) || ∞ ≥ . . . ≥ k| L ( I k,τ k ) || ∞ Then we pick the ﬁrst p subsets Γ i,k, = I k, , Γ i,k, = I k, , . . . , Γ i,k,p = I k,p after reordering. • H5 (Max-Repeated Heuristic) . We select subsets contained in many maximal cliques. • H6 (Min-Repeated Heuristic) . We select subsets contained in few maximal cliques. • H4-5.

We combine heuristic H4 and H5 to select the subsets that are not repeated in othermaximal cliques and contain variables with large weights.In the spirit of the heuristic H4-5 , we can also combine H5 with the moment heuristic H3 : • H3-5.

We combine H3 and H5 to select the subsets that are not repeated in other maximalcliques and contain variables with large moments.Table 1: Comparison of diﬀerent heuristics for Max-Cut instances g 20 and w01 100. Heuristics lv=4, p=1 lv=4, p=2 lv=6, p=1 lv=6, p=2 Countg20 w01 g20 w01 g20 w01 g20 w01 H1 H3 550.6 728.8 541.8 723.2 528.5 713.9 524.2 705.6 0 H4 H5 553.5 731.0 543.1 725.8 529.3 715.6 525.2 708.4 0H6 553.3 731.2 543.2 726.6 529.3 717.2 525.2 710.3 0H3-5 550.5 729.5 541.8 726.6 528.5 713.8 524.2 704.8 0H4-5 549.8 726.6 542.0 719.3 526.9 710.4 523.6 H1 performs the bestamong other heuristics. The ordered heuristic H2 and Laplacian heuristic H4 also performs well.For the sake of simplicity, we will only consider the ordered heuristic H2 and its variants for theforthcoming examples. In this section, we explicitly build diﬀerent sublevel relaxations for diﬀerent classes of polyno-mial optimization problems: Maximum Cut (Max-Cut), Maximum Clique (Max-Cliq), Mixed Inte-ger Quadratically Constrained Programming (MIQCP) and Quadratically Constrained QuadraticProblem (QCQP). We also consider two classes of problems arising from deep learning: robustnesscertiﬁcation and Lipschitz constant estimation of neural networks. For many deep learning applica-tions, the targeted optimization problems are often dense or nearly-dense, due to the compositionof aﬃne maps and non-linear activation functions such as ReLU( A x ) = max { A x, } . In this case,the sublevel hierarchy is indeed helpful. A simple application for Lipschitz constant estimation waspreviously considered by the authors in [3].For simplicity, unless stated explicitly, we always assume that all the levels ( l i ) (resp. ( l i,k ))and depths ( q i ) (resp. ( q i,k )) are identical, i.e., l i = l, q i = q for all i (resp. l i,k = l, q i,k = q for all i, k ). We say that this simpliﬁed sublevel relaxation is of level l and depth q . Note that the sublevelrelaxation of level 0 and depth 0 is equivalent to Shor’s relaxation. By convention, if l i,k ≥ τ k , thenthis sublevel l i,k should automatically be τ k and the depth p should be 1. For all the examples, weconsider the ordered heuristic H2 or its variants to select the subsets in the sublevel relaxation. The examples listed in this section are typical in optimization.

Maximum cut (Max-Cut) problem

Given an undirected graph G ( V, E ) where V is a set of vertices and E is a set of edges, a cut is apartition of the vertices into two disjoint subsets. The Max-Cut problem consists of ﬁnding a cutin a graph such that the number of edges between the two subsets is as large as possible. It can beformulated as follows: max x { x T Lx : x ∈ {− , } n } , (Max-Cut)where L is the Laplace matrix of the given graph of n vertices, i.e., L := diag( W1 n ) − W where W is the weight matrix of the graph. The constraints x ∈ {− , } n are equivalent to ( x i ) = 1for all i . Suppose that ( I k ) are the maximum cliques in the chordal extension of the given graph.For i = 1 , , . . . , n , denote by k ( i ) the set of indices s such that i ∈ I s . For s ∈ k ( i ), suppose that I s = { i , . . . , i τ s } so that i j ( i ) = i for 1 ≤ j ( i ) ≤ τ s . Then we select the q subsets of size l by order as: I s,t = { i j ( i ) , i j ( i )+ t , . . . , i j ( i )+ t + l − } for t = 1 , , . . . , q . If we consider the dense sublevel hierarchy,then we directly select the subsets by order as I t = { i, i + t, . . . , i + t + l − } for t = 1 , , . . . , q .11 aximum clique (Max-Cliq) problem Given an undirected graph G ( V, E ) where V is a set of vertices and E is a set of edges, a clique is deﬁned to be a set of vertices that is completely interconnected. The Max-Cliq problem con-sists of determining a clique of maximum cardinality. It can be stated as a nonconvex quadraticprogramming problem over the unit simplex [19] and its general formulation is:max x { x T Ax : n X i =1 x i = 1 , x ∈ [0 , n } , (Max-Cliq)where A is the adjacency matrix of the given graph of n vertices. The constraints x ∈ [0 , n are equivalent to x i ( x i − ≤ i = 1 , , . . . , n . The Max-Cliq problem is dense since wehave a constraint P ni =1 x i = 1 involving all the variables. Therefore, we apply the dense sublevelhierarchy. To handle the constraint P ni =1 x i = 1, we select the q subsets of size l by order as I t = { t, t + 1 , . . . , t + l − } for t = 1 , , . . . , q . For the constraints x i ( x i − ≤

0, we select thesubsets by order as I t = { i, i + t, . . . , i + t + l − } for t = 1 , , . . . , q . Mixed integer quadratically constrained programming (MIQCP)

The MIQCP problem is of the following form:min x { x T Q x + b T x : x T Q i x + b Ti x ≤ c i , i = 1 , . . . , p, Ax = b , l ≤ x ≤ u , x I ∈ Z } , (MIQCP)where each Q i is a symmetric matrix of size n × n , A is a matrix of size n × n , b , b i , l , u are n -dimensional vectors, and each c i is a real number. The constraints x T Q i x + b Ti x ≤ c i arecalled quadratic constraints, the constraints Ax = b are called linear constraints. The constraints l ≤ x ≤ u and x I ∈ Z bound the variables and restrict some of them to be integers. In ourbenchmarks, we only consider the case where x ∈ { , } n , which is also equivalent to x i ( x i −

1) = 0for i = 1 , , . . . , n . If we only have bound constraints, then we use the same ordered heuristic asfor the Max-Cut problem to select the subsets. If in addition we also have quadratic constraints orlinear constraints, then the problem is dense and therefore we consider the dense sublevel hierarchy.For quadratic constraints, we don’t apply the sublevel relaxation to them, i.e., l = q = 0. However,if Q i equals the identity matrix, then we use the same heuristic as the linear constraints: we selectthe subsets by order as I t = { t, t + 1 , . . . , t + l − } for t = 1 , , . . . , q . Quadratically constrained quadratic problems (QCQP)

A QCQP can be cast as follows:min x { x T Q x + b T x : x T Q i x + b Ti x ≤ c i , i = 1 , . . . , p, Ax = b , l ≤ x ≤ u } , (QCQP)where each Q i is a symmetric matrix of size n × n , A is a matrix of size n × n , b , b i , l , u are n -dimensional vectors, and each c i is a real number. This is very similar to the MIQCP except thatwe drop out the integer constraints. Therefore, we use the same strategy to select the subsets inthe sublevel relaxation. 12 .2 Examples from deep learning The following examples are picked from the recent deep learning topics.

Upper bounds of lipschitz constants of deep neural networks [3]

We only consider the 1-hidden layer neural network with ReLU activation function, the upper boundof whose Lipschitz constant results in a QCQP as follows:max x , u , t { t T A T diag( u ) c : u ( u −

1) = 0 , ( u − / Ax + b ) ≥ t ≤ , ( x − ¯ x + ε )( x − ¯ x − ε ) ≤ . } (Lip)where A is a matrix of size p × p , ¯ x is a p -dimensional vector, b , c are p -dimensional vectors,and ǫ is a positive real number. When ǫ = 10 (resp. ǫ = 0 . global (resp. local ) Lipschitz constant of the neural network. Assume the matrix A isdense, then the maximal cliques in the chordal extension of (Lip) are I = { x , . . . , x p ; u , . . . , u p } and I k = { u , . . . , u p , t k } for k = 1 , . . . , p . Therefore, we consider the sparse sublevel relax-ation. For the constraints t k ≤

1, we choose the subsets by order as I k,i = { u i , . . . , u i + l − ; t k } for i = 1 , . . . , q . For the constraints ( x i − ¯ x k + ε )( x − ¯ x k − ε ) ≤

0, we choose the subsets by order as I i = { x k , x k + i , . . . , x k + i + l/ − ; u i , . . . , u i + l/ − } for i = 1 , . . . , q . For the constraints u j ( u j −

1) = 0 and( u j − / A j, : x + b j ) ≥

0, we choose the subsets by order as I i = { x i , . . . , x i + l/ − ; u j , u j + i , . . . , u j + i + l/ − } for i = 1 , . . . , q . Robustness certiﬁcation of deep neural networks [22]

We also consider the 1-hidden layer neural network with ReLU activation function. Then therobustness certiﬁcation problem can be formulated as a QCQP as follows:max x , u { c T u : u ( u − Ax − b ) = 0 , u ≥ Ax + b , u ≥ , ( x − ¯ x + ε )( x − ¯ x − ε ) ≤ . } (Cert)where A is a matrix of size p × p , ¯ x is a p -dimensional vector, b , c are p -dimensional vectors,and ǫ is a positive real number. Assume the matrix A is dense, then the maximal cliques in thechordal extension of (Cert) are I k = { x , . . . , x p ; u k } for k = 1 , . . . , p . Similarly to the Lipschitzproblem (Lip), we consider the sparse sublevel relaxation. For all the constraints, we choose thesubsets by order as I k,i = { x i , . . . , x i + l − ; u k } for i = 1 , . . . , q . In this section, we apply the sublevel relaxation to diﬀerent type of POPs both in optimizationand deep learning, as discussed in the previous section. Most of the instances in optimization aretaken from the Biq-Mac library [23] and the QPLIB library [4], others are generated randomly.We calculate the ratio of improvements (RI) of each sublevel relaxation, compared with Shor’srelaxation, namely RI =

Shor − sublevelShor − solution × relative gap (RG) betweenthe sublevel relaxation and the optimal solution, given by RG = sublevel − solution | solution | × Max-Cut instances

The following classes of problems and their solutions are from the Biq-Mac library. For each classof problem, we choose the ﬁrst instance, i.e., i = 0, and drop the suﬃx “.i” in Table 3: • g05 n.i , unweighted graphs with edge probability 0.5, n = 60 , , • pm1s n.i , pm1d n.i , weighted graph with edge weights chosen uniformly from {− , , } anddensity 10% and 99% respectively, n = 80 , • w d n.i , pw d n.i , graph with integer edge weights chosen from [ − ,

10] and [0 ,

10] respectively,density d = 0 . , . , . n = 100.The instances named g n and the corresponding upper bounds are from the CS-TSSOS paper[30].The instances named G n are from the G-set library by Y.Y. Ye , and their best-known solutionsare taken from [8].In Table 2, we give a summary of basic information and the graph structure of each instance: nVar denotes the number of variables, Density denotes the percentage of non-zero elements inthe adjacency matrix, nCliques denotes the number of cliques in the chordal extension,

MaxClique denotes the maximum size of the cliques,

MinClique denotes the minimum size of the cliques. http://web.stanford.edu/~yyye/yyye/Gset/ nVar Density nCliques MaxClique MinCliqueg05 60 60 50% 11 50 19g05 80 80 50% 12 69 28g05 100 100 50% 13 88 37pm1d 80 80 99% 2 79 76pm1d 100 100 99% 2 99 95pm1s 80 80 10% 44 37 4pm1s 100 100 10% 47 54 4pw01 100 100 10% 47 54 4pw05 100 100 50% 12 89 40pw09 100 100 90% 4 97 83w01 100 100 10% 47 54 4w05 100 100 50% 12 89 40w09 100 100 90% 4 97 83g 20 505 1.6% 369 15 1g 40 1005 0.68% 756 15 1g 60 1505 0.43% 756 15 1g 80 2005 0.30% 1556 15 1g 100 2505 0.23% 1930 16 1g 120 3005 0.19% 2383 15 1g 140 3505 0.16% 2762 15 1g 160 4005 0.13% 3131 15 1g 180 4505 0.12% 3429 15 1g 200 5005 0.11% 3886 15 1G11 800 0.25% 598 24 5G12 800 0.25% 598 48 5G13 800 0.25% 598 90 5G32 2000 0.1% 1498 76 5G33 2000 0.1% 1498 99 5G34 2000 0.1% 1498 141 5 In Table 3, we display the upper bounds and running times corresponding to the sublevelrelaxations of depth 1, and level 0, 4, 6, 8, respectively. Notice that the authors in [2] use thepartial relaxation to compute upper bounds for instances g 20 to g 200. The sublevel relaxation weconsider here is actually what they call the augmented partial relaxation, which is a strengthenedrelaxation based on partial relaxation. From the ratio of improvement, we see that the more sparsestructure the graph has, the better the sublevel relaxation performs. Notice that if we obtain betterupper/lower bounds than the current best-known bounds, the ratio of improvements will be largerthan 100% and the relative gap will become nagative. Particularly, our method provides betterbounds for all the instances g n in the CS-TSSOS paper [30], and computes upper bounds veryclose to the best-known solution for the instances G n in G-set.Moreover, if the number of variables is of moderate size, the dense sublevel relaxation mightperforms faster than the sparse one. For example, the instance g05 100 has 13 maximal cliques withmaximum size 88 and minimum size 37. The sparse sublevel relaxation consists of 13 ﬁrst-ordermoment matrices of size from 37 to 88. However, the dense version only consists of 1 ﬁrst-ordermoment matrix of size 100. In fact, the dense sublevel relaxation gives an upper bound of 1463.5at level 0 in 10 seconds, yielding the same bound as the sparse case at level 0 but with much lesscomputing time, and 1458.1 at level 8 in 178.1 seconds, providing better bounds than the sparsecase at level 6, with less computing time. 15able 3: Results obtained with sublevel relaxations of Max-Cut problems. Sol./UB nVar Density Sublevel relaxation, l = 0/4/6/8, q = 1 (level 0 = Shor)upper bounds (RI, RG) solving time (s)g05 60 536 60 50% 550.1 548.1 546.0 544.6 (39.0%, 1.6%) 4.5 10.6 17.6 65.7g05 80 929 80 50% 950.9 949.0 946.6 944.6 (28.8%, 1.7%) 33.8 56.2 61.8 137.4g05 100 1430 100 50% 1463.5 1462.0 1459.2 1456.8 (20.0%, 1.9%) 138.7 303.7 328.7 460.3pm1d 80 227 80 99% 270.0 265.9 262.0 258.8 (26.0%, 14.0%) 15.0 29.4 39.2 128.1pm1d 100 340 100 99% 405.4 402.2 397.9 393.7 (19.0%, 15.8%) 47.6 69.4 110.2 225.1pm1s 80 79 80 10% 90.3 86.7 83.6 ( , 4.8%) 1.4 4.9 13.4 37.7pm1s 100 127 100 10% 143.2 141.4 137.6 ( , 6.5%) 11.1 24.3 28.6 180.3pw01 100 2019 100 10% 2125.4 2107.8 2088.1 ( , 2.8%) 13.0 20.5 29.7 285.8pw05 100 8190 100 50% 8427.7 8416.6 8403.6 8388.1 (16.7%, 2.4%) 136.8 223.0 272.9 400.3pw09 100 13585 100 90% 13806.0 13797.1 13781.1 13766.5 (17.9%, 1.3%) 141.6 218.4 268.7 442.4w01 100 651 100 10% 740.9 728.3 710.3 ( , 6.9%) 10.5 22.4 35.0 224.7w05 100 1646 100 50% 1918.0 1902.6 1885.5 1869.7 (17.8%, 13.6%) 138.1 265.8 272.2 403.2w09 100 2121 100 90% 2500.3 2478.2 2447.3 2422.8 (20.4%, 14.2%) 124.3 255.0 280.8 451.7g 20 537.4 505 1.6% 570.8 547.1 526.7 ( , -4.5%) 0.7 15.1 46.1 102.2g 40 992.2 1005 0.68% 1032.6 982.4 950.8 ( , -6.5%) 1.2 18.6 47.9 102.5g 60 1387.2 1505 0.43% 1439.9 1368.4 1317.8 ( , -7.6%) 2.8 26.0 74.7 431.1g 80 1838.1 2005 0.3% 1899.2 1803.8 1744.9 ( , -7.6%) 6.0 23.8 76.0 290.7g 100 2328.3 2505 0.23% 2398.7 2282.9 2205.1 ( , -7.7%) 3.4 30.1 117.4 428.6g 120 2655.4 3005 0.19% 2731.7 2588.5 2507.3 ( , -8.1%) 3.8 33.3 113.2 434.5g 140 3027.2 3505 0.16% 3115.8 2947.9 2856.5 ( , -8.1%) 3.8 46.3 138.4 522.1g 160 3589.0 4005 0.13% 3670.7 3487.1 3380.7 ( , -7.7%) 8.2 56.5 198.2 506.6g 180 3953.1 4505 0.12% 4054.7 3855.9 3736.9 ( , -7.6%) 8.8 51.5 277.0 693.4g 200 4472.3 5005 0.11% 4584.6 4353.3 4228.1 ( , -7.6%) 5.4 52.7 203.2 839.2G11 564 800 0.25% 629.2 581.3 564.6 ( , 0.1%) 4.0 15.8 32.6 36.5G12 556 800 0.25% 623.9 572.5 559.6 ( , 0.6%) 17.8 57.8 54.3 51.9G13 580 800 0.25% 647.1 594.2 585.1 ( , 0.7%) 159.2 241.7 340.2 321.6G32 1398 2000 0.1% 1567.6 1433.4 1415.9 ( , 1.3%) 622.0 736.3 630.8 628.0G33 1376 2000 0.1% 1544.3 1415.3 1392.7 ( , 0.8%) 1956.6 2115.8 1221.5 1486.8G34 1372 2000 0.1% 1546.7 1407.9 1388.2 ( , 1.2%) 3613.5 6580.9 6327.9 6147.4 MIQCP instances

The following classes of problems and their solutions are from the Biq-Mac library, where there areneither quadratic constraints x T Q i x + b Ti x ≤ c i nor linear constraints Ax = b . We only haveinteger bound constraints x ∈ { , } n . • bqp n - i , with 10% density. All the coeﬃcients have uniformly chosen integer values in [ − , n = 50 , , , • gka i a, with dimensions in [30 , . , . − , − , • gka i b, with dimensions in [20 , − ,

0] andthe oﬀ-diagonal coeﬃcients belong to [0 , • gka i c, dimensions in [40 , . , . − , − , • gka i d, with dimension 100 and densities in [0 . , − , − , Ax = b . For theinstance 0032, there are 50 continuous variables and 50 integer variables. For the two instances5935 and 5944, we maximize the objective, the others are minimization problems.Similarly to the Max-Cut instances, Table 4 summarizes the basic information and cliquesstructure of each instance. Table 5 is a summary of basic information and the number of quadratic,linear, bound constraints of the instances from the QPLIB library.16able 4: Summary of the basic information and sparse structure of the MIQCP instances. nVar Density nCliques MaxClique MinClique nQuad nLin nBoundbqp50-1 50 10% 36 15 3 0 0 50bqp100-1 100 10% 52 49 4 0 0 100gka1a 50 10% 36 15 1 0 0 50gka2a 60 10% 41 20 3 0 0 60gka3a 70 10% 44 27 3 0 0 70gka4a 80 10% 48 33 4 0 0 80gka5a 50 20% 25 26 4 0 0 50gka6a 30 40% 11 20 7 0 0 30gka7a 30 50% 10 21 10 0 0 30gka8a 100 62.5% 64 37 2 0 0 100gka1b 20 100% 2 19 19 0 0 20gka2b 30 100% 2 29 29 0 0 30gka3b 40 100% 2 39 38 0 0 40gka4b 50 100% 2 49 47 0 0 50gka5b 60 100% 2 59 56 0 0 60gka6b 70 100% 2 69 67 0 0 70gka7b 80 100% 2 79 77 0 0 80gka8b 90 100% 2 89 87 0 0 90gka9b 100 100% 2 99 97 0 0 100gka10b 125 100% 2 124 124 0 0 125gka1c 40 80% 4 37 25 0 0 40gka2c 50 60% 6 45 26 0 0 50gka3c 60 40% 14 47 17 0 0 60gka4c 70 30% 22 49 12 0 0 70gka5c 80 20% 27 54 11 0 0 80gka6c 90 10% 46 45 4 0 0 90gka7c 100 10% 51 50 3 0 0 100gka1d 100 10% 50 51 4 0 0 100gka2d 100 20% 30 71 11 0 0 100gka3d 100 30% 23 78 18 0 0 100gka4d 100 40% 15 86 31 0 0 100gka5d 100 50% 13 88 36 0 0 100gka6d 100 60% 10 91 47 0 0 100gka7d 100 70% 7 94 57 0 0 100gka8d 100 80% 6 95 68 0 0 100gka9d 100 90% 5 96 79 0 0 100gka10d 100 100% 2 99 95 0 0 100 Table 5: Summary of the basic information and constraint structure of the MIQCP instances fromQPLIB library. nVar Density nQuad nLin nBoundqplib0032 100 89% 0 52 100qplib0067 80 89% 0 1 80qplib0633 75 99% 0 1 75qplib2512 100 28% 0 20 100qplib3762 90 28% 0 480 90qplib5935 100 28% 0 1237 100qplib5944 100 28% 0 2475 100

In Table 6, we show the lower bounds and running time obtained by solving the sublevel relax-ations with depth 1 and level 0, 4, 6, 8, respectively. We see that when the problem has a goodsparsity structure or is of low dimension, the sublevel relaxation performs very well and providesthe exact solution, in particular for the two instances gka2a and gka7a. For dense problems, we arenot able to ﬁnd the exact solution, but still have improvements between 20% and 40% comparedto Shor’s relaxation. Notice that for the instances gka1b to gka10b, even though we have an im-17rovement ratio ranging from 24.0% to 77.9%, the relative gap is very high, varying from 38.2%to 947.2%. This means that these problems themselves are very hard to solve, so that the gapbetween the results of Shor’s relaxation and the exact optimal solution is very large. Even thoughthe sublevel relaxation yields substantial improvement compared to Shor’s relaxation, it’s still faraway from the true optimum.Table 6: Results obtained with sublevel relaxations of MIQCP problems.

Sol. nVar Density Sublevel relaxation, l = 0/4/6/8, q = 1 (level 0 = Shor)lower/upper bounds (RI, RG) solving time (s)bqp50-1 -2098 50 10% -2345.5 -2136.3 -2116.3 -2105.4 ( , 0.4%) 0.1 0.5 1.3 3.4bqp100-1 -7970 100 10% -8721.1 -8358.2 -8215.1 -8101.8 ( , 1.7%) 8.8 16.7 21.8 87.9gka1a -3414 50 10% -3623.3 -3453.2 -3432.6 -3428.5 ( , 0.4%) 0.1 0.6 0.9 1.8gka2a -6063 60 10% -6204.3 -6076.3 -6063.0 -6063.0 ( , 0%) 0.3 1.0 4.6 8.5gka3a -6037 70 10% -6546.2 -6291.5 -6182.6 -6106.3 ( , 1.1%) 0.7 1.6 6.1 31.0gka4a -8598 80 10% -8935.1 -8767.3 -8713.7 -8676.0 ( , 0.9%) 2.1 3.4 10.1 30.0gka5a -5737 50 20% -5979.9 -5789.9 -5760.3 -5750.0 ( , 0.2%) 0.7 1.4 6.2 31.0gka6a -3980 30 40% -4190.2 -4008.9 -3986.0 -3982.5 ( , 0.1%) 0.2 0.6 3.9 23.6gka7a -4541 30 50% -4696.6 -4566.8 -4541.1 -4541.1 ( , 0%) 0.3 0.8 4.9 23.1gka8a -11109 100 62.5% -11283.8 -11148.0 -11124.8 -11114.0 ( , 0.05%) 2.3 2.7 7.5 19.5gka1b -133 20 100% -362.9 -295.1 -253.6 -183.8 ( , 38.2%) 0.1 0.5 2.4 25.0gka2b -121 30 100% -505.7 -425.3 -325.4 -282.5 ( , 133.5%) 0.2 0.7 4.0 29.9gka3b -118 40 100% -718.0 -535.6 -483.4 -437.7 ( , 270.9%) 0.7 1.4 6.5 45.9gka4b -129 50 100% -809.8 -670.9 -614.2 -571.5 (35.0%, 343.0%) 1.9 3.3 14.3 65.2gka5b -150 60 100% -1034.8 -820.9 -736.8 -705.5 (37.2%, 370.3%) 3.2 8.4 15.5 76.1gka6b -146 70 100% -1279.0 -972.2 -894.8 -833.5 (39.3%, 470.9%) 9.1 11.6 26.6 86.5gka7b -160 80 100% -1362.5 -1138.1 -1031.0 -982.6 (31.6%, 514.1%) 26.1 31.2 50.8 136.1gka8b -145 90 100% -1479.1 -1269.8 -1190.2 -1120.9 (26.8%, 673.0%) 40.5 60.1 102.3 187.0gka9b -137 100 100% -1663.6 -1385.4 -1298.9 -1212.6 (29.5%, 785.1%) 65.9 92.3 111.2 256.3gka10b -154 125 100% -2073.1 -1782.1 -1707.1 -1612.7 (24.0%, 947.2%) 285.8 413.3 452.2 700.9gka1c -5058 40 80% -5161.1 -5102.9 -5077.9 -5073.7 ( , 0.3%) 0.8 1.6 5.3 41.9gka2c -6213 50 60% -6392.6 -6291.3 -6263.1 -6246.2 ( , 0.5%) 1.9 2.8 7.8 50.3gka3c -6665 60 40% -6849.9 -6730.7 -6703.1 -6688.1 ( , 0.3%) 6.1 9.3 15.9 62.1gka4c -7398 70 30% -7647.1 -7527.7 -7494.9 -7462.8 ( , 0.9%) 13.1 18.4 24.6 88.1gka5c -7362 80 20% -7684.5 -7543.7 -7474.6 -7412.8 ( , 0.7%) 15.1 27.7 40.3 112.8gka6c -5824 90 10% -6065.8 -5932.2 -5869.7 -5847.4 ( , 0.4%) 10.0 11.0 19.0 57.4gka7c -7225 100 10% -7422.7 -7297.8 -7264.3 -7248.7 ( , 0.3%) 12.4 13.9 22.1 55.6gka1d -6333 100 10% -6592.7 -6475.3 -6403.1 -6369.6 ( , 0.6%) 11.4 13.4 29.1 71.3gka2d -6579 100 20% -7234.2 -6980.5 -6897.9 -6811.6 ( , 3.5%) 42.3 70.8 70.6 193.7gka3d -9261 100 30% -9963.0 -9686.2 -9591.7 -9523.6 ( , 2.8%) 164.8 200.4 262.7 330.0gka4d -10727 100 40% -11592.5 -11303.3 -11175.4 -11096.5 ( , 3.4%) 302.2 259.1 191.8 387.7gka5d -11626 100 50% -12632.1 -12381.6 -12274.7 -12185.0 ( , 4.8%) 324.3 256.3 294.3 380.2gka6d -14207 100 60% -15235.3 -14938.2 -14834.9 -14720.2 ( , 3.6%) 236.6 239.7 221.9 437.9gka7d -14476 100 70% -15672.0 -15413.2 -15267.6 -15173.6 ( , 4.8%) 138.8 225.9 150.0 314.6gka8d -16352 100 80% -17353.3 -17011.5 -16887.6 -16794.3 ( , 2.7%) 271.5 277.9 291.6 408.6gka9d -15656 100 90% -17010.9 -16652.0 -16513.3 -16409.6 ( , 4.8%) 390.5 419.8 367.0 513.5gka10d -19102 100 100% -20421.4 -20121.7 -19974.1 -19863.8 ( , 4.0%) 77.8 83.4 130.2 244.8qplib0032 10.1 100 99% -19751 -16491 -15962 -15440 (21.8%, 152971.3%) 18.1 19.4 37.0 94.7qplib0067 -110942 80 89% -116480 -112923 -112615 -112478 ( , 1.4%) 6.2 11.1 21.9 158.3qplib0633 79.6 75 99% 70.9 74.0 75.1 ( , 4.9%) 2.9 10.1 27.1 140.0qplib2512 135028 100 77% -441284 -125060 27898 , 38.6%) 18.6 19.9 53.4 278.6qplib3762 -296 90 28% -345.6 -330.8 -319.9 -309.5 ( , 4.6%) 6.3 18.1 50.7 183.4qplib5935 4758 100 99% 67494 40148 36842 ( , 505.5%) 12.8 39.4 259.0 1745.3qplib5944 1829 100 99% 66934 27437 23142 ( , 981.7%) 15.6 182.1 2304.3 13204.6 Max-Cliq instances

We take the same graphs as the ones considered in the Max-Cut instances. Some instances share thesame adjacency matrix with diﬀerent weights, in which case we delete these repeated graphs. LBdenotes the lower bound of a given instance, computed by 10 random samples. By contrast with18he strategy used for the Max-Cut instances, we use sublevel relaxations with level 2 and depth 0,20, 40, 60, respectively. From Table 7 we see that the sublevel relaxation yields large improvementcompared to Shor’s relaxation. The Max-Cliq problem remains hard to solve as emphasized by thelarge relative gap, ranging from 662.5% to 3660%.Table 7: Results obtained with sublevel relaxations of Max-Cliq problems. LB nVar Density Sublevel relaxation, l = 2, q = 0/20/40/60 (depth 0 = Shor)upper bounds (RI, RG) solving time (s)g05 60 0.8 60 50% 29.9 19.3 8.3 ( , 662.5%) 0.6 1.8 3.6 2.4g05 80 0.9 80 50% 39.9 29.1 20.1 ( , 888.9%) 2.8 7.4 7.5 8.3g05 100 0.8 100 50% 50.0 39.1 28.9 ( , 2200.0%) 6.5 30.9 19.1 33.0pm1d 80 1.0 80 99% 78.2 57.5 37.6 ( , 1690.0%) 2.3 5.4 7.6 4.4pm1d 100 1.0 100 99% 98.0 77.2 57.5 ( , 3660%) 5.6 12.5 22.8 17.3pm1s 80 0.7 80 10% 8.9 6.2 4.6 ( , 557.1%) 2.6 6.1 6.4 9.0pw01 100 0.6 100 10% 10.6 8.2 5.9 ( , 800.0%) 7.5 30.7 20.0 29.6pw05 100 0.8 100 50% 49.8 39.7 28.9 ( , 2262.5%) 7.6 21.9 24.0 26.5pw09 100 1.0 100 90% 89.2 70.2 51.9 ( , 3300%) 8.5 15.0 33.2 28.9 QCQP instances

We take the MIQCP instances from the Biq-Mac library with size larger or equal than 50, thenadd one dense quadratic constraint || x || = 1, and relax the integer bound constraints x ∈ { , } n to linear bound constraints x ∈ [0 , n . UB denotes the upper bound obtained by selecting theminimum value over 10 random evaluations.We also select some instances and their solutions from the QPLIB library with ID 1535, 1661,1675, 1703 and 1773. These instances have more than one quadratic constraint and involve linearconstraints.Table 8 is a summary of basic information as well as the number of quadratic, linear, and boundconstraints of the instances from the QPLIB library.Table 8: Summary of the basic information and constraint structure of the QCQP instances fromthe QPLIB library. nVar Density nQuad nLin nBoundqplib1535 60 94% 60 6 60qplib1661 60 95% 1 12 60qplib1675 60 49% 1 12 60qplib1703 60 98% 30 6 60qplib1773 60 95% 1 6 60 In Table 9, we show the lower bounds and running time obtained by the sublevel relaxationwith depth 1 for the instances from the QPLIB library, 10 for the instances adapted from the Biq-Mac library, and level 0, 4, 6, 8, respectively. We see that the sublevel relaxation yields a uniformimprovement compared to Shor’s relaxation. However, for the QCQP problems adapted from theMIQCP instances, it is very hard to ﬁnd the exact optimal solution as the relative gap varies from60.5% to 77.0%. This is in deep contrast with the instances from the QPLIB library which arerelatively easier to solve as the relative gap varies from 9.4% to 13.8%.19able 9: Results obtained with sublevel relaxations of QCQP problems.

Sol./UB nVar Density Sublevel relaxation, l = 0, 4, 6, 8, q = 1, 10 (level 0 = Shor)lower bounds (RI, RG) solving time (s)bqp50-1 -99 50 10% -215.7 -195.4 -180.6 -172.5 (37.0%, 74.2%) 0.8 2.4 12.3 88.5bqp100-1 -67.2 100 10% -323.1 -304.7 -296.1 -290.0 (12.9%, 331.5%) 21.4 22.7 56.9 249.6gka1a -109.5 50 10% -241.8 -224.1 -219.8 -213.8 (21.2%, 95.3%) 0.8 1.9 10.0 65.6gka2a -140.7 60 10% -275.3 -260.9 -258.7 -251.6 (17.6%, 78.8%) 1.7 3.8 16.8 126.9gka3a -143.2 70 10% -300.0 -284.6 -278.8 -275.0 (15.9%, 92.0%) 3.6 9.1 23.1 121.1gka4a -126.2 80 10% -311.0 -288.3 -282.9 -280.0 (16.8%, 121.9%) 6.8 7.9 25.4 160.0gka5a -180.2 50 20% -351.8 -319.3 -306.4 -299.1 (30.7%, 66.0%) 0.7 2.8 15.1 75.6gka8a -122.5 100 62.5% -320.1 -306.8 -302.1 -299.5 (10.4%, 144.5%) 21.3 23.5 72.5 232.0gka4b -63 50 100% -381.4 -326.2 -302.4 -280.8 (31.6%, 345.7%) 0.7 1.8 17.4 79.2gka5b -63 60 100% -446.8 -377.2 -348.9 -327.4 (31.1%, 419.7%) 1.1 4.2 19.5 117.2gka6b -63 70 100% -496.6 -409.9 -385.9 -366.8 (29.9%, 482.2%) 3.2 5.3 17.4 118.1gka7b -63 80 100% -518.3 -447.1 -421.6 -404.3 (25.0%, 541.7%) 5.9 9.5 21.6 170.3gka8b -63 90 100% -534.5 -472.7 -449.0 -430.1 (22.1%, 582.7%) 10.9 16.0 36.0 148.9gka9b -63 100 100% -573.0 -501.3 -477.0 -455.8 (23.0%, 623.5%) 19.7 36.6 42.5 191.4gka10b -63 125 100% -639.4 -569.7 -553.7 -533.6 (18.4%, 747.0%) 80.1 82.1 110.9 410.1gka2c -159.1 50 60% -290.0 -269.3 -261.6 -255.4 (26.4%, 60.5%) 0.8 2.5 11.2 80.8gka3c -126.3 60 40% -271.2 -240.2 -235.4 -231.3 (27.5%, 83.1%) 1.7 4.4 16.0 103.1gka4c -123.0 70 30% -292.7 -263.7 -254.4 -247.9 (26.4%, 101.5%) 3.0 6.5 19.7 155.6gka5c -114.0 80 20% -239.1 -225.9 -223.2 -220.4 (14.9%, 93.3%) 10.6 9.6 32.3 166.5gka6c -100 90 10% -198.8 -190.8 -186.7 -182.4 (16.6%, 82.4%) 12.3 15.9 33.1 216.5gka7c -100 100 10% -225.8 -213.7 -210.5 -208.5 (13.8%, 108.5%) 21.4 28.2 63.4 323.1gka1d -75 100 10% -197.9 -182.5 -177.0 -174.5 (19.0%, 132.7%) 19.3 25.9 64.8 243.9gka2d -87.2 100 20% -259.6 -242.2 -233.8 -229.5 (17.5%, 163.2%) 23.5 28.2 61.1 254.3gka3d -88.1 100 30% -304.0 -281.6 -274.1 -267.5 (16.9%, 203.6%) 26.2 28.9 49.2 278.4gka4d -105.5 100 40% -375.2 -340.1 -326.0 -317.5 (21.4%, 201.0%) 21.6 21.9 53.2 270.7gka5d -131.9 100 50% -383.6 -351.5 -341.5 -332.3 (20.4%, 152.0%) 20.4 22.7 41.3 257.1gka6d -137.7 100 60% -443.1 -400.0 -391.0 -378.9 (21.0%, 175.2%) 23.8 23.3 48.6 254.2gka7d -156.3 100 70% -453.9 -421.4 -406.6 -397.4 (19.0%, 154.3%) 21.4 22.5 71.7 217.8gka8d -147.6 100 80% -488.0 -441.1 -423.3 -414.2 (21.7%, 180.6%) 21.7 25.0 47.0 232.8gka9d -179.6 100 90% -539.7 -487.7 -469.2 -456.8 (23.0%, 154.3%) 20.6 21.5 45.1 222.2gka10d -187.0 100 100% -552.4 -505.7 -491.8 -478.4 (20.3%, 155.8%) 23.2 24.6 56.9 196.5qplib1535 -11.6 60 94% -13.9 -13.5 -13.3 -13.2 (30.4%, 13.8%) 1.4 4.3 13.7 99.2qplib1661 -16.0 60 95% -18.4 -18.1 -17.8 -17.5 (37.5%, 9.4%) 1.4 3.0 14.7 96.4qplib1675 -75.7 60 49% -93.1 -87.0 -85.2 -83.8 ( , 10.7%) 1.0 4.2 19.2 147.8qplib1703 -132.8 60 98% -152.8 -147.0 -145.2 -143.5 ( , 8.06%) 1.2 4.2 20.8 109.3qplib1773 -14.6 60 95% -17.3 -16.8 -16.6 -16.4 (33.3%, 12.3%) 1.1 4.0 -14.0 89.3 Lipschitz constant estimation

We generate random 1-hidden layer neural networks with parameters A , b , c . We denote by net 1 n the instances of 1-hidden layer networks of size n , and compute the upper bounds corresponding to ǫ = 0 . ,

10, by the sublevel relaxations of depth 1 and level 0, 4, 6, 8, respectively.We see in Table 10 that we have a relatively high improvement ratio and low gap for the globalcase, while Table 11 ,dedicated to the local case, shows that the improvement ratio is decreasing andthe gap is increasing. The underlying rationale is that local Lipschitz constants of neural networksare harder to estimate than the global ones. 20able 10: Results obtained with sublevel relaxation of Lipschitz constant problems, ǫ = 10. Sol./LB nVar Sublevel relaxation, lv = 0/4/6/8, p = 1 (level 0 = Shor)upper bounds (RI, RG) solving time (s)net 1 5 0.38 15 0.44 0.39 , 0%) 0.02 0.61 1.77 7.01net 1 10 0.69 30 0.72 0.70 , 0%) 0.10 0.90 4.18 26.05net 1 15 1.72 45 1.86 1.81 1.76 ( , 0.58%) 0.35 2.19 10.55 69.66net 1 20 2.68 60 2.88 2.83 2.77 ( , 2.61%) 1.17 4.24 15.56 60.59net 1 25 3.56 75 3.83 3.74 3.69 ( , 3.37%) 2.79 8.72 29.38 166.35net 1 30 5.60 90 6.16 6.11 6.08 6.06 (17.86%, 8.21%) 8.45 11.68 33.29 220.39net 1 35 7.77 105 8.92 8.79 8.73 8.66 (22.61%, 11.455%) 16.69 26.55 74.28 267.19net 1 40 7.40 120 9.07 8.97 8.86 8.78 (17.37%, 18.65%) 33.19 56.15 116.37 333.65 Table 11: Results obtained with sublevel relaxations of Lipschitz constant problems, ǫ = 0 . Sol./LB nVar Sublevel relaxation, lv = 0/4/6/8, p = 1 (level 0 = Shor)upper bounds (RI, RG) solving time (s)net 1 5 0.247 15 0.251 0.251 , 0%) 0.04 0.34 1.25 6.72net 1 10 0.581 30 0.610 0.608 0.606 0.605 (17.2%, 4.13%) 0.18 0.84 4.65 38.56net 1 15 1.384 45 1.449 1.441 1.441 1.435 (21.54%, 3.68%) 0.42 1.43 7.29 60.27net 1 20 1.73 60 2.23 2.22 2.20 2.19 (8.00%, 26.59%) 4.21 3.82 13.11 83.14net 1 25 2.03 75 2.73 2.67 2.65 2.64 (12.86%, 30.05%) 4.79 7.08 23.31 134.50net 1 30 4.10 90 5.09 5.07 5.06 5.04 (5.05%, 22.93%) 19.10 13.60 28.29 146.17net 1 35 5.84 105 7.12 7.08 7.07 7.03 (7.03%, 20.38%) 56.46 28.95 47.06 192.31net 1 40 5.02 120 7.30 7.21 7.15 7.07 (10.09%, 40.84%) 144.28 58.01 80.27 254.22

Certiﬁcation instances

We use the same network net 1 n as the one generated for the above Lipschitz problems, andcompute the upper bounds corresponding to ǫ = 0 . ,

10, by the sublevel relaxations of depth 1 andlevel 0, 4, 6, 8, respectively.As for the Lipschitz problem, Table 12 and 13 indicate that in the local case it is much harderto improve and ﬁnd the exact optimal solution than in the global case. Furthermore, the diﬃcultyof the problem also increases with the dimension. When the number of variables gets larger, theimprovement ratio decreases while the relative gap increases.Table 12: Results obtained with sublevel relaxations of certiﬁcation problems, ǫ = 10. Sol./LB nVar Sublevel relaxation, lv = 0/4/6/8, p = 1 (level 0 = Shor)upper bounds (RI, RG) solving time (s)net 1 5 2.63 10 3.51 3.00 , 4.18%) 0.01 0.26 1.79 1.83net 1 10 3.49 20 4.88 4.69 4.60 4.48 (28.78%, 28.37%) 0.06 0.99 4.23 30.36net 1 15 5.61 30 8.20 8.10 7.84 7.41 (30.50%, 32.09%) 0.19 1.02 6.93 40.31net 1 20 9.24 40 16.48 16.03 15.75 15.48 (13.81%, 67.53%) 0.60 2.35 9.69 67.31net 1 25 14.40 50 26.68 26.28 25.89 25.57 (9.04%, 77.57%) 2.24 5.67 17.40 66.99net 1 30 17.22 60 38.06 37.72 36.82 35.89 (10.41%, 108.42%) 5.08 14.32 24.02 102.93net 1 35 26.71 70 59.18 58.64 57.78 57.39 (5.51%, 114.86%) 10.69 25.79 40.96 136.35net 1 40 22.94 80 57.59 56.08 54.69 54.18 (9.84%, 136.18%) 23.22 44.74 65.04 146.91net 1 45 22.57 90 57.56 56.34 55.57 54.68 (8.23%, 142.27%) 44.67 85.38 107.12 186.94net 1 50 27.34 100 73.59 72.10 71.19 69.92 (7.94%, 155.74%) 81.61 144.61 165.18 333.42 ǫ = 0 . Sol./LB nVar Sublevel relaxation, lv = 0/4/6/8, p = 1 (level 0 = Shor)upper bounds (RI, RG) solving time (s)net 1 5 0.190 10 0.191 0.191 0.191 0.191 (0.00%, 0.53%) 0.01 0.13 0.74 0.73net 1 10 0.021 20 0.025 0.025 0.025 0.024 (25.00%, 14.29%) 0.15 0.51 3.16 17.68net 1 15 0.027 30 0.053 0.053 0.053 0.053 (0.00%, 96.30%) 0.17 0.55 2.82 17.01net 1 20 0.269 40 0.299 0.299 0.299 0.298 (3.33%, 10.78%) 0.79 2.36 7.83 46.08net 1 25 -0.104 50 -0.025 -0.028 -0.031 -0.031 (7.59%, 70.19%) 2.37 5.77 13.22 54.81net 1 30 0.669 60 0.810 0.807 0.806 0.803 (4.96%, 20.03%) 6.34 10.06 21.85 84.82net 1 35 0.825 70 1.107 1.107 1.107 1.107 (0.00%, 34.18%) 12.61 18.44 34.07 102.26net 1 40 0.741 80 0.949 0.943 0.942 0.940 (4.33%, 26.86%) 34.70 52.44 56.64 163.67net 1 45 0.265 90 0.603 0.602 0.600 0.599 (1.18%, 126.04%) 55.69 89.39 115.78 200.94net 1 50 0.614 100 0.920 0.919 0.916 0.914 (1.96%, 48.86%) 105.68 177.41 179.53 205.58

The two latter examples show us that ﬁnding the guaranteed bounds for optimization problemsarising from deep learning is much harder than the usual sparse problems coming from the classicaloptimization literature. Hence, it remains a big challenge to adapt our approach to large real net-works, involving a large number of variables and more complicated structures such as convolutionalor max-pooling layers.

In this paper, we propose a new semideﬁnite programming hierarchy based on the standard denseand sparse Lasserre’s hierarchies. This hierarchy provides a wider choice of intermediate relaxationlevels, lying between the d -th and ( d + 1)-th order relaxations in Lasserre’s hierarchy. With thistechnique, we are able to solve problems where the standard relaxations are untractable. Ourexperimental results demonstrate that the sublevel relaxation often allows one to compute moreaccurate bounds by comparison with existing frameworks such as Shor’s relaxation or term sparsity,in particular for dense problems.Sublevel relaxations oﬀer a large choice of parameters tuning, as one can select the level, depth,and subsets for each relaxation. We can beneﬁt from this to potentially perform better that state-of-the-art methods. However, the ﬂexibility of our approach also comes together with a drawback sincethe more ﬂexible it is, the more diﬃcult for the users it is to tune the parameters. One importantand interesting future topic would be to design an algorithm that searches for the optimal level,depth and subsets in sublevel relaxations. Acknowledgement

This work has beneﬁted from the Tremplin ERC Stg Grant ANR-18-ERC2-0004-01 (T-COPSproject), the European Union’s Horizon 2020 research and innovation programme under the MarieSklodowska-Curie Actions, grant agreement 813211 (POEMA) as well as from the AI Interdisci-plinary Institute ANITI funding, through the French “Investing for the Future PIA3” programunder the Grant agreement n ◦ ANR-19-PI3A-0004. The third author was supported by the FMJHProgram PGMO (EPICS project) and EDF, Thales, Orange et Criteo. The fourth author acknowl-edge the support of Air Force Oﬃce of Scientiﬁc Research, Air Force Material Command, USAF,under grant numbers FA9550-19-1-7026, FA9550-18-1-0226, and ANR MasDol.22 eferences [1] Thomas Barthel and Robert H¨ubener. Solving condensed-matter ground-state problems bysemideﬁnite relaxations.

Physical Review Letters , 108(20), May 2012.[2] Juan S Campos, Ruth Misener, and Panos Parpas. Partial Lasserre relaxation for sparseMax-Cut. 2020.[3] Tong Chen, Jean B Lasserre, Victor Magron, and Edouard Pauwels. Semialgebraic Optimiza-tion for Lipschitz Constants of ReLU Networks.

Advances in Neural Information ProcessingSystems , 33, 2020.[4] Fabio Furini, Emiliano Traversi, Pietro Belotti, Antonio Frangioni, Ambros Gleixner, NickGould, Leo Liberti, Andrea Lodi, Ruth Misener, Hans Mittelmann, et al. Qplib: a library ofquadratic programming instances.

Mathematical Programming Computation , 11(2):237–265,2019.[5] Arbel Haim, Richard Kueng, and Gil Refael. Variational-correlations approach to quantummany-body problems, 2020.[6] C´edric Josz and Daniel K Molzahn. Lasserre hierarchy for large scale polynomial optimizationin real and complex variables.

SIAM Journal on Optimization , 28(2):1017–1048, 2018.[7] Igor Klep, Victor Magron, and Janez Povh. Sparse noncommutative polynomial optimization,2019.[8] Gary A Kochenberger, Jin-Kao Hao, Zhipeng L¨u, Haibo Wang, and Fred Glover. Solving largescale max cut problems via tabu search.

Journal of Heuristics , 19(4):565–571, 2013.[9] Jean B Lasserre. Global optimization with polynomials and the problem of moments.

SIAMJournal on optimization , 11(3):796–817, 2001.[10] Jean B Lasserre. Convergent sdp-relaxations in polynomial optimization with sparsity.

SIAMJournal on Optimization , 17(3):822–843, 2006.[11] Jean B Lasserre, Kim-Chuan Toh, and Shouguang Yang. A bounded degree sos hierarchy forpolynomial optimization.

EURO Journal on Computational Optimization , 5(1-2):87–117, 2017.[12] Jean Bernard Lasserre.

An introduction to polynomial and semi-algebraic optimization , vol-ume 52. Cambridge University Press, 2015.[13] Monique Laurent. A comparison of the sherali-adams, lov´asz-schrijver, and lasserre relaxationsfor 0–1 programming.

Mathematics of Operations Research , 28(3):470–496, 2003.[14] Victor Magron. Interval enclosures of upper bounds of roundoﬀ errors using semideﬁniteprogramming.

ACM Transactions on Mathematical Software (TOMS) , 44(4):1–18, 2018.[15] Victor Magron, George Constantinides, and Alastair Donaldson. Certiﬁed roundoﬀ er-ror bounds using semideﬁnite programming.

ACM Transactions on Mathematical Software(TOMS) , 43(4):1–31, 2017. 2316] Ngoc Hoang Anh Mai, Victor Magron, and Jean-Bernard Lasserre. A sparse version ofReznick’s Positivstellensatz, 2020.[17] A. Majumdar, A. A. Ahmadi, and R. Tedrake. Control and veriﬁcation of high-dimensionalsystems with DSOS and SDSOS programming. In

Proceedings of the 53rd IEEE Conferenceon Decision and Control , pages 394–401. IEEE, 2014.[18] Jiawang Nie. Optimality conditions and ﬁnite convergence of Lasserre’s hierarchy.

Mathemat-ical programming , 146(1-2):97–121, 2014.[19] Panos M. Pardalos and A. T. Phillips. A global optimization approach for solving the maximumclique problem.

International Journal of Computer Mathematics , 33(3-4):209–216, 1990.[20] Mihai Putinar. Positive polynomials on compact semi-algebraic sets.

Indiana University Math-ematics Journal , 42(3):969–984, 1993.[21] K´aroly F. P´al and Tam´as V´ertesi. Quantum bounds on Bell inequalities.

Physical Review A ,79(2), Feb 2009.[22] Aditi Raghunathan, Jacob Steinhardt, and Percy Liang. Semideﬁnite relaxations for certifyingrobustness to adversarial examples. In

Advances in Neural Information Processing Systems ,pages 10877–10887, 2018.[23] Franz Rendl, Giovanni Rinaldi, and Angelika Wiegele. A branch and bound algorithm for Max-Cut based on combining semideﬁnite and polyhedral relaxations. In

International Conferenceon Integer Programming and Combinatorial Optimization , pages 295–309. Springer, 2007.[24] Corbinian Schlosser and Milan Korda. Sparse moment-sum-of-squares relaxations for nonlineardynamical systems with guaranteed convergence. arXiv preprint arXiv:2012.05572 , 2020.[25] Matteo Tacchi, Carmen Cardozo, Didier Henrion, and Jean Lasserre. Approximating regions ofattraction of a sparse polynomial diﬀerential system. arXiv preprint arXiv:1911.09500 , 2019.[26] Matteo Tacchi, Tillmann Weisser, Jean-Bernard Lasserre, and Didier Henrion. Exploitingsparsity for semi-algebraic set volume computation, 2019.[27] Hayato Waki, Sunyoung Kim, Masakazu Kojima, and Masakazu Muramatsu. Sums of squaresand semideﬁnite program relaxations for polynomial optimization problems with structuredsparsity.

SIAM Journal on Optimization , 17(1):218–242, 2006.[28] Jie Wang, Martina Maggio, and Victor Magron. SparseJSR: A Fast Algorithm to ComputeJoint Spectral Radius via Sparse SOS Decompositions. arXiv preprint arXiv:2008.11441 , 2020.[29] Jie Wang and Victor Magron. Exploiting term sparsity in noncommutative polynomial opti-mization. arXiv preprint arXiv:2010.06956 , 2020.[30] Jie Wang, Victor Magron, Jean B Lasserre, and Ngoc Hoang Anh Mai. CS-TSSOS: Correlativeand term sparsity for large-scale polynomial optimization. arXiv preprint arXiv:2005.02828 ,2020. 2431] Jie Wang, Victor Magron, and Jean-Bernard Lasserre. Chordal-TSSOS: a moment-SOS hi-erarchy that exploits term sparsity with chordal extension.

SIAM Journal on Optimization ,31(1):114–141, 2021.[32] Jie Wang, Victor Magron, and Jean-Bernard Lasserre. TSSOS: A Moment-SOS hierarchy thatexploits term sparsity.

SIAM Journal on Optimization , 31(1):30–58, 2021.[33] Tillmann Weisser, Jean B Lasserre, and Kim-Chuan Toh. Sparse-BSOS: a bounded degree SOShierarchy for large scale polynomial optimization with sparsity.