[PDF] Computing Lower Rank Approximations of Matrix Polynomials

Abstract

Given an input matrix polynomial whose coefficients are floating point numbers, we consider the problem of finding the nearest matrix polynomial which has rank at most a specified value. This generalizes the problem of finding a nearest matrix polynomial that is algebraically singular with a prescribed lower bound on the dimension given in a previous paper by the authors. In this paper we prove that such lower rank matrices at minimal distance always exist, satisfy regularity conditions, and are all isolated and surrounded by a basin of attraction of non-minimal solutions. In addition, we present an iterative algorithm which, on given input sufficiently close to a rank-at-most matrix, produces that matrix. The algorithm is efficient and is proven to converge quadratically given a sufficiently good starting point. An implementation demonstrates the effectiveness and numerical robustness of our algorithm in practice.

Full PDF

CComputing Lower Rank Approximations of MatrixPolynomials

Mark Giesbrecht , Joseph Haraldson , George Labahn Abstract

Given an input matrix polynomial whose coeﬃcients are ﬂoating point num-bers, we consider the problem of ﬁnding the nearest matrix polynomial whichhas rank at most a speciﬁed value. This generalizes the problem of ﬁndinga nearest matrix polynomial that is algebraically singular with a prescribedlower bound on the dimension given in a previous paper by the authors. Inthis paper we prove that such lower rank matrices at minimal distance al-ways exist, satisfy regularity conditions, and are all isolated and surroundedby a basin of attraction of non-minimal solutions. In addition, we present aniterative algorithm which, on given input suﬃciently close to a rank-at-mostmatrix, produces that matrix. The algorithm is eﬃcient and is proven toconverge quadratically given a suﬃciently good starting point. An imple-mentation demonstrates the eﬀectiveness and numerical robustness of ouralgorithm in practice.

1. Introduction

Matrix polynomials appear in many areas of computational algebra, con-trol systems theory, diﬀerential equations, and mechanics. The algebra ofmatrix polynomials is typically described assuming that the individual poly-nomial coeﬃcients come from an exact arithmetic domain. However, in thecase of applications these coeﬃcients typically have numeric coeﬃcients, usu-ally real or complex numbers. As such, arithmetic can have numerical errorsand algorithms are prone to numerical instability.Numerical errors have an impact, for example, in determining the rankof a matrix polynomial with ﬂoating point coeﬃcients. In an exact setting (cid:73)

Cheriton School of Computer Science, University of Waterloo, Waterloo, Ontario,Canada

Email addresses: [email protected] (Mark Giesbrecht), [email protected] (Joseph Haraldson), [email protected] (George Labahn)

Preprint submitted to Elsevier December 13, 2017 a r X i v : . [ c s . S C ] D ec etermining the rank or determinant of a matrix polynomial is straightfor-ward, and eﬃcient procedures are available, for example from Storjohannand Villard (2005). However, in a numeric environment, a matrix polyno-mial may appear to have full or high rank while at the same time being closeto one having lower rank. Here “close” is deﬁned naturally under the Frobe-nius norm on the underlying coeﬃcient matrices of the matrix polynomial.Rather than computing the rank of the given matrix polynomial exactly, onecan ask how far away it is from one that is rank-deﬁcient, and then to ﬁndone at that distance. In the case of matrices with constant entries this isa problem solved via the Singular Value Decomposition (SVD). However, inthe case of matrix polynomials no equivalent rank revealing factorization hasthus far been available.In this paper we consider the problem of computing the nearest matrixpolynomial to an input matrix polynomial in R [ t ] m × n having rank at mosta speciﬁed value r . More precisely, given an integer r and an (cid:65) ∈ R [ t ] m × n of full rank, we want to compute ∆ (cid:65) ∈ R [ t ] m × n with deg(∆ (cid:65) ij ) ≤ deg (cid:65) ij (or similar degree constraints to be speciﬁed later), such that (cid:65) + ∆ (cid:65) hasrank at most n − r and where (cid:107) ∆ (cid:65) (cid:107) is minimized. In the case where n − r is one less than the row or column size then this is the problem of ﬁndingthe nearest matrix polynomial which is singular .A reasonable metric for measuring closeness on the space of matrix poly-nomials over the reals is the Frobenius norm. For a matrix polynomial (cid:65) ∈ R [ t ] m × n , with ( i, j ) entry A ij ∈ R [ t ], the Frobenius norm is given by (cid:107) (cid:65) (cid:107) = (cid:107) (cid:65) (cid:107) F = (cid:88) ≤ i ≤ m, ≤ j ≤ n (cid:107) A ij (cid:107) , (1.1)where, for a polynomial a ∈ R [ t ], the coeﬃcient 2-norm is deﬁned by a = (cid:88) ≤ i ≤ deg a a i t i , (cid:107) a (cid:107) = (cid:107) a (cid:107) = (cid:88) ≤ i ≤ deg a a i . (1.2)The main results in this paper center on the characterization of the geom-etry of minimal solutions. We show that minimal solutions exist, that is, for agiven r there exists a ∆ (cid:65) ∈ R [ t ] m × n of minimal norm such that (cid:65) + ∆ (cid:65) hasrank at most n − r and meets the required degree constraints on perturbedcoeﬃcients. In addition, we show that minimal solutions are isolated and aresurrounded by a non-trivial open neighbourhood of non-minimal solutions.Also regularity and second-order suﬃciency conditions are generically satis-ﬁed and a restricted version of the problem always satisﬁes these conditions.Finally we show that we can also generalize our results to the lower rank2pproximation instance of matrix polynomials generated by an aﬃne struc-ture , and so generalize to low-rank approximations of structured matricesby taking the degree to be zero.We demonstrate eﬃcient algorithms for computing our minimal lowerrank approximants. That is, for an input matrix polynomial (cid:65) ∈ R [ t ] m × n (with prescribed aﬃne structure) suﬃciently close to a singular matrix poly-nomial, we give an iterative scheme which converges to a rank at most matrixpolynomial at minimal distance, at a provably quadratic rate of convergence.We further generalize the iterative scheme so that it converges to a matrixpolynomial with a kernel of dimension at least r , at a minimal distance and aprovable quadratic rate of convergence. Finally, we also discuss a Maple im-plementation which demonstrates the convergence and numerical robustnessof our iterative scheme. Much of the work in this area has often been done under the heading of matrix pencils . See Gohberg et al. (2009) for an excellent overview. Non-singular (full rank) square matrix polynomials are sometimes referred to as regular matrix polynomials .In the case of ﬁnding the nearest singular matrix pencil this problem wassolved by the present authors in Giesbrecht et al. (2017). Previous to thatthis problem was posed for linear matrix pencils in Byers and Nichols (1993)and followed up in Byers et al. (1998). The nearest singular matrix polyno-mial relates to the stability of polynomial eigenvalue problems, linear timeinvariant systems and diﬀerential-algebraic equations studied subsequentlyin (Kressner and Voigt, 2015; Guglielmi et al., 2017). For non-linear matrixpolynomials/pencils, previous works rely on embedding a non-linear (degreegreater than 1) matrix polynomial into a linear matrix polynomial of muchhigher order. Theorem 1.1 and Section 7.2 of Gohberg et al. (2009) showsthat any regular (cid:65) ∈ R [ t ] n × n of degree d , is equivalent to a linear matrixpolynomial (cid:66) = B + tB , for B , B ∈ R nd × nd . However, this equivalenceis (obviously) not an isomorphism, nor is it distance preserving . Hence anearby singular matrix polynomial to (cid:66) ∈ R [ t ] nd × nd (even when constrainedto a degree one perturbation) almost certainly does not correspond to a A matrix A ∈ F m × n has an aﬃne structure over a ring F if it can be written as A = B + (cid:80) Li =1 c i B i for { B , B , . . . , B L } ⊆ F m × n and c i ∈ F . If B is the zero matrix,then the structure is said to be linear. Examples of linear structures include symmetricand hermitian matrices while matrices with an aﬃne structure include entries that areﬁxed non-zero coeﬃcients, such as monic matrix polynomials. The equivalence mapping is not surjective. (cid:65) ∈ R [ t ] n × n . Moreover, even if onewas to perturb to a rank-reduced matrix within the image of the lineariza-tion, the inverse image would not necessarily have reduced rank. In Lawrenceand Corless (2015) a more sophisticated linearization with an eye towardsameliorating this is explored.In the context of computer algebra the notion of symbolic-numeric algo-rithms for polynomials has been an active area of research for a number ofyears, and the general framework of ﬁnding nearby instances with a desiredalgebraic property is being thoroughly explored. Closest to our work hereis work on approximate Greatest Common Divisors (GCD) Corless et al.(1995); Beckermann and Labahn (1998b,a), multivariate polynomial factor-izations Kaltofen et al. (2008), and especially the optimization-based ap-proaches employing the Structured Total Least Norm algorithm Li et al.(2005); Kaltofen et al. (2005, 2006); Zhi (2007) and Riemannian SVD Bot-ting et al. (2005). More recently, we have explored computing the approxi-mate GCRD of (non-commutative) diﬀerential polynomials (Giesbrecht andHaraldson, 2014; Giesbrecht et al., 2016) and resolve similar issues.The computer algebra community has made impressive progress on fast,exact algorithms for matrix polynomials, including nearly optimal algorithmsfor computing ranks, factorizations and various normal forms; see Kaltofenand Storjohann (2015) and references therein for a recent overview. Part ofour goal in this current paper is establish a basis for extending the reachof these symbolic techniques to matrices of polynomials with ﬂoating pointcoeﬃcients.In a more general setting our problem can be formulated as a Struc-tured Low Rank Approximation (SLRA) problem. A popular method tosolve SLRA problems is the Structured Total Least Norm (STLN) approach(Rosen et al., 1996, 1998). These are iterative methods and in general theirconvergence to stationary points is linear (ﬁrst order), rather than quadratic,unless additional assumptions are made. In the event STLN converges to asolution, there may be other solutions arbitrarily nearby, as second ordersuﬃcient conditions may not hold. The SLRA problem is a non-linear leastsquares problem and accordingly other techniques such as the Restricted andRiemannian SVD (De Moor, 1993, 1994, 1995) provide general tools for solv-ing such problems. Other heuristic tools applicable to our problem includevariable projection (Golub and Pereyra, 1973, 2003) and Newton’s method(Abatzoglou et al., 1991). We would expect these methods to perform verypoorly in our case, as one can expect problems with large residuals to per-form poorly and the rational function arising from variable projection can betoo costly to deal with for modestly sized problems. The problem may alsobe considered as optimization on a manifold (Absil et al., 2009), however we4o not explicitly consider this approach. For a detailed survey of aﬃnelystructured low-rank approximation, see (Markovsky, 2008, 2011).Other methods for structured low-rank approximation involve the familyof lift and project algorithms, with the best known being Cadzow’s algo-rithm (Cadzow, 1988). More recently Schost and Spaenlehauer (2016) givesa sequence of alternating projections that provably converge quadratically toa ﬁxed point. However, lift and project algorithms do not generally satisfynecessary ﬁrst order (see (Bertsekas, 1999)) optimality conditions, and whilethey may converge (quickly) to a ﬁxed point, there is no guarantee that theﬁxed point is an optimal solution, though it is usually quite good. In anycase, for speciﬁc problems such as ours, understanding the geometry of theminimal solutions (and hence the well-posedness of the problem) is key toeﬀective algorithms for their computation.SLRA problems are in general NP-hard to solve, see for example (Poljakand Rohn, 1993; Braatz et al., 1994). They are also hard to approximateunder aﬃnely structured matrices over Q . In general the hardness stems fromdetermining if a bilinear system of equations admits a non-trivial solution. Inthe instance of classical matrix polynomials it is trivial to construct feasiblepoints since the underlying scalar matrix problem is linearly structured.All of our contributions apply to matrix polynomials with an aﬃne struc-ture provided that feasible points exist, that is, singular matrix polynomialswith a prescribed structure exist, which is NP-hard in general. In partic-ular, in the degree zero case our algorithms and techniques apply to aﬃneSLRA problems. Thus, computing the nearest (aﬃnely structured) matrixpolynomial is equivalent to SLRA problems with an aﬃne structure.While the contributions in this paper focus on local properties of SLRA,the local properties also imply global results. The Sum of Squares (SOS) hier-archy is a global framework for studying polynomial optimization problemssubject to polynomial constraints Lasserre (2001). The SOS optimizationtools have found experimental success in computing structured distances tosingularity and extracting minimizers when the solutions are locally unique,see for example Henrion and Lasserre (2006). In general the SOS hierarchyconverges for an inﬁnite order of relaxations, but for several problems therelaxations converge after a ﬁnite order. The ﬁnite convergence is in poly-nomial time with respect to the input and the number of relaxations. Inparticular, this ﬁnite convergence was observed for aﬃne SLRA problems inHenrion and Lasserre (2006) but little theory was provided to indicate thereason why. The later work of Nie (2014) shows that, under regularity andsecond-order suﬃciency conditions, ﬁnite convergence always occurs and thatit is possible to extract a minimal solution. In our contributions we provethat second-order suﬃciency and regularity conditions hold generically (and5f they do not, then they will hold on a restricted subset of the problem). Thecorollary to this is that the SOS hierarchy will have ﬁnite convergence foraﬃne SLRA problems if a solution exists (such as computing the distance ofthe nearest rank-deﬁcient matrix polynomial) and if the embedding is min-imal then a minimizer may be extracted as well. Another useful feature ofthe SOS hierarchy is even if convergence cannot be certiﬁed, a structuredlower-bound is obtained. In Sections 2 and 3 we describe tools needed for our constructions andthen explore the geometry of our problem. We show that the problem islocally well-posed. One cannot expect the nearest rank at most matrix poly-nomial to be unique. However under weak normalization assumptions, weshow that solutions are locally unique in a closed-ball around them. Tocomplement the separation of solutions, we also show that for an equivalentproblem, solutions corresponding to a diﬀerent closed ball are separated byat least a constant amount independent of the dimension of the space.In Section 4 we give an equality constrained variant of Newtons’ methodfor computing via post-reﬁnement the nearest rank at most matrix poly-nomial. The main idea is to compute an initial guess with a suitable ﬁrstorder or lift-and project method. We are able to prove that, with a suitableinitial guess and regularity assumptions, our algorithm generally has localquadratic convergence except for degenerate cases. This is done by derivingclosed-form expressions for the Jacobian of the constraints and the Hessianof the Lagrangian. When we refer to the speed of convergence, we refer toquotient rates as is typical in the nomenclature.In Section 5 we describe our prototype implementation, including heuris-tics for starting points and other improvements. We discuss the numericalperformance of the algorithm and give examples demonstrating convergence.results for a low-rank approximation of matrix polynomials. The paper endswith a conclusion and topics for future research.

2. Preliminaries and Geometry

In this section we will introduce some basic deﬁnitions and explore thenumerical geometry of our lower rank problem. Canonically we will let (cid:65) = d (cid:88) j =0 A j t j ∈ R [ t ] n × n be a matrix polynomial, with coeﬃcients A , . . . , A d ∈ R n × n . In the case ofrectangular matrix polynomials we are able to pad the matrix with zeros,6hus embedding the problem into one with square matrix polynomials. Thuswe will let (cid:65) = d (cid:88) j =0 A j t j ∈ R [ t ] n × n be a matrix polynomial, with coeﬃcients A , . . . , A d ∈ R n × n . The degree deg (cid:65) of (cid:65) is deﬁned as d , assuming that (cid:65) d (cid:54) = 0.We say that (cid:65) is singular if det( (cid:65) ) is the zero polynomial in R [ t ], orequivalently, that there is a (cid:98) ∈ R [ t ] n × such that (cid:65)(cid:98) ≡

0. The kernel of (cid:65) is ker (cid:65) = { (cid:98) ∈ R [ t ] n × : (cid:65)(cid:98) = 0 } and the rank of (cid:65) is n − dim ker (cid:65) (asa vector space over R ( t )). Then (cid:65) has rank at most n − r if there exists atleast r linear independent vectors { (cid:98) i } i =1 ,...,r satisfying (cid:65)(cid:98) i = 0.For a ∈ R [ t ], deﬁne φ ( a ) = φ ( n,d ) ( a ) =  a a a ... . . . a d a a d a . . . ... a d  ∈ R ( µ + d ) × µ , (2.1)where µ = nd + 1. φ ( a ) is a Toeplitz matrix. Such matrices are convenientlyused to describe polynomial multiplication in the sense that if c = a · b with a of degree d and c ∈ R [ t ] of degree at most µ −

1, then vec( c ) = φ ( a ) · vec( b )where vec( p ) is the vector of coeﬃcients of a polynomial. Deﬁnition 2.1.

The R -embedding of (cid:65) ∈ R [ t ] n × n is (cid:98) (cid:65) =  φ ( A , ) · · · φ ( A ,n ) ... ... φ ( A n, ) · · · φ ( A n,n )  ∈ R n ( µ + d ) × nµ . For (cid:98) ∈ R [ t ] n × of degree µ − the R -embedding of (cid:98) is (cid:98) (cid:98) = ( b , , b , , . . . , b ,µ − , . . . , b n, , . . . , b n,µ − ) T ∈ R nµ × . Note that (cid:65) · b = 0, for (cid:98) ∈ R [ t ] of degree at most µ − (cid:98) (cid:65) · (cid:98) (cid:98) = 0 ∈ R nµ × . This property is central to our work in the comingsections.For ease of notation we will take N = n ( µ + d ) = n d + n ( d + 1) , M = nµ = n d + n and R ≥ R -embeddings in subsequent sections. We note that (cid:98) (cid:65) isa block-Toeplitz matrix, and as such one method of understanding the lowerrank problem is to ﬁnd close by structured rank deﬁcient block-Toeplitz ma-trices, a typical structured low rank approximation problem. Some authorsrefer to such embeddings as a (permuted) Sylvester matrix associated with (cid:65) . We avoid this terminology as it is ambiguous when considering Sylvestermatrices occurring in (approximate) GCD computations.Unlike the standard linearizations in (Gohberg et al., 2009, Section 7.2)used to turn arbitrary degree matrix pencils into linear pencils, this R -embedding is kernel preserving for matrix polynomials of arbitrary degree.In particular, (cid:98) ∈ ker (cid:65) with deg (cid:98) ≤ µ implies (cid:98) (cid:98) ∈ ker (cid:98) (cid:65) . The R -embeddingis also quasi-distance preserving, since (cid:107) (cid:65) (cid:107) F = (cid:107) (cid:98) (cid:65) (cid:107) F µ . Problem 2.2.

Main Problem:

Given (cid:65) ∈ R [ t ] n × n non-singular of degree d and an integer r ≤ n deter-mine ∆ (cid:65) ∈ R [ t ] n × n , with deg ∆ (cid:65) ij ≤ deg (cid:65) ij for all ≤ i, j ≤ n , and n − r linearly independent vectors (cid:98) k ∈ R [ t ] n × , such that (cid:107) ∆ (cid:65) (cid:107) is (locally)minimized, subject to the constraint that ( (cid:65) + ∆ (cid:65) ) (cid:98) k = 0 and (cid:107) (cid:98) k (cid:107) = 1 . Note that this is minimizing a convex objective function subject to non-convex constraints. However, the equality constraints are linear in each ar-gument. It is still not clear that Problem 2.2 is well-posed in the currentform. We will prove that solutions exist, that is, there is an attainable globalminimum value and not an inﬁmum.

Lemma 2.3. (cid:65) ∈ R [ t ] n × n is singular if and only if there exists a (cid:98) ∈ R [ t ] n × with deg (cid:98) ≤ nd = µ − such that (cid:65)(cid:98) = 0 .Proof. Suppose that (cid:65) has rank s < n . By permuting rows and columns wemay assume without loss of generality that the leading s × s submatrix of (cid:65) is non-singular. There is a unique vector of the form (cid:99) = ( b /γ, . . . , b s /γ, − , , . . . , (cid:65)(cid:99) = 0, where γ ∈ R [ t ] is the determinant ofthe leading s × s minor of (cid:65) , and all of b , . . . , b s , γ ∈ R [ t ] have degree atmost sd ≤ nd . Multiplying through by γ , we ﬁnd that (cid:98) = γ (cid:99) satisﬁes therequirements of the lemma.See (Beckermann et al., 2006, Corollary 5.5) for an alternative proof. Lemma 2.4. (cid:65) is singular if and only if (cid:98) (cid:65) does not have full column rank. roof. If (cid:65) is rank deﬁcient then there exists (cid:98) ∈ R [ t ] n × with deg (cid:98) ≤ µ − (cid:65)(cid:98) = 0. (cid:98) (cid:65) has a non-trivial kernel and, (cid:98) (cid:98) ∈ ker (cid:98) (cid:65) by construction.Conversely, suppose that (cid:65) has full rank. Then for all (cid:98) ∈ R [ t ] n × we have (cid:65)(cid:98) (cid:54) = 0 which implies that (cid:98) (cid:65) (cid:98) (cid:98) (cid:54) = 0 or ker (cid:98) (cid:65) is trivial.We recall the Singular Value Decomposition as the primary tool for ﬁnd-ing the distance to the nearest unstructured rank deﬁcient matrix over R or C . Deﬁnition 2.5.

A Singular Value Decomposition (SVD) of C ∈ R N × M isgiven by C = Q · Σ · P T , where Q ∈ R M × M , P T ∈ R N × N are orthogonalmatrices and Σ = diag( σ , . . . , σ M ) ∈ R M × N is a diagonal matrix consistingof the singular values of C in descending order of magnitude. See (Goluband Van Loan, 2012). The following fact is a standard motivation for the SVD.

Fact 2.6 (Eckart and Young (1936)) . Suppose C = Q Σ P T ∈ R N × M as abovehas full column rank, with N ≥ M . Then ∆ C = Q diag(0 , . . . , , − σ M ) P T issuch that C + ∆ C has column rank at most M − , (cid:107) ∆ C (cid:107) F = σ M , and ∆ C is a perturbation of minimal Frobenius norm which reduces the column rankof C . Lemma 2.7.

Given a non-singular (cid:65) ∈ R [ t ] n × n , and ∆ (cid:65) ∈ R [ t ] n × n suchthat (cid:66) = (cid:65) + ∆ (cid:65) is singular, it is the case that (cid:107) (cid:100) ∆ (cid:65) (cid:107) ≥ σ nµ ( (cid:98) (cid:65) ) .Proof. By Lemma 2.4 above, (cid:98) (cid:66) is not of full column rank. Thus, by Fact 2.6 (cid:107) (cid:100) ∆ (cid:65) (cid:107) F ≥ σ nµ ( (cid:65) ). Corollary 2.8.

The set of rank r > matrices over R [ t ] n × n of degree atmost d is open, or equivalently, the set of all matrices of rank at-most n − r over R [ t ] n × n of degree at most d is closed. Theorem 2.9 (Existence of Solutions) . The minimization posed in Prob-lem 2.2 has an attainable global minimum if deg ∆ (cid:65) i,j ≤ deg (cid:65) i,j for all ≤ i, j ≤ n .Proof. Let S = (cid:8) (cid:67) ∈ R [ t ] n × n | rank (cid:67) ≤ n − r ∧ deg (cid:67) ≤ d (cid:9) ∩ (cid:8) (cid:67) ∈ R [ t ] n × n |(cid:107) (cid:67) (cid:107) F ≤ (cid:107) (cid:65) (cid:107) F (cid:9) .S is the intersection of a closed and bounded set and a closed set, hence S is closed and bounded. S is isomorphic to some closed and bounded subset9f Euclidean space, hence by the Heine-Borel theorem, S is compact. Toshow the set is non-empty, we note that, by the degree assumption on ∆ (cid:65) ,∆ (cid:65) = − (cid:65) is a feasible point independent of rank.Let (cid:67) ∈ S then (cid:107) (cid:65) − (cid:67) (cid:107) F = (cid:107) ∆ (cid:65) (cid:107) F is a continuous function over acompact set. By Weierstrass’ theorem it has an attainable global minimum.It is important not to over-constrain the problem with a choice of ∆ (cid:65) ,since otherwise the feasible set might be empty. Another reasonable choiceof ∆ (cid:65) which we can handle, is that the perturbation has the same coeﬃ-cient structure/support as (cid:65) , that is, zero terms in polynomial entries arepreserved.We note that this result says nothing about uniqueness or separation ofsolutions or any local properties. All that has been shown is that if theperturbations are in the same space as the input, and one seeks a rank at-most approximation, then there is an attainable global minimum value, i.e.not an inﬁmum. If one wants a minimal solution with the rank being exactly n − r , then there is no guarantee that there is an attainable global minimumto Problem 2.2.

3. Rank Factorization

A natural formulation of the problem that encompasses the rank im-plicitly is to perform a rank factorization and write (cid:65) + ∆ (cid:65) = (cid:85)(cid:86) for (cid:85) ∈ R [ t ] n × ( n − r ) and (cid:86) ∈ R [ t ] ( n − r ) × m . Here (cid:85)(cid:86) is subject to some con-straints that preserve the structure of ∆ (cid:65) (i.e., that we do not perturb anycoeﬃcients we are not allowed to, typically that deg ∆ (cid:65) ij ≤ deg (cid:65) ij , butpossibly also preserving the zero coeﬃcients and not introducing a largersupport). This is a non-linear least squares problem. However solutions arenot unique. Indeed, if (cid:90) ∈ R [ t ] ( n − r ) × ( n − r ) is unimodular (i.e., det( (cid:90) ) ∈ R ∗ ),then (cid:85)(cid:90) , (cid:90) − (cid:86) is another rank n − r factorization, and we obtain an in-ﬁnite family. While normalizing over matrix polynomial rank-factorizationsis diﬃcult, it is much easier to exploit the quasi-distance preserving prop-erty of (cid:107) · (cid:107) F and look at rank-factorizations of (cid:98) (cid:65) , that do not necessarilycorrespond to (cid:85) and (cid:86) . Deﬁnition 3.1.

Let N = ( µ + d ) n , M = nµ and R > . A rank factorizationof (cid:98) (cid:65) + (cid:100) ∆ (cid:65) is given by writing (cid:98) (cid:65) + (cid:100) ∆ (cid:65) = U V where U ∈ R N × R and V ∈ R R × M are arbitrary (unstructured) matrices over R . U, V with appropriate dimensions which minimize (cid:107) ∆ (cid:98) (cid:65) (cid:107) = (cid:107) (cid:98) (cid:65) − U V (cid:107) and such that ∆ (cid:98) (cid:65) has the correct Toeplitz-block structure (i.e., it is an R -embedding of a matrix polynomial). This is a problem with a non-convexobjective function (that is convex in each argument) and non-convex con-straints. We note that U , V have no direct connection with (cid:85) , (cid:86) ∈ R [ t ] n × n .One may always write (cid:98) (cid:65) + (cid:100) ∆ (cid:65) this way via the SVD for ﬁxed (cid:98) (cid:65) and (cid:100) ∆ (cid:65) , so in particular the optimal solution can be written as a rank factor-ization. The problem min (cid:107) (cid:98) (cid:65) − U V (cid:107) such that U V has the same structureas (cid:100) ∆ (cid:65) is generally ill-posed and needs to be constrained to do any mean-ingful analysis, as there are numerous degrees of freedom. At ﬁrst glance,optimizing over rank factorizations appears to be a harder problem than theoriginal. However it is helpful to perform analysis on this formulation. Inparticular, we are able to prove that optimal values of (cid:100) ∆ (cid:65) that satisfy ﬁrstorder conditions (which contains all useful perturbations) are separated by aconstant amount, and that equivalence classes of solutions are isolated. Ad-ditionally, this formulation of the problem is convex in each argument (butnot jointly convex) and is amenable to block coordinate descent methods.We next need to demonstrate that the condition that the matrix ∆ (cid:98) (cid:65) = (cid:98) (cid:65) − U V is the R -embedding of some matrix polynomial ∆ (cid:65) ∈ R [ t ] n × n canbe phrased as a single polynomial being zero. (cid:98) (cid:65) is generated by a linearstructure (cid:80) Li =1 c i (cid:98) (cid:65) ( i ) where c i ∈ R and { (cid:98) (cid:65) (1) , . . . , (cid:98) (cid:65) ( L ) } ⊆ R N × M . Deﬁnethe structural enforcement functionΓ : R N × R × R R × M → R as Γ( U, V ) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) L (cid:88) i =1 c i (cid:98) (cid:65) ( i ) − ∆ (cid:98) (cid:65) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) F . We note that there exist c i such that Γ(∆ (cid:98) (cid:65) ) = 0 if and only if ∆ (cid:98) (cid:65) is an R -embedding of a matrix polynomial. Problem 3.2.

With (cid:98) (cid:65) , U, V as above, the constrained R -embedded rank fac-torization problem consists of computing min (cid:107) (cid:98) (cid:65) − U V (cid:107) F subject to the con-straints that U T U − I = 0 and Γ( U, V ) = 0 . If R = M − , then this encodesall rank deﬁcient matrix polynomials. It is still not clear that Problem 3.2 is well-posed, as there are manydegrees of freedom in V , and this matrix can have arbitrary rank. The en-forcement of U as an orthogonal matrix ( U T U − I = 0) is allowed for withoutloss of generality. Informally then we are looking at all rank factorizations11here where U is orthogonal and Γ( U, V ) = 0, that is, the product satisﬁesthe block-Toeplitz structure on (cid:100) ∆ (cid:65) .We employ the machinery of non-linear optimization to describe the ge-ometry of the minimal solutions, and hence the nearest appropriately struc-tured matrices. See (Bertsekas, 1999) for an excellent overview. Fact 3.3 (Bertsekas (1999, Section 3.1.1)) . For a suﬃciently large ρ > ,one has that Problem 3.2 is equivalent to computing a solution to the un-constrained optimization problem Φ( U, V ) = min

U,V (cid:107) (cid:98) (cid:65) − U V (cid:107) F + ρ (cid:107) Γ( U, V ) (cid:107) F + ρ (cid:107) U T U − I (cid:107) F . All the interesting solutions to the minimization of Φ(

U, V ) occur at sta-tionary points. The ﬁrst-order necessary condition (on V ) of gradients van-ishing gives us ∇ V (cid:16) (cid:107) (cid:98) (cid:65) − U V (cid:107) F + ρ (cid:107) Γ( U, V ) (cid:107) F ) + ρ (cid:107) U T U − I (cid:107) F (cid:17) = 0 ⇐⇒ U T ( (cid:98) (cid:65) − U V ) + (cid:18) ∂∂V Γ( U, V ) T (cid:19) ρ Γ( U, V ) = 0 . If we assume that the constraints are active, that is U is orthogonal andthat Γ( U, V ) = 0, then we have U T (cid:98) (cid:65) − V = 0. Of course, there is the otherﬁrst order necessary condition requiring that ∇ U (cid:16) (cid:107) (cid:98) (cid:65) − U V (cid:107) + ρ (cid:107) Γ( U, V ) (cid:107) + ρ (cid:107) U T U − I (cid:107) (cid:17) = 0 . However, we do not need to employ this explicitly in the following.

Theorem 3.4 (Strong Separation of Objective) . Suppose (cid:100) ∆ (cid:65) and (cid:100) ∆ (cid:65) (cid:63) are distinct (local) optimal solutions to Problem 3.2 that satisfy ﬁrst ordernecessary conditions. Then (cid:107) (cid:100) ∆ (cid:65) − (cid:100) ∆ (cid:65) (cid:63) (cid:107) ≥ σ min ( (cid:98) (cid:65) ) , where σ min ( · ) is thesmallest non-trivial singular value.Proof. From the previously discussed necessary ﬁrst order condition we havethat there exists U ∈ R N × R , V ∈ R R × M and U (cid:63) ∈ R N × R (cid:63) and V (cid:63) ∈ R R (cid:63) × M such that (cid:107) (cid:100) ∆ (cid:65) − (cid:100) ∆ (cid:65) (cid:63) (cid:107) = (cid:107) U V − U (cid:63) V (cid:63) (cid:107) = (cid:107) U U T (cid:98) (cid:65) − U (cid:63) U (cid:63)T (cid:98) (cid:65) (cid:107) . ρ is sometimes known as a penalty term. R and R (cid:63) need not be the same. From this we obtain the sequenceof lower bounds (cid:107) U U T (cid:98) (cid:65) − U (cid:63) U (cid:63)T (cid:98) (cid:65) (cid:107) ≥ (cid:107) U U T − U (cid:63) U (cid:63)T (cid:107) σ min ( (cid:98) (cid:65) )= (cid:107) I − U T U (cid:63) U (cid:63)T U (cid:107) σ min ( (cid:98) (cid:65) ) ≥ σ min ( (cid:98) (cid:65) ) . The symmetric matrix W = U T U (cid:63) U (cid:63)T U is a product of matrices whosenon-zero eigenvalues have magnitude 1. Symmetric matrices have real eigen-values, and the non-zero eigenvalues of W will be ±

1, since U and U (cid:63) areorthogonal. Thus (cid:107) W (cid:107) = 1. W must have at least one negative eigenvalue or zero eigenvalue by theorthogonality assumption, since W (cid:54) = I . Since W is symmetric, we candiagonalize W as a matrix with ± (cid:107) I − W (cid:107) ≥ Corollary 3.5.

All locally optimal solutions satisfying ﬁrst order necessaryconditions are isolated modulo equivalence classes.Proof.

Suppose the contrary, that is that (

U, V ) is a solution correspondingto (cid:100) ∆ (cid:65) and ( U (cid:63) , V (cid:63) ) is a solution corresponding to (cid:100) ∆ (cid:65) (cid:63) . The objectivefunction and constraints are locally Lipschitz continuous, so let s > (cid:107) · (cid:107) F in some open neighborhood.If we take 0 < ε < σ min ( (cid:98) (cid:65) ) s then we have σ min ( (cid:98) (cid:65) ) ≤ (cid:107) (cid:100) ∆ (cid:65) − ∆ (cid:98) (cid:65) (cid:63) (cid:107) ≤ s (cid:13)(cid:13)(cid:13)(cid:13)(cid:18) UV (cid:19) − (cid:18) U (cid:63) V (cid:63) (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) F < σ min ( (cid:98) (cid:65) ) , which is a contradiction to Theorem 3.4.Implicitly the matrix V parametrizes the kernel of (cid:98) (cid:65) . If we normalizethe kernel of (cid:98) (cid:65) to contain R -embeddings of primitive kernel vectors then thematrix V can be made locally unique, although we do not employ this in therank-factorization formulation directly. Corollary 3.6.

Under a suitable choice of (cid:107) · (cid:107) we have that minimal solu-tions are separated. In particular, separation holds for (cid:107) · (cid:107) .

4. An Iterative Algorithm for Lower Rank Approximation

In this section we propose an iterative algorithm to solve Problem 2.2based on Newton’s method for constrained optimization. Suﬃcient condi-tions for quadratic convergence are that the second-order suﬃciency holds(Wright, 2005) and local Lipschitz continuity of the objective and constraints.We ensure these conditions hold for non-degenerate problems by working ona restricted space of minimal R -embeddings that remove degrees of freedom. In order to compute a nearby rank n − r approximation we want to solvethe non-convex optimization problemmin (cid:107) ∆ (cid:65) (cid:107) F subject to (cid:40) ( (cid:65) + ∆ (cid:65) ) (cid:66) = 0rank( (cid:66) ) = r. (4.1)In the instance of (structured) scalar matrices the rank constraint canbe enforced by ensuring that (cid:66) has orthogonal columns or is in a columnreduced echelon form. In the instance of matrix polynomials this is not suf-ﬁcient, since polynomial multiples of the same vector will have linearly inde-pendent combined coeﬃcient vectors. In order to apply these normalizationson the coeﬃcient vectors of (cid:66) we require that the columns be representedwith a minimal number of equations with respect to (cid:66) . Deﬁnition 4.1 (Minimal R -Embedding) . Suppose (cid:65) ∈ R [ t ] n × n with R -embedding (cid:98) (cid:65) . The vector (cid:98) ∈ R [ t ] n × , with R -embedding (cid:98) (cid:98) , is said to be minimally R -embedded in (cid:98) (cid:65) if ker (cid:98) (cid:65) = (cid:104) (cid:98) (cid:98) (cid:105) (i.e., a dimension 1 subspace).We say that (cid:98) (cid:98) is minimally degree R -embedded in (cid:98) (cid:65) if (1) (cid:98) (cid:98) is minimally This normalization alone is not suﬃcient for rapid convergence. -embedded in (cid:98) (cid:65) and (2) (cid:98) (cid:98) corresponds to a primitive kernel vector (cid:98) , thatis gcd( (cid:98) , . . . , (cid:98) n ) = 1 . We note that this deﬁnition ensures minimally R -embedded vectors areunique (up to scaling a factor), or that ( (cid:98) (cid:65) j + (cid:100) ∆ (cid:65) j ) (cid:98) (cid:66) [ ∗ , j ] = 0 has a (locally)unique solution for ﬁxed (cid:100) ∆ (cid:65) . In the minimal embedding, we will assume,without loss of generality, that redundant or equations known in advance,such as 0 = 0 , ∆ (cid:98) (cid:65) ij = 0 or (cid:98) (cid:66) ij = 0 corresponding to known entries areremoved for some indices of i and j . Some of these trivial equations occurbecause of the CREF assumption, while others occur from over-estimatingdegrees of entries.This allows us to reformulate ( (cid:65) + ∆ (cid:65) ) (cid:66) = 0 as a (bi-linear) system ofequations { ( (cid:98) (cid:65) j + ∆ (cid:98) (cid:65) j ) (cid:98) (cid:66) [ ∗ , j ] = 0 } rj =1 (4.2)where the j th column of (cid:66) is minimally degree embedded in the system( (cid:98) (cid:65) j + ∆ (cid:98) (cid:65) j ). We also note that assuming (cid:66) is in a column-reduced echelonform essentially requires us to guess the pivots in advance of the optimalsolution, which is only possible with a good initial guess. The beneﬁt of thisapproach is that if the pivots are not guessed correctly, we are still able tocompute a n − r approximation of (cid:65) .In order to exclude trivial solutions, we can assume that the pivot ele-ments of (cid:66) have a norm bounded away from zero. Let (cid:78) ( (cid:98) (cid:98) i ) be a normaliza-tion vector such that (cid:78) ( (cid:98) (cid:98) i ) T (cid:98) (cid:98) i = 1 which implies that the CREF pivots arebounded away from zero. For example, take the pivot to have unit norm.Note that other normalization vectors are possible, such as (cid:78) ( (cid:98) (cid:98) i ) = (cid:98) (cid:98) i (whichcorresponds to each column having a unit norm) if the initial guess is ade-quately close, or we could take the pivot element to be a monic polynomial.Of course there are several other permissible normalizations.Deﬁne the matrix (cid:98) (cid:65) i to have the column (cid:98) (cid:98) i = (cid:98) (cid:66) [1 ..n, i ] minimally degreeembedded. We can express (4.2) in a vector-matrix form as follows.  (cid:98) (cid:65) + (cid:100) ∆ (cid:65) (cid:98) (cid:65) + (cid:100) ∆ (cid:65) . . . (cid:98) (cid:65) r + (cid:100) ∆ (cid:65) r (cid:78) ( (cid:98) (cid:98) ) T (cid:78) ( (cid:98) (cid:98) ) T . . . (cid:78) ( (cid:98) (cid:98) r ) T  (cid:98) (cid:98) (cid:98) (cid:98) ... (cid:98) (cid:98) r  =   (4.3)15as a (locally) unique solution for ﬁxed ∆ (cid:65) . In order to solve (4.1) we will use the method of Lagrange multipliers(Bertsekas, 1999).Let M (∆ (cid:65) , (cid:66) ) be the vector of residuals corresponding to (4.3), then theLagrangian is deﬁned as L = (cid:107) ∆ (cid:65) (cid:107) F + λ T M (∆ (cid:65) , (cid:66) ) , (4.4)where λ = ( λ , . . . , λ ) T is a vector of Lagrange multipliers. Deﬁnition 4.2.

The vectorization of (cid:65) ∈ R [ t ] n × n of degree at most d isdeﬁned as vec( (cid:65) ) = ( (cid:65) , , , . . . , (cid:65) , ,d , (cid:65) , , , . . . , (cid:65) , ,d , . . . , (cid:65) n,n, , . . . (cid:65) n,n,d ) T , that is vec( (cid:65) )) stacks the entry-wise coeﬃcient vectors of each column ontop of each other. We will ﬁnd it convenient to deﬁne x = (cid:18) vec(∆ (cid:65) )vec( (cid:66) ) (cid:19) to be the combinedvector of unknowns corresponding to (cid:65) and (cid:66) . Let ∇ xx L denote the Hessianmatrix of L with respect to x and J be the Jacobian of the residuals of theconstraints, i.e. J = ∇ x M (∆ (cid:65) , (cid:66) ). Necessary optimality conditions at apoint ( x ∗ , λ ∗ ) (Bertsekas, 1999) are that ∇ L = 0 and ker( J ) T ∇ xx L ker( J ) (cid:23) . (4.5)Suﬃcient conditions for optimality at the same point are that ∇ L = 0 and ker( J ) T ∇ xx L ker( J ) (cid:31) . (4.6)These conditions are known as the second-order suﬃciency conditions Bert-sekas (1999). We note that (4.6) implies that minimal solutions are locallyunique, and will fail to hold if minimal solutions are not locally unique. Theidea is to show that (4.6) holds in the minimal embedding, which allows usto construct an algorithm with rapid local convergence. Deﬁnition 4.3.

The matrix ψ ( (cid:98) (cid:98) ) is an alternative form of ( (cid:98) (cid:65) + (cid:100) ∆ (cid:65) ) (cid:98) (cid:98) = 0 that satisﬁes ψ ( (cid:98) (cid:98) )vec( (cid:65) + ∆ (cid:65) ) = 0 . That is, ψ ( (cid:98) (cid:98) ) satisﬁes ψ ( (cid:98) (cid:98) ) · vec( (cid:65) + ∆ (cid:65) ) = 0 ⇐⇒ ( (cid:98) (cid:65) + (cid:100) ∆ (cid:65) ) (cid:98) (cid:98) = 0 .

16e will adopt that notation that ψ ( (cid:98) (cid:98) i ) corresponds to ψ ( (cid:98) (cid:98) i )vec( (cid:98) (cid:65) i +∆ (cid:98) (cid:65) i ) = 0. Here we use the bi-linearity of (4.3) to write the same systemusing a matrix with entries from (cid:98) (cid:66) instead of vec( (cid:65) + ∆ (cid:65) ).The closed-form expression for the Jacobian of the residuals (up to per-mutation) in (4.3) is given by J =  ψ ( (cid:98) (cid:98) ) (cid:98) (cid:65) + (cid:100) ∆ (cid:65) ψ ( (cid:98) (cid:98) ) (cid:98) (cid:65) + (cid:100) ∆ (cid:65) ... . . . ψ ( (cid:98) (cid:98) r ) (cid:98) (cid:65) r + (cid:100) ∆ (cid:65) r (cid:78) ( (cid:98) (cid:98) ) T (cid:78) ( (cid:98) (cid:98) ) T ... . . .0 (cid:78) ( (cid:98) (cid:98) r ) T  . (4.7)Unlike the case of a single kernel vector in (Giesbrecht et al., 2017), J may berank deﬁcient since some equations corresponding to low (high) index entriesmay be redundant at the solution. The Lagrange multipliers will not beunique in this particular scenario and the rate of convergence may degradeif Newton’s method is used. In the instance of r = 1 then we present thefollowing result (Giesbrecht et al., 2017). Theorem 4.4.

Suppose that r = 1 and (cid:98) (cid:98) is minimally degree R -embeddedin (cid:98) (cid:65) , then J has full rank when (4.5) holds.Proof. We show that J has full row rank by contradiction. If this matrix wasrank deﬁcient, then one row is a linear combination of the others. This meansthat one of the equations in the constraints is trivial or the solution is notregular (see (Bertsekas, 1999, Section 3.1)). As we are only concerned aboutregular solutions, this contradicts the minimal degree R -embedding.The corollary to this is that in the minimal embedding regularity condi-tions hold and it is straight forward to obtain rapid local convergence. The Hessian matrix, ∇ L is straight forward to compute as ∇ L = (cid:18) ∇ xx L J T J (cid:19) . The following theorem shows that second-order suﬃciency holds for the in-stance of r = 1. The case of r > r > heorem 4.5 (Second Order Suﬃciency Holds) . Suppose that (cid:98) (cid:65) + (cid:100) ∆ (cid:65) hasa minimally degree R -embedded kernel vector (cid:98) (cid:98) , i.e. r = 1 in (4.4) , then ata minimal solution, the second order suﬃciency condition (4.6) holds in theminimal embedding of (cid:98) (cid:98) .Proof. If (cid:107) ∆ A (cid:107) = 0 at the local minimizer ( x ∗ , λ ∗ ) then ∇ xx L ( x (cid:63) , λ (cid:63) ) = (cid:18) I (cid:19) and K = ker ∇ xx L ( x (cid:63) , λ (cid:63) ) = span (cid:18) I (cid:19) . We have that for y ∈ span( K ) such that J y = 0 implies that (cid:98) (cid:65) y = 0 and (cid:78) ( (cid:98) (cid:98) ) T y = 0. It follows that ker (cid:98) (cid:65) = span( (cid:98) (cid:98) ), thus we have y = (cid:98) (cid:98) or y = 0via the minimal degree R -embedding, thus y = 0 as (cid:98) (cid:98) / ∈ span( K ). Hence,second-order suﬃciency holds, as ker J ∩ K = 0.If (cid:107) ∆ (cid:65) (cid:107) (cid:54) = 0 then we have that ∇ xx L ( x (cid:63) , λ (cid:63) ) = (cid:18) I

00 0 (cid:19)(cid:124) (cid:123)(cid:122) (cid:125) (cid:72) + (cid:18) E T E (cid:19)(cid:124) (cid:123)(cid:122) (cid:125) (cid:69) . The matrix (cid:69) is linear in λ , however the precise tensor decomposition isirrelevant to the proof. If E has full rank, then ∇ xx L has full rank and weare done, so suppose that E is rank deﬁcient. If E is rank deﬁcient, thenone can eliminate a row of E and column of E T without aﬀecting (cid:72) viasymmetric row and column updates. We observe that ker( (cid:72) + (cid:69) ) ⊆ ker (cid:72) and the result follows. Corollary 4.6.

Suppose that r > in (4.4) and (cid:66) is minimally degreeembedded, then second-order suﬃciency (4.6) holds.Proof. The proof is almost the same as Theorem 4.5 and follows by inductionon r since each block is decoupled.We now have all of the ingredients for an iterative method with rapidlocal convergence. Newton’s method for equality constrained minimization problems can beinterpreted as solving the non-linear system of equations ∇ L = 0. Newton’smethod is based on the iterative update scheme (cid:18) x k +1 λ k +1 (cid:19) = (cid:18) x k + ∆ x k λ k + ∆ λ k (cid:19) such that ∇ L (cid:18) ∆ x ∆ λ (cid:19) . (4.8)18f r = 1 then ∇ L has full rank and the iteration is well deﬁned by matrixinversion. If r > (cid:18) x k +1 λ k +1 (cid:19) = (cid:18) x k + ∆ x k λ k + ∆ λ k (cid:19) such that (cid:18) ∇ xx L J T J − µ k I (cid:19) (cid:18) ∆ x ∆ λ (cid:19) = −∇ L (4.9)for a suitably chosen parameter µ k . Taking µ k = (cid:107)∇ L ( x k , λ k ) (cid:107) one hasprovably quadratic convergence (Wright, 2005, Theorem 4.2) with x k and λ k chosen suﬃciently close to the optimal solution. Theorem 4.7.

The iteration (4.9) converges quadratically to ( x (cid:63) , λ (cid:63) ) if ( x , λ ) are chosen suﬃciently close to ( x (cid:63) , λ (cid:63) ) . We now have a method to compute a nearby rank deﬁcient matrix poly-nomial with a rate of convergence that is quadratic, provided that the initialvalues of x are chosen to be suﬃciently close to the optimal solution.

5. Implementation, Description and Comparison

In this section we discuss implementation details and demonstrate ourimplementation for computing the nearest rank deﬁcient matrix polynomial.All algorithms are implemented in Maple 2016. All experiments are doneusing quad precision ﬂoating point arithmetic, with about 35 decimal digitsof accuracy. We compare some degree one examples to the recent results of(Guglielmi et al., 2017).To compute an approximate kernel vector, ﬁrst we use the SVD to com-pute an approximate kernel of an R -embedded (nearly) rank deﬁcient matrixpolynomial. Next we use structured orthogonal elimination RQ ( LQ ) de-composition to produce a minimally (degree) R -embedded vector from thekernel. In the case of several kernel vectors we use a modiﬁed Gaussianelimination on an embedding of an approximate kernel obtained by the SVDand approximate GCD to ﬁnd nearby approximate kernel vectors that areprimitive. We now formally describe an algorithm for computing the nearest matrixpolynomial of a prescribed rank. The algorithm has no global convergenceguarantees, however a globally convergent (although not necessarily optimal)algorithm can be developed in a straight forward manner via augmenting oursecond order algorithm with a ﬁrst order one, and removing content fromkernel vectors if necessary.The size of ∇ L is O ( r n d ) and accordingly each iteration has a costof O ( r n d ) ﬂops using standard matrix multiplication, where r is the di-mension of the kernel. 19 lgorithm 1 : Iterative Kernel Post-Refinement

Require: • Full rank matrix polynomial (cid:65) ∈ R [ t ] n × n • (Approximately) Rank deﬁcient matrix polynomial (cid:67) ∈ R [ t ] n ] × n • Approximate kernel vectors (cid:99) , . . . , (cid:99) r ∈ R [ t ] n × of the desired de-gree/displacement structure • Displacement structure matrix ∆ (cid:65) to optimize over

Ensure: • Singular matrix (cid:65) + ∆ (cid:65) with (cid:66) ⊂ ker( (cid:65) + ∆ (cid:65) ) or an indication offailure. R -Embed (cid:65) , (cid:67) , (cid:99) , . . . , (cid:99) r and ∆ (cid:65) . Compute Lagrangian L from Section 4.2. Initialize λ via linear least squares from ∇ L | x = 0. Compute (cid:18) x + ∆ xλ + ∆ λ (cid:19) by solving (4.9) until (cid:13)(cid:13)(cid:13)(cid:13)(cid:18) ∆ x ∆ λ (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) is suﬃciently smallor divergence is detected. Return the locally optimal ∆ (cid:65) and (cid:66) or an indication of failure.

In this section we consider Examples 2.10, 2.11 and 2.12 from Guglielmiet al. (2017), where we compare our results to real perturbations. Note thatcomplex perturbations are a straight-forward generalization of the theorypresented here, and can be re-formulated as a problem over R .The technique of Guglielmi et al. (2017) poses computing a nearby rank-deﬁcient linear matrix pencil by verifying that suﬃciently many images ofthe matrix polynomial are singular, so that det( (cid:65) + ∆ (cid:65) ) ≡

0. The problemis then posed as a solution to a system of Ordinary Diﬀerential Equations(ODE), assuming that certain genericity conditions on the eigenvalues of thesolution hold . They consider the instances of computing A and A witha common kernel vector, and the instance where A and A do not have acommon kernel. Additionally, perturbations aﬀecting only one of A and A are considered. We note that the solutions to the ODEs do not necessarilysatisfy necessary optimality conditions (4.5), and accordingly will generallynot be local minimizers. Our algorithm and convergence theory does not explicitly rely on genericity assump-tions or other properties of eigenvalues, however we do exploit generic properties in for-mulating initial guesses. .2.1. Nearest Aﬃnely Structured Examples I Consider ﬁrst the matrix polynomial (cid:65) =  (cid:124) (cid:123)(cid:122) (cid:125) A t +  . . . − . . . . (cid:124) (cid:123)(cid:122) (cid:125) A coming from Examples 2.10 and 2.12 of Guglielmi et al. (2017) Example 5.1.

If we assume that A is constant, then this is ﬁnding the(locally) nearest matrix polynomial with an aﬃne structure since A has non-zero ﬁxed constants. First let’s assume that zero entries are preserved, thisis a linear structure on A .To compute an initial guess for (cid:98) we use the SVD on (cid:98) (cid:65) and extract aguess from the smallest singular vector. This gives us (cid:98) init =  − . t + 0 . t − . t − . . t − . t + 0 . . t − . t + 0 .  . For an initial guess on (cid:65) we take (cid:65) init = (cid:65) . Note that we do not needan initial guess that is singular, it just needs to be “suﬃciently close” to asingular matrix polynomial.If we do not allow perturbations to zero-coeﬃcients, that is, A [1 , and A [2 , may not be perturbed, then after ﬁve iterations of plain Newton’smethod (see (Giesbrecht et al., 2017)) we compute ∆ A ≈  . − . − . − . . . . − . − .  with perturbation (cid:107) ∆ (cid:65) (cid:107) F ≈ . .A corresponding (approximate) kernel vector is (cid:98) ≈  . t + 0 . − . − .  . Example 5.2.

If we allow perturbations to zero-coeﬃcients in A then afterﬁve rounds of plain Newton’s method we compute ∆ A ≈  . − . − . − . . . . − . − .  ith perturbation (cid:107) ∆ (cid:65) (cid:107) F ≈ . , which is a marginal improvement overthe previous example. A corresponding approximate kernel vector is (cid:98) ≈  . t + 0 . − . − .  . Guglielmi et al. (2017) report an upper-bound on the distance to singu-larity allowing complex perturbations , that is ∆ (cid:65) ∈ C [ t ] n × n of (cid:107) ∆ C (cid:65) (cid:107) F ≈ . real perturbations , (cid:107) ∆ R (cid:65) (cid:107) F ≈ . A and A , then this is some form of ﬁndingthe nearest rank deﬁcient matrix polynomial. The question is whether toallow degree or support preserving perturbations. Again, we will use thesame initial guesses as the previous example.Matrix degree preserving perturbations are of the form∆ deg (cid:65) =  tA , , + A , , tA , , + A , , tA , , + A , , tA , , + A , , tA , , + A , , tA , , + A , , tA , , + A , , tA , , + A , , tA , , + A , ,  , where as support preserving perturbations are of the form∆ sup (cid:65) =  A , , A , , A , , A , , A , , tA , , tA , , + A , , A , ,  . Example 5.3.

In the instance of degree preserving perturbations we computeafter ﬁve iterations of Newton’s method ∆ deg (cid:65) ≈  . . t − . . t − . − . . t + 0 . . t + 0 . . − . t − . − . t − .  with (cid:107) ∆ deg (cid:65) (cid:107) ≈ . .A corresponding approximate kernel vector is (cid:98) ≈  − . t − . . .  . xample 5.4. In the instance of support preserving we compute after ﬁveiterations of Newton’s method, ∆ sup (cid:65) ≈  . − . − . − . . . t . − . t − . − .  with (cid:107) ∆ sup (cid:65) (cid:107) ≈ . . A corresponding approximate kernel vector is (cid:98) ≈  − . t − . . .  . Guglielmi et al. (2017) report an upper-bound on the distance to singu-larity of (cid:107) ∆ deg (cid:65) (cid:107) F ≈ . Example 5.5.

Next we consider the the matrix polynomial (cid:65) in Example2.11 of (Guglielmi et al., 2017) deﬁned as (cid:65) =  − .

79 0 . − . . − .

54 0 . − .

89 0 . . (cid:124) (cid:123)(cid:122) (cid:125) A +  (cid:124) (cid:123)(cid:122) (cid:125) A t. To compute an initial guess for we take (cid:65) init = A and take (cid:98) init =  − . t − . t + 0 . t + 0 . . t − . t − . t + 0 . . t + 0 . t − . t − .  . (cid:98) init is computed from the smallest singular vector of (cid:98) (cid:65) .We note that this initial guess does not attempt to ﬁnd a nearby singularmatrix polynomial for the initial guess, all that is needed is ∇ L ( x init , λ init ) isreasonably small to obtain convergence.Using a globalized variant of Newton’s method based on Levenberg-Marquardtwe compute ∆ (cid:65) =  . t + 0 . . t + 0 . − . t − . . t + 0 . − . t + 0 . − . t − . . t − . − . t + 0 . . t + 0 .  , ith (cid:107) ∆ (cid:65) (cid:107) F ≈ . . The corresponding approximate kernel vector is (cid:98) =  − . t − . . t − . . t + 0 .  . If we use the result of (Guglielmi et al., 2017) as the initial guess, thenwe compute (cid:98) init =  . t + 0 . t + 0 . − . × − t + 0 . t + 0 . . × − t − . t − .  . We will assume the entries of (cid:98) are degree at most two.After ﬁve iterations of Newton’s method we obtain ∆ (cid:65) =  . . t + 0 . − . t + 0 . . . t + 0 . − . t + 0 . − . − . t − . . t − .  , with (cid:107) ∆ (cid:65) (cid:107) ≈ . . The corresponding approximate kernel vector is (cid:98) =  . t + 0 . t + 0 . . t + 0 . − . t − .  . The previously noted small quadratic terms were at roughly machine precision(the computation is done with 35 digits of precision) and truncated.

Guglielmi et al. (2017) obtain a result on this past example that producesan upper bound on the distance to singularity of 0 . . × Matrix

In this following example we consider computing a lower-rank approxi-mation to a given matrix polynomial. Consider the 4 × (cid:65) , deﬁned as (cid:65) = A + A t + A t + A t , where24 =  . − . . . − . . − . . . − . . . − . . − . − .  ,A =  . . . . . . . − . . . . . . − . . − .  ,A =  . . . . . − . . . . . . . . . . .  ,A =  . . . . . . . . . . . . . . . .  . Example 5.6.

We will consider a displacement structure on the kernel aswell in this example, where higher-order zero terms are not perturbed from theinitial guess. For the entries of ∆ (cid:65) we preserve higher-order zero terms, andallow low-order terms to be perturbed. This is a linearly structured problem,on both the main variable ∆ (cid:65) and the auxiliary kernel variable (cid:66) .To ensure the rank constraint holds, we will additionally assume that thekernel, (cid:98) (cid:66) is in a CREF (while (cid:66) is obviously not) and the columns haveunit norm. This normalization is (locally) equivalent to the ones discussedin Section 4.2. Having (cid:98) (cid:66) in a CREF ensures that the two kernel vectors arelocally linearly independent during the iteration. Of course perturbing bothpivots to zero is possible (although this is sub-optimal). In such a scenariolinear independence can no longer be guaranteed, and the iteration would needto be re-ininitialized.For the initial guess we use (cid:65) init = (cid:65) and take (cid:66) init as  . t . − . t − . − . t − . t + 0 . − . t − . t − . − . t − . t + 0 . . t + 0 . t + 0 . . t + 0 . t − . t − .  . sing Algorithm 5.1 we compute after nine iterations ∆ A =  . − . − . − . . − . . − . − . − . . − . − . − . . − .  , ∆ A =  . . . − . − . − . . − . . − . − . − . − . − . . − .  , ∆ A =  − . − . . − . . − . . − . − . . . − . − . − . − . − .  , ∆ A =  . − . . . . − . . . . − . . . . − . . .  , with (cid:107) ∆ (cid:65) (cid:107) ≈ . .An approximate kernel, (cid:66) is given by  . t + 0 . t − . . − . t − . − . t − . t + 0 . − . t − . t − . − . t − . t − . t + 0 . . t + 0 . t + 0 . . t + 0 . t − . t − .  . A natural question is what happens if we change the displacement struc-ture on the kernel? To investigate this behavior, we consider an equivalentrepresentation of the previously used kernel, except that (cid:66) is in a CREFdirectly.

Example 5.7.

If we change the kernel (cid:66) init to be  . t + 0 . t − . . t + 0 . t − . t − . − . t − . t − . − . t − . t − . . . t − . . t − . .  , used in the initialization of the previous example, then we compute a pertur-bation with (cid:107) ∆ (cid:65) (cid:107) ≈ . . In either case, we obtain comparable answers that are a reasonable lower-rank approximation, and can likely be improved by relaxing restrictions on26he displacement structure on (cid:66) or (cid:98) (cid:66) . It is important to note that relaxingthe degree bounds to be ( n − r ) d in general on all non-zero entries (whereentries are zero if they are in the same row as a CREF pivot) will likelylead to a better approximation, however one may lose quadratic convergenceif doing so, since iterates may no longer have primitive kernel vectors, and(4.6) will no longer hold. As discussed in Section 4, it is generally diﬃcultto determine the CREF pivots of the kernel unless the initial guess is veryaccurate.The structure of the kernel is an important consideration when decidingupon an initial guess. It is preferable to restrict fewer coeﬃcients, howeverthe iteration requires a better initialization due to the increased number ofpossible descent directions. In such scenarios for maximum ﬂexibility, a glob-alized variant of Newton’s method is required. Like-wise, the structure for∆ (cid:65) is also an important choice. Restricting which terms can be changed hasa large inﬂuence on the (approximate) distance to singularity (of prescribedkernel dimension).Another way to approach the lower-rank approximation problem is to usealternating projections or alternating directions of descent (since the objec-tive is bi-linear with bi-linear constraints, it is convex in each argument) onthe rank factorization in Section 3. Since solutions in one coordinate, ∆ (cid:65) are isolated, one can expect linear convergence with a reasonable algorithm.The lack-of normalization required overcomes the diﬃculty of choosing asuitable kernel displacement structure, however convergence would be linearat best and determining the dimensions of U and V is another problem to bediscussed. It is also worth noting that Algorithm 5.1 requires more compu-tational resources per iteration as r increases, however a rank factorizationrequires fewer computational resources per iteration as r increases.

6. Conclusions and Future Work

We have shown that ﬁnding lower-rank approximations of matrix polyno-mials can be established as a numerically well-posed problem and is amenableto ﬁrst and second order optimization methods. The existence and isolationof solutions is established along with an algorithm exploiting aﬃne structuresto obtain locally quadratic convergence under mild normalization assump-tions.Along with considering the lower-rank approximation of matrix polyno-mials, we present a generalization of the theory to matrix polynomials withan arbitrary aﬃne structure. We provide examples of how the structure ofpermissible perturbations and prescribed kernel structure impacts the dis-tance to solutions. 27e also regard this current paper as a ﬁrst step towards a formally robustapproach to non-linear matrix polynomials, in the spirit of recent work withsymbolic-numeric algorithms for polynomials. Problems such as approximatematrix polynomial division, GCRD and factorization all have applicationswhich can beneﬁt from these modern tools.

References

Abatzoglou, T., Mendel, J., Harada, G., 1991. The constrained total leastsquares technique and its applications to harmonic superresolution. IEEETransactions on Signal Processing 39 (5), 1070–1087.Absil, P.-A., Mahony, R., Sepulchre, R., 2009. Optimization algorithms onmatrix manifolds. Princeton University Press.Beckermann, B., Labahn, G., 1998a. A fast and numerically stable euclidean-like algorithm for detecting relatively prime numerical polynomials. Jour-nal of Symbolic Computation 26 (6), 691–714.Beckermann, B., Labahn, G., 1998b. When are two numerical polynomialsrelatively prime? Journal of Symbolic Computation 26, 677–689.Beckermann, B., Labahn, G., Villard, G., 2006. Normal forms for generalpolynomial matrices. Journal of Symbolic Computation 41 (6), 708–737.Bertsekas, D., 1999. Nonlinear programming. Athena Scientiﬁc.Botting, B., Giesbrecht, M., May, J., 2005. Using the Riemannian SVDfor problems in approximate algebra. In: Proc. Workshop on Symbolic-Numeric Computation. pp. 209–219.Braatz, R. P., Young, P. M., Doyle, J. C., M., M., 1994. Computational com-plexity of/spl mu/calculation. IEEE Transactions on Automatic Control39 (5), 1000–1002.Byers, R., He, C., Mehrmann, V., 1998. Where is the nearest non-regularpencil? Linear Algebra and its Applications 285 (1), 81–105.Byers, R., Nichols, N., 1993. On the stability radius of a generalized state-space system. Linear Algebra and its Applications 188, 113–134.Cadzow, J., 1988. Signal enhancement – a composite property mapping al-gorithm. IEEE Transactions on Acoustics, Speech, and Signal Processing36 (1), 49–62. 28orless, R. M., Gianni, P., Trager, B., Watt, S., 1995. The singular valuedecomposition for polynomial systems. In: Proc. International Symposiumon Symbolic and Algebraic Computation (ISSAC). pp. 96–103.De Moor, B., 1993. Structured total least squares and L2