[PDF] Implementations of efficient univariate polynomial matrix algorithms and application to bivariate resultants

Abstract

Complexity bounds for many problems on matrices with univariate polynomial entries have been improved in the last few years. Still, for most related algorithms, efficient implementations are not available, which leaves open the question of the practical impact of these algorithms, e.g. on applications such as decoding some error-correcting codes and solving polynomial systems or structured linear systems. In this paper, we discuss implementation aspects for most fundamental operations: multiplication, truncated inversion, approximants, interpolants, kernels, linear system solving, determinant, and basis reduction. We focus on prime fields with a word-size modulus, relying on Shoup's C++ library NTL. Combining these new tools to implement variants of Villard's algorithm for the resultant of generic bivariate polynomials (ISSAC 2018), we get better performance than the state of the art for large parameters.

Full PDF

IImplementations of efficient univariate polynomial matrixalgorithms and application to bivariate resultants

Seung Gyu Hyun

University of Waterloo

Waterloo, ON, Canada

Vincent Neiger

Univ. Limoges, CNRS, XLIM, UMR 7252

F-87000 Limoges, France

Éric Schost

University of Waterloo

Waterloo, ON, Canada

Abstract

Complexity bounds for many problems on matrices with univari-ate polynomial entries have been improved in the last few years.Still, for most related algorithms, efficient implementations are notavailable, which leaves open the question of the practical impact ofthese algorithms, e.g. on applications such as decoding some error-correcting codes and solving polynomial systems or structuredlinear systems. In this paper, we discuss implementation aspectsfor most fundamental operations: multiplication, truncated inver-sion, approximants, interpolants, kernels, linear system solving,determinant, and basis reduction. We focus on prime fields with aword-size modulus, relying on Shoup’s C++ library NTL. Combin-ing these new tools to implement variants of Villard’s algorithm forthe resultant of generic bivariate polynomials (ISSAC 2018), we getbetter performance than the state of the art for large parameters.

Keywords

Polynomial matrices, algorithms, implementation, resultant.

ACM Reference Format:

Seung Gyu Hyun, Vincent Neiger, and Éric Schost. 2019. Implementationsof efficient univariate polynomial matrix algorithms and application tobivariate resultants. In

International Symposium on Symbolic and AlgebraicComputation(ISSAC ’19), July 15–18, 2019, Beijing, China.

ACM, New York,NY, USA, 8 pages. https://doi.org/10.1145/3326229.3326272

Hereafter, K is a field and K [ x ] is the algebra of univariate polyno-mials over K . Recent years have witnessed a host of activity on fastalgorithms for polynomial matrices and their applications: • Minimal approximant bases [15, 53] were used to computekernel bases [54], giving the first efficient deterministic al-gorithm for linear system solving over K [ x ] . • Basis reduction [15, 16] played a key role in accelerating thedecoding of one-point Hermitian codes [35] and in designingdeterministic determinant and Hermite form algorithms [29]. • Progress on minimal interpolant bases [23, 24] led to the bestknown complexity bound for list-decoding Reed-Solomoncodes and folded Reed-Solomon codes [24, Sec. 2.4 to 2.7]. • Coppersmith’s block Wiedemann algorithm and its exten-sions [7, 26, 48] were used in a variety of contexts, from inte-ger factorization [44] to polynomial system solving [22, 49].

Permission to make digital or hard copies of part or all of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for third-party components of this work must be honored.For all other uses, contact the owner/author(s).

At the core of these improvements, one also finds techniques suchas high-order lifting [41] and partial linearization [42],[16, Sec. 6].For many of these operations, no implementation of the latestalgorithms is available and no experimental evidence has been givenregarding their practical behavior. Our goal is to partly remedy thisissue, by providing and discussing implementations for a core offundamental algorithms such as multiplication, approximant andinterpolant bases, etc., upon which one may implement higherlevel algorithms. As an illustration, we describe the performanceof slightly modified versions of Villard’s recent breakthroughs onbivariate resultant and characteristic polynomial computation [49].Our implementation is based on Shoup’s Number Theory Li-brary (NTL) [40], and is dedicated to polynomial matrix arithmeticover K = F p for a word-size prime p . Particular attention waspaid to performance issues, so that our library compares favorablywith previous work for those operations where comparisons werepossible. Our code is available at https://github.com/vneiger/pml. Overview.

Polynomial matrix algorithms rely on efficient arith-metic in K [ x ] and for matrices over K ; in Section 2, we reviewsome related algorithms and their NTL implementations. Then, wedescribe our implementation of a key building block: multiplication.Section 3 presents the next major part of our work, concerningalgorithms for approximant bases [2, 15, 25, 53] and interpolantbases [3, 23, 24, 47]. We focus on a version of interpolants which isless general than in these references but allows for a more efficientalgorithm. In particular, we show that with this version, both in-terpolant and approximant bases can be used interchangeably inseveral contexts, with interpolants sometimes achieving better per-formance than approximants. In Section 4, we discuss algorithmsfor minimal kernel bases, linear system solving, determinant, andbasis reduction. Finally, using these tools, we study the practicalbehavior of the bivariate resultant algorithm of [49] (Section 5).Below, cost bounds are given in an algebraic complexity model,counting all operations in the base field at unit cost. While standard,this point of view fails to describe parts of the implementation (CRT-based algorithms, such as the 3-primes FFT, cannot be described insuch a manner), but we believe that this is a minor issue. Implementation choices.

NTL is a C++ library for polynomialand matrix arithmetic over rings such as Z , Z / n Z , etc., and is oftenseen as a reference point for fast implementations in such contexts.Other libraries for these operations include for example FLINT[18] as well as FFLAS-FFPACK and LinBox [45, 46]. Currently, ourimplementation relies solely on NTL; this choice was based oncomparisons of performance for the functionalities we need.In our implementation, the base field is a prime finite field F p ;we rely on NTL’s lzz_p class. At the time of writing, on standardx86_64 platforms, NTL v11.3.1 uses unsigned long ’s as its primarydata type for lzz_p , supporting moduli up to 60 bits long. a r X i v : . [ c s . S C ] M a y or such fields, one can directly compare running times and costbounds, since in the literature most polynomial matrix algorithmsare analyzed in the algebraic complexity model. Besides, compu-tations over F p are at the core of a general approach consistingin solving problems over Z or Q by means of reduction modulosufficiently many primes, which are chosen so as to satisfy several,partly conflicting, objectives. We may want them to support Fouriertransforms of high orders. Linear algebra modulo each prime shouldbe fast, so we may wish them to be small enough to support vector-ized matrix arithmetic with SIMD instructions. On the other hand,using larger primes allows one to use fewer of them, and reducesthe likelihood of unlucky choices in randomized algorithms.As a result, while all NTL lzz_p moduli are supported, our im-plementation puts an emphasis on three families: small FFT primesthat support AVX-based matrix multiplication (such primes have atmost 23 bits); arbitrary size FFT primes (at most 60 bits); arbitrarymoduli (at most 60 bits). Very small fields such as F or F aresupported, but we did not make any specific optimization for them. Experiments.

All runtimes below are in seconds and were mea-sured on an Intel Core i7-4790 CPU with 32GB RAM, using theversion 11.3.1 of NTL. Unless specified otherwise, timings are ob-tained modulo a random 60 bit prime. Runtimes were measuredon a single thread; currently, most parts of our code do not ex-plicitly exploit multi-threading. Tables below only show a few se-lected timings, with the best time(s) in bold; for more timings, seehttps://github.com/vneiger/pml/tree/master/benchmarks.

We review basic algorithms for polynomials and matrices, andrelated complexity results that hold over an abstract field K , andwe describe how we implemented these operations. Hereafter, for d ≥ K [ x ] d is the set of elements of K [ x ] of degree less than d . Multiplication in K [ x ] andFast Fourier Transform (FFT) are cornerstones of most algorithmsin this paper. Let M : N → N be a function such that polynomials ofdegree at most d in K [ x ] can be multiplied in M ( d ) operations in K .If K supports FFT, we can take M ( d ) ∈ O ( d log ( d )) , and otherwise, M ( d ) ∈ O ( d log ( d ) log log ( d )) [11, Chapter 8]; as in this reference,we assume that d (cid:55)→ M ( d )/ d is increasing. A useful variant of mul-tiplication is the middle product [5, 17]: for integers c and d , and F in K [ x ] c and G in K [ x ] c + d , MiddleProduct ( F , G , c , d ) returns theslice of the product FG with coefficients of degrees c , . . . , c + d − c = d . The direct approach computes thewhole product and extracts the slice. Yet, the transposition princi-ple [27] yields a more efficient approach, saving a constant factor(roughly a factor 2 when c = d , if FFT multiplication is used).Polynomial matrix algorithms frequently use fast evaluation andinterpolation at multiple points. In general, subproduct tree tech-niques [11, Chapter 10] allow one to do evaluation and interpolationof polynomials in K [ x ] d at d points in O ( M ( d ) log ( d )) operations.For special sets of points, one can do better: if we know α in K oforder at least d , then evaluation and interpolation at the geometricprogression ( , α , . . . , α d − ) can both be done in time O ( M ( d )) [6].In NTL, multiplication in F p [ x ] uses either naive, Karatsuba, orFFT techniques, depending on p and on the degree (NTL providesFFT primes with roots of unity of order 2 , and supports arbitrary user-chosen FFT primes). FFT multiplication uses the TFT algo-rithm of [19] and Harvey’s improvements on arithmetic mod p [20].For primes p that do not support Fourier transforms, multiplicationis done by means of either 3-primes FFT techniques [11, Chapter 8]or Schönhage and Strassen’s algorithm. We implemented middleproducts for naive, Karatsuba and FFT multiplication, closely fol-lowing [5, 17], as well as evaluation/interpolation algorithms forgeneral sets of points and for geometric progressions. Let ω be such that n × n matricesover any ring can be multiplied by a bilinear algorithm doing O ( n ω ) ring operations. The naive algorithm does exactly n multiplications.First improvements due to Winograd and Waksman [50, 51] reducedthe number of operations to n / + O ( n ) if 2 is a unit. Strassen’sand Winograd’s recursive algorithms [43, 52] have ω = log ( ) ; thebest known bound is ω ≤ .

373 [8, 30]. Note that, using blocking,rectangular matrices of sizes ( m × n ) and ( n × p ) can be multiplied in O ( mnp min ( m , n , p ) ω − ) ring operations. NTL implements its ownarithmetic for matrices over F p and chooses one of several imple-mentations depending on the bitsize of p , the matrix dimensions,the available processor instructions, etc. In what follows, wewrite MM ( n , d ) for a function such that two n × n matrices of degreeat most d can be multiplied in MM ( n , d ) operations in K ; we makethe assumption that d (cid:55)→ MM ( n , d )/ d is increasing for all n .From the definitions above we obtain MM ( n , d ) ∈ O ( n ω M ( d )) ,which is in O ˜ ( n ω d ) . Using evaluation/interpolation at 1 , α , . . . , α d or at roots of unity, one obtains the following bounds on MM ( n , d ) : • O ( n ω d + n M ( d )) if an element α in K of order more than2 d is known [6, Thm. 2.4]. • O ( n ω d + n d log ( d )) if K supports FFT in degree 2 d .We also mention a polynomial analogue of an integer matrix multi-plication algorithm from [10] which uses evaluation/interpolation,done plainly via multiplication by (inverse) Vandermonde matrices.Then, the corresponding part of the cost (e.g. O ( n M ( d )) for geo-metric progressions) is replaced by the cost of multiplying matricesover K in sizes roughly ( d × d ) by ( d × n ) ; this is in O ( n d ω − ) if d ≤ n . For moderate values of d , where M ( d ) is not in the FFTregime, this allows us to leverage fast matrix multiplication over K .We implemented and compared various algorithms for matrixmultiplication over F p [ x ] . For matrices of degree less than 5, weuse dedicated routines based on Karatsuba’s and Montgomery’sformulas [32]; for matrices of small size (up to 10, depending on p ),we use Waksman’s algorithm. For other inputs, most of our effortswere spent on variants of the evaluation/interpolation scheme.For FFT primes, we use evaluation/interpolation at roots of unity.For general primes, we use either evaluation/interpolation at geo-metric progressions (if such points exist in F p ), or our adaptationof the algorithm of [10], or 3-primes multiplication (as for poly-nomials, we lift the product from F p [ x ] to Z [ x ] , where it is donemodulo up to 3 FFT primes). No single variant outperformed or un-derperformed all others for all sizes and degrees, so thresholds wereexperimentally determined to switch between these options, withdifferent values for small (less than 23 bits) and for large primes.Middle product versions of these algorithms were implemented,and are used in approximant basis algorithms (Section 3.1) andNewton iteration (Section 4.3). Multiplier classes are available: theyo precomputations on a matrix A to accelerate repeated multipli-cations by A ; they are used in Dixon’s algorithm (Section 4.3).The table below shows timings for our multiplication and LinBox’one, for random m × m matrices of degree d and two choices ofprime p . The global comparison showed running times that areeither similar or in favor of our implementation.

20 bit FFT prime 60 bit prime m d ours Linbox ratio ours Linbox ratio8 131072

These bases are matrix generalizations of Padé approximation andplay an important role in many higher-level algorithms. For F in K [ x ] m × n and M non-constant in K [ x ] , they are bases of the K [ x ] -module A M ( F ) of all p in K [ x ] × m such that p F = M . Specif-ically, approximant bases are for M = x d and interpolant bases for M = (cid:206) i ( x − α i ) for d distinct points α , . . . , α d in K . (Here, we donot consider more general cases from the literature, for examplewith several moduli M , . . . , M n , one for each column of p F .)Since A M ( F ) is free of rank m , such a basis is represented row-wise by a nonsingular P in K [ x ] m × m . The algorithms below return P in s -ordered weak Popov form (also known as s -quasi Popov form[4]), for a given shift s = ( s , . . . , s m ) in Z m . Shifts allow us to setdegree constraints on the sought basis P , and they inherently occurin a general approach for finding bases of solutions to equations(approximants, interpolants, kernels, etc.). Approximant basis al-gorithms often require P to be in s -reduced form [47]; althoughthe s -ordered weak Popov form is stronger, obtaining it involvesminor changes in these algorithms, without impact on performanceaccording to our experiments. Recent literature shows that thisstronger form reveals valuable information for further computa-tions with P [23, 25], in particular for finding s -Popov bases [4].From the shift s , the s -degree of p = [ p i ] i ∈ K [ x ] × m is definedas rdeg s ( p ) = max ≤ i ≤ m ( deg ( p i ) + s i ) , which extends to matrices:rdeg s ( P ) is the list of s -degrees of the rows of P . Then, the s -pivot of p is its rightmost entry p i such that rdeg s ( p ) = deg ( p i ) + s i ,and a nonsingular matrix P is in s -ordered weak Popov form if the s -pivots of its rows are located on the diagonal.To simplify cost bounds below, we make use of the function MM ′ ( m , d ) = (cid:205) log ( d ) i = i MM ( m , d / i ) ∈ O ( MM ( m , d ) log ( d )) . For F in K [ x ] m × n and d in Z > , an approximant basis for ( F , d ) is a nonsingular m × m matrix whoserows form a basis of A x d ( F ) . We implemented minor variants of thealgorithms M-Basis (iterative, via matrix multiplication) and PM-Basis (divide and conquer, via polynomial matrix multiplication)from [15]. The lowest-level function (M-Basis-1 with the signaturein Algorithm 1), handles order d = O ( rank ( F ) ω − mn ) ; here,working modulo X , the matrix F is over K . Our implementationfollows [25, Algo. 1], which returns an s -Popov basis, using onlyan additional row permutation compared to the algorithm in [15].This form of the output of M-Basis-1 suffices to ensure thatM-Basis and PM-Basis return bases in s -ordered weak Popov form.Our implementation of M-Basis follows the original design [15] Algorithm 1:

M-Basis-1 ( F , s ) Input: matrix F in K m × n , shift s in Z m Output: the s -Popov approximant basis for ( F , ) with d iterations, each computing the residual R and updating P viamultiplication by a basis Q obtained by M-Basis-1 on R . We alsofollow [15] for PM-Basis, using a threshold T such that M-Basisis called if d ≤ T . Building PM-Basis directly upon M-Basis-1, i.e.choosing T =

1, has the same cost bound but is slower in practice.

Input: matrix F in K [ x ] m × n , order d in Z > , shift s in Z m Output: an s -ordered weak Popov approximant basis for ( F , d ) Algorithm 2:

M-Basis ( F , d , s ) P ← identity matrix in K [ x ] m × m , and t ← copy of s For k = , . . . , d − a. R ∈ K m × n ← coefficient of PF of degree k b. Q ∈ K [ x ] m × m ← M-Basis-1 ( R , t ) c. P ← QP , and then t ← rdeg s ( P ) Return P Algorithm 3:

PM-Basis ( F , d , s ) if d ≤ T return M-Basis ( F , d , s ) P ← PM-Basis ( F mod x ⌈ d / ⌉ , ⌈ d / ⌉ , s ) R ← MiddleProduct ( P , F , ⌈ d / ⌉ , ⌊ d / ⌋) t ← rdeg s ( P ) P ← PM-Basis ( R , ⌊ d / ⌋ , t ) return P P These algorithms use O (( m ω + m ω − n ) d ) and O (( + nm ) MM ′ ( m , d )) ,respectively [15]. Some implementation details are discussed inSection 3.2. The next table compares timings for LinBox’ and ourimplementations of PM-Basis for a 20 bit FFT prime (LinBox’ im-plementation is not optimized for large primes and general primes). m n d ours Linbox ratio8 4 131072 We also implemented [25, Algo. 3] which returns s -Popov basesand is about twice slower than PM-Basis; making this overheadnegligible for some usual cases is future work. For completeness,we handle general approximants (with one modulus per column of F ) by an iterative approach from [2, 47]; faster algorithms are morecomplex [23–25] and use partial linearization techniques.These techniques from [42, 53] yield cost bounds in O ˜ ( m ω − nd ) ,which is a Θ ( mn ) speedup compared to PM-Basis. Implementingthem is work in progress. experimental code, which focuses forsimplicity on n = P can be predicted, revealed significant speedups: m n d PM-Basis PM-Basis with linearization4 1 65536 1.6693

16 1 16384 1.8535

64 1 2048 2.2865

256 1 1024 36.620

Approximant bases are often applied to solve block-Hankel sys-tems [28]. In two specific settings, we have compared this approacho the one which uses structured matrix algorithms; we are notaware of previous comparisons of this kind. We obtain the followingrunning times, using the NTL-based solver from [21].

Setting 1 Setting 2 m d

PM-Basis solver PM-Basis solver5 8000

Setting 1: we call PM-Basis on [ F T − I m ] T at order 2 d with shift ( , . . . , ) , where F is an m × m matrix of degree 2 d −

1, and we solve asystem with m × m Hankel blocks of size d × d (the structured solverreturns a random solution to the system). Our experiments show aclear advantage for approximant algorithms. The asymptotic costsbeing similar, the effects at play here are constant factor differences:approximant basis algorithms seem to be somewhat simpler and tobetter leverage the main building blocks (matrix arithmetic over K and univariate polynomial arithmetic).Setting 2 is the vector rational reconstruction problem. We callPM-Basis on [ F T − I m ] T at order ( m + ) d with shift ( , . . . , ) ,where F is a 1 × m vector of degree ( m + ) d −

1, and we solve a blocksystem with 1 × m Hankel blocks of size md × d . The cost boundsare O ˜ ( m ω + d ) and O ˜ ( m ω d ) , respectively. Approximants are fasterup to dimension about 15, which is explained by the arguments inthe previous paragraph. For larger dimensions, as predicted by thecost estimates, the block-Hankel solver is more efficient. For matrices E = ( E , . . . , E d ) in K m × n and pairwise distinct points α = ( α , . . . , α d ) in K , consider I α ( E ) = { p ∈ K [ x ] × m | p ( α i ) E i = ≤ i ≤ d } . An interpolant basis for ( E , α ) is a matrix whose rows form a basis ofthe K [ x ] -module I α ( E ) . Note that I α ( F ( α ) , . . . , F ( α d )) coincideswith A M ( F ) , for F in K [ x ] m × n and M = Π di = ( x − α i ) .This definition is a specialization of those in [3, 24], which con-sider n sets of points, one for each of the n columns of E , . . . , E d :here, these sets are all equal. This more restrictive problem allowsus to give faster algorithms than those in these references, by directadaptations of the approximant basis algorithms presented above.Besides, Sections 4.1 and 4.2 will show that interpolant bases canoften play the same role as approximant bases in applications.These adaptations are described in Algorithms 4 and 5, where α i ... j stands for the sublist ( α i , α i + , . . . , α j ) . In the next proposi-tion, we assume that MM ( n , d ) is in Ω ( n M ( d )) (instead, one mayadd an extra term O ( n M ( d ) log ( d )) in the cost).Proposition 3.1. Algorithm 5 is correct. For input evaluationpoints in geometric progression, it costs O ( MM ′ ( m , d )) if n ≤ m and O ( MM ′ ( m , d ) + m ω − nd log ( d )) otherwise. For general evaluationpoints, an extra cost O ( m M ( d ) log ( d )) is incurred. (Correctness follows from Items (i) and (iii) of [25, Lem. 2.4]; thecost analysis is standard for such divide and conquer algorithms.)Our current code uses the threshold T =

32 in the divide andconquer PM-Basis and PM-IntBasis: beyond this point, they arefaster than the iterative M-Basis and M-IntBasis. Unlike in mostother functions, where elements of K [ x ] m × n are represented asmatrices of polynomials ( Mat> in NTL), in M-Basisand M-IntBasis we see them as polynomials with matrix coeffi-cients (

Vec> ). Indeed, since these algorithms involve

Input: matrices E = ( E , . . . , E d ) in K m × n , evaluation points α = ( α , . . . , α d ) in K , shift s in Z m Output: an s -ordered weak Popov interpolant basis for ( E , α ) Algorithm 4:

M-IntBasis ( E , α , s ) P ← identity matrix in K [ x ] m × m , and t ← copy of s For k = , . . . , d − a. R ∈ K m × n ← P ( α i ) E i b. Q ∈ K [ x ] m × m ← M-Basis-1 ( R , t ) c. P ← QP , and then t ← rdeg s ( P ) Return P Algorithm 5:

PM-IntBasis ( E , α , s ) if d ≤ T return M-IntBasis ( E , α , s ) P ← PM-IntBasis ( E ... ⌈ d / ⌉ , α ... ⌈ d / ⌉ , s ) R ← ( P ( α ⌈ d / ⌉ + ) E ⌈ d / ⌉ + , . . . , P ( α d ) E d ) t ← rdeg s ( P ) P ← PM-IntBasis ( R , α ⌈ d / ⌉ + ... d , t ) return P P only matrix arithmetic over K (recall that deg ( Q ) ≤ R is computed from P and F at each iteration, or weinitialize a list of residuals with a copy of F and we update thewhole list at each iteration using Q . The second variant improvesover the first when n > m /

2, with significant savings when n isclose to m . For interpolant bases, this does not lead to any gain.Timings are showed in the next table, for Algorithms M-Basis(M), M-IntBasis (M-I), PM-Basis (PM), PM-IntBasis for generalpoints (PM-I) and for geometric points (PM-Ig). For approximants,we take a random input in K [ x ] m × n of degree d −

1; for interpolants,we take d random matrices in K m × n . We focus on the commoncase m ≃ n , which arises for example in kernel algorithms (Sec-tion 4.2) and in fraction reconstruction, itself used in basis reduction(Section 4.5) and in the resultant algorithm of [49] (Section 5). m n d M M-I d PM PM-I PM-Ig4 2 32 1.60e-4 1.42e-4 32768

32 16 32 0.0104

64 32 32 0.0502

128 64 32 0.374

256 128 32 2.92

Concerning iterative algorithms, we observe that interpolantsare slightly faster than approximants, which is explained by thecost of computing the residual R : it uses one Horner evaluation of P and one matrix product for interpolants, whereas for approximantsit uses about min ( k , deg ( P )) matrix products at iteration k .As for the divide and conquer algorithms, interpolant bases withgeneral points are slower, in some cases significantly, than theother two algorithms: although the complexity analysis predicteda disadvantage, we believe that our implementation of multipointevaluation at general points could be improved to reduce this gap.For the other two algorithms, the comparison is less clear. Therecould be many factors at play here, but the main differences lie inthe base case (Step 1) which calls the iterative algorithm, and in theomputation of residuals (Step 3) which uses either middle prod-ucts or geometric evaluation. It seems that FFT-based polynomialmultiplication performs slightly better than geometric evaluationfor small matrices and slightly worse for large matrices. In this section we consider kernel bases, system solving, determi-nants, and basis reduction; we discuss algorithms which rely onmultiplication, through approximant/interpolant bases and liftingtechniques. For many of these algorithms, we are not aware ofprevious implementations or experimental comparisons.

Given H in K ( x ) n × n , a left fraction description of H is a pair of polynomialmatrices ( Q , R ) in K [ x ] n × n such that H = Q − R . It is minimal if Q and R have unimodular left matrix GCD and Q is in reducedform ( right fraction descriptions are defined similarly). Besides, H issaid to be strictly proper if the numerator of each of its entries hasdegree less than the corresponding denominator.Such a description of H is often computed from the power seriesexpansion of H at sufficient precision, using an approximant basis.Yet, for resultant computations in Section 5.2, we would like touse an interpolant basis to obtain this description from sufficientlymany values of H . We now state the validity of this approach; this isa matrix version of rational function reconstruction [11, Chap. 5.7].Proposition 4.1. Let H be in K ( x ) n × n be strictly proper andsuppose H admits left and right fraction descriptions of degrees atmost D , for some D ∈ Z > . For M in K [ x ] of degree at least D andsuch that all denominators in H are invertible modulo M , define thematrix F = [ H mod M − I n ] T ∈ K [ x ] n × n . Then, if P ∈ K [ x ] n × n is a -ordered weak Popov basis of A M ( F ) , the first n rows of P forma matrix [ Q R ] such that ( Q , R ) is a minimal left fraction descriptionof H , with Q in -ordered weak Popov form. The proof given in [15, Lem 3.7] for the specific M = x D + extends to any modulus M ; using an ordered weak Popov form(rather than a reduced form) allows us both to know a priori thatthe first n rows are those of degree at most D , and to use degree 2 D instead of 2 D + ( R ) < deg ( Q ) is ensured by this form).In particular, if M = (cid:206) Di = ( x − α i ) for pairwise distinct points ( α , . . . , α D ) , the interpolant basis algorithms in Section 3.2 give aminimal left fraction description of H from H ( α ) , . . . , H ( α D ) . We implemented two kernel basis algorithms:the first one, based on Lemma 4.2, finds the kernel basis from asingle approximant basis at sufficiently large order; the secondone, from [54], uses several approximant bases at small order andcombines recursively obtained kernel bases via multiplication. Witha minor modification and no performance impact, the latter returnsan s -ordered weak Popov basis. In both cases, we designed variantswhich rely on interpolant bases instead of approximant bases.Lemma 4.2. Let F be in K [ x ] m × n of degree d ≥ , let s be in N m ,and let δ in Z > be an upper bound on the s -degree of any s -reducedleft kernel basis of F ; for example, δ = nd + max ( s ) − min ( s ) + .Let M be in K [ x ] of degree at least δ + d , and P in K [ x ] m × m be an s -reduced basis of A M ( F ) . Then, the submatrix of P formed by itsrows of s -degree less than δ is an s -reduced left kernel basis for F . For a proof, see https://hal.archives-ouvertes.fr/hal-01995873v1/document . In particular, one may find P via PM-IntBasis at d + δ points orvia PM-Basis at order d + δ ; for n ≤ m , this costs O ( MM ′ ( m , d + δ )) .The approximant-based direct approach is folklore [54, Sec. 2.3],yet explicit statements in the literature focus on shifts linked to thedegrees in F , with better bounds δ (see [54, Lem. 3.3],[33, Lem. 4.3]).The algorithm of [54] is more efficient, at least when the entries of s are close to the corresponding row degrees of F ; for a uniform shift,it costs O ˜ ( m ω ⌈ nd / m ⌉) operations. We obtained significant practicalimprovements over the plain implementation of [54, Algo. 1] thanksto the following observation: if n ≤ m /

2, for a vast majority of input F , the approximant basis at Step 2 of [54, Algo. 1], computed at ordermore than 2 s , contains the sought kernel basis. Furthermore, thiscan be easily tested by checking well-chosen degrees, and thenthe algorithm can exit early, avoiding the further recursive calls.We took advantage of this via the following modifications: we useorder 2 s + s (see [54, Rmk. 3.5] for a discussion onthis point), and when n > m / A M ( F ) for a large degree M must be in the kernel of F . Thus, one can directly replace approx-imant bases with interpolant bases in [54, Algo. 1], up to modifyingStep 8 accordingly (dividing by the appropriate polynomial M ).Timings for both approaches are showed in the next table, for arandom F of degree d . Except for the last row, the shift is uniformand, as expected, [54, Algo. 1] is faster than the direct approach; thedifferences between interpolant and approximant variants followthose observed in Section 3. The last row corresponds to a shiftyielding the kernel basis in Hermite form and shows, as expected,that the direct approach is faster for shifts that are far from uniform.We note that [54, Algo. 1] may use partial linearization if it com-putes matrix products with unbalanced degrees or approximantbases with n ≪ m . We have not yet implemented this part of thealgorithm, which may lead to slowdowns for some rare inputs. direct [54, Algo. 1] m n d approx. int. approx. int.8 4 8192 7.22 6.60

32 31 1024 142 118

128 64 256 2720 1827 16.8

128 127 256 >1h >1h 43.8

16 1 512

For systems A υ = b , with A in K [ x ] m × n , b in K [ x ] m × and υ in K ( x ) n × , we implemented twofamilies of algorithms. The first one uses lifting techniques, assum-ing A is square, nonsingular, with A ( ) invertible; the algorithmreturns a pair ( u , f ) in K [ x ] n × × K [ x ] such that A u = f b and f hasminimal degree. The second one uses a kernel basis and works forany input A ; under the assumptions above, it has a similar output. Lifting techniques.

Under the above assumptions, our lifting al-gorithm is standard: if A and b have degree at most d , we firstcompute the truncated inverse S = A − mod x d + by matrix New-ton iteration [38]. Then, we use Dixon’s algorithm [9] to compute υ mod x nd = A − b mod x nd ; it consists of roughly 2 n steps, eachinvolving a matrix-vector product using either A or S . Then, vectorational reconstruction is applied to recover ( u , f ) from υ . The costof this algorithm is O ( MM ( n , d )) for the truncated inverse of A and O ( n M ( d )) for Dixon’s algorithm; overall this is in O ˜ ( n d ) .To reduce the exponent in n , Storjohann introduced the high-order lifting algorithm [41]. The core of this algorithm is the compu-tation of Θ ( log ( n )) slices S , S , . . . of the power series expansionof A − , where the coefficients of S i are the coefficients of degree ( i − ) d − i + , . . . , ( i + ) d − i − A − . These matrices are com-puted recursively, each step involving 4 matrix products; the othersteps of the algorithm, that use these S i to compute υ mod x nd ,are cheaper, so the runtime is O ( MM ( n , d ) log ( n )) ⊂ O ˜ ( n ω d ) . Using kernel bases.

For this second approach, let A be any matrixin K [ x ] m × n and b be in K [ x ] m × . The algorithm simply computes K ∈ K [ x ] ( n + )× k , a right kernel basis of the augmented matrix [ A | b ] ∈ K [ x ] m ×( n + ) . The matrix K generates, via K ( x ) -linearcombinations of its columns, all solutions υ ∈ K ( x ) n × to A υ = b .In particular, if K is empty (i.e. k =

0, which requires m ≥ n ),or if the last row of K is zero, then the system has no solution.Furthermore, if A is square and nonsingular, K has a single column [ u T | f ] T , where u ∈ K [ x ] n × and f ∈ K [ x ] , with f of minimaldegree (otherwise, K would not be a basis).In this context, the fastest known kernel algorithm is [54, Algo. 1].To exploit it best, we choose the input shift s = ( d , d ) , where d = deg ( b ) and d ∈ N n is the tuple of column degrees of A (zerocolumns of A are discarded while computing d ). Implementation.

We implemented the approaches described above:lifting with Dixon’s algorithm, high-order lifting, and via kernel.The table below shows timings for randomly chosen m × m matrix A and m × b , both of degree d . In this case the lifting al-gorithms apply (with high probability). On such inputs, Dixon’salgorithm usually does best. High-order lifting, although theoreti-cally faster, is outperformed, mainly because it performs Θ ( log ( n )) matrix products (we will however see that this algorithm still playsan important role for basis reduction). The kernel based approach ismoderately slower than Dixon’s algorithm, but has the advantageof working without any assumption on A . m d Dixon high-order lifting kernel16 1024

266 108

We implemented four algorithms, taking asinput a square m × m matrix A .The most basic one uses expansion by minors, which turns outto be the fastest option up to dimension about 6.The second one assumes that we have an element α in K of orderat least 2 md +

1, and uses evaluation/interpolation at the geometricprogression 1 , α , . . . , α md ; this costs O ( m M ( d ) + m ω + d ) opera-tions in K . For dimensions between 7 and about 20, it is often thefastest variant, sometimes competing with the third.The third one consists in solving a linear system with randomright-hand side of degree d ; this yields a solution ( u , f ) with f thesought determinant up to a constant [36]. For dimensions exceeding20, this is the fastest method (assuming that A ( ) is invertible). The last one, from [29], is based on triangularization and runs in O ˜ ( m ω d ) . Our implementation currently only supports the genericcase where the Hermite form of A has diagonal ( , . . . , , det ( A )) , orin other words, all so-called row bases computed in that algorithmare the identity. This allows us to circumvent the temporary lackof a row basis implementation, while still being able to observemeaningful timings. Indeed, we believe that timings for a completeimplementation called on such “generic” input will be similar to thetimings presented here in many cases of interest, where one caneasily detect whether the algorithm has made a wrong predictionabout the row basis being identity (for example if the input matrixis reduced, which is the case if it is the denominator of a minimalfraction description). This recursive determinant algorithm callsexpansion by minors as a base case for small dimensions; for largerdimensions, it is generally slightly slower than the third method.The timings below are for a random m × m matrix A of degree d . m d minors evaluation linsolve triangular4 65536

16 4096 ∞ ∞ ∞ ∞ Our implementation of the algorithm of[15] takes as input a nonsingular matrix A ∈ K [ x ] m × m of degree d such that A ( ) is invertible, and returns a reduced form of A . Thealgorithm first computes a slice S of 2 d consecutive coefficients ofdegree about md in the power series expansion of A − , then usesPM-Basis to reconstruct a fraction description S − = R − Q , andthen returns R . A Las Vegas randomized version is mentioned in[15], to remove the assumption on A ( ) : we will implement it forlarge enough K , but for smaller fields this requires to work in anextension of K , which is currently beyond the scope of our work.In our experiments, to create the input A , we started from arandom m × m matrix of degree d / d /

3. The following table shows timings for both steps, with thefirst step either based on Newton iteration or on high-order lifting;the displayed total time is when using the faster of the two. Weconclude that for reduction, as opposed to the above observationsfor system solving, it is crucial to rely on high-order lifting. Indeed,it improves over Newton iteration already for dimension 8, and thegap becomes quite significant when the dimension grows. m d

Newton high-order reconstruct total4 24574

We conclude this paper with algorithms originating from Villard’srecent breakthrough on computing the determinant of structuredpolynomial matrices [49]. Fix a field K and consider the two follow-ing questions: computing the resultant of two polynomials F , G in K [ x , z ] with respect to z , and computing the characteristic polyno-mial of an element A in K [ z ]/( P ) , for some P in K [ z ] .he second problem is a particular case of the former, since thecharacteristic polynomial of A modulo P is the resultant of x − A ( z ) and P ( z ) with respect to z , up to a nonzero constant. Let n be anupper bound on the degree in z of the polynomials we consider,and d be a bound on their degree in x (so in the second problem, d = O ˜ ( n − / ω d ) ⊂ O ˜ ( n . d ) operations in K . For the firstproblem, the best previous bound is O ˜ ( n d ) , obtained either byevaluation/interpolation techniques or Reischert’s algorithm [37].For the second problem, the previous record was O ˜ ( n ω / ) , where ω is the exponent of matrix multiplication in size ( s , s ) × ( s , s ) ,with ω / ≤ .

63 [31]. Note that these bounds apply to all inputs.We show how the work we presented above allows us to putVillard’s ideas to practice, and outperform the previous state of theart for large input sizes. This is however not straightforward: inboth cases, this required modifications of Villard’s original designs(for the second case, using an algorithm from [34]).

In [49], Villard designed thefollowing algorithm to find the determinant of a matrix P over K [ x ] . Algorithm 6:

Determinant ( P , m ) Input: nonsingular P in K [ x ] ν × ν ; parameter m ∈ { , . . . , ν } Output: det ( P ) compute ¯ H = H mod x ⌈ ν / m ⌉ d + , where d is the degree of P and H is the m × m top-right quadrant of P − ∈ K ( x ) ν × ν from ¯ H , find a minimal left fraction description ( Q , R ) of H return det ( Q ) The parameter m is chosen so as to minimize the theoretical cost.The correctness of the algorithm follows from the next properties,which do not hold for an arbitrary nonsingular P : the matrix H isstrictly proper and admits a left fraction description H = Q − R suchthat det ( P ) = det ( Q ) , for Q and R in K [ x ] m × m of degree at most ⌈ ν / m ⌉ d (see Section 4.1 for definitions). In [49], they are proved tohold for generic instances of the problems discussed here.Once sufficiently many terms of the expansion of H have beenobtained in Step 1, the denominator Q is recovered by an approxi-mant basis algorithm and its determinant is computed by a generalalgorithm in Steps 2 and 3, which cost O ˜ ( m ω ( νd / m )) .While the algorithm applies to any nonsingular matrix P sat-isfying the properties above, in general it does not improve overpreviously known methods (see Section 4.4). Indeed, the fastestknown algorithm for obtaining ¯ H costs O ˜ ( ν ω d ) operations viahigh-order lifting (see for example [16, Thm. 1]).However, sometimes P has some structure which helps to speedup the first step. Villard pointed out that when P is the Sylvestermatrix of two bivariate polynomials, then P − is a Toeplitz-likematrix which can be described succinctly as P − = L U + L U ;here, L , L (resp. U , U ) are lower (resp. upper) triangular Toeplitzmatrices with entries in K ( x ) . Hence, we start by computing thefirst columns c of L and c of L as well as the first rows r of U and r of U , all of them modulo x ⌈ ν / m ⌉ d + ; then, ¯ H is directlyobtained via the above formula for P − , using O ˜ ( mνd ) operations.Computing these rows and columns is done by solving systems withmatrices P and P T and very simple right-hand sides [49, Prop. 5.1],with power series coefficients, in time O ˜ ( ν d / m ) . Altogether, taking m = ν / ω minimizes the cost, yielding theruntime O ˜ ( ν − / ω d ) . In the case of bivariate resultants describedabove, the Sylvester matrix of F and G has size ν = n , hence thecost bound O ˜ ( n − / ω d ) . We imple-mented the algorithm described in the previous section to computethe resultant of generic F , G in K [ x , z ] ; first experiments showedthat obtaining c , c , r , r was a bottleneck. These vectors havepower series entries and are solutions of linear systems whose ma-trix is the Sylvester matrix of F and G or its transpose: they wereobtained via Hensel lifting techniques, following [11, Ch. 15.4].To get better performance, we designed a minor variant of Vil-lard’s algorithm: instead of computing the power series expansionof H modulo x δ , where δ = ⌈ ν / m ⌉ d +

1, we compute values of H at δ points. We choose these points in geometric progression anduse the interpolant basis algorithm of Section 3.2 to recover Q and N , as detailed in Section 4.1. The value of H at x = α is computedfollowing the same approach as above, but over K instead of K [[ x ]] .In particular, our implementation directly relies on NTL’s extendedGCD algorithm over K = F p to compute the vectors c , c , r , r .The next table compares our implementation to the direct ap-proach via evaluation/interpolation; note that the latter approach,while straightforward conceptually, is the state of the art. n = d Direct Algo. 6100 n = d Direct Algo. 6600 797

700 1343

800 2121

900 3203

For these running times, input polynomials were chosen at ran-dom with partial degree n both in x and in z ; such polynomialshave total degree 2 n , and their resultant has degree 2 n . The largestexamples have quite significant sizes, but such degrees are notunheard-of in applications, as for instance in the genus-2 pointcounting algorithms of [1, 12–14]. Overall, with n = d , we observea crossover point around n = m was set to ⌈ n . ⌉ since this gaveus the best runtimes. As an example, for d = Q and its determinant, which is a good balance. We consider the computationof the characteristic polynomial of an element A in K [ z ]/( P ) , forsome monic P in K [ z ] of degree n . The algorithm we implemented,and which we sketch below, is from [34] and assumes that A and P are generic.As explained previously, this problem is a particular case of abivariate resultant, but we rely on another point of view that allowsfor a better asymptotic cost. Indeed, the characteristic polynomialof A modulo P is by definition the characteristic polynomial of thematrix M of multiplication by A modulo P . In other words, it is thedeterminant of the degree-1 matrix P = x I − M ∈ K [ x ] n × n .The genericity assumption ensures that M is invertible, hencethe power series expansion of P − is (cid:205) k ≥ − M − k − x k . Here, weuse the top-left m × m quadrant H of P − ; it has entries h i , j ∈ K [[ x ]] ,where h i , j , k : = coeff ( h i , j , x k ) = coeff (− z j A − k − mod P , z i ) . for 0 ≤ i , j < m and for all k ≥

0. direct implementation of this idea does not improve on theruntime given in Section 5.1, since it computes A − k − mod P forall 0 ≤ k < δ = ⌈ n / m ⌉ and therefore costs Ω ( n / m ) . It turns outthat baby-steps giant-steps techniques allow one to compute h i , j , k for 0 ≤ i , j < m and 0 ≤ k < δ in O ˜ ( δ ( ω − )/ n + mn ) operations in K . Taking m = ⌈ n / ⌉ minimizes the overall cost, resulting in theruntime O ˜ ( n ( ω + )/ ) ⊂ O ( n . ) .The following table compares our implementation to NTL’s built-in characteristic polynomial algorithm, with random inputs A and P . For such inputs, NTL uses Shoup’s algorithm for power projec-tion [39], which runs in time O ˜ ( n ( ω + )/ ) . n m NTL new n m NTL new5000 5 Acknowledgements.

Hyun was supported by a Mitacs Globalinkaward; Neiger was supported by CNRS/INS2I’s program for youngresearchers; Schost was supported by an NSERC Discovery Grant.

References [1] S. Abélard, P. Gaudry, and P.-J. Spaenlehauer. 2018. Counting points on genus-3hyperelliptic curves with explicit real multiplication. In

ANTS-XIII .[2] B. Beckermann and G. Labahn. 1994. A Uniform Approach for the Fast Computa-tion of Matrix-Type Padé Approximants.

SIAM J. Matrix Anal. Appl.

15, 3 (1994),804–823.[3] B. Beckermann and G. Labahn. 2000. Fraction-free computation of matrix rationalinterpolant and matrix gcds.

SIAM J. Matrix Anal. Appl.

22, 1 (2000), 114–144.[4] B. Beckermann, G. Labahn, and G. Villard. 1999. Shifted Normal Forms of Poly-nomial Matrices. In

ISSAC’99 . ACM, 189–196.[5] A. Bostan, G. Lecerf, and É. Schost. 2003. Tellegen’s Principle Into Practice. In

ISSAC’03 . ACM, 37–44.[6] A. Bostan and É. Schost. 2005. Polynomial evaluation and interpolation on specialsets of points.

J. Complexity

21, 4 (2005), 420–446.[7] D. Coppersmith. 1994. Solving homogeneous linear equations over GF ( ) viablock Wiedemann algorithm. Math. Comp.

62, 205 (1994), 333–350.[8] D. Coppersmith and S. Winograd. 1990. Matrix multiplication via arithmeticprogressions.

J. Symbolic Comput.

9, 3 (1990), 251–280.[9] J. Dixon. 1982. Exact solution of linear equations using p -adic expansions. Numer.Math.

40 (1982), 137–141.[10] J. Doliskani, P. Giorgi, R. Lebreton, and É. Schost. 2018. Simultaneous conversionswith the residue number system using linear algebra.

ACM Trans. Math. Software

44, 3 (2018), 1–21.[11] J. von zur Gathen and J. Gerhard. 2013.

Modern Computer Algebra (third edition) .Cambridge University Press.[12] P. Gaudry and R. Harley. 2000. Counting points on hyperelliptic curves overfinite fields. In

ANTS-IV (LNCS) , Vol. 1838. Springer-Verlag, 313–332.[13] P. Gaudry and É. Schost. 2004. Construction of secure random curves of genus 2over prime fields. In

Eurocrypt’04 (LNCS) , Vol. 3027. Springer, 239–256.[14] P. Gaudry and É. Schost. 2012. Point-counting in genus 2 over prime fields.

J.Symbolic Comput.

47, 4 (2012), 368âĂŞ400.[15] P. Giorgi, C.-P. Jeannerod, and G. Villard. 2003. On the complexity of polynomialmatrix computations. In

ISSAC’03 . ACM, 135–142.[16] S. Gupta, S. Sarkar, A. Storjohann, and J. Valeriote. 2012. Triangular x -basisdecompositions and derandomization of linear algebra algorithms over K [ x ] . J.Symbolic Comput.

47, 4 (2012), 422–453.[17] G. Hanrot, M. Quercia, and P. Zimmermann. 2004. The Middle Product Algo-rithm, I.

Appl. Algebra Engrg. Comm. Comp.

14, 6 (2004), 415–438.[18] W. Hart, F. Johansson, and S. Pancratz. 2015. FLINT: Fast Library for NumberTheory. Version 2.5.2, http://flintlib.org.[19] D. Harvey. 2009. A cache-friendly truncated FFT.

Theoretical Computer Science

Journal ofSymbolic Computation

60 (2014), 113–119.[21] Seung Gyu Hyun, Romain Lebreton, and Éric Schost. 2017. Algorithms forstructured linear systems solving and their implementation. In

ISSAC’17 . ACMPress, 205–212.[22] S. G. Hyun, V. Neiger, H. Rahkooy, and É. Schost. 2017. Block-Krylov techniquesin the context of sparse-FGLM algorithms. (2017). arXiv:1712.04177 [23] C.-P. Jeannerod, V. Neiger, É. Schost, and G. Villard. 2016. Fast computation ofminimal interpolation bases in Popov form for arbitrary shifts. In

ISSAC’16 . ACM,295–302.[24] C.-P. Jeannerod, V. Neiger, É. Schost, and G. Villard. 2017. Computing minimalinterpolation bases.

J. Symbolic Comput.

83 (2017), 272–314.[25] C.-P. Jeannerod, V. Neiger, and G. Villard. 2018. Fast computation of approximantbases in canonical form.

Journal of Symbolic Computation (2018). to appear.[26] E. Kaltofen. 1995. Analysis of Coppersmith’s block Wiedemann algorithm for theparallel solution of sparse linear systems.

Math. Comp.

64, 210 (1995), 777–806.[27] M. Kaminski, D.G. Kirkpatrick, and N.H. Bshouty. 1988. Addition requirementsfor matrix and transposed matrix products.

J. Algorithms

9, 3 (1988), 354–364.[28] G. Labahn, D. K. Choi, and S. Cabay. 1990. The inverses of block Hankel andblock Toeplitz matrices.

SIAM J. Comput.

19, 1 (1990), 98–123.[29] G. Labahn, V. Neiger, and W. Zhou. 2017. Fast, deterministic computation of theHermite normal form and determinant of a polynomial matrix.

J. Complexity (inpress) (2017).[30] F. Le Gall. 2014. Powers of Tensors and Fast Matrix Multiplication. In

ISSAC’14 .ACM, 296–303.[31] F. Le Gall and F. Urrutia. 2018. Improved rectangular matrix multiplication usingpowers of the Coppersmith-Winograd tensor. In

SODA ’18 . SIAM, 1029–1046.[32] P. L. Montgomery. 2005. Five, six, and seven-term Karatsuba-like formulae.

IEEETrans. Comput.

54, 3 (2005), 362–369.[33] Vincent Neiger, Johan Rosenkilde, and Grigory Solomatov. 2018. ComputingPopov and Hermite Forms of Rectangular Polynomial Matrices. In

ISSAC’18 .ACM, New York, NY, USA, 295–302.[34] V. Neiger, B. Salvy, É. Schost, and G. Villard. 2019. In preparation.[35] J. S. R. Nielsen and P. Beelen. 2015. Sub-quadratic decoding of one-point Hermitiancodes.

IEEE Trans. Inf. Theory

61, 6 (June 2015), 3225–3240.[36] Victor Pan. 1988. Computing the determinant and the characteristic polynomialof a matrix via solving linear systems of equations.

Inform. Process. Lett.

28, 2(1988), 71–75.[37] D. Reischert. 1997. Asymptotically fast computation of subresultants. In

ISSAC’97 .ACM, 233–240.[38] G. Schulz. 1933. Iterative Berechung der reziproken Matrix.

Zeitschrift fürAngewandte Mathematik und Mechanik

13, 1 (1933), 57–59.[39] V. Shoup. 1994. Fast construction of irreducible polynomials over finite fields.

J.Symbolic Comput.

17, 5 (1994), 371–391.[40] V. Shoup. 2018. NTL: A Library for doing Number Theory, version 11.3.2. .[41] A. Storjohann. 2003. High-order lifting and integrality certification.

J. SymbolicComput.

36, 3-4 (2003), 613–648.[42] A. Storjohann. 2006. Notes on computing minimal approximant bases. In

Chal-lenges in Symbolic Computation Software (Dagstuhl Seminar Proceedings) .[43] V. Strassen. 1969. Gaussian elimination is not optimal.

Numer. Math.

13 (1969),354–356.[44] The CADO-NFS Development Team. 2017. CADO-NFS, An implementation ofthe number field sieve algorithm. http://cado-nfs.gforge.inria.fr/ Release 2.3.0.[45] The FFLAS-FFPACK Group. 2019. FFLAS-FFPACK: Finite Field Linear AlgebraSubroutines / Package, version 2.3.2. http://github.com/linbox-team/fflas-ffpack.[46] The LinBox Group. 2018. LinBox: Linear algebra over black-box matrices, version1.5.3. https://github.com/linbox-team/linbox/.[47] M. Van Barel and A. Bultheel. 1992. A general module theoretic framework forvector M-Padé and matrix rational interpolation.

Numer. Algorithms

ISSAC’97 . ACM, 32–39.[49] G. Villard. 2018. On computing the resultant of generic bivariate polynomials. In

ISSAC’18 . ACM, 391–398.[50] A. Waksman. 1970. On Winograd’s Algorithm for Inner Products.

IEEE Transac-tions On Computers

C-19 (1970), 360–361.[51] S. Winograd. 1968. A New Algorithm for Inner Product.

IEEE Trans. Comput.

C-17, 7 (1968), 693–694.[52] S. Winograd. 1971. On multiplication of × matrices. Linear Algebra Appl.

J. Symbolic Comput.

47, 7 (2012), 793–819.[54] W. Zhou, G. Labahn, and A. Storjohann. 2012. Computing Minimal NullspaceBases. In