Faster integer multiplication using short lattice vectors
aa r X i v : . [ c s . S C ] F e b FASTER INTEGER MULTIPLICATIONUSING SHORT LATTICE VECTORS
DAVID HARVEY AND JORIS VAN DER HOEVEN
Abstract.
We prove that n -bit integers may be multiplied in O ( n log n log ∗ n )bit operations. This complexity bound had been achieved previously by severalauthors, assuming various unproved number-theoretic hypotheses. Our proofis unconditional, and depends in an essential way on Minkowski’s theoremconcerning lattice vectors in symmetric convex sets. Introduction
Let M ( n ) denote the number of bit operations required to multiply two n -bitintegers, where “bit operations” means the number of steps on a deterministicTuring machine with a fixed, finite number of tapes [18] (our results also hold inthe Boolean circuit model). Let log ∗ x denote the iterated natural logarithm, i.e.,log ∗ x := min { j ∈ N : log ◦ j x } , where log ◦ j x := log · · · log x (iterated j times).The main result of this paper is an algorithm achieving the following bound. Theorem 1.1.
We have M ( n ) = O ( n log n log ∗ n ) . The first complexity bound for M ( n ) of the form O ( n log n K log ∗ n ) was estab-lished by F¨urer [8, 9], for an unspecified constant K >
1. His algorithm reducesa multiplication of size n to many multiplications of size exponentially smallerthan n , which are then handled recursively. The number of recursion levels islog ∗ n + O (1), and the constant K measures the “expansion factor” at each recur-sion level.The first explicit value for K , namely K = 8, was given by Harvey, van derHoeven and Lecerf [13]. Their method is somewhat different to F¨urer, but stillcarries out an exponential size reduction at each recursion level. One may thinkof the constant K = 8 as being built up of three factors of 2, each coming froma different source.The first factor of 2 arises from the need to perform both forward and inverseDFTs (discrete Fourier transforms) at each recursion level. This is a feature com-mon to all of the post-F¨urer algorithms, suggesting that significantly new ideas willbe needed to do any better than O ( n log n log ∗ n ).The second factor of 2 arises from coefficient growth: a product of polynomialswith k -bit integer coefficients has coefficients with at least 2 k bits. This factor of 2also seems difficult to completely eliminate, although Harvey and van der Hoevenhave recently made some progress [12]: they achieve K = 4 √ ≈ .
66 by arrangingthat, in effect, the coefficient growth only occurs at every second recursion level.This was the best known unconditional value of K prior to the present paper. Harvey was supported by the Australian Research Council (grants DP150101689 andFT160100219).
The final factor of 2 occurs because the algorithm works over C : when multiply-ing complex coefficients with say β significant bits, the algorithm first computes afull 2 β -bit product, and then truncates to β bits. More precisely, after splitting the β -bit inputs into m exponentially smaller chunks, and encoding them into polyno-mials of degree m , the algorithm must compute the full product of degree 2 m , eventhough essentially only m coefficients are needed to resolve β significant bits of theproduct. Again, this factor of 2 has been the subject of a recent attack: Harveyhas shown [10] that it is possible to work modulo a polynomial of degree only m ,at the expense of increasing the working precision by a factor of 3 /
2. This leads toan integer multiplication algorithm achieving K = 6.Another way of attacking this last factor of 2 is to replace the coefficient ring C by a finite ring Z /q Z for a suitable integer q . By choosing q with some specialstructure, it may become possible to convert a multiplication modulo q directly intoa polynomial multiplication modulo some polynomial of degree m , rather than 2 m .Three algorithms along these lines have been proposed.First, Harvey, van der Hoeven and Lecerf suggested using Mersenne primes ,i.e., primes of the form q = 2 k −
1, where k is itself prime [13, § Z /q Z to multiplication in Z [ y ] / ( y m − m is a power of two.Because k is not divisible by m , the process of splitting an element of Z /q Z into m chunks is somewhat involved, and depends on a variant of the Crandall–Faginalgorithm [7].Covanov and Thom´e [6] later proposed using generalised Fermat primes , i.e.,primes of the form q = r m + 1, where m is a power of two and r is a small eveninteger. Here, multiplication in Z /q Z is converted to multiplication in Z [ y ] / ( y m +1).The splitting procedure consists of rewriting an element of Z /q Z in base r , via fastradix-conversion algorithms.Finally, Harvey and van der Hoeven [11] proposed using FFT primes , i.e., primesof the form q = a · k + 1, where a is small. They reduce multiplication in Z /q Z tomultiplication in Z [ y ] / ( y m + a ) via a straightforward splitting of the integers into m chunks, where m is a power of two. Here the splitting process is trivial, as k maybe chosen to be divisible by m .These three algorithms all achieve K = 4, subject to plausible but unprovedconjectures on the distribution of the relevant primes. Unfortunately, in all threecases, it is not even known that there are infinitely many primes of the requiredform, let alone that there exist a sufficiently high density of them to satisfy therequirements of the algorithm.The main technical novelty of the present paper is a splitting procedure thatworks for an almost arbitrary modulus q . The core idea is to introduce an al-ternative representation for elements of Z /q Z : we represent them as expressions a + a θ + · · · + a m − θ m − , where θ is some fixed 2 m -th root of unity in Z /q Z , andwhere the a i are small integers, of size roughly q /m . Essentially the only restrictionon q is that Z /q Z must contain an appropriate 2 m -th root of unity. We will seethat Linnik’s theorem is strong enough to construct suitable such moduli q .In Section 2 we show that the cost of multiplication in this representation isonly a constant factor worse than for the usual representation. The key ingredientis Minkowski’s theorem on lattice vectors in symmetric convex sets. We also givealgorithms for converting between this representation and the standard representa-tion. The conversions are not as fast as one might hope — in particular, we do not ASTER INTEGER MULTIPLICATION USING SHORT LATTICE VECTORS 3 know how to carry them out in quasilinear time — but surprisingly this turns outnot to affect the overall complexity, because in the main multiplication algorithmwe perform the conversions only infrequently.Then in Sections 3 and 4 we prove Theorem 1.1, using an algorithm that isstructurally very similar to [11]. We make no attempt to minimise the impliedbig- O constant in Theorem 1.1; our goal is to give the simplest possible proof ofthe asymptotic bound, without any regard for questions of practicality.An interesting question is whether it is possible to combine the techniques of thepresent paper with those of [12] to obtain an algorithm achieving K = 2 √ ≈ . F p [ x ]. However, this is not so interesting, because anunconditional proof of the bound corresponding to K = 4 in the polynomial caseis already known [12].Throughout the paper we use the following notation. We write lg n := ⌈ log n ⌉ for n >
2, and for convenience put lg 1 := 1. We define M SS ( n ) = Cn lg n lg lg n ,where C > n -bit integers in at most M SS ( n ) bit operations [19]. This function satisfies n M SS ( m ) M SS ( nm ) for any n, m >
1, and also M SS ( dm ) = O ( M SS ( m )) forfixed d . An n -bit integer may be divided by m -bit integer, producing quotient andremainder, in time O ( M SS (max( n, m ))) [20, Ch. 9]. We may transpose an n × m array of objects of bit size b in O ( bnm lg min( n, m )) bit operations [4, Appendix].Finally, we occasionally use Xylouris’s refinement of Linnik’s theorem [21], whichstates that for any relatively prime positive integers a and n , the least prime in thearithmetic progression p = a (mod n ) satisfies p = O ( n . ).2. θ -representations Throughout this section, fix an integer q > m such that(2.1) m log q (lg lg q ) , or equivalently, q /m > (lg lg q ) , and assume we are given some θ ∈ Z /q Z such that θ m = − F = F + F y + · · · + F m − y m − ∈ Z [ y ] / ( y m + 1), define k F k :=max i | F i | . This norm satisfies k F G k m k F kk G k for any F, G ∈ Z [ y ] / ( y m + 1). Definition 2.1.
Let u ∈ Z /q Z . A θ -representation for u is a polynomial U ∈ Z [ y ] / ( y m + 1) such that U ( θ ) = u (mod q ) and k U k mq /m . Example . Let m = 4 and q = 3141592653589793238462833 ,θ = 2542533431566904450922735 (mod q ) ,u = 2718281828459045235360288 (mod q ) . The coefficients in a θ -representation must not exceed mq /m ≈ .
46. Twoexamples of θ -representations for u are U ( y ) = − y + 951670 y − y − ,U ( y ) = − y + 1849981 y − y + 1317423 . DAVID HARVEY AND JORIS VAN DER HOEVEN
By (2.1), the number of bits required to store U ( y ) is at most m (cid:0) log ( mq /m ) + O (1) (cid:1) = lg q + O ( m lg m ) = (cid:18) O (1)lg lg q (cid:19) lg q, so a θ -representation incurs very little overhead in space compared to the standardrepresentation by an integer in the interval 0 x < q .Our main tool for working with θ -representations is the reduction algorithm stated in Lemma 2.9 below. Given a polynomial F ∈ Z [ y ] / ( y m + 1), whose coeffi-cients are up to about twice as large as allowed in a θ -representation, the reductionalgorithm computes a θ -representation for F ( θ ) (up to a certain scaling factor, dis-cussed further below). The basic idea of the algorithm is to precompute a nonzeropolynomial P ( y ) such that P ( θ ) = 0 (mod q ), and then to subtract an appropriatemultiple of P ( y ) from F ( y ) to make the coefficients small.After developing the reduction algorithm, we are able to give algorithms forbasic arithmetic on elements of Z /q Z given in θ -representation (Proposition 2.15),a more general reduction algorithm for inputs of arbitrary size (Proposition 2.17),and algorithms for converting between standard and θ -representation (Proposition2.18 and Proposition 2.20).We begin with two results that generate certain precomputed data necessary forthe main reduction step. Lemma 2.3. In q o (1) bit operations, we may compute a nonzero polynomial P ∈ Z [ y ] / ( y m + 1) such that P ( θ ) = 0 (mod q ) and k P k q /m .Proof. We first establish existence of a suitable P ( y ). Let θ i denote a lift of θ i to Z ,and consider the lattice Λ ⊂ Z m spanned by the rows of the m × m integer matrix A = q − θ · · · − θ − θ m − · · · Every vector ( a , . . . , a m − ) ∈ Λ satisfies the equation a + · · · + a m − θ m − = 0(mod q ). The volume of the fundamental domain of Λ is det A = q . The volumeof the closed convex symmetric set Σ := {| a i | q /m } ⊂ R m is (2 q /m ) m = 2 m q ,so by Minkowski’s theorem (see for example [14, Ch. V, Thm. 3]), there exists anonzero vector ( a , . . . , a m − ) in Λ ∩ Σ. The corresponding polynomial P ( y ) := a + · · · + a m − y m − then has the desired properties.To actually compute P ( y ), we simply perform a brute-force search. By (2.1)there are at most (2 q /m + 1) m (3 q /m ) m = 3 m q < q o (1) candidates to test.Enumerating them in lexicographical order, we can easily evaluate P ( θ ) (mod q )in an average of O (lg q ) bit operations per candidate. (cid:3) Example . Continuing Example 2.2, the coefficients of P ( y ) must not exceed q /m ≈ .
36. A suitable polynomial P ( y ) is given by P ( y ) = − y − y + 1136523 y − . Remark . The computation of P ( y ) is closely related to the problem of findingan element of small norm in the ideal of the ring Z [ ζ m ] generated by q and ζ m − θ ,where ζ m denotes a primitive 2 m -th root of unity. ASTER INTEGER MULTIPLICATION USING SHORT LATTICE VECTORS 5
Remark . The poor exponential-time complexity of Lemma 2.3 can probably beimproved, by taking advantage of more sophisticated lattice reduction or shortestvector algorithms, but we were not easily able to extract a suitable result from theliterature. For example, LLL is not guaranteed to produce a short enough vector[15], and the Micciancio–Voulgaris exact shortest vector algorithm [16] solves theproblem for the Euclidean norm rather than the uniform norm. In any case, thishas no effect on our main result.
Lemma 2.7.
Assume that P ( y ) has been precomputed as in Lemma 2.3. Let r bethe smallest prime exceeding m q /m such that r ∤ q and such that P ( y ) is invertiblein ( Z /r Z )[ y ] / ( y m + 1) . Then r = O ( m q /m ) , and in q o (1) bit operations we maycompute r and a polynomial J ∈ Z [ y ] / ( y m + 1) such that J ( y ) P ( y ) = 1 (mod r ) and k J k r .Proof. Let R ∈ Z be the resultant of P ( y ) (regarded as a polynomial in Z [ y ]) and y m + 1. The primes r dividing R are exactly the primes for which P ( y ) fails to beinvertible in ( Z /r Z )[ y ] / ( y m + 1). Therefore our goal is to find a prime r > m q /m such that r ∤ Rq .Since m is a power of two, y m +1 is a cyclotomic polynomial and hence irreduciblein Q [ y ]. Thus y m + 1 and P ( y ) have no common factor, and so R = 0. Also, wehave R = Q α P ( α ) where α runs over the complex roots of y m + 1. These roots alllie on the unit circle, so | P ( α ) | m k P k mq /m , and hence by (2.1) we obtain | Rq | ( mq /m ) m q = m m q < q .On the other hand, the prime number theorem (in the form P p
Lemma 2.9.
Assume that P ( y ) , r and J ( y ) have been precomputed as in Lem-mas 2.3 and 2.7. Given as input F ∈ Z [ y ] / ( y m + 1) with k F k m ( q /m ) , wemay compute a θ -representation for F ( θ ) /r (mod q ) in O ( M SS (lg q )) bit operations.Proof. We first compute the “quotient” Q := F J (mod r ), normalised so that k Q k r/
2. This is done by means of Kronecker substitution [20, Ch. 8], i.e., wepack the polynomials F ( y ) and J ( y ) into integers, multiply the integers, unpack the DAVID HARVEY AND JORIS VAN DER HOEVEN result, and reduce the result modulo y m +1 and modulo r . The packed integers haveat most m (lg k F k + lg r + lg m ) bits, where the lg m term accounts for coefficientgrowth in Z [ y ]. By (2.1) and Lemma 2.7, this simplifies to O (lg q ) bits, so the integermultiplication step costs O ( M SS (lg q )) bit operations. This bound also covers thecost of the reductions modulo r .Next we compute the product QP , again using Kronecker substitution, at acost of O ( M SS (lg q )) bit operations. Since k Q k r/ k P k q /m , we have k QP k rmq /m .By construction of J we have QP = F (mod r ). In particular, all the coefficientsof F − QP ∈ Z [ y ] / ( y m + 1) are divisible by r . The last step is to compute the“remainder” G := ( F − QP ) /r ; again, this step costs O ( M SS (lg q )) bit operations.Since r > m q /m , we have k G k k F k r + k QP k r m ( q /m ) m q /m + mq /m mq /m . Finally, since P ( θ ) = 0 (mod q ), and all arithmetic throughout the algorithm hasbeen performed modulo y m + 1, we see that G ( θ ) = F ( θ ) /r (mod q ). (cid:3) Using the above reduction algorithm, we may give preliminary addition andmultiplication algorithms for elements of Z /q Z in θ -representation. Lemma 2.10.
Assume that P ( y ) , r and J ( y ) have been precomputed as in Lem-mas 2.3 and 2.7. Given as input θ -representations for u, v ∈ Z /q Z , we may compute θ -representations for uv/r and ( u ± v ) /r in O ( M SS (lg q )) bit operations.Proof. Let the θ -representations be given by U, V ∈ Z [ y ] / ( y m + 1). We may com-pute F ∗ := U V in Z [ y ] / ( y m + 1) using Kronecker substitution in O ( M SS (lg q ))bit operations, and F ± := U ± V in O (lg q ) bit operations. Note that k F ∗ k m k U kk V k m ( q /m ) , and k F ± k k U k + k V k mq /m m ( q /m ) , so wemay apply Lemma 2.9 to obtain the desired θ -representations. (cid:3) Example . Continuing Example 2.2, we walk through an example of computinga product of elements in θ -representation. Let u = 1414213562373095048801689 (mod q ) ,v = 1732050807568877293527447 (mod q ) . Suppose we are given as input the θ -representations U ( y ) = 3740635 y + 3692532 y − y + 4285386 ,V ( y ) = 4629959 y − y − y − . We first compute the product of U ( y ) and V ( y ) modulo y m + 1: F ( y ) = U ( y ) V ( y ) = 10266868543625 y − y − y + 26582459129078 . We multiply F ( y ) by J ( y ) and reduce modulo r to obtain the quotient Q ( y ) = 3932274 y − y + 20464841 y − . Then the remainder( F ( y ) − P ( y ) Q ( y )) /r = 995963 y − y + 398819 y + 777998is a θ -representation for uv/r (mod q ). ASTER INTEGER MULTIPLICATION USING SHORT LATTICE VECTORS 7
The following precomputation will assist in eliminating the spurious 1 /r factorappearing in Lemmas 2.9 and 2.10. Lemma 2.12.
Assume that P ( y ) , r and J ( y ) have been precomputed as in Lem-mas 2.3 and 2.7. In q o (1) bit operations, we may compute a polynomial D ∈ Z [ y ] / ( y m + 1) such that k D k mq /m and D ( θ ) = r (mod q ) .Proof. We may easily compute the totient function ϕ ( q ) in q o (1) bit operations,by first factoring q . Since ( r, q ) = 1, we have r − ( ϕ ( q ) − = r (mod q ). Repeat-edly using the identity r − i − = ( r − i · /r , we may compute θ -representations for r − , r − , . . . , r − ( ϕ ( q ) − by successively applying Lemma 2.10. (cid:3) Remark . Assuming the factorisation of q is known (which will always be thecase in the application in Section 3), the complexity of Lemma 2.12 may be im-proved to O ( M SS (lg q ) lg q ) bit operations by using a modified “repeated squaring”algorithm. Example . Continuing Example 2.2, we may take D ( y ) = − y − y + 2036309 y − . Henceforth we write P ( q, m, θ ) for the tuple ( P ( y ) , r, J ( y ) , D ( y )) of precomputeddata generated by Lemmas 2.3, 2.7, and 2.12. Given q , m and θ as input, the aboveresults show that we may compute P ( q, m, θ ) in q o (1) bit operations. With theseprecomputations out of the way, we may state complexity bounds for the mainoperations on θ -representations. Proposition 2.15.
Assume that P ( q, m, θ ) has been precomputed. Given as input θ -representations for u, v ∈ Z /q Z , we may compute θ -representations for uv and u ± v in O ( M SS (lg q )) bit operations.Proof. For the product, we first use Lemma 2.10 to compute a θ -representation for uv/r (mod q ), and then we use Lemma 2.10 again to multiply by D ( y ), to obtaina θ -representation for ( uv/r )( r ) /r = uv (mod q ). The sum and difference arehandled similarly. (cid:3) Remark . We suspect that the complexity bound for u ± v can be improvedto O (lg q ), but we do not currently know how to achieve this. This question seemsclosely related to Remark 2.22 below. Proposition 2.17.
Assume that P ( q, m, θ ) has been precomputed. Given as inputa polynomial F ∈ Z [ y ] / ( y m + 1) (with no restriction on k F k ), we may computea θ -representation for F ( θ ) (mod q ) in time O ( ⌈ m lg k F k / lg q ⌉ M SS (lg q )) .Proof. Let b := lg ⌈ q /m ⌉ and n := ⌈ m lg k F k / lg q ⌉ , so that2 nb > ( q /m ) n > ( q /m ) m lg k F k / lg q = 2 lg k F k (2 log q/ lg q ) > lg k F k . We may therefore decompose the coefficients of F into n chunks of b bits, i.e., wemay compute polynomials F , . . . , F n − ∈ Z [ y ] / ( y m + 1) such that F = F + 2 b F + · · · + 2 ( n − b F n − and k F i k b q /m . (This step implicitly requires an arraytransposition of cost O ( bmn lg m ) = O ( n lg q lg lg q ).) Now we use Proposition 2.15repeatedly to compute a θ -representation for F via Horner’s rule, i.e., first wecompute a θ -representation for 2 b F n − + F n − , then for 2 b (2 b F n − + F n − ) + F n − ,and so on. (cid:3) DAVID HARVEY AND JORIS VAN DER HOEVEN
Proposition 2.18.
Assume that P ( q, m, θ ) has been precomputed. Given as inputan element u ∈ Z /q Z in standard representation, we may compute a θ -representationfor u in O ( m M SS (lg q )) bit operations.Proof. Simply apply Proposition 2.17 to the constant polynomial F ( y ) = u , notingthat k F k q . (cid:3) Remark . A corollary of Proposition 2.18 is that every u ∈ Z /q Z admits a θ -representation. It would be interesting to have a direct proof of this fact that doesnot rely on the reduction algorithm. A related question is whether it is possible totighten the bound in the definition of θ -representation from mq /m to q /m , or even q /m . We do not know whether such a representation exists for all u ∈ Z /q Z . Proposition 2.20.
Given as input an element u ∈ Z /q Z in θ -representation, wemay compute the standard representation for u in O ( m M SS (lg q )) bit operations.Proof. Let U ∈ Z [ y ] / ( y m + 1) be the input polynomial. The problem amounts toevaluating U ( θ ) in Z /q Z . Again we may simply use Horner’s rule. (cid:3) Remark . In both Proposition 2.18 and Proposition 2.20, the input and outputhave bit size O (lg q ), but the complexity bounds given are not quasilinear in lg q .It is possible to improve on the stated bounds, but we do not know a quasilineartime algorithm for the conversion in either direction. Remark . In the reduction algorithm, the reader may wonder why we go tothe trouble of introducing the auxiliary prime r . Why not simply precompute anapproximation to a real inverse for P ( y ), i.e., the inverse in R [ y ] / ( y m + 1), anduse this to clear out the high-order bits of each coefficient of the dividend? Inother words, why not replace the Montgomery-style division with the more naturalBarrett-style division [2]?The reason is that we cannot prove tight enough bounds on the size of thecoefficients of this inverse: it is conceivable that P ( y ) might accidentally take ona very small value near one of the complex roots of y m + 1, or equivalently, thatthe resultant R in the proof of Lemma 2.7 might be unusually small. For the samereason, we cannot use a more traditional 2-adic Montgomery inverse to clear outthe low-order bits of the dividend, because again P ( y ) may take a 2-adically smallvalue near one of the 2-adic roots of y m + 1, or equivalently, the resultant R mightbe divisible by an unusually large power of 2.3. Integer multiplication: the recursive step
In this section we present a recursive routine
Transform with the followinginterface. It takes as input a (sufficiently large) power-of-two transform length L ,a prime p = 1 (mod L ), a prime power q = p α such that(3.1) lg L lg q L lg lg L, a principal L -th root of unity ζ ∈ Z /q Z (i.e., an L -th root of unity whose reductionmodulo p is a primitive L -th root of unity in the field Z /p Z ), certain precomputeddata depending on L , q and ζ (see below), and a polynomial F ∈ ( Z /q Z )[ x ] / ( x L − F with respect to ζ , that is, the vectorˆ F := ( F (1) , F ( ζ ) , . . . , F ( ζ L − )) ∈ ( Z /q Z ) L . The coefficients of both F and ˆ F are given in standard representation. ASTER INTEGER MULTIPLICATION USING SHORT LATTICE VECTORS 9
The precomputed data consists of the tuple P ( q, m, θ ) defined in Section 2,where m and θ are defined as follows.First, (3.1) implies that lg q > (lg lg L ) lg lg lg L for large L , so we may take m to be the unique power of two lying in the interval(3.2) lg q (lg lg L ) lg lg lg L m < q (lg lg L ) lg lg lg L .
Observe that (2.1) is certainly satisfied for this choice of m (for large enough L ),as (3.1) implies that lg lg L ∼ lg lg q .Next, note that 2 m | L , because (3.1) and (3.2) imply that m = o (lg L ) = o ( L );therefore we may take θ := ζ L/ m , so that θ m = ζ L/ = − α is to give us enough control overthe bit size of q , to compensate for the fact that Linnik’s theorem does not give ussufficiently fine control over the bit size of p (see Lemma 3.5).Our implementation of Transform uses one of two algorithms, depending onthe size of L . If L is below some threshold, say L , then it uses any convenientbase-case algorithm. Above this threshold, it reduces the given DFT problem toa collection of exponentially smaller DFTs of the same type, via a series of reduc-tions that may be summarised as follows.(i) Use the conversion algorithms from Section 2 to reduce to a transform over Z /q Z where the input and output coefficients are given in θ -representation.(During steps (ii) and (iii) below, all elements of Z /q Z are stored and manip-ulated entirely in θ -representation.)(ii) Reduce the “long” transform of length L over Z /q Z to many “short” trans-forms of exponentially small length S := 2 (lg lg L ) over Z /q Z , via the Cooley–Tukey decomposition.(iii) Reduce each short transform from step (ii) to a product in ( Z /q Z )[ x ] / ( x S − S , using Bluestein’s algorithm.(iv) Use the definition of θ -representation to reinterpret each product from (iii) asa product in Z [ x, y ] / ( x S − , y m + 1), where the coefficients in Z are exponen-tially smaller than the original coefficients in Z /q Z .(v) Embed each product from (iv) into ( Z /q ′ Z )[ x, y ] / ( x S − , y m +1) for a suitableprime power q ′ that is exponentially smaller than q , and large enough toresolve the coefficients of the products over Z .(vi) Reduce each product from (v) to a collection of forward and inverse DFTs oflength S over Z /q ′ Z , and recurse.The structure of this algorithm is very similar to that of [11]. The main dif-ference is that it is not necessary to explicitly split the coefficients into chunks instep (iv); this happens automatically as a consequence of storing the coefficients in θ -representation. In effect, the splitting (and reassembling) work has been shuntedinto the conversions in step (i).We now consider each of the above steps in more detail. We write T ( L, q ) forthe running time of
Transform . We always assume that L is increased whenevernecessary to accommodate statements that hold only for large L . Step (i) — convert between representations.
Let T long ( L, q ) denote the timerequired to compute a DFT of length L over Z /q Z with respect to ζ , assuming thatthe coefficients of the input F and the output ˆ F are given in θ -representation, andassuming that P ( q, m, θ ) is known. Lemma 3.1.
We have T ( L, q ) < T long ( L, q ) + O ( L lg L lg q ) .Proof. We first convert F from standard to θ -representation using Proposition 2.18;we then compute ˆ F from F (working entirely in θ -representation); at the end,we convert ˆ F back to standard representation using Proposition 2.20. By (3.1)and (3.2), the total cost of the conversions is O ( Lm M SS (lg q )) = O (cid:18) L lg q (lg lg L ) lg lg lg L lg q lg lg q lg lg lg q (cid:19) = O (cid:18) L lg L lg lg L (lg lg L ) lg lg lg L lg q lg lg L lg lg lg L (cid:19) = O ( L lg L lg q ) . (cid:3) Henceforth all elements of Z /q Z are assumed to be stored in θ -representation,and we will always use Proposition 2.15 to perform arithmetic operations on suchelements in O ( M SS (lg q )) bit operations. Step (ii) — reduce to short DFTs.
Let S := 2 (lg lg L ) . Given as input polynomials F , . . . , F L/S ∈ ( Z /q Z )[ x ] / ( x S −
1) (presented sequentially on tape), let T short ( L, q )denote the time required to compute the transforms ˆ F , . . . , ˆ F L/S ∈ ( Z /q Z ) S withrespect to the principal S -th root of unity ω := ζ L/S . (Here and below, we continueto assume that P ( q, m, θ ) is known.). Lemma 3.2.
We have T long ( L, q ) < lg L (lg lg L ) T short ( L, q ) + O ( L lg L lg q ) .Proof. Let d := ⌊ lg L/ lg S ⌋ , so that lg L = d lg S + d ′ where 0 d ′ < lg S . Applyingthe Cooley–Tukey method [5] to the factorisation L = S d d ′ , the given transformof length L may be decomposed into d layers, each consisting of L/S transforms oflength S (with respect to ω ), followed by d ′ layers, each consisting of L/ O ( L ) multiplicationsby “twiddle factors” in Z /q Z , which are given by certain powers of ζ . (For furtherdetails of the Cooley–Tukey decomposition, see for example [13, § O (( d + d ′ ) L M SS (lg q )) = O (cid:18)(cid:18) lg L (lg lg L ) + (lg lg L ) (cid:19) L lg q lg lg q lg lg lg q (cid:19) = O (cid:18) lg L (lg lg L ) L lg q lg lg L lg lg lg L (cid:19) = O ( L lg L lg q ) . This bound also covers the cost of the length 2 transforms (‘butterflies’), each ofwhich requires one addition and one subtraction in Z /q Z .In the Turing model, we must also account for the cost of rearranging data sothat the inputs for each layer of short DFTs are stored sequentially on tape. Thecost per layer is O ( L lg S lg q ) bit operations, so O ( L lg L lg q ) altogether (see [13, § (cid:3) Step (iii) — reduce to short convolutions.
Given polynomials G , . . . , G L/S , H ∈ ( Z /q Z )[ x ] / ( x S −
1) as input, let M short ( L, q ) denote the time required to computethe products G H, . . . , G
L/S H . Lemma 3.3.
We have T short ( L, q ) < M short ( L, q ) + O ( L (lg lg L ) lg q ) . ASTER INTEGER MULTIPLICATION USING SHORT LATTICE VECTORS 11
Proof.
We use Bluestein’s method [3], which reduces the the problem of computingthe DFT of F ∈ ( Z /q Z )[ x ] / ( x S −
1) to the problem of computing the product ofcertain polynomials
G, H ∈ ( Z /q Z )[ x ] / ( x S − O ( S ) auxiliary multiplicationsin Z /q Z (for further details see [13, § G depends on F and ζ , but H depends only on ζ . The total cost of the auxiliary multiplications is O (( L/S ) S M SS (lg q )) = O ( L lg q lg lg q lg lg lg q ) = O ( L (lg lg L ) lg q ) . (cid:3) Step (iv) — reduce to bivariate products over Z . Given as input polynomials˜ G , . . . , ˜ G L/S , ˜ H ∈ Z [ x, y ] / ( x S − , y m + 1), all whose of coefficients are boundedin absolute value by mq /m , let M bivariate ( L, q ) denote the cost of computing theproducts ˜ G ˜ H, . . . , ˜ G L/S ˜ H . Lemma 3.4.
We have M short ( L, q ) < M bivariate ( L, q ) + O ( L (lg lg L ) lg q ) .Proof. We are given as input polynomials G , . . . , G L/S , H ∈ ( Z /q Z )[ x ] / ( x S − θ -representation, we may immediately reinter-pret them as polynomials ˜ G , . . . , ˜ G L/S , ˜ H ∈ Z [ x, y ] / ( x S − , y m + 1), with coef-ficients bounded by mq /m . By definition of θ -representation, we have ˜ H ( x, θ ) = H ( x ) (mod q ), and similarly for the G i .After computing the products ˜ G i ˜ H for i = 1 , . . . , L/S , suppose that( ˜ G i ˜ H )( x, y ) = S − X j =0 A ij ( y ) x j , A ij ∈ Z [ y ] / ( y m + 1) . Then we have ( G i H )( x ) = ( ˜ G i ˜ H )( x, θ ) = P j A ij ( θ ) x j (mod q ) for each i . There-fore, to compute the desired products G i H with coefficients in θ -representation, itsuffices to apply Proposition 2.17 to each A ij , to compute θ -representations for allof the A ij ( θ ).Let us estimate the cost of the invocations of Proposition 2.17. We have k A ij k Sm ( mq /m ) = Sm ( q /m ) , solg k A ij k qm + lg S + 3 lg m < qm + (lg lg L ) + O (lg lg L ) . From (3.2) we have lg qm > (lg lg L ) lg lg lg L , so for large L ,(3.3) lg k A ij k < (cid:18) L (cid:19) lg qm . The cost of applying Proposition 2.17 for all A ij is thus O (cid:18) ( L/S ) S (cid:24) m lg k A ij k lg q (cid:25) M SS (lg q ) (cid:19) = O ( L M SS (lg q )) = O ( L (lg lg L ) lg q ) . (cid:3) Step (v) — Reduce to bivariate products over Z /q ′ Z . Let p ′ be the smallestprime such that p ′ = 1 (mod S ); by Linnik’s theorem we have p ′ = O ( S . ). Put q ′ := ( p ′ ) α ′ where α ′ := (cid:24)(cid:18) L (cid:19) lg qm (cid:30) lg ⌊ p ′ / ⌋ (cid:25) . We have the following bounds for q ′ . Lemma 3.5.
Let A ij be as in the proof of Lemma 3.4, for i = 1 , . . . , L/S and j = 0 , . . . , S − . Then q ′ > k A ij k and lg q ′ < (cid:18) O (1)lg lg lg L (cid:19) lg qm . Proof.
In what follows, we frequently use the fact that lg qm ≍ (lg lg L ) lg lg lg L (see (3.2)). Now, observe that log q ′ = α ′ log p ′ > α ′ lg ⌊ p ′ / ⌋ , so by (3.3),log q ′ > (cid:18) L (cid:19) lg qm > (cid:18) L (cid:19) lg qm + 2 > lg k A ij k + 2 . Thus q ′ > k A ij k . For the other direction, since lg p ′ ≍ lg S = (lg lg L ) , we havelg q ′ α ′ lg p ′ (cid:16) L (cid:17) lg qm lg ⌊ p ′ / ⌋ + 1 lg p ′ < (cid:18) O (1)lg lg lg L (cid:19) lg qm · lg p ′ lg ⌊ p ′ / ⌋ , and lg p ′ / lg ⌊ p ′ / ⌋ < O (1) / lg p ′ < O (1) / (lg lg L ) . (cid:3) Now, given as input polynomials g , . . . , g L/S , h ∈ ( Z /q ′ Z )[ x, y ] / ( x S − , y m + 1),let M ′ bivariate ( L, q ) denote the cost of computing the products g h, . . . , g L/S h , whereall input and output coefficients in Z /q ′ Z are in standard representation. Lemma 3.6.
We have M bivariate ( L, q ) < M ′ bivariate ( L, q ) + O ( L lg q ) .Proof. We may locate p ′ by testing S +1 , S +1 , . . . , in S O (1) = 2 O ((lg lg L ) ) = O ( L )bit operations, and we may easily compute α ′ and q ′ within the same time bound.Now, given as input ˜ G , . . . , ˜ G L/S , ˜ H ∈ Z [ x, y ] / ( x S − , y m + 1), we first convertthem (in linear time) to polynomials g , . . . , g L/S , h ∈ ( Z /q ′ Z )[ x, y ] / ( x S − , y m +1),and then multiply them in the latter ring. The bound q ′ > k A ij k in Lemma 3.5shows that the products over Z may be unambiguously recovered from those over Z /q ′ Z ; again, this lifting can be done in linear time. (cid:3) Step (vi) — reduce to DFTs over Z /q ′ Z . In this step we will call
Transform recursively to handle certain transforms of length S over Z /q ′ Z . To check thatthese calls are permissible, we must verify the precondition corresponding to (3.1),namely lg S lg q ′ S lg lg S . The first inequality is clear since q ′ > p ′ > S .The second inequality follows from (3.2), Lemma 3.5, and the observation thatlg S lg lg S > (lg lg L ) lg lg lg L . Lemma 3.7.
We have M ′ bivariate ( L, q ) < (cid:0) LS + 1 (cid:1) m T ( S, q ′ ) + O ( L (lg lg L ) lg q ) .Proof. We start by computing various data needed for the recursive calls. Wemay compute a primitive S -th root of unity in Z /p ′ Z in ( p ′ ) O (1) = O ( L ) bit op-erations, and then Hensel lift it to a principal S -th root of unity ζ ′ ∈ Z /q ′ Z in(lg p ′ lg q ′ ) O (1) = O ( L ) bit operations. Just as before, we define m ′ to be the uniquepower of two in the interval(3.4) lg q ′ (lg lg S ) lg lg lg S m ′ < q ′ (lg lg S ) lg lg lg S , and set θ ′ := ( ζ ′ ) S/ m ′ . Using Lemmas 2.3, 2.7, and 2.12, we may compute P ( q ′ , m ′ , θ ′ ) in ( q ′ ) o (1) = 2 O ((lg lg L ) lg lg lg L ) = O ( L ) bit operations.Now suppose we wish to compute the products g h, . . . , g L/S h , for polynomials g , . . . , g L/S , h ∈ ( Z /q ′ Z )[ x, y ] / ( x S − , y m + 1). We use the following algorithm. ASTER INTEGER MULTIPLICATION USING SHORT LATTICE VECTORS 13
First we use
Transform to transform all
L/S +1 polynomials with respect to x ,that is, we compute g i (( ζ ′ ) j , y ) and h (( ζ ′ ) j , y ) as elements of ( Z /q ′ Z )[ y ] / ( y m +1), for i = 1 , . . . , L/S and j = 0 , . . . , S −
1. Since
Transform must be applied separatelyto every coefficient 1 , y, . . . , y m − , the total number of calls is ( L/S +1) m . Accessingthe coefficient of each y k also implies a number of array transpositions whose totalcost is O (( L/S ) Sm lg m lg q ′ ) = O ( L lg lg L lg q ).Next we compute the ( L/S ) S = L pointwise products g i (( ζ ′ ) j , y ) h (( ζ ′ ) j , y ).Using Kronecker substitution, each such product in ( Z /q ′ Z )[ y ] / ( y m + 1) costs O ( M SS (lg q )) bit operations, as m (lg q ′ + lg m ) = O (lg q ).Finally, we perform ( L/S ) m inverse transforms with respect to x . It is wellknown that these may be computed by the same algorithm as the forward transform,with ζ ′ replaced by ( ζ ′ ) − , followed by a division by S . The division may beaccomplished by simply multiplying through by S − (mod q ′ ); this certainly costsno more than the pointwise multiplication step. (cid:3) Corollary 3.8.
We have T ( L, q ) < lg L (lg lg L ) (cid:0) LS + 1 (cid:1) m T ( S, q ′ ) + O ( L lg L lg q ) .Proof. This follows immediately by chaining together Lemmas 3.1, 3.2, 3.3, 3.4,3.6, and 3.7. (cid:3)
Define T ( L ) := max q T ( L, q ) L lg L lg q , where the maximum is taken over all prime powers q satisfying (3.1). (For large L ,at least one such q always exists. For example, take α := 1 and take q = p to bethe smallest prime satisfying p = 1 (mod L ); then Linnik’s theorem implies that(3.1) holds for this q .) Proposition 3.9.
We have T ( L ) < (cid:16) O (1)lg lg lg L (cid:17) T (2 (lg lg L ) ) + O (1) .Proof. Dividing the bound in Corollary 3.8 by L lg L lg q yields T ( L, q ) L lg L lg q < (cid:18) SL (cid:19) m lg q ′ lg q · T ( S, q ′ ) S lg S lg q ′ + O (1) . Applying Lemma 3.5 and the estimate
S/L < O (1) / lg lg lg L yields T ( L, q ) L lg L lg q < (cid:18) O (1)lg lg lg L (cid:19) T ( S ) + O (1) . Taking the maximum over allowable q yields the desired bound. (cid:3) Corollary 3.10.
We have T ( L ) = O (4 log ∗ L ) .Proof. This follows by applying the “master theorem” [13, Prop. 8] to the recurrencein Proposition 3.9. Alternatively, it follows by the same method used to deduce[11, Cor. 3] from [11, Prop. 2]. The key point is that 2 (lg lg L ) is dominated bya “logarithmically slow” function of L , such as Φ( x ) := 2 (log log x ) (see [13, § (cid:3) Remark . When working with θ -representations, it is possible to multiply anelement of Z /q Z by any power of θ in linear time, by simply permuting the coeffi-cients. In other words, we have available “fast roots of unity” in the sense of F¨urer.Notice however that the algorithm presented in this section makes no use of thisfact! This raises the question of whether one can design an integer multiplicationalgorithm that uses these fast roots in the same way as in F¨urer’s original algorithm,instead of our appeal to Bluestein’s trick. This is indeed possible, and one doesobtain a bound of the form O ( n lg n K log ∗ n ). In this algorithm, instead of therunning time being dominated by the short transforms, it is dominated by thetwiddle factor multiplications, just as in F¨urer’s algorithm. Unfortunately, thisleads to a worse value of K , because of the implied constant in Proposition 2.15.4. Integer multiplication: the top level
The only complication in building an integer multiplication algorithm on top ofthe
Transform routine is ensuring that the precomputations do not dominate thecomplexity. We achieve this by means of a multivariate Kronecker-style splitting,as follows.
Proof of Theorem 1.1.
Suppose that we wish to compute the product of two n -bitintegers u and v , for some sufficiently large n . Let b := lg n and t := ⌈ n/b ⌉ / , sothat t b > n . Decompose u into t chunks of b bits, say u = u + u b + · · · + u t − ( t − b where 0 u i < b for each i , and similarly for v . Let U ( x , . . . , x ) := t − X i =0 · · · t − X i =0 u i + ti + ··· + t i x i · · · x i ∈ Z [ x , . . . , x ] , so that u = U (2 b , tb , . . . , t b ), and define V ( x , . . . , x ) similarly. The product U V has degree less than 2 t in each variable, so at most 64 t terms altogether,and its coefficients are bounded by 2 b n n . We may therefore reconstruct uv from U V using a straightforward overlap-add procedure (essentially, evaluating at(2 b , tb , . . . , t b )) in O ( t lg n ) = O ( n ) bit operations.Now we consider the computation of U V . Let L be the unique power of two inthe interval 2 t L < t ; then it suffices to compute the product U V in the ring Z [ x , . . . , x ] / ( x L − , . . . , x L − i = 0 , . . . ,
18, let q i be the least prime such that q i = 1 (mod L ) and q i = i (mod 19). Then the q i are distinct, and by Linnik’s theorem they sat-isfy q i = O ( L . ) = O ( t . ) = O ( n . ), so we may locate the q i in n . o (1) bit operations, and they certainly satisfy (3.1). Moreover, for large n we have q · · · q > L > t > ( n/ lg n ) / > n , so to compute U V it sufficesto compute
U V (mod q i ) for each i and then reconstruct U V by the Chinese re-mainder theorem. The cost of this reconstruction is (lg n ) o (1) bit operations percoefficient, so ( n/ lg n )(lg n ) o (1) = n (lg n ) o (1) altogether.We have therefore reduced to the problem of computing a product in the ring( Z /q i Z )[ x , . . . , x ] / ( x L − , . . . , x L −
1) for each i = 0 , . . . ,
19. To do this, weuse
Transform to perform forward DFTs of length L with respect to a suitableprimitive L -th root of unity in Z /q i Z , for each variable x , . . . , x successively; thenwe multiply pointwise in Z /q i Z ; finally we perform inverse DFTs and scale theresults. The necessary precomputations for each prime (finding a suitable root ofunity and computing the appropriate tuple P ( q i , m i , θ i )) require only q o (1) i = n . o (1) bit operation per prime. The total cost of the pointwise multiplications is n (lg n ) o (1) . The total number of calls to Transform for each prime is 12 L , so by ASTER INTEGER MULTIPLICATION USING SHORT LATTICE VECTORS 15
Corollary 3.10 we obtain M ( n ) = O ( L P i =0 T ( L, q i )) + n (lg n ) o (1) = O ( L P i =0 T ( L ) lg L lg q i ) + n (lg n ) o (1) = O (( n/ lg n )4 log ∗ L lg n lg n ) + n (lg n ) o (1) = O ( n lg n log ∗ n ) . (cid:3) References
1. T. M. Apostol,
Introduction to analytic number theory , Springer-Verlag, New York-Heidelberg,1976, Undergraduate Texts in Mathematics. MR 04349292. P. Barrett,
Implementing the Rivest Shamir and Adleman public key encryption algorithm ona standard digital signal processor , Advances in cryptology—CRYPTO ’86 (Santa Barbara,Calif., 1986), Lecture Notes in Comput. Sci., vol. 263, Springer, Berlin, 1987, pp. 311–323.MR 907099 (88i:94015)3. L. I. Bluestein,
A linear filtering approach to the computation of discrete Fourier transform ,IEEE Transactions on Audio and Electroacoustics (1970), no. 4, 451–455.4. A. Bostan, P. Gaudry, and ´E. Schost, Linear recurrences with polynomial coefficients andapplication to integer factorization and Cartier-Manin operator , SIAM J. Comput. (2007),no. 6, 1777–1806. MR 2299425 (2008a:11156)5. J. W. Cooley and J. W. Tukey, An algorithm for the machine calculation of complex Fourierseries , Math. Comp. (1965), 297–301. MR 01785866. S. Covanov and E. Thom´e, Fast integer multiplication using generalized Fermat primes , http://arxiv.org/abs/1502.02800 , 2016.7. R. Crandall and B. Fagin, Discrete weighted transforms and large-integer arithmetic , Math.Comp. (1994), no. 205, 305–324. MR 11852448. M. F¨urer, Faster integer multiplication , STOC’07—Proceedings of the 39th Annual ACMSymposium on Theory of Computing, ACM, New York, 2007, pp. 57–66. MR 2402428(2009e:68124)9. ,
Faster integer multiplication , SIAM J. Comput. (2009), no. 3, 979–1005.MR 2538847 (2011b:68296)10. D. Harvey, Faster truncated integer multiplication , https://arxiv.org/abs/1703.00640 ,2017.11. D. Harvey and J. van der Hoeven, Faster integer multiplication using plain vanilla FFTprimes , https://arxiv.org/abs/1611.07144 , to appear in Mathematics of Computation,2016.12. , Faster integer and polynomial multiplication using cyclotomic coefficient rings , https://arxiv.org/abs/1712.03693 , 2017.13. D. Harvey, J. van der Hoeven, and G. Lecerf, Even faster integer multiplication , J. Complexity (2016), 1–30. MR 353063714. S. Lang, Algebraic number theory , second ed., Graduate Texts in Mathematics, vol. 110,Springer-Verlag, New York, 1994. MR 128272315. A. K. Lenstra, H. W. Lenstra, Jr., and L. Lov´asz,
Factoring polynomials with rational coeffi-cients , Math. Ann. (1982), no. 4, 515–534. MR 68266416. D. Micciancio and P. Voulgaris,
A deterministic single exponential time algorithm for mostlattice problems based on Voronoi cell computations , SIAM J. Comput. (2013), no. 3,1364–1391. MR 350463217. P. L. Montgomery, Modular multiplication without trial division , Math. Comp. (1985),no. 170, 519–521. MR 777282 (86e:11121)18. C. H. Papadimitriou, Computational complexity , Addison-Wesley Publishing Company, Read-ing, MA, 1994. MR 1251285 (95f:68082)19. A. Sch¨onhage and V. Strassen,
Schnelle Multiplikation grosser Zahlen , Computing (Arch.Elektron. Rechnen) (1971), 281–292. MR 0292344 (45 Modern computer algebra , third ed., Cambridge UniversityPress, Cambridge, 2013. MR 3087522
21. T. Xylouris,
On the least prime in an arithmetic progression and estimates for the zeros ofDirichlet L -functions , Acta Arith. (2011), no. 1, 65–91. MR 2825574 (2012m:11129) School of Mathematics and Statistics, University of New South Wales, Sydney NSW2052, Australia
E-mail address : [email protected] CNRS, Laboratoire d’informatique, ´Ecole polytechnique, 91128 Palaiseau, France
E-mail address ::