[PDF] Faster integer multiplication using short lattice vectors

Abstract

We prove that n -bit integers may be multiplied in O(nlogn 4 log ∗ n ) bit operations. This complexity bound had been achieved previously by several authors, assuming various unproved number-theoretic hypotheses. Our proof is unconditional, and depends in an essential way on Minkowski's theorem concerning lattice vectors in symmetric convex sets.

Full PDF

aa r X i v : . [ c s . S C ] F e b FASTER INTEGER MULTIPLICATIONUSING SHORT LATTICE VECTORS

DAVID HARVEY AND JORIS VAN DER HOEVEN

Abstract.

We prove that n -bit integers may be multiplied in O ( n log n log ∗ n )bit operations. This complexity bound had been achieved previously by severalauthors, assuming various unproved number-theoretic hypotheses. Our proofis unconditional, and depends in an essential way on Minkowski’s theoremconcerning lattice vectors in symmetric convex sets. Introduction

Let M ( n ) denote the number of bit operations required to multiply two n -bitintegers, where “bit operations” means the number of steps on a deterministicTuring machine with a ﬁxed, ﬁnite number of tapes [18] (our results also hold inthe Boolean circuit model). Let log ∗ x denote the iterated natural logarithm, i.e.,log ∗ x := min { j ∈ N : log ◦ j x } , where log ◦ j x := log · · · log x (iterated j times).The main result of this paper is an algorithm achieving the following bound. Theorem 1.1.

We have M ( n ) = O ( n log n log ∗ n ) . The ﬁrst complexity bound for M ( n ) of the form O ( n log n K log ∗ n ) was estab-lished by F¨urer [8, 9], for an unspeciﬁed constant K >

1. His algorithm reducesa multiplication of size n to many multiplications of size exponentially smallerthan n , which are then handled recursively. The number of recursion levels islog ∗ n + O (1), and the constant K measures the “expansion factor” at each recur-sion level.The ﬁrst explicit value for K , namely K = 8, was given by Harvey, van derHoeven and Lecerf [13]. Their method is somewhat diﬀerent to F¨urer, but stillcarries out an exponential size reduction at each recursion level. One may thinkof the constant K = 8 as being built up of three factors of 2, each coming froma diﬀerent source.The ﬁrst factor of 2 arises from the need to perform both forward and inverseDFTs (discrete Fourier transforms) at each recursion level. This is a feature com-mon to all of the post-F¨urer algorithms, suggesting that signiﬁcantly new ideas willbe needed to do any better than O ( n log n log ∗ n ).The second factor of 2 arises from coeﬃcient growth: a product of polynomialswith k -bit integer coeﬃcients has coeﬃcients with at least 2 k bits. This factor of 2also seems diﬃcult to completely eliminate, although Harvey and van der Hoevenhave recently made some progress [12]: they achieve K = 4 √ ≈ .

66 by arrangingthat, in eﬀect, the coeﬃcient growth only occurs at every second recursion level.This was the best known unconditional value of K prior to the present paper. Harvey was supported by the Australian Research Council (grants DP150101689 andFT160100219).

The ﬁnal factor of 2 occurs because the algorithm works over C : when multiply-ing complex coeﬃcients with say β signiﬁcant bits, the algorithm ﬁrst computes afull 2 β -bit product, and then truncates to β bits. More precisely, after splitting the β -bit inputs into m exponentially smaller chunks, and encoding them into polyno-mials of degree m , the algorithm must compute the full product of degree 2 m , eventhough essentially only m coeﬃcients are needed to resolve β signiﬁcant bits of theproduct. Again, this factor of 2 has been the subject of a recent attack: Harveyhas shown [10] that it is possible to work modulo a polynomial of degree only m ,at the expense of increasing the working precision by a factor of 3 /

2. This leads toan integer multiplication algorithm achieving K = 6.Another way of attacking this last factor of 2 is to replace the coeﬃcient ring C by a ﬁnite ring Z /q Z for a suitable integer q . By choosing q with some specialstructure, it may become possible to convert a multiplication modulo q directly intoa polynomial multiplication modulo some polynomial of degree m , rather than 2 m .Three algorithms along these lines have been proposed.First, Harvey, van der Hoeven and Lecerf suggested using Mersenne primes ,i.e., primes of the form q = 2 k −

1, where k is itself prime [13, § Z /q Z to multiplication in Z [ y ] / ( y m − m is a power of two.Because k is not divisible by m , the process of splitting an element of Z /q Z into m chunks is somewhat involved, and depends on a variant of the Crandall–Faginalgorithm [7].Covanov and Thom´e [6] later proposed using generalised Fermat primes , i.e.,primes of the form q = r m + 1, where m is a power of two and r is a small eveninteger. Here, multiplication in Z /q Z is converted to multiplication in Z [ y ] / ( y m +1).The splitting procedure consists of rewriting an element of Z /q Z in base r , via fastradix-conversion algorithms.Finally, Harvey and van der Hoeven [11] proposed using FFT primes , i.e., primesof the form q = a · k + 1, where a is small. They reduce multiplication in Z /q Z tomultiplication in Z [ y ] / ( y m + a ) via a straightforward splitting of the integers into m chunks, where m is a power of two. Here the splitting process is trivial, as k maybe chosen to be divisible by m .These three algorithms all achieve K = 4, subject to plausible but unprovedconjectures on the distribution of the relevant primes. Unfortunately, in all threecases, it is not even known that there are inﬁnitely many primes of the requiredform, let alone that there exist a suﬃciently high density of them to satisfy therequirements of the algorithm.The main technical novelty of the present paper is a splitting procedure thatworks for an almost arbitrary modulus q . The core idea is to introduce an al-ternative representation for elements of Z /q Z : we represent them as expressions a + a θ + · · · + a m − θ m − , where θ is some ﬁxed 2 m -th root of unity in Z /q Z , andwhere the a i are small integers, of size roughly q /m . Essentially the only restrictionon q is that Z /q Z must contain an appropriate 2 m -th root of unity. We will seethat Linnik’s theorem is strong enough to construct suitable such moduli q .In Section 2 we show that the cost of multiplication in this representation isonly a constant factor worse than for the usual representation. The key ingredientis Minkowski’s theorem on lattice vectors in symmetric convex sets. We also givealgorithms for converting between this representation and the standard representa-tion. The conversions are not as fast as one might hope — in particular, we do not ASTER INTEGER MULTIPLICATION USING SHORT LATTICE VECTORS 3 know how to carry them out in quasilinear time — but surprisingly this turns outnot to aﬀect the overall complexity, because in the main multiplication algorithmwe perform the conversions only infrequently.Then in Sections 3 and 4 we prove Theorem 1.1, using an algorithm that isstructurally very similar to [11]. We make no attempt to minimise the impliedbig- O constant in Theorem 1.1; our goal is to give the simplest possible proof ofthe asymptotic bound, without any regard for questions of practicality.An interesting question is whether it is possible to combine the techniques of thepresent paper with those of [12] to obtain an algorithm achieving K = 2 √ ≈ . F p [ x ]. However, this is not so interesting, because anunconditional proof of the bound corresponding to K = 4 in the polynomial caseis already known [12].Throughout the paper we use the following notation. We write lg n := ⌈ log n ⌉ for n >

2, and for convenience put lg 1 := 1. We deﬁne M SS ( n ) = Cn lg n lg lg n ,where C > n -bit integers in at most M SS ( n ) bit operations [19]. This function satisﬁes n M SS ( m ) M SS ( nm ) for any n, m >

1, and also M SS ( dm ) = O ( M SS ( m )) forﬁxed d . An n -bit integer may be divided by m -bit integer, producing quotient andremainder, in time O ( M SS (max( n, m ))) [20, Ch. 9]. We may transpose an n × m array of objects of bit size b in O ( bnm lg min( n, m )) bit operations [4, Appendix].Finally, we occasionally use Xylouris’s reﬁnement of Linnik’s theorem [21], whichstates that for any relatively prime positive integers a and n , the least prime in thearithmetic progression p = a (mod n ) satisﬁes p = O ( n . ).2. θ -representations Throughout this section, ﬁx an integer q > m such that(2.1) m log q (lg lg q ) , or equivalently, q /m > (lg lg q ) , and assume we are given some θ ∈ Z /q Z such that θ m = − F = F + F y + · · · + F m − y m − ∈ Z [ y ] / ( y m + 1), deﬁne k F k :=max i | F i | . This norm satisﬁes k F G k m k F kk G k for any F, G ∈ Z [ y ] / ( y m + 1). Deﬁnition 2.1.

Let u ∈ Z /q Z . A θ -representation for u is a polynomial U ∈ Z [ y ] / ( y m + 1) such that U ( θ ) = u (mod q ) and k U k mq /m . Example . Let m = 4 and q = 3141592653589793238462833 ,θ = 2542533431566904450922735 (mod q ) ,u = 2718281828459045235360288 (mod q ) . The coeﬃcients in a θ -representation must not exceed mq /m ≈ .

46. Twoexamples of θ -representations for u are U ( y ) = − y + 951670 y − y − ,U ( y ) = − y + 1849981 y − y + 1317423 . DAVID HARVEY AND JORIS VAN DER HOEVEN

By (2.1), the number of bits required to store U ( y ) is at most m (cid:0) log ( mq /m ) + O (1) (cid:1) = lg q + O ( m lg m ) = (cid:18) O (1)lg lg q (cid:19) lg q, so a θ -representation incurs very little overhead in space compared to the standardrepresentation by an integer in the interval 0 x < q .Our main tool for working with θ -representations is the reduction algorithm stated in Lemma 2.9 below. Given a polynomial F ∈ Z [ y ] / ( y m + 1), whose coeﬃ-cients are up to about twice as large as allowed in a θ -representation, the reductionalgorithm computes a θ -representation for F ( θ ) (up to a certain scaling factor, dis-cussed further below). The basic idea of the algorithm is to precompute a nonzeropolynomial P ( y ) such that P ( θ ) = 0 (mod q ), and then to subtract an appropriatemultiple of P ( y ) from F ( y ) to make the coeﬃcients small.After developing the reduction algorithm, we are able to give algorithms forbasic arithmetic on elements of Z /q Z given in θ -representation (Proposition 2.15),a more general reduction algorithm for inputs of arbitrary size (Proposition 2.17),and algorithms for converting between standard and θ -representation (Proposition2.18 and Proposition 2.20).We begin with two results that generate certain precomputed data necessary forthe main reduction step. Lemma 2.3. In q o (1) bit operations, we may compute a nonzero polynomial P ∈ Z [ y ] / ( y m + 1) such that P ( θ ) = 0 (mod q ) and k P k q /m .Proof. We ﬁrst establish existence of a suitable P ( y ). Let θ i denote a lift of θ i to Z ,and consider the lattice Λ ⊂ Z m spanned by the rows of the m × m integer matrix A =  q − θ · · · − θ − θ m − · · ·  Every vector ( a , . . . , a m − ) ∈ Λ satisﬁes the equation a + · · · + a m − θ m − = 0(mod q ). The volume of the fundamental domain of Λ is det A = q . The volumeof the closed convex symmetric set Σ := {| a i | q /m } ⊂ R m is (2 q /m ) m = 2 m q ,so by Minkowski’s theorem (see for example [14, Ch. V, Thm. 3]), there exists anonzero vector ( a , . . . , a m − ) in Λ ∩ Σ. The corresponding polynomial P ( y ) := a + · · · + a m − y m − then has the desired properties.To actually compute P ( y ), we simply perform a brute-force search. By (2.1)there are at most (2 q /m + 1) m (3 q /m ) m = 3 m q < q o (1) candidates to test.Enumerating them in lexicographical order, we can easily evaluate P ( θ ) (mod q )in an average of O (lg q ) bit operations per candidate. (cid:3) Example . Continuing Example 2.2, the coeﬃcients of P ( y ) must not exceed q /m ≈ .

36. A suitable polynomial P ( y ) is given by P ( y ) = − y − y + 1136523 y − . Remark . The computation of P ( y ) is closely related to the problem of ﬁndingan element of small norm in the ideal of the ring Z [ ζ m ] generated by q and ζ m − θ ,where ζ m denotes a primitive 2 m -th root of unity. ASTER INTEGER MULTIPLICATION USING SHORT LATTICE VECTORS 5

Remark . The poor exponential-time complexity of Lemma 2.3 can probably beimproved, by taking advantage of more sophisticated lattice reduction or shortestvector algorithms, but we were not easily able to extract a suitable result from theliterature. For example, LLL is not guaranteed to produce a short enough vector[15], and the Micciancio–Voulgaris exact shortest vector algorithm [16] solves theproblem for the Euclidean norm rather than the uniform norm. In any case, thishas no eﬀect on our main result.

Lemma 2.7.

Assume that P ( y ) has been precomputed as in Lemma 2.3. Let r bethe smallest prime exceeding m q /m such that r ∤ q and such that P ( y ) is invertiblein ( Z /r Z )[ y ] / ( y m + 1) . Then r = O ( m q /m ) , and in q o (1) bit operations we maycompute r and a polynomial J ∈ Z [ y ] / ( y m + 1) such that J ( y ) P ( y ) = 1 (mod r ) and k J k r .Proof. Let R ∈ Z be the resultant of P ( y ) (regarded as a polynomial in Z [ y ]) and y m + 1. The primes r dividing R are exactly the primes for which P ( y ) fails to beinvertible in ( Z /r Z )[ y ] / ( y m + 1). Therefore our goal is to ﬁnd a prime r > m q /m such that r ∤ Rq .Since m is a power of two, y m +1 is a cyclotomic polynomial and hence irreduciblein Q [ y ]. Thus y m + 1 and P ( y ) have no common factor, and so R = 0. Also, wehave R = Q α P ( α ) where α runs over the complex roots of y m + 1. These roots alllie on the unit circle, so | P ( α ) | m k P k mq /m , and hence by (2.1) we obtain | Rq | ( mq /m ) m q = m m q < q .On the other hand, the prime number theorem (in the form P p x > P x p Cx log p > x (sum taken over primes). Taking x := m q /m , by (2.1) again we get X m q /m p Cm q /m log p > m q /m > · (lg lg q ) > q > log ( q ) . In particular, there must be at least one prime in the interval 2 m q /m r Cm q /m that does not divide Rq .To ﬁnd the smallest such r , we ﬁrst make a list of all primes up to Cm q /m in( Cm q /m ) o (1) < q o (1) bit operations. Then for each prime r between 2 m q /m and Cm q /m , we check whether r divides q in (lg q ) o (1) bit operations, andattempt to invert P ( y ) in ( Z /r Z )[ y ] / ( y m + 1) in ( m lg r ) o (1) = (lg q ) o (1) bitoperations [20, Ch. 11]. (cid:3) Example . Continuing Example 2.2, we have r = 42602761 and J ( y ) = 17106162 y + 6504907 y + 30962874 y + 8514380 . Now we come to the main step of the reduction algorithm, which is inspired byMontgomery’s method for modular reduction [17].

Lemma 2.9.

Assume that P ( y ) , r and J ( y ) have been precomputed as in Lem-mas 2.3 and 2.7. Given as input F ∈ Z [ y ] / ( y m + 1) with k F k m ( q /m ) , wemay compute a θ -representation for F ( θ ) /r (mod q ) in O ( M SS (lg q )) bit operations.Proof. We ﬁrst compute the “quotient” Q := F J (mod r ), normalised so that k Q k r/

2. This is done by means of Kronecker substitution [20, Ch. 8], i.e., wepack the polynomials F ( y ) and J ( y ) into integers, multiply the integers, unpack the DAVID HARVEY AND JORIS VAN DER HOEVEN result, and reduce the result modulo y m +1 and modulo r . The packed integers haveat most m (lg k F k + lg r + lg m ) bits, where the lg m term accounts for coeﬃcientgrowth in Z [ y ]. By (2.1) and Lemma 2.7, this simpliﬁes to O (lg q ) bits, so the integermultiplication step costs O ( M SS (lg q )) bit operations. This bound also covers thecost of the reductions modulo r .Next we compute the product QP , again using Kronecker substitution, at acost of O ( M SS (lg q )) bit operations. Since k Q k r/ k P k q /m , we have k QP k rmq /m .By construction of J we have QP = F (mod r ). In particular, all the coeﬃcientsof F − QP ∈ Z [ y ] / ( y m + 1) are divisible by r . The last step is to compute the“remainder” G := ( F − QP ) /r ; again, this step costs O ( M SS (lg q )) bit operations.Since r > m q /m , we have k G k k F k r + k QP k r m ( q /m ) m q /m + mq /m mq /m . Finally, since P ( θ ) = 0 (mod q ), and all arithmetic throughout the algorithm hasbeen performed modulo y m + 1, we see that G ( θ ) = F ( θ ) /r (mod q ). (cid:3) Using the above reduction algorithm, we may give preliminary addition andmultiplication algorithms for elements of Z /q Z in θ -representation. Lemma 2.10.

Assume that P ( y ) , r and J ( y ) have been precomputed as in Lem-mas 2.3 and 2.7. Given as input θ -representations for u, v ∈ Z /q Z , we may compute θ -representations for uv/r and ( u ± v ) /r in O ( M SS (lg q )) bit operations.Proof. Let the θ -representations be given by U, V ∈ Z [ y ] / ( y m + 1). We may com-pute F ∗ := U V in Z [ y ] / ( y m + 1) using Kronecker substitution in O ( M SS (lg q ))bit operations, and F ± := U ± V in O (lg q ) bit operations. Note that k F ∗ k m k U kk V k m ( q /m ) , and k F ± k k U k + k V k mq /m m ( q /m ) , so wemay apply Lemma 2.9 to obtain the desired θ -representations. (cid:3) Example . Continuing Example 2.2, we walk through an example of computinga product of elements in θ -representation. Let u = 1414213562373095048801689 (mod q ) ,v = 1732050807568877293527447 (mod q ) . Suppose we are given as input the θ -representations U ( y ) = 3740635 y + 3692532 y − y + 4285386 ,V ( y ) = 4629959 y − y − y − . We ﬁrst compute the product of U ( y ) and V ( y ) modulo y m + 1: F ( y ) = U ( y ) V ( y ) = 10266868543625 y − y − y + 26582459129078 . We multiply F ( y ) by J ( y ) and reduce modulo r to obtain the quotient Q ( y ) = 3932274 y − y + 20464841 y − . Then the remainder( F ( y ) − P ( y ) Q ( y )) /r = 995963 y − y + 398819 y + 777998is a θ -representation for uv/r (mod q ). ASTER INTEGER MULTIPLICATION USING SHORT LATTICE VECTORS 7

The following precomputation will assist in eliminating the spurious 1 /r factorappearing in Lemmas 2.9 and 2.10. Lemma 2.12.

Assume that P ( y ) , r and J ( y ) have been precomputed as in Lem-mas 2.3 and 2.7. In q o (1) bit operations, we may compute a polynomial D ∈ Z [ y ] / ( y m + 1) such that k D k mq /m and D ( θ ) = r (mod q ) .Proof. We may easily compute the totient function ϕ ( q ) in q o (1) bit operations,by ﬁrst factoring q . Since ( r, q ) = 1, we have r − ( ϕ ( q ) − = r (mod q ). Repeat-edly using the identity r − i − = ( r − i · /r , we may compute θ -representations for r − , r − , . . . , r − ( ϕ ( q ) − by successively applying Lemma 2.10. (cid:3) Remark . Assuming the factorisation of q is known (which will always be thecase in the application in Section 3), the complexity of Lemma 2.12 may be im-proved to O ( M SS (lg q ) lg q ) bit operations by using a modiﬁed “repeated squaring”algorithm. Example . Continuing Example 2.2, we may take D ( y ) = − y − y + 2036309 y − . Henceforth we write P ( q, m, θ ) for the tuple ( P ( y ) , r, J ( y ) , D ( y )) of precomputeddata generated by Lemmas 2.3, 2.7, and 2.12. Given q , m and θ as input, the aboveresults show that we may compute P ( q, m, θ ) in q o (1) bit operations. With theseprecomputations out of the way, we may state complexity bounds for the mainoperations on θ -representations. Proposition 2.15.

Assume that P ( q, m, θ ) has been precomputed. Given as input θ -representations for u, v ∈ Z /q Z , we may compute θ -representations for uv and u ± v in O ( M SS (lg q )) bit operations.Proof. For the product, we ﬁrst use Lemma 2.10 to compute a θ -representation for uv/r (mod q ), and then we use Lemma 2.10 again to multiply by D ( y ), to obtaina θ -representation for ( uv/r )( r ) /r = uv (mod q ). The sum and diﬀerence arehandled similarly. (cid:3) Remark . We suspect that the complexity bound for u ± v can be improvedto O (lg q ), but we do not currently know how to achieve this. This question seemsclosely related to Remark 2.22 below. Proposition 2.17.

Assume that P ( q, m, θ ) has been precomputed. Given as inputa polynomial F ∈ Z [ y ] / ( y m + 1) (with no restriction on k F k ), we may computea θ -representation for F ( θ ) (mod q ) in time O ( ⌈ m lg k F k / lg q ⌉ M SS (lg q )) .Proof. Let b := lg ⌈ q /m ⌉ and n := ⌈ m lg k F k / lg q ⌉ , so that2 nb > ( q /m ) n > ( q /m ) m lg k F k / lg q = 2 lg k F k (2 log q/ lg q ) > lg k F k . We may therefore decompose the coeﬃcients of F into n chunks of b bits, i.e., wemay compute polynomials F , . . . , F n − ∈ Z [ y ] / ( y m + 1) such that F = F + 2 b F + · · · + 2 ( n − b F n − and k F i k b q /m . (This step implicitly requires an arraytransposition of cost O ( bmn lg m ) = O ( n lg q lg lg q ).) Now we use Proposition 2.15repeatedly to compute a θ -representation for F via Horner’s rule, i.e., ﬁrst wecompute a θ -representation for 2 b F n − + F n − , then for 2 b (2 b F n − + F n − ) + F n − ,and so on. (cid:3) DAVID HARVEY AND JORIS VAN DER HOEVEN

Proposition 2.18.

Assume that P ( q, m, θ ) has been precomputed. Given as inputan element u ∈ Z /q Z in standard representation, we may compute a θ -representationfor u in O ( m M SS (lg q )) bit operations.Proof. Simply apply Proposition 2.17 to the constant polynomial F ( y ) = u , notingthat k F k q . (cid:3) Remark . A corollary of Proposition 2.18 is that every u ∈ Z /q Z admits a θ -representation. It would be interesting to have a direct proof of this fact that doesnot rely on the reduction algorithm. A related question is whether it is possible totighten the bound in the deﬁnition of θ -representation from mq /m to q /m , or even q /m . We do not know whether such a representation exists for all u ∈ Z /q Z . Proposition 2.20.

Given as input an element u ∈ Z /q Z in θ -representation, wemay compute the standard representation for u in O ( m M SS (lg q )) bit operations.Proof. Let U ∈ Z [ y ] / ( y m + 1) be the input polynomial. The problem amounts toevaluating U ( θ ) in Z /q Z . Again we may simply use Horner’s rule. (cid:3) Remark . In both Proposition 2.18 and Proposition 2.20, the input and outputhave bit size O (lg q ), but the complexity bounds given are not quasilinear in lg q .It is possible to improve on the stated bounds, but we do not know a quasilineartime algorithm for the conversion in either direction. Remark . In the reduction algorithm, the reader may wonder why we go tothe trouble of introducing the auxiliary prime r . Why not simply precompute anapproximation to a real inverse for P ( y ), i.e., the inverse in R [ y ] / ( y m + 1), anduse this to clear out the high-order bits of each coeﬃcient of the dividend? Inother words, why not replace the Montgomery-style division with the more naturalBarrett-style division [2]?The reason is that we cannot prove tight enough bounds on the size of thecoeﬃcients of this inverse: it is conceivable that P ( y ) might accidentally take ona very small value near one of the complex roots of y m + 1, or equivalently, thatthe resultant R in the proof of Lemma 2.7 might be unusually small. For the samereason, we cannot use a more traditional 2-adic Montgomery inverse to clear outthe low-order bits of the dividend, because again P ( y ) may take a 2-adically smallvalue near one of the 2-adic roots of y m + 1, or equivalently, the resultant R mightbe divisible by an unusually large power of 2.3. Integer multiplication: the recursive step

In this section we present a recursive routine

Transform with the followinginterface. It takes as input a (suﬃciently large) power-of-two transform length L ,a prime p = 1 (mod L ), a prime power q = p α such that(3.1) lg L lg q L lg lg L, a principal L -th root of unity ζ ∈ Z /q Z (i.e., an L -th root of unity whose reductionmodulo p is a primitive L -th root of unity in the ﬁeld Z /p Z ), certain precomputeddata depending on L , q and ζ (see below), and a polynomial F ∈ ( Z /q Z )[ x ] / ( x L − F with respect to ζ , that is, the vectorˆ F := ( F (1) , F ( ζ ) , . . . , F ( ζ L − )) ∈ ( Z /q Z ) L . The coeﬃcients of both F and ˆ F are given in standard representation. ASTER INTEGER MULTIPLICATION USING SHORT LATTICE VECTORS 9

The precomputed data consists of the tuple P ( q, m, θ ) deﬁned in Section 2,where m and θ are deﬁned as follows.First, (3.1) implies that lg q > (lg lg L ) lg lg lg L for large L , so we may take m to be the unique power of two lying in the interval(3.2) lg q (lg lg L ) lg lg lg L m < q (lg lg L ) lg lg lg L .

Observe that (2.1) is certainly satisﬁed for this choice of m (for large enough L ),as (3.1) implies that lg lg L ∼ lg lg q .Next, note that 2 m | L , because (3.1) and (3.2) imply that m = o (lg L ) = o ( L );therefore we may take θ := ζ L/ m , so that θ m = ζ L/ = − α is to give us enough control overthe bit size of q , to compensate for the fact that Linnik’s theorem does not give ussuﬃciently ﬁne control over the bit size of p (see Lemma 3.5).Our implementation of Transform uses one of two algorithms, depending onthe size of L . If L is below some threshold, say L , then it uses any convenientbase-case algorithm. Above this threshold, it reduces the given DFT problem toa collection of exponentially smaller DFTs of the same type, via a series of reduc-tions that may be summarised as follows.(i) Use the conversion algorithms from Section 2 to reduce to a transform over Z /q Z where the input and output coeﬃcients are given in θ -representation.(During steps (ii) and (iii) below, all elements of Z /q Z are stored and manip-ulated entirely in θ -representation.)(ii) Reduce the “long” transform of length L over Z /q Z to many “short” trans-forms of exponentially small length S := 2 (lg lg L ) over Z /q Z , via the Cooley–Tukey decomposition.(iii) Reduce each short transform from step (ii) to a product in ( Z /q Z )[ x ] / ( x S − S , using Bluestein’s algorithm.(iv) Use the deﬁnition of θ -representation to reinterpret each product from (iii) asa product in Z [ x, y ] / ( x S − , y m + 1), where the coeﬃcients in Z are exponen-tially smaller than the original coeﬃcients in Z /q Z .(v) Embed each product from (iv) into ( Z /q ′ Z )[ x, y ] / ( x S − , y m +1) for a suitableprime power q ′ that is exponentially smaller than q , and large enough toresolve the coeﬃcients of the products over Z .(vi) Reduce each product from (v) to a collection of forward and inverse DFTs oflength S over Z /q ′ Z , and recurse.The structure of this algorithm is very similar to that of [11]. The main dif-ference is that it is not necessary to explicitly split the coeﬃcients into chunks instep (iv); this happens automatically as a consequence of storing the coeﬃcients in θ -representation. In eﬀect, the splitting (and reassembling) work has been shuntedinto the conversions in step (i).We now consider each of the above steps in more detail. We write T ( L, q ) forthe running time of

Transform . We always assume that L is increased whenevernecessary to accommodate statements that hold only for large L . Step (i) — convert between representations.

Let T long ( L, q ) denote the timerequired to compute a DFT of length L over Z /q Z with respect to ζ , assuming thatthe coeﬃcients of the input F and the output ˆ F are given in θ -representation, andassuming that P ( q, m, θ ) is known. Lemma 3.1.

We have T ( L, q ) < T long ( L, q ) + O ( L lg L lg q ) .Proof. We ﬁrst convert F from standard to θ -representation using Proposition 2.18;we then compute ˆ F from F (working entirely in θ -representation); at the end,we convert ˆ F back to standard representation using Proposition 2.20. By (3.1)and (3.2), the total cost of the conversions is O ( Lm M SS (lg q )) = O (cid:18) L lg q (lg lg L ) lg lg lg L lg q lg lg q lg lg lg q (cid:19) = O (cid:18) L lg L lg lg L (lg lg L ) lg lg lg L lg q lg lg L lg lg lg L (cid:19) = O ( L lg L lg q ) . (cid:3) Henceforth all elements of Z /q Z are assumed to be stored in θ -representation,and we will always use Proposition 2.15 to perform arithmetic operations on suchelements in O ( M SS (lg q )) bit operations. Step (ii) — reduce to short DFTs.

Let S := 2 (lg lg L ) . Given as input polynomials F , . . . , F L/S ∈ ( Z /q Z )[ x ] / ( x S −

1) (presented sequentially on tape), let T short ( L, q )denote the time required to compute the transforms ˆ F , . . . , ˆ F L/S ∈ ( Z /q Z ) S withrespect to the principal S -th root of unity ω := ζ L/S . (Here and below, we continueto assume that P ( q, m, θ ) is known.). Lemma 3.2.

We have T long ( L, q ) < lg L (lg lg L ) T short ( L, q ) + O ( L lg L lg q ) .Proof. Let d := ⌊ lg L/ lg S ⌋ , so that lg L = d lg S + d ′ where 0 d ′ < lg S . Applyingthe Cooley–Tukey method [5] to the factorisation L = S d d ′ , the given transformof length L may be decomposed into d layers, each consisting of L/S transforms oflength S (with respect to ω ), followed by d ′ layers, each consisting of L/ O ( L ) multiplicationsby “twiddle factors” in Z /q Z , which are given by certain powers of ζ . (For furtherdetails of the Cooley–Tukey decomposition, see for example [13, § O (( d + d ′ ) L M SS (lg q )) = O (cid:18)(cid:18) lg L (lg lg L ) + (lg lg L ) (cid:19) L lg q lg lg q lg lg lg q (cid:19) = O (cid:18) lg L (lg lg L ) L lg q lg lg L lg lg lg L (cid:19) = O ( L lg L lg q ) . This bound also covers the cost of the length 2 transforms (‘butterﬂies’), each ofwhich requires one addition and one subtraction in Z /q Z .In the Turing model, we must also account for the cost of rearranging data sothat the inputs for each layer of short DFTs are stored sequentially on tape. Thecost per layer is O ( L lg S lg q ) bit operations, so O ( L lg L lg q ) altogether (see [13, § (cid:3) Step (iii) — reduce to short convolutions.

Given polynomials G , . . . , G L/S , H ∈ ( Z /q Z )[ x ] / ( x S −

1) as input, let M short ( L, q ) denote the time required to computethe products G H, . . . , G

L/S H . Lemma 3.3.

We have T short ( L, q ) < M short ( L, q ) + O ( L (lg lg L ) lg q ) . ASTER INTEGER MULTIPLICATION USING SHORT LATTICE VECTORS 11

Proof.

We use Bluestein’s method [3], which reduces the the problem of computingthe DFT of F ∈ ( Z /q Z )[ x ] / ( x S −

1) to the problem of computing the product ofcertain polynomials

G, H ∈ ( Z /q Z )[ x ] / ( x S − O ( S ) auxiliary multiplicationsin Z /q Z (for further details see [13, § G depends on F and ζ , but H depends only on ζ . The total cost of the auxiliary multiplications is O (( L/S ) S M SS (lg q )) = O ( L lg q lg lg q lg lg lg q ) = O ( L (lg lg L ) lg q ) . (cid:3) Step (iv) — reduce to bivariate products over Z . Given as input polynomials˜ G , . . . , ˜ G L/S , ˜ H ∈ Z [ x, y ] / ( x S − , y m + 1), all whose of coeﬃcients are boundedin absolute value by mq /m , let M bivariate ( L, q ) denote the cost of computing theproducts ˜ G ˜ H, . . . , ˜ G L/S ˜ H . Lemma 3.4.

We have M short ( L, q ) < M bivariate ( L, q ) + O ( L (lg lg L ) lg q ) .Proof. We are given as input polynomials G , . . . , G L/S , H ∈ ( Z /q Z )[ x ] / ( x S − θ -representation, we may immediately reinter-pret them as polynomials ˜ G , . . . , ˜ G L/S , ˜ H ∈ Z [ x, y ] / ( x S − , y m + 1), with coef-ﬁcients bounded by mq /m . By deﬁnition of θ -representation, we have ˜ H ( x, θ ) = H ( x ) (mod q ), and similarly for the G i .After computing the products ˜ G i ˜ H for i = 1 , . . . , L/S , suppose that( ˜ G i ˜ H )( x, y ) = S − X j =0 A ij ( y ) x j , A ij ∈ Z [ y ] / ( y m + 1) . Then we have ( G i H )( x ) = ( ˜ G i ˜ H )( x, θ ) = P j A ij ( θ ) x j (mod q ) for each i . There-fore, to compute the desired products G i H with coeﬃcients in θ -representation, itsuﬃces to apply Proposition 2.17 to each A ij , to compute θ -representations for allof the A ij ( θ ).Let us estimate the cost of the invocations of Proposition 2.17. We have k A ij k Sm ( mq /m ) = Sm ( q /m ) , solg k A ij k qm + lg S + 3 lg m < qm + (lg lg L ) + O (lg lg L ) . From (3.2) we have lg qm > (lg lg L ) lg lg lg L , so for large L ,(3.3) lg k A ij k < (cid:18) L (cid:19) lg qm . The cost of applying Proposition 2.17 for all A ij is thus O (cid:18) ( L/S ) S (cid:24) m lg k A ij k lg q (cid:25) M SS (lg q ) (cid:19) = O ( L M SS (lg q )) = O ( L (lg lg L ) lg q ) . (cid:3) Step (v) — Reduce to bivariate products over Z /q ′ Z . Let p ′ be the smallestprime such that p ′ = 1 (mod S ); by Linnik’s theorem we have p ′ = O ( S . ). Put q ′ := ( p ′ ) α ′ where α ′ := (cid:24)(cid:18) L (cid:19) lg qm (cid:30) lg ⌊ p ′ / ⌋ (cid:25) . We have the following bounds for q ′ . Lemma 3.5.

Let A ij be as in the proof of Lemma 3.4, for i = 1 , . . . , L/S and j = 0 , . . . , S − . Then q ′ > k A ij k and lg q ′ < (cid:18) O (1)lg lg lg L (cid:19) lg qm . Proof.

In what follows, we frequently use the fact that lg qm ≍ (lg lg L ) lg lg lg L (see (3.2)). Now, observe that log q ′ = α ′ log p ′ > α ′ lg ⌊ p ′ / ⌋ , so by (3.3),log q ′ > (cid:18) L (cid:19) lg qm > (cid:18) L (cid:19) lg qm + 2 > lg k A ij k + 2 . Thus q ′ > k A ij k . For the other direction, since lg p ′ ≍ lg S = (lg lg L ) , we havelg q ′ α ′ lg p ′  (cid:16) L (cid:17) lg qm lg ⌊ p ′ / ⌋ + 1  lg p ′ < (cid:18) O (1)lg lg lg L (cid:19) lg qm · lg p ′ lg ⌊ p ′ / ⌋ , and lg p ′ / lg ⌊ p ′ / ⌋ < O (1) / lg p ′ < O (1) / (lg lg L ) . (cid:3) Now, given as input polynomials g , . . . , g L/S , h ∈ ( Z /q ′ Z )[ x, y ] / ( x S − , y m + 1),let M ′ bivariate ( L, q ) denote the cost of computing the products g h, . . . , g L/S h , whereall input and output coeﬃcients in Z /q ′ Z are in standard representation. Lemma 3.6.

We have M bivariate ( L, q ) < M ′ bivariate ( L, q ) + O ( L lg q ) .Proof. We may locate p ′ by testing S +1 , S +1 , . . . , in S O (1) = 2 O ((lg lg L ) ) = O ( L )bit operations, and we may easily compute α ′ and q ′ within the same time bound.Now, given as input ˜ G , . . . , ˜ G L/S , ˜ H ∈ Z [ x, y ] / ( x S − , y m + 1), we ﬁrst convertthem (in linear time) to polynomials g , . . . , g L/S , h ∈ ( Z /q ′ Z )[ x, y ] / ( x S − , y m +1),and then multiply them in the latter ring. The bound q ′ > k A ij k in Lemma 3.5shows that the products over Z may be unambiguously recovered from those over Z /q ′ Z ; again, this lifting can be done in linear time. (cid:3) Step (vi) — reduce to DFTs over Z /q ′ Z . In this step we will call

Transform recursively to handle certain transforms of length S over Z /q ′ Z . To check thatthese calls are permissible, we must verify the precondition corresponding to (3.1),namely lg S lg q ′ S lg lg S . The ﬁrst inequality is clear since q ′ > p ′ > S .The second inequality follows from (3.2), Lemma 3.5, and the observation thatlg S lg lg S > (lg lg L ) lg lg lg L . Lemma 3.7.

We have M ′ bivariate ( L, q ) < (cid:0) LS + 1 (cid:1) m T ( S, q ′ ) + O ( L (lg lg L ) lg q ) .Proof. We start by computing various data needed for the recursive calls. Wemay compute a primitive S -th root of unity in Z /p ′ Z in ( p ′ ) O (1) = O ( L ) bit op-erations, and then Hensel lift it to a principal S -th root of unity ζ ′ ∈ Z /q ′ Z in(lg p ′ lg q ′ ) O (1) = O ( L ) bit operations. Just as before, we deﬁne m ′ to be the uniquepower of two in the interval(3.4) lg q ′ (lg lg S ) lg lg lg S m ′ < q ′ (lg lg S ) lg lg lg S , and set θ ′ := ( ζ ′ ) S/ m ′ . Using Lemmas 2.3, 2.7, and 2.12, we may compute P ( q ′ , m ′ , θ ′ ) in ( q ′ ) o (1) = 2 O ((lg lg L ) lg lg lg L ) = O ( L ) bit operations.Now suppose we wish to compute the products g h, . . . , g L/S h , for polynomials g , . . . , g L/S , h ∈ ( Z /q ′ Z )[ x, y ] / ( x S − , y m + 1). We use the following algorithm. ASTER INTEGER MULTIPLICATION USING SHORT LATTICE VECTORS 13

First we use

Transform to transform all

L/S +1 polynomials with respect to x ,that is, we compute g i (( ζ ′ ) j , y ) and h (( ζ ′ ) j , y ) as elements of ( Z /q ′ Z )[ y ] / ( y m +1), for i = 1 , . . . , L/S and j = 0 , . . . , S −

1. Since

Transform must be applied separatelyto every coeﬃcient 1 , y, . . . , y m − , the total number of calls is ( L/S +1) m . Accessingthe coeﬃcient of each y k also implies a number of array transpositions whose totalcost is O (( L/S ) Sm lg m lg q ′ ) = O ( L lg lg L lg q ).Next we compute the ( L/S ) S = L pointwise products g i (( ζ ′ ) j , y ) h (( ζ ′ ) j , y ).Using Kronecker substitution, each such product in ( Z /q ′ Z )[ y ] / ( y m + 1) costs O ( M SS (lg q )) bit operations, as m (lg q ′ + lg m ) = O (lg q ).Finally, we perform ( L/S ) m inverse transforms with respect to x . It is wellknown that these may be computed by the same algorithm as the forward transform,with ζ ′ replaced by ( ζ ′ ) − , followed by a division by S . The division may beaccomplished by simply multiplying through by S − (mod q ′ ); this certainly costsno more than the pointwise multiplication step. (cid:3) Corollary 3.8.

We have T ( L, q ) < lg L (lg lg L ) (cid:0) LS + 1 (cid:1) m T ( S, q ′ ) + O ( L lg L lg q ) .Proof. This follows immediately by chaining together Lemmas 3.1, 3.2, 3.3, 3.4,3.6, and 3.7. (cid:3)

Deﬁne T ( L ) := max q T ( L, q ) L lg L lg q , where the maximum is taken over all prime powers q satisfying (3.1). (For large L ,at least one such q always exists. For example, take α := 1 and take q = p to bethe smallest prime satisfying p = 1 (mod L ); then Linnik’s theorem implies that(3.1) holds for this q .) Proposition 3.9.

We have T ( L ) < (cid:16) O (1)lg lg lg L (cid:17) T (2 (lg lg L ) ) + O (1) .Proof. Dividing the bound in Corollary 3.8 by L lg L lg q yields T ( L, q ) L lg L lg q < (cid:18) SL (cid:19) m lg q ′ lg q · T ( S, q ′ ) S lg S lg q ′ + O (1) . Applying Lemma 3.5 and the estimate

S/L < O (1) / lg lg lg L yields T ( L, q ) L lg L lg q < (cid:18) O (1)lg lg lg L (cid:19) T ( S ) + O (1) . Taking the maximum over allowable q yields the desired bound. (cid:3) Corollary 3.10.

We have T ( L ) = O (4 log ∗ L ) .Proof. This follows by applying the “master theorem” [13, Prop. 8] to the recurrencein Proposition 3.9. Alternatively, it follows by the same method used to deduce[11, Cor. 3] from [11, Prop. 2]. The key point is that 2 (lg lg L ) is dominated bya “logarithmically slow” function of L , such as Φ( x ) := 2 (log log x ) (see [13, § (cid:3) Remark . When working with θ -representations, it is possible to multiply anelement of Z /q Z by any power of θ in linear time, by simply permuting the coeﬃ-cients. In other words, we have available “fast roots of unity” in the sense of F¨urer.Notice however that the algorithm presented in this section makes no use of thisfact! This raises the question of whether one can design an integer multiplicationalgorithm that uses these fast roots in the same way as in F¨urer’s original algorithm,instead of our appeal to Bluestein’s trick. This is indeed possible, and one doesobtain a bound of the form O ( n lg n K log ∗ n ). In this algorithm, instead of therunning time being dominated by the short transforms, it is dominated by thetwiddle factor multiplications, just as in F¨urer’s algorithm. Unfortunately, thisleads to a worse value of K , because of the implied constant in Proposition 2.15.4. Integer multiplication: the top level

The only complication in building an integer multiplication algorithm on top ofthe

Transform routine is ensuring that the precomputations do not dominate thecomplexity. We achieve this by means of a multivariate Kronecker-style splitting,as follows.

Proof of Theorem 1.1.

Suppose that we wish to compute the product of two n -bitintegers u and v , for some suﬃciently large n . Let b := lg n and t := ⌈ n/b ⌉ / , sothat t b > n . Decompose u into t chunks of b bits, say u = u + u b + · · · + u t − ( t − b where 0 u i < b for each i , and similarly for v . Let U ( x , . . . , x ) := t − X i =0 · · · t − X i =0 u i + ti + ··· + t i x i · · · x i ∈ Z [ x , . . . , x ] , so that u = U (2 b , tb , . . . , t b ), and deﬁne V ( x , . . . , x ) similarly. The product U V has degree less than 2 t in each variable, so at most 64 t terms altogether,and its coeﬃcients are bounded by 2 b n n . We may therefore reconstruct uv from U V using a straightforward overlap-add procedure (essentially, evaluating at(2 b , tb , . . . , t b )) in O ( t lg n ) = O ( n ) bit operations.Now we consider the computation of U V . Let L be the unique power of two inthe interval 2 t L < t ; then it suﬃces to compute the product U V in the ring Z [ x , . . . , x ] / ( x L − , . . . , x L − i = 0 , . . . ,

18, let q i be the least prime such that q i = 1 (mod L ) and q i = i (mod 19). Then the q i are distinct, and by Linnik’s theorem they sat-isfy q i = O ( L . ) = O ( t . ) = O ( n . ), so we may locate the q i in n . o (1) bit operations, and they certainly satisfy (3.1). Moreover, for large n we have q · · · q > L > t > ( n/ lg n ) / > n , so to compute U V it suﬃcesto compute

U V (mod q i ) for each i and then reconstruct U V by the Chinese re-mainder theorem. The cost of this reconstruction is (lg n ) o (1) bit operations percoeﬃcient, so ( n/ lg n )(lg n ) o (1) = n (lg n ) o (1) altogether.We have therefore reduced to the problem of computing a product in the ring( Z /q i Z )[ x , . . . , x ] / ( x L − , . . . , x L −

1) for each i = 0 , . . . ,

19. To do this, weuse

Transform to perform forward DFTs of length L with respect to a suitableprimitive L -th root of unity in Z /q i Z , for each variable x , . . . , x successively; thenwe multiply pointwise in Z /q i Z ; ﬁnally we perform inverse DFTs and scale theresults. The necessary precomputations for each prime (ﬁnding a suitable root ofunity and computing the appropriate tuple P ( q i , m i , θ i )) require only q o (1) i = n . o (1) bit operation per prime. The total cost of the pointwise multiplications is n (lg n ) o (1) . The total number of calls to Transform for each prime is 12 L , so by ASTER INTEGER MULTIPLICATION USING SHORT LATTICE VECTORS 15

Corollary 3.10 we obtain M ( n ) = O ( L P i =0 T ( L, q i )) + n (lg n ) o (1) = O ( L P i =0 T ( L ) lg L lg q i ) + n (lg n ) o (1) = O (( n/ lg n )4 log ∗ L lg n lg n ) + n (lg n ) o (1) = O ( n lg n log ∗ n ) . (cid:3) References

1. T. M. Apostol,

Introduction to analytic number theory , Springer-Verlag, New York-Heidelberg,1976, Undergraduate Texts in Mathematics. MR 04349292. P. Barrett,

Implementing the Rivest Shamir and Adleman public key encryption algorithm ona standard digital signal processor , Advances in cryptology—CRYPTO ’86 (Santa Barbara,Calif., 1986), Lecture Notes in Comput. Sci., vol. 263, Springer, Berlin, 1987, pp. 311–323.MR 907099 (88i:94015)3. L. I. Bluestein,

A linear ﬁltering approach to the computation of discrete Fourier transform ,IEEE Transactions on Audio and Electroacoustics (1970), no. 4, 451–455.4. A. Bostan, P. Gaudry, and ´E. Schost, Linear recurrences with polynomial coeﬃcients andapplication to integer factorization and Cartier-Manin operator , SIAM J. Comput. (2007),no. 6, 1777–1806. MR 2299425 (2008a:11156)5. J. W. Cooley and J. W. Tukey, An algorithm for the machine calculation of complex Fourierseries , Math. Comp. (1965), 297–301. MR 01785866. S. Covanov and E. Thom´e, Fast integer multiplication using generalized Fermat primes , http://arxiv.org/abs/1502.02800 , 2016.7. R. Crandall and B. Fagin, Discrete weighted transforms and large-integer arithmetic , Math.Comp. (1994), no. 205, 305–324. MR 11852448. M. F¨urer, Faster integer multiplication , STOC’07—Proceedings of the 39th Annual ACMSymposium on Theory of Computing, ACM, New York, 2007, pp. 57–66. MR 2402428(2009e:68124)9. ,

Faster integer multiplication , SIAM J. Comput. (2009), no. 3, 979–1005.MR 2538847 (2011b:68296)10. D. Harvey, Faster truncated integer multiplication , https://arxiv.org/abs/1703.00640 ,2017.11. D. Harvey and J. van der Hoeven, Faster integer multiplication using plain vanilla FFTprimes , https://arxiv.org/abs/1611.07144 , to appear in Mathematics of Computation,2016.12. , Faster integer and polynomial multiplication using cyclotomic coeﬃcient rings , https://arxiv.org/abs/1712.03693 , 2017.13. D. Harvey, J. van der Hoeven, and G. Lecerf, Even faster integer multiplication , J. Complexity (2016), 1–30. MR 353063714. S. Lang, Algebraic number theory , second ed., Graduate Texts in Mathematics, vol. 110,Springer-Verlag, New York, 1994. MR 128272315. A. K. Lenstra, H. W. Lenstra, Jr., and L. Lov´asz,

Factoring polynomials with rational coeﬃ-cients , Math. Ann. (1982), no. 4, 515–534. MR 68266416. D. Micciancio and P. Voulgaris,

A deterministic single exponential time algorithm for mostlattice problems based on Voronoi cell computations , SIAM J. Comput. (2013), no. 3,1364–1391. MR 350463217. P. L. Montgomery, Modular multiplication without trial division , Math. Comp. (1985),no. 170, 519–521. MR 777282 (86e:11121)18. C. H. Papadimitriou, Computational complexity , Addison-Wesley Publishing Company, Read-ing, MA, 1994. MR 1251285 (95f:68082)19. A. Sch¨onhage and V. Strassen,

Schnelle Multiplikation grosser Zahlen , Computing (Arch.Elektron. Rechnen) (1971), 281–292. MR 0292344 (45 Modern computer algebra , third ed., Cambridge UniversityPress, Cambridge, 2013. MR 3087522

21. T. Xylouris,

On the least prime in an arithmetic progression and estimates for the zeros ofDirichlet L -functions , Acta Arith. (2011), no. 1, 65–91. MR 2825574 (2012m:11129) School of Mathematics and Statistics, University of New South Wales, Sydney NSW2052, Australia

E-mail address : [email protected] CNRS, Laboratoire d’informatique, ´Ecole polytechnique, 91128 Palaiseau, France

E-mail address ::