[PDF] Faster integer and polynomial multiplication using cyclotomic coefficient rings

Abstract

We present an algorithm that computes the product of two n-bit integers in O(n log n (4\sqrt 2)^{log^* n}) bit operations. Previously, the best known bound was O(n log n 6^{log^* n}). We also prove that for a fixed prime p, polynomials in F_p[X] of degree n may be multiplied in O(n log n 4^{log^* n}) bit operations; the previous best bound was O(n log n 8^{log^* n}).

Full PDF

aa r X i v : . [ c s . S C ] D ec FASTER INTEGER AND POLYNOMIAL MULTIPLICATIONUSING CYCLOTOMIC COEFFICIENT RINGS

DAVID HARVEY AND JORIS VAN DER HOEVEN

Abstract.

We present an algorithm that computes the product of two n -bitintegers in O ( n log n (4 √ log ∗ n ) bit operations. Previously, the best knownbound was O ( n log n log ∗ n ). We also prove that for a ﬁxed prime p , polynomi-als in F p [ X ] of degree n may be multiplied in O ( n log n log ∗ n ) bit operations;the previous best bound was O ( n log n log ∗ n ). Introduction

In this paper we present new complexity bounds for multiplying integers andpolynomials over ﬁnite ﬁelds. Our focus is on theoretical bounds rather than prac-tical algorithms. We work in the deterministic multitape Turing model [19], inwhich time complexity is deﬁned by counting the number of steps, or equivalently,the number of ‘bit operations’, executed by a Turing machine with a ﬁxed, ﬁnitenumber of tapes. The main results of the paper also hold in the Boolean circuitmodel.The following notation is used throughout. For x ∈ R , we denote by log ∗ x theiterated logarithm, that is, the least non-negative integer k such that log ◦ k x ◦ k x := log · · · log x (iterated k times). For a positive integer n , we deﬁnelg n := max(1 , ⌈ log n ⌉ ); in particular, expressions like lg lg lg n are deﬁned andtake positive values for all n >

1. We denote the n -th cyclotomic polynomial by φ n ( X ) ∈ Z [ X ], and the Euler totient function by ϕ ( n ).All absolute constants in this paper are in principle eﬀectively computable. Thisincludes the implied constants in all uses of O ( · ) notation.1.1. Integer multiplication.

Let M ( n ) denote the number of bit operations re-quired to multiply two n -bit integers. For over 35 years, the best known bound for M ( n ) was that achieved by the Sch¨onhage–Strassen algorithm [23], namely M ( n ) = O ( n lg n lg lg n ) . (1.1)In 2007, F¨urer described an asymptotically faster algorithm that achieves M ( n ) = O ( n lg n K log ∗ n Z ) (1.2)for some unspeciﬁed constant K Z > n to a large collection of multiplications of size exponentially smallerthan n ; these smaller multiplications are handled recursively. The K log ∗ n Z term maybe understood roughly as follows: the number of recursion levels is log ∗ n + O (1),and the constant K Z measures the amount of ‘data expansion’ that occurs at eachlevel, due to phenomena such as zero-padding.Immediately following F¨urer’s work, De, Kurur, Saha and Saptharishi describeda variant based on modular arithmetic [10], instead of the approximate complex DAVID HARVEY AND JORIS VAN DER HOEVEN arithmetic used by F¨urer. Their algorithm also achieves (1.2), again for someunspeciﬁed K Z > K Z was given by Harvey, van der Hoeven and Lecerf,who described an algorithm that achieves (1.2) with K Z = 8 [17]. Their algorithmborrows some important ideas from F¨urer’s work, but also diﬀers in several respects.In particular, their algorithm has no need for the ‘fast’ roots of unity that were thecornerstone of F¨urer’s approach (and of the variant of [10]). The main presentationin [17] is based on approximate complex arithmetic, and the paper includes a sketchof a variant based on modular arithmetic that also achieves K Z = 8.In a recent preprint, the ﬁrst author announced that in the complex arithmeticcase, the constant may be reduced to K Z = 6, by taking advantage of new methodsfor truncated integer multiplication [14]. This improvement does not seem to applyto the modular variants.The ﬁrst main result of this paper is the following further improvement. Theorem 1.1.

There is an integer multiplication algorithm that achieves M ( n ) = O ( n lg n (4 √ log ∗ n ) . In other words, (1.2) holds with K Z = 4 √ ≈ . K Z = 4 under various unproved number-theoretic conjectures: see [17, § K Z = 4 can be reached unconditionally remains an importantopen question.1.2. Polynomial multiplication over ﬁnite ﬁelds.

For a prime p , let M p ( n )denote the number of bit operations required to multiply two polynomials in F p [ X ]of degree less than n . The optimal choice of algorithm for this problem dependsvery much on the relative size of n and p .If n is not too large compared to p , say lg n = O (lg p ), then a reasonable choiceis Kronecker substitution : one lifts the polynomials to Z [ X ], packs the coeﬃcientsof each polynomial into a large integer (i.e., evaluates at X = 2 b for b := 2 lg p +lg n ), multiplies these large integers, unpacks the resulting coeﬃcients to obtain theproduct in Z [ X ], and ﬁnally reduces the output modulo p . This leads to the bound M p ( n ) = O ( M ( n lg p )) = O ( n lg p lg( n lg p ) K log ∗ ( n lg p ) Z ) , (1.3)where K Z is any admissible constant in (1.2). To the authors’ knowledge, this isthe best known asymptotic bound for M p ( n ) in the region lg n = O (lg p ).When n is large compared to p , the situation is starkly diﬀerent. The Kroneckersubstitution method leads to poor results, due to coeﬃcient growth in the liftedproduct: for example, when p is ﬁxed, Kronecker substitution yields M p ( n ) = O ( M ( n lg n )) = O ( n (lg n ) K log ∗ n Z ) . For many years, the best known bound in this regime was that achieved by thealgebraic version of the Sch¨onhage–Strassen algorithm [23, 22], namely M p ( n ) = O ( n lg n lg lg n lg p + n lg n M (lg p )) . (1.4)The ﬁrst term arises from performing O ( n lg n lg lg n ) additions in F p , and thesecond term from O ( n lg n ) multiplications in F p . (In fact, this sort of bound holds ASTER INTEGER AND POLYNOMIAL MULTIPLICATION 3 for polynomial multiplication over quite general rings [7].) For ﬁxed p , this is fasterthan the Kronecker substitution method by a factor of almost lg n . The mainreason for its superiority is that it exploits the modulo p structure throughout thealgorithm, whereas the Kronecker substitution method forgets this structure in thevery ﬁrst step.After the appearance of F¨urer’s algorithm, it was natural to ask whether a F¨urer-type bound could be proved for M p ( n ), in the case that n is large compared to p .This question was answered in the aﬃrmative by Harvey, van der Hoeven andLecerf, who gave an algorithm that achieves M p ( n ) = O ( n lg p lg( n lg p ) 8 log ∗ ( n lg p ) ) , uniformly for all n and p [18]. This is a very elegant bound; however, written in thisway, it obscures the fact that the constant 8 plays two quite diﬀerent roles in thecomplexity analysis. One source of the value 8 is the constant K Z = 8 arising fromthe integer multiplication algorithm mentioned above, but there is also a separateconstant K F = 8 arising from the polynomial part of the algorithm. There is noparticular reason to expect that K Z = K F , and it is somewhat of a coincidence thatthey have the same numerical value in [18].To clarify the situation, we mention that one may derive a complexity boundfor the algorithm of [18] under the assumption that one has available an integermultiplication algorithm achieving (1.2) for some K Z >

1, where possibly K Z = 8.Namely, one ﬁnds that M p ( n ) = O ( n lg p lg( n lg p ) K max(0 , log ∗ n − log ∗ p ) F K log ∗ p Z ) (1.5)where K F = 8 (we omit the proof). The second main result of this paper, provedin Section 4, is the following improvement in the value of K F . Theorem 1.2.

Let K Z > be any constant for which (1.2) holds (for example, byTheorem 1.1, one may take K Z = 4 √ ). Then there is a polynomial multiplicationalgorithm that achieves M p ( n ) = O ( n lg p lg( n lg p ) 4 max(0 , log ∗ n − log ∗ p ) K log ∗ p Z ) , (1.6) uniformly for all n > and all primes p . In other words, (1.5) holds with K F = 4. In particular, for ﬁxed p , one canmultiply polynomials in F p [ X ] of degree n in O ( n lg n log ∗ n ) bit operations.Theorem 1.2 may be generalised in various ways. We brieﬂy mention a few possi-bilities along the lines of [18, §

8] (no proofs will be given). First, we may obtain anal-ogous bit complexity bounds for multiplication in F p a [ X ] and ( Z /p a Z )[ X ] for a > Z /m Z )[ X ] for arbitrary m > A [ X ] of degree less than n , for any F p -algebra A , us-ing O ( n lg n log ∗ n ) additions and scalar multiplications and O ( n log ∗ n ) nonscalarmultiplications (compare with [18, Thm. 8.4]).1.3. Overview of the new algorithms.

To explain the new approach, let us ﬁrstrecall the idea behind the polynomial multiplication algorithm of [18].Consider a polynomial multiplication problem in F p [ X ], where the degree n isvery large compared to p . By splitting the inputs into chunks, we convert this DAVID HARVEY AND JORIS VAN DER HOEVEN to a bivariate multiplication problem in F p [ Y, Z ] / ( f ( Y ) , Z m − m and irreducible polynomial f ∈ F p [ Y ]. This bivariate product is handledby means of DFTs (discrete Fourier transforms) of length m over F p [ Y ] /f . The keyinnovation of [18] was to choose deg f so that p deg f − n , even though deg f itself is exponentially smaller. This is possible thanks to a number-theoretic resultof Adleman, Pomerance and Rumely [1], building on earlier work of Prachar [21].Taking m to be a product of many of these primes, we obtain m | p deg f −

1, andhence F p [ Y ] /f contains a root of unity of order m . As m is highly composite, eachDFT of length m may be converted to a collection of much smaller DFTs via theCooley–Tukey method. These in turn are converted into multiplication problemsusing Bluestein’s algorithm. These multiplications, corresponding to exponentiallysmaller values of n , are handled recursively.The recursion continues until n becomes comparable to p . The number of re-cursion levels during this phase is log ∗ n − log ∗ p + O (1), and the constant K F = 8represents the expansion factor at each recursion level. When n becomes compara-ble to p , the algorithm switches strategy to Kronecker substitution combined withordinary integer multiplication. This phase contributes the K log ∗ p Z term.It was pointed out in [16, §

8] that the value of K F can be improved to K F = 4if one is willing to accept certain unproved number-theoretic conjectures, includingArtin’s conjecture on primitive roots. More precisely, under these conjectures, onemay ﬁnd an irreducible f of the form f ( Y ) = Y α − + · · · + Y + 1, where α is prime,so that F p [ Y, Z ] / ( f ( Y ) , Z m −

1) is a direct summand of F p [ Y, Z ] / ( Y α − , Z m − F p [ X ] / ( X αm − K F .To prove Theorem 1.2, we will pursue a variant of this idea. We will take f to bea cyclotomic polynomial φ α ( Y ) for a judiciously chosen integer α (not necessarilyprime). Since φ α | Y α −

1, we may use the above isomorphism to realise the sameeconomy in zero-padding as in the conjectural construction of [16, § f be irreducible in F p [ Y ]. Thus F p [ Y ] /f is nolonger in general a ﬁeld, but a direct sum of ﬁelds. The situation is reminiscent ofF¨urer’s algorithm, in which the coeﬃcient ring C [ Y ] / ( Y r + 1) is not a ﬁeld, but adirect sum of copies of C . The key technical contribution of this paper is to showthat we have enough control over the factorisation of φ α in F p [ Y ] to ensure that F p [ Y ] /φ α contains suitable principal roots of unity. This approach avoids Artin’sconjecture and other number-theoretic diﬃculties, and enables us to reach K F = 4unconditionally. The construction of α is the subject of Section 3, and the mainpolynomial multiplication algorithm is presented in Section 4.Let us now outline how we go about proving Theorem 1.1 (the integer case).The algorithm is heavily dependent on the polynomial multiplication algorithmjust sketched. We take the basic problem to be multiplication in Z / (2 n − Z , forarbitrary positive n . We choose a collection of small primes p , each having around2 lg lg n bits, and whose product P has (lg n ) o (1) bits. By cutting the input inte-gers into many small chunks, we convert to a multiplication in ( Z /P Z )[ X ] / ( X N − N ≈ n/ lg P . One technical headache is that n is not necessarilydivisible by N ; following [17], we deal with this by adapting an idea of Crandall ASTER INTEGER AND POLYNOMIAL MULTIPLICATION 5 and Fagin [9]. Next, by the Chinese remainder theorem, we reduce to multiply-ing in F p [ X ] / ( X N −

1) for each p separately. This is reminisicent of Pollard’salgorithm [20], but instead of using three primes, here the number of primes growswith n . At this stage, the coeﬃcient size lg p is doubly exponentially smaller than N .We perform these multiplications in F p [ X ] / ( X N −

1) by applying two recursion lev-els of the polynomial multiplication algorithm of Theorem 1.2. This reduces theproblem to a collection of multiplication problems in F p [ X ], each doubly exponen-tially smaller than the original problem. Using Kronecker substitution, these areconverted back to multiplications in Z / (2 n ′ − Z , where n ′ is doubly exponentiallysmaller than n , and the algorithm is applied recursively.In eﬀect, each recursive call in the new integer multiplication algorithm corre-sponds to two recursion levels of the existing F¨urer-type algorithms. The speeduprelative to [17] may be understood as follows. In the algorithm of [17], at eachrecursion level we incur a factor of two in overhead due to the zero-padding thatoccurs when we split the inputs into small chunks. In the new algorithm, the pas-sage from Z /P Z to F p manages the same exponential size reduction without anyzero-padding. This roughly corresponds to saving a factor of two at every secondrecursion level of the algorithm of [17], and explains the factor of ( √ log ∗ n overallspeedup. 2. Preliminaries

Logarithmically slow functions.

Let x ∈ R , and let Φ : ( x , ∞ ) → R be a smooth increasing function. We recall from [17, §

5] that Φ is said to be logarithmically slow if there exists an integer ℓ > ◦ ℓ ◦ Φ ◦ exp ◦ ℓ )( x ) = log x + O (1)as x → ∞ . For example, the functions log(5 x ), 5 log x , (log x ) , and 2 (log log x ) arelogarithmically slow, with ℓ = 0 , , , x is chosen large enough to ensure that Φ( x ) x − x > x . According to [17, Lemma 2], this is possible for any logarithmicallyslow function, and it implies that the iterator Φ ∗ ( x ) := min { k > ◦ k ( x ) x } is well-deﬁned on R . It is shown in [17, Lemma 3] that this iterator satisﬁesΦ ∗ ( x ) = log ∗ x + O (1) (2.1)as x → ∞ . In other words, logarithmically slow functions are more or less indistin-guishable from log x , as far as iterators are concerned.As in [17] and [18], we will use logarithmically slow functions to measure sizereduction in multiplication algorithms. The typical situation is that we have afunction T ( n ) measuring the (normalised) cost of a certain multiplication algorithmfor inputs of size n ; we reduce the problem to a collection of problems of size n i < Φ ◦ κ ( n ) for some κ >

1, leading to a bound for T ( n ) in terms of the various T ( n i ). Applying the reduction recursively, we wish to convert these bounds into anexplicit asymptotic estimate for T ( n ). This is achieved via the following ‘mastertheorem’. Proposition 2.1.

Let

K > , B > , and let ℓ > and κ > be integers. Let x > exp ◦ ℓ (1) , and let Φ : ( x , ∞ ) → R be a logarithmically slow function such that Φ( x ) x − for all x > x . Assume that x > x is chosen so that Φ ◦ κ ( x ) is DAVID HARVEY AND JORIS VAN DER HOEVEN deﬁned for all x > x . Then there exists a positive constant C (depending on x , x , Φ , K , B , ℓ and κ ) with the following property.Let σ > x and L > . Let S ⊆ R , and let T : S → R > be any functionsatisfying the following recurrence. First, T ( y ) L for all y ∈ S , y σ . Second,for all y ∈ S , y > σ , there exist y , . . . , y d ∈ S with y i Φ ◦ κ ( y ) , and weights γ , . . . , γ d > with P i γ i = 1 , such that T ( y ) K (cid:18) B log ◦ ℓ y (cid:19) d X i =1 γ i T ( y i ) + L. For all y ∈ S , y > σ , we then have T ( y ) CL ( K /κ ) log ∗ y − log ∗ σ . Proof.

The special case κ = 1, x = x is exactly [17, Prop. 8]. We indicatebrieﬂy how the proof of [17, Prop. 8] must be modiﬁed to obtain this more generalstatement.The ﬁrst two paragraphs of the proof of [17, Prop. 8] may be read verbatim. Inthe third paragraph, the inductive statement is changed to T ( y ) E · · · E j L ( K ⌈ j/κ ⌉ + · · · + K + 1) , where j := Φ ∗ σ ( y ). The inductive step is modiﬁed slightly: for 0 < j < κ we usethe fact that y i σ , and for j > κ the fact that Φ ∗ σ ( y i ) Φ ∗ σ (Φ ◦ κ ( y )) = Φ ∗ σ ( y ) − κ .With these changes, the proof given in [17] goes through without diﬃculty. (cid:3) Discrete Fourier transforms.

Let n > R be a commutative ringin which n is invertible. A principal n -th root of unity is an element ω ∈ R suchthat ω n = 1 and such that P n − j =0 ω ij = 0 for i = 1 , , . . . , n −

1. If m is a divisorof n , then ω n/m is easily seen to be a principal m -th root of unity.Fix a principal n -th root of unity ω . The discrete Fourier transform (DFT) of thesequence ( a , . . . , a n − ) ∈ R n with respect to ω is the sequence (ˆ a , . . . , ˆ a n − ) ∈ R n deﬁned by ˆ a j := P n − i =0 ω ij a i . Equivalently, ˆ a j = A ( ω j ) where A = P n − i =0 a i X i ∈R [ X ] / ( X n − inverse DFT recovers ( a , . . . , a n − ) from (ˆ a , . . . , ˆ a n − ). Computationallyit corresponds to a DFT with respect to ω − , followed by a division by n , because1 n n − X j =0 ω − kj ˆ a j = 1 n n − X i =0 n − X j =0 ω ( i − k ) j a i = a k , k = 0 , . . . , n − . DFTs may be used to implement cyclic convolutions. Suppose that we wish tocompute C := AB where A, B ∈ R [ X ] / ( X n − A ( ω i ) and B ( ω i ) for i = 0 , . . . , n −

1. We then compute C ( ω i ) = A ( ω i ) B ( ω i ) foreach i , and ﬁnally perform an inverse DFT to recover C ∈ R [ X ] / ( X n − C := AB for A, B ∈ R [ X , . . . , X d ] / ( X n − , . . . , X n d d − . For this, we require that each n k be invertible in R , and that R contain a principal n k -th root of unity ω k for each k . Let n := n · · · n d . We ﬁrst perform multidimen-sional DFTs to evaluate A and B at the n points { ( ω j , . . . , ω j d d ) : 0 j k < n k } .We then multiply pointwise, and ﬁnally recover C via a multidimensional inverseDFT. ASTER INTEGER AND POLYNOMIAL MULTIPLICATION 7

Each multidimensional DFT may be reduced to a collection of one-dimensionalDFTs as follows. We ﬁrst compute A ( X , . . . , X d − , ω jd ) ∈ R [ X , . . . , X d − ] foreach j = 0 , . . . , n d −

1; this involves n/n d DFTs of length n d . We then recursivelyevaluate each of these polynomials at the n/n d points ( ω j , . . . , ω j d − d − ). Altogether,this strategy involves computing n/n k DFTs of length n k for each k = 1 , . . . , d .Finally, we brieﬂy recall Bluestein’s method [5] for reducing a (one-dimensional)DFT to a convolution problem (see also [17, § n > ω ∈ R be a principal n -th root of unity. Set ξ := ω ( n +1) / , so that ξ = ω and ξ n = 1.Then computing the DFT of a given sequence ( a , . . . , a n − ) ∈ R n with respectto ω reduces to computing the product of the polynomials f ( Z ) := n − X i =0 ξ i a i Z i , g ( Z ) := n − X i =0 ξ − i Z i in R [ Z ] / ( Z n − O ( n ) auxiliary multiplications in R . Notice that g ( Z ) isﬁxed and does not depend on the input sequence.2.3. The Crandall–Fagin algorithm.

Consider the problem of computing a‘cyclic’ integer product of length n , that is, a product uv where u, v ∈ Z / (2 n − Z .If N and P are positive integers such that N | n and lg P > n/N + lg N , thenwe may reduce the given problem to multiplication in ( Z /P Z )[ X ] / ( X N − n/N bits. In this section we brieﬂy recall avariant [17, § without the assumption that N | n .Assume that N n and lg P > ⌈ n/N ⌉ + lg N + 1, and that we have availablesome θ ∈ Z /P Z with θ N = 2. (This θ plays the same role as the real N -throot of 2 in the original Crandall–Fagin algorithm.) Set e i := ⌈ ni/N ⌉ and c i := N e i − ni . Observe that e i +1 − e i = ⌊ n/N ⌋ or ⌈ n/N ⌉ for each i . Decomposethe inputs as u = P N − i =0 e i u i and v = P N − i =0 e i v i where 0 u i , v i < e i +1 − e i (i.e., a decomposition with respect to a ‘variable base’). Set U ( X ) := P N − i =0 θ c i u i and V ( X ) := P N − i =0 θ c i v i , regarded as polynomials in ( Z /P Z )[ X ] / ( X N − W ( X ) := U ( X ) V ( X ). Then one ﬁnds (see [17, § uv may be recovered by the formula uv = P N − i =0 e i w i (mod 2 n − w i are integers in [0 , P ) deﬁned by W ( X ) = P N − i =0 θ c i w i .To summarise, the problem of computing uv reduces to computing a productin ( Z /P Z )[ X ] / ( X N − O ( N ) auxiliary multiplications in Z /P Z ,and O ( N (lg n ) + N lg P ) bit operations to compute the e i and to handle the ﬁnaloverlap-add phase (again, see [17, § Data layout.

In this section we discuss several issues relating to the layoutof data on the Turing machine tapes.Integers will always be stored in the standard binary representation. If n is apositive integer, then elements of Z /n Z will always be stored as residues in therange 0 x < n , occupying lg n bits of storage.If R is a ring and f ∈ R [ X ] is a polynomial of degree n >

1, then an elementof R [ X ] /f ( X ) will always be represented as a sequence of n coeﬃcients in thestandard monomial order. This convention is applied recursively, so for rings of thetype ( R [ Y ] /f ( Y ))[ X ] /g ( X ), the coeﬃcient of X is stored ﬁrst, as an element of R [ Y ] /f ( Y ), followed by the coeﬃcient of X , and so on. DAVID HARVEY AND JORIS VAN DER HOEVEN

A multidimensional array of size n d × · · · × n , whose entries occupy b bitseach, will be stored as a linear array of bn · · · n d bits. The entries are orderedlexicographically in the order (0 , . . . , , , (0 , . . . , , , . . . , ( n d − , . . . , n − · · · ( R [ X ] /f ( X )) · · · )[ X d ] /f d ( X d ) is represented as an n d × · · · × n array of elements of R . We will generally prefer the more compactnotation R [ X , . . . , X d ] / ( f ( X ) , . . . , f d ( X d )).There are many instances where an n × m array must be transposed so that its en-tries can be accessed eﬃciently either ‘by columns’ or ‘by rows’. Using the algorithmof [6, Lemma 18], such a transposition may be achieved in O ( bnm lg min( n, m )) bitoperations, where b is the bit size of each entry. (The idea of the algorithm is to splitthe array in half along the short dimension, and transpose each half recursively.)One important application is the following result, which estimates the data re-arrangement cost associated to the the Agarwal–Cooley method [2] for convertingbetween one-dimensional and multidimensional convolution problems (this is closelyrelated to the Good–Thomas DFT algorithm [13, 26]). Lemma 2.2.

Let n, m > be relatively prime, and let R be a ring whose elementsare represented using b bits. There exists an isomorphism R [ X ] / ( X nm − ∼ = R [ Y, Z ] / ( Y n − , Z m − that may be evaluated in either direction in O ( bnm lg min( n, m )) bit operations.Proof. Let c := m − mod n , and let β : R [ X ] / ( X nm − → R [ Y, Z ] / ( Y n − , Z m − X to Y c Z , and acts as the identity on R . Sup-pose that we wish to compute β ( F ) for some input polynomial F = P nm − k =0 F k X k ∈R [ X ] / ( X nm − F , . . . , F nm − ) as an n × m array, the( i, j )-th entry corresponds to F im + j . After transposing the array, which costs O ( bnm lg min( n, m )) bit operations, we have an m × n array, whose ( j, i )-th en-try is F im + j . Now for each j , cyclically permute the j -th row by ( jc mod n )slots; altogether this uses only O ( bnm ) bit operations. The result is an m × n array whose ( j, i )-th entry is F ( i − jc mod n ) m + j , which is exactly the coeﬃcient of Y (( i − jc ) m + j ) c Z ( i − jc ) m + j = Y i Z j in β ( F ). The inverse map β − may be computedby reversing this procedure. (cid:3) Corollary 2.3.

Let n , . . . , n d > be pairwise relatively prime, let n := n · · · n d ,and let R be a ring whose elements are represented using b bits. There exists anisomorphism R [ X ] / ( X n − ∼ = R [ X , . . . , X d ] / ( X n − , . . . , X n d d − that may be evaluated in either direction in O ( bn lg n ) bit operations.Proof. Using Lemma 2.2, we may construct a sequence of isomorphisms R [ X ] / ( X n ··· n d − ∼ = R [ X , W ] / ( X n − , W n ··· n d − ∼ = R [ X , X , W ] / ( X n − , X n − , W n ··· n d − · · ·∼ = R [ X , . . . , X d ] / ( X n − , . . . , X n d d − , the i -th of which may be computed in O ( bn lg n i ) bit operations. The overall costis O ( P i bn lg n i ) = O ( bn lg n ) bit operations. (cid:3) ASTER INTEGER AND POLYNOMIAL MULTIPLICATION 9 Cyclotomic coefficient rings

The aim of this section is to construct certain coeﬃcient rings that play a centralrole in the multiplication algorithms described later. The basic idea is as follows.Suppose that we want to multiply two polynomials in F p [ X ], and that the degreeof the product is known to be at most n . If N is an integer with N > n , then byappropriate zero-padding, we may embed the problem in F p [ X ] / ( X N − N = αm , where α and m are relatively prime,then there is an isomorphism F p [ X ] / ( X N − ∼ = F p [ Y, Z ] / ( Y α − , Z m − , and the latter ring is closely related to F p [ Y, Z ] / ( φ α ( Y ) , Z m − ∼ = ( F p [ Y ] /φ α )[ Z ] / ( Z m − φ α ( Y ) denotes the α -th cyclotomic polynomial). In particular, com-puting the product in ( F p [ Y ] /φ α )[ Z ] / ( Z m −

1) recovers ‘most’ of the informationabout the product in F p [ X ] / ( X N − N , α and m with the following properties:(1) N is not much larger than n , so that not too much space is ‘wasted’ in theinitial zero-padding step;(2) ϕ ( α ) (= deg φ α ) is not much smaller than α , so that we do not lose muchinformation by working modulo φ α ( Y ) instead of modulo Y α − F p [ Y ] /φ α contains a principal m -th root of unity, so thatwe can multiply in ( F p [ Y ] /φ α )[ Z ] / ( Z m −

1) eﬃciently by means of DFTsover F p [ Y ] /φ α ;(4) m is a product of many integers that are exponentially smaller than n , sothat the DFTs of length m may be reduced to many small DFTs; and(5) α is itself exponentially smaller than n .The last two items ensure that the small DFTs can be converted to multiplicationproblems of degree exponentially smaller than n , to allow the recursion to proceed. Deﬁnition 3.1. An admissible tuple is a sequence ( q , q , . . . , q e ) of distinct primes( e >

1) satisfying the following conditions. First,(lg N ) < q i < (lg lg N ) , i = 0 , . . . , e, (3.1)where N := q · · · q e . Second, q i − i = 1 , . . . , e , and λ ( q , . . . , q e ) := LCM( q − , . . . , q e − < (lg lg N ) . (3.2)(Note that q − q does not participate in (3.2).)An admissible length is a positive integer N of the form N = q · · · q e where( q , . . . , q e ) is an admissible tuple.If N is an admissible length, we treat ( q , . . . , q e ) and λ ( N ) := λ ( q , . . . , q e ) asauxiliary data attached to N . For example, if an algorithm takes N as input, weimplicitly assume that this auxiliary data is also supplied as part of the input. Example 3.2.

For n = 10 , there is a nearby admissible length N = 1000000000000000000156121 . . . (99971 digits omitted) . . . q q · · · q where q = 206658761261792645783 ,q = 36658226833235899 = 1 + 2 · · · · · · · · · · · ,q = 36658244723486119 = 1 + 2 · · · · · · · · · · ,q = 36658319675739343 = 1 + 2 · · · · · · · · · · · ,q = 36658428883190467 = 1 + 2 · · · · · · · · · · , · · · q = 37076481100386859 = 1 + 2 · · · · · · · · · · λ ( N ) = 2 · · · · ·

113 = 31610054640417607788145206291543662493274686990 . Deﬁnition 3.3.

Let p be a prime. An admissible length N is called p -admissible if N > p and p ∤ N (i.e., p is distinct from q , . . . , q e ).The following result explains how to choose a p -admissible length close to anyprescribed target. Proposition 3.4.

There is an absolute constant z > with the following prop-erty. Given as input a prime p and an integer n > max( z , p ) , in O ((lg lg n ) ) bitoperations we may compute a p -admissible length N in the interval n < N < (cid:18) n (cid:19) n. (3.3)The key ingredient in the proof is the following number-theoretic result of Adle-man, Pomerance and Rumely. Lemma 3.5 ([1, Prop. 10]) . There is an absolute constant C > with the followingproperty. For all x > , there exists a positive squarefree integer λ < x such that X q prime q − | λ > exp( C log x/ log log x ) . Proof of Proposition 3.4.

Let λ max := ⌈ (lg lg n ) ⌉ , and for λ > f ( λ ) to bethe number of primes q in the interval (lg n ) < q λ max + 1 such that q − | λ and q = p . We claim that, provided n is large enough, there exists some squarefree λ ∈ { , . . . , λ max } such that f ( λ ) > lg n . To see this, apply Lemma 3.5 with x := 2 (lg lg n ) ; for large n we then have C log x/ log log x > x ) / = 5 lg lg n, so Lemma 3.5 implies that there exists a positive squarefree integer λ < x λ max for which X q prime q − | λ > exp(5 lg lg n ) > (lg n ) and hence f ( λ ) = X (lg n ) (lg n ) − (lg n ) − > lg n. ASTER INTEGER AND POLYNOMIAL MULTIPLICATION 11

We may locate one such λ by means of the following algorithm (adapted fromthe proof of [18, Lemma 4.5]). First use a sieve to enumerate the primes q inthe interval (lg n ) < q λ max + 1, and to determine which λ = 1 , . . . , λ max aresquarefree, in ( λ max ) o (1) bit operations. Now initialise an array of integers c λ := 0for λ = 1 , . . . , λ max . For each q = p , scan through the array, incrementing those c λ for which λ is squarefree and divisible by q −

1, and stop as soon as one of the c λ reaches lg n . We need only allocate O (lg lg n ) bits per array entry, so each passthrough the array costs O ( λ max lg lg n ) bit operations. The number of passes is O ( λ max ), so the total cost of ﬁnding a suitable λ is O ( λ lg lg n ) = 2 O ((lg lg n ) ) bit operations. Within the same time bound, we may also easily recover a list ofprimes q , q , . . . , q lg n for which q i − | λ .Next, compute the partial products q , q q , . . . , q q · · · q lg n , and determine thesmallest integer e > q · · · q e > n/ (lg lg n ) . Such an e certainly exists,as q · · · q lg n > lg n > n . Since each q i occupies O ((lg lg n ) ) bits, this can all bedone in (lg n ) O (1) bit operations. Also, as q e λ + 1 (lg lg n ) + 1 < (lg lg n ) and q · · · q e − n/ (lg lg n ) , we ﬁnd that2 (lg lg n ) < nq · · · q e < (lg lg n ) for large n .Let q be the least prime that exceeds n/ ( q · · · q e ) and that is distinct from p .According to [4], the interval [ x − x . , x ] contains at least one prime for allsuﬃciently large x ; therefore q < nq · · · q e + (cid:18) nq · · · q e (cid:19) . < (cid:16) (lg lg n ) ) − . (cid:17) nq · · · q e < (cid:18) n (cid:19) nq · · · q e for n suﬃciently large. We may ﬁnd q in 2 O ((lg lg n ) ) bit operations, by using trialdivision to test successive integers for primality.Set N := q q · · · q e . Then (3.3) holds, and certainly N > p and p ∤ N . Let uscheck that ( q , . . . , q e ) is admissible, provided n is large enough. For i = 1 , . . . , e we have (lg N ) < (lg n ) < q i λ + 1 < (lg lg n ) < (lg lg N ) , and also (lg N ) < (lg lg n ) < q < (cid:18) n (cid:19) (lg lg n ) < (lg lg N ) ;this establishes (3.1). Also, as q > (lg lg n ) > q i for i = 1 , . . . , e , we see that q is distinct from q , . . . , q e . Finally, (3.2) holds becauseLCM( q − , . . . , q e − | λ (lg lg n ) < (lg lg N ) . This also shows that we may compute the auxiliary data λ ( q , . . . , q e ) in 2 O ((lg lg n ) ) bit operations. (cid:3) Remark . Example 3.2 was constructed by enumerating the smallest primes q , q , . . . exceeding (lg n ) for which q i − | · · · · · n , and then choosing q to make N as close to n as possible.The proof of Proposition 3.4 goes a diﬀerent way: rather than choosing λ ﬁrst,the proof constructs q , . . . , q e and λ simultaneously. In particular, one cannotguarantee that λ will be a product of an initial segment of primes, as occurredin the example. Indeed, the proof of [1, Prop. 10] (and of its predecessor [21])yields very little information at all about the prime factorisation of λ . For furtherdiscussion, see [1, Remark 6.2]. Deﬁnition 3.7.

Let p be a prime and let N = q · · · q e be a p -admissible length.A p -admissible divisor of N is a positive divisor α of N , with q | α , such that thering F p [ Y ] /φ α ( Y ) contains a principal ( q · · · q e )-th root of unity, and such thatlg N < α < (lg lg N ) (3.4)and ϕ ( α ) > (cid:18) − N (cid:19) α. (3.5)The next result shows how to construct a p -admissible divisor for any suﬃcientlylarge p -admissible length N . The idea behind the construction is as follows. Letord n p denote the order of p in the multiplicative group of integers modulo n . Forany α >

1, not divisible by p , the ring F p [ Y ] /φ α ( Y ) is a direct sum of ﬁelds oforder p r , where r = ord α p [27, Lemma 14.50]. Our goal is to ensure that p r − q · · · q e , so that F p [ Y ] /φ α ( Y ) contains the desired principal root ofunity. One way to force q i to divide p r − α divisible by q i ,as this implies that ord q i p | r . The diﬃculty is that we cannot do this for all q i ,because then α would become too large, violating (3.4). Fortunately, we can takeadvantage of the fact that the q i − λ = λ ( N ); thisenables us to take α to be a product of a small subset of the q i , in such a way thatstill every one of q , . . . , q e divides p r − Proposition 3.8.

There is an absolute constant z > with the following property.Given as input a prime p and a p -admissible length N > z , we may compute a p -admissible divisor α of N , together with the cyclotomic polynomial φ α ∈ F p [ Y ] and a principal ( q · · · q e ) -th root of unity in F p [ Y ] /φ α , in O ((lg lg N ) ) p o (1) bitoperations.Proof. We are given as input an admissible tuple ( q , . . . , q e ) with N = q · · · q e ,and the squarefree integer λ := λ ( q , . . . , q e ). Let L be the set of primes dividing λ .By (3.2) we have |L| log λ < (lg lg N ) , and we may compute L in λ O (1) =2 O ((lg lg N ) ) bit operations.We start by computing a table of values of ord q i p for i = 1 , . . . , e ; note that p = q i by hypothesis, so ord q i p is well-deﬁned. We have q i − | λ and henceord q i p | λ for each i . To compute ord q i p , we ﬁrst compute p mod q i in O (lg q i lg p )bit operations, and then repeatedly multiply by p modulo q i until reaching 1. Sinceord q i p λ , and there are e = O (lg N ) primes q i , the total cost to compute thetable is O (( λ lg q i + lg q i lg p ) lg N ) = (2 (lg lg N ) lg p ) O (1) bit operations. ASTER INTEGER AND POLYNOMIAL MULTIPLICATION 13

Using the above table, we construct a certain vector σ = ( σ , . . . , σ e ) ∈ { , } e as follows. Initialise the vector as σ := (0 , . . . , ℓ ∈ L , search for thesmallest i = 1 , . . . , e such that ℓ | ord q i p . If such an i is found, set σ i := 1; if no i is found, ignore this ℓ . The cost of computing σ is O ( |L| e (lg λ ) ) = (lg N ) O (1) bitoperations.Set α := q Q i : σ i =1 q i . To establish (3.4), note that the number of i for which σ i = 1 is at most |L| , so (3.1) implies thatlg N < q α < (2 (lg lg N ) ) |L| +1 (2 (lg lg N ) ) (lg lg N ) = 2 (lg lg N ) . For (3.5), ﬁrst observe that ϕ ( α ) α = (cid:18) − q (cid:19) Y i : σ i =1 (cid:18) − q i (cid:19) > (cid:18) − N ) (cid:19) (lg lg N ) . Since − log(1 − ε ) < ε for any ε ∈ (0 , ), we obtain − log ϕ ( α ) α < N ) (lg N ) < N and hence ϕ ( α ) /α > exp( − / lg N ) > − / lg N for suﬃciently large N .Now compute the cyclotomic polynomial φ α ∈ F p [ Y ] (i.e., the reduction mod-ulo p of φ α ( Y ) ∈ Z [ Y ]). This can be done in ( α lg p ) O (1) bit operations, using forexample [27, Algorithm 14.48]. We may then determine the factorisation of φ α into irreducibles in F p [ Y ], say φ α = f · · · f k , in α O (1) p / o (1) bit operations [24,Thm. 1]. Since p ∤ α , the f j are distinct, and each f j has degree r := ord α p [27,Lemma 14.50]. In other words, F p [ Y ] /φ α is isomorphic to a direct sum of k copiesof F p r .We claim that q h | p r − h = 1 , . . . , e . For this, it suﬃces to prove thatord q h p | r for each h . Since λ is squarefree, it suﬃces in turn to show that everyprime ℓ dividing ord q h p also divides r . But for every such ℓ , the procedure forconstructing σ must have succeeded in ﬁnding some i for which ℓ | ord q i p (since atleast one value of i works, namely i = h ). Then σ i = 1 for this i , so q i | α . Thisimplies that ord q i p | ord α p = r , and hence that ℓ | r .We conclude that q · · · q e | p r −

1, so each F p [ Y ] /f j contains a primitive rootof unity of order q · · · q e . As the factorisation of q · · · q e is known, we may locateone such primitive root in each F p [ Y ] /f j in α O (1) p o (1) bit operations [25] (seealso [17, Lemma 3.3]). Combining these primitive roots via the Chinese remaindertheorem, we obtain the desired principal ( q · · · q e )-th root of unity in F p [ Y ] /φ α inanother ( α lg p ) O (1) bit operations. (cid:3) Remark . The p o (1) term in Proposition 3.8 arises from the best known deter-ministic complexity bounds for factoring polynomials and ﬁnding primitive roots.If we permit randomised algorithms, then p o (1) may be replaced by (lg p ) O (1) .This has no eﬀect on the main results of this paper. Example 3.10.

Continuing with Example 3.2, let us take p = 3. In the notationof the proof of Proposition 3.8, we have L = { , , , . . . , } . For each ℓ ∈ L , let us write q ( ℓ ) for the smallest q i for which ord q i ℓ . Then we have q (2) = q , q (3) = q , q (5) = q , q (7) = q , q (11) = q ,q (13) = q , q (17) = q , q (19) = q , q (23) = q , q (29) = q ,q (31) = q , q (37) = q , q (41) = q , q (43) = q , q (47) = q ,q (53) = q , q (59) = q , q (61) = q , q (67) = q , q (71) = q ,q (73) = q , q (79) = q , q (83) = q , q (89) = q , q (97) = q ,q (101) = q , q (103) = q , q (107) = q , q (109) = q , q (113) = q . Therefore σ i = 1 for i = 1 , , , , , ,

9, and we have α = q q q q q q q q ≈ . × ,ϕ ( α ) ≈ . × ,r = ord α · · · · · · · · · · . The ring F [ Y ] /φ α is isomorphic to a direct sum of ϕ ( α ) /r copies of F r . Theextraneous factors in r (namely 883, 9041, 327251 and 39551747) arise from theauxiliary prime q . Let m := N/α = q q q q · · · q ≈ . × ;then since m | q · · · q e | r −

1, each copy of F r contains a primitive m -th root ofunity, so F [ Y ] /φ α contains a principal m -th root of unity. Thus it is possible tomultiply in the ring F [ Y, Z ] / ( φ α ( Y ) , Z m −

1) by using DFTs over F [ Y ] /φ α . Remark . In Example 3.10, every ℓ ∈ L divides ord q i p for some i . It seemslikely that this always occurs (at least for large n ), but we do not know how to provethis. If it fails for some ℓ , then r may turn out not to be divisible by ℓ , but theproof of Proposition 3.8 shows that we still have q h | p r − h = 1 , . . . , e .4. Faster polynomial multiplication

The goal of this section is to prove Theorem 1.2. We assume that K Z > M ( n ) = O ( n lg n K log ∗ n Z ).We will describe a recursive routine PolynomialMultiply , that takes as inputintegers r, t >

1, a prime p , and polynomials U , . . . , U t , V ∈ F p [ X ] / ( X r − U V, . . . , U t V . Its running time is denoted by C poly ( t, r, p ).Note that the input polynomials U , . . . , U t , V are expected to be supplied consecu-tively on the input tape (ﬁrst U , then U , and so on), and the outputs U V, . . . , U t V should also be written consecutively to the output tape.The role of the parameter t is to allow us to amortise the cost of transformingthe ﬁxed operand V across t products. This optimisation (borrowed from [17] and[18]) saves a constant factor in time at each recursion level of the main algorithm.Altogether the algorithm will perform 2 t + 1 transforms: t + 1 forward transformsfor U , . . . , U t and V , followed by t inverse transforms to recover the products U V, . . . , U t V . ASTER INTEGER AND POLYNOMIAL MULTIPLICATION 15

To simplify the analysis, it is convenient to introduce the normalisation C ⋆ poly ( r, p ) := sup t > C poly ( t, r, p )(2 t + 1) r lg p lg( r lg p ) . We certainly have M p ( n ) < C poly (1 , n, p ) + O ( n lg p ), so to prove Theorem 1.2 itis enough to show that C ⋆ poly ( r, p ) = O (4 max(0 , log ∗ r − log ∗ p ) K log ∗ p Z ) . (4.1)The algorithms presented in this section perform many auxiliary multiplicationsand divisions involving ‘small’ integers and polynomials. We assume that all aux-iliary divisions are reduced to multiplication via Newton’s method [27, Ch. 9], sothat the cost of a division (by a monic divisor) is at most a constant multiple of thecost of a multiplication of the same bit size. We also assume that, unless otherwisespeciﬁed, all auxiliary multiplications are handled using the integer and polyno-mial variants of the Sch¨onhage–Strassen algorithm, whose complexities are givenby (1.1) and (1.4).We ﬁrst discuss a subroutine Transform that handles DFTs over rings of theform R p,α := F p [ Y ] /φ α ( Y ), where p is a prime and α >

1. It takes as input p and α ,positive integers t and n such that n is odd and relatively prime to α , a principal n -th root of unity ω ∈ R p,α , and t input sequences ( a s, , . . . , a s,n − ) ∈ R np,α for s = 1 , . . . , t . Its output is the sequence of transforms (ˆ a s, , . . . , ˆ a s,n − ) ∈ R np,α withrespect to ω , for s = 1 , . . . , t . Just like PolynomialMultiply , the input andoutput sequences are stored consecutively on the tape.Let T ( t, n, α, p ) denote the running time of Transform . The following resultshows how to reduce the DFT problem to an instance of

PolynomialMultiply . Proposition 4.1.

We have T ( t, n, α, p ) < C poly ( t, nα, p ) + O ( tnα lg α lg lg α lg p lg lg p lg lg lg p ) . Proof.

Let R := R p,α . We use Bluestein’s method to reduce each DFT to theproblem of computing a certain product f s ( Z ) g ( Z ) in R [ Z ] / ( Z n − O ( n )multiplications in R , where f s ( Z ) and g ( Z ) are deﬁned as in Section 2.2. By (1.1)and (1.4), each multiplication in R costs O (( α lg α lg lg α )(lg p lg lg p lg lg lg p )) (4.2)bit operations. To handle the products f s ( Z ) g ( Z ), we ﬁrst lift the polynomials from F p [ Y, Z ] / ( φ α ( Y ) , Z n −

1) to F p [ Y, Z ] / ( Y α − , Z n −

1) (for example, by zero-paddingin Y up to degree α ). We then compute their images under the isomorphism F p [ Y, Z ] / ( Y α − , Z n − ∼ = F p [ X ] / ( X nα − O ( tnα lg α lg p ) bit operations. Wecall PolynomialMultiply to compute the products in F p [ X ] / ( X nα − C poly ( t, nα, p ) bit operations. We evaluate the inverse of the above isomorphismto bring the products back to F p [ Y, Z ] / ( Y α − , Z n − φ α ( Y ) to obtain the desired products in R [ Z ] / ( Z n − (cid:3) We now return to multiplication in F p [ X ] / ( X r − PolynomialMultiply chooses one of two algorithms, depending on the size of r relative to p . For r p it uses the straightforward Kronecker substitution methoddescribed in Section 1. By (1.3) this yields the bound C poly ( t, r, p ) = O ( t M ( r lg p )) = O ( tr lg p lg( r lg p ) K log ∗ ( r lg p ) Z )and hence C ⋆ poly ( r, p ) = O ( K log ∗ ( p lg p ) Z ) = O ( K log ∗ p Z ) , r p . (4.3)Therefore (4.1) holds in this case.For r > p , most of the work will be delegated to a subroutine AdmissibleMul-tiply , which is deﬁned as follows. It takes as input an integer t >

1, a prime p ,a p -admissible length N , and polynomials U , . . . , U t , V ∈ F p [ X ] / ( X N − U V, . . . , U t V . In other words, it has the same interface asas PolynomialMultiply , but it only works for p -admissible lengths. We denoteits running time by C ad ( t, N, p ). As above we also deﬁne the normalisation C ⋆ ad ( N, p ) := sup t > C ad ( t, N, p )(2 t + 1) N lg p lg( N lg p ) . The reduction from

PolynomialMultiply to AdmissibleMultiply in the case r > p is given by the following proposition. Proposition 4.2.

There is an absolute constant z > with the following prop-erty. For any prime p and any integer r > max( z , p ) , there exists a p -admissiblelength N in the interval r < N < (cid:18) r (cid:19) r (4.4) such that C ⋆ poly ( r, p ) < (cid:18) O (1)lg r (cid:19) C ⋆ ad ( N, p ) + O (1) . (4.5) Proof.

Given as input U , . . . , U t , V ∈ F p [ X ] / ( X r − U V, . . . , U t V . For suﬃciently large r we may apply Proposition 3.4 with n := 2 r to ﬁnd a p -admissible length N such that (4.4) holds. Since N > r , wemay simply zero-pad to reduce each problem to multiplication in F p [ X ] / ( X N − C poly ( t, r, p ) < C ad ( t, N, p ) + O ( tr lg p ) + 2 O ((lg lg r ) ) , where the tr lg p term arises from the reduction modulo X r −

1, and the last termfrom Proposition 3.4. Dividing by (2 t + 1) r lg p lg( r lg p ) and taking suprema over t >

1, we ﬁnd that C ⋆ poly ( r, p ) < N lg( N lg p ) r lg( r lg p ) C ⋆ ad ( N, p ) + O (1) . Finally, since lg( N lg p ) lg( r lg p ) + 2 we obtain N lg( N lg p ) r lg( r lg p ) < (cid:18) r (cid:19) (cid:18) r lg p ) (cid:19) < O (1)lg r . (cid:3) The motivation for deﬁning admissible lengths is the following result, whichshows how to implement

AdmissibleMultiply in terms of a large collection ofexponentially smaller instances of

PolynomialMultiply . ASTER INTEGER AND POLYNOMIAL MULTIPLICATION 17

Proposition 4.3.

There is an absolute constant z > with the following property.Let p be a prime and let N > z be a p -admissible length. Then there exist integers r , . . . , r d in the interval (lg lg N ) < r i < (lg lg N ) , (4.6) and weights γ , . . . , γ d > with P i γ i = 1 , such that C ⋆ ad ( N, p ) < (cid:18) O (1)lg lg N (cid:19) d X i =1 γ i C ⋆ poly ( r i , p ) + O (1) . (4.7) Proof.

We are given as input a prime p , a p -admissible length N = q · · · q e andpolynomials U , . . . , U t , V ∈ F p [ X ] / ( X N − U V, . . . , U t V . We will describe a series of reductions that converts this problemto a collection of exponentially smaller multiplication problems, plus overhead of O ( tN lg N lg p ) bit operations incurred during the reductions. Step 1 — reduce to products over cyclotomic coeﬃcient ring.

Invoking Propo-sition 3.8, we compute a p -admissible divisor α of N , the cyclotomic polynomial φ α ∈ F p [ Y ], and a principal ( q · · · q e )-th root of unity ω ∈ F p [ Y ] /φ α . As p < N ,this requires at most 2 O ((lg lg N ) ) p o (1) < N / o (1) bit operations.Set ψ α := ( Y α − /φ α ∈ F p [ Y ]. Since Y α − F p [ Y ], we have ( φ α , ψ α ) = 1. Using the Euclidean algorithm, compute polynomials χ , χ ∈ F p [ Y ] of degree at most α such that χ φ α + χ ψ α = 1; this costs at most( α lg p ) O (1) < N o (1) bit operations.Let m := N/α . As m and α are coprime, Lemma 2.2 provides an isomorphism F p [ X ] / ( X N − ∼ = F p [ Y, Z ] / ( Y α − , Z m − O ( mα lg α lg p ) bit operations. By (3.4)this simpliﬁes to O ( N (lg lg N ) lg p ) = O ( N lg N lg p ) bit operations. Next, since( φ α , ψ α ) = 1, there is an isomorphism F p [ Y ] / ( Y α − ∼ = ( F p [ Y ] /φ α ) ⊕ ( F p [ Y ] /ψ α ) . Using the precomputed polynomials χ and χ , we may evaluate the above isomor-phism in either direction in O (( α lg α lg lg α )(lg p lg lg p lg lg lg p )) = O ( α (lg lg N ) (lg lg lg N ) lg p )= O ( α lg N lg p )bit operations (here we have again used (3.4) and the fact that p < N ). Thisisomorphism induces another isomorphism F p [ Y ] / ( Y α − , Z m − ∼ = ( F p [ Y ] /φ α )[ Z ] / ( Z m − ⊕ ( F p [ Y ] /ψ α )[ Z ] / ( Z m − Z i separately; it may be evaluated in eitherdirection in O ( mα lg N lg p ) = O ( N lg N lg p ) bit operations. Chaining these iso-morphisms together, we obtain an isomorphism F p [ X ] / ( X N − ∼ = ( F p [ Y ] /φ α )[ Z ] / ( Z m − ⊕ ( F p [ Y ] /ψ α )[ Z ] / ( Z m − O ( N lg N lg p ) bit operations. We now use the following algorithm. First, at a cost of O ( tN lg N lg p ) bitoperations, apply the above isomorphism to U , . . . , U t and V to obtain polynomials U ′ , . . . , U ′ t , V ′ ∈ ( F p [ Y ] /φ α )[ Z ] / ( Z m − , ˜ U ′ , . . . , ˜ U ′ t , ˜ V ′ ∈ ( F p [ Y ] /ψ α )[ Z ] / ( Z m − . Second, compute the products ˜ U ′ ˜ V ′ , . . . , ˜ U ′ t ˜ V ′ : since deg ψ α < α/ lg N by (3.5),each of these products may be converted, via Kronecker substitution, to a productof univariate polynomials in F p [ X ] of degree O ( mα/ lg N ) = O ( N/ lg N ) (i.e., map Y to X and Z to X ψ α ). The cost of these multiplications is O ( t (( N/ lg N ) lg N lg lg N )(lg p lg lg p lg lg lg p )) = O ( tN lg N lg p )bit operations. Third, compute the products U ′ V ′ , . . . , U ′ t V ′ , using the methodexplained in Step 2 below. Finally, at a cost of O ( tN lg N lg p ) bit operations, applythe inverse isomorphism to the pairs ( U ′ s V ′ , ˜ U ′ s ˜ V ′ ) to obtain the desired products U V, . . . , U t V . Step 2 — convert to multidimensional convolutions.

Let R := F p [ Y ] /φ α . In thisstep our goal is to compute the products U ′ V ′ , . . . , U ′ t V ′ , where U ′ , . . . , U ′ t , V ′ ∈R [ Z ] / ( Z m − m d × · · · × m , for a suitable decomposition m = m · · · m d . Forthe subsequent complexity analysis, it is important that the m i are chosen to besomewhat larger than the coeﬃcient size. To achieve this we proceed as follows.Let m = ℓ · · · ℓ u be the prime factorisation of m . The ℓ j form a subset of { q , . . . , q e } , so by (3.1) we have(lg N ) < ℓ j < (lg lg N ) (4.8)for each j . Let w := ⌊ (lg lg N ) ⌋ . We certainly have u > w for large enough N ,as (4.8) and (3.4) imply that u > log m (lg lg N ) = log N − log α (lg lg N ) > log N − (lg lg N ) (lg lg N ) ≫ (lg lg N ) . Therefore we may take m := ℓ · · · ℓ w ,m := ℓ w +1 · · · ℓ w , · · · m d − := ℓ ( d − w +1 · · · ℓ ( d − w ,m d := ℓ ( d − w +1 · · · ℓ dw ℓ dw +1 · · · ℓ u , where d := ⌊ u/w ⌋ >

1. Each m i is a product of exactly w primes, except possi-bly m d , which is a product of at least w and at most 2 w − N we have m i < (2 (lg lg N ) ) w (lg lg N ) (4.9)and m i > ((lg N ) ) w > (2 lg lg N − ) w > (lg lg N ) (4.10)for all i , and hence d log m (lg lg N ) lg N (lg lg N ) . (4.11) ASTER INTEGER AND POLYNOMIAL MULTIPLICATION 19

Computing the decomposition m = m · · · m d requires no more than (lg N ) O (1) bitoperations.As the m i are pairwise relatively prime, Corollary 2.3 furnishes an isomorphism R [ Z ] / ( Z m − ∼ = R [ Z , . . . , Z d ] / ( Z m − , . . . , Z m d d − O (( m lg m )( α lg p )) = O ( N lg N lg p )bit operations. Therefore we may use the following algorithm. First, at a cost of O ( tN lg N lg p ) bit operations, compute the images U ′′ , . . . , U ′′ t , V ′′ ∈ R [ Z , . . . , Z d ] / ( Z m − , . . . , Z m d d − U ′ , . . . , U ′ t , V ′ under the above isomorphism. Next, as explained in Step 3 below,compute the products U ′′ V ′′ , . . . , U ′′ t V ′′ . Finally, apply the inverse isomorphism torecover the products U ′ V ′ , . . . , U ′ t V ′ ; again this costs O ( tN lg N lg p ) bit operations. Step 3 — reduce to DFTs over R . In this step our goal is to compute the prod-ucts U ′′ V ′′ , . . . , U ′′ t V ′′ , where U ′′ , . . . , U ′′ t and V ′′ are as above. Let ω i := ω q ··· q e /m i for i = 1 , . . . , d , where ω is the principal ( q · · · q e )-th root of unity in R computedin Step 1. According to the discussion in Section 2.2, the desired multidimensionalconvolutions may be computed by performing t +1 multidimensional m -point DFTswith respect to the evaluation points ( ω j , . . . , ω j d d ), followed by tm pointwise multi-plications in R , and then t multidimensional m -point inverse DFTs and tm divisionsby m . The total cost of the pointwise multiplications and divisions is O ( tm ( α lg α lg lg α )(lg p lg lg p lg lg lg p )) = O ( tN lg N lg p )bit operations.Each of the 2 t + 1 multidimensional DFTs may be converted to a collectionof one-dimensional DFTs of lengths m , . . . , m d by the method explained in Sec-tion 2.2. Note that the inputs must be rearranged so that the data to transformalong each dimension may be accessed sequentially. Let 1 i d , and consider thetransforms of length m i . Treating each input vector as a sequence of m i +1 · · · m d arrays of size m i × ( m · · · m i − ), we must transpose each array into an array of size( m · · · m i − ) × m i , perform m/m i DFTs of length m i , and then transpose back tothe original ordering. The total cost of all these transpositions is O ( tmα lg p P i lg m i ) = O ( tN lg p lg m ) = O ( tN lg N lg p )bit operations.The one-dimensional DFTs over R are handled by the Transform subroutine.Combining the contributions from Steps 1, 2 and 3 shows that C ad ( t, N, p ) < (2 t + 1) d X i =1 T (cid:16) mm i , m i , α, p (cid:17) + O ( tN lg N lg p ) . This concludes the description of the algorithm; it remains to establish the overallcomplexity claim. First, Proposition 4.1 yields d X i =1 T (cid:16) mm i , m i , α, p (cid:17) < d X i =1 C poly (cid:16) mm i , m i α, p (cid:17) + O ( dmα lg α lg lg α lg p lg lg p lg lg lg p ) . By (4.11), the last term lies in O ( dN (lg lg N ) (lg lg lg N ) lg p ) = O ( N lg N lg p ) . Setting r i := m i α for i = 1 , . . . , d , we obtain C ad ( t, N, p ) < (2 t + 1) d X i =1 C poly (cid:16) Nr i , r i , p (cid:17) + O ( tN lg N lg p ) . Notice that (4.6) follows immediately from (4.9), (4.10) and (3.4) (for large N ).For the normalised quantities, we have C ⋆ ad ( N, p ) < d X i =1 C poly (cid:0) Nr i , r i , p (cid:1) N lg p lg( N lg p ) + O (1) < d X i =1 (cid:16) Nr i + 1 (cid:17) r i lg( r i lg p ) N lg( N lg p ) C ⋆ poly ( r i , p ) + O (1) . Now observe thatlg( r i lg p )lg( N lg p ) < log m i + log α + lg lg p + O (1)log N < log m i + O ((lg lg N ) )log m . Put γ i := log m i / log m , so that P i γ i = 1. Then (4.10) implies thatlg( r i lg p )lg( N lg p ) < (cid:18) O ((lg lg N ) )log m i (cid:19) γ i < (cid:18) O (1)lg lg N (cid:19) γ i . Moreover, from (4.6) we certainly have (cid:16) Nr i + 1 (cid:17) r i N = 2 + r i N < O (1)lg lg N .

The desired bound (4.7) follows immediately. (cid:3)

Combining Proposition 4.2 and Proposition 4.3, we obtain the following recur-rence inequality for C ⋆ poly ( r, p ). (This is identical to Theorem 7.1 of [18], but withthe constant 8 replaced by 4.) Proposition 4.4.

There are absolute constants z , C , C > and a logarithmicallyslow function Φ : ( z , ∞ ) → R with the following property. For any prime p andany integer r > max( z , p ) , there exist positive integers r , . . . , r d < Φ( r ) , andweights γ , . . . , γ d > with P i γ i = 1 , such that C ⋆ poly ( r, p ) < (cid:18) C lg lg r (cid:19) d X i =1 γ i C ⋆ poly ( r i , p ) + C . (4.12) Proof.

We ﬁrst apply Proposition 4.2 to construct a p -admissible length N suchthat (4.4) and (4.5) both hold; then we apply Proposition 4.3 to construct inte-gers r , . . . , r d and weights γ , . . . , γ d satisfying (4.6) and (4.7). Deﬁne Φ( x ) :=2 (log log x ) ; then certainly r i < (lg lg 3 r ) < Φ( r ) for large r . The bound (4.12)follows immediately by substituting (4.7) into (4.5). (cid:3) Now we may prove our main result for multiplication in F p [ X ]. The proof is verysimilar to that of [18, Thm. 1.1]. Proof of Theorem 1.2.

We have already noted that C ⋆ poly ( r, p ) = O ( K log ∗ p Z ) in theregion r p (see (4.3)). To handle the case r > p , let z , C , C and Φ( x ) be as ASTER INTEGER AND POLYNOMIAL MULTIPLICATION 21 in Proposition 4.4. Increasing z if necessary, we may assume that z > exp(exp(1))and that Φ( x ) x − x > z . For each prime p , set σ p := max( z , p ) and L p := max( C , max r σ p C ⋆ poly ( r, p )) = O ( K log ∗ p Z ) . Now apply Proposition 2.1 with K = 4, B = C / S = { , , . . . } , ℓ = 2, κ = 1, x = x = z , σ = σ p , L = L p , and T ( r ) = C ⋆ poly ( r, p ). The ﬁrst part of therecurrence for T ( y ) is satisﬁed due to the deﬁnition of L p , and the second part dueto Proposition 4.4. We conclude that C ⋆ poly ( r, p ) = O ( L p log ∗ r − log ∗ σ p ) for r > p .Since log ∗ σ p = log ∗ p + O (1) and L p = O ( K log ∗ p Z ), we obtain the desired bound C ⋆ poly ( r, p ) = O (4 log ∗ r − log ∗ p K log ∗ p Z ) for r > p . (cid:3) Faster integer multiplication

The goal of this section is to prove Theorem 1.1. We will describe a recursiveroutine

IntegerMultiply , that takes as input positive integers n and t , and in-tegers u , . . . , u t , v ∈ Z / (2 n − Z , and computes the products u v, . . . , u t v . Wedenote its running time by C int ( t, n ). As in Section 4, it is convenient to deﬁne thenormalisation C ⋆ int ( n ) := sup t > C int ( t, n )(2 t + 1) n lg n . We certainly have M ( n ) < C int (1 , n ) + O ( n ), so to prove Theorem 1.1 it is enoughto prove that C ⋆ int ( n ) = O ((4 √ log ∗ n ) . (5.1)We begin by revisiting the polynomial multiplication algorithm from Section 4.Recall that to handle a multiplication problem in F p [ X ] / ( X r −

1) for r p , weused Kronecker substitution to convert it to an integer multiplication problem ofsize O ( r lg p ) (see (4.3)). This approach is suboptimal because it ignores the cyclicstructure of F p [ X ] / ( X r − RefinedPolynomialMul-tiply and

RefinedAdmissibleMultiply . They have exactly the same inter-face as

PolynomialMultiply and

AdmissibleMultiply . Their running timesare denoted by ˜ C poly ( t, r, p ) and ˜ C ad ( t, N, p ), with corresponding normalisations˜ C ⋆ poly ( r, p ) and ˜ C ⋆ ad ( N, p ). The implementation of

RefinedAdmissibleMultiply is exactly the same as

AdmissibleMultiply , except that calls to

Polynomial-Multiply are replaced by calls to

RefinedPolynomialMultiply . Similarly, theimplementation of

RefinedPolynomialMultiply for r > p is exactly the sameas PolynomialMultiply , except that calls to

AdmissibleMultiply are replacedby calls to

RefinedAdmissibleMultiply . Therefore these routines satisfy ana-logues of Proposition 4.2 and Proposition 4.3, with C ⋆ poly and C ⋆ ad replaced by ˜ C ⋆ poly and ˜ C ⋆ ad .Where the new routines diﬀer is in the implementation of RefinedPolyno-mialMultiply for the case r p , which is described in the proof of the followingresult. The idea is to exploit the cyclic structure by using IntegerMultiply tohandle the (cyclic) integer multiplication. This device saves a constant factor ateach recursion level of the main algorithm.

Proposition 5.1.

For any prime p , and for any positive integer r satisfying (lg lg p ) < lg r < (lg p ) / , there exists an integer n in the interval r lg p < n < (cid:18) r (cid:19) r lg p (5.2) such that ˜ C ⋆ poly ( r, p ) < (cid:18) O (1)lg r (cid:19) C ⋆ int ( n ) + O (1) . (Note that this bound does not hold over the whole range r p ; to obtain theconstant 2, we need to restrict to a smaller range of r .) Proof.

We are given as input U , . . . , U t , V ∈ F p [ X ] / ( X r − U V, . . . , U t V .We use the following algorithm. Lift the inputs to polynomials U ′ , . . . , U ′ t , V ′ ∈ Z [ X ] / ( X r − x < p . Evaluate thesepolynomials at X = 2 b , where b := 2 lg p + lg r ; that is, pack the coeﬃcientstogether to obtain integers u s := U s (2 b ) and v := V (2 b ) in Z / (2 rb − Z . Call IntegerMultiply with n := rb to compute the cyclic integer products w s := u s v .Then we have w s = W ′ s (2 b ) where W ′ s := U ′ s V ′ ∈ Z [ X ] / ( X r − W ′ s lie in the interval 0 x r ( p − < rp −

1, and since 2 b > rp ,we may unpack w s to recover the coeﬃcients of W ′ s unambiguously. Finally, byreducing the coeﬃcients of W ′ s modulo p , we arrive at the desired products W s ∈ F p [ X ] / ( X r − n = 2 r lg p + r lg r , the bound (5.2) follows by taking into account the hy-pothesis that lg r < (lg p ) / . For the complexity we have˜ C poly ( t, r, p ) < C int ( t, n ) + O ( tr lg p lg lg p lg lg lg p ) , where the last term covers the divisions by p at the end of the algorithm (and alsothe linear-time packing and unpacking steps). Dividing by (2 t + 1) r lg p lg( r lg p )and taking suprema over t >

1, we obtain˜ C ⋆ poly ( r, p ) < n lg nr lg p lg( r lg p ) C ⋆ int ( n ) + O (cid:18) lg lg p lg lg lg p lg( r lg p ) (cid:19) . The last term lies in O (1) thanks to the assumption lg r > (lg lg p ) . Moreover,(5.2) implies that lg n lg( r lg p ) + 2, so we ﬁnd that n lg nr lg p lg( r lg p ) < (cid:18) r (cid:19) (cid:18) r lg p ) (cid:19) < O (1)lg r . (cid:3) Now we describe the implementation of

IntegerMultiply . It chooses one oftwo algorithms, depending on the size of n . For small n , it calls any convenientbasecase multiplication algorithm, such as the Sch¨onhage–Strassen algorithm. Forlarge n , it uses the algorithm described in the proof of Proposition 5.2 below.This algorithm reduces the problem to a collection of instances of RefinedAd-missibleMultiply , one for each prime p ∈ P ( n ), where P ( n ) is deﬁned to bethe set consisting of the smallest lg n primes that exceed (lg n ) (we will see inthe proof below that these primes satisfy lg p = 2 lg lg n + O (1)). For example, P (10 ) = { , , . . . , } (the ﬁrst 14 primes after 144.5). ASTER INTEGER AND POLYNOMIAL MULTIPLICATION 23

Proposition 5.2.

There is an absolute constant z > with the following property.For all n > z , there exists an admissible length N in the interval n lg n lg lg n < N < (cid:18) n (cid:19) n lg n lg lg n , (5.3) such that N is p -admissible for all p ∈ P ( n ) , and such that C ⋆ int ( n ) < (cid:18) O (1)lg lg n (cid:19) X p ∈P ( n ) n ˜ C ⋆ ad ( N, p ) + O (1) . (5.4) Proof.

We are given as input u , . . . , u t , v ∈ Z / (2 n − Z , and we wish to computethe products u v, . . . , u t v . Step 1 — choose parameters.

In this preliminary step we compute a number ofparameters that depend only on n .The prime number theorem (see for example [3, p. 9]) implies that the numberof primes between (lg n ) and (lg n ) is asymptotically(lg n ) n ) ) ≫ lg n. Therefore, for large n we certainly have (lg n ) < p < (lg n ) (5.5)for all p ∈ P ( n ). Deﬁne P := Q p ∈P ( n ) p ; then(2 log lg n −

1) lg n log P (2 log lg n ) lg n, so 2 lg n lg lg n − n lg P n lg lg n. (5.6)Clearly we may compute P ( n ) and P within (lg n ) O (1) bit operations.Let n ′ := (cid:24) n lg P − lg n − (cid:25) ; (5.7)this makes sense for large n , as lg P ≫ lg n . Using Proposition 3.4 (with p = 2),construct an admissible length N in the interval n ′ < N < (cid:18) n ′ (cid:19) n ′ . The invocation of Proposition 3.4 costs 2 O ((lg lg n ′ ) ) = O ( n ) bit operations. Let uscheck that (5.3) holds for this choice of N . In one direction, by (5.6) we have N > n ′ > n lg P − lg n − > n lg P > n lg n lg lg n . (5.8) For the other direction we have

N < (cid:18) n ′ (cid:19) (cid:18) n n lg lg n − n − (cid:19) = (cid:18) n ′ (cid:19) (cid:18) n + 32 lg n lg lg n − n − n lg lg nn (cid:19) n lg n lg lg n< (cid:18) n ′ (cid:19) (cid:18) o (1)lg lg n (cid:19) n lg n lg lg n< (cid:18) n (cid:19) n lg n lg lg n for large n .Finally, let us verify that N is p -admissible for all p ∈ P ( n ) (for large n ). First,by (5.8) and (5.5) we have N > (lg n ) > p . Also, by (3.1) and (5.8), every primedivisor of N = q · · · q e satisﬁes q i > (lg N ) > (lg n ) > p ; in particular, p ∤ N . Step 2 — convert to polynomial product modulo P . In this step we applythe Crandall–Fagin algorithm from Section 2.3. For this, we require that lg

P > ⌈ n/N ⌉ + lg N + 1; this follows from (5.7) aslg P > nn ′ + lg n + 3 > nN + lg n + 3 > l nN m + lg N + 1 . We also require an element θ ∈ Z /P Z such that θ N = 2. To construct θ , we ﬁrstcompute a p := N − (mod p −

1) for each p ∈ P ( n ). This modular inverse existsbecause N is a product of primes that are all greater than p , and hence relativelyprime to p −

1. Then we put θ p := 2 a p (mod p ), so that ( θ p ) N = 2 (mod p ). Usingthe Chinese remainder theorem, we compute θ ∈ Z /P Z such that θ = θ p (mod p )for all p ∈ P ( n ); then θ N = 2 as desired. All of this can be eﬀected in (lg n ) O (1) bitoperations.According to Section 2.3, the problem of computing u v, . . . , u t v reduces to com-puting U V, . . . , U t V for certain polynomials U , . . . , U t , V ∈ ( Z /P Z )[ X ] / ( X N − O ( t ( N lg P + N (lg n ) + N lg P lg lg P lg lg lg P )) = O ( tn lg n/ lg lg n )bit operations. Step 3 — reduce to products modulo small primes.

In this step we convert eachmultiplication problem in ( Z /P Z )[ X ] / ( X N −

1) into a collection of products in F p [ X ] / ( X N − p ∈ P ( n ).We start with the isomorphism Z /P Z ∼ = ⊕ p ∈P ( n ) F p , which may be computed ineither direction in O (lg P (lg lg P ) lg lg lg P ) bit operations using fast simultaneousmodular reduction and fast Chinese remaindering algorithms [27, § Z /P Z )[ X ] / ( X N − ∼ = M p ∈P ( n ) F p [ X ] / ( X N − , which may be computed in either direction, for all s = 1 , . . . , t , in O ( tN lg P (lg lg P ) lg lg lg P ) = O ( tn (lg lg n ) lg lg lg n )bit operations. Note that the isomorphism Z /P Z ∼ = ⊕ p F p must be applied tothe coeﬃcient of each X i independently, but the subroutine for multiplying in ASTER INTEGER AND POLYNOMIAL MULTIPLICATION 25 F p [ X ] / ( X N −

1) needs sequential access to all of the residues for a single prime p .The required data rearrangement corresponds to transposing a tN × |P ( n ) | array,which costs only O ( tN |P ( n ) | lg |P ( n ) | max p lg p ) = O ( tn lg lg n ) bit operations. Fi-nally, for each p ∈ P ( n ), the products in F p [ X ] / ( X N −

1) may be computed bycalling

RefinedAdmissibleMultiply , since N is a p -admissible length.Combining the contributions from Steps 1, 2 and 3, we obtain C int ( t, n ) < X p ∈P ( n ) ˜ C ad ( t, N, p ) + O ( tn lg n/ lg lg n ) . Dividing by (2 t + 1) n lg n and taking suprema over t > C ⋆ int ( n ) < X p ∈P ( n ) ˜ C ⋆ ad ( N, p ) N lg p lg( N lg p ) n lg n + O (1) . By (5.5) we have lg p n , so (5.3) implies that N lg p < (cid:18) O (1)lg lg n (cid:19) n lg n , and also lg( N lg p ) lg n , for large n . The bound (5.4) follows immediately. (cid:3) We may now glue together the various pieces to obtain a doubly-exponentialrecurrence for C ⋆ int ( n ). Proposition 5.3.

There are absolute constants z > z > and C , C > ,and a logarithmically slow function Ψ : ( z , ∞ ) → R such that Ψ( z ) > z , withthe following property. For any n > z , there exist positive integers n , . . . , n d < Ψ(Ψ( n )) , and weights γ , . . . , γ d > with P i γ i = 1 , such that C ⋆ int ( n ) < (cid:18)

32 + C lg lg lg n (cid:19) d X i =1 γ i C ⋆ int ( n i ) + C . (5.9) Proof.

Step 1 — top-level call to

IntegerMultiply . Applying Proposition 5.2,we obtain an admissible N in the interval n lg n lg lg n < N < (cid:18) n (cid:19) n lg n lg lg n , (5.10)such that N is p -admissible for all p ∈ P ( n ), and such that C ⋆ int ( n ) < (cid:18) O (1)lg lg n (cid:19) X p ∈P ( n ) n ˜ C ⋆ ad ( N, p ) + O (1) . (5.11)In what follows, we frequently use the estimates lg N = lg n + O (lg lg n ) andlg lg N = lg lg n + O (1), which follow from (5.10). Also, from (5.5) we have lg p =2 lg lg n + O (1) for all p ∈ P ( n ). Step 2 — ﬁrst call to

RefinedAdmissibleMultiply . In this step we use (thereﬁned analogue of) Proposition 4.3 to estimate the ˜ C ⋆ ad ( N, p ) term in (5.11), for aﬁxed p ∈ P ( n ). We obtain integers r p, , . . . , r p,d p such that2 (lg lg N ) < r p,i < (lg lg N ) (5.12) and weights γ p, , . . . , γ p,d p > P i γ p,i = 1, such that˜ C ⋆ ad ( N, p ) < (cid:18) O (1)lg lg N (cid:19) d p X i =1 γ p,i ˜ C ⋆ poly ( r p,i , p ) + O (1) . Substituting into (5.11), and using lg lg N = lg lg n + O (1), yields C ⋆ int ( n ) < (cid:18) O (1)lg lg n (cid:19) X p ∈P ( n ) d p X i =1 γ p,i lg n ˜ C ⋆ poly ( r p,i , p ) + O (1) . (5.13) Step 3 — ﬁrst call to

RefinedPolynomialMultiply . In this step we use (thereﬁned analogue of) Proposition 4.2 to estimate the ˜ C ⋆ poly ( r p,i , p ) term in (5.13), fora ﬁxed p ∈ P ( n ) and i ∈ { , . . . , d p } . The precondition r p,i > max( z , p ) holds forlarge n , because by (5.12) we havelg( p ) = 4 lg lg n + O (1) < (lg lg N ) lg r p,i . Thus there exists a p -admissible length N p,i in the interval2 r p,i < N p,i < (cid:18) n (cid:19) r p,i (5.14)such that ˜ C ⋆ poly ( r p,i , p ) < (cid:18) O (1)lg lg n (cid:19) ˜ C ⋆ ad ( N p,i , p ) + O (1) . Substituting into (5.13) yields C ⋆ int ( n ) < (cid:18) O (1)lg lg n (cid:19) X p ∈P ( n ) d p X i =1 γ p,i lg n ˜ C ⋆ ad ( N p,i , p ) + O (1) . (5.15) Step 4 — second call to

RefinedAdmissibleMultiply . In this step we useProposition 4.3 again, to estimate the ˜ C ⋆ ad ( N p,i , p ) term in (5.15), for a ﬁxed p ∈P ( n ) and i ∈ { , . . . , d p } . We obtain integers r p,i, , . . . , r p,i,d p,i such that2 (lg lg N p,i ) < r p,i,j < (lg lg N p,i ) (5.16)and weights γ p,i, , . . . , γ p,i,d p,i > P j γ p,i,j = 1, such that˜ C ⋆ ad ( N p,i , p ) < (cid:18) O (1)lg lg N p,i (cid:19) d p,i X j =1 γ p,i,j ˜ C ⋆ poly ( r p,i,j , p ) + O (1) . We have lg lg N p,i > lg lg r p,i > lg lg lg n , so substituting into (5.15) yields C ⋆ int ( n ) < (cid:18)

16 + O (1)lg lg lg n (cid:19) X p ∈P ( n ) d p X i =1 d p,i X j =1 γ p,i γ p,i,j lg n ˜ C ⋆ poly ( r p,i,j , p ) + O (1) . (5.17) Step 5 — second call to

RefinedPolynomialMultiply . In this step we useProposition 5.1 to estimate the ˜ C ⋆ poly ( r p,i,j , p ) term in (5.17), for a ﬁxed p ∈ P ( n ), i ∈ { , . . . , d p } and j ∈ { , . . . , d p,i } . The precondition(lg lg p ) < lg r p,i,j < (lg p ) / holds for large n , as (5.16), (5.14) and (5.12) imply that(6 lg lg lg n + O (1)) < lg r p,i,j < (7 lg lg lg n + O (1)) , ASTER INTEGER AND POLYNOMIAL MULTIPLICATION 27 whereas (lg lg p ) = (lg lg lg n + O (1)) and (lg p ) / = (2 lg lg n + O (1)) / . We thusobtain an integer n p,i,j in the interval2 r p,i,j lg p < n p,i,j < (cid:18) n (cid:19) r p,i,j lg p (5.18)such that ˜ C ⋆ poly ( r p,i,j , p ) < (cid:18) O (1)lg lg lg n (cid:19) C ⋆ int ( n p,i,j ) + O (1) . Substituting into (5.17) produces C ⋆ int ( n ) < (cid:18)

32 + O (1)lg lg lg n (cid:19) X p ∈P ( n ) d p X i =1 d p,i X j =1 γ p,i γ p,i,j lg n C ⋆ int ( n p,i,j ) + O (1) . The weights γ p,i γ p,i,j / lg n sum to 1, so after appropriate reindexing we obtain thedesired bound (5.9).Finally, for the logarithmically slow function Ψ( x ) := 2 (log log x ) , let us verifythat for large n , we have n p,i,j < Ψ(Ψ( n )) for all p , i and j . First, since lg lg N p,i > lg lg lg n , we have lg p < n · lg lg N p,i , and hence, by (5.18) and (5.16), n p,i,j < r p,i,j lg p < · (lg lg N p,i ) +lg lg N p,i < Ψ( N p,i ) . Then by (5.14) and (5.12) we have N p,i < r p,i < · (lg lg N ) · (lg lg n ) < Ψ( n ) . Since Ψ( x ) is increasing, we get the desired inequality n p,i,j < Ψ(Ψ( n )). (cid:3) Now we may prove the main theorem for integer multiplication.

Proof of Theorem 1.1.

We have already noted that it suﬃces to establish that C ⋆ int ( n ) = O ((4 √ log ∗ n ) (see (5.1)). Let z , z , C , C and Ψ( x ) be as in Propo-sition 5.1. Increasing z if necessary, we may assume that z > exp(exp(exp(1)))and that Ψ( x ) x − x > z . Applying Proposition 2.1 with K = 32, B = C / S = { , , . . . } , ℓ = 3, κ = 2, x = z , x = σ = z , L =max( C , max n z C ⋆ int ( n )), and T ( n ) = C ⋆ int ( n ) leads immediately to the desiredbound. (cid:3) Acknowledgments

The authors thank Gr´egoire Lecerf for his comments on a draft of this paper.The ﬁrst author was supported by the Australian Research Council (DP150101689and FT160100219).

References

1. L. M. Adleman, C. Pomerance, and R. S. Rumely,

On distinguishing prime numbers fromcomposite numbers , Ann. of Math. (2) (1983), no. 1, 173–206. MR 683806 (84e:10008)2. R. Agarwal and J. Cooley,

New algorithms for digital convolution , IEEE Transactions onAcoustics, Speech, and Signal Processing (1977), no. 5, 392–410.3. T. M. Apostol, Introduction to analytic number theory , Springer-Verlag, New York-Heidelberg,1976, Undergraduate Texts in Mathematics. MR 04349294. R. C. Baker, G. Harman, and J. Pintz,

The diﬀerence between consecutive primes. II , Proc.London Math. Soc. (3) (2001), no. 3, 532–562. MR 18510815. L. I. Bluestein, A linear ﬁltering approach to the computation of discrete Fourier transform ,IEEE Transactions on Audio and Electroacoustics (1970), no. 4, 451–455.

6. A. Bostan, P. Gaudry, and ´E. Schost,

Linear recurrences with polynomial coeﬃcients andapplication to integer factorization and Cartier-Manin operator , SIAM J. Comput. (2007),no. 6, 1777–1806. MR 2299425 (2008a:11156)7. D. G. Cantor and E. Kaltofen, On fast multiplication of polynomials over arbitrary algebras ,Acta Inform. (1991), no. 7, 693–701. MR 1129288 (92i:68068)8. S. Covanov and E. Thom´e, Fast integer multiplication using generalized Fermat primes , http://arxiv.org/abs/1502.02800 , 2016.9. R. Crandall and B. Fagin, Discrete weighted transforms and large-integer arithmetic , Math.Comp. (1994), no. 205, 305–324. MR 118524410. A. De, P. Kurur, C. Saha, and R. Saptharishi, Fast integer multiplication using modulararithmetic , SIAM Journal on Computing (2013), no. 2, 685–699.11. M. F¨urer, Faster integer multiplication , STOC’07—Proceedings of the 39th Annual ACMSymposium on Theory of Computing, ACM, New York, 2007, pp. 57–66. MR 2402428(2009e:68124)12. ,

Faster integer multiplication , SIAM J. Comput. (2009), no. 3, 979–1005.MR 2538847 (2011b:68296)13. I. J. Good, The interaction algorithm and practical Fourier analysis , J. Roy. Statist. Soc. Ser.B (1958), 361–372. MR 0102888 (21 Faster truncated integer multiplication , https://arxiv.org/abs/1703.00640 ,2017.15. D. Harvey and J. van der Hoeven, Faster integer multiplication using plain vanilla FFTprimes , https://arxiv.org/abs/1611.07144 , to appear in Mathematics of Computation,2016.16. D. Harvey, J. van der Hoeven, and G. Lecerf, Faster polynomial multiplication over ﬁniteﬁelds , technical report, http://arxiv.org/abs/1407.3361 , 2014.17. ,

Even faster integer multiplication , J. Complexity (2016), 1–30. MR 353063718. , Faster polynomial multiplication over ﬁnite ﬁelds , J. ACM (2017), no. 6, 52:1–52:23.19. C. H. Papadimitriou, Computational complexity , Addison-Wesley Publishing Company, Read-ing, MA, 1994. MR 1251285 (95f:68082)20. J. M. Pollard,

The fast Fourier transform in a ﬁnite ﬁeld , Math. Comp. (1971), 365–374.MR 030196621. K. Prachar, ¨Uber die Anzahl der Teiler einer nat¨urlichen Zahl, welche die Form p − haben ,Monatsh. Math. (1955), 91–97. MR 006856922. A. Sch¨onhage, Schnelle Multiplikation von Polynomen ¨uber K¨orpern der Charakteristik 2 ,Acta Informat. (1976/77), no. 4, 395–398. MR 043666323. A. Sch¨onhage and V. Strassen, Schnelle Multiplikation grosser Zahlen , Computing (Arch.Elektron. Rechnen) (1971), 281–292. MR 0292344 (45 On the deterministic complexity of factoring polynomials over ﬁnite ﬁelds , Inform.Process. Lett. (1990), no. 5, 261–267. MR 1049276 (91f:11088)25. , Searching for primitive roots in ﬁnite ﬁelds , Math. Comp. (1992), no. 197, 369–380. MR 110698126. L. H. Thomas, Using computers to solve problems in physics , Applications of digital computers (1963), 42–57.27. J. von zur Gathen and J. Gerhard,

Modern computer algebra , third ed., Cambridge UniversityPress, Cambridge, 2013. MR 3087522

E-mail address : [email protected] School of Mathematics and Statistics, University of New South Wales, Sydney NSW2052, Australia

E-mail address : [email protected]@lix.polytechnique.fr