Faster integer multiplication using plain vanilla FFT primes
aa r X i v : . [ c s . S C ] O c t MATHEMATICS OF COMPUTATIONVolume 00, Number 0, Pages 000–000S 0025-5718(XX)0000-0
FASTER INTEGER MULTIPLICATION USINGPLAIN VANILLA FFT PRIMES
DAVID HARVEY AND JORIS VAN DER HOEVEN
Abstract.
Assuming a conjectural upper bound for the least prime in anarithmetic progression, we show that n -bit integers may be multiplied in O ( n log n log ∗ n ) bit operations. Introduction
Let M ( n ) be the number of bit operations required to multiply two n -bit integersin the deterministic multitape Turing model [22]. A decade ago, F¨urer [8, 9] provedthat M ( n ) = O ( n log n K log ∗ n ) (1)for some constant K >
1. Here log ∗ x denotes the iterated logarithm, that is,log ∗ x := min { j ∈ N : log ◦ j x } ( x ∈ R ) , where log ◦ j x := log · · · log x (iterated j times). Harvey, van der Hoeven andLecerf [14] subsequently gave a related algorithm that achieves (1) with the ex-plicit value K = 8, and more recently Harvey announced that one may achieve K = 6 by applying new techniques for truncated integer multiplication [13].There have been two proposals in the literature for algorithms that achieve thetighter bound M ( n ) = O ( n log n log ∗ n ) (2)under plausible but unproved number-theoretic hypotheses. First, Harvey, van derHoeven and Lecerf gave such an algorithm [14, §
9] that depends on a slight weak-ening of the Lenstra–Pomerance–Wagstaff conjecture on the density of Mersenneprimes, that is, primes of the form p = 2 m −
1, where m is itself prime. Althoughthis conjecture is backed by reasonable heuristics and some numerical evidence, itis problematic for several reasons. At the time of writing, only 49 Mersenne primesare known, the largest being 2 , , − r λ + 1. Again, although their unproved hypothesis is supportedby heuristics and some numerical evidence, it is still unknown whether there areinfinitely many primes of the desired form. It is a famous unsolved problem even to Mathematics Subject Classification.
Primary 68W30, 68W40, 11Y16.Harvey was supported by the Australian Research Council (grants DP150101689 andFT160100219). c (cid:13) XXXX American Mathematical Society prove that there are infinitely many primes of the form n + 1, of which the abovegeneralised Fermat primes are a special case.As an aside, we mention that the unproved hypotheses in [14, §
9] and [6] mayboth be expressed as statements about the cyclotomic polynomials φ k ( x ) occasion-ally taking prime values: for [14, §
9] we have 2 m − φ m (2), and for [6] we have r λ + 1 = φ λ +1 ( r ).In this paper we give a new conditional proof of (2), which depends on the follow-ing hypothesis. Let ϕ ( q ) denote the totient function. For relatively prime positiveintegers r and q , let P ( r, q ) denote the least prime in the arithmetic progression n = r (mod q ), and put P ( q ) := max r P ( r, q ). Hypothesis P.
We have P ( q ) = O ( ϕ ( q ) log q ) as q → ∞ . Our main result is as follows.
Theorem 1.
Assume Hypothesis P. Then there is an algorithm achieving (2) . The overall structure of the new algorithm is largely inherited from [14, § §
9] is the choice of coefficient ring.The algorithm of [14, §
9] worked over F p [ i ], where p = 2 m − i = −
1. In the new algorithm we use instead the ring F p , where p is a prime ofthe form p = a · m +1, for an appropriate choice of m and a = O ( m ). Hypothesis Pguarantees that such primes exist (take q = 2 m and r = 1). The key new ingredientis the observation that we may convert an integer product modulo a · m + 1 to apolynomial product modulo X k + a , by splitting the integers into chunks of m/k bits.In software implementations of fast Fourier transforms (FFTs) over finite fields,such as Shoup’s NTL library [25], it is quite common to work over F p where p is aprime of the form a · m + 1 that fits into a single machine register. Such primes aresometimes called FFT primes ; they are popular because it is possible to performa radix-two FFT efficiently over F p with a large power-of-two transform length.Our Theorem 1 shows that such primes remain useful even in a theoretical senseas m → ∞ .From a technical point of view, the new algorithm is considerably simpler thanthat of [14, § m (the coefficient size), and in particular we may easily choose m tobe divisible by the desired chunk size. By contrast, the choice of m in [14, §
9] isdictated by the rather erratic distribution of Mersenne primes; this forces one todeal with the technical complication of splitting integers into chunks of ‘non-integralsize’, which was handled in [14, §
9] by adapting an idea of Crandall and Fagin [7].We remark that the conditional algorithm of Covanov and Thom´e [6] achieves K = 4 by a rather different route. Instead of using Bluestein’s trick to handlethe short transforms, the authors follow F¨urer’s original strategy, which involvesconstructing a coefficient ring containing “fast” roots of unity. The algorithm ofthis paper, like the algorithms of [14], makes no use of such “fast” roots.Let us briefly discuss the evidence in favour of Hypothesis P. The best uncondi-tional bound for P ( q ) is currently Xylouris’s refinement of Linnik’s theorem, namely P ( q ) = O ( q . ) [30]. If q is a prime power (the case of interest in this paper), one ASTER INTEGER MULTIPLICATION 3 can obtain P ( q ) = O ( q . ε ) [4, Cor. 11]. Assuming the Generalised Riemann Hy-pothesis (GRH), one has P ( q ) = O ( q ε ) [19]. All of these bounds are far too weakfor our purposes.The tighter bound in Hypothesis P was suggested by Heath–Brown [17, 18]. Itcan be derived from the reasonable assumption that a randomly chosen integer in agiven congruence class should be no more or less ‘likely’ to be prime than a randominteger of the same size, after correcting the probabilities to take into account thedivisors of q . A detailed discussion of this argument is given by Wagstaff [29], whoalso presents some supporting numerical evidence. In the other direction, Granvilleand Pomerance [11] have conjectured that ϕ ( q ) log q = O ( P ( q )). These questionshave been revisited in a recent preprint of Li, Pratt and Shakan [21]; they givefurther numerical data, and propose the more precise conjecture thatlim inf q P ( q ) ϕ ( q ) log q = 1 , lim sup q P ( q ) ϕ ( q ) log q = 2 . The consensus thus seems to be that ϕ ( q ) log q is the right order of magnitude for P ( q ), although a proof is apparently still elusive.For the purposes of this paper, there are several reasons that Hypothesis P ismuch more compelling than the conjectures required by [14, §
9] and [6]. First, itis well known that there are infinitely many primes in any given congruence class,and we even know that asymptotically the primes are equidistributed among thecongruence classes modulo q . Second, one finds that, in practice, primes of therequired type are extremely common. For example, we find that a · + 1 isprime for a = 13 , , , , , , , , , , , , , . . . and there are still plenty of opportunities to hit primes before exhausting the pos-sible values of a up to about 10 allowed by Hypothesis P.Third, we point out that Hypothesis P is actually much stronger than whatis needed in this paper. We could prove Theorem 1 assuming only the weakerstatement that there exists a logarithmically slow function Φ( q ) (see [14, § P ( q ) < ϕ ( q )Φ( q ) for all large q . For example, we could replace (log q ) inHypothesis P by (log q ) C for any fixed C >
2, or even by (log q ) (log log q ) C for anyfixed C >
0. To keep the complexity arguments in this paper as simple as possible,we will only give the proof of Theorem 1 for the simplest form of Hypothesis P, asstated above.It is interesting to ask for what bit size the new algorithm would be fasterthan the asymptotically inferior multiplication algorithms that are used in practice.The current reference implementation in the GMP library [10] uses Sch¨onhage–Strassen’s algorithm [24]. On recent architectures, Pollard’s algorithm [23] tends tobe very competitive as well [12]. Concerning our new algorithm, we stress that thepresent paper is optimised for simplicity rather than speed. An optimised versionwould essentially coincide with Pollard’s algorithm for sizes n < , where 64corresponds to the bit size of an integer hardware register, so the main recursivestep in our algorithm would only be invoked for super-astronomical sizes. This doesnot withstand that some of the techniques that we developed for proving the newcomplexity bound may have practical applications for much smaller sizes. This isexactly what happened for the related problem of carryless integer multiplication:new techniques from [16] led to faster practical implementations [15, 20]. DAVID HARVEY AND JORIS VAN DER HOEVEN The algorithm
Define lg x := ⌈ log x ⌉ for x > For the rest of the paper we assume thatHypothesis P holds , and hence we may fix an absolute constant
C > P ( q ) Cq (lg q ) for all q >
2. (Numerical evidence suggests that P ( q ) / ( q (lg q ) ) achieves its maxi-mum value at q = 2, implying that one may take C = 3 / admissible size is an integer m > of the form m = k (lg k ) for some integer k . For such m , let p ( m ) denote the smallest prime of the form p = a · m + 1 . Hypothesis P implies that 1 a < Cm . (3)In the proof of Proposition 2 below, we will describe a recursive algorithm Transform that takes as input an admissible size m = k (lg k ) , the correspondingprime p = p ( m ) = a · m + 1, a power-of-two transform length L = 2 lg L such that(lg m ) < lg L < m, (4)a primitive L -th root of unity ζ ∈ F p (such a primitive root exists as lg L < m and2 m | p − F ∈ F p [ X ] / ( X L − F with respect to ζ , that is, the vectorˆ F := ( F (1) , F ( ζ ) , . . . , F ( ζ L − )) ∈ ( F p ) L . For sufficiently large m , Proposition 2 will reduce the computation of such a DFT tothe computation of a large collection of similar DFTs over F p ′ for an exponentiallysmaller prime p ′ = p ( m ′ ) = a ′ · m ′ + 1where m ′ = k ′ (lg k ′ ) ∼ m ) . The main reduction consists of five steps, which may be summarised as follows:(i) Reduce the given ‘long’ transform of length
L < m over F p to many ‘short’transforms of exponentially smaller length S = 2 (lg m ) over F p , via theCooley–Tukey decomposition.(ii) Reduce each short transform to a product in F p [ X ] / ( X S − S , using Bluestein’s algorithm.(iii) By splitting each coefficient in F p into exponentially smaller chunks of bitsize r = (lg k ) ∼ (lg m ) , reduce each product from step (ii) to a product in Z [ X, Y ] / ( X S − , Y k + a ).(iv) Embed each product from step (iii) into F p ′ [ X, Y ] / ( X S − , Y k + a ), for asuitable prime p ′ = a ′ · m ′ + 1 that is exponentially smaller than p ; moreprecisely, m ′ ∼ m ) ∼ r .(v) Reduce each product from step (iv) to a collection of forward and inverseDFTs of length S over F p ′ , and recurse. ASTER INTEGER MULTIPLICATION 5
The main recursion.
Let us now present the main reduction in more detailtogether with its complexity analysis. We denote the running time of
Transform by T ( m, L ). For m > there is always at least one integer lg L in the interval(4), so we may define the normalisation T ( m ) := max (lg m ) < lg L In our algorithm we must often perform auxiliary arithmetic operations on ‘small’integers. These will always be handled via the Sch¨onhage–Strassen algorithm [24]and Newton’s method [28, Ch. 9]; thus we may compute products, quotients andremainders of n -bit integers in O ( n lg n lg lg n ) bit operations. Proposition 2. There exist absolute constants m > , C > and C > with the following property. For any admissible m > m , there exists an admissible m ′ < (log m ) such that T ( m ) < (cid:18) C lg lg m (cid:19) T ( m ′ ) + C . (5) Proof. Assume that we are given as input an admissible size m = k (lg k ) , the corre-sponding prime p = p ( m ) = a · m + 1, a transform length L = 2 lg L satisfying (4),a primitive L -th root of unity ζ ∈ F p , and a polynomial F ∈ F p [ X ] / ( X L − F . For the base case m m , we may compute ˆ F usingany convenient algorithm. In what follows, we assume that m > m and that m is increased whenever necessary to accommodate statements that hold only forlarge m .Let us now detail the reductions (i)–(v) mentioned above. Step (i) — reduce to short DFTs. In this step we reduce the given transform oflength L to a collection of short transforms of length S := 2 (lg m ) . By (4) we have lg L > (lg m ) , so S | L . Let ω := ζ L/S ; then ω is a primitive S -throot of unity in F p .Let d := ⌊ lg L/ (lg m ) ⌋ , so that lg L = (lg m ) d + d ′ where 0 d ′ < (lg m ) .Applying the Cooley–Tukey method [5] to the factorisation L = S d d ′ , the giventransform of length L may be decomposed into d layers, each consisting of L/S transforms of length S (with respect to ω ), followed by d ′ layers, each consisting of L/ O ( L )multiplications by ‘twiddle factors’ in F p , which are given by certain powers of ζ .(For further details of the Cooley–Tukey decomposition, see for example [14, § d lg L/ (lg m ) , and by (4) also d ′ < (lg m ) < lg L/ (lg m ) , so the total num-ber of twiddle factor multiplications is O (( d + d ′ ) L ) = O ( L lg L/ (lg m ) ). Us-ing the Sch¨onhage–Strassen algorithm, each multiplication in F p costs at most O (lg p lg lg p lg lg lg p ) bit operations. As p = a · m + 1 we have lg p = m + O (lg a ),so (3) implies that lg p = m + O (lg m ) = O ( m ). Thus the cost of each multiplica-tion in F p is O ( m (lg m ) ) bit operations, and the total cost of the twiddle factormultiplications is O ( mL lg L ) bit operations. This bound also covers the cost ofthe length 2 transforms (‘butterflies’), each of which requires one addition and onesubtraction in F p . DAVID HARVEY AND JORIS VAN DER HOEVEN In the Turing model, we must also account for the cost of rearranging data sothat the inputs for each layer of short DFTs are stored sequentially on tape. Using afast matrix transpose algorithm, the cost per layer is O ( L lg p lg S ) = O ( Lm (lg m ) )bit operations (see [14, § O ( mL lg L ) bit operations alto-gether.Let T short ( m, L ) denote the number of bit operations required to perform L/S transforms of length S with respect to ω , i.e., the cost of one layer of short trans-forms. Since the number of layers of short transforms is d lg L/ (lg m ) , the abovediscussion shows that T ( m, L ) < lg L (lg m ) T short ( m, L ) + O ( mL lg L ) . (6) Step (ii) — reduce to short convolutions. In this step we use Bluestein’s algo-rithm [2] to convert the short transforms into convolution problems. Suppose thatat some layer of the main DFT we are given as input the short polynomials a t ( X ) = S − X i =0 a t,i X i ∈ F p [ X ] / ( X S − , t = 1 , . . . , L/S. We wish to compute ˆ a , . . . , ˆ a L/S , the DFTs with respect to ω .Let η := ζ L/ S so that η = ω . For i = 0 , . . . , S − t = 1 , . . . , L/S , define f t,i := η i a t,i , g i := η − i , and put f t ( X ) := S − X i =0 f t,i X i , g ( X ) := S − X i =0 g i X i , regarded as polynomials in F p [ X ] / ( X S − g i , andcompute all of the f t,i from the a t,i , using O (( L/S ) S ) = O ( L ) operations in F p .Then one finds (see for example [14, § a t ) i = η i h t,i , where h t := f t g = S − X i =0 h t,i X i ∈ F p [ X ] / ( X S − . (7)In other words, computing the short DFTs reduces to computing the products f g, . . . , f L/S g in F p [ X ] / ( X S − O ( L ) operations in F p .Let C short ( m, L ) denote the cost of computing the products f g, . . . , f L/S g in F p [ X ] / ( X S − F p costs O ( m (lg m ) )bit operations, so the above discussion shows that T short ( m, L ) < C short ( m, L ) + O ( Lm (lg m ) ) . Substituting into (6) yields T ( m, L ) < lg L (lg m ) C short ( m, L ) + O ( mL lg L ) . (8) Step (iii) — reduce to bivariate product over Z . In this step we transport theproblem of computing the products f g, . . . , f L/S g in F p [ X ] / ( X S − 1) to the ring R := Z [ X, Y ] / ( X S − , Y k + a ) , ASTER INTEGER MULTIPLICATION 7 by cutting up each coefficient in F p into k chunks of bit size r := m/k = (lg k ) . We note for future reference the estimate r = (cid:18) O (1)lg lg m (cid:19) (lg m ) ; (9)this follows from m = k (lg k ) , becauselg m = lg k + O (lg lg k ) = (cid:18) O (lg lg k )lg k (cid:19) lg k = (cid:18) O (1)lg lg m (cid:19) lg k. Interpreting each f t,i and g i as an integer in the interval [0 , p ), and decomposingthem in base 2 r , we write f t,i = k − X j =0 f t,i,j ( k − − j ) r , g i = k − X j =0 g i,j ( k − − j ) r , (10)where f t,i,j and g i,j are integers in the interval0 f t,i,j , g i,j r a. (11)(In fact, they are less than 2 r for j = 1 , . . . , k − 1; the bound 2 r a is only needed forthe first term j = 0.) Then define polynomials F t := S − X i =0 k − X j =0 f t,i,j X i Y j , G := S − X i =0 k − X j =0 g i,j X i Y j , regarded as elements of R = Z [ X, Y ] / ( X S − , Y k + a ), and let H t := F t G = S − X i =0 k − X j =0 h t,i,j X i Y j be the corresponding products in R for t = 1 , . . . , L/S .We claim that knowledge of H t determines h t ; specifically, that h t,i = k − X j =0 h t,i,j (2 k − − j ) r (mod p ) (12)for each pair ( t, i ). To prove this, observe that by definition of multiplication in R , h t,i,j = X i + i = i mod S X j + j = j mod kj + j 1) that sends Y to 2 − r , modulo a scaling factor of 2 ( k − r .)Let us estimate the cost of using (12) to compute h t,i for a single pair ( t, i ),assuming that the h t,i,j are known. By (13) and (11) we have | h t,i,j | ( Sk ) a (2 r a ) = 2 r Ska . Then, using (3) and the fact that k m , we obtain | h t,i,j | r SC m , solg | h t,i,j | < r + (lg m ) + 7 lg m + O (1) . Moreover, as m = k (lg k ) , for large m we certainly have lg k lg m − 1, so r = (lg k ) (lg m − (lg m ) − m ) , and we deduce that lg | h t,i,j | < m ) . (14)In particular, by (9), we see that h t,i,j has bit size O ((lg m ) ) = O ( r ). Now,to compute h t,i , we first evaluate the sum in (12) (in Z ) using a straightforwardoverlap-add procedure; this costs O ( kr ) = O ( m ) bit operations. Then we reducethe result modulo p ; this costs a further O ( m (lg m ) ) bit operations (using theSch¨onhage–Strassen algorithm, as in step (i)). The total cost over all pairs ( t, i ) is O (( L/S ) Sm (lg m ) ) = O ( Lm (lg m ) ) bit operations.Let C biv ( m, L ) denote the cost of computing the products F G, . . . , F L/S G in R .The above discussion shows that C short ( m, L ) < C biv ( m, L ) + O ( Lm (lg m ) ) . Substituting into (8) yields T ( m, L ) < lg L (lg m ) C biv ( m, L ) + O ( mL lg L ) . (15) Step (iv) — reduce to bivariate multiplication over F p ′ . In this step we transferthe above multiplication problems from R = Z [ X, Y ] / ( X S − , Y k + a ) to the ring S := F p ′ [ X, Y ] / ( X S − , Y k + a ) , where p ′ := p ( m ′ ) = a ′ · m ′ + 1 for a suitable choice of admissible size m ′ . Theidea is to choose m ′ to be just large enough that computing the desired productsmodulo p ′ determines their coefficients unambiguously in Z .To achieve this, we will take m ′ := k ′ (lg k ′ ) where k ′ := (cid:24) β (lg β − β ) (cid:25) , β := 2(lg m ) . ASTER INTEGER MULTIPLICATION 9 Note that m ′ is admissible (we may ensure that m ′ > by taking m sufficientlylarge). We claim that with this choice of m ′ we have β m ′ < (cid:18) O (1)lg lg m (cid:19) β, (16)and consequently, taking (9) into account, m ′ = (cid:18) O (1)lg lg m (cid:19) r. (17)To establish the first inequality in (16), observe that since k ′ > β/ (lg β ) , we havelog k ′ > log β − lg β > log β − β. Hence lg k ′ > lg β − β , which implies that m ′ = k ′ (lg k ′ ) > β . For the secondinequality, since k ′ β we have m ′ = k ′ (lg k ′ ) (cid:18) β (lg β − β ) + 1 (cid:19) (lg β ) = (cid:18) (lg β ) (lg β − β ) + (lg β ) β (cid:19) β = (cid:18) O ((lg lg β ) )(lg β ) (cid:19) β = (cid:18) O (1)lg lg m (cid:19) β. Let us estimate the cost of computing p ′ = a ′ · m ′ + 1. Hypothesis P ensuresthat a ′ < C ( m ′ ) , so we may locate p ′ by testing each candidate a ′ = 1 , . . . , C ( m ′ ) using a naive primality test (trial division) in 2 O ( m ′ ) bit operations. By (4) and(16) this amounts to2 O ( m ′ ) = 2 O ( β ) = 2 O ((lg m ) ) = 2 O ((lg L ) / ) = O ( L ) (18)bit operations.Now let F , . . . , F L/S , G ∈ R be as in step (iii), and let u , . . . , u L/S , v be theirimages in S ; that is, u t := S − X i =0 k − X j =0 u t,i,j X i Y j , v := S − X i =0 k − X j =0 v i,j X i Y j , where u t,i,j and v i,j are the images in F p ′ of f t,i,j and g i,j . Computing theseimages amounts to zero-padding each coefficient up to lg p ′ bits; by (17) we havelg p ′ = O ( m ′ ) = O ( r ), so the total cost of this step is O (( L/S ) Skr ) = O ( Lm ) bitoperations.Let w t := u t v for each t . Clearly w t is the image in S of H t = F t G . By (14)and (16), the coefficients h t,i,j of H t are completely determined by those of w t , as | h t,i,j | β − and p ′ > m ′ + 1 > β + 1. Moreover, this lifting can be carried outin linear time, so the cost of deducing H t from w t (for all t ) is again O ( Lm ) bitoperations.Let C tiny ( m, L ) denote the cost of computing the products u v, . . . , u L/S v in S .The above discussion shows that C biv ( m, L ) < C tiny ( m, L ) + O ( Lm ) . Substituting into (15) yields T ( m, L ) < lg L (lg m ) C tiny ( m, L ) + O ( mL lg L ) . (19) Step (v) — reduce to DFTs over F p ′ . Since 2 m ′ | p ′ − S = (lg m ) < m ′ ,there exists a primitive S -th root of unity ζ ′ ∈ F p ′ . We may find one such primitiveroot by a brute force search in 2 O ( m ′ ) = O ( L ) bit operations (see (18)).We will compute the products w t = u t v in S by first performing DFTs withrespect to X , and then multiplying pointwise in F p ′ [ Y ] / ( Y k + a ). More precisely,for each j = 0 , . . . , k − U t,j := S − X i =0 u t,i,j X i , V j := S − X i =0 v i,j X i , regarded as polynomials in F p ′ [ X ] / ( X S − Transform recursively tocompute the transforms of each U t,j and V j with respect to ζ ′ , i.e., to compute thepolynomials u t (( ζ ′ ) i , Y ) = k − X j =0 U t,j (( ζ ′ ) i ) Y j , v (( ζ ′ ) i , Y ) = k − X j =0 V j (( ζ ′ ) i ) Y j , as elements of F p ′ [ Y ] / ( Y k + a ), for each i = 0 , . . . , S − t = 1 , . . . , L/S . Theprecondition corresponding to (4) for these recursive calls is(lg m ′ ) < lg S < m ′ . This is certainly satisfied for large m , as lg S = (lg m ) and m ′ ∼ m ) by (16).There are ( L/S + 1) k transforms, so their total cost is ( L/S + 1) k T ( m ′ , S ) bitoperations.We next compute the pointwise products w t (( ζ ′ ) i , Y ) = u t (( ζ ′ ) i , Y ) · v (( ζ ′ ) i , Y )in F p ′ [ Y ] / ( Y k + a ) for each i and t . Using the Sch¨onhage–Strassen algorithm(both the integer variant and the polynomial variant [3]), the cost of each productis O (( k lg k lg lg k )(lg p ′ lg lg p ′ lg lg lg p ′ )) bit operations. Using the bounds lg p ′ = O ( m ′ ), km ′ = O ( kr ) = O ( m ), and m ′ = O ((lg m ) ), this amounts to O (( k lg m lg lg m )( m ′ lg m ′ lg lg m ′ )) = O ( m lg m (lg lg m ) lg lg lg m )= O ( m (lg m ) )bit operations, or O ( Lm (lg m ) ) bit operations over all t and i .Finally, we perform inverse DFTs with respect to X to recover w , . . . , w L/S . Itis well known that these inverse DFTs may be computed by the same algorithmas the forward DFT, with ζ ′ replaced by ( ζ ′ ) − , followed by a division by S . Thedivisions cost O ( Lm (lg m ) ) bit operations altogether (they are no more expensivethan the pointwise multiplications), so the cost of this step is ( L/S ) k T ( m ′ , s ) + O ( Lm (lg m ) ).The procedure just described requires some data rearrangement, so that theDFTs and pointwise multiplication steps can access the necessary data sequentially.Using a fast matrix transpose algorithm, this costs O ( Lk lg p ′ lg k ) = O ( Lm lg m )bit operations altogether. ASTER INTEGER MULTIPLICATION 11 Combining all the contributions mentioned above, we obtain C tiny ( m, L ) < (cid:18) LS + 1 (cid:19) k T ( m ′ , S ) + O ( Lm (lg m ) ) . Substituting into (19) yields T ( m, L ) < lg L (lg m ) (cid:18) LS + 1 (cid:19) k T ( m ′ , S ) + O ( mL lg L ) . Conclusion. This concludes the description of the algorithm; it remains to estab-lish (5). Dividing the previous inequality by mL lg L , and recalling that S = 2 (lg m ) ,we obtain T ( m, L ) mL lg L < (cid:18) SL (cid:19) km ′ m · T ( m ′ , S ) m ′ S lg S + O (1) . By definition of T ( m ′ ) we have T ( m ′ , S ) / ( m ′ S lg S ) T ( m ′ ). Equation (17) impliesthat km ′ = (cid:18) O (1)lg lg m (cid:19) kr = (cid:18) O (1)lg lg m (cid:19) m, and (4) yields S/L < (lg m ) − (lg m ) = O (1 / lg lg m ). Therefore T ( m, L ) mL lg L < (cid:18) O (1)lg lg m (cid:19) T ( m ′ ) + O (1) . Taking the maximum over all lg L yields the desired bound (5). (cid:3) Corollary 3. We have T ( m ) = O (4 log ∗ m ) for admissible m → ∞ . The corollary could be deduced from Proposition 2 by using [14, Prop. 8]. Wegive a simpler (but less general) argument here. Proof. Let m , C and C be as in Proposition 2. We may assume, increasing m if necessary, that(log m ) / < log( m / ) and C lg lg m < − log ∗ ( m / ) for all m > m . Define B := max (cid:0) C , max m admissible m m T ( m ) (cid:1) . We will prove that T ( m ) < (4 log ∗ ( m / )+1 − B (20)for all admissible m .If m m then (20) holds by the definition of B , so we may assume that m > m . By Proposition 2, there exists an admissible m ′ < (log m ) such that T ( m ) < (cid:18) C lg lg m (cid:19) T ( m ′ ) + C < (4 + 4 − log ∗ ( m / ) ) T ( m ′ ) + B. Since ( m ′ ) / < (log m ) / < log( m / ), we have log ∗ (( m ′ ) / ) log ∗ ( m / ) − ∗ ( m / ), we obtain T ( m ) < (4 + 4 − log ∗ ( m / ) )(4 log ∗ ( m / ) − B + B = (4 log ∗ ( m / )+1 − − log ∗ ( m / ) − B< (4 log ∗ ( m / )+1 − B. This establishes (20), and the corollary follows immediately. (cid:3) Application to integer multiplication. We are now in a position to provethe main result. Proof of Theorem 1. We are given as input two positive integers u, v < n for somelarge n ; our goal is to compute uv .Define k := (cid:24) (5 / 2) lg n (lg lg n ) (cid:25) , m := k (lg k ) . We have lg k = lg lg n + O (lg lg lg n ) = (1 + o (1)) lg lg n, so 2 lg n < m < n for large n . We may assume that n is large enough so that m > ; then m isadmissible.Let b := ⌊ m/ ⌋ and d := ⌈ n/b ⌉ . We decompose u and v in base 2 b as u = d − X i =0 u i bi , v = d − X i =0 v i bi , u i , v i < b , and define polynomials U ( X ) := d − X i =0 u i X i , V ( X ) := d − X i =0 v i X i ∈ Z [ X ] . Let W := U V ∈ Z [ X ]. The coefficients of W have at most 2 b + lg d = O ( m ) bits,so the product uv = W (2 b ) may be recovered from W ( X ) in O ( n ) bit operations.Let p := p ( m ) = a · m + 1. We may find p by testing each value of a up to Cm using a polynomial-time primality test [1], in m O (1) = (lg n ) O (1) bit operations.Also put ℓ := lg(10 n/m ) and L := 2 ℓ . The polynomial W is determined by itsimage in F p [ X ] / ( X L − d nb + 1 nm/ − < nm L/ b + lg d m/ n < m for large n .To compute the product in F p [ X ] / ( X L − Transform to performDFTs and inverse DFTs, and multiply pointwise in F p . The precondition (4) cer-tainly holds for large n . According to [26], we may find a suitable primitive root in F p in p / o (1) < (2 m/ ) o (1) < (2 (3 / 4) lg n ) o (1) = n / o (1)ASTER INTEGER MULTIPLICATION 13 bit operations. By Corollary 3, we conclude that M ( n ) < T ( m, L ) + O ( Lm lg m lg lg m ) < mL lg L T ( m ) + O ( Lm lg m lg lg m )= O ( n lg n log ∗ m ) + O ( n lg lg n lg lg lg n )= O ( n lg n log ∗ n ) . (cid:3) Acknowledgments The authors thank an anonymous referee, whose comments helped to greatlyimprove the presentation of these results, and also Igor Shparlinski and LiangyiZhao, for helpful discussions about bounds for primes in arithmetic progressions. References 1. M. Agrawal, N. Kayal, and N. Saxena, PRIMES is in P , Ann. of Math. (2) (2004), no. 2,781–793. MR 21239392. L. I. Bluestein, A linear filtering approach to the computation of discrete Fourier transform ,IEEE Transactions on Audio and Electroacoustics (1970), no. 4, 451–455.3. D. G. Cantor and E. Kaltofen, On fast multiplication of polynomials over arbitrary algebras ,Acta Inform. (1991), no. 7, 693–701. MR 1129288 (92i:68068)4. M. Chang, Short character sums for composite moduli , J. Anal. Math. (2014), 1–33.MR 32335735. J. W. Cooley and J. W. Tukey, An algorithm for the machine calculation of complex Fourierseries , Math. Comp. (1965), 297–301. MR 01785866. S. Covanov and E. Thom´e, Fast integer multiplication using generalized Fermat primes ,preprint http://arxiv.org/abs/1502.02800 , 2016.7. R. Crandall and B. Fagin, Discrete weighted transforms and large-integer arithmetic , Math.Comp. (1994), no. 205, 305–324. MR 11852448. M. F¨urer, Faster integer multiplication , STOC’07—Proceedings of the 39th Annual ACMSymposium on Theory of Computing, ACM, New York, 2007, pp. 57–66. MR 2402428(2009e:68124)9. , Faster integer multiplication , SIAM J. Comput. (2009), no. 3, 979–1005.MR 2538847 (2011b:68296)10. Torbj¨orn Granlund and the GMP development team, GNU Multiple Precision ArithmeticLibrary , version 6.1.2, available at http://gmplib.org/ , 2017.11. A. Granville and C. Pomerance, On the least prime in certain arithmetic progressions , J.London Math. Soc. (2) (1990), no. 2, 193–200. MR 106726112. D. Harvey, Faster arithmetic for number-theoretic transforms , J. Symbolic Comput. (2014), 113–119.13. , Faster truncated integer multiplication , https://arxiv.org/abs/1703.00640 , 2017.14. D. Harvey, J. van der Hoeven, and G. Lecerf, Even faster integer multiplication , J. Complexity (2016), 1–30. MR 353063715. , Fast polynomial multiplication over F , Proc. ISSAC ’16 (New York, NY, USA),ACM, 2016, pp. 255–262.16. , Faster polynomial multiplication over finite fields , J. ACM (2017), no. 6, Arti-cle 52.17. D. R. Heath-Brown, Almost-primes in arithmetic progressions and short intervals , Math.Proc. Cambridge Philos. Soc. (1978), no. 3, 357–375. MR 0491558 (58 Siegel zeros and the least prime in an arithmetic progression , Quart. J. Math. OxfordSer. (2) (1990), no. 164, 405–418. MR 108110319. , Zero-free regions for Dirichlet L -functions, and the least prime in an arithmeticprogression , Proc. London Math. Soc. (3) (1992), no. 2, 265–338. MR 1143227 (93a:11075)20. J. van der Hoeven, R. Larrieu, and G. Lecerf, Implementing fast carryless multiplication ,Tech. report, HAL, 2017, http://hal.archives-ouvertes.fr/hal-01579863 . 21. J. Li, K. Pratt, and G. Shakan, A lower bound for the least prime in an arithmetic progression ,preprint https://arxiv.org/abs/1607.02543 , 2016.22. C. H. Papadimitriou, Computational complexity , Addison-Wesley Publishing Company, Read-ing, MA, 1994. MR 1251285 (95f:68082)23. J. M. Pollard, The fast Fourier transform in a finite field , Mathematics of Computation (1971), no. 114, 365–374.24. A. Sch¨onhage and V. Strassen, Schnelle Multiplikation grosser Zahlen , Computing (Arch.Elektron. Rechnen) (1971), 281–292. MR 0292344 (45 NTL: a library for doing number theory (Version 9.9.1) , .26. I. Shparlinski, On finding primitive roots in finite fields , Theoret. Comput. Sci. (1996),no. 2, 273–275. MR 138977327. GIMPS team, Mersenne Prime Number discovery - − is Prime , , 2016.28. J. von zur Gathen and J. Gerhard, Modern computer algebra , third ed., Cambridge UniversityPress, Cambridge, 2013. MR 308752229. S. S. Wagstaff, Jr., Greatest of the least primes in arithmetic progressions having a givenmodulus , Math. Comp. (1979), no. 147, 1073–1080. MR 52806130. T. Xylouris, On the least prime in an arithmetic progression and estimates for the zeros ofDirichlet L -functions , Acta Arith. (2011), no. 1, 65–91. MR 2825574 (2012m:11129) School of Mathematics and Statistics, University of New South Wales, Sydney NSW2052, Australia E-mail address : [email protected] Laboratoire d’informatique, UMR 7161 CNRS, ´Ecole polytechnique, 91128 Palaiseau,France E-mail address ::