aa r X i v : . [ c s . S C ] M a r FASTER TRUNCATED INTEGER MULTIPLICATION
DAVID HARVEY
Abstract.
We present new algorithms for computing the low n bits or thehigh n bits of the product of two n -bit integers. We show that these problemsmay be solved in asymptotically 75% of the time required to compute the full2 n -bit product, assuming that the underlying integer multiplication algorithmrelies on computing cyclic convolutions of real sequences. Introduction
Let n > u and v be integers in the interval 0 u, v < n . We write M ( n ) for the cost of computing the full product of u and v , which is just the usual2 n -bit product uv . Unless otherwise specified, by ‘cost’ we mean the number of bitoperations, under a model such as the multitape Turing machine [12].In this paper we are interested in two types of truncated product . The low product of u and v is the unique integer w in the interval 0 w < n such that w = uv (mod 2 n ), or in other words, the low n bits of uv . We denote the cost of computingthe low product by M lo ( n ).The high product of u and v is defined to be any integer w in the range 0 w n such that | uv − n w | < n . Thus there are at most two possible values for the highproduct, and an algorithm that computes it is permitted to return either one. Thehigh product consists of, more or less, the high n bits of uv , except that we allowa small error in the lowest bit. We denote the cost of computing the high productby M hi ( n ).There are many applications of truncated products in computer arithmetic. Themost obvious example is high-precision arithmetic on real numbers: to compute an n -bit approximation to the product of two real numbers with n -bit mantissae, onemay scale by an appropriate power of two to convert the inputs into n -bit integers,and then compute the high product of those integers. Further examples includeBarrett’s and Montgomery’s algorithms for modular arithmetic [1, 10].It is natural to ask whether a truncated product can be computed more quicklythan a full product. This is indeed the case for small n : in the classical quadratic-time regime, one can compute a truncated product in about half the time of a fullproduct, because essentially only half of the n bit-by-bit products contribute tothe desired output.However, as n grows, and more sophisticated multiplication algorithms are de-ployed, these savings begin to dissipate. Consider for instance Karatsuba’s algo-rithm, which has complexity M ( n ) = O ( n α ) for α = log 3 / log 2 ≈ .
58. Muldersshowed [11] that one can adapt Karatsuba’s algorithm to obtain bounds for M hi ( n )and M lo ( n ) around 0 . M ( n ). However, it is not known how to reach 0 . M ( n ) inthis regime. For much larger values of n , the most efficient integer multiplication algorithmsknown are based on FFTs (fast Fourier transforms). The complexity of these al-gorithms is of the form M ( n ) = n (log n ) o (1) ; see [9] for the smallest o (1) termpresently known.It has long been thought that the best way to compute a truncated product usingFFT-based algorithms is simply to compute the full product and then discard theunwanted part of the output. One might be able to save O ( n ) bit operationscompared to the full product, by skipping computations that do not contributeto the desired half of the output, but no bounds of the type M lo ( n ) < c M ( n ) or M hi ( n ) < c M ( n ) have been proved for any constant c < prove that it is not possibleto do better than computing the full product. For example, in a suitable algebraicmodel, the multiplicative complexity of any algorithm that computes the low n coefficients of the product of two polynomials of degree less than n is at least 2 n − R , we do not yet know how to obtain this 25% reduction in complexity forarbitrary integer multiplication algorithms. In particular, we are currently unableto establish analogous results for integer multiplication algorithms based on FFTsover other rings, such as finite fields.The remainder of the paper is structured as follows. In Section 2 we stateour main results precisely, after making some preliminary definitions. Section 3presents the new algorithm for the low product, including the proof of correctnessand complexity analysis. Section 4 does the same for the high product. Section 5gives some performance data for an implementation of the new algorithms. Finally,Section 6 sketches several further results that can be proved using the methodsintroduced in this paper.2. Setup and statement of results
Fixed point arithmetic and real convolutions.
We write lg x for ⌈ log x ⌉ .To simplify analysis of numerical error, all algorithms are assumed to work withthe following fixed-point representation for real numbers (see [9, §
3] for a moredetailed treatment). Let p > R p for the setof real numbers of the form a/ p where a is an integer in the interval − p a p .Thus R p models the unit interval [ − , R p are represented using p + O (1) bits of storage. For e ∈ Z , we write 2 e R p for the set of real numbers of theform 2 e x where x ∈ R p . An element of 2 e R p is represented simply by its mantissain R p ; the exponent e is always known from context, and is not explicitly stored.We will frequently work with quotient rings of the form R [ X ] /P ( X ) where P ( X ) is some fixed monic polynomial of positive degree, such as X N −
1. If F ∈ R [ X ] /P ( X ) and deg P = N , we write F , . . . , F N − for the coefficients of F ASTER TRUNCATED INTEGER MULTIPLICATION 3 with respect to the standard monomial basis; that is, F = F + · · · + F N − X N − (mod P ( X )). For such F we define a norm k F k := max i
1, and polynomials
F, G ∈ e R p [ X ] / ( X N − . Let H := F G ∈ R [ X ] / ( X N − H k := X i + j = k mod N F i G j ∈ [ − e N, e N ] , k < N. Then
Convolution is required to output a polynomial˜ H ∈ e +lg N R p [ X ] / ( X N − k ˜ H − H k < e +lg N − p . In other words,
Convolution computes a p -bit approximation to the cyclic con-volution of two real input sequences of length N .We write C ( N, p ) for the bit complexity of
Convolution . We treat this rou-tine as a black box; its precise implementation is not important for our purposes.A typical implementation would execute a real-to-complex FFT for each input se-quence, multiply the Fourier coefficients pointwise, and then compute an inversecomplex-to-real transform to recover the result. Internally, it should work to preci-sion slightly higher than p to manage rounding errors during intermediate compu-tations (for an explicit error bound, see for example [3, Theorem 3.6]). The routinemay completely ignore the exponent parameter e .2.2. The full product.
For completeness, we recall the well-known algorithmthat uses
Convolution to compute the full product of two n -bit integers (Al-gorithm 2.1). It depends on two parameters: a chunk size b , and a transformlength N , where N b > n . The idea is to cut the integers into N chunks of b bits,thereby converting the integer multiplication problem into the problem of multi-plying two polynomials in Z [ X ] modulo X N − b and N . The optimal choice of N will involve some balance between making N as close to n/b as possible, but also ensuring that N is sufficiently smooth (hasonly small prime factors) so that FFTs of length N are as efficient as possible.(An alternative approach is to use “truncated FFTs” [13], which eliminates theneed to choose a smooth transform length. However, this makes no differenceasymptotically. Despite the overlapping terminology, it is not clear whether thenew truncated multiplication algorithms can be adapted to the case of truncatedFFTs.) DAVID HARVEY
Algorithm 2.1:
Full product
Input:
Parameters n > b > N > N b > n Integers 0 u, v < n Output: uv (the full product of u and v ) p := 2 b + lg N + 2. Compute ( u , . . . , u N − ) and ( v , . . . , v N − ), where 0 u i , v i < b , so that u = N − X i =0 u i ib , v = N − X i =0 v i ib , and let¯ U ( X ) := N − X i =0 u i X i , ¯ V ( X ) := N − X i =0 v i X i , ¯ U , ¯ V ∈ b R p [ X ] / ( X N − . Use
Convolution to compute¯ W ∈ b +lg 2 N R p [ X ] / ( X N − k ¯ W − ¯ U ¯ V k < b +lg 2 N − p . return P N − i =0 round( ¯ W i )2 ib . Theorem 2.1 (Full product) . Let n > , and let u and v be n -bit integers. Let b > and N > be integers such that N b > n . Then Algorithm 2.1 correctlycomputes the full product of u and v . Assuming that lg N = O ( b ) , its complexity is M ( n ) < C (2 N, b + lg N + 2) + O ( N b ) . Proof.
The condition
N b > n ensures that the decompositions of u and v into u , . . . , u N − and v , . . . , v N − in line 2 are legal. Let U ( X ) := N − X i =0 u i X i ∈ Z [ X ] , V ( X ) := N − X i =0 v i X i ∈ Z [ X ] , and W ( X ) := U ( X ) V ( X ) = N − X i =0 w i X i ∈ Z [ X ] . Note that ¯ U and ¯ V are the images of U and V in 2 b R p [ X ] / ( X N − u = U (2 b ) and v = V (2 b ). Since W ( X ) has degree at most 2 N −
2, it isdetermined by its remainder modulo X N −
1. Line 3 computes an approximation ¯ W to this remainder with | ¯ W i − w i | < b +lg 2 N − p = 1 / i . The function round( · ) in line 4 rounds its argument to the nearestinteger (ties broken in either direction as convenient). Since w i ∈ Z , we deducethat round( ¯ W i ) = w i for each i ; hence line 4 returns W (2 b ) = U (2 b ) V (2 b ) = uv .The main term in the complexity bound arises from the Convolution call inline 3. The secondary term consists of the splitting step in line 2, which costs O ( b ) bit operations per coefficient, and the overlap-add procedure in line 4, whichrequires O ( b + lg N ) = O ( b ) bit operations per coefficient. (cid:3) ASTER TRUNCATED INTEGER MULTIPLICATION 5
Statement of results.
The main results of this paper are the following ana-logues of Theorem 2.1 for the low product and high product. These are proved inSection 3 and Section 4 respectively.
Theorem 2.2 (Low product) . Let n > , and let u and v be n -bit integers. Let b > and N > be integers such that N b > n . Then Algorithm 3.1 correctlycomputes the low product of u and v . Assuming that lg N = O ( b ) , its complexity is M lo ( n ) < C ( N, b + lg N + 6) + O ( N M ( b )) . (2.1) Theorem 2.3 (High product) . Let n > , and let u and v be n -bit integers. Let b > and N > be integers such that ( N + 1) b > n + lg N + 2 . Then Algorithm 4.1correctly computes the high product of u and v . Assuming that lg N = O ( b ) , itscomplexity is M hi ( n ) < C ( N, b + lg N + 9) + O ( N M ( b )) . (2.2)Comparing these complexity bounds to Theorem 2.1, we observe that the convo-lution length has dropped from 2 N to N , but the working precision has increasedfrom roughly 2 b to roughly 3 b . To understand the implications for the overall com-plexity, we need to make further assumptions on the growth of C ( N, b ) as a functionof n . We consider two scenarios. Scenario n → ∞ . We assume that N and b arechosen so that N b = (1 + o (1)) n as n → ∞ . We also assume that N is restricted tosuitably smooth values (for instance, the ultrasmooth numbers defined by Bernstein[2]), and that b is chosen to be exponentially smaller than N but somewhat largerthan lg N , say b = Θ(log n ) and N = Θ( n/ log n ) (as is done for example in [9, § b ensures that3 b + lg N + 6 = (3 + o (1)) b as n → ∞ , and similarly for 2 b .Under these assumptions, it is reasonable to expect that the complexity of theunderlying real convolution is quasi-linear with respect to the total bit size of the in-put, i.e., behaves roughly like n log n . This is the case for all FFT-based convolutionalgorithms known to the author. We conclude that M lo ( n ) M ( n ) = C ( N, b + lg N + 6) + O ( N M ( b )) C (2 N, b + lg N + 2) + O ( N b ) = 34 + o (1)as n → ∞ , and similarly for the high product. This justifies our assertion thatasymptotically the new truncated product algorithms save 25% of the total workcompared to the full product. Scenario
Now let us consider the situation faced by aprogrammer working on a modern microprocessor with hardware support for afixed word size, such as the 53-bit double-precision floating point type providedby the IEEE 754 standard. In this setting, the
Convolution subroutine takes asinput two vectors of coefficients represented by this data type, and computes theircyclic convolution using some sort of FFT, taking direct advantage of the hardwarearithmetic. We assume that b is chosen as large as possible so that the FFTs canbe performed in this way; for example, under IEEE 754 we would require that3 b + lg N + β N
53 for the low product, where β N is an allowance for numericalerror. Obviously in this scenario it does not make sense to allow n → ∞ , andit also does not quite make sense to measure complexity by the number of “bit DAVID HARVEY operations”. Instead, n should be restricted to lie in some finite (possibly quitelarge) range, and a more natural measure of complexity is the number of wordoperations (ignoring issues such as locality and parallelism).We claim that it is still reasonable to expect a reduction in complexity close to25%. To see this, consider a full product computation for a given n , with splittingparameters N and b . Let N ′ and b ′ be the splitting parameters for the correspondingtruncated product (for the same n ). We should choose b ′ around 2 b/ N ′ around 3 N/ n ) the bulk of the work for the full product consists of FFTsof length 2 N , but for the truncated products the FFT length is reduced to around3 N/
2. Since the FFTs run in quasilinear time (word operations), we expect to seeroughly 25% savings. In practice this will be tempered somewhat by the additionallinear-time work inherent in the truncated product algorithms, such as the theevaluation of α ∗ and β ∗ in Algorithm 3.1. The situation is also complicated by thefact that we are constrained to choose smooth transform lengths. Section 5 givessome empirical evidence in favour of this conclusion.3. The low product
Throughout this section we fix integers b > , N > , (3.1)as in Theorem 2.2.3.1. The cancellation trick.
The key to the new low product algorithm is thefollowing simple observation.
Proposition 3.1.
Let W ( X ) ∈ Z [ X ] with deg W N − . Let L ( X ) ∈ R [ X ] bethe remainder on dividing W ( X ) by A ( X ) := X N + 2 − b X − ∈ R [ X ] , with deg L < N . Then b L ( X ) ∈ Z [ X ] , L (2 b ) ∈ Z , and L (2 b ) ≡ W (2 b ) (mod 2 Nb ) . Proof.
Write W ( X ) = N − X i =0 w i X i = N − X i =0 w i X i + N − X i =0 w N + i X N + i . Since X N = 1 − − b X (mod A ( X )), we obtain L ( X ) = N − X i =0 w i X i + N − X i =0 w N + i (1 − − b X ) X i . This shows that 2 b L ( X ) ∈ Z [ X ]. Moreover, L (2 b ) = N − X i =0 w i ib ≡ W (2 b ) (mod 2 Nb ) . (cid:3) ASTER TRUNCATED INTEGER MULTIPLICATION 7
Later we will apply Proposition 3.1 to a polynomial W ( X ) = U ( X ) V ( X ) analo-gous to the W ( X ) encountered earlier in the proof of Theorem 2.1. The propositionshows that after reducing W ( X ) modulo A ( X ) and making the substitution X = 2 b ,the 2 − b X term in A ( X ) causes the unwanted high-order coefficients of W ( X ) todisappear. An alternative point of view is that polynomial multiplication modulo A ( X ) corresponds roughly to integer multiplication modulo A (2 b ) = 2 Nb + 2 − b b − Nb . To make use of Proposition 3.1 to compute a low product, we must compute L ( X ) exactly . Note that the coefficients of L ( X ) lie in 2 − b Z rather than Z . Consequently,to compute L ( X ), we must increase the working precision by b bits compared to theprecision used in the full product algorithm. This is why the precision parameterin Theorem 2.2 (and Theorem 2.3) is 3 b + lg N + O (1) instead of 2 b + lg N + O (1).3.2. The roots of A ( X ) . In this section we study the roots of the special polyno-mial A ( X ) introduced in Proposition 3.1. For r >
0, let D r denote the open disc { z ∈ C : | z | < r } . Lemma 3.2.
The roots of A ( X ) lie in D , and they are all simple.Proof. If z ∈ C is a root of A ( X ), then (3.1) implies that | z | N = | − − b z | | z | / , which is impossible if | z | > z of A ( X ) would have to satisfy A ( z ) = z N + 2 − b z − , A ′ ( z ) = N z N − + 2 − b = 0 , and hence ( N − − b z = N . This implies that z >
0, contradicting A ′ ( z ) = 0. (cid:3) Now consider the function β ( z ) := z (1 − − b z ) − /N , z ∈ D b , where u u − /N means the branch that maps 1 to 1. Lemma 3.3.
The function β ( z ) maps roots of A ( X ) to roots of X N − .Proof. If z is a root of A ( X ), then β ( z ) N = z N − − b z = z N z N = 1 . (cid:3) In fact, β ( z ) always sends a root of A ( X ) to the root of X N − N = 12 and b = 1,showing that the roots of A ( X ) are very close to those of X N −
1. (For b = 2 theroots are already too close together to distinguish at this scale.)For any k ∈ Z , the binomial theorem implies that β ( z ) k is represented on D b by the series β ( z ) k = z k ∞ X r =0 β r,k z r = z k + β ,k z k +1 + β ,k z k +2 + · · · where β r,k := (cid:18) − k/Nr (cid:19) ( − − b ) r , r > . DAVID HARVEY
Figure 1.
Roots of A ( X ) = X + X/ − X − β ( z ) are β ( z ) = z + 1 N − b z + ( N + 1)2 N − b z + ( N + 1)(2 N + 1)6 N − b z + · · · . We will need to construct an explicit functional inverse for β ( z ), in order to mapthe roots of X N − A ( X ). Let α ( z ) ∈ z R [[ z ]] be the formal powerseries inverse of β ( z ), i.e., so that β ( α ( z )) = z = α ( β ( z )) . The coefficients of α ( z ), and of its powers, are given as follows. Lemma 3.4.
For any k ∈ Z we have (formally) α ( z ) k = z k ∞ X r =0 α r,k z r = z k + α ,k z k +1 + α ,k z k +2 + · · · where α ,k := 1 and α r,k := krN (cid:18) ( k + r ) /N − r − (cid:19) ( − − b ) r , r > . In particular, the first few terms of α ( z ) are α ( z ) = z − N − b z − ( N − N − b z − ( N − N − N − b z − · · · . Proof.
By the Lagrange inversion formula (for example, in equation (2.1.2) of [6],set φ ( t ) := t k and n := k + r ), we find that α r,k is equal to the coefficient of z r in (cid:16) − β ( z ) (cid:0) z/β ( z ) (cid:1) ′ (cid:17) (cid:0) z/β ( z ) (cid:1) k + r = (cid:16) − z (1 − − b z ) − /N (cid:0) (1 − − b z ) /N (cid:1) ′ (cid:17) (1 − − b z ) ( k + r ) /N = (cid:18) − b zN (1 − − b z ) − (cid:19) (1 − − b z ) ( k + r ) /N = (1 − − b z ) ( k + r ) /N + 2 − b zN (1 − − b z ) ( k + r ) /N − . ASTER TRUNCATED INTEGER MULTIPLICATION 9
Therefore for r > α r,k = (cid:18) ( k + r ) /Nr (cid:19) ( − − b ) r + 2 − b N (cid:18) ( k + r ) /N − r − (cid:19) ( − − b ) r − = (cid:18) k + rrN − N (cid:19) (cid:18) ( k + r ) /N − r − (cid:19) ( − − b ) r = krN (cid:18) ( k + r ) /N − r − (cid:19) ( − − b ) r . (cid:3) Lemma 3.5.
For all r > and k < N we have | β r,k | − rb , | α r,k | − r ( b − . Proof.
The bounds are trivial for r = 0, so assume that r >
1. Then | β r,k | = 2 − rb r ! r − Y j =0 (cid:12)(cid:12)(cid:12)(cid:12) − kN − j (cid:12)(cid:12)(cid:12)(cid:12) = 2 − rb r ! r − Y j =0 (cid:18) kN + j (cid:19) − rb r ! r − Y j =0 ( j + 1) = 2 − rb and | α r,k | = 2 − rb kN r ! r − Y j =1 (cid:12)(cid:12)(cid:12)(cid:12) k + rN − j (cid:12)(cid:12)(cid:12)(cid:12) − rb r ! (cid:16) rN + r (cid:17) r = 2 − rb r r r ! (cid:18) N (cid:19) r − rb e r (4 / r − r ( b − . (cid:3) Corollary 3.6.
The series for α ( z ) and β ( z ) converge on D b − and D b respec-tively, and α ( β ( z )) = z = β ( α ( z )) , z ∈ D . (3.2) Proof.
We already know that β ( z ) converges on D b , and the convergence of α ( z )on D b − follows from Lemma 3.5. If | z | <
2, then | α ( z ) | ∞ X r =0 | α r, || z | r +1 < ∞ X r =0 − r ( b − r +1 = 21 − − b +3 − / , so α ( z ) maps D into D ⊆ D b . A similar argument shows that β ( z ) maps D into D ⊆ D b − . Since both α ( z ) and β ( z ) map D into the disc of convergence ofthe other, and since they are inverses formally, they must be inverse functions inthe sense of (3.2). (cid:3) Corollary 3.7.
The functions α ( z ) and β ( z ) induce mutually inverse bijectionsbetween the roots of X N − and the roots of A ( X ) .Proof. By Lemma 3.2, the polynomial A ( X ) has N distinct roots z , . . . , z N in D .Corollary 3.6 implies that β ( z ) is injective on D , so β ( z ) must map z , . . . , z N todistinct roots of X N −
1. Since X N − N roots, every root of X N − z i , and then α ( z ) must map this root back to z i . (cid:3) Ring isomorphisms.
The aim of this section is construct a pair of mutuallyinverse ring isomorphisms α ∗ : R [ X ] /A ( X ) → R [ X ] / ( X N − ,β ∗ : R [ X ] / ( X N − → R [ X ] /A ( X ) . In the main low product algorithm, the role of these maps will be to convert theproblem of multiplying two polynomials modulo A ( X ) into an ordinary cyclic con-volution.The idea of the construction is that for F ∈ R [ X ] /A ( X ), we want to de-fine ( α ∗ F )( z ) to be the composition F ( α ( z )), considered as a polynomial modulo X N −
1. The map β ∗ is defined similarly. However, since α ( z ) and β ( z ) are notpolynomials, there are convergence issues to consider. We now proceed to give aformal construction of α ∗ and β ∗ in terms of the power series expansions of α ( z )and β ( z ), to make all of this more precise.For each r > α ∗ r : R [ X ] /A ( X ) → R [ X ] / ( X N − ,β ∗ r : R [ X ] / ( X N − → R [ X ] /A ( X )by the formulas α ∗ r (cid:18) N − X k =0 F k X k mod A ( X ) (cid:19) := N − X k =0 α r,k F k X k + r mod X N − ,β ∗ r (cid:18) N − X k =0 F k X k mod X N − (cid:19) := N − X k =0 β r,k F k X k + r mod A ( X ) . These maps satisfy the following norm bounds.
Lemma 3.8.
For any r > and F ∈ R [ X ] /A ( X ) , k α ∗ r F k − r ( b − k F k . Proof.
For any G ∈ R [ X ] / ( X N −
1) we have k XG k = k G k , because multiplicationby X simply permutes the coefficients cyclically. Applying this observation to G := P N − k =0 α r,k F k X k ∈ R [ X ] / ( X N − k α ∗ r F k = k X r G k = k G k − r ( b − k F k . (cid:3) Lemma 3.9.
For any r > and F ∈ R [ X ] / ( X N − , k β ∗ r F k − r ( b − k F k . Proof.
For any G = P N − k =0 G k X k ∈ R [ X ] /A ( X ) we have XG = G N − + ( G − − b G N − ) X + G X + · · · + G N − X N − , so k XG k (1 + 2 − b ) k G k . Taking G := P N − k =0 β r,k F k X k ∈ R [ X ] /A ( X ), againLemma 3.5 implies that k β ∗ r F k = k X r G k (1 + 2 − b ) r k G k (1 + 2 − b ) r − rb k F k . (cid:3) We now define α ∗ and β ∗ by setting α ∗ F := ∞ X r =0 α ∗ r F, β ∗ F := ∞ X r =0 β ∗ r F. ASTER TRUNCATED INTEGER MULTIPLICATION 11
Lemma 3.8 and Lemma 3.9 guarantee that these series converge coefficientwise,so α ∗ and β ∗ are well-defined, and they are clearly linear maps. Moreover, weimmediately obtain the following estimates. Lemma 3.10.
For any F ∈ R [ X ] /A ( X ) and any λ > we have (cid:13)(cid:13)(cid:13)(cid:13) α ∗ F − λ − X r =0 α ∗ r F (cid:13)(cid:13)(cid:13)(cid:13) · − λ ( b − k F k , (cid:13)(cid:13)(cid:13)(cid:13) λ − X r =0 α ∗ r F (cid:13)(cid:13)(cid:13)(cid:13) k F k , and k α ∗ F k k F k . Proof.
For the first claim, observe that (cid:13)(cid:13)(cid:13)(cid:13) α ∗ F − λ − X r =0 α ∗ r F (cid:13)(cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13)(cid:13) ∞ X r = λ α ∗ r F (cid:13)(cid:13)(cid:13)(cid:13) ∞ X r = λ k α ∗ r F k ∞ X r = λ − r ( b − k F k = 2 − λ ( b − − − b +2 k F k · − λ ( b − k F k , by Lemma 3.8 and (3.1). The remaining estimates are proved in a similar way. (cid:3) Lemma 3.11.
For any F ∈ R [ X ] / ( X N − and any λ > we have (cid:13)(cid:13)(cid:13)(cid:13) β ∗ F − λ − X r =0 β ∗ r F (cid:13)(cid:13)(cid:13)(cid:13) · − λ ( b − k F k , (cid:13)(cid:13)(cid:13)(cid:13) λ − X r =0 β ∗ r F (cid:13)(cid:13)(cid:13)(cid:13) k F k , and k β ∗ F k k F k . Proof.
Similar to the proof of Lemma 3.10. (cid:3)
Now we can establish that α ∗ F and β ∗ F behave like the desired compositions F ( α ( z )) and F ( β ( z )). Lemma 3.12.
Let F ∈ R [ X ] /A ( X ) , and let z be a root of X N − . Then ( α ∗ F )( z ) = F ( α ( z )) . Proof.
By the definition of α ∗ r ,( α ∗ r F )( z ) = N − X k =0 α r,k F k z k + r , so ( α ∗ F )( z ) = ∞ X r =0 ( α ∗ r F )( z ) = N − X k =0 F k z k ∞ X r =0 α r,k z r = N − X k =0 F k α ( z ) k = F ( α ( z )) . (cid:3) Lemma 3.13.
Let F ∈ R [ X ] / ( X N − , and let z be a root of A ( X ) . Then ( β ∗ F )( z ) = F ( β ( z )) . Proof.
Similar to the proof of Lemma 3.12. (cid:3)
Corollary 3.14.
The maps α ∗ and β ∗ are mutually inverse ring isomorphismsbetween R [ X ] /A ( X ) and R [ X ] / ( X N − . Proof.
Lemma 3.12 implies that( α ∗ ( F G ))( z ) = ( F G )( α ( z )) = F ( α ( z )) G ( α ( z )) = ( α ∗ F )( z )( α ∗ G )( z )for any F, G ∈ R [ X ] /A ( X ) and any root z of X N −
1. Since a polynomial in R [ X ] / ( X N −
1) is determined by its values at the roots of X N −
1, this shows that α ∗ ( F G ) = ( α ∗ F )( α ∗ G ), and hence that α ∗ is a ring homomorphism. A similarargument shows that β ∗ is a ring homomorphism. Finally, if F ∈ R [ X ] /A ( X )and z is a root of A ( X ), then Corollary 3.7 implies that( β ∗ α ∗ F )( z ) = F ( α ( β ( z ))) = F ( z ) . Therefore α ∗ and β ∗ are inverses. (cid:3) Finally, we have the following two results concerning the complexity of approxi-mating α ∗ and β ∗ . Proposition 3.15 (Approximating α ∗ ) . Given as input F ∈ e R p [ X ] /A ( X ) , wemay compute G ∈ e +1 R p [ X ] / ( X N − such that k G − α ∗ F k < e +1 − p in O ( N M ( p )) bit operations, assuming that p = O ( b ) .Proof. Let λ := ⌈ ( p + 2) / ( b − ⌉ ; the hypothesis p = O ( b ) implies that λ = O (1).According to Lemma 3.10, (cid:13)(cid:13)(cid:13)(cid:13) λ − X r =0 α ∗ r F (cid:13)(cid:13)(cid:13)(cid:13) k F k e +1 and (cid:13)(cid:13)(cid:13)(cid:13) α ∗ F − λ − X r =0 α ∗ r F (cid:13)(cid:13)(cid:13)(cid:13) · − λ ( b − k F k · e +1 − p . To compute the desired G ∈ e +1 R p [ X ] / ( X N −
1) such that k G − α ∗ F k < e +1 − p ,it suffices to ensure that G satisfies (cid:13)(cid:13)(cid:13)(cid:13) G − λ − X r =0 α ∗ r F (cid:13)(cid:13)(cid:13)(cid:13) < · e +1 − p . (3.3)This may be accomplished by simply evaluating the sum P λ − r =0 α ∗ r F directly fromthe definition, with a sufficiently high working precision.In more detail, we first calculate the coefficients α r,k , for each 0 r < λ and0 k < N . Each one requires O ( λ ) = O (1) operations in R , using the usual formulafor the binomial coefficients. Next we compute the coefficients of the polynomials α ∗ F = F + F X + F X + · · · + F N − X N − + F N − X N − ,α ∗ F = ( α , F X + α , F X + · · · + α ,N − F N − X N ) mod X N − α ,N − F N − + α , F X + · · · + α ,N − F N − X N − ,α ∗ F = ( α , F X + α , F X + · · · + α ,N − F N − X N +1 ) mod X N − α ,N − F N − + α ,N − F N − X + α , F X + · · · + α ,N − F N − X N − , and so on, up to α ∗ λ − F . This costs altogether O ( λN ) = O ( N ) operations in R .Taking the sum of these polynomials costs another O ( λN ) operations in R . Toensure that (3.3) holds, it suffices to perform all of these operations with a working ASTER TRUNCATED INTEGER MULTIPLICATION 13 precision of p + O ( λ ) = p + O (1) significant bits (the details of this analysis areroutine and are omitted). Each such addition, multiplication or division in R costs O ( M ( p )) bit operations, leading to the claimed complexity bound. (cid:3) Proposition 3.16 (Approximating β ∗ ) . Given as input F ∈ e R p [ X ] / ( X N − ,we may compute G ∈ e +1 R p [ X ] /A ( X ) such that k G − β ∗ F k < e +1 − p in O ( N M ( p )) bit operations, assuming that p = O ( b ) .Proof. The proof is identical to that of Proposition 3.15, except that the reductionsmodulo A ( X ) lead to slightly more complicated formulas. For example, we have β ∗ F = ( β , F X + · · · + β ,N − F N − X N + β ,N − F N − X N +1 ) mod A ( X )= β ,N − F N − + ( β ,N − F N − − − b β ,N − F N − ) X + ( β , F − − b β ,N − F N − ) X + · · · + β ,N − F N − X N − . The terms with the minus signs are those arising from the 2 − b X term in A ( X ).Overall, there are no more than O ( λ ) = O (1) of these additional terms comparedto the proof of Proposition 3.15. (cid:3) Remark . In the estimates given above, such as Lemma 3.5 and Lemma 3.8,we have opted for the shortest possible proofs rather than the sharpest possiblebounds. With more effort, one could prove tighter bounds; this saves a few bits inthe main algorithm, but does not affect the asymptotic conclusions of the paper.Similar remarks apply to Section 4.3.4.
The main algorithm.
We are now in a position to state Algorithm 3.1 andprove the main theorem concerning the computation of the low product.
Proof of Theorem 2.2.
As in the proof of Theorem 2.1, let U ( X ) := N − X i =0 u i X i ∈ Z [ X ] , V ( X ) := N − X i =0 v i X i ∈ Z [ X ] , so that u = U (2 b ) and v = V (2 b ), and let W ( X ) := U ( X ) V ( X ) = N − X i =0 w i X i ∈ Z [ X ] . The polynomials ¯ U and ¯ V in line 2 are just the images of U and V in 2 b R p [ X ] /A ( X ).Our goal is to compute L ( X ), the remainder on dividing W ( X ) by A ( X ), as inProposition 3.1. By definition this is equal to ¯ U ¯ V .Line 3 computes approximations ˜ U and ˜ V to α ∗ ¯ U and α ∗ ¯ V . Line 4 computes ˜ W ,an approximation to ˜ U ˜ V (the cyclic convolution of ˜ U and ˜ V ). Observe that k ˜ W − α ∗ ( ¯ U ¯ V ) k = k ˜ W − ( α ∗ ¯ U )( α ∗ ¯ V ) k k ˜ W − ˜ U ˜ V k + k ˜ U ( ˜ V − α ∗ ¯ V ) k + k ( ˜ U − α ∗ ¯ U )( α ∗ ¯ V ) k k ˜ W − ˜ U ˜ V k + N k ˜ U kk ˜ V − α ∗ ¯ V k + N k α ∗ ¯ V kk ˜ U − α ∗ ¯ U k . Algorithm 3.1:
Low product
Input:
Parameters n > b > N > N > n/b Integers 0 u, v < n Output: uv mod 2 n (the low product of u and v ) p := 3 b + lg N + 6. Compute ( u , . . . , u N − ) and ( v , . . . , v N − ), where 0 u i , v i < b , so that u = N − X i =0 u i ib , v = N − X i =0 v i ib , and let¯ U ( X ) := N − X i =0 u i X i , ¯ V ( X ) := N − X i =0 v i X i , ¯ U , ¯ V ∈ b R p [ X ] /A ( X ) . Use Proposition 3.15 (approximating α ∗ ) to compute˜ U , ˜ V ∈ b +1 R p [ X ] / ( X N − k ˜ U − α ∗ ¯ U k < b +1 − p , k ˜ V − α ∗ ¯ V k < b +1 − p . Use
Convolution to compute˜ W ∈ b +2+lg N R p [ X ] / ( X N − k ˜ W − ˜ U ˜ V k < b +2+lg N − p . Use Proposition 3.16 (approximating β ∗ ) to compute¯ W ∈ b +3+lg N R p [ X ] /A ( X )such that k ¯ W − β ∗ ˜ W k < b +3+lg N − p . return P N − i =0 round(2 b ¯ W i )2 ( i − b (mod 2 n ).In this calculation we have used the fact that α ∗ is a ring homomorphism (Corol-lary 3.14), and that k F G k N k F kk G k for any F, G ∈ R [ X ] / ( X N − k α ∗ ¯ V k k ¯ V k < b +1 , so k ˜ W − α ∗ ( ¯ U ¯ V ) k b +2+lg N − p + N · b +1 b +1 − p + N · b +1 b +1 − p · b +lg N − p . Line 5 computes ¯ W , an approximation to β ∗ ˜ W . Since α ∗ and β ∗ are inverses(Corollary 3.14), Lemma 3.11 implies that k ¯ W − ¯ U ¯ V k k ¯ W − β ∗ ˜ W k + k β ∗ ( ˜ W − α ∗ ( ¯ U ¯ V )) k < b +3+lg N − p + 43 · · b +lg N − p = 24 · b +lg N − p < − b / . ASTER TRUNCATED INTEGER MULTIPLICATION 15
On the other hand, we know from Proposition 3.1 that 2 b L ( X ) has integer co-efficients, so we deduce that round(2 b ¯ W i ) = 2 b L i for each i . Therefore the sum inline 6 is equal to L (2 b ); by Proposition 3.1 this is equal to W (2 b ) = uv modulo 2 Nb ,and hence also modulo 2 n as N b > n .The main term in the complexity bound arises from the Convolution call inline 4. The splitting and overlap-add steps in lines 2 and 6 contribute O ( N b ) bitoperations, as lg N = O ( b ), and the invocations of Proposition 3.15 and Proposi-tion 3.16 in lines 3 and 5 contribute another O ( N M ( b )) bit operations. (cid:3) The high product
The discussion for the high product runs along similar lines to the low product,with one additional technical complication. The polynomial B ( X ) that naturallyreplaces A ( X ) in the cancellation trick (see Proposition 4.1) has N roots near theroots of X N −
1, just like A ( X ), but it also has a real root near 2 b . Some extrawork is needed to handle this additional root.Throughout this section we continue to assume that (3.1) holds.4.1. The cancellation trick.
We begin with a suitable analogue of Proposi-tion 3.1.
Proposition 4.1.
Let W ( X ) ∈ Z [ X ] with deg W N . Let H ( X ) ∈ R [ X ] be theremainder on dividing (1 − − b X ) W ( X ) by B ( X ) := X N +1 − b X N + 2 b ∈ R [ X ] . with deg H < N + 1 . Then b H ( X ) ∈ Z [ X ] and | W (2 b ) − Nb H (2 b ) | ( N − b +1 max( | W | , . . . , | W N − | ) . Proof.
Write W ( X ) = N X i =0 w i X i = N − X i =0 w i X i + N X i =0 w N + i X N + i . Since X N (1 − − b X ) = 1 (mod B ( X )), we obtain H ( X ) = N − X i =0 w i X i (1 − − b X ) + N X i =0 w N + i X i . This shows that 2 b H ( X ) ∈ Z [ X ]. Moreover, | W (2 b ) − Nb H (2 b ) | = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N − X i =0 w i ib (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Nb − b − | W | , . . . , | W N − | ) . (cid:3) The roots of B ( X ) . The next result isolates the auxiliary real root of B ( X ). Lemma 4.2.
The polynomial B ( X ) has a unique real root ρ in the interval b (1 − − Nb +1 ) < ρ < b . Moreover, this root satisfies . < ρ N Nb < . (4.1) Proof.
The running hypothesis (3.1) implies that(1 − − Nb +1 ) N > (1 − − N +1 ) N > (1 − − ) > . , (4.2)as the function (1 − − x +1 ) x is increasing for x >
3. Therefore B (2 b (1 − − Nb +1 )) = 2 b (1 − − − Nb +1 ) N ) < b (1 − · . < . On the other hand B (2 b ) = 2 b >
0, so the intermediate value theorem ensures thereis at least one root in the interval indicated. The derivative B ′ ( X ) = X N − (( N + 1) X − N b )is strictly positive on the interval, as for any x > b (1 − − Nb +1 ) we certainly have x > b N/ ( N + 1); this implies that ρ is unique. The estimate (4.1) now followsimmediately from (4.2). (cid:3) Next consider the polynomial C ( X ) := B ( X ) X − ρ = X N + ( ρ − b ) X N + 2 b X − ρ . Since B ( ρ ) = 0 we have ρ − b = ρ N +1 − b ρ N ρ N = − b ρ N , and hence the coefficients of C ( X ) are given explicitly by C ( X ) = X N − b ρ (cid:18) X N − ρ N − + · · · + Xρ + 1 (cid:19) . Lemma 4.3.
The roots of C ( X ) lie in D , and they are all simple.Proof. If z is a root of C ( X ) and | z | >
2, then | z | N = 2 b ρ (cid:12)(cid:12)(cid:12)(cid:12) z N − ρ N − + · · · + zρ + 1 (cid:12)(cid:12)(cid:12)(cid:12) b ρ | z | N − (cid:18) ρ N − + · · · + 1 ρ + 1 (cid:19) , so by (4.1) we have | z | b ρ · − ρ − < , which is impossible.If C ( X ) had a multiple root, say z , then z would also be a multiple root of B ( X ).This would imply that B ′ ( z ) = ( N + 1) z N − b N z N − = 0 , which in turn forces z = 2 b N/ ( N + 1). This contradicts the previous paragraph, as2 b N/ ( N + 1) does not lie in D . (cid:3) Lemma 4.2 and Lemma 4.3 together imply that B ( X ) has N + 1 distinct roots,namely, the N roots of C ( X ), and the auxiliary root ρ . Figure 2 illustrates the case N = 12, b = 1.Now consider the function δ ( z ) := z (1 − − b z ) /N , z ∈ D b . This is the same as the definition of β ( z ) in Section 3.2, except that the exponent − /N is replaced by 1 /N . The roots of C ( X ) lie well within the domain of definition ASTER TRUNCATED INTEGER MULTIPLICATION 17
Figure 2.
Roots of X − X + 2 (filled circles) and X − δ ( z ). The auxiliary root ρ is also inside the domain, but lies very close to theboundary. Lemma 4.4.
The function δ ( z ) maps roots of B ( X ) to roots of X N − .Proof. If z is a root of B ( X ), then δ ( z ) N = z N (1 − − b z ) = z N − − b z N +1 = 1 . (cid:3) Of course, δ ( z ) cannot yield a bijection between the roots of B ( X ) and those of X N −
1, as B ( X ) has too many roots. In a moment we will see that we do get abijection if we restrict to the roots of C ( X ).For any k ∈ Z , the function δ ( z ) k is represented on D b by the series δ ( z ) k = z k ∞ X r =0 δ r,k z r = z k + δ ,k z k +1 + δ ,k z k +2 + · · · where δ r,k := (cid:18) k/Nr (cid:19) ( − − b ) r , r > . Again, δ r,k is identical to β r,k , except that N has the opposite sign. The first fewterms in the expansion of δ ( z ) are δ ( z ) = z − N − b z − ( N − N − b z − ( N − N − N − b z − · · · . Let γ ( z ) ∈ z R [[ z ]] be the formal series inverse of δ ( z ). The remaining resultsin this section are proved in exactly the same way as the corresponding results inSection 3.2, replacing N by − N as appropriate. Lemma 4.5.
For any k ∈ Z we have (formally) γ ( z ) k = z k ∞ X r =0 γ r,k z r = z k + γ ,k z k +1 + γ ,k z k +2 + · · · where γ ,k := 1 and γ r,k := − krN (cid:18) − ( k + r ) /N − r − (cid:19) ( − − b ) r , r > . In particular, the first few terms of γ ( z ) are γ ( z ) = z + 1 N − b z + ( N + 3)2 N − b z + ( N + 4)(2 N + 4)6 N − b z + · · · . Lemma 4.6.
For all r > and k < N we have | δ r,k | − rb , | γ r,k | − r ( b − . Corollary 4.7.
The series for γ ( z ) and δ ( z ) converge on D b − and D b respec-tively, and γ ( δ ( z )) = z = δ ( γ ( z )) , z ∈ D . Corollary 4.8.
The functions γ ( z ) and δ ( z ) induce mutually inverse bijectionsbetween the roots of X N − and the roots of C ( X ) . Ring isomorphisms.
In this section we will first construct maps γ ∗ : R [ X ] /C ( X ) → R [ X ] / ( X N − ,δ ∗ : R [ X ] / ( X N − → R [ X ] /C ( X ) , which are analogous to the maps α ∗ and β ∗ defined in Section 3.3. Note that thesemaps do not yet take into account the auxiliary root ρ .For each r > γ ∗ r : R [ X ] /C ( X ) → R [ X ] / ( X N − ,δ ∗ r : R [ X ] / ( X N − → R [ X ] /C ( X )by the formulas γ ∗ r (cid:18) N − X k =0 F k X k mod C ( X ) (cid:19) := N − X k =0 γ r,k F k X k + r mod X N − ,δ ∗ r (cid:18) N − X k =0 F k X k mod X N − (cid:19) := N − X k =0 δ r,k F k X k + r mod C ( X ) . As in Section 3.3 we have the following norm bounds.
Lemma 4.9.
For any r > and F ∈ R [ X ] /C ( X ) , k γ ∗ r F k − r ( b − k F k . Proof.
Similar to the proof of Lemma 3.8. (cid:3)
Lemma 4.10.
For any r > and F ∈ R [ X ] / ( X N − , k δ ∗ r F k − r ( b − k F k . Proof.
For any G = P N − k =0 G k X k ∈ R [ X ] /C ( X ) we have XG = 2 b ρ G N − + (cid:18) G + 2 b ρ G N − (cid:19) X + · · · + (cid:18) G N − + 2 b ρ N G N − (cid:19) X N − . Since 2 b /ρ < b /ρ i < i >
1, we find that k XG k k G k . The rest ofthe proof now runs as in Lemma 3.9. (cid:3) ASTER TRUNCATED INTEGER MULTIPLICATION 19
We now define γ ∗ and δ ∗ by setting γ ∗ F := ∞ X r =0 γ ∗ r F, δ ∗ F := ∞ X r =0 δ ∗ r F. The next five statements are proved exactly as in Section 3.3.
Lemma 4.11.
For any F ∈ R [ X ] /C ( X ) and any λ > , we have (cid:13)(cid:13)(cid:13)(cid:13) γ ∗ F − λ − X r =0 γ ∗ r F (cid:13)(cid:13)(cid:13)(cid:13) · − λ ( b − k F k , (cid:13)(cid:13)(cid:13)(cid:13) λ − X r =0 γ ∗ r F (cid:13)(cid:13)(cid:13)(cid:13) k F k , and k γ ∗ F k k F k . Lemma 4.12.
For any F ∈ R [ X ] / ( X N − and any λ > , we have (cid:13)(cid:13)(cid:13)(cid:13) δ ∗ F − λ − X r =0 δ ∗ r F (cid:13)(cid:13)(cid:13)(cid:13) · − λ ( b − k F k , (cid:13)(cid:13)(cid:13)(cid:13) λ − X r =0 δ ∗ r F (cid:13)(cid:13)(cid:13)(cid:13) k F k , and k δ ∗ F k k F k . Lemma 4.13.
Let F ∈ R [ X ] /C ( X ) , and let z be a root of X N − . Then ( γ ∗ F )( z ) = F ( γ ( z )) . Lemma 4.14.
Let F ∈ R [ X ] / ( X N − , and let z be a root of C ( X ) . Then ( δ ∗ F )( z ) = F ( δ ( z )) Corollary 4.15.
The maps γ ∗ and δ ∗ are mutually inverse ring isomorphismsbetween R [ X ] /C ( X ) and R [ X ] / ( X N − . Now we bring ρ back into the picture. We will define maps γ † : R [ X ] /B ( X ) → R [ X ] / ( X N − ⊕ R ,δ † : R [ X ] / ( X N − ⊕ R → R [ X ] /B ( X ) , in terms of γ ∗ and δ ∗ , by utilising the Chinese remainder theorem isomorphism R [ X ] /B ( X ) ∼ = R [ X ] /C ( X ) ⊕ R [ X ] / ( X − ρ ) ,F ( F mod C ( X ) , F mod X − ρ ) . More precisely, for F ∈ R [ X ] /B ( X ), we define γ † F := (cid:0) γ ∗ ( F mod C ( X )) , ρ − N F ( ρ ) (cid:1) ∈ R [ X ] / ( X N − ⊕ R , and conversely, for ( F, θ ) ∈ R [ X ] / ( X N − ⊕ R , we define δ † ( F, θ ) to be the unique polynomial G ∈ R [ X ] /B ( X ) such that G = (1 − − b X ) δ ∗ ( F ) (mod C ( X )) , G ( ρ ) = ρ N θ. The maps γ † and δ † are clearly linear isomorphisms, but they are not ringisomorphisms because they do not quite preserve multiplication (due to the scalingfactors ρ ± N and 1 − − b X ). Instead, they satisfy the following property. Lemma 4.16.
For any
F, G ∈ R [ X ] /B ( X ) we have δ † ( γ † F · γ † G ) = (1 − − b X ) F G.
Proof.
It is enough to check the equality modulo C ( X ) and modulo X − ρ . It issatisfied for C ( X ) as δ † ( γ † F · γ † G ) = (1 − − b X ) δ ∗ ( γ ∗ ( F mod C ( X )) γ ∗ ( G mod C ( X )))= (1 − − b X ) δ ∗ ( γ ∗ ( F G mod C ( X )))= (1 − − b X ) F G (mod C ( X )) , and it is satisfied for X − ρ as δ † ( γ † F · γ † G )( ρ ) = ρ N ( ρ − N F ( ρ ))( ρ − N G ( ρ ))= ρ − N F ( ρ ) G ( ρ )= (1 − − b ρ )( F G )( ρ ) . (cid:3) Define a norm on R [ X ] / ( X N − ⊕ R by taking k ( F, θ ) k := max( k F k , | θ | ) . Then γ † and δ † satisfy the following norm bounds. Lemma 4.17.
For any F ∈ R [ X ] /B ( X ) we have k γ † F k k F k . Proof.
Let F = F + · · · + F N X N ∈ R [ X ] /B ( X ). Then F mod C ( X ) is equal to (cid:18) F + 2 b ρ F N (cid:19) + (cid:18) F + 2 b ρ F N (cid:19) X + · · · + (cid:18) F N − + 2 b ρ N F N (cid:19) X N − , so k F mod C ( X ) k (cid:18) b ρ (cid:19) k F k . k F k by (4.1), and hence Lemma 4.11 yields k γ ∗ ( F mod C ( X )) k · . k F k k F k . We also have | ρ − N F ( ρ ) | = | F N + F N − ρ − + · · · | − ρ − k F k k F k . Together these inequalities show that k γ † F k k F k . (cid:3) Lemma 4.18.
For any ( F, θ ) ∈ R [ X ] / ( X N − ⊕ R we have k δ † ( F, θ ) k k ( F, θ ) k . Proof.
We may derive an explicit formula for δ † ( F, θ ) as follows. Let H = H + H X + · · · + H N − X N − ∈ R [ X ] /B ( X )such that H = δ ∗ F (mod C ( X )) (in other words, H ( X ) is the unique lift of δ ∗ F from R [ X ] /C ( X ) to R [ X ] /B ( X ) whose coefficient of X N is zero). It may be easilychecked, by considering congruences modulo C ( X ) and modulo X − ρ , that δ † ( F, θ ) = (1 − − b X ) H ( X ) + ψ · C ( X ) , (4.3) ASTER TRUNCATED INTEGER MULTIPLICATION 21 where ψ = ρ N θ − ρ − N H ( ρ ) C ( ρ ) ∈ R . Thus we may estimate k δ † ( F, θ ) k as follows. First, by Lemma 4.11 we have k H k k F k , so | H ( ρ ) | | H N + H N − ρ − + · · · | − ρ − k H k k F k . Also (4.1) implies that C ( ρ ) = ρ N − N b ρ > . · Nb − . N > . · Nb . Therefore | ψ | ρ N | θ | + ρ − N | H ( ρ ) | C ( ρ ) Nb | θ | + 1 . · − Nb · k F k . · Nb · . · − Nb . k ( F, θ ) k . k ( F, θ ) k . Since k C k b /ρ . k δ † ( F, θ ) k (1 + 2 − b ) k H k + | ψ |k C k (cid:0) (1 + 2 − b ) 43 + 1 . · . (cid:1) k ( F, θ ) k k ( F, θ ) k . (cid:3) Next, we exhibit efficient algorithms for approximating γ † and δ † . Proposition 4.19 (Approximating γ † ) . Given as input F ∈ e R p [ X ] /B ( X ) , wemay compute G ∈ e +2 R p [ X ] / ( X N − , θ ∈ e +2 R p , such that k ( G, θ ) − γ † F k < e +2 − p , in O ( N M ( p )) bit operations, assuming that p = O ( b ) .Proof. We first remark that since p = O ( b ), we may precompute ρ to a precisionof p + O (1) significant bits using Newton’s method in only O (1) operations in R .(In fact, in practice one always has p ≪ N b , in which case the trivial approximation ρ ≈ b is already correct to p + O (1) significant bits.)Given as input F ∈ e R p [ X ] /B ( X ) as above, we first compute an approximationto F mod C ( X ) using the formula given in the proof of Lemma 4.17. The hypothesis p = O ( b ), together with the rapid decay of the coefficients of C ( X ), implies thatthis may be done using O (1) operations in R . We may then compute the desiredapproximation G to γ ∗ ( F mod C ( X )) using the same method as in the proof ofProposition 3.15, at a cost of O ( N ) operations in R . Finally, we may easily computethe desired approximation θ to ρ − N F ( ρ ) = F N + F N − ρ − + · · · using another O (1)operations in R . (cid:3) Proposition 4.20 (Approximating δ † ) . Given as input F ∈ e R p [ X ] / ( X N − and θ ∈ e R p , we may compute G ∈ e +2 R p [ X ] /B ( X ) such that k G − δ † ( F, θ ) k < e +2 − p in O ( N M ( p )) bit operations, assuming that p = O ( b ) .Proof. The algorithm amounts to evaluating the explicit formula (4.3). We firstapproximate H = δ ∗ F using the same method as in the proof of Proposition 3.16,at a cost of O ( N ) operations in R . (This requires O (1) more operations than thecorresponding algorithm for β ∗ , because the reductions modulo C ( X ) involve a fewmore terms than those modulo A ( X ).) We then approximate ψ at a cost of O (1)operations, and evaluate (4.3) in another O ( N ) operations. (cid:3) Finally we may state the main high product algorithm, and prove the maintheorem concerning its correctness and complexity.
Proof of Theorem 2.3.
Line 2 decomposes u and v into N + 1 chunks of b bits. Thesplitting boundaries are different to those used for the full product and low product:here u N consists of the b most significant bits of u , then u N − the next lower b bits,and so on. The hypothesis N + 1 > n/b ensures that this splitting is possible.As in the proof of Theorem 2.1, let U ( X ) := N − X i =0 u i X i ∈ Z [ X ] , V ( X ) := N − X i =0 v i X i ∈ Z [ X ] , so that u = U (2 b )2 ( N +1) b − n , v = V (2 b )2 ( N +1) b − n . Let W ( X ) := U ( X ) V ( X ) = N X i =0 w i X i ∈ Z [ X ] . The polynomials ¯ U and ¯ V in line 2 are just the images of U and V in 2 b R p [ X ] /B ( X ).Our goal is to compute H ( X ) := (1 − − b X ) W ( X ) (mod B ( X )) as in Proposi-tion 4.1. By definition this is equal to (1 − − b X ) ¯ U ¯ V .Line 3 computes approximations ( ˜ U , θ U ) and ( ˜ V , θ V ) to γ † ¯ U and γ † ¯ V . Line 4computes ˜ W , an approximation to ˜ U ˜ V , and θ W , an approximation to θ U θ V ∈ R .The latter involves just a single real multiplication. Let us write ¯ U ′ and ¯ V ′ for theimages of ¯ U and ¯ V in R [ X ] /C ( X ). A similar calculation to that used in the proofof Theorem 2.2 shows that k ˜ W − ( γ ∗ ¯ U ′ )( γ ∗ ¯ V ′ ) k k ˜ W − ˜ U ˜ V k + N k ˜ U kk ˜ V − γ ∗ ¯ V ′ k + N k γ ∗ ¯ V ′ kk ˜ U − γ ∗ ¯ U ′ k b +4+lg N − p + N · b +2 b +2 − p + N · b +2 b +2 − p · b +lg N − p ASTER TRUNCATED INTEGER MULTIPLICATION 23
Algorithm 4.1:
High product
Input:
Parameters n > b > N > N + 1) b > n + lg N + 2Integers 0 u, v < n Output: w n such that | uv − n w | < n (the high product of u and v ) p := 3 b + lg N + 9. Compute ( u , . . . , u N ) and ( v , . . . , v N ), where 0 u i , v i < b , so that u = N X i =0 u i ib − (( N +1) b − n ) , v = N X i =0 v i ib − (( N +1) b − n ) , and let¯ U ( X ) := N X i =0 u i X i , ¯ V ( X ) := N X i =0 v i X i , ¯ U , ¯ V ∈ b R p [ X ] /B ( X ) . Use Proposition 4.19 (approximating γ † ) to compute˜ U , ˜ V ∈ b +2 R p [ X ] / ( X N − , θ U , θ V ∈ b +2 R p , such that k ( ˜ U , θ U ) − γ † ¯ U k < b +2 − p , k ( ˜ V , θ V ) − γ † ¯ V k < b +2 − p . Use
Convolution to compute˜ W ∈ b +4+lg N R p [ X ] / ( X N − , θ W ∈ b +4+lg N R p such that k ˜ W − ˜ U ˜ V k < b +4+lg N − p , | θ W − θ U θ V | < b +4+lg N − p . Use Proposition 4.20 (approximating δ † ) to compute¯ W ∈ b +6+lg N R p [ X ] /B ( X )such that k ¯ W − δ † ( ˜ W , θ W ) k < b +6+lg N − p . t := P Ni =0 round(2 b ¯ W i )2 ( i − b − (( N +2) b − n ) return round( t ).and that | θ W − ( ρ − N ¯ U ( ρ ))( ρ − N ¯ V ( ρ )) | | θ W − θ U θ V | + | θ U || θ V − ρ − N ¯ V ( ρ ) | + | ρ − N ¯ V ( ρ ) || θ U − ρ − N ¯ U ( ρ ) | b +4+lg N − p + 2 b +2 b +2 − p + 2 b +2 b +2 − p · b +lg N − p . These two inequalities may be expressed more briefly in combination by writing k ( ˜ W , θ W ) − ( γ † ¯ U )( γ † ¯ V ) k · b +lg N − p . Line 5 computes an approximation ¯ W to δ † ( ˜ W , θ W ). Using Lemma 4.16 andLemma 4.18 we find that k ¯ W − (1 − − b X ) ¯ U ¯ V k k ¯ W − δ † ( ˜ W , θ W ) k + k δ † (( ˜ W , θ W ) − ( γ † ¯ U )( γ † ¯ V )) k b +6+lg N − p + 3 · · b +lg N − p = 208 · b +lg N − p < − b / . We know from Proposition 4.1 that 2 b H ( X ) has integer coefficients, so we deducethat round(2 b ¯ W i ) = 2 b H i for each i . Thus the t computed in line 6 is equal to t = H (2 b )2 ( N +2) b − n . Applying Proposition 4.1, we obtain | uv − n t | = (cid:12)(cid:12)(cid:12)(cid:12) U (2 b ) V (2 b )2 (2 N +2) b − n − H (2 b )2 ( N +2) b − n (cid:12)(cid:12)(cid:12)(cid:12) = | W (2 b ) − Nb H (2 b ) | (2 N +2) b − n < ( N − b +1 · N · b (2 N +2) b − n − ( N +1) b +2 n +lg N +1 n / . Since 0 uv < n , this implies that − / < t < n + 1 / . Let w := round( t ) be the value returned by the algorithm in line 9. Then theprevious inequality shows that 0 w n . Moreover, since | w − t | /
2, weconclude that | uv − n w | < n / n / n as desired. The running time analysis is essentially the same as in the proof ofTheorem 2.2. (cid:3) Implementation
We wrote an implementation of the new low product and high product algorithmsin C, together with a comparable implementation of the full product, to examineto what extent the desired 25% reduction in complexity can be realised in practice.The source code is available from the author’s web page under a free softwarelicense.The timings reported in Table 1 were run on a single core of an otherwiseidle 2.4GHz Intel Xeon E5-2680v4 (Broadwell microarchitecture), running Linux(CentOS 6.3, kernel 2.6.32). The compiler used was GCC 6.2.0, and our programwas compiled with the optimisation flags -O3 -mavx2 -ffast-math . In the criticalinner loops, our code uses GCC’s vector extensions to take advantage of the AVX2instruction set available on the target platform.For the real convolutions, we rely on the one-dimensional real-to-complex andcomplex-to-real transforms provided by the FFTW library (version 3.3.6) [5]. Weconfigured FFTW using the --enable-avx2 flag, and used FFTW’s “wisdom”facility with the
FFTW_MEASURE option to find optimal transform sequences for allrelevant transform lengths.
ASTER TRUNCATED INTEGER MULTIPLICATION 25
Table 1.
Timings for full and truncated products truncated products full product n N, b, λ low high
N, b full GMP1 000 000 2 ·
35, 2.91ms 2.99ms 2 ·
3, 3.29ms 4.07ms14 , ·
81, 6.94ms 7.11ms 2 · , · ·
15, 15.5ms 23.9ms13 , · · , · · , ·
7, 210ms 214ms 2 ·
75, 240ms 382ms13 , , 548ms 555ms 2 ·
81, 662ms 795ms12 , ·
75, 1.28s 1.31s 2 ·
45, 1.47s 1.85s12 , ·
75, 2.89s 2.96s 2 ·
25, 3.36s 4.19s12 , ·
81, 6.45s 6.64s 2 ·
27, 8.36s 9.47s12 , ·
45, 15.9s 16.4s 2 , 19.6s 23.6s12 , double data type in C). In particular, this applies to the routines that compute α ∗ , β ∗ , γ † and δ † , and also the FFTs and pointwise multiplications. (The splittingand recombining steps are handled using integer arithmetic.) We make no attemptto prove any bounds for round-off error. This is impossible anyway because FFTWdoes not offer any error guarantees.In the splitting step we allow signed coefficients. For example, we write u = U (2 b )where the coefficients of U are integers lying in the balanced interval | U i | b − .This leads to less coefficient growth in the product U ( X ) V ( X ): instead of thesecoefficients having roughly 2 b + lg N bits, for uniformly random inputs they tend tohave around 2 b + lg N bits, due to cancellation between the positive and negativeterms. Of course, an adversary could easily choose input for which every U i and V i is close to 2 b − , in which case the product coefficients will have close to 2 b + lg N bits. In this case our program will certainly produce incorrect output, unless wedecrease b to compensate.The table also shows timings for the full product computed by the mpz_mul func-tion from the GMP multiple-precision arithmetic library (version 6.1.2) [7]. This isnot a fair comparison, because in principle GMP performs a provably correct com-putation, whereas the output of our program is not provably correct. Neverthelessthe timings demonstrate that our code is competitive with the highly optimisedmultiplication routines in GMP.For each n shown in the table, we chose the parameters as follows. We ran alarge number of tests to determine the maximum possible b for which the programconsistently produces the correct output for uniformly random inputs u and v . Theparameter λ refers to the number of terms used in the approximation of the ring iso-morphisms such as α ∗ ; it has the same meaning as in the proof of Proposition 3.15.Again, we chose λ by empirical testing, taking the smallest value that led to consis-tently correct output. We examined several possible candidates for N , namely thoseof the form N = 2 e e e e where n/b N . · n/b and 3 e e e < n increases. For the smaller values of n thetruncated products are around 10% faster than the full product, and towards thelarger values of n this improves towards the 15–20% range. We are hopeful thatmore careful implementation work would see these numbers decrease even further.6. Further applications
In this section we mention several further results that can be derived from themethods of this paper. Detailed proofs will appear in a forthcoming work.6.1.
Faster integer multiplication.
Currently, the asymptotically fastest knowninteger multiplication algorithm has complexity M ( n ) = O ( n log n log ∗ n ) , where log ∗ n is the iterated logarithm function [9]. The idea of this algorithmis to convert the multiplication problem to a product in C [ X ], and then use acombination of Bluestein’s algorithm and Kronecker substitution to convert this intoa large collection of exponentially smaller integer multiplication problems, whichare then handled recursively.It was pointed out in [9, Section 9] that working over C leads to undesirablezero-padding being introduced on each recursive call. The culprit is essentiallythat multiplication of real numbers corresponds to a high product of integers, andit was not previously known how to compute a high product any more efficientlythan a full product.The technique introduced in Proposition 4.1 can be adapted to this situation.This yields a 25% reduction in zero-padding at every recursion level, and leads to ASTER TRUNCATED INTEGER MULTIPLICATION 27 the bound M ( n ) = O ( n log n log ∗ n ) . Other arithmetic operations.
Apart from multiplication, one may considerother operations on integers such as division, square root, and so on. Several authorshave given bounds of the form ( c + o (1)) M ( n ) for these problems. These resultstypically rely on various reasonable assumptions about the underlying integer multi-plication algorithm, for example, that it really computes products modulo numbersof the form 2 k ±
1, that Fourier transforms of multiplicands may be reused in subse-quent computations, and that the transforms have reasonable linearity properties.For division, reciprocal and square root, the best known constants are currentlyas follows. One may compute an n -bit approximation to the quotient of a 2 n -bitinteger by an n -bit integer in time (5 / o (1)) M ( n ) [14]. One may compute thequotient and remainder exactly in time (2+ o (1)) M ( n ) [14]. One may compute an n -bit approximation to the reciprocal of an n -bit integer in time (13 / o (1)) M ( n ) [8].One may compute an approximate n -bit square root of a 2 n -bit integer in time(4 / o (1)) M ( n ) [8]. One may compute the exact integer square root (i.e., ⌊√ x ⌋ )and remainder in time (5 / o (1)) M ( n ) [8].Note that the algorithms listed in the previous paragraph were originally statedfor power series over a suitable ring (such as C ), but the same ideas apply to theinteger case, and yield the same constants. Essentially, these algorithms all workby splitting the inputs into large blocks, applying a fairly naive algorithm at thelevel of blocks (for example, primary school long division), but operating on theFourier transforms of the blocks.One operation that occurs frequently in these algorithms is to compute the highproduct of two blocks. If we perform these using the new high product algorithm,then we can obtain better constants. For the five problems mentioned above, theconstants decrease to respectively 3 /
2, 11 /
6, 25 /
18, 7 / /
2. Of course, thisconclusion is only valid under the additional assumption that the underlying integermultiplication algorithm is based on cyclic convolution of real sequences.
Acknowledgments
The author thanks Joris van der Hoeven for his comments on a draft of thispaper. The author was supported by the Australian Research Council, DiscoveryProject grant DP150101689.
References
1. P. Barrett,
Implementing the Rivest Shamir and Adleman public key encryption algorithm ona standard digital signal processor , Advances in cryptology—CRYPTO ’86 (Santa Barbara,Calif., 1986), Lecture Notes in Comput. Sci., vol. 263, Springer, Berlin, 1987, pp. 311–323.MR 907099 (88i:94015)2. D. Bernstein,
Removing redundancy in high-precision Newton iteration , unpublished, avail-able at http://cr.yp.to/papers.html , 2004.3. R. P. Brent and P. Zimmermann,
Modern computer arithmetic , Cambridge Monographs onApplied and Computational Mathematics, vol. 18, Cambridge University Press, Cambridge,2011. MR 27608864. P. B¨urgisser, M. Clausen, and M. A. Shokrollahi,
Algebraic complexity theory , Grundlehrender Mathematischen Wissenschaften [Fundamental Principles of Mathematical Sciences], vol.315, Springer-Verlag, Berlin, 1997, With the collaboration of Thomas Lickteig. MR 1440179(99c:68002)
5. M. Frigo,
A Fast Fourier Transform Compiler , Proceedings of the ACM SIGPLAN 1999Conference on Programming Language Design and Implementation (New York, NY, USA),PLDI ’99, ACM, 1999, pp. 169–180.6. I. M. Gessel,
Lagrange inversion , J. Combin. Theory Ser. A (2016), 212–249. MR 35340687. T. Granlund,
The GNU Multiple Precision Arithmetic Library (Version 6.1.2) , http://gmplib.org/ .8. D. Harvey, Faster algorithms for the square root and reciprocal of power series , Math. Comp. (2011), no. 273, 387–394. MR 2728985 (2011m:68299)9. D. Harvey, J. van der Hoeven, and G. Lecerf, Even faster integer multiplication , J. Complexity (2016), 1–30. MR 353063710. P. L. Montgomery, Modular multiplication without trial division , Math. Comp. (1985),no. 170, 519–521. MR 777282 (86e:11121)11. T. Mulders, On short multiplications and divisions , Appl. Algebra Engrg. Comm. Comput. (2000), no. 1, 69–88. MR 1817699 (2001m:68187)12. C. H. Papadimitriou, Computational complexity , Addison-Wesley Publishing Company, Read-ing, MA, 1994. MR 1251285 (95f:68082)13. J. van der Hoeven,
The truncated Fourier transform and applications , ISSAC 2004, ACM,New York, 2004, pp. 290–296. MR MR212695614. ,
Newton’s method and FFT trading , J. Symbolic Comput. (2010), no. 8, 857–878.MR 2657669 E-mail address : [email protected]@unsw.edu.au