[PDF] Faster Multiplication for Long Binary Polynomials

Abstract

We set new speed records for multiplying long polynomials over finite fields of characteristic two. Our multiplication algorithm is based on an additive FFT (Fast Fourier Transform) by Lin, Chung, and Huang in 2014 comparing to previously best results based on multiplicative FFTs. Both methods have similar complexity for arithmetic operations on underlying finite field; however, our implementation shows that the additive FFT has less overhead. For further optimization, we employ a tower field construction because the multipliers in the additive FFT naturally fall into small subfields, which leads to speed-ups using table-lookup instructions in modern CPUs. Benchmarks show that our method saves about 40 \% computing time when multiplying polynomials of 2^{28} and 2^{29} bits comparing to previous multiplicative FFT implementations.

Full PDF

FFaster Multiplication for Long BinaryPolynomials

Ming-Shing Chen , Chen-Mou Cheng , Po-Chun Kuo ,Wen-Ding Li , and Bo-Yin Yang Department of Electrical Engineering, National TaiwanUniversity, Taiwan, {mschen,doug,kbj}@crypto.tw Institute of Information Science, Academia Sinica, Taiwan, {thekev,by}@crypto.tw Research Center of Information Technology and Innovation,Academia Sinica, TaiwanJanuary 8, 2018

Abstract

We set new speed records for multiplying long polynomials over ﬁ-nite ﬁelds of characteristic two. Our multiplication algorithm is based onan additive FFT (Fast Fourier Transform) by Lin, Chung, and Huang in2014 comparing to previously best results based on multiplicative FFTs.Both methods have similar complexity for arithmetic operations on un-derlying ﬁnite ﬁeld; however, our implementation shows that the additiveFFT has less overhead. For further optimization, we employ a tower ﬁeldconstruction because the multipliers in the additive FFT naturally fallinto small subﬁelds, which leads to speed-ups using table-lookup instruc-tions in modern CPUs. Benchmarks show that our method saves about computing time when multiplying polynomials of and bitscomparing to previous multiplicative FFT implementations. Keywords:

Finite Field, Multiplication, Additive FFT, Single InstructionMultiple Data (SIMD).

Multiplication for long binary polynomials in the ring F [ x ] , where F is theﬁnite ﬁeld with two elements, is a fundamental problem in computer science.It is needed to factor polynomials in number theory [vzGG96] [vZGG02] and iscentral to the modern Block Wiedemann algorithm [Tho02]. Block Wiedemannis in turn integral to the Number Field Sieve [AFK + a r X i v : . [ c s . S C ] J a n o the best of our knowledge, all currently fast binary polynomial multipli-cation algorithms are based on a Fast Fourier Transform (FFT) algorithm. TheFFT is an algorithm for eﬃciently evaluating polynomials at subgroups in theunderlying ﬁnite ﬁeld. There have been good reviews of algorithms and implementations for multiply-ing binary polynomials in [BGTZ08] and [HvdHL16].In general, these previous methods of multiplication were based on “mul-tiplicative” FFTs, evaluating polynomials at multiplicative subgroups formedby roots of unity. Since the sizes of multiplicative groups existing in F k arerestricted, multiplicative FFTs in the binary ﬁeld are somewhat more diﬃcultthan that over the reals and the complex. These multiplicative FFTs, evaluatingpolynomials at n points, reach their best eﬃciency only when n is a size of amultiplicative subgroup.In [BGTZ08], Brent et al. implemented mainly the Schönhage [Sch77] algo-rithm for long polynomials with complexity O ( n log n log log n ) of ﬁeld opera-tions. Schönhage’s algorithm is based on ternary FFTs over binary ﬁnite ﬁelds.In their implementation, the optimal number of evaluation points is k .Harvey, van der Hoeven, and Lecerf [HvdHL16] presented multiplicationusing an FFT of a mixed radix approach. They applied several discrete Fouriertransforms(DFTs) for diﬀerent input sizes, e.g., Cooley-Tukey [CT65] for thelargest scale DFT. In particular, they need to ﬁnd a suitable ﬁnite ﬁeld whichis of a size close to a machine word and simultaneously allows both abundantmultiplicative subgroups. They proposed F which elegantly satisﬁes theseconditions. Following Cantor [Can89], alternative methods are developed to evaluate poly-nomials at points that form an additive subgroup in a ﬁeld of characteristic 2.These methods are called “additive FFTs”.Cantor presented a basis, termed “Cantor basis” in the literature, for con-structing a ﬁnite ﬁeld as well as an FFT over the ﬁeld. His FFT works with thecomplexity of O ( n log n ) multiplications and O ( n log log n ) additions( XOR ) forevaluating n = 2 m elements.In 2010, Gao and Mateer [GM10] presented an additive FFT (heretofore“GM FFT”) over F k , where the evaluation points are an additive subgroup ofsize k in the underlying GF. The additive subgroups are easier to form thanmultiplicative subgroups in the ﬁelds of characteristic 2. However, the com-plexity in GM FFT is O ( n log n ) XOR operations for evaluating a polynomialat n = 2 m points in general. It can be optimized to O ( n log n log log n ) onlywhen m is a power of 2, and the polynomials in this case are represented in aspecial polynomial basis introduced by Cantor [Can89]. In 2014, Bernstein andChou [BC14] presented an eﬃcient implementation of the GM FFT.In 2014, Lin, Chung, and Han [LCH14] proposed a more general variant(“LCH FFT”), which uses a diﬀerent polynomial basis than the standard one.The polynomial basis results in a more regular structure in the butterﬂy stage.When representing the underlying ﬁnite ﬁeld in a Cantor basis, the butterﬂy2tage is the same as the optimized GM FFT and Bernstein-Chou. In the subse-quent work [LANH16], they presented a method for converting the polynomialfrom standard basis. The complexity for the conversion is O ( n log n log log n ) XOR operations for n being any power of two, which reaches the same complexityas multiplicative FFTs.More details of the LCH FFT are reviewed in Sec. 2.3. In this paper, we present a faster method of multiplication for long binarypolynomials based on the recently developed LCH FFT comparing to previouslymultiplicative-FFT-based algorithms. From the faster results of our additive-FFT-based implementations, we conﬁrm that the recent development of additiveFFTs helps the multiplication for binary polynomials.Our implementation is faster than previous multiplicative FFT codes fortwo reasons. First, the FFT in our algorithm uses simple binary butterﬂiesstages leading to n log n multiplications for n evaluation points, comparedto (for example) the ternary FFT used in Schönhage’s algorithm which leadsto n log n multiplications. This factor confers a 10%–20% advantage overmultiplicative-FFT-based implementations and will be discussed in Sec. 3.1.Second, we exploit the fact that the multipliers in the additive FFT are insubﬁelds, which reduces the average time taken per multiplication. We willdiscuss the method in Sec. 3.2 and 3.3. The overall improvement is 10%–40%over previous implementations. In this section, we discuss the general method of multiplications for long binarypolynomials.Suppose we are multiplying two polynomials a ( x ) = a + a x + · · · + a d − x d − and b ( x ) = b + · · · + b d − x d − ∈ F [ x ] . The polynomials are represented in bitsequence with length d . The standard Kronecker segmentation for multiplyingbinary polynomials is performed as follows:1. Partition the polynomials to w -bits blocks. There are n = (cid:100) d/w (cid:101) blocks. a ( x ) = a + a x + · · · + a d − x d − → ( a + · · · + a w − x w − )+( a w + · · · + a w − x w − ) x w + · · · +( · · · ) x w ( n − .

2. Deﬁne F w := F [ z ] / ( g ( z )) , with g ( z ) an irreducible polynomial of degree w . Let ψ map a ( x ) to a (cid:48) ( y ) ∈ F w [ y ] (and similarly ψ ( b ( x )) = b (cid:48) ( y ) ) : a (cid:48) ( y ) := a (cid:48) + a (cid:48) y + · · · + a (cid:48) n − y n − ∈ F w [ y ] ,b (cid:48) ( y ) := b (cid:48) + b (cid:48) y + · · · + b (cid:48) n − y n − ∈ F w [ y ] , such that a (cid:48) = ( a + a z + . . . + a w − z w − ) , a (cid:48) = ( a w + . . . + a w − z w − ) ,. . . , a (cid:48) n − = ( a ( n − w + a ( n − w +1 z + . . . + a nw − z w − ) and same for b j .3. Calculate c (cid:48) ( y ) = a (cid:48) ( y ) · b (cid:48) ( y ) = c (cid:48) + c (cid:48) y + · · · ∈ F w [ y ] (using FFTs).4. Map z (cid:55)→ x and y (cid:55)→ x w . Then, collect terms and coeﬃcients to ﬁnd theresult of multiplication of binary polynomials. We need to add togetherat most 2 coeﬃcients at any power. FFT-based Polynomial multiplication.

It is well known that polynomialmultiplication can be done using FFT [CLRS09]. To multiply two degree- ( n − polynomials a (cid:48) ( y ) and b (cid:48) ( y ) ∈ F w [ y ] with FFT algorithms, the standard stepsare as follows:1. ( fft ) Evaluate a (cid:48) ( y ) and b (cid:48) ( y ) at n points by an FFT algorithm.2. ( pointmul ) Multiply the evaluated values pairwise together.3. ( ifft ) Interpolate back into a polynomial of degree ≤ n − by the inverseFFT algorithm.The complexity of polynomial multiplication is the same as the FFT in use. The ﬁeld of two elements, denoted as F , is the set { , } . The multiplicationof F is logic AND and addition is logic

XOR . In this paper, every ﬁeld will be analgebraic extension of F .We will apply interchangeable representations for each used ﬁnite ﬁeld (orGalois ﬁeld, GF). The illustrative example is the ﬁeld of elements, denotedby F in [Can89] and [BGTZ08]. We will switch representations during com-putation to achieve a better eﬃciency for ﬁeld multiplication. All ﬁelds of thesame size are isomorphic and the cost of changing representation is a lineartransformation. F We choose the basic working ﬁeld to be the same as in AES-GCM, denoted as F : F := F [ x ] / (cid:0) x + x + x + x + 1 (cid:1) . An element in F is represented as a binary polynomial of degree < . The F can also be a linear space of dimension 128 with the basis ( x i ) i =0 . The cost of multiplication for F . There are hardware instructions formultiplying small binary polynomials which ﬁts for the multiplication of F inmany platforms. PCLMULQDQ is a widely used instruction for multiplying 64-bitbinary polynomials in x86. Since one

PCLMULQDQ performs × → bits,the multiplication of F costs roughly 5 PCLMULQDQ (3 for multiplying -bitpolynomials with Karatsuba’s method and 2 for reducing the -bit result backto bits with linear folding). More details about multiplications of F canbe found in [GK14]. 4 .2.2 Cantor Basis for Finite Field as Linear Space Gao and Mateer presented an explicit construction of Cantor Basis for ﬁniteﬁeld in [GM10]. The Cantor basis ( β i ) satisﬁes β = 1 , β i + β i = β i − for i > . Deﬁnition 2.1.

With respect to the basis ( β i ) , let φ β ( k ) := (cid:80) m − j =0 b j β j bethe ﬁeld element represented by k under ( β i ) when the binary expansion of k = (cid:80) m − j =0 b j j with b j ∈ { , } . Deﬁnition 2.2.

Given a basis ( β i ) m − i =0 in the base ﬁeld, its sequence of subspacesis V i := span { β , β , . . . , β i − } . Its subspace vanishing polynomials ( s i ) are, s i ( x ) := (cid:89) a ∈ V i ( x − a ) . Note that V i is a ﬁeld with linear basis ( β j ) i − j =0 only when i is power of .Since dim V i = i , one can see that deg( s i ( x )) = 2 i .From [Can89] and [GM10], vanishing polynomials s i ( x ) w.r.t. the Cantorbasis ( β i ) has the following useful properties: • (linearity) s i ( x ) contains only monomials in the form x m . • (minimal two terms) s i ( x ) = x i + x iﬀ i is a power of . In the cases of i = 2 k , V k are ﬁelds and { β k , β k +1 , . . . , β k +1 − } ⊂ F k +1 \ F k . • (recursivity) s i ( x ) = s i − ( x ) + s i − ( x ) = s ( s i − ( x )) ; s i + j ( x ) = s i ( s j ( x )) .If k = 2 i + 2 i + · · · + 2 i j , where i < i < · · · < i j , then we can write s k ( x ) = s i ( s i ( · · · ( s ij ( x )) · · · )) . Therefore, every s i is a composition offunctions which only has two terms (see Table 1). Evaluating s i ( x ) in Cantor basis Computing s i ( x ) in the Cantor basis isvery fast since the representation of s i ( α ) would exactly be that of α shiftedright by i bits, or s i ( φ β ( j )) = φ β ( j (cid:29) i ) . For example, we have s i ( β i ) = β = 1 . In this section, we introduce how to evaluate a degree ( n − polynomial at n points in a binary ﬁeld with the LCH FFT. We assume that n is a power of .The polynomial is padded with for high-degree coeﬃcients if the actual degreeof polynomial is not n − .The LCH FFT requires that the evaluated polynomial is converted into aparticular kind of basis, called novelpoly basis. novelpoly basis The polynomial basis used in LCH FFT was presented in [LCH14]. The novel-poly basis for polynomials must be distinguished from bases for ﬁeld. Althoughthe LCH FFT is independent of underlying basis of ﬁnite ﬁeld, we assume thatthe ﬁeld is in Cantor basis for simplicity.5 eﬁnition 2.3.

Given the Cantor basis ( β i ) for the base ﬁeld and its vanishingpolynomials ( s i ) , deﬁne the novelpoly basis w.r.t. ( β i ) to be the polynomials ( X k ) X k ( x ) := (cid:89) ( s i ( x )) b i where k = (cid:88) b i i with b i ∈ { , } . I.e., X k ( x ) is the product of all s i ( x ) where the i -th bit of k is set.Since deg( s i ( x )) = 2 i , clearly deg( X k ( x )) = k .To perform LCH FFT, the evaluated polynomial f ( x ) has to be convertedinto the form f ( x ) = g ( X ) = g + g X ( x ) + . . . + g n − X n − ( x ) . The evaluation of a polynomial in the novelpoly basis can be done through a“Butterﬂy” process, denoted as

FFT

LCH . The general idea of evaluating f ( x ) at all points of V k is to divide V k intothe two sets V k − and V k \ V k − = V k − + β k − := { x + β k − : x ∈ V k − } .Since s k ( x ) is linear, evaluations at V k − + β k − can be quickly calculated withthe evaluations at V k − and the butterﬂy process. It is a divide-and-conquerprocess that the polynomial f ( x ) = g ( X ) can be expressed as two half-sizedpolynomials h ( X ) and h ( X ) with g ( X ) = h ( X ) + X (cid:100) log n (cid:101)− ( x ) h ( X ) . FFT

LCH is detailed in Algorithm 1. The

FFT

LCH evaluates the converted poly-nomial g ( X ) at points V log n + α . Line 5 and 6 perform the butterﬂy process (seeFig. 1 and Fig. 2). Inverse FFT

LCH simply performs the butterﬂies in reverse.

Algorithm 1:

LCH FFT in novelpoly . FFT

LCH ( f ( x ) = g ( X ) , α ) : input : a polynomial: g ( X ) = g + g X ( x ) + ... + g n − X n − ( x ) .a scalar: α ∈ F . output: a list: [ f (0 + α ) , f ( φ u (1) + α ) , . . . , f ( φ u ( n −

1) + α )] . if deg( f ( x )) = 0 then return [ g ] ; Let k ← Max ( i ) s.t. i ≤ n − . Let g ( X ) = p ( X ) + X k · p ( X ) = p ( X ) + s k ( x ) · p ( X ) . h ( X ) ← p ( X ) + s k ( α ) · p ( X ) . h ( X ) ← h ( X ) + s k ( β k ) · p ( X ) . // s k ( β k ) = 1 in Cantor basis. return [ FFT

LCH ( h ( X ) , α ) , FFT

LCH ( h ( X ) , β k + α ) ] We note there are two multipliers s k ( α ) and s k ( β k ) in the FFT

LCH and s k ( β k ) =1 in Cantor basis, avoiding one multiplication. This constant reduction of mul-tiplications won’t aﬀect the asymptotic complexity; however, one can not bearthe extra multiplication in practice. Although FFT

LCH is applicable to any basisof ﬁeld, the choice for a practitioner might be Cantor basis only. novelpoly

Basis w.r.t. Cantor Basis

The evaluated polynomial has to be in novelpoly basis for performing

FFT

LCH .We review the conversion algorithms in this section. The fast conversion relies following [LCH14], which calls the butterﬂy an FFT. s i ( x ) s ( x ) xs ( x ) x + xs ( x ) x + x = s ( x ) = ys ( x ) x + x + x + x = s ( y ) = y + ys ( x ) x + x = s ( x ) = zs ( x ) x + x + x + x = s ( z ) = z + zs ( x ) x + x + x + x = s ( z ) = z + z = s ( x ) = ws ( x ) x + x + · · · + x + x = s ( z ) = z + z + z + z = s ( w ) = w + w on the simple form of ( s i ) w.r.t. Cantor basis.[BC14] converts f ( x ) to g ( X ) by ﬁnding the largest i such that i < deg f ,and then divide f ( x ) by s i ( x ) to form f ( x ) = f ( x ) + s i ( x ) f ( x ) . Recursivelydivide f ( x ) and f ( x ) by lower s i ( x ) and eventually express f ( x ) as a sum ofnon-repetitive products of the s i ( x ) , which is the desired form for g ( X ) . Weknow the division comprise only XOR operations since the coeﬃcients of s i ( x ) are always in Cantor basis. Therefore the complexity of division by s i ( x ) depends on the number of terms in s i ( x ) . This is functionally equivalent to theCantor transform.[LANH16] does better by1. ﬁnding the largest i such that j < deg f and then do variable substi-tution (Alg. 2) to express f as a power series of s i .2. Recursively express the series in s i as a series in X j ( s i ) , where j < i .3. Recursively express each coeﬃcient of X j ( s i ) (which is a polynomial in x of degree < i ) as a series in X k , where k < i . Algorithm 2:

Variable Substitution VarSubs( f ( x ) , y ) : input : Two polynomials: f ( x ) = f + f x + ... + f n − x n − and y = x i + x . output: h ( y ) = h ( x ) + h ( x ) y + · · · + h m − ( x ) y m − . if deg( f ( x )) < i then return h ( y ) ← f ( x ) ; Let k ← Max (2 j ) where j ∈ Z s.t. deg(( x i + x ) j ) ≤ deg( f ( x )) . Let y k ← x k i + x k . Compute f ( x ) + y k · f ( x ) = f ( x ) by dividing f ( x ) by x k i + x k . // Note that this is done by repeatedly subtracting. return VarSubs( f ( x ) , y ) + y k · VarSubs( f ( x ) , y ) .The detail of basis conversion is given in Algorithm 3 and an example isgiven in Appendix B. Note that the algorithms rely on the simple form of ( s i ) instead of ﬁeld representations of coeﬃcients.7 lgorithm 3: Basis conversion: monomial to novelpoly w.r.t Cantor. BasisCvt( f ( x ) ) : input : f ( x ) = f + f x + ... + f n − x n − . output: g ( X ) = g + g X ( x ) + ... + g n − X n − ( x ) . if deg( f ( x )) ≤ then return g ( X ) ← f + X f ; Let k ← Max (2 i ) where i ∈ Z s.t. deg( s i ( x )) ≤ deg( f ( x )) . Let y ← s k ( x ) . h ( y ) = h ( x ) + h ( x ) y + · · · + h m − ( x ) y m − ← VarSubs( f ( x ) , y ) . h (cid:48) ( Y ) = q ( x ) + q ( x ) X k + · · · + q m − ( x ) X ( m − · k ← BasisCvt( h ( y ) ) . foreach q i ( x ) in h (cid:48) ( Y ) do Compute g i ( X ) ← BasisCvt( q i ( x ) ) . end return g ( X ) = g ( X ) + g ( X ) X k + ... + g n − ( X ) X ( m − · k We can have a fast multiplication for binary polynomials simply by applyingLCH FFT with the Cantor basis as the underlying FFT in the general methodof Sec. 2.1.Besides the straightforward method, we also present a faster algorithm byaccelerating the ﬁeld multiplication in the FFT. The acceleration relies on aspecial tower ﬁeld representation, making all multipliers in butterﬂies short.

A simple version of our multiplication for binary polynomials is to keep theworking ﬁeld in the representation of polynomial basis and apply LCH FFTwith evaluation points in Cantor basis to the general multiplication in Sec. 2.1.The details of the straightforward algorithm is presented in Alg. 4.For more details of

FFT

LCH , we choose w = 64 and use F as our base ﬁeld.The evaluated points are { φ β (0) , . . . , φ β (2 n − } , which can be seen on line 6in Alg. 4. The multiplier s k ( α ) is calculated in Cantor basis and then switchedto its representation in F with a linear map (ﬁeld isomorphism). Althoughthe multipliers in the Cantor basis are short numbers, they are random-looking128-bit polynomials in F . The ﬁeld multiplication in F is performed with5 PCLMULQDQ (cf. Sec. 2.2.1). The other multiplier s k ( β k ) = 1 in the Cantorbasis. Advantages of the Additive FFT

We can expect that the simple structureof the LCH FFT leads to a lower complexity. One way is to count the but-terﬂies. LCH FFT has a binary structure, which means at each of log n layersthere are n/ butterﬂies for a total of n log n multiplications. Considering aternary FFT instead, there will be log n layers of analogous structure to but-terﬂies, in number n/ each. At each of these structures, one has to make fourmultiplications for a total of n log n multiplications. All else being equal, themultiplicative complexity of binary structure of the additive FFT is about . times lower than that of the ternary FFT, which is used in the Schöhage-like8 lgorithm 4: Simple multiplications for binary polynomials. binPolyMul( a ( x ) , b ( x ) ) : input : a ( x ) , b ( x ) ∈ F [ x ] . output: c ( x ) = a ( x ) · b ( x ) ∈ F [ x ] . f a ( x ) ∈ F w [ x ] ← Split ( a ( x )) . f b ( x ) ∈ F w [ x ] ← Split ( b ( x )) . g a ( X ) ∈ F w [ X ] ← BasisCvt ( f a ( x )) . g b ( X ) ∈ F w [ X ] ← BasisCvt ( f b ( x )) . [ f a (0) , . . . , f a ( φ β (2 n − ∈ F n w ← FFT

LCH ( g a ( X ) , . [ f b (0) , . . . , f b ( φ β (2 n − ∈ F n w ← FFT

LCH ( g b ( X ) , . [ f c (0) , . . . , f c ( φ β (2 n − ∈ F n w ← [ f a (0) · f b (0) , . . . , f a ( φ β (2 n − · f b ( φ β (2 n − g c ( X ) ∈ F w [ X ] ← iFFT LCH ([ f c (0) , . . . , f c ( φ β (2 n − . f c ( x ) ∈ F w [ x ] ← iBasisCvt ( g c ( X )) . c ( x ) ∈ F [ x ] ← InterleavedCombine ( f c ( x )) . return c ( x ) .algorithm in [BGTZ08]. Similarly we hold an advantage over the even morecomplex FFT method in [HvdHL16]. Results

Please refer to Tab. 2 in Sec. 5. Simply using an additive FFT confersa 10%–20% advantage over state-of-the-art libraries in [HvdHL16, BGTZ08].

We consider this sequence of extension ﬁelds. F = (cid:101) F := F [ x ] / ( x + x + 1) , (cid:101) F := (cid:101) F [ x ] / ( x + x + (cid:81) i =1 x i ) , F = (cid:101) F := F [ x ] / ( x + x + x ) , (cid:101) F := (cid:101) F [ x ] / ( x + x + (cid:81) i =1 x i ) , F = (cid:101) F := F [ x ] / ( x + x + x x ) , (cid:101) F := (cid:101) F [ x ] / ( x + x + (cid:81) i =1 x i ) , (cid:101) F := F [ x ] / ( x + x + x x x ) , (cid:101) F := (cid:101) F [ x ] / ( x + x + (cid:81) i =1 x i ) . Thus, decimal subscripts or a tilde denotes the ﬁeld is in tower representation.We can now deﬁne a basis for (cid:101) F as a vector space over F . Deﬁnition 3.1. v k := (cid:81) m − j =0 x b j j +1 where k := (cid:80) m − j =0 b j j with b j ∈ { , } .By deﬁnition, the sequence ( v , v , v , v , v , v , . . . ) := (1 , x , x , x x , x ,x x , . . . ) . Henceforth ( v k ) will be our default basis unless otherwise speciﬁed. Deﬁnition 3.2. ı := φ v ( i ) is the element of (cid:101) F k represented by i . Numbers inhex such as also denote the representatives under the basis ( v i ) .Hence the sequence ( v k ) can also be written as ( v , v , v , v , . . . ) := (1 , , , , . . . ) or (1 , , , , . . . ) . For example, x x + x + x + 1 = v + v + v + v ∈ F is denoted as or . Under this notation, we can order twoelements in the tower ﬁeld and thus deﬁne “big” or “small” by comparing theirrepresentative numbers. We can see that the representation of each ﬁeld isembedded in the lower bit(s) of the ﬁeld that is twice as wide. For example,the elements { , , . . . , } in F form the subﬁeld F . Therefore, a“small” number in the tower ﬁeld is often in a subﬁeld.9 .2.1 Compatibility between Tower and Cantor Bases We discuss the “compatibility” between tower and Cantor bases in this section.It is clear the basis ( v , v , . . . ) is not a Cantor basis since v + v (cid:54) = v . However, Claim 1.

The sequence of subspaces ( V k ) and subspace vanishing polynomials ( s i ) are the same w.r.t. the Cantor basis ( β j ) and the tower basis ( v j ) .We ﬁrst assume that V k := span( v , . . . , v k − ) and s k ( x ) = (cid:81) c ∈ V k ( x − c ) and we will show that these are the same as those from a Cantor basis. Noteagain that V k is a ﬁeld only for k = 2 m a power of two, and only in this case wehave V k = (cid:101) F k = (cid:101) F m − [ x m ] / ( x m + x m + v m − ) . Proposition 1. s k ( v k ) = 1 . Proof:

See Appendix A.

Corollary 1 (recursivity) . With respect to ( v i ) , s k +1 ( x ) = s k ( x ) (cid:81) c ∈ V k + v k ( x − c ) = s k ( x ) s k ( x + v k ) = s k ( x )( s k ( x ) + s k ( v k )) = ( s k ( x )) + s k ( x ) . Note that s i ( x ) w.r.t. any basis of ﬁeld is linear [LCH14, Theorem 1]. Thus,the vanishing polynomials of tower and Cantor bases are the same. Sincethe s i ( x ) determines the V i by unique factorization theorem, we have provedClaim 1. We show multiplying by a subﬁeld element is not only cheaper than generalmultiplication but also calculated as a vector-scalar product in this section.The cost depends on the size of the subﬁeld. For example, to multiply a ∈ (cid:101) F w by b ∈ (cid:101) F w , the a is represented as a polynomial a := a + a x ∈ (cid:101) F w [ x ] with a , a ∈ (cid:101) F w . Hence the product of a · b ∈ (cid:101) F w is calculated as twomultiplications in (cid:101) F w , i.e., ( a + a x ) · b . It is easy to generalize to the followingproposition. Proposition 2.

Given a ∈ (cid:101) F l = V l , b ∈ (cid:101) F l = V l , and l | l , a · b ∈ V l canbe performed with l /l ﬁeld multiplications in (cid:101) F l . Besides the compatibility with Cantor allowing eﬃcient

FFT

LCH and basis con-version in tower ﬁelds, we show the multipliers in

FFT

LCH fall to smaller numberswhich can be optimized by subﬁeld multiplication with Prop. 2 in this section.Recall that the two multipliers are s k ( α ) and s k ( v k ) , which corresponds to s k ( β k ) in Alg. 1.First we have s k ( v k ) = 1 by Prop. 1. We can thus avoid the multiplicationof s k ( v k ) as in Cantor basis. One butterﬂy unit thus contains one multiplicationand two additions.Figure 1 shows the details of the butterﬂy unit. It is also an example ofevaluating a degree-1 polynomial f ( x ) = f + f x at points { c, c +1 } , which α = c with the notation in Alg. 1. Since X ( x ) = x , the degree-1 polynomial after10igure 1: Details of the butterﬂy unit.basis conversion is identical to the original polynomial, i.e., g ( X ) = g + g X = f + f X . The only eﬀective multiplier is s ( α ) = s ( c ) = c .Now we discuss the multiplier s k ( α ) . Proposition 3.

For any v i = 2 i w.r.t. tower representation, v i + v i ∈ V i . Proof:

See Appendix A.Contrast this with the Cantor basis, where β i + β i = β i − . Note that s : x (cid:55)→ x + x is a linear map from vector spaces V i +1 to V i and its kernel is { , } . Since each vector in V i is exactly the image of two vectors in V i +1 , eachvector in V i \ V i − is also the image of exactly two vectors in V i +1 \ V i . Hence s ( v i ) = v i − + u where u ∈ V i − , i.e., s shortens the tower representation of x by exactly one bit. Since s k is just applying s consecutively k times, themultiplier s k ( α ) in LCH butterﬂies is (a) k bits shorter than α in the towerrepresentation, and (b) independent of the least signiﬁcant k bits of α .Fig. 2 depicts the evaluation of a degree- polynomial at points { , , . . . , } with LCH FFT, using ( = log 16 ) layers of butterﬂies. We can observe thatthe multipliers in butterﬂies are smaller than the actual evaluation points.Figure 2: The forward butterﬂy units for evaluating a degree- polynomial f ( x ) = g ( X ) = g + · · · + g X at points { , , . . . , } .Since coeﬃcients above degree- are all , the ﬁrst layer is simply a “fan-out” (in software, copying a block of memory). The multipliers in second layerare calculated by evaluating the degree- s ( x ) at two α ∈ { , ∈ F } ,11esulting in the small multipliers { , ∈ F } . The third layer evaluates thedegree-2 s ( x ) at 4 points α ∈ { , , , } and results in the multipliers { , , , } . In the last layer, the multipliers are the subﬁeld elements { , , , , . . . } themselves because s ( α ) = α . (cid:101) F or (cid:101) F The main idea of our algorithm is to trade an expensive ﬁeld isomorphism forthe faster multiplication by small subﬁeld elements.We show the algorithm for multiplying binary polynomials in Alg. 5. Assumewe use a w = 64 -bit word ﬁrst. The input polynomials are in -bit blocksafter partitioning( Split() ). We perform the basis conversion before changingrepresentations so as to have more densely packed data during the conversion.Then we convert the data to the tower representation. The data are kept inthe tower representation for performing

FFT

LCH . We eﬃciently compute thesubﬁeld multiplication with the technique in Sec. 4.1. The pointmul are alsoperformed in (cid:101) F ; the somewhat diﬀerent operations are detailed in Sec. 4.1.3.The inverse butterﬂy and basis conversion stages are then computed (still in thetower representation) before converting back to the polynomial basis of F .Then we split the -bit results into -bit blocks and collate coeﬃcients forthe ﬁnal result ( InterleavedCombine() ).An initial block width of w = 128 bits is also possible and in this case theoperative ﬁeld is (cid:101) F . Algorithm 5:

Multiplications for binary polynomials. binPolyMul( a ( x ) , b ( x ) ) : input : a ( x ) , b ( x ) ∈ F [ x ] . output: c ( x ) = a ( x ) · b ( x ) ∈ F [ x ] . f a ( x ) ∈ F w [ x ] ← Split ( a ( x )) . f b ( x ) ∈ F w [ x ] ← Split ( b ( x )) . g a ( X ) ∈ F w [ X ] ← BasisCvt ( f a ( x )) . g b ( X ) ∈ F w [ X ] ← BasisCvt ( f b ( x )) . (cid:101) g a ( X ) ∈ (cid:101) F w [ X ] ← changeRepr ( g a ( X )) . (cid:101) g b ( X ) ∈ (cid:101) F w [ X ] ← changeRepr ( g b ( X )) . [ (cid:101) f a (0) , . . . , (cid:101) f a (2 n − ∈ (cid:101) F n w ← FFT

LCH ( (cid:101) g a ( X ) , . [ (cid:101) f b (0) , . . . , (cid:101) f b (2 n − ∈ (cid:101) F n w ← FFT

LCH ( (cid:101) g b ( X ) , . [ (cid:101) f c (0) , . . . , (cid:101) f c (2 n − ∈ (cid:101) F n w ← [ (cid:101) f a (0) · (cid:101) f b (0) , . . . , (cid:101) f a (2 n − · (cid:101) f b (2 n − (cid:101) g c ( X ) ∈ (cid:101) F w [ X ] ← iFFT LCH ([ (cid:101) f c (0) , . . . , (cid:101) f c (2 n − . (cid:101) f c ( x ) ∈ (cid:101) F w [ x ] ← iBasisCvt ( (cid:101) g c ( X )) . f c ( x ) ∈ F w [ x ] ← changeRepr ( (cid:101) f c ( x )) . c ( x ) ∈ F [ x ] ← InterleavedCombine ( f c ( x )) . return c ( x ) . 12 Implementation (cid:101) F Multiplication by subﬁeld elements

The operation for multiplying an element of tower ﬁelds by an subﬁeld elementis implemented as a scalar multiplication of a vector over various subﬁelds by ascalar with Prop. 2. We show how to calculate the product eﬃciently in currentmainstream computers in this section.

The typical single-instruction-multiple-data (SIMD) instruction set nowadays isAdvanced Vector Extensions 2 (AVX2), providing 256-bit ymm registers on x86platforms. We use the table-lookup instruction

VPSHUFB in AVX2 for multiply-ing elements in (cid:101) F by subﬁeld elements which are F to (cid:101) F . We demon-strate the scalar multiplication over subﬁelds with PSHUFB instruction which isthe precursor to

VPSHUFB and uses 128-bit xmm registers.

PSHUFB takes two 16-byte sources which one is a lookup table of 16 bytes x = ( x , x , . . . , x ) and the other is 16 indices y = ( y , y , . . . , y ) . The16-byte result of “ PSHUFB x , y ” at position i is x y i mod 16 if y i ≥ and if y i < . VPSHUFB simply performs two copies of

PSHUFB in one instruction. Thetwo instructions are suitable for scalar multiplication over small ﬁelds [CYC13].For scalar multiplication over F , we ﬁrst prepare 16 tables; each tablestores the product of all elements and a speciﬁc element in F . Suppose wehave a ∈ F and b ∈ F , we can apply VPSHUFB to the prepared “multiply-by- b ”table and a for the product of a · b . Since we use one 256-bit register to store64 elements in F , the data in a have to be split into nibbles (4-bit chunks)before applying VPSHUFB .The scalar multiplication over F is similar to F except that the numberof prepared tables becomes 256 and one multiplication costs 2 VPSHUFB .For scalar multiplication over (cid:101) F or (cid:101) F , we implement the ﬁeld multipli-cation as polynomial multiplication in F [ x ] because or prepared tablesis too much for caches. The Karatsuba’s method is applied to reduce the totalnumber of multiplications in F while multiplying polynomials. While multiplying (cid:101) F elements by elements in (cid:101) F or (cid:101) F , the bytes in a (cid:101) F element might multiply by diﬀerent multipliers in F . Since we usescalar multiplication over F as our building blocks, multiplying by diﬀerentmultipliers reduces the eﬃciency. Example: Product of (cid:101) F and (cid:101) F elements The natural data layout of (cid:101) F consists of eight consecutive (cid:101) F elements, each with its two bytes storedside by side. Suppose we are multiplying a = ( a , . . . , a ) ∈ (cid:101) F := F by c = ( c , c ) ∈ (cid:101) F := F . The a is naturally stored as (( a , a ) , ( a , a ) , . . . , ( a , a )) ∈ (cid:101) F . To perform the ﬁeld multiplication in (cid:101) F := F [ x ] with Karatsuba, we then must compute (( a , , ( a , , . . . , ( a , · c and ((0 , a ) , . . . , (0 , a )) · c . In this case, one has to mask oﬀ half the componentsin a and thus reduces the eﬃciency. 13 Byte-Slicing” layout:

To perform many scalar multiplications withs

VPSHUFB eﬃciently, a possible solution is to store each of the 16 bytes in an (cid:101) F elementin a separate register. This rearrangement of data layout is exactly equivalentto a × transposition of a byte matrix. After each byte in the same positionof (cid:101) F elements is collected in the same register (cf. Fig. 3), SIMD instructionscan cover entire registers. Transposition as needed:

While multiplying two elements by a subﬁeldelement instead of one multiplicand in previous example, e.g., [ a , b ∈ (cid:101) F ] multiply by c ∈ (cid:101) F , we only have to split the even and odd bytes in a and b for performing scalar multiplication eﬃciently. Hence, we can just use a × transposition, which converts a = ( a , . . . , a ) and b = ( b , . . . , b ) to ( a , b , a , b , . . . , b ) and ( a , b , a , . . . , b ) , to split the even and odd bytes.The × transposition involves only 2 registers instead of 16 registers in a × transposition and thus is more eﬃcient. For multiplying by an (cid:101) F element, we need to × transpose our data. The following Fig. 3 depicts theprocess of data rearrangement on diﬀerent demand. ( a , . . . , a ) ( a , b , c , d , a , . . . , d ) ( a , b , c , . . . , l )( b , . . . , b ) ( a , b , c , d , a , . . . , d ) ( a , b , c , . . . , l ) ... ⇒ ... ⇒ ... ( k , . . . , k ) ( i , j , k , l , i , . . . , l ) ( a , b , c , . . . , l )( l , . . . , l ) ( i , j , k , l , i , . . . , l ) ( a , b , c , . . . , l ) Figure 3: “Byte-matrix transpose” for (cid:101) F elements: The data layout in themiddle is for multiplying by (cid:101) F elements, and the rightmost layout is for pointmul .To transpose a matrix, we use similar techniques in [War12], i.e., transposing × can be done after transposing × . While performing the transposition,we ﬁrst collect elements in one register with a byte shuﬄe instruction ( VPSHUFB ).The interchange of data between diﬀerent registers is accomplished by swizzleinstructions in AVX2 instruction set. We refer readers to [Int15] [Fog17] formore information on AVX instructions. pointmul In pointmul , we need a generic ﬁeld multiplication instead of subﬁeld multipli-cation in FFT

LCH for data in tower representation. Assuming the data in (cid:101) F andin the byte-slice layout, we need the pointwise ﬁeld multiplication in pointmul ,in contrast to scalar multiplication in FFT

LCH . Since the byte-slice layout, themultiplication in (cid:101) F can be easily performed as polynomial multiplication in (cid:101) F [ x ] in the SIMD manner. In other words, we reduce one parallelized mul-tiplication in (cid:101) F to several parallelized multiplications in (cid:101) F . The processis recursively applied until the pointwise multiplication in F . We then apply14he method in [CLP + F . We discuss the choice of working ﬁeld based on the cost of ﬁeld multiplication.While multiplying d bits polynomials, we split the data into w -bit wide chunksand then apply FFT with l = 2 w -bit base ﬁeld. Here we want to decide ona suitable size of base ﬁeld. In the typical x86 CPU, one may choose l to be , , or . Irreducible polynomial constructed ﬁeld:

In this construction, the ﬁeldmultiplication is implemented with

PCLMULQDQ instruction. Using Karatsuba’smethod and linear folding for the reduction, one multiplication costs (1+2) = 3 , (3 + 2) = 5 , and (9 + 4) = 13 PCLMULQDQ in F , F and F respectively.The number of butterﬂies in FFT

LCH are dw log dw . With one ﬁeld multiplicationfor each butterﬂy, we conclude that working on F has lowest number of PCLMULQDQ for

FFT

LCH . Tower ﬁeld:

For l -bit tower ﬁeld, the costs of multiplying by subﬁeld elementsin FFT

LCH or generic elements in pointmul is proportional to length of ﬁeld l asdescribing in Prop. 2. Since number of multiplications in FFT

LCH proportions to dl log dl , to enlarge l by two results in one layer less ( ∝ log dl ) in FFT

LCH but thesame cost for each layer of butterﬂies( ∝ l · dl ).The cost of generic multiplication in pointmul is ∝ l . using recursiveKaratsuba in Sec. 4.1.3. If we compare l = l with l = 2 l , the smaller ﬁeld ismore eﬃcient. However, the length of ﬁeld should ﬁt the underlying machinearchitecture for better eﬃciency. In the Intel Haswell architecture with AVX2instruction set, it is the most eﬃcient to use 256 bits as a unit because memoryaccess is the best with 256-bit alignment. Therefore, we choose l = 256 bits forIntel Haswell in our implementation. The change of ﬁeld representations is simply a matrix product for a pre-deﬁnedmatrix I with the data α ∈ V k as a vector. We compute the product I · α withthe famous method of four Russians(M4R) [AH74].With M4R of l -bit, one ﬁrst prepars all possible products of I and l -bitvectors, i.e., prepares I · b for all b ∈ V l , for all b ∈ span( v l , . . . , v l − ) , . . . , forall b ∈ span( v k − l , . . . , v k − ) . To compute I · α , one splits α to l -bit chunks, looksup the prepared tables for various segments of α , and combines the results. Thenumber of operations proportions to bit-length of α .The choice of l depends on the size of cache for eﬃciency. The size of I for -bit ﬁeld is × bits, and L1 cache is KiB for data in the IntelHaswell architecture. Therefore, we choose the M4R with -bit, which resultsin KiB = 16 × × bits prepared tables.15 .4 Calculation of Multipliers in the Butterﬂies Suppose we want to calculate s i ( α ) for α ∈ V k in a tower ﬁeld. s i ( α ) can becalculated recursively via s i ( α ) = s i − ( s ( α )) . Since s is a linear map, wecan prepare a table S1 := [ s ( v ) , . . . , s ( v k )] for the images of s on all basiselements ( v , . . . , v k ) . Since α = (cid:80) b i · v i with b i ∈ { , } , we have s ( α ) = S1 · α which is a matrix production. For further optimization, we can omit the least i bits of α While evaluating s i ( x ) by Prop. 3.In our implementations, we actually precomputed 31 tables S1 , S2 , . . . , S31 for the evaluations of s ( x ) , s ( x ) , . . . , s ( x ) at up to points and avoid theneed for recursion. Again, we use M4R of -bit to accelerate matrix-vectorproducts. Querying a result of -bit α costs table lookups and the storagefor tables is 4 KiB( × × bits), which ﬁts our L1 cache. We benchmark our implementation on the Intel Haswell architecture (sameas [HvdHL16]). Our hardware is Intel Xeon E3-1245 v3 @3.40GHz with turboboost disabled and 32 GB DDR3@1600MHz memory. The experiments was runin ubuntu 1604, Linux version 4.4.0-78-generic and the compiler is gcc: 5.4.020160609 (Ubuntu 5.4.0-6ubuntu1 16.04.4).We show our main results in Tab. 2, comparing against [BGTZ08] and[HvdHL16]. Three of our implementations are denoted by their base ﬁeldsin FFT

LCH . The version of F is the simple version with evaluation points inCantor basis in Sec. 3.1. The versions of (cid:101) F and (cid:101) F are tower ﬁeld imple-mentations in Sec. 3.3. In general, our implementations are around to faster than the other binary polynomial multipliers.Table 2: Products in degree < d in F [ x ] on Intel Xeon E3-1245 v3 @ 3.40GHz( − sec.) log d/

15 16 17 18 19 20 21 22 23This work, (cid:101) F (cid:101) F

11 22 48 104 243 527 1105 2302 4812This work, F

12 26 55 119 261 554 1181 2491 5282 gf2x [BGTZ08] a

12 26 59 123 285 586 1371 3653 7364 F [HvdHL16] b

14 29 64 148 279 668 1160 3142 7040 a Version 1.2. Available from http://gf2x.gforge.inria.fr/ b SVN r10663. Available from svn://scm.gforge.inria.fr/svn/mmx

The ﬁrst notable result is that our simple version over F outperforms theprevious implementations in [BGTZ08] and [HvdHL16] for polynomials over bits. The result shows that additive FFT are better based FFT for multi-plying binary polynomials given the same multiplication in the base ﬁeld using PCLMULQDQ . Software can be downloaded from https://github.com/fast-crypto-lab/bitpolymul .

16e can also compare the eﬀectiveness of ﬁeld multiplication from second andthird rows. The comparison between (cid:101) F and F shows that the subﬁeld mul-tiplication with VPSHUFB outperforms generic multiplication with

PCLMULQDQ .This is a counter-intuitive result since

PCLMULQDQ is dedicated to multiply bi-nary polynomials by design. The eﬀect can be shown qualitatively: we can do multiplications in F using 2 VPSHUFB ’s. Multiplying by (cid:101) F elements (thelargest multipliers in the butterﬂies) costs 9 F multiplications. Multiplyingan (cid:101) F element by an (cid:101) F element is 4 (cid:101) F multiplications, so each (cid:101) F to (cid:101) F product takes / VPSHUFB ’s on average. Similarly, a (cid:101) F to (cid:101) F prod-uct takes / VPSHUFB ’s. The average is less than 2

VPSHUFB ’s compared to PCLMULQDQ ’s when using F . The proﬁles of various components

The ratio of relative cost in our fastsubﬁeld multiplication between basis conversion : butterﬂy process : pointmul : change of representations for multiplying -bit polynomials are 1: 3.06:0.08: 0, 1: 2.11: 0.27: 0.54, and 0.73: 1.74: 0.47: 0.57 for F , (cid:101) F and (cid:101) F ,respectively.The results show that the change of tower representation increases the costin pointmul besides the cost itself. However, the eﬃcient subﬁeld multiplicationreduces the cost of butterﬂy process and the eﬀects are greater than the costincreased. The version of (cid:101) F can even reduce the cost of basis conversionbecause the better aligned memory access ﬁts into machine architecture.More results on newer Intel architecture and proﬁles can be found in Ap-pendix D. Discussion about the most recent result

In [VDHLL17], Hoeven et al.present a new result of multiplying binary polynomials. They use traditionalmultiplication FFT and replace the Kronecker segmentation in Sec. 2.1 with aFrebenius DFT for saving about of running time of F [HvdHL16]. Wenote the same technique can be applied to additive FFT as well as about saving of cost. The method for additive FFT is described in Appendix C. The experiments are actually the optimal cases for multiplying polynomials withpower-of-two terms by additive-FFT-based multipliers.Since the cost of additive-FFT over number of terms of polynomial is highlystairwise, one can truncate the FFT to make the curve somewhat smoother asin [BGTZ08]. One truncated version of additive FFT for n = 3 · k was shownin [CLP + In this section, we discuss two possible variants of implementations.

Tower ﬁeld implementations with evaluation points in Cantor basis

If we choose evaluation points in Cantor basis, the evaluation of s i ( α ) seemsfaster than the M4R technique in previous section. However, we still need tochange the representation of s i ( α ) from Cantor basis to tower ﬁeld in this case.17his operation results in the same cost as the calculation of s i ( α ) in tower ﬁeldwith M4R. Performing subﬁeld multiplication in Cantor basis

For subﬁeld multi-plications in tower ﬁeld, the eﬃciency comes from the powerful

VPSHUFB instruc-tion. Since we don’t have Prop. 2 in Cantor basis, we can not use

VPSHUFB inCantor basis in the same way of tower ﬁelds. Another options for implementingﬁeld multiplication in Cantor basis is to use bit-sliced data and logical instruc-tions [BC14]. There are at least 64 ymm registers for operating in 64-bit baseﬁeld, resulting ineﬃciency from too much data in play.

We have presented our eﬃcient multipliers based on recently developed additiveFFTs, which has similar but lower multiplicative complexity as the ternaryvariant of Schönhage’s algorithm used in [BGTZ08].In [BGTZ08], Brent et al. also implemented the Cantor [Can89] algorithmbeside Schönhage. They concluded that “Schönhage’s algorithm is consistentlyfaster by a factor of about 2 (than Cantor)” . In their implementation, multi-plicative FFT outperformed additive FFT.Our experiments show that recently developed additive-FFT does help formultiplying binary polynomials of large degrees in practice. We derive a furtheradvantage by exploiting the lower cost of multiplying by subﬁeld elements intower ﬁelds.

Future Work:

Our implementation is written in C and may be improvedwith assembly for better register allocations. Further, AVX-512 instructions,featuring 512-bit SIMD instructions, will be more widely available soon. Weprobably cannot speed up the additive FFT multiplier to a factor of two, butsurely AVX-512 can be expected to oﬀer a substantial advance.

References [AFK +

07] Kazumaro Aoki, Jens Franke, Thorsten Kleinjung, Arjen KLenstra, and Dag Arne Osvik. A kilobit special number ﬁeldsieve factorization. In

International Conference on the Theory andApplication of Cryptology and Information Security , pages 1–12.Springer, 2007.[AH74] Alfred V. Aho and John E. Hopcroft.

The Design and Analysis ofComputer Algorithms . Addison-Wesley Longman Publishing Co.,Inc., Boston, MA, USA, 1st edition, 1974.[BC14] Daniel J. Bernstein and Tung Chou. Faster binary-ﬁeld multipli-cation and faster binary-ﬁeld macs. In Antoine Joux and Amr M.Youssef, editors,

Selected Areas in Cryptography - SAC 2014 - 21stInternational Conference, Montreal, QC, Canada, August 14-15,2014, Revised Selected Papers , volume 8781 of

Lecture Notes inComputer Science , pages 92–111. Springer, 2014.18BGTZ08] Richard P Brent, Pierrick Gaudry, Emmanuel Thomé, and PaulZimmermann. Faster multiplication in gf (2)(x).

Lecture Notes inComputer Science , 5011:153–166, 2008.[Can89] David G. Cantor. On arithmetical algorithms over ﬁnite ﬁelds.

J.Comb. Theory Ser. A , 50(2):285–300, March 1989.[CCNY12] Chen-Mou Cheng, Tung Chou, Ruben Niederhagen, and Bo-YinYang. Solving quadratic equations with xl on parallel architectures.In Emmanuel Prouﬀ and Patrick Schaumont, editors,

CHES , vol-ume 7428 of

Lecture Notes in Computer Science , pages 356–373.Springer, 2012.[CLP +

17] Ming-Shing Chen, Wen-Ding Li, Bo-Yuan Peng, Bo-Yin Yang,and Chen-Mou Cheng. Implementing 128-bit secure mpkc signa-tures. Cryptology ePrint Archive, Report 2017/636, 2017. http://eprint.iacr.org/2017/636 .[CLRS09] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, andCliﬀord Stein.

Introduction to Algorithms, Third Edition . The MITPress, 3rd edition, 2009.[CT65] James W Cooley and John W Tukey. An algorithm for the machinecalculation of complex fourier series.

Mathematics of computation ,19(90):297–301, 1965.[CYC13] Ming-Shing Chen, Bo-Yin Yang, and Chen-Mou Cheng. Raidq: Asoftware-friendly, multiple-parity raid. In

Presented as part of the5th USENIX Workshop on Hot Topics in Storage and File Systems ,Berkeley, CA, 2013. USENIX.[DGS06] Jintai Ding, Jason Gower, and Dieter Schmidt.

Multivariate Public-Key Cryptosystems . Advances in Information Security. Springer,2006. ISBN 0-387-32229-9.[Fog17] Agner Fog.

Instruction Tables . Copenhagen University, College ofEngineering, May 2017. Lists of Instruction Latencies, Through-puts and micro-operation breakdowns for Intel, AMD, and VIACPUs, .[GK14] Shay Gueron and Michael E. Kounavis. Intel(r) carry-less multiplication instruction and its usage for com-puting the gcm mode(rev.2.02), April 2014. https://software.intel.com/sites/default/files/managed/72/cc/clmul-wp-rev-2.02-2014-04-20.pdf .[GM10] Shuhong Gao and Todd D. Mateer. Additive fast fourier transformsover ﬁnite ﬁelds.

IEEE Trans. Information Theory , 56(12):6265–6272, 2010.[HvdHL16] David Harvey, Joris van der Hoeven, and Grégoire Lecerf. Fastpolynomial multiplication over F . In Sergei A. Abramov, Eu-gene V. Zima, and Xiao-Shan Gao, editors, Proceedings of the ACM n International Symposium on Symbolic and Algebraic Computa-tion, ISSAC 2016, Waterloo, ON, Canada, July 19-22, 2016 , pages255–262. ACM, 2016.[Int15] Intel. Intel architecture instruction set extensions programmingreference, August 2015. https://software.intel.com/sites/default/files/managed/07/b7/319433-023.pdf .[LANH16] Sian-Jheng Lin, Tareq Y. Al-Naﬀouri, and Yunghsiang S. Han. Fftalgorithm for binary extension ﬁnite ﬁelds and its application toreed–solomon codes. IEEE Trans. Inf. Theor. , 62(10):5343–5358,October 2016.[LCH14] Sian-Jheng Lin, Wei-Ho Chung, and Yunghsiang S. Han. Novelpolynomial basis and its application to reed-solomon erasure codes.In ,pages 316–325. IEEE Computer Society, 2014.[Sch77] Arnold Schönhage. Schnelle multiplikation von polynomen überkörpern der charakteristik 2.

Acta Informatica , 7(4):395–398, 1977.[Tho02] Emmanuel Thomé. Subquadratic computation of vector generatingpolynomials and improvement of the block wiedemann algorithm.

Journal of symbolic computation , 33(5):757–775, 2002.[VDHLL17] Joris Van Der Hoeven, Robin Larrieu, and Grégoire Lecerf. Imple-menting fast carryless multiplication. working paper or preprint,August 2017.[vzGG96] Joachim von zur Gathen and Jürgen Gerhard. Arithmetic and fac-torization of polynomial over f2 (extended abstract). In

Proceedingsof the 1996 International Symposium on Symbolic and AlgebraicComputation , ISSAC ’96, pages 1–9, New York, NY, USA, 1996.ACM.[vZGG02] Joachim von Zur Gathen and Jürgen Gerhard. Polynomial factor-ization over f2.

Mathematics of Computation , 71(240):1677–1698,2002.[War12] Henry S. Warren.

Hacker’s Delight . Addison-Wesley Professional,2nd edition, 2012.

A Claims about s i ( x ) in Tower Fields and Proofs Proposition 1. s k ( v k ) := (cid:81) b ∈ V k ( v k − b ) = 1 .We ﬁrst discuss the special case that k = 2 l is power of two and there is a F q for q = k . By Galois’s theory, we have s k ( x ) = (cid:81) b ∈ F q ( x − b ) = x q + x . Proposition 4. If q is a power of two, and choose any a ∈ F q such that F q := F q [ x l ] / ( x l + x l + a ) is a valid ﬁeld extension, then (cid:89) b ∈ F q ( x l − b ) = x ql + x l = 1 . roof. x ql + x l = (cid:0) x l + x l (cid:1) + (cid:0) x l + x l (cid:1) + (cid:0) x l + x l (cid:1) + · · · + (cid:0) x l + x l (cid:1) q/ = a + a + a + · · · + a q/ = Trace of a (in F q over F ) ∈ { , } . But zero here would be contradictory because all q solutions of x q = x arealready in F q , but x l is in F q \ F q , hence we must have x ql + x l = 1 . Proof of Prop. 1. If q = 2 j is a power of two, then V q = F q and the result holdsaccording to the Proposition above. So we assume that the proposition holdsfor all k < h and h = q + (cid:96) < q , where q = 2 j , and note that for a ∈ V q = F q ,we have ( v q a ) q + v q a = ( v q + 1) a + v q a = a , and (cid:89) b ∈ V h ( v h − b ) = (cid:89) c ∈ span( v q ,...v h − )  (cid:89) c (cid:48) ∈ V q ( v h + c + c (cid:48) )  ( divide V h ) = (cid:89) c (cid:48)(cid:48) ∈ V h − q  (cid:89) c (cid:48) ∈ V q ( v q ( v h − q + c (cid:48)(cid:48) ) + c (cid:48) )  (replace c by v q c (cid:48)(cid:48) ) = (cid:89) c (cid:48)(cid:48) ∈ V h − q (cid:2) ( v q ( v h − q + c (cid:48)(cid:48) )) q + v q ( v h − q + c (cid:48)(cid:48) ) (cid:3) (Galois’s theory) = (cid:89) c (cid:48)(cid:48) ∈ V h − q ( v h − q + c (cid:48)(cid:48) ) = 1 . (by induction) Proposition 3.

For any v i = 2 i w.r.t. tower representation, v i + v i ∈ V i .In the proof of Prop. 3, we need the following consequences from deﬁni-tion 3.1 and 2.3. Corollary 2. If q = 2 k > i , then x k +1 v i = v q v i = v k + i . Corollary 3. If q = 2 k > i and v ∈ V i , then x k +1 v = v q v ∈ V k + i .Proof of Prop. 3. If i = 2 k , then v i = x k +1 , and v i + v i = x x · · · x k = v k − by the deﬁntion of the x i so the claim holds. Using mathematical induction, welet k > i = 2 k − + j . Thus v i = x k v j , and v i + v i = ( x k + x k ) v j + x k ( v j + v j ) .The ﬁrst term is the product of two terms in F k − and therefore is itself in F k − = V k − ⊆ V i . The second is the product of v k − and an element of V j (by the induction hypothesis), which is by Corollary 3 also in V k − + j = V i . B Basis Conversion: An Example

Figure 4 shows an example for converting a degree-15 polynomial to novelpoly basis. We have to divide by 3 diﬀerent s i ( x ) ’s, namely s ( x ) , s ( x ) , and s ( x ) ,which has 4 terms. One can see there are actually 4 layers of division in Figure 4and the number of XOR ’s are the same in all layers.21igure 4: From f ( x ) = f + · · · + f x to g ( X ) = g + · · · + g X byAlgorithm 3.What the ﬁrst 2 layers perform is variable substitution into terms of y = s ( x ) = x + x (See algorithm 2). The ﬁrst layer is a division by s ( x ) = x + x and second layer is divisions by s ( x ) = x + x on the high degree and low degreepolynomials from the ﬁrst layer. The third layer is a division by s ( x ) = y + y byadding between coeﬃcients of terms diﬀering by a factor of s ( x ) , and 4 positionsapart (see 3rd column of Table 1). The last layer divides by s ( x ) = x + x for4 short polynomials (Last loop in Algo. 3). Note that we only do division bytwo terms polynomials in the conversion. C Frobenius bijection between F [ x ] < · n and F n In this section, we perform an FFT directly on F [ x ] instead of combining manycoeﬃcients to a larger ﬁnite ﬁeld, i.e., abandoning the Kronecker segmentation.We refer the reader to [VDHLL17] for preliminaries and focus on performingthe idea with additive-FFT in this section. C.1 The partition of evaluated points

Given φ is the Frebenius operation in F , which is exactly to square anelement in an extension ﬁeld of characteristic 2. Thus, we have φ ( β ) = β ,and φ ( β i ) = β i = β i + β i − for i > in Cantor basis. Moreover, the order of φ for β i> is · (cid:98) log i (cid:99) . Deﬁnition C.1.

Given a ( x ) ∈ F [ x ] < · n and n = 2 m with m < , deﬁne alinear map E β m +64 + V m : F [ x ] < · n → F n is the evaluations of a ( x ) at points β m +64 + V m , i.e., E β m +64 + V m : a ( x ) (cid:55)→ { a ( β m +64 + i ) : i ∈ V m } .We can see the evaluation of a ( x ) at n = 2 m points which is less than thenumber of coeﬃcients. However, the evaluation points are in F . It can bechecked the order of φ ( β m +64 ) is . Proposition 5. E β m +64 + V m is a bijection between F [ x ] < · n and F n . .2 Truncated Additive FFT To evaluate a polynomial a ( x ) ∈ F [ x ] < · n at m points in F , we treateach coeﬃcient of a ( x ) is an element in F . Since there are · n = 128 · m coeﬃcients and only m evaluation points, the E β m +64 + V m is exactly a truncatedadditive FFT to / points of an FFT at points β m +64 + V m +7 . The basisconversion can be proceeded straightforwardly with alg. 3. With the notation ofalg. 1, the evaluation of a ( x ) at β m +64 + V m can be accomplished with the scalardisplacement α = β m +64 and reserve only ﬁrst / results of the evaluationpoints V m +7 . C.3 Encoding: the First Seven Layers of Truncated FFT

We can also translate the truncated FFT to an “encoding”, corresponding tothe starting 7 layers of butterﬂies, of input coeﬃcients and followed by a normal

FFT

LCH over F [ x ]

LCH is a result of linear map with input ( a j · n + i ) j =0 which is a × n transpose of input data. C.4 Results

Table 3 shows the results of our implementation on multiplying polynomialswith additive FFT and the comparison with implementation from [VDHLL17].Table 3: Multiplicaitons with Frobenius partitions. Products in degree < d in F [ x ] on Intel Xeon E3-1245 v3 @ 3.40GHz ( − sec.) log d/

15 16 17 18 19 20 21 22 23add FFT, F a F [VDHLL17] c c Version 2ae2461. https://github.com/fast-crypto-lab/bitpolymul c SVN r10681. Available from svn://scm.gforge.inria.fr/svn/mmx

D Proﬁles: Raw Data

Table 4 shows the raw data of various componets of our implementations. Ta-ble 5 shows the performance of our implemetations in Intel Skylake architecture.23able 4: Running time of various Components, in − sec. log d/

15 16 17 18 19 20 21 (cid:101) F chRepr ×

971 1955 3482 7226 13991 27550 53982

BasisCvt ×

823 1846 4055 8398 17598 35611 80913

FFT

LCH × pointmul iFFT LCH

923 1960 4402 11782 34244 75693 161648 iBasisCvt ichRepr (cid:101) F chRepr ×

924 1866 3890 7172 13237 25965 52275

BasisCvt × FFT

LCH × pointmul iFFT LCH iBasisCvt ichRepr F BasisCvt × FFT

LCH × pointmul

317 635 1297 2863 7250 10278 21188 iFFT

LCH iBasisCvt

Table 5: Products in degree < d in F [ x ] on Intel Xeon E3-1275 v5 @ 3.60GHz( − sec.) log d/

15 16 17 18 19 20 21 22 23This work, (cid:101) F F gf2x a [BGTZ08] 11 23 51 111 250 507 1182 2614 6195 F [HvdHL16] b

10 22 51 116 217 533 885 2286 5301 a Version 1.2. Available from http://gf2x.gforge.inria.fr/ . b SVN r10663. Available from svn://scm.gforge.inria.fr/svn/mmxsvn://scm.gforge.inria.fr/svn/mmx