Faster Multiplication for Long Binary Polynomials
Ming-Shing Chen, Chen-Mou Cheng, Po-Chun Kuo, Wen-Ding Li, Bo-Yin Yang
FFaster Multiplication for Long BinaryPolynomials
Ming-Shing Chen , Chen-Mou Cheng , Po-Chun Kuo ,Wen-Ding Li , and Bo-Yin Yang Department of Electrical Engineering, National TaiwanUniversity, Taiwan, {mschen,doug,kbj}@crypto.tw Institute of Information Science, Academia Sinica, Taiwan, {thekev,by}@crypto.tw Research Center of Information Technology and Innovation,Academia Sinica, TaiwanJanuary 8, 2018
Abstract
We set new speed records for multiplying long polynomials over fi-nite fields of characteristic two. Our multiplication algorithm is based onan additive FFT (Fast Fourier Transform) by Lin, Chung, and Huang in2014 comparing to previously best results based on multiplicative FFTs.Both methods have similar complexity for arithmetic operations on un-derlying finite field; however, our implementation shows that the additiveFFT has less overhead. For further optimization, we employ a tower fieldconstruction because the multipliers in the additive FFT naturally fallinto small subfields, which leads to speed-ups using table-lookup instruc-tions in modern CPUs. Benchmarks show that our method saves about computing time when multiplying polynomials of and bitscomparing to previous multiplicative FFT implementations. Keywords:
Finite Field, Multiplication, Additive FFT, Single InstructionMultiple Data (SIMD).
Multiplication for long binary polynomials in the ring F [ x ] , where F is thefinite field with two elements, is a fundamental problem in computer science.It is needed to factor polynomials in number theory [vzGG96] [vZGG02] and iscentral to the modern Block Wiedemann algorithm [Tho02]. Block Wiedemannis in turn integral to the Number Field Sieve [AFK + a r X i v : . [ c s . S C ] J a n o the best of our knowledge, all currently fast binary polynomial multipli-cation algorithms are based on a Fast Fourier Transform (FFT) algorithm. TheFFT is an algorithm for efficiently evaluating polynomials at subgroups in theunderlying finite field. There have been good reviews of algorithms and implementations for multiply-ing binary polynomials in [BGTZ08] and [HvdHL16].In general, these previous methods of multiplication were based on “mul-tiplicative” FFTs, evaluating polynomials at multiplicative subgroups formedby roots of unity. Since the sizes of multiplicative groups existing in F k arerestricted, multiplicative FFTs in the binary field are somewhat more difficultthan that over the reals and the complex. These multiplicative FFTs, evaluatingpolynomials at n points, reach their best efficiency only when n is a size of amultiplicative subgroup.In [BGTZ08], Brent et al. implemented mainly the Schönhage [Sch77] algo-rithm for long polynomials with complexity O ( n log n log log n ) of field opera-tions. Schönhage’s algorithm is based on ternary FFTs over binary finite fields.In their implementation, the optimal number of evaluation points is k .Harvey, van der Hoeven, and Lecerf [HvdHL16] presented multiplicationusing an FFT of a mixed radix approach. They applied several discrete Fouriertransforms(DFTs) for different input sizes, e.g., Cooley-Tukey [CT65] for thelargest scale DFT. In particular, they need to find a suitable finite field whichis of a size close to a machine word and simultaneously allows both abundantmultiplicative subgroups. They proposed F which elegantly satisfies theseconditions. Following Cantor [Can89], alternative methods are developed to evaluate poly-nomials at points that form an additive subgroup in a field of characteristic 2.These methods are called “additive FFTs”.Cantor presented a basis, termed “Cantor basis” in the literature, for con-structing a finite field as well as an FFT over the field. His FFT works with thecomplexity of O ( n log n ) multiplications and O ( n log log n ) additions( XOR ) forevaluating n = 2 m elements.In 2010, Gao and Mateer [GM10] presented an additive FFT (heretofore“GM FFT”) over F k , where the evaluation points are an additive subgroup ofsize k in the underlying GF. The additive subgroups are easier to form thanmultiplicative subgroups in the fields of characteristic 2. However, the com-plexity in GM FFT is O ( n log n ) XOR operations for evaluating a polynomialat n = 2 m points in general. It can be optimized to O ( n log n log log n ) onlywhen m is a power of 2, and the polynomials in this case are represented in aspecial polynomial basis introduced by Cantor [Can89]. In 2014, Bernstein andChou [BC14] presented an efficient implementation of the GM FFT.In 2014, Lin, Chung, and Han [LCH14] proposed a more general variant(“LCH FFT”), which uses a different polynomial basis than the standard one.The polynomial basis results in a more regular structure in the butterfly stage.When representing the underlying finite field in a Cantor basis, the butterfly2tage is the same as the optimized GM FFT and Bernstein-Chou. In the subse-quent work [LANH16], they presented a method for converting the polynomialfrom standard basis. The complexity for the conversion is O ( n log n log log n ) XOR operations for n being any power of two, which reaches the same complexityas multiplicative FFTs.More details of the LCH FFT are reviewed in Sec. 2.3. In this paper, we present a faster method of multiplication for long binarypolynomials based on the recently developed LCH FFT comparing to previouslymultiplicative-FFT-based algorithms. From the faster results of our additive-FFT-based implementations, we confirm that the recent development of additiveFFTs helps the multiplication for binary polynomials.Our implementation is faster than previous multiplicative FFT codes fortwo reasons. First, the FFT in our algorithm uses simple binary butterfliesstages leading to n log n multiplications for n evaluation points, comparedto (for example) the ternary FFT used in Schönhage’s algorithm which leadsto n log n multiplications. This factor confers a 10%–20% advantage overmultiplicative-FFT-based implementations and will be discussed in Sec. 3.1.Second, we exploit the fact that the multipliers in the additive FFT are insubfields, which reduces the average time taken per multiplication. We willdiscuss the method in Sec. 3.2 and 3.3. The overall improvement is 10%–40%over previous implementations. In this section, we discuss the general method of multiplications for long binarypolynomials.Suppose we are multiplying two polynomials a ( x ) = a + a x + · · · + a d − x d − and b ( x ) = b + · · · + b d − x d − ∈ F [ x ] . The polynomials are represented in bitsequence with length d . The standard Kronecker segmentation for multiplyingbinary polynomials is performed as follows:1. Partition the polynomials to w -bits blocks. There are n = (cid:100) d/w (cid:101) blocks. a ( x ) = a + a x + · · · + a d − x d − → ( a + · · · + a w − x w − )+( a w + · · · + a w − x w − ) x w + · · · +( · · · ) x w ( n − .
2. Define F w := F [ z ] / ( g ( z )) , with g ( z ) an irreducible polynomial of degree w . Let ψ map a ( x ) to a (cid:48) ( y ) ∈ F w [ y ] (and similarly ψ ( b ( x )) = b (cid:48) ( y ) ) : a (cid:48) ( y ) := a (cid:48) + a (cid:48) y + · · · + a (cid:48) n − y n − ∈ F w [ y ] ,b (cid:48) ( y ) := b (cid:48) + b (cid:48) y + · · · + b (cid:48) n − y n − ∈ F w [ y ] , such that a (cid:48) = ( a + a z + . . . + a w − z w − ) , a (cid:48) = ( a w + . . . + a w − z w − ) ,. . . , a (cid:48) n − = ( a ( n − w + a ( n − w +1 z + . . . + a nw − z w − ) and same for b j .3. Calculate c (cid:48) ( y ) = a (cid:48) ( y ) · b (cid:48) ( y ) = c (cid:48) + c (cid:48) y + · · · ∈ F w [ y ] (using FFTs).4. Map z (cid:55)→ x and y (cid:55)→ x w . Then, collect terms and coefficients to find theresult of multiplication of binary polynomials. We need to add togetherat most 2 coefficients at any power. FFT-based Polynomial multiplication.
It is well known that polynomialmultiplication can be done using FFT [CLRS09]. To multiply two degree- ( n − polynomials a (cid:48) ( y ) and b (cid:48) ( y ) ∈ F w [ y ] with FFT algorithms, the standard stepsare as follows:1. ( fft ) Evaluate a (cid:48) ( y ) and b (cid:48) ( y ) at n points by an FFT algorithm.2. ( pointmul ) Multiply the evaluated values pairwise together.3. ( ifft ) Interpolate back into a polynomial of degree ≤ n − by the inverseFFT algorithm.The complexity of polynomial multiplication is the same as the FFT in use. The field of two elements, denoted as F , is the set { , } . The multiplicationof F is logic AND and addition is logic
XOR . In this paper, every field will be analgebraic extension of F .We will apply interchangeable representations for each used finite field (orGalois field, GF). The illustrative example is the field of elements, denotedby F in [Can89] and [BGTZ08]. We will switch representations during com-putation to achieve a better efficiency for field multiplication. All fields of thesame size are isomorphic and the cost of changing representation is a lineartransformation. F We choose the basic working field to be the same as in AES-GCM, denoted as F : F := F [ x ] / (cid:0) x + x + x + x + 1 (cid:1) . An element in F is represented as a binary polynomial of degree < . The F can also be a linear space of dimension 128 with the basis ( x i ) i =0 . The cost of multiplication for F . There are hardware instructions formultiplying small binary polynomials which fits for the multiplication of F inmany platforms. PCLMULQDQ is a widely used instruction for multiplying 64-bitbinary polynomials in x86. Since one
PCLMULQDQ performs × → bits,the multiplication of F costs roughly 5 PCLMULQDQ (3 for multiplying -bitpolynomials with Karatsuba’s method and 2 for reducing the -bit result backto bits with linear folding). More details about multiplications of F canbe found in [GK14]. 4 .2.2 Cantor Basis for Finite Field as Linear Space Gao and Mateer presented an explicit construction of Cantor Basis for finitefield in [GM10]. The Cantor basis ( β i ) satisfies β = 1 , β i + β i = β i − for i > . Definition 2.1.
With respect to the basis ( β i ) , let φ β ( k ) := (cid:80) m − j =0 b j β j bethe field element represented by k under ( β i ) when the binary expansion of k = (cid:80) m − j =0 b j j with b j ∈ { , } . Definition 2.2.
Given a basis ( β i ) m − i =0 in the base field, its sequence of subspacesis V i := span { β , β , . . . , β i − } . Its subspace vanishing polynomials ( s i ) are, s i ( x ) := (cid:89) a ∈ V i ( x − a ) . Note that V i is a field with linear basis ( β j ) i − j =0 only when i is power of .Since dim V i = i , one can see that deg( s i ( x )) = 2 i .From [Can89] and [GM10], vanishing polynomials s i ( x ) w.r.t. the Cantorbasis ( β i ) has the following useful properties: • (linearity) s i ( x ) contains only monomials in the form x m . • (minimal two terms) s i ( x ) = x i + x iff i is a power of . In the cases of i = 2 k , V k are fields and { β k , β k +1 , . . . , β k +1 − } ⊂ F k +1 \ F k . • (recursivity) s i ( x ) = s i − ( x ) + s i − ( x ) = s ( s i − ( x )) ; s i + j ( x ) = s i ( s j ( x )) .If k = 2 i + 2 i + · · · + 2 i j , where i < i < · · · < i j , then we can write s k ( x ) = s i ( s i ( · · · ( s ij ( x )) · · · )) . Therefore, every s i is a composition offunctions which only has two terms (see Table 1). Evaluating s i ( x ) in Cantor basis Computing s i ( x ) in the Cantor basis isvery fast since the representation of s i ( α ) would exactly be that of α shiftedright by i bits, or s i ( φ β ( j )) = φ β ( j (cid:29) i ) . For example, we have s i ( β i ) = β = 1 . In this section, we introduce how to evaluate a degree ( n − polynomial at n points in a binary field with the LCH FFT. We assume that n is a power of .The polynomial is padded with for high-degree coefficients if the actual degreeof polynomial is not n − .The LCH FFT requires that the evaluated polynomial is converted into aparticular kind of basis, called novelpoly basis. novelpoly basis The polynomial basis used in LCH FFT was presented in [LCH14]. The novel-poly basis for polynomials must be distinguished from bases for field. Althoughthe LCH FFT is independent of underlying basis of finite field, we assume thatthe field is in Cantor basis for simplicity.5 efinition 2.3.
Given the Cantor basis ( β i ) for the base field and its vanishingpolynomials ( s i ) , define the novelpoly basis w.r.t. ( β i ) to be the polynomials ( X k ) X k ( x ) := (cid:89) ( s i ( x )) b i where k = (cid:88) b i i with b i ∈ { , } . I.e., X k ( x ) is the product of all s i ( x ) where the i -th bit of k is set.Since deg( s i ( x )) = 2 i , clearly deg( X k ( x )) = k .To perform LCH FFT, the evaluated polynomial f ( x ) has to be convertedinto the form f ( x ) = g ( X ) = g + g X ( x ) + . . . + g n − X n − ( x ) . The evaluation of a polynomial in the novelpoly basis can be done through a“Butterfly” process, denoted as
FFT
LCH . The general idea of evaluating f ( x ) at all points of V k is to divide V k intothe two sets V k − and V k \ V k − = V k − + β k − := { x + β k − : x ∈ V k − } .Since s k ( x ) is linear, evaluations at V k − + β k − can be quickly calculated withthe evaluations at V k − and the butterfly process. It is a divide-and-conquerprocess that the polynomial f ( x ) = g ( X ) can be expressed as two half-sizedpolynomials h ( X ) and h ( X ) with g ( X ) = h ( X ) + X (cid:100) log n (cid:101)− ( x ) h ( X ) . FFT
LCH is detailed in Algorithm 1. The
FFT
LCH evaluates the converted poly-nomial g ( X ) at points V log n + α . Line 5 and 6 perform the butterfly process (seeFig. 1 and Fig. 2). Inverse FFT
LCH simply performs the butterflies in reverse.
Algorithm 1:
LCH FFT in novelpoly . FFT
LCH ( f ( x ) = g ( X ) , α ) : input : a polynomial: g ( X ) = g + g X ( x ) + ... + g n − X n − ( x ) .a scalar: α ∈ F . output: a list: [ f (0 + α ) , f ( φ u (1) + α ) , . . . , f ( φ u ( n −
1) + α )] . if deg( f ( x )) = 0 then return [ g ] ; Let k ← Max ( i ) s.t. i ≤ n − . Let g ( X ) = p ( X ) + X k · p ( X ) = p ( X ) + s k ( x ) · p ( X ) . h ( X ) ← p ( X ) + s k ( α ) · p ( X ) . h ( X ) ← h ( X ) + s k ( β k ) · p ( X ) . // s k ( β k ) = 1 in Cantor basis. return [ FFT
LCH ( h ( X ) , α ) , FFT
LCH ( h ( X ) , β k + α ) ] We note there are two multipliers s k ( α ) and s k ( β k ) in the FFT
LCH and s k ( β k ) =1 in Cantor basis, avoiding one multiplication. This constant reduction of mul-tiplications won’t affect the asymptotic complexity; however, one can not bearthe extra multiplication in practice. Although FFT
LCH is applicable to any basisof field, the choice for a practitioner might be Cantor basis only. novelpoly
Basis w.r.t. Cantor Basis
The evaluated polynomial has to be in novelpoly basis for performing
FFT
LCH .We review the conversion algorithms in this section. The fast conversion relies following [LCH14], which calls the butterfly an FFT. s i ( x ) s ( x ) xs ( x ) x + xs ( x ) x + x = s ( x ) = ys ( x ) x + x + x + x = s ( y ) = y + ys ( x ) x + x = s ( x ) = zs ( x ) x + x + x + x = s ( z ) = z + zs ( x ) x + x + x + x = s ( z ) = z + z = s ( x ) = ws ( x ) x + x + · · · + x + x = s ( z ) = z + z + z + z = s ( w ) = w + w on the simple form of ( s i ) w.r.t. Cantor basis.[BC14] converts f ( x ) to g ( X ) by finding the largest i such that i < deg f ,and then divide f ( x ) by s i ( x ) to form f ( x ) = f ( x ) + s i ( x ) f ( x ) . Recursivelydivide f ( x ) and f ( x ) by lower s i ( x ) and eventually express f ( x ) as a sum ofnon-repetitive products of the s i ( x ) , which is the desired form for g ( X ) . Weknow the division comprise only XOR operations since the coefficients of s i ( x ) are always in Cantor basis. Therefore the complexity of division by s i ( x ) depends on the number of terms in s i ( x ) . This is functionally equivalent to theCantor transform.[LANH16] does better by1. finding the largest i such that j < deg f and then do variable substi-tution (Alg. 2) to express f as a power series of s i .2. Recursively express the series in s i as a series in X j ( s i ) , where j < i .3. Recursively express each coefficient of X j ( s i ) (which is a polynomial in x of degree < i ) as a series in X k , where k < i . Algorithm 2:
Variable Substitution VarSubs( f ( x ) , y ) : input : Two polynomials: f ( x ) = f + f x + ... + f n − x n − and y = x i + x . output: h ( y ) = h ( x ) + h ( x ) y + · · · + h m − ( x ) y m − . if deg( f ( x )) < i then return h ( y ) ← f ( x ) ; Let k ← Max (2 j ) where j ∈ Z s.t. deg(( x i + x ) j ) ≤ deg( f ( x )) . Let y k ← x k i + x k . Compute f ( x ) + y k · f ( x ) = f ( x ) by dividing f ( x ) by x k i + x k . // Note that this is done by repeatedly subtracting. return VarSubs( f ( x ) , y ) + y k · VarSubs( f ( x ) , y ) .The detail of basis conversion is given in Algorithm 3 and an example isgiven in Appendix B. Note that the algorithms rely on the simple form of ( s i ) instead of field representations of coefficients.7 lgorithm 3: Basis conversion: monomial to novelpoly w.r.t Cantor. BasisCvt( f ( x ) ) : input : f ( x ) = f + f x + ... + f n − x n − . output: g ( X ) = g + g X ( x ) + ... + g n − X n − ( x ) . if deg( f ( x )) ≤ then return g ( X ) ← f + X f ; Let k ← Max (2 i ) where i ∈ Z s.t. deg( s i ( x )) ≤ deg( f ( x )) . Let y ← s k ( x ) . h ( y ) = h ( x ) + h ( x ) y + · · · + h m − ( x ) y m − ← VarSubs( f ( x ) , y ) . h (cid:48) ( Y ) = q ( x ) + q ( x ) X k + · · · + q m − ( x ) X ( m − · k ← BasisCvt( h ( y ) ) . foreach q i ( x ) in h (cid:48) ( Y ) do Compute g i ( X ) ← BasisCvt( q i ( x ) ) . end return g ( X ) = g ( X ) + g ( X ) X k + ... + g n − ( X ) X ( m − · k We can have a fast multiplication for binary polynomials simply by applyingLCH FFT with the Cantor basis as the underlying FFT in the general methodof Sec. 2.1.Besides the straightforward method, we also present a faster algorithm byaccelerating the field multiplication in the FFT. The acceleration relies on aspecial tower field representation, making all multipliers in butterflies short.
A simple version of our multiplication for binary polynomials is to keep theworking field in the representation of polynomial basis and apply LCH FFTwith evaluation points in Cantor basis to the general multiplication in Sec. 2.1.The details of the straightforward algorithm is presented in Alg. 4.For more details of
FFT
LCH , we choose w = 64 and use F as our base field.The evaluated points are { φ β (0) , . . . , φ β (2 n − } , which can be seen on line 6in Alg. 4. The multiplier s k ( α ) is calculated in Cantor basis and then switchedto its representation in F with a linear map (field isomorphism). Althoughthe multipliers in the Cantor basis are short numbers, they are random-looking128-bit polynomials in F . The field multiplication in F is performed with5 PCLMULQDQ (cf. Sec. 2.2.1). The other multiplier s k ( β k ) = 1 in the Cantorbasis. Advantages of the Additive FFT
We can expect that the simple structureof the LCH FFT leads to a lower complexity. One way is to count the but-terflies. LCH FFT has a binary structure, which means at each of log n layersthere are n/ butterflies for a total of n log n multiplications. Considering aternary FFT instead, there will be log n layers of analogous structure to but-terflies, in number n/ each. At each of these structures, one has to make fourmultiplications for a total of n log n multiplications. All else being equal, themultiplicative complexity of binary structure of the additive FFT is about . times lower than that of the ternary FFT, which is used in the Schöhage-like8 lgorithm 4: Simple multiplications for binary polynomials. binPolyMul( a ( x ) , b ( x ) ) : input : a ( x ) , b ( x ) ∈ F [ x ] . output: c ( x ) = a ( x ) · b ( x ) ∈ F [ x ] . f a ( x ) ∈ F w [ x ] ← Split ( a ( x )) . f b ( x ) ∈ F w [ x ] ← Split ( b ( x )) . g a ( X ) ∈ F w [ X ] ← BasisCvt ( f a ( x )) . g b ( X ) ∈ F w [ X ] ← BasisCvt ( f b ( x )) . [ f a (0) , . . . , f a ( φ β (2 n − ∈ F n w ← FFT
LCH ( g a ( X ) , . [ f b (0) , . . . , f b ( φ β (2 n − ∈ F n w ← FFT
LCH ( g b ( X ) , . [ f c (0) , . . . , f c ( φ β (2 n − ∈ F n w ← [ f a (0) · f b (0) , . . . , f a ( φ β (2 n − · f b ( φ β (2 n − g c ( X ) ∈ F w [ X ] ← iFFT LCH ([ f c (0) , . . . , f c ( φ β (2 n − . f c ( x ) ∈ F w [ x ] ← iBasisCvt ( g c ( X )) . c ( x ) ∈ F [ x ] ← InterleavedCombine ( f c ( x )) . return c ( x ) .algorithm in [BGTZ08]. Similarly we hold an advantage over the even morecomplex FFT method in [HvdHL16]. Results
Please refer to Tab. 2 in Sec. 5. Simply using an additive FFT confersa 10%–20% advantage over state-of-the-art libraries in [HvdHL16, BGTZ08].
We consider this sequence of extension fields. F = (cid:101) F := F [ x ] / ( x + x + 1) , (cid:101) F := (cid:101) F [ x ] / ( x + x + (cid:81) i =1 x i ) , F = (cid:101) F := F [ x ] / ( x + x + x ) , (cid:101) F := (cid:101) F [ x ] / ( x + x + (cid:81) i =1 x i ) , F = (cid:101) F := F [ x ] / ( x + x + x x ) , (cid:101) F := (cid:101) F [ x ] / ( x + x + (cid:81) i =1 x i ) , (cid:101) F := F [ x ] / ( x + x + x x x ) , (cid:101) F := (cid:101) F [ x ] / ( x + x + (cid:81) i =1 x i ) . Thus, decimal subscripts or a tilde denotes the field is in tower representation.We can now define a basis for (cid:101) F as a vector space over F . Definition 3.1. v k := (cid:81) m − j =0 x b j j +1 where k := (cid:80) m − j =0 b j j with b j ∈ { , } .By definition, the sequence ( v , v , v , v , v , v , . . . ) := (1 , x , x , x x , x ,x x , . . . ) . Henceforth ( v k ) will be our default basis unless otherwise specified. Definition 3.2. ı := φ v ( i ) is the element of (cid:101) F k represented by i . Numbers inhex such as also denote the representatives under the basis ( v i ) .Hence the sequence ( v k ) can also be written as ( v , v , v , v , . . . ) := (1 , , , , . . . ) or (1 , , , , . . . ) . For example, x x + x + x + 1 = v + v + v + v ∈ F is denoted as or . Under this notation, we can order twoelements in the tower field and thus define “big” or “small” by comparing theirrepresentative numbers. We can see that the representation of each field isembedded in the lower bit(s) of the field that is twice as wide. For example,the elements { , , . . . , } in F form the subfield F . Therefore, a“small” number in the tower field is often in a subfield.9 .2.1 Compatibility between Tower and Cantor Bases We discuss the “compatibility” between tower and Cantor bases in this section.It is clear the basis ( v , v , . . . ) is not a Cantor basis since v + v (cid:54) = v . However, Claim 1.
The sequence of subspaces ( V k ) and subspace vanishing polynomials ( s i ) are the same w.r.t. the Cantor basis ( β j ) and the tower basis ( v j ) .We first assume that V k := span( v , . . . , v k − ) and s k ( x ) = (cid:81) c ∈ V k ( x − c ) and we will show that these are the same as those from a Cantor basis. Noteagain that V k is a field only for k = 2 m a power of two, and only in this case wehave V k = (cid:101) F k = (cid:101) F m − [ x m ] / ( x m + x m + v m − ) . Proposition 1. s k ( v k ) = 1 . Proof:
See Appendix A.
Corollary 1 (recursivity) . With respect to ( v i ) , s k +1 ( x ) = s k ( x ) (cid:81) c ∈ V k + v k ( x − c ) = s k ( x ) s k ( x + v k ) = s k ( x )( s k ( x ) + s k ( v k )) = ( s k ( x )) + s k ( x ) . Note that s i ( x ) w.r.t. any basis of field is linear [LCH14, Theorem 1]. Thus,the vanishing polynomials of tower and Cantor bases are the same. Sincethe s i ( x ) determines the V i by unique factorization theorem, we have provedClaim 1. We show multiplying by a subfield element is not only cheaper than generalmultiplication but also calculated as a vector-scalar product in this section.The cost depends on the size of the subfield. For example, to multiply a ∈ (cid:101) F w by b ∈ (cid:101) F w , the a is represented as a polynomial a := a + a x ∈ (cid:101) F w [ x ] with a , a ∈ (cid:101) F w . Hence the product of a · b ∈ (cid:101) F w is calculated as twomultiplications in (cid:101) F w , i.e., ( a + a x ) · b . It is easy to generalize to the followingproposition. Proposition 2.
Given a ∈ (cid:101) F l = V l , b ∈ (cid:101) F l = V l , and l | l , a · b ∈ V l canbe performed with l /l field multiplications in (cid:101) F l . Besides the compatibility with Cantor allowing efficient
FFT
LCH and basis con-version in tower fields, we show the multipliers in
FFT
LCH fall to smaller numberswhich can be optimized by subfield multiplication with Prop. 2 in this section.Recall that the two multipliers are s k ( α ) and s k ( v k ) , which corresponds to s k ( β k ) in Alg. 1.First we have s k ( v k ) = 1 by Prop. 1. We can thus avoid the multiplicationof s k ( v k ) as in Cantor basis. One butterfly unit thus contains one multiplicationand two additions.Figure 1 shows the details of the butterfly unit. It is also an example ofevaluating a degree-1 polynomial f ( x ) = f + f x at points { c, c +1 } , which α = c with the notation in Alg. 1. Since X ( x ) = x , the degree-1 polynomial after10igure 1: Details of the butterfly unit.basis conversion is identical to the original polynomial, i.e., g ( X ) = g + g X = f + f X . The only effective multiplier is s ( α ) = s ( c ) = c .Now we discuss the multiplier s k ( α ) . Proposition 3.
For any v i = 2 i w.r.t. tower representation, v i + v i ∈ V i . Proof:
See Appendix A.Contrast this with the Cantor basis, where β i + β i = β i − . Note that s : x (cid:55)→ x + x is a linear map from vector spaces V i +1 to V i and its kernel is { , } . Since each vector in V i is exactly the image of two vectors in V i +1 , eachvector in V i \ V i − is also the image of exactly two vectors in V i +1 \ V i . Hence s ( v i ) = v i − + u where u ∈ V i − , i.e., s shortens the tower representation of x by exactly one bit. Since s k is just applying s consecutively k times, themultiplier s k ( α ) in LCH butterflies is (a) k bits shorter than α in the towerrepresentation, and (b) independent of the least significant k bits of α .Fig. 2 depicts the evaluation of a degree- polynomial at points { , , . . . , } with LCH FFT, using ( = log 16 ) layers of butterflies. We can observe thatthe multipliers in butterflies are smaller than the actual evaluation points.Figure 2: The forward butterfly units for evaluating a degree- polynomial f ( x ) = g ( X ) = g + · · · + g X at points { , , . . . , } .Since coefficients above degree- are all , the first layer is simply a “fan-out” (in software, copying a block of memory). The multipliers in second layerare calculated by evaluating the degree- s ( x ) at two α ∈ { , ∈ F } ,11esulting in the small multipliers { , ∈ F } . The third layer evaluates thedegree-2 s ( x ) at 4 points α ∈ { , , , } and results in the multipliers { , , , } . In the last layer, the multipliers are the subfield elements { , , , , . . . } themselves because s ( α ) = α . (cid:101) F or (cid:101) F The main idea of our algorithm is to trade an expensive field isomorphism forthe faster multiplication by small subfield elements.We show the algorithm for multiplying binary polynomials in Alg. 5. Assumewe use a w = 64 -bit word first. The input polynomials are in -bit blocksafter partitioning( Split() ). We perform the basis conversion before changingrepresentations so as to have more densely packed data during the conversion.Then we convert the data to the tower representation. The data are kept inthe tower representation for performing
FFT
LCH . We efficiently compute thesubfield multiplication with the technique in Sec. 4.1. The pointmul are alsoperformed in (cid:101) F ; the somewhat different operations are detailed in Sec. 4.1.3.The inverse butterfly and basis conversion stages are then computed (still in thetower representation) before converting back to the polynomial basis of F .Then we split the -bit results into -bit blocks and collate coefficients forthe final result ( InterleavedCombine() ).An initial block width of w = 128 bits is also possible and in this case theoperative field is (cid:101) F . Algorithm 5:
Multiplications for binary polynomials. binPolyMul( a ( x ) , b ( x ) ) : input : a ( x ) , b ( x ) ∈ F [ x ] . output: c ( x ) = a ( x ) · b ( x ) ∈ F [ x ] . f a ( x ) ∈ F w [ x ] ← Split ( a ( x )) . f b ( x ) ∈ F w [ x ] ← Split ( b ( x )) . g a ( X ) ∈ F w [ X ] ← BasisCvt ( f a ( x )) . g b ( X ) ∈ F w [ X ] ← BasisCvt ( f b ( x )) . (cid:101) g a ( X ) ∈ (cid:101) F w [ X ] ← changeRepr ( g a ( X )) . (cid:101) g b ( X ) ∈ (cid:101) F w [ X ] ← changeRepr ( g b ( X )) . [ (cid:101) f a (0) , . . . , (cid:101) f a (2 n − ∈ (cid:101) F n w ← FFT
LCH ( (cid:101) g a ( X ) , . [ (cid:101) f b (0) , . . . , (cid:101) f b (2 n − ∈ (cid:101) F n w ← FFT
LCH ( (cid:101) g b ( X ) , . [ (cid:101) f c (0) , . . . , (cid:101) f c (2 n − ∈ (cid:101) F n w ← [ (cid:101) f a (0) · (cid:101) f b (0) , . . . , (cid:101) f a (2 n − · (cid:101) f b (2 n − (cid:101) g c ( X ) ∈ (cid:101) F w [ X ] ← iFFT LCH ([ (cid:101) f c (0) , . . . , (cid:101) f c (2 n − . (cid:101) f c ( x ) ∈ (cid:101) F w [ x ] ← iBasisCvt ( (cid:101) g c ( X )) . f c ( x ) ∈ F w [ x ] ← changeRepr ( (cid:101) f c ( x )) . c ( x ) ∈ F [ x ] ← InterleavedCombine ( f c ( x )) . return c ( x ) . 12 Implementation (cid:101) F Multiplication by subfield elements
The operation for multiplying an element of tower fields by an subfield elementis implemented as a scalar multiplication of a vector over various subfields by ascalar with Prop. 2. We show how to calculate the product efficiently in currentmainstream computers in this section.
The typical single-instruction-multiple-data (SIMD) instruction set nowadays isAdvanced Vector Extensions 2 (AVX2), providing 256-bit ymm registers on x86platforms. We use the table-lookup instruction
VPSHUFB in AVX2 for multiply-ing elements in (cid:101) F by subfield elements which are F to (cid:101) F . We demon-strate the scalar multiplication over subfields with PSHUFB instruction which isthe precursor to
VPSHUFB and uses 128-bit xmm registers.
PSHUFB takes two 16-byte sources which one is a lookup table of 16 bytes x = ( x , x , . . . , x ) and the other is 16 indices y = ( y , y , . . . , y ) . The16-byte result of “ PSHUFB x , y ” at position i is x y i mod 16 if y i ≥ and if y i < . VPSHUFB simply performs two copies of
PSHUFB in one instruction. Thetwo instructions are suitable for scalar multiplication over small fields [CYC13].For scalar multiplication over F , we first prepare 16 tables; each tablestores the product of all elements and a specific element in F . Suppose wehave a ∈ F and b ∈ F , we can apply VPSHUFB to the prepared “multiply-by- b ”table and a for the product of a · b . Since we use one 256-bit register to store64 elements in F , the data in a have to be split into nibbles (4-bit chunks)before applying VPSHUFB .The scalar multiplication over F is similar to F except that the numberof prepared tables becomes 256 and one multiplication costs 2 VPSHUFB .For scalar multiplication over (cid:101) F or (cid:101) F , we implement the field multipli-cation as polynomial multiplication in F [ x ] because or prepared tablesis too much for caches. The Karatsuba’s method is applied to reduce the totalnumber of multiplications in F while multiplying polynomials. While multiplying (cid:101) F elements by elements in (cid:101) F or (cid:101) F , the bytes in a (cid:101) F element might multiply by different multipliers in F . Since we usescalar multiplication over F as our building blocks, multiplying by differentmultipliers reduces the efficiency. Example: Product of (cid:101) F and (cid:101) F elements The natural data layout of (cid:101) F consists of eight consecutive (cid:101) F elements, each with its two bytes storedside by side. Suppose we are multiplying a = ( a , . . . , a ) ∈ (cid:101) F := F by c = ( c , c ) ∈ (cid:101) F := F . The a is naturally stored as (( a , a ) , ( a , a ) , . . . , ( a , a )) ∈ (cid:101) F . To perform the field multiplication in (cid:101) F := F [ x ] with Karatsuba, we then must compute (( a , , ( a , , . . . , ( a , · c and ((0 , a ) , . . . , (0 , a )) · c . In this case, one has to mask off half the componentsin a and thus reduces the efficiency. 13 Byte-Slicing” layout:
To perform many scalar multiplications withs
VPSHUFB efficiently, a possible solution is to store each of the 16 bytes in an (cid:101) F elementin a separate register. This rearrangement of data layout is exactly equivalentto a × transposition of a byte matrix. After each byte in the same positionof (cid:101) F elements is collected in the same register (cf. Fig. 3), SIMD instructionscan cover entire registers. Transposition as needed:
While multiplying two elements by a subfieldelement instead of one multiplicand in previous example, e.g., [ a , b ∈ (cid:101) F ] multiply by c ∈ (cid:101) F , we only have to split the even and odd bytes in a and b for performing scalar multiplication efficiently. Hence, we can just use a × transposition, which converts a = ( a , . . . , a ) and b = ( b , . . . , b ) to ( a , b , a , b , . . . , b ) and ( a , b , a , . . . , b ) , to split the even and odd bytes.The × transposition involves only 2 registers instead of 16 registers in a × transposition and thus is more efficient. For multiplying by an (cid:101) F element, we need to × transpose our data. The following Fig. 3 depicts theprocess of data rearrangement on different demand. ( a , . . . , a ) ( a , b , c , d , a , . . . , d ) ( a , b , c , . . . , l )( b , . . . , b ) ( a , b , c , d , a , . . . , d ) ( a , b , c , . . . , l ) ... ⇒ ... ⇒ ... ( k , . . . , k ) ( i , j , k , l , i , . . . , l ) ( a , b , c , . . . , l )( l , . . . , l ) ( i , j , k , l , i , . . . , l ) ( a , b , c , . . . , l ) Figure 3: “Byte-matrix transpose” for (cid:101) F elements: The data layout in themiddle is for multiplying by (cid:101) F elements, and the rightmost layout is for pointmul .To transpose a matrix, we use similar techniques in [War12], i.e., transposing × can be done after transposing × . While performing the transposition,we first collect elements in one register with a byte shuffle instruction ( VPSHUFB ).The interchange of data between different registers is accomplished by swizzleinstructions in AVX2 instruction set. We refer readers to [Int15] [Fog17] formore information on AVX instructions. pointmul In pointmul , we need a generic field multiplication instead of subfield multipli-cation in FFT
LCH for data in tower representation. Assuming the data in (cid:101) F andin the byte-slice layout, we need the pointwise field multiplication in pointmul ,in contrast to scalar multiplication in FFT
LCH . Since the byte-slice layout, themultiplication in (cid:101) F can be easily performed as polynomial multiplication in (cid:101) F [ x ] in the SIMD manner. In other words, we reduce one parallelized mul-tiplication in (cid:101) F to several parallelized multiplications in (cid:101) F . The processis recursively applied until the pointwise multiplication in F . We then apply14he method in [CLP + F . We discuss the choice of working field based on the cost of field multiplication.While multiplying d bits polynomials, we split the data into w -bit wide chunksand then apply FFT with l = 2 w -bit base field. Here we want to decide ona suitable size of base field. In the typical x86 CPU, one may choose l to be , , or . Irreducible polynomial constructed field:
In this construction, the fieldmultiplication is implemented with
PCLMULQDQ instruction. Using Karatsuba’smethod and linear folding for the reduction, one multiplication costs (1+2) = 3 , (3 + 2) = 5 , and (9 + 4) = 13 PCLMULQDQ in F , F and F respectively.The number of butterflies in FFT
LCH are dw log dw . With one field multiplicationfor each butterfly, we conclude that working on F has lowest number of PCLMULQDQ for
FFT
LCH . Tower field:
For l -bit tower field, the costs of multiplying by subfield elementsin FFT
LCH or generic elements in pointmul is proportional to length of field l asdescribing in Prop. 2. Since number of multiplications in FFT
LCH proportions to dl log dl , to enlarge l by two results in one layer less ( ∝ log dl ) in FFT
LCH but thesame cost for each layer of butterflies( ∝ l · dl ).The cost of generic multiplication in pointmul is ∝ l . using recursiveKaratsuba in Sec. 4.1.3. If we compare l = l with l = 2 l , the smaller field ismore efficient. However, the length of field should fit the underlying machinearchitecture for better efficiency. In the Intel Haswell architecture with AVX2instruction set, it is the most efficient to use 256 bits as a unit because memoryaccess is the best with 256-bit alignment. Therefore, we choose l = 256 bits forIntel Haswell in our implementation. The change of field representations is simply a matrix product for a pre-definedmatrix I with the data α ∈ V k as a vector. We compute the product I · α withthe famous method of four Russians(M4R) [AH74].With M4R of l -bit, one first prepars all possible products of I and l -bitvectors, i.e., prepares I · b for all b ∈ V l , for all b ∈ span( v l , . . . , v l − ) , . . . , forall b ∈ span( v k − l , . . . , v k − ) . To compute I · α , one splits α to l -bit chunks, looksup the prepared tables for various segments of α , and combines the results. Thenumber of operations proportions to bit-length of α .The choice of l depends on the size of cache for efficiency. The size of I for -bit field is × bits, and L1 cache is KiB for data in the IntelHaswell architecture. Therefore, we choose the M4R with -bit, which resultsin KiB = 16 × × bits prepared tables.15 .4 Calculation of Multipliers in the Butterflies Suppose we want to calculate s i ( α ) for α ∈ V k in a tower field. s i ( α ) can becalculated recursively via s i ( α ) = s i − ( s ( α )) . Since s is a linear map, wecan prepare a table S1 := [ s ( v ) , . . . , s ( v k )] for the images of s on all basiselements ( v , . . . , v k ) . Since α = (cid:80) b i · v i with b i ∈ { , } , we have s ( α ) = S1 · α which is a matrix production. For further optimization, we can omit the least i bits of α While evaluating s i ( x ) by Prop. 3.In our implementations, we actually precomputed 31 tables S1 , S2 , . . . , S31 for the evaluations of s ( x ) , s ( x ) , . . . , s ( x ) at up to points and avoid theneed for recursion. Again, we use M4R of -bit to accelerate matrix-vectorproducts. Querying a result of -bit α costs table lookups and the storagefor tables is 4 KiB( × × bits), which fits our L1 cache. We benchmark our implementation on the Intel Haswell architecture (sameas [HvdHL16]). Our hardware is Intel Xeon E3-1245 v3 @3.40GHz with turboboost disabled and 32 GB DDR3@1600MHz memory. The experiments was runin ubuntu 1604, Linux version 4.4.0-78-generic and the compiler is gcc: 5.4.020160609 (Ubuntu 5.4.0-6ubuntu1 16.04.4).We show our main results in Tab. 2, comparing against [BGTZ08] and[HvdHL16]. Three of our implementations are denoted by their base fieldsin FFT
LCH . The version of F is the simple version with evaluation points inCantor basis in Sec. 3.1. The versions of (cid:101) F and (cid:101) F are tower field imple-mentations in Sec. 3.3. In general, our implementations are around to faster than the other binary polynomial multipliers.Table 2: Products in degree < d in F [ x ] on Intel Xeon E3-1245 v3 @ 3.40GHz( − sec.) log d/
15 16 17 18 19 20 21 22 23This work, (cid:101) F (cid:101) F
11 22 48 104 243 527 1105 2302 4812This work, F
12 26 55 119 261 554 1181 2491 5282 gf2x [BGTZ08] a
12 26 59 123 285 586 1371 3653 7364 F [HvdHL16] b
14 29 64 148 279 668 1160 3142 7040 a Version 1.2. Available from http://gf2x.gforge.inria.fr/ b SVN r10663. Available from svn://scm.gforge.inria.fr/svn/mmx
The first notable result is that our simple version over F outperforms theprevious implementations in [BGTZ08] and [HvdHL16] for polynomials over bits. The result shows that additive FFT are better based FFT for multi-plying binary polynomials given the same multiplication in the base field using PCLMULQDQ . Software can be downloaded from https://github.com/fast-crypto-lab/bitpolymul .
16e can also compare the effectiveness of field multiplication from second andthird rows. The comparison between (cid:101) F and F shows that the subfield mul-tiplication with VPSHUFB outperforms generic multiplication with
PCLMULQDQ .This is a counter-intuitive result since
PCLMULQDQ is dedicated to multiply bi-nary polynomials by design. The effect can be shown qualitatively: we can do multiplications in F using 2 VPSHUFB ’s. Multiplying by (cid:101) F elements (thelargest multipliers in the butterflies) costs 9 F multiplications. Multiplyingan (cid:101) F element by an (cid:101) F element is 4 (cid:101) F multiplications, so each (cid:101) F to (cid:101) F product takes / VPSHUFB ’s on average. Similarly, a (cid:101) F to (cid:101) F prod-uct takes / VPSHUFB ’s. The average is less than 2
VPSHUFB ’s compared to PCLMULQDQ ’s when using F . The profiles of various components
The ratio of relative cost in our fastsubfield multiplication between basis conversion : butterfly process : pointmul : change of representations for multiplying -bit polynomials are 1: 3.06:0.08: 0, 1: 2.11: 0.27: 0.54, and 0.73: 1.74: 0.47: 0.57 for F , (cid:101) F and (cid:101) F ,respectively.The results show that the change of tower representation increases the costin pointmul besides the cost itself. However, the efficient subfield multiplicationreduces the cost of butterfly process and the effects are greater than the costincreased. The version of (cid:101) F can even reduce the cost of basis conversionbecause the better aligned memory access fits into machine architecture.More results on newer Intel architecture and profiles can be found in Ap-pendix D. Discussion about the most recent result
In [VDHLL17], Hoeven et al.present a new result of multiplying binary polynomials. They use traditionalmultiplication FFT and replace the Kronecker segmentation in Sec. 2.1 with aFrebenius DFT for saving about of running time of F [HvdHL16]. Wenote the same technique can be applied to additive FFT as well as about saving of cost. The method for additive FFT is described in Appendix C. The experiments are actually the optimal cases for multiplying polynomials withpower-of-two terms by additive-FFT-based multipliers.Since the cost of additive-FFT over number of terms of polynomial is highlystairwise, one can truncate the FFT to make the curve somewhat smoother asin [BGTZ08]. One truncated version of additive FFT for n = 3 · k was shownin [CLP + In this section, we discuss two possible variants of implementations.
Tower field implementations with evaluation points in Cantor basis
If we choose evaluation points in Cantor basis, the evaluation of s i ( α ) seemsfaster than the M4R technique in previous section. However, we still need tochange the representation of s i ( α ) from Cantor basis to tower field in this case.17his operation results in the same cost as the calculation of s i ( α ) in tower fieldwith M4R. Performing subfield multiplication in Cantor basis
For subfield multi-plications in tower field, the efficiency comes from the powerful
VPSHUFB instruc-tion. Since we don’t have Prop. 2 in Cantor basis, we can not use
VPSHUFB inCantor basis in the same way of tower fields. Another options for implementingfield multiplication in Cantor basis is to use bit-sliced data and logical instruc-tions [BC14]. There are at least 64 ymm registers for operating in 64-bit basefield, resulting inefficiency from too much data in play.
We have presented our efficient multipliers based on recently developed additiveFFTs, which has similar but lower multiplicative complexity as the ternaryvariant of Schönhage’s algorithm used in [BGTZ08].In [BGTZ08], Brent et al. also implemented the Cantor [Can89] algorithmbeside Schönhage. They concluded that “Schönhage’s algorithm is consistentlyfaster by a factor of about 2 (than Cantor)” . In their implementation, multi-plicative FFT outperformed additive FFT.Our experiments show that recently developed additive-FFT does help formultiplying binary polynomials of large degrees in practice. We derive a furtheradvantage by exploiting the lower cost of multiplying by subfield elements intower fields.
Future Work:
Our implementation is written in C and may be improvedwith assembly for better register allocations. Further, AVX-512 instructions,featuring 512-bit SIMD instructions, will be more widely available soon. Weprobably cannot speed up the additive FFT multiplier to a factor of two, butsurely AVX-512 can be expected to offer a substantial advance.
References [AFK +
07] Kazumaro Aoki, Jens Franke, Thorsten Kleinjung, Arjen KLenstra, and Dag Arne Osvik. A kilobit special number fieldsieve factorization. In
International Conference on the Theory andApplication of Cryptology and Information Security , pages 1–12.Springer, 2007.[AH74] Alfred V. Aho and John E. Hopcroft.
The Design and Analysis ofComputer Algorithms . Addison-Wesley Longman Publishing Co.,Inc., Boston, MA, USA, 1st edition, 1974.[BC14] Daniel J. Bernstein and Tung Chou. Faster binary-field multipli-cation and faster binary-field macs. In Antoine Joux and Amr M.Youssef, editors,
Selected Areas in Cryptography - SAC 2014 - 21stInternational Conference, Montreal, QC, Canada, August 14-15,2014, Revised Selected Papers , volume 8781 of
Lecture Notes inComputer Science , pages 92–111. Springer, 2014.18BGTZ08] Richard P Brent, Pierrick Gaudry, Emmanuel Thomé, and PaulZimmermann. Faster multiplication in gf (2)(x).
Lecture Notes inComputer Science , 5011:153–166, 2008.[Can89] David G. Cantor. On arithmetical algorithms over finite fields.
J.Comb. Theory Ser. A , 50(2):285–300, March 1989.[CCNY12] Chen-Mou Cheng, Tung Chou, Ruben Niederhagen, and Bo-YinYang. Solving quadratic equations with xl on parallel architectures.In Emmanuel Prouff and Patrick Schaumont, editors,
CHES , vol-ume 7428 of
Lecture Notes in Computer Science , pages 356–373.Springer, 2012.[CLP +
17] Ming-Shing Chen, Wen-Ding Li, Bo-Yuan Peng, Bo-Yin Yang,and Chen-Mou Cheng. Implementing 128-bit secure mpkc signa-tures. Cryptology ePrint Archive, Report 2017/636, 2017. http://eprint.iacr.org/2017/636 .[CLRS09] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, andClifford Stein.
Introduction to Algorithms, Third Edition . The MITPress, 3rd edition, 2009.[CT65] James W Cooley and John W Tukey. An algorithm for the machinecalculation of complex fourier series.
Mathematics of computation ,19(90):297–301, 1965.[CYC13] Ming-Shing Chen, Bo-Yin Yang, and Chen-Mou Cheng. Raidq: Asoftware-friendly, multiple-parity raid. In
Presented as part of the5th USENIX Workshop on Hot Topics in Storage and File Systems ,Berkeley, CA, 2013. USENIX.[DGS06] Jintai Ding, Jason Gower, and Dieter Schmidt.
Multivariate Public-Key Cryptosystems . Advances in Information Security. Springer,2006. ISBN 0-387-32229-9.[Fog17] Agner Fog.
Instruction Tables . Copenhagen University, College ofEngineering, May 2017. Lists of Instruction Latencies, Through-puts and micro-operation breakdowns for Intel, AMD, and VIACPUs, .[GK14] Shay Gueron and Michael E. Kounavis. Intel(r) carry-less multiplication instruction and its usage for com-puting the gcm mode(rev.2.02), April 2014. https://software.intel.com/sites/default/files/managed/72/cc/clmul-wp-rev-2.02-2014-04-20.pdf .[GM10] Shuhong Gao and Todd D. Mateer. Additive fast fourier transformsover finite fields.
IEEE Trans. Information Theory , 56(12):6265–6272, 2010.[HvdHL16] David Harvey, Joris van der Hoeven, and Grégoire Lecerf. Fastpolynomial multiplication over F . In Sergei A. Abramov, Eu-gene V. Zima, and Xiao-Shan Gao, editors, Proceedings of the ACM n International Symposium on Symbolic and Algebraic Computa-tion, ISSAC 2016, Waterloo, ON, Canada, July 19-22, 2016 , pages255–262. ACM, 2016.[Int15] Intel. Intel architecture instruction set extensions programmingreference, August 2015. https://software.intel.com/sites/default/files/managed/07/b7/319433-023.pdf .[LANH16] Sian-Jheng Lin, Tareq Y. Al-Naffouri, and Yunghsiang S. Han. Fftalgorithm for binary extension finite fields and its application toreed–solomon codes. IEEE Trans. Inf. Theor. , 62(10):5343–5358,October 2016.[LCH14] Sian-Jheng Lin, Wei-Ho Chung, and Yunghsiang S. Han. Novelpolynomial basis and its application to reed-solomon erasure codes.In ,pages 316–325. IEEE Computer Society, 2014.[Sch77] Arnold Schönhage. Schnelle multiplikation von polynomen überkörpern der charakteristik 2.
Acta Informatica , 7(4):395–398, 1977.[Tho02] Emmanuel Thomé. Subquadratic computation of vector generatingpolynomials and improvement of the block wiedemann algorithm.
Journal of symbolic computation , 33(5):757–775, 2002.[VDHLL17] Joris Van Der Hoeven, Robin Larrieu, and Grégoire Lecerf. Imple-menting fast carryless multiplication. working paper or preprint,August 2017.[vzGG96] Joachim von zur Gathen and Jürgen Gerhard. Arithmetic and fac-torization of polynomial over f2 (extended abstract). In
Proceedingsof the 1996 International Symposium on Symbolic and AlgebraicComputation , ISSAC ’96, pages 1–9, New York, NY, USA, 1996.ACM.[vZGG02] Joachim von Zur Gathen and Jürgen Gerhard. Polynomial factor-ization over f2.
Mathematics of Computation , 71(240):1677–1698,2002.[War12] Henry S. Warren.
Hacker’s Delight . Addison-Wesley Professional,2nd edition, 2012.
A Claims about s i ( x ) in Tower Fields and Proofs Proposition 1. s k ( v k ) := (cid:81) b ∈ V k ( v k − b ) = 1 .We first discuss the special case that k = 2 l is power of two and there is a F q for q = k . By Galois’s theory, we have s k ( x ) = (cid:81) b ∈ F q ( x − b ) = x q + x . Proposition 4. If q is a power of two, and choose any a ∈ F q such that F q := F q [ x l ] / ( x l + x l + a ) is a valid field extension, then (cid:89) b ∈ F q ( x l − b ) = x ql + x l = 1 . roof. x ql + x l = (cid:0) x l + x l (cid:1) + (cid:0) x l + x l (cid:1) + (cid:0) x l + x l (cid:1) + · · · + (cid:0) x l + x l (cid:1) q/ = a + a + a + · · · + a q/ = Trace of a (in F q over F ) ∈ { , } . But zero here would be contradictory because all q solutions of x q = x arealready in F q , but x l is in F q \ F q , hence we must have x ql + x l = 1 . Proof of Prop. 1. If q = 2 j is a power of two, then V q = F q and the result holdsaccording to the Proposition above. So we assume that the proposition holdsfor all k < h and h = q + (cid:96) < q , where q = 2 j , and note that for a ∈ V q = F q ,we have ( v q a ) q + v q a = ( v q + 1) a + v q a = a , and (cid:89) b ∈ V h ( v h − b ) = (cid:89) c ∈ span( v q ,...v h − ) (cid:89) c (cid:48) ∈ V q ( v h + c + c (cid:48) ) ( divide V h ) = (cid:89) c (cid:48)(cid:48) ∈ V h − q (cid:89) c (cid:48) ∈ V q ( v q ( v h − q + c (cid:48)(cid:48) ) + c (cid:48) ) (replace c by v q c (cid:48)(cid:48) ) = (cid:89) c (cid:48)(cid:48) ∈ V h − q (cid:2) ( v q ( v h − q + c (cid:48)(cid:48) )) q + v q ( v h − q + c (cid:48)(cid:48) ) (cid:3) (Galois’s theory) = (cid:89) c (cid:48)(cid:48) ∈ V h − q ( v h − q + c (cid:48)(cid:48) ) = 1 . (by induction) Proposition 3.
For any v i = 2 i w.r.t. tower representation, v i + v i ∈ V i .In the proof of Prop. 3, we need the following consequences from defini-tion 3.1 and 2.3. Corollary 2. If q = 2 k > i , then x k +1 v i = v q v i = v k + i . Corollary 3. If q = 2 k > i and v ∈ V i , then x k +1 v = v q v ∈ V k + i .Proof of Prop. 3. If i = 2 k , then v i = x k +1 , and v i + v i = x x · · · x k = v k − by the defintion of the x i so the claim holds. Using mathematical induction, welet k > i = 2 k − + j . Thus v i = x k v j , and v i + v i = ( x k + x k ) v j + x k ( v j + v j ) .The first term is the product of two terms in F k − and therefore is itself in F k − = V k − ⊆ V i . The second is the product of v k − and an element of V j (by the induction hypothesis), which is by Corollary 3 also in V k − + j = V i . B Basis Conversion: An Example
Figure 4 shows an example for converting a degree-15 polynomial to novelpoly basis. We have to divide by 3 different s i ( x ) ’s, namely s ( x ) , s ( x ) , and s ( x ) ,which has 4 terms. One can see there are actually 4 layers of division in Figure 4and the number of XOR ’s are the same in all layers.21igure 4: From f ( x ) = f + · · · + f x to g ( X ) = g + · · · + g X byAlgorithm 3.What the first 2 layers perform is variable substitution into terms of y = s ( x ) = x + x (See algorithm 2). The first layer is a division by s ( x ) = x + x and second layer is divisions by s ( x ) = x + x on the high degree and low degreepolynomials from the first layer. The third layer is a division by s ( x ) = y + y byadding between coefficients of terms differing by a factor of s ( x ) , and 4 positionsapart (see 3rd column of Table 1). The last layer divides by s ( x ) = x + x for4 short polynomials (Last loop in Algo. 3). Note that we only do division bytwo terms polynomials in the conversion. C Frobenius bijection between F [ x ] < · n and F n In this section, we perform an FFT directly on F [ x ] instead of combining manycoefficients to a larger finite field, i.e., abandoning the Kronecker segmentation.We refer the reader to [VDHLL17] for preliminaries and focus on performingthe idea with additive-FFT in this section. C.1 The partition of evaluated points
Given φ is the Frebenius operation in F , which is exactly to square anelement in an extension field of characteristic 2. Thus, we have φ ( β ) = β ,and φ ( β i ) = β i = β i + β i − for i > in Cantor basis. Moreover, the order of φ for β i> is · (cid:98) log i (cid:99) . Definition C.1.
Given a ( x ) ∈ F [ x ] < · n and n = 2 m with m < , define alinear map E β m +64 + V m : F [ x ] < · n → F n is the evaluations of a ( x ) at points β m +64 + V m , i.e., E β m +64 + V m : a ( x ) (cid:55)→ { a ( β m +64 + i ) : i ∈ V m } .We can see the evaluation of a ( x ) at n = 2 m points which is less than thenumber of coefficients. However, the evaluation points are in F . It can bechecked the order of φ ( β m +64 ) is . Proposition 5. E β m +64 + V m is a bijection between F [ x ] < · n and F n . .2 Truncated Additive FFT To evaluate a polynomial a ( x ) ∈ F [ x ] < · n at m points in F , we treateach coefficient of a ( x ) is an element in F . Since there are · n = 128 · m coefficients and only m evaluation points, the E β m +64 + V m is exactly a truncatedadditive FFT to / points of an FFT at points β m +64 + V m +7 . The basisconversion can be proceeded straightforwardly with alg. 3. With the notation ofalg. 1, the evaluation of a ( x ) at β m +64 + V m can be accomplished with the scalardisplacement α = β m +64 and reserve only first / results of the evaluationpoints V m +7 . C.3 Encoding: the First Seven Layers of Truncated FFT
We can also translate the truncated FFT to an “encoding”, corresponding tothe starting 7 layers of butterflies, of input coefficients and followed by a normal
FFT
LCH over F [ x ] LCH is a result of linear map with input ( a j · n + i ) j =0 which is a × n transpose of input data. C.4 Results Table 3 shows the results of our implementation on multiplying polynomialswith additive FFT and the comparison with implementation from [VDHLL17].Table 3: Multiplicaitons with Frobenius partitions. Products in degree < d in F [ x ] on Intel Xeon E3-1245 v3 @ 3.40GHz ( − sec.) log d/ 15 16 17 18 19 20 21 22 23add FFT, F a F [VDHLL17] c c Version 2ae2461. https://github.com/fast-crypto-lab/bitpolymul c SVN r10681. Available from svn://scm.gforge.inria.fr/svn/mmx D Profiles: Raw Data Table 4 shows the raw data of various componets of our implementations. Ta-ble 5 shows the performance of our implemetations in Intel Skylake architecture.23able 4: Running time of various Components, in − sec. log d/ 15 16 17 18 19 20 21 (cid:101) F chRepr × 971 1955 3482 7226 13991 27550 53982 BasisCvt × 823 1846 4055 8398 17598 35611 80913 FFT LCH × pointmul iFFT LCH 923 1960 4402 11782 34244 75693 161648 iBasisCvt ichRepr (cid:101) F chRepr × 924 1866 3890 7172 13237 25965 52275 BasisCvt × FFT LCH × pointmul iFFT LCH iBasisCvt ichRepr F BasisCvt × FFT LCH × pointmul 317 635 1297 2863 7250 10278 21188 iFFT LCH iBasisCvt Table 5: Products in degree < d in F [ x ] on Intel Xeon E3-1275 v5 @ 3.60GHz( − sec.) log d/ 15 16 17 18 19 20 21 22 23This work, (cid:101) F F gf2x a [BGTZ08] 11 23 51 111 250 507 1182 2614 6195 F [HvdHL16] b 10 22 51 116 217 533 885 2286 5301 a Version 1.2. Available from http://gf2x.gforge.inria.fr/ . b SVN r10663. Available from svn://scm.gforge.inria.fr/svn/mmxsvn://scm.gforge.inria.fr/svn/mmx