Frobenius Additive Fast Fourier Transform
Wen-Ding Li, Ming-Shing Chen, Po-Chun Kuo, Chen-Mou Cheng, Bo-Yin Yang
FFrobenius Additive Fast Fourier Transform
Wen-Ding Li
Research Center of InformationTechnology and Innovation,Academia Sinica, [email protected]
Ming-Shing Chen
Research Center of InformationTechnology and Innovation,Academia Sinica, [email protected]
Po-Chun Kuo
Department of Electrical Engineering,National Taiwan University, [email protected]
Chen-Mou Cheng
Department of Electrical Engineering,National Taiwan University, [email protected]
Bo-Yin Yang
Institute of Information Science,Academia Sinica, [email protected]
ABSTRACT
In ISSAC 2017, van der Hoeven and Larrieu showed that evaluatinga polynomial P ∈ F q [ x ] of degree < n at all n -th roots of unity in F q d can essentially be computed d -time faster than evaluating Q ∈ F q d [ x ] at all these roots, assuming F q d contains a primitive n -throot of unity [vdHL17a]. Termed the Frobenius FFT, this discoveryhas a profound impact on polynomial multiplication, especiallyfor multiplying binary polynomials, which finds ample applicationin coding theory and cryptography. In this paper, we show thatthe theory of Frobenius FFT beautifully generalizes to a class ofadditive FFT developed by Cantor and Gao-Mateer [Can89, GM10].Furthermore, we demonstrate the power of Frobenius additive FFTfor q =
2: to multiply two binary polynomials whose product is ofdegree < CCS CONCEPTS • Mathematics of computing → Computations in finite fields ; KEYWORDS addtitive FFT, Frobenius FFT, polynomial multiplication
Let F q d be the finite field of q d elements, and let ξ ∈ F q d be a prim-itive n -th root of unity. The (discrete) Fourier transform of a polyno-mial P ∈ F q d [ x ] with degree < n is ( P ( ) , P ( ξ ) , P ( ξ ) , . . . , P ( ξ n − )) ,namely, evaluating P at all n -th roots of unity. How to efficientlycompute the Fourier transform not only is an important problemin its own right but also finds a wide variety of applications. As aresult, there is a long line of research aiming to find what is termed“fast” Fourier tranform, or FFT for short, for various situations.Arguably, one of the most important applications of FFT is fastpolynomial multiplication. In particular, the case of q = F d in order to obtain a primitive n -th root of unity for any meaningful n , and in this case, it is well known that one can use the Kro-necker method to efficiently compute binary polynomial multipli-cation [Can89, HvdHL16]. Such FFT-based techniques have betterasymptotic complexity compared with school-book and Karatsubaalgorithms. However, it is conventional wisdom that FFT is notsuitable for polynomial multiplication of small degrees because ofthe large hidden constant in the big- O notation [FH15].We recall that the Frobenius map x (cid:55)→ x q fixes F q in any of itsextension field F q d , and hence ∀ P ∈ F q [ x ] , ∀ a ∈ F q d , P ( ϕ ( a )) = ϕ ( P ( a )) . In ISSAC 2017, van der Hoeven and Larrieu showed how touse Frobenius map to speed up the Fourier transform of P ∈ F q [ x ] essentially by a factor of d over Q ∈ F q d [ x ] and hence avoid thefactor-of-two loss as in the Kronecker method [vdHL17a]. How-ever, the Frobenius FFT is complicated, especially when the Cooley-Tukey algorithm is used for a (highly) composite n . One of thereasons behind might be that the Galois group of F q n over F q isgenerated by the Frobenius map and isomorphic to a cyclic sub-group of the multiplicative group of units of Z / n Z , whereas theCooley-Tukey algorithms works by decomposing the additive group Z / n Z . The complicated interplay between these two group struc-tures can bring a lot of headaches to implementers.In his seminal work, Cantor showed how to evaluate a poly-nomial in some additive subgroups of a tower of Artin-Schreierextensions of a finite field and gave an O ( n lg log ( n )) FFT algorithmbased on polynomial division [Can89]. An Artin-Schreier extensionof a finite field F q of characteristic p is a degree- p Galois extensionof F q . In this paper, we restrict our discussion to the case of p = p .Based on Cantor’s construction, Gao and Mateer gave a Cooley-Tukey-style algorithm whose complexity is O ( n lg ( n ) lg lg ( n )) when d is a power of two [GM10], using which Chen et al. achieved com-petitive performance compared with other state of the art of binarypolynomial multiplication [CCK + a r X i v : . [ c s . S C ] F e b r “bitslice” software technique in embedded device. However, sofar most of the techniques for small degrees were based on Karat-suba algorithm or its generalization to n -way split. By applyingFrobenius additive Fourier transform instead of Kronecker method,we show that we can break the record for the number of bit opera-tions even at the polynomial size 231. To the best of our knowledge,it is the first time FFT-based method is shown to be competitive insuch small degrees. We also implement a code generator to outputprocedures for multiplying two polynomials, publicly available athttps://github.com/fast-crypto-lab/Frobenius_AFFTThe rest of this paper is organized as follows. In Section 2, we willreview the relevant background information. In Section 3, we willdefine the Frobenius additive Fourier transform and show someof its important properties. In Section 4, we conclude by show-ing how we apply Frobenius additive FFT to binary polynomialmultiplication and achieve a new record. Let F d denote an binary extension field, and let v d = ( v , v , . . . , v d − ) . We call v d a basis for F d if v , v , . . . , v d − are linearly indepen-dent over F . Throughout this paper, we often represent an element ω i of a binary extension field as ω i = i v + i v + · · · + i m − v m − , where i = i + i + i + · · · + m − i m − , i j ∈ { , } ∀ ≤ j < m ,with the basis elements v , v , . . . , v m − inferred from the context.In his seminal work, Cantor presented a sequence of explicit andcomputationally useful bases for binary extension fields [Can89]. Definition 2.1.
Given a sequence u , u , u , . . . of elements fromthe algebraic closure of F satisfying u i + u i = ( u u · · · u i − ) + [ a sum of monomials of lower degrees ] , where each “monomial of a lower degree” has the form u j u j · · · u j i − i − such that ∀ ≤ k < i , j k ∈ { , } and ∃ k , j k =
0. Then a Cantorbasis v d = ( v , . . . , v d − ) for F d is defined as v i = u i u i · · · u i k − k − where i = i + i + · · · + k − i k − and d = k .If we fix F k = F ( u , u , . . . , u k − ) for k = , , . . . , then withCantor’s construction, we arrive at a tower of Artin-Schreier exten-sions of F . For example, the following tower of extension fields of F are one such construction: F : = F [ u ]/( u + u + ) , F : = F [ u ]/( u + u + u ) , F : = F [ u ]/( u + u + u u ) , F : = F [ u ]/( u + u + u u u ) ,... In this case, the Cantor basis for, e.g., F is v = ( , u , u , u u , u , u u , u u , . . . , u u u u ) . In this paper, we will focus on additive Fourier transform withrespect to Cantor bases.
We will use the bit complexity model for finite field arithmeticunless stated otherwise. We use M q ( d ) to denote the complexity ofmultiplication of polynomials of degree < d over F q . Currently, thebest known bound for M q ( n ) is M q ( d ) = O( d log q log ( d log q ) log ∗ ( d log q ) ) , where log ∗ (·) is the iterated logarithm function [HvdHL17]. It isconventional to assume that M q ( d )/ d is an increasing functionof d . We will denote M ( d ) as the bit complexity to multiply twoelements in F d represented in Cantor basis. Since we can use mod-ular decomposition technique[vdHL17b] [JKR12] to convert F d to F [ x ] and then perform polynomial multiplication with O( d lg d ) .So M ( d ) = O( M ( d )) . We also assume that M ( d ) d is an increasingfunction in d for Cantor bases. We use A ( d ) to denote the complex-ity of addition for two elements in F d . As usual, the complexityof adding two elements in F d is as O ( d ) . Note that in some case,Cantor’s construction allows more efficient multiplication. For ex-ample, given α , β ∈ F k : = F k − [ u k − ]/( u k − + u k − + ζ ) , if α happens to be in the (proper) subfield F k − , then multiplication of α and β can be computed using only two multiplications in F k − .The cost of multiplication become 2 M ( k − ) instead of M ( k ) . Aswe shall see, we often multiply elements from different extensionfields of F , so Cantor’s trick plays an important role in reducingbit complexity. Let v d = ( v , v , v , ..., v d − ) be a basis of F d . Let n = m and m ≤ d . Now consider a polynomial P ∈ F d [ x ] < n , where F d [ x ] < n : = { P ∈ F d [ x ] : deg ( P ) < n } We will define the additive Fourier transform
AFT n ( P ) with respectto a basis v d to be AFT n ( P ) = (cid:0) P ( ω ) , P ( ω ) , P ( ω ) , ..., P ( ω n − ) (cid:1) Recall that ω i = (cid:205) m − j = i j · v j , i = (cid:205) m − j = i j · j and i j ∈ { , } Consider a basis v = ( v i ) d − i = and all v i ∈ F d . Let W k : = span { v , v , . . . , v k − } = { (cid:213) j ∈ S v j | S ⊆ { , , , ..., k − }} denote an k -dimensional subspace in F d , where k ≤ d . These W k satisfies { } = W ⊂ W ⊂ · · · ⊂ W d = F d and form a sequence of subspaces. We define W = { } for conve-nience later. Definition 2.2.
Given a subspace W k of F d , the subspace polyno-mial is defined as s k ( x ) : = (cid:214) a ∈ W k ( x − a ) . Lemma 2.3. s k ( x ) is a linearized polynomial: s k ( x + y ) = s k ( x ) + s k ( y ) for all x , y ∈ F d s in [Can89] [GM10] [LANH16], we will consider Cantor basesto construct an efficient algorithm. For the rest of this subsection,we list properties of subspace polynomial with respect to Cantorbases. These properties were proven in [Can89] and are necessaryfor deriving the algorithm later.Lemma 2.4. For a Cantor basis v d , s k ( v k ) = for ≤ k < d Given a function f , denote f ◦ i as ( f ◦ f ◦ · · · ◦ f ) (cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) i times , which isfunction composition i times.Lemma 2.5. The subspace polynomial with respect to a Cantorbasis can be written as a recursive form: s ( x ) = xs ( x ) = x + xs j ( x ) = s j − ( x ) · s j − ( x − v j − ) = s j − ( x ) + s j − ( x ) = s ( s j − ( x )) = s ◦ j ( x ) Lemma 2.6.
Given d a power of two and a Cantor basis v d , then v = s ( v i ) = v i + v i = v i − + α where α ∈ W i − for i > . Lemma 2.7.
Given a Cantor basis v d , ∀ ≤ j ≤ k ≤ d . s j ( v k ) + v k − j ∈ W k − j Lemma 2.8.
For subspace polynomial s k ( x ) with respect to a Cantorbasis v d . s k ( x ) = k (cid:213) i = s k , i x i where s k , i ∈ F for ≤ i ≤ k .If k is a power of 2 then s k ( x ) = x k + x Here we will introduce polynomial basis proposed in [LCH14]and denote it as novel polynomial basis . They propose an additiveFast Fourier transform given a polynomial represented with novelpolynomial basis . Definition 2.9.
Given a basis v d and its subspace polynomials ( s , s , ..., s d − ) and n = d , define its corresponding novel polyno-mial basis basis to be the polynomials ( X k ) n − k = X k ( x ) : = (cid:214) ( s i ( x )) b i where k = d − (cid:213) i = b i i with b i ∈ { , } . and X ( x ) = ( s i ( x )) = i for all i , deg ( X k ( x )) = k for all k .Thus, given any polynomial P ∈ F d [ x ] < n , it can be representedwith novel polynomial basis , P = p X ( x ) + p X ( x ) + p X ( x ) + . . . + p n − X n − ( x ) where all p i ∈ F d . To perform basis conversion between monomial basis and novelpolynomial basis , we can simply recursively divide s k ( x ) . Thus thecost of naive basis conversion is O ( n ( lg ( n )) ) additions in F d . How-ever, more efficient polynomial basis conversion exists with respectto Cantor basis which was proposed in [GM10] and [LANH16]. Weshow algorithm from [LANH16] in Algorithm 1. The algorithmonly requires O ( n lg ( n ) lg ( lg ( n ))) additions in F d . It is easy to seethat for polynomial admits coefficients F , basis conversion frommonomial to novel polynomial basis can easily gain a factor of d because addition in F cost A ( ) instead of A ( d ) . BasisConversion( f ( x ) ) : input : f ( x ) = f + f x + ... + f n − x n − output : f ( x ) = д ( X ) = д + д X ( x ) + ... + д n − X n − ( x ) if Degree( f ( x ) ) ≤ then return д ( X ) = f + X f ;Let k = max { i : Degree( s i ( x ) ) ≤ Degree( f ( x ) ) } .Let y = s k ( x ) .Let f ( x ) = h ′ ( y ) = q ′ ( x ) + q ′ ( x ) y + q ′ ( x ) y + · · · wherecoefficients of h ′ ( y ) are polynomials q ′ i ( x ) whose degree < k . h ( Y ) ← BasisConversion( h ′ ( y ) ) Then we have h ( Y ) = q ( x ) + q ( x ) X k + q ( x ) X k + + · · · д i ( X ) ← BasisConversion( q i ( x ) ) for all q i ( x ) . return д ( X ) + д ( X ) X k + д ( X ) X k + + · · · Algorithm 1:
Basis conversion: monomial to novel polynomialbasis constructed from Cantor basis
Given a polynomial P represented in novel polynomial basis , Lin,Chung and Han[LCH14] proposed a fast method to compute itsadditive Fourier transform.Given a basis v d of finite field F d , we can construct the polyno-mial basis accordingly: ( X ( x ) , X ( x ) , . . . , X d − ( x )) . Then given apolynomial of P ∈ F d [ x ] < k represented with novel polynomialbasis P ( x ) = p X ( x ) + p X ( x ) + . . . + p k − X k − ( x ) , we denote AFFT ( k , P ( x ) , α ) = ( P ( ω i + α )) k − i = , where P ( ω i + α ) = (cid:213) ≤ j < k p j X j ( ω i + α ) k ≤ d and α ∈ F d . Now, let n = k − P ( ω i + α ) = P ( ω n · i + i + α ) = (cid:213) ≤ j < n (cid:213) ≤ j < p n · j + j X n · j + j ( ω n · i + i + α ) = (cid:213) ≤ j < n (cid:16) p j + s k − ( ω n · i + i + α ) · p n + j (cid:17) X j ( ω n · i + i + α ) = (cid:213) ≤ j < n (cid:16) p j + s k − ( ω n · i + α ) · p n + j (cid:17) X j ( ω i + ( α + ω n · i )) We can see that the
AFFT with input polynomial degree of 2 k − AFFT with input polynomial of degree2 k − − i = FFT( k , P ( x ) , α ) : input : P ( x ) = p X ( x ) + p X ( x ) + ... + p k − X k − ( x ) , all p i ∈ F d α ∈ F d , k ≤ d output : ( P ( ω + α ) , P ( ω + α ) , . . . , P ( ω k − + α )) . if k = then return p ; // Decompose P ( x ) = P ( x ) + s k − ( x ) · P ( x ) . P ( x ) ← p X ( x ) + p X ( x ) + . . . p k − − X k − − ( x ) P ( x ) ← p k − X ( x ) + p k − + X ( x ) + . . . p k − X k − − ( x ) Q ( x ) ← P ( x ) + s k − ( α ) · P ( x ) . Q ( x ) ← Q ( x ) + s k − ( v k − ) · P ( x ) .return AFFT( k − , Q ( x ) , α ) ∥ AFFT( k − , Q ( x ) , v k − + α ) Algorithm 2:
Addtive FFT in novel polynomial basis from[LCH14]Note that if we use Cantor basis, then s k − ( ω n ) = s k − ( v k − ) = P ∈ F d [ x ] < n represented in monomialbasis and n = m , its additive Fourier transform AFT n ( P ) can becomputed as follow. We first perform basis conversion to get p i such that P ( x ) = p X ( x ) + p X ( x ) + . . . + p m − X m − ( x ) . Thenwe perform AFFT ( m , P ( x ) , ) . Thus, to compute AFT n (P) using AFFT ,the maximum depth of recursion is m , and the algorithm performstotal n multiplications and n additions in each depth of recursion.Therefore the cost of the algorithm is n lg ( n )( M ( d ) + A ( d )) where n = m is the number of terms. Let P be a polynomial in F [ x ] and v d be a basis in F d . We definethe Frobenius map ϕ : x (cid:55)→ x . Notice that P ( ϕ ( a )) = ϕ ( P ( a )) for all a ∈ F d .The core idea of the Frobenius Fourier transform is to evaluatea minimal number of points and all other points can be computedby applying Frobenius map ϕ . This is because we now considerpolynomial in F [ x ] ⊂ F d [ x ] . The set of those points is called a cross section [vdHL17a]. Formally, given a set W ⊆ F d , a sub-set Σ ⊆ W is called a cross section of W if for every w ∈ W ,there exists exactly one σ ∈ Σ such that ϕ ◦ j ( σ ) = w for some j . Let v d denote a basis of F d . Given a polynomial P ∈ F [ x ] < n where n = m , then the AFT n ( P ) is the evaluation of the pointsin W m = { ω , ω , ω , . . . , ω m − } . To perform Frobenius additiveFourier transform, we partition W m into disjoint orbits by ϕ . If thereexists a subset Σ of W m that contains exactly one element in eachorbit, that is, Σ is a cross section of W m , then Frobenius mappingallows us to recover AFT n ( P ) from the evaluations of P at each ofthe points in Σ . We denote { P ( σ )| σ ∈ Σ } the Frobenius additive Fourier transform ( FAFT ) of polynomial P .To exactly evaluate a polynomial with the points in cross sectionand reduce the complexity of algorithm by a factor d for discreteFourier transform is certainly not easy as can be seen in [vdHL17a]. However, when considering the additive Fourier transform pro-posed by Cantor, we will show that there exists a cross section suchthat we can naturally use truncated method (as in truncated FFT)to only evaluate those points. In other words, there exists a crosssection suited for the structure of additive FFT and let us obtain afast algorithm. Consider the field F d with Cantor basis v d for d a power of two.We have ϕ ( v ) = v , and ϕ ( v i ) = v i = v i + v i − + α , where α ∈ W i − for i > Σ for F d .We recall that the Frobenius map ϕ on F d generates the (cyclic)Galois group Gal ( F d / F ) of order [ F d : F ] = d , which naturallyacts on F d by taking α ∈ F d to ϕ ( α ) . The orbit of α under thisaction is thus Orb α = (cid:8) σ ( α ) : σ ∈ Gal ( F d / F ) (cid:9) . Lemma 3.1.
Given a Cantor basis v d , ∀ k > , ∀ w ∈ W k + \ W k , (cid:12)(cid:12) Orb w (cid:12)(cid:12) = ⌊ lg k ⌋ + . Proof. Let ℓ = ⌊ lg k ⌋ . In this case, 2 ℓ ≤ k < ℓ + , and v k = u ℓ u j ℓ − ℓ − · · · u j , j i ∈ { , } ∀ ≤ i < ℓ . Since w ∈ W k + \ W k , we canwrite w = v k + α = u ℓ u j ℓ − ℓ − · · · u j + α for some α ∈ W k . Obviously the splitting field of w is F ℓ + = F ( u , u , . . . , u ℓ ) , so the stabilizer of w is the subgroup of Gal ( F d / F ) generated by ϕ ℓ + . It follows immediately from the orbit-stabilizertheorem and Lagrange’s theorem that (cid:12)(cid:12) Orb w (cid:12)(cid:12) = ⌊ lg k ⌋ + . □ Moreover, we can further characterize the orbit of w ∈ W k + \ W k using the following lemma.Lemma 3.2. Given a Cantor basis v d , ∀ k > , consider the orbitof w ∈ W k + \ W k under the action of Gal ( F d / F ) . Then for all j i ∈ { , } , i = , , , . . . , ⌊ lg k ⌋ , there is precisely one element w ′ ∈ Orb w such that w ′ = v k + j ′ v k − + · · · + j ′ k v ∈ W k + \ W k , ∀ j ′ i ∈ { , } , and j ′ i = j i for i = , , , . . . , ⌊ lg k ⌋ . Proof. Let ℓ be a power of two. From Lemma 2.8, we have ϕ ◦ ℓ ( x ) = x ℓ = s ℓ ( x ) + x . From Lemma 2.7, we see that ϕ ◦ ℓ ( w ) + w = s ℓ ( w ) ∈ W k − ℓ + \ W k − ℓ . That is, ϕ ◦ ℓ ( w ) + w ∈ v k − ℓ + W k − ℓ . Let ℓ = ϕ ◦ ℓ allows us to obtain w and ϕ ( w ) , one of which has j ′ = j ′ = w ∈ W k + \ W k . Now let ℓ =
2. Wecan use ϕ ◦ ℓ to obtain w and ϕ ◦ ( w ) , one of which has j ′ = j ′ =
1. Both w and ϕ ◦ ( w ) have the same j ′ . Similarlyfor ϕ ( w ) and ϕ ◦ ( ϕ ( w )) . Let ℓ =
4. We can use ϕ ◦ ℓ to obtain w and ϕ ◦ ( w ) , one of which has j ′ = j ′ = w and ϕ ◦ have the same j ′ , j ′ . If we continue, we can then obtain allcombinations of j i ∈ { , } , for i = , , , . . . , ⌊ lg k ⌋ . However, as | Orb w | = ⌊ lg k ⌋ + , we see that each such combination can appearin Orb w precisely once due to the pigeonhole principle. □ ow we can explicitly construct a cross section. Let Σ = { } ,and ∀ i >
0, let Σ i = (cid:26) v i − + j v i − + · · · + j i − v : j k = k is a power of 2, j k ∈ { , } otherwise. (cid:27) Theorem 3.3. Σ i is a cross section of W i \ W i − . That is, ∀ i > , ∀ w ∈ W i \ W i − , there exists exactly one σ ∈ Σ i such that ϕ ◦ j ( σ ) = w for some j . Proof. First, any two elements of Σ i are in different orbits forany i ; this is a corollary of Lemma 3.2. Next, we know that ∀ w ∈ W i \ W i − , (cid:12)(cid:12) Orb w (cid:12)(cid:12) = ⌊ lg ( i − )⌋ + , and ϕ ◦ j w ∈ W i \ W i − , ∀ j . Soeach orbit generate by element in Σ i has the size 2 ⌊ lg ( i − )⌋ + , and2 ⌊ lg ( i − )⌋ + · | Σ i | = i − = | W i \ W i − | . By the pigeonhole principle,each element in W i \ W i − must be in an orbit generate by exactlyone element in Σ i . □ With the theorem 3.3, a cross section of W m is Σ ∪ Σ ∪ Σ ∪ . . . ∪ Σ m Given P ( x ) ∈ F [ x ] < n represented with novel polynomial basis andCantor basis v d of field F d where n = m , instead of computing AFT n ( P ) = ( P ( ω ) , P ( ω ) , . . . , P ( ω m − )) , we only need to compute FAFT n ( P ) = { P ( σ ) : σ ∈ Σ ∪ Σ ∪ Σ ∪ . . . ∪ Σ m } and then useFrobenius map ϕ to get the rest.Due to the structure of the Additive FFT, we can simply ‘truncate‘to those points. In the original additive FFT (algorithm 2), each FAFFT calls two
FAFFT routines recursively. Those two
FAFFT callcorresponds to evaluate points in α + W k − and α + v k − + W k − . Wecan omit one call and only compute α + W k − so we will not evaluatethe points not in the cross section Σ when Σ ∩( α + v k − + W k − ) = ∅ .Then we get the algorithm 3.It is easy to see that FAFFT ( m , P ( x ) , , v m ) computes { P ( x ) : x ∈ Σ m } because truncation happens when the v m − l component is zerofor all points in Σ m and l is a power of two. To compute FAFT (P),we call
FAFFT ( m , P ( x ) , , ) .The Fig. 1 is a graphical illustration of FAFFT ( , f , , ) rou-tine which computes FAFT ( f ) where f = д X ( x ) + д X ( x ) + д X ( x ) + . . . + д X ( x ) . It consists of 5 layers corresponding tothe recursive depth in the pseudocode. Each grey box is a ‘butterflyunit‘ that performs a multiplication and an addition. A butterflyunit has two inputs a , b ∈ F d . For normal butterfly unit with twooutput a ′ , b ′ , it performs a ′ ← a + b · s k ( α ) b ′ ← a ′ + b while the truncated one only output a ′ . In the figure, we denotethe s k ( α ) in each butterfly unit c i , j . Initially, the input of butterflyunit, д , д , . . . , д , are all in F . But as it goes through layer bylayer, because the multiplicands c i , j maybe in extension fields, thebit size of input to the following butterfly unit grows larger. Forexample, after second layer, the lower half of the input are in F because c , are in ( W \ W ) ⊂ F . Then they go through butterflyunit with c , ∈ ( W \ W ) ⊂ F and come to be in F . FAFFT( k , P ( x ) , l , α ) : input : k ∈ N ∪ { } . P ( x ) = p X ( x ) + p X ( x ) + ... + p k − X k − ( x ) : p i ∈ F ⌈ lg l ⌉ if l > p i ∈ F otherwise. l ∈ N ∪ { } α ∈ W k + l \ W k if l > α = output : P ( σ ) σ ∈ Σ where Σ = ( Σ ∪ Σ ∪ . . . ∪ Σ k + l ) ∩ ( α + W k ) , if k = then return p ;Decompose P ( x ) = P ( x ) + s k − ( x ) · P ( x ) . Q ( x ) ← P ( x ) + s k − ( α ) · P ( x ) . Q ( x ) ← Q ( x ) + P ( x ) . if l = thenreturn FAFFT( k − , Q ( x ) , l , α ) ∥ FAFFT( k − , Q ( x ) , l + , v k − + α ) else if l is a power of two thenreturn FAFFT( k − , Q ( x ) , l + , α ) elsereturn FAFFT( k − , Q ( x ) , l + , α ) ∥ FAFFT( k − , Q ( x ) , l + , v k − + α ) endAlgorithm 3: Frobenius Additive FFT in novel polynomial basis . Figure 1: Illustration of the butterfly network with n = . f ( ) , f ( ) ∈ F , f ( ω ) ∈ F , f ( ω ) , f ( ω ) , f ( ω ) ∈ F and f ( ω ) , f ( ω ) ∈ F In this section, we analyze the complexity of
FAFFT in algorithm3. Let F ( k , l ) and F A ( k , l ) denote the cost of multiplication andaddition to compute FAFFT ( k , P ( x ) , l , α ) for P ( x ) ∈ F [ x ] < k and α ∈ W k + l \ W k .First, it is straightforward to verify that for all FAFFT ( k ′ , P ′ ( x ) , l ′ , α ′ ) call during recursion: • α ′ ∈ W k ′ + l ′ \ W k ′ if l ′ > α ′ = P ′ ( x ) = (cid:205) p ′ i X i ( x ) , p ′ i ∈ F ⌈ lg l ′⌉ if l ′ > p ′ i ∈ F otherwise. • s k − ( α ′ ) ∈ ( W l ′ + \ W l ′ )∈ (cid:40) u lg l ′ + F l ′ if l ′ is a power of two F ⌈ lg l ′⌉ otherwiseThen we have F ( k , l ) = F ( k − , l ) + F ( k − , l + ) + k − ( M ( )) if l = F ( k − , l + ) + k − ( M ( l )) if l is a power of two2 · F ( k − , l + ) + k − ( M ( ⌈ lg l ⌉ )) otherwiseTheorem 3.4. (multiplication complexity) Given n = m , for m + l ≤ d , d is a power of two. Then we have F ( m , l ) ≤ (cid:40) ( n lg n M ( d ) d ) if l = ( n lg n M ( d ) d ⌈ lg l ⌉ ) otherwise Proof. We prove by induction. Consider m =
1, then F ( , l ) = M ( l ) ≤ M ( d ) d l is correct.Assume m = k − l ≤ d − m , F ( m , l ) ≤ (cid:40) m m M ( d ) d if l = m m M ( d ) d ⌈ lg l ⌉ otherwiseThen we check three cases: first, m = k and l = F ( k , l ) = F ( k − , ) + F ( k − , ) + k − · M ( ) = ( k − ) k M ( d ) d + k · M ( )≤ k k M ( d ) d Second, m = k and l is a power of two: F ( k , l ) = F ( k − , l + ) + · k − · M ( l ) = ( k − ) k M ( d ) d l + k − · M ( l )≤ ( k − ) k M ( d ) d l + k − M ( d ) d l = k k M ( d ) d l Finally, l > F ( k , l ) = · F ( k − , l + ) + k − · M ( ⌈ lg l ⌉ ) = ( k − ) k M ( d ) d ⌈ lg l + ⌉ + k − · M ( ⌈ lg l ⌉ )≤ ( k − ) k M ( d ) d ⌈ lg l ⌉ + k − · M ( ⌈ lg l ⌉ ) ⌈ lg l ⌉ ⌈ lg l ⌉ ≤ k k M ( d ) d l Note that we assume M ( l ) l is increasing in l . We complete the proof. □ For the cost of addition, it can be proved follow the same proce-dure above since each with 2 A ( d ) instead of M ( d ) . (Note that A ( d ) d is constant) Theorem 3.5. (addition complexity) Given n = m , for m + l ≤ d ,d is a power of two. Then we have F A ( m , l ) ≤ (cid:40) ( n lg n A ( d ) d ) if l = ( n lg n A ( d ) d ⌈ lg l ⌉ ) otherwise Given P ∈ F [ x ] < n and n a power of two, to compute FAFT n ( P ) ,we call FAFFT ( lg ( n ) , P , , ) . Thus, the cost of compute FAFT n ( P ) is n lg ( n ) M ( d ) d + n lg ( n ) A ( d ) d . Compare with the additive FFT for F d [ x ] < n whose cost is ( n lg ( n )( M ( d ) + A ( d )) , we gain a speed-upfactor d . The inverse Frobenius additive FFT is straight forward because forthe butterfly unit with two output, it is easy to find its inverse.However, due to the truncation, it is not obvious how to inversewhen l is a power of two. Here we show that it is always invertible.In the algorithm 3, when l is a power of two, it truncates and onlycompute FAFT of Q ( x ) = P ( x ) + s k − ( α ) · P ( x ) . To be able toinverse, we need to recover P ( x ) and P ( x ) from Q ( x ) . Note that s k − ( α ) ∈ ( W l + \ W l ) = v l + W l because α ∈ W k + l + ∈ W k + l and lemma 2.7. Since we use Cantor basis, recall the definition 2.1, v l = u lg l when l is a power of two. We can rewrite the equationfrom the point of F l [ u lg l ][ x ] . Let s k − ( α ) = u lg l + c and c ∈ F l , Q ( x ) = R ( x ) + R ( x ) u lg l = P ( x ) + ( c + u lg l ) · P ( x ) where R ( x ) , R ( x ) ∈ F l [ x ] . Then we get P ( x ) = R ( x ) + R ( x ) · cP ( x ) = R ( x ) Thus we can always recover P ( x ) and P ( x ) from Q ( x ) . The fullinverse Frobenius additive FFT algorithm is shown in algorithm 4. F [ x ] To multiply a polynomial using Frobenius additive FFT is exactlythe same as conventional way: applying basis conversion to convertto novel polynomial basis , computing Frobenius additive FFT, pair-wise multiplication, computing the inverse Frobenius additive FFT,then transforming back into the original monomial basis. F [ x ] of small degree The record of minimal bit-operation to multiply polynomial over F [ x ] was set by [Ber09] and [CH15], which are both based onKaratsuba-like algorithm. Instead of Karatsuba-like algorithm, weuse Frobenius additive FFT to perform multiplication in F [ x ] . Weimplement a generator to generate code of binary polynomial mul-tiplication with size 2 m where each variable is in F . Since themultiplicands s k ( α ) in FAFFT can all be precomputed. To reducethe number of bit operations, when multiplying a constant, wetransform it into a matrix vector product over F and apply com-mon subexpression algorithm as in [Paa97]. The generated codeconsists of XOR and AND expressions. The generator will be madepublic available on Github.In figure 2, we show the best results of polynomial multiplicationover binary field. [Ber09] set the record for polynomial size up to1000 in 2009. [CH15] improve the results up to 4.5% for certainsize of polynomial. Since our Frobenius Additive FFT works with FAFFT( k , A , l , α ) : input : A = P ( σ ) σ ∈ Σ where Σ = ( Σ ∪ Σ ∪ . . . ∪ Σ k + l ) ∩ ( α + W k ) , α ∈ W k + l \ W k if l > α = output : P ( x ) = p X ( x ) + p X ( x ) + ... + p k − X k − ( x ) : p i ∈ F ⌈ lg l ⌉ if l > p i ∈ F otherwise. if k = then return the only element in A ; if l = then Divide the set A to A , A Q ( x ) ← IFAFFT( k − , A , l , α ) Q ( x ) ← IFAFFT( k − , A , l + , v k − + α ) P ( x ) ← ( Q ( x ) + Q ( x )) P ( x ) ← Q ( x ) + s k − ( α ) · P ( x ) else if l = ⌊ lg ( l )⌋ then Q ( x ) ← IFAFFT( k − , A , l + , α ) Let s k − ( α ) = c + u lg ( l ) Let Q ( x ) = R ( x ) + u lg ( l ) · R ( x ) P ( x ) ← R ( x ) + R ( x ) · cP ( x ) ← R ( x ) else Divide the set A to A , A Q ( x ) ← IFAFFT( k − , A , l + , α ) Q ( x ) ← IFAFFT( k − , A , l + , v k − + α ) P ( x ) ← Q ( x ) + Q ( x ) P ( x ) ← Q ( x ) + s k − ( α ) · P ( x ) endreturn P ( x ) + P ( x ) · s k − ( x ) Algorithm 4:
Inverse Frobenius Additive FFT in novel polynomialbasis .the polynomial size equal to power of two, we apply it to polyno-mial multiplication with polynomial size 256, 512, and 1024. Weimprove the best known results by 19.1%, 29.7%, and 41.1% respec-tively. To conclude the comparison, we set the record of size 231to 256, 414 to 512, and 709 to 1024 just by above result. To the bestof our knowledge, it is the first time FFT-based method outper-forms Karatsuba-like algorithm in such low degree in terms of bitoperation count. In addition, for polynomial size 128, our FAFFTcosts 11556 bit operations, comparing to the best previous is 11466from [CH15]. Our result is only 0.78% slight slower in terms of bitoperation count.
In [BC14], an optimized implementation of additive FFT based on[GM10] was presented. They show the cost of multiplication in F [ x ] < is 22,292 bit operations. We can use it to multiply poly-nomials of degree 128 using Kronecker method. But our Frobeniusadditive FFT only requires 11556 for size 128, which is about halfof their results. The factor 2 speedup compared with Kroneckermethod is expected as in [vdHL17a] because the total bit length ishalf when using Frobenius method. Thereare several polynomial sizes that are in the interest of cryptogra-phy engineering community and its number of bit operation ofmultiplication were studied due to its application in binary ellipticcurve[Ber09] [CH15]. These binary elliptic curve includes: Koblitz , · (709,159267)(414,68484)(231,29124) (1024,158226)(512,68446)(512,98018)(256,29005)Size of polynomial b i t o p e r a t i o n s Bernstein[Ber09]Cenk[CH15]This work
Figure 2: Number of bit operations for multiplication of F [ x ] curve sect233k1 , sect233r1 over F [ x ]/( x + x + ) , curve sect239k1 , sect239r1 over F [ x ]/( x + x + ) , and Edwardscurve BBE251 over F [ x ]/( x + x + x + x + ) according toStandards for Efficient Cryptography Group (SECG) and [Ber09].For the corresponding polynomial size 233, 239 and 251, the Frobe-nius additive FFT method outperforms previous method in terms ofnumber of bit operations. Thus, our FAFFT can potentially appliedto these curve in order to accelerate the computation. F [ x ] of large degree Another application is to implement multiplications of F [ x ] oflarge degree on modern CPU. Here we will implement a variantof the algorithm. The bit operation count is not a good predictoron modern CPU since they operate on 64-bit machine words andthere are special instruction PCLMULQDQ designed for carryless mul-tiplication with input size 64. As in [VDHLL17], to implement onmodern CPU, we have to take these into account. To be able to use
PCLMULQDQ , we change our algorithm to only compute a subset ofcross section. The set of point we will use is { v i − + v i − j + v i − j + . . . + v j i − : j k = k ≤ , j k ∈ { , } otherwise } where 64 < i ≤ F and mainly use the PCLMULQDQ instruction whichperforms carryless multiplication with input size 64. We show thebenchmark on Intel Skylake architecture in Table 1 with compari-son of other implementations. In the table, for polynomial of size n where log ( n / ) = , , . . . ,
23, our implementation of variant ofFAFFT outperforms previous best results from [VDHLL17, CCK + able 1: Products in degree < n in F [ x ] on Intel Skylake XeonE3-1275 v5 @ 3.60GHz ( − sec.) log ( n / )
16 17 18 19 20 21 22 23This work, F c
11 24 56 127 239 574 958 2465ADFT[CCK +
17] 16 34 74 175 382 817 1734 3666 F [HvdHL16] b
22 51 116 217 533 885 2286 5301 gf2x a [BGTZ08] 23 51 111 250 507 1182 2614 6195 a Version 1.2. Available from http://gf2x.gforge.inria.fr/ b SVN r10663. Available from svn://scm.gforge.inria.fr/svn/mmx c SVN r10681. Available from svn://scm.gforge.inria.fr/svn/mmx
This is the first time FFT-based algorithm that outperforms Karatsuba-like algorithm for binary polynomial multiplication in such lowdegree. We hope our work can open up a new direction for thecommunity interested in the number bit operation of binary poly-nomial multiplication in small degree, as there are possible futurework such as further reducing bit operations in [BC14] or usingtruncated method to eliminate the ‘jump‘ in the complexity whensize is a power of two [vdH04].
REFERENCES [BC14] Daniel J. Bernstein and Tung Chou. Faster binary-field multiplication andfaster binary-field macs. In Antoine Joux and Amr M. Youssef, editors,
Selected Areas in Cryptography - SAC 2014 - 21st International Conference,Montreal, QC, Canada, August 14-15, 2014, Revised Selected Papers , volume8781 of
Lecture Notes in Computer Science , pages 92–111. Springer, 2014.[Ber09] Daniel J. Bernstein. Batch binary edwards. In
Advances in Cryptology- CRYPTO 2009, 29th Annual International Cryptology Conference, SantaBarbara, CA, USA, August 16-20, 2009. Proceedings , pages 317–336, 2009.[BGTZ08] Richard P Brent, Pierrick Gaudry, Emmanuel Thomé, and Paul Zimmer-mann. Faster multiplication in gf (2)(x).
Lecture Notes in Computer Science ,5011:153–166, 2008.[BYD17] Michael A. Burr, Chee K. Yap, and Mohab Safey El Din, editors.
Proceedingsof the 2017 ACM on International Symposium on Symbolic and AlgebraicComputation, ISSAC 2017, Kaiserslautern, Germany, July 25-28, 2017 . ACM,2017.[Can89] David G. Cantor. On arithmetical algorithms over finite fields.
J. Comb.Theory Ser. A , 50(2):285–300, March 1989. [CCK +
17] Ming-Shing Chen, Chen-Mou Cheng, Po-Chun Kuo, Wen-Ding Li, andBo-Yin Yang. Faster multiplication for long binary polynomials.
CoRR ,abs/1708.09746, 2017.[CH15] Murat Cenk and M. Anwar Hasan. Some new results on binary polyno-mial multiplication.
J. Cryptographic Engineering , 5(4):289–303, 2015.[CHN14] Murat Cenk, M. Anwar Hasan, and Christophe Nègre. Efficient sub-quadratic space complexity binary polynomial multipliers based on blockrecombination.
IEEE Trans. Computers , 63(9):2273–2287, 2014.[CNH13] Murat Cenk, Christophe Nègre, and M. Anwar Hasan. Improved three-way split formulas for binary polynomial and toeplitz matrix vectorproducts.
IEEE Trans. Computers , 62(7):1345–1361, 2013.[FH15] Haining Fan and M. Anwar Hasan. A survey of some recent bit-parallelgf ( 2 n ) multipliers.
Finite Fields Appl. , 32(C):5–43, March 2015.[GM10] Shuhong Gao and Todd Mateer. Additive fast fourier transforms overfinite fields.
IEEE Trans. Inf. Theor. , 56(12):6265–6272, December 2010.[HvdHL16] David Harvey, Joris van der Hoeven, and Grégoire Lecerf. Fast polynomialmultiplication over F . In Sergei A. Abramov, Eugene V. Zima, and Xiao-Shan Gao, editors, Proceedings of the ACM on International Symposium onSymbolic and Algebraic Computation, ISSAC 2016, Waterloo, ON, Canada,July 19-22, 2016 , pages 255–262. ACM, 2016.[HvdHL17] David Harvey, Joris van der Hoeven, and Grégoire Lecerf. Faster polyno-mial multiplication over finite fields.
J. ACM , 63(6):52:1–52:23, 2017.[JKR12] Charanjit S. Jutla, Vijay Kumar, and Atri Rudra. On the circuit complex-ity of composite galois field transformations.
Electronic Colloquium onComputational Complexity (ECCC) , 19:93, 2012.[LANH16] Sian-Jheng Lin, Tareq Y. Al-Naffouri, and Yunghsiang S. Han. Fft al-gorithm for binary extension finite fields and its application to reed–solomon codes.
IEEE Trans. Inf. Theor. , 62(10):5343–5358, October 2016.[LCH14] Sian-Jheng Lin, Wei-Ho Chung, and Yunghsiang S. Han. Novel polyno-mial basis and its application to reed-solomon erasure codes. In , pages 316–325. IEEE ComputerSociety, 2014.[Paa97] C. Paar. Optimized arithmetic for reed-solomon encoders. In
Proceedingsof IEEE International Symposium on Information Theory , pages 250–, Jun1997.[vdH04] Joris van der Hoeven. The truncated fourier transform and applications.In Jaime Gutierrez, editor,
Symbolic and Algebraic Computation, Interna-tional Symposium ISSAC 2004, Santander, Spain, July 4-7, 2004, Proceedings ,pages 290–296. ACM, 2004.[vdHL17a] Joris van der Hoeven and Robin Larrieu. The frobenius FFT. In Burr et al.[BYD17], pages 437–444.[vdHL17b] Joris van der Hoeven and Grégoire Lecerf. Composition modulo powersof polynomials. In Burr et al. [BYD17], pages 445–452.[VDHLL17] Joris Van Der Hoeven, Robin Larrieu, and Grégoire Lecerf. Implementingfast carryless multiplication. working paper or preprint, August 2017.[vzGS05] Joachim von zur Gathen and Jamshid Shokrollahi. Efficient fpga-basedkaratsuba multipliers for polynomials over f2. In Bart Preneel andStafford E. Tavares, editors,
Selected Areas in Cryptography, 12th Inter-national Workshop, SAC 2005, Kingston, ON, Canada, August 11-12, 2005,Revised Selected Papers , volume 3897 of
Lecture Notes in Computer Science ,pages 359–369. Springer, 2005.,pages 359–369. Springer, 2005.