[PDF] An in-place truncated Fourier transform

Abstract

We show that simple modifications to van der Hoeven's forward and inverse truncated Fourier transforms allow the algorithms to be performed in-place, and with only a linear overhead in complexity.

Full PDF

aa r X i v : . [ c s . CC ] J a n AN IN-PLACE TRUNCATED FOURIER TRANSFORM

NICHOLAS COXON

Abstract.

We show that simple modiﬁcations to van der Hoeven’s forward andinverse truncated Fourier transforms allow the algorithms to be performed in-place,and with only a linear overhead in complexity. Introduction

The discrete Fourier transform (DFT) with respect to a principal n th root of unity is a linear mapthat evaluates polynomials of degree less than n at each of the n distinct powers of the root of unity.A naive approach allows the map to be evaluated with O ( n ) arithmetic operations. However, when n is comprised entirely of small prime factors, the DFT can be evaluated with O ( n log n ) operations bythe fast Fourier transform (FFT) algorithm. While already known to Gauss [5], the FFT algorithm onlyreceived widespread attention after its rediscovery [7, 3] by Cooley and Tukey [4]. The algorithm hassince found numerous applications in mathematics, computer science and electrical engineering. In thearea of computer algebra, the FFT algorithm provides the basis for asymptotically fast algorithms forinteger and polynomial multiplication [11, 2].The radix-2 Cooley-Tukey FFT algorithm is widely used due to its ease of implementation and fastpractical performance. However, the radix-2 algorithm requires the order n of the root of unity to bea power of two, limiting control over the number of evaluation points of the corresponding transform.This requirement leads to unwanted “jumps” in the complexity for certain applications, e.g., polynomialmultiplication, since polynomials may have to be evaluated at almost twice as many points than actuallyrequired by the problem at hand. Truncated (or pruned) variants of the radix-2 algorithm addressthis issue by considering more general problems involving the evaluation of polynomials with restrictedsupport at subsets of the evaluation points. The algorithms then disregard parts of the radix-2 algorithmthat are vacuous due to zero coeﬃcients, or that do not contribute to the computation of the desiredoutput, resulting in complexities which vary relatively smoothly with the size of the inputs and outputs.The truncated Fourier transform (TFT) algorithm of van der Hoeven [13, 14] modiﬁes the radix-2algorithm so that it evaluates polynomials of degree less than ℓ , where ℓ ≤ n , at the ℓ powers of theroot of unity whose exponents are smallest under a reversal of their bit representation. The algorithmretains the simplicity of the radix-2 algorithm, while enjoying a time complexity which varies relativelysmoothly with ℓ . More speciﬁcally, the algorithm performs ( ℓ/

2) log ℓ + O ( ℓ ) ring multiplications, and ℓ log ℓ + O ( ℓ ) ring additions or subtractions. Van der Hoeven also provides a second algorithm, called theinverse truncated Fourier transform (ITFT), which performs the inverse transformation of interpolationwith the same complexity. Together, the algorithms allow for more eﬃcient implementations of FFT-based polynomial multiplication algorithms by reducing the size of the jumps in their complexities.Van der Hoeven’s TFT and ITFT algorithms operate on arrays of 2 ⌈ log ℓ ⌉ ring elements, resulting injumps in their space complexities as ℓ increases. Harvey and Roche [6] address this issue by describ-ing in-place variants of the algorithms, which operate by updating their length ℓ input arrays throughreplacements until they are overwritten with the output, while storing only a constant number of ringelements and bounded-precision integers in auxiliary storage space. Their variants retain the O ( ℓ log ℓ )time complexity of van der Hoeven’s algorithms, albeit with a larger implied constant. Arnold [1] sub-sequently describes in-place algorithms that match the complexity of van der Hoeven’s algorithms up tothe implied constants in the O ( ℓ ) terms. The algorithms are based on existing in-place algorithms [12]for computing a diﬀerent truncated transform [9, Chapter 6], which Arnold improves and then modiﬁesto compute the transforms of van der Hoeven’s algorithms. E-mail address : [email protected] . Date : January 25, 2021.

Key words and phrases. fast Fourier transform, truncated Fourier transform, in-place algorithms.

In Section 3 of this paper, we present an in-place TFT algorithm that performs at most ( ℓ/ ⌊ log ℓ ⌋ +2 ℓ + O (log ℓ ) ring multiplications, and at most ℓ ⌊ log ℓ ⌋ +2 ℓ ring additions or subtractions. The algorithmsimply injects an additional O ( ℓ ) computation into the middle of a slightly restructured version of van derHoeven’s TFT algorithm. This approach is arguably more natural than the one used by Arnold, which wesee reﬂected in the fact that our algorithm performs Ω( ℓ ) fewer multiplications than Arnold’s algorithmwhen ℓ is not a power of two. In Section 4, we show that our TFT algorithm is easily inverted, yieldingan in-place ITFT algorithm that performs at most ( ℓ/ ⌊ log ℓ ⌋ + 4 ℓ + O (log ℓ ) ring multiplications, andat most ℓ ⌊ log ℓ ⌋ + 3 ℓ ring additions or subtractions.2. Preliminaries

The discrete Fourier transform.

Let R be a commutative ring with identity. Then ω ∈ R iscalled a principal n th root of unity if ω n = 1 and P n − j =0 ω ij = 0 for i ∈ { , . . . , n − } . The discreteFourier transform (DFT) with respect to a principal n th root of unity ω ∈ R is the R -linear mapDFT ω : R n → R n deﬁned by ( a , . . . , a n − ) (ˆ a , . . . , ˆ a n − ) whereˆ a i = n − X j =0 a j ω ij for i ∈ { , . . . , n − } . It is readily veriﬁed that ω − = ω n − is also a principal n th root of unity, and DFT ω − (DFT ω ( a )) = n a for a ∈ R n . Thus, DFT ω is injective if and only if n is not a zero-divisor in R .Hereafter, we assume that ω ∈ R is a principal n th root of unity such that n = 2 p for some p ≥

1. Wefurther assume that ω n/ = −

1, in order to simplify the algorithms presented in this paper. When n isa power of two, satisfying this condition is suﬃcient for ω ∈ R to be an n th principal root of unity; thecondition is also necessary if 2 is not a zero-divisor in R .2.2. Computational model.

We adhere to the computational model laid out by Harvey and Roche [6,Section 2.1]. Accordingly, two primitive types are allowed in memory: elements of R and integers in theinterval [ − cℓ, cℓ ], where ℓ is the length of the transform being computed, and c ∈ N is a ﬁxed constant(like Harvey and Roche, we may take c = 2 for our algorithms). An algorithm is then said to be in-placeif it operates by updating its given inputs until they are overwritten with the corresponding outputs.The algorithm may use extra working space during the computation, which is deﬁned to be the auxiliaryspace used by the algorithm. However, only O (1) auxiliary space is allowed.We measure the time complexity of algorithms by the number algebraic operations they perform, i.e.,by the number of multiplications, additions and subtractions performed in R . Thus, we ignore the cost ofoperations on indices. However, these costs are negligible compared to those of the algebraic operationsfor the algorithms in this paper. At times, we provide a more detailed analysis by separately counting thenumber of multiplications by powers of ω , and by powers of 2 or its inverse. Multiplications by powersof ω are signiﬁcant in settings where the root has been artiﬁcially added to the ring [11, 2], since theyenjoy a smaller cost than arbitrary multiplications. Similarly, the cost of a multiplication by 2 is alwaysbounded by that of an addition, while for certain rings (e.g., R = C ) multiplications by powers of 2 orits inverse can be eﬃciently implemented by simple bit operations (e.g., bit-shifts).2.3. The radix-2 fast Fourier transform.

Along with the rediscovery of the FFT algorithm, Cooleyand Tukey [4] introduced an in-place variant of the radix-2 algorithm. A feature of the algorithm is thatit requires a “bit-reversal” permutation to be performed either at its beginning or at its end. For k ∈ N ,the k -bit bit-reversal permutation [ ] k : { , . . . , k − } → { , . . . , k − } is deﬁned by2 i + 2 i + · · · + 2 k − i k − i k − + 2 i k − + · · · + 2 k − i for i , . . . , i k − ∈ { , } . The in-place radix-2 FFT algorithm is then derived from the following result ofCooley and Tukey (see also [13, Section 2]). Lemma 2.1.

Let a = ( a , . . . , a n − ) ∈ R n , and recursively deﬁne a k, , . . . , a k,n − ∈ R for k ∈ { , . . . , p } as follows: a p,i = a i for i ∈ { , . . . , n − } ; and, if k < p , (2.1) (cid:18) a k, k (2 i )+ j a k, k (2 i +1)+ j (cid:19) = (cid:18) ω [2 i ] p − ω [2 i ] p (cid:19) (cid:18) a k +1 , k (2 i )+ j a k +1 , k (2 i +1)+ j (cid:19) for i ∈ { , . . . , p − k − − } and j ∈ { , . . . , k − } . Then DFT ω ( a ) = ( a , [0] p , a , [1] p , . . . , a , [ n − p ) . N IN-PLACE TRUNCATED FOURIER TRANSFORM 3

Lemma 2.1 immediately yields a O ( n log n ) algorithm for evaluating DFT ω . To make the algorithmin-place, the input vector ( a , . . . , a n − ) is successively overwritten with a k, , . . . , a k,n − for k = p − , p − , . . . , n/ ω [2 i ] p of the transformations (2.1), commonlyreferred to as “twiddle factors”, using only O (1) auxiliary space. This requirement can be met byperforming the transformations successively for i = [0] p − k − , [1] p − k − , . . . , [2 p − k − − p − k − , since thecorresponding twiddle factors are then successive powers of ω k . More simply, the input vector can bepermuted according to the bit-reversal permutation at the beginning of the algorithm, rather than at itsend, since (2.1) implies that (cid:18) a k, [2 p − k − (2 j )+ i ] p a k, [2 p − k − (2 j +1)+ i ] p (cid:19) = ω k i − ω k i ! (cid:18) a k +1 , [2 p − k − (2 j )+ i ] p a k +1 , [2 p − k − (2 j +1)+ i ] p (cid:19) for k ∈ { , . . . , p − } , i ∈ { , . . . , p − k − − } and j ∈ { , . . . , k − } . Using either method, the cost ofcomputing the twiddle factors is n + O (log n ) multiplications by powers of ω . Theorem 2.2 ([4]) . If a ∈ R n , then DFT ω ( a ) can be computed in-place with ( n/

2) log n + O (log n ) multiplications by powers of ω , and n log n additions or subtractions. The truncated Fourier transform.

The length ℓ ≤ n truncated Fourier transform (TFT) withrespect to ω is the R -linear map TFT ω,ℓ : R ℓ → R ℓ deﬁned by ( a , . . . , a ℓ − ) (ˆ a [0] p , . . . , ˆ a [ ℓ − p ) whereˆ a i = ℓ − X j =0 a j ω ij for i ∈ { , . . . , n − } . When convenient, we also view the truncated Fourier transform as the polynomial evaluation mapTFT ω,ℓ : R [ x ] <ℓ → R ℓ given by f ( f ( ω [0] p ) , . . . , f ( ω [ ℓ − p )), where R [ x ] <ℓ denotes the set of polyno-mials over R with degree less than ℓ .The length ℓ TFT is readily evaluated by applying Lemma 2.1 with a ℓ = · · · = a n − = 0. Vander Hoeven’s TFT algorithm simply augments this approach by disregarding redundant parts of thecomputation. For example, the redundancies introduced by the zero padding are accounted for by thefollowing lemma. Lemma 2.3.

Let ℓ ∈ { , . . . , n } , m = ⌈ log ℓ ⌉ and a , . . . , a n − ∈ R such that a ℓ = · · · = a n − = 0 .Deﬁne a k, , . . . , a k,n − ∈ R for k ∈ { , . . . , p } as per Lemma 2.1. Then (2.2) a m − , m − i + j = ( a j + ( − i a m − + j if j < ℓ − m − ,a j if j ≥ ℓ − m − , for i ∈ { , } and j ∈ { , . . . , m − − } .Proof. If for some k ∈ { m, . . . , p − } , we have a k +1 ,j = a j for j ∈ { , . . . , k +1 − } , then (2.1) with i = 0implies that a k,j = a k +1 ,j + a k +1 , k + j = a j + a k + j = a j for j ∈ { , . . . , k − } , since 2 k ≥ m ≥ ℓ . As a p,j = a j for j ∈ { , . . . , p − } by deﬁnition, it follows by induction that a m,j = a j for j ∈ { , . . . , m − } .Thus, (2.1) with k = m − i = 0 implies that a m − ,j = a j + a m − + j and a m − , m − + j = a j − a m − + j for j ∈ { , . . . , m − − } , where a m − + j = 0 if j ≥ ℓ − m − . (cid:3) To compute the truncated Fourier transform of ( a , . . . , a ℓ − ) ∈ R ℓ , van der Hoeven’s algorithmoperates on an array of length 2 ⌈ log ℓ ⌉ that initially has its ﬁrst ℓ entries set equal to a , . . . , a ℓ − . Thealgorithm then overwrites the ﬁrst 2 k ⌈ ℓ/ k ⌉ entries of the array with a k, , . . . , a k, k ⌈ ℓ/ k ⌉− , successivelyfor k = ⌈ log ℓ ⌉ − , ⌈ log ℓ ⌉ − , . . . ,

0, by using (2.2) if k = ⌈ log ℓ ⌉ −

1, and otherwise using (2.1) tooverwrite pairs of entries or single entries. The truncated Fourier transform is then read from the ﬁrst ℓ entries of the array upon termination.The description of van der Hoeven’s ITFT algorithm is more involved, and not provided here. However,we do recall a simple, but key, observation that is used in its development. Van der Hoeven [13, Section 4](see also [14, Section 6]) observes that not only do we have (2.1), but also the relations(2.3) (cid:18) a k +1 , k (2 i )+ j a k +1 , k (2 i +1)+ j (cid:19) = (cid:18) ω − [2 i ] p − ω − [2 i ] p (cid:19) (cid:18) a k, k (2 i )+ j a k, k (2 i +1)+ j (cid:19) AN IN-PLACE TRUNCATED FOURIER TRANSFORM and(2.4) (cid:18) a k +1 , k (2 i )+ j a k, k (2 i +1)+ j (cid:19) = (cid:18) − ω [2 i ] p − ω [2 i ] p (cid:19) (cid:18) a k, k (2 i )+ j a k +1 , k (2 i +1)+ j (cid:19) for k ∈ { , . . . , p − } , i ∈ { , . . . , p − k − − } and j ∈ { , . . . , k − } , and those obtained by invertingthe matrix in (2.4). The additional relations are used in van der Hoeven’s ITFT algorithm to navigatenimbly, and cheaply, through the elements a k,i for increasing and decreasing k . They are used similarlyin our in-place algorithms.The complexity of van der Hoeven’s TFT and ITFT algorithms is summarised as follows. Theorem 2.4 ([13, Theorems 1 and 2]) . Suppose that a ∈ R ℓ for some ℓ ≤ n . Then TFT ω,ℓ ( a ) can be computed with ( ℓ/

2) log ℓ + O ( ℓ ) multiplications by powers of ω , and ℓ log ℓ + O ( ℓ ) additions orsubtractions. Moreover, if is a unit in R , then a can be recovered from TFT ω,ℓ ( a ) with ( ℓ/

2) log ℓ + O ( ℓ ) multiplications by powers of ω , and ℓ log ℓ + O ( ℓ ) shifted additions or subtractions. In Theorem 2.4, a “shifted” addition or subtraction refers to the combined operation of an additionor subtraction with a multiplication by 2 or 2 − . These operations arise, for example, when using (2.3)and (2.4), and are signiﬁcant for the reasons discussed in Section 2.2.2.5. In-place algorithms for the truncated Fourier transform.

Van der Hoeven’s TFT and ITFTalgorithms both operate on arrays of 2 ⌈ log ℓ ⌉ ring elements. Thus, their auxiliary space requirements arenot O (1), since each algorithms stores at least 2 ⌈ log ℓ ⌉ − ℓ ring elements auxiliary space. The ﬁrst in-placealgorithms for the truncated Fourier transform are due to Harvey and Roche [6] (see also [10, Chapter 3]).Their TFT algorithm is again based on the idea of computing the sequences a k, , . . . , a k, k ⌈ ℓ/ k ⌉− . How-ever, elements a k,i with i > ℓ are never stored by the algorithm. Instead, they are computed on-the-ﬂyonly when needed, and then immediately discarded. This approach allows the algorithm to operate on avector of length ℓ , and potentially results in the computation of fewer elements a k,i with i > ℓ . However,the individual cost of their computation is higher, leading to an increase in multiplicative complexity overvan der Hoeven’s algorithm of a constant factor. Theorem 2.5 ([10, Chapter 3]) . If a ∈ R ℓ for some ℓ ≤ n , then TFT ω,ℓ ( a ) can be computed in-placewith ℓ (log ℓ ) / O ( ℓ ) multiplications by powers of ω , and O ( ℓ log ℓ ) additions or subtractions. Harvey and Roche’s in-place ITFT algorithm performs a similar number of multiplications by powersof ω , and O ( ℓ log ℓ ) shifted additions or subtractions.Arnold [1] provides in-place TFT and ITFT algorithms that match the complexity of van der Ho-even’s algorithms up to lower order terms. The problem of evaluating TFT ω,ℓ is not considered directlyby Arnold. Instead, the main focus of Arnold’s work is to improve an existing in-place algorithm forevaluating the “cyclotomic” truncated Fourier transform proposed by Mateer [9, Chapter 6]. The im-proved algorithm is then modiﬁed to evaluate TFT ω,ℓ . We summarise Arnold’s main results relating tothe evaluation of TFT ω,ℓ in the following two lemmas. Lemma 2.6 ([1, Section 6]) . Write ℓ ∈ { , . . . , n } as P tr =1 ℓ r for integer powers of two ℓ > ℓ > · · · > ℓ t ,and let e r = [ ℓ r + · · · + ℓ t ] p for r ∈ { , . . . , t } , with [ n ] p taken to be zero. Then, for f ∈ R [ x ] <ℓ , TFT ω,ℓ ( f ) = TFT ω n/ℓ ,ℓ ( f ( x/ω e )) k · · · k TFT ω n/ℓt ,ℓ t ( f s ( x/ω e t )) , where f r ∈ R [ x ] <ℓ r is the residue of f ( ω [ ℓ ] p x ) modulo x ℓ r + 1 , for r ∈ { , . . . , t } , and k denotes theconcatenation operator. Lemma 2.7 ([1, Lemma 13]) . Write ℓ ∈ N \ { } as P tr =1 ℓ t for integer powers of two ℓ > ℓ > · · · > ℓ t .Then given f ∈ R [ x ] <ℓ , the minimum degree residues of f modulo the polynomials x ℓ r +1 for r ∈ { , . . . , t } can be computed in-place with at most ℓ multiplications by or − , and at most ℓ additions orsubtractions. The maps TFT ω n/ℓr ,ℓ r that appear in Lemma 2.6 can be evaluated in-place by the methods of Sec-tion 2.3, since ω n/ℓ r is a principal ℓ r th root of unity. In Lemma 2.7, “computed in-place” means thatthe length ℓ coeﬃcient vector of f is overwritten with the concatenation of the coeﬃcient vectors of theresidues, appearing in order, and with the residue modulo x ℓ r + 1 having coeﬃcient vector of length ℓ r .Arnold furthermore shows that the bounds of the lemma also apply to the cost of inverting this transfor-mation in-place. Thus, by combining Lemmas 2.6 and 2.7 with Theorem 2.2, Arnold obtains the followingcomplexity result. N IN-PLACE TRUNCATED FOURIER TRANSFORM 5

Theorem 2.8 ([1]) . Suppose that is a unit in R , and let a ∈ R ℓ for some ℓ ≤ n . Then given either a or TFT ω,ℓ ( a ) , the other vector can be computed in-place with ( ℓ/

2) log ℓ + O ( ℓ ) multiplications, and ℓ log ℓ + O ( ℓ ) additions or subtractions. The assumption in Theorem 2.8 that 2 is a unit in R can be removed for the case of evaluating TFT ω,ℓ by replacing the algorithm of Lemma 2.7 with the earlier algorithm of Sergeev [12]. The algorithmspresented in this paper yield an alternative proof of the theorem, and without the need for the unitassumption when evaluating the truncated Fourier transform.3. An in-place truncated Fourier transform

Our in-place algorithm for computing the truncated Fourier transform is presented in Algorithm 1.The algorithm operates on an array ( x , . . . , x ℓ − ), which initially contains a given vector a ∈ R ℓ , and hasit entries overwritten by the algorithm with those of TFT ω,ℓ ( a ). The algorithm also asks for ω n/ ⌈ log2 ℓ ⌉ asan input, thus assuming (along with Harvey and Roche [10, Section 3.3]) a precomputation of p − ⌈ log ℓ ⌉ multiplications if repeated squaring is used. This element generates the smallest subgroup that containsthe twiddle factors used in the algorithm, and is used as part of their computation. Algorithm 1 omitsthe details of the twiddle factor computations, which are instead provided in Section 3.3. We temporarilyassume that 2 is not a zero-divisor in R , so that in Line 24 of the algorithm the map div : 2 R → R given by 2 a a for a ∈ R is well-deﬁned. We also assume that the map can be evaluated in-place.In Section 3.2, we show how to modify the algorithm so as to remove the use of the map, thus removingalso the need for the two assumptions.3.1. Overview and correctness.

Similar to van der Hoeven’s TFT algorithm, Algorithm 1 is based onthe idea of computing the sequences a k, , . . . , a k, k ⌈ ℓ/ k ⌉− for k = m − , m − , . . . ,

0, where m = ⌈ log ℓ ⌉ .However, the computation does not proceed by computing the full sequences for successive values of k ,since their length is greater than ℓ for k such that 2 k does not divide ℓ . Instead, in the ﬁrst phase of thealgorithm, consisting of Lines 1 to 17, the subsequences a k, k +1 ⌊ ℓ/ k +1 ⌋ , . . . , a k, k ⌈ ℓ/ k ⌉− are computed for k = m − , m − , . . . , v , where v = ord ℓ is the maximum integer for which 2 v divides ℓ . Each computedelement a k,i is written to x i if i < ℓ , and written to x i − m − if i ≥ ℓ . These assignments are consistentfor k = m −

1, since Lemma 2.3 implies that a m − ,i = a m − , m − + i for i ∈ { ℓ − m − , . . . , m − − } (ofcourse, we do not compute these elements twice), while no clashes occur for smaller values of k , sinceonly elements a k,i with i ≥ m − are computed. This method of assigning computed elements a k,i to thearray is used throughout the entire algorithm.By indexing in this manner, the elements a m − ,ℓ − m − , . . . , a m − , m − − , which by Lemma 2.3 residein the entries x ℓ − m − , . . . , x m − − after Lines 3 and 4 have been performed, are partially overwritten byLines 6 to 17 with elements a k,i such that k ≤ m − i ≥ ℓ . However, the choice of the subsequencescomputed in the ﬁrst phase of the algorithm ensures that no element a k,i with i ≥ ℓ is needed to computethe remainder of the transform, whereas no further progress can be made on computing the transformwithout access to a m − ,ℓ − m − , . . . , a m − , m − − . Consequently, the next phase of the algorithm, Lines 19to 28, reverts the modiﬁcations to the entries x ℓ − m − , . . . , x m − − made by Lines 6 to 17 by proceedingbackwards through the loop and using (2.3) and (2.4) to partially inverting the oﬀending transformations.The ﬁnal phase of the algorithm, Lines 30 to 36, then completes the computation of the transform bycomputing a k, , . . . , a k, k +1 ⌊ ℓ/ k +1 ⌋− for k = m − , m − , . . . ,

0, where 2 k +1 ⌊ ℓ/ k +1 ⌋ = 2 k ⌈ ℓ/ k ⌉ if k < v .Further details are provided by the proof of the following lemma. Lemma 3.1.

Algorithm 1 correctly evaluates

TFT ω,ℓ .Proof.

Suppose that Algorithm 1 is called with a = ( a , . . . , a ℓ − ) ∈ R ℓ . Deﬁne a k, , . . . , a k,n − for k ∈ { , . . . , p } as per Lemma 2.1, with a i taken to be zero for i ∈ { ℓ, . . . , n − } . Then the lemmaimplies that TFT ω,ℓ ( a ) is computed correctly by Algorithm 1 if x i = a i, for i ∈ { , . . . , ℓ − } upon itstermination.Deﬁne m = ⌈ log ℓ ⌉ and v = ord ℓ , as in Line 1 of Algorithm 1. Then Lemma 2.3 implies x i = a m − ,i for i ∈ { , . . . , ℓ − } after Lines 3 and 4 of the algorithm have been performed. If ℓ = 2 m , the algorithmthen proceeds directly to the loop of Lines 30 to 36. In this case, it follows immediately from the deﬁnitionof the a k,i that each iteration of the loop sets x i = a k,i for i ∈ { , . . . , ℓ − } , where k is the correspondingvalue of the loop variable. Thus, the algorithm terminates with x i = a i, for i ∈ { , . . . , ℓ − } in thiscase. Therefore, we may assume that ℓ < m . Then v ≤ m −

2, since it is never equal to m − AN IN-PLACE TRUNCATED FOURIER TRANSFORM

Algorithm 1

TFT ( ℓ, ω n/ ⌈ log2 ℓ ⌉ , ( x , . . . , x ℓ − )) Input: the transform length ℓ ∈ { , . . . , n } , the precomputed element ω n/ ⌈ log2 ℓ ⌉ , and an array( x , . . . , x ℓ − ) containing some vector a ∈ R ℓ . Result: the vector TFT ω,ℓ ( a ) written to ( x , . . . , x ℓ − ). m ← ⌈ log ℓ ⌉ , v ← ord ℓ // Compute a m − , , . . . , a m − ,ℓ − for j = 0 , . . . , ℓ − m − − do (cid:18) x j x m − + j (cid:19) ← (cid:18) − (cid:19) (cid:18) x j x m − + j (cid:19) // Compute a k, k +1 ⌊ ℓ/ k +1 ⌋ , . . . , a k, k ⌈ ℓ/ k ⌉− for k = m − , m − , . . . , v for k = m − , m − , . . . , v do q ← ⌊ ℓ/ k +1 ⌋ , r ← ℓ − k +1 q , q ′ ← q − m − k − , α ← ω [2 q ] p if r > k then for j = 0 , . . . , r − k − do (cid:18) x k (2 q )+ j x k (2 q +1)+ j (cid:19) ← (cid:18) α − α (cid:19) (cid:18) x k (2 q )+ j x k (2 q +1)+ j (cid:19) for j = r − k , . . . , k − do (cid:18) x k (2 q )+ j x k (2 q ′ +1)+ j (cid:19) ← (cid:18) α − α (cid:19) (cid:18) x k (2 q )+ j x k (2 q ′ +1)+ j (cid:19) else for j = 0 , . . . , r − do x k (2 q )+ j ← x k (2 q )+ j + αx k (2 q ′ +1)+ j for j = r, . . . , k − do x k (2 q ′ )+ j ← x k (2 q ′ )+ j + αx k (2 q ′ +1)+ j // Restore x ℓ − m − , . . . , x m − − to a m − ,ℓ − m − , . . . , a m − , m − − for k = v + 1 , v + 2 , . . . , m − do q ← ⌊ ℓ/ k +1 ⌋ , r ← ℓ − k +1 q , q ′ ← q − m − k − if r > k then α ← ω − [2 q ] p for j = r − k , . . . , k − do x k (2 q ′ +1)+ j ← div ( α ( x k (2 q )+ j − x k (2 q ′ +1)+ j )) else α ← ω [2 q ] p for j = r, . . . , k − do x k (2 q ′ )+ j ← x k (2 q ′ )+ j − αx k (2 q ′ +1)+ j // Compute a k, , . . . , a k, k +1 ⌊ ℓ/ k +1 ⌋− for k = m − , m − , . . . , for k = m − , m − , . . . , do q ← ⌊ ℓ/ k +1 ⌋ for j = 0 , . . . , k − do (cid:18) x j x k + j (cid:19) ← (cid:18) − (cid:19) (cid:18) x j x k + j (cid:19) for ( i, α ) ∈ { ( i, ω [2 i ] p ) | i ∈ { , . . . , q − }} do for j = 0 , . . . , k − do (cid:18) x k (2 i )+ j x k (2 i +1)+ j (cid:19) ← (cid:18) α − α (cid:19) (cid:18) x k (2 i )+ j x k (2 i +1)+ j (cid:19) For k ∈ { , . . . , m − } , deﬁne q k = ⌊ ℓ/ k +1 ⌋ , r k = ℓ − k +1 q k and q ′ k = q k − m − − k , so as to reﬂectthe values q , r and q ′ computed in Lines 7, 20 and 31 of the algorithm. Furthermore, deﬁne index sets R k = ( { i ∈ Z | m − ≤ i < ℓ } if k = m − , { i ∈ Z | k (2 q k ) ≤ i < ℓ } if k < m − , N IN-PLACE TRUNCATED FOURIER TRANSFORM 7 and R ′ k = ( { i ∈ Z | ℓ − m − ≤ i < k (2 q ′ k + 2) } if r k > k , { i ∈ Z | ℓ − m − ≤ i < k (2 q ′ k + 1) } if r k ≤ k . Then in Lines 6 to 17 of the algorithm (and also Lines 19 to 28), indices of the form 2 k (2 q ) + j or2 k (2 q + 1) + j always belong to R m − = { m − , . . . , ℓ − } , while indices of the form 2 k (2 q ′ ) + j or2 k (2 q ′ + 1) + j always belong to R ′ m − = { ℓ − m − , . . . , m − − } . Moreover, for each value of the loopvariable k , the set R k ∪ R ′ k consists of the indices i for which x i is overwritten during the correspondingiteration of the loop. For k ∈ { v, . . . , m − } ,(3.1) q k = ( q k +1 + 1 if r k +1 > k +1 , q k +1 if r k +1 ≤ k +1 . Consequently, R v ⊆ R v +1 ⊆ · · · ⊆ R m − and ∅ = R ′ v ⊆ R ′ v +1 ⊆ · · · ⊆ R ′ m − .We now use induction to determine the eﬀect of the loop in Lines 6 to 17. As x i = a m − ,i for i ∈ { , . . . , ℓ − } after Lines 3 and 4 have been performed, Lemma 2.3 implies that x i = a m − ,i for i ∈ R m − , and x i = a m − ,i = a m − , m − + i for i ∈ R ′ m − , at the beginning of the ﬁrst iteration of theloop. Suppose more generally that x i = a k +1 ,i for i ∈ R k +1 , and x i = a k +1 , m − + i for i ∈ R ′ k +1 , at thebeginning of some iteration of the loop, where k ∈ { v, . . . , m − } is the corresponding value of the loopvariable. If r k > k , then a k +1 , k (2 q k )+ j = x k (2 q k )+ j and a k +1 , k (2 q k +1)+ j = ( x k (2 q k +1)+ j if j < r k − k ,x k (2 q ′ k +1)+ j if j ≥ r k − k , for j ∈ { , . . . , k − } , since R k ⊆ R k +1 and R ′ k ⊆ R ′ k +1 . Thus, (2.1) implies that Lines 9 to 12 compute a k, k (2 q k )+ j and a k, k (2 q k +1)+ j for j ∈ { , . . . , k − } , with a k,i written to x i for i ∈ R k , and a k, m − + i written to x i for i ∈ R ′ k . If r k ≤ k , then a k +1 , k (2 q k )+ j = ( x k (2 q k )+ j if j < r k ,x k (2 q ′ k )+ j if j ≥ r k , and a k +1 , k (2 q k +1)+ j = x k (2 q ′ k +1)+ j for j ∈ { , . . . , k − } , since R k ⊆ R k +1 , R ′ k ⊆ R ′ k +1 and (3.1) implies that(3.2) (cid:8) i ∈ Z | k (2 q ′ k + 1) ≤ i < k (2 q ′ k + 1) + 2 k (cid:9) = R ′ k +1 \ R ′ k . Thus, (2.1) implies that Lines 14 to 17 compute a k, k (2 q k )+ j for j ∈ { , . . . , k − } , with a k,i written to x i for i ∈ R k , and a k, m − + i written to x i for i ∈ R ′ k . It follows that each iteration of the loop in Lines 6to 17 sets x i = a k,i for i ∈ R k , and x i = a k +1 , m − + i for i ∈ R ′ k , where k is the corresponding value of theloop variable. Therefore, after Lines 6 to 17 have been performed, the state of the array ( x , . . . , x ℓ − ) issummarised as follows:(3.3) x i =  a m − ,i if i / ∈ R m − ∪ R ′ m − ,a k +1 ,i if i ∈ R k +1 \ R k for some k ≥ v,a k +1 ,i +2 m − if i ∈ R ′ k +1 \ R ′ k for some k ≥ v,a v,i if i ∈ R v . We now use induction to determine the eﬀect of the loop in Lines 19 to 28. Therefore, suppose that x i = a k, m − + i for i ∈ R ′ k − at the beginning of some iteration of the loop, where k ∈ { v + 1 , . . . , m − } is the corresponding value of the loop variable. Then x i = a k, m − + i for i ∈ R ′ k − ∪ ( R ′ k \ R ′ k − ) = R ′ k , x i = a k,i for i ∈ R k \ R k − , and x i = a k +1 , m − + i for i ∈ R ′ k +1 \ R ′ k , since (3.3) held after Lines 6to 17 were performed, while any previous iterations of the loop only modiﬁed x i for i ∈ R ′ v +1 ∪ · · · ∪ R ′ k − = R ′ k − . Thus, if r k > k , then x k (2 q ′ k +1)+ j = a k, k (2 q k +1)+ j and x k (2 q k )+ j = a k, k (2 q k )+ j for j ∈ { r k − k , . . . , k − } , since (3.1) implies that(3.4) R k \ R k − = (cid:8) i ∈ Z | k (2 q k ) ≤ i < k (2 q k ) + 2 k (cid:9) . Consequently, (2.3) implies in this case that Lines 23 and 24 set x i = a k +1 , m − + i for i ∈ R ′ k . If r k ≤ k ,then x k (2 q ′ k )+ j = a k, k (2 q k )+ j and x k (2 q ′ k +1)+ j = a k +1 , k (2 q k +1)+ j for j ∈ { r k , . . . , k − } , since (3.2)holds. Thus, (2.4) implies that Lines 27 and 28 set x i = a k +1 , m − + i for i ∈ R ′ k . Therefore, we ﬁndthat if x i = a k, m − + i for i ∈ R ′ k − at the beginning of an iteration of the loop in Lines 19 to 28, then x i = a k +1 , m − + i for i ∈ R ′ k at the end of the iteration. As R ′ v is the empty set, it follows that the AN IN-PLACE TRUNCATED FOURIER TRANSFORM overall eﬀect of loop is to set x i = a m − , m + i for i ∈ R ′ m − . Hence, x i = a m − , m + i = a m − ,i for i ∈ ( R ′ m − \ R ′ m − ) ∪ R ′ m − = R ′ m − after the loop has been performed.Redeﬁne R m − to be the set { , . . . , ℓ − } . Then, as R = · · · = R v − = ∅ if v >

0, the state of thevector ( x , . . . , x ℓ − ) after Lines 19 to 28 have been performed is summarised as follows:(3.5) x i = ( a k +1 ,i if i ∈ R k +1 \ R k ,a ,i if i ∈ R . For k ∈ { , . . . , m − } , deﬁne L k = { i ∈ Z | ≤ i < k (2 q k ) } . Then for each value of the loopvariable k in Line 30, the set L k consists of the indices i for which x i is modiﬁed by Lines 31 to 36during the corresponding iteration of the loop. Moreover, L k is the complement of R k in { , . . . , ℓ − } for k ∈ { , . . . , m − } , and L ⊇ L ⊇ · · · ⊇ L m − = ∅ .Suppose that x i = a k +1 ,i for i ∈ L k +1 at the beginning of an iteration of the loop in Lines 30 to 36,where k is the corresponding value of the loop variable. Then x i = a k +1 ,i for i ∈ L k +1 ∪ ( R k +1 \ R k ) = L k ,since (3.5) held after Lines 19 to 28 were performed, while any previous iterations of the loop onlymodiﬁed x i for i ∈ L k +1 ∪ · · · ∪ L m − = L k +1 . Thus, (2.1) implies that x i = a k,i for i ∈ L k at the end ofthe iteration. As L m − = ∅ , it holds trivially that x i = a m − ,i for i ∈ L m − for at the beginning of theﬁrst iteration of the loop in Lines 30 to 36. Therefore, x i = a ,i for i ∈ L once all iterations of the loophave been performed. However, it also holds that x i = a ,i for i ∈ R = { , . . . , ℓ − }\ L , since (3.5) heldafter Lines 19 to 28 were performed, while Lines 30 to 36 only modiﬁed x i for i ∈ L ∪ · · · ∪ L m − = L .It follows that the algorithm terminates with x i = a i, for i ∈ { , . . . , ℓ − } , as required. (cid:3) Avoiding division by two.

It is possible to modify Algorithm 1 so as to avoid the use of themap div , and without incurring any additional costs. If Lines 11 and 12 are performed for some k ∈{ v, . . . , m − } , then the values stored in the entries x k (2 q )+ j for j ∈ { r − k , . . . , k − } by the loop maybe modiﬁed, so long as they are then set to the correct values when Lines 23 and 24 are performed for thesame value of k . Using the notation of the proof of Lemma 3.1, we see that such a modiﬁcation is possiblesince only entries x i with indices belonging to R k − ∪ R ′ k − are accessed or modiﬁed during the interveningperiod, while (3.4) implies that the entries x k (2 q k )+ j for j ∈ { r k − k , . . . , k − } have indices that belongto R k \ R k − = R k \ ( R k − ∪ R ′ k − ). Consequently, the algorithm may be modiﬁed so that Line 12 replaces( x k (2 q )+ j , x k (2 q ′ +1)+ j ) t by A ( x k (2 q )+ j , x k (2 q ′ +1)+ j ) t , and Line 24 replaces ( x k (2 q )+ j , x k (2 q ′ +1)+ j ) t by B ( x k (2 q )+ j , x k (2 q ′ +1)+ j ) t , for 2 × A and B over R such that (cid:0) (cid:1) A = (cid:0) − ω [2 q ] p (cid:1) and BA = (cid:18) ω [2 q ] p (cid:19) . Indeed, the requirement on the bottom row of A ensures that Line 12 assigns the correct value to x k (2 q ′ +1)+ j , while the requirement on the product BA ensures that Line 24 reverts the eﬀect of Line 12on x k (2 q ′ +1)+ j , and corrects the value stored in x k (2 q )+ j .If 2 is a unit in R , then Algorithm 1 in its current form is equivalent to taking A = (cid:18) ω [2 q ] p − ω [2 q ] p (cid:19) and B = (cid:18) − ω − [2 q ] p − − ω − [2 q ] p (cid:19) . To remove the use of the map div , along with the assumption that 2 is not a zero-divisor in R , weconsider the choice A = (cid:18) − ω [2 q ] p (cid:19) and B = (cid:18) ω [2 q ] p

11 0 (cid:19) , which implies that Algorithm 1 may be modiﬁed as follows: replace Line 12 by(3.6) (cid:18) x k (2 q )+ j x k (2 q ′ +1)+ j (cid:19) ← (cid:18) − α (cid:19) (cid:18) x k (2 q )+ j x k (2 q ′ +1)+ j (cid:19) , replace Line 22 by α ← ω [2 q ] p , and replace Line 24 by(3.7) (cid:18) x k (2 q )+ j x k (2 q ′ +1)+ j (cid:19) ← (cid:18) α

11 0 (cid:19) (cid:18) x k (2 q )+ j x k (2 q ′ +1)+ j (cid:19) . Performing (3.6) rather than Line 12 saves an addition, while performing (3.7) instead of Line 24 replacesthe cost of evaluating the map div with that of multiplying by 2. N IN-PLACE TRUNCATED FOURIER TRANSFORM 9

Computing the twiddle factors.

We now describe how the twiddle factors are computed inAlgorithm 1. Therefore, suppose that the algorithm has been called to compute a length ℓ transform.Let m = ⌈ log ℓ ⌉ and v = ord ℓ , as deﬁned in Line 1 of the algorithm. Furthermore, let q k = ⌊ ℓ/ k +1 ⌋ for k ∈ { , . . . , m − } , so as to reﬂect the values of q computed in Lines 7, 20 and 31.For the computation of the twiddle factors in Lines 7, 22 and 26 we may assume that v < m , sinceotherwise the lines are not performed by the algorithm. Then, for k ∈ { v, . . . , m − } , we have ω [2 q k ] p = (cid:16) ω n/ m (cid:17) k [ q k ] m − k − and ω − [2 q k ] p = (cid:16) ω n/ m (cid:17) k (2 m − k − [ q k ] m − k − ) , since q k < m − k − . As ω n/ ⌈ log2 ℓ ⌉ = ω n/ m is provided as an input to Algorithm 1, it follows that thetwiddle factors of Lines 7 and 26, and the inverse twiddle factors of Line 22, can be computed by asquare and multiply algorithm (see [8]) with O (1) space and an individual cost of O ( m ) multiplicationsby powers of ω . We propose to use this approach for Lines 7, 22 and 26 of the algorithm, at a total cost of O (log ℓ ) multiplications by powers of ω . However, we also note that Lines 22 and 26 can be implementedwith a total cost of O (log ℓ ) multiplications by powers of ω , since (3.1) implies that ω [2 q k +1 ] p = ( − (cid:0) ω [2 q k ] p (cid:1) if r k +1 > k +1 , (cid:0) ω [2 q k ] p (cid:1) if r k +1 ≤ k +1 , for k ∈ { v, . . . , m − } .Line 34 asks that we compute the pairs ( i, ω [2 i ] p ) for i ∈ { , . . . , q − } , where q ∈ { , . . . , m − } . Weimplement the line by generalising the approach of Section 2.3 in a manner that is reminiscent of thedecomposition of Lemma 2.6. Therefore, suppose that q ∈ { , . . . , m − } . Then q may be written in theform P tr =1 i r for t ≥ i > i > · · · > i t . The sets { i + · · · + 2 i s − + [ j ] i s | j ∈{ , . . . , i s − }} for s ∈ { , . . . , t } then form a partition of { , . . . , q − } . Furthermore,[2(2 i + · · · + 2 i s − + [ j ] i s )] p = 2 p − − i s j + s − X r =1 p − − i r for s ∈ { , . . . , t } and j ∈ { , . . . , i s − } . Thus, if q < m − , then ω [2(2 i + ··· +2 is − +[ j ] is )] p = τ s ( λ s ) j for s ∈ { , . . . , t } and j ∈ { , . . . , i s − } , where λ , . . . , λ t , τ , . . . , τ t ∈ R are deﬁned recursively as follows: λ r = ((cid:0) ω n/ m (cid:1) m − − i if r = 1 ,λ ir − − ir r − if r > , and τ r = ( r = 1 ,τ r − λ r − if r > . If q = 2 m − , then t = 1 and ω [2[ j ] i ] p = ( ω n/ m ) j for j ∈ { , . . . , i − } . Therefore, computing the pairs( i, ω [2 i ] p ) for i ∈ { , . . . , q − } reduces to computing the terms of t geometric progressions whose ﬁrstterms and common ratios are related by simple recurrence relations. We derive Algorithm 2 from theseobservations, which is presented in the style of a generator that may be called by Algorithm 1. Assumingthat repeated squaring is used for exponentiation, the algorithm performs q + O ( m ) multiplications bypowers of ω , and uses O (1) space.3.4. Complexity.

Recall that Lines 1 to 17 of Algorithm 1 compute a k, k +1 ⌊ ℓ/ k +1 ⌋ , . . . , a k, k ⌈ ℓ/ k ⌉− for k = m − , m − , . . . , v , while Lines 30 to 36 compute a k, , . . . , a k, k +1 ⌊ ℓ/ k +1 ⌋− for k = m − , m − , . . . , O ( ℓ ) operations in R for the modiﬁed algorithm, so that the diﬀerence in complexities between thealgorithm and van der Hoeven’s algorithm, and thus also Arnold’s algorithm, is conﬁned to lower orderterms. Lemma 3.2.

Suppose that the modiﬁcations of Section 3.2 are applied to Algorithm 1 with the multiplica-tion by in (3.7) implemented as an addition. Then the algorithm performs at most P tr =1 ( ℓ r /

2) log ℓ r +2 ℓ + O (log ℓ ) multiplications by powers of ω , and at most ℓ ⌊ log ℓ ⌋ + 2 ℓ additions or subtractions, where ℓ > ℓ > · · · > ℓ t are integer powers of two such that ℓ = P tr =1 ℓ r . Algorithm 2

GeneratePairs ( m, ω n/ m , q ) Input: m ∈ { , . . . , p } , the precomputed element ω n/ m , and q ∈ { , . . . , m − } . Output: the pairs ( i, ω [2 i ] p ) for i ∈ { , . . . , q − } , yielded one at a time. i ← ⌊ log q ⌋ , τ ← e ← min( m − − i, λ ← ( ω n/ m ) m − − i − e , µ ← λ e , θ ← τ for j = 1 , . . . , i − do θ ← θµ yield ([ j ] i , θ ) o ← i while q > o do i ′ ← i , i ← ⌊ log ( q − o ) ⌋ , τ ← τ λ , λ ← λ i ′− i , µ ← λ , θ ← τ yield ( o, θ ) for j = 1 , . . . , i − do θ ← θµ yield ( o + [ j ] i , θ ) o ← o + 2 i Proof.

Suppose that the modiﬁed algorithm has been called to evaluate a length ℓ transform. Let m = ⌈ log ℓ ⌉ and v = ord ℓ , as deﬁned in Line 1 of the algorithm. Furthermore, deﬁne q k = ⌊ ℓ/ k +1 ⌋ and r k = ℓ − k +1 q k for k ∈ { , . . . , m − } , so as to reﬂect the values of q and r computed in Lines 7, 20and 31. Write ℓ = P mk =0 k ℓ k with ℓ , . . . , ℓ m ∈ { , } , which we note is diﬀerent to how ℓ is written inthe statement of the lemma. All multiplications in the algorithm involve a power of ω . Consequently, wesimply refer to them as multiplications hereafter. Similarly, we do not distinguish between additions andsubtractions, since they are counted together.Lines 3 and 4 perform no multiplications and 2( ℓ − m − ) additions. The deﬁnition of v implies that ℓ v = 1 and r v = 2 v . Thus, Lines 7 to 17 perform at most 2 v additions for k = v . It also follows thatfor k ∈ { v + 1 , . . . , m − } , r k > k if and only if ℓ k = 1. Consequently, for k ∈ { v + 1 , . . . , m − } such that r k > k , Lines 7 to 17 and Lines 20 to 28 perform a combined total of r k + 2(2 k +1 − r k ) =2 k +1 + (2 k +1 − r k ) ≤ k +1 + 2 k ℓ k additions. Similarly, for k ∈ { v + 1 , . . . , m − } such that r k ≤ k , Lines 7to 17 and Lines 20 to 28 perform a combined total 2 k + (2 k − r k ) ≤ k +1 + 2 k ℓ k additions. Therefore,Lines 6 to 28 perform at most 2 v + m − X k = v +1 (cid:0) k +1 + 2 k ℓ k (cid:1) ≤ ℓ + 2 m − additions. Excluding the computation of the twiddle factors, the number of multiplications by performedby Lines 6 to 28 is easily seen to be bounded by the number of additions they perform. Moreover, weknow from Section 3.3 that the total cost of computing the twiddle factors in Lines 6 to 28 is O (log ℓ )multiplications. It follows that Lines 6 to 28 perform at most ℓ + 2 m − + O (log ℓ ) multiplications.For each k ∈ { , . . . , m − } , Lines 31 to 36 perform 2 k ( q k −

1) + q k + O (log ℓ ) multiplications, and2 k +1 q k additions, since we know from Section 3.3 that Line 34 performs q k + O (log ℓ ) multiplications.We have m − X k =0 k +1 q k = m − X k =0 m X i = k +1 i ℓ i = m X i =0 i iℓ i − m ℓ m ≤ ℓ ⌊ log ℓ ⌋ − ( ℓ − m − )and m − X k =0 (cid:0) q k − k (cid:1) ≤ m − X k =0 (cid:18) ℓ k +1 − k (cid:19) ≤ ℓ − m − + 1 . Thus, Lines 30 to 36 perform at most P mi =0 i − iℓ i + ℓ − m − + O (log ℓ ) multiplications, and at most ℓ ⌊ log ℓ ⌋ − ( ℓ − m − ) additions. Summing each set of bounds and discarding the terms with ℓ i = 0 thencompletes the proof. (cid:3) Algorithm 1 and Arnold’s TFT algorithm both reduce to the algorithm of Section 2.3 (without thepermutation) when ℓ is a power of two. If ℓ is not a power of two, then Arnold’s algorithm performsapproximately P tr =1 ( ℓ r /

2) log ℓ r multiplications, where the error made is O (log ℓ ), in order to evaluatethe TFTs of the decomposition in Lemma 2.6. To compute the polynomials f r ( x/ω e r ) that are evaluated N IN-PLACE TRUNCATED FOURIER TRANSFORM 11 by the transforms, Arnold’s algorithm ﬁrst replaces the coeﬃcients of f by those of f ( ω [ ℓ ] p x ) “term-by-term”, by computing successive powers of ω [ ℓ ] p , and multiplying each power by the correspondingcoeﬃcient of f . By observing that ω [ ℓ ] p ℓ = −

1, one only has to compute the ﬁrst ℓ powers of ω [ ℓ ] p .However, the observation also implies that these ℓ > ℓ/ . ℓ multiplications when computing the coeﬃcientsof f ( ω [ ℓ ] p x ). The coeﬃcients of the residues f r are later replaced with those of f r ( x/ω e r ) by a similarmethod, requiring at least ℓ −⌈ log ℓ ⌉ multiplications. Thus, Lemma 3.2 implies that Algorithm 1 performsΩ( ℓ ) fewer multiplications than Arnold’s algorithm when ℓ is not a power of two.4. An in-place inverse truncated Fourier transform

Assume for simplicity that 2 is a unit in R . Then Algorithm 1 is comprised of a series of invertible lineartransformations. Consequently, the truncated Fourier transform can be inverted in this case by proceedingbackwards through the algorithm and inverting each transformation. However, this approach does notimmediately yield an algorithm with complexity similar to that of Algorithm 1, since the transformationsof Lines 4, 10, 12, 33 and 36 are less expensive to evaluate than their inverses: (cid:18) α − α (cid:19) − = 2 − (cid:18) α − − α − (cid:19) for α ∈ R × . We can address this problem when ℓ is a power of two, in which case it is only necessary to invert thetransformations of Lines 4, 33 and 36, by simply ignoring multiplications by 2 − during the algorithm, andaccounting for them afterwards by multiplying each entry of ( x , . . . , x ℓ − ) by (2 − ) log ℓ . The algorithmthen performs only ℓ + O (log log ℓ ) more multiplications than Algorithm 1. To generalise this approachto arbitrary ℓ , we weight the sequences deﬁned in Lemma 2.1 by setting ˜ a k,i = 2 k a k,i for k ∈ { , . . . , p } and i ∈ { , . . . , n − } . Then (2.1) implies that (cid:18) a k, k (2 i )+ j a k, k (2 i +1)+ j (cid:19) = (cid:18) ω [2 i ] p − ω [2 i ] p (cid:19) (cid:18) ˜ a k +1 , k (2 i )+ j ˜ a k +1 , k (2 i +1)+ j (cid:19) for k ∈ { , . . . , p − } , i ∈ { , . . . , p − k − − } and j ∈ { , . . . , k − } . Thus, it is straight-forwardto derive a version of Algorithm 1 that works by computing the values ˜ a k,i in-place of the values a k,i .Reversing the steps of this new algorithm and inverting its transformations then yields Algorithm 3.The twiddle factors in Algorithm 3 can be computed by the methods of Section 3.3. Consequently, byassuming that all multiplications by 2 in the algorithm are implemented as additions, a proof similar tothat of Lemma 3.2 provides the following bounds. Lemma 4.1.

Algorithm 3 performs at most ( ℓ/ ⌊ log ℓ ⌋ + 2 ℓ + O (log ℓ ) multiplications by powers of ω ,at most ⌈ log ℓ ⌉ + O (log log ℓ ) multiplications by powers of − , and at most ℓ ⌊ log ℓ ⌋ + 3 ℓ additions orsubtractions. The assumption that 2 is a unit in R may be replaced by the weaker assumption that 2 is not a zero-divisor in R if all multiplications by powers of 2 − in Algorithm 3 are replaced by in-place evaluationsof the corresponding maps div k : 2 k R → R deﬁned by 2 k a a for a ∈ R . References

1. Andrew Arnold,

A new truncated Fourier transform algorithm , ISSAC 2013—Proceedings of the 38th InternationalSymposium on Symbolic and Algebraic Computation, ACM, New York, 2013, pp. 15–22.2. David G. Cantor and Erich Kaltofen,

On fast multiplication of polynomials over arbitrary algebras , Acta Inform. (1991), no. 7, 693–701.3. James W. Cooley, The re-discovery of the fast Fourier transform algorithm , Microchimica Acta (1987), no. 1, 33–45.4. James W. Cooley and John W. Tukey, An algorithm for the machine calculation of complex Fourier series , Math.Comp. (1965), 297–301.5. Carl Friedrich Gauss, Nachlaß: Theoria interpolationis methodo nova tractata , Carl Friedrich Gauss, Werke, Band 3,K¨oniglichen Gesellschaft der Wissenschaften, G¨ottingen, 1866, pp. 265–330.6. David Harvey and Daniel S. Roche,

An in-place truncated Fourier transform and applications to polynomial multiplica-tion , ISSAC 2010—Proceedings of the 2010 International Symposium on Symbolic and Algebraic Computation, ACM,New York, 2010, pp. 325–329.7. Michael T. Heideman, Don H. Johnson, and C. Sidney Burrus,

Gauss and the history of the fast Fourier transform ,Arch. Hist. Exact Sci. (1985), no. 3, 265–277.8. Donald E. Knuth, The art of computer programming, Volume 2: Seminumerical algorithms , 3rd ed., Addison-Wesley,Reading, Massachusetts, 1997.9. Todd Mateer,

Fast Fourier Transform algorithms with applications , Ph.D. thesis, Clemson University, 2008.

Algorithm 3

ITFT ( ℓ, ω n/ ⌈ log2 ℓ ⌉ , − , ( x , . . . , x ℓ − )) Input: the transform length ℓ ∈ { , . . . , n } , precomputed elements ω n/ ⌈ log2 ℓ ⌉ and 2 − , and an array( x , . . . , x ℓ − ) containing TFT ω,ℓ ( a ) for some vector a ∈ R ℓ . Result: the vector a written to ( x , . . . , x ℓ − ). m ← ⌈ log ℓ ⌉ , v ← ord ℓ for k = 0 , , . . . , m − do q ← ⌊ ℓ/ k +1 ⌋ for j = 0 , . . . , k − do (cid:18) x j x k + j (cid:19) ← (cid:18) − (cid:19) (cid:18) x j x k + j (cid:19) for ( i, α ) ∈ { ( i, ω − [2 i ] p ) | i ∈ { , . . . , q − }} do for j = 0 , . . . , k − do (cid:18) x k (2 i )+ j x k (2 i +1)+ j (cid:19) ← (cid:18) α − α (cid:19) (cid:18) x k (2 i )+ j x k (2 i +1)+ j (cid:19) for k = m − , m − , . . . , v + 1 do q ← ⌊ ℓ/ k +1 ⌋ , r ← ℓ − k +1 q , q ′ ← q − m − k − , α ← ω [2 q ] p if r > k then for j = r − k , . . . , k − do x k (2 q ′ +1)+ j ← x k (2 q )+ j − αx k (2 q ′ +1)+ j else for j = r, . . . , k − do x k (2 q ′ )+ j ← − ( x k (2 q ′ )+ j + αx k (2 q ′ +1)+ j ) for k = v, v + 1 , . . . , m − do q ← ⌊ ℓ/ k +1 ⌋ , r ← ℓ − k +1 q , q ′ ← q − m − k − if r > k then α ← ω − [2 q ] p for j = 0 , . . . , r − k − do (cid:18) x k (2 q )+ j x k (2 q +1)+ j (cid:19) ← (cid:18) α − α (cid:19) (cid:18) x k (2 q )+ j x k (2 q +1)+ j (cid:19) for j = r − k , . . . , k − do (cid:18) x k (2 q )+ j x k (2 q ′ +1)+ j (cid:19) ← (cid:18) α − α (cid:19) (cid:18) x k (2 q )+ j x k (2 q ′ +1)+ j (cid:19) else α ← ω [2 q ] p for j = 0 , . . . , r − do x k (2 q )+ j ← x k (2 q )+ j − αx k (2 q ′ +1)+ j for j = r, . . . , k − do x k (2 q ′ )+ j ← x k (2 q ′ )+ j − αx k (2 q ′ +1)+ j α ← (2 − ) m − for j = ℓ − m − , . . . , m − − do x j ← αx j α ← − α for j = 0 , . . . , ℓ − m − − do (cid:18) x j x m − + j (cid:19) ← α (cid:18) − (cid:19) (cid:18) x j x m − + j (cid:19)

10. Daniel S. Roche,

Eﬃcient computation with sparse and dense polynomials , Ph.D. thesis, University of Waterloo, 2011.11. Arnold Sch¨onhage and Volker Strassen,

Schnelle Multiplikation grosser Zahlen , Computing (1971), 281–292.12. Igor S. Sergeev, Regular estimates for the complexity of polynomial multiplication and truncated Fourier transform ,Prikl. Diskr. Mat. (2011), 72–88.13. Joris van der Hoeven,

The truncated Fourier transform and applications , ISSAC 2004—Proceedings of the 2004 Inter-national Symposium on Symbolic and Algebraic Computation, ACM, New York, 2004, pp. 290–296.14. Joris van der Hoeven,