[PDF] Fast In-place Algorithms for Polynomial Operations: Division, Evaluation, Interpolation

Abstract

We consider space-saving versions of several important operations on univariate polynomials, namely power series inversion and division, division with remainder, multi-point evaluation, and interpolation. Now-classical results show that such problems can be solved in (nearly) the same asymptotic time as fast polynomial multiplication. However, these reductions, even when applied to an in-place variant of fast polynomial multiplication, yield algorithms which require at least a linear amount of extra space for intermediate results. We demonstrate new in-place algorithms for the aforementioned polynomial computations which require only constant extra space and achieve the same asymptotic running time as their out-of-place counterparts. We also provide a precise complexity analysis so that all constants are made explicit, parameterized by the space usage of the underlying multiplication algorithms.

Full PDF

FFast In-place Algorithms for Polynomial Operations:Division, Evaluation, Interpolation

Pascal GiorgiLIRMM, Univ. Montpellier, CNRSMontpellier, France [email protected]

Bruno GrenetLIRMM, Univ. Montpellier, CNRSMontpellier, France [email protected]

Daniel S. RocheUnited States Naval AcademyAnnapolis, Maryland, U.S.A. [email protected]

June 11, 2020

Abstract

We consider space-saving versions of several important operations on univariate polynomials,namely power series inversion and division, division with remainder, multi-point evaluation, andinterpolation. Now-classical results show that such problems can be solved in (nearly) the sameasymptotic time as fast polynomial multiplication. However, these reductions, even when appliedto an in-place variant of fast polynomial multiplication, yield algorithms which require at least alinear amount of extra space for intermediate results. We demonstrate new in-place algorithms forthe aforementioned polynomial computations which require only constant extra space and achievethe same asymptotic running time as their out-of-place counterparts. We also provide a precisecomplexity analysis so that all constants are made explicit, parameterized by the space usage ofthe underlying multiplication algorithms.

Computations with dense univariate polynomials or truncated power series over a ﬁnite ring areof central importance in computer algebra and symbolic computation. Since the discovery of sub-quadratic (“fast”) multiplication algorithms [

12, 4, 19, 10, 3 ] , a major research task was to reducemany other polynomial computations to the cost of polynomial multiplication.This project has been largely successful, starting with symbolic Newton iteration for fast inversionand division with remainder [ ] , product tree algorithms for multi-point evaluation and interpola-tion [ ] , the “half-GCD” fast Euclidean algorithm [ ] , and many more related important problems [

2, 6 ] . Not only are these problems important in their own right, but they also form the basis for manymore, such as polynomial factorization, multivariate and / or sparse polynomial arithmetic, structuredmatrix computations, and further applications in areas such as coding theory and public-key cryptog-raphy.But the use of fast arithmetic frequently comes at the expense of requiring extra temporary space to perform the computation. This can make a difference in practice, from the small scale where em-bedded systems engineers seek to minimize hardware circuitry, to the medium scale where a space-inefﬁcient algorithm can exceed the boundaries of (some level of) cache and cause expensive cachemisses, to the large scale where main memory may simply not be sufﬁcient to hold the intermedi-ate values. In a streaming model, where the output must be written only once, in order, explicittime-space tradeoffs prove that fast multiplication algorithms will always require up to linear extra1 a r X i v : . [ c s . S C ] J un pace. And indeed, all sub-quadratic polynomial multiplication algorithms we are aware of — in theiroriginal formulation — require linear extra space [

12, 4, 19, 10, 3 ] .However, if we treat the output space as pre-allocated random-access memory, allowing values inoutput registers to be both read and written multiple times, then improvements are possible. In-placequadratic-time algorithms for polynomial arithmetic are described in [ ] . A series of recent resultsprovide explicit algorithms and reductions from arbitrary fast multiplication routines which have thesame asymptotic running time, but use only constant extra space [

17, 11, 8 ] . That is, these algorithmstrade a constant increase in the running time for a linear reduction in the amount of extra space. Sofar, these results are limited to multiplication routines and related computations such as middle andshort product. Applying in-place multiplication algorithms directly to other problems, such as thoseconsidered in this paper, does not immediately yield an in-place algorithm for the desired applicationproblem. Time Space ReferencePower series inversion ( λ m + λ s ) M ( n ) max ( c m , c s + ) n [

9, Alg.

MP-inv ] at precision n λ m M ( n ) log cm + cm + ( n ) O ( ) Theorem 2.3 ( λ m + λ s ) M ( n ) c m + n [

9, Alg.

MP-div-KM ] Power series division λ m M ( n ) log cm + cm + ( n ) O ( ) Theorem 2.5at precision n O ( M ( n )) α n , for any α > (cid:0) λ m ( c + + c ) + λ s ( + c ) (cid:1) M ( n ) O ( ) ‡ Corollary 2.6

Euclidean division ( λ m + λ s ) M ( m ) + λ s M ( n ) max ( c m + m − n , c s n ) standard algorithm of polynomials λ s M ( m ) + ( λ m + λ s ) M ( n ) ( + max ( c m , c s + , c s )) n (cid:6) mn (cid:7) balanced div. (precomp)in sizes ( m + n − n ) (cid:0) λ m ( c + + c ) + λ s ( + c ) (cid:1) M ( m ) O ( ) Theorem 2.8 multipoint evaluation / M ( n ) log ( n ) n log ( n ) [ ] size- n polynomial on n points / M ( n ) log ( n ) n [ ] , Lemma 3.1 ( + λ s / log ( c s + c s + )) M ( n ) log ( n ) O ( ) Theorem 3.4 interpolation / M ( n ) log ( n ) n log ( n ) [ ] size- n polynomial on n points 5 M ( n ) log ( n ) n [

6, 7 ] , Lemma 3.3 (cid:39) M ( n ) log ( n ) O ( ) Theorem 3.6Table 1: Summary of complexity analyses, omitting non-dominant terms and assuming c f ≤ c s ≤ c m .We use c = c m +

3. For O ( ) ‡ space, the memory model is changed such that the input dividend can beoverwritten. Here, and throughout the paper, the base of the logarithms is 2 if not otherwise stated.In this paper, we present new in-place algorithms for power series inversion and division, poly-nomial division with remainder, multi-point evaluation, and interpolation. These algorithms are fast because their running time is only a constant time larger than the fastest known out-of-place algo-rithms, parameterized by the cost of dense polynomial multiplication.Our space complexity model is the one of [

17, 11, 8 ] where input space is read only while outputspace is pre-allocated and can be used to store intermediate results. In that model, the space com-plexity is measured by only counting the auxiliary space required during the computation, excludinginput and output spaces. We shall mention that a single memory location or register may containeither an element of the coefﬁcient ring, or a pointer to the input or output space. It follows thatin-place algorithms are those that require only a constant number of extra memory locations.2or all ﬁve problems, we present in-place variants which have nearly the same asymptotic runningtime as their fastest out-of-place counterparts. The power series inversion and division algorithms in-cur an extra log ( n ) overhead when quasi-linear multiplication is used, while the polynomial division,evaluation, and interpolation algorithms keep the same asymptotic runtime as the fastest known al-gorithm. Our reductions essentially trade a small amount of extra runtime for a signiﬁcant decreasein space usage.Our motivation in this work is mainly theoretical. We address the existence of such fast in-placealgorithms as we already did for polynomial multiplications [ ] . To further extend our result, we com-pare precisely the number of arithmetic operations in our algorithms with the best known theoreticalbounds. These results are summarized in Table 1.Of course further work is needed to determine the practicability of our approach. In particularcache misses play a predominant role when dealing with memory management. Studying the cachecomplexity of all these algorithms, for instance in the idealized cache model [ ] , would give moreprecise insights. However, the practicability will heavily depend on the underlying multiplicationalgorithms. Due to their diversity and the need for ﬁne-tuned implementations, we leave this task tofuture work. By a size- n polynomial, we mean a polynomial of degree ≤ n −

1. As usual, we denote by M ( n ) abound on the number of operations in (cid:75) to multiply two size- n polynomials, and we assume that α M ( n ) ≤ M ( α n ) for any constant α ≥

1. All known multiplication algorithms have at most a linearspace complexity. Nevertheless, several results reduce this space complexity at the expense of a slightincrease in the time complexity [

20, 17, 11, 8 ] . To provide tight analyses, we consider multiplicationalgorithms with time complexity λ f M ( n ) and space complexity c f n for some constants λ f ≥ c f ≥ ( m + n − ) polynomial F ∈ (cid:75) [ X ] and a size- n polynomial G ∈ (cid:75) [ X ] is the size- m polynomial deﬁned as MP ( F , G ) = ( F G div X n − ) mod X m . Wedenote by λ m M ( n ) and c m n the time and space complexities of the middle product of size ( n − n ) .Then, a middle product in size ( m + n − n ) where m < n can be computed with (cid:6) nm (cid:7) λ m M ( m ) operations in (cid:75) and ( c m + ) m extra space. Similarly, the short product of two size- n polynomials F , G ∈ (cid:75) [ X ] is deﬁned as SP ( F , G ) = F G mod X n and we denote by λ s M ( n ) and c s n its time and spacecomplexities.On the one hand, the most time-efﬁcient algorithms achieve λ f = λ m = λ s = ≤ c f , c m , c s ≤

4, using the

Transposition principle [

9, 2 ] for λ m = λ f . On the other hand, the authors recentlyproposed new space-efﬁcient algorithms reaching c f = c m = c s = λ f , λ m and λ s remain constants [ ] .Writing F = (cid:80) di = f i X i ∈ (cid:75) [ X ] , we will use rev ( F ) ∈ (cid:75) [ X ] to denote the reverse polynomial of F ,that is, rev ( F ) = X d F ( / X ) , whose computation does not involve any operations in (cid:75) . Note that wewill use abusively the notation F [ a .. b [ to refer to the chunk of F that is the polynomial (cid:80) b − i = a f i X i , andthe notation F [ a ] for the coefﬁcient f a . Considering our storage, the notation F [ a .. b [ will also serve torefer to some speciﬁc registers associated to F . When necessary, our algorithms indicate with WS theoutput registers used as work space. In this section, we present in-place algorithms for the inversion and the division of power series aswell as the Euclidean division of polynomials. As a ﬁrst step, we investigate the space complexity3rom the literature for these computations.

Power series inversion

Power series inversion is usually computed through Newton iteration: If G is the inverse of F at precision k then H = G + ( − GF ) G mod X k is the inverse of F at precision 2 k .This allows one to compute F − at precision n using O ( M ( n )) operations in (cid:75) , see [

6, Chapter 9 ] .As noticed in [

9, Alg.

MP-inv ] only the coefﬁcients of degree k to 2 k − H are needed. Thus,assuming that G [ k [ = F − mod X k , one step of Newton iteration computes k new coefﬁcients of F − into G [ k ..2 k ] as G [ k ..2 k [ = − SP ( MP ( F [ k [ , G [ k [ ) , G [ k [ ) . (1)The time complexity is then ( λ m + λ s ) M ( n ) for an inversion at precision n . For space complexity, themost consuming part is the last iteration of size n . It needs max ( c m , c s + ) n extra registers: One cancompute the middle product in G [ n .. n [ using c m n extra registers, then move it to n extra registers andcompute the short product using c s n registers. Power series division

Let F , G ∈ (cid:75) [[ X ]] , the fast approach to compute F / G mod X n is to ﬁrst invert G at precision n and then to multiply the result by F . The complexity is given by one inversion and oneshort product at precision n . Actually, Karp and Markstein remarked in [ ] that F / G can be directlycomputed during the last iteration. Applying this trick, the complexity becomes ( λ m + λ s ) M ( n ) [ ] ,see also [ ] . The main difference with inversion is the storage of the short product of size n , yieldinga space complexity of max ( c m + c s + ) n . Euclidean division of polynomials

Given two polynomials A , B of respective size m + n − n , the fast Euclidean division computes the quotient A div B as rev ( rev ( A ) / rev ( B )) viewed as powerseries at precision m [

6, Chapter 9 ] . The remainder R is retrieved with a size- n short product, yieldinga total time complexity of ( λ m + λ s ) M ( m ) + λ s M ( n ) . Since the remainder size is not determinedby the input size we assume that we are given a maximal output space of size n −

1. As this spaceremains free when computing the quotient, this step requires max ( c m + c s + ) m − n + c s n .As a ﬁrst result, when m ≤ n , using space-efﬁcient multiplication is enough to obtain an in-place O ( M ( n )) Euclidean division. Indeed, the output space is enough to compute the small quotient, whilethe remainder can be computed in-place [ ] .When m > n , the space complexity becomes O ( m − n ) . In that case, the Euclidean division of A by B can also be computed by (cid:6) mn (cid:7) balanced Euclidean divisions of polynomials of size 2 n − B . It actually corresponds to a variation of the long division algorithm , in which each step computes n new coefﬁcients of the quotient. To save some time, one can precompute the inverse of rev ( B ) atprecision n , which gives a time complexity ( λ m + λ s ) M ( n ) + mn λ s M ( n ) ≤ λ s M ( m ) + ( λ m + λ s ) M ( n ) and space complexity ( + max ( c m , c s + , c s )) n .Finally, one may consider to only compute the quotient or the remainder. Computing quotient onlyis equivalent to power series division. For the computation of the remainder, it is not yet known howto compute it without the quotient. In that case, we shall consider space usage for the computationand the storage of the quotient. When m is large compared to n , one may notice that relying onbalanced divisions does not require one to retain the whole quotient, but only its n latest computedcoefﬁcients. In that case the space complexity only increases by n . Since we can always perform amiddle product via two short products, we obtain the following result.4 emma 2.1. Given A ∈ (cid:75) [ X ] of size m and B ∈ (cid:75) [ X ] , monic of size n, and provided n registers for theoutput, the remainder A mod B can be computed using λ s M ( m ) + λ s M ( n ) + O ( m + n ) operations in (cid:75) and ( c s + ) n extra registers. We notice that during the ﬁrst Newton iterations, only a few coefﬁcients of the inverse have beenalready written. The output space thus contains lots of free registers, and the standard algorithmcan use them as working space. In the last iterations, the number of free registers becomes too smallto perform a standard iteration. Our idea is then to slow down the computation. Instead of stilldoubling the number of coefﬁcients computed at each iteration, the algorithm computes less and lesscoefﬁcients, in order to be able to use the free output space as working space. We denote these twophases as acceleration and deceleration phases.The following easy lemma generalizes Newton iteration to compute only (cid:96) ≤ k new coefﬁcientsfrom an inverse at precision k . Lemma 2.2.

Let F be a power series and G [ k [ contain its inverse at precision k. Then for < (cid:96) ≤ k, ifwe compute G [ k .. k + (cid:96) [ = − SP (cid:0) MP (cid:0) F [ k + (cid:96) [ , G [ k [ (cid:1) , G [ (cid:96) [ (cid:1) (2) then G [ k + (cid:96) [ contains the inverse of F at precision k + (cid:96) . Algorithm 1 is an in-place fast inversion algorithm. Accelerating and decelerating phases corre-spond to (cid:96) = k and (cid:96) < k . Algorithm 1

In-Place Fast Power Series Inversion (I N P LACE I NV ) Input: F ∈ (cid:75) [ X ] of size n , such that F [ ] is invertible; Output: G ∈ (cid:75) [ X ] of size n , such that F G = X n . Required: MP and SP alg. using extra space ≤ c m n and ≤ c s n . G [ ] ← F − [ ] k ← (cid:96) ← while (cid:96) > do G [ n − (cid:96) .. n [ ← MP ( F [ k + (cid:96) [ , G [ k [ ) (cid:46) WS: G [ k .. n − (cid:96) [ G [ k .. k + (cid:96) [ ← SP ( G [ (cid:96) [ , − G [ n − (cid:96) .. n [ ) (cid:46) WS: G [ k + (cid:96) .. n − (cid:96) [ k ← k + (cid:96) (cid:96) ← min (cid:0) k , (cid:4) n − kc (cid:5)(cid:1) where c = + max ( c m , c s ) G [ k .. n [ ← SP ( G [ n − k [ , − MP ( F [ n [ , G [ k [ )) (cid:46) O ( ) space Theorem 2.3.

Algorithm 1 is correct. It uses O ( ) space, and either λ m M ( n ) log cm + cm + ( n ) + O ( M ( n )) operations in (cid:75) when M ( n ) is quasi-linear, or O ( M ( n )) operations in (cid:75) when M ( n ) = n + γ , < γ ≤ .Proof. Steps 4 and 5, and Step 8, correspond to Equation (2). They compute (cid:96) new coefﬁcients of G when k of them are already written in the output, whence Lemma 2.2 implies the correctness.Step 4 needs ( c m + ) (cid:96) free registers for its computation and its storage. Then ( c s + ) (cid:96) free registersare needed to compute SP ( G [ (cid:96) [ , G [ n − (cid:96) .. n [ ) using (cid:96) registers for G [ n − (cid:96) .. n [ and ( c s + ) (cid:96) registers for theshort product computation and its result. For this computation to be done in-place, we need c (cid:96) ≤ n − k .Since at most k new coefﬁcients can be computed, the maximal number of new coefﬁcients in eachstep is (cid:96) = min (cid:0) k , (cid:4) n − kc (cid:5)(cid:1) . 5ach iteration uses O ( M ( k )) operations in (cid:75) : O ( (cid:100) k /(cid:96) (cid:101) M ( (cid:96) )) for the middle product at Step 4and O ( M ( (cid:96) )) for the short product at Step 5. The accelerating phase stops when k > n − kc + , that is, k > nc + . It costs (cid:80) (cid:98) log nc + (cid:99) i = M ( i ) = O ( M ( n )) . During the decelerating phase, each iteration computesa constant fraction of the remaining coefﬁcients. Hence, this phase lasts for δ = log cc − n steps.Let (cid:96) i and k i denote the values of (cid:96) and k at the i -th iteration of the deceleration phase and t i = n − k i . Then one iteration of the deceleration phase costs one middle product in sizes ( n − t i + (cid:4) t i c (cid:5) − n − t i ) and one short product in size (cid:4) t i c (cid:5) . The total cost of all the short products amounts to (cid:80) i M ( t i ) = O ( M ( n )) since (cid:80) i t i ≤ cn . The cost of the middle product at the i -th step is λ m (cid:6) ( n − t i ) / (cid:4) t i c (cid:5)(cid:7) M (cid:0)(cid:4) t i c (cid:5)(cid:1) = λ m M ( n ) + O ( n ) .Therefore, the total cost of all the middle products is at most λ m M ( n ) log cc − ( n ) + O ( M ( n )) and isdominant in the complexity. We can choose the in-place short products of [ ] and get c = c m +

2. Thecomplexity is then λ m M ( n ) log cm + cm + ( n ) + O ( M ( n )) .If M ( n ) = n + γ with 0 < γ ≤

1, the cost of each iteration is O ( (cid:160) n − t i (cid:96) i (cid:163) (cid:96) + γ i ) . Since (cid:96) ≤ n , we have (cid:96) i < n ( c − c ) i + c , whence δ (cid:88) i = (cid:161) n − t i (cid:96) i (cid:164) (cid:96) + γ i ≤ n δ (cid:88) i = (cid:96) γ i ≤ n δ (cid:88) i = (cid:18) n (cid:129) c − c (cid:139) i + c (cid:19) γ .Since 0 < γ ≤

1, we have ( α + β ) γ ≤ α γ + β γ for any α , β >

0, and the complexity is n + γ (cid:80) δ i = (cid:0) c − c (cid:1) i γ + O ( n log n ) = O ( M ( n )) . Division of power series can be implemented easily as an inversion followed by a product. Yet, usingin-place algorithms for these two steps is not enough to obtain an in-place division algorithm sincethe intermediate result must be stored. Karp and Markstein’s trick, that includes the dividend inthe last iteration of Newton iteration [ ] , cannot be used directly in our case since we replace thevery last iteration by several ones. We thus need to build our in-place algorithm on the followinggeneralization of their method. Lemma 2.4.

Let F and G be two power series, G invertible, and Q [ k [ contain their quotient at precisionk. Then for < (cid:96) ≤ k, if we computeQ [ k .. k + (cid:96) [ = SP (cid:128) G − [ (cid:96) [ , F [ k .. k + (cid:96) [ − MP ( G [ k + (cid:96) [ , Q [ k [ ) (cid:138) then Q [ k + (cid:96) [ contains their quotient at precision k + (cid:96) .Proof. Let us write F / G = Q k + X k Q (cid:96) + O ( X k + (cid:96) ) . We prove that Q (cid:96) = G − × (( F − GQ k ) div X k ) mod X (cid:96) .By deﬁnition, F ≡ G ( Q k + X k Q (cid:96) ) mod X k + (cid:96) . Hence ( F − GQ k ) div X k = GQ (cid:96) mod X (cid:96) . Therefore, Q (cid:96) = ( G − × (( F − GQ k ) div X k )) mod X (cid:96) . Finally, since only the coefﬁcients of degree k to k + (cid:96) − GQ k are needed, they can be computed as MP ( G [ k + (cid:96) [ , Q [ k [ ) .Algorithm 2 is an in-place power series division algorithm based on Lemma 2.4, choosing at eachstep the appropriate value of (cid:96) so that all computations can be performed in place. Theorem 2.5.

Algorithm 2 is correct. It uses O ( ) space, and either λ m M ( n ) log cm + cm + ( n ) + O ( M ( n )) operations in (cid:75) when M ( n ) is quasi-linear or O ( M ( n )) operations in (cid:75) when M ( n ) = O ( n + γ ) , < γ ≤ . lgorithm 2 In-Place Power Series Division (I N P LACE

PSD IV ) Input: F , G ∈ (cid:75) [ X ] of size n , such that G [ ] is invertible; Output: Q ∈ (cid:75) [ X ] of size n , such that F / G = Q mod X n . Required: MP , SP , Inv alg. using extra space ≤ c m n , c s n , c i n . k ← (cid:98) n / max ( c i + c s + ) (cid:99) Q [ n − k .. n [ ← rev ( Inv ( G [ k [ )) (cid:46) WS: Q [ n − k [ Q [ k [ ← SP ( F [ k [ , rev ( Q [ n − k .. n [ )) (cid:46) WS: Q [ k .. n − k [ (cid:96) ← (cid:98) ( n − k ) / ( + max ( c m , c s )) (cid:99) while (cid:96) > do Q [ n − (cid:96) .. n − (cid:96) [ ← MP ( G [ k + (cid:96) [ , Q [ k [ ) (cid:46) WS: Q [ k .. n − (cid:96) [ Q [ n − (cid:96) .. n − (cid:96) [ ← F [ k .. k + (cid:96) [ − Q [ n − (cid:96) .. n − (cid:96) [ let us deﬁne Q ∗ (cid:96) = rev ( Q [ n − (cid:96) .. n [ ) Q [ k .. k + (cid:96) [ ← SP ( Q [ n − (cid:96) .. n − (cid:96) [ , Q ∗ (cid:96) ) (cid:46) WS: Q [ k + (cid:96) .. n − (cid:96) [ k ← k + (cid:96) (cid:96) ← (cid:98) ( n − k ) / ( + max ( c m , c s )) (cid:99) tmp ← F [ k .. n [ − MP ( G [ n [ , Q [ k [ ) (cid:46) constant space Q [ k .. n [ ← SP ( tmp , rev ( Q [ k .. n [ )) (cid:46) constant space Proof.

The correctness follows from Lemma 2.4. The inverse of G is computed once at Step 2, atprecision (cid:98) n / max ( c i + c s + ) (cid:99) . Its coefﬁcients are then progressively overwritten during the loopsince Step 8 only requires (cid:96) coefﬁcients of the inverse, and (cid:96) is decreasing. Since c i = max ( c m , c s + ) , (cid:96) is always less than the initial precision. For simplicity of the presentation, we store the inverse inreversed order in Q [ n − k .. n [ . Step 2 requires space c i k while the free space has size n − k : Since k ≤ nc i + ,the free space is large enough. Similarly, the next step requires space c s k while the free space has size n − k , and k ≤ nc s + . Step 6 needs ( c m + ) (cid:96) space and the free space has size n − k − (cid:96) , and Step 8requires c s (cid:96) space while the free space has size n − k − (cid:96) . Since (cid:96) ≤ n − k + max ( c m , c s ) , these computationscan also be performed in place.The time complexity analysis is very similar to the one of Algorithm 1 given in Theorem 2.3. Themain difference is Step 7 which adds a negligible term O ( n log n ) in the complexity. Corollary 2.6.

If it can erase its dividend, Algorithm 2 can be modiﬁed to improve its complexity to (cid:0) λ m ( c + + c ) + λ s ( + c ) (cid:1) M ( n ) + O ( n ) operations in (cid:75) where c = max ( c m + c s + ) , still using O ( ) extra space.Proof. Once k coefﬁcients of Q have been computed, F [ k [ is not needed anymore. This means thatat Step 7, the result can be directly written in F [ k .. k + (cid:96) [ and that F [ k [ can be used as working space inthe other steps of the loop. The free space at Steps 6 and 8 becomes n − (cid:96) instead of n − k − (cid:96) and n − k − (cid:96) respectively. Therefore, (cid:96) can always be chosen as large as (cid:4) nc (cid:5) where c = max ( c m + c s + ) .Since (cid:96) stays positive, we also modify the algorithm to stop when all the coefﬁcients of Q have beencomputed.To simplify the complexity analysis, we further assume that k gets the same value (cid:4) nc (cid:5) at Step 1.Step 2 requires ( λ s + λ m ) M ( (cid:4) nc (cid:5) ) operations in (cid:75) . The sum of the input sizes of all the short products inthe algorithm is n . Their total complexity is thus λ s M ( n ) . At the i -th iteration of the loop, k = ( i + ) (cid:96) .Therefore Step 6 has complexity i (cid:4) nc (cid:5) . Step 7 requires (cid:4) nc (cid:5) operations in (cid:75) . Altogether, the complexityof the modiﬁed algorithm is λ s M ( n ) + ( λ s + λ m ) M (cid:0)(cid:4) nc (cid:5)(cid:1) + c (cid:88) i = (cid:0) i λ m M (cid:0)(cid:4) nc (cid:5)(cid:1) + (cid:4) nc (cid:5)(cid:1) (cid:0) λ m ( c + + c ) + λ s ( + c ) (cid:1) M ( n ) + O ( n ) .Using similar techniques, we get the following variant. Remark 2.7.

Algorithm 2 can be easily modiﬁed to improve the complexity to O ( M ( n )) operations in (cid:75) when a linear amount of extra space is available, say α n registers for some α ∈ (cid:82) + . If A is a size- ( m + n − ) polynomial and B a size- n polynomial, one can compute their size- m quotient Q in place using Algorithm 2, in O (( M ( m ) log m )) operations in (cid:75) . When Q is known, the remainder R = A − BQ , can be computed in-place using O ( M ( n )) operations in (cid:75) as it requires a single short productand some subtractions. As already mentioned, the exact size of the remainder is not determined bythe size of the inputs. Given any tighter bound r < n on deg ( R ) , the same algorithm can compute R in place, in time O ( M ( r )) .Altogether, we get in-place algorithms that compute the quotient of two polynomials in time O ( M ( m ) log m ) , or the quotient and size- r remainder in time O ( M ( m ) log m + M ( r )) . As suggestedin Section 2.1 and in Remark 2.7, this complexity becomes O ( M ( m ) + M ( r )) whenever m = O ( r ) .Indeed, in that case the remainder space can be used to speed-up the quotient computation. We shallmention that computing only the remainder remains a harder problem as we cannot count on thespace of the quotient while it is required for the computation. As of today, only the classical quadraticlong division algorithm allows such an in-place computation.We now provide a new in-place algorithm for computing both the quotient and the remainderthat achieves a complexity of O ( M ( m ) + M ( n )) operation in (cid:75) when m ≥ n . Our algorithm requiresan output space of size n − r < n − Algorithm 3

In-Place Euclidean Division (I N P LACE E UCL D IV ) Input: A , B ∈ (cid:75) [ X ] of sizes ( m + n , n ) , m ≥ n , such that B [ ] (cid:54) = Output: Q , R ∈ (cid:75) [ X ] of sizes ( m + n − ) such that A = BQ + R ; Required:

In-place D IV E RASE ( F , G , n ) computing F / G mod X n while erasing F ; In-place SP ; For simplicity, H is a size-n polynomial such that H [ n − [ is R and H [ n − ] is an extra register H ← A [ m .. m + n [ k ← m + while k > n do Q [ k − n .. k [ ← rev ( D IV E RASE ( rev ( H ) , rev ( B ) , n )) H [ n − [ ← SP ( Q [ k − n .. k − [ , B [ n − [ ) H [ n [ ← A [ k − n .. k − [ − H [ n − [ H [ ] ← A [ k − n − ] k ← k − n Q [ k [ ← rev ( D IV E RASE ( rev ( H [ n − k .. n [ ) , rev ( B [ n − k .. n [ ))) H [ n − [ ← SP ( Q [ n − [ , B [ n − [ ) H [ n − [ ← A [ n − [ − H [ n − [ return ( Q , H [ n − [ ) Theorem 2.8.

Algorithm 3 is correct. It uses O ( ) extra space and (cid:0) λ m ( c + + c ) + λ s ( + c ) (cid:1) M ( m ) + O ( m log n ) operations in (cid:75) where c = max ( c m + c s + ) . roof. Algorithm 3 is an adaptation of the classical long division algorithm , recalled in Section 2.1,where chunks of the quotient are computed iteratively via

Euclidean division of size ( n − n ) . Themain difﬁculty is that the update of the dividend cannot be done on the input. Since we computeonly chunks of size n from the quotient, the update of the dividend affects only n − R for storing these new coefﬁcients. As we need toconsider n coefﬁcients from the dividend to get a new chunk, we add the missing coefﬁcient from A and consider the polynomial H as our new dividend.By Corollary 2.6, Step 4 can be done in place while erasing H , which is not part of the originalinput. It is thus immediate that our algorithm is in-place. For the complexity, Steps 4 and 5 dominatethe cost. Using the exact complexity for Step 4 given in Corollary 2.6, one can deduce easily thatAlgorithm 3 requires (cid:0) λ m ( c + + c ) + λ s ( + c ) (cid:1) M ( m ) + O ( m log n ) operations in (cid:75) .Using time-efﬁcient products with λ m = λ s = c m = c s = (cid:39) M ( m ) , which is roughly 6.29 / = In this section, we present in-place algorithms for the two related problems of multipoint evaluationand interpolation. We ﬁrst review both classical algorithms and their space-efﬁcient variants.

Multipoint evaluation

Given n elements a , . . . , a n of (cid:75) and a size- n polynomial F ∈ (cid:75) [ X ] ,multipoint evaluation aims to compute F ( a ) , . . . , F ( a n ) . While the naive approach using Hornerscheme leads to a quadratic complexity, the fast approach of [ ] reaches a quasi-linear complexity O ( M ( n ) log ( n )) using a divide-and-conquer approach and the fact that F ( a i ) = F mod ( X − a i ) . Asproposed in [ ] this complexity can be sharpened to ( λ m + λ f ) M ( n ) log ( n ) + O ( M ( n )) using thetransposition principle.The fast algorithms are based on building the so-called subproduct tree [

6, Chapter 10 ] whoseleaves contain the ( X − a i ) ’s and whose root contains the polynomial (cid:81) ni = ( X − a i ) . This tree contains2 i degree- n / i monic polynomials at level i , and can be stored in exactly n log n registers if n is a powerof two. The fast algorithms then require n log ( n ) + O ( n ) registers as work space. Here, because thespace complexity constants c f , c m , c s do not appear in the leading term n log ( n ) of space usage, wecan always choose the fastest underlying multiplication routines, so the computational cost for thisapproach is simply M ( n ) log ( n ) + O ( M ( n )) .As remarked in [ ] , one can easily derive a fast variant that uses only O ( n ) extra space. Inparticular, [

7, Lemma 2.1 ] shows that the evaluation of a size- n polynomial F on k points a , . . . , a k with k ≤ n can be done at a cost O ( M ( k )( nk + log ( k ))) with O ( k ) extra space.We provide a tight analysis of this algorithm, starting with the balanced case k = n , i.e. the numberof evaluation points is equal to the size of F . The idea of the algorithm is to group the points in (cid:100) log ( n ) (cid:101) groups of (cid:98) n / log ( n ) (cid:99) points each, and to use standard multipoint evaluation on each group, by ﬁrstreducing F modulo the root of the corresponding subproduct tree. The complexity analysis of thisapproach is given in the following lemma. Observe that here too, the constants λ s , c s , etc., do notenter in since we can always use the fastest out-of-place subroutines without affecting the O ( n ) termin the space usage. Lemma 3.1.

Given F ∈ (cid:75) [ X ] of size n and a , . . . , a n ∈ (cid:75) , one can compute F ( a ) , . . . , F ( a n ) using M ( n ) log ( n ) + O ( M ( n )) operations in (cid:75) and n + O ( n log ( n ) ) extra registers. roof. Computing each subproduct tree on O ( n / log ( n )) points can be done in time M ( n / log ( n )) log ( n ) ≤ M ( n ) and space n + O ( n / log ( n )) . The root of this tree is a polyno-mial of degree at most n / log ( n ) . Each reduction of F modulo such a polynomial takes time2 M ( n ) + O ( n / log ( n )) and space O ( n / log ( n )) using the balanced Euclidean division algorithm fromSection 2.1. Each multi-point evaluation of the reduced polynomial on n / log ( n ) points, using thepre-computed subproduct tree, takes M ( n / log ( n )) log ( n ) + O ( M ( n / log ( n ))) operations in (cid:75) and O ( n / log ( n )) extra space [ ] .All information except the evaluations from the last step — which are written directly to theoutput space — may be discarded before the next iteration begins. Therefore the total time andspace complexity are as stated.When the number of evaluation points k is large compared to the size n of the polynomial F , wecan simply repeat the approach of Lemma 3.1 (cid:100) k / n (cid:101) times. The situation is more complicated when k ≤ n , because the output space is smaller. The idea is to compute the degree- k polynomial M at theroot of the product tree, reduce F modulo M and perform balanced k -point evaluation of F mod M . Lemma 3.2.

Given F ∈ (cid:75) [ X ] of size n and a , . . . , a k ∈ (cid:75) , one can compute F ( a ) , . . . , F ( a k ) using λ s M ( n ) + M ( k ) log ( k ) + O ( n + M ( k ) loglog ( k )) operations in (cid:75) and ( c s + ) k + O ( k / log ( k )) extraregisters.Proof. Computing the root M of a product tree proceeds in two phases. For the bottom levels of thetree, we use the fastest out-of-place full multiplication algorithm that computes the product of twosize- t polynomials in time M ( t ) and space O ( t ) . Then, only for the top loglog ( n ) levels, do we switchto an in-place full product algorithm from [ ] , which has time O ( M ( t )) but only O ( ) extra space.The result is that M can be computed using M ( k ) log ( k ) + O ( M ( k ) loglog ( k )) operations in (cid:75) and k + O ( k / log ( k )) registers.Then, we reduce F modulo M . By Lemma 2.1, this is accomplished in time 2 λ s M ( n )+ O ( n + M ( k )) and space ( c s + ) k . Adding the cost of the k -point evaluation of Lemma 3.1 completes the proof. Interpolation

Interpolation is the inverse operation of multipoint evaluation, that is, to reconstructa size- n polynomial F from its evaluations on n distinct points F ( a ) , . . . , F ( a n ) . The classic ap-proach using Lagrange’s interpolation formula has a quadratic complexity [

6, Chapter 5 ] while thefast approach of [ ] has quasi-linear time complexity O ( M ( n ) log ( n )) . We ﬁrst brieﬂy recall this fastalgorithm.Let M ( X ) = (cid:81) ni = ( X − a i ) and M (cid:48) its derivative. Noting that MX − a i ( a i ) = M (cid:48) ( a i ) for 1 ≤ i ≤ n , wehave F ( X ) = M ( X ) n (cid:88) i = F ( a i ) / M (cid:48) ( a i ) X − a i . (3)Hence the fast algorithm of [ ] consists in computing M (cid:48) ( X ) and its evaluation on each a i throughmultipoint evaluation, and then to sum the n fractions using a divide-and-conquer strategy. Thenumerator of the result is then F by Equation (3).If the subproduct tree over the a i ’s is already computed, this gives all the denominators in therational fraction sum. Using the same subproduct tree for evaluating M (cid:48) and for the rational fractionsum gives the fastest interpolation algorithm, combining the textbook method [ ] with the multi-point evaluation of [ ] . The total computational cost is only M ( n ) log ( n ) + O ( M ( n )) , while thespace is dominated by the size of this subproduct tree, n log ( n ) + O ( n ) .A more space-efﬁcient approach can be derived using linear-space multipoint evaluation. Sincethe subproduct must be essentially recomputed on the ﬁrst and last steps, the total running time is ( λ f + ) M ( n ) log ( n ) + O ( M ( n )) , using ( + c f ) n + O ( n / log ( n )) registers. This approach can be10mproved in two ways: ﬁrst by again grouping the interpolation points and re-using the smaller sub-product trees for each group, and secondly by using an in-place full multiplication algorithm from [ ] to combine the results of each group in the rational function summation. A detailed description ofthe resulting algorithm, along with a proof of the following lemma, can be found in Appendix A. Lemma 3.3.

Given a , . . . , a n ∈ (cid:75) and y , . . . , y n ∈ (cid:75) , one can compute F ∈ (cid:75) [ X ] of size n such thatF ( a i ) = y i for ≤ i ≤ n using M ( n ) log ( n )+ O ( M ( n ) loglog ( n )) operations in (cid:75) and n + O ( n / log ( n )) extra registers. In order to derive an in-place algorithm we make repeated use of the unbalanced multi-point evalua-tion with linear space to compute only k evaluations of the polynomial F among the n original points.The strategy is to set k as a fraction of n to ensure that n − k is large enough to serve as extra space.Applying this strategy on smaller and smaller values of k leads to Algorithm 4, which is an in-placealgorithm with the same asymptotic time complexity O ( M ( n ) log ( n )) as out-of-place fast multipointevaluation. Algorithm 4

In-Place Multipoint Evaluation (I N P LACE E VAL ) Input: F ∈ (cid:75) [ X ] of size n and ( a , . . . , a n ) ∈ (cid:75) n ; Output: R = ( F ( a ) , . . . , F ( a n )) Required: E VAL of space complexity ≤ ( c s + ) k as in Lemma 3.2 s ← k ← (cid:98) n / ( c s + ) (cid:99) while k > do R [ s .. s + k [ ← E VAL ( F , a s , . . . , a s + k ) (cid:46) WS: R [ s + k .. n [ s ← s + k k ← (cid:154) n − sc s + (cid:157) R [ s .. n [ ← E VAL ( F , a s , . . . , a n ) (cid:46) constant space Theorem 3.4.

Algorithm 4 is correct. It uses O ( ) extra space and (cid:128) + λ s / log ( c s + c s + ) (cid:138) M ( n ) log ( n ) + O ( M ( n ) loglog n ) operations in (cid:75) .Proof. The correctness is obvious as soon as E

VAL is correct. By the choice of k and from the ex-tra space bound of E VAL from Lemma 3.2, Step 3 has sufﬁcient work space, and therefore the en-tire algorithm is in-place. The sequence k i = ( c s + ) i − ( c s + ) i n , for i =

1, 2, . . ., gives the values of k ineach iteration. Then (cid:80) i k i ≤ n and the loop terminates after at most (cid:96) log ( n ) iterations, where (cid:96) ≤ / log ( c s + c s + ) . Applying Lemma 3.2, the cost of the entire algorithm is therefore dominated by (cid:80) ≤ i ≤ (cid:96) ( λ s M ( n ) + M ( k i ) log ( k i )) , which is at most ( λ s (cid:96) + ) M ( n ) log ( n ) .Using a time-efﬁcient short product with λ s = c s = (cid:39) M ( n ) log n ,which is roughly 11.61 / = Let ( a , y ) , . . . , ( a n , y n ) be n pairs of evaluations, with the a i ’s pairwise distinct. Our goal is to com-pute the unique size- n polynomial F ∈ (cid:75) [ X ] such that F ( a i ) = y i for 1 ≤ i ≤ n , with an in-placealgorithm. Our ﬁrst aim is to provide a variant of polynomial interpolation that computes F mod X k using O ( k ) extra space. Without loss of generality, we assume that k divides n . For i = n / k , let11 i = (cid:81) kij = + k ( i − ) ( X − a j ) and S i = M / T i where M = (cid:81) ni = ( X − a i ) . Note that S i = (cid:81) j (cid:54) = i T j . One canrewrite Equation (3) as F ( X ) = M ( X ) n / k (cid:88) i = ki (cid:88) j = + k ( i − ) F ( a j ) M (cid:48) ( a j ) ( X − a j ) = M ( X ) n / k (cid:88) i = N i ( X ) T i ( X ) = n / k (cid:88) i = N i ( X ) S i ( X ) (4)for some size- k polynomials N , . . . , N n / k . One may remark that the latter equality can also be viewedas an instance of the chinese remainder theorem where N i = F / S i mod T i (see [

6, Chapter 5 ] ). Toget the ﬁrst k terms of the polynomial F , we only need to compute F mod X k = n / k (cid:88) i = N i ( S i mod X k ) mod X k . (5)One can observe that M (cid:48) ( a j ) = ( S i mod T i )( a j ) T (cid:48) i ( a j ) for k ( i − ) < j ≤ ki . Therefore, Equation (4)implies that N i is the unique size- k polynomial satisfying N i ( a j ) = ( F / S i mod T i )( a j ) and can becomputed using interpolation. One ﬁrst computes S i mod T i , evaluates it at the a j ’s, performs k divisions in (cid:75) to get each N i ( a j ) and ﬁnally interpolates N i .Our second aim is to generalize the previous approach when some initial coefﬁcients of F areknown. Writing F = G + X s H where G is known, we want to compute H mod X k from some evalu-ations of F . Since H has size at most ( n − s ) , only ( n − s ) evaluation points are needed. Therefore,using Equation (4) with M = (cid:81) n − si = ( X − a i ) , we can write H ( X ) = M ( X ) ( n − s ) / k (cid:88) i = ki (cid:88) j = + k ( i − ) F ( a j ) − G ( a j ) a sj M (cid:48) ( a j ) ( X − a j ) . (6)This implies that H mod X k can be computed using the same approach described above by replacing F ( a j ) with H ( a j ) = ( F ( a j ) − G ( a j )) / a sj . We shall remark that the H ( a j ) ’s can be computed usingmultipoint evaluation and fast exponentation. Algorithm 5 fully describes this approach. Lemma 3.5.

Algorithm 5 is correct. It requires k + O ( k / log k ) extra space and it uses (cid:0) ( n − sk ) + n − sk (cid:1) M ( k ) log ( k ) + ( n − s ) log ( s ) + O (( n − sk ) M ( k ) loglog k ) operations in (cid:75) .Proof. The correctness follows from the above discussion. In particular, note that the polynomials S ki and S Ti at Steps 6 and 7 equal S i mod X k and S i mod T i respectively. Furthermore, z j = G ( a j + k ( i − ) ) since G ( a j + k ( i − ) ) = ( G mod T i )( a j + k ( i − ) ) . Hence, Step 12 correctly computes the polynomial N i andthe result follows from Equations (5) and (6).From the discussion in Section 3.1, we can compute each T i in 1 / M ( k ) log ( k )+ O ( M ( k ) loglog k ) operations in (cid:75) and k extra space. Step 9 requires some care as we can share some computationamong the two equal-size evaluations. Indeed, the subproduct trees induced by this computation areidentical and thus can be computed only once. Using Lemma 3.1, this amounts to M ( k ) log ( k ) + O ( M ( k )) operations in (cid:75) using k + O ( k / log k ) extra space. Step 12 can be done in 5 M ( k ) log ( k ) + O ( M ( k ) loglog k ) operations in (cid:75) and 2 k + O ( k / log k ) extra space using Lemma 3.3. Taking intoaccount the n − s exponentations a sj , and that other steps have a complexity in O ( M ( k )) , the cost ofthe algorithm is (cid:129) (cid:16) n − sk (cid:17) + n − sk (cid:139) M ( k ) log ( k ) + ( n − s ) log ( s ) + O (cid:129)(cid:16) n − sk (cid:17) M ( k ) loglog k (cid:139) .We show that 6 k + O ( k / log k ) extra registers are enough to implement this algorithm. At Step 7,the polynomials T i , T j , S ki , S Ti must be stored in memory. The computation involved at this step re-quires only 2 k extra registers as S Ti × T j mod T i can be computed with an in-place full product (stored12 lgorithm 5 Partial Interpolation (P

ART I NTERPOL ) Input: G ∈ (cid:75) [ X ] of size s and ( y , . . . , y n − s ) , ( a , . . . , a n − s ) in (cid:75) n − s ; an integer k ≤ n − s Output: H mod X k where F = G + X s H ∈ (cid:75) [ X ] is the unique size- n polynomial s.t. F ( a i ) = y i for1 ≤ i ≤ n − s for i = ( n − s ) / k do S ki ← S Ti ← T i ← (cid:81) kij = + k ( i − ) ( X − a j ) (cid:46) Fast divide-and-conquer for j = ( n − s ) / k , j (cid:54) = i do T j ← (cid:81) k jt = + k ( j − ) ( X − a t ) (cid:46) Fast divide-and-conquer S ki ← S ki × T j mod X k (cid:46) S ki = S i mod X k S Ti ← S Ti × T j mod T i (cid:46) S Ti = S i mod T i G T ← G mod T i ( b , . . . , b k ) ← E VAL ( S Ti , a + k ( i − ) , . . . , a ki )( z , . . . , z k ) ← E VAL ( G T , a + k ( i − ) , . . . , a ki ) for j = k do b j ← ( y j + k ( i − ) − z j ) / ( a sj + k ( i − ) b j ) N i ← I NTERPOL (( z , . . . , z k ) , ( b , . . . , b k )) H [ k [ ← H [ k [ + N i S ki mod X k in the extra registers) followed by an in-place division with remainder using the registers of S Ti and T j for the quotient and remainder storage. Using the same technique Step 8 requires only k extraspace as for Steps 2 to 6. At Step 9, we need 3 k registers to store G T , S Ti , S ki and 2 k registers to store ( b , . . . , b k ) and ( z , . . . , z k ) , plus k + O ( k / log k ) extra register for the computation. At Step 12 were-use the space of G T , S Ti for N i and the extra space of the computation which implies the claim.We can now provide our in-place variant for fast interpolation. Algorithm 6

In-Place Interpolation (I N P LACE I NTERPOL ) Input: ( y , . . . , y n ) and ( a , . . . , a n ) of size n such that a i , y i ∈ (cid:75) ; Output: F ∈ (cid:75) [ X ] of size n , such that F ( a i ) = y i for 0 ≤ i ≤ n . Required: P ART I NTERPOL with space complexity ≤ c pi k s ← while s < n do k ← (cid:154) n − sc pi + (cid:157) if k = then k ← n − s Y , A ← ( y , . . . , y n − s ) , ( a , . . . , a n − s ) F [ s .. s + k [ ← P ART I NTERPOL ( F [ s [ , Y , A , k ) s ← s + k Theorem 3.6.

Algorithm 6 is correct. It uses O ( ) extra space and at most ( c + c ) M ( n ) log n + O ( M ( n ) loglog n ) operations in (cid:75) , where c = + c pi .Proof. The correctness is clear from the correctness of Algorithm P

ART I NTERPOL . To ensure that thealgorithm uses O ( ) extra space we notice that at Step 6, F [ s + k .. n [ can be used as work space. There-fore, as soon as c pi k ≤ n − s − k , that is, k ≤ n − sc pi + , this free space is enough to run P ART I NTERPOL .Note that when k = n − s < c pi + O ( ) extra space. Let k , k , . . . , k t and s , s , . . . , s t be the values of k and s taken duringthe course of the algorithm. Since s i = (cid:80) ij = k j ≤ n with s =

0, we have k i ≤ λ n ( − λ ) i − , and s i ≥ n ( − ( − λ ) i ) where λ = c pi + . The time complexity T ( n ) of the algorithm satisﬁes T ( n ) ≤ t (cid:88) i = (cid:18) c + c (cid:19) M ( k i ) log ( k i ) + t (cid:88) i = ( n − s i − ) log ( s i − ) + O ( c M ( k i ) loglog k i ) since n − s i − k i ≤ c = c pi + k i . Moreover, we have (cid:80) ti = M ( k i ) log ( k i ) ≤ M ( (cid:80) i k i ) log n ≤ M ( n ) log ( n ) . By deﬁnition of s i , we have n − s i ≤ n ( − λ ) i which gives t (cid:88) i = ( n − s i − ) log ( s i − ) ≤ n log ( n ) t (cid:88) i = ( − λ ) i ≤ ( c pi + ) n log n .This concludes the proof.Since c pi < + ε for any ε >

0, the complexity can be approximated to 105 M ( n ) log ( n ) , which is42 times slower than the fastest interpolation algorithm (see Table 1). Acknowledgments

We thank Grégoire Lecerf, Alin Bostan and Michael Monagan for pointing out the references [

7, 16 ] . References [ ] Daniel J. Bernstein. Fast multiplication and its applications. In

Algorithmic Number Theory ,volume 44 of

MSRI Pub. , pages 325–384. Cambridge University Press, 2008. [ ] Alin Bostan, Grégoire Lecerf, and Éric Schost. Tellegen’s principle into practice. In

ISSAC’03 ,pages 37–44. ACM, 2003.

DOI : . [ ] David G. Cantor and Erich Kaltofen. On fast multiplication of polynomials over arbitrary alge-bras.

Acta Inform. , 28(7):693–701, 1991.

DOI : . [ ] Stephen A. Cook.

On the minimum computation time of functions . PhD thesis, Harvard University,1966. [ ] Matteo Frigo, Charles E. Leiserson, Harald Prokop, and Sridhar Ramachandran. Cache-obliviousalgorithms. In

FOCS’99 , pages 285–297. IEEE, 1999.

DOI : . [ ] Joachim von zur Gathen and Jürgen Gerhard.

Modern Computer Algebra . Cambridge UniversityPress, 3rd edition, 2013. [ ] Joachim von zur Gathen and Victor Shoup. Computing frobenius maps and factoring polyno-mials.

Comput. Complex. , 2(3):187–224, 1992.

DOI : . [ ] Pascal Giorgi, Bruno Grenet, and Daniel S. Roche. Generic reductions for in-place polynomialmultiplication. In

ISSAC’19 , pages 187–194. ACM, 2019.

DOI : . [ ] Guillaume Hanrot, Michel Quercia, and Paul Zimmermann. The middle product algorithm I.

Appl. Algebr. Eng. Comm. , 14(6):415–438, 2004.

DOI : .14 ] David Harvey and Joris van der Hoeven. Polynomial multiplication over ﬁnite ﬁelds in timeO(n log n). 2019.

URL : https://hal.archives-ouvertes.fr/hal-02070816/ . [ ] David Harvey and Daniel S. Roche. An in-place truncated Fourier transform and applications topolynomial multiplication. In

ISSAC’10 , pages 325–329. ACM, 2010.

DOI : . [ ] Anatoli Karatsuba and Yuri Ofman. Multiplication of Multidigit Numbers on Automata.

Sov.Phys. - Dok. , 7:595–596, 1963. [ ] Alan H. Karp and Peter Markstein. High-precision division and square root.

ACM Trans. Math.Software , 23(4):561–589, 1997.

DOI : . [ ] Hsiang-Tsung Kung. On computing reciprocals of power series.

Numer. Math. , 22(5):341–348,1974.

DOI : . [ ] Robert Moenck and Allan Borodin. Fast modular transforms via division. In

SWAT’72 , pages90–96. IEEE, 1972.

DOI : . [ ] Michael Monagan. In-place arithmetic for polynomials over Zn. In

DISCO’93 , pages 22–34.Springer, 1993.

DOI : . [ ] Daniel S. Roche. Space- and time-efﬁcient polynomial multiplication. In

ISSAC’09 , pages 295–302. ACM, 2009.

DOI : . [ ] Arnold Schönhage. Probabilistic computation of integer polynomial gcds.

J. Algorithms ,9(3):365–371, 1988.

DOI : . [ ] Arnold Schönhage and Volker Strassen. Schnelle Multiplikation großer Zahlen.

Computing ,7(3):281–292, 1971.

DOI : . [ ] Emmanuel Thomé. Karatsuba multiplication with temporary space of size ≤ n. 2002. URL : https://hal.archives-ouvertes.fr/hal-02396734 .15 Interpolation with linear space

The algorithm proceeds as:(1) Run the subproduct tree algorithm for each group of n / log ( n ) interpolation points, saving onlythe roots of each subtree M , . . . , M (cid:100) log ( n ) (cid:101) , using fast out-of-place full multiplications.(2) Run the subproduct tree algorithm over these M i ’s to compute the root M , using in-place fullmultiplications from [ ] , discarding other nodes in the tree.(3) Compute the derivative M (cid:48) in place.(4) Compute the remainders M (cid:48) mod M i for 1 ≤ i ≤ (cid:100) log ( n ) (cid:101) , using the balanced (with precompu-tation) algorithm described in Section 2.1. The size- n polynomial M (cid:48) may now be discarded.(5) For each group i , compute the full subproduct tree over its n / log ( n ) points. Use this to performmulti-point evaluation of M (cid:48) mod M i over the n / log ( n ) points of that group only, and thencompute the partial sum of (3) for that group’s points. Discard the subproduct tree but savethe rational function partial sum for each group.(6) Combine the rational functions for the (cid:100) log ( n ) (cid:101) groups using a divide-and-conquer strategy,employing again the in-place full multiplications from [ ] .The following lemma gives the complexity of this linear-space interpolation algorithm. Lemma 3.3.

Given a , . . . , a n ∈ (cid:75) and y , . . . , y n ∈ (cid:75) , one can compute F ∈ (cid:75) [ X ] of size n such thatF ( a i ) = y i for ≤ i ≤ n using M ( n ) log ( n )+ O ( M ( n ) loglog ( n )) operations in (cid:75) and n + O ( n / log ( n )) extra registers.Proof. Steps (1) and (5) collectively involve, for each group, two subproduct tree computations,one multi-point evaluation, and one rational function summation over each group, for a total of3 M ( n ) log ( n ) + O ( M ( n )) time. Step (4) contributes another 2 M ( n ) log ( n ) + O ( M ( n )) operations in (cid:75) . In steps (2) and (6), the expensive in-place multiplications are used only for the top (cid:100) loglog ( n ) (cid:101) levels of the entire subproduct tree, so this contributes only O ( M ( n ) loglog ( n )) .For the space, note that the size- nn