Generic reductions for in-place polynomial multiplication
GGeneric reductions for in-place polynomialmultiplication
Pascal Giorgi ∗ , Bruno Grenet ∗ and Daniel S. Roche † February 11, 2019
Abstract
The polynomial multiplication problem has attracted considerable at-tention since the early days of computer algebra, and several algorithmshave been designed to achieve the best possible time complexity. Morerecently, efforts have been made to improve the space complexity, devel-oping modified versions of a few specific algorithms to use no extra spacewhile keeping the same asymptotic running time.In this work, we broaden the scope in two regards. First, we askwhether an arbitrary multiplication algorithm can be performed in-placegenerically. Second, we consider two important variants which produceonly part of the result (and hence have less space to work with), the so-called middle and short products, and ask whether these operations canalso be performed in-place.To answer both questions in (mostly) the affirmative, we provide a se-ries of reductions starting with any linear-space multiplication algorithm.For full and short product algorithms these reductions yield in-place ver-sions with the same asymptotic time complexity as the out-of-place ver-sion. For the middle product, the reduction incurs an extra logarithmicfactor in the time complexity only when the algorithm is quasi-linear.
Keywords— arithmetic, polynomial multiplication, in-place algorithm, selfreduction
Polynomial multiplication is a fundamental problem in mathematical algorithms.It forms the basis (and key bottleneck) for other fundamental problems such ∗ LIRMM, Universit´e de Montpellier, CNRS,Montpellier, France. ([email protected]) † United States Naval Academy, Annapolis, Maryland, USA. [email protected] a r X i v : . [ c s . S C ] F e b s division with remainder, GCD computation, evaluation/interpolation, resul-tants, factorization, and structured linear algebra (see, e.g., [9, § § n polynomials, most notably Karatsuba’s algorithm [16], Toom-Cook mul-tiplication [8], and Sch¨onhage-Strassen [21]; more recent results have improvedthe complexity further but have not yet seen wide adoption in practice [6, 13]. After minimizing the runtime, an important question both in theory and inpractice is how much extra space these algorithms require. While the classicalalgorithm can be made to use only a constant number of temporary values,all the faster algorithms mentioned above require O ( n ) space to multiply twosize- n polynomials. In fact, proven time-space trade-offs in the algebraic circuitand branching program models indicate that space at least polynomial in n isrequired for any sub-quadratic multiplication algorithm [20, 1].But in a model where the output space admits both random writes and reads, these time-space lower bounds can be broken. [19] developed a variant ofKaratsuba’s algorithm using only O (log n ) space. Later, an FFT-based multipli-cation algorithm using O ( n log n ) time and constant space was developed for thecase that the coefficient ring contains a suitable root of unity [14]. Space-savingversions of Karatsuba’s algorithm can also be found in [23, 5, 22, 7]. Besides the usual full product computation, two other variants have also beenextensively studied: the short product which truncates the output to the first n terms, and the middle product which truncates the result on both ends. Thesevariants are important especially for power series, and specific variants of Karat-suba’s algorithm and others have been developed, usually gaining a constantfactor compared to a full product followed by a truncation [10, 18, 11, 12].[4] shows that the middle product can be viewed essentially as the reverseof a full product and in the same space. However, in our model which uses thespace of the output as temporary working space, this reversal implies that theinputs must also be destroyed for an in-place middle product. In some sense itwould not be surprising if middle and short products were more difficult in oursetting, as the truncated size of the output essentially limits the working spaceof the algorithm. In this paper, we develop reductions which can transform any multiplicationalgorithm which uses O ( n ) extra space into full, short, and middle productalgorithms which use only O (1) extra space. The time complexity for full and2hort product is the same as that of the original, while that for middle productincurs an additional log n factor.This improves the O (log n ) space of the most space-constrained Karatsubaalgorithm [19], and implies for the first time: in-place versions of Toom-Cookmultiplication; in-place FFT-based multiplication even when the ring does notcontain a root of unity; in-place subquadratic short product algorithms; andin-place middle product algorithms which do not overwrite their inputs.We begin by carefully stating our space complexity model and then definingthe multiplications problems in Sections 2 and 3. A few easier but importantreductions and equivalences are presented next in Section 4, followed by thecritical reductions in Section 5 which prove our main results. We use the model of an algebraic-RAM that is equipped with two kinds ofregisters: the standard registers store integers as in the classical Word-RAMmodel, whereas the algebraic registers store elements from the base field K ofcoefficients. As in Word-RAM, we assume that the standard registers can storeintegers of size O (log n ) where n is the number of coefficients in the inputs.Word-RAM machines are a classical model in computational complexity, inparticular for fine-grained complexity that classifies the difficulty of polynomial-time problems [24]. We use it in order to distinguish between the space neededto store indices (that is thus hidden in the standard registers) from the spaceneeded to store elements from the base field. Time complexity
As mentioned, we use the number of arithmetic operationsas the time complexity measure since the cost of the operations on indices isnegligible with respect to arithmetic operations. Formally, we assume that anyring operation on the algebraic registers has cost 1.
Space complexity
We divide the registers into three categories: the inputspace is made of the (algebraic) registers that store the inputs, the output spaceis made of the (algebraic) registers where the output must be written, and thework space is made of (algebraic and non-algebraic) registers that are used asextra space during the computation. The space complexity is then the maximumnumber of work registers used simultaneously during the computation. Analgorithm is said to be “in-place” if its space complexity is O (1), and “out-of-place” otherwise.One can then distinguish different models depending on the read/write per-missions on the input and output registers:1. Input space is read-only, output space write-only;2. Input space is read-only, output space is read/write;3. Input and output spaces are both read/write.3he first model is the classical one from complexity theory [2]. Despite itstheoretical interest, it does not reflect low-level computation where output istypically in some DRAM or Flash memory on which reading is no more costlythan writing. Furthermore, polynomial multiplication here has a quadratic lowerbound for time times space [1], limiting the possibility for meaningful improve-ments.The second model has been used in the context of in-place polynomial multi-plication [19, 14]. This is a very reasonable model since it matches the paradigmof parallel computing with shared memory. This is the model in which we de-velop our algorithms.The third model has been used to provide a generic approach for preservingmemory designing algorithms via the transposition principle [4]: Given an algo-rithm for a linear map with time complexity t ( n ) and space complexity s ( n ), thetransposition principle yields an algorithm for the transposed linear map whichhas the same space complexity and time complexity O ( t ( n )) [4, Propositions1 and 2]. However, the inputs are destroyed during the computation, which isproblematic particularly for recursive algorithms that re-use their operands; wewill not use this too-permissive model. Notation
The output space in our algorithms is denoted by R and registersare indexed from 0 to n −
1. We write R [ k..(cid:96) [ to denote the registers of indices k to (cid:96) − Define the size of a univariate polynomial as the number of coefficients in its(dense) representation; a polynomial of size n has degree at most n −
1. Impor-tantly, we allow zero padding: a size- n polynomial could have degree strictlyless than n −
1; the size indicates only how it is represented.Let f = (cid:80) n − i =0 f i X i and g = (cid:80) n − i =0 g i X i be two size- n polynomials. Theirproduct h = f g is a polynomial of size 2 n −
1, what we call a balanced fullproduct . More generally, if f has size m and g has size n , their product has size m + n −
1. We call this case the unbalanced full product of f and g .We now define precisely the short product, middle product, and half-additivefull product. Definition 3.1.
Let f and g be two size- n polynomials. Their low short product is the size- n polynomial defined as SP lo ( f, g ) = ( f · g ) mod X n and their high short product is the size- ( n − polynomial defined as SP hi ( f, g ) = ( f · g ) quo X n . f and g . The rationale for this choice is to have the identity f g = SP lo ( f, g ) + X n SP hi ( f, g ). Definition 3.2.
Let f and g be two polynomials sizes n + m − and n , respec-tively. Their middle product is the size- m made of the central coefficients ofthe product f g , that is MP ( f, g ) = (cid:0) ( f · g ) quo X n − (cid:1) mod X m . If f = (cid:80) i Definition 3.3. Let f and g be two polynomials of degree less than n , and h be a polynomial of degree less than ( n − . The (low-order) half-additive fullproduct of f and g given h is FP + lo ( f, g, h ) = h + f g . Similarly, their high-orderhalf-additive full product is FP + hi ( f, g, h ) = X n h + f g . An in-place half-additivefull product algorithm is an algorithm computing a half-additive full productwhere h is initially stored in the output space. This variant of the full product which has a partially-initialized output spacewill be useful to derive other in-place algorithms. For ease of explanation, we will use the linear property of polynomial multipli-cations when an operand is fixed.Let f = (cid:80) n − i =0 f i X i and g = (cid:80) n − i =0 g i X i be two size- n polynomials. If f isfixed, the product h = f g can be described as a linear map from K n to K n − .The matrix, denoted M FP ( f ) , for this map is to a Toeplitz matrix built from thecoefficients of f , and the product f g corresponds to the following matrix-vector5roduct: f ... . . . f n − f . . . ... f n − (cid:124) (cid:123)(cid:122) (cid:125) M FP ( f ) × g g ... g n − (cid:124) (cid:123)(cid:122) (cid:125) (cid:126)g = h h ... h n − (cid:124) (cid:123)(cid:122) (cid:125) (cid:126)h (1) where M FP ( f ) ∈ K (2 n − × n , (cid:126)g ∈ K n and (cid:126)h ∈ K n − .The low and high short products being defined as part of the result of the fullproduct, their corresponding linear maps are endomorphisms of K n and K n − respectively, given by submatrices of M FP ( f ) as follows: f f . . .... . . . . . . f n − . . . f f (cid:124) (cid:123)(cid:122) (cid:125) M SPlo ( f ) f n − . . . f f . . . f . . . ... f n − (cid:124) (cid:123)(cid:122) (cid:125) M SPhi ( f ) (2) Finally, the middle product corresponds also to a linear map from K n to K m when the larger operand is fixed, given by the m × n Toeplitz matrix f n − f n − . . . f f f n f n − f f ... ... ... ... f n + m − f n + m − . . . f m − f m − (cid:124) (cid:123)(cid:122) (cid:125) M MP ( f ) . In this section, we compare the relative difficulties of the full product, the half-additive full product, the low and high short products, and the middle product,in the framework of time and space efficient algorithms. To this end, we definea notion of time and space preserving reduction between problems.We say that a problem A is TISP -reducible to a problem B if, given analgorithm for B that has time complexity t ( n ) and space complexity s ( n ), onecan deduce an algorithm for A that has time complexity O ( t ( n )) and spacecomplexity s ( n ) + O (1). We write A ≤ TISP B is A is TISP -reducible to B and A ≡ TISP B if both A ≤ TISP B and B ≤ TISP A . Note that the TISP -reduction istransitive.The reduction we use can be defined using oracles and is an adaptation ofthe notion of fine-grained reduction [24, Definition 2.1] adapted to time-spacefine-grained complexity classes [17]. 6 heorem 4.1. Half-additive full products and short products are equivalentunder TISP -reductions, that is FP + hi ≡ TISP FP + lo ≡ TISP SP hi ≡ TISP SP lo . Furthermore, if SP denotes either SP lo or SP hi , FP ≤ TISP SP ≤ TISP MP . Proof. The equivalences SP hi ≡ TISP SP lo and FP + hi ≡ TISP FP + lo are proved belowin Lemmas 4.3 and 4.4. The equivalence SP ≡ FP + (where SP denotes any of SP lo and SP hi , and FP + any of FP + lo and FP + hi ) is proved in Section 4.2.The reduction FP ≤ TISP SP simply amounts to the identity FP ( f, g ) = SP lo ( f, g ) + X n SP hi ( f, g ). The reductions SP ≤ TISP MP and FP ≤ TISP MP follow from the following equalities where 0 denotes the zero polynomial storedin size n : SP lo ( f, g ) = MP (0 + X n f, g ), SP hi ( f, g ) = MP ( f + X n , g ), and FP ( f, g ) = MP (0 + X n f + X n , g ) . Hence, one can compute the full product, the low and high short products of f and g simply by calling a middle product algorithm on f padded with zeroesand g . In our model of read-only inputs, an actual padding is not required. Itis sufficient to use some kind of fake padding where the data structure storing f is responsible for returning 0 when needed.The relative order of difficulty FP ≤ TISP SP ≤ TISP MP makes intuitive sensebased on the size of the output compared to the size of the inputs since theoutput can be used as work space: The full product maps 2 n coefficients to2 n − n coefficients to n coefficients andthe middle product maps 3 n coefficients to n coefficients. In Section 5, we shallgive a partial converse to SP ≤ TISP MP : There exists a reduction from SP to MP which preserves space and either maintains the asymptotic complexity orincreases it by a logarithmic factor. Definition 4.2. The size- n reversal of a polynomial f is rev n ( f ) = X n − f (1 /X ) . We note that any algorithm whose input is a size- n polymial f can be turnedinto a new algorithm that computes the same function with input rev n ( f ),simply by replacing a query to any coefficient with index i with one of index n − i , not affecting the number of ring operations.Let us now prove that SP hi ≡ TISP SP lo . Lemma 4.3. Let f and g be two size- n polynomials. Then SP hi ( f, g ) = rev n − ( SP lo (rev n − ( f quo X ) , rev n − ( g quo X ))) . roof. Let ˜ f = rev n − ( f quo X ) and ˜ g = rev n − ( g quo X ). Then SP lo ( ˜ f , ˜ g ) = (cid:88) ≤ i,j Let f and g be two size- n polynomials and h be a size- ( n − polynomial. Then FP + hi ( f, g, h ) = rev n − (cid:0) FP + lo (rev n ( f ) , rev n ( g ) , rev n − ( h )) (cid:1) . Proof. Let f ∗ = rev n ( f ), g ∗ = rev n ( g ) and h ∗ = rev n − ( h ). First note thatrev n − ( h ∗ ) = X n h by definition. Since rev n − ( f ∗ g ∗ ) = rev n ( f ∗ ) rev n ( g ∗ ) weget that rev n − ( f ∗ g ∗ + h ∗ ) = rev n ( f ∗ ) rev n ( g ∗ ) + rev n − ( h ∗ ) = f g + X n h = FP + hi ( f, g, h ). Reduction from SP to FP + Let f and g be two size- n polynomials and h be a size-( n − 1) polynomial. The half-additive full product FP + lo ( f, g, h ) equals f g + h . Note that f g = SP lo ( f, g ) + X n SP hi ( f, g ). This already proves that thenon-additive full product can be computed using algorithms for low and highshort products. For the half-additive full products, it is sufficient to store anintermediate result in the free registers of the output space.Assuming R [0 ..n − holds the value of h , the following instructions reducesthe computation of FP + lo ( f, g, h ) to two short products plus ( n − 1) additions. R [ n − .. n − ← SP lo ( f, g ) R [0 ..n − ← R [0 ..n − + R [ n − .. n − R n − ← R n − R [ n.. n − ← SP hi ( f, g ) 8 eduction from FP + to SP Let f and g be polynomials of degree less than n . Splitting f and g by half such that f = f + X (cid:100) n/ (cid:101) f and g = g + X (cid:100) n/ (cid:101) g ,we have SP lo ( f, g ) = f g + X (cid:100) n/ (cid:101) ( f g + f g ) mod X n . What is needed is the full product of f and g , and the low short products of f and g , and f and g . Actually, since f is larger than g when n is odd (and g larger than f ), one only needs the short products SP lo ( f − , g ) and SP ( f , g − )where f − = f mod X (cid:98) n/ (cid:99) and g − = g mod X (cid:98) n/ (cid:99) .To avoid any recursive call that would imply storing a call stack, we canactually use full products instead of short products: We first compute f − g + f g − using a full product and a half-additive full product. Then we can forgetabout the higher order terms, and add f g to this sum using a second half-additive full product. The following instructions summarize this approach: R [0 .. (cid:98) n/ (cid:99)− ← FP ( f − , g ) (cid:46) half-additivity not needed R [0 .. (cid:98) n/ (cid:99)− ← FP + lo ( f , g − ) (cid:46) erase higher part of f − g R [ (cid:100) n/ (cid:101) ..n [ ← R [0 .. (cid:98) n/ (cid:99) [ (cid:46) keep lower part of f − g + f g − R [0 .. (cid:100) n/ (cid:101)− ← FP + hi ( f , g )The correctness is clear. The complexity of the algorithm is the cost of threefull products in degree approximately n/ 2: One non-additive full product in size (cid:98) n/ (cid:99) and two half-additive full products in size (cid:98) n/ (cid:99) and (cid:100) n/ (cid:101) , respectively.As direct consequence of Lemmas 4.3 and 4.4, one obtains the same reduc-tions to SP hi and from FP + lo or FP + hi . The unbalanced full product can be computed using any algorithm for the (bal-anced) full product. Nevertheless, the space complexity increases since inter-mediate results must be stored. Given an algorithm for the balanced full prod-uct of space complexity s ( n ), one obtains an algorithm with space complexity s ( n ) + ( n − 1) for the unbalanced full product. In this section, we prove that ifthe original full product algorithm is half-additive , the resulting unbalanced fullproduct algorithm has the same space complexity.Let f be a size- m polynomial and g be a size- n polynomial with m > n . Write f = (cid:80) (cid:100) m/n (cid:101)− k =0 X kn f k , where each sub-polynomial f , . . . , f (cid:100) m/n (cid:101)− has size atmost n . The computation of f · g reduces to the computations of each f k · g . Thefollowing instructions prove that using half-additivity, the intermediate results f k · g can be computed directly in the output space. R [ (cid:100) m/n (cid:101) n..m + n [ ← FP ( f (cid:100) m/n (cid:101) , g ) (cid:46) using fake padding for k from (cid:100) m/n (cid:101) − do R [ kn.. ( k +2) n − ← FP + hi ( f k , g )Note that at step 1, the polynomial computed may have a larger size thatwhat is needed, due to padding. Yet one can use without difficulty the lower9art of the output space to store these additional useless coefficients, that arethen erased at step 3.The time complexity remains (cid:100) m/n (cid:101) M ( n ) where M ( n ) is the complexity ofthe half-additive full product. In this section, we show how to obtain in-place algorithms from out-of-placealgorithms. The theorem below summarizes the main results described in thissection. Theorem 5.1. 1. Given a full product algorithm with time complexity M ( n ) and space complexity ≤ cn , one can build an in-place algorithm for thehalf-additive full product with time complexity ≤ (2 c + 7) M ( n ) + o ( M ( n )) .2. Given a (low or high) short product algorithm with time complexity M ( n ) and space complexity ≤ cn , one can build an in-place algorithm for thesame problem with time complexity ≤ (2 c + 5) M ( n ) + o ( M ( n )) .3. Given a middle product algorithm with time complexity M ( n ) and spacecomplexity ≤ cn , one can build an in-place algorithm for the same problemwith time complexity ≤ M ( n ) log c +1 c +2 ( n ) + O ( M ( n )) if M ( n ) is quasi-linear,and O ( M ( n )) otherwise. Actually, our reductions work for any space bound s ( n ) ≤ O ( n ). Smallerspace bounds yield better time bounds though we do not have a general expres-sion in terms of s ( n ). Yet sublinear space bounds still imply an increase of thetime complexity by a multiplicative constant for full and short products.Formally, we give self-reductions for the three problems. That is, we use anout-of-place algorithm for the problem as building block of our in-place version.The general idea is similar in the three cases. In a first step, we use the out-of-place algorithm to compute some part of the output, using the unused outputspace as temporary work space. Then a recursive call finishes the work. The(constant) amount of space needed in our in-place algorithms correspond thespace needed to process the base cases.Using the language of linear algebra, we aim to apply some specific matrixto a vector. The general construction we use consists in first applying the topor bottom rows of the matrix to the vector using the out-of-place algorithm,and applying the remaining rows using a recursive call ( cf. Fig. 1). In thecases of full and short products, the diamond and triangular shapes of thecorresponding matrices imply that the recursive call is made on two smallerinputs: For instance, to apply the first rows of a triangular matrix to a vector,one only needs to apply it to the first entries of the vector. For the middleproduct, the square shape imply that one input remains of the same size in therecursive call. This difference explains the difference in the time complexities inTheorem 5.1. 10 (cid:100) n/k (cid:101) m − k nk (cid:100) n/k (cid:101) k (cid:100) n/k (cid:101)(cid:100) n/k (cid:101) − n − k n − k L U L U . . .. . . Figure 1: Tilings of the matrices M FP ( f ) (left), M SP lo ( f ) (center) and M MP ( f ) (right). Our aim is to build an in-place (low-order) half-additive full product algorithm iFP + hi based on an out-of-place full product algorithm oFP that has space com-plexity cn . That is, we are given two polynomials f and g of degree < n in theinput space and a polynomial h of degree < n − n − 1) low-order reg-isters of the output space R and we aim to compute f g + h in R . The algorithmis based on the tiling of the matrix M FP ( f ) given in Fig. 1 (left).For some k < n to be fixed later, let f = ˆ f X k + f and g = ˆ gX k + g wheredeg f , deg g < k . Then we have h + f g = h + f g + ˆ f g X k + ˆ f ˆ gX k . (3)Recall that the output R has size 2 n − n − h . Then equation (3) can be evaluated with the following three steps: R [0 ..n + k − ← h + f g R [ k..n + k − ← R [ k..n + k − + ˆ f g R [2 k.. n [ ← R [2 k.. n [ + ˆ f ˆ g The first two steps corresponds exactly to two additive unbalanced full products,that is unbalanced full products that must be added to some already filled outputspace. One can describe an algorithm oFP + u for this task, based on a (standard)full product algorithm oFP : If f has degree < k and g has degree < n , n > k , wewrite g = (cid:80) (cid:100) n/k (cid:101)− i =0 g i X ki with deg( g i ) < k . Then f g = (cid:80) i f g i : The algorithmcomputes the (cid:100) n/k (cid:101) products f g i in 2 k − oFP has time complexity M ( n ) and space complexity cn , the timecomplexity of oFP + u is (cid:100) n/k (cid:101) ( M ( k )+2 k − 1) and its space complexity ( c +2) k − h + f g and corresponds to a half-additive full producton inputs of degree < n − k , since only the n − k − R [2 k.. n [ are filled: Indeed, deg( h + f g + ˆ f g X k ) < n + k − 1. This last step is thus arecursive call.In order to make this algorithm run in place, k must be chosen so that theextra memory needed in the two calls to oFP + u fits exactly in the unused part11f R . This is the case when( c + 2) k − ≤ n − − ( n + k − k ≤ n +1 c +3 . The resulting algorithm is formally depicted below. Algorithm 1 iFP + hi from oFP Input: f and g of degree < n in the input space, h of degree < n − R Output: R contains f g + h Required alg.: A full product algorithm oFP with space complexity ≤ cn if n < c + 2 then R ← R + f g (cid:46) using a naive algorithm else k ← (cid:98) ( n + 1) / ( c + 3) (cid:99) R [0 ..n + k − ← oFP + u ( h, f , g ) (cid:46) work space: R [ n + k − .. n [ R [ k..n + k − ← oFP + u ( h + f g, f, g ) (cid:46) same work space R [2 k.. n [ ← iFP + hi from oFP ( f quo X k , g quo X k ) Complexity analysis The algorithm uses two calls to oFP + u with inputsof sizes ( k, n ) and ( n − k, k ) respectively. The total complexity amounts to (cid:100) n/k (cid:101) M ( k ) + ( (cid:100) n/k ] (cid:101) − M ( k ) + 2( (cid:100) n/k (cid:101) − k − 1) plus a recursive call insize n − k . Let T ( n ) be the complexity of iFP + hi , we thus have T ( n ) = T ( n − k ) + (2 (cid:100) n/k (cid:101) − 1) [ M ( k ) + (2 k − . Note that k depends upon n , this implies that the analysis must be done without k . Since k = (cid:98) ( n + 1) / ( c + 3) (cid:99) , (cid:100) n/k (cid:101) ≤ c + 4 for n ≥ ( c + 2)( c + 4). Therefore, T ( n ) ≤ T (cid:18) c + 2 c + 3 ( n + 1) (cid:19) + (2 c + 7) (cid:20) M (cid:18) n + 1 c + 3 (cid:19) + 2 nc + 3 − c + 1 c + 3 (cid:21) . Using Corollary 5.6, we conclude that T ( n ) ≤ (2 c + 7) M ( n ) + o ( M ( n )). Our goal is to describe an in-place (low) short product algorithm based on anout-of-place one, based on the tiling of M SP lo ( f ) depicted on Fig. 1 (center). Let f = (cid:80) n − i =0 f i X i and g = (cid:80) n − i =0 g i X i , and let h = (cid:80) n − i =0 h i X i = SP lo ( f, g ). Theidea is to fix some k < n and to have two phases. The first phase correspondsto the bottom k rows of M SP lo ( f ) and computes h n − k to h n − using the out-of-place algorithm on smaller polynomials. The second phase corresponds tothe top ( n − k ) rows and is a recursive call to compute h to h n − k − : Indeed, h mod X n − k = SP lo ( f mod X n − k , g mod X n − k ).For the second phase, we remark that the bottom k rows can be tiled by (cid:100) n/k (cid:101) lower triangular matrices (denoted L , . . . , L (cid:100) n/k (cid:101)− from the right to the12eft), and (cid:100) n/k (cid:101) − U , . . . , U (cid:100) n/k (cid:101)− ). Onecan identify the matrices L i and U i as matrices of some low and high short prod-ucts. More precisely, the coefficients that appear in the lower triangular matrix L i are the coefficients of degree ki to k ( i +1) − f . Thus, L i = M SP lo ( f ki,k ( i +1) ) where f ki,k ( i +1) = (cid:80) k ( i +1) − j = ki f j X j − ki . Similarly, U i = M SP hi ( f ki,k ( i +1) ) . The ma-trices L (cid:100) n/k (cid:101)− and U (cid:100) n/k (cid:101)− must be padded if k does not divide n . Altogether,this proves that this part of the computation reduces to (cid:100) n/k (cid:101) low short prod-ucts and (cid:100) n/k (cid:101) − k .In order for this algorithm to actually be in place, k must be small enough. Ifthe out-of-place short product algorithm uses ck extra space, since we also need k free registers to store the intermediate results, k must satisfy n − k ≥ ( c + 1) k ,that is k ≤ n/ ( c + 2). Algorithm 2 iSP lo from oSP Input: f and g of degree < n Output: R contains SP lo ( f, g ) Required alg.: Two short product algorithms oSP lo and oSP hi with space com-plexity ≤ cn if n < c + 2 then R ← SP lo ( f, g ) (cid:46) using a naive algorithm else k ← (cid:98) n/ ( c + 2) (cid:99) for i = 0 to (cid:100) n/k (cid:101) − do (cid:46) work space: R [0 ..n − k [ R [ n − k..n [ + = oSP lo ( f ki,k ( i +1) , g n − k ( i +1) ,n − ki ) ) for i = 0 to (cid:100) n/k (cid:101) − do (cid:46) same work space R [ n − k..n [ + = oSP hi ( f ki,k ( i +1) , g n − k ( i +2) ,n − k ( i +1) ) R [0 ..n − k [ ← iSP lo from oSP ( f mod X n − k , g mod X n − k ) Complexity analysis The algorithm performs (cid:100) n/k (cid:101) low short products and (cid:100) n/k (cid:101) − n − k . Let M ( k ) bethe complexity of a low short product algorithm. Then the high short productcan be computed in time M ( k − T ( n ) be the complexity of the recursivealgorithm. Then T ( n ) = (cid:100) n/k (cid:101) M ( k )+( (cid:100) n/k (cid:101)− M ( k − (cid:100) n/k (cid:101)− k + T ( n − k ) (the linear time is for the additions). Since k = (cid:98) n/ ( c + 2) (cid:99) , (cid:100) n/k (cid:101) ≤ c + 3for n ≥ ( c + 3)( c + 2) and n − k ≤ c +1 c +2 n + 1. Thus, T ( n ) ≤ ( c + 3) M (cid:18) nc + 2 (cid:19) + ( c + 2) M (cid:18) nc + 2 − (cid:19) + 2 n + T (cid:18) c + 1 c + 2 n + 1 (cid:19) . Using Corollary 5.6, this equation yields T ( n ) ≤ (2 c + 5) M ( n ) + o ( M ( n )). To build an in-place middle product algorithm, we assume that we have analgorithm for the middle product that uses cn extra space to compute the middle13roduct in size ( n, m ) (that is with inputs of degree < n + m − < n ,respectively).The in-place algorithm is again based on the tiling given in Fig. 1 (right):The top k rows correspond to the matrix M MP ( f mod X k ) and the bottom m − k rows to the matrix M MP ( f quo X k ) . The algorithm consists in computing M MP ( f mod X k ) (cid:126)g using the out-of-place algorithm and then M MP ( f quo X k ) (cid:126)g usinga recursive call.To make this algorithm work in place, the value of k has to be adjusted sothat the work space is large enough. The result of a middle product in size k has degree < k and needs ck extra work space by hypothesis. Therefore, if m − k ≥ ( c + 1) k , that is k ≤ m/ ( c + 2), the computation can be performed inplace. Algorithm 3 iMP from oMP Input: f and g of degree < n + m − < n respectively Output: R contains MP ( f, g ) Required alg.: An out-of-place middle product algorithm oMP with space com-plexity ≤ cn if m < c + 2 then R ← oMP ( f, g ) (cid:46) using a naive algorithm else k ← (cid:98) m/ ( c + 2) (cid:99) R [0 ..k [ ← oMP ( f mod X n + k , g ) (cid:46) work space: R [ k..m [ R [ k..m [ ← iMP from oMP ( f quo X k , g ) (cid:46) recursive call Complexity analysis Let M ( k ) be the cost of an out-of-place balanced mid-dle product algorithm. The cost of an unbalanced middle product is thus (cid:100) n/k (cid:101) M ( k ) for k < n . The in-place algorithm computes first a middle productusing an out-of-place algorithm and then makes a recursive call on the remain-ing part. Note that n does not change during the algorithm and can be viewedas a large constant, while m is the parameter that varies. Then the cost ofthe algorithm verifies T ( m ) ≤ (cid:100) n/k (cid:101) M ( k ) + T ( m − k ). Since k = (cid:98) m/ ( c + 2) (cid:99) , (cid:100) n/k (cid:101) < n ( c + 2) / ( m − c − 2) + 1 and m − k ≤ ( c + 1) m/ ( c + 2) + 1. Furthermore, M ( k ) ≤ m/n ( c + 2) M ( n ), thus (cid:100) n/k (cid:101) M ( k ) ≤ ( m/ ( m − c − 2) + m/n ( c + 2)) M ( n ).That is, T ( m ) ≤ (cid:18) mn ( c + 2) + c + 2 m − c − (cid:19) M ( n ) + T (cid:18) c + 1 c + 2 m + 1 (cid:19) . Corollary 5.7 implies T ( n ) ≤ M ( n ) log c +2 c +1 ( n ) + O ( M ( n )) for m = n . Improvement for non quasi-linear algorithms The extra logarithmic fac-tor only occurs when M ( n ) = n o (1) . Suppose to the contrary that M ( n ) ≤ λn γ for some γ > 1. The recurrence now reads T ( m ) ≤ (cid:16) n ( c +2) m − c − + 1 (cid:17) λ (cid:16) mc +2 (cid:17) γ +14 ( c +1 c +2 m + 1). We claim that there exist constants µ and ν such that T ( m ) ≤ µm γ − n + νm γ + o ( m γ − n + m γ ) and prove it by induction. Using the recurrencerelation and the induction hypothesis, T ( m ) ≤ λnm γ − ( c + 2) γ − + λm γ ( c + 2) γ + µ (cid:18) c + 1 c + 2 (cid:19) γ − m γ − n + ν (cid:18) c + 1 c + 2 (cid:19) γ m γ + o ( m γ − n + m γ ) . The result follows as soon as ( λ + µ ( c + 1) γ − ) / ( c + 2) γ − ≤ µ and ( λ + ν ( c +1) γ ) / ( c + 2) γ ≤ ν . We can thus fix µ = λ ( c + 2) γ − − ( c + 1) γ − and ν = λ ( c + 2) γ − ( c + 1) γ . Finally, taking m = n , we conclude that T ( n ) ≤ ( µ + ν ) λn γ + O ( n γ − ). Reduction from short products to middle product The middle productof f and g can be computed as the sum of the low short product of f quo X n with g and the high short product of f mod X n with g . Yet this reduction doesnot preserve the space complexity since one needs to store the results of thetwo short products in two zones of size n before summing them. Actually, thereduction given above from oMP to iMP can easily be adapted to a reductionfrom SP to MP that is space-preserving. Yet, the complexity also worsens witha logarithmic factor. Thus, we cannot conclude that MP ≤ TISP SP . Lemma 5.2. Let T ( n ) be a function satisfying T ( n ) ≤ f ( n ) + T ( (cid:98) αn + β (cid:99) ) forsome α < . Then T ( n ) ≤ T ( (cid:98) n K (cid:99) ) + K − (cid:88) i =0 f ( n i ) where n i = α i n + β − α i +1 − α and K ≤ log /α ( n ) .Proof. Let T ( x ) = T ( (cid:98) x (cid:99) ) for non integral x . By definition of n i , n = n and T ( n i ) ≤ f ( n i )+ T ( n i +1 ). Then by recurrence, T ( n ) ≤ T ( n i +1 )+ (cid:80) ij =0 f ( n i ). Lemma 5.3. Let n i = α i n + β − α i +1 − α . Then K − (cid:88) i =0 n i ≤ n + βK − α . Proof. Since (cid:80) K − i =0 α i = (1 − α K ) / (1 − α ) and 1 − α > (cid:80) i α i n ≤ n/ (1 − α ).Then, (cid:80) i (1 − α i +1 ) / (1 − α ) = K/ (1 − α ) + ( α K +1 − α ) / (1 − α ) ≤ K/ (1 − α )since α K +1 < α . 15 emma 5.4. Let n i = α i n + β − α i +1 − α . Then K − (cid:88) i =0 n i − β/ (1 − α ) = α ( α − K − − α ) n − αβ . Proof. Since n i = α i ( n − βα/ (1 − α )) + β/ (1 − α ), n i − β/ (1 − α ) is a multipleof α i . Thus, K − (cid:88) i =0 n i − β/ (1 − α ) = 1 n − βα/ (1 − α ) K − (cid:88) i =0 α − i . Then, (cid:80) i α − i = (1 − α − K ) / (1 − /α ) = α ( α − K − / (1 − α ), and (cid:80) i / ( n i − β/ (1 − α )) = α ( α − K − / ((1 − α ) n − αβ ). Lemma 5.5. If M ( n ) /n is non-decreasing, and n i = α i n + β (1 − α i +1 ) / (1 − α ) for some α < , then K − (cid:88) i =0 M ( λn i + µ ) = λ − α M ( n ) + o ( M ( n )) for K ≤ log /α ( n ) and any λ and µ such that λn i + µ ≤ n for all n i .Proof. Since M ( n ) /n is non-decreasing, M ( λn i + µ ) ≤ λn i + µn M ( n ). There-fore, (cid:80) i M ( λn i + µ ) ≤ M ( n ) /n (cid:80) i λn i + µ . By Lemma 5.3, (cid:80) i M ( λn i + µ ) ≤ λ M ( n ) / (1 − α ) + λβK M ( n ) /n (1 − α ) + µK M ( n ) /n . Since K = O (log n ), K M ( n ) /n = o ( M ( n )). Corollary 5.6. Let T ( n ) ≤ (cid:80) k a k M ( λ k n + µ k ) + bn + c + T ( αn + β ) with α < and λ k n + µ k < n for all k . Then T ( n ) ≤ (cid:88) k a k λ k − α M ( n ) + bn − α + o ( M ( n )) . The linear term is negligible but if M ( n ) = O ( n ) .Proof. By Lemma 5.2, T ( n ) ≤ T ( n K ) + (cid:80) i f ( n i ) with n i defined as in thelemma and f ( n ) = (cid:80) k a k M ( λ k n + µ k ) + bn + c . Then K − (cid:88) i =0 f ( n i ) = (cid:88) k a k K − (cid:88) i =0 M ( λ k n i + µ k ) + b K − (cid:88) i =0 n i + Kc ≤ (cid:88) k a k (cid:18) λ k − α M ( n ) + o ( M ( n )) (cid:19) + b n + βK − α + Kc = (cid:88) k a k λ k − α M ( n ) + bn − α + o ( M ( n ))since K = o ( M ( n )) and the sum over k is of fixed size.16 orollary 5.7. Let T ( m ) ≤ ( λm/n + µ/ ( m − − α ) + 1) M ( n ) + T ( αm + 1) with α < and m ≤ n . Then for m = n , T ( n ) ≤ M ( n ) log /α ( n ) + λ + µα − α M ( n ) + o ( M ( n )) . Proof. By Lemma 5.2, T ( m ) ≤ T ( m K ) + M ( n ) (cid:88) i (cid:18) λm i n + µm i − / (1 − α ) + 1 (cid:19) where m i = α i m + (1 − α i +1 ) / (1 − α ). By Lemma 5.3, (cid:80) i m i ≤ ( m + K ) / (1 − α )and by Lemma 5.4, (cid:80) i / ( m i − − α ) ≤ α − K +1 / ((1 − α ) m − α ). Altogether, T ( m ) ≤ T ( m K ) + K M ( n ) + λ ( m + K ) n (1 − α ) M ( n ) + µα − α · (1 /α ) K m − α/ (1 − α ) M ( n ) . If we plug K = log /α ( m ) and fix m = n , we get T ( n ) ≤ T ( n K ) + M ( n ) log /α n + λ + µα − α M ( n ) + o ( M ( n )) . We have presented algorithms for polynomial multiplication problems which areefficient in terms of both time and space. Our results show that any algorithmfor the full and short products of polynomials can be turned into another algo-rithm with the same asymptotic time complexity while using only O (1) extraspace. We obtain similar results for the middle product but only proved it foralgorithms that do not have a quasi-linear time complexity. In the latter case,an increase of the time complexity by a logarithmic factor occurs. We providedanalysis of our reductions that make their constants explicit. In particular, theirvalues ensure that our reductions are practicable.In a future work, we plan to address some remaining issues. By examiningthe constants in the already known algorithms, we can choose the algorithms touse as starting points of our reductions to optimize the complexity. For instancethree variants of Karatsuba’s algorithm with different time and space complex-ities are known [19, 23, 16]. Furthermore, it seems possible to improve on thecomplexity of low-space versions of Karatsuba’s and Toom-Cook’s algorithm,yielding faster in-place algorithms through our reductions. Another promisingapproach is to slightly relax the model of computation and work in model inwhich one can write on the input space as long as the original inputs are restoredby the end of the computation. Preliminary results for Karatsuba’s algorithmsuggest that this could also yield a lower constant in the time complexity.Finally, we have stated to explore the design of in-place algorithms for abroader range of problems of polynomials, such as division or evaluation/interpolation.The use of in-place middle and short products becomes crucial since one needsto avoid any increase in the size of the intermediate results.17 cknowledgements This work was begun while the last author was graciously hosted by the LIRMMat the Universit´e Montpellier.This work was supported in part by the National Science Foundation un-der grants 1319994 ( ) and 1618269 ( ). References [1] K. Abrahamson. Time-space tradeoffs for branching programs contrastedwith those for straight-line programs. In , pages 402–409, 1986.[2] S. Arora and B. Barak. Computational Complexity: A Modern Approach .Cambridge University Press, 1st edition, 2009.[3] A. Bostan, F. Chyzak, M. Giusti, R. Lebreton, G. Lecerf, B. Salvy, andE. Schost. Algorithmes Efficaces en Calcul Formel . 1.0 edition, Aug. 2017.[4] A. Bostan, G. Lecerf, and E. Schost. Tellegen’s principle into practice. In Proceedings of the 2003 International Symposium on Symbolic and Alge-braic Computation , ISSAC ’03, pages 37–44, New York, NY, USA, 2003.ACM.[5] R. Brent and P. Zimmermann. Modern Computer Arithmetic . CambridgeUniversity Press, New York, NY, USA, 2010.[6] D. G. Cantor and E. Kaltofen. On fast multiplication of polynomials overarbitrary algebras. Acta Informatica , 28:693–701, 1991.[7] Y. Cheng. Space-efficient karatsuba multiplication for multi-precision inte-gers. CoRR , abs/1605.06760, 2016.[8] S. A. Cook. On the minimum computation time of functions . PhD thesis,Harvard University, May 1966.[9] J. v. z. Gathen and J. Gerhard. Modern Computer Algebra (third edition) .Cambridge University Press, 2013.[10] G. Hanrot, M. Quercia, and P. Zimmermann. Speeding up the Division andSquare Root of Power Series. Technical Report RR-3973, INRIA, 2000.[11] G. Hanrot, M. Quercia, and P. Zimmermann. The middle product algo-rithm i. Applicable Algebra in Engineering, Communication and Comput-ing , 14(6):415–438, Mar 2004.[12] G. Hanrot and P. Zimmermann. A long note on Mulders’ short product. Journal of Symbolic Computation , 37(3):391–401, 2004.1813] D. Harvey, J. van der Hoeven, and G. Lecerf. Faster polynomial multipli-cation over finite fields. J. ACM , 63(6):52:1–52:23, Jan. 2017.[14] D. Harvey and D. S. Roche. An in-place truncated Fourier transform andapplications to polynomial multiplication. In ISSAC ’10: Proceedings ofthe 2010 International Symposium on Symbolic and Algebraic Computation ,pages 325–329, New York, NY, USA, 2010. ACM.[15] E. Kaltofen. Challenges of symbolic computation: my favorite open prob-lems. Journal of Symbolic Computation , 29(6):891–919, 2000.[16] A. Karatsuba and Y. Ofman. Multiplication of Multidigit Numbers onAutomata. Soviet Physics-Doklady , 7:595–596, 1963.[17] A. Lincoln, V. Vassilevska Williams, J. R. Wang, and R. R. Williams.Deterministic Time-Space Trade-Offs for k-SUM. In I. Chatzigiannakis,M. Mitzenmacher, Y. Rabani, and D. Sangiorgi, editors, ,volume 55 of Leibniz International Proceedings in Informatics (LIPIcs) ,pages 58:1–58:14, Dagstuhl, Germany, 2016. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik.[18] T. Mulders. On Short Multiplications and Divisions. Applicable Algebra inEngineering, Communication and Computing , 11(1):69–88, 2000.[19] D. S. Roche. Space- and time-efficient polynomial multiplication. In Pro-ceedings of the 2009 International Symposium on Symbolic and AlgebraicComputation , ISSAC ’09, pages 295–302. ACM, 2009.[20] J. Savage and S. Swamy. Space-time tradeoffs for oblivious integer multi-plication. In H. Maurer, editor, Automata, Languages and Programming ,volume 71 of Lecture Notes in Computer Science , pages 498–504. SpringerBerlin / Heidelberg, 1979.[21] A. Sch¨onhage and V. Strassen. Schnelle Multiplikation großer Zahlen. Com-puting , 7:281–292, 1971.[22] C. Su and H. Fan. Impact of Intel’s new instruction sets on software im-plementation of GF(2)[x] multiplication.