[PDF] A Simple and Fast Algorithm for Computing the N -th Term of a Linearly Recurrent Sequence

Abstract

We present a simple and fast algorithm for computing the N -th term of a given linearly recurrent sequence. Our new algorithm uses O(M(d)logN) arithmetic operations, where d is the order of the recurrence, and M(d) denotes the number of arithmetic operations for computing the product of two polynomials of degree d . The state-of-the-art algorithm, due to Charles Fiduccia (1985), has the same arithmetic complexity up to a constant factor. Our algorithm is simpler, faster and obtained by a totally different method. We also discuss several algorithmic applications, notably to polynomial modular exponentiation, powering of matrices and high-order lifting.

Full PDF

aa r X i v : . [ c s . S C ] A ug A Simple and Fast Algorithm for Computingthe N -th Term of a Linearly Recurrent Sequence Alin Bostan ⋆ and Ryuhei Mori † ⋆ Inria, Palaiseau, France and † Tokyo Institute of Technology, Japan [email protected] , [email protected] August 19, 2020

Abstract

We present a simple and fast algorithm for computing the N -thterm of a given linearly recurrent sequence. Our new algorithm uses O ( M ( d ) log N ) arithmetic operations, where d is the order of the re-currence, and M ( d ) denotes the number of arithmetic operations forcomputing the product of two polynomials of degree d . The state-of-the-art algorithm, due to Charles Fiduccia (1985), has the samearithmetic complexity up to a constant factor. Our algorithm is sim-pler, faster and obtained by a totally diﬀerent method. We also discussseveral algorithmic applications, notably to polynomial modular expo-nentiation, powering of matrices and high-order lifting. Keywords : Algebraic Algorithms; Computational Complexity; LinearlyRecurrent Sequence; Rational Power Series; Fast Fourier Transform

Computing eﬃciently selected terms in sequences is a basic and fundamen-tal algorithmic problem, whose applications are ubiquitous, for instance intheoretical computer science [65, 49], algebraic complexity theory [59, 73],computer algebra [28, 71, 48], cryptography [31, 32, 29, 33], algorithmicnumber theory [72, 1], eﬀective algebraic geometry [12, 36], numerical anal-ysis [52, 51] and computational biology [56].In simple terms, the problem can be formulated as follows:1iven a sequence ( u n ) n ≥ in an eﬀective ring R , and given apositive integer N ∈ N , compute the term u N as fast as possible.Here, the input ( u n ) n ≥ ∈ R N is assumed to be a recurrent sequence,speciﬁed by a data structure consisting in a recurrence relation and suﬃ-ciently many initial terms that uniquely determine the sequence ( u n ) n ≥ .Eﬃciency is measured in terms of ring operations (algebraic model), or of bit operations (Turing machine model). The cost of an algorithm is respec-tively estimated in terms of arithmetic complexity or of binary complexity .Both measures have their own usefulness: the algebraic model is relevantwhen ring operations have essentially unit cost (typically, when R is a ﬁnitering such as the prime ﬁeld F p := Z /p Z ), while the bit complexity model isrelevant when elements of R have a variable bitsize, and thus ring operationsin R have variable cost (typically, when R is the ring Z of integer numbers).The recurrence relation satisﬁed by the input sequence ( u n ) n ≥ mightbe of several types:(C) linear with constant coeﬃcients , that is of the form, u n + d = c d − u n + d − + · · · + c u n , n ≥ , for some given coeﬃcients c , . . . , c d − in R . In this case we simplysay that the sequence is linearly recurrent (or, C-recursive ). The mostbasic examples are the geometric sequence ( q n ) n ≥ , for q ∈ R , and theFibonacci sequence ( F n ) n with F n +2 = F n +1 + F n and F = 0 , F = 1.(P) linear with polynomial coeﬃcients , that is of the form, u n + d = c d − ( n ) u n + d − + · · · + c ( n ) u n , n ≥ , for some given rational functions c ( x ) , . . . , c d − ( x ) in R ( x ). In thiscase the sequence is called holonomic (or, P-recursive ). Among themost basic examples, other than the C-recursive ones, there is thefactorial sequence ( n !) n ≥ = (1 , , , , , , . . . ) and the Motzkinsequence ( u n ) n ≥ = (1 , , , , , , , . . . ) speciﬁed by the recurrence u n +1 = n +3 n +3 · u n + nn +3 · u n − and the initial conditions u = u = 1. The ring R is assumed to be commutative with unity and eﬀective in the sense thatits elements are represented using some data structure, and there exist algorithms forperforming the basic ring operations (+ , − , × ) and for testing equality of elements in R . linear with polynomial coeﬃcients in q and q n , that is of the form, u n + d = c d − ( q, q n ) u n + d − + · · · + c ( q, q n ) u n , n ≥ , for some q ∈ R and some rational functions c ( x, y ) , . . . , c d − ( x, y )in R ( x, y ). In this case, the sequence is called q -holonomic ; such asequence can be seen as a q -deformation of a holonomic sequence (inthe sense that when q

1, the limit sequence tends to be holonomic).A typical example is the q -factorial [ n ] q ! := (1+ q ) · · · (1+ q + · · · + q n − ).In all these classes of examples, the recurrence is linear , and the integer d that governs the length of the recurrence relation is called the order of thecorresponding linear recurrence.Of course, some interesting sequences satisfy nonlinear recurrences, asis the case for the so-called Somos-4 sequence (1 , , , , , , , , , . . . )deﬁned by: u n +4 = (cid:0) u n +3 u n +1 + u n +2 (cid:1) /u n together with u = · · · = u = 1,but we will not consider this larger class in what follows.For computing the N -th term u N in a sequence of type (P), resp. (Q), thebest known algorithms are presented in [16, 12], resp. in [9]. In the algebraicmodel, they rely on an algorithmic technique called baby-step / giant-step ,which allows to compute u N using a number of operations in R that is almostlinear in √ N , up to logarithmic factors. This should be contrasted with thedirect iterative algorithm, of arithmetic complexity linear in N .In the bit model, the same references provide diﬀerent algorithms basedon a diﬀerent technique, called binary splitting ; these algorithms are quasi-optimal in the sense that they are able to compute u N in a number ofbit operations almost linear (up to logarithmic factors) in the bitsize ofthe output value u N . Once again, this should be contrasted with the directiterative algorithm, whose binary complexity is larger by at least one order ofmagnitude (e.g., in case (P) the naive algorithm has bit complexity O ( N )). In what follows, we will restrict our attention to the case (C) only. This caseobviously is a subcase of both cases (P) and (Q). It presents an exceptionalfeature with respect to the algebraic model: contrary to the general cases(P) and (Q), in case (C) it is possible to compute the term u N using anumber of arithmetic operations in R that is only logarithmic in N .For the geometric sequence u n = q n , this is known since Pingala ( ∼

200 BC)who seemingly is the inventor of the algorithmic method of binary powering ,3r square-and-multiply [46, §4.6.3]. The corresponding algorithm is recursiveand based on the equalities q N = ( ( q N/ ) , if N is even, q · ( q N − ) , else.The arithmetic complexity of this algorithm is bounded by 2 log N multipli-cations in R , which represents a tremendous improvement compared to thenaive iterative algorithm that computes the term q N in N − R , by simply unrolling the recurrence u n +1 = q · u n with q = 1.In the general case (C), Miller and Spencer Brown showed in 1966 [54]that a similar complexity can be obtained by converting the scalar recur-rence of order du n + d = c d − u n + d − + · · · + c u n , n ≥ , (1)into a vector recurrence of order 1  u n u n +1 ... u n + d − | {z } v n =  c c · · · c d − | {z } M ×  u n − u n ... u n + d − | {z } v n − , n ≥ , (2)and by using binary powering in the ring M d ( R ), of d × d matrices withcoeﬃcients in R , in order to compute M N recursively by M N = ( ( M N/ ) , if N is even, M · ( M N − ) , else.From there, u N can be read oﬀ the matrix-vector product v N = M N · v .The arithmetic complexity of this method is O ( d θ log N ) operations in R ,where θ ∈ [2 ,

3] is any feasible exponent for matrix multiplication in M d ( R ).Strangely enough, the paper [54] of Miller and Spencer Brown was largelyoverlooked in the subsequent literature, and their result has been redis-covered several times in the 1980s. For instance, Shortt [68] proposed a In all this article, the notation log refers to the logarithm in base 2. In [15, p. 74], the authors call this un truc bien connu (“a well-known trick”). (log N ) algorithm for computing the N -th Fibonacci number , and ex-tended it together with Wilson [76] to the computation of order- d Fibonaccinumbers in O ( d log N ) arithmetic operations. The same cost has also beenobtained by Dijkstra [21] and Urbanek [75]. Pettorossi [60], and indepen-dently Gries and Levin [34], improved the algorithm and lowered the cost to O ( d log N ), essentially by taking into account the sparse structure of thematrix M . See also [22, 50, 23, 24, 39, 63, 30, 44] for similar algorithms. The currently best algorithm is due to Fiduccia [26]. It is based on thefollowing observation: the matrix M in (2) is the transpose of the compan-ion matrix C which represents the R -linear multiplication-by- x map fromthe quotient ring R [ x ] / (Γ) to itself, where Γ = x d − P d − i =0 c i x i . Therefore,denoting by e the row vector e = [ · · · ], the N -th term u N equals u N = e · v N = e · M N · v = (cid:16) C N · e T (cid:17) T · v = h x N mod Γ , v i , (3)where the inner product takes place between the vector v = h u · · · u d − i of initial terms of ( u n ) n ≥ , and the vector whose entries are the coeﬃcientsof the remainder ( x N mod Γ) of the Euclidean division of x N by Γ.Therefore, computing u N is reduced to computing the coeﬃcients of( x N mod Γ), and this can be performed eﬃciently by using binary poweringin the quotient ring A := R [ x ] / (Γ), at the cost of O (log N ) multiplicationsin A . Each multiplication in A may be performed using O ( M ( d )) operationsin R [27, Ch. 9, Corollary 9.7], where M ( d ) denotes the arithmetic cost of Shortt’s algorithm had actually appeared before, in the 1969 edition of Knuth’sbook [45, p. 552], as a solution of Ex. 26 (p. 421, §4.6.3). The algorithm is based onthe so-called doubling formulas ( F n , F n − ) = ( F n + 2 F n F n − , F n + F n − ), actually dueto Lucas (1876) and Catalan (1886), see e.g. [20, Ch. XVII]. The currently best implemen-tation for computing F N over Z (mpz_ﬁb_ui from GMP) uses a variant of this method,requiring just two squares (and a few additions) per binary digit of N . Already in 1899, G. de Rocquigny asked “for an expeditious procedure to compute avery distant term of the Fibonacci sequence” [19]. In response, several methods (includingthe one mentioned by Knuth in [45, p. 552]) have been published one year later by Rosace(alias), E.-B. Escott, E. Malo, C.-A. Laisant and G. Picou [66]. This fact does not seem tohave been noticed in the modern algorithmic literature before the current paper, althoughthe reference [66] is mentioned in Dickson’s formidable book [20, p. 404]. The idea already appears in the 1982 conference paper [25]. We have discovered thatthe same algorithm had been sketched by D. Knuth in the corrections/changes to [46]published in 1981 in [47, p. 28], where he attributes the result to R. Brent. Almost surely,C. Fiduccia was not aware about this fact. R [ x ] in degree d . Using Fast Fourier Transform(FFT) methods, one may take M ( d ) = O ( d log d ) when R contains enoughroots of unity, and M ( d ) = O ( d log d log log d ) in general [27, Ch. 8].In conclusion, Fiduccia’s algorithm allows the computation of the N -thterm u N of a linearly recurrent sequence of order d using O ( M ( d ) log N )operations in R . Since 1985, this is the state-of-the-art algorithm for thistask in case (C).A closer inspection of the proof of [27, Corollary 9.7] shows that a moreprecise estimate for the arithmetic cost of Fiduccia’s algorithm is F ( N, d ) = 3 M ( d ) ⌊ log N ⌋ + O ( d log N ) (4)operations in R . This comes from the fact that squaring in A = R [ x ] / (Γ)is based on one polynomial multiplication in degree less than d followed byan Euclidean division by Γ of a polynomial of degree less than 2 d . TheEuclidean division is reduced to a power series division by the reversal Q ( x ) := x d · Γ(1 /x ) of Γ, followed by a polynomial multiplication in de-gree less than d . The reciprocal of Q ( x ) is precomputed modulo x d onceand for all (using a formal Newton iteration) in 3 M ( d ) + O ( d ) operationsin R , and then each squaring in A also takes 3 M ( d ) + O ( d ) operations in R .The announced cost from (4) follows from the fact that binary poweringuses ⌊ log N ⌋ squarings and at most ⌊ log N ⌋ multiplications by x . We propose in this paper a new and simpler algorithm, with a better cost.More precisely, our ﬁrst main complexity result is:

Theorem 1.

One can compute the N -th term of a linearly recurrent se-quence of order d with coeﬃcients in a ring R using T ( N, d ) = 2 M ( d ) ⌈ log( N + 1) ⌉ + M ( d ) arithmetic operations in R . The proof of this result is based on a very natural algorithm, which willbe presented in Section 2.1. Let us remark that it improves by a factor of 1 . Note that multiplying by x in A is much easier and has linear arithmetic cost O ( d ). heorem 2. One can compute the N -th term of a linearly recurrent se-quence of order d with coeﬃcients in a ﬁeld K supporting FFT using ∼ M ( d ) log( N ) arithmetic operations in K . Algorithms 1 and 2 (underlying Theorem 1) and Algorithm 11 (underly-ing Theorem 2) are both of LSB-ﬁrst (least signiﬁcant bit ﬁrst) type. Thisprevents them from computing simultaneously several consecutive terms ofhigh indices, such as u N , . . . , u N + d − . This makes a notable diﬀerence withFiduccia’s algorithm from §1.3. For this reason, we will design a secondalgorithm, of MSB-ﬁrst (most signiﬁcant bit ﬁrst) type, by “transposing”Algorithm 1. This leads to the following complexity result. Theorem 3.

One can compute the terms of indices N − d + 1 , . . . , N of alinearly recurrent sequence of order d with coeﬃcients in a ring R using M ( d ) ⌈ log( N + 1) ⌉ + O ( M ( d )) arithmetic operations in R . The method underlying this complexity result is based on Algorithms 5, 6and 8, which are presented in Section 3. Along the way, using the MSB-ﬁrstAlgorithm 5, we improve the cost of polynomial modular exponentiation ,which is a central algorithmic task in computer algebra, with many applica-tions. Since this result has an interest per se , we isolate it here as our lastcomplexity result.

Theorem 4.

Given N ∈ N and a polynomial Γ( x ) in R [ x ] of degree d , onecan compute x N mod Γ( x ) using M ( d ) ⌈ log( N + 1) ⌉ + M ( d ) arithmetic operations in R . This cost of ∼ M ( d ) log N compares favorably with the currently bestestimate of ∼ M ( d ) log N obtained by square-and-multiply in the quotientring R [ x ] / (Γ( x )), combined with fast modular multiplications performedeither classically [27, Corollary 9.7], or using Montgomery’s algorithm [55].In the FFT setting, the gain is even larger, and our results improve on thebest estimates, due to Mihăilescu [53].7 .5 Structure of the paper In Section 2 we propose our LSB-ﬁrst (least signiﬁcant bit ﬁrst) algorithm forcomputing the N -th term of a C-recursive sequence. We design in Section 3a second algorithm, which is an MSB-ﬁrst (most signiﬁcant bit ﬁrst) variant,and discuss several algorithmic applications, including polynomial modularexponentiation, powering of matrices and high-order lifting. In Section 4 wespecialize and analyze Algorithm 1 in the speciﬁc FFT setting, where poly-nomial multiplication is based on Discrete Fourier Transform techniques,and we compare it with the FFT-based Fiduccia’s algorithm. We concludein Section 5 by a summary of results and plans of future work. We will prove Theorem 1 in §2.1, where we propose the ﬁrst main algorithms(Algorithms 1 and 2), which are faster than Fiduccia’s algorithm. Then,in §2.2 we instantiate them in the particular case of the Fibonacci sequence.The resulting algorithm is competitive with state-of-the-art algorithms.

The algorithm underlying Theorem 1 is very natural. Let us sketch now itsmain idea.First, it is classical [64] that the generating functions of linearly recur-rent sequences are rational . As a consequence, computing terms of a linearlyrecurrent sequence is equivalent to computing coeﬃcients in the power se-ries expansion (at the origin) of a rational function. More precisely, let usattach to the recurrence (1) the polynomial Q ( x ) := 1 − c d − x − · · · − c x d ,that is the reversal of the characteristic polynomial Γ( x ) = x d − P d − i =0 c i x i of recurrence (1). Let us denote by F ( x ) the generating function of thesequence ( u n ) n ≥ , F ( x ) := X n ≥ u n x n . Then, there exists a polynomial P ( x ) in R [ x ] of degree less than d such that F ( x ) = P ( x ) /Q ( x ) in R [[ x ]]. This is immediately seen by checking that, forany n ≥

0, the coeﬃcient of x n + d in the power series P ( x ) := Q ( x ) · F ( x ) isequal to u n + d − c d − u n + d − −· · · − c u n , hence it is zero by (1), and thereforethe power series P ( x ) is in fact a polynomial of degree less than d . Moreover,the coeﬃcients of P ( x ) can be determined from the recurrence (1) and fromthe initial terms u , . . . , u d − by using M ( d ) operations in R .8e are thus reduced to the question of determining the N -th coeﬃ-cient u N of the rational power series F ( x ) = P ( x ) /Q ( x ). Our new algorithmis based on the following observation. The polynomial Q ( x ) Q ( − x ) is even,so it writes V ( x ) for some V ∈ R [ x ] of degree d . Then, denoting by U ( x )the polynomial P ( x ) Q ( − x ), of degree less than 2 d , and by U e and U o theeven and the odd parts of U , that is U ( x ) = U e ( x ) + x · U o ( x ), we have P ( x ) Q ( x ) = P ( x ) Q ( − x ) Q ( x ) Q ( − x ) = U e ( x ) V ( x ) + x · U o ( x ) V ( x ) , which implies that the N -th coeﬃcient in the series expansion of P/Q is[ x N ] P ( x ) Q ( x ) =  [ x N ] U e ( x ) V ( x ) , if N is even,[ x N − ] U o ( x ) V ( x ) , else.In other words, the problem of computing the N -th term of a rational func-tion P/Q of degree d is reduced to that of computing the term of index ⌊ N/ ⌋ of another rational function of degree d , that can be deduced from P/Q byusing two polynomial multiplications in degree d . The desired coeﬃcient u N is computed after repeating this process at most ⌈ log( N + 1) ⌉ times, andthe complexity estimate in Theorem 1 is easily deduced.Notice that, by the duality between linearly recurrent sequences andrational functions, the new algorithm admits a nice and simple interpretationdirectly at the level of recurrences. The sequence ( u n ) n ≥ is determined bythe recurrence (1) (encoded by the denominator Q ( x )) and by the initialconditions u , . . . , u d − (encoded by the numerator P ( x )). To compute the N -th coeﬃcient u N , the new method builds a diﬀerent recurrence (encodedby V ( x ) = Q ( √ x ) Q ( −√ x )) still with constant coeﬃcients and of the sameorder d , together with new initial conditions (encoded by P ( x ) Q ( − x )) .Computing u N is thus reduced to computing the term of index ⌊ N/ ⌋ of thenew sequence; and this reduction is applied at most ⌈ log( N + 1) ⌉ times.The proposed algorithm for computing [ x N ] P ( x ) /Q ( x ) is summarized inAlgorithm 1, and its immediate consequence for computing the N -th term ofthe linearly recurrent sequence ( u n ) n ≥ deﬁned by eq. (1) is displayed in Al-gorithm 2. Algorithm 1 has complexity 2 M ( d ) ⌈ log( N + 1) ⌉ and Algorithm 2has complexity 2 M ( d ) ⌈ log( N + 1) ⌉ + M ( d ), which proves in Theorem 1.Note that Algorithms 1 and 2 use an idea similar to the ones in [11,10] which were dedicated to the larger class of algebraic power series, but In fact, the new sequence consists of the even (or odd) terms of the original sequence. lgorithm 1 ( OneCoeﬀ ) Input : P ( x ), Q ( x ), N Output : [ x N ] P ( x ) Q ( x ) Assumptions: Q (0) invertible and deg( P ) < deg( Q ) =: d while N ≥ do U ( x ) ← P ( x ) Q ( − x ) ⊲ U = P d − i =0 U i x i if N is even then P ( x ) ← P d − i =0 U i x i else P ( x ) ← P d − i =0 U i +1 x i A ( x ) ← Q ( x ) Q ( − x ) ⊲ A = P di =0 A i x i Q ( x ) ← P di =0 A i x i N ← ⌊ N/ ⌋ return P (0) /Q (0) Algorithm 2 ( OneTerm ) Input : rec. (1), u , . . . , u d − , N Output : u N Assumptions: Γ( x ) = x d − P d − i =0 c i x i with c = 0 Q ( x ) ← x d Γ(1 /x ) P ( x ) ← ( u + · · · + u d − x d − ) · Q ( x ) mod x d return [ x N ] P ( x ) /Q ( x ) ⊲ using Algorithm 1restricted to positive characteristic only. Algorithm 1 also shares commonfeatures with the technique of section operators [2, Lemma 4.1] used byAllouche and Shallit to compute the N -th term of k -regular sequences [2,Corollary 4.5] in O (log N ) ring operations.Algorithm 1 can be interpreted at the level of recurrences as computing ∼ log N new recurrences produced by the Graeﬀe process, which is a classicaltechnique to compute the largest root of a real polynomial [40, 57, 58].Interestingly, the Graeﬀe process has been used in a purely algebraic contextby Schönhage in [67, §3] for computing the reciprocal of a power series, seealso [14, §2]. However, our paper seems to be the ﬁrst reference where theGraeﬀe process and the section operators approach are combined together. To illustrate the mechanism of Algorithm 1, let us instantiate it in theparticular case of the Fibonacci sequence deﬁned by F = 0 , F = 1 , F n +2 = F n +1 + F n , n ≥ . lgorithm 3 ( NewFibonacci ) Input : N Output : F N Assumptions: N ≥ c ← if N is even then [ a, b ] ← [0 , else [ a, b ] ← [1 , − N ← ⌊ N/ ⌋ while N > do if N is even then b ← a + b · c else a ← b + a · c c ← c − N ← ⌊ N/ ⌋ return b + a · c The generating function P n ≥ F n x n equals x/ (1 − x − x ). Therefore, thecoeﬃcient F N is equal to[ x N ] x − x − x = [ x N ] x (1 + x − x )1 − x + x =  [ x N ] x − x + x , if N is even,[ x N − ] − x − x + x , else.The computation of F N is reduced to that of a coeﬃcient of the form[ x N ] a + bx − cx + x = [ x N ] ( a + bx )(1 + cx + x )1 − ( c − x + x =  [ x N ] a +( bc + a ) x − ( c − x + x , if N is even,[ x N − ] ( ac + b )+ bx − ( c − x + x , else.This yields Algorithm 3 for the computation of F N . A close inspectionreveals that this algorithm computes F N by a recursive use of the formula F N = L ⌊ log N ⌋ · F N − ⌊ log N ⌋ + ( − N · F ⌊ log N ⌋ − N , which is a particular instance of the classical formula F n + m = L m F n + ( − n F m − n Notice that the same algorithm can be used to compute eﬃciently the N -th Fibonaccipolynomial, or the N -th Chebychev polynomial. Fibonacci polynomials in R [ t ] are deﬁnedby F n +2 ( t ) = t · F n +1 ( t ) + F n ( t ) with F ( t ) = 1 and F ( t ) = 1. It is suﬃcient to initialize c to t + 2 (instead of 3) and b to t when N is even (instead of 0). The complexity of thisalgorithm is O ( M ( N )) operations in R , which is quasi-optimal. lgorithm 4 Input : N Output : F N Assumptions: N ≥ N is a power of 2 [ b, c ] ← [1 , N ← ⌊ N/ ⌋ while N > do b ← b · c c ← c − N ← ⌊ N/ ⌋ return b · c relating the Fibonacci numbers and the Lucas numbers L n = F n +1 + F n − .When N is a power of 2, then Algorithm 3 degenerates into Algorithm 4.This is equivalent to Algorithm fib ( n ) in [18, Fig. 6] . It uses 2 log( N ) − N ) − N ) − N is arbitrary, Algorithm 3 has essentially the same cost: it usesat most 2 log( N ) − N ) − ⌊ log( N ) ⌋ − F is explicitly displayed in Table 1.A nice feature of Algorithm 3 is not only that it is simple and natural,but also that its arithmetic and bit complexity matches the complexity ofthe state-of-art algorithms for computing Fibonacci numbers [74]. N a b c

21 1 − × − − − × − − × − − − × − − × − F = 433494437 using the new algorithm. This algorithm had also appeared before, in Knuth’s book [45, p. 552, second solution]. The MSB-ﬁrst algorithm and applications

We present in §3.1 a “most signiﬁcant bit” (MSB) variant (Algoritm 5) ofAlgoritm 1. Then we discuss various applications of Algorithms 1 and 5.In §3.2 we design a faster algorithm for polynomial modular exponentiation,that we use in §3.3 to design a faster Fiduccia-like algorithm for computinga slice of d terms of indices N − d + 1 , . . . , N in ∼ M ( d ) log N operations. In Fiduccia’s algorithm (§1.3), the N -th coeﬃcient u N in the power seriesexpansion P i ≥ u i x i of P/Q is given by the inner product h x N mod Γ( x ) , v i ,where Γ is the reversal polynomial of Q and v is the vector of initial coeﬃ-cients h u · · · u d − i . Here, x N mod Γ( x ) depends only on the linear recur-rence equation (1), and is independent of the initial terms v . Hence, if wewant to compute the N -th terms of k diﬀerent linearly recurrent sequencesthat share the same linear recurrence equation (1), we can ﬁrst determine ρ ( x ) := x N mod Γ( x ), and then h ρ, v ( i )0 i for i = 1 , . . . , k , where v ( i )0 denotesthe vector of d initial terms of the i -th sequence. The total arithmetic com-plexity of this algorithm is O ( M ( d ) log N + kd ); this is faster than Fiduccia’salgorithm repeated independently k times, with cost O ( k M ( d ) log N ).On the other hand, in Algorithm 1, we iteratively update both the de-nominator and the numerator, and each new numerator depends on theoriginal numerator P ( x ) which encodes the initial d terms of the sequence.Hence, it is not a priori clear how to obtain with Algorithm 1 the goodfeature of Fiduccia’s algorithm mentioned above.In this section, we present an algorithm that computes u N with arith-metic complexity equal to that of Algorithm 1 and which, in addition,achieves the cost O ( M ( d ) log N + kd ) for the above problem with k sequences.While Algorithm 1 looks at N from the least signiﬁcant bit (LSB), themain algorithm presented in this section (Algoritm 5) looks at N from themost signiﬁcant bit (MSB). In fact, Algoritm 5 is in essentially equivalent to“the transposition” of Algorithm 1 in the sense of [8]. This is the reason whythis MSB-ﬁrst algorithm has exactly the same complexity as Algorithm 1.However, in order to keep the presentation self-contained, we are not going toappeal here to the general machinery of algorithmic transposition tools, butrather derive the transposed algorithm “by hand”, using a direct reasoning.In order to compute the coeﬃcient [ x N ] P ( x ) /Q ( x ), it is suﬃcient tocompute the ( N − d + 1)-th term to the N -th term of 1 /Q ( x ) since thedegree of P ( x ) is at most d −

1. Let F N,d ( P i ≥ a i x i ) := P d − i =0 a N − d +1+ i x i .13 lgorithm 5 ( SliceCoeﬀ ) Input : Q ( x ), N Output : F N,d (1 /Q ( x )) Assumptions: Q (0) invertible and deg( Q ) =: d function F ( N , Q ( x )) if N = 0 then return x d − /Q (0) A ( x ) ← Q ( x ) Q ( − x ) ⊲ A = P di =0 A i x i V ( x ) ← P di =0 A i x i W ( x ) ← F ( ⌊ N/ ⌋ , V ( x )) if N is even then S ( x ) ← xW ( x ) else S ( x ) ← W ( x ) B ( x ) ← Q ( − x ) S ( x ) ⊲ B = P d − i =0 B i x i return P d − i =0 B d + i x i Our goal is to compute F N,d (1 /Q ( x )). We have the sequence of equalities F N,d (cid:18) Q ( x ) (cid:19) = F N,d (cid:18) Q ( − x ) Q ( x ) Q ( − x ) (cid:19) = F N,d (cid:18) Q ( − x ) x N − d +1 F N, d (cid:18) Q ( x ) Q ( − x ) (cid:19)(cid:19) = F d − ,d (cid:18) Q ( − x ) F N, d (cid:18) V ( x ) (cid:19)(cid:19) , where V ( x ) := Q ( x ) Q ( − x ).In the second equality, we ignore the terms of 1 /V ( x ) except for the( N − d + 1)-th term to the N -th term. In the third equality, we use thefact that F N,d ( xA ( x )) = F N − ,d ( A ( x )). Let now W ( x ) := F ⌊ N/ ⌋ ,d (1 /V ( x )).Then, it is easy to see that F N,d (cid:18) Q ( x ) (cid:19) = F d − ,d ( Q ( − x ) S ( x )) , where S ( x ) := ( xW ( x ) , if N is even W ( x ) , else.The resulting method for computing F N,d (1 /Q ( x )) is summarized in Algo-rithm 5, and its immediate applications to the computation of [ x N ] P/Q , andto [ x N ] P ( i ) /Q for several i = 1 , . . . , k , are displayed in Algorithms 6 and 7.14 lgorithm 6 ( OneCoeﬀT ) Input : P ( x ), Q ( x ), N Output : [ x N ] P ( x ) Q ( x ) Assumptions: Q (0) invertible and deg( P ) < deg( Q ) =: d U ← F N,d (1 /Q ( x )) using Algorithm 5 ⊲ U = u N − d +1 + · · · + u N x d − return p u N + · · · + p d − u N − d +1 ⊲ P = P d − i =0 p i x i Algorithm 7 Input : P , . . . , P k , Q , N Output : [ x N ] P j Q , j = 1 , . . . , k Assumptions: Q (0) invertible and deg( P ) < deg( Q ) =: d U ← F N,d (1 /Q ( x )) using Algorithm 5 ⊲ U = u N − d +1 + · · · + u N x d − return p ( j )0 u N + · · · + p ( j ) d − u N − d +1 , j = 1 , . . . , k ⊲ P j = P d − i =0 p ( j ) i x i Let us analyze the complexity of Algorithms 5 and 6 more carefully. Ateach step, Algorithm 5 computes Q ( x ) Q ( − x ) and Q ( − x ) S ( x ), where thedegrees of Q ( x ) and S ( x ) are d and at most 2 d −

1, respectively. Hence adirect analysis concludes that its complexity is 3 M ( d ) log N operations in R .However, an improvement comes from the remark that not all coeﬃcients of Q ( − x ) S ( x ) are needed: it is suﬃcient to compute the d -th coeﬃcient to the(2 d − Q ( − x ) S ( x ). This operation is known as “the middleproduct”, and can be performed with exactly the same arithmetic complexityas the standard product of two polynomials of degrees d and d − M ( d ) log N .This complexity is also inherited by Algorithm 6, which uses at most 2 d additional operations in the last step.It should be obvious at this point that the slight variant Algorithm 7of Algorithm 6 achieves arithmetic complexity O ( M ( d ) log N + kd ) for theaforementioned problem with k sequences, and more precisely its cost is ofat most (2 M ( d ) + d ) log N + 2 kd operations in R .In conclusion, Algorithm 5 achieves the same arithmetic complexity asAlgorithm 1 and it extends to Algorithms 6 and 7. All algorithmic tech-niques speciﬁc to the FFT setting, that we will describe in Section 4, canalso be applied to Algorithms 5, 6 and 7, yielding the same complexity gains. The algorithms of §3.1 are not only well-suited to compute the N -th termsof several sequences satisfying the same recurrence relation. In this section,we show that they also permit a surprising application to the computation15 lgorithm 8 ( NewModExp ) Input : Γ( x ), N Output : x N mod Γ( x ) Assumptions: lc(Γ) invertible, Γ(0) = 0 and deg(Γ) =: d Q ( x ) ← x d Γ(1 /x ) u ( x ) ← F N,d (1 /Q ( x )) ⊲ using Algorithm 5 v ( x ) ← u ( x ) Q ( x ) mod x d return v (1 /x ) x d − of polynomial modular exponentiations . This fact has many consequences,since modular exponentiation is a central algorithmic task in algebraic com-putations. In §3.3, we will discuss a ﬁrst application in relation with themain topic of our article. Namely, we will design a new Fiduccia-style algo-rithm for the computation of the N -th term, and actually of a whole sliceof k ≥ d terms, in 2 M ( d ) log N + O (( k + d ) M ( d ) /d ) arithmetic operations.More consequences will be separately discussed in §3.4.Assume we are given a polynomial Γ( x ) ∈ R [ x ] of degree d , an integer N ,and that we want to compute ρ ( x ) := x N mod Γ( x ). Without loss of gen-erality, we may assume Γ(0) = 0. Let Q ( x ) ∈ R [ x ] be the reversal of Γ( x ),that is Q ( x ) := x d Γ(1 /x ). Let us denote the power series expansion of 1 /Q by P i ≥ a i x i . Then, equation (3) implies that h a N · · · a N + d − i = r × H , (5)where r = h r · · · r d − i with ρ = P d − i =0 r i x i and H is the Hankel matrix H :=  a · · · a d − a · · · a d ... a d − · · · a d −  . Note that the matrix H is invertible, as its determinant is equal (up to asign) to ([ x d ] Q ) d − = Γ(0) d − . Therefore, r (and thus ρ ) can be found by(1) computing h u N · · · u N + d − i using Algorithm 5;(2) solving the Hankel linear system (5).The arithmetic complexity of step (1) is 2 M ( d ) log( N ) + O ( d log N ), whilestep (2) has negligible cost O ( M ( d ) log d ) using [13], see also [6, Ch. 2, §5].It is actually possible to improve a bit more on this algorithm, by usingthe next lemma. 16 emma 1. Let N ∈ N and let Γ( x ) ∈ R [ x ] be of degree d with Γ(0) = 0 . Let Q ( x ) ∈ R [ x ] be its reversal, Q ( x ) := x d Γ(1 /x ) . Denote its reciprocal /Q by P i ≥ a i x i , and let u ( x ) be F N,d (1 /Q ( x )) = a N − d +1 + · · · + a N x d − . Deﬁne v ( x ) to be u ( x ) Q ( x ) mod x d . Then x N mod Γ( x ) = v (1 /x ) x d − .Proof. Write the Euclidean division x N = L ( x ) · Γ( x )+ ρ ( x ), where deg( L ) = N − d and ρ ( x ) = r + · · · + r d − x d − . Replacing x by 1 /x on both sides,and then multiplying by x N yields 1 = L rev ( x ) · Q ( x ) + x N − d +1 · ˜ ρ ( x ), where L rev ( x ) = x N − d · L (1 /x ) and ˜ ρ ( x ) = x d − · ρ (1 /x ). In other words1 Q ( x ) = L rev ( x ) + x N − d +1 · ˜ ρ ( x ) Q ( x ) . Since L rev ( x ) has degree at most N − d , it follows that u ( x ) = ˜ ρ ( x ) Q ( x ) mod x d .Therefore, ˜ ρ ( x ) is equal to v ( x ), and the conclusion follows.The merit of Lemma 1 is that it shows that computing x N mod Γ( x )can be reduced to computing F N,d (1 /Q ( x )), plus a few additional operationswith negligible cost M ( d ). The resulting method is presented in Algorithm 8,whose complexity is 2 M ( d ) log N + M ( d ). This proves Theorem 4.Notice that Algorithm 8 is simpler, and faster by a factor of 1.5, than theclassical algorithm based on binary powering in the quotient ring R [ x ] / (Γ( x )).Algorithm 8 admits a specialization into the FFT setting, with complexity ∼ M ( d ) log N , in the spirit of §4 below. Similarly to the case of Algo-rithm 11 in §4, the FFT variant of Algorithm 8 is faster by a factor of 2 . .

625 than the (much more complex) algorithm of Mihăilescu [53].This speed-up might be beneﬁcial for instance in applications to polyno-mial factoring in F p [ x ], where one time-consuming step to factor f ∈ F p [ x ]is the computation of x p mod f , see [27, Algorithms 14.3, 14.8, 14.13, 14.15,14.31, 14.33 and 14.36], and also [70, 48].It might also be so in point-counting methods such as Schoof’s algo-rithm and the Schoof-Elkies-Atkin (SEA) algorithm [7, Ch. VII], the secondone being the best known method for counting the number of points of el-liptic curves deﬁned over ﬁnite ﬁelds of large characteristic. Indeed, thebulks of these algorithms are computations of x q modulo the “division poly-nomial” f ℓ ( x ) and of x q modulo the “modular polynomial” Φ ℓ ( x ), where ℓ = O (log( q )) and deg( f ℓ ) = O ( ℓ ), deg(Φ ℓ ) = O ( ℓ ).As a ﬁnal remark, note that while Fiduccia’s algorithm shows that com-puting the terms of indices N, . . . , N + d − d can be reduced to polynomial modular exponentiation17 x N mod Γ( x )), Algorithm 8 shows that the converse is also true: poly-nomial modular exponentiation can be reduced to computing the terms ofindices N, . . . , N + d − d . There-fore, these two problems are computationally equivalent . To our knowledge,this important fact seems not to have been noticed before. We conclude this section by discussing a straightforward application of Al-gorithm 8. This is based on the next equality, generalizing (5) to any k ≥ h u N · · · u N + k − i = r × H k , (6)where as before r = h r · · · r d − i is the coeﬃcients vector of ρ = P d − i =0 r i x i ,with ρ = x N mod Γ( x ), and H k is the Hankel matrix H k :=  u · · · u d − · · · · · · u k − u · · · u d · · · · · · u k ... ... ... ... u d − · · · u d − · · · · · · · · · u k + d −  . The matrix H k is built upon the ﬁrst terms of the sequence ( u n ) n ≥ satisfying recurrence (1) with characteristic polynomial Γ = x d − P d − i =0 c i x i ,or equivalently, from the power series expansion of the rational function P/Q with Q ( x ) = x d Γ(1 /x ).Note that the entries of H k can be computed either from P and Q , orfrom the recurrence (1) together with the initial terms u , . . . , u d − , using O (( k + d ) M ( d ) /d ) arithmetic operations, by the algorithm in [69, Thm. 3.1],see also [8, §5]. To compute u N , . . . , u N + k − it thus only remains to performthe vector-matrix product (6).When k = 1, the product r × H costs 2 d operations and it yields theterm u N .When k ≥ d , the product r × H k can be reduced to the polynomial mul-tiplication of r d − + · · · + r x d − by P k + d − i =0 u i x i , and this can be performedusing ⌈ k + dd ⌉ M ( d ) arithmetic operations. As a consequence, the whole sliceof coeﬃcients u N + i = [ x N + i ] P/Q for i = 0 , . . . , k −

1, can be computed usingAlgorithm 8 and eq. (6) for a total cost of arithmetic operations equal to2 M ( d ) log N + O (cid:18) k + dd M ( d ) (cid:19) . lgorithm 9 Input : rec. (1), u , . . . , u d − , N Output : u N , . . . , u N + d − Assumptions: Γ( x ) = x d − P d − i =0 c i x i with c = 0 ρ ( x ) ← x N mod Γ( x ) ⊲ using Algorithm 8 U ( x ) ← u + · · · + u d − x d − ⊲ using Algorithm in [69, p. 18] V ( x ) ← U ( x ) · ( x d · ρ (1 /x )) ⊲ V = P d − i =0 v i x i return [ v d , . . . , v d − ]When k = d , this proves Theorem 3. The corresponding algorithm is givenas Algorithm 9.We emphasize that this variant of Fiduccia’s algorithm is diﬀerent fromAlgorithm 1. It is actually a bit slower than Algorithm 1 when k = 1.However, when k > k times Algorithm 1. It also compares favorably with Fiduccia’soriginal algorithm, whose adaption to k terms has arithmetic complexity3 M ( d ) log N + O (cid:18) d log( N ) + k + dd M ( d ) (cid:19) . In this section, we discuss three more applications of the MSB-ﬁrst algo-rithms (Algorithm 5 and 8) presented in §3.1 and §3.2. We deal with thecase of multiplicities (§3.4.1), and explain a new way to speed up computa-tions in that case. Then, we address other applications, to faster poweringof matrices (§3.4.2) and to faster high-order lifting (§3.4.3).To simplify matters, we assume in this section that R = K is a ﬁeld. Hyun and his co-authors [42, 41] addressed the following question: is itpossible to compute faster the N -th term of a linearly recurrent sequencewhen the characteristic polynomial of the recurrence has multiple roots? Bythe Chinese Remainder Theorem, it is suﬃcient to focus on the case wherethe characteristic polynomial is a pure power of a squarefree polynomial. Inother words, the main step of [42, Algorithm 1] is to compute x N mod Q ,where Q = ( Q ⋆ ) m and Q ⋆ is the squarefree part of Q . Under suitableinvertibility conditions, the problem is solved in [42, 41] in O ( M ( d ⋆ ) log N + M ( d ) log d ) operations in K , where d ⋆ = deg( Q ⋆ ) and d = deg( Q ) = m · d ⋆ .This cost is obtained using an algorithm based on bivariate computations,using the isomorphisms between K [ x ] / ( Q ) and K [ y, x ] / ( Q ⋆ ( y ) , ( x − y ) m )19ade eﬀective by the so-called tangling / untangling operations. We nowpropose an alternatively fast, but simpler, algorithm with the same cost.Let us explain this on an example, for “multiple-Fibonacci numbers”,that is when Q has the form ( Q ⋆ ) m , with Q ⋆ = 1 − x − x and d ⋆ = 2 , d = 2 d ⋆ .Assume we want to compute the N -th coeﬃcient u N in the power seriesexpansion of ( x/ (1 − x − x )) m . The cost of Fiduccia’s algorithm, and alsoof our new algorithms, is O ( M ( m ) · log N ). Let us explain how we can lowerthis to O (log N + M ( m ) log m ). The starting point is the observation that,by the structure theorem of linearly recurrent sequences [15, §A.(I)] (seealso [62, §2]), u N is of the form u m ( N ) φ N + v m ( N ) ψ N , where φ and ψ are the two roots of 1 + x = x and u m , v m are polynomials in K [ x ] ofdegree less than m . By an easy liner algebra argument, u N is thus equal to U m ( N ) F N + V m ( N ) F N +1 , where U m ( x ) and V m ( x ) are polynomials in K [ x ]of degree less than m . These polynomials can be computed by (structured)linear algebra from the ﬁrst 2 m values of the sequence ( u n ), in complexity O ( M ( m ) log m ). For instance, when d = 2, we have U ( x ) = − ( x + 1) / V ( x ) = 2 x/

5. Once U m and V m are determined, it remains to compute F N and F N +1 using Algorithm 9 in O (log N ) operations in K , then to returnthe value U m ( N ) · F N + V m ( N ) · F N +1 .The arguments extend to the general case and yields an algorithm ofarithmetic complexity 2 M ( d ⋆ ) log N + O ( M ( d ) log d ). Assume we are given a matrix M ∈ M d ( K ), an integer N , and that we wantto compute the N -th power M N of M .The arithmetic complexity of binary powering in M d ( K ) is O ( d θ log N )operations in K , where as before θ ∈ [2 ,

3] is any feasible exponent formatrix multiplication in M d ( K ). A better algorithm consists in ﬁrst com-puting the characteristic polynomial Γ( x ) of the matrix M , then the re-mainder ρ ( x ) := x N mod Γ( x ), and ﬁnally evaluating the polynomial ρ ( x )at M . By the Cayley-Hamilton theorem, ρ ( M ) = M N . The most costlystep is the computation of R , which can be done as explained in §3.2 using ∼ M ( d ) log( N ) operations in K . The cost of the other two steps is inde-pendent of N , and it is respectively O ( d θ log d ) [43] and O ( d θ + ), this lastcost being achieved using the Paterson-Stockmeyer baby-step / giant-step algorithm [59]. The total cost of this algorithm is 2 M ( d ) log( N ) + O ( d θ + ).Note that a faster variant (w.r.t. d ), of cost 2 M ( d ) log( N ) + O ( d θ log d ),can be obtained using [28, Corollary 7.4]. The corresponding algorithm isbased on the computation of the Frobenius (block-companion) form of the20atrix M , followed by the powering of companion matrices, which againreduces to modular exponentiation. The fastest known algorithms for polynomial linear algebra rely fundamen-tally on an algorithmic technique introduced by Storjohann [71], called high-order lifting .Given an invertible polynomial matrix A of degree d , the problem is tocompute the high order components ( C , C ) , ( C , C ) , ( C , C ) , ( C , C ) , . . . in the power series expansion of its inverse A − = X i ≥ C i ( x ) · ( x d ) i , with C i ∈ M n ( K [ x ]

1, this algorithmuses ∼ MM ( n, d ) log( N ) operations in K if polynomial products are used,or ∼ MM ( n, d ) log( N ) operations in K if middle product techniques areused for the outmost products. Using a matrix adaptation of Algorithm 5,we can lower this to ∼ MM ( n, d ) log( N ) operations in K .Note that Nuel and Dumas compared in [56] Fiduccia’s and Storjohann’salgorithms (in the scalar case), but only in the under the speciﬁc assumptionthat naive polynomial multiplication is used, that is M ( d ) = O ( d ).21 Analysis under the FFT multiplication model

In this section, we specialize, optimize and analyze the generic Algorithm 1to the FFT setting, in which polynomial multiplications are assumed to beperformed using the discrete Fourier transform (DFT), and its inverse.In order to do this, we will assume that the base ring R possesses rootsof unity of suﬃciently high order. To simplify the exposition, the ring R will be supposed to be a ﬁeld, but the arguments also apply without thisassumption, modulo some technical complications, see [27, §8.2]. Let K be a ﬁeld with a primitive n -th root ω n of unity. Let A ∈ K [ x ] be apolynomial of degree at most d ≤ n −

1. The DFT b A of A is deﬁned by b A y := A ( ω − yn ) = n − X i =0 A i ω − yin for y = 0 , , . . . , n − . Here, A y = 0 for y > d . It is classical that the DFT map is an invertible K -linear transform from K n to itself, and that the polynomial A can beretrieved from its DFT b A using the formulas A i = 1 n n − X y =0 b A y ω yin for i = 0 , , . . . , n − . For computing the polynomial multiplication C ( x ) = A ( x ) B ( x ) for given A ( x ) , B ( x ) ∈ K [ x ] of degree at most d , it is suﬃcient to compute the DFTof C ( x ) for n ≥ d + 1. Since b C y = C ( ω − yn ) = A ( ω − yn ) B ( ω − yn ) = b A y b B y , thepolynomial C ( x ) can be computed using two DFTs and one inverse DFT.Let E ( n ) be an arithmetic complexity for computing a DFT of length n .Then the cost of polynomial multiplication in K [ x ] is governed by M ( d ) = 3 E (2 d ) + O ( d ) . In this subsection, we brieﬂy recall the Fast Fourier Transform (FFT), whichgives the quasi-linear estimate E ( n ) = O ( n log n ).22ssume n is even. Then, for y = 0 , , . . . , n/ − b A y = n/ − X i =0 A i ω − y (2 i ) n + n/ − X i =0 A i +1 ω − y (2 i +1) n = n/ − X i =0 A i ω − yin/ + ω − yn · n/ − X i =0 A i +1 ω − yin/ = c A e y + ω − yn c A o y where A e ( x ) := P n/ i =0 A i x i and A o ( x ) := P n/ i =0 A i +1 x i .Similarly, we have b A n/ y = b A ey − ω − yn b A oy . We therefore obtain thefollowing matrix equation " b A y b A n/ y = " − ω − yN A e y b A o y for y = 0 , , . . . , n/ − . (7)Thus, computing a DFT in size n reduces to two DFTs in size n/

2. Moreprecisely, E ( n ) ≤ E ( n/

2) + (3 / n . If n is a power of two, n = 2 k , andif the ﬁeld K contains a primitive 2 k -th root of unity (as is the case forinstance when K = C , K = F p for a prime number p satisfying 2 k | p − k = log n times, and it yields the estimate E ( n ) = n log n . The corresponding algorithm is called the decimation-in-time Cooley–Tukey fast Fourier transform [17], see also [5, §2].By the arguments of §4.1, we conclude that polynomial multiplicationin K [ x ] can be performed in arithmetic complexity M ( d ) = 9 d log d + O ( d ) . In the FFT setting, it is useful for many applications to compute eﬃcientlya DFT of length 2 n starting from DFT of length n .Assume n ≥ d + 1 and we have at our disposal the DFT b A of A , oflength n . Assume that we want to compute the DFT b A (2 n ) of length 2 n .The simplest algorithm is to apply the inverse DFT of length n to ob-tain A , and then to apply the DFT of length 2 n to A . This costs E ( n )+ E (2 n )arithmetic operations, that is n log n + 3 n operations in K .23 lgorithm 10 Doubling the length of a DFT function UP ( b A ) A ← IDFT n ( b A ) B i ← ω − i n A i for i = 0 , , . . . , n − b B ← DFT n ( B ) b A (2 n )2 y ← b A y for y = 0 , , . . . , n − b A (2 n )2 y +1 ← b B y for y = 0 , , . . . , n − return b A (2 n ) This algorithm can be improved using the following formulas b A (2 n )2 y = n − X i =0 A i ω − yi n = n − X i =0 A i ω − yin = b A y , b A (2 n )2 y +1 = n − X i =0 A i ω − (2 y +1) i n = n − X i =0 ω − i n A i ω − yin = b B y , where B i := ω − i n A i for i = 0 , , . . . , n −

1. We obtain Algorithm 10 witharithmetic complexity 2 E ( n )+ n , i.e. n log n + n , [5, §12.8], see also [4, 52] .Compared with the direct algorithm, the gain is roughly a factor of 3 / Recall that our main objective is, given

P, Q in K [ x ] with d = deg( Q ) > deg( P ), to compute the N -th coeﬃcient u N in the series expansion of P/Q .Let k be the minimum integer satisfying 2 k ≥ d + 1. Assume that thereexists a primitive 2 k -th root of unity in K . In this case, we can employan FFT-based polynomial multiplication in K [ x ]. In each iteration of Al-gorithm 1, it is suﬃcient to compute P ( x ) Q ( − x ) and Q ( x ) Q ( − x ). Here,only two FFTs and two inverse FFT of length 2 k are needed since b Q − y = b Q ¯ y for Q − ( x ) := Q ( − x ) where ¯ y := y + 2 k − if y < k − and ¯ y := y − k − if y ≥ k − . Hence, the arithmetic complexity S ( d ) for a single step inAlgorithm 1 satisﬁes S ( d ) ≤ E (2 k ) + O (2 k ).In the following we will show the improved estimate S ( d ) ≤ E (2 k − ) + O (2 k ) . This “FFT doubling” trick is sometimes attributed to R. Kramer (2004), but we havenot been able to locate Kramer’s paper. while loop in Algorithm 1, the DFTs b P and b Q of P ( x ) and Q ( x ) of length 2 k are computed, respectively. Inside the while loop, b P and b Q are updated. The recursive formula (7) for the decimation-in-time Cooley–Tukey FFT is equivalent to " b A e y b A o y = 12 " ω yN − A y b A k − + y for y = 0 , , . . . , k − . By using this formula, b A e (or b A o ) can be computed with O (2 k ) operationsfrom b A . By using Algorithm 10, we obtain the updated b P from b A e or b A o .The algorithm is summarized in Algorithm 11. In each step, UP is calledtwice. Hence, the total arithmetic complexity of Algorithm 11 is(4 E (2 k − ) + O (2 k )) · log N. When d is of the form 2 ℓ − , then one can take k = ℓ + 1 and the costsimpliﬁes to T ( N, d ) = 4 E ( d ) log N + O ( d log N ) , or, equivalently T ( N, d ) = 6 d log d log N + O ( d log N ) . The (striking) conclusion of this analysis is that, in the FFT setting, our(variant of the) algorithm for computing the N -th term of P/Q uses muchless operations than in the general case, namely T ( N, d ) ∼ M ( d ) log N, (8)while for a generic multiplication algorithm the cost is ∼ M ( d ) log N . Thisproves Theorem 2.Note that the complexity bound (8) compares favorably with Fidducia’salgorithm combined with the best algorithms for modular squaring. Forinstance, Shoup’s algorithm [70, §7.3] computes one modular squaring inthe FFT setting using ∼ M ( d ) arithmetic operations, while Mihăilescu’salgorithm [53, Table 1] (based on Montgomery’s algorithm [55]) uses roughly ∼ M ( d ) arithmetic operations. Our bound (8) is better by a factor of 2 . . In the general case, it might be useful to use the Truncated Fourier Transform (TFT),which smoothes the “jumps” in complexity exhibited by FFT algorithms [38, 37, 3]. lgorithm 11 ( OneCoeﬀ-FFT ) Input : P ( x ), Q ( x ), N Output : [ x N ] P ( x ) Q ( x ) b P ← DFT k ( P ) b Q ← DFT k ( Q ) while N ≥ do b U y ← b P y b Q ¯ y for y = 0 , , . . . , k − if N is even then b U e y ← ( b U y + b U y +2 k − ) / y = 0 , , . . . , k − − b P ← UP ( b U e y ) else b U o y ← ω yN ( b U y − b U y +2 k − ) / y = 0 , , . . . , k − − b P ← UP ( b U o y ) b A y ← b Q y b Q ¯ y for y = 0 , , . . . , k − − b Q ← UP ( b A ) N ← ⌊ N/ ⌋ P (0) ← P k − y =0 b P y Q (0) ← P k − y =0 b Q y return P (0) /Q (0) We have proposed several algorithmic contributions to the classical ﬁeld oflinearly recurrent sequences.Firstly, we have designed a simple and fast algorithm for computing the N -th term of a linearly recurrent sequence of order d , using ∼ M ( d ) log N arithmetic operations, which is faster by a factor of 1 . ∼ M ( d ) log N which is faster than the fastest variant of Fiduccia’s algorithm in the FFTsetting by a factor of 1.625. The new algorithms are based on a new method(Algorithm 1) for computing the N -th coeﬃcient of a rational power series.Secondly, using algorithmic transposition techniques, we have derivedfrom Algorithm 1 a new method (Algorithm 5) for computing simultane-ously the coeﬃcients of indices N − d + 1 , . . . , N in the power series expan-sion of the reciprocal of a degree- d polynomial, using again ∼ M ( d ) log N arithmetic operations. Using Algorithm 5, we have designed a new algo-rithm for computing the remainder of x N modulo a given polynomial ofdegree d , using ∼ M ( d ) log N arithmetic operations as well. This is better26y a factor of 1.5 than the previous best algorithm for modular exponenti-ation, with an even better speed-up in the FFT setting, as for Algorithm 1.Combined with the basic idea of Fiduccia’s algorithm, our new algorithm formodular exponentiation yields a faster Fiduccia-like algorithm (by the afore-mentioned constant factors) that computes a slice of d consecutive terms (ofindices N − d + 1 , . . . , N ) of a linearly recurrent sequence of order d using ∼ M ( d ) log N arithmetic operations.Thirdly, we have discussed applications of the new algorithms to a fewother algorithmic problems, including powering of matrices, high-order lift-ing (a basic brick for modern polynomial linear algebra algorithms) and thecomputation of terms of linearly recurrent sequences when the recurrencehas roots with (high) multiplicities.As future work, we plan to investigate further the full power of our tech-nique. To which extent can it be generalized to larger classes of power series?For instance, although it perfectly works for bivariate rational power series U ( x, y ), the corresponding method does not directly provide a O (log N )-algorithm for computing the ( N, N )-th coeﬃcient u N,N , the reason beingthat the log N new bivariate recurrences produced by the Graeﬀe processdo not have constant orders, as in the univariate case. This is disappointing,but after all not surprising, because the generating function of the sequence( u n,n ) n is known to be algebraic, but not rational anymore [61]. As of to-day, no algorithm is known for computing the N -th coeﬃcient of an algebraicpower series faster than in the P-recursive case (P), namely in a number ofring operations almost linear in √ N . Acknowledgements.

Our special thanks go to Kevin Atienza, whose edi-torial on https://discuss.codechef.com was our initial source of inspiration,and to Sergey Yurkevich, for his careful reading of a ﬁrst draft of this work.A. Bostan was supported in part by DeRerumNatura ANR-19-CE40-0018.R. Mori was supported in part by JST PRESTO Grant

References [1] Manindra Agrawal, Neeraj Kayal, and Nitin Saxena. PRIMES is in P.

Ann. of Math. (2) , 160(2):781–793, 2004.[2] Jean-Paul Allouche and Jeﬀrey Shallit. The ring of k -regular sequences. Theoret. Comput. Sci. , 98(2):163–197, 1992.273] Andrew Arnold. A new truncated Fourier transform algorithm. In

Proceedings of ISSAC’13 , pages 15–22. ACM Press, 2013.[4] D. J. Bernstein. Removing redundancy in high-precision Newton iter-ation, 2004. Preprint, http://cr.yp.to/fastnewton.html .[5] Daniel J. Bernstein. Fast multiplication and its applications. In

Algo-rithmic number theory: lattices, number ﬁelds, curves and cryptography ,volume 44 of

Math. Sci. Res. Inst. Publ. , pages 325–384. CambridgeUniv. Press, 2008.[6] Dario Bini and Victor Y. Pan.

Polynomial and matrix computations.Vol. 1 . Progress in Theoretical Computer Science. Birkhäuser Boston,Inc., Boston, MA, 1994. Fundamental algorithms.[7] I. Blake, G. Seroussi, and N. Smart.

Elliptic curves in cryptography ,volume 265 of

London Math. Soc. Lecture Note Ser.

Cambridge Uni-versity Press, 1999.[8] A. Bostan, G. Lecerf, and É. Schost. Tellegen’s principle into practice.In

Proceedings of ISSAC’03 , pages 37–44. ACM Press, 2003.[9] Alin Bostan. Computing the N -th Term of a q -Holonomic Sequence.In Proceedings of ISSAC’20 , pages 46–53. ACM Press, 2020.[10] Alin Bostan, Xavier Caruso, Gilles Christol, and Philippe Dumas. Fastcoeﬃcient computation for algebraic power series in positive charac-teristic. In

Proceedings of the Thirteenth Algorithmic Number TheorySymposium , volume 2 of

Open Book Ser. , pages 119–135. Math. Sci.Publ., Berkeley, CA, 2019.[11] Alin Bostan, Gilles Christol, and Philippe Dumas. Fast computationof the N th term of an algebraic series over a ﬁnite prime ﬁeld. In Proceedings of ISSAC’16 , pages 119–126. ACM Press, 2016.[12] Alin Bostan, Pierrick Gaudry, and Éric Schost. Linear recurrenceswith polynomial coeﬃcients and application to integer factorization andCartier-Manin operator.

SIAM J. Comput. , 36(6):1777–1806, 2007.[13] Richard P. Brent, Fred G. Gustavson, and David Y. Y. Yun. Fastsolution of Toeplitz systems of equations and computation of Padéapproximants.

J. Algorithms , 1(3):259–295, 1980.2814] Neil Calkin, Jimena Davis, Kevin James, Elizabeth Perez, and CharlesSwannack. Computing the integer partition function.

Math. Comp. ,76(259):1619–1638, 2007.[15] L. Cerlienco, M. Mignotte, and F. Piras. Suites récurrentes linéaires.Propriétés algébriques et arithmétiques.

L’Enseignement Mathéma-tique , 33:67–108, 1987.[16] D. V. Chudnovsky and G. V. Chudnovsky. Approximations and com-plex multiplication according to Ramanujan. In

Ramanujan revis-ited (Urbana-Champaign, Ill., 1987) , pages 375–472. Academic Press,Boston, MA, 1988.[17] James W. Cooley and John W. Tukey. An algorithm for the machinecalculation of complex Fourier series.

Math. Comp. , 19:297–301, 1965.[18] Paul Cull and James L. Holloway. Computing Fibonacci numbersquickly.

Inform. Process. Lett. , 32(3):143–149, 1989.[19] G. de Rocquigny. Question 1541. [I25a].

L’Intermédiaire des mathé-maticiens , 6:148, 1899.[20] Leonard Eugene Dickson.

History of the theory of numbers. Vol. I:Divisibility and primality . Publication No. 256. Washington: CarnegieInstitution of Washington, Vol. 1, 1919.[21] E.W. Dijkstra. In honour of Fibonacci. In

Program Construction.Lecture Notes in Computer Science 69, ed. F.L. Bauer et al. , pages49–50. Springer, Berlin, Heidelberg, 1979.[22] M. C. Er. Computing sums of order- k Fibonacci numbers in log time.

Inform. Process. Lett. , 17(1):1–5, 1983.[23] M. C. Er. A formal derivation of an O (log n ) algorithm for computingFibonacci numbers. J. Inform. Optim. Sci. , 7(1):9–15, 1986.[24] M. C. Er. An O ( k log( n/k )) algorithm for computing generalizedorder- k Fibonacci numbers with linear space.

J. Inform. Optim. Sci. ,9(3):343–353, 1988.[25] Charles M. Fiduccia. The n -th power of a companion matrix: Fastsolutions to linear recurrences. 20th Proceedings of the Annual AllertonConference on Communication, Control and Computing, pages 934–940, 1982. 2926] Charles M. Fiduccia. An eﬃcient formula for linear recurrences. SIAMJ. Comput. , 14(1):106–112, 1985.[27] J. von zur Gathen and J. Gerhard.

Modern computer algebra . Cam-bridge Univ. Press, third edition, 2013.[28] Mark Giesbrecht. Nearly optimal algorithms for canonical matrix forms.

SIAM J. Comput. , 24(5):948–969, 1995.[29] Kenneth J. Giuliani and Guang Gong. New LFSR-Based Cryptosys-tems and the Trace Discrete Log Problem (Trace-DLP). In Tor Helle-seth, Dilip Sarwate, Hong-Yeop Song, and Kyeongcheol Yang, editors,

Sequences and Their Applications - SETA 2004 , pages 298–312, Berlin,Heidelberg, 2005. Springer Berlin Heidelberg.[30] Kenneth J. Giuliani and Guang Gong. A new algorithm to computeremote terms in special types of characteristic sequences. In

Sequencesand their applications—SETA 2006 , volume 4086 of

Lecture Notes inComput. Sci. , pages 237–247. Springer, Berlin, 2006.[31] Guang Gong and Lein Harn. Public-key cryptosystems based on cubicﬁnite ﬁeld extensions.

IEEE Trans. Inform. Theory , 45(7):2601–2605,1999.[32] Guang Gong, Lein Harn, and Huapeng Wu. The GH public-key cryp-tosystem. In

Selected areas in cryptography , volume 2259 of

LectureNotes in Comput. Sci. , pages 284–300. Springer, Berlin, 2001.[33] Santos González, Llorenç Huguet, Consuelo Martínez, and Hugo Vil-lafañe. Discrete logarithm like problems and linear recurring sequences.

Adv. Math. Commun. , 7(2):187–195, 2013.[34] David Gries and Gary Levin. Computing Fibonacci numbers (and simi-larly deﬁned functions) in log time.

Inform. Process. Lett. , 11(2):68–69,1980.[35] Guillaume Hanrot, Michel Quercia, and Paul Zimmermann. The middleproduct algorithm. I.

Appl. Algebra Engrg. Comm. Comput. , 14(6):415–438, 2004.[36] David Harvey. Counting points on hyperelliptic curves in average poly-nomial time.

Ann. of Math. (2) , 179(2):783–803, 2014.3037] David Harvey and Daniel S. Roche. An in-place truncated Fouriertransform and applications to polynomial multiplication. In

Proceedingsof ISSAC’10 , pages 325–329. ACM Press, 2010.[38] Joris van der Hoeven. The truncated Fourier transform and applica-tions. In

Proceedings of ISSAC’04 , pages 290–296. ACM Press, 2004.[39] J. L. Holloway. Algorithms for computing Fibonacci numbers quickly.MSc Thesis. Oregon State Univ. 1988.[40] Alston S. Householder. Dandelin, Lobačevski ˘ı, or Graeﬀe?

Amer.Math. Monthly , 66:464–466, 1959.[41] Seung Gyu Hyun, Stephen Melczer, Éric Schost, and Catherine St-Pierre. Change of basis for m -primary ideals in one and two variables.In Proceedings of ISSAC’19 , pages 227–234. ACM Press, 2019.[42] Seung Gyu Hyun, Stephen Melczer, and Catherine St-Pierre. A fast al-gorithm for solving linearly recurrent sequences.

ACM Communicationsin Computer Algebra , page 100–103, 2019.[43] Walter Keller-Gehrig. Fast algorithms for the characteristic polynomial.

Theoret. Comput. Sci. , 36(2-3):309–317, 1985.[44] Dmitry I. Khomovsky. Eﬃcient computation of terms of linear recur-rence sequences of any order.

Integers , 18:Paper No. A39, 12, 2018.[45] Donald E. Knuth.

The art of computer programming. Vol. 2: Seminu-merical algorithms . Addison-Wesley Publishing Co., Reading, Mass.-London-Don Mills, Ont, ﬁrst edition, 1969.[46] Donald E. Knuth.

The art of computer programming. Vol. 2 . Addison-Wesley Publishing Co., Reading, Mass., second edition, 1981. Seminu-merical algorithms, Addison-Wesley Series in Computer Science andInformation Processing.[47] Donald E. Knuth. The Last Whole Errata Catalog, 1981. Department ofComputer Science, Stanford University, Report. No. STAN-CS-81-868, http://infolab.stanford.edu/pub/cstr/reports/cs/tr/81/868/CS-TR-81-868.pdf .[48] Grégoire Lecerf. New recombination algorithms for bivariate polyno-mial factorization based on Hensel lifting.

Appl. Algebra Engrg. Comm.Comput. , 21(2):151–176, 2010. 3149] Katharina Lürwer-Brüggemeier and Martin Ziegler. On faster integercalculations using non-arithmetic primitives. In

Unconventional com-putation , volume 5204 of

Lecture Notes in Comput. Sci. , pages 111–128.Springer, Berlin, 2008.[50] Alain J. Martin and Martin Rem. A presentation of the Fibonaccialgorithm.

Inform. Process. Lett. , 19(2):67–68, 1984.[51] J. M. McNamee and V. Y. Pan.

Numerical methods for roots of poly-nomials. Part II , volume 16 of

Studies in Computational Mathematics .Elsevier/Academic Press, Amsterdam, 2013.[52] Marc Mezzarobba. NumGfun: a package for numerical and analyticcomputation and D-ﬁnite functions. In

Proceedings of ISSAC’10 , pages139–146. ACM Press, 2010.[53] Preda Mihăilescu. Fast convolutions meet Montgomery.

Math. Comp. ,77(262):1199–1221, 2008.[54] J. C. P. Miller and D. J. Spencer Brown. An algorithm for evaluationof remote terms in a linear recurrence sequence.

Computer Journal ,9:188–190, 1966.[55] Peter L. Montgomery. Modular multiplication without trial division.

Math. Comp. , 44(170):519–521, 1985.[56] Gregory Nuel and Jean-Guillaume Dumas. Sparse approaches for theexact distribution of patterns in long state sequences generated by aMarkov source.

Theoret. Comput. Sci. , 479:22–42, 2013.[57] V. Pan. Algebraic complexity of computing polynomial zeros.

Comput.Math. Appl. , 14(4):285–304, 1987.[58] Victor Y. Pan. Solving a polynomial equation: some history and recentprogress.

SIAM Rev. , 39(2):187–220, 1997.[59] Michael S. Paterson and Larry J. Stockmeyer. On the number of non-scalar multiplications necessary to evaluate polynomials.

SIAM J. Com-put. , 2:60–66, 1973.[60] Alberto Pettorossi. Derivation of an O ( k log n ) algorithm for comput-ing order- k Fibonacci numbers from the O ( k log n ) matrix multiplica-tion method. Inform. Process. Lett. , 11(4-5):172–179, 1980.3261] G. Pólya. Sur les séries entières, dont la somme est une fonction al-gébrique.

Enseignement Math. , 22:38–47, 1921/1922.[62] A. J. van der Poorten. Some facts that should be better known, es-pecially about rational functions. In

Number theory and applications(Banﬀ, AB, 1988) , volume 265 of

NATO Adv. Sci. Inst. Ser. C Math.Phys. Sci. , pages 497–528. Kluwer Acad. Publ., Dordrecht, 1989.[63] Marco Protasi and Maurizio Talamo. On the number of arithmeti-cal operations for ﬁnding Fibonacci numbers.

Theoret. Comput. Sci. ,64(1):119–124, 1989.[64] Arthur Ranum. The general term of a recurring series.

Bull. Amer.Math. Soc. , 17(9):457–461, 1911.[65] B. Ravikumar and G. Eisman. Weak minimization of DFA—an algo-rithm and applications.

Theoret. Comput. Sci. , 328(1-2):113–133, 2004.[66] Rosace, E.-B. Escott, E. Malo, C.-A. Laisant, and G. Picou. Answersto Question 1541. [I25a] asked by G. de Rocquigny.

L’Intermédiaire desmathématiciens , 7:172–177, 1900.[67] Arnold Schönhage. Variations on computing reciprocals of power series.

Inform. Process. Lett. , 74(1-2):41–46, 2000.[68] Joseph Shortt. An iterative program to calculate Fibonacci numbersin O (log n ) arithmetic operations. Inform. Process. Lett. , 7(6):299–303,1978.[69] Victor Shoup. A fast deterministic algorithm for factoring polynomialsover ﬁnite ﬁelds of small characteristic. In

Proceedings of ISSAC’91 ,pages 14–21. ACM Press, 1991.[70] Victor Shoup. A new polynomial factorization algorithm and its imple-mentation.

J. Symbolic Comput. , 20(4):363–397, 1995.[71] A. Storjohann. High-order lifting. In

Proceedings of ISSAC’02 , pages246–254. ACM Press, 2002.[72] V. Strassen. Einige Resultate über Berechnungskomplexität.

Jahres-bericht der Deutschen Mathematiker-Vereinigung , 78(1):1–8, 1976/77.[73] Volker Strassen. Polynomials with rational coeﬃcients which are hardto compute.

SIAM J. Comput. , 3:128–149, 1974.3374] Daisuke Takahashi. A fast algorithm for computing large Fibonaccinumbers.

Inform. Process. Lett. , 75(6):243–246, 2000.[75] Friedrich J. Urbanek. An O (log n ) algorithm for computing the n thelement of the solution of a diﬀerence equation. Inform. Process. Lett. ,11(2):66–67, 1980.[76] Thomas C. Wilson and Joseph Shortt. An O (log n ) algorithm forcomputing general order- k Fibonacci numbers.