Efficient sparse polynomial factoring using the Funnel heap
aa r X i v : . [ c s . S C ] D ec Efficient sparse polynomial factoringusing the Funnel heap
Fatima K. Abu Salem a, ∗ , Khalil El-Harake b , Karl Gemayel c a Computer Science Department, American University of Beirut, Beirut, Lebanon b Computer Science Department, Boston University, Boston, U.S.A. c School of Computational Science and Engineering, Georgia Institute of Technology,Georgia, U.S.A.
Abstract
This work is a comprehensive extension of Abu-Salem et al. (2015) that in-vestigates the prowess of the Funnel Heap for implementing sums of productsin the polytope method for factoring polynomials, when the polynomials arein sparse distributed representation. We exploit that the work and cachecomplexity of an Insert operation using Funnel Heap can be refined to de-pend on the rank of the inserted monomial product, where rank correspondsto its lifetime in Funnel Heap. By optimising on the pattern by which inser-tions and extractions occur during the Hensel lifting phase of the polytopemethod, we are able to obtain an adaptive Funnel Heap that minimises all ofthe work, cache, and space complexity of this phase. This, in turn, maximisesthe chances of having all polynomial arithmetic performed in the innermostlevels of the memory hierarchy, and observes nearly optimal spatial locality.We provide proofs of results introduced in Abu-Salem et al. (2015) pertain-ing to properties of Funnel Heap, several of which are of independent worthextending beyond Hensel lifting. Additionally, we conduct a detailed empir-ical study confirming the superiority of Funnel Heap over the generic BinaryHeap once swaps to external memory begin to take place. We support thetheoretical analysis of the cache and space complexity in Abu-Salem et al.(2015) using accounts of cache misses and memory consumption, and com-pare the run-time results appearing there against adaptive Funnel Heap. We ∗ Corresponding Author
Email addresses: [email protected] (Fatima K. Abu Salem), [email protected] (Khalil El-Harake), [email protected] (Karl Gemayel)
Preprint submitted to Journal of Symbolic Computation August 27, 2018 urther demonstrate that Funnel Heap is a more efficient merger than thecache oblivious k -merger, which fails to achieve its optimal (and amortised)cache complexity when used for performing sums of products. This providesan empirical proof of concept that the overlapping approach for perform-ing sums of products using one global Funnel Heap is more suited than theserialised approach, even when the latter uses the best merging structuresavailable. Our main conclusion is that Funnel Heap will outperform BinaryHeap for performing sums of products, whether data fits in in-core memoryor not. Keywords:
Hensel Lifting, Newton Polytopes, Polynomial Factorisation,Cache Oblivious Algorithms and Data Structures, Cache complexity,Priority Queues, Funnel Heap
1. Introduction
Hensel lifting techniques are at the basis of several polynomial factoringalgorithms that are fast in practice. The classical algorithms are designed forgeneric bivariate polynomials over finite fields without reference to sparsity(e.g. (Bostan et al., 2004; Gao and Lauder, 2002)). The polytope methodof (Abu-Salem et al., 2004) is intended to factor sparse polynomials moreefficiently, by exploiting the structure of their Newton polygon. It promisesto be significantly fast when the polygon has a few decompositions, and canhelp factor families of polynomials which possess the same Newton polytope.While the pre-processing stages of the polytope method benefit from the spar-sity of the input in reference to its Newton polygon, the Hensel lifting phasethat pursues the boundary factorisations does not do so. Our chain of workin (Abu-Salem et al., 2014, 2015) reveals that the inner workings of Hensellifting remain oblivious to the sparsity of the input as well as fluctuations inthe sparsity of intermediary output, so long as one is designing the Hensellifting phase using the dense model for polynomial representation. In con-trast, the sparse distributed representation considers the problem size to bea function of the number of non-zero terms of the polynomails treated, whichcaptures the fluctuation in sparsity throughout the factorisation process. In(Abu-Salem et al., 2014), we revised the analysis of the Hensel lifting phasewhen polynomials are in sparse distributed representation. We derived thatthe asymptotic performance in work, space, and cache complexity is criti-cally affected not only by the degree of the input polynomial, but also by the2ollowing factors: (i) the sparsity of each polynomial multiplication, and (ii)the sparsity of the resulting polynomial products to be merged into a finalsummand. We further showed that even with advanced additive (merging)data structures like the cache aware tournament tree or the cache oblivious k -merger, the asymptotic performance of the serialised version in all threemetrics is still poor. This was a result of the straightforward implementationwhich performed polynomial products first off, to be followed by sums ofthose products, a process that we dubbed serialised . We remedied this by re-engineering the Hensel lifting phase such that sums of polynomial productsare computed simultaneously using a MAX priority queue. This generalisesthe approach of (Johnson, 1974; Monagan and Pearce, 2007, 2009, 2011) fora single polynomial multiplication. We derived orders of magnitude reduc-tion in work, space, and cache complexity even against a serialised versionthat employs many possible enhancements, and succeeded in evading expres-sion swell. Hereafter, we label the serialised and the priority queue versionsof Hensel lifting as SER-HL and PQ-HL respectively. More specifically andwith regard s to the latter algorithm, we will denote by PQ-HL B the ver-sion that uses Binary Heap as a priority queue, and by PQ-HL F the versionthat uses Funnel Heap instead. Our experiments in Abu-Salem et al. (2014,2015) demonstrate that the polytope method is now able to adapt signifi-cantly more efficiently to sparse input when its Newton polygon consists ofa few edges, something not to have been observed when employing SER-HL.In (Abu-Salem et al., 2015), we shifted to enhancing the overlapping al-gorithm PQ-HL F . The motivation lies in the fact that Binary Heap is notscalable, which, on a serial machine, is interpreted to say that its performancewill deteriorate once data no longer fits in in-core memory, thus restrictingthe number of non-zero terms that input and intermediary output polynomi-als are permitted to possess. By performing priority queue operations usingoptimal cache complexity and in a cache oblivious fashion, Funnel Heap beatsBinary Heap at large scale. The fact that Funnel Heap assumes no knowl-edge of the underlying parameters such as memory level, memory level size,or word length, makes it ideal for applications where polynomial arithmeticis susceptible to fluctuations in sparsity. However, all of those features canalso be observed when adopting an alternate cache oblivious priority queue(see for example, (Brodal et al., 2004; Arge et al., 2002)). As such, we pur-sued Funnel Heap for further attributes that can improve on its asymptoticperformance, as well as exploit it at small scale, specifically for Hensel lift-ing. In (Abu-Salem et al., 2015), we addressed the chaining optimisation,3nd how Funnel Heap can be tailored to implement it in a highly efficientmanner. We exploited that Funnel Heap is able to identify equal order mono-mials “for free” as part of its inner workings whilst it re-organises itself oversufficiently many updates during one of its special operations known as the“SWEEP”. By this we were able to eliminate entirely the requirement forsearching from the chaining process. We designed a batched mode for chain-ing that gets overlapped with Funnel Heap’s mechanism for emptying itsin-core components. In addition to also managing expression swell and ir-regularity in sparsity, batched chaining is sensitive to the number of distinctmonomials residing in Funnel Heap, as opposed to the number of replicaschained. This allows the overhead due to batched chaining to decrease withincreasing replicas. For sufficiently large input size with respect to the cache-line length, and also sufficiently sparse input and intermediary polynomials,batched chaining that is “search free” leads to an implementation of Hensellifting that exhibits optimal cache complexity in the number of replicas foundin Funnel Heap, and one that achieves an order of magnitude reduction inspace, as well as a reduction in the logarithmic factor in work and cache com-plexity, when comparing against PQ-HL B of (Abu-Salem et al., 2014). Welabel as FH-HL the enhancement of Hensel lifting using Funnel Heap andbatched chaining.This paper extends all of the above work in garnering the prowess ofFunnel Heap. To this end, we incorporate analytical as well as experimentalalgorithmics techniques as follows: • In Section 3, we provide proofs of results introduced in (Abu-Salem et al.,2015) pertaining to properties of Funnel Heap, several of which are ofindependent worth extending beyond Hensel lifting. For example, weprovide complete proofs for the following: – We establish where the replicas will reside immediately after eachinsertion into Funnel Heap. – We determine the number of times one is expected to call SWEEPon each link of Funnel Heap throughout a given sequence of inser-tions. – Given an upper bound on the maximum constituency of FunnelHeap at any one point in time across a sequence of operations, wecompute the total number of links required by Funnel Heap.4
We establish that the cache complexity by which one performsbatched chaining within FH-HL is optimal. • In Section 4, we exploit that the work and cache complexity of an Insertoperation using Funnel Heap can be refined to depend on the rank ofthe inserted monomial product, where rank corresponds to its lifetimein Funnel Heap. By optimising on the pattern by which insertionsand extractions occur during the Hensel lifting phase of the polytopemethod, we are able to obtain an adaptive Funnel Heap that minimisesall of the work, cache, and space complexity of this phase. This, in turn,maximises the chances of having all polynomial arithmetic performedin the innermost levels of the memory hierarchy, and observes nearlyoptimal spatial locality. We show that the asymptotic costs of suchpreprocessing can be embedded in the overall costs to perform Hensellifting with batched chaining (FH-HL), independently of the amount ofminimisation taking place. We call the resulting algorithm FH-RANK. • In Section 5, we develop the experimental algorithmics component toour work addressing various facets: – We conduct a detailed empirical study confirming the scalabilityof Funnel Heap over the generic Binary Heap. By simulating outof core behaviour, Funnel Heap is superior once swaps to externalmemory begin to take place, despite that it performs consider-ably more work than Binary Heap. This supports the notion thatFunnel Heap should be employed even when performing a singlepolynomial multiplication or division once data grows out of core. – We support the theoretical analysis of the cache and space com-plexity in (Abu-Salem et al., 2015) using accounts of cache missesand memory consumption of FH-HL. This can be seen as an ex-tension of (Abu-Salem et al., 2015), as the performance measurespresented there capture only the real execution time. – We benchmark FH-RANK against several other variants of Hensellifting, which include PQ-HL B , PQ-HL B with the chaining methodakin to Monagan and Pearce (2011), PQ-HL F , and FH-HL. Ourempirical account of time, space and cache complexity of FH-RANK confirm the predicted asymptotic analysis in all three met-rics. 5 We demonstrate that Funnel Heap is a more efficient merger thanthe cache oblivious k -merger, which fails to achieve its optimal(and amortised) cache complexity when used for performing sumsof products. We attribute this to the fact that the polynomialstreams to be merged during Hensel lifting cannot be guaranteedto be of equal size (as a result of fluctuating sparsity). This pro-vides an empirical proof of concept that the overlapping approachfor performing sums of products using one global Funnel Heap ismore suited than the serialised approach, even when the latteruses the best merging structures available.We now begin with the following section on background literature and results.
2. Background
In the remainder of this paper, we will consider that in-core memory isof size M . It is organised using cache lines (disk blocks), respectively, eachconsisting of B consecutive words. All words in a single line are transferredtogether between in-core and out-of-core memory in one round (I/O opera-tion) referred to as a cache miss (disk block transfer). Funnel Heap implements Insert and Extract-Max operations in a cacheoblivious fashion. For N elements, Funnel Heap can perform these operationsusing amortised (and optimal) O ( B log M/B NB ) cache misses (Brodal and Fagerberg,2002b).At the innermost level, Funnel heap is first constructed using simple bi-nary mergers. Each binary merger processes two input sorted streams andproduces their final merge. The heads of the input streams and the tail ofthe output stream reside in buffers of a limited size. A binary merger is invoked using a FILL function when merge steps are repetitively performeduntil its output buffer is full or both its input streams are exhausted. Onecan construct binary merge trees by letting the output buffer of one mergerbe an input buffer of another merger. Now let k = 2 i for i ∈ Z + . A k -merger is a binary merge tree with exactly k input streams. The size of theoutput buffer is k , and the sizes of the remaining buffers are defined re-cursively in a Van Emde Boas fashion (See (Brodal and Fagerberg, 2002a,b;Frigo et al., 1999)). Funnel Heap consists of a sequence { K i } of k -mergers,6here k increases doubly exponentially across the sequence. The K i ’s arelinked together in a list, with the help of extra binary mergers and buffers ateach juncture of the list. In Fig. 1, the circles are binary mergers, rectanglesare buffers, and triangles are k -mergers. Link i in the linked list consists ofa binary merger v i , two buffers A i and B i , and a merger K i with k i inputbuffers labeled as S i, , . . . , S i,k i . Link i has an associated counter c i for which1 ≤ c i ≤ k i + 1. Initially, c i = 1. It will be an invariant that S i,c i , . . . , S i,k i are empty. The first structure in Funnel Heap is a buffer S , of extremelysmall size s , dedicated for insertion. This buffer occupies in-core memory atall times. Funnel Heap is now laid out in memory in the order S , , link 1,link 2, etc. Within link i the layout order is c i , A i , v i , B i , K i , S i, , . . . , S i,k i . I A1 A2B1S11 S12 S21 S22 S2k_2B2 Si1 Si2 Si3 Sik_iBiAi
Figure 1: Funnel HeapThe linked list of buffers and mergers constitute one binary tree T withroot v and with sorted sequences of elements on the edges. This tree isheap-ordered: when traversing any path towards the root, elements will bepassed in increasing order. If buffer A is non-empty, the maximum elementwill reside in A or in S , . The smaller mergers in Funnel Heap are meantto occupy primary memory, and can process sufficiently many insertions andextractions in-core before an expensive operation is encountered. In contrast,the larger mergers tend to be out of core, and contain elements that are least7ikely to be accessed in the near future. To perform an Extract-Max, we callFILL on v if buffer A is empty. We return the largest element residingin both S , and A . To insert into Funnel Heap, an element has to beinserted into S , . If S , is full, a SWEEP function is called. Its purpose isto free the insertion buffer S , together with all the heavily occupied linksin Funnel Heap which are closer to in-core memory. During a SWEEP, allelements residing in those dense links are extracted then merged into onesingle stream. This stream is then copied sufficiently downwards in FunnelHeap, towards the first link which has at least one empty input buffer. As aresult of SWEEP, the dense links are now free and Funnel Heap operationsare resumed within in-core memory. The SWEEP kernel is considerablyexpensive, yet, sufficiently many insertions and all the extractions can beaccounted for between any two SWEEPs. Let F denote a finite field of characteristic p , and consider a polynomial f ∈ F [ x, y ] with total degree n . We wish to obtain a polynomial factorisationof f into two factors g and h such that f = gh and g , h ∈ F [ x, y ]. Let N ewt ( f ) denote the Newton polygon R of f defined as the convex hull of thesupport vector of f . One identifies suitable subsets { ∆ i } of edges belongingto N ewt ( f ), such that all lattice points can be accounted for by a propertranslation of this set of edges. One then specialises terms of f along eachedge δ ( i ) j ∈ ∆ i . Those specialisations are derived from the nonzero terms of f whose exponents make up integral points on each δ ( i ) j , and we label them as f δ j . These can be transformed into Laurent polynomials in one variable. Forat least one ∆ i , the associated edge polynomials f δ j ought to be squarefree,for all δ j ∈ ∆ i . One then begins lifting using the boundary factorisationsgiven by f δ j = g δ j h δ j , for all δ j ∈ ∆ i . For each boundary factorisation,we determine the associated { g k } ’s and { h k } ’s that satisfy the Hensel liftingequation g δ j h δ j k + h δ j g δ j k = f δ j k − k − X j =1 g δ j j h δ j k − j (1)for k = 1 , . . . , min(deg( g ) , deg( h )). In (Abu-Salem et al., 2014) we revised the analysis associated with thebottleneck in computation arising in Eq. (1), using the sparse distributed8epresentation. In this model of representation, a polynomial is exclusivelyrepresented as the sum of its non-zero terms, sorted upon some decreasingmonomial ordering. Eq. (1) can be modeled using the input and outputrequirements shown in Alg. 1:
Algorithm 1
Local-Iterative
Require:
An integer k designating one iterative step in the Hensel liftingprocess. Two sets of univariate polynomials over F , { g i } k − i =1 , { h i } k − i =1 , insparse distributed monomial order representation. Ensure:
The polynomial S k = P k − i =1 g i · h j , where j = k − i . for i = 1 to k − do Compute p i ← g i · h j . end for Compute S k = P k − i =1 p i .We distinguish between the serialised approach (SER-HL) and the over-lapping approach (PQ-HL) for performing the required arithmetic. In theserialised version, one performs all polynomial multiplications first, and thenmerges all the resulting polynomial products. In the overlapping approach,one handles all arithmetic simultaneously using a single Max priority queue.In (Abu-Salem et al., 2014), we analysed the work, space, and cache complex-ity, when polynomials are in sparse distributed representation. We derivedthat the performance of the serialised version in all three metrics is criticallyaffected not only by the degree of the input polynomial, but also by thefollowing factors: (i) the sparsity of each polynomial multiplication, and (ii)the sparsity of the resulting polynomial products to be merged into a finalsummand. We further showed that this remains the case even with advancedadditive (merging) data structures like the cache aware tournament tree orthe cache oblivious k -merger, for performing the sums of resulting polyno-mial products, and that the serialised approach is not able to fully exploitthe cache efficiency of these structures.In the overlapping approach, the priority queue is initialised using thehighest order monomial products generated from each product g i · h j . Then,terms of S k are produced in decreasing order of degree, via successive in-vocations of Extract-Max upon the priority queue. In (Abu-Salem et al.,2015), we pursued Funnel Heap as an alternative to the generic Binary Heapfor implementing the overlapping approach. Beyond its cache oblivious na-9ure and optimal cache complexity, we showed that Funnel Heap allows fora mechanism of chaining that significantly improves its overall performance.Chaining replicas outside the priority queue following insertions is a wellknown technique (e.g. see (Monagan and Pearce, 2007, 2009, 2011)) for thecase of single polynomial multiplication using binary heap). It helps reduceseveral parameters tied to performance, such as the total number of extrac-tions required to perform a single polynomial multiplication and the size ofthe priority queue. In turn, the latter results in reducing the number ofmonomial comparisons as well as the cache complexity required to performeach priority queue operation. In the straight-forward implementation, onehas to search for a replica immediately after an insertion and then chain thenewly inserted element to the end of a linked list tied to that replica in thepriority queue. When using Binary Heap, chaining hinders performance crit-ically. Each insertion into the linked list denoting the chain incurs a randommiss, whereas a single search query may require traversing the entire heap.It follows that the work and cache complexity of a single insertion amountsto that of traversal of N elements for a heap of size N . When employed inthe priority queue that is implementing sums of products arising in Hensellifting, chaining becomes daunting as the size of the queue and the amountof replication change irregularly from one iteration to the other.In (Abu-Salem et al., 2015) we showed how to exploit the expensive SWEEPkernel of Funnel Heap in order to develop a cache friendly batched chainingmechanism (BATCHED-CHAIN) that gets intertwined with the SWEEP’sinternal operations. The crux behind our approach lies in delaying chainingand performing it in batches, somehow at the “right time”. In the interim,a prescribed amount of replication is tolerated, whose effect is shown to beinsignificant at scale. Here, we restrict chaining to only two specific phasesin Funnel Heap’s operations. If one is inserting a monomial product intothe (sorted) insertion buffer S , , a replica that resides in S , is immediatelyidentified and chaining can take place. One does not attempt to find a replicaoutside of S , . If such a replica exists, chaining will be deferred until S , isfull. That is when SWEEP is invoked upon some link i as well as one of itsinput buffers S i,c i . In the duration of SWEEP, one is forming the stream σ which contains the merged output of all elements in the buffers leading from A i to S i,c i together with all elements in links 1 , . . . , i −
1. During the merge,the replicas residing in those specified regions of Funnel Heap will be alignedconsecutively and thus identified. One can then chain them all and at onceoutside of Funnel Heap. 10ATCHED-CHAIN eliminates entirely the need for searching for replicas,and lesser links would be allocated to Funnel Heap, which reduces garbagecollection. BATCHED-CHAIN is further sensitive to the number of distinctmonomials in Funnel Heap, and not the number of replicas chained. Thiscan be understood to mean that the overhead due to chaining decreases withincreasing replicas, which is intuitively appealing, since chaining is likely tobe disabled once the number of replicas is lower than an acceptable threshold.When incorporating Funnel Heap and BATCHED-CHAIN into the priorityqueue algorithm for sums of products, Alg. FH-HL was shown to be signifi-cantly fast. The timings reported in (Abu-Salem et al., 2015) correspond tooverall run-time, with the following percentages of improvement recorded,attained with increasing input size: about 90%-98% (FH-HL to Magma2.18-7), about 90%- 99% (FH-HL to SER-HL), about 10%-60% (FH-HL toPQ-HL B ). The dramatic reduction in run-time over SER-HL is largely at-tributed to substantial expression swell, and that over PQ-HL B is attributedto BATCHED-CHAIN.
3. Funnel Heap Properties: Extended Results
In this section we revisit several claims made in (Abu-Salem et al., 2015)and provide their complete proofs. Those results pertain to the behaviour ofFunnel Heap in general and not necessarily only in relation to Hensel lifting,and thus are of independent worth. Unless otherwise stated, all lemmas andcorollaries in this specific section are stated in Abu-Salem et al. (2015).We begin by the following invariant which identifies where the replicaswill reside immediately after each Insert into Funnel HEap:
Lemma 3.1.
Let ℓ denote the index of the last link in Funnel Heap. UsingBATCHED-CHAIN, and immediately following each insertion, there will beno replication within the constituency of any buffer {{ S i,j } k i j =1 } ℓi =0 . As aresult, a given element in some buffer S i,j may only be replicated at mostonce in each of the preceding buffers { S i,j ′ } j − j ′ =1 in its own link or in each ofthe buffers {{ S i ′ ,j } k i ′ j =1 } ℓi ′ = i +1 in the larger links. Proof.
Consider the case when one is inserting immediately into the inser-tion buffer S , . Alg. BATCHED-CHAIN ensures that chaining is happeningimmediately, and so there will be no replicas in this particular buffer. Nowconsider a random S i,j for i >
0. We know that one can only write elements11o S i,j upon a call onto SW EEP ( i ). This call produces the stream σ whichmerges the content of all links 1 , . . . , i − p leading from A down to S i,j . Since BATCHED-CHAIN employschaining during the formation of σ , buffer S i,j will not contain any replicas.Now, by the first claim above, each buffer S i,j in Funnel Heap containsdistinct elements. When i = 0, it is straightforward to see that since S , hasno buffers which precede it, each of its elements is replicated at most once ineach of the following buffers. Now take i >
0. We know that once SWEEP iscalled onto S i,j , each buffer { S i,j ′ } j ′ >j in the i ’th link must be empty. Also,as we form σ – the end of which is written to S i,j – we exclude the elementsresiding in each buffer that is also in the same link as S i,j but which precedeit in that link. It follows that the only possible replicas of each element in S i,j will be in each of the buffers { S i,j ′ } j − j ′ =1 preceding it in its own link, aswell as each of the buffers {{ S i ′ ,j } k i j =1 } ℓi ′ = i in the larger links.The following result captures the number of times one is expected to callSWEEP on each link of Funnel Hap throughout a given sequence of insertionsand extractions: Lemma 3.2.
Let ℓ denote the index of the last link in Funnel Heap andlet T j denote the total number of times SWEEP ( j ) is called, across a givensequence of insertions and extractions. Then T j = c ℓ · ℓ − Y i = j k i Proof.
We proceed by backward induction on j . Take j = ℓ . Link ℓ has k ℓ input buffers. Since this is the last link, not all of its input buffers S i,j maybe written onto using SWEEP. In fact, exaclty c ℓ of them will be so. Wethus have T ℓ = c ℓ . We now show that T j = c ℓ · ℓ − Q i = j k i assuming the propertyholds for T j +1 . Observe that before any SWEEP on link j + 1 has occurred,there should have preceded it exactly k j SWEEPs, in order to fill each of theinput buffers in link j . Also, by the inductive hypothesis, the total numberof SWEEPs on link j + 1 is given by T j +1 = c ℓ ℓ − Q i = j +1 k i . Combining, we get12hat there are T j = k j · c ℓ · ℓ Q i = j +1 k i = c ℓ · ℓ Q i = j k i SWEEPs on link j .Given an upper bound on the maximum constituency of Funnel Heap atany one point in time across a sequence of operations, we now determine thetotal number of links the heap requires: Lemma 3.3.
Let ℓ denote the index of the last link in Funnel Heap. Then ℓ = θ ( | T | log log | T | ) , where | T | designates the maximum number of elements residing in FunnelHeap at any point in time. Proof.
From (Brodal and Fagerberg, 2002b) we invoke the following provenresults which we require for our proof:1. The space usage s i of each input buffer in link i satisfies s i = θ ( k i ),where k i is the number of input buffers in link i .2. The space usage of link i is θ ( k i s i ), i.e. it is dominated by the spaceusage of all of its k i input buffers.3. k i = θ ( k / i − )Since link ℓ is the last link required by Funnel Heap to host all elements of itselements, those elements will consume at least one path leading to the firstinput buffer of link ℓ , and at most all k ℓ such possible paths. By (2) above,the space usage of each such path is dominated by the size of the input bufferitself and we thus have | T | = O ( k ℓ s ℓ ) and | T | = Ω( s ℓ ). By T = O ( k ℓ s ℓ ) wehave: T = O ( k ℓ s ℓ )= O ( k ℓ ) by (1) above= O (cid:18)(cid:16) k (4 / ℓ − (cid:17) (cid:19) where the last equality follows by (3) above and by unrolling the recursiverelation down to the base case. Using k = 2 and composing the logarithmfunction on the two bases 2 and 4/3 respectively, we get ℓ = O (log log T ).13aking T = Ω( s ℓ ) one can proceed analogously as above and obtain ℓ =Ω(log log T ). This concludes the proof.As in Monagan and Pearce (2007, 2009, 2011), reasoning in the sparsedistributed representation produces worst-case versus best case polynomialmultiplication, depending on the structure of the output. In the worst case,a given multiplication g i · h j is sparse as it yields a product with θ ( g i · h j )non-zero terms, an incidence of a memory bound computation. At best, themultiplication is dense as it yields a product with θ ( g i + h j ) terms. Whenthe product has significantly fewer terms due to cancelation of terms, theoperation is said to suffer from expression swell. We now establish that thecache complexity by which one performs BATCHED-CHAIN within FH-HLis optimal. For this, we require a few notations from Abu-Salem et al. (2015)that will be helpful in the forthcoming sections as well. Let ¯ g = max { g i } ki =1 and ¯ h = max { h j } kj =1 , which denote the maximum number of non-zeromonomials comprising each g j and h j respectively. Let τ denote the fractionof reduction in the size of the heap during chaining, such that the largestsize the priority queue attains during the k ’th lifting step is θ ( k ¯ g/τ ). Let τ ′ denote the fraction of replication in the total number of monomial productssuch that the total number of replicas chained during the k ’th Hensel liftingstep is θ ( k ¯ g ¯ h/τ ′ ). The two parameters τ and τ ′ reflect, in an asymptoticsense, the changes in the size of the queue as a function of the amount ofreplicas. Particularly, the bounds on τ and τ ′ are as follows. When no replicasare encountered at all during any one lifting step, we have that τ = 1 and τ ′ = θ ( k ¯ g ¯ h ). In contrast, when each polynomial in the pair ( g i , h j ) is totallydense and all resulting products in one lifting step are of the same degree,the heap will contain only one element, leading to τ = θ ( k ¯ g ) and τ ′ = θ (1).We now have the following: Corollary 3.4.
Assume the sparse distributed representation for polynomi-als. Assume further that B = O (cid:16) τ ¯ h log( k ¯ g/τ )log log( k ¯ g/τ ) (cid:17) . In the worst case analysiswhen each polynomial multiplication g i h j is sparse, the cache complexity bywhich one performs BATCHED-CHAIN within FH-HL is optimal. Proof.
Following the analysis in Prop. 3.6 of Abu-Salem et al. (2015), thecache complexity of FH-HL is split into two major parts. The first part ac-counts for all the insertions into Funnel Heap using O ( k ¯ g ¯ h B log M/B k ¯ gτ ) cachemisses. The second part accounts for the cost to perform BATCHED-CHAIN14sing O ( k ¯ g ¯ hτ ′ B + k ¯ gτ log log k ¯ gτ ) cache misses. When B = O (cid:16) τ ¯ h log( k ¯ g/τ )log log( k ¯ g/τ ) (cid:17) , we getthat the second summand in the cache complexity incurred by BATCHED-CHAIN is dominated by the cost to perform all the insertions into FunnelHeap, or that the cost for BATCHED-CHAIN is dominated by O ( k ¯ g ¯ hτ ′ B ), where θ ( k ¯ g ¯ hτ ′ ) denotes the total number of replicas chained. It follows that the cachecomplexity of BATCHED-CHAIN corresponds to that of traversal, and henceis optimal.In the following, we provide a detailed proof that FH-HL, and thanksto BATCHED-CHAIN, outperforms PQ-HL F (and thus by transitivity, alsoPQ-HL B ). In other words, performing sums of products using Funnel Heapwith BATCHED-CHAIN is provably more efficient in work, space, and cachecomplexity than if we were to resort to a standalone Funnel Heap implemen-tation. Corollary 3.5.
Assume the sparse distributed representation for polynomi-als, and assume further the conditions in Cor. 3.4. In the worst case analysiswhen each polynomial multiplication g i h j is sparse, FH-HL achieves an or-der of magnitude reduction in space, as well as a reduction in the logarithmicfactor in work and cache complexity, over PQ-HL F . Proof.
From (Abu-Salem et al., 2014), Alg. PQ-HL F requires the followingcosts: Space Work cache complexity θ ( k ¯ g ) θ ( k ¯ g ¯ h log k ¯ g ) O (cid:0) k ¯ g ¯ h B log M/B k ¯ g (cid:1) From (Abu-Salem et al., 2015), Alg. FH-HL requires the following costs:Space Work cache complexity θ (cid:0) k ¯ gτ (cid:1) θ (cid:0) k ¯ g ¯ h log k ¯ gτ (cid:1) O (cid:16) k ¯ g ¯ h B log M/B k ¯ gτ + k ¯ g ¯ hτ ′ B (cid:17) Reductions in space borne by FH-HL are obvious by comparing θ (cid:0) k ¯ gτ (cid:1) and θ ( k ¯ g ) respectively, and noting that τ ≥
1. The logarithmic factor re-ductions in work are obvious by comparing θ (cid:0) k ¯ g ¯ h log k ¯ gτ (cid:1) and θ (cid:0) k ¯ g ¯ h log k ¯ g (cid:1) (cid:18) k ¯ g ¯ h B log M/B k ¯ gτ (cid:19) + k ¯ g ¯ hτ ′ B = O (cid:18) k ¯ g ¯ h B log M/B k ¯ g (cid:19) or (cid:18) B log M/B k ¯ gτ (cid:19) + 1 τ ′ B = O (cid:18) B log M/B k ¯ g (cid:19) (2)Recall the bounds established earlier for τ and τ ′ . When τ = 1, we have τ ′ = θ (cid:0) k ¯ g ¯ h (cid:1) , for which we have: (cid:18) B log M/B k ¯ gτ (cid:19) + 1 τ ′ B = θ (cid:18) B log k ¯ g + 1 k ¯ g ¯ h B (cid:19) = θ (cid:18) B log k ¯ g (cid:19) and (2) holds. As τ increases, τ ′ satisfies τ ′ = Ω(1) and so (cid:18) B log M/B k ¯ gτ (cid:19) = O (cid:18) B log M/B k ¯ g (cid:19) and 1 τ ′ B = O (cid:18) B (cid:19) = O (cid:18) B log M/B k ¯ g (cid:19) for which (2) holds again. This concludes the proof.
4. Adaptive Funnel Heap and the Sequence of Insertions/Extractions
The canonical SWEEP function described in Sec. 2 works by identifyingthe smallest link in Funnel Heap that is completely empty, which necessitatesthat one keeps pushing the content of the heap downwards in the directionof larger and larger links, which are also more likely to be out-of-core. Anenhanced version of SWEEP exploits the smaller links in Funnel Heap thatare sufficiently sparse, instead of always sweeping onto totally empty, yetsignificantly larger buffers. The refined SWEEP operation identifies the firstlink i whose total number of elements residing in its input buffers { S i,j } isless than half of its total size. The input buffer with minimal occupancyin that link, say S i,j , is then recycled and its content moved onto anotherinput buffer, say S i,j , with second largest occupancy. SWEEP is now calledwith S i,j as the destination buffer. That smaller buffers are effectively usedinstead of the larger, out-of-core buffers causes Funnel Heap to adapt to var-ious modes of usage. The analysis in (Brodal and Fagerberg, 2002b) shows16hat the amortised cost for the r ’th insertion is now O ( B log M/B N r B ), where N r denotes some notion of the lifetime of the r ’th inserted element in FunnelHeap. Particularly, if the r ’th inserted element is removed by an Extract-Maxprior to the t ’th inserted element, then N r = t − r .In this section, we exploit the idea that the cache complexity of an Insertoperation can be refined to depend on the rank of the inserted monomialproduct, by optimising on the pattern by which insertions and extractionsoccur during the Hensel lifting phase, with the notion of lifetime in hindsight.We achieve this by efficiently delaying all the insertions that come from poly-nomial pairs that “can wait”, as indicated by their total order and their rankin relation to the maximal element residing in Funnel Heap. The refined pro-cess, which we label as FH-RANK, proceeds as follows. We first accumulatethe set of all distinct monomial orders α appearing in the sum of products S k . When the input polynomials are univariate, the monomial order can beunderstood to denote the total degree of a given monomial product. Foreach α , let ψ ( α ) denote the set of indices { i | ≤ i ≤ k ∧ o ( g i h j = k − i ) = α } ,which maps each monomial order α to the polynomial operands that re-sulted in a product of this particular order. Let O = { ( α, ψ ( α )) } , where O is sorted on α in strictly decreasing order. We manipulate the sequenceof insertions into Funnel Heap based on information derived from O as fol-lows. Let α max = max { α } . Initially, we insert monomial products generatedfrom pairs pointed to by ψ ( α max ) only. No other polynomial pair of to-tal order α ′ < α max may be involved, until at least one monomial productfrom ψ ( α max ) has been inserted into Funnel Heap, whose order is less thanor equal to α ′ . This point in time is identified by knowledge of the nextmaximal order to be encountered before an upcoming Extract-Max is called.The function NEXT-MAX-ORDER introduced below answers this particu-lar query, by calling EXTRACT-MAX on funnel heap whilst refraining fromactually extracting the maximal element and only reporting on its order.Alg. 2 summarises the details of this adaptive technique, which we labelas FH-RANK. We further demonstrate its impact asymptotically speaking,particularly with regards to the costs associated with preprocessing the sortedlist O , as well as invoking NEXT-MAX-ORDER on top of the existing callsto Extract-Max. 17em. 4.1, Cor. 4.2 and Cor.4.3 we argue that with FH-RANK, the spacerequired to handle sums of products using the priority queue approach isminimised, and so are the work and cache complexity required to performinsertions: Lemma 4.1.
In Alg. FH-RANK, the lifetime of each inserted element inthe priority queue is minimised.
Proof.
We establish the proof by showing that no monomial product entersthe priority queue prior to the time when it is necessary for it to be there.Put differently, a monomial product is inserted into Funnel Heap at a pointin time when its insertion can no longer be deferred. To show the claim,assume that Funnel Heap contains a monomial product ( X ( i ) u Y ( j ) w , g ( i ) , h ( j ) )whose insertion could have been safely deferred. Then one of those two casesmust hold: • The polynomial pair ( g i , h j ) should have not been engaged in the in-sertion process. But that is impossible since the pair ( g i , h j ) must havebeen identified by the latest call to NEXT-MAX-ORDER in Step 20,which ensures that a polynomial pair is chosen only when its highestorder monomial product is larger than the maximum residing elementin Funnel Heap. • The pair ( g i , h j ) is already engaged in the insertion process but thisparticular monomial product X ( i ) u Y ( j ) w can wait. This is also impossiblesince the sequence of insertions and extractions ensures that a givenmonomial product is inserted only after one of its horizontal or verticalpredecessors ( X ( i ) u − Y ( j ) w or X ( i ) u Y ( j ) w − ) have been extracted. This meansthat ( X ( i ) u Y ( j ) w ) can potentially be the next maximum, and so must beinserted into Funnel Heap. Corollary 4.2.
In Alg. FH-RANK, the cache complexity of Insert is min-imised.
Proof.
Consider the r ’th insertion in the sequence prescribed by Alg. FH-RANK. Let t be the index of the first insertion that takes place immedi-ately following the extraction of the r ’th element. The refined SWEEP op-eration attains the amortised cache complexity of the r ’th insertion to be O ( B log M/B N r B ), where N r = t − r . By Prop. 4.3, the lifetime in FunnelHeap of the r ’th element is minimised, and hence, so is t − r . This concludesthe proof. 18 lgorithm 2 Alg. FH-RANK
Require:
An integer k designating one iterative step in the Hensel liftingprocess. Two sets of univariate polynomials over F , { g i } k − i =1 , { h i } k − i =1 ,in sparse and sorted monomial order representation. Also, two arrays Ord g and Ord h , such that Ord g ( i ) and Ord h ( i ) designate respectively themaximal order of the polynomials g i and h i under the assumed monomialordering, for i = 1 , . . . , k . Ensure:
The polynomial S k = P k − i =1 g i · h j , where j = k − i . For each product pair ( g i , h j ), calculate o ( i, j ), the total order of theirproduct under the assumed monomial ordering, by a forward scan of thearray Ord g and a backward scan of Ord h . Collect the set { α } of distincttotal orders. Set O = { ( α, ψ ( α )) } , where ψ ( α ) = { i | ≤ i ≤ k ∧ o ( g i h j = k − i ) = α } .If k ∈ θ ( n ), sort O on the { α } ’s using Counting Sort. Else, use a cacheefficient comparison based algorithm. Consider O ind = ( α ind , ψ ( α ind )), where ind ← for i ∈ ψ ( α ind ), and j = k − i do Call BATCHED-CHAIN( X ( i )1 Y ( j )1 , g ( i ) , h ( j ) ) to insert those monomialproducts into Funnel Heap while chaining. end for Set t ← repeat Let (
XY, g , h ) denote the the maximal element in Funnel Heap, β denote the rank of XY under the assumed monomial ordering. Set t ← t + 1, a t ← R t ← XY . Call
Extract-Max on Funnel Heap to return the maximal element(
XY, g , h ). Return all monomial products of order β chained outside of FunnelHeap. while the maximal element in Funnel Heap has rank equal to β do Repeat Steps 10-11 above end while for each element ( X ( i ) u Y ( j ) w , g ( i ) , h ( j ) ) returned in Steps 10 and 11 do Perform the coefficient arithmetic required to accumulate in a t byreading the coefficients of terms pointed to by g ( i ) and h ( j ) , then set S k ← S k + a t R t . 19 lgorithm 2 Alg. FH-RANK (continued) If w < h j , insert into Funnel Heap the horizontal successor bycalling BATCHED-CHAIN( X ( i ) u Y ( j ) w +1 ). If w = 1 and u < g i , insert into Funnel Heap the vertical successorby calling BATCHED-CHAIN( X ( i ) u +1 Y ( j ) w ). end for β ′ ← N EXT − M AX − ORDER . while α ind +1 ≥ β ′ do for i ∈ ψ ( α ind +1 ), and j = k − i do Call BATCHED-CHAIN( X ( i )1 Y ( j )1 , g ( i ) , h ( j ) ) to insert those mono-mial products into Funnel Heap while chaining. ind ← ind + 1. end for end while until no monomials can be inserted into Funnel Heap. Return S k .Moreover, we have: Corollary 4.3.
In Alg. FH-RANK, the size of the priority queue is min-imised. Put differently, the likelihood that it can operate in as innermost aspossible levels of the memory hierarchy is maximised.
Proof.
The size of the priority queue is minimised as an immediate conse-quence of Lemma 4.1, since no element enters the queue prior to the timewhen it has to be there.As an immediate consequence of Cor. 4.3, the work required to perform eachmonomial product using insertions and extractions is also minimised.Finally, we establish that spatial locality associated with BATCHED-CHAIN in Alg. FH-RANK is nearly optimal . By observing optimal spatiallocality , the length of each stride in the address space is at most 1. Ourdefinition of nearly optimal relaxes this requirement: it suffices to have thatthe length of each stride in the address space is minimised.
Corollary 4.4.
BATCHED-CHAIN invoked by Alg. FH-RANK exhibitsnearly optimal spatial locality. roof. From Abu-Salem et al. (2015), the mechanism for chaining in batchesintroduced in Sec. 2 is achieved as follows. We store all monomials of a givenorder α and that have to be excluded from the queue in a dynamic array D [ α ]. The set { D [ α ] } α over all monomial orders α encountered during thelatest SWEEP represents pointers to the heads of chains. These pointers arealigned consecutively in a static array D in increasing monomial order, wherethe size of D grows like the bound on deg( S k ). We will label the memoryaccesses to the dynamic array pointed to by D [ α ] as horizontal accesses. Incontrast, we will label the memory accesses to the static array D , as we hopfrom one pointer D [ α ] to another, as vertical accesses. Observe that repli-cas of each given monomial order α are chained consecutively into the singlechain pointed to by D [ α ], which maintains sequential spatial locality corre-sponding to strides of length equal to 1 in the horizontal direction. Becauseof BATCHED-CHAIN, all pointers { D [ α ] } α are accessed in increasing mono-mial order. Additionally, because of FH-RANK, no element being chainedcould have been delayed entry into Funnel Heap. It follows that jumps inthe vertical direction are minimised.We devote the remainder of this section to showing that the pre-processingcosts associated with Alg. FH-RANK can be embedded in the costs to per-form all monomial insertions, independently of the amount of minimisationtaking place. This is taken up in Lem. 4.5, Lem. 4.6 and Cor. 4.7. Lemma 4.5.
The cost to perform Next-Max-Order is θ (1) if buffer A ofFunnel Heap is non-empty. Else, the cost to perform Next-Max-Order ac-counts for the cost to Call one ensuing Extract-Max. Proof.
If buffer A is non-empty, Next-Max-Order returns the maximumover all elements residing in the insertion buffer S , and A . Else, Next-Max-Order will trigger a FILL on buffer A (see Sec. 2). Merely revealingthe maximum element, however, does not alter the physical constituency of A . Hence, this buffer can be queried again with respect to the maximumresiding in it, when the first Extract-Max to be encountered after the call toNext-Max-Order is issued. This can also be done without the need for FILLThis completes the proof. Lemma 4.6.
Consider an invocation of Alg. FH-RANK during the k ’thlifting step. If k ∈ θ ( n ) , then sorting polynomial pairs on their total orderin Step 2 of FH-RANK requires θ ( k ) work and θ (( M + k ) /B ) cache misses. lse, we have k ∈ O ( n ) but k / ∈ Ω( n ) , for which this step requires O ( k log k ) work and O ( kB log M/B k ) cache misses. Proof.
Collecting the total order of products in { ( g i · h j ) } k =1 ,...,n using aforward and backward scan of the arrays Ord g and Ord h respectively requiresonly cache miss operations, whose total cost amounts to θ ( k/B ). We nowaddress the costs for sorting. Case 1:
Using Counting Sort, the number of records to be sorted isequal to k −
1. Since each record represents the total degree of a monomialproduct, it is then less than or equal to n − k , and so, the work of countingsort is θ ( k + ( n − k )) = θ ( n ). Using a cache efficient variant of counting sortattains a cache complexity nearly linear in the number of records as well, andis equal to θ ( n + MB ) cache misses (see Moreno Maza (2014)). When k = θ ( n ),the space, work, and cache complexity of Counting Sort simplify to those asstated in the Lemma above. Case 2:
Here, we know that k / ∈ θ ( n ). But k ≤ n , so we must have k ∈ O ( n ) but k / ∈ Ω( n ). Here, counting sort is no longer linear in thenumber of records being sorted. Using any of the comparison-based, cacheefficient sorting algorithms requires O ( k ) space, optimal O ( k log k ) work, andoptimal O ( kB log M/B k ) cache misses (See (Frigo et al., 1999), for example).In the following Corollary, we conclude that the cost for sorting andpre-fetching the maximal order can be embedded in the asymptotic costsfor performing all monomial products comprising S k , independently of theamount of minimisation exerted onto Funnel Heap. Corollary 4.7.
Assume the sparse distributed representation for polyno-mials. Let | T | min denote the size of Funnel Heap following the minimisa-tion incurred by Alg. FH-RANK in the k ’th lifting step, ¯ g = max { g i } ki =1 and ¯ h = max { h j } kj =1 . Assume further that the cache size M satisfies M ∈ O ( n ) , and, if k ∈ O ( n ) but k / ∈ Ω( n ) , that ¯ g ¯ h = Ω(log k ) . In theworst case analysis when each polynomial multiplication g i h j is sparse, thecost to sort all polynomial products { ( g i · h j ) } on their total order and to issueall calls to NEXT-MAX-ORDER can be embedded in the cost for performingall insertions into Funnel Heap. Proof.
The gist of the proof lies in deriving bounds on the costs for thepre-processing phase that do not depend on | T min | . By Lem. 4.5, each callto NEXT-MAX-ORDER accounts for the cost of the ensuing Extract-Max.22erforming all θ (cid:0) k ¯ g ¯ h (cid:1) monomial extractions and insertions into FunnelHeap requires O (cid:0) k ¯ g ¯ h log | T | min (cid:1) work and amortised O (cid:0) k ¯ g ¯ h B log M/B | T | min (cid:1) cache misses. From Lem. 4.6, if k ∈ θ ( n ), we know that cache friendlycounting sort requires θ ( k ) work and θ (( M + k ) /B ) cache misses, and forthese costs to be embedded in their respective counterparts, we require k ∈ O (cid:0) k ¯ g ¯ h log | T | min (cid:1) (3)and ( M + k ) B ∈ O (cid:18) k ¯ g ¯ h B log M/B | T | min (cid:19) . (4)The requirement in (3) trivially holds. For (4), the assumption that M ∈ O ( n ) when k ∈ θ ( n ) leads to ( M + k ) /B = O ( k/B ) ∈ O (cid:0) k ¯ g ¯ h B log M/B | T | min (cid:1) .When k ∈ O ( n ) but k / ∈ Ω( n ), the work of sorting ought to satisfy k log k ∈ O (cid:0) k ¯ g ¯ h log | T | min (cid:1) , or log k ∈ O (cid:0) ¯ g ¯ h log | T | min (cid:1) , which is attained if each monomial product g i h j , known to be sparse andhaving O (¯ g ¯ h ) non-zero terms, also has at least Ω(log k ) non-zero terms. Thisis also a very realistic assumption since k is sufficiently small in this branch ofthe proof. Using this same requirement, we also get that the cache complexityrequired by sorting is embedded by that to perform all monomial insertionsand extractions, as we can see from: kB log M/B k ∈ O (cid:18) k ¯ g ¯ h B log M/B | T | min (cid:19) . Remark 4.8.
The condition M ∈ O ( n ) entails that the input polynomialhas sufficiently high degree with respect to the size of in-core memory. Thisis a very reasonable requirement at scale bearing contemporary cache sizes.To require that log k ∈ O (¯ g ¯ h ) when k ∈ O ( n ) but k / ∈ Ω( n ) means that thesparsest polynomial product encountered in the k ’th lifting step has Ω(log k ) non-zero terms. But any such product has degree at most n − k , which meansthat for significantly small iteration indices k , the polynomial products arisingtend to be of significantly large degrees. The lower bound requirement onthe sparsity of polynomial products indicated by ¯ g ¯ h ∈ Ω(log k ) is thus verypermissive. . Experimental Results We implement all algorithms in C++ and compile our code using g++version 4.4.6 20120305 with optimization level -O3. We run the experimentson an Intel(R) Xeon(R) CPU E5645 with 43GB of RAM, 12MB in L3 cache,and 256KB in L2 cache. To record cache misses, we use the STXXL libraryin Sec. 5.1 and the profiler tool perf in Sec. 5.2 and 5.3.
In this section we present a preamble where we benchmark Funnel Heapagainst Binary Heap for performing a generic sequence of priority queue op-erations outside the scope of Hensel lifting. The only available benchmarkingappears in the unpublished work of (Sach and Clifford, 2008) . The specificgoal of this section is to reproduce the conclusions derived in (Sach and Clifford,2008) on our own machines and to reveal the cut-off line when Funnel Heapis able to beat Binary Heap. A careful examination is required before oneis able to witness the performance predicted by the asymptotic analysisat large scale. This is because Funnel Heap performs more computationsthan Binary Heap, making it expensive to use at small scale, when the costto perform memory accesses does not dominate performance. In line with(Sach and Clifford, 2008), we simulate out of core behaviour by constrainingthe RAM of our machine to 16MB through the use of STXXL vectors (seeSTXXL version 1.3.1 and (Dementiev et al., 2005)). In this case, we forceboth heaps to store part of their structure on disk despite that the inputsuites are not too large. We generate a list of random integers and performthe following sequence of insertions and extractions onto the queues: • N elements are pushed into the heap • N/ • N/ • N elements are popped off the heapIn Table 1 below, we present an account of the overall runtime as well ascache misses incurred by each of the two heaps. The term “Max Capacity” The authors of the current manuscript were unable to locate any standardised imple-mentations of funnel heap available for public use. N = 4 × elements of size four bytes each, the expectedcutoff line representing the customised size of RAM. Beyond that point, bothheaps start swapping to disk and the cache complexity begins to dominatethe computation cost. Funnel Heap now beats Binary Heap by orders ofmagnitude, as predicted by the asymptotic analysis.Table 1: Generic Priority Queue OperationsHeap Type Max Capacity Cache Misses Runtime(s)Binary 65,536 46,247 2.36Binary 262,144 220,761 9.76Binary 524,288 370,777 22.18Binary 1,048,576 987,864 46.49Binary 2,097,152 4,428,548 98.05Binary 4,194,304 16,364,635 206.67Binary 8,388,608 647,576,728 49,021.72Binary 16,777,216 3,977,883,205 337,626.06Funnel 65,536 142,839 2.85Funnel 262,144 291,070 10.91Funnel 524,288 779,126 26.21Funnel 1,048,576 1,572,436 55.89Funnel 2,097,152 3,319,421 127.17Funnel 4,194,304 5,073,462 269.02Funnel 8,388,608 13,639,451 558.49Funnel 16,777,216 23,786,264 1,234.35The results in this section should be construed in the following sense. Theinput suite to our polynomial factorisation experiments below does not attain25 level of growth sufficient to solicit out-of core behaviour. As such, any sig-nificant improvements in performance shown hereafter can only be attributedto the optimisations incurred onto Funnel Heap, particularly, batched chain-ing and the optimised sequence of insertions, but not to Funnel Heap alone.On the other hand, we do expect Funnel Heap to outperform Binary Heapconsiderably, without any of those mentioned optimisations. This will remainapplicable even when tackling a single polynomial multiplication or division– and not just sums of products – once the input data grows sufficiently large. This section is dedicated to the performance of FH-RANK. We start offwith a summary of relevant results from (Abu-Salem et al., 2015), where weobserve all of the following. Both overlapping implementations PQ-HL B andFH-HL are always significantly faster than SER-HL, despite that the New-ton polygons treated there are extremely sparse. Even when the polygonhas a few edges, the polytope method remains susceptible to fluctuations inthe sparsity of the intermediary polynomials, which is attributed to expres-sion swell. Both PQ-HL B and FH-HL are also always faster than Magma2.18-7, whose built-in function for factoring bivariate polynomials relies onthe standard algorithms in (Bernardin and Monagan, 1997; von Hoeij, 2002).Starting with bivariate polynomials of total degree equal to 10 , B , which is largely attributed to BATCHED-CHAIN.Hereafter we address the performance of FH-RANK, by benchmarkingit against all of PQ-HL F , PQ-HL B , PQ-HL B with chainining (PQ-HL B -CHAIN), and FH-HL. We use the sparse distribued representation for en-coding all polynomials. Our input suite consists of random bivariate poly-nomials over F that turn out to factor into two irreducibles, and that areextremely sparse. In several instances, we specifically generate random poly-nomials whose Newton polygon is the sparsest possible, consisting of thetriangle (0 , n ), ( n, , n denotes the total degree of theinput polynomial, and t denotes the total number of its non-zero terms. Ourrequirement for sparse input is to have t ≪ n . The parameter F correspondsto the total number of boundary factorisations attempted before the twoirreducible factors are produced. The first table for each input polynomialprovides an account of run-time in seconds as well as cache misses. Thesecond and third ensuing tables show the number of SWEEPs called uponeach link in the Funnel Heap used in Alg. PQ-HL F and Alg. FH-RANKrespectively. In those tables we also indicate the size of each link. FH-RANK against all other variants:
Both FH-RANK and FH-HLare faster than all of PQ-HL F , PQ-HL B , and PQ-HL B -CHAIN, and they in-cur considerably less cache misses thanks to BATCHED-CHAIN. In turn,FH-RANK improves over FH-HL at a larger scale thanks to the optimisedsequence of insertions. This is demonstrated by an order of magnitude re-duction in time as well as cache misses over FH-HL. For all of the inputpolynomials tested, the second and third tables show that FH-RANK bringsabout an order of magnitude reducion in space, as demonstrated in the re-duction of the number of links required, as well as the corresponding sizeof each link. Of significance also is the notable reduction in the number ofsweeps to each link. The improvement in the amount of space required aswell as the number of SWEEPs incurred explain the significant reduction inthe run-time and cache misses associated with FH-RANK. In contrast, wenote that FH-HL consumed the same amount of peak memory as PQ-HL F :as explained in (Abu-Salem et al., 2015), FH-HL requires the same amountof space as FH, except that an asymptotically large amount of space shiftsfrom being “working space” to “auxiliary space”, which improves on runtimeand rate of cache misses of FH-HL. The effect of chaining on Binary Heap:
In several instances, PQ-HL B -CHAIN incurs the same order of cache misses as PQ-HL B , and on afew occasions it is actually slower. This demonstrates that chaining is notconsistently efficient when employing Binary Heap. In (Abu-Salem et al.,2015), we elaborate on the reasons behind this behaviour. For example, eachinsertion into the linked list denoting the chain incurs a random miss. Also, asingle search for a replica to be chained may very well require traversing theentire heap of size N , bringing the work and cache complexity for performinga single search to be that of traversal of N elements. Summing up, neithertemporal locality nor spatial locality are observed in PQ-HL B -CHAIN.27 unnel Heap without any of the optimisations: The results re-ported for PQ-HL F are also not promising for the range of input degreeswe had treated, specifically as taken against PQ-HL B and PQ-HL B -CHAIN.Funnel Heap inherently performs more work, and incurs more cache missesfor smaller levels of the memory hierarchy. The input polynomials treatedhere do not solicit access to disk. As a result, the extra work performedby Funnel Heap is not compensated for. Yet, it is evident from the bench-marks reported in Section 5.1, that PQ-HL F is set to outperform PQ-HL B and PQ-HL B -CHAIN once the input is large enough to solicit disk swaps. Atsmaller scale, all the benefits observed are attributed solely to the techniquesin BATCHED-CHAIN and optimising the sequence of insertions (FH-HL andFH-RANK respectively).Table 2: Input: n = 2000, t ≈ , F = 1Heap Type Runtime Cache(s) MissesPQ-HL B . ′′ B -Chain 7 . ′′ F . ′′ . ′′ . ′′ , F – n = 2000, t ≈ , F = 1Link 1 Link 2 Link 3 Link 4 Link 5 Link 6 Link 7 n = 2000, t ≈ , F = 1Link 1 Link 2 Link 3 Link 4 Link 5 Link 6 Link 7 n = 4000, t ≈ × , F = 8Heap Type Runtime Cache(s) MissesPQ-HL B ′′ B -Chain 17 . ′′ F . ′′ . ′′ . ′′ , , F – n = 4000, t ≈ × , F = 8Link 1 Link 2 Link 3 Link 4 Link 5 Link 6 Link 7 n = 4000, t ≈ × , F = 8Link 1 Link 2 Link 3 Link 4 Link 5 Link 6 Link 7 n = 5000, t ≈ × , F = 5Heap Type Runtime Cache(s) MissesPQ-HL B ′′ B -Chain 39 . ′′ F .
06 50,129,826FH-HL 32 . ′′ . ′′ , , F – n = 5000, t ≈ × , F = 5Link 1 Link 2 Link 3 Link 4 Link 5 Link 6 Link 7 n = 5000, t ≈ × , F = 5Link 1 Link 2 Link 3 Link 4 Link 5 Link 6 Link 7 n = 6000, t ≈ × , F = 3Heap Type Runtime Cache(s) MissesPQ-HL B ′′ B -Chain 372 . ′′ F ′′ . ′′ . ′′ , , F – n = 6000, t ≈ × , F = 3Link 1 Link 2 Link 3 Link 4 Link 5 Link 6 Link 7 n = 6000, t ≈ × , F = 3Link 1 Link 2 Link 3 Link 4 Link 5 Link 6 Link 7 n = 10 , t ≈ × , F = 7Heap Type Runtime Cache(s) MissesPQ-HL B
17 hrs 19 ′ ′′ B -Chain 9 hrs 39 ′ ′′ F
51 hrs 18 ′ ′′ ′ ′′ ′ ′′ , , , F – n = 10 , t ≈ × , F = 7Link 1 Link 2 Link 3 Link 4 Link 5 Link 6 Link 7 n = 10 , t ≈ × , F = 7Link 1 Link 2 Link 3 Link 4 Link 5 Link 6 Link 7 n = 13 , t ≈ × , F = 2Heap Type Runtime Cache(s) MissesPQ-HL B B -Chain 7,799.34 5,307,782,843PQ-HL F .
33 56 , , F – n = 13 , t ≈ × , F = 2Link 1 Link 2 Link 3 Link 4 Link 5 Link 6 Link 7 n = 13 , t ≈ × , F = 2Link 1 Link 2 Link 3 Link 4 Link 5 Link 6 Link 7 k -merger In this section we gather further insight into the bottleneck associatedwith the irregularity in computations as a result of the varying density ofthe intermediary output arising during Hensel lifting. In (Abu-Salem et al.,2014), we observed that if space is not a concern and one is willing to storeall polynomial products during each Hensel lifting step before their finalmerge into S k , the only cache oblivious competitor at scale for the global32riority queue approach would be the following scenario. One would performeach polynomial multplication separately using a dedicated (local) priorityqueue, followed by merging of all the polynomial products using the staticand cache oblivious k -merger of (Frigo et al., 1999). Local priority queuestackling one polynomial multiplication at a time are more likely to reside incache when the multiplication is sparse. As such, a data structure suited forinternal memory such as a MAX-heap should be used, since it performs lesswork than any external memory implementation. The goal of this section,however, is to demonstrate that even in this scenario, the k -merger still failsto achieve its optimal (and amortised) work and cache complexity, sincethe streams to be merged are of varying density. This is in contrast to itstypical mode of usage in merge-based sorting algorithms where the streamstend to be of equal size. This section provides an empirical proof of conceptthat overlapping arithmetic using a global funnel heap is not only cache-oblivious, but is further guaranteed to attain optimal performance, bearingthe irregularity in computation. An interesting conclusion we draw is that,despite that the k -merger incurs optimal work for merging a given set ofelements, it is beaten by funnel heap, albeit at a higher work complexity,once the k -merger fails to amortise its cache complexity in the presence ofirregular input.To this end, we demonstrate by merging streams of random integers, andwe simulate three scenarios depending on the density of streams. In Table20, we process k streams each containing k elements, in Table 21 we process k streams with k elements each, and in Table 22, we process k streams with k elements each. In addition to cache misses and total execution time, werecord the total number of integer comparisons required by merging sub-routines, including the merging that is required by the SWEEP function inFunnel Heap. On integer comparisons : In all three tables, Funnel Heap does more inte-ger comparisons than the k -merger, something we attribute to the cost of itsSWEEP function. Despite that it incurs the least number of comparisons,the k -merger lags behind in cache miss rate and then finally overall executiontime. On cache misses : In all three tables, Funnel Heap terminates using sig-nificantly lower cache misses than the k -merger. We note, however, that therate of improvement of Funnel Heap in Table 22 is less than what is observedfor Tables 20 and 21, which is to be expected, since the k -merger is now ableto produce k elements at the end of its invocation. This is in line with the33 -merger’s cache complexity analysis, by which we know that the amortisedcache complexity is met so long as the output buffer produces k elements. On overall runtime : In all three categories, Funnel Heap terminates fasterthan the single, fixed size k -merger. In Tables 20 and 21, Funnel Heap attainsa significantly lower cache miss rate that justifies its fast execution time. InTable 22, and despite that the k -merger catches up by performing its bestcache misse rate as opposed to Tables 20 and 21, it still lags behind FunnelHeap in total execution time.Table 20: k streams of k elements eachMerge Type Streams Input Per Insertions, Cache Comparison RuntimeStream Extractions Misses Count (s)FunnelHeap 64 8 512 18,495 5,900 0 . ′′ FunnelHeap 128 11 1,408 15,603 20,217 0 . ′′ FunnelHeap 256 16 4,096 21,100 63,524 0 . ′′ FunnelHeap 512 22 11,264 25,771 209,359 0 . ′′ Kmerger 64 8 512 367,006 2,937 0 . ′′ Kmerger 128 11 1,408 1,171,682 9,596 2 . ′′ Kmerger 256 16 4,096 8,206,508 32,248 21 . ′′ Kmerger 512 22 11,264 74,409,763 100,410 132 . ′′ k streams with k elements eachMerge Type Streams Input Per Insertions, Cache Comparison RuntimeStream Extractions Misses Count (s)FunnelHeap 64 64 4,096 13,482 59,641 0 . ′′ FunnelHeap 128 128 16,384 19,135 290,088 0 . ′′ FunnelHeap 256 256 65,536 63,775 1,378,606 0 . ′′ FunnelHeap 512 512 262,144 129,492 6,394,298 0 . ′′ Kmerger 64 64 4,096 354,527 24,451 0 . ′′ Kmerger 128 128 16,384 1,935,724 114,412 2 . ′′ Kmerger 256 256 65,536 6,975,425 523,766 17 . ′′ Kmerger 512 512 262,144 52,426,417 2,358,303 130 . ′′ Table 22: k streams with k elements eachMerge Type Streams Input Per Insertions, Cache Comparison RuntimeStream Extractions Misses Count (s)FunnelHeap 64 4,096 262,144 169,123 6,270,855 0 . ′′ FunnelHeap 128 16,384 2,097,152 1,693,942 63,857,189 12 . ′′ FunnelHeap 256 65,536 16,777,216 17,161,816 535,670,619 180 . ′′ FunnelHeap 512 262,144 134,217,728 261,091,499 4,752,491,629 3 , . ′′ Kmerger 64 4,096 262,144 433,473 1,572,732 0 . ′′ Kmerger 128 16,384 2,097,152 5,138,297 14,679,774 44 . ′′ Kmerger 256 65,536 16,777,216 39,108,825 134,217,167 560 . ′′ Kmerger 512 262,144 134,217,728 265,845,601 1,207,958,504 3 , . ′′ . Conclusion In this paper we presented a comprehensive design and analysis that ex-tends the work in (Abu-Salem et al., 2014) and (Abu-Salem et al., 2015).Alg. FH-RANK exploits all the features of Funnel Heap for implementingsums of products arising in Hensel lifting of the polytope method, whenpolynomials are in sparse distributed representation. Those features involvea batched mechanism for chaining replicas as well as optimising on the se-quence of insertions and extractions in order to minimise the size of thepriority queue as well as the work and cache complexity. The competitiveasymptotics are validated by empirical results, which, in addition to assertingthe high efficieny of FH-RANK whether or not data fits in in-core memory,help us derive two other main conclusions. Firstly, we confirm that at alarge scale, all polynomial arithmetic employing a priority queue will ben-efit substantially from using Funnel Heap over Binary Heap, even withoutthe proposed mechanisms for chaining and/or optimising the sequence ofinsertions/extractions. Secondly, Funnel Heap is confirmed to be superiorin practice as a merger when tested against the provably optimal k -mergerstructure, despite having a higher work complexity. This is attributed toits ability to adapt to merging input streams of fluctuating density, whichin turn, makes Funnel Heap ideal for performing polynomial arithmetic inthe sparse distributed representation, where such fluctuation affects overallperformance. This supports our argument that one should resort to the over-lapping approach using a single priority queue, as opposed to handling eachof the the local multiplications separately using a local priority queue, tobe followed by additive merging of all polynomial streams. This conclusionremains valid whether or not expression swell is taking place.
7. Acknowledgments
We thank the Lebanese National Council for Scientific Research and theUniversity Research Board – American University of Beirut, for supportingthis work.
ReferencesReferences
Abu Salem, F., Gao, S., Lauder, A. G., 2004. Factoring polynomials viapolytopes. In: Proc. of ISSAC ’04. ACM Press, pp. 4–11.36bu Salem, F. K., El-Harake, K., Gemayel, K., 2014. Factoring sparse bi-variate polynomials using the priority queue. In: Proc. of CASC ’14. ACMPress, pp. 388–402.Abu Salem, F. K., El-Harake, K., Gemayel, K., 2015. Cache oblivious sparsepolynomial factoring using the funnel heap. In: Proc. PASCO ’15. Vol.8660 of Lecture Notes in Computer Science. Springer, pp. 7–15.Arge, L., Bender, M. A., Demaine, E. D., Holland-Minkley, B., Munro, J. I.,2002. Cache-oblivious priority queue and graph algorithm applications. In:Proc. of STOC ’02. ACM Press, pp. 268–276.Bernardin, L., Monagan, M. B., 1997. Efficient multivariate factorizationover finite fields. In: Proc. of AAECC ’97. Vol. 1255 of Lecture Notes inComputer Science. Springer, pp. 15–28.Bostan, A., Lecerf, G., Salvy, B., Schost, E., Wiebelt, B., 2004. Complexityissues in bivariate polynomial factorization. In: Proc. of ISSAC ’04. ACMPress, pp. 42–49.Brodal, G. S., Fagerberg, R., 2002a. Cache oblivious distribution sweeping.In: Proc. of ICALP ’02. Vol. 2380 of Lecture Notes in Computer Science.Springer-Verlag, pp. 426–438.Brodal, G. S., Fagerberg, R., 2002b. Funnel heap - a cache oblivious priorityqueue. In: Proc. of ISAAC ’02. Vol. 2518 of Lecture Notes in ComputerScience. Springer-Verlag, pp. 219–228.Brodal, G. S., Fagerberg, R., Meyer, U., Zeh, N., 2004. Cache-oblivious datastructures and algorithms for undirected breadth-first search and shortestpaths. In: Proc. of SWAT ’04. Vol. 3111 of Lecture Notes in ComputerScience. Springer, pp. 480–492.Brodal, G. S., Fagerberg, R., Vinther, K., 2008. Engineering a cache-oblivioussorting algorithm. J. Exp. Alg. 8, doi > .Dementiev, R., Kettner, L., Sanders, P., 2005. STXXL: Standard templatelibrary for XXL data sets. In: Proc. of ESA ’05. Vol. 3669 of Lecture Notesin Computer Science. Springer, pp. 640–651.37rigo, M., Leiserson, C. E., Prokop, H., Ramachandran, S., 1999. Cache-oblivious algorithms. In: Proc. of FOCS ’99. IEEE Press, pp. 285–297.Gao, S., Lauder, A. G., 2002. Hensel lifting and bivariate polynomial factori-sation over finite fields. Math. Comp. 71, 1663–1676.Johnson, S. C., 1974. Sparse polynomial arithmetic. ACM SIGSAM Bulletin8, 63–71.Moreno Maza, M., 2014. Cache memories, cache complexity. .Monagan, M., Pearce, R., 2007. Polynomial division using dynamic arrays,heaps, and packed exponent vectors. In: Proc. of CASC ’07. Vol. 4770 ofLecture Notes in Computer Science. Springer, pp. 295–315.Monagan, M., Pearce, R., 2009. Parallel sparse polynomial multiplicationusing heaps. In: Proc. ISSAC ’09. ACM Press, pp. 263–269.Monagan, M., Pearce, R., 2011. Sparse polynomial pseudo division using aheap. J. Symb. Comp. 46 (7), 807–822.Sach, B., Clifford, R., 2008. An empirical study of cache-oblivious prior-ity queues and their application to the shortest path problem. CoRR0802.1026v1.von Hoeij, M., 2002. Factoring polynomials and the knapsack problem. Jour-nal of Number Theory 95 (2), 167–189.38.Monagan, M., Pearce, R., 2007. Polynomial division using dynamic arrays,heaps, and packed exponent vectors. In: Proc. of CASC ’07. Vol. 4770 ofLecture Notes in Computer Science. Springer, pp. 295–315.Monagan, M., Pearce, R., 2009. Parallel sparse polynomial multiplicationusing heaps. In: Proc. ISSAC ’09. ACM Press, pp. 263–269.Monagan, M., Pearce, R., 2011. Sparse polynomial pseudo division using aheap. J. Symb. Comp. 46 (7), 807–822.Sach, B., Clifford, R., 2008. An empirical study of cache-oblivious prior-ity queues and their application to the shortest path problem. CoRR0802.1026v1.von Hoeij, M., 2002. Factoring polynomials and the knapsack problem. Jour-nal of Number Theory 95 (2), 167–189.38