[PDF] Accelerated Polynomial Evaluation and Differentiation at Power Series in Multiple Double Precision

Abstract

The problem is to evaluate a polynomial in several variables and its gradient at a power series truncated to some finite degree with multiple double precision arithmetic. To compensate for the cost overhead of multiple double precision and power series arithmetic, data parallel algorithms for general purpose graphics processing units are presented. The reverse mode of algorithmic differentiation is organized into a massively parallel computation of many convolutions and additions of truncated power series. Experimental results demonstrate that teraflop performance is obtained in deca double precision with power series truncated at degree 152. The algorithms scale well for increasing precision and increasing degrees.

Full PDF

AAccelerated Polynomial Evaluation and Diﬀerentiationat Power Series in Multiple Double Precision ∗ Jan Verschelde † Abstract

The problem is to evaluate a polynomial in several variables and its gradient at a power se-ries truncated to some ﬁnite degree with multiple double precision arithmetic. To compensatefor the cost overhead of multiple double precision and power series arithmetic, data parallelalgorithms for general purpose graphics processing units are presented. The reverse mode ofalgorithmic diﬀerentiation is organized into a massively parallel computation of many con-volutions and additions of truncated power series. Experimental results demonstrate thatteraﬂop performance is obtained in deca double precision with power series truncated atdegree 152. The algorithms scale well for increasing precision and increasing degrees.

Keywords. acceleration, convolution, CUDA, diﬀerentiation, evaluation, GPU, parallel,polynomial, precision.

Solving systems of many polynomial equations in several variables is needed in various ﬁeldsof science and engineering. Numerical continuation methods apply path trackers [15]. A pathtracker computes approximations of the solution paths deﬁned by a family of polynomial systems.The paths start at known solutions of easier systems and end at the solutions of the given system.The evaluation and diﬀerentiation of polynomials often dominates the computational cost.The main motivation for this paper is to accelerate a new robust path tracker [17], addedrecently to PHCpack [19], which requires power series expansions of the solution series of polyno-mial systems. As shown in [18], double precision may no longer suﬃce to obtain accurate results,for larger systems, and for longer power series, truncated at higher degrees. The double preci-sion can be extended with double doubles, triple doubles, quad doubles, etc., applying multipledouble arithmetic [16]. The goal is to compensate for the computational cost overhead causedby power series and multiple double precision by the application of data parallel algorithms ongeneral purposed graphics processing units (GPUs).The CUDA programming model (see [12] for an introduction) is applied. The software wasdeveloped on ﬁve diﬀerent NVIDIA graphics cards: the C2050, K20C, P100, V100 (on Linux);and the GeForce 2080 (on Windows). ∗ Supported by the National Science Foundation under grant DMS 1854513. † University of Illinois at Chicago, Department of Mathematics, Statistics, and Computer Science, 851 S.Morgan St. (m/c 249), Chicago, IL 60607-7045 Email: [email protected] , URL: ∼ jan . a r X i v : . [ c s . M S ] F e b rior work. Data parallel algorithms for polynomial evaluation and diﬀerentiation [9] were ﬁrstpresented by G. Yoﬀe and the author in [21], using double double and quad double arithmeticof [4] on the host and of [13] on the device. Adding accelerated linear algebra [22] led to anaccelerated Newton’s method [23], and to accelerated path trackers [24, 25].

Related work.

The authors of [6, 7] describe a GPU implementation of a path tracker, withan application to kinematic synthesis. Instead of an adaptive step size control, paths for whichthe desired accuracy is not achieved are recomputed with a smaller step size. The source codefor the computations of [6, 7] can be found in the appendix of [5].Accelerating the multiplication of polynomials is reported in [1] and [2, 3, 14]. Those al-gorithms are applied with exact, modular arithmetic, to polynomials with huge degrees. AGPU-accelerated application of adjoint algorithmic diﬀerentiation is implemented in [8] for thegradient computation of cost functions given by computer programs.In [10], several software packages for high precision arithmetic on GPUs are considered. Forthe problem of matrix-vector multiplication, the double double arithmetic of CAMPARY [11]performs best. In quad double precision, the performance of the implementation with CAM-PARY comes close to the multiple precision proposed by [10].

Contributions.

This paper extends the ideas of [21] to power series and to more levels ofmultiple double precision, with the code generated by the CAMPARY software [11]. In additionto double, double double, and quad double precision, the algorithms in this paper run also intriple, penta, octo, and deca double precision, extending double precision respectively three,ﬁve, eight, and ten times.In [12], convolutions and scans are explained as parallel patterns. This paper presents noveldata staging algorithms. In one addition of two power series consecutive threads in the block addconsecutive coeﬃcients of the series. In one multiplication of two power series, the number ofsteps equals the degree d at which the series are truncated. The computations in the data parallelalgorithms are deﬁned by two sequences of jobs. The ﬁrst sequence computes all multiplications,for all monomials, while the second sequence encodes the additions of all evaluated monomials.The theoretical speedup of the novel parallel algorithms is a multiple of d .In deca double precision, on power series truncated at degree 152, teraﬂop performance isreached on the P100 and the V100. Experimental results show the scalability for increasingdegrees and increasing precisions. Consider the product z of two power series x and y , both truncated to the same degree d , sothe input consists of two sequences of d + 1 coeﬃcients. The output are the d + 1 coeﬃcients ofthe product z . In a data parallel algorithm with d + 1 threads, thread k will compute the k -thcoeﬃcient z k of z , deﬁned by the formula z k = k (cid:88) i =0 x i y k − i , k = 0 , , . . . , d, (1)where x i is the i -th coeﬃcient of x and y k − i is the ( k − i )-th coeﬃcient of y . In the directapplication of the formula in (1), every thread performs a diﬀerent number of computations,which results in thread divergence. 2he remedy for this thread divergence is to insert zero numbers before the second vectorwhen threads load the numbers into the shared memory of the block. In the statements below, X , Y , and Z represent the shared memory locations, respectively for the coeﬃcients of x , y ,and z . The Y has space for at least 2 d + 2 numbers. In the data parallel algorithm with zeroinsertion, thread k executes the following statements, expressed in pseudo code:1. X k := x k Y k := 03. Y d + k := y k Z k := X k Y d + k

5. for i from 1 to d do Z k := Z k + X i Y d + k − i z k := Z k Thread k executes the same statements on diﬀerent data. In the statements above, the k =0 , , . . . , d of formula (1) is implicit. As the data parallel version eliminates the outside loopon k , it is expected to run about d times faster than the sequential application of formula (1).The zero insertion justiﬁes the auxiliary vector Y , but why are X and Z needed? Thecoeﬃcients x , y , and z reside in the global memory of the device. Access to global memory isslower than access to shared memory. Observe that all threads need access to x . Retrieving x once from global memory and then d + 1 times from shared memory is expected to be faster than d + 1 times retrieving x from global memory. The same argument applies to the assignments to Z k . Every thread assigns once to Z k and then updates Z k as many as d times. Only at the veryend of the algorithm is the value of Z k in shared memory assigned to the value of z k in globalmemory. Another beneﬁt of using vectors in shared memory occurs when one needs to updatethe same series x or y with the product. Replacing z by x or y in (1) requires an auxiliaryvector.All coeﬃcients are stored in consecutive memory locations, promoting eﬃcient memory ac-cess. For arrays of doubles, threads with consecutive indices access the corresponding consecutivememory locations. For complex numbers and multiple double numbers, the same eﬃcient mem-ory access is obtained by storing real and imaginary parts of complex numbers in separate arraysand by storing all parts of multiple double numbers in separate arrays.The other basic operation is the addition of two power series. In the data parallel version,one block of threads adds two power series. If the block has exactly has exactly as many threadsas the number of coeﬃcients of the power series, then thread k adds the k -th coeﬃcient of thetwo series.The next two sections elaborate the scheduling of convolution jobs. Consider a monomial a x x · · · x n , in the n variables x , x , . . . , x n , and a is a nonzero powerseries, truncated to degree d . We want to evaluate and diﬀerentiate this monomial at a sequenceof n power series z = ( z , z , . . . , z n ). All series in z are also truncated to degree d .For monomials with positive powers, e.g.: x x , observe that the value of x x is not onlya factor of the monomial value, but is also a factor in all values of the derivatives. Therefore,3e write x x as a x x , where a = x x . This common factor is then evaluated with a table ofpowers of the variables.The statements below assume that n is larger than 2. Each (cid:63) represents a convolution oftwo power series. The n forward products are stored in the n -dimensional array f . The ( n − b collects the backward products. Other partial derivatives can then be foundin the ( n − c of cross products.1. f := a (cid:63) z

2. for j from 2 to n do f j := f j − (cid:63) z i b := z n (cid:63) z n −

4. for j from 2 to n − b j := b j − (cid:63) z n − j b n − := b n − (cid:63) a

6. for j from 1 to n − c j := f j (cid:63) b n − − j c n − := f n − (cid:63) z n The amount of (cid:63) operations equals the total amount of auxiliary storage, plus one (as b n − getsassigned twice): 3 n −

3. In the example below for n = 5, if one expands the left and rightoperands of (cid:63) into their values, one can verify that b , c , c , c , and f contain the values of allﬁve partial derivatives. The organization in three columns shows the parallelism. Statementson the same line in (2) can be executed in parallel. f := a (cid:63) z b := z (cid:63) z f := f (cid:63) z b := b (cid:63) z f := f (cid:63) z b := b (cid:63) z c := f (cid:63) b f := f (cid:63) z b := b (cid:63) a c := f (cid:63) b f := f (cid:63) z c := f (cid:63) z (2)For n = 5, three blocks of threads may evaluate and diﬀerentiate one monomial in ﬁve steps.For n >

5, observe that c needs z n z n − · · · z , or the value of b n − . While this observation seemsto limit the amount of parallelism, the cross products need not be computed one after the other. Proposition 3.1.

The j -th cross product c j can be computed after max( j, n − − j ) steps.Proof. For n = 3, the only cross product is c := f (cid:63) z , and c can be computed after f hasbeen computed in the ﬁrst step. After one step, c can be computed for n = 3.For n >

3, consider c n − := f n − (cid:63) z n . The computation of c n − has to wait for thecomputation of f n − , which requires n − j < n − c j := f j (cid:63) b n − − j and c j can becomputed after f j and b n − − j have been computed, which each take respectively j and n − − j steps. Thus, after max( j, n − − j ) steps, c j can be computed. Corollary 3.2.

Given suﬃciently many blocks of threads, monomial evaluation and diﬀerenti-ation takes n steps for n variables. As the number of convolutions to evaluate and diﬀerentiate one monomial in n variablesequals 3 n −

3, Corollary 3.2 means that 3 d is the upper bound on the speedup, where d is thetruncation degree of the power series. 4 Polynomial Evaluation and Diﬀerentiation

We consider a polynomial p of N monomials in n variables with nonzero coeﬃcients as powerseries, all truncated to the same degree d . The coeﬃcient of the k -th monomial is denoted as a k and n k variables appear in the monomial, for k ranging from 1 to N . The variables in the k -thmonomial are deﬁned by the tuple of indices ( i , i , . . . , i n k ) with 1 ≤ i < i < · · · < i n k ≤ n .We want to evaluate and diﬀerentiate p ( x , x , . . . , x n ) = a + N (cid:88) k =1 a k x i x i · · · x i nk , (3)at a sequence of n power series z = ( z , z , . . . , z n ). All series in z are also truncated to degree d .The constant term a is not included in the count N .In the data parallel evaluation and diﬀerentiation, every monomial has three separate arraysof forward, backward, and cross products, denoted respectively by f , b , and c . In the examplebelow, the ﬁrst index of f , b , and c corresponds to the monomial index. p = a + a x x x + a x x x x + a x x x f , := a (cid:63) z f , := a (cid:63) z f , := a (cid:63) z f , := f , (cid:63) z f , := f , (cid:63) z f , := f , (cid:63) z f , := f , (cid:63) z f , := f , (cid:63) z f , := f , (cid:63) z f , := f , (cid:63) z b , := z (cid:63) z b , := z (cid:63) z b , := z (cid:63) z b , := b , (cid:63) a b , := b , (cid:63) z b , := b , (cid:63) a b , := b , (cid:63) a c , := f , (cid:63) z c , := f , (cid:63) b , c , := f , (cid:63) z c , := f , (cid:63) z (4)The 21 convolutions in (4 are arranged in (5). Statements on the same line can be computed inparallel. f , b , f , b , f , b , f , b , c , f , b , c , f , b , c , f , f , b , c , f , f , (5)If 9 thread blocks are available, all 21 convolutions can be computed in 4 steps. The value of p at z and all six partial derivatives are listed below: a + f , + f , + f , , b , + b , , c , + b , , c , + c , , f , , c , , f , + f , . (6)If 7 threads blocks are available, all values can be computed in two steps.Two steps are needed for the value of p . The ﬁrst step computes f , := a + f , and f , := f , + f , simultaneously. The second step then does f , := f , + f , , so the value of p is in f , .Obviously, as convolutions for diﬀerent monomials can be computed in parallel, the resultin Corollary 3.2 extends directly to polynomials. Corollary 4.1.

Consider a polynomial p in n variables, with N monomials. Let m be the numberof variables in that monomial of p that has the largest number of variables. Given suﬃcientlymany blocks of threads, the evaluation and diﬀerentiation of p takes m + (cid:100) log ( N ) (cid:101) steps. a a a a z z z z z z f , f , f , f , f , f , f , f , f , f , b , b , b , b , c , c , c , c , (cid:54) (cid:54) (cid:54) (cid:54) (cid:54) (cid:54) (cid:54) (cid:54) (cid:54) (cid:54) Figure 1: The data array used to compute the forward, backward, and cross products to evaluatea polynomial and its gradient of the example in (4). Every box represents d + 1 doubles for thecoeﬃcients of a series truncated at degree d . The arrows point at the start position of the inputseries and at the space for the forward, backward, cross products for every monomial.To interpret Corollary 4.1, assume every monomial has m variables. Then the total numberof convolutions equals N (3 m −

3) and the number of additions equals N . As in the case ofone monomial, the speedup factor of 3 d is present. For polynomials with many monomials, for N (cid:29) m , the upper bound on the speedup is dN/ log ( N ). The accelerated polynomial evaluation and diﬀerentiation algorithm proceeds in two stages. Theﬁrst stage computes all convolutions. The second stage adds up the evaluated and diﬀerentiatedmonomials.As the number of threads in each block matches the number of coeﬃcients in each truncatedpower series, within each block the natural order of the data follows the coeﬃcient vectors ofthe power series. In preparation for the launching of the kernels the staging of the data mustbe deﬁned.Extracting the data structures of (3), the input of the algorithm consists of the following:1. N , the number of monomials;2. n , the number of variables;3. d , the degree at which all series are truncated;4. a k , truncated series as the coeﬃcient of the k -th monomial, k = 1 , , . . . , N ;5. ( i , i , . . . , i n k ), indices of the variables in the k -th monomial, where 1 ≤ i < i < · · ·

2. The max(1 , n k −

2) in the upper bound for b k,j is for the special case n k = 2, to store z i (cid:63) a .The output of the ﬁrst stage is the input of the second stage. The second stage adds foreach monomial k , the last forward product f k,n k to obtain p ( z ). The values of the derivatives ofthe k -th monomial are in f k,n k − , b k,n k − , and c k,j .6he data parallel algorithm to compute all convolutions is deﬁned by the data layout. Thetotal count of numbers involved in all convolution and addition jobs is e = ( d + 1) (cid:32) N + n + N (cid:88) k =1 (cid:32) n k + max(1 , n k −

2) + max(0 , n k − (cid:33)(cid:33) . (7)The ﬁrst factor ( d + 1) in (7) counts the number of coeﬃcients in all series truncated to degree d .The ﬁve terms in the second factor in (7) count respectively the constant coeﬃcient a , the N coeﬃcients a k , the n input series in z , the n k forward, the max(1 , n k −

2) backward, and themax(0 , n k −

2) cross products.Figure 1 illustrates the layout of the data vector for the polynomial in (4).The data are in an array A of e doubles. The order of the numbers in A follows the count asin (7): the coeﬃcients a , a k , and series z are followed by the numbers in the forward, backward,and cross products. Each job is then characterized by a triplet of indices in this A . The ﬁrst twoindices point respectively at the start of the ﬁrst and the second input. The third index in thetriplet deﬁnes the start of the output. For example, the triplet for the convolution f , := a (cid:63) z for the polynomial in (4) is ( d + 1 , d + 4 , d + 10), for degree d , as the coeﬃcients for a startafter the ﬁrst d + 1 coeﬃcients for a , z starts after the the ﬁrst four series, and f , after theten series that deﬁne the coeﬃcients of the polynomials and the input.For complex numbers, the data are in two arrays, one for the real parts and the other for theimaginary parts. For m -fold double numbers, there are m data arrays, all following the samelayout as for A , described above.Jobs are placed in layers, following the lines of (5) for the example polynomial. Jobs tocompute f k,j and b k,j are at layer j . Following Proposition 3.1, the layer of c k,j is max( j, n k − − j ) + 1. All jobs in the same layer can be executed at the same time. A kernel is launchedwith as many blocks as the number of jobs in one layer. One convolution job is executed byone block of threads. In addition to the data array A , the kernel is launched with the triple ofcoordinates which deﬁne each job. Each block of threads extracts the triplet according to itsblock number.The coordinates for each convolution job in the ﬁrst stage of the algorithm depend only onthe structure of the monomials and are computed only once. The same holds for the coordinatesof the addition jobs in the second stage of the algorithm. Each addition job updates one serieswith another, so one pair of indices deﬁnes one addition job. For example, for the polynomialin (4), the ﬁrst update f , := a + f , has coordinates (0 , d + 12), as a comes ﬁrst and f , is positioned after 12 series in A , see Figure 1.For a convolution job j , let t = ( t ( j ) , t ( j ) , t ( j )) denote the triplet where t ( j ) and t ( j )respectively are the locations in the data array A of the ﬁrst and second input, and where t ( j )is the location in A of the output. Given A and t , we can then symbolically summarize the codefor the kernel to compute all convolution jobs at the same layer as1. B := blockIdx.x

2. ( i , i , i ) := ( t ( B ) , t ( B ) , t ( B ))3. A [ i : i + d +1] := A [ i : i + d +1] (cid:63) A [ i : i + d +1]where the block index B corresponds to the index j of the convolution job, and [ i : i + d + 1]denotes the range of the coeﬃcients in A . Likewise, for an addition job j , let t = ( t ( j ) , t ( j ))7e the pair of input and update locations in the data array A . Given A and t , the kernel toexecute all addition jobs at the same layer is then summarized as1. B := blockIdx.x

2. ( i , i ) := ( t ( B ) , t ( B ))3. A [ i : i + d +1] := A [ i : i + d +1] + A [ i : i + d +1]The data staging algorithm to deﬁne the convolution jobs runs through the steps to computeall forward f k,(cid:96) , backward b k,(cid:96) , and cross products c k,(cid:96) , for k ranging from 1 to N . The k -thmonomial has n k variables, with indices ( i , i , . . . , i n k ). α k = k − (cid:88) (cid:96) =1 n (cid:96) , β k = α N +1 + k − (cid:88) (cid:96) =1 max(1 , n (cid:96) − , and γ k = β N +1 + k − (cid:88) (cid:96) =1 max(0 , n (cid:96) −

2) (8)mark the positions respectively of f k, , b k, , and c k, in the data array A . For an example, seethe arrows in Figure 1. Let J (cid:96) denote a set of convolutions jobs at level (cid:96) . Jobs in J (cid:96) can beexecuted after (cid:96) − n k > k from 1 to N do1. to execute f k, := a k (cid:63) z i : t := k ( d +1) t := (1 + N + i − d +1) t := (1 + N + n + α k )( d +1) J := J ∪ { ( t , t , t ) }

2. for (cid:96) from 2 to n k doto execute f k,(cid:96) := f k,(cid:96) − (cid:63) z i (cid:96) : t := (1 + N + n + α k + (cid:96) − d +1) t := (1 + N + i (cid:96) − d +1) t := (1 + N + n + α k + (cid:96) − d +1) J (cid:96) := J (cid:96) ∪ { ( t , t , t ) }

3. to execute b k, := z n k (cid:63) z n k − : t := (1 + N + n k − d +1) t := (1 + N + n k − d +1) t := (1 + N + n + β k )( d +1) J := J ∪ { ( t , t , t ) }

4. for (cid:96) from 2 to n k − b k,(cid:96) := b k,(cid:96) − (cid:63) z n k − (cid:96) : t := (1 + N + n + β k + (cid:96) − d +1) t := (1 + N + n k − (cid:96) )( d +1) t := (1 + N + n + β k + (cid:96) − d +1) J (cid:96) := J (cid:96) ∪ { ( t , t , t ) }

8. to execute b k,n k − := b k,n k − (cid:63) a k : t := (1 + N + n + β k + n k − d +1) t := k ( d +1) t := (1 + N + n + β k + n k − d +1) J n k − := J n k − ∪ { ( t , t , t ) }

6. for (cid:96) from 1 to n k − c k,(cid:96) := f k,(cid:96) (cid:63) b k,n k − − (cid:96) t := (1 + N + n + α (cid:96) + (cid:96) − d +1) t := (1 + N + n + β k + n k − − (cid:96) − d +1) t := (1 + N + n + γ k + (cid:96) − d +1) L := max( (cid:96), n k − − (cid:96) ) J L := J L ∪ { ( t , t , t ) }

7. to compute c k,n k − := f k,n k − (cid:63) z n k : t := (1 + N + n + α k + n k − d +1) t := (1 + N + n k − d +1) t := (1 + N + n + γ k + n k − d +1) J n k − := J n k − ∪ { ( t , t , t ) } Sets are natural data structures in the mathematical description of the data staging algorithm,as the jobs in one J (cid:96) can be executed in any order. In the implementation, the jobs in the samelayer are stored in three integer arrays. The block index B is then used to get the coordinatesof the convolution job performed by block B .The pairs of indices for the addition jobs are deﬁned in a recursive manner, following theorder of the tree summation algorithm. To compute the value of the polynomial, add the forwardproduct f k,n k of the k -th monomial. For stride L and level (cid:96) , apply the following:to execute f k,n k := f k,n k + f k − L,n k − L : t := 1 + N + n + α k − L + n k − L − t := 1 + N + n + α k + n k − J (cid:96) := J (cid:96) ∪ { ( t , t ) } recursively, starting at L = (cid:98) N/ (cid:99) and level (cid:96) = log ( N ) (assuming N = 2 (cid:96) ), dividing L by twoin each step, until L = 1. For k = L in the formula f k,n k := f k,n k + f k − L,n k − L , replace f k − L,n k − L by a . The same recursive formula is applied to sum the ﬁrst backward products and all crossproducts to obtain the gradient. Table 1 summarizes the characteristics of each GPU, with the focus on the core counts and theprocessor speeds, because the problem is compute bound. The ﬁrst four GPUs in Table 1 arehoused in a Linux workstation, running CentOS. The ﬁfth GPU resides in a Windows laptop.9VIDIA GPU CUDA nvcc -O3 on the device, with the code for the host compiled by gcc -O3 on the linux computers, andthe community edition of Microsoft Visual Studio on the Windows laptop. While running thesame software on all ﬁve GPUs is obviously convenient, more advanced features of newer devicesare not utilized. The main importance for the evaluation of our software is that the unfaircomparison with the CPU is avoided. Taking into the account the double peak performance ofthe P100 and the V100 (4.7 TFLOPS and 7.9 TFLOPS respectively), we may expect the V100to be about 1.68 times faster than the P100.The ﬁrst test polynomial p is a function of 16 variables. Its evaluation adds to the constantterm all 1,820 monomials that are the products of exactly four variables. The evaluation requires16,380 convolutions and 9,084 additions. As each monomial has no more than four variables, the16,380 convolutions are performed in four kernel launches of respectively 3,640, 5,460, 5,460, and1,820 blocks. The execution of the 9,084 additions requires 11 kernel launches of respectively4,542, 2,279, 1,140, 562, 281, 140, 78, 39, 20, 2, and 1 blocks. The second test polynomial p isconstructed to require many more convolutions than additions, respectively 24,192 versus 8,192.To evaluate the third polynomial p , as many convolutions as additions are required: 24,256.Table 2 lists the characteristics of the test polynomials. Compared to p , p has fewermonomials but each monomial has many more variables; whereas p has many more monomials,but each monomial has only two variables. n m N p

16 4 1,820 16,380 9,084 p

128 64 128 24,192 8,192 p

128 2 8,128 24,256 24,256Table 2: For each polynomial, n is the total number of variables, m is the number of variablesper monomial, and N is the number of monomials (not counting the constant term). The lasttwo columns list the number of convolution and addition jobs.To examine the scalability of our problem, experiments are run for increasing degrees oftruncation and for increasing levels of precision. Does the shape of the test polynomials inﬂuencethe execution times? 10 .2 Performance For each run, four times are reported. The elapsed times of the kernel launches are measured by cudaEventElapsedTime and expressed in milliseconds. The ﬁrst two times are the sums of allelapsed times spent by all kernels, respectively for all convolutions and all additions. The thirdtime is the sum of the ﬁrst two times. Each kernel launches involves the memory transfer of theindex vectors that deﬁne the coordinates of the jobs in the data arrays. The fourth reported isthe wall clock time which includes also this memory transfer. What is not included in the timesis the transfer of the input data arrays from the host to the device and of the output data arraysfrom the device to the host.Table 3 summarizes execution times to evaluate the ﬁrst test polynomial p at a power seriestruncated to degree 152 in deca double precision. This degree is the largest one block of threadscan manage because of the limitation of the size of shared memory, which is the same all ﬁvedevices. C2050 K20C P100 V100 RTX 2080convolution 12947.26 11290.22 1060.03 634.29 10002.32addition 10.72 11.13 1.37 0.77 5.01sum 12957.98 11301.35 1061.40 635.05 10007.34wall clock 12964.00 11309.00 1066.00 640.00 10024.00Table 3: Evaluating p for degree d = 152 in deca double precision. The last line is the wallclock time for all convolution and addition kernels. All units are milliseconds.The ratio 12964 / ≈ .

26 is the speedup of the most recent V100 over the oldest C2050.Compare the ratio of the wall clock times for P100 over V100 in Table 3: 1066 / ≈ .

67 withthe ratios of theoretical double peak performance of the V100 of the P100: 7 . / . ≈ . p , at series truncated at degree d = 152 in deca doubleprecision. Following the counts in [20], one addition in deca double precision requires 139additions and 258 subtractions of doubles, while one deca double multiplication requires 952additions, 1743 subtractions, and 394 multiplications of doubles. One convolution with zeroinsertion on series truncated at degree d requires ( d + 1) multiplications and d ( d + 1) addi-tions. One addition of two series truncated at degree d requires d + 1 additions. So we have16 , d + 1) multiplications and 16 , d ( d + 1) + 9 , d + 1) additions in deca doubleprecision. One multiplication and one addition in deca double precision require respectively3089 and 397 double operations. Then the 16 , d + 1) evaluates to 1,184,444,368,380 and16 , d ( d + 1) + 9 , d + 1) to 151,782,283,404 double ﬂoat operations. In total, in 1.066seconds the P100 performed 1,336,226,651,784 double ﬂoat operations, reaching a performanceof about 1.25 TFLOPS.Another observation from Table 3 is the tiny amount of time spent by the addition kernels,when compared to the convolution kernels, for V100: 0.77 versus 634.29. The convolution isquadratic in the degree d , whereas the addition is linear in d . As every block has d + 1 threads,the addition ﬁnishes in one single step, whereas there are still d steps in the convolutions.Table 4 lists the execution times for p and p on P100 and V100, to verify if the shape ofthe test polynomial would inﬂuence the conclusions on p .11 p P100 V100 P100 V100convolution 1700.49 1115.03 1566.58 926.53addition 1.24 0.67 3.43 1.92sum 1701.72 1115.71 1570.01 928.45wall clock 1729.00 1142.00 1583.00 941.00Table 4: Evaluating p and p for degree d = 152 in deca double precision. The last line is thewall clock time for all convolution and addition kernels. All units are milliseconds.The ratios of the wall clock times on P100 over the V100 for p and p are respectively1729 / ≈ .

51 and 1583 / ≈ . p , the factor 1.51 is not as high as expected. One probable cause is that the number ofconvolutions jobs in the ﬁrst 31 layers equals 256. This number equals the number of blocks inone kernel launch. The number of streaming multiprocessors of the P100 and V100 respectivelyequal 56 and 80. The number of 256 blocks in one launch does not occupy the V100 as much asthe P100. The plots in this section visualize data in Tables 5 and 6, which contain times on the three testpolynomials, on the V100. These raw data sets are in the appendix.As the times spent by all addition kernels is less than one millisecond for p , Figure 2 showsthe relative cost of the multiple doubles versus doubles. The cost starts to increase once thedegrees become larger than the warp size. For all precisions, the cost at degree 127 is less thantwice the cost at degree 63.Figure 2: Times spent by all addition kernels when evaluating p and its gradient at power seriestruncated at increasing degrees 0, 8, 15, 31, 63, 95, 127, and 152, for seven precisions: double(1d), 2d, 3d, 4d, 5d, 8d, and 10d, at power series truncated to degree 191.Figure 3 shows the sum of the times spent by all addition kernels, for the three test polyno-mials, for power series truncated at degree 152, for all seven precisions. Although p has 8,128monomials and p has only 128, the increase in addition times for p is at most three times as12uch as for p . The addition for p happens with 12 kernel launches, while the addition for p has 7 kernel launches.Figure 3: Times spent by all addition kernels when evaluating p , p , p and their gradients atpower series truncated at degrees 152, for seven precisions: double (1d), 2d, 3d, 4d, 5d, 8d, and10d.The percentage of time spent by all kernels over the wall clock time is visualized in Figure 4.For double precision, the wall clock time dominates (the percentage of the time spent by allkernels is less than 10%), although the time is also less than a tenth of a millisecond. This per-centage climbs for higher precisions. In triple precision, the time spent on all kernels dominatesthe wall clock time. For octo and deca double precision, this percentage is more than 95%.Figure 4: Percentage of the time spent by all kernels over the wall clock time when evaluating p , p , p and their gradients at power series truncated at degrees 152, for seven precisions: double(1d), 2d, 3d, 4d, 5d, 8d, and 10d.As the precision increases, the problem becomes more and more compute bound. For de-gree 191, in Table 5, for p , the wall clock times in double, double double, quad double, and octodouble are respectively 6, 14, 95, and 449 seconds. The cost overhead factor of double doubleover double is typically a factor of about ﬁve, whereas here we observe 14 / ≈ .

33. The otherobserved cost overhead factors are 95 / ≈ .

79 and 449 / ≈ .

72. In Figure 5, the evolutionof the logarithmic wall clock time is plotted.If the number of coeﬃcients in a truncated series doubles from 32 to 64, and from 64 to 128,13igure 5: The 2-logarithm of the wall clock times to evaluate and diﬀerentiate p , p , p indouble (1d), double double (2d), quad double (4d), and octo double (8d) precision, for powerseries truncated at degree 191.then one would expect the observed wall clock times to quadruple, as the cost of the convolutionsis O ( d ) for the truncation degree d . As shown in Figure 6, the wall clock times doubles, as thediﬀerence between the bars in the 2-log times is about one.Figure 6: The 2-logarithm of the wall clock times to evaluate and diﬀerentiate p in quad double(4d), penta double (5d), octo double (8d), and deca double (10d) precision, for power seriestruncated at degrees 31, 63, and 127. The evaluation and diﬀerentiation of a polynomial in n variables at power series truncatedat some ﬁnite degree d requires a number of convolution jobs proportional to the number ofvariables per monomial and the number N of monomials. The convolution jobs are arrangedin layers of jobs that can be executed simultaneously. A scan performs the N addition jobs in (cid:100) log ( N ) (cid:101) steps. For polynomials where N dominates the number of variables per monomial,The theoretical speedup is bounded by dN/ log ( N ).Data staging algorithms deﬁne the coordinates for the convolution and the addition jobs.Speedup factors comparing the V100 and P100 are close to the ratio of their theoretical peakperformance. Experimental results show that teraﬂop performance is obtained. The accelerated14lgorithms scale well for increasing degrees and precisions. GPUs are well suited to compensatefor the overhead of power series arithmetic and multiple double precision. References [1] P. Emeliyanenko. Eﬃcient multiplication of polynomials on graphics hardware. In Y. Dou,R. Gruber, and J.M. Joller, editors,

Advanced Parallel Processing Technologies. 8th Inter-national Symposium, APPT 2009, Rapperswil, Switzerland, August 2009 , volume 5737 of

Lecture Notes in Computer Science , pages 134–149. Springer-Verlag, 2009.[2] S. A. Haque and M. M. Maza. Plain polynomial arithmetic on GPU. In

High PerformanceComputing Symposium (HPCS 2012), J. of Physics: Conference Series , 385, 2012.[3] S. A. Haque, X. Li, F. Mansouri, M. M. Maza, W. Pan, and N. Xie. Dense arithmeticover ﬁnite ﬁelds with the CUMODP library. In H. Hong and C. Yap, editors,

MathematicalSoftware – ICMS 2014 , volume 8592 of

Lecture Notes in Computer Science , pages 725–732.Springer-Verlag, 2014.[4] Y. Hida, X. S. Li, and D. H. Bailey. Algorithms for quad-double precision ﬂoating pointarithmetic. In the

Proceedings of the 15th IEEE Symposium on Computer Arithmetic (Arith-15 2001) , pages 155–162. IEEE Computer Society, 2001.[5] J. Glabe.

Numerical Continuation on a GPU for Kinematic Synthesis . PhD thesis, Uni-versity of California, Irvine, 2020.[6] J. Glabe and J.M. McCarthy. A GPU homotopy path tracker and end game for mechanismsynthesis. In the Proceedings of the 2020 USCToMM Symposium on Mechanical Systemsand Robotics , pages 206–215. Springer 2020.[7] J. Glabe and J.M. McCarthy. Numerical continuation on a graphical processing unit forkinematic synthesis.

Journal of Computing and Information Science in Engineering

Computer Physics Communications , 200, pages 300–311, 2016.[9] A. Griewank and A. Walther.

Evaluating Derivatives: Principles and Techniques of Algo-rithmic Diﬀerentiation . SIAM, 2008.[10] K. Isupov and V. Knyazkov. Multiple-precision matrix-vector multiplication on graphicsprocessing units.

Program Systems: Theory and Applications

Mathematical Software – ICMS 2016, the 5thInternational Conference on Mathematical Software , pages 232–240, Springer-Verlag, 2016.[12] D.B. Kirk and W.W. Hwu.

Programming Massively Parallel Processors. A Hands-on Ap-proach . Morgan Kaufmann, Second Edition, 2013.1513] M. Lu, B. He, and Q. Luo. Supporting extended precision on graphics processors. InA. Ailamaki and P.A. Boncz, editors,

Proceedings of the Sixth International Workshop onData Management on New Hardware (DaMoN 2010), June 7, 2010, Indianapolis, Indiana ,pages 19–26, 2010.[14] M.M. Maza and W. Pan. Fast polynomial multiplication on a GPU.

Journal of Physics:Conference Series , 256, 2010. High Performance Computing Symposium (HPCS2010), 5-9June 2010, Victoria College, University of Toronto, Canada.[15] A. Morgan.

Solving Polynomial Systems using Continuation for Engineering and ScientiﬁcProblems , volume 57 of

Classics in Applied Mathematics . SIAM, 2009.[16] J.-M. Muller, N. Brunie, F. de Dinechin, C.-P. Jeannerod, M. Joldes, V. Lefevre, G.Melquiond, N. Revol, S. Torres.

Handbook of Floating-Point Arithmetic.

Second Edition,Springer-Verlag, 2018.[17] S. Telen, M. Van Barel, and J. Verschelde. A robust numerical path tracking algorithm forpolynomial homotopy continuation.

SIAM Journal on Scientiﬁc Computing

Proceedings of the22nd International Workshop on Computer Algebra in Scientiﬁc Computing (CASC 2020) ,volume 12291 of

Lecture Notes in Computer Science , pages 563–582. Springer-Verlag, 2020.[19] J. Verschelde. Algorithm 795: PHCpack: A general-purpose solver for polynomial systemsby homotopy continuation.

ACM Transactions on Mathematical Software http://arxiv.org/abs/2012.06607 .[21] J. Verschelde and G. Yoﬀe. Evaluating polynomials in several variables and their derivativeson a GPU computing processor. In

Proceedings of the 2012 IEEE 26th International Paralleland Distributed Processing Symposium Workshops (PDSEC 2012) , pages 1391–1399. IEEEComputer Society, 2012.[22] J. Verschelde and G. Yoﬀe. Orthogonalization on a general purpose graphics processingunit with double double and quad double arithmetic. In

Proceedings of the 2013 IEEE 27thInternational Parallel and Distributed Processing Symposium Workshops (PDSEC 2013) ,pages 1373–1380. IEEE Computer Society, 2013.[23] J. Verschelde and X. Yu. GPU acceleration of Newton’s method for large systems of poly-nomial equations in double double and quad double arithmetic. In

Proceedings of the16th IEEE International Conference on High Performance Computing and Communication(HPCC 2014) , pages 161–164. IEEE Computer Society, 2014.[24] J. Verschelde and X. Yu. Accelerating polynomial homotopy continuation on a graphicsprocessing unit with double double and quad double arithmetic. In J.-G. Dumas andE.L. Kaltofen, editors,

Proceedings of the 7th International Workshop on Parallel Symbolic omputation (PASCO 2015), July 10-11 2015, Bath, United Kingdom , pages 109–118.ACM, 2015.[25] J. Verschelde and X. Yu. Tracking many solution paths of a polynomial homotopy on agraphics processing unit in double double and quad double arithmetic. In Proceedings of the17th IEEE International Conference on High Performance Computing and Communication(HPCC 2015) , pages 371–376. IEEE Computer Society, 2015.

Appendix

The source code is available on github, under the GNU GPL license, as part of the code ofPHCpack. Tables 5, 6 and 7 contain times on the three test polynomials, on the V100. d Table 5: Times in milliseconds to evaluate and diﬀerentiate p , for increasing degree d , and forincreasing precision. 17 Table 6: Times in milliseconds to evaluate and diﬀerentiate p , for increasing degree dd

The source code is available on github, under the GNU GPL license, as part of the code ofPHCpack. Tables 5, 6 and 7 contain times on the three test polynomials, on the V100. d Table 5: Times in milliseconds to evaluate and diﬀerentiate p , for increasing degree d , and forincreasing precision. 17 Table 6: Times in milliseconds to evaluate and diﬀerentiate p , for increasing degree dd , and forincreasing precision. 18 Table 7: Times in milliseconds to evaluate and diﬀerentiate p , for increasing degree dd

Related Researches

FastAD: Expression Template-Based C++ Library for Fast and Memory-Efficient Automatic Differentiation

by James Yang

Event-Based Automatic Differentiation of OpenMP with OpDiLib

by Johannes Blühdorn

An Empirical Analysis of the R Package Ecosystem

by Ethan Bommarito

Using Jupyter for reproducible scientific workflows

by Marijan Beg

Automatic Generation of Interpolants for Lattice Samplings: Part II -- Implementation and Code Generation

by Joshua Horacsek

Automatic Generation of Interpolants for Lattice Samplings: Part I -- Theory and Analysis

by Joshua Horacsek

Quasi-Monte Carlo Software

by Sou-Cheng T. Choi

COMET: A Domain-Specific Compilation of High-Performance Computational Chemistry

by Erdal Mutlu

User manual for bch, a program for the fast computation of the Baker-Campbell-Hausdorff and similar series

by Harald Hofstätter

lrsarith: a small fixed/hybrid arithmetic C library

by David Avis

FDApy: a Python package for functional data

by Steven Golovkine

UFL Dual Spaces, a proposal

by David A. Ham

Accelerating Sparse Matrix-Matrix Multiplication with GPU Tensor Cores

by Orestis Zachariadis

ParaMonte: A high-performance serial/parallel Monte Carlo simulation library for C, C++, Fortran

by Amir Shahmoradi

A highly scalable approach to solving linear systems using two-stage multisplitting

by Nick Brown

Extendible and Efficient Python Framework for Solving Evolution Equations with Stabilized Discontinuous Galerkin Method

by Andreas Dedner

Compressed Basis GMRES on High Performance GPUs

by José I. Aliaga

Flexible Performant GEMM Kernels on GPUs

by Thomas Faingnaert

AMReX: Block-Structured Adaptive Mesh Refinement for Multiphysics Applications

by Weiqun Zhang

Portable high-order finite element kernels I: Streaming Operations

by Noel Chalmers

A Survey of Singular Value Decomposition Methods for Distributed Tall/Skinny Data

by Drew Schmidt

HDGlab: An open-source implementation of the hybridisable discontinuous Galerkin method in MATLAB

by Matteo Giacomini

Dune-CurvedGrid -- A Dune module for surface parametrization

by Simon Praetorius

A parallel structured divide-and-conquer algorithm for symmetric tridiagonal eigenvalue problems

by Xia Liao

GPU-accelerating ImageJ Macro image processing workflows using CLIJ

by Daniela Vorkel

«

1

2

3

4

»

Submitted on 22 Jan 2021 (v1), last revised 13 Mar 2021 (this version, v3) Updated

arXiv.org Original Source

NASA ADS

Google Scholar

Semantic Scholar