Accelerated Polynomial Evaluation and Differentiation at Power Series in Multiple Double Precision
AAccelerated Polynomial Evaluation and Differentiationat Power Series in Multiple Double Precision ∗ Jan Verschelde † Abstract
The problem is to evaluate a polynomial in several variables and its gradient at a power se-ries truncated to some finite degree with multiple double precision arithmetic. To compensatefor the cost overhead of multiple double precision and power series arithmetic, data parallelalgorithms for general purpose graphics processing units are presented. The reverse mode ofalgorithmic differentiation is organized into a massively parallel computation of many con-volutions and additions of truncated power series. Experimental results demonstrate thatteraflop performance is obtained in deca double precision with power series truncated atdegree 152. The algorithms scale well for increasing precision and increasing degrees.
Keywords. acceleration, convolution, CUDA, differentiation, evaluation, GPU, parallel,polynomial, precision.
Solving systems of many polynomial equations in several variables is needed in various fieldsof science and engineering. Numerical continuation methods apply path trackers [15]. A pathtracker computes approximations of the solution paths defined by a family of polynomial systems.The paths start at known solutions of easier systems and end at the solutions of the given system.The evaluation and differentiation of polynomials often dominates the computational cost.The main motivation for this paper is to accelerate a new robust path tracker [17], addedrecently to PHCpack [19], which requires power series expansions of the solution series of polyno-mial systems. As shown in [18], double precision may no longer suffice to obtain accurate results,for larger systems, and for longer power series, truncated at higher degrees. The double preci-sion can be extended with double doubles, triple doubles, quad doubles, etc., applying multipledouble arithmetic [16]. The goal is to compensate for the computational cost overhead causedby power series and multiple double precision by the application of data parallel algorithms ongeneral purposed graphics processing units (GPUs).The CUDA programming model (see [12] for an introduction) is applied. The software wasdeveloped on five different NVIDIA graphics cards: the C2050, K20C, P100, V100 (on Linux);and the GeForce 2080 (on Windows). ∗ Supported by the National Science Foundation under grant DMS 1854513. † University of Illinois at Chicago, Department of Mathematics, Statistics, and Computer Science, 851 S.Morgan St. (m/c 249), Chicago, IL 60607-7045 Email: [email protected] , URL: ∼ jan . a r X i v : . [ c s . M S ] F e b rior work. Data parallel algorithms for polynomial evaluation and differentiation [9] were firstpresented by G. Yoffe and the author in [21], using double double and quad double arithmeticof [4] on the host and of [13] on the device. Adding accelerated linear algebra [22] led to anaccelerated Newton’s method [23], and to accelerated path trackers [24, 25].
Related work.
The authors of [6, 7] describe a GPU implementation of a path tracker, withan application to kinematic synthesis. Instead of an adaptive step size control, paths for whichthe desired accuracy is not achieved are recomputed with a smaller step size. The source codefor the computations of [6, 7] can be found in the appendix of [5].Accelerating the multiplication of polynomials is reported in [1] and [2, 3, 14]. Those al-gorithms are applied with exact, modular arithmetic, to polynomials with huge degrees. AGPU-accelerated application of adjoint algorithmic differentiation is implemented in [8] for thegradient computation of cost functions given by computer programs.In [10], several software packages for high precision arithmetic on GPUs are considered. Forthe problem of matrix-vector multiplication, the double double arithmetic of CAMPARY [11]performs best. In quad double precision, the performance of the implementation with CAM-PARY comes close to the multiple precision proposed by [10].
Contributions.
This paper extends the ideas of [21] to power series and to more levels ofmultiple double precision, with the code generated by the CAMPARY software [11]. In additionto double, double double, and quad double precision, the algorithms in this paper run also intriple, penta, octo, and deca double precision, extending double precision respectively three,five, eight, and ten times.In [12], convolutions and scans are explained as parallel patterns. This paper presents noveldata staging algorithms. In one addition of two power series consecutive threads in the block addconsecutive coefficients of the series. In one multiplication of two power series, the number ofsteps equals the degree d at which the series are truncated. The computations in the data parallelalgorithms are defined by two sequences of jobs. The first sequence computes all multiplications,for all monomials, while the second sequence encodes the additions of all evaluated monomials.The theoretical speedup of the novel parallel algorithms is a multiple of d .In deca double precision, on power series truncated at degree 152, teraflop performance isreached on the P100 and the V100. Experimental results show the scalability for increasingdegrees and increasing precisions. Consider the product z of two power series x and y , both truncated to the same degree d , sothe input consists of two sequences of d + 1 coefficients. The output are the d + 1 coefficients ofthe product z . In a data parallel algorithm with d + 1 threads, thread k will compute the k -thcoefficient z k of z , defined by the formula z k = k (cid:88) i =0 x i y k − i , k = 0 , , . . . , d, (1)where x i is the i -th coefficient of x and y k − i is the ( k − i )-th coefficient of y . In the directapplication of the formula in (1), every thread performs a different number of computations,which results in thread divergence. 2he remedy for this thread divergence is to insert zero numbers before the second vectorwhen threads load the numbers into the shared memory of the block. In the statements below, X , Y , and Z represent the shared memory locations, respectively for the coefficients of x , y ,and z . The Y has space for at least 2 d + 2 numbers. In the data parallel algorithm with zeroinsertion, thread k executes the following statements, expressed in pseudo code:1. X k := x k Y k := 03. Y d + k := y k Z k := X k Y d + k
5. for i from 1 to d do Z k := Z k + X i Y d + k − i z k := Z k Thread k executes the same statements on different data. In the statements above, the k =0 , , . . . , d of formula (1) is implicit. As the data parallel version eliminates the outside loopon k , it is expected to run about d times faster than the sequential application of formula (1).The zero insertion justifies the auxiliary vector Y , but why are X and Z needed? Thecoefficients x , y , and z reside in the global memory of the device. Access to global memory isslower than access to shared memory. Observe that all threads need access to x . Retrieving x once from global memory and then d + 1 times from shared memory is expected to be faster than d + 1 times retrieving x from global memory. The same argument applies to the assignments to Z k . Every thread assigns once to Z k and then updates Z k as many as d times. Only at the veryend of the algorithm is the value of Z k in shared memory assigned to the value of z k in globalmemory. Another benefit of using vectors in shared memory occurs when one needs to updatethe same series x or y with the product. Replacing z by x or y in (1) requires an auxiliaryvector.All coefficients are stored in consecutive memory locations, promoting efficient memory ac-cess. For arrays of doubles, threads with consecutive indices access the corresponding consecutivememory locations. For complex numbers and multiple double numbers, the same efficient mem-ory access is obtained by storing real and imaginary parts of complex numbers in separate arraysand by storing all parts of multiple double numbers in separate arrays.The other basic operation is the addition of two power series. In the data parallel version,one block of threads adds two power series. If the block has exactly has exactly as many threadsas the number of coefficients of the power series, then thread k adds the k -th coefficient of thetwo series.The next two sections elaborate the scheduling of convolution jobs. Consider a monomial a x x · · · x n , in the n variables x , x , . . . , x n , and a is a nonzero powerseries, truncated to degree d . We want to evaluate and differentiate this monomial at a sequenceof n power series z = ( z , z , . . . , z n ). All series in z are also truncated to degree d .For monomials with positive powers, e.g.: x x , observe that the value of x x is not onlya factor of the monomial value, but is also a factor in all values of the derivatives. Therefore,3e write x x as a x x , where a = x x . This common factor is then evaluated with a table ofpowers of the variables.The statements below assume that n is larger than 2. Each (cid:63) represents a convolution oftwo power series. The n forward products are stored in the n -dimensional array f . The ( n − b collects the backward products. Other partial derivatives can then be foundin the ( n − c of cross products.1. f := a (cid:63) z
2. for j from 2 to n do f j := f j − (cid:63) z i b := z n (cid:63) z n −
4. for j from 2 to n − b j := b j − (cid:63) z n − j b n − := b n − (cid:63) a
6. for j from 1 to n − c j := f j (cid:63) b n − − j c n − := f n − (cid:63) z n The amount of (cid:63) operations equals the total amount of auxiliary storage, plus one (as b n − getsassigned twice): 3 n −
3. In the example below for n = 5, if one expands the left and rightoperands of (cid:63) into their values, one can verify that b , c , c , c , and f contain the values of allfive partial derivatives. The organization in three columns shows the parallelism. Statementson the same line in (2) can be executed in parallel. f := a (cid:63) z b := z (cid:63) z f := f (cid:63) z b := b (cid:63) z f := f (cid:63) z b := b (cid:63) z c := f (cid:63) b f := f (cid:63) z b := b (cid:63) a c := f (cid:63) b f := f (cid:63) z c := f (cid:63) z (2)For n = 5, three blocks of threads may evaluate and differentiate one monomial in five steps.For n >
5, observe that c needs z n z n − · · · z , or the value of b n − . While this observation seemsto limit the amount of parallelism, the cross products need not be computed one after the other. Proposition 3.1.
The j -th cross product c j can be computed after max( j, n − − j ) steps.Proof. For n = 3, the only cross product is c := f (cid:63) z , and c can be computed after f hasbeen computed in the first step. After one step, c can be computed for n = 3.For n >
3, consider c n − := f n − (cid:63) z n . The computation of c n − has to wait for thecomputation of f n − , which requires n − j < n − c j := f j (cid:63) b n − − j and c j can becomputed after f j and b n − − j have been computed, which each take respectively j and n − − j steps. Thus, after max( j, n − − j ) steps, c j can be computed. Corollary 3.2.
Given sufficiently many blocks of threads, monomial evaluation and differenti-ation takes n steps for n variables. As the number of convolutions to evaluate and differentiate one monomial in n variablesequals 3 n −
3, Corollary 3.2 means that 3 d is the upper bound on the speedup, where d is thetruncation degree of the power series. 4 Polynomial Evaluation and Differentiation
We consider a polynomial p of N monomials in n variables with nonzero coefficients as powerseries, all truncated to the same degree d . The coefficient of the k -th monomial is denoted as a k and n k variables appear in the monomial, for k ranging from 1 to N . The variables in the k -thmonomial are defined by the tuple of indices ( i , i , . . . , i n k ) with 1 ≤ i < i < · · · < i n k ≤ n .We want to evaluate and differentiate p ( x , x , . . . , x n ) = a + N (cid:88) k =1 a k x i x i · · · x i nk , (3)at a sequence of n power series z = ( z , z , . . . , z n ). All series in z are also truncated to degree d .The constant term a is not included in the count N .In the data parallel evaluation and differentiation, every monomial has three separate arraysof forward, backward, and cross products, denoted respectively by f , b , and c . In the examplebelow, the first index of f , b , and c corresponds to the monomial index. p = a + a x x x + a x x x x + a x x x f , := a (cid:63) z f , := a (cid:63) z f , := a (cid:63) z f , := f , (cid:63) z f , := f , (cid:63) z f , := f , (cid:63) z f , := f , (cid:63) z f , := f , (cid:63) z f , := f , (cid:63) z f , := f , (cid:63) z b , := z (cid:63) z b , := z (cid:63) z b , := z (cid:63) z b , := b , (cid:63) a b , := b , (cid:63) z b , := b , (cid:63) a b , := b , (cid:63) a c , := f , (cid:63) z c , := f , (cid:63) b , c , := f , (cid:63) z c , := f , (cid:63) z (4)The 21 convolutions in (4 are arranged in (5). Statements on the same line can be computed inparallel. f , b , f , b , f , b , f , b , c , f , b , c , f , b , c , f , f , b , c , f , f , (5)If 9 thread blocks are available, all 21 convolutions can be computed in 4 steps. The value of p at z and all six partial derivatives are listed below: a + f , + f , + f , , b , + b , , c , + b , , c , + c , , f , , c , , f , + f , . (6)If 7 threads blocks are available, all values can be computed in two steps.Two steps are needed for the value of p . The first step computes f , := a + f , and f , := f , + f , simultaneously. The second step then does f , := f , + f , , so the value of p is in f , .Obviously, as convolutions for different monomials can be computed in parallel, the resultin Corollary 3.2 extends directly to polynomials. Corollary 4.1.
Consider a polynomial p in n variables, with N monomials. Let m be the numberof variables in that monomial of p that has the largest number of variables. Given sufficientlymany blocks of threads, the evaluation and differentiation of p takes m + (cid:100) log ( N ) (cid:101) steps. a a a a z z z z z z f , f , f , f , f , f , f , f , f , f , b , b , b , b , c , c , c , c , (cid:54) (cid:54) (cid:54) (cid:54) (cid:54) (cid:54) (cid:54) (cid:54) (cid:54) (cid:54) Figure 1: The data array used to compute the forward, backward, and cross products to evaluatea polynomial and its gradient of the example in (4). Every box represents d + 1 doubles for thecoefficients of a series truncated at degree d . The arrows point at the start position of the inputseries and at the space for the forward, backward, cross products for every monomial.To interpret Corollary 4.1, assume every monomial has m variables. Then the total numberof convolutions equals N (3 m −
3) and the number of additions equals N . As in the case ofone monomial, the speedup factor of 3 d is present. For polynomials with many monomials, for N (cid:29) m , the upper bound on the speedup is dN/ log ( N ). The accelerated polynomial evaluation and differentiation algorithm proceeds in two stages. Thefirst stage computes all convolutions. The second stage adds up the evaluated and differentiatedmonomials.As the number of threads in each block matches the number of coefficients in each truncatedpower series, within each block the natural order of the data follows the coefficient vectors ofthe power series. In preparation for the launching of the kernels the staging of the data mustbe defined.Extracting the data structures of (3), the input of the algorithm consists of the following:1. N , the number of monomials;2. n , the number of variables;3. d , the degree at which all series are truncated;4. a k , truncated series as the coefficient of the k -th monomial, k = 1 , , . . . , N ;5. ( i , i , . . . , i n k ), indices of the variables in the k -th monomial, where 1 ≤ i < i < · · ·
2. The max(1 , n k −
2) in the upper bound for b k,j is for the special case n k = 2, to store z i (cid:63) a .The output of the first stage is the input of the second stage. The second stage adds foreach monomial k , the last forward product f k,n k to obtain p ( z ). The values of the derivatives ofthe k -th monomial are in f k,n k − , b k,n k − , and c k,j .6he data parallel algorithm to compute all convolutions is defined by the data layout. Thetotal count of numbers involved in all convolution and addition jobs is e = ( d + 1) (cid:32) N + n + N (cid:88) k =1 (cid:32) n k + max(1 , n k −
2) + max(0 , n k − (cid:33)(cid:33) . (7)The first factor ( d + 1) in (7) counts the number of coefficients in all series truncated to degree d .The five terms in the second factor in (7) count respectively the constant coefficient a , the N coefficients a k , the n input series in z , the n k forward, the max(1 , n k −
2) backward, and themax(0 , n k −
2) cross products.Figure 1 illustrates the layout of the data vector for the polynomial in (4).The data are in an array A of e doubles. The order of the numbers in A follows the count asin (7): the coefficients a , a k , and series z are followed by the numbers in the forward, backward,and cross products. Each job is then characterized by a triplet of indices in this A . The first twoindices point respectively at the start of the first and the second input. The third index in thetriplet defines the start of the output. For example, the triplet for the convolution f , := a (cid:63) z for the polynomial in (4) is ( d + 1 , d + 4 , d + 10), for degree d , as the coefficients for a startafter the first d + 1 coefficients for a , z starts after the the first four series, and f , after theten series that define the coefficients of the polynomials and the input.For complex numbers, the data are in two arrays, one for the real parts and the other for theimaginary parts. For m -fold double numbers, there are m data arrays, all following the samelayout as for A , described above.Jobs are placed in layers, following the lines of (5) for the example polynomial. Jobs tocompute f k,j and b k,j are at layer j . Following Proposition 3.1, the layer of c k,j is max( j, n k − − j ) + 1. All jobs in the same layer can be executed at the same time. A kernel is launchedwith as many blocks as the number of jobs in one layer. One convolution job is executed byone block of threads. In addition to the data array A , the kernel is launched with the triple ofcoordinates which define each job. Each block of threads extracts the triplet according to itsblock number.The coordinates for each convolution job in the first stage of the algorithm depend only onthe structure of the monomials and are computed only once. The same holds for the coordinatesof the addition jobs in the second stage of the algorithm. Each addition job updates one serieswith another, so one pair of indices defines one addition job. For example, for the polynomialin (4), the first update f , := a + f , has coordinates (0 , d + 12), as a comes first and f , is positioned after 12 series in A , see Figure 1.For a convolution job j , let t = ( t ( j ) , t ( j ) , t ( j )) denote the triplet where t ( j ) and t ( j )respectively are the locations in the data array A of the first and second input, and where t ( j )is the location in A of the output. Given A and t , we can then symbolically summarize the codefor the kernel to compute all convolution jobs at the same layer as1. B := blockIdx.x
2. ( i , i , i ) := ( t ( B ) , t ( B ) , t ( B ))3. A [ i : i + d +1] := A [ i : i + d +1] (cid:63) A [ i : i + d +1]where the block index B corresponds to the index j of the convolution job, and [ i : i + d + 1]denotes the range of the coefficients in A . Likewise, for an addition job j , let t = ( t ( j ) , t ( j ))7e the pair of input and update locations in the data array A . Given A and t , the kernel toexecute all addition jobs at the same layer is then summarized as1. B := blockIdx.x
2. ( i , i ) := ( t ( B ) , t ( B ))3. A [ i : i + d +1] := A [ i : i + d +1] + A [ i : i + d +1]The data staging algorithm to define the convolution jobs runs through the steps to computeall forward f k,(cid:96) , backward b k,(cid:96) , and cross products c k,(cid:96) , for k ranging from 1 to N . The k -thmonomial has n k variables, with indices ( i , i , . . . , i n k ). α k = k − (cid:88) (cid:96) =1 n (cid:96) , β k = α N +1 + k − (cid:88) (cid:96) =1 max(1 , n (cid:96) − , and γ k = β N +1 + k − (cid:88) (cid:96) =1 max(0 , n (cid:96) −
2) (8)mark the positions respectively of f k, , b k, , and c k, in the data array A . For an example, seethe arrows in Figure 1. Let J (cid:96) denote a set of convolutions jobs at level (cid:96) . Jobs in J (cid:96) can beexecuted after (cid:96) − n k > k from 1 to N do1. to execute f k, := a k (cid:63) z i : t := k ( d +1) t := (1 + N + i − d +1) t := (1 + N + n + α k )( d +1) J := J ∪ { ( t , t , t ) }
2. for (cid:96) from 2 to n k doto execute f k,(cid:96) := f k,(cid:96) − (cid:63) z i (cid:96) : t := (1 + N + n + α k + (cid:96) − d +1) t := (1 + N + i (cid:96) − d +1) t := (1 + N + n + α k + (cid:96) − d +1) J (cid:96) := J (cid:96) ∪ { ( t , t , t ) }
3. to execute b k, := z n k (cid:63) z n k − : t := (1 + N + n k − d +1) t := (1 + N + n k − d +1) t := (1 + N + n + β k )( d +1) J := J ∪ { ( t , t , t ) }
4. for (cid:96) from 2 to n k − b k,(cid:96) := b k,(cid:96) − (cid:63) z n k − (cid:96) : t := (1 + N + n + β k + (cid:96) − d +1) t := (1 + N + n k − (cid:96) )( d +1) t := (1 + N + n + β k + (cid:96) − d +1) J (cid:96) := J (cid:96) ∪ { ( t , t , t ) }
8. to execute b k,n k − := b k,n k − (cid:63) a k : t := (1 + N + n + β k + n k − d +1) t := k ( d +1) t := (1 + N + n + β k + n k − d +1) J n k − := J n k − ∪ { ( t , t , t ) }
6. for (cid:96) from 1 to n k − c k,(cid:96) := f k,(cid:96) (cid:63) b k,n k − − (cid:96) t := (1 + N + n + α (cid:96) + (cid:96) − d +1) t := (1 + N + n + β k + n k − − (cid:96) − d +1) t := (1 + N + n + γ k + (cid:96) − d +1) L := max( (cid:96), n k − − (cid:96) ) J L := J L ∪ { ( t , t , t ) }
7. to compute c k,n k − := f k,n k − (cid:63) z n k : t := (1 + N + n + α k + n k − d +1) t := (1 + N + n k − d +1) t := (1 + N + n + γ k + n k − d +1) J n k − := J n k − ∪ { ( t , t , t ) } Sets are natural data structures in the mathematical description of the data staging algorithm,as the jobs in one J (cid:96) can be executed in any order. In the implementation, the jobs in the samelayer are stored in three integer arrays. The block index B is then used to get the coordinatesof the convolution job performed by block B .The pairs of indices for the addition jobs are defined in a recursive manner, following theorder of the tree summation algorithm. To compute the value of the polynomial, add the forwardproduct f k,n k of the k -th monomial. For stride L and level (cid:96) , apply the following:to execute f k,n k := f k,n k + f k − L,n k − L : t := 1 + N + n + α k − L + n k − L − t := 1 + N + n + α k + n k − J (cid:96) := J (cid:96) ∪ { ( t , t ) } recursively, starting at L = (cid:98) N/ (cid:99) and level (cid:96) = log ( N ) (assuming N = 2 (cid:96) ), dividing L by twoin each step, until L = 1. For k = L in the formula f k,n k := f k,n k + f k − L,n k − L , replace f k − L,n k − L by a . The same recursive formula is applied to sum the first backward products and all crossproducts to obtain the gradient. Table 1 summarizes the characteristics of each GPU, with the focus on the core counts and theprocessor speeds, because the problem is compute bound. The first four GPUs in Table 1 arehoused in a Linux workstation, running CentOS. The fifth GPU resides in a Windows laptop.9VIDIA GPU CUDA nvcc -O3 on the device, with the code for the host compiled by gcc -O3 on the linux computers, andthe community edition of Microsoft Visual Studio on the Windows laptop. While running thesame software on all five GPUs is obviously convenient, more advanced features of newer devicesare not utilized. The main importance for the evaluation of our software is that the unfaircomparison with the CPU is avoided. Taking into the account the double peak performance ofthe P100 and the V100 (4.7 TFLOPS and 7.9 TFLOPS respectively), we may expect the V100to be about 1.68 times faster than the P100.The first test polynomial p is a function of 16 variables. Its evaluation adds to the constantterm all 1,820 monomials that are the products of exactly four variables. The evaluation requires16,380 convolutions and 9,084 additions. As each monomial has no more than four variables, the16,380 convolutions are performed in four kernel launches of respectively 3,640, 5,460, 5,460, and1,820 blocks. The execution of the 9,084 additions requires 11 kernel launches of respectively4,542, 2,279, 1,140, 562, 281, 140, 78, 39, 20, 2, and 1 blocks. The second test polynomial p isconstructed to require many more convolutions than additions, respectively 24,192 versus 8,192.To evaluate the third polynomial p , as many convolutions as additions are required: 24,256.Table 2 lists the characteristics of the test polynomials. Compared to p , p has fewermonomials but each monomial has many more variables; whereas p has many more monomials,but each monomial has only two variables. n m N p
16 4 1,820 16,380 9,084 p
128 64 128 24,192 8,192 p
128 2 8,128 24,256 24,256Table 2: For each polynomial, n is the total number of variables, m is the number of variablesper monomial, and N is the number of monomials (not counting the constant term). The lasttwo columns list the number of convolution and addition jobs.To examine the scalability of our problem, experiments are run for increasing degrees oftruncation and for increasing levels of precision. Does the shape of the test polynomials influencethe execution times? 10 .2 Performance For each run, four times are reported. The elapsed times of the kernel launches are measured by cudaEventElapsedTime and expressed in milliseconds. The first two times are the sums of allelapsed times spent by all kernels, respectively for all convolutions and all additions. The thirdtime is the sum of the first two times. Each kernel launches involves the memory transfer of theindex vectors that define the coordinates of the jobs in the data arrays. The fourth reported isthe wall clock time which includes also this memory transfer. What is not included in the timesis the transfer of the input data arrays from the host to the device and of the output data arraysfrom the device to the host.Table 3 summarizes execution times to evaluate the first test polynomial p at a power seriestruncated to degree 152 in deca double precision. This degree is the largest one block of threadscan manage because of the limitation of the size of shared memory, which is the same all fivedevices. C2050 K20C P100 V100 RTX 2080convolution 12947.26 11290.22 1060.03 634.29 10002.32addition 10.72 11.13 1.37 0.77 5.01sum 12957.98 11301.35 1061.40 635.05 10007.34wall clock 12964.00 11309.00 1066.00 640.00 10024.00Table 3: Evaluating p for degree d = 152 in deca double precision. The last line is the wallclock time for all convolution and addition kernels. All units are milliseconds.The ratio 12964 / ≈ .
26 is the speedup of the most recent V100 over the oldest C2050.Compare the ratio of the wall clock times for P100 over V100 in Table 3: 1066 / ≈ .
67 withthe ratios of theoretical double peak performance of the V100 of the P100: 7 . / . ≈ . p , at series truncated at degree d = 152 in deca doubleprecision. Following the counts in [20], one addition in deca double precision requires 139additions and 258 subtractions of doubles, while one deca double multiplication requires 952additions, 1743 subtractions, and 394 multiplications of doubles. One convolution with zeroinsertion on series truncated at degree d requires ( d + 1) multiplications and d ( d + 1) addi-tions. One addition of two series truncated at degree d requires d + 1 additions. So we have16 , d + 1) multiplications and 16 , d ( d + 1) + 9 , d + 1) additions in deca doubleprecision. One multiplication and one addition in deca double precision require respectively3089 and 397 double operations. Then the 16 , d + 1) evaluates to 1,184,444,368,380 and16 , d ( d + 1) + 9 , d + 1) to 151,782,283,404 double float operations. In total, in 1.066seconds the P100 performed 1,336,226,651,784 double float operations, reaching a performanceof about 1.25 TFLOPS.Another observation from Table 3 is the tiny amount of time spent by the addition kernels,when compared to the convolution kernels, for V100: 0.77 versus 634.29. The convolution isquadratic in the degree d , whereas the addition is linear in d . As every block has d + 1 threads,the addition finishes in one single step, whereas there are still d steps in the convolutions.Table 4 lists the execution times for p and p on P100 and V100, to verify if the shape ofthe test polynomial would influence the conclusions on p .11 p P100 V100 P100 V100convolution 1700.49 1115.03 1566.58 926.53addition 1.24 0.67 3.43 1.92sum 1701.72 1115.71 1570.01 928.45wall clock 1729.00 1142.00 1583.00 941.00Table 4: Evaluating p and p for degree d = 152 in deca double precision. The last line is thewall clock time for all convolution and addition kernels. All units are milliseconds.The ratios of the wall clock times on P100 over the V100 for p and p are respectively1729 / ≈ .
51 and 1583 / ≈ . p , the factor 1.51 is not as high as expected. One probable cause is that the number ofconvolutions jobs in the first 31 layers equals 256. This number equals the number of blocks inone kernel launch. The number of streaming multiprocessors of the P100 and V100 respectivelyequal 56 and 80. The number of 256 blocks in one launch does not occupy the V100 as much asthe P100. The plots in this section visualize data in Tables 5 and 6, which contain times on the three testpolynomials, on the V100. These raw data sets are in the appendix.As the times spent by all addition kernels is less than one millisecond for p , Figure 2 showsthe relative cost of the multiple doubles versus doubles. The cost starts to increase once thedegrees become larger than the warp size. For all precisions, the cost at degree 127 is less thantwice the cost at degree 63.Figure 2: Times spent by all addition kernels when evaluating p and its gradient at power seriestruncated at increasing degrees 0, 8, 15, 31, 63, 95, 127, and 152, for seven precisions: double(1d), 2d, 3d, 4d, 5d, 8d, and 10d, at power series truncated to degree 191.Figure 3 shows the sum of the times spent by all addition kernels, for the three test polyno-mials, for power series truncated at degree 152, for all seven precisions. Although p has 8,128monomials and p has only 128, the increase in addition times for p is at most three times as12uch as for p . The addition for p happens with 12 kernel launches, while the addition for p has 7 kernel launches.Figure 3: Times spent by all addition kernels when evaluating p , p , p and their gradients atpower series truncated at degrees 152, for seven precisions: double (1d), 2d, 3d, 4d, 5d, 8d, and10d.The percentage of time spent by all kernels over the wall clock time is visualized in Figure 4.For double precision, the wall clock time dominates (the percentage of the time spent by allkernels is less than 10%), although the time is also less than a tenth of a millisecond. This per-centage climbs for higher precisions. In triple precision, the time spent on all kernels dominatesthe wall clock time. For octo and deca double precision, this percentage is more than 95%.Figure 4: Percentage of the time spent by all kernels over the wall clock time when evaluating p , p , p and their gradients at power series truncated at degrees 152, for seven precisions: double(1d), 2d, 3d, 4d, 5d, 8d, and 10d.As the precision increases, the problem becomes more and more compute bound. For de-gree 191, in Table 5, for p , the wall clock times in double, double double, quad double, and octodouble are respectively 6, 14, 95, and 449 seconds. The cost overhead factor of double doubleover double is typically a factor of about five, whereas here we observe 14 / ≈ .
33. The otherobserved cost overhead factors are 95 / ≈ .
79 and 449 / ≈ .
72. In Figure 5, the evolutionof the logarithmic wall clock time is plotted.If the number of coefficients in a truncated series doubles from 32 to 64, and from 64 to 128,13igure 5: The 2-logarithm of the wall clock times to evaluate and differentiate p , p , p indouble (1d), double double (2d), quad double (4d), and octo double (8d) precision, for powerseries truncated at degree 191.then one would expect the observed wall clock times to quadruple, as the cost of the convolutionsis O ( d ) for the truncation degree d . As shown in Figure 6, the wall clock times doubles, as thedifference between the bars in the 2-log times is about one.Figure 6: The 2-logarithm of the wall clock times to evaluate and differentiate p in quad double(4d), penta double (5d), octo double (8d), and deca double (10d) precision, for power seriestruncated at degrees 31, 63, and 127. The evaluation and differentiation of a polynomial in n variables at power series truncatedat some finite degree d requires a number of convolution jobs proportional to the number ofvariables per monomial and the number N of monomials. The convolution jobs are arrangedin layers of jobs that can be executed simultaneously. A scan performs the N addition jobs in (cid:100) log ( N ) (cid:101) steps. For polynomials where N dominates the number of variables per monomial,The theoretical speedup is bounded by dN/ log ( N ).Data staging algorithms define the coordinates for the convolution and the addition jobs.Speedup factors comparing the V100 and P100 are close to the ratio of their theoretical peakperformance. Experimental results show that teraflop performance is obtained. The accelerated14lgorithms scale well for increasing degrees and precisions. GPUs are well suited to compensatefor the overhead of power series arithmetic and multiple double precision. References [1] P. Emeliyanenko. Efficient multiplication of polynomials on graphics hardware. In Y. Dou,R. Gruber, and J.M. Joller, editors,
Advanced Parallel Processing Technologies. 8th Inter-national Symposium, APPT 2009, Rapperswil, Switzerland, August 2009 , volume 5737 of
Lecture Notes in Computer Science , pages 134–149. Springer-Verlag, 2009.[2] S. A. Haque and M. M. Maza. Plain polynomial arithmetic on GPU. In
High PerformanceComputing Symposium (HPCS 2012), J. of Physics: Conference Series , 385, 2012.[3] S. A. Haque, X. Li, F. Mansouri, M. M. Maza, W. Pan, and N. Xie. Dense arithmeticover finite fields with the CUMODP library. In H. Hong and C. Yap, editors,
MathematicalSoftware – ICMS 2014 , volume 8592 of
Lecture Notes in Computer Science , pages 725–732.Springer-Verlag, 2014.[4] Y. Hida, X. S. Li, and D. H. Bailey. Algorithms for quad-double precision floating pointarithmetic. In the
Proceedings of the 15th IEEE Symposium on Computer Arithmetic (Arith-15 2001) , pages 155–162. IEEE Computer Society, 2001.[5] J. Glabe.
Numerical Continuation on a GPU for Kinematic Synthesis . PhD thesis, Uni-versity of California, Irvine, 2020.[6] J. Glabe and J.M. McCarthy. A GPU homotopy path tracker and end game for mechanismsynthesis. In the Proceedings of the 2020 USCToMM Symposium on Mechanical Systemsand Robotics , pages 206–215. Springer 2020.[7] J. Glabe and J.M. McCarthy. Numerical continuation on a graphical processing unit forkinematic synthesis.
Journal of Computing and Information Science in Engineering
Computer Physics Communications , 200, pages 300–311, 2016.[9] A. Griewank and A. Walther.
Evaluating Derivatives: Principles and Techniques of Algo-rithmic Differentiation . SIAM, 2008.[10] K. Isupov and V. Knyazkov. Multiple-precision matrix-vector multiplication on graphicsprocessing units.
Program Systems: Theory and Applications
Mathematical Software – ICMS 2016, the 5thInternational Conference on Mathematical Software , pages 232–240, Springer-Verlag, 2016.[12] D.B. Kirk and W.W. Hwu.
Programming Massively Parallel Processors. A Hands-on Ap-proach . Morgan Kaufmann, Second Edition, 2013.1513] M. Lu, B. He, and Q. Luo. Supporting extended precision on graphics processors. InA. Ailamaki and P.A. Boncz, editors,
Proceedings of the Sixth International Workshop onData Management on New Hardware (DaMoN 2010), June 7, 2010, Indianapolis, Indiana ,pages 19–26, 2010.[14] M.M. Maza and W. Pan. Fast polynomial multiplication on a GPU.
Journal of Physics:Conference Series , 256, 2010. High Performance Computing Symposium (HPCS2010), 5-9June 2010, Victoria College, University of Toronto, Canada.[15] A. Morgan.
Solving Polynomial Systems using Continuation for Engineering and ScientificProblems , volume 57 of
Classics in Applied Mathematics . SIAM, 2009.[16] J.-M. Muller, N. Brunie, F. de Dinechin, C.-P. Jeannerod, M. Joldes, V. Lefevre, G.Melquiond, N. Revol, S. Torres.
Handbook of Floating-Point Arithmetic.
Second Edition,Springer-Verlag, 2018.[17] S. Telen, M. Van Barel, and J. Verschelde. A robust numerical path tracking algorithm forpolynomial homotopy continuation.
SIAM Journal on Scientific Computing
Proceedings of the22nd International Workshop on Computer Algebra in Scientific Computing (CASC 2020) ,volume 12291 of
Lecture Notes in Computer Science , pages 563–582. Springer-Verlag, 2020.[19] J. Verschelde. Algorithm 795: PHCpack: A general-purpose solver for polynomial systemsby homotopy continuation.
ACM Transactions on Mathematical Software http://arxiv.org/abs/2012.06607 .[21] J. Verschelde and G. Yoffe. Evaluating polynomials in several variables and their derivativeson a GPU computing processor. In
Proceedings of the 2012 IEEE 26th International Paralleland Distributed Processing Symposium Workshops (PDSEC 2012) , pages 1391–1399. IEEEComputer Society, 2012.[22] J. Verschelde and G. Yoffe. Orthogonalization on a general purpose graphics processingunit with double double and quad double arithmetic. In
Proceedings of the 2013 IEEE 27thInternational Parallel and Distributed Processing Symposium Workshops (PDSEC 2013) ,pages 1373–1380. IEEE Computer Society, 2013.[23] J. Verschelde and X. Yu. GPU acceleration of Newton’s method for large systems of poly-nomial equations in double double and quad double arithmetic. In
Proceedings of the16th IEEE International Conference on High Performance Computing and Communication(HPCC 2014) , pages 161–164. IEEE Computer Society, 2014.[24] J. Verschelde and X. Yu. Accelerating polynomial homotopy continuation on a graphicsprocessing unit with double double and quad double arithmetic. In J.-G. Dumas andE.L. Kaltofen, editors,
Proceedings of the 7th International Workshop on Parallel Symbolic omputation (PASCO 2015), July 10-11 2015, Bath, United Kingdom , pages 109–118.ACM, 2015.[25] J. Verschelde and X. Yu. Tracking many solution paths of a polynomial homotopy on agraphics processing unit in double double and quad double arithmetic. In Proceedings of the17th IEEE International Conference on High Performance Computing and Communication(HPCC 2015) , pages 371–376. IEEE Computer Society, 2015.
Appendix
The source code is available on github, under the GNU GPL license, as part of the code ofPHCpack. Tables 5, 6 and 7 contain times on the three test polynomials, on the V100. d Table 5: Times in milliseconds to evaluate and differentiate p , for increasing degree d , and forincreasing precision. 17 Table 6: Times in milliseconds to evaluate and differentiate p , for increasing degree dd
The source code is available on github, under the GNU GPL license, as part of the code ofPHCpack. Tables 5, 6 and 7 contain times on the three test polynomials, on the V100. d Table 5: Times in milliseconds to evaluate and differentiate p , for increasing degree d , and forincreasing precision. 17 Table 6: Times in milliseconds to evaluate and differentiate p , for increasing degree dd , and forincreasing precision. 18 Table 7: Times in milliseconds to evaluate and differentiate p , for increasing degree dd