[PDF] Fast Derivatives for Multilinear Polynomials

Abstract

The article considers linear functions of many (n) variables - multilinear polynomials (MP). The three-steps evaluation is presented that uses the minimal possible number of floating point operations for non-sparse MP at each step. The minimal number of additions is achieved in the algorithm for fast MP derivatives (FMPD) calculation. The cost of evaluating all first derivatives approaches to only 1/8 of MP evaluation with a growing number of variables. The FMPD algorithm structure exhibits similarity to the Fast Fourier Transformation (FFT) algorithm.

Full PDF

FFast Derivatives for Multilinear Polynomials

Valeri AronovSeptember 25, 2020

Abstract

The article considers linear functions of many ( n ) variables - multilinear polynomials (MP) [3]. Thethree-steps evaluation is presented that uses the minimal possible number of ﬂoating point operations fornon-sparse MP at each step. The minimal number of additions is achieved in the algorithm for fast MPderivatives ( F MP D ) calculation. The cost of evaluating all ﬁrst and second derivatives approaches to thecost of MP evaluation itself with a growing n . The F MP D algorithm structure exhibits similarity to theFast Fourier Transformation (FFT) algorithm.

Contents

This is a linear function of n variables: a ( x ) = n − (cid:88) i =0 r i · n (cid:89) k =1 i =( i n ...i k ...i ) x i k k , where { x , r ∈ R n } (1)It is called a multilinear polynomial (MP) [3].For example (for n=2): a ( x ) = r + r · x + r · x + r · x · x The binary notation for i is natural for the fast MP derivatives ( F M P D ) algorithm presented in this article.The transfer function of linear analog electric circuits has MPs of circuit parameters in its numerator anddenominator with coeﬃcients being complex polynomials of frequency and x - values of the circuit parameters.Some software systems for linear circuit analysis generate and use symbolic (analytical) parameters presentationof the transfer functions for fast multiple transfer function evaluations [4]. It makes it possible to build a symbolicpresentation for for derivatives of transfer functions as well. F M P D algorithm was ﬁrst introduced in [8] forthe linear circuit parameters optimization.

This article uses q i ( x ) designation: q i ( x ) = r i · n (cid:89) k =1 i =( i n ...i k ...i ) x i k k (2)1 a r X i v : . [ c s . S C ] S e p heir derivatives then are: ( q i ( x )) (cid:48) x k = q i ( x ) /x k for all i k = 1 . q i ( x ) does not depend on those x k which have i k = 0 and corresponding derivatives are equal .The following deﬁnitions of full and partial sums of q i ( x ) are used here: a = a ( x ) a ... = n − (cid:88) i =0 i =( i n ...i q i a ... = n − (cid:88) i =0 i =( i n ...i i ) q i For two n -bit integers j = ( j , . . . , j n ) and i = ( i , . . . , i n ) we write j (cid:31) i if their binary digits satisfy j ν ≥ i ν , ν = 1 , . . . , n . Then a i = (cid:88) j (cid:31) i q j (3) a ... = q ... Given integers ≤ k < . . . < k m ≤ n we denote by t ( k , . . . , k m ) the n -bit integer having 1 at positions k , . . . , k m and at all other positions. Then MP derivatives are: ( a ( x )) ( m ) x k ...x km = a t ( k ,...,k m ) ( x ) / m (cid:89) j =1 x k j (4)The task of MP derivatives calculation is: given r and x and using MP presentation (1) calculate allderivatives as in (4) including the evaluation of MP itself. MPs in this article are considered to be ’non-sparse’.It means that either r i (cid:54) = 0 for all i or the savings in MP and its derivatives calculation due to omission of r i = 0 are negligible.Execute the task in three steps:1. Calculate all items as in (2)2. Calculate all a i as in (3)3. Calculate all derivatives as in (4)The following algorithm calculates all products of x i in step 1: GetProducts ( x, products ) k = 1 products [0] = 1 products [1] = x [1] for i = 2 to sizeof(x)5 { products [ k + 1] = x [ i ] for j = 1 to k products [ k + j + 1] = x [ i ] ∗ x [ j ] k = k ∗ } Note that an item i of products in GetProducts and the items i of q i ( x ) and r i arrays in (2) relate to thisproduct: n (cid:89) k =1 i =( i n ...i k ...i ) x i k k Note also that n + 1 items in products do not require multiplications.Getting the results in (2) after having all the products is straightforward and requires n − multiplicationsmore. 2et us apply ’divide and conquer’ algorithm to the Step 2. Split all q j for j = ( j p . . . j ) into two groups q (0 j p − ...j ) and q (1 j p − ...j ) . Assume that the Step 2 for ( p − ) is executed for both halves and the results are a ( j p − ...j ) and ∼ a ( j p − ...j ) . Then the following operations produce the results for p : a (1 j p − ...j ) = ∼ a ( j p − ...j ) a (0 j p − ...j ) = a ( j p − ...j ) + ∼ a ( j p − ...j ) GetPartialSums implements this algorithm:

GetPartialSums ( q a, xSize ) addendP ositionsDif f erence = 12 clusterP ositionsDif f erence = 23 q aSizeM inusOne = sizeof( q a ) - 14 lastSumP ositionLimitInSilo = q aSizeM inusOne for iSilo = 1 to xSize { for i = 0 by clusterP ositionsDif f erence to lastSumP ositionLimitInSilo { for j = 0 to addendP ositionsDif f erence - 110 q a [ i + j ] = q a [ i + j ] + q a [ i + j + addendP ositionsDif f erence ] } lastSumP ositionLimitInSilo = q aSizeM inusOne − clusterP ositionsDif f erence addendP ositionsDif f erence = clusterP ositionsDif f erence clusterP ositionsDif f erence = 2 ∗ clusterP ositionsDif f erence } It starts with q a containing all q i and ﬁnishes with q a containing all a i . xSize = n and q aSizeM inusOne =2 n − here. Once the memory for q i is allocated this algorithm does not require any additional memory exceptof several scalar variables.This ﬁgure presents the GetPartialSums diagrams for n = 2 , , : q q q q q q q q q q q q q q q q q n=2 n=3 a a a a a a a a a a a a a a a a a i input data are located along the left vertical line. a i output data are located along the right vertical line.An arrow represents an addition where addends are taken at the levels of the arrow’s end points and the sum isplaced into the location of the addend the arrow points to. The additions are executed silo-by-silo from the leftsilo to the right one. The top-left parts of the ﬁgure outlined by the broken lines represent diagrams for n = 2 and n = 3 .As it is proven in the next section the algorithm implemented by GetPartialSums requires a minimalnumber of ﬂoating point additions. The adjective ’fast’ is appropriate here then.

Deﬁnition 2.1.

The algorithm implemented by

GetPartialSums is called a Fast Multilinear PolynomialDerivatives (

F M P D ) algorithm.Having products calculated in Step 1 and q a calculated in Step 2 the last Step 3 gets derivatives of MPdividing each corresponding q a [ i ] containing a i by associated products [ i ] . Consider ﬂoating point operation numbers in Steps 1-3.Step 1 requires (2 n − n − multiplications at least to produce (2 n − n − diﬀerent products. GetProd-ucts achieves this minimum because it produces one product per multiplication. Overall Step 1 requires n − n − n − n +1 − n − multiplications.Step 3 requires (2 n − divisions because a and a n − = r n − do not require divisions. It achieves theminimal ﬂoating point complexity because it produces the rest of (2 n − derivatives - one derivative perdivision.Step 2 F M P D requires n · n − additions. Its asymptotic complexity is the greatest out of all Steps. Thefollowing statement holds: Theorem 1.

The

F M P D algorithm uses the minimal number of ﬂoating point additions for (3) evaluationout of all algorithms using additions only.Proof.

The theorem proposition about the algorithms under consideration contains the ’additions only’ restric-tion. It is an open question if the proposition holds for algorithms using additions and subtractions.Let us call an algorithm to be a direct one if it uses for each a i only the addends in (3) and only once each.The direct algorithms never add an q i k ...i item to a i k ...i , for example. If a non-direct algorithm does it the q i k ...i item has to be subtracted from a i k ...i as well. Thus ’additions only’ restriction is equivalent to ’thedirect algorithms only’ restriction. The F M P D algorithm uses n · n − additions. Below we prove that this isthe minimal number of additions required for (4).The theorem is valid for n = 2 . Indeed, four diﬀerent numbers a , a , a and a are produced by fouradditions used by F M P D . One needs at least one operation to produce one number. The

F M P D ’s numberof additions for n = 2 is n · n − = 4 and is equal to the minimal required.Assume that the theorem in valid for k and prove that then it is valid for k + 1 as well.Let us split all additions into three sets:• those with both addends being q i k − ...i (lower half of q i in the diagram) or the sums of them (the 1stset);• those with exactly one addend being q i k − ...i or the sums of them (the 2nd set);• the rest of additions (the 3rd set).Note that these three sets do not overlap and their superset includes all additions used in (4).Let us introduce b j : b j k − ...j = a i k − ...i Any algorithm evaluating a i k − ...i does not use additions where the addends are q i k − ...i or their sumsaccording to (3). Otherwise the use of subtractions is necessary to remove q i k − ...i from the result. The ﬁrstset execution obtains b j and according to our assumption contains k · k − additions at least.Consider the 2nd set of additions. Each a i k − ...i has to use at least one additions from the 2nd set hencethere are at least k additions because the number of a i k − ...i items is k .4et us prove that the 3rd set has to have at least k · k − additions. If there is an algorithm A for ( k + 1) that needs lesser number of additions consider its application to x ∗ = { , x k , . . . , x } . Only additions from 3rdset remain because each addition from ﬁrst two sets will have at least one addend. A application to x ∗ reducesoriginal target dimension from ( k + 1) to k and it can’t use less than k · k − additions because it contradictsour assumption about the minimal number of additions for k .Finally, we have minimal numbers of additions for all three sets. These sets do not overlap and their supersetis a set of all additions used for (3) evaluation. Thus the total number of additions in the three sets is: A k +1 ≥ k · k − + k · k − + 2 k = ( k + 1) · k The right hand side of the above is exactly the number of additions used by

F M P D for k + 1 meaning thatit uses the minimal number of the required additions.An F M P D l analog of F M P D for derivatives up to l th order ( l < n ) can be produced by omitting in F M P D the additions that do not contribute to the required partial sums. In the last silo the omitted additions will bethose which arrows point to a i that have the number of 1s in the binary presentation of i greater than l . In thelast but one silo the omitted arrows in the top cluster will be those pointing to a i with the number of 1s in i greater than l + 1 . The lower cluster repeats the the mission pattern of the top cluster. For an iSilo the omittedarrows in the top cluster will point to a i with the number of 1s in i greater than l + 1 . The rest of cluster inthis cluster repeat the mission pattern of the ﬁrst one. The ﬁrst silo with omissions is iSilo first = l + 2 . Thisﬁgure presents the F M P D l diagram for n = 5 , l = 2 (the omitted additions are presented with the broken line): q i a i Note that the above omission can be precalculated for the usage in the internal loop of

GetPartialSums and have a negligible eﬀect on the overall execution time.The number of ﬂoating point additions for (3) evaluation by

F M P D l [8]: A F MP D l =  ( l + 1) · n − + n (cid:80) k = l +2 (cid:20) l (cid:80) i =0 (cid:0) k − i (cid:1)(cid:21) · n − k if l < n − ,n · n − if l = n − . (5)An important particular case of (5) is the evaluation of the polynomial and its gradient. Let us estimatethe relative increase in the number of additions required for obtaining the gradient in comparison with the MPevaluation itself: R ( n ) = [ AF M P D − A ( n )] /A ( n ) = [2 · n − + n (cid:88) k =3 k · n − k − A ( n )] /A ( n ) A ( n ) = n (cid:80) k =1 n − k .Using ∞ (cid:80) i =1 − i = 1 we get lim n →∞ A ( n ) = 2 n and using ∞ (cid:80) i =1 i · − i = 2 [6] we get lim n →∞ R ( n ) = 1 / Thus the cost of evaluating all partial sums for ﬁrst derivatives approaches to only 1/8 of MP evaluationcost with a growing number of variables. The cost per ﬁrst derivative is negligible in comparison with MPevaluation.Similarly, using ∞ (cid:80) i =1 i · − i = 6 [7] we get lim n →∞ R ( n ) = 1 Thus the cost of n ( n + 1) / ﬁrst and second derivatives evaluation approaches the cost of MP evaluation itself.The cost per derivative is negligible in comparison with MP evaluation.The number of additions in the algorithm calculating partial sums in (4) for each i separately is: A naive = n (cid:88) i =0 (cid:18) ni (cid:19) · (2 n − i −

1) = 2 n · n (cid:88) i =0 (cid:18) ni (cid:19) · − i − n = 3 n − n The particular variant of the binomial formula was used here: n (cid:88) i =0 (cid:18) ni (cid:19) · u i = (1 + u ) n The number of additions in the algorithm calculating partial sums in (4) for each i separately for the up toand including l derivatives is: A naive l = l (cid:88) i =0 (cid:18) ni (cid:19) · (2 n − i − The number of additions in the algorithm calculating partial sums in (4) for each i separately for the up toand including derivatives is: A naive = 1 + n · n − + (cid:18) n (cid:19) · n − = 1 + n · n − + 2 n − · n ! / [2! · ( n − n · n − + n · n − − n · n − The asymptotic complexity expressions of these algorithms are: O F MP D ( n ) = n · n O F MP D ( n ) = 2 n O naive n ( n ) = 3 n O naive ( n ) = n · n F M P D has a substantially lower complexity than the naive algorithm. The exponential complexity of

F M P D and

F M P D l allow their practical usage only for relatively low n values. Still it is noticeably faster thanits naive counterpart. For example, for n = 8 F M P D uses 1024 additions vs. 6305 used by the naive algorithm.The table below compares number of additions in

F M P D and

F M P D against their naive counterparts: n O ( n ) A naive n /A F MP D . n A naive /A F MP D n Now consider parallelization opportunities for the ﬂoating point operations in Steps 1-3.Step 1. The body of the internal loop of

GetProducts can be split across m processors. If a singleprocessor asymptotic complexity is O ( n ) then the one for m processors is O m ( n ) = O ( n ) /m . After theproducts in (2) are evaluated the multiplications by r i ( r ) can be done in M m ( n ) = (cid:98) M ( n ) /m (cid:99) + 1 operationsper processor, where M ( n ) and M m ( n ) are the number of multiplications for a single processor and each of m processors.Step 2. As it is illustrated by F M P D diagram all additions in a silo can be executed in parallel becauseeach addition has its own separate addends and the result. Then A m ( n ) = (cid:98) A ( n ) /m (cid:99) + 1 , where A ( n ) and A m ( n ) are the number of additions for a single processor and each of m processors.Step 3. All multiplications may be executed in parallel. It requires M m ( n ) = (cid:98) M ( n ) /m (cid:99) +1 multiplications.6 Fast Multilinear Polynomial Derivatives and FFT

The simpliﬁed

F F T algorithm [1] has the following diagram:where the solid line arrows mean the same as ones in

F M P D diagram and the broken line arrows turn intovoid operations if ω = 0 . Comparison of F F T ( ω = 0 ) with the F M P D diagram reveals their structural simi-larity. Note that a somewhat simpler MP derivatives problem allowed to achieve the minimal possible numberof operations whereas only a complexity lower bound for FFT [5] is established.

Acknowledgements

The author would like to thank Joris van der Hoeven and Igor Shparlinski for constructive criticism of thearticle and advise.

References [1] A.Aho, J.Hopcroft, J.Ullman, The Design and Analysis of Computer Algorithms. Addison-Wesley, 1974[2] L.O. Chua, Pen-Min Lin, Computer-aided analysis of electronic circuits: algorithms and computationaltechniques. Englewood Cliﬀs, N.J.: Prentice-Hall, 1975.[3] A. Giambruno and M. Zaicev, Polynomial Identities and Asymptotic Methods. AMS Bookstore. Section1.3, 2005[4] G. Gielen and W. Sansen, Symbolic Analysis for Automated Design of Analog Integrated Circuits. Boston:Kluwer Academic Publishers, 1991.[5] J. Morgenstern, Note on a Lower Bound of the Linear Complexity of Fast Fourier Transform. Journal ofthe ACM. Vol.20. N.2, 1973.[6] How do you evaluate the sum of n/2**n from n=1 to inﬁnity,

77] How do you evaluate the sum of n**2/2**n from n=1 to inﬁnity,