ARCHITECT: Arbitrary-precision Hardware with Digit Elision for Efficient Iterative Compute
He Li, James J. Davis, John Wickerson, George A. Constantinides
IIEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS 1
A R C H I T E C T : Arbitrary-precision Hardware withDigit Elision for Efficient Iterative Compute
He Li,
Student Member, IEEE , James J. Davis,
Member, IEEE , John Wickerson,
Senior Member, IEEE ,and George A. Constantinides,
Senior Member, IEEE
Abstract —Many algorithms feature an iterative loop thatconverges to the result of interest. The numerical operations insuch algorithms are generally implemented using finite-precisionarithmetic, either fixed- or floating-point, most of which operateleast-significant digit first. This results in a fundamental problem:if, after some time, the result has not converged, is this becausewe have not run the algorithm for enough iterations or becausethe arithmetic in some iterations was insufficiently precise? Thereis no easy way to answer this question, so users will often over-budget precision in the hope that the answer will always be to runfor a few more iterations. We propose a fundamentally new ap-proach: with the appropriate arithmetic able to generate resultsfrom most-significant digit first, we show that fixed compute-area hardware can be used to calculate an arbitrary number ofalgorithmic iterations to arbitrary precision, with both precisionand approximant index increasing in lockstep. Consequently,datapaths constructed following our principles demonstrate ef-ficiency over their traditional arithmetic equivalents where thelatter’s precisions are either under- or over-budgeted for thecomputation of a result to a particular accuracy. Use of most-significant digit-first arithmetic additionally allows us to declarecertain digits to be stable at runtime, avoiding their recalculationin subsequent iterations and thereby increasing performanceand decreasing memory footprints. Versus arbitrary-precisioniterative solvers without the optimisations we detail herein, weachieve up-to 16 × performance speedups and 1.9 × memorysavings for the evaluated benchmarks. Index Terms —Arbitrary-precision arithmetic, hardware archi-tecture, online arithmetic, field-programmable gate array.
I. I
NTRODUCTION I N numerical analysis, an algorithm executing on the realnumbers, R , is often expressed as a conceptually infiniteiterative process that converges to a result. This is illustratedin a general form by the equation x ( k +1) = f (cid:16) x ( k ) (cid:17) , in which the computable real function f ∈ (cid:0) R N → R N (cid:1) isrepeatedly applied to an initial approximation x (0) ∈ R N .The true result, x ∗ , is obtained as k approaches infinity, i.e. x ∗ = lim k →∞ Π (cid:16) x ( k ) (cid:17) , where the operator Π denotes projection of the variablesof interest since the result may be of lower dimensionalitythan N . Examples of this template include classical iterativemethods such as Jacobi and successive over-relaxation, as The authors are with the Department of Electrical and Elec-tronic Engineering, Imperial College London, London, SW7 2AZ,United Kingdom. E-mail: { h.li16, james.davis, j.wickerson,g.constantinides } @imperial.ac.uk . Algorithm 1
Generic finite-precision iterative algorithm.
Require: ˆ x (0) ∈ FP NP , ˆ f ∈ (cid:16) FP NP → FP NP (cid:17) for k = 0 to K − do ˆ x ( k +1) ← ˆ f (cid:16) ˆ x ( k ) (cid:17) end forAssert: (cid:13)(cid:13)(cid:13) Π (cid:16) ˆ x ( K ) (cid:17) − x ∗ (cid:13)(cid:13)(cid:13) < η well as others including gradient descent methods, the keyalgorithms in deep learning [1].In practice, these calculations are often implemented usingfinite-precision approximations such as that shown in Algo-rithm 1, wherein FP P denotes some finite-precision datatype, P is a measure of its precision (usually word length), ˆ f isa finite-precision approximation of f and η is an accuracybound. The problem with this implementation lies in thecoupling of P and iteration limit K . Generally, this algorithmwill not be able to ensure that its assertion passes, and whenit fails we are left with no knowledge as to whether K shouldbe increased or if all computations need to be thrown awayand the algorithm restarted with a higher P instead.As a simple demonstration of this problem, suppose we wishto compute the toy iteration x ( k +1) = / − / · x ( k ) starting from zero.When performing this arithmetic using a standard approachin either software or hardware, we must choose a single,fixed precision for our calculations before beginning to iterate.Fig. 1a shows the order in which digits are calculated whenthe precision is fixed to six decimal places: approximant-by-approximant, least-significant digit (LSD) first. Choosing theright precision a priori is difficult, particularly with respect tohardware implementation. If it is too high, the circuit may beunnecessarily slow and power-hungry, while, if it is too low,the criterion for convergence may never be reached.Our proposal, illustrated in Fig. 1b, avoids the need toanswer the aforementioned question entirely. The digits arecalculated in a zig-zag pattern, sweeping through approximantsand decimal places simultaneously. The longer we compute,the more accurate our result will be; the computation canterminate whenever the result is accurate enough. This avoidsthe need to fix the precision beforehand, but requires theability to calculate from most-significant digit (MSD) first:a facility provided through the use of online arithmetic [2]. a r X i v : . [ c s . A R ] O c t IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS x (1) : x (2) : x (3) : x (4) : x (5) : x (6) : x (7) : x (8) : ........ (a) Conventional arithmetic, approxi-mants calculated LSD first. ........ (b) Our proposal, MSD first. ........ (cid:48)(cid:48) (cid:48)(cid:48) (cid:48)(cid:48) (cid:48)(cid:48) (cid:48)(cid:48) (cid:48)(cid:48) (cid:48)(cid:48) (cid:48)(cid:48) (c) Our proposal, MSD first with don’t-change digitelision. (cid:48)(cid:48) -marks indicate don’t-change digits, withthe solid line representing the region’s boundary.Fig. 1. Digit-calculating strategies for the solution of x ( k +1) = / − / · x ( k ) starting from x (0) = 0 . Arrows show the order of digit generation. While general-purpose processors featuring traditional, LSD-first arithmetic units exhibit inefficiency for the realisationof online arithmetic, field-programmable gate arrays (FPGAs)represent excellent platforms for the implementation of suchMSD-first operations [3]–[5].As originally formulated, this method is somewhat ineffi-cient since the triangular shape traced out results in the com-putation of more digits than is actually needed. In the bottom-left corner lie high-significance digits of later approximants;these generally become stable over time, so we call them don’t-change digits. By detecting the presence and avoidingthe recomputation of these digits, we arrive at a digit patternsuch as that shown in Fig. 1c. This increases efficiency whilehaving no bearing on the chosen iterative method’s ability toreach a result of any accuracy.The proposed architecture, coined
ARCHITECT (for Ar bitrary-precision C onstant- h ardware Ite rative C ompu t e), isthe first to allow the runtime adaption of both precisionand iteration count for iterative algorithms implemented inhardware. We make the following novel contributions: • The first fixed-compute-resource hardware for iterativecalculation capable of producing arbitrary-precision re-sults after arbitrary numbers of iterations. • An optimised mechanism for digit-vector storage basedon a Cantor pairing function to facilitate simultaneouslyincreasing precision and iteration count. • Theoretical analysis of MSD stability within any onlinearithmetic-implemented iterative method, enabling theruntime elision of don’t-change digits to obtain perfor-mance speedups and increase memory efficiency. • Exemplary hardware implementations of our proposalsfor the computation of both linear (Jacobi method) andnonlinear (Newton) iterations. • Qualitative and quantitative performance and scalabilitycomparisons against traditional and state-of-the-art onlinearithmetic FPGA implementations.An earlier version of this work appeared in the proceedingsof the 16 th International Conference on Field-programmableTechnology (FPT) [5]. This article combines that paper’smaterial with the don’t-change digit proposal taken from our 24 th IEEE Symposium on Computer Arithmetic (ARITH)publication [6], extending both. In particular: • We add an arbitrary-precision divider to our availableoperators, enabling the construction of datapaths for thecalculation of irrational results with Newton’s method. • Changes to our digit elision technique, originally de-signed for linear-convergence algorithms, are made to suitthe Newton method’s quadratic convergence. • To complement the new digit elision strategy, we pro-pose an enhanced memory-addressing scheme, leadingto greater performance and higher achievable result ac-curacy for a given memory budget. • Finally, we exploit digit-parallel online addition to de-crease datapath latency.These optimisations allow us to obtain significant increases inthroughput and memory efficiency over previous designs.The implementations presented and evaluated in this articleare fixed-point.
ARCHITECT ’s principles are, however, generic,and could be employed for the construction of floating-pointoperators supporting arbitrary-precision mantissas.II. B
ACKGROUND
In scientific computing, machine learning, optimisation andmany other numerical application areas, methods of iterativecalculation are particularly popular and interest in their ac-celeration with FPGAs is growing [7]. Recent studies havedemonstrated that FPGAs are promising platforms for theacceleration of the Jacobi [8], Newton’s [9], conjugate gra-dient [10] and MINRES methods [11]. However, implemen-tations relying on traditional arithmetic—whether digit-serialor -parallel—enforce compile-time determination of precision;for digit-parallel designs this affects their area and input/outputbandwidth requirements, while for digit-serial it is one ofthe factors affecting algorithm runtime. Runtime tuning ofprecision in iterative calculations was enabled through the useof online arithmetic in recent work [4], however unrollingwas necessary in order to implement the algorithm’s loop;area therefore scaled with the desired number of iterations.As shown in Table I,
ARCHITECT stands apart from thesealternatives by enabling the runtime selection of both factorsaffecting result accuracy while keeping compute area constant. I et al. : ARCHITECT: ARBITRARY-PRECISION HARDWARE WITH DIGIT ELISION FOR EFFICIENT ITERATIVE COMPUTE 3 T ABLE IC OMPARISON OF ITERATIVE ARITHMETIC PARADIGMS .Area scales with Runtime scales withPrec. Iter. limit Prec. Iter. limitLSD-first, parallel (cid:52) (cid:56) (cid:56) (cid:52) unboundedLSD-first, serial (cid:56) (cid:56) (cid:52) bounded (cid:52) unboundedZhao et al. [4] (cid:56) (cid:52) (cid:52) unbounded (cid:56)
ARCHITECT (cid:56) (cid:56) (cid:52) unbounded (cid:52) unbounded
A. Arbitrary-precision Computing
Applications requiring very high precisions have becomeincreasingly popular in recent years [12]. For example, today,hundreds of digits of precision are required in atomic systemsimulations and electromagnetic scattering theory calculations,while Ising integrals and elliptic function evaluation needthousands of digits [13]. In experimental mathematics, Pois-son equation computations frequently require results to tensor hundreds of thousands of digits of precision [14]. Stan-dard numeric datatypes, such as double- or even quadruple-precision floating-point, are therefore no longer sufficient inan increasing number of scenarios.Many software libraries have been developed for arbitrary-precision arithmetic [15]–[17]. The de facto standard is MPFR,which guarantees correct rounding to any requested numberof bits, selected before each operation is executed. Interestin the hardware acceleration of high-precision operations, inparticular those within iterative algorithms, is growing [7].FPGAs provide flexibility not available on other platforms,allowing for the implementation of bespoke designs withmany precision and performance tradeoffs. Libraries includingFloPoCo [18] and VFLOAT [19], alongside proprietary vendortools, facilitate the creation of custom-precision arithmeticunits. These provide designers with many options to suitparticular frequency, latency and resource requirements. Sun etal. proposed an FPGA-based mixed-precision linear solver: asmany operations as possible are performed in low precision be-fore switching to a slower, higher-precision mode for the lateriterations [8]. A dual-mode (double- and quadruple-precision)architecture based on Taylor series expansion has also been im-plemented [20]. Zhao et al. ’s work enables arbitrary-precisioncomputation but, as mentioned earlier, necessitates compile-time determination of iteration count [4].With the exception of Zhao et al. ’s architecture [4],each of the aforementioned proposals requires precision—orprecisions—to be determined a priori . In many cases, this isnot a trivial task; making the wrong choice often means havingto throw the calculations already done away and starting fromscratch with higher precision, wasting both time and energyin doing so. In our work, we are particularly interested inhardware architectures which allow precision to be increasedover time without having to restart computation or modify thecircuitry. Table II presents a side-by-side comparison of thesetechniques and their features with
ARCHITECT , the only entrysupporting the determination of result precision and iterationcount after each calculation has commenced. T ABLE
IIC
OMPARISON OF ARBITRARY - PRECISION TECHNIQUES .Level Prec. setper calc. Iter. limit setper calc.MPFR [16] Software Before DuringFloPoCo [18], etc.
Hardware Before DuringMixed precision [8], [20] Hardware Before DuringZhao et al. [4] Hardware During Before
ARCHITECT
Hardware During During
B. Online Arithmetic
Achieving arbitrary-precision computation with fixed hard-ware requires MSD-first input consumption and output gen-eration. A suitable proposal for this, widely discussed in theliterature, is online arithmetic [2]. By employing redundancyin their number representation, allowing less-significant digitsto correct errors introduced in those of higher significance,all online operators are able to function in MSD-first fashion.Online operators are classically serial, however efficient digit-parallel (unrolled) implementations targetting FPGAs havebeen developed as well [3]. We make use of both digit-serialand -parallel online operators in this work, employing the de facto standard radix-2 signed-digit number representation,wherein the i th digit of a number x , x i , lies in {− , , } .In hardware, each x i corresponds to a pair of bits, x + i and x − i , selected such that x i = x + i − x − i . Data can be efficientlyconverted between non-redundant and redundant forms usingwell known on-the-fly conversion techniques [2].Of particular significance to the material presented in thisarticle is the concept of online delay . When performing anonline operation, the digits of its result are generated at thesame rate as its input digits are consumed, but the result isdelayed by a fixed number of digits, denoted δ . That is, thefirst ( i.e. most-significant) q digits of an operator’s result arewholly determined by the first q + δ digits within each ofits operands [2]. The value of δ is operator-specific, but istypically a small integer. When chaining operators to forma datapath, as we do, the total online delay is the highestcumulative delay through the complete circuit [4].
1) Addition:
A classical online adder makes use of fulladders and registers to add digits of inputs x and y presentedserially as x in and y in , as shown in Fig. 2 (left), from most toleast significant [2]. Digits of z start to appear at serial output z out after two clock cycles, hence δ + = 2 . Duplication ofsuch a serial adder P times and removal of its registers leadsto the creation of a P -digit parallel online adder devoid ofonline delay, as shown in Fig. 2 (right). Crucially, while carrydigits are presented at the least-significant end of the adder andgenerated at the most, there is no carry chain; independent ofits word length, the critical path lies across two full adders.This demonstrates the adder’s suitability for the construction ofmore complex online operators and indicates that its maximumfrequency is independent of precision.
2) Multiplication:
Algorithm 2 illustrates classical radix-2 online multiplication [2]: a process that operates in serial-
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS x + in x − in y + in y − in z + out z − out Full adder s c
Full adder s c > >> x +0 x − y +0 y − x +1 x − y +1 y − . . . x + P − x − P − y + P − y − P − c + in c − in c + out c − out z +0 z − z +1 . . . z − P − z + P − z − P − Full adder s c
Full adder s c
Full adder s c
Full adder s c
Full adder s c
Full adder s c
Fig. 2. Radix-2 online adders [2]. Left: Serial. Right: Parallel.
Algorithm 2
Radix-2 online multiplication [2].
Inputs: serially presented multiplicand x , multiplier y x , y , w ← for j = 0 to P + 2 do y ← y (cid:107) y j v ← w + 2 − ( x y j + y x j ) z j − ← sel × ( v ) w ← v − z j − x ← x (cid:107) x j end forOutput: serially generated product z in, serial-out fashion. So-called digit vectors x and y areassembled from digits of multiplicand x and multiplier y overtime from the most significant first; (cid:107) represents concatenationperformed such that x = j (cid:88) i =0 x i − i − , y = j (cid:88) i =0 y i − i − during cycle j . Digit-selection function sel × serves to deter-mine the digits of product z . This is defined to besel × ( v ) = if v ≥ / if − / ≤ v < / − otherwise .z j is produced at cycle j + 3 since δ × = 3 . Note that ‘digits’ z j for j < are ignored. P -digit online addition lies at theheart of the algorithm; due to its fixed width, hardware thatimplements Algorithm 2 can multiply to a precision of at most P digits, which must be fixed in advance.
3) Division:
The process of classical radix-2 online divisionis shown in Algorithm 3, in which dividend x and divisor y areused to produce quotient z . In contrast to Algorithm 2, divisionrequires the formation of digit vector z since all prior outputdigits are needed for the calculation of v , while updates to w require the full history of y . Online division therefore hasmore complex computation dependencies than multiplication.Its digit-selection function, sel ÷ , issel ÷ ( v ) = if v ≥ / if − / ≤ v < / − otherwise .z j is produced at cycle j + 4 since δ ÷ is 4. Algorithm 3
Radix-2 online division [2].
Inputs: serially presented dividend x , divisor y y , w , z ← for j = 0 to P + 3 do y ← y (cid:107) y j v ← w + 2 − ( x j − z y j ) z j − ← sel ÷ ( v ) w ← v − z j − y z ← z (cid:107) z j − end forOutput: serially generated quotient z III. P
ROPOSED
ARCHITECT
URE
Using classical online operators as a starting point, we nowdescribe the construction of constant-compute-resource hard-ware capable of performing iterative computation to increasingprecision over time. We call this concept
ARCHITECT . A. Digit-vector Storage
Classical online operators make use of registers to store digitvectors. When implementing Algorithm 2 in hardware, forexample, P -digit registers are needed for x and y . To computeto an arbitrary precision p instead, this is unsuitable; we mustuse random-access memory (RAM) for digit-vector storage toavoid both under- and over-budgeting register resources. Webreak p into two dimensions: one fixed, U , that determines theRAM width, and a second variable, n = (cid:100) p / U (cid:101) , representingthe number of these ‘chunks’ that constitute each p -digitnumber. For digit index i , where ≤ i < p , we define chunkindex c = (cid:98) i / U (cid:99) and chunk digit index u = i mod U suchthat i = U c + u . When performing iterative calculations,independent digit vectors exist for each step, thus their in-dexing requires three variables: c ∈ [0 , n ) , u ∈ [0 , U ) andapproximant index k .Since ARCHITECT requires k and i to both vary non-monotonically as time progresses, it is necessary to uniquelyencode a one-to-one mapping from two-dimensional approxi-mant and chunk index pair ( k, c ) into one-dimensional time.We use a Cantor pairing function (CPF) [21], a bijection from N onto N , for this purpose, defined to becpf ( k, c ) = ( k + c ) ( k + c + 1)2 + c. (1)The function’s bijectivity is crucial for ARCHITECT . Unlikeclassical row- or column-major indexing, the injectivity ofthe CPF allows both dimensions to grow without boundwhile providing a unique result for every ( k, c ) . Its operationis demonstrated visually in Fig. 3; what is conceptually athree-dimensional array indexed as ( k, c, u ) becomes a two-dimensional array indexed by ( cpf ( k, c ) , u ) instead, therebysuiting the ‘flat’ nature of RAM. The function’s surjectivityensures that every cpf ( k, c ) is produced by some ( k, c ) , thusenabling the most efficient use of the available memory. B. Arbitrary-precision Operators1) Multiplication:
We are now in a position to rewriteAlgorithm 2 such that it can compute results to arbitrary I et al. : ARCHITECT: ARBITRARY-PRECISION HARDWARE WITH DIGIT ELISION FOR EFFICIENT ITERATIVE COMPUTE 5 uk c x [0][0] x [1][0] x [2][0] x [3][0] ··· x [0][0] x [0][1] x [0][2] x [0][3] ··· ··· ··· ··· ··· RAM width R A M d e p t h · · · U − x [cpf(0,0)] x [cpf(1,0)] x [cpf(0,1)] x [cpf(2,0)] ··· ··· Fig. 3. Operation of our Cantor pairing function, showing the transformation ofa three-dimensional array growing with both approximant and chunk indices k and c to a structure growing only in a single dimension. Algorithm 4
Radix-2
ARCHITECT multiplication.
Inputs: serially presented multiplicand x , multiplier y ; ap-proximant index k , precision p x , y , w ← for j = 0 to p + 2 do y [ cpf ( k, (cid:98) j / U (cid:99) )][ j mod U ] ← y j for c = (cid:98) j / U (cid:99) to do v [ cpf ( k, c )] ← w [ cpf ( k, c )]+2 − ( x [ cpf ( k, c )] y j + y [ cpf ( k, c )] x j ) if c > then w [ cpf ( k, c )] ← v [ cpf ( k, c )] end if end for z j − ← sel × ( v [ cpf ( k, w [ cpf ( k, ← v [ cpf ( k, − z j − x [ cpf ( k, (cid:98) j / U (cid:99) )][ j mod U ] ← x j end forOutput: serially generated product z precision. These transformed steps are shown in Algorithm 4.Most importantly, a new loop has been introduced; this iteratesover the n pairs of p -digit numbers’ chunks, most significantfirst, to facilitate arbitrary-precision multiplication with a U -digit online adder. Digit vectors x , y , v and w are now indexedin two dimensions, corresponding to standard RAM addressingdenoted as [ word ][ digit ] . Where a digit index is not given, all U digits of that word are accessed simultaneously.
2) Division:
The equivalently transformed version of Algo-rithm 3 is shown in Algorithm 5. Mirroring the increased com-plexity of classical online division over multiplication, here,two accumulation loops are needed: one for the calculationof v , as for multiplication, and a second for w . Consequently, n − more cycles are required for the computation of an outputdigit in ARCHITECT division than multiplication.Particular care is required for digit alignment in onlinedivision since input operands need to be bounded such thatthe output range is ( − , [2]. The normalisation of quo-tients following online division ordinarily necessitates variable δ ÷ [22]. To avoid this, we can maintain a fixed onlinedelay by bounding divisor magnitude within [ / r , [23]. For Algorithm 5
Radix-2
ARCHITECT division.
Inputs: serially presented dividend x , divisor y ; approximantindex k , precision p y , w , z ← for j = 0 to p + 3 do y [ cpf ( k, (cid:98) j / U (cid:99) )][ j mod U ] ← y j for c = (cid:98) j / U (cid:99) to do v [ cpf ( k, c )] ← w [ cpf ( k, c )]+2 − ( x j − z [ cpf ( k, c )] y j ) end for z j − ← sel ÷ ( v [ cpf ( k, for c = (cid:98) j / U (cid:99) to do w [ cpf ( k, c )] ← v [ cpf ( k, c )] − z j − y [ cpf ( k, c )] end for z [ cpf ( k, (cid:98) j / U (cid:99) )][ j mod U ] ← z j − end forOutput: serially generated quotient z experimentation, we can guarantee alignment across iterationsthrough the appropriate selection of initial inputs. C. Digit Computation Scheduling
Given a generic online delay δ made up of latencies froma pipeline (or replicated pipelines operating in parallel) ofone or more operators implementing the body of an iterativealgorithm, restrictions are imposed on the order in which digitscan be calculated. δ impacts us in two ways: • Calculation of the first output digit requires the prior inputof the first δ + 1 input digits. Thereafter, each subsequentoutput digit requires one additional input digit in orderto be computed. • The i th output digit is generated δ cycles after the i th inputdigit is presented.In general, digits of the same approximant can be calcu-lated indefinitely, while those across iterations must be se-quenced such that they obey these δ -imposed limitations.When scheduling digit z ( k ) i ’s generation, we must ensure that t (cid:16) z ( k ) i +1 (cid:17) > t (cid:16) z ( k ) i (cid:17) , t (cid:16) z ( k +1) i (cid:17) > t (cid:16) z ( k ) i + δ (cid:17) for all approximant indices k ≥ and digit indices i ≥ ,where t is the time at which a generation event occurs.While we have the freedom to trade off between iterationcount and precision within the bounds of these dependencies,we always assume a mapping from current to next digit ofthe form depicted in Fig. 4. The groups of digits shown,each δ in size, are processed ‘downwards’ and ‘leftwards,’with slope dependent on δ and control snapping back to thefirst approximant once digit position i = 0 has been reached.Fixing the granularity of digit generation to δ allows forcontrol path simplification—as will be elaborated upon in Sec-tion III-E—and limits transitions between approximants. Thelatter is beneficial since, as will be explained in Section III-G,switching between approximants leads to the incursion ofperformance penalties under some circumstances. IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS δ δ δ i • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • k Fig. 4. Proposed digit generation pattern without don’t-change digit elisionfor generic iterative computation using online operators. q δx ( k − : x ( k − : x ( k ) : Fig. 5. A proof sketch showing why it is sound to omit don’t-change digits. Ifthe two hatched regions contain the same q + δ digits, the three thick boxes areguaranteed to contain the same q digits, hence x ( k ) ’s calculation can beginfrom digit index q . D. Don’t-change Digit Elision
Thanks to the use of online arithmetic, when advancingdownwards in our iteration-precision space, we can avoid therecalculation of don’t-change digits, i.e. those of later approx-imants that have stabilised. This is generally not possible inLSD-first architectures, in which carries can propagate fromLSD to MSD. Don’t-change digit elision is guaranteed to bean error-free transformation: it induces no approximation.The concept behind this optimisation is straightforward.Before beginning to calculate the digits of approximant k , weexamine the digits of the previous two approximants. If theseapproximants are equal in their most-significant q + δ digits,it is guaranteed that approximant k will be equal to its twopredecessors in its first q digits. Hence, we do not need tocalculate them; we can skip directly to digit q ’s generation.The soundness of this optimisation can be justified byappealing to the digit dependencies of online arithmetic. Fig. 5provides some graphical intuition. Given that each approxi-mant depends only on the value of its immediate predecessor,and recalling the definition of online delay from Section II-B,we emphasise that the first q digits of one approximant dependonly upon the first q + δ digits of the previous approximant [2].Hence, if approximants k − and k − are equal in their first q + δ digits, approximant k is guaranteed to be equal to themin its first q digits.During the generation of approximant k , we compare digitson the fly with those generated for approximant k − , previ-ously stored in RAM. Based on the number of digits found tobe equal, we store a pointer indicating whence approximant k + 1 ’s, i.e. the next approximant’s, generation should begin.Pointer storage requires a small amount of extra memory but,as will be elaborated upon in Section V-F, this overhead issmall and amortised out the more RAM is instantiated for • • • •• • • •• • • •• • • •• • • •• • • •• • • •• • • •• • • •• • • •• • • •• • • •• • • •• • • •• • • •• • • •• • • •• • • •• • • •• • • •• • • •• • • •• • • •× ×× ×× × ik δ δ δ Fig. 6. Digit generation pattern with don’t-change digit elision. Groups ofdigits in the shaded region were found to be identical at runtime, allowingcomputation of the first group to be skipped in the subsequent iteration.Dashed lines are scheduled paths not taken and × s are digits therefore elided. storing digit vectors. Since we have elected to process digitsin groups of δ , it makes sense to also limit our don’t-changedigit elision to this granularity. We thus avoid the processingof entire groups of digits, where possible.As a result of the introduction of don’t-change digit elision,our scheduling pattern becomes dynamic. Fig. 6 shows anexample. This is similar to Fig. 4, but, due to the identificationof the third approximant’s first group of MSDs as stable, wecan advance into the iteration-precision space more quicklythan had we not elided them, increasing compute efficiency.Along with increased performance, the elision of don’t-change digits also enables us to increase memory efficiency.Defining ψ as the number of digits guaranteed not tohave changed within the current approximant, as determinedthrough the runtime comparison of MSDs within the precedingtwo approximants, we can substitutecpf ( k, ˆ c ) = ( k + ˆ c ) ( k + ˆ c + 1)2 + ˆ c, for (1), where ˆ c = (cid:98) ( i − ψ +1) / U (cid:99) . By doing so, stable digitsno longer need to be recomputed or stored. In common withits predecessor, this optimised storage strategy guarantees nomemory wastage through the surjectivity of its mapping fromapproximant and chunk indices to memory addresses. E. Control Logic
Given a particular ( k, i ) , we can compute the subsequent ( k, i ) , ( k (cid:48) , i (cid:48) ) , needed to realise scheduling patterns such asthose shown in Figs 4 and 6 with the finite-state machine(FSM) depicted in Fig. 7. Therein, we present state transitionconditions both with and without don’t-change digit elisionfunctionality. When elision is enabled, the conditions shownin boxes are evaluated in addition to those outside.The states’ functionality is as follows. • Digit generation : Manages the propagation and storage of δ -digit groups across iterations. When remaining withinthis state, only digit index i must be evaluated to deter-mine changes needed to k and i without don’t-changedigit elision. When enabled, ψ must also be considered. • Accumulation : Assuming that the constructed datapathcontains at least one multiplier or divider, we mustaccount for the variable latency of those operators. Thethroughput of the datapath as a whole is determined by I et al. : ARCHITECT: ARBITRARY-PRECISION HARDWARE WITH DIGIT ELISION FOR EFFICIENT ITERATIVE COMPUTE 7 Start Digitgeneration Accumulation / k ← , i ← i mod δ = δ − ∧ i ≥ δ ∧ i − ψ (cid:54) = δ − / i ← i − δ + 1 , k ← k + 1 i mod δ (cid:54) = δ − / i ← i + 1 i = δ − ∨ ( i mod δ = δ − ∧ i ≥ δ ∧ i − ψ = δ − / k ← , i ← i + kδ + 1 γ > / γ ← γ − i ≥ U / γ ← α (cid:98) i / U (cid:99) − γ = 0 Fig. 7. FSM for digit computation scheduling. Transition edges are labelled with conditions and actions separated by slashes ( / ). If the datapath consists onlyof adders, the accumulation state is never entered. Otherwise, α = 2 if the datapath contains one or more dividers, and is 1 in all other cases. Boxed conditionsapply only when don’t-change digit elision is active; they are otherwise ignored. Termination occurs either on demand or following memory exhaustion. the slowest operator. Since ARCHITECT ’s multiplicationand division operators have dissimilar accumulation func-tionality, as was explained in Section III-B, the numberof clock cycles consumed by each is different. If thedatapath contains at least one divider, advancement mustbe inhibited for (cid:98) i / U (cid:99) − cycles per generated digit. Ifit does not, but does contain at least one multiplier, thisfactor is (cid:98) i / U (cid:99)− instead. Counter γ sequences the returnto the digit generation state. Since i is variable, this loopcannot be unrolled. In the case that the datapath containsonly adders, entry into this state never occurs. F. Accuracy Bounds
Let us assume the existence of a target result defined by itsapproximant index and precision ( K, P ) . To reach it, we arerequired to compute for at least K iterations and to at least P -digit precision. We emphasise that ARCHITECT does notnecessitate its users to specify K or P up-front, while otherapproaches require either one or both of these—usually P —tobe determined before beginning to iterate. Since don’t-changedigits are identified at runtime, the analysis herein applies to ARCHITECT without digit elision. Enabling this optimisationwill therefore increase the bounds that follow.As shown in Fig. 8, we define the number of iterationsresulting from computation to target ( K, P ) as K res and thefinal precision of the first approximant—always the mostprecise—as P res . K res is bounded to no more than K max ,while P res is similarly bounded by P max , both of which aredetermined by the size of the available memory. The lattertherefore determines the maximum approximant index andprecision—and consequently accuracy—that can be reachedthrough the use of our approach. Thus, if higher accuracy isrequired, more memory must be instantiated.Upon termination, the precision of approximant k will be p ( k ) = δ (cid:0)(cid:6) Pδ (cid:7) + K − k (cid:1) if k < KP if k = Kδ ( K res − k ) otherwise , ••• ••• iP P res P max KK res K max k × Target
Fig. 8. How the final precision and iteration count ( K res , P res ) are constrainedby the desired result ( K, P ) and the available memory ( K max , P max ). where K res can be geometrically deduced to be K res = (cid:40)(cid:6) Pδ (cid:7) + K − if P > δK otherwiseand P res = p (1) .For each arbitrary-precision digit vector to be stored, K max and P max are fixed by RAM depth D (in U -digit words).Analysis of our pairing function in (1) allows us to derive P max = U (cid:16) (cid:106) / (cid:16)(cid:112) / D − (cid:17)(cid:107)(cid:17) ,K max = (cid:40) P max U + 1 if D ≥ (cid:0) P max U + 1 (cid:1) P max UP max U otherwise . G. Compute Time
Given a particular target ( K, P ) , and hence a certain K res and P res , we can calculate the number of clock cycles requiredto compute the desired result. Let us first assume that don’t-change digit elision is disabled. This total time T can bebroken down into the following three components such that T = T + T + T . IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS • Initial online delay : We must wait δ clock cycles beforeeach approximant’s result begins to appear, thus the delayacross all iterations is simply T = δK res . • Digit generation : Across all iterations performed, the totaltime for digit generation is either T = K res − (cid:88) k =0 p ( k ) (cid:16) n ( k ) − (cid:17) − U n ( k ) (cid:16) n ( k ) − (cid:17) − δ, if the datapath contains one or more dividers, or T = K res − (cid:88) k =0 n ( k ) (cid:32) p ( k ) − U (cid:0) n ( k ) − (cid:1) (cid:33) − δ if it contains one or more multipliers. n ( k ) = (cid:6) p ( k ) / U (cid:7) and represents the number of chunks within the givenapproximant upon termination of the algorithm. In thecase that the datapath contains only one or more adders, T = K res − (cid:88) k =0 p ( k ) − δ.p (0) and n (0) are the numbers of digits and chunks,respectively, that must be read from the initial guess. • Digit-serial addition : Recall that a serial online adderhas δ + = 2 . When switching between iterations, adders,if present, require two cycles to recalculate the preced-ing approximant’s residuals in order to produce a newdigit [2]. This ensures that the calculated digit aligns withits truncated digit vectors. For this, T = β (cid:0) K res − K res + 2 K − (cid:1) , where β is the number of serial adders present along thehighest-online delay path within the circuit.When enabled, don’t-change digit elision generally allowscomputation time to be reduced below T . Since the amountof achievable reduction is input-dependent, however, it is notpracticable to determine such reductions analytically. H. Digit-parallel Addition Optimisation
It is possible to eliminate the final T component in Sec-tion III-G, resulting in T = 0 , by using three-digit parallelonline adders in place of serial ones. We store consecutivedigit-vector words in alternating memory banks for speed. Byensuring that RAM width U > , i.e. that each word containsat least two digits, we can always read the three contiguousdigits required by these adders in a single cycle. No additionalmemory is needed for this optimisation.IV. B ENCHMARKS
In order to evaluate
ARCHITECT , we implemented twowidely used iterative algorithms—the Jacobi method (to solvesystems of linear equations) and Newton’s method (for thesolution of nonlinear equations)—in hardware following theaforementioned principles. We chose Jacobi and Newton toexemplify a large class of iterative methods with linear and quadratic convergence properties, respectively. Except whereotherwise stated,
ARCHITECT implementations featured allof the previously described optimisations: don’t-change digitelision, its related memory-addressing and digit-schedulingschemes and serial-to-parallel online adder substitutions.
A. Jacobi Method
The Jacobi method seeks to solve the system of N linearequations Ax = b . If A is decomposed into diagonal andremainder components such that A = D + R , x can becomputed through the repeated evaluation of x ( k +1) = D − (cid:16) b − Rx ( k ) (cid:17) , or, expressed in element-wise fashion, x ( k +1) i = 1 a ii b i − (cid:88) j (cid:54) = i ∈ [0 ,N ) a ij x ( k ) j ∀ i ∈ [0 , N ) , where k is the approximant index. Since D ’s only non-zeroelements lie along its diagonal, D − is trivial to calculate.Note that x ( k +1) relies only upon the previously computedvalue of x ; the calculation can therefore be parallelised bycomputing each x ( k +1) i independently. A convergence crite-rion, (cid:13)(cid:13) Ax ( k ) − b (cid:13)(cid:13) < η , can be used in order to determine ifthe solution has been found to great enough accuracy.Such a system is guaranteed to be soluble when A is strictlydiagonally dominant, i.e. if the condition | a ii | > (cid:80) j (cid:54) = i | a ij | holds for all i . Although strict diagonal dominance is not anecessity in every case, we assume this condition to alwaysbe satisfied for simplicity.A metric used to quantify the sensitivity of a particularlinear system to error is the condition number of A [24], where κ ( A ) = (cid:107) A (cid:107) (cid:13)(cid:13) A − (cid:13)(cid:13) . Perturbations in x ( k ) , caused by rounding, lead to errors in x ( k +1) whose magnitude is dependent, in part, on κ ( A ) ; ahigh condition number indicates that A is sensitive to errorand therefore ill-conditioned [25]. We can expect to need atleast ω additional digits of precision in order to compute asystem with κ ( A ) = 2 ω than required if κ ( A ) were 1 [26].Without loss of generality, the datapath we developed tosolve systems with dimensionality N = 2 is depicted inFig. 9a, featuring ARCHITECT numerical operators as de-scribed in Section III-B. Jacobi solvers with
N > couldhave been built with additional multipliers and adders, butthis is not the emphasis—demonstrating arbitrary-accuracyiterative calculation—of this work. Note that runtime divisionis unnecessary since A and b are constants and that simplerearrangement transforms subtraction into addition. B. Newton’s Method
Newton’s method is a root-finding algorithm, commonlyemployed to approximate the zeroes of a real-valued function f . The iterative process is x ( k +1) = x ( k ) − f (cid:0) x ( k ) (cid:1) f (cid:48) (cid:0) x ( k ) (cid:1) , I et al. : ARCHITECT: ARBITRARY-PRECISION HARDWARE WITH DIGIT ELISION FOR EFFICIENT ITERATIVE COMPUTE 9 × × + + RAM − a a b a − a a b a
22 22226 6 6 6 x ( k, i ) x ( k, i ) x ( k (cid:48) , i (cid:48) ) x ( k (cid:48) , i (cid:48) ) (a) Jacobi method ( δ = 3 ). × ÷ + RAM − − a x ( k, i ) x ( k (cid:48) , i (cid:48) ) (b) Newton’s method ( δ = 4 ).Fig. 9. ARCHITECT benchmark datapaths. Adders, multipliers and dividersare arbitrary-precision radix-2 signed-digit online operators. Use of three-digitadders reduces online delay by 2 over their serial equivalents. where f (cid:48) is the first derivative of f . Assuming that f ( x ) = 0 is soluble and f (cid:48) ( x ) is Lipschitz continuous, convergence isquadratic if x (0) is sufficiently close to the solution [27].We implemented the datapath shown in Fig. 9b, again with ARCHITECT operators, as a second case study. This can solveequations of the form f ( x ) = ax − : x ( k +1) = x ( k ) ax ( k ) . Since the solution of f ( x ) = 0 is irrational for some choices of a ( e.g. ARCHITECT ’s arbitrary-precision capabilities.V. E
VALUATION
We conducted theoretical analysis and performed experi-ments to investigate how
ARCHITECT scales and performsversus competing arithmetic implementations, both traditional(LSD-first) and online, using the Jacobi and Newton’s methodsas benchmarks. Performance is evaluated in terms of latency,which, for all implementations considered in this article, is themultiplicative inverse of throughput.The closest study to this work is that presented by Zhao et al. [4], which we compare against directly. For comparisonagainst traditional arithmetic, we chose to implement parallel-in serial-out (PISO) operators since
ARCHITECT operates in asimilar digit-serial fashion. PISO sits at the midpoint betweenfully serial (SISO) and parallel (PIPO) in terms of area andperformance [28]. With increase in precision P —which, fortraditional arithmetic, can solve problems requiring precision T ABLE
IIIC
OMPLEXITIES OF ITERATIVE SOLVER IMPLEMENTATIONS .Area Memory Solve timePISO O (cid:0) N P (cid:1) O ( NP ) , O ( P ) O ( log ( N ) KP ) Zhao et al. [4] O (cid:0) N K (cid:1) O (cid:0) N KP (cid:1) O ( P ( log ( N ) K + P )) ARCHITECT O (cid:0) N (cid:1) O (cid:16) N ( K + P ) (cid:17) O (cid:16) ( log ( N ) K + P ) log ( N ) (cid:17) N -dimensional Jacobi method, N th -order Newton’s method. up to P —PISO suffers less from area growth and operatingfrequency f max degradation than PIPO [29] while also beingdramatically faster than SISO [30]. While we focus exclusivelyon hardware implementations, the limitations revealed forPISO apply equally to software libraries since precision mustbe chosen prior to iterative algorithmic commencement. A. Complexity Analysis
In Table III, we present the results of asymptotic complexityanalysis—in terms of circuit size, memory requirements andlatency—performed for
ARCHITECT and its competitors. ForPISO, we assume the repeated evaluation of an iterativeexpression using datapaths composed of standard numericoperators. For each arithmetic, we further assume latency-optimal datapath implementations featuring minimal-depthadder (for Jacobi) and multiplier (Newton) trees. Complexitiesfor Zhao et al. ’s implementations were derived from analyticalexpressions provided by the authors [4].Since we have chosen to analyse latency-optimised datap-aths, area scales with the required number of multipliers (New-ton) and adders (Jacobi), which themselves grow quadraticallywith N . For PISO, area also scales linearly with the width ofits input operands, controlled by P , while the size of Zhao etal. ’s implementations instead scales linearly with the numberof iterations to be performed, K . The area of an ARCHITECT implementation, however, scales with neither K nor P , sincethe same arithmetic operators compute every approximant, toany precision, for the chosen iterative method.As with area, a PISO implementation’s memory footprintscales linearly with P ; for the Jacobi method, scaling isalso linear in N due to the size of the computed vector.Both Zhao et al. ’s implementations and ARCHITECT requireresidue storage within their multipliers and dividers; memoryoccupancy therefore scales with area for the arbitrary-precisionarchitectures. For the former, use of memory also scales with P as residues are stored to the same precision as its inputdata. Since ARCHITECT effectively collapses approximant andprecision indices into a single dimension via its CPF, thememory requirements for each operator are determined bythe maximum value of (1) during computation to the target ( K, P ) . They thus scale quadratically with K + P .PISO’s latency grows linearly with K and P , but logarith-mically with N due to our aforementioned choice of adder(and multiplier) structures. Zhao et al. ’s speed is bottleneckedby the growth of precision—quadratically—as well as the fre-quency of pipeline flushes, which grows as O ( log ( N ) KP ) [4].For ARCHITECT , given that each datapath’s highest cumulative
Iterative algorithmIP cores: LSD-first,existing online or
ARCHITECT operators SoftwareimplementationDigitschedulingDatapath State machineHardwareimplementation VerificationPerformanceevaluation
HDL FPGA CPU
Fig. 10. Experimental setup for the evaluation of
ARCHITECT . online delay δ is logarithmically related to N , its latency com-plexity can be determined by solving for T in Section III-G.Note that T dominates T in all cases and T = 0 since weassume the use of digit-parallel adders.At first glance, it appears that ARCHITECT behaves morepoorly than its competitors in terms of memory use andsolve time when scaled. We emphasise, however, that thesecomplexities represent worst-case scenarios for
ARCHITECT :optimisations including digit elision do not factor into itsasymptotic behaviour but do significantly improve the averagecase. They also do not take fundamental limitations of thealternatives into account. In particular, exact computation toa given ( K, P ) is rarely possible with P -digit LSD-firstarithmetic due to rounding errors introduced in earlier approx-imants; only MSD-first architectures are capable of producingexact results for every approximant. Additionally, they do notaccount for ARCHITECT ’s unique ability to compute resultsto any required accuracy, effectively allowing the necessary ( K, P ) to be determined, on a problem-by-problem basis, atruntime. In contrast, a PISO implementation’s precision isalways bounded, while the same is true of iteration count forZhao et al. ’s proposal. In the remainder of this section, weempirically explore the implications of these issues. B. Experimental Particulars
We targetted a Xilinx Virtex UltraScale FPGA (XCVU190-FLGB2104-3-E) for all experiments detailed henceforward,with implementation performed using Vivado 16.4. The cor-rectness of results obtained in hardware was verified viacomparison against those produced by golden models executedin software. Fig. 10 captures our experimental process.
C. Qualitative Performance Comparison
To evaluate performance for the Jacobi method, we consid-ered systems in which A m = − − m − − m , b = b b , x (0) = , with b and b randomly selected from a uniform distributionin the range [0 , . As m increases, condition number κ ( A m ) also increases, indicating that higher precision P will berequired to generate a result of great-enough accuracy. Weset accuracy bound η = 2 − and experimentally determinedthat the most ill-conditioned matrix requiring P = 32 , acommonly encountered traditional arithmetic data width, tosolve the associated system was that with m = 25 , so welimited our experiments to m ∈ [0 , . We postulate that ARCHITECT should ‘win,’ i.e. compute the required result inless time, versus PISO either when the latter’s precision P is high and A m is well conditioned or when P is too lowfor an ill-conditioned A m to allow convergence at all. For ARCHITECT , we used RAM size ( U, D ) = (cid:0) , (cid:1) . Latencieswere calculated using frequencies taken from Section V-E.Fig. 11a captures the latency ratio between ARCHITECT andPISO with a fixed precision of 32 bits (LSD-32) necessary tocompute results for matrices with low m . Here, PISO canbe said to have over-budgeted precision; results take longerto compute than had a smaller P been chosen. For the mostwell conditioned matrices ( m ≤ . ), ARCHITECT takes lesstime to reach the target accuracy. For larger m , however, theopposite is true: lower-indexed approximants are computed togreater precision than those of PISO, taking more time. Hada lower choice of P been made for PISO, ARCHITECT wouldhave been at a disadvantage for the more well conditionedmatrices, but it would also have been able to compute theresults of systems featuring ill-conditioned matrices that PISOcould not. As shown in Fig. 11c, with P = 8 (LSD-8), ARCHITECT can solve systems with m > , where PISO’sprecision is under-budgeted; here, even if PISO ran indefinitelyit would never be able to converge to an accurate-enoughsolution. We can conclude that ARCHITECT requires less timeto generate results either when P is small and convergence isfast or when P is too large for PISO to ever converge.For Newton’s method, we experimented with a ∈ (cid:2) , (cid:3) .As a increases, / a decreases, thus greater precision willbe required for its representation. We calculated under ter-mination condition (cid:12)(cid:12) f (cid:0) x ( k ) (cid:1)(cid:12)(cid:12) < η , with η again set to − . a ∈ (cid:2) , (cid:3) was chosen since, to solve f ( x ) with a = 2 ,the worst-case precision requirement was again P = 32 .Figs 11b and 11d show the performance of our ARCHITECT -based Newton’s method benchmark versus 32-bit and 8-bitPISO in the same form as Figs 11a and 11c, respectively. Theresults achieved for Newton’s method are broadly similar tothose for Jacobi.
ARCHITECT requires a ≤ . to beat LSD-32in terms of compute time, while only our proposed iterativesolver can solve systems with a > when PISO has P =8 . Identical conclusions regarding under- and over-budgetedprecisions can therefore be drawn for Newton’s method. D. Area & Frequency Scalability
Implementational results are presented in Fig. 12 for ourJacobi and Newton’s method benchmarks, including area andmaximum operating frequency f max . Each of the four plotsfeatures D , the RAM depth used for storage of each digitvector, on the x -axis, and RAM width U was 8 in all cases. I et al. : ARCHITECT: ARBITRARY-PRECISION HARDWARE WITH DIGIT ELISION FOR EFFICIENT ITERATIVE COMPUTE 11 .
01 0 . m = . (a) L a t e n c yv e r s u s L S D - ( × ) Jacobi a = . (b) Newton .
01 0 . (c) m L a t e n c yv e r s u s L S D - ( × ) (d) a Fig. 11. Performance comparisons of our proposal against LSD-first arithmeticfor the Jacobi and Newton’s methods. (a) and (b) show how the conditioning ofinput matrix A m (Jacobi) and input value a (Newton) affect the solve time ofour proposal compared to LSD-32. ARCHITECT computes more quickly thanLSD-32 when m ≤ . for Jacobi and a ≤ . for Newton. (c) and (d) showthat, even though our proposal leads to a slowdown compared to LSD-8, thereare nevertheless points—at m > (Jacobi) and a > (Newton)—whenceLSD-8 does not converge at all, hence our speedup is effectively infinite. Lookup table (LUT) and flip-flop (FF) use are not shownsince the numbers are insignificant compared to those of on-chip block RAM (BRAM)—from 0.17% to 0.66% for LUTsand 0.045% to 0.21% for FFs for the smallest ( D = 2 )and largest ( D = 2 ) Jacobi designs implemented, and from0.22% to 0.86% (LUTs) and 0.040% to 0.23% (FFs) for theNewton datapath. Memory use grows with D , as expected;the higher K res and P res one wishes to be able to reach,the more RAM must be instantiated. With 90% and 77%of BRAMs allocated for the Jacobi and Newton methods,respectively, we can reach K max = 1023 and P max = 8184 : themaxima for our targetted FPGA with power-of-two choices of D . The small increases in non-memory resources noted canbe attributed to the additional control logic and multiplexingrequired to address larger memories. The f max plots show thatour implementations are able to run at between 120 MHz, forthe smallest D tested, to around 50 MHz for the largest of bothbenchmarks. Subtle increases are due to compilation noise. ARCHITECT gives its users the freedom to trade off areaand computation time directly by varying RAM width U .When U is changed, so are the widths of the parallel onlineadders used in the datapath. While a design with narroweradders is just as able to compute a particular result as onecapable of performing wider additions, it will also consumemore clock cycles in return for demanding lower resource use.Comparisons between U = 8 and U = 64 with the same D , (a) BR A M s ( % ) Jacobi (b)
Newton (c) f m a x ( M H z ) (d) RAM depth D (words) Fig. 12. Resource use and maximum clock rate of
ARCHITECT
Jacobi andNewton benchmarks versus RAM depth D . Area is reported in terms ofBRAMs only; LUT and FF use were below 1% for all design points.T ABLE
IVA
REA - SPEED TRADEOFF VIA SELECTION OF
RAM
WIDTH U . U LUTs FFs BRAMs f max (MHz) Accumulationlatency (cycles) J ac ob i (cid:108) p ( k ) / (cid:109) (cid:108) p ( k ) / (cid:109) N e w t on (cid:108) p ( k ) / (cid:109) − (cid:108) p ( k ) / (cid:109) − in this case , are shown in Table IV to exemplify this forboth of our benchmarks. Note that the accumulation latency forNewton’s method is higher than Jacobi’s due to the former’suse of division; as was explained in Section III-B2, divisionrequires more cycles to produce each output digit than areneeded for multiplication. Table V shows the area breakdownand minimum clock period (critical path delay) for each ofour individual arithmetic components for an example ( U, D ) . T ABLE V ARCHITECT OPERATOR FEATURES WITH
RAM
SIZE ( U, D ) = (cid:0) , (cid:1) .LUTs FFs BRAMs Minimum clockperiod (ns) + 4 3 – . ×
250 141 4 5 . ÷
255 93 6 5 . E. Quantitative Area & Frequency Comparison
In order to compare the resource use and f max of ARCHI - TECT against its competitors, we now assume that we wishto compute to particular ( K, P ) targets. We emphasise that,since ARCHITECT iterates exactly while LSD-first arithmetic-based solvers do not, latency cannot be fairly compared whenconsidering computation to a particular ( K, P ) .We chose to set targets of (cid:0) , (cid:1) (for the Jacobimethod) and (cid:0) , (cid:1) (Newton). Thus, at their 100 th and 10 th iterations, respectively, we wish to obtain a result with 2048-digit precision. Fewer iterations were targetted for Newton’smethod due to its quadratic convergence. Using U = 8 , for ARCHITECT , the resultant iteration counts and precisions forthe two methods are ( K res , P res ) = (509 , (Jacobi) and (351 , (Newton). To successfully perform computationto ( K, P ) , we must ensure that K max ≥ K res and P max ≥ P res .We can determine that, by setting RAM depth D = 2 , we areable to reach K max = 512 and P max = 4088 , which satisfiesthese requirements for both benchmarks.Fig. 13 presents a side-by-side comparison of the architec-tures implemented following the principles presented hereinand those using PISO operators as well as the online im-plementation published by Zhao et al. [4]. Most strikingly,the latter demonstrates area inefficiency, with resource usescaling linearly with iteration count K ; ARCHITECT consumes57 × fewer LUTs and 59 × fewer FFs than Zhao et al. ’sproposal requires in order to execute 100 iterations of theJacobi method. When executing 10 iterations of Newton’smethod, these factors are 8.4 and 13, respectively. f max iscomparable between the two since the underlying arithmetic islargely equivalent, although ARCHITECT ’s is slightly inferiorprincipally due to reductions caused by the introduction ofdon’t-change digit elision logic. For PISO, we can see that,while its f max is initially much higher—over 300 MHz for P = 2 —than ARCHITECT ’s, it falls as P increases; thecrossover occurs at P ≈ . Taking Newton’s method asan example, with a high precision requirement, such as digits, ARCHITECT is able to outperform its PISO counterpartin terms of f max by a factor of 1.5. Corresponding decreasesin LUT and FF use were also found: when computing to P = 2 , again for Newton, ARCHITECT consumes 1.8 × and3.3 × fewer of each than PISO, while for these factorsincrease to 3.6 and 6.5. Similar conclusions can be made forour implementation of the Jacobi method. Since the proposeddesigns are able to calculate to any K ≤ K max and P ≤ P max ,their area and f max are constant. F. Performance Improvement Breakdown
We conducted further analysis to investigate how the eli-sion of don’t-change digits and use of parallel online addersindividually improve the performance and memory efficiencyof
ARCHITECT . Overall, Figs 14a and 14b show that solvetime can be significantly reduced when enabling these op-timisations. As expected, don’t-change digit elision leads tothe majority of our design’s efficiency savings over ‘vanilla’
ARCHITECT (that without digit elision or parallel addition). (a) L U T s ( % ) Jacobi (b)
Newton (c) FF s ( % ) . . . . (d) (e) f m a x ( M H z ) (f) Precision P (digits) Fig. 13. Resource use and performance comparison of Jacobi and Newton’smethod implementations using Zhao et al. ’s ( ), PISO ( ) and our( ) approaches versus required result precision P . The gaps between the don’t change-plus-parallel online addi-tion and parallel addition-only lines widen as η is reduced,indicating that consideration of don’t-change digits becomesmore important with higher accuracy requirements. The subtlejump present in Fig. 14a is due to the δ -digit granularity ofelision. With respect to using parallel online addition only,performance is improved for higher η since it leads to clockcycle savings when switching between iterations. For higher-accuracy cases, this optimisation does not contribute muchto solve time speedup, however. This makes sense since, as η falls, more iterations are required to achieve convergence, thusmore cycles are required for the production of each new digit.This also affords much greater opportunity for don’t-changedigit elision, however, hence the high overall speedups seenon the right-hand side of, in particular, Fig. 14b.The speedups we observed for Newton’s method werefar more significant than those for Jacobi: up to 16 × forthe former. Relatively low performance improvements wereexpected for the Jacobi benchmark due to the method’s linearconvergence. Far fewer don’t-change digits are detected andelided during computation than for the quadratic-convergenceNewton’s method, hence the less-significant latency reductionsseen in Fig. 14a than Fig. 14b.Figs 14c and 14d show the memory efficiency improvementsafforded through the use of don’t-change digit elision for both I et al. : ARCHITECT: ARBITRARY-PRECISION HARDWARE WITH DIGIT ELISION FOR EFFICIENT ITERATIVE COMPUTE 13 . . (a) S p ee dup ( × ) Jacobi (b)
Newton − − − − − − − . . (c) M e m o r y r e du c ti on ( × ) − − − − − − − − − . . (d) Accuracy bound η Fig. 14. Solve time speedup for (a) the Jacobi and (b) Newton’s methods usingboth don’t-change digit elision and parallel online addition ( ) and paralleladdition only ( ) versus
ARCHITECT with both optimisations disabled. (c)and (d) show the corresponding memory requirement reductions for Jacobiand Newton, respectively, facilitated through digit elision. benchmarks. We present these as the ratio of the number ofBRAM blocks that must be instantiated on our targetted FPGAfor the solution of equations to particular accuracies with andwithout digit elision enabled. The jaggedness of these plotsis due to the granularity of memories, for which we onlyused whole numbers of BRAMs. For lower-accuracy cases,both pairs of designs require approximately the same amountof memory, although that considering don’t-change digits isslightly inferior due to the overheads involved in comparisonand subsequent elision. However, don’t-change digit elisionallows our optimised Newton design to use the same amountof memory for η ≤ − , while vanilla ARCHITECT starts toconsume more memory when η = 2 − . For the test caseswe evaluated, we observed up-to 1.5 × and 1.9 × memorysavings for the Jacobi and Newton’s methods, respectively.Beyond those shown in Fig. 14, there are particularly high-accuracy cases— η ≥ − for Jacobi and η ≥ − forNewton—vanilla ARCHITECT cannot reach before it exhaustsits available memory, while that with digit elision can. Theadvantages of this scheme and its efficient memory addressingtherefore come to the fore with higher accuracy requirements.VI. C
ONCLUSION & F
UTURE W ORK
In this article, we proposed the first hardware architecturecapable of executing iterative algorithms to produce resultsof arbitrary accuracy by combining increasing iteration countwith precision while using constant compute resources. Wenamed this technique
ARCHITECT . Our proposal employsonline arithmetic to generate its results MSD first and a Cantor pairing function within its digit-storage mechanism tofacilitate the simultaneous growth of iteration count and pre-cision. Using digit dependency analysis, we identified stable‘don’t-change’ digits across iterations, excluding them fromcalculation. This technique holds for any iterative methodimplemented using online arithmetic and was realised inhardware using simple runtime detection and digit-schedulinglogic. We also proposed the replacement of serial online adderswithin iterative datapaths with parallel equivalents, facilitatinglatency reduction and consequent improvements in throughput.We evaluated
ARCHITECT on FPGAs using the Jacobi andNewton’s methods in order to verify its accuracy and establishits scalability and efficiency. These benchmarks showcased thekey advantage of our approach: removing the burden of havingto determine and fix the precision of arithmetic operators inadvance. By doing so, we showed that datapaths constructedfrom
ARCHITECT operators are superior to their traditionalarithmetic equivalents in scenarios where the latter’s precisionis either overly high for the problems being solved or too lowfor results to converge at all. A single
ARCHITECT datapathis able to compute results to any accuracy, with the only limitbeing imposed by the size of the available RAM.Our experiments revealed 12 × LUT and 24 × FF reductionsover 2048-bit conventional parallel-in serial-out arithmetic,along with 57 × LUT and 59 × FF decreases versus the state-of-the-art online arithmetic implementation, when executing100 Jacobi iterations. For Newton’s method run for 10 iter-ations, these factors were 3.6, 6.5, 8.4 and 13, respectively.Versus
ARCHITECT with the proposed don’t-change and par-allel addition optimisations disabled, we were able to achieveup-to 16 × decreases in solve time.While we prototyped our designs on FPGAs owing tothe costs and lead times associated with full-custom im-plementation, we note that these devices are optimised forthe implementation of conventional arithmetic operators. Inparticular, FPGAs’ hardened carry chains suit the constructionof fast LSD-first adders. Our proposals cannot take advantageof such structures at present. We are confident that, should AR - CHITECT see application-specific integrated circuit (ASIC) im-plementation, however, much more competitive performancewould be achievable. Higher-radix ( r > ) online arithmeticcould instead (or additionally) be employed to exploit high-performance adders, including on FPGAs, which we anticipatewould also lead to speedups. We leave the exploration of suchoptimised implementations to future work.Beyond this, we will extend our benchmarking to coveradditional iterative algorithms, including Krylov subspacemethods such as conjugate gradient descent. Finally, we en-visage that the arbitrary-precision computation enabled by ARCHITECT can be combined with high-level synthesis toenable faster hardware specialisation.A
CKNOWLEDGEMENTS
The authors are grateful for the support of the United King-dom EPSRC (grant numbers EP/P010040/1, EP/R006865/1and EP/K034448/1), Imagination Technologies, the RoyalAcademy of Engineering and the China Scholarship Council.
Supporting data for this article are available online at https://doi.org/10.5281/zenodo.3378800 .R EFERENCES[1] Y. LeCun, Y. Bengio, and G. Hinton, “Deep Learning,”
Nature , vol. 521,no. 7553, 2015.[2] M. D. Ercegovac and T. Lang,
Digital Arithmetic . Elsevier, 2004.[3] K. Shi, D. Boland, and G. A. Constantinides, “Efficient FPGA Imple-mentation of Digit Parallel Online Arithmetic Operators,” in
Interna-tional Conference on Field Programmable Technology (FPT) , 2014.[4] Y. Zhao, J. Wickerson, and G. A. Constantinides, “An Efficient Imple-mentation of Online Arithmetic,” in
International Conference on FieldProgrammable Technology (FPT) , 2016.[5] H. Li, J. J. Davis, J. Wickerson, and G. A. Constantinides, “ARCHI-TECT: Arbitrary-precision Constant-hardware Iterative Compute,” in
International Conference on Field Programmable Technology (FPT) ,2017.[6] ——, “Digit Elision for Arbitrary-accuracy Iterative Computation,” in
IEEE Symposium on Computer Arithmetic (ARITH) , 2018.[7] M. Benzi, T. M. Evans, S. P. Hamilton, M. L. Pasini, and S. R.Slattery, “Analysis of Monte Carlo-accelerated Iterative Methods forSparse Linear Systems,”
Numerical Linear Algebra with Applications ,vol. 24, no. 3, 2017.[8] J. Sun, G. D. Peterson, and O. O. Storaasli, “High-performance Mixed-precision Linear Solver for FPGAs,”
IEEE Transactions on Computers ,vol. 57, no. 12, 2008.[9] Q. Liu, R. Sang, and Q. Zhang, “FPGA-based Acceleration of Davidon-Fletcher-Powell Quasi-Newton Optimization Method,”
Transactions ofTianjin University , vol. 22, no. 5, 2016.[10] A. Roldao-Lopes, A. Shahzad, G. A. Constantinides, and E. C. Kerrigan,“More Flops or More Precision? Accuracy Parameterizable Linear Equa-tion Solvers for Model Predictive Control,” in
IEEE International Sym-posium on Field-programmable Custom Computing Machines (FCCM) ,2009.[11] D. Boland and G. A. Constantinides, “An FPGA-based Implementationof the MINRES Algorithm,” in
International Conference on Field-programmable Logic and Applications (FPL) , 2008.[12] G. Constantinides, A. Kinsman, and N. Nicolici, “Numerical DataRepresentations for FPGA-based Scientific Computing,”
IEEE Design& Test of Computers , vol. 28, no. 4, 2011.[13] D. H. Bailey, R. Barrio, and J. M. Borwein, “High-precision Com-putation: Mathematical Physics & Dynamics,”
Applied MathematicsComputation , vol. 218, no. 20, 2012.[14] D. H. Bailey and J. M. Borwein, “High-precision Arithmetic in Mathe-matical Physics,”
Mathematics
IEEE Transactions on Computers , vol. 66, no. 8, 2017.[18] F. de Dinechin and B. Pasca, “Designing Custom Arithmetic Data Pathswith FloPoCo,”
IEEE Design & Test of Computers , vol. 28, no. 4, 2011.[19] X. Fang and M. Leeser, “Open-source Variable-precision Floating-point Library for Major Commercial FPGAs,”
ACM Transactions onReconfigurable Technology and Systems , vol. 9, no. 3, 2016.[20] M. Jaiswal and H. So, “Area-efficient Architecture for Dual-mode Dou-ble Precision Floating Point Division,”
IEEE Transactions on Circuitsand Systems I: Regular Papers , vol. 64, no. 2, 2017.[21] P. C´egielski and D. Richard, “Decidability of the Theory of the NaturalIntegers with the Cantor Pairing Function and the Successor,”
Theoret-ical Computer Science , vol. 257, no. 1-2, 2001.[22] P. Tu and M. D. Ercegovac, “Design of On-line Division Unit,” in
IEEESymposium on Computer Arithmetic (ARITH) , 1989.[23] S. F. Obermann and M. J. Flynn, “Division Algorithms and Implemen-tations,”
IEEE Transactions on Computers , vol. 46, no. 8, 1997.[24] E. K. Miller, “A Computational Study of the Effect of Matrix Size andType, Condition Number, Coefficient Accuracy and Computation Preci-sion on Matrix-solution Accuracy,” in
IEEE Antennas and PropagationSociety International Symposium , 1995.[25] A. H.-D. Cheng, “Multiquadric and its Shape Parameter—A NumericalInvestigation of Error Estimate, Condition Number, and Round-offError by Arbitrary Precision Computation,”
Engineering Analysis withBoundary Elements , vol. 36, no. 2, 2012. [26] E. Cheney and D. Kincaid,
Numerical Mathematics and Computing .Nelson Education, 2012.[27] C. T. Kelley,
Iterative Methods for Linear and Nonlinear Equations .Society for Industrial and Applied Mathematics, 1995.[28] K. Javeed, X. Wang, and M. Scott, “Serial and Parallel InterleavedModular Multipliers on FPGA Platform,” in
International Conferenceon Field-programmable Logic and Applications (FPL) , 2015.[29] M. R. Meher, C. C. Jong, and C.-H. Chang, “A High Bit RateSerial-serial Multiplier With On-the-fly Accumulation by AsynchronousCounters,”
IEEE Transactions on Very Large Scale Integration (VLSI)Systems , vol. 19, no. 10, 2011.[30] A. Landy and G. Stitt, “Revisiting Serial Arithmetic: A Performance andTradeoff Analysis for Parallel Applications on Modern FPGAs,” in
IEEEInternational Symposium on Field-programmable Custom ComputingMachines (FCCM) , 2015.
He Li is a PhD student in the Department ofElectrical and Electronic Engineering at ImperialCollege London. He received the MS degree fromthe Department of Microelectronics at Tianjin Uni-versity in 2016. His main research interests areFPGA arithmetic, custom computing and hardwaresecurity. He received the Best Paper PresentationAward at FPT 2017.
James J. Davis is a Research Fellow in the De-partment of Electrical and Electronic Engineering’sCircuits and Systems group at Imperial College Lon-don. He received a PhD in Electrical and ElectronicEngineering from Imperial College London in 2016.His research is focussed on the exploitation of FPGAfeatures for cutting-edge applications, driving upperformance, energy efficiency and reliability. DrDavis serves on the technical programme commit-tees of the four top-tier reconfigurable computingconferences (FPGA, FCCM, FPL and FPT) and is amulti-best paper award recipient. He is a Member of the IEEE and the ACM.
John Wickerson received a PhD in Computer Sci-ence from the University of Cambridge in 2013. Heis a Lecturer in the Department of Electrical andElectronic Engineering at Imperial College London.His research interests include high-level synthesis,the design and implementation of programming lan-guages and software verification. He is a SeniorMember of the IEEE and a Member of the ACM.