[PDF] An Approximate Carry Estimating Simultaneous Adder with Rectification

Abstract

Approximate computing has in recent times found significant applications towards lowering power, area, and time requirements for arithmetic operations. Several works done in recent years have furthered approximate computing along these directions. In this work, we propose a new approximate adder that employs a carry prediction method. This allows parallel propagation of the carry allowing faster calculations. In addition to the basic adder design, we also propose a rectification logic which would enable higher accuracy for larger computations. Experimental results show that our adder produces results 91.2% faster than the conventional ripple-carry adder. In terms of accuracy, the addition of rectification logic to the basic design produces results that are more accurate than state-of-the-art adders like SARA and BCSA by 74%.

Full PDF

AA N A PPROXIMATE C ARRY E STIMATING S IMULTANEOUS A DDER WITH R ECTIFICATION

Rajat Bhattacharjya*

Indian Institute of Information Technology GuwahatiAssam, India. [email protected]

Vishesh Mishra*

Indian Institute of Information Technology GuwahatiAssam, India. [email protected]

Saurabh Singh

Indian Institute of Information Technology GuwahatiAssam, India. [email protected]

Kaustav Goswami

Indian Institute of Information Technology GuwahatiAssam, India. [email protected]

Dip Sankar Banerjee

Indian Institute of Information Technology GuwahatiAssam, India. [email protected] A BSTRACT

Approximate computing has in recent times found signiﬁcant applications towards lowering power,area, and time requirements for arithmetic operations. Several works done in recent years havefurthered approximate computing along these directions. In this work, we propose a new approximateadder that employs a carry prediction method. This allows parallel propagation of the carry allowingfaster calculations. In addition to the basic adder design, we also propose a rectiﬁcation logicwhich would enable higher accuracy for larger computations. Experimental results show that ouradder produces results 91.2% faster than the conventional ripple-carry adder. In terms of accuracy,the addition of rectiﬁcation logic to the basic design produces results that are more accurate thanstate-of-the-art adders like SARA [1] and BCSA [2] by 74%. K eywords Approximate adder · carry estimation · rectiﬁcation logic · accuracy · delay · systems level simulations · image processing. Approximate computing in recent years has garnered signiﬁcant attention due to the massive data deluge and require-ments for fault-tolerant real-time computing. Towards this, there have been several techniques that have been proposedwhich have shown that often an approximate result with provable error bounds are desirable rather than computingcorrect result from scratch which can take signiﬁcantly higher time. With massive advancements in semiconductortechnologies in recent years, digital circuits have also become more vulnerable to variations. These variations havemade accurate results more difﬁcult to be ensured [3].Techniques employed for analyzing large data and employing machine learning methods often rely on approximations toquickly model the available data. For the design of semiconductor devices, such methods are often feasible in the sense ∗ Both authors contributed equally to this research.This work has been accepted at 30th ACM Great Lakes Symposium on VLSI (GLSVLSI) 2020 as a regular paper. a r X i v : . [ c s . A R ] A ug PREPRINT

Figure 1: Cascaded blocks of CESA-PERLthat if the applications are in itself employing some level of approximation, then the results provided by hardware canalso assist in approximation. This can potentially lead to much faster computations, simpler hardware, and additionalpower beneﬁts.Works done towards the implementation of approximations on hardware [1, 4] in recent years have proposed techniquesthat have shown that approximation shows acceptable results on the end applications. An approximation can be done onseveral levels and works focusing on both full systems approximation [5] and software level approximations [6] havebeen shown. Arithmetic circuits such as adders and multipliers form the basic building blocks in all such approximatesystems. On hardware, techniques proposed towards achieving approximate results are mostly non-conﬁgurable whichmeans that it is not possible to trade-off the approximated errors with the amount of extra logic that needs to beimplemented. In this work, we propose a new solution for approximate additions via a simultaneous carry estimationtechnique (CESA). Additionally, we ﬁne-tune our design to ensure that only a marginal amount of extra space isconsumed on the die. Also, we propose a propagating error rectiﬁcation logic (PERL) which would yield higher accuracycompared to the basic adder design (CESA) for larger computations. We name this modiﬁed design as CESA-PERL(Carry Estimating simultaneous Adder with Propagating Error Rectiﬁcation Logic). An intuitive look at how adderswork in hardware (like the popular Ripple-Carry adder) shows that the opportunity for any approximate result dependson how early a possible point of error could be detected. These are the potential points of good approximation asthese carries that are generated after every pair of bits are added. So, a naive way to take care of the minimal erroraccumulation at the carries could be via sequential processing of the carries as and when they are generated. Althoughon close observation, we can understand that once the instruction for addition is invoked, the register gets results after aﬁnite amount of delay. So, if we look at all the carries generated in parallel and ensure that the error is not accumulatedbeyond a certain threshold, then the result can be quickly generated and accurately estimated.Our main motivation behind this work is achieving faster computations through approximate additions which incura lower area overhead. Since accuracy is a function of the chip area, we exploit the fundamental intuition that errorscumulatively build across parallel addition blocks. If we can track this propagation, we can correct them trivially whichwill lead to much better accuracy. This is what we achieve through the design of our naive approximate adder calledCESA which is further optimized through the rectiﬁcation of the propagating errors through the CESA-PERL design(discussed in Section 3). We show that on the application’s end, we achieve speedup of around 2.83x over accurateadders, at an additional space requirement of 12.5%. When compared to other state-of-the-art approximate adders,we observe that our design outperforms existing designs that consume similar area overheads by 74% on an average(discussed in Section 4).

In this section, we discuss the preliminary ideas behind our designs. The basic architecture of the proposed adderdepicted in Figure 1 comprises of [ n/k ] k-bit summation blocks which generate the partial result in a non-blockingparallel manner. If two n-bit input operands are taken for approximate additions, then the proposed adder CESA-PERLwill divide the input operands into n/k equal segments, called sub-inputs. The sub-inputs are then fed to a k-bitsub-adder which is part of the k-bit summation block. The k-bit summation block also contains the CEU (CarryEstimate Unit), PERL (Propagating Error Rectiﬁcation Logic), and an SU (Selection Unit).2 PREPRINT

In Figure 1, the carry input of i th sub-adder is selected through ( i − th SU which selects the output of CEU if itestimates the carry input correctly. Also, the output of PERL in case CEU generates an error in carry estimation. Sincethe carry input of the i th sub-adder is not generated through the actual carry chain mechanism but selected directly bySU, the results are approximate in nature. In our proposed adder, the estimated carry input of i th sub-adder is given by, C out = Sel ( i − .C ceu + Sel ( i − .C perl (1)In equation 1, C out beside being the carry input of the i th sub adder, it is also the carry output of the ( i − th sub adder.Here, Sel ( i − .C ceu is the output signal of the select (CEU) unit and Sel ( i − .C perl is the output signal of the selectunit (PERL). Also, Sel ( i − is the ( i − th Selection Unit (SU) whose logic is given by,

Sel i − = ( A i − k − ⊕ B i − k − ) . ( A i − k − ⊕ B i − k − ) (2)subsequently C ceu , C perl are respectively given by, C ceu = A i − k − .B i − k − + A i − k − .B i − k − ( A i − k − + B i − k − ) (3) C perl = A i − k − .B i − k − + A i − k − .B i − k − ( A i − k − + B i − k − ) (4)Here in equations 2, 3 and 4, A i − k − and B i − k − can be interpreted as ( k − th input bits of ( i − th sub-adder. Theother terms used in these equations can also be interpreted in a similar manner. Carry Estimating Simultaneous Adder (CESA) contains various sub adders which compute individual block sum inparallel. The Carry Estimate Unit (CEU) generates the carry input for the summation block based on the two mostsigniﬁcant bits of the summation block previous to it. Since four input bits are involved namely A i − k − , B i − k − , A i − k − and B i − k − , so total = 16 permutations arise. The idea behind carry estimation by CEU is as follows:• If both A i − k − and B i − k − are (0 , respectively, then the carry input for the i th block would also be irrespectiveof previous carry-ins. This will account for four binary combinations. Since A i − k − and B i − k − can still varyover 0 and 1.• If both A i − k − and B i − k − are (1 , then the carry input for the i th block would also be irrespective of previouscarry-ins. This also includes four binary combinations.• If A i − k − and B i − k − are (1,0) or (0,1), then the carry input for the i th block would depend on the previous bits.If the previous bits A i − k − and B i − k − are (0 , , then carry input would be and if previous bits A i − k − and B i − k − are (1 , then it would be .• For the remaining cases, to accurately determine carry input, a further backward input bit traversal would berequired. So in all four remaining cases, carry input is approximated as without any backward traversal. Thisapproximation leads to tolerable error as the actual carry input may not be .Let C radd be the accurate carry input for i th sub-adder which would have been generated through the carry chainmechanism in a traditional ripple carry adder, then based on the above-mentioned cases, the carry input is alwaysaccurate in out of cases. So, the probability of correct carry bit estimation is, P ( C ceu = C radd ) = 12 /

16 = 3 / (5)Clearly, the error occurs only when carry estimation is false. So by making the use of equation 5, the probability thatthe result generated by CESA is always error-prone is: P ( C ceu (cid:54) = C radd ) = 1 − P ( C ceu = C radd ) = 1 / (6) Now in equation 6, the probability of getting error-prone results that comes out to be 1/4 can be interpreted as to be theworst case error rate of %,provided the number of summation blocks are minimum.Since the worst case error rate is quite high, so in order to rectify the error, we propose an auxiliary rectiﬁcationunit known as PERL. When both input bits, A i − k − and B i − k − are (1,0) or (0,1) respectively and the CEU is unable to3 PREPRINT

0% 10% 20% 30% 40% 50% 60% (32,4) (32,8) (16,4) (16,8) (8,4) E rr o r R a t e ( % ) Size (Bit Size, Block Size)

CESACESA−PERLBCSABCSA with ERUSARARAP_CLA (a) Error Rate (%) M R E D Size (Bit Size, Block Size)

CESACESA−PERLBCSABCSA with ERUSARARAP_CLA (b) Mean Relative Error Distance (MRED) M E D ( l og b a s e ) Size (Bit Size, Block Size)

CESACESA−PERLBCSABCSA with ERUSARARAP_CLA (c) Mean Error Distance (MED)

Figure 2: Error Metricsdetermine the carry input accurately, then instead of selecting as the approximate carry input, the output signal ofPERL is selected which signiﬁcantly lowers error-rates.Now the error will only occur if both units, namely CEU and PERL simultaneously fail in accurate determination ofcarry input. So the probability that the result generated by the adder is always error-prone is: P ( C out (cid:54) = C radd ) = P ( C ceu (cid:54) = C radd ) × P ( C perl (cid:54) = C radd ) = 1 / (7)It is clear from equation 7 that the chances of getting error-prone results decrease signiﬁcantly after the addition ofPERL. Although adding PERL introduces some additional area overheads, the beneﬁts that it provides towards accuracyclearly outweigh the nominal area overheads. In this section, we brieﬂy discuss the solution mechanism along with the hardware level designs that are proposed.

Our proposed adder is designed with the sole motivation of providing higher accuracy at the cost of lower on-chip areaand minimal power consumption. The basic design is based on the divide and conquer technique in which we divide the n bit input operand into n/k equal segments called sub inputs. The sub inputs are then fed to sub adders which arethe part of summation blocks. The summation blocks of size k can be evaluated in a parallel manner to calculate theirindividual sum. Since all summation blocks are independent of each other, they will produce results at the same instantafter a certain delay. Meanwhile, just before the beginning of computations, the carry input of these blocks is selectedthrough SU. As discussed in section 2, SU selects the output of either CEU or PERL based on various cases. CEU,PERL and SU generate results in a non-blocking manner so that it can be done concurrently with the sum calculation.This eliminates the added latency in the generation of carry input, hence producing the result in much lesser time. Algorithm 1:

SummationBlock( A i − k − , B i − k − , Cin ) Initialize: sum[n] C ceu = CEU( A i − k − , B i − k − , A i − k − , B i − k − ) C perl = PERL( A i − k − , B i − k − , A i − k − , B i − k − ) Sel = SU( A i − k − , B i − k − , A i − k − , B i − k − ) C out = Sel.C ceu + Sel.C perl for j ← do sum[j] = A i − j ] ⊕ B i − j ] ⊕ C in C in = ( A i − j ] · B i − j ] ) + (( A i − j ] ⊕ B i − j ] ) · C in ) end for return (sum, C out )Algorithm 1 shows how estimation of carry input for i th block and sum calculation in ( i − th summation block takesplace simultaneously. Here Sel , C ceu and C perl are computed using equations 2, 3 and 4 respectively. Each summationblock returns sum bits and a carry-out ( C out ) . This C out acts as a carry input for the next block. The carry input for the4 PREPRINT ﬁrst block is initialized as zero while the other summation blocks take the carry-out of the previous block as their carryinput. In this way, all the blocks are evaluated in a parallel manner and each block follows Algorithm 1 for generatingresults. Now we will brieﬂy discuss various components of the summation block.

This unit estimates the carry input of the next bock based on the two most signiﬁcant bits of the previous block. Itproduces the output after two gate-level delays which are faster than the delay provided by a single full adder. Equation 3describes the logic behind it.

The hardware design of PERL is exactly the same as that of CEU but a different set of input are fed to PERL. In caseCEU wrongly estimates the carry input using two most signiﬁcant bits of the block then the other two most signiﬁcantbits adjacent to previous ones are fed to PERL. As discussed in equation 7, the chances of estimating the carry inputincorrectly decreases after the inclusion of PERL in hardware design. In this way, PERL rectiﬁes the propagating errordue to false carry estimation.

The SU selects the output signal of CEU in case it estimates the carry input correctly and the output signal of PERL ifthe carry estimated through CEU was false and required rectiﬁcation. Equation 2 showcases the logic of SU whichis obtained through boolean simpliﬁcation. It also generates the result after two gate-level delays ensuring in a fasterselection of available carry input. Figure 1 shows the three cascaded summation blocks of our proposed adder design.The logic for all individual units of summation block are generated through Boolean simpliﬁcations and thus the circuitfor each unit is made as per logic expressions.In our proposed adder CESA-PERL, the minimum block size is because a minimum of at least input bits are requiredto estimate the carry bit with error rectiﬁcation via PERL. However, if the propagating error is ignored and PERL is notincluded in hardware design (as in case of CESA) then the minimum block size can be reduced to as carry estimateunit requires only input bits of the summation block in order to generate the carry input for next block.Figures 2(a), 2(b) and 2(c) depict the error results. CESA-PERL shows the least error rate when the block size is lowest.Since the minimum block size can be four, the least error rate occurs when the input operands are divided into n/ blocks. In order to facilitate the use of our proposed adder from software perspective, we propose two additional assembly levelinstructions, namely adx and adxi on existing ISAs. adx , or approximately add , approximately adds two numbers r1 and r2 explicitly on the CESA/CESA-PERL circuit. adxi , or approximately add immediate will function same as adx but will include immediate as an operand. In this section, we evaluate CESA and CESA-PERL using standard metrics to measure error and the gains obtained inpower and delay.

CESA and CESA-PERL are compared to various other state-of-the-art approximate adders and all of them are evaluatedbased on standard error metrics. Error analysis was done using GNU Octave for random cases and averages takenover a dozen runs. The comprehensive results are shown in Figure 2. Error metrics such as ER(Error Rate) [7], MED(Mean Error Distance) [7], MRED (Mean Relative Error Distance) [7] are used to compare our designs with SARA [1],RAP-CLA [8], BCSA and BCSA with ERU (Error Reduction Unit) [2].For 8-bit numbers, CESA was found to be giving accurate results 85.94% times whereas RAP-CLA was found to begiving accurate results 91.1% times since RAP-CLA is extensively based on the carry-lookahead adder (shown inFigure 2(a)). This is however at an area cost that is signiﬁcantly higher than our solution as discussed in Section 4.2.2.SARA and BCSA give accurate results 82.25% and 83.5% respectively which is lesser than CESA but BCSA with ERU5 PREPRINT

0 50 100 150 200 250 300 (32,4) (32,8) (16,4) (16,8) (8,4) A r ea ( u m ^ ) Size (Bit Size, Block Size)

CESACESA−PERLBCSABCSA with ERUSARARAP_CLA (a) Area ( µ m ) P o w e r ( u W ) Size (Bit Size, Block Size)

CESACESA−PERLBCSABCSA with ERUSARARAP_CLA (b) Power ( µ W) D e l a y ( n s ) Size (Bit Size, Block Size)

CESACESA−PERLBCSABCSA with ERUSARARAP_CLA (c) Delay ( ns ) Figure 3: Hardware Evaluationproduces results 90.55% times accurately. The good output of BCSA with ERU can be credited to the error reductionunit at the cost of an area overhead. For 16-bit numbers, CESA was found to be giving accurate results 70.1% timeswhereas RAP-CLA gave accurate results 85% times, SARA, BCSA, and BCSA with ERU gave accurate results 68.4%,70.6% and 82.2% times respectively. For 32-bit numbers too, CESA was found to be better than SARA and BCSA interms of accuracy by 42.5%.For CESA-PERL, we ﬁnd that error rates signiﬁcantly drop as shown in Figure 2(a). The drop in error rates can beattributed to the use of PERL which is however at a minimal area overhead. On average we see that CESA-PERL isbetter than SARA and BCSA by 74%.

We compare our adder CESA and CESA-PERL with SARA [1], RAP-CLA [8], BCSA and BCSA with ERU [2] usingSynopsys Design Compiler (DC) with The NanGate Open Cell Library (45nm technology node) on global operatingvoltage . V. All the adders were described using Verilog HDL. We add 8, 16 and 32-bit numbers for various blocksizes thereby conﬁguring accuracy.

For an 8-bit design, CESA shows faster output generation than SARA (by 2.99%), BCSA (by 17.29%) and BCSAwith ERU (by 25.8%). Similarly, for 16-bit and 32-bit designs, we ﬁnd that CESA yields result faster as shown inFigure 3(c). Higher delay of CESA compared to RAP-CLA can be attributed to the use of the carry-lookahead adderconcept presented in it. On average we see that CESA’s delay is 14.57% lower than that of SARA, BCSA, and BCSAwith ERU combined. When compared to the conventional ripple-carry adder, we ﬁnd that CESA is 91.2% faster than itwhen used in a best-case scenario. Due to the additional logic of rectiﬁcation in CESA-PERL, we ﬁnd that on averageSARA and RAP-CLA outperform CESA-PERL by 26.4%. Whereas when comparing with BCSA and BCSA withERU, the delay statistics of CESA-PERL is better than them by 9.98% across all conﬁgurations as shown in Figure 3(c).

Investigating the statistics of area (shown in Figure 3(a)), we see that for 8-bit additions CESA takes 10.51% lesserarea compared to SARA, RAP-CLA, BCSA and BCSA with ERU. More area of CESA compared to SARA is due tothe use of additional circuitry for parallel carry estimation. Even though SARA also does a parallel estimation, ourimplementation looks at the vicinity of MSBs for carry estimation whereas SARA simply looks at the MSB. Comingto 16-bit architecture, we see that CESA takes less area than RAP-CLA (by 24.83%), SARA (by 3.63%), BCSA (by20.64%) and BCSA with ERU (by 27.43%). Similarly, lesser area of CESA can be observed for the 32-bit design aswell. On average, we see that the area overhead of SARA is actually more than CESA given the advantages that CESAgets when the carry estimation happens over longer distances. In CESA-PERL, the rectiﬁcation logic (PERL) adds onto the area of the CESA. Even after the addition of PERL to CESA, area of CESA-PERL comes out to be 10.3% lesserthan that of RAP-CLA, BCSA and BCSA with ERU on an average.

Power comparison in µW is shown in Figure 3(b). In the case of 8-bit designs, we see that CESA takes 9.49% lesspower than RAP-CLA, 1.90% more power than SARA, 10.56%, 19.44% less power than BCSA and BCSA with ERUrespectively. More power is consumed with respect to SARA in 8-bit design, given the parameter of larger area inCESA. For 16-bit additions, we ﬁnd that CESA takes 17.33% lesser power on an average compared to the other four6 PREPRINT adders. Finally, for 32-bit designs, we see that CESA takes 20.23% less power on an average than RAP-CLA, SARA,BCSA, and BCSA with ERU combined. The lesser power that is consumed by CESA overall is due to the lesser amountof extra logic that is used by CESA for implementing the approximate adder in comparison to the others. We seethat on average it takes 12.54% less power in CESA-PERL than RAP-CLA, BCSA, and BCSA with ERU as shownin Figure 3(b). Though CESA-PERL takes higher power than SARA, it is worth noting that the accuracy of SARAstands nowhere near to that of CESA-PERL.

We have selected 7 integer SPEC CPU2006 benchmarks namely bzip2, sjeng, astar, libquantum, mcf, hmmer andomnetpp to evaluate speedup of our proposed approximate adder on end applications. We have used the delay valuesobtained using Synopsys Design Compiler for all 32 bit conﬁgurations. Using the same, we have modiﬁed theaddition parameters in GEM5 to measure the runtime of end applications. This experiment is geared only towards themeasurement of speedup obtained at the application end and not the program’s correctness. Simulation was done for 1billion instructions. We have simulated the benchmarks on GEM5 [9] using a system with 4 out-of-order CPU coreswith frequency 2 GHz each and a DRAM size of 4 GB. The system had three levels of cache (L1, L2, L3) of sizes 64KB, 512 KB and 4 MB respectively.

In this section we brieﬂy evaluate the performance of some applications that can potentially beneﬁt from CESA andCESA-PERL. (a) (b) (c) (d)

Figure 4: Gaussian image smoothing (a) Original image, (b) Original image with noise, (c) RAP-CLA, PSNR=29.366dB& SSIM=0.7814 (d) SARA, PSNR=26.79dB & SSIM=0.787 (e) BCSA, PSNR=33.9dB & SSIM=0.9142 (f) BCSAwith ERU, PSNR=37.837dB & SSIM=0.9482 (g) CESA, PSNR=32.032dB & SSIM=0.9007 (h) CESA-PERL,PSNR=36.097dB & SSIM=0.9302In this section, we study the application of Gaussian smoothing using CESA and CESA-PERL and compare them toother state-of-the-art approximate adders as shown in Figure 4. For this purpose, we take a 256x256 grayscale image of

Lena and apply Gaussian smoothing to it. We take a × ﬁlter and apply it to a noisy image of Lena . The originalﬁlter has fractional numbers, but for our application using CESA/CESA-PERL, we need ﬁxed-point numbers, hencewe round them. The addition operation in convolution is approximated and the rest of the arithmetic operations areunchanged.We compare our adder’s accuracy with that of other state-of-the-art approximate adders based on the metrics PeakSignal to Noise Ratio (PSNR) and Structural Similarity (SSIM) [10] Index for 32-bit approximate adders with a blocksize of 8. The PSNRs and SSIMs of all approximate adders are compared with respect to accurate addition in Gaussiansmoothing. The results indicate that CESA has a PSNR of 32.032dB and an SSIM of 0.9007 which is 12.3% betterthan that of RAP-CLA and SARA combined. This gain can be attributed to the use of parallel carry estimation doneby CESA. The PSNR and SSIM values of BCSA and BCSA with ERU are however better than CESA by 10.4% and3.25% respectively because in BCSA and in BCSA with ERU, an Error Reduction Unit is present which inherentlyprovides better results than our technique with a signiﬁcant area overhead. Coming to CESA-PERL, Figure 4 showsthat it outperforms all other approximate adders except BCSA with ERU. But compared to PERL, ERU consumeshigher on-chip area and produces more accurate results at the expense of latency.7

PREPRINT (b)

Figure 5: K-Means (a) CESA-PERL K-Means with (32,8) and (32,16) (b) CESA-PERL K-Means with (32,4). Incor-rectly clustered data point is highlighted in black box.We have evaluated CESA-PERL on K-Means Clustering Algorithm. We have considered a simple dataset containing150 data points with 3 clusters. For bit size and block size conﬁgurations of (32, 8) and (32, 16), our adder performsaccurate clustering on the given dataset (Figure 5. (a)). However, for CESA-PERL conﬁgured with bit size of 32 andblock size of 4, our results differ from the accurate result by 0.66 (Figure 5. (b)). Evaluating CESA on K-Means showsimilar results with the same incorrect clustering accuracy.

Towards the measurement of performance improvement , we have used GEM5 statistics for 7 SPEC CPU2006 bench-marks, as mentioned in Section 4.3. Our proposed adders, CESA and CESA-PERL, were used to ﬁnd the speedupobtained for a program ’s addition operations, compared to a baseline case of a conventional system with a ripple carryadder. The speedup obtained using CESA-PERL for bit size, block size conﬁgurations of (32, 4); (32, 8) and (32,16) were 2.57x, 2.03x, and 1.50x respectively, compared to the baseline case. CESA-PERL cannot be used in (32,2)conﬁguration due to the minimum bits requirement as mentioned in Section 3.1.3. So, we have used CESA adder toobtain the results for the (32, 2) conﬁguration. The speedup obtained was 2.83x.

Approximate adders, in general, are developed for catering to compute-intensive applications that require fast computa-tions. An approximate extension to carry-lookahead adder was proposed through RAP-CLA [8] which reduced the areaof the actual carry-lookahead adder. RAP-CLA, however, suffers delay that is higher than CLA and produces resultsthat are on an average 63.7% more error-prone than an accurate adder.The approximate binary adders [11, 12] split the input operands into two segments. Here, the LSBs are approximatelycomputed and MSBs are accurately computed, thus producing the result in lesser time. Few other types of approximateadders are based on carry-selection [13, 14, 15]. In these adders, every block computes sum assuming carry input equals0 and 1 similar to conventional carry select adder and one of them is selected based on predicted carry rather thanaccurate carry. The approximate adders RAP-CLA [8] and SARA [1] speculate carry and compute correct carry, thusleading to a costlier design than an accurate adder. In some other approximate adders, the input operands are dividedinto various segments. In [16, 2], the sum of each segment is computed independently with blocks being computed witheither accurate or approximate methods.It is worth noting that [1, 8] employ multiplexers in their design which leads to higher consumption of on-chip area.Hence we investigate two designs, CESA and CESA-PERL, CESA doesn’t require multiplexing but has lower accuracyfor larger computations whereas CESA-PERL with the addition of multiplexers provides higher accuracy for both smalland large computations. Our solution for the fundamental adder architecture (CESA) differs from the existing solutionsfrom the point that we do not use any multiplexer in it leading to lower requirement of space with better accuracy. Thisdesign while providing good results on smaller block sizes generates some error on larger computations. To counterthat, we also propose a rectiﬁcation logic (PERL) which would negate the effect of propagating error to a great extent atthe cost of an additional area overhead. In this case, we make use of one multiplexer per block for CESA-PERL toselect the carry input. 8

PREPRINT

In this paper, we propose an approximate carry estimating simultaneous adder called CESA. It is based on a nearlyaccurate carry estimation of carry-out using a carry estimator circuit. It has signiﬁcantly lower power consumption,delay and area overhead than other state-of-the-art approximate adders. Moreover, we also propose a rectiﬁcation logicPERL which yields more accurate results for larger computations. In the near future, we plan on extending this work toincorporate hardware support for the addition of signed integers alongside ﬂoating point numbers and implement asubsequent compiler design to use the same.

References [1] W. Xu, S. S. Sapatnekar, and J. Hu. A Simple Yet Efﬁcient Accuracy-Conﬁgurable Adder Design.

IEEETransactions on VLSI Systems , pages 1112–1125, 2018.[2] F. Ebrahimi-Azandaryani, O. Akbari, M. Kamal, A. Afzali-Kusha, and M. Pedram. Block-based carry speculativeapproximate adder for energy-efﬁcient applications.

IEEE Transactions on Circuits and Systems II: Express Briefs ,67(1):137–141, 2020.[3] J. Han and M. Orshansky. Approximate computing: An emerging paradigm for energy-efﬁcient design. In , pages 1–6, 2013.[4] M. S. Ansari, B. F. Cockburn, and J. Han. A hardware-efﬁcient logarithmic multiplier with improved accuracy. In , pages 928–931, 2019.[5] A. Raha and V. Raghunathan. Towards Full-System Energy-Accuracy Tradeoffs: A Case Study of An ApproximateSmart Camera System. In

Proceedings of the 54th Annual Design Automation Conference , DAC 17, pages 1–6,2017.[6] M. Samadi, D.A. Jamshidi, J. Lee, and S. Mahlke. Paraprox: Pattern-based Approximation for Data ParallelApplications. In ,page 35–50, 2014.[7] J. Liang, J. Han, and F. Lombardi. New metrics for the reliability of approximate and probabilistic adders.

IEEETransactions on Computers , 62(9):1760–1771, 2013.[8] O. Akbari, M. Kamal, A. Afzali-Kusha, and M. Pedram. RAP-CLA: A Reconﬁgurable Approximate CarryLook-Ahead Adder.

IEEE Transactions on Circuits and Systems II: Express Briefs , pages 1089–1093, 2018.[9] N. Binkert, B. Beckmann, G. Black, S.K. Reinhardt, A. Saidi, A. Basu, J. Hestness, D. R. Hower, T. Krishna,S. Sardashti, R. Sen, K. Sewell, M. Shoaib, N. Vaish, M.D. Hill, and D.A. Wood. The Gem5 Simulator.

SIGARCHComput. Archit. News , page 1–7, Aug 2011.[10] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli. Image quality assessment: from error visibility tostructural similarity.

IEEE Transactions on Image Processing , 13(4):600–612, 2004.[11] N. Zhu, W. L. Goh, W. Zhang, K. S. Yeo, and Z. H. Kong. Design of Low-Power High-Speed Truncation-Error-Tolerant Adder and Its Application in Digital Signal Processing.

IEEE Transactions on VLSI Systems , pages1225–1229, 2010.[12] N. Zhu, W. L. Goh, and K. S. Yeo. An enhanced low-power high-speed Adder For Error-Tolerant application. In

Proceedings of the 12th International Symposium on Integrated Circuits , pages 69–72, 2009.[13] K. Du, P. Varman, and K. Mohanram. High performance reliable variable latency carry select addition. In

Design,Automation Test in Europe Conference Exhibition (DATE) , pages 1257–1262, 2012.[14] R. Ye, T. Wang, F. Yuan, R. Kumar, and Q. Xu. On reconﬁguration-oriented approximate adder design and itsapplication. In , pages 48–54,2013.[15] J. Hu and W. Qian. A new approximate adder with low relative error and correct sign calculation. In , pages 1449–1454, 2015.[16] N. Zhu, W. L. Goh, and K. S. Yeo. Ultra low-power high-speed ﬂexible Probabilistic Adder for Error-TolerantApplications. In