[PDF] ApproxFPGAs: Embracing ASIC-Based Approximate Arithmetic Components for FPGA-Based Systems

Abstract

There has been abundant research on the development of Approximate Circuits (ACs) for ASICs. However, previous studies have illustrated that ASIC-based ACs offer asymmetrical gains in FPGA-based accelerators. Therefore, an AC that might be pareto-optimal for ASICs might not be pareto-optimal for FPGAs. In this work, we present the ApproxFPGAs methodology that uses machine learning models to reduce the exploration time for analyzing the state-of-the-art ASIC-based ACs to determine the set of pareto-optimal FPGA-based ACs. We also perform a case-study to illustrate the benefits obtained by deploying these pareto-optimal FPGA-based ACs in a state-of-the-art automation framework to systematically generate pareto-optimal approximate accelerators that can be deployed in FPGA-based systems to achieve high performance or low-power consumption.

Full PDF

TTo appear at the 57th Design Automation Conference (DAC), July 2020, San Francisco, CA, USA.

ApproxFPGAs: Embracing ASIC-Based ApproximateArithmetic Components for FPGA-Based Systems

Bharath Srinivas Prabakaran ∗ , ‡ , Vojtech Mrazek † , ‡ , Zdenek Vasicek † , Lukas Sekanina † , Muhammad Shaﬁque ∗∗ Institute of Computer Engineering, Technische Universit¨at Wien (TU Wien), Austria { bharath.prabakaran, muhammad.shaﬁque } @tuwien.ac.at † Faculty of Information Technology, IT4Innovations Centre of Excellence, Brno University of Technology, Czech Republic { mrazek, vasicek, sekanina } @ﬁt.vutbr.cz Abstract —There has been abundant research on the devel-opment of Approximate Circuits (ACs) for ASICs. However,previous studies have illustrated that ASIC-based ACs offerasymmetrical gains in FPGA-based accelerators. Therefore, anAC that might be pareto-optimal for ASICs might not be pareto-optimal for FPGAs. In this work, we present the

ApproxFPGAs methodology that uses machine learning models to reduce theexploration time for analyzing the state-of-the-art ASIC-basedACs to determine the set of pareto-optimal FPGA-based ACs.We also perform a case-study to illustrate the beneﬁts obtainedby deploying these pareto-optimal FPGA-based ACs in a state-of-the-art automation framework to systematically generate pareto-optimal approximate accelerators that can be deployed in FPGA-based systems to achieve high performance or low-power con-sumption.

Index Terms —Approximate Computing, FPGA, ASIC, Adder,Multiplier, Arithmetic Units, Machine Learning, Statistics, Mod-els, Synthesis.

I. I

NTRODUCTION

Field Programmable Gate Arrays (FPGAs) have become in-creasingly popular since their introduction in [1]. Due totheir (partial) run-time reconﬁgurability, short time-to-market,and lower prototype costs, as compared to Application-Speciﬁc Integrated Circuits (ASICs), FPGAs are preferred ina wide variety of applications. These comprise domains likehigh-performance computing clusters and server platforms thatoffer “

FPGAs as a Service ”, and embedded and cyber-physicalsystems, which perform complex data-computations on theconﬁgurable arrays [2]. The current generation of FPGAs areequipped with a wide range of capabilities that can be usedto design a

Programmable System (on a Chip) by includinghard IPs (IC realization) of the low-power ARM A9 processorcore and other commonly used hardware accelerators, suchas video codecs [3]. However, FPGAs are low-performance,power-hungry devices that are a lot less energy-efﬁcient whencompared to ASICs.The

Approximate Computing paradigm offers a directionof research, in which the intermediate computational unitscan be approximated without “signiﬁcantly” degrading theoutput quality, to obtain savings in power/energy consumptionand latency [4]. This quality of error-tolerance is exhibitedby applications in the ﬁelds of recognition, mining, andsynthesis, due to the following four factors: (i) redundancy ‡ These two authors have contributed to this work equally. in the processed data, (ii) algorithms with error attenuat-ing patterns, (iii) non-existence of a unique golden output,and (iv) imperceptible differences in the output quality byend-users. Since its re-emergence, plenty of research worksfrom academia and industry have exploited this phenomenonacross the hardware [5]–[16] and software [17]–[20] layers toobtain power/energy/latency savings.Most of the current works on approximate circuits (AC)primarily focus on obtaining energy/power/latency savingsin ASIC-based systems. Previous studies have illustratedthat ASIC-based approximate computing principles and tech-niques offer asymmetric savings when implemented on FP-GAs [13] [15] [16]. State-of-the-art ACs for adders and multi-pliers can offer up to % savings in energy when synthesizedfor ASICs. Whereas these designs offer minimal/asymmetricsavings or at times negative savings, i.e. , an increase inresources when synthesized for FPGAs. This is primarily dueto the architectural differences between ASICs and FPGAs.The required functionality is realized using logic gates inASICs and using Lookup Tables (LUTs) made of SRAMelements in FPGAs. Therefore, an AC that offers signiﬁcantsavings and introduces the least error (pareto-optimal) forASICs, might not necessarily be pareto-optimal for FPGAs.Note, by pareto-optimal approximate circuits we mean the setof all circuits that are not dominated by any other circuit fromthe set of circuits in the library in terms of the evaluationmetrics.Furthermore, the works presented in [13]–[16] have de-veloped FPGA-based approximate circuits by analyzing thearchitecture of the target FPGA. These techniques are typicallynot scalable, due to their manual lookup table optimizationsand approximations, and do not offer multiple pareto-optimaldesign points that trade-off between power consumption andintroduced error. To further illustrate these behavioral differ-ences between ASICs and FPGAs, we present a motivationalanalysis of our work in the next sub-section.

A. Motivational Analysis

We synthesize and implement a small subset of ,

494 8 x unsigned approximate multiplier designs from the library ofevolutionary approximate arithmetic circuits [21] and the state-of-the-art FPGA-based approximate multiplier designs [16].These circuits were synthesized and implemented for the Xil-1 a r X i v : . [ c s . A R ] A p r nx xc7vx485tffg1157-1 FPGA using the Vivado . tool-chain, with zero Digital Signal Processing blocks enabledto ensure that the designs are mapped to the reconﬁgurablelogic (see details in Section III). We also evaluate the outputquality of these approximate circuits with the help of theirbehavioral models by computing their Mean Error Distance (MED), which we deﬁne as the average of the absolute errordifference across all the input combinations relative to themaximum number of outputs [22]. Based on the resourcesrequired for each of these designs and their MED, we extractthe pareto-front of approximate x multipliers for the targetFPGA and compare them to the pareto-front obtained whenthe same designs are synthesized for ASICs. The results ofthese experiments are presented in Fig. 1. From these results,we make the following key observations :(1) The ACs that are pareto-optimal for ASICs (ASIC-ACs)are not necessarily pareto-optimal for FPGAs (FPGA-ACs). As discussed earlier, this is primarily due to thedifferences in realizing the logic functions across the ASICand FPGA platforms.(2) The time required for synthesizing only % of theapproximate x multiplier library is ~ days. This hugetime requirement can also be attributed to the architecturaldifferences between FPGAs and ASICs. The synthesis androuting algorithms of an FPGA tool-ﬂow need to mapthe functionality to existing hardware blocks on the targetFPGA while optimizing for various factors and constraintsto maximize performance.(3) State-of-the-art FPGA-based approximate multipliers [16]are not pareto-optimal when compared to the % subsetof approximate multipliers from [21]. Similarly, the otherFPGA-based approximate adders and multipliers pre-sented in [13] [15] are neither pareto-optimal nor scalable.Due to their manual optimizations and circuit designs, theyare not effective in achieving similar performance/powertrade-offs, as illustrated by the evolutionary approximatearithmetic library for larger bit-widths.Based on these observations, we have identiﬁed the follow-ing research challenges : • Based on the time required for synthesizing and implement-ing a small subset of the designs for the target FPGA, thetime required for exhaustively exploring all designs in theFig. 1: Analysis of Pareto-optimal Approximate Circuitsfor Approximate x Multipliers and State-of-the-Art (SoA)FPGA-Based Approximate Multipliers [16]. data-set would be in the magnitude of s of hours, or acouple of weeks. – How to efﬁciently reduce the time required for explor-ing the design-space of approximate arithmetic units inFPGAs? – Can we explore the concepts of machine learning inorder to reduce the exploration time by estimating FPGAparameters? If yes, which machine learning algorithm? • There is a nonexistence of pareto-optimal FPGA-ACs,which can offer a design-space trade-off between the re-sources consumed and the error introduced. – How can we determine a set of pareto-optimal FPGA-ACs that can be deployed in error-tolerant applicationsto obtain power/energy/latency savings? • Unavailability of a systematic automation framework thatcan be used to develop FPGA-ACs for a given error-tolerantapplication and its quality requirements. – How to systematically deploy the FPGA-ACs in a givenerror tolerant application to maximize performance orpower/energy savings?

To address these research challenges, we propose the fol-lowing novel contributions : • We propose the

ApproxFPGAs methodology that deploysmachine learning (ML) models, which can be used toestimate the power and latency of the approximate circuits.These ML models are trained using a small subset of theevolutionary approximate circuits [21]. • Based on the estimates, we propose to construct a pseudo-pareto-front, which can be used to determine the set ofpseudo-pareto-optimal approximate circuits for varying bit-widths of the approximate arithmetic units. These modelscan then be subsequently synthesized for the target FPGAto measure the exact power and latency of these FPGA-ACs. • These pareto-optimal FPGA-ACs are open-source and avail-able online at https://github.com/ehw-ﬁt/approx-fpgas, toenable reproducible research and foster development in thisarea. • We also perform a case-study by deploying these pareto-optimal FPGA-ACs in a state-of-the-art automation frame-work that can systematically generate approximate accel-erators, which can be deployed in FPGA-based systemsto achieve high-performance and/or low power/energy con-sumption.II. T HE A PPROX

FPGA S M ETHODOLOGY

Overview:

Fig. 2 presents an overview of the proposedmethodology. The complete procedure can be divided into twosub-parts, (i) the ﬁrst part deals with the training and testingof the ML models, which can be used to efﬁciently estimatethe hardware resources of a given approximate arithmeticdesign, while (ii) the second part deals with the constructionof the pareto-optimal FPGA-ACs, which can be deployed inerror-tolerant applications.

Inputs:

We start by compiling the library of approximatearithmetic circuits that need to be analyzed and deployed2 ibrary of ApproximateCircuits ... Trainingsubset SynthesisValidationsubset

Modeltraining Model validation TestaccuracyModification of ML parametersEstimation of FPGA parameters Pseudo-ParetoconstructionPseudo-Paretos Final synthesis FPGA-ACsPareto construction

ApproxFPGAs

ErrormetricsASICmetrics ASICmetrics

Fig. 2: An Overview of the ApproxFPGAs Methodologyin the target application. Without loss of generality, in thiswork, we consider the evolutionary library of approximateadder and multiplier circuits for illustrating the beneﬁts ofour methodology [21]. Note, the use of other state-of-the-art designs is orthogonal to our approach and they can beappropriately included, with necessary modiﬁcations, in thelibrary of approximate circuits.

Exhaustive Exploration:

Due to the large number ofdesigns present in the library, the time required for exploringall the designs, exhaustively, might be quite large, as initiallystated in Section I. Fig. 3 presents a brief illustration of theestimated time required for synthesizing all the approximatecircuits present in the library for the target FPGA. As can beobserved, when the number of ACs in the library increases,the time required for exploring the designs rises and reaches amagnitude of s of hours. Therefore, exhaustive explorationis not a feasible option for identifying the pareto-optimalapproximate circuits for FPGAs. Fig. 3 also illustrates thesavings in exploration time when the proposed

ApproxFPGAs methodology is used for exploration as opposed to exhaustiveexploration. The exploration time is reduced by a factor of~ × from . days to . days, including the time requiredfor synthesizing the data-set, training and evaluating the MLmodels, and re-synthesizing the pareto-optimal FPGA-ACs. ML-Model Learning:

Due to the infeasible time require-ments of exhaustive exploration, we propose to train andevaluate a wide variety of statistical and machine learning(S/ML) models, which can be used to estimate the resourcerequirements of an approximate circuit, given its hardwaredescription. These S/ML models can be used to estimate F P G A s y n t h e s i s t i m e [ s ] hourdayweekmonth82.4 d8.2 dCumulative Exhaustive Expl.Cumulative ApproxFPGA Exhaustive Expl.ApproxFPGA Fig. 3: Time Required for Exhaustive Exploration Comparedto our

ApproxFPGAs

Approach for all ACs in the Library. TABLE I:

List of Light-weight Statistical/MachineLearning Models Used in ApproxFPGAs.

Statistical/ML Model

ML1 Regression w.r.t

ASIC-AC PowerML2 Regression w.r.t

ASIC-AC LatencyML3 Regression w.r.t

ASIC-AC AreaML4 PLS RegressionML5 Random ForestML6 Gradient BoostingML7 Adaptive Boosting(AdaBoost)ML8 Gaussian ProcessML9 SymbolicRegression

Statistical/ML Model

ML10 Kernel RidgeML11 Bayesian RidgeML12 CoordinateDescent (Lasso)ML13 Least AngleRegressionML14 Ridge RegressionML15 StochasticGradient DescentML16 K-NearestNeighboursML17 Multi-LayerPerceptron (MLP)ML18 Decision Tree

FPGA parameters like power consumption ( W ), latency ( ns ),and area ( LU T s ). Training these models requires a labeleddata-set, with the FPGA parameters as the output labels andthe hardware description of the AC as the input data. Webuild this data-set by randomly extracting a % subset ofthe complete library of ACs and synthesizing them for thetarget FPGA platform. This subset is further partitioned intotraining ( %) and validation ( %) data-sets, which are thenused to train and evaluate the various machine learning models,respectively. Without loss of generality, in this work, weevaluate the applicability of the most-commonly used light-weight S/ML models (see Table I) to reduce the time requiredfor exploring the library of ACs. We iteratively evaluate theaccuracy of the models and modify their parameters based onthe correlation obtained on the validation data-set to furtherimprove the model’s accuracy. Instead of synthesizing and im-plementing each circuit in the library, which might take weeksto months, we can roughly estimate the FPGA parameters ofall circuits using these models in the order of seconds. Toestimate the accuracy of these ML models, we propose the ﬁdelity metric, which evaluates the relationship between themeasured ( mes ) and estimated ( est ) FPGA parameters forany two ACs in the library. We compute the ﬁdelity ( F ) of aset of ACs, X , as: F ( X ) = (cid:80) x ∈ X (cid:80) x ∈ X E ( x , x ) | X | (1)where E denotes the correctness of the relationship betweenthe estimated and measured FPGA parameters: E ( x, y ) = (cid:40) If est ( x ) R est ( y ) ∧ mes ( x ) R mes ( y ) Otherwise (2)where R denotes one of the following relations { <, >, = } between the FPGA parameters of the ACs. Due to theiravailability Pareto Construction:

Based on the outcome of our ex-periments (see Section IV), we select the best S/ML modelsto estimate the FPGA parameters of all ACs in the library.Based on these parameter-estimates, we can determine the3areto-optimal FPGA-ACs. However, we have observed thatthese models have limited ﬁdelity, because of which the realpareto-optimal ACs can be dominated by the ACs where theestimation was incorrect. Therefore we propose to constructmultiple pseudo-pareto-fronts from the input set (library) ofACs C . We determine the ﬁrst set of pseudo-pareto-optimalACs ( F ) from the initial set of all ACs C . Next, we eliminateall these pseudo-pareto-points from the input set to constructthe second pseudo-pareto-front, i.e., using C \ F as the input,we determine F . Similarly, we construct the third pseudo-pareto-front F , using the input C \ ( F ∪ F ) , and so on.By constructing multiple pseudo-pareto-fronts, we mitigate theinaccuracies associated with our S/ML models. The ACs lyingon these pseudo-pareto-fronts can be subsequently synthesizedagain using our work-ﬂow to determine the accurate FPGAparameters and the resources required. Hence, we have to syn-thesize an additional number of ACs when we are constructingmultiple pseudo-pareto-fronts.Based on the real FPGA parameter measurements obtainedfrom the synthesis and implementation reports of Vivado, weconstruct an open-source library of pareto-optimal FPGA-ACsthat offers a trade-off between the output quality and theresources consumed. This library can be subsequently utilizedby application and system developers, to further maximizeperformance or power and energy savings obtained while sat-isfying the quality constraints of the application. The RTL andbehavioral models of the FPGA-ACs are open-source and areavailable online at https://github.com/ehw-ﬁt/approx-fpgas. AutoAx-FPGA:

To incorporate the set of pareto-optimalFPGA-ACs in different error-tolerant applications, we modifythe state-of-the-art AutoAx [23] framework to include thefunctionality of designing ACs for a given application thatcan be deployed in FPGA-based systems. The traditionalAutoAx framework searches the design-space of approximatecomponents to select and combine approximation components,in order to generate an approximate hardware acceleratorthat maximizes the energy savings. Initially, a set of randomapproximation assignments are evaluated for the target accel-erator circuit, to get the quality of results (QoR) and hardware(HW) cost of the accelerator. Based on these values, QoRand HW cost estimators are constructed, which can be used toexplore the complete design-space of approximate componentsfor the given accelerator and to determine the set of pareto-optimal circuits for the given application. To generate approx-imate accelerators for a given application, which can be usedin low-power and/or high-performance FPGA-based systems,we propose to include the following functionality in AutoAx:(i) we replace the library of pareto-optimal ASIC-ACs with theset of pareto-optimal FPGA-ACs obtained from the proposed

ApproxFPGAs methodology, (ii) we modify the estimatorsused in AutoAx to estimate the FPGA parameters of theapproximate accelerator instead of their ASIC-based HWcosts. III. E

XPERIMENTAL S ETUP

The RTL (in Verilog) and behavioral models (in C) of theevolutionary approximate arithmetic circuits are open-sourceand readily accessible . These designs are synthesized andimplemented (i.e., place & route) using the Vivado DesignSuite . for the target FPGA xc7vx485tffg1157-1 ,to extract their area, power, and timing reports. We restrictthe placement and routing algorithms of the Xilinx Vivado bydisabling the use of the FPGA’s DSP logic blocks. We do thisto ensure that the designs are mapped on to the conﬁgurablelogic. These reports are used to extract the FPGA parameters,which are subsequently used for training and evaluating theS/ML models. The S/ML models are implemented, trained,and tested inside the Python . environment with the help ofthe scikit-learn library. The RTL designs were synthesized onan Intel Core i − CPU with GB of internal memoryand a

GB Solid-State Drive (SSD). The S/ML models weretrained and evaluated on an Intel Xeon CPU E − with GB of internal memory. An overview of our work-ﬂow ispresented in Fig. 4.Fig. 4: Overview of Our Experimental Work-ﬂow.IV. R

ESULTS & D

ISCUSSION

Fidelity:

First, we illustrate the accuracy of the S/MLmodels that we have evaluated inside our

ApproxFPGAs framework. We do this by studying the ﬁdelity of these modelswith respect to the three important FPGA parameters, namely,latency ( ns ), power ( mW ), and area ( LU T s ). The ﬁdelityof these models is evaluated on the validation data-set. Theresults of these experiments are presented in Fig. 5.

From theseresults, we make the following key observations : • Tree-based methods, like

Decision Trees and

Random For-rest , achieve above-average accuracy in estimating theFPGA parameters and retaining their relationship to theother ACs. • Based on further analysis, we also observed that generaliza-tion of models across all bit-widths is not very effective, i.e.,estimating FPGA parameters of higher bit-width ( -/ -bit)designs (adder or multiplier) using a model learned froma lower bit-width ( -bit) designs is not very effective. Onaverage, we observed that the ﬁdelity of the higher bit-widthdesigns decreased from % to % when using models https://github.com/ehw-ﬁt/evoapproxlib • Ridge models such as

Kernel Ridge and

Bayesian Ridge ,typically, illustrate the best ﬁdelity.We also summarize the top- S/ML models for each FPGAparameter, along with the ﬁdelity achieved for each case,in Table II. Likewise, we identify the models that achievemaximum ﬁdelity when obtained using regression analysis ontheir corresponding ASIC parameters.TABLE II: Fidelity of the top- ML Models for the Estimatingthe FPGA parameters

FPGA Latency FPGA Power FPGA Area

Model Fidelity Model Fidelity Model FidelityML11 % ML11 % ML4 %ML4 % ML13 % ML13 %ML10 % ML4 % ML11 %ML2 % ML1 % ML3 % Correlation of ML Models:

Next, we illustrate the cor-relation between the estimated FPGA parameters and theirmeasured values when the top- S/ML models are used onthe library of approximate x multipliers. These results areillustrated in Fig. 6. From these results, we make the following key observations : • The

Bayesian Ridge and

PLS regression techniques can beused as standalone techniques to estimate all three FPGA M L . M L . M L . M L . M L . M L . M L . M L . M L . M L . M L . M L . M L . M L . M L . M L . M L . M L . F i d e li t y Latency Power Area

Fig. 5: Fidelity Analysis of the FPGA parameters for theDifferent S/ML Techniques Described in Table I. parameters, as they are one of the top- models for all threeparameters. • Statistical regression with respect to the corresponding ASICparameters is equally useful in estimating the FPGA param-eters of the given circuit. • Due to the ~ % bias illustrated by the model, latency isnot estimated accurately, especially using regression withASIC parameters and Kernel Ridge . This leads to a scenariowhere the circuit latency is under-estimated by the model,including certain pareto-optimal designs.

Construction of the Pareto-fronts:

As discussed earlier,we construct multiple pareto-fronts to ensure that non-pareto-optimal designs are not missed by our methodology. Towardsthis, we illustrate the beneﬁts of constructing multiple pareto-fronts sequentially for estimating the FPGA latency using thetop- S/ML models and Regression with respect to ASIClatency. Fig. 7 illustrates the results of constructing , , and pareto-fronts using the technique discussed in Section II.From these results, we make the following key observations : • Using ML-based techniques for estimating the FPGA pa-rameters reduces the total number of synthesized circuitsby a factor of ~ . × to , , including the training andvalidation data-set and synthesis of pseudo-pareto-optimalpoints, instead of synthesizing the complete library of ap-proximate x multipliers. • The ML models are highly effective in selecting the pseudo-pareto-optimal designs that have to be re-synthesized, ascompared to the regression analysis w.r.t ASIC latency,which increases the number of circuits to be exploredfrom in Bayesian Ridge to , effectively doubling thenumber of new circuits to be synthesized. • The best results are obtained when we effectively combinethe pseudo-pareto-optimal points obtained from multipleML models. Therefore, we need to consider a union ofall the pareto-fronts (cid:83) ni =1 F i to determine the ﬁnal set ofpareto-optimal FPGA-ACs. Pareto-Optimal FPGA-ACs:

Fig. 8 illustrates the set ofFPGA-ACs synthesized to obtain the subset of pareto-optimalFPGA-ACs using our proposed

ApproxFPGAs methodologyon the library of -, -bit adders and x , x multipliers.Although we have not exclusively determined and synthesizedFig. 6: Correlation Analysis of the the Top- S/ML Techniques for the Library of x Approximate Multipliers.5ig. 7: Analysis of Constructing Multiple Pareto-fronts for the x Approximate Multiplier Library w.r.t

FPGA Latency.all the pareto-optimal designs, we have reduced the explorationtime a factor of ~ × to obtain, on average, % percentagecoverage of the pareto-optimal designs present in the libraryof approximate circuits. This is quite explicitly illustrated withthe help of the pareto-front in the designs with a highernumber of ACs present in the library, such as the approximatemultipliers, and a little less explicit for libraries with a lowernumber of circuits, like the approximate adders. Similarly,we have generated the pareto-optimal ACs for the -bitapproximate adder and x approximate multiplier. AutoAx-FPGA:

Finally, we present the results of modifyingthe AutoAx framework to include the functionality of gen-erating pareto-optimal accelerators for FPGA-based systems.We evaluated the modiﬁed AutoAx-FPGA methodology using a Gaussian Filter as a case-study and the input of pareto-optimal x approximate multipliers and -bit approximateadders. The QoR of the Gaussian ﬁlter’s output is estimatedusing the structural similarity index (SSIM), for which webuild an estimator. First, we generate a training and validationdata-set of , random approximate circuits for the givenGaussian ﬁlter, which was synthesized and implemented usingthe Vivado work-ﬂow to measure their FPGA parameterssuch as area, latency, and power consumption. Similar tothe AutoAx methodology, we constructed estimators that candetermine the FPGA parameters for the other circuits inthe library, and construct different pareto-fronts using the hill-climber algorithm. We thereby reduce the number ofaccelerator circuits to be explored from . × to ,Fig. 8: Evaluation of the Pareto-optimal FPGA-ACs Obtained using the ApproxFPGAs

Methodology.6 , and possible designs for each of the FPGA parameter-QoR scenarios, namely, latency-SSIM, power-SSIM, and area-SSIM, respectively. Each of these designs is synthesized in theVivado work-ﬂow and their behavioral models are deployedin the image processing environment to measure their FPGAparameters and determine their SSIM. These results are illus-trated in Fig. 9 We can observe that AutoAx-FPGA achievesbetter results when compared to a simple random search.Furthermore, we can also observe that the optimization forarea and power improve the savings obtained in other FPGAparameters as well, which is not the case when we optimizefor latency. For example, in the case where we optimize forlatency in Fig. 9, we would expect the SSIM-Latency pareto-front to encompass the best ACs in terms of latency, but thisis not the case as the latency-estimator is not very effective.However, since the other two pareto-fronts improve the savingsfor other FPGA parameters as well, they outperform the SSIM-Latency pareto-front ACs, even in terms of latency.Fig. 9: Analysis of the ACs from FPGA-AutoAx Comparedto the Basic Random Search.V. C ONCLUSION

We presented the

ApproxFPGAs methodology, for embracingthe use of current state-of-the-art ASIC-based approximatecircuits for FPGA-based systems. We synthesize a partialsubset of the library of arithmetic circuits to establish thetraining and validation data-set, which can be used to teach andevaluate the models’ applicability. Based on the outcome, wechose the top- models that achieve the best ﬁdelity to estimatethe FPGA parameters for all the circuits in the data-set, whichis used to subsequently construct multiple pseudo-pareto-fronts. The circuits on these pareto-fronts are re-synthesizedto measure the correct FPGA parameters and determine theﬁnal set of pareto-optimal FPGA-ACs, which can be used by system developers and application-designers to develop low-power or high-performance FPGA-based accelerators. This setof pareto-optimal arithmetic FPGA-ACs is open-source andavailable online at https://github.com/ehw-ﬁt/approx-fpgas. Fi-nally, we evaluate the applicability of these pareto-optimalACs by using a modiﬁed version of the state-of-the-art AutoAxframework to illustrate the beneﬁts obtained.A CKNOWLEDGEMENT

This work was partially supported by Doctoral CollegeResilient Embedded Systems which is run jointly by TUWien’s Faculty of Informatics and FH-Technikum Wien, andpartially by Czech Science Foundation project 19-10137S.R

EFERENCES[1] S. M. S. Trimberger, “Three ages of fpgas: A retrospective on the ﬁrstthirty years of fpga technology,”

IEEE Solid-State Circuits Magazine ,vol. 10, no. 2, pp. 16–29, 2018.[2] R. Watanabe et al. , “Implementation of fpga building platform as a cloudservice,” in

Proceedings of the 10th

HEART . ACM, 2019.[3] L. H. Crockett et al. , The Zynq Book: Embedded Processing with the ArmCortex-A9 on the Xilinx Zynq-7000 All Programmable Soc . StrathclydeAcademic Media, 2014.[4] V. K. Chippa et al. , “Analysis and characterization of inherent applica-tion resilience for approximate computing,” in

DAC . ACM, 2013.[5] H. Jiang et al. , “A comparative review and evaluation of approximateadders,” in

GLSVLSI . ACM, 2015.[6] H. Jiang et al. , “A comparative evaluation of approximate multipliers,”in

NANOARCH . IEEE, 2016.[7] S. Mittal, “A survey of techniques for approximate computing,”

ACMComputing Surveys (CSUR) , vol. 48, no. 4, p. 62, 2016.[8] S. Hashemi et al. , “Drum: A dynamic range unbiased multiplier forapproximate applications,” in

ICCAD . IEEE Press, 2015.[9] H. Saadat et al. , “Approximate integer and ﬂoating-point dividers withnear-zero error bias,” in

DAC . ACM, 2019.[10] H. Saadat et al. , “Minimally biased multipliers for approximate integerand ﬂoating-point multiplication,”

IEEE

TCAD , vol. 37, no. 11, pp.2623–2635, 2018.[11] S. Venkataramani et al. , “Quality programmable vector processors forapproximate computing,” in

MICRO . IEEE, 2013.[12] A. Sampson et al. , “Enerj: Approximate data types for safe and generallow-power computation,” in

ACM SIGPLAN Notices , vol. 46, no. 6.ACM, 2011, pp. 164–174.[13] B. S. Prabakaran et al. , “Demas: An efﬁcient design methodology forbuilding approximate adders for fpga-based systems,” in

DATE . IEEE,2018.[14] J. Echavarria et al. , “Fau: Fast and error-optimized approximate adderunits on lut-based fpgas,” in

FPT . IEEE, 2016.[15] S. Ullah et al. , “Smapproxlib: library of fpga-based approximate multi-pliers,” in

DAC . IEEE, 2018.[16] S. Ullah et al. , “Area-optimized low-latency approximate multipliers forfpga-based hardware accelerators,” in

DAC . ACM, 2018.[17] A. K. Mishra et al. , “iact: A software-hardware framework for under-standing the scope of approximate computing,” in

WACAS , 2014.[18] W. Baek et al. , “Green: a framework for supporting energy-consciousprogramming using controlled approximation,” in

ACM Sigplan Notices ,vol. 45, no. 6. ACM, 2010, pp. 198–209.[19] D. S. Khudia et al. , “Rumba: An online quality management system forapproximate computing,” in

ISCA . IEEE, 2015.[20] A. Yazdanbakhsh et al. , “Axilog: Language support for approximatehardware design,” in

DATE . IEEE EDA Consortium, 2015.[21] V. Mrazek et al. , “Evoapprox8b: Library of approximate adders andmultipliers for circuit design and benchmarking of approximation meth-ods,” in

DATE . IEEE European Design and Automation Association,2017.[22] J. Han et al. , “Approximate computing: An emerging paradigm forenergy-efﬁcient design,” in

ETS . IEEE, 2013.[23] V. Mrazek et al. , “autoax: An automatic design space explorationand circuit building methodology utilizing libraries of approximatecomponents,” in

DAC . ACM, 2019.. ACM, 2019.