[PDF] NEAT: A Framework for Automated Exploration of Floating Point Approximations

Abstract

Much recent research is devoted to exploring tradeoffs between computational accuracy and energy efficiency at different levels of the system stack. Approximation at the floating point unit (FPU) allows saving energy by simply reducing the number of computed floating point bits in return for accuracy loss. Although, finding the most energy efficient approximation for various applications with minimal effort is the main challenge. To address this issue, we propose NEAT: a pin tool that helps users automatically explore the accuracy-energy tradeoff space induced by various floating point implementations. NEAT helps programmers explore the effects of simultaneously using multiple floating point implementations to achieve the lowest energy consumption for an accuracy constraint or vice versa. NEAT accepts one or more user-defined floating point implementations and programmable placement rules for where/when to apply them. NEAT then automatically replaces floating point operations with different implementations based on the user-specified rules during the runtime and explores the resulting tradeoff space to find the best use of approximate floating point implementations for the precision tuning throughout the program. We evaluate NEAT by enforcing combinations of 24/53 different floating point implementations with three sets of placement rules on a wide range of benchmarks. We find that heuristic precision tuning at the function level provides up to 22% and 48% energy savings at 1% and 10% accuracy loss comparing to applying a single implementation for the whole application. Also, NEAT is applicable to neural networks where it finds the optimal precision level for each layer considering an accuracy target for the model.

Full PDF

NNEAT: A Framework for Automated Exploration ofFloating Point Approximations

Saeid Barati

Computer Science DepartmentUniversity of Chicago

Chicago, [email protected]

Lee Ehudin

Computer Science DepartmentUniversity of Chicago

Chicago, [email protected]

Henry Hoffmann

Computer Science DepartmentUniversity of Chicago

Chicago, [email protected]

Abstract —Much recent research is devoted to exploring trade-offs between computational accuracy and energy efﬁciency atdifferent levels of the system stack. Approximation at the ﬂoatingpoint unit (FPU) allows saving energy by simply reducing thenumber of computed ﬂoating point bits in return for accuracyloss. Although, ﬁnding the most energy efﬁcient approximationfor various applications with minimal effort is the main challenge.To address this issue, we propose NEAT: a pin tool that helpsusers automatically explore the accuracy-energy tradeoff spaceinduced by various ﬂoating point implementations. NEAT helpsprogrammers explore the effects of simultaneously using multipleﬂoating point implementations to achieve the lowest energyconsumption for an accuracy constraint or vice versa. NEATaccepts one or more user-deﬁned ﬂoating point implementationsand programmable placement rules for where/when to applythem. NEAT then automatically replaces ﬂoating point operationswith different implementations based on the user-speciﬁed rulesduring the runtime and explores the resulting tradeoff space toﬁnd the best use of approximate ﬂoating point implementationsfor the precision tuning throughout the program. We evaluateNEAT by enforcing combinations of 24/53 different ﬂoating pointimplementations with three sets of placement rules on a widerange of benchmarks. We ﬁnd that heuristic precision tuningat the function level provides up to 22% and 48% energysavings at 1% and 10% accuracy loss comparing to applyinga single implementation for the whole application. Also, NEAT isapplicable to neural networks where it ﬁnds the optimal precisionlevel for each layer considering an accuracy target for the model.

I. I

NTRODUCTION

Early work in approximate computing demonstrates thetremendous energy and execution time reductions by makinga variety of arithmetic and logic functional units available[11], [12], [22], [24]. Reduced-precision methods advocate lessnumerical precision for the data storage and computation toachieve higher performance and energy efﬁciency [16], [64],[73], [77].The proliferation of both different approximate functionalunits and reduced-precision software methods creates tremen-dous opportunity, but it also creates a new problem. Whiledesigning for reduced precision has long been common inspecialized application domains—for example, digital signalprocessing [10]—the proliferation of these techniques meansthat general programmers will now have to consider the impli-cations of such designs. Speciﬁcally, it is up to programmersdecide which level of approximation to use at different points in their application and navigate through this immense tradeoffspace enacted by allowing multiple approximations within asingle program.Consider 10 different levels of approximation availableto be enforced at the function level for a moderate-sizedprogram with 10 functions. Programmers attempting to designfor energy efﬁciency and accuracy in this scenario face twoseparate, but related, challenges. First, is the challenge ofcorrectly (in terms of achieved accuracy) implementing 10different versions of each candidate function (one version foreach available level of precision). Second is the challengeof searching the resulting tradeoff space with pointsto explore. The tradeoff space could be even larger if weexploit data type approximation where each variable in theprogram could acquire a different level of approximation [9],[30], [65], [71]. Constructing a large number of alternativeimplementations and then navigating such an immense tradeoffspace is likely beyond the abilities of even domain experts.Thus, we need an automated precision tuning framework thatcan both generate alternative implementations and then explorethe induced tradeoff space.In this paper, we propose one mechanism that helps addressboth of the above challenges: programmable placement rulesfor approximate ﬂoating point computation . We argue thatasking programmers to implement N different versions ofkey functions is unnecessarily burdensome and generatingall possible approximations of each function will make thesearch space prohibitively large. The programmable rules are acompromise, where programmers can encode their knowledgeof the application into concise rules about which functionscan be approximated, by how much, and when it might bepermissible to do so. These rules can then be used by anautomated tool to generate a candidate set of approximatefunction implementations which is much smaller than the setof all possible approximations.To address the challenges of creating and selecting froma large number of approximation alternatives, we proposeNEAT—Navigating Energy Approximation tradeoffs—a toolthat helps users explore different levels of approximationwithin a program without detailed instrumentation and with-out laboriously creating many alternative implementations offunctions. NEAT accepts a user program, a set of approximate1 a r X i v : . [ c s . D C ] F e b oating point implementations, and a set of programmableplacement rules for when to use a speciﬁc implementationwithin a program. NEAT then runs the program and dy-namically replaces ﬂoating point operations (FLOPs) withthe approximate version as speciﬁed by the rules. NEATreports the program’s output with the estimation of ﬂoatingpoint unit (FPU) and memory access energy alongside anitemized report of FLOPs in the program. Thus, NEAT helpsdevelopers explore the conﬁguration space of ﬂoating pointimplementations (FPI) without requiring them to have deepnumerical expertise.We implement NEAT for x86 using the Pin binary instru-mentation system [51]. We demonstrate NEAT’s value by com-paring the approximations produced by different placementrule sets. In the ﬁrst, we write a simple rule that picks asingle ﬂoating point implementation for the entire program; i . e .the rule is a simple one-to-one replacement (whole-programrule) common to many proposed approximation methods;e.g., those that use a single, reduced precision for machinelearning [21] or scientiﬁc simulation [22]. In the second, weallow the top 10 executed functions with the most FLOPs toeach use a different approximation (per-function rules). Eitherwe use the currently-in-progress function (CIP) or the mostrecent function on the call stack (FCS) as the target to applythe approximate ﬂoating point implementation. For all rules,NEAT uses a genetic algorithm to guide exploration of theenormous resulting search space.We evaluate NEAT on a selected set of benchmarks fromParsec 3.0 [7] and Rodinia 3.1 [13] suites which covers avariety of real-world applications. For the FPIs, we appliedmantissa bitwidth tuning. On average, the per-function place-ment retrieves more energy-optimal ﬂoating point implementa-tions than the whole-program approach, providing 22.1% and3.2% energy savings in FPU and memory respectively with anallowance of 1% accuracy loss. To ensure the robustness ofNEAT, we include multiple inputs for each application whichare divided into training and test sets to evaluate whetherNEAT produces statistically sound results. We also extend theevaluation by including a digit recognition application that isimplemented with a neural network and MNIST dataset. Forany accuracy target, NEAT provides the required precisionlevel for each layer. NEAT is also released as opensource,so others could evaluate or use it freely.In summary, this paper proposes: • The NEAT framework that helps users explore the trade-off space of reduced precision ﬂoating point combinationswhile not requiring hand tuning or code instrumentation. • A case study that compares whole-program vs. per-function approximation placements for a variety ofbenchmarks. Also, NEAT offers a separated placementsolution based on the caller function, useful for the highfrequency invoked functions. • Robustness on unseen inputs with a high correlationcoefﬁcient. NEAT ﬁnds statistically meaningful approxi-mations that are not sensitive to input data and are morelikely to be efﬁcient on an unseen set of inputs. • A demonstration of NEAT’s applicability to Convolu-tional Neural Networks (CNN), providing precision com-putation modes per layer resulting in energy savings withminimal loss of model accuracy.II. B

ACKGROUND & M

OTIVATION

A. Prior Work

While there has been a substantial amount of effort aimedtowards ﬁnding new forms of approximation [1], [17], [25],[36], [37], [50], [56], [61], [63], [65], [69], [70], [80], thereis a lack of solutions that helps the user to both developtheir own approximation methods, and then specifying theapproximation level to enforce for a single application.Hardware approximation computes inexactly in return forreduced energy, area, or time [14], [49]. Approximate multipli-ers [44], [46], [81] and adders [20], [84] are widely advocatedfor energy-efﬁcient computing. State-of-the-art neural networktraining platforms offer 16 bit ﬂoating point hardware systemsthat provide up to 4x performance gain comparing to tradi-tional 32 bit systems [29]. Recent proposals promote puttingmany different approximate units or customized acceleratorson a single core [31]. Thus, it is beneﬁcial to include multipleFPUs on a chip for higher energy efﬁciency [22] but thisrequires tedious hand-tuning. Therefore, the challenge is howto ﬁgure out which FPU to use in each part of the program.This is the challenge that motivates NEAT.Languages support approximation allowing the speciﬁcationof variants for key functionality and formal analysis of theireffects [3], [9], [55]. Approximation Knobs provide a way tolend performance and energy gains to existing power knobs[43]. Quora is a quality programmable processor where thenotion of quality is codiﬁed in the instruction set of theprocessor [73]. Another example of user-deﬁned approxima-tion is Green, which is a system that allows programmers tosupply approximate versions of loops and while-blocks thatterminate early [4]. On the contrary to these programminglanguage techniques, our proposal lets users easily—throughour programmable substitution rules—examine and change theaccuracy of FLOPs, giving them more control over the ﬂoatingpoint computations in a program.Performing precision tuning at ﬁne grain is availablethrough software libraries. EnerJ proposes to declare approxi-mate data via type qualiﬁers [65]. MPFR adds to its arbitrary-precision representation the support for rounding modes,exceptions and special values as deﬁned in the IEEE 754standard [30]. FlexFloat reduces ﬂoating point emulation timeby providing a C/C++ interface for supporting multiple FPformats. These techniques require source code instrumentation(changing f loat and double variables deﬁnition to customparameters) or intending to yield more precise computation(for instance ﬂoating points numbers with more than 128bits). NEAT focuses on energy efﬁciency by reducing precisionwhile only requiring the program binary.Convolutional Neural Networks (CNNs) include a signiﬁ-cant amount of ﬂoating point computation in the training andinference stages. A large body of research has been focused2ig. 1: Energy Per Instruction for different classes of instruc-tions.towards CNN precision scaling [15], [32], [33], [62], [64],[72]. For example, WAGE quantizes weights to 2 bits whileactivation, errors, and gradient are 8 bits respectively [79].FLEXpoint presents a new format with 16 bits mantissa totrain CNNs with full precision [45]. Another piece of researchdemonstrates the successful training with 8-16 bits ﬂoatingpoint numbers with full accuracy [77]. Other, tangentiallyrelated approaches create networks with early exit points [75],[76], but those are not related to the problem of chang-ing numerical precision. Prior approaches either change thetraining architecture or apply a coarse-grain precision levelfor all layers. Differently, NEAT generates precision tuninganalysis at different granularities by offering WP and CIPsolutions without modifying the application internal structureor exhaustive precision exploration.While prior work mainly develops mechanisms that enableapproximation to provide energy and runtime savings at dif-ferent domains, they do not help users make more informeddecisions about approximation. These techniques mostly arenot ﬂexible about how much, where, and when to approximate,and only provide discrete approximation knobs which leads tomore conservative design choices. NEAT does not proposenew mechanisms but helps users answer the questions above.

B. Motivation

Current inexact functional units in addition to approximatesoftware libraries create an opportunity to exploit quality-energy tradeoffs. While an FPU accounts for 2-5% area onthe chip, the ﬂoating point instructions consume signiﬁcantlymore energy compared to other classes of instructions suchas integer, memory, and control [5], [54]. Figure 1 illustratesthe energy per instruction (EPI) results for different classes ofinstructions of 64-bit 32nm processor. With random operands,a 64-bit ﬂoating point add consumes 400 pJ, and a divisionoperation could go as high as 680 pJ. For a 32-bit versions,the energy consumption is 350 and 420 pJ respectively. As expected with regards to the type of operations, ex-ecuting the ﬂoating point instructions emerges as a majorcontributor to the total energy consumption. Recent empiricalstudies have shown up to 50% of the energy consumed ina core and memory is related to ﬂoating point instructions[54]. Thus, exploiting reduced bitwidth at instruction level (bittruncation) to generate Floating Point Implementations (FPI)could facilitate higher energy efﬁciency. Another useful insightfrom Figure 1 is the relationship between computation andmemory accesses. For example, three add operations consumethe same amount of energy as a ldx instruction. Hence,looking from an energy efﬁciency point of view, reducing thememory trafﬁc could be as efﬁcient as optimizing the ﬂoatingpoint arithmetic operations [54].A body of literature has focused on providing tool supportsthat allow users to deﬁne several approximations for differentcomponents of the application [17], [25], [36], [37], [56], [60],[63], [66]. Petabricks provides language extensions that exposetradeoffs between time and accuracy to the compiler [3]. Thecompiler then runs dynamic autotuning to generate optimizedelements to achieve the target accuracy. However, autotunersneed to be determined on a per-application basis by the user.OpenTuner provides fully-customizable conﬁguration repre-sentation and ensembles of search techniques to ﬁnd an opti-mal solution [2]. Both autotuning techniques are supposed tohelp programmers but Petabricks requires a separate languageand both require users to implement all alternatives before thesearch can be conducted. NEAT also helps users deal withapproximation, but instead of requiring users to implement allpossible alternatives, they simply describe programmable rulesthat are then used to automatically generate the alternatives.Hence, there is a need for a generic framework that providesmultiple precision levels, accommodates custom user-deﬁnedﬂoating point implementation, and does not require coderefactoring. NEAT provides such a solution. NEAT generatesinsightful information for precision tuning at function level forﬂoating point programs.III. S

YSTEM D ESIGN

In this section, we describe our solution which generatesinsightful information about ﬂoating point precision tuningfor applications. This tool, named Navigating Energy andAccuracy tradeoffs (referred to as NEAT) allows users tocollect energy and performance data from applications usingcustom implementations of ﬂoating-point arithmetic.The main challenge of precision tuning is constructingthe right conﬁguration of ﬂoating point precisions for theapplication. This conﬁguration space might be extremely largeto fairly small, ranging in complexity from using a differentﬂoating point implementation for each dynamic ﬂoating pointinstruction, using a different implementation for differentfunction calls, or just picking a single ﬂoating point implemen-tation for the entire application. NEAT provides such ﬂexibilityin the granularity of enforcing ﬂoating point approximationsby introducing the programmable placement rules and then3ig. 2: NEAT Designautomatically searching the accuracy and energy tradeoff spaceto ﬁnd the optimal frontier.Figure 2 illustrates the NEAT system from the user per-spective. Users specify: (1) the application that they want tounderstand (this could be just a binary and requires no specialchanges), (2) whether NEAT should consider double or singleprecision (or both), a set of alternative implementations forﬂoating-point arithmetic, and (4) the programmable placementrules that describe when, where, and how in the programto replace the standard ﬂoating point operations with one ofthe alternative implementations. NEAT then runs the programas a pin tool and intercepts all ﬂoating point operations ofthe speciﬁed type and replacing them according to the rules.NEAT will perform multiple runs of the application, collectstatistics on ﬂoating point usage, accuracy, and estimatedenergy. NEAT offers a proﬁling mode where the user collectsprecision analysis such as quantity and frequency of FLOPs forthe application before applying any FPIs. Ultimately, NEATcan repeatedly test different assignments of ﬂoating pointoperations to ﬁnd the frontier of optimal conﬁgurations; i.e.,assignments of ﬂoating point operations to different regions ofthe code. This section describes NEAT’s inputs, internals, andoutputs.

A. NEAT Inputs

User inputs of NEAT includes: a user application to instru-ment, a precision level as the optimization target, the desiredFP arithmetic implementations, and a set of FPI to functionmappings (programmable placement rules).NEAT receives the binary of the program and instrumentsthe ﬂoating point instructions. Unlike other precision tuningtools, NEAT does not require the source code of the program.Then, NEAT expects the optimization target which can beeither single or double precision. There are two reasonsbehind including optimization objective. First, for most ofthe programs, the same precision level is held across thecode base for the data structures and the functions. Second,if we consider both f loat and double

FLOPs to optimize,the conﬁguration space of FPIs combinations would explodeexcessively.Next, users specify multiple FPIs for any individual arith-metic instruction such as addition, subtraction, multiplication, and division for each operand. At last, NEAT expects amapping between the candidate code sections and the FPIsto calculate each FLOP in a program. By default, NEATenforces the FPIs at the function level, meaning all FLOPsexecuted within a speciﬁc function will be using the samecustomized FPI. Any function that has at least one FLOP canbe considered as a candidate for approximation.

B. NEAT Internal Structure

The NEAT dynamic instrumentation tool was written inC++ using the Intel Pin instrumentation system [51]. NEATperforms run-time instrumentation to facilitate the analysis andreplacement of ﬂoating-point arithmetic operations during theexecution of compiled C and C++ binaries.

1) Intel Pin Tool:

The Pin instrumentation system waschosen as the backbone for this tool because of its clean APIand efﬁcient implementation. The Pin API makes it possibleto write instrumentation routines to observe and alter thearchitectural state of a process. Pin uses a JIT compiler togenerate a new instrumented code that can be executed withoutthe extra runtime overhead from instrumentation.

2) Floating Point Operations:

For the purposes of thistool, we identify ﬂoating-point arithmetic operations as theStreaming SIMD Extensions (SSE) instructions for scalararithmetic. These instructions are included in a SIMD in-struction set extension to the x86 architecture and operateon 32-bit or 64-bit ﬂoating point numbers. More speciﬁcally,the instructions we use for our deﬁnition of ﬂoating-pointoperation are ADDSS, SUBSS, MULSS, DIVSS, ADDSD,SUBSD, MULSD, and DIVSD.

3) Floating Point Arithmetic Implementation:

Customhardware units or accelerators have been considered for en-riching the quality versus energy tradeoff spaces. Approximateadders [20], [74], [84] and multipliers [44], [46], [81] havebeen designed as a solution for lower power consumption andhigh performance. In the presence of inexact hardware units,NEAT provides information on how to efﬁciently redirect thearithmetic instructions to these units.The ﬂoating point formats with a lower number of bitsemerge an appealing opportunity to reduce the energy con-sumption since it allows simpliﬁcation of both hardware unitsand reduction of memory bandwidth required to transfer thedata between the memory and registers. The FPI can beas simple as bit truncation in the FP format representation,enforcing direct approximation to the operands or result ofarithmetic operations, or redirecting instruction to approximatehardware units or software libraries.

4) Execution of Floating Point Instructions:

Deﬁning anFPI is fairly trivial. The main challenge with enforcing FPIdynamically is the way to specify the exact mapping betweenFPI and the FLOPs. NEAT allows users to deﬁne placementrules that determine which FPI is used to calculate each FLOPin a program. Every time a FLOP is about to be calculated inthe user application, NEAT examines all of the mappings andcaptures information about the current state of the application,4ABLE I: Built-in Placement Rules in NEAT.

PlacementRule Description tradeoffSpace Size

WP one FPI for the whole program − CIP one FPI for the currently inprogress function − FCS one FPI for the most recentfunction on the call stack − and use them to determine which FPI will be applied tocalculate the result of the FLOP.NEAT comes packaged with three predeﬁned sets of FPIplacements for the applications the cover many use-cases andshow off its versatility. Table I includes the default placementrules and the corresponding tradeoff space size. Sets of rulesare speciﬁed as C++ routines that accept the program state asinput and return a single FPI as output.The ﬁrst set applies the same FPI for every FLOP in thewhole-program (WP) regardless of the current function andthe program state.For ﬁner granularity, the user can register callbacks throughNEAT that can be executed whenever a function is enteredor exited in the instrumented application. This allows morecomplex information to be collected about the program state,such as the call stack of the application. The second set ofplacement rules allows the user to specify a map of functionnames to FPIs and employs each FPI for the FLOPs in thecorresponding currently in progress (CIP) function. Similarly,the third set of placement rules uses callbacks registered withNEAT to keep track of the function call stack (FCS) of theprogram. Instead of inspecting the current function, NEATﬁrst checks the most recent function on the call stack. If nofunctions in the call stack match the names of those in theuser-supplied map, a default implementation is used.To highlight the difference between CIP and FCS, weanalyzed the structure of 7 functions in a benchmark shownin Figure 3. The radar is an embedded real-time signalprocessing application that is used to ﬁnd moving targets onthe ground [35], [47]. It includes both a low-pass ﬁlter (LPF)and pulse compression (PC). Both of these components use aFast Fourier Transform (FFT) as a part of their computation.With the CIP option, NEAT enforces the same FPI everytime the FFT function is called. For the FCS option, NEATdistinguishes between the two occurrences of FFT based onwho has made the function call. Therefore, NEAT uses oneFPI for the FFT in the Low Pass Filter (LPF) stage and asecond FPI for the FFT in the Process Pulse (PC) stage.Empirically, we have found the results of FCS and CIP formost of the benchmarks do not differ as the callers of a FLOPintensive functions are the same. The radar is an examplewhere multiple functions make numerous calls to the sameFLOP-intensive function that is accuracy sensitive. C. Outputs

There are ﬁve outputs from this tool: the output from theuser application, a trace of the operands and result of everyFLOP executed by the program, the estimated FPU energy of Fig. 3: FCS placement considers FFT function call stackbefore selecting the approximate FPI.FLOPs in the execution of the program, the estimated energyof off-chip memory accesses of the program, and the numberof FLOPs executed per function in the program.The trace of the FLOPs executed by the instrumentedapplication is written to a ﬁle while the application is running.If FPIs are supplied to NEAT by the user, the result of eachoperation will be printed after the operation is calculated withthe chosen FPI. The operands and result of each operation areprinted as hexadecimal numbers so that there is no confusionin rounding the ﬂoating-point values.NEAT reports total energy consumed in FPU by usingenergy per instruction (EPI) of different classes ﬂoating-pointoperations. We extracted the energy model of f add , f mul and f div for single and double precision operations provided inrelated work [54].To this end, NEAT counts the number of bits manipulated inthe operands and results of every FLOP in the instrumentedprogram. Modifying the bit width in the exponent and signof a ﬂoating-point number changes the accuracy signiﬁcantlywhere the quality of output becomes unacceptable. Hence,NEAT only focuses on mantissa bits. NEAT counts the numberof zeroes in the binary representation of the ﬂoating-pointnumber, starting with the least signiﬁcant bit, and then sub-tracts it from available mantissa bits in the ﬂoating type (24/53bits in single/double precision respectively) to calculate thenumber of manipulated bits. NEAT uses the EPI models andthe number of manipulated bits per FLOP to estimate the totalﬂoating-point energy consumed in the FPU.NEAT also records the total number of bits used in FLOPsin the execution of the program is output to a ﬁle afterthe termination of the application. Unlike the FPU energyestimation, this metric can be used as a platform-independentway to evaluate the approximate amount of power used byFLOPs when instrumenting a program.Currently, the memory accounts for more than 25% ofenergy spent in a large scale system. While on average, eachsingle precision FLOP takes 400 pJ to execute, a byte readfrom memory consumes 1.5 nJ [8]. Accordingly, NEAT counts5he total number of bits transmitted to/from memory and thenestimates the total memory access energy of the instrumentedprogram [53]. This allows NEAT to yield a better energyestimation of the program in a real system.NEAT generates in-detail statistics about the ﬂoating pointinstructions in the program. Users might operate NEAT toproﬁle the application before performing precision tuning toﬁrst, decide whether NEAT is useful to their application andsecond, what type of FPIs, which functions, and how to mapthem.In general, NEAT is a tool used at program design time.NEAT allows users to evaluate many points on the accu-racy/energy tradeoff curve without having to implement allpossible alternatives. After proﬁling with NEAT, users canthen select a point and implement it with conﬁdence that itwill provide the desired behavior.Future work would explore additional machione learningtechniques to conﬁgure the ﬂoating point usage differentlyfor different functions in the program [19], [39], [58], [59].Another promising line of work is using a runtime systemto dynamically tune ﬂoating point usage to maintain eitherenergy or accracy constraints in a changing workload [6],[26]–[28], [34], [38], [40], [41], [52], [57], [78], or possiblyimplementing this control scheme in hardware [67], [68], [83].IV. NEAT I NTERFACE AND R UNTIME

We explain how the user can manage ﬂoating point precisionscaling with the NEAT framework explained in the previoussection. We specify the information that NEAT expects toreceive from the users and then, discuss steps to execute theruntime engine of NEAT.The NEAT procedure follows as:1.

Proﬁle the Program : User runs the application. NEATrecords the single and double precision instructions and thefunctions associated with them, and generates the detailedreport in csv format.2.

Assign FP Optimization Target : Since the applicationsusually use the same precision level across the source code,NEAT enhances either single or double precision instructionsat the same time. At this point, the user deﬁnes the directivefor NEAT to target 32 or 64 bit FLOPs.3.

Develop FPIs : Users might deﬁne multiple FPIs tobe explored by NEAT. NEAT supports FPIs developed in anumber of different ways. An FPI can be created by truncatingmantissa bits of the FLOP representation or injecting directapproximation to the operands or results of ﬂoating pointarithmetic operations. For example, approximating the inversefunction [82] or sin function using a neural network [23] isconsidered an FPI, too. The FPI can be applied to one ormore ﬂoating point arithmetic instruction. For instance, onebenchmark might include numerous accumulations but fewdivisions. Thus, the user deﬁnes an FPI with enforcing 8precision bits for the add/sub arithmetic instructions and 24precision bits for the multiply instructions. The user developsan FPI by creating an instance of the

F pImplementation virtual class. Furthermore, user might customize the subroutine of

P erf ormOperation to modify the operands or results ofa ﬂoating point instruction directly.4.

Register FPI Placement Rules and Functions : NEATexpects to receive a mapping between FPIs and when toenforce them. For the WP approach, the user only needsto instantiate

Register F P selector class with the desiredFPI as the argument. For the per-function rules, NEATby default considers the top 10 FLOP intensive functions.The user might pre-proﬁle the program to detect and se-lect any number of functions. The user then should pro-vide a mapping between functions and FPIs by deﬁninga pair < f unctionN ame, F P I ∗ > map data structure.Next, the user should combine the map with one of thepre-packaged placement rules (CIP or FCS). This mappingis also referred as a conﬁguration . Finally, the user createsan instance of Register F P selector class and passes themap and placement strategy as the input arguments. At theruntime, the user passes the registered instance name via fp_selector_name command line ﬂag to NEAT. Thisinterface is simple, but provides a quite ﬂexible approach toreplacing standard ﬂoating point operations with the approxi-mate version. For example, the user can provide several mapsand then their instantiation of the selector class can look at thecurrent program context to select the desired map. This allowsNEAT to explore many different options for a single functionwithin a program. For example, users can specify that the mapshould depend on the function call stack so that different FPimplementations will be used for the same function based onwhere it was called from.5.

Activate Exploration Scripts : If CIP or FCS schemais selected, the tradeoff space of FPI to function mappings(conﬁgurations) becomes too huge to explore exhaustively.Hence, NEAT uses the NSGA-II genetic exploration techniqueto search for energy efﬁcient conﬁgurations [18]. If the userdesires to enhance the exploration phase of the conﬁgurationspace further, NEAT provides an interface through the com-mand line ﬂags to manually modify the tuning parameters ofNSGA-II such as population size, number of generations, orconvergence threshold.6.

Analyze the Output : NEAT reports detailed energy andperformance data per conﬁguration. Moreover, a python scriptis provided to generate scatter plots of tradeoff space with thelower convex hulls.At the completion of these steps, the user ﬁnds informationabout the most appropriate precision level for each individualfunction or the whole program.V. E

XPERIMENTAL R ESULTS

We evaluate the efﬁcacy and ﬂexibility of NEAT to pro-vide ﬂoating point approximation analysis. In general, NEATgenerates useful information on precision tuning of applica-tions which can be used at design stage of a software orconvoyed to other layers of system such as compilers orhardware ( e.g. building a set of reduced-precision FPUs).Section V-B inspects the ﬂoating point proﬁling of NEAT forthe applications. The primary challenge of automatic precision6uning is creating approximation conﬁgurations. We examinethe NEAT’s ﬂexibility to produce customized FPI deﬁnitionsin Sections V-C and V-D. Moreover, the main mechanism ofNEAT—programmable placement rules—are investigated inSections V-E and V-F.To navigate through the immense conﬁguration space,NEAT comes with a tunable genetic exploration algorithmwhich is used in Sections above (from V-B through V-C).Although, to ensure robustness of NEAT on unseen data,we evaluate the difference between predicted accuracy andenergy on training and test data to demonstrate that NEATﬁnds conﬁgurations that are robust across different test inputsthat were not seen in training V-G. Finally in section V-H,we evaluate NEAT’s general applicability to ﬁnd appropriatereduced precision ﬂoating point conﬁgurations by evaluating iton a problem that has seen a tremendous amount of attentionfrom human experts recently: trading accuracy for precisionin neural network inference. We ﬁnd that NEAT can use thewhole-program rule to automatically ﬁnd a single ﬂoatingpoint precision that is similar to those reported by humanexperts. Further, we ﬁnd that by using different ﬂoating pointimplementations for different layers, NEAT produces evengreater energy savings for the same accuracy.

A. Evaluation platform

We evaluate NEAT by exploring the tradeoff spaces of theplacement rules for a variety of benchmarks. Table II liststhe applications from Parsec 3.0 [7] and Rodinia 3.1 [13]suites with the conﬁguration space size (default precisionoptimization target) and training and test inputs for eachbenchmark. These benchmarks cover domains from ﬁnanceto image processing.To create FPIs, we use bit truncation. For the single pre-cision ﬂoating point numbers ( f loat type in C), we have 24different FPIs corresponding to the mantissa bits. Similarly,we created 53 FPIs for the double precision ﬂoating pointnumbers. For the whole-program approach, the size of thetradeoff space is the total number of possible FPIs which are24 and 53 points. For the per-function approaches, we considerthe top 10 functions with most FLOPs to enforce the FP rules,so each of the top 10 functions may use a different FPI.In each experiment, at most 400 conﬁgurations in thetradeoff space (less than − of all possible conﬁgurations)have been evaluated through NEAT’s genetic algorithm. B. Floating Point Precision Distribution

NEAT can be used to analyze the type, distribution, and theintensity of the FLOPs in a program. Figure 4 depicts the ratioof single and double precision FLOPs for each benchmark.Most of the benchmarks hold the same precision levelacross the source for correctness and portability. For exam-ple,

Bodytrack , Heartwall , and

Kmeans are all im-plemented with f loat type while

Canneal is mainly using double . However, for some benchmarks such as

Ferret , Particlefilter , and

Srad due to including external li-braries, there is a mixture of both precision levels. In this case, B l a c k s c h o l e s B o d y t r a c k C a n n e a l F l u i d a n i m a t e F e rr e t H e a r t w a ll K m e a n s P a r t i c l e ﬁ l t e r R a d a r S r a d S w a p t i o n s R a ti o t o a ll F L O P s SingleP recision (32 bits ) DoubleP recision (64 bits ) Fig. 4: Floating Point Type Breakdown for Benchmarks. Whilemost benchmarks have a dominant FP type, some carry both.users might choose the optimization target to be enforced.Specifying the right target opens up further opportunities foradditional energy savings.

C. FPU Energy Saving

NEAT provides the FPU energy estimation consumed bythe FLOPs. We compare two rules: whole program (WP) andcurrently-in-progress (CIP). As a reminder, WP uses one ﬂoat-ing point implementation through the entirety of the program,while CIP is free to choose a separate implementation for eachof the top 10 functions (by FLOP count) in the program. For

Particlefilter , we set the optimization target to doubleprecision as most of the FLOPs are double . For the rest of thebenchmarks, we apply the single precision optimization.We consider the top 10 FLOP intensive functions for the CIPplacement. Although, one might ask how much of the FLOPsare included in the top 10 functions. For all benchmarks, atleast 98% FLOPs were coming from the top 10 functions, thusNEAT covers almost all of the FLOPs in the program.Figure 5 illustrates the lower convex hull of normalized FPUenergy and the error rate (also referred to as accuracy loss).The error rate metric is the relative error of a conﬁgurationcomparing against the highest quality conﬁguration (baseline)where no approximation happens. The horizontal axis is theerror rate while the Noramlized Energy Consumption (NEC)to the baseline is shown vertically (on the y-axis). The lowerthe curve is, the more efﬁcient conﬁguration is found whichmeans higher energy efﬁciency. Since users generally do notcare about extremely inaccurate outputs, only error rates lessthan 20% is shown in the subﬁgures. The results show that f weassign multiple FPIs at the function level, NEAT will retrievemore energy efﬁcient conﬁgurations that are not explorable ifwe use single FPI for the whole program. This result furtherdemonstrates NEAT’s value in design space exploration.With a minimal error in ﬁnal output of the benchmark,NEAT reduces the FPU energy up to 60%. For some ap-plications such as

Blackscholes , Fluidanimate , and

Particlefilter the FPU energy savings are more consid-erable. These benchmarks have less than 10 FLOP intensivefunctions. Therefore ﬁrst, CIP covers all the FLOPs in the7ABLE II: Benchmarks Used for Evaluation.

Benchmarks Training inputs Test inputs Possible ConﬁgurationSpace

Blackscholes 10 lists with 100K initial prices 30 lists with 100K initial prices Bodytrack Sequence of 5 frames Sequence of 20 frames Fluidanimate 5 ﬂuids with 15K+ particle 15 ﬂuids with 15K+ particle Ferret 5 databases of 16 images 15 databases of 16 images Heartwall Sequence of 15 frames Sequence of 60 frames Kmeans 10 vectors with 512 data points 30 vectors with 512 data points Particleﬁlter Sequence of 32 frames Sequence of 128 frames Radar Sequence of 10 frames Sequence of 40 frames N E C ( % ) Blackscholes

WPCIP N E C ( % ) Bodytrack

WPCIP N E C ( % ) Ferret

WPCIP N E C ( % ) Fluidanimate

WPCIP N E C ( % ) Heartwall

WPCIP N E C ( % ) Kmeans

WPCIP N E C ( % ) Particleﬁlter

WPCIP N E C ( % ) Radar

WPCIP

Fig. 5: Lower Convex Hulls of FPU energy and Error Rates for the WP and CIP. Values are normalized to the baseline.program. Second, since the tradeoff space is relatively smaller,NSGA-II searches a larger portion of the tradeoff space in thesame exploration time.For

Fluidanimate and

Ferret benchmarks, there areonly three and two conﬁgurations where the WP outperformsthe CIP. The reason is that NEAT’s genetic algorithm failsto explore those speciﬁc conﬁgurations as it is a heuristicalgorithm. The same pattern can be seen for the

Radar benchmark as well where the CIP does not dominate thewhole-program approach.The

Heartwall benchmark has only two FLOP functionswhere they are very sensitive to the bit width adjustment andany modiﬁcation leads to more than 20% error. Consequently,NEAT is not able to decrease FPU energy to less than 71% ofthe baseline with reasonable error rate. The opposite scenariohappens for the

Particlefilter application where themajor FLOP functions do not impact the quality of outputconsiderably, hence NEAT aggressively reduces the FPU en-ergy without causing much error.For a more detailed comparison, we re-illustrate a quantizedrepresentation of the previous plot. Figure 6 displays how theFPU energy savings enhance as the tolerated error thresholdincreases. Higher bars indicate more energy savings. By har- monic mean, applying the CIP versus WP approach results in7%, 12%, and 13% more energy savings at 1%,5%, and 10%error rate, respectively.The steeper slope in the lower convex hull curves insubplots of Figure 5 translates into higher bars in Figure6 as the error threshold increases. The

Blackscholes and

Particlefilter benchmarks demonstrate such be-havior. On the contrary, by increasing the error threshold in

Particlefilter and

Radar applications, the FPU energysavings do not inﬂate similarly.From these graphs, we draw two conclusions. First, spec-ifying the FPIs placement at a ﬁner granularity results inmore efﬁcient FPI to function mappings. In other words, per-function rules use less energy with the same error comparingto use a single FPI for the whole application. This type ofinsight is really only achievable with the an automated systemlike NEAT. Second, if higher error rates are allowed, NEATachieves higher efﬁciency of FPU energy. Thus, NEAT cannavigate the whole tradeoffs space and give users a range ofoptions depending on tolerable error rate.8 lackscholes Bodytrack Fluidanimate Ferret Heartwall Kmeans Particleﬁlter Radar HarmonicMean FP U E n e r gy S a v i ng s ( % ) W P − W P − W P − CIP − CIP − CIP − Fig. 6: FPU Energy Savings at Different Error Rates, normalized to the baseline. Higher the bars, the more energy efﬁcient is.

D. Memory Instructions

Main memory (DRAM) consumes as much as half of thetotal system power in a computer today, due to the increasingdemand for memory capacity and bandwidth [53]. Hence,reducing the memory trafﬁc directly derives into substantialenergy savings. NEAT estimates the memory energy withaccounting only accesses to/from an off-chip memory bykeeping track of memory operations such as MOVSS andMOVSD. Figure 7 depicts memory accesses energy for a rangeof error rates for both whole-program (WP) and per-function(CIP) approaches respectively across the benchmarks. Sameas before, higher bars indicate higher energy efﬁciency. Valuesare normalized to non-approximated version of the application,that acts as a baseline. On harmonic mean, increasing theerror rate from 1% to 10% results in 3.2-10.5% less energyconsumption.If the FLOP functions are memory intensive, reducingthe precision bits results in lower memory bandwidth, andconsequently more energy savings. That is the reason whybenchmarks such as

Bodytrack , Fluidanimate , and

Radar reduces the memory energy by more than 60%. In restof the benchmarks, the FLOP functions were solely computeintensive.To put the experiments above to a conclusion, we illustratethe WP rule as a sample for prior work [79] which tries to ﬁnda single most optimal approximation for the whole application.The per-function rules of NEAT show off the ability of thereplacement rules to allow programmers to explore a richerset of tradeoffs without having to come up with whole newimplementations of existing program functionality.

E. Flexible Precision Level

In previous sections, we observed some benchmarks have amixture of both f loat and double

FLOPs. To choose the rightoptimization target, we compare the energy and accuracy ofselected benchmarks in both single and double optimizationtargets. The FPI to function mapping is CIP in this experiment.Figure 8 shows the normalized energy savings for bothsingle and double optimization targets. As expected, if wechoose the optimization target to be the same as the FPtype which has larger ratio in FLOP distribution, higherenergy savings would be achieved. This observation can beeasily justiﬁed by the looking back at Section V-B. Both

Canneal and

Particlefilter contain more 64-bit than32-bit FLOPs. Thus, double precision as NEAT directive is theright choice to achieve substantially higher energy efﬁciency.

Ferret requires special attention as it is not obvious howto choose the optimization target based on FLOP distributionratio since it has almost equal amount of f loat and double

FLOPs. At the 10% error rate, NEAT saves up to 92%of FPU energy corresponding to double instructions whileonly 38% savings is available if we consider only f loat instructions. There are two reasons for the discrepancy. Oneis that generally double

FLOPs yield more precise output,but they use more precision bits in return. Thus, NEAT hasmore freedom to cut down unnecessary ﬂoating point bitswhile not losing much accuracy because the double baseline isalready more accurate than the f loat one. Second, the double functions in

Ferret are not accuracy sensitive, meaning thatenforcing approximation on these functions would not exces-sively change the quality of the output. This is a great exampleof how NEAT determines the most efﬁcient conﬁgurations forany benchmark regardless of how their ﬂoating point precisionis speciﬁed in the source (or binary).

F. Function Call Stack

As we mentioned in section III-B4, if we map an FPI to afunction, depending on the caller, the quality of output couldchange. While on most benchmarks, CIP and FCS approachesproduce the same result, on the

Radar they differ. Hence, weexamine the impact of the caller of the FFT function on theenergy and accuracy of the benchmark. Figure 9 illustrates theFPU energy savings normalized to a baseline for CIP and FCSplacement rules. FCS was able to explore a handful of moreoptimal conﬁgurations, resulting in 7% more energy savingsat 1% accuracy loss comparing to CIP without extra runtimeoverhead. At 5% and 10% error rate, the additional energysavings are 4% and 2% respectively.

G. Sensitivity to Input Changes

Since we employ a heuristic exploration technique, weensure that NEAT produces statistically sound results byevaluating each application with multiple inputs divided intotraining and test sets. We take the median of normalizedaccuracy loss and FPU energy for each set of inputs, computea linear least squares ﬁt of training data to test data, and com-pute the correlation coefﬁcient of each ﬁt. Higher correlation9 lackscholes Bodytrack Fluidanimate Ferret Heartwall Kmeans Particleﬁlter Radar HarmonicMean M e m o r y E n e r gy S a v i ng s ( % ) W P − W P − W P − CIP − CIP − CIP − Fig. 7: Memory Transfer Energy Savings at Different Error Rates, normalized to the baseline.

Canneal Ferret Particleﬁlter FP U E n e r gy S a v i ng s ( % ) float − float − float − double − double − double − Fig. 8: FPU Energy Savings with Different OptimizationTargets for NEAT.

Radar FP U E n e r gy S a v i ng s ( % ) CIP − CIP − CIP − F CS − F CS − F CS − Fig. 9: Comparison of CIP and FCS for the FPU EnergySavings in Radar.coefﬁcients imply less input sensitivity; i.e. the behavior ofconﬁgurations found during training data is a good predictorof test behavior.Table III show the correlation coefﬁcient (R-values) foraccuracy loss and FPU energy for each benchmark. Due toTABLE III: Correlation Coefﬁcients for Error Rates and FPUenergy.

Benchmark Error Rates FPU Energy

Blackscholes 0.999 0.999Bodytrack 0.958 0.989Fluidanimate 0.995 1.0Ferret 0.973 1.0Heartwall 0.999 1.0Kmeans 0.932 1.0Particleﬁlter 0.991 1.0Radar 0.992 1.0 heuristic nature of exploration technique, it might be possibleto select conﬁgurations that perform differently on unseendata. For instance,

Kmeans clearly stresses the difference be-tween training and test inputs. Although, all benchmarks haveuniformly high R-values on accuracy loss and FPU energy—atleast . . This demonstrates that NEAT’s search techniquesare robust and the accuracy and energy results they predict ontraining inputs hold up well for test inputs. The robustness ofthe energy results is, perhaps, not surprising as those shouldbe highly predictable (simpler FLOP implementations shouldpredictably lower energy). The robustness of the accuracyresults is perhaps more surprising as it not intuitively obviousthat ﬂoating point implementations that work well for one setof inputs would also work for another set. H. Neural Network Integration

The energy and resource constraints in neural networkscreates an intriguing challenge. More recently, a growing bodyof literature have tried to sacriﬁce the precision of training andinference for the lower runtime and energy consumption [16].NEAT can be used to identify the FLOP intensive sections ofthe network and then provide the minimum precision requiredfor the computation without considerable model accuracyreductions. This tradeoff (small accuracy loss for large energysavings) is well known, and we perform this study not to claima new result here, but to demonstrate that NEAT’s automatedapproach can produce the same types of savings for thisproblem that have been produced by human domain experts.We also believe that using NEAT’s programmable replacementrules to create DNNs with differing precision throughout thenetwork is a new contribution that would (due to the size ofthe search space) be quite difﬁcult even for human experts.We use a hand-written digit classiﬁcation with the MNISTdataset which includes 60K images and 10K labels. For theCNN, we consider the LeNet-5 model with the architecturesummary listed in Table IV. The LeNet-5 architecture consistsof two sets of convolutional and average pooling layers,followed by a ﬂattening convolutional layer, then two fully-connected layers and ﬁnally a softmax classiﬁer [48].Figure 10 shows the FLOPs breakdown for CNN trainingwith minibatch size of 4, learning rate of 1, and 30 epochs.We ﬁrst measured how much of the operations are ﬂoatingpoint to determine the applicability of NEAT. For the in-ference, more than 73% of operations were FLOPs which10ABLE IV: LeNet-5 Architecture Summary.

Layer

Feature Map Size Kernel Size Activation

Input Image 1 32x32 - -1 Convolutional(1) 6 28x28 5x5 tanh2 Average Pooling(1) 6 14x14 2x2 tanh3 Convolutional(2) 16 10x10 5x5 tanh4 Average Pooling(2) 16 5x5 2x2 tanh5 Convolutional(3) 120 1x1 5x5 tanh6 Fully Connected - 84 - tanhOutput Fully Connected - 10 - softmax

TABLE V: Mantissa Bits For Single Precision FP Recommended by NEAT for Each Layer at Different Error Rates.

Layers /ErrorRates Conv 1 Avg Pool 1 Conv 2 Avg Pool 2 Conv 3 FC Tanh InternalFunc.

Fig. 10: 32-bit FLOP breakdown per layer in digit recognitionCNN.makes NEAT absolutely beneﬁcial to apply. Next, we analyzethe FLOP distribution between the layers. We observe thatmore than 69% of ﬂoating point computation happens in theconvolutional layers where they extract interesting features inan image. Activation phases and internal compute functionsare responsible for the majority of remainder. Finally, we showthat the number of FLOPs decreases as we approach the latterlayers of the CNN since the size of transferred data betweenlayers reduces as well.To apply the FPI to function placement rules for a CNN,there are two options. First, apply one FPI per layer category(we refer to as PLC) meaning that all convolutional layersuse the same precision level. The second approach is to applya different FPI Per Layer Instance (PLI) where in this casethe ﬁrst and third layers might use distinct precision levels,however, they are both convolutional layers.Picking the right FPI placement policy is not trivial forthe CNNs. Unlike the WP versus CIP rules where one hassigniﬁcantly larger tradeoff space, the PLC and PLI tradeoffspaces are both large enough that heuristic exploration isrequired. Thus, any of these rules could outperform the otherwith the same exploration time. For the PLC, NEAT explores alarger portion of the tradeoff space, leading to locating efﬁcientconﬁgurations more quickly. On the other hand, PLI examinesFPI mappings at a ﬁner granularity, hence it has a higher N E C ( % ) CNN

PLCPLI (a)

CNN FP U E n e r gy S a v i ng s ( % ) PLC − PLC − PLC − PLI − PLI − PLI − (b) Fig. 11: Comparison of PLC and PLC replacements for theCNN. (a) Lower Convex Hull Curves of Energy and ErrorRate. (b) Quantized Energy Savings at Different Error Rates.chance of discovering more optimal conﬁgurations.Figure 11a illustrates the lower convex hull of normalizedFPU energy and accuracy for both approaches. The accuracyloss is the error difference to the baseline conﬁguration with-out approximation. The baseline recognition accuracy in theinference stage is 99.04% with a full accurate trained model.Each point in the tradeoff space demonstrates an FPI to layer(category or instance) mapping. Closer points to the originindicate higher energy efﬁciency.As can be seen, the lower convex hull of the PLI (ﬁnergranularity) outperforms the PLC curve for the error rates ofless than 20%. The quantized representation of FPU energyversus error rates tradeoff space is shown in Figure 11b forboth PLC and PLI placements. Similar to previous evaluation,ﬁner granularity results in higher energy efﬁciency. With 1%,5%, and 10% accuracy loss, NEAT with PLI placementsachieves 6%, 4%, and 3% more energy savings compared tothe default conﬁguration.NEAT’s programmable placement rules allow developers toanalyze various precision levels for different components oftheir neural network without requiring them to instrument thesource code or re-design the architecture.11ince the FPIs are based on the bit truncation of mantissa,using the above analysis, NEAT ﬁnds the required precisionbits for each layer in the LeNet-5 network under accuracy lossconstraints. By default, each layer is implemented with singleprecision ﬂoating point numbers (24 mantissa bits) bits. TableV demonstrates the mantissa bits required for every layer inthe network. These precisions could later be integrated withthe MPFR library in C [30] or mpmath library in Python [42].VI. C

ONCLUSION

In this work, we proposed NEAT, a tool for automatedprecision tuning of ﬂoating point applications. NEAT providesmechanisms for programmers trying to explore the tradeoffspace of combinations of approximate ﬂoating point imple-mentations without extensive source code refactoring. Weevaluate NEAT on various benchmarks with whole-programand per-function placement rules. We found out at the ﬁnergranularity, up to 54% and 74% energy savings are available inFPU and memory transmissions respectively. We empiricallyshow that NEAT performs robustly on unseen inputs as well.We also perform a case study on a digit recognition CNNprograms to ﬁnd optimal precision level requirements for eachlayer.

Acknowledgments:

This research is supported byNSF(CCF-1439156, CNS-1526304, CCF-1823032, CNS-1764039). Additional support comes from the Proteus projectunder the DARPA BRASS program and a DOE Early Careeraward. R

EFERENCES[1] C. Alvarez, J. Corbal, and M. Valero, “Fuzzy memoization for ﬂoating-point multimedia applications,”

IEEE Transactions on Computers ,vol. 54, no. 7, pp. 922–927, July 2005.[2] J. Ansel, S. Kamil, K. Veeramachaneni, J. Ragan-Kelley, J. Bosboom,U.-M. O’Reilly, and S. Amarasinghe, “Opentuner: An extensibleframework for program autotuning,” in

Proceedings of the 23rdInternational Conference on Parallel Architectures and Compilation ,ser. PACT ’14. New York, NY, USA: ACM, 2014, pp. 303–316.[Online]. Available: http://doi.acm.org/10.1145/2628071.2628092[3] J. Ansel, Y. L. Wong, C. Chan, M. Olszewski, A. Edelman, and S. Ama-rasinghe, “Language and compiler support for auto-tuning variable-accuracy algorithms,” in

Proceedings of the 9th Annual IEEE/ACMInternational Symposium on Code Generation and Optimization . IEEEComputer Society, 2011, pp. 85–96.[4] W. Baek and T. M. Chilimbi, “Green: A framework for supportingenergy-conscious programming using controlled approximation,”

SIGPLAN Not. , vol. 45, no. 6, pp. 198–209, Jun. 2010. [Online].Available: http://doi.acm.org/10.1145/1809028.1806620[5] J. Balkind, M. McKeown, Y. Fu, T. Nguyen, Y. Zhou, A. Lavrov,M. Shahrad, A. Fuchs, S. Payne, X. Liang et al. , “Openpiton: An opensource manycore research framework,” in

ACM SIGARCH ComputerArchitecture News , vol. 44, no. 2. ACM, 2016, pp. 217–232.[6] S. Barati, F. A. Bartha, S. Biswas, R. Cartwright, A. Duracz, D. S.Fussell, H. Hoffmann, C. Imes, J. E. Miller, N. Mishra, Arvind,D. Nguyen, K. V. Palem, Y. Pei, K. Pingali, R. Sai, A. Wright, Y. Yang,and S. Zhang, “Proteus: Language and runtime support for self-adaptivesoftware development,”

IEEE Software , vol. 36, no. 2, pp. 73–82, 2019.[Online]. Available: https://doi.org/10.1109/MS.2018.2884864[7] C. Bienia, “Benchmarking modern multiprocessors,” Ph.D. dissertation,Princeton University, January 2011.[8] S. Borkar, “The exascale challange.” Keynote Talk, Parallel Architec-tures and Compilation Techniques (PACT), Galveston Island, Texas,USA., 10 2011. [9] J. Bornholt, T. Mytkowicz, and K. S. McKinley, “Uncertain¡ t¿: A ﬁrst-order type for uncertain data,”

ACM SIGPLAN Notices , vol. 49, no. 4,pp. 51–66, 2014.[10] A. Boutros, S. Yazdanshenas, and V. Betz, “Embracing diversity:Enhanced dsp blocks for low-precision deep learning on fpgas,” in . IEEE, 2018, pp. 35–357.[11] L. N. Chakrapani, B. E. S. Akgul, S. Cheemalavagu, P. Korkmaz, K. V.Palem, and B. Seshasayee, “Ultra-efﬁcient (embedded) soc architecturesbased on probabilistic cmos (pcmos) technology,” in

Proceedings of theConference on Design, Automation and Test in Europe: Proceedings ,ser. DATE ’06. 3001 Leuven, Belgium, Belgium: European Designand Automation Association, 2006, pp. 1110–1115. [Online]. Available:http://dl.acm.org.proxy.uchicago.edu/citation.cfm?id=1131481.1131790[12] A. P. Chandrakasan and R. W. Brodersen, “Minimizing power consump-tion in digital cmos circuits,”

Proceedings of the IEEE , vol. 83, no. 4,pp. 498–523, Apr 1995.[13] S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S. H. Lee, andK. Skadron, “Rodinia: A benchmark suite for heterogeneous computing,”in

Workload Characterization, 2009. IISWC 2009. IEEE InternationalSymposium on , Oct 2009, pp. 44–54.[14] V. K. Chippa, S. Venkataramani, S. T. Chakradhar, K. Roy, andA. Raghunathan, “Approximate computing: An integrated hardwareapproach,” in , Nov 2013, pp. 111–117.[15] M. Courbariaux, Y. Bengio, and J. David, “Binaryconnect: Trainingdeep neural networks with binary weights during propagations,”

CoRR ,vol. abs/1511.00363, 2015. [Online]. Available: http://arxiv.org/abs/1511.00363[16] D. Das, N. Mellempudi, D. Mudigere, D. Kalamkar, S. Avancha,K. Banerjee, S. Sridharan, K. Vaidyanathan, B. Kaul, E. Georganas et al. ,“Mixed precision training of convolutional neural networks using integeroperations,” arXiv preprint arXiv:1802.00930 , 2018.[17] M. de Kruijf, S. Nomura, and K. Sankaralingam, “Relax: Anarchitectural framework for software recovery of hardware faults,” in

Proceedings of the 37th Annual International Symposium on ComputerArchitecture , ser. ISCA ’10. New York, NY, USA: ACM, 2010, pp.497–508. [Online]. Available: http://doi.acm.org.proxy.uchicago.edu/10.1145/1815961.1816026[18] K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan, “A fast and elitistmultiobjective genetic algorithm: Nsga-ii,”

IEEE Transactions on Evo-lutionary Computation , vol. 6, no. 2, pp. 182–197, Apr 2002.[19] Y. Ding, N. Mishra, and H. Hoffmann, “Generative and multi-phaselearning for computer systems optimization,” in

Proceedings of the 46thInternational Symposium on Computer Architecture , ser. ISCA ’19.New York, NY, USA: Association for Computing Machinery, 2019, p.3952. [Online]. Available: https://doi.org/10.1145/3307650.3326633[20] K. Du, P. Varman, and K. Mohanram, “High performance reliablevariable latency carry select addition,” in , March 2012, pp. 1257–1262.[21] Z. Du, K. Palem, A. Lingamneni, O. Temam, Y. Chen, and C. Wu,“Leveraging the error resilience of machine-learning applications fordesigning highly energy efﬁcient accelerators,” in , Jan 2014, pp.201–206.[22] P. D. D¨uben, J. Joven, A. Lingamneni, H. McNamara, G. De Micheli,K. V. Palem, and T. N. Palmer, “On the use of inexact, prunedhardware in atmospheric modelling,”

Philosophical Transactionsof the Royal Society of London A: Mathematical, Physical andEngineering Sciences , vol. 372, no. 2018, 2014. [Online]. Available:http://rsta.royalsocietypublishing.org/content/372/2018/20130276[23] S. Eldridge, F. Raudies, D. Zou, and A. Joshi, “Neural network-basedaccelerators for transcendental function approximation,” in

Proceedingsof the 24th edition of the great lakes symposium on VLSI . ACM, 2014,pp. 169–174.[24] H. Esmaeilzadeh, A. Sampson, L. Ceze, and D. Burger, “Architecturesupport for disciplined approximate programming,” in

Proceedings ofthe Seventeenth International Conference on Architectural Support forProgramming Languages and Operating Systems , ser. ASPLOS XVII.New York, NY, USA: ACM, 2012, pp. 301–312. [Online]. Available:http://doi.acm.org.proxy.uchicago.edu/10.1145/2150976.2151008[25] H. Esmaeilzadeh, A. Sampson, L. Ceze, and D. Burger, “Neuralacceleration for general-purpose approximate programs,” in

Proceedingsof the 2012 45th Annual IEEE/ACM International Symposium n Microarchitecture , ser. MICRO-45. Washington, DC, USA:IEEE Computer Society, 2012, pp. 449–460. [Online]. Available:http://dx.doi.org.proxy.uchicago.edu/10.1109/MICRO.2012.48[26] A. Farrell and H. Hoffmann, “MEANTIME: achieving both minimalenergy and timeliness with approximate computing,” in , 2016, pp. 421–435.[27] A. Filieri, H. Hoffmann, and M. Maggio, “Automated multi-objectivecontrol for self-adaptive software design,” in Proceedings of the 201510th Joint Meeting on Foundations of Software Engineering, ESEC/FSE2015, Bergamo, Italy, August 30 - September 4, 2015 , E. D. Nitto,M. Harman, and P. Heymans, Eds. ACM, 2015, pp. 13–24. [Online].Available: https://doi.org/10.1145/2786805.2786833[28] A. Filieri, M. Maggio, K. Angelopoulos, N. D’Ippolito,I. Gerostathopoulos, A. B. Hempel, H. Hoffmann, P. Jamshidi,E. Kalyvianaki, C. Klein, F. Krikava, S. Misailovic, A. V.Papadopoulos, S. Ray, A. M. Shariﬂoo, S. Shevtsov, M. Ujma, andT. Vogel, “Control strategies for self-adaptive software systems,”

ACMTrans. Auton. Adapt. Syst. , vol. 11, no. 4, pp. 24:1–24:31, 2017.[Online]. Available: https://doi.org/10.1145/3024188[29] B. Fleischer, S. Shukla, M. Ziegler, J. Silberman, J. Oh, V. Srinivasan,J. Choi, S. Mueller, A. Agrawal, T. Babinsky, N. Cao, C. Chen,P. Chuang, T. Fox, G. Gristede, M. Guillorn, H. Haynie, M. Klaiber,D. Lee, S. Lo, G. Maier, M. Scheuermann, S. Venkataramani,C. Vezyrtzis, N. Wang, F. Yee, C. Zhou, P. Lu, B. Curran, L. Chang, andK. Gopalakrishnan, “A scalable multi- teraops deep learning processorcore for ai trainina and inference,” in , June 2018, pp. 35–36.[30] L. Fousse, G. Hanrot, V. Lef`evre, P. P´elissier, and P. Zimmermann,“Mpfr: A multiple-precision binary ﬂoating-point library with cor-rect rounding,”

ACM Transactions on Mathematical Software (TOMS) ,vol. 33, no. 2, p. 13, 2007.[31] N. Gajjar, N. M. Devahsrayee, and K. S. Dasgupta, “Scalable leon3 based soc for multiple ﬂoating point operations,” in , Dec 2011, pp. 1–3.[32] B. Grigorian, N. Farahpour, and G. Reinman, “Brainiac: Bringingreliable accuracy into neurally-implemented approximate computing,”in , Feb 2015, pp. 615–626.[33] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, andW. J. Dally, “Eie: Efﬁcient inference engine on compressed deep neuralnetwork,” in , June 2016, pp. 243–254.[34] H. Hoffmann, “Coadapt: Predictable behavior for accuracy-aware appli-cations running on power-aware systems,” in ,2014, pp. 223–232.[35] H. Hoffmann, A. Agarwal, and S. Devadas, “Selecting spatiotemporalpatterns for development of parallel applications,”

IEEE Trans. ParallelDistributed Syst. , vol. 23, no. 10, pp. 1970–1982, 2012. [Online].Available: https://doi.org/10.1109/TPDS.2011.298[36] H. Hoffmann, S. Misailovic, S. Sidiroglou, A. Agarwal, and M. Rinard,“Using code perforation to improve performance, reduce energy con-sumption, and respond to failures,” no. MIT-CSAIL-TR-2009-042, 092009.[37] H. Hoffmann, S. Sidiroglou, M. Carbin, S. Misailovic, A. Agarwal, andM. Rinard, “Dynamic knobs for responsive power-aware computing,” in

Proceedings of the Sixteenth International Conference on ArchitecturalSupport for Programming Languages and Operating Systems , ser.ASPLOS XVI. New York, NY, USA: ACM, 2011, pp. 199–212.[Online]. Available: http://doi.acm.org/10.1145/1950365.1950390[38] C. Imes and H. Hoffmann, “Bard: A uniﬁed framework for managingsoft timing and power constraints,” in

International Conference onEmbedded Computer Systems: Architectures, Modeling and Simulation,SAMOS 2016, Agios Konstantinos, Samos Island, Greece, July 17-21,2016 , W. A. Najjar and A. Gerstlauer, Eds. IEEE, 2016, pp. 31–38.[Online]. Available: https://doi.org/10.1109/SAMOS.2016.7818328[39] C. Imes, S. A. Hofmeyr, and H. Hoffmann, “Energy-efﬁcient applicationresource scheduling using machine learning classiﬁers,” in

Proceedingsof the 47th International Conference on Parallel Processing, ICPP 2018,Eugene, OR, USA, August 13-16, 2018 . ACM, 2018, pp. 45:1–45:11.[Online]. Available: https://doi.org/10.1145/3225058.3225088[40] C. Imes, D. H. K. Kim, M. Maggio, and H. Hoffmann, “POET:a portable approach to minimizing energy under soft real-time constraints,” in .IEEE Computer Society, 2015, pp. 75–86. [Online]. Available:https://doi.org/10.1109/RTAS.2015.7108419[41] C. Imes, H. Zhang, K. Zhao, and H. Hoffmann, “Copper: Soft real-timeapplication performance using hardware power capping,” in . IEEE, 2019, pp. 31–41. [Online].Available: https://doi.org/10.1109/ICAC.2019.00015[42] F. Johansson et al. , mpmath: a Python library for arbitrary-precision ﬂoating-point arithmetic (version 0.14) , February 2010, http://code.google.com/p/mpmath/ .[43] A. Kanduri, M. H. Haghbayan, A. M. Rahmani, P. Liljeberg, A. Jantsch,N. Dutt, and H. Tenhunen, “Approximation knob: Power capping meetsenergy efﬁciency,” in , Nov 2016, pp. 1–8.[44] Khaing Yin Kyaw, Wang Ling Goh, and Kiat Seng Yeo, “Low-powerhigh-speed multiplier for error-tolerant application,” in , Dec 2010, pp. 1–4.[45] U. K¨oster, T. J. Webb, X. Wang, M. Nassar, A. K. Bansal,W. H. Constable, O. H. Elibol, S. Gray, S. Hall, L. Hornof,A. Khosrowshahi, C. Kloss, R. J. Pai, and N. Rao, “Flexpoint:An adaptive numerical format for efﬁcient training of deep neuralnetworks,” in Proceedings of the 31st International Conferenceon Neural Information Processing Systems , ser. NIPS’17. USA:Curran Associates Inc., 2017, pp. 1740–1750. [Online]. Available:http://dl.acm.org/citation.cfm?id=3294771.3294937[46] P. Kulkarni, P. Gupta, and M. Ercegovac, “Trading accuracy for powerwith an underdesigned multiplier architecture,” in , Jan 2011, pp. 346–351.[47] J. Lebak, J. Kepner, H. Hoffmann, and E. Rutledge, “Parallel vsipl++:An open standard software library for high-performance parallel signalprocessing,”

Proceedings of the IEEE , vol. 93, no. 2, pp. 313–330, Feb2005.[48] Y. LeCun, P. Haffner, L. Bottou, and Y. Bengio, “Object recognition withgradient-based learning,” in

Shape, contour and grouping in computervision . Springer, 1999, pp. 319–345.[49] A. Lingamneni, C. Enz, K. Palem, and C. Piguet, “Designing energy-efﬁcient arithmetic operators using inexact computing,”

Journal of LowPower Electronics , vol. 9, no. 1, pp. 141–153, 2013.[50] S. Liu, K. Pattabiraman, T. Moscibroda, and B. G. Zorn, “Flikker: Savingdram refresh-power through critical data partitioning,” in

Proceedingsof the Sixteenth International Conference on Architectural Support forProgramming Languages and Operating Systems , ser. ASPLOS XVI.New York, NY, USA: ACM, 2011, pp. 213–224. [Online]. Available:http://doi.acm.org.proxy.uchicago.edu/10.1145/1950365.1950391[51] C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney,S. Wallace, V. J. Reddi, and K. Hazelwood, “Pin: Buildingcustomized program analysis tools with dynamic instrumentation,” in

Proceedings of the 2005 ACM SIGPLAN Conference on ProgrammingLanguage Design and Implementation , ser. PLDI ’05. New York,NY, USA: ACM, 2005, pp. 190–200. [Online]. Available: http://doi.acm.org/10.1145/1065010.1065034[52] M. Maggio, A. V. Papadopoulos, A. Filieri, and H. Hoffmann,“Automated control of multiple software goals using multipleactuators,” in

Proceedings of the 2017 11th Joint Meeting onFoundations of Software Engineering, ESEC/FSE 2017, Paderborn,Germany, September 4-8, 2017 , 2017, pp. 373–384. [Online]. Available:https://doi.org/10.1145/3106237.3106247[53] K. T. Malladi, F. A. Nothaft, K. Periyathambi, B. C. Lee, C. Kozyrakis,and M. Horowitz, “Towards energy-proportional datacenter memorywith mobile dram,” in , June 2012, pp. 37–48.[54] M. McKeown, A. Lavrov, M. Shahrad, P. J. Jackson, Y. Fu, J. Balkind,T. M. Nguyen, K. Lim, Y. Zhou, and D. Wentzlaff, “Power and energycharacterization of an open source 25-core manycore processor,” in , Feb 2018, pp. 762–775.[55] S. Misailovic, M. Carbin, S. Achour, Z. Qi, and M. C. Rinard,“Chisel: Reliability- and accuracy-aware optimization of approximatecomputational kernels,” in

Proceedings of the 2014 ACM InternationalConference on Object Oriented Programming Systems Languages &Applications , ser. OOPSLA ’14. New York, NY, USA: ACM, 2014, p. 309–328. [Online]. Available: http://doi.acm.org.proxy.uchicago.edu/10.1145/2660193.2660231[56] S. Misailovic, S. Sidiroglou, H. Hoffmann, and M. Rinard, Qualityof Service Proﬁling . New York, NY, USA: Association forComputing Machinery, 2010, p. 2534. [Online]. Available: https://doi.org/10.1145/1806799.1806808[57] N. Mishra, C. Imes, J. D. Lafferty, and H. Hoffmann, “CALOREE:learning control for predictable latency and low energy,” in

Proceedingsof the Twenty-Third International Conference on Architectural Supportfor Programming Languages and Operating Systems, ASPLOS 2018,Williamsburg, VA, USA, March 24-28, 2018 , X. Shen, J. Tuck,R. Bianchini, and V. Sarkar, Eds. ACM, 2018, pp. 184–198. [Online].Available: https://doi.org/10.1145/3173162.3173184[58] N. Mishra, J. D. Lafferty, and H. Hoffmann, “ESP: A machinelearning approach to predicting application interference,” in , X. Wang, C. Stewart, andH. Lei, Eds. IEEE Computer Society, 2017, pp. 125–134. [Online].Available: https://doi.org/10.1109/ICAC.2017.29[59] N. Mishra, H. Zhang, J. D. Lafferty, and H. Hoffmann, “Aprobabilistic graphical model-based approach for minimizing energyunder performance constraints,” in

Proceedings of the TwentiethInternational Conference on Architectural Support for ProgrammingLanguages and Operating Systems, ASPLOS ’15, Istanbul, Turkey,March 14-18, 2015 , ¨O. ¨Ozturk, K. Ebcioglu, and S. Dwarkadas, Eds.ACM, 2015, pp. 267–281. [Online]. Available: https://doi.org/10.1145/2694344.2694373[60] T. Moreau, A. Sampson, and L. Ceze, “Approximate computing: Makingmobile systems more efﬁcient,”

IEEE Pervasive Computing , vol. 14,no. 2, pp. 9–13, Apr 2015.[61] K. V. Palem, L. N. Chakrapani, Z. M. Kedem, A. Lingamneni, and K. K.Muntimadugu, “Sustaining moore’s law in embedded computing throughprobabilistic and approximate design: Retrospects and prospects,” in

Proceedings of the 2009 International Conference on Compilers,Architecture, and Synthesis for Embedded Systems , ser. CASES ’09.New York, NY, USA: ACM, 2009, pp. 1–10. [Online]. Available:http://doi.acm.org.proxy.uchicago.edu/10.1145/1629395.1629397[62] Qian Zhang, F. Yuan, R. Ye, and Q. Xu, “Approxit: An approximate com-puting framework for iterative methods,” in , June 2014, pp. 1–6.[63] M. Rinard, H. Hoffmann, S. Misailovic, and S. Sidiroglou, “Patterns andstatistical analysis for understanding reduced resource computing,” in

Proceedings of the ACM International Conference on Object OrientedProgramming Systems Languages and Applications , ser. OOPSLA ’10.New York, NY, USA: Association for Computing Machinery, 2010, p.806821. [Online]. Available: https://doi.org/10.1145/1869459.1869525[64] C. Sakr, N. Wang, C.-Y. Chen, J. Choi, A. Agrawal, N. Shanbhag,and K. Gopalakrishnan, “Accumulation bit-width scaling for ultra-lowprecision training of deep networks,” arXiv preprint arXiv:1901.06588 ,2019.[65] A. Sampson, W. Dietl, E. Fortuna, D. Gnanapragasam, L. Ceze, andD. Grossman, “Enerj: Approximate data types for safe and generallow-power computation,” in

Proceedings of the 32Nd ACM SIGPLANConference on Programming Language Design and Implementation ,ser. PLDI ’11. New York, NY, USA: ACM, 2011, pp. 164–174.[Online]. Available: http://doi.acm.org/10.1145/1993498.1993518[66] A. Sampson, W. Dietl, E. Fortuna, D. Gnanapragasam, L. Ceze,and D. Grossman, “Enerj: Approximate data types for safe andgeneral low-power computation,” in

Proceedings of the 32NdACM SIGPLAN Conference on Programming Language Design andImplementation , ser. PLDI ’11. New York, NY, USA: ACM, 2011, pp.164–174. [Online]. Available: http://doi.acm.org.proxy.uchicago.edu/10.1145/1993498.1993518[67] M. H. Santriaji and H. Hoffmann, “GRAPE: minimizing energy forGPU applications with performance requirements,” in . IEEE Computer Society,2016, pp. 16:1–16:13. [Online]. Available: https://doi.org/10.1109/MICRO.2016.7783719[68] M. H. Santriaji and H. Hoffmann, “MERLOT: architectural supportfor energy-efﬁcient real-time processing in gpus,” in

IEEE Real-Time and Embedded Technology and Applications Symposium,RTAS 2018, 11-13 April 2018, Porto, Portugal , R. Pellizzoni, Ed. IEEE Computer Society, 2018, pp. 214–226. [Online]. Available:https://doi.org/10.1109/RTAS.2018.00030[69] Q. Shi, H. Hoffmann, and O. Khan, “A cross-layer multicorearchitecture to tradeoff program accuracy and resilience overheads,”

IEEE Comput. Archit. Lett. , vol. 14, no. 2, pp. 85–89, 2015. [Online].Available: https://doi.org/10.1109/LCA.2014.2365204[70] S. Sidiroglou-Douskos, S. Misailovic, H. Hoffmann, and M. Rinard,“Managing performance vs. accuracy trade-offs with loop perforation,”in

Proceedings of the 19th ACM SIGSOFT Symposium and the13th European Conference on Foundations of Software Engineering ,ser. ESEC/FSE ’11. New York, NY, USA: ACM, 2011, pp.124–134. [Online]. Available: http://doi.acm.org.proxy.uchicago.edu/10.1145/2025113.2025133[71] G. Tagliavini, A. Marongiu, and L. Benini, “Flexﬂoat: A software libraryfor transprecision computing,”

IEEE Transactions on Computer-AidedDesign of Integrated Circuits and Systems , 2018.[72] S. Venkataramani, A. Ranjan, K. Roy, and A. Raghunathan, “Axnn:Energy-efﬁcient neuromorphic systems using approximate computing,”in , Aug 2014, pp. 27–32.[73] S. Venkataramani, V. K. Chippa, S. T. Chakradhar, K. Roy,and A. Raghunathan, “Quality programmable vector processors forapproximate computing,” in

Proceedings of the 46th Annual IEEE/ACMInternational Symposium on Microarchitecture , ser. MICRO-46. NewYork, NY, USA: ACM, 2013, pp. 1–12. [Online]. Available:http://doi.acm.org.proxy.uchicago.edu/10.1145/2540708.2540710[74] A. K. Verma, P. Brisk, and P. Ienne, “Variable latency speculativeaddition: A new paradigm for arithmetic circuit design,” in , March 2008, pp. 1250–1255.[75] C. Wan, H. Hoffmann, S. Lu, and M. Maire, “Orthogonalized SGD andnested architectures for anytime neural networks,” in

Proceedings of the37th International Conference on Machine Learning , ser. Proceedingsof Machine Learning Research, H. D. III and A. Singh, Eds., vol.119. PMLR, 13–18 Jul 2020, pp. 9807–9817. [Online]. Available:http://proceedings.mlr.press/v119/wan20a.html[76] C. Wan, M. Santriaji, E. Rogers, H. Hoffmann, M. Maire, andS. Lu, “ALERT: Accurate learning for energy and timeliness,” in

Advances in neural information processing systems , 2018, pp. 7675–7684.[78] S. Wang, C. Li, H. Hoffmann, S. Lu, W. Sentosa, and A. I.Kistijantoro, “Understanding and auto-adjusting performance-sensitiveconﬁgurations,” in

Proceedings of the Twenty-Third InternationalConference on Architectural Support for Programming Languages andOperating Systems, ASPLOS 2018, Williamsburg, VA, USA, March24-28, 2018 , X. Shen, J. Tuck, R. Bianchini, and V. Sarkar, Eds.ACM, 2018, pp. 154–168. [Online]. Available: https://doi.org/10.1145/3173162.3173206[79] S. Wu, G. Li, F. Chen, and L. Shi, “Training and inference with integersin deep neural networks,” arXiv preprint arXiv:1802.04680 , 2018.[80] A. Yazdanbakhsh, D. Mahajan, B. Thwaites, J. Park, A. Nagendrakumar,S. Sethuraman, K. Ramkrishnan, N. Ravindran, R. Jariwala, A. Rahimi,H. Esmaeilzadeh, and K. Bazargan, “Axilog: Language support forapproximate hardware design,” in , March 2015, pp. 812–817.[81] G. Zervakis, K. Tsoumanis, S. Xydis, D. Soudris, and K. Pekmestzi,“Design-efﬁcient approximate multiplication circuits through partialproduct perforation,”

IEEE Transactions on Very Large Scale Integration(VLSI) Systems , vol. 24, no. 10, pp. 3105–3117, Oct 2016.[82] H. Zhang, M. Putic, and J. Lach, “Low power gpgpu computationwith imprecise hardware,” in

Proceedings of the 51st Annual DesignAutomation Conference , ser. DAC ’14. New York, NY, USA: ACM,2014, pp. 99:1–99:6. [Online]. Available: http://doi.acm.org/10.1145/2593069.2593156[83] Y. Zhou, H. Hoffmann, and D. Wentzlaff, “CASH: supporting iaascustomers with a sub-core conﬁgurable architecture,” in . IEEE Computer Society, 2016,pp. 682–694. [Online]. Available: https://doi.org/10.1109/ISCA.2016.65

84] N. Zhu, W. L. Goh, W. Zhang, K. S. Yeo, and Z. H. Kong, “Design oflow-power high-speed truncation-error-tolerant adder and its applicationin digital signal processing,”

IEEE Transactions on Very Large ScaleIntegration (VLSI) Systems , vol. 18, no. 8, pp. 1225–1229, Aug 2010., vol. 18, no. 8, pp. 1225–1229, Aug 2010.