NEAT: A Framework for Automated Exploration of Floating Point Approximations
NNEAT: A Framework for Automated Exploration ofFloating Point Approximations
Saeid Barati
Computer Science DepartmentUniversity of Chicago
Chicago, [email protected]
Lee Ehudin
Computer Science DepartmentUniversity of Chicago
Chicago, [email protected]
Henry Hoffmann
Computer Science DepartmentUniversity of Chicago
Chicago, [email protected]
Abstract —Much recent research is devoted to exploring trade-offs between computational accuracy and energy efficiency atdifferent levels of the system stack. Approximation at the floatingpoint unit (FPU) allows saving energy by simply reducing thenumber of computed floating point bits in return for accuracyloss. Although, finding the most energy efficient approximationfor various applications with minimal effort is the main challenge.To address this issue, we propose NEAT: a pin tool that helpsusers automatically explore the accuracy-energy tradeoff spaceinduced by various floating point implementations. NEAT helpsprogrammers explore the effects of simultaneously using multiplefloating point implementations to achieve the lowest energyconsumption for an accuracy constraint or vice versa. NEATaccepts one or more user-defined floating point implementationsand programmable placement rules for where/when to applythem. NEAT then automatically replaces floating point operationswith different implementations based on the user-specified rulesduring the runtime and explores the resulting tradeoff space tofind the best use of approximate floating point implementationsfor the precision tuning throughout the program. We evaluateNEAT by enforcing combinations of 24/53 different floating pointimplementations with three sets of placement rules on a widerange of benchmarks. We find that heuristic precision tuningat the function level provides up to 22% and 48% energysavings at 1% and 10% accuracy loss comparing to applyinga single implementation for the whole application. Also, NEAT isapplicable to neural networks where it finds the optimal precisionlevel for each layer considering an accuracy target for the model.
I. I
NTRODUCTION
Early work in approximate computing demonstrates thetremendous energy and execution time reductions by makinga variety of arithmetic and logic functional units available[11], [12], [22], [24]. Reduced-precision methods advocate lessnumerical precision for the data storage and computation toachieve higher performance and energy efficiency [16], [64],[73], [77].The proliferation of both different approximate functionalunits and reduced-precision software methods creates tremen-dous opportunity, but it also creates a new problem. Whiledesigning for reduced precision has long been common inspecialized application domains—for example, digital signalprocessing [10]—the proliferation of these techniques meansthat general programmers will now have to consider the impli-cations of such designs. Specifically, it is up to programmersdecide which level of approximation to use at different points in their application and navigate through this immense tradeoffspace enacted by allowing multiple approximations within asingle program.Consider 10 different levels of approximation availableto be enforced at the function level for a moderate-sizedprogram with 10 functions. Programmers attempting to designfor energy efficiency and accuracy in this scenario face twoseparate, but related, challenges. First, is the challenge ofcorrectly (in terms of achieved accuracy) implementing 10different versions of each candidate function (one version foreach available level of precision). Second is the challengeof searching the resulting tradeoff space with pointsto explore. The tradeoff space could be even larger if weexploit data type approximation where each variable in theprogram could acquire a different level of approximation [9],[30], [65], [71]. Constructing a large number of alternativeimplementations and then navigating such an immense tradeoffspace is likely beyond the abilities of even domain experts.Thus, we need an automated precision tuning framework thatcan both generate alternative implementations and then explorethe induced tradeoff space.In this paper, we propose one mechanism that helps addressboth of the above challenges: programmable placement rulesfor approximate floating point computation . We argue thatasking programmers to implement N different versions ofkey functions is unnecessarily burdensome and generatingall possible approximations of each function will make thesearch space prohibitively large. The programmable rules are acompromise, where programmers can encode their knowledgeof the application into concise rules about which functionscan be approximated, by how much, and when it might bepermissible to do so. These rules can then be used by anautomated tool to generate a candidate set of approximatefunction implementations which is much smaller than the setof all possible approximations.To address the challenges of creating and selecting froma large number of approximation alternatives, we proposeNEAT—Navigating Energy Approximation tradeoffs—a toolthat helps users explore different levels of approximationwithin a program without detailed instrumentation and with-out laboriously creating many alternative implementations offunctions. NEAT accepts a user program, a set of approximate1 a r X i v : . [ c s . D C ] F e b oating point implementations, and a set of programmableplacement rules for when to use a specific implementationwithin a program. NEAT then runs the program and dy-namically replaces floating point operations (FLOPs) withthe approximate version as specified by the rules. NEATreports the program’s output with the estimation of floatingpoint unit (FPU) and memory access energy alongside anitemized report of FLOPs in the program. Thus, NEAT helpsdevelopers explore the configuration space of floating pointimplementations (FPI) without requiring them to have deepnumerical expertise.We implement NEAT for x86 using the Pin binary instru-mentation system [51]. We demonstrate NEAT’s value by com-paring the approximations produced by different placementrule sets. In the first, we write a simple rule that picks asingle floating point implementation for the entire program; i . e .the rule is a simple one-to-one replacement (whole-programrule) common to many proposed approximation methods;e.g., those that use a single, reduced precision for machinelearning [21] or scientific simulation [22]. In the second, weallow the top 10 executed functions with the most FLOPs toeach use a different approximation (per-function rules). Eitherwe use the currently-in-progress function (CIP) or the mostrecent function on the call stack (FCS) as the target to applythe approximate floating point implementation. For all rules,NEAT uses a genetic algorithm to guide exploration of theenormous resulting search space.We evaluate NEAT on a selected set of benchmarks fromParsec 3.0 [7] and Rodinia 3.1 [13] suites which covers avariety of real-world applications. For the FPIs, we appliedmantissa bitwidth tuning. On average, the per-function place-ment retrieves more energy-optimal floating point implementa-tions than the whole-program approach, providing 22.1% and3.2% energy savings in FPU and memory respectively with anallowance of 1% accuracy loss. To ensure the robustness ofNEAT, we include multiple inputs for each application whichare divided into training and test sets to evaluate whetherNEAT produces statistically sound results. We also extend theevaluation by including a digit recognition application that isimplemented with a neural network and MNIST dataset. Forany accuracy target, NEAT provides the required precisionlevel for each layer. NEAT is also released as opensource,so others could evaluate or use it freely.In summary, this paper proposes: • The NEAT framework that helps users explore the trade-off space of reduced precision floating point combinationswhile not requiring hand tuning or code instrumentation. • A case study that compares whole-program vs. per-function approximation placements for a variety ofbenchmarks. Also, NEAT offers a separated placementsolution based on the caller function, useful for the highfrequency invoked functions. • Robustness on unseen inputs with a high correlationcoefficient. NEAT finds statistically meaningful approxi-mations that are not sensitive to input data and are morelikely to be efficient on an unseen set of inputs. • A demonstration of NEAT’s applicability to Convolu-tional Neural Networks (CNN), providing precision com-putation modes per layer resulting in energy savings withminimal loss of model accuracy.II. B
ACKGROUND & M
OTIVATION
A. Prior Work
While there has been a substantial amount of effort aimedtowards finding new forms of approximation [1], [17], [25],[36], [37], [50], [56], [61], [63], [65], [69], [70], [80], thereis a lack of solutions that helps the user to both developtheir own approximation methods, and then specifying theapproximation level to enforce for a single application.Hardware approximation computes inexactly in return forreduced energy, area, or time [14], [49]. Approximate multipli-ers [44], [46], [81] and adders [20], [84] are widely advocatedfor energy-efficient computing. State-of-the-art neural networktraining platforms offer 16 bit floating point hardware systemsthat provide up to 4x performance gain comparing to tradi-tional 32 bit systems [29]. Recent proposals promote puttingmany different approximate units or customized acceleratorson a single core [31]. Thus, it is beneficial to include multipleFPUs on a chip for higher energy efficiency [22] but thisrequires tedious hand-tuning. Therefore, the challenge is howto figure out which FPU to use in each part of the program.This is the challenge that motivates NEAT.Languages support approximation allowing the specificationof variants for key functionality and formal analysis of theireffects [3], [9], [55]. Approximation Knobs provide a way tolend performance and energy gains to existing power knobs[43]. Quora is a quality programmable processor where thenotion of quality is codified in the instruction set of theprocessor [73]. Another example of user-defined approxima-tion is Green, which is a system that allows programmers tosupply approximate versions of loops and while-blocks thatterminate early [4]. On the contrary to these programminglanguage techniques, our proposal lets users easily—throughour programmable substitution rules—examine and change theaccuracy of FLOPs, giving them more control over the floatingpoint computations in a program.Performing precision tuning at fine grain is availablethrough software libraries. EnerJ proposes to declare approxi-mate data via type qualifiers [65]. MPFR adds to its arbitrary-precision representation the support for rounding modes,exceptions and special values as defined in the IEEE 754standard [30]. FlexFloat reduces floating point emulation timeby providing a C/C++ interface for supporting multiple FPformats. These techniques require source code instrumentation(changing f loat and double variables definition to customparameters) or intending to yield more precise computation(for instance floating points numbers with more than 128bits). NEAT focuses on energy efficiency by reducing precisionwhile only requiring the program binary.Convolutional Neural Networks (CNNs) include a signifi-cant amount of floating point computation in the training andinference stages. A large body of research has been focused2ig. 1: Energy Per Instruction for different classes of instruc-tions.towards CNN precision scaling [15], [32], [33], [62], [64],[72]. For example, WAGE quantizes weights to 2 bits whileactivation, errors, and gradient are 8 bits respectively [79].FLEXpoint presents a new format with 16 bits mantissa totrain CNNs with full precision [45]. Another piece of researchdemonstrates the successful training with 8-16 bits floatingpoint numbers with full accuracy [77]. Other, tangentiallyrelated approaches create networks with early exit points [75],[76], but those are not related to the problem of chang-ing numerical precision. Prior approaches either change thetraining architecture or apply a coarse-grain precision levelfor all layers. Differently, NEAT generates precision tuninganalysis at different granularities by offering WP and CIPsolutions without modifying the application internal structureor exhaustive precision exploration.While prior work mainly develops mechanisms that enableapproximation to provide energy and runtime savings at dif-ferent domains, they do not help users make more informeddecisions about approximation. These techniques mostly arenot flexible about how much, where, and when to approximate,and only provide discrete approximation knobs which leads tomore conservative design choices. NEAT does not proposenew mechanisms but helps users answer the questions above.
B. Motivation
Current inexact functional units in addition to approximatesoftware libraries create an opportunity to exploit quality-energy tradeoffs. While an FPU accounts for 2-5% area onthe chip, the floating point instructions consume significantlymore energy compared to other classes of instructions suchas integer, memory, and control [5], [54]. Figure 1 illustratesthe energy per instruction (EPI) results for different classes ofinstructions of 64-bit 32nm processor. With random operands,a 64-bit floating point add consumes 400 pJ, and a divisionoperation could go as high as 680 pJ. For a 32-bit versions,the energy consumption is 350 and 420 pJ respectively. As expected with regards to the type of operations, ex-ecuting the floating point instructions emerges as a majorcontributor to the total energy consumption. Recent empiricalstudies have shown up to 50% of the energy consumed ina core and memory is related to floating point instructions[54]. Thus, exploiting reduced bitwidth at instruction level (bittruncation) to generate Floating Point Implementations (FPI)could facilitate higher energy efficiency. Another useful insightfrom Figure 1 is the relationship between computation andmemory accesses. For example, three add operations consumethe same amount of energy as a ldx instruction. Hence,looking from an energy efficiency point of view, reducing thememory traffic could be as efficient as optimizing the floatingpoint arithmetic operations [54].A body of literature has focused on providing tool supportsthat allow users to define several approximations for differentcomponents of the application [17], [25], [36], [37], [56], [60],[63], [66]. Petabricks provides language extensions that exposetradeoffs between time and accuracy to the compiler [3]. Thecompiler then runs dynamic autotuning to generate optimizedelements to achieve the target accuracy. However, autotunersneed to be determined on a per-application basis by the user.OpenTuner provides fully-customizable configuration repre-sentation and ensembles of search techniques to find an opti-mal solution [2]. Both autotuning techniques are supposed tohelp programmers but Petabricks requires a separate languageand both require users to implement all alternatives before thesearch can be conducted. NEAT also helps users deal withapproximation, but instead of requiring users to implement allpossible alternatives, they simply describe programmable rulesthat are then used to automatically generate the alternatives.Hence, there is a need for a generic framework that providesmultiple precision levels, accommodates custom user-definedfloating point implementation, and does not require coderefactoring. NEAT provides such a solution. NEAT generatesinsightful information for precision tuning at function level forfloating point programs.III. S
YSTEM D ESIGN
In this section, we describe our solution which generatesinsightful information about floating point precision tuningfor applications. This tool, named Navigating Energy andAccuracy tradeoffs (referred to as NEAT) allows users tocollect energy and performance data from applications usingcustom implementations of floating-point arithmetic.The main challenge of precision tuning is constructingthe right configuration of floating point precisions for theapplication. This configuration space might be extremely largeto fairly small, ranging in complexity from using a differentfloating point implementation for each dynamic floating pointinstruction, using a different implementation for differentfunction calls, or just picking a single floating point implemen-tation for the entire application. NEAT provides such flexibilityin the granularity of enforcing floating point approximationsby introducing the programmable placement rules and then3ig. 2: NEAT Designautomatically searching the accuracy and energy tradeoff spaceto find the optimal frontier.Figure 2 illustrates the NEAT system from the user per-spective. Users specify: (1) the application that they want tounderstand (this could be just a binary and requires no specialchanges), (2) whether NEAT should consider double or singleprecision (or both), a set of alternative implementations forfloating-point arithmetic, and (4) the programmable placementrules that describe when, where, and how in the programto replace the standard floating point operations with one ofthe alternative implementations. NEAT then runs the programas a pin tool and intercepts all floating point operations ofthe specified type and replacing them according to the rules.NEAT will perform multiple runs of the application, collectstatistics on floating point usage, accuracy, and estimatedenergy. NEAT offers a profiling mode where the user collectsprecision analysis such as quantity and frequency of FLOPs forthe application before applying any FPIs. Ultimately, NEATcan repeatedly test different assignments of floating pointoperations to find the frontier of optimal configurations; i.e.,assignments of floating point operations to different regions ofthe code. This section describes NEAT’s inputs, internals, andoutputs.
A. NEAT Inputs
User inputs of NEAT includes: a user application to instru-ment, a precision level as the optimization target, the desiredFP arithmetic implementations, and a set of FPI to functionmappings (programmable placement rules).NEAT receives the binary of the program and instrumentsthe floating point instructions. Unlike other precision tuningtools, NEAT does not require the source code of the program.Then, NEAT expects the optimization target which can beeither single or double precision. There are two reasonsbehind including optimization objective. First, for most ofthe programs, the same precision level is held across thecode base for the data structures and the functions. Second,if we consider both f loat and double
FLOPs to optimize,the configuration space of FPIs combinations would explodeexcessively.Next, users specify multiple FPIs for any individual arith-metic instruction such as addition, subtraction, multiplication, and division for each operand. At last, NEAT expects amapping between the candidate code sections and the FPIsto calculate each FLOP in a program. By default, NEATenforces the FPIs at the function level, meaning all FLOPsexecuted within a specific function will be using the samecustomized FPI. Any function that has at least one FLOP canbe considered as a candidate for approximation.
B. NEAT Internal Structure
The NEAT dynamic instrumentation tool was written inC++ using the Intel Pin instrumentation system [51]. NEATperforms run-time instrumentation to facilitate the analysis andreplacement of floating-point arithmetic operations during theexecution of compiled C and C++ binaries.
1) Intel Pin Tool:
The Pin instrumentation system waschosen as the backbone for this tool because of its clean APIand efficient implementation. The Pin API makes it possibleto write instrumentation routines to observe and alter thearchitectural state of a process. Pin uses a JIT compiler togenerate a new instrumented code that can be executed withoutthe extra runtime overhead from instrumentation.
2) Floating Point Operations:
For the purposes of thistool, we identify floating-point arithmetic operations as theStreaming SIMD Extensions (SSE) instructions for scalararithmetic. These instructions are included in a SIMD in-struction set extension to the x86 architecture and operateon 32-bit or 64-bit floating point numbers. More specifically,the instructions we use for our definition of floating-pointoperation are ADDSS, SUBSS, MULSS, DIVSS, ADDSD,SUBSD, MULSD, and DIVSD.
3) Floating Point Arithmetic Implementation:
Customhardware units or accelerators have been considered for en-riching the quality versus energy tradeoff spaces. Approximateadders [20], [74], [84] and multipliers [44], [46], [81] havebeen designed as a solution for lower power consumption andhigh performance. In the presence of inexact hardware units,NEAT provides information on how to efficiently redirect thearithmetic instructions to these units.The floating point formats with a lower number of bitsemerge an appealing opportunity to reduce the energy con-sumption since it allows simplification of both hardware unitsand reduction of memory bandwidth required to transfer thedata between the memory and registers. The FPI can beas simple as bit truncation in the FP format representation,enforcing direct approximation to the operands or result ofarithmetic operations, or redirecting instruction to approximatehardware units or software libraries.
4) Execution of Floating Point Instructions:
Defining anFPI is fairly trivial. The main challenge with enforcing FPIdynamically is the way to specify the exact mapping betweenFPI and the FLOPs. NEAT allows users to define placementrules that determine which FPI is used to calculate each FLOPin a program. Every time a FLOP is about to be calculated inthe user application, NEAT examines all of the mappings andcaptures information about the current state of the application,4ABLE I: Built-in Placement Rules in NEAT.
PlacementRule Description tradeoffSpace Size
WP one FPI for the whole program − CIP one FPI for the currently inprogress function − FCS one FPI for the most recentfunction on the call stack − and use them to determine which FPI will be applied tocalculate the result of the FLOP.NEAT comes packaged with three predefined sets of FPIplacements for the applications the cover many use-cases andshow off its versatility. Table I includes the default placementrules and the corresponding tradeoff space size. Sets of rulesare specified as C++ routines that accept the program state asinput and return a single FPI as output.The first set applies the same FPI for every FLOP in thewhole-program (WP) regardless of the current function andthe program state.For finer granularity, the user can register callbacks throughNEAT that can be executed whenever a function is enteredor exited in the instrumented application. This allows morecomplex information to be collected about the program state,such as the call stack of the application. The second set ofplacement rules allows the user to specify a map of functionnames to FPIs and employs each FPI for the FLOPs in thecorresponding currently in progress (CIP) function. Similarly,the third set of placement rules uses callbacks registered withNEAT to keep track of the function call stack (FCS) of theprogram. Instead of inspecting the current function, NEATfirst checks the most recent function on the call stack. If nofunctions in the call stack match the names of those in theuser-supplied map, a default implementation is used.To highlight the difference between CIP and FCS, weanalyzed the structure of 7 functions in a benchmark shownin Figure 3. The radar is an embedded real-time signalprocessing application that is used to find moving targets onthe ground [35], [47]. It includes both a low-pass filter (LPF)and pulse compression (PC). Both of these components use aFast Fourier Transform (FFT) as a part of their computation.With the CIP option, NEAT enforces the same FPI everytime the FFT function is called. For the FCS option, NEATdistinguishes between the two occurrences of FFT based onwho has made the function call. Therefore, NEAT uses oneFPI for the FFT in the Low Pass Filter (LPF) stage and asecond FPI for the FFT in the Process Pulse (PC) stage.Empirically, we have found the results of FCS and CIP formost of the benchmarks do not differ as the callers of a FLOPintensive functions are the same. The radar is an examplewhere multiple functions make numerous calls to the sameFLOP-intensive function that is accuracy sensitive. C. Outputs
There are five outputs from this tool: the output from theuser application, a trace of the operands and result of everyFLOP executed by the program, the estimated FPU energy of Fig. 3: FCS placement considers FFT function call stackbefore selecting the approximate FPI.FLOPs in the execution of the program, the estimated energyof off-chip memory accesses of the program, and the numberof FLOPs executed per function in the program.The trace of the FLOPs executed by the instrumentedapplication is written to a file while the application is running.If FPIs are supplied to NEAT by the user, the result of eachoperation will be printed after the operation is calculated withthe chosen FPI. The operands and result of each operation areprinted as hexadecimal numbers so that there is no confusionin rounding the floating-point values.NEAT reports total energy consumed in FPU by usingenergy per instruction (EPI) of different classes floating-pointoperations. We extracted the energy model of f add , f mul and f div for single and double precision operations provided inrelated work [54].To this end, NEAT counts the number of bits manipulated inthe operands and results of every FLOP in the instrumentedprogram. Modifying the bit width in the exponent and signof a floating-point number changes the accuracy significantlywhere the quality of output becomes unacceptable. Hence,NEAT only focuses on mantissa bits. NEAT counts the numberof zeroes in the binary representation of the floating-pointnumber, starting with the least significant bit, and then sub-tracts it from available mantissa bits in the floating type (24/53bits in single/double precision respectively) to calculate thenumber of manipulated bits. NEAT uses the EPI models andthe number of manipulated bits per FLOP to estimate the totalfloating-point energy consumed in the FPU.NEAT also records the total number of bits used in FLOPsin the execution of the program is output to a file afterthe termination of the application. Unlike the FPU energyestimation, this metric can be used as a platform-independentway to evaluate the approximate amount of power used byFLOPs when instrumenting a program.Currently, the memory accounts for more than 25% ofenergy spent in a large scale system. While on average, eachsingle precision FLOP takes 400 pJ to execute, a byte readfrom memory consumes 1.5 nJ [8]. Accordingly, NEAT counts5he total number of bits transmitted to/from memory and thenestimates the total memory access energy of the instrumentedprogram [53]. This allows NEAT to yield a better energyestimation of the program in a real system.NEAT generates in-detail statistics about the floating pointinstructions in the program. Users might operate NEAT toprofile the application before performing precision tuning tofirst, decide whether NEAT is useful to their application andsecond, what type of FPIs, which functions, and how to mapthem.In general, NEAT is a tool used at program design time.NEAT allows users to evaluate many points on the accu-racy/energy tradeoff curve without having to implement allpossible alternatives. After profiling with NEAT, users canthen select a point and implement it with confidence that itwill provide the desired behavior.Future work would explore additional machione learningtechniques to configure the floating point usage differentlyfor different functions in the program [19], [39], [58], [59].Another promising line of work is using a runtime systemto dynamically tune floating point usage to maintain eitherenergy or accracy constraints in a changing workload [6],[26]–[28], [34], [38], [40], [41], [52], [57], [78], or possiblyimplementing this control scheme in hardware [67], [68], [83].IV. NEAT I NTERFACE AND R UNTIME
We explain how the user can manage floating point precisionscaling with the NEAT framework explained in the previoussection. We specify the information that NEAT expects toreceive from the users and then, discuss steps to execute theruntime engine of NEAT.The NEAT procedure follows as:1.
Profile the Program : User runs the application. NEATrecords the single and double precision instructions and thefunctions associated with them, and generates the detailedreport in csv format.2.
Assign FP Optimization Target : Since the applicationsusually use the same precision level across the source code,NEAT enhances either single or double precision instructionsat the same time. At this point, the user defines the directivefor NEAT to target 32 or 64 bit FLOPs.3.
Develop FPIs : Users might define multiple FPIs tobe explored by NEAT. NEAT supports FPIs developed in anumber of different ways. An FPI can be created by truncatingmantissa bits of the FLOP representation or injecting directapproximation to the operands or results of floating pointarithmetic operations. For example, approximating the inversefunction [82] or sin function using a neural network [23] isconsidered an FPI, too. The FPI can be applied to one ormore floating point arithmetic instruction. For instance, onebenchmark might include numerous accumulations but fewdivisions. Thus, the user defines an FPI with enforcing 8precision bits for the add/sub arithmetic instructions and 24precision bits for the multiply instructions. The user developsan FPI by creating an instance of the
F pImplementation virtual class. Furthermore, user might customize the subroutine of
P erf ormOperation to modify the operands or results ofa floating point instruction directly.4.
Register FPI Placement Rules and Functions : NEATexpects to receive a mapping between FPIs and when toenforce them. For the WP approach, the user only needsto instantiate
Register F P selector class with the desiredFPI as the argument. For the per-function rules, NEATby default considers the top 10 FLOP intensive functions.The user might pre-profile the program to detect and se-lect any number of functions. The user then should pro-vide a mapping between functions and FPIs by defininga pair < f unctionN ame, F P I ∗ > map data structure.Next, the user should combine the map with one of thepre-packaged placement rules (CIP or FCS). This mappingis also referred as a configuration . Finally, the user createsan instance of Register F P selector class and passes themap and placement strategy as the input arguments. At theruntime, the user passes the registered instance name via fp_selector_name command line flag to NEAT. Thisinterface is simple, but provides a quite flexible approach toreplacing standard floating point operations with the approxi-mate version. For example, the user can provide several mapsand then their instantiation of the selector class can look at thecurrent program context to select the desired map. This allowsNEAT to explore many different options for a single functionwithin a program. For example, users can specify that the mapshould depend on the function call stack so that different FPimplementations will be used for the same function based onwhere it was called from.5.
Activate Exploration Scripts : If CIP or FCS schemais selected, the tradeoff space of FPI to function mappings(configurations) becomes too huge to explore exhaustively.Hence, NEAT uses the NSGA-II genetic exploration techniqueto search for energy efficient configurations [18]. If the userdesires to enhance the exploration phase of the configurationspace further, NEAT provides an interface through the com-mand line flags to manually modify the tuning parameters ofNSGA-II such as population size, number of generations, orconvergence threshold.6.
Analyze the Output : NEAT reports detailed energy andperformance data per configuration. Moreover, a python scriptis provided to generate scatter plots of tradeoff space with thelower convex hulls.At the completion of these steps, the user finds informationabout the most appropriate precision level for each individualfunction or the whole program.V. E
XPERIMENTAL R ESULTS
We evaluate the efficacy and flexibility of NEAT to pro-vide floating point approximation analysis. In general, NEATgenerates useful information on precision tuning of applica-tions which can be used at design stage of a software orconvoyed to other layers of system such as compilers orhardware ( e.g. building a set of reduced-precision FPUs).Section V-B inspects the floating point profiling of NEAT forthe applications. The primary challenge of automatic precision6uning is creating approximation configurations. We examinethe NEAT’s flexibility to produce customized FPI definitionsin Sections V-C and V-D. Moreover, the main mechanism ofNEAT—programmable placement rules—are investigated inSections V-E and V-F.To navigate through the immense configuration space,NEAT comes with a tunable genetic exploration algorithmwhich is used in Sections above (from V-B through V-C).Although, to ensure robustness of NEAT on unseen data,we evaluate the difference between predicted accuracy andenergy on training and test data to demonstrate that NEATfinds configurations that are robust across different test inputsthat were not seen in training V-G. Finally in section V-H,we evaluate NEAT’s general applicability to find appropriatereduced precision floating point configurations by evaluating iton a problem that has seen a tremendous amount of attentionfrom human experts recently: trading accuracy for precisionin neural network inference. We find that NEAT can use thewhole-program rule to automatically find a single floatingpoint precision that is similar to those reported by humanexperts. Further, we find that by using different floating pointimplementations for different layers, NEAT produces evengreater energy savings for the same accuracy.
A. Evaluation platform
We evaluate NEAT by exploring the tradeoff spaces of theplacement rules for a variety of benchmarks. Table II liststhe applications from Parsec 3.0 [7] and Rodinia 3.1 [13]suites with the configuration space size (default precisionoptimization target) and training and test inputs for eachbenchmark. These benchmarks cover domains from financeto image processing.To create FPIs, we use bit truncation. For the single pre-cision floating point numbers ( f loat type in C), we have 24different FPIs corresponding to the mantissa bits. Similarly,we created 53 FPIs for the double precision floating pointnumbers. For the whole-program approach, the size of thetradeoff space is the total number of possible FPIs which are24 and 53 points. For the per-function approaches, we considerthe top 10 functions with most FLOPs to enforce the FP rules,so each of the top 10 functions may use a different FPI.In each experiment, at most 400 configurations in thetradeoff space (less than − of all possible configurations)have been evaluated through NEAT’s genetic algorithm. B. Floating Point Precision Distribution
NEAT can be used to analyze the type, distribution, and theintensity of the FLOPs in a program. Figure 4 depicts the ratioof single and double precision FLOPs for each benchmark.Most of the benchmarks hold the same precision levelacross the source for correctness and portability. For exam-ple,
Bodytrack , Heartwall , and
Kmeans are all im-plemented with f loat type while
Canneal is mainly using double . However, for some benchmarks such as
Ferret , Particlefilter , and
Srad due to including external li-braries, there is a mixture of both precision levels. In this case, B l a c k s c h o l e s B o d y t r a c k C a n n e a l F l u i d a n i m a t e F e rr e t H e a r t w a ll K m e a n s P a r t i c l e fi l t e r R a d a r S r a d S w a p t i o n s R a ti o t o a ll F L O P s SingleP recision (32 bits ) DoubleP recision (64 bits ) Fig. 4: Floating Point Type Breakdown for Benchmarks. Whilemost benchmarks have a dominant FP type, some carry both.users might choose the optimization target to be enforced.Specifying the right target opens up further opportunities foradditional energy savings.
C. FPU Energy Saving
NEAT provides the FPU energy estimation consumed bythe FLOPs. We compare two rules: whole program (WP) andcurrently-in-progress (CIP). As a reminder, WP uses one float-ing point implementation through the entirety of the program,while CIP is free to choose a separate implementation for eachof the top 10 functions (by FLOP count) in the program. For
Particlefilter , we set the optimization target to doubleprecision as most of the FLOPs are double . For the rest of thebenchmarks, we apply the single precision optimization.We consider the top 10 FLOP intensive functions for the CIPplacement. Although, one might ask how much of the FLOPsare included in the top 10 functions. For all benchmarks, atleast 98% FLOPs were coming from the top 10 functions, thusNEAT covers almost all of the FLOPs in the program.Figure 5 illustrates the lower convex hull of normalized FPUenergy and the error rate (also referred to as accuracy loss).The error rate metric is the relative error of a configurationcomparing against the highest quality configuration (baseline)where no approximation happens. The horizontal axis is theerror rate while the Noramlized Energy Consumption (NEC)to the baseline is shown vertically (on the y-axis). The lowerthe curve is, the more efficient configuration is found whichmeans higher energy efficiency. Since users generally do notcare about extremely inaccurate outputs, only error rates lessthan 20% is shown in the subfigures. The results show that f weassign multiple FPIs at the function level, NEAT will retrievemore energy efficient configurations that are not explorable ifwe use single FPI for the whole program. This result furtherdemonstrates NEAT’s value in design space exploration.With a minimal error in final output of the benchmark,NEAT reduces the FPU energy up to 60%. For some ap-plications such as
Blackscholes , Fluidanimate , and
Particlefilter the FPU energy savings are more consid-erable. These benchmarks have less than 10 FLOP intensivefunctions. Therefore first, CIP covers all the FLOPs in the7ABLE II: Benchmarks Used for Evaluation.
Benchmarks Training inputs Test inputs Possible ConfigurationSpace
Blackscholes 10 lists with 100K initial prices 30 lists with 100K initial prices Bodytrack Sequence of 5 frames Sequence of 20 frames Fluidanimate 5 fluids with 15K+ particle 15 fluids with 15K+ particle Ferret 5 databases of 16 images 15 databases of 16 images Heartwall Sequence of 15 frames Sequence of 60 frames Kmeans 10 vectors with 512 data points 30 vectors with 512 data points Particlefilter Sequence of 32 frames Sequence of 128 frames Radar Sequence of 10 frames Sequence of 40 frames N E C ( % ) Blackscholes
WPCIP N E C ( % ) Bodytrack
WPCIP N E C ( % ) Ferret
WPCIP N E C ( % ) Fluidanimate
WPCIP N E C ( % ) Heartwall
WPCIP N E C ( % ) Kmeans
WPCIP N E C ( % ) Particlefilter
WPCIP N E C ( % ) Radar
WPCIP
Fig. 5: Lower Convex Hulls of FPU energy and Error Rates for the WP and CIP. Values are normalized to the baseline.program. Second, since the tradeoff space is relatively smaller,NSGA-II searches a larger portion of the tradeoff space in thesame exploration time.For
Fluidanimate and
Ferret benchmarks, there areonly three and two configurations where the WP outperformsthe CIP. The reason is that NEAT’s genetic algorithm failsto explore those specific configurations as it is a heuristicalgorithm. The same pattern can be seen for the
Radar benchmark as well where the CIP does not dominate thewhole-program approach.The
Heartwall benchmark has only two FLOP functionswhere they are very sensitive to the bit width adjustment andany modification leads to more than 20% error. Consequently,NEAT is not able to decrease FPU energy to less than 71% ofthe baseline with reasonable error rate. The opposite scenariohappens for the
Particlefilter application where themajor FLOP functions do not impact the quality of outputconsiderably, hence NEAT aggressively reduces the FPU en-ergy without causing much error.For a more detailed comparison, we re-illustrate a quantizedrepresentation of the previous plot. Figure 6 displays how theFPU energy savings enhance as the tolerated error thresholdincreases. Higher bars indicate more energy savings. By har- monic mean, applying the CIP versus WP approach results in7%, 12%, and 13% more energy savings at 1%,5%, and 10%error rate, respectively.The steeper slope in the lower convex hull curves insubplots of Figure 5 translates into higher bars in Figure6 as the error threshold increases. The
Blackscholes and
Particlefilter benchmarks demonstrate such be-havior. On the contrary, by increasing the error threshold in
Particlefilter and
Radar applications, the FPU energysavings do not inflate similarly.From these graphs, we draw two conclusions. First, spec-ifying the FPIs placement at a finer granularity results inmore efficient FPI to function mappings. In other words, per-function rules use less energy with the same error comparingto use a single FPI for the whole application. This type ofinsight is really only achievable with the an automated systemlike NEAT. Second, if higher error rates are allowed, NEATachieves higher efficiency of FPU energy. Thus, NEAT cannavigate the whole tradeoffs space and give users a range ofoptions depending on tolerable error rate.8 lackscholes Bodytrack Fluidanimate Ferret Heartwall Kmeans Particlefilter Radar HarmonicMean FP U E n e r gy S a v i ng s ( % ) W P − W P − W P − CIP − CIP − CIP − Fig. 6: FPU Energy Savings at Different Error Rates, normalized to the baseline. Higher the bars, the more energy efficient is.
D. Memory Instructions
Main memory (DRAM) consumes as much as half of thetotal system power in a computer today, due to the increasingdemand for memory capacity and bandwidth [53]. Hence,reducing the memory traffic directly derives into substantialenergy savings. NEAT estimates the memory energy withaccounting only accesses to/from an off-chip memory bykeeping track of memory operations such as MOVSS andMOVSD. Figure 7 depicts memory accesses energy for a rangeof error rates for both whole-program (WP) and per-function(CIP) approaches respectively across the benchmarks. Sameas before, higher bars indicate higher energy efficiency. Valuesare normalized to non-approximated version of the application,that acts as a baseline. On harmonic mean, increasing theerror rate from 1% to 10% results in 3.2-10.5% less energyconsumption.If the FLOP functions are memory intensive, reducingthe precision bits results in lower memory bandwidth, andconsequently more energy savings. That is the reason whybenchmarks such as
Bodytrack , Fluidanimate , and
Radar reduces the memory energy by more than 60%. In restof the benchmarks, the FLOP functions were solely computeintensive.To put the experiments above to a conclusion, we illustratethe WP rule as a sample for prior work [79] which tries to finda single most optimal approximation for the whole application.The per-function rules of NEAT show off the ability of thereplacement rules to allow programmers to explore a richerset of tradeoffs without having to come up with whole newimplementations of existing program functionality.
E. Flexible Precision Level
In previous sections, we observed some benchmarks have amixture of both f loat and double
FLOPs. To choose the rightoptimization target, we compare the energy and accuracy ofselected benchmarks in both single and double optimizationtargets. The FPI to function mapping is CIP in this experiment.Figure 8 shows the normalized energy savings for bothsingle and double optimization targets. As expected, if wechoose the optimization target to be the same as the FPtype which has larger ratio in FLOP distribution, higherenergy savings would be achieved. This observation can beeasily justified by the looking back at Section V-B. Both
Canneal and
Particlefilter contain more 64-bit than32-bit FLOPs. Thus, double precision as NEAT directive is theright choice to achieve substantially higher energy efficiency.
Ferret requires special attention as it is not obvious howto choose the optimization target based on FLOP distributionratio since it has almost equal amount of f loat and double
FLOPs. At the 10% error rate, NEAT saves up to 92%of FPU energy corresponding to double instructions whileonly 38% savings is available if we consider only f loat instructions. There are two reasons for the discrepancy. Oneis that generally double
FLOPs yield more precise output,but they use more precision bits in return. Thus, NEAT hasmore freedom to cut down unnecessary floating point bitswhile not losing much accuracy because the double baseline isalready more accurate than the f loat one. Second, the double functions in
Ferret are not accuracy sensitive, meaning thatenforcing approximation on these functions would not exces-sively change the quality of the output. This is a great exampleof how NEAT determines the most efficient configurations forany benchmark regardless of how their floating point precisionis specified in the source (or binary).
F. Function Call Stack
As we mentioned in section III-B4, if we map an FPI to afunction, depending on the caller, the quality of output couldchange. While on most benchmarks, CIP and FCS approachesproduce the same result, on the
Radar they differ. Hence, weexamine the impact of the caller of the FFT function on theenergy and accuracy of the benchmark. Figure 9 illustrates theFPU energy savings normalized to a baseline for CIP and FCSplacement rules. FCS was able to explore a handful of moreoptimal configurations, resulting in 7% more energy savingsat 1% accuracy loss comparing to CIP without extra runtimeoverhead. At 5% and 10% error rate, the additional energysavings are 4% and 2% respectively.
G. Sensitivity to Input Changes
Since we employ a heuristic exploration technique, weensure that NEAT produces statistically sound results byevaluating each application with multiple inputs divided intotraining and test sets. We take the median of normalizedaccuracy loss and FPU energy for each set of inputs, computea linear least squares fit of training data to test data, and com-pute the correlation coefficient of each fit. Higher correlation9 lackscholes Bodytrack Fluidanimate Ferret Heartwall Kmeans Particlefilter Radar HarmonicMean M e m o r y E n e r gy S a v i ng s ( % ) W P − W P − W P − CIP − CIP − CIP − Fig. 7: Memory Transfer Energy Savings at Different Error Rates, normalized to the baseline.
Canneal Ferret Particlefilter FP U E n e r gy S a v i ng s ( % ) float − float − float − double − double − double − Fig. 8: FPU Energy Savings with Different OptimizationTargets for NEAT.
Radar FP U E n e r gy S a v i ng s ( % ) CIP − CIP − CIP − F CS − F CS − F CS − Fig. 9: Comparison of CIP and FCS for the FPU EnergySavings in Radar.coefficients imply less input sensitivity; i.e. the behavior ofconfigurations found during training data is a good predictorof test behavior.Table III show the correlation coefficient (R-values) foraccuracy loss and FPU energy for each benchmark. Due toTABLE III: Correlation Coefficients for Error Rates and FPUenergy.
Benchmark Error Rates FPU Energy
Blackscholes 0.999 0.999Bodytrack 0.958 0.989Fluidanimate 0.995 1.0Ferret 0.973 1.0Heartwall 0.999 1.0Kmeans 0.932 1.0Particlefilter 0.991 1.0Radar 0.992 1.0 heuristic nature of exploration technique, it might be possibleto select configurations that perform differently on unseendata. For instance,
Kmeans clearly stresses the difference be-tween training and test inputs. Although, all benchmarks haveuniformly high R-values on accuracy loss and FPU energy—atleast . . This demonstrates that NEAT’s search techniquesare robust and the accuracy and energy results they predict ontraining inputs hold up well for test inputs. The robustness ofthe energy results is, perhaps, not surprising as those shouldbe highly predictable (simpler FLOP implementations shouldpredictably lower energy). The robustness of the accuracyresults is perhaps more surprising as it not intuitively obviousthat floating point implementations that work well for one setof inputs would also work for another set. H. Neural Network Integration
The energy and resource constraints in neural networkscreates an intriguing challenge. More recently, a growing bodyof literature have tried to sacrifice the precision of training andinference for the lower runtime and energy consumption [16].NEAT can be used to identify the FLOP intensive sections ofthe network and then provide the minimum precision requiredfor the computation without considerable model accuracyreductions. This tradeoff (small accuracy loss for large energysavings) is well known, and we perform this study not to claima new result here, but to demonstrate that NEAT’s automatedapproach can produce the same types of savings for thisproblem that have been produced by human domain experts.We also believe that using NEAT’s programmable replacementrules to create DNNs with differing precision throughout thenetwork is a new contribution that would (due to the size ofthe search space) be quite difficult even for human experts.We use a hand-written digit classification with the MNISTdataset which includes 60K images and 10K labels. For theCNN, we consider the LeNet-5 model with the architecturesummary listed in Table IV. The LeNet-5 architecture consistsof two sets of convolutional and average pooling layers,followed by a flattening convolutional layer, then two fully-connected layers and finally a softmax classifier [48].Figure 10 shows the FLOPs breakdown for CNN trainingwith minibatch size of 4, learning rate of 1, and 30 epochs.We first measured how much of the operations are floatingpoint to determine the applicability of NEAT. For the in-ference, more than 73% of operations were FLOPs which10ABLE IV: LeNet-5 Architecture Summary.
Layer
Feature Map Size Kernel Size Activation
Input Image 1 32x32 - -1 Convolutional(1) 6 28x28 5x5 tanh2 Average Pooling(1) 6 14x14 2x2 tanh3 Convolutional(2) 16 10x10 5x5 tanh4 Average Pooling(2) 16 5x5 2x2 tanh5 Convolutional(3) 120 1x1 5x5 tanh6 Fully Connected - 84 - tanhOutput Fully Connected - 10 - softmax
TABLE V: Mantissa Bits For Single Precision FP Recommended by NEAT for Each Layer at Different Error Rates.
Layers /ErrorRates Conv 1 Avg Pool 1 Conv 2 Avg Pool 2 Conv 3 FC Tanh InternalFunc.
Fig. 10: 32-bit FLOP breakdown per layer in digit recognitionCNN.makes NEAT absolutely beneficial to apply. Next, we analyzethe FLOP distribution between the layers. We observe thatmore than 69% of floating point computation happens in theconvolutional layers where they extract interesting features inan image. Activation phases and internal compute functionsare responsible for the majority of remainder. Finally, we showthat the number of FLOPs decreases as we approach the latterlayers of the CNN since the size of transferred data betweenlayers reduces as well.To apply the FPI to function placement rules for a CNN,there are two options. First, apply one FPI per layer category(we refer to as PLC) meaning that all convolutional layersuse the same precision level. The second approach is to applya different FPI Per Layer Instance (PLI) where in this casethe first and third layers might use distinct precision levels,however, they are both convolutional layers.Picking the right FPI placement policy is not trivial forthe CNNs. Unlike the WP versus CIP rules where one hassignificantly larger tradeoff space, the PLC and PLI tradeoffspaces are both large enough that heuristic exploration isrequired. Thus, any of these rules could outperform the otherwith the same exploration time. For the PLC, NEAT explores alarger portion of the tradeoff space, leading to locating efficientconfigurations more quickly. On the other hand, PLI examinesFPI mappings at a finer granularity, hence it has a higher N E C ( % ) CNN
PLCPLI (a)
CNN FP U E n e r gy S a v i ng s ( % ) PLC − PLC − PLC − PLI − PLI − PLI − (b) Fig. 11: Comparison of PLC and PLC replacements for theCNN. (a) Lower Convex Hull Curves of Energy and ErrorRate. (b) Quantized Energy Savings at Different Error Rates.chance of discovering more optimal configurations.Figure 11a illustrates the lower convex hull of normalizedFPU energy and accuracy for both approaches. The accuracyloss is the error difference to the baseline configuration with-out approximation. The baseline recognition accuracy in theinference stage is 99.04% with a full accurate trained model.Each point in the tradeoff space demonstrates an FPI to layer(category or instance) mapping. Closer points to the originindicate higher energy efficiency.As can be seen, the lower convex hull of the PLI (finergranularity) outperforms the PLC curve for the error rates ofless than 20%. The quantized representation of FPU energyversus error rates tradeoff space is shown in Figure 11b forboth PLC and PLI placements. Similar to previous evaluation,finer granularity results in higher energy efficiency. With 1%,5%, and 10% accuracy loss, NEAT with PLI placementsachieves 6%, 4%, and 3% more energy savings compared tothe default configuration.NEAT’s programmable placement rules allow developers toanalyze various precision levels for different components oftheir neural network without requiring them to instrument thesource code or re-design the architecture.11ince the FPIs are based on the bit truncation of mantissa,using the above analysis, NEAT finds the required precisionbits for each layer in the LeNet-5 network under accuracy lossconstraints. By default, each layer is implemented with singleprecision floating point numbers (24 mantissa bits) bits. TableV demonstrates the mantissa bits required for every layer inthe network. These precisions could later be integrated withthe MPFR library in C [30] or mpmath library in Python [42].VI. C
ONCLUSION
In this work, we proposed NEAT, a tool for automatedprecision tuning of floating point applications. NEAT providesmechanisms for programmers trying to explore the tradeoffspace of combinations of approximate floating point imple-mentations without extensive source code refactoring. Weevaluate NEAT on various benchmarks with whole-programand per-function placement rules. We found out at the finergranularity, up to 54% and 74% energy savings are available inFPU and memory transmissions respectively. We empiricallyshow that NEAT performs robustly on unseen inputs as well.We also perform a case study on a digit recognition CNNprograms to find optimal precision level requirements for eachlayer.
Acknowledgments:
This research is supported byNSF(CCF-1439156, CNS-1526304, CCF-1823032, CNS-1764039). Additional support comes from the Proteus projectunder the DARPA BRASS program and a DOE Early Careeraward. R
EFERENCES[1] C. Alvarez, J. Corbal, and M. Valero, “Fuzzy memoization for floating-point multimedia applications,”
IEEE Transactions on Computers ,vol. 54, no. 7, pp. 922–927, July 2005.[2] J. Ansel, S. Kamil, K. Veeramachaneni, J. Ragan-Kelley, J. Bosboom,U.-M. O’Reilly, and S. Amarasinghe, “Opentuner: An extensibleframework for program autotuning,” in
Proceedings of the 23rdInternational Conference on Parallel Architectures and Compilation ,ser. PACT ’14. New York, NY, USA: ACM, 2014, pp. 303–316.[Online]. Available: http://doi.acm.org/10.1145/2628071.2628092[3] J. Ansel, Y. L. Wong, C. Chan, M. Olszewski, A. Edelman, and S. Ama-rasinghe, “Language and compiler support for auto-tuning variable-accuracy algorithms,” in
Proceedings of the 9th Annual IEEE/ACMInternational Symposium on Code Generation and Optimization . IEEEComputer Society, 2011, pp. 85–96.[4] W. Baek and T. M. Chilimbi, “Green: A framework for supportingenergy-conscious programming using controlled approximation,”
SIGPLAN Not. , vol. 45, no. 6, pp. 198–209, Jun. 2010. [Online].Available: http://doi.acm.org/10.1145/1809028.1806620[5] J. Balkind, M. McKeown, Y. Fu, T. Nguyen, Y. Zhou, A. Lavrov,M. Shahrad, A. Fuchs, S. Payne, X. Liang et al. , “Openpiton: An opensource manycore research framework,” in
ACM SIGARCH ComputerArchitecture News , vol. 44, no. 2. ACM, 2016, pp. 217–232.[6] S. Barati, F. A. Bartha, S. Biswas, R. Cartwright, A. Duracz, D. S.Fussell, H. Hoffmann, C. Imes, J. E. Miller, N. Mishra, Arvind,D. Nguyen, K. V. Palem, Y. Pei, K. Pingali, R. Sai, A. Wright, Y. Yang,and S. Zhang, “Proteus: Language and runtime support for self-adaptivesoftware development,”
IEEE Software , vol. 36, no. 2, pp. 73–82, 2019.[Online]. Available: https://doi.org/10.1109/MS.2018.2884864[7] C. Bienia, “Benchmarking modern multiprocessors,” Ph.D. dissertation,Princeton University, January 2011.[8] S. Borkar, “The exascale challange.” Keynote Talk, Parallel Architec-tures and Compilation Techniques (PACT), Galveston Island, Texas,USA., 10 2011. [9] J. Bornholt, T. Mytkowicz, and K. S. McKinley, “Uncertain¡ t¿: A first-order type for uncertain data,”
ACM SIGPLAN Notices , vol. 49, no. 4,pp. 51–66, 2014.[10] A. Boutros, S. Yazdanshenas, and V. Betz, “Embracing diversity:Enhanced dsp blocks for low-precision deep learning on fpgas,” in . IEEE, 2018, pp. 35–357.[11] L. N. Chakrapani, B. E. S. Akgul, S. Cheemalavagu, P. Korkmaz, K. V.Palem, and B. Seshasayee, “Ultra-efficient (embedded) soc architecturesbased on probabilistic cmos (pcmos) technology,” in
Proceedings of theConference on Design, Automation and Test in Europe: Proceedings ,ser. DATE ’06. 3001 Leuven, Belgium, Belgium: European Designand Automation Association, 2006, pp. 1110–1115. [Online]. Available:http://dl.acm.org.proxy.uchicago.edu/citation.cfm?id=1131481.1131790[12] A. P. Chandrakasan and R. W. Brodersen, “Minimizing power consump-tion in digital cmos circuits,”
Proceedings of the IEEE , vol. 83, no. 4,pp. 498–523, Apr 1995.[13] S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S. H. Lee, andK. Skadron, “Rodinia: A benchmark suite for heterogeneous computing,”in
Workload Characterization, 2009. IISWC 2009. IEEE InternationalSymposium on , Oct 2009, pp. 44–54.[14] V. K. Chippa, S. Venkataramani, S. T. Chakradhar, K. Roy, andA. Raghunathan, “Approximate computing: An integrated hardwareapproach,” in , Nov 2013, pp. 111–117.[15] M. Courbariaux, Y. Bengio, and J. David, “Binaryconnect: Trainingdeep neural networks with binary weights during propagations,”
CoRR ,vol. abs/1511.00363, 2015. [Online]. Available: http://arxiv.org/abs/1511.00363[16] D. Das, N. Mellempudi, D. Mudigere, D. Kalamkar, S. Avancha,K. Banerjee, S. Sridharan, K. Vaidyanathan, B. Kaul, E. Georganas et al. ,“Mixed precision training of convolutional neural networks using integeroperations,” arXiv preprint arXiv:1802.00930 , 2018.[17] M. de Kruijf, S. Nomura, and K. Sankaralingam, “Relax: Anarchitectural framework for software recovery of hardware faults,” in
Proceedings of the 37th Annual International Symposium on ComputerArchitecture , ser. ISCA ’10. New York, NY, USA: ACM, 2010, pp.497–508. [Online]. Available: http://doi.acm.org.proxy.uchicago.edu/10.1145/1815961.1816026[18] K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan, “A fast and elitistmultiobjective genetic algorithm: Nsga-ii,”
IEEE Transactions on Evo-lutionary Computation , vol. 6, no. 2, pp. 182–197, Apr 2002.[19] Y. Ding, N. Mishra, and H. Hoffmann, “Generative and multi-phaselearning for computer systems optimization,” in
Proceedings of the 46thInternational Symposium on Computer Architecture , ser. ISCA ’19.New York, NY, USA: Association for Computing Machinery, 2019, p.3952. [Online]. Available: https://doi.org/10.1145/3307650.3326633[20] K. Du, P. Varman, and K. Mohanram, “High performance reliablevariable latency carry select addition,” in , March 2012, pp. 1257–1262.[21] Z. Du, K. Palem, A. Lingamneni, O. Temam, Y. Chen, and C. Wu,“Leveraging the error resilience of machine-learning applications fordesigning highly energy efficient accelerators,” in , Jan 2014, pp.201–206.[22] P. D. D¨uben, J. Joven, A. Lingamneni, H. McNamara, G. De Micheli,K. V. Palem, and T. N. Palmer, “On the use of inexact, prunedhardware in atmospheric modelling,”
Philosophical Transactionsof the Royal Society of London A: Mathematical, Physical andEngineering Sciences , vol. 372, no. 2018, 2014. [Online]. Available:http://rsta.royalsocietypublishing.org/content/372/2018/20130276[23] S. Eldridge, F. Raudies, D. Zou, and A. Joshi, “Neural network-basedaccelerators for transcendental function approximation,” in
Proceedingsof the 24th edition of the great lakes symposium on VLSI . ACM, 2014,pp. 169–174.[24] H. Esmaeilzadeh, A. Sampson, L. Ceze, and D. Burger, “Architecturesupport for disciplined approximate programming,” in
Proceedings ofthe Seventeenth International Conference on Architectural Support forProgramming Languages and Operating Systems , ser. ASPLOS XVII.New York, NY, USA: ACM, 2012, pp. 301–312. [Online]. Available:http://doi.acm.org.proxy.uchicago.edu/10.1145/2150976.2151008[25] H. Esmaeilzadeh, A. Sampson, L. Ceze, and D. Burger, “Neuralacceleration for general-purpose approximate programs,” in
Proceedingsof the 2012 45th Annual IEEE/ACM International Symposium n Microarchitecture , ser. MICRO-45. Washington, DC, USA:IEEE Computer Society, 2012, pp. 449–460. [Online]. Available:http://dx.doi.org.proxy.uchicago.edu/10.1109/MICRO.2012.48[26] A. Farrell and H. Hoffmann, “MEANTIME: achieving both minimalenergy and timeliness with approximate computing,” in , 2016, pp. 421–435.[27] A. Filieri, H. Hoffmann, and M. Maggio, “Automated multi-objectivecontrol for self-adaptive software design,” in Proceedings of the 201510th Joint Meeting on Foundations of Software Engineering, ESEC/FSE2015, Bergamo, Italy, August 30 - September 4, 2015 , E. D. Nitto,M. Harman, and P. Heymans, Eds. ACM, 2015, pp. 13–24. [Online].Available: https://doi.org/10.1145/2786805.2786833[28] A. Filieri, M. Maggio, K. Angelopoulos, N. D’Ippolito,I. Gerostathopoulos, A. B. Hempel, H. Hoffmann, P. Jamshidi,E. Kalyvianaki, C. Klein, F. Krikava, S. Misailovic, A. V.Papadopoulos, S. Ray, A. M. Sharifloo, S. Shevtsov, M. Ujma, andT. Vogel, “Control strategies for self-adaptive software systems,”
ACMTrans. Auton. Adapt. Syst. , vol. 11, no. 4, pp. 24:1–24:31, 2017.[Online]. Available: https://doi.org/10.1145/3024188[29] B. Fleischer, S. Shukla, M. Ziegler, J. Silberman, J. Oh, V. Srinivasan,J. Choi, S. Mueller, A. Agrawal, T. Babinsky, N. Cao, C. Chen,P. Chuang, T. Fox, G. Gristede, M. Guillorn, H. Haynie, M. Klaiber,D. Lee, S. Lo, G. Maier, M. Scheuermann, S. Venkataramani,C. Vezyrtzis, N. Wang, F. Yee, C. Zhou, P. Lu, B. Curran, L. Chang, andK. Gopalakrishnan, “A scalable multi- teraops deep learning processorcore for ai trainina and inference,” in , June 2018, pp. 35–36.[30] L. Fousse, G. Hanrot, V. Lef`evre, P. P´elissier, and P. Zimmermann,“Mpfr: A multiple-precision binary floating-point library with cor-rect rounding,”
ACM Transactions on Mathematical Software (TOMS) ,vol. 33, no. 2, p. 13, 2007.[31] N. Gajjar, N. M. Devahsrayee, and K. S. Dasgupta, “Scalable leon3 based soc for multiple floating point operations,” in , Dec 2011, pp. 1–3.[32] B. Grigorian, N. Farahpour, and G. Reinman, “Brainiac: Bringingreliable accuracy into neurally-implemented approximate computing,”in , Feb 2015, pp. 615–626.[33] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, andW. J. Dally, “Eie: Efficient inference engine on compressed deep neuralnetwork,” in , June 2016, pp. 243–254.[34] H. Hoffmann, “Coadapt: Predictable behavior for accuracy-aware appli-cations running on power-aware systems,” in ,2014, pp. 223–232.[35] H. Hoffmann, A. Agarwal, and S. Devadas, “Selecting spatiotemporalpatterns for development of parallel applications,”
IEEE Trans. ParallelDistributed Syst. , vol. 23, no. 10, pp. 1970–1982, 2012. [Online].Available: https://doi.org/10.1109/TPDS.2011.298[36] H. Hoffmann, S. Misailovic, S. Sidiroglou, A. Agarwal, and M. Rinard,“Using code perforation to improve performance, reduce energy con-sumption, and respond to failures,” no. MIT-CSAIL-TR-2009-042, 092009.[37] H. Hoffmann, S. Sidiroglou, M. Carbin, S. Misailovic, A. Agarwal, andM. Rinard, “Dynamic knobs for responsive power-aware computing,” in
Proceedings of the Sixteenth International Conference on ArchitecturalSupport for Programming Languages and Operating Systems , ser.ASPLOS XVI. New York, NY, USA: ACM, 2011, pp. 199–212.[Online]. Available: http://doi.acm.org/10.1145/1950365.1950390[38] C. Imes and H. Hoffmann, “Bard: A unified framework for managingsoft timing and power constraints,” in
International Conference onEmbedded Computer Systems: Architectures, Modeling and Simulation,SAMOS 2016, Agios Konstantinos, Samos Island, Greece, July 17-21,2016 , W. A. Najjar and A. Gerstlauer, Eds. IEEE, 2016, pp. 31–38.[Online]. Available: https://doi.org/10.1109/SAMOS.2016.7818328[39] C. Imes, S. A. Hofmeyr, and H. Hoffmann, “Energy-efficient applicationresource scheduling using machine learning classifiers,” in
Proceedingsof the 47th International Conference on Parallel Processing, ICPP 2018,Eugene, OR, USA, August 13-16, 2018 . ACM, 2018, pp. 45:1–45:11.[Online]. Available: https://doi.org/10.1145/3225058.3225088[40] C. Imes, D. H. K. Kim, M. Maggio, and H. Hoffmann, “POET:a portable approach to minimizing energy under soft real-time constraints,” in .IEEE Computer Society, 2015, pp. 75–86. [Online]. Available:https://doi.org/10.1109/RTAS.2015.7108419[41] C. Imes, H. Zhang, K. Zhao, and H. Hoffmann, “Copper: Soft real-timeapplication performance using hardware power capping,” in . IEEE, 2019, pp. 31–41. [Online].Available: https://doi.org/10.1109/ICAC.2019.00015[42] F. Johansson et al. , mpmath: a Python library for arbitrary-precision floating-point arithmetic (version 0.14) , February 2010, http://code.google.com/p/mpmath/ .[43] A. Kanduri, M. H. Haghbayan, A. M. Rahmani, P. Liljeberg, A. Jantsch,N. Dutt, and H. Tenhunen, “Approximation knob: Power capping meetsenergy efficiency,” in , Nov 2016, pp. 1–8.[44] Khaing Yin Kyaw, Wang Ling Goh, and Kiat Seng Yeo, “Low-powerhigh-speed multiplier for error-tolerant application,” in , Dec 2010, pp. 1–4.[45] U. K¨oster, T. J. Webb, X. Wang, M. Nassar, A. K. Bansal,W. H. Constable, O. H. Elibol, S. Gray, S. Hall, L. Hornof,A. Khosrowshahi, C. Kloss, R. J. Pai, and N. Rao, “Flexpoint:An adaptive numerical format for efficient training of deep neuralnetworks,” in Proceedings of the 31st International Conferenceon Neural Information Processing Systems , ser. NIPS’17. USA:Curran Associates Inc., 2017, pp. 1740–1750. [Online]. Available:http://dl.acm.org/citation.cfm?id=3294771.3294937[46] P. Kulkarni, P. Gupta, and M. Ercegovac, “Trading accuracy for powerwith an underdesigned multiplier architecture,” in , Jan 2011, pp. 346–351.[47] J. Lebak, J. Kepner, H. Hoffmann, and E. Rutledge, “Parallel vsipl++:An open standard software library for high-performance parallel signalprocessing,”
Proceedings of the IEEE , vol. 93, no. 2, pp. 313–330, Feb2005.[48] Y. LeCun, P. Haffner, L. Bottou, and Y. Bengio, “Object recognition withgradient-based learning,” in
Shape, contour and grouping in computervision . Springer, 1999, pp. 319–345.[49] A. Lingamneni, C. Enz, K. Palem, and C. Piguet, “Designing energy-efficient arithmetic operators using inexact computing,”
Journal of LowPower Electronics , vol. 9, no. 1, pp. 141–153, 2013.[50] S. Liu, K. Pattabiraman, T. Moscibroda, and B. G. Zorn, “Flikker: Savingdram refresh-power through critical data partitioning,” in
Proceedingsof the Sixteenth International Conference on Architectural Support forProgramming Languages and Operating Systems , ser. ASPLOS XVI.New York, NY, USA: ACM, 2011, pp. 213–224. [Online]. Available:http://doi.acm.org.proxy.uchicago.edu/10.1145/1950365.1950391[51] C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney,S. Wallace, V. J. Reddi, and K. Hazelwood, “Pin: Buildingcustomized program analysis tools with dynamic instrumentation,” in
Proceedings of the 2005 ACM SIGPLAN Conference on ProgrammingLanguage Design and Implementation , ser. PLDI ’05. New York,NY, USA: ACM, 2005, pp. 190–200. [Online]. Available: http://doi.acm.org/10.1145/1065010.1065034[52] M. Maggio, A. V. Papadopoulos, A. Filieri, and H. Hoffmann,“Automated control of multiple software goals using multipleactuators,” in
Proceedings of the 2017 11th Joint Meeting onFoundations of Software Engineering, ESEC/FSE 2017, Paderborn,Germany, September 4-8, 2017 , 2017, pp. 373–384. [Online]. Available:https://doi.org/10.1145/3106237.3106247[53] K. T. Malladi, F. A. Nothaft, K. Periyathambi, B. C. Lee, C. Kozyrakis,and M. Horowitz, “Towards energy-proportional datacenter memorywith mobile dram,” in , June 2012, pp. 37–48.[54] M. McKeown, A. Lavrov, M. Shahrad, P. J. Jackson, Y. Fu, J. Balkind,T. M. Nguyen, K. Lim, Y. Zhou, and D. Wentzlaff, “Power and energycharacterization of an open source 25-core manycore processor,” in , Feb 2018, pp. 762–775.[55] S. Misailovic, M. Carbin, S. Achour, Z. Qi, and M. C. Rinard,“Chisel: Reliability- and accuracy-aware optimization of approximatecomputational kernels,” in
Proceedings of the 2014 ACM InternationalConference on Object Oriented Programming Systems Languages &Applications , ser. OOPSLA ’14. New York, NY, USA: ACM, 2014, p. 309–328. [Online]. Available: http://doi.acm.org.proxy.uchicago.edu/10.1145/2660193.2660231[56] S. Misailovic, S. Sidiroglou, H. Hoffmann, and M. Rinard, Qualityof Service Profiling . New York, NY, USA: Association forComputing Machinery, 2010, p. 2534. [Online]. Available: https://doi.org/10.1145/1806799.1806808[57] N. Mishra, C. Imes, J. D. Lafferty, and H. Hoffmann, “CALOREE:learning control for predictable latency and low energy,” in
Proceedingsof the Twenty-Third International Conference on Architectural Supportfor Programming Languages and Operating Systems, ASPLOS 2018,Williamsburg, VA, USA, March 24-28, 2018 , X. Shen, J. Tuck,R. Bianchini, and V. Sarkar, Eds. ACM, 2018, pp. 184–198. [Online].Available: https://doi.org/10.1145/3173162.3173184[58] N. Mishra, J. D. Lafferty, and H. Hoffmann, “ESP: A machinelearning approach to predicting application interference,” in , X. Wang, C. Stewart, andH. Lei, Eds. IEEE Computer Society, 2017, pp. 125–134. [Online].Available: https://doi.org/10.1109/ICAC.2017.29[59] N. Mishra, H. Zhang, J. D. Lafferty, and H. Hoffmann, “Aprobabilistic graphical model-based approach for minimizing energyunder performance constraints,” in
Proceedings of the TwentiethInternational Conference on Architectural Support for ProgrammingLanguages and Operating Systems, ASPLOS ’15, Istanbul, Turkey,March 14-18, 2015 , ¨O. ¨Ozturk, K. Ebcioglu, and S. Dwarkadas, Eds.ACM, 2015, pp. 267–281. [Online]. Available: https://doi.org/10.1145/2694344.2694373[60] T. Moreau, A. Sampson, and L. Ceze, “Approximate computing: Makingmobile systems more efficient,”
IEEE Pervasive Computing , vol. 14,no. 2, pp. 9–13, Apr 2015.[61] K. V. Palem, L. N. Chakrapani, Z. M. Kedem, A. Lingamneni, and K. K.Muntimadugu, “Sustaining moore’s law in embedded computing throughprobabilistic and approximate design: Retrospects and prospects,” in
Proceedings of the 2009 International Conference on Compilers,Architecture, and Synthesis for Embedded Systems , ser. CASES ’09.New York, NY, USA: ACM, 2009, pp. 1–10. [Online]. Available:http://doi.acm.org.proxy.uchicago.edu/10.1145/1629395.1629397[62] Qian Zhang, F. Yuan, R. Ye, and Q. Xu, “Approxit: An approximate com-puting framework for iterative methods,” in , June 2014, pp. 1–6.[63] M. Rinard, H. Hoffmann, S. Misailovic, and S. Sidiroglou, “Patterns andstatistical analysis for understanding reduced resource computing,” in
Proceedings of the ACM International Conference on Object OrientedProgramming Systems Languages and Applications , ser. OOPSLA ’10.New York, NY, USA: Association for Computing Machinery, 2010, p.806821. [Online]. Available: https://doi.org/10.1145/1869459.1869525[64] C. Sakr, N. Wang, C.-Y. Chen, J. Choi, A. Agrawal, N. Shanbhag,and K. Gopalakrishnan, “Accumulation bit-width scaling for ultra-lowprecision training of deep networks,” arXiv preprint arXiv:1901.06588 ,2019.[65] A. Sampson, W. Dietl, E. Fortuna, D. Gnanapragasam, L. Ceze, andD. Grossman, “Enerj: Approximate data types for safe and generallow-power computation,” in
Proceedings of the 32Nd ACM SIGPLANConference on Programming Language Design and Implementation ,ser. PLDI ’11. New York, NY, USA: ACM, 2011, pp. 164–174.[Online]. Available: http://doi.acm.org/10.1145/1993498.1993518[66] A. Sampson, W. Dietl, E. Fortuna, D. Gnanapragasam, L. Ceze,and D. Grossman, “Enerj: Approximate data types for safe andgeneral low-power computation,” in
Proceedings of the 32NdACM SIGPLAN Conference on Programming Language Design andImplementation , ser. PLDI ’11. New York, NY, USA: ACM, 2011, pp.164–174. [Online]. Available: http://doi.acm.org.proxy.uchicago.edu/10.1145/1993498.1993518[67] M. H. Santriaji and H. Hoffmann, “GRAPE: minimizing energy forGPU applications with performance requirements,” in . IEEE Computer Society,2016, pp. 16:1–16:13. [Online]. Available: https://doi.org/10.1109/MICRO.2016.7783719[68] M. H. Santriaji and H. Hoffmann, “MERLOT: architectural supportfor energy-efficient real-time processing in gpus,” in
IEEE Real-Time and Embedded Technology and Applications Symposium,RTAS 2018, 11-13 April 2018, Porto, Portugal , R. Pellizzoni, Ed. IEEE Computer Society, 2018, pp. 214–226. [Online]. Available:https://doi.org/10.1109/RTAS.2018.00030[69] Q. Shi, H. Hoffmann, and O. Khan, “A cross-layer multicorearchitecture to tradeoff program accuracy and resilience overheads,”
IEEE Comput. Archit. Lett. , vol. 14, no. 2, pp. 85–89, 2015. [Online].Available: https://doi.org/10.1109/LCA.2014.2365204[70] S. Sidiroglou-Douskos, S. Misailovic, H. Hoffmann, and M. Rinard,“Managing performance vs. accuracy trade-offs with loop perforation,”in
Proceedings of the 19th ACM SIGSOFT Symposium and the13th European Conference on Foundations of Software Engineering ,ser. ESEC/FSE ’11. New York, NY, USA: ACM, 2011, pp.124–134. [Online]. Available: http://doi.acm.org.proxy.uchicago.edu/10.1145/2025113.2025133[71] G. Tagliavini, A. Marongiu, and L. Benini, “Flexfloat: A software libraryfor transprecision computing,”
IEEE Transactions on Computer-AidedDesign of Integrated Circuits and Systems , 2018.[72] S. Venkataramani, A. Ranjan, K. Roy, and A. Raghunathan, “Axnn:Energy-efficient neuromorphic systems using approximate computing,”in , Aug 2014, pp. 27–32.[73] S. Venkataramani, V. K. Chippa, S. T. Chakradhar, K. Roy,and A. Raghunathan, “Quality programmable vector processors forapproximate computing,” in
Proceedings of the 46th Annual IEEE/ACMInternational Symposium on Microarchitecture , ser. MICRO-46. NewYork, NY, USA: ACM, 2013, pp. 1–12. [Online]. Available:http://doi.acm.org.proxy.uchicago.edu/10.1145/2540708.2540710[74] A. K. Verma, P. Brisk, and P. Ienne, “Variable latency speculativeaddition: A new paradigm for arithmetic circuit design,” in , March 2008, pp. 1250–1255.[75] C. Wan, H. Hoffmann, S. Lu, and M. Maire, “Orthogonalized SGD andnested architectures for anytime neural networks,” in
Proceedings of the37th International Conference on Machine Learning , ser. Proceedingsof Machine Learning Research, H. D. III and A. Singh, Eds., vol.119. PMLR, 13–18 Jul 2020, pp. 9807–9817. [Online]. Available:http://proceedings.mlr.press/v119/wan20a.html[76] C. Wan, M. Santriaji, E. Rogers, H. Hoffmann, M. Maire, andS. Lu, “ALERT: Accurate learning for energy and timeliness,” in
Advances in neural information processing systems , 2018, pp. 7675–7684.[78] S. Wang, C. Li, H. Hoffmann, S. Lu, W. Sentosa, and A. I.Kistijantoro, “Understanding and auto-adjusting performance-sensitiveconfigurations,” in
Proceedings of the Twenty-Third InternationalConference on Architectural Support for Programming Languages andOperating Systems, ASPLOS 2018, Williamsburg, VA, USA, March24-28, 2018 , X. Shen, J. Tuck, R. Bianchini, and V. Sarkar, Eds.ACM, 2018, pp. 154–168. [Online]. Available: https://doi.org/10.1145/3173162.3173206[79] S. Wu, G. Li, F. Chen, and L. Shi, “Training and inference with integersin deep neural networks,” arXiv preprint arXiv:1802.04680 , 2018.[80] A. Yazdanbakhsh, D. Mahajan, B. Thwaites, J. Park, A. Nagendrakumar,S. Sethuraman, K. Ramkrishnan, N. Ravindran, R. Jariwala, A. Rahimi,H. Esmaeilzadeh, and K. Bazargan, “Axilog: Language support forapproximate hardware design,” in , March 2015, pp. 812–817.[81] G. Zervakis, K. Tsoumanis, S. Xydis, D. Soudris, and K. Pekmestzi,“Design-efficient approximate multiplication circuits through partialproduct perforation,”
IEEE Transactions on Very Large Scale Integration(VLSI) Systems , vol. 24, no. 10, pp. 3105–3117, Oct 2016.[82] H. Zhang, M. Putic, and J. Lach, “Low power gpgpu computationwith imprecise hardware,” in
Proceedings of the 51st Annual DesignAutomation Conference , ser. DAC ’14. New York, NY, USA: ACM,2014, pp. 99:1–99:6. [Online]. Available: http://doi.acm.org/10.1145/2593069.2593156[83] Y. Zhou, H. Hoffmann, and D. Wentzlaff, “CASH: supporting iaascustomers with a sub-core configurable architecture,” in . IEEE Computer Society, 2016,pp. 682–694. [Online]. Available: https://doi.org/10.1109/ISCA.2016.65
84] N. Zhu, W. L. Goh, W. Zhang, K. S. Yeo, and Z. H. Kong, “Design oflow-power high-speed truncation-error-tolerant adder and its applicationin digital signal processing,”
IEEE Transactions on Very Large ScaleIntegration (VLSI) Systems , vol. 18, no. 8, pp. 1225–1229, Aug 2010., vol. 18, no. 8, pp. 1225–1229, Aug 2010.