[PDF] Comparing and Combining Approximate Computing Frameworks

Abstract

Approximate computing frameworks configure applications so they can operate at a range of points in an accuracy-performance trade-off space. Prior work has introduced many frameworks to create approximate programs. As approximation frameworks proliferate, it is natural to ask how they can be compared and combined to create even larger, richer trade-off spaces. We address these questions by presenting VIPER and BOA. VIPER compares trade-off spaces induced by different approximation frameworks by visualizing performance improvements across the full range of possible accuracies. BOA is a family of exploration techniques that quickly locate Pareto-efficient points in the immense trade-off space produced by the combination of two or more approximation frameworks. We use VIPER and BOA to compare and combine three different approximation frameworks from across the system stack, including: one that changes numerical precision, one that skips loop iterations, and one that manipulates existing application parameters. Compared to simply looking at Pareto-optimal curves, we find VIPER's visualizations provide a quicker and more convenient way to determine the best approximation technique for any accuracy loss. Compared to a state-of-the-art evolutionary algorithm, we find that BOA explores 14x fewer configurations yet locates 35% more Pareto-efficient points.

Full PDF

CComparing and Combining ApproximateComputing Frameworks

Saeid Barati

Computer Science DepartmentUniversity of Chicago

Chicago, [email protected]

Gordon Kindlmann

Computer Science DepartmentUniversity of Chicago

Chicago, [email protected]

Hank Hoffmann

Computer Science DepartmentUniversity of Chicago

Chicago, [email protected]

Abstract —Approximate computing frameworks conﬁgure ap-plications so they can operate at a range of points in an accuracy-performance trade-off space. Prior work has introduced manyframeworks to create approximate programs. As approximationframeworks proliferate, it is natural to ask how they can be com-pared and combined to create even larger, richer trade-off spaces.We address these questions by presenting VIPER and BOA.VIPER compares trade-off spaces induced by different approx-imation frameworks by visualizing performance improvementsacross the full range of possible accuracies. BOA is a family ofexploration techniques that quickly locate Pareto-efﬁcient pointsin the immense trade-off space produced by the combination oftwo or more approximation frameworks. We use VIPER andBOA to compare and combine three different approximationframeworks from across the system stack, including: one thatchanges numerical precision, one that skips loop iterations, andone that manipulates existing application parameters. Comparedto simply looking at Pareto-optimal curves, we ﬁnd VIPER’svisualizations provide a quicker and more convenient way todetermine the best approximation technique for any accuracyloss. Compared to a state-of-the-art evolutionary algorithm, weﬁnd that BOA explores 14 × fewer conﬁgurations yet locates 35%more Pareto-efﬁcient points. I. I

NTRODUCTION

Approximation frameworks conﬁgure applications to op-erate within a trade-off space where the result accuracy isexchanged for other beneﬁts, typically increased performance.Different approximation frameworks exist across the layersof the system stack. Some focus on the circuit level [1]–[6]. Others replace expensive hardware with approximations[7]–[10]. Still others exist at the programming language andcompiler level [11]–[18].As approximation methods proliferate, it is natural to ques-tion their interaction; especially: • How to compare the trade-off spaces induced by differenttechniques?

Comparing individual points in the trade-offspace is easy: simply compare all framework’s perfor-mance at that point. Most approximation frameworks,however, produce a trade-off space—with a range ofoperating points—so we need techniques that compareframeworks across that entire range. • How to combine different techniques’ trade-off spacesand locate Pareto-efﬁcient points in the new space?

Thechallenge is quickly locating more efﬁcient conﬁgurations in the immense combined trade-off space, which is toobig to search exhaustively.

VIPER and BOA:

To compare approximation frameworkswe propose VIPER: Visualizing Improved PErformance Ra-tios. While existing techniques use numerical comparisons[19]–[23] or simply display Pareto-optimal curves, VIPERproduces a visual representation of the trade-off space. VIPERproduces charts showing normalized performance for differentframeworks across all possible accuracy loss ranges. A chartis divided into different, mathematically meaningful regionsthat show how much one framework out-performs others.To combine frameworks, we propose BOA: Blending Opti-mal Approximations. BOA is a family of algorithms that locatePareto-efﬁcient points in the huge trade-off space producedby multiple frameworks. In its simplest version, BOA-simplesearches the cross product of Pareto-optimal points fromindividual frameworks. BOA extensions, evaluate more of thesearch space—either deterministically or probabilistically—including more near-optimal points. BOA then returns thePareto-efﬁcient points from this search space.

Summary of Results:

We consider two case studies. Bothuse prior approximation frameworks from across the systemstack: Loop Perforation—a compiler technique (LP) [24],PowerDial—an application-level technique (PD) [25], and theApproximate Math Library—a library that changes numericalprecision (AML) [18]. We use eleven applications coveringdomains from machine learning to image processing. Each ap-plication includes multiple inputs that we divide into trainingand test sets to evaluate whether combination methods producestatistically sound results on unseen inputs.The ﬁrst case study uses VIPER to compare Loop Perfo-ration, PowerDial, and the Approximate Math Library. LoopPerforation simply discards loop iterations with no regardto original intent. PowerDial , in contrast, builds off ap-proximations that already exist in the application; i.e., thoseenvisioned by the original programmer. The ApproximateMath Library approximates math functions ( e.g., exp , log and sqrt ) using variable Taylor series expansion. While theseapproximations work at different levels of the stack, VIPERallows us to quickly compare them and produces more intuitivevisualizations than simply looking at Pareto-optimal curves.The second case study combines the three approximation1 a r X i v : . [ c s . A I] F e b echniques and locates Pareto-efﬁcient points in this new,signiﬁcantly larger trade-off space. We compare BOA to twostate-of-the-art design-space exploration algorithms: MultipleChoice Knapsack Problem (MCKP) [26] and Non-SortingGenetic Algorithm (NSGA-II) [27]. Compared to these twoapproaches, BOA achieves: • More efﬁcient conﬁgurations : BOA-simple produces48.2% and 35.1% more Pareto-efﬁcient points thanMCKP and NSGA-II, respectively. • More reliable behavior on unseen inputs : BOA-simpleﬁnds statistically meaningful Pareto-efﬁcient conﬁgura-tions that are not sensitive to input data and are morelikely to be efﬁcient on an unseen set of inputs. Thatis, the correlation between BOA’s behavior on training(seen) and test (unseen) inputs is much higher thanMCKP and NSGA-II.Somewhat surprisingly, BOA produces much better resultswhile using a much simpler search technique than the priorworks to which it is compared. The fact that simpler searchmethods can produce better results is a key contribution of thiswork. The primary insight is that, empirically, we ﬁnd thatoptimal combinations of approximation frameworks tend toinvolve conﬁgurations that are near-optimal for the individualframeworks. Therefore, BOA explores combinations derivedfrom these points with high probability. In contrast, MCKPdoes not consider enough non-optimal conﬁgurations and getsstuck in local minima, while NSGA-II explores too many non-optimal conﬁgurations—avoiding the worst local minima, butalso stopping short of the true optimal combinations. Thus,BOA’s method represents a compromise that works well forapproximation frameworks, whose optimal combinations tendto be near the individually optimal points.

Contributions: • Introduction of VIPER to visually compare approxima-tion trade-off spaces over their entire range. • Comparison analysis—based on VIPER—of the threeapproximation frameworks (Loop Perforation [24], Pow-erDial [25] and Approximate Math Library [18]). • Proposal of variations of BOA for fast locating Pareto-efﬁcient conﬁgurations in combined trade-off space. • Open-source release of the VIPER and BOA tools.II. M

OTIVATION AND B ACKGROUND

A. Approximation Across the System Stack

Approximation frameworks reduce runtime (or energy) byallowing output quality degradation. Hardware approximationcomputes inexactly in return for reduced energy, area, or time[2], [4], [28]. Many software approximation techniques allowspeciﬁc software components to be replaced by approximatevariants; e.g. skipping loop iterations or replacing mathoperations with Taylor-series expansions [17], [18], [24], [25],[29]–[38]. Some approaches use machine learning to replaceexact computation with a faster, less accurate learned variant[5], [8], [9], [39]–[41]. Languages support approximationallowing speciﬁcation of variants for key functionality and N o r m a li ze d R un ti m e Fluidanimate

Loop PerforationPowerDialApproximate Math Library

Fig. 1: fluidanimate ’s accuracy/runtime trade-off space with Loop Per-foration (LP), PowerDial (PD), and Approximate Math Library (AML). Thecloser to origin, the more efﬁcient conﬁguration. formal analysis of their effects [11]–[13], [15], [16], [42]–[45]. Other mechanisms guarantee that approximate programswill maintain some quality or energy guarantees either throughprogram analysis [46]–[48] or runtimes with formally analyz-able dynamic adaptation [37], [49]–[55].

B. Comparing Approximation Frameworks

Our intuition is that some approximation frameworks willproduce better results (e.g., higher performance for the sameaccuracy) than others in different situations. Hence, we needa method to compare frameworks across their full range ofaccuracy and choose the best one for a speciﬁc usage scenario.Points in these trade-off spaces correspond to conﬁgurations of the approximation framework. Point-by-point comparisonis unfeasible since trade-off spaces include numerous conﬁg-urations, many of which are not useful. Typically, only thePareto-optimal points are used for comparison.As an example, we compare three approximation frame-works: PowerDial [25], Loop Perforation [24], [31], [32],and the Approximate Math Library [18]. We pick these threeframeworks because (1) they are either easily recreated orpublicly available, requiring no specialized language or hard-ware support, and yet (2) they are representative of approachesapplied at different levels of the system stack. PowerDial isan application-level approach that exploits existing trade-offsenvisioned by the application developers. Loop Perforationcreates approximate applications by applying a compiler trans-formation to selectively skip loop iterations. The ApproximateMath Library changes computation, and while implementedin software, it is a good proxy for approximation techniquesthat change hardware arithmetic units. Figure 1 illustratesthe trade-off spaces induced by these three approaches forthe fluidanimate benchmark. Each point represents thenormalized runtime and accuracy loss.Comparison of approximation frameworks across their fullrange of accuracy is necessary, as not all users have thesame accuracy requirements. We ﬁnd, however, that lookingat Pareto-optimal curves like those in Figure 1) is unsatisfyingand rarely makes it obvious which approximation method isbetter across a range of operating points. Moreover, while fora certain range of accuracy one framework might performbetter, another framework might produce higher performance2t different accuracy ranges. This motivates VIPER, a tool thatallows users to tell—at a glance—which framework has thebest performance for any range of accuracy loss.

C. Combining Approximation Frameworks

Figure 1 suggests that none of the three approximationframeworks is uniformly best. Furthermore, the fact that thethree are broadly representative of approaches from differentlevels of the ssystem stack motivates us to combine them forbetter accuracy/runtime tradeoffs than any individual frame-work. The challenge is that merging multiple frameworks leadsto an enormous trade-off space, which is infeasible to exploreexhaustively. Table II lists the number of points in the trade-off spaces of Loop Perforation, PowerDial, and ApproximateMath Library for sample benchmarks. For example, the x264 benchmark takes up to 4 weeks to test all combined conﬁgura-tions with a single input. Considering multiple inputs shouldbe tested for statistically sound results, the unfeasibility ofexhaustive exploration is obvious.More formally, combining approximation frameworks re-quires quickly locating Pareto-efﬁcient conﬁgurations in thenew, larger trade-off space. Exploring large trade-off spacesis well-studied and has produced two broad classes of ap-proach. The ﬁrst is carefully selecting and exhaustively search-ing a subset of the combined trade-off space [26], [56].The second class intelligently traverses the entire combinedtrade-off space—not limiting the initial conﬁguration com-binations, but exploreing a small number of the total [21],[57], [58]. Among these intelligent search techniques, NSGA-II—a genetic algorithm-based approach—has repeatedly out-performed other proposals [27], [58].While prior work has proven effective for application-speciﬁc processor design, we ﬁnd that it is not the bestmatch for combining approximation frameworks. Speciﬁcally,heuristic exploration of genetic algorithms appear to causestwo issues: (1) in an effort to avoid local minima, they produceless efﬁcient combinations (see Section VI-B) and (2) theyadd too much randomization that leads to lower correlationbetween training and test inputs (see Section VI-E).III. C

OMPARING A PPROXIMATION F RAMEWORKS

A. Terminology

To produce performance/accuracy tradeoffs, any approxi-mation framework must have one or more tunable param-eters . The values assigned to the parameter set represent a conﬁguration and the range of possible parameter settings isa conﬁguration space . Each conﬁguration represents a trade-off between performance and accuracy. The trade-off space (or design space ) is the set of all possible trade-offs; i.e., therange of achievable performance and accuracy.We consider large search spaces and often do not know thetrue optimal values for which we are searching. We thereforedistinguish between

Pareto-optimal —meaning we know thata point is on the true Pareto-optimal frontier—-and

Pareto-efﬁcient —meaning a point on the estimated, unknown Pareto-optimal frontier. Thus, if we say a point is Pareto-efﬁcient, it is better than all other points found so far, but we do not knowthat it is truly Pareto-optimal.

B. Numerical Comparisons

For large trade-off spaces, a point-by-point comparison isnot possible. Therefore, prior work has introduced analyticalmethods for comparing trade-off spaces based on the numberof Pareto-optimal—if the trade-off space is known—or Pareto-efﬁcient—if the trade-off space is estimated—points from eachframework.A point in our accuracy-performance trade-off space is a 2Dvector with runtime and accuracyLoss . Ideally, we wouldhave zero run time and zero accuracy loss; i.e., instantaneouslyget a perfect answer, leading to:

Deﬁnition 1.

Objective Function: Given points x and x inthe objective to be minimized is f ( x ) where: f ( x ) < f ( x ) ⇐⇒ accuracyLoss ( x ) < accuracyLoss ( x )& runtime ( x ) < runtime ( x ) (1)Points closer to the origin represent more efﬁcient conﬁgu-rations. Given the objective function f ( x ) , we determine if apoint is more efﬁcient than another by: Deﬁnition 2.

Dominance: Given points x and x , we say: x (cid:23) x (weakly dominates) if f ( x ) (cid:54) f ( x ) x (cid:31) x (dominates) if f ( x ) < f ( x ) (2)A point is Pareto-optimal if it is not dominated by any otherpoint. A point is Pareto-efﬁcient if we do not know of anotherpoint that dominates it. Figure 2(a) illustrates an example ofdominance where point x is dominated by point x . Coverage quantiﬁes the number of Pareto-efﬁcient pointsproduced by different techniques [57]:

Deﬁnition 3.

Coverage is the dominance ratio of the Pareto-efﬁcient curves induced by two separate frameworks. If X and Y are two Pareto-efﬁcient curves, and x and y represent pointson them respectively, then: C ( X, Y ) = |{ y ∈ Y | ∃ x ∈ X : x (cid:23) y }|| Y | (3) C ( X, Y ) = 1 means that all points in Y are weaklydominated by points in X; i.e., all points of X provide lowerruntime for the same accuracy loss than the points of Y .Figure 2(b) illustrates the coverage of curve X with respectto curve Y . The point y and y on the Y curve are dominatedby at least one point on the curve X — x , for example—therefore C ( X, Y ) = . In contrast, no point on X isdominated by a point on curve Y which means C ( Y, X ) = 0 .By this metric, we consider X more efﬁcient, but note that thecurve Y extends through a larger range within the trade-offspace; i.e., y is a useful point which neither dominates noris dominated by any points in X . The coverage function isnon-symmetric ( C ( X, Y ) (cid:54) = C ( Y, X ) ) and usually their sumdoes not equal 1 [20]. Hence, we need a metric that considersboth coverage functions simultaneously:3 ig. 2: Dominance (a) and Coverage (b) Functions (from [57]). In (a) x dominates x as x is both faster and more accurate. In (b), X covers / of Y because y and y are dominated by at least one point in X . Deﬁnition 4.

Difference Of Coverage compares coverage fortwo different Pareto-efﬁcient curves.

DOC ( X, Y ) = C ( X, Y ) − C ( Y, X ) (4)As a result, when DOC ( X, Y ) ≥ , that fraction of Y points which are dominated by X is greater than X pointsthat are dominated by Y . Higher DOC implies one set ismore efﬁcient than the other. If

DOC ( X, Y ) is close to zeroboth may provide the same efﬁciency. This metric is widelyused by in multi-objective optimization problems [21]–[23]. C. VIPER

Numerical comparisons suffer from two major shortcom-ings. First, they do not show the full range of accuracy lossinduced by each framework. Second, as seen in Figure 1, thebest approximation framework varies as accuracy loss changes.Numerical metrics—like DOC—have limited expressiveness;Figure 2(b) shows that y is a useful point, but DOC makes X look uniformly better than Y .As an alternative to numerical methods, researchers haveused Pareto-optimal curves to compare frameworks, but thisgraphical evaluation has proven problematic [56], [59]. Whilecurves may look compact, they can be different by orders ofmagnitude; e.g. when there is a steep slope and a large rangecovered, a small change in one dimension ( e.g. accuracy loss)leads to a signiﬁcant shift in the other ( e.g. runtime). Algorithm 1

VIPER.

Require:

M, B (cid:46)

Lower convex hulls for framework M and baseline B MinX = Max ( Min ( M.x ) , Min ( B.x )) (cid:46) lower bound MaxX = Min ( Max ( M.x ) , Max ( B.x )) (cid:46) upper bound step = ( MaxX − MinX ) / for accuLoss = MinX ; accuLoss < MaxX ; accuLoss + = step do M i ← ﬁnd point on M where M i .x < accuLoss < M i +1 .x B j ← ﬁnd point on B where B j .x < accuLoss < B j +1 .x ˆ y M ← interpolate runtime between M i and M i +1 where x = accuLoss ˆ y B ← interpolate runtime between B j and B j +1 where x = accuLoss perfImprovRatio [ accuLoss ] = ˆ y M / ˆ y B end for NORMALIZE( perfImprovRatio ) (cid:46) limit the ratio to [0,1] return perfImprovRatio (cid:46) array of points To provide an alternative visualization of approximationframeworks we introduce VIPER, which illustrates the relativeperformance of frameworks for any range of accuracy loss.Algorithm 1 explains how VIPER calculates the performanceimprovement ratio (PIR) of one framework M over a baseline B . The PIR is the ratio of the frameworks’ performance at agiven accuracy loss. To make the charts readable, PIR is inthe range [0,1]. PIR = 1 means a conﬁguration is the fastest inthe space, while PIR = 0 is the slowest. A conﬁguration withPIR = 0 . achieves 60% of the maximum speedup.First, VIPER ﬁnds the lower and upper bounds for theaccuracy loss which deﬁnes the range of comparison. Then,this range gets divided by a parameterized granularity. We usea granularity of 1000 in this paper as larger values producedno beneﬁt and smaller values make the charts less clear.For each accuLoss value, VIPER ﬁnds the correspondingruntime in both frameworks. Afterwards, we search for thenearest points on each lower convex hull where their accuracyloss is smaller than accuLoss (identiﬁed with M i and B j points respectively). Then, we interpolate the runtime of thespeciﬁed accuLoss for both frameworks (named as ˆ y M and ˆ y B ), and divide these interpolated runtime values to computethe performance improvement ratio. Finally, we normalize theratio to [0,1]. When we compare more than two frameworks,the ratio is normalized to the lowest and highest among all.VIPER then charts the PIR across the range of accuracy loss.The values for the baseline B will be a straight horizontalline. Values of M above that line indicate that M achieveshigher performance for that accuracy loss. If the line for M stays above that for B for a greater range of accuracy loss, itmeans M has found more efﬁcient conﬁgurations, on average.The color shading on the plot background indicates the highestperformance method for that accuracy loss from the multipleframeworks. Therefore, if the plot’s background is dominatedby a single color, the corresponding method provides the moreefﬁcient conﬁgurations. Thus, VIPER allows users to see ata (literal) glance, whether one approximation framework isclearly superior to another.IV. C ASE S TUDY

1: C

OMPARING F RAMEWORKS

A. Experimental Setup

We use a dual socket Intel Xeon E5 server system with20 physical cores at 2.9 GHz, hyperthreading, and 32 GBmemory. Table I lists the used benchmarks, from Parsec 3.0[60] and Rodinia 3.1 [61]. Table I also contains the description,type, application accuracy metric, and default run-time foreach benchmark. Accuracy loss is the error relative to the mostaccurate conﬁguration. Shorter runtime interprets as higherperformance. blackscholes ’s only tunable parameter is thenumber of prices to estimate and modifying it does not affectaccuracy. Thus, PowerDial has no effect on blackscholes .Similarly, canneal , heartwall , kmeans , and x264 usemath functions infrequently; the Approximate Math Library isnot applicable to them.We evaluate each suite across multiple inputs and comparethe median across the inputs for this evaluation. In this section,we are evaluating known frameworks, thus we use the traininginputs from Table I. In subsequent sections—where we presentnew techniques—we divide inputs into training and test andbuild combinations of frameworks using the training data and4hen use the test data to ensure our selected combination workswell on previously unseen data.We evaluate three approximation frameworks. PowerDial(PD) transforms an application’s command line parametersinto software knobs that are automatically manipulated to tradeaccuracy for performance [25]. Each application has tunable knobs , which can take different values, and an assignmentof values to knobs is a conﬁguration. Loop Perforation (LP)identiﬁes perforatable loops whose iterations can be skipped toproduce faster, but less accurate results [24]. A set of loops andperforation rates is a conﬁguration. The Approximate MathLibrary (AML) substitutes math functions with a variableTaylor series expansion. A set of functions and their numberof terms is the conﬁguration. B. Comparison by Difference of Coverage

To compare Loop Perforation and PowerDial, we calculatethe average coverage function ( C ( X, Y ) from Eq. 3) acrossall benchmarks for both. On average, Loop Perforation coversonly . Pareto-optimal points of PowerDial, while Power-Dial covers . Pareto-optimal points of Loop Perforation.Thus,

DOC ( LoopP erf oration, P owerDial ) = − . which shows the slight superiority of PowerDial over LoopPerforation, on average. On the other hand, negative valuesof DOC ( AM L, P D ) = − . and DOC ( AM L, LP ) = − . shows signiﬁcant inferiority of Approximate MathLibrary against other frameworks, on average. C. Comparison by Pareto-optimal Curves

Figure 3a illustrates the frameworks’ trade-off spaces. Eachpoint represents a conﬁguration. The y-axis is runtime normal-ized to the default conﬁguration and the x-axis is the accuracyloss. Each framework’s Pareto-optimal curve is shown in thesame color as the trade-off space. These plots highlight howconﬁgurations cover wide ranges of runtime and accuracyloss. While in some cases— e.g. canneal , kmeans , and srad —Pareto-optimal curves are easily to compare, in otherbenchmarks— e.g. particlefilter and swaptions —comparison is unfeasible. For fluidanimate , the Pareto-optimal curves intersect multiple times; the best approximationframework differs across the range of accuracy loss. D. Comparison by VIPER

Just viewing the Pareto-optimal curves in Figure 3a provideslimited intuition as differences are not always visible. Weuse VIPER to compare these frameworks in Figure 3b. They-axis represents the performance improvement ratio (PIR),while the x-axis illustrates accuracy loss. The horizontal linerepresents Loop Perforation: points above that line mean thecorresponding technique is faster than Loop Perforation. Thebackdrop color indicates the best method for an accuracy loss.VIPER shows only useful conﬁgurations—if one dominatesothers, VIPER shows a small range of accuracyLoss. Thus,some ranges are not shown in the Figure 3b plots comparingto accuracy loss ranges in Figure 3a because there is no beneﬁtto increasing accuracyLoss . VIPER provides the following insights: • It illustrates how frameworks perform within a speciﬁcaccuracy loss range. For instance, while PowerDial ﬁndsa higher performance conﬁguration than Loop Perforationfor bodytrack and canneal , its performance is worsefor kmeans and x264 for most accuracies. • While the distinction between frameworks is not clearin Figure 3a for streamcluster and swaptions ,VIPER allows quick, obvious comparison. • VIPER clearly illustrates the intersection of Pareto-optimal curves; e.g. , in fluidanimate and srad .Since VIPER only requires trade-off spaces to compare, itcan be applied to any approximation frameworks regardlessof system level. We believe VIPER provides clear insights,which are instantly visually recognizable and mathematicallymeaningful. VIPER is not a replacement for existing methods,but a complement that simpliﬁes comparison.V. C

OMBINING A PPROXIMATION F RAMEWORKS

The prior section shows that none of Loop Perforation,PowerDial, or the Approximate Math Library is uniformlybest. This observation motivates us to combine frameworks.At one level, this process is quite easy—just create a newtrade-off space that is the cross product of all conﬁgurationsin the original frameworks. The challenge, of course, is quicklylocating the Pareto-efﬁcient points in the resulting massivecombined trade-off space.We meet this challenge with the BOA family of searchalgorithms. All BOA methods select a subset of the combinedtrade-off space and exhaustively search that subspace space.The ﬁrst algorithm, BOA-simple, only considers conﬁgura-tions in the cross-product of individual frameworks’ Pareto-optimal conﬁgurations. This technique produces a relativelysmall set of points to search, but may be subject to localminima if the approximation frameworks are not independent.Unfortunately, most approximation frameworks are not in-dependent. For example, Loop Perforation changes the num-ber of loop iterations within an application; PowerDial maychange convergence criteria. When we combine conﬁgurationsfrom these frameworks, we ﬁnd that some conﬁgurations thatwere Pareto-optimal when considering only the original frame-works are now far from optimal. Conversely, we empiricallyﬁnd that some conﬁgurations that were not Pareto-optimal inthe original frameworks combine to be Pareto-optimal whenwe consider multiple frameworks together. These observationsmotivate us to expand BOA-simple to include more nonPareto-optimal conﬁgurations in combination.BOA-ﬂex expands the combined search space to considerthe conﬁgurations that produce a trade-off within a user-deﬁned threshold of Pareto-optimal. This technique searchesmore points and tends to ﬁnd more efﬁcient combinations,but it is deterministic. A common way to avoid local minimain large search spaces is expanding the exploration area withsome form of randomization. We follow this approach withthe last algorithm: BOA-prob, which probabilistically selectsconﬁgurations from each individual framework to combine.5

ABLE I: Benchmarks used for evaluation.

Benchmarks Accuracy Metric Training inputs Test Inputs Runtime(sec)Blackscholes

Average Relative Error of Prices 30 lists with 1M initial prices 90 lists with 1M initial prices 3.2

Bodytrack

Average Distance of Poses sequence of 100 frames sequence of 261 frames 3.1

Canneal

Average Relative Routing Cost 30 netlists with 400K+ elements 90 netlists with 400K+ elements 6.88

Fluidanimate

Distance between Particles 5 ﬂuids with 100K+ particles 15 ﬂuids with 500K+ particles 17.2

Heartwall

Average Relative Error of heart frames sequence of 30 ultrasound images sequence of 100 ultrasound images 11.6

Kmeans

Distance between Cluster Centers 30 vectors with 256K data points 90 vectors with 256K data points 3.1

Particleﬁlter

Distance between Particles sequence of 60 frames sequence of 240 frames 12.9

Srad

Image Diff (RMSE) 3 images with 2560*1920 pixels 9 images with 2560*1920 pixels 22.6

Streamcluster

Distance between Cluster Centers 3 streams of 19k-100K data points 9 streams of 100K data points 30

Swaptions

Average Relative Error of Prices 40 swaptions 160 swaptions 6.2 x264

Relative PSNR+Bitrate 4 HD videos of 200+ frames 12 HD videos of 200+ frames 7.7

Speciﬁcally, it uses a sigmoid probability function, so that thecloser points are to Pareto-optimal, the more likely they areto be included in the combined trade-off space. BOA-probincludes most of the same points as the other frameworks, butincludes some outliers with small probability, making it morerobust in the presence of local minima.

BOA-simple:

The simple version of BOA forms the cross-product of all Pareto-optimal conﬁgurations from the indi-vidual frameworks. After executing on evaluation platform,BOA-simple returns the Pareto-efﬁcient conﬁgurations foundin this combined trade-off space. The worst case complexityof BOA-simple is bounded by O ( m + log ( m )) ∗ nm where m is the number of frameworks to be merged and n contains thetotal number of parameters of all approximation frameworks[56]. In our experiments, input parameters are the sum of thenumber of loop rates, software knobs, and Taylor series boundsthat represent our three frameworks. While the algorithmhas exponential complexity, it is practical because so fewconﬁgurations lie on the Pareto-optimal frontiers produced byindividual frameworks (see Figure 3 and Table V).

BOA-ﬂex:

BOA-ﬂex augments BOA-simple with a user-speciﬁed selection threshold, as shown in Algorithm 2. Thisthreshold also removes some inconsistency that may arise dueto experimental noise; i.e., it is possible that for high varianceapplications, the true Pareto-optimal conﬁgurations cannotbe found with conﬁdence, so adding the threshold makesthe search more robust. Speciﬁcally, BOA-ﬂex considers allconﬁgurations whose trade-off is within the user-speciﬁedthreshold of a Pareto-optimal trade-off.This threshold is speciﬁed in terms of normalized Euclideandistance . All trade-offs are normalized so that accuracy lossand runtime range from 0 to 1. Accuracy loss of 1 means thelowest quality. A runtime of 1 is the slowest execution time.A trade-off point is the output of executing a conﬁgurationand, is thus, a pair of accuracy loss and runtime. Havingnormalized all conﬁgurations accuracy loss and runtime, wecan then compute the Euclidean distance between the trade-offs of two separate conﬁgurations. Given this deﬁnition, thethreshold speciﬁes how close to Pareto-optimal a trade-offmust be for it to be included in the search. For example, For the purpose of time complexity analysis, we assume each approxima-tion knob can take on two values only, however, in reality, parameters maybe assigned a larger number of values. the threshold is . , and then the algorithm will include anyconﬁguration whose accuracy loss/runtime trade-off is within5% of a Pareto-optimal point. If the threshold is zero, thisalgorithm is equivalent to BOA-simple. Algorithm 2

BOA-ﬂex: expands search space by threshold . Require: frameworks (cid:46) trade-off spaces of frameworks

Require: threshold (cid:46)

User-deﬁned threshold Combination = [] (cid:46) conﬁgurations to explore for f in frameworks do P areto − opt f ← Get-Pareto-Opt( f ) for Conﬁg C i in P areto − opt do for Conﬁg C j in f do if NormalizedEuclideanDistance ( C i , C j ) (cid:54) threshold then Combination .append( C j ) end if end for end for end for return Combination (cid:46)

Set of points to explore

BOA-prob:

While BOA-ﬂex expands the combined searchspace, it only considers additional conﬁgurations that are closeto an individual framework’s Pareto-optimal curve. To makeBOA even more robust to local minima, BOA-prob employsa sigmoid probability function to include a few points that arefurther from the individual frameworks’ Pareto-optimal curves: S ( C j ) = 11 + exp ( − ∆+ βγ ) (5)where ∆ is the normalized Euclidean distance between con-ﬁguration C j and the nearest Pareto-optimal conﬁguration. β is the horizontal shift and γ decides the curve’s smoothness.Algorithm 3 shows how BOA-prob uses Equation 5.We use constants β = 0 . and γ = 0 . so there isa 92% chance of including the point that has ∆ < . ,and 50% chance of selecting the point with ∆ < . . At ∆ = 0 . , there is less than 1% chance of including the pointin the combination. If ∆ = 0 , it means that C j is actuallyPareto-optimal and BOA-prob includes it. The interdependentparameters β and γ control the size of combined trade-offspace and the exploration time.VI. C ASE S TUDY

2: BOA E

VALUATION

We compare variations of BOA to prior exploration tech-niques using the same experimental setup from Section IV-A.6 N . R un ti m e Blackscholes N . R un ti m e Bodytrack N . R un ti m e Canneal N . R un ti m e Fluidanimate N . R un ti m e Heartwall N . R un ti m e Kmeans N . R un ti m e Particleﬁlter N . R un ti m e Srad N . R un ti m e Streamcluster N . R un ti m e Swaptions N . R un ti m e X264

LPPDAML (a) Performance/AccuracyLoss trade-off spaces for PowerDial, Loop Perforation, and Approximate Math Library. P I R Blackscholes P I R Bodytrack P I R Canneal P I R Fluidanimate P I R Heartwall P I R Kmeans P I R Particleﬁlter P I R Srad P I R Streamcluster P I R Swaptions P I R X264

LPPDAML (b) VIPER comparison of PowerDial, Loop Perforation, and Approximate Math Library.Fig. 3: Comparison of PowerDial, Loop Perforation, and Approximate Math Library. (a) shows Pareto-optimal frontiers and (b) shows VIPER.

Algorithm 3

BOA-prob: probabilistic search space expansion.

Require: frameworks (cid:46) trade-off spaces of frameworks Combination = [] (cid:46)

Pareto-efﬁcient conﬁgurations for f in frameworks do P areto − opt f ← Get-Pareto-Opt( f ) for Conﬁg C i in P areto − opt do for Conﬁg C j in f do ∆ = NormalizedEuclideanDistance ( C i , C j ) S ( C j ) = exp ( − ∆+ βγ ) r=rand() (cid:46) Random number between 0 and 1 if ( r ¡ S ( C j ) ) or ( ∆ d = 0 ) then Combination .append( C j ) end if end for end for end for return Combination (cid:46)

Set of points to explore

We now split our inputs into training and test data sets asshown in Table I. For each exploration technique, we ﬁrst usethe training inputs to ﬁnd Pareto-efﬁcient conﬁgurations, thenwe evaluate those points using the separate test data.

A. Points of Comparison

We compare BOA to state-of-the-art approaches for locatingPareto-efﬁcient points in large trade-off spaces: • MCKP : The multiple choice knapsack problem variantof the classic knapsack problem has classes of itemsand must choose one item from each class. MCKPhas been used to ﬁnd Pareto-efﬁcient processor designsin the performance-power space for application speciﬁcembedded processors [26]. We declare each frameworkto be a class. MCKP then selects the Pareto-optimalconﬁgurations of each class while keeping the defaultvalues for other classes. This creates a new, small trade-off space which can be searched with brute force. • NSGA-II : The non-dominated sorting-based multi-objective evolutionary algorithm (NSGA-II) exploreslarge trade-off spaces to ﬁnd Pareto-optimal con-ﬁgurations using an evolutionary genetic algorithm[27]. NSGA-II is the state-of-the-art for multiobjec-tive optimization of embedded processors that navigateperformance-power trade-offs, having been cited over7 l a c k s c h o l e s B o d y t r a c k C a n n e a l F l u i d a n i m a t e H e a r t w a ll K m e a n s P a r t i c l e ﬁ l t e r S r a d S t r e a m c l u s t e r S w a p t i o n s X A v e r a g e -1-0.500.501 D i ﬀ e r e n ce o f C o v e r a g e MCKP BOA - simple BOA - flex BOA - prob Fig. 4: Difference of coverage over NSGA-II. Higher bars are better.

B. Comparison by Difference of Coverage

Recall from Section III that difference of coverage (DOC)implies the efﬁciency of one curve over another. Figure 4displays the difference of coverage for various techniques overNSGA-II per benchmark and on average. The y-axis shows

DOC ( X, N SGA ) which is DOC of exploration technique X over NSGA-II (see Section III-B). Negative values of DOC ( X, N SGA ) indicate that X does not ﬁnd as manyPareto-efﬁcient points as NSGA-II. Conversely, positive valuesimply that technique X provides that many more valuesthat dominate NSGA-II. BOA-ﬂex and BOA-prob on averagelocate 52.8-65.6% more Pareto-efﬁcient conﬁgurations thanNSGA-II. BOA’s superiority is due to its focus on the con-ﬁgurations that have been shown to be Pareto-optimal onindividual frameworks. NSGA-II starts the exploration froma random set of points in the combined trade-off space anditeratively looks for more efﬁcient points. MCKP uses the indi-vidual Pareto-optimal curves but keeps the rest of frameworksat default conﬁgurations. Since the frameworks are not fullyindependent, we empirically ﬁnd the some conﬁgurations thatwere not Pareto-optimal in the original frameworks becomepart of a Pareto-efﬁcient curve of combined trade-off spacewhen we consider multiple frameworks together. By expandingBOA with threshold and probabilistic exploration, we searchmore points, resulting in more efﬁcient conﬁgurations. Inshort, MCKP does not search enough combinations, whileNSGA-II searches too many. By restricting the search to pointslikely to be near the Pareto-optimal frontier for individualframeworks, BOA achieves the right balance and the bestempirical results. This data shows that, for approximate com-putations, BOA produces many more efﬁcient conﬁgurationsthan prior state-of-the-art search techniques.C. Comparison by VIPER

Figure 5a shows the Pareto-efﬁcient points for each bench-mark and search method. The y-axis shows runtime (normal-ized to the default conﬁguration) and the x-axis representsaccuracy loss. We use median runtime across test inputs. These ﬁgures display the range of runtime and accuracy loss that amethod can achieve. For instance, NSGA-II and MCKP cannotprovide normalized runtime less than 78.1% and 55.5% of thedefault conﬁguration respectively for particlefilter .Figure 5b illustrates the VIPER comparison of NSGA-II,MCKP, and different variations of BOA. The y-axis shows per-formance improvement ratio while the x-axis shows accuracyloss. We use NSGA-II as the baseline, so it is represented bya horizontal line. For the same accuracy loss, lines above thathorizontal represent better (more efﬁcient) conﬁgurations, andlines below represent conﬁgurations worse than those foundby NSGA-II. For most applications, MCKP stays below thehorizontal, meaning it is worse than NSGA-II. By comparingBOA-simple and MCKP performance improvement ratio lines,we see that BOA-simple outperforms MCKP.From the VIPER plots we also ﬁnd the maximum andminimum performance improvement over NSGA-II. Consider-ing fluidanimate , the NSGA-II line is at 0.25 indicatingthat the maximum performance is 4 × better than NSGA-II,and minimum performance is 25% worse. In fact, for everybenchmark BOA-ﬂex ﬁnds at least one conﬁguration withhigher performance for the same accuracy.Whenever NSGA-II locates more Pareto-efﬁcient pointsthan BOA-simple, by expanding the Pareto-efﬁcientconﬁgurations we reduce the performance improvementratio gap. Benchmarks heartwall , kmeans , and particlefilter demonstrate how expanding thecombined conﬁgurations provides higher Pareto-efﬁciency. Intotal, we ﬁnd by increasing the threshold, lines of BOA-ﬂexare above the NSGA-II line more than 95% of the time. Theseresults provide visual conﬁrmation that BOA not only ﬁndsa greater number of efﬁcient points than prior techniques,but BOA’s points are also signiﬁcantly better, representingmuch more efﬁcient trade-offs. Furthermore, we believe thiscase study provides further evidence of VIPER’s value, asthe VIPER charts are visually intuitive in Figure 5b, but thePareto frontiers (Figure 5a) do not immediately show whichframework is best at a given accuracy or by how much.D. Exploration Time

The Pareto-efﬁciency of the located points depends onexploration time. While Figures 4 and 5 show that BOAproduces better conﬁgurations than other techniques, it isimportant to know if that gain comes from exploring morepoints or from a better exploration strategy. Table II presentsthe number of conﬁgurations explored for each benchmark fordifferent methods, including different thresholds for BOA-ﬂex.To get an estimate of the time spent exploring the combinedtrade-off space for a speciﬁc benchmark, we can multiplythe number of explored conﬁgurations by the average runtime(from the last row in Table 1). Comparing NSGA-II and BOA-simple across all benchmarks, NSGA-II explores 2.05% ofall possible conﬁgurations, while BOA-simple explores about14 × less. BOA-ﬂex and BOA-prob only search the 0.682%and 0.692% of all possible conﬁgurations, respectively. These N o r m a li ze d R un ti m e Blackscholes N o r m a li ze d R un ti m e Bodytrack N o r m a li ze d R un ti m e Canneal N o r m a li ze d R un ti m e Fluidanimate N o r m a li ze d R un ti m e Heartwall N o r m a li ze d R un ti m e Kmeans N o r m a li ze d R un ti m e Particleﬁlter N o r m a li ze d R un ti m e Srad N o r m a li ze d R un ti m e Streamcluster N o r m a li ze d R un ti m e Swaptions N o r m a li ze d R un ti m e X264

MCKPNSGABOA-simpleBOA-ﬂexBOA-prob (a) Pareto-optimal curves found by each exploration technique. P I R Blackscholes P I R Bodytrack P I R Canneal P I R Fluidanimate P I R Heartwall P I R Kmeans P I R Particleﬁlter P I R Srad P I R Streamcluster P I R Swaptions P I R X264

MCKPNSGABOA-simpleBOA-ﬂexBOA-prob (b) VIPER comparison of MCKP, NSGA-II, and different versions of BOA.Fig. 5: Comparison of MCKP, NSGA-II, and BOA with different thresholds. (a) shows comparison by Pareto-efﬁcient frontiers and (b) shows comparison byVIPER. ABLE II: Number of Explored Conﬁgurations.

Benchmarks

LP PD AML MCKP NSGA-II BOA-simple BOA-ﬂexTh=0.05 BOA-ﬂexTh=0.1 BOA-prob CombinedConﬁgs

Blackscholes

20 - 216 5 240 3 12 18 27 4,320

Bodytrack

768 200 36 15 800 30 224 256 144 5,529,600

Canneal

Fluidanimate

144 20 6 11 180 24 216 672 168 17,280

Heartwall

256 320 - 19 400 98 665 722 777 81,920

Kmeans

120 100 - 11 200 32 216 216 192 12,000

Particleﬁlter

200 380 216 12 1000 240 1760 7200 1728 16,416,000

Srad

256 10 36 11 320 48 288 396 160 92,160

Streamcluster

384 256 6 9 480 80 392 1380 640 589,824

Swaptions

768 100 36 15 1000 330 2310 11025 1584 2,764,800

X264

768 400 - 17 1000 45 345 400 405 307,200 results indicate that BOA not only ﬁnds better combinationsof approximate frameworks, it does so with less searching.

Since MCKP only chooses conﬁgurations from individualPareto-optimal curves rather than merging the conﬁgurations,the number of explored conﬁgurations stays very low. In theworst case, MCKP explores up to the sum of the Pareto-optimal points of PowerDial, Loop Perforation, and the Ap-proximate Math Library. Unfortunately, while MCKP searchesa small space, it is too small to ﬁnd many useful points.

E. Input Sensitivity

Since exhaustive exploration is not feasible, we use trainingand test data to ensure robustness of BOA on unseen data. Weshow how well the behavior on training inputs predicts that ontest inputs. For each search method, we take the normalizedruntime and accuracy loss, compute a linear least squaresﬁt of training data to test data, and compute the correlationcoefﬁcient of each ﬁt. Higher correlation coefﬁcients implygreater robustness; i . e . the behavior of conﬁgurations foundduring training data is a good predictor of test behaviorTable III shows the correlation coefﬁcient ( R -values) foraccuracy loss for each exploration method per benchmark.Table IV shows the R -values for runtime. By harmonic mean,BOA has higher consistency of accuracy loss and normalizedruntime comparing to NSGA-II by 17% and 64% respectively.Since MCKP evaluates few conﬁgurations, its predictions arequite robust—one advantage of MCKP over other techniques.While some benchmarks such as fluidanimate and streamcluster clearly stress the difference between train-ing and test inputs. NSGA-II’s heuristic approach can selectconﬁgurations for the training data that produce bad resultson the test data. In contrast, BOA not only ﬁnds moreefﬁcient conﬁgurations, its results are also much more robustwhen applied to new inputs, producing uniformly high R -values. These results indicate that BOA is a sound methodfor combining approximation frameworks.F. Combination Distribution

When BOA combines frameworks, it considers multipleconﬁgurations from each rather than choosing from one or twoonly. Table V includes the number of conﬁgurations BOA-simple selects from each framework to generate the new,combined trade-off space. As mentioned in Section IV-D, theApproximate Math Library is never better than Loop Perfora-tion or PowerDial in any range of accuracy loss. However,

TABLE III: Correlation coefﬁcients for accuracy loss.

Benchmark

MCKP NSGA-II BOA-simple BOA-ﬂex BOA-prob

Blackscholes

Bodytrack

Canneal

Fluidanimate

Heartwall

Kmeans

Particleﬁlter

Srad

Streamcluster

Swaptions

X264

Average

TABLE IV: Correlation coefﬁcients for normalized runtime.

Benchmark

MCKP NSGA-II BOA-simple BOA-ﬂex BOA-prob

Blackscholes

Bodytrack

Canneal

Fluidanimate

Heartwall

Kmeans

Particleﬁlter

Srad

Streamcluster

Swaptions

X264

Average

BOA uses the Approximate Math Library in combinationwith Loop Perforation and PowerDial for 7 out of the 11applications.

These results show that there is real beneﬁtto combining frameworks, as even the Approximate MathLibrary—which is uniformly the worst of the three techniquesby themselves—contributes to Pareto-efﬁcient points in thecombined space found by BOA.

VII. C

ONCLUSION

A proliferation of approximation frameworks have recentlyappeared that exploit different conﬁgurable parameters to tradereduced accuracy for decreased resource consumption. Thispaper proposes methods for both comparing and combiningdifferent frameworks. VIPER is a visualization tool for com-paring approximation frameworks across their entire range ofavailable accuracies. We show this tool is useful for comparingexisting approximation frameworks regardless of their typeand applied system level. BOA is a family of algorithms thatcombine approximation frameworks and quickly locate Pareto-efﬁcient conﬁguration combination.

Acknowledgments:

The effort on this project is fundedby the U.S. Government under the DARPA BRASS programand by a DOE Early Career Award. Additional funding comes10

ABLE V: Combinations of approximation frameworks found by BOA.

Benchmark

LP(Pareto-opt) PD(Pareto-opt) AML(Pareto-opt) BOA-simple

Blackscholes

Bodytrack

Canneal

Fluidanimate

Heartwall

Kmeans

Particleﬁlter

Srad

Streamcluster

Swaptions

11 5 6 330

X264 from the NSF (CCF-1439156, CNS-1526304, CCF-1823032,CNS-1764039. R

EFERENCES[1] K. V. Palem, “Energy aware algorithm design via probabilisticcomputing: From algorithms and models to moore’s law and novel(semiconductor) devices,” in

Proceedings of the 2003 InternationalConference on Compilers, Architecture and Synthesis for EmbeddedSystems , ser. CASES ’03. New York, NY, USA: ACM, 2003, pp. 113–116. [Online]. Available: http://doi.acm.org/10.1145/951710.951712[2] A. Lingamneni, C. Enz, K. Palem, and C. Piguet, “Designing energy-efﬁcient arithmetic operators using inexact computing,”

Journal of LowPower Electronics , vol. 9, no. 1, pp. 141–153, 2013.[3] A. Ingole, B. Maiti, J. Augustine, and K. Palem, “Does customizinginexactness help over simplistic precision (bit-width) reduction? a casestudy,” in

Compilers, Architecture and Synthesis for Embedded Systems(CASES), 2015 International Conference on , Oct 2015, pp. 33–34.[4] V. K. Chippa, S. Venkataramani, S. T. Chakradhar, K. Roy, andA. Raghunathan, “Approximate computing: An integrated hardwareapproach,” in , Nov 2013, pp. 111–117.[5] S. Muralidharan, A. Roy, M. Hall, M. Garland, and P. Rai,“Architecture-adaptive code variant tuning,”

SIGPLAN Not. , vol. 51,no. 4, pp. 325–338, Mar. 2016. [Online]. Available: http://doi.acm.org/10.1145/2954679.2872411[6] L. N. Chakrapani, B. E. S. Akgul, S. Cheemalavagu, P. Korkmaz, K. V.Palem, and B. Seshasayee, “Ultra-efﬁcient (embedded) soc architecturesbased on probabilistic cmos (pcmos) technology,” in

Proceedings of theDesign Automation Test in Europe Conference , vol. 1, March 2006, pp.1–6.[7] M. Samadi, D. A. Jamshidi, J. Lee, and S. Mahlke, “Paraprox: Pattern-based approximation for data parallel applications,” in

ACM SIGARCHComputer Architecture News , vol. 42, no. 1. ACM, 2014, pp. 35–50.[8] H. Esmaeilzadeh, A. Sampson, L. Ceze, and D. Burger, “Neuralacceleration for general-purpose approximate programs,” in

Proceedingsof the 2012 45th Annual IEEE/ACM International Symposiumon Microarchitecture , ser. MICRO-45. Washington, DC, USA:IEEE Computer Society, 2012, pp. 449–460. [Online]. Available:http://dx.doi.org/10.1109/MICRO.2012.48[9] O. Temam, “A defect-tolerant accelerator for emerging high-performanceapplications,” in

Proceedings of the 39th Annual InternationalSymposium on Computer Architecture , ser. ISCA ’12. Washington,DC, USA: IEEE Computer Society, 2012, pp. 356–367. [Online].Available: http://dl.acm.org/citation.cfm?id=2337159.2337200[10] H. Esmaeilzadeh, P. Saeedi, B. N. Araabi, C. Lucas, and S. M. Fakhraie,“Neural network stream processing core (nnsp) for embedded systems,”in , May2006, pp. 4 pp.–2776.[11] J. Bornholt, T. Mytkowicz, and K. S. McKinley, “Uncertain¡ t¿: A ﬁrst-order type for uncertain data,”

ACM SIGPLAN Notices , vol. 49, no. 4,pp. 51–66, 2014.[12] A. Kansal, S. Saponas, A. B. Brush, K. S. McKinley, T. Mytkowicz,and R. Ziola, “The latency, accuracy, and battery (lab) abstraction:Programmer productivity and energy efﬁciency for continuous mobilecontext sensing,” in

Proceedings of the 2013 ACM SIGPLANInternational Conference on Object Oriented Programming SystemsLanguages & , ser. OOPSLA ’13. New York,NY, USA: ACM, 2013, pp. 661–676. [Online]. Available: http://doi.acm.org/10.1145/2509136.2509541[13] A. Sampson, W. Dietl, E. Fortuna, D. Gnanapragasam, L. Ceze, andD. Grossman, “Enerj: Approximate data types for safe and generallow-power computation,” in

Proceedings of the 32Nd ACM SIGPLANConference on Programming Language Design and Implementation ,ser. PLDI ’11. New York, NY, USA: ACM, 2011, pp. 164–174.[Online]. Available: http://doi.acm.org/10.1145/1993498.1993518[14] T. Oh, H. Kim, N. P. Johnson, J. W. Lee, and D. I. August, “Practicalautomatic loop specialization,”

SIGPLAN Not. , vol. 48, no. 4, pp.419–430, Mar. 2013. [Online]. Available: http://doi.acm.org/10.1145/2499368.2451161[15] J. Ansel, Y. L. Wong, C. Chan, M. Olszewski, A. Edelman, and S. Ama-rasinghe, “Language and compiler support for auto-tuning variable-accuracy algorithms,” in

Proceedings of the 9th Annual IEEE/ACMInternational Symposium on Code Generation and Optimization . IEEEComputer Society, 2011, pp. 85–96.[16] W. Baek and T. M. Chilimbi, “Green: A framework for supportingenergy-conscious programming using controlled approximation,” IGPLAN Not. , vol. 45, no. 6, pp. 198–209, Jun. 2010. [Online].Available: http://doi.acm.org/10.1145/1809028.1806620[17] L. Fousse, G. Hanrot, V. Lef`evre, P. P´elissier, and P. Zimmermann,“Mpfr: A multiple-precision binary ﬂoating-point library with correctrounding,”

ACM Trans. Math. Softw. , vol. 33, no. 2, Jun. 2007.[Online]. Available: http://doi.acm.org/10.1145/1236463.1236468[18] T.-J. Kwon and J. Draper, “Floating-point division and squareroot using a taylor-series expansion algorithm,”

MicroelectronicsJournal

Swarm and Evolutionary Computation , July 2007, pp. 126–133.[21] G. Ascia, V. Catania, and M. Palesi, “A ga-based design space explo-ration framework for parameterized system-on-a-chip platforms,”

IEEETransactions on Evolutionary Computation , vol. 8, no. 4, pp. 329–346,Aug 2004.[22] L. Mart´ı, J. Garc´ıa, A. Berlanga, and J. M. Molina, “A cumulativeevidential stopping criterion for multiobjective optimization evolutionaryalgorithms,” in

Proceedings of the 9th Annual Conference Companionon Genetic and Evolutionary Computation , ser. GECCO ’07. NewYork, NY, USA: ACM, 2007, pp. 2835–2842. [Online]. Available:http://doi.acm.org/10.1145/1274000.1274053[23] L. Marti, J. Garcia, A. Berlanga, and J. M. Molina, “An approachto stopping criteria for multi-objective optimization evolutionary algo-rithms: The mgbm criterion,” in , May 2009, pp. 1263–1270.[24] S. Sidiroglou-Douskos, S. Misailovic, H. Hoffmann, and M. Rinard,“Managing performance vs. accuracy trade-offs with loop perforation,”in

Proceedings of the 19th ACM SIGSOFT Symposium and the 13thEuropean Conference on Foundations of Software Engineering , ser.ESEC/FSE ’11. New York, NY, USA: ACM, 2011, pp. 124–134.[Online]. Available: http://doi.acm.org/10.1145/2025113.2025133[25] H. Hoffmann, S. Sidiroglou, M. Carbin, S. Misailovic, A. Agarwal, andM. Rinard, “Dynamic knobs for responsive power-aware computing,” in

Proceedings of the Sixteenth International Conference on ArchitecturalSupport for Programming Languages and Operating Systems , ser.ASPLOS XVI. New York, NY, USA: ACM, 2011, pp. 199–212.[Online]. Available: http://doi.acm.org/10.1145/1950365.1950390[26] P. Yang and F. Catthoor, “Pareto-optimization-based run-time taskscheduling for embedded systems,” in

Hardware/Software Codesign andSystem Synthesis, 2003. First IEEE/ACM/IFIP International Conferenceon , Oct 2003, pp. 120–125.[27] K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan, “A fast and elitistmultiobjective genetic algorithm: Nsga-ii,”

IEEE Transactions on Evo-lutionary Computation , vol. 6, no. 2, pp. 182–197, Apr 2002.[28] H. Esmaeilzadeh, A. Sampson, L. Ceze, and D. Burger, “Architecturesupport for disciplined approximate programming,”

SIGPLAN Not. ,vol. 47, no. 4, pp. 301–312, Mar. 2012. [Online]. Available:http://doi.acm.org/10.1145/2248487.2151008[29] Q. Shi, H. Hoffmann, and O. Khan, “A cross-layer multicorearchitecture to tradeoff program accuracy and resilience overheads,”

IEEE Comput. Archit. Lett. , vol. 14, no. 2, pp. 85–89, 2015. [Online].Available: https://doi.org/10.1109/LCA.2014.2365204[30] M. Rinard, H. Hoffmann, S. Misailovic, and S. Sidiroglou, “Patterns andstatistical analysis for understanding reduced resource computing,” in

Proceedings of the ACM International Conference on Object OrientedProgramming Systems Languages and Applications , ser. OOPSLA ’10.New York, NY, USA: Association for Computing Machinery, 2010, p.806821. [Online]. Available: https://doi.org/10.1145/1869459.1869525[31] S. Misailovic, S. Sidiroglou, H. Hoffmann, and M. Rinard,

Qualityof Service Proﬁling . New York, NY, USA: Association forComputing Machinery, 2010, p. 2534. [Online]. Available: https://doi.org/10.1145/1806799.1806808[32] H. Hoffmann, S. Misailovic, S. Sidiroglou, A. Agarwal, and M. Rinard,“Using code perforation to improve performance, reduce energy con- sumption, and respond to failures,” no. MIT-CSAIL-TR-2009-042, 092009.[33] M. Samadi, J. Lee, D. A. Jamshidi, A. Hormati, and S. Mahlke, “Sage:Self-tuning approximation for graphics engines,” in

Proceedings of the46th Annual IEEE/ACM International Symposium on Microarchitecture ,ser. MICRO-46. New York, NY, USA: ACM, 2013, pp. 13–24.[Online]. Available: http://doi.acm.org/10.1145/2540708.2540711[34] J. Park, E. Amaro, D. Mahajan, B. Thwaites, and H. Esmaeilzadeh,“Axgames: Towards crowdsourcing quality target determination inapproximate computing,”

SIGPLAN Not. , vol. 51, no. 4, pp. 623–636,Mar. 2016. [Online]. Available: http://doi.acm.org/10.1145/2954679.2872376[35] R. J. Mathar, “A java math. bigdecimal implementation of core mathe-matical functions,” arXiv preprint arXiv:0908.3030 , 2009.[36] A. Abad, R. Barrio, M. Marco-Buzunariz, and M. Rodr´ıguez,“Automatic implementation of the numerical taylor series method:A mathematica and sage approach,”

Applied Mathematics andComputation

SOSP , 2015.[38] A. Canino, Y. D. Liu, and H. Masuhara, “Stochastic energy optimizationfor mobile gps applications,” in

Proceedings of the 2018 26th ACMJoint Meeting on European Software Engineering Conference andSymposium on the Foundations of Software Engineering , ser. ESEC/FSE2018. New York, NY, USA: ACM, 2018, pp. 703–713. [Online].Available: http://doi.acm.org/10.1145/3236024.3236076[39] X. Sui, A. Lenharth, D. S. Fussell, and K. Pingali, “Proactive controlof approximate programs,”

SIGOPS Oper. Syst. Rev. , vol. 50, no. 2, pp.607–621, Mar. 2016. [Online]. Available: http://doi.acm.org/10.1145/2954680.2872402[40] C. Wan, H. Hoffmann, S. Lu, and M. Maire, “Orthogonalized SGD andnested architectures for anytime neural networks,” in

Proceedings of the37th International Conference on Machine Learning , ser. Proceedingsof Machine Learning Research, H. D. III and A. Singh, Eds., vol.119. PMLR, 13–18 Jul 2020, pp. 9807–9817. [Online]. Available:http://proceedings.mlr.press/v119/wan20a.html[41] C. Wan, M. Santriaji, E. Rogers, H. Hoffmann, M. Maire, andS. Lu, “ALERT: Accurate learning for energy and timeliness,” in

Proceedings of the 5th International Conference on EmbeddedNetworked Sensor Systems , ser. SenSys ’07, 2007.[43] S. Barati, F. A. Bartha, S. Biswas, R. Cartwright, A. Duracz, D. S.Fussell, H. Hoffmann, C. Imes, J. E. Miller, N. Mishra, Arvind,D. Nguyen, K. V. Palem, Y. Pei, K. Pingali, R. Sai, A. Wright, Y. Yang,and S. Zhang, “Proteus: Language and runtime support for self-adaptivesoftware development,”

IEEE Software , vol. 36, no. 2, pp. 73–82, 2019.[Online]. Available: https://doi.org/10.1109/MS.2018.2884864[44] A. Kansal, S. Saponas, A. Brush, K. S. McKinley, T. Mytkowicz,and R. Ziola, “The latency, accuracy, and battery (lab) abstraction:programmer productivity and energy efﬁciency for continuous mobilecontext sensing,” in

ACM SIGPLAN Notices , 2013.[45] A. Canino and Y. D. Liu, “Proactive and adaptive energy-awareprogramming with mixed typechecking,” in

Proceedings of the 38thACM SIGPLAN Conference on Programming Language Design andImplementation , ser. PLDI 2017. New York, NY, USA: ACM, 2017,pp. 217–232. [Online]. Available: http://doi.acm.org/10.1145/3062341.3062356[46] M. Ringenburg, A. Sampson, I. Ackerman, L. Ceze, and D. Grossman,“Monitoring and debugging the quality of results in approximateprograms,”

SIGPLAN Not. , vol. 50, no. 4, pp. 399–411, Mar. 2015.[Online]. Available: http://doi.acm.org/10.1145/2775054.2694365[47] M. Carbin, S. Misailovic, and M. C. Rinard, “Verifying quantitativereliability for programs that execute on unreliable hardware,” in

Proceedings of the 2013 ACM SIGPLAN International Conference onObject Oriented Programming Systems Languages & ,ser. OOPSLA ’13. New York, NY, USA: ACM, 2013, pp. 33–52.[Online]. Available: http://doi.acm.org/10.1145/2509136.2509546[48] E. Darulova, V. Kuncak, R. Majumdar, and I. Saha, “Synthesis of ﬁxed-point programs,” in

Proceedings of the Eleventh ACM International onference on Embedded Software , ser. EMSOFT ’13. Piscataway,NJ, USA: IEEE Press, 2013, pp. 22:1–22:10. [Online]. Available:http://dl.acm.org/citation.cfm?id=2555754.2555776[49] H. Hoffmann, “Coadapt: Predictable behavior for accuracy-aware appli-cations running on power-aware systems,” in ,2014, pp. 223–232.[50] M. Maggio, A. V. Papadopoulos, A. Filieri, and H. Hoffmann,“Automated control of multiple software goals using multipleactuators,” in Proceedings of the 2017 11th Joint Meeting onFoundations of Software Engineering, ESEC/FSE 2017, Paderborn,Germany, September 4-8, 2017 , 2017, pp. 373–384. [Online]. Available:https://doi.org/10.1145/3106237.3106247[51] A. Filieri, H. Hoffmann, and M. Maggio, “Automated multi-objectivecontrol for self-adaptive software design,” in

Proceedings of the 201510th Joint Meeting on Foundations of Software Engineering, ESEC/FSE2015, Bergamo, Italy, August 30 - September 4, 2015 , E. D. Nitto,M. Harman, and P. Heymans, Eds. ACM, 2015, pp. 13–24. [Online].Available: https://doi.org/10.1145/2786805.2786833[52] A. Farrell and H. Hoffmann, “MEANTIME: achieving both minimalenergy and timeliness with approximate computing,” in , 2016, pp. 421–435.[53] A. Filieri, M. Maggio, K. Angelopoulos, N. D’Ippolito,I. Gerostathopoulos, A. B. Hempel, H. Hoffmann, P. Jamshidi,E. Kalyvianaki, C. Klein, F. Krikava, S. Misailovic, A. V.Papadopoulos, S. Ray, A. M. Shariﬂoo, S. Shevtsov, M. Ujma, andT. Vogel, “Control strategies for self-adaptive software systems,”

ACMTrans. Auton. Adapt. Syst. , vol. 11, no. 4, pp. 24:1–24:31, 2017.[Online]. Available: https://doi.org/10.1145/3024188[54] S. Wang, C. Li, H. Hoffmann, S. Lu, W. Sentosa, and A. I.Kistijantoro, “Understanding and auto-adjusting performance-sensitiveconﬁgurations,” in

Proceedings of the Twenty-Third InternationalConference on Architectural Support for Programming Languages andOperating Systems, ASPLOS 2018, Williamsburg, VA, USA, March24-28, 2018 , X. Shen, J. Tuck, R. Bianchini, and V. Sarkar, Eds.ACM, 2018, pp. 154–168. [Online]. Available: https://doi.org/10.1145/3173162.3173206[55] C. Hankendi, A. K. Coskun, and H. Hoffmann, “Adapt&cap:Coordinating system- and application-level adaptation for power-constrained systems,”

IEEE Des. Test , vol. 33, no. 1, pp. 68–76, 2016.[Online]. Available: https://doi.org/10.1109/MDAT.2015.2463275[56] T. Givargis, F. Vahid, and J. Henkel, “System-level exploration forpareto-optimal conﬁgurations in parameterized systems-on-a-chip,” in

Computer Aided Design, 2001. ICCAD 2001. IEEE/ACM InternationalConference on , Nov 2001, pp. 25–30.[57] G. Palermo, C. Silvano, and V. Zaccaria, “Respir: A response surface-based pareto iterative reﬁnement for application-speciﬁc design spaceexploration,”

IEEE Transactions on Computer-Aided Design of Inte-grated Circuits and Systems , vol. 28, no. 12, pp. 1816–1829, Dec 2009.[58] E. Zitzler, M. Laumanns, and L. Thiele, “Spea2: Improving the strengthpareto evolutionary algorithm,” Tech. Rep., 2001.[59] J. Knowles and D. Corne, “The pareto archived evolution strategy:a new baseline algorithm for pareto multiobjective optimisation,” in

Evolutionary Computation, 1999. CEC 99. Proceedings of the 1999Congress on , vol. 1, 1999, p. 105 Vol. 1.[60] C. Bienia, “Benchmarking modern multiprocessors,” Ph.D. dissertation,Princeton University, January 2011.[61] S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S. H. Lee, andK. Skadron, “Rodinia: A benchmark suite for heterogeneous computing,”in

Workload Characterization, 2009. IISWC 2009. IEEE InternationalSymposium on , Oct 2009, pp. 44–54., Oct 2009, pp. 44–54.