[PDF] AXES: Approximation Manager for Emerging Memory Architectures

Abstract

Memory approximation techniques are commonly limited in scope, targeting individual levels of the memory hierarchy. Existing approximation techniques for a full memory hierarchy determine optimal configurations at design-time provided a goal and application. Such policies are rigid: they cannot adapt to unknown workloads and must be redesigned for different memory configurations and technologies. We propose AXES: the first self-optimizing runtime manager for coordinating configurable approximation knobs across all levels of the memory hierarchy. AXES continuously updates and optimizes its approximation management policy throughout runtime for diverse workloads. AXES optimizes the approximate memory configuration to minimize power consumption without compromising the quality threshold specified by application developers. AXES can (1) learn a policy at runtime to manage variable application quality of service (QoS) constraints, (2) automatically optimize for a target metric within those constraints, and (3) coordinate runtime decisions for interdependent knobs and subsystems. We demonstrate AXES' ability to efficiently provide functions 1-3 on a RISC-V Linux platform with approximate memory segments in the on-chip cache and main memory. We demonstrate AXES' ability to save up to 37% energy in the memory subsystem without any design-time overhead. We show AXES' ability to reduce QoS violations by 75% with <5% additional energy.

Full PDF

AAXES: A

PPROXIMATION M ANAGER FOR E MERGING M EMORY A RCHITECTURES

A P

REPRINT

Biswadip Maity , Bryan Donyanavard , Anmol Surhonne , Amir Rahmani , Andreas Herkersdorf , and Nikil Dutt University of California, Irvine, CAEmail: {maityb, bdonyana, a.rahmani, dutt}@uci.edu Technical University of Munich, Munich, GermanyEmail: {anmol.surhonne, herkersdorf}@tum.de A BSTRACT

Memory approximation techniques are commonly limited in scope, targeting individual levels ofthe memory hierarchy. Existing approximation techniques for a full memory hierarchy determineoptimal conﬁgurations at design-time provided a goal and application. Such policies are rigid: theycannot adapt to unknown workloads and must be redesigned for different memory conﬁgurationsand technologies. We propose AXES: the ﬁrst self-optimizing runtime manager for coordinatingconﬁgurable approximation knobs across all levels of the memory hierarchy. AXES continuouslyupdates and optimizes its approximation management policy throughout runtime for diverse work-loads. AXES optimizes the approximate memory conﬁguration to minimize power consumptionwithout compromising the quality threshold speciﬁed by application developers. AXES can (1) learna policy at runtime to manage variable application quality of service (QoS) constraints, (2) automati-cally optimize for a target metric within those constraints, and (3) coordinate runtime decisions forinterdependent knobs and subsystems. We demonstrate AXES’ ability to efﬁciently provide functions1-3 on a RISC-V Linux platform with approximate memory segments in the on-chip cache and mainmemory. We demonstrate AXES’ ability to save up to 37% energy in the memory subsystem withoutany design-time overhead. We show AXES’ ability to reduce QoS violations by 75% with < additional energy. K eywords Approximate Computing, Memory Hierarchy, Model-free Control, RISC-V

As applications become increasingly resource-intensive, trading off performance and energy in battery-powered systemsis crucial. Application proﬁling has revealed memory-accesses as one of the most signiﬁcant performance, andenergy bottlenecks [1]. Approximate memory is an effective way to alleviate the energy bottleneck in memory forapplications that can tolerate output errors caused by inexact memory load/store operations; potentially improving energyconsumption, leakage power, latency, or lifetime. The inexactness stems from relaxing the need for high-precisionstorage for some data structures in the application.Approximation techniques for different types of memories with conﬁgurable degrees of approximation have beenpreviously explored [2, 3, 4, 5, 6, 7], but are typically limited to one or a few levels of the memory hierarchy. A holisticsolution would need to manage approximation knobs (e.g., V DD for SRAM caches, t REF for DRAM main memory)across the entire memory hierarchy, from cache to main memory, in order to fully exploit memory approximationopportunities. Furthermore, runtime dynamic reconﬁguration of approximation knobs is required to fully leveragethe performance/energy tradeoff while honoring application goals, e.g., quality of service (QoS) targets, and systemconstraints, e.g., minimize energy consumption.Current approximation techniques using runtime dynamic reconﬁguration still depend on the design-time modeling ofworkloads (application along with input) to determine the optimal knobs of operation for a speciﬁc system conﬁguration.Such techniques are application-speciﬁc and cannot be ported to new systems. However, memory technologies are a r X i v : . [ c s . A R ] N ov . Maity et al. ApproximateMemoryController L2

Apprx L1 Apprx

CoreMainMemory

Apprx

Application QualityMonitor+ − Goal

Goal

Figure 1: Runtime management of approximation knobs using output quality monitoring.changing very rapidly, and design-time workload proﬁling dependency introduces signiﬁcant overhead to utilizeapproximate memory in emerging memory technologies. Furthermore, runtime dynamic reconﬁguration acrossthe memory hierarchy requires coordination between multiple knobs (e.g., L1 V DD , L2 V DD , and DRAM t REF ).Coordination is challenging because knobs in one layer can affect other subsystems directly or indirectly (e.g., writeerrors in L1 affect reads in higher layers).Consider the system shown in Figure 1. The memory hierarchy (in this case: L1 cache, L2 cache, and main memory)exposes tunable knobs (e.g., operating voltage for L1 and L2, and data refresh period for main memory) that controlthe degree of approximation. Each knob introduces a new degree of freedom and increases the conﬁguration spaceexponentially. Furthermore, satisfying even a single-objective function poses non-trivial optimization challenges,with an additional level of complexity arising from the optimization of multiple objective functions to determinethe optimal system conﬁguration. Researchers have proposed frameworks for exploring the conﬁguration space atdesign-time and determining static optimal knob settings for an approximate memory hierarchy before deployment[8, 9]. More ﬂexible solutions have been proposed to provide dynamic conﬁguration of knobs at runtime but requireidentifying workload-speciﬁc system dynamics at design-time [10, 11, 12]. Apriori knowledge limits the ability toadapt to changing workloads, and further assumes that the system and workload are observable ahead of deployment.On the other hand, determining the optimal knob conﬁguration for unknown applications and new inputs at runtime isan extremely challenging decision process.To address these challenges, we develop AXES, a model-free method to tune memory knobs without any previousknowledge about the system (memory-hierarchy as well as the workload). AXES eases the design of systems withapproximate memory by enabling deployment without going through design time exploration of conﬁguration knobs.AXES’ methodology is independent of the underlying memory technologies and works regardless of the nature of knobs.Once deployed, AXES can learn the optimal knob conﬁguration for unknown applications, resulting in self-optimizingsystems.

The main contributions of this paper are as follows:

1. Enable self-optimization of multi-level memory approximation knobs through AXES, a runtime resourcemanager using reinforcement learning. Self-optimization is demonstrated by ﬁnding the system conﬁgurationfor unknown workloads at runtime, as well as the dynamic management of quality of service (QoS).2. Enable coordination between multiple memory system knobs without explicit communication. Coordinationis demonstrated through dynamic runtime reconﬁguration of multiple knobs by continuously evaluatingdifferent subsystem conﬁgurations (e.g., ↑ L1 knob, ↓ L2 knob, vs ↓ L1 knob, ↑ L2 knob).3. An approximate memory management approach that is (a) technology agnostic, (b) application-independent,and (c) can easily be applied to any multi-level memory hierarchy.4.

Experimental case study : A software implementation of AXES is evaluated using an FPGA board with amodiﬁed RISC-V processing core to validate the approach.We believe AXES will enable quick adoption of approximation using a variety of memory nodes.2. Maity et al. V DD L V DD ( V ) (a) Application A V DD L V DD ( V ) (b) Application B, input B1 V DD L V DD ( V ) . . . . . N o r m a li ze d e n e r gy (c) Application B, input B2 Figure 2: Effect of conﬁguration knobs on cache layers (L1 data cache and L2 shared cache) for two differentapplications (A and B), and different inputs within an application (B1 and B2 within B). The dot diameter indicatesthe number of errors (smaller is better: no dot means no errors), and the color indicates normalized total energy usage(normalized to 1V:1V case). The outer circle represents the quality-constraint which the system must meet. For knobconﬁgurations where there is no outer circle, the system fails to meet the quality constraint. Feasible operating regionsthat can achieve the target QoS are outlined in dashed rectangles, and the optimal setting is indicated by a star.

The QoS delivered by a given conﬁguration of approximation knobs varies widely based on the application andcurrent input. Even for a ﬁxed workload (application and input), the conﬁguration space grows exponentially witheach additional knob (e.g., one knob = 4 states, two knobs = 16 states, three knobs = 64 states). More examples ofapproximation knobs for different memory technologies are presented in Table 1. In a memory hierarchy, knobs are atleast partially interdependent: changing one knob affects multiple subsystems in ways that are complex to predict. (e.g.,changing L1 V DD introduces errors in L1, which subsequently propagate to L2.) This makes the conﬁguration problemextremely challenging. To illustrate this challenge, consider a system equipped with an approximate memory subsystem, as shown in Figure 1.The application’s source code is annotated with a quality monitor and is running on a system that supports approximatememory. The approximate memory subsystem consists of three layers of hierarchy, including an SRAM L1 cachememory, an SRAM L2 cache memory, and a DRAM main memory. These memories have an ‘exact’ and ‘approximate’region in which application data can be mapped. The degree of approximation varies based on the memory technology:in this work, the voltage level for SRAM cache and refresh rate for DRAM main memory. Approximation can becontrolled at each layer of the memory hierarchy, and the knob setting impacts the application QoS measured by thequality monitor. malloc calls from the application to the Linux kernel are modiﬁed by the developer to indicate whichdata can be mapped to approximate regions. The complete experimental setup is described in Section 5.Figure 2 is an illustrative example showing variations in QoS observed across different conﬁgurations of L1 and L2approximation knobs, as well as across different applications. The DRAM knob is ﬁxed for the sake of simplicity. Thedots’ size represents the QoS (i.e., number of errors, smaller is better). We observe the effect of conﬁguration knobs ontwo applications:•

Application A : A memory write-read kernel that writes 512 64-bit numbers in the main memory and thenreads the numbers from main memory. The QoS metric for this kernel is deﬁned as the total number of bitﬂips that occur during the write-read cycle. The QoS and average energy for each knob conﬁguration is shownin Figure 2a.•

Application B : The Canny-edge detection application as described in Section 5.2. The QoS metric for thisapplication is the rmse (Root Mean Square Error) between the pixels of the approximate runs and the exactruns of the application. The QoS and average energy for knob conﬁgurations corresponding to two differentinputs (i.e., scenes), B1 and B2, is shown in Figures 2b and 2c respectively.We make the following key observations: First, we observe that conﬁgurations achieving a target QoS vary both within and between applications. In Figure 2, we deﬁne a feasible region (dashed rectangle) by identifying the set ofconﬁgurations that achieve acceptable QoS. Depending on the workload, the feasible regions of operation are different.3. Maity et al.

Table 1: Examples of approximate memory technology knobs.

Technology Memory Type Technology Knobs Knob objective Reference

Cache SRAM Volatile Operating Voltage ( V DD ) Energy savings ASPLOS’12[13], ESL’15[2]STT-RAM Non-Volatile Read Voltage ( V read )Write Pulse Duration ( t write ) Energy savings HPCA’11[3], CASES’15[14]ISLPED’17[15]MainMemory DRAM Volatile Data Refresh Period ( t REF )Operating Voltage ( V DD )Row Activation Delay ( t RCD ) Energy savingsReduce Latency ASPLOS’11[4],ISLPED’14[7]MICRO’19[16]PCM Non-Volatile Data Comparison Write (

T h ) Energy savingsIncrease lifetime MICRO’09[6], MICRO’13[5]

The difference is seen in the varying bounding boxes of Figure 2a (Application A), and Figure 2b (Application B). Evenwithin the same application, the acceptable regions of operation vary based on the dynamic inputs to the application atruntime as seen in Figure 2b (input B1) and Figure 2c (input B2).Second, we observe that even within the feasible regions, the achieved QoS varies across applications and inputs. Insome cases, the outer circle and inner circle are well separated, implying that there is still room for approximation.However, in some cases, the inner circle is very close to the outer circle, implying that the QoS is reaching its threshold.Third, we observe that even the same conﬁguration of knobs (e.g., L1: . , L2: . ) have different power character-istics with respect to different applications and different inputs within an application. This results in varying optimaldesign points (marked with a star).This simple example demonstrates that even for the same memory technology, it is hard to predict the resulting QoSand energy when knobs are changed in only two layers of the memory hierarchy; i.e., the dynamics between thesystem and application vary both within and between applications. We expect that ﬁnding the optimal conﬁgurationfor additional layers of a memory hierarchy or new memory technologies will only exacerbate these challenges, withcurrent state-of-the-art techniques (summarized next) insufﬁcient for determining the complex interactions of knobconﬁgurations for multi-level approximate memories. A popular approximation strategy is to use design-time techniques to ﬁnd optimal knob conﬁgurations [17, 16, 18, 19].Based on the application proﬁle, approximation knobs are determined before deployment and are expected to meetthe QoS requirements throughout the application’s lifetime. For an open-loop system designers must design thesystem with the worst-case scenario in mind and are unable to exploit the full potential of the approximation knobs atruntime. Additionally, application programmers are burdened with the task of setting memory approximation knobsthrough intensive proﬁling of the target workloads at design time[16, 8, 10]. Thus, open-loop control techniques areapplication-speciﬁc and not portable to new systems.

To address the lack of reconﬁguration in open-loop systems, state-of-the-art alternatives reconﬁgure approximationknobs using closed-loop controllers. [20, 21, 22, 23] The controllers are generated based on a system model identiﬁed atdesign-time. Closed-loop control aims to alleviate the programmer’s burden at design-time by using feedback at runtime.Design-time models consider the difﬁculty of specifying an under-designed memory’s parameters by measuring theoutput accuracy in different settings. However, with the number of system parameters on the rise, system identiﬁcation isbecoming impractical for capturing the effects of one knob on another. Coordination in control theory requires a formalMultiple-input-multiple-output (MIMO) method, but designing a MIMO controller requires nontrivial design-timeeffort. Additionally, such models are rigid: models must be generated for each memory technology, with an underlyingassumption that the system is available for observation ahead of deployment. Thus, closed-loop control techniques arealso application-speciﬁc and suffer from signiﬁcant design-time overhead.

A static model identiﬁed during development does not take into account complex system dynamics (e.g., variabilitybetween applications). As the conﬁguration space increases due to the increasing number of knobs, self-learningintelligent agents without apriori knowledge are attractive candidates to ﬁnd optimal solutions through runtime4. Maity et al.

Table 2: Memory approximation approaches and the key challenges addressed ( ∗ = uniquely addressed by AXES). Features EDEN ControlTheory AdAM DART AXES [16] [11] [10] [8]Technology Independent (cid:88) (cid:88) (cid:88)

Memory Hierarchy (cid:88) (cid:88) (cid:88)

Application Agnostic ∗ Coordination ∗ Self-Adaptivity (cid:88) (cid:88)

Self-Optimization ∗ Model-Independence (cid:88) (cid:88) (cid:88)

Real System Evaluation (cid:88) (cid:88) observation. Reinforcement learning [24] is a prevalent candidate in the ﬁeld of self-learning agents, demonstratingsuccess for decision-making for services such as recommendation engines and games. In this work, we utilize amodel-free reinforcement learning approach to develop a model-free approximate memory controller that can learn thebehavior of knobs through runtime experience. Model-independent control techniques can provide a general-purposesolution, independent of the application and system dynamics.

Approximate memory subsystems have been widely explored in the literature [25]. An integral step towards runningan approximate application is to identify the non-critical sections of the data elements. Allowing faults in the criticaldata sections would lead to crashes and would require additional recovery mechanisms. Identifying non-critical datasections can either be done automatically [26, 27, 28], or through explicit programming language support [29]. Uponidentifying the non-critical data, various memory approximation strategies can be implemented on systems, determinedby underlying memory hardware technology. Table 1 summarizes some of the standard technologies utilized throughoutthe memory hierarchy, along with the approximation objective. AXES is technology-agnostic and can leverage all ofthe technology knobs described in Table 1.Several methods have been proposed in the context of tuning the memory approximation knobs. Table 2 identiﬁes themost recent and relevant research, and compares AXES to these prior works. We deﬁne self-adaptivity as the ability toadapt to user-speciﬁed application goals or system constraints (e.g., increased target QoS). We deﬁne self-optimizationas the ability to ﬁnd desirable system conﬁgurations given a ﬁxed goal in the face of external disturbances (e.g., ascene change). Manual schemes rely on designer expertise to optimize approximation knobs (e.g., t REF in DRAM).In EDEN [16], Koppula et al. show the effectiveness of manual tuning for neural networks, which have an intrinsiccapacity of tolerating errors in memory accesses. EDEN uses approximate DRAM to reduce energy consumption andincrease the performance of DNN inference. EDEN is limited to machine learning workloads and does not apply to amulti-level memory hierarchy. The absence of a runtime quality monitor in EDEN prevents dynamic reconﬁguration ofthe approximation knobs (e.g., row activation delay t RCD , operating voltage V DD ).Maity et al. [11] have proposed a solution to maintain a quality target at runtime by using classical control theory.Quality conﬁguration tracking is modelled as a formal quality-control problem, and black-box modelling is used tocapture memory approximation effects with variations in application input and system architecture. However, thisscheme assumes only one level of the memory hierarchy is tuned at runtime and fails to address the problem ofcoordination between multiple knobs.In AdAM [10], Teimoori et al. investigate memory approximation by managing approximation knobs across thememory hierarchy. AdAM solves a design-time ILP optimization problem and uses a runtime algorithm to adapt to newtasks by re-estimating the execution time. Although optimization techniques are a natural choice for simple architecturaltuning, the lack of a feedback mechanism makes it too rigid for any sort of adaptivity (e.g., unknown inputs, anddisturbances from other applications). Furthermore, their use-case only addresses a two-layer memory hierarchy, withan on-chip STT-RAM and an off-chip PCM Main Memory; and the design-time algorithm is technology dependent.In DART [8], Yarmand et al. propose a framework for a three-layer memory hierarchy (SRAM L1, SRAM L2, and anoff-chip DRAM) without any technology-speciﬁc assumptions. DART uses a branch and bound algorithm to considerall possibilities at design time, and creates a search tree to perform error probability analysis. Although DART considersthe full memory hierarchy, it requires the programmer to: (1) analyze the program during design time, (2) generate a5. Maity et al. Environmentaction t Agent CoreL1L2Main MemApplicationR t+1 R t S t+1 S t QoS Monitor

L1: x, L2: y, DRAM: z, Error: k

State Action Q-Value

L1: [+/-/·] L2: [+/-/·] DRAM: [+/-/·]

ISA HardwareSoftware

Cache SubsystemMain MemoryControllerCacheController L1DL2 L1ICore

To approximationspeciﬁc-registers

AX_ENABLEAX_DISABLEAX_L1_LEVELAX_L2_LEVELAX_DRAM_LEVEL

Native ISAQoS constraints A pp r o x i m a t e E x a c t SEAMSconﬁg Main Memory(MM)Loadable Kernel Module O p e r a t i n g S y s t e m DeviceModels

Application malloc_approx()

HighLevelKnobs DeviceSpeciﬁcValuesQuality Monitor SEAMSController

Quality of Service (QoS)

L1L2 MM

Power Sensor

Figure 3: Overview of AXES system architecture.memory proﬁle for each application that would run on the system, and (3) estimate the worst-case probability of errorsthat would occur due to under-designed memory. Therefore DART requires apriori knowledge of the application andassumes that the system is available for full observation before deployment.In the related topic of runtime resource management, machine learning approaches have gained traction recently.Researchers have investigated the feasibility of machine learning methods for quality conﬁguration in the approximationdomain [30, 31]. However, conventional machine learning methods require extensive training to learn the correlationbetween the system’s inputs and outputs. Static models that are deﬁned ahead of deployment fail to handle newsituations outside of expected behavior. Online learning methods aim to address this issue and have shown promisingresults for resource management [32].AXES incorporates the features highlighted in Table 2 using online learning methods. The AXES’ approach improvesupon prior work by eliminating design-time modeling, being memory technology-agnostic, and coordinating multipleknobs at runtime to exploit approximation for multi-level memory hierarchies; enabling quick adoption of approximationfor diverse platforms.

Figure 3 presents the AXES’ realization of the logical architecture described in Figure 1, consisting of the followingcomponents: 1 (cid:13) a hardware platform with a processing unit, cache subsystem, and main memory. This hardwarecontrols the degree of approximation at each memory layer by conﬁguring the speciﬁc technology knobs availableon the platform. Examples of technology knobs are in Table 1. The processor core contains special registers to setknobs (e.g., L1 V DD updated from . to . ) through special instructions. For instance, in our current RISC-Vrealization of the processor, we deploy unused control and status registers (CSRs) for this purpose, as shown in Figure 3.2 (cid:13) Instructions that write to these CSRs form an extension of the processor’s ISA (RISC-V in our implementation), andare used to manage approximate elements at runtime. Trufﬂe [13] is another example of a micro-architecture designthat efﬁciently supports these ISA extensions for disciplined approximate programming. Instructions supported throughthese new CSRs include

AX_ENABLE to enable approximation,

AX_DISABLE to disable approximation,

AX_L1_LEVEL to set the technology-speciﬁc knob for Level 1 cache,

AX_L2_LEVEL to set the knob for Level 2 cache,

AX_DRAM_LEVEL to set the technology-speciﬁc knob for DRAM. 3 (cid:13)

A loadable kernel module that helps map the high-level knobs(e.g., low approximation) to technology-speciﬁc knob values (e.g., V DD ). For new technologies, the information inthe module should be updated to reﬂect what the available actuation knobs are (e.g., available write pulse duration( t write ) for STT-RAM). The kernel module also allows applications to indicate which parts of the application’s virtualmemory can be placed physically in the approximate regions (explained further in Section 5.3). 4 (cid:13) The user applicationrunning on this platform, specifying the non-critical sections of the data using an malloc_approx() call to 3 (cid:13) kernelmodule. 5 (cid:13)

A quality monitor computes the QoS periodically at runtime and reports it to AXES. 6 (cid:13)

The current powerof the system is sensed using power sensors. 7 (cid:13)

The expected QoS speciﬁc by the user. Expected QoS can be updatedat runtime to adapt to different system objectives (e.g., a strict quality constraint optimizes AXES for more accurateexecutions, whereas a relaxed quality constraint optimizes AXES for energy savings). 8 (cid:13)

The AXES Controller agent isthe ﬁnal component of the architecture and is responsible for runtime control of the memory approximation knobs.The AXES controller agent is a model-free runtime manager for tuning conﬁgurable approximation knobs throughoutthe memory hierarchy. AXES follows the observe-decide-act (ODA) paradigm: the environment is observed throughsensors during normal execution, and the decision-making agent is periodically invoked in order to (re)conﬁgure the6. Maity et al.

Environment A tAgent CoreL1L2Main MemoryApplication R t S t QoS Monitor L1: 0, L2: 4, DRAM: 2, Error: 0.06

State Action Q-Value

L1: [+/-/·] L2: [+/-/·] DRAM: [+/-/·]

Figure 4: AXES taking actions against the environment, and the environment returns observations (updated state) andrewardsystem using knobs. We design our decision-making logic by ﬁrst deﬁning our problem as a Markov Decision Process[33]: ( S, A, P a , R a ) , where S refers to state space, A refers to action space, P a refers to the transition probabilitiesfrom S → S (cid:48) given action A , and R a refers to the expected rewards for selecting action a in state s . As is commonwhen controlling real systems, we do not know the system dynamics and assume they change continuously. This is awell-known problem, and to address it, we apply an appropriate established reinforcement learning solution, namelytemporal difference (TD) learning [34]. Our goal is to design a decision-making agent that coordinates each layer in a uniﬁed 3-layer memory hierarchy toachieve acceptable application QoS while minimizing energy consumption. First, we must deﬁne the structure of ourenvironment. S ) The state is a representation of the current system under control. In AXES, we deﬁne a state vector that consists ofhigh-level approximation settings (e.g., no/low/medium/high approximation) of each memory layer, as well as thecurrent QoS error:1. L1D: current level 1 data cache conﬁguration2. L2: current level 2 shared cache conﬁguration3. Main memory: current main memory conﬁguration4. Discretized QoS error ( Q threshold − Q ), where Q is the measured QoS, and Q threshold is the constraint)This way, the state informs the agent what the current knob settings are, as well as how well they are achieving the goalof meeting the QoS requirement set by the application. This allows us to translate the dynamics between applicationbehavior and hardware conﬁguration. The QoS error is normalized to the worst-case QoS value ( max Q ) to make AXESportable across applications, and high-level knobs allow AXES to be independent of memory technologies. A ) The action space contains all possible operations the agent may take to conﬁgure the system each time the agent isinvoked. The AXES action vector consists of the relative changes to the high-level knobs for layers in the memoryhierarchy:1. L1D: Increase/Decrease/No change2. L2: Increase/Decrease/No change3. Main memory: Increase/Decrease/No changeInitially, the AXES policy does not have any information regarding what actions are desirable and must discover whichactions yield the maximum reward in each state via exploration (e.g., when there is no QoS constraint, actions whichdecrease power yield the maximum reward). 7. Maity et al. R ) The reward provides immediate feedback to the agent on how the previous state’s action helped achieve the system goal.In our case, this goal is to ﬁnd the optimal conﬁguration corresponding to minimum energy with acceptable QoS. Usingthe normalized power consumed measured at regular intervals, we deﬁne the reward in an unconstrained system byEquation 1: reward P = 1 − P owermax

P ower reward P ∈ { x | ≤ x ≤ } (1)where P ower is measured power, max

P ower is the power consumed when the approximation is disabled at all layersof the memory hierarchy, and reward P is the reward obtained in terms of optimizing power. This function represents apower optimization objective with a target power of zero. In an unconstrained system, operating at the highest poweryields no reward, while operating at zero power yields the maximum reward. However, we must constrain the totalreward in order to account for the quality threshold.The policy should take actions which minimize the number of violations of the quality constraint speciﬁed by theapplication developer. Thus, the reward of a quality violation is calculated in Equation 2 as: reward Q = − Q − Q threshold max Q reward Q ∈ { x | − ≤ x ≤ } (2)where reward Q is the reward obtained by staying within the quality constraint. In case of violations, reward Q isnegative indicating that an undesired action was performed by AXES, which led to a QoS violation.Finally, the reward (R) is calculated from the reward P and reward Q and reported to the agent by Equation 3: R = (cid:26) reward P , if Q ≤ Q threshold reward Q , otherwise (3) Algorithm 1

TD( λ ) algorithm [24] for determining AXES policy. Algorithm parameters: step size, discount factor, trace decay α, γ, λ ∈ (0 , Initialize Q ( s, a ) arbitrarily, for all s ∈ S , a ∈ A ( s ) for each episode do E ( s, a ) = 0 , ∀ s ∈ S , a ∈ A ( s ) Initialize

S, A for each step of episode do Take action A , observe R , S (cid:48) Choose A (cid:48) from S (cid:48) using policy derived from Q δ ← R + γQ ( S (cid:48) , A (cid:48) ) − Q ( S, A ) E ( S, A ) ← E ( S, A ) + 1 for each s ∈ S , a ∈ A ( s ) do Q ( s, a ) ← Q ( s, a ) + αδE ( s, a ) E ( s, a ) ← γλE ( s, a ) end for S ← S (cid:48) ; A ← A (cid:48) end for end for4.2 AXES Agent: Model-free Control Given the deﬁnition of the environment and goals, we simply need a decision-making mechanism (AXES) to ﬁndthe optimal policy. Initially, the AXES agent does not have any information regarding the environment and exploresthe state-space by taking purely arbitrary decisions (actions). It uses temporal-difference (TD) learning [24] to learndirectly from raw experiences without a model of the environment’s dynamics. Figure 4 shows the logical structure ofthe AXES agent and its relation to the environment, i.e., system under control. The agent interacts with the environmentthrough actions, and the environment provides rewards and updated state information to the agent.8. Maity et al.

L1 Knob Setting (V) R M S E L2 Knob Setting (V) R M S E DRAM Refresh Period (s) R M S E (a) RMSE Sensitivity analysis. M e m o r y P o w e r( % ) M e m o r y P o w e r( % ) M e m o r y P o w e r( % ) (b) Memory Power Sensitivity analysis. Figure 5: Sensitivity analysis of memory conﬁguration knobs on QoS (RMSE, top) and memory power (normalized toL1: , L2: , DRAM: .

064 s ) for canny .Actions that lead the system to optimize power without violating quality constraints are rewarded well. The policy ismodeled as a state-action value function by keeping track of all the state variables, along with the possible actionsin the form of a table. Q-learning [35] is a popular TD control algorithm. Q-learning aims to learn a state-actionvalue function, Q , which directly approximates q ∗ , the optimal state-action value function. A variation of Q-learningcombines eligibility traces to obtain a more general method that may learn more efﬁciently. Eligibility traces lookbackward to recently visited states and act as short-term memory. This algorithm, where Q-learning is combined with abackward short-term memory using eligibility traces, is known as TD( λ ) [34].AXES uses the TD( λ ) algorithm to update and optimize the approximation management policy throughout runtimecontinuously. The detailed algorithm is outlined in Algorithm 1. The dilemma presented during any controller designeris determining control parameters, whether the implementation uses classical control theory or reinforcement learning.In the TD( λ ) algorithm, learning parameters have interpretable meaning, so can be set several ways, e.g., using designerintuition or empirical observation. In our case we determine learning parameters ( α =0.6, γ =0.1, and λ =0.95) empiricallyby simulating our control logic on system traces for canny . No matter the controller deployed, these parameters mustbe determined. However, we deﬁne our control logic in such a way that the parameters apply to the type of control (i.e.,memory approximation knobs), as opposed to the application under control (i.e., edge detection).A Q-table is formed that maintains the Q-value of each state-action pair (Figure 4). The agent is invoked periodicallyand performs the following steps during each invocation:1. Measure the power and QoS to evaluate the reward R

2. Update the table ( Q values) based on reward R

3. Sense the current approximation levels and QoS to determine the current state S

4. Given the current state S and updated Q values, select next action A As described earlier, Figure 3 outlines the AXES system architecture, with the AXES Controller agent 8 (cid:13) in softwareresponsible for runtime control of the hardware memory approximation knobs. Our implemented environment consistsof a unicore RISC-V processor with a three-layer memory hierarchy: L1 SRAM data cache, L2 SRAM shared cache,and DRAM main memory. AXES is implemented in software and runs in userspace. A Loadable Kernel Module9. Maity et al.

Table 3: System conﬁguration used for AXES evaluation.Component ConﬁgurationCores 1TLBs Number of entries (16)L1 D-Cache Number of sets, ways ( )L2 Cache Number of sets, ways ( )Floating-Point Unit PresentMain Memory Onboard (512MB 800MHz DDR3)Clock frequency 30 MHzaccompanies AXES which incorporates the device speciﬁc translations from high-level conﬁgurations (e.g.,

25 % approximation) to technology speciﬁc values (e.g., . for SRAM caches).The state vector S is made up of (1) high-level conﬁgurations corresponding to memory layers, and (2) the QoS error.Discrete integer values represent all of the vector values. The L1 and L2 voltage levels ( V DD ) are between 0.7- . in increments of . [8]. The main memory refresh periods are . , , ,

20 s [4]. The QoS error is normalizedand discretized into 16 buckets of step size log . The inclusion of the QoS error in the state differentiates desirableactions for the same voltage level, depending on the QoS error explicitly.The action vector A contains a ﬁeld for adjusting each of the L1, L2, and main memory knobs. The possible knobconﬁgurations are voltage levels for cache L1 and L2 and refresh periods for DRAM main memory. Actions for eachknob only consist of increase by one, decrease by one, or remain the same. To keep the action space manageable,we performed a sensitivity analysis on the knob in each memory layer. Figure 5 shows the sensitivity analysis of theapproximation knobs on the output QoS and power. We make three observations. (1) As we move up the memoryhierarchy (i.e., from L1 to L2 to main memory), the quality is less affected by higher degrees of approximation. (2) Thecontribution of memory power from individual levels of hierarchy varies. Although main memory techniques can savearound 23% DRAM power, when the full memory hierarchy is considered, DRAM’s power savings saturates at 12%.(3) Having four knob conﬁgurations captures the range of power/quality trade-off effects while keeping the state-spacemanageable. We conclude that four knob conﬁgurations for each level provide sufﬁcient control for reaching our goal.Reward R is calculated based on Equation 3. To evaluate the reward, AXES uses software level sensors to determine theapplication’s output quality. Although the metric is domain-speciﬁc and is generated by a quality monitor, normalizingit to the worst quality keeps AXES domain agnostic. We update Q values using the reward as speciﬁed in Algorithm 1.To demonstrate the efﬁcacy of AXES for coordinating knobs in the memory hierarchy, we deploy a hardware platformthat mimics the effects of approximation. The effect of approximation knobs in each layer in the memory hierarchy isdetermined using existing models in literature [8, 4]. For the AXES architecture described in Section 4, we describe our experimental setup for the RISC-V hardwareplatform on which AXES is running ( 1 (cid:13) in Figure 3). An overview of our evaluation platform is shown in Figure6. We implement AXES in software running on Openpiton [36], an open-source framework designed to enablescalable architecture research prototypes. We use Openpiton with a single Ariane [37] core, a 64-Bit RISC-V corecapable of running Linux. The framework is synthesized on a DIGILENT NexysVideo board, having a Xilinx Artix-7FPGA(XC7A200T-1SBG484C). The parameters used to synthesize the system are summarized in Table 3.

We further modify the Ariane core to support fault injection throughout the memory hierarchy. The synthesized core,running on the NexysVideo board, does not have an option to conﬁgure real knobs for approximation. However, werely on existing works that map device-speciﬁc approximation knobs to the observable bit error rates [8, 4]. Thus, weintroduce bit errors through fault injection in order to emulate the effect of approximation knobs.

The RISC-V speciﬁcation deﬁnes separate addresses for Control and Status Registers (CSRs) associated with eachhardware thread [38]. Unused CSRs are utilized by the kernel to communicate information required for the conﬁguration10. Maity et al. I n s t r u c t i o n F e t c h E x ec u t e s t a g e C o mm i t s t a g e I n s t r u c t i o n D ec o d e I ss u e s t a g e PC G e n e r a t i o n Load Store UnitMMUCSR Regﬁle Execution Stage Cache SubsystemDcache MissUnitWrite Buﬀer L2CacheCacheControllerNew CSRregisters AddressComparison L2WL1W L2RL1R S c o r e b o a r d Branch UnitMultiplierALULoad Store UnitCache SubsystemMMUDTLBPTW CSR Buﬀer 12 Miss UnitD-Cache W r i t e B u ﬀ e r C o n t r o ll e r L2RL1R L1WTo L2CacheL2WFI FaultInjector3

Figure 6: Modiﬁcation of Ariane RISC-V core to emulate on-chip approximate memory. 1 (cid:13)

Addition of new CSRs tocommunicate with AXES kernel module. 2 (cid:13)

Modiﬁcation of address translation logic in Memory Management Unit(MMU) to generate approx signal. 3 (cid:13)

Fault injectors which introduce errors in the memory bus.of the approximation knobs. In Figure 6 additional CSRs are denoted with 1 (cid:13) . In particular, the following informationis stored in CSRs:1. L1 data cache Read and Write Bit Error Rate2. L2 shared cache Read and Write Bit Error Rate3. Starting and Ending physical address of the non-critical memory segmentThe Bit Error Rates correspond to speciﬁc memory nodes and are translated from technology-speciﬁc values describedin Section 5.5. The information from the CSRs is propagated to the 2 (cid:13)

Memory Management Unit (MMU), whereaddress translation takes place. MMU uses this information to generate an additional approx bit along with the index and tag bits to indicate that this address in the valid range of approximation. The approx bit generation is repeatedwhenever a virtual address is converted to a physical address. The approx bit in conjunction with the CSRs for BitError Rate is utilized by the cache controller to control the degree of approximation and contain it to the non-criticalparts of the application. A Fault Injector ( FI ) module is used to emulate the effects of approximation by introducing thebit ﬂips on the memory bus. Four FI modules are instantiated in the cache subsystem as shown in Figure 6 3 (cid:13) . The FI modules generate a bit-ﬂip mask for each memory access using a Linear-Feedback Shift register (LFSR) that introducesrandomness in the injected errors.The FI s are located in (1) Data Cache Memory emulating the bit ﬂips corresponding to L1 data reads and L2 reads,(2) Write Buffer emulating the bit ﬂips corresponding to L1 data writes, and (3) Miss Unit emulating the bit ﬂipscorresponding to L2 writes. DRAM cells store data in capacitors that lose charge over time. In order to keep the data consistent, the DRAM cellshave to be refreshed periodically. DRAM cells’ strength is non-uniform due to manufacturing variability, i.e., someDRAM cells lose charge faster than others. The number of bit-ﬂips in DRAM increases as the refresh period increasesdue to a higher number of DRAM cells losing charge before they are refreshed. These bit-ﬂips also depend on when thedata was written into and read from the DRAM. Therefore, implementing a FI module for DRAM requires keepingtrack of the faulty DRAM cells for each refresh-rate knob and the hold times of the data in each DRAM cell. Given the11. Maity et al. Table 4: Applications used for AXES’ evaluation along with their inputs and QoS.Application Domain Input Size Quality Metriccanny[39] Image-Processing 352x288 (Grayscale) Image Diff (RMSE)k-means[40] Machine Learning 426x240 (RGB) Image Diff (RMSE)blackscholes [40] Finance 4K entries Avg. Relative ErrorDRAM size, maintaining this information requires a lookup table of impractical size on an FPGA. The lookup table alsointroduces a considerable latency in DRAM reads/writes. To emulate DRAM errors, we implement a software-based FI for DRAM. Initially, a map DRAM_MAP of faulty DRAM cells for the maximum refresh period (

20 s ) knob is generatedrandomly using a uniform distribution. The faulty DRAM cells for higher refresh rates are a subset of

DRAM_MAP . Thedata being loaded into the DRAM is modiﬁed using the

DRAM_MAP and the current refresh rate knob. The exact readand write accesses to the DRAM are not impacted.

The AXES methodology is well suited for a large class of workloads that have a high intensity of memory operations(e.g., video processing, machine learning). Table 4 summarizes the applications used for AXES’ case study. (1)

Canny edge detection [39] algorithm operates on a video streams and marks the edges in each frame. (2) k-means is amachine-learning application [40] which partitions 3 dimensional input points (RGB pixels) into 6 different clusters,and (3) blackscholes [40] is a ﬁnance analysis application which solves partial differential equations to performprice estimations.The applications’ source code is modiﬁed to indicate which data elements are non-critical. Several techniques havebeen explored in the literature [41, 42] to systematically analyze and report how different parts of the application areaffected by errors. Depending on the application, there are one or more candidate data segments (e.g., image data,video data, signal data) for accuracy/energy tradeoffs. We identify these segments in the source code, and replace malloc() calls to the kernel by malloc_approx() calls. For canny , the image buffer is marked as a non-criticalsection. For k-means , the data structure for the image buffer is modiﬁed to separate the non-critical pixel data, andraw pixel information is converted from ﬂoat representation to unsigned char representation since each pixel valuelies between 0 and 255. For blackscholes , the buffer data structure is left unmodiﬁed: the non-critical approximatememory consists of a buffer of ﬂoats. Thus, errors can impact the bits differently, and in case of extreme approximationmay produce relative errors of . The malloc_approx() calls are intercepted by a custom Linux Kernel Module,described in next section.In addition to specifying the non-critical data elements, a quality-monitor speciﬁc to the application domain is required.The quality monitor is a lightweight software routine invoked to evaluate the application QoS and used to calculate thereward as described in Section 4.1.3. The QoS metric indicates the quality degradation caused by the conﬁguration ofapproximation knobs. Typically, application developers generate a software routine that is capable of measuring thequality at runtime. In canny , the QoS is determined by evaluating the

Root Mean Square Error (RMSE), which is themean of pixel differences squared between an exact result and an approximate result. For k-means and blackscholes the quality monitors are RMSE and

Average Relative Error , used directly from AxBench [40]. This software routineis invoked during runtime, and the result of an exact run of the application is compared with an approximate version[43, 44]. The quality evaluation is not repeated for every input so that the beneﬁts of approximation can justify theoverhead. Depending on the status of learning, the frequency of quality evaluation should be adapted. A detailedoverhead analysis is presented in Section 6.5. If additional cores are available, ground truth comparison can beperformed in parallel. In our unicore setup, this incurs unavoidable overhead during the initial exploration phase.

We develop a Loadable Kernel Module (LKM) as middleware between the user application and CSRs. malloc_approx() calls from the applications are intercepted by the LKM, and using mmap a contiguous physi-cal segment is allocated. The starting and ending address of the segment is written to additional CSRs. Whenever theuser application loads/stores data, the MMU compares the memory address in hardware using these CSRs to check if itis in the non-critical segment. 12. Maity et al.

40 50 60 70 80 90 10010 − − − − − − Relative Power Supply Voltage ( % ) S R A M B it E rr o r R a t e Read ErrorWrite Errors(a) Bit error rate for a 6T SRAM cell with varying V DD values in 65nm. Data from [8]. S e l f-r e fr e s h P o w e r S a v i ng ( % ) Power saving − − − − − − E rr o r R a t e Error Rate(b) Bit error rate and power savings for different refreshcycles in DRAM array. Data from Flikker [4].

The evaluation platform does not come equipped with on-board power sensors. Thus, we use Sniper [45] simulationsand McPAT [46], along with existing power models from literature [8, 4] for computing the power for different knobsettings. Each new input to the application is simulated in Sniper, and McPAT is invoked to estimate the power andenergy consumption of different system components. The power and energy are then scaled according to the technologymodels (described below) to estimate different knob conﬁgurations’ power values.

On-chip SRAM : When scaling the supply voltage ( V DD ) in SRAM cells, read and write errors are dominant; hence holdfailures are not considered here. We use a model for a 6T SRAM for

65 nm node from [8] for comparison with relatedmemory approximation work. The Bit Error Rate corresponding to relative power supply voltage is shown in Figure 7a.

Off-chip DRAM : We employ the power model proposed in Flikker [4]. We assume that the DRAM is partitioned intotwo sections: (1) 1/4 exact DRAM having a high refresh rate, and (2) 3/4 DRAM having a lower refresh rate based onthe approximation knob. The corresponding power model is shown in Figure 7b.

In this section, we demonstrate AXES’ ability to learn directly from raw experience, without requiring any model of theenvironment’s dynamics. These experiments are evaluated against canny . First, we evaluate AXES ability to learn an optimal policy to minimize energy from scratch. We compare two TDreinforcement learning algorithms: TD( λ ) and Q-Learning. The primary difference between the methods is that TD( λ )uses bootstrapping. For both algorithms, we determined the learning parameters empirically using a simulated workload(discussed in Section 4.2).Without any QoS constraints, approximation knobs should be set to the conﬁguration corresponding to the lowest power.The goal of a policy should be to reach the optimal conﬁguration as quickly as possible. Figure 8 shows the comparisonof the two methods, along with the optimal conﬁguration corresponding to maximum energy savings. The plots areaveraged over 16 runs to remove the effect of any outliers.The x-axis in Figure 8 represents frames processed, and the y-axis represents average memory power (normalized to theexact execution) for each episode for canny edge detection application[39]. We make two major observations. First,both algorithms can eventually converge to the optimal policy. Second, in Figure 8a, when the conﬁguration space issmall (i.e., we restrict the allowable knob settings), both Q-learning and TD( λ ) converge at an equal rate. However,when conﬁguration space is increased (Figure 8b), TD( λ ) can improve its policy faster than Q-learning because it usesshort-term memory in the form of eligibility traces. The traces are used to update multiple state-action pairs based onthe reward obtained, instead of just one state-action pair in Q-learning at every step. We conclude that with growingcomplexity in conﬁguration knobs, TD( λ ) is the better algorithm. Thus, for the rest of the experiments, we only use theTD( λ ) algorithm in AXES. 13. Maity et al. M e m o r y P o w e r( % ) Q-LearningTD( λ )Optimal (a) Conﬁguration space = 64 states M e m o r y P o w e r( % ) Q-LearningTD( λ )Optimal (b) Conﬁguration space = 704 states Figure 8: Power (normalized to exact execution) consumption achieved by different learning algorithms provided a goalto minimize power. Ideally, the policy should learn to reduce power consumption as quickly as possible toward theminimum (black dashed line).

To show that AXES is capable of self-optimizing the approximation knobs in the memory hierarchy within the QoSbudget speciﬁed by the application, we study AXES’ behavior for unexperienced inputs. We expose an agent withan established policy to varying inputs and compare it to state-of-art approximation management policy DART [8].DART is a design-time technique that uses a branch and bound algorithm to consider the worst-case effects of allpossible approximation knob conﬁgurations for a memory hierarchy. We train DART on a set of scenes used duringthe policy initialization phase. The goal is to honor the QoS constraint speciﬁed by the application while maximizingenergy-efﬁciency. In our case, this means keeping the RMSE below a speciﬁed value. For a QoS constraint of 10 RMSEDART statically sets the knobs as follows: L1 V DD : . , L2 V DD : . and DRAM T REF : . . The energy/framereported in all results is normalized with respect to the the knob conﬁguration of L1 V DD : , L2 V DD : and DRAM T REF : . .In Figure 9a, a QoS constraint of 28 (RMSE) has been speciﬁed by the user, which is marked with a black line. Thekey observations here are as follows: (1) Frame 32 is a key-frame where a scene change occurs. Neither of the testedpolicies has experienced this scene previously, and the scene requires a new conﬁguration of knobs to remain to meetthe QoS requirement. In frame 32, DART immediately violates the QoS constraint and continues to do so. AXEScan take actions and reach a new conﬁguration while remaining within the quality constraint. Initially, when AXESdetects there is an overshoot, it penalizes the current action and takes action to reduce the QoS error. This leads to aconservative state, with more room for QoS relaxation. Then, AXES increases the degree of approximation. In thesubsequent cycles AXES self-optimizes until it reaches a stable state. (2) Figure 9b shows the average power for eachframe. There is no signiﬁcant change in power with DART because it uses a ﬁxed knob conﬁguration at runtime. Thenormalized energy/frame required by AXES is . , and normalized energy/frame required by DART is . . (3)We deﬁne QoS overshoot as the area under curve for the regions of QoS violations during execution. For DART, QoSovershoot is 200 versus 50 for AXES. Thus, DART violates the QoS requirement × more than AXES (Figure 9a). Thismeans that AXES can reduce QoS violations by

75 % with < additional energy. These experiments demonstratethat AXES is self-optimizing, i.e., can continuously learn conﬁgurations that meet the QoS constraint when exposed tounknown inputs. 14. Maity et al. Frame R M S E QoS Constraint AXES DART (a) Quality of Service M e m o r y P o w e r( % ) AXESDART (b) Normalized Memory Power over exact computation V DD ( V ) L1D L2 (c) L1 and L2 cache V DD knobs R e fr e s h P e r i od ( s ) DRAM (d) DRAM refresh period knob

Figure 9: AXES self-optimizing power within quality constraint

To show that AXES is capable of coordinating interdependent memory knobs, we study AXES’ behavior when exposedto varying quality constraints. The goal is to adapt to new targets speciﬁed by the user and reconﬁgure the knobs tosave more energy while processing the frames. Figure 10 shows how AXES behaves in the scenario described. Thekey observations here are as follows: (1) Figure 10a shows measured QoS compared to the dynamic QoS constraint.Initially, the QoS constraint is 10. At frame 30, it is relaxed and updated to 85, thus exposing an opportunity to conserveenergy. Again, at frame 60, the constraint is changed to 30. AXES can self-adapt to ﬁnd new conﬁguration knobs thatmeet the constraints each time they are changed. (2) Figure 10b shows the normalized power for each frame. Initially,when the QoS constraint is 10, the memory power is around

80 % .When the QoS constraint is relaxed to 85 at frame 30, AXES can lower the memory energy consumption by ﬁnding anew conﬁguration of knobs and keeps operating in that region until the constraint changes again. Overall, the normalizedenergy/frame required by AXES is . . Therefore we demonstrated that AXES is capable of (1) self-adapting tonew quality constraints speciﬁed by applications through coordination, and (2) continuously converging on optimalconﬁgurations. 15. Maity et al. Frame R M S E QoS Constraint AXES (a) Quality of Service M e m o r y P o w e r( % ) AXES (b) Normalized Memory Power over exact computation V DD ( V ) L1D L2 (c) L1 and L2 cache V DD knobs R e fr e s h P e r i od ( s ) DRAM (d) DRAM refresh period knob

Figure 10: AXES self-adapting to user-speciﬁed quality constraints by coordination across the memory hierarchy.

Frame R M S E QoS ConstraintAXES (a) kmeans :QoS

Input A vg . R e l . E rr o r QoS ConstraintAXES (b) blackscholes : QoS M e m o r y P o w e r( % ) (c) kmeans : Memory Power M e m o r y P o w e r( % ) (d) blackscholes : Memory Power Figure 11: Additional workloads.16. Maity et al.

Figures 11a and 11c show AXES’ result for kmeans with a normalized energy/frame of . . Figures 11b and 11dshow AXES’ result for blackscholes with a normalized energy/frame of . . We observe that even though theinterdependent dynamics between all three layers of memory hierarchy and the achievable QoS/power are complex andapplication-dependent, AXES is able to meet the dynamic quality constraints by continuously ﬁnding new conﬁgurationscorresponding to the new system goals. R M S E QoS Constraint AXES 1 R M S E QoS Constraint AXES 2 R M S E QoS Constraint AXES 5

Frame R M S E QoS Constraint AXES 10

Figure 12: AXES QoS at different invocation intervals for k-means . Blue ticks indicate evaluation instances.All runtime approximation strategies suffer from two primary sources of overhead: (1) calculating the QoS value, and(2) runtime management. As mentioned in Section 5.2, to reap the beneﬁts of approximation, AXES is not invoked forevery input once the Q-values have been populated. In most cases, the quality monitor (e.g., the errors between pixelsin canny edge detection) is embarrassingly parallel. If an additional core is available, the ground truth comparison canbe performed in parallel. If AXES is invoked too frequently, the compute and energy overhead of the ground-truthcomparison would not be justiﬁed. In Figure 12, we compare the QoS of k-means application at different intervals ofinvocation (marked with a blue tick). Whenever the system goals change, the subsequent ﬁve frames are always updatedto adapt to the new setting. We observe that AXES can adjust to new goals with reduced invocation periods, at the risk ofpotentially missing self-optimization opportunities between invocations. Even ignoring regular invocation completely,AXES can be used in an event-driven manner (e.g., updated at design time when a new system is developed, at runtimewhen a new application is available for approximation, or when the goals (QoS constraints) change). State-of-the-artalternatives do not self-optimize for these situations for a full memory hierarchy.In Figure 13, we compare the compute overhead and the energy savings of AXES for various invocation periods basedon the number of frames. Based on these observations, we invoke AXES every ﬁve frames in all of our evaluations.The overheads of the current version of AXES are not preventative for the presented architecture and execution scenario.However, the investigated architecture is unicore – in a many-core system, the proposed control structure will not scalewell due to conﬁguration complexity if a single agent is responsible for conﬁguring the entire memory system.17. Maity et al.

Similarly, the current reward calculation is based on the quality reports of a single application utilizing the approximatememory segments – multiple QoS applications running concurrently sharing approximate segments will complicate theagent. We believe a hierarchical or multi-agent architecture would effectively tackle the challenges of more complexsystems and are topics for future investigation.1 2 5 10

Invocation Period ( % o f B a s e li n e compute overhead power reduction Figure 13: AXES overhead for different intervals

In this paper, we address the challenge of designing a technology-agnostic runtime manager for approximate memoryhierarchies. To this end, we propose AXES, the ﬁrst self-optimizing model-free runtime manager for tuning approx-imation knobs in a uniﬁed multi-layer memory hierarchy to achieve acceptable QoS for diverse workloads whileminimizing energy consumption. AXES uses temporal difference (TD) learning to learn directly from experiencewithout a model of the environment’s dynamics. We develop an experimental case-study to evaluate AXES on amodiﬁed RISC-V processing core. We demonstrate the efﬁcacy of AXES for conﬁguration knobs when: (1) there isno quality constraint, and the system objective is to minimize power, (2) there is a quality constraint, and unknowninputs require reconﬁguration of the knobs, and (3) the user dynamically changes the QoS constraints. AXES canautomatically control approximate memory hierarchies regardless of memory technology or application. Due to theseadvantages, we believe that AXES has the potential for developing extremely energy-efﬁcient systems across memorydevices, parameters, and technologies.

Acknowledgements

This work was partially supported by NSF grant CCF-1704859.

References [1] Aaron Carroll, Gernot Heiser, et al. An analysis of power consumption in a smartphone. In

Proceedings of AnnualTechnical Conference , volume 14, page 21, USA, 2010. USENIX Association.[2] Majid Shoushtari, Abbas BanaiyanMofrad, and Nikil Dutt. Exploiting partially-forgetful memories for approximatecomputing.

IEEE Embedded Systems Letters , 7:19–22, 2015.[3] Clinton W Smullen, Vidyabhushan Mohan, Anurag Nigam, Sudhanva Gurumurthi, and Mircea R Stan. Relaxingnon-volatility for fast and energy-efﬁcient stt-ram caches. In

IEEE 17th International Symposium on HighPerformance Computer Architecture , pages 50–61. IEEE, 2011.[4] Song Liu, Karthik Pattabiraman, Thomas Moscibroda, and Benjamin G Zorn. Flikker: saving dram refresh-powerthrough critical data partitioning. In

Proceedings of the 16th international conference on Architectural supportfor programming languages and operating systems , pages 213–224, New York, NY, USA, 2011. Association forComputing Machinery.[5] Adrian Sampson, Jacob Nelson, Karin Strauss, and Luis Ceze. Approximate storage in solid-state memories. In

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture , page 25–36, NewYork, NY, USA, 2013. Association for Computing Machinery.[6] Moinuddin K Qureshi, John Karidis, Michele Franceschini, Vijayalakshmi Srinivasan, Luis Lastras, and BulentAbali. Enhancing lifetime and security of pcm-based main memory with start-gap wear leveling. In , pages 14–23, New York, NY, USA, 2009.Association for Computing Machinery. 18. Maity et al. [7] Kyungsang Cho, Yongjun Lee, Young H Oh, Gyoo-cheol Hwang, and Jae W Lee. edram-based tiered-reliabilitymemory with applications to low-power frame buffers. In

Proceedings of International Symposium on Low PowerElectronics and Design , pages 333–338, New York, NY, USA, 2014. Association for Computing Machinery.[8] Roohollah Yarmand, Mehdi Kamal, Ali Afzali-Kusha, and Massoud Pedram. Dart: A framework for determiningapproximation levels in an approximable memory hierarchy.

IEEE Transactions on Very Large Scale Integration(VLSI) Systems , 28:273–286, 2019.[9] Jiajia Jiao. Heap: A holistic error assessment framework for multiple approximations using probabilistic graphicalmodels.

Electronics , 9:373, 2020.[10] Mohammad Taghi Teimoori, Muhammad Abdullah Hanif, Alireza Ejlali, and Muhammad Shaﬁque. AdAM:Adaptive approximation management for the non-volatile memory hierarchies. In

Design, Automation & Test inEurope Conference & Exhibition (DATE) , pages 785–790. IEEE, 2018.[11] Biswadip Maity, Majid Shoushtari, Amir M Rahmani, and Nikil Dutt. Self-adaptive memory approximation: Aformal control theory approach.

IEEE Embedded Systems Letters , 12:33–36, 2019.[12] Biswadip Maity, Majid Shoushtari, Amir M Rahmani, and Nikil Dutt. Simulation Infrastructure and SystemDynamics of Quality Conﬁgurable Memory.

CECS Tech. Rep. 19-03 , 2019.[13] Hadi Esmaeilzadeh, Adrian Sampson, Luis Ceze, and Doug Burger. Architecture support for disciplined ap-proximate programming. In

Proceedings of the 17th international conference on Architectural Support forProgramming Languages and Operating Systems , pages 301–312, New York, NY, USA, 2012. Association forComputing Machinery.[14] Felipe Sampaio, Muhammad Shaﬁque, Bruno Zatt, Sergio Bampi, and Jörg Henkel. Approximation-awareMulti-Level Cells STT-RAM cache architecture. In

International Conference on Compilers, Architecture andSynthesis for Embedded Systems (CASES) , pages 79–88. IEEE, 2015.[15] Amir Mahdi Hosseini Monazzah, Majid Shoushtari, Seyed Ghassem Miremadi, Amir M Rahmani, and Nikil Dutt.Quark: Quality-conﬁgurable approximate stt-mram cache by ﬁne-grained tuning of reliability-energy knobs. In

IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED) , pages 1–6. IEEE, 2017.[16] Skanda Koppula, Lois Orosa, A Giray Ya˘glıkçı, Roknoddin Azizi, Taha Shahroodi, Konstantinos Kanellopoulos,and Onur Mutlu. Eden: Enabling energy-efﬁcient, high-performance deep neural network inference usingapproximate dram. In

Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture ,pages 166–181, New York, NY, USA, 2019. Association for Computing Machinery.[17] Joshua San Miguel, Mario Badr, and Natalie Enright Jerger. Load value approximation. In

Proceedings of the 47thAnnual IEEE/ACM International Symposium on Microarchitecture , pages 127–139, USA, 2014. IEEE ComputerSociety.[18] Renée St. Amant, Amir Yazdanbakhsh, Jongse Park, Bradley Thwaites, Hadi Esmaeilzadeh, Arjang Hassibi,Luis Ceze, and Doug Burger. General-purpose code acceleration with limited-precision analog computation. In

Proceeding of the 41st Annual International Symposium on Computer Architecuture , pages 505–516. IEEE, 2014.[19] Thierry Moreau, Mark Wyse, Jacob Nelson, Adrian Sampson, Hadi Esmaeilzadeh, Luis Ceze, and Mark Oskin.SNNAP: Approximate computing on programmable SoCs via neural acceleration. In

IEEE 21st InternationalSymposium on High Performance Computer Architecture , pages 603–614. IEEE, 2015.[20] Haibo Zhang, Shulin Zhao, Ashutosh Pattnaik, Mahmut T. Kandemir, Anand Sivasubramaniam, and Chita R. Das.Distilling the essence of raw video to reduce memory usage and energy at edge devices. In

Proceedings of the52nd Annual IEEE/ACM International Symposium on Microarchitecture , pages 657–669, New York, NY, USA,2019. Association for Computing Machinery.[21] Michael Ringenburg, Adrian Sampson, Isaac Ackerman, Luis Ceze, and Dan Grossman. Monitoring anddebugging the quality of results in approximate programs. In

Proceedings of the 20th International Conference onArchitectural Support for Programming Languages and Operating Systems , pages 399–411, New York, NY, USA,2015. Association for Computing Machinery.[22] Kasra Moazzemi, Biswadip Maity, Saehanseul Yi, Amir M. Rahmani, and Nikil Dutt. Hessle-free: Heterogeneoussystems leveraging fuzzy control for runtime resource management.

ACM Trans. Embed. Comput. Syst. , 18(5s),October 2019.[23] Beayna Grigorian, Nazanin Farahpour, and Glenn Reinman. Brainiac: Bringing reliable accuracy into neurally-implemented approximate computing. In

IEEE 21st International Symposium on High Performance ComputerArchitecture , pages 615–626. IEEE, 2015.[24] Richard S Sutton and Andrew G Barto.

Introduction to Reinforcement Learning . MIT press, Cambridge, MA,USA, 2018. 19. Maity et al. [25] Sparsh Mittal. A survey of techniques for approximate computing.

ACM Computing Surveys (CSUR) , 48:1–33,2016.[26] Pooja Roy, Rajarshi Ray, Chundong Wang, and Weng Fai Wong. Asac: automatic sensitivity analysis forapproximate computing. In

Proceedings of the SIGPLAN/SIGBED Conference on Languages, Compilers andTools for Embedded Systems , pages 95–104, New York, NY, USA, 2014. Association for Computing Machinery.[27] Vinay K Chippa, Srimat T Chakradhar, Kaushik Roy, and Anand Raghunathan. Analysis and characterization ofinherent application resilience for approximate computing. In

Proceedings of the 50th Annual Design AutomationConference , pages 1–9, New York, NY, USA, 2013. Association for Computing Machinery.[28] Radha Venkatagiri, Abdulrahman Mahmoud, Siva Kumar Sastry Hari, and Sarita V Adve. Approxilyzer: towardsa systematic framework for instruction-level approximate computing and its application to hardware resiliency. In , pages 1–14. IEEE, 2016.[29] Jason Ansel, Yee Lok Wong, Cy Chan, Marek Olszewski, Alan Edelman, and Saman Amarasinghe. Language andcompiler support for auto-tuning variable-accuracy algorithms. In

Proceedings of the International Symposium onCode Generation and Optimization , pages 85–96, USA, 2011. IEEE Computer Society.[30] Mahmoud Masadeh, Osman Hasan, and Soﬁene Tahar. Using machine learning for quality conﬁgurable approxi-mate computing. In

Design, Automation & Test in Europe Conference & Exhibition (DATE) , pages 1575–1578.IEEE, 2019.[31] Mahmoud Masadeh, Osman Hasan, and Soﬁene Tahar. Machine Learning-Based Self-Compensating ApproximateComputing. arXiv e-prints , page arXiv:2001.03783, 2020.[32] Bryan Donyanavard, Tiago Mück, Amir M Rahmani, Nikil Dutt, Armin Sadighi, Florian Maurer, and AndreasHerkersdorf. Sosa: Self-optimizing learning with self-adaptive control for hierarchical system-on-chip manage-ment. In

Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture , pages685–698, New York, NY, USA, 2019. Association for Computing Machinery.[33] Chelsea C White III and Douglas J White. Markov decision processes.

European Journal of Operational Research ,39:1–16, 1989.[34] Richard S. Sutton. Learning to predict by the methods of temporal differences.

Machine Learning , 3:9–44, 1988.[35] Christopher J. C. H. Watkins and Peter Dayan. Q-learning. In

Machine Learning , volume 8, pages 279–292.Springer Science and Business Media LLC, 1992.[36] Jonathan Balkind, Katie Lim, Fei Gao, Jinzheng Tu, David Wentzlaff, Michael Schaffner, Florian Zaruba, andLuca Benini. Openpiton+ ariane: The ﬁrst open-source, smp linux-booting risc-v system scaling from one tomany cores. In

Third Workshop on Computer Architecture Research with RISC-V, CARRV . CARRV, 2019.[37] Florian Zaruba and Luca Benini. The cost of application-class processing: Energy and performance analysisof a linux-ready 1.7-ghz 64-bit risc-v core in 22-nm fdsoi technology.

IEEE Transactions on Very Large ScaleIntegration (VLSI) Systems , 27, 2019.[38] Andrew Waterman, Yunsup Lee, David Patterson, Krste Asanovic, and Volume I User level Isa. The risc-vinstruction set manual.

Volume I: User-Level ISA’, version , 2, 2014.[39] John Canny. A computational approach to edge detection.

IEEE Transactions on pattern analysis and machineintelligence , PAMI-8:679–698, 1986.[40] Amir Yazdanbakhsh, Divya Mahajan, Hadi Esmaeilzadeh, and Pejman Lotﬁ-Kamran. Axbench: A multiplatformbenchmark suite for approximate computing.

IEEE Design & Test , 34:60–68, 2017.[41] Radha Venkatagiri, Khalique Ahmed, Abdulrahman Mahmoud, Sasa Misailovic, Darko Marinov, Christopher WFletcher, and Sarita V Adve. gem5-approxilyzer: An open-source tool for application-level soft error analysis. In , pages 214–221.IEEE, 2019.[42] Thomas Goldbrunner, Thomas Wild, and Andreas Herkersdorf. Memory access pattern proﬁling for stream-ing applications based on matlab models. In , pages 32–38. IEEE, 2018.[43] Woongki Baek and Trishul M Chilimbi. Green: a framework for supporting energy-conscious programming usingcontrolled approximation. In

Proceedings of Programming Language Design and Implementation , pages 198–209,New York, NY, USA, 2010. Association for Computing Machinery.[44] Rudolf Eigenmann et al. Harnessing parallelism in multicore systems to expedite and improve function approxi-mation. In

Languages and Compilers for Parallel Computing , pages 88–92, Cham, 2017. Springer InternationalPublishing. 20. Maity et al. [45] Trevor E Carlson, Wim Heirman, Stijn Eyerman, Ibrahim Hur, and Lieven Eeckhout. An evaluation of high-levelmechanistic core models.

ACM Transactions on Architecture and Code Optimization (TACO) , 11:1–25, 2014.[46] Sheng Li, Jung Ho Ahn, Richard D Strong, Jay B Brockman, Dean M Tullsen, and Norman P Jouppi. Mcpat: anintegrated power, area, and timing modeling framework for multicore and manycore architectures. In