[PDF] The Case for Learning Application Behavior to Improve Hardware Energy Efficiency

Abstract

Computer applications are continuously evolving. However, significant knowledge can be harvested from a set of applications and applied in the context of unknown applications. In this paper, we propose to use the harvested knowledge to tune hardware configurations. The goal of such tuning is to maximize hardware efficiency (i.e., maximize an applications performance while minimizing the energy consumption). Our proposed approach, called FORECASTER, uses a deep learning model to learn what configuration of hardware resources provides the optimal energy efficiency for a certain behavior of an application. During the execution of an unseen application, the model uses the learned knowledge to reconfigure hardware resources in order to maximize energy efficiency. We have provided a detailed design and implementation of FORECASTER and compared its performance against a prior state-of-the-art hardware reconfiguration approach. Our results show that FORECASTER can save as much as 18.4% system power over the baseline set up with all resources. On average, FORECASTER saves 16% system power over the baseline setup while sacrificing less than 0.01% of overall performance. Compared to the prior scheme, FORECASTER increases power savings by 7%.

Full PDF

FF ORECASTER : A Continual Lifelong Learning Approach toImprove Hardware Efﬁciency

Phat Nguyen , Abhishek Taur , Abdullah Muzahid , Arnav Kansal , and Mohamed Zahran Department of Computer Science and Engineering, Texas A&M University Department of Computer Science, New York University

ABSTRACT

Computer applications are continuously evolving. However,signiﬁcant knowledge can be harvested from older applica-tions or versions and applied in the context of newer ap-plications or versions. Such a vision can be realized with

Continual Lifelong Learning . Therefore, we propose to em-ploy continual lifelong learning to dynamically tune hardwareconﬁgurations based on application’s behavior. The goal ofsuch tuning is to maximize hardware efﬁciency (i.e., max-imize an application’s performance while minimizing thehardware’s energy consumption). Our proposed approach,F

ORECASTER , uses deep reinforcement learning to contin-ually learn during the execution of an application as wellas propagate and utilize the accumulated knowledge duringsubsequent executions of the same or new application. Wepropose a novel hardware and ISA support to implementdeep reinforcement learning. We implement F

ORECASTER and compare its performance against prior learning-basedhardware reconﬁguration approaches. Our results show thatF

ORECASTER can save as much as 17.5% system power overthe baseline set up with all resources. On average, F

ORE - CASTER saves 16% system power over the baseline setupwhile sacriﬁcing an average of 4.7% of execution time.

1. INTRODUCTION

Computer architects are in continuous quest to ﬁnd the besthardware design for different program types. We cannot havedifferent application speciﬁc hardware designs for differentprogram types because this will be prohibitively expensive.What makes things even more challenging is that a singleprogram passes through different phases during its execu-tion lifetime and each phase has a different best hardwareconﬁguration. This paper presents a step toward a solution.Previously, designers used to gather proﬁling informationabout a program execution on a hardware, then make use ofthis proﬁling information to either enhance the hardware orthe program. However, this means that each program mustbe instrumented ﬁrst and the information gathered duringproﬁling is used on that application only. Our proposed ideais based on a simple hypothesis: any phase of a programexecution has a best hardware conﬁguration. Each phase hascertain characteristics. So, if there is a different program witha phase with similar characteristics, then it can use the sameconﬁguration to get the best performance. Therefore, if wecan learn the best conﬁguration for different program phases, we can use that to get the best conﬁguration for new, un-seen, programs. In other word, there is a ﬁnite set of patternsalong which hardware/software interactions can occur to givebest performance. For example, given a cache conﬁguration,there are ﬁnite set of memory access patterns that yield lowcache misses. Or, given a memory access pattern, we canbuild the best cache conﬁguration that yields the lowest num-ber of misses.

The main goal of this paper is to designa hardware with conﬁgurable knobs that learns from itsinteraction with programs to be able to reconﬁgure itselfto a conﬁguration that achieves best performance for newunseen programs.

As the hardware executes more programs,it learns more patterns and can achieve better performancefor more and more programs.There are several challenges that need to be tackled in orderto reach this goal. First, what are the knobs to be changed?There are many structures that can be designed to be recon-ﬁgurable. Our main criteria is to pick knobs that have thebiggest impact on performance and power and at the sametime can be reconﬁgured with the least hardware cost andmodiﬁcation. Second, how to learn the patterns of the hard-ware/software interaction to suggest the best conﬁguration?The pattern means the proﬁling information such as telemetrycollected from performance counters. For each pattern, thereis a hardware conﬁguration that leads to the best performanceand power, or any other metric that needs to be optimized. Itis clear that the number of patterns is large and, dependingon the number of type of knobs, the hardware conﬁgurationsis also large. This is why straightforward classiﬁers such asbloom ﬁlter may not be a viable option. Using neural networkin a deep-learning setup does not lead to best results from anearly stage because it requires a large number of examplesin the training phase. Therefore, we need some kind of un-supervised learning approach. We read proﬁling informationand we make changes to the hardware based on this informa-tion. That is, we make changes to the environment and weget feedback about how well we do. This is a description ofreinforcement learning.

The main contribution of this paper is to propose ahardware scheme, called F

ORECASTER , that uses con-tinuous learning, from one execution to another, using adeep reinforcement learning to reconﬁgure certain knobsto get the best performance and power for different pro-grams.

We implemented F

ORECASTER using Multi2Sim [32]simulator. Our experimental results using Parsec benchmarksshow that the proposed technique can save as much as 17.5%1 a r X i v : . [ c s . A R ] A p r ower over the baseline with all resources. On average, ourscheme saves 16% system power over the baseline setupwhile sacriﬁcing only an average of 4.7% execution time.The rest of the paper is organized as follows: Section 2presents some background materials; Section 3 describes themain idea of F ORECASTER ; Section 4 shows the detailedimplementation of F

ORECASTER ; Section 5 presents someexperimental results; Section 6 highlights some related work;and ﬁnally, Section 7 concludes our work.

2. BACKGROUND2.1 Hardware Adaptation

There is a considerable amount of prior work on reconﬁg-urable architecture [3, 4, 8, 21, 33]. However, unlike F

ORE - CASTER , the majority of the work did not use any learn-ing [3, 21]. Among the learning-based approaches, most usedofﬂine training [4, 8, 33]. Only a few approaches utilizedonline training; however, they focused on a single hardwarestructure [12].Choi and Yeung [6] perform microarchitectural resourcesdistribution in an SMT processor using hill-climbing algo-rithm. Bitirgen et al. [4] propose a scheme to combine per-formance prediction model of multiple applications to getan aggregate performance prediction of the overall resourcedistribution. The scheme is coupled with some limited proba-bilistic search technique to ﬁnd the optimal resource distri-bution to improve performance. Petrica et al. [21] presentFlicker, a general-purpose multicore architecture that dynam-ically adapts to varying limits on allocated power. A Flickercore has reconﬁgurable lanes through the pipeline that allowstailoring an individual core to the running application withlower overhead.Dubach et al. [8] propose the use of machine learning todynamically optimize the efﬁciency of some processor’s com-ponents such as the Arithmetic Logic Unit, instruction queues,register ﬁle, caches, branch predictor, and the pipeline depth.During program execution, as soon as a phase change is de-tected, the hardware starts to collect counters on a predeﬁnedproﬁling conﬁguration. These counters represent the usage ofthe hardware resources in that interval. The model then pre-dicts the optimal conﬁguration and the system is reconﬁguredaccordingly for the rest of the phase. Unlike our approach,Dubach et al. proposes learning for each program separately.Moreover, they also use the ofﬂine training method, whichcould limit the adaptability of the model to future unmetprograms.There is also some other work in utilizing proﬁling infor-mation for optimization. However, majority of the work isrelated to software optimization [19]. For hardware designs,proﬁling information has been traditionally used to makedesign choices for hardware before fabrication [11].

Reinforcement learning is a subset area of machine learn-ing concerned with autonomous agents that can learn withoutsupervision to optimize an objective [10]. In reinforcementlearning, the knowledge of the agent is built through trial and error. For each time step, the agent takes an action andobserves feedback from the environment about how good it isdoing and how close it is to the goal. A reinforcement learn-ing problem typically consists of three main components: • a set of states that represent the environment at differenttime steps; • a set of possible actions that the agent can take; • a reward f unction that issues a reward for each actionof the agentA state is deﬁned as the information about the conditionof the environment at a time step. The agent observes thisinformation and selects the most appropriate action . Asthe agent taking the selected action, the environment transi-tions from the current state to another state. After that, the reward f unction assigns the agent with a reward. The valueof this reward depends on how good the state-action pairis. Since the name of this reward function is Q − f unction ,the reward is called Q − value . The agent accumulates theknowledge by storing and updating these Q-values after eachtime step.Ipek et al. [12] formulate DRAM scheduling as a reinforce-ment Q-learning problem with the goal of optimizing busutilization and throughput. In every clock cycle, the agentpicks one out six possible actions available to the scheduler.The agent is given a numerical Q-value of 1 whenever it is-sues a command that increases the data bus utilization and 0otherwise. The state is deﬁned as a combination of attributesthat represent the state of the controller’s transaction queue.Ipek et al. show that this approach can improve the bus uti-lization and bandwidth efﬁciency by a signiﬁcant amountcompared to the state-of-the-art DRAM scheduler. Early reinforcement Q-learning techniques use a Q-tableto store the Q-values and take the state-action pairs as theindices. As modern problems getting more and more com-plex, this approach become inefﬁcient since the Q-table sizeinﬂates with the number of state-action pairs. The answerto this issue is Deep Q-learning (DQN), which is the crossbreeding of reinforcement Q-learning and deep learning tech-niques. In DQN, the reward table is replaced by a multi-layerneural network that predicts the Q-value for any particularstate-action pair.There are a signiﬁcant amount of work on the applicationof DQN [17, 27, 28, 29]. Mnih et al. [17] apply DQN to seven2600 Atari games and show that it outperforms all previousreinforcement learning approaches. Moreover, this DQNmodel also manages to beat a human expert in three out ofseven games. The model uses only the raw pixels of theapplication screen as input, and outputs the expected futurereward of the taken action. DeepMind Techologies uses DQNin the series of AlphaGo programs [27, 28, 29] to solve thegame Go. The original AlphaGo version [27] outperforms allprevious Go programs and is the ﬁrst Go program to beat aprofessional human player. The latest version of the series,AlphaZero, has the capability of teaching itself three differentgames: Go, chess, and shogi [28]. However, at the time ofthis paper, there has not any published research on applyingDQN in hardware optimization.2 . MAIN IDEA: F ORECASTER F ORECASTER periodically collects hardware telemetryduring the execution of a program. The telemetry consists ofvarious hardware event counters maintained by the processorarchitecture. F

ORECASTER uses the hardware telemetry in adeep reinforcement learning algorithm to predict the optimalconﬁgurations of tunable hardware resources. The goal ofthe predicted conﬁgurations is to maximize the efﬁciencyof the hardware. The overall workﬂow of F

ORECASTER isshown in Figure 1. F

ORECASTER reconﬁgures the hardwareresources according to the prediction and receives a rewardafter a while. F

ORECASTER receives a positive reward if thehardware efﬁciency improves due to reconﬁguration. Oth-erwise, it receives a zero or negative reward. Rewards are afeedback mechanism to encourage conﬁgurations associatedwith positive rewards while discouraging non-positive rewardrelated conﬁgurations. Based on the reward, F

ORECASTER updates the Q-values (used by the reinforcement learningalgorithm) so that efﬁciency boosting conﬁgurations are pre-dicted more frequently. Thus, F

ORECASTER continuallyimproves its prediction during the execution of an application.Next time, the same or a new application executes, F

ORE - CASTER reuses the Q-values learned from prior executionsand continues to improve its prediction accuracy. In otherwords, F

ORECASTER keeps on learning from one executionto the next both within and across applications, thereby real-izing continual lifelong learning with the goal of maximizingthe hardware efﬁciency. In the next few sections, we willelaborate on different steps of F

ORECASTER . When an application starts, F

ORECASTER starts with max-imum amount of hardware resources. This prevents any slowdown from the beginning. Progressively, F

ORECASTER triesto reconﬁgure tunable hardware resources to maximize thehardware efﬁciency. We used

IPC / Power as the metric forhardware efﬁciency. Similar metric has been used in priorwork [8]. As for tunable hardware resources, we choose L2and L3 caches as well as the Branch Target Buffer (BTB)and prefetcher. We choose caches because they are the mostenergy hungry resources in a chip [13]. We choose the otherresources because they can be easily clock-gated without in-trusive changes to the pipeline circuitry. Although we demon-strate the effectiveness of F

ORECASTER with these 4 tunableresources, we argue that F

ORECASTER is general enoughto accommodate any number of tunable resources. Table 1shows the tunable resources and possible conﬁgurations.

Tunable Resource ConﬁgurationBTB Size 0.5K, 1K, 1.5K, and EntriesPrefetcher On , OffL2 (private) cache 256K, 512K, 768K, and BytesL3 (shared LLC) cache 4M, 8M, 12M, and

Bytes

Table 1: List of tunable hardware resources. Initial con-ﬁguration is shown in bold-faced.

A program usually goes through distinct phases during itsexecution [25]. Some phases may beneﬁt from more cacheswhile others might beneﬁt from having a larger BTB. Wecollect hardware telemetry as an approximation of how aprogram behaves. Modern processors provide hundreds ofhardware event counters as its telemetry. After inspectingevery hardware event counter, we choose n counters mostrelevant to the tunable hardware resources. Let us denotethe set of counters (i.e., hardware telemetry) as T = { T i } ni = ,where each T i is an individual counter. F ORECASTER collectsthese counters at a regular interval. At the beginning ofeach interval, F

ORECASTER uses the counters for predictingconﬁgurations of tunable resources.F

ORECASTER uses reinforcement learning, more speciﬁ-cally Deep Q-learning (DQN) model for prediction. . In thismodel, the current conﬁguration of the hardware resources, C , as well as the behavior of the program as speciﬁed bythe telemetry, T , is provided as a state, S . In other words, S = < T , C > . Given a state, S t , at a time period, t , the deepQ-learning model predicts Q-values for all possible actionsin that state using a deep neural network (DNN). Each actionindicates a different conﬁguration of the hardware resources.Thus, if we have N possible actions, the model predicts N different Q-values - one for each action, A i , where 1 ≤ i ≤ N .The Q-value associated with action A i , say Q A i , is an estima-tion of how good the new conﬁguration (corresponding to A i )is in maximizing the hardware efﬁciency. Higher Q-value im-plies better conﬁguration. Therefore, F ORECASTER choosesthe action related to the maximum Q-value.Naively designating one action for each conﬁguration leadsto a large number of actions. For example, based on Table 1,we can have 4*4*2*4=128 possible conﬁgurations and hence,the same number of actions. Reinforcement learning with alarge action space takes a long time to train due to the sparsityof rewards [24]. Therefore, in order to reduce the numberof actions, we express each action in terms of the changesin conﬁgurations. Suppose, ↑ , ↓ , and = indicate that a re-source should be increased, decreased or kept at the samelevel respectively. If a resource has only two conﬁgurations,we can use ON and OF to indicate those conﬁgurations. Withthese notations, we can deﬁne an action as A = < R ia > ni = ,where R i represents i -th resource for 1 ≤ i ≤ n and a rep-resents a change in R i ’s conﬁguration such as ↑ , ↓ , = , ON , or OF . For example, suppose the current conﬁguration isdenoted by C = < L , L , PF OF , BT B . > . Then, an ac-tion < L ↑ , L ↓ , PF OF , BT B ↑ > will create a new conﬁgu-ration denoted by < L , L , PF OF , BT B > . With thisnew approach, the number of actions is reduced from 128 to3*3*2*3=54 i.e., less than half of the initial actions. Withthe reduced action space, the overall prediction process isillustrated in Figure 2. F ORECASTER reconﬁgures the tunable resources accord-ing to the predicted conﬁgurations. Now, we describe howeach resources are reconﬁgured.

BTB has 4 possible conﬁgurations (Table 1). Therefore,we can partition BTB into 4 sections - B B B

3, and B -valuesApplication Start execution with maximum resources Periodically collect HW telemetry Predict new configuration Reconfigure resources Calculate efficiency, rewards, andupdate Q-valuesQ-values

Save Q-values at the end of executionLoad Q-values

Figure 1: Overall workﬂow of F

ORECASTER . T t C t Q Q Q Q m … A A A A m S e l e c t A i a ss o c i a t e d w i t h t h e m a x i m u m Q - v a l u e C t C t+1 Apply changes according to A i Figure 2: F

ORECASTER uses hardware telemetry to pre-dict conﬁgurations. B B B B Reconfiguration Logic Indexing Logic

Clock

Figure 3: Logic for reconﬁguring the BTB. (Figure 3). For the ﬁrst conﬁguration (i.e., 0.5K entries), sec-tions ( B B B

4) are clock-gated. Similarly, for the secondand third conﬁgurations, sections ( B B

4) and ( B

4) are clock-gated respectively. The last conﬁguration does not clock-gateany section at all. On the other hand, Section B ORECASTER reconﬁgures all BTBs to the same conﬁguration. This isdone is to simplify the prediction and reconﬁguration logicin F

ORECASTER . Prefetcher is used either completely or not at all. Therefore,the prefetcher is clock-gated entirely or not at all. So, thereconﬁguration logic simply generates a single clock-gatingsignal for the entire prefetcher. … Way

Reconfiguration Logic

Tag State Tag State

Clock

Data Data

Way 0 Way 1

Figure 4: Logic for reconﬁguring L2 and L3 caches.

In order to reconﬁgure caches, F

ORECASTER makes threedesign choices.

First, F ORECASTER does not clock-gate anentire set. As a result, the address decoding logic remains un-changed.

Second, from each set, F

ORECASTER clock-gatesthe invalid lines. F

ORECASTER never clock-gates any validlines from the cache.

Three, whenever more than the required4 ollect HW telemetry H t and predict configuration C t Calculate efficiency E t , reward R t , andUpdate Q-values Reconfigure according to C t Reconfigure according to C t+1

Configuration C t-1

Configuration C t Configuration C t+1

Collect HW telemetry H t+1 and predict configuration C t+1

Time

Interval t Interval t+1Interval t-1

Figure 5: Timing of various steps of F

ORECASTER . number of cache lines in a single set satisfy the selectioncriteria, F ORECASTER randomly choose some of them toclock-gate. Figure 4 shows the schematic for reconﬁguringthe caches. The way selection logic ﬁrst determines howmany lines need to be clock-gated in each set. Then, it se-lects which way to clock-gate based on the selection criteriaand then, sends a signal to that way. When an invalid lineis clock-gated, F

ORECASTER does not need to worry aboutcache coherence issues. During the clock-gating process, thecache controller blocks any incoming request to that particu-lar cache set. The request is handled after the clock-gating iscomplete. F ORECASTER collects the hardware telemetry, H t at thebeginning of an interval, t, and determines the new conﬁgura-tion, C t . using the Q-value, Q t , predicted by DNN. Then, itreconﬁgures the hardware resources accordingly, and contin-ues the program execution. The timing is shown in Figure 5.At the beginning of the next interval, t+1, F ORECASTER calculates the new efﬁciency that results from the reconﬁgu-ration. F

ORECASTER compares the new efﬁciency with theold one (the one calculated at the beginning of interval t).If efﬁciency increases, F

ORECASTER receives + ORECASTER received0 reward. On the other hand, if the efﬁciency decreases,F

ORECASTER receives − ORECASTER uses the commonly used temporal differencemethod to update Q t − (the Q-value predicted at interval t-1) [30]. In this method, the new Q-value, Q (cid:48) t − is calculatedusing the following equation: Q (cid:48) t − = ( − α ) Q t − + α [ r + γ Q t ] Here, α is the learning rate and γ is the discounted factor.The DNN uses back propagation method to learn Q (cid:48) t − . Use of DQN in F

ORECASTER provides a natural way toimplement continual lifelong learning. In the DQN model,F

ORECASTER learns by training a DNN with Q-values. Tocontinue learning from one execution to the next, F

ORE - CASTER stores the DNN topology and weights in a ﬁle at theend of each execution. Section 4 presents an extension to the ISA that is used to read the topology and weights of the DNNand write them back in a special ﬁle. At the beginning of thenext execution, F

ORECASTER loads the topology and weightsof the DNN and continues to learn from where it left in thelast execution. If multiple applications are concurrently run-ning in a processor, each application will store its own DNNtopology and weights at the end of the respective execution.In that case, F

ORECASTER uses an ofﬂine process to mergethe networks periodically (e.g., once a day). For merging,F

ORECASTER uses TensorFlow [1] to load all the DNNs, andgenerates a number of random state samples. For each statesample, F

ORECASTER calculates the Q-value of each actionusing all the DNNs, takes an average of the Q-values, andretrains the largest DNN. Thus, the largest DNN accumulatesthe knowledge of all DNNs. During the beginning of the nextexecution of an application, F

ORECASTER loads this DNNand continues execution.

4. IMPLEMENTATION

In this section, we outline the implementation of DeepQ-Learning in F

ORECASTER as well as the extension to theISA.

We propose to add a DQN module in the chip. The modulecontains a Neural Processing Unit (NPU) for implementingthe DNN along with additional buffers such as input andreplay buffers, and a control logic. Figure 6 shows the highlevel schematic of the DQN module.There are several NPU designs in literature [2, 9, 23]. Wepropose to use an NPU similar to the one proposed by Es-maeilzadeh et al. [9]. It consists of a number of ProcessingElements (PEs) and a scheduler. Each PE implements an in-dividual artiﬁcial neuron. Each PE contains input and weightregisters, a multiplier, an adder, a partial sum register, and acomparator. Input and weight registers along with the adder,multiplier and partial sum register are used to calculate thedot product of inputs and weights. The comparator is usedto implement ReLU activation function [18]. The schedulerschedules each layer of the DNN in the PEs starting from theintput layer. After calculating the Q-values, the current inputand Q-values are stored in the replay buffer. When F

ORE - CASTER receives a reward and calculates the updated Q-value,the reply buffer provides the saved inputs and Q-values to the5PU to learn the new Q-value.

Neural Processing UnitInput

Buffer

Replay

Buffer

Control Logic PE1 PE2PE3 PE4PE5 PE6

PE7 PE8

Scheduler Input Reg

Weight Reg

MultiplierPartial Sum RegComparator

DQN Module

Adder

Figure 6: Details of the DQN module.

The control logic sequences the operations to implementthe DQN algorithm. The control logic also contains threespecial registers to store learning rate, discount factor andexploration ratio. The learning rate and discount factor areused to calculate new Q-value (Section 3.4). Explorationratio dictates how many times the module will choose a ran-dom exploratory action as opposed to an action based on themaximum Q-value. DQN uses exploratory action to exploreactions that would have been otherwise never selected. Thisis done to ﬁnd a potentially better action that the one basedon prior knowledge.

Based on our experiments (Section 5.2.6) and intuition,we select the following hardware counters as telemetry - (i)number of integer instructions, (ii) number of logical instruc-tions, (iii) number of ﬂoating point instructions, (iv) numberof memory access instructions, and (v) number of controlﬂow instructions. Each core collects the telemetry indepen-dently and sends to the DQN module after every n (e.g., say n =10,000) instructions. When DQN module receives teleme-try of at least a total of N (e.g., say, N = 500,000) instructions,F ORECASTER assumes the start of a new interval. DQNmodule aggregates the telemetry and normalizes each counterwith respect to the total instructions of the interval that justﬁnished. DQN module predicts the new conﬁguration andsends a reconﬁguration message to each core.

We extend the ISA with instructions to set and get DQNconﬁgurations. We propose a ﬁxed format for DQN conﬁgu-rations. The format is as follows - , Layer1, Layer2,..., LayerN, , Weght1, Weight2, ..., WeightM, ,LearningRate, DiscountFactor, ExplorationRatio, .Here, , , , are special mark-ers (values) to indicate the beginning of conﬁguration, endof layers, end of weights, and end of conﬁgurations respec-tively. We propose two instructions - setconf %ri, %rc and getconf %rc, %ri . setconf %rc, %ri sets the con-ﬁguration value at address [%ri] into the conﬁguration regis-ter %rc . On the other hand, getconf %rc, %ri reads fromthe conﬁguration register %rc into the address [%ri] . Inorder to initialize the DQN module with a particular conﬁgu-ration, F ORECASTER needs to invoke a function that executesa sequence of setconf instructions in a loop until marker is reached. Similarly, in order to save the currentconﬁguration of the DQN module, F

ORECASTER executesa sequence of getconf instructions in a loop until marker is reached.

5. EXPERIMENTAL EVALUATION (a)(b)(c)

Figure 7: Avg amount of (a) L2, (b) L3, and (c) BTBturned off during the execution of streamcluster . Table 2 shows the parameters of the simulated hardwarethat we use to conduct the experiments. We use a modiﬁedversion of Multi2Sim [32] and McPAT [15] to simulate theexperimental hardware and its power consumption. PAR-SEC 3.0 benchmark suite is used with small inputs. Dueto resource and time constraints, all benchmarks are run tocompletion or 1 billion instructions. The interval size N isset at 0.5M instructions.We conduct three experiments on three versions of F ORE - CASTER : • Experiment 1: F

ORECASTER is implemented with agiant table to store and update the Q-values. All ap-plications are run ﬁve times. Each run starts with anempty Q-table. This version is essentially an adoption6 igure 8: Normalized power consumption of ﬁve executions of applications

Parameter ValueCPU 8-core @ 2.4Ghz, SMT offPrivate L1 cache (I/D) 32KB, 64B line, 8-wayPrivate L2 Cache 1024K, 64B line, 8-wayShared L3 Cache 16M, 64B line, 16-wayCoherence Protocol Directory-based MOESI

Table 2: Parameters of the simulated hardware. of prior reinforcement learning-based approach in thecurrent usage scenario [12]. • Experiment 2: F

ORECASTER is implemented with adeep neural network to predict the Q-values. All ap-plications are run ﬁve times. Each run starts with anuntrained neural model. • Experiment 3: F

ORECASTER is implemented with adeep neural network to predict the Q-values. All ap-plications are run ﬁve times. Each run starts with thetrained neural model inherited from the previous execu-tion.In the ﬁrst experiment, each run starts with an empty Q-table, which means there is no knowledge accumulation be-tween executions. This technique is basically the Q-learningadopted from [12]. In the second experiment, we replacethe Q-table with a deep neural network to see how goodDQN is compared to the vanilla Q-learning. Experiment 3 issimilar to experiment 2 except an execution starts with themodel taken from the previous execution. The purpose ofthis experiment is to investigate the efﬁcacy of knowledgeaccumulation.

Figures 5(a), 5(b), 5(c) show how F

ORECASTER managesthe hardware resources during an execution of streamcluster .On average, F

ORECASTER can turn off 64%, 66%, 66% L2cache, L3 cache and the BTB respectively. F

ORECASTER also deactivates the prefetcher for 26% of all intervals. Simi-lar behavior can be seen for other programs in the benchmarksuite. F

ORECASTER is able to determine the best size foreach structure for each phase. This can be seen from therepetitive pattern in the ﬁgures, which maps to phases in eachprogram. In this paper, we use static phases, ﬁxed numberof instructions. In the future, we plan to use phase detectiontechniques [7, 26] and this is expected to make the schemeeven more efﬁcient.

Experimental results shows that F

ORECASTER with con-tinuous learning uses the least power compared to other tech-niques and similar to the best static conﬁguration, as shownin Figure 8. On average, F

ORECASTER with accumulatedknowledge can save 16% of power across all applicationscompared to the baseline. This is a 2% more than the versionwithout continuous learning and 8% more than the versionwith basic Q-table.

The efﬁciency of each experiment is shown in Figure 9. Ingeneral, our scheme outperforms the baseline conﬁgurationin all benchmarks except from canneal . Interestingly, the Q-table version gives the best efﬁciency compared to the othertwo versions with the neural network. This may be becausethe Q-table does not require much time to learn comparedto the neural network. Due to the time constraint, only twoexecutions of canneal are completed for experiment 3. Thatis why the neural network does not perform as expected.

Figure 10 shows that there is not much IPC degradationwhen using F

ORECASTER . Speciﬁcally, the system IPC when7 igure 9: Normalized efﬁciencies of ﬁve executions of applications running swaption is virtually unchanged across all versionof F

ORECASTER . The Q-table version of F

ORECASTER hasthe most consistent performance as it only cause a 1.2% IPCoverhead on average. This result is comparable to the beststatic conﬁguration. In canneal , two versions with the deepneural network performs badly as they degrade the systemIPC by about 15%. One reason is because it takes time to trainthe neural network before it can have reasonable accuracy.The normalized execution time measured in terms of num-ber of cycles is shown in Figure 11. In overall, the executiontime overhead incurred by F

ORECASTER is less than 5%.F

ORECASTER tends to perform better in multi-threaded ap-plications as seen in streamcluster , swaptions , f luidanimate compared to single-threaded applications such as canneal .This is because F ORECASTER only makes one predictionfor all cores, and the prediction is largely dependent onthe resource of the core that is heavily used. For example,single threaded programs only use one core, therefore theL2 cache of that core is mostly occupied. However, whenF

ORECASTER reconﬁgure the hardware, it turns off the sameamount of L2 cache on every core, even though L2 cacheson other cores are mostly empty. This is a limitation ofF

ORECASTER that can be the subject of a future research.

The cost of the proposed design can be divided into threeparts: delay or latency cost, hardware cost, and power con-sumption cost. As for the latency cost, reading the hardwaretelemetry and making a reconﬁguration decision does nothappen in the critical path. The hardware will continue in itsold conﬁguration till the decision is made for a new conﬁgu-ration.The hardware cost consists of the DQN hardware and theextra hardware used to implement the knobs. The DQN uses aseven-layer neural network with six neurons per layer. Thereis also an input layer of 10 neurons and an output layer of one neuron. So, we use eight processing elements to implementthe input-layer, in two cycles as it needs to do the work of 10neurons, and then one-cycle per each layer. Each processingelement (PE) is a simple execution unit that can do a fusedmultiply-add operation per cycle similar to the executionunits found in traditional Graphics processing units (GPUs).The PEs are organized together in a design similar to theneural processing unit (NPU) described in [9]. We also needtwo extra registers for the old Q value and the new Q value(calculated by the neural network based on the reward). Asimple computation unit is needed to calculate the new Qvalue as shown in Section 3.4.The hardware needed for the knobs is straightforward. Theprefetcher is just clock-gated as the knob is on/off. TheBTB also uses clock-gating depending on the conﬁguration.We have four conﬁgurations so a small 2x4 decoder will dothe job as shown in the reconﬁguration logic of Figure 3.Clock gating the cache ways is simpliﬁed by the fact that theway-reconﬁguration logic, shown in Figure 4, never gates avalid entry so no change to the cache controller or coherencehardware. The way-reconﬁguration logic is not complicatedbecause it exploits the fact that large caches (such as L3) isusually partitioned. Therefore we have one logic circuitry perpartition.The power consumption of the above hardware is not highdue to several factors. First, that extra hardware is activatedonly at the end of each program phase to make predictionand reconﬁgure the knobs. Second, the extra power con-sumption is much smaller than the power-saved by gatingthe reconﬁgured structures. Finally, there are several optionsto design the neural network ranging from executing it, as asoftware component, on a CPU of GPU, or designing it asdigital ASIC [9], FPGA [14], or analog ASIC [5, 16]. Eachapproach has its own characteristics of area, power, and cost. igure 10: Normalized IPCs of ﬁve executions of applicationsFigure 11: Normalized number of cycles taken betweenexperiments We conduct three additional experiments in order to deter-mine the optimal interval size, history length, and number ofcounters to collect. The history length experiment shows howlong into the past should we take into account for determin-ing the best conﬁguration of the current interval. Simulationresults show that increasing the history length from 1 to 2intervals reduces the efﬁciency gains by 3% as shown inFigure 12.The number of counters experiment shows how many coun-ters should be considered to best represent an interval. We testwith 3 sets: 3-counter, 5-counter and 8-counter sets. Belowis the list of 8 counters that we are collecting: • Normalized number of dispatched integer instructions. • Normalized number of dispatched logic instructions • Normalized number of dispatched ﬂoating point instruc-tions. • Normalized number of dispatched memory instructions. • Normalized number of dispatched control instructions. • Minimum free space across all L2 caches • Free space of shared L3 cache • Branch predictor misprediction rate.8-counter set includes all of counters above. 5-counter set in-cludes the normalized dispatched instructions, leaving out thelast 3 counters. 3-counter set only includes the number of dis-patched integer, memory and control instructions. Figure 13shows that a set of 5 counters gives the best efﬁciency. Aset of 3 counters does not have enough representation powerwhile a full set of 8 counters is redundant.The third experiment determines how big an interval sizeshould be. We test with interval sizes of 0.25M, 0.5M, 1M,2M instructions. Simulation results shows that setting intervalsize at 0.5M instructions gives 0.02%, 0.11%, and 0.11%more efﬁciency gain than 2M, 1M and 0.25M instructions,respectively.

Figure 12: Efﬁciency comparison between different his-tory lengths

In overall, the continuous learning version of F

ORECASTER can save up to 17.5% of power consumption in some appli-cations and 16% on average compared to the baseline setup.It gives an efﬁciency gain of 4% while sacriﬁcing 4.7% ofexecution time.

6. RELATED WORK igure 13: Efﬁciency comparison between different num-ber of countersFigure 14: Efﬁciency comparison between different in-terval sizes Tarsa et al. [31] propose a lightweight ML framework thatcan be distributed through ﬁrmware updates to the microcon-troller for post-silicon CPUs. The ML model is ﬁrst trainedofﬂine with a diverse collection of applications to avoid sta-tistical blind spots. During execution, the CPU dynamicallysets the issue width of a clustered hardware component whileclock-gating unused resources based on the prediction of theML model.Pan et al. [20] present a multi-level reinforcement learningframework (MLRL) to address the scalability issue of thedynamic power management in multi-core processors. MLRLeffective reduce the exponential decision process into a linearproblem by exploiting the hierarchical paradigm. In MLRL,core states and Q-values are propagated from the bottom tothe top of the tree structure, then decisions are propagatedback down the tree, providing an efﬁcient control mechanism.Ravi et al. [22] propose CHARSTAR, a clock tree awareresource optimizing mechanism. CHARSTAR incorporatesa multi-layer perceptron with one hidden layer to predictthe optimal conﬁguration in each execution phase. The neu-ral network takes into account the clock hierarchy and thetopology overhead in order to improve the power savings.However, the ofﬂine trained model may soon be obsolete forfuture unmet programs. Secondly, CHARSTAR only worksfor single-threaded programs, and a multi-threaded versionmay cause a super-linearly increase in the size of the neuralnetwork model.

7. CONCLUSIONS

This work presents the potential of dynamically tuninghardware components to save power with a small perfor-mance overhead. Our scheme, F

ORECASTER , when incor-porated a continuous learning deep neural network, can saveup to 17.5% of power consumption compared to the base- line conﬁguration. On average, F

ORECASTER can reducethe power usage by 16% while sacriﬁcing 4.7% of executiontime, thus leads to a 4% efﬁciency gain. Future researchmay focus on improving the efﬁcacy of Forecaster as well asextending the control of F

ORECASTER over more hardwareresources to achieve more efﬁciency gain.

REFERENCES [1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S.Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow,A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser,M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray,C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar,P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals,P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng,“TensorFlow: Large-scale machine learning on heterogeneous systems,”2015, software available from tensorﬂow.org. [Online]. Available:http://tensorﬂow.org/[2] M. M. u. Alam and A. Muzahid, “Production-run software failurediagnosis via adaptive communication tracking,” in

Proceedings of the43rd International Symposium on Computer Architecture , ser. ISCA’16. Piscataway, NJ, USA: IEEE Press, 2016, pp. 354–366. [Online].Available: https://doi.org/10.1109/ISCA.2016.39[3] R. Balasubramonian, D. Albonesi, A. Buyuktosunoglu, andS. Dwarkadas, “Memory hierarchy reconﬁguration for energy andperformance in general-purpose processor architectures,” in

Proceedings of the 33rd Annual ACM/IEEE International Symposiumon Microarchitecture , ser. MICRO 33. New York, NY, USA: ACM,2000, pp. 245–257. [Online]. Available:http://doi.acm.org/10.1145/360128.360153[4] R. Bitirgen, E. Ipek, and J. F. Martinez, “Coordinated management ofmultiple interacting resources in chip multiprocessors: A machinelearning approach,” in

Proceedings of the 41st Annual IEEE/ACMInternational Symposium on Microarchitecture , ser. MICRO 41.Washington, DC, USA: IEEE Computer Society, 2008, pp. 318–329.[Online]. Available: https://doi.org/10.1109/MICRO.2008.4771801[5] V. Calayir, M. Darwish, J. Weldon, and L. Pileggi, “Analogneuromorphic computing enabled by multi-gate programmableresistive devices,” in

Proceedings of the 2015 Design, Automation &Test in Europe Conference & Exhibition , ser. DATE â ˘A ´Z15. SanJose, CA, USA: EDA Consortium, 2015, p. 928â ˘A¸S931.[6] S. Choi and D. Yeung, “Learning-based smt processor resourcedistribution via hill-climbing,” in , Jun 2006, p. 239â ˘A¸S251.[7] A. S. Dhodapkar and J. E. Smith, “Managing multi-conﬁgurationhardware via dynamic working set analysis,” in

Proc. 17thInternational Symposium on Computer Architecture , 2002.[8] C. Dubach, T. M. Jones, E. V. Bonilla, and M. F. P. O’Boyle, “Apredictive model for dynamic microarchitectural adaptivity control,” in , Dec 2010, pp. 485–496.[9] H. Esmaeilzadeh, A. Sampson, L. Ceze, and D. Burger, “Neuralacceleration for general-purpose approximate programs,” in

Proceedings of the 2012 45th Annual IEEE/ACM InternationalSymposium on Microarchitecture , ser. MICRO-45. Washington, DC,USA: IEEE Computer Society, 2012, pp. 449–460. [Online].Available: https://doi.org/10.1109/MICRO.2012.48[10] L. Graesser and W. L. Keng,

Foundations of Deep ReinforcementLearning: Theory and Practice in Python . Boston, MA, USA:Addison-Wesley Professional, 2018.[11] H. Hubert and B. Stabernack, “Proﬁling-based hardware/softwareco-exploration for the design of video coding architectures,” in

IEEETransactions on Circuits and Systems for Video Technology , Sep 2009,pp. 1680 – 1691.[12] E. Ipek, O. Mutlu, J. F. Martínez, and R. Caruana, “Self-optimizingmemory controllers: A reinforcement learning approach,” in

Proceedings of the 35th Annual International Symposium onComputer Architecture , ser. ISCA ’08. Washington, DC, USA: IEEEComputer Society, 2008, pp. 39–50. [Online]. Available:https://doi.org/10.1109/ISCA.2008.21[13] C. Isci, A. Buyuktosunoglu, C. Cher, P. Bose, and M. Martonosi, “An nalysis of efﬁcient multi-core global power management policies:Maximizing performance for a given power budget,” in , 2006, pp. 347–358.[14] M.-J. Li, A.-H. Li, Y.-J. Huang, and S.-I. Chu, “Implementation ofdeep reinforcement learning,” in Proceedings of the 2019 2ndInternational Conference on Information Science and Systems , ser.ICISS 2019. New York, NY, USA: Association for ComputingMachinery, 2019, p. 232â ˘A¸S236. [Online]. Available:https://doi.org/10.1145/3322645.3322693[15] S. Li, H. Ann, R. D. Strong, J. B. Brockman, D. M. Tullsen, and N. P.Jouppi, “Mcpat: An integrated power, area, and timing modelingframework for multicore and manycore architectures,” in , Oct 2009, pp. 469–480.[16] D. Maliuk and Y. Makris, “An analog non-volatile neural networkplatform for prototyping rf bist solutions,” in

Proceedings of theConference on Design, Automation & Test in Europe , ser. DATEâ ˘A ´Z14. Leuven, BEL: European Design and AutomationAssociation, 2014.[17] V. Mnih, K. Kavukcuoglu, and D. Silver, “Human-level controlthrough deep reinforcement learning,” in

Nature , vol. 518, Feb 2015, p.529â ˘A¸S533.[18] V. Nair and G. E. Hinton, “Rectiﬁed linear units improve restrictedboltzmann machines,” in

Proceedings of the 27th InternationalConference on International Conference on Machine Learning , ser.ICMLâ ˘A ´Z10. Madison, WI, USA: Omnipress, 2010, p. 807â ˘A¸S814.[19] D. Novillo, “Samplepgo - the power of proﬁle guided optimizationswithout the usability burden,” in , Nov 2014, p. 22â ˘A¸S28.[20] G.-Y. Pan, J.-Y. Jou, and B.-C. Lai, “Scalable power managementusing multilevel reinforcement learning for multiprocessors,”

ACMTrans. Des. Autom. Electron. Syst. , vol. 19, no. 4, Aug. 2014. [Online].Available: https://doi.org/10.1145/2629486[21] P. Petrica, A. M. Izraelevitz, D. H. Albonesi, and C. A. Shoemaker,“Flicker: A dynamically adaptive architecture for power limitedmulticore systems,” in

Proceedings of the 40th Annual InternationalSymposium on Computer Architecture , ser. ISCA ’13. New York, NY,USA: ACM, 2013, pp. 13–23. [Online]. Available:http://doi.acm.org/10.1145/2485922.2485924[22] G. S. Ravi and M. H. Lipasti, “Charstar: Clock hierarchy awareresource scaling in tiled architectures,” in

Proceedings of the 44thAnnual International Symposium on Computer Architecture , ser. ISCAâ ˘A ´Z17. New York, NY, USA: Association for ComputingMachinery, 2017, p. 147â ˘A¸S160. [Online]. Available:https://doi.org/10.1145/3079856.3080212[23] B. Reagen, P. Whatmough, R. Adolf, S. Rama, H. Lee, S. K. Lee, J. M.HernÃ ˛andez-Lobato, G.-Y. Wei, and D. Brooks, “Minerva: Enablinglow-power, highly-accurate deep neural network accelerators,” in

International Symposium on Computer Architecture (ISCA) , 2016.[Online]. Available: http://vlsiarch.eecs.harvard.edu/wp-content/uploads/2016/05/reagen_isca16.pdf [24] M. Riedmiller, R. Hafner, T. Lampe, M. Neunert, J. Degrave, T. van deWiele, V. Mnih, N. Heess, and J. T. Springenberg, “Learning byplaying solving sparse reward tasks from scratch,” in

Proceedings ofthe 35th International Conference on Machine Learning , ser.Proceedings of Machine Learning Research, J. Dy and A. Krause,Eds., vol. 80. StockholmsmÃd’ssan, Stockholm Sweden: PMLR,10–15 Jul 2018, pp. 4344–4353. [Online]. Available:http://proceedings.mlr.press/v80/riedmiller18a.html[25] T. Sherwood, E. Perelman, and B. Calder, “Basic block distributionanalysis to ﬁnd periodic behavior and simulation points inapplications,” in

Proceedings of the 2001 International Conference onParallel Architectures and Compilation Techniques , ser. PACT â ˘A ´Z01.USA: IEEE Computer Society, 2001, p. 3â ˘A¸S14.[26] T. Sherwood, S. Sair, and B. Calder, “Phase tracking and prediction,”

SIGARCH Comput. Archit. News , vol. 31, no. 2, p. 336â ˘A¸S349, May2003. [Online]. Available: https://doi.org/10.1145/871656.859657[27] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van denDriessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam,M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner,I. Sutskever, T. P. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel,and D. Hassabis, “Mastering the game of go with deep neuralnetworks and tree search,”

Nature , vol. 529, pp. 484–489, 2016.[28] D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez,M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, T. Lillicrap,K. Simonyan, and D. Hassabis, “A general reinforcement learningalgorithm that masters chess, shogi, and go through self-play,”

Science ,vol. 362, no. 6419, pp. 1140–1144, 2018. [Online]. Available:https://science.sciencemag.org/content/362/6419/1140[29] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang,A. Guez, T. Hubert, L. R. Baker, M. Lai, A. Bolton, Y. Chen, T. P.Lillicrap, F. Hui, L. Sifre, G. van den Driessche, T. Graepel, andD. Hassabis, “Mastering the game of go without human knowledge,”

Nature , vol. 550, pp. 354–359, 2017.[30] R. S. Sutton and A. G. Barto,

Reinforcement Learning: AnIntroduction . Cambridge, MA, USA: A Bradford Book, 2018.[31] S. J. Tarsa, R. B. R. Chowdhury, J. Sebot, G. Chinya, J. Gaur,K. Sankaranarayanan, C.-K. Lin, R. Chappell, R. Singhal, andH. Wang, “Post-silicon cpu adaptation made practical using machinelearning,” in

Proceedings of the 46th International Symposium onComputer Architecture , ser. ISCA â ˘A ´Z19. New York, NY, USA:Association for Computing Machinery, 2019, p. 14â ˘A¸S26. [Online].Available: https://doi.org/10.1145/3307650.3322267[32] R. Ubal, J. Sahuquilo, S. Petit, and P. LÃ¸spez, “Multi2sim: Asimulation framework to evaluate multicore-multithreaded processors,”in , Oct 2007, pp. 62–68.[33] J. Wildstrom, P. Stone, E. Witchel, and M. Dahlin, “Machine learningfor on-line hardware reconﬁguration,” in

IJCAI 2007, Proceedings ofthe 20th International Joint Conference on Artiﬁcial Intelligence,Hyderabad, India, January 6-12, 2007 , 2007, pp. 1113–1118. [Online].Available: http://ijcai.org/Proceedings/07/Papers/180.pdf, 2007, pp. 1113–1118. [Online].Available: http://ijcai.org/Proceedings/07/Papers/180.pdf