The Case for Learning Application Behavior to Improve Hardware Energy Efficiency
Kevin Weston, Vahid Jafanza, Arnav Kansal, Abhishek Taur, Mohamed Zahran, Abdullah Muzahid
FF ORECASTER : A Continual Lifelong Learning Approach toImprove Hardware Efficiency
Phat Nguyen , Abhishek Taur , Abdullah Muzahid , Arnav Kansal , and Mohamed Zahran Department of Computer Science and Engineering, Texas A&M University Department of Computer Science, New York University
ABSTRACT
Computer applications are continuously evolving. However,significant knowledge can be harvested from older applica-tions or versions and applied in the context of newer ap-plications or versions. Such a vision can be realized with
Continual Lifelong Learning . Therefore, we propose to em-ploy continual lifelong learning to dynamically tune hardwareconfigurations based on application’s behavior. The goal ofsuch tuning is to maximize hardware efficiency (i.e., max-imize an application’s performance while minimizing thehardware’s energy consumption). Our proposed approach,F
ORECASTER , uses deep reinforcement learning to contin-ually learn during the execution of an application as wellas propagate and utilize the accumulated knowledge duringsubsequent executions of the same or new application. Wepropose a novel hardware and ISA support to implementdeep reinforcement learning. We implement F
ORECASTER and compare its performance against prior learning-basedhardware reconfiguration approaches. Our results show thatF
ORECASTER can save as much as 17.5% system power overthe baseline set up with all resources. On average, F
ORE - CASTER saves 16% system power over the baseline setupwhile sacrificing an average of 4.7% of execution time.
1. INTRODUCTION
Computer architects are in continuous quest to find the besthardware design for different program types. We cannot havedifferent application specific hardware designs for differentprogram types because this will be prohibitively expensive.What makes things even more challenging is that a singleprogram passes through different phases during its execu-tion lifetime and each phase has a different best hardwareconfiguration. This paper presents a step toward a solution.Previously, designers used to gather profiling informationabout a program execution on a hardware, then make use ofthis profiling information to either enhance the hardware orthe program. However, this means that each program mustbe instrumented first and the information gathered duringprofiling is used on that application only. Our proposed ideais based on a simple hypothesis: any phase of a programexecution has a best hardware configuration. Each phase hascertain characteristics. So, if there is a different program witha phase with similar characteristics, then it can use the sameconfiguration to get the best performance. Therefore, if wecan learn the best configuration for different program phases, we can use that to get the best configuration for new, un-seen, programs. In other word, there is a finite set of patternsalong which hardware/software interactions can occur to givebest performance. For example, given a cache configuration,there are finite set of memory access patterns that yield lowcache misses. Or, given a memory access pattern, we canbuild the best cache configuration that yields the lowest num-ber of misses.
The main goal of this paper is to designa hardware with configurable knobs that learns from itsinteraction with programs to be able to reconfigure itselfto a configuration that achieves best performance for newunseen programs.
As the hardware executes more programs,it learns more patterns and can achieve better performancefor more and more programs.There are several challenges that need to be tackled in orderto reach this goal. First, what are the knobs to be changed?There are many structures that can be designed to be recon-figurable. Our main criteria is to pick knobs that have thebiggest impact on performance and power and at the sametime can be reconfigured with the least hardware cost andmodification. Second, how to learn the patterns of the hard-ware/software interaction to suggest the best configuration?The pattern means the profiling information such as telemetrycollected from performance counters. For each pattern, thereis a hardware configuration that leads to the best performanceand power, or any other metric that needs to be optimized. Itis clear that the number of patterns is large and, dependingon the number of type of knobs, the hardware configurationsis also large. This is why straightforward classifiers such asbloom filter may not be a viable option. Using neural networkin a deep-learning setup does not lead to best results from anearly stage because it requires a large number of examplesin the training phase. Therefore, we need some kind of un-supervised learning approach. We read profiling informationand we make changes to the hardware based on this informa-tion. That is, we make changes to the environment and weget feedback about how well we do. This is a description ofreinforcement learning.
The main contribution of this paper is to propose ahardware scheme, called F
ORECASTER , that uses con-tinuous learning, from one execution to another, using adeep reinforcement learning to reconfigure certain knobsto get the best performance and power for different pro-grams.
We implemented F
ORECASTER using Multi2Sim [32]simulator. Our experimental results using Parsec benchmarksshow that the proposed technique can save as much as 17.5%1 a r X i v : . [ c s . A R ] A p r ower over the baseline with all resources. On average, ourscheme saves 16% system power over the baseline setupwhile sacrificing only an average of 4.7% execution time.The rest of the paper is organized as follows: Section 2presents some background materials; Section 3 describes themain idea of F ORECASTER ; Section 4 shows the detailedimplementation of F
ORECASTER ; Section 5 presents someexperimental results; Section 6 highlights some related work;and finally, Section 7 concludes our work.
2. BACKGROUND2.1 Hardware Adaptation
There is a considerable amount of prior work on reconfig-urable architecture [3, 4, 8, 21, 33]. However, unlike F
ORE - CASTER , the majority of the work did not use any learn-ing [3, 21]. Among the learning-based approaches, most usedoffline training [4, 8, 33]. Only a few approaches utilizedonline training; however, they focused on a single hardwarestructure [12].Choi and Yeung [6] perform microarchitectural resourcesdistribution in an SMT processor using hill-climbing algo-rithm. Bitirgen et al. [4] propose a scheme to combine per-formance prediction model of multiple applications to getan aggregate performance prediction of the overall resourcedistribution. The scheme is coupled with some limited proba-bilistic search technique to find the optimal resource distri-bution to improve performance. Petrica et al. [21] presentFlicker, a general-purpose multicore architecture that dynam-ically adapts to varying limits on allocated power. A Flickercore has reconfigurable lanes through the pipeline that allowstailoring an individual core to the running application withlower overhead.Dubach et al. [8] propose the use of machine learning todynamically optimize the efficiency of some processor’s com-ponents such as the Arithmetic Logic Unit, instruction queues,register file, caches, branch predictor, and the pipeline depth.During program execution, as soon as a phase change is de-tected, the hardware starts to collect counters on a predefinedprofiling configuration. These counters represent the usage ofthe hardware resources in that interval. The model then pre-dicts the optimal configuration and the system is reconfiguredaccordingly for the rest of the phase. Unlike our approach,Dubach et al. proposes learning for each program separately.Moreover, they also use the offline training method, whichcould limit the adaptability of the model to future unmetprograms.There is also some other work in utilizing profiling infor-mation for optimization. However, majority of the work isrelated to software optimization [19]. For hardware designs,profiling information has been traditionally used to makedesign choices for hardware before fabrication [11].
Reinforcement learning is a subset area of machine learn-ing concerned with autonomous agents that can learn withoutsupervision to optimize an objective [10]. In reinforcementlearning, the knowledge of the agent is built through trial and error. For each time step, the agent takes an action andobserves feedback from the environment about how good it isdoing and how close it is to the goal. A reinforcement learn-ing problem typically consists of three main components: • a set of states that represent the environment at differenttime steps; • a set of possible actions that the agent can take; • a reward f unction that issues a reward for each actionof the agentA state is defined as the information about the conditionof the environment at a time step. The agent observes thisinformation and selects the most appropriate action . Asthe agent taking the selected action, the environment transi-tions from the current state to another state. After that, the reward f unction assigns the agent with a reward. The valueof this reward depends on how good the state-action pairis. Since the name of this reward function is Q − f unction ,the reward is called Q − value . The agent accumulates theknowledge by storing and updating these Q-values after eachtime step.Ipek et al. [12] formulate DRAM scheduling as a reinforce-ment Q-learning problem with the goal of optimizing busutilization and throughput. In every clock cycle, the agentpicks one out six possible actions available to the scheduler.The agent is given a numerical Q-value of 1 whenever it is-sues a command that increases the data bus utilization and 0otherwise. The state is defined as a combination of attributesthat represent the state of the controller’s transaction queue.Ipek et al. show that this approach can improve the bus uti-lization and bandwidth efficiency by a significant amountcompared to the state-of-the-art DRAM scheduler. Early reinforcement Q-learning techniques use a Q-tableto store the Q-values and take the state-action pairs as theindices. As modern problems getting more and more com-plex, this approach become inefficient since the Q-table sizeinflates with the number of state-action pairs. The answerto this issue is Deep Q-learning (DQN), which is the crossbreeding of reinforcement Q-learning and deep learning tech-niques. In DQN, the reward table is replaced by a multi-layerneural network that predicts the Q-value for any particularstate-action pair.There are a significant amount of work on the applicationof DQN [17, 27, 28, 29]. Mnih et al. [17] apply DQN to seven2600 Atari games and show that it outperforms all previousreinforcement learning approaches. Moreover, this DQNmodel also manages to beat a human expert in three out ofseven games. The model uses only the raw pixels of theapplication screen as input, and outputs the expected futurereward of the taken action. DeepMind Techologies uses DQNin the series of AlphaGo programs [27, 28, 29] to solve thegame Go. The original AlphaGo version [27] outperforms allprevious Go programs and is the first Go program to beat aprofessional human player. The latest version of the series,AlphaZero, has the capability of teaching itself three differentgames: Go, chess, and shogi [28]. However, at the time ofthis paper, there has not any published research on applyingDQN in hardware optimization.2 . MAIN IDEA: F ORECASTER F ORECASTER periodically collects hardware telemetryduring the execution of a program. The telemetry consists ofvarious hardware event counters maintained by the processorarchitecture. F
ORECASTER uses the hardware telemetry in adeep reinforcement learning algorithm to predict the optimalconfigurations of tunable hardware resources. The goal ofthe predicted configurations is to maximize the efficiencyof the hardware. The overall workflow of F
ORECASTER isshown in Figure 1. F
ORECASTER reconfigures the hardwareresources according to the prediction and receives a rewardafter a while. F
ORECASTER receives a positive reward if thehardware efficiency improves due to reconfiguration. Oth-erwise, it receives a zero or negative reward. Rewards are afeedback mechanism to encourage configurations associatedwith positive rewards while discouraging non-positive rewardrelated configurations. Based on the reward, F
ORECASTER updates the Q-values (used by the reinforcement learningalgorithm) so that efficiency boosting configurations are pre-dicted more frequently. Thus, F
ORECASTER continuallyimproves its prediction during the execution of an application.Next time, the same or a new application executes, F
ORE - CASTER reuses the Q-values learned from prior executionsand continues to improve its prediction accuracy. In otherwords, F
ORECASTER keeps on learning from one executionto the next both within and across applications, thereby real-izing continual lifelong learning with the goal of maximizingthe hardware efficiency. In the next few sections, we willelaborate on different steps of F
ORECASTER . When an application starts, F
ORECASTER starts with max-imum amount of hardware resources. This prevents any slowdown from the beginning. Progressively, F
ORECASTER triesto reconfigure tunable hardware resources to maximize thehardware efficiency. We used
IPC / Power as the metric forhardware efficiency. Similar metric has been used in priorwork [8]. As for tunable hardware resources, we choose L2and L3 caches as well as the Branch Target Buffer (BTB)and prefetcher. We choose caches because they are the mostenergy hungry resources in a chip [13]. We choose the otherresources because they can be easily clock-gated without in-trusive changes to the pipeline circuitry. Although we demon-strate the effectiveness of F
ORECASTER with these 4 tunableresources, we argue that F
ORECASTER is general enoughto accommodate any number of tunable resources. Table 1shows the tunable resources and possible configurations.
Tunable Resource ConfigurationBTB Size 0.5K, 1K, 1.5K, and EntriesPrefetcher On , OffL2 (private) cache 256K, 512K, 768K, and BytesL3 (shared LLC) cache 4M, 8M, 12M, and
Bytes
Table 1: List of tunable hardware resources. Initial con-figuration is shown in bold-faced.
A program usually goes through distinct phases during itsexecution [25]. Some phases may benefit from more cacheswhile others might benefit from having a larger BTB. Wecollect hardware telemetry as an approximation of how aprogram behaves. Modern processors provide hundreds ofhardware event counters as its telemetry. After inspectingevery hardware event counter, we choose n counters mostrelevant to the tunable hardware resources. Let us denotethe set of counters (i.e., hardware telemetry) as T = { T i } ni = ,where each T i is an individual counter. F ORECASTER collectsthese counters at a regular interval. At the beginning ofeach interval, F
ORECASTER uses the counters for predictingconfigurations of tunable resources.F
ORECASTER uses reinforcement learning, more specifi-cally Deep Q-learning (DQN) model for prediction. . In thismodel, the current configuration of the hardware resources, C , as well as the behavior of the program as specified bythe telemetry, T , is provided as a state, S . In other words, S = < T , C > . Given a state, S t , at a time period, t , the deepQ-learning model predicts Q-values for all possible actionsin that state using a deep neural network (DNN). Each actionindicates a different configuration of the hardware resources.Thus, if we have N possible actions, the model predicts N different Q-values - one for each action, A i , where 1 ≤ i ≤ N .The Q-value associated with action A i , say Q A i , is an estima-tion of how good the new configuration (corresponding to A i )is in maximizing the hardware efficiency. Higher Q-value im-plies better configuration. Therefore, F ORECASTER choosesthe action related to the maximum Q-value.Naively designating one action for each configuration leadsto a large number of actions. For example, based on Table 1,we can have 4*4*2*4=128 possible configurations and hence,the same number of actions. Reinforcement learning with alarge action space takes a long time to train due to the sparsityof rewards [24]. Therefore, in order to reduce the numberof actions, we express each action in terms of the changesin configurations. Suppose, ↑ , ↓ , and = indicate that a re-source should be increased, decreased or kept at the samelevel respectively. If a resource has only two configurations,we can use ON and OF to indicate those configurations. Withthese notations, we can define an action as A = < R ia > ni = ,where R i represents i -th resource for 1 ≤ i ≤ n and a rep-resents a change in R i ’s configuration such as ↑ , ↓ , = , ON , or OF . For example, suppose the current configuration isdenoted by C = < L , L , PF OF , BT B . > . Then, an ac-tion < L ↑ , L ↓ , PF OF , BT B ↑ > will create a new configu-ration denoted by < L , L , PF OF , BT B > . With thisnew approach, the number of actions is reduced from 128 to3*3*2*3=54 i.e., less than half of the initial actions. Withthe reduced action space, the overall prediction process isillustrated in Figure 2. F ORECASTER reconfigures the tunable resources accord-ing to the predicted configurations. Now, we describe howeach resources are reconfigured.
BTB has 4 possible configurations (Table 1). Therefore,we can partition BTB into 4 sections - B B B
3, and B -valuesApplication Start execution with maximum resources Periodically collect HW telemetry Predict new configuration Reconfigure resources Calculate efficiency, rewards, andupdate Q-valuesQ-values
Save Q-values at the end of executionLoad Q-values
Figure 1: Overall workflow of F
ORECASTER . T t C t Q Q Q Q m … A A A A m S e l e c t A i a ss o c i a t e d w i t h t h e m a x i m u m Q - v a l u e C t C t+1 Apply changes according to A i Figure 2: F
ORECASTER uses hardware telemetry to pre-dict configurations. B B B B Reconfiguration Logic Indexing Logic
Clock
Figure 3: Logic for reconfiguring the BTB. (Figure 3). For the first configuration (i.e., 0.5K entries), sec-tions ( B B B
4) are clock-gated. Similarly, for the secondand third configurations, sections ( B B
4) and ( B
4) are clock-gated respectively. The last configuration does not clock-gateany section at all. On the other hand, Section B ORECASTER reconfigures all BTBs to the same configuration. This isdone is to simplify the prediction and reconfiguration logicin F
ORECASTER . Prefetcher is used either completely or not at all. Therefore,the prefetcher is clock-gated entirely or not at all. So, thereconfiguration logic simply generates a single clock-gatingsignal for the entire prefetcher. … Way
Reconfiguration Logic
Tag State Tag State
Clock
Data Data
Way 0 Way 1
Figure 4: Logic for reconfiguring L2 and L3 caches.
In order to reconfigure caches, F
ORECASTER makes threedesign choices.
First, F ORECASTER does not clock-gate anentire set. As a result, the address decoding logic remains un-changed.
Second, from each set, F
ORECASTER clock-gatesthe invalid lines. F
ORECASTER never clock-gates any validlines from the cache.
Three, whenever more than the required4 ollect HW telemetry H t and predict configuration C t Calculate efficiency E t , reward R t , andUpdate Q-values Reconfigure according to C t Reconfigure according to C t+1
Configuration C t-1
Configuration C t Configuration C t+1
Collect HW telemetry H t+1 and predict configuration C t+1
Time
Interval t Interval t+1Interval t-1
Figure 5: Timing of various steps of F
ORECASTER . number of cache lines in a single set satisfy the selectioncriteria, F ORECASTER randomly choose some of them toclock-gate. Figure 4 shows the schematic for reconfiguringthe caches. The way selection logic first determines howmany lines need to be clock-gated in each set. Then, it se-lects which way to clock-gate based on the selection criteriaand then, sends a signal to that way. When an invalid lineis clock-gated, F
ORECASTER does not need to worry aboutcache coherence issues. During the clock-gating process, thecache controller blocks any incoming request to that particu-lar cache set. The request is handled after the clock-gating iscomplete. F ORECASTER collects the hardware telemetry, H t at thebeginning of an interval, t, and determines the new configura-tion, C t . using the Q-value, Q t , predicted by DNN. Then, itreconfigures the hardware resources accordingly, and contin-ues the program execution. The timing is shown in Figure 5.At the beginning of the next interval, t+1, F ORECASTER calculates the new efficiency that results from the reconfigu-ration. F
ORECASTER compares the new efficiency with theold one (the one calculated at the beginning of interval t).If efficiency increases, F
ORECASTER receives + ORECASTER received0 reward. On the other hand, if the efficiency decreases,F
ORECASTER receives − ORECASTER uses the commonly used temporal differencemethod to update Q t − (the Q-value predicted at interval t-1) [30]. In this method, the new Q-value, Q (cid:48) t − is calculatedusing the following equation: Q (cid:48) t − = ( − α ) Q t − + α [ r + γ Q t ] Here, α is the learning rate and γ is the discounted factor.The DNN uses back propagation method to learn Q (cid:48) t − . Use of DQN in F
ORECASTER provides a natural way toimplement continual lifelong learning. In the DQN model,F
ORECASTER learns by training a DNN with Q-values. Tocontinue learning from one execution to the next, F
ORE - CASTER stores the DNN topology and weights in a file at theend of each execution. Section 4 presents an extension to the ISA that is used to read the topology and weights of the DNNand write them back in a special file. At the beginning of thenext execution, F
ORECASTER loads the topology and weightsof the DNN and continues to learn from where it left in thelast execution. If multiple applications are concurrently run-ning in a processor, each application will store its own DNNtopology and weights at the end of the respective execution.In that case, F
ORECASTER uses an offline process to mergethe networks periodically (e.g., once a day). For merging,F
ORECASTER uses TensorFlow [1] to load all the DNNs, andgenerates a number of random state samples. For each statesample, F
ORECASTER calculates the Q-value of each actionusing all the DNNs, takes an average of the Q-values, andretrains the largest DNN. Thus, the largest DNN accumulatesthe knowledge of all DNNs. During the beginning of the nextexecution of an application, F
ORECASTER loads this DNNand continues execution.
4. IMPLEMENTATION
In this section, we outline the implementation of DeepQ-Learning in F
ORECASTER as well as the extension to theISA.
We propose to add a DQN module in the chip. The modulecontains a Neural Processing Unit (NPU) for implementingthe DNN along with additional buffers such as input andreplay buffers, and a control logic. Figure 6 shows the highlevel schematic of the DQN module.There are several NPU designs in literature [2, 9, 23]. Wepropose to use an NPU similar to the one proposed by Es-maeilzadeh et al. [9]. It consists of a number of ProcessingElements (PEs) and a scheduler. Each PE implements an in-dividual artificial neuron. Each PE contains input and weightregisters, a multiplier, an adder, a partial sum register, and acomparator. Input and weight registers along with the adder,multiplier and partial sum register are used to calculate thedot product of inputs and weights. The comparator is usedto implement ReLU activation function [18]. The schedulerschedules each layer of the DNN in the PEs starting from theintput layer. After calculating the Q-values, the current inputand Q-values are stored in the replay buffer. When F
ORE - CASTER receives a reward and calculates the updated Q-value,the reply buffer provides the saved inputs and Q-values to the5PU to learn the new Q-value.
Neural Processing UnitInput
Buffer
Replay
Buffer
Control Logic PE1 PE2PE3 PE4PE5 PE6
PE7 PE8
Scheduler Input Reg
Weight Reg
MultiplierPartial Sum RegComparator
DQN Module
Adder
Figure 6: Details of the DQN module.
The control logic sequences the operations to implementthe DQN algorithm. The control logic also contains threespecial registers to store learning rate, discount factor andexploration ratio. The learning rate and discount factor areused to calculate new Q-value (Section 3.4). Explorationratio dictates how many times the module will choose a ran-dom exploratory action as opposed to an action based on themaximum Q-value. DQN uses exploratory action to exploreactions that would have been otherwise never selected. Thisis done to find a potentially better action that the one basedon prior knowledge.
Based on our experiments (Section 5.2.6) and intuition,we select the following hardware counters as telemetry - (i)number of integer instructions, (ii) number of logical instruc-tions, (iii) number of floating point instructions, (iv) numberof memory access instructions, and (v) number of controlflow instructions. Each core collects the telemetry indepen-dently and sends to the DQN module after every n (e.g., say n =10,000) instructions. When DQN module receives teleme-try of at least a total of N (e.g., say, N = 500,000) instructions,F ORECASTER assumes the start of a new interval. DQNmodule aggregates the telemetry and normalizes each counterwith respect to the total instructions of the interval that justfinished. DQN module predicts the new configuration andsends a reconfiguration message to each core.
We extend the ISA with instructions to set and get DQNconfigurations. We propose a fixed format for DQN configu-rations. The format is as follows -
ORECASTER executesa sequence of getconf instructions in a loop until
5. EXPERIMENTAL EVALUATION (a)(b)(c)
Figure 7: Avg amount of (a) L2, (b) L3, and (c) BTBturned off during the execution of streamcluster . Table 2 shows the parameters of the simulated hardwarethat we use to conduct the experiments. We use a modifiedversion of Multi2Sim [32] and McPAT [15] to simulate theexperimental hardware and its power consumption. PAR-SEC 3.0 benchmark suite is used with small inputs. Dueto resource and time constraints, all benchmarks are run tocompletion or 1 billion instructions. The interval size N isset at 0.5M instructions.We conduct three experiments on three versions of F ORE - CASTER : • Experiment 1: F
ORECASTER is implemented with agiant table to store and update the Q-values. All ap-plications are run five times. Each run starts with anempty Q-table. This version is essentially an adoption6 igure 8: Normalized power consumption of five executions of applications
Parameter ValueCPU 8-core @ 2.4Ghz, SMT offPrivate L1 cache (I/D) 32KB, 64B line, 8-wayPrivate L2 Cache 1024K, 64B line, 8-wayShared L3 Cache 16M, 64B line, 16-wayCoherence Protocol Directory-based MOESI
Table 2: Parameters of the simulated hardware. of prior reinforcement learning-based approach in thecurrent usage scenario [12]. • Experiment 2: F
ORECASTER is implemented with adeep neural network to predict the Q-values. All ap-plications are run five times. Each run starts with anuntrained neural model. • Experiment 3: F
ORECASTER is implemented with adeep neural network to predict the Q-values. All ap-plications are run five times. Each run starts with thetrained neural model inherited from the previous execu-tion.In the first experiment, each run starts with an empty Q-table, which means there is no knowledge accumulation be-tween executions. This technique is basically the Q-learningadopted from [12]. In the second experiment, we replacethe Q-table with a deep neural network to see how goodDQN is compared to the vanilla Q-learning. Experiment 3 issimilar to experiment 2 except an execution starts with themodel taken from the previous execution. The purpose ofthis experiment is to investigate the efficacy of knowledgeaccumulation.
Figures 5(a), 5(b), 5(c) show how F
ORECASTER managesthe hardware resources during an execution of streamcluster .On average, F
ORECASTER can turn off 64%, 66%, 66% L2cache, L3 cache and the BTB respectively. F
ORECASTER also deactivates the prefetcher for 26% of all intervals. Simi-lar behavior can be seen for other programs in the benchmarksuite. F
ORECASTER is able to determine the best size foreach structure for each phase. This can be seen from therepetitive pattern in the figures, which maps to phases in eachprogram. In this paper, we use static phases, fixed numberof instructions. In the future, we plan to use phase detectiontechniques [7, 26] and this is expected to make the schemeeven more efficient.
Experimental results shows that F
ORECASTER with con-tinuous learning uses the least power compared to other tech-niques and similar to the best static configuration, as shownin Figure 8. On average, F
ORECASTER with accumulatedknowledge can save 16% of power across all applicationscompared to the baseline. This is a 2% more than the versionwithout continuous learning and 8% more than the versionwith basic Q-table.
The efficiency of each experiment is shown in Figure 9. Ingeneral, our scheme outperforms the baseline configurationin all benchmarks except from canneal . Interestingly, the Q-table version gives the best efficiency compared to the othertwo versions with the neural network. This may be becausethe Q-table does not require much time to learn comparedto the neural network. Due to the time constraint, only twoexecutions of canneal are completed for experiment 3. Thatis why the neural network does not perform as expected.
Figure 10 shows that there is not much IPC degradationwhen using F
ORECASTER . Specifically, the system IPC when7 igure 9: Normalized efficiencies of five executions of applications running swaption is virtually unchanged across all versionof F
ORECASTER . The Q-table version of F
ORECASTER hasthe most consistent performance as it only cause a 1.2% IPCoverhead on average. This result is comparable to the beststatic configuration. In canneal , two versions with the deepneural network performs badly as they degrade the systemIPC by about 15%. One reason is because it takes time to trainthe neural network before it can have reasonable accuracy.The normalized execution time measured in terms of num-ber of cycles is shown in Figure 11. In overall, the executiontime overhead incurred by F
ORECASTER is less than 5%.F
ORECASTER tends to perform better in multi-threaded ap-plications as seen in streamcluster , swaptions , f luidanimate compared to single-threaded applications such as canneal .This is because F ORECASTER only makes one predictionfor all cores, and the prediction is largely dependent onthe resource of the core that is heavily used. For example,single threaded programs only use one core, therefore theL2 cache of that core is mostly occupied. However, whenF
ORECASTER reconfigure the hardware, it turns off the sameamount of L2 cache on every core, even though L2 cacheson other cores are mostly empty. This is a limitation ofF
ORECASTER that can be the subject of a future research.
The cost of the proposed design can be divided into threeparts: delay or latency cost, hardware cost, and power con-sumption cost. As for the latency cost, reading the hardwaretelemetry and making a reconfiguration decision does nothappen in the critical path. The hardware will continue in itsold configuration till the decision is made for a new configu-ration.The hardware cost consists of the DQN hardware and theextra hardware used to implement the knobs. The DQN uses aseven-layer neural network with six neurons per layer. Thereis also an input layer of 10 neurons and an output layer of one neuron. So, we use eight processing elements to implementthe input-layer, in two cycles as it needs to do the work of 10neurons, and then one-cycle per each layer. Each processingelement (PE) is a simple execution unit that can do a fusedmultiply-add operation per cycle similar to the executionunits found in traditional Graphics processing units (GPUs).The PEs are organized together in a design similar to theneural processing unit (NPU) described in [9]. We also needtwo extra registers for the old Q value and the new Q value(calculated by the neural network based on the reward). Asimple computation unit is needed to calculate the new Qvalue as shown in Section 3.4.The hardware needed for the knobs is straightforward. Theprefetcher is just clock-gated as the knob is on/off. TheBTB also uses clock-gating depending on the configuration.We have four configurations so a small 2x4 decoder will dothe job as shown in the reconfiguration logic of Figure 3.Clock gating the cache ways is simplified by the fact that theway-reconfiguration logic, shown in Figure 4, never gates avalid entry so no change to the cache controller or coherencehardware. The way-reconfiguration logic is not complicatedbecause it exploits the fact that large caches (such as L3) isusually partitioned. Therefore we have one logic circuitry perpartition.The power consumption of the above hardware is not highdue to several factors. First, that extra hardware is activatedonly at the end of each program phase to make predictionand reconfigure the knobs. Second, the extra power con-sumption is much smaller than the power-saved by gatingthe reconfigured structures. Finally, there are several optionsto design the neural network ranging from executing it, as asoftware component, on a CPU of GPU, or designing it asdigital ASIC [9], FPGA [14], or analog ASIC [5, 16]. Eachapproach has its own characteristics of area, power, and cost. igure 10: Normalized IPCs of five executions of applicationsFigure 11: Normalized number of cycles taken betweenexperiments We conduct three additional experiments in order to deter-mine the optimal interval size, history length, and number ofcounters to collect. The history length experiment shows howlong into the past should we take into account for determin-ing the best configuration of the current interval. Simulationresults show that increasing the history length from 1 to 2intervals reduces the efficiency gains by 3% as shown inFigure 12.The number of counters experiment shows how many coun-ters should be considered to best represent an interval. We testwith 3 sets: 3-counter, 5-counter and 8-counter sets. Belowis the list of 8 counters that we are collecting: • Normalized number of dispatched integer instructions. • Normalized number of dispatched logic instructions • Normalized number of dispatched floating point instruc-tions. • Normalized number of dispatched memory instructions. • Normalized number of dispatched control instructions. • Minimum free space across all L2 caches • Free space of shared L3 cache • Branch predictor misprediction rate.8-counter set includes all of counters above. 5-counter set in-cludes the normalized dispatched instructions, leaving out thelast 3 counters. 3-counter set only includes the number of dis-patched integer, memory and control instructions. Figure 13shows that a set of 5 counters gives the best efficiency. Aset of 3 counters does not have enough representation powerwhile a full set of 8 counters is redundant.The third experiment determines how big an interval sizeshould be. We test with interval sizes of 0.25M, 0.5M, 1M,2M instructions. Simulation results shows that setting intervalsize at 0.5M instructions gives 0.02%, 0.11%, and 0.11%more efficiency gain than 2M, 1M and 0.25M instructions,respectively.
Figure 12: Efficiency comparison between different his-tory lengths
In overall, the continuous learning version of F
ORECASTER can save up to 17.5% of power consumption in some appli-cations and 16% on average compared to the baseline setup.It gives an efficiency gain of 4% while sacrificing 4.7% ofexecution time.
6. RELATED WORK igure 13: Efficiency comparison between different num-ber of countersFigure 14: Efficiency comparison between different in-terval sizes Tarsa et al. [31] propose a lightweight ML framework thatcan be distributed through firmware updates to the microcon-troller for post-silicon CPUs. The ML model is first trainedoffline with a diverse collection of applications to avoid sta-tistical blind spots. During execution, the CPU dynamicallysets the issue width of a clustered hardware component whileclock-gating unused resources based on the prediction of theML model.Pan et al. [20] present a multi-level reinforcement learningframework (MLRL) to address the scalability issue of thedynamic power management in multi-core processors. MLRLeffective reduce the exponential decision process into a linearproblem by exploiting the hierarchical paradigm. In MLRL,core states and Q-values are propagated from the bottom tothe top of the tree structure, then decisions are propagatedback down the tree, providing an efficient control mechanism.Ravi et al. [22] propose CHARSTAR, a clock tree awareresource optimizing mechanism. CHARSTAR incorporatesa multi-layer perceptron with one hidden layer to predictthe optimal configuration in each execution phase. The neu-ral network takes into account the clock hierarchy and thetopology overhead in order to improve the power savings.However, the offline trained model may soon be obsolete forfuture unmet programs. Secondly, CHARSTAR only worksfor single-threaded programs, and a multi-threaded versionmay cause a super-linearly increase in the size of the neuralnetwork model.
7. CONCLUSIONS
This work presents the potential of dynamically tuninghardware components to save power with a small perfor-mance overhead. Our scheme, F
ORECASTER , when incor-porated a continuous learning deep neural network, can saveup to 17.5% of power consumption compared to the base- line configuration. On average, F
ORECASTER can reducethe power usage by 16% while sacrificing 4.7% of executiontime, thus leads to a 4% efficiency gain. Future researchmay focus on improving the efficacy of Forecaster as well asextending the control of F
ORECASTER over more hardwareresources to achieve more efficiency gain.
REFERENCES [1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S.Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow,A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser,M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray,C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar,P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals,P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng,“TensorFlow: Large-scale machine learning on heterogeneous systems,”2015, software available from tensorflow.org. [Online]. Available:http://tensorflow.org/[2] M. M. u. Alam and A. Muzahid, “Production-run software failurediagnosis via adaptive communication tracking,” in
Proceedings of the43rd International Symposium on Computer Architecture , ser. ISCA’16. Piscataway, NJ, USA: IEEE Press, 2016, pp. 354–366. [Online].Available: https://doi.org/10.1109/ISCA.2016.39[3] R. Balasubramonian, D. Albonesi, A. Buyuktosunoglu, andS. Dwarkadas, “Memory hierarchy reconfiguration for energy andperformance in general-purpose processor architectures,” in
Proceedings of the 33rd Annual ACM/IEEE International Symposiumon Microarchitecture , ser. MICRO 33. New York, NY, USA: ACM,2000, pp. 245–257. [Online]. Available:http://doi.acm.org/10.1145/360128.360153[4] R. Bitirgen, E. Ipek, and J. F. Martinez, “Coordinated management ofmultiple interacting resources in chip multiprocessors: A machinelearning approach,” in
Proceedings of the 41st Annual IEEE/ACMInternational Symposium on Microarchitecture , ser. MICRO 41.Washington, DC, USA: IEEE Computer Society, 2008, pp. 318–329.[Online]. Available: https://doi.org/10.1109/MICRO.2008.4771801[5] V. Calayir, M. Darwish, J. Weldon, and L. Pileggi, “Analogneuromorphic computing enabled by multi-gate programmableresistive devices,” in
Proceedings of the 2015 Design, Automation &Test in Europe Conference & Exhibition , ser. DATE â ˘A ´Z15. SanJose, CA, USA: EDA Consortium, 2015, p. 928â ˘A¸S931.[6] S. Choi and D. Yeung, “Learning-based smt processor resourcedistribution via hill-climbing,” in , Jun 2006, p. 239â ˘A¸S251.[7] A. S. Dhodapkar and J. E. Smith, “Managing multi-configurationhardware via dynamic working set analysis,” in
Proc. 17thInternational Symposium on Computer Architecture , 2002.[8] C. Dubach, T. M. Jones, E. V. Bonilla, and M. F. P. O’Boyle, “Apredictive model for dynamic microarchitectural adaptivity control,” in , Dec 2010, pp. 485–496.[9] H. Esmaeilzadeh, A. Sampson, L. Ceze, and D. Burger, “Neuralacceleration for general-purpose approximate programs,” in
Proceedings of the 2012 45th Annual IEEE/ACM InternationalSymposium on Microarchitecture , ser. MICRO-45. Washington, DC,USA: IEEE Computer Society, 2012, pp. 449–460. [Online].Available: https://doi.org/10.1109/MICRO.2012.48[10] L. Graesser and W. L. Keng,
Foundations of Deep ReinforcementLearning: Theory and Practice in Python . Boston, MA, USA:Addison-Wesley Professional, 2018.[11] H. Hubert and B. Stabernack, “Profiling-based hardware/softwareco-exploration for the design of video coding architectures,” in
IEEETransactions on Circuits and Systems for Video Technology , Sep 2009,pp. 1680 – 1691.[12] E. Ipek, O. Mutlu, J. F. Martínez, and R. Caruana, “Self-optimizingmemory controllers: A reinforcement learning approach,” in
Proceedings of the 35th Annual International Symposium onComputer Architecture , ser. ISCA ’08. Washington, DC, USA: IEEEComputer Society, 2008, pp. 39–50. [Online]. Available:https://doi.org/10.1109/ISCA.2008.21[13] C. Isci, A. Buyuktosunoglu, C. Cher, P. Bose, and M. Martonosi, “An nalysis of efficient multi-core global power management policies:Maximizing performance for a given power budget,” in , 2006, pp. 347–358.[14] M.-J. Li, A.-H. Li, Y.-J. Huang, and S.-I. Chu, “Implementation ofdeep reinforcement learning,” in Proceedings of the 2019 2ndInternational Conference on Information Science and Systems , ser.ICISS 2019. New York, NY, USA: Association for ComputingMachinery, 2019, p. 232â ˘A¸S236. [Online]. Available:https://doi.org/10.1145/3322645.3322693[15] S. Li, H. Ann, R. D. Strong, J. B. Brockman, D. M. Tullsen, and N. P.Jouppi, “Mcpat: An integrated power, area, and timing modelingframework for multicore and manycore architectures,” in , Oct 2009, pp. 469–480.[16] D. Maliuk and Y. Makris, “An analog non-volatile neural networkplatform for prototyping rf bist solutions,” in
Proceedings of theConference on Design, Automation & Test in Europe , ser. DATEâ ˘A ´Z14. Leuven, BEL: European Design and AutomationAssociation, 2014.[17] V. Mnih, K. Kavukcuoglu, and D. Silver, “Human-level controlthrough deep reinforcement learning,” in
Nature , vol. 518, Feb 2015, p.529â ˘A¸S533.[18] V. Nair and G. E. Hinton, “Rectified linear units improve restrictedboltzmann machines,” in
Proceedings of the 27th InternationalConference on International Conference on Machine Learning , ser.ICMLâ ˘A ´Z10. Madison, WI, USA: Omnipress, 2010, p. 807â ˘A¸S814.[19] D. Novillo, “Samplepgo - the power of profile guided optimizationswithout the usability burden,” in , Nov 2014, p. 22â ˘A¸S28.[20] G.-Y. Pan, J.-Y. Jou, and B.-C. Lai, “Scalable power managementusing multilevel reinforcement learning for multiprocessors,”
ACMTrans. Des. Autom. Electron. Syst. , vol. 19, no. 4, Aug. 2014. [Online].Available: https://doi.org/10.1145/2629486[21] P. Petrica, A. M. Izraelevitz, D. H. Albonesi, and C. A. Shoemaker,“Flicker: A dynamically adaptive architecture for power limitedmulticore systems,” in
Proceedings of the 40th Annual InternationalSymposium on Computer Architecture , ser. ISCA ’13. New York, NY,USA: ACM, 2013, pp. 13–23. [Online]. Available:http://doi.acm.org/10.1145/2485922.2485924[22] G. S. Ravi and M. H. Lipasti, “Charstar: Clock hierarchy awareresource scaling in tiled architectures,” in
Proceedings of the 44thAnnual International Symposium on Computer Architecture , ser. ISCAâ ˘A ´Z17. New York, NY, USA: Association for ComputingMachinery, 2017, p. 147â ˘A¸S160. [Online]. Available:https://doi.org/10.1145/3079856.3080212[23] B. Reagen, P. Whatmough, R. Adolf, S. Rama, H. Lee, S. K. Lee, J. M.Hernà ˛andez-Lobato, G.-Y. Wei, and D. Brooks, “Minerva: Enablinglow-power, highly-accurate deep neural network accelerators,” in
International Symposium on Computer Architecture (ISCA) , 2016.[Online]. Available: http://vlsiarch.eecs.harvard.edu/wp-content/uploads/2016/05/reagen_isca16.pdf [24] M. Riedmiller, R. Hafner, T. Lampe, M. Neunert, J. Degrave, T. van deWiele, V. Mnih, N. Heess, and J. T. Springenberg, “Learning byplaying solving sparse reward tasks from scratch,” in
Proceedings ofthe 35th International Conference on Machine Learning , ser.Proceedings of Machine Learning Research, J. Dy and A. Krause,Eds., vol. 80. StockholmsmÃd’ssan, Stockholm Sweden: PMLR,10–15 Jul 2018, pp. 4344–4353. [Online]. Available:http://proceedings.mlr.press/v80/riedmiller18a.html[25] T. Sherwood, E. Perelman, and B. Calder, “Basic block distributionanalysis to find periodic behavior and simulation points inapplications,” in
Proceedings of the 2001 International Conference onParallel Architectures and Compilation Techniques , ser. PACT â ˘A ´Z01.USA: IEEE Computer Society, 2001, p. 3â ˘A¸S14.[26] T. Sherwood, S. Sair, and B. Calder, “Phase tracking and prediction,”
SIGARCH Comput. Archit. News , vol. 31, no. 2, p. 336â ˘A¸S349, May2003. [Online]. Available: https://doi.org/10.1145/871656.859657[27] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van denDriessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam,M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner,I. Sutskever, T. P. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel,and D. Hassabis, “Mastering the game of go with deep neuralnetworks and tree search,”
Nature , vol. 529, pp. 484–489, 2016.[28] D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez,M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, T. Lillicrap,K. Simonyan, and D. Hassabis, “A general reinforcement learningalgorithm that masters chess, shogi, and go through self-play,”
Science ,vol. 362, no. 6419, pp. 1140–1144, 2018. [Online]. Available:https://science.sciencemag.org/content/362/6419/1140[29] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang,A. Guez, T. Hubert, L. R. Baker, M. Lai, A. Bolton, Y. Chen, T. P.Lillicrap, F. Hui, L. Sifre, G. van den Driessche, T. Graepel, andD. Hassabis, “Mastering the game of go without human knowledge,”
Nature , vol. 550, pp. 354–359, 2017.[30] R. S. Sutton and A. G. Barto,
Reinforcement Learning: AnIntroduction . Cambridge, MA, USA: A Bradford Book, 2018.[31] S. J. Tarsa, R. B. R. Chowdhury, J. Sebot, G. Chinya, J. Gaur,K. Sankaranarayanan, C.-K. Lin, R. Chappell, R. Singhal, andH. Wang, “Post-silicon cpu adaptation made practical using machinelearning,” in
Proceedings of the 46th International Symposium onComputer Architecture , ser. ISCA â ˘A ´Z19. New York, NY, USA:Association for Computing Machinery, 2019, p. 14â ˘A¸S26. [Online].Available: https://doi.org/10.1145/3307650.3322267[32] R. Ubal, J. Sahuquilo, S. Petit, and P. Løspez, “Multi2sim: Asimulation framework to evaluate multicore-multithreaded processors,”in , Oct 2007, pp. 62–68.[33] J. Wildstrom, P. Stone, E. Witchel, and M. Dahlin, “Machine learningfor on-line hardware reconfiguration,” in
IJCAI 2007, Proceedings ofthe 20th International Joint Conference on Artificial Intelligence,Hyderabad, India, January 6-12, 2007 , 2007, pp. 1113–1118. [Online].Available: http://ijcai.org/Proceedings/07/Papers/180.pdf, 2007, pp. 1113–1118. [Online].Available: http://ijcai.org/Proceedings/07/Papers/180.pdf