[PDF] An Energy-Efficient Heterogeneous Memory Architecture for Future Dark Silicon Embedded Chip-Multiprocessors

Abstract

Main memories play an important role in overall energy consumption of embedded systems. Using conventional memory technologies in future designs in nanoscale era causes a drastic increase in leakage power consumption and temperature-related problems. Emerging non-volatile memory (NVM) technologies offer many desirable characteristics such as near-zero leakage power, high density and non-volatility. They can significantly mitigate the issue of memory leakage power in future embedded chip-multiprocessor (eCMP) systems. However, they suffer from challenges such as limited write endurance and high write energy consumption which restrict them for adoption in modern memory systems. In this article, we present a convex optimization model to design a 3D stacked hybrid memory architecture in order to minimize the future embedded systems energy consumption in the dark silicon era. This proposed approach satisfies endurance constraint in order to design a reliable memory system. Our convex model optimizes numbers and placement of eDRAM and STT-RAM memory banks on the memory layer to exploit the advantages of both technologies in future eCMPs. Energy consumption, the main challenge in the dark silicon era, is represented as a major target in this work and it is minimized by the detailed optimization model in order to design a dark silicon aware 3D Chip-Multiprocessor. Experimental results show that in comparison with the Baseline memory design, the proposed architecture improves the energy consumption and performance of the 3D CMP on average about 61.33 and 9 percent respectively.

Full PDF

II EEE P r oo f An Energy-Efﬁcient Heterogeneous MemoryArchitecture for Future Dark Silicon EmbeddedChip-Multiprocessors

SALMAN ONSORI, ARGHAVAN ASAD, KAAMRAN RAAHEMIFAR, AND MAHMOOD FATHY

S. Onsori is with the Computer Engineering Department, Bilkent University, Ankara 06800, TurkeyA. Asad and M. Fathy are with the Computer Engineering Department, Iran University of Science and Technology, Tehran, IranK. Raahemifar is with the Electrical and Computer Engineering Department, Ryerson University, ON M5B 2K3, CanadaCORRESPONDING AUTHOR: S. ONSORI ([email protected])

ABSTRACT

Main memories play an important role in overall energy consumption of embedded systems.Using conventional memory technologies in future designs in nanoscale era causes a drastic increase in leak-age power consumption and temperature-related problems. Emerging non-volatile memory (NVM) technolo-gies offer many desirable characteristics such as near-zero leakage power, high density and non-volatility.They can signi ﬁ cantly mitigate the issue of memory leakage power in future embedded chip-multiprocessor(eCMP) systems. However, they suffer from challenges such as limited write endurance and high writeenergy consumption which restrict them for adoption in modern memory systems. In this article, we present aconvex optimization model to design a 3D stacked hybrid memory architecture in order to minimize thefuture embedded systems energy consumption in the dark silicon era. This proposed approach satis ﬁ es endur-ance constraint in order to design a reliable memory system. Our convex model optimizes numbers and place-ment of eDRAM and STT-RAM memory banks on the memory layer to exploit the advantages of bothtechnologies in future eCMPs. Energy consumption, the main challenge in the dark silicon era, is representedas a major target in this work and it is minimized by the detailed optimization model in order to design a darksilicon aware 3D Chip-Multiprocessor. Experimental results show that in comparison with the Baseline mem-ory design, the proposed architecture improves the energy consumption and performance of the 3D CMP onaverage about 61.33 and 9 percent respectively. INDEX TERMS

Heterogeneous memory architecture, non-volatile memory (NVM), convex-optimizationproblem, 3D integration tehnology, energy ef ﬁ cient design, dark silicon I. INTRODUCTION

Energy consumption is an essential and important constraintin embedded systems since these systems are generallyrestricted by battery lifetime. It is widely acknowledged thatenergy consumption of memory systems is a signi ﬁ cant con-tributor to the overall system energy due to integration ofincreasingly larger memory closer to the processor [47].Therefore, there is a critical need to considerably reduceenergy consumption of memory architectures. Memoryenergy consists of two components: 1) leakage, and 2) energyof the read/write access. In order to reduce memory energy,both the leakage and dynamic energy should be minimized.Moreover, 42 percent of the overall energy dissipation in the90 nm generation [1] and over 50 percent of the overall energydissipation in 65 nm technology [4] are due to leakage. Hence,leakage energy has become comparable to dynamic energy in current generation memory modules and soon will exceeddynamic energy in magnitude if voltage and technology arefurthur scaled down [3], [24].Due to physical limitations of two dimensional integrationtechnologies (2D IC), three dimensional chip-multiproces-sors (3D CMPs) receive a lot of attention in these days[25] – [28]. 3D integration technology compare with 2Ddesigns reduces interconnection wire length resulting inlower power consumption and shorter communicationlatency [23]. On the other hand, Network on Chips (NoC)architectures have been extended to the third dimension bythe help of through silicon vias (TSVs) [44], [45]. 3D NoCscombine the bene ﬁ ts of short vertical interconnects of 3DICs and the scalability of NoCs. Therefore, 3D NoCs havethe potential to achieve better performance with higher scal-ability and lower power consumption. Received 2 December 2015; revised 25 April 2016; accepted 26 April 2016.Date of publication 0 2016; date of current version 0 2016.

Digital Object Identi ﬁ er 10.1109/TETC.2016.2563323 VOLUME 1, NO. X, XXXXX 2016 2168-6750 (cid:1) EEE P r oo f Inorder to exploit 3D CMP and bene ﬁ t from the advan-tages of 3D NoC, CMP architectures with 3D stacked mem-ory system has been proposed to reduce power consumptionof CMP and increase its performance [7], [35], [36], [53],[54]. Stacked traditional memory systems on the core layermay drastically degrade performance, power density andtemperature-related problems [46] such as negative bias tem-perature instability (NBTI) [42]. For example by stackingeDRAM/DRAM on top of cores as on-chip memory, theheat generated by the core-layer can signi ﬁ cantly aggravatethe refresh power of DRAM layers. In such case, thedesigner needs to consider the power consumption due tothe refreshing phase when designing the power managementpolicy for stacked DRAM memory or cache. Non-volatilememories (NVMs) are newly emerging memory technologywith potential application in designing new classes of mem-ory systems due to their bene ﬁ ts such as higher storage den-sity and near zero leakage power consumption [37] – [39].Spin-transfer torque random-access memory (STT-RAM)as a promising candidate of NVM technology combinesthe speed of SRAM, the density of DRAM and the non-volatility of Flash memory. In addition, excellent scalabilityand very high integration with conventional CMOS logicare the other superior characteristics of STT-RAM [2].Although NVMs have many bene ﬁ ts as described above,their drawbacks such as high write energy consumption,long latency writes and limited write endurance preventfrom their direct use as a replacement for traditional memo-ries [32], [48].In order to overcome the aforementioned disadvantages,we use eDRAM and STT-RAM as two different types ofmemory banks in the stacked memory layer in a 3D eCMP.This hybrid memory architecture leads us to the best designpossible exploiting the bene ﬁ ts of both of memory technolo-gies. In this work, we use Non Uniform Memory Architec-ture (NUMA) stacked directly on top of the core layer in theproposed eCMP.Recently, dark silicon has emerged as a trend in VLSItechnology [29], [30], [49], [50]. The rise of utilization walldue to thermal and power budgets restricts active compo-nents and results in a large region of dark silicon. Uncorecomponents, such as memory and cache subsystem, consumea signi ﬁ cant amount of power consumption [31]. Thereby,power management of uncore components is critical for max-imizing design performance in dark silicon era. We exploit3D die-stacking and emerging NVM in this work to designhigh performance 3D CMP architecture for minimizingenergy consumption as a solution to combat dark siliconchallenge. Previous research has mainly focused on energyef ﬁ cient core designs [29], [40], and the design of uncorecomponents for reducing energy consumption has beenrarely explored. Heterogeneous architectures can be a prom-ising solution to tackle the challenges of multicore scaling inthe dark silicon era because of slight improvement in CMOStechnology. NVMs can be ef ﬁ ciently integrated with CMOScircuits in energy-ef ﬁ cient designs. To the best of our knowledge, this paper is the ﬁ rst work toexamine an energy ef ﬁ cient heterogeneous memory architec-ture design based on a convex optimization approach forfuture eCMPs. We exploit 3D die-stacking and emergingNVMs to design a high performance 3D eCMP architectureto minimize energy consumption as a solution to combatdark silicon challenge for future CMP.Figure 1 shows an overview of the proposed design usingan example of an 8 homogeneous cores in the lower layerand hybrid memory architecture in the upper layer. In theproposed heterogeneous memory system, STT-RAM as awell-known candidate of NVMs is incorporated witheDRAM banks in the second layer.This paper makes the following novel contributions: (cid:1) We provide convex optimization based platform todesign a heterogeneous memory system consistsing ofNVM and eDRAM memory banks. (cid:1)

Our proposed model can optimally ﬁ nd the number ofeDRAM and STT-RAM memory banks in the memorylayer of the embedded 3D CMP based on the accessbehavior of mapped applications to minimize energyconsumption. (cid:1) We demonstrate that our ILP formulation extends thelifetime of the hybrid memory architecture and providessigni ﬁ cant energy savings in comparison with the base-line designs. (cid:1) We developed a simulator with hybrid memory and 3DNoC platform to evaluate the proposed design inembedded 3D CMP using PARSEC benchmarks.The rest of this paper is organized as follows. Section IIdescribes a brief background. Section III describes relatedwork. In Section IV, the details of convex optimization-basedproblem and its formulation are investigated. In Section V,evaluation results are presented. Finally, the paper is con-cluded in Section VI.

II. BACKGROUND

A. STT-RAM TECHNOLOGY

STT-RAM has been one of the most popular NVM structuresdue to its scalability in sub-nanometer technology and thelow writing current in comparison with the conventionalMagnetic Random Access Memory (MRAM).As it is illustrated in Figure 2, to performe a read operationfrom the STT-RAM cell, the NMOS transistor will be turnedON and a small voltage will be set between the bit line and

FIGURE 1.

An overview of the proposed architecure. VOLUME 1, NO. X, XXXXX 2016Onsori et al.et al.

An overview of the proposed architecure. VOLUME 1, NO. X, XXXXX 2016Onsori et al.et al. : An Energy-Efﬁcient Heterogeneous Memory Architecture for Future Dark Silicon Embedded Chip-Multiprocessors

EEE P r oo f the source line. This voltage causes a current in the magnetictunnel junction (MTJ). The amount of this current dependson the state of the MTJ. A current sensor senses the currentand compares it with a reference current. As a result, thelogic value of that cell will be determined.For a write operation, the amount of the current wouldvary and will depend on the cell value. In order to write a thelogic value of ‘ ’ a positive current and for writing the logicvalue of ‘ ’ , negative current is injected between bit line andsource line. The amount of the current for a reliable writeoperation is known as threshold current which is dependedon the type of material used to construct the MTJ and itsshape [14], [41]. B. 3D DIE-STACKING TECHNOLOGY

The three-dimensional integrated circuits (3D ICs) technol-ogy, where multiple silicon layers are stacked vertically, hasproven to be a promising solution for increasing the numberof transistors on a chip [55]. In 3D IC designs, the criticalpaths can be signi ﬁ cantly shortened and the bandwidthbetween processor cores and memories can be greatlyincreased [22], [23]. In addition to the aforementioned advan-tages, 3D ICs also provide heterogeneous integration, on-chipinterconnect length reduction, and a modular and scalabledesign. Thus, 3D integration is envisioned as a solution forfuture many-core design to tackle the memory wall problem.In this paper, we assume that the stacking approach is used for3D embedded CMP design, in which core and memory layersare vertically stacked and connected by through silicon vias(TSVs). III. RELATED WORK

Numerous studies [8], [9], [33], [34] have proposed hybridarchitectures, wherein the SRAM is integrated with NVMs,in order to take advantages of both technologies. Energy con-sumption is still a primary concern in embedded systemssince they are limited by battery constraint. Several techni-ques have been proposed to reduce energy consumption ofhybrid memory architectures in embedded systems. Fu et al. [12] presented a technique to improve energy ef ﬁ ciencythrough a sleep-aware variable partitioning algorithm forreducing the high leakage power of hybrid memories. Hajimiri et al. [11] proposed a system-level design approachthat minimizes dynamic energy of a NVM-based memorythrough content aware encoding for embedded systems. Ourwork is different from all the prior works as we focus onplacement of eDRAM and STT-RAM banks in a stackedmemory architecture in future CMPs to minimize energyconsumption using a convex optimization based approach.As mentioned before, there are some obstacles for employ-ing STT-RAM without integration with tradi-tional technolo-gies in modern memory systems. One of these obstacles isthe limited number of write operations. After number of writeoperations has reached its limit, it is not possible to writeanother value into a STTRAM cell, and only the stored valuescan be read [43]. A number of researches presented differenttechniques to address the endurance problem of NVMs.Qureshi et al. [10] proposed wear leveling techniques for aPRAM-based memory system to enhance the lifetime. Wang et al. [5] proposed an algorithm to evenly distribute writeevents in the address space of scratchpad memory to extendthe endurance of NVM. Luo et al. [6] presented a writing tech-nique called Min-Shift to reduce the total number of writes toNVM and to enhance the lifetime of NVM. Hu et al. [13]proposed a software wear leveling technique to extend thelifetime of NVM in hybrid memory structure of embeddedsystems. However, our paper is the ﬁ rst work to propose anendurance model for NVM technology. This endurance modelis used as a constraint in the proposed optimiza-tion problemto design a high endurance heterogeneous memory systemwith minimum energy consumption. IV. OPTIMIZATION MODEL

In this section, we formulate our energy optimization prob-lem to design a minimum energy heterogeneous memorystructure in embedded 3D CMP. Figure 3 shows block dia-gram of our model for designing the proposed hybrid mem-ory with minimum energy consumption.The outputs of our optimization problem are 1) ﬁ nding theoptimal number of eDRAM and STT-RAM memory banksbased on the memory access behavior of mapped applica-tions with respect to the endurance constraint, 2) the appro-priate placement of eDRAM incorporated with STT-RAMbanks in the memory layer to minimize energy consumption.DRC and STC represent our optimization variables. Thesetwo binary variables indicate that a particular memory bank FIGURE 2.

Structure of a STT-RAM.

FIGURE 3.

Overview of our model.

VOLUME 1, NO. X, XXXXX 2016 Onsori et al. : An Energy-Efﬁcient Heterogeneous Memory Architecture for Future Dark Silicon Embedded Chip-Multiprocessors

EEE P r oo f in the proposed design is either an eDRAM or a STT-RAMbank. Our convex optimization model ﬁ nds DRC and STCvariables for each banks in the second layer. Based on thesevariables, the hybrid memory layer is constructed (Figure 4).After constructing the second layer and knowing actualplacement of eDRAM and STT-RAM banks on it, we cancount the number of banks and hence we can ﬁ nd the optimalnumber of each memory technology in our design.Table 1 gives the constant terms used in our convex for-mulation. To solve the models, we used CVX [15], an ef ﬁ -cient convex optimization solver.Assuming that P denotes the total number of processorcores, DR the total available number of eDRAM memorybanks, ST the total available number of STT-RAM memorybanks, ð C X ; C Y Þ the dimensions of the chip, ð P X ; P Y Þ thedimensions of the processor core. In this work, DR and ST areequal to P ; however, these numbers can be different values.Our approach uses 0 – DRC and

STC to identify the coordinates of a mem-ory bank. We have two types of memory banks, eDRAM andSTT-RAM, so we have two variables. (cid:1)

DRC dr ; x ; y ; l : indicates whether an eDRAM bank is in ð x ; y Þ in layer l ¼ (cid:1) STC st ; x ; y ; l : indicates whether a STT-RAM bank is in ð x ; y Þ in layer l ¼ DRMap and

STMap for theeDRAM and STT-RAM memory banks, respectively. That is, (cid:1)

DRMAP dr ; x ; y ; l : indicates whether coordinate ð x ; y Þ isassigned to an eDRAM bank in layer l ¼ (cid:1) STMAP st ; x ; y ; l : indicates whether coordinate ð x ; y Þ isassigned to a STT-RAM bank in layer l ¼ i and j correspond to the x and y coordi-nates, respectively. X C X (cid:3) i ¼ X C Y (cid:3) j ¼ ð DRC dr ; i ; j ; l þ STC st ; i ; j ; l Þ < ¼ ; dr ; st ; l ¼ ; ( ) STMAP st ; x ; y ; l (cid:4) STC st ; x ; y ; l st ; x ; y ; x ; y such that x þ T X (cid:4) x > x and y þ T Y (cid:4) y > y ; l ¼ ; ( ) DRMAP dr ; x ; y ; l (cid:4) DRC dr ; x ; y ; l dr ; x ; y ; x ; y such that x þ R X (cid:4) x > x and y þ R Y (cid:4) y > y ; l ¼ : ( )Also, the sum of used STT-RAM and eDRAM banks inthe second layer is equal to P as follow: X C X (cid:3) x ¼ X C Y (cid:3) y ¼ X DRi ¼ DTC i ; x ; y ; l þ X STi ¼ STC i ; x ; y ; l ! ¼ P ; l ¼ : ( )In this work, the memory banks and their associatedrouter/controller in the upper layer are the same as size thecores in the lower layer. This will prevent VLSI problemsrelated to layout and TSV design.In order to prevent multiple mappings of a coordinate inour grid, we assign a coordinate in the second layer to amemory bank (eDRAM or STT-RAM). X DRi ¼ DRMAP i ; x ; y ; l þ X STi ¼ STMAP i ; x ; y ; l ¼ ; x ; y ; l ¼ : ( ) FIGURE 4.

Construction of hybrid memory layer based onoptimization variables.

TABLE 1.

Constant Terms Used in Our Optimization Problem.The Values of

FREQ p ; m ; r and FREQ p ; m ; w Are Obtained by CollectingStatistics Through Simulation the Code and Capturing Accesseto Each Storage block.

Constant De ﬁ nition P Number of cores in the core layer DR Total number of eDRAM memory banks ST Total number of STT-RAM memory banks C X , C Y Dimensions of the chip P X , P Y Dimensions of a core R X , R Y Dimensions of an eDRAM memory bank T X , T Y Dimensions of a STT-RAM memory bank N The number of lines in STT-RAM memory bank l Index of layers in the 3D CMP

FREQ p ; m ; r Number of read access to memory bank m by core pFREQ p ; m ; w Number of write access to memory bank m by core pE read dr , E write dr Dynamic energy consumption per read and writeaccess by the eDRAM memory bank E read st ; E write st Dynamic energy consumption per read and writeaccess by the STT-RAM memory bank ’ Using STT-RAM versus eDRAM ratio t rdr , t wdr Read and write latency of eDRAM bank t rst , t wst Read and write latency of STT-RAM cache bank P static dr Static power consumed by each eDRAM memorybank at maximum temperature limit P static st Static power consumed by each STT-RAM memorybank at maximum temperature limit

STTLine endurance

Maximum write number for each line of STT-RAMmemory bank VOLUME 1, NO. X, XXXXX 2016Onsori et al.et al.

Maximum write number for each line of STT-RAMmemory bank VOLUME 1, NO. X, XXXXX 2016Onsori et al.et al. : An Energy-Efﬁcient Heterogeneous Memory Architecture for Future Dark Silicon Embedded Chip-Multiprocessors

EEE P r oo f The static power dissipation depends on the temperature.Since this optimization approach is solved at design time, weconsider pessimistic worst case temperature assumption andcalculate P static d r and P static st at maximum temperature limit. P static ¼ X C X (cid:3) i ¼ X C Y (cid:3) j ¼ (cid:2) X DRk ¼ DRC k ; i ; j ; l (cid:5) P static dr þ X STk ¼ STC k ; i ; j ; l (cid:5) P static st (cid:3) ; l ¼ : ( )We consider endurance problem of STT-RAM in our con-vex model. Hence, we exploit an endurance constraint foroptimal placement of eDRAM and STT-RAM memorybanks. In our model, if placing a STT-RAM memory bank ina special position leads to destruction of more than half ofthe lines of that memory due to writing frequency of cores,STT-RAM memory bank is not chosen for that position.This endurance constraint can be expressed as follows: P Pi ¼ FREQ i ; st ; w STTLine endurance (cid:5)

STC st ; x ; y ; < N ; x ; y ; st : ( )Figure 5 shows the overview of our endurance model.Since STT-RAM has an endurable write threshold, wecan only write a limited number of times in each line ofSTT-RAM. If the number of writes into one line is more thanthe threshold, that line will be destroyed. We assume a worstcase scenario in which all write operations are written in oneline until the line is destroyed and after that a new line isselected for rest of write operations. When 50% of lines in aSTT-RAM memory bank have been destroyed, a new writeoperation only has 1/2 chance to go to a valid line which hasnot been already destroyed. More speci ﬁ cally, there is equalchance for a successful or an unsuccessful write to theSTT-RAM bank. If more than half lines of a STT-RAM banksis destroyed, chance of successful write to this bank is evenless than 1/2. Thus, the maximum tolerable amount to guarantee writing in a healthy line with more that 50 percentprobability is N/2. Increasing this amount to a number like3N/4, decreases our chance of writing in a healthy line of aSTT-RAM bank to 1/4. On the other hand, if we decrease theamount to a number less than N/2, for example N/4, ourchance to write to a healthy line will be increased to 3/4; how-ever, it limits our design because we only can place ourSTT-RAM in special positions with smaller amount of writeoperations. We selected N/2 because it is exactly at the middleand it can make a good tradeoff for increasing endurance ofSTT-RAM and maintaining ﬂ exibility in our design; however,this amount can be changed based on the design ’ s purpose.Note that, we assume the number of lines for a STT-RAMbank is equal to N. Thus, in our endurance constraint model,if placing a STT-RAM memory bank in the special positionleads to destruction of more than half lines of that memorydue to writing frequency of cores, STT-RAM bank is notchosen for that position. Figure 5 illustrates the work ﬂ ow ofthe endurance model.Having speci ﬁ ed the necessary constraints in our convexformulation, we next consider the objective function. Thegoal of our objective function is to minimize energy con-sumption of the stacked heterogeneous memory architecturein the target 3D CMP with respect to the endurance con-straint. A weighted objective function is considered to cap-ture its potential effects on power consumption and overallperformance. This is achieved by the ’ constant which isused as a knob for choosing eDRAM versus STT-RAM bankin each x and y coordinates in the memory layer. As men-tioned before, in comparison with eDRAM technology STT-RAM is slower and has higher density and near zero leakagepower. Consequently, STT-RAM banks are applicable formemory-intensive blocks and eDRAM banks are applicablefor computation-intensive blocks. Therefore, with changing ’ value, it is possible to have an optimized design based onthe designer ’ s preference. In this work, we select ’ ¼ : ’ ¼ : ’ < ’ can be setdifferently for the other design purposes.The static energy of eDRAM and STT-RAM banks for eachwrite and read operations are de ﬁ ned as multiplication of theirstatic power consumptions and read and write durations. E static dr ¼ t rdr þ t wdr (cid:4) (cid:5) (cid:5) P static dr ; ( ) E static st ¼ ð t rst þ t wst Þ (cid:5) P static st : ( )In Equation (10), E read dr , E write dr , E read st and E write st indi-cate dynamic energy consumed by eDRAM and STT-RAMbanks per read and write access. Figure 6 shows eDRAMand STT-RAM banks in the second layer and illustrates thestatic and dynamic energy parameters of each memory FIGURE 5.

Overview of endurance model.

VOLUME 1, NO. X, XXXXX 2016 Onsori et al. : An Energy-Efﬁcient Heterogeneous Memory Architecture for Future Dark Silicon Embedded Chip-Multiprocessors

EEE P r oo f technology. E dynamic , the dynamic energy consumption of theproposed heterogeneous memory system is calculated as: E dynamic ¼ X C X (cid:3) i ¼ X C Y (cid:3) j ¼ X Pp ¼ (cid:2) X DRk ¼ DRC k ; i ; j ; l (cid:5) FREQ p ; k ; r (cid:5) E read dr þ FREQ p ; k ; w (cid:5) E write dr (cid:4) (cid:5) þ X STk ¼ STC k ; i ; j ; l (cid:5) ð FREQ p ; k ; r (cid:5) E read st þ FREQ p ; k ; w (cid:5) E write st Þ (cid:3) ; l ¼ : ( )Consequently, our objective function can be expressed as: minimize E Total ¼ ð E static dr þ E dynamic dr Þ þ ’ : ð E static st þ E dynamic st Þ ( )To summarize, objective function E Total is minimizedunder constraints (1) through (10). This proposed memorysystem and convex optimization model is very ﬂ exible. Forexample in the proposed architecture, we can use other typesof NVM technologies such as PCM instead of STT-RAMbanks in the memory layer. V. EVALUATION

In this section, we ﬁ rst describe the experimental environ-ment for evaluation of the proposed architecture. In the nextpart, different experiments are performed to quantify theadvantages of the proposed architecture over the baselinearchitectures. A. EVALUATION SETUP

We used GEM5 [16] full system simulator to implementmemories and cores. To simulate accurate behavior of the 3DCMP design and its NoC architecture, we integrated GEM5with 3D-Noxim [18] which is a SystemC-based NoC simula-tor. We also integrated McPAT [17] with the aforementionedsimulation platform in order to calculate the power consump-tion of the design. Furthermore, the cache capacities andenergy consumption of eDRAM and STT-RAM have beenestimated from CACTI [19] and NVSIM [20], respectively. Figure 7 demonstrates the structure of the core layer and itsnetwork on chip characteristics in the proposed 3D eCMPdesign. Also, the simulation platform of this work is shown inFigure 8. Tables 2 and 3 list the details of system con ﬁ gurationfor the evaluation part along with the parameters used in ourexperiments for eDRAM and STT-RAM memory technolo-gies. We used multithreaded workloads in our experiments.The multithreaded applications with small working sets areselected from the PARSEC benchmark suit [21]. Moreover, P budget and T max were considered 100W and 80 (cid:6) C for theexperimental evaluation part.

B. EXPERIMENTAL RESULTS

In this sub-section, we evaluate the target 3D eCMP withstacked memory in four different cases: the CMP witheDRAM-only stacked memory (Baseline-eDRAM), the CMPwith hybrid stacked memory that has four eDRAM banks atthe middle (eDRAM-centric), the CMP with hybrid stackedmemory that has same number of eDRAM and STT-RAMbanks (Hybrid-symmetric), and the CMP with the proposedhybrid stacked memory based on convex optimization model.In the proposed method, we consider 16 eDRAM banks(4 MB each) and 16 STT-RAM banks (4 MB each) as themaximum available memory which can be used for designingthe hybrid memory architecture. For evaluation purposes, theresults of the proposed design are compared with those ofthe baseline designs. Baseline designs are shown in Figure 9.

FIGURE 6.

Energy and power parameters of a memory bank insecond layer of the design.

FIGURE 8.

Simulation platform of the design.

FIGURE 7.

3D eCMP conﬁguration. VOLUME 1, NO. X, XXXXX 2016Onsori et al.et al.

3D eCMP conﬁguration. VOLUME 1, NO. X, XXXXX 2016Onsori et al.et al. : An Energy-Efﬁcient Heterogeneous Memory Architecture for Future Dark Silicon Embedded Chip-Multiprocessors

EEE P r oo f Figure 10 shows the results of energy consumption foreach PARSEC application. As shown in this ﬁ gure, the pro-posed design reduces energy consumption by, on average,about 61.33, 32 and 36 percent compared with the Baseline-eDRAM, eDRAM-centric and Hybrid-symmetric designs,respectively. The educed energy consumption is due to ef ﬁ -cient use of eDRAM and STT-RAM banks on the memorylayer which is done by the proposed optimization model.Figure 11 compares the instruction per cycle (IPC) of theproposed 3D-stacked hybrid memory architecture with thebaseline designs. eDRAM and STT-RAM capacity is slightlysame. Therefore, IPC differences amongst the baseline desi-gns is due to various read and write latencies of eDRAMand STT-RAM memory technologies. Based on Table 1,although read latency in eDRAM is higher than read latencyin STT-RAM, STT-RAM ’ s write problem has a worseimpact on IPC than eDRAM ’ s read latency. For example, inHybrid-symmetric design, half of STT-RAM banks arereplaced with eDRAM banks. Hence, Hybrid-symmetric cangive a higher IPC than Baseline-STTRAM design since thewrite problem of STT-RAM can be mitigated by eDRAMbanks. Also, it is possible that Baseline-STTRAM has betterIPC than Hybrid-symmetric design in read intensive bench-marks. This is because there are too many read operations inread intensive benchmarks, and this increases time requiredto access the memory layer due to higher read latency of eDRAM. The proposed hybrid memory architecture basedon our convex optimization model has maximum IPCcompared with the baseline design for all the benchmarks.Experimental results show that the proposed hybrid memoryarchitecture gives, on average, about 9, 2.8 and 1 percentspeedup on over Baseline-eDRAM, Hybrid-symmetric andeDRAM-centric designs, respectively. TABLE 2.

Speciﬁcation of The Baseline eCMPs Conﬁguration.

Component Description

Number of Cores 16, 4 (cid:5) ﬁ guration Alpha21164, 3GHz, area 3.5mm , 32nmPrivate Cache pereach Core SRAM, 4 way, 32B line, size 32KB per coreOn-chip Memory Baseline-eDRAM: 64MB (4MB eDRAMbank on each core)Baseline-STTRAM: 64MB (4MB STT-RAM bank on each core)Hybrid-symmetric: 32MB STT-RAM and32MB eDRAM (8 STT-RAM and 8eDRAM banks, 4MB each bank)eDRAM-centric: 48MB STT-RAM and16MB eDRAM (12 STT-RAM and4 eDRAM banks, 4 MB each bank)Hybrid proposed: the proposed hybrid mem-ory based on the convex optimization modelNetwork Router 2-stage wormhole switched, virtual channel ﬂ ow control, 2 VCs per port, 5 ﬂ its bufferdepth, 8 ﬂ its per a data packet, 1 ﬂ it peraddress packet, 16-byte in each ﬂ it TABLE 3.

Different Memory Technology Comparisons at 65 nm.

Technology Area Read Latency Write Latency Leakage Power at 80 (cid:6)

C Read Energy Write Energy

128 KB SRAM 3 : mm : ns : ns : mW : nJ : nJ

512 KB STTRAM 3 : mm : ns : ns mW : nJ : nJ

512 KB eDRAM 3 : mm : ns : ns mW : nJ : nJ : mm : ns : ns mW : nJ : nJ FIGURE 9.

Different baseline designs.

FIGURE 10.

Comparison of energy consumption for the differentbaselines and the proposed memory architecture normalizedwith Baseline-eDRAM.

FIGURE 11.

Comparison of instruction per cycle (IPC) for the dif-ferent baselines and the proposed memory architecture normal-ized with Baseline-eDRAM.

VOLUME 1, NO. X, XXXXX 2016 Onsori et al. : An Energy-Efﬁcient Heterogeneous Memory Architecture for Future Dark Silicon Embedded Chip-Multiprocessors

EEE P r oo f Figure 12 compares the lifetime of the proposed designwith the Hybrid-symmetric for each benchmark. We assumethe endurable maximum write number for eDRAM and dif-ferent NVM memory technologies are as reported in Table 4[51], [52].To evaluate the lifetime, we assume that each benchmarkcontinuously runs until one of the memory lines in eachmemory bank exceeds the number of maximum endurablewrites (shown in Table 4). Figure 12 shows that the lifetimeof our proposed heterogeneous memory architecture is higherthan the lifetime of the baseline designs for all the bench-marks. The proposed hybrid memory architecture yields onaverage 3.03 times (and up to ﬁ ve times) improvement inlifetime when compared with Hybrid symmetric memorydesign. Thus, our hybrid memory architecture results in amore reliable 3D eCMP design due to opthe timal numberand optimal placement of STT-RAM and eDRAM banks onthe memory layer.Figure 13 shows the results of energy delay product (EDP)for each PARSEC application. As shown in this ﬁ gure, basedon the energy consumption and performance improvement ofthe proposed architecture, our design improves the EDPby about 65 percent on average compare with the baseline-eDRAM.The generated hybrid memory architectures for the cannealand ﬂ uidanimate benchmarks based on the proposed convexoptimization model are shown in Figure 14. As we mentionedearlier, the number and placement of banks for each memorytechnology (eDRAM and STT-RAM) in the memory layerare calculated in order to minimize the performance cost func-tion of the 3D eCMP while keeping the power budget atthe satisfactory level. In other words, it depends on distributionof threads/applications on the core layer for each individualbenchmark based on the convex optimization model. VI. CONCLUSION

In this work, we proposed a convex optimization basedmodel to design a heterogeneous memory organization using eDRAM and STT-RAM memory banks in order tominimize energy consumption of future 3D eCMPs. Weproposed an endurance model for NVM memory technolo-gies in our optimization problem to design a reliable hybridmemory structure for the ﬁ rst time. The experimental resultsshowed that the proposed method improves energy-delayproduct by 65 percent on average when compared withthe traditional memory designs in which single technologyis used. Furthermore, our 3D eCMP yields on average9 percent performance improvement when compared withbaseline designs. REFERENCES [1] J. Kao, S. Narendra, and A. Chandrakasan, “ Subthreshold leakage model-ing and reduction techniques, ” in Proc. IEEE/ACM Int. Conf. Comput.-Aided Des. , 2002, pp. 141 – “ Architecting on-chip interconnects for stacked 3D STT-RAMcaches in CMPs, ” in Proc. 38th Annu. Int. Symp. Comput. Archit. , 2011,pp. 69 – “ Resistive computation: Avoiding thepower wall with low-leakage, STT-MRAM based computing, ” in Proc.Annu. Int. Symp. Comput. Archit. , 2010, pp. 371 – “ System-wide leakage-aware energy minimiza-tion using dynamic voltage scaling and cache recon ﬁ guration in multi-tasking systems, ” IEEE Trans. Very Large Scale Integr. Syst. , vol. 20,no. 5, pp. 902 – “ Endurance-aware allocation ofdata variables on NVM-based scratchpad memory in real-time embeddedsystems, ” IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. , vol. 34,no. 10, pp. 1600 – “ Enhanc-ing lifetime of NVM based main memory with bit shifting and ﬂ ipping, ” in Proc. IEEE 20th Int. Conf. Embedded Real-Time Comput. Syst. Appl. ,2014, pp. 1 – “ Analysis and runtime management of 3Dsystems with stacked DRAM for boosting energy ef ﬁ ciency, ” in Proc.Int. Conf. Des., Autom. Test Eur. Conf. Exhib. , 2012, pp. 611 – FIGURE 12.

Expected life time comparison of the proposeddesign.

TABLE 4.

Comparison of Maximum Possible Write Number forVarious Memory Technologies.

Technology SRAM eDRAM STT-RAM PRAMEndurance 10 (cid:5) FIGURE 13.

Comparison of energy (cid:5) delay consumption for thedifferent baselines and the proposed memory architecture nor-malized with Baseline eDram.

FIGURE 14.

Hybrid memory layer for the canneal and ﬂuidanimatebenchmarks based on the proposed convex optimization model. VOLUME 1, NO. X, XXXXX 2016Onsori et al.et al.

Hybrid memory layer for the canneal and ﬂuidanimatebenchmarks based on the proposed convex optimization model. VOLUME 1, NO. X, XXXXX 2016Onsori et al.et al. : An Energy-Efﬁcient Heterogeneous Memory Architecture for Future Dark Silicon Embedded Chip-Multiprocessors

EEE P r oo f [8] Z. Wang, D. A. Jimenez, C. Xu, G. Sun, and Y. Xie, “ Adaptive placementand migration policy for an STT-RAM-based hybrid cache, ” in Proc. Int.Conf. High Perform. Comput. Archit. , 2014, pp. 13 – “ Design ofhybrid second-level caches, ” IEEE Trans. Comput. , vol. 64, no. 7,pp. 1884 – ~ no, and J. Karidis, “ Morphable memory system: A robust architecture for exploiting multi-level phase change memories, ” in Proc. 37th Annual Int. Symp. Comput.Archit. , 2010, pp. 153 – “ Content-aware encoding for improving energy ef ﬁ ciency in multi-level cell resistiverandom access memory, ” in Proc. IEEE/ACM Int. Symp. NanoscaleArchit. , 2013, pp. 76 – “ Sleep-aware variable parti-tioning for energy-ef ﬁ cient hybrid PRAM and DRAM main memory, ” in Proc. Int. Symp. Low Power Electron. Des. , 2014, pp. 75 – “ Low overheadsoftware wear leveling for hybrid PCM þ DRAM main memory onembedded systems, ” IEEE Trans. Very Large Scale Integr. Syst. , vol. 23,no. 4, pp. 654 – “ Spin-Transfer torque switching in magnetic tunnel junctionsand spin-transfer torque random access memory, ” J. Phys. Condensed Mat-ter , vol. 19, no. 16, p. 13, 2007.[15] M. Grant, S. Boyd, and Y. Ye, “ CVX: Matlab software for disciplined convexprogramming, ” et al. , “ The gem5 simulator, ” ACM SIGARCH Comput. Archit.News 39 , vol. 39, no. 2, pp. 1 –

7, May 2011.[17] S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, andN. P. Jouppi, “ McPAT: An integrated power, area, and timing modelingframework for multicore and manycore architectures, ” in Proc. Annu.IEEE/ACM Int. Symp. MICRO-42 , 2009, pp. 469 – “ CACTI 6.0: Atool to model large caches, ” HP Laboratories , Chicago, USA, Tech. Rep.HPL-2009-85, 2009.[20] X. Dong, C. Xu, N. Jouppi, and Y. Xie, “ NVSim: A circuit-level perfor-mance, energy, and area model for emerging non-volatile memory, ” in Proc. Emerging Memory Technol. Springer , 2014, pp. 15 – “ Running PARSEC 2.1 on M5, ” Univ. Texas Austin, Dept. Comput. Sci. ,Tech. Rep. TR-09-32, 2009.[22] C. C. Liu, I. Ganusov, M. Burtscher, and S. Tiwari, “ Bridging the proces-sor-memory performance gap with 3D IC technology, ” IEEE Des. TestComput. , vol. 22, no. 6, pp. 556 – “ Design space exploration for3D architectures, ” ACM J. Emerging Technol. Comput. Syst. , vol. 2, no. 2,pp. 65 – “ System-wide leakage-aware energy minimiza-tion using dynamic voltage scaling and cache recon ﬁ guration in multitask-ing systems, ” IEEE Trans. Very Large Scale Integr. Syst. , vol. 20, no. 5,pp. 902 – “ An energy-ef ﬁ cient 3D CMP design with ﬁ ne-grained voltage scaling, ” in Proc. Des., Autom. Test Eur. Conf. Exhib. ,2011, pp. 1 – “ Optimizing energy ef ﬁ -ciency of 3D multicore systems with stacked DRAM under power andthermal constraints, ” in Proc. 49th Annu. Des. Autom. Conf. , 2012,pp. 648 – “ An examina-tion of the architecture and system-level tradeoffs of employing steepslope devices in 3D CMPs, ” in Proc. Int. Symp. Comput. Archit. , 2014,pp. 241 – “ THOR: Orchestrated thermal man-agement of cores and networks in 3D many-core architectures, ” in Proc.Des. Autom. Conf. , 2015, pp. 773 – “ Dark silicon and the end of multicore scaling, ” in Proc. 38th Annu. Int.Symp. Comput. Archit. , 2011, pp. 365 – “ Is dark silicon real?: Technical perspective, ” Commun. ACMMag. , vol. 56, pp. 92 –

92, 2013. [31] H. Y. Cheng, J. Zhan, J. Zhao, Y. Xie, J. Sampson, and M. J. Irwin, “ Corevs. uncore: The heart of darkness, ” in Proc. 52nd Annu. Des. Autom.Conf. , 2015, pp. 1 – “ Compiler-assistedrefresh minimization for volatile STT-RAM cache, ” in Proc. Des. Autom.Conf. , 2013, pp. 273 – “ Accelerating non-volatile/hybrid processor cache design space exploration for application speci ﬁ cembedded systems, ” in Proc. 20th Asia South Paci ﬁ c Des. Autom. Conf. ,2015, pp. 435 – “ Prediction hybrid cache: An energy-ef ﬁ cientSTT-RAM cache architecture, ” IEEE Trans. Comput. , vol. 65, no. 3,pp. 940 – “ ” in Proc. Int. Conf. Design High Performance, Low Power, Reliable3D Integrated Circuits , 2013, pp. 537 – “ Circuit and micro-architectureevaluation of 3D stacking magnetic RAM (MRAM) as a uni-versal memory replacement, ” in Proc. 45th Annu. Des. Autom. Conf. ,Jun. 2008, pp. 554 – “ Evaluat-ing STT-RAM as an energy-ef ﬁ cient main memory alternative, ” in Proc.Int. Conf. Perform. Anal. Syst. Softw. , 2013, pp. 256 – “ Design of last-level on-chip cache using spin-torque transfer RAM (STT RAM), ” IEEE Trans.Very Large Scale Integr. Syst. , vol. 19, no. 3, pp. 483 – “ Ef ﬁ -cient data mapping and buffering techniques for multilevel cell phase-change memories, ” ACM Trans. Archit. Code Optimization , vol. 11, no. 4,2014, Art. no. 40.[40] B. Raghunathan, Y. Turakhia, S. Garg, and D. Marculescu, “ Cherry-picking: Exploiting process variations in dark-silicon homogeneouschip multi-processors, ” in Proc. Design, Autom. Test Eur. Conf.Exhib. , 2013, pp. 39 – “ International technology roadmap for semiconductors (ITRS), ” Semiconductore Ind. Assoc. , 2013.[42] H. Tajik, H. Homayoun, and N. Dutt, “ VAWOM: Temperature and processvariation aware wearout management in 3D multicore architecture, ” in Proc. 50th ACM/EDAC/IEEE Design Autom. Conf. , 2013, pp. 1 – “ Hybridcache architecture with disparate memory technologies, ” in Proc. 36thAnnu. Int. Symp. Comput. Archit. , 2009, pp. 34 – “ Assembling 2D blocks into 3Dchips, ” IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. , 2012,pp. 228 – “ Small-world network enabledenergy ef ﬁ cient and robust 3D NoC architectures, ” in Proc. 25th EditionGreat Lakes Symp. VLSI , 2015, pp. 133 – “ Temperature aware refresh for DRAM perfor-mance improvement in 3D ICs, ” in Proc. 16th Int. Symp. ISQED , 2015,pp. 207 – “ Energy management for commercial servers, ” Computer , vol. 36, no. 12,pp. 39 –

48, 2003.[48] J. Wang, Y. Tim, W. F. Wong, Z. L. Ong, Z. Sun, and H. Li, “ Acoherent hybrid SRAM and STT-RAM L1 cache architecture forshared memory multicores, ” in Proc. Asia South Paci ﬁ c Des. Autom.Conf. , 2014, pp. 610 – “ Approximate acceleration: a path through the era ofdark silicon and big data, ” in Proc. Int. Conf. Compilers, Archit. SynthesisEmbedded Syst. , 2015, pp. 31 – ﬁ que, “ Dark Silicon: From computation to communication, ” in Proc. 9th Int. Symp. Netw.-on-Chip , 2015, Art. no. 23.[51] Y. T. Chen, J. Cong, H. Huang, B. Liu, C. Liu, M. Potkonjak, and G. Rein-man, “ Dynamically recon ﬁ gurable hybrid cache: An energyef ﬁ cient last-levelcache design, ” in Proc. Des., Autom. Test Eur. Conf. Exhib. , 2012, pp. 45 – “ Technology compari-son for large last-level caches (L3Cs): Low-leakage SRAM, low write-energy STT-RAM, and refresh-optimized eDRAM, ” in Proc. High Perfom.Comput. Archit. , 2013, pp. 143 – “ An optimized3D-stacked memory architecture by exploiting excessive, high-densityTSV bandwidth, ” in Proc. High Perform. Comput. Archit. , 2010, pp. 1 – VOLUME 1, NO. X, XXXXX 2016 Onsori et al. : An Energy-Efﬁcient Heterogeneous Memory Architecture for Future Dark Silicon Embedded Chip-Multiprocessors

EEE P r oo f [54] Q. Guo, N. Alachiotis, B. Akin, F. Sadi, G. Xu, T. M. Low, L. Pi-leggi,J. C. Hoe, and F. Franchetti, “ ” in Proc. Workshop Near-Data Process. ,2014, pp. 1 – “ ” in Proc. ACM SIGARCH Comput. Archit. News , 2008, pp. 453 – Salman Onsori received the BS degree in com-puter engineering (hardware) from the ShahedUniversity, Iran, in 2010 and the MS degree incomputer architecture from the Shahid BeheshtiUniversity, Iran, in 2013. He is currently workingtoward the PhD degree in the Bilkent university,Turkey. His current research interests includedesign of the emerging non-volatile memory andcache architectures, 3D chip-multi processors andembedded systems as well as their hardwaremodelling.

Arghavan Asad recieved the MS degree in com-puter architecture from the Iran University of Sci-ence and Technology, Tehran, Iran, in 2012. She iscurrently working toward the PhD degree at theIran University of Science and Technology. Herresearch interest include interconnection network,low power hardware and memory hierarchy design.

Kaamran Raahemifar received the BSc degreefrom the Sharif University of Technology, the MSdegree from the Waterloo University, and the PhDdegree from the Windsor University, all in electri-cal and computer engineering. He is currently aprofessor in the Department of Computer Engineer-ing at Ryerson University. His research interestsare in the areas of optimization in engineering,modeling, simulation, design and VLSI circuits.

Mahmood Fathy received the BS degree in elec-tronics from the Iran University of Science andTechnology, Tehran, Iran, in 1985, the MS degreein computer architecture from the Bradford Univer-sity, West Yorkshire, United Kingdom, in 1987,and the PhD degree in image processing and com-pute architecture from the University of ManchesterInstitute of Science and Technology, Manchester,Unnited Kingdom, in 1991. Since 1991, he hasbeen an associate professor with the Department ofComputer Engineering, Iran University of Scienceand Technology. His research interests include the quality of service in com-puter networks. VOLUME 1, NO. X, XXXXX 2016Onsori et al.et al.