Aging-Aware Request Scheduling for Non-Volatile Main Memory
AAging-Aware Request Scheduling for Non-Volatile Main Memory
Shihao Song
Drexel UniversityUSA
Anup Das
Drexel UniversityUSA
Onur Mutlu
ETH ZürichSwitzerland
Nagarajan Kandasamy
Drexel UniversityUSA
ABSTRACT
Modern computing systems are embracing non-volatile memory(NVM) to implement high-capacity and low-cost main memory.Elevated operating voltages of NVM accelerate the aging of CMOStransistors in the peripheral circuitry of each memory bank. Ag-gressive device scaling increases power density and temperature,which further accelerates aging, challenging the reliable operationof NVM-based main memory. We propose HEBE, an architecturaltechnique to mitigate the circuit aging-related problems of NVM-based main memory. HEBE is built on three contributions. First,we propose a new analytical model that can dynamically track theaging in the peripheral circuitry of each memory bank based onthe bank’s utilization. Second, we develop an intelligent memoryrequest scheduler that exploits this aging model at run time tode-stress the peripheral circuitry of a memory bank only when itsaging exceeds a critical threshold. Third, we introduce an isolationtransistor to decouple parts of a peripheral circuit operating atdifferent voltages, allowing the decoupled logic blocks to undergolong-latency de-stress operations independently and off the criticalpath of memory read and write accesses, improving performance.We evaluate HEBE with workloads from the SPEC CPU2017 Bench-mark suite. Our results show that HEBE significantly improves bothperformance and lifetime of NVM-based main memory.
ACM Reference Format:
Shihao Song, Anup Das, Onur Mutlu, and Nagarajan Kandasamy. 2021.Aging-Aware Request Scheduling for Non-Volatile Main Memory. In
ACM, New York, NY, USA, 8 pages. https://doi.org/10.1145/3394885.3431529
DRAM has been the technology choice for implementing mainmemory due to its relatively low latency and low cost. However,DRAM is a fundamental performance and energy bottleneck inalmost all computing systems, and it is experiencing significanttechnology scaling challenges [22, 36, 40, 54, 59–61]. Recently,
Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected].
ASPDAC ’21, January 18–21, 2021, Tokyo, Japan © 2021 Association for Computing Machinery.ACM ISBN 978-1-4503-7999-1/21/01...$15.00https://doi.org/10.1145/3394885.3431529
DRAM-compatible, yet more technology-scalable alternative non-volatile memory (NVM) technologies such as Phase-Change Mem-ory (PCM), are being explored [44, 46–48, 57, 64, 69–71, 81, 84, 85,94, 100]. Compared to DRAM, NVM requires higher voltages to read andprogram memory cells. We investigate the internal architecture ofthe peripheral circuitry of each memory bank and find that suchcircuitry consists of transistors built using CMOS and FinFET [28].When operated at high voltage and temperature, over time thetransistor’s parameters can strongly drift from their nominal val-ues. This is called aging . In fact, in scaled technology nodes, aginghappens even under nominal conditions from the very start of de-vice use. The most important breakdown mechanism is the BiasTemperature Instability (BTI) [43, 92]. Strongly depending on theworkload, BTI is highly variable and largely reversible under nom-inal conditions upon removal of the stress voltage. However, ifthe peripheral circuitry is used continuously for long durations atelevated operating conditions, the BTI induced parameter drifts inperipheral circuitry cannot be reversed [98], leading to permanentfunctional degradation and hardware faults.As process technology scales down to smaller dimensions due toNVM’s CMOS-compatible scaling [96], aging issues are expectedto get exacerbated due to the increase in the electric field andpower density, which leads to higher chip temperatures and, conse-quently, the acceleration of BTI. Current methods for improvingaging are overly conservative, since they estimate transistor agingin a peripheral circuitry statically , assuming worst-case operatingconditions [31]. Based on such worst-case estimates, these methodsde-stress each peripheral circuitry periodically at a fixed interval,without tracking its actual aging. Therefore, these methods signifi-cantly and unnecessarily constrain performance.Our goal is to design a dynamic policy to track the aging in theperipheral circuitry of each memory bank based on the operatingvoltages needed to serve read and write requests from the bank, anddynamically schedule its de-stress operation only when its agingexceeds a critical threshold. Our architectural approach to mitigateaging in NVM, called HEBE, is built on three contributions. Contribution 1.
We develop a new, accurate analytical modelto estimate transistor aging in peripheral circuitry of each memorybank. Our model dynamically tracks aging in response to a memorycontroller’s request scheduling decisions such as serving a read(which requires . 𝑉 ) vs. a write (which requires . 𝑉 ). To usethis model at run time, we leverage the associative property ofour analytical formulation, a direct reflection of the underlying NVMs are also used for synaptic storage in neuromorphic computing [4, 5, 8, 53, 82]. In Greek mythology, Hebe (pronounced hee.bee) is the goddess of youth [93]. a r X i v : . [ c s . A R ] N ov hysical failure mechanism, allowing us to express aging in termsof offline-computed unit aging parameters (described in Section 3).Our memory controller uses these parameters to estimate agingin a peripheral circuitry based on the number of read and writerequests that are served via the circuitry. Contribution 2.
We develop a new, intelligent memory requestscheduler that prioritizes requests to banks whose peripheral cir-cuits are currently active but not serving any memory request, overother requests, including the long-outstanding ones. This straight-forward and greedy policy is controlled in two ways. First, thememory controller uses our new aging model to track the agingof a peripheral circuitry, de-stressing the circuitry only when itsaging exceeds a critical threshold. Second, the memory controlleruses thresholding to avoid starvation of memory requests.
Contribution 3.
We introduce an isolation transistor in eachperipheral circuitry to decouple its logic blocks operating at differ-ent supply voltages during read and write accesses (see Fig. 1). Thedecoupled architecture allows these logic blocks to be de-stressedbased on their respective aging levels. Our request scheduler ex-ploits this decoupled architecture to schedule the long-latency de-stress operations off the critical path of accesses, reducing bankoccupancy and improving performance.We evaluate HEBE with workloads from the SPEC CPU2017Benchmark suite [7]. Our results show that HEBE significantlyimproves both performance and lifetime of NVM-based main mem-ory.
NVM, like DRAM [22, 42, 49, 78], is organized hierarchically [56,58, 73, 81, 84, 85, 100]. For example, a 128GB NVM can have 2channels, 1 rank/channel and 8 banks/rank. A bank can have 64partitions [81]. Each bit in NVM is represented by the resistanceof an NVM cell: low resistance is logic ‘1’ and high resistance islogic ‘0’. An NVM cell is read and programmed by driving currentthrough it using per bank peripheral circuitry (see Fig. 1). Peripheralcircuitry consists of sense amplifiers (SA) to read and write drivers(WD) to program. WD consists of the write pulse shaper (PS) logic,which generates the current pulses necessary for SET and RESEToperations, and the verify (VF) logic, which verifies the correctnessof these operations [25].
Pulse Shaper Logic Verify Logicisolation transistor Sense Amplifier
PCM BanksPeripheral Circuitry
Charge Pumps
Write Driver
Figure 1: Architecture of NVM peripheral circuitry [81].
In addition to the regular read and write mode of operations,peripheral circuitry can also be in 1) idle mode, where it does notserve any request, and 2) de-stress mode, where it is powered down.Table 1 reports the operating voltages of the three logic blocks in a peripheral circuitry during read, write, idle, and de-stress op-erations [73]. Voltages higher than the nominal . 𝑉 supply aregenerated using the two on-chip charge pumps shown in Figure 1.These high voltages induce aging of the transistors in the peripheralcircuitry logic blocks. We focus on BTI failures. Figure 2: Threshold voltage ( 𝑉 th ) shift due to BTI. BTI is a failure mechanism in a transistor, where positive chargeis trapped in the oxide-semiconductor boundary underneath thegate [26]. BTI manifests as 1) decrease in drain current and transcon-ductance, and 2) increase in off current and threshold voltage 𝑉 th .Figure 2 illustrates the stress and recovery of the threshold voltageof a transistor on application of a high ( 𝑉 read / 𝑉 write ) and a low ( 𝑉 de-stress )voltage. We observe that both stress and recovery depends on thetime of exposure to the corresponding voltage level. This impliesthat when peripheral circuitry is de-stressed, the BTI aging of itstransistors partially recovers from stress. To compute the overheaddue to de-stress operations, we assume that the memory controllerissues a de-stress command to a memory bank once every 𝑡𝐷𝑆𝐼 ,the de-stress interval . Each de-stress operation completes within atime interval 𝑡𝐷𝑆𝐶 , the de-stress cycle time . Hence, the performanceoverhead (i.e., data throughput loss) due to periodic de-stress is de-stress overhead = 𝑡𝐷𝑆𝐶 / 𝑡𝐷𝑆𝐼. (1) The overhead due to periodic de-stress (as implemented in conserva-tive approaches such as [31]) is significant in current NVM devices,and it is expected to become even more performance-critical in thefuture as NVM chip capacity increases [73].
Operating mode Operating voltagepulse shaper verify sense amplifier(PS) (VF) (SA)
Read 1.2V 1.2V 2.85VWrite (program) 3.7V 2.85V 1.2VIdle 1.2V 1.2V 1.2VDe-stress < 𝑉 th < 𝑉 th < 𝑉 th Table 1: Operating voltage of the three logic blocks in periph-eral circuitry during read, write, idle, and de-stress [73]. Thethreshold voltage ( 𝑉 𝑡ℎ ) of a CMOS transistor is between 0.7Vand 1V at scaled nodes. Figure 3 shows the shift in threshold voltage of a transistor in amemory bank’s peripheral circuitry when executing a microbench-mark with 𝑡𝐷𝑆𝐼 set to and requests, respectively. We observe that the shift in the threshold voltage is higher forlarger de-stress interval 𝑡𝐷𝑆𝐼 . This is because when we set 𝑡𝐷𝑆𝐼 toa large value, the transistor is exposed to a stress voltage for a longduration between two consecutive de-stress operations. Therefore,the large parameter drift it encounters in this duration cannot bereversed during the next de-stress operation. The parameter drift The microbenchmark we use for this evaluation consists of alternating read and writerequests to randomly selected PCM locations. ontinues to accumulate over time, resulting in a large shift in thethreshold voltage as shown in the figure. Therefore, it is better toset the 𝑡𝐷𝑆𝐼 to a small value, which allows most of the parameterdrifts to be reversed during each de-stress operating, resulting in alower threshold voltage shift. .
00 0 .
25 0 .
50 0 .
75 1 .
00 1 .
25 1 .
50 1 .
75 2 . time (years) ∆ V t h ( % ) tDSI = 10tDSI = 100 Figure 3: Shift in 𝑉 th for 𝑡𝐷𝑆𝐼 = and . However, setting a lower 𝑡𝐷𝑆𝐼 leads to a higher de-stress relatedperformance overhead, which we formulate in Equation 1. HEBE ex-ploits this performance-reliability trade-off using its new intelligentmemory request scheduler (see Sec. 5).
In this section, we introduce the new aging model of HEBE. TheBTI lifetime [3, 17–20, 80, 83, 85] of a transistor is
MTTF
BTI = 𝐴𝑉 𝛾 𝑒 𝐸𝑎𝐾𝑇 , (2) where 𝐴 and 𝛾 are material-related constants, 𝐸 𝑎 is the activationenergy, 𝐾 is the Boltzmann constant, 𝑇 is the temperature, and 𝑉 isthe overdrive gate voltage of the transistor. BTI failures can alsobe modeled using the Weibull distribution with a scale parameter 𝛼 and a slope parameter 𝛽 . The reliability, defined as the probability ofcorrect operation of the transistor, at time 𝑡 is given by [6, 21, 29, 86] 𝑅 ( 𝑡 ) = 𝑒 − (cid:16) 𝑡𝛼 ( 𝑉 ) (cid:17) 𝛽 , (3) with the corresponding MTTF computed as 𝑀𝑇𝑇𝐹 = ∫ ∞ 𝑅 ( 𝑡 ) 𝑑𝑡 = 𝛼 ( 𝑉 ) Γ (cid:18) + 𝛽 (cid:19) , (4) where Γ is the Gamma function. Using the expressions for MTTFfrom Equations 2 and 3, and rearranging, we obtain the expressionfor the scale parameter 𝛼 as 𝛼 ( 𝑉 ) = 𝐴𝑉 𝛾 𝑒 𝐸𝑎𝐾𝑇 (cid:30) Γ (cid:18) + 𝛽 (cid:19) . (5) Figure 4 shows the operating voltage of PS, VR, and SA blocks inthe peripheral circuitry of a memory bank when serving read andwrite requests from the bank. We observe that the operating voltageof the logic blocks in a memory bank’s peripheral circuit changesover time based on whether the peripheral circuit is idle or servingread or write requests. Existing aging models such as [15, 21, 86]assume constant operating voltage for the logic blocks. Therefore,these models cannot be effectively used to estimate the aging in amemory bank’s peripheral circuitry.We illustrate how we formulate aging of each of these three logicblocks in peripheral circuitry, starting with the PS logic. Let [ 𝑡 𝑖 , 𝑡 𝑖 + ) be the ( 𝑖 + ) th time interval with Δ 𝑡 𝑖 = 𝑡 𝑖 + − 𝑡 𝑖 and 𝑉 𝑖 be the gateoverdrive voltage in this time interval 𝑡 𝑖 . The reliability of the PSlogic at the start of execution is 𝑅 ( 𝑡 ) | 𝑡 = 𝑡 = . (6) Overdrive voltage is defined as the voltage between transistor gate and source ( 𝑉 𝐺𝑆 )in excess of the threshold voltage ( 𝑉 th ), where 𝑉 th is the minimum voltage requiredbetween gate and source to turn the transistor on. At the end of the first interval (i.e, after servicing the first readrequest), the reliability of the PS logic is 𝑅 ( 𝑡 − ) = 𝑒 − (cid:18) 𝑡 𝛼 ( 𝑉 ) (cid:19) 𝛽 . (7) Using the term 𝜃 to represent reliability degradation during this in-terval [ 𝑡 𝑜 , 𝑡 ) , the reliability at the beginning of the second interval(i.e., right after the start of the first idle period) is 𝑅 ( 𝑡 + ) = 𝑒 − (cid:18) 𝑡 + 𝜃𝛼 ( 𝑉 ) (cid:19) 𝛽 . (8) Due to the continuity of the reliability function, we can equateEquations 7 & 8 to compute 𝜃 as 𝜃 = (cid:18) 𝛼 ( 𝑉 ) 𝛼 ( 𝑉 ) − (cid:19) 𝑡 . (9) Substituting Eq. 9 in Eq. 8, reliability at time 𝑡 is 𝑅 ( 𝑡 ) = 𝑒 − (cid:18) Δ 𝑡 𝛼 ( 𝑉 ) + Δ 𝑡 𝛼 ( 𝑉 ) (cid:19) 𝛽 . (10) We can extend this equation to compute the reliability of the PSlogic at the end of execution (i.e., after servicing the last writerequest from the bank in Fig. 4) as 𝑅 ( 𝑡 𝑠 ) = 𝑒 − (cid:18)(cid:205) 𝑛𝑖 = Δ 𝑡𝑖𝛼 ( 𝑉𝑖 ) (cid:19) 𝛽 , (11) Read Idle WriteWrite
Requests
Idle
Voltage PS Voltage VR Voltage SA time Read 1.2V1.2V2.85V
Figure 4: Operating voltage of PS, VR, and SA logic blocks inperipheral circuitry of a bank when serving read and writerequests. (Overdrive voltage = operating voltage - 𝑉 th ). The aging A PS of the PS logic is A PS = 𝑛 ∑︁ 𝑖 = Δ 𝑡 𝑖 𝛼 ( 𝑉 𝑖 ) , such that 𝑅 ( 𝑡 𝑠 ) = 𝑒 −(A PS ) 𝛽 , (12) where the scaling factor 𝛼 ( 𝑉 𝑖 ) can be calculated using Eq. 5.We observe that Eq. 12 follows the associative property , a directreflection of the underlying BTI failure mechanism. In other words,the aging accrued in each bank’s peripheral circuitry is independentof the order in which the reads and writes are scheduled to thebank. Eq. 12 can be rewritten using memory timing parameters as A PS = 𝑛 𝑟 · U 𝑟 + 𝑛 𝑤 · U 𝑤 + 𝑛 𝑖 · U 𝑖 , where (13) U 𝑟 = 𝑡𝑅𝐶 𝑟 𝛼 ( . ) , U 𝑤 = 𝑡𝑅𝐶 𝑤 𝛼 ( . ) , and U 𝑖 = 𝛼 ( . ) where, 𝑡𝑅𝐶 is the row cycle time, 𝑛 𝑟 and 𝑛 𝑤 are the number of readand write requests, respectively, and 𝑛 𝑖 is the number of memoryclock cycles for which the PS logic is idle. U 𝑟 and U 𝑤 representrespectively, the aging accrued in peripheral circuitry when servinga read and a write request, and U 𝑖 represents the aging accruedper clock cycle when the peripheral circuitry is idle. U 𝑟 , U 𝑤 , and U 𝑖 are called unit aging parameters , which the memory controllerses to track the aging of the PS logic in peripheral circuitry bysimply recording 1) the number of read and write requests that areserved from the bank, and 2) the number of idle clock cycles duringworkload execution. We note that these factors (read/write requestsand the idle periods) cannot be known with certainty at design-time.Therefore, design-time aging estimates are not accurate.The aging of VR and SA logic blocks (represented as A VR and A SA , respectively) can be computed in a similar way using Eq. 13. Toobtain the overall aging, we combine these individual aging valuesconsidering the peripheral circuitry to be a series failure system,where the first instance of any logic block failing causes the entireperipheral circuit to fail. Therefore, the overall aging is A = max {A PS , A VR , A SA } . (14) In the baseline system, when peripheral circuitry is de-stressed, itsthree logic blocks (PS, VR, and SA) are de-stressed simultaneously.Once de-stressed, these logic blocks take several memory cycles( 𝑡𝐷𝑆𝐶 ) before they can be used to serve memory requests again. Inrecent designs, 𝑡𝐷𝑆𝐶 = cycles [73]. Therefore, frequent de-stressoperations can lead to high performance overhead (Eq. 1). To reducethis overhead, we analyze the average aging of the PS, VR, and SAblocks in a memory bank’s peripheral circuitry at the time whenthey are de-stressed during workload execution. Figure 5 plots theseresults for the workloads described in Section 6 with 𝑡𝐷𝑆𝐼 set to .We make the following two key observations. b l e nd e r b w a v e s c a c t u B SS N c a m cc m c f o m n e t pp p a r e s t r o m s x a l a n c b m k AV E R A G E A g i n g [ a . u . ] Aging threshold = 1000 [a.u.] PS VR SA
Figure 5: Average aging in arbitrary units (a.u.) of the PS, VR,and SA logic blocks in peripheral circuitry, at the time whenthey are de-stressed for our evaluated workloads.
First, the average aging of these logic blocks varies widely acrossdifferent workloads due to the difference in memory requests. Sec-ond, at the time when a de-stress is initiated, the average agingfor these three logic blocks are different and lower than the agingthreshold. This is because in the baseline design, a peripheral circuitis de-stressed at its entirety, when the aging in any one of its threelogic blocks exceeds the aging threshold.Based on these observations and the connectivity of the logicblocks to the two charge pumps (see Fig. 1), we introduce an iso-lation transistor ( M ) to decouple the VR logic from the PS logicinside the write driver, allowing us to track and de-stress the logicblocks individually, as opposed to de-stressing the entire peripheralcircuitry at once. Table 2 summarizes the new controls, which weenable using the isolation transistor.Using these new decoupled control mechanism, HEBE’s requestscheduler (Section 5) can de-stress logic blocks in a bank’s periph-eral circuitry off the critical path of accesses, lowering bank occu-pancy and improving performance. We observe that the read chargepump is shared between the SA and VR logic blocks (see Figure 1). Therefore, when HEBE de-stresses the SA because SA’s aging ex-ceeds the critical aging threshold, the VR logic also gets de-stressed,preventing the write driver from serving write requests. To addressthis, we exploit the decoupled program and verify-based writeoperations in PCM [25]. If a write request needs to be scheduledconcurrently with the de-stress operation of SA, HEBE schedulesonly the program step of the write operation (which utilizes the PSblock) concurrently with the de-stress operation, while the verifystep is scheduled after the de-stress operation completes. Charge Pump Control Peripheral Circuit ActionRead Write PS VR SA
Baseline ControlActive Active Active Active ActiveDischarged Discharged De-stress De-stress De-stressProposed Decoupled ControlActive Active Active Active ActiveDischarged Active Active De-stress De-stressActive Discharged De-stress Active ActiveDischarged Discharged De-stress De-stress De-stress
Table 2: Controlling de-stress ops. using charge pumps.
We design a new memory request scheduling policy to control theaging in the peripheral circuitry within each memory bank usingour new aging model (Section 3) and the decoupled peripheralcircuit architecture (Section 4).
We describe HEBE in the context of DRAM-PCM hybrid memory,where embedded DRAM (eDRAM) is used as a write cache to PCMmain memory as shown in Figure 6. The baseline memory con-troller architecture consists of a read-write queue (rwQ) to bufferPCM requests. The key idea of HEBE’s scheduling policy is to 1)improve performance by lowering the de-stress overhead, 2) mini-mize wasted memory cycles during which a bank is idle with itsperipheral circuitry accruing BTI aging, and 3) prevent a requestfrom being delayed too much. eDRAM cache Status Table ( sTab ) Request Selection PCM requests
HEBE Memory Controller
PCMMulticore CPUs de-stress requestsread-write queue (rwQ)Access Table ( aTab )Unit Aging Table ( uTab ) De-stress Selection
Figure 6: Request and de-stress scheduling in HEBE.
Figure 6 shows the detailed design of HEBE, which introducesfive new components to the baseline memory controller design ashighlighted in the figure. Even though we use eDRAM as cache to PCM in our implementation and evaluations,HEBE is applicable to any type of hybrid memory or standalone PCM memory. he first component is the status table ( sTab ). HEBE uses thistable to record if a memory bank is available to serve a PCM request.sTab requires one 1-bit entry for each PCM bank. For 128 banksin a 128GB PCM (see our simulation parameters in Table 3), HEBErequires 128 bits of storage for sTab.The second component is the access table ( aTab ). HEBE uses thistable to record the number of memory cycles for which a memorybank’s peripheral circuitry is active since the last de-stress operationof the bank. Since peripheral circuitry operates at a different voltagewhen it is idle than when it is serving a read or a write request,the aging model of HEBE requires the exact number of cycles forwhich a peripheral circuit is idle and serving read and write requests.Therefore, each aTab entry contains one 16-bit field for recordingthe idle cycles, and two 4-bit fields for recording the number ofread and write requests. For 128GB PCM with 1GB per bank, HEBErequires 3Kb (= 128 x 24 bits) of storage.The third component is the unit aging table ( uTab ). HEBE usesthis table to store the unit aging parameters (Eq. 13). Since the threeunit aging parameters are the same for every peripheral circuitry inPCM, there are only three 32-bit entries in this table, one for eachof these parameters, requiring a total of 96 bits for uTab. The fourth component is the request selection . HEBE uses thiscomponent to select a request from the rwQ to schedule to PCM. Fig-ure 7 shows the flowchart of HEBE’s request selection mechanism.After scheduling a request from the rwQ, the memory controllerchecks to see if the number of clock cycles for which a request isoutstanding in the rwQ is smaller than the backlogging threshold ( 𝑡ℎ 𝑏 ). If the backlogging threshold is exceeded, the request is de-queued and served next. Otherwise, the memory controller selectsan outstanding request from the rwQ that is to a bank whose pe-ripheral circuitry has the highest number of idle cycles since thetime it served a request from the bank. isrequestcritical ?yesschedule request to PCM oldest request from rwQ no select request to a bank with highest idle cycle is idle or aging threshold exceeded ?de-stress bank yesno Figure 7: Memory request and de-stress selection in HEBE.
The final component is the de-stress selection logic . HEBE usesthis component to schedule de-stress operations in PCM banks. Forthis purpose, HEBE uses two thresholds – the aging threshold ( 𝑡ℎ 𝑎 )and the idle threshold ( 𝑡ℎ 𝑖 ). The aging threshold is used to controlthe aging of peripheral circuitry in PCM in order to achieve a targetlifetime. The idle threshold is used to limit the duration duringwhich a peripheral circuit accrues aging without doing any usefulwork. The de-stress selection logic is also shown in Figure 7. If theselected memory request is to a bank whose peripheral circuitryexceeds either of the two thresholds, the memory controller sched-ules a de-stress operation to the bank. Otherwise, the request isscheduled to PCM. For simplicity, we have not considered process variation across different peripheralcircuitry of different PCM banks.
HEBE requires a total storage of 3.2Kb for a PCM memory of 128GBcapacity and 128 banks. The timing overhead of the request andde-stress selection is overlapped with the timing of an ongoing reador a write request, incurring minimal impact on the critical path ofPCM read and write accesses. Therefore, HEBE’s request schedulingintroduces marginal performance overhead. On the contrary, HEBEimproves performance compared to other approaches by reducingthe de-stress related performance bottleneck (see Section 7.1).
We evaluate HEBE for phase-change memory (PCM), one of thematured NVM technologies. We configure PCM as main memorywith eDRAM as its write cache. This is similar to the architectureof IBM POWER 9 [77]. Our simulation framework includes thefollowing components with parameters listed in Table 3. • Cycle-level in-house x86 multi-core simulator. We configurethis to simulate 8 out-of-order cores. • Main memory simulator, closely matching the JEDEC Non-volatile Dual In-line Memory Module (NVDIMM) specifica-tions [45]. This simulator is composed of Ramulator [41], tosimulate DRAM and an cycle-level in-house NVM simulator,based on NVMain [68]. • Power and latency for DRAM and NVM are based on In-tel/Micron’s 3D Xpoint specification [73]. Energy is modeledfor DRAM using DRAMPower [9] and for NVM using NVMainwith parameters from [73].
Processor 8 cores, 3 GHz, out-of-orderL1-I/D cache Private 64KB per core, 4-wayL2 cache Shared, 4MB, 8-wayDRAM 8GB, Micron DDR3Main Memory 1 channel, 8 rank/channel, 8 banks/rank, 128 sub-arrays/bank, 512 rows/sub-arrayPCM 128GB, Micron DDR3 [73]Main Memory 2 channels, 1 rank/channel, 8 banks/rank, 64 par-titions/bank, 128 tiles/partition, 4096 rows/tile
Table 3: Major simulation parameters.
Table 4 reports the timing parameters for PCM reads and writes.These parameters are based on Micron’s 128GB PCM module [73]. tRCD tRAS tRP tRC
Read 3.75ns 55.25ns 1ns 56.25ns tRCD tBURST tWR tRP tRC
Write 75ns 15ns 190ns 1ns 209.75ns
Table 4: PCM timing parameters based on [73].
We evaluate 10 billion instructions of ten workloads from theSPEC CPU2017 benchmarks [7] (see Table 5).We evaluate the following techniques. • Baseline [84] de-stresses peripheral circuitry of each NVMbank with a fixed 𝑡𝐷𝑆𝐼 of 100, without tracking their aging.Memory requests are scheduled using the FR-FCFS policy [76,104].
HEBE tracks the aging in CMOS transistors in peripheral cir-cuitry of each bank and de-stresses them only when theiraging exceeds the aging threshold. A peripheral circuitry isde-stressed based on the maximum aging of its logic blocks. • Decoupled-HEBE is based on HEBE. Each peripheral circuitryis decoupled to de-stress its logic blocks independently. single-core 8 copies each of blender, bwaves, cactuBSSN, cam4,gcc, mcf, omnetpp, parset, roms, xalancbmk
Table 5: Evaluated workloads.
To compute aging, the slope parameter of Weibull distribution isset to 𝛽 = , and the operating temperature is set to 𝐾 . Otherfitting parameters are adjusted to achieve an MTTF of 2 years inthe baseline system, corresponding to a threshold voltage shift of10%. This is what is typically accepted as the maximum allowed 𝑉 th degradation before timing errors begin to appear [3, 13–16, 21, 80,83, 89, 90]. Figure 8 plots the execution time of each workload for the evaluatedsystems normalized to Baseline. We make the following two keyobservations.First, the average execution time of HEBE is 12% lower thanBaseline. This improvement is because HEBE has lower de-stressoverhead than Baseline due to HEBE’s dynamic policy to oppor-tunistically de-stress each peripheral circuitry only when its agingexceeds the aging threshold. Baseline, on the other hand, uses afixed de-stress interval of 100 without tracking the exact aging. Sec-ond, the average execution time of Decoupled-HEBE is 6% lowerthan HEBE. This improvement is because Decoupled-HEBE de-stresses the logic blocks in a memory bank’s peripheral circuits offthe critical path of read and write accesses from the bank, reducingbank occupancy and improving performance. b l e nd e r b w a v e s c a c t u B SS N c a m cc m c f o m n e t pp p a r e s t r o m s x a l a n c b m k AV E R A G E . . . . E x ec u t i o n t i m e N o r m a li z e d t o B a s e li n e Baseline HEBE Decoupled-HEBE
Figure 8: Execution time, normalized to Baseline.
Figure 9 plots the MTTF of each workload for the evaluated systemsnormalized to Baseline. We make the following two key observa-tions. b l e nd e r b w a v e s c a c t u B SS N c a m cc m c f o m n e t pp p a r e s t r o m s x a l a n c b m k AV E R A G E . . . . M TT F N o r m a li z e d t o B a s e li n e Baseline HEBE Decoupled-HEBE
Figure 9: MTTF, normalized to Baseline.
First, the average MTTF of HEBE is 16% higher than Baseline.This improvement is because 1) HEBE does not allow the aging ofany peripheral circuitry in PCM to exceed the aging threshold and2) the aging-aware access scheduling policy of HEBE minimizesthe number of wasted memory cycles for which a peripheral cir-cuitry accrues aging while being idle. Second, the average MTTFof Decoupled-HEBE is 3.4% higher than HEBE. This improvementis because HEBE needs to wait for an ongoing PCM read or writerequest to complete before it can schedule the de-stress operationof a peripheral circuitry. Therefore, the circuitry continues to agebefore it is eventually de-stressed, lowering its MTTF. On the otherhand, Decoupled-HEBE can schedule the de-stress operation of alogic block in a bank’s peripheral circuitry independently and inparallel to ongoing read and write requests to the bank. Therefore,the MTTF of Decoupled-HEBE is higher than HEBE.
Figure 10 plots the de-stress overhead (Eq. 1) of each workloadfor each evaluated system normalized to Baseline. We make thefollowing two key observations. b l e nd e r b w a v e s c a c t u B SS N c a m cc m c f o m n e t pp p a r e s t r o m s x a l a n c b m k AV E R A G E . . . . D e - s t r e ss o v e r h e a d N o r m a li z e d t o B a s e li n e Baseline HEBE Decoupled-HEBE
Figure 10: De-stress overhead, normalized to Baseline.
First, the average de-stress overhead of HEBE is 6.6% lower thanBaseline. This improvement is because HEBE increases the de-stressinterval ( 𝑡𝐷𝑆𝐼 ) by accurately tracking the aging of each peripheralcircuitry dynamically, de-stressing it only when aging exceeds athreshold. Baseline uses a fixed 𝑡𝐷𝑆𝐼 of 100. Second, the averagede-stress overhead of Decoupled-HEBE is 35% lower than HEBE.This improvement is due to the reduction of the de-stress cycletime ( 𝑡𝐷𝑆𝐶 ), which is achieved by de-stressing the logic blocks in abank’s peripheral circuitry independently and in parallel to ongoingread and write requests to the bank.
Figure 11 reports the execution time and aging of each of our work-loads using HEBE, normalized to Baseline. The height of a barrepresents HEBE’s result with the default aging threshold of 1000units. An error bar represents the variation obtained by changingthe aging threshold from 500 units to 2000 units. We make thefollowing observation. b l e nd e r b w a v e s c a c t u B SS N c a m cc m c f o m n e t pp p a r e s t r o m s x a l a n c b m k AV E R A G E . . . E x ec u t i o n t i m e a nd a g i n g N o r m a li z e d t o B a s e li n e Execution time aging
Figure 11: Execution time and aging of HEBE, normalized toBaseline, as a function of the aging threshold.
When we set a stricter aging threshold (e.g., 500 units), execu-tion time increases and aging decreases. This is because when theging threshold is lowered, the high-voltage exposure time of theperipheral circuitry in each memory bank reduces, reducing theaccrued aging. However, performance degrades due to the highde-stress overhead (see Eq. 1). Conversely, when we relax the agingthreshold (e.g., 2000 units), the de-stress interval increases, reduc-ing the de-stress overhead and increasing performance. However,aging is now higher because of longer exposure to high-voltagestress.
Figure 12 plots MTTF of Decoupled-HEBE at 300K, 325K, and 350Knormalized to Baseline at 300K for each evaluated application. Weobserve that MTTF decreases with increase in temperature. MTTFat 325K and 350K are higher than at 300K by an average of 7%and 26%, respectively. These results follow directly from our agingformulation, which incorporates temperature using the scalingparameter 𝛼 in Eq. 5. This parameter grows exponentially withtemperature, resulting in a corresponding exponential increase inaging. More aging leads to larger shift in threshold voltage. b l e nd e r b w a v e s c a c t u B SS N c a m cc m c f o m n e t pp p a r e s t r o m s x a l a n c b m k AV E R A G E . . . M TT F N o r m a li z e d t o B a s e li n e Figure 12: MTTF at 325K and 350K normalized to Baseline.
To our knowledge, this is the first work that exploits a workload’saccess characteristics to dynamically control the length of the de-stress interval of peripheral circuitry in each memory bank, im-proving both performance and MTTF of PCM-based main memory.Many works propose optimizations for PCM. Recent examplesinclude architecture optimization [47, 70, 97], performance and en-ergy optimization [1, 37, 81, 84, 99, 100], wear leveling [85, 101, 102],and memory controller optimizations [103]. See [95] for a surveyof these and other similar approaches. HEBE can be combined withmost of these techniques.Many works propose memory latency reduction, refresh opti-mization, energy reduction, and request scheduling methods toenhance system performance, fairness, quality of service, or se-curity [2, 10–12, 23, 24, 27, 30, 32–35, 38, 39, 50–52, 55, 62, 63, 65–67, 72, 74–76, 79, 87, 88, 91]. None of these works consider ag-ing of phase change memory in their scheduling decisions. Ouraging-aware scheduling mechanism can be incorporated into othermemory controller designs that aim to improve other metrics.
We introduce HEBE, a new mechanism that can dynamically trackand control the aging of transistors in peripheral circuitry of eachmemory bank, improving both performance and aging of NVM-based main memory. HEBE is built on three novel contributions.First, we propose a new, accurate analytical model to dynamically track aging in response to the memory controller’s request schedul-ing decisions. Second, we develop a new, intelligent request sched-uler that exploits this aging model at run time to decide when pe-ripheral circuitry in NVM must be de-stressed. Third, we decouplelogic blocks in peripheral circuitry operating at different voltages,allowing these blocks to be de-stressed independently and off thecritical path of execution, improving performance. We evaluateHEBE for DRAM-NVM hybrid main memory and show the sig-nificant performance and MTTF improvement. We conclude thatHEBE is a simple yet powerful mechanism to dynamically managethe aging in non-volatile main memory, and improve both perfor-mance and lifetime via its intelligent request scheduling decisions.
ACKNOWLEDGMENTS
This work is supported by the National Science Foundation FacultyEarly Career Development Award CCF-1942697 (CAREER: Facili-tating Dependable Neuromorphic Computing: Vision, Architecture,and Impact on Programmability).
REFERENCES [1] M. Arjomand et al. , “Boosting access parallelism to PCM-based main memory,”in
ISCA , 2016.[2] R. Ausavarungnirun et al. , “Staged memory scheduling: Achieving high perfor-mance and scalability in heterogeneous systems,” in
ISCA , 2012.[3] A. Balaji et al. , “A framework to explore workload-specific performance andlifetime trade-offs in neuromorphic computing,”
CAL , 2019.[4] A. Balaji et al. , “Mapping spiking neural networks to neuromorphic hardware,”
TVLSI , 2020.[5] A. Balaji et al. , “Enabling resource-aware mapping of spiking neural networksvia spatial decomposition,”
ESL , 2020.[6] C. Bolchini et al. , “A lightweight and open-source framework for the lifetimeestimation of multicore systems,” in
ICCD , 2014.[7] J. Bucek et al. , “SPEC CPU2017: Next-generation compute benchmark,” in
ICPE ,2018.[8] G. W. Burr et al. , “Neuromorphic computing using non-volatile memory,”
Ad-vances in Physics: X , 2017.[9] K. Chandrasekar et al. , “DRAMPower: Open-source DRAM power & energyestimation tool,” , 2012.[10] K. K. Chang et al. , “Understanding latency variation in modern DRAM chips:Experimental characterization, analysis, and optimization,” in
SIGMETRICS ,2016.[11] K. K. Chang et al. , “Understanding reduced-voltage operation in modern DRAMdevices: Experimental characterization, analysis, and mechanisms,” in
SIGMET-RICS , 2017.[12] K. K.-W. Chang et al. , “Improving DRAM performance by parallelizing refresheswith accesses,” in
HPCA , 2014.[13] A. Das et al. , “Fault-aware task re-mapping for throughput constrained multi-media applications on NoC-based MPSoCs,” in
RSP , 2012.[14] A. Das et al. , “Fault-tolerant network interface for spatial division multiplexingbased network-on-chip,” in
ReCoSoC , 2012.[15] A. Das et al. , “Aging-aware hardware-software task partitioning for reliablereconfigurable multiprocessor systems,” in
CASES , 2013.[16] A. Das et al. , “Energy-aware dynamic reconfiguration of communication-centricapplications for reliable MPSoCs,” in
ReCoSoC , 2013.[17] A. Das et al. , “Energy-aware task mapping and scheduling for reliable embeddedcomputing systems,”
TECS , 2014.[18] A. Das et al. , “Temperature aware energy-reliability trade-offs for mapping ofthroughput-constrained applications on multimedia MPSoCs,” in
DATE , 2014.[19] A. Das et al. , “Combined DVFS and mapping exploration for lifetime and soft-error susceptibility improvement in MPSoCs,” in
DATE , 2014.[20] A. Das et al. , “Reinforcement learning-based inter-and intra-application thermaloptimization for lifetime improvement of multicore systems,” in
DAC , 2014.[21] A. Das et al. , “Reliability and energy-aware mapping and scheduling of multi-media applications on multiprocessor systems,”
TPDS , 2015.[22] A. Das et al. , “VRL-DRAM: Improving DRAM performance via variable refreshlatency,” in
DAC , 2018.[23] H. David et al. , “Memory power management via dynamic voltage/frequencyscaling,” in
ICAC , 2011.[24] Q. Deng et al. , “MemScale: Active low-power modes for main memory,” in
ASPLOS , 2011.25] M. Frulio, “Adaptive non-volatile memory programming,” 2016, US Patent.[26] R. Gao et al. , “NBTI-Generated defects in nanoscaled devices: Fast characteriza-tion methodology and modeling,”
TED , 2017.[27] H. Hassan et al. , “ChargeCache: Reducing DRAM latency by exploiting rowaccess locality,” in
HPCA , 2016.[28] D. Hisamoto et al. , “FinFET-a self-aligned double-gate MOSFET scalable to 20nm,”
TED , 2000.[29] L. Huang et al. , “On task allocation and scheduling for lifetime extension ofplatform-based mpsoc designs,”
TPDS , 2011.[30] E. Ipek et al. , “Self-optimizing memory controllers: A reinforcement learningapproach,” in
ISCA , 2008.[31] L. Jiang et al. , “A low power and reliable charge pump design for phase changememories,” in
ISCA , 2014.[32] S. Khan et al. , “PARBOR: An efficient system-level technique to detect data-dependent failures in DRAM,” in
DSN , 2016.[33] J. Kim et al. , “Solar-DRAM: Reducing DRAM access latency by exploiting thevariation in local bitlines,” in
ICCD , 2018.[34] J. S. Kim et al. , “The DRAM latency PUF: Quickly evaluating physical unclonablefunctions by exploiting the latency-reliability tradeoff in modern commodityDRAM devices,” in
HPCA , 2018.[35] J. S. Kim et al. , “D-RaNGe: Using commodity DRAM devices to generate truerandom numbers with low latency and high throughput,” in
HPCA , 2019.[36] J. S. Kim et al. , “Revisiting RowHammer: An experimental analysis of modernDRAM devices and mitigation techniques,” in
ISCA , 2020.[37] N. Kim et al. , “LL-PCM: Low-latency phase change memory architecture,” in
DAC , 2019.[38] Y. Kim et al. , “ATLAS: A scalable and high-performance scheduling algorithmfor multiple memory controllers,” in
HPCA , 2010.[39] Y. Kim et al. , “Thread cluster memory scheduling: Exploiting differences inmemory access behavior,” in
MICRO , 2010.[40] Y. Kim et al. , “Flipping bits in memory without accessing them: An experimentalstudy of DRAM disturbance errors,” in
ISCA , 2014.[41] Y. Kim et al. , “Ramulator: A fast and extensible DRAM simulator.”
CAL , 2016.[42] Y. Kim et al. , “A case for exploiting subarray-level parallelism (SALP) in DRAM,”in
ISCA , 2012.[43] D. Kraak et al. , “Parametric and Functional Degradation Analysis of Complete14-nm FinFET SRAM,”
TVLSI , 2019.[44] E. Kültürsay et al. , “Evaluating STT-RAM as an energy-efficient main memoryalternative,” in
ISPASS , 2013.[45] A. Lalam et al. , “Non-volatile dual in-line memory module (NVDIMM) multichippackage,”
US Patent 10,199,364 , 2019.[46] B. Lee et al. , “Phase-change technology and the future of main memory,”
IEEEMicro , 2010.[47] B. Lee et al. , “Architecting phase change memory as a scalable DRAM alternative,”in
ISCA , 2009.[48] B. Lee et al. , “Phase change memory architecture and the quest for scalability,”
CACM , 2010.[49] D. Lee et al. , “Tiered-latency DRAM: A low latency and low cost DRAM archi-tecture,” in
HPCA , 2013.[50] D. Lee et al. , “Adaptive-latency DRAM: Optimizing DRAM timing for thecommon-case,” in
HPCA , 2015.[51] J. Liu et al. , “RAIDR: Retention-aware intelligent DRAM refresh,” in
ISCA , 2012.[52] Y. Lu et al. , “Loose-ordering consistency for persistent memory,” in
ICCD , 2014.[53] A. Mallik et al. , “Design-technology co-optimization for OxRRAM-based synap-tic processing unit,” in
VLSIT , 2017.[54] J. A. Mandelman et al. , “Challenges and future directions for the scaling ofdynamic random-access memory (DRAM),”
IBM JRD , 2002.[55] J. Meza et al. , “Enabling efficient and scalable hybrid memories using fine-granularity DRAM cache management,”
CAL , 2012.[56] J. Meza et al. , “A case for small row buffers in non-volatile main memories,” in
ICCD , 2012.[57] J. Meza et al. , “A case for efficient hardware/software cooperative managementof storage and memory,” in
WEED , 2013.[58] J. Meza et al. , “Evaluating row buffer locality in future non-volatile main memo-ries,” arXiv , 2018.[59] O. Mutlu, “Memory scaling: A systems architecture perspective,” in
IMW , 2013.[60] O. Mutlu, “The rowhammer problem and other issues we may face as memorybecomes denser,” in
DATE , 2017.[61] O. Mutlu et al. , “Rowhammer: A retrospective,”
TCAD , 2019.[62] O. Mutlu et al. , “Stall-time fair memory access scheduling for chip multiproces-sors,” in
MICRO , 2007.[63] O. Mutlu et al. , “Parallelism-aware batch scheduling: Enhancing both perfor-mance and fairness of shared DRAM systems,” in
ISCA , 2008. [64] O. Mutlu et al. , “Research problems and opportunities in memory systems,”
SUFI ,2015.[65] K. J. Nesbit et al. , “Fair queuing memory systems,” in
MICRO , 2006.[66] M. Patel et al. , “The reach profiler (REAPER) enabling the mitigation of DRAMretention failures via profiling at aggressive conditions,” in
ISCA , 2017.[67] S. Pelley et al. , “Memory persistency,” in
ISCA , 2014.[68] M. Poremba et al. , “Nvmain 2.0: A user-friendly memory simulator to model(non-) volatile memory systems,”
CAL , 2015.[69] M. K. Qureshi, “Pay-As-You-Go: Low-overhead hard-error correction for phasechange memories,” in
MICRO , 2011.[70] M. K. Qureshi et al. , “Scalable high performance main memory system usingphase-change memory technology,” in
ISCA , 2009.[71] M. K. Qureshi et al. , “Improving read performance of phase change memoriesvia write cancellation and write pausing,” in
HPCA , 2010.[72] M. K. Qureshi et al. , “AVATAR: A variable-retention-time (VRT) aware refreshfor DRAM systems,” in
DSN , 2015.[73] A. Redaelli et al. , Phase Change Memory . Springer, 2017.[74] J. Ren et al. , “ThyNVM: Enabling software-transparent crash consistency inpersistent memory systems,” in
MICRO , 2015.[75] S. Rixner, “Memory controller optimizations for web servers,” in
MICRO , 2004.[76] S. Rixner et al. , “Memory access scheduling,” in
ISCA , 2000.[77] S. K. Sadasivam et al. , “IBM POWER 9 processor architecture,”
IEEE Micro , 2017.[78] V. Seshadri et al. , “In-DRAM bulk bitwise execution engine,” arXiv , 2019.[79] V. Seshadri et al. , “Gather-scatter DRAM: In-DRAM address translation to im-prove the spatial locality of non-unit strided accesses,” in
MICRO , 2015.[80] S. Song et al. , “A case for lifetime reliability-aware neuromorphic computing,”in
MWSCAS , 2020.[81] S. Song et al. , “Enabling and exploiting partition-level parallelism (PALP) inphase change memories,”
TECS , 2019.[82] S. Song et al. , “Compiling spiking neural networks to neuromorphic hardware,”in
LCTES , 2020.[83] S. Song et al. , “Improving dependability of neuromorphic computing with non-volatile memory,” in
EDCC , 2020.[84] S. Song et al. , “Improving phase change memory performance with data contentaware access,” in
ISMM , 2020.[85] S. Song et al. , “Exploiting inter- and intra-memory asymmetries for data mappingin hybrid tiered-memories,” in
ISMM , 2020.[86] J. Srinivasan et al. , “The case for lifetime reliability-aware microprocessors,” in
ISCA , 2004.[87] L. Subramanian et al. , “The blacklisting memory scheduler: Achieving highperformance and fairness at low cost,” in
ICCD , 2014.[88] L. Subramanian et al. , “BLISS: Balancing performance, fairness and complexityin memory access scheduling,”
TPDS , 2016.[89] T. Titirsha et al. , “Reliability-performance trade-offs in neuromorphic comput-ing,” in
CUT , 2020.[90] T. Titirsha et al. , “Thermal-aware compilation of spiking neural networks toneuromorphic hardware,” in
LCPC , 2020.[91] H. Usui et al. , “DASH: Deadline-aware high-performance memory scheduler forheterogeneous systems with hardware accelerators,”
TACO , 2016.[92] P. Weckx et al. , “Non-Monte-Carlo methodology for high-sigma simulations ofcircuits under workload-dependent BTI degradation-application to 6T SRAM,”in
IRPS , 2014.[93] Wikipedia contributors, “Hebe (mythology) — Wikipedia, the free encyclopedia,”2020, [Online; accessed 14-Aug-2020].[94] H.-S. P. Wong et al. , “Phase change memory,”
Proceedings of the IEEE , 2010.[95] F. Xia et al. , “A survey of phase change memory systems,”
JCST , 2015.[96] F. Xiong et al. , “Towards ultimate scaling limits of phase-change memory,” in
IEDM , 2016.[97] L. Yavits et al. , “WoLFRaM: Enhancing wear-leveling and fault tolerance inresistive memories using programmable address decoders,” in
ICCD , 2020.[98] C. Yilmaz et al. , “Modeling of NBTI-recovery effects in analog CMOS circuits,”in
IRPS , 2013.[99] H. Yoon et al. , “Row buffer locality aware caching policies for hybrid memories,”in
ICCD , 2012.[100] H. Yoon et al. , “Efficient data mapping and buffering techniques for multilevelcell phase-change memories,”
TACO , 2014.[101] J. Zhang et al. , “RETROFIT: Fault-aware wear leveling,”
CAL , 2018.[102] X. Zhang et al. , “Toss-up wear leveling: Protecting phase-change memories frominconsistent write patterns,” in
DAC , 2017.[103] J. Zhao et al. , “FIRM: Fair and high-performance memory control for persistentmemory systems,” in
MICRO , 2014.[104] W. K. Zuravleff et al. , “Controller for a synchronous DRAM that maximizesthroughput by allowing memory requests and commands to be issued out oforder,”