A Modern Primer on Processing in Memory
Onur Mutlu, Saugata Ghose, Juan Gómez-Luna, Rachata Ausavarungnirun
AA Modern Primer on Processing in Memory
Onur Mutlu a,b , Saugata Ghose b,c , Juan G´omez-Luna a , Rachata Ausavarungnirun d SAFARI Research Group a ETH Z¨urich b Carnegie Mellon University c University of Illinois at Urbana-Champaign d King Mongkut’s University of Technology North Bangkok
Abstract
Modern computing systems are overwhelmingly designed to move data to computation. This design choice goesdirectly against at least three key trends in computing that cause performance, scalability and energy bottlenecks:(1) data access is a key bottleneck as many important applications are increasingly data-intensive, and memorybandwidth and energy do not scale well, (2) energy consumption is a key limiter in almost all computing platforms,especially server and mobile systems, (3) data movement, especially o ff -chip to on-chip, is very expensive in termsof bandwidth, energy and latency, much more so than computation. These trends are especially severely-felt in thedata-intensive server and energy-constrained mobile systems of today.At the same time, conventional memory technology is facing many technology scaling challenges in terms ofreliability, energy, and performance. As a result, memory system architects are open to organizing memory in di ff erentways and making it more intelligent, at the expense of higher cost. The emergence of 3D-stacked memory plus logic,the adoption of error correcting codes inside the latest DRAM chips, proliferation of di ff erent main memory standardsand chips, specialized for di ff erent purposes (e.g., graphics, low-power, high bandwidth, low latency), and the necessityof designing new solutions to serious reliability and security issues, such as the RowHammer phenomenon, are anevidence of this trend.This chapter discusses recent research that aims to practically enable computation close to data, an approach we call processing-in-memory (PIM). PIM places computation mechanisms in or near where the data is stored (i.e., inside thememory chips, in the logic layer of 3D-stacked memory, or in the memory controllers), so that data movement betweenthe computation units and memory is reduced or eliminated. While the general idea of PIM is not new, we discussmotivating trends in applications as well as memory circuits / technology that greatly exacerbate the need for enablingit in modern computing systems. We examine at least two promising new approaches to designing PIM systemsto accelerate important data-intensive applications: (1) processing using memory by exploiting analog operationalproperties of DRAM chips to perform massively-parallel operations in memory, with low-cost changes, (2) processingnear memory by exploiting 3D-stacked memory technology design to provide high memory bandwidth and low memorylatency to in-memory logic. In both approaches, we describe and tackle relevant cross-layer research, design, andadoption challenges in devices, architecture, systems, and programming models. Our focus is on the development ofin-memory processing designs that can be adopted in real computing platforms at low cost. We conclude by discussingwork on solving key challenges to the practical adoption of PIM. Keywords: memory systems, data movement, main memory, processing-in-memory, near-data processing,computation-in-memory, processing using memory, processing near memory, 3D-stacked memory, non-volatilememory, energy e ffi ciency, high-performance computing, computer architecture, computing paradigm, emergingtechnologies, memory scaling, technology scaling, dependable systems, robust systems, hardware security, systemsecurity, latency, low-latency computing December 8, 2020 a r X i v : . [ c s . A R ] D ec ontents1 Introduction 22 Major Trends A ff ecting Main Memory 43 The Need for Intelligent Memory Controllersto Enhance Memory Scaling 64 Perils of Processor-Centric Design 95 Processing-in-Memory (PIM): TechnologyEnablers and Two Approaches 12 Main memory, built using the Dynamic Random Ac-cess Memory (DRAM) technology, is a major compo-nent in nearly all computing systems, including servers,cloud platforms, mobile / embedded devices, and sensorsystems. Across all of these systems,the data workingset sizes of modern applications are rapidly growing,while the need for fast analysis of such data is increas-ing. Thus, main memory is becoming an increasinglysignificant bottleneck across a wide variety of computingsystems and applications [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,12, 13, 14, 15, 16]. Alleviating the main memory bot-tleneck requires the memory capacity, energy, cost, andperformance to all scale in an e ffi cient manner acrosstechnology generations. Unfortunately, it has becomeincreasingly di ffi cult in recent years, especially the pastdecade, to scale all of these dimensions [1, 2, 17, 18, 19,20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34,35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49],and thus the main memory bottleneck has been worsen-ing.A major reason for the main memory bottleneck is thehigh energy and latency cost associated with data move-ment . In modern computers, to perform any operationon data that resides in main memory, the processor mustretrieve the data from main memory. This requires thememory controller to issue commands to a DRAM mod-ule across a relatively slow and power-hungry o ff -chipbus (known as the memory channel ). The DRAM mod-ule sends the requested data across the memory channel,after which the data is placed in the caches and regis-ters. The CPU can perform computation on the dataonce the data is in its registers. Data movement from theDRAM to the CPU incurs long latency and consumesa significant amount of energy [7, 50, 51, 52, 53, 54].These costs are often exacerbated by the fact that muchof the data brought into the caches is not reused by theCPU [52, 53, 55, 56], providing little benefit in returnfor the high latency and energy cost.The cost of data movement is a fundamental issuewith the processor-centric nature of contemporary com-puter systems. The CPU is considered to be the masterin the system, and computation is performed only in theprocessor (and accelerators). In contrast, data storageand communication units, including the main memory,are treated as unintelligent workers that are incapable ofcomputation. As a result of this processor-centric designparadigm, data moves a lot in the system between thecomputation units and communication / storage units sothat computation can be done on it. With the increasingly data-centric nature of contemporary and emerging appli-2ations, the processor-centric design paradigm leads togreat ine ffi ciency in performance, energy and cost. Forexample, most of the real estate within a single computenode is already dedicated to handling data movementand storage (e.g., large caches, memory controllers, in-terconnects, and main memory) [57, 58, 59, 60, 61], andour recent work shows that more than 62% of the entiresystem energy of a mobile device is spent on data move-ment between the processor and the memory hierarchyfor widely-used mobile workloads [62].The large overhead of data movement in modern sys-tems along with technology advances that enable betterintegration of memory and logic have recently promptedthe re-examination of an old idea that we will generallycall Processing in Memory (PIM). The key idea is toplace computation mechanisms in or near where the datais stored (i.e., inside the memory chips, in the logic layerof 3D-stacked memory, in the memory controllers, orinside large caches), so that data movement betweenwhere the computation is done and where the data isstored is reduced or eliminated, compared to contempo-rary processor-centric systems. Processing-in-memoryis also known as near-data processing (NDP), enablesthe ability to perform operations and execute softwaretasks either using (1) the memory itself, or (2) someform of processing logic (e.g., accelerators, simple cores,reconfigurable logic) inside the memory subsystem.The idea of PIM has been around for at least fivedecades [63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74,75, 76, 77, 78, 79, 80]. However, past e ff orts were not widely adopted for various reasons, including 1) the dif-ficulty of integrating processing elements with DRAM,2) the lack of critical memory-related scaling challengesthat current technology and applications face today, and3) that the data movement bottleneck was not as criticalto system cost, energy and performance as it is today.As a result of advances in modern memory architec-tures, e.g., the integration of logic and memory in a 3D-stacked manner, various recent works explore a range ofPIM architectures for multiple di ff erent purposes (e.g.,[7, 13, 50, 51, 52, 53, 62, 81, 82, 83, 84, 85, 86, 87,88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101,102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112,113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123]).We believe it is crucial to re-examine PIM today with afresh perspective (i.e., with novel approaches and ideas),by exploiting new memory technologies, with realisticworkloads and systems, and with a mindset to ease adop-tion and feasibility.In this chapter, we explore two new approaches toenabling processing-in-memory in modern systems. Thefirst approach minimally changes memory chips to per- form simple yet powerful common operations that thechip is inherently e ffi cient at or could be made e ffi cientat performing [11, 40, 97, 104, 105, 106, 107, 108, 109,110, 111, 112, 116, 120, 121, 122, 123, 124, 125, 126,127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137,138, 139, 140, 141, 142, 143, 144]. We call this approach processing using memory [11, 120, 124, 145]. Some so-lutions that fall under this approach take advantage ofthe existing DRAM design to cleverly and e ffi ciently per-form bulk operations (i.e., operations on an entire row ofDRAM cells), such as bulk copy, data initialization, andbitwise operations, using the analog operational princi-ples of DRAM [108, 109, 111, 112, 120, 124, 125, 145].Other solutions take advantage of the analog operationalprinciples of emerging non-volatile memory technolo-gies to perform similar bulk operations [104] or otherspecialized computations like convolutions and matrixmultiplications [107, 132, 133, 134, 135, 136, 137, 138,139, 140, 141, 142, 143, 144, 146].The second approach enables PIM in a more general-purpose manner by taking advantage of computationcapability in conventional memory controllers [50, 51]or the logic layer(s) of the relatively new [7, 13, 52, 53, 62, 81, 82, 83, 84,85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 98, 99, 100,101, 103, 113, 114, 115, 117, 118, 119, 147]. We callthis general approach processing near memory [11]. Thisapproach is especially catalyzed by recent advancementsin 3D-stacked memory technologies that include a logicprocessing layer underneath memory layers [148, 149].In order to stack multiple layers of memory, 3D-stackedchips use vertical through-silicon vias (TSVs) to con-nect the layers to each other, and to the I / O drivers ofthe chip [149]. The TSVs provide much greater inter-nal bandwidth within the 3D stack layers than is avail-able externally on the memory channel. Several such3D-stacked memory architectures, such as the HybridMemory Cube [150, 151] and High-Bandwidth Mem-ory [149, 152], include a logic layer , where designerscan add some processing logic (e.g., accelerators, simplecores, reconfigurable logic) to take advantage of this highinternal bandwidth. Future die-stacking technologies,like monolithic 3D [153, 154, 155, 156, 157, 158, 159],can amplify the benefits of this approach by greatly im-proving internal bandwidth and the number of logic lay-ers between memory layers.Regardless of the approach taken to PIM, there arekey practical adoption challenges that system architectsand programmers must address to enable the widespreadadoption of PIM across the computing landscape and indi ff erent domains of workloads. In addition to describingwork along the two key approaches, we also discuss these3hallenges in work paper, along with existing work thataddresses these challenges.Before we describe in detail the two modern ap-proaches to PIM in Section 5, we first describe majortrends a ff ecting main memory (Section 2), then demon-strate many reasons why we need to have intelligentmemory controllers to enhance memory scaling intothe future (Section 3), followed by an analysis of themajor shortcomings of the processor-centric computingparadigm which PIM intends to augment, disrupt, andperhaps in some cases displace (Section 4).
2. Major Trends A ff ecting Main Memory Main memory is a major, critical component of allcomputing systems, including cloud and server plat-forms, desktop computers, mobile and embedded de-vices, and sensors. It constitutes one of the main pillarsof any computing platform, together with 1) the process-ing elements (or computational elements), which caninclude CPU cores, GPU cores, accelerators, or recon-figurable devices, and 2) the communication elements,which can include interconnects, network interfaces, andnetwork processing units.Due to its relatively low cost and low latency, Dy-namic Random Access Memory (DRAM) [160] is thepredominant data storage technology that is used tobuild main memory. The growing data working setsizes of modern applications [1, 2, 3, 4, 5, 6, 7, 161,162, 163, 164, 165] impose an ever-increasing demandfor higher DRAM capacity and performance. Unfor-tunately, DRAM technology scaling is becoming in-creasingly challenging: it is increasingly di ffi cult toenlarge DRAM chip capacity at low cost while alsomaintaining maintain DRAM performance, energy e ffi -ciency, and reliability [1, 2, 20, 24, 43, 45, 46, 166, 167]Thus, fulfilling the increasing memory needs of mod-ern workloads is becoming increasingly costly and dif-ficult [2, 3, 4, 14, 17, 18, 19, 21, 22, 23, 25, 26, 27, 28,29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 42, 44, 45,46, 47, 49, 52, 53, 86, 121, 168, 169, 170, 171, 172].If CMOS technology scaling is coming to anend [173], the projections are significantly worse forDRAM technology scaling [174]. DRAM technologyscaling a ff ects all major characteristics of DRAM, in-cluding capacity, bandwidth, latency, reliability, energy,and cost. We next describe the key issues and trends inDRAM technology scaling and discuss how these trendsmotivate the need for intelligent memory controllers , i.e.,controllers that have intelligence and computation capa-bility to enable better scaling of main memory in terms of all metrics of interest. Such intelligent memory con-trollers can also more easily pave the way to and be usedas a starting substrate for processing in memory.The first key concern is the di ffi culty of scaling DRAMcapacity (i.e., density, or cost per bit), bandwidth andlatency at the same time . While the processing corecount doubles every two years, the DRAM capacitydoubles only every three years, as shown by [29], andthe latter is slowing down. This trend causes the mem-ory capacity per core to drop by approximately 30%every two years [29]. The trend is even worse for memory bandwidth per core – in the approximatelytwo decades between 1999 and 2017, DRAM chipstorage capacity (for the most commonly-used DDRxchip of the time) has improved approximately 128 × while DRAM bandwidth has improved only approxi-mately 20 × [31, 32, 40], as shown in Figure 1. In thesame period of about two decades, DRAM latency (asmeasured by the row cycling time) has remained al-most constant (i.e., reduced by only 30%, as shown inFigure 1), making it a significant performance bottle-neck for many modern workloads, including in-memorydatabases [62, 97, 112, 175, 176, 177, 178, 179], graphprocessing [15, 52, 53, 180, 181], data analytics [177,182, 183, 184], datacenter workloads [4], neural net-works [7, 14, 93, 185, 186, 187, 188], and consumerworkloads [7]. As low-latency computing is becomingever more important [1, 2, 3, 12, 13, 189, 190, 191, 192],e.g., due to the ever-increasing need to process largeamounts of data at real time, and predictable perfor-mance continues to be a critical concern in the design ofmodern computing systems [2, 25, 193, 194, 195, 196,197, 198, 199, 200, 201, 202], it is increasingly criticalto design low-latency main memory chips.The second key concern is that DRAM technologyscaling to smaller nodes adversely a ff ects DRAM relia-bility. A DRAM cell stores one bit of data in the formof charge in a capacitor, which is accessed via an ac-cess transistor and peripheral circuitry. For a DRAMcell to operate correctly, both the capacitor and the ac-cess transistor (as well as the peripheral circuitry) needto operate reliably. As the size of the DRAM cell re-duces, both the capacitor and the access transistor be-come less reliable, more leaky, and generally more vul-nerable to electrical noise and disturbance. As a result,reducing the size of the DRAM cell increases the di ffi -culty of correctly storing and detecting the desired orig-inal value in the DRAM, as shown in various recentworks that study DRAM reliability by analyzing dataretention and other reliability issues of modern DRAMchips cell [1, 20, 23, 24, 38, 41, 42, 45, 46, 166, 167,170, 171, 206, 207, 208, 209, 210, 211]. Hence, mem-4 D R A M I m p r o v e m e n t ( l o g ) Capacity Bandwidth Latency
DRAM Capacity, Bandwidth & Latency
Figure 1: Scaling of DRAM capacity, bandwidth and latency between1999 and 2017, normalized to the value in 2017. Data depicted for themost common type of DDRx chip of each year. Reproduced from [203].Originally presented in [31, 204, 205]. ory technology scaling causes memory errors to appearmore frequently. For example, a study of Facebook’sentire production datacenter servers showed that mem-ory errors, and thus the server failure rate, are stronglypositively correlated with the density of the chips em-ployed in the servers [212]: the higher the density ofthe chip used in a server, the more likely the server is toexperience a memory error and server failure. Thus, it iscritical to make the main memory system more reliableto build reliable computing systems on top of it.The third key issue is that the reliability problemscaused by aggressive DRAM technology scaling can leadto new security vulnerabilities. The RowHammer phe-nomenon [20, 24, 45, 46] shows that it is possible to pre-dictably induce errors (bit flips) in most modern DRAMchips. Repeatedly reading the same row in DRAM cancorrupt data in physically-adjacent rows. Specifically,when a DRAM row is opened (i.e., activated) and closed(i.e., precharged) repeatedly (i.e., hammered), enoughtimes within a DRAM refresh interval, one or more bitsin physically-adjacent DRAM rows can be flipped to thewrong value. A very simple user-level program [213]can reliably and consistently induce RowHammer errorsin vulnerable DRAM modules. The seminal paper thatintroduced RowHammer [20] showed that more than85% of the chips tested, built by three major vendors be-tween 2010 and 2014, were vulnerable to RowHammer-induced errors. In particular, all
DRAM modules from2012 and 2013 are vulnerable, as shown by the Figure 2which depicts the observed RowHammer error vulnera-bility of DRAM modules manufactured between 2008and 2014 by all three major DRAM manufacturers A, B,C [20]. A recent technology scaling study [45] of 1580 DRAM chips belonging to three di ff erent DRAM typesand various di ff erent technology node sizes experimen-tally demonstrated that the RowHammer vulnerability isgetting much worse at the circuit level: fewer number ofactivates to a row can cause bit flips in the most recentchips and recent chips experience higher bit flip ratesdue to RowHammer. The same work [45] also showedthat existing RowHammer mitigation mechanisms willnot be e ff ective in future DRAM chips that will be muchmore vulnerable to RowHammer, and thus RowHammerremains to be an open vulnerability to securely protectagainst. All modules from are vulnerable
FirstAppearance
Recent DRAM Is More Vulnerable
Figure 2: RowHammer vulnerability for DRAM modules manufac-tured between 2008 and 2014. Reproduced from [214]. Originallypresented in [20, 215].
The RowHammer phenomenon entails a real reliabil-ity, and perhaps even more importantly, a real and preva-lent security issue. It breaks physical memory isolationbetween two addresses, one of the fundamental build-ing blocks of memory, on top of which system securityprinciples are built. With RowHammer, accesses to onerow (e.g., an application page) can modify data stored inanother memory row (e.g., an OS page). This was con-firmed in 2015 by researchers from Google Project Zero,who developed a user-level attack that uses RowHammerto gain kernel privileges [216, 217]. Other researchershave shown how RowHammer vulnerabilities can beexploited in various ways to gain privileged access tovarious systems: in a remote server RowHammer canbe used to remotely take over the server via the useof JavaScript [218]; a virtual machine can take overanother virtual machine by inducing errors in the vic-tim virtual machine’s memory space [219]; a maliciousapplication without permissions can take control of anAndroid mobile device [220]; or an attacker can gainarbitrary read / write access in a web browser on a Mi-crosoft Windows 10 system [221]. Over the past six5ears, many security attacks were developed to exploitRowHammer [216, 217, 218, 219, 220, 221, 222, 223,224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234,235, 236, 237, 238, 239]. Very recently, the TRRespassattack [167] showed that existing DRAM chips that areadvertised to be RowHammer-resistant, as described byvarious DRAM vendors [240, 241], are actually vulnera-ble because these mitigation mechanisms can be circum-vented with a new type of RowHammer attack called many-sided hammering . For a more detailed treatmentof the RowHammer problem and its consequences, aswell as its root causes, modeling, and analyses, we referthe reader to [20, 24, 45, 46, 167, 211, 242].The fourth key issue is the power and energy con-sumption of main memory. DRAM is inherently a powerand energy hog, as it consumes energy even when it isnot used (e.g., it requires periodic memory refresh [23]),due to its charge-based nature. And, energy consump-tion of main memory is becoming worse due to threemajor reasons. First, main memory’s capacity, band-width, parallelism, and complexity are all increasing,causing energy consumption to naturally increase dueto higher amount of dynamic activity and higher over-all static power consumption. Second, main memoryhas remained o ff the main processing chip and thus didnot benefit from many energy reduction mechanismsthat come with better integration, even though manyother platform components have been integrated into theprocessing chip and have benefited from the aggressiveenergy / voltage scaling mechanisms and the low-energycommunication substrate on-chip. Third, the di ffi cultiesin DRAM technology scaling are making DRAM energyreduction very di ffi cult with technology generations. Infact, some of the mechanisms that are added to DRAMchips to compensate for reliability problems in smallertechnology generations, e.g., in-DRAM error correctingcodes [17, 41, 170, 171, 243, 244, 245] and higher re-fresh rates [171, 246, 247, 248], directly increase energyconsumption. As a result of these three major issuesthat make main memory a larger energy bottleneck, thefraction of the entire system power consumed by mainmemory is increasing over the last two decades. In 2003,Lefurgy et al. [249] showed that, in large commercialservers designed by IBM, the o ff -chip memory hierarchy(including, at that time, DRAM, interconnects, mem-ory controller, and o ff -chip caches) consumed between40% and 50% of the total system energy. The trend hasbecome even worse over the course of the one-to-twodecades. In recent computing systems with CPUs orGPUs, only DRAM itself is shown to account for morethan 40% of the total system power [34, 43, 250, 251].Hence, the power and energy consumption of main mem- ory is increasing relative to that of other components incomputing platform. As energy e ffi ciency and sustain-ability are critical necessities in computing platforms to-day, it is critical to reduce the energy and power consump-tion of main memory [34, 43, 49, 252, 253, 254, 255].
3. The Need for Intelligent Memory Controllers toEnhance Memory Scaling
A key promising approach to solving the four majorissues above is to design intelligent memory controllers that can manage main memory better. If the memorycontroller is designed to be more intelligent and moreprogrammable, it can, for example, incorporate flexi-ble mechanisms to overcome various types of reliabilityissues (including RowHammer), manage latencies andenergy / power consumption better based on a deep un-derstanding of the DRAM chip and application charac-teristics, provide enough support for programmabilityto prevent security and reliability vulnerabilities thatare discovered in the field, and manage various typesof memory technologies that are put together as a hy-brid main memory to enhance the scaling of the mainmemory system. We provide a few examples of how anintelligent memory controller can help overcome circuit-and device-level issues modern computing systems arefacing at the main memory level. We believe havingintelligent memory controllers can greatly alleviate thescaling issues encountered with main memory today, aswe have described in an earlier position paper [1]. Thisis a direction that is also supported by key hardwaremanufacturers in computing industry today, as describedin an informative paper written collaboratively by Inteland Samsung engineers on DRAM technology scalingissues [17].In this section, we give several examples of how anintelligent memory controller can help overcome majorscaling challenges of modern DRAM. First, a slightlymore intelligent memory controller than today’s con-trollers can prevent the RowHammer vulnerability byprobabilistically refreshing rows that are physically adja-cent to an activated row, with a very low probability. Thissolution, called PARA (Probabilistic Adjacent Row Acti-vation) [20] was shown to provide strong, programmable,robust guarantees against RowHammer, with very littlepower, performance and chip area overheads [20]. It re-quires a slightly more intelligent memory controller that1) knows (or that can figure out) the physical adjacencyof rows in a DRAM chip, 2) is programmable enough toadjust the probability of adjacent row activation depend-ing on the vulnerability of a chip, and 3) can issue refreshrequests to physically-adjacent rows accordingly to the6robability supplied by the system or discovered online.As described by prior work [20, 24, 45, 46, 242], thissolution has much lower performance and energy over-heads than increasing the refresh rate across the boardfor the entire main memory, which is the RowHammersolution employed by existing systems in the field thathave simple and rigid memory controllers [46, 248].Second, an intelligent memory controller can greatlyalleviate the refresh problem in DRAM, and hence itsnegative consequences on energy, performance, pre-dictability, and technology scaling, by understandingthe retention time characteristics of di ff erent rows well.It is well known that the retention time of di ff erent cellsin DRAM are widely di ff erent due to process manufac-turing variation [23, 166]. A very large fraction of allDRAM cells are strong (i.e., they can retain data forhundreds of seconds), whereas only a small fraction ofDRAM cells are weak (i.e., they can retain data for only64 ms), as demonstrated in Figure 3 [23, 166]. Yet, to-day’s memory controllers treat every cell as equal andrefresh all rows every 64 ms, which is the worst-case re-tention time that is allowed. This worst-case refresh rateleads to a large number of unnecessary refreshes, andthus great energy waste and performance loss. Refreshis also shown to be the key technology scaling limiterof DRAM [17], and as such refreshing all DRAM cellsat the worst case rates is likely to make DRAM technol-ogy scaling di ffi cult. An intelligent memory controllercan overcome the refresh problem by 1) identifying theminimum data retention time of each row (during on-line operation) and 2) refreshing each row at the rateit really requires to be refreshed at or 3) by decommis-sioning weak rows such that data is not stored in them.As shown by a recent body of work whose aim is todesign such an intelligent memory controller that canperform online profiling of DRAM cell retention timesand online adjustment of refresh rate on a per-row ba-sis [23, 41, 166, 206, 207, 208, 209, 210], includingthe works on RAIDR [23, 166], AVATAR [210] andREAPER [41], such an intelligent memory controllercan eliminate more than 75% of all refreshes at verylow cost, leading to significant energy reduction, perfor-mance improvement, and quality of service benefits, allat the same time, at the system level. Thus, the down-sides of DRAM refresh can potentially be overcome withthe design of intelligent memory controllers.Third, an intelligent memory controller can enableperformance improvements that can overcome the limi-tations of memory scaling. As we discuss in Section 2,DRAM latency has remained almost constant over thelast decades, despite the fact that low-latency comput-ing has become even more important during that time. Data Retention in Memory [Liu et al., ISCA 2013] n Retention Time Profile of DRAM looks like this:
Location dependent
Stored value pattern dependent
Time dependent
Figure 3: Data retention times of di ff erent DRAM cells, representedas a cartoon based on experimental data obtained from real DRAMchips [256]. Reproduced from [214]. Originally presented in [257,258]. Similar to how intelligent memory controllers handle therefresh problem, the controllers can exploit the fact thatnot all cells in DRAM need the same amount of timeto be accessed. Modern DRAM specifications requireworst-case timing parameters that define the amount oftime required to perform a memory access. In order toguarantee correct operation, the timing parameters arechosen to ensure that the worst-case cell in any DRAMchip that is acceptable (to satisfy a yield rate) can stillbe accessed correctly at worst-case operating temper-atures [31, 33, 35, 42, 44, 47, 169]. However, we findthat access latency to cells is very heterogeneous due tovariation in operating conditions (e.g., across di ff erenttemperatures and operating voltage levels), manufactur-ing process (e.g., across di ff erent chips and di ff erent partsof a chip), and access patterns (e.g., based on whetheror not the cell was recently accessed). We give eightexamples of how an intelligent memory controller canexploit the various di ff erent types of heterogeneity inaccess latency.(1) At low temperature, DRAM cells contain morecharge, and as a result, can be accessed much faster thanat high temperatures. We find that, averaged across 115real DRAM modules from three major manufacturers,read and write latencies of DRAM can be reduced by33% and 55%, respectively, when operating at relativelylow temperature (55 °C) compared to operating at worst-case temperature (85 °C) [33, 259]. Thus, a slightlyintelligent memory controller can greatly reduce mem-ory latency by adapting the access latency to operatingtemperature.(2) Due to manufacturing process variation, we findthat the majority of cells in DRAM (across di ff erent chips7r within the same chip) can be accessed much fasterthan the manufacturer-provided timing parameters [31,33, 35, 40, 44, 259]. An intelligent memory controllercan profile the DRAM chip and identify which cellscan be accessed reliably at low latency, and use thisinformation to reduce access latencies by as much as57% [31, 35, 44].(3) In a similar fashion, an intelligent memory con-troller can use similar properties of manufacturing pro-cess variation to reduce the energy consumption of acomputer system, by exploiting the minimum voltagerequired for safe operation of di ff erent parts of a DRAMchip [34, 40]. The key idea is to reduce the operatingvoltage of a DRAM chip from the standard specificationand tolerate the resulting errors by increasing access la-tency on a per-bank basis, while keeping performancedegradation in check.(4) Bank conflict latencies can be dramatically re-duced by making modifications in the DRAM chip suchthat di ff erent subarrays in a bank can be accessed mostlyindependently, and designing an intelligent memory con-troller that can take advantage of requests that requiredata from di ff erent subarrays (i.e., exploit subarray-levelparallelism) [21, 22]. A similar approach is also shownto reduce the performance impact of refresh by enablingparallelization of refresh and access operations to abank [260].(5) Access latency to a portion of the DRAM bankcan be greatly reduced by partitioning the DRAM arraysuch that a subset of rows can be accessed much fasterthan the other rows and having an intelligent memorycontroller that decides what data should be placed in fastrows versus slow rows [32, 42, 110, 121, 169, 259].(6) We find that a recently-accessed or recently-refreshed memory row can be accessed much morequickly than the standard latency if it needs to be ac-cessed again soon, since the recent access and refresh tothe row has replenished the charge of the cells in the row.An intelligent memory controller can thus keep track ofthe charge level of recently-accessed / refreshed rows anduse the appropriate access latency that corresponds to thestored charge level [39, 47, 168], leading to significantreductions in both access and refresh latencies. Thus,the poor scaling of DRAM latency and energy can poten-tially be overcome with the design of intelligent memorycontrollers that can facilitate a large number of e ff ectivelatency and energy reduction techniques.(7) Two recent works [172, 261] observe that thelatency-reliability tradeo ff in modern DRAM devicescan be exploited by an intelligent memory controller to1) generate true random numbers at low latency and highthroughput [172], and 2) to evaluate physical unclonable functions quickly using a DRAM device [261]. Theseworks exploit the heterogeneity in the latency-reliabilitytradeo ff of di ff erent cells: some cells fail truly randomlyand some cells fail very consistently, when accessed witha low latency that violates the timing parameters. Theformer type of cells are used as true random numbergenerator cells and the latter type of cells can be used aspart of the challenge-response space of a DRAM-basedphysical unclonable function (PUF). An intelligent con-troller would determine the di ff erent types of cells usingprofiling mechanisms and enable the generation of truerandom numbers or PUF responses.(8) An intelligent controller can use application anddata characteristics to carefully map data across hybridmemories that consist of multiple di ff erent types of mem-ories with di ff erent characteristics to maximize the ben-efits obtained from each memory type while avoidingthe downsides of each memory type. Figure 4 depictsan example of such hybrid main memory composed ofDRAM and PCM memories, as described by severalworks [27, 262, 263, 264]. Figure 4: Hybrid main memory. Reproduced from [203]. Originallypresented in [27].
Many proposals exist for such intelligent controllersthat manage hybrid memories, e.g., [27, 36, 262, 263,264, 265, 266, 267, 268], indicating that such an intelli-gent controller can enhance memory scaling by enablingthe best of multiple technologies. For example, the ideaof
Heterogeneous Reliability Memory [36] uses an in-telligent memory controller that can communicate withboth applications and memory devices to map each dataelement to di ff erent types of memories depending on theerror vulnerability characteristics of the data element,thereby reducing memory cost. Similarly, EDEN [14]uses a memory controller that can communicate with aneural network application to map di ff erent neural net-8ork layers to di ff erent DRAM partitions with di ff erentaccess latency and voltage parameters, depending onthe error tolerance characteristics of each layer, therebygreatly improving energy e ffi ciency and performance ofneural network inference tasks. With increasing relianceon hybrid memories as well as increasing heterogeneitywithin each memory type to solve key memory scal-ing issues, it has become necessary to have intelligentcontrollers to manage data allocation, migration, andmovement across the di ff erent heterogeneous parts.Intelligent controllers are already in widespread usein another key part of a modern computing system. Insolid-state drives (SSDs) consisting of NAND flash mem-ory, the flash memory controllers that manage the SSDsare designed to incorporate a significant level of intelli-gence in order to improve both performance and reliabil-ity [36, 269, 270, 271, 272, 273, 274, 275, 276, 277, 278,279, 280, 281, 282, 283, 284, 285, 286, 287, 288, 289].Figure 5 shows one of our real experimental infrastruc-tures (from [277]) used for the design and evaluation ofintelligent flash memory controllers. Aside: Intelligent Controller for NAND Flash
USB JackVirtex-II Pro(USB controller)Virtex-V FPGA(NAND Controller)HAPS-52 Mother Board USB Daughter Board NAND Daughter Board1x-nmNAND Flash
Cai+, “Error Characterization, Mitigation, and Recovery in Flash Memory Based Solid State Drives,” Proc. IEEE 2017.[DATE 2012, ICCD 2012, DATE 2013, ITJ 2013, ICCD 2013, SIGMETRICS 2014, HPCA 2015, DSN 2015, MSST 2015, JSAC 2016, HPCA 2017, DFRWS 2017, PIEEE 2017, HPCA 2018, SIGMETRICS 2018]
Figure 5: Example of an intelligent flash memory controllers. Thefigure depicts a picture of one of our real experimental infrastructures(from [277]) used for the design and evaluation of intelligent flashmemory controllers. Reproduced from [203].
Modern flash controllers need to take into accounta wide variety of issues such as remapping data, per-forming wear leveling to mitigate the limited lifetime ofNAND flash memory devices, refreshing data based onthe current wearout of each NAND flash cell, optimizingvoltage levels to maximize memory lifetime, employingsophisticated error correction and recovery techniquesto maximize lifetime and minimize error rates, and en-forcing fairness across di ff erent applications accessingthe SSD. Much of the complexity in flash controllersis a result of mitigating issues related to the scaling ofNAND flash memory [36, 269, 270, 271, 272, 275, 276, 277, 278, 279, 280, 281, 282, 283, 284, 285, 286, 287,288, 289]. A comprehensive review of scaling issuesof NAND flash memory and related mitigation tech-niques can be found in [275] (Figure 6) and [269, 270].We argue that in order to overcome scaling issues inmain memory (DRAM), the time has come for mainmemory controllers to also incorporate significant intel-ligence. Yet, incorporating sophisticated intelligence inthe DRAM controller is more challenging than doing soin a flash controller due to the much lower access latencyand much higher access bandwidth of modern DRAMdevices. Aside: Intelligent Controller for NAND Flash https://arxiv.org/pdf/1706.08642 Proceedings of the IEEE, Sept. 2017
Figure 6: A comprehensive review article on scaling issues of NANDflash memory and related mitigation techniques [275] Reproducedfrom [203].
As we describe above, introducing intelligence intothe memory controller can help us overcome a numberof key challenges in memory scaling. In particular, asignificant body of work has demonstrated that the keyreliability, refresh, and latency / energy issues in memorycan be mitigated e ff ectively with an intelligent memorycontroller that intelligently and meticulously managesthe many di ff erent characteristics of underlying memorychips, which may consist of di ff erent types of memorytechnology. As we discuss in Section 4, such intelligencecan go even further, by enabling the memory controllers(and the broader memory system) to perform applicationcomputation in order to overcome the significant datamovement bottleneck in modern and future computingsystems.
4. Perils of Processor-Centric Design
As described earlier, a major reason for performanceand energy degradation in modern computing systemsis the large amount of data movement present in the9ystems. Such data movement is a natural consequenceof the processor-centric execution model and designparadigm [290], which creates a dichotomy betweencomputation and memory / storage. The processor-centricdesign paradigm separates computation capability andmemory / storage capability into two completely-separatesystem components (i.e., the computing unit versus thememory / storage unit) that are connected by long andenergy-hungry interconnects: processing is done only inthe computing unit, while data is stored in a completelyseparate place. As a result, data has to continuouslymove back and forth between the memory / storage unitand the computing unit (e.g., CPU cores or accelerators),for any computation to be performed.In order to perform an operation on data that is storedwithin memory, a costly process is invoked. First, theCPU (or an accelerator) must issue a request to the mem-ory controller, which in turn sends a series of commandsacross the o ff -chip bus to the DRAM module. Second,the data is read from the DRAM module and returned tothe memory controller. Third, the data is placed in theCPU cache and registers, where it is accessible by theCPU cores. Finally, the CPU can operate (i.e., performcomputation) on the data. All these steps consume sub-stantial time and energy in order to bring the data intothe CPU chip [4, 7, 291, 292].In current computing systems, the CPU (or any ac-celerator) is the only system component that is able toperform computation on data. The rest of system compo-nents are devoted to only data storage (memory, caches,disks) and data movement (interconnects); they are in-capable of performing computation. As a result, cur-rent computing systems are grossly imbalanced , whichleads to large amounts of energy ine ffi ciency and lostperformance. As empirical evidence to the gross imbal-ance caused by the processor-memory dichotomy in thedesign of computing systems today, we have recentlyobserved that more than 62% of the entire system en-ergy consumed by four major commonly-used mobileconsumer workloads (including the Chrome browser,TensorFlow machine learning inference engine, and theVP9 video encoder and decoder) [7]. Thus, the factthat current systems can perform computation only inthe computing unit (CPU cores and hardware accelera-tors) is causing significant waste in terms of energy bynecessitating data movement across the entire system.At least five factors contribute to the performance lossand the energy waste associated with data movementbetween processor and memory. We briefly describethese next, to demonstrate the sweeping negative impactof data movement in modern computing systems. First, the width of the o ff -chip bus between the mem-ory controller and the main memory is narrow, due topin count and cost constraints, leading to relatively lowbandwidth and high latency to / from main memory. Thismakes it di ffi cult to send a large number of requeststo memory in parallel to enable higher levels of paral-lelism and to tolerate the long main memory latency.As a result, systems that require higher levels of con-currency and lower latency require much higher costbecause they require wider processor-memory intercon-nects or more processor-memory channels, both of whichlead to higher power consumption and higher hardwarearea overheads [2, 43, 49].Second, current computing systems employ manysophisticated mechanisms to tolerate the data accessfrom main memory. These mechanisms include com-plex multi-level cache hierarchies with sophisticated in-sertion / promotion / eviction policies and sophisticated la-tency tolerance / hiding mechanisms (e.g., sophisticatedcaching algorithms at many di ff erent caching levels, mul-tiple complex prefetching techniques, high amounts ofmultithreading, complex and power-hungry out-of-orderexecution mechanisms). These components, while some-times e ff ective at improving performance, are costly interms of both die area and energy consumption, as wellas the additional latency required to access / manage them.When these components are not e ff ective at improvingperformance, they result in a net energy waste and la-tency overheads that hurt the very performance that theyare designed to improve. These components significantlyincrease the complexity of the system. Hence, the ar-chitectural and microarchitectural techniques used inmodern systems to tolerate the consequences of the di-chotomy between processing unit and main memory,lead to significant energy waste and additional systemcomplexity. As such, we are in a vicious cycle in sys-tem design due to the processor-centric design paradigm:1) data movement between the processor and memoryalready causes significant energy waste and latency; 2)to tolerate the latency of such data movement, existingsystems employ many complex mechanisms whose ef-fectiveness varies depending on the workloads; 3) thesecomplex mechanisms in turn cause additional energywaste and latency overheads. The fundamental causeof this vicious cycle is the processor-centric executionmodel and design paradigm, and hence breaking outof this vicious cycle requires tackling this fundamentalcause by changing the paradigm (to a data-centric one).Third, the many caches employed in computing sys-tems are not always e ff ective or e ffi cient. Much ofthe data brought into the caches is not reused by theCPU [52, 53, 55, 56, 293], resulting in a large waste of10ardware area and memory bandwidth. For example, 1)random access to memory leads to poor locality, render-ing caches almost completely ine ff ective, 2) strided ac-cess to memory where stride is greater than a cache blockalso renders caches ine ff ective, 3) even streaming accessto memory where all elements in a cache block are usedin a consecutive manner is ine ffi cient to handle with largecaches because the block is not reused again. There aremany such access patterns in a wide variety of modernworkloads [49, 52, 53, 55, 56, 199, 293, 294, 295, 296]that render caches either very ine ffi cient or unnecessary,exacerbating the energy waste due to data movement inprocessor-centric systems.Fourth, many modern applications, such as graph pro-cessing [52, 53] and workloads that operate on sparsedata structures, such as sparse linear algebra [15, 297]and sparse neural networks [298, 299, 300], producerandom memory access patterns. Figure 7 shows the ex-ample of PageRank [301], a graph processing algorithmwith frequent random memory accesses and little amountof computation. With such random access patterns, notonly the caches but also the main memory bus and themain memory itself are very ine ffi cient, since only asmall part of each memory row and cache line retrievedall the way from main memory is actually used by theCPU. Such random accesses are fundamentally di ffi cultto prefetch, rendering prefetchers ine ff ective. This exam-ple demonstrates that modern memory hierarchies arenot designed to work well for random memory accesspatterns that are found in many modern workloads. Key Bottlenecks in Graph Processing for (v: graph.vertices) { for (w: v.successors) {w.next_rank += weight * v.rank;}} weight * v.rank vw &w
1. Frequent random memory accesses2. Little amount of computation w.rankw.next_rankw.edges…
Figure 7: Random memory accesses in the PageRank graph processingalgorithm [301]. Reproduced from [214]. Originally depicted in [52,302].
Fifth, the processor (as well as accelerators) and themain memory are connected to each other via long,power-hungry interconnects. These interconnects im-pose significant additional latency to every data access and represent a significant fraction of the energy spenton moving data to / from the DRAM memory. In fact,o ff -chip interconnect latency and energy consumption isa key limiter of performance and energy in modern sys-tems [7, 25, 32, 52, 97, 108], as it greatly exacerbates thecost of data movement. Unfortunately, o ff -chip intercon-nect latency and energy are not scaling (i.e., reducing)well with the scaling of technology node generations,which mainly benefits logic [303].The increasing disparity between processing tech-nology and memory / communication technology has re-sulted in systems in which communication (data move-ment) costs dominate computation costs in terms of en-ergy consumption. The energy consumption of a mainmemory access is between two to three orders of mag-nitude the energy consumption of an addition operationtoday. For example, [292] reports that the energy con-sumption of a memory access is ∼ × the energyconsumption of an addition operation. Similarly, Fig-ure 8 shows that compares the energy consumed by aDRAM access is ∼ × the energy consumption ofa double precision addition operation, based on datareported by [304]. As a result, data movement is em-pirically shown to account for 40% [291], 35% [292],and 62% [7] of the total system energy in scientific,mobile, and consumer applications, respectively. Thisenergy waste due to data movement is a huge burdenthat greatly limits the e ffi ciency and performance of allmodern computing platforms, from datacenters with arestricted power budget to mobile devices with limitedbattery life. Data Movement vs. Computation Energy
Dally, HiPEAC 2015
A memory access consumes ~100-1000X the energy of a complex addition
Figure 8: Data movement versus computation energy. The figuredepicts the absolute amount of energy spent on various arithmetic anddata movement operations, including a double-precision floating pointaddition and a single DRAM access. Reproduced from [214], basedon data reported in [304]. ffi ciency (as well as highsystem design complexity) in current computing systemsfirst requires the realization that all of these reasons arecaused by the processor-centric design paradigm em-ployed by existing computing systems. As such, a funda-mental solution to all of these reasons at the same timerequires a paradigm shift [305]. We believe that futurecomputing architectures should become data-centric :they should (1) perform computation with minimal datamovement, and (2) compute where it makes sense (i.e.,where the data resides), as opposed to computing solelyin the processor (i.e., CPU or accelerators). Thus, thetraditional rigid dichotomy between the computing unitsand the memory / communication units needs to be bro-ken and a new paradigm enabling computation where thedata resides needs to be invented and enabled. We referto this general data-centric execution model and designparadigm as Processing-in-Memory (PIM) .
5. Processing-in-Memory (PIM): Technology En-ablers and Two Approaches
Large amounts of data movement is a major result ofthe predominant processor-centric design paradigm ofmodern computers. Eliminating unnecessary data move-ment between memory and the processor is essential tomake future computing architectures high performance,energy-e ffi cient and sustainable. To this end, processing-in-memory (PIM) equips the memory subsystem withthe ability to perform computation.In this section, we first describe two new technologyenablers for PIM: 1) the emergence of 3D-stacked memo-ries, and 2) the use of byte-addressable memories. Thesetwo relatively new main memory technologies providenew opportunities that can make it easier for moderncomputing systems to introduce and adopt PIM.Second, we introduce two promising approaches toimplementing PIM in modern architectures. The firstapproach, processing using memory , exploits the exist-ing DRAM architecture and the operational principlesof the DRAM circuitry to enable (bulk) processing oper-ations within the main memory with minimal changes.This minimalist approach can especially be powerful inperforming specialized computation in main memory bytaking advantage of what the main memory substrate isextremely good at performing with minimal changes tothe existing memory chips. The second approach, pro-cessing near memory , exploits the ability to implement awide variety of general-purpose processing logic in thelogic layer of 3D-stacked memory and thus the high in-ternal bandwidth and low latency available between the logic layer and the memory layers of 3D-stacked mem-ory. This is a more general approach where the logicimplemented in the logic layer can be general purposeand thus can benefit a wide variety of applications.Below, we provide a more detailed general overviewof the two approaches, to show that the approaches aremore general than what we will describe in more detail.It is important for the reader to keep in mind that thetwo approaches can be applied to many di ff erent types ofmemory technologies, even though our major focus willbe on DRAM, the predominant main memory technologyfor several decades, in most of this section. Memory manufacturers are actively developing newapproaches for main memory system design, due tothe DRAM technology scaling issues we described indetail in Section 2. Two promising technologies are3D-stacked memory and byte-addressable Non-volatileMemory (NVM), both of which can be exploited to over-come prior barriers to introducing and implementingPIM architectures.
The first major new approach to main memory designis 3D-stacked memory [52, 149, 150, 151, 152, 306]. Ina 3D-stacked memory, multiple layers of memory (typ-ically DRAM in already-existing systems) are stackedon top of each other, as shown in Figure 9. Theselayers are connected together using vertical through-silicon vias (TSVs) [149, 306]. Using current manufac-turing process technologies, thousands of TSVs can beplaced within a single 3D-stacked memory chip. Assuch, the TSVs provide much greater internal mem-ory bandwidth than the narrow memory channel. Ex-amples of 3D-stacked DRAM available commerciallyinclude High-Bandwidth Memory (HBM) [149, 152],Wide I / O [307], Wide I / O 2 [308], and the Hybrid Mem-ory Cube (HMC) [151]. Detailed analysis of such 3D-stacked memories and their e ff ects on modern workloadscan be found in [49, 148, 149].In addition to the multiple layers of DRAM, a numberof prominent 3D-stacked DRAM architectures, includ-ing HBM and HMC, incorporate a logic layer inside thechip [149, 151, 152]. The logic layer is typically the bot-tommost layer of the chip, and is connected to the sameTSVs as the memory layers. The logic layer providesarea inside the DRAM chip where architects can imple-ment functionality that interacts with both the processorand the DRAM cells. Currently, manufacturers make12 PULogic Layer wide channel with
Through-SiliconVias (TSVs) D R A M L a y e r s narrow Memory Channel
Figure 9: High-level overview of a 3D-stacked DRAM architecture.Reproduced from [309]. limited use of the logic layer and there is significantamount of area the logic layer can provide. This presentsa promising opportunity for architects to implement newPIM logic in the available area of the logic layer. Wecan potentially add a wide range of computational logic(e.g., general-purpose cores, accelerators, reconfigurablearchitectures, or a combination of all three types of logic)in the logic layer, as long as the added logic meets area,energy, and thermal dissipation constraints, which are im-portant and potentially limiting constraints in 3D-stackedsystems.
The second major new approach to main memorydesign is the development of byte-addressable resis-tive nonvolatile memory (NVM). In order to circum-vent DRAM scaling limitations, such as refresh dueto charge loss, as much as possible, researchers andmanufacturers have been developing new memory de-vices that can store data at much higher densities thanthe typical density available in existing DRAM man-ufacturing process technologies. Manufacturers areexploring at least three types of emerging NVMs toaugment or replace DRAM: (1) phase-change memory(PCM) [26, 28, 264, 310, 311, 312, 313]. (2) magneticRAM (MRAM) [314, 315], and (3) metal-oxide resistiveRAM (RRAM) or memristors [316, 317, 318]. All threeof these NVM types are expected to provide memory ac-cess latencies and energy usage that are competitive withor close enough to DRAM, while enabling much largercapacities per chip and nonvolatility in main memory.Since they are emerging and their designs do not havethe long-term ”baggage” other main memories (DRAM)have accumulated, NVMs present architects with an op-portunity to redesign how the main memory subsystemoperates from the cell and chip levels all the way upto software and algorithms. While it can be relativelydi ffi cult to modify the design of DRAM arrays due to thedelicacy of DRAM manufacturing process technologiesas we approach scaling limitations, NVMs have yet to approach such scaling limitations. As a result, archi-tects can potentially design NVM memory arrays thatintegrate PIM functionality from the getgo. A promisingdirection for this functionality is the ability to manipulateNVM cells at the circuit level in order to perform logicoperations using the memory cells themselves. A num-ber of recent works have demonstrated that NVM cellscan be used to perform a complete family of Booleanlogic operations [104, 132, 133, 134, 135, 136], simi-lar to such operations that can be performed in DRAMcells [109, 111, 112, 120, 124]. NVMs have also beenshown to perform more sophisticated operations likemultiplication [93, 107, 146], which are more di ffi cult toimplement in DRAM. Many recent works take advantage of the memorytechnology innovations that we discuss in Section 5.1to enable and implement PIM. We find that these worksgenerally take one of two approaches, which are cat-egorized in Table 1: (1) processing using memory or(2) processing near memory . We briefly describe eachapproach here. Sections 6 and 7 will provide exampleapproaches and more detail for both.
Table 1: Summary of enabling technologies for the two approaches toPIM used by recent works. Adapted from [309].
Approach Enabling Technologies
Processing Using Memory SRAMDRAMPhase-change memory (PCM)Magnetic RAM (MRAM)Resistive RAM (RRAM) / memristorsProcessing Near Memory Logic layers in 3D-stacked memorySilicon interposersLogic in memory controllers Processing using memory (PUM) exploits the ex-isting memory architecture and the operational princi-ples of the memory circuitry to enable operations withinmain memory with minimal changes. PUM makes useof intrinsic properties and operational principles of thememory cells and cell arrays themselves, by inducinginteractions between cells such that the cells and / or cellarrays can perform useful computation. Prior worksshow that processing using memory is possible usingstatic RAM (SRAM) [105, 106, 130, 131], DRAM [108,109, 110, 111, 112, 120, 124, 145, 319], PCM [104],MRAM [132, 133, 134], or RRAM / memristive [107,135, 136, 137, 138, 139, 140, 141, 142, 143, 144] de-vices. Processing using memory architectures enable awide range of di ff erent functions, such as bulk as well13s finer-grained data copy / initialization [106, 108, 110,121, 123], bulk bitwise operations (e.g., a complete setof Boolean logic operations) [31, 46, 104, 106, 109, 111,112, 116, 120, 122, 129, 132, 133, 134, 320], and simplearithmetic operations (e.g., addition, multiplication, im-plication) [105, 106, 107, 116, 128, 130, 131, 135, 136,137, 138, 139, 140, 141, 142, 143, 144, 319]. Processing near memory (PNM) involves adding orintegrating PIM logic (e.g., accelerators, simple pro-cessing cores, reconfigurable logic) close to or insidethe memory (e.g., [7, 13, 50, 51, 52, 53, 62, 81, 82,83, 84, 86, 87, 88, 91, 92, 93, 94, 95, 96, 98, 99, 100,102, 103, 113, 115, 117, 118, 119, 178, 179, 321, 322,323, 324, 325, 326, 327]) Many of these works placePIM logic inside the logic layer of 3D-stacked memo-ries or at the memory controller, but recent advancesin silicon interposers (in-package wires that connectdirectly to the through-silicon vias in a 3D-stackedchip) [118, 147, 152] also allow for separate logic chipsto be placed in the same die package as a 3D-stackedmemory while still taking advantage of the TSV band-width.Note that more functionality can be potentially inte-grated into a memory chip using PNM than using PUM,but both approaches can be combined to get even higherbenefit from PIM. In Section 6, we provide a detailedoverview of PUM within the commodity DRAM tech-nology. In Section 7, we provide a detailed overviewof PNM within the 3D-stacked DRAM technology. Wenote that the described approaches and techniques in Sec-tions 6 and 7 are applicable to other types of technologiesas well, with small modifications.
6. Processing Using Memory (PUM)
The PUM approach to processing-in-memory modi-fies existing DRAM architectures minimally to extendtheir functionality with computing capability. This ap-proach takes advantage of the existing interconnectsin and analog operational behavior of conventionalDRAM architectures (e.g., DDRx, LPDDRx, HBM),without the need for a dedicated logic layer or logicprocessing elements, and usually with very low over-heads. Mechanisms that use this approach take advan-tage of the high internal bandwidth available withineach DRAM cell array. There are a number of exam-ple PIM architectures that make use of the PUM ap-proach [40, 108, 109, 110, 111, 112, 120, 124, 125]. Inthis section, we first focus on two such designs: Row-Clone, which enables in-DRAM bulk data movementoperations [108] and Ambit, which enables in-DRAMbulk bitwise operations [109, 111, 112, 120]. Then, we describe a low-cost substrate that performs data reorga-nization for non-unit strided access patterns [97].
Two important classes of bandwidth-intensive mem-ory operations are (1) bulk data copy , where a largequantity of data is copied from one location in physi-cal memory to another; and (2) bulk data initialization ,where a large quantity of data is initialized to a specificvalue. We refer to these two operations as bulk datamovement operations . Prior research [4, 328, 329] hasshown that operating systems and data center workloadsspend a significant portion of their time performing bulkdata movement operations. For example, a paper byGoogle shows that close to 5% of the execution timein Google’s data center workloads is spent on executingonly two data movement function calls, memset and mem-copy . Therefore, accelerating these operations will likelyimprove system performance and energy e ffi ciency.We have developed a mechanism called Row-Clone [108], which takes advantage of the fact that bulkdata movement operations do not require any compu-tation on the part of the processor. RowClone exploitsthe internal organization and operation of DRAM to per-form bulk data copy / initialization quickly and e ffi cientlyinside a DRAM chip. A DRAM chip contains multiplebanks, where the banks are connected together and to ex-ternal I / O circuitry by a shared internal bus. Each bank isdivided into multiple subarrays [21, 108, 260]. Each sub-array contains many rows of DRAM cells, where eachcolumn of DRAM cells is connected together across themultiple rows using bitlines .RowClone consists of two mechanisms that take ad-vantage of the existing DRAM structure. The first mech-anism, Fast Parallel Mode, copies the data of a row insidea subarray to another row inside the same DRAM subar-ray by issuing back-to-back activate (i.e., row open) com-mands to the source and the destination row. Figure 10 il-lustrates the two steps of RowClone’s Fast Parallel Mode.The first step activates source row A , which enables thecapture of the entire row’s data in the row bu ff er. Thesecond step activates destination row B , which enablesthe copying of the contents of the row bu ff er into row B . Thus, the back-to-back activate in the same subarrayenables the copying of source row A to destination row B by using the row bu ff er as a temporary bu ff er for row A ’s contents. The second mechanism, Pipelined SerialMode, can transfer an arbitrary number of bytes from arow in one bank to another row in another bank usingthe shared internal bus among banks in a DRAM chip.RowClone significantly reduces the raw latency andenergy consumption of bulk data copy and initialization,14 owClone: In-DRAM Row Copy Row Buffer (4 Kbytes)Data Bus8 bits DRAM subarray4 Kbytes
Step 1: Activate row A
Transfer row
Step 2: Activate row B
Transferrow
Negligible HW cost
Idea: Two consecutive ACTivates
Figure 10: RowClone Fast Parallel Mode. Reproduced from [214]. leading to 11 . × latency reduction and 74 . × energy re-duction for a 4kB bulk page copy (using the Fast ParallelMode), at very low cost (only 0.01% DRAM chip areaoverhead) [108]. This reduction directly translates toimprovement in performance and energy e ffi ciency ofsystems running copy or initialization-intensive work-loads. Our MICRO 2013 paper [108] shows that theperformance of six copy / initialization-intensive bench-marks (including the fork system call, Memcached [330]and a MySQL [331] database) improves between 4% and66%. For the same six benchmarks, RowClone reducesthe energy consumption between 15% and 69%.Recent works have improved upon the RowClone ap-proach in various ways. First, Low-cost Interlinked Sub-Arrays (LISA) [110] provides mechanisms to enable therapid transfer of data between one subarray to and adja-cent subarray in the same bank, by enhancing the con-nectivity of subarrays using isolation transistors. LISAreduces inter-subarray copy latency by 9.2 × and DRAMenergy by 48 × , approaching the intra-subarray latencyand energy improvements of RowClone’s Fast ParallelMode. Second, FIGARO [121] improves upon LISA byenabling fine-grained (i.e., column granularity) data copyacross subarrays within a bank using the shared globalI / O structures of the bank as an intermediate location.This work shows significant benefit from FIGARO whenits principles and techniques are used to build a highly-e ff ective yet low-cost in-DRAM cache. Third, Network-on-Memory (NoM) [123] improves the parallelism ofbank-to-bank copy as well as bank read / write opera-tions by providing more connectivity between di ff erentbanks and chip I / O structures using the logic layer in 3D-stacked memory. Fifth, the ComputeDRAM work [122]shows that one can mimic the e ff ect of RowClone’s back-to-back activation mechanism in o ff -the-shelf DRAM chips by violating the timing parameters such that twowordlines in a subarray are activated back-to-back asin Rowclone. This work shows that such a version ofRowClone can operate reliably in a variety of o ff -the-shelf DRAM chips tested using the SoftMC infrastruc-ture [38, 332]. Sixth, the PINATUBO work [104] showthat RowClone can e ff ectively be performed in emergingresistive memory chips, including Phase Change Mem-ory (PCM) [26, 264].We believe that RowClone provides very low-cost spe-cialized support for a critical and often-used operation:data copy and initialization. In latency-critical systems,such as virtual machines, modern software is writtento, as much as possible, avoid large amounts of datacopy exactly because data copy is expensive in mod-ern systems (because it goes through the processor overa bandwidth-bottlenecked memory bus). Eliminatingcopies as much as possible complicates software design,making it less maintainable and readable. If RowCloneis implemented in real chips, perhaps the need for avoid-ing data copies will greatly diminish due to the more-than-an-order-of-magnitude latency reduction of pagecopy, leading to easier-to-write and easier-to-maintainsoftware. As such, we believe that an idea as simpleas RowClone (and the work that builds on it) can haveexciting and forward-looking implications on makingboth systems and software much faster, more e ffi cientand overall better. In addition to bulk data movement and initializa-tion, many applications make use of bulk bitwise oper-ations , i.e., bitwise operations on large bit vectors [333,334]. Examples of such applications include bitmap in-dices [335, 336, 337, 338] used in databases, bitwisescan acceleration [339] in databases, accelerated doc-ument filtering for web search [340], DNA sequencealignment [13, 115, 190, 191, 192, 341, 342], encryp-tion algorithms [343, 344, 345], graph processing [104],and networking [334]. Accelerating bulk bitwise oper-ations can thus significantly boost the performance andenergy e ffi ciency of a wide range applications.In order to avoid data movement bottlenecks whenthe system performs these bulk bitwise operations, wehave recently proposed a new A ccelerator-in- M emoryfor bulk Bit wise operations (Ambit) [109, 111, 112]. Un-like prior approaches, Ambit uses the analog operationof existing DRAM technology to perform bulk bitwiseoperations. Ambit consists of two components. The firstcomponent, Ambit–AND–OR, implements a new op-eration called triple-row activation , where the memory15ontroller simultaneously activates three rows. Triple-row activation, depicted in Figure 11, performs a bitwisemajority function across the cells in the three rows, dueto the charge sharing principles that govern the operationof the DRAM array. In the initial state, all three rows areclosed 1 . In the example of Figure 11, two cells are inthe charged state. When the three wordlines are raisedsimultaneously 2 , charge sharing results in a positivedeviation of the bitline. After sense amplification 3 , thesense amplifier drives the bitline to V DD , and as a result,fully charges the three cells. By controlling the initialvalue of one of the three rows (e.g., C ), we can use triple-row activation to perform a bitwise AND or OR of theother two rows, since the bitwise majority function canbe expressed as C ( A + B ) + ¯ C ( AB ). The second compo-nent, Ambit–NOT, takes advantage of the two invertersthat are part of each sense amplifier in a DRAM subarray.Ambit–NOT exploits the fact that, at the end of the senseamplification process, the voltage level of one of theinverters represents the negated logical value of the cell.The Ambit design adds a special row to the DRAM array,which is used to capture the negated value that is presentin the sense amplifiers. One possible implementation ofthe special row [112] is a row of dual-contact cells (a2-transistor 1-capacitor cell [346, 347]) that connect toboth inverters inside the sense amplifier. With the abilityto perform AND, OR, and NOT operations, Ambit isfunctionally complete: It can reliably perform any bulkbitwise operation completely using DRAM technology,even in the presence of significant process variation (see[112] for details). MICRO-50, October 2017, Cambridge, MA, USA V. Seshadri et al. access transistor is called the wordline . We refer to the wireon the other end of the sense ampli er as bitline (“bitline bar”).Figure 3 shows the state transitions involved in extractingthe state of the DRAM cell. In this gure, we assume that thecell capacitor is initially charged. The operation is similar ifthe capacitor is initially empty. In the initial precharged state , both the bitline and bitline are maintained at a voltage levelof V DD . The sense ampli er and the wordline are disabled.The ACTIVATE command triggers an access to the cell.Upon receiving the
ACTIVATE , the wordline of the cell israised À , connecting the cell to the bitline. Since the capac-itor is fully charged, and thus, at a higher voltage level thanthe bitline, charge ows from the capacitor to the bitline un-til both the capacitor and the bitline reach the same voltagelevel V DD + . This phase is called charge sharing à . Af-ter charge sharing is complete, the sense ampli er is enabled Õ . The sense ampli er senses the di erence in voltage levelbetween the bitline and bitline. The sense ampli er then am-pli es the deviation to the stable state where the bitline is atthe voltage level of V DD (and the bitline is at 0). Since the ca-pacitor is still connected to the bitline, the capacitor also getsfully charged (i.e., restored ) Œ . If the capacitor was initiallyempty, then the deviation on the bitline would be negative(towards ), and the sense ampli er would drive the bitlineto . Each ACTIVATE command operates on an entire row ofcells (typically 8 KB of data across a rank).After the cell is activated, data can be accessed from thebitline by issuing a
READ or WRITE to the column containingthe cell (not shown in Figure 3; see [28, 45, 59, 67, 68, 71] fordetails). When data in a di erent row needs to be accessed,the memory controller takes the subarray back to the initialprecharged state using the
PRECHARGE command. Uponreceiving this command, DRAM rst lowers the raised word-line, thereby disconnecting the capacitor from the bitline. Af-ter this, the sense ampli er is disabled, and both the bitlineand the bitline are driven to the voltage level of V DD .
3. Ambit-AND-OR
The rst component of our mechanism, Ambit-AND-OR,uses the analog nature of the charge sharing phase to performbulk bitwise AND and OR directly in DRAM. It speci callyexploits two facts about DRAM operation:1. In a subarray, each sense ampli er is shared by many(typically 512 or 1024) DRAM cells on the same bitline.2. The nal state of the bitline after sense ampli cation de-pends primarily on the voltage deviation on the bitlineafter the charge sharing phase.Based on these facts, we observe that simultaneously activat-ing three cells, rather than a single cell, results in a bitwise ma-jority function —i.e., at least two cells have to be fully chargedfor the nal state to be a logical “1”. We refer to simultaneousactivation of three cells (or rows) as triple-row activation . Wenow conceptually describe triple-row activation and how weuse it to perform bulk bitwise AND and OR operations.
A triple-row activation (TRA) simultaneously connects asense ampli er with three DRAM cells on the same bitline.For ease of conceptual understanding, let us assume that thethree cells have the same capacitance, the transistors and bit-lines behave ideally (no resistance), and the cells start at afully refreshed state. Then, based on charge sharing princi-ples [57], the bitline deviation at the end of the charge sharingphase of the TRA is: = k.C c .V DD + C b . V DD C c + C b V DD = (2 k C c C c + 2 C b V DD (1) where, is the bitline deviation, C c is the cell capacitance, C b is the bitline capacitance, and k is the number of cells inthe fully charged state. It is clear that > if and only if k > . In other words, the bitline deviation is positive if k = 2 , and it is negative if k = 0 , . Therefore, we expectthe nal state of the bitline to be V DD if at least two of thethree cells are initially fully charged, and the nal state to be , if at least two of the three cells are initially fully empty.Figure 4 shows an example TRA where two of the threecells are initially in the charged state . When the wordlinesof all the three cells are raised simultaneously À , charge shar-ing results in a positive deviation on the bitline. Therefore,after sense ampli cation à , the sense ampli er drives the bit-line to V DD , and as a result, fully charges all the three cells. CBA CBA CBA V DD V DD V DD + V DD DD initial state after charge sharing after sense ampli cation Figure 4: Triple-row activation If A , B , and C represent the logical values of the threecells, then the nal state of the bitline is AB + BC + CA (the bitwise majority function). Importantly, we can rewritethis expression as C ( A + B ) + C ( AB ) . In other words, bycontrolling the value of the cell C , we can use TRA to executea bitwise AND or bitwise OR of the cells A and B . Sinceactivation is a row-level operation in DRAM, TRA operateson an entire row of DRAM cells and sense ampli ers, therebyenabling a multi-kilobyte-wide bitwise AND/OR of two rows. Modern DRAMs use an open-bitline architecture [29, 57, 71, 79], where cellsare also connected to bitline. The three cells in our example are connected tothe bitline. However, based on the duality principle of Boolean algebra [104],i.e., not (A and B) ⌘ ( not A) or ( not B), TRA works seamlessly even if allthe three cells are connected to bitline. Figure 11: Triple-row activation in Ambit. Reproduced from [112].
Averaged across seven commonly-used bitwise opera-tions ( not, and, or, nand, nor, xor, xnor ), Ambit with 8DRAM banks improves bulk bitwise operation through-put by 44 × compared to an Intel Skylake processor [348],and 32 × compared to the NVIDIA GTX 745 GPU [349].Compared to the DDR3 standard, Ambit reduces en-ergy consumption of these operations by 35 × on aver-age. Compared to HMC 2.0 [151], Ambit improves bulk bitwise operation throughput by 2.4 × . When inte-grated directly into the HMC 2.0 device, Ambit improvesthroughput by 9.7 × compared to processing in the logiclayer of HMC 2.0.The Ambit work also shows that porting bitmap-indexbased databases as well as the BitWeaving database to ex-ecute Ambit can greatly improve query latencies. For ex-ample, Ambit reduces the end-to-end query latencies by5.4 × to 6.6 × for bitmap-based databases, with larger im-provements coming from cases where more data needs tobe scanned in the database. For the BitWeaving database,which is specifically designed to maximize bitwise oper-ations so that the database can be relatively easily accel-erated on modern GPUs, Ambit reduces the end-to-endquery latencies by 4 × to 12 × , again with larger improve-ments coming from cases where more data needs to bescanned in the database. These results are clearly verypromising on two important data-intensive applications.A number of Ambit-like bitwise operation substrateshave been proposed in recent years, making use ofemerging resistive memory technologies, e.g., phase-change memory (PCM) [26, 28, 264, 310, 311, 313],SRAM [105, 106, 130, 131], or specialized computa-tional DRAM [116, 122, 129, 134, 319]. These sub-strates can perform bulk bitwise operations in a spe-cial DRAM array augmented with computational cir-cuitry [116, 128] and in resistive memories [104] likePCM. Substrates similar to Ambit can perform simplearithmetic operations in SRAM [105, 106] and arith-metic and logical operations in memristors [107, 135,136, 137, 138]. All of these works have shown signifi-cant benefits from performing bitwise operations usingmemory, for a wide variety of applications, includingdatabases, machine learning, graph processing, genomeanalysis, and using a variety of di ff erent memory tech-nologies, including DRAM, SRAM, PCM, memristors.Recently, the ComputeDRAM work [122] showedthat carefully violating timing parameters between ac-tivation commands can mimic the triple-row-activationoperation of Ambit in some existing o ff -the-shelf DRAMchips, using the SoftMC infrastructure [38]. Thus, in-DRAM AND and OR operations can be performed insome real o ff -the-shelf DRAM chips even though clearlysuch chips are not designed to perform such Ambit oper-ations. This proof-of-concept demonstration shows thatthe ideas presented in Ambit may not be far from reality:if some existing DRAM chips that are not even designedfor in-DRAM bulk bitwise operations can perform suchoperations, then DRAM chips that are carefully designedfor such operations will hopefully be even more capable!We believe it is extremely important to continue ex-ploring such low-cost Ambit-like substrates, as well16s more sophisticated computational substrates, for alltypes of memory technologies, old and new. Resistivememory technologies are fundamentally non-volatile andamenable to in-place updates, and as such, can lead toeven less data movement compared to DRAM, whichfundamentally requires some data movement to sense,amplify and restore the data. Thus, we believe it is verypromising to examine the design of both charge-basedconventional and emerging resistive memory chips thatcan incorporate Ambit-like bitwise operations and othertypes of suitable computation capability. Going forward,it is also critical to research frameworks that can enableease-of-programming of such substrates such that manyalgorithms can take advantage of the massive bit-levelparallelism o ff ered by Ambit-like PUM substrates. Many applications access data structures with di ff erentaccess patterns at di ff erent points in time. Depending onthe layout of the data structures in the physical memory,some access patterns require non-unit strides. As cur-rent memory systems are optimized to access sequentialcache lines, non-unit strided accesses exhibit low spatiallocality, leading to memory bandwidth waste and cachespace waste.Gather-Scatter DRAM (GS-DRAM) [97] is a low-costsubstrate that addresses this problem. It performs in-DRAM data structure reorganization by accessing multi-ple values that belong to a strided access pattern usinga single read / write command in the memory controller.GS-DRAM uses two key new mechanisms. First, GS-DRAM remaps the data of each cache line to di ff erentDRAM chips such that multiple values of a strided ac-cess pattern are mapped to di ff erent chips. This enablesthe possibility of gathering di ff erent parts of the stridedaccess pattern concurrently from di ff erent DRAM chips.Figure 12 show an example mapping on four DRAMchips. Adjacent values and / or adjacent pairs of valuesare swapped. Second, instead of sending separate re-quests to each chip, the GS-DRAM memory controllercommunicates a pattern ID to the memory module, asFigure 12 shows. With the pattern ID, each DRAM chipcomputes the address to be accessed independently viaa custom column translation logic (CTL) hardware thatis part of the DRAM module. This way, the returnedcache line contains di ff erent values of the strided patterngathered from di ff erent DRAM chips.GS-DRAM achieves near-ideal memory bandwidthand cache utilization in real-world workloads, such asin-memory databases and matrix multiplication. Forin-memory databases, GS-DRAM outperforms a transac-tional workload with column-store layout by 3 × and an Figure 12: GS-DRAM (Gather-Scatter DRAM) data mapping and chipcontrol overview. CTL refers to Column Translation Logic hardwarein the DRAM module. Reproduced from [97]. analytics workload with row-store layout by 2 × , therebyproviding the best performance for both transactionaland analytical queries on databases (which in generalbenefit from di ff erent types of data layouts). For ma-trix multiplication, GS-DRAM is 10% faster than thebest-performing tiled implementation of the matrix multi-plication algorithm. We note that the idea of GS-DRAMis completely independent of memory technology, andthus GS-DRAM can be used in any type of memorymodule, including DRAM, SRAM, PCM, memristors,STT-MRAM, RRAM. Secure computation is of critical importance in mod-ern computing systems. Therefore, it is important for aPIM system to support fundamental security primitivesthat enable secure computation and security functions.Doing so would enable PIM systems to execute a widerrange of workloads and do so securely. To this end, re-cent work shows that Processing Using Memory can pro-vide two basic security primitives: by carefully violatingDRAM access timing parameters and taking advantageof the resulting characteristics of di ff erent DRAM cells(i.e., whether they always / never fail or fail randomly), itis possible to use DRAM to generate Physical Unclon-able Functions (PUFs) [261] and true random numbers(TRNs) [172].PUFs are commonly used in cryptography to identifydevices based on the uniqueness of their physical mi-crostructures. DRAM-based PUFs have two key advan-tages: (1) DRAM is present in many modern computingsystems, and (2) DRAM has high capacity and thus canprovide many unique identifiers. However, traditional17RAM PUFs exhibit unacceptably high latencies andare not runtime-accessible. Our recent work, the DRAMLatency PUF [261], proposes a new class of fast, reliableDRAM PUFs that are runtime-accessible, i.e., that canbe used during online operation with low latency. Thekey idea is to reduce DRAM read access latency belowthe reliable datasheet specifications using software-onlysystem calls. Doing so results in error patterns that re-flect the compound e ff ects of manufacturing variations invarious DRAM structures (e.g., capacitors, wires, senseamplifiers). Some DRAM cells fail always or not at all,and a combination of a set of such cells can be used togenerate a unique identifier for the device. Figure 13illustrates the key idea of using the pattern of predictableaccess latency failures in a DRAM subarray to generatea unique DRAM device identifier. An experimental char-acterization of 223 LPDDR4 DRAM chips from all threemajor manufacturers shows that these error patterns (1)satisfy runtime-accesible PUF requirements, and (2) arequickly generated (i.e., at 88.2ms) irrespective of oper-ating temperature. The DRAM latency PUF does notrequire any modification to existing DRAM chips – itonly requires an intelligent memory controller that canchange timing parameters and identify DRAM regionsand cells that can be reliably used as PUFs. R o w D e c o d e r DRAM Latency PUF Key Idea • A cell’s latency failure probability is inherently related to random process variation from manufacturing • We can provide repeatable and unique device signatures using latency error patterns
High % chance to fail with reduced t
RCD
Low % chance to fail with reduced t
RCD
SASASASASASASA
Figure 13: Key idea of DRAM latency PUF. Reproduced from [350].
Intentionally violating DRAM access timing parame-ters can also be used to generate true random numbersinside DRAM. The technique we propose in [172] de-creases the DRAM row activation latency (timing param-eter tRCD) below the datasheet specifications to induceread errors, or activation failures. As a result, someDRAM cells, called TRNG (True Random Number Gen-erator) cells, fail truly randomly. By aggregating theresulting data from multiple such TRNG cells, our tech-nique, called D-RaNGe, provides a high-throughput and low-latency TRNG. Figure 14 illustrates the key ideaof D-RaNGe: finding and using the TRNG cells in aDRAM subarray to generate true random values.We demonstrate the e ff ectiveness of D-RaNGe in 282LPDDR4 devices from the three major manufacturers,and observe that the produced random data remains ro-bust over both time and temperature variation. D-RaNGe(1) successfully passes all NIST statistical tests for ran-domness, and (2) generates true random numbers withover two orders of magnitude higher throughput than thestate-of-the-art DRAM-based TRNG. D-RaNGe doesnot require any modification to existing DRAM chips –it only requires an intelligent memory controller that canchange timing parameters and identify DRAM cells thatcan be reliably used as TRNG cells. R o w D e c o d e r D-RaNGe Key Idea • A cell’s latency failure probability is inherently related to random process variation from manufacturing • We can extract random values by observing DRAM cells’ latency failure probabilities
High % chance to fail with reduced t
RCD
Low % chance to fail with reduced t
RCD
SASASASASASASA
Figure 14: Key idea of D-RaNGe. Reproduced from [351].
D-RaNGe and the DRAM Latency PUF show thatcommodity DRAM devices can be reliably used to gen-erate true random numbers and unique keys with highthroughput, low latency, and low power. As a result, PIMsystems can e ff ectively generate true random numbersand unique keys directly using DRAM itself. Doingso can improve the security and privacy of the system:PIM applications can directly generate random num-bers or unique keys within DRAM and do not requireo ff -DRAM devices to generate them and transfer themover the CPU to DRAM bus. Thus, random numbersor unique keys are no longer transferred across buses,and security-critical computations can securely happeninside memory, which likely vastly improves the securityguarantees of a PIM-enabled system.
7. Processing Near Memory (PNM)
Processing near memory (PNM) involves adding orintegrating PIM logic (e.g., accelerators, simple process-18ng cores, reconfigurable logic) close to or inside thememory (e.g., [7, 13, 50, 51, 52, 53, 62, 81, 82, 83, 84,86, 87, 88, 91, 92, 93, 94, 95, 96, 98, 99, 100, 102, 103,113, 115, 117, 118, 119, 178, 179, 321, 322, 323, 324,325, 326, 327]) Many of these works place PIM logicinside the logic layer of 3D-stacked memories [149].This
PIM processing logic , which we also refer to as
PIM cores or PIM engines , interchangeably, can executeportions of applications (from individual instructions tofunctions) or entire threads and applications, dependingon the design of the architecture. Such PIM engines havehigh-bandwidth and low-latency access to the memorystacks that are on top of them, since the logic layer andthe memory layers are connected via high-bandwidthvertical connections [149], e.g., through-silicon vias. Inthis section, we discuss several examples of how systemscan make use of relatively simple PIM engines withinthe logic layer to avoid data movement and thus obtainsignificant performance and energy improvements on awide variety of application domains.
A promising approach to using PNM is to enablecoarse-grained acceleration of entire applications thatare heavily memory bound. In such a fundamentallycoarse-grained (i.e., application-granularity) approach,an entire application is re-written to completely executeon the PNM substrate, potentially using a specialized pro-gramming model and specialized architecture / hardware.This approach is especially promising because it canprovide the maximum performance and energy benefitsachievable from PNM acceleration of a given applica-tion, since it enables the customization of the entire PNMsystem for the application. We believe this approach canbe especially promising for widely-used data-intensiveapplications, such as graph processing, machine learning,databases, media processing, genome analysis.A popular modern application is large-scale graph pro-cessing [113, 352, 353, 354, 355, 356, 357, 358, 359,360, 361]. Graph processing has broad applicability anduse in many domains, from social networks to machinelearning, from data analytics to bioinformatics. Graphanalysis workloads are known to put significant pres-sure on memory bandwidth due to (1) large amounts ofrandom memory accesses across large memory regions(leading to very limited cache e ffi ciency and very largeamounts of unnecessary data transfer on the memorybus) and (2) very small amounts of computation per eachdata item fetched from memory (leading to very limitedability to hide long memory latencies and exacerbating the energy bottleneck by exercising the huge energy dis-parity between memory access and computation). Thesetwo characteristics make it very challenging to scale upsuch workloads despite their inherent parallelism, es-pecially with conventional architectures based on largeon-chip caches and relatively scarce o ff -chip memorybandwidth for random access.We can exploit the high bandwidth as well as the po-tential computation capability available within the logiclayer of 3D-stacked memory to overcome the limitationsof conventional architectures for graph processing. Tothis end, we design a programmable PNM acceleratorfor large-scale graph processing, called Tesseract [52],depicted at a high level in Figure 15. Tesseract con-sists of (1) a new hardware architecture that e ff ectivelyutilizes the available memory bandwidth in 3D-stackedmemory by placing simple in-order processing cores inthe logic layer and enabling each core to manipulatedata only on the memory partition it is assigned to con-trol, (2) an e ffi cient method of communication betweendi ff erent in-order cores within a 3D-stacked memoryto enable each core to request computation on data el-ements that reside in the memory partition controlledby another core, and (3) a message-passing based pro-gramming interface, similar to how modern distributedsystems are programmed, which enables remote functioncalls on data that resides in each memory partition. TheTesseract design moves functions (i.e., computations andtemporary values) to data that is to be updated rather thanmoving data elements across di ff erent memory partitionsand cores. It also includes two hardware prefetchersspecialized for memory access patterns of graph process-ing, which operate based on the hints provided by ourprogramming model. Our comprehensive evaluations us-ing five state-of-the-art graph processing workloads withlarge real-world graphs show that the proposed TesseractPIM architecture improves average system performanceby 13 . × and achieves 87% average energy reductionover conventional systems.A significant amount of recent research has built uponTesseract to enable the graph processing PNM system tobe much more powerful [324, 325, 326, 327]. Amongthese, GraphP [325] proposes a new graph partitioningscheme that greatly reduces the costly communicationacross 3D-stacked memory chips. Better partitioning isalso proposed in GraphH [324], together with a reconfig-urable double mesh network that provides higher band-width across 3D-stacked memory chips. GraphQ [327]employs static and structured communication patterns toeliminate irregular communication, which is one of thekey bottlenecks of Tesseract. Hetraph [326] combinesmemristor-based analog computation units and CMOS-19 esseract System for Graph Processing Crossbar Network ………… D R A M C o n t r o ll e r NI In-Order Core
Message QueuePF BufferMTPLP
Host Processor
Memory-MappedAccelerator Interface (Noncacheable, Physically Addressed)
Interconnected set of 3D-stacked memory+logic chips with simple cores
LogicMemory
Ahn+, “A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing” ISCA 2015.
Figure 15: Overview of the Tesseract system for graph processing.Reproduced from [214]. Originally presented in [52, 302]. based digital compute cores on the logic layer of 3D-stacked memory chips, in order to use the most suitableone for each phase of computation. Overall, combiningthe multiple proposals reported by these works, usingthe Tesseract-based PNM approach to accelerate graphprocessing can lead to more than two orders of magni-tude improvement both in performance as well as energye ffi ciency compared to a conventional processor-centricsystem with high-bandwidth memory. This demonstratesthe potential promise of designing an entire PNM sys-tem from the ground up completely for an importantdata-intensive application. Another promising approach to using PNM, function-level o ffl oading , is less intrusive than Tesseract’sapplication-granularity approach described in Sec-tion 7.1. This approach can still be coarse-grained sincethe function that is o ffl oaded to the PNM logic can bepotentially arbitarily long. However, the entire applica-tion does not need to be re-written. This approach ispromising because it can enable easier adoption of PNMwhile still providing significant benefits. The key ques-tion in this approach is which functions in an applicationshould be o ffl oaded for PNM acceleration. Several re-cent works tackle this question for various applications,e.g., mobile consumer workloads [7], GPGPU work-loads [86, 87], graph processing and in-memory databaseworkloads [62, 179], and a wide variety of workloadsfrom many domains [16]. We will discuss function-levelPNM acceleration of mobile consumer workloads in thissection, focusing on our recent work on the topic [7]. A very popular domain of computing today consists ofconsumer devices, which include smartphones, tablets,web-based computers such as Chromebooks, and wear-able devices. In consumer devices, energy e ffi ciency isa first-class concern due to the limited battery capacityand the stringent thermal power budget. We find that data movement is a major contributor to the total systemenergy and execution time in modern consumer devices.Across all of the popular modern mobile consumer ap-plications we study (described in the next paragraph),we find that 62.7% of the total system energy, on av-erage, is spent on data movement across the memoryhierarchy [7]. As described before, this large fractionconsumed on data movement is directly the result of theprocessor-centric design paradigm of modern computingsystems.We comprehensively analyze the energy and perfor-mance impact of data movement for several widely-usedGoogle consumer workloads [7], which account for asignificant portion of the applications executed on con-sumer devices. These workloads include (1) the Chromeweb browser [362], which is a very popular browserused in mobile devices and laptops; (2) TensorFlow Mo-bile [363], Google’s machine learning framework, whichis used in services such as Google Translate, GoogleNow, and Google Photos; (3) the VP9 video playback en-gine [364], and (4) the VP9 video capture engine [364],both of which are used in many video services such asYouTube and Google Hangouts. We find that o ffl oadingkey functions to the logic layer can greatly reduce datamovement in all of these workloads. However, thereare challenges to introducing PIM in consumer devices,as consumer devices are extremely stringent in termsof the area and energy budget they can accommodatefor any new hardware enhancement. As a result, weneed to identify what kind of in-memory logic can both(1) maximize energy e ffi ciency and (2) be implemented at minimum possible cost, in terms of both area overheadand complexity .We find that many of target functions for PIM in con-sumer workloads are comprised of simple operationssuch as memcopy , memset , basic arithmetic and bitwiseoperations, and simple data shu ffl ing and reorganiza-tion routines. Therefore, we can relatively easily im-plement these PIM target functions in the logic layer of3D-stacked memory using either (1) a small low-powergeneral-purpose embedded core or (2) a group of smallfixed-function accelerators. Our analysis shows that thearea of a PIM core and a PIM accelerator take up no morethan 9.4% and 35.4%, respectively, of the area availablefor PIM logic in an HMC-like [151] 3D-stacked memoryarchitecture. Both the PIM core and PIM accelerator20liminate a large amount of data movement, and therebysignificantly reduce total system energy (by an averageof 55.4% across all the workloads) and execution time(by an average of 54.2%).As evident from these results, function-level acceler-ation provides significant performance and energy ben-efits, but the benefits are not as high as full application-level o ffl oading and customization of the PNM system,as we have shown for Tesseract in Section 7.1. Thisis expected since function-level o ffl oading makes muchfewer changes to the system and the programming modelthan application-level o ffl oading, customization and re-thinking of the system. In the last decade, Graphics Processing Units (GPUs)have become the accelerator of choice for a wide va-riety of data-parallel applications. They deploy thou-sands of in-order, SIMT (Single Instruction MultipleThread) cores that run lightweight threads. The heavily-multithreaded GPU architecture is devised to hide thelong latency of memory accesses by interleaving threadsthat execute arithmetic and logic operations. Despitethat, many GPU applications are still very memory-bound [365, 366, 367, 368, 369, 370, 371, 372, 373, 374,375], because the limited o ff -chip pin bandwidth cannotsupply enough data to the running threads.Processing near memory in 3D-stacked memory archi-tectures presents a promising opportunity to alleviate thememory bottleneck in GPU systems. GPU cores placedin the logic layer of a 3D-stacked memory can be directlyconnected to the DRAM layers with high-bandwidth(and low-latency) connections. Figure 16 presents an ex-ample configuration with a main GPU system connectedto four 3D-stacked memories. In the logic layer of each3D-stacked memory, there are GPU cores (also known asstreaming multiprocessors, SMs) connected to memoryvault controllers via a crossbar switch. In order to lever-age the potential performance benefits of such systems,it is necessary to enable computation o ffl oading and datamapping to multiple such compute-capable 3D-stackedmemories, such that GPU applications can benefit fromprocessing-in-memory capabilities in the logic layers ofsuch memories.TOM (Transparent O ffl oading and Mapping) [86] pro-poses two mechanisms to address computation o ffl oadingand data mapping in such a system in a programmer-transparent manner. First, it introduces new compileranalysis techniques to identify code sections in GPUkernels that can benefit from o ffl oading to PIM engines.The compiler estimates the potential memory bandwidth Transparent Oloading and Mapping (TOM):Enabling Programmer-Transparent Near-Data Processing in GPU Systems
Kevin Hsieh ‡ Eiman Ebrahimi † Gwangsun Kim ú Niladrish Chatterjee † Mike O’Connor † Nandita Vijaykumar ‡ Onur Mutlu §‡ Stephen W. Keckler †‡ Carnegie Mellon University † NVIDIA ú KAIST § ETH Zürich
ABSTRACT
Main memory bandwidth is a critical bottleneck for modern GPUsystems due to limited o-chip pin bandwidth. 3D-stacked mem-ory architectures provide a promising opportunity to signicantlyalleviate this bottleneck by directly connecting a logic layer to theDRAM layers with high bandwidth connections. Recent work hasshown promising potential performance benets from an architec-ture that connects multiple such 3D-stacked memories and ooadsbandwidth-intensive computations to a GPU in each of the logiclayers. An unsolved key challenge in such a system is how to enablecomputation ooading and data mapping to multiple without burdening the programmer such that any applica-tion can transparently benet from near-data processing capabilitiesin the logic layer.Our paper develops two new mechanisms to address this key chal-lenge. First, a compiler-based technique that automatically identiescode to ooad to a logic-layer GPU based on a simple cost-benetanalysis. Second, a software/hardware cooperative mechanism thatpredicts which memory pages will be accessed by ooaded code,and places those pages in the memory stack closest to the ooadedcode, to minimize o-chip bandwidth consumption. We call the com-bination of these two programmer-transparent mechanisms TOM:Transparent Ooading and Mapping.Our extensive evaluations across a variety of modern memory-intensive GPU workloads show that, without requiring any programmodication, TOM signicantly improves performance (by 30% onaverage, and up to 76%) compared to a baseline GPU system thatcannot ooad computation to 3D-stacked memories.
1. Introduction
Main memory bandwidth is a well-known critical bottleneckfor many GPU applications [21, 44, 56]. Emerging 3D-stackedmemory technologies oer new opportunities to alleviate thisbottleneck by enabling very wide, energy-ecient interfacesto the processor [29, 30]. In addition, a logic layer within a3D memory stack provides the opportunity to place process-ing elements close to the data in memory to further improvebandwidth and reduce power consumption [12, 26, 59]. Inthese near-data processing (NDP) systems, through-siliconvias (TSVs) from the memory dies can provide greater band-width to the processing units on the logic layer within thestack, while simultaneously removing the need for energy-consuming and long-distance data movement between chips.Recent work demonstrates promising performance andenergy eciency benets from using near-data processing inGPU systems [55, 60]. Figure 1 shows a high level diagram ofan example near-data processing system architecture. Thissystem consists of 1) multiple 3D-stacked memories, calledmemory stacks, each of which has one or more streamingmultiprocessors (SMs) on its logic layer, and 2) the main GPUwith multiple SMs. Ooading computation to the logic layer …. Vault Ctrl
Figure 1: Overview of an NDP GPU system.
SMs reduces data trac between the memory stacks andthe main GPU, alleviating the o-chip memory bandwidthbottleneck and reducing power consumption of the power-hungry o-chip memory bus. Unfortunately, there are twokey challenges in such NDP systems that need to be solvedto eectively exploit the benets of near data processing.To solve these challenges, prior works required signicantprogrammer eort [2, 55, 60], which we aim to eliminate inthis work.
Challenge 1.
Which operations should be executed on theSMs in the main GPU versus the SMs in the memory stack?
In-structions must be steered to the compute units they most e-ciently execute on. For example, memory-intensive blocks ofinstructions could benet from executing at the logic layer ofmemory stacks that hold the data they access, while compute-intensive portions could benet from remaining on the mainGPU. Although programmers may have such knowledge, itwould be a large burden for them to designate the most appro-priate execution engine for all parts of the program, whichmay change dynamically due to program phase behavior anddierent input sets.
Challenge 2.
How should data be mapped to dierent 3Dmemory stacks?
In a system with multiple memory stacks,such as the one in Figure 1, an application’s data is spreadacross multiple memory stacks to maximize bandwidth uti-lization of the main GPU. However, the eciency of an NDPoperation primarily depends on whether the data accessed bythe ooaded operation is located within the same memorystack. We thus need to map data in a way that: 1) maximizesthe code/data co-location for NDP operations, and 2) max-imizes bandwidth utilization for the code executing on themain GPU. Doing so is challenging because dierent codeblocks and dierent threads in a program access dierentparts of data structures at dierent times during programexecution. Determining which part of memory is accessed bywhich code block instances is dicult, and requiring the pro-grammer to do this places a large burden on the programmer.Our goal is to solve both challenges transparently to theprogrammer . To this end, we develop two new mechanisms,the combination of which we refer to as TOM (Transpar-1
Figure 16: Overview of a PNM GPU system with a powerful main GPUand less powerful logic-layer GPUs distributed across four 3D-stackedmemories. Reproduced from [86]. savings for each code block. To this end, the compilercompares the bandwidth consumption of the code block,when executed on the regular GPU cores, to the band-width cost of transmitting / receiving input / output reg-isters, when o ffl oading to the GPU cores in the logiclayers. At runtime, a final o ffl oading decision is madebased on dynamic system conditions, such as contentionfor processing resources in the logic layer. Second, asoftware / hardware cooperative mechanism predicts thememory pages that will be accessed by o ffl oaded code,and places such pages in the same 3D-stacked memorywhere the code will be executed. The goal is to makePIM e ff ective by ensuring that the data needed by thePIM cores is in the same memory stack as the code thatneeds it. Both mechanisms are completely transparent tothe programmer, who only needs to write regular GPUcode without any explicit PIM instructions or any othermodification to the code. We find that TOM improves theaverage performance of a variety of GPGPU workloadsby 30% and reduces the average energy consumption by11% with respect to a baseline GPU system without PIMo ffl oading capabilities.A related work [87] identifies GPU kernels that aresuitable for PIM o ffl oading by using a regression-baseda ffi nity prediction model. A concurrent kernel manage-ment mechanism uses the a ffi nity prediction model anddetermines which kernels should be scheduled concur-rently to maximize performance. This way, the proposedmechanism enables the simultaneous exploitation of theregular GPU cores and the in-memory GPU cores. Thisscheduling technique improves performance and energye ffi ciency by an average of 42% and 27%, respectively. A finer-grained approach to using PNM is instruction-level o ffl oading . With this approach, individual instruc-tions can be o ffl oaded to the PNM engine and acceler-ated. As we describe below, this fine-grained approach21an have significant benefits in terms of potential adop-tion since existing processor-centric execution modelsalready operate (i.e., perform computation) at the granu-larity of individual instructions and all such machinerycan be reused to aid o ffl oading to be as seamless aspossible with existing programming models and systemmechanisms. PIM-Enabled Instructions (PEI) [53] aimsto provide the minimal processing-in-memory supportto take advantage of PIM using 3D-stacked memory, ina way that can achieve significant performance and en-ergy benefits without changing the computing systemsignificantly. To this end, PEI proposes a collection ofsimple instructions, which introduce small changes tothe computing system and no changes to the program-ming model or the virtual memory system, in a systemwith 3D-stacked memory. These instructions, generatedby the compiler or programmer to indicate potentiallyPIM-o ffl oadable operations in the program, are opera-tions that can be executed either in a traditional host CPU(that fetches and decodes them) or the PIM engine in3D-stacked memory.PIM-Enabled Instructions are based on two key ideas.First, a PEI is a cache-coherent, virtually-addressed hostprocessor instruction that operates on only a single cacheblock. It requires no changes to the sequential executionand programming model, no changes to virtual memory,minimal changes to cache coherence, and no need forspecial data mapping to take advantage of PIM (becauseeach PEI is restricted to a single memory module dueto the single cache block restriction it has). Second, aLocality-Aware Execution runtime mechanism decidesdynamically where to execute a PEI (i.e., either the hostprocessor or the PIM logic) based on simple localitycharacteristics and simple hardware predictors. This run-time mechanism executes the PEI at the location thatmaximizes performance. In summary, PIM-Enabled In-structions provide the illusion that PIM operations areexecuted as if they were host instructions: the program-mer may not even be aware that the code is executing ona PIM-capable system and the exact same program con-taining PEIs can be executed on conventional systemsthat do not implement PIM.Figure 17 shows an example architecture that can beused to enable PEIs. In this architecture, a PEI is exe-cuted on a PEI Computation Unit (PCU). To enable PEIexecution in either the host CPU or in memory, a PCU isadded to each host CPU and to each vault in an HMC-like 3D-stacked memory. While the work done in a PCUfor a PEI might have required multiple CPU instructionsin the baseline CPU-only architecture, the CPU onlyneeds to execute a single PEI instruction, which is sentto a central PEI Management Unit (PMU in Figure 17). The PMU, which is in charge of the Locality-Aware Exe-cution, launches the appropriate PEI operation on one ofthe PCUs, either on the CPU or in 3D-stacked memory. Figure 17: Example architecture for PIM-enabled instructions. Repro-duced from [309]. Originally presented in [53, 376].
Examples of PEIs are integer increment, integer mini-mum, floating-point addition, hash table probing, his-togram bin index, Euclidean distance, and dot prod-uct [53]. Data-intensive workloads such as graph pro-cessing, in-memory data analytics, machine learning,and data mining can significantly benefit from these PEIs.Across 10 key data-intensive workloads, we observe thatthe use of PEIs in these workloads, in combination withthe Locality-Aware Execution runtime mechanism, leadsto an average performance improvement of 47% and anaverage energy reduction of 25% over a baseline CPU,on reasonably large data set sizes.As such, the benefits provided by the fine-grained PEIapproach are quite promising: with minimal changesto the system, performance and energy improve signifi-cantly. We therefore believe that the PEI mechanism canease the adoption of PIM systems going into the future,a key issue we discuss in detail next.
Genome analysis is a critical data-intensive do-main that can greatly benefit from acceleration [12,13, 189, 190, 191, 192, 342, 377, 378, 379], specif-ically processing-in-memory acceleration. We findthat function-level PNM acceleration via algorithm-architecture co-design is especially beneficial for data-intensive genome analysis workloads, as demonstratedin two of our recent works [13, 115].GRIM-Filter [115] is an in-memory accelerator forgenome seed filtering. In order to read the genome (i.e.,DNA sequence) of an organism, geneticists often needto reconstruct the genome from small segments of DNAknown as reads, as current DNA extraction techniquesare unable to extract the entire DNA sequence. A genome22ead mapper can perform the reconstruction by match-ing the reads against a reference genome, and a corepart of read mapping is a computationally-expensive dy-namic programming algorithm that aligns the reads tothe reference genome. One technique to significantlyimprove the performance and e ffi ciency of read mappingis seed filtering [190, 191, 192, 342, 380], which reducesthe number of reference genome seeds (i.e., segments)that a read must be checked against for alignment byquickly eliminating seeds with no probability of match-ing. GRIM-Filter proposes a state-of-the-art filteringalgorithm, and places the entire algorithm inside mem-ory [115].GRIM-Filter represents the entire reference genomeby dividing it into short continuous segments, called bins ,and performs analyses on metadata associated to eachbin. This metadata, represented as a bitvector , storeswhether or not a particular token (a short DNA sequence)is present in the associated bin. Bitvectors are storedin DRAM in column order, such that a DRAM accessto a row fetches the bits of the same token across manybitvectors, as the left block of Figure 18 shows. GRIM-Filter places custom logic for each vault in the logiclayer of 3D-stacked memory (center block of Figure 18).In each vault, there are multiple per-bin logic modules which operate on the bitvector of a single bin. Eachlogic module consists of an incrementer, accumulator,and comparator, as the right block of Figure 18 shows. Figure 18: Left block: GRIM-Filter bitvector layout within a DRAMbank. Center block: 3D-stacked DRAM with tightly integrated logiclayer stacked underneath with TSVs for high inter-layer data transferbandwidth. Right block: Custom GRIM-Filter logic placed in the logiclayer, for each vault. Reproduced from [115].
GRIM-Filter introduces a communication protocolbetween the read mapper and the filter. The commu-nication protocol allows GRIM-Filter to be integratedinto a full genome read mapper (e.g., FastHASH [380],mrFAST [381], BWA-MEM [382], Minimap2 [383], byallowing (1) the read mapper to notify GRIM-Filter aboutthe DRAM addresses on which to execute customized in-memory filtering operations, (2) GRIM-Filter to notifythe read mapper once the filter generates a list of seedsfor alignment. Across 10 real genome read sets, GRIM-Filter improves the performance of a full state-of-the-art read mapper by 3.65 × over a conventional CPU-onlysystem [115].In a more recent work [13], we develop an algorithm-architecture co-design to accelerate approximate stringmatching (ASM) , which is used at multiple points duringthe mapping process of genome analysis. ASM enablesread mapping to account for sequencing errors and ge-netic variations in the reads. Our work, GenASM, isthe first ASM acceleration framework for genome se-quence analysis. GenASM performs bitvector-basedASM, which can e ffi ciently accelerate multiple stepsof genome sequence analysis. We modify the underly-ing ASM algorithm (Bitap [384, 385]) to significantlyincrease its parallelism and reduce its memory foot-print. We accelerate this modified ASM algorithm, called GenASM-DC for Distance Calculation, using an accel-erator that performs very e ffi cient Distance Calculationbetween two input strings. We also develop a novel Bitap-compatible algorithm for traceback (i.e., a method to col-lect information about the di ff erent types of alignmenterrors, or di ff erences, between two input strings), called GenASM-TB . Using our modified algorithm and the newGenASM-TB algorithm, we design the first hardwareaccelerator for Bitap. Figure 19 illustrates a high-leveloverview of GenASM, depicting the flow of input andintermediate data in the system as well as the commu-nication paths of the two accelerators for GenASM-DCand GenASM-TB. Our hardware accelerator, which isplaced in the logic layer of 3D-stacked memory to mini-mize data movement overheads, consists of specializedsystolic-array-based compute units and on-chip SRAMsthat are designed to match the rate of computation withmemory capacity and bandwidth, resulting in an e ffi -cient design whose performance scales linearly as weincrease the number of compute units working in par-allel. Our detailed performance and energy evaluationsdemonstrate that GenASM provides significant perfor-mance and power benefits for three di ff erent use cases ingenome sequence analysis, outperforming the best priorhardware accelerators as well as software baselines byone or more orders of magnitude. We believe these re-sults are quite promising and point to the need for furtherexploration of PIM accelerators in genome analysis. NATSA [118] is a near-memory processing acceler-ator for time series analysis. Time series analysis is apowerful technique for extracting and predicting eventswith applications in epidemiology, genomics, neuro-science, astronomy, environmental sciences, economics,23 ost CPU GenASM-TBAcceleratorGenASM-DCAcceleratorMain Memory reference & query locations Write bitvectorsreference text & query pattern
DC-SRAM sub-text & sub-pattern Read bitvectors Find the traceback output
DC-Controller
Generate bitvectors
GenASM-DC GenASM-TB
TB-SRAM TB-SRAM TB-SRAM n ...
21 3 4 5 6 7
Figure 19: Overview of GenASM. Di ff erent components are describedin detail in [13]. Figure reproduced from [13]. etc. NATSA implements matrix profile [386], the state-of-the-art algorithm for time series analysis fully viaPNM. Matrix profile operates on large amounts of timeseries data, but it has low arithmetic intensity. As aresult, data movement represents a major performancebottleneck and energy waste, which NATSA alleviatesby performing the complete time series analysis process-ing near memory using specialized accelerators. NATSAplaces energy-e ffi cient floating point arithmetic process-ing units (PUs in Figure 20) close to 3D-stacked HBMmemory [149, 152], connected via silicon interposers,as Figure 20 shows. NATSA improves performance byup to 14.2 × (9.9 × on average) and reduces energy byup to 27.2 × (19.4 × on average) over the state-of-the-artmulti-core implementation. NATSA also improves per-formance by 6.3 × and reduces energy by 10.2 × over ageneral-purpose PNM platform with 64 in-order cores. Dot Product
DPU
Dot Product
DPU
Dot ProductReutilization
DPRU
Dot ProductReutilization
DPRU
DistanceCalculation
DCU
DistanceCalculation
DCU
ProfileUpdate
PUU
ProfileUpdate
PUU
Dot ProductDot Product
DPUDPU
DistanceCalculationDistanceCalculation
DCUDCU
Dot ProductReutilizationDot ProductReutilization ProfileUpdateProfileUpdate control unit Tm q i,j σ µ m PP II PPII
PUUPUU
DPUUDPUU t i,m t j,m ≤ d i,j PP i - × × t i + σ i ÷ << -- × × σ j μ i μ j m HBM memory
NATSA PU n mm q i+1,j+1 d i,j d i,j T q i,j q i,j t j t i+m t j+m m q i,j q i,j PP i ,II i d i,j ,j { Figure 20: NATSA design and integration near HBM memory. APU is a NATSA Processing Unit that can do energy-e ffi cient floatingpoint arithmetic for time series analysis. Its components are describedin [118]. Figure reproduced from [118].
8. Enabling the Adoption of PIM
Pushing some or all of the computation for a programfrom the CPU to memory introduces new challenges forsystem architects and programmers to overcome. Fig-ure 21 lists some of these key challenges. These chal-lenges must be addressed carefully and systematically inorder for PIM to be adopted as a mainstream architecturein a wide variety of systems and workloads, and in aseamless manner that does not place heavy burden onthe vast majority of programmers. In this section, wediscuss several of these system-level and programming-level challenges, and highlight a number of our worksthat have addressed these challenges for a wide range ofPIM architectures. We believe future research should ex-amine solutions to these challenges with an open mindsetthat is keen on enabling adoption, since the widespreadsuccess of the PIM paradigm critically depends on e ff ec-tive solutions to these challenges. Barriers to Adoption of PIM
1. Functionality of PIM and applications/software for PIM2. Ease of programming (interfaces and compiler/HW support)3. System support: coherence & virtual memory4. Runtime and compilation systems for adaptive scheduling, data mapping, access/sharing control5. Infrastructures and models to assess benefits and feasibility
All can be solved with change of mindset
Figure 21: Potential barriers to adoption of PIM. Reproducedfrom [203, 214].
Two open research questions to enable the adoptionof PIM are 1) what should the programming modelsbe, and 2) how can compilers and libraries alleviate theprogramming burden?While PIM-Enabled Instructions [53] work well foro ffl oading fine-grained and small amounts of computa-tion to memory, they can potentially introduce overheadswhile taking advantage of PIM for large tasks, due tothe need to frequently exchange information betweenthe PIM processing logic and the CPU. Hence, there isa need for researchers to investigate how to integratePIM instructions with other compiler-based methods or24ibrary calls that can support PIM integration, and howthese approaches can ease the burden on the program-mer, by enabling seamless o ffl oading of instructions orfunction / library calls.Such solutions can often be platform-dependent. Oneof our recent works [86] examines compiler-based mech-anisms to decide what portions of code should be of-floaded to PIM processing logic in a GPU-based systemin a manner that is transparent to the GPU programmer.Another recent work [87] examines system-level tech-niques that decide which GPU application kernels aresuitable for PIM execution.As described in Section 7 with multiple promisingexamples, di ff erent granularities of code o ffl oading inProcessing Near Memory architectures have di ff erentimplications for performance and energy as well as sys-tem complexity. These di ff erent granularities also haveimplications on programming and code generation com-plexity. Adoption-minded solutions should clearly takeinto account the granularity of code o ffl oading and howa PNM system supports code execution.Similarly, programming and code generation frame-works for Processing Using Memory approaches likeAmbit are also critical for such approaches to becomewidely adopted. Programming model, compiler and li-brary support for expressing, extracting and generatingbulk bitwise operations in a program can greatly help theadoption of in-memory bulk bitwise execution modelslike Ambit. We believe there is exciting research to doin these directions.Determining e ff ective programming interfaces and thenecessary as well as useful compiler / library support toe ff ectively perform PIM remain open research and designquestions, which are important for future works to tackle. We identify four key runtime issues in PIM: (1) whatcode to execute near data, (2) when to schedule execu-tion on PIM (i.e., when is it worth o ffl oading compu-tation to the PIM cores), (3) how to map data to multi-ple memory modules such that PIM execution is viableand e ff ective, and (4) how to e ff ectively share / partitionPIM mechanisms / accelerators at runtime across multi-ple threads / cores to maximize performance and energye ffi ciency. We have already proposed several approachesto solve these four issues, yet much research remains tobe done to enable a robust and e ff ective PIM runtimesystem that can be e ff ective under many conditions.The first key issue is to identify which portions ofan application are suitable for PIM. We call such por-tions PIM o ffl oading candidates . While PIM o ffl oading candidates can be identified manually by a programmer,the identification would require significant programmere ff ort along with a detailed understanding of the hard-ware tradeo ff s between CPU cores and PIM cores. Forarchitects who are adding custom PIM logic (e.g., fixed-function accelerators, which we call PIM accelerators) tomemory, the tradeo ff s between CPU cores and PIM ac-celerators may not be known before determining whichportions of the application are PIM o ffl oading candi-dates, since the PIM accelerators are tailored for the PIMo ffl oading candidates. To alleviate the burden of manu-ally identifying PIM o ffl oading candidates, we develop asystematic toolflow for identifying PIM o ffl oading can-didates in an application [7, 62, 179, 321]. This toolflowuses a system that executes the entire application on theCPU to evaluate whether each PIM o ffl oading candidatemeets the constraints of the system under consideration.For example, when we evaluate workloads for mobileconsumer devices (e.g., Chrome web browser, Tensor-Flow Mobile, video playback, and video capture) [7],we use hardware performance counters and our energymodel to identify candidate functions that could be PIMo ffl oading candidates. A function is a PIM o ffl oadingcandidate in a mobile consumer device if it meets thefollowing conditions:1. It consumes a significant fraction (e.g., more than30%) of the overall workload energy consumption,since energy reduction is a primary objective inmobile systems and workloads.2. Its data movement consumes a significant fraction(e.g., more than 30%) of the total workload energyto maximize the potential energy benefits of o ffl oad-ing to PIM.3. It is memory-intensive (e.g., its last-level cachemisses per kilo instruction, or MPKI, is greaterthan 10 [200, 387, 388, 389]), as the energy sav-ings of PIM is higher when more data movement iseliminated.4. Data movement is the single largest component ofthe function’s energy consumption.Figure 22 shows two example functions in Google’sMobile TensorFlow machine learning inference frame-work [363, 390] that are identified to be PIM o ffl oad-ing candidates using the afore-described methodology: packing / unpacking and quantization [7]. Note that thesefunctions are together responsible for more than 54%of the data movement energy in the examined neuralnetworks for this workload, which spend more than 57%of their execution energy on data movement, as depictedin Figure 22.25 ensorFlow Mobile Figure 22: A majority of the data movement energy in TensorFlowMobile machine learning inference framework [363, 390] is causedby two key functions. Reproduced from [214]. Originally presentedin [7, 391].
Some of our other recent works in PIM identify suit-able PIM o ffl oading candidates with di ff erent granular-ities. PIM-Enabled Instructions [53] propose variousoperations that can benefit from execution near or insidememory, such as integer increment, integer minimum,floating-point addition, hash table probing, histogrambin index, Euclidean distance, and dot product. GPUapplications also contain several parts that are suitablefor o ffl oading to PIM engines [86, 87]. Bulk memoryoperations (copy, initialization) and bulk bitwise oper-ations are good candidates for Ambit-like processingusing DRAM approaches [108, 109, 112, 125], as wediscussed earlier. For PUM approaches that can executemore complex operations (e.g., addition, multiplication)using memory, the operation complexity (i.e., the latencyof an operation for a certain data type) can determinehow beneficial o ffl oading to PUM can be compared toCPU execution. A recent analytical model [392] helpsto evaluate such o ffl oading tradeo ff s in memristor-basedPUM [136, 137, 138].In several of our research works, we propose runtimemechanisms for dynamic scheduling of PIM o ffl oadingcandidates, i.e., mechanisms that decide whether or notto actually o ffl oad code that is marked to be potentiallyo ffl oaded to PIM engines. In [53], we develop a locality-aware scheduling mechanism for PIM-enabled instruc-tions. For GPU-based systems [86, 87], we explore thecombination of compile-time and runtime mechanismsfor identification and dynamic scheduling of PIM of-floading candidates.The best mapping of data and code that enables themaximal benefits from PIM depends on the applicationsand the computing system configuration as well as the type of PIM employed in the system. For instance, inorder to be able operate on two source arrays insideDRAM with PUM approaches [108, 109, 110, 111, 112,120, 124, 145, 319], one key issue is how to guaranteethe alignment of the two arrays inside the same DRAMsubarray. Practical solutions for this issue need to involveboth the memory controller and the operating system toenable that arrays aligned in virtual memory can alsobe physically aligned in DRAM. he programmer and / orthe compiler also likely need to carefully annotate andcommunicate computation patterns on large data blocksso that the system software and the memory controllercan cooperatively map the data blocks in an appropriatemanner that is amenable to bulk bitwise computationvia PUM. Another key issue is how to move partial re-sults generated in one DRAM subarray to other DRAMsubarrays to continue the execution with other inputoperands residing in those subarrays. Several of our re-cent works [108, 121, 123, 260] propose mechanismsfor in-DRAM internal data movement that can facilitategathering of data in appropriate rows / subarrays / banks ina DRAM chip.Programmer-transparent data and code mapping mech-anisms are especially desirable for PIM adoption. In [86],we present a software / hardware cooperative mechanismto map data and code to several 3D-stacked memorychips in regular GPU applications with relatively regularmemory access patterns. This work also deals with e ff ec-tively sharing PIM engines across multiple threads , asGPU code sections can be o ffl oaded from di ff erent GPUcores to the PNM GPU cores in 3D-stacked memorychips. Developing new approaches to data / code map-ping and scheduling for a wide variety of applicationsand possible core and memory configurations is stillnecessary.In summary, there are still several key research ques-tions that should be investigated in runtime systems forPIM, which perform scheduling and data / code mapping: • What are simple mechanisms to enable and dis-able PIM execution? How can PIM executionbe throttled for highest performance gains? Howshould data locations and access patterns a ff ectwhere / whether PIM execution should occur? • Which parts of a given application’s code shouldbe executed on PIM? What are simple mechanismsto identify when those parts of the application codecan benefit from PIM? • What are scheduling mechanisms to share PIM en-gines between multiple requesting cores to maxi-mize benefits obtained from PIM?26
What are simple mechanisms to manage access toa memory that serves both CPU requests and PIMrequests?
In a traditional multithreaded execution model thatmakes use of shared memory, writes to memory mustbe coordinated between multiple CPU cores, to ensurethat threads do not operate on stale data values. SinceCPUs include per-core private caches, when one corewrites data to a memory address, cached copies of thedata held within the caches of other cores must be up-dated or invalidated, using a mechanism known as cachecoherence . Within a modern chip multiprocessor, theper-core caches perform coherence actions over a sharedinterconnect, with hardware coherence protocols.Cache coherence is a major system challenge for en-abling PIM architectures as general-purpose executionengines, as PIM processing logic can modify the datait processes, and this data may also be needed by CPUcores. If PIM processing logic is coherent with the pro-cessor, the PIM programming model is relatively simple,as it remains similar to conventional shared memorymultithreaded programming, which makes PIM architec-tures easier to adopt in general-purpose systems. Thus,allowing PIM processing logic to maintain such a simpleand traditional shared memory programming model canfacilitate the widespread adoption of PIM. However, em-ploying traditional fine-grained cache coherence (e.g., acache-block based MESI protocol [393]) for PIM forcesa large number of coherence messages to traverse the nar-row processor-memory bus, potentially undoing the ben-efits of high-bandwidth and low-latency PIM execution.Unfortunately, solutions for coherence proposed by priorPIM works [52, 53, 86] either place some restrictions onthe programming model (by eliminating coherence andrequiring message passing based programming) or limitthe performance and energy gains achievable by a PIMarchitecture.We have developed a new coherence protocol,
CoNDA [62, 179, 321], that maintains cache coherencebetween PIM processing logic and CPU cores without sending coherence requests for every memory access.Instead, as shown in Figure 23, CoNDA enables e ffi cientcoherence by having the PIM logic:1. speculatively acquire coherence permissions formultiple memory operations over a given periodof time (which we call optimistic execution ; 1 inthe figure);2. batch the coherence requests from the multiplememory operations into a set of compressed co-herence signatures ( 2 and 3 ); 3. send the signatures to the CPU to determine whetherthe speculation violated any coherence semantics.Whenever the CPU receives compressed signatures fromthe PIM core (e.g., when the PIM kernel finishes),the CPU performs coherence resolution ( 4 ), where itchecks if any coherence conflicts occurred. If a con-flict exists, any dirty cache line in the CPU that causedthe conflict is flushed, and the PIM core rolls back andre-executes the code that was optimistically executed. Figure 23: High-level operation of CoNDA, a new coherence mecha-nism for near-data accelerators, including PNM and PUM. Reproducedfrom [309]. Originally presented in [179].
As a result of this ”lazy” checking of coherence viola-tions, CoNDA approaches near-ideal coherence behavior:the performance and energy consumption of a PIM ar-chitecture with CoNDA are, respectively, within 10.4%and 4.4% the performance and energy consumption ofa system where coherence is performed at zero latencyand energy cost.Despite the leap that CoNDA [62, 179, 321] representsfor memory coherence in computing systems with PIMsupport, we believe that it is still necessary to exploreother solutions for memory coherence that can e ffi cientlydeal with all types of workloads and PIM o ffl oadinggranularities as well as di ff erent approaches to PIM. When an application needs to access its data insidethe main memory, the CPU core must first perform an address translation , which converts the data’s virtualaddress into a physical address within main memory. Ifthe translation metadata is not available in the CPU’stranslation lookaside bu ff er (TLB), the CPU must invokethe page table walker in order to perform a long-latencypage table walk that involves multiple sequential readsto the main memory and lowers the application’s perfor-mance. In modern systems, the virtual memory systemalso provides access protection mechanisms.A naive solution to reducing the overhead of pagewalks is to utilize PIM engines to perform page tablewalks. This can be done by duplicating the content of27he TLB and moving the page walker to the PIM pro-cessing logic in main memory. Unfortunately, this iseither di ffi cult or expensive for three reasons. First, co-herence has to be maintained between the CPU’s TLBsand the memory-side TLBs. This introduces extra com-plexity and o ff -chip requests. Second, duplicating theTLBs increases the storage and complexity overheads onthe memory side, which should be carefully contained.Third, if main memory is shared across CPUs with di ff er-ent types of architectures, page table structures and theimplementation of address translations can be di ff erentacross the di ff erent architectures. Ensuring compatibilitybetween the in-memory TLB / page walker and all possi-ble types of virtual memory architecture designs can becomplicated and often not even practically feasible.To address these concerns and reduce the overheadof virtual memory, we explore a tractable solution forPIM address translation as part of our in-memory pointerchasing accelerator, IMPICA [89]. IMPICA exploitsthe high bandwidth available within 3D-stacked memoryto traverse a chain of virtual memory pointers withinDRAM, without having to look up virtual-to-physical ad-dress translations in the CPU translation lookaside bu ff er(TLB) and without using the page walkers within theCPU. IMPICA’s key ideas are 1) to use a region-basedpage table, which is optimized for PIM acceleration,and 2) to decouple address calculation and memory ac-cess with two specialized engines. IMPICA improvesthe performance of pointer chasing operations in threecommonly-used linked data structures (linked lists, hashtables, and B-trees) by 92%, 29%, and 18%, respectively.On a real database application, DBx1000, IMPICA im-proves transaction throughput and response time by 16%and 13%, respectively. IMPICA also reduces overallsystem energy consumption (by 41%, 23%, and 10% forthe three commonly-used data structures, and by 6% forDBx1000).Beyond pointer chasing operations that are tackled byIMPICA [89], providing e ffi cient mechanisms for PIM-based virtual-to-physical address translation (as well asaccess protection) remains a challenge for the general-ity of applications, especially those that access largeamounts of virtual memory [372, 373, 394].Looking forward, we recently introduced afundamentally-new virtual memory framework, theVirtual Block Interface (VBI) [395], which proposesto delegate physical memory management dutiescompletely to the memory controller hardware as wellas other specialized hardware. Figure 24 comparesVBI to conventional virtual memory at a very highlevel. Designing VBI-based PIM units that managememory allocation and address translation can help fundamentally overcome this important virtual memorychallenge of PIM systems. We refer the reader to ourVBI work [395] for details. VBI: Overview Virtual Address Space (VAS) P VAS 1 P P n Page Tables managed by the OS
Physical MemoryVAS 2 VAS n . . .
Processes
VBIConventional Virtual Memory
VBI Address Space P VB 1 P P n Memory Translation Layer in the memory controller
Physical MemoryVB 2 VB 3 VB 4 . . .
Processes
Figure 24: The Virtual Block Interface versus conventional virtualmemory. Reproduced from [395].
Current systems with many cores run applications withconcurrent data structures to achieve high performanceand scalability, with significant benefits over sequentialdata structures. Such concurrent data structures are oftenused in heavily-optimized server systems today, wherehigh performance is critical. To enable the adoptionof PIM in such many-core systems, it is necessary todevelop concurrent data structures that are specificallytailored to take advantage of PIM.
Pointer chasing data structures and contended datastructures require careful analysis and design to lever-age the high bandwidth and low latency of 3D-stackedmemories [98]. First, pointer chasing data structures,such as linked-lists and skip-lists, have a high degree ofinherent parallelism and low contention, but a naive im-plementation in PIM cores is burdened by hard-to-predictmemory access patterns. By combining and partitioningthe data across 3D-stacked memory vaults, it is possibleto fully exploit the inherent parallelism of these datastructures. Second, contended data structures, such asFIFO queues, are a good fit for CPU caches because theyexpose high locality. However, they su ff er from highcontention when many threads access them concurrently.Their performance on traditional CPU systems can beimproved using a new PIM-based FIFO queue [98]. Theproposed PIM-based FIFO queue uses a PIM core toperform enqueue and dequeue operations requested byCPU cores. The PIM core can pipeline requests fromdi ff erent CPU cores for improved performance.As recent work [98] shows, PIM-managed concurrentdata structures can outperform state-of-the-art concur-rent data structures that are designed for and executed28n multiple cores. We believe and hope that future workwill enable other types of data structures (e.g., hash ta-bles, search trees, priority queues) to benefit from PIM-managed designs. To ease the adoption of PIM, it is critical that we ac-curately assess the benefits and shortcomings of PIM.Accurate assessment of PIM requires (1) a preferablylarge set of real-world memory-intensive applicationsthat have the potential to benefit significantly when ex-ecuted near memory, (2) a rigorous methodology to(automatically) identify PIM o ffl oading candidates, and(3) simulation / evaluation infrastructures that allow ar-chitects and system designers to accurately analyze thebenefits and overheads of adding PIM processing logicto memory and executing code on this processing logic.In order to explore what processing logic should beintroduced near memory, and to know what propertiesare ideal for PIM kernels, we believe it is importantto begin by developing a real-world benchmark suite of a wide variety of applications that can potentiallybenefit from PIM. While many data-intensive applica-tions, such as pointer chasing and bulk memory copy,can potentially benefit from PIM, it is crucial to exam-ine important candidate applications for PIM execution,and for researchers to agree on a common set of thesecandidate applications to focus the e ff orts of the commu-nity as well as to enable reproducibility of results, whichis important to assess the relative benefits of di ff erentideas developed by di ff erent researchers. We believethat these applications should come from a number ofpopular and emerging domains. Examples of potentialdomains include data-parallel applications, neural net-works, machine learning, graph processing, data analyt-ics, search / filtering, mobile workloads, bioinformatics,Hadoop / Spark programs, security / cryptography, and in-memory data stores. Many of these applications havelarge data sets and can benefit from high memory band-width and low memory latency benefits provided by com-putation near memory. In our prior work, we have startedidentifying several applications that can benefit fromPIM in graph processing frameworks [52, 53], pointerchasing [51, 89], databases [62, 89, 97, 179, 321], con-sumer workloads [7], time series analysis [118], genomeanalysis [13, 115], machine learning [7], and GPGPUworkloads [86, 87]. However, there is significant roomfor methodical development of a large-scale PIM bench-mark suite, which our very recent work [16] takes thefirst steps for.A systematic methodology for (automatically) iden-tifying potential PIM kernels (i.e., code portions that can benefit from PIM) within an application can, amongmany other benefits, 1) ease the burden of programmingPIM architectures by aiding the programmer to identifywhat should be o ffl oaded, 2) ease the burden of and im-prove the reproducibility of PIM research, 3) drive thedesign and implementation of PIM functional units thatmany types of applications can leverage, 4) inspire thedevelopment of tools that programmers and compilerscan use to automate the process of o ffl oading portions ofexisting applications to PIM processing logic, and 5) leadthe community towards convergence on PIM designs ando ffl oading candidates. In a very recent work that is toappear in 2021 [16], we take the first steps in developingsuch a methodology and the first benchmark suite forPIM. We refer the reader to that work [16] for detaileddescription and analyses of the methodology and the newPIM benchmark suite. We believe this work opens upmany more steps to extend the methodology and developother new methodologies for identifying PIM kernelsas well as automatic tools (e.g., profilers, compilers,runtime systems) that implement these methodologies,generate optimized code for PIM (potentially with helpfrom programmer annotations), coordinate o ffl oading toPIM cores, etc.Along these lines, our NAPEL [119] work is an earlyexample of an ML-based performance and energy esti-mation framework for PNM. NAPEL leverages ensemblelearning techniques to generate PNM performance andenergy prediction models that are based on microarchi-tecture parameters and application characteristics. Fig-ure 25 shows the high-level overview of NAPEL trainingand prediction, the components of which are explainedin detail in [119]. Our evaluations show that NAPELcan make fast yet accurate predictions of PIM o ffl oadingsuitability for previously-unseen applications on general-purpose PNM architectures. Instruction MixProgram Execution Traces ?EnsembleLearning AlgorithmTraining DatasetMemory Behavior ILPCode Instrumentation NMCPrediction ModelTrace SimulationApplication ?LLVM Kernel AnalyzerMicroarchitecture Simulation ?Hyper-parameterTuning
Instruction Mix Memory Behavior ILP ?LLVM Kernel Analyzer
Code InstrumentationNew Application Model GenerationPerformance/ Energy
NAPEL Model TrainingNAPEL Model Prediction
Instructions Per Cycle + Hardware Features ValidationApplication FeaturesApplication Features
Figure 25: Overview of NAPEL training and prediction. Componentsare explained in detail in [119]. Figure reproduced from [119].
We also need simulation infrastructures to accuratelymodel the performance and energy of PIM hardware29tructures, available memory bandwidth, and communi-cation overheads when we execute code near or insidememory. Highly-flexible and commonly-used memorysimulators (e.g., Ramulator [148, 396], SoftMC [38,332]) can be combined with full-system simulators(e.g., gem5 [397], zsim [398], gem5-gpu [399], GPG-PUSim [400]) to provide a robust environment that canevaluate how various PIM architectures a ff ect the entirecompute stack , and can allow designers to identify mem-ory, workload, and system characteristics that a ff ect thee ffi ciency of PIM execution. A powerful open-sourcesimulation infrastructure that provides such environmentis Ramulator-PIM [401], first introduced by our NAPELframework [119], which combines Ramulator [148, 396]and zsim [398]. Ramulator-PIM can simulate a widerange of configurations of PIM in-order and out-of-ordercores and accelerators with di ff erent memory technolo-gies. As industry and academia push toward enabling thePIM paradigm, it will be important to also provide realPIM hardware or prototypes. Such hardware can greatlyenable and accelerate evaluations of both adoption andresearch issues in PIM, leading to learnings from realworkloads executed on real systems and thus better PIMsystems over time. Such real hardware for PIM is verymuch useful for both PUM and PNM approaches.We are aware of at least two such real hardware sys-tems. First, ComputeDRAM [122], which is based on theSoftMC memory controller infrastructure [38] can po-tentially provide the opportunity to test Rowclone (Sec-tion 6.1) and Ambit (Section 6.2) PUM approaches onreal workloads, albeit likely at reduced reliability sinceit exploits o ff -the-shelf DRAM chips, as we discussed inSection 6.2.Second, the recent UPMEM PIM architecture [402,403], shown in Figure 26 is the first real-world publicly-available PIM architecture. This PNM system consistsof one simple processors (called DRAM Processing Unit,DPU) implemented next to each bank in a DRAM chip.A DPU has high-bandwidth, low-latency, low-energy ac-cess to all the data in its corresponding bank. UPMEMhas produced real DRAM modules that contain 16 PNM-capable DRAM chips each. Each DRAM chip includeseight 64-MB DRAM banks, each of which has a DPUattached running at a few hundred MHz. A full-blownUPMEM system configuration is expected to soon have2560 DPUs capable of operating on 160 GB of DRAMmemory. We believe the existence of such real PIMhardware can greatly enable and accelerate software and adoption-related research for PIM, specifically PNM ar-chitectures, and can set a promising and useful baselinefor future research in PNM systems. UPMEM Processing-in-DRAM Engine (2019) n Processing in DRAM Engine n Includes standard DIMM modules , with a large number of DPU processors combined with DRAM chips. n Replaces standard
DIMMs q DDR4 R-DIMM modules n n Standard 2x-nm DRAM process q Large amounts of compute & memory bandwidth
CPU (x86,ARM,RV…)
DDRDatabus
Figure 26: UPMEM PIM architecture and hardware. Reproducedfrom [214].
As a new processing paradigm, PIM introduces newsecurity considerations related to its integration in real-world computing systems. First, there is a need to pro-vide security guarantees in systems with PIM capabil-ities so that applications that o ffl oad code can executesecurely in PIM computation units. Naively providingaccess to PIM computation units for all concurrently-executing applications may lead to potentially unfore-seen data leakage and other issues. Second, the ability toperform computation inside or near memory using PIMcan enable the opportunity to specialize such computa-tion mechanisms to enhance system security (as brieflydiscussed in Section 6.4). We cover each of these topicsbriefly but envision many future ideas related to them infuture PIM research and designs.First, PIM computation units should provide at leastas good security primitives as processor-centric computa-tion units of today. This means that there should at leastbe isolation between concurrently-executing processeson PIM computation units and access control to PIM re-sources (both data storage and computation units) shouldbe securely managed. Partitioning of computation units,as done in [395], can enable isolation. We believe newapproaches to virtualization and cross-layer design thatprovide extensive hardware management capabilities inthe memory controllers, such as the Virtual Block In-terface (VBI) [395] or Expressive Memory [404, 405]can not only make PIM security mechanisms easier andmore e ff ective to implement but can provide much more30nhanced PIM security mechanisms than existing sys-tems.As in existing systems, reliability and data integrity are important in PIM systems, especially in PUM ap-proaches, where memory rows can be frequently ac-tivated and deactivated. The RowHammer vulnerabil-ity [20, 24, 45, 46, 167, 211] (Section 2) can potentiallybecome exacerbated in PIM systems but it can also bemore easily preventable using an intelligent memorycontroller as in PIM. The cell wearout problem due toendurance limitations in some modern NVM technolo-gies can limit the reliability and thus e ff ectiveness ofNVM-based PUM approaches [104, 136, 137, 138, 140]and thus needs to be addressed. Employing in-memoryerror correcting code (ECC) techniques [170, 171, 406]is likely necessary in future PIM approaches and PIMsystems should likely be designed to support ECC tech-niques to maintain data reliability in the presence of com-putation mechanisms using / near memory and increasingnoise and reliability problems due to technology scaling.Second, the PIM paradigm enables new opportuni-ties to increase the security and privacy of computationsand data, and thus entire computing systems. If dataand computation stay within one chip, then the expo-sure of such data and computation to many attacks willlikely be minimized. By eliminating data movement be-tween memory and processor, the PIM paradigm takes alarge step towards getting rid of one of the most attacker-exposed type of data movement, i.e., data movementover the main memory bus. Enabling the secure andprivate execution of computations in PIM systems cantherefore potentially enable fundamentally more securecomputing systems. This requires providing supportfor such secure computation, as we discussed earlier inSection 6.4. For example, our afore-described DRAMlatency PUF [261] and DRAM latency True RandomNumber Generator [172] are notable examples of novelin-DRAM security primitives that take advantage of Pro-cessing Using Memory that were briefly discussed inSection 6.4. We envision future works on PIM will pro-vide many other security primitives, applications, anduse cases.
9. Conclusion and Future Outlook
Data movement is a major performance and energybottleneck plaguing modern computing systems. A largefraction of system energy is spent on moving data acrossthe memory hierarchy into the processors (and accelera-tors), the only place where computation is performed ina modern system. Fundamentally, the large amounts of data movement are caused by the processor-centric de-sign paradigm of modern computing systems: processingof data is performed only in the processors (and acceler-ators), which are far away from the data, and as a result,data moves a lot in the system, to facilitate computationon it.In this work, we argue for a paradigm shift in the de-sign of computing systems toward a data-centric designparadigm that enables computation capability in placeswhere data resides and thus performs computation withminimal data movement.
Processing-in-memory (PIM)is a fundamentally data-centric design approach for com-puting systems that enables the ability to perform oper-ations in or near memory. Recent advances in modernmemory architectures have enabled us to extensivelyexplore two novel approaches to designing PIM archi-tectures: PUM (Processing Using Memory) and PNM(Processing Near Memory). First, we show that PUMexploits the existing DRAM architecture and the oper-ational principles of the DRAM circuitry, enabling anumber of important and widely-used operations (e.g.,memory copy, data initialization, bulk bitwise opera-tions, data reorganization) within DRAM, with minimalchanges to DRAM chips. Similar PUM approaches arealso applicable to other types of memory chips, and allyield large performance and energy benefits. Second, wedemonstrate that PNM can exploit the embedded compu-tation capability in the logic layer of 3D-stacked memoryin a variety of ways to provide significant performanceimprovements and energy savings, across a large rangeof application domains and computing platforms. Simi-lar PNM approaches are applicable to di ff erent types ofmemories and also to memory controllers.Despite the extensive design space that we have stud-ied so far, a number of key challenges remain to enablethe widespread adoption of PIM in future computingsystems [126, 127]. Important challenges include devel-oping easy-to-use programming models for PIM (e.g.,PIM application interfaces, compilers and libraries de-signed to abstract away PIM architecture details fromprogrammers), and extensive runtime support for PIM(e.g., scheduling PIM operations, sharing PIM logicamong CPU threads, cache coherence, virtual memorysupport). We hope that providing the community with(1) a large set of memory-intensive benchmarks that canpotentially benefit from PIM, (2) a rigorous methodologyto identify PIM-suitable parts within an application, and(3) accurate simulation infrastructures for estimating thebenefits and overheads of PIM will empower researchersto address remaining challenges for the adoption of PIM.We firmly believe that it is time to design principledsystem architectures to solve the data movement problem31f modern computing systems, which is caused by therigid dichotomy and imbalance between the computingunit (CPUs and accelerators) and the memory / storageunit. Fundamentally solving the data movement problemrequires a paradigm shift to a more data-centric comput-ing system design, where computation happens wheredata resides (i.e., in or near memory / storage), with mini-mal movement of data. Such a paradigm shift can greatlypush the boundaries of future computing systems, lead-ing to orders of magnitude improvements in energy andperformance (as we demonstrated with some examplesin this work), potentially enabling new applications andcomputing platforms. Acknowledgments
This chapter is a drastically revised and extended ver-sion of an earlier article published in 2019 [9]. Thischapter also incorporates revised material from anotherearlier article published in 2019 [11]. The shorter, ini-tial version of this work [9] is based on a keynote talkdelivered by Onur Mutlu at the 3rd Mobile System Tech-nologies (MST) Workshop in Milan, Italy on 27 October2017 [407].The mentioned keynote talk is similar to a series oftalks given by Onur Mutlu in a wide variety of venuessince 2015 until now. This talk has evolved significantlyover time with the accumulation of new works and feed-back received from many audiences. Recent versionsof the talk were delivered as a distinguished lecture atGeorge Washington University in February 2019 [408],as an Invited Talk at ISSCC Special Forum on ”Intelli-gence at the Edge: How Can We Make Machine Learn-ing More Energy E ffi cient?”, as part of the 2019 In-ternational Solid State Circuits Conference in February2019 [409], as a keynote talk at the 29th ACM GreatLakes Symposium on VLSI [410], as a keynote talkat the International Symposium on Advanced ParallelProcessing Technology in August 2019 [411], and as akeynote talk at the 37th IEEE International Conferenceon Computer Design in November 2019 [203].This article and the associated talks are based on re-search done over the course of the past nine years in theSAFARI Research Group on the topic of processing-in-memory (PIM). We thank all of the members of the SA-FARI Research Group, and our collaborators at CarnegieMellon, ETH Z¨urich, and other universities, who havecontributed to the various works we describe in this paper.Thanks also goes to our research group’s industrial spon-sors over the past ten years, especially Alibaba, ASML,Google, Huawei, Intel, Microsoft, NVIDIA, Samsung,Seagate, and VMware. This work was also partially supported by the Intel Science and Technology Centerfor Cloud Computing, the Semiconductor Research Cor-poration, the Data Storage Systems Center at CarnegieMellon University, various NSF and NIH grants, andvarious awards, including the NSF CAREER Award,the Intel Faculty Honor Program Award, and a numberof Google and IBM Faculty Research Awards to OnurMutlu. References [1] O. Mutlu, Memory Scaling: A Systems Architecture Perspec-tive, IMW (2013).[2] O. Mutlu, L. Subramanian, Research Problems and Opportuni-ties in Memory Systems, SUPERFRI (2014).[3] J. Dean, L. A. Barroso, The Tail at Scale, Communications ofthe ACM (2013).[4] S. Kanev, J. P. Darago, K. Hazelwood, P. Ranganathan, T. Mose-ley, G.-Y. Wei, D. Brooks, Profiling a Warehouse-Scale Com-puter, in: ISCA, 2015.[5] M. Ferdman, A. Adileh, O. Kocberber, S. Volos, M. Alisafaee,D. Jevdjic, C. Kaynak, A. D. Popescu, A. Ailamaki, B. Falsafi,Clearing the Clouds: A Study of Emerging Scale-Out Work-loads on Modern Hardware, in: ASPLOS, 2012.[6] L. Wang, J. Zhan, C. Luo, Y. Zhu, Q. Yang, Y. He, W. Gao,Z. Jia, Y. Shi, S. Zhang, C. Zheng, G. Lu, K. Zhan, X. Li, B. Qiu,BigDataBench: A Big Data Benchmark Suite From InternetServices, in: HPCA, 2014.[7] A. Boroumand, S. Ghose, Y. Kim, R. Ausavarungnirun, E. Shiu,R. Thakur, D. Kim, A. Kuusela, A. Knies, P. Ranganathan,O. Mutlu, Google Workloads for Consumer Devices: MitigatingData Movement Bottlenecks, in: ASPLOS, 2018.[8] O. Mutlu, S. Ghose, J. G´omez-Luna, R. Ausavarungnirun,Enabling Practical Processing in and near Memory for Data-Intensive Computing, in: DAC, 2019.[9] O. Mutlu, et al., Processing Data Where It Makes Sense: En-abling In-Memory Computation, MicPro (2019).[10] O. Mutlu, Intelligent Architectures for Intelligent Ma-chines, https://people.inf.ethz.ch/omutlu/pub/intelligent-architectures-for-intelligent-machines_keynote-paper_VLSI20.pdf (2020).[11] S. Ghose, A. Boroumand, J. S. Kim, J. G´omez-Luna, O. Mutlu,Processing-in-Memory: A Workload-driven Perspective, IBMJRD (2019).[12] M. Alser, Z. Bing¨ol, D. Senol Cali, J. Kim, S. Ghose, C. Alkan,O. Mutlu, Accelerating Genome Analysis: A Primer on anOngoing Journey, IEEE Micro (2020).[13] D. S. Cali, G. S. Kalsi, Z. Bing¨ol, C. Firtina, L. Subrama-nian, J. S. Kim, R. Ausavarungnirun, M. Alser, J. Gomez-Luna,A. Boroumand, et al., GenASM: A High-Performance, Low-Power Approximate String Matching Acceleration Frameworkfor Genome Sequence Analysis, in: MICRO, 2020.[14] S. Koppula, L. Orosa, A. G. Ya˘glıkc¸ı, R. Azizi, T. Shahroodi,K. Kanellopoulos, O. Mutlu, EDEN: Enabling Energy-E ffi cient,High-Performance Deep Neural Network Inference using Ap-proximate DRAM, in: MICRO, 2019.[15] K. Kanellopoulos, N. Vijaykumar, C. Giannoula, R. Azizi,S. Koppula, N. Mansouri Ghiasi, T. Shahroodi, J. Gomez-Luna,O. Mutlu, SMASH: Co-designing Software Compression andHardware-Accelerated Indexing for E ffi cient Sparse Matrix Op-erations, in: MICRO, 2019.[16] G. F. Oliveira, J. Gomez-Luna, L. Orosa, S. Ghose, N. Vijayku-mar, I. Fernandez, M. Sadrosadati, O. Mutlu, A New Method- logy and Open-Source Benchmark Suite for Evaluating DataMovement Bottlenecks: A Near-Data Processing Case Study,in: SIGMETRICS, 2021.[17] U. Kang, H.-S. Yu, C. Park, H. Zheng, J. Halbert, K. Bains,S. Jang, J. Choi, Co-Architecting Controllers and DRAM toEnhance DRAM Process Scaling, in: The Memory Forum,2014.[18] S. A. McKee, Reflections on the Memory Wall, in: CF, 2004.[19] M. V. Wilkes, The Memory Gap and the Future of High Perfor-mance Memories, CAN (2001).[20] Y. Kim, R. Daly, J. Kim, C. Fallin, J. H. Lee, D. Lee, C. Wilk-erson, K. Lai, O. Mutlu, Flipping Bits in Memory WithoutAccessing Them: An Experimental Study of DRAM Distur-bance Errors, in: ISCA, 2014.[21] Y. Kim, V. Seshadri, D. Lee, J. Liu, O. Mutlu, A Case forExploiting Subarray-Level Parallelism (SALP) in DRAM, in:ISCA, 2012.[22] Y. Kim, Architectural Techniques to Enhance DRAM Scaling,Ph.D. thesis, Carnegie Mellon University (2015).[23] J. Liu, B. Jaiyen, R. Veras, O. Mutlu, RAIDR: Retention-AwareIntelligent DRAM Refresh, in: ISCA, 2012.[24] O. Mutlu, The RowHammer Problem and Other Issues We MayFace as Memory Becomes Denser, in: DATE, 2017.[25] D. Lee, L. Subramanian, R. Ausavarungnirun, J. Choi, O. Mutlu,Decoupled Direct Memory Access: Isolating CPU and IO Traf-fic by Leveraging a Dual-Data-Port DRAM, in: PACT, 2015.[26] B. C. Lee, E. Ipek, O. Mutlu, D. Burger, Architecting PhaseChange Memory as a Scalable DRAM Alternative, in: ISCA,2009.[27] H. Yoon, J. Meza, R. Ausavarungnirun, R. A. Harding, O. Mutlu,Row Bu ff er Locality Aware Caching Policies for Hybrid Mem-ories, in: ICCD, 2012.[28] H. Yoon, J. Meza, N. Muralimanohar, N. P. Jouppi, O. Mutlu,E ffi cient Data Mapping and Bu ff ering Techniques for MultilevelCell Phase-Change Memories, ACM TACO (2014).[29] K. Lim, J. Chang, T. Mudge, P. Ranganathan, S. K. Reinhardt,T. F. Wenisch, Disaggregated Memory for Expansion and Shar-ing in Blade Servers, in: ISCA, 2009.[30] W. A. Wulf, S. A. McKee, Hitting the Memory Wall: Implica-tions of the Obvious, CAN (1995).[31] K. K. Chang, A. Kashyap, H. Hassan, S. Ghose, K. Hsieh,D. Lee, T. Li, G. Pekhimenko, S. Khan, O. Mutlu, Understand-ing Latency Variation in Modern DRAM Chips: ExperimentalCharacterization, Analysis, and Optimization, in: SIGMET-RICS, 2016.[32] D. Lee, Y. Kim, V. Seshadri, J. Liu, L. Subramanian, O. Mutlu,Tiered-Latency DRAM: A Low Latency and Low Cost DRAMArchitecture, in: HPCA, 2013.[33] D. Lee, Y. Kim, G. Pekhimenko, S. Khan, V. Seshadri, K. Chang,O. Mutlu, Adaptive-Latency DRAM: Optimizing DRAM Tim-ing for the Common-Case, in: HPCA, 2015.[34] K. K. Chang, A. G. Ya˘glıkc¸ı, S. Ghose, A. Agrawal, N. Chatter-jee, A. Kashyap, D. Lee, M. O’Connor, H. Hassan, O. Mutlu,Understanding Reduced-Voltage Operation in Modern DRAMDevices: Experimental Characterization, Analysis, and Mecha-nisms, in: SIGMETRICS, 2017.[35] D. Lee, S. Khan, L. Subramanian, S. Ghose, R. Ausavarung-nirun, G. Pekhimenko, V. Seshadri, O. Mutlu, Design-InducedLatency Variation in Modern DRAM Chips: Characterization,Analysis, and Latency Reduction Mechanisms, in: SIGMET-RICS, 2017.[36] Y. Luo, S. Govindan, B. Sharma, M. Santaniello, J. Meza,A. Kansal, J. Liu, B. Khessib, K. Vaid, O. Mutlu, CharacterizingApplication Memory Error Vulnerability to Optimize Datacen-ter Cost via Heterogeneous-reliability Memory, in: DSN, 2014. [37] Y. Luo, S. Ghose, T. Li, S. Govindan, B. Sharma, B. Kelly,A. Boroumand, O. Mutlu, Using ECC DRAM to AdaptivelyIncrease Memory Capacity, arXiv:1706.08870 [cs:AR] (2017).[38] H. Hassan, N. Vijaykumar, S. Khan, S. Ghose, K. Chang,G. Pekhimenko, D. Lee, O. Ergin, O. Mutlu, SoftMC: A Flexi-ble and Practical Open-Source Infrastructure for Enabling Ex-perimental DRAM Studies, in: HPCA, 2017.[39] H. Hassan, G. Pekhimenko, N. Vijaykumar, V. Seshadri, D. Lee,O. Ergin, O. Mutlu, ChargeCache: Reducing DRAM Latencyby Exploiting Row Access Locality, in: HPCA, 2016.[40] K. K. Chang, Understanding and Improving the Latency ofDRAM-Based Memory Systems, Ph.D. thesis, Carnegie MellonUniversity (2017).[41] M. Patel, J. S. Kim, O. Mutlu, The Reach Profiler (REAPER):Enabling the Mitigation of DRAM Retention Failures via Pro-filing at Aggressive Conditions, in: ISCA, 2017.[42] H. Hassan, M. Patel, J. S. Kim, A. G. Yaglikci, N. Vijaykumar,N. M. Ghiasi, S. Ghose, O. Mutlu, CROW: A Low-Cost Sub-strate for Improving DRAM Performance, Energy E ffi ciency,and Reliability, in: ISCA, 2019.[43] S. Ghose, A. G. Yaglikc¸i, R. Gupta, D. Lee, K. Kudrolli, W. X.Liu, H. Hassan, K. K. Chang, N. Chatterjee, A. Agrawal,M. O’Connor, O. Mutlu, What Your DRAM Power ModelsAre Not Telling You: Lessons from a Detailed ExperimentalStudy, in: SIGMETRICS, 2018.[44] J. Kim, M. Patel, H. Hassan, O. Mutlu, Solar-DRAM: ReducingDRAM Access Latency by Exploiting the Variation in LocalBitlines, in: ICCD, 2018.[45] J. S. Kim, M. Patel, A. G. Ya˘glıkc¸ı, H. Hassan, R. Azizi,L. Orosa, O. Mutlu, Revisiting RowHammer: An ExperimentalAnalysis of Modern DRAM Devices and Mitigation Techniques,in: ISCA, 2020.[46] O. Mutlu, J. S. Kim, RowHammer: A Retrospective, IEEETransactions on Computer-Aided Design of Integrated Circuitsand Systems (2020).[47] Y. Wang, A. Tavakkol, L. Orosa, S. Ghose, N. Mansouri Ghiasi,M. Patel, J. S. Kim, H. Hassan, M. Sadrosadati, O. Mutlu, Re-ducing DRAM Latency via Charge-Level-Aware Look-AheadPartial Restoration, in: MICRO, 2018.[48] O. Mutlu, S. Ghose, R. Ausavarungnirun, Recent Advances inDRAM and Flash Memory Architectures, Invited Journal IssueIPSI Transactions on Internet Research (2018).[49] S. Ghose, T. Li, N. Hajinazar, D. S. Cali, O. Mutlu, Demystify-ing Complex Workload-DRAM Interactions: An ExperimentalStudy, in: SIGMETRICS, 2019.[50] M. Hashemi, Khubaib, E. Ebrahimi, O. Mutlu, Y. N. Patt, Ac-celerating Dependent Cache Misses with an Enhanced MemoryController, in: ISCA, 2016.[51] M. Hashemi, O. Mutlu, Y. N. Patt, Continuous Runahead: Trans-parent Hardware Acceleration for Memory Intensive Workloads,in: MICRO, 2016.[52] J. Ahn, S. Hong, S. Yoo, O. Mutlu, K. Choi, A ScalableProcessing-in-Memory Accelerator for Parallel Graph Process-ing, in: ISCA, 2015.[53] J. Ahn, S. Yoo, O. Mutlu, K. Choi, PIM-Enabled Instructions:A Low-Overhead, Locality-Aware Processing-in-Memory Ar-chitecture, in: ISCA, 2015.[54] S. W. Keckler, W. J. Dally, B. Khailany, M. Garland, D. Glasco,GPUs and the Future of Parallel Computing, Micro, IEEE(2011).[55] M. K. Qureshi, A. Jaleel, Y. N. Patt, S. C. S. Jr., J. Emer,Adaptive Insertion Policies for High-Performance Caching, in:ISCA, 2007.[56] M. K. Qureshi, M. A. Suleman, Y. N. Patt, Line Distillation:Increasing Cache Capacity by Filtering Unused Words in CacheLines, in: HPCA, 2007.
57] R. Kumar, G. Hinton, A Family of 45nm IA Processors, in:ISSCC, 2009.[58] J. Howard, S. Dighe, Y. Hoskote, S. Vangal, D. Finan, G. Ruhl,D. Jenkins, H. Wilson, N. Borkar, G. Schrom, F. Pailet, S. Jain,T. Jacob, S. Yada, S. Marella, P. Salihundam, V. Erraguntla,M. Konow, M. Riepen, G. Droege, J. Lindemann, M. Gries,T. Apel, K. Henriss, T. Lund-Larsen, S. Steibl, S. Borkar, V. De,R. V. D. Wijngaart, T. Mattson, A 48-Core IA-32 Message-passing Processor with DVFS in 45nm CMOS, in: ISSCC,2010.[59] R. Jotwani, S. Sundaram, S. Kosonocky, A. Schaefer, V. An-drade, G. Constant, A. Novak, S. Na ff ziger, An x86-64 CoreImplemented in 32nm SOI CMOS, in: ISSCC, 2010.[60] K. Gillespie, H. R. Fair, C. Henrion, R. Jotwani, S. Kosonocky,R. S. Orefice, D. A. Priore, J. White, K. Wilcox, 5.5 Steamroller:An x86-64 Core Implemented in 28nm Bulk CMOS, in: ISSCC,2014.[61] T. Singh, S. Rangarajan, D. John, C. Henrion, S. Southard,H. McIntyre, A. Novak, S. Kosonocky, R. Jotwani, A. Schaefer,E. Chang, J. Bell, M. Co, 3.2 Zen: A Next-generation High-performance x86 Core, in: ISSCC, 2017.[62] A. Boroumand, S. Ghose, M. Patel, H. Hassan, B. Lucia,K. Hsieh, K. T. Malladi, H. Zheng, O. Mutlu, LazyPIM: An E ffi -cient Cache Coherence Mechanism for Processing-in-Memory,CAL (2016).[63] H. S. Stone, A Logic-in-Memory Computer, IEEE Transactionson Computers (1970).[64] D. E. Shaw, S. J. Stolfo, H. Ibrahim, B. Hillyer, G. Wieder-hold, J. Andrews, The NON-VON Database Machine: A BriefOverview, IEEE Database Eng. Bull. (1981).[65] D. G. Elliott, W. M. Snelgrove, M. Stumm, ComputationalRAM: A Memory-SIMD Hybrid and Its Application to DSP,in: CICC, 1992.[66] P. M. Kogge, EXECUBE–A New Architecture for ScaleableMPPs, in: ICPP, 1994.[67] M. Gokhale, B. Holmes, K. Iobst, Processing in Memory: TheTerasys Massively Parallel PIM Array, IEEE Computer (1995).[68] D. Patterson, T. Anderson, N. Cardwell, R. Fromm, K. Keeton,C. Kozyrakis, R. Thomas, K. Yelick, A Case for IntelligentRAM, IEEE Micro (1997).[69] M. Oskin, F. T. Chong, T. Sherwood, Active Pages: A Compu-tation Model for Intelligent Memory, in: ISCA, 1998.[70] Y. Kang, W. Huang, S.-M. Yoo, D. Keen, Z. Ge, V. Lam, P. Pat-tnaik, J. Torrellas, FlexRAM: Toward an Advanced IntelligentMemory System, in: ICCD, 1999.[71] J. Draper, J. Chame, M. Hall, C. Steele, T. Barrett, J. LaCoss,J. Granacki, J. Shin, C. Chen, C. W. Kang, I. Kim, G. Daglikoca,The Architecture of the DIVA Processing-in-Memory Chip, in:SC, 2002.[72] K. Mai, T. Paaske, N. Jayasena, R. Ho, W. J. Dally, M. Horowitz,Smart Memories: A Modular Reconfigurable Architecture, in:ISCA, 2000.[73] D. Elliott, M. Stumm, W. M. Snelgrove, C. Cojocaru, R. McKen-zie, Computational RAM: Implementing Processors in Memory,IEEE Design & Test (1999).[74] E. Riedel, G. Gibson, C. Faloutsos, Active Storage for Large-scale Data Mining and Multimedia Applications, in: VLDB,1998.[75] K. Keeton, D. A. Patterson, J. M. Hellerstein, A Case for Intel-ligent Disks (IDISKs), SIGMOD Rec. (1998).[76] S. Kaxiras, R. Sugumar, Distributed Vector Architecture: Be-yond a Single Vector-IRAM, in: First Workshop on MixingLogic and DRAM: Chips that Compute and Remember, 1997.[77] A. Acharya, M. Uysal, J. Saltz, Active Disks: ProgrammingModel, Algorithms and Evaluation, in: ASPLOS, 1998. [78] M. Jino, J. W. S. Liu, Intelligent Magnetic Bubble Memories,in: ISCA, 1978.[79] Doty, Greenblatt, Stanley Y.W. Su, Magnetic Bubble MemoryArchitectures for Supporting Associative Searching of Rela-tional Databases, IEEE Transactions on Computers (1980).[80] Bongiovanni, Luccio, Maintaining Sorted Files in a MagneticBubble Memory, IEEE Transactions on Computers (1980).[81] Q. Zhu, T. Graf, H. E. Sumbul, L. Pileggi, F. Franchetti, Accel-erating Sparse Matrix-Matrix Multiplication with 3D-StackedLogic-in-Memory Hardware, in: HPEC, 2013.[82] S. H. Pugsley, J. Jestes, H. Zhang, R. Balasubramonian, V. Srini-vasan, A. Buyuktosunoglu, A. Davis, F. Li, NDC: Analyzing theImpact of 3D-Stacked Memory + Logic Devices on MapReduceWorkloads, in: ISPASS, 2014.[83] D. P. Zhang, N. Jayasena, A. Lyashevsky, J. L. Greathouse,L. Xu, M. Ignatowski, TOP-PIM: Throughput-Oriented Pro-grammable Processing in Memory, in: HPDC, 2014.[84] A. Farmahini-Farahani, J. H. Ahn, K. Morrow, N. S. Kim, NDA:Near-DRAM acceleration architecture leveraging commodityDRAM devices and standard memory modules, in: HPCA,2015.[85] G. H. Loh, N. Jayasena, M. Oskin, M. Nutter, D. Roberts,M. Meswani, D. P. Zhang, M. Ignatowski, A Processing inMemory Taxonomy and a Case for Studying Fixed-FunctionPIM, in: WoNDP, 2013.[86] K. Hsieh, E. Ebrahimi, G. Kim, N. Chatterjee, M. O’Conner,N. Vijaykumar, O. Mutlu, S. Keckler, Transparent O ffl oadingand Mapping (TOM): Enabling Programmer-Transparent Near-Data Processing in GPU Systems, in: ISCA, 2016.[87] A. Pattnaik, X. Tang, A. Jog, O. Kayiran, A. K. Mishra, M. T.Kandemir, O. Mutlu, C. R. Das, Scheduling Techniques forGPU Architectures with Processing-in-Memory Capabilities,in: PACT, 2016.[88] B. Akin, F. Franchetti, J. C. Hoe, Data Reorganization in Mem-ory Using 3D-Stacked DRAM, in: ISCA, 2015.[89] K. Hsieh, S. Khan, N. Vijaykumar, K. K. Chang, A. Boroumand,S. Ghose, O. Mutlu, Accelerating Pointer Chasing in 3D-Stacked Memory: Challenges, Mechanisms, Evaluation, in:ICCD, 2016.[90] O. O. Babarinsa, S. Idreos, JAFAR: Near-Data Processing forDatabases, in: SIGMOD, 2015.[91] J. H. Lee, J. Sim, H. Kim, BSSync: Processing Near Mem-ory for Machine Learning Workloads with Bounded StalenessConsistency Models, in: PACT, 2015.[92] M. Gao, C. Kozyrakis, HRL: E ffi cient and Flexible Reconfig-urable Logic for Near-Data Processing, in: HPCA, 2016.[93] P. Chi, S. Li, C. Xu, T. Zhang, J. Zhao, Y. Liu, Y. Wang, Y. Xie,PRIME: A Novel Processing-In-Memory Architecture for Neu-ral Network Computation In ReRAM-Based Main Memory, in:ISCA, 2016.[94] B. Gu, A. S. Yoon, D.-H. Bae, I. Jo, J. Lee, J. Yoon, J.-U. Kang,M. Kwon, C. Yoon, S. Cho, J. Jeong, D. Chang, Biscuit: AFramework for Near-Data Processing of Big Data Workloads,in: ISCA, 2016.[95] D. Kim, J. Kung, S. Chai, S. Yalamanchili, S. Mukhopadhyay,Neurocube: A Programmable Digital Neuromorphic Architec-ture with High-Density 3D Memory, in: ISCA, 2016.[96] H. Asghari-Moghaddam, Y. H. Son, J. H. Ahn, N. S. Kim,Chameleon: Versatile and Practical Near-DRAM AccelerationArchitecture for Large Memory Systems, in: MICRO, 2016.[97] V. Seshadri, T. Mullins, A. Boroumand, O. Mutli, P. B. Gibbons,M. A. Kozuch, T. C. Mowry, Gather-Scatter DRAM: In-DRAMAddress Translation to Improve the Spatial Locality of Non-Unit Strided Accesses, in: MICRO, 2015.[98] Z. Liu, I. Calciu, M. Herlihy, O. Mutlu, Concurrent Data Struc-tures for Near-Memory Computing, in: SPAA, 2017.
99] M. Gao, G. Ayers, C. Kozyrakis, Practical Near-Data Processingfor In-Memory Analytics Frameworks, in: PACT, 2015.[100] Q. Guo, N. Alachiotis, B. Akin, F. Sadi, G. Xu, T. M. Low,L. Pileggi, J. C. Hoe, F. Franchetti, 3D-Stacked Memory-SideAcceleration: Accelerator and System Design, in: WoNDP,2014.[101] Z. Sura, A. Jacob, T. Chen, B. Rosenburg, O. Sallenave,C. Bertolli, S. Antao, J. Brunheroto, Y. Park, K. O’Brien,R. Nair, Data Access Optimization in a Processing-in-MemorySystem, in: CF, 2015.[102] A. Morad, L. Yavits, R. Ginosar, GP-SIMD Processing-in-Memory, ACM TACO (2015).[103] S. M. Hassan, S. Yalamanchili, S. Mukhopadhyay, Near DataProcessing: Impact and Optimization of 3D Memory SystemArchitecture on the Uncore, in: MEMSYS, 2015.[104] S. Li, C. Xu, Q. Zou, J. Zhao, Y. Lu, Y. Xie, Pinatubo: AProcessing-in-Memory Architecture for Bulk Bitwise Opera-tions in Emerging Non-Volatile Memories, in: DAC, 2016.[105] M. Kang, M.-S. Keel, N. R. Shanbhag, S. Eilert, K. Curewitz,An Energy-E ffi cient VLSI Architecture for Pattern Recognitionvia Deep Embedding of Computation in SRAM, in: ICASSP,2014.[106] S. Aga, S. Jeloka, A. Subramaniyan, S. Narayanasamy,D. Blaauw, R. Das, Compute Caches, in: HPCA, 2017.[107] A. Shafiee, A. Nag, N. Muralimanohar, et al., ISAAC: A Con-volutional Neural Network Accelerator with In-situ AnalogArithmetic in Crossbars, in: ISCA, 2016.[108] V. Seshadri, Y. Kim, C. Fallin, D. Lee, R. Ausavarungnirun,G. Pekhimenko, Y. Luo, O. Mutlu, M. A. Kozuch, P. B. Gibbons,T. C. Mowry, RowClone: Fast and Energy-E ffi cient In-DRAMBulk Data Copy and Initialization, in: MICRO, 2013.[109] V. Seshadri, K. Hsieh, A. Boroumand, D. Lee, M. A. Kozuch,O. Mutlu, P. B. Gibbons, T. C. Mowry, Fast Bulk Bitwise ANDand OR in DRAM, CAL (2015).[110] K. K. Chang, P. J. Nair, D. Lee, S. Ghose, M. K. Qureshi,O. Mutlu, Low-Cost Inter-Linked Subarrays (LISA): EnablingFast Inter-Subarray Data Movement in DRAM, in: HPCA,2016.[111] V. Seshadri, D. Lee, T. Mullins, H. Hassan, A. Boroumand,J. Kim, M. A. Kozuch, O. Mutlu, P. B. Gibbons, T. C. Mowry,Buddy-RAM: Improving the Performance and E ffi ciency ofBulk Bitwise Operations Using DRAM, arXiv:1611.09988[cs:AR] (2016).[112] V. Seshadri, D. Lee, T. Mullins, H. Hassan, A. Boroumand,J. Kim, M. A. Kozuch, O. Mutlu, P. B. Gibbons, T. C. Mowry,Ambit: In-Memory Accelerator for Bulk Bitwise OperationsUsing Commodity DRAM Technology, in: MICRO, 2017.[113] L. Nai, R. Hadidi, J. Sim, H. Kim, P. Kumar, H. Kim, GraphPIM:Enabling Instruction-Level PIM O ffl oading in Graph Comput-ing Frameworks, in: HPCA, 2017.[114] J. S. Kim, D. Senol, H. Xin, D. Lee, S. Ghose, M. Alser, H. Has-san, O. Ergin, C. Alkan, O. Mutlu, GRIM-Filter: Fast SeedFiltering in Read Mapping Using Emerging Memory Technolo-gies, arXiv:1708.04329 [q-bio.GN] (2017).[115] J. S. Kim, D. Senol, H. Xin, D. Lee, S. Ghose, M. Alser, H. Has-san, O. Ergin, C. Alkan, O. Mutlu, GRIM-Filter: Fast SeedLocation Filtering in DNA Read Mapping Using Processing-in-Memory Technologies, BMC Genomics (2018).[116] S. Li, D. Niu, K. T. Malladi, H. Zheng, B. Brennan, Y. Xie,DRISA: A DRAM-Based Reconfigurable In-Situ Accelerator,in: MICRO, 2017.[117] G. Kim, N. Chatterjee, M. O’Connor, K. Hsieh, Toward Stan-dardized Near-Data Processing with Unrestricted Data Place-ment for GPUs, in: SC, 2017.[118] I. Fernandez, R. Quislant, C. Giannoula, M. Alser, J. Gomez-Luna, E. Gutierrez, O. Plata, O. Mutlu, NATSA: A Near-Data Processing Accelerator for Time Series Analysis, in: ICCD,2020.[119] G. Singh, J. Gomez-Luna, G. Mariani, G. F. Oliveira, S. Corda,S. Stujik, O. Mutlu, H. Corporaal, NAPEL: Near-memory Com-puting Application Performance Prediction via Ensemble Learn-ing, in: DAC, 2019.[120] V. Seshadri, O. Mutlu, In-DRAM Bulk Bitwise Execution En-gine (2020).[121] Y. Wang, L. Orosa, X. Peng, Y. Guo, S. Ghose, M. Patel, J. S.Kim, J. G. Luna, M. Sadrosadati, N. M. Ghiasi, et al., FIGARO:Improving System Performance via Fine-Grained In-DRAMData Relocation and Caching, in: MICRO, 2020.[122] F. Gao, G. Tziantzioulis, D. Wentzla ff , ComputeDRAM: In-Memory Compute Using O ff -the-Shelf DRAMs, in: MICRO,2019.[123] S. H. S. Rezaei, M. Modarressi, R. Ausavarungnirun,M. Sadrosadati, O. Mutlu, M. Daneshtalab, NoM: Network-on-Memory for Inter-Bank Data Transfer in Highly-BankedMemories, CAL (2020).[124] V. Seshadri, O. Mutlu, Simple Operations in Memory to ReduceData Movement, in: Advances in Computers, Volume 106,2017.[125] V. Seshadri, Simple DRAM and Virtual Memory Abstractionsto Enable Highly E ffi cient Memory Systems, Ph.D. thesis,Carnegie Mellon University (2016).[126] S. Ghose, K. Hsieh, A. Boroumand, R. Ausavarungnirun,O. Mutlu, The Processing-in-Memory Paradigm: Mechanismsto Enable Adoption, in: Beyond-CMOS Technologies for NextGeneration Computer Design, 2019.[127] S. Ghose, K. Hsieh, A. Boroumand, R. Ausavarung-nirun, O. Mutlu, Enabling the Adoption of Processing-in-Memory: Challenges, Mechanisms, Future Research Directions,arXiv:1802.00320 [cs:AR] (2018).[128] Q. Deng, L. Jiang, Y. Zhang, M. Zhang, J. Yang, DrAcc: aDRAM Based Accelerator for Accurate CNN Inference, in:DAC, 2018.[129] X. Xin, Y. Zhang, J. Yang, ELP2IM: E ffi cient and Low PowerBitwise Operation Processing in DRAM, in: HPCA, 2020.[130] C. Eckert, X. Wang, J. Wang, A. Subramaniyan, R. Iyer,D. Sylvester, D. Blaaauw, R. Das, Neural Cache: Bit-serialIn-cache Acceleration of Deep Neural Networks, in: ISCA,2018.[131] D. Fujiki, S. Mahlke, R. Das, Duality Cache for Data ParallelAcceleration, in: ISCA, 2019.[132] S. Angizi, Z. He, D. Fan, PIMA-Logic: A Novel Processing-in-Memory Architecture for Highly Flexible and Energy-e ffi cientLogic Computation, in: DAC, 2018.[133] S. Angizi, A. S. Rakin, D. Fan, CMP-PIM: An Energy-e ffi cientComparator-based Processing-in-Memory Neural Network Ac-celerator, in: DAC, 2018.[134] S. Angizi, J. Sun, W. Zhang, D. Fan, AlignS: A Processing-in-Memory Accelerator for DNA Short Read Alignment Leverag-ing SOT-MRAM, in: DAC, 2019.[135] Y. Levy, J. Bruck, Y. Cassuto, E. G. Friedman, A. Kolodny,E. Yaakobi, S. Kvatinsky, Logic Operations in Memory Using aMemristive Akers Array, Microelectronics Journal (2014).[136] S. Kvatinsky, D. Belousov, S. Liman, G. Satat, N. Wald, E. G.Friedman, A. Kolodny, U. C. Weiser, MAGIC—Memristor-Aided Logic, IEEE TCAS II: Express Briefs (2014).[137] S. Kvatinsky, A. Kolodny, U. C. Weiser, E. G. Friedman,Memristor-Based IMPLY Logic Design Procedure, in: ICCD,2011.[138] S. Kvatinsky, G. Satat, N. Wald, E. G. Friedman, A. Kolodny,U. C. Weiser, Memristor-Based Material Implication (IMPLY)Logic: Design Principles and Methodologies, TVLSI (2014). ffi cient Memristor-Based Accelerator for Machine LearningInference, in: ASPLOS, 2019.[147] G. Singh, D. Diamantopoulos, C. Hagleitner, J. Gomez-Luna,S. Stuijk, O. Mutlu, H. Corporaal, NERO: A Near High-Bandwidth Memory Stencil Accelerator for Weather PredictionModeling, in: FPL, 2020.[148] Y. Kim, W. Yang, O. Mutlu, Ramulator: A Fast and ExtensibleDRAM Simulator, CAL (2015).[149] D. Lee, S. Ghose, G. Pekhimenko, S. Khan, O. Mutlu, Simul-taneous Multi-Layer Access: Improving 3D-Stacked MemoryBandwidth at Low Cost, TACO (2016).[150] Hybrid Memory Cube Consortium, HMC Specification 1.1(2013).[151] Hybrid Memory Cube Consortium, HMC Specification 2.0(2014).[152] JEDEC, High Bandwidth Memory (HBM) DRAM, StandardNo. JESD235 (2013).[153] B. Gopireddy, J. Torrellas, Designing Vertical Processors inMonolithic 3D, in: ISCA, 2019.[154] S. Mitra, Abundant-data computing: The N3XT 1,000X, in:VLSI-TSA, 2018.[155] W. Hwang, W. Wan, S. Mitra, H. . P. Wong, Coming Up N3XT,After 2D Scaling of Si CMOS, in: ISCAS, 2018.[156] S. Mitra, From Nanodevices to Nanosystems: The N3XT Infor-mation Technology, in: E3S, 2015.[157] D. Rich, A. Bartolo, C. Gilardo, B. Le, H. Li, R. Park, R. M.Radway, M. M. Sabry Aly, H.-S. P. Wong, S. Mitra, Heteroge-neous 3D Nano-systems: The N3XT Approach?, 2020.[158] M. M. Sabry Aly, M. Gao, G. Hills, C. Lee, G. Pitner, M. M.Shulaker, T. F. Wu, M. Asheghi, J. Bokor, F. Franchetti, K. E.Goodson, C. Kozyrakis, I. Markov, K. Olukotun, L. Pileggi,E. Pop, J. Rabaey, C. R´e, H. . P. Wong, S. Mitra, Energy-E ffi cient Abundant-Data Computing: The N3XT 1,000x, Com-puter (2015).[159] M. M. Sabry Aly, T. F. Wu, A. Bartolo, Y. H. Malviya,W. Hwang, G. Hills, I. Markov, M. Wootters, M. M. Shulaker,H. . Philip Wong, S. Mitra, The N3XT Approach to Energy-E ffi cient Abundant-Data Computing, Proceedings of the IEEE(2019).[160] R. H. Dennard, Field-e ff ect Transistor Memory, US Patent3,387,286 (1968).[161] G. Pekhimenko, T. C. Mowry, O. Mutlu, Linearly CompressedPages: A Main Memory Compression Framework with LowComplexity and Low Latency, in: PACT, 2012. [162] G. Pekhimenko, V. Seshadri, Y. Kim, H. Xin, O. Mutlu, P. B.Gibbons, M. A. Kozuch, T. C. Mowry, Linearly CompressedPages: A Low-Complexity, Low-Latency Main Memory Com-pression Framework, in: MICRO, 2013.[163] I. Churin, A. Georgiev, A CAMAC Crate Controller for theIBM PC / XT Family Computers with Built-in Selftest Features,Microprocessing and Microprogramming (1988).[164] B. Abali, H. Franke, D. E. Po ff , R. A. Saccone, C. O. Schulz,L. M. Herger, T. B. Smith, Memory Expansion Technology(MXT): Software support and performance, IBM Journal ofResearch and Development (2001).[165] J. Friedrich, H. Le, W. Starke, J. Stuechli, B. Sinharoy, E. J.Fluhr, D. Dreps, V. Zyuban, G. Still, C. Gonzalez, D. Hogen-miller, F. Malgioglio, R. Nett, R. Puri, P. Restle, D. Shan, Z. T.Deniz, D. Wendel, M. Ziegler, D. Victor, The POWER8TMProcessor: Designed for Big Data, Analytics, and Cloud En-vironments, in: IEEE International Conference on IC DesignTechnology, 2014.[166] J. Liu, B. Jaiyen, Y. Kim, C. Wilkerson, O. Mutlu, An Exper-imental Study of Data Retention Behavior in Modern DRAMDevices: Implications for Retention Time Profiling Mecha-nisms, in: ISCA, 2013.[167] P. Frigo, E. Vannacci, H. Hassan, V. van der Veen, O. Mutlu,C. Giu ff rida, H. Bos, K. Razavi, TRRespass: Exploiting theMany Sides of Target Row Refresh, in: S&P, 2020.[168] A. Das, H. Hassan, O. Mutlu, VRL-DRAM: Improving DRAMPerformance via Variable Refresh Latency, in: DAC, 2018.[169] H. Luo, T. Shahroodi, H. Hassan, M. Patel, A. G. Yaglikci,L. Orosa, J. Park, O. Mutlu, CLR-DRAM: A Low-Cost DRAMArchitecture Enabling Dynamic Capacity-Latency Trade-O ff ,in: ISCA, 2020.[170] M. Patel, J. S. Kim, H. Hassan, O. Mutlu, Understanding andModeling On-die Error Correction in Modern DRAM: An Ex-perimental Study using Real Devices, in: DSN, 2019.[171] M. Patel, J. S. Kim, T. Shahroodi, H. Hassan, O. Mutlu, Bit-Exact ECC Recovery (BEER): Determining DRAM On-DieECC Functions by Exploiting DRAM Data Retention Charac-teristics, in: MICRO, 2020.[172] J. Kim, M. Patel, H. Hassan, L. Orosa, O. Mutlu, D-RaNGe:Using Commodity DRAM Devices to Generate True RandomNumbers with Low Latency and High Throughput, in: HPCA,2019.[173] P. J. Denning, T. G. Lewis, Exponential Laws of ComputingGrowth, Commun. ACM (Jan. 2017).[174] International Technology Roadmap for Semiconductors (ITRS)(2009).[175] A. Ailamaki, D. J. DeWitt, M. D. Hill, D. A. Wood, DBMSs ona Modern Processor: Where Does Time Go?, in: VLDB, 1999.[176] P. A. Boncz, S. Manegold, M. L. Kersten, Database ArchitectureOptimized for the New Bottleneck: Memory Access, in: VLDB,1999.[177] R. Clapp, M. Dimitrov, K. Kumar, V. Viswanathan, T. Willhalm,Quantifying the Performance Impact of Memory Latency andBandwidth for Big Data Workloads, in: IISWC, 2015.[178] S. L. Xi, O. Babarinsa, M. Athanassoulis, S. Idreos, Beyond theWall: Near-Data Processing for Databases, in: DaMoN, 2015.[179] A. Boroumand, S. Ghose, M. Patel, H. Hassan, B. Lucia,R. Ausavarungnirun, K. Hsieh, N. Hajinazar, K. T. Malladi,H. Zheng, O. Mutlu, CoNDA: E ffi cient Cache Coherence Sup-port for near-Data Accelerators, in: ISCA, 2019.[180] Y. Umuroglu, D. Morrison, M. Jahre, Hybrid Breadth-FirstSearch on a Single-Chip FPGA-CPU Heterogeneous Platform,in: FPL, 2015.[181] Q. Xu, H. Jeon, M. Annavaram, Graph Processing on GPUs:Where Are the Bottlenecks?, in: IISWC, 2014. ffi cient Inference Engine on Compressed DeepNeural Network, in: ISCA, 2016.[187] Y. Long, T. Na, S. Mukhopadhyay, ReRAM-Based Processing-in-Memory Architecture for Recurrent Neural Network Accel-eration, in: TVLSI, 2018.[188] F. Schuiki, M. Scha ff ner, F. K. G¨urkaynak, L. Benini, A Scal-able Near-Memory Architecture for Training Deep Neural Net-works on Large In-Memory Datasets, in: arXiv, 2018.[189] D. Senol, J. Kim, S. Ghose, C. Alkan, O. Mutlu, Nanopore Se-quencing Technology and Tools for Genome Assembly: Com-putational Analysis of the Current State, Bottlenecks and FutureDirections, in: Briefings in Bioinformatics (BIB), 2018.[190] M. Alser, H. Hassan, H. Xin, O. Ergin, O. Mutlu, C. Alkan,GateKeeper: A New Hardware Architecture for AcceleratingPre-Alignment in DNA Short Read Mapping, Bioinformatics(2017).[191] M. Alser, T. Shahroodi, J. Gomez-Luna, C. Alkan, O. Mutlu,SneakySnake: A Fast and Accurate Universal Genome Pre-Alignment Filter for CPUs, GPUs, and FPGAs (2020).[192] M. Alser, H. Hassan, A. Kumar, O. Mutlu, C. Alkan, Shouji: aFast and E ffi cient Pre-alignment Filter for Sequence Alignment,Bioinformatics (2019).[193] T. Moscibroda, O. Mutlu, Memory Performance Attacks: De-nial of Memory Service in Multi-Core Systems, in: USENIXSecurity, 2007.[194] O. Mutlu, T. Moscibroda, Stall-Time Fair Memory AccessScheduling for Chip Multiprocessors, in: MICRO, 2007.[195] O. Mutlu, T. Moscibroda, Parallelism-Aware Batch Scheduling:Enhancing Both Performance and Fairness of Shared DRAMSystems, in: ISCA, 2008.[196] L. Subramanian, Providing High and Controllable Performancein Multicore Systems Through Shared Resource Management,Ph.D. thesis, Carnegie Mellon University (2015).[197] L. Subramanian, V. Seshadri, Y. Kim, B. Jaiyen, O. Mutlu,MISE: Providing Performance Predictability and ImprovingFairness in Shared Main Memory Systems, in: HPCA, 2013.[198] H. Usui, L. Subramanian, K. Chang, O. Mutlu, DASH:Deadline-Aware High-Performance Memory Scheduler forHeterogeneous Systems with Hardware Accelerators, TACO(2016).[199] L. Subramanian, V. Seshadri, A. Ghosh, S. Khan, O. Mutlu, TheApplication Slowdown Model: Quantifying and Controlling theImpact of Inter-Application Interference at Shared Caches andMain Memory, in: MICRO, 2015.[200] Y. Kim, M. Papamichael, O. Mutlu, M. Harchol-Balter, ThreadCluster Memory Scheduling: Exploiting Di ff erences in MemoryAccess Behavior, in: MICRO, 2010.[201] H. Kim, D. De Niz, B. Andersson, M. Klein, O. Mutlu, R. Ra-jkumar, Bounding Memory Interference Delay in COTS-basedMulti-core Systems, in: RTAS, 2014.[202] H. Kim, D. De Niz, B. Andersson, M. Klein, O. Mutlu, R. Ra-jkumar, Bounding and Reducing Memory Interference in COTS-based Multi-core Systems, Real-Time Systems (2016). [203] O. Mutlu, Processing Data Where It Makes Sense in ModernComputing Systems: Enabling In-Memory Computation, https://people.inf.ethz.ch/omutlu/pub/onur-ICCD-Keynote-EnablingInMemoryComputation-November-19-2019-unrolled.pptx , video availableat ,keynote talk at 37th IEEE International Conference onComputer Design (ICCD), Abu Dhabi, UAE, 19 November2019. (2019).[204] K. K. Chang, Understanding Latency Variation in ModernDRAM Chips: Experimental Characterization, Analy-sis, and Optimization, https://people.inf.ethz.ch/omutlu/pub/understanding-latency-variation-in-DRAM-chips_kevinchang_sigmetrics16-talk.pdf ,conference talk at SIGMETRICS 2016. (2016).[205] K. K. Chang, Understanding and Improving the La-tency of DRAM-Based Memory Systems, , slides available at https://safari.ethz.ch/safari_public_wp/wp-content/uploads/2018/12/kchang_defense_slides.pptx (2016).[206] S. Khan, D. Lee, Y. Kim, A. R. Alameldeen, C. Wilkerson,O. Mutlu, The E ffi cacy of Error Mitigation Techniques forDRAM Retention Failures: A Comparative Experimental Study,in: SIGMETRICS, 2014.[207] S. Khan, D. Lee, O. Mutlu, PARBOR: An E ffi cient System-Level Technique to Detect Data Dependent Failures in DRAM,in: DSN, 2016.[208] S. Khan, C. Wilkerson, D. Lee, A. R. Alameldeen, O. Mutlu, ACase for Memory Content-Based Detection and Mitigation ofData-Dependent Failures in DRAM, CAL (2016).[209] S. Khan, C. Wilkerson, Z. Wang, A. Alameldeen, D. Lee,O. Mutlu, Detecting and Mitigating Data-Dependent DRAMFailures by Exploiting Current Memory Content, in: MICRO,2017.[210] M. K. Qureshi, D. H. Kim, S. Khan, P. J. Nair, O. Mutlu,AVATAR: A Variable-Retention-Time (VRT) Aware Refresh forDRAM Systems, in: DSN, 2015.[211] L. Cojocar, J. Kim, M. Patel, L. Tsai, S. Saroiu, A. Wolman,O. Mutlu, Are We Susceptible to Rowhammer? An End-to-EndMethodology for Cloud Providers, in: S&P, 2020.[212] J. Meza, Q. Wu, S. Kumar, O. Mutlu, Revisiting Memory Er-rors in Large-Scale Production Data Centers: Analysis andModeling of New Trends from the Field, in: DSN, 2015.[213] SAFARI Research Group, RowHammer – GitHub Repository, https://github.com/CMU-SAFARI/rowhammer/ .[214] O. Mutlu, Intelligent Architectures for Intel-ligent Machines, https://people.inf.ethz.ch/omutlu/pub/onur-NSF-PIM-KeynoteTalk-IntelligentArchitecturesForIntelligentMachines-October-26-2020-final.pptx , video available at ,keynote Talk at National Science Foundation Workshop onProcessing-In-Memory Technology (NSF-PIM), Virtual, 26October 2020. (2020).[215] Y. Kim, Flipping Bits in Memory Without Ac-cessing Them. DRAM Disturbance Errors, https://people.inf.ethz.ch/omutlu/pub/dram-row-hammer_kim_talk_isca14.pdf , conference talk at ISCA2014. (2014).[216] M. Seaborn and T. Dullien, Exploiting the DRAMRowhammer Bug to Gain Kernel Privileges, http://googleprojectzero.blogspot.com.tr/2015/03/exploiting-dram-rowhammer-bug-to-gain.html . / arXiv:1507.06955 .URL http://arxiv.org/abs/1507.06955 [219] K. Razavi, B. Gras, E. Bosman, B. Preneel, C. Giu ff rida, H. Bos,Flip Feng Shui: Hammering a Needle in the Software Stack, in:USENIX Security, 2016.[220] V. van der Veen, Y. Fratantonio, M. Lindorfer, D. Gruss, C. Mau-rice, G. Vigna, H. Bos, K. Razavi, C. Giu ff rida, Drammer: De-terministic Rowhammer Attacks on Mobile Platforms, in: CCS,2016.[221] E. Bosman, K. Razavi, H. Bos, C. Giu ff rida, Dedup EstMachina: Memory Deduplication as an Advanced Exploita-tion Vector, in: S&P, 2016.[222] Y. Xiao, et al., One Bit Flips, One Cloud Flops: Cross-VM RowHammer Attacks and Privilege Escalation, in: USENIX Sec.,2016.[223] D. Gruss, et al., Another Flip in the Wall of Rowhammer De-fenses, in: S&P, 2018.[224] R. Qiao, M. Seaborn, A New Approach for Rowhammer At-tacks, in: HOST, 2016.[225] S. Bhattacharya, D. Mukhopadhyay, Curious Case of RowHam-mer: Flipping Secret Exponent Bits using Timing Analysis, in:CHES, 2016.[226] Y. Jang, J. Lee, S. Lee, T. Kim, SGX-Bomb: Locking Down theProcessor via Rowhammer Attack, in: SysTEX, 2017.[227] M. T. Aga, Z. B. Aweke, T. Austin, When Good Protections goBad: Exploiting anti-DoS Measures to Accelerate RowhammerAttacks, in: HOST, 2017.[228] P. Pessl, D. Gruss, C. Maurice, M. Schwarz, S. Mangard,DRAMA: Exploiting DRAM Addressing for Cross-CPU At-tacks, in: USENIX Security, 2016.[229] P. Frigo, et al., Grand Pwning Unit: Accelerating Microarchi-tectural Attacks with the GPU, in: S&P, 2018.[230] A. P. Fournaris, L. Pocero Fraile, O. Koufopavlou, ExploitingHardware Vulnerabilities to Attack Embedded System Devices:A Survey of Potent Microarchitectural Attacks, Electronics(2017).[231] D. Poddebniak, J. Somorovsky, S. Schinzel, M. Lochter,P. R¨osler, Attacking Deterministic Signature Schemes usingFault Attacks, in: EuroS&P, 2018.[232] M. Lipp, et al., Nethammer: Inducing Rowhammer Faultsthrough Network Requests, arxiv.org (2018).[233] A. Tatar, et al., Throwhammer: Rowhammer Attacks over theNetwork and Defenses, in: USENIX ATC, 2018.[234] A. Tatar, C. Giu ff rida, H. Bos, K. Razavi, Defeating SoftwareMitigations Against Rowhammer: A Surgical Precision Ham-mer, in: RAID, 2018.[235] S. Carre, M. Desjardins, A. Facon, S. Guilley, OpenSSL Bell-core’s Protection Helps Fault Attack, in: DSD, 2018.[236] A. Barenghi, L. Breveglieri, N. Izzo, G. Pelosi, Software-only Reverse Engineering of Physical DRAM Mappings forRowhammer Attacks, in: IVSW, 2018.[237] Z. Zhang, Z. Zhan, D. Balasubramanian, X. Koutsoukos, G. Kar-sai, Triggering Rowhammer Hardware Faults on ARM: A Re-visit, in: ASHES, 2018.[238] S. Bhattacharya, D. Mukhopadhyay, Advanced Fault Attacks inSoftware: Exploiting the Rowhammer Bug, in: Fault TolerantArchitectures for Cryptography and Hardware Security, 2018.[239] L. Cojocar, K. Razavi, C. Giu ff rida, H. Bos, Exploiting Cor-recting Codes: On the E ff ectiveness of ECC Memory AgainstRowHammer Attacks, in: S&P, 2019.[240] J.-B. Lee, Green Memory Solution, Samsung Electronics, In-vestor’s Forum. [241] Micron, DDR4 SDRAM Datasheet, p. 380.[242] O. Mutlu, RowHammer, in: Top Picks in Hardware and Embed-ded Security, 2018.[243] T.-Y. Oh, H. Chung, J.-Y. Park, K.-W. Lee, S. Oh, S.-Y. Doo,H.-J. Kim, C. Lee, H.-R. Kim, J.-H. Lee, et al., A 3.2 Gbps / pin8 gbit 1.0 v LPDDR4 SDRAM with Integrated ECC Engine forsub-1 v DRAM Core Operation, IEEE Journal of Solid-StateCircuits (2014).[244] Micron Technology Inc., ECC Brings Reliability and PowerE ffi ciency to Mobile Devices, Tech. Rep. (2017).[245] JEDEC, JESD79-5 DDR5 SDRAM standard (2020).[246] N. Kwak, S.-H. Kim, K. H. Lee, C.-K. Baek, M. S. Jang, Y. Joo,S.-H. Lee, W. Y. Lee, E. Lee, D. Han, et al., 23.3 A 4.8 Gb / s / pin2Gb LPDDR4 SDRAM with sub-100 µ A Self-refresh Currentfor IoT Applications, in: ISSCC, 2017.[247] H.-J. Kwon, E. Seo, C.-Y. Lee, Y.-H. Seo, G.-H. Han, H.-R.Kim, J.-H. Lee, M.-S. Jang, S.-G. Do, S.-H. Cho, et al., 23.4 AnExtremely Low-standby-power 3.733 Gb / s / pin 2Gb LPDDR4SDRAM for Wearable Devices, in: ISSCC, 2017.[248] Apple Inc., About the Security Content of Mac EFI Secu-rity Update 2015-001, https://support.apple.com/en-us/HT204934 (2015).[249] C. Lefurgy, K. Rajamani, F. Rawson, W. Felter, M. Kistler, T. W.Keller, Energy Management for Commercial Servers, Computer(2003).[250] M. Ware, K. Rajamani, M. Floyd, B. Brock, J. C. Rubio, F. Raw-son, J. B. Carter, Architecting for Power Management: TheIBM ® POWER7 ™ Approach, in: HPCA, 2010.[251] I. Paul, W. Huang, M. Arora, S. Yalamanchili, Harmonia: Bal-ancing Compute and Memory Power in High-PerformanceGPUs, in: ISCA, 2015.[252] H. David, C. Fallin, E. Gorbatov, U. R. Hanebutte, O. Mutlu,Memory Power Management via Dynamic Voltage / FrequencyScaling, in: 8th ACM International Conference on AutonomicComputing, 2011.[253] Q. Deng, D. Meisner, L. Ramos, T. F. Wenisch, R. Bianchini,Memscale: Active Low-power Modes for Main Memory, in:ASPLOS, 2011.[254] J. Haj-Yahya, M. Alser, J. Kim, A. G. Yaglıkc¸ı, N. Vijayku-mar, E. Rotem, O. Mutlu, SysScale: Exploiting Multi-domainDynamic Voltage and Frequency Scaling for Energy E ffi cientMobile Processors, in: ISCA, 2020.[255] J. Haj-Yahya, Y. Sazeides, M. Alser, E. Rotem, O. Mutlu, Tech-niques for Reducing the Connected-Standby Energy Consump-tion of Mobile Devices, in: HPCA, 2020.[256] K. Kim, J. Lee, A New Investigation of Data Retention Timein Truly Nanoscaled DRAMs, IEEE Electron Device Letters(2009).[257] J. Liu, RAIDR: Retention-Aware Intelligent DRAM Re-fresh, https://people.inf.ethz.ch/omutlu/pub/liu_isca12_talk.pdf , conference talk at ISCA 2012. (2012).[258] O. Mutlu, An Experimental Study of Data Retention Be-havior in Modern DRAM Devices. Implications for Re-tention Time Profiling Mechanisms, https://people.inf.ethz.ch/omutlu/pub/mutlu_isca13_talk.pdf , confer-ence talk at ISCA 2013. (2013).[259] D. Lee, Reducing DRAM Latency at Low Cost by Exploit-ing Heterogeneity, Ph.D. thesis, Carnegie Mellon University(2016).[260] K. K. Chang, D. Lee, Z. Chishti, A. R. Alameldeen, C. Wilk-erson, Y. Kim, O. Mutlu, Improving DRAM Performance byParallelizing Refreshes with Accesses, in: HPCA, 2014.[261] J. Kim, M. Patel, H. Hassan, O. Mutlu, The DRAM LatencyPUF: Quickly Evaluating Physical Unclonable Functions byExploiting the Latency–Reliability Tradeo ff in Modern DRAMDevices, in: HPCA, 2018. ffi cient and Scalable Hybrid Memories using Fine-granularityDRAM Cache Management, CAL (2012).[263] Y. Li, S. Ghose, J. Choi, J. Sun, H. Wang, O. Mutlu, Utility-based Hybrid Memory Management, in: CLUSTER, 2017.[264] M. K. Qureshi, V. Srinivasan, J. A. Rivers, Scalable High Per-formance Main Memory System Using Phase-Change MemoryTechnology, in: ISCA, 2009.[265] L. E. Ramos, E. Gorbatov, R. Bianchini, Page Placement inHybrid Memory Systems, in: ICS, 2011.[266] X. Yu, C. J. Hughes, N. Satish, O. Mutlu, S. Devadas, Banshee:Bandwidth-e ffi cient DRAM Caching via Software / HardwareCooperation, in: MICRO, 2017.[267] W. Zhang, T. Li, Exploring Phase Change Memory and 3DDie-stacking for Power / Thermal Friendly, Fast and DurableMemory Architectures, in: PACT, 2009.[268] S. Song, A. Das, O. Mutlu, N. Kandasamy, Improving PhaseChange Memory Performance with Data Content Aware Access,in: ISMM, 2020.[269] Y. Cai, S. Ghose, E. F. Haratsch, Y. Luo, O. Mutlu, ReliabilityIssues in Flash-Memory-Based Solid-State Drives: Experimen-tal Analysis, Mitigation, Recovery, in: Inside Solid State Drives(SSDs), 2018.[270] Y. Cai, S. Ghose, E. F. Haratsch, Y. Luo, O. Mutlu, Errors inFlash-Memory-Based Solid-State Drives: Analysis, Mitigation,and Recovery, arXiv:1711.11427 [cs:AR] (2018).[271] Y. Cai, NAND Flash Memory: Characterization, Analysis, Mod-eling, and Mechanisms, Ph.D. thesis, Carnegie Mellon Univer-sity (2013).[272] Y. Luo, Architectural Techniques for Improving NAND FlashMemory Reliability, Ph.D. thesis, Carnegie Mellon University(2018).[273] A. Tavakkol, J. G´omez-Luna, M. Sadrosadati, S. Ghose,O. Mutlu, MQSim: A Framework for Enabling Realistic Studiesof Modern Multi-Queue SSD Devices, in: FAST, 2018.[274] A. Tavakkol, M. Sadrosadati, S. Ghose, J. Kim, Y. Luo, Y. Wang,N. M. Ghiasi, L. Orosa, J. G´omez-Luna, O. Mutlu, FLIN: En-abling Fairness and Enhancing Performance in Modern NVMeSolid State Drives, in: ISCA, 2018.[275] Y. Cai, S. Ghose, E. F. Haratsch, Y. Luo, O. Mutlu, Error Char-acterization, Mitigation, and Recovery in Flash-Memory-BasedSolid-State Drives, Proc. IEEE (Sep. 2017).[276] Y. Cai, E. F. Haratsch, O. Mutlu, K. Mai, Error Patterns in MLCNAND Flash Memory: Measurement, Characterization, andAnalysis, in: DATE, 2012.[277] Y. Cai, G. Yalcin, O. Mutlu, E. F. Haratsch, A. Cristal, O. S.Unsal, K. Mai, Flash Correct-and-Refresh: Retention-awareError Management for Increased Flash Memory Lifetime, in:ICCD, 2012.[278] Y. Cai, O. Mutlu, E. F. Haratsch, K. Mai, Program Interferencein MLC NAND Flash Memory: Characterization, Modeling,and Mitigation, in: ICCD, 2013.[279] Y. Cai, E. F. Haratsch, O. Mutlu, K. Mai, Threshold VoltageDistribution in MLC NAND Flash Memory: Characterization,Analysis, and Modeling, in: DATE, 2013.[280] Y. Cai, G. Yalcin, O. Mutlu, E. F. Haratsch, O. Unsal, A. Cristal,K. Mai, Neighbor-cell Assisted Error Correction for MLCNAND Flash Memories, in: SIGMETRICS, 2014.[281] Y. Cai, Y. Luo, E. F. Haratsch, K. Mai, O. Mutlu, Data Retentionin MLC NAND Flash Memory: Characterization, Optimization,and Recovery, in: HPCA, 2015.[282] Y. Cai, Y. Luo, S. Ghose, O. Mutlu, Read Disturb Errors inMLC NAND Flash Memory: Characterization, Mitigation, andRecovery, in: DSN, 2015.[283] Y. Cai, S. Ghose, Y. Luo, K. Mai, O. Mutlu, E. F. Haratsch,Vulnerabilities in MLC NAND Flash Memory Programming: Experimental Analysis, Exploits, and Mitigation Techniques,in: HPCA, 2017.[284] Y. Luo, S. Ghose, Y. Cai, E. F. Haratsch, O. Mutlu, HeatWatch:Improving 3D NAND flash memory device reliability by ex-ploiting self-recovery and temperature awareness, in: HPCA,2018.[285] Y. Luo, S. Ghose, Y. Cai, E. F. Haratsch, O. Mutlu, Improv-ing 3D NAND Flash Memory Lifetime by Tolerating EarlyRetention Loss and Process Variation, in: SIGMETRICS, 2018.[286] Y. Luo, Y. Cai, S. Ghose, J. Choi, O. Mutlu, WARM: Improv-ing NAND Flash Memory Lifetime with Write-hotness AwareRetention Management, in: MSST, 2015.[287] Y. Luo, S. Ghose, Y. Cai, E. F. Haratsch, O. Mutlu, EnablingAccurate and Practical Online Flash Channel Modeling forModern MLC NAND Flash Memory, JSAC (2016).[288] Y. Cai, G. Yalcin, O. Mutlu, E. F. Haratsch, A. Crista, O. S.Unsal, K. Mai, Error Analysis and Retention-Aware Error Man-agement for NAND Flash Memory, Intel Technology Journal(2013).[289] M. Kim, J. Park, G. Cho, Y. Kim, L. Orosa, O. Mutlu, J. Kim,Evanesco: Architectural Support for E ffi cient Data Sanitizationin Modern Flash-Based Storage Systems, in: ASPLOS, 2020.[290] A. W. Burks, H. H. Goldstine, J. von Neumann, PreliminaryDiscussion of the Logical Design of an Electronic ComputingInstrument (1946).[291] G. Kestor, R. Gioiosa, D. J. Kerbyson, A. Hoisie, Quantifyingthe Energy Cost of Data Movement in Scientific Applications,in: IISWC, 2013.[292] D. Pandiyan, C.-J. Wu, Quantifying the Energy Cost of DataMovement for Emerging Smart Phone Workloads on MobilePlatforms, in: IISWC, 2014.[293] V. Seshadri, O. Mutlu, M. A. Kozuch, T. C. Mowry, The Evicted-address Filter: A Unified Mechanism to Address both CachePollution and Thrashing, in: PACT, 2012.[294] S. Khan, A. R. Alameldeen, C. Wilkerson, O. Mutlu, D. A.Jimenez, Improving Cache Performance using Read-Write Par-titioning, in: HPCA, 2014.[295] M. K. Qureshi, D. N. Lynch, O. Mutlu, Y. N. Patt, A Case forMLP-aware Cache Replacement, in: ISCA, 2006.[296] O. Mutlu, H. Kim, Y. N. Patt, Address-Value Delta (AVD)Prediction: A Hardware Technique for E ffi ciently ParallelizingDependent Cache Misses, IEEE Transactions on Computers(2006).[297] F. Sadi, J. Sweeney, T. M. Low, J. C. Hoe, L. Pileggi,F. Franchetti, E ffi cient SPMV Operation for Large and HighlySparse Matrices using Scalable Multi-way Merge Paralleliza-tion, in: MICRO, 2019.[298] A. Gondimalla, N. Chesnut, M. Thottethodi, T. Vijaykumar,Sparten: A Sparse Tensor Accelerator for Convolutional NeuralNetworks, in: MICRO, 2019.[299] K. Hegde, H. Asghari-Moghaddam, M. Pellauer, N. Crago,A. Jaleel, E. Solomonik, J. Emer, C. W. Fletcher, Extensor: AnAccelerator for Sparse Tensor Algebra, in: MICRO, 2019.[300] M. Zhu, T. Zhang, Z. Gu, Y. Xie, Sparse Tensor Core: Algo-rithm and Hardware Co-design for Vector-wise Sparse NeuralNetworks on Modern GPUs, in: MICRO, 2019.[301] L. Page, S. Brin, R. Motwani, T. Winograd, The PageRank Cita-tion Ranking: Bringing Order to the Web, Tech. rep., StanfordInfoLab (1999).[302] J. Ahn, A Scalable Processing-in-Memory Accelerator forParallel Graph Processing, https://people.inf.ethz.ch/omutlu/pub/tesseract-pim-architecture-for-graph-processing_isca15-talk.pdf , conference talk atISCA 2015. (2015).[303] R. H. Dennard, F. H. Gaensslen, H.-N. Yu, V. L. Rideout, E. Bas-sous, A. R. LeBlanc, Design of Ion-implanted MOSFET’s with ery Small Physical Dimensions, IEEE Journal of Solid-StateCircuits (1974).[304] W.J. Dally, Challenges for Future Computing Systems, HiPEACKeynote, 2015.[305] T. S. Kuhn, The Structure of Scientific Revolutions, 2012.[306] G. H. Loh, 3D-Stacked Memory Architectures for Multi-CoreProcessors, in: ISCA, 2008.[307] JEDEC, Wide I / O Single Data Rate (Wide I / O SDR), StandardNo. JESD229 (2011).[308] JEDEC, Wide I / O 2 (WideIO2), Standard No. JESD229-2(2014).[309] S. Ghose, A. Boroumand, J. S. Kim, J. G´omez-Luna, O. Mutlu,A Workload and Programming Ease Driven Perspective ofProcessing-in-Memory, arXiv:1907.12947 [cs:AR] (2019).[310] B. C. Lee, E. Ipek, O. Mutlu, D. Burger, Phase Change MemoryArchitecture and the Quest for Scalability, CACM (2010).[311] B. C. Lee, P. Zhou, J. Yang, Y. Zhang, B. Zhao, E. Ipek,O. Mutlu, D. Burger, Phase-Change Technology and the Futureof Main Memory, IEEE Micro (2010).[312] H.-S. P. Wong, S. Raoux, S. Kim, J. Liang, J. P. Reifenberg,B. Rajendran, M. Asheghi, K. E. Goodson, Phase Change Mem-ory, Proc. IEEE (2010).[313] P. Zhou, B. Zhao, J. Yang, Y. Zhang, A Durable and Energy E ffi -cient Main Memory Using Phase Change Memory Technology,in: ISCA, 2009.[314] E. K¨ult¨ursay, M. Kandemir, A. Sivasubramaniam, O. Mutlu,Evaluating STT-RAM as an Energy-E ffi cient Main MemoryAlternative, in: ISPASS, 2013.[315] H. Naeimi, C. Augustine, A. Raychowdhury, S.-L. Lu,J. Tschanz, STT-RAM Scaling and Retention Failure, IntelTechnology Journal (2013).[316] L. Chua, Memristor—The Missing Circuit Element, IEEE TCT(1971).[317] D. B. Strukov, G. S. Snider, D. R. Stewart, R. S. Williams, TheMissing Memristor Found, Nature (2008).[318] H.-S. P. Wong, H.-Y. Lee, S. Yu, Y.-S. Chen, Y. Wu, P.-S. Chen,B. Lee, F. T. Chen, M.-J. Tsai, Metal-Oxide RRAM, Proc. IEEE(2012).[319] S. Angizi, D. Fan, Graphide: A Graph Processing AcceleratorLeveraging In-dram-computing, in: GLSVLSI, 2019.[320] J. A. Mandelman, R. H. Dennard, G. B. Bronner, J. K. DeBrosse,R. Divakaruni, Y. Li, C. J. Radens, Challenges and FutureDirections for the Scaling of Dynamic Random-Access Memory(DRAM), IBM JRD (2002).[321] A. Boroumand, S. Ghose, M. Patel, H. Hassan, B. Lucia,N. Hajinazar, K. Hsieh, K. T. Malladi, H. Zheng, O. Mutlu,LazyPIM: E ffi cient Support for Cache Coherence in Processing-in-Memory Architectures, arXiv:1706.03162 [cs:AR] (2017).[322] M. Gao, J. Pu, X. Yang, M. Horowitz, C. Kozyrakis, Tetris:Scalable and E ffi cient Neural Network Acceleration with 3DMemory, in: ASPLOS, 2017.[323] M. P. Drumond Lages De Oliveira, A. Daglis, N. Mirzadeh,D. Ustiugov, J. Picorel Obando, B. Falsafi, B. Grot, D. Pnev-matikatos, The Mondrian Data Engine, in: ISCA, 2017.[324] G. Dai, T. Huang, Y. Chi, J. Zhao, G. Sun, Y. Liu, Y. Wang,Y. Xie, H. Yang, GraphH: A Processing-in-Memory Architec-ture for Large-scale Graph Processing, IEEE TCAD (2018).[325] M. Zhang, Y. Zhuo, C. Wang, M. Gao, Y. Wu, K. Chen,C. Kozyrakis, X. Qian, GraphP: Reducing Communication forPIM-based Graph Processing with E ffi cient Data Partition, in:HPCA, 2018.[326] Y. Huang, L. Zheng, P. Yao, J. Zhao, X. Liao, H. Jin, J. Xue, AHeterogeneous PIM Hardware-Software Co-Design for Energy-E ffi cient Graph Processing, in: IPDPS, 2020.[327] Y. Zhuo, C. Wang, M. Zhang, R. Wang, D. Niu, Y. Wang,X. Qian, GraphQ: Scalable PIM-based Graph Processing, in: MICRO, 2019.[328] J. K. Ousterhout, Why Aren’t Operating Systems Getting FasterAs Fast as Hardware?, in: USENIX STC, 1990.[329] M. Rosenblum, et al., The Impact of Architectural Trends onOperating System Performance, in: SOSP, 1995.[330] Memcached: A High Performance, Distributed Memory ObjectCaching System, http://memcached.org .[331] MySQL: An Open Source Database, .[332] SAFARI Research Group, SoftMC v1.0 – GitHub Repository, https://github.com/CMU-SAFARI/SoftMC/ .[333] D. E. Knuth, The Art of Computer Programming, Volume 4Fascicle 1: Bitwise Tricks & Techniques; Binary DecisionDiagrams (2009).[334] H. S. Warren, Hacker’s Delight, 2nd Edition, Addison-WesleyProfessional, 2012.[335] C.-Y. Chan, Y. E. Ioannidis, Bitmap Index Design and Evalua-tion, in: SIGMOD, 1998.[336] E. O’Neil, P. O’Neil, K. Wu, Bitmap Index Design Choices andTheir Performance Implications, in: IDEAS, 2007.[337] FastBit: An E ffi cient Compressed Bitmap Index Technology, https://sdm.lbl.gov/fastbit/ .[338] K. Wu, E. J. Otoo, A. Shoshani, Compressing Bitmap Indexesfor Faster Search Operations, in: SSDBM, 2002.[339] Y. Li, J. M. Patel, BitWeaving: Fast Scans for Main MemoryData Processing, in: SIGMOD, 2013.[340] B. Goodwin, M. Hopcroft, D. Luu, A. Clemmer, M. Curmei,S. Elnikety, Y. He, BitFunnel: Revisiting Signatures for Search,in: SIGIR, 2017.[341] G. Benson, Y. Hernandez, J. Loving, A Bit-Parallel, GeneralInteger-Scoring Sequence Alignment Algorithm, in: CPM,2013.[342] H. Xin, J. Greth, J. Emmons, G. Pekhimenko, C. Kingsford,C. Alkan, O. Mutlu, Shifted Hamming Distance: A Fast andAccurate SIMD-Friendly Filter to Accelerate Alignment Verifi-cation in Read Mapping, Bioinformatics (2015).[343] P. Tuyls, H. D. L. Hollmann, J. H. V. Lint, L. Tolhuizen, XOR-Based Visual Cryptography Schemes, Designs, Codes and Cryp-tography.[344] J.-W. Han, C.-S. Park, D.-H. Ryu, E.-S. Kim, Optical ImageEncryption Based on XOR Operations, SPIE OE (1999).[345] S. A. Manavski, CUDA Compatible GPU as an E ffi cient Hard-ware Accelerator for AES Cryptography, in: ICSPC, 2007.[346] H. Kang, S. Hong, One-Transistor Type DRAM, US Patent7701751 (2009).[347] S.-L. Lu, Y.-C. Lin, C.-L. Yang, Improving DRAM Latencywith Dynamic Asymmetric Subarray, in: MICRO, 2015.[348] 6th Generation Intel Core Processor Family Datasheet, .[349] GeForce GTX 745, .[350] J. S. Kim, The DRAM Latency PUF: Quickly EvaluatingPhysical Unclonable Functions by Exploiting the Latency–Reliability Tradeo ff in Modern Commodity DRAM Devices, https://people.inf.ethz.ch/omutlu/pub/dram-latency-puf_hpca18_talk.pdf , conference talk at HPCA2018. (2018).[351] J. S. Kim, D-RaNGe: Using Commodity DRAM Devices toGenerate True Random Numbers with Low Latency and HighThroughput, https://people.inf.ethz.ch/omutlu/pub/drange-dram-latency-based-true-random-number-generator_hpca19-talk.pdf , conference talk atHPCA 2019. (2019). ffi cient Graph Analysis, in: ASPLOS, 2012.[356] G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn,N. Leiser, G. Czajkowski, Pregel: A System for Large-ScaleGraph Processing, in: SIGMOD, 2010.[357] Harshvardhan, et al., KLA: A New Algorithmic Paradigm forParallel Graph Computation, in: PACT, 2014.[358] J. E. Gonzalez, et al., PowerGraph: Distributed Graph-ParallelComputation on Natural Graph, in: OSDI, 2012.[359] J. Shun, G. E. Blelloch, Ligra: A Lightweight Graph ProcessingFramework for Shared Memory, in: PPoPP, 2013.[360] J. Xue, Z. Yang, Z. Qu, S. Hou, Y. Dai, Seraph: An E ffi cient,Low-Cost System for Concurrent Graph Processing, in: HPDC,2014.[361] Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, J. M.Hellerstein, GraphLab: A New Framework for Parallel MachineLearning, arXiv:1006.4990 [cs:LG] (2010).[362] Google LLC, Chrome Browser, .[363] Google LLC, TensorFlow: Mobile, .[364] A. Grange, P. de Rivaz, J. Hunt, VP9 Bitstream & DecodingProcess Specification, http://storage.googleapis.com/downloads.webmproject.org/docs/vp9/vp9-bitstream-specification-v0.6-20160331-draft.pdf .[365] V. Narasiman, C. J. Lee, M. Shebanow, R. Miftakhutdinov,O. Mutlu, Y. N. Patt, Improving GPU Performance via LargeWarps and Two-Level Warp Scheduling, in: MICRO, 2011.[366] A. Jog, O. Kayiran, N. C. Nachiappan, A. K. Mishra, M. T. Kan-demir, O. Mutlu, R. Iyer, C. R. Das, OWL: Cooperative ThreadArray Aware Scheduling Techniques for Improving GPGPUPerformance, in: ASPLOS, 2013.[367] A. Jog, O. Kayiran, A. K. Mishra, M. T. Kandemir, O. Mutlu,R. Iyer, C. R. Das, Orchestrated Scheduling and Prefetching forGPGPUs, in: ISCA, 2013.[368] R. Ausavarungnirun, S. Ghose, O. Kayıran, G. H. Loh, C. R.Das, M. T. Kandemir, O. Mutlu, Exploiting Inter-Warp Hetero-geneity to Improve GPGPU Performance, in: PACT, 2015.[369] N. Vijaykumar, G. Pekhimenko, A. Jog, A. Bhowmick,R. Ausavarungnirun, C. Das, M. Kandemir, T. C. Mowry,O. Mutlu, A Case for Core-Assisted Bottleneck Acceleration inGPUs: Enabling Flexible Data Compression with Assist Warps,in: ISCA, 2015.[370] N. Vijaykumar, K. Hsieh, G. Pekhimenko, S. Khan, A. Shrestha,S. Ghose, A. Jog, P. B. Gibbons, O. Mutlu, Zorua: A HolisticApproach to Resource Virtualization in GPUs, in: MICRO,2016.[371] A. Jog, O. Kayiran, A. Pattnaik, M. T. Kandemir, O. Mutlu,R. Iyer, C. R. Das, Exploiting Core Criticality for EnhancedGPU Performance, in: SIGMETRICS, 2016.[372] R. Ausavarungnirun, V. Miller, J. Landgraf, S. Ghose, J. Gandhi,A. Jog, C. Rossbach, O. Mutlu, MASK: Redesigning the GPUMemory Hierarchy to Support Multi-Application Concurrency,in: ASPLOS, 2018.[373] R. Ausavarungnirun, J. Landgraf, V. Miller, S. Ghose, J. Gandhi,C. J. Rossbach, O. Mutlu, Mosaic: A GPU Memory Manager with Application-Transparent Support for Multiple Page Sizes,in: MICRO, 2017.[374] R. Ausavarungnirun, Techniques for Shared Resource Man-agement in Systems with Throughput Processors, Ph.D. thesis,Carnegie Mellon University (2017).[375] C. Li, R. Ausavarungnirun, C. J. Rossbach, Y. Zhang, O. Mutlu,Y. Guo, J. Yang, A Framework for Memory OversubscriptionManagement in Graphics Processing Units, in: ASPLOS, 2019.[376] J. Ahn, PIM-Enabled Instructions: A Low-Overhead, Locality-Aware PIM Architecture, https://people.inf.ethz.ch/omutlu/pub/pim-enabled-instructons-for-low-overhead-pim_isca15-talk.pdf , conference talk at ISCA2015. (2015).[377] O. Mutlu, Accelerating Genome Analysis: A Primer on an On-going Journey, https://people.inf.ethz.ch/omutlu/pub/onur-AcceleratingGenomeAnalysis-AACBB-Keynote-Feb-16-2019-FINAL.pptx , video availableat ,keynote talk at 2nd Workshop on Accelerator Architecturein Computational Biology and Bioinformatics (AACBB),Washington, DC, USA, February 2019. (2019).[378] Y. Turakhia, G. Bejerano, W. J. Dally, Darwin: A GenomicsCo-processor Provides up to 15,000x Acceleration on LongRead Assembly, in: ASPLOS, 2018.[379] D. Fujiki, A. Subramaniyan, T. Zhang, Y. Zeng, R. Das,D. Blaauw, S. Narayanasamy, Genax: a Genome SequencingAccelerator, in: ISCA, 2018.[380] H. Xin, D. Lee, F. Hormozdiari, S. Yedkar, O. Mutlu, C. Alkan,Accelerating Read Mapping with FastHASH, BMC Genomics(2013).[381] C. Alkan, et al., Personalized Copy Number and SegmentalDuplication Maps Using Next-Generation Sequencing, NatureGenetics (2009).[382] H. Li, R. Durbin, Fast and Accurate Short Read Alignment withBurrows–Wheeler Transform, Bioinformatics (2009).[383] H. Li, Minimap2: Pairwise Alignment for Nucleotide Se-quences, Bioinformatics (2018).[384] R. Baeza-Yates, G. H. Gonnet, A New Approach to Text Search-ing, Communications of the ACM (1992).[385] S. Wu, U. Manber, Fast Text Searching: Allowing Errors, Com-munications of the ACM (1992).[386] C.-C. M. Yeh, Y. Zhu, L. Ulanova, N. Begum, Y. Ding, H. A.Dau, D. F. Silva, A. Mueen, E. Keogh, Matrix profile I: AllPairs Similarity Joins for Time Series: a Unifying View thatIncludes Motifs, Discords and Shapelets, in: ICDM, 2016.[387] C. Chou, P. Nair, M. K. Qureshi, Reducing Refresh Power inMobile Devices with Morphable ECC, in: DSN, 2015.[388] Y. Kim, D. Han, O. Mutlu, M. Harchol-Balter, ATLAS: A Scal-able and High-performance Scheduling Algorithm for MultipleMemory Controllers, in: HPCA, 2010.[389] S. P. Muralidhara, L. Subramanian, O. Mutlu, M. Kandemir,T. Moscibroda, Reducing Memory Interference in MulticoreSystems via Application-aware Memory Channel Partitioning,in: MICRO, 2011.[390] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean,M. Devin, S. Ghemawat, G. Irving, M. Isard, et al., Tensorflow:A System for Large-scale Machine Learning, in: OSDI, 2016.[391] A. Boroumand, Google Workloads for Consumer Devices: Mit-igating Data Movement Bottlenecks, https://people.inf.ethz.ch/omutlu/pub/Google-consumer-workloads-data-movement-and-PIM_asplos18-talk.pdf , confer-ence talk at ASPLOS 2018. (2018).[392] K. Korgaonkar, R. Ronen, A. Chattopadhyay, S. Kvatinsky, TheBitlet Model: Defining a Litmus Test for the Bitwise Processing-in-Memory Paradigm (2019). arXiv:1910.10234 . https://github.com/CMU-SAFARI/ramulator/ .[397] N. Binkert, B. Beckman, A. Saidi, G. Black, A. Basu, The gem5Simulator, CAN (2011).[398] D. Sanchez, C. Kozyrakis, ZSim: Fast and Accurate Microar-chitectural Simulation of Thousand-Core Systems, in: ISCA,2013.[399] J. Power, J. Hestness, M. S. Orr, M. D. Hill, D. A. Wood, gem5-gpu: A Heterogeneous CPU-GPU Simulator, CAL (Jan 2015).[400] A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, T. M.Aamodt, Analyzing CUDA Workloads Using a Detailed GPUSimulator, in: ISPASS, 2009.[401] SAFARI Research Group, Ramulator-PIM: A Processing-in-Memory Simulation Framework – GitHub Repository, https://github.com/CMU-SAFARI/ramulator-pim .[402] UPMEM, Introduction to UPMEM PIM. Processing-in-memory(PIM) on DRAM accelerator (2018).[403] F. Devaux, The True Processing In Memory Accelerator, in:Hot Chips, 2019.[404] N. Vijaykumar, A. Jain, D. Majumdar, K. Hsieh, G. Pekhi-menko, E. Ebrahimi, N. Hajinazar, P. B. Gibbons, O. Mutlu, ACase for Richer Cross-layer Abstractions: Bridging the Seman-tic Gap with Expressive Memory, in: ISCA, 2018.[405] N. Vijaykumar, E. Ebrahimi, K. Hsieh, P. B. Gibbons, O. Mutlu,The Locality Descriptor: A Holistic Cross-layer Abstraction toExpress Data Locality in GPUs, in: ISCA, 2018.[406] X. Liu, D. Roberts, R. Ausavarungnirun, O. Mutlu, J. Zhao,Binary Star: Coordinated Reliability in Heterogeneous Memory Systems for High Performance and Scalability, in: MICRO,2019.[407] O. Mutlu, Processing Data Where It Makes Sense: En-abling In-Memory Computation, https://people.inf.ethz.ch/omutlu/pub/onur-MST-Keynote-EnablingInMemoryComputation-October-27-2017-unrolled-FINAL.pptx , keynote talk at MST (2017).[408] O. Mutlu, Processing Data Where It Makes Sense in ModernComputing Systems: Enabling In-Memory Computation, https://people.inf.ethz.ch/omutlu/pub/onur-GWU-EnablingInMemoryComputation-February-15-2019-unrolled-FINAL.pptx , video available at ,distinguished lecture at George Washington University (2019).[409] O. Mutlu, Processing Data Where It Makes Sense in Mod-ern Computing Systems: Enabling In-Memory Computa-tion, https://people.inf.ethz.ch/omutlu/pub/onur-ISSCC2019-talk.pptx , Invited Talk at ISSCC Special Fo-rum on ”Intelligence at the Edge: How Can We Make MachineLearning More Energy E ffi cient?”, as part of the 2019 Interna-tional Solid State Circuits Conference (ISSCC), San Francisco,CA, USA, February 2019. (2019).[410] O. Mutlu, Processing Data Where It Makes Sensein Modern Computing Systems: Enabling In-Memory Computation, https://people.inf.ethz.ch/omutlu/pub/onur-GLSVLSI-KeynoteTalk-EnablingInMemoryComputation-May-10-2019-unrolled.pptx , keynote Talk at 29th ACM Great LakesSymposium on VLSI (GLSVLSI), Washington, DC, USA, May2019. (2019).[411] O. Mutlu, Processing Data Where It Makes Sense in ModernComputing Systems: Enabling In-Memory Computation, https://people.inf.ethz.ch/omutlu/pub/onur-APPT-Keynote-EnablingInMemoryComputation-August-16-2019-unrolled.pptx , video available at ,keynote talk at International Symposium on Advanced ParallelProcessing Technology (APPT), Tianjin, China, 16 August2019. (2019).,keynote talk at International Symposium on Advanced ParallelProcessing Technology (APPT), Tianjin, China, 16 August2019. (2019).