Processing Data Where It Makes Sense: Enabling In-Memory Computation
Onur Mutlu, Saugata Ghose, Juan Gómez-Luna, Rachata Ausavarungnirun
aa r X i v : . [ c s . A R ] M a r Processing Data Where It Makes Sense:Enabling In-Memory Computation
Onur Mutlu a,b , Saugata Ghose b , Juan G´omez-Luna a , Rachata Ausavarungnirun b,c a ETH Z¨urich b Carnegie Mellon University c King Mongkut’s University of Technology North Bangkok
Abstract
Today’s systems are overwhelmingly designed to move data to computation. This design choice goes directly againstat least three key trends in systems that cause performance, scalability and energy bottlenecks: (1) data access frommemory is already a key bottleneck as applications become more data-intensive and memory bandwidth and energy donot scale well, (2) energy consumption is a key constraint in especially mobile and server systems, (3) data movementis very expensive in terms of bandwidth, energy and latency, much more so than computation. These trends areespecially severely-felt in the data-intensive server and energy-constrained mobile systems of today.At the same time, conventional memory technology is facing many scaling challenges in terms of reliability, energy,and performance. As a result, memory system architects are open to organizing memory in di ff erent ways and makingit more intelligent, at the expense of higher cost. The emergence of 3D-stacked memory plus logic as well as theadoption of error correcting codes inside DRAM chips, and the necessity for designing new solutions to seriousreliability and security issues, such as the RowHammer phenomenon, are an evidence of this trend.In this work, we discuss some recent research that aims to practically enable computation close to data. Aftermotivating trends in applications as well as technology, we discuss at least two promising directions for processing-in-memory (PIM): (1) performing massively-parallel bulk operations in memory by exploiting the analog operationalproperties of DRAM, with low-cost changes, (2) exploiting the logic layer in 3D-stacked memory technology toaccelerate important data-intensive applications. In both approaches, we describe and tackle relevant cross-layerresearch, design, and adoption challenges in devices, architecture, systems, and programming models. Our focus ison the development of in-memory processing designs that can be adopted in real computing platforms at low cost. Keywords: data movement, main memory, processing-in-memory, 3D-stacked memory, near-data processing
1. Introduction
Main memory, which is built using the Dynamic Ran-dom Access Memory (DRAM) technology, is a majorcomponent in nearly all computing systems. Acrossall of these systems, including servers, cloud platforms,and mobile / embedded devices, the data working setsizes of modern applications are rapidly growing, caus-ing the main memory to be a significant bottleneck forthese applications [1, 2, 3, 4, 5, 6, 7]. Alleviating themain memory bottleneck requires the memory capac-ity, energy, cost, and performance to all scale in an ef-ficient manner. Unfortunately, it has become increas-ingly di ffi cult in recent years to scale all of these dimen-sions [1, 2, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31], and themain memory bottleneck has instead been worsening. A major reason for the main memory bottleneck isthe high cost associated with data movement . In to-day’s computers, to perform any operation on data thatresides in main memory, the memory controller mustfirst issue a series of commands to the DRAM mod-ules across an o ff -chip bus (known as the memory chan-nel ). The DRAM module responds by sending the datato the memory controller across the memory channel,after which the data is placed within a cache or reg-isters. The CPU can only perform the operation onthe data once the data is in the cache. This processof moving data from the DRAM to the CPU incurs along latency, and consumes a significant amount of en-ergy [7, 32, 33, 34, 35]. These costs are often exacer-bated by the fact that much of the data brought into thecaches is not reused by the CPU [36, 37], providing lit-tle benefit in return for the high latency and energy cost.1he cost of data movement is a fundamental issuewith the processor-centric nature of contemporary com-puter systems, where the CPU is considered to be themaster of the system and has been optimized heavily. Incontrast, data storage units such as main memory aretreated as unintelligent workers, and, thus, are largelynot optimized. With the increasingly data-centric na-ture of contemporary and emerging applications, theprocessor-centric design approach leads to many ine ffi -ciencies. For example, within a single compute node,most of the node real estate is dedicated to handlethe storage and movement of data (e.g., large on-chipcaches, shared interconnect, memory controllers, o ff -chip interconnects, main memory) [38].Recent advances in memory design and memory ar-chitecture have enabled the opportunity for a paradigmshift towards performing processing-in-memory (PIM),where we can redesign the computer to no longer beprocessor-centric and avoid unnecessary data move-ment. Processing-in-memory, also known as near-dataprocessing (NDP), enables the ability to perform oper-ations either using (1) the memory itself, or (2) someform of processing logic (e.g., accelerators, simplecores, reconfigurable logic) inside the DRAM subsys-tem. Processing-in-memory has been proposed for atleast four decades [39, 40, 41, 42, 43, 44, 45, 46, 47, 48,49, 50, 51, 52, 53]. However, these past e ff orts were not adopted at large scale due to various reasons, in-cluding the di ffi culty of integrating processing elementswith DRAM and the fact that memory technology wasnot facing as critical scaling challenges as it is today.As a result of advances in modern memory architec-tures, e.g., the integration of logic and memory in a 3D-stacked manner, various recent works explore a range ofPIM architectures for multiple di ff erent purposes (e.g.,[7, 32, 33, 34, 35, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63,64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78,79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91]).In this paper, we explore two approaches to enablingprocessing-in-memory in modern systems. The first ap-proach examines a form of PIM that only minimallychanges memory chips to perform simple yet power-ful common operations that the chip could be madeinherently very good at performing [31, 71, 82, 83,84, 85, 86, 90, 92, 93, 94, 95, 96]. Solutions thatfall under this approach take advantage of the existingDRAM design to cleverly and e ffi ciently perform bulkoperations (i.e., operations on an entire row of DRAMcells), such as bulk copy, data initialization, and bitwiseoperations. The second approach takes advantage ofthe design of emerging to enable PIM in a more general-purpose man- ner [7, 34, 35, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63,64, 65, 66, 67, 68, 70, 72, 73, 74, 75, 77, 87, 88, 89, 91].In order to stack multiple layers of memory, 3D-stackedchips use vertical through-silicon vias (TSVs) to con-nect the layers to each other, and to the I / O drivers ofthe chip [97]. The TSVs provide much greater inter-nal bandwidth than is available externally on the mem-ory channel. Several such 3D-stacked memory archi-tectures, such as the Hybrid Memory Cube [98, 99]and High-Bandwidth Memory [97, 100], include a logiclayer , where designers can add some simple processinglogic to take advantage of the high internal bandwidth.For both approaches to PIM, there are a numberof new challenges that system architects and program-mers must address to enable the widespread adoptionof PIM across the computing landscape and in di ff erentdomains of workloads. In addition to describing workalong the two key approaches, we also discuss thesechallenges in this paper, along with existing work thataddresses these challenges.
2. Major Trends A ff ecting Main Memory The main memory is a major, critical component ofall computing systems, including cloud and server plat-forms, desktop computers, mobile and embedded de-vices, and sensors. It is one of the two main pillars ofany computing platform, together with the processingelements, namely CPU cores, GPU cores, or reconfig-urable devices.Due to its relatively low cost and low latency, DRAMis the predominant technology to build main memory.Because of the growing data working set sizes of mod-ern applications [1, 2, 3, 4, 5, 6, 7], there is an ever-increasing demand for higher DRAM capacity and per-formance. Unfortunately, DRAM technology scalingis becoming more and more challenging in terms ofincreasing the DRAM capacity and maintaining theDRAM energy e ffi ciency and reliability [1, 11, 15, 101,102]. Thus, fulfilling the increasing memory needs fromapplications is becoming more and more costly and dif-ficult [2, 3, 4, 8, 9, 10, 12, 13, 14, 16, 17, 18, 19, 20, 21,22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 34, 35, 59, 103,104, 105].If CMOS technology scaling is coming to anend [106], the projections are significantly worse forDRAM technology scaling [107]. DRAM technologyscaling a ff ects all major characteristics of DRAM, in-cluding capacity, bandwidth, latency, energy and cost.We next describe the key issues and trends in DRAMtechnology scaling and discuss how these trends moti-2ate the need for intelligent memory controllers, whichcan be used as a substrate for processing in memory.The first key concern is the di ffi culty of scalingDRAM capacity (i.e., density, or cost per bit), band-width and latency at the same time . While the process-ing core count doubles every two years, the DRAM ca-pacity doubles only every three years [20]. This causesthe memory capacity per core to drop by approximately30% every two years [20]. The trend is even worsefor memory bandwidth per core – in the last 20 years,DRAM chip capacity (for the most common DDRx chipof the time) has improved around 128 × while DRAMbandwidth has increased only around 20 × [22, 23, 31].In the same period of twenty years, DRAM latency (asmeasured by the row cycling time) has remained al-most constant (i.e., reduced by only 30%), making ita significant performance bottleneck for many modernworkloads, including in-memory databases [108, 109,110, 111], graph processing [34, 112, 113], data ana-lytics [110, 114, 115, 116], datacenter workloads [4],and consumer workloads [7]. As low-latency comput-ing is becoming ever more important [1], e.g., due to theever-increasing need to process large amounts of data atreal time, and predictable performance continues to bea critical concern in the design of modern computingsystems [2, 16, 117, 118, 119, 120, 121, 122, 123], it isincreasingly critical to design low-latency main memorychips.The second key concern is that DRAM technologyscaling to smaller nodes adversely a ff ects DRAM re-liability. A DRAM cell stores each bit in the formof charge in a capacitor, which is accessed via an ac-cess transistor and peripheral circuitry. For a DRAMcell to operate correctly, both the capacitor and the ac-cess transistor (as well as the peripheral circuitry) needto operate reliably. As the size of the DRAM cell re-duces, both the capacitor and the access transistor be-come less reliable. As a result, reducing the size of theDRAM cell increases the di ffi culty of correctly storingand detecting the desired original value in the DRAMcell [1, 11, 15, 101]. Hence, memory scaling causesmemory errors to appear more frequently. For exam-ple, a study of Facebook’s entire production datacenterservers showed that memory errors, and thus the serverfailure rate, increase proportionally with the density ofthe chips employed in the servers [124]. Thus, it is crit-ical to make the main memory system more reliable tobuild reliable computing systems on top of it.The third key issue is that the reliability problemscaused by aggressive DRAM technology scaling canleads to new security vulnerabilities. The RowHammerphenomenon [11, 15] shows that it is possible to pre- dictably induce errors (bit flips) in most modern DRAMchips. Repeatedly reading the same row in DRAM cancorrupt data in physically-adjacent rows. Specifically,when a DRAM row is opened (i.e., activated) and closed(i.e., precharged) repeatedly (i.e., hammered), enoughtimes within a DRAM refresh interval, one or more bitsin physically-adjacent DRAM rows can be flipped to thewrong value. A very simple user-level program [125]can reliably and consistently induce RowHammer errorsin vulnerable DRAM modules. The seminal paper thatintroduced RowHammer [11] showed that more than85% of the chips tested, built by three major vendors be-tween 2010 and 2014, were vulnerable to RowHammer-induced errors. In particular, all DRAM modules from2012 and 2013 are vulnerable.The RowHammer phenomenon entails a real reliabil-ity, and perhaps even more importantly, a real and preva-lent security issue. It breaks physical memory isolationbetween two addresses, one of the fundamental build-ing blocks of memory, on top of which system securityprinciples are built. With RowHammer, accesses to onerow (e.g., an application page) can modify data stored inanother memory row (e.g., an OS page). This was con-firmed by researchers from Google Project Zero, whodeveloped a user-level attack that uses RowHammer togain kernel privileges [126, 127]. Other researchershave shown how RowHammer vulnerabilities can beexploited in various ways to gain privileged access tovarious systems: in a remote server RowHammer canbe used to remotely take over the server via the use ofJavaScript [128]; a virtual machine can take over an-other virtual machine by inducing errors in the victimvirtual machine’s memory space [129]; a malicious ap-plication without permissions can take control of an An-droid mobile device [130]; or an attacker can gain arbi-trary read / write access in a web browser on a MicrosoftWindows 10 system [131]. For a more detailed treat-ment of the RowHammer problem and its consequences,we refer the reader to [11, 15, 132].The fourth key issue is the power and energy con-sumption of main memory. DRAM is inherently apower and energy hog, as it consumes energy evenwhen it is not used (e.g., it requires periodic memoryrefresh [14]), due to its charge-based nature. And, en-ergy consumption of main memory is becoming worsedue to three major reasons. First, its capacity and com-plexity are both increasing. Second, main memory hasremained o ff the main processing chip, even thoughmany other platform components have been integratedinto the processing chip and have benefited from theaggressive energy scaling and low-energy communica-tion substrate on-chip. Third, the di ffi culties in DRAM3echnology scaling are making energy reduction verydi ffi cult with technology generations. For example, Le-furgy et al. [133] showed, in 2003 that, in large commer-cial servers designed by IBM, the o ff -chip memory hi-erarchy (including, at that time, DRAM, interconnects,memory controller, and o ff -chip caches) consumed be-tween 40% and 50% of the total system energy. Thetrend has become even worse over the course of theone-to-two decades. In recent computing systems withCPUs or GPUs, only DRAM itself is shown to accountfor more than 40% of the total system power [134, 135].Hence, the power and energy consumption of mainmemory is increasing relative to that of other compo-nents in computing platform. As energy e ffi ciency andsustainability are critical necessities in computing plat-forms today, it is critical to reduce the energy and powerconsumption of main memory.
3. The Need for Intelligent Memory Controllers toEnhance Memory Scaling
A key promising approach to solving the four majorissues above is to design intelligent memory controllers that can manage main memory better. If the memorycontroller is designed to be more intelligent and moreprogrammable, it can, for example, incorporate flexi-ble mechanisms to overcome various types of reliabilityissues (including RowHammer), manage latencies andpower consumption better based on a deep understand-ing of the DRAM and application characteristics, pro-vide enough support for programmability to prevent se-curity and reliability vulnerabilities that are discoveredin the field, and manage various types of memory tech-nologies that are put together as a hybrid main memoryto enhance the scaling of the main memory system. Weprovide a few examples of how an intelligent memorycontroller can help overcome circuit- and device-levelissues we are facing at the main memory level. We be-lieve having intelligent memory controllers can greatlyalleviate the scaling issues encountered with main mem-ory today, as we have described in an earlier positionpaper [1]. This is a direction that is also supported by in-dustry today, as described in an informative paper writ-ten collaboratively by Intel and Samsung engineers onDRAM technology scaling issues [8].First, the RowHammer vulnerability can be preventedby probabilistically refreshing rows that are adjacent toan activated row, with a very low probability. This solu-tion, called PARA (Probabilistic Adjacent Row Activa-tion) [11] was shown to provide strong, programmableguarantees against RowHammer, with very little power,performance and chip area overhead [11]. It requires a slightly more intelligent memory controller that knows(or that can figure out) the physical adjacency of rowsin a DRAM chip and that is programmable enough toadjust the probability of adjacent row activation andissue refresh requests to adjacent rows accordingly tothe probability supplied by the system. As describedby prior work [11, 15, 132], this solution is muchlower overhead that increasing the refresh rate acrossthe board for the entire main memory, which is theRowHammer solution employed by existing systems inthe field that have simple and rigid memory controllers.Second, an intelligent memory controller can greatlyalleviate the refresh problem in DRAM, and hence itsnegative consequences on energy, performance, pre-dictability, and technology scaling, by understandingthe retention time characteristics of di ff erent rows well.It is well known that the retention time of di ff erent cellsin DRAM are widely di ff erent due to process manufac-turing variation [14, 101]. Some cells are strong (i.e.,they can retain data for hundreds of seconds), whereassome cells are weak (i.e., they can retain data for only64 ms). Yet, today’s memory controllers treat every cellas equal and refresh all rows every 64 ms, which is theworst-case retention time that is allowed. This worst-case refresh rate leads to a large number of unneces-sary refreshes, and thus great energy waste and perfor-mance loss. Refresh is also shown to be the key technol-ogy scaling limiter of DRAM [8], and as such refresh-ing all DRAM cells at the worst case rates is likely tomake DRAM technology scaling di ffi cult. An intelli-gent memory controller can overcome the refresh prob-lem by identifying the minimum data retention time ofeach row (during online operation) and refreshing eachrow at the rate it really requires to be refreshed at or bydecommissioning weak rows such that data is not storedin them. As shown by a recent body of work whoseaim is to design such an intelligent memory controllerthat can perform inline profiling of DRAM cell retentiontimes and online adjustment of refresh rate on a per-rowbasis [14, 101, 136, 137, 138, 139, 140, 141], includ-ing the works on RAIDR [14, 101], AVATAR [137] andREAPER [140], such an intelligent memory controllercan eliminate more than 75% of all refreshes at verylow cost, leading to significant energy reduction, perfor-mance improvement, and quality of service benefits, allat the same time. Thus the downsides of DRAM refreshcan potentially be overcome with the design of intelli-gent memory controllers.Third, an intelligent memory controller can enableperformance improvements that can overcome the limi-tations of memory scaling. As we discuss in Section 2,DRAM latency has remained almost constant over the4ast twenty years, despite the fact that low-latency com-puting has become more important during that time.Similar to how intelligent memory controllers handlethe refresh problem, the controllers can exploit the factthat not all cells in DRAM need the same amount oftime to be accessed. Manufacturers assign timing pa-rameters that define the amount of time required to per-form a memory access. In order to guarantee correctoperation, the timing parameters are chosen to ensurethat the worst-case cell in any DRAM chip that is soldcan still be accessed correctly at worst-case operatingtemperatures [22, 24, 26, 105]. However, we find thataccess latency to cells is very heterogeneous due to vari-ation in both operating conditions (e.g., across di ff erenttemperatures and operating voltage), manufacturing pro-cess (e.g., across di ff erent chips and di ff erent parts of achip), and access patterns (e.g., whether or not the cellwas recently accessed). We give six examples of howan intelligent memory controller can exploit the variousdi ff erent types of heterogeneity.(1) At low temperature, DRAM cells contain morecharge, and as a result, can be accessed much faster thanat high temperatures. We find that, averaged across 115real DRAM modules from three major manufacturers,read and write latencies of DRAM can be reduced by33% and 55%, respectively, when operating at relativelylow temperature (55 ◦ C) compared to operating at worst-case temperature (85 ◦ C) [24, 142]. Thus, a slightly in-telligent memory controller can greatly reduce memorylatency by adapting the access latency to operating tem-perature.(2) Due to manufacturing process variation, we findthat the majority of cells in DRAM (across di ff erentchips or within the same chip) can be accessed muchfaster than the manufacturer-provided timing parame-ters [22, 24, 26, 31, 105, 142]. An intelligent mem-ory controller can profile the DRAM chip and identifywhich cells can be accessed reliably at low latency, anduse this information to reduce access latencies by asmuch as 57% [22, 26, 105].(3) In a similar fashion, an intelligent memory con-troller can use similar properties of manufacturing pro-cess variation to reduce the energy consumption of acomputer system, by exploiting the minimum voltage re-quired for safe operation of di ff erent parts of a DRAMchip [25, 31]. The key idea is to reduce the operatingvoltage of a DRAM chip from the standard specificationand tolerate the resulting errors by increasing access la-tency on a per-bank basis, while keeping performancedegradation in check.(4) Bank conflict latencies can be dramatically re-duced by making modifications in the DRAM chip such that di ff erent subarrays in a bank can be ac-cessed mostly independently, and designing an intelli-gent memory controller that can take advantage of re-quests that require data from di ff erent subarrays (i.e.,exploit subarray-level parallelism) [12, 13].(5) Access latency to a portion of the DRAM bankcan be greatly reduced by partitioning the DRAM arraysuch that a subset of rows can be accessed much fasterthan the other rows and having an intelligent memorycontroller that decides what data should be placed in fastrows versus slow rows [23, 142].(6) We find that a recently-accessed or recently-refreshed memory row can be accessed much morequickly than the standard latency if it needs to be ac-cessed again soon, since the recent access and refreshto the row has replenished the charge of the cells in therow. An intelligent memory controller can thus keeptrack of the charge level of recently-accessed / refreshedrows and use the appropriate access latency that corre-sponds to the charge level [30, 103, 104], leading to sig-nificant reductions in both access and refresh latencies.Thus, the poor scaling of DRAM latency and energy canpotentially be overcome with the design of intelligentmemory controllers that can facilitate a large number ofe ff ective latency and energy reduction techniques.Intelligent controllers are already in widespread usein another key part of a modern computing system.In solid-state drives (SSDs) consisting of NAND flashmemory, the flash controllers that manage the SSDsare designed to incorporate a significant level of intel-ligence in order to improve both performance and re-liability [143, 144, 145, 146, 147]. Modern flash con-trollers need to take into account a wide variety of is-sues such as remapping data, performing wear levelingto mitigate the limited lifetime of NAND flash memorydevices, refreshing data based on the current wearout ofeach NAND flash cell, optimizing voltage levels to max-imize memory lifetime, and enforcing fairness acrossdi ff erent applications accessing the SSD. Much of thecomplexity in flash controllers is a result of mitigat-ing issues related to the scaling of NAND flash mem-ory [143, 144, 145, 148, 149]. We argue that in order toovercome scaling issues in DRAM, the time has comefor DRAM memory controllers to also incorporate sig-nificant intelligence.As we describe above, introducing intelligence intothe memory controller can help us overcome a numberof key challenges in memory scaling. In particular, a sig-nificant body of works have demonstrated that the keyreliability, refresh, and latency / energy issues in memorycan be mitigated e ff ectively with an intelligent memorycontroller. As we discuss in Section 4, this intelligence5an go even further, by enabling the memory controllers(and the broader memory system) to perform applica-tion computation in order to overcome the significantdata movement bottleneck in existing computing sys-tems.
4. Perils of Processor-Centric Design
A major bottleneck against improving the overall sys-tem performance and the energy e ffi ciency of today’scomputing systems is the high cost of data movement .This is a natural consequence of the von Neumannmodel [150], which separates computation and storagein two di ff erent system components (i.e., the computingunit versus the memory / storage unit) that are connectedby an o ff -chip bus. With this model, processing is doneonly in one place, while data is stored in another, sepa-rate place. Thus, data needs to move back and forth be-tween the memory / storage unit and the computing unit(e.g., CPU cores or accelerators).In order to perform an operation on data that is storedwithin memory, a costly process is invoked. First, theCPU (or an accelerator) must issue a request to the mem-ory controller, which in turn sends a series of commandsacross the o ff -chip bus to the DRAM module. Second,the data is read from the DRAM module and returned tothe memory controller. Third, the data is placed in theCPU cache and registers, where it is accessible by theCPU cores. Finally, the CPU can operate (i.e., performcomputation) on the data. All these steps consume sub-stantial time and energy in order to bring the data intothe CPU chip [4, 7, 151, 152].In current computing systems, the CPU is the onlysystem component that is able to perform computa-tion on data. The rest of system components are de-voted to only data storage (memory, caches, disks) anddata movement (interconnects); they are incapable ofperforming computation. As a result, current comput-ing systems are grossly imbalanced , leading to largeamounts of energy ine ffi ciency and low performance.As empirical evidence to the gross imbalance causedby the processor-memory dichotomy in the design ofcomputing systems today, we have recently observedthat more than 62% of the entire system energy con-sumed by four major commonly-used mobile consumerworkloads (including the Chrome browser, TensorFlowmachine learning inference engine, and the VP9 videoencoder and decoder) [7]. Thus, the fact that currentsystems can perform computation only in the comput-ing unit (CPU cores and hardware accelerators) is caus-ing significant waste in terms of energy by necessitatingdata movement across the entire system. At least five factors contribute to the performanceloss and the energy waste associated with retrieving datafrom main memory, which we briefly describe next.First, the width of the o ff -chip bus between the mem-ory controller and the main memory is narrow, due topin count and cost constraints, leading to relatively lowbandwidth to / from main memory. This makes it di ffi -cult to send a large number of requests to memory inparallel.Second, current computing systems deploy com-plex multi-level cache hierarchies and latency toler-ance / hiding mechanisms (e.g., sophisticated cachingalgorithms at many di ff erent caching levels, multiplecomplex prefetching techniques, high amounts of mul-tithreading, complex out-of-order execution) to toler-ate the data access from memory. These components,while sometimes e ff ective at improving performance,are costly in terms of both die area and energy con-sumption, as well as the additional latency required toaccess / manage them. These components also increasethe complexity of the system significantly. Hence, thearchitectural techniques used in modern systems to tol-erate the consequences of the dichotomy between pro-cessing unit and main memory, lead to significant en-ergy waste and additional complexity.Third, the caches are not always properly leveraged.Much of the data brought into the caches is not reusedby the CPU [36, 37], e.g., in streaming or random ac-cess applications. This renders the caches either veryine ffi cient or unnecessary for a wide variety of modernworkloads.Fourth, many modern applications, such as graph pro-cessing [34, 35], produce random memory access pat-terns. In such cases, not only the caches but also theo ff -chip bus and the DRAM memory itself become veryine ffi cient, since only a little part of each cache line re-trieved is actually used by the CPU. Such accesses arealso not easy to prefetch and often either confuse theprefetchers or render them ine ff ective. Modern memoryhierarchies are not designed to work well for randomaccess patterns.Fifth, the computing unit and the memory unit areconnected through long, power-hungry interconnects.These interconnects impose significant additional la-tency to every data access and represent a significantfraction of the energy spent on moving data to / from theDRAM memory. In fact, o ff -chip interconnect latencyand energy consumption is a key limiter of performanceand energy in modern systems [16, 23, 71, 82] as itgreatly exacerbates the cost of data movement.The increasing disparity between processing tech-nology and memory / communication technology has re-6ulted in systems in which communication (data move-ment) costs dominate computation costs in terms of en-ergy consumption. The energy consumption of a mainmemory access is between two to three orders of mag-nitude the energy consumption of a complex additionoperation today. For example, [152] reports that the en-ergy consumption of a memory access is ∼ × theenergy consumption of an addition operation. As a re-sult, data movement accounts for 40% [151], 35% [152],and 62% [7] of the total system energy in scientific,mobile, and consumer applications, respectively. Thisenergy waste due to data movement is a huge burdenthat greatly limits the e ffi ciency and performance of allmodern computing platforms, from datacenters with arestricted power budget to mobile devices with limitedbattery life.Overcoming all the reasons that cause low perfor-mance and large energy ine ffi ciency (as well as high sys-tem design complexity) in current computing systemsrequires a paradigm shift. We believe that future com-puting architectures should become more data-centric :they should (1) perform computation with minimal datamovement, and (2) compute where it makes sense (i.e.,where the data resides), as opposed to computing solelyin the CPU or accelerators. Thus, the traditional rigiddichotomy between the computing units and the mem-ory / communication units needs to be broken and a newparadigm enabling computation where the data residesneeds to be invented and enabled.
5. Processing-in-Memory (PIM)
Large amounts of data movement is a major result ofthe predominant processor-centric design paradigm ofmodern computers. Eliminating unnecessary data move-ment between memory unit and compute unit is essen-tial to make future computing architectures higher per-formance, more energy e ffi cient and sustainable. To thisend, processing-in-memory (PIM) equips the memorysubsystem with the ability to perform computation.In this section, we describe two promising ap-proaches to implementing PIM in modern architectures.The first approach exploits the existing DRAM archi-tecture and the operational principles of the DRAM cir-cuitry to enable bulk processing operations within themain memory with minimal changes. This minimalistapproach can especially be powerful in performing spe-cialized computation in main memory by taking advan-tage of what the main memory substrate is extremelygood at performing with minimal changes to the exist-ing memory chips. The second approach exploits theability to implement a wide variety of general-purpose processing logic in the logic layer of 3D-stacked mem-ory and thus the high internal bandwidth and low latencyavailable between the logic layer and the memory layersof 3D-stacked memory. This is a more general approachwhere the logic implemented in the logic layer can begeneral purpose and thus can benefit a wide variety ofapplications. One approach to implementing processing-in-memory modifies existing DRAM architecturesminimally to extend their functionality with computingcapability. This approach takes advantage of the exist-ing interconnects in and analog operational behaviorof conventional DRAM architectures (e.g., DDRx,LPDDRx, HBM), without the need for a dedicatedlogic layer or logic processing elements, and usuallywith very low overheads. Mechanisms that use thisapproach take advantage of the high internal bandwidthavailable within each DRAM cell array. There are anumber of example PIM architectures that make use ofthis approach [31, 82, 83, 84, 85, 86, 92, 93]. In thissection, we first focus on two such designs: RowClone,which enables in-DRAM bulk data movement opera-tions [82] and Ambit, which enables in-DRAM bulkbitwise operations [83, 85, 86]. Then, we describe alow-cost substrate that performs data reorganization fornon-unit strided access patterns [71].
Two important classes of bandwidth-intensive mem-ory operations are (1) bulk data copy , where a largequantity of data is copied from one location in physi-cal memory to another; and (2) bulk data initialization ,where a large quantity of data is initialized to a spe-cific value. We refer to these two operations as bulkdata movement operations . Prior research [4, 153, 154]has shown that operating systems and data center work-loads spend a significant portion of their time perform-ing bulk data movement operations. Therefore, acceler-ating these operations will likely improve system perfor-mance and energy e ffi ciency.We have developed a mechanism called Row-Clone [82], which takes advantage of the fact that bulkdata movement operations do not require any compu-tation on the part of the processor. RowClone exploitsthe internal organization and operation of DRAM to per-form bulk data copy / initialization quickly and e ffi cientlyinside a DRAM chip. A DRAM chip contains multiplebanks, where the banks are connected together and toI / O circuitry by a shared internal bus, each of which is7ivided into multiple subarrays [12, 82, 155]. Each sub-array contains many rows of DRAM cells, where eachcolumn of DRAM cells is connected together across themultiple rows using bitlines .RowClone consists of two mechanisms that take ad-vantage of the existing DRAM structure. The firstmechanism, Fast Parallel Mode, copies the data of arow inside a subarray to another row inside the sameDRAM subarray by issuing back-to-back activate (i.e.,row open) commands to the source and the destinationrow. The second mechanism, Pipelined Serial Mode,can transfer an arbitrary number of bytes between twobanks using the shared internal bus among banks in aDRAM chip.RowClone significantly reduces the raw latency andenergy consumption of bulk data copy and initialization,leading to 11 . × latency reduction and 74 . × energy re-duction for a 4kB bulk page copy (using the Fast ParallelMode), at very low cost (only 0.01% DRAM chip areaoverhead) [82]. This reduction directly translates to im-provement in performance and energy e ffi ciency of sys-tems running copy or initialization-intensive workloads.Our MICRO 2013 paper [82] shows that the perfor-mance of six copy / initialization-intensive benchmarks(including the fork system call, Memcached [156] anda MySQL [157] database) improves between 4% and66%. For the same six benchmarks, RowClone reducesthe energy consumption between 15% and 69%. In addition to bulk data movement, many applicationstrigger bulk bitwise operations , i.e., bitwise operationson large bit vectors [158, 159]. Examples of such ap-plications include bitmap indices [160, 161, 162, 163]and bitwise scan acceleration [164] for databases, accel-erated document filtering for web search [165], DNAsequence alignment [166, 167, 168], encryption algo-rithms [169, 170, 171], graph processing [78], and net-working [159]. Accelerating bulk bitwise operationscan thus significantly boost the performance and energye ffi ciency of a wide range applications.In order to avoid data movement bottlenecks whenthe system performs these bulk bitwise operations, wehave recently proposed a new A ccelerator-in- M emoryfor bulk Bit wise operations (Ambit) [83, 85, 86]. Un-like prior approaches, Ambit uses the analog operationof existing DRAM technology to perform bulk bitwiseoperations. Ambit consists of two components. The firstcomponent, Ambit–AND–OR, implements a new opera-tion called triple-row activation , where the memory con-troller simultaneously activates three rows. Triple-rowactivation performs a bitwise majority function across the cells in the three rows, due to the charge sharingprinciples that govern the operation of the DRAM array.By controlling the initial value of one of the three rows,we can use triple-row activation to perform a bitwiseAND or OR of the other two rows. The second compo-nent, Ambit–NOT, takes advantage of the two invertersthat are connected to each sense amplifier in a DRAMsubarray. Ambit–NOT exploits the fact that, at the endof the sense amplification process, the voltage level ofone of the inverters represents the negated logical valueof the cell. The Ambit design adds a special row to theDRAM array, which is used to capture the negated valuethat is present in the sense amplifiers. One possible im-plementation of the special row [86] is a row of dual-contact cells (a 2-transistor 1-capacitor cell [172, 173])that connects to both inverters inside the sense amplifier.With the ability to perform AND, OR, and NOT oper-ations, Ambit is functionally complete: It can reliablyperform any bulk bitwise operation completely usingDRAM technology, even in the presence of significantprocess variation (see [86] for details).Averaged across seven commonly-used bitwise op-erations, Ambit with 8 DRAM banks improves bulkbitwise operation throughput by 44 × compared to anIntel Skylake processor [174], and 32 × compared tothe NVIDIA GTX 745 GPU [175]. Compared to theDDR3 standard, Ambit reduces energy consumptionof these operations by 35 × on average. Compared toHMC 2.0 [99], Ambit improves bulk bitwise operationthroughput by 2.4 × . When integrated directly into theHMC 2.0 device, Ambit improves throughput by 9.7 × compared to processing in the logic layer of HMC 2.0.A number of Ambit-like bitwise operation substrateshave been proposed in recent years, making use ofemerging resistive memory technologies, e.g., phase-change memory (PCM) [17, 19, 176, 177, 178, 179],SRAM, or specialized computational DRAM. Thesesubstrates can perform bulk bitwise operations in a spe-cial DRAM array augmented with computational cir-cuitry [90] and in PCM [78]. Similar substrates canperform simple arithmetic operations in SRAM [79, 80]and arithmetic and logical operations in memristors [81,180, 181, 182, 183]. We believe it is extremely impor-tant to continue exploring such low-cost Ambit-like sub-strates, as well as more sophisticated computational sub-strates, for all types of memory technologies, old andnew. Resistive memory technologies are fundamentallynon-volatile and amenable to in-place updates, and assuch, can lead to even less data movement compared toDRAM, which fundamentally requires some data move-ment to access the data. Thus, we believe it is verypromising to examine the design of emerging resistive8emory chips that can incorporate Ambit-like bitwiseoperations and other types of suitable computation ca-pability. Many applications access data structures with di ff er-ent access patterns at di ff erent points in time. Depend-ing on the layout of the data structures in the physicalmemory, some access patterns require non-unit strides.As current memory systems are optimized to access se-quential cache lines, non-unit strided accesses exhibitlow spatial locality, leading to memory bandwidth wasteand cache space waste.Gather-Scatter DRAM (GS-DRAM) [71] is a low-cost substrate that addresses this problem. It performsin-DRAM data structure reorganization by accessingmultiple values that belong to a strided access patternusing a single read / write command in the memory con-troller. GS-DRAM uses two key new mechanisms.First, GS-DRAM remaps the data of each cache line todi ff erent chips such that multiple values of a strided ac-cess pattern are mapped to di ff erent chips. This enablesthe possibility of gathering di ff erent parts of the stridedaccess pattern concurrently from di ff erent chips. Sec-ond, instead of sending separate requests to each chip,the GS-DRAM memory controller communicates a pat-tern ID to the memory module. With the pattern ID,each chip computes the address to be accessed indepen-dently. This way, the returned cache line contains di ff er-ent values of the strided pattern gathered from di ff erentchips.GS-DRAM achieves near-ideal memory bandwidthand cache utilization in real-world workloads, such asin-memory databases and matrix multiplication. For in-memory databases, GS-DRAM outperforms a transac-tional workload with column store layout by 3 × and ananalytics workload with row store layout by 2 × , therebygetting the best performance for both transactional andanalytical queries on databases (which in general ben-efit from di ff erent types of data layouts). For matrixmultiplication, GS-DRAM is 10% faster than the best-performing tiled implementation of the matrix multipli-cation algorithm. Several works propose to place some form of pro-cessing logic (typically accelerators, simple cores, or re-configurable logic) inside the logic layer of 3D-stackedmemory [97]. This
PIM processing logic , which we alsorefer to as
PIM cores or PIM engines , interchangeably,can execute portions of applications (from individual instructions to functions) or entire threads and applica-tions, depending on the design of the architecture. SuchPIM engines have high-bandwidth and low-latency ac-cess to the memory stacks that are on top of them, sincethe logic layer and the memory layers are connected viahigh-bandwidth vertical connections [97], e.g., through-silicon vias. In this section, we discuss how systemscan make use of relatively simple PIM engines withinthe logic layer to avoid data movement and thus obtainsignificant performance and energy improvements on awide variety of application domains.
A popular modern application is large-scale graphprocessing [87, 184, 185, 186, 187, 188, 189, 190, 191,192, 193]. Graph processing has broad applicability anduse in many domains, from social networks to machinelearning, from data analytics to bioinformatics. Graphanalysis workloads are known to put significant pres-sure on memory bandwidth due to (1) large amountsof random memory accesses across large memory re-gions (leading to very limited cache e ffi ciency and verylarge amounts of unnecessary data transfer on the mem-ory bus) and (2) very small amounts of computation pereach data item fetched from memory (leading to verylimited ability to hide long memory latencies and exac-erbating the energy bottleneck by exercising the hugeenergy disparity between memory access and computa-tion). These two characteristics make it very challeng-ing to scale up such workloads despite their inherentparallelism, especially with conventional architecturesbased on large on-chip caches and relatively scarce o ff -chip memory bandwidth for random access.We can exploit the high bandwidth as well as the po-tential computation capability available within the logiclayer of 3D-stacked memory to overcome the limita-tions of conventional architectures for graph processing.To this end, we design a programmable PIM acceleratorfor large-scale graph processing, called Tesseract [34].Tesseract consists of (1) a new hardware architecturethat e ff ectively utilizes the available memory bandwidthin 3D-stacked memory by placing simple in-order pro-cessing cores in the logic layer and enabling each coreto manipulate data only on the memory partition it isassigned to control, (2) an e ffi cient method of commu-nication between di ff erent in-order cores within a 3D-stacked memory to enable each core to request computa-tion on data elements that reside in the memory partitioncontrolled by another core, and (3) a message-passingbased programming interface, similar to how moderndistributed systems are programmed, which enables re-mote function calls on data that resides in each memory9artition. The Tesseract design moves functions to datarather than moving data elements across di ff erent mem-ory partitions and cores. It also includes two hardwareprefetchers specialized for memory access patterns ofgraph processing, which operate based on the hints pro-vided by our programming model. Our comprehensiveevaluations using five state-of-the-art graph processingworkloads with large real-world graphs show that theproposed Tesseract PIM architecture improves averagesystem performance by 13 . × and achieves 87% aver-age energy reduction over conventional systems. A very popular domain of computing today con-sists of consumer devices, which include smartphones,tablets, web-based computers such as Chromebooks,and wearable devices. In consumer devices, energy e ffi -ciency is a first-class concern due to the limited batterycapacity and the stringent thermal power budget. Wefind that data movement is a major contributor to thetotal system energy and execution time in modern con-sumer devices. Across all of the popular modern appli-cations we study (described in the next paragraph), wefind that a massive 62.7% of the total system energy, onaverage, is spent on data movement across the memoryhierarchy [7].We comprehensively analyze the energy and perfor-mance impact of data movement for several widely-used Google consumer workloads [7], which accountfor a significant portion of the applications executedon consumer devices. These workloads include (1) theChrome web browser [194], which is a very popularbrowser used in mobile devices and laptops; (2) Tensor-Flow Mobile [195], Google’s machine learning frame-work, which is used in services such as Google Trans-late, Google Now, and Google Photos; (3) the VP9video playback engine [196], and (4) the VP9 video cap-ture engine [196], both of which are used in many videoservices such as YouTube and Google Hangouts. Wefind that o ffl oading key functions to the logic layer cangreatly reduce data movement in all of these workloads.However, there are challenges to introducing PIM inconsumer devices, as consumer devices are extremelystringent in terms of the area and energy budget theycan accommodate for any new hardware enhancement.As a result, we need to identify what kind of in-memorylogic can both (1) maximize energy e ffi ciency and (2) beimplemented at minimum possible cost, in terms of botharea overhead and complexity .We find that many of target functions for PIM in con-sumer workloads are comprised of simple operationssuch as memcopy , memset , basic arithmetic and bitwise operations, and simple data shu ffl ing and reorganiza-tion routines. Therefore, we can relatively easily im-plement these PIM target functions in the logic layerof 3D-stacked memory using either (1) a small low-power general-purpose embedded core or (2) a group ofsmall fixed-function accelerators. Our analysis showsthat the area of a PIM core and a PIM accelerator takeup no more than 9.4% and 35.4%, respectively, of thearea available for PIM logic in an HMC-like [197] 3D-stacked memory architecture. Both the PIM core andPIM accelerator eliminate a large amount of data move-ment, and thereby significantly reduce total system en-ergy (by an average of 55.4% across all the workloads)and execution time (by an average of 54.2%). In the last decade, Graphics Processing Units (GPUs)have become the accelerator of choice for a wide vari-ety of data-parallel applications. They deploy thousandsof in-order, SIMT (Single Instruction Multiple Thread)cores that run lightweight threads. Their multithreadedarchitecture is devised to hide the long latency of mem-ory accesses by interleaving threads that execute arith-metic and logic operations. Despite that, many GPU ap-plications are still very memory-bound [198, 199, 200,201, 202, 203, 204, 205, 206, 207], because the limitedo ff -chip pin bandwidth cannot supply enough data to therunning threads.3D-stacked memory architectures present a promis-ing opportunity to alleviate the memory bottleneck inGPU systems. GPU cores placed in the logic layer ofa 3D-stacked memory can be directly connected to theDRAM layers with high bandwidth (and low latency)connections. In order to leverage the potential perfor-mance benefits of such systems, it is necessary to en-able computation o ffl oading and data mapping to mul-tiple such compute-capable 3D-stacked memories, suchthat GPU applications can benefit from processing-in-memory capabilities in the logic layers of such memo-ries.TOM (Transparent O ffl oading and Mapping) [59]proposes two mechanisms to address computation of-floading and data mapping in such a system in aprogrammer-transparent manner. First, it introducesnew compiler analysis techniques to identify code sec-tions in GPU kernels that can benefit from PIM of-floading. The compiler estimates the potential mem-ory bandwidth savings for each code block. To thisend, the compiler compares the bandwidth consumptionof the code block, when executed on the regular GPUcores, to the bandwidth cost of transmitting / receiving in-put / output registers, when o ffl oading to the GPU cores10n the logic layers. At runtime, a final o ffl oading deci-sion is made based on system conditions, such as con-tention for processing resources in the logic layer. Sec-ond, a software / hardware cooperative mechanism pre-dicts the memory pages that will be accessed by of-floaded code, and places such pages in the same 3D-stacked memory cube where the code will be executed.The goal is to make PIM e ff ective by ensuring that thedata needed by the PIM cores is in the same memorystack. Both mechanisms are completely transparent tothe programmer, who only needs to write regular GPUcode without any explicit PIM instructions or any othermodification to the code. TOM improves the averageperformance of a variety of GPGPU workloads by 30%and reduces the average energy consumption by 11%with respect to a baseline GPU system without PIM of-floading capabilities.A related work [60] identifies GPU kernels that aresuitable for PIM o ffl oading by using a regression-baseda ffi nity prediction model. A concurrent kernel manage-ment mechanism uses the a ffi nity prediction model anddetermines which kernels should be scheduled concur-rently to maximize performance. This way, the pro-posed mechanism enables the simultaneous exploitationof the regular GPU cores and the in-memory GPU cores.This scheduling technique improves performance andenergy e ffi ciency by an average of 42% and 27%, re-spectively. PIM-Enabled Instructions (PEI) [35] aims to providethe minimal processing-in-memory support to take ad-vantage of PIM using 3D-stacked memory, in a way thatcan achieve significant performance and energy benefitswithout changing the computing system significantly.To this end, PEI proposes a collection of simple instruc-tions, which introduce negligible changes to the comput-ing system and no changes to the programming modelor the virtual memory system, in a system with 3D-stacked memory. These instructions, inserted by thecompiler / programmer to code written in a regular pro-gram, are operations that can be executed either in a tra-ditional host CPU (that fetches and decodes them) or thePIM engine in 3D-stacked memory.PIM-Enabled Instructions are based on two key ideas.First, a PEI is a cache-coherent, virtually-addressedhost processor instruction that operates on only a sin-gle cache block. It requires no changes to the sequentialexecution and programming model, no changes to vir-tual memory, minimal changes to cache coherence, andno need for special data mapping to take advantage ofPIM (because each PEI is restricted to a single mem- ory module due to the single cache block restrictionit has). Second, a Locality-Aware Execution runtimemechanism decides dynamically where to execute a PEI(i.e., either the host processor or the PIM logic) basedon simple locality characteristics and simple hardwarepredictors. This runtime mechanism executes the PEI atthe location that maximizes performance. In summary,PIM-Enabled Instructions provide the illusion that PIMoperations are executed as if they were host instructions.Examples of PEIs are integer increment, integer min-imum, floating-point addition, hash table probing, his-togram bin index, Euclidean distance, and dot prod-uct [35]. Data-intensive workloads such as graph pro-cessing, in-memory data analytics, machine learning,and data mining can significantly benefit from thesePEIs. Across 10 key data-intensive workloads, we ob-serve that the use of PEIs in these workloads, in com-bination with the Locality-Aware Execution runtimemechanism, leads to an average performance improve-ment of 47% and an average energy reduction of 25%over a baseline CPU.
6. Enabling the Adoption of PIM
Pushing some or all of the computation for a programfrom the CPU to memory introduces new challenges forsystem architects and programmers to overcome. Thesechallenges must be addressed in order for PIM to beadopted as a mainstream architecture in a wide varietyof systems and workloads, and in a seamless mannerthat does not place heavy burden on the vast majorityof programmers. In this section, we discuss several ofthese system-level and programming-level challenges,and highlight a number of our works that have addressedthese challenges for a wide range of PIM architectures.
Two open research questions to enable the adoptionof PIM are 1) what should the programming models be,and 2) how can compilers and libraries alleviate the pro-gramming burden.While PIM-Enabled Instructions [35] work well foro ffl oading small amounts of computation to memory,they can potentially introduce overheads while takingadvantage of PIM for large tasks, due to the need tofrequently exchange information between the PIM pro-cessing logic and the CPU. Hence, there is a need forresearchers to investigate how to integrate PIM instruc-tions with other compiler-based methods or library callsthat can support PIM integration, and how these ap-proaches can ease the burden on the programmer, by11nabling seamless o ffl oading of instructions or func-tion / library calls.Such solutions can often be platform-dependent. Oneof our recent works [59] examines compiler-basedmechanisms to decide what portions of code should beo ffl oaded to PIM processing logic in a GPU-based sys-tem in a manner that is transparent to the GPU program-mer. Another recent work [60] examines system-leveltechniques that decide which GPU application kernelsare suitable for PIM execution.Determining e ff ective programming interfaces andthe necessary compiler / library support to perform PIMremain open research and design questions, which areimportant for future works to tackle. We identify four key runtime issues in PIM: (1) whatcode to execute near data, (2) when to schedule execu-tion on PIM (i.e., when is it worth o ffl oading computa-tion to the PIM cores), (3) how to map data to multi-ple memory modules such that PIM execution is viableand e ff ective, and (4) how to e ff ectively share / partitionPIM mechanisms / accelerators at runtime across mul-tiple threads / cores to maximize performance and en-ergy e ffi ciency. We have already proposed several ap-proaches to solve these four issues.Our recent works in PIM processing identify suit-able PIM o ffl oading candidates with di ff erent granulari-ties. PIM-Enabled Instructions [35] propose various op-erations that can benefit from execution near or insidememory, such as integer increment, integer minimum,floating-point addition, hash table probing, histogrambin index, Euclidean distance, and dot product. In [7],we find simple functions with intensive data movementthat are suitable for PIM in consumer workloads (e.g.,Chrome web browser, TensorFlow Mobile, video play-back, and video capture), as described in Section 5.2.2.Bulk memory operations (copy, initialization) and bulkbitwise operations are good candidates for in-DRAMprocessing [82, 83, 86, 93]. GPU applications also con-tain several parts that are suitable for o ffl oading to PIMengines [59, 60].In several of our research works, we propose runtimemechanisms for dynamic scheduling of PIM o ffl oadingcandidates, i.e., mechanisms that decide whether or notto actually o ffl oad code that is marked to be potentiallyo ffl oaded to PIM engines. In [35], we develop a locality-aware scheduling mechanism for PIM-enabled instruc-tions. For GPU-based systems [59, 60], we explore thecombination of compile-time and runtime mechanismsfor identification and dynamic scheduling of PIM of-floading candidates. The best mapping of data and code that enables themaximal benefits from PIM depends on the applicationsand the computing system configuration. For instance,in [59], we present a software / hardware mechanismto map data and code to several 3D-stacked memorycubes in regular GPU applications with relatively reg-ular memory access patterns. This work also deals withe ff ectively sharing PIM engines across multiple threads ,as GPU code sections are o ffl oaded from di ff erent GPUcores. Developing new approaches to data / code map-ping and scheduling for a large variety of applicationsand possible core and memory configurations is still nec-essary.In summary, there are still several key research ques-tions that should be investigated in runtime systemsfor PIM, which perform scheduling and data / code map-ping: • What are simple mechanisms to enable and dis-able PIM execution? How can PIM executionbe throttled for highest performance gains? Howshould data locations and access patterns a ff ectwhere / whether PIM execution should occur? • Which parts of a given application’s code shouldbe executed on PIM? What are simple mechanismsto identify when those parts of the application codecan benefit from PIM? • What are scheduling mechanisms to share PIM en-gines between multiple requesting cores to maxi-mize benefits obtained from PIM? • What are simple mechanisms to manage access toa memory that serves both CPU requests and PIMrequests?
In a traditional multithreaded execution model thatmakes use of shared memory, writes to memory must becoordinated between multiple CPU cores, to ensure thatthreads do not operate on stale data values. Since CPUsinclude per-core private caches, when one core writesdata to a memory address, cached copies of the dataheld within the caches of other cores must be updatedor invalidated, using a mechanism known as cache co-herence . Within a modern chip multiprocessor, the per-core caches perform coherence actions over a shared in-terconnect, with hardware coherence protocols.Cache coherence is a major system challenge for en-abling PIM architectures as general-purpose executionengines, as PIM processing logic can modify the datait processes, and this data may also be needed by CPU12ores. If PIM processing logic is coherent with the pro-cessor, the PIM programming model is relatively sim-ple, as it remains similar to conventional shared mem-ory multithreaded programming, which makes PIM ar-chitectures easier to adopt in general-purpose systems.Thus, allowing PIM processing logic to maintain sucha simple and traditional shared memory programmingmodel can facilitate the widespread adoption of PIM.However, employing traditional fine-grained cache co-herence (e.g., a cache-block based MESI protocol [208])for PIM forces a large number of coherence messages totraverse the narrow processor-memory bus, potentiallyundoing the benefits of high-bandwidth and low-latencyPIM execution. Unfortunately, solutions for coherenceproposed by prior PIM works [35, 59] either place somerestrictions on the programming model (by eliminat-ing coherence and requiring message passing based pro-gramming) or limit the performance and energy gainsachievable by a PIM architecture.We have developed a new coherence protocol,LazyPIM [70, 209], that maintains cache coherencebetween PIM processing logic and CPU cores with-out sending coherence requests for every memory ac-cess. Instead, LazyPIM e ffi ciently provides coherenceby having PIM processing logic speculatively acquirecoherence permissions, and then later sends compressed batched coherence lookups to the CPU to determinewhether or not its speculative permission acquisition vi-olated the coherence semantics. As a result of this ”lazy”checking of coherence violations, LazyPIM approachesnear-ideal coherence behavior: the performance and en-ergy consumption of a PIM architecture with LazyPIMare, respectively, within 5.5% and 4.4% the perfor-mance and energy consumption of a system where co-herence is performed at zero latency and energy cost.Despite the leap that LazyPIM [70, 209] representsfor memory coherence in computing systems with PIMsupport, we believe that it is still necessary to exploreother solutions for memory coherence that can e ffi -ciently deal with all types of workloads and PIM o ffl oad-ing granularities. When an application needs to access its data insidethe main memory, the CPU core must first perform an address translation , which converts the data’s virtual ad-dress into a physical address within main memory. If thetranslation metadata is not available in the CPU’s trans-lation lookaside bu ff er (TLB), the CPU must invoke thepage table walker in order to perform a long-latencypage table walk that involves multiple sequential reads to the main memory and lowers the application’s perfor-mance. In modern systems, the virtual memory systemalso provides access protection mechanisms.A naive solution to reducing the overhead of pagewalks is to utilize PIM engines to perform page tablewalks. This can be done by duplicating the content ofthe TLB and moving the page walker to the PIM pro-cessing logic in main memory. Unfortunately, this iseither di ffi cult or expensive for three reasons. First, co-herence has to be maintained between the CPU’s TLBsand the memory-side TLBs. This introduces extra com-plexity and o ff -chip requests. Second, duplicating theTLBs increases the storage and complexity overheadson the memory side, which should be carefully con-tained. Third, if main memory is shared across CPUswith di ff erent types of architectures, page table struc-tures and the implementation of address translations canbe di ff erent across the di ff erent architectures. Ensuringcompatibility between the in-memory TLB / page walkerand all possible types of virtual memory architecture de-signs can be complicated and often not even practicallyfeasible.To address these concerns and reduce the overhead ofvirtual memory, we explore a tractable solution for PIMaddress translation as part of our in-memory pointerchasing accelerator, IMPICA [62]. IMPICA exploitsthe high bandwidth available within 3D-stacked mem-ory to traverse a chain of virtual memory pointers withinDRAM, without having to look up virtual-to-physicaladdress translations in the CPU translation lookasidebu ff er (TLB) and without using the page walkers withinthe CPU. IMPICA’s key ideas are 1) to use a region-based page table, which is optimized for PIM accelera-tion, and 2) to decouple address calculation and mem-ory access with two specialized engines. IMPICA im-proves the performance of pointer chasing operationsin three commonly-used linked data structures (linkedlists, hash tables, and B-trees) by 92%, 29%, and 18%,respectively. On a real database application, DBx1000,IMPICA improves transaction throughput and responsetime by 16% and 13%, respectively. IMPICA alsoreduces overall system energy consumption (by 41%,23%, and 10% for the three commonly-used data struc-tures, and by 6% for DBx1000).Beyond pointer chasing operations that are tackled byIMPICA [62], providing e ffi cient mechanisms for PIM-based virtual-to-physical address translation (as well asaccess protection) remains a challenge for the general-ity of applications, especially those that access largeamounts of virtual memory [210, 211, 212]. We believeit is important to explore new ideas to address this PIMchallenge in a scalable and e ffi cient manner.13 .5. Data Structures for PIM Current systems with many cores run applicationswith concurrent data structures to achieve high perfor-mance and scalability, with significant benefits over se-quential data structures. Such concurrent data structuresare often used in heavily-optimized server systems to-day, where high performance is critical. To enable theadoption of PIM in such many-core systems, it is neces-sary to develop concurrent data structures that are specif-ically tailored to take advantage of PIM.
Pointer chasing data structures and contended datastructures require careful analysis and design to lever-age the high bandwidth and low latency of 3D-stackedmemories [72]. First, pointer chasing data structures,such as linked-lists and skip-lists, have a high degreeof inherent parallelism and low contention, but a naiveimplementation in PIM cores is burdened by hard-to-predict memory access patterns. By combining and par-titioning the data across 3D-stacked memory vaults, it ispossible to fully exploit the inherent parallelism of thesedata structures. Second, contended data structures, suchas FIFO queues, are a good fit for CPU caches becausethey expose high locality. However, they su ff er fromhigh contention when many threads access them con-currently. Their performance on traditional CPU sys-tems can be improved using a new PIM-based FIFOqueue [72]. The proposed PIM-based FIFO queue usesa PIM core to perform enqueue and dequeue operationsrequested by CPU cores. The PIM core can pipelinerequests from di ff erent CPU cores for improved perfor-mance.As recent work [72] shows, PIM-managed concurrentdata structures can outperform state-of-the-art concur-rent data structures that are designed for and executedon multiple cores. We believe and hope that future workwill enable other types of data structures (e.g., hash ta-bles, search trees, priority queues) to benefit from PIM-managed designs. To ease the adoption of PIM, it is critical that we ac-curately assess the benefits and shortcomings of PIM.Accurate assessment of PIM requires (1) a preferablylarge set of real-world memory-intensive applicationsthat have the potential to benefit significantly when ex-ecuted near memory, (2) a rigorous methodology to(automatically) identify PIM o ffl oading candidates, and(3) simulation / evaluation infrastructures that allow ar-chitects and system designers to accurately analyze thebenefits and overheads of adding PIM processing logicto memory and executing code on this processing logic. In order to explore what processing logic should beintroduced near memory, and to know what propertiesare ideal for PIM kernels, we believe it is important tobegin by developing a real-world benchmark suite of awide variety of applications that can potentially bene-fit from PIM. While many data-intensive applications,such as pointer chasing and bulk memory copy, can po-tentially benefit from PIM, it is crucial to examine im-portant candidate applications for PIM execution, andfor researchers to agree on a common set of these candi-date applications to focus the e ff orts of the communityas well as to enable reproducibility of results, whichis important to assess the relative benefits of di ff erentideas developed by di ff erent researchers. We believethat these applications should come from a number ofpopular and emerging domains. Examples of potentialdomains include data-parallel applications, neural net-works, machine learning, graph processing, data analyt-ics, search / filtering, mobile workloads, bioinformatics,Hadoop / Spark programs, security / cryptography, and in-memory data stores. Many of these applications havelarge data sets and can benefit from high memory band-width and low memory latency benefits provided bycomputation near memory. In our prior work, we havestarted identifying several applications that can bene-fit from PIM in graph processing frameworks [34, 35],pointer chasing [33, 62], databases [62, 70, 71, 209],consumer workloads [7], machine learning [7], andGPGPU workloads [59, 60]. However, there is signif-icant room for methodical development of a large-scalePIM benchmark suite, which we hope future work pro-vides.A systematic methodology for (automatically) iden-tifying potential PIM kernels (i.e., code portions thatcan benefit from PIM) within an application can, amongmany other benefits, 1) ease the burden of programmingPIM architectures by aiding the programmer to identifywhat should be o ffl oaded, 2) ease the burden of and im-prove the reproducibility of PIM research, 3) drive thedesign and implementation of PIM functional units thatmany types of applications can leverage, 4) inspire thedevelopment of tools that programmers and compilerscan use to automate the process of o ffl oading portionsof existing applications to PIM processing logic, and 5)lead the community towards convergence on PIM de-signs and o ffl oading candidates.We also need simulation infrastructures to accuratelymodel the performance and energy of PIM hardwarestructures, available memory bandwidth, and communi-cation overheads when we execute code near or insidememory. Highly-flexible and commonly-used memorysimulators (e.g., Ramulator [213, 214], SoftMC [29,1415]) can be combined with full-system simulators(e.g., gem5 [216], zsim [217], gem5-gpu [218], GPG-PUSim [219]) to provide a robust environment that canevaluate how various PIM architectures a ff ect the entirecompute stack , and can allow designers to identify mem-ory, workload, and system characteristics that a ff ect thee ffi ciency of PIM execution. We believe it is critical tosupport the open source development such simulationand emulation infrastructures for assessing the benefitsof a wide variety of PIM designs.
7. Conclusion and Future Outlook
Data movement is a major performance and energybottleneck plaguing modern computing systems. Alarge fraction of system energy is spent on moving dataacross the memory hierarchy into the processors (andaccelerators), the only place where computation is per-formed in a modern system. Fundamentally, the largeamounts of data movement are caused by the processor-centric design of modern computing systems: process-ing of data is performed only in the processors (and ac-celerators), which are far away from the data, and as aresult, data moves a lot in the system, to facilitate com-putation on it.In this work, we argue for a paradigm shift in the de-sign of computing systems toward a data-centric designthat enables computation capability in places where dataresides and thus performs computation with minimaldata movement.
Processing-in-memory (PIM) is a fun-damentally data-centric design approach for computingsystems that enables the ability to perform operations inor near memory. Recent advances in modern memoryarchitectures have enabled us to extensively explore twonovel approaches to designing PIM architectures. First,with minimal changes to memory chips, we show thatwe can perform a number of important and widely-usedoperations (e.g., memory copy, data initialization, bulkbitwise operations, data reorganization) within DRAM.Second, we demonstrate how embedded computationcapability in the logic layer of 3D-stacked memory canbe used in a variety of ways to provide significant perfor-mance improvements and energy savings, across a largerange of application domains and computing platforms.Despite the extensive design space that we have stud-ied so far, a number of key challenges remain to enablethe widespread adoption of PIM in future computingsystems [94, 95]. Important challenges include devel-oping easy-to-use programming models for PIM (e.g.,PIM application interfaces, compilers and libraries de-signed to abstract away PIM architecture details fromprogrammers), and extensive runtime support for PIM (e.g., scheduling PIM operations, sharing PIM logicamong CPU threads, cache coherence, virtual memorysupport). We hope that providing the community with(1) a large set of memory-intensive benchmarks that canpotentially benefit from PIM, (2) a rigorous methodol-ogy to identify PIM-suitable parts within an application,and (3) accurate simulation infrastructures for estimat-ing the benefits and overheads of PIM will empower re-searchers to address remaining challenges for the adop-tion of PIM.We firmly believe that it is time to design principledsystem architectures to solve the data movement prob-lem of modern computing systems, which is causedby the rigid dichotomy and imbalance between thecomputing unit (CPUs and accelerators) and the mem-ory / storage unit. Fundamentally solving the data move-ment problem requires a paradigm shift to a more data-centric computing system design, where computationhappens in or near memory / storage, with minimal move-ment of data. Such a paradigm shift can greatly pushthe boundaries of future computing systems, leading toorders of magnitude improvements in energy and perfor-mance (as we demonstrated with some examples in thiswork), potentially enabling new applications and com-puting platforms. Acknowledgments
This work is based on a keynote talk deliveredby Onur Mutlu at the 3rd Mobile System Technolo-gies (MST) Workshop in Milan, Italy on 27 October2017 [220]. The mentioned keynote talk is similar to aseries of talks given by Onur Mutlu in a wide variety ofvenues since 2015 until now. This talk has evolved sig-nificantly over time with the accumulationof new worksand feedback received from many audiences. A recentversion of the talk was delivered as a distinguished lec-ture at George Washington University [221].This article and the associated talks are based on re-search done over the course of the past seven years inthe SAFARI Research Group on the topic of processing-in-memory (PIM). We thank all of the members ofthe SAFARI Research Group, and our collaborators atCarnegie Mellon, ETH Z¨urich, and other universities,who have contributed to the various works we describein this paper. Thanks also goes to our research group’sindustrial sponsors over the past ten years, especiallyAlibaba, Google, Huawei, Intel, Microsoft, NVIDIA,Samsung, Seagate, and VMware. This work was alsopartially supported by the Intel Science and Technol-ogy Center for Cloud Computing, the SemiconductorResearch Corporation, the Data Storage Systems Center15t Carnegie Mellon University, various NSF grants, andvarious awards, including the NSF CAREER Award, theIntel Faculty Honor Program Award, and a number ofGoogle Faculty Research Awards to Onur Mutlu.
References [1] O. Mutlu, Memory Scaling: A Systems Architecture Perspec-tive, IMW (2013).[2] O. Mutlu, L. Subramanian, Research Problems and Opportuni-ties in Memory Systems, SUPERFRI (2014).[3] J. Dean, L. A. Barroso, The Tail at Scale, CACM (2013).[4] S. Kanev, J. P. Darago, K. Hazelwood, P. Ranganathan,T. Moseley, G.-Y. Wei, D. Brooks, Profiling a Warehouse-ScaleComputer, in: ISCA, 2015.[5] M. Ferdman, A. Adileh, O. Kocberber, S. Volos, M. Alisafaee,D. Jevdjic, C. Kaynak, A. D. Popescu, A. Ailamaki, B. Falsafi,Clearing the Clouds: A Study of Emerging Scale-Out Work-loads on Modern Hardware, in: ASPLOS, 2012.[6] L. Wang, J. Zhan, C. Luo, Y. Zhu, Q. Yang, Y. He, W. Gao,Z. Jia, Y. Shi, S. Zhang, C. Zheng, G. Lu, K. Zhan, X. Li,B. Qiu, BigDataBench: A Big Data Benchmark Suite FromInternet Services, in: HPCA, 2014.[7] A. Boroumand, S. Ghose, Y. Kim, R. Ausavarungnirun,E. Shiu, R. Thakur, D. Kim, A. Kuusela, A. Knies, P. Ran-ganathan, O. Mutlu, Google Workloads for Consumer Devices:Mitigating Data Movement Bottlenecks, in: ASPLOS, 2018.[8] U. Kang, H.-S. Yu, C. Park, H. Zheng, J. Halbert, K. Bains,S. Jang, J. Choi, Co-Architecting Controllers and DRAM to En-hance DRAM Process Scaling, in: The Memory Forum, 2014.[9] S. A. McKee, Reflections on the Memory Wall, in: CF, 2004.[10] M. V. Wilkes, The Memory Gap and the Future of High Perfor-mance Memories, CAN (2001).[11] Y. Kim, R. Daly, J. Kim, C. Fallin, J. H. Lee, D. Lee, C. Wilk-erson, K. Lai, O. Mutlu, Flipping Bits in Memory Without Ac-cessing Them: An Experimental Study of DRAM DisturbanceErrors, in: ISCA, 2014.[12] Y. Kim, V. Seshadri, D. Lee, J. Liu, O. Mutlu, A Case forExploiting Subarray-Level Parallelism (SALP) in DRAM, in:ISCA, 2012.[13] Y. Kim, Architectural Techniques to Enhance DRAM Scaling,Ph.D. thesis, Carnegie Mellon University (2015).[14] J. Liu, B. Jaiyen, R. Veras, O. Mutlu, RAIDR: Retention-Aware Intelligent DRAM Refresh, in: ISCA, 2012.[15] O. Mutlu, The RowHammer Problem and Other Issues WeMay Face as Memory Becomes Denser, in: DATE, 2017.[16] D. Lee, L. Subramanian, R. Ausavarungnirun, J. Choi,O. Mutlu, Decoupled Direct Memory Access: Isolating CPUand IO Tra ffi c by Leveraging a Dual-Data-Port DRAM, in:PACT, 2015.[17] B. C. Lee, E. Ipek, O. Mutlu, D. Burger, Architecting PhaseChange Memory as a Scalable DRAM Alternative, in: ISCA,2009.[18] H. Yoon, R. A. J. Meza, R. Harding, O. Mutlu, Row Bu ff er Lo-cality Aware Caching Policies for Hybrid Memories, in: ICCD,2012.[19] H. Yoon, J. Meza, N. Muralimanohar, N. P. Jouppi, O. Mutlu,E ffi cient Data Mapping and Bu ff ering Techniques for Multi-level Cell Phase-Change Memories, ACM TACO (2014).[20] K. Lim, J. Chang, T. Mudge, P. Ranganathan, S. K. Reinhardt,T. F. Wenisch, Disaggregated Memory for Expansion and Shar-ing in Blade Servers, in: ISCA, 2009.[21] W. A. Wulf, S. A. McKee, Hitting the Memory Wall: Implica-tions of the Obvious, CAN (1995). [22] K. K. Chang, A. Kashyap, H. Hassan, S. Ghose, K. Hsieh,D. Lee, T. Li, G. Pekhimenko, S. Khan, O. Mutlu, Understand-ing Latency Variation in Modern DRAM Chips: Experimen-tal Characterization, Analysis, and Optimization, in: SIGMET-RICS, 2016.[23] D. Lee, Y. Kim, V. Seshadri, J. Liu, L. Subramanian, O. Mutlu,Tiered-Latency DRAM: A Low Latency and Low Cost DRAMArchitecture, in: HPCA, 2013.[24] D. Lee, Y. Kim, G. Pekhimenko, S. Khan, V. Seshadri,K. Chang, O. Mutlu, Adaptive-Latency DRAM: OptimizingDRAM Timing for the Common-Case, in: HPCA, 2015.[25] K. K. Chang, A. G. Ya˘glıkc¸ı, S. Ghose, A. Agrawal, N. Chatter-jee, A. Kashyap, D. Lee, M. O’Connor, H. Hassan, O. Mutlu,Understanding Reduced-Voltage Operation in Modern DRAMDevices: Experimental Characterization, Analysis, and Mech-anisms, in: SIGMETRICS, 2017.[26] D. Lee, S. Khan, L. Subramanian, S. Ghose, R. Ausavarung-nirun, G. Pekhimenko, V. Seshadri, O. Mutlu, Design-InducedLatency Variation in Modern DRAM Chips: Characterization,Analysis, and Latency Reduction Mechanisms, in: SIGMET-RICS, 2017.[27] Y. Luo, S. Govindan, B. Sharma, M. Santaniello, J. Meza,A. Kansal, J. Liu, B. Khessib, K. Vaid, O. Mutlu, Characteriz-ing Application Memory Error Vulnerability to Optimize Dat-acenter Cost via Heterogeneous-Reliability Memory, in: DSN,2014.[28] Y. Luo, S. Ghose, T. Li, S. Govindan, B. Sharma, B. Kelly,A. Boroumand, O. Mutlu, Using ECC DRAM to AdaptivelyIncrease Memory Capacity, arXiv:1706.08870 [cs:AR] (2017).[29] H. Hassan, N. Vijaykumar, S. Khan, S. Ghose, K. Chang,G. Pekhimenko, D. Lee, O. Ergin, O. Mutlu, SoftMC: A Flex-ible and Practical Open-Source Infrastructure for Enabling Ex-perimental DRAM Studies, in: HPCA, 2017.[30] H. Hassan, G. Pekhimenko, N. Vijaykumar, V. Seshadri,D. Lee, O. Ergin, O. Mutlu, ChargeCache: Reducing DRAMLatency by Exploiting Row Access Locality, in: HPCA, 2016.[31] K. K. Chang, Understanding and Improving the Latency ofDRAM-Based Memory Systems, Ph.D. thesis, Carnegie Mel-lon University (2017).[32] M. Hashemi, Khubaib, E. Ebrahimi, O. Mutlu, Y. N. Patt, Ac-celerating Dependent Cache Misses with an Enhanced MemoryController, in: ISCA, 2016.[33] M. Hashemi, O. Mutlu, Y. N. Patt, Continuous Runahead:Transparent Hardware Acceleration for Memory IntensiveWorkloads, in: MICRO, 2016.[34] J. Ahn, S. Hong, S. Yoo, O. Mutlu, K. Choi, A ScalableProcessing-in-Memory Accelerator for Parallel Graph Process-ing, in: ISCA, 2015.[35] J. Ahn, S. Yoo, O. Mutlu, K. Choi, PIM-Enabled Instructions:A Low-Overhead, Locality-Aware Processing-in-Memory Ar-chitecture, in: ISCA, 2015.[36] M. K. Qureshi, A. Jaleel, Y. N. Patt, S. C. S. Jr., J. Emer,Adaptive Insertion Policies for High-Performance Caching, in:ISCA, 2007.[37] M. K. Qureshi, M. A. Suleman, Y. N. Patt, Line Distillation: In-creasing Cache Capacity by Filtering Unused Words in CacheLines, in: HPCA, 2007.[38] R. Kumar, G. Hinton, A Family of 45nm IA Processors, in:ISSCC, 2009.[39] H. S. Stone, A Logic-in-Memory Computer, TC (1970).[40] D. E. Shaw, S. J. Stolfo, H. Ibrahim, B. Hillyer, G. Wieder-hold, J. Andrews, The NON-VON Database Machine: A BriefOverview, IEEE Database Eng. Bull. (1981).[41] D. G. Elliott, W. M. Snelgrove, M. Stumm, ComputationalRAM: A Memory-SIMD Hybrid and Its Application to DSP, n: CICC, 1992.[42] P. M. Kogge, EXECUBE–A New Architecture for ScaleableMPPs, in: ICPP, 1994.[43] M. Gokhale, B. Holmes, K. Iobst, Processing in Memory: TheTerasys Massively Parallel PIM Array, IEEE Computer (1995).[44] D. Patterson, T. Anderson, N. Cardwell, R. Fromm, K. Keeton,C. Kozyrakis, R. Thomas, K. Yelick, A Case for IntelligentRAM, IEEE Micro (1997).[45] M. Oskin, F. T. Chong, T. Sherwood, Active Pages: A Compu-tation Model for Intelligent Memory, in: ISCA, 1998.[46] Y. Kang, W. Huang, S.-M. Yoo, D. Keen, Z. Ge, V. Lam, P. Pat-tnaik, J. Torrellas, FlexRAM: Toward an Advanced IntelligentMemory System, in: ICCD, 1999.[47] J. Draper, J. Chame, M. Hall, C. Steele, T. Barrett, J. La-Coss, J. Granacki, J. Shin, C. Chen, C. W. Kang, I. Kim,G. Daglikoca, The Architecture of the DIVA Processing-in-Memory Chip, in: SC, 2002.[48] K. Mai, T. Paaske, N. Jayasena, R. Ho, W. J. Dally,M. Horowitz, Smart Memories: A Modular Reconfigurable Ar-chitecture, in: ISCA, 2000.[49] D. Elliott, M. Stumm, W. M. Snelgrove, C. Cojocaru,R. McKenzie, Computational RAM: Implementing Processorsin Memory, IEEE Design & Test (1999).[50] E. Riedel, G. Gibson, C. Faloutsos, Active Storage for Large-scale Data Mining and Multimedia Applications, in: VLDB,1998.[51] K. Keeton, D. A. Patterson, J. M. Hellerstein, A Case for Intel-ligent Disks (IDISKs), SIGMOD Rec. (1998).[52] S. Kaxiras, R. Sugumar, Distributed Vector Architecture: Be-yond a Single Vector-IRAM, in: First Workshop on MixingLogic and DRAM: Chips that Compute and Remember, 1997.[53] A. Acharya, M. Uysal, J. Saltz, Active Disks: ProgrammingModel, Algorithms and Evaluation, in: ASPLOS, 1998.[54] Q. Zhu, T. Graf, H. E. Sumbul, L. Pileggi, F. Franchetti, Accel-erating Sparse Matrix-Matrix Multiplication with 3D-StackedLogic-in-Memory Hardware, in: HPEC, 2013.[55] S. H. Pugsley, J. Jestes, H. Zhang, R. Balasubramonian,V. Srinivasan, A. Buyuktosunoglu, A. Davis, F. Li, NDC: An-alyzing the Impact of 3D-Stacked Memory + Logic Devices onMapReduce Workloads, in: ISPASS, 2014.[56] D. P. Zhang, N. Jayasena, A. Lyashevsky, J. L. Greathouse,L. Xu, M. Ignatowski, TOP-PIM: Throughput-Oriented Pro-grammable Processing in Memory, in: HPDC, 2014.[57] A. Farmahini-Farahani, J. H. Ahn, K. Morrow, N. S. Kim,NDA: Near-DRAM Acceleration Architecture LeveragingCommodity DRAM Devices and Standard Memory Modules,in: HPCA, 2015.[58] G. H. Loh, N. Jayasena, M. Oskin, M. Nutter, D. Roberts,M. Meswani, D. P. Zhang, M. Ignatowski, A Processing inMemory Taxonomy and a Case for Studying Fixed-FunctionPIM, in: WoNDP, 2013.[59] K. Hsieh, E. Ebrahimi, G. Kim, N. Chatterjee, M. O’Conner,N. Vijaykumar, O. Mutlu, S. Keckler, Transparent O ffl oadingand Mapping (TOM): Enabling Programmer-Transparent Near-Data Processing in GPU Systems, in: ISCA, 2016.[60] A. Pattnaik, X. Tang, A. Jog, O. Kayiran, A. K. Mishra, M. T.Kandemir, O. Mutlu, C. R. Das, Scheduling Techniques forGPU Architectures with Processing-in-Memory Capabilities,in: PACT, 2016.[61] B. Akin, F. Franchetti, J. C. Hoe, Data Reorganization in Mem-ory Using 3D-Stacked DRAM, in: ISCA, 2015.[62] K. Hsieh, S. Khan, N. Vijaykumar, K. K. Chang,A. Boroumand, S. Ghose, O. Mutlu, Accelerating Pointer Chas-ing in 3D-Stacked Memory: Challenges, Mechanisms, Evalua-tion, in: ICCD, 2016. [63] O. O. Babarinsa, S. Idreos, JAFAR: Near-Data Processing forDatabases, in: SIGMOD, 2015.[64] J. H. Lee, J. Sim, H. Kim, BSSync: Processing Near Mem-ory for Machine Learning Workloads with Bounded StalenessConsistency Models, in: PACT, 2015.[65] M. Gao, C. Kozyrakis, HRL: E ffi cient and Flexible Reconfig-urable Logic for Near-Data Processing, in: HPCA, 2016.[66] P. Chi, S. Li, C. Xu, T. Zhang, J. Zhao, Y. Liu, Y. Wang, Y. Xie,PRIME: A Novel Processing-in-Memory Architecture for Neu-ral Network Computation in ReRAM-Based Main Memory, in:ISCA, 2016.[67] B. Gu, A. S. Yoon, D.-H. Bae, I. Jo, J. Lee, J. Yoon, J.-U. Kang,M. Kwon, C. Yoon, S. Cho, J. Jeong, D. Chang, Biscuit: AFramework for Near-Data Processing of Big Data Workloads,in: ISCA, 2016.[68] D. Kim, J. Kung, S. Chai, S. Yalamanchili, S. Mukhopadhyay,Neurocube: A Programmable Digital Neuromorphic Architec-ture with High-Density 3D Memory, in: ISCA, 2016.[69] H. Asghari-Moghaddam, Y. H. Son, J. H. Ahn, N. S. Kim,Chameleon: Versatile and Practical Near-DRAM AccelerationArchitecture for Large Memory Systems, in: MICRO, 2016.[70] A. Boroumand, S. Ghose, M. Patel, H. Hassan, B. Lucia,K. Hsieh, K. T. Malladi, H. Zheng, O. Mutlu, LazyPIM:An E ffi cient Cache Coherence Mechanism for Processing-in-Memory, CAL (2016).[71] V. Seshadri, T. Mullins, A. Boroumand, O. Mutli, P. B. Gib-bons, M. A. Kozuch, T. C. Mowry, Gather-Scatter DRAM: In-DRAM Address Translation to Improve the Spatial Locality ofNon-Unit Strided Accesses, in: MICRO, 2015.[72] Z. Liu, I. Calciu, M. Herlihy, O. Mutlu, Concurrent Data Struc-tures for Near-Memory Computing, in: SPAA, 2017.[73] M. Gao, G. Ayers, C. Kozyrakis, Practical Near-Data Process-ing for In-Memory Analytics Frameworks, in: PACT, 2015.[74] Q. Guo, N. Alachiotis, B. Akin, F. Sadi, G. Xu, T. M. Low,L. Pileggi, J. C. Hoe, F. Franchetti, 3D-Stacked Memory-SideAcceleration: Accelerator and System Design, in: WoNDP,2014.[75] Z. Sura, A. Jacob, T. Chen, B. Rosenburg, O. Sallenave,C. Bertolli, S. Antao, J. Brunheroto, Y. Park, K. O’Brien,R. Nair, Data Access Optimization in a Processing-in-MemorySystem, in: CF, 2015.[76] A. Morad, L. Yavits, R. Ginosar, GP-SIMD Processing-in-Memory, ACM TACO (2015).[77] S. M. Hassan, S. Yalamanchili, S. Mukhopadhyay, Near DataProcessing: Impact and Optimization of 3D Memory SystemArchitecture on the Uncore, in: MEMSYS, 2015.[78] S. Li, C. Xu, Q. Zou, J. Zhao, Y. Lu, Y. Xie, Pinatubo: AProcessing-in-Memory Architecture for Bulk Bitwise Opera-tions in Emerging Non-Volatile Memories, in: DAC, 2016.[79] M. Kang, M.-S. Keel, N. R. Shanbhag, S. Eilert, K. Curewitz,An Energy-E ffi cient VLSI Architecture for Pattern Recog-nition via Deep Embedding of Computation in SRAM, in:ICASSP, 2014.[80] S. Aga, S. Jeloka, A. Subramaniyan, S. Narayanasamy,D. Blaauw, R. Das, Compute Caches, in: HPCA, 2017.[81] A. Shafiee, A. Nag, N. Muralimanohar, R. Balasubramonian,J. P. Strachan, M. Hu, R. S. Williams, V. Srikumar, ISAAC: AConvolutional Neural Network Accelerator with In-Situ Ana-log Arithmetic in Crossbars, in: ISCA, 2016.[82] V. Seshadri, Y. Kim, C. Fallin, D. Lee, R. Ausavarungnirun,G. Pekhimenko, Y. Luo, O. Mutlu, M. A. Kozuch, P. B. Gib-bons, T. C. Mowry, RowClone: Fast and Energy-E ffi cient In-DRAM Bulk Data Copy and Initialization, in: MICRO, 2013.[83] V. Seshadri, K. Hsieh, A. Boroumand, D. Lee, M. A. Kozuch,O. Mutlu, P. B. Gibbons, T. C. Mowry, Fast Bulk Bitwise AND nd OR in DRAM, CAL (2015).[84] K. K. Chang, P. J. Nair, D. Lee, S. Ghose, M. K. Qureshi,O. Mutlu, Low-Cost Inter-Linked Subarrays (LISA): EnablingFast Inter-Subarray Data Movement in DRAM, in: HPCA,2016.[85] V. Seshadri, D. Lee, T. Mullins, H. Hassan, A. Boroumand,J. Kim, M. A. Kozuch, O. Mutlu, P. B. Gibbons, T. C. Mowry,Buddy-RAM: Improving the Performance and E ffi ciency ofBulk Bitwise Operations Using DRAM, arXiv:1611.09988[cs:AR] (2016).[86] V. Seshadri, D. Lee, T. Mullins, H. Hassan, A. Boroumand,J. Kim, M. A. Kozuch, O. Mutlu, P. B. Gibbons, T. C. Mowry,Ambit: In-Memory Accelerator for Bulk Bitwise OperationsUsing Commodity DRAM Technology, in: MICRO, 2017.[87] L. Nai, R. Hadidi, J. Sim, H. Kim, P. Kumar, H. Kim, Graph-PIM: Enabling Instruction-Level PIM O ffl oading in GraphComputing Frameworks, in: HPCA, 2017.[88] J. S. Kim, D. Senol, H. Xin, D. Lee, S. Ghose, M. Alser,H. Hassan, O. Ergin, C. Alkan, O. Mutlu, GRIM-Filter: FastSeed Filtering in Read Mapping Using Emerging MemoryTechnologies, arXiv:1708.04329 [q-bio.GN] (2017).[89] J. S. Kim, D. Senol, H. Xin, D. Lee, S. Ghose, M. Alser,H. Hassan, O. Ergin, C. Alkan, O. Mutlu, GRIM-Filter:Fast Seed Location Filtering in DNA Read Mapping UsingProcessing-in-Memory Technologies, BMC Genomics (2018).[90] S. Li, D. Niu, K. T. Malladi, H. Zheng, B. Brennan, Y. Xie,DRISA: A DRAM-Based Reconfigurable In-Situ Accelerator,in: MICRO, 2017.[91] G. Kim, N. Chatterjee, M. O’Connor, K. Hsieh, Toward Stan-dardized Near-Data Processing with Unrestricted Data Place-ment for GPUs, in: SC, 2017.[92] V. Seshadri, O. Mutlu, Simple Operations in Memory to Re-duce Data Movement, in: Advances in Computers, Volume106, 2017.[93] V. Seshadri, Simple DRAM and Virtual Memory Abstractionsto Enable Highly E ffi cient Memory Systems, Ph.D. thesis,Carnegie Mellon University (2016).[94] S. Ghose, K. Hsieh, A. Boroumand, R. Ausavarungnirun,O. Mutlu, The Processing-in-Memory Paradigm: Mechanismsto Enable Adoption, in: Beyond-CMOS Technologies for NextGeneration Computer Design, 2019.[95] S. Ghose, K. Hsieh, A. Boroumand, R. Ausavarung-nirun, O. Mutlu, Enabling the Adoption of Processing-in-Memory: Challenges, Mechanisms, Future Research Direc-tions, arXiv:1802.00320 [cs:AR] (2018).[96] Q. Deng, L. Jiang, Y. Zhang, M. Zhang, J. Yang, DrAcc: aDRAM Based Accelerator for Accurate CNN Inference, in:DAC, 2018.[97] D. Lee, S. Ghose, G. Pekhimenko, S. Khan, O. Mutlu, Simul-taneous Multi-Layer Access: Improving 3D-Stacked MemoryBandwidth at Low Cost, TACO (2016).[98] Hybrid Memory Cube Consortium, HMC Specification 1.1(2013).[99] Hybrid Memory Cube Consortium, HMC Specification 2.0(2014).[100] JEDEC, High Bandwidth Memory (HBM) DRAM, StandardNo. JESD235 (2013).[101] J. Liu, B. Jaiyen, Y. Kim, C. Wilkerson, O. Mutlu, An Experi-mental Study of Data Retention Behavior in Modern DRAMDevices: Implications for Retention Time Profiling Mecha-nisms, in: ISCA, 2013.[102] S. Ghose, A. G. Yaglikci, R. Gupta, D. Lee, K. Kudrolli,W. X. Liu, H. Hassan, K. K. Chang, N. Chatterjee, A. Agrawal,M. O’Connor, O. Mutlu, What Your DRAM Power Models AreNot Telling You: Lessons from a Detailed Experimental Study, in: SIGMETRICS, 2018.[103] Y. Wang, A. Tavakkol, L. Orosa, S. Ghose, N. M. Ghiasi,M. Patel, J. Kim, H. Hassan, M. Sadrosadati, O. Mutlu, Re-ducing DRAM Latency via Charge-Level-Aware Look-AheadPartial Restoration, in: MICRO, 2018.[104] A. Das, H. Hassan, O. Mutlu, VRL-DRAM: Improving DRAMPerformance via Variable Refresh Latency, in: DAC, 2018.[105] J. Kim, M. Patel, H. Hassan, L. Orosa, O. Mutlu, Solar-DRAM:Reducing DRAM Access Latency by Exploiting the Variationin Local Bitlines, in: ICCD, 2018.[106] P. J. Denning, T. G. Lewis, Exponential Laws of ComputingGrowth, Commun. ACM (Jan. 2017).[107] International Technology Roadmap for Semiconductors(ITRS) (2009).[108] A. Ailamaki, D. J. DeWitt, M. D. Hill, D. A. Wood, DBMSs ona Modern Processor: Where Does Time Go?, in: VLDB, 1999.[109] P. A. Boncz, S. Manegold, M. L. Kersten, Database Architec-ture Optimized for the New Bottleneck: Memory Access, in:VLDB, 1999.[110] R. Clapp, M. Dimitrov, K. Kumar, V. Viswanathan, T. Will-halm, Quantifying the Performance Impact of Memory La-tency and Bandwidth for Big Data Workloads, in: IISWC,2015.[111] S. L. Xi, O. Babarinsa, M. Athanassoulis, S. Idreos, Beyondthe Wall: Near-Data Processing for Databases, in: DaMoN,2015.[112] Y. Umuroglu, D. Morrison, M. Jahre, Hybrid Breadth-FirstSearch on a Single-Chip FPGA-CPU Heterogeneous Platform,in: FPL, 2015.[113] Q. Xu, H. Jeon, M. Annavaram, Graph Processing on GPUs:Where Are the Bottlenecks?, in: IISWC, 2014.[114] A. J. Awan, M. Brorsson, V. Vlassov, E. Ayguade, PerformanceCharacterization of In-Memory Data Analytics on a ModernCloud Server, in: CCBD, 2015.[115] A. J. Awan, M. Brorsson, V. Vlassov, E. Ayguade, Micro-Architectural Characterization of Apache Spark on Batchand Stream Processing Workloads, in: BDCloud-SocialCom-SustainCom, 2016.[116] A. Yasin, Y. Ben-Asher, A. Mendelson, Deep-Dive Analysis ofthe Data Analytics Workload in CloudSuite, in: IISWC, 2014.[117] T. Moscibroda, O. Mutlu, Memory Performance Attacks: De-nial of Memory Service in Multi-Core Systems, in: USENIXSecurity, 2007.[118] O. Mutlu, T. Moscibroda, Stall-Time Fair Memory AccessScheduling for Chip Multiprocessors, in: MICRO, 2007.[119] O. Mutlu, T. Moscibroda, Parallelism-Aware Batch Schedul-ing: Enhancing Both Performance and Fairness of SharedDRAM Systems, in: ISCA, 2008.[120] L. Subramanian, Providing High and Controllable Perfor-mance in Multicore Systems Through Shared Resource Man-agement, Ph.D. thesis, Carnegie Mellon University (2015).[121] L. Subramanian, V. Seshadri, Y. Kim, B. Jaiyen, O. Mutlu,MISE: Providing Performance Predictability and ImprovingFairness in Shared Main Memory Systems, in: HPCA, 2013.[122] H. Usui, L. Subramanian, K. Chang, O. Mutlu, DASH:Deadline-Aware High-Performance Memory Scheduler forHeterogeneous Systems with Hardware Accelerators, in: ACMTACO, 2016.[123] L. Subramanian, V. Seshadri, A. Ghosh, S. Khan, O. Mutlu,The Application Slowdown Model: Quantifying and Control-ling the Impact of Inter-Application Interference at SharedCaches and Main Memory, in: MICRO, 2015.[124] J. Meza, Q. Wu, S. Kumar, O. Mutlu, Revisiting Memory Er-rors in Large-Scale Production Data Centers: Analysis andModeling of New Trends from the Field, in: DSN, 2015. https://github.com/CMU-SAFARI/rowhammer/ .[126] M. Seaborn and T. Dullien, Exploiting the DRAMRowhammer Bug to Gain Kernel Privileges, http://googleprojectzero.blogspot.com.tr/2015/03/exploiting-dram-rowhammer-bug-to-gain.html .[127] M. Seaborn and T. Dullien, Exploiting the DRAM Rowham-mer Bug to Gain Kernel Privileges, BlackHat (2016).[128] D. Gruss, C. Maurice, S. Mangard, Rowhammer.js: A RemoteSoftware-Induced Fault Attack in JavaScript, CoRR (2015)abs / http://arxiv.org/abs/1507.06955 [129] K. Razavi, B. Gras, E. Bosman, B. Preneel, C. Giu ff rida,H. Bos, Flip Feng Shui: Hammering a Needle in the SoftwareStack, in: USENIX Security, 2016.[130] V. van der Veen, Y. Fratantonio, M. Lindorfer, D. Gruss,C. Maurice, G. Vigna, H. Bos, K. Razavi, C. Giu ff rida, Dram-mer: Deterministic Rowhammer Attacks on Mobile Platforms,in: CCS, 2016.[131] E. Bosman, K. Razavi, H. Bos, C. Giu ff rida, Dedup EstMachina: Memory Deduplication as an Advanced ExploitationVector, in: SP, 2016.[132] O. Mutlu, RowHammer, in: Top Picks in Hardware and Em-bedded Security, 2018.[133] C. Lefurgy, K. Rajamani, F. Rawson, W. Felter, M. Kistler,T. W. Keller, Energy Management for Commercial Servers,Computer (Dec 2003).[134] M. Ware, K. Rajamani, M. Floyd, B. Brock, J. C. Rubio,F. Rawson, J. B. Carter, Architecting for Power Management:The IBM R (cid:13) POWER7 TM Approach, in: HPCA, 2010.[135] I. Paul, W. Huang, M. Arora, S. Yalamanchili, Harmonia:Balancing Compute and Memory Power in High-PerformanceGPUs, in: ISCA, 2015.[136] S. Khan, D. Lee, Y. Kim, A. R. Alameldeen, C. Wilker-son, O. Mutlu, The E ffi cacy of Error Mitigation Techniquesfor DRAM Retention Failures: A Comparative ExperimentalStudy, in: SIGMETRICS, 2014.[137] M. K. Qureshi, D. H. Kim, S. Khan, P. J. Nair, O. Mutlu,AVATAR: A Variable-Retention-Time (VRT) Aware Refreshfor DRAM Systems, in: DSN, 2015.[138] S. Khan, D. Lee, O. Mutlu, PARBOR: An E ffi cient System-Level Technique to Detect Data Dependent Failures in DRAM,in: DSN, 2016.[139] S. Khan, C. Wilkerson, D. Lee, A. R. Alameldeen, O. Mutlu,A Case for Memory Content-Based Detection and Mitigationof Data-Dependent Failures in DRAM, CAL (2016).[140] M. Patel, J. Kim, O. Mutlu, The Reach Profiler (REAPER):Enabling the Mitigation of DRAM Retention Failures via Pro-filing at Aggressive Conditions, in: ISCA, 2017.[141] S. Khan, C. Wilkerson, Z. Wang, A. Alameldeen, D. Lee,O. Mutlu, Detecting and Mitigating Data-Dependent DRAMFailures by Exploiting Current Memory Content, in: MICRO,2017.[142] D. Lee, Reducing DRAM Latency at Low Cost by Exploit-ing Heterogeneity, Ph.D. thesis, Carnegie Mellon University(2016).[143] Y. Cai, S. Ghose, E. F. Haratsch, Y. Luo, O. Mutlu, ErrorCharacterization, Mitigation, and Recovery in Flash-Memory-Based Solid-State Drives, Proc. IEEE (Sep. 2017).[144] Y. Cai, S. Ghose, E. F. Haratsch, Y. Luo, O. Mutlu, Reliabil-ity Issues in Flash-Memory-Based Solid-State Drives: Experi-mental Analysis, Mitigation, Recovery, in: Inside Solid StateDrives (SSDs), 2018.[145] Y. Cai, S. Ghose, E. F. Haratsch, Y. Luo, O. Mutlu, Errorsin Flash-Memory-Based Solid-State Drives: Analysis, Mitiga- tion, and Recovery, arXiv:1711.11427 [cs:AR] (2018).[146] A. Tavakkol, J. G´omez-Luna, M. Sadrosadati, S. Ghose,O. Mutlu, MQSim: A Framework for Enabling Realistic Stud-ies of Modern Multi-Queue SSD Devices, in: FAST, 2018.[147] A. Tavakkol, M. Sadrosadati, S. Ghose, J. Kim, Y. Luo,Y. Wang, N. M. Ghiasi, L. Orosa, J. G´omez-Luna, O. Mutlu,FLIN: Enabling Fairness and Enhancing Performance in Mod-ern NVMe Solid State Drives, in: ISCA, 2018.[148] Y. Cai, NAND Flash Memory: Characterization, Analysis,Modeling, and Mechanisms, Ph.D. thesis, Carnegie MellonUniversity (2013).[149] Y. Luo, Architectural Techniques for Improving NAND FlashMemory Reliability, Ph.D. thesis, Carnegie Mellon University(2018).[150] A. W. Burks, H. H. Goldstine, J. von Neumann, PreliminaryDiscussion of the Logical Design of an Electronic ComputingInstrument (1946).[151] G. Kestor, R. Gioiosa, D. J. Kerbyson, A. Hoisie, Quantifyingthe Energy Cost of Data Movement in Scientific Applications,in: IISWC, 2013.[152] D. Pandiyan, C.-J. Wu, Quantifying the Energy Cost of DataMovement for Emerging Smart Phone Workloads on MobilePlatforms, in: IISWC, 2014.[153] J. K. Ousterhout, Why Aren’t Operating Systems GettingFaster As Fast as Hardware?, in: USENIX STC, 1990.[154] M. Rosenblum, et al., The Impact of Architectural Trends onOperating System Performance, in: SOSP, 1995.[155] K. K. Chang, D. Lee, Z. Chishti, A. R. Alameldeen, C. Wilk-erson, Y. Kim, O. Mutlu, Improving DRAM Performance byParallelizing Refreshes with Accesses, in: HPCA, 2014.[156] Memcached: A High Performance, Distributed Memory Ob-ject Caching System, http://memcached.org .[157] MySQL: An Open Source Database, .[158] D. E. Knuth, The Art of Computer Programming, Volume 4Fascicle 1: Bitwise Tricks & Techniques; Binary Decision Di-agrams (2009).[159] H. S. Warren, Hacker’s Delight, 2nd Edition, Addison-WesleyProfessional, 2012.[160] C.-Y. Chan, Y. E. Ioannidis, Bitmap Index Design and Evalua-tion, in: SIGMOD, 1998.[161] E. O’Neil, P. O’Neil, K. Wu, Bitmap Index Design Choicesand Their Performance Implications, in: IDEAS, 2007.[162] FastBit: An E ffi cient Compressed Bitmap Index Technology, https://sdm.lbl.gov/fastbit/ .[163] K. Wu, E. J. Otoo, A. Shoshani, Compressing Bitmap Indexesfor Faster Search Operations, in: SSDBM, 2002.[164] Y. Li, J. M. Patel, BitWeaving: Fast Scans for Main MemoryData Processing, in: SIGMOD, 2013.[165] B. Goodwin, M. Hopcroft, D. Luu, A. Clemmer, M. Curmei,S. Elnikety, Y. He, BitFunnel: Revisiting Signatures for Search,in: SIGIR, 2017.[166] G. Benson, Y. Hernandez, J. Loving, A Bit-Parallel, Gen-eral Integer-Scoring Sequence Alignment Algorithm, in: CPM,2013.[167] H. Xin, J. Greth, J. Emmons, G. Pekhimenko, C. Kingsford,C. Alkan, O. Mutlu, Shifted Hamming Distance: A Fast andAccurate SIMD-Friendly Filter to Accelerate Alignment Veri-fication in Read Mapping, Bioinformatics (2015).[168] M. Alser, H. Hassan, H. Xin, O. Ergin, O. Mutlu, C. Alkan,GateKeeper: A New Hardware Architecture for AcceleratingPre-Alignment in DNA Short Read Mapping, Bioinformatics(2017).[169] P. Tuyls, H. D. L. Hollmann, J. H. V. Lint, L. Tolhuizen,XOR-Based Visual Cryptography Schemes, Designs, Codes nd Cryptography.[170] J.-W. Han, C.-S. Park, D.-H. Ryu, E.-S. Kim, Optical ImageEncryption Based on XOR Operations, SPIE OE (1999).[171] S. A. Manavski, CUDA Compatible GPU as an E ffi cient Hard-ware Accelerator for AES Cryptography, in: ICSPC, 2007.[172] H. Kang, S. Hong, One-Transistor Type DRAM, US Patent7701751 (2009).[173] S.-L. Lu, Y.-C. Lin, C.-L. Yang, Improving DRAM Latencywith Dynamic Asymmetric Subarray, in: MICRO, 2015.[174] 6th Generation Intel Core Processor Family Datasheet, .[175] GeForce GTX 745, .[176] B. C. Lee, P. Zhou, J. Yang, Y. Zhang, B. Zhao, E. Ipek,O. Mutlu, D. Burger, Phase-Change Technology and the Fu-ture of Main Memory, IEEE Micro (2010).[177] B. C. Lee, E. Ipek, O. Mutlu, D. Burger, Phase Change Mem-ory Architecture and the Quest for Scalability, CACM (2010).[178] P. Zhou, B. Zhao, J. Yang, Y. Zhang, A Durable and EnergyE ffi cient Main Memory Using Phase Change Memory Tech-nology, in: ISCA, 2009.[179] M. K. Qureshi, V. Srinivasan, J. A. Rivers, Scalable High Per-formance Main Memory System Using Phase-Change MemoryTechnology, in: ISCA, 2009.[180] S. Kvatinsky, D. Belousov, S. Liman, G. Satat, N. Wald, E. G.Friedman, A. Kolodny, U. C. Weiser, MAGIC—Memristor-Aided Logic, IEEE TCAS II: Express Briefs (2014).[181] S. Kvatinsky, A. Kolodny, U. C. Weiser, E. G. Friedman,Memristor-Based IMPLY Logic Design Procedure, in: ICCD,2011.[182] S. Kvatinsky, G. Satat, N. Wald, E. G. Friedman, A. Kolodny,U. C. Weiser, Memristor-Based Material Implication (IMPLY)Logic: Design Principles and Methodologies, TVLSI (2014).[183] Y. Levy, J. Bruck, Y. Cassuto, E. G. Friedman, A. Kolodny,E. Yaakobi, S. Kvatinsky, Logic Operations in Memory Usinga Memristive Akers Array, Microelectronics Journal (2014).[184] S. Salihoglu, J. Widom, GPS: A Graph Processing System, in:SSDBM, 2013.[185] Y. Tian, A. Balmin, S. A. Corsten, S. Tatikonda, J. McPherson,From “Think Like a Vertex” to “Think Like a Graph”, VLDBEndowment (2013).[186] Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola, J. M.Hellerstein, Distributed GraphLab: A Framework for MachineLearning and Data Mining in the Cloud, VLDB Endowment(2012).[187] S. Hong, H. Chafi, E. Sedlar, K. Olukotun, Green-Marl: A DSLfor Easy and E ffi cient Graph Analysis, in: ASPLOS, 2012.[188] G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn,N. Leiser, G. Czajkowski, Pregel: A System for Large-ScaleGraph Processing, in: SIGMOD, 2010.[189] Harshvardhan, et al., KLA: A New Algorithmic Paradigm forParallel Graph Computation, in: PACT, 2014.[190] J. E. Gonzalez, et al., PowerGraph: Distributed Graph-ParallelComputation on Natural Graph, in: OSDI, 2012.[191] J. Shun, G. E. Blelloch, Ligra: A Lightweight Graph Process-ing Framework for Shared Memory, in: PPoPP, 2013.[192] J. Xue, Z. Yang, Z. Qu, S. Hou, Y. Dai, Seraph: An E ffi cient,Low-Cost System for Concurrent Graph Processing, in: HPDC,2014.[193] Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, J. M.Hellerstein, GraphLab: A New Framework for Parallel Ma-chine Learning, arXiv:1006.4990 [cs:LG] (2010). [194] Google LLC, Chrome Browser, .[195] Google LLC, TensorFlow: Mobile, .[196] A. Grange, P. de Rivaz, J. Hunt, VP9 Bitstream & DecodingProcess Specification, http://storage.googleapis.com/downloads.webmproject.org/docs/vp9/vp9-bitstream-specification-v0.6-20160331-draft.pdf .[197] Hybrid Memory Cube Specification 2.0 (2014).[198] V. Narasiman, C. J. Lee, M. Shebanow, R. Miftakhutdinov,O. Mutlu, Y. N. Patt, Improving GPU Performance via LargeWarps and Two-Level Warp Scheduling, in: MICRO, 2011.[199] A. Jog, O. Kayiran, N. C. Nachiappan, A. K. Mishra, M. T.Kandemir, O. Mutlu, R. Iyer, C. R. Das, OWL: CooperativeThread Array Aware Scheduling Techniques for ImprovingGPGPU Performance, in: ASPLOS, 2013.[200] A. Jog, O. Kayiran, A. K. Mishra, M. T. Kandemir, O. Mutlu,R. Iyer, C. R. Das, Orchestrated Scheduling and Prefetchingfor GPGPUs, in: ISCA, 2013.[201] R. Ausavarungnirun, S. Ghose, O. Kayıran, G. H. Loh, C. R.Das, M. T. Kandemir, O. Mutlu, Exploiting Inter-Warp Hetero-geneity to Improve GPGPU Performance, in: PACT, 2015.[202] N. Vijaykumar, G. Pekhimenko, A. Jog, A. Bhowmick,R. Ausavarungnirun, C. Das, M. Kandemir, T. C. Mowry,O. Mutlu, A Case for Core-Assisted Bottleneck Accelerationin GPUs: Enabling Flexible Data Compression with AssistWarps, in: ISCA, 2015.[203] N. Vijaykumar, K. Hsieh, G. Pekhimenko, S. Khan,A. Shrestha, S. Ghose, A. Jog, P. B. Gibbons, O. Mutlu, Zorua:A Holistic Approach to Resource Virtualization in GPUs, in:MICRO, 2016.[204] A. Jog, O. Kayiran, A. Pattnaik, M. T. Kandemir, O. Mutlu,R. Iyer, C. R. Das, Exploiting Core Criticality for EnhancedGPU Performance, in: SIGMETRICS, 2016.[205] R. Ausavarungnirun, V. Miller, J. Landgraf, S. Ghose,J. Gandhi, A. Jog, C. J. Rossbach, O. Mutlu, MASK: Redesign-ing the GPU Memory Hierarchy to Support Multi-ApplicationConcurrency, in: ASPLOS, 2018.[206] R. Ausavarungnirun, J. Landgraf, V. Miller, S. Ghose,J. Gandhi, C. J. Rossbach, O. Mutlu, Mosaic: A GPU Mem-ory Manager with Application-Transparent Support for Multi-ple Page Sizes, in: MICRO, 2017.[207] R. Ausavarungnirun, Techniques for Shared Resource Man-agement in Systems with Throughput Processors, Ph.D. thesis,Carnegie Mellon University (2017).[208] M. S. Papamarcos, J. H. Patel, A Low-Overhead CoherenceSolution for Multiprocessors with Private Cache Memories, in:ISCA, 1984.[209] A. Boroumand, S. Ghose, M. Patel, H. Hassan, B. Lucia, N. Ha-jinazar, K. Hsieh, K. T. Malladi, H. Zheng, O. Mutlu, LazyPIM:E ffi cient Support for Cache Coherence in Processing-in-Memory Architectures, arXiv:1706.03162 [cs:AR] (2017).[210] R. Ausavarungnirun, V. Miller, J. Landgraf, S. Ghose,J. Gandhi, A. Jog, C. Rossbach, O. Mutlu, MASK: Redesign-ing the GPU Memory Hierarchy to Support Multi-ApplicationConcurrency, in: ASPLOS, 2018.[211] R. Ausavarungnirun, J. Landgraf, V. Miller, S. Ghose,J. Gandhi, C. J. Rossbach, O. Mutlu, Mosaic: A GPU Mem-ory Manager with Application-Transparent Support for Multi-ple Page Sizes, in: MICRO, 2017.[212] R. Ausavarungnirun, J. Landgraf, V. Miller, S. Ghose,J. Gandhi, C. J. Rossbach, O. Mutlu, Mosaic: EnablingApplication-Transparent Support for Multiple Page Sizes inThroughput Processors, SIGOPS Oper. Syst. Rev. (Aug. 2018). https://github.com/CMU-SAFARI/ramulator/ .[215] SAFARI Research Group, SoftMC v1.0 – GitHub Repository, https://github.com/CMU-SAFARI/SoftMC/ .[216] N. Binkert, B. Beckman, A. Saidi, G. Black, A. Basu, Thegem5 Simulator, CAN (2011).[217] D. Sanchez, C. Kozyrakis, ZSim: Fast and Accurate Microar-chitectural Simulation of Thousand-Core Systems, in: ISCA,2013.[218] J. Power, J. Hestness, M. S. Orr, M. D. Hill, D. A. Wood, gem5-gpu: A Heterogeneous CPU-GPU Simulator, IEEE CAL (Jan2015). [219] A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, T. M.Aamodt, Analyzing CUDA Workloads Using a Detailed GPUSimulator, in: ISPASS, 2009.[220] O. Mutlu, Processing Data Where It Makes Sense: En-abling In-Memory Computation, https://people.inf.ethz.ch/omutlu/pub/onur-MST-Keynote-EnablingInMemoryComputation-October-27-2017-unrolled-FINAL.pptx , keynote talk at MST (2017).[221] O. Mutlu, Processing Data Where It Makes Sense in ModernComputing Systems: Enabling In-Memory Computation, https://people.inf.ethz.ch/omutlu/pub/onur-GWU-EnablingInMemoryComputation-February-15-2019-unrolled-FINAL.pptx , video available at ,distinguished lecture at George Washington University (2019).,distinguished lecture at George Washington University (2019).