Enabling Practical Processing in and near Memory for Data-Intensive Computing
Onur Mutlu, Saugata Ghose, Juan Gómez-Luna, Rachata Ausavarungnirun
aa r X i v : . [ c s . D C ] M a y Enabling Practical Processing in and near Memoryfor Data-Intensive Computing
Onur Mutlu a , b Saugata Ghose b Juan Gómez-Luna a Rachata Ausavarungnirun b a ETH Zürich b Carnegie Mellon University
ABSTRACT
Modern computing systems suffer from the dichotomy betweencomputation on one side, which is performed only in the proces-sor (and accelerators), and data storage/movement on the other,which all other parts of the system are dedicated to. Due to thisdichotomy, data moves a lot in order for the system to performcomputation on it. Unfortunately, data movement is extremely ex-pensive in terms of energy and latency, much more so than com-putation. As a result, a large fraction of system energy is spent andperformance is lost solely on moving data in a modern computingsystem.In this work, we re-examine the idea of reducing data move-ment by performing Processing in Memory (PIM). PIM places com-putation mechanisms in or near where the data is stored (i.e., in-side the memory chips, in the logic layer of 3D-stacked logic andDRAM, or in the memory controllers), so that data movement be-tween the computation units and memory is reduced or eliminated.While the idea of PIM is not new, we examine two new approachesto enabling PIM: 1) exploiting analog properties of DRAM to per-form massively-parallel operations in memory, and 2) exploiting3D-stacked memory technology design to provide high bandwidthto in-memory logic. We conclude by discussing work on solvingkey challenges to the practical adoption of PIM.
Main memory, built using Dynamic Random Access Memory(DRAM), is a major component in nearly all computing systems,including servers, cloud platforms, mobile/embedded devices, andsensors. Across these systems, the data working set sizes of appli-cations are rapidly growing, while the need for fast analysis of suchdata is increasing. Thus, main memory is becoming an increasinglysignificant bottleneck across a wide variety of computing systemsand applications [9, 38, 71, 75]. The bottleneck has worsened in re-cent years, as it has become increasingly difficult to efficiently scalememory capacity, energy, cost, and performance across technologygenerations [41, 48, 49, 63, 64, 68, 69, 71, 72, 75], as evidenced bythe RowHammer problem [48, 72, 74] in recent DRAM chips.A major reason for the main memory bottleneck is the high en-ergy and latency associated with data movement . In today’s com-puters, to perform any operation on data, the processor must re-trieve the data from main memory. This requires the memory con-troller to issue commands to a DRAM module across a relativelyslow and power-hungry off-chip bus (known as the memory chan-nel ). The DRAM module sends the requested data across the mem-ory channel, after which the data is placed in the caches and regis-ters. The CPU can perform computation on the data once the datais in its registers. Data movement from the DRAM to the CPU in-curs long latency and consumes significant energy [3, 4, 9, 29, 30].These costs are often exacerbated by the fact that much of the data brought into the caches is not reused by the CPU [82, 84], providinglittle benefit in return for the high latency and energy cost.The cost of data movement is a fundamental issue with the processor-centric nature of contemporary computer systems. TheCPU is considered the master in the system, and computation isperformed only in the processor (and accelerators). In contrast,data storage and communication units, including the main mem-ory, are treated as unintelligent workers that are incapable of com-putation. As a result of this processor-centric design paradigm,data moves a lot in the system between the computation unitsand communication/storage units so that computation can be doneon it. With the increasingly data-centric nature of contemporaryand emerging applications, the processor-centric design paradigmleads to great inefficiency in performance, energy and cost: for ex-ample, most of the real estate within a single compute node isalready dedicated to handling data movement and storage (e.g.,large caches, memory controllers, interconnects, and main mem-ory), and our recent work shows that 62% of the entire system en-ergy of a mobile device is spent on data movement between theprocessor and the memory hierarchy for widely-used mobile work-loads [9].The huge overhead of data movement in modern systems alongwith technology advances that enable better integration of mem-ory and logic have recently prompted the re-examination of an oldidea that we will generally call
Processing in Memory (PIM). Thekey idea is to place computation mechanisms in or near where thedata is stored (i.e., inside the memory chips, in the logic layer of 3D-stacked DRAM, in the memory controllers, or inside large caches),so that data movement between where the computation is doneand where the data is stored is reduced or eliminated, compared tocontemporary processor-centric systems.The idea of PIM has been around for at least four decades [1, 16–18, 24, 42–44, 52, 70, 78, 79, 85, 95, 96]. However, past efforts were not widely adopted for various reasons, including 1) the difficultyof integrating processing elements with DRAM, 2) the lack of criti-cal memory-related scaling challenges that current technology andapplications face today, and 3) that the data movement bottleneckwas not as critical to system cost, energy and performance as it istoday. We believe it is crucial to re-examine PIM today with a freshperspective (i.e., with novel approaches and ideas), by exploitingnew memory technologies, with realistic workloads and systems,and with a mindset to ease adoption and feasibility.In this paper, we explore two new approaches to enabling PIM inmodern systems. The first approach only minimally changes mem-ory chips to perform simple yet powerful common operations thatthe chip is inherently efficient at performing [12, 13, 15, 22, 23,60, 73, 87–93]. Such solutions take advantage of the existing mem-ory design to perform bulk operations (i.e., operations on an entireow of DRAM cells), such as bulk copy, data initialization, and bit-wise operations [13, 88–91]. The second approach enables PIM ina more general-purpose manner by taking advantage of emerging [3–5, 8–11, 14, 19–21, 26, 27, 31–33, 45–47, 59, 65, 66, 76, 80, 81, 97, 102, 104]. 3D-stacked memorychips have much greater internal bandwidth than is available ex-ternally on the memory channel [58], and many such chip architec-tures (e.g., Hybrid Memory Cube [34, 35], High-Bandwidth Mem-ory [37, 58]) include a logic layer where designers can add someprocessing logic (e.g., accelerators, simple cores, reconfigurablelogic) that can take advantage of this high internal bandwidth.Regardless of the approach taken to PIM, there are key practicaladoption challenges that system architects and programmers mustaddress to enable the widespread adoption of PIM across the com-puting landscape and in different domains of workloads. We alsobriefly discuss these challenges in this paper, along with referencesto some existing work that addresses these challenges.
Minimal modifications in existing memory chips can enable sim-ple yet powerful computation capability inside the chip. Thesemodifications take advantage of the existing interconnects in andanalog operational behavior of conventional memory chips, e.g.,DRAM architectures, without the need for a logic layer and usu-ally without the need for logic processing elements. As a result,the overheads imposed on the memory chip are low. There are anumber of mechanisms that use this approach to take advantageof the high internal bandwidth available within each memory cellarray [12, 13, 87–91, 93]. We briefly describe one such design, Am-bit, which enables in-DRAM bulk bitwise operations [88, 90, 91],by building on RowClone, which enables fast and energy-efficientin-DRAM data movement [13, 89].
Ambit: In-DRAM Bulk Bitwise Operations.
Many applica-tions use bulk bitwise operations [51, 99] (i.e., bitwise operationson large bit vectors), such as bitmap indices, bitwise scan accel-eration [62] for databases, accelerated document filtering for websearch [25], DNA sequence alignment [6, 7, 47, 100], encryptionalgorithms [28, 98], graph processing, and networking [99]. Ac-celerating bulk bitwise operations can thus significantly boost theperformance and energy efficiency of a wide range of applications.We have recently proposed a new A ccelerator-in- M emory forbulk Bit wise operations (Ambit) [88, 90, 91]. Unlike prior ap-proaches, Ambit uses the analog operation of existing DRAM tech-nology to perform bulk bitwise operations. Ambit has two com-ponents. The first component, Ambit–AND–OR, implements anew operation called triple-row activation , where the memory con-troller simultaneously activates three rows. Triple-row activationuses the charge sharing principles that govern the operation ofthe DRAM array to perform a bitwise AND or OR on two rows ofdata, by controlling the initial value on the third row. The secondcomponent, Ambit–NOT, takes advantage of the two inverters thatare connected to each sense amplifier in a DRAM subarray, as thevoltage level of one of the inverters represents the negated logicalvalue of the cell. The Ambit design adds a special row to the DRAMarray to capture this negated value. One possible implementationof the special row [91] is a row of dual-contact cells (a 2-transistor1-capacitor cell [39, 67]), each connected to both inverters inside a sense amplifier. Even in the presence of process variation (see[91]), Ambit can reliably perform AND, OR, and NOT operationscompletely using DRAM technology, making it functionally (i.e.,Boolean logic) complete.Ambit provides promising performance and energy improve-ments. Averaged across seven commonly-used bulk bitwise oper-ations (NOT, AND, OR, NAND, NOR, XOR, XNOR), Ambit with 8DRAM banks improves bulk bitwise operation throughput by 44 × compared to an Intel Skylake processor [36], and 32 × comparedto the NVIDIA GTX 745 GPU [77]. Compared to DDR3 DRAM,Ambit reduces energy consumption by 35 × on average. When in-tegrated directly into the HMC 2.0 device, which has many morebanks, Ambit improves operation throughput by 9.7 × comparedto processing in the logic layer of HMC 2.0. Our work evaluatesthe end-to-end benefits of Ambit on real database queries usingBitmap indices and the BitWeaving database [62], showing querylatency reductions of 2X to 12X, with larger benefits for larger dataset sizes.A number of Ambit-like bitwise operation substrates have beenproposed in recent years, making use of emerging resistive mem-ory technologies, e.g., phase-change memory (PCM) [55–57, 83,101, 103], SRAM, or specialized DRAM. These substrates can per-form bulk bitwise operations in a special DRAM array augmentedwith computational circuitry [60] and in PCM [61]. Similar sub-strates can perform simple arithmetic operations in SRAM [2, 40]and arithmetic and logical operations in memristors [53, 54, 94].Resistive memory technologies are amenable to in-place updates,and can thus incorporate Ambit-like operations with even less datamovement than DRAM. Thus, we believe it is extremely importantto continue exploring low-cost Ambit-like substrates, as well asmore sophisticated computational substrates, for all types of mem-ory technologies, old and new. Several works propose to place some form of processing logic (typ-ically accelerators, simple cores, or reconfigurable logic) insidethe logic layer of 3D-stacked memory [58]. This
PIM processinglogic , which we also refer to as
PIM cores , can execute portionsof applications (from individual instructions to functions) or en-tire threads and applications, depending on the design of the archi-tecture. The PIM cores connect to the memory stacks that are ontop of them using vertical through-silicon vias [58], which providehigh-bandwidth and low-latency access to data. In this section, wediscuss examples of how systems can make use of relatively sim-ple PIM cores to avoid data movement and thus obtain significantperformance and energy improvements for a variety of applicationdomains.
Tesseract: Graph Processing.
A popular modern applicationis large-scale graph processing/analytics. Graph processing hasbroad applicability and use in many domains, from social networksto machine learning, from data analytics to bioinformatics. Graphanalysis workloads put large pressure on memory bandwidth dueto 1) frequent random memory accesses across large memory re-gions (leading to limited cache efficiency and unnecessary data ransfer on the memory bus) and 2) small amount of computa-tion per data item fetched from memory (leading to limited abil-ity to hide long memory latencies and exercising the memory en-ergy bottleneck). These two characteristics make it very challeng-ing to scale up such workloads despite their inherent parallelism,especially with conventional architectures based on large on-chipcaches and relatively scarce off-chip memory bandwidth for ran-dom access.To overcome the limitations of conventional architectures, wedesign Tesseract, a programmable PIM accelerator for large-scalegraph processing [3]. Tesseract consists of 1) simple in-order PIMcores that exploit the high memory bandwidth available in thelogic layer of 3D-stacked memory, where each core manipulatesdata only on the memory partition it is assigned to control, 2) anefficient communication interface that allows a PIM core to re-quest computation on data elements that reside in the memorypartition controlled by another core, and 3) a message-passingbased programming interface, similar to how modern distributedsystems are programmed, which enables remote function calls ondata that resides in each memory partition. Tesseract moves func-tions to data rather than moving data elements across differentmemory partitions and cores. Our comprehensive evaluations us-ing five state-of-the-art graph processing workloads with largegraphs show that Tesseract improves average system performanceby 13 . × and reduces average system energy by 87% over a state-of-the-art conventional system. Consumer Workloads.
A popular domain of computing is con-sumer devices, including smartphones, tablets, web-based comput-ers (e.g., Chromebooks), and wearable devices. In such devices, en-ergy efficiency is a first-class concern due to the limited battery ca-pacity and the stringent thermal power budget. We find that datamovement is a major contributor to energy (and execution time)in modern consumer devices: across four popular workloads (de-scribed next), 62.7% of the total system energy, on average, is spenton data movement across the memory hierarchy [9].We comprehensively analyze the energy and performance im-pact of data movement for several widely-used Google consumerworkloads [9]: 1) the Chrome web browser, 2) TensorFlow Mobile(Google’s machine learning framework), 3) the VP9 video playbackengine, and 4) the VP9 video capture engine. We find that offload-ing key functions (called target functions ) of these workloads toPIM logic greatly reduces data movement. However, consumer de-vices are extremely stringent in terms of the extra area and energythey can accommodate. As a result, it is important to identify whatkind of PIM logic can both 1) maximize energy efficiency and 2) beimplemented at minimum possible area and energy costs.We find that many of the target functions for PIM in con-sumer workloads are comprised of simple operations (e.g., mem-copy / memset , basic arithmetic and bitwise operations), and can beimplemented easily in the logic layer using either 1) a small low-power general-purpose core or 2) small fixed-function accelerators.Our analysis shows that the area of a PIM core and a PIM acceler-ator take up no more than 9.4% and 35.4%, respectively, of the areaavailable for PIM logic in an HMC-like [35] 3D-stacked memory ar-chitecture. Both the PIM core and PIM accelerator eliminate a largeamount of data movement, and thereby significantly reduce total system energy (by an average of 55.4% across all the workloads)and execution time (by an average of 54.2%). Pushing computation from the CPU into memory introduces newchallenges for system architects and programmers to overcome.Many of these challenges must be addressed for PIM to be adoptedin a wide variety of systems of workloads, without placing a heavyburden on most programmers [22, 73] These challenges include1) how to easily program PIM systems (with good programmingmodel, library, compiler and tools support) [4, 32]; 2) how to de-sign runtime systems and system software that can take advan-tage of PIM (e.g., runtime scheduling of code on PIM logic, datamapping) [4, 9, 32, 80]; 3) how to efficiently enable coherence be-tween PIM logic and CPU/accelerator cores that operate on shareddata [4, 10, 11]; 4) how to efficiently enable virtual memory supporton the PIM logic [33]; 5) how to design high-performance datastructures for PIM whose performance is better than concurrentdata structures on multi-core machines [65]; 6) how to accuratelyassess the benefits and shortcomings of PIM using realistic work-load suites, rigorous analysis methodologies, and accurate and flex-ible simulation infrastructures [50, 86].We believe these challenges provide exciting cross-layer re-search opportunities. Fundamentally solving the data movementproblem requires a paradigm shift to a data-centric computing sys-tem design, where computation happens in or near memory, withminimal data movement. We argue that research enabled towardssuch a paradigm shift would be very useful for both PIM as well asother potential ideas that can reduce data movement.
ACKNOWLEDGMENTS
We thank members of the SAFARI Research Group and collabora-tors at Carnegie Mellon, ETH Zurich, and other universities, whohave contributed to the various works we describe in this paper.Thanks also to our research group’s industrial sponsors over thepast ten years, especially Alibaba, Google, Huawei, Intel, Microsoft,NVIDIA, Samsung, and VMware. This work was also partially sup-ported by the Semiconductor Research Corporation and NSF.
REFERENCES [1] A. Acharya et al. 1998. Active Disks: Programming Model, Algorithms andEvaluation. In
ASPLOS .[2] S. Aga et al. 2017. Compute Caches. In
HPCA .[3] J. Ahn et al. 2015. A Scalable Processing-in-Memory Accelerator for ParallelGraph Processing. In
ISCA .[4] J. Ahn et al. 2015. PIM-Enabled Instructions: A Low-Overhead, Locality-AwareProcessing-in-Memory Architecture. In
ISCA .[5] B. Akin et al. 2015. Data Reorganization in Memory Using 3D-Stacked DRAM.In
ISCA .[6] M. Alser et al. Shouji: A Fast and Efficient Pre-Alignment Filter for SequenceAlignment.
Bioinformatics (2019).[7] M. Alser et al. GateKeeper: A New Hardware Architecture for AcceleratingPre-Alignment in DNA Short Read Mapping.
Bioinformatics (2017).[8] O. O. Babarinsa et al. 2015. JAFAR: Near-Data Processing for Databases. In
SIGMOD .[9] A. Boroumand et al. 2018. Google Workloads for Consumer Devices: MitigatingData Movement Bottlenecks. In
ASPLOS .[10] A. Boroumand et al. 2019. CoNDA: Enabling Efficient Near-Data AcceleratorCommunication by Optimizing Data Movement. In
ISCA .[11] A. Boroumand et al. LazyPIM: An Efficient Cache Coherence Mechanism forProcessing-in-Memory.
CAL (2016).[12] K. K. Chang. 2017.
Understanding and Improving the Latency of DRAM-BasedMemory Systems . Ph.D. Dissertation. Carnegie Mellon Univ.
13] K. K. Chang et al. 2016. Low-Cost Inter-Linked Subarrays (LISA): Enabling FastInter-Subarray Data Movement in DRAM. In
HPCA .[14] P. Chi et al. 2016. PRIME: A Novel Processing-in-Memory Architecture forNeural Network Computation in ReRAM-Based Main Memory. In
ISCA .[15] Q. Deng et al. 2018. DrAcc: a DRAM Based Accelerator for Accurate CNNInference. In
DAC .[16] J. Draper et al. 2002. The Architecture of the DIVA Processing-in-Memory Chip.In SC .[17] D. Elliott et al. Computational RAM: Implementing Processors in Memory. IEEEDesign & Test (1999).[18] D. G. Elliott et al. 1992. Computational RAM: A Memory-SIMD Hybrid and ItsApplication to DSP. In
CICC .[19] A. Farmahini-Farahani et al. 2015. NDA: Near-DRAM Acceleration Architec-ture Leveraging Commodity DRAM Devices and Standard Memory Modules.In
HPCA .[20] M. Gao et al. 2015. Practical Near-Data Processing for In-Memory AnalyticsFrameworks. In
PACT .[21] M. Gao et al. 2016. HRL: Efficient and Flexible Reconfigurable Logic for Near-Data Processing. In
HPCA .[22] S. Ghose et al. 2018. Enabling the Adoption of Processing-in-Memory: Chal-lenges, Mechanisms, Future Research Directions. arXiv:1802.00320 [cs:AR].[23] S. Ghose et al. 2019. The Processing-in-Memory Paradigm: Mechanisms toEnable Adoption. In
Beyond-CMOS Technologies for Next Generation ComputerDesign .[24] M. Gokhale et al. Processing in Memory: The Terasys Massively Parallel PIMArray.
IEEE Computer (1995).[25] B. Goodwin et al. 2017. BitFunnel: Revisiting Signatures for Search. In
SIGIR .[26] B. Gu et al. 2016. Biscuit: A Framework for Near-Data Processing of Big DataWorkloads. In
ISCA .[27] Q. Guo et al. 2014. 3D-Stacked Memory-Side Acceleration: Accelerator andSystem Design. In
WoNDP .[28] J.-W. Han et al. Optical Image Encryption Based on XOR Operations.
SPIE OE (1999).[29] M. Hashemi et al. 2016. Accelerating Dependent Cache Misses with an En-hanced Memory Controller. In
ISCA .[30] M. Hashemi et al. 2016. Continuous Runahead: Transparent Hardware Accel-eration for Memory Intensive Workloads. In
MICRO .[31] S. M. Hassan et al. 2015. Near Data Processing: Impact and Optimization of 3DMemory System Architecture on the Uncore. In
MEMSYS .[32] K. Hsieh et al. 2016. Transparent Offloading and Mapping (TOM): EnablingProgrammer-Transparent Near-Data Processing in GPU Systems. In
ISCA .[33] K. Hsieh et al. 2016. Accelerating Pointer Chasing in 3D-Stacked Memory:Challenges, Mechanisms, Evaluation. In
ICCD .[34] Hybrid Memory Cube Consortium. 2013. HMC Specification 1.1.[35] Hybrid Memory Cube Consortium. 2014. HMC Specification 2.0.[36] Intel Corp. 2018. 6th Generation Intel Core Processor Family Datasheet.[37] JEDEC. 2013. High Bandwidth Memory (HBM) DRAM. Standard No. JESD235.[38] S. Kanev et al. 2015. Profiling a Warehouse-Scale Computer. In
ISCA .[39] H. Kang et al. 2009. One-Transistor Type DRAM. US Patent 7701751.[40] M. Kang et al. 2014. An Energy-Efficient VLSI Architecture for Pattern Recog-nition via Deep Embedding of Computation in SRAM. In
ICASSP .[41] U. Kang et al. 2014. Co-Architecting Controllers and DRAM to Enhance DRAMProcess Scaling. In
The Memory Forum .[42] Y. Kang et al. 1999. FlexRAM: Toward an Advanced Intelligent Memory System.In
ICCD .[43] S. Kaxiras et al. 1997. Distributed Vector Architecture: Beyond a Single Vector-IRAM. In
First Workshop on Mixing Logic and DRAM: Chips that Compute andRemember .[44] K. Keeton et al. A Case for Intelligent Disks (IDISKs).
SIGMOD Rec. (1998).[45] D. Kim et al. 2016. Neurocube: A Programmable Digital Neuromorphic Archi-tecture with High-Density 3D Memory. In
ISCA .[46] G. Kim et al. 2017. Toward Standardized Near-Data Processing with Unre-stricted Data Placement for GPUs. In SC .[47] J. S. Kim et al. GRIM-Filter: Fast Seed Location Filtering in DNA Read MappingUsing Processing-in-Memory Technologies. BMC Genomics (2018).[48] Y. Kim et al. 2014. Flipping Bits in Memory Without Accessing Them: AnExperimental Study of DRAM Disturbance Errors. In
ISCA .[49] Y. Kim et al. 2012. A Case for Exploiting Subarray-Level Parallelism (SALP) inDRAM. In
ISCA .[50] Y. Kim et al. Ramulator: A Fast and Extensible DRAM Simulator.
CAL (2015).[51] D. E. Knuth. 2009. The Art of Computer Programming, Volume 4 Fascicle 1:Bitwise Tricks & Techniques; Binary Decision Diagrams.[52] P. M. Kogge. 1994. EXECUBE–A New Architecture for Scaleable MPPs. In
ICPP .[53] S. Kvatinsky et al. MAGIC—Memristor-Aided Logic.
IEEE TCAS II: ExpressBriefs (2014).[54] S. Kvatinsky et al. Memristor-Based Material Implication (IMPLY) Logic: De-sign Principles and Methodologies.
TVLSI (2014). [55] B. C. Lee et al. 2009. Architecting Phase Change Memory as a Scalable DRAMAlternative. In
ISCA .[56] B. C. Lee et al. Phase Change Memory Architecture and the Quest for Scalabil-ity.
CACM (2010).[57] B. C. Lee et al. Phase-Change Technology and the Future of Main Memory.
IEEE Micro (2010).[58] D. Lee et al. Simultaneous Multi-Layer Access: Improving 3D-Stacked MemoryBandwidth at Low Cost.
TACO (2016).[59] J. H. Lee et al. 2015. BSSync: Processing Near Memory for Machine LearningWorkloads with Bounded Staleness Consistency Models. In
PACT .[60] S. Li et al. 2017. DRISA: A DRAM-Based Reconfigurable In-Situ Accelerator. In
MICRO .[61] S. Li et al. 2016. Pinatubo: A Processing-in-Memory Architecture for Bulk Bit-wise Operations in Emerging Non-Volatile Memories. In
DAC .[62] Y. Li et al. 2013. BitWeaving: Fast Scans for Main Memory Data Processing. In
SIGMOD .[63] J. Liu et al. 2013. An Experimental Study of Data Retention Behavior in Mod-ern DRAM Devices: Implications for Retention Time Profiling Mechanisms. In
ISCA .[64] J. Liu et al. 2012. RAIDR: Retention-Aware Intelligent DRAM Refresh. In
ISCA .[65] Z. Liu et al. 2017. Concurrent Data Structures for Near-Memory Computing.In
SPAA .[66] G. H. Loh et al. 2013. A Processing in Memory Taxonomy and a Case for Study-ing Fixed-Function PIM. In
WoNDP .[67] S.-L. Lu et al. 2015. Improving DRAM Latency with Dynamic Asymmetric Sub-array. In
MICRO .[68] Y. Luo et al. 2017. Using ECC DRAM to Adaptively Increase Memory Capacity.arXiv:1706.08870 [cs:AR].[69] Y. Luo et al. 2014. Characterizing Application Memory Error Vulnerability toOptimize Datacenter Cost via Heterogeneous-Reliability Memory. In
DSN .[70] K. Mai et al. 2000. Smart Memories: A Modular Reconfigurable Architecture.In
ISCA .[71] O. Mutlu. Memory Scaling: A Systems Architecture Perspective.
IMW (2013).[72] O. Mutlu. 2017. The RowHammer Problem and Other Issues We May Face asMemory Becomes Denser. In
DATE .[73] O. Mutlu et al. Processing Data Where It Makes Sense: Enabling In-MemoryComputation.
Microprocessors and Microsystems (2019).[74] O. Mutlu et al. 2019. RowHammer: A Retrospective. In
IEEE TCAD .[75] O. Mutlu et al. Research Problems and Opportunities in Memory Systems.
SU-PERFRI (2014).[76] L. Nai et al. 2017. GraphPIM: Enabling Instruction-Level PIM Offloading inGraph Computing Frameworks. In
HPCA .[77] NVIDIA Corp. 2014. GeForce GTX 745 Specification.[78] M. Oskin et al. 1998. Active Pages: A Computation Model for Intelligent Mem-ory. In
ISCA .[79] D. Patterson et al. A Case for Intelligent RAM.
IEEE Micro (1997).[80] A. Pattnaik et al. 2016. Scheduling Techniques for GPU Architectures withProcessing-in-Memory Capabilities. In
PACT .[81] S. H. Pugsley et al. 2014. NDC: Analyzing the Impact of 3D-Stacked Mem-ory+Logic Devices on MapReduce Workloads. In
ISPASS .[82] M. K. Qureshi et al. 2007. Adaptive Insertion Policies for High-PerformanceCaching. In
ISCA .[83] M. K. Qureshi et al. 2009. Scalable High Performance Main Memory SystemUsing Phase-Change Memory Technology. In
ISCA .[84] M. K. Qureshi et al. 2007. Line Distillation: Increasing Cache Capacity by Fil-tering Unused Words in Cache Lines. In
HPCA .[85] E. Riedel et al. 1998. Active Storage for Large-scaleData Mining and MultimediaApplications. In
VLDB .[86] SAFARI ResearchGroup. 2015. Ramulator: A DRAM Simulator – GitHub Repos-itory. https://github.com/CMU-SAFARI/ramulator/.[87] V. Seshadri. 2016.
Simple DRAM and Virtual Memory Abstractions to EnableHighly Efficient Memory Systems . Ph.D. Dissertation. Carnegie Mellon Univ.[88] V. Seshadri et al. Fast Bulk Bitwise AND and OR in DRAM.
CAL (2015).[89] V. Seshadri et al. 2013. RowClone: Fast and Energy-Efficient In-DRAM BulkData Copy and Initialization. In
MICRO .[90] V. Seshadri et al. 2016. Buddy-RAM: Improving the Performance and Efficiencyof Bulk Bitwise Operations Using DRAM. arXiv:1611.09988 [cs:AR].[91] V. Seshadri et al. 2017. Ambit: In-Memory Accelerator for Bulk Bitwise Opera-tions Using Commodity DRAM Technology. In
MICRO .[92] V. Seshadri et al. 2015. Gather-Scatter DRAM: In-DRAM Address Translationto Improve the Spatial Locality of Non-Unit Strided Accesses. In
MICRO .[93] V. Seshadri et al. 2017. Simple Operations in Memory to Reduce Data Move-ment. In
Advances in Computers, Volume 106 .[94] A. Shafiee et al. 2016. ISAAC: A Convolutional Neural Network Acceleratorwith In-Situ Analog Arithmetic in Crossbars. In
ISCA .[95] D. E. Shaw et al. The NON-VON Database Machine: A Brief Overview.
IEEEDatabase Eng. Bull. (1981).[96] H. S. Stone. A Logic-in-Memory Computer. TC (1970).
97] Z. Sura et al. 2015. Data Access Optimization in a Processing-in-Memory Sys-tem. In CF .[98] P. Tuyls et al. XOR-Based Visual Cryptography Schemes. Designs, Codes andCryptography (2005).[99] H. S. Warren. 2012.
Hacker’s Delight (2nd ed.). Addison-Wesley Professional.[100] H. Xin et al. Shifted Hamming Distance: A Fast and Accurate SIMD-FriendlyFilter to Accelerate Alignment Verification in Read Mapping.
Bioinformatics (2015). [101] H. Yoon et al. Efficient Data Mapping and Buffering Techniques for MultilevelCell Phase-Change Memories.
ACM TACO (2014).[102] D. P. Zhang et al. 2014. TOP-PIM: Throughput-Oriented Programmable Pro-cessing in Memory. In
HPDC .[103] P. Zhou et al. 2009. A Durable and Energy Efficient Main Memory Using PhaseChange Memory Technology. In
ISCA .[104] Q. Zhu et al. 2013. Accelerating Sparse Matrix-Matrix Multiplication with 3D-Stacked Logic-in-Memory Hardware. In
HPEC ..