[PDF] Accelerating Bulk Bit-Wise X(N)OR Operation in Processing-in-DRAM Platform

Abstract

With Von-Neumann computing architectures struggling to address computationally- and memory-intensive big data analytic task today, Processing-in-Memory (PIM) platforms are gaining growing interests. In this way, processing-in-DRAM architecture has achieved remarkable success by dramatically reducing data transfer energy and latency. However, the performance of such system unavoidably diminishes when dealing with more complex applications seeking bulk bit-wise X(N)OR- or addition operations, despite utilizing maximum internal DRAM bandwidth and in-memory parallelism. In this paper, we develop DRIM platform that harnesses DRAM as computational memory and transforms it into a fundamental processing unit. DRIM uses the analog operation of DRAM sub-arrays and elevates it to implement bit-wise X(N)OR operation between operands stored in the same bit-line, based on a new dual-row activation mechanism with a modest change to peripheral circuits such sense amplifiers. The simulation results show that DRIM achieves on average 71x and 8.4x higher throughput for performing bulk bit-wise X(N)OR-based operations compared with CPU and GPU, respectively. Besides, DRIM outperforms recent processing-in-DRAM platforms with up to 3.7x better performance.

Full PDF

AAccelerating Bulk Bit-Wise X(N)OR Operation inProcessing-in-DRAM Platform

Shaahin Angizi and Deliang Fan

Department of Electrical and Computer Engineering, University of Central Florida, Orlando, FL [email protected],[email protected]

ABSTRACT

With Von-Neumann computing architectures struggling to addresscomputationally- and memory-intensive big data analytic task to-day, Processing-in-Memory (PIM) platforms are gaining growing in-terests. In this way, processing-in-DRAM architecture has achievedremarkable success by dramatically reducing data transfer energyand latency. However, the performance of such system unavoidablydiminishes when dealing with more complex applications seekingbulk bit-wise X(N)OR- or addition operations, despite utilizing max-imum internal DRAM bandwidth and in-memory parallelism. Inthis paper, we develop

DRIM platform that harnesses DRAM ascomputational memory and transforms it into a fundamental pro-cessing unit.

DRIM uses the analog operation of DRAM sub-arraysand elevates it to implement bit-wise X(N)OR operations betweenoperands stored in the same bit-line, based on a new dual-row acti-vation mechanism with a modest change to peripheral circuits suchsense amplifiers. The simulation results show that

DRIM achieveson average 71 × and 8.4 × higher throughput for performing bulkbit-wise X(N)OR-based operations compared with CPU and GPU, re-spectively. Besides, DRIM outperforms recent processing-in-DRAMplatforms with up to 3.7 × better performance. In the last two decades, Processing-in-Memory (PIM) architecture,as a potentially viable way to solve the memory wall challenge, hasbeen well explored for different applications [1–7]. The key conceptbehind PIM is to realize logic computation within memory to pro-cess data by leveraging the inherent parallel computing mechanismand exploiting large internal memory bandwidth. The proposals forexploiting SRAM-based [8, 9] PIM architectures can be found in re-cent literature. However, PIM in context of main memory (DRAM-[2, 3, 10]) has drawn much more attention in recent years mainlydue to larger memory capacities and off-chip data transfer reduc-tion as opposed to SRAM-based PIM. Such processing-in-DRAMplatforms show significantly higher throughputs leveraging multi-row activation methods to perform bulk bit-wise operations byeither modifying the DRAM cell and/or sense amplifier. For ex-ample, Ambit [2] uses triple-row activation method to implementmajority-based AND/OR logic, outperforming Intel Skylake-CPU,NVIDIA GeForce GPU, and even HMC [11] by 44.9 × , 32.0 × , and2.4 × , respectively. DRISA [12] employs 3T1C- and 1T1C-basedcomputing mechanisms and achieves 7.7 × speedup and 15 × betterenergy-efficiency over GPUs to accelerate convolutional neural net-works. However, there are different challenges in such platformsthat make them inefficient acceleration solutions for X(N)OR- andaddition-based applications such as DNA alignment and data en-cryption. Due to the intrinsic complexity of X(N)OR logic, currentPIM designs are not able to offer a high-throughput X(N)OR-based operation despite utilizing the maximum internal bandwidth andmemory level parallelism. This is because majority/AND/OR-basedmulti-cycle operations and required row initialization in the previ-ous designs.To overcome the memory bandwidth bottleneck and addressthe existing challenges, we propose a high-throughput and energy-efficient PIM accelerator based on DRAM, called DRIM . DRIM ex-ploits a new in-memory computing mechanism called Dual-RowActivation (DRA) to perform bulk bit-wise operations betweenoperands stored in different word-lines. The DRA is developedbased on analog operation of DRAM sub-arrays with a modestchange in the sense amplifier circuit such that X(N)OR operationcan be efficiently realized on every memory bit-line. In addition,such design addresses the reliability concerns regarding the voltagedeviation on the bit-line and multi-cycle operations of the triple-rowactivation method. We evaluate and compare

DRIM ’s raw perfor-mance with conventional and PIM accelerators including a Core-i7Intel CPU [13], an NVIDIA GTX 1080Ti Pascal GPU [14], Ambit[2], DRISA-1T1C [3], and HMC 2.0 [11], to handle bulk bit-wiseoperations. We observe that

DRIM achieves remarkable through-put compared to Von-Neumann computing systems (CPU/GPU)through unblocking the data movement bottleneck by on average71 × /8.4 × better throughput. DRIM outperforms other PIMs in per-forming X(N)OR-based operations by up to 3.7 × higher throughput.We further show that a 3D-stacked DRAM built on top of DRIM can boost the throughput of the HMC by ∼ × . From the energyconsumption perspective, DRIM reduces the DRAM chip energyby 2.4 × compared with Ambit [2] and 69 × compared with copyingdata through the DDR4 interface.To the best of our knowledge, this work is the first that designsa high-throughput and energy-efficient X(N)OR-friendly PIM ar-chitecture exploiting DRAM arrays. We develop DRIM based on aset of novel microarchitectural and circuit-level schemes to realizea data-parallel computational unit for different applications.

A DRAM hierarchy at the top level is composed of channels, mod-ules, and ranks. Each memory rank, with a data bus typically 64-bitswide, includes a set of memory chips that are manufactured with avariety of configurations and operate in unison [2, 15]. Each chipis further divided into multiple memory banks that contains 2Dsub-arrays of memory cells virtually-organized in memory matri-ces (mats). Banks within same chips share I/O, buffer and banksin different chips working in a lock-step manner. Each memorysub-array, as shown in Fig. 1a, has 1) a large number of rows (typi-cally 2 or 2 ) holding DRAM cells, 2) a row of Sense Amplifiers(SA), and 3) a Row Decoder (RD) connected to the cells. A DRAM a r X i v : . [ c s . A R ] A p r Ld B L access transistor B L WL dcc1 WL dcc2 BL Dual-contact DRAM cell capacitor RD BLBL enable

DRAM cell SA B L Sense Amplifier

Row Decoder B L WL dcc1 WL dcc2cc BL add-on access transistor Figure 1: (a) DRAM sub-array organization, (b) DRAM celland Sense Amplifier, (c) Dual-contact DRAM cell. cell basically consists of two elements, a capacitor (storage) andan Access Transistor (AT) (Fig. 1b). The drain and gate of the ATis connected to the Bit-line ( BL ) and Word-line ( W L ), respectively.DRAM cell encodes the binary data by the charge of the capacitor.It represents logic ‘1’ when the capacitor is full-charged, and logic‘0’ when there is no charge. • Write/Read Operation:

At initial state, both BL and BL isalways set to V dd . Technically, accessing data from a DRAM’ssub-array (write/read) after initial state is done through three con-secutive commands [2, 16] issued by the memory controller: 1)During the activation (i.e. ACTIVATE ), activating the target row,data is copied from the DRAM cells to the SA row. Fig. 1b showshow a cell is connected to a SA via a BL . The selected cell (storing V dd or 0) shares its charge with the BL leading to a small change inthe initial voltage of BL ( V dd ± δ ). Then, by activating the enable sig-nal, the SA senses and amplifies the δ of the BL voltage towards theoriginal value of the data through voltage amplification accordingto the switching threshold of SA’s inverter [16]. 2) Such data can bethen transferred from/to SA to/from DRAM bus by a READ / WRITE command. In addition, multiple

READ / WRITE commands can be is-sued to one row. 3) The

PRECHARGE command precharges both BL and BL again and makes the sub-array ready for the next access. • Copy and Initialization Operations:

To enable a fast ( < ns )in-memory copy operation within DRAM sub-arrays, rather thanusing ∼ µs conventional operation in Von-Neumann computingsystems, RowClone -Fast Parallel Mode (FPM) [17] proposes a PIM-based mechanism that does not need to send the data to the process-ing units. In this scheme, two back-to-back

ACTIVATE commandsto the source and destination rows without

PRECHARGE commandin between, leads to a multi-kilo byte in-memory copy operation.This operation takes only 90 ns [17]. This method has been furtherused for row initialization, where a preset DRAM row (either to‘0’ or ‘1’) can be readily copied to a destination row. RowCloneimposes only a 0.01% overhead to DRAM chip area [17]. • Not Operation:

The

NOT function has been implemented indifferent works employing Dual-Contact Cells (DCC), as shownFig. 1c. DCC is mainly designed based on typical DRAM cell, butequipped with one more AT connected to BL . Such hardware-friendly design [2, 18, 19] can be developed for a small number ofrows on top of existing DRAM cells to enable efficient NOT operationwith issuing two back-to-back

ACTIVATE commands [2]. In this way,the memory controller first activates the

W L dcc (Fig. 1c) of inputDRAM cell, and reads the data out to the SA through BL . It then V in V ou t high-v s low-v s normal-v s MCD M R D SA WD A C t r l XNOR2 B XOR2/

XNOR2

RDC t r l C M RD MCD V1 W B L R B L RWL1

M1M2

SL1SL2

RWL2 SA WWL1 M RD MCD V1 W B L R B L RWL1

M1M2

SL1SL2

RWL2 SA WWL1 M3 SL3

RWL3

Sub-array

Ctrl G RD GWL

GRB

GBL c_a dd rr _a dd r Ctrl

GRB

MATMATMAT

GBL

MATMATMAT G RD Mat

Reconfig. SA

Bank

BankBank

Bank buffer

I/OCtrl

Chip V ref1 R AND V s e n s e V ref2 R OR R Mem I ref R BL I sense R MAJ

Latch

Rst

Local Row Buffer

CommandDecoder

CmdAdd

Timing Ctrl

Data flow ctrl

Ctrl MRD-extension D in-Intra D in-Inter D + W e D . W e LRB_out1 D A D + W e D . W e D + W e D . W e VDD-VDD

Write Driver (WD)

CB DE

MAG3

LRBout1LRBout2

LRB_out2 C MAJ C AND C OR C Mem C muxI C muxII XNOR2read

0 0 0 1 1 1

MAJ3 add

0 1 1 0 0 1

XOR3

0 1 1 0 0 0

1 0 0 0 1 1 I ref A B ½Vdd0 ½Vdd Precharged state AB M RD WLnWLmWLc  Two-cycle in-memory operation based 2-row computation for a single-purpose logic  The rest is based on Multi-cycle 3-row computation  Needs Dual Contact Cells  Needs Initialized rows for 3-row computation  Medium Area overhead (~14T)

Ctrl A B ½Vdd0½Vdd WL x1 WL x2 XNOR2readMAJ3 add

XOR3 ½Vdd0 ½Vdd-ɛ WL x1 WL x2 Charge sharing state

11 Vdd1 0 WL x1 W x2 AB Sense Amplification

Precharged state AB ½Vdd WL x1 C D E WL x2 WL x3 WL x4 WL x5 ½Vdd WL x1 WL x2 WL x3 WL x4 WL x5 WL x1 WL x2 WL x3 WL x4 WL x5 (500) Compute.Sub. D r i ve r Ctrl

GWL

Compute.Sub. D r i ve r Ctrl Compute.Sub. D r i ve r CtrlCompute.Sub. D r i ve r CtrlCompute.Sub. D r i ve r CtrlCompute.Sub. D r i ve r Ctrl ½Vdd0

WLd

WLm B L access transistor B L WL dcc1 WL dcc2 BL Dual-contact DRAM cell B L CommandDecoderTiming Ctrl C AND C OR C m d A dd En A D F E x1 dcc1 dcc2x2dcc4x8dcc3 V i A ND N A ND M A J5 B L B L BL BLBL

BL BL B L B L Charge sharing state Sense Amplification

DRIM-II WL x1 WL x2 B L B L En x En M E n c N A ND DRISA-1T1C (Micro 17, 18) B L En x Latch

Rst

Logic Gate  Single-cycle in-memory operations based on 2-row activation: NOT-OR/

NOR-AND/NAND-XOR/XNOR  Only 3-row computation for MAJ/MIN  No Need to Dual Contact Cells  No Need to initialization  Large Area overhead (~24T)  Multi-cycle in-memory operations based on 3-row computation: MAJ/MIN-AND/NAND-OR-NOR-XOR/XNOR  Needs Dual Contact Cells  Needs Initialized rows  Zero Area overhead

Ambit (Micro 17) WL WL B L B L En x Enc E n c NO R NOTXOR2

DRIM-I (Designed for DNA Alignment) WL x1 WL x2 B L B L N A ND O R  Single-cycle in-memory XOR/XNOR based on 2-row activation  The rest is based on Multi-cycle 3-row computation  Needs Dual Contact Cells  Needs Initialized rows for 3-row computation  Medium Area overhead (~14T) GraphiDe (submitted

GLSVLSI 19) WL x1 WL x2 B L B L En x  Two-cycle in-memory operation based 2-row computation for AND/OR  The rest is based on Multi-cycle 3-row computation  Needs Dual Contact Cells  No Need to Initialized rows  Low Area overhead (2T)  Multi-cycle in-memory operations based on 3-row computation: MAJ/MIN-AND/NAND-OR-NOR-XOR/XNOR  No Need to Dual Contact Cells  Needs Initialized rows for 3-row computation  Very Large Area overhead (~2T per cell)

DRISA-3T1C (Micro 17, 18) rWL r B L En x wWL B L Compute.Sub. D r i ve r CtrlCompute.Sub. D r i ve r Ctrl ½Vdd R s t - t o - D e c ode r WL x1 WL x2 B L B L En x En M E n c E n c XOR2 O R E n c WL x1 WL x2 B L B L N A ND O R ½Vdd ½Vdd-ɛ V i WL x1 WL x2 B L B L

10 1 N A ND O R Vdd 0 capacitor RD BL enable DRAM cell SA B L Sense Amplifier

Row Decoder add-on access transistor wB L rWL wWL rWL wWL WL WL WL B L WL D k D i D j D r WL D i D j D r D i D j D r Data Rows

Bank (12)

Compute Rows

Compute. Sub-array

DCC

Command

Decoder

CmdAdd

Timing Ctrl En c add-on circuit controller unit (Ctrl) MRD reconfigurable SA VTC D i D j Di Dj out1 out2 out3

0 0 1 1 1 0 1 0 11 0 0 1 1 1 0 0 0

0 1 1 0 0 1 ½ ½ out1out2 out3truth table N A ND NOR2 NAND2

Figure 2: (a) Ambit’s TRA [2], (b) DRISA’s 3T1C [3], (c)DRISA’s 1T1C-logic [3]. Glossary- D i / D j : input rows data, D k : initialized row data, D r result row data. activates W L dcc to connect BL to the same capacitor and so writesthe negated result back to the DCC. • Other Logic Operations:

To realize the logic function in DRAMplatform,

Ambit [2] extends the idea of RowClone by implementing3-input majority (

Maj3 )-based operations in memory by issuing the

ACTIVATE command to three rows simultaneously followed by asingle

PRECHARGE command, so-called

Triple Row Activation (TRA) method. As shown in Fig. 2a, considering one row as the control,initialized by D k = ‘0’/‘1’, Ambit can readily implement in-memory AND2 / OR2 in addition to

Maj3 functions through charge sharingbetween connected cells ( D k , D i and D j ) and write back the re-sult on D r cell. It also leverage TRA mechanism along with DCCsto realize the complementary functions. However, despite Ambitshows only 1% area over commodity DRAM chip [2], it suffers frommulti-cycle PIM operations to implement other functions such as XOR2/XNOR2 based on TRA. Alternatively,

DRISA -3T1C method [3]utilizes the early 3-transistor DRAM design [20], in which the cellconsists of two separated read/write ATs, and one more transis-tor to decouple the capacitor from the read BL ( rBL ), as shown inFig. 2b. This transistor connects the two DRAM cells in a NORstyle on the rBL naturally performing functionally-complete NOR2 function. However,

DRISA -3T1C imposes very large area overhead(2T per cell) and still requires multi-cycle operations to implementmore complex logic functions.

DRISA -1T1C method [3] offers toperform PIM through upgrading the SA unit by adding a CMOSlogic gate in conjunction with a latch, as depicted in Fig. 2c. Suchinherently-multi-cycle operation can enhance the performance of asingle function through add-on CMOS circuitry, in two consecutivecycles. In first cycle, D i is read out and stored in the latch, and inthe second cycle, D j is sensed to perform the computation. How-ever, this design imposes excessive cycles to implement other logicfunctions and at least 12 transistors to each SA. Recently, Dracc [21] implements a carry look-ahead adder by enhancing Ambit [2]to accelerate convolutional neural networks.

There are three main challenges in the existing processing-in-DRAMplatforms that make them inefficient acceleration solutions for XOR-based computations and we aim to resolve them:

Limited throughput (Challenge-1):

Due to the intrinsic com-plexity of X(N)OR-based logic implementations, current PIM de-signs (such as Ambit [2], DRISA [3], and Dracc [21]) are not able tooffer a high-throughput and area-efficient X(N)OR or addition in-memory operation despite utilizing maximum internal DRAM band-width and memory-level parallelism for NOT, (N)AND, (N)OR, andMAJ/MIN logic functions. Moreover, while DRISA-1T1C methodcould implement either XNOR or XOR functions as the add-onlogic gate, it requires at least two consecutive cycles to perform thecomputation, which in turn limits other logics implementation. Weaddress this challenge by proposing the DRA mechanism in Section3.1 and 3.4. • Row initialization (Challenge-2):

Given R=A op B function( op ∈ AND2/OR2 ), TRA-based method [2, 16] takes 4 consecutivesteps to calculate one result as it relies on row initialization: 1-RowClone data of row A to row D i (Copying first operand to acomputation row to avoid data-overwritten), 2-RowClone of row Bto D j , 3-RowClone of ctrl row to D k (Copying initialized controlrow to a computation row), 4-TRA and RowClone data of row D i toR row (Computation and Writing-back the result). Therefore TRAmethod needs averagely 360 ns to perform such in-memory opera-tions. When it comes to XOR2/XNOR2 operation, Ambit requires atleast three row-initialization steps to process two input rows. Obvi-ously, this row-initialization load could adversely impact the PIM’senergy-efficiency especially dealing with such big data problems.This challenge is addressed in Section 3.1 through the proposedsense amplifier, which totally eliminates the need for initializationin performing X(N)OR-based logics. • Reliability concerns (Challenge-3):

By simultaneously acti-vating three cells in TRA method, the deviation on the BL mightbe smaller than typical one-cell read operation in DRAM. This canelongate the sense amplification state or even adversely affect thereliability of the result [2, 16]. The problem can be even intensifiedwhen multiple TRA are needed to implement X(N)OR-based com-putations. To explore and address this challenges, we perform anextensive Monte-Carlo simulation on our design in Section 3.3. DRIM is designed to be an independent, high- performance, energy-efficient accelerator based on main memory architecture to acceler-ate different applications. The main memory organization of

DRIM is shown in Fig. 3 based on typical DRAM hierarchy. Each matconsists of multiple computational memory sub-arrays connectedto a Global Row Decoder (GRD) and a shared Global Row Buffer(GRB). According to the physical address of operands within mem-ory,

DRIM ’s Controller (Ctrl) is able to configure the sub-arrays toperform data-parallel intra-sub-array computations. We divide the

DRIM ’s sub-array row space into two distinct regions as depictedin Fig. 3: 1- Data rows (500 rows out of 512) that include the typicalDRAM cells (Fig. 1b) connected to a regular Row Decoder (RD), and2- Computation rows (12), connected to a Modified Row Decoder(MRD), which enables multiple row activation required for bulkbit-wise in-memory operations between operands. Eight computa-tional rows ( x , ..., x

8) include typical DRAM cells and four rows( dcc , ..., dcc

4) are allocated to DCCs (Fig. 1c) enabling

NOT functionin every sub-array.

DRIM ’s computational sub-array is motivatedby Ambit [2], but enhanced and optimized to perform both TRA and V in V ou t high-v s low-v s normal-v s MCD M R D SA WD A C t r l XNOR2 B XOR2/

XNOR2

RDC t r l C M RD MCD V1 W B L R B L RWL1

M1M2

SL1SL2

RWL2 SA WWL1 M RD MCD V1 W B L R B L RWL1

M1M2

SL1SL2

RWL2 SA WWL1 M3 SL3

RWL3

Sub-array

Ctrl G RD GWL

GRB

GBL c_a dd rr _a dd r Ctrl

GRB

MATMATMAT

GBL

MATMATMAT G RD Mat

Reconfig. SA

Bank

BankBank

Bank buffer

I/OCtrl

Chip V ref1 R AND V s e n s e V ref2 R OR R Mem I ref R BL I sense R MAJ

Latch

Rst

Local Row Buffer

CommandDecoder

CmdAdd

Timing Ctrl

Data flow ctrl

Ctrl MRD-extension D in-Intra D in-Inter D + W e D . W e LRB_out1 D A D + W e D . W e D + W e D . W e VDD-VDD

Write Driver (WD)

CB DE

MAG3

LRBout1LRBout2

LRB_out2 C MAJ C AND C OR C Mem C muxI C muxII XNOR2read

0 0 0 1 1 1

MAJ3 add

0 1 1 0 0 1

XOR3

0 1 1 0 0 0

Ctrl A B ½Vdd0½Vdd WL x1 WL x2 XNOR2readMAJ3 add

XOR3 ½Vdd0 ½Vdd-ɛ WL x1 WL x2 Charge sharing state

11 Vdd1 0 WL x1 W x2 AB Sense Amplification

Precharged state AB ½Vdd WL x1 C D E WL x2 WL x3 WL x4 WL x5 ½Vdd WL x1 WL x2 WL x3 WL x4 WL x5 WL x1 WL x2 WL x3 WL x4 WL x5 (500) Compute.Sub. D r i ve r Ctrl

GWL

Compute.Sub. D r i ve r Ctrl Compute.Sub. D r i ve r CtrlCompute.Sub. D r i ve r CtrlCompute.Sub. D r i ve r CtrlCompute.Sub. D r i ve r Ctrl ½Vdd0

WLd

WLm B L access transistor B L WL dcc1 WL dcc2 BL Dual-contact DRAM cell B L CommandDecoderTiming Ctrl C AND C OR C m d A dd En A D F E x1 dcc1 dcc2x2dcc4x8dcc3 V i A ND N A ND M A J5 B L B L BL BLBL

BL BL B L B L Charge sharing state Sense Amplification

DRIM-II WL x1 WL x2 B L B L En x En M E n c N A ND DRISA-1T1C (Micro 17, 18) B L En x Latch

Rst

Logic Gate  Single-cycle in-memory operations based on 2-row activation: NOT-OR/

Ambit (Micro 17) WL WL B L B L En x Enc E n c NO R NOTXOR2

10 1 N A ND O R Vdd 0 capacitor RD BL enable DRAM cell SA B L Sense Amplifier

Row Decoder add-on access transistor wB L rWL wWL rWL wWL WL WL WL B L WL D k D i D j D r WL D i D j D r D i D j D r Data Rows

Bank (12)

Compute Rows

Compute. Sub-array

DCC

Command

Decoder

CmdAdd

Timing Ctrl En c add-on circuit controller unit (Ctrl) MRD reconfigurable SA VTC D i D j Di Dj out1 out2 out3

0 0 1 1 1 0 1 0 11 0 0 1 1 1 0 0 0

0 1 1 0 0 1 ½ ½ out1out2 out3truth table N A ND NOR2 NAND2 WLd access transistorCcross B L A d j a ce n t B L CwblCs substrate

W:L D i D j Figure 3: The

DRIM memory organization. the proposed

Dual-Row Activation (DRA) mechanisms leveragingcharge-sharing among different rows to perform logic operations,as discussed below. • Dual-Row Single-Cycle In-Memory X(N)OR:

With a carefulobservation on the existing processing-in-DRAM platforms, werealized that they are not able to efficiently handle two main func-tions prerequisite for accelerating a variety of applications (XNOR,addition). As a result, such platforms impose an excessive latencyand energy to memory chip, which could be alleviated by rethink-ing about SA circuit. Our key idea is to perform in-memory

XNOR2 through a DRA method to alleviate and address three of the chal-lenges discussed in Section 2.3. To achieve this goal, we proposea new reconfigurable SA, as shown in Fig. 4a, developed on topof the existing DRAM circuitry. It consists of a regular DRAMSA equipped with add-on circuits including three inverters andone AND gate controlled with three enable signals ( En M , En x , En C ).This design leverages the charge-sharing feature of DRAM cell andelevates it to implement XNOR2 logic between two selected rowsthrough static capacitive-NAND/NOR functions in a single cycle.To implement capacitor-based logics, we use two different inverterswith shifted Voltage Transfer Characteristic (VTC), as shown in Fig.4b.In this way, a NAND/NOR logic can be readily carried out basedon high switching voltage ( V s )/low- V s inverters with standard high- V th /low- V th NMOS and low- V th /high- V th PMOS transistors. It is V in V ou t high-v s low-v s normal-v s R s t - t o - D e c ode r WL x1 WL x2 B L B L En x En M E n c E n c XOR2 O R E n c Command

Decoder

Cmd

Add

Timing Ctrl En c add-on circuit controller unit (Ctrl) MRD reconfigurable SA

VTC D i D j Di Dj out1 out2 out3

0 0 1 1 1 0 1 0 1

1 0 0 1

1 1 0 0 0

0 1 1 0 0 1 ½ ½ out1out2 out3 truth table N A ND NOR2 NAND2 + - + - + - Figure 4: (a) New sense amplifier design for

DRIM , (b) VTCand truth table of the SA’s inverters. orth mentioning that, utilizing low/high-threshold voltage tran-sistors a long with normal-threshold transistors have been accom-plished in low-power application, and many circuits have enjoyedthis technique in low-power design [22–25].Consider D i and D j operands are RowCloned from data rowsto x x BL and BL are precharged to V dd (Precharged State in Fig. 5). To implement DRA, DRIM ’s ctrl firstactivates two

W L s in computational row space (here, x x En C and En x ) tabulated in Table 1, the input voltage of bothlow- and high- V s inverters in the reconfigurable SA can be simplyderived as V i = n . V dd C , where n is the number of DRAM cells stor-ing logic ‘1’ and C represents the total number of unit capacitorsconnected to the inverters (i.e. 2 in DRA method). Table 1: Control bits status in Sense Amplification state.

In-memory operations EN M EN x EN C W/R - Copy - NOT - TRA 1 1 0DRA 0 1 1

Now, the low- V s inverter acts as a threshold detector by amplify-ing deviation from V dd and realizes a NOR2 function as tabulatedin the truth table in Fig. 4b. At the same time the high- V s inverteramplifies the deviation from V dd and realizes a NAND2 function.Accordingly,

XOR2 and

XNOR2 functions of input operands can berealized after CMOS AND gate, respectively, on the BL and BL based on Equation-(1) in a single memory cycle. BL = ( D i . D j ) . ( D i + D j ) = D i . D j + D j . D i = D i ⊕ D j ⇒ BL = D i (cid:12) D j (1) DRIM ’s reconfigurable SA is especially optimized to accelerate

X(N)OR2 operations, as well as supporting other memory and in-memory operations (i.e. Write/Read, Copy, NOT, and TRA).

DRIM ctrl activates En M and En x control-bits simultaneously (when En C is deactivated) to perform such operations. However, in this work,we only use Ambit’s TRA mechanism to directly realize in-memorymajority function ( Maj3 ). V in V ou t high-v s low-v s normal-v s MCD M R D SA WD A C t r l XNOR2 B XOR2/

XNOR2

RDC t r l C M RD MCD V1 W B L R B L RWL1

M1M2

SL1SL2

RWL2 SA WWL1 M RD MCD V1 W B L R B L RWL1

M1M2

SL1SL2

RWL2 SA WWL1 M3 SL3

RWL3

Sub-array

Ctrl G RD GWL

GRB

GBL c_a dd rr _a dd r Ctrl

GRB

MATMATMAT

GBL

MATMATMAT G RD Mat

Reconfig. SA

Bank

BankBank

Bank buffer

I/OCtrl

Chip V ref1 R AND V s e n s e V ref2 R OR R Mem I ref R BL I sense R MAJ

Latch

Rst

Local Row Buffer

CommandDecoder

CmdAdd

Timing Ctrl

Data flow ctrl

Ctrl MRD-extension D in-Intra D in-Inter D + W e D . W e LRB_out1 D A D + W e D . W e D + W e D . W e VDD-VDD

Write Driver (WD)

CB DE

MAG3

LRBout1LRBout2

LRB_out2 C MAJ C AND C OR C Mem C muxI C muxII XNOR2read

0 0 0 1 1 1

MAJ3 add

0 1 1 0 0 1

XOR3

0 1 1 0 0 0

Ctrl A B ½Vdd0½Vdd WL x1 WL x2 XNOR2readMAJ3 add

XOR3 ½Vdd0 ½Vdd-ɛ WL x1 WL x2 Charge sharing state

11 Vdd1 0 WL x1 W x2 AB Sense Amplification

Precharged state AB ½Vdd WL x1 C D E WL x2 WL x3 WL x4 WL x5 ½Vdd WL x1 WL x2 WL x3 WL x4 WL x5 WL x1 WL x2 WL x3 WL x4 WL x5 (500) Compute.Sub. D r i ve r Ctrl

GWL

Compute.Sub. D r i ve r Ctrl Compute.Sub. D r i ve r CtrlCompute.Sub. D r i ve r CtrlCompute.Sub. D r i ve r CtrlCompute.Sub. D r i ve r Ctrl ½Vdd0

WLd

WLm B L access transistor B L WL dcc1 WL dcc2 BL Dual-contact DRAM cell B L CommandDecoderTiming Ctrl C AND C OR C m d A dd En A D F E x1 dcc1 dcc2x2dcc4x8dcc3 V i A ND N A ND M A J5 B L B L BL BLBL

BL BL B L B L Charge sharing state Sense Amplification

DRIM-II WL x1 WL x2 B L B L En x En M E n c N A ND DRISA-1T1C (Micro 17, 18) B L En x Latch

Rst

Logic Gate  Single-cycle in-memory operations based on 2-row activation: NOT-OR/

Ambit (Micro 17) WL WL B L B L En x Enc E n c NO R NOTXOR2

10 1 N A ND O R Vdd 0 capacitor RD BL enable DRAM cell SA B L Sense Amplifier

Row Decoder add-on access transistor wB L rWL wWL rWL wWL WL WL WL B L WL D k D i D j D r WL D i D j D r D i D j D r Data Rows

Bank (12)

Compute Rows

Compute. Sub-array

DCC

Command

Decoder

CmdAdd

Timing Ctrl En c add-on circuit controller unit (Ctrl) MRD reconfigurable SA VTC D i D j Di Dj out1 out2 out3

0 0 1 1 1 0 1 0 11 0 0 1 1 1 0 0 0

0 1 1 0 0 1 ½ ½ out1out2 out3truth table N A ND NOR2 NAND2 WLd access transistorCcross B L A d j a ce n t B L CwblCs substrate

W:L D i D j + - + - + - E n c N A ND + - + - + - Figure 5: Dual Row Activation to realize

XNOR2 . The transient simulation results of DRA method to realize single-cycle in-memory

XNOR2 operation is shown in Fig. 6. We can ob-serve how BL voltage and accordingly cell’s capacitor is charged to V dd (when D i D j =00/11) or discharged to GND (when D i D j =01/10)during sense amplification state. Therefore, DRA method can effec-tively provide a single-cycle X(N)OR logic to address the challenge-1and -2 discussed in Section 2.3 by eliminating the need for multiple ( V ) ( V ) ( V ) ( V ) ( V ) ( V ) ( V ) ( V ) ( V ) WLx1WLx2En x En C (D i ,D j ) = (1,1)(D i ,D j ) = (0,0) NAND2BL 11 10OR2 01 i ,D j ) = (1,0)1 0 00 111 Vcap-D j Vcap-D i P.S. C.S. S . S.A.S . P.S. C.S. S . S.A.S . P.S. C.S. S . S.A.S . Figure 6: The transient simulation of the internal

DRIM ’ssub-array signals involved in DRA mechanism. Glossary-Vcap- D i and Vcap- D J represent the voltage across the two se-lected DRAM cell’s capacitors connected to WLx1 and WLx2.P.S., C.S.S., S.A.S. are short for Precharged State, Charge Shar-ing State, and Sense Amplification State, respectively. TRA- [2] or NOR-based [3] operations as well as row initializationsteps. • In-Memory Adder:

DRIM ’s sub-array can perform addition/subtraction ( add/sub ) operation quite efficiently. Assume D i , D j ,and D k as input operands, the carry-out ( C out ) of the Full-Adder(FA) can be directly generated through MAJ D i , D j , D k ) = D i D j + D i D k + D j D k using TRA method. Moreover, the Sum can be readilycarried out through two back-to-back

XOR2 operations based onthe proposed DRA mechanism.

While

DRIM is meant to be an independent high-performance andenergy-efficient accelerator, we need to expose it to programmersand system-level libraries to utilize it. From a programmer perspec-tive,

DRIM is more of a third party accelerator that can be connecteddirectly to the memory bus or through PCI-Express lanes ratherthan a memory unit, thus it is integrated similar to that of GPUs.Therefore, a virtual machine and ISA for general-purpose parallelthread execution need to be defined similar to PTX [26] for NVIDIA.Accordingly, the programs are translated at install time to the

DRIM hardware instruction set discussed here to realize the functionstabulated in Table 2. The micro and control transfer instructionsare not discussed here.

DRIM is developed based on

ACTIVATE-ACTIVATE-PRECHARGE command a.k.a.

AAP primitives and most bulk bitwise operationsinvolve a sequence of

AAP commands. To enable processor to effi-ciently communicate with

DRIM , we developed four types of

AAP -based instructions that only differ from the number of activatedsource or destination rows:1-

AAP (src, des, size) that runs the following commandssequence: 1)

ACTIVATE a source address ( src ); 2)

ACTIVATE a desti-nation address ( des ); 3)

PRECHARGE to prepare the array for the nextaccess. The size of input vectors for in-memory computation muste a multiple of DRAM row size, otherwise the application mustpad it with dummy data. The type-1 instruction is mainly used forcopy and NOT functions; 2-

AAP (src, des1, des2, size) , 1)

ACTIVATE a source address; 2)

ACTIVATE two destination addresses;3)

PRECHARGE . This instruction copies a source row simultaneouslyto two destination rows; 3-

AAP (src1, src2, des, size) thatperforms DRA method by activating two source addresses andthen writes back the result on a destination address; 4-

AAP (src1,src2, src3, des, size) that performs Ambit-TRA method [2]by activating three source rows and writing back the

MAJ3 resulton a destination address.For instance, in order to implement the addition-in-memory, asshown in Table 2, three

AAP -type2 commands double-copy the threeinput data rows to computational rows ( x , .., x Sum function is realized through two back-to-back

XOR2 operations with

AAP -type3. The C out is generated by AAP -type4 and written back tothe designated data row.

We performed a comprehensive circuit-level simulation to studythe effect of process variation on both DRA and TRA methodsconsidering different noise sources and variation in all componentsincluding DRAM cell ( BL / W L capacitance and transistor, shownin Fig. 7) and SA (width/length of transistors- V s ). We ran Monte-Carlo simulation in Cadence Spectre with 45nm NCSU ProductDevelopment Kit (PDK) library [27] (DRAM cell parameters weretaken and scaled from Rambus [28]) under 10000 trials and increasedthe amount of variation from ±

0% to ±

30% for each method. Table 3shows the percentage of the test error in each variation. We observethat even considering a significant ±

10% variation, the percentageof erroneous DRA across 10000 trials is 0%, where TRA methodshows a failure with 0.18%. Therefore,

DRIM offers a solution toalleviate challenge-3 by showing an acceptable voltage margin inperforming operations based on DRA mechanism. By scaling downthe transistor size, the process variation effect is expected to getworse [2, 17]. Since

DRIM is mainly developed based on existingDRAM structure and operation with slight modifications, differentmethods currently-used to tackle process variation can be alsoapplied for

DRIM . Besides, just like Ambit,

DRIM chips that failtesting due to DRA or TRA methods can be potentially consideredas regular DRAM chips alleviating DRAM yield.

Table 2: The basic functions supported by

DRIM . Func. Operation Command Sequence

AAP

Typecopy D r ← D i AAP ( D i , D r ) 1NOT D r ← D i AAP ( D i , dcc AAP ( dcc , D r ) 11MAJ/MIN † D r ← MAJ D i , D j , D k ) AAP ( D i , x AAP ( D j , x AAP ( D k , x AAP ( x , x , x , D r ) 1114XNOR2/XOR2 † D r ← D i (cid:12) D j AAP ( D i , x AAP ( D j , x AAP ( x , x , D r ) 113Add/Sub † Sum ← D i ⊕ D j ⊕ D k C out ← MAJ D i , D j , D k ) AAP ( D i , x , x AAP ( D j , x , x AAP ( D k , x , x AAP ( x , x , dcc AAP ( x , dcc , dcc AAP ( dcc , Sum ) AAP ( x , x , x , C out ) 2223314 † Complement functions and Subtraction can be realized with dcc rows. V in V ou t high-v s low-v s normal-v s MCD M R D SA WD A C t r l XNOR2 B XOR2/

XNOR2

RDC t r l C M RD MCD V1 W B L R B L RWL1

M1M2

SL1SL2

RWL2 SA WWL1 M RD MCD V1 W B L R B L RWL1

M1M2

SL1SL2

RWL2 SA WWL1 M3 SL3

RWL3

Sub-array

Ctrl G RD GWL

GRB

GBL c_a dd rr _a dd r Ctrl

GRB

MATMATMAT

GBL

MATMATMAT G RD Mat

Reconfig. SA

Bank

BankBank

Bank buffer

I/OCtrl

Chip V ref1 R AND V s e n s e V ref2 R OR R Mem I ref R BL I sense R MAJ

Latch

Rst

Local Row Buffer

CommandDecoder

CmdAdd

Timing Ctrl

Data flow ctrl

Ctrl MRD-extension D in-Intra D in-Inter D + W e D . W e LRB_out1 D A D + W e D . W e D + W e D . W e VDD-VDD

Write Driver (WD)

CB DE

MAG3

LRBout1LRBout2

LRB_out2 C MAJ C AND C OR C Mem C muxI C muxII XNOR2read

0 0 0 1 1 1

MAJ3 add

0 1 1 0 0 1

XOR3

0 1 1 0 0 0

Ctrl A B ½Vdd0½Vdd WL x1 WL x2 XNOR2readMAJ3 add

XOR3 ½Vdd0 ½Vdd-ɛ WL x1 WL x2 Charge sharing state

11 Vdd1 0 WL x1 W x2 AB Sense Amplification

Precharged state AB ½Vdd WL x1 C D E WL x2 WL x3 WL x4 WL x5 ½Vdd WL x1 WL x2 WL x3 WL x4 WL x5 WL x1 WL x2 WL x3 WL x4 WL x5 (1012) Compute.Sub. D r i ve r Ctrl

GWL

Compute.Sub. D r i ve r Ctrl Compute.Sub. D r i ve r CtrlCompute.Sub. D r i ve r CtrlCompute.Sub. D r i ve r CtrlCompute.Sub. D r i ve r Ctrl ½Vdd0

WLd

WLm B L access transistor B L WL dcc1 WL dcc2 BL Dual-contact DRAM cell B L CommandDecoderTiming Ctrl C AND C OR C m d A dd En A D F E x1 dcc1 dcc2x2dcc4x8dcc3 V i A ND N A ND M A J5 B L B L BL BLBL

BL BL B L B L Charge sharing state Sense Amplification

DRIM-II WL x1 WL x2 B L B L En x En M E n c N A ND DRISA-1T1C (Micro 17, 18) B L En x Latch

Rst

Logic Gate  Single-cycle in-memory operations based on 2-row activation: NOT-OR/

Ambit (Micro 17) WL WL B L B L En x Enc E n c NO R NOTXOR2

10 1 N A ND O R Vdd 0 capacitor RD BL enable DRAM cell SA B L Sense Amplifier

Row Decoder add-on access transistor wB L rWL wWL rWL wWL WL WL WL B L WL D k D i D j D r WL D i D j D r D i D j D r Data Rows

Bank (12)

Compute Rows

Compute. Sub-array

DCC

Command

Decoder

CmdAdd

Timing Ctrl En c add-on circuit controller unit (Ctrl) MRD reconfigurable SA VTC D i D j Di Dj out1 out2 out3

0 0 1 1 1 0 1 0 11 0 0 1 1 1 0 0 0

0 1 1 0 0 1 ½ ½ out1out2 out3truth table N A ND NOR2 NAND2 WLd access transistorCcross B L A d j a ce n t B L CwblCs substrate

W:L

Figure 7: Noise sources inDRAM cell. Glossary:

Cwbl , Cs , and Ccross are

W L - BL , BL -substrate, and BL - BL capaci-tance, respectively. Variation TRA DRA ±

5% 0.00 0.00 ±

10% 0.18 0.00 ±

15% 5.5 1.2 ±

20% 17.1 9.6 ±

30% 28.4 16.4

Table 3: Process variationanalysis. • Throughput:

We evaluate and compare the

DRIM ’s raw perfor-mance with conventional computing units including a Core-i7 IntelCPU [13] and an NVIDIA GTX 1080Ti Pascal GPU [14]. There is agreat deal of PIM accelerators that present reconfigurable platformsor application-specific logics in or close to memory die [29–49].Due to the lack of space, we shall restrict our comparison to fourrecent processing-in-DRAM platforms, Ambit [2], DRISA-1T1C[3], DRISA-3T1C [3], and HMC 2.0 [11], to handle three main bulkbit-wise operations, i.e.

NOT , XNOR2 , and add . To have a fair compar-ison, we report

DRIM ’s and other PIM platforms’ raw throughputimplemented with 8 banks with 512 ×

256 computational sub-arrays.We further develop a 3D-Stacked DRAM with 256 banks in 4GBcapacity similar to that of HMC 2.0 for the

DRIM (i.e.

DRIM -S)considering its computational capability. The Intel CPU consistsof 4 cores and 8 threads working with two 64-bit DDR4-1866/2133channels. The Pascal GPU has 3584 CUDA cores running at 1.5GHz[14] and 352-bit GDDR5X. The HMC has 32 vaults each with 10GB/s bandwidth. Accordingly, we develop an in-house benchmarkto run the operations repeatedly for 2 /2 /2 -length input vec-tors and report the throughput of each platform, as shown in Fig.8.

16 32 64248163264128256512102420484096 T h r oughpu t ( GO p s / s ) CPU GPU HMC DRIM-R DRIM-S Ambit DRISA-1T1C DRISA-3T1C

16 32 64

Vector size (MB)

16 32 64

NOT XNOR2 add

Figure 8: Throughput of different platforms (Y: log scale).

We observe that 1) either the external or internal DRAM band-width has limited the throughput of the CPU, GPU, and even HMCplatforms. However, HMC outperforms the CPU and GPU with ∼ × and 6.5 × higher performance on average for bulk bit-wiseoperations. Besides, PIM platforms achieve remarkable through-put compared to Von-Neumann computing systems (CPU/GPU)through unblocking the data movement bottleneck. Regular DRIM ( DRIM -R) shows on average 71 × and 8.4 × better throughput com-pared to CPU and GPU, respectively. 2) while DRIM -R, Ambit, andDRISA platforms achieve almost the same performance on perform-ing bulk bit-wise

NOT function,

DRIM -R outperforms other PIMsn performing

X(N)OR2 -based operations. Our platform improvesthe throughput 2.3 × , 1.9 × , 3.7 × compared with Ambit [2], DRISA-1T1C [3], and DRISA-3T1C [3], respectively. 3) DRIM -S can boostthe throughput of the HMC by 13.5 × . To sum it up, DRIM ’s DRAmechanism could effectively address challenge-1 by proposing thehigh-through bulk bit-wise X(N)OR-based operation. • Energy:

We estimate the energy that DRAM chip consumes toperform the three bulk bit-wise operations per Kilo-Byte for

DRIM ,Ambit [2], DRISA-1T1C [12], and CPU . Note that, other operationssuch AND2/NAND2 and

OR2/NOR2 in DRIM can be built on top of TRAmethod with almost the same energy consumption to that of Ambit.Fig. 9 shows that

DRIM achieves 2.4 × and 1.6 × energy reductionover Ambit [2] and DRISA-1T1C [12], respectively, to performbulk bit-wise XNOR2 operation. Besides, compared with copyingdata through the DDR4 interface,

DRIM reduces the energy by69 × . As for bit-wise in-memory add operation, DRIM outperformsAmbit, DRISA-1T1C, and CPU, respectively, with ∼ × , 1.7 × , and27 × reduction in energy consumption. NOT XNOR2 add1E+01E+11E+21E+3 E ne r g y ( n J / KB ) Ambit DRISA-1T1C DRIM CPU Figure 9: Energy consumption of different platforms (Y: logscale). • Area:

To assess the area overhead of

DRIM on top of commod-ity DRAM chip, four hardware cost sources must be taken intoconsideration. First, add-on transistors to SAs; in our design, eachSA requires 22 additional transistors connected to each BL . Second,two rows of DCCs with two W L associated with each; based onthe estimation made by [18], each DCC row imposes roughly onetransistor over regular DRAM cell to each BL . Third, the 4:12 MRDoverhead (originally 4:16); we modify each W L driver by addingtwo more transistors in the typical buffer chain, as depicted in Fig.4a. Fourth, the Ctrl’s overhead to control enable bits; ctrl generatesthe activation bits with MUX units with 6 transistors. To sum it up,

DRIM roughly imposes 24 DRAM rows per sub-array, which canbe interpreted as ∼ .

3% of DRAM chip area. • Virtual Memory:

DRIM has its own ISA with operations that canpotentially use virtual addresses. To use virtual addresses,

DRIM ’sctrl must have the ability to translate virtual addresses to phys-ical addresses. While in theory this looks as simple as passingthe address of the page table root to

DRIM and giving

DRIM ’s ctrlthe ability to walk the page table, it is way more complicated inreal-world designs. The main challenge here is that the page ta-ble can be scattered across different DIMMs and channels, while

DRIM operates within a memory module. Furthermore, page table This energy doesn’t involve the energy that processor consumes to perform theoperation. coherence issues can arise. The other way to implement translationcapabilities for

DRIM is through memory controller pre-processingof instructions being written to

DRIM instruction registers. For in-stance, if the programmer writes instruction

APP (src,dec,256) ,then the memory controller intercepts the virtual addresses andtranslates them into physical addresses. Note that most systemshave near memory controller translation capabilities, mainly tomanage IOMMU and DMA accesses from I/O devices. One issuethat can arise is that some operations are appropriate only if the re-sulting physical addresses are within specific plane, e.g., within thesame bank. Accordingly, the compiler and the OS should work to-gether to ensure that the operands of commands will result physicaladdresses that are suitable to the operation type. • Memory Layout and Interleaving:

While high- performancememory systems rely on channel interleaving to maximize thememory bandwidth,

DRIM adopts a different approach throughmaximizing spatial locality and allocating memory as close to theircorresponding operands as possible. The main goal is to reducethe data movement across memory modules and hence reducingoperations latency and energy costs. As exposing a programmerdirectly to the layout of memory is challenging,

DRIM architecturecan rely on compiler passes that take memory layout and the pro-gram as input, then assign physical addresses that are adequate toeach operation without impacting the symantics of the application. • Reliability:

Many ECC-enabled DIMMs rely on calculatingsome hamming code at the memory controller and use it to correctany soft errors. Unfortunately, such a feature is not available for

DRIM as the data being processed are not visible to the memorycontroller. Note that this issue is common across all PIM designs.To overcome this issue,

DRIM can potentially augment each rowwith additional ECC bits that can be calculated and verified atthe memory module level or bank level. Augmenting

DRIM withreliability guarantees is left as future work. • Cache Coherence:

When

DRIM updates data directly in mem-ory, there could be stale copies of the updated memory locationsin the cache, thus data inconsistency issues may arise. Similarly, ifthe processor updates cached copies from memory locations that

DRIM will process later,

DRIM could actually use wrong/stale values.There are several ways to solve such issues in off-chip accelerators,the most common one is to rely on operating system (OS) to unmapthe physical pages accessible by

DRIM from any process that canrun while computing in

DRIM . In this work, we presented

DRIM , as a high-throughput and energy-efficient PIM architecture to address some of the existing issues instate-of-the-art DRAM-based acceleration solutions for performingbulk bit-wise X(N)OR-based operations i.e. limited throughput, rowinitialization, reliability concerns, etc. incurring less than 10% ontop of commodity DRAM chip.

REFERENCES [1] P. Chi, S. Li, C. Xu, T. Zhang, J. Zhao, Y. Liu, Y. Wang, and Y. Xie, “Prime: a novelprocessing-in-memory architecture for neural network computation in reram-based main memory,” in

ACM SIGARCH Computer Architecture News , vol. 44,no. 3. IEEE Press, 2016, pp. 27–39.[2] V. Seshadri, D. Lee, T. Mullins, H. Hassan, A. Boroumand, J. Kim, M. A. Kozuch,O. Mutlu, P. B. Gibbons, and T. C. Mowry, “Ambit: In-memory accelerator forbulk bitwise operations using commodity dram technology,” in

Proceedings of the0th Annual IEEE/ACM International Symposium on Microarchitecture . ACM,2017, pp. 273–287.[3] S. Li, D. Niu, K. T. Malladi, H. Zheng, B. Brennan, and Y. Xie, “Drisa: A dram-basedreconfigurable in-situ accelerator,” in

Proceedings of the 50th Annual IEEE/ACMInternational Symposium on Microarchitecture . ACM, 2017, pp. 288–301.[4] S. Angizi, Z. He, F. Parveen, and D. Fan, “Rimpa: A new reconfigurable dual-modein-memory processing architecture with spin hall effect-driven domain wallmotion device,” in

VLSI (ISVLSI), 2017 IEEE Computer Society Annual Symposiumon . IEEE, 2017, pp. 45–50.[5] S. Angizi, Z. He, A. Awad, and D. Fan, “Mrima: An mram-based in-memoryaccelerator,”

IEEE Transactions on Computer-Aided Design of Integrated Circuitsand Systems , 2019.[6] S. Angizi, Z. He, and D. Fan, “Dima: a depthwise cnn in-memory accelerator,”in .IEEE, 2018, pp. 1–8.[7] S. Angizi, Z. He, F. Parveen, and D. Fan, “Imce: Energy-efficient bit-wise in-memory convolution engine for deep neural network,” in

Design AutomationConference (ASP-DAC), 2018 23rd Asia and South Pacific . IEEE, 2018, pp. 111–116.[8] S. Aga, S. Jeloka, A. Subramaniyan, S. Narayanasamy, D. Blaauw, and R. Das,“Compute caches,” in . IEEE, 2017, pp. 481–492.[9] C. Eckert, X. Wang, J. Wang, A. Subramaniyan, R. Iyer, D. Sylvester, D. Blaauw,and R. Das, “Neural cache: Bit-serial in-cache acceleration of deep neural net-works,” arXiv preprint arXiv:1805.03718 , 2018.[10] G. Dai, T. Huang, Y. Chi, J. Zhao, G. Sun, Y. Liu, Y. Wang, Y. Xie, and H. Yang,“Graphh: A processing-in-memory architecture for large-scale graph processing,”

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems et al. , “Latest advances and roadmap forin-plane and perpendicular stt-ram,” in

Memory Workshop (IMW), 2011 3rd IEEEInternational

IEEE Computer architecture letters , vol. 15, no. 1, pp. 45–49, 2016.[16] V. Seshadri, K. Hsieh, A. Boroum, D. Lee, M. A. Kozuch, O. Mutlu, P. B. Gib-bons, and T. C. Mowry, “Fast bulk bitwise and and or in dram,”

IEEE ComputerArchitecture Letters , vol. 14, no. 2, pp. 127–131, 2015.[17] V. Seshadri, Y. Kim, C. Fallin, D. Lee, R. Ausavarungnirun, G. Pekhimenko, Y. Luo,O. Mutlu, P. B. Gibbons, M. A. Kozuch et al. , “Rowclone: fast and energy-efficientin-dram bulk data copy and initialization,” in

Proceedings of the 46th AnnualIEEE/ACM International Symposium on Microarchitecture . ACM, 2013, pp. 185–197.[18] H. B. Kang and S. K. Hong, “One-transistor type dram,” Apr. 20 2010, uS Patent7,701,751.[19] S.-L. Lu, Y.-C. Lin, and C.-L. Yang, “Improving dram latency with dynamicasymmetric subarray,” in . IEEE, 2015, pp. 255–266.[20] G. Sideris, “Intel 1103-mos memory that defied cores,”

Electronics , vol. 46, no. 9,pp. 108–113, 1973.[21] Q. Deng, L. Jiang, Y. Zhang, M. Zhang, and J. Yang, “Dracc: a dram basedaccelerator for accurate cnn inference,” in

Proceedings of the 55th Annual DesignAutomation Conference . ACM, 2018, p. 168.[22] M. W. Allam, M. H. Anis, and M. I. Elmasry, “High-speed dynamic logic stylesfor scaled-down cmos and mtcmos technologies,” in

Proceedings of the 2000international symposium on Low power electronics and design . ACM, 2000, pp.155–160.[23] S. Mutoh, T. Douseki, Y. Matsuya, T. Aoki, S. Shigematsu, and J. Yamada, “1-vpower supply high-speed digital circuit technology with multithreshold-voltagecmos,”

IEEE Journal of Solid-state circuits , vol. 30, no. 8, pp. 847–854, 1995.[24] T. Kuroda, T. Fujita, S. Mita, T. Nagamatsu, S. Yoshioka, K. Suzuki, F. Sano,M. Norishima, M. Murota, M. Kako et al. , “A 0.9-v, 150-mhz, 10-mw, 4 mm/sup 2/,2-d discrete cosine transform core processor with variable threshold-voltage (vt)scheme,”

IEEE Journal of Solid-State Circuits , vol. 31, no. 11, pp. 1770–1779, 1996.[25] K. Navi, V. Foroutan, M. R. Azghadi, M. Maeen, M. Ebrahimpour, M. Kaveh, andO. Kavehei, “A novel low-power full-adder cell with new technique in designinglogical gates based on static cmos inverter,”

Microelectronics Journal

IEEETransactions on Computer-Aided Design of Integrated Circuits and Systems , vol. 37,no. 9, pp. 1788–1801, 2018.[30] M. N. Bojnordi and E. Ipek, “Memristive boltzmann machine: A hardware accel-erator for combinatorial optimization and deep learning,” in

High PerformanceComputer Architecture (HPCA), 2016 IEEE International Symposium on . IEEE,2016, pp. 1–13.[31] S. Angizi, Z. He, A. S. Rakin, and D. Fan, “Cmp-pim: an energy-efficientcomparator-based processing-in-memory neural network accelerator,” in

Pro-ceedings of the 55th Annual Design Automation Conference . ACM, 2018, p.105.[32] J. Ahn, S. Hong, S. Yoo, O. Mutlu, and K. Choi, “A scalable processing-in-memoryaccelerator for parallel graph processing,”

ACM SIGARCH Computer ArchitectureNews , vol. 43, no. 3, pp. 105–117, 2016.[33] J. Ahn, S. Yoo, O. Mutlu, and K. Choi, “Pim-enabled instructions: a low-overhead,locality-aware processing-in-memory architecture,” in

Computer Architecture(ISCA), 2015 ACM/IEEE 42nd Annual International Symposium on . IEEE, 2015,pp. 336–348.[34] B. Akin, F. Franchetti, and J. C. Hoe, “Data reorganization in memory using3d-stacked dram,” in

Computer Architecture (ISCA), 2015 ACM/IEEE 42nd AnnualInternational Symposium on . IEEE, 2015, pp. 131–143.[35] R. Balasubramonian, J. Chang, T. Manning, J. H. Moreno, R. Murphy, R. Nair, andS. Swanson, “Near-data processing: Insights from a micro-46 workshop,”

IEEEMicro , vol. 34, no. 4, pp. 36–42, 2014.[36] A. Boroumand, S. Ghose, M. Patel, H. Hassan, B. Lucia, K. Hsieh, K. T. Malladi,H. Zheng, and O. Mutlu, “Lazypim: An efficient cache coherence mechanismfor processing-in-memory,”

IEEE Computer Architecture Letters , vol. 16, no. 1, pp.46–50, 2017.[37] A. Farmahini-Farahani, J. H. Ahn, K. Morrow, and N. S. Kim, “Nda: Near-dramacceleration architecture leveraging commodity dram devices and standardmemory modules,” in

High Performance Computer Architecture (HPCA), 2015 IEEE21st International Symposium on . IEEE, 2015, pp. 283–295.[38] Q. Guo, T.-M. Low, N. Alachiotis, B. Akin, L. Pileggi, J. C. Hoe, and F. Franchetti,“Enabling portable energy efficiency with memory accelerated library,” in

Mi-croarchitecture (MICRO), 2015 48th Annual IEEE/ACM International Symposiumon . IEEE, 2015, pp. 750–761.[39] K. Hsieh, S. Khan, N. Vijaykumar, K. K. Chang, A. Boroumand, S. Ghose, andO. Mutlu, “Accelerating pointer chasing in 3d-stacked memory: Challenges,mechanisms, evaluation,” in

Computer Design (ICCD), 2016 IEEE 34th InternationalConference on . IEEE, 2016, pp. 25–32.[40] D. Kim, J. Kung, S. Chai, S. Yalamanchili, and S. Mukhopadhyay, “Neurocube:A programmable digital neuromorphic architecture with high-density 3d mem-ory,” in

Computer Architecture (ISCA), 2016 ACM/IEEE 43rd Annual InternationalSymposium on . IEEE, 2016, pp. 380–392.[41] R. Nair, S. F. Antao, C. Bertolli, P. Bose, J. R. Brunheroto, T. Chen, C.-Y. Cher,C. H. Costa, J. Doi, C. Evangelinos et al. , “Active memory cube: A processing-in-memory architecture for exascale systems,”

IBM Journal of Research andDevelopment , vol. 59, no. 2/3, pp. 17–1, 2015.[42] A. Pattnaik, X. Tang, A. Jog, O. Kayiran, A. K. Mishra, M. T. Kandemir, O. Mutlu,and C. R. Das, “Scheduling techniques for gpu architectures with processing-in-memory capabilities,” in

Parallel Architecture and Compilation Techniques (PACT),2016 International Conference on . IEEE, 2016, pp. 31–44.[43] S. H. Pugsley, J. Jestes, R. Balasubramonian, V. Srinivasan, A. Buyuktosunoglu,A. Davis, and F. Li, “Comparing implementations of near-data computing within-memory mapreduce workloads,”

IEEE Micro , vol. 34, no. 4, pp. 44–52, 2014.[44] P. Trancoso, “Moving to memoryland: in-memory computation for existing ap-plications,” in

Proceedings of the 12th ACM International Conference on ComputingFrontiers . ACM, 2015, p. 32.[45] X. Tang, O. Kislal, M. Kandemir, and M. Karakoy, “Data movement aware com-putation partitioning,” in

Proceedings of the 50th Annual IEEE/ACM InternationalSymposium on Microarchitecture . ACM, 2017, pp. 730–744.[46] D. Zhang, N. Jayasena, A. Lyashevsky, J. L. Greathouse, L. Xu, and M. Ignatowski,“Top-pim: throughput-oriented programmable processing in memory,” in

Pro-ceedings of the 23rd international symposium on High-performance parallel anddistributed computing . ACM, 2014, pp. 85–98.[47] A. Akerib and E. Ehrman, “Non-volatile in-memory computing device,” May 142015, uS Patent App. 14/588,419.[48] S. Angizi, Z. He, and D. Fan, “Pima-logic: a novel processing-in-memory archi-tecture for highly flexible and energy-efficient logic computation,” in

Proceedingsof the 55th Annual Design Automation Conference . ACM, 2018, p. 162.[49] F. Parveen, S. Angizi, Z. He, and D. Fan, “Imcs2: Novel device-to-architectureco-design for low-power in-memory computing platform using coterminousspin switch,”