[PDF] Accelerating Bandwidth-Bound Deep Learning Inference with Main-Memory Accelerators

Abstract

DL inference queries play an important role in diverse internet services and a large fraction of datacenter cycles are spent on processing DL inference queries. Specifically, the matrix-matrix multiplication (GEMM) operations of fully-connected MLP layers dominate many inference tasks. We find that the GEMM operations for datacenter DL inference tasks are memory bandwidth bound, contrary to common assumptions: (1) strict query latency constraints force small-batch operation, which limits reuse and increases bandwidth demands; and (2) large and colocated models require reading the large weight matrices from main memory, again requiring high bandwidth without offering reuse opportunities. We demonstrate the large potential of accelerating these small-batch GEMMs with processing in the main CPU memory. We develop a novel GEMM execution flow and corresponding memory-side address-generation logic that exploits GEMM locality and enables long-running PIM kernels despite the complex address-mapping functions employed by the CPU that would otherwise destroy locality. Our evaluation of StepStone variants at the channel, device, and within-device PIM levels, along with optimizations that balance parallelism benefits with data-distribution overheads demonstrate 12× better minimum latency than a CPU and 2.8× greater throughput for strict query latency constraints. End-to-end performance analysis of recent recommendation and language models shows that StepStone PIM outperforms a fast CPU (by up to 16× ) and prior main-memory acceleration approaches (by up to 2.4× compared to the best prior approach).

Full PDF

AAccelerating Bandwidth-Bound Deep LearningInference with Main-Memory Accelerators

Benjamin Y. Cho, Jeageun Jung, and Mattan Erez The University of Texas at Austin { bjcho,jeageunjung,mattan.erez } @utexas.edu Abstract —DL inference queries play an important role in diverse internetservices and a large fraction of datacenter cycles are spenton processing DL inference queries. Speciﬁcally, the matrix-matrix multiplication (GEMM) operations of fully-connectedMLP layers dominate many inference tasks. We ﬁnd that theGEMM operations for datacenter DL inference tasks are memorybandwidth bound, contrary to common assumptions: (1) strictquery latency constraints force small-batch operation, whichlimits reuse and increases bandwidth demands; and (2) largeand colocated models require reading the large weight matricesfrom main memory, again requiring high bandwidth withoutoffering reuse opportunities. We demonstrate the large potentialof accelerating these small-batch GEMMs with processing in themain CPU memory. We develop a novel GEMM execution ﬂowand corresponding memory-side address-generation logic thatexploits GEMM locality and enables long-running PIM kernelsdespite the complex address-mapping functions employed by theCPU that would otherwise destroy locality. Our evaluation ofStepStone variants at the channel, device, and within-devicePIM levels, along with optimizations that balance parallelismbeneﬁts with data-distribution overheads demonstrate × betterminimum latency than a CPU and . × greater throughputfor strict query latency constraints. End-to-end performanceanalysis of recent recommendation and language models showsthat StepStone PIM outperforms a fast CPU (by up to × )and prior main-memory acceleration approaches (by up to . × compared to the best prior approach). I. I

NTRODUCTION

With the evolution of deep learning (DL), artiﬁcial intel-ligence is being widely used in many internet services. Wedescribe a new approach for reducing the latency of such DLinference tasks by accelerating their fully-connected layerswith a processing in/near memory (PIM) approach. Park etal. [35] report that for important personalized recommendationand natural language DL inference workloads, a large fractionof DL-related data-center cycles (42%) are spent executingfully-connected (FC) layers in Facebook data centers.FC layers are executed as matrix-matrix multiplicationoperations (commonly referred to as

GEMM kernels) andthese GEMMs dominate the overall execution time of someworkloads [15], [35]. GEMMs are commonly consideredcompute rather than bandwidth bound based on decades ofscientiﬁc-computing and DL training experience. However, weobserve that DL inference GEMMs exhibit two unique traitsthat leave them memory-bandwidth bound in many cases, andthus amenable to PIM acceleration. First, DL inference queries require small-batch execution tomeet tight latency constraints, leading to very tall/skinny orshort/fat activation matrices. Such matrices offer lower local-ity, increasing the importance of memory bandwidth. Second,some recommender and language models have billions ofparameters (across numerous layers) and it is common formultiple models to be colocated on a single node to improvesystem efﬁciency and reduce multi-model query latency [16],[20], [33], [42]. As a result, it is common for the larger weightmatrices to reside only in main memory, stressing the memorychannel when executing on a CPU and often requiring low-bandwidth host-device transfers in systems with accelerators.Our experiments demonstrate that these GEMM operationsare in fact bandwidth-bound on both CPU and GPU systems,and describe how they can be accelerated with processingin/near main memory (PIM).We present

StepStone PIM , which is integrated within theCPU main memory system and solves the dual challengesof utilizing available GEMM locality and sharing data withthe CPU under its sophisticated XOR-based DRAM addressmapping scheme. Hence, StepStone is an appealing datacentersolution because it: (1) better utilizes bandwidth within thememory system; (2) utilizes locality, enabling high perfor-mance and efﬁciency for datacenter DL inference GEMMoperations; (3) does not require additional memory devicesor capacity, avoiding the exorbitant cost of additional memoryand taking advantage of the already-memory resident matrices;and (4) ofﬂoads a low-performance workload from the CPU,freeing additional execution capacity for colocated tasks.This unique set of StepStone capabilities is, to the best ofour knowledge, not available in any prior PIM architectureand research, including in recent work that targets datacenterDL inference or processing in main memory. While recentwork explored PIM-acceleration for datacenter DL inference,it focuses on the embedding layers of DL-inference [20], [25]rather than on the MLP GEMM operations, which requirea different approach for exploiting locality. Prior work thatconsiders integrating PIM accelerators within main memoryeither requires costly data replication to avoid the DRAMaddress mapping challenge [4], [5], [12] or does not offer themechanisms to exploit GEMM locality [3], [9], [20], [23].We choose a straight-forward PIM microarchitecture forStepStone that follows recent research trends. Our contribu-tions instead lie with four key innovations. The ﬁrst is the a r X i v : . [ c s . A R ] N ov tepStone PIM GEMM parallelization and execution ﬂowthat is cognizant of the XOR-based DRAM address mappingthat otherwise break GEMM locality. The second contributionaccelerates the localization and reduction operations of theexecution ﬂow without consuming CPU core resources. Thethird contribution enables long-running locality-conservingPIM GEMM kernels with the new StepStone memory-sideaddress generation logic. Long-running kernels relieve PIMpressure on the memory command channel, enabling high-performance colocated CPU tasks.The fourth contribution is identifying and exploiting anew tradeoff opportunity in balancing the performance ben-eﬁts of parallelization across ﬁne-grained PIM units (PIMs)within DRAM with the data-transfer overheads of the lo-calization/replication and reduction operations necessary forhigh parallelization. We explore this tradeoff by evaluatingchannel-, device-, and bank group-level StepStone PIMs.To summarize our contributions: • We identify and demonstrate that small-batch GEMMoperations of DL datacenter inference workloads arebandwidth bound on CPUs and GPUs, and can hencebeneﬁt from PIM-acceleration (Section II). • We develop the novel StepStone PIM GEMM executionﬂow that is cognizant of the complex CPU addressmapping, thus exploiting GEMM locality and improvingperformance by − over a prior PIM architecturethat supports complex address mappings [9]. • We accelerate the localization and reduction operationsof our new GEMM ﬂow at the CPU memory controllerto improve performance by up to an additional . • We design the novel memory-side StepStone addressgenerator that enables long-running GEMM kernels tominimize command-channel usage, which improves PIMperformance by . × when the CPU executes concurrentmemory-intensive tasks. • We identify a new tradeoff opportunity in determiningwhether to target channel-, device-, or bank group-levelPIMs and show beneﬁts of up to in exploiting it. • We present a detailed StepStone PIM evaluation, includ-ing end-to-end performance analysis and conclude thatStepStone is an appealing datacenter solution because ofits low cost (no additional memory devices or capacity),its potential for lower latency and higher throughput, andits ability to dynamically support the execution of larger-batch and colocated tasks on the CPU.Combining all our innovative mechanisms, StepStone is ableto substantially outperform a CPU when executing GEMMoperations on matrices with dimensions typical in datacenterDL inference workloads: (1) StepStone offers × lowerminimum GEMM latency for these matrices; (2) × higherthroughput under the strictest latency constraints that corre-spond to batch-1 on the CPU but if the CPU is allowed 20%additional latency for batch-32 execution, the performancebeneﬁt drops to . × ; and (3) up to × lower end-to-end DLinference latency compared to measured CPU performance. II. M OTIVATION AND C HALLENGES

Bandwidth-bound GEMMs.

Matrix-matrix multiplication(GEMM) is commonly regarded as compute bound. However,we observe that GEMM becomes bandwidth-bound and ex-hibits low CPU/GPU utilization when both: (1) one of thetwo input matrices is much larger than the other (e.g., A islarge while B is “tall and skinny”) and (2) the large inputmatrix is in main memory. While rare in traditional linearalgebra applications, DL inference tasks in datacenters oftenmeet both conditions.First, DL inference queries have tight latency constraintsthat require small batches [35]. The corresponding GEMMoperations in fully-connected layers therefore multiply a largeweight matrix and a small input matrix. Second, the MLPweights are often only found in main memory because eitherthe total size of the MLP parameters exceeds cache capacity(e.g., in recent language models [7], [21], [37]) and/or multiplemodels are colocated on a single node [16].The resulting matrix sizes (Table I) are executed inefﬁ-ciently on CPUs and GPUs as shown by the rooﬂine analysispresented in Figure 1. Each point in the ﬁgure correspondsto the performance measured on a 2.7 GHz 28-core IntelCascade Lake Xeon CPU or an NVIDIA Titan Xp GPU whenmultiplying a memory-resident 1024 × × N matrix, where N represents the batch size.The left-most point for each system is when N = 1 and eachpoint moving right represents a doubling of N . We observethat all three systems are bandwidth bound for inference-appropriate batch sizes ( N (cid:46) ). Further, for such smallbatches, GPU performance is lower than the CPU if matrix Ais in host memory because of the slow host-device bus. We conclude that processing in/near memory (PIM) isappealing for these GEMM operations of datacenter DL-inference workloads.

PIM GEMMs with XOR-based DRAM address map-ping.

We target systems in which main memory is PIMenabled, implying a shared DRAM address space with theCPU. The CPU relies on sophisticated XOR-based DRAMaddress mappings to exploit bank and channel parallelism bydistributing consecutive cache blocks in the physical addressspace (PA) across DRAM banks and channels. As a result,matrices that are contiguous in the application virtual spaceare distributed across PIM units (PIMs) in complex patterns.TABLE I: Common DL-inference GEMM dimensions.

Model Description Weights Batch SizeLM BERT MLP 1024 × × × × × × ×

512 1-256 [35]Bottom MLP 512 × × × .010.11101001000100001000000.01 0.1 1 10 100 1000 10000 P e r f o r m a n c e ( G f l o p s / s ) Operational intensity (Flops/byte)

CPU (weight in main memory)GPU (weight in device memory)GPU (weight in main memory) C P U P C I e G P U Data loading overhead

Fig. 1: CPU (Intel Xeon Platinum 8280) and GPU (NVIDIATitan XP) rooﬂine modeling when executing bandwidth-boundGEMM operations of a memory-resident 1024 × × N matrix; N is swept from − inpowers of moving from left to right.Effective GEMM execution requires exploiting locality andreuse in matrix blocks, which are challenging to identify.Figure 2 illustrates this challenge for the toy address map-ping of Figure 2a targeting a system with 4 PIM units (one perrank). Addresses refer to elements of the large matrix shown inFigure 2b, which is laid out row major in contiguous memoryaddresses. Logical blocks of the matrix do not form blockswithin each PIM. For example, element of the weightmatrix (marked in black) is mapped to PIM0 and is multipliedby elements p and q from the input tensor to modify elements x and y of the output tensor. These same elements of theinput tensor are also needed when reading element ofthe weight matrix and the same two output-tensor elementswhen reading weights , , , , , , and .Utilizing this locality requires the PIMs to correctly mapbetween contiguous DRAM addresses within each PIM andthe corresponding addresses of the input and output tensors.Prior approaches to this challenge fall into one of three cate-gories. The ﬁrst avoids the challenge altogether by maintaininga copy of the data that is stored in a PIM-friendly layoutand not accessed by the CPU [17], [22], [25]. This eitherduplicates substantial data arrays (possibly > GiB) [7],[32], [37] or prevents the CPU from assisting with requeststhat can tolerate higher response latency [15]. Furthermore,a different layout is needed for channel-, device-, and bankgroup-level PIMs. This either forces even more replicas orprohibits optimizations that dynamically choose the PIM levelbased on input characteristics (e.g., as in the XLM languagemodel [10]).The second approach requires the CPU to transfer thiscorrespondence information to the PIMs for each cache blockthe PIM processes [3], [23]. The CPU sends special DRAMcommand packets that include operand and opcode informa-tion to each PIM unit and all the transactions related to PIMexecution are controlled by the host. PIMs are isolated fromthe address mapping concerns, but performance scalability ispoor because: (1) channel bandwidth for sending PIM com-mand packets saturates rapidly, (2) CPU resources are requiredfor address generation, and (3) the frequent PIM command packets severely interfere with CPU memory trafﬁc [9].The third approach, proposed by Cho et al. [9], aligns longvector PIM operands in memory such that all kernel operandsfollow the same interleaving pattern after the XOR addressmapping. In this way, both the CPU and the vector-orientedPIM can process the same data. However, this vector-orientedapproach cannot exploit the GEMM kernel locality. Vector-oriented execution splits the GEMM into multiple matrix-vector (GEMV) operations, requiring a larger number of kernelinvocations. The straightforward implementation also requirescopies across PIM units to ensure all data is local. The stan-dalone (non main-memory) Newton PIM accelerator [17] alsofollows this approach. We observe that a different executionﬂow can be used to block both the input and output matrices toreduce copy overhead. We explain our StepStone PIM GEMMin the following section.III. S

TEP S TONE

PIM

StepStone PIM enables independent GEMM execution withPIMs under any XOR-based memory-system address map-ping. In StepStone PIM, the weight matrix is partitioned andassigned to PIMs based on the underlying address mapping,maintaining internal contiguity and enabling temporal localitywhen each PIM unit works on its GEMM blocks. From theCPU perspective, the PIMs appear to skip within the addressspace and only step on those “stones” (i.e. cache blocks) thatare physically local to them.

A. StepStone Architecture

The StepStone PIM architecture is, for the most part, astandard PIM. The innovation lies in how we map GEMMoperations and in our unique address-generation algorithm,both discussed later in this section. StepStone is comprisedof a host-side PIM controller that interfaces with PIM units(PIMs) through the memory channel to control their oper-ation using memory-mapped PIM-side registers. As shownin Figure 3a, PIM units (PIMs) can be integrated with: (1)DRAM itself, e.g., at the bank-group level (StepStone-BG);(2) with a memory module, e.g., associated with each DRAMdevice or buffer chip (StepStone-DV); and/or (3) with amemory channel controller (StepStone-CH). We consider allthree integration levels. Note that StepStone-BG accounts fordevice-level timing parameters such as tRCD and tFAW usingcontrol logic at the I/O port of each device.Each PIM unit (Figure 3b) includes SIMD/vector units,scratchpad memory, interfaces, control logic to execute theGEMM kernel command (sub-GEMM to be more precise),and a local-address generation unit. The pipeline is sufﬁcientlydeep to hide address generation and access latencies (20 stagesin our case). When N (e.g., the batch dimension) is large,performance is bound by the SIMD width. While wide SIMDunits are desirable, arithmetic performance must be balancedwith area and power constraints.Following prior work, we aim for . mm for eachStepStone-BG unit [14] and . mm for StepStone-DV [5]. This is a cheap rank-level PIM with no inter-device communication. q0 2122 3 24 5 6 27 8 292a b 2c d e 2f5071725374555677587980

5e 8f (b) yx ✕ = Batch sizeInput dimension O u t p u t d i m e n s i o n WeightsInput tensor (a) (c)

PIM ID 1 (CH)

PIM 0PIM 1 PIM 2PIM 3

CPUCPUCores M e m C t r l M e m C t r l O u t p u t T e n s o r Fig. 2: An example of bandwidth-bound GEMM operation with PIM and a toy XOR-based address mapping: (a) toy XOR-based physical-to-DRAM address mapping where addresses refer to contiguous row-major matrix elements; (b) layout of an 8 ×

16 matrix with colors indicating element → PIM unit mapping; (c) example system with rank-level PIMs.

Host

Rank0 Rank1 Rank2 Rank3ControllerCPUCores CPU MemoryControllerPIM ctrl signalStatus update DRAM InterfaceCPU interface Mem. Access

CPU-side PIM and memory controller ① StepStone-CH

Device0Device 1Device2Device7 … ② S t e p S t o n e - D V PIM ControllerCopyEngine … BankBankBankBank BankBankBankBankI/O ports + PIM ctrl ③ SS-BG SS-BG

BankBankBankBank BankBankBankBank

SS-BG

Bank groups

DRAM deviceSS-BGDIMM (a) Baseline PIM system.

Control/Status Reg.ScratchpadVectorunitDRAM Opnd. Host interfaceMemory interfaceHost memory operationsCtrl. logicAddr.Gen.… (b) StepStone PIM architecture.

Fig. 3: Overview of the StepStone PIM System.We aim for a ratio between SIMD area and scratchpadarea and assume additional logic is of comparable size to thescratchpad. We estimate functional unit and scratchpad areaand power at the device-level with the values reported foriPIM [14] and at the device and channel levels following themethodology of Lym et al. [29]. This analysis yields nominalvalues of 8-wide SIMD with KB scratchpad capacity foreach StepStone-BG unit (4 PIMs per DRAM device), and 32-wide SIMD with KB scratchpad capacity per StepStone-DVPIM unit. For StepStone-CH, we keep the same bandwidthto arithmetic performance ratio as StepStone-DV: 256-wideSIMD units and

KB scratchpad capacity. We consider allthree cases and conclude that StepStone-CH offers the lowestperformance and requires the largest die area.One other component is the replication/reduction unit withinthe PIM controller, which is used to accelerate the distributionof matrix B and reduction of partial values of C, which arerequired for the GEMM execution described below.

B. StepStone GEMM Execution

GEMM execution starts with the large weight matrix Astored contiguously in the virtual and physical address spacesin row-major order. Therefore, A is distributed across memorydevices based on the DRAM address mapping as shown inFigure 4 (for the Intel Skylake address mapping [36] onStepStone-BG and depicting only the elements of A that mapto PIM0, which we refer to as partition A0). A0 must bemultiplied with elements of B and accumulated into elementsof C. To maximize parallelism, we ﬁrst localize private copiesof B and C into each PIM unit, also shown in the ﬁgure.Localizing data copies the data into a pre-allocated per PIM-unit memory region. Execution then proceeds with a partialdot product across rows of A0 with columns of B (B is showntransposed in the ﬁgure).However, recall the dual challenges of identifying the cor-responding indices in B and C as A0 is advanced while maxi-mizing reuse during execution. We address these challengesgrouping together cacheline-sized memory blocks (“cacheblocks”) into block groups that follow the same DRAM ad-dress mapping distribution. We note that the grouping dependsboth on the address mapping and the matrix dimension. Eachblock group is shaded a different color in Figure 4b.

StepStone locality.

To maximize reuse, each element of Bshould be multiplied with as many elements of A beforeoverwritten in the buffer. We achieve this by executing oneblock group at a time: cache blocks within each group acrossrows reuse elements of C while those along columns reuseelements of B. No reuse exists between groups.The number of groups required to maximize locality isdetermined by the number of PIM ID bits that are impactedby addresses within the matrix. For example, the matrix inFigure 4 is 16 ×

512 4B words and starts at physical address0, thus locations within this matrix span the lower 15 addressbits. Within these bits, bits 7 and 14 determine one bank-groupbit (BG0, which is also PIM ID bit 0) and bits 8, 9, 12, and 13affect the channel bit (PIM ID3). The other PIM ID bits are We assume that the matrix dimensions are powers of two; matrices withnon-power-of-two dimensions are either padded or execution is parti-tioned/serialized into smaller, power-of-two matrices. Iter. 1

20 19 18 17 16 15 14 13 12 11 10 9 8 7 6

CHRKBG0BG1PIM ID (0): 0 0 0 0 MROW MCOLGroup ID (0): 0 0

19 18 13 1214

GP1GP0 (a) PIM and group IDs under Skylake address mapping. MCOL and MROW are the row and column index values of matrix A.

Matrix A(Shared) Matrix C(Locallized)

N KM ✕ = Shared by 4 PIMs(Replication)Shared by 4 PIMs(Reduction) 1 10 9 8 7 6 +1+1 Instantcorrection (1 → → E x ec u t i o n o r d e r Iter. 2 No checkCheck &Complete

Instant & iterative correction Carry forwarding (if

M >= 128 )

17 16 15 14 13

20 19 18 +1

Carry forwarding PIM ID[2]PIM ID[1]Group ID[1]Matrix B (Localized) (b) An example of (cid:2)(cid:11)(cid:5)(cid:10)(cid:2)(cid:11)(cid:9)(cid:8)(cid:5)(cid:1)(cid:3)(cid:7)(cid:9)(cid:4)(cid:6) grouping for PIM0 ([M, K, N] = [16, 512, 4]). Each square represents 16 32-bit words (1 cache block). (c) StepStone address generation mechanisms N Fig. 4: Overview of GEMM execution with StepStone PIM.ﬁxed for all locations within this matrix. We further note thata group spans entire rows to maximize locality. We thereforeexclude address bits associated with matrix columns (MCOL)from deﬁning the group ID bits (GP0 and GP1 in the ﬁgure).

Localizing matrices B and C.

Matrix B is (partially) repli-cated and localized to the different PIMs before executionbegins, and the localized partial-result C matrices are reducedafter the GEMM. The replication and reduction, along withdata reorganization for spatial locality within each PIM unitare handled by the host and accelerated with a simple DMAengine at the PIM controller. The operation of this engine isillustrated in Figure 5 for localizing matrix B for a portion ofmatrix A that is distributed across PIMs 0, 1, 8, and 9. MatrixB is again represented transposed in the ﬁgure and consecutiveelements in each of its rows appear as columns (e.g., a0 - a3).During localization, the engine reorganizes the input matrixfor each PIM unit such that accesses are sequential duringits group-based execution. The outer-most loop iterates overcolumns of A, localizing the rows of B (appear as columns inB T ) needed for each column in the PIMs and block groupsit maps to. The PIM and group IDs are computed based onthe mappings illustrated in Figure 4a. Each cache block ofB is read once and then copied to all its relevant PIM-localaddresses. Reductions follow a similar execution ﬂow. C. Overall Execution Flow of StepStone GEMM

The execution ﬂow of complete GEMM is shown in Algo-rithm 1. After localization, the input matrices are all alignedin the DRAM accessible by each PIM unit and executionproceeds in group order. A slight complication arises when a0a1a2a3 b0b1b2b30 0 c0c1c2c3 d0d1d2d31 18880 8880 9991 9991008 008 119 119

Weights M i dd l e l oo p I nn e r l oo p a0 a1 a2 a3b0 b1 b2 b3 a0 a1 a2 a3b0 b1 b2 b3c0 c1 c2 c3d0 d1 d2 d3 ✕ c0 c1 c2 c3d0 d1 d2 d38 8 9 90008 0008 1119 1119880 880 991 991e0e1e2e3 f0f1f2f3 g0g1g2g3 h0h1h2h3 Group 0 e0 e1 e2 e3f0 f1 f2 f3

Group 1Group 0 Group 1 e0 e1 e2 e3f0 f1 f2 f3

Group 0 Group 1Group 0 Group 1

PIM 1g0 g1 g2 g3h0 h1 h2 h3PIM 9g0 g1 g2 g3h0 h1 h2 h3PIM 0PIM 8

Outer loop

Reorganization tableInput tensor

Fig. 5: Input-matrix Reorganization.A is very large such that not all elements of B and C thatcorrespond to a group can be buffered within the scratchpad.In such cases, to still utilize locality, we further block eachgroup. This blocking can be across rows, maximizing reuseof C, and/or across columns. We process blocks of rows ﬁrstbecause C offers greater reuse as it is both read and written.The inner-most GEMM call is coarse-grained for the fullStepStone PIM with the mapping-aware AGEN unit, but issplit into multiple dot product operations without this innova-tive AGEN logic. In this way we can isolate the contributionsof our algorithm mapping and hardware mechanism whenevaluating StepStone PIM compared to prior PIM architec-tures, like Chopim [9] (we denote our StepStone GEMM ﬂowon Chopim as enhanced Chopim or eCHO).Note that address generation with partitioning is slightlydifferent than as described for unpartitioned groups execution.When crossing different column partitions (groups of columns5hat partition a row into multiple blocks), address generationmust skip over those columns belonging to different partitions.This is simple to do and only requires modifying the address-generation rules to account for group and partition ID.

D. StepStone Address Generation

Within a single cache block, the address is a simple incre-ment, but once the value of a bit that determines the PIM IDis modiﬁed, that contiguous physical address must be skippeduntil the next physical address that maps to the same PIMunit and block group is reached. A simple iterative approachof incrementing the address until the address is again withinthis same block and PIM ID, the number of iterations requiredwhen the number of PIMs is large introduces bubbles into theexecution pipeline and degrades performance. We propose new increment-correct-and-check

AGEN logicthat skips to the next closest address with the same PIM andgroup IDs (after the simple increment falls outside the targetIDs). We do this by ensuring that those address bits thatare XORed to determine each ID bit always maintain theirparity. We can thus skip incrementing bits that are lower thanthe lowest ID-affecting address bit. The AGEN logic iteratesover the ID-affecting bits (from LSB to MSB), each timeincrementing the next ID-affecting bit and checking whetherthe PIM and group IDs match their target values.The number of iterations is limited to the number of ID-affecting bits, but can be further reduced with two additionalrules. The ﬁrst rule applies for adjacent address bits thatboth affect the same ID bit. When the lower of the two isincremented, the upper must be as well to maintain parity.This can be done directly, saving one iteration. The second ruleapplies for chains of contiguous address bits that each affect adifferent ID bit. In this case, when the ﬁrst is incremented, thecarry it propagates will have to corrected in multiple iterationsto maintain the parity corresponding to each bit in the chain.Thus, the chain can be skipped with the carry simply directlypropagated to the next-higher address bit. These rules areshown in Figure 4c. The top part of the ﬁgure illustrates theﬁrst rule of instantly correcting from to , and the bottompart illustrates the second rule of forwarding the carry frombit 13 to 17 since bit 14-16 affect different ID bits. E. Optimizations

Multiple optimizations over the basic ﬂow described aboveare possible, including fusing multiple kernel executions formatrices that are not powers of two, balancing parallel ex-ecution with the overheads of localization and reduction,and choosing a PIM level for execution (StepStone-BG vs.StepStone-DV). For brevity, we only discuss the latter two.

Choosing the PIM level.

Bank-group level StepStone-BGoffers the highest potential performance when the GEMMis severely bandwidth bound (very small batches) because itaccesses underutilized bandwidth within a DRAM device. An We assume that the CPU address mapping is available for PIMs either byreverse engineering, by CPU vendors building the PIMs, or by agreement.

Algorithm 1:

Group-based GEMM with partitioning.localize(B);localize(C); for rpart in row partitions do buffer ﬁll(C); for grp in block groups dofor cpart in col partitions do buffer ﬁll(B); if StepStone then

StepStone GEMM; else if eCHO thenfor row in cpart do DOT(row);buffer drain(C);reduce(C);interesting tradeoff emerges as bandwidth constraints decreasewith somewhat larger batches. The arithmetic intensity (dataresuse per operation) in StepStone scales with the batch size( N ) up to the SIMD width of each PIM unit. This resultsin comparable arithmetic execution times for ≤ N ≤ in StepStone-BG and for ≤ N ≤ in StepStone-DV(though obviously the execution times differ between the twoPIM levels). At the same time, the overheads of localizationand reduction increase with N and with the number of PIMs(number of block groups).An optimization opportunity emerges for choosing betweenbank-group and device level PIMs as a result. The best PIMlevel depends on the matrix dimensions and the address map-ping as these determine the number of block groups, and hencethe localization and reduction overheads. We demonstrate thistradeoff in Section V and generally ﬁnd that StepStone-BG isbest when N ≤ . Note that we do not discuss the algorithmfor choosing the PIM level, but note that a simple heuristic thatestimates execution times and overheads based on availablebandwidth and transferred data volumes works well. Small weight matrices.

A similar tradeoff between arithmeticperformance and localization and reduction overhead existswhen the matrices are relatively small. In such cases, it ispreferable to only use a subset of the available PIMs, tradingoff lower arithmetic performance for reduced overheads. Weshow that this optimization can yield a ∼ performanceimprovement in some cases (Section V-D). Another optimiza-tion for relatively small matrix A is that CPU can directlywrite to and read from PIM’s scratchpad memory when thematrix B and C ﬁts in it. This reduces the time to move databetween DRAM and the scratchpad memory.Optimizing execution to only utilize a subset of the PIMscomes with additional considerations when allocating memoryfor the large distributed input matrix A. Speciﬁcally, A mustremain contiguous in virtual memory yet be mapped to just asubset of the PEs. Because the address mapping and size of thematrix is known, it is possible to allocated physical memoryframes to satisfy this mapping constraint as long as the PIM6D bits that are ignored (for subsetting) are not affected byvirtual-address offset bits. In other words, this is possible withbase pages only (e.g., 4KB pages). Enforcing the mappingconstraints to maintain alignment with the PIMs can be doneusing the proposed coloring interface introduced by Cho etal. [9] and by modifying the application’s memory allocator.For example, if the goal is to execute on half the PIMsof StepStone-BG with the Skylake mapping, we keep BG0ﬁxed for the entire physical allocation of A. This is achievedby allocating virtual memory at 32KB granularity rather thanthe minimum 4KB granularity. Additionally, we must ensurethat contiguous virtual addresses remain aligned in the DRAMspace and therefore must also ensure that the other PIM ID bitsfollow a consistent mapping. We do that by coloring those bitsin a way that the OS-level frame allocator maintains alignment,as proposed for Chopim [9].IV. M ETHODOLOGY

System conﬁguration.

Table II summarizes our system con-ﬁguration and DRAM timing parameters. Our DRAM modelis based on the DDR4 model of the Ramulator detailed DRAMsimulator [24]. We implement StepStone-CH, StepStone-DV, and StepStone-BG PIMs inside Ramulator with detailedpipeline and timing models. We emulate our memory allocatorand add an address translation engine into the PIM controlleron the CPU (III-A); address translation is infrequent (once percoarse-grained PIM command) because contiguous physicalregions are allocated for PIM execution. To validate ourexecution ﬂow, we modify Ramulator to read from and writevalues to memory and check the ﬁnal output against pre-calculated results. We also compare all addresses from theStepStone AGEN logic with a pre-generated address trace foreach PIM. For all the GEMM operations with StepStone PIM,we assume the input activations reside in the CPU caches andare therefore localized to the active PIMs. In the same way,we assume the ﬁnal GEMM results are reduced by the CPU.We use the XOR-based address mappings described inDRAMA [36], acquired by reverse-engineering Intel andSamsung CPUs. To show the impact of address mapping onthe same DDR4 model, we modify the address mapping ofExynos, Dual-socket Haswell, Ivy Bridge, and Sandy Bridgebased on the randomized method (PAE) proposed by Liu etal. [26]. By default we we use Skylake’s address mapping.To measure GEMM performance on real machines, we use anIntel Xeon Platinum 8280 (CPU) with Intel oneDNN [1] andan NVIDIA TitanXP (GPU) with CUTLASS [2].The area and power of the SIMD units are estimated basedon Lym et al. [29] for StepStone-DV and StepStone-CH andiPIM [14] for the SIMD units of StepStone-BG. We use Cacti6.5 [31] to estimate the area and power of scratchpad memory.

Workloads.

We choose matrix sizes and aspect ratios toclearly show their impact on performance. By default, we use1024 × PIM conﬁgurations

StepStone-BG 8-width SIMD, 8KB scratchpad (per DRAM device), 1.2GHzStepStone-DV 32-width SIMD, 32KB scratchpad (per buffer chip), 1.2GHzStepStone-CH 256-width SIMD, 256KB scratchpad (per channel), 1.2GHz

Address mappings

ID = 0 Exynos-like address mapping (modiﬁed)1 Haswell-like address mapping (modiﬁed)2 Ivy Bridge-like address mapping (modiﬁed)3 Sandy Bridge-like address mapping (modiﬁed)4 Skylake address mapping (baseline)

DRAM timing parameters (DDR4-2400R, 4GB, x8) tBL=4, tCCDS=4, tCCDL=6, tRTRS=2, tCL=16, tRCD=16,tRP=16, tCWL=12, tRAS=39, tRC=55, tRTP=9, tWTRS=3,tWTRL=9, tWR=18, tRRDS=4, tRRDL=6, tFAW=26

Energy components

In-device RD/WR (11.3pJ/b), Off-chip RD/WR(25.7pJ/b)s,CH/DV/BG SIMD (11.3,11.3,11.3nJ/op),CH/DV/BG Scratchpad (0.03/0.1/0.3 nJ/access)

DL inference parameters

RM DLRM [32] RM3, Bottom MLP (2560-512-32),Top MLP (512-128-1), bsz=4LM BERT [21] Text classiﬁcation (WNLI), 24 transformer blocks,MLP (1024-4096-1024), seq. length=8, bsz=4 activations as matrix B and C, respectively, and we refer tothe largest matrix as weight matrix A.We demonstrate the beneﬁts of long-running kernels forconcurrent CPU and PIM execution in a colocation scenarioof the default 1024 × Comparisons.

We compare our approach with two existingPIM approaches, PEI [3] and Chopim [9], which are capableof accelerating GEMMs in main memory and leveraging multi-level PIMs with one data layout. To make a fair comparison,we use our baseline PIM system (Figure 3) for all approachesand only vary localization/reduction mechanisms and PIMkernel granularity. For PEI, each PIM instruction is used toprocess one cache block and all the other operands neededfor executing the PIM instruction is written to scratchpadmemory by the CPU. We evaluate two versions of Chopim.The baseline naive Chopim (nCHO) follows the GEMV map-ping approach (Section II). We also use our newly-developedStepStone ﬂow with an “enahnced” Chopim (eCHO). ThiseCHO conﬁguration exploits locality as well as StepStonePIM, but requires more frequent kernel calls (Algorithm 1) anddoes not use accelerated localization and reduction operations.7 .0E+003.0E+056.0E+059.0E+051.2E+061.5E+061.8E+062.1E+06 D R A M C y c l e s GEMM Buffer fill (B) Buffer fill (C) Buffer drain (C) Localization Reduction CPU time

Fig. 6: GEMM Latency comparison between different PIMoptions of StepStone PIM and the CPU. The conﬁgurationswith relaxed area constraints are labeled with * (i.e. enoughALUs and large enough scratchpad memory).V. E

VALUATION R ESULTS

In this section, we demonstrate the throughput and latencybeneﬁts of StepStone over either a CPU or GPU, evaluatethe impact of address mapping and scratchpad capacity, andanalyze the tradeoff between arithmetic performance and over-heads as the number of active PIM units (PIMs) is varied.

A. StepStone PIM Performance Beneﬁts

We ﬁrst compare the performance of StepStone PIM to a2.7GHz 28-core Intel Xeon Platinum 8280 Cascade Lake CPUwith a representative 1024 × . × better than the device-level StepStone-DV and × better than the CPU.Alternatively, we consider maximum throughput under alatency constraint. When the latency constraint is set to theminimal latency of the CPU (CPU with batch-1), StepStone-DV offers × higher throughput ( × more samples at about less time). If we allow a larger-area PIM with largerscratchpads, performance is improved even further to × . Ifwe relax the latency constraint and allow the CPU . × moretime to complete an inference task, which allows batch-32 onthe CPU, the performance beneﬁt of StepStone-DV drops to × ( . × with a larger scratchpad). While we use the highly-optimized Intel OneDNN library on the CPU, the performancewe observe falls short of the channel-level StepStone-CH,which can fully utilize the memory-system bandwidth. Still,the ﬁner-grained StepStone-DV (which can be implementedin buffer chips) offers substantially better performance andStepStone-BG offers far lower minimum latency. Throughput rooﬂines.

The throughput beneﬁts of StepStoneare also apparent on the rooﬂine plot presented in Figure 7,also for a 1024 × P e r f o r m a n c e ( G f l o p s / s ) Operational intensity (Flops/byte)

CPU (weight in main memory)GPU (weight in device memory)GPU (weight in main memory)StepStone-BG (weight in main memory)StepStone-DV (weight in main memory)

Fig. 7: Rooﬂine models for CPU, GPU, and StepStone PIMs;measured results are for a 1K ×

4K weight matrix for varyingbatch sizes (the left most point of each system is for batch-1and the batch is 2 × larger for each point moving to the right).addition to its latency beneﬁts) at all reasonable batch sizes.In fact, the CPU and GPU offer an advantage only once thebatch is 256 samples or greater. Even if the model ﬁts in GPUmemory, StepStone offers higher performance for batches of16 samples or less. The gap between the rooﬂines and sim-ulated performance of StepStone stems from the localizationand reduction overheads.We emphasize that StepStone PIM achieves this high per-formance beneﬁts without utilizing CPU or GPU computeresource, such as cores or caches. This implies that the overallsystem performance can increase even further by colocatingadditional tasks on the same node. B. End-to-End Performance

Figure 8 compares the inference performance of one recom-mendation system and three language models with differentPIM approaches—PEI, Chopim, and StepStone PIM—to thatof a CPU. For the PIM approaches, we assume the samePIM system (Figure 3) and that GEMMs can be executed byeither the CPU, device-level (PIM DV), or BG-level PIMs(PIM BG); the best performing option is chosen for eachGEMM. GEMMs are used for FC and projection layers. Allother operations, including concatenation, GELU, softmax,sorting, batched GEMM, and some data reorganization (e.g.stack) operations are executed on the CPU (

CPU Other ).We show the performance of two different CPU models:measured on the real system (

CPU ) and idealized CPUperformance ( iCPU ). We estimate idealized performance withour StepStone-CH, which maximally utilizes memory channelbandwidth. Overall, the measured results show that the CPUperforms poorly on small-matrix GEMMs.Naive Chopim ( nCHO ) executes GEMMs as multipleGEMV operations, which leads to missed locality oppor-tunities. On the other hand, if Chopim is enhanced withStepStone block grouping ( eCHO ) and divides each GEMMinto smaller dot-product operations, it beneﬁts from betterPIM buffer locality and the overhead for buffer ﬁll/drainsigniﬁcantly decreases. However, compared to StepStone PIM,eCHO suffers from higher localization/reduction overhead. Weevaluate a low-power StepStone PIM mode (

STP* ), where8 .4 3.1 2.8 7.200.20.40.60.811.2 C P U i C P U P E I n C H O e C H O S T P * S T P C P U i C P U P E I n C H O e C H O S T P * S T P C P U i C P U P E I n C H O e C H O S T P * S T P C P U i C P U P E I n C H O e C H O S T P * S T P DLRM GPT2 XLM BERT N o r m a li z e d E x e c u t i o n T i m e PIM_DV PIM_BG CPU_GEMM CPU_Other

Fig. 8: End-to-end performance results for various recommen-dation and language models with the CPU and PIMs.only StepStone-DV is used, and a high-performance mode(

STP ), which selects the best-performing level per GEMM.The execution time of DLRM is dominated by a singleFC layer (92%) and GEMM execution time is long enoughto amortize the localization/reduction overheads. This enablesChopim and StepStone PIM to use BG-level PIMs and beneﬁtfrom their high memory bandwidth. On the other hand, PEIcannot fully utilize BG-level PIMs due to command bandwidthbottleneck and, consequently, using more PIMs with PEI onlyincreases overhead. GPT2 shows a similar trend but the gapsbetween PEI and StepStone PIM/Chopim are greater due to alarger weight matrix than DLRM. In BERT and XLM, the N dimension is the batch size multiplied by the sequencelength after tensor reshaping, offering more efﬁcient GEMMexecution. For BERT, N becomes 32 in all FC layers whereas,for XLM, the sequence length starts at 1 and increases by 1up to the maximum length (8 in our conﬁguration) after eachiteration. As a result, XLM utilizes BG-level PIMs when N is small and, later, switches to DV-level PIMs once arithmeticperformance saturates and overheads start to dominate.Overall, when weight matrices are larger and the batchdimension is smaller, StepStone PIM outperforms other CPUand PIM approaches. Even with somewhat larger batches (e.g.,up to N = 384 for BERT), StepStone PIM outperformsthe CPU by splitting a batch into several batch-32 GEMMoperations. For example, StepStone PIM achieves 12 × higherperformance than the CPU for BERT. Thus, StepStone PIMoutperforms the CPU until N = 12 ×

32 = 384 . C. Impact of StepStone AGEN

Figure 9 shows the performance beneﬁt of StepStone AGENover the naive approach (explained in Section III-D). Overall,StepStone AGEN outperforms the naive approach by up to4 × , in particular when the number of active PIMs is larger.Intuitively, the naive approach can ﬁnd the next cache block ina probability of /n , where n is the number active PIMs. ForStepStone-BG (Figure 9a) there are 16 active PIMs and theperformance difference between two approaches is 8 × . Thisis mainly because pairs of cache blocks are contiguous in ourbaseline address mapping, which equates the naive approachwith our optimized AGEN. However, when a large gap inthe mapping exists, the naive approach requires numerousiterations and requires a large number of cycles to generate thenext address. The DRAM burst transfer latency is 4 DRAMcycles and bubbles are introduced any time the next addresscannot be generated within that time window. This does not (a) m ✕ k = 1024 ✕ ✕ k = 2048 ✕ N a i v e A G E N N a i v e A G E N N a i v e A G E N BG DV CH D R A M C y c l e s N a i v e A G E N N a i v e A G E N N a i v e A G E N BG DV CH D R A M C y c l e s Fig. 9: GEMM latency comparison between naive addressgenerator and the proposed StepStone AGEN.

16 32 16 32512 ✕ ✕ D R A M C y c l e s

16 32 16 321024 ✕ ✕ D R A M C y c l e s All PIMs1/2 PIMsBatch size

16 32 16 321024 ✕ ✕ D R A M C y c l e s GEMM Buffer fill (B) Buffer fill (C) Buffer drain (C) Localization Reduction

Fig. 10: Impact of trading off between PIM execution timeand replication/reduction overhead.occur with our proposed AGEN and its latency can alwaysbe hidden within the pipeline. The difference in performancebetween the two approaches for this case is apparent forStepStone-DV with a large weight matrix (Figure 9b), wherethe performance gap is 2.5 × . D. Parallelism—Distribution Overhead Tradeoffs

Figure 10 shows the GEMM latency comparison betweentwo cases: (1) when all bank group-level PIMs are used and(2) only half of the BG-level PIMs are active. We presentbank group-level PIMs tradeoff because we already discussedtradeoffs with respect to PIM level. When the weight matrixsize is small, the fraction of replication and reduction overheaddominates the entire execution time. If we only activate halfof the BG-level PIMs the overheads decrease by at most halfwhile arithmetic execution time doubles because parallelismis cut in half. Still, the tradeoff proves beneﬁcial when thematrices are small (left). On the other hand, as the matrixsize increases, the fraction of PIM execution time increasesas well (right). This is because the PIM execution timequadruples as each dimension size is doubled, whereas thelocalization/reduction overhead only doubles. Moreover, as theinput and output matrices (i.e. activations) grow, they exceedscratchpad capacity. As a result, the fraction of execution timerequired for buffer ﬁll/drain operations also increases. Eventhough using fewer PIM units does offer better performancefor the larger matrix, it still provides a valid tradeoff optionbecause comparable performance is attainable in some caseswhile resource usage and power consumption decreases.

E. Impact of Address Mapping

Figure 11 shows the execution time of GEMM operationswhen different address mappings and aspect ratios of theweight matrices are used. To isolate impact to only DRAMaddress mapping, we set the batch size to 4 such that the9 .0E+005.0E+041.0E+051.5E+052.0E+05 D R A M C y c l e s GEMM Localizaton Reduction

Address Mapping ID ✕ ✕ ✕ Fig. 11: Sensitivity to address mapping and aspect ratio of theweight matrix (batch size = 4).input and output matrices ﬁt in the scratchpad memory ofall three PIM options. In the bank-group level StepStone-BG, the fraction of localization overhead differs signiﬁcantlyacross address mappings when the matrix is short and fat(i.e., 128 × × greater than those with address mappings 3and 4 and 4 × greater than those with address mapping 0.The reason for the low localization overhead with addressmapping 0 is that the combination of the address mappingand matrix size interleaves addresses such that matrix columnsremain contiguous within each PIM. In contrast, the tall andthin GEMM (i.e., 8192 × F. Impact of Scratchpad Memory Capacity

Figure 12 shows the impact of scratchpad memory capacityon GEMM latency. We analyze StepStone-BG as it has themost stringent area constraint. We search for an optimal allo-cation across the scratchpad partitioning options between inputand output buffer allocations (there are only two buffers sothe search converges quickly). We ﬁnd that interleaving bufferﬁll/drain operations with arithmetic has negligible impact onGEMM performance. The ability to execute entire kernelslimits the beneﬁts of overlapping data transfer with arithmetic ✕ ✕ D R A M C y c l e s ✕ ✕ D R A M C y c l e s Batch size 16KB32KB64KB ✕ ✕ ✕ ✕ D R A M C y c l e s GEMM Buffer fill (B) Buffer fill (C) Buffer drain (C) Localization Reduction

Fig. 12: GEMM latencies for different matrices and buffersizes (StepStone-BG).and interleaving increases the number of row buffer conﬂicts,though row-buffer locality remains high.Larger matrices (right) tends to amortize the buffer ﬁll/drainoverheads better than smaller matrices (left). Generally, over-head increases with batch size. Interestingly, the overhead withthe 2048 × G. Impact of Concurrent CPU Access

We expect StepStone PIM to outperform prior PIM ar-chitectures, including Chopim, by enabling longer-runningGEMM operations that maintain PIM locality. Long-runningoperations are important when the CPU also executes amemory-intensive workload concurrently with the PIMs, asboth the CPU and PIMs contend for limited command channelbandwidth. We evaluate this using the same colocation usedby Cho et al. for evaluating Chopim [9], as described inSection IV. While the colocated applications are not DL-related, they run readily on gem5 and clearly demonstratethe impact of command channel contention. We isolate theperformance beneﬁts to just the StepStone AGEN unit thatenables long-running kernels by running the same StepStoneGEMM ﬂow on eCHO and StepStone PIM and reportingresults corresponding only to GEMM execution.StepStone PIM outperforms Chopim when the CPU inten-sively accesses memory concurrently with PIMs (Figure 13).As the matrix shape changes from short-fat to tall-thin, each ofeCHO kernels accesses fewer cache blocks, resulting in morePIM kernel invocations and greater contention with the CPUfor the command channel. As a result, PIM kernel packets aredelayed and the PIMs are underutilized. With BG-level PIM,the relative performance of Chopim to StepStone PIM is worsesince even more PIMs are underutilized due to the commandbandwidth bottleneck. This performance gap will increase asthe number of PIMs in each channel increases, increasing theimportance of mechanisms that enable long-running kernels.

H. Power and Energy Analysis

Figure 14 shows the power and energy consumption perDRAM device of StepStone-BG and StepStone-DV. As N [ K , K ] [ K , K ] [ K , K ] [ K , K ] [ K , K ] [ K , K ] [ K , K ] [ K , K ] DV-level PIM BG-level PIM S p ee dup o v e r e C H O Fig. 13: Speedup of StepStone PIM (STP) over Chopimenhanced with StepStone block grouping (eCHO) when con-current CPU access exists. The size of matrices is ﬁxed andits aspect ratio is varied. P o w e r p e r D R A M c h i p ( W ) p J / o p Fig. 14: Power dissipation per DRAM device (left) and energyconsumption per ﬂoating-point operation (right) of StepStone-BG and StepStone-DV (weight matrix = [1024, 4096]).increases, the relative contribution of arithmetic also increases.However, overall, the power of DRAM access (either withinthe PIMs or for localization and reduction) dominates thepower of the SIMD units. The right side of the ﬁgure showsthat StepStone-BG is more energy-efﬁcient than StepStone-DVwhen N is small. The main source of this energy savings isthat IO energy is much smaller within a device. However, as N increases, the energy for localization and reduction dominatesthe energy consumption of arithmetic and StepStone-DV ismore efﬁcient. Note that, if power exceeds the delivery/coolingbudget for a chip or module, performance can be throttled.VI. R ELATED W ORK

To the best of our knowledge, this is the ﬁrst work thatenables high-performance and CPU-compatible near-memoryprocessing for accelerating bandwidth-bound fully-connected(FC) layers. We describe prior related work below and contrastour contributions for existing approaches.

Processing in main memory.

Processing in main memoryimplies that PIM should play along with the other parts ofthe system; otherwise, it will have a system-wide impact.Considering this, some researches [3], [23] enable PIM in aﬁne granularity, such as PIM operations per cache block. Thisapproach can solve the complex address mapping problem.The CPU indicates the next cache block to process with somecommand packets and PIM processes the cache block. Thoughthis approach can accommodate more applications due to itsﬂexibility, PIM performance will be eventually limited by thecommand bandwidth. RecNMP [20] mitigates this commandbandwidth bottleneck by sending compound of memory re-quests but this solution does not scale when there are more than 4 PIMs per channel. Chopim [9] enables coarse-grainedPIM kernels under complex address mapping. Though GEMMcan be executed with Chopim by multiple GEMV kernel calls,temporal locality cannot be exploited which is crucial forhigh-performance GEMM operations. Also, Chopim does notprovide efﬁcient localization and reduction mechanisms, whichincur high overhead for executing GEMMs on PIMs. NDA[12], Chameleon [5], and MCN DIMM [4] are also basedon conventional DIMM devices and proposes PIM designs topractically add PEs to main memory.

GEMM acceleration with PIM.

The Active Memory Cube(AMC) [41] targets GEMM operations with in-memory pro-cessors. AMC considers address interleaving but interleavingis only allowed within the same local PIM memory. Thisessentially requires partitioning into CPU and PIM memoryspaces and data should be copied from one space to another ifthe processing mode changes. This approach has the same dataloading problem as discrete accelerators and does not trulyrealize PIM potential of sharing the same memory betweenthe CPU and PIM. On the other hand, our solution does nothave this limitations and works with any XOR-based addressmapping and PIMs in any DRAM hierarchy levels.

PIM for machine learning.

PIM for machine learning work-loads has been widely studied. Much of this research targetsconvolutional neural networks [8], [11], [13], [18], [19], [22],[28], [39], [40], embedding table lookups in recommenda-tion models [20], [25], recurrent neural networks [27], andGAN [30], [38]. In contrast, we target the tall-thin/fat-narrowGEMMs of fully-connected layers in DL inference. New-ton [17] also targets fully-connected layers, like StepStonePIM. However, Newton operates as a discrete acceleratorthat cannot beneﬁt from the advantages of main-memoryacceleration described in Section II. More importantly, Newtondoes not avoid weight copies, does not exploit GEMM locality,cannot trade off parallelization degree overheads with perfor-mance beneﬁts, cannot selectively execute at different PIMlevels or the CPU to dynamically match changing workloadcharacteristics, and does not support the long-running kernelsnecessary for concurrent bandwidth-intensive CPU tasks.VII. C

ONCLUSION

We identify that small-batch GEMM operations of DL infer-ence workloads are bandwidth bound on CPUs and GPUs andcan beneﬁt from PIM acceleration. We introduce

StepStonePIM , which enables independent PIM GEMM execution undercomplex CPU DRAM address mapping. The main innovationof StepStone PIM is the address-mapping cognizant GEMMblocking with matching PIM-side address generation. Ourunique AGEN logic improves throughput compared to naiveor host-side address generation. We explore PIM designs inthree different DRAM hierarchy levels (channel, chip, andbank-group levels) and show their tradeoffs with detailedsimulation results. We show that activating more PIMs forGEMM improves arithmetic performance but adds overheadsfor data localization/replication and reduction. We conclude11hat StepStone is an appealing datacenter solution because of:(1) its low cost; (2) its potential for lower latency and higherthroughput even when implemented at the buffer-chip levelwithin a DIMM without DRAM device modiﬁcation; and (3)its locality-optimized high efﬁciency GEMM execution thatfrees CPU resources for other tasks.R

EFERENCES[1] “Intel onednn (v1.7),” in https://github.com/oneapi-src/oneDNN . Intel,2020.[2] “Nvidia cutlass (v2.2),” in https://github.com/NVIDIA/cutlass . NVIDIA,2020.[3] J. Ahn, S. Yoo, O. Mutlu, and K. Choi, “Pim-enabled instructions:a low-overhead, locality-aware processing-in-memory architecture,” in . IEEE, 2015, pp. 336–348.[4] M. Alian, S. W. Min, H. Asgharimoghaddam, A. Dhar, D. Wang,A. Roewer, Thomas McPadden, O. OHalloran, D. Chen, J. Xiong,D. Kim, W.-m. Hwu, and N. S. Kim, “Application-transparent near-memory processing architecture with memory channel network,,” in

The51st Annual IEEE/ACM International Symposium on Microarchitecture ,2018.[5] H. Asghari-Moghaddam, Y. H. Son, J. H. Ahn, and N. S. Kim,“Chameleon: Versatile and practical near-dram acceleration architecturefor large memory systems,” in

Microarchitecture (MICRO), 2016 49thAnnual IEEE/ACM International Symposium on . IEEE, 2016, pp. 1–13.[6] N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu,J. Hestness, D. R. Hower, T. Krishna, S. Sardashti et al. , “The gem5simulator,”

ACM SIGARCH Computer Architecture News , vol. 39, no. 2,pp. 1–7, 2011.[7] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal,A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al. , “Language modelsare few-shot learners,” arXiv preprint arXiv:2005.14165 , 2020.[8] P. Chi, S. Li, C. Xu, T. Zhang, J. Zhao, Y. Liu, Y. Wang, and Y. Xie,“Prime: A novel processing-in-memory architecture for neural networkcomputation in reram-based main memory,”

ACM SIGARCH ComputerArchitecture News , vol. 44, no. 3, pp. 27–39, 2016.[9] B. Y. Cho, Y. Kwon, S. Lym, and M. Erez, “Near data acceleration withconcurrent host access,” in . IEEE, 2020, pp. 818–831.[10] A. Conneau and G. Lample, “Cross-lingual language model pretraining,”in

Advances in Neural Information Processing Systems , 2019, pp. 7059–7069.[11] Q. Deng, Y. Zhang, M. Zhang, and J. Yang, “Lacc: Exploiting lookuptable-based fast and accurate vector multiplication in dram-based cnnaccelerator,” in

Proceedings of the 56th Annual Design AutomationConference 2019 , 2019, pp. 1–6.[12] A. Farmahini-Farahani, J. H. Ahn, K. Morrow, and N. S. Kim, “Nda:Near-dram acceleration architecture leveraging commodity dram devicesand standard memory modules,” in

High Performance Computer Archi-tecture (HPCA), 2015 IEEE 21st International Symposium on . IEEE,2015, pp. 283–295.[13] M. Gao, J. Pu, X. Yang, M. Horowitz, and C. Kozyrakis, “Tetris:Scalable and efﬁcient neural network acceleration with 3d memory,” in

Proceedings of the Twenty-Second International Conference on Archi-tectural Support for Programming Languages and Operating Systems ,2017, pp. 751–764.[14] P. Gu, X. Xie, Y. Ding, G. Chen, W. Zhang, D. Niu, and Y. Xie, “ipim:Programmable in-memory image processing accelerator using near-bankarchitecture,” in . IEEE, 2020, pp. 804–817.[15] U. Gupta, S. Hsia, V. Saraph, X. Wang, B. Reagen, G. Wei, H. S. Lee,D. Brooks, and C. Wu, “Deeprecsys: A system for optimizing end-to-end at-scale neural recommendation inference,” in ,2020.[16] U. Gupta, C.-J. Wu, X. Wang, M. Naumov, B. Reagen, D. Brooks,B. Cottel, K. Hazelwood, M. Hempstead, B. Jia et al. , “The architecturalimplications of facebook’s dnn-based personalized recommendation,” in . IEEE, 2020, pp. 488–501. [17] M. He, C. Song, I. Kim, C. Jeong, S. Kim, I. Park, M. Thottethodi, andT. Vijaykumar, “Newton: A dram-maker’s accelerator-in-memory (aim)architecture for machine learning,” in . IEEE, 2020,pp. 372–385.[18] M. Imani, S. Gupta, Y. Kim, and T. Rosing, “Floatpim: In-memoryacceleration of deep neural network training with high precision,” in . IEEE, 2019, pp. 802–815.[19] B. K. Joardar, B. Li, J. R. Doppa, H. Li, P. P. Pande, and K. Chakrabarty,“Regent: A heterogeneous reram/gpu-based architecture enabled by nocfor training cnns,” in . IEEE, 2019, pp. 522–527.[20] L. Ke, U. Gupta, B. Y. Cho, D. Brooks, V. Chandra, U. Diril,A. Firoozshahian, K. Hazelwood, B. Jia, H.-H. S. Lee et al. , “Recnmp:Accelerating personalized recommendation with near-memory process-ing,” in . IEEE, 2020, pp. 790–803.[21] J. D. M.-W. C. Kenton and L. K. Toutanova, “Bert: Pre-training of deepbidirectional transformers for language understanding,” in

Proceedingsof NAACL-HLT , 2019, pp. 4171–4186.[22] D. Kim, J. Kung, S. Chai, S. Yalamanchili, and S. Mukhopadhyay,“Neurocube: a programmable digital neuromorphic architecture withhigh-density 3d memory,” in

Proceedings of the 43rd InternationalSymposium on Computer Architecture , 2016, pp. 380–392.[23] G. Kim, N. Chatterjee, M. O’Connor, and K. Hsieh, “Toward standard-ized near-data processing with unrestricted data placement for gpus,”in

Proceedings of the International Conference for High PerformanceComputing, Networking, Storage and Analysis , 2017, pp. 1–12.[24] Y. Kim, W. Yang, and O. Mutlu, “Ramulator: A fast and extensibledram simulator,”

IEEE Computer architecture letters , vol. 15, no. 1, pp.45–49, 2016.[25] Y. Kwon, Y. Lee, and M. Rhu, “Tensordimm: A practical near-memoryprocessing architecture for embeddings and tensor operations in deeplearning,” in

Proceedings of the 52nd Annual IEEE/ACM InternationalSymposium on Microarchitecture , 2019, pp. 740–753.[26] Y. Liu, X. Zhao, M. Jahre, Z. Wang, X. Wang, Y. Luo, and L. Eeckhout,“Get out of the valley: power-efﬁcient address mapping for gpus,” in . IEEE, 2018, pp. 166–179.[27] Y. Long, T. Na, and S. Mukhopadhyay, “Reram-based processing-in-memory architecture for recurrent neural network acceleration,”

IEEETransactions on Very Large Scale Integration (VLSI) Systems , vol. 26,no. 12, pp. 2781–2794, 2018.[28] Y. Long, X. She, and S. Mukhopadhyay, “Design of reliable dnnaccelerator with un-reliable reram,” in . IEEE, 2019, pp. 1769–1774.[29] S. Lym, A. Behroozi, W. Wen, G. Li, Y. Kwon, and M. Erez, “Mini-batchserialization: Cnn training with inter-layer data reuse,”

Proceedings ofMachine Learning and Systems , vol. 1, pp. 264–275, 2019.[30] H. Mao, M. Song, T. Li, Y. Dai, and J. Shu, “Lergan: A zero-free, lowdata movement and pim-based gan architecture,” in .IEEE, 2018, pp. 669–681.[31] N. Muralimanohar, R. Balasubramonian, and N. P. Jouppi, “Cacti 6.0:A tool to model large caches,”

HP laboratories , pp. 22–31, 2009.[32] M. Naumov, D. Mudigere, H. M. Shi, J. Huang, N. Sundaraman,J. Park, X. Wang, U. Gupta, C. Wu, A. G. Azzolini, D. Dzhulgakov,A. Mallevich, I. Cherniavskii, Y. Lu, R. Krishnamoorthi, A. Yu,V. Kondratenko, S. Pereira, X. Chen, W. Chen, V. Rao, B. Jia,L. Xiong, and M. Smelyanskiy, “Deep learning recommendationmodel for personalization and recommendation systems,”

CoRR , vol.abs/1906.00091, 2019. [Online]. Available: https://arxiv.org/abs/1906.00091[33] R. Nishtala, V. Petrucci, P. Carpenter, and M. Sjalander, “Twig: Multi-agent task management for colocated latency-critical cloud services,” in . IEEE, 2020, pp. 167–179.[34] R. Panda, S. Song, J. Dean, and L. K. John, “Wait of a decade:Did spec cpu 2017 broaden the performance horizon?” in . IEEE, 2018, pp. 271–282.

35] J. Park, M. Naumov, P. Basu, S. Deng, A. Kalaiah, D. Khudia, J. Law,P. Malani, A. Malevich, S. Nadathur et al. , “Deep learning inference infacebook data centers: Characterization, performance optimizations andhardware implications,” arXiv preprint arXiv:1811.09886 , 2018.[36] P. Pessl, D. Gruss, C. Maurice, M. Schwarz, and S. Mangard, “Drama:exploiting dram addressing for cross-cpu attacks,” in

Proceedings of the25th USENIX Conference on Security Symposium , 2016, pp. 565–581.[37] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever,“Language models are unsupervised multitask learners,”

OpenAI blog ,vol. 1, no. 8, p. 9, 2019.[38] A. S. Rakin, S. Angizi, Z. He, and D. Fan, “Pim-tgan: A processing-in-memory accelerator for ternary generative adversarial networks,” in .IEEE, 2018, pp. 266–273.[39] A. Shaﬁee, A. Nag, N. Muralimanohar, R. Balasubramonian, J. P. Stra-chan, M. Hu, R. S. Williams, and V. Srikumar, “Isaac: A convolutionalneural network accelerator with in-situ analog arithmetic in crossbars,”in . IEEE, 2016, pp. 14–26.[40] L. Song, X. Qian, H. Li, and Y. Chen, “Pipelayer: A pipelined reram-based accelerator for deep learning,” in . IEEE,2017, pp. 541–552.[41] Z. Sura, A. Jacob, T. Chen, B. Rosenburg, O. Sallenave, C. Bertolli,S. Antao, J. Brunheroto, Y. Park, K. O’Brien et al. , “Data accessoptimization in a processing-in-memory system,” in

Proceedings of the12th ACM International Conference on Computing Frontiers , 2015, pp.1–8.[42] H. Zhu, D. Lo, L. Cheng, R. Govindaraju, P. Ranganathan, and M. Erez,“Kelp: Qos for accelerated machine learning systems,” in . IEEE, 2019, pp. 172–184.. IEEE, 2019, pp. 172–184.