[PDF] RecSSD: Near Data Processing for Solid State Drive Based Recommendation Inference

Abstract

Neural personalized recommendation models are used across a wide variety of datacenter applications including search, social media, and entertainment. State-of-the-art models comprise large embedding tables that have billions of parameters requiring large memory capacities. Unfortunately, large and fast DRAM-based memories levy high infrastructure costs. Conventional SSD-based storage solutions offer an order of magnitude larger capacity, but have worse read latency and bandwidth, degrading inference performance. RecSSD is a near data processing based SSD memory system customized for neural recommendation inference that reduces end-to-end model inference latency by 2X compared to using COTS SSDs across eight industry-representative models.

Full PDF

RRecSSD: Near Data Processing forSolid State Drive Based Recommendation Inference

Mark Wilkening

Harvard UniversityCambridge, Massachusetts, [email protected]

Udit Gupta

Harvard UniversityFacebookCambridge, Massachusetts, [email protected]

Samuel Hsia

Harvard UniversityCambridge, Massachusetts, [email protected]

Caroline Trippel

FacebookMenlo Park, California, [email protected]

Carole-Jean Wu

FacebookMenlo Park, California, [email protected]

David Brooks

Harvard UniversityCambridge, Massachusetts, [email protected]

Gu-Yeon Wei

Harvard UniversityCambridge, Massachusetts, [email protected]

ABSTRACT

Neural personalized recommendation models are used across a widevariety of datacenter applications including search, social media,and entertainment. State-of-the-art models comprise large embed-ding tables that have billions of parameters requiring large memorycapacities. Unfortunately, large and fast DRAM-based memorieslevy high infrastructure costs. Conventional SSD-based storagesolutions offer an order of magnitude larger capacity, but haveworse read latency and bandwidth, degrading inference perfor-mance. RecSSD is a near data processing based SSD memory sys-tem customized for neural recommendation inference that reducesend-to-end model inference latency by 2 × compared to using COTSSSDs across eight industry-representative models. CCS CONCEPTS • Hardware → External storage ; •

Computer systems organi-zation → Neural networks . KEYWORDS near data processing, neural networks, solid state drives

ACM Reference Format:

Mark Wilkening, Udit Gupta, Samuel Hsia, Caroline Trippel, Carole-JeanWu, David Brooks, and Gu-Yeon Wei. 2021. RecSSD: Near Data Processingfor Solid State Drive Based Recommendation Inference. In

Proceedings of the26th ACM International Conference on Architectural Support for ProgrammingLanguages and Operating Systems (ASPLOS ’21), April 19–23, 2021, Virtual,

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected].

USA.

ACM, New York, NY, USA, 13 pages. https://doi.org/10.1145/3445814.3446763

Recommendation algorithms are used across a variety of Inter-net services such as social media, entertainment, e-commerce, andsearch [20, 37, 43–45]. In order to efficiently provide accurate, per-sonalized, and scalable recommendations to users, state-of-the-artalgorithms use deep learning based solutions. These algorithmsconsume a significant portion of infrastructure capacity and cyclesin industry datacenters. For instance, compared to other AI-drivenapplications, recommendation accounts for 10 × the infrastructurecapacity in Facebook’s datacenter [20, 28, 30]. Similar capacity re-quirements can be found at Google, Alibaba, and Amazon [43–45].One of the key distinguishing features of neural recommendationmodels is processing categorical input features using large embed-ding tables. While large embedding tables enable higher personal-ization, they consume up to hundreds of GBs of storage [20, 32]. Infact, in many cases, the size of recommendation models is set by theamount of memory available on servers [20]. A promising alterna-tive is to store embedding tables in SSDs. While SSDs offer ordersof magnitude higher storage capacities than main memory systems,they exhibit slower read and write performance. To hide the longerSSD read and write latencies, previous SSD based systems overlapcomputations from other layers in the recommendation modelsand cache frequently accessed embedding vectors in DRAM-basedmain memory [17, 41, 42].We propose RecSSD , a near data processing (NDP) solution cus-tomized for recommendation inference that improves the perfor-mance of the underlying SSD storage for embedding table oper-ations. In order to fully utilize the internal SSD bandwidth andreduce round-trip data communication overheads between the hostCPU and SSD memory, RecSSD offloads the entire embedding ta-ble operation, including gather and aggregation computations, tothe SSDs. Compared to baseline SSD, we demonstrate that RecSSDprovides a 4 × improvement in embedding operation latency and a r X i v : . [ c s . A R ] J a n SPLOS ’21, April 19–23, 2021, Virtual, USA Wilkening, et al. × improvement in end-to-end model latency on a real OpenSSDsystem. In addition to offloading embedding operations, RecSSDexploits the locality patterns of recommendation inference queries.RecSSD demonstrates that a combination of host-side and SSD-sidecaching complement NDP and reduce end-to-end model inferencelatency. To demonstrate the feasibility and practicality of the pro-posed design in server-class datacenter systems, We implementRecSSD on a real, open-source Cosmos+OpenSSD system withinthe Micron UNVMe driver library.The key contributions of this paper are: • We design RecSSD, the first NDP-based SSD system for rec-ommendation inference. Improving the performance of con-ventional SSD systems, the proposed design targets the mainperformance bottleneck to datacenter scale recommendationexecution using SSDs. Furthermore, the latency improve-ment further enables recommendation models with higherstorage capacities at reduced infrastructure cost. • We implement RecSSD in a real system on top of the Cos-mos+OpenSSD hardware. The implementation demonstratesthe viability of Flash-based SSDs for industry-scale recom-mendation. In order to provide a feasible solution for datacen-ter scale deployment, we implement RecSSD within the FTLfirmware; the interface is compatible with existing NVMeprotocols, requiring no hardware changes. • We evaluate the proposed design across eight industry repre-sentative models across various use cases (e.g., social media,e-commerce, entertainment). Of the eight, our real systemevaluation shows that five models — whose runtime is domi-nated by compute-intensive FC layers — achieve comparableperformance using SSD compared to DRAM. The remain-ing three models are dominated by memory-bound, embed-ding table operations. On top of the highly optimized hybridDRAM-SSD systems, we demonstrate that RecSSD improvesperformance by up to 4 × for individual embedding opera-tions, translating into up to 2 × end-to-end recommendationinference latency reduction. Often found in commercial applications, recommendation systemsrecommend items to users by predicting said items’ values in thecontext of the users’ preferences . In fact, meticulously tuned per-sonalized recommendation systems form the backbone of manyinternet services – including social media, e-commerce, and onlineentertainment [22, 30, 43–45] – that require real-time responses.Modern recommendation systems implement deep learning-basedsolutions that enable more sophisticated user-modeling. Recentwork shows that deep-learning based recommendation systems notonly drive product success[13, 37, 40] but also dominate the data-center capacity for AI training and inference [20, 21, 29]. Thus, thereexists a need to make dataceter-scale recommendation solutionsmore efficient and scalable. Overview of model architecture

As shown in Figure 1, deeplearning-based recommendation models comprise both fully-connected ... +... ...

Emb. Table 1 Emb. Table N

Continuous (dense) Inputs Categorical (sparse) Inputs [i1, i2, … in,] [j1, j2, … jn,] + ++

Output(s) ...

Bottom FC Top FC ......

RecSSD o ﬄ oads embedding tableoperationto NDP-SSD EmbeddingGatherEmbeddingReduce

Figure 1: Recommendation models process both categoricaland continuous input features. (FC) layers and embedding tables of various sizes. FC layers stresscompute capabilities by introducing regular MAC operations whileembedding table references stress memory bandwidth by introduc-ing irregular memory lookups. Based on the specific operator com-position and dimensions, recommendation models span a diverserange of architectures. For instance, the operators that combineoutputs from Bottom FC and embedding table operations dependon the application use case. Furthermore, recommendation modelsimplement a wide range of sizes of FC layers and embedding tables.

Processing categorical inputs

Unique to recommendations,models process categorical input features using embedding tableoperations. Embedding tables are organized such that each rowis a unique embedding vector typically comprising 16, 32, or 64learned features (i.e., number of columns for the table). For eachinference, a set of embedding vectors, specified by a list of IDs(e.g., multi-hot encoded categorical inputs) is gathered and aggre-gated together. Common operations for aggregating embeddingvectors together include sum, averaging, concatentation, and matrixmultiplication [30, 43–45]; Figure 1 shows an example using sum-mation. Inference requests are often batched together to amortizecontrol overheads and better utilize computational resources. Addi-tionally models often comprise many embedding tables. Currently,production-scale datacenter store embedding tables in DRAM whileCPU perform embedding table operations, optimizations such asvectorized instructions and software prefetching [2].The embedding table operations pose unique challenges:(1)

Capacity:

Industry-scale embedding tables have up to hun-dreds of millions of rows leading to embedding tables thatoften require up to ∼ Irregular Accesses:

Categorical input features are sparse,multi-hot encoded vectors. High sparsity leads to a smallfraction of embedding vectors being access per request. Fur-thermore, access patterns between subsequent requests from ecSSD ASPLOS ’21, April 19–23, 2021, Virtual, USA different users can be quite different causing embedding tableoperations to incur irregular accesses.(3)

Low Compute Intensity:

The overall compute intensity ofthe embedding tables are orders of magnitude lower thanother deep learning workloads precluding efficient executionusing recently proposed SIMD, systolic array, and dataflowhardware accelerators [20].These three features – large capacity requirements, irregularmemory accesses, and low compute intensity – make Flash tech-nology an interesting target for embedding tables.

Figure 2: In order to support a high performance and simplelogical block interface to the host while handling the pecu-liarities of NAND Flash memories, SSDs are designed with aMicroprocessor operating alongside dedicated memory con-trollers.Architecture of Flash

NAND flash memory is the most widelyused SSD building block on the market. Compared to traditionaldisk-based storage systems, NAND flash memories offer higherperformance in terms of latency and bandwidth for reads andwrites [12]. Figure 2 illustrates the overall architecture of NANDFlash storage systems. To perform a read operation, the host com-municates over PCIe using an NVMe protocol to a host controlleron the SSD. The host requests logical blocks, which are served by a flash translation layer (FTL) running on a microprocessor on theSSD. The FTL schedules and controls an array of Flash controllers,which are organized per channel and provide specific commandsto all the Flash DIMs (chips) on a channel and DMA capabilitiesacross the multiple channels. In order to transfer data between theFlash controller’s DMA engines and the host NVMe DMA engine,the controller uses an on-board DRAM buffer.

Flash Translation Layer (FTL)

In order to maintain compat-ibility with existing drivers and file systems, Flash SSD systemsimplement the FTL. The FTL exposes a logical block device interfaceto the host system while managing the underlying NAND Flashmemory system. This includes performing key functions such as (1)maintaining indirect mapping between logical and physical pages,(2) maintaining a log-like write mechanism to sequentially add datain erase blocks and invalidate stale data [33], (3) garbage collection,and (4) wear leveling. As shown in Figure 2, to perform this diverseset of tasks, the FTL runs on a general purpose microprocessor.

Performance characteristics of SSD storage

Compared toDRAM-based main memory systems, Flash-based storage systemshave orders of magnitude higher storage densities [12] enablinghigher capacities at lower infrastructure cost, around 4-8x cheaperthan DRAM per bit [16]. Despite these advantages, Flash posesmany performance challenges. One single flash memory packageprovides a limited bandwidth of 32-40MB/sec [11]. In addition,writes to flash memory are often much slower, incurring 𝑂 (ms) latencies. To help address these limitations, SSDs are built to ex-pose significant internal bandwidth by organizing flash memorypackages as an array of connected channels (e.g., 2-10) handled bya single memory controller. Since logical blocks can be striped overmultiple flash memory packages, data accesses can be conductedin parallel to provide higher aggregated bandwidth and hide highlatency operations through concurrent work. To better understand the role of SSD storage for neural recommen-dation inference, we begin with initial characterization. First, wepresent the memory access pattern characterization for recommen-dation models running in a cloud-scale production environmentanddescribe the locality optimization opportunities for performing em-bedding execution on SSDs. Then, we take a step further to studythe impact of storing embedding tables and performing associatedcomputation in SSDs as opposed to DRAM [26]. The characteriza-tion studies focus on embedding table operations, followed by theevaluation on the end-to-end model performance.

One important property of SSD systems is that SSDs operate asblock devices where data is transferred in coarse chunks. This isan important factor when considering efficient bandwidth use ofSSDs. The hardware is designed for sequential disk access, wheredata is streamed in arbitrarily large chunks. However, larger ac-cess granularity penalizes performance for workloads that requirerandom, sparse accesses – embedding table access and operationin neural recommendation models. Therefore, it is important tounderstand unique memory access patterns of embedding tables.Furthermore, caching techniques become even more important toexploit temporal reuse and maximize spatial locality from blockaccesses.Figure 3 depicts the reuse distribution of embedding tables inthe granularity of 256B, 1KB, and 4KB, respectively. The x-axis rep-resents pages accessed over the execution of real-time recommen-dation inference serving (sorted by the corresponding hit countsin the ascending order) whereas the y-axis shows the cumulativehit counts, by analyzing embedding table accesses logged for rec-ommendation models running in a cloud-scale production environ-ment. Access patterns to embedding tables follow the power-lawdistribution. Depending on the page sizes, the slope of the tailchanges.

The majority of reuse remains concentrated in a few hotmemory regions — a few hundred pages capture 30% of reuses whilecaching a few thousand pages can extend reuse over 50%.

SPLOS ’21, April 19–23, 2021, Virtual, USA Wilkening, et al.

Figure 3: Access patterns to neural recommendation embed-ding tables follow the power-law distribution.Figure 4: Locality patterns vary significantly across differentembedding tables. Using a 16-way LRU 4KB page cache ofvarying total capacities, the hit rate varies wildly from under10% to over 90% across different embedding tables.

The concentration of hot pages varies across individual embed-ding tables. Figure 4 characterizes the memory locality patternsacross different, individual embedding tables. Using a 16-way, LRU,4KB page cache of varying cache capacities, the hit rate varieswildly from under 10% to over 90% across the different embeddingtables of recommendation models running in a cloud-scale pro-duction environment. As the capacity of the page cache increases,more embedding vectors can be captured in the page cache, leadingto higher reuses. With a 16MB page cache per embedding table,more than 50% of reuses can be achieved across all the embeddingtables analyzed in this study. The specific page cache capacity perembedding table can be further optimized for better efficiency.Locality in embedding table accesses influences the design andperformance of SSD systems in many ways. First, on-board SSDcaching is difficult due to the limited DRAM capacity and the poten-tially large reuse distances. Despite this, the distribution of reuseand the relatively small collection of hot pages suggest reason-able effectiveness of static partitioning strategies, where hot pagescan be stored in host-side DRAM. But, most importantly, the vary-ing page reuse patterns (Figure 4) suggests that, although in somecases, caching can be used to effectively deal with block access,strategies for more efficiently handling sparse access is also needed.Previous work [17] has thoroughly investigated advanced caching

Figure 5: Using a table configuration typical of industryscale models [19, 20] and a range of batch sizes, the SparseLength Sum (SLS) embedding table operation slows downsignificantly using SSD storage over DRAM. techniques, while we propose orthogonal solutions which specifi-cally target increasing the efficiency of sparse accesses. We evaluateour proposed techniques by using somewhat simpler caching strate-gies (standard LRU software caching and static partitioning) andsweeping the design space across a variety of input locality distri-butions.

Given their unique memory access patterns, storing embedding ta-bles in SSD versus DRAM has a large impact on performance giventhe characteristics of the underlying memory systems. Figure 5illustrates the performance of a single embedding table operationusing DRAM versus SSD across a range of batch-sizes. The em-bedding table has one million rows, with an embedding vectordimension of 32, and 80 lookups per table, typical for industry-scalemodels such as Facebook’s embedding-dominated recommendationnetworks [19, 20]. For an optimized DRAM-based embedding tableoperation, we analyze the performance of the SparseLengthsSumoperation in Caffe2 [1]. As shown in Figure 5, compared to theDRAM baseline, accessing embedding tables stored in the SSD in-curs three orders of magnitude longer latencies. This is a result ofsoftware overhead in accessing embedding tables over PCIe as wellas the orders-of-magnitude lower read bandwidth in the underlyingSSD system — 10K IOPS or 10 𝑀𝐵 / 𝑠 random read bandwidth on SSDversus 1 𝐺𝐵 / 𝑠 on DRAM. Thus, while SSD storage offers appealingcapacity advantage for growing industry neural recommendationmodels, there is significant room to improve the performance ofembedding table operations. While embedding tables enable recommendation systems to moreaccurately model user interests, as shown in Figure 1, embedding isonly a component when considering end-to-end recommendationinference. Thus, to understand the end-to-end performance impact ecSSD ASPLOS ’21, April 19–23, 2021, Virtual, USA

Figure 6: Performance degradation from using Flash basedembedding table operations is model dependant. Storing ta-bles in SSD for WND, MTWND, DIEN, and NCF increasesmodel latency by 1.01 × , 1.01 × , 1.09 × , and 1.01 × , versusDRAM. by offloading embedding tables to the SSD memory, we character-ize the performance impact on recommendation inference over arepresentative variety of network model architectures.Our evaluations use eight open-source recommendation mod-els [19] representing industry-scale inference use cases from Face-book, Google, and Alibaba [20, 22, 30, 43–45]. For the purposes ofthis study, models are clustered into two categories based on therespective performance characteristics: embedding-dominated andMLP-dominated. MLP-dominated models, such as Wide and Deep(WD), Multi-Task Wide and Deep (MTWND), Deep Interest (DIN),Deep Interest Evolution (DIEN), and Neural Collaborative Filtering(NCF), spend the vast majority of their execution time on matrix op-erations. On the other hand, embedding-dominated models, such asDLRM-RMC1, DLRM-RMC2, and DLRM-RMC3, spend the majorityof their time processing embedding table operations. We refer thereader to [19] for detailed operator breakdowns and benchmarkmodel characterizations.Figure 6 shows the execution time of the eight recommendationmodels at a batch-size of 64 when embedding tables are storedin DRAM and in SSD, respectively. The execution time for MLP-dominated models remains largely unaffected between the twomemory systems. Compared to DRAM, storing tables in SSD forWND, MTWND, DIEN, and NCF increases the model latency by1.01 × , 1.01 × , 1.09 × , and 1.01 × . On the other hand, storing em-bedding tables in SSD instead of DRAM significantly impacts theexecution time for embedding-dominated models. For instance, theexecution time of embedding-dominated models, such as DLRM-RMC1, DLRM-RMC2, DLRM-RMC3, degrades by several orders ofmagnitude. Given the performance characterization of individual embeddingoperations and end-to-end models when embeddings are stored inSSDs, we identify many opportunities for inference accelerationand optimization. First, the overwhelming majority of executiontime in MLP-dominated models is devoted to matrix operations;thus SSDs systems offer an exiting solution to store embeddingtables in high-density storage substrates, lowering infrastructurecosts for datacenter scale recommendation inference.While SSDs is an appealing target for MLP-dominated models,there is significant room for performance improvement, particu-larly when embedding table operations are offloaded to SSDs for embedding-dominated recommendation models. To bridge the per-formance gap, this paper proposes to use near data processing(NDP) by leveraging the existing compute capability of commod-ity SSDs. Previous work has shown that NDP-based SSD systemscan improve performance across a variety of different applicationspaces such as databases and graph analytics [31, 34]. NDP solutionswork particularly well when processing gather-reduce operationsover large quantities of input data using lightweight computations.Embedding table operations follow this compute paradigm as well.NDP can help reduce round-trip counts and latency overheads inPCIe communication as well as improve the SSD bandwidth utiliza-tion by co-locating compute with the Flash-based storage systems(more detail in Section 4).In summary, the focus of this work is to demonstrate the viabilityof SSD-based storage for the MLP-dominated recommendation modelsand to customize NDP-based SSD systems for neural recommendationin order to unlock the advantages of SSD storage capacity for theembedding-dominated models . We present RecSSD, a near-data processing (NDP) solution forefficient embedding table operations on SSD memory. Comparedto traditional SSD storage systems, RecSSD increases bandwidth toFlash memories by utilizing internal SSD bandwidth rather thanexternal PCIe, greatly reducing unused data transmitted over PCIeby packing useful data together into returned logical blocks, andreduces command and control overheads in the host driver stack byreducing the number of I/O commands needed for the same amountof data. To maintain compatibility with exisitng NVMe protocoland drivers, RecSSD is implemented within the FTL of the SSD,requiring no modifications to the hardware substrate and pavingthe way for datacenter scale deployment. This section describes theoverall RecSSD design and implementation. First we outline howembedding operations are mapped to the FTL in SSD systems; next,we detail how RecSSD exploits temporal locality in embedding tableoperations to improve performance; and finally, we describe thefeasibility of implementing RecSSD in real systems.

RecSSD is designed to accelerate embedding table operations forrecommendation inference. In most high-level machine learningframeworks, these embedding operations are implemented as spe-cific custom operators. These operators can be implemented us-ing a variety of backing hardware/software technologies, typicallyDRAM based data structures for conventional embedding opera-tions. RecSSD implements embedding operations using SSD storageby moving computation into the SSD FTL, and on the host using cus-tom NDP based drivers within the higher-level framework operatorimplementation.Given the large storage requirements, embedding table opera-tions (e.g., SparseLengthSum in Caffe2), span multiple pages withinSSD systems. A standard FTL provides highly optimized softwarethat supports individual page scheduling and maintenance; Rec-SSD operates on top of request queues and data buffers designed forindividual Flash page requests and operations. In order to supportmulti-page SparseLengthSum (SLS) operations, we add a scheduling

SPLOS ’21, April 19–23, 2021, Virtual, USA Wilkening, et al. layer – with accompanying buffer space and request queues – ontop of the existing page scheduling layer. The proposed SLS sched-uling layer feeds individual page requests from a set of in-flight SLSrequests into the existing page-level scheduler to guarantee highthroughput across SLS requests. The existing page-level schedul-ing proceeds as normal to ensure page operations maximize theavailable internal memory parallelism.Figure 7 details the proposed RecSSD design, which augmentsSSD systems with NDP to improve internal Flash memory band-width and overall performance of embedding table operations.

Data-structures

In particular, to support NDP SLS, RecSSD addstwo major system components: a pending-SLS-request buffer and aspecialized embedding cache. These components are colored red inFigure 7.Each SLS operation allocates an entry in the pending SLS requestbuffer. Each entry contains five major elements: (Input Config)buffer space to store SLS configuration data passed from the host,(Status) various data structures storing reformatted input config-uration and counters to track completion status, (Pending FlashPage Requests) a queue of pending Flash page read requests tobe submitted to the low-level page request queues, (Pending HostPage Requests) a queue of pending result logical block requests tobe serviced to the host upon completion, and (Result Scratchpad)buffer space for those result pages.

Initiating embedding request

When the FTL receives an SLSrequest in the form of a write-like NVMe command, the FTL allo-cates an SLS request entry. The FTL then triggers the DMA of theconfiguration data from the host using the NVMe host controller(step 1a). Upon receipt of that configuration data, the FTL will needto process the data, initializing completion status counters and pop-ulating custom data structures containing the reformatted inputdata (populating element 2 - Status). This processing step computeswhich flash Pages must be accessed and separates input embed-dings by flash Page, such that the per-page processing computationcan easily access its own input embeddings. During this scan ofthe input data, a fast path may also check for availability of inputembeddings in an embedding cache (discussed later this section),avoiding flash Page read requests (step 2a), and otherwise placingthose Flash page requests in the pending queue (step 2b). Uponcompletion of the configuration processing the request entry ismarked as configured, and pending Flash page requests may bepulled and fed into the low-level page request queues (step 3a). Ifthe page exists within the page cache already, the page may bedirectly processed (step 3b). When the FTL receives a SLS read-likeNVMe command (asynchronous with steps 2-5), it searches for theassociated SLS request entry and populates the pending host pagerequest queue (step 1b).

Issuing individual Flash requests

At a high level of the FTLscheduler polling loop, the scheduler will maintain a pointer toan entry in the SLS request buffer. Before processing low-levelrequest queues, the scheduler will fill the queues from the currentSLS entry’s pending Flash page request queue. The scheduler willthen perform an iteration of processing the low level page requestqueues, and increment the SLS request buffer pointer regardless ofcompletion, such that requests are handled fairly in a round robinfashion.

Returning individual Flash requests

Upon completion of aFlash page read request which is associated with an SLS request(step 4), the extraction and reduction computation will be triggeredfor that page. The embeddings required for the request which residein that page will be read from the page buffer entry and accumu-lated into the appropriate result embedding in the result bufferspace for that SLS request (step 5). The reformatted input configu-ration allows the page processing function to quickly index whichembeddings need to be processed and appropriately update thecompletion counters.

Returning embedding requests

Again at a high level of theFTL scheduler polling loop, the scheduler will check for completedhost page requests within an SLS request. If completed pages areready, and the NVMe host controller is available, the scheduler willtrigger the controller to DMA the result pages back to the host(step 6). Upon completion of all result pages in a SLS request theSLS request entry will be deallocated. The NVMe host controllerwill automatically track completed pages and complete NVMe hostcommands.

Multi-threading and Pipelining

Aside from the base NDP imple-mentation, there are a number of conventional optimizations thatcan be applied on top of the NDP Flash operation. Multi-threadingand software pipelining can be used to overlap NDP SLS I/O opera-tions with the rest of the neural network computation. For this weuse a threadpool of SLS workers to fetch embeddings and feed post-SLS embeddings to neural network workers. We match our SLSworker count to the number of independent available I/O queues inour SSD driver stack. We then match our neural network workersto the available CPU resources.DRAM caching is another technique which has been previouslystudied [17] in the context of recommendation inference serving.With our NDP implementation, there is the option for both hostDRAM caching and SSD internal DRAM caching.

Host-side DRAM Caching

Because our NDP SLS operator re-turns accumulated result embeddings to the host, we cannot use ourworkload’s existing NDP SLS requests to populate a host DRAMembedding cache. In order to still make use of available host DRAM,we implement a static partitioning technique utilizing input dataprofiling which can partition embedding tables such that frequentlyaccessed embeddings are stored in host DRAM, while infrequentlyused embeddings are stored on the SSD. This solution is motivatedby the characterization in Section 3.1, showcasing the power lawdistribution of page access. Because there exist relatively few highlyaccessed embeddings, static partitioning becomes a viable solution.With this feature, our system requests the SSD embeddings usingour NDP function, and post processes the returned partial sums toinclude embeddings contained in the DRAM cache on the host.

SSD-side DRAM Caching

For host DRAM caching, it is en-tirely feasible to use a large fully associative LRU software cache.However, for SSD internal DRAM caching, we must more carefullyconsider the implementation overheads of our software cachingtechniques. The FTL runs on a relatively simple CPU, with limited ecSSD ASPLOS ’21, April 19–23, 2021, Virtual, USA

Figure 7: The lifetime of an SSD based SLS operator. The addition of an SLS request buffer and a specialized embedding cachesupport the multi-page operation.

DRAM space. The code and libraries available are specifically de-signed for embedded systems, such that the code is compact andhas low computation overhead, as well as having more consistentperformance. The SSD FTL is designed without dynamic memoryallocation and garbage collection. When implementing any levelof associativity, the cost of maintaining LRU or pseudo LRU infor-mation on every access must be balanced against cache hit-rategains. For the current evaluations we implement a direct-mappedSSD-side DRAM cache.

Our custom interface maintains complete compatibility with theexisting NVMe protocol, utilizing a single unused command bit to in-dicate embedding commands. Other than this bit, our interface sim-ply uses the existing command structure of traditional read/writecommands. Embedding processing parameters are passed to theSSD system with a special write-like command, which initiatesembedding processing. A subsequent read-like command gathersthe resulting pages. The parameters passed include embedding vec-tor dimensions such as attribute size and vector length, the totalnumber of input embeddings to be gathered, the total number ofresulting embeddings to be returned, and a list of (input ID, resultID) pairs specifying the input embeddings and their accumulationdestinations. Adding a restriction that this list be sorted by inputID enables more efficient processing on the SSD system, whichcontains a much less powerful CPU than the host system. Theconfiguration-write command and result-read command are associ-ated with each-other internally in the SSD by embedding a requestID into the starting logical block address (SLBA) of the requests.The SLBA is set as the starting address of the targeted embeddingtable added with the unique request ID. By assuming a minimum ta-ble size and alignment constraints, the two inputs can be separatedwithin the SSD system using the modulus operator. We also note that in addition to maintaining compatibility withexisting NVMe protocol, by implementing support for embeddingtable operations purely through software within the SSD FTL, weensure RecSSD is fully compatible with existing commodity SSDhardware. This method of implementation relies on the lightweightnature of the required computation, such that the SSD micropro-cessor does not become overly delayed in its scheduling functionsby performing the extra reduction computation.

This section describes the methodology and experimental setupused to evaluate the proposed RecSSD design. Here we summarizethe OpenSSD platform, Micron UNVMe, recommendation mod-els, and input traces used. Additional details can be found in theAppendix.

OpenSSD

In order to evaluate the efficacy of offloading the SLSoperator onto the SSD FTL, we implement a fully functional NDPSLS operator in the open source Cosmos+ OpenSSD system [4]. Thedevelopment platform, Cosmos+ OpenSSD, has a 2TB capacity, fullyfunctional NVMe Flash SSD, and a customizable Flash controller andFTL firmware. In order to provide a feasible solution for datacenterscale deployment, we implement RecSSDwithin the FTL firmware;the interface is compatible with existing NVMe protocol, requiringno hardware changes.

Micron UNVMe

In addition to the NVMe compatible OpenSSDsystem, the RecSSD interface is implemented within the MicronUNVMe driver library [10]. We modify the UNVMe driver stack toinclude two additional commands, built on top existing commandstructures for NVMe read/write commands and distinguished bysetting an additional unused command bit, as described in Sec-tion 4. The command interface enables flexible input data andcommand configurations, while maintaining compatibility withthe existing NVMe host controller. The UNVMe driver makes use

SPLOS ’21, April 19–23, 2021, Virtual, USA Wilkening, et al. of a low latency userspace library, which polls for the comple-tion of NVMe read commands, and uses the maximum number ofthreads/command queues.

Neural recommendation models

To evaluate RecSSD, we usea diverse set of eight industry-representative recommendation mod-els provided in DeepRecInfra [19], implemented in Python usingCaffe2 [1]. In order to evaluate the performance of end-to-endrecommendation models on real systems, we integrate the Sparse-LengthsSum operations (embedding table operations in Caffe2)with the custom NDP solution. We offload embedding operationsto RecSSD, we design a low-overhead Python-level interface usingCTypes, which allows us to load the modified UNVMe as a sharedlibrary and call NVMe and NDP SLS I/O commands. In the futurethese operations could be ported into a custom Caffe2 operatorfunction, and compiled along with the other Caffe2 C++ binaries.

Input traces and Performance Metrics

In addition to the rec-ommendation models themselves, we instrument the networkswith synthetically generated input traces. We instrument the open-source synthetic trace generators from Facebook’s open-scourcedDLRM [30] with the locality analysis from industry-scale recom-mendation systems shown in Figure 4. The synthetic trace generatoris instrumented with likelihood distributions for input embeddingsacross stack distances of previously requested embedding vectors.We generate exponential distributions based on a parameter value, 𝐾 . Sweeping 𝐾 generates input traces with varying degrees of lo-cality; for instance, setting 𝐾 equal to 0, 1, and 2 generates traceswith 13%, 54%, and 72% unique accesses respectively [17, 20]. Giventhe high cache miss rates and our locality analysis, we assume asingle embedding vector per SSD page of 16KB. For the evaluationresults, we assume embedding tables have 1 million vectors andhost-side DRAM caches store up to 2K entries per embedding table.Because our prototype limits us to single-model single-SSD sys-tems, we do not focus our results on latency-bounded throughput,but rather direct request latencies, a critical metric for determiningthe performance viability of SSD based recommendation. We av-erage latency results across many batches, ensuring steady-statebehavior. Physical Compute Infrastructure

All experiments are runon a Quad-core Intel Skylake desktop machine. Our machine usesG.SKILL TridentZ Series 64GB (4 x 16GB) 288-Pin DDR4 SDRAMDDR4 3200 (PC4 25600) Desktop Memory Model F4-3200C14Q-64GTZ DRAM. DRAM has nanosecond-scale latencies, and 10s ofGB/s in throughput. Our prototype SSD system supports 10K IOPsper channel with 8 channels and a page size of 16KB, leading tomaximum throughput with sequential read of just under 1.4GB/s.Newer SSD systems will have higher throughput. Single page accesslatencies are in the 10s to 100s of microseconds range.

Here we present empirical results evaluating the performance ofRecSSD. Overall, the results demonstrate that RecSSD provides upto 2 × speedup over baseline SSD for recommendation inference.This section first analyzes the fundamental tradeoffs of RecSSDus-ing micro-benchmarks based on embedding table operations. Fol-lowing the micro-benchmark analysis, the section compares theperformance of end-to-end recommendation models between base-line SSD systems and RecSSD. Expanding upon this analysis, we Figure 8: The standalone performance of the SLS embed-ding operator. Performance is shown for both sequentialand strided access patterns, using both conventional SSD in-terfaces and NDP interfaces, on a variety of batch sizes. present performance tradeoffs between baseline SSD systems andRecSSD using both host-side and SSD-side DRAM caching in orderto exploit temporal locality in embedding table accesses. Finally,the section conducts a sensitivity study on the impact of individualrecommendation network architectural parameters on RecSSD per-formance. The sensitivity analysis provides insight into RecSSD’sperformance on future recommendation models.

Figure 8 presents the performance of embedding table operations(i.e., SparseLengthsSum in Caffe2 [2]). For RecSSD, the executiontime is categorized five components (i.e., Config Write, Config Pro-cess, Translation, and Flash Read) over a range of batch sizes.

ConfigWrite and

Config Process represent the time taken to transfer config-uration data to the SSD and the time to process the configuration,respectively; after the transfers, internal data structures are popu-lated and Flash page requests begin issuing.

Translation representsthe time spent on processing returned Flash pages, extracting thenecessary input embedding vectors, and accumulating the vectorsinto the corresponding result buffer space.

Flash Read indicates thetime in which the FTL is managing and waiting on Flash memoryoperations.It is difficult to compare the computational throughput of

Trans-lation independently with the IO bandwidth of flash, as the com-putation is synchronously tightly-coupled with the IO schedulingmechanisms within the FTL. With hardware modification this com-putation could be decoupled and made parallel. However, we canindirectly observe the bottleneck by observing the dominating timespent in the FTL, whether it is translation computation or flashread operations.Following the characterization from Section 3.1, we study twodistinct memory access patterns: SEQ and STR. The

Sequential (SEQ)memory access pattern represents use cases where embedding ta-ble IDs are contiguous. This is unlikely to happen in datacenter-scale recommendation inference applications, as shown in Figure 4,but represents use cases with extremely high page locality. The

Random (STR) memory access patterns are generated with stridedembedding table lookup IDs and representative of access patternswhere each vector accessed is located on a unique Flash page. Given ecSSD ASPLOS ’21, April 19–23, 2021, Virtual, USA

Figure 9: NDP alone provides up to 7 × performance improve-ment for some full models given a simple naive configura-tion. the large diversity in recommendation use cases, as evidenced bythe variety of recommendation model architectures [19], the twomemory access patterns allow us to study the performance charac-teristics of RecSSD across a wide variety use cases. Furthermore,while current recommendation use cases exhibit sparse access pat-terns, future optimizations in profiling and restructuring embeddingtables may increase the locality. Performance with low locality embedding accesses

Underthe

Random memory lookup access pattern, RecSSD achieves upto a 4 × performance improvement over baseline SSD. This per-formance improvement comes from the increased memory levelparallelism. RecSSD increases memory level parallelism by con-currently executing Flash requests for each embedding operation,increasing utilization of the internal SSD bandwidth. As shown inFigure 8, roughly half the time in the RecSSD’s FTL is spent on Translation . Given the limited hardware capability of the 1GHz,dual core ARM A9 processors of the Cosmos+OpenSSD system[4],we expect that with faster SSD microprocessors or custom logic,the

Translation time could be significantly reduced.

Performance with high locality embedding accesses

Com-pared to the baseline SSD system using conventional NVMe re-quests,

Sequential access patterns with high spatial locality resultin poor NDP performance. Compared to random or low locality ac-cess patterns, sequential or high locality embedding accesses accessfewer Flash pages but require commensurate compute resourcesto aggregate embedding vectors. In the baseline system, the SSDpage cache will hold pages while embedding vectors requests aresequentially streamed through the system and accumulated by theCPU. While, RecSSD also acccess fewer Flash pages, the embeddingvectors are aggregated using the dual-core ARM A9 processor onthe Cosmos+OpenSSD system; this accounts for over half the ex-ecution time (Translation) as shown in Figure 8. With sequentialaccesses, the benefits of aggregating faster server class, host IntelCPU outweighs the lack of overhead of multiple NVMe commands.We anticipate more sophisticated processors on the NDP systemwould close eliminate the slight performance degradation usingRecSSD.

In addition to individual embedding table operations, here we evalu-ate RecSSD across a selection of industry-representative recommen-dation models. To start, we showcase the raw potential of NDP, bypresenting the simplest naive experimental configuration. Figure 9 presents the relative speedup of RecSSD over a conventional SSDbaseline, without operator pipelining and caching techniques, andusing randomly generated input indices. We observe that manymodels exist where NDP provides no observable benefits, and formodels where performance is limited by embedding operations andSSD latencies, NDP can provide substantial assistance with up to7 × speedup. The maximum speedup across models shown here ishigher than that of the individual embedding operations (Figure 8)due to differences in underlying model parameters such as featuresize and indices per lookup as discussed in Section 6.4. In addition to the end-to-end model results, we evaluate the per-formance of RecSSD with operator pipelining and caching. Theseoptimization techniques, as presented in Section 4, are applied ontop of RecSSD and conventional SSD systems.Figure 10(a-c) presents relative speedup results for RecSSD withjust SSD-side caching and the conventional SSD baseline with host-side caching. RecSSD utilizes a large, but direct mapped, cachewithin the SSD DRAM while the baseline utilizes a fully associativeLRU cache within host DRAM. Batchsizes are swept between 1and 32, along with the three input trace locality conditions 𝐾 = , ,

2. Hit rates for RecSSD’s SSD DRAM cache are labeled aboveeach speedup bar. The baseline LRU cache hit rates follow theinverse of the locality distribution, with 84%, 44%, and 28% hitsin the cache corresponding to 𝐾 equal to 0, 1, and 2 respectively.Note, the LRU cache hit rates span the diverse set of embeddingaccess patterns from the initial characterization of production-scalerecommendation models shown in Figure 4.With high locality (i.e., low 𝐾 ), conventional SSD systems achievehigher performance than RecSSD. On the other hand, with low local-ity RecSSD outperforms the conventional baseline. This is becausethe direct mapped caching hit rate cannot match that of the morecomplex fully associative LRU cache on the host system, exempli-fied in the high batch size runs for RM1. Furthermore, RM2 haslower SSD cache hit rates compared to RM1/3, a result of the largernumber of embedding lookups required per request and temporallocality being across requests not lookups. Even so, without hostDRAM caching, RecSSD outperforms the baseline by up to 1.5X forlower locality traces, where many SSD pages must be accessed andthe benefits of increased internal bandwidth shine.Figure 10(d-f) presents relative speedup results for RecSSD usingstatic table partitioning as well as SSD caching. With static tablepartitioning, we make use of available host DRAM by statically plac-ing the most frequently used embeddings within the host DRAMcache as detailed in Section 4. The hitrates labeled above each barrepresent the hit rates of RecSSD in the statically partitioned hostDRAM cache, not the SSD cache.Following the conventional SSD baseline, static partitioninghelps in leveraging the available host DRAM memory. For hightemporal locality however it cannot match the hit rate of the fullyassociative LRU cache. With higher batch sizes as well as higherindicies per request (seen in RM2), the hit rate asymptotically ap-proaches 25%, the size of the static partition relative to the used IDspace. Overall, Figure10 shows that with static partitioning, Rec-SSD achieves a 2 × performance improvement over the conventional SPLOS ’21, April 19–23, 2021, Virtual, USA Wilkening, et al.

Figure 10: Relative full model performance improvement including caching techniques. The percentages above each bar rep-resent the hitrate of either the SSD cache (a-c) or the host partition (d-f) for RecSSD. The baseline LRU cache hitrates alwaysfollow the inverse of the locality distribution, with 84%, 44%, and 28% hits in the cache corresponding to 𝐾 equal to 0, 1, and 2,respectively. SSD baseline. This occurs when the baseline host LRU cache has arelatively low hit rate such that many SSD pages must be accessed,while RecSSD is able to achieve comparable host DRAM hitrateswith static partitioning.In general, the above results show that the advantages of RecSSDshine when pages must be pulled from the SSD, and when the hostlevel caching strategies available for RecSSD (static partitioning) areof comparable effectiveness to the baseline LRU software cache. Al-though RecSSD shows diminishing returns with improved cachingand locality, we note that because RecSSD is fully compatible withthe existing NVMe interface, it can be employed in tandem withconventional strategies and switched based on the embedding tablelocality properties.

In this section we more closely examine the impact of model pa-rameters differentiating the performance of our benchmark models.Table 1 details the parameter space of RM1/2/3. We specifically notethat absolute table size does not impact our results. Growing tablesizes do provide motivation to move from capacity constrainedDRAM to flash SSDs, however embedding lookup performance isdependant on access patterns, not absolute table size.

Table 1: Differentiating benchmark parameters.

Benchmark Feature Size Indices Table CountRM1 32 80 8RM2 64 120 32RM3 32 20 10In Figure 11a we see how feature size and quantization, whichaffect the size of embedding vectors relative to the page size, show (a) Feature Size and Quantization (b) Indices and Table Count

Figure 11: Examining the impact of model parameters onfull model executions. decreasing relative performance as this ratio grows. This is becausethe baseline is able to make more efficient use of block accesses asthe lowest unit of memory access approaches the size of a memoryblock, while RecSSD must perform more computation on the SSDmicroprocessor per page accessed from Flash. In Figure 11b we seethat although increasing table count diminishes performance, thisquickly becomes outscaled by increases in performance from theincreased indices per lookup. The performance loss from increas-ing table count is due to the implementation of our NDP interface.Because a single NDP call handles a single table, the amortizationof command overheads is on a per table basis. On the other hand,increasing the number of indices per lookup increases the amorti-zation of this control overhead as well as the value of accumulatingthese embeddings on the SSD, where only one vector is sent to thehost for all the indices accumulated in a single lookup. ecSSD ASPLOS ’21, April 19–23, 2021, Virtual, USA

SSD systems

Recent advances in flash memory technology havemade it an increasingly compelling storage system for at-scale de-ployment. Compared to disk based solutions, SSDs offer 2.6 × and3.2 × bandwidth per Watt and bandwidth per dollar respectively [12].Furthermore, given the high density and energy efficiency, SSDsare being used as datacenter DRAM-replacements as well [12]. Infact, prior work from Google and Baidu highlight how modernSSD systems are being used for web-scale applications in datacen-ters [31, 34]. Furthermore, given recent advances in Intel’s Optanetechnology, balancing DRAM-like latency and byte-addressabilitywith SSD-like density, we anticipate the type of applications thatleverage SSD based storage systems to widen [24]. In fact, trainingplatforms for terabyte scale personalized recommendation modelsrely heavily on SSD storage capabilities for efficient and scalableexecution [41].In order to enable highly efficient SSD execution, modern storagesolutions rely on programmable memory systems [15]. Leveragingthis compute capability, there has been much work on both soft-ware and hardware solutions for Near Data Processing in SSDs forother datacenter applications [14, 18, 25, 31, 35, 36, 38, 39]. Previousworks which target more general SSD NDP solutions have relied onhardware modifications, complex programming frameworks, andheavily modified driver subsystems to support the performancerequirements of more complex and general computational tasks.Our system trades-off this generality for simplicity and applicationspecific performance and cost efficiency. Accelerating recommendation inference

Given the ubiquityof AI and machine learning workloads, there have been many pro-posals for accelerating deep learning workloads. In particular, recentwork illustrates that recommendation workloads dominate the AIcapacity in datacenters [20]. As a result, recent work proposes accel-erating neural recommendation. For example, the authors in [23, 27]propose a customized memory management unit for AI accelerators(i.e., GPUs) in order to accelerate address translation operationsacross multi-node hardware platforms. Given the low-compute in-tensity of embedding table operations, recent work also explores therole of near memory processing for Facebook’s recommendationmodels [26]. Similarly, researchers have proposed the applicationof flash memory systems to store large embedding tables foundin Facebook’s recommendation models [17], exploring advancedcaching techniques to alleviate challenges with large flash pagesizes. These techniques can be used in tandem with RecSSD. Inthis paper, we explore the role combining near data processing andNAND flash memory systems for at-scale recommendation in orderto reduce overall infrastructure cost. Furthermore, we provide areal system evaluation across a wide collection of recommendationworkloads [19].

In this paper we propose, RecSSD, a near data processing solutioncustomized for neural recommendation inference. By offloadingcomputations for key embedding table operations, RecSSD reducesround-trip time for data communication and improves internalSSD bandwidth utilization. Furthermore, with intelligent host-side and SSD-side caching, RecSSD enables high performance embed-ding table operations. We demonstrate the feasibility of RecSSD byimplementing it in a real-system using server-class CPUs and anOpenSSD compatible system with Micron UNVMe’s driver library.RecSSD reduces end-to-end neural recommendation inference la-tency by 4 × compared to off-the-shelf SSD systems and comparableperformance to DRAM-based memories. As a result, RecSSD enableshighly efficient and scalable datacenter neural recommendation in-ference. ACKNOWLEDGMENTS

We would like to thank the anonymous reviewers for their thought-ful comments and suggestions. We would also like to thank GlennHolloway for his valuable technical support. This work was spon-sored in part by National Science Foundation Graduate ResearchFellowships (NSFGRFP), and the ADA (Applications Driving Archi-tectures) Center.

A ARTIFACT APPENDIXA.1 Abstract

RecSSD is composed of a number of open sourced artifacts. First,we implement a fully-functional NDP SLS operator in the opensource Cosmos+ OpenSSD system [4], provided in the

RecSSD-OpenSSDFirmware repository[7]. To maintain compatibility withthe NVMe protocols, the RecSSD interface is implemented withinMicron’s UNVMe driver library [10], provided in the

RecSSD- UN-VMeDriver repository[9]. To evaluate RecSSD, we use a diverse setof eight industry-representative recommendation models providedin DeepRecInfra [19], implemented in Python using Caffe2 [1] andprovided in the

RecSSD-RecInfra repository[8]. In addition to themodels themselves, we instrument the open-source synthetic tracegenerators from Facebook’s open-sourced DLRM [30] with ourlocality analysis from production-scale recommendation systems,also included in the RecSSD-RecInfra repository.

A.2 Artifact check-list (meta-information) • Compilation:

GCC, Python3, PyTorch, Caffe2, Xilinx SDK 2014.4 • Model:

DeepRecInfra • Run-time environment:

Ubuntu 14.04 • Hardware:

Cosmos+ OpenSSD, two Linux Desktop machines, re-mote PDU • How much time is needed to prepare workflow (approximately)?: • How much time is needed to complete experiments (approx-imately)?:

10+ hours • Publicly available?:

Software will be open-sourced and publiclyavailable. Required hardware platform is potentially still purchasablethrough original developers. • Code licenses (if publicly available)?:

GNU GPL

A.3 Description

A.3.1 How to access.

RecSSD is provided through a number ofpublically available GitHub repositories [7–9], as well as a publiclyavailable archive on Zenodo, DOI: 10.5281/zenodo.4321943.

SPLOS ’21, April 19–23, 2021, Virtual, USA Wilkening, et al.

A.3.2 Hardware dependencies.

Cosmos+ OpenSSD system [4], twoLinux Desktop class machines, and a remote PDU for a fully remoteworkflow.

A.3.3 Software dependencies.

Xilinx SDK 2014.4 for programmingthe OpenSSD. The Cosmos+ OpenSSD FTL firmware and controllerBitstream. Python3, PyTorch, and Caffe2 for running recommenda-tion models.

A.3.4 Models.

Uses recommendation model benchmarks fromDeepRecInfra [19], and trace generation from Facebook’s open-sourced DLRM [30].

A.4 Installation

To set up the SSDDev machine, start by downloading the Cosmos+OpenSSD software available on their GitHub[3]. You will needto install Xilinx SDK 2014.4, and follow the instructions in theirtutorial[5] to set up a project for the OpenSSD board. For RecSSD,we use the prebuilt bitstream and associated firmware. After settingup the project, replace the ./GreedyFTL/src/ directory with thecode from the RecSSD-OpenSSDFirmware GitHub repository. TheOpenSSD tutorial contains detailed instructions on running thefirmware, and the physical setup of the hardware.To set up the SSDHost machine, download and make the RecSSD-UNVMeFirmware repository. This repository provides a user leveldriver library to connect the RecSSD-RecInfra recommendationmodels to the OpenSSD device. Once the SSDHost has been bootedwith the OpenSSD running, use lspci to detect the PCIe device identi-fier of the board, and use unvme-setup bind PCIEID to attach the dri-ver to the specific device. Make note of ./test/unvme/libflashrec.so,which must be later copied into RecSSD-RecInfra, such that thePython3 runtime can load and run the necessary driver functionsto make use of our implemented NDP techniques.Next, download the RecSSD-RecInfra repository. Copy the libflashrec.sofile into ./models/libflashrec.so. Make sure to download and installPython3 and PyTorch[6].

A.5 Experiment workflow

Detailed walk-throughs of the technical steps required are docu-mented within the provided individual repositories and from theOpenSSD tutorial[5]. At a high level the expected workflow is asfollows.(1) With the SSDHost machine powered off, use the Xilinx SDKon the SSDDev machine to launch the FTL firmware on theOpenSSD.(2) Power on and boot the SSDHost machine. Connect the UN-VMe driver library to the device through unvme-setup bind.(3) Run /models/input/create_dist.sh within RecSSD-RecInfrato generate the desired synthetic locality patterns for inputtraces.(4) Run the python based experimental sweeps scripts withinRecSSD-RecInfra /models/ to run various recommendationmodels using either baseline SSD interfaces or our NDPinterfaces.

A.6 Evaluation and expected results

Most of our results are reported as inference latency, output fromscripts run on the SSDHost machine. We compare relative latencyresults across a large number of batches in order to guaranteeregular steady state behavior. Figure 10 presents expected resultsfor the important RM1, RM2, and RM3 models, while Figure 11presents results for an RM3-like model while tuning specific modelparameters.Figure 8 reports breakdowns in time spent within the FTL forNDP requests using microbenchmarks within the RecSSD-UNVMeDriverrepository. To reproduce these results, run ./test/unvme/unvme_embed_test.Unlike model latency results, these measurements are performedwithin the FTL and directly reported through output to the SDK,therefore they must be recorded from the SDK running on theSSDDev machine.Figures 3 and 4 use proprietary industry data and are not repro-ducible using our open-sourced infrastructure.

REFERENCES [1] [n.d.]. Caffe2. https://caffe2.ai/.[2] [n.d.]. Caffe2 Operator Catalog. https://caffe2.ai/docs/operators-catalogue.html

USENIX2008 Annual Technical Conference (Boston, Massachusetts) (ATC’08) . USENIXAssociation, USA, 57–70.[12] David G Andersen and Steven Swanson. 2010. Rethinking flash in the data center.

IEEE micro

Proceedings of the 2013 ACM SIGMOD International Conference onManagement of Data (New York, New York, USA) (SIGMOD ’13) . Association forComputing Machinery, New York, NY, USA, 1221–1230. https://doi.org/10.1145/2463676.2465295[15] Jaeyoung Do, Sudipta Sengupta, and Steven Swanson. 2019. Programmablesolid-state storage in future cloud datacenters.

Commun. ACM

62, 6 (2019),54–62.[16] Assaf Eisenman, Darryl Gardner, Islam AbdelRahman, Jens Axboe, Siying Dong,Kim Hazelwood, Chris Petersen, Asaf Cidon, and Sachin Katti. 2018. ReducingDRAM Footprint with NVM in Facebook. In

Proceedings of the Thirteenth EuroSysConference (Porto, Portugal) (EuroSys ’18) . Association for Computing Machinery,New York, NY, USA, Article 42, 13 pages. https://doi.org/10.1145/3190508.3190524[17] Assaf Eisenman, Maxim Naumov, Darryl Gardner, Misha Smelyanskiy, SergeyPupyrev, Kim Hazelwood, Asaf Cidon, and Sachin Katti. 2018. Bandana: Us-ing non-volatile memory for storing deep learning models. arXiv preprintarXiv:1811.05922 (2018).[18] Boncheol Gu, Andre S. Yoon, Duck-Ho Bae, Insoon Jo, Jinyoung Lee, JonghyunYoon, Jeong-Uk Kang, Moonsang Kwon, Chanho Yoon, Sangyeun Cho, JaeheonJeong, and Duckhyun Chang. 2016. Biscuit: A Framework for Near-data Process-ing of Big Data Workloads. In

Proceedings of the 43rd International Symposium onComputer Architecture (Seoul, Republic of Korea) (ISCA ’16) . IEEE Press, Piscat-away, NJ, USA, 153–165. https://doi.org/10.1109/ISCA.2016.23 ecSSD ASPLOS ’21, April 19–23, 2021, Virtual, USA [19] Udit Gupta, Samuel Hsia, Vikram Saraph, Xiu Qiao Wang, Brandon Reagen,Gu-Yeon Wei, Hsien-Hsin S. Lee, David M. Brooks, and Carole-Jean Wu. 2020.DeepRecSys: A System for Optimizing End-To-End At-scale Neural Recommen-dation Inference.

ArXiv abs/2001.02772 (2020).[20] Udit Gupta, Xiaodong Wang, Maxim Naumov, Carole-Jean Wu, Brandon Reagen,David Brooks, Bradford Cottel, Kim Hazelwood, Bill Jia, Hsien-Hsin S Lee, et al.2019. The architectural implications of Facebook’s DNN-based personalizedrecommendation. arXiv preprint arXiv:1906.03109 (2019).[21] K. Hazelwood, S. Bird, D. Brooks, S. Chintala, U. Diril, D. Dzhulgakov, M. Fawzy,B. Jia, Y. Jia, A. Kalro, J. Law, K. Lee, J. Lu, P. Noordhuis, M. Smelyanskiy, L. Xiong,and X. Wang. 2018. Applied Machine Learning at Facebook: A Datacenter Infras-tructure Perspective. In . 620–629. https://doi.org/10.1109/HPCA.2018.00059[22] Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-SengChua. 2017. Neural Collaborative Filtering. In

Proceedings of the 26th InternationalConference on World Wide Web (Perth, Australia) (WWW ’17) . International WorldWide Web Conferences Steering Committee, Republic and Canton of Geneva,Switzerland, 173–182. https://doi.org/10.1145/3038912.3052569[23] Bongjoon Hyun, Youngeun Kwon, Yujeong Choi, John Kim, and Minsoo Rhu. 2019.NeuMMU: Architectural Support for Efficient Address Translations in NeuralProcessing Units.

Proceedings of the Twenty-Fifth International Conference onArchitectural Support for Programming Languages and Operating Systems (2019).[24] Joseph Izraelevitz, Jian Yang, Lu Zhang, Juno Kim, Xiao Liu, AmirsamanMemaripour, Yun Joon Soh, Zixuan Wang, Yi Xu, Subramanya R Dulloor, et al.2019. Basic performance measurements of the intel optane DC persistent memorymodule. arXiv preprint arXiv:1903.05714 (2019).[25] Y. Jin, H. W. Tseng, Y. Papakonstantinou, and S. Swanson. 2017. KAML: A Flexible,High-Performance Key-Value SSD. In . 373–384. https://doi.org/10.1109/HPCA.2017.15[26] Liu Ke, Udit Gupta, Carole-Jean Wu, Benjamin Y. Cho, Mark Hempstead, Bran-don Reagen, Xuan Zhang, David M. Brooks, Vikas Chandra, Utku Diril, AminFiroozshahian, Kim Hazelwood, Bill Jia, Hsien-Hsin S. Lee, Mengxing Li, Bert Ma-her, Dheevatsa Mudigere, Maxim Naumov, Martin Schatz, Mikhail Smelyanskiy,and Xiu Qiao Wang. 2019. RecNMP: Accelerating Personalized Recommendationwith Near-Memory Processing.

ArXiv abs/1912.12953 (2019).[27] Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2019. TensorDIMM: A PracticalNear-Memory Processing Architecture for Embeddings and Tensor Operationsin Deep Learning. In

Proceedings of the 52nd Annual IEEE/ACM InternationalSymposium on Microarchitecture . 740–753.[28] Michael Lui, Yavuz Yetim, Özgür Özkan, Zhuoran Zhao, Shin-Yeh Tsai, Carole-Jean Wu, and Mark Hempstead. 2020. Understanding Capacity-Driven Scale-OutNeural Recommendation Inference. arXiv:2011.02084 [cs.DC][29] Maxim Naumov, John Kim, Dheevatsa Mudigere, Srinivas Sridharan, XitingWang, Whitney Zhao, Serhat Yilmaz, Changkyu Kim, Hector Yuen, MustafaOzdal, Krishnakumar Nair, Isabel Gao, Bor-Yiing Su, Jiyan Yang, and MikhailSmelyanskiy. 2020. Deep Learning Training in Facebook Data Centers: Designof Scale-up and Scale-out Systems.

ArXiv abs/2003.09518 (2020).[30] Maxim Naumov, Dheevatsa Mudigere, Hao-Jun Michael Shi, Jianyu Huang,Narayanan Sundaraman, Jongsoo Park, Xiaodong Wang, Udit Gupta, Carole-JeanWu, Alisson G. Azzolini, Dmytro Dzhulgakov, Andrey Mallevich, Ilia Cherni-avskii, Yinghai Lu, Raghuraman Krishnamoorthi, Ansha Yu, Volodymyr Kon-dratenko, Stephanie Pereira, Xianjie Chen, Wenlin Chen, Vijay Rao, Bill Jia, LiangXiong, and Misha Smelyanskiy. 2019. Deep Learning Recommendation Modelfor Personalization and Recommendation Systems.

CoRR abs/1906.00091 (2019).http://arxiv.org/abs/1906.00091[31] Jian Ouyang, Shiding Lin, Song Jiang, Zhenyu Hou, Yong Wang, and YuanzhengWang. 2014. SDF: software-defined flash for web-scale internet storage systems.In

Proceedings of the 19th international conference on Architectural support forprogramming languages and operating systems . 471–484.[32] Jongsoo Park, Maxim Naumov, Protonu Basu, Summer Deng, Aravind Kalaiah,Daya Khudia, James Law, Parth Malani, Andrey Malevich, Satish Nadathur, et al.2018. Deep learning inference in facebook data centers: Characterization, perfor-mance optimizations and hardware implications. arXiv preprint arXiv:1811.09886 (2018).[33] Mendel Rosenblum and John K. Ousterhout. 1992. The Design and Implementa-tion of a Log-Structured File System.

ACM Trans. Comput. Syst.

10, 1 (Feb. 1992),26–52. https://doi.org/10.1145/146941.146943[34] Bianca Schroeder, Raghav Lagisetty, and Arif Merchant. 2016. Flash reliability inproduction: The expected and the unexpected. In { USENIX } Conference onFile and Storage Technologies ( { FAST } . 67–80.[35] Sudharsan Seshadri, Mark Gahagan, Sundaram Bhaskaran, Trevor Bunker,Arup De, Yanqin Jin, Yang Liu, and Steven Swanson. 2014. Willow: A User-programmable SSD. In Proceedings of the 11th USENIX Conference on Operating Sys-tems Design and Implementation (Broomfield, CO) (OSDI’14) . USENIX Association,Berkeley, CA, USA, 67–80. http://dl.acm.org/citation.cfm?id=2685048.2685055 [36] Devesh Tiwari, Simona Boboila, Sudharshan S. Vazhkudai, Youngjae Kim, Xi-aosong Ma, Peter J. Desnoyers, and Yan Solihin. 2013. Active Flash: TowardsEnergy-efficient, In-situ Data Analytics on Extreme-scale Machines. In

Pro-ceedings of the 11th USENIX Conference on File and Storage Technologies (SanJose, CA) (FAST’13) . USENIX Association, Berkeley, CA, USA, 119–132. http://dl.acm.org/citation.cfm?id=2591272.2591286[37] Corinna Underwood. 2019. Use Cases of Recommendation Systems in Business –Current Applications and Methods. https://emerj.com/ai-sector-overviews/use-cases-recommendation-systems/[38] Jianguo Wang, Dongchul Park, Yannis Papakonstantinou, and Steven Swanson.2016. Ssd in-storage computing for search engines.

IEEE Trans. Comput. (2016).[39] Peng Wang, Guangyu Sun, Song Jiang, Jian Ouyang, Shiding Lin, Chen Zhang, andJason Cong. 2014. An Efficient Design and Implementation of LSM-tree Based Key-value Store on Open-channel SSD. In

Proceedings of the Ninth European Conferenceon Computer Systems (Amsterdam, The Netherlands) (EuroSys ’14)

ArXiv abs/2003.05622 (2020).[42] Weijie Zhao, Jingyuan Zhang, Deping Xie, Yulei Qian, Ronglai Jia, and Ping Li.2019. AIBox: CTR Prediction Model Training on a Single Node. In

Proceedings ofthe 28th ACM International Conference on Information and Knowledge Management (Beijing, China) (CIKM ’19) . Association for Computing Machinery, New York,NY, USA, 319–328. https://doi.org/10.1145/3357384.3358045[43] Zhe Zhao, Lichan Hong, Li Wei, Jilin Chen, Aniruddh Nath, Shawn Andrews,Aditee Kumthekar, Maheswaran Sathiamoorthy, Xinyang Yi, and Ed Chi. 2019.Recommending What Video to Watch Next: A Multitask Ranking System. In

Proceedings of the 13th ACM Conference on Recommender Systems (Copenhagen,Denmark) (RecSys ’19) . ACM, New York, NY, USA, 43–51. https://doi.org/10.1145/3298689.3346997[44] Guorui Zhou, Na Mou, Ying Fan, Qi Pi, Weijie Bian, Chang Zhou, XiaoqiangZhu, and Kun Gai. 2019. Deep interest evolution network for click-through rateprediction. In

Proceedings of the AAAI Conference on Artificial Intelligence , Vol. 33.5941–5948.[45] Guorui Zhou, Xiaoqiang Zhu, Chenru Song, Ying Fan, Han Zhu, Xiao Ma, YanghuiYan, Junqi Jin, Han Li, and Kun Gai. 2018. Deep interest network for click-throughrate prediction. In