[PDF] Accelerating Recommender Systems via Hardware "scale-in"

Abstract

In today's era of "scale-out", this paper makes the case that a specialized hardware architecture based on "scale-in"--placing as many specialized processors as possible along with their memory systems and interconnect links within one or two boards in a rack--would offer the potential to boost large recommender system throughput by 12-62x for inference and 12-45x for training compared to the DGX-2 state-of-the-art AI platform, while minimizing the performance impact of distributing large models across multiple processors. By analyzing Facebook's representative model--Deep Learning Recommendation Model (DLRM)--from a hardware architecture perspective, we quantify the impact on throughput of hardware parameters such as memory system design, collective communications latency and bandwidth, and interconnect topology. By focusing on conditions that stress hardware, our analysis reveals limitations of existing AI accelerators and hardware platforms.

Full PDF

AAccelerating Recommender Systemsvia Hardware “scale-in”

Suresh Krishna ∗ , Ravi Krishna ∗ Electrical Engineering and Computer SciencesUniversity of California, Berkeley { suresh_krishna, ravi.krishna } @berkeley.edu Abstract —In today’s era of “scale-out”, this paper makes thecase that a specialized hardware architecture based on “scale-in”—placing as many specialized processors as possible alongwith their memory systems and interconnect links within one ortwo boards in a rack—would offer the potential to boost largerecommender system throughput by – × for inference and – × for training compared to the DGX-2 state-of-the-art AIplatform, while minimizing the performance impact of distribut-ing large models across multiple processors. By analyzing Face-book’s representative model—Deep Learning RecommendationModel (DLRM)—from a hardware architecture perspective, wequantify the impact on throughput of hardware parameters suchas memory system design, collective communications latency andbandwidth, and interconnect topology. By focusing on conditionsthat stress hardware, our analysis reveals limitations of existingAI accelerators and hardware platforms. I. I

NTRODUCTION

Recommender Systems serve to personalize user experiencein a variety of applications including predicting click-throughrates for ranking advertisements [34], improving search re-sults [30], suggesting friends and content on social networks[30], suggesting food on Uber Eats [53], helping users ﬁndhouses on Zillow [54], helping contain information overloadby suggesting relevant news articles [55], helping users ﬁndvideos to watch on YouTube [43] and movies on Netﬂix [59],and several more real-world use cases [60]. An introduction torecommender system technology can be found in [58] and aset of best practices and examples for building recommendersystems in [56]. The focus of this paper is recommendationsystems that use neural networks, referred to as Deep LearningRecSys, or simply RecSys . These have been recently appliedto a variety of areas with success [34] [30].Due to their commercial importance—by improving thequality of Ads and content served to users, RecSys directlydrives revenues for hyperscalers, especially under cost-per-click billing—it is no surprise that recommender systemsconsume the vast majority ( ∼ ) of AI inference cycleswithin Facebook’s datacenters [1]; the situation is similar [30]at Google, Alibaba and Amazon. In addition, RecSys trainingnow consumes >

50% of AI training cycles within Facebook’sdatacenters [10]. Annual growth rates are . × for training[10] and × for inference [31]. The unique characteristics ∗ Equal contribution. While the term is often used to denote any recommender system, it isspeciﬁcally used for Deep Learning recommenders here. of these workloads present challenges to datacenter hardwareincluding CPUs, GPUs and almost all AI accelerators. Com-pared to other AI workloads such as computer vision or naturallanguage processing, RecSys tend to have larger model sizes ofup to 10TB [7], are memory access intensive [30], have lowercompute burden [31], and rely heavily on CC operations [6].These characteristics make RecSys a poor ﬁt for many existingsystems, as described in Sec. II, and call for a new approachto accelerator HW architecture.In this paper, various HW architectures are analyzed usingFacebook’s DLRM [30] as a representative example and the re-sulting data are used to derive the characteristics of RecSpeed ,an architecture optimized for running RecSys. Speciﬁcally, we: • Describe the DLRM workload in terms of its character-istics that impact HW throughput and latency. • Identify HW characteristics such as memory system de-sign, CC latency and bandwidth, and CC interconnecttopology that are key determinants of upper bounds onRecSys throughput. • Use a generalized rooﬂine model that adds communica-tion cost to memory and compute to show that specializedchip-level HW features can improve upper bounds onRecSys throughput × by reducing latency and × byimproving bandwidth. It is known that a ﬁxed-topologyquadratic interconnect can offer CC all-to-all performancegains of . × – × [10]. • Explain why AI accelerators for RecSys would beneﬁtfrom supporting hybrid memory systems that combinemultiple forms of DRAM. • Evaluate the practical implementation of specialized HWfor RecSys and show how this has the potential toimprove throughput by – × for inference and – × for training compared to the NVIDIA DGX-2 AIplatform. II. R ELATED W ORK

There are several published works that describe systemsand chips for accelerating RecSys. Compared to these, ourwork focuses on sweeping various hardware parameters fora homogeneous system in order to understand the impact C OLLECTIVE C OMMUNICATIONS . System where a single type of processor contributes the bulk of processingand communications capability. a r X i v : . [ c s . A R ] S e p f each upon upper bound DLRM system throughput. Assuch, we do not evaluate, from the standpoint of RecSysacceleration, any of the other types of systems describedbelow. A. Heterogeneous Platforms

Facebook’s Zion platform [2] is a specialized heteroge-neous system for training RecSys. Fig. 1 shows the majorcomponents and interconnect of the Zion platform, which aresummarized in Table I. Zion offers the beneﬁt of combiningvery large CPU memory to hold embedding tables with highcompute capability and fast interconnect from GPUs, alongwith 8 100GbE NICs for scale-out. AIBox [7] from Baidu isanother heterogeneous system. A key innovation of AIBox isthe use of SSD memory to store the parameters of a model upto 10TB in size; the system uses CPU memory as a cache forSSD. The hit rate of this combination is reported to increase astraining proceeds, plateauing at about 85% after 250 trainingmini-batches. A single-node AIBox is reported to train a 10TBRecSys with 80% of the throughput of a 75-node MPI clusterat 10% of the cost.TABLE I: Key Features of FB Zion [2] [10]. Feature Example of Implementation

CPU 8x server-class processor such as IntelXeonCPU Memory speed DDR4 DRAM, 6 channels/socket, up to3200MHz, ∼ . GB/s/channelCPU memory capacity Typical 1 DIMM/channel, up to256GB/DIMM = 1.5TB/CPUCPU Interconnect UltraPath Interconnect, coherentCPU I/O PCIe Gen4 x16 per CPU, ∼ GB/sAI HW Accelerator VariableAcceleratorInterconnect 7 links per accelerator, x16 serdeslanes/link, 25G, ∼ GB/s/cardAccelerator Memory On-package, likely HBMAccelerator Power Up to 500W @ 54V/48V, support forliquid coolingScale-out 1x NIC per CPU, typically 100GbEthernet

B. Homogeneous Platforms

NVIDIA’s DGX-A100 and Intel/Habana’s HLS-1 are tworepresentative examples of homogeneous AI appliances. Ta-bles II and III respectively summarize their key characteristics.

C. In/Near Memory Processing

RecSys models tend to be limited by memory accesses toembedding table values that are then combined using poolingoperators [19], which makes the integration of memory withembedding processing an attractive solution.The ﬁrst approach for this is to modify a standard DDR4DIMM by replacing its buffer chip with a specialized processorthat can handle embedding pooling operations.Fig. 2 shows the idea behind this concept for a scenarioinvolving two dual-rank DIMMs sharing one memory channel. System that combines various processor types, each providing specializedcompute and communications capability.

TABLE II: Key Features of NVIDIA DGX-A100 [76].

Feature Example of Implementation

CPU 2x AMD Rome 7742, 64 cores eachCPU Memoryspeed DDR4 DRAM, 8 channels/socket, up to3200MHz, ∼ . GB/s/channelCPU memorycapacity Typical 1 DIMM/channel, up to256GB/DIMM = 2TB/CPUAI HWAccelerator 8x NVIDIA A100AcceleratorInterconnect Switched all-to-all, NVLink3, 12 links/chip, ∼ GB/s bandwidth per chipAcceleratorMemory HBM2 @ 2430MHz [75], 40GB/chip, 320GBtotalSystem Power ∼ . kW max.Scale-out 8x 200Gb/s HDR Inﬁniband TABLE III: Key Features of Intel/Habana HLS-1 [25].

Feature Example of Implementation

AI HWAccelerator 8x Habana GaudiAcceleratorInterconnect Switched all-to-all, 10x 100Gb Ethernet perchip of which 7 are available for interconnectAcceleratorMemory HBM, 32GB/chip, 256GB totalSystem Power ∼ kW max.Scale-out 24x 100Gb Ethernet conﬁgured as 3 links perchip In this example, the typical buffer chip found on each DIMMis replaced by a specialized NMP (Near-Memory Processing)chip such as TensorDIMM [20] or RecNMP [19] that canaccess both ranks simultaneously, cache embeddings, and poolthem prior to transfer over the bandwidth-constrained channel.For a high-performance server conﬁguration with one dual-rank DIMM per memory channel, simulations [19] indicatespeedups of . × to . × .It is also possible to provide a memory/compute module ina non-DIMM form factor. An example is Myrtle.ai’s SEALModule [77], an M.2 module specialized for recommendersystems embedding processing. This is built from Bittware’s250-M2D module [78] that integrates a Xilinx Kintex Ultra-Scale+ KU3P FPGA along with two 32-bit channels of DDR4memory. Twelve such modules can ﬁt within an OpenComputeGlacier Point V2 carrier board [79], in the space typicallytaken up by a Xeon CPU with its six 64-bit channels of DDR4memory.The second approach is to build a processor directly into aDRAM die. UpMem [27] has built eight 500MHz processorsinto an 8Gb DRAM. Due to the lower speed of the DRAMprocess, the CPU uses a 14-stage pipeline to boost its clockrate. Several factors limit the performance gains available withthis approach. These include the limited number of layers ofmetal (e.g. 3) in DRAM semiconductor processes, the needto place the processor far downstream from sensitive analogsense ampliﬁer logic, and the lag of DRAM processes behindlogic processes.2 DR MEMORYCPU I C N I CPCIe SWITCHAI HW ACCEL. AI HW ACCEL.MEM MEM DDR MEMORYCPU C NI C PCIe SWITCHAI HW ACCEL. AI HW ACCEL.MEM MEM DDR MEMORYCPU C NI C PCIe SWITCHAI HW ACCEL. AI HW ACCEL.MEM MEM

DDR MEMORY

CPU

BOARD

CPU UltraPath Interconnect

SCALE-OUT

Fig. 1: FB Zion System: Major components and interconnect.

DRAM CHIP DRAM CHIP … DRAM CHIPDRAM CHIP DRAM CHIP … DRAM CHIP

NMP CHIP

POOLING, CACHINGRANK

DDR4MEMORYCHANNEL

DRAM CHIP DRAM CHIP … DRAM CHIPDRAM CHIP DRAM CHIP … DRAM CHIP

NMP CHIP

POOLING, CACHINGRANK

NMP DIMM

Fig. 2: Near-Memory Processing via ModiﬁedDIMM Buffer Chip. Example shows two DIMMs,each dual-rank, sharing a single memory channel.

D. DRAM-less Accelerators

DRAM-less AI accelerators include the Cerebras CS1 [15],Graphcore GC2 [13] and GC200 [16]. The CS1 and GC2 lackattached external DRAM, and the GC200 likely offers low-bandwidth access to such DRAM since two DDR4 DIMMsare shared between four GC200s via a gateway chip.

E. Other approaches

Centaur [61] ofﬂoads embedding layer lookups and densecompute to an FPGA that is co-packaged with a CPU via co-herent links. NVIDIA’s Merlin [62], while not a HW platformfor RecSys, is an end-to-end solution to support the devel-opment, training and deployment of RecSys on GPUs. Intel[6] describes optimizations that improve DLRM throughputby × on a single CPU socket, to a level about × that ofa V100 GPU, with excellent scaling properties on clusters ofup to 64 CPUs.There is also a considerable body of work on domain-speciﬁc architectures for accelerating a broad set of AI ap- plications and there are several surveys of the ﬁeld [71] [72][73]. An up to date list of commercial chips can be found in[68]. Approaches using FPGAs are described in [70] [69], and[74] describes a datacenter AI accelerator.III. O VERVIEW OF R ECOMMENDER S YSTEMS

In this section, we provide an overview of a representativeRecSys, Facebook’s DLRM [34], from both application andalgorithm/model perspectives, including relevant deploymentconstraints and goals that will guide the development of ourarchitecture. An overview of several other RecSys can befound in [30].

A. Black box model of RecSys; Inputs & Output

The focus of this paper is a RecSys for rating an individualitem of content, as opposed to RecSys that process multipleitems of content simultaneously [63]. Inputs to the RecSysare a description u of a user and a description c of apiece of content; the RecSys outputs the estimated probability P ( u, c ) ∈ (0 , that the user will interact with the contentin some speciﬁed way. Both the user and the content aredescribed by a set of dense features ∈ R n and a set of sparse features ∈ { , } m . Dense features are continuous,real-valued inputs such as the click-through rate of an Adduring the last week [33], the average click-through rate ofthe user [33], and the number of clicks already recorded foran Ad being ranked [35]. Sparse (or categorical) features eachcorrespond to a vector with a small number of 1-indices outof many 0s in multi-hot vectors, and represent informationsuch as user ID, Ad category, and user history such as visitedproduct pages or store pages [64] [65].From a conceptual standpoint, the RecSys could be runon every piece of content considered for a user and theresulting output probabilities could then be used to determinewhich items of content to show that user to achieve businessobjectives such as maximizing revenue or user engagement.Note that the RecSys output may be combined with othercomponents in order to decide what content to ultimately showthe user, as described in Sec. III-B.3 . Deployment Scenario USER

REQUESTS

CONTENT AD INVENTORYUSER DATAAD FILTERINGAD CANDIDATESAD RATING

RECOMMENDER

SYSTEM TOP FEW ADSAD AUCTIONWINNING ADTO DISPLAY

Fig. 3: Overview of multi-step Ad servingprocess consisting of ﬁltering a large set ofAd content down to a small set of candi-dates, followed by ranking the candidates,the Ad auction, and displaying the winningAd [1] [30] [51].A typical deployment scenario of a RecSys model is illus-trated in Fig. 3:1) User loads a page, which triggers a request to generatepersonalized content. This request is sent from the user’sdevice to the company’s datacenter.2) Based on available content, an input query is generated,consisting of a set of B (the query size ) features, eachone representing the piece of content and the user, andpossibly their interactions. B varies by recommendationmodel. There is typically a hierarchy of recommendationmodels whereby easier-to-evaluate models rank largeramounts of content ﬁrst with high B , and then pass thetop results to more complex models that rate smalleramounts of content with lower B [1]. B in the lowto mid-hundreds is representative [30], with some B aslarge as ∼ .3) This query is then passed on to the RecSys. This systemoutputs, for each of the B pieces of potential content, theprobability that the user will interact with that contentin some way. In the case of advertising, this oftenmeans the probability that the user will click on the Ad.For video content, it could be metrics related to userengagement [43]. 4) Based on these probabilities, the most relevant content isreturned to the user, for instance “the top tens of posts.”[1]. However, for Ad ranking, the probabilities generatedby the RecSys are ﬁrst fed to an auction where they arecombined with advertiser bids to select the Ad(s) thatare ultimately shown to the user [42] [51]. C. System Constraints

RecSys operate under strict constraints.

Inference Constraints : The system must return the mostaccurate results within the (SLA and thus inviolate) la-tency constraint deﬁned statistically in equation 1, where

P P F ( D Q , P ) is the percentage point function or inverse CDF , D Q is the distribution of the times to evaluate eachquery, P is a percentile, such as 99 th or 90 th , and C SLA is the latency constraint, typically in the range of “tens tohundreds of milliseconds” [1]. Tail latencies are dependent onthe QPS (queries per second) throughput of the serving system[30]; one method to trade off QPS and tail latency which [30]explores adjusting the query size.

P P F ( D Q , P ) ≤ C SLA (1)

Training Constraints : The system must train the mostaccurate model within the minimum amount of time . Rec-ommendation models need to be retrained frequently in orderto maintain accuracy when faced with changing conditions anduser preferences [32], and time spent training a new model istime when that model is not contributing to revenue.Total deployment cost for the hardware needed to run thesystem is also a consideration. This is measured as the

TCO ,or Total Cost of Ownership, of that hardware.

D. Model Overview of DLRM

The DLRM structure was open-sourced by Facebook in2019 [34]. Fig. 4 illustrates the layers that comprise theDLRM model. DLRM is meant to be a reasonably generalstructure; for simplicity, the “default” implementation is used,with sum pooling for embeddings, an FM -based [36] featureinteractions layer, and exclusion of the diagonal entries fromthe feature interactions layer output.Inputs to DLRM are described in Sec. III-A. We notethat each sparse feature is effectively a set of indices into embedding tables . We now describe embedding tables andthe other main components of DLRM. Embedding tables are look-up tables, each viewed as a c × d matrix. Given a set of indices into a table, the correspondingrows are looked up, transposed to column vectors and com-bined into a single vector through an operation called pooling .Common pooling operators include sum and concatenation[30]. An example of a sparse feature is user ID—a singleindex denoting the user for whom Ads are being ranked,corresponding to a single vector in an embedding table [39].We typically refer to the output of embedding tables as Factorization Machine. ookupsPoolingFC Layer + ReLUFC Layer + ReLU ... FC Layer + SigmoidFC Layer + ReLU ...

FC Layer + ReLUFC Layer + ReLUDense featuresBottom MLP

Feature interactions layer

Top MLPPredicted CTRSparse features Sparse features Sparse featuresEmbeddingTable … Notations:size of featurec[i] = dd s(s+1)/2d

Fig. 4: DLRM Structure illustrating dense inputs and their associated bottom MLP, categorical inputsand embedding tables, feature interactions layer, and top MLP [34]. embedding vectors , and embedding tables themselves maybe referred to as embedding layers . FC Layers : The DLRM contains two Multi-Layer Percep-trons (MLPs), which consist of stacked (or composed) fullyconnected (FC) layers. The “bottom MLP” processes the densefeature inputs to the model, and the “top MLP” processes theconcatenation of the feature interactions layer output and thebottom MLP output. DLRM uses ReLU activations for all FClayers except the last one in the top MLP which uses a sigmoidactivation; this is to convert the 1-dimensional output of thatlayer to a click probability ∈ (0 , . Feature interactions layer : This layer is designed tocombine the information from the dense and sparse features.The FM-based feature interactions layer forces the embeddingdimension of every table and the output dimension of the ﬁnalFC layer in the bottom MLP to all be equal. After the densefeatures are ﬁrst run through the bottom MLP, the output sizeis b × d , where b is the batch size being used for inference ortraining, and d is the common embedding dimension. Further,each of the s sparse feature embedding tables produces a b × d output after pooling. These are concatenated along a new di-mension to construct a b × ( s + 1) × d tensor which we will call A . Let A (cid:48) ∈ R b × d × ( s +1) be constructed by transposing the lasttwo dimensions of A . F is then calculated as the batch matrixmultiplication of A and A (cid:48) such that F ∈ R b × ( s +1) × ( s +1) .Roughly half of these entries are duplicates, and those arediscarded along with optionally the diagonal entries, whichwe opt to do. After ﬂattening the two innermost dimensions,the result is the output matrix F (cid:48) ∈ R b × ( s s )2 . This batchof vectors, after being concatenated with the bottom MLPoutput, is then fed to the top MLP. In algorithms 1 and 2, part of Sec. IV-C where we cover the implementation ofDLRM from a HW architecture perspective, for simplicity, thisconcatenation is considered as part of the feature interactionslayer. E. RecSys vs. other AI models

The two key factors that set RecSys models apart fromother AI models (such as those for computer vision or naturallanguage processing) are arithmetic intensity—the number ofcompute operations relative to memory accesses—and modelsize. RecSys models are signiﬁcantly larger and have signiﬁ-cantly lower arithmetic intensity as shown in Table IV, result-ing in increased pressure on memory systems and interconnectstructure.IV. DLRM

FROM A H ARDWARE A RCHITECTUREPERSPECTIVE

A. Distributed model setup

The characteristics of recommender systems require a com-bination of model parallelism and data parallelism. On a largehomogeneous system, dense parameters such as FC layerweights are copied onto every processor. However, embeddingtables are distributed across the memory of all processors,such that no parameters are replicated. This arises from thefact that embedding tables can easily reach from several 100GBs to multiple TBs in size [40].While each processor can compute the dense part of themodel for a batch, it cannot look up the embeddings speciﬁedby the sparse features, because some of those embeddings arestored in the memory of other processors. Thus, each processor5ABLE IV: Recommender System vs. Other AI Models. Data from Table 1 of [31].

Category Model Type Size

Computer Vision ResNeXt101-32x4-48 43-829M Avg. 188-380 2.4-29MLanguage Seq2Seq GRU/LSTM 100M-1B 2-20 > > Recommender Embedding Layers >

10B 1-2 > needs to query other processors for embedding entries. Thisleads to the communication patterns outlined in Sec. IV-B.There are multiple ways to split up the embedding tables.The distribution and (if beneﬁcial) replication of tables acrossprocessors must be optimized to avoid system bottlenecks byevening out memory access loading across the memory sys-tems of the various processors. One extreme is “full sharding”of the tables, where tables are split up at a vector-level acrossthe attached memory systems of multiple processors to getvery close to uniformly distributing table lookups—at the costof increased communication. This increased communicationand the resulting stress on the HW is due to the fact that, withfull sharding, each processor is prevented from doing localpooling of embeddings, thus requiring unpooled vectors to becommunicated, of which there are many more than pooledvectors, as described in Sec. III-D. As such, many systemstoday attempt to ﬁt entire tables into the memory of singleprocessors (which we call “no sharding”), or to break up thetables as little as possible as shown on slide 6 of [2]. B. CC operations

The following CC operations are key to distributed RecSysthroughput. For more information on CC operations, pleasesee [8]. In all of the scenarios described below, there are n processors, numbered , , · · · , n . All-to-all : This CC primitive essentially implements a“transpose” operation. Each processor starts out with with n pieces of data. Supposing A ij denotes the j th piece ofdata currently residing on the i th processor, then the all-to-all operation will move A ij such that it is instead on the j thprocessor and is ordered as the i th piece of data there. Thisoperation is useful when each processor needs to send somedata to every other processor. All-reduce : All-reduce is an operation which replaces localdata on each processor with the sum of data across allprocessors. If processor i contains data A i , then after the all-reduce operation, all processors will contain A R = (cid:80) ni =1 A i .Efﬁcient all-reduce algorithms exist for ring interconnects [8]which are popular in AI systems; however, all-reduce canalso be implemented in equivalent time by ﬁrst performinga reduce-scatter operation and then an all-gather operation. Reduce-scatter : This operation may be best described as anall-to-all operation followed by a “local reduction.” Similar tothe all-to-all setup, the starting point is A ij as the j th piece ofdata on processor i . Where all-to-all will result in this beingthe i th piece of data on processor j , reduce-scatter performs anextra reduction, such that the only piece of data on processor j at the end is (cid:80) ni =1 A ij . All-gather : In this operation, the starting point is a singlepiece of data A i on each of the n processors, and the resultof the operation is to have every A i on every processor.For example, with 4 processors initially containing A , A , A , and A , respectively, after the all-gather operation, eachprocessor will contain all of A , A , A , and A . C. Detailed steps for DLRM Inference and Training

Please refer to the Appendix for a description of operatorsand an in-depth view of the steps involved in DLRM inferencean training.

D. Key HW performance factors

Our performance model (Sec. V-B) helps identify severalHW characteristics that are major determinants of throughputfor the RecSys that we evaluate. As could be expected fromthe DLRM operations described in the previous three sections,these are CC performance, memory system performance forembedding table lookups, compute performance for runningthe dense portions of the model, and on-chip buffering. Thefollowing sections examine each of these factors in moredetail.

1) CC Performance:

RecSys make extensive use of CC forexchanging embedding table indices, embedding values, andfor gradient averaging during training backprop. Table V sum-marizes HW factors that impact RecSys CC performance. Inparticular, all-to-all with its smaller message sizes is especiallysensitive to latency and scales poorly with interconnect such asrings [10]. Various lower bounds on the latency and throughputof these operators have been derived for representative systemarchitectures [8].TABLE V: Summary of HW Collective Communicationsfactors for RecSys.

Factor Impact

Per-processorbandwidth Sets upper-bounds on CC all-reduce andall-to-all throughputInterconnect topology Major determinant of all-to-all throughputRing interconnect Well-suited for all-reduce, poorly-suitedfor all-to-allCC latency Particularly important for all-to-all due tosmaller message sizes

For the purposes of HW analysis on homogeneous systems,we note the following [8]: Systems where all processors share equally in the communications andarithmetic components of CC. For an all-to-all with data volume V and n processors,the lower bound on the amount of data sent and receivedby each processor is V × ( n − n . • Similarly for an all-reduce, the lower bound is × V × ( n − n . • The above bounds impose a minimum per-processordata volume that must be transferred.

As a consequence,the bandwidth per processor will limit overall all-to-all and all-reduce throughput, even as more processorsare added to a system.

As a rule of thumb, for a largesystem with a per-processor interconnect bandwidth BW ,the maximum achievable system-wide all-to-all and all-reduce throughputs are roughly BW and BW .The above rule of thumb works well for NVIDIA’s DGX-2system built from 16 V100 processors [9] since it achievesan “all-reduce bandwidth ” of ∼ A ll - g a t h e r B a nd w i d t h , G B / s Data Volume, KB

Fig. 5: DGX-2 All-gather bandwidth, measuredvalues [12] compared to simple latency/bandwidthmodel. All-gather time for smaller message sizesis latency dominated.Fig. 5 illustrates the impact of latency on CC throughput.Due to the ∼ µ s latency for all-gather on the DGX-2, thetime to perform an all-gather for smaller message sizes islatency dominated. Latencies for a few current systems areshown in Table VI.TABLE VI: Latencies of interest. System Latency incl. SWoverhead

NVIDIA DGX-2 CC All-reduce Est. ∼ µ s [12]NVIDIA DGX-2 CC All-gather Est. ∼ µ s [12]NVIDIA DGX-2 NVLink2 point to point ∼ µ s [12]NVIDIA DGX-2 PCIe point to point ∼ µ s [12]Graphcore GC2, 16-IPU system,single-destination Gather ∼ µ s [13]Graphcore GC2, 16-IPU system,single-destination Reduce ∼ . µ s [13] The other important factor for CC, available per-chip peakbandwidth, is summarized in Table VII. Note that Broadcom’s Data transfer rate in each direction, per processor, during all-reduce.

Tomahawk 4 switch chip supports an order of magnitude morebandwidth than any AI HW chip, demonstrating the headroomavailable to improve per-chip bandwidth for AI applications .TABLE VII: HW communication peak bandwidth for variouschips.

Chip Peak HW bandwidth, eachdirection, per chip

CPU: Intel Xeon Platinum8180 62GB/s UPI, 48GB/s PCIeaggregate [6], [22], [23]GPU: NVIDIA A100 NVLink3, 300GB/s [21]FPGA: Achronix SpeedsterAC7t1500 400GB/s via 112G SerDes [18]AI HW: Graphcore GC200 320GB/s via 10x IPU-Links [14]AI HW: Intel/Habana Gaudi 125GB/s via 10x 100GbE [25]

Network Switch: BroadcomTomahawk 4 3,200GB/s [26]

2) Memory System Performance:

This section refers toexternal DRAM memory attached to an AI HW acceleratoror CPU. RecSys applies considerable pressure on the memorysystem due to the large number of accesses to embeddingtables. Furthermore, these accesses have limited spatial locality(but some temporal locality) [19], resulting in scattered mem-ory accessed of 64B-256B in size [1] that exhibit poor DRAMpage hit characteristics. Multiple ranks per DIMM and internalbanks and bank groups per memory die help by increasingparallelism, within memory device timing parameters such ascommand issue rates and on-die power distribution limitations.TABLE VIII: External Memory systems for select AI HW.

Chip Memory System

Intel Xeon CPU DDR4, 6 channels for Xeon Gold, upto 1.5TB [5]NVIDIA A100 HBM, 5 stacks, 40GB totalNVIDIA TU102RTX2080Ti GDDR6, 11 chips, 11GB totalHabana Goya DDR4, 2 channels, 16GBGraphcore GC2 None, 300MB on-chipGraphcore GC200 900MB on-chip, 2x DDR4 channelsfor 4 chips [16]Cerebras CS1 None, all memory is on-wafer, 18GBtotal

Table VIII shows memory types for a few representativeAI HW chips and Fig. 6 show achievable effective memorybandwidth for various memory system conﬁgurations andembedding sizes. HBM has considerably higher performancefor random embedding accesses, while the typical 6-channelDDR4 server CPU memory system has far lower performance,especially for smaller embedding sizes. However HBM andGDDR6 suffer from limited capacity compared to DDR4 asshown in Fig. 7.

3) Compute Performance:

Table IX shows available com-pute capability of various chips. For the speciﬁc workload thatwe analyze, compute capability is not a limiting factor (seeSec. V-A) and this is believed to be the case for many RecSysworkloads when run on specialized AI accelerators [30].7

32 64 96 128 160 192 224 256

HBM2E @ 2400MHz, 4 stacksGDDR6 @ 16GHz, 12 chipsDDR4 @ 3200MHz, Dual-Rank DIMM, 6 channelsEMBEDDING SIZE, BYTES P E A K B A N D W I D T H C A P A B I L I TY , G B / s Fig. 6: Peak Random Embedding Access Band-width for common memory systems basedon memory timing parameters, assuming auto-precharge. Data transfer frequency shown; deviceclock is half for DDR4 & HBM, one-eighth forGDDR6. DDR4 memory systems have far lowerperformance than HBM for embedding lookups. T O T A L C A P A C I TY , G B DDR4 6-CHANNEL3D TSV DIMM 256GB HBM2E4 STACKS GDDR612 CHIPS

Fig. 7: Total capacity of common memory sys-tems.TABLE IX: Peak Compute capability of various chips.

Chip FLOPS capability

CPU: Intel Xeon Platinum 8180 4.1 TFLOPS FP32 [6]GPU: NVIDIA A100 19.5 TFLOPS FP32, 312TFLOPS FP16/32 [21]FPGA: Achronix SpeedsterAC7t1500 3.84 TFLOPS FP24 @750MHz [17], [18]

TABLE X: Uses for on-chip buffer memory.

Item Buffering memory

Dense weights for dataparallelism Model-dependent, replicated acrosseach chipEmbeddings for onemini-batch Dependent on embedding table sizes,mini-batch size, and (temporal)locality across mini-batchWorking buffers for datatransfers Used to overlap processing andtransfers for input features andembedding lookupsData during training Activations, gradients, optimizer state[44] [52]

4) On-Chip Buffering:

Buffering memory serves severalpurposes as shown in Table X. While some of these arestraightforward to estimate—such as the number of weightsin the dense layers of a model—others are harder to quantify,such as the number of unique embedding values across oneor multiple mini-batches. Analyses by Facebook [19] indicatehit rates of 40% to 60% with a 64MB cache.

E. Improving existing AI HW accelerators for RecSys

TABLE XI: Improving existing AI HW accelerators for Rec-Sys.

Chip Potential Changes

NVIDIA A100 Increase on-chip memory for buffering; AddDDR memory support; Enable sixth HBMstack; support deeper stacks; reduce computecapability to ﬁt die area budgetGraphcore GC200 Add external high-speed DRAM supportCerebras CS1 Change from mesh to fully connectedtopology, Add external high-speed DRAMsupportAll of the above Increase I/O bandwidth; Add HW support forCC; Add dual external DRAM support(Sec. VII-A)

Table XI illustrates potential changes to enhance the perfor-mance of existing AI HW accelerators on RecSys workloads.V. P

ERFORMANCE M ODEL

A. Representative DLRM Models

Our choice of representative model is Facebook’s DLRM-RM2 [30]. In order to reveal limitations of various HWplatforms, two model conﬁgurations are analyzed. These arepositioned at the low and high end of batch size and embed-ding entry size; speciﬁcally, 200 and 600 as the batch sizepoints and 64B and 256B as the embedding size points . Werefer to the 200 batch size/64B embedding size combinationas “Small batch/Small embeddings” , while the 600 batchsize/256B combination is “Large batch/Large embeddings” .Similarly, the two extremes of table distribution are an-alyzed. “Unsharded” refers to each table being able to ﬁtwithin the memory attached to an AI accelerator, such thatonly pooled embeddings need be exchanged. “Sharded” refersto “full sharding” (see section IV-A) where each table is fullydistributed across the attached memory of every accelerator—aworst-case scenario. The reality will likely fall between thesetwo extremes.Note that the same batch sizes are used for both our infer-ence and training performance models and that all parametersare stored in FP16 for both inference and training. As thesize of the tables for the production model is not publiclyavailable, we assume that the model is large enough to occupythe memory of all chips in the system. This is a reasonableassumption based on other recommendation systems [7]. Roughly 200 is the median query size; 600 is signiﬁcantly farther out inthe query size distribution [30]. Facebook has noted that embedding sizes in bytes are typically 64B-256B[19].

Parameter Value(s)

Number of embedding tables 40Lookups per table 80Embedding size 32 FP16 = 64B (small), 128 FP16= 256B (large)Number of dense features 256Bottom MLP 256-128-32-Embedding dimensionTop MLP Interactions-512-128-1Feature Interactions Dot products, exclude diagonalFLOPS/Inference ∼ ∼ B. Overview of Performance Model

We have developed a performance model that computes thetime, memory usage and communications overhead for eachof the steps detailed in Sec. IV-C. Our model takes as inputa speciﬁc DLRM conﬁguration as described in Sec.V-A aswell as various parameters that describe batch size, embeddingtable sharding, processing engine capabilities and conﬁgura-tion, memory system conﬁguration, memory device timingparameters from vendor datasheets, communication networklatencies and bandwidths, and system parameters that controlthe level of overlap and concurrency within the HW. In order tomaximally stress the HW, our model assumes zero (temporal)locality within the embedding access stream.Results are reported in Sec. VI for the stated conﬁgura-tions, assuming the HW and system exploit maximal overlapwithin a batch for inference by grouping memory accessesand overlapping memory activity with communications wherepossible, and for training by pipelining the collective commu-nications with backpropagation computations and parameterupdates. For embedding parameter updates during training, theoriginally looked up embeddings are buffered on-chip, therebyonly requiring a write to update them instead of a read-modify-write. In particular, sufﬁcient on-chip buffering to support theabove capabilities is assumed— doing so over-estimates theperformance capabilities of most current AI accelerator HW.

VI. E

VALUATION OF S YSTEM P ERFORMANCE

A. Systems Parameters

We focus our performance evaluation on homogeneoussystems, where one type of processor provides the bulk ofcompute and communications capability.Table XIII shows the range of parameters that we considerfor our reference homogeneous system. In terms of CC per-formance, these ranges are, for the most part, signiﬁcantlyin excess of what is supported by state of the art AI HWaccelerators. This is consistent with our goal of showing thebeneﬁt of further optimizing these parameters. In particular,the CC latency range is lower than measured numbers forNVIDIA’s DGX-2 system (Sec. VII-B). The range of per-chipbandwidth spans popular training accelerators as well as about × beyond NVLink3. TABLE XIII: Ranges of Key Parameters for HomogeneousSystems. Parameter Range

CC all-to-all Latency of . µ s to µ s, Bandwidth 100 to1000GB/sCC all-reduce Latency of . µ s to µ s, Bandwidth 100 to1000GB/sBandwidth perchip 100 to 1000GB/s aggregated across all links B. Inference

Fig. 8 shows upper bounds on achievable system throughputfor inference. For the large batch, large embeddings case,unsharded, throughput is primarily limited by memory ac-cesses for embeddings such that interconnect is of secondaryimportance. Such models would run well on existing systemsincluding scale-out topologies with limited bandwidth.Latency, as opposed to bandwidth, matters most for smallbatch/embedding workloads, as detailed in Fig. 9. For theunsharded case this sensitivity applies at high bandwidth aswell as at low bandwidth, with throughput dropping by almost × as latency increases. This is not surprising since the typicalall-to-all message sizes for this conﬁguration are 320KBof indices per processor and 64KB for pooled embeddings,small messages that would typically fall within the latency-dominated regime of CC.This would indicate that when batch sizes are relativelysmall and tables are allocated to each ﬁt within the memoryattached to each processor, it is more important to designsystems to minimize latency as opposed to pushing per-chipbandwidth. Scale-out architectures, with multiple interconnecthops and long physical distances, make this difﬁcult. On theother hand, “scale-in” system design can help reduce latency.The sharded, small batch/embeddings case is sensitive toboth latency and bandwidth since the exchange of unpooledembeddings results in an all-to-all message size of ∼ × forindices and by 12 × for embeddings; the resulting messagesizes are still typically within the latency-dominated region ofCC. However, the number of bytes of memory lookups forembedding tables increases by 12 × such that memory lookuptime becomes the dominant term.The sharded, large batch/embeddings case depends on bothlatency and bandwidth due to the larger message sizes fromexchanging unpooled embeddings.9 A T E N C Y s B A N D W I D T H G B / s

100 500 1000 Q P S

75K 125K 175K

INFERENCE SMALL UNSHARD (a) Small batch, Small embeddings, Unsharded L A T E N C Y s B A N D W I D T H G B / s

100 500 1000 Q P S

50K 100K 150K

INFERENCE SMALL SHARD (b) Small batch, Small embeddings, Sharded L A T E N C Y s B A N D W I D T H G B / s

100 500 1000 Q P S INFERENCE LARGE UNSHARD (c) Large batch, Large embeddings, Unsharded L A T E N C Y s B A N D W I D T H G B / s

100 500 1000 Q P S

5K 10K 15K

INFERENCE LARGE SHARD (d) Large batch, Large embeddings, Sharded

Fig. 8: Inference performance upper bounds as a function of Collective Communications latency and bandwidth for small/largebatches, with and without Sharding. See text for analysis.In all cases, higher bandwidth minimizes the throughputimpact of sharding. As shown in Fig. 10, increasing band-width from 100GB/s to 1000GB/s reduces the impact of fullsharding from about . × loss down to . × for the smallbatch/embeddings case. Similarly signiﬁcant gains are seenfor the large batch/embeddings case with sharding. C. Training

Fig. 11 shows upper bounds on achievable system through-put for training. Similarly to inference, we note that the caseof unsharded large batches/embeddings is primarily memory-bound, hence does not depend much on CC latency or band-width.For the other conﬁgurations, compared to inference, theimportance of optimizing bandwidth and latency are morebalanced for training. For the small batch/embeddings case,bandwidth matters to training QPS as detailed in Fig. 12. High bandwidth cuts dense all-reduce times since the message sizesinvolved are ∼ µ s.As with inference, higher bandwidth helps mitigate thethroughput penalty of sharding. In the large batch/embeddingsscenario, overall throughput increases almost proportionallywith bandwidth as shown in Fig. 13. The all-to-all exchangeof unpooled embeddings between processors dominates, withmessage sizes of ∼ EC S PEED : A N O PTIMIZED S YSTEM A RCHITECTUREFOR R EC S YS This section describes the features of RecSpeed, a hypothet-ical system architecture for RecSys workloads. The objectivesof RecSpeed are to: • Maximize throughput for inference and training of largeRecSys models;10 %20%40%60%80%100%0K50K100K150K200K 0.5 2.5 4.5 6.5 8.5QPS vs. Latency, Bandwidth=100 GB/sSmall batch/embeddings, Unsharded

QPSMemory Utilization Q P S M e m o r y U t ili z a t i o n Latency, μs (a) With 100GB/s Bandwidth per chip

QPSMemory Utilization Q P S M e m o r y U t ili z a t i o n Latency, μs (b) With 1,000GB/s Bandwidth per chip

Fig. 9: Impact of latency on QPS for small batch size, smallembedding vectors, Unsharded. Latency matters at both highand low bandwidth, and there is beneﬁt in driving CC latencydown to typical network switch port to port levels. Q P S BANDWIDTH PER CHIP, GB/s

Fig. 10: High Bandwidth helps minimize theperformance loss from sharding, even at highlatency, due to the all-to-all exchange of unpooledembedding entries between processors. • Support future, ever-larger RecSys models; • Allow implementation using existing process technolo-gies, and to ﬁt into common datacenter power envelopes; • Support existing SW and HW interfaces for datacenterAI server racks.

A. RecSpeed Architecture Features

Fig. 14 shows a sketch of the proposed chip and interconnectstructure for a 6-chip RecSpeed system. Key features of thearchitecture are as follows: • Fixed-topology quadratic point-to-point interconnectwithout any form of switching to minimize latency. • HW support for Collective Communications to minimizesynchronization and SW-induced latency. • Fast HBM memory attached to each chip, as many stacksas practical; as of the writing of this paper, NVIDIA’sA100 has room for 6 stacks of HBM (of which only 5 areused), each 4-deep; however, 8-deep stacks are available. • Slow bulk memory, such as DDR4. • Optimized packaging and system design to allow “scale-in”, packing as many RecSpeed chips as possible in closephysical proximity.High-density physical packaging of chips is particularlyimportant in order to achieve maximum throughput; wheninterconnect moves from intra-board to the system level, theenergy consumed per bit goes up by × , bandwidth falls byover × , and overhead increases markedly [29]. Fixed-topology vs. switched all-to-all interconnect:

Theproposed interconnect for RecSpeed could reduce latencycompared to the all-to-all switched interconnect found inNVIDIA’s DGX-2 [9] or Habana’s HLS-1 training system [25].Speciﬁcally, the presence of a switch introduces several hun-dred nanoseconds of additional latency [45] [46]. A quadraticinterconnect offers performance gains of . × for large CCall-to-all messages on an 8-node system, compared to a ringinterconnect, and for smaller message sizes the gain can rangeas high as × [10]. HW support for CC:

We note that the proposed intercon-nect structure can efﬁciently support both CC all-to-all and all-reduce operations with minimal latency and bandwidth usagethat matches the theoretical lower bound [8].

Implementing High-Bandwidth Links:

The proposedhigh-bandwidth RecSpeed links can be implemented via ex-isting technology, amounting to ∼ of the bandwidth of aTomahawk 4 switch chip [26]. Hybrid Memory Support:

Certain embedding tables andvectors are accessed less often than others [7]. It is thereforeuseful to provide a two-level memory system, with large bulkmemory in the form of DDR4 combined with fast HBMmemory. Tables can be allocated to one memory or the otherstatically—sharded or not—or the faster HBM can be used asa cache. Static allocation is preferred, as dealing with such alarge cache structure where the smaller memory has latencycomparable to that of the larger memory may not offer muchbeneﬁt, based on prior efforts such as Intel’s Knights LandingHPC architecture [47]. With the conﬁguration shown, up to11 A T E N C Y s B A N D W I D T H G B / s

100 500 1000 Q P S

30K 50K 70K

TRAINING SMALL UNSHARD (a) Small batch, Small embeddings, Unsharded L A T E N C Y s B A N D W I D T H G B / s

100 500 1000 Q P S

20K 40K 60K

TRAINING SMALL SHARD (b) Small batch, Small embeddings, Sharded L A T E N C Y s B A N D W I D T H G B / s

100 500 1000 Q P S

8K 10K 12K

TRAINING LARGE UNSHARD (c) Large batch, Large embeddings, Unsharded L A T E N C Y s B A N D W I D T H G B / s

100 500 1000 Q P S

4K 8K

TRAINING LARGE SHARD (d) Large batch, Large embeddings, Sharded

Fig. 11: Training performance upper bounds as a function of Collective Communications latency and bandwidth for small/largebatches, with and without Sharding. See text for analysis.5.5TB of memory can be provided for a RecSpeed system,of which 27% would be fast HBM. Baidu reports that theirAIBox [7] system is able to effectively hide storage accesstime, despite the two order of magnitude latency difference andthe greater than one order of magnitude bandwidth differencebetween SSD and main memory. Since the performance gapbetween DDR and HBM memory is smaller, it is reasonable toassume that careful system design can enable a hybrid memorysystem to run closer to the performance of a full HBM system.

B. RecSpeed Performance vs. DGX-2

Table XIV shows the system characteristics that we use forcomputing the performance upper bound for RecSpeed, andTable XV for the DGX-2. Note that we assume a “modiﬁed”V100 chip with more on-chip buffering memory than theactual V100. TABLE XIV: Numbers used for RecSpeed performance upperbounds.

Parameter Value

CC all-to-all Latency of 1000ns, bandwidth of 1000GB/sCC all-reduce Latency of 1000ns, bandwidth of 1000GB/sHBM Memory HBM2E @ 3000MHz, 6 stacks, 96GBDDR4 Memory 1-channel, up to 256GB 3D TSV DIMM

The resulting throughput numbers and comparison versusNVIDIA’s DGX-2 estimated upper bounds are shown in Ta-ble XVI for inference and Table XVII for training. In our12

K10K20K30K40K50K60K 100 200 300 400 500 600 700 800 900 1000QPS vs. Bandwidth, Latency=2μsSmall batch/embeddings, Unsharded Q P S BANDWIDTH PER CHIP, GB/s (a) QPS (b) Relative duration of training phases

Fig. 12: Impact of Bandwidth on Training, small batch size,small embeddings, Unsharded. In contrast to inference, band-width helps directly for training due the all-reduce of largemessage sizes from averaging gradients across processors.TABLE XV: Numbers used for NVIDIA DGX-2 performanceupper bounds.

Parameter Value

CC all-to-all Derived from CC all-gather [12]CC all-reduce, CC all-gather Data from [12]Memory System HBM2, 4 stacks @ 2300MHzBandwidth per chip 150GB/s

On-chip memory Assumed sufﬁcient—not the casein practice model, the DGX-2 is largely bound by its high CC latency,which can likely be reduced via software optimization.

Limitations:

In this section, we do not discuss trade-offsand issues—important as they may be—relating to die size,power, and thermal design.C

ONCLUSION

This paper reviews the features of a representative DeepLearning Recommender System and describes hardware ar-chitectures that are used to deploy this and similar workloads.The performance of this representative Deep Learning Rec- Q P S BANDWIDTH PER CHIP, GB/s (a) QPS

100 200 500 1000

Training Time Breakdown vs. BandwidthLatency=2μs, Large batch/embeddings, Sharded

FWD ALLREDUCE SPARSE UPDT% TIME FOR

BANDWIDTH

PER CHIP, GB/s

Note: FWD = Forward Pass, ALLREDUCE = Dense all-reduce and dense update, SPARSE UPDT = Sparse all-to-all or all-gather and sparse update (b) Relative duration of training phases

Fig. 13: Impact of Bandwidth on Training, large batch size,large embeddings, Sharded. In this situation, the all-to-allexchange of unpooled embeddings between processors dom-inates. Increasing bandwidth reduces the time spent in thisall-to-all, thus speeding up the forward pass.TABLE XVI: RecSpeed Inference Upper Bounds.

Conﬁg. QPS Mem. DGX-2 DGX-2 PotentialUtil. QPS Mem.Util. SpeedupSm. Unshard 300K 67% 4.9K 1.8% × Sm. Shard 207K 47% 4.5K 1.6% × Lg. Unshard 56K 93% 4.7K 15% × Lg. Shard 30K 50% 2.1K 7% × TABLE XVII: RecSpeed Training Upper Bounds.

Conﬁg. QPS Allred. DGX-2 DGX-2 Potential%Time QPS Allred. SpeedupSm. Unshard 99K 33% 2.2K 31% × Sm. Shard 83K 28% 2.1K 30% × Lg. Unshard 25K 9% 2K 28% × Lg. Shard 16K 6% 1.2K 18% × All-reduce refers to dense reduction and update. /O FABRICRSHBMDDR RSHBMDDR RSHBMDDRRSHBMDDR HBMSTACKS HBMSTACKSSERDESSERDESCOMPUTE & CC PROCESSORSDDRMEMRSHBMDDRRSHBMDDR (a) Schematic of chip I/O FABRICRSHBMDDR RSHBMDDR RSHBMDDRRSHBMDDR HBMSTACKS HBMSTACKSSERDESSERDESCOMPUTE & CC PROCESSORSDDRMEMRSHBMDDRRSHBMDDR (b) System interconnect example with 6 chips

Fig. 14: Proposed RecSpeed Architecture.ommender is investigated with respect to its sensitivity tohardware system design parameters for training and inference.We identify the latency of collective communications oper-ations as a crucial, yet overlooked, bottleneck that can limitrecommender system throughput on many platforms. We alsoidentify per-chip communication bandwidth, on-chip bufferingand memory system lookup rates as further factors.We show that a novel architecture could achieve substan-tial throughput gains for inference and training on recom-mender system workloads by improving these factors be-yond state of the art via the use of “scale-in” to packprocessing chips in close physical proximity, a two-level high-performance memory system combining HBM2E and DDR4,and a quadratic point-to-point ﬁxed topology interconnect.Speciﬁcally, achieving CC latencies of 1 µ s and chip-to-chipbandwidth of 1000GB/s would offer the potential to boostrecommender system throughput by – × for inferenceand – × for training compared to upper bounds forNVIDIA’s DGX-2 large-scale AI platform, while minimizingthe performance impact of “sharding” embedding tables.A CKNOWLEDGMENT

The authors would like to thank Prof. Kurt Keutzer andDr. Amir Gholami for their support and their constructivefeedback. R PPENDIX S TEPS AND O PERATORS FOR

DLRMThis appendix presents a high-level algorithm representingthe steps for training a DLRM model in a distributed fashionwith n identical processors. The forward pass is covered inalgorithm 1 (used in both inference and training, note thatfor training it is assumed that activations are checkpointedas needed and optimally [37] to facilitate backpropagation)and the backward pass and weight update in algorithm 2.Note that most operations are performed concurrently acrossall processors as the inference query or training batch issplit up across all n processors. For simplicity, it is alsoassumed that the concatenation of the bottom MLP output andthe pairwise dot products (after duplicate removal, diagonalremoval, and vectorizing is completed) is performed as partof the FeatureInteractions operation.For training, the optimizer used is vanilla SGD. AdaGradis reported to achieve slightly better results for DLRM on theCriteo Ad Kaggle dataset [34], however we use vanilla SGDin our steps and performance model for consistency with theDLRM repo. The pipelining during the dense backward passof collective communications (i.e. all-reduce operations) withthe backpropagation computations and parameter updates isnot shown; in practice, this is certainly feasible as long as theall-reduce latency is acceptable.The notations used are shown in Table XVIII.

Other operations

This section introduces additional operations (other thancollective communications operations) mentioned in algo-rithms 1 and 2. FC : The forward pass of the FC layer. FeatureInteractions : The forward pass of the dot-productfeature interactions layer with exclusion of the diagonal featureinteractions matrix entries.

Concat : The concatenation operation of the batched bottomMLP output and batched pooled embedding vectors along anew dimension as mentioned in Sec. III-D.

FCBackward : This operator takes in the gradient of theloss with respect to the output of a given FC layer, uses thecheckpointed input activations to the layer, and returns boththe gradient of the loss with respect to the weights of the layeras well as the gradient of the loss with respect to the inputto the layer. This will then be used by the next FCBackwardoperator.

FeatureInteractionsBackward : This operator takes in thegradient of the loss with respect to the output of the featureinteractions layer and returns the gradient of the loss withrespect to each of the batched inputs to the feature interactionslayer, which are the output of the bottom MLP as well asthe pooled embeddings resulting from the embedding lookupsin the model. Note that there are no weights in the featureinteractions layer so this is sufﬁcient for FeatureInteractions-Backward.

Expand sparse grads : Because of pooling in this model,all of the gradients on embeddings that are pooled into a single TABLE XVIII: Notations for DLRM steps.

Symbol Usage Meaning n d Constant Number of dense features in the model n c Constant Number of sparse features or embeddingtables c i Constant Cardinality of the i th categorical feature l i Constant Number of lookups performed on the i thembedding table l b Constant Number of bottom MLP layers l t Constant Number of top MLP layers d Constant Embedding dimension O B Fwd Output of bottom MLP up to a givenlayer L i Fwd Local embedding lookup indices for i thtable after indices all-to-all O E Fwd Local embedding lookup vectors (possiblepooled) V i Fwd Pooled embedding vectors (after secondall-to-all) resulting from lookups for the i th table F Fwd Feature interactions input denoted as A infeature interactions layer description ofSec. III-D O T Fwd Output of bottom MLP up to a givenlayer GT i Bckwd/update Gradient of loss w.r.t. output of i th topMLP FC layer GB i Bckwd/update Gradient of loss w.r.t. output of i thbottom MLP FC layer GE i Bckwd/update Pooled gradient (i.e. on processor whichdoesn’t own relevant embeddings) of lossw.r.t. lookups from i th embedding table LGE i Bckwd/update Pooled gradient of loss w.r.t. lookupsfrom i table on processor which ownsrelevant embeddings afterall-to-all/all-gather F GE i Bckwd/update Unpooled/expanded batch-reducedgradient of loss w.r.t. lookups from i table on processor which owns relevantembeddings after all-to-all/all-gather output are identical. After communicating only the gradientson the pooled vectors, these values are simply “expanded”, orcopied, to every unpooled vector that was summed to generatethe pooled vector. This operation also averages the gradientson the pooled vectors across the batch. Params : Shorthand operator to denote the parameters asso-ciated with a given FC layer or embedding table.17 lgorithm 1:

DLRM forward pass steps.

Input:

Number of processors p ; b × n d batch of densefeatures D ; n c sets of sparse features each onecalled S i , each one b × l i ; bottom MLP layers F C B i , i ∈ [1 , l b ] ; top MLP layers F C T i , i ∈ [1 , l t ] ; n c embedding tables denoted as E i with each table representing a c i × d matrix. O B ← D for i = 1 · · · l b do O B ← F C B i ( O B ) end L , L , · · · , L n c ← all to all ([ S , S , · · · , S n c ]) O E ← [] for j = 1 · · · n c do O E j = E j ( L j ) if no sharding then O E j = pool ( O E j ) end O E .append ( O E j ) endif full sharding then V , V , · · · , V n c ← reduce scatter ( O E ) endif no sharding then V , V , · · · , V n c ← all to all ( O E ) end F ← concat ([ O B , V , V , · · · , V n c ]) F (cid:48) ← F eatureInteractions ( F ) O T ← F (cid:48) for i = 1 · · · l t do O T ← F C T i ( O T ) end p ← O T [: , Output:

Predicted click probabilities vector p Algorithm 2:

DLRM backward pass and weight updatesteps.

Input:

Number of processors p ; b × n d batch of densefeatures D ; n c sets of sparse features each onecalled S i , each one b × l i ; bottom MLP backwardoperators F CBackward B i , i ∈ [1 , l b ] ; top MLPbackward operators F CBackward T i , i ∈ [1 , l t ] ; n c embedding tables denoted as E i with eachtable representing a c i × d matrix. Learning rate γ ;predictions p ∈ (0 , n ; labels l ∈ [0 , n ; lossfunction L ( p , l ) = n (cid:80) ni =1 l i log p i + (1 − l i ) log(1 − p i ) with predictions p and labels l . L BCE ← L ( p , l ) ∇ p L ← n (cid:80) bi =1 ( l i p i − − l i − p i ) GT l t +1 ← ∇ p L for i = l t , l t − , · · · , do GT i , ∇ F C Ti L ←

F CBackward T i ( GT i +1 ) end GB l b +1 , GE , · · · , GE n c ← F eatureInteractionsBackward ( GT ) if no sharding then LGE , · · · , LGE n c ← all to all ([ GE , · · · , GE n c ]) endif full sharding then LGE , · · · , LGE n c ← all gather ([ GE , · · · , GE n c ]) end F GE , · · · , F GE n c ← expand sparse grads ([ LGE , · · · , LGE n c ]) for i = l b , l b − , · · · , do GB i , ∇ F C Bi L ←