2021 31st International Conference on Field-Programmable Logic and Applications (FPL) | 2021

Distributed Recommendation Inference on FPGA Clusters

Abstract

Deep neural networks are widely used in personalized recommendation systems. Such models involve two major components: the memory-bound embedding layer and the computation-bound fully-connected layers. Existing solutions are either slow on both stages or only optimize one of them. To implement recommendation inference efficiently in the context of a real deployment, we design and implement an FPGA cluster optimizing the performance of both stages. To remove the memory bottleneck, we take advantage of the High-Bandwidth Memory (HBM) available on the latest FPGAs for highly concurrent embedding table lookups. To match the required DNN computation throughput, we partition the workload across multiple FPGAs interconnected via a 100 Gbps TCP/IP network. Compared to an optimized CPU baseline (16 vCPU, AVX2-enabled) and a one-node FPGA implementation, our system (four-node version) achieves 28.95× and 7.68× speedup in terms of throughput respectively. The proposed system also guarantees a latency of tens of microseconds per single inference, significantly better than CPU and GPU-based systems which take at least milliseconds.

Volume None

2021 31st International Conference on Field-Programmable Logic and Applications (FPL) | 2021

Distributed Recommendation Inference on FPGA Clusters

Abstract

Volume None

Pages 279-285

DOI 10.1109/FPL53798.2021.00057

Language English

Journal 2021 31st International Conference on Field-Programmable Logic and Applications (FPL)

Full Text