2021 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) | 2021

Understanding Capacity-Driven Scale-Out Neural Recommendation Inference

Abstract

Deep learning recommendation models have grown to the terabyte scale. Traditional serving schemes–that load entire models to a single server–are unable to support this scale. One approach to support these models is distributed serving, or distributed inference, which divides the memory requirements of a single large model across multiple servers. This work is a first-step for the systems community to develop novel model-serving solutions, given the huge system design space. Large-scale deep recommender systems are a novel workload and vital to study, as they consume up to 79% of all inference cycles in the data center. To that end, this work is the first to describe and characterize scale-out deep learning recommender inference using data-center serving infrastructure. This work specifically explores latency-bounded inference systems, compared to the throughput-oriented training systems of other recent works. We find that the latency and compute overheads of distributed inference are largely attributed to a model s static embedding table distribution and sparsity of inference request inputs. We evaluate three embedding table mapping strategies on three representative models and specify the challenging design trade-offs in terms of end-to-end latency, compute overhead, and resource efficiency. Overall, we observe a modest latency overhead with distributed inference–P99 latency is increased by only 1% in the best case configuration. The latency overheads are a result of the commodity infrastructure used and the sparsity of embedding tables. Encouragingly, we also show how distributed inference can account for efficiency improvements in data-center scale recommendation serving.

Volume None

2021 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) | 2021

Understanding Capacity-Driven Scale-Out Neural Recommendation Inference

Abstract

Volume None

Pages 162-171

DOI 10.1109/ISPASS51385.2021.00033

Language English

Journal 2021 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)

Full Text