ArXiv | 2019

Queueing Analysis of GPU-Based Inference Servers with Dynamic Batching: A Closed-Form Characterization

 

Abstract


GPU-accelerated computing is a key technology to realize high-speed inference servers using deep neural networks (DNNs). An important characteristic of GPU-based inference is that the computational efficiency, in terms of the processing speed and energy consumption, drastically increases by processing multiple jobs together in a batch. In this paper, we formulate GPU-based inference servers as a batch service queueing model with batch-size dependent processing times. We first show that the energy efficiency of the server monotonically increases with the arrival rate of inference jobs, which suggests that it is energy-efficient to operate the inference server under a utilization level as high as possible within a latency requirement of inference jobs. We then derive a closed-form upper bound for the mean latency, which provides a simple characterization of the latency performance. Through simulation and numerical experiments, we show that the exact value of the mean latency is well approximated by this upper bound.

Volume abs/1912.06322
Pages None
DOI 10.1016/j.peva.2020.102183
Language English
Journal ArXiv

Full Text