Proceedings of the 1st Workshop on Machine Learning and Systems | 2021

Fast Optimisation of Convolutional Neural Network Inference using System Performance Models

Abstract

The choice of convolutional routines (or primitives) for implementing the operations in a Convolutional Neural Network (CNN) has a tremendous impact over the inference time. To optimise the execution latency for a target system, a lengthy profiling stage is needed - iterating over all the implementations of convolutional primitives in the configuration of each layer to measure their execution time on that platform. Each primitive exercises the system resources in different ways, so new profiling is currently needed when optimising for another system. In this work, we replace this prohibitively expensive profiling stage with a machine learning based approach of performance modelling. Our approach drastically speeds up the optimisation by estimating the latency of convolutional primitives in any layer configuration running on a target system. We reduce the time needed for optimising the execution of large neural networks on an ARM Cortex-A73 system from hours to just seconds. Our performance model is easily transferable across target platforms. This is demonstrated by training a performance model on an Intel platform and transferring its predictive performance to AMD and ARM systems, using very few profiled samples from the target platforms for fine-tuning the performance model.

Volume None

Proceedings of the 1st Workshop on Machine Learning and Systems | 2021

Fast Optimisation of Convolutional Neural Network Inference using System Performance Models

Abstract

Volume None

Pages None

DOI 10.1145/3437984.3458840

Language English

Journal Proceedings of the 1st Workshop on Machine Learning and Systems

Full Text