Wireless Personal Communications | 2021

Recommender System for Optimal Distributed Deep Learning in Cloud Datacenters

 
 
 
 

Abstract


With the modern advancements in Deep Learning architectures, and abundant research consistently being put forward in areas such as computer vision, natural language processing and forecasting. Models are becoming complicated and datasets are growing exponentially in size demanding high performing and faster computing machines from researchers and engineers. TensorFlow provides a wide range of distributed deep learning high-level APIs to address this issue, that can scale deep learning training from one machine to more than one. In this paper, we have investigated the performance of computing clusters utilizing those APIs. We created clusters of different sizes and discuss performance issues of distributed deep learning under high latency and poor communication conditions. To address the challenge of finding the optimal cluster for fast distributed deep learning, we have proposed a recommendation system, that can provide an optimal cluster size for fastest training time, given batch size and networking latency. Our results show that using a 2 machine cluster is both faster and cheaper than a four machine cluster for certain algorithms when network delay is high.

Volume None
Pages None
DOI 10.1007/s11277-021-08699-3
Language English
Journal Wireless Personal Communications

Full Text