Proceedings of the 2021 International Conference on Management of Data | 2021

Heterogeneity-Aware Distributed Machine Learning Training via Partial Reduce

Abstract

All-reduce is the key communication primitive used in distributed data-parallel training due to the high performance in the homogeneous environment. However, All-reduce is sensitive to stragglers and communication delays as deep learning has been increasingly deployed on the heterogeneous environment like cloud. In this paper, we propose and analyze a novel variant of all-reduce, called partial-reduce, which provides high heterogeneity tolerance and performance by decomposing the synchronous all-reduce primitive into parallel-asynchronous partial-reduce operations. We provide theoretical guarantees, proving that partial-reduce converges to a stationary point at the similar sub-linear rate as distributed SGD. To enforce the convergence of the partial-reduce primitive, we further propose a dynamic staleness-aware distributed averaging algorithm and implement a novel group generation mechanism to prevent possible update isolation in heterogeneous environments. We build a prototype system in the real production cluster and validate its performance under different workloads. The experiments show that it is 1.21x-2x faster than other state-of-the-art baselines.

Volume None

Proceedings of the 2021 International Conference on Management of Data | 2021

Heterogeneity-Aware Distributed Machine Learning Training via Partial Reduce

Abstract

Volume None

Pages None

DOI 10.1145/3448016.3452773

Language English

Journal Proceedings of the 2021 International Conference on Management of Data

Full Text