2019 IEEE 21st International Conference on High Performance Computing and Communications; IEEE 17th International Conference on Smart City; IEEE 5th International Conference on Data Science and Systems (HPCC/SmartCity/DSS) | 2019

Accelerating Distributed Training in Heterogeneous Clusters via a Straggler-Aware Parameter Server

 
 
 
 
 
 

Abstract


Different from homogeneous clusters, when distributed training is performed in heterogeneous clusters, there will be great performance degradation due to the effect of stragglers. Instead of the synchronous stochastic optimization commonly used in homogeneous clusters, we choose an asynchronous approach, which does not require waiting for stragglers but has the problem of using stale parameters. To solve this problem, we design a straggler-aware parameter server (SaPS), which can detect stragglers through the version of parameters and mitigate their effect by a coordinator which can limit the staleness of parameters without waiting for stragglers. Experimental results show that SaPS can converge faster than fully synchronous, fully asynchronous and some SGD variants.

Volume None
Pages 200-207
DOI 10.1109/HPCC/SmartCity/DSS.2019.00042
Language English
Journal 2019 IEEE 21st International Conference on High Performance Computing and Communications; IEEE 17th International Conference on Smart City; IEEE 5th International Conference on Data Science and Systems (HPCC/SmartCity/DSS)

Full Text