2019 IEEE International Conference on Cloud Engineering (IC2E) | 2019

Understanding Synchronization Costs for Distributed ML on Transient Cloud Resources

 
 
 
 
 
 

Abstract


Cloud platforms often execute parallel batch applications, such as distributed machine learning (ML), that include numerous synchronization barriers. These barriers, which prevent any task from advancing beyond a specified point until all tasks have reached that point, significantly degrade application performance by reducing it to that of the slowest straggler task. To address the problem, researchers have proposed numerous straggler mitigation techniques, including speculatively re-executing straggler tasks and various relaxations of strict barrier semantics. While these techniques improve parallel application performance, they incur a cost in terms of the resources wasted re-executing tasks or waiting. Importantly, these costs, which are often implicit in prior work that targets dedicated resources, become explicit in the cloud, which charges for resources at fine-grained intervals. In addition, the cost difference between techniques is exacerbated in cloud platforms, since they charge substantially less for transient resources that effectively yield a probabilistic performance across a wide range. While transient resources low list price is attractive, revocations increase the frequency and severity of stragglers, which decreases parallel job performance and increases overall execution cost. To better understand the cost of synchronization, we develop simple analytical models of different straggler mitigation techniques and compare their cost and performance on on-demand and transient resources. Our analysis shows that i) transient servers offer complex tradeoffs compared to on-demand servers, and can result in higher overall costs despite their highly discounted price due to their probabilistic performance; ii) common approaches to straggler mitigation, which is a well-studied problem, are less effective using transient servers that cause frequent and severe stragglers; and iii) a recent approach to flexible synchronization offers the best cost and performance.

Volume None
Pages 145-155
DOI 10.1109/IC2E.2019.00029
Language English
Journal 2019 IEEE International Conference on Cloud Engineering (IC2E)

Full Text