2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS) | 2019

Aladdin: Optimized Maximum Flow Management for Shared Production Clusters

 
 
 
 
 
 
 

Abstract


The rise in popularity of long-lived applications (LLAs), such as deep learning and latency-sensitive online Web services, has brought new challenges for cluster schedulers in shared production environments. Scheduling LLAs needs to support complex placement constraints (e.g., to run multiple containers of an application on different machines) and larger degrees of parallelism to provide global optimization. But existing schedulers usually suffer severe constraint violations, high latency and low resource efficiency. This paper describes Aladdin, a novel cluster scheduler that can maximize resource efficiency while avoiding constraint violations: (i) it proposes a multidimensional and nonlinear capacity function to support constraint expressions; (ii) it applies an optimized maximum flow algorithm to improve resource efficiency. Experiments with an Alibaba workload trace from a 10,000-machine cluster show that Aladdin can reduce violated constraints by as mush as 20%. Meanwhile, it improves resource efficiency by 50% compared with state-of-the-art schedulers.

Volume None
Pages 696-707
DOI 10.1109/IPDPS.2019.00078
Language English
Journal 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Full Text