Disaggregated Accelerator Management System for Cloud Data Centers
TTo appear in IEICE TRANS. INF. & SYST., VOL.Exx–??, NO.xx XXXX 200x LETTER
Disaggregated Accelerator Management System for Cloud DataCenters ∗ Ryousei TAKANO † a) and Kuniyasu SUZAKI † , Members
SUMMARY
A conventional data center that consists of monolithic-servers is confronted with limitations including lack of operational flexi-bility, low resource utilization, low maintainability, etc. Resource disag-gregation is a promising solution to address the above issues. We proposea concept of disaggregated cloud data center architecture called Flow-in-Cloud (FiC) that enables an existing cluster computer system to expand anaccelerator pool through a high-speed network. FlowOS-RM manages theentire pool resources, and deploys a user job on a dynamically constructedslice according to a user request. This slice consists of compute nodesand accelerators where each accelerator is attached to the correspondingcompute node. This paper demonstrates the feasibility of FiC in a proof ofconcept experiment running a distributed deep learning application on theprototype system. The result successfully warrants the applicability of theproposed system. key words:
Resource disaggregation, Resource management, Cloud Com-puting
1. Introduction
The end of Moore’s law is coming within a decade due totechnical and economic limitations. No more drastic per-formance improvement for general purpose processors isexpected and new computing paradigms and architecturesare needed for the explosive growing computational work-load such as big data analysis, deep learning training andinference, and so on. Specialization or, in other words, do-main specific architecture (DSA) is a promising researchdirection in the Post-Moore era. Specifically, many task-specific accelerators including Google TPU, Fujitsu DLU,Microsoft BrainWave, and D-Wave Quantum Annealer wereproposed recently. To take advantage of such accelerators,it is important to establish a resource management systemto fully utilize a variety of hardware resources, includinga generic processor, an accelerator, and storage, dependingon the workloads. However, a conventional data center con-sists of monolithic servers and it cannot provide such flexibleuse of computing hardware resources. It also faces limita-tions including lack of operational flexibility, low resourceutilization, low maintainability, etc [2], [3].
Manuscript received January 1, 2020.Manuscript revised January 1, 2020. † The author is with Information Technology Research Institute,National Institute of Advanced Industrial Science and Technology(AIST), Tsukuba, 305-8560 Japan. ∗ This paper extends our preliminary work published at SC2018research poster [1]. Specifically, we conducted additional experi-ments with more realistic deep learning workload. Moreover, wehave clalified the novelty in related work section.a) E-mail: [email protected]: 10.1587/transinf.E0.D.1
To address the limitations of conventional data centers,there is an emerging interest in resource disaggregation [2]–[6] that decomposes monolithic servers into independenthardware components, including CPU, accelerator, memory,and storage, through a high-speed network. In a disaggre-gated data center, hardware components are separated in eachresource pool and reconstructed to meet the user require-ments. We have proposed a new concept of disaggregateddata center architecture, Flow-in-Cloud (FiC), that enablesan existing PC cluster to expand an accelerator pool througha high-speed network. FlowOS manages hardware resourcesand application jobs on FiC. To demonstrate the feasibilityof FiC and FlowOS, currently we are developing FlowOS onthe prototype system of FiC. This paper focuses on the re-source management system of FlowOS called FlowOS-RMand reports on the effectiveness for distributed deep learningapplications.
2. Related Work
Some papers [2], [3] have reported that the resource utiliza-tion, e.g., CPU or main memory, varies considerably foreach server in commercial data centers. This is becauseit is quite difficult to assign various workloads in such away that all resources are fully and equally consumed. Asa result, the resource utilization remains low. To addressthis problem, many studies of resource disaggregation haveemerged since the middle of the 2010s. This movementstarted with hardware-level resource disaggregation [4]–[7]and expanded to the OS-level resource disaggregation [2], [3]in recent years. The interconnection technology for enablingdisaggregation, including Gen-Z, is being standardized andwill be commercialized in the near future. The proposed sys-tem addresses device disaggregation with a special focus onaccelerators, and it seamlessly extends an existing cloud datacenter by facilitating access an accelerator resource pool.This characteristic allows us to benefit from the softwareecosystem of an existing cluster resource management sys-tem like Apache Mesos. To the best of our knowledge, thereis no existing works that consider cooperation with clusterresource management systems. Although we use an electricinterconnection network, an optical network is promising asseveral researchers have proposed in [4]–[7].Several device disaggregation technologies have beenproposed in [8], [9]. ExpEther [8] is a PCIe-over-Ethernettechnology and it allows us to dynamically attach and detachremote PCIe devices through Ethernet. On the other hand,Copyright © a r X i v : . [ c s . O S ] O c t To appear in IEICE TRANS. INF. & SYST., VOL.Exx–??, NO.xx XXXX 200x rCUDA [9] is an OS-level disaggregation technology. Al-though it works with only NVIDIA GPUs, it can seamlesslyaccess a remote GPU through Infiniband and Ethernet. Ourpreliminary experiment [10] shows the performance over-head of device disaggregation technologies. The host-to-device bandwidth of ExpEther is about 20% of that of alocal PCIe device. While this performance degradation isa worse-case situation, some application performances areregulated by the amount of traffic among the host and de-vices. On the other hand, the impact on computation boundapplications like GEMM and convolution is negligible. Thepractical problem with rCUDA is the lack of compatibility.For example, it does not support cuDNN, which is heavilyused on deep learning applications.
3. Flow-in-Cloud
Flow-in-Cloud (FiC) is a proof of concept system for a disag-gregated cloud data center, and provides the user with a sliceof resources as an application execution environment. Anapplication is then divided into several tasks and each taskis optimized through the use of a suitable accelerator. In thecase of deep learning, a convolution layer task is executed onGPU, and a full connected layer task is executed on FPGA.We call such a set of accelerators as a meta accelerator. Aslice is dynamically configured by combining meta acceler-ators and attaching them to corresponding compute nodesaccording to a user request, as shown in Figure 1. Computenodes and accelerators are connected through a high-speedcircuit-switched network called FiC network, and it com-prises a set of FiC switch boards [11] that have a middlegrade FPGA chip (Xilinx Kintex Ultrascale XCKU095), 32-10Gbps FireFly serial connections and DRAM. In addition,Raspberry Pi 3 is implemented on the board as a controller.Raspberry Pi 3 communicates with FPGA via GPIO for con-figuration of FPGA and data transmission. Note that weemploy circuit switching instead of packet switching be-cause friction-less transition from electric network to opticalnetwork is possible. A circuit-switching logic and a user-defined logic written in a high-level synthesis language arerunning on the FPGA, and the latter logic is partially recon-figurable in advance of application deployment.FlowOS manages the entire pool of FiC resources, andsupports the execution of a user job on provided slices.FlowOS employs a layered architecture including FlowOS-Job, FlowOS-RM, and FlowOS-drivers. FlowOS-Job is aheterogeneous programming framework that allows the usersto describe a job composed of several tasks, where each taskis optimized for a specific accelerator according to the work-load. FlowOS-RM is a resource manager and the detail isdescribed in Section 4. FlowOS-driver is a proxy compo-nent to access underlying hardware resources like acceleratorpools and compute nodes.Currently, we have implemented FlowOS on a smalldisaggregated data center environment where compute nodesand accelerators are connected through ExpEther [8] insteadof FiC network. FlowOS provides two major features: disag-
A set of compute nodes
FPGA GPU SCM
FiC NetworkResource PoolA prototype board of FiC switch system
Slice Slice
Configuring “meta-accelerators” according to the application requirement metaaccelerator
Fig. 1
The overview of Flow-in-Cloud Architecture gregated resource management and heterogeneous program-ming framework as mentioned above. In this paper, we focuson the latter and demonstrate disaggregate device manage-ment and cooperation with a cluster resource managementsystem. Although ExpEther cannot support direct commu-nication among accelerators as FiC originally addresses, itis a reasonable alternative technology to demonstrate theconcept of FiC using commodity hardware.
4. FlowOS-RM
FlowOS-RM seamlessly works in cooperation with a clusterresource management system such as Apache Mesos [12],Kubernetes, SLURM, and so on. In other words, FlowOS-RM extends such systems to support accelerator disaggrega-tion. FlowOS-RM works in cooperation with the followingcomponents: (1) Disaggregate device management: Ex-pEther is a PCIe-over-Ethernet technology and it allowsus to dynamically attach and detach remote PCIe devicesthrough Ethernet. (2) OS deployment: Bare-Metal Con-tainer (BMC) [13] constructs an execution environment torun a Docker image with an application optimized OS kernelon a node. (3) Task scheduling and execution: FlowOS-RMis implemented on top of a Mesos framework, and it co-allocates nodes to meet a user requirement and launches atask on each node in the manner of Mesos.FlowOS-RM provides users with the REST API to con-figure a slice and execute a job on it. A single-node job aswell as an MPI type multi-node job is supported. Figure 2presents a job execution flow in FlowOS-RM, where a jobis a set of tasks and each task runs on a node belonging to aslice. Table 1 summarizes each operation of FlowOS-RM.First, a slice is constructed in two steps: attach-device and launch-machine . Second, a job is launched in the followingtwo steps: prepare-task and launch-task . After job execu-tion, the slice is destructed in two steps: detach-device and destroy-machine .
5. Experiment
ETTER Ethernet SwitchGPU (P100) compute nodeI/O BoxGPU (P100)
HBAHBA
GPU (P100)
I/O BoxGPU (P100)
Mesos
HBAHBARESTAPI
FlowOS-RMBMC ExpEther
Device management :1. attach-device5. detach-device
OS deployment :2. launch-machine6. destroy-machine
Task execution :3. prepare-task, 4. launch-taskOS kernel OS kernelContainer Container
MesosAgent
Task
MesosAgent
Task compute node
Fig. 2
Job execution flow in FlowOS-RM
Table 1
Major operations in FlowOS-RMattach-device Attach devices to a nodelaunch-machine Boot a node with a specific OS kernel andcontainer, and it joins active nodes underMesosprepare-task Do housekeeping for launching a task, in-cluding submitting a task to the correspond-ing node through Mesoslaunch-task Launch a task in a node (running state)detach-device Detach devices from a nodedestroy-machine Shutdown a node and it leaves from activenodes
Table 2
Experimental SettingCompute Node ConfigurationCPU 10-core Intel Xeon E5-2630v4/2.2GHzM/B Supermicro X10SRG-FMemory 128 GB DDR4-2133Network ExpEther 40G HBANIC Intel I350 (Gigabit Ethernet)Disaggregated Resource (PCIe device)GPU NVIDIA Tesla P100 x4, P40 x1NVMe Intel SSD 750 x4Software ConfigurationOS CentOS 7.4Kernel Linux 3.10.0-514.26.2.el7.x86_64Mesos 1.4.1, ChainerMN, OpenMPI 3.1.0NVIDIA CUDA 8.0.61 have conducted distributed deep learning training experi-ments on a four-node cluster environment as shown in Fig-ure 3a. Each compute node has two ExpEther HBAs toconnect PCIe devices, e.g., NVIDIA Tesla P100 GPU andIntel NVMe SSD, on I/O Boxes through a 40 GbE swtich.Linux is running on each compute node, and FlowOS-RMand Mesos are installed on this environment. We used twoapplications, a handwriting character recognition (MNIST)and a large-scale image classification (ImageNet), as bench-mark programs, and they are implemented with a distributeddeep learning framework ChainerMN [14].
40G Ethernet SwitchGPU (P100) cmp nodeI/O BoxGPU (P100)
HBAHBA cmp node
HBAHBA cmp node
HBAHBA cmp node
HBAHBA
GPU (P100)
I/O BoxGPU (P100)
GPU (P40)
I/O Box N V M e I/O Box N V M e N V M e N V M e (a) Physical Cluster Configuration GPU (P100) cmp nodeI/O BoxGPU (P100)
HBAHBA
GPU (P100)
I/O BoxGPU (P100)
GPU (P100) cmp nodeI/O BoxGPU (P100)
HBAHBA
GPU (P100)
I/O BoxGPU (P100) cmp node
HBAHBA
GPU (P100) cmpnodeI/O BoxGPU (P100)
GPU (P100)
I/O BoxGPU (P100) cmpnodecmpnode cmpnode
HBA HBA HBA HBA (b) Slice Configurations
Fig. 3
Experimental Configuration run-task elapsed times of 4node-1gpu, 2node-2gpu, and 1node-4gpu are 366.36, 237.31, and 104.57 seconds, respectively.It is a relatively lightweight workload and the slice con-struction and destruction operations account for 32% to 45%of the total execution time. Specifically, a launch-machine operation takes longer as the number of nodes increases,because downloading a container image that is about 3GBin size through GbE becomes the bottleneck. Some oper-ations including attach/detach-device and launch-task takelonger as the number of GPUs per node increases, becausethese operations are not parallelized. We plan to reduce theabove overhead by using a faster network and parallelizingoperations.Secondly, we ran ImageNet, a more practical applica-tion on the same slice configurations. We used the ResNet-50 model and ILSVRC2012 dataset. In this experiment,a slice has not only GPUs but also two NVMe SSDs tostore ILSVRC2012 dataset. Unlike an MNIST experiment,the slice construction and destruction operations account for0.15% to 0.17% of the total execution time as shown in Fig-ures 4c and 4b. Generally speaking, deep learning trainingexecution time tends to significantly increase and the over-head of FlowOS-RM can be negligible.5.2.2 Resource sharingWe confirmed disaggregated resources are shared among
To appear in IEICE TRANS. INF. & SYST., VOL.Exx–??, NO.xx XXXX 200x attach-device launch-machine prepare-taskrun-task detach-device destroy-machine (a) MNIST on three slice configurations node1node2node3node4
Time (second) attach-device launch-machine prepare-taskrun-task detach-device destroy-machine (b) ImageNet on 4node-1gpu slice configuration node1node2
Time (second) attach-device launch-machine prepare-taskrun-task detach-device destroy-machine (c) ImageNet in 2node-2gpu slice configuration
Fig. 4
Slice execution life cycle
Slice1Slice2 Slice4Slice3 Time(sec) R e s ou r c e s ( G P U s ) (a) Slice Configurations Fig. 5
Resource sharing several slices according to a user requirement. In this ex-periment, a user submitted four MNIST application jobs andFlowOS-RM allocated resources into each slice in the FIFOmanner. The slice configurations of each job are as follows:Slice1 and 2 consist of 2node-2gpu (P100), Slice3 consistsof 1node-1gpu (P40), and Slice4 consists of 4node-1gpu(P100). Figure 5a shows that resource sharing among slicesworks as expected.
6. Conclusion and Future Work
We have demonstrated flexible and effective resource sharingon the proposed disaggregated resource management system(FlowOS-RM) for AI and Big Data applications in next gen-eration cloud data centers. We found some performanceissues, but the impact is limited for long, hours-running ap-plications like distributed deep learning training. Our futurework is replacing ExpEther with the FiC network, and thenopening up a new perspective of heterogeneous accelerator computing by leveraging resource disaggregation. Further-more, in this experiment, we cannot take advantage of thepotential of bare metal containers. Thus, we plan to evaluatevarious applications with applying performance optimiza-tion techniques such as a profile-guided optimization on thissystem.
Acknowledgement
The authors would like to thank Hidetaka Koie, SURIGIKENfor support on the engineering effort, and Jason Haga for hisvaluable comments. This paper is partially based on re-sults obtained from a project commissioned by the New En-ergy and Industrial Technology Development Organization(NEDO).
References [1] R. Takano, K. Suzaki, and H. Koie, “FlowOS-RM: DisaggregatedResource Management System,” ACM/IEEE International Confer-ence for High Performance Computing, Networking, Storage andAnalysis (Poster), 2018.[2] J. Gu, Y. Lee, Y. Zhang, M. Chowdhury, and K.G. Shin, “Efficientmemory disaggregation with infiniswap,” 14th USENIX Symposiumon Networked Systems Design and Implementation (NSDI), pp.649–667, 2017.[3] Y. Shan, Y. Huang, Y. Chen, and Y. Zhang, “LegoOS: A Dissemi-nated, Distributed OS for Hardware Resource Disaggregation,” 13thUSENIX Symposium on Operating Systems Design and Implemen-tation (OSDI), pp.69–87, 2018.[4] K. Asanovic, “Firebox: A hardware building block for 2020warehouse-scale computers,” 2014.[5] K. Katrinis, D. Syrivelis, D. Pnevmatikatos, G. Zervas, D. Theodor-opoulos, I. Koutsopoulos, K. Hasharoni, D. Raho, C. Pinto, F. Es-pina, S. Lopez-Buedo, Q. Chen, M. Nemirovsky, D. Roca, H. Klos,and T. Berends, “Rack-scale disaggregated cloud data centers: ThedReDBox project vision,” Design, Automation Test in Europe Con-ference Exhibition (DATE), pp.690–695, March 2016.[6] X. Guo, F. Yan, X. Xue, G. Exarchakos, and N. Calabretta, “Per-formance assessment of a novel rack-scale disaggregated data centerwith fast optical switch,” Optical Fiber Communications Conferenceand Exhibition, pp.1–3, 2019.[7] HP Labs, “The Machine: Our vision for the future computing.” , 2018.[8] J. Suzuki, Y. Hidaka, J. Higuchi, T. Yoshikawa, and A. Iwata,“ExpressEther - Ethernet-Based Virtualization Technology for Re-configurable Hardware Platform,” 14th IEEE Symposium on High-Performance Interconnects (HOTI), pp.45–51, 2006.[9] J. Duato, A.J. Pena, F. Silla, R. Mayo, and E.S. Quintana-Ortí,“rCUDA: Reducing the number of GPU-based accelerators in highperformance clusters,” IEEE International Conference on High Per-formance Computing and Simulation (HPCS), pp.224–231, 2010.[10] R. Takano, T. Ikegami, K. Suzaki, A. Tanaka, and T. Hirofuchi,“Flowos: A conception of system software for accelerator clouds,”IPSJ SIG Technical Report 2017-HPC-163, pp.1–7, 2018. (inJapanese).[11] K. Hironaka, A.B. Ahmed, and H. Amano, “Multi-FPGA Manage-ment on Flow-in-Cloud Prototype System,” 20th IEEE/ACIS Interna-tional Conference on Software Engineering, Artificial Intelligence,Networking and Parallel/Distributed Computing (SNPD), pp.443–448, July 2019.[12] B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A.D. Joseph,R.H. Katz, S. Shenker, and I. Stoica, “Mesos: A platform for fine-grained resource sharing in the data center,” NSDI, pp.1–14, 2011.