[PDF] Disaggregated Accelerator Management System for Cloud Data Centers

Abstract

A conventional data center that consists of monolithic-servers is confronted with limitations including lack of operational flexibility, low resource utilization, low maintainability, etc. Resource disaggregation is a promising solution to address the above issues. We propose a concept of disaggregated cloud data center architecture called Flow-in-Cloud (FiC) that enables an existing cluster computer system to expand an accelerator pool through a high-speed network. FlowOS-RM manages the entire pool resources, and deploys a user job on a dynamically constructed slice according to a user request. This slice consists of compute nodes and accelerators where each accelerator is attached to the corresponding compute node. This paper demonstrates the feasibility of FiC in a proof of concept experiment running a distributed deep learning application on the prototype system. The result successfully warrants the applicability of the proposed system.

Full PDF

TTo appear in IEICE TRANS. INF. & SYST., VOL.Exx–??, NO.xx XXXX 200x LETTER

Disaggregated Accelerator Management System for Cloud DataCenters ∗ Ryousei TAKANO † a) and Kuniyasu SUZAKI † , Members

SUMMARY

A conventional data center that consists of monolithic-servers is confronted with limitations including lack of operational ﬂexi-bility, low resource utilization, low maintainability, etc. Resource disag-gregation is a promising solution to address the above issues. We proposea concept of disaggregated cloud data center architecture called Flow-in-Cloud (FiC) that enables an existing cluster computer system to expand anaccelerator pool through a high-speed network. FlowOS-RM manages theentire pool resources, and deploys a user job on a dynamically constructedslice according to a user request. This slice consists of compute nodesand accelerators where each accelerator is attached to the correspondingcompute node. This paper demonstrates the feasibility of FiC in a proof ofconcept experiment running a distributed deep learning application on theprototype system. The result successfully warrants the applicability of theproposed system. key words:

Resource disaggregation, Resource management, Cloud Com-puting

1. Introduction

The end of Moore’s law is coming within a decade due totechnical and economic limitations. No more drastic per-formance improvement for general purpose processors isexpected and new computing paradigms and architecturesare needed for the explosive growing computational work-load such as big data analysis, deep learning training andinference, and so on. Specialization or, in other words, do-main speciﬁc architecture (DSA) is a promising researchdirection in the Post-Moore era. Speciﬁcally, many task-speciﬁc accelerators including Google TPU, Fujitsu DLU,Microsoft BrainWave, and D-Wave Quantum Annealer wereproposed recently. To take advantage of such accelerators,it is important to establish a resource management systemto fully utilize a variety of hardware resources, includinga generic processor, an accelerator, and storage, dependingon the workloads. However, a conventional data center con-sists of monolithic servers and it cannot provide such ﬂexibleuse of computing hardware resources. It also faces limita-tions including lack of operational ﬂexibility, low resourceutilization, low maintainability, etc [2], [3].

Manuscript received January 1, 2020.Manuscript revised January 1, 2020. † The author is with Information Technology Research Institute,National Institute of Advanced Industrial Science and Technology(AIST), Tsukuba, 305-8560 Japan. ∗ This paper extends our preliminary work published at SC2018research poster [1]. Speciﬁcally, we conducted additional experi-ments with more realistic deep learning workload. Moreover, wehave claliﬁed the novelty in related work section.a) E-mail: [email protected]: 10.1587/transinf.E0.D.1

To address the limitations of conventional data centers,there is an emerging interest in resource disaggregation [2]–[6] that decomposes monolithic servers into independenthardware components, including CPU, accelerator, memory,and storage, through a high-speed network. In a disaggre-gated data center, hardware components are separated in eachresource pool and reconstructed to meet the user require-ments. We have proposed a new concept of disaggregateddata center architecture, Flow-in-Cloud (FiC), that enablesan existing PC cluster to expand an accelerator pool througha high-speed network. FlowOS manages hardware resourcesand application jobs on FiC. To demonstrate the feasibilityof FiC and FlowOS, currently we are developing FlowOS onthe prototype system of FiC. This paper focuses on the re-source management system of FlowOS called FlowOS-RMand reports on the eﬀectiveness for distributed deep learningapplications.

2. Related Work

Some papers [2], [3] have reported that the resource utiliza-tion, e.g., CPU or main memory, varies considerably foreach server in commercial data centers. This is becauseit is quite diﬃcult to assign various workloads in such away that all resources are fully and equally consumed. Asa result, the resource utilization remains low. To addressthis problem, many studies of resource disaggregation haveemerged since the middle of the 2010s. This movementstarted with hardware-level resource disaggregation [4]–[7]and expanded to the OS-level resource disaggregation [2], [3]in recent years. The interconnection technology for enablingdisaggregation, including Gen-Z, is being standardized andwill be commercialized in the near future. The proposed sys-tem addresses device disaggregation with a special focus onaccelerators, and it seamlessly extends an existing cloud datacenter by facilitating access an accelerator resource pool.This characteristic allows us to beneﬁt from the softwareecosystem of an existing cluster resource management sys-tem like Apache Mesos. To the best of our knowledge, thereis no existing works that consider cooperation with clusterresource management systems. Although we use an electricinterconnection network, an optical network is promising asseveral researchers have proposed in [4]–[7].Several device disaggregation technologies have beenproposed in [8], [9]. ExpEther [8] is a PCIe-over-Ethernettechnology and it allows us to dynamically attach and detachremote PCIe devices through Ethernet. On the other hand,Copyright © a r X i v : . [ c s . O S ] O c t To appear in IEICE TRANS. INF. & SYST., VOL.Exx–??, NO.xx XXXX 200x rCUDA [9] is an OS-level disaggregation technology. Al-though it works with only NVIDIA GPUs, it can seamlesslyaccess a remote GPU through Inﬁniband and Ethernet. Ourpreliminary experiment [10] shows the performance over-head of device disaggregation technologies. The host-to-device bandwidth of ExpEther is about 20% of that of alocal PCIe device. While this performance degradation isa worse-case situation, some application performances areregulated by the amount of traﬃc among the host and de-vices. On the other hand, the impact on computation boundapplications like GEMM and convolution is negligible. Thepractical problem with rCUDA is the lack of compatibility.For example, it does not support cuDNN, which is heavilyused on deep learning applications.

3. Flow-in-Cloud

Flow-in-Cloud (FiC) is a proof of concept system for a disag-gregated cloud data center, and provides the user with a sliceof resources as an application execution environment. Anapplication is then divided into several tasks and each taskis optimized through the use of a suitable accelerator. In thecase of deep learning, a convolution layer task is executed onGPU, and a full connected layer task is executed on FPGA.We call such a set of accelerators as a meta accelerator. Aslice is dynamically conﬁgured by combining meta acceler-ators and attaching them to corresponding compute nodesaccording to a user request, as shown in Figure 1. Computenodes and accelerators are connected through a high-speedcircuit-switched network called FiC network, and it com-prises a set of FiC switch boards [11] that have a middlegrade FPGA chip (Xilinx Kintex Ultrascale XCKU095), 32-10Gbps FireFly serial connections and DRAM. In addition,Raspberry Pi 3 is implemented on the board as a controller.Raspberry Pi 3 communicates with FPGA via GPIO for con-ﬁguration of FPGA and data transmission. Note that weemploy circuit switching instead of packet switching be-cause friction-less transition from electric network to opticalnetwork is possible. A circuit-switching logic and a user-deﬁned logic written in a high-level synthesis language arerunning on the FPGA, and the latter logic is partially recon-ﬁgurable in advance of application deployment.FlowOS manages the entire pool of FiC resources, andsupports the execution of a user job on provided slices.FlowOS employs a layered architecture including FlowOS-Job, FlowOS-RM, and FlowOS-drivers. FlowOS-Job is aheterogeneous programming framework that allows the usersto describe a job composed of several tasks, where each taskis optimized for a speciﬁc accelerator according to the work-load. FlowOS-RM is a resource manager and the detail isdescribed in Section 4. FlowOS-driver is a proxy compo-nent to access underlying hardware resources like acceleratorpools and compute nodes.Currently, we have implemented FlowOS on a smalldisaggregated data center environment where compute nodesand accelerators are connected through ExpEther [8] insteadof FiC network. FlowOS provides two major features: disag-

A set of compute nodes

FPGA GPU SCM

FiC NetworkResource PoolA prototype board of FiC switch system

Slice Slice

Configuring “meta-accelerators” according to the application requirement metaaccelerator

Fig. 1

The overview of Flow-in-Cloud Architecture gregated resource management and heterogeneous program-ming framework as mentioned above. In this paper, we focuson the latter and demonstrate disaggregate device manage-ment and cooperation with a cluster resource managementsystem. Although ExpEther cannot support direct commu-nication among accelerators as FiC originally addresses, itis a reasonable alternative technology to demonstrate theconcept of FiC using commodity hardware.

4. FlowOS-RM

FlowOS-RM seamlessly works in cooperation with a clusterresource management system such as Apache Mesos [12],Kubernetes, SLURM, and so on. In other words, FlowOS-RM extends such systems to support accelerator disaggrega-tion. FlowOS-RM works in cooperation with the followingcomponents: (1) Disaggregate device management: Ex-pEther is a PCIe-over-Ethernet technology and it allowsus to dynamically attach and detach remote PCIe devicesthrough Ethernet. (2) OS deployment: Bare-Metal Con-tainer (BMC) [13] constructs an execution environment torun a Docker image with an application optimized OS kernelon a node. (3) Task scheduling and execution: FlowOS-RMis implemented on top of a Mesos framework, and it co-allocates nodes to meet a user requirement and launches atask on each node in the manner of Mesos.FlowOS-RM provides users with the REST API to con-ﬁgure a slice and execute a job on it. A single-node job aswell as an MPI type multi-node job is supported. Figure 2presents a job execution ﬂow in FlowOS-RM, where a jobis a set of tasks and each task runs on a node belonging to aslice. Table 1 summarizes each operation of FlowOS-RM.First, a slice is constructed in two steps: attach-device and launch-machine . Second, a job is launched in the followingtwo steps: prepare-task and launch-task . After job execu-tion, the slice is destructed in two steps: detach-device and destroy-machine .

5. Experiment

ETTER Ethernet SwitchGPU (P100) compute nodeI/O BoxGPU (P100)

HBAHBA

GPU (P100)

I/O BoxGPU (P100)

Mesos

HBAHBARESTAPI

FlowOS-RMBMC ExpEther

Device management :1. attach-device5. detach-device

OS deployment :2. launch-machine6. destroy-machine

Task execution :3. prepare-task, 4. launch-taskOS kernel OS kernelContainer Container

MesosAgent

Task

MesosAgent

Task compute node

Fig. 2

Job execution ﬂow in FlowOS-RM

Table 1

Major operations in FlowOS-RMattach-device Attach devices to a nodelaunch-machine Boot a node with a speciﬁc OS kernel andcontainer, and it joins active nodes underMesosprepare-task Do housekeeping for launching a task, in-cluding submitting a task to the correspond-ing node through Mesoslaunch-task Launch a task in a node (running state)detach-device Detach devices from a nodedestroy-machine Shutdown a node and it leaves from activenodes

Table 2

Experimental SettingCompute Node ConﬁgurationCPU 10-core Intel Xeon E5-2630v4/2.2GHzM/B Supermicro X10SRG-FMemory 128 GB DDR4-2133Network ExpEther 40G HBANIC Intel I350 (Gigabit Ethernet)Disaggregated Resource (PCIe device)GPU NVIDIA Tesla P100 x4, P40 x1NVMe Intel SSD 750 x4Software ConﬁgurationOS CentOS 7.4Kernel Linux 3.10.0-514.26.2.el7.x86_64Mesos 1.4.1, ChainerMN, OpenMPI 3.1.0NVIDIA CUDA 8.0.61 have conducted distributed deep learning training experi-ments on a four-node cluster environment as shown in Fig-ure 3a. Each compute node has two ExpEther HBAs toconnect PCIe devices, e.g., NVIDIA Tesla P100 GPU andIntel NVMe SSD, on I/O Boxes through a 40 GbE swtich.Linux is running on each compute node, and FlowOS-RMand Mesos are installed on this environment. We used twoapplications, a handwriting character recognition (MNIST)and a large-scale image classiﬁcation (ImageNet), as bench-mark programs, and they are implemented with a distributeddeep learning framework ChainerMN [14].

40G Ethernet SwitchGPU (P100) cmp nodeI/O BoxGPU (P100)

HBAHBA cmp node

HBAHBA

GPU (P100)

I/O BoxGPU (P100)

GPU (P40)

I/O Box N V M e I/O Box N V M e N V M e N V M e (a) Physical Cluster Conﬁguration GPU (P100) cmp nodeI/O BoxGPU (P100)

HBAHBA

GPU (P100)

I/O BoxGPU (P100)

GPU (P100) cmp nodeI/O BoxGPU (P100)

HBAHBA

GPU (P100)

I/O BoxGPU (P100) cmp node

HBAHBA

GPU (P100) cmpnodeI/O BoxGPU (P100)

GPU (P100)

I/O BoxGPU (P100) cmpnodecmpnode cmpnode

HBA HBA HBA HBA (b) Slice Conﬁgurations

Fig. 3

Experimental Conﬁguration run-task elapsed times of 4node-1gpu, 2node-2gpu, and 1node-4gpu are 366.36, 237.31, and 104.57 seconds, respectively.It is a relatively lightweight workload and the slice con-struction and destruction operations account for 32% to 45%of the total execution time. Speciﬁcally, a launch-machine operation takes longer as the number of nodes increases,because downloading a container image that is about 3GBin size through GbE becomes the bottleneck. Some oper-ations including attach/detach-device and launch-task takelonger as the number of GPUs per node increases, becausethese operations are not parallelized. We plan to reduce theabove overhead by using a faster network and parallelizingoperations.Secondly, we ran ImageNet, a more practical applica-tion on the same slice conﬁgurations. We used the ResNet-50 model and ILSVRC2012 dataset. In this experiment,a slice has not only GPUs but also two NVMe SSDs tostore ILSVRC2012 dataset. Unlike an MNIST experiment,the slice construction and destruction operations account for0.15% to 0.17% of the total execution time as shown in Fig-ures 4c and 4b. Generally speaking, deep learning trainingexecution time tends to signiﬁcantly increase and the over-head of FlowOS-RM can be negligible.5.2.2 Resource sharingWe conﬁrmed disaggregated resources are shared among

To appear in IEICE TRANS. INF. & SYST., VOL.Exx–??, NO.xx XXXX 200x attach-device launch-machine prepare-taskrun-task detach-device destroy-machine (a) MNIST on three slice conﬁgurations node1node2node3node4

Time (second) attach-device launch-machine prepare-taskrun-task detach-device destroy-machine (b) ImageNet on 4node-1gpu slice conﬁguration node1node2

Time (second) attach-device launch-machine prepare-taskrun-task detach-device destroy-machine (c) ImageNet in 2node-2gpu slice conﬁguration

Fig. 4

Slice execution life cycle

Slice1Slice2 Slice4Slice3 Time(sec) R e s ou r c e s ( G P U s ) (a) Slice Conﬁgurations Fig. 5

Resource sharing several slices according to a user requirement. In this ex-periment, a user submitted four MNIST application jobs andFlowOS-RM allocated resources into each slice in the FIFOmanner. The slice conﬁgurations of each job are as follows:Slice1 and 2 consist of 2node-2gpu (P100), Slice3 consistsof 1node-1gpu (P40), and Slice4 consists of 4node-1gpu(P100). Figure 5a shows that resource sharing among slicesworks as expected.

6. Conclusion and Future Work

We have demonstrated ﬂexible and eﬀective resource sharingon the proposed disaggregated resource management system(FlowOS-RM) for AI and Big Data applications in next gen-eration cloud data centers. We found some performanceissues, but the impact is limited for long, hours-running ap-plications like distributed deep learning training. Our futurework is replacing ExpEther with the FiC network, and thenopening up a new perspective of heterogeneous accelerator computing by leveraging resource disaggregation. Further-more, in this experiment, we cannot take advantage of thepotential of bare metal containers. Thus, we plan to evaluatevarious applications with applying performance optimiza-tion techniques such as a proﬁle-guided optimization on thissystem.

Acknowledgement

The authors would like to thank Hidetaka Koie, SURIGIKENfor support on the engineering eﬀort, and Jason Haga for hisvaluable comments. This paper is partially based on re-sults obtained from a project commissioned by the New En-ergy and Industrial Technology Development Organization(NEDO).

References [1] R. Takano, K. Suzaki, and H. Koie, “FlowOS-RM: DisaggregatedResource Management System,” ACM/IEEE International Confer-ence for High Performance Computing, Networking, Storage andAnalysis (Poster), 2018.[2] J. Gu, Y. Lee, Y. Zhang, M. Chowdhury, and K.G. Shin, “Eﬃcientmemory disaggregation with inﬁniswap,” 14th USENIX Symposiumon Networked Systems Design and Implementation (NSDI), pp.649–667, 2017.[3] Y. Shan, Y. Huang, Y. Chen, and Y. Zhang, “LegoOS: A Dissemi-nated, Distributed OS for Hardware Resource Disaggregation,” 13thUSENIX Symposium on Operating Systems Design and Implemen-tation (OSDI), pp.69–87, 2018.[4] K. Asanovic, “Firebox: A hardware building block for 2020warehouse-scale computers,” 2014.[5] K. Katrinis, D. Syrivelis, D. Pnevmatikatos, G. Zervas, D. Theodor-opoulos, I. Koutsopoulos, K. Hasharoni, D. Raho, C. Pinto, F. Es-pina, S. Lopez-Buedo, Q. Chen, M. Nemirovsky, D. Roca, H. Klos,and T. Berends, “Rack-scale disaggregated cloud data centers: ThedReDBox project vision,” Design, Automation Test in Europe Con-ference Exhibition (DATE), pp.690–695, March 2016.[6] X. Guo, F. Yan, X. Xue, G. Exarchakos, and N. Calabretta, “Per-formance assessment of a novel rack-scale disaggregated data centerwith fast optical switch,” Optical Fiber Communications Conferenceand Exhibition, pp.1–3, 2019.[7] HP Labs, “The Machine: Our vision for the future computing.” , 2018.[8] J. Suzuki, Y. Hidaka, J. Higuchi, T. Yoshikawa, and A. Iwata,“ExpressEther - Ethernet-Based Virtualization Technology for Re-conﬁgurable Hardware Platform,” 14th IEEE Symposium on High-Performance Interconnects (HOTI), pp.45–51, 2006.[9] J. Duato, A.J. Pena, F. Silla, R. Mayo, and E.S. Quintana-Ortí,“rCUDA: Reducing the number of GPU-based accelerators in highperformance clusters,” IEEE International Conference on High Per-formance Computing and Simulation (HPCS), pp.224–231, 2010.[10] R. Takano, T. Ikegami, K. Suzaki, A. Tanaka, and T. Hirofuchi,“Flowos: A conception of system software for accelerator clouds,”IPSJ SIG Technical Report 2017-HPC-163, pp.1–7, 2018. (inJapanese).[11] K. Hironaka, A.B. Ahmed, and H. Amano, “Multi-FPGA Manage-ment on Flow-in-Cloud Prototype System,” 20th IEEE/ACIS Interna-tional Conference on Software Engineering, Artiﬁcial Intelligence,Networking and Parallel/Distributed Computing (SNPD), pp.443–448, July 2019.[12] B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A.D. Joseph,R.H. Katz, S. Shenker, and I. Stoica, “Mesos: A platform for ﬁne-grained resource sharing in the data center,” NSDI, pp.1–14, 2011.