A Novel Software-based Multi-path RDMA Solutionfor Data Center Networks
AA Novel Software-based Multi-path RDMA Solutionfor Data Center Networks
Feng Tian
University of MinnesotaMinneapolis, Minnesota
Wendi Feng
University of MinnesotaMinneapolis, Minnesota
Yang Zhang
University of MinnesotaMinneapolis, Minnesota
Zhi-Li Zhang
University of MinnesotaMinneapolis, Minnesota
ABSTRACT
In this paper we propose
Virtuoso , a purely software-based multi-path RDMA solution for data center networks (DCNs)to effectively utilize the rich multi-path topology for load bal-ancing and reliability. As a “middleware” library operatingat the user space, Virtuoso employs three innovative mech-anisms to achieve its goal. In contrast to existing hardwarebased MP-RDMA solution, Virtuoso can be readily deployedin DCNs with existing RDMA NICs. It also decouples pathselection and load balancing mechanisms from hardware fea-tures, allowing DCN operators and applications to make flex-ible decisions by employing best mechanisms (as “plug-in”software library modules) as needed. Our experiments showthat Virtuoso is capable of fully utilizing multiple paths withnegligible CPU overheads.
KEYWORDS
Data Center Networks, RDMA, Software-based Multi-Path
Remote Direct Memory Access (RDMA) introduces the ca-pability of directly accessing the memory of a remote serverby implementing the transport logic in hardware network in-terface cards and bypassing CPU and kernel network stacks,thereby offering high bandwidth and low latency. Nowadays,RDMA is widely deployed over “Converged” Ethernet via Ro-CEv2 in modern data centers [12, 16, 26] to support machinelearning and other data intensive applications. By design,RDMA is a point-to-point transport, where each RDMA con-nection is mapped onto a single network path. More specif-ically, RDMA operations ( verbs ) of an RDMA connectionare transported along the same network path via single QueuePair (QP); each message of an RDMA verb such as SEND,RECV, READ, WRITE is divided in segments of equal sizeand encapsulated in UDP packets, where the source and des-tination IP addresses of the UDP packets are set to those ofthe two communicating servers, the destination port fixed at4791 and the source port arbitrarily chosen. These are all done automatically by the RDMA NICs (or RNICs in short),which makes port number based path control mechanism asin MPTCP [7] difficult in user space.Data center networks (DCNs) are typically built using a“spine-leaf” topological structure with rich multiple paths,especially between spine routers, for load balancing and re-liability [1, 9]. As a point-to-point transport, RDMA doesnot take advantage of multiple paths in the underlying net-works for load balancing and reliability [11, 22, 24]. Formachine learning and other data intensive applications, anRDMA read/write operation may involve remote transfer ofa big chunk of data (“elephant flows”), which may not onlytake some time to deliver along a single path, but also causecongestion that can potentially affect “mice flows” from otherapplications, especially interactive applications with strin-gent latency requirements. MP-RDMA [17] is the first workthat attempts to address this limitation of existing RDMA (orrather, RoCEv2). It focuses on the challenges in implement-ing a multi-path RDMA solution in hardware , in particular,the limited memory resource in RNICs. By using source portto encode “virtual path” id (VP id) and influencing the pathtraversed by the RDMA UDP packets, it assumes and heavilyrelies on the underlying routers’ ECMP mechanisms for loadbalancing among multiple paths. The proposed solution isemulated/prototyped using FPGA. As MP-RDMA requiresreplacing existing RNICs to new MP-RDMA capable NICs,it cannot be readily deployed in DCNs.In this work, we propose and develop a purely software-based multi-path
RDMA solution, dubbed
Virtuoso . Oursolution employs three key innovations . First, we create mul-tiple virtual interfaces – each with a different (virtual) IPaddress of our choice – and bind them to the same physicalRNIC (effectively creating multiple virtual RNICs). Henceunlike MP-RDMA which manipulates the source port only,we control and manipulate source IP addresses of the RDMAUDP packets for load balancing and reliability. Second, wedevelop a user-space middleware layer which intercepts andsplit (large) messages of RDMA operations into multiple a r X i v : . [ c s . N I] S e p smaller) messages, and dynamically maps them onto dif-ferent paths at the sender side, and judiciously merge themtogether at the receiver side by passing them to the applica-tions. Performing these operations correctly and incurringas little overheads as possible (especially, maintaining zero-copy) is nontrivial; it involves careful design and some clevertricks (see Section 3). Third, we also implemented a user-space load balancer that consists of a congestion avoidance(for lossy network) and path probing component mechanism,to do a application aware load balancing.Virtuoso offers several advantages over existing hardware-based multi-path RDMA solutions. As a purely software-based solution, it can be readily deployed in DCNs at scalewith existing RDMA NICs and works regardless of the num-ber of physical RNICs installed on servers. In contrast to MP-RDMA which implements “built-in” path selection, conges-tion control and traffic distribution mechanisms in hardwareand hinges on ECMP to perform multi-path routing, Virtu-oso decouples these mechanisms from hardware features, andallows DCN operators/applications to make flexible decisionsby employing best mechanisms (as “plug-in” software librarymodules) as needed. For example, one can explicitly managemulti-path routing by setting appropriate forwarding rules(based on source and destination IP addresses), e.g., throughan SDN controller. Virtuoso allows them to guide traffic dis-tribution decisions. Our experiments show that Virtuoso canfully utilize multiple paths with negligible CPU overheads. RDMA allows applications to directly access remote memorywith zero-copying and low CPU involvement by implement-ing the transport logic in hardware RNICs. RDMA over Con-verged Ethernet v2 (RoCEv2) has been widely deployed indata center networks to support compute- & data-intensive ap-plications such as machine learning, as it provides low latencyand high bandwidth with little CPU overheads. Normally,RDMA requires a lossless network where Priority-based FlowControl (PFC) and Explicit Congestion Notification (ECN)are usually configured to prevent packet losses by pausingtraffic transport and throttle traffic at the source.RDMA is a message based, point-to-point transport, whereRDMA messages are divided into segments and encapsulatedin UDP packets that are transported along a single path. Ap-plications connect with each other using send and receive
Queue Pairs (QP) . An application initiates RDMA operations(or verbs ) by posting
Work Requests (WRs) (or
Work QueueElement (WQE)), e.g., SEND/RECV or WRITE/READ tothe QP, which commands the RNIC to transfer data to thememory of a remote host. For each application, there is also one (or more) completion queue (CQ); upon completing aWR, a completion queue element (CQE) is delivered to CQ.
The “leaf-spine” topology in modern Data Center Networks(DCNs) offers rich path diversity [1, 5, 9]. Switches androuters employ built-in
Equal-Cost Multi-path (ECMP) forrouting based on hashes of 5-tuple packet/flow headers ( (cid:104) srcIP , dst IP , src port , dst port , protocol number (cid:105) ). ECMP suf-fers several issues in practice [1, 3], e.g., it is less effectivewhen the number of paths is large, and it cannot performintra-flow load balancing for large elephant flows. Other(software-based) solutions such as Valiant Load Balancingand “customized” multi-path routing algorithms (e.g., by set-ting up explicit flow rules [1, 9, 19]) provide DCN operatorsand applications more control over multi-path routing andload balancing. We remark that congestion often occurs atthe core layer of a DCN [2]; large “elephant” flows gener-ated by data-intensive machine learning applications furthercontribute to this problem. They not only prolong their ownflow completion times (FCTs), but also adversely affect otherapplications. It is therefore desirable to split such “elephant”flows to enable “intra-flow” load balancing across multiple(core) paths [2, 25].MP-RDMA [17] is the first to address the challenge thatRDMA/RoCE v2 cannot effectively take advantage of richmultiple paths in DCNs [11, 22, 24]. It proposes a hardware-based solution with “built-in” path selection and congestionavoidance mechanisms. The key challenge it focuses on is thelimited memory in RNICs (see also FaRM [6], LITE [27] andINFINISWAP [10] that tackle similar hardware constraints).As a hardware-based solution, it cannot be readily deployedwithout upgrading RNIC. It also heavily relies on ECMP formulti-path routing and load balancing.We therefore seek a purely software-based multi-path RDMAsolution operating in the user space that works with existingRNICs while maintaining zero-copying and incurring as lit-tle CPU overheads as possible. A key enabling idea of ourproposed solution is to create multiple virtual NICs (vNICs)and bind them to the same hardware RNIC, thereby allowingmultiple IP addresses to be assigned to the same RNIC. Oursolution allows a single RDMA application to create multi-ple virtual RDMA connections that are mapped to differentpaths. This is different from existing efforts in virtualiz-ing
RNICs [4, 13, 15, 21] with the goal to allow multipleVMs/containers to share the same RNIC with some levelof isolation. Compared with “built-in” multi-path routingand load balancing mechanisms, we also believe that it isimperative to provide DCN operators and applications withflexibility in multi-path routing and load balancing decisions.For example, it has been shown that global congestion avoid-ance and traffic scheduling [8, 18, 20] are essential in DCNs, nd applications are best aware of traffic load distribution foradaptive load balancing [14]. Similarly, Avatar [23] aims atmaking RDMA transport on a single NIC to be efficientlyshared by eliminating lock contention and providing fair datascheduling via WR multiplexing. Virtuoso is a software-based, modular multi-path RDMAframework. Virtuoso sets up multiple virtual NICs (vNICs)on each physical RNIC using
IP alias , each assigned witha distinct IP address (see Fig. 1(a)). In practice, RDMA usesa Global ID (GID) to identify each host, and RoCEv2 bindsGIDs to the IP addresses of the interfaces using the IP table.Using vNICs, Virtuoso is able to create multiple QPs usingthe standard RDMP libraries rdma cm and ibv verb . RDMA APP
Host
VNIC
VNIC
VNIC
RNIC
QP QP QP QP
RDMA Library
VNIC
Virtuoso (a) Virtuoso Overview
Decomposer
Path Monitor CQ WRWRWR
APP
Memory
Reassembler
APP
Memory WR QPQP WR QP WR QP Manager addrsizeMP_WRaddrsizeMP_WR
QP Manager WR QPQP WR QP WR WRWRWR
Sender Receiver (b) System Design
Figure 1: Virtuoso: Software Multi-path RDMA Solu-tion
Virtuoso maps each QP to a distinct virtual path (VP), andusing the IP address associated with each vNIC as a
VP id .As a middleware operating at the user space, Virtuoso pro-vides the same APIs (and RDMA verbs) as in the standardRDMA libraries, but prefixes them with the keyword MP as shown in Table 1. For example, an application invokes MP connect() to set up a Virtuoso multi-path (logical) con-nection, and uses
MP READ/SEND and
MP WRITE/RECV topost Virtuoso work requests (WRs),
MP WR ’s. On the senderside, Virtuoso decomposes a large RDMA message (there-after simply a “flow”) contained in an
MP WR into smaller“sub-flows”, and distribute them to different QP’s by gener-ating the corresponding constituent WRs using the standardRDMA verbs. The sub-flows are “merged” at the receiverside. These are illustrated in the right portion of Fig. 1(b).Virtuoso consists four major components,
QP Manager , De-composer (on the sender side),
Reassembler (on the receiverside), and
Path Monitor & Load Balancer . Standard RDMA API & Verbs Virtuoso Versionrdma connect() MP rdma connect()rdma disconnect() MP rdma disconnect()ibv post send() MP ibv post send()WRITE/READ MP WRITE/READSEND/RECV MP SEND/RECV
Table 1: Interface & Verb Design
Virtuoso assumes that there is only one single port connec-tion between ToR switch (but can also work with multipleports) and RNIC while have multiple paths in the core layerof data center networks. The load balancing mechanism canbe either ECMP (with known hash function) or static routing.
As discussedabove, an RDMA application creates a (logical) multi-pathconnection using Virtuoso APIs. Virtuoso maps this logi-cal connection to multiple (virtual) paths by automaticallysetting up the corresponding QPs, one per path. To set upthese QPs to work with the same application, we take advan-tage of several key features of RDMA. Recall that in RDMA,memory must be registered before any RDMA verb can bepost. The sender and receiver communicate and negotiatethe address locations of the respective memory. Each RDMAtransport context (registered memory, QP) is maintained in-side a Protect Domain (PD). Inside this PD, these contextscan be shared and accessed by multiple QPs who within thesame PD.In order to associate the multiple QPs created by Virtuosowith the same application, the
QP manager create them withinthe same PD. Furthermore, the same target memory regionas specified by an RDMA application is also registered tothis PD. This way the message in an
MP WRITE or MP SEND can be transported through any of the QPs; in particular, fora large message, it can be divided into smaller chunks andtransported via multiple QPs for load balancing.The advantage of this design is efficiency and flexibility:the QPs can concurrently manage the same memory regionwithout memory copying and state transfer between PDs.This, however, creates a challenge at the receiver side whenthe two-sided
MP SEND and
MP RECV verbs are used: thereceiver will not know in advance which QP the data willbe arriving, thus which QP to post the corresponding
RECV
WR. We will discuss how this challenge is addressed in
Re-assembler of Virtuoso, as well as how out-of-order (OOO)data is handled in Section 3.4. QP manager also creates a shared
Completion Queue (CQ) for these QPs, so that it canpoll this queue to query the CQEs of the WRs posted to anyof these QPs. Note that each CQE has the correspondingWR information (e.g., WR id). Hence for each
MP WR (a“flow”) submitted by an application, Virtuoso can determine hether its constituent WRs (“sub-flows”) have been com-pleted, thereby notifying the Decomposer to post a
MP WR forapplication about the completion of the transmission task.
In terms ofconnections between queue pair, it requires a transmissionparameter (e.g., queue pair type (qp type) & queue pair ca-pabilities (max send wr)) exchange process, which involvesseveral functions provided by standard RDMA libraries. Thisprocedure works as the three-way hand shake procedure inTCP/IP. However, this procedure is handled in user space byapplication instead of the driver in kernel. Thus, Virtuososhould handle all the parameter exchange tasks for multipleQPs. To simplify the connection procedure, Virtuoso pro-vides an uniformed interface, ‘
MP rdma connect() ’, formulti-path connection which takes over the whole connect-ing procedure from application. Moreover, application canalso configure these parameters by submitting configurationsto Virtuoso. The disconnection procedure of QPs is alsosimilar and requires extra negotiation between two remotesides. Thus, Virtuoso also provides an uniformed interface,‘
MP rdma disconnect() ’, for applications.
The
Decomposer component is responsible for WR generat-ing, memory mapping and
MP CQE generating. As the sameas RDMA verbs, each
MP WR (multi-path work request) con-tains the relevant metadata (memory location, size) regardingthe target memory blocks it wants to access. At the senderside, the main task of
Decomposer is to divide a large message(“flow”) contained in a
MP WRITE or MP SEND multi-pathwork request into smaller data chunks (“sub-flows”), andgenerate the corresponding WRs (
WRITE or SEND ) for eachsub-flow using the standard verb (
WRITE or SEND ). Like-wise, a
MP READ
WR that wants to access a large remotememory region (“flows”) will be divided at the sender sideinto multiple
READ
WRs, each accessing a smaller part of thetarget memory region (“sub-flows”). To facilitate the memorylocation and size matching between the sender and receiver,Virtuoso divides the whole (application) memory space intoblocks (this parameter is configurable).To decide the size of each sub-flows, the
Decomposer willquery the
Path Monitor & Load Balancer . Based on the pathstatus, bandwidth and congestion information,
Path Monitor& Load Balancer provides a decision about the memory mes-sage and WR mapping where load balancing and congestionavoidance are considered (in section 3.5). Then, the
Decom-poser will generate WRs that maps different blocks of thememory, and pass them to
QP Manager . After these WRsare successfully posted and completed, the
Decomposer willbe notified. Then it generates a corresponding
MP CQE forentire message to notify the application of the completion.
We first remark that while Virtuoso performs the additionaltasks of dividing a large message (“flow”) contained in an
MP READ , MP WRITE or MP SEND into smaller messages(“sub-flows”) by generating a sequence of WRs. These WRsare distributed across multiple QPs, and are performed usingthe standard RDMA verbs (
READ , WRITE or SEND ). In otherwords, the RNIC will directly read/write the correspondingdata from or into the remote memory area in application’smemory region as indicated by the verbs. Hence, Virtuosoincurs no additional memory copying . Out-of-Order (OOO) is a common issue in multi-pathtransport, due to parallel transmission and variant delay onmultiple paths. Virtuoso leverages the benefit of direct mem-ory wiring to resolve the OOO issue by buffering correctlyreceived data into application memory. Once the data trafficarrived on the remote side, we have to merge sub-flows toreconstruct original memory region for receiver application.Since sub-flow traffic pay-loads are written to the memorydirectly by NIC hardware. The data flow will be composedcorrectly directly in the user space memory once we postthe correct WR into the receive queue of
MP SEND/RECV case (to identify the target memory addresses for each sub-flow); into the send queue of
MP WIRTE/READ case (wherethe receiver side is totally passive). When Virtuoso uses
WRITE/READ verbs as instructed by application submitted
MP WR , the receiver side is totally passive (which means re-ceiver requires no action after memory registration). Oncethe access key of remote memory is acquired by sender side,Virtuoso can treat remote target memory as its own memoryspace without receiver’s reaction for any transmission.
SENDWRITERECV
Send QueueReceive Queue
WRITE WRITE WRITE WRITE WRITE WRITE WRITE
MP_SEND MP_WRITE
Figure 2: SEND/RECV & Out-of-OrderMP SEND/RECV verbs is a special case of OOO. Orig-inally in RDMA, each
SEND consumes a
RECV in receivequeue. Moreover,
RECV (who instructs RNIC to write data tothe target memory address) is supposed to be posted before
SEND ’s arriving, which means the target addresses need to bedetermined in advance. However, the arriving order and datasize of each sub-flow is unpredictable. So we cannot simplygenerate multiple
SEND/RECV
WRs as in
MP WRITE case.So we propose a hybrid solution by combining
SEND/RECV and
WRITE . As illustrated in Fig 2,
WRITE verbs who re-quires no
RECV , are used to avoid beforehand memory ad-dress determination on receiver side.Additionally, two-sided
SEND/RECV needs to notify theapplication of accomplishment by posting a CQE into CQ MP CQE in our case). However, one-sided
WRITE verb can-not generate CQEs on receiver side. To this end, Virtuosoposts an extra
RECV to the receiving queue for receiver noti-fication purpose. And also Virtuoso appends an extra
SEND after
WRITE
WRs to consume this
RECV . Both the
RECV and
SEND are empty WRs (did not map any memory). Asa result, when all the
WRITE and
SEND/RECV
WRs are ac-complished, CQEs will be posted to CQs on both sender andreceiver sides. After polling the CQ, Virtuoso can post a
MP CQE to notify application using the metadata in CQE.For efficiency, we classify the
MP SEND into two cate-gories, small message and large message. For small message,a single
SEND is used to send entire small message via ar-bitrary single path; for large message, the hybrid solution isused to load balance the elephant flow of the message ontomultiple paths.
Load Balancing is an essential task in multi-path transmis-sion. Virtuoso employs a pre-allocation mechanism to fit theRDMA verbs scenario. First, Virtuoso probes the path capac-ity (e.g., bandwidth) using historical information (or otherperformance tools such as iPerf ). In current implementation,Virtuoso initiates multiple probing flows (at least 512 KB)to estimate the capacity of each path by monitoring the flowcompletion times. Second, Virtuoso distributes the incominglarge data traffic into multiple sub-flows as follows. (cid:40) data path cap path = data path cap path = · · · = data pathn cap pathn (cid:205) ni = data path i = data total (1)Here cap path i and data path i denote the estimated bandwidthand allocated data size for path i , respectively.. Then Virtu-oso maps the memory into WRs and submits them to QPsin Round-Robin scheduling as shown in Fig. 3(a). Currentdesign is based on the assumption that the status of corepaths are stable in short period. Since load balancing is to-tally decoupled from other components, more real-time andfine-grain load balancing mechanisms in user space will beexplored in the future work. WR Path 1 Queue WR Path 2 QueueMemory
WRWR (a) Lossless Network WR Path 1 Queue WR Path 2 Queue
CQE
Receive Queue
CQE CQE CQE
Memory (b) Lossy Network
Figure 3: User-space Load BalancingCongestion Avoidance is also required in per sub-flowtransmission. For instance, if the RNIC has insufficientresilient capability (e.g., Mellanox ConnectX-3 Pro) while https://iperf.fr/iperf-download.php the network is not well configured (lossy), mapping a largeamount memory into a single WR (where RNIC transmitsthe data too fast) will cause packet loss in core switches(where the network bottleneck locates at). To resolve this,Virtuoso limits the maximum trunk size of each WR usinga congestion window based mechanism. Initially, Virtuosoprobes the threshold value of the trunk size of each sub-flowby binary increasing the chunk size while monitoring theshared CQ. If a congestion happens (usually indicated by aCQE with IBV WC RETRY EXC ERR error code). Virtuosowill decrease the chunk size to previous value and to finda maximum threshold in linear increasing. Moreover, WRconstruction and posting is also slightly different in lossynetwork. To avoid packet loss, Virtuoso uses multiple WRs tomap the sub-flow message of each path. The maximum chunksize value is used to determine the number of WRs. And also,these WRs will be posted in turns followed by success CQEsas shown in Fig. 3(b).
In this section, we introduce the implementation and evalu-ation of Virtuoso. We evaluate the performance of Virtuoso,and validate that Virtuoso can fully utility multiple paths inthe core of DCN with minimal CPU overhead.
Virtuoso is implemented as a user space “middleware” li-brary on top of the standard RDMA libraries, ib verbs and rdma cm . Virtuoso contains approximately 1500 linesof code (LoC) in C language. Virtuoso uses a thread-freemethod and event based mechanism to handle multiple QPsestablishment and data transmission. An RDMA applicationsinvokes
MP connect() to create QP connections, and use
MP WRITE()/SEND() to initiate a multipath data transmis-sion. Additionally, we implemented two basic modules forcongestion control and load balancing. However, they couldbe replaced easily for apps’ own design.
Our testbed consists of two servers connected to two Top ofRack (ToR) switches with multiple links between them toemulate the multipath scenario in spine-leaf DCN topology.The end-host server is Dell PowerEdge R430 with Intel XeonCPU [email protected] CPUs and 64GB RAM. They areequipped with Mellanox ConnectX-3 40Gbps RNICs withthe
MLNX OFED LINUX-4.6-1.0.1.1 driver with 10GB portenabled. The ToR switches are QuantaMesh T1048 LB9A(SDN switch) to perform an ip based path mapping as shownin Fig. 4. B9A SDN Switch LB9A SDN Switch
Server 1
RNIC
Server 2
RNIC
Figure 4: Testbed Setup
In this experiment, we evaluate the capability of Virtuoso inpath utilization. We can proof that Virtuoso can fully utilizemultiple paths to improve bandwidth in the network betweenToR switches (core portion).Flow Completion Time is the matrix that we are using toevaluate the performance of Virtuoso in using different num-ber of paths(1, 2, 4, 6, 8 and 10). For each link between ToRswitches, we limited the speed to 1GB/s while links betweenToRs and servers are 10GB/s which introduces a bottleneckin core portion. As shown in Fig. 5, with the increasing of thenumber of used paths, the FCT will decrease obviously. Andalso, the benefit of using more paths can be leveraged underdifferent sizes (from 10 MBytes to 100GBytes) of messagesize scenarios as shown in Fig. 7, which means Virtuoso canutilize multiple paths for better transport. F C T ( S ec ) Figure 5: Multiple Path Utilization (100GByte Flow) F C T ( s ec ) Figure 6: Different Flow Size Comparison
As in congestion control, if a WR submitted too much dataat once, congestion will happen in bottleneck core network.Thus, utilizing multiple paths will potentially increase thetrunk size that a WR can submit. As shown in Table. 2, byusing more paths, the trunk size can also increase. As a result,for a fixed size of data flow, we could save more CPU timeby sending more data each single iteration.Moreover, as shown in Table. 2, using 2 or 4 paths cancause a decreasing of average trunk size compared with singlepath scenario. The reason is that the capabilities of limitednumber of extra paths still cannot patch the gap between corenetworks and RNIC. However, with more paths are used, theaverage trunk size of each can also be increase with less dataallocated on each path in each iteration. of Paths Max Size (Byte) Avg Size (Byte)1 700417 7004172 1300482 6502414 2537172 6342936 4405660 7342808 9371656 11714579 18984993 2109400
Table 2: Lossy Network Chunk Size Comparison
As discussed in Section 3, multi-path transport can also in-crease the fairness by avoiding elephant flows blocking themice ones. To validate that, we generate consistent data flowas background traffic while a mice flow (256 KBybe) is ini-tiated in every 2 seconds. Virtuoso splits the backgroundelephant flow among 10 paths to avoid its blocking on singlepath. In comparative situation, Virtuoso does not split theelephant flow, while mice flows are sharing the same pathused by the elephant flow. Then, we compare the FCT ofmice flows with/without load balancing of Virtuoso.As shown in Fig. 7, with Virtuoso splitting the elephant flowon multiple paths, the FCT of these mice flow will decreasedue to extra bandwidth. In single path scenario, which is alsothe case without Virtuoso’s load balancing, the backgroundtraffic occupies the shared single path and blocks the miceflows. As a result, the FCT of mice flows are increased.
We use CPU usage time (CPU cycles) to evaluate the CPUoverhead of Virtuoso. In this experiment, we tag the code indifferent points (e.g., the end of the
MP rdma connect() function) to measure the CPU cycles used by different parts.The standard C library time is used to log the CPU clock ofa specific time.Moreover, to avoid CQ polling caused extra CPU usage, weuse event based completion queue polling ( ibv get cq event() here the application will be blocked during data transmis-sion). In this way, we could avoid the deviation caused byunnecessary CPU usage and measure only critical CPU over-head. .
02 0 .
04 0 .
06 0 . . . . . . C D F VirtuosoSingle Path
Figure 7: Multiple Flows Interactions with Virtuoso
As shown in Fig 8, with increasing the number of usedpaths especially in small message size scenarios, more CPUcycles are used in user space computation. However, largedata size actually eliminates this side effects by increasingboth bandwidth and trunk size to decrease the iterations fortransmitting the same amount of data. Hence as a comprehen-sive conclusion, large data message should always leveragemultiple paths while small data messages can use Virtuoso tosteer the flows to avoid congested paths. C P U C y c l e s ( ) Figure 8: CPU Usage Overhead Comparison
This paper presents
Virtuoso , a purely software-based multi-path RDMA solution for data center networks which effec-tively utilizes multiple paths for load balancing and reliability.Virtuoso employs VNICs to help RDMA applications splitlarge flows into multiple smaller sub-flows and dispatch themamong multiple paths to achieve user space load balancing. Virtuoso can improve the bandwidth in core of DCNs by uti-lizing multiple paths but introduces negligible CPU overhead.Virtuoso is presented to bring inspirations for the com-munity to leverage the flexibility of software visualizationtechniques. We plan to further i) provide a fine-grained yetefficient congestion control mechanism to achieve fast anddynamic load balance reaction and ii) migrate real-worldapplications, like distributed TensorFlow [12], to evaluatethe benefits of Virtuoso, and to benefit the machine learningcommunity.
REFERENCES [1] Mohammad Al-Fares et al. 2008. A Scalable, Commodity Data CenterNetwork Architecture. In
ACM SIGCOMM . 6374.[2] Mohammad Alizadeh et al. 2014. CONGA: Distributed Congestion-Aware Load Balancing for Datacenters. In
ACM SIGCOMM . 503514.[3] Jiaxin Cao et al. 2013. Per-Packet Load-Balanced, Low-Latency Rout-ing for Clos-Based Data Center Networks. In
ACM CoNEXT . 4960.[4] Shoby Cherian et al. 2017. Methods and systems to achieve multi-tenancy in RDMA over converged Ethernet. (Aug. 29 2017). USPatent 9,747,249.[5] Carolyn J Sher Decusatis et al. 2012. Communication within clouds:open standards and proprietary protocols for data center networking.
IEEE Commun Mag
50, 9 (2012), 26–33.[6] Aleksandar Dragojevi´c et al. 2014. FaRM: Fast Remote Memory. In
USENIX NSDI . 401–414.[7] Alan Ford et al. 2012. TCP Extensions for Multipath Operation withMultiple Addresses.
IETF (2012).[8] Monia Ghobadi et al. 2012. Rethinking End-to-End Congestion Controlin Software-Defined Networks. In
ACM HotNets . 6166.[9] Albert Greenberg et al. 2009. VL2: A Scalable and Flexible DataCenter Network. In
ACM SIGCOMM . 5162.[10] Juncheng Gu et al. 2017. Efficient Memory Disaggregation with Infin-iswap. In
USENIX NSDI . 649–667.[11] Chuanxiong Guo et al. 2016. RDMA over Commodity Ethernet atScale. In
ACM SIGCOMM . 202–215.[12] Chengfan Jia et al. 2018. Improving the performance of distributedtensorflow with RDMA.
Int J Parallel Program
46, 4 (2018), 674–685.[13] George Kalokerinos et al. 2009. FPGA implementation of a config-urable cache/scratchpad memory with virtualized user-level RDMAcapability. In
IEEE SAMOS . 149–156.[14] Hari Kathi et al. 2006. Data traffic load balancing based on applicationlayer messages. (July 13 2006). US Patent App. 11/031,184.[15] Daehyeok Kim et al. 2019. FreeFlow: Software-based Virtual RDMANetworking for Containerized Clouds. In
USENIX NSDI . 113–126.[16] Xiaoyi Lu et al. 2014. Accelerating spark with rdma for big dataprocessing: Early experiences. In
IEEE HOTI . 9–16.[17] Yuanwei Lu et al. 2018. Multi-Path Transport for RDMA in Datacen-ters. In
USENIX NSDI . 357–371.[18] Yifei Lu and Shuhong Zhu. 2015. SDN-based TCP congestion controlin data center networks. In
IEEE IPCCC . 1–7.[19] Niranjan Mysore et al. 2009. PortLand: A Scalable Fault-TolerantLayer 2 Data Center Network Fabric. In
ACM SIGCOMM . 3950.[20] Jonathan Perry et al. 2014. Fastpass: a centralized” zero-queue” data-center network. In
ACM SIGCOMM . 307–318.[21] Jonas Pfefferle et al. 2015. A Hybrid I/O Virtualization Framework forRDMA-capable Network Interfaces. In
ACM VEE . 17–30.[22] Jim Pinkerton. 2002. The case for RDMA.
RDMA Consortium, May
29 (2002), 27.
23] Haonan Qiu et al. 2018. Toward Effective and Fair RDMA ResourceSharing. In
ACM APNet . 8–14.[24] Ren et al. 2013. Design and Performance Evaluation of NUMA-AwareRDMA-Based End-to-End Data Transfer Systems. In
ACM SC . 48.[25] M Skyllas-Kazacos et al. 1986. New all-vanadium redox flow cell.
Journal of the Electrochemical Society
133 (1986), 1057.[26] Maomeng Su et al. 2017. RFP: When RPC is Faster than Server-Bypasswith RDMA. In
ACM EuroSys . 115.[27] Shin-Yeh Tsai and Yiying Zhang. 2017. LITE Kernel RDMA Supportfor Datacenter Applications. In
USENIX OSDI . 306–324.
REFERENCES [1] Mohammad Al-Fares et al. 2008. A Scalable, Commodity Data CenterNetwork Architecture. In
ACM SIGCOMM . 6374.[2] Mohammad Alizadeh et al. 2014. CONGA: Distributed Congestion-Aware Load Balancing for Datacenters. In
ACM SIGCOMM . 503514.[3] Jiaxin Cao et al. 2013. Per-Packet Load-Balanced, Low-Latency Rout-ing for Clos-Based Data Center Networks. In
ACM CoNEXT . 4960.[4] Shoby Cherian et al. 2017. Methods and systems to achieve multi-tenancy in RDMA over converged Ethernet. (Aug. 29 2017). USPatent 9,747,249.[5] Carolyn J Sher Decusatis et al. 2012. Communication within clouds:open standards and proprietary protocols for data center networking.
IEEE Commun Mag
50, 9 (2012), 26–33.[6] Aleksandar Dragojevi´c et al. 2014. FaRM: Fast Remote Memory. In
USENIX NSDI . 401–414.[7] Alan Ford et al. 2012. TCP Extensions for Multipath Operation withMultiple Addresses.
IETF (2012).[8] Monia Ghobadi et al. 2012. Rethinking End-to-End Congestion Controlin Software-Defined Networks. In
ACM HotNets . 6166.[9] Albert Greenberg et al. 2009. VL2: A Scalable and Flexible DataCenter Network. In
ACM SIGCOMM . 5162.[10] Juncheng Gu et al. 2017. Efficient Memory Disaggregation with Infin-iswap. In
USENIX NSDI . 649–667.[11] Chuanxiong Guo et al. 2016. RDMA over Commodity Ethernet atScale. In
ACM SIGCOMM . 202–215.[12] Chengfan Jia et al. 2018. Improving the performance of distributedtensorflow with RDMA.
Int J Parallel Program
46, 4 (2018), 674–685.[13] George Kalokerinos et al. 2009. FPGA implementation of a config-urable cache/scratchpad memory with virtualized user-level RDMAcapability. In
IEEE SAMOS . 149–156.[14] Hari Kathi et al. 2006. Data traffic load balancing based on applicationlayer messages. (July 13 2006). US Patent App. 11/031,184.[15] Daehyeok Kim et al. 2019. FreeFlow: Software-based Virtual RDMANetworking for Containerized Clouds. In
USENIX NSDI . 113–126.[16] Xiaoyi Lu et al. 2014. Accelerating spark with rdma for big dataprocessing: Early experiences. In
IEEE HOTI . 9–16.[17] Yuanwei Lu et al. 2018. Multi-Path Transport for RDMA in Datacen-ters. In
USENIX NSDI . 357–371.[18] Yifei Lu and Shuhong Zhu. 2015. SDN-based TCP congestion controlin data center networks. In
IEEE IPCCC . 1–7.[19] Niranjan Mysore et al. 2009. PortLand: A Scalable Fault-TolerantLayer 2 Data Center Network Fabric. In
ACM SIGCOMM . 3950.[20] Jonathan Perry et al. 2014. Fastpass: a centralized” zero-queue” data-center network. In
ACM SIGCOMM . 307–318.[21] Jonas Pfefferle et al. 2015. A Hybrid I/O Virtualization Framework forRDMA-capable Network Interfaces. In
ACM VEE . 17–30.[22] Jim Pinkerton. 2002. The case for RDMA.
RDMA Consortium, May
29 (2002), 27.[23] Haonan Qiu et al. 2018. Toward Effective and Fair RDMA ResourceSharing. In
ACM APNet . 8–14. [24] Ren et al. 2013. Design and Performance Evaluation of NUMA-AwareRDMA-Based End-to-End Data Transfer Systems. In
ACM SC . 48.[25] M Skyllas-Kazacos et al. 1986. New all-vanadium redox flow cell.
Journal of the Electrochemical Society
133 (1986), 1057.[26] Maomeng Su et al. 2017. RFP: When RPC is Faster than Server-Bypasswith RDMA. In
ACM EuroSys . 115.[27] Shin-Yeh Tsai and Yiying Zhang. 2017. LITE Kernel RDMA Supportfor Datacenter Applications. In
USENIX OSDI . 306–324.. 306–324.