[PDF] A Novel Software-based Multi-path RDMA Solutionfor Data Center Networks

Abstract

In this paper we propose Virtuoso, a purely software-based multi-path RDMA solution for data center networks (DCNs) to effectively utilize the rich multi-path topology for load balancing and reliability. As a "middleware" library operating at the user space, Virtuoso employs three innovative mechanisms to achieve its goal. In contrast to existing hardware-based MP-RDMA solution, Virtuoso can be readily deployed in DCNs with existing RDMA NICs. It also decouples path selection and load balancing mechanisms from hardware features, allowing DCN operators and applications to make flexible decisions by employing the best mechanisms (as "plug-in" software library modules) as needed. Our experiments show that Virtuoso is capable of fully utilizing multiple paths with negligible CPU overheads

Full PDF

AA Novel Software-based Multi-path RDMA Solutionfor Data Center Networks

Feng Tian

University of MinnesotaMinneapolis, Minnesota

Wendi Feng

University of MinnesotaMinneapolis, Minnesota

Yang Zhang

University of MinnesotaMinneapolis, Minnesota

Zhi-Li Zhang

University of MinnesotaMinneapolis, Minnesota

ABSTRACT

In this paper we propose

Virtuoso , a purely software-based multi-path RDMA solution for data center networks (DCNs)to effectively utilize the rich multi-path topology for load bal-ancing and reliability. As a “middleware” library operatingat the user space, Virtuoso employs three innovative mech-anisms to achieve its goal. In contrast to existing hardwarebased MP-RDMA solution, Virtuoso can be readily deployedin DCNs with existing RDMA NICs. It also decouples pathselection and load balancing mechanisms from hardware fea-tures, allowing DCN operators and applications to make ﬂex-ible decisions by employing best mechanisms (as “plug-in”software library modules) as needed. Our experiments showthat Virtuoso is capable of fully utilizing multiple paths withnegligible CPU overheads.

KEYWORDS

Data Center Networks, RDMA, Software-based Multi-Path

Remote Direct Memory Access (RDMA) introduces the ca-pability of directly accessing the memory of a remote serverby implementing the transport logic in hardware network in-terface cards and bypassing CPU and kernel network stacks,thereby offering high bandwidth and low latency. Nowadays,RDMA is widely deployed over “Converged” Ethernet via Ro-CEv2 in modern data centers [12, 16, 26] to support machinelearning and other data intensive applications. By design,RDMA is a point-to-point transport, where each RDMA con-nection is mapped onto a single network path. More specif-ically, RDMA operations ( verbs ) of an RDMA connectionare transported along the same network path via single QueuePair (QP); each message of an RDMA verb such as SEND,RECV, READ, WRITE is divided in segments of equal sizeand encapsulated in UDP packets, where the source and des-tination IP addresses of the UDP packets are set to those ofthe two communicating servers, the destination port ﬁxed at4791 and the source port arbitrarily chosen. These are all done automatically by the RDMA NICs (or RNICs in short),which makes port number based path control mechanism asin MPTCP [7] difﬁcult in user space.Data center networks (DCNs) are typically built using a“spine-leaf” topological structure with rich multiple paths,especially between spine routers, for load balancing and re-liability [1, 9]. As a point-to-point transport, RDMA doesnot take advantage of multiple paths in the underlying net-works for load balancing and reliability [11, 22, 24]. Formachine learning and other data intensive applications, anRDMA read/write operation may involve remote transfer ofa big chunk of data (“elephant ﬂows”), which may not onlytake some time to deliver along a single path, but also causecongestion that can potentially affect “mice ﬂows” from otherapplications, especially interactive applications with strin-gent latency requirements. MP-RDMA [17] is the ﬁrst workthat attempts to address this limitation of existing RDMA (orrather, RoCEv2). It focuses on the challenges in implement-ing a multi-path RDMA solution in hardware , in particular,the limited memory resource in RNICs. By using source portto encode “virtual path” id (VP id) and inﬂuencing the pathtraversed by the RDMA UDP packets, it assumes and heavilyrelies on the underlying routers’ ECMP mechanisms for loadbalancing among multiple paths. The proposed solution isemulated/prototyped using FPGA. As MP-RDMA requiresreplacing existing RNICs to new MP-RDMA capable NICs,it cannot be readily deployed in DCNs.In this work, we propose and develop a purely software-based multi-path

RDMA solution, dubbed

Virtuoso . Oursolution employs three key innovations . First, we create mul-tiple virtual interfaces – each with a different (virtual) IPaddress of our choice – and bind them to the same physicalRNIC (effectively creating multiple virtual RNICs). Henceunlike MP-RDMA which manipulates the source port only,we control and manipulate source IP addresses of the RDMAUDP packets for load balancing and reliability. Second, wedevelop a user-space middleware layer which intercepts andsplit (large) messages of RDMA operations into multiple a r X i v : . [ c s . N I] S e p smaller) messages, and dynamically maps them onto dif-ferent paths at the sender side, and judiciously merge themtogether at the receiver side by passing them to the applica-tions. Performing these operations correctly and incurringas little overheads as possible (especially, maintaining zero-copy) is nontrivial; it involves careful design and some clevertricks (see Section 3). Third, we also implemented a user-space load balancer that consists of a congestion avoidance(for lossy network) and path probing component mechanism,to do a application aware load balancing.Virtuoso offers several advantages over existing hardware-based multi-path RDMA solutions. As a purely software-based solution, it can be readily deployed in DCNs at scalewith existing RDMA NICs and works regardless of the num-ber of physical RNICs installed on servers. In contrast to MP-RDMA which implements “built-in” path selection, conges-tion control and trafﬁc distribution mechanisms in hardwareand hinges on ECMP to perform multi-path routing, Virtu-oso decouples these mechanisms from hardware features, andallows DCN operators/applications to make ﬂexible decisionsby employing best mechanisms (as “plug-in” software librarymodules) as needed. For example, one can explicitly managemulti-path routing by setting appropriate forwarding rules(based on source and destination IP addresses), e.g., throughan SDN controller. Virtuoso allows them to guide trafﬁc dis-tribution decisions. Our experiments show that Virtuoso canfully utilize multiple paths with negligible CPU overheads. RDMA allows applications to directly access remote memorywith zero-copying and low CPU involvement by implement-ing the transport logic in hardware RNICs. RDMA over Con-verged Ethernet v2 (RoCEv2) has been widely deployed indata center networks to support compute- & data-intensive ap-plications such as machine learning, as it provides low latencyand high bandwidth with little CPU overheads. Normally,RDMA requires a lossless network where Priority-based FlowControl (PFC) and Explicit Congestion Notiﬁcation (ECN)are usually conﬁgured to prevent packet losses by pausingtrafﬁc transport and throttle trafﬁc at the source.RDMA is a message based, point-to-point transport, whereRDMA messages are divided into segments and encapsulatedin UDP packets that are transported along a single path. Ap-plications connect with each other using send and receive

Queue Pairs (QP) . An application initiates RDMA operations(or verbs ) by posting

Work Requests (WRs) (or

Work QueueElement (WQE)), e.g., SEND/RECV or WRITE/READ tothe QP, which commands the RNIC to transfer data to thememory of a remote host. For each application, there is also one (or more) completion queue (CQ); upon completing aWR, a completion queue element (CQE) is delivered to CQ.

The “leaf-spine” topology in modern Data Center Networks(DCNs) offers rich path diversity [1, 5, 9]. Switches androuters employ built-in

Equal-Cost Multi-path (ECMP) forrouting based on hashes of 5-tuple packet/ﬂow headers ( (cid:104) srcIP , dst IP , src port , dst port , protocol number (cid:105) ). ECMP suf-fers several issues in practice [1, 3], e.g., it is less effectivewhen the number of paths is large, and it cannot performintra-ﬂow load balancing for large elephant ﬂows. Other(software-based) solutions such as Valiant Load Balancingand “customized” multi-path routing algorithms (e.g., by set-ting up explicit ﬂow rules [1, 9, 19]) provide DCN operatorsand applications more control over multi-path routing andload balancing. We remark that congestion often occurs atthe core layer of a DCN [2]; large “elephant” ﬂows gener-ated by data-intensive machine learning applications furthercontribute to this problem. They not only prolong their ownﬂow completion times (FCTs), but also adversely affect otherapplications. It is therefore desirable to split such “elephant”ﬂows to enable “intra-ﬂow” load balancing across multiple(core) paths [2, 25].MP-RDMA [17] is the ﬁrst to address the challenge thatRDMA/RoCE v2 cannot effectively take advantage of richmultiple paths in DCNs [11, 22, 24]. It proposes a hardware-based solution with “built-in” path selection and congestionavoidance mechanisms. The key challenge it focuses on is thelimited memory in RNICs (see also FaRM [6], LITE [27] andINFINISWAP [10] that tackle similar hardware constraints).As a hardware-based solution, it cannot be readily deployedwithout upgrading RNIC. It also heavily relies on ECMP formulti-path routing and load balancing.We therefore seek a purely software-based multi-path RDMAsolution operating in the user space that works with existingRNICs while maintaining zero-copying and incurring as lit-tle CPU overheads as possible. A key enabling idea of ourproposed solution is to create multiple virtual NICs (vNICs)and bind them to the same hardware RNIC, thereby allowingmultiple IP addresses to be assigned to the same RNIC. Oursolution allows a single RDMA application to create multi-ple virtual RDMA connections that are mapped to differentpaths. This is different from existing efforts in virtualiz-ing

RNICs [4, 13, 15, 21] with the goal to allow multipleVMs/containers to share the same RNIC with some levelof isolation. Compared with “built-in” multi-path routingand load balancing mechanisms, we also believe that it isimperative to provide DCN operators and applications withﬂexibility in multi-path routing and load balancing decisions.For example, it has been shown that global congestion avoid-ance and trafﬁc scheduling [8, 18, 20] are essential in DCNs, nd applications are best aware of trafﬁc load distribution foradaptive load balancing [14]. Similarly, Avatar [23] aims atmaking RDMA transport on a single NIC to be efﬁcientlyshared by eliminating lock contention and providing fair datascheduling via WR multiplexing. Virtuoso is a software-based, modular multi-path RDMAframework. Virtuoso sets up multiple virtual NICs (vNICs)on each physical RNIC using

IP alias , each assigned witha distinct IP address (see Fig. 1(a)). In practice, RDMA usesa Global ID (GID) to identify each host, and RoCEv2 bindsGIDs to the IP addresses of the interfaces using the IP table.Using vNICs, Virtuoso is able to create multiple QPs usingthe standard RDMP libraries rdma cm and ibv verb . RDMA APP

Host

VNIC

RNIC

QP QP QP QP

RDMA Library

VNIC

Virtuoso (a) Virtuoso Overview

Decomposer

Path Monitor CQ WRWRWR

APP

Memory

Reassembler

APP

Memory WR QPQP WR QP WR QP Manager addrsizeMP_WRaddrsizeMP_WR

QP Manager WR QPQP WR QP WR WRWRWR

Sender Receiver (b) System Design

Figure 1: Virtuoso: Software Multi-path RDMA Solu-tion

Virtuoso maps each QP to a distinct virtual path (VP), andusing the IP address associated with each vNIC as a

VP id .As a middleware operating at the user space, Virtuoso pro-vides the same APIs (and RDMA verbs) as in the standardRDMA libraries, but preﬁxes them with the keyword MP as shown in Table 1. For example, an application invokes MP connect() to set up a Virtuoso multi-path (logical) con-nection, and uses

MP READ/SEND and

MP WRITE/RECV topost Virtuoso work requests (WRs),

MP WR ’s. On the senderside, Virtuoso decomposes a large RDMA message (there-after simply a “ﬂow”) contained in an

MP WR into smaller“sub-ﬂows”, and distribute them to different QP’s by gener-ating the corresponding constituent WRs using the standardRDMA verbs. The sub-ﬂows are “merged” at the receiverside. These are illustrated in the right portion of Fig. 1(b).Virtuoso consists four major components,

QP Manager , De-composer (on the sender side),

Reassembler (on the receiverside), and

Path Monitor & Load Balancer . Standard RDMA API & Verbs Virtuoso Versionrdma connect() MP rdma connect()rdma disconnect() MP rdma disconnect()ibv post send() MP ibv post send()WRITE/READ MP WRITE/READSEND/RECV MP SEND/RECV

Table 1: Interface & Verb Design

Virtuoso assumes that there is only one single port connec-tion between ToR switch (but can also work with multipleports) and RNIC while have multiple paths in the core layerof data center networks. The load balancing mechanism canbe either ECMP (with known hash function) or static routing.

As discussedabove, an RDMA application creates a (logical) multi-pathconnection using Virtuoso APIs. Virtuoso maps this logi-cal connection to multiple (virtual) paths by automaticallysetting up the corresponding QPs, one per path. To set upthese QPs to work with the same application, we take advan-tage of several key features of RDMA. Recall that in RDMA,memory must be registered before any RDMA verb can bepost. The sender and receiver communicate and negotiatethe address locations of the respective memory. Each RDMAtransport context (registered memory, QP) is maintained in-side a Protect Domain (PD). Inside this PD, these contextscan be shared and accessed by multiple QPs who within thesame PD.In order to associate the multiple QPs created by Virtuosowith the same application, the

QP manager create them withinthe same PD. Furthermore, the same target memory regionas speciﬁed by an RDMA application is also registered tothis PD. This way the message in an

MP WRITE or MP SEND can be transported through any of the QPs; in particular, fora large message, it can be divided into smaller chunks andtransported via multiple QPs for load balancing.The advantage of this design is efﬁciency and ﬂexibility:the QPs can concurrently manage the same memory regionwithout memory copying and state transfer between PDs.This, however, creates a challenge at the receiver side whenthe two-sided

MP SEND and

MP RECV verbs are used: thereceiver will not know in advance which QP the data willbe arriving, thus which QP to post the corresponding

RECV

WR. We will discuss how this challenge is addressed in

Re-assembler of Virtuoso, as well as how out-of-order (OOO)data is handled in Section 3.4. QP manager also creates a shared

Completion Queue (CQ) for these QPs, so that it canpoll this queue to query the CQEs of the WRs posted to anyof these QPs. Note that each CQE has the correspondingWR information (e.g., WR id). Hence for each

MP WR (a“ﬂow”) submitted by an application, Virtuoso can determine hether its constituent WRs (“sub-ﬂows”) have been com-pleted, thereby notifying the Decomposer to post a

MP WR forapplication about the completion of the transmission task.

In terms ofconnections between queue pair, it requires a transmissionparameter (e.g., queue pair type (qp type) & queue pair ca-pabilities (max send wr)) exchange process, which involvesseveral functions provided by standard RDMA libraries. Thisprocedure works as the three-way hand shake procedure inTCP/IP. However, this procedure is handled in user space byapplication instead of the driver in kernel. Thus, Virtuososhould handle all the parameter exchange tasks for multipleQPs. To simplify the connection procedure, Virtuoso pro-vides an uniformed interface, ‘

MP rdma connect() ’, formulti-path connection which takes over the whole connect-ing procedure from application. Moreover, application canalso conﬁgure these parameters by submitting conﬁgurationsto Virtuoso. The disconnection procedure of QPs is alsosimilar and requires extra negotiation between two remotesides. Thus, Virtuoso also provides an uniformed interface,‘

MP rdma disconnect() ’, for applications.

The

Decomposer component is responsible for WR generat-ing, memory mapping and

MP CQE generating. As the sameas RDMA verbs, each

MP WR (multi-path work request) con-tains the relevant metadata (memory location, size) regardingthe target memory blocks it wants to access. At the senderside, the main task of

Decomposer is to divide a large message(“ﬂow”) contained in a

MP WRITE or MP SEND multi-pathwork request into smaller data chunks (“sub-ﬂows”), andgenerate the corresponding WRs (

WRITE or SEND ) for eachsub-ﬂow using the standard verb (

WRITE or SEND ). Like-wise, a

MP READ

WR that wants to access a large remotememory region (“ﬂows”) will be divided at the sender sideinto multiple

READ

WRs, each accessing a smaller part of thetarget memory region (“sub-ﬂows”). To facilitate the memorylocation and size matching between the sender and receiver,Virtuoso divides the whole (application) memory space intoblocks (this parameter is conﬁgurable).To decide the size of each sub-ﬂows, the

Decomposer willquery the

Path Monitor & Load Balancer . Based on the pathstatus, bandwidth and congestion information,

Path Monitor& Load Balancer provides a decision about the memory mes-sage and WR mapping where load balancing and congestionavoidance are considered (in section 3.5). Then, the

Decom-poser will generate WRs that maps different blocks of thememory, and pass them to

QP Manager . After these WRsare successfully posted and completed, the

Decomposer willbe notiﬁed. Then it generates a corresponding

MP CQE forentire message to notify the application of the completion.

We ﬁrst remark that while Virtuoso performs the additionaltasks of dividing a large message (“ﬂow”) contained in an

MP READ , MP WRITE or MP SEND into smaller messages(“sub-ﬂows”) by generating a sequence of WRs. These WRsare distributed across multiple QPs, and are performed usingthe standard RDMA verbs (

READ , WRITE or SEND ). In otherwords, the RNIC will directly read/write the correspondingdata from or into the remote memory area in application’smemory region as indicated by the verbs. Hence, Virtuosoincurs no additional memory copying . Out-of-Order (OOO) is a common issue in multi-pathtransport, due to parallel transmission and variant delay onmultiple paths. Virtuoso leverages the beneﬁt of direct mem-ory wiring to resolve the OOO issue by buffering correctlyreceived data into application memory. Once the data trafﬁcarrived on the remote side, we have to merge sub-ﬂows toreconstruct original memory region for receiver application.Since sub-ﬂow trafﬁc pay-loads are written to the memorydirectly by NIC hardware. The data ﬂow will be composedcorrectly directly in the user space memory once we postthe correct WR into the receive queue of

MP SEND/RECV case (to identify the target memory addresses for each sub-ﬂow); into the send queue of

MP WIRTE/READ case (wherethe receiver side is totally passive). When Virtuoso uses

WRITE/READ verbs as instructed by application submitted

MP WR , the receiver side is totally passive (which means re-ceiver requires no action after memory registration). Oncethe access key of remote memory is acquired by sender side,Virtuoso can treat remote target memory as its own memoryspace without receiver’s reaction for any transmission.

SENDWRITERECV

Send QueueReceive Queue

WRITE WRITE WRITE WRITE WRITE WRITE WRITE

MP_SEND MP_WRITE

Figure 2: SEND/RECV & Out-of-OrderMP SEND/RECV verbs is a special case of OOO. Orig-inally in RDMA, each

SEND consumes a

RECV in receivequeue. Moreover,

RECV (who instructs RNIC to write data tothe target memory address) is supposed to be posted before

SEND ’s arriving, which means the target addresses need to bedetermined in advance. However, the arriving order and datasize of each sub-ﬂow is unpredictable. So we cannot simplygenerate multiple

SEND/RECV

WRs as in

MP WRITE case.So we propose a hybrid solution by combining

SEND/RECV and

WRITE . As illustrated in Fig 2,

WRITE verbs who re-quires no

RECV , are used to avoid beforehand memory ad-dress determination on receiver side.Additionally, two-sided

SEND/RECV needs to notify theapplication of accomplishment by posting a CQE into CQ MP CQE in our case). However, one-sided

WRITE verb can-not generate CQEs on receiver side. To this end, Virtuosoposts an extra

RECV to the receiving queue for receiver noti-ﬁcation purpose. And also Virtuoso appends an extra

SEND after

WRITE

WRs to consume this

RECV . Both the

RECV and

SEND are empty WRs (did not map any memory). Asa result, when all the

WRITE and

SEND/RECV

WRs are ac-complished, CQEs will be posted to CQs on both sender andreceiver sides. After polling the CQ, Virtuoso can post a

MP CQE to notify application using the metadata in CQE.For efﬁciency, we classify the

MP SEND into two cate-gories, small message and large message. For small message,a single

SEND is used to send entire small message via ar-bitrary single path; for large message, the hybrid solution isused to load balance the elephant ﬂow of the message ontomultiple paths.

Load Balancing is an essential task in multi-path transmis-sion. Virtuoso employs a pre-allocation mechanism to ﬁt theRDMA verbs scenario. First, Virtuoso probes the path capac-ity (e.g., bandwidth) using historical information (or otherperformance tools such as iPerf ). In current implementation,Virtuoso initiates multiple probing ﬂows (at least 512 KB)to estimate the capacity of each path by monitoring the ﬂowcompletion times. Second, Virtuoso distributes the incominglarge data trafﬁc into multiple sub-ﬂows as follows. (cid:40) data path cap path = data path cap path = · · · = data pathn cap pathn (cid:205) ni = data path i = data total (1)Here cap path i and data path i denote the estimated bandwidthand allocated data size for path i , respectively.. Then Virtu-oso maps the memory into WRs and submits them to QPsin Round-Robin scheduling as shown in Fig. 3(a). Currentdesign is based on the assumption that the status of corepaths are stable in short period. Since load balancing is to-tally decoupled from other components, more real-time andﬁne-grain load balancing mechanisms in user space will beexplored in the future work. WR Path 1 Queue WR Path 2 QueueMemory

WRWR (a) Lossless Network WR Path 1 Queue WR Path 2 Queue

CQE

Receive Queue

CQE CQE CQE

Memory (b) Lossy Network

Figure 3: User-space Load BalancingCongestion Avoidance is also required in per sub-ﬂowtransmission. For instance, if the RNIC has insufﬁcientresilient capability (e.g., Mellanox ConnectX-3 Pro) while https://iperf.fr/iperf-download.php the network is not well conﬁgured (lossy), mapping a largeamount memory into a single WR (where RNIC transmitsthe data too fast) will cause packet loss in core switches(where the network bottleneck locates at). To resolve this,Virtuoso limits the maximum trunk size of each WR usinga congestion window based mechanism. Initially, Virtuosoprobes the threshold value of the trunk size of each sub-ﬂowby binary increasing the chunk size while monitoring theshared CQ. If a congestion happens (usually indicated by aCQE with IBV WC RETRY EXC ERR error code). Virtuosowill decrease the chunk size to previous value and to ﬁnda maximum threshold in linear increasing. Moreover, WRconstruction and posting is also slightly different in lossynetwork. To avoid packet loss, Virtuoso uses multiple WRs tomap the sub-ﬂow message of each path. The maximum chunksize value is used to determine the number of WRs. And also,these WRs will be posted in turns followed by success CQEsas shown in Fig. 3(b).

In this section, we introduce the implementation and evalu-ation of Virtuoso. We evaluate the performance of Virtuoso,and validate that Virtuoso can fully utility multiple paths inthe core of DCN with minimal CPU overhead.

Virtuoso is implemented as a user space “middleware” li-brary on top of the standard RDMA libraries, ib verbs and rdma cm . Virtuoso contains approximately 1500 linesof code (LoC) in C language. Virtuoso uses a thread-freemethod and event based mechanism to handle multiple QPsestablishment and data transmission. An RDMA applicationsinvokes

MP connect() to create QP connections, and use

MP WRITE()/SEND() to initiate a multipath data transmis-sion. Additionally, we implemented two basic modules forcongestion control and load balancing. However, they couldbe replaced easily for apps’ own design.

Our testbed consists of two servers connected to two Top ofRack (ToR) switches with multiple links between them toemulate the multipath scenario in spine-leaf DCN topology.The end-host server is Dell PowerEdge R430 with Intel XeonCPU [email protected] CPUs and 64GB RAM. They areequipped with Mellanox ConnectX-3 40Gbps RNICs withthe

MLNX OFED LINUX-4.6-1.0.1.1 driver with 10GB portenabled. The ToR switches are QuantaMesh T1048 LB9A(SDN switch) to perform an ip based path mapping as shownin Fig. 4. B9A SDN Switch LB9A SDN Switch

Server 1

RNIC

Server 2

RNIC

Figure 4: Testbed Setup

In this experiment, we evaluate the capability of Virtuoso inpath utilization. We can proof that Virtuoso can fully utilizemultiple paths to improve bandwidth in the network betweenToR switches (core portion).Flow Completion Time is the matrix that we are using toevaluate the performance of Virtuoso in using different num-ber of paths(1, 2, 4, 6, 8 and 10). For each link between ToRswitches, we limited the speed to 1GB/s while links betweenToRs and servers are 10GB/s which introduces a bottleneckin core portion. As shown in Fig. 5, with the increasing of thenumber of used paths, the FCT will decrease obviously. Andalso, the beneﬁt of using more paths can be leveraged underdifferent sizes (from 10 MBytes to 100GBytes) of messagesize scenarios as shown in Fig. 7, which means Virtuoso canutilize multiple paths for better transport. F C T ( S ec ) Figure 5: Multiple Path Utilization (100GByte Flow) F C T ( s ec ) Figure 6: Different Flow Size Comparison

As in congestion control, if a WR submitted too much dataat once, congestion will happen in bottleneck core network.Thus, utilizing multiple paths will potentially increase thetrunk size that a WR can submit. As shown in Table. 2, byusing more paths, the trunk size can also increase. As a result,for a ﬁxed size of data ﬂow, we could save more CPU timeby sending more data each single iteration.Moreover, as shown in Table. 2, using 2 or 4 paths cancause a decreasing of average trunk size compared with singlepath scenario. The reason is that the capabilities of limitednumber of extra paths still cannot patch the gap between corenetworks and RNIC. However, with more paths are used, theaverage trunk size of each can also be increase with less dataallocated on each path in each iteration. of Paths Max Size (Byte) Avg Size (Byte)1 700417 7004172 1300482 6502414 2537172 6342936 4405660 7342808 9371656 11714579 18984993 2109400

Table 2: Lossy Network Chunk Size Comparison

As discussed in Section 3, multi-path transport can also in-crease the fairness by avoiding elephant ﬂows blocking themice ones. To validate that, we generate consistent data ﬂowas background trafﬁc while a mice ﬂow (256 KBybe) is ini-tiated in every 2 seconds. Virtuoso splits the backgroundelephant ﬂow among 10 paths to avoid its blocking on singlepath. In comparative situation, Virtuoso does not split theelephant ﬂow, while mice ﬂows are sharing the same pathused by the elephant ﬂow. Then, we compare the FCT ofmice ﬂows with/without load balancing of Virtuoso.As shown in Fig. 7, with Virtuoso splitting the elephant ﬂowon multiple paths, the FCT of these mice ﬂow will decreasedue to extra bandwidth. In single path scenario, which is alsothe case without Virtuoso’s load balancing, the backgroundtrafﬁc occupies the shared single path and blocks the miceﬂows. As a result, the FCT of mice ﬂows are increased.

We use CPU usage time (CPU cycles) to evaluate the CPUoverhead of Virtuoso. In this experiment, we tag the code indifferent points (e.g., the end of the

MP rdma connect() function) to measure the CPU cycles used by different parts.The standard C library time is used to log the CPU clock ofa speciﬁc time.Moreover, to avoid CQ polling caused extra CPU usage, weuse event based completion queue polling ( ibv get cq event() here the application will be blocked during data transmis-sion). In this way, we could avoid the deviation caused byunnecessary CPU usage and measure only critical CPU over-head. .

02 0 .

04 0 .

06 0 . . . . . . C D F VirtuosoSingle Path

Figure 7: Multiple Flows Interactions with Virtuoso

As shown in Fig 8, with increasing the number of usedpaths especially in small message size scenarios, more CPUcycles are used in user space computation. However, largedata size actually eliminates this side effects by increasingboth bandwidth and trunk size to decrease the iterations fortransmitting the same amount of data. Hence as a comprehen-sive conclusion, large data message should always leveragemultiple paths while small data messages can use Virtuoso tosteer the ﬂows to avoid congested paths. C P U C y c l e s ( ) Figure 8: CPU Usage Overhead Comparison

This paper presents

Virtuoso , a purely software-based multi-path RDMA solution for data center networks which effec-tively utilizes multiple paths for load balancing and reliability.Virtuoso employs VNICs to help RDMA applications splitlarge ﬂows into multiple smaller sub-ﬂows and dispatch themamong multiple paths to achieve user space load balancing. Virtuoso can improve the bandwidth in core of DCNs by uti-lizing multiple paths but introduces negligible CPU overhead.Virtuoso is presented to bring inspirations for the com-munity to leverage the ﬂexibility of software visualizationtechniques. We plan to further i) provide a ﬁne-grained yetefﬁcient congestion control mechanism to achieve fast anddynamic load balance reaction and ii) migrate real-worldapplications, like distributed TensorFlow [12], to evaluatethe beneﬁts of Virtuoso, and to beneﬁt the machine learningcommunity.

REFERENCES [1] Mohammad Al-Fares et al. 2008. A Scalable, Commodity Data CenterNetwork Architecture. In

ACM SIGCOMM . 6374.[2] Mohammad Alizadeh et al. 2014. CONGA: Distributed Congestion-Aware Load Balancing for Datacenters. In

ACM SIGCOMM . 503514.[3] Jiaxin Cao et al. 2013. Per-Packet Load-Balanced, Low-Latency Rout-ing for Clos-Based Data Center Networks. In

ACM CoNEXT . 4960.[4] Shoby Cherian et al. 2017. Methods and systems to achieve multi-tenancy in RDMA over converged Ethernet. (Aug. 29 2017). USPatent 9,747,249.[5] Carolyn J Sher Decusatis et al. 2012. Communication within clouds:open standards and proprietary protocols for data center networking.

IEEE Commun Mag

50, 9 (2012), 26–33.[6] Aleksandar Dragojevi´c et al. 2014. FaRM: Fast Remote Memory. In

USENIX NSDI . 401–414.[7] Alan Ford et al. 2012. TCP Extensions for Multipath Operation withMultiple Addresses.

IETF (2012).[8] Monia Ghobadi et al. 2012. Rethinking End-to-End Congestion Controlin Software-Deﬁned Networks. In

ACM HotNets . 6166.[9] Albert Greenberg et al. 2009. VL2: A Scalable and Flexible DataCenter Network. In

ACM SIGCOMM . 5162.[10] Juncheng Gu et al. 2017. Efﬁcient Memory Disaggregation with Inﬁn-iswap. In

USENIX NSDI . 649–667.[11] Chuanxiong Guo et al. 2016. RDMA over Commodity Ethernet atScale. In

ACM SIGCOMM . 202–215.[12] Chengfan Jia et al. 2018. Improving the performance of distributedtensorﬂow with RDMA.

Int J Parallel Program

46, 4 (2018), 674–685.[13] George Kalokerinos et al. 2009. FPGA implementation of a conﬁg-urable cache/scratchpad memory with virtualized user-level RDMAcapability. In

IEEE SAMOS . 149–156.[14] Hari Kathi et al. 2006. Data trafﬁc load balancing based on applicationlayer messages. (July 13 2006). US Patent App. 11/031,184.[15] Daehyeok Kim et al. 2019. FreeFlow: Software-based Virtual RDMANetworking for Containerized Clouds. In

USENIX NSDI . 113–126.[16] Xiaoyi Lu et al. 2014. Accelerating spark with rdma for big dataprocessing: Early experiences. In

IEEE HOTI . 9–16.[17] Yuanwei Lu et al. 2018. Multi-Path Transport for RDMA in Datacen-ters. In

USENIX NSDI . 357–371.[18] Yifei Lu and Shuhong Zhu. 2015. SDN-based TCP congestion controlin data center networks. In

IEEE IPCCC . 1–7.[19] Niranjan Mysore et al. 2009. PortLand: A Scalable Fault-TolerantLayer 2 Data Center Network Fabric. In

ACM SIGCOMM . 3950.[20] Jonathan Perry et al. 2014. Fastpass: a centralized” zero-queue” data-center network. In

ACM SIGCOMM . 307–318.[21] Jonas Pfefferle et al. 2015. A Hybrid I/O Virtualization Framework forRDMA-capable Network Interfaces. In

ACM VEE . 17–30.[22] Jim Pinkerton. 2002. The case for RDMA.

RDMA Consortium, May

29 (2002), 27.

23] Haonan Qiu et al. 2018. Toward Effective and Fair RDMA ResourceSharing. In

ACM APNet . 8–14.[24] Ren et al. 2013. Design and Performance Evaluation of NUMA-AwareRDMA-Based End-to-End Data Transfer Systems. In

ACM SC . 48.[25] M Skyllas-Kazacos et al. 1986. New all-vanadium redox ﬂow cell.

Journal of the Electrochemical Society

133 (1986), 1057.[26] Maomeng Su et al. 2017. RFP: When RPC is Faster than Server-Bypasswith RDMA. In

ACM EuroSys . 115.[27] Shin-Yeh Tsai and Yiying Zhang. 2017. LITE Kernel RDMA Supportfor Datacenter Applications. In

USENIX OSDI . 306–324.