Fast and Efficient Bulk Multicasting over Dedicated Inter-Datacenter Networks
Mohammad Noormohammadpour, Cauligi S. Raghavendra, Srikanth Kandula, Sriram Rao
11 Fast and Efficient Bulk Multicasting over DedicatedInter-Datacenter Networks
Mohammad Noormohammadpour, Cauligi S. Raghavendra, Srikanth Kandula, Sriram Rao University of Southern California, Microsoft, Facebook
Abstract —Several organizations have built multiple datacen-ters connected via dedicated wide area networks over which largeinter-datacenter transfers take place. This includes tremendousvolumes of bulk multicast traffic generated as a result of dataand content replication. Although one can perform these transfersusing a single multicast forwarding tree, that can lead to poorperformance as the slowest receiver on each tree dictates thecompletion time for all receivers. Using multiple trees per transfereach connected to a subset of receivers alleviates this concern.The choice of multicast trees also determines the total bandwidthusage. To further improve the performance, bandwidth overdedicated inter-datacenter networks can be carved for differentmulticast trees over specific time periods to avoid congestion andminimize the average receiver completion times.In this paper, we break this problem into the three sub-problems of partitioning, tree selection, and rate allocation. Wepresent an algorithm called QuickCast which is computationallyfast and allows us to significantly speed up multiple receiversper bulk multicast transfer with control over extra bandwidthconsumption. We evaluate QuickCast against a variety of syn-thetic and real traffic patterns as well as real WAN topologies.Compared to performing bulk multicast transfers as separateunicast transfers, QuickCast achieves up to . × reduction inmean completion times while at the same time using . × thebandwidth. Also, QuickCast allows the top of receivers tocomplete between × to × faster on average compared withwhen a single forwarding multicast tree is used for data delivery. Index Terms —Wide Area Networks, Data Replication, Inter-Datacenter Networks, Receiver Completion Times, BandwidthAllocation, Traffic Engineering.
I. I
NTRODUCTION
Dedicated inter-datacenter networks connect dozens of ge-ographically dispersed datacenters [2]–[4] whose traffic cangenerally be categorized as either user-generated or internal.User-generated traffic is in the critical path of users’ qualityof experience and is generated as a result of direct interactionwith users. Internal traffic flows across servers that hostapplications in the back-end and is a product of staging dataand content that will later be used to offer services to users.Compared to user-generated traffic, internal traffic is moreresilient to scheduling and routing latency and is usually ordersof magnitude larger in volume [2], [3], [5]. Internal datatransfers over inter-datacenter networks can potentially takea long time to complete, that is, up to hours [5].A prevalent form of internal traffic is the replication of dataand content from one datacenter to multiple other datacenterswhich accounts for tremendous volumes of traffic [2], [3],
A preliminary version of this paper appears in INFOCOM 2018 [1] [6]. Examples include the distribution of numerous copies ofvoluminous configuration files [7], multimedia content servedto regional users by CDNs [8], and search index updates[2]. Such replication generates bulk multicast transfers witha predetermined set of receivers and known transfer volumewhich are the focus of this paper.As bandwidth over dedicated inter-datacenter networks ismanaged by one organization that also operates the datacen-ters, it is possible to coordinate data transmissions across theend-points to avoid congestion and optimize network-wideperformance metrics such as mean or tail completion timesof receivers. We focus on minimizing the mean completiontimes of receivers while performing concurrent bulk multicasttransfers assuming that receivers of a transfer can completeat different times. Speeding up several receivers per transfercan translate to improved end-user quality of experienceand increased availability. For example, faster replication ofvideo content to regional datacenters enhances average user’sexperience in social media applications or making a newlytrained model available at regional datacenters allows speedieraccess to new application features for millions of users.Several recent works focus on improving the performanceof unicast transfers over dedicated inter-datacenter networks[2], [5], [9]–[11]. Performing bulk multicast transfers as manyseparate unicast transfers can lead to excessive bandwidthusage and increase completion times. Although there existsextensive work on multicasting, it is not possible to applythose solutions to our problem as existing research has focusedon different goals and considers different constraints. Forexample, earlier research in multicasting aims at dynamicallybuilding and pruning multicast trees as receivers join or leave[12], building multicast overlays that reduce control trafficoverhead and improve scalability [13], or choosing multicasttrees that satisfy a fixed available bandwidth across all edgesas requested by applications [14], [15], minimize congestionwithin datacenters [16], [17], reduce data recovery costs as-suming some recovery nodes [18], or maximize the throughputof a single multicast flow [19], [20]. To our knowledge, noneof the related research efforts aimed at minimizing the meancompletion times of receivers for concurrent bulk multicasttransfers while considering the overall bandwidth usage, whichis the focus of this work.
Motivating Example:
Figure 1 shows an example ofdelivering a large object X from source S to destinations { t , t , t , t } which has a volume of 100 units. We have twotypes of links with capacities of 1 and 10 units of traffic pertime unit. We can use a single multicast tree to connect the a r X i v : . [ c s . N I] D ec P2P1 t t t t S X = 100
C = 1C = 10 t t t t S X = 100
C = 1C = 10 rate = 1 rate = 9 rate = 1Individual Completion Times(Time to deliver X to receiver t i , ∀ i ) Mean Completion Times t t t t Setup
Fig. 1. Motivating Example sender to all receivers which will allow us to transmit at thebottleneck rate of 1 to all receivers. However, one can groupreceivers into two partitions of P and P and attach eachpartition with a separate multicast tree. Then we can selecttransmission rates so that we minimize the mean completiontimes. In this case, assigning a rate of 1 to the tree attachedto P and a rate of 9 to the tree attached to P will attainthis goal while respecting link capacity over all links (the linkattached to S is the bottleneck). As another possibility, wecould have assigned a rate of 10 to the tree attached to P ,allowing { t , t } to finish in 10 units of time, while suspendingthe tree attached to P until time 11. As a result, the treeattached to P would have started at 11 allowing { t , t } tofinish at 110. In this paper, we aim to improve the speed ofseveral receivers per bulk multicast transfer without hurtingthe completion times of the slow receivers. In computingthe completion times, we ignore the propagation and queuinglatencies as the focus of this paper is on delivering bulk objectsfor which the transmission time dominates the propagation orqueuing latency along the trees.We break the bulk multicast transfer routing, and schedulingproblem with the objective of minimizing mean completiontimes of receivers into three sub-problems of the receiver setpartitioning, multicast forwarding tree selection per receiverpartition, and rate allocation per forwarding tree. We proposeQuickCast, which offers an elegant solution to each one ofthese three sub-problems. Receiver Set Partitioning:
As different receivers can havedifferent completion times, a natural way to improve comple-tion times is to partition receivers into multiple sets with eachreceiver set having a separate tree. This reduces the effectof slow receivers on faster ones. We employ a partitioningtechnique that groups receivers of every bulk multicast transferinto multiple partitions according to their mutual distance (inhops) on the inter-datacenter graph. With this approach, thepartitioning of receivers into any N > partitions consumesminimal additional bandwidth on average. We also offer aconfiguration parameter called the partitioning factor that isused to decide on the right number of partitions that create a Compared to [1], we have extended QuickCast by considering actual WANtopologies with non-uniform link capacity and by eliminating the constraint onthe number of partitions. We have performed additional empirical evaluationson multiple tree selection techniques and several rate allocation policies. balance between receiver completion times improvements andthe total bandwidth consumption.
Forwarding Tree Selection:
To avoid heavily loaded routes,multicast trees should be chosen dynamically per partition ac-cording to the receivers in that partition and the distribution oftraffic load across network edges. We utilize a computationallyefficient approach for forwarding tree selection that connectsa sender to a partition of its receivers by assigning weightsto edges of the inter-datacenter graph, and using a minimumweight Steiner tree heuristic. We define a weight assignmentaccording to the traffic load scheduled on edges and theircapacity and empirically show that this weight assignment of-fers improved receiver completion times at minimal bandwidthconsumption.
Rate Allocation:
Given the receiver partitions and theirforwarding trees, formulating the rate allocation for mini-mizing mean completion times of receivers leads to a hardproblem. We consider the popular scheduling policies of fairsharing, Shortest Remaining Processing Time (SRPT), andFirst Come First Serve (FCFS). We reason why fair sharing ispreferred compared to policies that strictly prioritize transfers(i.e., SRPT, FCFS, etc.) for network throughput maximizationwhen focusing on bulk multicast transfers especially oneswith many receivers per transfer. We empirically show thatusing max-min fairness [21], which is a form of fair sharing,we can considerably improve the average network throughputwhich in turn reduces receiver completion times. In QuickCast,we applied max-min fairness for rate allocation across themulticast forwarding trees.QuickCast assumes a logically centralized setting, com-municates with the end-points that transmit traffic for ratelimiting, and with the inter-datacenter network elements thatperform traffic forwarding for managing multicast forwardingtrees. We evaluate QuickCast against a variety of syntheticand real traffic patterns as well as real WAN topologies.Compared to performing bulk multicast transfers as separateunicast transfers, QuickCast achieves up to . × reductionin mean receiver completion times while at the same timeusing . × the bandwidth. Also, QuickCast allows the top50% of receivers to complete between × to × faster onaverage compared with when a single forwarding multicasttree is used for data delivery. We also show that on a real WANtopology, fair sharing offers up to . × higher throughput with16 receivers per bulk multicast transfer compared to otherscheduling policies, i.e., SRPT and FCFS.II. R ELATED W ORK
Internet Multicasting:
A large body of general multicast-ing approaches have been proposed where receivers can joinmulticast groups anytime to receive required data and multicasttrees are incrementally built and pruned as nodes join or leavea multicast session such as IP multicasting [12], TCP-SMO[22] and NORM [23]. These solutions focus on building andmaintaining multicast trees, and do not consider link capacityand other ongoing multicast flows while building the trees.
Multicast Traffic Engineering:
An interesting work [14]considers the online arrival of multicast requests with a speci-fied bandwidth requirement. The authors provide an elegant solution to find a minimum weight Steiner tree for an ar-riving request with all edges having the requested availablebandwidth. This work assumes a fixed transmission rate permulticast tree, dynamic multicast receivers, and unknowntermination time for multicast sessions whereas we considervariable transmission rates over timeslots, fixed multicastreceivers, and deem a multicast tree completed when all itsreceivers download a specific volume of data. MTRSA [15]considers a similar problem to [14] but in an offline scenariowhere all multicast requests are known beforehand whiletaking into account the number of available forwarding rulesper switch. MPMC [19], [20] maximizes the throughput for asingle multicast transfer by using multiple parallel multicasttrees and coding techniques. None of these works aims tominimize the completion times of receivers while consideringthe total bandwidth consumption.
Datacenter Multicasting:
A variety of solutions have beenproposed for minimizing congestion across the intra-datacenternetwork by selecting multicast trees according to link utiliza-tion. Datacast [17] sends data over edge-disjoint Steiner treesfound by pruning spanning trees over various topologies ofFatTree, BCube, and Torus. AvRA [16] focuses on tree andFatTree topologies and builds minimum edge Steiner treesthat connect the sender to all receivers as they join. MCTCP[24] reactively schedules flows according to link utilization.These works do not aim at minimizing the completion timesof receivers and ignore the total bandwidth consumption.
Overlay Multicasting:
With overlay networks, end-hostscan form a multicast forwarding tree in the application layer.RDCM [25] populates backup overlay networks as nodes joinand transmits lost packets in a peer-to-peer fashion over them.NICE [13] creates hierarchical clusters of multicast peers andaims to minimize control traffic overhead. AMMO [26] allowsapplications to specify performance constraints for selectionof multi-metric overlay trees. DC2 [27] is a hierarchy-awaregroup communication technique to minimize cross-hierarchycommunication. SplitStream [28] builds forests of multicasttrees to distribute load across many machines. BDS [6] gen-erates an application-level multicast overlay network, createschunks of data, and transmits them in parallel over bottleneck-disjoint overlay paths to the receivers. Due to limited knowl-edge of underlying physical network topology and condition(e.g., utilization, congestion or even failures), and limited or nocontrol over how the underlying network routes traffic, overlayrouting has limited capability in managing the total bandwidthusage and distribution of traffic to minimize completion timesof receivers. In case such control and information are provided,for example by using a cross-layer approach, overlay multi-casting can be used to realize solutions such as QuickCast.
Reliable Multicasting:
Various techniques have been pro-posed to make multicasting reliable including the use ofcoding and receiver (negative or positive) acknowledgments.Experiments have shown that using positive ACKs does notlead to ACK implosion for medium scale (sub-thousand)receiver groups [22]. TCP-XM [29] allows reliable deliveryby using a combination of IP multicast and unicast for datadelivery and re-transmissions. MCTCP [24] applies standard TCP mechanisms for reliability. Another approach is forreceivers to send NAKs upon expiration of some inactivitytimer [23]. NAK suppression has been proposed to addressimplosion which can be applied by routers [30]. Forward ErrorCorrection (FEC) has been used to reduce re-transmissions[23] and improve the completion times [31] examples of whichinclude Raptor Codes [32] and Tornado Codes [33]. Thesetechniques can be applied complementary to QuickCast.
Multicast Congestion Control:
Existing approaches trackthe slowest receiver. PGMCC [34], MCTCP [24] and TCP-SMO [22] use window-based TCP like congestion control tocompete fairly with other flows. NORM [23] uses an equation-based rate control scheme. With rate allocation and end-hostbased rate limiting applied over inter-datacenter networks,need for distributed congestion control becomes minimal;however, such techniques can still be used as a backup.
Other Related Work:
CastFlow [35] precalculates multi-cast spanning trees which can then be used at request arrivaltime for fast rule installation. ODPA [36] presents algorithmsfor dynamic adjustment of multicast spanning trees accordingto specific metrics. BIER [37] has been recently proposed toimprove the scalability and allow frequent dynamic manipu-lation of multicast forwarding state in the network and can beapplied complementary to our solutions in this paper. Peer-to-peer approaches [38]–[40] aim to maximize throughputper receiver without considering physical network topology,link capacity, or total network bandwidth consumption. Store-and-Forward (SnF) approaches [41]–[44] focus on minimizingtransit bandwidth costs which does not apply to dedicatedinter-datacenter networks. However, SnF can still be used toimprove overall network utilization in the presence of diurnallink utilization patterns, transient bottleneck links, or for appli-cation layer multicasting. BDS [45] uses many parallel overlaypaths from a multicast source to its destinations storing andforwarding data from one destination to the next. Applicationof SnF for bulk multicast transfers considering the physicaltopology is complementary to our work in this paper andis a future direction. Recent research [46]–[48] also considerbulk multicast transfers with deadlines with the objective ofmaximizing the number of transfers completed before thedeadlines. We realize that reducing completion times is amore general objective and for most applications, completingtransfers is valuable and required even when it is not possibleto meet all the deadlines [5].III. P
ROBLEM S TATEMENT AND C HALLENGES
We consider a scenario where bulk multicast transfers arriveat the inter-datacenter network in an online fashion. Everybulk multicast transfer R is specified with a source S R , setof destinations DDD R , and volume V R in bytes (unicast andbroadcast can be considered as special cases with one receiveror all other nodes as receivers). In general, no form ofsynchronization is required across receivers of a bulk multicasttransfer and therefore, receivers are allowed to complete atdifferent times as long as they all receive the multicast objectin whole. Incoming requests are processed as they come bya traffic engineering server that manages the forwarding state TABLE ID
EFINITION OF VARIABLES
Variable Definition t now The current timeslot e A directed edge C e Capacity of e ≤ U e ≤ Edge e ’s bandwidth utilization G ( VVV , EEE ) A directed inter-datacenter network graph T A directed Steiner tree δ Duration of a timeslot R A bulk multicast transfer request S R Source datacenter of R DDD R Set (cid:104)(cid:105) of destinations of request R V R Original volume of R PPP R Set (cid:104)(cid:105) of partitions of request R PPP
Set (cid:104)(cid:105) of partitions of all transfers in the system T P Forwarding tree of some partition P ∈ PPP r P ( t ) The transmission rate over forwarding tree of somepartition P ∈ PPP at timeslot t V rP Residual volume of some partition P ∈ PPP . Therefore, V rP ≤ V R where P ∈ PPP R L e > Edge e ’s total traffic load at time t now , i.e., totaloutstanding bytes scaled by e ’s inverse capacity p f ≥ A configuration parameter that determines a partition-ing cost threshold N max ≥ A configuration parameter that determines the maxi-mum number of partitions per transfer of the whole network in a logically centralized manner forinstallation and eviction of multicast trees. Upon arrival of arequest, this server decides on the number of partitions andreceivers that are grouped per partition and a multicast treeper partition.We consider a slotted timeline with a timeslot duration of δ . Periodically, the traffic engineering server computes thetransmission rates for all multicast trees at the beginning ofevery timeslot and dispatches them to senders for rate limiting.This allows for a congestion free network since the rates arecomputed according to link capacity constraints and other on-going transfers. To minimize control plane overhead, partitionsand forwarding trees are fixed once they are established for anincoming transfer. In this context, the bulk multicast transferrouting and scheduling problem can be formally stated asfollows. A summary of our notations is present in Table I. Problem Statement:
Given an inter-datacenter network G ( VVV , EEE ) with the edge capacity C e , ∀ e ∈ EEE and the set ofall partitions { P ∈ PPP | V rP > } , for a newly arrivingbulk multicast transfer R ( S R , DDD R , V R ) , the traffic engineeringserver needs to compute a set of receiver partitions PPP R eachwith one or more receivers, and select a forwarding tree T P for every partition P ∈ PPP R . In addition, per timeslot t , the traffic engineering server needs to compute the rates r P ( t ) , { P ∈ PPP | V rP > } . The objective is to minimizethe average time for a receiver to complete data receptionwhile keeping the total bandwidth consumption below a certainthreshold compared to the minimum possible, i.e., a minimumedge Steiner tree per transfer. Challenges:
Both the number of ways to partition receiversinto subsets and the number of candidate forwarding trees per subset grow exponentially with the problem size. It is, ingeneral, not clear how partitioning and selection of forwardingtrees correlate with both receiver completion times and totalbandwidth usage. Even the simple objective of minimizingthe total bandwidth usage is a hard problem. Also, assum-ing known forwarding trees, selecting transmission rates pertimeslot per tree for minimization of mean receiver completiontimes is a hard problem. Finally, this is an online problem withunknown future arrivals which adds to the complexity.IV. Q
UICK C AST
As stated earlier, we need to address the three sub-problemsof receiver set partitioning, tree selection, and rate allocation.Since the partitioning sub-problem uses the tree selection sub-problem, we first discuss tree selection in the following. As thelast problem, we will address rate allocation. Since the totalbandwidth usage is a function of transfer properties (numberof receivers, transfer volume, and the location of sender andreceivers) and the network topology, it is highly sophisticatedto design a solution that guarantees a limit on the totalbandwidth usage. Instead, we aim to reduce the completiontimes while minimally increasing bandwidth usage.
A. Forwarding Tree Selection
The tree selection problem states that given a networktopology with link capacity knowledge, how to choose aSteiner tree that connects a sender to all of its receivers.The objective is to minimize the completion time of receivers(all receivers on a tree complete at the same time) whileminimally increasing the total bandwidth usage. Since the totalbandwidth usage is directly proportional to the number ofedges on selected trees, we would want to keep trees as smallas possible. Reduction in completion times can be achievedby avoiding edges that have a large outstanding traffic load.In general, this would mean selecting potentially larger treesto go around such edges, if necessary. This effect can beaccounted for by assigning proper weights to the edges ofthe inter-datacenter graph and choosing a minimum weightSteiner tree that connects the sender to a partition of receiversfor some bulk multicast transfer. The minimum weight Steinertree is a hard problem for which many heuristics exist.
Weight Assignment:
Since we focus on reducing the re-ceiver completion times for bulk transfers where transmissiontime could be orders of magnitude larger than propagation orqueuing delay, conventional routing metrics such as end-to-endlatency are not effective. Also, we realized that instantaneouslink utilization, which has been extensively used for trafficengineering over WAN, lacks stability over longer time periodswhich makes it hard to infer how it will change in the nearfuture. Therefore, we will use a new metric we refer to as linkload L e , ∀ e ∈ EEE that is defined in Table I and can be computedas follows: L e = C e (cid:213) P ∈ PPP | e ∈ T P V rP (1)A link’s load is the total outstanding volume of trafficallocated on that link (that we know will cross over that link Algorithm 1:
Forwarding Tree Selection Algorithm/* Variables defined in Table I */
Input:
Request R , partition P ∈ PPP R , G ( VVV , EEE ) , and L e , ∀ e ∈ EEE
Output:
A forwarding tree (set of edges)
ComputeTree ( P , R ) Assign W e = ( L e + V R C e ) , ∀ e ∈ EEE ;Find a minimum weight Steiner tree T P whichconnects the nodes { S R } ∪ P ; L e ← L e + V R C e , ∀ e ∈ T P ; return T P ;in the future) divided by its capacity. We can compute thisvalue since we know the volume of incoming transfers andthe edges that it will be using. A link’s load is a measureof how busy it is expected to be shortly. It increases as newtransfers are scheduled on a link, and diminishes as trafficflows through it. To select a forwarding tree from a source toa set of receivers, we use an edge weight of L e + V R C e and selecta minimum weight Steiner tree. The selected tree will mostlikely exclude any links that are expected to be highly busy.Addition of the second element in the weight (new request’svolume divided by capacity) helps select smaller trees in casethere is not much load on most edges.Algorithm 1 applies the weight assignment approach men-tioned above to select a forwarding tree that balances thetraffic load across available trees and finds a minimum weightSteiner tree using the GreedyFLAC heuristic [49]. In § V, weexplore a variety of weights for forwarding tree selectionas shown in Table IV and see that this weight assignmentprovides consistently close to minimum values for the threeperformance metrics of mean and tail receiver completiontimes as well as total bandwidth usage.
Worst-case Complexity:
Algorithm 1 computes one min-imum weight Steiner tree. For a request R , the worst-casecomplexity of algorithm 1 is O (| VVV | | DDD R | + | EEE |) given thecomplexity of GreedyFLAC [49]. B. Receiver Set Partitioning
The maximum transmission rate on a tree is that of the linkwith minimum capacity. To improve bandwidth utilization ofinter-datacenter backbone, we can replace a large forwardingtree with multiple smaller trees each connecting the source to asubset of receivers. By partitioning, we isolate some receiversfrom the bottlenecks allowing them to receive data at a higherrate. We aim to find a set of partitions each with at least onereceiver that allows for reducing the average receiver comple-tion times while minimally increasing the bandwidth usage.Bottlenecks may appear either due to competing transfers ordifferences in link capacity. In the former case, some edgesmay be shared by multiple trees which lead to lower availablebandwidth per tree. Such conditions may arise more frequentlyunder heavy load. In the latter case, differences in link capacity can increase completion times especially in large networks andwith many receivers per transfer.Receiver set partitioning to minimize the impact of bottle-necks and reduce completion times is a sophisticated openproblem. It is best if partitions are selected in a way that noadditional bottlenecks are created. Also, increasing the numberof partitions may in general increase bandwidth consumption(multiple smaller trees may have more edges in total comparedto one large tree). Therefore, we need to come up with theright number of partitions and receivers that are groupedper partition. We propose a partitioning approach, called thehierarchical partitioning, that is computationally efficient anduses a partitioning factor to decide on the number of partitionsand receivers that are grouped in those partitions.
Number of Partitions:
Transfers may have a highly varyingnumber of receivers. Generally, the number of partitionsshould be computed based on the number of receivers, wherethey are located in the network, and the network topology.Also, using more partitions can lead to the creation of un-necessary bottlenecks due to shared links. We compute thenumber of partitions per transfer according to the total trafficload on network edges and considering a threshold that limitsthe cost of additional bandwidth consumption.
Limitations of Partitioning:
Partitioning, in general, can-not improve tail completion times of transfers as tail is usuallydriven by physical resource constraints, i.e., low capacity linksor links with high contention.
Hierarchical Partitioning:
We group receivers into parti-tions according to their mutual distance which is defined asthe number of hops on the shortest hop path that connects anytwo receivers. Hierarchical clustering [50] approaches such asagglomerative clustering can be used to compute the groupsby initially assuming that every receiver has its partition andthen by merging the two closest partitions at each step whichgenerates a hierarchy of partitioning solutions. Each layer ofthe hierarchy then gives us one possible solution with a givennumber of partitions.With this approach, the partitioning of receivers into any N > partitions consumes minimal additional bandwidth onaverage compared to any other partitioning with N partitions.That is because assigning a receiver to any other partition willlikely increase the total number of edges needed to connectthe source to all receivers; otherwise, that receiver would nothave been grouped with the other receivers in its currentpartition in the first place. There is, however, no guaranteesince hierarchical clustering works based on a greedy heuristic.After building a partitioning hierarchy, the algorithm selectsthe layer with the maximum number of partitions whose totalsum of tree weights stays below a threshold that can beconfigured as a system parameter. Choosing the maximumpartitions allows us to minimize the effect of slow receiversgiven the threshold, which is a multiple of the weight ofa single tree that would connect the sender to all receiversand can be looked at as a bandwidth budget. We call themultiplication coefficient the partitioning factor p f . Algorithm2 shows this process in detail. The partitioning factor p f playsa vital role in the operation of QuickCast as it determines the Algorithm 2:
Compute Partitions and Trees/* Variables defined in Table I */
Input:
Request R ( S R , DDD R , V R ) , G ( VVV , EEE ) , and L e , ∀ e ∈ EEE
Output:
Pairs of (partition, forwarding tree)
ComputePartitionsAndTrees ( R , N max ) Assign W e = ( L e + V R C e ) , ∀ e ∈ EEE ;Find the minimum weight Steiner tree T R whichconnects the nodes { S R } ∪ DDD R and its weight W T R ; foreach ( α, β ) ∈ DDD R , α (cid:44) β do DIST α,β ← number of edges on the minimumhop path from α to β ;Compute agglomerative clustering hierarchy for DDD R using average linkage and distance DIST which willhave l clusters at layer ≤ l ≤ | DDD R | ; for l = min ( N max , | DDD R |) to by − do PPP l ← set of clusters at layer l of agglomerativehierarchy, each cluster forms a partition; foreach P ∈ PPP l do Find the minimum weight Steiner tree T P which connects the nodes { S R } ∪ P ; if (cid:205) P ∈ PPP l W T P ≤ p f × W T R thenforeach P ∈ PPP l do T P ← ComputeTree ( P , R ); return ( P , T P ) , ∀ P ∈ PPP l ; L e ← L e + V R , ∀ e ∈ T R ; return ( DDD R , T R ) ;extra cost we are willing to pay in bandwidth for improvedcompletion times. In general, a p f greater than one but closeto it should allow partitioning to separate very slow receiversfrom several other nodes. A p f that is considerably larger thanone may generate too many partitions and potentially createmany shared links which reduce throughput and additionaledges that increase bandwidth usage. If p f is less than one, asingle partition will be used. Worst-case Complexity:
Algorithm 2 performs multiplecalls to the GreedyFLAC algorithm [49]. It also uses thehierarchical clustering with average linkage which has aworst-case complexity of O (| DDD R | ) . To compute the pair-wisedistances of receivers, we can use breadth first search withhas a complexity of O (| VVV | + | EEE |) . Worst-case complexity ofAlgorithm 2 is O ((| VVV | + | EEE |)|
DDD R | + | DDD R | ) . C. Rate Allocation
To compute the transmission rates per tree per timeslot,one can formulate an optimization problem with the capacityand demand constraints, and consider minimizing the meanreceiver completion times as the objective. This is, however,a hard problem and can be modeled using mixed-integer programming by assuming a binary variable per timeslotper tree that shows whether that tree has completed by thattimeslot. One can come up with approximation algorithms tothis problem which is considered part of the future work.In this paper, we consider the three popular scheduling poli-cies of FCFS, SRPT, and fair sharing according to max-minfairness [21] which have been extensively used for networkscheduling. These policies can be applied independently ofpartitioning and forwarding tree selection techniques. Eachone of these three policies has its unique features. FCFSand SRPT both prioritize transfers; the former according toarrival times and the latter according to transfer volumesand so obtain a meager fairness score [51]. SRPT has beenextensively used for minimizing flow completion times withindatacenters [52]–[54]. Strictly prioritizing transfers over for-warding trees (as done by SRPT and FCFS), however, canlead to low overall link utilization and increased completiontimes, especially when trees are large. This might happendue to bandwidth contention on shared edges which canprevent some transfers from making progress. Fair sharingallows all transfers to make progress which mitigates suchcontention enabling concurrent multicast transfers to all makeprogress. In § V-C, we empirically compare the performanceof these scheduling policies and show that fair sharing basedon max-min fairness can significantly outperform both FCFSand SRPT in average network throughput especially with alarger number of receivers per tree. As a result, we will useQuickCast along with the fair sharing policy based on max-min fairness.The traffic engineering server periodically computes thetransmission rates per multicast tree every timeslot to maxi-mize utilization and cope with inaccurate inter-datacenter linkcapacity measurements, imprecise rate limiting, and droppedpackets due to corruption. To account for inaccurate ratelimiting, dropped packets and link capacity estimation errors,which all can lead to a difference between the actual volumeof data delivered and the number of bytes transmitted, wepropose that senders keep track of actual data delivered to theirreceivers per forwarding tree. At the end of every timeslot,every sender reports to the traffic engineering server howmuch data it was able to deliver allowing it to computerates accordingly for the timeslot that follows. Newly arrivingtransfers will be assigned rates starting the next timeslot.V. E
VALUATION
We considered various topologies and transfer size distri-butions as shown in Tables II and III. Also, for Algorithm2, unless otherwise stated, we used p f = . which limits theoverall bandwidth usage while offering significant gains. In thefollowing subsections, we first evaluated a variety of weightassignments for multicast tree selection considering receivercompletion times and bandwidth usage. We showed that theweight proposed in Algorithm 1 offers close to minimumcompletion times with minimal extra bandwidth consumption.Next, we evaluated the proposed partitioning technique andconsidered two cases of N max = (as used in [1]) and N max = | DDD R | . We measured the performance of QuickCast TABLE IIV
ARIOUS TOPOLOGIES USED IN EVALUATION
Name Description
ANS [55] A medium-sized backbone and transit network thatspans across the United States with nodes and links. All links have equal capacity of 45 Mbps.GEANT [56] A large-sized backbone and transit network thatspans across the Europe with nodes and links.Link capacity ranges from 45 Mbps to 10 Gbps.UNINETT [57] A large-sized backbone that spans across Norwaywith nodes and links. Most links have acapacity of 1, 2.5 or 10 Gbps. while varying the number of receivers and showed that it offersconsistent gains. We also measured the speedup observedby different receivers ranked by their speed per multicasttransfer, and the effect of partitioning factor p f on the gains incompletion times as well as bandwidth usage. In addition, weevaluated the effect of different scheduling policies on averagenetwork throughput and showed that with increasing numberof multicast receivers, fair sharing offers higher throughputcompared to both FCFS and SRPT. Finally, we showed thatQuickCast is computationally fast by measuring its runningtime and that the maximum number of group table forwardingentries it uses across all switches is only a fraction of whatis usually available in a physical switch across the severalconsidered scenarios. Network Topologies:
Table II shows the list of topologieswe considered. These topologies provide capacity informationfor all links which range from 45 Mbps to 10 Gbps. Wenormalized all link capacities dividing them by the maximumlink capacity. We also assumed all bidirectional links withequal capacity in either direction.
Traffic Patterns:
Table III shows the considered distribu-tions for transfer volumes. Transfer arrival followed a Poissondistribution with rate λ . We considered no units for time orbandwidth. For all simulations, we assumed a timeslot lengthof δ = . . For Pareto distribution, we considered a minimumtransfer volume equal to that of full timeslots and limitedmaximum transfer volume to that of full timeslots.Unless otherwise stated, we considered an average demandequal to volume of full timeslots per transfer for all trafficdistributions (we fixed the mean values of all distributions tothe same value). Per simulation instance, we assumed equalnumber of transfers per sender and for every transfer, weselected the receivers from all existing nodes according to theuniform distribution (with equal probability from all nodes). Assumptions:
We focused on computing gains and assumedaccurate knowledge of inter-datacenter link capacity, and pre-cise rate control at the end-points which together lead to acongestion free network. We also assumed no dropped packetsdue to corruption or errors, and no link failures.
Simulation Setup:
We developed a simulator in Java (JDK8). We performed all simulations on one machine (Core i7-6700 and 24 GB of RAM). We used the Java implementationof GreedyFLAC [58] for minimum weight Steiner trees.
TABLE IIIT
RANSFER SIZE DISTRIBUTIONS ( PARAMETERS IN § V) Name Description
Light-tailed Based on Exponential distribution.Heavy-tailed Based on Pareto distribution.Facebook
Cache-Follower
Generated by cache applications overFacebook inter-datacenter WAN [59].Facebook
Hadoop
Generated by geo-distributed analyticsover Facebook inter-datacenter WAN [59].
A. Weight Assignment Techniques for Tree Selection
We empirically evaluate and analyze several weights forselection of forwarding trees. Table IV lists the weight as-signment approaches considered for tree selection (please seeTable I for definition of variables). We considered three edgeweight metrics of utilization (i.e., the fraction of a link’sbandwidth currently in use), load (i.e., the total volume oftraffic that an edge will carry starting current time), and loadplus the volume of the newly arriving transfer request. Wealso considered the weight of a tree to be either the weight ofits edge with maximum weight or the sum of weights of itsedges. An exponential weight is used to approximate selectionof trees with minimum highest weight, similar to the approachused in [5]. The benefit of the weight receivers for afixed arrival rate of λ = . We considered both light-tailedand heavy-tailed transfer volume distributions. Techniques TABLE IVV
ARIOUS WEIGHTS FOR TREE SELECTION FOR INCOMING REQUEST R W e , ∀ e ∈ EEE
Properties of Selected Trees . A fixed minimum edge Steiner tree2 exp ( U e ) Minimum highest utilization over edges3 exp ( L e ) Minimum highest load over edges4 U e Minimum sum of utilization over edges5 L e Minimum sum of load over edges6 L e + V R C e Minimum final sum of load over edges7 . + exp ( U e ) (cid:205) e ∈ EEE exp ( U e ) Minimum edges, min-max utilization8 . + exp ( L e ) (cid:205) e ∈ EEE exp ( L e ) Minimum edges, min-max load9 . + U e (cid:205) e ∈ EEE U e Minimum edges, min-sum of utilization10 . + L e (cid:205) e ∈ EEE L e Minimum edges, min-sum of load
Mean Receiver Completion TimesANS GEANTLight-tailed Heavy-tailed Light-tailed Heavy-tailed
F S M F S M F S M F S M
10- 10- 10- 10- 20- 20- 10- 10- 10- 10- 10- 10-7 10- 10- 10- 10- 10- 10- 40- 30- 30- 40- 30- 20-8 10- 10- 10- 10- 10- 10- 50- 40- 50+ 50+ 40- 40-9 10- 10- 10- 10- 10- 10- 40- 40- 30- 40- 30- 30-10 10- 10- 10- 10- 10- 10- 50+ 50+ 50+ 50+ 50+ 50-Tail Receiver Completion TimesANS GEANTLight-tailed Heavy-tailed Light-tailed Heavy-tailed
F S M F S M F S M F S M
10- 10- 10- 10- 10- 10- 10- 10- 10- 10- 10- 10-7 20- 10- 20- 20- 10- 10- 40- 30- 40- 50- 40- 50+8 10- 10- 10- 10- 10- 10- 50- 50- 50+ 50+ 50+ 50-9 10- 20- 20- 20- 10- 10- 30- 30- 40- 40- 30- 50+10 10- 10- 10- 10- 10- 10- 40- 50+ 50- 40- 50+ 50-
Total Bandwidth UsedANS GEANTLight-tailed Heavy-tailed Light-tailed Heavy-tailed
F S M F S M F S M F S M
10- 10- 10- 10- 10- 10- 10- 10- 10- 10- 10- 20-7 10- 10- 10- 10- 10- 10- 10- 10- 10- 10- 10- 10-8 10- 10- 10- 10- 10- 10- 10- 10- 10- 10- 10- 10-9 10- 10- 10- 10- 10- 10- 10- 10- 10- 10- 10- 10-10 10- 10- 10- 10- 10- 10- 10- 10- 10- 10- 10- 10- < from min 10- < from min 20- < from min 30- < from min 40- < from min 50- ≥ from min 50+ Fig. 2. Evaluation of various weights for tree selection ( F , S and M referto scheduling policies FCFS, SRPT and Fair Sharing, respectively) for traffic engineering over WAN. As can be seen, they havethe highest bandwidth usage compared to other techniques(up to above the minimum) for almost all scenarioswhile their completion times are at least worse thanthe minimum for several scenarios. Techniques worse than the minimumin mean completion times). Techniques above the minimum (for all performancemetrics) across all scheduling policies, topologies, and trafficpatterns. These techniques offer lower completion times for the GEANT topology with non-uniform link capacity. Technique B. Receiver Set Partitioning
Receiver set partitioning allows separation of faster re-ceivers from the slowest (or slower ones). This is essential toimprove network utilization and speed up transfers when thereare competing transfers or physical bottlenecks. For example,both GEANT and UNINETT have edges that vary by at leasta factor of × in capacity. We evaluate QuickCast over avariety of scenarios.
1) Effect of Number of Receivers:
We provide an overallcomparison of several schemes (QuickCast, Single Load-Aware Steiner Tree, and DCCast [60]) along with two basicsolutions of using a minimum edge Steiner tree and unicastminimum hop path routing as shown in Figure 3. We alsoconsidered both light and heavy load regimes. We used realinter-datacenter traffic patterns reported by Facebook for twoapplications of Cache-Follower and Hadoop [59]. Also, allschemes use the fair sharing rate allocation based on max-min fairness except DCCast which uses the FCFS policy.The minimum edge Steiner tree leads to the minimum band-width consumption. The unicast minimum hop path routingapproach separates all receivers per bulk multicast transfer.It, however, uses a significantly larger volume of traffic andalso does not offer the best mean completion times for thefollowing reasons. First, it exhausts network capacity quicklywhich increases tail completion times by a significant factor(not shown here). Second, it can lead to many additionalshared links that increase contention across flows and reducethroughput per receiver. The significant increase in completiontimes of higher percentiles increases the average completiontimes of the unicast approach.With N max = | DDD R | , we see that QuickCast offers thebest mean and median completion times, i.e., up to . × less compared to QuickCast with N max = , up to . × less compared to unicast minimum hop routing, and up to . × less than single load-aware Steiner tree. To achievethis gain, QuickCast with N max = | DDD R | uses at most . × more bandwidth compared to using minimum edge Steinertrees which is still . × less than bandwidth usage of unicastminimum hop routing. We also see that while increasing thenumber of receivers, QuickCast with N max = | DDD R | offersconsistently small median completion times by separating fastand slow receivers since the number of partitions are notlimited. Overall, we see a higher gain under light load as thereis more capacity available to utilize. We also recognize thatQuickCast with either N max = or N max = | DDD R | performsalmost always better than unicast minimum hop routing inmean completion times.
2) Speedup by Receiver Rank:
Figure 4 shows how Quick-Cast can speed up multiple receivers per transfer by separatingthem from the slower receivers. The gains are normalized bywhen a single partition is used per bulk multicast transfer. M ean C o m p l e t i on T i m e s GEANT (Cache-Follower) M ean C o m p l e t i on T i m e s GEANT (Hadoop) M ed i an C o m p l e t i on T i m e s GEANT (Cache-Follower) M ed i an C o m p l e t i on T i m e s GEANT (Hadoop) T o t a l B and w i d t h GEANT (Cache-Follower) T o t a l B and w i d t h GEANT (Hadoop)
QuickCast (N max =|D R |) QuickCast (N max =2) Single Load-Aware Steiner Tree Unicast Min-Hop Paths Min-Edge Steiner Tree DCCast (a) λ = . (light load) M ean C o m p l e t i on T i m e s GEANT (Cache-Follower) M ean C o m p l e t i on T i m e s GEANT (Hadoop) M ed i an C o m p l e t i on T i m e s GEANT (Cache-Follower) M ed i an C o m p l e t i on T i m e s GEANT (Hadoop) T o t a l B and w i d t h GEANT (Cache-Follower) T o t a l B and w i d t h GEANT (Hadoop) (b) λ = (heavy load)Fig. 3. Various schemes for bulk multicast transfers. All schemes use max-min fair rates except for DCCast which uses FCFS and are performed on GEANTtopology. Plots are normalized by minimum (lower is better). We used Cache-Follower and Hadoop traffic patterns in Table III. In case the number of partitions is limited to two similarto [1], the highest gain is usually obtained by the first twoto three receivers while allowing more partitions, we can getconsiderably higher gain for a significant fraction of receivers.Also, by not limiting the partitions to two, we see higher gainsfor all receiver ranks that is above × for multiple receiverranks. This comes at the cost of higher bandwidth consumptionwhich we saw earlier in the previous experiment.
3) Partitioning Factor (p f ): The performance of QuickCastas a function of the partitioning factor has been shown inFigure 5 where gains are normalized by single load-awareSteiner tree which uses a single partition per bulk multicasttransfer. We computed per receiver mean and 95 th percentilecompletion times as well as bandwidth usage. As can be seen,bandwidth consumption increases with partitioning factor asmore requests’ receivers are partitioned into two or moregroups. The gains in completion times keep increasing if N max is not limited as we increase p f . That, however, canultimately lead to unicast delivery to all receivers (every receiver as a separate partition) and excessive bandwidthusage. We see a diminishing return type of curve as p f isincreased with the highest returns coming when we increase p f from 1 to 1.1 (marked with a green dashed lined). Thatis because using too many partitions can saturate networkcapacity while not improving the separation of fast and slownodes considerably. At p f = . , we see up to 10% additionalbandwidth usage compared to single load-aware Steiner treewhile mean completion times improve by between 40% to50%. According to other experiments not shown here, withlarge p f , it is possible even to see reductions in gain thatcome from excessive bandwidth consumption and increasedcontention over capacity. Note that this experiment was per-formed considering four receivers per bulk multicast transfer.Using more receivers can lead to more bandwidth usage withthe same p f , an increased slope at values of p f close to 1,and faster saturation of network capacity as we increase p f .Therefore, using smaller p f is preferred with more receiversper transfer. Receiver Rank M ean S peedup GEANT (4 Receivers)
Receiver Rank
GEANT (16 Receivers)
Receiver Rank
UNINETT (4 Receivers)
Receiver Rank
UNINETT (16 Receivers)
QuickCast (N max =|D R |, Light-tailed) QuickCast (N max =2, Light-tailed) QuickCast (N max =|D R |, Heavy-tailed) QuickCast (N max =2, Heavy-tailed) Single Load-Aware Steiner Tree Fig. 4. Mean receiver completion time speedup (larger is better) of receivers compared to single load-aware Steiner tree (Algorithm 1) by their rank (receiverssorted by their speed from fastest to slowest per transfer), receivers selected according to uniform distribution from all nodes, we considered λ = . M ean S peedup GEANT p f M ean S peedup UNINETT t h P e r c en t il e S peedup GEANT p f t h P e r c en t il e S peedup UNINETT T o t a l B and w i d t h b y S i ng l e Load - A w a r e S t e i ne r T r ee GEANT p f T o t a l B and w i d t h b y S i ng l e Load - A w a r e S t e i ne r T r ee UNINETT
QuickCast (N max =|D R |, Light-tailed) QuickCast (N max =2, Light-tailed) QuickCast (N max =|D R |, Heavy-tailed) QuickCast (N max =2, Heavy-tailed) Fig. 5. Performance of QuickCast as a function of partitioning factor p f . We assumed 4 receivers and an arrival rate of λ = . ANS (Light-tailed) A v g N e t w o r k T h r oughpu t ANS (Heavy-tailed) A v g N e t w o r k T h r oughpu t Fair Sharing (Max-Min Fairness) FCFS SRPT
Fig. 6. Average throughput of bulk multicast transfers obtained by running different scheduling policies. We started 100 transfers at the time zero, sendersand receivers were selected according to the uniform distribution. Each group of bars is normalized by the minimum in that group.
C. Effect of Rate Allocation Policies
As explained earlier in § IV-C, when scheduling trafficover large forwarding trees, fair sharing can sometimes offersignificantly higher throughput and hence better completiontimes. We performed an experiment over the ANS topologyand with both light-tailed and heavy-tailed traffic distributions.ANS topology has uniform link capacity across all edgeswhich helps us rule out the effect of capacity variations onthroughput obtained via different scheduling policies. We alsoconsidered an increasing number of receivers from 4 to 8and 16. Figure 6 shows the results. We see that fair sharing offers a higher average throughput across all ongoing transferscompared to FCFS and SRPT and that with more receivers,the benefit of using fair sharing increases to up to . × with16 receivers per transfer. D. Running Time
To ensure scalability of proposed algorithms, we measuredthe running time of our algorithms over various topologies(with different sizes) and with varying rates of arrival. Weassumed two arrival rates of λ = . and λ = whichaccount for light and heavy load regimes. We also considered eight receivers per transfer and all the three topologies of ANS,GEANT, and UNINETT. We saw that the running time ofAlgorithm 1, and 2 remained below one millisecond and 20milliseconds, respectively, across all of these scenarios. Thesenumbers are less than the propagation latency between themajority of senders and receivers over considered topologies(a simple TCP handshake would take at least twice the propa-gation latency). More efficient realization of these algorithmscan further reduce their running time (e.g., implementation inC/C++ instead of Java). E. Forwarding Plane Resource Usage
QuickCast can be realized using software-defined network-ing and OpenFlow compatible switches. To forward packetsto multiple outgoing ports on switches where trees branchout to numerous edges, we can use group tables which havebeen supported by OpenFlow since early versions. Besides,an increasing number of physical switch makers have addedsupport for group tables. To allow forwarding to multipleoutgoing ports, the group table entries should be of type“ALL”, i.e.,
OFPGT_ALL in the OpenFlow specifications.Group tables are highly scarce (compared to TCAM entries)and so should be used with care. Some new switches support512 or 1024 entries per switch. Another critical parameteris the maximum number of action buckets per entry whichprimarily determines the maximum possible branching degreefor trees. Across the switches we looked at, we found that theminimum supported value was 8 action buckets which shouldbe enough for WAN topologies as most of such do not haveany nodes with this connectivity degree.In general, reasoning about the number of group tableentries needed to realize different schemes is hard sinceit depends on how the trees are formed which is highlyintertwined with edge weights that depend on the distributionof load. For example, consider a complete binary tree with 8receivers as leaves and the sender at the root. This will require6 group table entries to transmit to all receivers with two actionbuckets per each intermediate node on the tree (branching atthe sender does not need a group table entry). If instead, weused an intermediate node to connect to all receivers with abranching degree of 8, we would only need one group tableentry with eight action buckets.We measured the number of group table entries needed torealize QuickCast. We computed the average of the maximum,and maximum of the maximum number of entries used perswitch during the simulation for the topologies of ANS,GEANT, and UNINETT, with arrival rates of λ = . and λ = , considering both light-tailed and heavy-tailedtraffic patterns and assuming that each bulk multicast transferhad eight receivers. The experiment was terminated when200 transfers arrived. Looking at the maximum helps ussee whether there are enough entries at all times to handleall concurrent transfers. Interestingly, we saw that by usingmultiple trees per transfer, both the average and maximum ofthe maximum number of group table entries used were lessthan when a single tree was used per transfer. One reason isthat using a single tree slows down faster receivers which may lead to more concurrent receivers that increase the numberof group entries. Also, by partitioning receivers, we makesubsequent trees smaller and allow them to branch out closerto their receivers which balances the use of group table entriesusage across the switches reducing the maximum. Finally, byusing more partitions, the maximum number of times a treeneeds to branch to reach all of its receivers decreases. Acrossall the scenarios considered above, the maximum of maximumgroup table entries at any timeslot was 123, and the averageof the maximum was at most 68 for QuickCast. Furthermore,by setting N max = | DDD R | which allows for more partitions, themaximum of maximum group table entries decreased by upto 17% across all scenarios.VI. C ONCLUSIONS AND F UTURE W ORK
A variety of applications running across datacenters repli-cate content between geographically dispersed sites for in-creased availability and reliability. Such data replication gen-erates bulk multicast transfers with an apriori known senderand set of receivers per transfer which can be efficientlyperformed using multicast forwarding trees. We introducedthe bulk multicast routing and scheduling problem with theobjective of minimizing mean completion times of receiversand decomposed it into three sub-problems of receiver setpartitioning, tree selection, and rate allocation. We then pro-posed QuickCast which applies three heuristic techniques tooffer approximate solutions to these three hard sub-problems.For future research, we will consider parallel trees to increasethroughput which is especially helpful under light traffic load.Also, application of BIER, which allows dynamic and low-cost updates to the forwarding trees, opens up new researchopportunities. R
EFERENCES[1] M. Noormohammadpour, C. S. Raghavendra, S. Kandula, and S. Rao,“QuickCast: Fast and Efficient Inter-Datacenter Transfers using Forward-ing Tree Cohorts,”
INFOCOM , 2018.[2] S. Jain, A. Kumar, S. Manda et al. , “B4: Experience with a globally-deployed software defined wan,”
SIGCOMM , vol. 43, no. 4, pp. 3–14,2013.[3] “Building express backbone: Facebooks new long-haul network,”https://code.facebook.com/posts/1782709872057497/building-express-backbone-facebook-s-new-long-haul-network/, visited on September30, 2017.[4] “How microsoft builds its fast and reliable global network,”https://azure.microsoft.com/en-us/blog/how-microsoft-builds-its-fast-and-reliable-global-network/, visited on September 30, 2017.[5] S. Kandula, I. Menache, R. Schwartz, and S. R. Babbula, “Calendaringfor wide area networks,”
SIGCOMM , vol. 44, no. 4, pp. 515–526, 2015.[6] Y. Zhang, J. Jiang, K. Xu et al. , “BDS: A Centralized Near-optimalOverlay Network for Inter-datacenter Data Replication,” in
EuroSys ,2018, pp. 10:1–10:14.[7] C. Tang, T. Kooburat, P. Venkatachalam et al. , “Holistic ConfigurationManagement at Facebook,” in
SOSP , 2015, pp. 328–343.[8] K. Florance. (2016) How netflix works with isps around theglobe to deliver a great viewing experience. [Online]. Avail-able: https://media.netflix.com/en/company-blog/how-netflix-works-with-isps-around-the-globe-to-deliver-a-great-viewing-experience[9] C.-Y. Hong, S. Kandula, R. Mahajan et al. , “Achieving high utilizationwith software-driven wan,” in
SIGCOMM . ACM, 2013, pp. 15–26.[10] H. Zhang, K. Chen, W. Bai et al. , “Guaranteeing deadlines for inter-datacenter transfers,” in
EuroSys . ACM, 2015, p. 20.[11] X. Jin, Y. Li, D. Wei, S. Li, J. Gao, L. Xu, G. Li, W. Xu, andJ. Rexford, “Optimizing bulk transfers with software-defined opticalwan,” in
SIGCOMM . ACM, 2016, pp. 87–100. [12] M. Cotton, L. Vegoda, and D. Meyer, “IANA guidelines for IPv4multicast address assignments,” Internet Requests for Comments, pp.1–10, 2010. [Online]. Available: https://tools.ietf.org/html/rfc5771[13] S. Banerjee, B. Bhattacharjee, and C. Kommareddy, “Scalable applica-tion layer multicast,” in SIGCOMM . ACM, 2002, pp. 205–217.[14] M. Kodialam, T. V. Lakshman, and S. Sengupta, “Online multicastrouting with bandwidth guarantees: a new approach using multicastnetwork flow,”
IEEE/ACM Transactions on Networking , vol. 11, no. 4,pp. 676–686, Aug 2003.[15] L. H. Huang, H. C. Hsu, S. H. Shen et al. , “Multicast traffic engineeringfor software-defined networks,” in
INFOCOM . IEEE, 2016, pp. 1–9.[16] A. Iyer, P. Kumar, and V. Mann, “Avalanche: Data center multicast usingsoftware defined networking,” in
COMSNETS . IEEE, 2014, pp. 1–8.[17] J. Cao, C. Guo, G. Lu et al. , “Datacast: A scalable and efficient reliablegroup data delivery service for data centers,”
IEEE Journal on SelectedAreas in Communications , vol. 31, no. 12, pp. 2632–2645, 2013.[18] S. H. Shen, L. H. Huang, D. N. Yang, and W. T. Chen, “Reliablemulticast routing for software-defined networks,” in
INFOCOM , April2015, pp. 181–189.[19] A. Nagata, Y. Tsukiji, and M. Tsuru, “Delivering a file by multipath-multicast on openflow networks,” in
International Conference on Intel-ligent Networking and Collaborative Systems , 2013, pp. 835–840.[20] K. Ogawa, T. Iwamoto, and M. Tsuru, “One-to-many file transfers usingmultipath-multicast with coding at source,” in
HPCC , 2016, pp. 687–694.[21] D. Bertsekas and R. Gallager, “Data networks,” 1987.[22] S. Liang and D. Cheriton, “TCP-SMO: extending TCP to supportmedium-scale multicast applications,” in
INFOCOM , vol. 3, 2002, pp.1356–1365.[23] B. Adamson, C. Bormann, M. Handley, and J. Macker, “Nack-orientedreliable multicast (norm) transport protocol,” 2009.[24] T. Zhu, F. Wang, Y. Hua, D. Feng et al. , “Mctcp: Congestion-awareand robust multicast tcp in software-defined networks,” in
InternationalSymposium on Quality of Service , June 2016, pp. 1–10.[25] D. Li, M. Xu, M. c. Zhao, C. Guo et al. , “RDCM: Reliable data centermulticast,” in
INFOCOM , 2011, pp. 56–60.[26] A. Rodriguez, D. Kostic, and A. Vahdat, “Scalability in adaptive multi-metric overlays,” in
International Conference on Distributed ComputingSystems , 2004, pp. 112–121.[27] K. Nagaraj, H. Khandelwal, C. Killian, and R. R. Kompella, “Hierarchy-aware distributed overlays in data centers using dc2,” in
COMSNETS .IEEE, 2012, pp. 1–10.[28] M. Castro, P. Druschel, A.-M. Kermarrec, A. Nandi, A. Rowstron,and A. Singh, “Splitstream: High-bandwidth multicast in cooperativeenvironments,” in
SOSP . ACM, 2003, pp. 298–313.[29] K. Jeacle and J. Crowcroft, “Tcp-xm: unicast-enabled reliable multicast,”in
ICCCN , 2005, pp. 145–150.[30] L. H. Lehman, S. J. Garland, and D. L. Tennenhouse, “Active reliablemulticast,” in
INFOCOM , vol. 2, Mar 1998, pp. 581–589 vol.2.[31] C. Gkantsidis, J. Miller, and P. Rodriguez, “Comprehensive view of alive network coding p2p system,” in
IMC . ACM, 2006, pp. 177–188.[32] A. Shokrollahi, “Raptor codes,”
IEEE Transactions on InformationTheory , vol. 52, no. 6, pp. 2551–2567, 2006.[33] J. W. Byers, M. Luby, M. Mitzenmacher, and A. Rege, “A digitalfountain approach to reliable distribution of bulk data,” in
SIGCOMM .ACM, 1998, pp. 56–67.[34] L. Rizzo, “Pgmcc: A tcp-friendly single-rate multicast congestion con-trol scheme,” in
SIGCOMM , 2000.[35] C. A. C. Marcondes, T. P. C. Santos, A. P. Godoy, C. C. Viel, andC. A. C. Teixeira, “Castflow: Clean-slate multicast approach using in-advance path processing in programmable networks,” in
IEEE Sympo-sium on Computers and Communications , 2012, pp. 94–101.[36] J. Ge, H. Shen, E. Yuepeng et al. , “An openflow-based dynamic pathadjustment algorithm for multicast spanning trees,” in
IEEE TrustCom ,2013, pp. 1478–1483.[37] I. Wijnands, E. C. Rosen, A. Dolganow, T. Przygienda, and S. Aldrin,“Multicast Using Bit Index Explicit Replication (BIER),” RFC 8279,Nov. 2017. [Online]. Available: https://rfc-editor.org/rfc/rfc8279.txt[38] M. Hefeeda, A. Habib, B. Botev et al. , “Promise: Peer-to-peer mediastreaming using collectcast,” in
MULTIMEDIA . ACM, 2003, pp. 45–54.[39] J. Pouwelse, P. Garbacki, D. Epema, and H. Sips, “The bittorrent p2pfile-sharing system: Measurements and analysis,” in
Proceedings of the4th International Conference on Peer-to-Peer Systems , ser. IPTPS’05.Berlin, Heidelberg: Springer-Verlag, 2005, pp. 205–216.[40] R. Sherwood, R. Braud, and B. Bhattacharjee, “Slurpie: a cooperativebulk data transfer protocol,” in
INFOCOM , vol. 2, 2004, pp. 941–951. [41] N. Laoutaris, M. Sirivianos, X. Yang, and P. Rodriguez, “Inter-datacenterbulk transfers with netstitcher,” in
SIGCOMM . ACM, 2011, pp. 74–85.[42] S. Su, Y. Wang, S. Jiang, K. Shuang, and P. Xu, “Efficient algorithmsfor scheduling multiple bulk data transfers in inter-datacenter networks,”
International Journal of Communication Systems , vol. 27, no. 12, 2014.[43] N. Laoutaris, G. Smaragdakis, R. Stanojevic, P. Rodriguez, and R. Sun-daram, “Delay-tolerant bulk data transfers on the internet,”
IEEE/ACMTON , vol. 21, no. 6, 2013.[44] Y. Wang, S. Su et al. , “Multiple bulk data transfers scheduling amongdatacenters,”
Computer Networks , vol. 68, pp. 123–137, 2014.[45] Y. Zhang, J. Jiang, K. Xu et al. , “BDS: A Centralized Near-optimalOverlay Network for Inter-datacenter Data Replication,” in
EuroSys , ser.EuroSys ’18, 2018, pp. 10:1–10:14.[46] M. Noormohammadpour and C. S. Raghavendra, “DDCCast: MeetingPoint to Multipoint Transfer Deadlines Across Datacenters using ALAPScheduling Policy,” arXiv preprint arXiv:1707.02027 , 2017.[47] S. Ji, S. Liu, and B. Li, “Deadline-Aware Scheduling and Routingfor Inter-Datacenter Multicast Transfers,” in , 2018, pp. 124–133.[48] L. Luo, H. Yu, and Z. Ye, “Deadline-guaranteed Point-to-MultipointBulk Transfers in Inter-Datacenter Networks,”
ICC , 2018.[49] D. Watel and M.-A. Weisser,
A Practical Greedy Approximation forthe Directed Steiner Tree Problem . Cham: Springer InternationalPublishing, 2014, pp. 200–215.[50] L. Rokach and O. Maimon,
Clustering Methods . Springer US, 2005,pp. 321–352.[51] T. Lan, D. Kao, M. Chiang, and A. Sabharwal, “An Axiomatic Theoryof Fairness in Network Resource Allocation,” in , 2010, pp. 1–9.[52] M. Alizadeh, S. Yang, M. Sharif et al. , “pFabric: Minimal Near-optimalDatacenter Transport,”
SIGCOMM Comput. Commun. Rev. , vol. 43,no. 4, pp. 435–446, August 2013.[53] W. Bai, L. Chen, K. Chen et al. , “PIAS: Practical information-agnosticflow scheduling for data center networks,”
Proceedings of the 13th ACMWorkshop on Hot Topics in Networks , p. 25, 2014.[54] Y. Lu, G. Chen, L. Luo et al. , “One more queue is enough: Minimizingflow completion time with explicit priority notification,”
INFOCOM et al. , “Inside the Social Network’s (Data-center) Network,” in
SIGCOMM . ACM, 2015, pp. 123–137.[60] M. Noormohammadpour, C. S. Raghavendra, S. Rao et al. , “DCCast: Ef-ficient Point to Multipoint Transfers Across Datacenters,” in