Providing In-network Support to Coflow Scheduling
Cristian Hernandez Benet, Andreas J. Kassler, Gianni Antichi, Theophilus A. Benson, Gergely Pongracz
PProviding In-network Support to Coflow Scheduling
Cristian Hernandez Benet ∗ , Andreas J. Kassler ∗ , Gianni Antichi † Theophilus A. Benson ‡ and Gergely Pongracz §∗ Karlstad University, † Queen Mary University, ‡ Brown University, § Ericsson Research ∗ cristian.hernandez-benet,[email protected], † [email protected], ‡ [email protected], § [email protected]
Abstract —Many emerging distributed applications, includingbig data analytics, generate a number of flows that concurrentlytransport data across data center networks. To improve theirperformance, it is required to account for the behavior ofa collection of flows, i.e., coflows, rather than individual.State-of-the-art solutions allow for a near-optimal completiontime by continuously reordering the unfinished coflows at theend-host, using network priorities.This paper shows that dynamically changing flow priorities atthe end host, without taking into account in-flight packets, cancause high-degrees of packet re-ordering, thus imposing pressureon the congestion control and potentially harming networkperformance in the presence of switches with shallow buffers. Wepresent pCoflow , a new solution that integrates end-host basedcoflow ordering with in-network scheduling based on packethistory. Our evaluation shows that pCoflow improves in CCTupon state-of-the-art solutions by up to 34% for varying load.
Index Terms —Coflow, Datacenter Networks, P4, DataplaneProgramming
I. I
NTRODUCTION
Emerging big data processing frameworks such asMapReduce [1], Spark [2] or Dyan [3] are based on apartition/aggregate programming model that allow them todistribute and parallelize the processing across differentmachines. Such big data analytics frameworks are alsobecoming an important enabler for future mobile networkswith use cases ranging from incident detection at NetworkOperations Centers [4], Traffic Classification and NetworkSlicing [5] to IoT data processing [6]. A common propertyof such big data frameworks is that each processing stagecannot complete until all data have been transferred. Asa consequence of this property, the performance of theseframeworks is function of the behavior of the collectionof flows used to transfer data in each stage [3], [7], i.e.,coflows [8]. More formally, coflows are collection of flowsof varying sizes with different communication endpoints.Most of the growing work on improving performance withindata centers build upon the advances in data center loadbalancing techniques, e.g., Hula [9], and data center centrictransports, e.g., DCTCP, which improve low level details ofthe data center’s network. By abstracting details about loadbalancing, routing, and transport, these emerging techniquescan focus on the crucial aspect of the network whichimpact individual flow performance, i.e., controlling queuing,priorities, or rate-limits. However, existing approaches for queuing, priorities, rate-limits within the data center do notprovide the levels of dynamicity required to support recentcoflow proposals, e.g., Sincronia [10].In this work, we ask the following fundamental question: “Does the network provide sufficient primitives to faith-fully support dynamic modification of coflow priorities andqueues?” . To answer this question, we use a large scaletrace-driven simulation, which allows us to methodologicallyanalyze a broad range of existing techniques and scenarios.Our initial observations are that current network primitives donot effectively support arbitrary modifications and changes toa coflow’s queuing priorities. In particular, we observe thatwhile the network provides the illusion that all packets in aflow can be atomically moved between queues. In practice,once a packet has been queued, it does not dynamically changequeues, which leads to packets from a flow ending up in differ-ent queues. The end result of this phenomenon is that packetsfrom a flow are spread across queues resulting in a high degreeof packet reordering when they arrive at the destination end-hosts which lead to reduced performance due to TCP’s design.In particular, due to TCP’s behavior high-degrees of packetre-ordering can in some cases cause the congestion controlwindow to shrink with negative effect on performance.In this paper, we argue that existing data center networkslack in-network support for dynamically changing coflowqueuing priority: specifically, with existing network primitiveschanging a flow’s priority does not consider packets alreadytraversing the network, thus causing inconsistencies betweenthe in-flights and newly generated packets of the flowat switches with multiple priority queues. Motivated bythese observations, we propose an in-network primitive,called pCoFlow , which allows flows to temporarily maintainqueue affinity until already enqueued packets are drainedwhen flows are being reprioritized due to coflow orderupdate. A key challenge in preserving flow affinity underre-prioritization lies in maintaining network state and usingthis state to dynamically override packet priorities and alterqueuing behavior. Our work builds on emerging techniquesfor programmable data planes [11], [12] and uses them tomaintain minimal per-coflow state, i.e., optimized packethistories, and dynamically manage flow priorities and queueassignments based on this state.To demonstrate the strength of our approach, we propose a r X i v : . [ c s . N I] J u l he design (§III) of a system that integrates state-of-the-artordering mechanisms at the end-host, such as Sincronia,with in-network scheduling based on packet history (i.e., pCoFlow ). We then show that the latter can be implementedin P4 starting from the PIFO abstraction [12]. Finally, wedemonstrate that our approach reduces the average CCT by15% up to 18% for 10% load and by 27 up to 34% for 90%load with respect to state-of-the-art solutions.In summary, the contributions of this paper are: • We make the case for in-network support in the contextof coflow scheduling. • We propose pCoflow , which is an architecture that inte-grates state-of-the-art techniques performing ordering atthe end-host with advanced in-network packet scheduling. • We design a coflow aware in-network priority schedulerthat avoids reordering and can be implemented using thePIFO abstraction [12].II. M
OTIVATION
Recently, there has been a tremendous effort on networkdesigns for coflows [13], [14], [15], [16], [17], [18]. Some ofthe works advocate for adopting a distributed approach [17],[18], where coflows are scheduled and managed locally atthe end hosts; others propose a centralized scheme [13],[14], [15], [16], where a single entity with global knowledgeis in charge of managing coflows. Centralized solutionsthat rely on global knowledge have proved to guaranteebetter performances. However, the need to centrally calculatecomplex per-flow rate allocations has hindered the possibilityto realize them in practice [10].Recently, Sincronia [10] has demonstrated that the keyingredient to obtain near-optimal performances is to provideconvenient coflow prioritization. Specifically, if the right ordering of coflows is given, any per-flow rate allocationmechanism would lead to good results provable close to theoptimum if co(flow) scheduling preserves the order. In otherwords, if coflow C i is ordered higher than coflow C j , all flowsand packets in C i must be prioritized over C j . Such an impor-tant finding has opened up the possibility of a new networkdesign where a central controller is just in charge of orderingthe coflows, while leaving any per-flow prioritization to theend-hosts, thus striking the right balance between centralizedand distributed schemes. In this regard, Sincronia assumesthat individual flow scheduling and rate allocation is providedby a priority-enabled transport layer at the end-host, obeyingto a centrally managed coflow ordering controller [10].In practice, given a coflow ordering, the end-host willcontinuously (re)assign the priority of a flow coherently withthe coflow it belongs to using, for example, DiffServ markings. The need for in-network support.
Decoupling schedulingdecisions, centrally managed, from the flow-rate allocationproblem, controlled at the end-hosts through a priority-enabled transport layer, might generate an excessive amountof out-of-order packets thus affecting network performances.To illustrate this problem, we used the NS2 simulator. Weassumed a non-blocking big-switch as network topology [19], [14], [10] and we used Data center TCP (DCTCP) [20]as state-of-the-art congestion control for data centernetworks. We generated traffic according to the Sincroniaworkload generator [10] which is based on Facebook trafficcharacteristics and used an increasing number of coflows from20 to 200. We let Sincronia calculate the coflow orderingand we enforced the corresponding flow priority usingDSCP marking. Furthermore, to properly enforce the correctordering, we enhanced the big-switch abstraction with eightqueues per port. Figure 2 shows the total number of timeoutevents obtained for different coflow sizes. The reason lies inthe Sincronia behavior that dynamically change flow prioritiesenforcing new policies from the end hosts without consideringpackets that are already traversing the network. DuplicatedACKs might trigger a flow to shrink its congestion window,impacting directly on network performance. The effect on theCCT is shown in Figure 1. Here, we compare the CCT weobtained with Sincronia against an optimal scenario where achange in coflow priority does not cause any packet reordering.The ideal case performs up to 1.5x better than Sincronia.
20 50 100 150 200Num. Coflows80100120140160180200220240 A v e r a g e CC T ( m s ) Sincronia_ Sincronia_Ideal_
Fig. 1. Average coflow completiontime
20 50 100 150 200 N u m b e r o f D U P A C K s Fig. 2. DupAcks for differentnumber of coflows
To better understand the cause of packet reordering, weillustrate in Figure 3 what happens when the Sincroniacontroller issues a change in coflow priority. Let us assumethat at some point coflow 1 finishes and Sincronia increasesthe priority level of each remaining active coflow. This changeaffects new packets to be generated from end hosts, while theone already in-flight will be served with the old configuration,i.e., priority. This clearly creates packet reordering if packetsof the same flow are still enqueued at a lower priority queueat the same switch and newer packets having higher priorityarrive due to strict priority queue. Reordering may triggercongestion control, which reduces the rate of the flow.
Low-priorityMiddle-priorityHigh-priority
N-Queues Reordering
Coflow 3 Coflow 2Coflow 1
Coflow 1 finished
Sincronia updates priorityPacket number
Scheduler
Fig. 3. Packet reordering after priority updates his simple example illustrates a wider problem: in realdata center topologies where multiple paths from source todestination are available, the amount of reordering as wellas its effect on network performances can be even bigger.Indeed, to select the best path , the research community hasshown the effectiveness of congestion aware flowlet-basedload-balancing approaches [21], [9]. Those schemes splitflows into smaller flowlets , exploiting the burstiness of TCP.The idea is to route each flowlet over the least congestedpath. However, when using Sincronia in combination withthe mentioned solutions might trigger a reordering of not justfew packets , but instead entire flowlets .III. D
ESIGN
Given the insights from the previous section, we ask thequestion is it possible to minimise packet reordering dueto priority changes by allowing switches to participate inscheduling decisions?
We answer this positively by describing pCoflow , a solution which provides in-network support forcoflow scheduling. pCoflow integrates state-of-the-art orderingtechniques at the end-host, e.g., Sincronia, with schedulingdecisions taken in-network . The main idea is to leverageprogrammable switches to temporarily maintain queue affinityfor newly arriving packets until already enqueued packetsare drained when flows are being reprioritized due to cofloworder update. We show that this can be realized with the PIFOabstraction [12] and by taking into account the priorities ofpackets before and after an update, i.e., their history.
A. Design Objectives
Avoiding In-Network Reordering:
Coflow schedulers such as Sincronia ensure coflow isolationand preserve the order of coflows by delegating prioritizationto a priority-enabled transport mechanism. When run togetherwith TCP, this requires a prioritization of IP packets accordingto the coflow priority, which is typically implemented using amulti-level queuing system with strict priorities. Such systemsschedule packets waiting at higher priority queue first. Ideally,this would allow higher priority coflows to finish earlier andthus improving overall transmission time. However, the arrival,termination or changes in the remaining transmission time ofcoflows can lead to shifting in priority levels at end-hots thatmight lead to packet reordering, triggering congestion controland reducing the rate of the newly prioritized flow, achievingexactly the opposite effect. When using multi-level queuingsystems, priority updates at end-hosts should therefore notresult in packet reordering. Therefore, we aim to implement asingle queue that manages coflow priorities during insert usingpacket histories that track priorities of enqueued packets.
Avoiding Coflow Starvation:
Whether coflows are prioritizedusing per-flow rate allocation or transport layer priorities,coflow starvation must be avoided. This is necessary toensure that large coflows do not starve short coflows. Byleveraging network feedback in the form of Early CongestionNotification (ECN), congested network elements can signalend-host transport layer such as TCP [22] or DCTCP [20] that congestion is building up forcing them to scale back in rate.However, when using a single queue that manages differentpriorities, care must be taken to how ECN marking is applied.
B. pCoflow Design OverviewpCoflow uses state-of-the-art centralized coflow controllerssuch as Sincronia that orders coflows and derives their priorityand combines it with transport layer that enforces priorityscheduling (see Figure 4). The core component of pCoflow is a novel coflow aware strict-priority packet scheduler insidethe data plane that is aware of priority levels of coflowpackets waiting in the queues. pCoflow p1 p2Priority markingCoflow taggingp3c1 c1 c2 c3
End-host agent
Coflow listPriority markingCoflow tagging
End-host agentCoflow Controller
Coflow list Coflow list
Fig. 4. pCoflow Architecture Overview
C. End-host
End-hosts are responsible for marking packets with thecorresponding coflow priority and sending packets overthe transport protocol. We exploit a shim layer betweenthe application and the transport layer which continuouslyorders the coflows using e.g. information available from acentralized controller (e.g. Sincronia) [10]). The coflow orderis translated by the end-host agent to DSCP values that mapthe highest order coflow to the highest priority level. Secondhighest priority is mapped to second highest priority, etc.and all remaining priorities are mapped to the lowest prioritylevel. The shim layer also tags each packet with an unique coflowID , which is subsequently used by switches to avoidreordering when coflow order is updated by the end-host shimlayer. The coflowID can be provided in an extra header(e.g. using GPE extension of VXLAN) or can be conveyedwithin the IP Identification field or TCP options.
D. In-Network Coflow-aware Scheduler
The main objective of our coflow aware programmablescheduler approach is to maintain coflow priorities fordequeuing operation while avoiding reordering when coflowpriorities are switched at end-hosts. Consequently, we needto maintain the relative scheduling order of buffered packetswith future packet arrivals, which will be implemented duringthe push-in operation. The main idea of our approach is toassigning coflows to a single queue as long as buffer spaceis available. In order to enable prioritization of coflows,we therefore partition a single queue into multiple virtualriority bands, each one having a dedicated priority level anda certain buffer space. As long as there is space available ata given priority level, coflow packets matching that prioritylevel can be inserted appropriately.Packet reordering may only happen when the end-hostincreases the priority level of a given coflow. If there are stillpackets enqueued at the switch for the same flow at lowerpriority levels, the newly arriving packets are served first,leading to reordering. On the contrary, when new packetshave a lower priority, there will be no reordering as packetswaiting at higher priority levels will be served first. In orderto avoid packet reordering due to end-host triggered cofloworder update, we first identify the coflow priority by parsingthe DSCP field in the packet headerThe insert operation has to ensure that packets with highestpriority are inserted at the first priority band of the queue,and packets with lower priority at lower priority bands.This makes sure that packets with higher priority are alwaysserved first implementing a strict priority queuing policy. Inorder to avoid that a change in coflow priorities may lead topacket reordering within a flow, we need to check to whichcoflow a packet belongs to. If a Sincronia triggered reorderincreases the priority of coflow C j , a packet may arrive ata switch with higher priority and several packets of the samecoflow C j may wait for transmission at a lower priority level.In this case, we temporarily do not use the higher priorityas indicated by the end-host reordering but rather insert thenewly arriving packet after packets of the same coflow C j .This avoids reordering and makes transport layer transparentto priority changes. The drawback is a delayed response topriority changes in the switch.Our packet scheduler needs to store (1) the bounds ofpriority bands, and (2) for each coflow the lowest priorityband that has packets waiting to be served (Figure 5. Whena packet with priority p i that belongs to coflow C j reachesthe switch, the scheduler thus checks the position of the lastpacket enqueued at priority level p i and the position of thelast packet enqueued for coflow C j and calculates the rankof the packet as in Equation 1. rank = max ( p i ,C j )+1 (1) pCoflow uses ECN to signal congestion to the end-hosttransport layer [23], [24]. When using a single queue withmultiple priority bands and a single ECN marking thresholdmay lead to marking mostly lower priority packets whichmay lead to coflow starvation. To prevent this effect, pCoflow uses multiple ECN marking thresholds, one per priority band.If during enqueue we detect that the number of packetsenqueued for a given priority p i is larger than the ECNthreshold for the given band, we set the ECN bit.Although the minimum and maximum marking thresholdof each priority level can be adjusted to react earlier tocongestion and start marking packets before reaching themaximum threshold [25], congestion control algorithms cantake several RTTs to adjust the sending rate after receivingECN notification. Therefore, queue sizes may temporarily exceed the defined ECN thresholds. Dropping packets maybe necessary when using transport protocols that do not reactto ECN or if we want to guarantee a certain buffer spacefor each priority. However, enforcing packet drop reducesqueue elasticity. pCoflow enables adaptive queue sizes bydynamically allowing priority bands to increase and shrink. Itintegrates such dynamic resizing with ECN marking to signalend-points to reduce their rate. E. Implementation Feasibility
With today’s switches, only a limited number of schedulingapproaches is available whose parameters can be controlledby network operators. However, using programmable packetschedulers allows to implement custom scheduling disciplinestuned to application requirements. Indeed, using the push-infirst-out (PIFO) scheduling abstraction [12], where packetscan be pushed into an arbitrary position but always dequeuedfrom the head, several scheduling approaches can beimplemented on programmable data-planes such as P4 [11].To implement our approach, we can leverage the PIFOabstraction [12], which allows enqueued packets to be pushedin arbitrary positions, given by the packet rank, while beingdequeued from the head. PIFO assumes that packet ranksincrease continuously within a flow and dynamic reorderingof already enqueued packets is not supported. Consequently,we need to consider three main issues (i) extracting the coflowpriority from the packet header; (ii) mapping packets to correctpriority bands and computing the rank; and (iii) updatingpriority band bounds. For (i), we read the priority bits fromthe IP header ToS field. Figure 5 illustrates our approach.
Priority Pos_queue
Priority
Coflow_ID Priority
Coflow Coflow Priority
Switch End-HostPos=max (2,5)+1 = 6 DequeueCoflow_ID Packets
Enq_Packets_Prio_1
Coflow_ID Packets
Enq_Packets_Prio_2 Enq_Packets_Prio_3 ...Coflow_ID Packets
Fig. 5. pCoflow Queue with coflow and priority tracking
Mapping:
We use registers
Priority to store the boundinformation for each priority band. Assuming p prioritybands, we use a register to encode the end of priority band p i . For each coflow, we track the lowest priority band thatstill has packets enqueued in register array Coflow . If thereis no packet enqueued for a coflow, we set the priority to 0.For calculating the PIFO rank at insert and avoid reordering,we first check
Coflow to determine, which lowest priorityband has packets enqueued (e.g. 2 in the example). Then,we look up the position of the last packet in this band using riority , which returns 5. We compare this with the rankof the last packet in the priority band that corresponds to thepriority marked in the packet header (as the priority is one, thelookup returns 2). Equation 1 returns rank = max (2 , which is used for PIFO insert operation. Update:
As in [10], we aim to map each coflow to the givenpriority band if the current order of the coflow is less than p − , else we map it to band p . To avoid reordering whenpackets are enqueued at priority p i and new packets for thesame coflow arrive having higher priority, p i − k , we trackwhich bands have packets enqueued for each coflow usingone register array Enq_Packets for each priority band. Weupdate
Coflow as follows. On enqueue of a packet at theend of priority band p i , we update priority band bounds ofall lower priority bands p i + k (e.g. if p i == 2 we will updatebounds of bands 2, 3, 4...). We update Enq_Packets accordingly (e.g. indicating that priority band 2 has now 4packets waiting for coflow 2). If there is no packet waiting tobe transmitted in any queue (
Coflow returned 0), we update
Enq_Packets using the priority band corresponding to thepacket priority. On dequeue, we update
Enq_Packets forthe priority band we dequeued from and coflow id. If afterthe dequeue there are no more packets in the current priorityband, we sweep the remaining lower priority bands to findthe lowest priority band that has still packets waiting andupdate
Coflow for the given coflow id. If there are no morepackets waiting, we set
Coflow to zero. Finally, we alsoupdate
Priority of the priority band corresponding to thepacket that has been dequeued and all lower priority bands.To track the ECN marking threshold, we use counters perpriority band. If we detect that an insert leads to more packetsthan allowed according to the ECN threshold for a givenband, we mark the ECN bit. In our example (Figure 5), if theECN threshold is set to 2 packets, when the packet belongingto coflow 2 and priority 1 arrives at the switch, the counterassociated to priority 1 will return 2, and the ECN bit is set.
Remarks:
Note, pCoflow downprioritizes temporarily allcoflow packets if there are other packets waiting at lowerpriorities, which maybe not necessary. However, a more fine-granular per flow decision would require per flow tracking,which may lead to excessive switch resources. Note, that allsweeping operations through the multiple priority bands canbe implemented as nested if-else statements as the numberof bands is determined at compile time. As pCoflow doesnot require to maintain per-flow state (just per priority bandand coflow), the required state variables is reasonably small.Increasing the priority bands p leads to more fine granularprioritization but requires more switch resources. Note, thescheme cannot be implemented on state-of-the-art hardwaresuch as Tofino because registers in egress is not available iningress, which however could be solved by packet recirculationat the expense of higher complexity. PIFO on the other handsupports not more than around 1000 flows [12]. A variant ofour scheme supporting a fixed-size priority bands with limitedreordering could be implemented using SP-PIFO [26]. IV. E VALUATION AND R ESULTS
We implemented our scheme in the NS2 simulator and usedcoflow traces for comparing our scheme against different con-figurations.
Topology:
We use a 3-tier Fat-tree topology withk=4. All links have a capacity of 40Gbps, except links connect-ing 64 servers to the ToRs, which have a capacity of 10Gbps.Each of the 8 ToRs has 8 servers connected that send coflowsaccording to a given trace. Servers run a client application withthe Sincronia shim layer that informs the Sincronia coordinatorabout the coflow information such as coflow id, number offlows and sources, and destination for each flow. It receivesthe coflow ordering and tags coflow priorities and coflow IDs.
Coflow Scheduler:
We use the online Sincronia algorithmfrom [10] to order coflows. We immediately recompute the or-der upon each coflow arrival and departure. As in [10], we mapcoflow order to the Diffserv option and use 8 priority levels.
Workload:
We use [27] to create a coflow trace having thesame characteristics as the Facebook trace from [10]. Thetrace contains 150 coflows, which are composed of 2086 totalflows. In total, the Intra-pod traffic was 32.8 GB and the Inter-pod traffic 25.4 GB. We increase the workload by reducinginter-coflow arrival rates. As in [14], [28], we group coflowsinto short if the longest flow of a coflow has less than 5MB.A coflow is classified as narrow if it has less than 50 flowsleading to four categories: Short and Narrow (SN), Long andNarrow (LN), Short and Wide (SW) and Long and Wide (LW).
Network Layer Load Balancing:
We compare Equal-CostMulti-path (ECMP) and HULA [9]. HULA is a flowlet-basedload-balancing scheme that forwards flowlets over the leastcongested path. Fowlet gap is set to 500 µ s and probinginterval is set to 200 µ s [9]. Transport Protocol and Queue:
We use DCTP [29] withstandard retransmission time-out of 3 RTTs and an RTO of200us as in [30]. As baseline (deRED), we use 8 strict priority(SP) queues, each one holds [31] max. 500 packets. Each phys-ical queue contains a single virtual RED queue ( min _ th = 200 , max _ th = 400 ) that starts marking ECN at min _ th = 200 witha given probability and the scheduler maps flows to queuesgiven by the Diffserv field. When using pCoflow , the singlequeue has 8 priority bands of aggregated size. Each priorityband starts marking packets at min _ th = 200 per band. Results for BigSwitch:
The first question we try to answerfor pCoflow is, if it is better to drop the packets once themaximum number of packets per priority band is exceeded orallow to borrow space from lower priority bands? Figure 8compares pCoflow_Drop , which drops packets once the limitfor a band is reached (500) with pCoflow_ECN which adap-tively adjusts queue bands, when using Sincronia for priorityordering. Dropping packets once the threshold is exceededavoids coflow starvation by not allowing packets from otherpriorities to take queue space reserved for other priorities. Onthe other hand, allowing coflows to temporary exceed theirreserved queue space enables flows to steadily reduce theirsending rate. Although this decision may temporarily lead tocoflow starvation, we note that coflows can only take more A v e r a g e CC T ( m s ) dsREDdsRED_Sincronia pCoflow pCoflow_Sincronia Fig. 6. Average CCT for BigSwitch
10 20 30 40 50 60 70 80 90Load (%)150200250300350400450 A v e r a g e F C T ( m s ) dsREDdsRED_Sincronia pCoflow pCoflow_Sincronia Fig. 7. Average FCT for BigSwitch
10 20 30 40 50 60 70 80 90Load (%)100120140160180200220240260 A v e r a g e CC T ( m s ) pCoflow_ECN pCoflow_ECN_Dropping Fig. 8. pCoflow Queue ECN vs ECN-Dropping
10 20 30 40 50 60 70 80 90Load (%)100200300400500600700 A v e r a g e CC T ( m s ) dsRED_ECMPdsRED_HuladsRED_Sincronia_ECMPdsRED_Sincronia_Hula pCoflow_ECMPpCoflow_HulapCoflow_Sincronia_ECMPpCoflow_Sincronia_Hula Fig. 9. Average CCT for fat-tree topology
10 20 30 40 50 60 70 80 90Load (%)150200250300350400450500 A v e r a g e F C T ( m s ) dsRED_ECMPdsRED_HuladsRED_Sincronia_ECMPdsRED_Sincronia_Hula pCoflow_ECMPpCoflow_HulapCoflow_Sincronia_ECMPpCoflow_Sincronia_Hula Fig. 10. Average FCT for fat-tree topology
LN LW SW SNCoflow category05001000150020002500 A v e r a g e CC T ( m s ) dsRED_ECMPdsRED_HuladsRED_Sincronia_ECMPdsRED_Sincronia_HulapCoflow_ECMPpCoflow_HulapCoflow_Sincronia_ECMPpCoflow_Sincronia_Hula Fig. 11. Average CCT for coflows sorted bycategory for 90% load space in the queue whenever there is space left from othercoflows. In Figure 8), we observe that allowing coflows totemporally exceed the priority band limit of 500 packets, wecan reduce the overall CCT. This is due to DCTCP whichreacts upon the ECN marking. All remaining experiments for pCoflow are performed with adaptive queues and ECN mark-ing. Figure 6 and Figure 7 show the average CCT and FCTfor different combinations of load-balancing with and withoutSincronia ordering. pCoflow improves both upon Sincroniaand when not using Sincronia. When not using Sincronia,the benefits of pCoflow are attributed to the adaptive queuesize. pCoflow improves upon multi-level dsRED queues whenusing Sincronia since we avoid reordering and leverage the fullcapacity of the queue. At higher load, the benefits of pCoflow are more pronounced, where the gap between our approachand the vanilla SP multi-level dsRED queue is in the range of15-25%. This might stem from the fact that at high loads thetraffic is more unstable, leading to more changes in prioritiesand therefore more reordering. However, as Figure 7 shows,by changing the insertion order of the packets, this can lead toan increase in the tailing of some flows and therefore leadingto an increase of the FCT compared to the multi-level dsREDqueue for low loads. For network loads higher than about 70%, pCoflow also reduces FCT compared to other approaches.
Fattree - Summary:
Figure 9 and Figure 10 show theaverage CCT and FCT for the Fattree topology. The lowestCCT is achieved by pCoflow when used with HULA (see Fig- ure 9). When using Sincronia, the difference between ECMPand HULA is not significant since HULA probe packets areused to identify least congested paths. Consequently, we mapthem to the highest priority queue or band. This can lead to asituation where low priority packets being forwarded to a morecongested path. Moreover, Facebook traffic is characterized byhaving a one-to-many communication pattern, where a singlenode receives data from different nodes in the network. Atlow loads, the bottleneck might be located mainly on the linksbetween the ToR and the server and therefore load-balancingplays a minor role. On the other hand, we can see that whenwe do not use Sincronia, the effect of load-balancing ismore pronounced due to the congestion-aware load-balancingby HULA. When using Sincronia, pCoflow combined withHULA achieves a CCT reduction up to 27% compared to thefixed dsRED multi-level queuing when used with HULA. Onthe other hand, pCoflow can reduce CCT by 34% comparedto dsRED multi-level queues when used with Sincronia andECMP. Figure 11 analyzes the CCT for the 90% load casefor each coflow category. As expected, long and wide (LW)coflows contribute to the highest CCT. This is because theyare the coflows that transport the largest data volume. Inaddition, Sincronia benefits small coflows, as they have ahigher probability to be assigned to a higher priority band.Surprisingly, the load-balancing scheme plays a less importantrole for large flows than expected, which may be caused bythe unawareness of HULA of coflow properties. This leavesoom for an integrated design with pCoflow .V. R
ELATED W ORK
There is extensive work related to scheduling coflows in datacenters. Most of these schemes including [10], [14], [13], [17],[16] rely on prior knowledge about coflows (e.g. flow sizes,server pairs). Coflow scheduling methods can be divided intotwo groups: distributed schedulers and centralized schedulers .Distributed schemes including [17], [18], [32] are executed oneach host where coflows are scheduled and sorted locally. Onthe contrary, centralized schemes such as [13], [14], [15], [16]rely on a central controller to order coflows. Indeed, havinga global view enables better scheduling decisions [33] whilefacing a a large control overhead and, therefore, scalabilityis an issue. Aalo [28] uses priority queues to classify coflowsaccording to the amount of data sent and does not need priorknowledge. Sincronia [10] overcomes the main centralizedschedulers’ problems by avoiding per-flow rate allocation.Sincronia achieves near-optimal average CCT and requires atransport layer proritizing flows according to coflow orderings.While most of the flows do not consider coflow routing,Rapier [13] and [34] demonstrate that combining schedulingand routing can lead to better performance.VI. C
ONCLUSIONS AND F UTURE W ORK
We presented pCoflow , an in-network support to coflowscheduling. Our work integrates state-of-the-art end-hostcoflow reordering approaches with in-network packetprioritization performed in the switch. Our approach uses thePIFO scheduling abstraction to build a coflow aware packetscheduler which considers packet history during priorityscheduling. pCoflow therefore avoids excessive packetreordering that potentially lead to wasteful restransmissions.Our approach improves upon coflow completion time andbenefits from flowlet-based load-balancing schemes such asHULA. As future work, we aim for an integrated design bydefining extensions and proper interactions between pCoflow and flowlet-based load-balancing schemes such as HULA.A
CKNOWLEDGMENT
Parts of this work is supported by the KnowledgeFoundation of Sweden through the Profile HITS under GrantNo.: 20140037. R
EFERENCES[1] J. Dean et al. , “MapReduce: simplified data processing on large clusters,”in
Communications of the ACM, Volume: 51, Issue: 1 . ACM, 2008.[2] M. Zaharia et al. , “Spark: Cluster computing with working sets,” in
HotTopics in Cloud Computing (HotCloud) . USENIX Association, 2010.[3] M. Isard et al. , “Dryad: distributed data-parallel programs fromsequential building blocks,” in
SIGOPS Operating Systems Review,Volume: 41, Issue: 3 . ACM, 2007.[4] H. Kumar et al. , “Machine Intelligence at the NOC,” in
Ericsson Blog ,06 2018, [Online; accessed 21-April-2020].[5] L. V. Le et al. , “SDN/NFV, Machine Learning, and Big Data Driven Net-work Slicing for 5G,” in , 11 2018.[6] V. Nejkovic et al. , “Big Data in 5G Distributed Applications,” in
High-Performance Modelling and Simulation for Big Data Applications:Selected Results of the COST Action IC1406 cHiPSet , J. Kołodziej et al. ,Eds. Cham: Springer International Publishing, 2019, pp. 138–162. [7] M. Zaharia et al. , “Resilient distributed datasets: A fault-tolerantabstraction for in-memory cluster computing,” in
Networked SystemsDesign and Implementation (NSDI) . USENIX Association, 2012.[8] M. Chowdhury et al. , “Coflow: a networking abstraction for clusterapplications,” in
Hot Topics in Networks (HotNets) . ACM, 2012.[9] N. Katta et al. , “Hula: Scalable load balancing using programmabledata planes,” in
Symposium on SDN Research (SOSR) . ACM, 2016.[10] S. Agarwal et al. , “Sincronia: Near-optimal Network Design forCoflows,” in
Special Interest Group on Data Communication(SIGCOMM) . ACM, 2018.[11] P. Bosshart et al. , “P4: Programming Protocol-independent PacketProcessors,” vol. 44, no. 3. New York, NY, USA: ACM, jul 2014, pp.87–95.[12] A. Sivaraman et al. , “Programmable Packet Scheduling at Line Rate,”in
Special Interest Group on Data Communication (SIGCOMM) .ACM, 2016.[13] Y. Zhao et al. , “Rapier: Integrating routing and scheduling for coflow-aware data center networks,” in
International Conference on ComputerCommunications (INFOCOM) . IEEE, 2015.[14] M. Chowdhury et al. , “Efficient coflow scheduling with Varys,” in
Spe-cial Interest Group on Data Communication (SIGCOMM) . ACM, 2014.[15] Y. Li et al. , “Efficient online coflow routing and scheduling,” in
MobileAd Hoc Networking and Computing (MobiHoc) . ACM, 2016.[16] M. Chowdhury et al. , “Managing data transfers in computer clusterswith orchestra,” in
Special Interest Group on Data Communication(SIGCOMM) . ACM, 2011.[17] S. Luo et al. , “Minimizing average coflow completion timewith decentralized scheduling,” in
International Conference onCommunications (ICC) . IEEE, 2015.[18] H. Susanto et al. , “Stream: Decentralized opportunistic inter-coflowscheduling for datacenter networks,” in
International Conference onNetwork Protocols (ICNP) . IEEE, 2016.[19] M. Alizadeh et al. , “pfabric: Minimal near-optimal datacenter transport,”in
Special Interest Group on Data Communication (SIGCOMM) . ACM,2013.[20] ——, “Data Center TCP (DCTCP),” in
Special Interest Group on DataCommunication (SIGCOMM) . ACM, 2010.[21] ——, “CONGA: Distributed congestion-aware load balancing fordatacenters,” in
Special Interest Group on Data Communication(SIGCOMM) . ACM, 2014.[22] S. Floyd, “TCP and Explicit Congestion Notification,” in
ComputerCommunication Review, Volume: 24, Issue: 5 . ACM, 1994.[23] H. Wu et al. , “Tuning ecn for data center networks,” in
Proceedings ofthe 8th international conference on Emerging networking experimentsand technologies , 2012, pp. 25–36.[24] W. Bai et al. , “Enabling { ECN } in multi-service multi-queue datacenters,” in { USENIX } Symposium on Networked Systems Designand Implementation ( { NSDI } , 2016, pp. 537–549.[25] Y. Zhu et al. , “Congestion Control for Large-Scale RDMADeployments,” in Special Interest Group on Data Communication(SIGCOMM) . ACM, 2015.[26] A. Alcoz et al. , “SP-PIFO: Approximating Push-In First-Out Behaviorsusing Strict-Priority Queues ,” in et al. , “Coflow workload generator,” https://github.com/sincronia-coflow, 2018.[28] M. Chowdhury et al. , “Efficient Coflow Scheduling Without PriorKnowledge,” in
Special Interest Group on Data Communication(SIGCOMM) . ACM, 2015.[29] M. Alizadeh et al. , “Data Center TCP (DCTCP),” in
Proceedingsof the ACM SIGCOMM 2010 Conference , ser. SIGCOMM ’10.New York, NY, USA: ACM, 2010, pp. 63–74. [Online]. Available:http://doi.acm.org/10.1145/1851182.1851192[30] A. G. Alcoz et al. , “Sp-pifo: Approximating push-in first-out behaviorsusing strict-priority queues,” in { USENIX } Symposium onNetworked Systems Design and Implementation ( { NSDI } , 2020,pp. 59–76.[31] M. A. Qadeer et al. , “Differentiated services with multiple random earlydetection algorithm using ns2 simulator,” in International Conference onComputer Science and Information Technology (ICCSIT) . IEEE, 2009.32] F. R. Dogar et al. , “Decentralized task-aware scheduling for datacenter networks,” in
Special Interest Group on Data Communication(SIGCOMM) . ACM, 2014.[33] S. Wang et al. , “A survey of coflow scheduling schemes for data centernetworks,” in
Communications Magazine, Volume: 56, Issue: 6 . IEEE,2018.[34] H. Jahanjou et al. , “Asymptotically optimal approximation algorithmsfor coflow scheduling,” in