T-RACKs: A Faster Recovery Mechanism for TCP in Data Center Networks
11 T-RACKs: A Faster Recovery Mechanism for TCPin Data Center Networks
Ahmed M. Abdelmoniem
Brahim Bensaou CS Department, Assiut University, Egypt CSE Department, HKUST, Hong Kong {amas,brahim}@cse.ust.hk
Abstract
Cloud interactive data-driven applications generate swarms of small TCP flows that compete for the small buffer space indata-center switches. Such applications require a short flow completion time (FCT) to perform their jobs effectively. However,TCP is oblivious to the composite nature of application data and artificially inflates the FCT of such flows by several orders ofmagnitude. This is due to TCP’s Internet-centric design that fixes the retransmission timeout (RTO) to be at least hundreds ofmilliseconds. To better understand this problem, in this paper, we use empirical measurements in a small testbed to study, at amicroscopic level, the effects of various types of packet losses on TCP’s performance. In particular, we single out packet losses thatimpact the tail end of small flows, as well as bursty losses, that span a significant fraction of the small congestion window of TCPflows in data-centers, to show a non-negligible effect on the FCT. Based on this, we propose the so-called, timely-retransmittedACKs (or T-RACKs), a simple loss recovery mechanism to conceal the drawbacks of the long RTO even in the presence of heavypacket losses. Interestingly enough, T-RACKS achieves this transparently to TCP itself as it does not require any change to TCPin the tenant’s virtual machine (VM). T-RACKs can be implemented as a software shim layer in the hypervisor between the VMsand server’s NIC or in hardware as a networking function in a SmartNIC. Simulation and real testbed results show that T-RACKsachieves remarkable performance improvements.
Index Terms
Data-center, Cross Layer, Fast Recovery, Kernel Module, TCP-Incast, Timeouts.
I. I
NTRODUCTION
The recent growth in data-center deployments worldwide is reshaping how the Internet and its applications operate. New,cloud-based, data-driven applications have emerged over the past many years to harness the cost-effectiveness and scalabilityafforded by cloud computing. Most such applications rely on distributed programming and data storage frameworks such asHadoop, HDFS, or Spark [62] for storing and processing large data sets. In such frameworks, master and aggregation nodesoften require data transfers from hundreds of worker nodes to build a result. Due to the stringent timing requirements ofinteractive applications, a data transfer that misses a hard deadline because of excessive waiting for packet loss recoveryreturns a partial result (of lower quality). Hence, the quality of application results is not only correlated with the averagelatency but also with the tail of latency distribution. For example, in practice, the th % of the flow completion times (FCT)can be anywhere between two to four orders of magnitude worse than the median or the average latency.In small scale private data-centers, CPU resources are often the bottleneck, and solutions that rely on task placement andscheduling already exist (e.g., [44]). In contrast, public data-centers seldom overload their server CPUs and usually haveabundant computing resources; yet, they often adopt high over-subscription ratios in the network. As a result, network latencybecomes the main performance bottleneck [52]. This is typical for many Internet-scale applications deployed on public (IaaS)clouds such as Microsoft Azure or Amazon EC2.Measurements in real production data-centers [10, 34, 35, 49, 60] have shown over the years that the applications that producesmall traffic flows predominate and that incast congestion events and excessive packet losses are frequent. To circumvent suchproblems, large corporations such as Microsoft, Facebook, and Google dedicate well-structured data-centers to deploying theirtime-sensitive applications. Smaller-scale private data-centers address the problem by deploying homogeneous custom-designedTCP variants (e.g., DCTCP [10]) on all the VMs in the data-center. In stark contrast, multi-tenant and public data-centers,where many-to-one (or many-to-many) communication patterns predominate, are populated with a large variety of versions ofTCP with different behaviors in the face of congestion [19, 25, 35]. As a direct consequence of this heterogeneous environment,unfairness in congestion resolution is inevitable and often leads to repeated packet losses and a long-tailed latency distributionfor small flows. In particular, since commodity Ethernet switches are the backbone of all intra-data-center communications,their small buffer space can quickly be fully occupied by a few (large) TCP flows. This is because TCP has a natural tendencyto fill up the bottleneck bandwidth of the communication path.This raises two issues: i) small flows would not last long enough to capture their fair share of the buffer from ongoinglarge flows, as their TCP sending window cannot grow large enough before a packet loss is experienced; ii) when a sudden Manuscript is accepted for publication in ACM/IEEE Transactions of Networking ©2021 IEEE a r X i v : . [ c s . N I] F e b swarm of small flows (usually co-flows) surges, while the buffer is occupied by other flows, incast congestion loss eventsbecome inevitable. In this case, a burst of packet losses from many such flows takes place. Bursty losses with small congestionwindows often leave an insufficient number of TCP segments in flight to trigger TCP’s fast-retransmit and recovery mechanism.As a consequence, small flows often experience timeouts. The retransmission timeout (RTO), which is orders of magnitudelarger than the actual round-trip time (RTT), contributes thus the lion’s share to the long FCT and number of missed deadlinesexperienced by small flows in data-centers.In this paper, we study the impact of RTO on the performance of TCP applications in data-centers and propose a simplemechanism to shield small TCP flows from the negative effects of the RTO, without changing TCP itself. Our methodologyadopts a two-phased approach: i) First, to fully understand the impact of the RTO on the FCT of small flows, we conduct an empirical study of the lossevents in a small data-center, by examining the nature of the recovery mechanism invoked by TCP for each segment loss. Tothis end, we trace TCP traffic flows microscopically at the socket-level in the Linux Kernel. Then by analyzing the collectedtraces, we study the frequency of occurrence of the two TCP loss recovery mechanisms (viz., RTO and Fast Retransmit andRecovery (or FRR)) concerning the TCP window size. We show that tail-end losses and bursty losses primarily cause RTOs,and while they have a less dramatic effect on the latency of large flows, their impact on the performance of small flows istremendous. ii) Second, to prevent RTOs from artificially inflating the actual loss recovery delays of small flows intra-data-centers, and without modifying TCP we propose, implement and study the performance of a new mechanism to conceal thelong retransmission timeout. This mechanism forces TCP in the VM to go into the FRR mode whenever a segment is estimatedto be likely to experience a timeout, long before it does. We implement the resulting so-called T-RACKS in a real testbed andstudy its performance with realistic traces .In the remainder, supported by an empirical study, we show in Section III the dramatic impact of the RTO on the performanceof small flows. In Section IV, we present the proposed methodology and system design. In Section V, we discuss the packet-level simulation results in detail. Then, in Section VI, we present the experimental results from the testbed deployment. Wediscuss important related work in Section II. Finally, we conclude the paper in Section VII.II. R ELATED W ORK
Several works have found, via measurements and analysis, that TCP timeouts are the root cause of most throughput andlatency problems in data-center networks [1, 5, 37, 39, 41, 42, 55, 59]. Other works [6, 22, 23, 30, 32, 38, 40, 53, 58, 63–65]analyzed the nature of incast events and packet drops in data-centers ,. They also found that severe incast occurrences couldlead to throughput collapse and longer FCT. They show in particular that throughput collapse and increased FCT are to beattributed to the data-center ill-suited timeout mechanism. For example, [59] showed that frequent timeouts could harm theperformance of latency-sensitive applications. Numerous solutions have been proposed. These fall into one of four fundamentalcategories. The first mitigate the consequence of the long waiting times due to RTO, by reducing the default RT O min to the100 µ s - 2 ms [59]. While very useful, this approach affects the sending rates of TCP by forcing it to cut CWND to 1; itrelies on a static RT O min value, which can be ineffective in heterogeneous networks; and it imposes modifications to TCPstack on tenant’s VM. Our approach is fundamentally different in its enforcement of
RT O min via dup-acks which allows fordifferent handling of Internet and data-centers flows. Therefore, T-RACKs allows flows to have different
RT O min which iseasily imposed by the flow tables.The second approach aims at controlling queue build-up at the switches by relying on ECN marks to limit the sending rateof the servers [8, 10, 34], or by controlling the congestion window [32] or receiver window [2, 3, 60] of TCP flows. Similarapproaches deployed global traffic scheduler [12, 13, 15, 56, 57] or tacked fine-grained sub-microsecond updates in RTT todetect congestion [45]. All of these works achieved their goals and have shown they could reduce the FCT of short flows aswell as achieving high link utilization. However, they require modifications of either the TCP stack, or introduce a completelynew switch design, and are prone to fine-tuning of parameters or sometimes require application-side information. They alsoincrease CPU utilization of the end hosts. [45] is sensitive to traffic variations in the backward path. [21] is a new congestioncontrol for inter-DC traffic which is based on characterizing the band-width and RTT of the bottleneck path. While effectivefor high bandwidth-delay product (BDP), its minutes-level measurements and the aggressive start can exacerbate the problemsin low BDP networks of data-centers.The third approach is to achieve efficient sharing of network resources or enforce flow admission control to reduce TimeOutprobability [18, 28, 50, 55]. [28] has proposed ARS, a cross-layer system that can dynamically adjust the number of activeTCP flows by batching application requests based on the sensed congestion state indicated by the transport layer. The lastapproach, which is adopted in this paper due to its simplicity, and feasibility, is to recover losses utilizing fast retransmit The minimum RTO is 200ms in Linux and 300ms in Windows Notice that in public data-centers, under the IaaS model, the operating system and thus the protocol stack in the VM is under the full control of the tenantand cannot be modified by the cloud service provider. An earlier version of this work were published in INFOCOM 2018 [4].The implementation, simulation and experimental code and scripts are publicly available at http://ahmedcs.github.io/T-RACKs. Similar analysis was done for Content Centric Networks (CCNs) [7, 8] rather than waiting for a long timeout. For instance, TCP-PLATO [55] proposed changing TCP state-machine to tag specificpackets using IP-DSCP bits, which are preferentially queued at the switch to reduce their drop-probability; enabling dupACKsto be received to trigger FRR instead of waiting for the timeout. Even though TCP-PLATO is effective in reducing timeouts,its performance is degraded whenever tagged packets are lost. In addition, the tagging may interfere with the operations ofmiddle-boxes or other schemes, and most importantly, it modifies the TCP state machine of the sender and receiver.Similar to DCTCP, DCQCN [9, 11] and HPCC [36] was proposed as an end-to-end congestion control scheme implementedin custom NICs designed for RDMA over Converged Ethernet (RoCE). Both DCQCN, and HPCC applies adaptive rate controlat the link-layer to throttle large flows relying on Priority-based Flow Control (PFC) and RED-ECN marking, and In-NetworkTelemetry (INT) information, respectively. DCQCN, not only relies on PFC, which adds to network overhead, it introduces theextra cost of the explicit ECN Notification Packets between the end-points. HPCC requires programmable NICs and relies onthe timely availability of the INT information which not only increases the packet size by 42 bytes for each hop in the pathbut also is subject to congestion, and contention with other traffic in the network.More recent approaches have also identified the importance of the timeout problem in data-center environments [51, 61, 67].For instance, the authors in [51] proposed injects high-priority packet after each window worth of packets. They infer thenetwork congestion by checking the sequences of the received ACKs of high/low priorities. Consequently, they adjust thesending window to reduce buffer occupancy and early detect losses. However, this not only requires setting-up priority-queues(if available) in the switches but also imposes extra processing and communication overhead (esp., the congestion windowconsists of typically few packets in data-centers).III. P
ROBLEM AND M OTIVATION
Before we start discussing our empirical study of TCP and presenting our solution, let us first shed some light on themotives that led us to adopt such a non-traditional approach by contrasting it against alternative methods. In particular, whileour approach is straightforward, it turns out to be very effective because it stems from a full understanding of the large numberof incremental mechanisms that have been added over the years to TCP. Many alternative schemes proposed in the literaturedeal with TCP congestion in data-centers in a classic Internet-centric approach by invoking mechanisms such as RED. Thisapproach is flawed because of three major reasons: i) RED is a mechanism that was designed for the Internet. Its goal is toreduce the average queuing delay experienced by packets in the huge routers’ buffer, which contributes a large proportion of theend-to-end delay and delay-jitter. In contrast, data-centers use high-speed switches with small buffers; therefore, the contributionof queuing delay to the total FCT is not as dramatic as in the case of the Internet, regardless of the buffer occupancy. Andso, maintaining a small queue does not help the FCT. ii)
With increasing link speeds in modern data-centers, the interplaybetween propagation delay and transmission delay is transformed, rendering the control mechanisms valid for one no longervalid for the other. For example, in data-centers with 1Gbps network interfaces, the transmission time of a single IP packet of1500 bytes is about 12 microseconds; the round trip time over a 600m path at the speed of light is about 6 microseconds. So,there can be at most one packet spread over a link between any two adjacent interfaces in the network (e.g., server NIC, toToR port, ToR port, to Aggregation Port, ...). In contrast, with 40 Gbps network interfaces, it takes only 0.3 microseconds tocomplete the transmission of an IP packet, yielding a possible flight of up to 33 IP packets per hop. So early detection andnotification via buffer thresholds with the small buffers that exist in the switches are ineffective, and excessive packet lossesare inevitable. iii)
Packet losses in TCP per-se are not the reason for these problems; excessive congestion is. Packet lossesare merely symptoms of congestion, so there is no reason to try to curtail them completely as long as we can recover fromlosses fast. In fact, curtailing packet losses completely results in a non-competitive behavior that yields poor performance inheterogeneous TCP environments. For example, pitting TCP Vegas against TCP New Reno, Cubic or DCTCP results in poorperformance for the former. As a consequence, eliminating packet losses in data-center networks while maintaining a highlink utilization is not helpful. Instead, we propose to pinpoint the true reason for increased delays in data-centers and to tacklesuch reasons directly [4].Several measurement studies [19, 25, 35] have been conducted on data-centers and have shown that latency in suchenvironments varies greatly. To further understand the reasons behind this, we deep-dive into the packet level analysis of theflows and the TCP socket state variables at a microscopic level to understand TCP behavior and its loss recovery mechanisms.An early work [59], based on data-center measurements, found that the timeout mechanism is to blame for the long waitingtimes and proposed the very simple yet effective solution of reducing the
RT O min value for TCP in data-center environmentswhile using high-resolution timers to keep track of delays at the microsecond-level. This approach actually solves the problem,reduces the FCT, and mitigates TCP-incast congestion effects. However, i) it requires the modification of TCP, and as suchit is inappropriate for public data-centers where multiple tenants can upload their own version of the OS; and, ii) there is no“magical” value of RT O min that fits all possible environments. For instance, a
RT O min that works inside the data-center (e.g.,between a web server and the back-end database server) will lead to spurious timeout events for Internet-facing connections(e.g., the connection between the web administrator workstation and the server in the data-center).
Table I:
TCP API Calls Intercepted by LossProbe Module
Function Call Descriptiontcp_set_state Handle and update TCP connection statetcp_v4_do_rcv Handles the arrival of all types of TCPsegmentstcp_retransmit_skb Retransmits one SKB where policy deci-sions and retransmit queue state updates aredone by the callertcp_v4_send_check Computes an IPv4 TCP checksum
Table II:
TCP Socket-level State Info logged by LossProbe Module
DataType Variable Descriptionuint lost_out Count of lost packetsuint prr_out Count of pkts sent during Recoveryuint prr_delivered Count of packets delivered during re-coveryuint prior_cwnd Congestion window at start of recoveryuint prior_ssthresh SSThresh saved at recovery startuint total_retrans Count of retransmits for entire connec-tionuint retransmit_high Highest sequence
A recent RFC [46] proposed the so-called tail loss probe (TLP) mechanism, which recommends sending TCP probe segmentswhenever ACKs do not arrive within a short Probe TimeOut (PTO) . In addition to requiring changes to TCP, this approachsuffers from two additional problems: i) probe packets also may be lost; and, ii) probe packets may worsen the in-networkcongestion, especially during TCP-incast. A. Impact of RTO on The FCT
In data-centers, partition/aggregate applications that generate small flows are challenged by the presence of small buffers,large initial sending windows, inadequate
RT O min , or slow-start exponential increase. This combination of hardware and TCPconfiguration frequently leads to timeout events for such applications. In particular, when the number of flows they generateis large and roughly synchronized, incast-TCP synchronized losses occur. As the loss probability increases linearly with thenumber of flows [43], the flow synchronization and the excessive losses lead to throughput-collapse for small-flows.To illustrate this, consider a simplified fluid-flow model with N flows sharing a link of capacity C equally. Let B be theflow size in bits and n be the number of RTTs it takes to complete the transfer of one flow. The optimal throughput ρ ∗ can besimply expressed as the fraction of the flow size to its average transfer time: ρ ∗ = Bnτ + BNC . That is, it takes
BN/C to transmitthe B bits plus an additional queuing and propagation delay of τ seconds for each of the n RTTs. In practice, when TCPincast congestion involving N flows results in throughput-collapse, the flow experiences one or more timeouts and recoversafter waiting for RTO. Then, the actual throughput writes: ρ = BRT O + n (cid:48) τ (cid:48) + BNC , Typically n (cid:48) ≥ n and τ (cid:48) ≥ τ . In addition, indata-centers, the typical RTT is around µ s, while existing TCP implementations impose a minimum RTO (i.e., RT O min )of about to ms . As a consequence, large flows yield values of n (cid:48) such that n (cid:48) τ (cid:48) is similar or greater than RT O . Incontrast, small flows only last for a few RTTs, therefore
RT O (cid:29) n (cid:48) τ (cid:48) . And so, when a small flow experiences a loss thatcannot be recovered by 3-duplicate ACKs, it systematically incurs an FCT that is orders of magnitude larger than it should. B. Analyzing TCP congestion recovery
To investigate why packet losses seem to affect large flows only marginally, yet degrade the performance of small flowsdramatically, we collected and examined socket-level TCP flows state information from a Websearch workload [10], in a PTO is set to min(2*srtt, 10ms) if inflight > Linux uses 200ms and Windows uses 300ms P r o b a b ili t y D i s t r i b u t i o n Loss size normalized to CWND
Samples=21835Median=15.76Mean=20.48STDEV=9.09 (a)
FR size rel. CWND size P r o b a b ili t y D i s t r i b u t i o n Loss size normalized to CWND
Samples=7149Median=31.56Mean=32.25STDEV=16.67 (b)
RTO size rel. CWND size P r o b a b ili t y D i s t r i b u t i o n Loss position normalized to CWND
Samples=21835Median=73.50Mean=19.33STDEV=80.00 (c)
FR position rel. CWND P r o b a b ili t y D i s t r i b u t i o n Loss position normalized to CWND
Samples=7149Median=75.99Mean=17.92STDEV=81.40 (d)
RTO position rel. CWND
Figure 1: (a-b) shows the retransmission size relative to CWND while (c-d) shows the loss position relative to CWND. small-scale data-center testbed. First, we implemented a socket-level monitoring module, named hereafter “LossProbe”, basedon KProbes/JProbes [33] in the Linux Kernel. Probes are dynamic debugging tools which in our case, allow us to interceptdifferent TCP event handlers and API calls as listed in Table I where we log the target TCP socket-level state information.The module works as follows:1) Jprobe requires the address of the kernel function to trace; hence the target TCP handlers of the events of interest haveto be identified from the Linux kernel source code base [20]. For example, tcp_retransmit_skb is the function called inthe kernel to retransmit a TCP segment.2) Then, a handler function is defined that will perform certain actions upon entry of the traced function (e.g., printthe debugging message when the target kernel function is invoked). In the probe module, that function is defined forconvenience with the same name as the original probed function (e.g., jtcp_retransmit_skb ) and jprobe calls it uponentering the original function.3) The monitoring module is dynamically installed into the kernel, and the probed workload (or experiment) is invoked. Themodule upon entry of the targeted functions, the jprobe function, is called to collect the state information of interest andwrite them into an in-RAM buffer which in turn is flushed periodically and asynchronously onto the file system to avoidstalling the datapath artificially.Using our custom-built traffic generator, we replicate a Websearch workload [10] consisting of thousands of flows and collectmeasurements on the data listed in Table II from all the servers in our testbed .We summarize our findings in several figures to reflect TCP’s behavior with respect to the mechanism invoked to recoverfrom packet losses (e.g. Fast Retransmit and Recovery or Re-transmission Timeouts ). Fig. 1a shows the distribution (on theordinate) of the size of each retransmission (on the abscissa) for the FRR-based recovery events. The size of retransmission iscalculated by subtracting the seq Cwnd . Similarly, Fig. 1b shows the same metric for RTO-based recovery Code for the LossProbe module is publicly available at https://github.com/ahmedcs/TCP_loss_monitor/ Each figure shows the aggregate of all servers in the data-center. The bar for 0-10, refers to probability of 0-10% of CWND is lost while a bar for 90-100 refers to the probability that 90-100% of the CWND is lost C D F CWND sizeFast RetransmitRTO Retransmit (a)
CDF of CWND size W e b W e b - T L P D a t a D a t a - T L P A v e r a g e F C T w i t h E rr o r b a r s ( m s ) Scheme (b)
FCT of small flows
Figure 2: (a) shows the CDF of CWND at the time of the transmission of the lost packet. (b) TLP and NO-TLP FCT for Websearch andDatamining workloads events. Fig. 1c and Fig. 1d show the distribution (on the ordinate) of the loss index or position (on the abscissa) in case of FRRand RTO-based recovery, respectively. The index points to the first retransmitted segment (for several consecutive segmentslosses), relative to
Cwnd when the segment was first transmitted (i.e., before a loss is detected) .Analyzing these results, we can draw the following conclusions: Fig. 1a suggests that FRR loss size is distributed over therange of packets with a positive skewness towards the first few fractions of the window (i.e., probability of losing more than30% of the window is insignificant). However, Fig. 1b shows that this is not the case for RTO, which seems well distributedwith positive skewness towards the tail end of the window (i.e., the probability of losing more than 30% of the window issignificant). Also, we can see that there are only a few RTOs far away from the tail; these represent lost packets within thesame congestion window. Fig. 1c points out that losses at the tail of the window occur with higher frequency for RTO events,however in the case of FRR, the Cwnd is relatively large enough for TCP to receive a sufficient number of duplicate Acks,which allows for Fast-Recovery. Similarly, Fig. 1d clearly shows a similar trend with higher frequency at the tail, however, inthis case,
Cwnd is relatively small and hence, contains less in-flight packets to allow for FRR, and eventually, RTO recoveryoccurs. To elaborate, we see in Fig. 2a that typically
Cwnd for the flows with segments that experience RTO is smaller thanthat of those that recover via FRR.Finally, we also show the ineffectiveness of the TLP mechanism cited above [46] in recovering tail losses in Fig. 2b. Thefigure shows that TLP mechanism is not effective, and due to its additional overhead, it may even increase the FCT for smallflows.In data-centers, the size of the pipeline is small: typically, with an RTT of 100us, a link of 1Gbps (respect. 10Gbps) canaccommodate 8.3 packets (respect. 83 packets). In conjunction with shallow buffered switches, the nominal TCP fair shareduring TCP-incast barely exceeds one packet per-flow, and hence the occurrence of RTO is highly likely. This phenomenonhighlights how TCP’s performance can be degraded when operating in small windows regime in a small buffer with high-bandwidth low-delay environments like data-centers. The effect on the FCT is more severe for small, time-sensitive flows, thatgenerally last only a few RTTs, but that are compelled to wait for 2 to 4 orders of magnitude extra time due to the
RT O min rule. IV. S
YSTEM D ESIGN AND I MPLEMENTATION
T-RACKs design is based on the following observation: packet losses are inevitable in TCP. So the key to reducing the longlatency and jitter is not to try and avoid losses completely but instead try to avoid long waiting after the losses occur . Inachieving this, T-RACKS must also: (R1) improve the FCT of latency-sensitive applications by expediting the transmissionsof small flows’ packets. (R2) be friendly to throughput-sensitive large flows (i.e., it must not sensibly degrade the throughputof large flows to satisfy the delay requirements of latency-sensitive flows). (R3) be compatible with all existing TCP versions(i.e., it must not impose modifications inside the virtual machines, and if any are needed, they shall be in the hypervisors, A bar for 0-10, refers to the probability that loss occurs for the first 0-10% of the segments in
Cwnd . In contrast, a bar for 90-100 refers to the probabilitythat the loss occurred for the last 90-100% of the segments the
Cwnd which are fully under the control of the data center operator. It also must not require any extra special hardware). (R4) Finally,the mechanism must be simple enough to be easily deployable in real data-centers.In this perspective, T-RACKs actively infers packet losses by monitoring (in the hypervisor) per-flow TCP ACK numbersand proactively triggers the FRR mechanism of TCP to take action whenever and RTO is detected to be likely to take place.The goal is to help small TCP flows that would otherwise experience a timeout, recover fast via the FRR instead of waiting forTCP’s
RT O min . The proposed mechanisms intervene when only the loss is almost certain, leading to a significant improvementof recovery times, and hence the FCT. T-RACKS design derives from the following arguments: i) all TCP versions adopt theFRR mechanism as a way to detect and recover from losses fast. So, if the FRR mechanism can be forced into action by thehypervisor, regardless of the nature of the loss, the resulting system would be transparent to the TCP protocol in the VM andwould require no changes to TCP in the VM; ii) TCP relies on a small number of duplicate ACKs to activate FRR; however,in the majority of cases (especially for short-flows), there aren’t enough packets in flight to trigger duplicate ACKs. To achievethis, we propose to use “spoofed” TCP ACK signaling from the hypervisor to the VM. In this perspective, the hypervisormaintains a per-flow timer β = α ∗ RT T + rand ( RT T ) to wait for the ACKs before it triggers FRR with spoofed duplicateACKs.Note that our idea is similar in spirit to the so-called TCP SNOOP protocol [16], which retransmits lost segments on behalfof the communicating end-points to filter out bit-errors in low-speed wireless networks. As such, TCP SNOOP also couldbe applied in data-centers. However, it is expensive to implement, as it requires buffering all sent segments at the lowerlayers (e.g., link-layer or hypervisor), which requires an ample buffer space in data-centers. T-RACKs, in contrast, triggers theretransmissions from the actual TCP protocol in the VM instead of buffering and retransmitting the packets itself. It requiresno packet buffering at all; it only relies on memorizing a few state variables from the last segment, and the final ACK receivedof each flow. A. T-RACKs Algorithm
The T-RACKs algorithm consists broadly of three major functions: the first two are in charge of maintaining per-flow stateinformation on the server (hypervisor) on arrival and departure of packets, shown in Algorithm 1 and the third is a timerevent handler described in Algorithm 2. In the initialization in (lines − ) of Algorithm 1, an in-memory flow cache poolis created to track new flow arrivals. This approach speeds up flow objects creation. A hash-based flow table is created andmanipulated via the Read-Copy-Update (RCU) synchronization mechanism to efficiently identify flow entries. Other parametersand variables are set in this step, as well. Before each TCP segment departure, T-RACKs performs the following actions: i) in line , the packet is hashed using its 4-tuple (source and destination IP addresses and port numbers), and the correspondingflow is identified; ii) in lines − , if this is an SYN packet or the flow entry is inactive (i.e., a new flow), the flow entry isreset then TCP header info and options are extracted to activate a new flow record; and iii) in lines − , if this is a Datapacket, then the last s ent sequence number and time of the flow are updated.Next, upon each TCP ACK arrival, the algorithm performs the following actions: i) in lines − , the flow entry isidentified using its 4-tuple; if the flow is large, we ignore it as it does not undergo recovery via T-RACKs. By doing so, thecomplexity of the scheme is reduced; ii) in lines − if the ACK sequence number acknowledges a new packet arrival,the last seen ACK sequence number and time is updated. The dupACK counter is reset. If the accumulated flow size exceedsa threshold γ it is marked as a large flow (to be able to stop tracking it); iii) in lines − , if ACK number acknowledgesan old packet (i.e., if this is a duplicate ACK), then we drop dupACKs if the flow is in recovery mode, or otherwise incrementthe number of dupACKs seen so far; iv) in line , we update the TCP headers information of the ACK if necessary. Wediscuss this part in more detail later in sec IV-C.Algorithm 2 handles the periodic global timer expiry events and performs the following actions for all active non-large flows in the table. In a typical implementation this timer lasts 1 ms and is processed regularly with the OS clock timer interrupt(i.e., does not require the special high-resolution timers): i) in lines − , if no new ACK acknowledging new data has arrivedfor β seconds since the last new ACK arrival, the flow is deemed to be likely to experience a timeout in the future. T-RACKsenters into action, spoofs an ACK using the last successfully received ACK sequence number, and sends it out to the sendingprocess or VM residing on the same end-host. An exponential backoff mechanism is activated to account for various dupACKthresholds set by the sender TCP or OS. ii) In lines − , if with timer β backed off by the number of retransmissions x ofthe spoofed ACK, the flow still did not receive a new ACK, another spoofed ACK is created and sent out to the correspondingsender. To ensure T-RACKS is not sending spurious spoofed dupAcks, the algorithm backs-off exponentially; i.e., after eachtransmission of a spoofed Ack, timer β is doubled. iii) In lines − , if the backoff time approaches the RT O min (i.e.,200ms), we stop triggering Fast-Retransmit (by resetting the soft state) and letting the sender’s TCP RTO timer handle therecovery of this segment. iv)
In line , if the inactivity period exceeds 1 sec, flow (f) entry is hard reset. Algorithm 1:
T-RACKs Packet Processing /* Initialization */ Create an in-memory flow cache pool; Create flow table and reset flow information; Initialize and insert NetFilter hooks (for a NetFilter implementation);
Input: α Input: γ a threshold in bytes to stop tracking a flow as small Input: φ the dupACK threshold used by TCP flows Input: t : the current local time counted in jiffies Define x : the exponential backoff counter Function
Outgoing Packet Event Handler (Packet P) f=Hash(P); if SYN(P) or !f.active then Reset Flow (f); Extract TCP options (i.e, TStamp, SACK, etc); Update the flow information and set f.active; if DATA(P) then Update flow info (i.e., last seq f.active_time = now(); Function
Incoming Packet Event Handler (Packet P) /* For ACKs: extract and update flow information from incoming header */ f=Hash(P); if f.long_lived then return ; if ACK_bit_set(P) then Extract required values (e.g., seq if New ACK then Update flow entry and state information (e.g., RTT); Update last seen ACK number from receiver; Reset f.dupAck_Nr = 0; Reset f.ACK_time = now(); if f.lastAckNo ≥ γ then f.long_lived = true ; else if Duplicate ACK then f.dupAck_Nr = f.dupAck_Nr + 1; /* Drop extra dup-ACKs */ if f.resent > then Drop Dup ACK ; Update TCP headers (i.e., TStamps, SACK, etc);
B. T-RACKs System Implementation
Algorithm 1 relies on TCP header information of ACK packets to maintain per-flow TCP state information. In this paper,we only consider a lightweight end-host (hypervisor) shim-layer implementation to achieve this . This approach is perfectlyfeasible even for production data-centers, because the number of flows in a server in production data-centers has been reportedto be small in general not exceeding 30-40 [10]. In addition, the number of flows tracked by T-RACKs is further reduced onaverage by only tracking small-sized flows, abandoning large flows whenever they grow to reach a certain size threshold. Thedeployment of T-RACKs in data-centers involves hashing the flows into a hash-based flow-table using the 4-tuples (i.e, SIP,DIP, Sport and Dport) whenever SYNs packets arrive or a flow sends data after a long silence period. For instance, referringto Figure 3, when VM1 on the sender end-host established a connection with its peer (or destination) VM on the receivingend-host, a new flow entry ( S D ) is created in the table as shown in Fig.3. Also, not shown in the Algorithm code, flowentries are cleared from the table whenever a connection is closed (following the TCP connection tear down FIN/FIN-ACK)or after a pre-set inactivity time threshold is exceeded. The flow table could track many relevant TCP-related per-flow stateinformation, however, for T-RACKs to perform the fast recovery function, it needs to track a minimal set of TCP state variables(including the highest ACK sequence number seen so far and the arrival time of the most recent ACK).The T-RACKs system uses a flow table to store and update TCP flow information, including the last ACK number, the lastsent sequence number, the corresponding times, the RTT for the flow measured using the TCP timestamp option, as well asthe optional TCP Sack information for each outgoing TCP flow. T-RACKs intercepts the incoming ACKs and outgoing Datato update the current state of each tracked small flow. When packets are dropped by the network and the receiver receives We note that, in higher speed networks, T-RACKs could equally be implemented as a networking function on smartNICs Note that if TCP Sack is active, TCP’s response to duplicate ACKs is different from the standard behavior, therefore we need to take this into accountto elicit a proper reaction.
Algorithm 2:
T-RACKs Timeout Handler Create and initialize a timer to trigger every 1 ms; Function
Timer Expiry Event Handler for Flow (f) ∈ FlowTable do β = α ∗ f.RT T + rand ( f.RT T )) ; if !f.active or f.long_lived then Continue ; T = MAX(f.ACK_time, f.active_time); if now() - T ≥ β then Resend last ACK φ − f.dupAck _ Nr ) times; Set f.resent_time = now(); Set x = 2; Continue; if now()-f.resent_time ≥ ( β (cid:28) x ) then resend ACK one more time; x = x + 1; Continue; if (now()-f.ACK_time) ≥ RT O min then stop T-RACKs recovery; soft reset flow (f) recovery state; Continue; if (now()-f.active_time) ≥ then deactivate_flow(f) ; HypervisorNIC
Flow
Receiver
S1:data S3:dataD1:FRACK D3:FRACK
Sender
S3:D3 5
S2:dataD2:FRACKVM2 VM1VM3
S1-D1S2-D2S3-D3 ProcessIN P r e_ R o u t e I p _ r c v ProcessOUT P o s t _ r o u t e I p _ f i n i s h TO_timer_handler
D1:ACK D3:ACKD2:ACK
T-RACKs
Figure 3:
T-RACKs System: It consists of an end-host module that track TCP flows incoming ACKs and generates FAKE ACKs whenever aflow timesout enough out of sequence DATA to generate sufficient dupACKs (real ones), the loss is recovered via FRR from the VM withoutthe intervention of T-RACKs. However, when the receiver fails to receive enough DATA to generate enough real dupACKs totrigger FRR, then T-RACKs intervenes after a timer (1ms) by sending spoofed dupACKs (or RACKs for retransmitted-ACKs)to the sender. Typically, the sender would receive enough dupACKs and RACKs to trigger FRR to retransmit the lost segmentwithin a reasonable time, long before the TCP RTO is reached. C D F RTT variation (usec)Fast RetransmitRTO Retransmit
Figure 4:
RTT variation between transmission time and time of retransmission(i.e ∆ RT T = RT T rt − RT T t ) C. Practical Aspects of T-RACKs System
T-RACKs System: is built upon a light-weight module at the hypervisor layer tracking a limited per-flow state. In thesimplest case, it tracks TCP’s identification 4-tuple, per-flow last ACK number, and the timestamp of the last non-dupACKpacket. The system in spirit is similar to recent works in [24, 27] that aim to enable virtualized congestion control in thehypervisor without cooperation from the tenant VM, however these approaches require fully-fledged TCP state informationtracking and implementing full TCP finite-state machines in the hypervisor including packet queuing. On the other hand,T-RACKs tries to minimize the overhead by tracking the minimal amount of necessary information and implementing only asubset of the retransmission mechanism while letting the VM do the actual work of transmission and queuing.
Complexity:
T-RACKs Complexity resides in its interception of ACKs to update the last seen ACK information. Since, itdoes not perform any computation on the ACK packets, it does not add much to the load on the hosting server nor to thelatency . This claim is supported by our observation in our experiments on our data-center. A hash-based table is used to trackflow entries of active small flows. In the worst case, when hashes collide, a linear search is necessary within the linked-list.However, this worst case is rare due to the small number of flows originating from a given end-host. Typically, end-host CPUscan sustain rates of 60 Gbps of packet processing. Hence, the little processing required by the insertion of T-RACK state wouldnot affect the TCP throughput. Spurious retransmissions:
T-RACKs may possibly introduce spurious retransmissions making in-network congestion worse.However, this boils down to answering similar question when choosing the correct RTO value in TCP. For this purpose, werefer to a previous study [14], that mostly showed that even when a relatively bad RTT estimator is used, setting a relativelyhigh minimum RTO, it can help avoid many of the spurious retransmissions in WAN transfers. This fact is supported by asubsequent study [66] that shows significant changes (or variance) in Internet delays. Recent works [26, 45] show similarbehavior within current data-centers. In our testbed, we observed noticeable variation in the measured RTT. To quantify this,we measured the difference in RTT values collected at the time of the first transmission of the packet and then at the time offast retransmission or RTO retransmission. From the collected data, a considerably large variation, ranging from a few hundredmicroseconds at the ≈ ≈ β ) calculationshown in Algorithm 1 strikes a balance between rapid retransmission and the risk of causing spurious retransmission. T-RACKs RTO β : in most of our experiments and simulations, we choose a value for ACK RTO ( β ) to be ( ≥
10) timesthe dominant measured RTT in the data-center. We believe, and the results show that this value achieves a good tradeoffbetween not having many of the spurious retransmissions and, at the same time not being too late in recovering from losses.We further adopt the well-know exponential back-off mechanism for subsequent ACK RTO ( β ) calculations until either theloss is recovered or TCP’s default RTO (i.e., RT O min ) is close enough to be reached.
Synchronization of retransmissions:
Since T-RACKs relies on a timer for ACK recovery, such timer may result in thesynchronization of retransmissions from different VMs on different hosts leading into incast-like congestion. We studied thisbehavior in a simulation, and the results show repeated losses due to possible synchronized retransmissions. A viable solution tode-synchronize such flows is to introduce some randomness in the ACK RTO, ultimately resulting in fewer flows experiencing ACKs is updated in some instances (e.g., to insert fake SACK block to signify a small gap in the SACKed numbers. Otherwise RACK packets would beignored by TCP). Source Port Number Destination Port NumberSequence NumberAcknowledgement Number
Header
Length
Reserved C W R E C E U R G A C K P S H R S T S Y N F I N Window SizeTCP Checksum Urgent Pointer
16 24 32
No-OP No-OP Kind=8(tstamp) Len=10Time Stamp Value (tsval)Time Stamp Echo Reply (tsecr)No-OP No-OP Kind=5(sack)
Len=2 +
SACK Block1: Left Edge SACK Block1: Right Edge ...
Figure 5:
TCP headers manipulated by T-RACKs system repeated timeouts. We adopted this approach and added a random delay in the calculation of the RTO β , as shown above inthe algorithms. TCP Header manipulation:
TCP does not accept any packet with an inconsistent timestamp, hence the timestamps areupdated per ACK arrival with the local jif f ies variable to keep the consistency of timestamps whenever RACKs are sent. ForSACK-enabled TCP, fake SACK block information needs to be inserted for incoming ACKs (with no SACK blocks in TCPheader) to indicate a small gap equal to the minimum segment size (i.e., 40 Bytes) after the last successfully acknowledgeddata.
Security Concerns: during FRR, to be able to maintain its flight size and avoid timeout, TCP inflates the window artificiallyby 1MSS for each received dupAck. This can be exploited to launch ACK spoofing attack [54] on the senders. RFC 5681released in 2009 addressed this particular attack and proposed implementing Nonce and Nonce-Reply as a way of verifying thesource of dupACKs. However, such a solution would require the introduction of extra TCP headers prohibiting its deploymentin real TCP implementations. In T-RACKs, we address such attack by dropping dupACKS whenever the ACK timer expireswhen entering a recovery state. This approach is adopted to disable
Cwnd artificial inflation during recovery and, at the sametime, prevents external ACK spoofing (other than RACKs). Worth mentioning also is that RACKs are generated from thehypervisor layer, which is under the control of the trusted data-center operator.
TCP semantics: is conceptually violated since dupACKs should reflect packets following the lost one being receivedsuccessfully. However, according to RFC 5681, the network could replicate packets, and hence the RACKs could be treatedas replicated packets from within the network.
TCP Header manipulation:
Fig. 5 shows TCP headers manipulated by T-RACKs module, the module tracks and uses thehighlighted TCP headers for all incoming ACK packets using RCU flow entries.V. S
IMULATION A NALYSIS
In this section, we study the performance of T-RACKs to verify if it can achieve its goals in small-scale and large-scalesimulation scenarios. We first conduct a number of packet-level simulations using ns2 and compared T-RACKs performanceto the state-of-the-art schemes. (For brevity, we refer to T-RACKs as RACK in the figures.)
A. Simulations in a Dumbell Topology
To study how TCP behaves in response to packet losses and how likely it may recover quickly with the help of T-RACKs,we conducted several packet-level simulation experiments to covers a large variety of TCP and AQM combinations. We alsoconducted simulation experiments using the congestion control mechanisms code imported from the Linux kernel (i.e., NewReno and Cubic). We use ns2 version 2.35 [48], which we have extended with T-RACKs mechanism inserted as a connectorbetween nodes and their link in the topology setup. Besides, we patched ns2 using the publicly available DCTCP patch. Weuse in our simulation experiments link speeds of 1 Gbps for sending stations, a bottleneck link of 1 Gb/s, low RTT of 100 µ s, the default TCP RT O min of 200 ms and TCP initial window of 10 MSS. We use a rooted tree topology with a singlebottleneck at the destination and run the experiments for 15 sec. The buffer size of the bottleneck link is set to 100 packets,which is more than the bandwidth-delay product in all cases. We first designed two simulation scenarios:1) CASE 1: Small flows.2) CASE 2: Large flows coexisting with small flows.In CASE 2, the ratio of small flows to large flows is set to 3, and large flows send data for the whole duration of theexperiment. In both cases, each small flow sends a 14.6KB file (i.e., 10 MSS) until it completes its transmission. Small flowsstart with a random inter-arrival time that is drawn from an exponential distribution with a mean equal to the transmissiontime of one packet; this allows us to create clusters of small flows that start almost simultaneously, to emulate incast traffic.This process is repeated once every 3 sec, which gives five rounds of incast traffic, during the simulation. In each round, wedraw the order of the servers generating the flows according to a uniform distribution. We study packet losses, the likelihood CD F Average FCT (ms)
DropTailDropRandREDECNDCTCP (a)
Without T-RACKs CD F Average FCT (ms)
DropTailDropRandREDECNDCTCP (b)
With T-RACKs
Figure 6:
CDF of average FCT for CASE 1 in 20 flows scenario CD F Average FCT (ms)
DropTailDropRandREDECNDCTCP (a)
Without T-RACKs CD F Average FCT (ms)
DropTailDropRandREDECNDCTCP (b)
With T-RACKs
Figure 7:
CDF of average FCT for CASE 1 in 80 flows scenario of fast recovery, and recovery time. We first study TCP newReno with Droptail, RED-ECN, and Random Drop AQMs, andDCTCP covering the most common TCP and AQM settings.First, we run the experiments without T-RACKs for 20 and 80 flow traffic scenarios generated according to CASE 1 (i.e.,containing only small flows). Figures 6a and 7a show the average FCT for different schemes. It appears that the FCTis significantly high for all-schemes exceeding 200ms for most flows (more than ≈ % and ≈ % for 20 and 80 flows,respectively). This result indicates that most flows are experiencing RTOs. We repeat the same simulation, enabling T-RACKson the end-hosts, to mitigate the RTOs. Figures 6b and 7b show the FCT for different schemes with T-RACKs for 20 and 80flows scenario in CASE 1. The results show significant improvement in the average FCT for the two scenarios for all schemes.Specifically, it could reduce the average FCT range significantly for both 20 and 80 flows scenarios. In the 20 flows scenario,at th % (as highlighted by the horizontal black line), the reduction in average FCT is up to ≈
15 times (i.e., from 200msdown to 13ms). In the 80 flows scenario, it reduces the FCT at th % by ≈ ≈ th % by ≈ ≈ ≈
14X and ≈
14X for DropTail, DropRand, RED-ECN, and DCTCP, respectively.Figure 9a and Figure 9b show the results without and with T-RACKs for the 80 flows scenario. In the 80 flows scenario, CD F Average FCT (ms)
DropTailDropRandREDECNDCTCP (a)
Without T-RACKs CD F Average FCT (ms)
DropTailDropRandREDECNDCTCP (b)
With T-RACKs
Figure 8:
CDF of average FCT for CASE 2 in 20 flows scenario CD F Average FCT (ms)
DropTailDropRandREDECNDCTCP (a)
Without T-RACKs CD F Average FCT (ms)
DropTailDropRandREDECNDCTCP (b)
With T-RACKs
Figure 9:
CDF of average FCT for CASE 2 in 80 flows scenario
T-RACKs can further improve the performance, and it manages to reduce the average FCT at the th % by ≈ ≈ ≈ ≈
37X for DropTail, DropRand, RED-ECN, and DCTCP, respectively. These results show that T-RACKs improvesthe performance more in heavy load scenarios with background traffic.
B. Simulations in Data-center Topology
End-host based schemes are assumed by default to be scalable and to verify this, we experiment with T-RACKs on a largerscale with varying workloads and flow size distributions.For this purpose, we conduct another packet-level simulation using a spine-leaf topology with nine leaf switches and fourspine switches using links of 10 Gbps for end-hosts and an over-subscription ratio of 5 . We again examine scenarios withTCP-NewReno, TCP-ECN, and DCTCP operating along with Droptail, RED, and DCTCP AQMs, respectively. We use aper-hop link delay of 50 µ s, TCP is configured with the default TCP RT O min of 200 ms and an initial window of 10MSS. Persistent connections are used for successive requests. Finally, buffer sizes on all the links are set to be equal to thebandwidth-delay product between end-points within one physical rack.The flow size distribution for workload 1 (which represents Websearch flow sizes distribution [10]) and workload 2 (whichrepresents datamining flow sizes distribution [25]) are shown in Fig 10a which capture a wide range of flow sizes. The flows the typical over-subscription ratio in current production data-centers is in range of 3 to over 20 C D F ( % ) size (bytes)Workload2Workload1 (a) CDF of flow size C D F ( % ) Inter-arrival (usec)load-30%load-40%load-50%load-60%load-70%load-80%load-90% (b)
CDF of inter-arrival time
Figure 10:
Flow characteristics: (a) Actual Flow size distribution (b) Inter-arrival times for various network load are generated randomly from any source host to any other destination host with the arrivals following a Poisson process withvarious average flow arrival rates to simulate different network loads. Fig 10b shows the inter-arrival times distribution forvarious traffic loads ranging from to .We report the average FCT for small flows and for all flows as well as the total number of timeout events in each case.In the simulation, the T-RACKs threshold γ is set to infinity (i.e., all flows are tracked including large flows). The T-RACKsRTO, the timeout to trigger spoofed dupACKs, is set to 10 times the measured RTT in this experiment. That means if an ACKwith the expected sequence number is not delivered within 10 RTTs then RACK packets are spoofed to trigger the FRR. A v e r a g e F C T i n ( m s ) Network loadDCTCPDCTCP-RACK DropTailDT-RACK REDRED-RACK (a)
Small Flows: Average FCT
80 100 120 140 160 180 200 220 240 260 20 30 40 50 60 70 80 90 100 A v e r a g e F C T i n ( m s ) Network loadDCTCPDCTCP-RACK DropTailDT-RACK REDRED-RACK (b)
All Flows: Average FCT T i m e o u t s ( ) Network loadDCTCPDCTCP-RACK DropTailDT-RACK REDRED-RACK (c)
All Flows: Number of RTOs A v e r a g e F C T i n ( m s ) Network loadDCTCPDCTCP-RACK DropTailDT-RACK REDRED-RACK (d)
Small Flows: Average FCT
126 127 128 129 130 131 132 133 134 135 136 20 30 40 50 60 70 80 90 100 A v e r a g e F C T i n ( m s ) Network loadDCTCPDCTCP-RACK DropTailDT-RACK REDRED-RACK (e)
All Flows: Average FCT T i m e o u t s ( ) Network loadDCTCPDCTCP-RACK DropTailDT-RACK REDRED-RACK (f)
All Flows: Number of RTOs
Figure 11:
Performance with network load in (30 % , 90 % ) for workload 1 (top) and workload 2 (bottom) The average FCT for small and all flows, as well as the total number of timeouts experienced by all flows for workloads 1and workload 2 are shown in Fig. 11. Note that for both workloads, the FCT of small flows is dramatically degraded whenthey experience a timeout regardless of the TCP congestion control version or AQM mechanism in operation. In contrast,when T-RACKs is activated, it helps small flows the most by improving their FCT as a by-product of reducing the numberof timeouts they experience. We also note that the overall FCT decreases for all flows for two reasons: 1) the threshold γ enables all flows to benefit from T-RACK and 2) small flows finish quicker leaving network resources for larger ones. Wenotice that for workload 2, with almost 80 % of the flows being less than 10KB and hence experiencing lesser timeout events Flow sizes in the range [0-100KB] are considered to be small, flow sizes in the range [100KB - 10MB] are considered to be medium and flows sizesfrom 10MB and upwards are considered to be large flows.
15 20 25 30 35 40 45 30 40 50 60 70 80 90 A v e r a g e F C T i n ( m s ) Network loadRTT5RTT 10RTT50RTT 100RTT (a)
Small Flows: DropTail
60 70 80 90 100 110 120 130 30 40 50 60 70 80 90 A v e r a g e F C T i n ( m s ) Network loadRTT5RTT 10RTT50RTT 100RTT (b)
Small Flows: RED-ECN
50 60 70 80 90 100 110 120 30 40 50 60 70 80 90 A v e r a g e F C T i n ( m s ) Network loadRTT5RTT 10RTT50RTT 100RTT (c)
Small Flows: DCTCP
Figure 12:
The same Websearch scenario as above but using DropTail AQM and α is varied from 1 RTT to 100 RTTs. overall, DCTCP can improve the FCT. This improvement is because DCTCP’s ability to regulate the persistent queue length(i.e., there are few large flows to fill the buffer, unlike workload 1). C. Sensitivity to Choice of T-RACKs RTO
In this experiment, we study the sensitivity of T-RACKs to the preset RACK RTO value. For this purpose, we repeat thelast simulation by varying the value of the RTT multiplicative factor α in the set [1, 5, 10, 50, 100]. We report the averageFCT of small flows and all flows in each case for DropTail, RED, and DCTCP in Figure 12 for various loads. From the figure,we can see that the FCT is greatly affected by the choice of parameter α . Small values of α (i.e., 1 and 5) show a relativelylarge FCT compared to the RTT, which indicates that they tend to cause too many spurious retransmissions that exacerbatecongestion in the network. On the other hand, excessively large values for α (i.e., 50 and 100) tend to be too conservative andresult in TCP flows recovering later than they could. We can see in the three figures for all loads a minimum FCT is achievedat or near a RACK RTO of 10 RTTs. VI. L INUX K ERNEL I MPLEMENTATION R e n o R e n o - R A C K C u b i c C u b i c - R A C K D C T C P D C T C P - R A C K A v e r a g e F C T w i t h E rr o r b a r s ( m s ) Scheme (a)
Small Flows: Average with Errorbar R e n o R e n o - R A C K C u b i c C u b i c - R A C K D C T C P D C T C P - R A C K % o f M i ss e d d e a d li n e ( >= m s ) Scheme (b)
Small Flows: Missed Deadlines R e n o R e n o - R A C K C u b i c C u b i c - R A C K D C T C P D C T C P - R A C K A v e r a g e F C T w i t h E rr o r b a r s ( m s ) Scheme (c)
All Flows: Average with Errorbar R e n o R e n o - R A C K C u b i c C u b i c - R A C K D C T C P D C T C P - R A C K A v e r a g e F C T w i t h E rr o r b a r s ( m s ) Scheme (d)
Small Flows: Average (Datamining) R e n o R e n o - R A C K C u b i c C u b i c - R A C K D C T C P D C T C P - R A C K A v e r a g e F C T w i t h E rr o r b a r s ( m s ) Scheme (e)
Small Flows: Average (Educational) R e n o R e n o - R A C K C u b i c C u b i c - R A C K D C T C P D C T C P - R A C K A v e r a g e F C T w i t h E rr o r b a r s ( m s ) Scheme (f)
Small Flows: Average (Private DC)
Figure 13:
Performance metrics of one-to-all scenario without any background traffic for (a-c) Websearch, (d) Datamining, (e) Educationaland (f) Private DC workloads, respectively.
In this section, we study the performance of T-RACKs implementation as a loadable Linux kernel module using syntheticworkloads reproduced from the statistics of workloads found in production data-centers [10, 25]. T-RACKs is a shim-layerbetween the VMs (or TCP/IP stack) and the hypervisor (or link-layer). We use the NetFilter framework [47], which is anintegral part of Linux kernel. The NetFilter hooks attach to the data-path between the NIC driver and TCP/IP stack, which Rack 1
Rack 2 Rack 3
Core ToR
Rack 4NetFPGA Switch (a)
The testbed topology (b)
The actual testbed
Figure 14:
Testbed setup of T-RACKs in small-scale cluster imposes no modifications to the TCP/IP stack of the host OS nor the guest OS. The module intercepts TCP packets incomingto the host or its guests before it is handed to the TCP/IP stack (i.e., at the post routing). First, the 4-tuples are hashed, andthe associated flow index is calculated via Jenkins hash (JHash) [31]. Then, TCP headers are examined, and the proper courseof action is based on the flag bits (i.e., SYN-ACK, FIN, or ACK) following the logic in Algorithm 1. Unlike SNOOP [16],the module does not employ any packet queues to store the incoming packets, it only stores and updates flow entry states (i.e.,ACK No, arrival time and so on). Also, unlike [59] T-RACKs does not need the fine-grained high-resolution timers in themicrosecond time-scale, therefore the native OS
Jif f ies timer is used. T-RACKs uses a single timer for all flows to handleper-flow RTO events. These design choices make T-RACKs lightweight and help reduce the server overhead.From 14 data-center grade servers equipped each with 6 NICs, we built a small-scale testbed consisting of 84 virtualservers, each assigned a dedicated physical NIC. The servers are interconnected via four non-blocking leaf switches and onespine switch. The testbed is organized into four racks (rack 1, 2, 3, and 4). The servers are connected to leaf switches, andleaf switches are connected to the spine switch via 1 Gbps Ethernet links. The servers use Ubuntu Server 14.04 LTS withLinux kernel 3.18, which has integrated a full implementation of DCTCP. Unless otherwise stated, T-RACKs runs with thedefault settings (i.e., The RTO and threshold γ of T-RACKs is set to 4 ms and 100 KB, respectively). The RTO of 4 ms is areasonable 16 times (i.e., ≥ ) the average RTT of ≈ µs without queuing. We use our custom-built traffic generator to runthe experiments with realistic traffic workloads. The traffic generator generates common workloads described in the literature(e.g., industrial-like Websearch [10], Datamining [25] or institutional-like University and Private DC [17]). In addition, we havedeployed the iperf program [29] to emulate large background traffic (e.g., VM migrations, backups) in some scenarios. We usedifferent scenarios to reproduce one-to-all and all-to-all flows with or without background traffic. In the one-to-all scenarios,randomly chosen clients in one rack send random requests to any of all the servers in the data-center. While in the all-to-allscenario, all clients in the data-center send requests to randomly picked servers out of all the servers in the data-center. Ifbackground traffic is introduced, we run large iperf flows from all clients to all servers to evaluate T-RACKs under suddenand persistent network load spikes. Like before, we classify flows of size ≤ KB as small, [ KB - M B ] as medium,and ≥ M B as large.
A. Experimental Results and Discussion
One-to-all Scenario without Background Traffic: we report here the average FCT for small and all flows as well as thenumber of small flows that missed a deadline of 200ms.The traffic generator is deployed on every client running on an end-host in the data-center and is set to randomly initiates1000 requests to randomly chosen servers on any of the other racks. In the Websearch workload, Figures 13a, 13b and 13cshow the average FCT and missed deadlines for small flows and the average FCT for all flows, respectively. While, Fig-ures 13d, 13e, and 13f, show the average FCT for small flows in the Datamining, Educational, and private DC workloads,respectively. From these figures, we make the following observations: i) for all workloads, T-RACKs helps small flowsregardless of the TCP version, on both the average FCT and its variation, as indicated by the error bars. Compared to Reno, R e n o R e n o - R A C K C u b i c C u b i c - R A C K D C T C P D C T C P - R A C K A v e r a g e F C T w i t h E rr o r b a r s ( m s ) Scheme (a)
Small Flows: Average FCT R e n o R e n o - R A C K C u b i c C u b i c - R A C K D C T C P D C T C P - R A C K % o f M i ss e d d e a d li n e ( >= m s ) Scheme (b)
Small Flows: Missed Deadlines R e n o R e n o - R A C K C u b i c C u b i c - R A C K D C T C P D C T C P - R A C K A v e r a g e F C T w i t h E rr o r b a r s ( m s ) Scheme (c)
All Flows: Average FCT R e n o R e n o - R A C K C u b i c C u b i c - R A C K D C T C P D C T C P - R A C K A v e r a g e F C T w i t h E rr o r b a r s ( m s ) Scheme (d)
Average FCT (Datamining) R e n o R e n o - R A C K C u b i c C u b i c - R A C K D C T C P D C T C P - R A C K A v e r a g e F C T w i t h E rr o r b a r s ( m s ) Scheme (e)
Average FCT (Educational) R e n o R e n o - R A C K C u b i c C u b i c - R A C K D C T C P D C T C P - R A C K A v e r a g e F C T w i t h E rr o r b a r s ( m s ) Scheme (f)
Average FCT (Private DC)
Figure 15:
Performance metrics of one-to-all scenario with background traffic for (a-c) Websearch, (d) Datamining, (e) Educational and (f)Private DC workloads, respectively.
Cubic and DCTCP, T-RACKs reduces the average FCT of small flows by ≈ (34% , , for Websearch, ≈ (18% , , − ) for Datamining, ≈ (69% , − , for Educational and ≈ ( − , − , for Private DC workloads. We notice that DCTCPimproves the FCT over its Reno and Cubic counterparts, and T-RACKs could improve their performance in terms of misseddeadlines in Websearch. The average FCT, in certain cases of Educational and Private DC workloads, shows a negligibleincrease of FCT with T-RACKs. In these workloads, the network load is very light (as shown by the small FCT withoutT-RACKs), and hence the added overhead of deploying T-RACKs module surpasses its performance gains for these lightworkloads. ii) for Websearch workload, T-RACKs reduces the missed deadlines for short flows by ≈ (55% , , forReno, Cubic, and DCTCP, respectively. iii) T-RACKs slightly improves the overall average FCT. This can be attributed to thefact that small flows are finishing their transmission quicker, leaving some additional bandwidth for medium and large flows.The improvement shown equals ≈ (16% , for Reno and Cubic, respectively. In Figure 13c, DCTCP with T-RACKs, showsa slight increase in average FCT of all flow types for Websearch workload which has many flows of medium size comparedto other workload (Figure 10). While DCTCP is designed for improving FCT of small flows in DC environments, T-RACKsis designed to curb RTO events to improve the FCT of small flows whose transmitted volumes do not exceed the threshold γ .Hence, T-RACKs might be adding a slight overhead due to the need to maintain the flow table information for the mediumand/or large flows. This overhead may be mitigated by simply skipping state maintenance for flows that exceed the thresholdand become long-lived. Moreover, the overhead for TCP variants, which are designed for Internet (e.g., Reno and Cubic), isnearly negligible relative to the large FCT of their long-lived flows. One-to-all Scenario with Background Traffic: to put T-RACKs under true stress, we run the same one-to-all scenario withall-to-all background traffic. Figure 15a, Figure 15b and Figure 15c show the average FCT and missed deadlines for small flowsas well as the average FCT for all flows for Websearch and Figure 15d, Figure 15e and Figure 15f show the average FCT forshort flows for data mining, educational, and private DC workloads, respectively. We observe the following: i) T-RACKs canimprove the average FCT of small flows for all workloads regardless of the TCP congestion control in use. As shown in thefigures, compared to Reno, Cubic and DCTCP, T-RACKs reduces the average FCT of small flows by ≈ (38% , , forWebsearch, ≈ (11% , , for educational and ≈ (13% , , for private DC workloads. The improvement increases forDatamining workload to ≈ (36% , , since it includes a wider range of short flows. ii) T-RACKs reduces the misseddeadlines for short flows of Websearch by ≈ (40% , , for Reno, Cubic, and DCTCP, respectively. iii) T-RACKs stillimproves for the overall average FCT ≈ (7% , , for Reno and Cubic, and DCTCP respectively. All-to-all Scenario without Background Traffic: we run the all-to-all scenario where all clients initiate 1000 requests eachto any of all the servers in the data-center. Figure 16a, Figure 16b, Figure 16c and Figure 16d show the average FCT for short R e n o R e n o - R A C K C u b i c C u b i c - R A C K D C T C P D C T C P - R A C K A v e r a g e F C T w i t h E rr o r b a r s ( m s ) Scheme (a)
Websearch R e n o R e n o - R A C K C u b i c C u b i c - R A C K D C T C P D C T C P - R A C K A v e r a g e F C T w i t h E rr o r b a r s ( m s ) Scheme (b)
Datamining R e n o R e n o - R A C K C u b i c C u b i c - R A C K D C T C P D C T C P - R A C K A v e r a g e F C T w i t h E rr o r b a r s ( m s ) Scheme (c)
Educational R e n o R e n o - R A C K C u b i c C u b i c - R A C K D C T C P D C T C P - R A C K A v e r a g e F C T w i t h E rr o r b a r s ( m s ) Scheme (d)
Private DC
Figure 16:
The average FCT of small flows in the all-to-all scenario for (a) Websearch, (b) Datamining, (c) Educational and (d) PrivateDC workloads, respectively. flows in Websearch, Datamining, Educational, Private workloads, respectively. The network load is considerably higher thanthe previous cases, given the more complex nature of this all-to-all traffic. We can still see here that T-RACKs can deliversignificant improvements of up to 71% in the FCT for all workloads.In summary, the experimental results show the performance gains achieved by T-RACKs, especially for small flows, thatconstitute the lion’s share in data-centers traffic, without affecting too much the performance of larger flows. In particular, theresults show that: • T-RACKs reduces the variance of small flows’ FCTs and the missed deadlines. • T-RACKs can maintain its gains even if bandwidth-greedy large flows hog the network. • T-RACKs efficiently handles various workloads, and is agnostic to the variant of TCP congestion controller. • T-RACKs fulfilled its requirements with no assumptions about nor any modifications to in-network hardware nor theTCP/IP stack of the guest VMs. VII. C
ONCLUSION AND FUTURE WORK
In this paper, we studied packet losses and the impact of various recovery methods on flow performance. We then proposedT-RACKs, an efficient cross-layer approach for timely recovery from losses. T-RACKs improves the flow completion timeof time-sensitive flows and helps avoid throughput-collapse situations. T-RACKs is deployed either at the sender-side or thereceiver-side as a shim-layer residing between the virtual machines and the network hardware. Simulation and experimentalresults show that the flows completion time is improved by up to an order of magnitude, missed deadlines are reducedconsiderably, and a high-link utilization is attained. T-RACKs is shown to be lightweight and practical due to its minimalfootprint on end-hosts. Finally, because it does not change TCP and adapts to any TCP flavor, T-RACKs is very appropriate for multi-tenant public data-centers. As part of our future work, we seek real larger-scale deployment in cloud environmentssuch AWS or Azure and investigate and analyze the effectiveness in T-RACKs scheme at scale.R EFERENCES [1] A. M. Abdelmoniem and B. Bensaou. Efficient Switch-Assisted Congestion Control for Data Centers: an Implementationand Evaluation. In
IEEE IPCCC , 2015.[2] A. M. Abdelmoniem and B. Bensaou. Incast-Aware Switch-Assisted TCP Congestion Control for Data Centers. In
IEEEGLOBECOM , 2015.[3] A. M. Abdelmoniem and B. Bensaou. Reconciling Mice and Elephants in Data Center Networks. In
IEEE CLOUDNET ,2015.[4] A. M. Abdelmoniem and B. Bensaou. Curbing Timeouts for TCP-Incast in Data Centers via A Cross-Layer FasterRecovery Mechanism. In
Proceedings - IEEE INFOCOM , 2018.[5] A. M. Abdelmoniem, B. Bensaou, and A. J. Abu. Mitigating incast-tcp congestion in data centers with sdn.
Annals ofTelecommunications , 73(3), 2018.[6] A. M. Abdelmoniem, B. Bensaou, and V. Barsoum. Incastguard: An efficient tcp-incast mitigation mechanism for cloudnetworks. In
IEEE GLOBECOM , 2018.[7] A. J. Abu, B. Bensaou, and A. M. Abdelmoniem. A Markov Model of CCN Pending Interest Table Occupancy withInterest Timeout and Retries. In
IEEE ICC , 2016.[8] A. J. Abu, B. Bensaou, and A. M. Abdelmoniem. Inferring and Controlling Congestion in CCN via the Pending InterestTable Occupancy. In
IEEE Local Computer Networks (LCN) , 2016.[9] M. Alizadeh, B. Atikoglu, A. Kabbani, A. Lakshmikantha, R. Pan, B. Prabhakar, and M. Seaman. Data Center transportmechanisms: Congestion control theory and IEEE standardization. In , pages 1270–1277, 2008.[10] M. Alizadeh, A. Greenberg, D. A. Maltz, J. Padhye, P. Patel, B. Prabhakar, S. Sengupta, and M. Sridharan. Data centerTCP (DCTCP).
ACM SIGCOMM CCR , 40:63, 2010.[11] M. Alizadeh, A. Kabbani, B. Atikoglu, and B. Prabhakar. Stability analysis of QCN.
ACM SIGMETRICS PerformanceEvaluation Review , 39(1):49, 2011.[12] M. Alizadeh, A. Kabbani, T. Edsall, and B. Prabhakar. Less is More : Trading a little Bandwidth for Ultra-Low Latencyin the Data Center. In
USENIX NSDI , 2012.[13] M. Alizadeh, S. Yang, S. Katti, N. McKeown, B. Prabhakar, and S. Shenker. Deconstructing datacenter packet transport.In
ACM HotNets , 2012.[14] M. Allman and V. Paxson. On Estimating End-to-end Network Path Properties.
SIGCOMM CCR , 29:263–274, 1999.[15] W. Bai, L. Chen, K. Chen, D. Han, C. Tian, and H. Wang. Information-agnostic Flow Scheduling for Commodity DataCenters. In
USENIX NSDI , 2015.[16] H. Balakrishnan, S. Seshan, and R. H. Katz. Improving reliable transport and handoff performance in cellular wirelessnetworks.
Wireless Networks , 1:469–481, 1995.[17] T. Benson, A. Akella, and D. A. Maltz. Network Traffic Characteristics of Data Centers in the Wild. In
ACM IMC , 2010.[18] T. Benson, A. Akella, A. Shaikh, and S. Sahu. Cloudnaas: A cloud networking platform for enterprise applications. In
ACM Symposium on Cloud Computing (SoCC) , 2011.[19] T. Benson, A. Anand, A. Akella, and M. Zhang. Understanding data center traffic characteristics. In
ACM SIGCOMM ,2010.[20] bootlin.com. Elixir Cross Referencer. https://elixir.bootlin.com/linux/latest/source.[21] N. Cardwell, Y. Cheng, C. S. Gunn, S. H. Yeganeh, and V. Jacobson. Bbr: Congestion-based congestion control.
Commun.ACM , 60(2), 2017.[22] W. Chen, F. Ren, J. Xie, C. Lin, K. Yin, and F. Baker. Comprehensive understanding of TCP Incast problem. In
IEEEConference on Computer Communications (INFOCOM) , 2015.[23] Y. Chen, R. Griffith, J. Liu, R. H. Katz, and A. D. Joseph. Understanding TCP incast throughput collapse in datacenternetworks. In , 2009.[24] B. Cronkite-Ratcliff, A. Bergman, S. Vargaftik, M. Ravi, N. McKeown, I. Abraham, and I. Keslassy. Virtualized CongestionControl. In
ACM SIGCOMM , 2016.[25] B. A. Greenberg, J. R. Hamilton, S. Kandula, C. Kim, P. Lahiri, A. Maltz, P. Patel, S. Sengupta, A. Greenberg, N. Jain,and D. A. Maltz. VL2: a scalable and flexible data center network. In
ACM SIGCOMM , 2009.[26] C. Guo, L. Yuan, D. Xiang, Y. Dang, R. Huang, D. Maltz, Z. Liu, V. Wang, B. Pang, H. Chen, Z.-W. Lin, and V. Kurien.Pingmesh: A Large-Scale System for Data Center Network Latency Measurement and Analysis. In
ACM SIGCOMM ,2015.[27] K. He, E. Rozner, K. Agarwal, Y. J. Gu, W. Felter, J. Carter, and A. Akella. AC/DC TCP: Virtual Congestion ControlEnforcement for Datacenter Networks. In
SIGCOMM , SIGCOMM ’16, pages 244–257, New York, NY, USA, 2016.ACM. [28] J. Huang, T. He, Y. Huang, and J. Wang. ARS: Cross-layer adaptive request scheduling to mitigate TCP incast in datacenter networks. In IEEE INFOCOM , 2016.[29] iperf. The Bandwidth Measurement Tool. https://iperf.fr/.[30] S. M. Irteza, A. Ahmed, S. Farrukh, B. N. Memon, and I. A. Qazi. On the coexistence of transport protocols in datacenters. In
IEEE ICC , 2014.[31] B. Jenkins. A hash function for hash table lookup. http://burtleburtle.net/bob/hash/doobs.html.[32] Jiao Zhang, Fengyuan Ren, Li Tang, and Chuang Lin. Taming TCP incast throughput collapse in data center networks.In
IEEE ICNP
USENIX NSDI , 2015.[35] S. Kandula, S. Sengupta, A. Greenberg, P. Patel, and R. Chaiken. The nature of data center traffic. In
ACM IMC , NewYork, New York, USA, 2009.[36] Y. Li, R. Miao, H. H. Liu, Y. Zhuang, F. Feng, L. Tang, Z. Cao, M. Zhang, F. Kelly, M. Alizadeh, and M. Yu. HPCC:High Precision Congestion Control. In
ACM SIGCOMM , 2019.[37] A. M. Abdelmoniem, Y. M. Abdelmoniem, and B. Bensaou. On Network Systems Design: Pushing the PerformanceEnvelope via FPGA Prototyping. In
IEEE international Conference on Recent Trends in Computer Engineering (ITCE) ,2019.[38] A. M. Abdelmoniem and B. Bensaou. Enforcing Transport-Agnostic Congestion Control via SDN in Data Centers. In
IEEE Local Computer Networks , 2017.[39] A. M. Abdelmoniem and B. Bensaou. Hysteresis-based Active Queue Management for TCP Traffic in Data Centers. In
IEEE INFOCOM , 2019.[40] A. M. Abdelmoniem, B. Bensaou, and A. J. Abu. HyGenICC: Hypervisor-based Generic IP Congestion Control forVirtualized Data Centers. In
IEEE ICC , 2016.[41] A. M. Abdelmoniem, H. Susanto, and B. Bensaou. Taming Latency in Data centers via Active Congestion-Probing. In
IEEE ICDCS , 2019.[42] A. M. Abdelmoniem, H. Susanto, and B. Bensaou. Reducing latency in multi-tenant data centers via cautious congestionwatch. In
International Conference on Parallel Processing (ICPP) , 2020.[43] M. Mathis, J. Semke, J. Mahdavi, and T. Ott. The macroscopic behavior of the TCP congestion avoidance algorithm.
ACM SIGCOMM Computer Communication Review , 1997.[44] M. Mattess, R. N. Calheiros, and R. Buyya. Scaling MapReduce Applications Across Hybrid Clouds to Meet SoftDeadlines. In
IEEE AINA , 2013.[45] R. Mittal, V. T. Lam, N. Dukkipati, E. Blem, H. Wassel, M. Ghobadi, A. Vahdat, Y. Wang, D. Wetherall, and D. Zats.TIMELY: RTT-based Congestion Control for the Datacenter. In
ACM SIGCOMM
USENIX FAST , 2008.[50] L. Popa, G. Kumar, M. Chowdhury, A. Krishnamurthy, S. Ratnasamy, and I. Stoica. FairCloud: Sharing the Network inCloud Computing.
ACM SIGCOMM CCR , 42:187–198, 2012.[51] C. Ruan, J. Wang, W. Jiang, G. Min, and Y. Pan. PTCP: A priority-based transport control protocol for timeout mitigationin commodity data center.
Future Generation Computer Systems , 102:619 – 632, 2020.[52] S. M. Rumble, D. Ongaro, R. Stutsman, M. Rosenblum, and J. K. Ousterhout. It’s time for low latency. In
USENIXHotOS , 2011.[53] A. S. Sabyasachi, H. M. D. Kabir, A. M. Abdelmoniem, and S. K. Mondal. A resilient auction framework for deadline-aware jobs in cloud spot market. In
IEEE Symposium on Reliable Distributed Systems (SRDS) , 2017.[54] S. Savage, N. Cardwell, D. Wetherall, and T. Anderson. TCP Congestion Control with a Misbehaving Receiver.
SIGCOMMComput. Commun. Rev. , 29(5):71–78, Oct. 1999.[55] S. Shukla, S. Chan, A. S.-W. Tam, A. Gupta, Y. Xu, and H. J. Chao. TCP PLATO: Packet Labelling to Alleviate Time-Out.
IEEE JSAC , 32(1), 2014.[56] H. Susanto, A. M. Abdelmoniem, H. Jin, and B. Bensaou. Creek: Inter many-to-many coflows scheduling for datacenternetworks. In
IEEE ICC , 2019.[57] H. Susanto, B. L. Ahmed M. Abdelmoniem, Honggang Zhang, and D. Towsley. A Near Optimal Multi-Faced JobScheduler for Datacenter Workloads. In
IEEE ICDCS , 2019.[58] A. S.-W. Tam, K. Xi, Y. Xu, and H. J. Chao. Preventing TCP Incast Throughput Collapse at the Initiation, Continuation,and Termination. In
International Workshop on Quality of Service , IWQoS ’12, pages 29:1–29:9, Piscataway, NJ, USA, SIGCOMM CCR , 39:303, 2009.[60] H. Wu, Z. Feng, C. Guo, and Y. Zhang. ICTCP: Incast congestion control for TCP in data-center networks.
IEEE/ACMTransactions on Networking , 21, 2013.[61] Y. Xu, S. Shukla, Z. Guo, S. Liu, A. S. . Tam, K. Xi, and H. J. Chao. RAPID: Avoiding TCP Incast Throughput Collapsein Public Clouds With Intelligent Packet Discarding.
IEEE JSAC , 37(8):1911–1923, 2019.[62] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: Cluster Computing with Working Sets. In
USENIX HotCloud , 2010.[63] D. Zats, T. Das, P. Mohan, D. Borthakur, and R. Katz. DeTail : Reducing the Flow Completion Time Tail in DatacenterNetworks. In
ACM SIGCOMM , 2012.[64] J. Zhang, F. Ren, and C. Lin. Modeling and understanding TCP incast in data center networks. In
IEEE INFOCOM ,2011.[65] J. Zhang, F. Ren, L. Tang, and C. Lin. Modeling and Solving TCP Incast Problem in Data Center Networks.
IEEE TPDS ,26(2):478–491, 2015.[66] Y. Zhang and N. Duffield. On the Constancy of Internet Path Properties. In
ACM IMC , IMW ’01, New York, NY, USA,2001. ACM.[67] J. Zhuang, X. Jiang, G. Jin, J. Zhu, and H. Chen. PTCP: A Priority-Driven Congestion Control Algorithm to Tame TCPIncast in Data Centers.