[PDF] T-RACKs: A Faster Recovery Mechanism for TCP in Data Center Networks

Abstract

Cloud interactive data-driven applications generate swarms of small TCP flows that compete for the small buffer space in data-center switches. Such applications require a short flow completion time (FCT) to perform their jobs effectively. However, TCP is oblivious to the composite nature of application data and artificially inflates the FCT of such flows by several orders of magnitude. This is due to TCP's Internet-centric design that fixes the retransmission timeout (RTO) to be at least hundreds of milliseconds. To better understand this problem, in this paper, we use empirical measurements in a small testbed to study, at a microscopic level, the effects of various types of packet losses on TCP's performance. In particular, we single out packet losses that impact the tail end of small flows, as well as bursty losses, that span a significant fraction of the small congestion window of TCP flows in data-centers, to show a non-negligible effect on the FCT. Based on this, we propose the so-called, timely-retransmitted ACKs (or T-RACKs), a simple loss recovery mechanism to conceal the drawbacks of the long RTO even in the presence of heavy packet losses. Interestingly enough, T-RACKS achieves this transparently to TCP itself as it does not require any change to TCP in the tenant's virtual machine (VM). T-RACKs can be implemented as a software shim layer in the hypervisor between the VMs and server's NIC or in hardware as a networking function in a SmartNIC. Simulation and real testbed results show that T-RACKs achieves remarkable performance improvements.

Full PDF

11 T-RACKs: A Faster Recovery Mechanism for TCPin Data Center Networks

Ahmed M. Abdelmoniem

Brahim Bensaou CS Department, Assiut University, Egypt CSE Department, HKUST, Hong Kong {amas,brahim}@cse.ust.hk

Abstract

Cloud interactive data-driven applications generate swarms of small TCP ﬂows that compete for the small buffer space indata-center switches. Such applications require a short ﬂow completion time (FCT) to perform their jobs effectively. However,TCP is oblivious to the composite nature of application data and artiﬁcially inﬂates the FCT of such ﬂows by several orders ofmagnitude. This is due to TCP’s Internet-centric design that ﬁxes the retransmission timeout (RTO) to be at least hundreds ofmilliseconds. To better understand this problem, in this paper, we use empirical measurements in a small testbed to study, at amicroscopic level, the effects of various types of packet losses on TCP’s performance. In particular, we single out packet losses thatimpact the tail end of small ﬂows, as well as bursty losses, that span a signiﬁcant fraction of the small congestion window of TCPﬂows in data-centers, to show a non-negligible effect on the FCT. Based on this, we propose the so-called, timely-retransmittedACKs (or T-RACKs), a simple loss recovery mechanism to conceal the drawbacks of the long RTO even in the presence of heavypacket losses. Interestingly enough, T-RACKS achieves this transparently to TCP itself as it does not require any change to TCPin the tenant’s virtual machine (VM). T-RACKs can be implemented as a software shim layer in the hypervisor between the VMsand server’s NIC or in hardware as a networking function in a SmartNIC. Simulation and real testbed results show that T-RACKsachieves remarkable performance improvements.

Index Terms

Data-center, Cross Layer, Fast Recovery, Kernel Module, TCP-Incast, Timeouts.

I. I

NTRODUCTION

The recent growth in data-center deployments worldwide is reshaping how the Internet and its applications operate. New,cloud-based, data-driven applications have emerged over the past many years to harness the cost-effectiveness and scalabilityafforded by cloud computing. Most such applications rely on distributed programming and data storage frameworks such asHadoop, HDFS, or Spark [62] for storing and processing large data sets. In such frameworks, master and aggregation nodesoften require data transfers from hundreds of worker nodes to build a result. Due to the stringent timing requirements ofinteractive applications, a data transfer that misses a hard deadline because of excessive waiting for packet loss recoveryreturns a partial result (of lower quality). Hence, the quality of application results is not only correlated with the averagelatency but also with the tail of latency distribution. For example, in practice, the th % of the ﬂow completion times (FCT)can be anywhere between two to four orders of magnitude worse than the median or the average latency.In small scale private data-centers, CPU resources are often the bottleneck, and solutions that rely on task placement andscheduling already exist (e.g., [44]). In contrast, public data-centers seldom overload their server CPUs and usually haveabundant computing resources; yet, they often adopt high over-subscription ratios in the network. As a result, network latencybecomes the main performance bottleneck [52]. This is typical for many Internet-scale applications deployed on public (IaaS)clouds such as Microsoft Azure or Amazon EC2.Measurements in real production data-centers [10, 34, 35, 49, 60] have shown over the years that the applications that producesmall trafﬁc ﬂows predominate and that incast congestion events and excessive packet losses are frequent. To circumvent suchproblems, large corporations such as Microsoft, Facebook, and Google dedicate well-structured data-centers to deploying theirtime-sensitive applications. Smaller-scale private data-centers address the problem by deploying homogeneous custom-designedTCP variants (e.g., DCTCP [10]) on all the VMs in the data-center. In stark contrast, multi-tenant and public data-centers,where many-to-one (or many-to-many) communication patterns predominate, are populated with a large variety of versions ofTCP with different behaviors in the face of congestion [19, 25, 35]. As a direct consequence of this heterogeneous environment,unfairness in congestion resolution is inevitable and often leads to repeated packet losses and a long-tailed latency distributionfor small ﬂows. In particular, since commodity Ethernet switches are the backbone of all intra-data-center communications,their small buffer space can quickly be fully occupied by a few (large) TCP ﬂows. This is because TCP has a natural tendencyto ﬁll up the bottleneck bandwidth of the communication path.This raises two issues: i) small ﬂows would not last long enough to capture their fair share of the buffer from ongoinglarge ﬂows, as their TCP sending window cannot grow large enough before a packet loss is experienced; ii) when a sudden Manuscript is accepted for publication in ACM/IEEE Transactions of Networking ©2021 IEEE a r X i v : . [ c s . N I] F e b swarm of small ﬂows (usually co-ﬂows) surges, while the buffer is occupied by other ﬂows, incast congestion loss eventsbecome inevitable. In this case, a burst of packet losses from many such ﬂows takes place. Bursty losses with small congestionwindows often leave an insufﬁcient number of TCP segments in ﬂight to trigger TCP’s fast-retransmit and recovery mechanism.As a consequence, small ﬂows often experience timeouts. The retransmission timeout (RTO), which is orders of magnitudelarger than the actual round-trip time (RTT), contributes thus the lion’s share to the long FCT and number of missed deadlinesexperienced by small ﬂows in data-centers.In this paper, we study the impact of RTO on the performance of TCP applications in data-centers and propose a simplemechanism to shield small TCP ﬂows from the negative effects of the RTO, without changing TCP itself. Our methodologyadopts a two-phased approach: i) First, to fully understand the impact of the RTO on the FCT of small ﬂows, we conduct an empirical study of the lossevents in a small data-center, by examining the nature of the recovery mechanism invoked by TCP for each segment loss. Tothis end, we trace TCP trafﬁc ﬂows microscopically at the socket-level in the Linux Kernel. Then by analyzing the collectedtraces, we study the frequency of occurrence of the two TCP loss recovery mechanisms (viz., RTO and Fast Retransmit andRecovery (or FRR)) concerning the TCP window size. We show that tail-end losses and bursty losses primarily cause RTOs,and while they have a less dramatic effect on the latency of large ﬂows, their impact on the performance of small ﬂows istremendous. ii) Second, to prevent RTOs from artiﬁcially inﬂating the actual loss recovery delays of small ﬂows intra-data-centers, and without modifying TCP we propose, implement and study the performance of a new mechanism to conceal thelong retransmission timeout. This mechanism forces TCP in the VM to go into the FRR mode whenever a segment is estimatedto be likely to experience a timeout, long before it does. We implement the resulting so-called T-RACKS in a real testbed andstudy its performance with realistic traces .In the remainder, supported by an empirical study, we show in Section III the dramatic impact of the RTO on the performanceof small ﬂows. In Section IV, we present the proposed methodology and system design. In Section V, we discuss the packet-level simulation results in detail. Then, in Section VI, we present the experimental results from the testbed deployment. Wediscuss important related work in Section II. Finally, we conclude the paper in Section VII.II. R ELATED W ORK

Several works have found, via measurements and analysis, that TCP timeouts are the root cause of most throughput andlatency problems in data-center networks [1, 5, 37, 39, 41, 42, 55, 59]. Other works [6, 22, 23, 30, 32, 38, 40, 53, 58, 63–65]analyzed the nature of incast events and packet drops in data-centers ,. They also found that severe incast occurrences couldlead to throughput collapse and longer FCT. They show in particular that throughput collapse and increased FCT are to beattributed to the data-center ill-suited timeout mechanism. For example, [59] showed that frequent timeouts could harm theperformance of latency-sensitive applications. Numerous solutions have been proposed. These fall into one of four fundamentalcategories. The ﬁrst mitigate the consequence of the long waiting times due to RTO, by reducing the default RT O min to the100 µ s - 2 ms [59]. While very useful, this approach affects the sending rates of TCP by forcing it to cut CWND to 1; itrelies on a static RT O min value, which can be ineffective in heterogeneous networks; and it imposes modiﬁcations to TCPstack on tenant’s VM. Our approach is fundamentally different in its enforcement of

RT O min via dup-acks which allows fordifferent handling of Internet and data-centers ﬂows. Therefore, T-RACKs allows ﬂows to have different

RT O min which iseasily imposed by the ﬂow tables.The second approach aims at controlling queue build-up at the switches by relying on ECN marks to limit the sending rateof the servers [8, 10, 34], or by controlling the congestion window [32] or receiver window [2, 3, 60] of TCP ﬂows. Similarapproaches deployed global trafﬁc scheduler [12, 13, 15, 56, 57] or tacked ﬁne-grained sub-microsecond updates in RTT todetect congestion [45]. All of these works achieved their goals and have shown they could reduce the FCT of short ﬂows aswell as achieving high link utilization. However, they require modiﬁcations of either the TCP stack, or introduce a completelynew switch design, and are prone to ﬁne-tuning of parameters or sometimes require application-side information. They alsoincrease CPU utilization of the end hosts. [45] is sensitive to trafﬁc variations in the backward path. [21] is a new congestioncontrol for inter-DC trafﬁc which is based on characterizing the band-width and RTT of the bottleneck path. While effectivefor high bandwidth-delay product (BDP), its minutes-level measurements and the aggressive start can exacerbate the problemsin low BDP networks of data-centers.The third approach is to achieve efﬁcient sharing of network resources or enforce ﬂow admission control to reduce TimeOutprobability [18, 28, 50, 55]. [28] has proposed ARS, a cross-layer system that can dynamically adjust the number of activeTCP ﬂows by batching application requests based on the sensed congestion state indicated by the transport layer. The lastapproach, which is adopted in this paper due to its simplicity, and feasibility, is to recover losses utilizing fast retransmit The minimum RTO is 200ms in Linux and 300ms in Windows Notice that in public data-centers, under the IaaS model, the operating system and thus the protocol stack in the VM is under the full control of the tenantand cannot be modiﬁed by the cloud service provider. An earlier version of this work were published in INFOCOM 2018 [4].The implementation, simulation and experimental code and scripts are publicly available at http://ahmedcs.github.io/T-RACKs. Similar analysis was done for Content Centric Networks (CCNs) [7, 8] rather than waiting for a long timeout. For instance, TCP-PLATO [55] proposed changing TCP state-machine to tag speciﬁcpackets using IP-DSCP bits, which are preferentially queued at the switch to reduce their drop-probability; enabling dupACKsto be received to trigger FRR instead of waiting for the timeout. Even though TCP-PLATO is effective in reducing timeouts,its performance is degraded whenever tagged packets are lost. In addition, the tagging may interfere with the operations ofmiddle-boxes or other schemes, and most importantly, it modiﬁes the TCP state machine of the sender and receiver.Similar to DCTCP, DCQCN [9, 11] and HPCC [36] was proposed as an end-to-end congestion control scheme implementedin custom NICs designed for RDMA over Converged Ethernet (RoCE). Both DCQCN, and HPCC applies adaptive rate controlat the link-layer to throttle large ﬂows relying on Priority-based Flow Control (PFC) and RED-ECN marking, and In-NetworkTelemetry (INT) information, respectively. DCQCN, not only relies on PFC, which adds to network overhead, it introduces theextra cost of the explicit ECN Notiﬁcation Packets between the end-points. HPCC requires programmable NICs and relies onthe timely availability of the INT information which not only increases the packet size by 42 bytes for each hop in the pathbut also is subject to congestion, and contention with other trafﬁc in the network.More recent approaches have also identiﬁed the importance of the timeout problem in data-center environments [51, 61, 67].For instance, the authors in [51] proposed injects high-priority packet after each window worth of packets. They infer thenetwork congestion by checking the sequences of the received ACKs of high/low priorities. Consequently, they adjust thesending window to reduce buffer occupancy and early detect losses. However, this not only requires setting-up priority-queues(if available) in the switches but also imposes extra processing and communication overhead (esp., the congestion windowconsists of typically few packets in data-centers).III. P

ROBLEM AND M OTIVATION

Before we start discussing our empirical study of TCP and presenting our solution, let us ﬁrst shed some light on themotives that led us to adopt such a non-traditional approach by contrasting it against alternative methods. In particular, whileour approach is straightforward, it turns out to be very effective because it stems from a full understanding of the large numberof incremental mechanisms that have been added over the years to TCP. Many alternative schemes proposed in the literaturedeal with TCP congestion in data-centers in a classic Internet-centric approach by invoking mechanisms such as RED. Thisapproach is ﬂawed because of three major reasons: i) RED is a mechanism that was designed for the Internet. Its goal is toreduce the average queuing delay experienced by packets in the huge routers’ buffer, which contributes a large proportion of theend-to-end delay and delay-jitter. In contrast, data-centers use high-speed switches with small buffers; therefore, the contributionof queuing delay to the total FCT is not as dramatic as in the case of the Internet, regardless of the buffer occupancy. Andso, maintaining a small queue does not help the FCT. ii)

With increasing link speeds in modern data-centers, the interplaybetween propagation delay and transmission delay is transformed, rendering the control mechanisms valid for one no longervalid for the other. For example, in data-centers with 1Gbps network interfaces, the transmission time of a single IP packet of1500 bytes is about 12 microseconds; the round trip time over a 600m path at the speed of light is about 6 microseconds. So,there can be at most one packet spread over a link between any two adjacent interfaces in the network (e.g., server NIC, toToR port, ToR port, to Aggregation Port, ...). In contrast, with 40 Gbps network interfaces, it takes only 0.3 microseconds tocomplete the transmission of an IP packet, yielding a possible ﬂight of up to 33 IP packets per hop. So early detection andnotiﬁcation via buffer thresholds with the small buffers that exist in the switches are ineffective, and excessive packet lossesare inevitable. iii)

Packet losses in TCP per-se are not the reason for these problems; excessive congestion is. Packet lossesare merely symptoms of congestion, so there is no reason to try to curtail them completely as long as we can recover fromlosses fast. In fact, curtailing packet losses completely results in a non-competitive behavior that yields poor performance inheterogeneous TCP environments. For example, pitting TCP Vegas against TCP New Reno, Cubic or DCTCP results in poorperformance for the former. As a consequence, eliminating packet losses in data-center networks while maintaining a highlink utilization is not helpful. Instead, we propose to pinpoint the true reason for increased delays in data-centers and to tacklesuch reasons directly [4].Several measurement studies [19, 25, 35] have been conducted on data-centers and have shown that latency in suchenvironments varies greatly. To further understand the reasons behind this, we deep-dive into the packet level analysis of theﬂows and the TCP socket state variables at a microscopic level to understand TCP behavior and its loss recovery mechanisms.An early work [59], based on data-center measurements, found that the timeout mechanism is to blame for the long waitingtimes and proposed the very simple yet effective solution of reducing the

RT O min value for TCP in data-center environmentswhile using high-resolution timers to keep track of delays at the microsecond-level. This approach actually solves the problem,reduces the FCT, and mitigates TCP-incast congestion effects. However, i) it requires the modiﬁcation of TCP, and as suchit is inappropriate for public data-centers where multiple tenants can upload their own version of the OS; and, ii) there is no“magical” value of RT O min that ﬁts all possible environments. For instance, a

RT O min that works inside the data-center (e.g.,between a web server and the back-end database server) will lead to spurious timeout events for Internet-facing connections(e.g., the connection between the web administrator workstation and the server in the data-center).

Table I:

TCP API Calls Intercepted by LossProbe Module

Function Call Descriptiontcp_set_state Handle and update TCP connection statetcp_v4_do_rcv Handles the arrival of all types of TCPsegmentstcp_retransmit_skb Retransmits one SKB where policy deci-sions and retransmit queue state updates aredone by the callertcp_v4_send_check Computes an IPv4 TCP checksum

Table II:

TCP Socket-level State Info logged by LossProbe Module

DataType Variable Descriptionuint lost_out Count of lost packetsuint prr_out Count of pkts sent during Recoveryuint prr_delivered Count of packets delivered during re-coveryuint prior_cwnd Congestion window at start of recoveryuint prior_ssthresh SSThresh saved at recovery startuint total_retrans Count of retransmits for entire connec-tionuint retransmit_high Highest sequence

A recent RFC [46] proposed the so-called tail loss probe (TLP) mechanism, which recommends sending TCP probe segmentswhenever ACKs do not arrive within a short Probe TimeOut (PTO) . In addition to requiring changes to TCP, this approachsuffers from two additional problems: i) probe packets also may be lost; and, ii) probe packets may worsen the in-networkcongestion, especially during TCP-incast. A. Impact of RTO on The FCT

In data-centers, partition/aggregate applications that generate small ﬂows are challenged by the presence of small buffers,large initial sending windows, inadequate

RT O min , or slow-start exponential increase. This combination of hardware and TCPconﬁguration frequently leads to timeout events for such applications. In particular, when the number of ﬂows they generateis large and roughly synchronized, incast-TCP synchronized losses occur. As the loss probability increases linearly with thenumber of ﬂows [43], the ﬂow synchronization and the excessive losses lead to throughput-collapse for small-ﬂows.To illustrate this, consider a simpliﬁed ﬂuid-ﬂow model with N ﬂows sharing a link of capacity C equally. Let B be theﬂow size in bits and n be the number of RTTs it takes to complete the transfer of one ﬂow. The optimal throughput ρ ∗ can besimply expressed as the fraction of the ﬂow size to its average transfer time: ρ ∗ = Bnτ + BNC . That is, it takes

BN/C to transmitthe B bits plus an additional queuing and propagation delay of τ seconds for each of the n RTTs. In practice, when TCPincast congestion involving N ﬂows results in throughput-collapse, the ﬂow experiences one or more timeouts and recoversafter waiting for RTO. Then, the actual throughput writes: ρ = BRT O + n (cid:48) τ (cid:48) + BNC , Typically n (cid:48) ≥ n and τ (cid:48) ≥ τ . In addition, indata-centers, the typical RTT is around µ s, while existing TCP implementations impose a minimum RTO (i.e., RT O min )of about to ms . As a consequence, large ﬂows yield values of n (cid:48) such that n (cid:48) τ (cid:48) is similar or greater than RT O . Incontrast, small ﬂows only last for a few RTTs, therefore

RT O (cid:29) n (cid:48) τ (cid:48) . And so, when a small ﬂow experiences a loss thatcannot be recovered by 3-duplicate ACKs, it systematically incurs an FCT that is orders of magnitude larger than it should. B. Analyzing TCP congestion recovery

To investigate why packet losses seem to affect large ﬂows only marginally, yet degrade the performance of small ﬂowsdramatically, we collected and examined socket-level TCP ﬂows state information from a Websearch workload [10], in a PTO is set to min(2*srtt, 10ms) if inﬂight > Linux uses 200ms and Windows uses 300ms P r o b a b ili t y D i s t r i b u t i o n Loss size normalized to CWND

Samples=21835Median=15.76Mean=20.48STDEV=9.09 (a)

FR size rel. CWND size P r o b a b ili t y D i s t r i b u t i o n Loss size normalized to CWND

Samples=7149Median=31.56Mean=32.25STDEV=16.67 (b)

RTO size rel. CWND size P r o b a b ili t y D i s t r i b u t i o n Loss position normalized to CWND

Samples=21835Median=73.50Mean=19.33STDEV=80.00 (c)

FR position rel. CWND P r o b a b ili t y D i s t r i b u t i o n Loss position normalized to CWND

Samples=7149Median=75.99Mean=17.92STDEV=81.40 (d)

RTO position rel. CWND

Figure 1: (a-b) shows the retransmission size relative to CWND while (c-d) shows the loss position relative to CWND. small-scale data-center testbed. First, we implemented a socket-level monitoring module, named hereafter “LossProbe”, basedon KProbes/JProbes [33] in the Linux Kernel. Probes are dynamic debugging tools which in our case, allow us to interceptdifferent TCP event handlers and API calls as listed in Table I where we log the target TCP socket-level state information.The module works as follows:1) Jprobe requires the address of the kernel function to trace; hence the target TCP handlers of the events of interest haveto be identiﬁed from the Linux kernel source code base [20]. For example, tcp_retransmit_skb is the function called inthe kernel to retransmit a TCP segment.2) Then, a handler function is deﬁned that will perform certain actions upon entry of the traced function (e.g., printthe debugging message when the target kernel function is invoked). In the probe module, that function is deﬁned forconvenience with the same name as the original probed function (e.g., jtcp_retransmit_skb ) and jprobe calls it uponentering the original function.3) The monitoring module is dynamically installed into the kernel, and the probed workload (or experiment) is invoked. Themodule upon entry of the targeted functions, the jprobe function, is called to collect the state information of interest andwrite them into an in-RAM buffer which in turn is ﬂushed periodically and asynchronously onto the ﬁle system to avoidstalling the datapath artiﬁcially.Using our custom-built trafﬁc generator, we replicate a Websearch workload [10] consisting of thousands of ﬂows and collectmeasurements on the data listed in Table II from all the servers in our testbed .We summarize our ﬁndings in several ﬁgures to reﬂect TCP’s behavior with respect to the mechanism invoked to recoverfrom packet losses (e.g. Fast Retransmit and Recovery or Re-transmission Timeouts ). Fig. 1a shows the distribution (on theordinate) of the size of each retransmission (on the abscissa) for the FRR-based recovery events. The size of retransmission iscalculated by subtracting the seq Cwnd . Similarly, Fig. 1b shows the same metric for RTO-based recovery Code for the LossProbe module is publicly available at https://github.com/ahmedcs/TCP_loss_monitor/ Each ﬁgure shows the aggregate of all servers in the data-center. The bar for 0-10, refers to probability of 0-10% of CWND is lost while a bar for 90-100 refers to the probability that 90-100% of the CWND is lost C D F CWND sizeFast RetransmitRTO Retransmit (a)

CDF of CWND size W e b W e b - T L P D a t a D a t a - T L P A v e r a g e F C T w i t h E rr o r b a r s ( m s ) Scheme (b)

FCT of small ﬂows

Figure 2: (a) shows the CDF of CWND at the time of the transmission of the lost packet. (b) TLP and NO-TLP FCT for Websearch andDatamining workloads events. Fig. 1c and Fig. 1d show the distribution (on the ordinate) of the loss index or position (on the abscissa) in case of FRRand RTO-based recovery, respectively. The index points to the ﬁrst retransmitted segment (for several consecutive segmentslosses), relative to

Cwnd when the segment was ﬁrst transmitted (i.e., before a loss is detected) .Analyzing these results, we can draw the following conclusions: Fig. 1a suggests that FRR loss size is distributed over therange of packets with a positive skewness towards the ﬁrst few fractions of the window (i.e., probability of losing more than30% of the window is insigniﬁcant). However, Fig. 1b shows that this is not the case for RTO, which seems well distributedwith positive skewness towards the tail end of the window (i.e., the probability of losing more than 30% of the window issigniﬁcant). Also, we can see that there are only a few RTOs far away from the tail; these represent lost packets within thesame congestion window. Fig. 1c points out that losses at the tail of the window occur with higher frequency for RTO events,however in the case of FRR, the Cwnd is relatively large enough for TCP to receive a sufﬁcient number of duplicate Acks,which allows for Fast-Recovery. Similarly, Fig. 1d clearly shows a similar trend with higher frequency at the tail, however, inthis case,

Cwnd is relatively small and hence, contains less in-ﬂight packets to allow for FRR, and eventually, RTO recoveryoccurs. To elaborate, we see in Fig. 2a that typically

Cwnd for the ﬂows with segments that experience RTO is smaller thanthat of those that recover via FRR.Finally, we also show the ineffectiveness of the TLP mechanism cited above [46] in recovering tail losses in Fig. 2b. Theﬁgure shows that TLP mechanism is not effective, and due to its additional overhead, it may even increase the FCT for smallﬂows.In data-centers, the size of the pipeline is small: typically, with an RTT of 100us, a link of 1Gbps (respect. 10Gbps) canaccommodate 8.3 packets (respect. 83 packets). In conjunction with shallow buffered switches, the nominal TCP fair shareduring TCP-incast barely exceeds one packet per-ﬂow, and hence the occurrence of RTO is highly likely. This phenomenonhighlights how TCP’s performance can be degraded when operating in small windows regime in a small buffer with high-bandwidth low-delay environments like data-centers. The effect on the FCT is more severe for small, time-sensitive ﬂows, thatgenerally last only a few RTTs, but that are compelled to wait for 2 to 4 orders of magnitude extra time due to the

RT O min rule. IV. S

YSTEM D ESIGN AND I MPLEMENTATION

T-RACKs design is based on the following observation: packet losses are inevitable in TCP. So the key to reducing the longlatency and jitter is not to try and avoid losses completely but instead try to avoid long waiting after the losses occur . Inachieving this, T-RACKS must also: (R1) improve the FCT of latency-sensitive applications by expediting the transmissionsof small ﬂows’ packets. (R2) be friendly to throughput-sensitive large ﬂows (i.e., it must not sensibly degrade the throughputof large ﬂows to satisfy the delay requirements of latency-sensitive ﬂows). (R3) be compatible with all existing TCP versions(i.e., it must not impose modiﬁcations inside the virtual machines, and if any are needed, they shall be in the hypervisors, A bar for 0-10, refers to the probability that loss occurs for the ﬁrst 0-10% of the segments in

Cwnd . In contrast, a bar for 90-100 refers to the probabilitythat the loss occurred for the last 90-100% of the segments the

Cwnd which are fully under the control of the data center operator. It also must not require any extra special hardware). (R4) Finally,the mechanism must be simple enough to be easily deployable in real data-centers.In this perspective, T-RACKs actively infers packet losses by monitoring (in the hypervisor) per-ﬂow TCP ACK numbersand proactively triggers the FRR mechanism of TCP to take action whenever and RTO is detected to be likely to take place.The goal is to help small TCP ﬂows that would otherwise experience a timeout, recover fast via the FRR instead of waiting forTCP’s

RT O min . The proposed mechanisms intervene when only the loss is almost certain, leading to a signiﬁcant improvementof recovery times, and hence the FCT. T-RACKS design derives from the following arguments: i) all TCP versions adopt theFRR mechanism as a way to detect and recover from losses fast. So, if the FRR mechanism can be forced into action by thehypervisor, regardless of the nature of the loss, the resulting system would be transparent to the TCP protocol in the VM andwould require no changes to TCP in the VM; ii) TCP relies on a small number of duplicate ACKs to activate FRR; however,in the majority of cases (especially for short-ﬂows), there aren’t enough packets in ﬂight to trigger duplicate ACKs. To achievethis, we propose to use “spoofed” TCP ACK signaling from the hypervisor to the VM. In this perspective, the hypervisormaintains a per-ﬂow timer β = α ∗ RT T + rand ( RT T ) to wait for the ACKs before it triggers FRR with spoofed duplicateACKs.Note that our idea is similar in spirit to the so-called TCP SNOOP protocol [16], which retransmits lost segments on behalfof the communicating end-points to ﬁlter out bit-errors in low-speed wireless networks. As such, TCP SNOOP also couldbe applied in data-centers. However, it is expensive to implement, as it requires buffering all sent segments at the lowerlayers (e.g., link-layer or hypervisor), which requires an ample buffer space in data-centers. T-RACKs, in contrast, triggers theretransmissions from the actual TCP protocol in the VM instead of buffering and retransmitting the packets itself. It requiresno packet buffering at all; it only relies on memorizing a few state variables from the last segment, and the ﬁnal ACK receivedof each ﬂow. A. T-RACKs Algorithm

The T-RACKs algorithm consists broadly of three major functions: the ﬁrst two are in charge of maintaining per-ﬂow stateinformation on the server (hypervisor) on arrival and departure of packets, shown in Algorithm 1 and the third is a timerevent handler described in Algorithm 2. In the initialization in (lines − ) of Algorithm 1, an in-memory ﬂow cache poolis created to track new ﬂow arrivals. This approach speeds up ﬂow objects creation. A hash-based ﬂow table is created andmanipulated via the Read-Copy-Update (RCU) synchronization mechanism to efﬁciently identify ﬂow entries. Other parametersand variables are set in this step, as well. Before each TCP segment departure, T-RACKs performs the following actions: i) in line , the packet is hashed using its 4-tuple (source and destination IP addresses and port numbers), and the correspondingﬂow is identiﬁed; ii) in lines − , if this is an SYN packet or the ﬂow entry is inactive (i.e., a new ﬂow), the ﬂow entry isreset then TCP header info and options are extracted to activate a new ﬂow record; and iii) in lines − , if this is a Datapacket, then the last s ent sequence number and time of the ﬂow are updated.Next, upon each TCP ACK arrival, the algorithm performs the following actions: i) in lines − , the ﬂow entry isidentiﬁed using its 4-tuple; if the ﬂow is large, we ignore it as it does not undergo recovery via T-RACKs. By doing so, thecomplexity of the scheme is reduced; ii) in lines − if the ACK sequence number acknowledges a new packet arrival,the last seen ACK sequence number and time is updated. The dupACK counter is reset. If the accumulated ﬂow size exceedsa threshold γ it is marked as a large ﬂow (to be able to stop tracking it); iii) in lines − , if ACK number acknowledgesan old packet (i.e., if this is a duplicate ACK), then we drop dupACKs if the ﬂow is in recovery mode, or otherwise incrementthe number of dupACKs seen so far; iv) in line , we update the TCP headers information of the ACK if necessary. Wediscuss this part in more detail later in sec IV-C.Algorithm 2 handles the periodic global timer expiry events and performs the following actions for all active non-large ﬂows in the table. In a typical implementation this timer lasts 1 ms and is processed regularly with the OS clock timer interrupt(i.e., does not require the special high-resolution timers): i) in lines − , if no new ACK acknowledging new data has arrivedfor β seconds since the last new ACK arrival, the ﬂow is deemed to be likely to experience a timeout in the future. T-RACKsenters into action, spoofs an ACK using the last successfully received ACK sequence number, and sends it out to the sendingprocess or VM residing on the same end-host. An exponential backoff mechanism is activated to account for various dupACKthresholds set by the sender TCP or OS. ii) In lines − , if with timer β backed off by the number of retransmissions x ofthe spoofed ACK, the ﬂow still did not receive a new ACK, another spoofed ACK is created and sent out to the correspondingsender. To ensure T-RACKS is not sending spurious spoofed dupAcks, the algorithm backs-off exponentially; i.e., after eachtransmission of a spoofed Ack, timer β is doubled. iii) In lines − , if the backoff time approaches the RT O min (i.e.,200ms), we stop triggering Fast-Retransmit (by resetting the soft state) and letting the sender’s TCP RTO timer handle therecovery of this segment. iv)

In line , if the inactivity period exceeds 1 sec, ﬂow (f) entry is hard reset. Algorithm 1:

T-RACKs Packet Processing /* Initialization */ Create an in-memory ﬂow cache pool; Create ﬂow table and reset ﬂow information; Initialize and insert NetFilter hooks (for a NetFilter implementation);

Input: α Input: γ a threshold in bytes to stop tracking a ﬂow as small Input: φ the dupACK threshold used by TCP ﬂows Input: t : the current local time counted in jifﬁes Deﬁne x : the exponential backoff counter Function

Outgoing Packet Event Handler (Packet P) f=Hash(P); if SYN(P) or !f.active then Reset Flow (f); Extract TCP options (i.e, TStamp, SACK, etc); Update the ﬂow information and set f.active; if DATA(P) then Update ﬂow info (i.e., last seq f.active_time = now(); Function

Incoming Packet Event Handler (Packet P) /* For ACKs: extract and update flow information from incoming header */ f=Hash(P); if f.long_lived then return ; if ACK_bit_set(P) then Extract required values (e.g., seq if New ACK then Update ﬂow entry and state information (e.g., RTT); Update last seen ACK number from receiver; Reset f.dupAck_Nr = 0; Reset f.ACK_time = now(); if f.lastAckNo ≥ γ then f.long_lived = true ; else if Duplicate ACK then f.dupAck_Nr = f.dupAck_Nr + 1; /* Drop extra dup-ACKs */ if f.resent > then Drop Dup ACK ; Update TCP headers (i.e., TStamps, SACK, etc);

B. T-RACKs System Implementation

Algorithm 1 relies on TCP header information of ACK packets to maintain per-ﬂow TCP state information. In this paper,we only consider a lightweight end-host (hypervisor) shim-layer implementation to achieve this . This approach is perfectlyfeasible even for production data-centers, because the number of ﬂows in a server in production data-centers has been reportedto be small in general not exceeding 30-40 [10]. In addition, the number of ﬂows tracked by T-RACKs is further reduced onaverage by only tracking small-sized ﬂows, abandoning large ﬂows whenever they grow to reach a certain size threshold. Thedeployment of T-RACKs in data-centers involves hashing the ﬂows into a hash-based ﬂow-table using the 4-tuples (i.e, SIP,DIP, Sport and Dport) whenever SYNs packets arrive or a ﬂow sends data after a long silence period. For instance, referringto Figure 3, when VM1 on the sender end-host established a connection with its peer (or destination) VM on the receivingend-host, a new ﬂow entry ( S D ) is created in the table as shown in Fig.3. Also, not shown in the Algorithm code, ﬂowentries are cleared from the table whenever a connection is closed (following the TCP connection tear down FIN/FIN-ACK)or after a pre-set inactivity time threshold is exceeded. The ﬂow table could track many relevant TCP-related per-ﬂow stateinformation, however, for T-RACKs to perform the fast recovery function, it needs to track a minimal set of TCP state variables(including the highest ACK sequence number seen so far and the arrival time of the most recent ACK).The T-RACKs system uses a ﬂow table to store and update TCP ﬂow information, including the last ACK number, the lastsent sequence number, the corresponding times, the RTT for the ﬂow measured using the TCP timestamp option, as well asthe optional TCP Sack information for each outgoing TCP ﬂow. T-RACKs intercepts the incoming ACKs and outgoing Datato update the current state of each tracked small ﬂow. When packets are dropped by the network and the receiver receives We note that, in higher speed networks, T-RACKs could equally be implemented as a networking function on smartNICs Note that if TCP Sack is active, TCP’s response to duplicate ACKs is different from the standard behavior, therefore we need to take this into accountto elicit a proper reaction.

Algorithm 2:

T-RACKs Timeout Handler Create and initialize a timer to trigger every 1 ms; Function

Timer Expiry Event Handler for Flow (f) ∈ FlowTable do β = α ∗ f.RT T + rand ( f.RT T )) ; if !f.active or f.long_lived then Continue ; T = MAX(f.ACK_time, f.active_time); if now() - T ≥ β then Resend last ACK φ − f.dupAck _ Nr ) times; Set f.resent_time = now(); Set x = 2; Continue; if now()-f.resent_time ≥ ( β (cid:28) x ) then resend ACK one more time; x = x + 1; Continue; if (now()-f.ACK_time) ≥ RT O min then stop T-RACKs recovery; soft reset ﬂow (f) recovery state; Continue; if (now()-f.active_time) ≥ then deactivate_ﬂow(f) ; HypervisorNIC

Flow

Receiver

S1:data S3:dataD1:FRACK D3:FRACK

Sender

S3:D3 5

S2:dataD2:FRACKVM2 VM1VM3

S1-D1S2-D2S3-D3 ProcessIN P r e_ R o u t e I p _ r c v ProcessOUT P o s t _ r o u t e I p _ f i n i s h TO_timer_handler

D1:ACK D3:ACKD2:ACK

T-RACKs

Figure 3:

T-RACKs System: It consists of an end-host module that track TCP ﬂows incoming ACKs and generates FAKE ACKs whenever aﬂow timesout enough out of sequence DATA to generate sufﬁcient dupACKs (real ones), the loss is recovered via FRR from the VM withoutthe intervention of T-RACKs. However, when the receiver fails to receive enough DATA to generate enough real dupACKs totrigger FRR, then T-RACKs intervenes after a timer (1ms) by sending spoofed dupACKs (or RACKs for retransmitted-ACKs)to the sender. Typically, the sender would receive enough dupACKs and RACKs to trigger FRR to retransmit the lost segmentwithin a reasonable time, long before the TCP RTO is reached. C D F RTT variation (usec)Fast RetransmitRTO Retransmit

Figure 4:

RTT variation between transmission time and time of retransmission(i.e ∆ RT T = RT T rt − RT T t ) C. Practical Aspects of T-RACKs System

T-RACKs System: is built upon a light-weight module at the hypervisor layer tracking a limited per-ﬂow state. In thesimplest case, it tracks TCP’s identiﬁcation 4-tuple, per-ﬂow last ACK number, and the timestamp of the last non-dupACKpacket. The system in spirit is similar to recent works in [24, 27] that aim to enable virtualized congestion control in thehypervisor without cooperation from the tenant VM, however these approaches require fully-ﬂedged TCP state informationtracking and implementing full TCP ﬁnite-state machines in the hypervisor including packet queuing. On the other hand,T-RACKs tries to minimize the overhead by tracking the minimal amount of necessary information and implementing only asubset of the retransmission mechanism while letting the VM do the actual work of transmission and queuing.

Complexity:

T-RACKs Complexity resides in its interception of ACKs to update the last seen ACK information. Since, itdoes not perform any computation on the ACK packets, it does not add much to the load on the hosting server nor to thelatency . This claim is supported by our observation in our experiments on our data-center. A hash-based table is used to trackﬂow entries of active small ﬂows. In the worst case, when hashes collide, a linear search is necessary within the linked-list.However, this worst case is rare due to the small number of ﬂows originating from a given end-host. Typically, end-host CPUscan sustain rates of 60 Gbps of packet processing. Hence, the little processing required by the insertion of T-RACK state wouldnot affect the TCP throughput. Spurious retransmissions:

T-RACKs may possibly introduce spurious retransmissions making in-network congestion worse.However, this boils down to answering similar question when choosing the correct RTO value in TCP. For this purpose, werefer to a previous study [14], that mostly showed that even when a relatively bad RTT estimator is used, setting a relativelyhigh minimum RTO, it can help avoid many of the spurious retransmissions in WAN transfers. This fact is supported by asubsequent study [66] that shows signiﬁcant changes (or variance) in Internet delays. Recent works [26, 45] show similarbehavior within current data-centers. In our testbed, we observed noticeable variation in the measured RTT. To quantify this,we measured the difference in RTT values collected at the time of the ﬁrst transmission of the packet and then at the time offast retransmission or RTO retransmission. From the collected data, a considerably large variation, ranging from a few hundredmicroseconds at the ≈ ≈ β ) calculationshown in Algorithm 1 strikes a balance between rapid retransmission and the risk of causing spurious retransmission. T-RACKs RTO β : in most of our experiments and simulations, we choose a value for ACK RTO ( β ) to be ( ≥

10) timesthe dominant measured RTT in the data-center. We believe, and the results show that this value achieves a good tradeoffbetween not having many of the spurious retransmissions and, at the same time not being too late in recovering from losses.We further adopt the well-know exponential back-off mechanism for subsequent ACK RTO ( β ) calculations until either theloss is recovered or TCP’s default RTO (i.e., RT O min ) is close enough to be reached.

Synchronization of retransmissions:

Since T-RACKs relies on a timer for ACK recovery, such timer may result in thesynchronization of retransmissions from different VMs on different hosts leading into incast-like congestion. We studied thisbehavior in a simulation, and the results show repeated losses due to possible synchronized retransmissions. A viable solution tode-synchronize such ﬂows is to introduce some randomness in the ACK RTO, ultimately resulting in fewer ﬂows experiencing ACKs is updated in some instances (e.g., to insert fake SACK block to signify a small gap in the SACKed numbers. Otherwise RACK packets would beignored by TCP). Source Port Number Destination Port NumberSequence NumberAcknowledgement Number

Header

Length

Reserved C W R E C E U R G A C K P S H R S T S Y N F I N Window SizeTCP Checksum Urgent Pointer

16 24 32

No-OP No-OP Kind=8(tstamp) Len=10Time Stamp Value (tsval)Time Stamp Echo Reply (tsecr)No-OP No-OP Kind=5(sack)

Len=2 +

SACK Block1: Left Edge SACK Block1: Right Edge ...

Figure 5:

TCP headers manipulated by T-RACKs system repeated timeouts. We adopted this approach and added a random delay in the calculation of the RTO β , as shown above inthe algorithms. TCP Header manipulation:

TCP does not accept any packet with an inconsistent timestamp, hence the timestamps areupdated per ACK arrival with the local jif f ies variable to keep the consistency of timestamps whenever RACKs are sent. ForSACK-enabled TCP, fake SACK block information needs to be inserted for incoming ACKs (with no SACK blocks in TCPheader) to indicate a small gap equal to the minimum segment size (i.e., 40 Bytes) after the last successfully acknowledgeddata.

Security Concerns: during FRR, to be able to maintain its ﬂight size and avoid timeout, TCP inﬂates the window artiﬁciallyby 1MSS for each received dupAck. This can be exploited to launch ACK spooﬁng attack [54] on the senders. RFC 5681released in 2009 addressed this particular attack and proposed implementing Nonce and Nonce-Reply as a way of verifying thesource of dupACKs. However, such a solution would require the introduction of extra TCP headers prohibiting its deploymentin real TCP implementations. In T-RACKs, we address such attack by dropping dupACKS whenever the ACK timer expireswhen entering a recovery state. This approach is adopted to disable

Cwnd artiﬁcial inﬂation during recovery and, at the sametime, prevents external ACK spooﬁng (other than RACKs). Worth mentioning also is that RACKs are generated from thehypervisor layer, which is under the control of the trusted data-center operator.

TCP semantics: is conceptually violated since dupACKs should reﬂect packets following the lost one being receivedsuccessfully. However, according to RFC 5681, the network could replicate packets, and hence the RACKs could be treatedas replicated packets from within the network.

TCP Header manipulation:

Fig. 5 shows TCP headers manipulated by T-RACKs module, the module tracks and uses thehighlighted TCP headers for all incoming ACK packets using RCU ﬂow entries.V. S

IMULATION A NALYSIS

In this section, we study the performance of T-RACKs to verify if it can achieve its goals in small-scale and large-scalesimulation scenarios. We ﬁrst conduct a number of packet-level simulations using ns2 and compared T-RACKs performanceto the state-of-the-art schemes. (For brevity, we refer to T-RACKs as RACK in the ﬁgures.)

A. Simulations in a Dumbell Topology

To study how TCP behaves in response to packet losses and how likely it may recover quickly with the help of T-RACKs,we conducted several packet-level simulation experiments to covers a large variety of TCP and AQM combinations. We alsoconducted simulation experiments using the congestion control mechanisms code imported from the Linux kernel (i.e., NewReno and Cubic). We use ns2 version 2.35 [48], which we have extended with T-RACKs mechanism inserted as a connectorbetween nodes and their link in the topology setup. Besides, we patched ns2 using the publicly available DCTCP patch. Weuse in our simulation experiments link speeds of 1 Gbps for sending stations, a bottleneck link of 1 Gb/s, low RTT of 100 µ s, the default TCP RT O min of 200 ms and TCP initial window of 10 MSS. We use a rooted tree topology with a singlebottleneck at the destination and run the experiments for 15 sec. The buffer size of the bottleneck link is set to 100 packets,which is more than the bandwidth-delay product in all cases. We ﬁrst designed two simulation scenarios:1) CASE 1: Small ﬂows.2) CASE 2: Large ﬂows coexisting with small ﬂows.In CASE 2, the ratio of small ﬂows to large ﬂows is set to 3, and large ﬂows send data for the whole duration of theexperiment. In both cases, each small ﬂow sends a 14.6KB ﬁle (i.e., 10 MSS) until it completes its transmission. Small ﬂowsstart with a random inter-arrival time that is drawn from an exponential distribution with a mean equal to the transmissiontime of one packet; this allows us to create clusters of small ﬂows that start almost simultaneously, to emulate incast trafﬁc.This process is repeated once every 3 sec, which gives ﬁve rounds of incast trafﬁc, during the simulation. In each round, wedraw the order of the servers generating the ﬂows according to a uniform distribution. We study packet losses, the likelihood CD F Average FCT (ms)

DropTailDropRandREDECNDCTCP (a)

Without T-RACKs CD F Average FCT (ms)

DropTailDropRandREDECNDCTCP (b)

With T-RACKs

Figure 6:

CDF of average FCT for CASE 1 in 20 ﬂows scenario CD F Average FCT (ms)

DropTailDropRandREDECNDCTCP (a)

Without T-RACKs CD F Average FCT (ms)

DropTailDropRandREDECNDCTCP (b)

With T-RACKs

Figure 7:

CDF of average FCT for CASE 1 in 80 ﬂows scenario of fast recovery, and recovery time. We ﬁrst study TCP newReno with Droptail, RED-ECN, and Random Drop AQMs, andDCTCP covering the most common TCP and AQM settings.First, we run the experiments without T-RACKs for 20 and 80 ﬂow trafﬁc scenarios generated according to CASE 1 (i.e.,containing only small ﬂows). Figures 6a and 7a show the average FCT for different schemes. It appears that the FCTis signiﬁcantly high for all-schemes exceeding 200ms for most ﬂows (more than ≈ % and ≈ % for 20 and 80 ﬂows,respectively). This result indicates that most ﬂows are experiencing RTOs. We repeat the same simulation, enabling T-RACKson the end-hosts, to mitigate the RTOs. Figures 6b and 7b show the FCT for different schemes with T-RACKs for 20 and 80ﬂows scenario in CASE 1. The results show signiﬁcant improvement in the average FCT for the two scenarios for all schemes.Speciﬁcally, it could reduce the average FCT range signiﬁcantly for both 20 and 80 ﬂows scenarios. In the 20 ﬂows scenario,at th % (as highlighted by the horizontal black line), the reduction in average FCT is up to ≈

15 times (i.e., from 200msdown to 13ms). In the 80 ﬂows scenario, it reduces the FCT at th % by ≈ ≈ th % by ≈ ≈ ≈

14X and ≈

14X for DropTail, DropRand, RED-ECN, and DCTCP, respectively.Figure 9a and Figure 9b show the results without and with T-RACKs for the 80 ﬂows scenario. In the 80 ﬂows scenario, CD F Average FCT (ms)

DropTailDropRandREDECNDCTCP (a)

Without T-RACKs CD F Average FCT (ms)

DropTailDropRandREDECNDCTCP (b)

With T-RACKs

Figure 8:

CDF of average FCT for CASE 2 in 20 ﬂows scenario CD F Average FCT (ms)

DropTailDropRandREDECNDCTCP (a)

Without T-RACKs CD F Average FCT (ms)

DropTailDropRandREDECNDCTCP (b)

With T-RACKs

Figure 9:

CDF of average FCT for CASE 2 in 80 ﬂows scenario

T-RACKs can further improve the performance, and it manages to reduce the average FCT at the th % by ≈ ≈ ≈ ≈

37X for DropTail, DropRand, RED-ECN, and DCTCP, respectively. These results show that T-RACKs improvesthe performance more in heavy load scenarios with background trafﬁc.

B. Simulations in Data-center Topology

End-host based schemes are assumed by default to be scalable and to verify this, we experiment with T-RACKs on a largerscale with varying workloads and ﬂow size distributions.For this purpose, we conduct another packet-level simulation using a spine-leaf topology with nine leaf switches and fourspine switches using links of 10 Gbps for end-hosts and an over-subscription ratio of 5 . We again examine scenarios withTCP-NewReno, TCP-ECN, and DCTCP operating along with Droptail, RED, and DCTCP AQMs, respectively. We use aper-hop link delay of 50 µ s, TCP is conﬁgured with the default TCP RT O min of 200 ms and an initial window of 10MSS. Persistent connections are used for successive requests. Finally, buffer sizes on all the links are set to be equal to thebandwidth-delay product between end-points within one physical rack.The ﬂow size distribution for workload 1 (which represents Websearch ﬂow sizes distribution [10]) and workload 2 (whichrepresents datamining ﬂow sizes distribution [25]) are shown in Fig 10a which capture a wide range of ﬂow sizes. The ﬂows the typical over-subscription ratio in current production data-centers is in range of 3 to over 20 C D F ( % ) size (bytes)Workload2Workload1 (a) CDF of ﬂow size C D F ( % ) Inter-arrival (usec)load-30%load-40%load-50%load-60%load-70%load-80%load-90% (b)

CDF of inter-arrival time

Figure 10:

Flow characteristics: (a) Actual Flow size distribution (b) Inter-arrival times for various network load are generated randomly from any source host to any other destination host with the arrivals following a Poisson process withvarious average ﬂow arrival rates to simulate different network loads. Fig 10b shows the inter-arrival times distribution forvarious trafﬁc loads ranging from to .We report the average FCT for small ﬂows and for all ﬂows as well as the total number of timeout events in each case.In the simulation, the T-RACKs threshold γ is set to inﬁnity (i.e., all ﬂows are tracked including large ﬂows). The T-RACKsRTO, the timeout to trigger spoofed dupACKs, is set to 10 times the measured RTT in this experiment. That means if an ACKwith the expected sequence number is not delivered within 10 RTTs then RACK packets are spoofed to trigger the FRR. A v e r a g e F C T i n ( m s ) Network loadDCTCPDCTCP-RACK DropTailDT-RACK REDRED-RACK (a)

Small Flows: Average FCT

80 100 120 140 160 180 200 220 240 260 20 30 40 50 60 70 80 90 100 A v e r a g e F C T i n ( m s ) Network loadDCTCPDCTCP-RACK DropTailDT-RACK REDRED-RACK (b)

All Flows: Average FCT T i m e o u t s ( ) Network loadDCTCPDCTCP-RACK DropTailDT-RACK REDRED-RACK (c)

All Flows: Number of RTOs A v e r a g e F C T i n ( m s ) Network loadDCTCPDCTCP-RACK DropTailDT-RACK REDRED-RACK (d)

Small Flows: Average FCT

126 127 128 129 130 131 132 133 134 135 136 20 30 40 50 60 70 80 90 100 A v e r a g e F C T i n ( m s ) Network loadDCTCPDCTCP-RACK DropTailDT-RACK REDRED-RACK (e)

All Flows: Average FCT T i m e o u t s ( ) Network loadDCTCPDCTCP-RACK DropTailDT-RACK REDRED-RACK (f)

All Flows: Number of RTOs

Figure 11:

Performance with network load in (30 % , 90 % ) for workload 1 (top) and workload 2 (bottom) The average FCT for small and all ﬂows, as well as the total number of timeouts experienced by all ﬂows for workloads 1and workload 2 are shown in Fig. 11. Note that for both workloads, the FCT of small ﬂows is dramatically degraded whenthey experience a timeout regardless of the TCP congestion control version or AQM mechanism in operation. In contrast,when T-RACKs is activated, it helps small ﬂows the most by improving their FCT as a by-product of reducing the numberof timeouts they experience. We also note that the overall FCT decreases for all ﬂows for two reasons: 1) the threshold γ enables all ﬂows to beneﬁt from T-RACK and 2) small ﬂows ﬁnish quicker leaving network resources for larger ones. Wenotice that for workload 2, with almost 80 % of the ﬂows being less than 10KB and hence experiencing lesser timeout events Flow sizes in the range [0-100KB] are considered to be small, ﬂow sizes in the range [100KB - 10MB] are considered to be medium and ﬂows sizesfrom 10MB and upwards are considered to be large ﬂows.

15 20 25 30 35 40 45 30 40 50 60 70 80 90 A v e r a g e F C T i n ( m s ) Network loadRTT5RTT 10RTT50RTT 100RTT (a)

Small Flows: DropTail

60 70 80 90 100 110 120 130 30 40 50 60 70 80 90 A v e r a g e F C T i n ( m s ) Network loadRTT5RTT 10RTT50RTT 100RTT (b)

Small Flows: RED-ECN

50 60 70 80 90 100 110 120 30 40 50 60 70 80 90 A v e r a g e F C T i n ( m s ) Network loadRTT5RTT 10RTT50RTT 100RTT (c)

Small Flows: DCTCP

Figure 12:

The same Websearch scenario as above but using DropTail AQM and α is varied from 1 RTT to 100 RTTs. overall, DCTCP can improve the FCT. This improvement is because DCTCP’s ability to regulate the persistent queue length(i.e., there are few large ﬂows to ﬁll the buffer, unlike workload 1). C. Sensitivity to Choice of T-RACKs RTO

In this experiment, we study the sensitivity of T-RACKs to the preset RACK RTO value. For this purpose, we repeat thelast simulation by varying the value of the RTT multiplicative factor α in the set [1, 5, 10, 50, 100]. We report the averageFCT of small ﬂows and all ﬂows in each case for DropTail, RED, and DCTCP in Figure 12 for various loads. From the ﬁgure,we can see that the FCT is greatly affected by the choice of parameter α . Small values of α (i.e., 1 and 5) show a relativelylarge FCT compared to the RTT, which indicates that they tend to cause too many spurious retransmissions that exacerbatecongestion in the network. On the other hand, excessively large values for α (i.e., 50 and 100) tend to be too conservative andresult in TCP ﬂows recovering later than they could. We can see in the three ﬁgures for all loads a minimum FCT is achievedat or near a RACK RTO of 10 RTTs. VI. L INUX K ERNEL I MPLEMENTATION R e n o R e n o - R A C K C u b i c C u b i c - R A C K D C T C P D C T C P - R A C K A v e r a g e F C T w i t h E rr o r b a r s ( m s ) Scheme (a)

Small Flows: Average with Errorbar R e n o R e n o - R A C K C u b i c C u b i c - R A C K D C T C P D C T C P - R A C K % o f M i ss e d d e a d li n e ( >= m s ) Scheme (b)

Small Flows: Missed Deadlines R e n o R e n o - R A C K C u b i c C u b i c - R A C K D C T C P D C T C P - R A C K A v e r a g e F C T w i t h E rr o r b a r s ( m s ) Scheme (c)

All Flows: Average with Errorbar R e n o R e n o - R A C K C u b i c C u b i c - R A C K D C T C P D C T C P - R A C K A v e r a g e F C T w i t h E rr o r b a r s ( m s ) Scheme (d)

Small Flows: Average (Datamining) R e n o R e n o - R A C K C u b i c C u b i c - R A C K D C T C P D C T C P - R A C K A v e r a g e F C T w i t h E rr o r b a r s ( m s ) Scheme (e)

Small Flows: Average (Educational) R e n o R e n o - R A C K C u b i c C u b i c - R A C K D C T C P D C T C P - R A C K A v e r a g e F C T w i t h E rr o r b a r s ( m s ) Scheme (f)

Small Flows: Average (Private DC)

Figure 13:

Performance metrics of one-to-all scenario without any background trafﬁc for (a-c) Websearch, (d) Datamining, (e) Educationaland (f) Private DC workloads, respectively.

In this section, we study the performance of T-RACKs implementation as a loadable Linux kernel module using syntheticworkloads reproduced from the statistics of workloads found in production data-centers [10, 25]. T-RACKs is a shim-layerbetween the VMs (or TCP/IP stack) and the hypervisor (or link-layer). We use the NetFilter framework [47], which is anintegral part of Linux kernel. The NetFilter hooks attach to the data-path between the NIC driver and TCP/IP stack, which Rack 1

Rack 2 Rack 3

Core ToR

Rack 4NetFPGA Switch (a)

The testbed topology (b)

The actual testbed

Figure 14:

Testbed setup of T-RACKs in small-scale cluster imposes no modiﬁcations to the TCP/IP stack of the host OS nor the guest OS. The module intercepts TCP packets incomingto the host or its guests before it is handed to the TCP/IP stack (i.e., at the post routing). First, the 4-tuples are hashed, andthe associated ﬂow index is calculated via Jenkins hash (JHash) [31]. Then, TCP headers are examined, and the proper courseof action is based on the ﬂag bits (i.e., SYN-ACK, FIN, or ACK) following the logic in Algorithm 1. Unlike SNOOP [16],the module does not employ any packet queues to store the incoming packets, it only stores and updates ﬂow entry states (i.e.,ACK No, arrival time and so on). Also, unlike [59] T-RACKs does not need the ﬁne-grained high-resolution timers in themicrosecond time-scale, therefore the native OS

Jif f ies timer is used. T-RACKs uses a single timer for all ﬂows to handleper-ﬂow RTO events. These design choices make T-RACKs lightweight and help reduce the server overhead.From 14 data-center grade servers equipped each with 6 NICs, we built a small-scale testbed consisting of 84 virtualservers, each assigned a dedicated physical NIC. The servers are interconnected via four non-blocking leaf switches and onespine switch. The testbed is organized into four racks (rack 1, 2, 3, and 4). The servers are connected to leaf switches, andleaf switches are connected to the spine switch via 1 Gbps Ethernet links. The servers use Ubuntu Server 14.04 LTS withLinux kernel 3.18, which has integrated a full implementation of DCTCP. Unless otherwise stated, T-RACKs runs with thedefault settings (i.e., The RTO and threshold γ of T-RACKs is set to 4 ms and 100 KB, respectively). The RTO of 4 ms is areasonable 16 times (i.e., ≥ ) the average RTT of ≈ µs without queuing. We use our custom-built trafﬁc generator to runthe experiments with realistic trafﬁc workloads. The trafﬁc generator generates common workloads described in the literature(e.g., industrial-like Websearch [10], Datamining [25] or institutional-like University and Private DC [17]). In addition, we havedeployed the iperf program [29] to emulate large background trafﬁc (e.g., VM migrations, backups) in some scenarios. We usedifferent scenarios to reproduce one-to-all and all-to-all ﬂows with or without background trafﬁc. In the one-to-all scenarios,randomly chosen clients in one rack send random requests to any of all the servers in the data-center. While in the all-to-allscenario, all clients in the data-center send requests to randomly picked servers out of all the servers in the data-center. Ifbackground trafﬁc is introduced, we run large iperf ﬂows from all clients to all servers to evaluate T-RACKs under suddenand persistent network load spikes. Like before, we classify ﬂows of size ≤ KB as small, [ KB - M B ] as medium,and ≥ M B as large.

A. Experimental Results and Discussion

One-to-all Scenario without Background Trafﬁc: we report here the average FCT for small and all ﬂows as well as thenumber of small ﬂows that missed a deadline of 200ms.The trafﬁc generator is deployed on every client running on an end-host in the data-center and is set to randomly initiates1000 requests to randomly chosen servers on any of the other racks. In the Websearch workload, Figures 13a, 13b and 13cshow the average FCT and missed deadlines for small ﬂows and the average FCT for all ﬂows, respectively. While, Fig-ures 13d, 13e, and 13f, show the average FCT for small ﬂows in the Datamining, Educational, and private DC workloads,respectively. From these ﬁgures, we make the following observations: i) for all workloads, T-RACKs helps small ﬂowsregardless of the TCP version, on both the average FCT and its variation, as indicated by the error bars. Compared to Reno, R e n o R e n o - R A C K C u b i c C u b i c - R A C K D C T C P D C T C P - R A C K A v e r a g e F C T w i t h E rr o r b a r s ( m s ) Scheme (a)

Small Flows: Average FCT R e n o R e n o - R A C K C u b i c C u b i c - R A C K D C T C P D C T C P - R A C K % o f M i ss e d d e a d li n e ( >= m s ) Scheme (b)

Small Flows: Missed Deadlines R e n o R e n o - R A C K C u b i c C u b i c - R A C K D C T C P D C T C P - R A C K A v e r a g e F C T w i t h E rr o r b a r s ( m s ) Scheme (c)

All Flows: Average FCT R e n o R e n o - R A C K C u b i c C u b i c - R A C K D C T C P D C T C P - R A C K A v e r a g e F C T w i t h E rr o r b a r s ( m s ) Scheme (d)

Average FCT (Datamining) R e n o R e n o - R A C K C u b i c C u b i c - R A C K D C T C P D C T C P - R A C K A v e r a g e F C T w i t h E rr o r b a r s ( m s ) Scheme (e)

Average FCT (Educational) R e n o R e n o - R A C K C u b i c C u b i c - R A C K D C T C P D C T C P - R A C K A v e r a g e F C T w i t h E rr o r b a r s ( m s ) Scheme (f)

Average FCT (Private DC)

Figure 15:

Performance metrics of one-to-all scenario with background trafﬁc for (a-c) Websearch, (d) Datamining, (e) Educational and (f)Private DC workloads, respectively.

Cubic and DCTCP, T-RACKs reduces the average FCT of small ﬂows by ≈ (34% , , for Websearch, ≈ (18% , , − ) for Datamining, ≈ (69% , − , for Educational and ≈ ( − , − , for Private DC workloads. We notice that DCTCPimproves the FCT over its Reno and Cubic counterparts, and T-RACKs could improve their performance in terms of misseddeadlines in Websearch. The average FCT, in certain cases of Educational and Private DC workloads, shows a negligibleincrease of FCT with T-RACKs. In these workloads, the network load is very light (as shown by the small FCT withoutT-RACKs), and hence the added overhead of deploying T-RACKs module surpasses its performance gains for these lightworkloads. ii) for Websearch workload, T-RACKs reduces the missed deadlines for short ﬂows by ≈ (55% , , forReno, Cubic, and DCTCP, respectively. iii) T-RACKs slightly improves the overall average FCT. This can be attributed to thefact that small ﬂows are ﬁnishing their transmission quicker, leaving some additional bandwidth for medium and large ﬂows.The improvement shown equals ≈ (16% , for Reno and Cubic, respectively. In Figure 13c, DCTCP with T-RACKs, showsa slight increase in average FCT of all ﬂow types for Websearch workload which has many ﬂows of medium size comparedto other workload (Figure 10). While DCTCP is designed for improving FCT of small ﬂows in DC environments, T-RACKsis designed to curb RTO events to improve the FCT of small ﬂows whose transmitted volumes do not exceed the threshold γ .Hence, T-RACKs might be adding a slight overhead due to the need to maintain the ﬂow table information for the mediumand/or large ﬂows. This overhead may be mitigated by simply skipping state maintenance for ﬂows that exceed the thresholdand become long-lived. Moreover, the overhead for TCP variants, which are designed for Internet (e.g., Reno and Cubic), isnearly negligible relative to the large FCT of their long-lived ﬂows. One-to-all Scenario with Background Trafﬁc: to put T-RACKs under true stress, we run the same one-to-all scenario withall-to-all background trafﬁc. Figure 15a, Figure 15b and Figure 15c show the average FCT and missed deadlines for small ﬂowsas well as the average FCT for all ﬂows for Websearch and Figure 15d, Figure 15e and Figure 15f show the average FCT forshort ﬂows for data mining, educational, and private DC workloads, respectively. We observe the following: i) T-RACKs canimprove the average FCT of small ﬂows for all workloads regardless of the TCP congestion control in use. As shown in theﬁgures, compared to Reno, Cubic and DCTCP, T-RACKs reduces the average FCT of small ﬂows by ≈ (38% , , forWebsearch, ≈ (11% , , for educational and ≈ (13% , , for private DC workloads. The improvement increases forDatamining workload to ≈ (36% , , since it includes a wider range of short ﬂows. ii) T-RACKs reduces the misseddeadlines for short ﬂows of Websearch by ≈ (40% , , for Reno, Cubic, and DCTCP, respectively. iii) T-RACKs stillimproves for the overall average FCT ≈ (7% , , for Reno and Cubic, and DCTCP respectively. All-to-all Scenario without Background Trafﬁc: we run the all-to-all scenario where all clients initiate 1000 requests eachto any of all the servers in the data-center. Figure 16a, Figure 16b, Figure 16c and Figure 16d show the average FCT for short R e n o R e n o - R A C K C u b i c C u b i c - R A C K D C T C P D C T C P - R A C K A v e r a g e F C T w i t h E rr o r b a r s ( m s ) Scheme (a)

Websearch R e n o R e n o - R A C K C u b i c C u b i c - R A C K D C T C P D C T C P - R A C K A v e r a g e F C T w i t h E rr o r b a r s ( m s ) Scheme (b)

Datamining R e n o R e n o - R A C K C u b i c C u b i c - R A C K D C T C P D C T C P - R A C K A v e r a g e F C T w i t h E rr o r b a r s ( m s ) Scheme (c)

Educational R e n o R e n o - R A C K C u b i c C u b i c - R A C K D C T C P D C T C P - R A C K A v e r a g e F C T w i t h E rr o r b a r s ( m s ) Scheme (d)

Private DC

Figure 16:

The average FCT of small ﬂows in the all-to-all scenario for (a) Websearch, (b) Datamining, (c) Educational and (d) PrivateDC workloads, respectively. ﬂows in Websearch, Datamining, Educational, Private workloads, respectively. The network load is considerably higher thanthe previous cases, given the more complex nature of this all-to-all trafﬁc. We can still see here that T-RACKs can deliversigniﬁcant improvements of up to 71% in the FCT for all workloads.In summary, the experimental results show the performance gains achieved by T-RACKs, especially for small ﬂows, thatconstitute the lion’s share in data-centers trafﬁc, without affecting too much the performance of larger ﬂows. In particular, theresults show that: • T-RACKs reduces the variance of small ﬂows’ FCTs and the missed deadlines. • T-RACKs can maintain its gains even if bandwidth-greedy large ﬂows hog the network. • T-RACKs efﬁciently handles various workloads, and is agnostic to the variant of TCP congestion controller. • T-RACKs fulﬁlled its requirements with no assumptions about nor any modiﬁcations to in-network hardware nor theTCP/IP stack of the guest VMs. VII. C

ONCLUSION AND FUTURE WORK

In this paper, we studied packet losses and the impact of various recovery methods on ﬂow performance. We then proposedT-RACKs, an efﬁcient cross-layer approach for timely recovery from losses. T-RACKs improves the ﬂow completion timeof time-sensitive ﬂows and helps avoid throughput-collapse situations. T-RACKs is deployed either at the sender-side or thereceiver-side as a shim-layer residing between the virtual machines and the network hardware. Simulation and experimentalresults show that the ﬂows completion time is improved by up to an order of magnitude, missed deadlines are reducedconsiderably, and a high-link utilization is attained. T-RACKs is shown to be lightweight and practical due to its minimalfootprint on end-hosts. Finally, because it does not change TCP and adapts to any TCP ﬂavor, T-RACKs is very appropriate for multi-tenant public data-centers. As part of our future work, we seek real larger-scale deployment in cloud environmentssuch AWS or Azure and investigate and analyze the effectiveness in T-RACKs scheme at scale.R EFERENCES [1] A. M. Abdelmoniem and B. Bensaou. Efﬁcient Switch-Assisted Congestion Control for Data Centers: an Implementationand Evaluation. In

IEEE IPCCC , 2015.[2] A. M. Abdelmoniem and B. Bensaou. Incast-Aware Switch-Assisted TCP Congestion Control for Data Centers. In

IEEEGLOBECOM , 2015.[3] A. M. Abdelmoniem and B. Bensaou. Reconciling Mice and Elephants in Data Center Networks. In

IEEE CLOUDNET ,2015.[4] A. M. Abdelmoniem and B. Bensaou. Curbing Timeouts for TCP-Incast in Data Centers via A Cross-Layer FasterRecovery Mechanism. In

Proceedings - IEEE INFOCOM , 2018.[5] A. M. Abdelmoniem, B. Bensaou, and A. J. Abu. Mitigating incast-tcp congestion in data centers with sdn.

Annals ofTelecommunications , 73(3), 2018.[6] A. M. Abdelmoniem, B. Bensaou, and V. Barsoum. Incastguard: An efﬁcient tcp-incast mitigation mechanism for cloudnetworks. In

IEEE GLOBECOM , 2018.[7] A. J. Abu, B. Bensaou, and A. M. Abdelmoniem. A Markov Model of CCN Pending Interest Table Occupancy withInterest Timeout and Retries. In

IEEE ICC , 2016.[8] A. J. Abu, B. Bensaou, and A. M. Abdelmoniem. Inferring and Controlling Congestion in CCN via the Pending InterestTable Occupancy. In

IEEE Local Computer Networks (LCN) , 2016.[9] M. Alizadeh, B. Atikoglu, A. Kabbani, A. Lakshmikantha, R. Pan, B. Prabhakar, and M. Seaman. Data Center transportmechanisms: Congestion control theory and IEEE standardization. In , pages 1270–1277, 2008.[10] M. Alizadeh, A. Greenberg, D. A. Maltz, J. Padhye, P. Patel, B. Prabhakar, S. Sengupta, and M. Sridharan. Data centerTCP (DCTCP).

ACM SIGCOMM CCR , 40:63, 2010.[11] M. Alizadeh, A. Kabbani, B. Atikoglu, and B. Prabhakar. Stability analysis of QCN.

ACM SIGMETRICS PerformanceEvaluation Review , 39(1):49, 2011.[12] M. Alizadeh, A. Kabbani, T. Edsall, and B. Prabhakar. Less is More : Trading a little Bandwidth for Ultra-Low Latencyin the Data Center. In

USENIX NSDI , 2012.[13] M. Alizadeh, S. Yang, S. Katti, N. McKeown, B. Prabhakar, and S. Shenker. Deconstructing datacenter packet transport.In

ACM HotNets , 2012.[14] M. Allman and V. Paxson. On Estimating End-to-end Network Path Properties.

SIGCOMM CCR , 29:263–274, 1999.[15] W. Bai, L. Chen, K. Chen, D. Han, C. Tian, and H. Wang. Information-agnostic Flow Scheduling for Commodity DataCenters. In

USENIX NSDI , 2015.[16] H. Balakrishnan, S. Seshan, and R. H. Katz. Improving reliable transport and handoff performance in cellular wirelessnetworks.

Wireless Networks , 1:469–481, 1995.[17] T. Benson, A. Akella, and D. A. Maltz. Network Trafﬁc Characteristics of Data Centers in the Wild. In

ACM IMC , 2010.[18] T. Benson, A. Akella, A. Shaikh, and S. Sahu. Cloudnaas: A cloud networking platform for enterprise applications. In

ACM Symposium on Cloud Computing (SoCC) , 2011.[19] T. Benson, A. Anand, A. Akella, and M. Zhang. Understanding data center trafﬁc characteristics. In

ACM SIGCOMM ,2010.[20] bootlin.com. Elixir Cross Referencer. https://elixir.bootlin.com/linux/latest/source.[21] N. Cardwell, Y. Cheng, C. S. Gunn, S. H. Yeganeh, and V. Jacobson. Bbr: Congestion-based congestion control.

Commun.ACM , 60(2), 2017.[22] W. Chen, F. Ren, J. Xie, C. Lin, K. Yin, and F. Baker. Comprehensive understanding of TCP Incast problem. In

IEEEConference on Computer Communications (INFOCOM) , 2015.[23] Y. Chen, R. Grifﬁth, J. Liu, R. H. Katz, and A. D. Joseph. Understanding TCP incast throughput collapse in datacenternetworks. In , 2009.[24] B. Cronkite-Ratcliff, A. Bergman, S. Vargaftik, M. Ravi, N. McKeown, I. Abraham, and I. Keslassy. Virtualized CongestionControl. In

ACM SIGCOMM , 2016.[25] B. A. Greenberg, J. R. Hamilton, S. Kandula, C. Kim, P. Lahiri, A. Maltz, P. Patel, S. Sengupta, A. Greenberg, N. Jain,and D. A. Maltz. VL2: a scalable and ﬂexible data center network. In

ACM SIGCOMM , 2009.[26] C. Guo, L. Yuan, D. Xiang, Y. Dang, R. Huang, D. Maltz, Z. Liu, V. Wang, B. Pang, H. Chen, Z.-W. Lin, and V. Kurien.Pingmesh: A Large-Scale System for Data Center Network Latency Measurement and Analysis. In

ACM SIGCOMM ,2015.[27] K. He, E. Rozner, K. Agarwal, Y. J. Gu, W. Felter, J. Carter, and A. Akella. AC/DC TCP: Virtual Congestion ControlEnforcement for Datacenter Networks. In

SIGCOMM , SIGCOMM ’16, pages 244–257, New York, NY, USA, 2016.ACM. [28] J. Huang, T. He, Y. Huang, and J. Wang. ARS: Cross-layer adaptive request scheduling to mitigate TCP incast in datacenter networks. In IEEE INFOCOM , 2016.[29] iperf. The Bandwidth Measurement Tool. https://iperf.fr/.[30] S. M. Irteza, A. Ahmed, S. Farrukh, B. N. Memon, and I. A. Qazi. On the coexistence of transport protocols in datacenters. In

IEEE ICC , 2014.[31] B. Jenkins. A hash function for hash table lookup. http://burtleburtle.net/bob/hash/doobs.html.[32] Jiao Zhang, Fengyuan Ren, Li Tang, and Chuang Lin. Taming TCP incast throughput collapse in data center networks.In

IEEE ICNP

USENIX NSDI , 2015.[35] S. Kandula, S. Sengupta, A. Greenberg, P. Patel, and R. Chaiken. The nature of data center trafﬁc. In

ACM IMC , NewYork, New York, USA, 2009.[36] Y. Li, R. Miao, H. H. Liu, Y. Zhuang, F. Feng, L. Tang, Z. Cao, M. Zhang, F. Kelly, M. Alizadeh, and M. Yu. HPCC:High Precision Congestion Control. In

ACM SIGCOMM , 2019.[37] A. M. Abdelmoniem, Y. M. Abdelmoniem, and B. Bensaou. On Network Systems Design: Pushing the PerformanceEnvelope via FPGA Prototyping. In

IEEE international Conference on Recent Trends in Computer Engineering (ITCE) ,2019.[38] A. M. Abdelmoniem and B. Bensaou. Enforcing Transport-Agnostic Congestion Control via SDN in Data Centers. In

IEEE Local Computer Networks , 2017.[39] A. M. Abdelmoniem and B. Bensaou. Hysteresis-based Active Queue Management for TCP Trafﬁc in Data Centers. In

IEEE INFOCOM , 2019.[40] A. M. Abdelmoniem, B. Bensaou, and A. J. Abu. HyGenICC: Hypervisor-based Generic IP Congestion Control forVirtualized Data Centers. In

IEEE ICC , 2016.[41] A. M. Abdelmoniem, H. Susanto, and B. Bensaou. Taming Latency in Data centers via Active Congestion-Probing. In

IEEE ICDCS , 2019.[42] A. M. Abdelmoniem, H. Susanto, and B. Bensaou. Reducing latency in multi-tenant data centers via cautious congestionwatch. In

International Conference on Parallel Processing (ICPP) , 2020.[43] M. Mathis, J. Semke, J. Mahdavi, and T. Ott. The macroscopic behavior of the TCP congestion avoidance algorithm.

ACM SIGCOMM Computer Communication Review , 1997.[44] M. Mattess, R. N. Calheiros, and R. Buyya. Scaling MapReduce Applications Across Hybrid Clouds to Meet SoftDeadlines. In

IEEE AINA , 2013.[45] R. Mittal, V. T. Lam, N. Dukkipati, E. Blem, H. Wassel, M. Ghobadi, A. Vahdat, Y. Wang, D. Wetherall, and D. Zats.TIMELY: RTT-based Congestion Control for the Datacenter. In

ACM SIGCOMM

USENIX FAST , 2008.[50] L. Popa, G. Kumar, M. Chowdhury, A. Krishnamurthy, S. Ratnasamy, and I. Stoica. FairCloud: Sharing the Network inCloud Computing.

ACM SIGCOMM CCR , 42:187–198, 2012.[51] C. Ruan, J. Wang, W. Jiang, G. Min, and Y. Pan. PTCP: A priority-based transport control protocol for timeout mitigationin commodity data center.

Future Generation Computer Systems , 102:619 – 632, 2020.[52] S. M. Rumble, D. Ongaro, R. Stutsman, M. Rosenblum, and J. K. Ousterhout. It’s time for low latency. In

USENIXHotOS , 2011.[53] A. S. Sabyasachi, H. M. D. Kabir, A. M. Abdelmoniem, and S. K. Mondal. A resilient auction framework for deadline-aware jobs in cloud spot market. In

IEEE Symposium on Reliable Distributed Systems (SRDS) , 2017.[54] S. Savage, N. Cardwell, D. Wetherall, and T. Anderson. TCP Congestion Control with a Misbehaving Receiver.

SIGCOMMComput. Commun. Rev. , 29(5):71–78, Oct. 1999.[55] S. Shukla, S. Chan, A. S.-W. Tam, A. Gupta, Y. Xu, and H. J. Chao. TCP PLATO: Packet Labelling to Alleviate Time-Out.

IEEE JSAC , 32(1), 2014.[56] H. Susanto, A. M. Abdelmoniem, H. Jin, and B. Bensaou. Creek: Inter many-to-many coﬂows scheduling for datacenternetworks. In

IEEE ICC , 2019.[57] H. Susanto, B. L. Ahmed M. Abdelmoniem, Honggang Zhang, and D. Towsley. A Near Optimal Multi-Faced JobScheduler for Datacenter Workloads. In

IEEE ICDCS , 2019.[58] A. S.-W. Tam, K. Xi, Y. Xu, and H. J. Chao. Preventing TCP Incast Throughput Collapse at the Initiation, Continuation,and Termination. In

International Workshop on Quality of Service , IWQoS ’12, pages 29:1–29:9, Piscataway, NJ, USA, SIGCOMM CCR , 39:303, 2009.[60] H. Wu, Z. Feng, C. Guo, and Y. Zhang. ICTCP: Incast congestion control for TCP in data-center networks.

IEEE/ACMTransactions on Networking , 21, 2013.[61] Y. Xu, S. Shukla, Z. Guo, S. Liu, A. S. . Tam, K. Xi, and H. J. Chao. RAPID: Avoiding TCP Incast Throughput Collapsein Public Clouds With Intelligent Packet Discarding.

IEEE JSAC , 37(8):1911–1923, 2019.[62] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: Cluster Computing with Working Sets. In

USENIX HotCloud , 2010.[63] D. Zats, T. Das, P. Mohan, D. Borthakur, and R. Katz. DeTail : Reducing the Flow Completion Time Tail in DatacenterNetworks. In

ACM SIGCOMM , 2012.[64] J. Zhang, F. Ren, and C. Lin. Modeling and understanding TCP incast in data center networks. In

IEEE INFOCOM ,2011.[65] J. Zhang, F. Ren, L. Tang, and C. Lin. Modeling and Solving TCP Incast Problem in Data Center Networks.

IEEE TPDS ,26(2):478–491, 2015.[66] Y. Zhang and N. Dufﬁeld. On the Constancy of Internet Path Properties. In

ACM IMC , IMW ’01, New York, NY, USA,2001. ACM.[67] J. Zhuang, X. Jiang, G. Jin, J. Zhu, and H. Chen. PTCP: A Priority-Driven Congestion Control Algorithm to Tame TCPIncast in Data Centers.