[PDF] 10-millisecond Computing

Abstract

Despite computation becomes much complex on data with an unprecedented scale, we argue computers or smart devices should and will consistently provide information and knowledge to human being in the order of a few tens milliseconds. We coin a new term 10-millisecond computing to call attention to this class of workloads. 10-millisecond computing raises many challenges for both software and hardware stacks. In this paper, using a typical workload-memcached on a 40-core server (a main-stream server in near future), we quantitatively measure 10-ms computing's challenges to conventional operating systems. For better communication, we propose a simple metric-outlier proportion to measure quality of service: for N completed requests or jobs, if M jobs or requests' latencies exceed the outlier threshold t, the outlier proportion is M/N . For a 1K-scale system running Linux (version 2.6.32), LXC (version 0.7.5) or XEN (version 4.0.0), respectively, we surprisingly find that so as to reduce the service outlier proportion to 10% (10% users will feel QoS degradation), the outlier proportion of a single server has to be reduced by 871X, 2372X, 2372X accordingly. Also, we discuss the possible design spaces of 10-ms computing systems from perspectives of datacenter architectures, networking, OS and scheduling, and benchmarking.

Full PDF

aa r X i v : . [ c s . PF ] M a r Gang Lu ⋆ , Jianfeng Zhan ∗ , ‡ , Tianshu Hao ∗ , and Lei Wang ∗ , ‡ ⋆ Beijing Academy of Frontier Science and Technology ∗ Institute of Computing Technology, Chinese Academy of Sciences ‡ University of Chinese Academy of [email protected], [email protected], [email protected], wanglei [email protected]

Abstract —Despite computation becomes much complex on datawith an unprecedented scale, we argue computers or smartdevices should and will consistently provide information andknowledge to human being in the order of a few tens milliseconds.We coin a new term 10-millisecond computing to call attentionto this class of workloads.10-millisecond computing raises many challenges for bothsoftware and hardware stacks. In this paper, using a a typicalworkload—memcached on a 40-core server (a main-stream serverin near future), we quantitatively measure 10-ms computing’schallenges to conventional operating systems. For better com-munication, we propose a simple metric— outlier proportion tomeasure quality of service: for N completed requests or jobs, if M jobs or requests’ latencies exceed the outlier threshold t , theoutlier proportion is MN . For a 1K-scale system running Linux(version 2.6.32), LXC (version 0.7.5 ) or XEN (version 4.0.0),respectively, we surprisingly ﬁnd that so as to reduce the serviceoutlier proportion to 10% (10% users will feel QoS degradation),the outlier proportion of a single server has to be reduced by871X, 2372X, 2372X accordingly. Also, we discuss the possibledesign spaces of 10-ms computing systems from perspectives ofdatacenter architectures, networking, OS and scheduling, andbenchmarking. I. I

NTRODUCTION

Despite computation becomes much complex on data withan unprecedented scale, in this paper we argue computers orsmart devices should and will consistently provide informationand knowledge to human being in the order of a few tensmilliseconds. We coin a new term 10-millisecond (in short,10-ms) computing to call attention to this class of workloads.First, determined by the nature of human being’s nervousand motor systems, the timescale for many human activities isin the order of a few hundreds milliseconds [17], [10], [11].For example, in a talk, the gaps we leave in speech to tellthe other person it is ’your turn’ are only a few hundredmilliseconds long [17]; the response time of our visual systemto a very brief pulse of light and its duration is also inthis order. Second, one of the key results from early workon delays in command line interfaces is that regularity is ofthe vital importance [17], [10], [11]. If people can predicthow long they are likely to wait they are far happier [17],[10], [11]. Third, the experiments in [10] show perceptualevents occurring within a single cycle (of this timescale)are combined into a single percept if they are sufﬁcientlysimilar, indicating our perceptual system cannot provide much ﬁner capability. That is to say, much lower latency (i.e.,less than 10 milliseconds) means nothing to human being. So perfect human-computer interactions come from humanbeing’s requirements, and should be irrelevant to data scale,task complexity, and their underlying hardware and softwaresystems.The trend of 10-ms computing has been conﬁrmed bycurrent internet services industries. Internet service providerswill not lower their QoS expectation because of the complexityof underlying infrastructures. Actually, keeping latency low isof the vital importance for attracting and retaining users [15],[11], [27]. Google [34] and Amazon [27] found that movingfrom a 10-result page loading in 0.4 seconds to a 30-resultpage loading in 0.9 seconds caused a decrease of 20% of thetrafﬁc and revenue; Moreover delaying the page in incrementsof 100 milliseconds would result in substantial and costlydrops in revenue.The trend of 10-ms computing is also witnessed by otherultra-low latency applications [3]; for example, high-frequencytrading and internet of thing applications. These applicationsare characterized by a request-response loop involving ma-chines in stead of humans, and operations involving multipleparallel requests/RPCs to thousands of servers [3]. Since aservice processing or a job completes when all of its requestsor tasks are satisﬁed, the worst-case latency of the individualrequests or tasks is required to be ultra-low to maintain serviceor job-level quality of service. Someone may argue that thoseapplications demand lower and lower latency. However, asthere are end-host stacks, NICs (network interface cards), andswitches on the path of an end-to-end application at which arequest or response currently experience delay [3], we believein next decade 10-ms is a reasonable latency performancegoal for most of end-to-end applications with ultra-low latencyrequirements.Previous work [36] also demonstrates that it is advantageousto break data-parallel jobs into tiny tasks each of whichcomplete in hundreds of milliseconds. Ousterhout et al. [36]demonstrate a 5.2x improvement in response times due to theuse of smaller tasks: Tiny tasks alleviate long wait times seenin todays clusters for interactive jobs—even large batch jobscan be split into small tasks that ﬁnish quickly.However, 10-ms computing raises many challenges to bothsoftware and hardware stack. In this paper, we quantitativelymeasure the challenges raised for conventional operatingsystems. memcached [1] is a popular in-memory key-valuestore intended for speeding up dynamic web applications bylleviating database loads. The average latency is about tensor hundreds µ s. A real-world memcached-based applicationusually need to invoke several get or put memcached opera-tions, in addition to many other procedures, to serve a singlerequest, so we choose it as a case study on 10-millisecondcomputing.Running memcached on a 40-core Linux server, we found,when the outlier threshold decreases, the outlier proportionof a single server will signiﬁcantly deteriorate. Meanwhile,the outlier proportion also deteriorates as the system corenumber increases. The outlier is further ampliﬁed by thesystem scale. For a 1K-scale system running Linux (version2.6.32) or LXC (version 0.7.5 ) or XEN (version 4.0.0)—a typical conﬁguration in internet services, we surprisinglyﬁnd that so as to reduce the service outlier proportion to 10%(The outlier threshold is 100 µ s), the outlier proportion of asingle server needs to be reduced by 871X, 2372X, 2372X,accordingly. We also conducted a list of experiments to revealthe current Linux systems still suffer from poor performanceoutlier. The operating systems we tested include Linux withdifferent kernels: 1) . . , an old kernel released ﬁve yearsago but still popularly used and in long-term maintenance.2) . . , a latest kernel released on November 21, 2014.3) . . M , a modiﬁed version of 2.6.35 integrated with sloppy counters proposed by Boyd-Wickizer et al. to solvescalability problem and mitigate kernel contentions [9] [2].4) representative real time schedulers , SCHED FIFO (First InFirst Out) and SCHED RR (Round Robin). This observationindicates that the new challenges are signiﬁcantly differentfrom traditional outlier and stagger issues widely investigatedin MapReduce and other environments [32], [23], [28], [29],[37]. Furthermore, we discuss the possible design spacesand challenges from perspectives of datacenter architectures,networking, OS and scheduling, and benchmarking.Section II formally states the problem. Section III quantita-tively measures the OS challenges in terms of reducing outlierproportion. Section IV discusses the possible design spaceof 10-ms computing systems from perspectives of datacenterarchitectures, networking, OS and Scheduling, and bench-marking. Section V summarizes the related work. Section VIdraws a conclusion.II. P ROBLEM S TATEMENT

For scale-out architecture, a feasible solution is to breakdata-parallel jobs into tiny tasks [36]. On the other hand, for alarge-scale online service, a request is often fanned out froma root server to a large number of leaf servers (handling sub-requests) and responses are merged via a request-distributiontree [15].We use a probability function

P r ( T ≤ t ) where T ≥ describes the distribution of service or job-level response time( T ). If SC ( SC ≥ leaf servers (or slave nodes) are usedto handle sub-requests or tasks sent from the root server (ormaster node), we use T i to denote the response time of a taskor sub-request on server i . Here, for clarity, we intentionally ignore the overhead of merging responses from different sub-requests. Meanwhile, for the case of breaking a large job intotiny tasks, we only consider the most simplest scenario—-one-round tasks are merged into results, excluding the iterativecomputation scenarios.The service or job-level outlier proportion is deﬁned asfollows: for N completed requests or jobs, if M jobs orrequests’ latencies exceed the outlier threshold t , e.g. 10milliseconds, the outlier proportion op sj ( t ) is MN .According to [15], the service or job-level outlier propor-tion will be extraordinarily magniﬁed by the system scale SC .The outlier proportion of a single server is represented by op ( t ) = P r ( T > t ) = 1 − P r ( T ≤ t ) .Assuming the servers are independent from each other, theservice or job-level outlier proportion, op sj ( t ) , is denoted byEquation 1 op sj ( t ) = P r ( T ≥ t or T ≥ t, ..., or T SC ≥ t ) (1) = 1 − P r ( T ≤ t ) P r ( T ≤ t ) ...P r ( T SC ≤ t ) (2) = 1 − P r ( T ≤ t ) SC = 1 − (1 − P r ( T > t )) SC (3) = 1 − (1 − op ( t )) SC (4)When we deploy the XEN or LXC/Docker solutions, theservice or job-level outlier proportion will be further ampliﬁedby the number K of guest OSes or containers deployed oneach server. op sj ( t ) = P r ( T ≥ torT ≥ t, ..., orT SC ∗ K ≥ t ) (5) = 1 − (1 − op ( t )) SC ∗ K (6)On the vice versa, to reduce an service or job-level outlierproportion to be op sj ( t ) , the outlier proportion of a singleserver must be low as shown in Equation 7. op ( t ) = 1 − SC p − op sj ( t ) (7)For example, a Bing search may access 10,000 indexservers [31]. If we need to reduce the service or job-leveloutlier proportion to be less than 10%, the outlier proportionof a single server must be low as − (0 . / ≈ . .Unfortunately, Section III shows it is an impossible task forthe conventional OS like Linux to provide such capability.From a cost-effective perspective, another important per-formance goal is the valid throughput—how many requestsor jobs are ﬁnished within the outlier threshold. In fact,according to the outlier proportion, it is very easy to derivethe valid throughput. According to the deﬁnition of the outlierproportion, N is the throughput number. The valid throughputis ( N − M ) .Another widely-used metric is n% tail latency . For the com-pleteness, we also include its deﬁnition. The n % tail latencyis the mean latency of all requests beyond a certain percentile n [25], e.g., the 99th percentile latency. Outlier proportionand tail latency are two related concepts, however, there areig. 1: The outlier proportion of memcached on each serverchanges with the outlier thresholds (from 100 to 1000 µ s) andcore numbers (from 1 to 40).subtle differences between two concepts. With respect to themetric n % tail latency, the outlier proportion is much easier tobe used to represent both the user requirement, e.g., ( N − MN ) requests or jobs satisfying within the outlier threshold and theservice provider’s concern, e.g., the valid throughput ( N − M ) measuring how many requests or jobs ﬁnished within theoutlier threshold. We also note that we cannot derive the validthroughput from the n % tail latency. As the outlier proportiondoes not rely upon the history latency data, while the taillatency needs to calculate the average latency of all requestsbeyond a certain percentile, so the former is easier to handlein the QoS guarantee. A. Related concepts

Soft real time systems are those in which timing require-ments are statistically deﬁned [7]. An example can be a videoconference system—it is desirable that frames are not skippedbut it is acceptable if a frame or two is occasionally missed.The goal of a soft real time system is not to reduce the serviceor job-level outlier proportion but to reduce the average latencywithin a speciﬁed threshold. If we use the tail latency todescribe, that is 0% tail latency must be less than the threshold.Instead, in a hard real time system, the deadlines must beguaranteed. That is to say, the service or job-level outlierproportion must be zero! For example if during a rocket enginetest this engine begins to overheat the shutdown proceduremust be completed in time [7].Different from hard real time systems, 10-ms computingsystems not only care about performance outlier—speciﬁcallyoutlier proportion, but also the resource utilization, while theformer often uses dedicated systems to achieve the zero outlierproportion goal without paying much attention to the resourceutilization.III. C

HALLENGES FROM AN OS PERSPECTIVE

We investigate the outlier problem from a perspectivesof the operating system using a latency-critical workload:memcached. memcached [1] is a popular in-memory key-valuestore intended for speeding up dynamic web applications byalleviating database loads. The average latency is about tens or (cid:19)(cid:8)(cid:21)(cid:8) (cid:23)(cid:8)(cid:25)(cid:8)(cid:27)(cid:8) (cid:20)(cid:19)(cid:8)(cid:20)(cid:21)(cid:8) (cid:20)(cid:23)(cid:8) (cid:20) (cid:23) (cid:27) (cid:20)(cid:21) (cid:20)(cid:25) (cid:21)(cid:19) (cid:21)(cid:23) (cid:21)(cid:27) (cid:22)(cid:21) (cid:22)(cid:25) (cid:23)(cid:19)(cid:47)(cid:76)(cid:81)(cid:88)(cid:91)(cid:16)(cid:21)(cid:17)(cid:25)(cid:17)(cid:22)(cid:21)(cid:16)(cid:38)(cid:41)(cid:54) (cid:47)(cid:76)(cid:81)(cid:88)(cid:91)(cid:16)(cid:21)(cid:17)(cid:25)(cid:17)(cid:22)(cid:21)(cid:16)(cid:41)(cid:44)(cid:41)(cid:50) (cid:47)(cid:76)(cid:81)(cid:88)(cid:91)(cid:16)(cid:21)(cid:17)(cid:25)(cid:17)(cid:22)(cid:21)(cid:16)(cid:53)(cid:53)(cid:47)(cid:76)(cid:81)(cid:88)(cid:91)(cid:16)(cid:21)(cid:17)(cid:25)(cid:17)(cid:22)(cid:24)(cid:48)(cid:16)(cid:38)(cid:41)(cid:54)(cid:47)(cid:76)(cid:81)(cid:88)(cid:91)(cid:16)(cid:22)(cid:17)(cid:20)(cid:26)(cid:17)(cid:23)(cid:16)(cid:38)(cid:41)(cid:54) (cid:49)(cid:88)(cid:80)(cid:69)(cid:72)(cid:85)(cid:3)(cid:82)(cid:73)(cid:3)(cid:70)(cid:82)(cid:85)(cid:72)(cid:86)(cid:3) (cid:50) (cid:88) (cid:87) (cid:79)(cid:76) (cid:72) (cid:85) (cid:3) (cid:83) (cid:85) (cid:82)(cid:83)(cid:82) (cid:85) (cid:87) (cid:76) (cid:82)(cid:81) (cid:3) (a) (cid:19) (cid:19)(cid:17)(cid:24) (cid:20) (cid:20)(cid:17)(cid:24) (cid:21)(cid:21)(cid:17)(cid:24)(cid:22)(cid:22)(cid:17)(cid:24) (cid:23) (cid:20) (cid:23) (cid:27) (cid:20)(cid:21) (cid:20)(cid:25) (cid:21)(cid:19) (cid:21)(cid:23) (cid:21)(cid:27) (cid:22)(cid:21) (cid:22)(cid:25) (cid:23)(cid:19) (cid:47)(cid:76)(cid:81)(cid:88)(cid:91)(cid:16)(cid:21)(cid:17)(cid:25)(cid:17)(cid:22)(cid:21)(cid:16)(cid:38)(cid:41)(cid:54) (cid:47)(cid:76)(cid:81)(cid:88)(cid:91)(cid:16)(cid:21)(cid:17)(cid:25)(cid:17)(cid:22)(cid:21)(cid:16)(cid:41)(cid:44)(cid:41)(cid:50)(cid:47)(cid:76)(cid:81)(cid:88)(cid:91)(cid:16)(cid:21)(cid:17)(cid:25)(cid:17)(cid:22)(cid:21)(cid:16)(cid:53)(cid:53) (cid:47)(cid:76)(cid:81)(cid:88)(cid:91)(cid:16)(cid:21)(cid:17)(cid:25)(cid:17)(cid:22)(cid:24)(cid:48)(cid:16)(cid:38)(cid:41)(cid:54)(cid:47)(cid:76)(cid:81)(cid:88)(cid:91)(cid:16)(cid:22)(cid:17)(cid:20)(cid:26)(cid:17)(cid:23)(cid:16)(cid:38)(cid:41)(cid:54) (cid:49)(cid:88)(cid:80)(cid:69)(cid:72)(cid:85)(cid:3)(cid:82)(cid:73)(cid:3)(cid:70)(cid:82)(cid:85)(cid:72)(cid:86)(cid:3) (cid:55) (cid:75) (cid:85) (cid:82)(cid:88)(cid:74)(cid:75)(cid:83)(cid:88) (cid:87)(cid:3) (cid:11) (cid:20)(cid:19)(cid:19)(cid:19)(cid:19)(cid:19)(cid:19) (cid:3) (cid:85) (cid:72)(cid:84)(cid:88)(cid:72) (cid:86) (cid:87) (cid:86) (cid:18) (cid:86) (cid:12) (cid:3) (b)

Fig. 2: (a) shows the outlier proportions of a single serverusing different Linux kernels and schedulers. (b) shows thevalid throughput. The core number of a server is varied (seeX axis). The outlier threshold is 100 µ s.hundreds µ s. A real-world memcached-based application usu-ally need to invoke several get or put memcached operations,in addition to many other procedures, to serve a single request,so we choose it as a case study on 10-millisecond computing.We increase the running cores and bind a memcached instanceon each active core via numactl . A 40-core ccNUMA (cachecoherent Non Uniform Memory Access) machine with fourNUMA domains (Each with a 10-core E7-8870 processor.) isused for running memcached instances. Four 12-core serversrun client emulators, which are obtained from MOSBench [9].The host OS is SUSE11SP1 with the kernel version 2.6.32 anda default scheduler CFS (Completely Fair Scheduler).Figure 1 shows the outlier proportions vs. different outlierthresholds and increasing cores. We can observe that: a) tighteroutlier thresholds lead to higher outlier proportions. The outlierproportion is low as of 0.60% on a common 12-core nodewith the outlier threshold of 1 second. By contrast, it will behigh as 4.50% if we reduce the outlier threshold to 100- µ s. b)The outlier proportion increases gradually with the number ofcores. In the worst cases, it degrades to to 9.09%. According toEquation 1, using 1K 40-core servers, the service or job-leveloutlier proportion will be up to nearly 100%.Following the above observations, we quantitatively mea-sure the challenges in terms of reducing outlier proportionusing a monolithic kernel (Linux) and virtualization technolo-gies including Xen and Linux Containers. A. A monolithic kernel: Linux

We conducted a list of experiments to study whether currentLinux systems still suffer from poor outlier performance. The operating systems we tested are Linux with different kernels:1) . . , an old kernel released ﬁve years ago but stillpopularly used and in long-term maintenance. 2) . . , alatest kernel released on November 21, 2014. 3) . . M ,a modiﬁed version of 2.6.35 integrated with sloppy counters proposed by Boyd-Wickizer et al. to solve scalability problemand mitigate kernel contentions [9] [2]. sloppy counters adoptslocal counters on each core and an central counter to avoid (cid:8) (cid:20)(cid:19)(cid:8)(cid:21)(cid:19)(cid:8) (cid:22)(cid:19)(cid:8)(cid:23)(cid:19)(cid:8) (cid:24)(cid:19)(cid:8)(cid:25)(cid:19)(cid:8) (cid:26)(cid:19)(cid:8)(cid:27)(cid:19)(cid:8) (cid:28)(cid:19)(cid:8)(cid:20)(cid:19)(cid:19)(cid:8) (cid:20) (cid:23) (cid:27) (cid:20)(cid:21) (cid:20)(cid:25) (cid:21)(cid:19) (cid:21)(cid:23) (cid:21)(cid:27) (cid:22)(cid:21) (cid:22)(cid:25) (cid:23)(cid:19)(cid:47)(cid:76)(cid:81)(cid:88)(cid:91)(cid:16)(cid:21)(cid:17)(cid:25)(cid:17)(cid:22)(cid:21)(cid:16)(cid:38)(cid:41)(cid:54)(cid:47)(cid:59)(cid:38)(cid:16)(cid:23)(cid:3)(cid:70)(cid:82)(cid:81)(cid:87)(cid:68)(cid:76)(cid:81)(cid:72)(cid:85)(cid:86)(cid:59)(cid:72)(cid:81)(cid:16)(cid:23)(cid:3)(cid:57)(cid:48)(cid:86) (cid:49)(cid:88)(cid:80)(cid:69)(cid:72)(cid:85)(cid:3)(cid:82)(cid:73)(cid:3)(cid:70)(cid:82)(cid:85)(cid:72)(cid:86)(cid:3) (cid:50) (cid:88) (cid:87) (cid:79)(cid:76) (cid:72) (cid:85) (cid:3) (cid:83) (cid:85) (cid:82)(cid:83)(cid:82) (cid:85) (cid:87) (cid:76) (cid:82)(cid:81) (cid:3) (a) (cid:19) (cid:20)(cid:19)(cid:19)(cid:19) (cid:21)(cid:19)(cid:19)(cid:19)(cid:22)(cid:19)(cid:19)(cid:19) (cid:23)(cid:19)(cid:19)(cid:19) (cid:24)(cid:19)(cid:19)(cid:19) (cid:25)(cid:19)(cid:19)(cid:19) (cid:20) (cid:23) (cid:27) (cid:20)(cid:21) (cid:20)(cid:25) (cid:21)(cid:19) (cid:21)(cid:23) (cid:21)(cid:27) (cid:22)(cid:21) (cid:22)(cid:25) (cid:23)(cid:19) (cid:47)(cid:76)(cid:81)(cid:88)(cid:91)(cid:29)(cid:3)(cid:19)(cid:17)(cid:20) (cid:47)(cid:76)(cid:81)(cid:88)(cid:91)(cid:29)(cid:3)(cid:19)(cid:17)(cid:19)(cid:24) (cid:47)(cid:59)(cid:38)(cid:29)(cid:3)(cid:19)(cid:17)(cid:20) (cid:47)(cid:59)(cid:38)(cid:29)(cid:3)(cid:19)(cid:17)(cid:19)(cid:24) (cid:59)(cid:72)(cid:81)(cid:29)(cid:3)(cid:19)(cid:17)(cid:20) (cid:59)(cid:72)(cid:81)(cid:29)(cid:3)(cid:19)(cid:17)(cid:19)(cid:24) (cid:49)(cid:88)(cid:80)(cid:69)(cid:72)(cid:85)(cid:3)(cid:82)(cid:73)(cid:3)(cid:70)(cid:82)(cid:85)(cid:72)(cid:86)(cid:3) (cid:38) (cid:82) (cid:80) (cid:83)(cid:68) (cid:85) (cid:76) (cid:86) (cid:82)(cid:81) (cid:3) (cid:85) (cid:68) (cid:87) (cid:76) (cid:82) (cid:3) (b) Fig. 3: (a) shows the outlier proportions of a single server whenrunning one Linux instance, 4 containers (LXC) and 4 VMs(XEN) on the varied core number (see X axis). The outlierthreshold is 100 µ s. (b) shows for 1K-server conﬁguration,how many times of the outlier proportion (see Y axis) of eachserver need to be reduced to reduce the service or job-leveloutlier proportion to 10% or 5%. The core number of a serveris varied (see X axis). The outlier threshold is 100 µ s.unnecessary touching of the global reference count. Theirevaluations show its good effects on mitigating unnecessarycontentions on kernel objects. Beside these systems with dif-ferent kernels, we also evaluated the impact of representative real time schedulers , SCHED FIFO (First In First Out) andSCHED RR (Round Robin). A SCHED FIFO process runsuntil either it is blocked by an I/O request (if a higher priorityprocess is ready) or itself invokes sched yield . SCHED RR isa simple implementation of SCHED FIFO, except that eachis allowed to run for a maximum time slice. The time slice inSCHED RR is set to 100 µ s. We use a dash with CFS, FIFO,or RR following the kernel version to denote the scheduler thekernel uses in Figures. Results and discussions . Figure 2a and Figure 2b show thatthese systems are not competent for low outlier proportionunder the outlier threshold of 100 µ s. First, all systems has ascalability problem in terms of the valid throughput. Althoughthe modiﬁed kernel Linux-2.6.35M behaves better than Linux-2.6.32, the throughput stops increasing after 8 cores. Second,the outlier proportions climb up to 10% with 40 cores.Besides, the latest kernel 3.17.4 still cannot surpass the olderkernel 2.6.35M. Such a ad-hoc method of mitigating potentialresource contentions seems to be endless and contribute tolimited improvements. Third, the real time schedulers neitherreduce the outlier proportion nor improve the valid throughput.The real time schedulers do not show positive effects on theperformance. B. Virtualization technologies

Virtualization offers multiple execution environments ona single server. Comparing with the monolithic kernel OS,according to Equation 5, the number of guest OS or containerswill amplify the outlier, and hence make the outlier propor-tion deteriorates. We use LXC [45] and Xen[12] to evaluatethe outlier proportions. The versions of LXC and Xen are 0.7.5 and 4.0.0, respectively. The VMs on Xen are all para-virtualized. For both LXC and Xen, the node resources aredivided equally for the 4 containers and 4 VMs.LXC is based on a monolithic kernel which binds mul-tiple processes (process group) together to run as a full-functioned OS. A container-based system can spawn severalshared, virtualized OS images, each of which is known asa container. It consists of multiple processes bound together(process group), appearing as a full-functioned OS with anexclusive root ﬁle system and a safely shared set of systemlibraries. For Xen-like hardware level virtualization, there isa microkernel-like hypervisor to manage all virtual machines(VMs). Each VM is composed by virtual devices generatedby device emulators and runs on a less privileged level. Theexecution of privileged instructions on a VM must be deliveredto hypervisor. Communications are based on the event channelmechanism.We run memcached instances on four containers and fourvirtual machines (VMs) hosted on a single node, respectively.A request is fanned out to the four containers or VMS andfour sub-query responses are merged in the client emulator tocalculate the performance. From Figure 3a, we observe thatthe performance is far from the expectation. Note that whendeploying on less than 24 cores, Xen has better outlier propor-tions. Comparing to LXC, Xen has much better performanceisolation. Xen’s and LXC ’s outlier proportion signiﬁcantlydeteriorate, respectively when each VM and container runs on10 cores. We also note that the valid throughput of Xen ismuch lower than LXC and Linux.

C. Discussion

From Figure 3b, we surprisingly ﬁnd that to reduce theservice or job-level outlier proportion to 10%, running Linuxor LXC or XEN, respectively, the outlier proportion of asingle server needs to be reduced by 871X, 2372X, 2372Xaccordingly. And if reducing the outlier proportion to 5%, theperformance gap is 1790X, 4721X, 4721X, respectively.The operating system can be abstracted as a multi-threadscheduling system with internal and external interrupts. Theoutliers are mainly caused by the tasks starved for certainresources because of preemption or interferences from paralleltasks or kernel routines. Waiting and serialization can beaggravated by the increasing parallel tasks and cores. Here area few occasions that the outlier proportion may be aggravatedwith increasing cores. • Synchronization . Synchronization becomes more fre-quent and time-consuming, such as the lock synchro-nization of resource counters, cache coherence and TLBshootdown among cores. For example, for a multipro-cessor OS, TLB consistency is maintained through byinvalidating entries when pages are unmapped. Althoughthe TLB invalidation itself is fast, the process of contextswitches, transferring IPIs (Inter-Processor Interrupts)across all possible cores, and waiting for all acknowl-edgements may be time-consuming [48]. On one hand,rocessing shootdown IPIs need to interrupt current run-ning tasks. On the other hand, the time consumed duringtransferring and waiting may easily climb rapidly withthe increasing cores. • Shared status . Shared status also becomes more com-mon, such as shared queues and lists or shared buffers.For example, information of software resources or kernelstates are commonly stored in queues, lists, or trees,such as the per-block list that tracks open ﬁles. Withincreasing tasks on more cores, these structures may bemore ﬁlled. Thus, searching and traversing them becomesmore expensive. Besides, there are many limits set by thekernel. In 10-ms computing, the number of tiny tasks maybe of large quantity whose accesses of system resourcesmay be easily excessive. • Scheduling based on limited hardware . Hardware re-sources such as last level cache (LLC), inter-core inter-connect, DRAM, I/O hub are physically shared by allprocessors. Too many operations on the subsystems mayreach beyond the capacity, so it is difﬁcult for a schedulerto schedule tasks with a optimum solution.IV. P

OSSIBLE D ESIGN S PACE AND C HALLENGES

This section discusses possible design space and challengefrom perspectives of datacenter architecture, networking, OSand Scheduling, and benchmarking.

A. Data center architecture

Existing datacenters are built using commodity servers.A common 1000-node cluster often come with a signiﬁcantperformance penalty because networks at datacenter scaleshave historically been oversubscribed—communication wasfast within a rack, slow off-rack, and worse still for nodeswhose nearest ancestor was the root [35]. The datacenterbandwidth shortage forced software developers to think interms of rack locality— moving computation to data ratherthan vice versa [35]. The same holds true for the storagesystems. In this server-centric setting, resource shortage willresult in excessive competition and signiﬁcantly deteriorateperformance outliers for 10-ms computing.Recent efforts suggest a forthcoming paradigm shift towardsa disaggregated datacenter (DDC), where each resource typeis built as a standalone resource pool and a network fabricinterconnects these resource pools [19].On one hand, this paradigm shift is driven largely byhardware architects. CPU, memory and storage technologiesexhibit signiﬁcantly different trends in terms of cost, perfor-mance and power scaling, and hence it is increasingly hard toadopt evolving resource technologies within a server-centricarchitecture [19]. By decoupling CPU, memory and storageresources, DDC makes it easier for each resource technologyto evolve independently and reduces the time-to-adoption byavoiding the burdensome process of redoing integration andmotherboard design [19]. On the other hand, as the resource is much uniformlyaccessed with respect to the traditional server-centric architec-ture, DDC helps alleviate performance outliers. In addition,it also enables ﬁne-grained and efﬁcient individual resourceprovisioning and scheduling across jobs [20].Unfortunately, resource disaggregation will further chal-lenge networking, since disaggregating CPU from memoryand disk requires that the inter-resource communication thatused to be within the scope of a server must now traversethe network fabric. Thus, to support good application-levelperformance it becomes critical that the network fabric providemuch lower latency communication [20].

B. Networking

In an end-to-end application within a datacentre, there areend-host stacks, NICs (network interface cards), and switchesat which packets experience delay [3]. Kernel bypass and zerocopy techniques [14] [43] signiﬁcantly reduce the latency atthe end-host and in the NICs.Rumble et al [39] argue that it should be possible to achieveend-to-end remote procedure call (RPC) latencies of 5-10 µ sin large datacenters using commodity hardware and softwarewithin a few years. However, achieving that goal requirescreating a new software architecture for networking with adifferent division of responsibility between operating system,hardware, and application. Over the longer-term, they alsothink 1 µ s datacenter round-trips can be achieved with moreradical hardware changes, i.e., moving the NIC onto the CPUchip.Trivedi et al [47] also found that the current Spark dataprocessing frameworks can not leverage network speeds higherthan 10 Gbps because the high amount of CPU work doneper byte moved eclipses any potential gains coming from thenetwork. This observation indicates 10-ms computing needmore than only ultra-low latency networking, and we needinvestigate a number of ways to balance CPU and networktime. C. OS and scheduling

The widely used Linux, Linux container (LXC) [6], orXen [8], adopts a monolithic kernel or a virtual machinemonitor (VMM) that shares numerous data structures pro-tected by locks (share ﬁrst), and then a process or virtualmachine (VM) is proposed to guarantee performance isolation(then isolate). These OS architectures have their inherentstructure obstacle in reducing outlier proportion, becauseglobally-shared and contended data structures and maintainingglobal coordination deteriorate performance interferences andoutliers, especially for 10-ms computing. Our evaluation inSection III shows running with three Linux kernels—2.6.32,2.6.35, 3.17.4, respectively, on a 40-core server, memcachedexhibits deteriorated outlier proportion as the core numberincreases, while its average latency does not increase much.The shift toward 10-ms computing calls for new OS ar-chitectures for reducing the outlier proportion. Our previouswork [33] presents an ”isolate ﬁrst, then share” (in short,FTS) OS model that is promising in guaranteeing worst-case performance. We decompose the OS into the supervisorand several subOSes running in parallel. A subOS isolatesresources ﬁrst through directly managing physical resourceswithout intervention from the supervisor, and the supervisorenables resource sharing through creating, destroying, resizinga subOS on-the-ﬂy; SubOSes and the supervisor have conﬁnedstate sharing, but fast inter-subOS communication mechanismsbased on shared memory and IPIs are provided on demand. Onseveral Intel Xeon platforms, we applied the IFTS OS modelto build the ﬁrst working prototype—RainForest. We portedLinux 2.6.32 as a subOS and performed comprehensive eval-uation of RainForest against three Linux kernels: 1) 2.6.32; 2)3.17.4; 3) 2.6.35M; LXC (version 0.7.5 ); Xen (version 4.0.0).Experimental results show: 1)With respect to Linux, LXC, andXen, RainForest improves the throughput of the search serviceby 25.0%, 42.8%, 42.8% under the 99th percentile latency of200 ms. The CPU utilization rate is improved by 16.6% to25.1% accordingly. Our previous work shows achieving betterisolation among OS instances will provide better worst-caseperformance and is promising for 10-ms computing workloads.In addition to OS support, supporting 10-ms computingrequires a low-latency, high throughput task scheduler. AsOusterhout et al. demonstrate in [36], handling 100ms tasks ina cluster with 160.000 cores (e.g., 10,000 16-core machines),requires a scheduler that can, on average, make 1.6 millionscheduling decisions per second. Todays centralized sched-ulers have well-known scalability limits [40] that hinder theirability to support millisecond-scale tasks in a large cluster.Instead, handling large clusters and very short tasks willrequire a decentralized scheduler design like Sparrow [37]. Inaddition to providing high throughput scheduling decisions,a framework for 10-ms computing must also reduce theoverhead for launching individual tasks [36].

D. Benchmarking

To perform research on 10-ms computing, the ﬁrst challengeis to set up a benchmark suite. Unfortunately, it will be anon-trivial issue as 10-ms computing may depend on thenew scheduling, executing and programming models. So thechallenge lies in how to set up a benchmark suite on the exist-ing commodity software system as we have not implementedthe new scheduling, executing and programming models for10-ms computing. Previous work like BigDataBench [50] orTailBench [26] still has serious drawback. For example,TailBench [26] only provides several simple workloads thathave latencies varying from microseconds to seconds.V. R

ELATED W ORK

The outlier problem has been studied in many areas suchas parallel iterative convergent algorithms where all executingthreads must be synchronized [13]. Within the context ofscale-out architecture, we now discuss related work on outliersources and mitigation.

A. Sources of Outliers and Tail LatencyHadoop MapReduce Outliers . The problem of Hadoopoutliers is ﬁrst proposed in [16] and it is further studied inheterogeneous environments [52]. In Hadoop, the task outliersare typically incurred by task skews, including load imbalanceamong servers, uneven data distribution, and unexpected longprocessing time [32], [23], [28], [29], [37]

Tail latency in interactive services . In today’s WCSs, in-teractive services and batch jobs are typically co-located onthe same machine to increase machine utilizations. In suchshared environments, resource contention and performanceinterference is a major source of service time variability[44], [18]. This variability is further signiﬁcantly ampliﬁedwhen considering requests’ queueing delay, thus incurring highrequest tail latency.

B. Application-level techniques to mitigate outliers

Task/sub-request redundancy is a commonly applied tech-nique to mitigate outliers and tail latency. The key idea ofsuch technique is to execute each individual task/sub-requestin multiple replicas so as to reduce its latency by usingthe quickest replica. In [4], [49], this technique has beenapplied to mitigate outlier tasks in small Hadoop jobs whosenumber of tasks is smaller than ten. In [46], it is appliedto reduce low tail latency only when the system at idlestate. In contrast, task/sub-request reissue is a conservativeredundancy technique [5], [24], [37]. This technique ﬁrst sendsa task/sub-request replica to one approximate machine. Thereplica is judged as the outlier if it is not completed aftera short delay (e.g. the 99th percentile latency for this classof tasks/sub-requests), and then a secondary replica is sent toanother machine. Both techniques work well when the serviceis under light load. However, when load becomes heavier,the unnecessarily execution of the same tasks/sub-requestsadversely increases the outlier proportion [41].

Partial execution is another widely used technique to miti-gate outliers by sacriﬁcing result quality (e.g. prediction accu-racy in classiﬁcation or recommendation services). Followingthe anytime framework initially proposed in AI [53], thistechnique has been applied in Bing search engine [24], [21]to return an intermediate and partial search result wheneverthe allocated processing time expires. Similar approaches havebeen proposed to sample a subset of input data to produceapproximate results for MapReduce jobs under both time andresource constraints [30], [38], [42], [51]. Moreover, best-effort scheduling algorithms have been developed to form acompliment to the partial execution technique, which allo-cate available processing times among tasks/sub-requests tomaximize their result quality [22], [21]. However, when loadbecome heavier, such technique incurs considerable loss inresult quality to meet response target and this is sometimesunacceptable for users.VI. C

ONCLUSION n this paper we argue computers or smart devices shouldand will consistently provide information and knowledge tohuman being in the order of a few tens milliseconds despitecomputation becomes much complex on data with an unprece-dented scale. We coin a new term 10-millisecond computingto call attention to this class of workloads.We speciﬁcally investigate 10-ms computing’s challengesraised for conventional operating systems. For a 1K-scalesystem—a typical internet service conﬁguration—runningLinux (version 2.6.32) or LXC (version 0.7.5 ) or XEN(version 4.0.0), respectively, we surprisingly ﬁnd that to reducethe service-level outlier proportion of a typical workload—memcached to 10%, the outlier proportion of a single serverneeds to be reduced by 871X, 2372X, 2372X accordingly.We also conducted a list of experiments to reveal the state-of-the-art and state-of-the-practice Linux systems still sufferfrom poor performance outlier, including Linux kernel ver-sions . . , . . M , a modiﬁed version of 2.6.35 that isintegrated with sloppy counters , and two representative realtime schedulers. This observation indicates the new chal-lenges are signiﬁcantly different from traditional outlier andstagger issues widely investigated in MapReduce and otherenvironments. Also, we discuss the possible design spaces andchallenges for 10-ms computing systems from perspectives ofdatacenter architecture, networking, OS and scheduling, andbenchmarking. R EFERENCES[1] Memcached. Accessed Dec 2014. http://memcached.org/.[2] Mosbench: a set of application benchmarks to measure os scalability.Modiﬁed kernels are included. http://pdos.csail.mit.edu/mosbench/.[3] Mohammad Alizadeh, Abdul Kabbani, Tom Edsall, Balaji Prabhakar,Amin Vahdat, and Masato Yasuda. Less is more: trading a littlebandwidth for ultra-low latency in the data center. In

Proceedingsof the 9th USENIX conference on Networked Systems Design andImplementation , pages 19–19. USENIX Association, 2012.[4] Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, and Ion Stoica.Effective straggler mitigation: Attack of the clones. In

NSDI , volume 13,pages 185–198, 2013.[5] Ganesh Ananthanarayanan, Srikanth Kandula, Albert G Greenberg, IonStoica, Yi Lu, Bikas Saha, and Edward Harris. Reining in the outliers inmap-reduce clusters using mantri. In

OSDI , volume 10, page 24, 2010.[6] Gaurav Banga, Peter Druschel, and Jeffrey C Mogul. Resource con-tainers: A new facility for resource management in server systems. In

OSDI , volume 99, pages 45–58, 1999.[7] Michael Barabanov.

A linux-based real-time operating system . PhDthesis, New Mexico Institute of Mining and Technology, 1997.[8] Paul Barham, Boris Dragovic, Keir Fraser, Steven Hand, Tim Harris,Alex Ho, Rolf Neugebauer, Ian Pratt, and Andrew Warﬁeld. Xen andthe art of virtualization.

ACM SIGOPS Operating Systems Review ,37(5):164–177, 2003.[9] Silas Boyd-Wickizer, Austin T. Clements, Yandong Mao, AlekseyPesterev, M. Frans Kaashoek, Robert Morris, and Nickolai Zeldovich.An analysis of linux scalability to many cores. In

Proceedings of the 9thUSENIX Conference on Operating Systems Design and Implementation ,OSDI’10, pages 1–8, Berkeley, CA, USA, 2010. USENIX Association.[10] Stuart K Card, Allen Newell, and Thomas P Moran. The psychology ofhuman-computer interaction. 1983.[11] Stuart K Card, George G Robertson, and Jock D Mackinlay. Theinformation visualizer, an information workspace. In

Proceedings ofthe SIGCHI conference on human factors in computing systems , pages181–186. ACM, 1991.[12] David Chisnall.

The deﬁnitive guide to the xen hypervisor . PearsonEducation, 2008. [13] James Cipar, Qirong Ho, Jin Kyu Kim, Seunghak Lee, Gregory RGanger, Garth Gibson, Kimberly Keeton, and Eric Xing. Solving thestraggler problem with bounded staleness.

HotOS. USENIX Association ,2013.[14] David Cohen, Thomas Talpey, Arkady Kanevsky, Uri Cummings,Michael Krause, Renato Recio, Diego Crupnicoff, Lloyd Dickman,and Paul Grun. Remote direct memory access over the convergedenhanced ethernet fabric: Evaluating the options. In

High PerformanceInterconnects, 2009. HOTI 2009. 17th IEEE Symposium on , pages 123–130. IEEE, 2009.[15] Jeffrey Dean and Luiz Andr´e Barroso. The tail at scale.

Communicationsof the ACM , 56(2):74–80, 2013.[16] Jeffrey Dean and Sanjay Ghemawat. Mapreduce: simpliﬁed dataprocessing on large clusters.

Communications of the ACM , 51(1):107–113, 2008.[17] Alan Dix. Natural time. In

Position Paper for CHI 96 Basic ResearchSymposium (April 13-14, 1996, Vancouver, BC) , 1996.[18] Mario Flajslik and Mendel Rosenblum. Network interface design for lowlatency request-response protocols. In

Proceedings of the 2013 USENIXconference on Annual Technical Conference , pages 333–346. USENIXAssociation, 2013.[19] Peter X Gao, Akshay Narayan, Sagar Karandikar, Joao Carreira, SangjinHan, Rachit Agarwal, Sylvia Ratnasamy, and Scott Shenker. Networkrequirements for resource disaggregation. In , pages249–264. USENIX Association, 2016.[20] Sangjin Han, Norbert Egi, Aurojit Panda, Sylvia Ratnasamy, GuangyuShi, and Scott Shenker. Network support for resource disaggregationin next-generation datacenters. In

Proceedings of the Twelfth ACMWorkshop on Hot Topics in Networks , page 10. ACM, 2013.[21] Yuxiong He, Sameh Elnikety, James Larus, and Chenyu Yan. Zeta:scheduling interactive services with partial execution. In

Proceedings ofthe Third ACM Symposium on Cloud Computing , page 12. ACM, 2012.[22] Yuxiong He, Sameh Elnikety, and Hongyang Sun. Tians scheduling:Using partial processing in best-effort applications. In

DistributedComputing Systems (ICDCS), 2011 31st International Conference on ,pages 434–445. IEEE, 2011.[23] Shadi Ibrahim, Hai Jin, Lu Lu, Song Wu, Bingsheng He, and Li Qi.Leen: Locality/fairness-aware key partitioning for mapreduce in thecloud. In

Cloud Computing Technology and Science (CloudCom), 2010IEEE Second International Conference on , pages 17–24. IEEE, 2010.[24] Virajith Jalaparti, Peter Bodik, Srikanth Kandula, Ishai Menache,Mikhail Rybalkin, and Chenyu Yan. Speeding up distributed request-response workﬂows. In

Proceedings of the ACM SIGCOMM 2013conference on SIGCOMM , pages 219–230. ACM, 2013.[25] Harshad Kasture and Daniel Sanchez. Ubik: efﬁcient cache sharingwith strict qos for latency-critical workloads. In

Proceedings of the19th international conference on Architectural support for programminglanguages and operating systems , pages 729–742. ACM, 2014.[26] Harshad Kasture and Daniel Sanchez. Tailbench: a benchmark suite andevaluation methodology for latency-critical applications. In

WorkloadCharacterization (IISWC), 2016 IEEE International Symposium on ,pages 1–10. IEEE, 2016.[27] Ron Kohavi, Randal M. Henne, and Dan Sommerﬁeld. Practical guideto controlled experiments on the web: Listen to your customers notto the hippo. In

Proceedings of the 13th ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining , KDD ’07, pages959–967, New York, NY, USA, 2007. ACM.[28] YongChul Kwon, Magdalena Balazinska, Bill Howe, and Jerome Rolia.Skew-resistant parallel processing of feature-extracting scientiﬁc user-deﬁned functions. In

Proceedings of the 1st ACM symposium on Cloudcomputing , pages 75–86. ACM, 2010.[29] YongChul Kwon, Magdalena Balazinska, Bill Howe, and Jerome Rolia.Skewtune: mitigating skew in mapreduce applications. In

Proceedingsof the 2012 ACM SIGMOD International Conference on Managementof Data , pages 25–36. ACM, 2012.[30] Nikolay Laptev, Kai Zeng, and Carlo Zaniolo. Early accurate resultsfor advanced analytics on mapreduce.

Proceedings of the VLDBEndowment , 5(10):1028–1039, 2012.[31] Jialin Li, Naveen Kr Sharma, Dan RK Ports, and Steven D Gribble. Talesof the tail: Hardware, os, and application-level sources of tail latency.[32] Jimmy Lin et al. The curse of zipf and limits to parallelization: A lookat the stragglers problem in mapreduce. In , volume 1, 2009.33] Gang Lu, Jianfeng Zhan, Xinlong Lin, Chongkang Tan, and Lei Wang.”isolate ﬁrst, then share”: a new os architecture for the worst-caseperformance. arXiv preprint arXiv:1604.01378 , 2016.[34] Marissa Mayer. Rough notes from marissa mayer keynote. keynote atGoogle IO, 2008.[35] Edmund B Nightingale, Jeremy Elson, Jinliang Fan, Owen S Hofmann,Jon Howell, and Yutaka Suzue. Flat datacenter storage. In

OSDI , pages1–15, 2012.[36] Kay Ousterhout, Aurojit Panda, Joshua Rosen, Shivaram Venkataraman,Reynold Xin, Sylvia Ratnasamy, Scott Shenker, and Ion Stoica. Thecase for tiny tasks in compute clusters. In

Proc. HotOS , 2013.[37] Kay Ousterhout, Patrick Wendell, Matei Zaharia, and Ion Stoica. Spar-row: distributed, low latency scheduling. In

Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles , pages 69–84.ACM, 2013.[38] Niketan Pansare, Vinayak R Borkar, Chris Jermaine, and Tyson Condie.Online aggregation for large mapreduce jobs.

Proc. VLDB Endow ,4(11):1135–1145, 2011.[39] Stephen M Rumble, Diego Ongaro, Ryan Stutsman, Mendel Rosenblum,and John K Ousterhout. It’s time for low latency. In

HotOS , volume 13,pages 11–11, 2011.[40] Malte Schwarzkopf, Andy Konwinski, Michael Abd-El-Malek, and JohnWilkes. Omega: ﬂexible, scalable schedulers for large compute clusters.In

Proceedings of the 8th ACM European Conference on ComputerSystems , pages 351–364. ACM, 2013.[41] Nihar B Shah, Kangwook Lee, and Kannan Ramchandran. When doredundant requests reduce latency? Technical report, Department ofElectrical Engineering and Computer Sciences, University of Californiaat Berkeley, 2013.[42] Yingjie Shi, Xiaofeng Meng, Fusheng Wang, and Yantao Gan. You canstop early with cola: Online processing of aggregate queries in the cloud.In

Proceedings of the 21st ACM international conference on Informationand knowledge management , pages 1223–1232. ACM, 2012.[43] Piyush Shivam, Pete Wyckoff, and Dhabaleswar Panda. Emp: zero-copyos-bypass nic-driven gigabit ethernet message passing. In

Supercomput-ing, ACM/IEEE 2001 Conference , pages 49–49. IEEE, 2001.[44] David Shue, Michael J Freedman, and Anees Shaikh. Performanceisolation and fairness for multi-tenant cloud storage. In

OSDI , pages349–362, 2012.[45] Stephen Soltesz, Herbert P¨otzl, Marc E Fiuczynski, Andy Bavier, andLarry Peterson. Container-based operating system virtualization: ascalable, high-performance alternative to hypervisors. In

ACM SIGOPSOperating Systems Review , volume 41, pages 275–287. ACM, 2007.[46] Christopher Stewart, Aniket Chakrabarti, and Rean Grifﬁth. Zoolander:Efﬁciently meeting very strict, low-latency slos. In

ICAC , pages 265–277, 2013.[47] Animesh Trivedi, Patrick Stuedi, Jonas Pfefferle, Radu Stoica, BernardMetzler, Ioannis Koltsidas, and Nikolas Ioannou. On the [ir] relevanceof network performance for data processing.

Network , 40:60, 2016.[48] C. Villavieja, V. Karakostas, L. Vilanova, Y. Etsion, A. Ramirez,A. Mendelson, N. Navarro, A. Cristal, and O.S. Unsal. Didi: Mitigatingthe performance impact of tlb shootdowns using a shared tlb directory.In

Parallel Architectures and Compilation Techniques (PACT), 2011International Conference on , pages 340–349, Oct 2011.[49] Ashish Vulimiri, Oliver Michel, P Godfrey, and Scott Shenker. More isless: reducing latency via redundancy. In

Proceedings of the 11th ACMWorkshop on Hot Topics in Networks , pages 13–18. ACM, 2012.[50] Lei Wang, Jianfeng Zhan, Chunjie Luo, Yuqing Zhu, Qiang Yang,Yongqiang He, Wanling Gao, Zhen Jia, Yingjie Shi, Shujie Zhang, ChenZheng, Gang Lu, K. Zhan, Xiaona Li, and Bizhu Qiu. Bigdatabench:A big data benchmark suite from internet services. In

High Perfor-mance Computer Architecture (HPCA), 2014 IEEE 20th InternationalSymposium on , pages 488–499, Feb 2014.[51] Yuxiang Wang, Junzhou Luo, Aibo Song, and Fang Dong. A sampling-based hybrid approximate query processing system in the cloud. In

Parallel Processing (ICPP), 2014 43rd International Conference on ,pages 291–300. IEEE, 2014.[52] Matei Zaharia, Andy Konwinski, Anthony D Joseph, Randy H Katz,and Ion Stoica. Improving mapreduce performance in heterogeneousenvironments. In

OSDI , volume 8, page 7, 2008.[53] Shlomo Zilberstein. Using anytime algorithms in intelligent systems.