In Datacenter Performance, The Only Constant Is Change
Dmitry Duplyakin, Alexandru Uta, Aleksander Maricq, Robert Ricci
IIn Datacenter Performance, The Only Constant Is Change
Dmitry Duplyakin * , Alexandru Uta †(cid:5) , Aleksander Maricq * , Robert Ricci * * University of Utah, † Vrije Universiteit Amsterdam, (cid:5)
Leiden University * { dmdu, amaricq, ricci } @cs.utah.edu, † [email protected] Abstract —All computing infrastructure suffers from perfor-mance variability, be it bare-metal or virtualized. This phe-nomenon originates from many sources: some transient, suchas noisy neighbors, and others more permanent but sudden,such as changes or wear in hardware, changes in the underlyinghypervisor stack, or even undocumented interactions betweenthe policies of the computing resource provider and the activeworkloads. Thus, performance measurements obtained on clouds,HPC facilities, and, more generally, datacenter environmentsare almost guaranteed to exhibit performance regimes thatevolve over time, which leads to undesirable nonstationarities inapplication performance. In this paper, we present our analysisof performance of the bare-metal hardware available on theCloudLab testbed where we focus on quantifying the evolvingperformance regimes using changepoint detection. We describeour findings, backed by a dataset with nearly 6.9M benchmarkresults collected from over 1600 machines over a period of2 years and 9 months. These findings yield a comprehensivecharacterization of real-world performance variability patterns inone computing facility, a methodology for studying such patternson other infrastructures, and contribute to a better understandingof performance variability in general.
Index Terms —Datacenter Performance; Benchmarking; Vari-ability; Changepoint Detection; Temporal Analysis.
I. I
NTRODUCTION
The performance of computing infrastructure is variable ,which means multiple runs of the same code on the samehardware result in slightly different performance results [1],[2]. Often, this variation is relatively small (within a fewpercent) and follows a pattern that can be modeled as randomnoise , with samples following a stationary (though potentiallycomplicated) distribution. Under these conditions, it is possibleto use sound experiment designs [3], [4], [5] and statisticaltechniques [2], [6] to achieve meaningful performance results.Sometimes, however, the performance distribution does not remain stationary: it exhibits some systematic change such asthe change in the median, variance, tail, or other statisticalproperties. Instances of such change are called changepoints ,and the practice of finding them is referred to as
ChangepointDetection (CPD) [7], [8]. In this study, we consider the prob-lem of detecting performance changepoints in large computingfacilities. We use data collected from CloudLab, a distributedtestbed which supports cloud computing and systems researchby providing raw access to programmable hardware [9]. Using6.9M measurements collected on more than 1600 bare-metalmachines during a period of 2 years and 9 months (May 2017–February 2020), we detect between 583 and 2439 changepoints(depending on the detection sensitivity). In this paper, wedescribe our assessment of the magnitude of these performancechanges and the duration of inter-changepoint intervals, as well as relate groups of changepoints to several major recordedconfiguration changes that occurred on CloudLab over thecourse of this benchmarking effort.Understanding changepoints in facility performance is help-ful to both operators and users. In operation of a large-scalefacility, change over time is a fact of life: our analysis inSection IV shows that performance change is the norm ratherthan the exception. Hardware ages, firmware and software getupdated, security patches [10] are applied, and even changesin the physical environment affect performance over time [11].Some of these changes are planned, some are not, andsome changes have unexpected performance consequences.The level of “noise” in repeated performance measurementsand between different machines in the same facility makesdetecting and characterizing these changes difficult. However,our work should help operators better understand their ownfacilities by giving them insights into the effects of plannedsystem updates, allowing them to find unexpected changes,and helping them grasp which metrics change together.On the user side, changepoints can help users better un-derstand performance of their own codes and run repeatableexperiments. When one gets unexpected performance results,a natural question to ask is “Has something on the platformchanged?” Our CPD-based approach helps answer this typeof questions: given some performance measurements collectedbetween two points in time, we can automatically detect sig-nificant changepoints that occurred between them and identifythe sets of impacted metrics, as well as the magnitudes ofthe observed changes. This is also important for repeatable and predictable performance research [12]: if the goal is tocompare the performance of two programs where one of themwas originally run months or years ago, it is essential tounderstand if the baseline performance of the platform hasevolved since that time, and, if so, in what specific ways andby how much.In the face of non-trivial and often unexpected performancechanges, practitioners are often unprepared, especially whenthey do not control the underlying infrastructure. We advo-cate for constantly fingerprinting the underlying resources’performance as a prerequisite to application performanceexplainability . Similar to the recent concept of developingexplainable artificial intelligence algorithms, making senseof systems’ performance is becoming an extremely complexyet necessary endeavor, especially at large scale. Therefore,performance fingerprinting, which includes frequent bench-marking of individual components of computing infrastructure,is imperative to the explicable system performance behavior.As shown in recent studies, performance evaluations in a r X i v : . [ c s . PF ] M a r he literature do not always use experimental practices thataccount for variability and change over time [13]. Buildingbenchmarks that automatically adjust based on performancevariability leads to more trustworthy evaluations [14] andunderstanding of the types and frequency of changes thatoccur in practice is vital to this adaptation. Additionally, under-standing changes over time—especially unplanned ones—canhelp system designers build systems that are stable and offerconsistent performance guarantees.Toward persuading the international systems communitythat in datacenter performance, stability is usually the excep-tion rather than the norm [1], and that better experimentaldesign and practices are needed [13], [15], in this paper wemake the following contributions:1) We advocate for and describe the power of CPD incharacterizing performance variability (Section IV).2) We characterize and identify changepoints in the per-formance of CloudLab. We offer clear-cut examples forvalidating our CPD results, such as OS and BIOS patches(Section V).3) We offer practitioners and performance engineers guide-lines toward tuning the sensitivity of CPD and settingexpectations about its performance (Section VI).4) For promoting reproducibility, we release as open data alarge-scale archive containing several million datapointscharacterizing the CPU, memory, and disk performanceof a heterogeneous pool of hardware resources availableon CloudLab, as well as our analysis tools (Section VII).To give the aforementioned contributions more context, wedescribe the dataset we have collected in Section III anddiscuss CPD-related developments with potential high impactin computer systems research in Section VIII.II. R ELATED W ORK / B
ACKGROUND
Identifying changepoints in datasets is regarded as a usefultechnique to pinpointing when certain characteristics of therecorded data have changed. More precisely, a changepoint isa temporal moment that separates a given dataset in two sub-datasets with different statistical characteristics. Such analysishas proven itself valuable in several types of analyses relatedto large-scale systems. In this section, we start by giving abrief overview of scalable changepoint detection techniquesand continue by describing the applicability domains of CPD.
Changepoint Detection Techniques.
There are frequen-tist [7], [16] and Bayesian [8], [17], [18] approaches tochangepoint detection, and it is shown that both of themachieve good results in online and offline settings. Althoughcomputationally demanding, there exist scalable algorithmsand efficient software packages for CPD [19], [20], [21].Such linear or sub-quadratic time algorithms ensure that evenmassive datasets can be analyzed quickly, in a scalable wayand in near real-time scenarios.
Changepoint Detection in Large-scale Systems.
In large-scale systems, changepoint detection has been used in a varietyof scenarios, including decentralized sensor networks [22],performance diagnosis in distributed systems [23], identifying JVM warmups [24], denial-of-service attacks and intrusiondetection [25], [26], [27]. These techniques prove scalable andefficient enough to inform real-time decisions in time-criticalapplications.
Anomaly Detection in Large-scale Systems.
Computersystems are generally designed with clear (non-)functionalrequirements. Engineers and administrators measure such met-rics to determine when systems deviate from their normalbehavior, i.e., detecting anomalous behavior. Various change-point detection methods [28], [29], [30], [31] are also success-fully used for detecting anomalies.Highly contrasting with our work, all the previously dis-cussed research considers changepoints or anomalies the ex-ception rather than the norm. However, in this paper we makethe case for change being the norm in datacenter performance.Toward understanding this behavior, we show strong empiricalevidence and we advocate for continuously monitoring forperformance changes. By accounting for change, engineers,datacenter operators, and researchers can design better bench-marks, easily explain abnormal or seemingly inexplicable per-formance behavior and build better datacenters and distributedapplications.III. L
ARGE -S CALE B ENCHMARKING
A. Previous Work
Our initial work in the area of facility-wide hardwareperformance analysis started in May 2017 with a study ofhardware available on the CloudLab testbed. CloudLab al-locates an entire bare-metal machine to one user at a time,and therefore we can study the variability of the hardwareperformance free from the side-effects of “noisy neighbors”and virtualization [32]. More specifically, we were interestedin (i) how the performance differs between supposedly iden-tical machines and (ii) how the same machine performs overtime . Over a period of 10 months, we captured nearly 900Kdata points from 835 machines by testing subsets of availablemachines several times per day. Machines were tested usingnetwork, memory, and disk microbenchmarks suites accordingto the least-recently-tested order. Our observations and statis-tical analysis resulted in a publication where we describedthirteen data-backed findings from this study, spanning bothexperimentalists’ and facility operators’ concerns [1].Since our initial collection period, we expanded the scopeof this performance analysis to include the newest hardwareand additional benchmarks, as well as increased the number ofmachines tested per collection period. As of the writing of thispaper, the dataset has grown to include over 4.3M memoryand 664K disk performance measurements from over 1600machines. We have also gathered over 2.0M CPU test resultsand reported initial findings from our high-level assessment ofthe observed variability patterns [33].
B. Ongoing Benchmarking and Analysis
Close to our current work in terms of experiment scale arethe studies of Gunawi et al. [11] and Amvrosiadis et al. [34].The former investigates the symptoms of “fail-slow” hardwareABLE I: Hardware specs and test coverage.
Type Model Processor Cores Tested / Total m400
HPE m400 APM X-GENE 8 311 / 315 151 K / 201 K m510
HPE m510 Xeon D-1548 8 266 / 270 214 K / 575 K xl170
HPE xl170r G9 Xeon E5-2640v4 10 196 / 200 152 K / 270 K c220g1
Cisco c220m4 Xeon E5-2630v3 16 89 / 90 121 K / 375 K c220g5
Cisco c220m5 Xeon Silver 4114 20 220 / 224 380 K / 675 K c6220
Dell C6220 Xeon E5-2650v2 16 60 / 64 98 K / 175 K c6320
Dell C6320 Xeon E5-2683v3 28 84 / 84 156 K / 444 K c6420
Dell C6420 Xeon Gold 6142 32 71 / 72 59 K / 105 K and characterize the performance anomalies generated by suchissues. The latter shows that trace-driven experiment designscan lead to over-fitting of systems software toward certainbehavior in case not enough (varied) traces are used. Much likethose studies, we investigate a rather unexplored path in ourdomain, namely performance analysis of cyberinfrastructurewith the focus on CPD. This analysis uncovers non-trivialbehavior in performance measurements and helps characterizechange patterns in datacenter performance. We present the dataand the findings that have potential to inform new researchin the area of adaptive benchmarking , represented by studiessuch as [35] and [14], and should promote the development ofCPD-based tools, both for analysis and systems management.This study expands our previous temporal analysis ofperformance data [1] in two important ways. First, we gobeyond simply checking stationarity with the AugmentedDickey–Fuller test [36], which previously showed that mostof our performance traces cannot be viewed as stationarywith high conference level. We describe an investigation thatinvolves quantifying individual changepoints and temporarysteady states. Second, we describe the analysis that notonly characterizes individual timeseries but also processesentire batches of performance timeseries and helps revealrelationships between different performance manifestations ofthe same system events. By relating and grouping differentchangepoints we identify, we gain better understanding of theunderlying causes.
C. Collected Performance Data
We measure CPU performance of CloudLab hardware usingNPB, the NAS Parallel Benchmarks [37], version 3.3–OMP.We run 9 microbenchmarks (BT, CG, EP, FT, IS, LU, MG, SP,UA) on homogeneous pools of machines of 11 types, turningon/off dynamic voltage and frequency scaling (DVFS). Wevary the number of running threads—we run tests that use ei-ther a single thread or all available hardware threads—and alsopin the computations to each of the sockets (for two-socketmachines) using the numactl utility. All these parameterscreate an input space with 590 distinct configurations that weuse to answer questions related to performance changepointsin the current work. Each run produces a record in our datasetwith the runtime (in seconds) accompanied by many metadataattributes, which include machine specs, OS version, kernelrelease, and compiler version, among others. Similarly, we collect measurements for 1038 memory con-figurations that correspond to evaluating the same CloudLabhardware using STREAM [38] tests and the micro-benchmarksfrom Alex W. Reece’s suite [39], [40] for testing Intel x86intrinsics such as SSE and AVX instructions. In Table I, wedetail some of the CloudLab’s hardware types—the ones werefer to throughout the paper—their specifications, and thetesting coverage. The complete hardware overview can befound on the CloudLab Hardware documentation page [41].We include 152 disk configurations in our analysis. We run fio [42] in different settings on raw I/O devices or theirpartitions, eight tests per device: read and write load, randomand sequential tests, low and high iodepth settings. Theseconfigurations include the measurements for HDDs, SSDs, andNVMe devices, which provides opportunities for comparingentire classes of I/O devices.We have made this entire 6.9M-measurement performancedataset publicly available as described in Section VII.IV. C
HANGEPOINT D ETECTION A NALYSIS
A. Overview of CPD
We leverage a recent CPD approach that is explicitly de-signed to handle data with outliers and heavy-tailed noise [43].This approach uses an efficient dynamic programming algo-rithm to produce minimum-cost segmentations of univariatetimeseries. Not only is this approach shown to be robust tonoise in the data, but it operates sequentially on the data thatis given to it for processing, which makes it a great fit for rapidanalysis of multiple streams of performance measurementscoming from large-scale cyberinfrastructure. Robustness tonoise is equally important in this context considering thatoutliers occur in performance results even on perfectly finehardware due to nondeterminism of computing systems andintegral low-level attributes of their design, such as taskscheduling, interrupts, instruction pipelining, caching, amongothers [1], [6]. Therefore, we base our work on the methodfor which a handful of outliers, including large ones, wouldnot result in changepoints being reported unless they providesufficient statistical basis for such outcomes. We also explorehow sensitive the detection method we selected is to short-termfluctuations and report our finding throughout the paper.We use an implementation of this CPD that is available inthe form of the robseg package in R [44]. We integrateit with the analysis and visualization tools we develop inPython using the rpy2 interface [45]. Following one of therepository’s examples, we run this implementation with the biweight loss function (i.e., pointwise minimum of an L lossand a constant), which is shown to improve the consistencyand the accuracy of changepoint estimation [43]. The authorsargue that this is a good choice in practice that has noperformance drawbacks comparing to the more common andyet less outlier-resilient L (square error) loss. They alsointroduce the penalty/threshold parameter K , which relates themagnitudes of potential changepoints to the ratio of the signalto its standard deviation, and use different values with differentloss functions. After our initial experiments with CloudLab a) Multi-threaded memory copy results. (b) Runtimes of multi-threaded FFT. (c) Subset from (b) for a single machine. Fig. 1: Outlier-resilient change-point detection applied to memory and CPU performance measurements collected from 183homogeneous xl170 machines over the period of 14 months, which started when these machines were first added to CloudLab.performance traces, we have settled on the [0 . , . range for K values, making this hyperparameter easily tunable in theanalysis dashboard we develop, as we discuss in Section VII.Most of the results we present in this paper are obtained with K = 0 . , unless noted otherwise. We would also like to notethat while the authors of this CPD method evaluate it on thedata with heavy-tailed but synthetic t -distributed noise and alsoempirical data from a well drilling application [46], we applythis recently developed CPD approach in the area of computerperformance analysis and, to the best of our knowledge, reporton the first large-scale analysis of this kind.The aforementioned choices allowed us to achieve desiredresults in segmentation of CloudLab’s performance data invarious settings, as demonstrated in Figure 1. Both memory (a)and CPU (b,c) datasets were segmented in the ways that madeperfect sense visually and proved to be outlier-resilient. Thelatter can be noticed in the memory plot (a) where some of themost recent measurements appeared higher than the rest, whilethe detection algorithm did not create another changepoint thatrepresents them; similar instances can be observed in the plot(b). Either more data is needed to confirm the significance ofthis behavior, or we can increase the value of threshold K ifwe are indeed interested in capturing such instances.Another interesting result here is that all three of the shownsegmentations—two for the data from all machines of thestudied hardware type (a,b) and one for a single machine ofthis type (c)—agree on the change that occurred just before“05-01-19”. The agreement between different benchmarks’changepoints speak for the significance of such changes: thehigher the number of agreeing benchmarks, the more “weight”these performance changes carry. We search for examplesof cluster-wide performance changes using CPD and discussseveral specific cases in more detail in Section V. In thisstudy, we focus on the analysis of traces that include largesets of measurements from batches of homogeneous machines,like traces (a) and (b). We acknowledge that there is analternative approach which involves processing smaller single-machine traces, similar to (c), one-by-one and then clusteringthe detected changepoints for extracting prominent patterns.This method may provide its own benefits (e.g., resource- specific CPD that may help expose outliers), yet it falls outsidethe scope of our current work.We run the robseg -based CPD on the data for all CPU,memory, and disk configurations described in the previous sec-tion and analyze the properties of the produced segmentations,defined as follows. The n th segment—the period of testing be-tween changepoints where performance measurements appearstationary—can be characterized by the duration d n and themean m n of the measurements that fall within it. The firstand the last segments in each time series have the beginningor the end of the testing period as one of its end points. Thus,the segmentations shown in Figure 1 include the total of 8segments (3 in (a), 3 in (b), and 2 in (c)). At each changepoint,we observe a “step” that we can characterize as a relativechange in the means: c n = ( m n +1 − m n ) /m n × . Belowwe describe the distributions of empirical c n and d n valueswe obtain for different benchmarks. B. Changepoints and Their Characteristics
All changepoints we detect form three groups, represent-ing CPU, memory, and disk performance changepoints. Thecorresponding c n distributions that characterize the relativemagnitudes of these changes are shown in three histograms inFigure 2. Roughly speaking, most CPU changes are within the [ − . , . range, memory changes—within [ − , ,and disk changes—almost entirely within [ − , . Wealso notice that the directions of the heavier tails in thesedistributions agree with the directions of typical outliers inthese types of performance measurements. Thus, CPU change-points have more positive c n values than negative (i.e., themedian is slightly greater than zero), and the abnormal CPUperformance results (measured with test runtimes, in unitsof time) appear on the high side of the usual performancelevels. In contrast, for memory and disk bandwidth tests(measured in MB/s), more c n values have the negative sign(and medians are below zero), which corresponds to instancesof degradation of such performance metrics—the scenario thatis more common in practice for unexpected changes thanbandwidth improvements. In Section V, we further discussseveral specific multi-benchmark changes that conform withthese patterns and also represent several counterexamples.ig. 2: Histograms of c n values, the relative mean changes corresponding to the detected changepoints.Fig. 3: Histograms of d n values, the durations of segments with stable performance. The set of summarized changepoints isthe same as the set shown in Figure 2, which includes the total of 1632 changepoints.We also track the numbers of detected changepoints, foreach distribution and the total number. They are directlyproportional to the value of K , and the number of memorychangepoints is greater than the number of CPU changepoints,which, in turn, is greater that the number of disk changepointsin our dataset. Considering that we analyze uneven numbers ofconfigurations across these types, we adjust for their numbers( n CP U = 590 , n Memory = 1038 , and n Disk = 152 ) and,using the notation r X = |{ c Xn }| /n X for ratios of changepointsper configuration, arrive at the following: r Memory > r
Disk > r
CP U . This holds true for all values of K in the range we have stud-ied. To provide a concrete example, our analysis for K = 0 . yields r CP U = 0 . , r Memory = 1 . , and r Disk = 0 . .This suggests that on average in this setting, 4 memory traceshave approximately 5 changepoints, 10 disk traces have 9changepoints, and 5 CPU traces have 2 detectable change-points. In combination with the analysis of inter-changepointintervals presented below, this fact should provide a referencepoint for researchers and practitioners pursuing performance-focused changepoint detection, especially at large scale. C. Steady States
The inter-changepoint intervals with stationary performanceregimes can be viewed as steady states (provided there is arepresentative set of measurements). With these intervals, weare interested in the patterns expressed in the distributionsof their durations. If we find them to include many shortintervals, we may consider tuning CPD to produce fewerchangepoints and only characterize more permanent ones.At the other extreme, with less sensitive CPD we run riskof not noticing some short-term but important changes. InFigure 3, we present what we believe are the balanced durationdistributions, which we obtain for K = 0 . and measure indays. Specifically, we analyze the heights of the leftmost bars in these histograms (i.e., characterizing short segments) andnotice that they are comparable to the heights of the barsrepresenting much longer segments, at several hundred days.In other words, the short-term changes are neither too abundantnor lacking, when compared to the changes that occur atlarger timescales. With this summary in mind, we gain moreconfidence about viewing the steady states on the shorter endof this spectrum as being representative of the system’s per-formance regimes rather than stemming from noise. Moreover,the tall bars for the segments that are about 500-days longpoint to the configurations that did not yield any changepoints.This is a satisfying observation indicating that there is some long-term stability in the studied performance metrics, andtherefore, specific types of performance experiments can berun repeatedly over long periods of time without noticeableimpact on the results caused by infrastructure changes andtransient effects. We further discuss several exemplary casesin the following section.V. I NVESTIGATING C HANGE P ATTERNS
In this section we describe the most noticeable performancechanges that we detect using CPD and investigate them bylooking at the history of CloudLab system changes. The lattercomes in the form of kernel, OS, and compiler version changesand other attributes we record in our dataset, as well as therecord of maintenance procedures (reconstructed based onadministrators’ recollections and emails sent to testbed users).The cases where we can attribute the multi-benchmark change-points to such system changes provide validation , giving usthe concrete context for the observed performance changes. Wepresent the summary of the instances we have investigated inTable II and discuss them below.
A. Major Changepoints xl170 BIOS Updates
In Figure 4, we illustrate how xl170 performance traces show changes that occurred afterABLE II: Details of Validated CPD Results.
Hardware ChangeDirection Time Summary xl170
CPU ↓ , Mem ↑ Nov 1, 2019 BIOS Updates;(see Fig. 4) c6320 , other hw Mem ↓ August 13, 2018 Upgrade from Ubuntu 16 to 18;(see Fig. 5) d430 , other hw CPU ↑ , Mem ↓ July 25, 2019 Kernel upgrade from to November 1, 2019. The timeline plots (a,c) indicate that thechanges can be considered positive : CPU runtimes decreased,while memory bandwidth results increased (these changes canalso be seen in Figure 1). Considering that the bars in thesetimelines represent individual days, we confirm that the largespikes in the numbers of affected tests match the dates ofupdates to the BIOS conducted administrators by on thesemachines. We see that they are followed by several dayswith small numbers of related performance changes. Thiscan be explained by a combination of how we collect ourbenchmarking results—we may get only a few machines testedon a day when the testbed and this particular hardware typeare in high demand—and the fact that some of the testsmight need more measurements than the others to results inchangepoints, depending on the magnitudes of changes. Inthis case, however, the detection is quite accurate; it pointsus precisely at the performance impact of system changes.The specific changes that caused this changepoint weretree BIOS settings according to the HPE’s low-latency tuningrecommendations [47]. By disabling patrol scrubbing (whichscans memory to correct soft errors) and the early warning ofDRAM errors (through the memory pre-failure notification set-ting), administrators reduced the amount of System Manage-ment Interrupts (SMIs) sent to the processor. Administratorsalso reduced the rate at which the memory controller refreshesDRAM, from 2x to 1x.As far the magnitudes of these hardware-specific changesare concerned, on average, CPU runtimes decreased by . ( . maximum), and memory bandwidth increased by . ( . maximum). In Figure 4 (b,d), we show the histogramsfor the full ranges of xl170 ’s c n values—not only forthese BIOS-related changes but also for the changepoints thatoccurred at different times—characterizing the entire periodof our benchmarking for xl170 machines. OS Version Change
The performance effects of the testbed-wide switch from Ubuntu 16.04 as a default operating systemimage to Ubuntu 18.04 can be seen in the traces for mostof CloudLab’s hardware types, such as the c6320 hardwaretype shown in Figure 5. While our collection of CPU tests onlybegan shortly after this transition, our memory results revealthe changepoints that trace back to this OS upgrade. Theseperformance changes are predominantly negative : they reflectthe security updates many of which mitigate recent specula-tive execution exploits [48] at the expense of performance. c6320 ’s memory bandwidth results decreased by . onaverage ( . maximum) following the OS switch, but notall hardware types experienced this degree of change. The m510 hardware type, in contrast, showed smaller performance degradation, with the average of . ( . maximum). In themost stable case, c220g1 machines showed a single memorychangepoint with bandwidth degradation of only . .This helps illustrate the point that we cannot project per-formance results from one hardware type to another. Thisconclusion is typically drawn for high-level applications, yethere we see the evidence for it in the context of performancebaselines defined by the OS evolution. Kernel Version Changes
Similar to the OS upgrade,changes in the version of the deployed Linux kernel resultin many performance changepoints. One such update—from to —can be seen in the traces formultiple hardware types, for both CPU and memory measure-ments. A number of the corresponding negative changepointsaround July 25, 2019, which relate to the continuing mitigationof security exploits (we have confirmed that the changelogsfor this series of kernel updates describe such improvements),can be seen in Figure 5. There, CPU runtimes increased by . on average ( . maximum), and memory bandwidthdecreased by . on average ( . maximum). The largestrelative change that we can attribute to this update is the . memory bandwidth decrease for read AVX instructions run on c220g5 machines. It is also worth noting that the discussedkernel version update coincided with a compiler change, fromGCC to . To verify the root cause of thesechangepoints, we collected measurements in a series of teststhat used a CloudLab machine running the older kernel and thenewer GCC. By comparing these results with the pre- and post-changepoint distributions, we can claim without doubt that thekernel is indeed responsible for the performance impacts, notGCC. Our analysis of other kernel version changes, which tookplace before and after the aforementioned update, showed thatthey had more limited performance effects. B. Stable Measurements
In addition to investigating changepoints, we recognize longsteady states present in our performance measurements anddescribe several instances below.
Disk Performance
The most stationary measurements comefrom the disks installed on c6420 and . Both useSeagate 1TB 7200-RPM 6G SATA HDDs (albeit differentmodels). Their performance traces showed isolated change-points for I/O tests with the default setting iodepth = 1 . Theseare instances of impressive long-term performance stability,considering that other storage devices, such as Micron M500120GB SATA3 flash and Toshiba XG3 series 256GB NVMedisks, showed 15 and 33 changepoints, respectively, on thesame set of eight I/O tests. This summary agrees with whatwe have found in our previous work [1] about the performanceof these devices based on empirical coefficients of variance:increased performance often comes at the expense of increasedvariability. We also notice that I/O performance changepointsmostly do not coincide with the rest of the studied change-points, comparing to the CPU and memory changepoints thatare in agreement in many cases, as discussed earlier. haded
Area:
Period of Measurements
DownMixedUp
CPU: changes in RUNTIMES o f c hangepo i n t s (a) Changepoints in CPU traces. CPU changes in means, % -6 -5 -4 -3 -2 -1 000.10.20.30.4 D en s i t y (b) Histogram of CPU c n values. Shaded
Area:
Period of Measurements
DownMixedUp
MEMORY: changes in BANDWIDTH o f c hangepo i n t s (c) Changepoints in Memory traces. MEMORY changes in means, % -2 0 2 4 600.20.40.60.8 D en s i t y (d) Histogram of Memory c n values. Fig. 4:
Changepoint detection for xl170 performance data. Shaded areas in (a) and (c) represent the period of benchmarking.
Shaded
Area:
Period of Measurements
DownMixedUp
CPU: changes in RUNTIMES o f c hangepo i n t s (a) Changepoints in CPU traces. Shaded
Area:
Period of Measurements
DownMixedUp
MEMORY: changes in BANDWIDTH o f c hangepo i n t s (b) Changepoints in Memory traces. Fig. 5:
Changepoint detection for c6320 performance data. Shaded areas represent the period of benchmarking.
Most Stable Configurations
CPU measurements on m400 machines and memory measurements on c6420 machinesshowed no changepoints. Though, it does not mean that thereis no variability or subtle fluctuations: we do find some per-formance changepoints when we increase K , but their numberis still lower than the numbers of changepoints detected forother hardware. Based on all our observations, these hardwaretypes would be the best candidates—not showing the highestperformance but instead delivering the highest stability—forlong-running series of experiments with the emphasis onCPU and memory performance, among the hardware availableon the CloudLab testbed. In such experiments, applicationperformance regressions can be studied in isolation fromhardware- and OS-caused transient performance effects, withgreater rigor and depth. C. Discussion
We anticipate that these findings can be helpful to the com-munity of CloudLab users in several ways. First, it is worthnoting that some of the factors that caused changes (the OSand kernel updates) are actually under the users’ control: whilethe user may opt to use the images and kernels that are thedefault at the time, they also have the option of using specific(potentially older) software stacks if performance consistencyis a primary goal. Second, as noted above, some hardware Fig. 6: Time needed for single-timeseries CPD analysis.types show fewer changepoints than others, underscoring thefact that the choices the user makes of hardware can affecthow many changes in performance they experience. Third,it is notable that spinning disks provide the most consistentperformance over time. If this holds for larger set of modelsand usage patterns, this suggests that practical performancefingerprinting for performance explainability should put moreemphasis on benchmarking of CPUs, memory, and other typesof I/O devices comparing to testing spinning disks.While the analysis in this section necessarily concentrateson the specific changes that we have observed in CloudLab, theoverall lessons can generalize to other systems. While changesare common in large, long-lived facilities, techniques such asCPD can help both administrators and users recognize thesechanges and track down their root causes.I. T
UNING T HE D ETECTION P ROCESS
As part of our evaluation of CPD’s computational require-ments, we perform two series of experiments: CPD for singletimeseries and for batches of timeseries. In the former, we runthe analysis for samples of measurements of the increasing sizeand measure per-invocation CPD analysis time (in seconds).We start with the 25-point sample shown in Figure 1 (c)and proceed to larger samples drawn from the data shownin Figure 1 (b). Each sample is randomly shuffled (so we canstudy a variety of segmentations) and passed to the CPD code100 times. We depict the analysis times we have recordedusing the boxplot in Figure 6. In the latter set of experiments,which is more representative of the scenarios with multi-benchmark performance analysis for datacenters, we run CPDfor all
CPU timeseries back-to-back, and then repeat it for all memory data. We study how the analysis times for thesetwo batches vary as we tune the detection by changing thevalue of K threshold. In Figure 7, we show the runtimes,as well as the numbers of identified changepoints, producingvisualizations that allow us to reflect on the previously usedvalue of K . All these runtimes are collected and processed ona machine with two 6-core Xeon X5650 processors and 96 GBof memory. Below we summarize the key insights revealed bythe performance results we have gathered. CPD is fast.
It processes samples with over 3,000 pointswithin tiny fractions of a second. The analysis time is linearwith respect to the sample size. Moreover, unlike typicalMachine Learning tools that require large datasets for training,CPD analyzes each sample independently from the rest of thedata and can achieve good results even on small, 25-pointsamples (as we demonstrated in Figure 1 (c)). This makesCPD and the robseg implementation in particular suitablefor fast, interactive analysis tools connected to live databasesor sources of streaming data.
There is a “sweet spot” in the range of K values. Analysistime consistently decreases as we increase K , while thenumber of detected changepoints increases. This matches whatis highlighted in the study that introduced robseg [43],where the authors compared two scenarios: “no change”( K = 0 ) and “many changes” ( K = n/ , where n is thesample size). Our results complement their brief summary bydemonstrating how CPD behaves over an entire range of K values that may be considered in practice. Thus, thresholdsaround K = 0 . appear to be good choices for CPU andmemory performance analysis, as confirmed by the trade-offcurves shown in Figure 7. Such values allow finding most ofthe changepoints detected with higher values of K withoutperformance drawbacks of the analysis with lower K . It isalso worth noting that if we are indeed interested in making thedetection more sensitive by increasing K , we would be able todo that without performance penalties. Then, we would need toselect K based on the desired characteristics of changepointsand steady states, using the arguments from Section IV. (a) Properties of CPD for x1l170 data.(b) Properties of CPD for c220g1 data. Fig. 7: Results of varying the K threshold.VII. O PEN A RTIFACTS
CONFIRM is an interactive analysis service running at https://confirm . fyi/ . We developed it to assess thelevels of performance and variability present in our dataset us-ing scatter plots, per-machine confidence intervals, convergingoverall confidence intervals, among other analysis techniques.It analyzes all performance results we have collected on Cloud-Lab. It is worth noting that we first encountered performancechangepoints while exploring specific configurations one-by-one using CONFIRM’s scatter plots, which are similar to theones shown in Figure 1 (without CPD segmentations). Change Over Time is a complementary dashboard wehave developed to examine the results of CPD. It is availableat https://confirm . fyi/change/ . It runs alongsideCONFIRM, accesses the same database, and produces visual-izations like the ones shown in Figures 4 and 5. The dashboardsummarizes the numbers of detected changepoints and theirdistributions for CPU, memory, and disk measurements, aswell as depicts the temporal relationships between them. Thedashboard has a slider allowing to experiment with the CPDthreshold K and go from a few (only large) to the largernumbers (including smaller) of performance changepoints. All data and the developed analysis code can be found at: https://gitlab.flux.utah.edu/emulab/cloudlab-ccgrid20 . This repository provides access tothe raw data, the collection of changepoints produced by ouranalysis, and the Google Colab notebooks that can be easilycloned and run.III. I
MPACT OF C HANGEPOINT A NALYSIS
With the approach we have laid out and the findings wehave presented, we hope to demonstrate the value of CPD forcomprehensive performance evaluation and operation of largecomputing facilities. Not only can such analysis improve facil-ity studies in cloud computing, HPC, datacenter optimization,etc., but it can also contribute to a wide range of practicalevolution-over-time studies focused on performance of indi-vidual components and aspects of modern cyberinfrastructure:new devices, network and storage systems, compilers andcomputing frameworks, QoS policies, resource sharing, amongothers. Better understanding of all these systems requiresmeasuring key performance characteristics across a variety ofoperational regimes rather than focusing on a single or handfulof selected states.A good analysis protocol would prescribe checking station-arity across the sampled states or time intervals and usingelements of CPD where the stationarity does not persist. Thesame logic applies to computing applications used repeatedlyin production settings. We support this protocol and haveshared what we have learned from it when we discussed thestationarity assessment in our previous work [1] and describedthe changepoint investigation in the current study.In a different capacity, we can also envision the use of CPDin the systems reproducibility studies, which reuse publisheddata and code artifacts for validation and further analysis.CPD can be a part of the toolchains involved in such work,complementing other temporal analysis techniques.The confluence of ideas about establishing performancebaselines and studying reproducibility leads us to think aboutcreating an open archive for performance variability data.There is plenty of precedent for archives of performance data,from the Top500 [49] and Green500 [50] efforts, to the datareleased by SPEC [51]. What we propose is an archive specif-ically targeting performance stationarity and variability. Thisarchive would include fine-grained multi-benchmark datasets,similar to the dataset we describe in this paper. It wouldprovide access to complete performance records for repeatedbenchmark runs on the same system, as well as runs acrossgroups of theoretically-identical systems. This would allowsystem owners and users better understand what to expect intheir environments and bootstrap statistical calculations andadaptive techniques needed for robust benchmarking of theirhardware. An organized and documented collection of datalike this is likely to empower a plethora of new studies onlarge-scale performance evaluations, analysis of correlationsin performance measurements, outlier detection, prediction ofresults in untested configurations, minimization of benchmark-ing time, among many other avenues within contemporary per-formance analysis of computer systems. In this context, CPDtechniques running on the archive’s data would strengthenall analyses that are able to consider individual stationarysegments and their performance characteristics rather thanrelying on the coarse estimates obtained for entire time series(sometimes with nonstationarities). We hope that our dataset from this study, in addition to the recently published measure-ments of network performance in several clouds [52] (alsoshowing many changepoints), will help establish such archiveand stimulate more work on novel and practical CPD.IX. C
ONCLUSION AND F UTURE W ORK
In the current study, we apply changepoint detection toa large dataset of measurements we have collected on theCloudLab testbed, which includes records of CPU, mem-ory, and disk performance. We present our analysis of thedetected changepoints—their distributions in terms of mag-nitudes and inter-changepoint intervals—how they relate tothe large recorded system changes, and also reveal whichconfigurations we have found to be the most stable basedon the lack of changepoints. These results, coupled with thepresented performance experiments and their outcomes, haveconvinced us in the viability and usefulness of applying robustCPD in studying large performance datasets. We expect tosee more work in computer systems research using CPDtechniques in the future and have shared several ideas aboutthe types of developments that might facilitate this adaptation.As part of our future work, we plan to include the results ofnetwork bandwidth and latency tests collected on CloudLab.It will allow us to compare CloudLab’s networks with the net-works deployed in public clouds [13] from the variability andevolution-over-time perspectives. On the detection side, wewill experiment with other changepoint detection approachesand tools, including online Bayesian detection [17] and theBreakoutDetection package [53], looking for the best choicesand capabilities for datacenter performance analysis. We willwork on a comprehensive comparison study for the availablemethods and consider evaluating them on the CloudLab’sperformance dataset.A
CKNOWLEDGMENTS
This material in this paper is based upon work supportedby the National Science Foundation, Grant Number 1743363.R
EFERENCES[1] A. Maricq, D. Duplyakin, I. Jimenez, C. Maltzahn, R. Stutsman,and R. Ricci, “Taming performance variability,” in
Proceedingsof the 13th USENIX Symposium on Operating Systems Designand Implementation (OSDI) . flux . utah . edu/paper/maricq-osdi18[2] R. Jain, The Art of Computer Systems Performance Analysis: Techniquesfor Experimental Design, Measurement, Simulation, and Modeling .Wiley- Interscience, Apr. 1991.[3] T. J. Santner, B. J. Williams, W. Notz, and B. J. Williams,
The designand analysis of computer experiments . Springer, 2003, vol. 1.[4] P. Balaprakash, R. B. Gramacy, and S. M. Wild, “Active-learning-basedsurrogate models for empirical performance tuning,” in
Cluster Com-puting (CLUSTER), 2013 IEEE International Conference on . IEEE,2013, pp. 1–8.[5] D. Duplyakin, J. Brown, and R. Ricci, “Active learning in performanceanalysis,” in . IEEE, 2016, pp. 182–191.[6] T. Hoefler and R. Belli, “Scientific benchmarking of parallel computingsystems: Twelve ways to tell the masses when reporting performanceresults,” in
Proceedings of the International Conference for High Per-formance Computing, Networking, Storage and Analysis . ACM, 2015.[7] E. S. Page, “Continuous inspection schemes,”
Biometrika , vol. 41, no.1/2, pp. 100–115, 1954.8] A. Smith, “A bayesian approach to inference about a change-point in asequence of random variables,”
Biometrika , vol. 62, no. 2, pp. 407–416,1975.[9] D. Duplyakin, R. Ricci, A. Maricq, G. Wong, J. Duerig, E. Eide,L. Stoller, M. Hibler, D. Johnson, K. Webb, A. Akella, K. Wang,G. Ricart, L. Landweber, C. Elliott, M. Zink, E. Cecchet, S. Kar, andP. Mishra, “The design and operation of CloudLab,” in
Proceedings ofthe USENIX Annual Technical Conference (ATC) . flux . utah . edu/paper/duplyakin-atc19[10] Databricks, “Meltdown and Spectres Performance Impact on Big DataWorkloads in the Cloud ,” https://databricks . com/blog/2018/01/13/meltdown-and-spectre-performance-impact-on-big-data-workloads-in-the-cloud . html, 2018.[11] H. S. Gunawi, R. O. Suminto, R. Sears, C. Golliher, S. Sundararaman,X. Lin, T. Emami, W. Sheng, N. Bidokhti, C. McCaffrey, G. Grider,P. M. Fields, K. Harms, R. B. Ross, A. Jacobson, R. Ricci, K. Webb,P. Alvaro, H. B. Runesha, M. Hao, and H. Li, “Fail-slow at scale:Evidence of hardware performance faults in large production systems,”in . usenix . org/conference/fast18/presentation/gunawi[12] T. Patki, J. J. Thiagarajan, A. Ayala, and T. Z. Islam, “Performanceoptimality or reproducibility: that is the question,” in Proceedings of theInternational Conference for High Performance Computing, Networking,Storage and Analysis . ACM, 2019, p. 77.[13] A. Uta, A. Custura, D. Duplyakin, I. Jimenez, J. Rellermeyer,C. Maltzahn, R. Ricci, and A. Iosup, “Is big data performance repro-ducible in modern cloud networks?” in , Feb. 2020.[14] M. Kogias, S. Mallon, and E. Bugnion, “Lancet: A self-correctinglatency measuring tool,” in , 2019, pp. 881–896.[15] A. Uta and H. Obaseki, “A performance study of big data workloadsin cloud datacenters with network variability,” in
Companion of the2018 ACM/SPEC International Conference on Performance Engineer-ing . ACM, 2018, pp. 113–118.[16] E. Page, “A test for a change in a parameter occurring at an unknownpoint,”
Biometrika , vol. 42, no. 3/4, pp. 523–527, 1955.[17] R. P. Adams and D. J. MacKay, “Bayesian online changepoint detec-tion,” arXiv preprint arXiv:0710.3742 , 2007.[18] D. Stephens, “Bayesian retrospective multiple-changepoint identifica-tion,”
Journal of the Royal Statistical Society: Series C (Applied Statis-tics) , vol. 43, no. 1, pp. 159–178, 1994.[19] A. J. Scott and M. Knott, “A cluster analysis method for grouping meansin the analysis of variance,”
Biometrics , pp. 507–512, 1974.[20] R. Killick, P. Fearnhead, and I. A. Eckley, “Optimal detection ofchangepoints with a linear computational cost,”
Journal of the AmericanStatistical Association , vol. 107, no. 500, pp. 1590–1598, 2012.[21] R. Killick and I. Eckley, “changepoint: An r package for changepointanalysis,”
Journal of statistical software , vol. 58, no. 3, pp. 1–19, 2014.[22] A. G. Tartakovsky and V. V. Veeravalli, “Quickest change detectionin distributed sensor systems,” in
Proceedings of the 6th InternationalConference on Information Fusion . Australia, 2003, pp. 756–763.[23] P. Chen, Y. Qi, P. Zheng, and D. Hou, “Causeinfer: Automatic anddistributed performance diagnosis with hierarchical causality graph inlarge distributed systems,” in
IEEE INFOCOM 2014-IEEE Conferenceon Computer Communications . IEEE, 2014, pp. 1887–1895.[24] E. Barrett, C. F. Bolz-Tereick, R. Killick, S. Mount, and L. Tratt, “Virtualmachine warmup blows hot and cold,”
Proceedings of the ACM onProgramming Languages , vol. 1, no. OOPSLA, p. 52, 2017.[25] Y. Chen, K. Hwang, and W.-S. Ku, “Distributed change-point detectionof ddos attacks: Experimental results on deter testbed.” in
DETER , 2007.[26] R. B. Blazek, H. Kim, B. Rozovskii, and A. Tartakovsky, “A novelapproach to detection of denial-of-service attacks via adaptive sequentialand batch-sequential change-point detection methods,” in
Proceedings ofIEEE systems, man and cybernetics information assurance workshop .Citeseer, 2001, pp. 220–226.[27] A. G. Tartakovsky, B. L. Rozovskii, R. B. Blazek, and H. Kim, “A novelapproach to detection of intrusions in computer networks via adaptivesequential and batch-sequential change-point detection methods,”
IEEETransactions on Signal Processing , vol. 54, no. 9, pp. 3372–3382, 2006.[28] A. Li, L. Gu, and K. Xu, “Fast anomaly detection for large data centers,”in . IEEE, 2010, pp. 1–6. [29] C. Wang, K. Viswanathan, L. Choudur, V. Talwar, W. Satterfield, andK. Schwan, “Statistical techniques for online anomaly detection in datacenters,” in . IEEE, 2011, pp. 385–392.[30] M. Solaimani, M. Iftekhar, L. Khan, B. Thuraisingham, and J. B. Ingram,“Spark-based anomaly detection over multi-source vmware performancedata in real-time,” in . IEEE, 2014, pp. 1–8.[31] S. C. Tan, K. M. Ting, and T. F. Liu, “Fast anomaly detection forstreaming data,” in
Twenty-Second International Joint Conference onArtificial Intelligence , 2011.[32] D. Novakovi´c, N. Vasi´c, S. Novakovi´c, D. Kosti´c, and R. Bianchini,“Deepdive: Transparently identifying and managing performance inter-ference in virtualized environments,” in , 2013, pp. 219–230.[33] D. Duplyakin, A. Uta, A. Maricq, and R. Ricci, “On studyingCPU performance of CloudLab hardware,” in
Proceedings of theWorksop on Midscale Education and Research Infrastructure and Tools(MERIT) . flux . utah . edu/paper/duplyakin-merit19[34] G. Amvrosiadis, J. W. Park, G. R. Ganger, G. A. Gibson, E. Baseman,and N. DeBardeleben, “On the diversity of cluster workloads andits impact on research results,” in , 2018, pp. 533–546.[35] T. Kalibera and R. Jones, “Rigorous benchmarking in reasonable time,” ACM SIGPLAN Notices , vol. 48, no. 11, pp. 63–74, 2013.[36] D. A. Dickey and W. A. Fuller, “Distribution of the estimators forautoregressive time series with a unit root,”
Journal of the AmericanStatistical Association , vol. 74, no. 366a, pp. 427–431, 1979.[37] D. Bailey, T. Harris, W. Saphir, R. Van Der Wijngaart, A. Woo, andM. Yarrow, “The NAS parallel benchmarks 2.0,” Technical Report NAS-95-020, NASA Ames Research Center, Tech. Rep., 1995.[38] J. D. McCalpin, “Memory bandwidth and machine balance in currenthigh performance computers,”
IEEE Computer Society Technical Com-mittee on Computer Architecture (TCCA) Newsletter , pp. 19–25, 1995.[39] A. W. Reece, “Memory bandwidth demo,” https://github . com/awreece/memory-bandwidth-demo, May 19 2013.[40] Alex W. Reece, “Achieving maximum memory bandwidth,”http://codearcana . com/posts/2013/05/18/achieving-maximum-memory-bandwidth . html, 2013.[41] The CloudLab Team, “Hardware,” http://docs . cloudlab . us/hardware . html, 2018.[42] J. Axboe, “Flexible I/O tester,” https://github . com/axboe/fio, 2006-2018.[43] P. Fearnhead and G. Rigaill, “Changepoint detection in the presence ofoutliers,” Journal of the American Statistical Association , vol. 114, no.525, pp. 169–183, 2019.[44] Guillem Rigaill, “Fpop implementation for robust losses,” https://github . com/guillemr/robust-fpop, 2019.[45] L. Gautier, “rpy2: A simple and efficient access to r from python,” URLhttp://rpy. sourceforge. net/rpy2. html , 2008.[46] J. J. O. Ruanaidh and W. J. Fitzgerald,
Numerical Bayesian methodsapplied to signal processing . Springer Science & Business Media,2012.[47] Hewlett Packard Enterprise, “Configuring and tuning HP ProLiantServers for low-latency applications,” Tech. Rep., Nov. 2014.[Online]. Available: https://h50146 . . hpe . com/products/software/oe/linux/mainstream/support/whitepaper/pdfs/c01804533-2014-nov . pdf[48] P. Kocher, J. Horn, A. Fogh, D. Genkin, D. Gruss, W. Haas, M. Ham-burg, M. Lipp, S. Mangard, T. Prescher et al. , “Spectre attacks: Exploit-ing speculative execution,” in . top500 . . top500 . org/green500/,2019.[51] Standard Performance Evaluation Corporation, “SPEC Members’Archive,” https://pro . spec . org/, 2019.[52] A. Uta, A. Custura, D. Duplyakin, I. Jimenez, J. Reller-meyer, C. Maltzahn, R. Ricci, and A. Iosup, “alexandru-uta/cloud network variability data: Cloud Network Variability Data,”Dec. 2019. [Online]. Available: https://doi . org/10 . . ..