[PDF] Duet Benchmarking: Improving Measurement Accuracy in the Cloud

Abstract

We investigate the duet measurement procedure, which helps improve the accuracy of performance comparison experiments conducted on shared machines by executing the measured artifacts in parallel and evaluating their relative performance together, rather than individually. Specifically, we analyze the behavior of the procedure in multiple cloud environments and use experimental evidence to answer multiple research questions concerning the assumption underlying the procedure. We demonstrate improvements in accuracy ranging from 2.3x to 12.5x (5.03x on average) for the tested ScalaBench (and DaCapo) workloads, and from 23.8x to 82.4x (37.4x on average) for the SPEC CPU 2017 workloads.

Full PDF

DDuet Benchmarking:Improving Measurement Accuracy in the Cloud(Accepted Preprint Version)

Lubom´ır Bulej , Vojtˇech Hork´y , Petr T˚uma , Fran¸cois Farquet , and Aleksandar Prokopec Charles UniversityFaculty of Mathematics and PhysicsDepartment of Distributed and Dependable SystemsPrague, Czech Republic { name.surname } @d3s.mff.cuni.cz Oracle LabsZurich, Switzerland { name.surname } @oracle.com January 20, 2020

Abstract

We investigate the duet measurement procedure, which helps improve the accuracy of performancecomparison experiments conducted on shared machines by executing the measured artifacts in paralleland evaluating their relative performance together, rather than individually. Speciﬁcally, we analyzethe behavior of the procedure in multiple cloud environments and use experimental evidence to answermultiple research questions concerning the assumption underlying the procedure. We demonstrateimprovements in accuracy ranging from 2 . × to 12 . × (5 . × on average) for the tested ScalaBench (andDaCapo) workloads, and from 23 . × to 82 . × (37 . × on average) for the SPEC CPU 2017 workloads. Rights

Uploaded to ArXiV under the ACM Copyright Policy Version 9. Copyright 2020 ACM and the authors.This is the author version of the work. Posted for your personal use. Not for redistribution. The deﬁnitiveVersion of Record was published in

Proceedings of the 2020 ACM/SPEC International Conference onPerformance Engineering (ICPE ’20), April 20–24, 2020, Edmonton, AB, Canada , https://doi.org/10.1145/3358960.3379132 .This work was partially supported by the ECSEL Joint Undertaking (JU) under grant agreement No 783162and the Charles University institutional funding (SVV).1 a r X i v : . [ c s . PF ] J a n igure 1: Distribution of observed mean execution times of the avrora benchmark, running on an otherwiseidle bare-metal server and on a public cloud machine. Note the min-max range, which is about 16 % of themean on the bare-metal server and about 150 % in the cloud. At the heart of various performance comparison activities is a measurement experiment, whose statisticalnature involves an inherent trade oﬀ between execution time and sensitivity to diﬀerences in performance.Longer experiment times average over noise in the measurement data and provide more accurate results,but are also expensive both in terms of time and computing resources. Conversely, shorter execution timesmay cause the loss of sensitivity or report false alarms. This is a problem when automating performancetest execution and evaluation [13, 21].Importantly, the resource requirements for performance testing are not constant, but rather reﬂectthe development activities, the test scenarios, and the desired level of sensitivity. To satisfy the changingresource requirements, it is therefore attractive to consider oﬄoading the performance testing activities tothe cloud.A speciﬁc hurdle in this context is the fact that the cloud does not necessarily provide the performancestability required for performance testing. Performance measurements in the cloud are noisy, in part due tolack of control over hardware conﬁguration, in part due to overhead of virtualization, but most importantlydue to interference from colocated workloads of other tenants [14, 17, 16]. To illustrate this, Figure 1 showsthe distribution of mean task execution times for iterations of an example benchmark from the DaCaposuite, both on a bare-metal server and on a virtual machine running in a public cloud.Our earlier work [4] introduced the idea of the duet measurement procedure , which improves measurementaccuracy in shared resource environments, such as virtual machine instances in the cloud. The procedure isbased on the assumption that performance ﬂuctuations due to interference tend to impact similar tenantsequally, and attempts to maximize the likelihood of such equal impact by executing the measured artifactsin parallel. The subsequent computation ﬁlters out the ﬂuctuations by considering the relative performanceof the measured artifacts together. 2he assumptions of the duet measurement procedure hinge on detailed technical properties of boththe measurement platform and the executing workloads. In the cloud, such properties typically cannot becontrolled or guaranteed, we therefore subject the procedure to a thorough experimental evaluation withthe goal of analyzing the overall behavior and documenting the observed accuracy. Based on experimentalevidence, we answer speciﬁc research questions concerning the assumption underlying the procedure andexplain the technical mechanisms behind the observations:– We demonstrate improvements in accuracy that range from 2 . × to 12 . × (5 . × on average) for thetested ScalaBench [26] (and DaCapo [2]) workloads, and from 23 . × to 82 . × (37 . × on average) forSPEC CPU 2017 workloads [28].– We show that the accuracy improvements are due to the ability of the duet procedure to isolatesynchronized interference, and that this interference arises with resource sharing.– We evaluate how the speciﬁc patterns of concurrent execution and uneven resource utilization impactthe ability of the duet procedure to measure performance diﬀerences.As an essential overall contribution, our results indicate that cloud-based virtual machines can providea viable platform for conducting an entire class of performance testing experiments based on comparingtask execution times of benchmark workloads.Section 2 provides additional background and motivation for performance regression testing as ourspeciﬁc application context. Section 3 presents an overview of the duet measurement procedure and theassociated computations. Section 4 presents experimental evaluation answering speciﬁc research questionsthat naturally arise when using duet measurements and observing the eﬀects on measurement accuracy. Wereview related work in Section 5 and conclude the paper in Section 6. The motivation for our work is performance regression testing, that is, the task of detecting performancechanges between two versions of a software project. To this end, we use benchmark workloads to exerciseboth versions of the project, measuring and comparing task execution times of individual workloads betweenthe two versions.Essential to performance regression testing is robust performance change detection. The task executiontimes observed on a real system are inﬂuenced by diﬀerent sources of variability at diﬀerent levels ofgranularity – the comparison therefore relies on statistical hypothesis testing to accommodate the inherentvariability in the data, and the performance testing procedure must ensure that signiﬁcant sources ofvariability are suﬃciently represented in the data [8, 3, 1].To provide suﬃcient variability, benchmarks repeatedly execute the same task (in a single process)and measure the task execution time in each iteration . This captures variability caused by factors thatcan manifest at any time during benchmark execution, and which can inﬂuence the execution time of anyiteration, such as scheduling, memory caches, or background load. In addition, benchmarks are executedrepeatedly to obtain execution times from multiple benchmark runs (in multiple processes). This capturesvariability caused by factors that can change between runs, but rarely change within a single run, such asprocess memory layout, or decisions of managed platforms such as the Java Virtual Machine.3s a general rule, the variability in the observed execution times determines the magnitude of performancechanges that can be reliably detected in a given time, or alternatively, the time needed to detect performancechanges of a given magnitude. For a quick illustration of the computational resources needed for performanceregression testing, we use the open source GraalVM project [22], where the developers contribute on average5 merge commits per day and want to test these commits for performance changes on a selection of 60workloads from multiple benchmark suites. When using Java workloads for tests at the 99 % conﬁdencelevel, we can realistically assume to need data from 30 benchmark runs, each executing for 10 minutes (toget past some of the warm up eﬀects). This sums up to 10 machine hours for a single experiment involvingone version pair and one benchmark, and becomes 3000 machine hours per day for all experiments, which isan overwhelming ﬁgure.To pare down the resource demands, we can limit the amount of testing actually done [11, 21], however,that alone may not solve the problem of infrastructure capacity limits. This is where cloud resources comeinto consideration, yet it is unclear if they are of any use for performance regression testing – the degreeof control over the experimental platform, which allows obtaining accurate measurements on the localinfrastructure, is not available in the cloud. Furthermore, cloud providers oﬀer abstract virtual machinetypes that can run on diﬀerent types of physical hosts [17], resulting in diﬀerent execution times even forthe same code. Finally, cloud virtual machines suﬀer from performance interference of neighbor workloads,which the virtualization technology cannot entirely eliminate. This also holds for continuous integrationsolutions executing in the cloud, such as Travis [29] or GitLab Runner [9].In summary, we need a procedure that takes the characteristics of the cloud into account and makes ituseful for performance testing, even if it only allows to quickly process many versions and ﬂag suspect casesfor more thorough measurements on dedicated infrastructure.

Measurements in the cloud are subject to performance interference, which manifests as noise that mayrandomly aﬀect any measured data. To account for the probabilistic nature of the interference, we haveto repeat the measured operation enough times to obtain a representative sample of measurements, andthen calculate conﬁdence intervals for any values derived from the measurements. In experiments involvingmultiple workloads there is a risk of a systematic bias in the measured data if the probability of a workloadbeing inﬂuenced by interference is not equal for all workloads. The current best practice uses randomizedinterleaving of workloads [1], which—for a long enough experiment—avoids the bias by equalizing theprobability of interference for all workloads.The duet measurement procedure also avoids bias by equalizing probability of interference, but isspeciﬁcally tailored for experiments comparing performance of two (related) workloads. The two workloadsare executed in parallel, inside a virtual machine with two virtual cores, with each workload restricted toone virtual core. The workloads are synchronized using a shared memory barrier, so that their measuredoperations always start at the same time. This setting ensures that any external interference on the virtualmachine impacts both workloads simultaneously, which equalizes the probability of interference betweenthe workloads for each paired measurement and thus avoids the bias immediately—rather than only for along enough experiment.We derive the conﬁdence interval for the ratio of task execution times, which describes the relativeperformance of the two workloads, using a Monte Carlo procedure based on standard bootstrap conﬁdence4nterval computation [12], explained in detail in [4]:1. For a pair of workloads x and y and an experiment with R runs of I iterations each, we denote x r,i and y r,i the task execution times of the respective workloads, measured in iteration i ∈ . . . I of run r ∈ . . . R .2. For each r and i , we use the paired samples x r,i and y r,i to calculate the corresponding (speedup)sample s r,i of the ratio between task execution times of workloads x and y : ∀ r ∈ . . . R, ∀ i ∈ . . . I : s r,i = x r,i y r,i

3. For each run, we aggregate the speedup samples across iterations in a run by computing the geometricmean: ∀ r ∈ . . . R : gms r = I √ s r, · s r, . . . s r,I

4. We aggregate the geometric means across all runs in an experiment by computing the grand geometricmean: ggms = R √ gms · gms . . . gms R The value ggms represents a point estimate of the ratio of task execution times between workloads x and y , i.e., the relative performance of the two workloads.5. We use non-parametric bootstrap to estimate the percentile conﬁdence interval for ggms , drawingwith replacement from gms • and computing ggms ∗ (step 4 applied on the sample drawn from gms • )as Monte Carlo estimates for ggms .When the conﬁdence interval for ggms (mean ratio of task execution times) straddles 1 .

0, we consider theobserved performance of the two workloads equal, otherwise we report a performance diﬀerence.

We examine the duet measurements using multiple experiments designed to answer speciﬁc research questions.Before introducing the research questions and the experiments, we outline the experimental environment.For detailed information, please consult the appendix.The duet measurements target shared resource environments common in clouds, most of our mea-surements therefore execute in clouds. As the main cloud platform, we use the Amazon Elastic Cloud,speciﬁcally the t3.medium , t3a.medium , m5.large and m5a.large instance types. As our second cloudplatform, we use the Travis CI infrastructure [29], which in turn uses otherwise unspeciﬁed Google ComputeEngine platform machine instances. As our third cloud platform, we use the GitLab CI infrastructure [9]backed by Digital Ocean machine instances. In addition to the three public cloud platforms, we carry outmeasurements on a private cloud running the Proxmox Virtual Environment. Finally, we run bare metalmeasurements that are to represent the most stable baseline for comparison.5o approximate realistic workloads, we use benchmark suites – SPEC CPU 2017 [28] for staticallycompiled and optimized workloads, and ScalaBench [26] (with DaCapo [2]) for dynamically compiled andoptimized workloads. From SPEC CPU 2017, we execute the rate workload variants (23 workloads in total).From ScalaBench and DaCapo, we execute all workloads except actors , batik , eclipse , tomcat , tradebeans and tradesoap , which fail for various reasons (20 workloads in total). We use the OpenJDK 1.8.0 JVM, runwith ﬁxed heap size and disabled garbage collector ergonomics, other virtual machine settings were left attheir defaults.To provide information on result variance, we execute all benchmarks multiple times (on average over20 runs for each workload on the Amazon t instances, over 40 runs on the Amazon m instances, andover 100 runs on the other platforms), and use random samples of 10 runs for all computations. On thefaster execution platforms (public cloud at full speed, private cloud, bare metal), we collect the timingof the ﬁrst 100 iterations or 10 ﬁrst minutes of execution within each run, whichever comes ﬁrst. On theslower execution platforms (public clouds with token bucket processor allocation), it is 100 iterations or 60minutes. We do not execute the SPEC CPU 2017 workloads on the Amazon t instances and on the TravisCI infrastructure, because both lack the computing power to execute the benchmark in reasonable time.For the SPEC CPU 2017 workloads, which exhibit virtually no startup artifacts, we use the timing of alliterations. For the ScalaBench workloads, which exhibit startup artifacts related to dynamic compilation,we discard the timing of the ﬁrst half of iterations. We apply outlier ﬁltering with winsorization in allcomputations, replacing at most one observation in a run with its nearest neighbor when that observation isfurther than 20% away from the min-max range of the remaining observations. Our bootstrap computationsuse 10000 replicates.The constants above were determined by informal experiments to provide reasonable measurement timeand reasonable stability across the workload spectrum. In an actual performance testing environment, thenumbers would be chosen per platform and per workload using established procedures such as [10, 19],however, introducing this practice here would prevent us from comparing diﬀerent measurement proceduresunder similar conditions. The very purpose of the duet procedure is to improve the accuracy of performance comparison experiments.Our ﬁrst research question directly addresses this purpose:

Are the performance comparisons made with theduet procedure more accurate than performance comparisons done using standard methods ? (RQ1)

The standard way to express the measurement accuracy is to treat the individual measurements asobservations of a random variable with an unknown parameter of interest, such as the mean value. The goalof the measurement is to estimate this unknown parameter, and the accuracy of this estimate characterizesthe overall measurement accuracy. An intuitive way to present the accuracy of the estimate, which we alsouse in this paper, is with conﬁdence intervals [12]. For the duet measurements, we use the 99% conﬁdenceintervals for the mean of ratios computed with the procedure in Section 3. As a representative standardmethod that we compare against, we use the common 99% bootstrap conﬁdence intervals for the diﬀerenceof means, computed using the procedure in [3], with random measurement interleaving, as recommendedin [1].We collect the accuracy information using A/A measurements, that is, we compare two sets of mea-surements that use the same workload and the same instance type. For each workload and instancetype, the comparison gives us two conﬁdence intervals, one for the mean ratio of the workload execution6igure 2: Accuracy expressed as relative 99% conﬁdence interval width, 10 runs, aggregated across allworkloads.times computed using the duet procedure, and one for the diﬀerence of the mean workload executiontimes computed using the standard method. By construction of the experiment, the two intervals mustrespectively straddle 1 . .

0, and the width of the two intervals expresses the accuracy achieved by thetwo procedures.A direct comparison of the two conﬁdence intervals is hindered by the fact that the intervals produced bythe duet procedure are centered around 1 . .

0. We therefore convert both types of conﬁdence intervals to a value expressing their width relativeto mean performance – for the mean of ratios interval ( ggms lo , ggms hi ) we report ggms hi − ggms lo , andfor a diﬀerence of means interval ( diff lo , diff hi ) we report ( diff hi − diff lo ) /mean , where mean is the samplemean computed from all samples (all samples concern the same workload and can therefore be averaged).Figure 2 shows the distribution of the 99 % conﬁdence interval widths on the public cloud platforms,aggregated across all workloads. The distribution indicates that the duet procedure generally deliversmore narrow conﬁdence intervals and therefore better accuracy. Table 1 aggregates the improvement inaccuracy for each platform and benchmark, expressed as the average reduction of the relative conﬁdenceinterval width. For the ScalaBench workloads, the duet procedure computes on average 5 .

03 times more We use the 99 % conﬁdence level throughout the presentation, however, other conﬁdence levels provide reasonably similarresults. . × . × Amazon m5a.large 3 . × . × Amazon t3.medium 9 . × —Amazon t3a.medium 3 . × —GitLab CI 12 . × . × Travis CI 3 . × —Average 5 . × . × narrow intervals than the standard method. For the SPEC CPU 2017 workloads, the duet procedurecomputes on average 37 . m5a.large platform. Colors in the duet procedure distinguish samples collected in parallel.To give an intuitive illustration of the improvement in accuracy, we look at the associated measurementcosts. The mean conﬁdence intervals tend to shrink with the square root of the sample counts – asymptotically,8his holds due to the Central Limit Theorem, but here we refer rather to empirical observations at smallsample counts, where we see similar behavior. A twofold improvement in accuracy at constant samplecount therefore roughly corresponds to a fourfold reduction in sample count at constant accuracy. Notethat the measurement costs are also impacted by diﬀerent platform requirements – where the standardmethod requires suﬃcient resources to run a single workload copy, the duet procedure requires resources fortwo workloads executing concurrently. At the core of the duet procedure is the idea to expose the compared workloads to the same interference.To achieve that, the procedure modiﬁes the way the workloads are executed and the way the results areprocessed. We therefore need to determine whether the observed accuracy improvements are due to thesynchronized interference, rather than a side eﬀect of the modiﬁcations in workload execution and resultsprocessing.

Can we attribute the improved accuracy exhibited by the duet procedure to both workloadssuﬀering from synchronized interference ? (RQ2)

To isolate the contribution of synchronized interference from the other modiﬁcations introduced bythe duet procedure, we use the existing measurements, but adjust the conﬁdence interval computationfrom Section 3. Where the duet procedure normally computes ratios from measurements collected at thesame time, we now perform a random shuﬄe and use ratios from unrelated measurements. That way, wepreserve all other aspects of the duet procedure, but obtain results that do not beneﬁt from synchronizedinterference.Figure 4 shows the impact of shuﬄing on the distribution of the conﬁdence interval widths. Thedistribution demonstrates that the duet procedure indeed beneﬁts particularly from synchronized interference.We can also note that the conﬁdence interval widths obtained with shuﬄing are very similar to the conﬁdenceinterval widths from Figure 2 computed by the standard method. If we compute the aggregate improvementin accuracy after shuﬄing – an analogue of Table 1 but without synchronized interference – we obtain atotal of 1 .

02 for the ScalaBench workloads and 1 .

03 for the SPEC CPU 2017 workloads, suggesting not onlythat the ability to deal with synchronized interference is the major factor contributing to improved accuracy,but also that other factors inherent to the duet procedure, such as the concurrent workload execution, arenot a major detriment.

The third aspect of the duet procedure we investigate is whether the presence of synchronized interference isdue to resource sharing common in clouds, or whether some other property of our experiments may accountfor the observed behavior.

Is the presence of synchronized interference associated with the existence of otherworkloads that share the same computing platform ? (RQ3)

The only way to control other workloads on the same platform in the public could is to rent an entirephysical machine, however, that option also removes the virtualization infrastructure, making apples-to-apples comparison impossible. Instead, we therefore use private cloud measurements and control theutilization of the physical servers backing the virtual machine instances. In one set of measurements, wemake sure each physical server runs only the measured workload. In the other set of measurements, we adda competing workload with the potential to saturate the physical server. Our competing workload is thecomposite conﬁguration of the SPEC JBB 2015 benchmark, which generates a variable workload pattern9igure 4: Impact of random shuﬄing on relative 99% conﬁdence interval width, 10 runs, aggregated acrossall workloads.across all cores of the physical server, moving between zero and peak utilization with a period of about 150minutes. The workload approximates an enterprise business application and is therefore relevant in thecloud context.Figure 5 demonstrates the impact of resource sharing on conﬁdence intervals, again computed usingeither ratios from measurements collected at the same time, or ratios from unrelated measurements after arandom shuﬄe. In the left-hand part of the plot, where the measurements were performed with resourcecontention, shuﬄing changes the conﬁdence intervals signiﬁcantly. In the right-hand part of the plot, wherethe measurements were without resource contention, shuﬄing has almost no eﬀect. This conﬁrms ourhypothesis that the synchronized interference we observe and tackle with the duet procedure is indeed dueto resource sharing.

The duet procedure does not always utilize the computing resources evenly. Assume A/B measurementswhere the duet workloads diﬀer in length, with A shorter and B longer. The concurrent workload executionphase, as long as A, will be followed by an isolated workload execution phase, as long as the remaining partof B. This makes the execution conditions for the two workloads diﬀer – while A always competes for the10igure 5: Impact of resource sharing on random shuﬄing in private cloud, idle vs busy with competingworkload, expressed as relative 99% conﬁdence interval width, 10 runs, aggregated across all workloads.shared resources, B executes partially with and partially without such competition. It may therefore ﬁnishfaster than if the computing resources were utilized evenly, making the duet procedure underestimate theworkload execution time ratio.An underestimated workload execution time ratio is not necessarily a serious issue. Our motivation isthe ability to detect performance changes during regression testing. In this context, it is enough to usethe cloud to reliably detect the presence of a change, additional measurements to assess the magnitudecan be performed in a controlled environment. We should, however, still seek to understand the impact ofuneven resource on the measurements.

How does uneven resource utilization impact the estimated workloadexecution time ratio ? (RQ4)

We answer the research question by arranging workloads with known execution time ratio in an A/Bmeasurement and looking at the actual ratio measured and reported by the duet procedure. We do thisﬁrst in the private cloud, where we have more control over the workload duration and resource utilization,and next in the public cloud, where we can use previous measurements.

Private cloud.

To get suﬃcient control over workload duration and resource utilization, we move from thebenchmarks to four entirely artiﬁcial workloads, designed to utilize a given resource for a given operationcount. We refer to the four workloads as integer (an integer loop running entirely from level 1 caches), ﬂoat (a ﬂoating point computation also running entirely from level 1 caches), cache (a linear memory walkover 4 MiB of data that mostly hits in the last level cache), and memory (a random memory walk over64 MiB of data that mostly misses in the last level cache). The integer and ﬂoat workloads are sensitivemostly to hyperthreading and power management, while the cache and memory workloads add sensitivityto competition on the memory resources.We ﬁrst calibrate the artiﬁcial workloads on the private cloud platform, obtaining operation counts thatyield roughly 100 ms executions. For each artiﬁcial workload, we then execute A/B measurements where Aexecutes the workload using the calibrated operation count and B executes the same workload using twice11he count of A. For the artiﬁcial workload, the operation count translates directly into execution time, wewould therefore desire to observe iteration times with the ratio of 2 . Figure 6: Distribution of observed mean iteration time ratios for individual artiﬁcial workloads in privatecloud, idle vs busy with competing workload, 10 runs.As Figure 6 illustrates, the observed ratio of iteration times for the two workloads is indeed very closeto 2 .

0. We can observe the ratio decreasing slightly when the platform suﬀers from additional resourcecontention, generated again using the composite conﬁguration of the SPEC JBB 2015 benchmark runningacross all cores of the physical server. This is most visible with the memory workload, which makes practicalsense because out of the four artiﬁcial workloads, memory is most sensitive to memory bandwidth, which isshared across the entire physical server. We can conclude that on the local cloud, the impact of unevenresource utilization is negligible.

Public cloud.

We can also assess the impact of uneven resource utilization using the previous A/Ameasurements on the public cloud. In the private cloud, we have constructed an A/B measurement whereB was twice as long as A, and examined the ratio. Each A/B duet measurement had two phases, aconcurrent phase where both A and B executed, and an isolated phase, where A already ﬁnished and Bexecuted in isolation. Here, we observe that the concurrent phase of the A/B duet measurement resembles Note that the relationship between operation count and execution time does not hold for the benchmark workloads, onereason why artiﬁcial workloads are used here. .

0, suggesting that the uneven resource utilization is not an issue. On the other publiccloud platforms, the ratios are larger – in other words, the same workloads take longer when executed asA/A duet measurement than when executed using a standard isolated measurement. In the hypotheticalA/B measurement scenario, this translates into an underestimated workload execution time ratio.We attribute the diﬀerence between the platforms to two factors – hyperthreading and token bucketprocessor allocation. On the Travis CI and Amazon m platforms, the ratios range between 1 . . On the Amazon t platforms, the ratios exceed 2 .

0, likely because thetoken bucket processor allocation throttles the concurrent workloads executing on two virtual cores more Although the Proxmox private cloud also uses hyperthreading, it does not have the same impact. This is because theprivate cloud schedules virtual cores across all physical cores, unlike the Amazon public cloud, which likely binds the virtualcores to the hardware threads of one physical core. Returning to the research question, our results put an upper bound on how much we can underestimatethe workload execution time ratio. For example, if an A/A execution takes 3 times as much time as Aexecuting alone, and B executing alone takes 2 times as much as A, the desired ratio of 2 . / ≈ .

3. Figure 7 suggests this would be an extreme case.At the same time, our experiments provide a way to address this concern if required. Because theunderestimated workload execution time ratio is associated with uneven resource utilization, we can simplyadjust the duet procedure to continue (repeatedly) executing the shorter workload until the longer workloadﬁnishes, rather than leaving the resources of the shorter workload idle. This measure obviously removes theuneven resource utilization.

The combined answers to the four research questions prove that the duet procedure improves performancecomparison accuracy on shared resource platforms by relying on the synchronized nature of resource sharinginterference. Our experiments suggest the assumption of synchronized interference is safe to make on manyplatforms – although it hinges on a multitude of technical details, these boil down to expecting that theplatforms treat similar workloads in symmetrical situations equally.On the ﬂip side of the same argument, the duet procedure may not improve accuracy when comparingworkloads with very diﬀerent bottleneck resources, such as a CPU-bound workload and an I/O-boundworkload. There is no reason to expect any resource sharing interference to impact most diﬀerent resourcesequally. This is a threat to external validity of our results.We can also argue that comparing workloads with diﬀerent bottleneck resources is inherently fraughtwith issues. The relative performance of the workloads is more likely to change between platforms withdiﬀerent resource parameters, making comparison results less portable and therefore less useful.A very general threat to both external and internal validity concerns the complex and diverse nature ofpublic cloud platforms. Because cloud performance characteristics may vary signiﬁcantly across platforms,our conclusions are potentially restricted to the platforms and workloads we use. Also, some of the eﬀectswe observe may be due to internal mechanisms we do not analyze. While characterizing every platform andworkload is clearly not possible, we do use multiple platforms and workloads to at least partially addressthis concern.We have mostly limited our experiments to the application of the duet procedure for change detection inthe cloud, however, we do see more application opportunities both in the cloud and on bare metal systems.One interesting challenge is integration into CI/CD pipelines without dedicated virtual machine instances.Such platforms can possibly use ﬁne-grained processor-scheduling policies in place of binding workloads tocores, and still achieve a reasonable comparison accuracy.

Our related work section includes a condensed version of an earlier analysis in [4]. We start with the paperby Laaber et al. [16], which investigates the accuracy achievable in the cloud with standard measurement Somewhat surprisingly, this would suggest that it is more cost eﬃcient to use Amazon t instances as single-core ratherthan dual-core machines. Our experimental evaluation on 23 SPEC CPU 2017 workloads and 20 ScalaBench and DaCapo workloadssuggests that duet measurement in the cloud is signiﬁcantly more accurate than existing methodologiesbased on sequential measurements. Furthermore, our evaluation conﬁrms that the improved accuracy isbecause the paired workloads are subjected to synchronized external interference. This external interferenceis an inherent property of running the workloads in the cloud, where the underlying resources are shared withthe workloads of other users – whereas earlier techniques provide the same accuracy as duet measurementwhen there is no resource sharing, their accuracy deteriorates considerably in the presence of sharing.The duet measurement procedure can introduce competition on the resources between the pairedworkloads and uneven resource utilization patterns. We show that these eﬀects are either negligible orbounded and therefore do not prevent the detection of performance regressions.Our observations imply that duet measurement is a viable technique for performance regression testingon both bare metal systems and in public cloud environments that support dedicated virtual machineinstances. An interesting question is whether this technique can also improve accuracy of CI/CD pipelineswithout dedicated instances—we leave the answer to future work.

The duet measurements target shared resource environments common in clouds, most of our measurementstherefore execute in clouds. As the main cloud platform, we use the Amazon Elastic Cloud, with machineinstances allocated in three diﬀerent zones (us-east-1, us-east-2, us-west-2). We use four diﬀerent instancetypes that were the smallest general-purpose-computing instances with two virtual cores and suﬃcientmemory – t3.medium (two virtual cores on Intel Xeon Platinum 8175M, reported speed ≈ t3a.medium (two virtual cores on AMD EPYC 7571, reported speed ≈ m5.large (two virtual cores on Intel Xeon Platinum 8175M, reported speed ≈ m5a.large (two virtual cores on AMD EPYC 7571, reported speed ≈ ≈ ≈ ≈ t instances, over 40 runs on the Amazon m instances, and over 100 runs on theother platforms). Some virtual machine instances are used for multiple measurements. On the Amazonplatform, we have an average of 20 diﬀerent instances used for each workload. On GitLab CI, we haveallocated 4 diﬀerent instances and let the GitLab CI infrastructure choose whichever instance it decidedon. With Travis CI, the allocation of instances is outside our control. We use random samples of 10runs for all computations. These numbers were chosen to provide suﬃcient opportunity for exploring thenon deterministic execution behavior—our informal experiments suggest that fewer runs yield too wideconﬁdence intervals for mean iteration time.On the relatively fast execution platforms (public cloud at full speed, private cloud, bare metal), wecollect the timing of the ﬁrst 100 iterations or 10 ﬁrst minutes of execution within each run, whichever comesﬁrst. On the relatively slow execution platforms (public clouds with token bucket processor allocation), it is100 iterations or 60 minutes. We do not execute the SPEC CPU 2017 workloads on the Amazon t instancesand on the Travis CI infrastructure, because both lack the computing power to execute the benchmark inreasonable time.For the SPEC CPU 2017 workloads, which exhibit virtually no startup artifacts, we use the timing of alliterations. For the ScalaBench workloads, which exhibit startup artifacts related to dynamic compilation,we discard the timing of the ﬁrst half of iterations as cold and only use the rest as warm. The intent isto avoid measurements taken before the compilation of the hottest methods completes, however, it is notour ambition to guarantee steady state performance measurements – the diversity of the measurementconﬁgurations means we would have to rely on runtime steady state detection, which would introducesigniﬁcant additional variability between individual executions. We have deemed it better to providean apples-to-apples comparison on data that may include some dynamic artifacts, as opposed to anapples-to-oranges comparison on data that omitted a varying number of initial iterations.17inally, in all computations we employ outlier ﬁltering with winsorization, replacing at most oneobservation in a run with its nearest neighbor when that observation is further than 20% away from themin-max range of the remaining observations. Our bootstrap computations use 10000 replicates. For insight beyond the aggregate ﬁgures, we include complete relative conﬁdence interval widths for allworkload and platform combinations presented. The columns list triplets separated by colons:– The interval width computed with the duet procedure.– The interval width computed with the duet procedure after shuﬄing.– The interval width computed with the standard method.All widths are relative and expressed in percents. The best (smallest) interval width is shown in boldface.The listing also includes bare metal measurements, which show the accuracy achievable under ideallycontrolled conditions.

References [1] A. Abedi and T. Brecht. Conducting Repeatable Experiments in Highly Variable Cloud ComputingEnvironments. In

ICPE , pages 287–292. ACM, 2017.[2] S. M. Blackburn, R. Garner, C. Hoﬀmann, A. M. Khang, K. S. McKinley, R. Bentzur, A. Diwan,D. Feinberg, D. Frampton, S. Z. Guyer, M. Hirzel, A. Hosking, M. Jump, H. Lee, J. E. B. Moss,A. Phansalkar, D. Stefanovi´c, T. VanDrunen, D. von Dincklage, and B. Wiedermann. The DaCapoBenchmarks: Java Benchmarking Development and Analysis. In

OOPSLA , pages 169–190. ACM, 2006.[3] L. Bulej, T. Bureˇs, V. Hork´y, J. Kotrˇc, L. Marek, T. Troj´anek, and P. T˚uma. Unit Testing Performancewith Stochastic Performance Logic.

Automated Software Engineering , pages 1–49, 2016.[4] L. Bulej, V. Hork´y, and P. T˚uma. Initial experiments with Duet benchmarking: Performance testinginterference in the cloud. In

MASCOTS , pages 249–255, Oct 2019.[5] D. Cerotti, M. Gribaudo, P. Piazzolla, and G. Serazzi. Flexible CPU Provisioning in Clouds: A NewSource of Performance Unpredictability. In

QEST , pages 230–237, 2012.[6] J. Ericson, M. Mohammadian, and F. Santana. Analysis of Performance Variability in Public CloudComputing. In

IRI , pages 308–314, 2017.[7] B. Farley, A. Juels, V. Varadarajan, T. Ristenpart, K. D. Bowers, and M. M. Swift. More for YourMoney: Exploiting Performance Heterogeneity in Public Clouds. In

SoCC , pages 20:1–20:14. ACM,2012.[8] A. Georges, D. Buytaert, and L. Eeckhout. Statistically Rigorous Java Performance Evaluation. In

OOPSLA , 2007. 189] GitLab Inc. GitLab Runner. https://about.gitlab.com , 2019.[10] S. He, G. Manns, J. Saunders, W. Wang, L. Pollock, and M. L. Soﬀa. A Statistics-Based PerformanceTesting Methodology for Cloud Applications. In

ESEC/FSE , pages 188–199, New York, NY, USA,2019. ACM.[11] C. Heger, J. Happe, and R. Farahbod. Automated Root Cause Isolation of Performance RegressionsDuring Software Development. In

ICPE , pages 27–38. ACM, 2013.[12] T. Hesterberg. What Teachers Should Know about the Bootstrap: Resampling in the UndergraduateStatistics Curriculum. arXiv:1411.5279 [stat] , 2014.[13] P. Huang, X. Ma, D. Shen, and Y. Zhou. Performance Regression Testing Target Prioritization viaPerformance Risk Analysis. In

ICSE , pages 60–71. ACM, 2014.[14] A. Iosup, N. Yigitbasi, and D. Epema. On the Performance Variability of Production Cloud Services.In

CCGRID , pages 104–113, 2011.[15] K. Joshi, A. Raj, and D. Janakiram. Sherlock: Lightweight Detection of Performance Interference inContainerized Cloud Services. In

HPCC , pages 522–530, 2017.[16] C. Laaber, J. Scheuner, and P. Leitner. Software Microbenchmarking in the Cloud. How Bad is itReally?

Empirical Software Engineering , 2019.[17] P. Leitner and J. Cito. Patterns in the Chaos—A Study of Performance Variation and Predictabilityin Public IaaS Clouds.

ACM Trans. Internet Technol. , 16(3):15:1–15:23, 2016.[18] A. Lenk, M. Menzel, J. Lipsky, S. Tai, and P. Oﬀermann. What Are You Paying For? PerformanceBenchmarking for Infrastructure-as-a-Service Oﬀerings. In

CLOUD , pages 484–491, 2011.[19] A. Maricq, D. Duplyakin, I. Jimenez, C. Maltzahn, R. Stutsman, and R. Ricci. Taming PerformanceVariability. In

OSDI , pages 409–425, Berkeley, CA, USA, 2018. USENIX Association.[20] J. Mukherjee, D. Krishnamurthy, and M. Wang. Subscriber-Driven Interference Detection for Cloud-Based Web Services.

IEEE Trans. on Network and Service Management , 14(1):48–62, 2017.[21] A. B. D. Oliveira, S. Fischmeister, A. Diwan, M. Hauswirth, and P. F. Sweeney. Perphecy: PerformanceRegression Test Selection Made Simple but Eﬀective. In

ICST , pages 103–113, 2017.[22] Oracle. GraalVM Repository at GitHub. https://github.com/oracle/graal , 2019.[23] Z. Ou, H. Zhuang, A. Lukyanenko, J. K. Nurminen, P. Hui, V. Mazalov, and A. Yl¨a-J¨a¨aski. Is theSame Instance Type Created Equal? Exploiting Heterogeneity of Public Clouds.

IEEE Trans. onCloud Computing , 1(2):201–214, 2013.[24] S. Ristov, R. Math´a, and R. Prodan. Analysing the Performance Instability Correlation with VariousWorkﬂow and Cloud Parameters. In

PDP , pages 446–453, 2017.[25] J. Schad, J. Dittrich, and J.-A. Quian´e-Ruiz. Runtime Measurements in the Cloud: Observing,Analyzing, and Reducing Variance.

VLDB Endow. , 3(1-2):460–471, 2010.1926] A. Sewe, M. Mezini, A. Sarimbekov, and W. Binder. Da Capo Con Scala: Design and Analysis of aScala Benchmark Suite for the Java Virtual Machine. In

OOPSLA , pages 657–676. ACM, 2011.[27] S. Shankar, J. M. Acken, and N. K. Sehgal. Measuring Performance Variability in the Clouds.

IETETechnical Review , 35(6):656–660, 2018.[28] Standard Performance Evaluation Corporation. SPEC CPU 2017. ,2017.[29] Travis CI, GmbH. Travis CI. https://travis-ci.com , 2019.20enchmark Amazon m5.large Amazon m5a.large Amazon t3.medium Amazon t3a.mediumapparat : 14 : 17 : 25 : 20 : 26 : 46 : 30 : 51avrora : 2.5 : 2.8 : 12 : 12 : 75 : 51 : 89 : 56factorie : 11 : 12 : 15 : 21 : 12 : 41 : 22 : 55fop : 5.5 : 10 : 12 : 14 : 180 : 150 : 100 : 200h2 : 9 : 16 : 10 : 11 : 95 : 60 : 110 : 49jython : 8.5 : 9.3 : 12 : 12 : 55 : 65 : 72 : 140kiama : 3.8 : 3.7 : 11 : 13 : 97 : 61 : 72 : 110luindex : 5.7 : 8.7 : 12 : 12 : 110 : 54 : 30 : 31lusearch : 3.3 : 2.7 : 12 : 13 : 70 : 45 : 17 : 48pmd : 4.8 : 5.7 : 12 : 13 : 160 : 98 : 120 : 66scalac : 5.3 : 5.8 : 11 : 13 : 130 : 86 : 140 : 140scaladoc : 5.1 : 6.4 : 13 : 12 : 120 : 87 : 110 : 110scalap : 3.6 : 3.3 : 12 : 12 : 61 : 20 : 130 : 46scalariform : 4.8 : 5.2 : 12 : 13 : 150 : 100 : 130 : 100scalatest : 4.4 : 5.2 : 12 : 13 : 72 : 37 : 100 : 57scalaxb 28 : 27 :

24 : 28 :

24 21 : 110 : 59 : 100 : 98specs : 5.1 : 6.2 : 12 : 11 : 98 : 46 : 74 : 56sunﬂow : 3.1 : 3.1 : 13 : 13 : 8.5 : 50 : 21 : 38tmt : 10 : 9.2 : 17 : 26 : 66 : 62 : 65 : 76xalan : 3.9 : 3.6 : 13 : 19 : 78 : 34 : 120 : 58500.perlbench r : 4.3 : 11 : 13 : 11 — —502.gcc r : 5.9 : 7.4 : 11 : 11 — —503.bwaves r : 2.5 : 2.1 : 10 : 12 — —505.mcf r : 4.4 : 3 : 9.8 : 10 — —507.cactuBSSN r : 5.5 : 5.3 : 12 : 13 — —508.namd r : 2.3 : 2.2 : 12 : 12 — —510.parest r : 2.4 : 2.4 : 11 : 9.7 — —511.povray r : 4.3 : 2 : 11 : 12 — —519.lbm r : 2.5 : 3.6 : 8.8 : 11 — —520.omnetpp r : 6.6 : 16 : 9.2 : 13 — —521.wrf r : 2.5 : 2.4 : 12 : 11 — —523.xalancbmk r : 3.8 : 4.3 : 10 : 11 — —525.x264 r : 2.2 : 2.3 : 12 : 11 — —526.blender r : 3.8 : 3.3 : 12 : 13 — —527.cam4 r : 2.7 : 2.5 : 13 : 11 — —531.deepsjeng r : 2.4 : 3.4 : 10 : 12 — —538.imagick r : 2.7 : 2.5 : 11 : 11 — —541.leela r : 4.5 : 2.2 : 14 : 11 — —544.nab r : 2.9 : 2.8 : 12 : 11 — —548.exchange2 r : 4 : 1.9 : 11 : 11 — —549.fotonik3d r : 4.7 : 2.5 : 8.7 : 9 — —554.roms r : 2.8 : 2.5 : 10 : 13 — —557.xz r : 4.5 : 7.2 : 11 : 12 — —21enchmark Bare Metal GitLab CI Proxmox Busy Proxmox Idle Travis CIapparat 18 : 16 :

16 22 : 75 : 79 : 33 : 29 16 : 18 :

15 12 : 22 : 18avrora : 4.1 : 3.9 : 66 : 76 : 13 : 11 4 : 4 : factorie : 18 : 19 : 49 : 52 : 29 : 23 : 14 : 17 : 17 : 26fop 4.6 : : 6.5 : 69 : 73 : 24 : 21 7.7 : : 8.7 : 14 : 15h2 : 8.8 : 10 : 51 : 53 : 22 : 21 : 3.8 : 5.6 : 17 : 26jython : 7 : 8.6 : 64 : 69 : 29 : 27 : 8.2 : 12 : 25 : 18kiama 3.3 : : 3.3 : 63 : 69 : 24 : 21 3.1 : 2.9 : : 13 : 20luindex : 8.1 : 5.6 : 60 : 63 : 16 : 15 : 7.7 : 5.3 : 11 : 10lusearch : 2.5 : 2.6 : 61 : 63 : 19 : 15 : 2.4 : 2.5 : 20 : 25pmd 2.1 : 1.8 : : 69 : 70 : 23 : 22 1.6 : : 1.6 : 16 : 11scalac 5.6 : : 5.6 : 66 : 66 : 21 : 19 4.8 : : 4.6 : 13 : 15scaladoc : 4.4 : 5 : 64 : 69 : 23 : 20 4.2 : 4.3 : : 10 : 21scalap : 3.1 : 2.4 : 56 : 58 : 17 : 14 : 2.8 : 2.9 : 9.2 : 11scalariform : 4.3 : 5.2 : 59 : 67 : 19 : 16 : 3.1 : 4 : 10 : 15scalatest : 1.5 : 2.3 : 69 : 67 : 21 : 16 1.5 : : 1.7 : 10 : 12scalaxb 30 : 27 :

20 23 : 70 : 74 : 37 : 32 29 : 26 :

19 23 : 27 : 27specs : 2.9 : 4.2 : 60 : 62 : 18 : 15 4.6 : : 5.5 : 10 : 19sunﬂow : 2.3 : 2.2 : 55 : 60 : 25 : 20 : 2.8 : 3.3 : 11 : 7.7tmt 5.1 : : 18 : 61 : 71 : 24 : 22 : 7.3 : 16 : 21 : 18xalan : 2.4 : 2.3 : 73 : 74 : 18 : 15 2.9 : 2.7 : : 12 : 11500.perlbench r 0.71 : 0.65 : : 43 : 46 : 28 : 30 0.62 : 0.55 : —502.gcc r : 0.61 : 0.77 : 42 : 41 : 23 : 23 : 0.57 : 0.75 —503.bwaves r : 0.35 : 0.35 : 61 : 74 : 16 : 15 : 0.54 : 0.52 —505.mcf r 1.3 : : 1.3 : 41 : 42 : 19 : 16 1.3 : 1.2 : —507.cactuBSSN r 5 : 3.8 : : 33 : 31 : 21 : 21 0.67 : 0.65 : —508.namd r : 0.74 : 0.91 : 41 : 49 : 27 : 25 : 0.61 : 0.68 —510.parest r : 0.29 : 0.31 : 49 : 60 : 17 : 16 : 0.29 : 0.34 —511.povray r 0.91 : 0.93 : : 51 : 49 : 27 : 25 1 : 0.94 : —519.lbm r 0.36 : 0.91 : : 41 : 39 : 18 : 17 1.9 : 2.1 : —520.omnetpp r 1.1 : : 1.1 : 39 : 36 : 19 : 19 0.92 : : 1.1 —521.wrf r 0.38 : 0.68 : : 48 : 52 : 20 : 19 : 0.48 : 0.65 —523.xalancbmk r 1.9 : : 2 : 86 : 97 : 17 : 16 1.5 : 1.4 : —525.x264 r : 0.27 : 0.24 : 46 : 50 : 20 : 18 0.95 : 2.7 : —526.blender r 0.33 : : 0.33 : 41 : 42 : 24 : 20 0.36 : 0.39 : —527.cam4 r : 0.32 : 0.38 : 39 : 53 : 21 : 17 : 0.63 : 0.61 —531.deepsjeng r : 0.17 : 0.18 : 44 : 45 : 22 : 22 : 0.26 : 0.29 —538.imagick r 1.6 : 0.87 : : 54 : 56 : 26 : 27 0.86 : : 0.92 —541.leela r 0.13 : 0.14 : : 37 : 40 : 18 : 17 : 0.2 : 0.18 —544.nab r 0.99 : 1 : : 54 : 60 : 15 : 12 : 0.36 : 0.71 —548.exchange2 r 0.79 : : 0.72 : 49 : 58 : 27 : 24 0.46 : 0.44 : —549.fotonik3d r : 1.6 : 0.65 : 29 : 32 : 18 : 23 : 1.3 : 0.92 —554.roms r : 1.1 : 0.45 : 37 : 41 : 20 : 20 : 1.6 : 1.1 —557.xz r 0.54 : 0.47 : : 36 : 37 : 19 : 17 0.58 :0.56