BayesPerf: Minimizing Performance Monitoring Errors Using Bayesian Statistics
Subho S. Banerjee, Saurabh Jha, Zbigniew T. Kalbarczyk, Ravishankar K. Iyer
BBayesPerf: Minimizing Performance Monitoring Errors UsingBayesian Statistics
Subho S. Banerjee
University of Illinois at Urbana-ChampaignUrbana, Illinois, [email protected]
Saurabh Jha
University of Illinois at Urbana-ChampaignUrbana, Illinois, [email protected]
Zbigniew T. Kalbarczyk
University of Illinois at Urbana-ChampaignUrbana, Illinois, [email protected]
Ravishankar K. Iyer
University of Illinois at Urbana-ChampaignUrbana, Illinois, [email protected]
Abstract
Hardware performance counters (HPCs) that measure low-levelarchitectural and microarchitectural events provide dynamic con-textual information about the state of the system. However, HPCmeasurements are error-prone due to non determinism (e.g., un-dercounting due to event multiplexing, or OS interrupt-handlingbehaviors). In this paper, we present BayesPerf, a system for quanti-fying uncertainty in HPC measurements by using a domain-drivenBayesian model that captures microarchitectural relationships be-tween HPCs to jointly infer their values as probability distributions.We provide the design and implementation of an accelerator thatallows for low-latency and low-power inference of the BayesPerfmodel for x86 and ppc64
CPUs. BayesPerf reduces the average er-ror in HPC measurements from 40.1% to 7.6% when events are beingmultiplexed. The value of BayesPerf in real-time decision-makingis illustrated with a simple example of scheduling of PCIe transfers.
CCS Concepts • General and reference → Performance ; Measurement ; •
Hardware → Error detection and error correction ; Hard-ware accelerators ; •
Computing methodologies → Learningin probabilistic graphical models . Keywords
Performance Counter, Sampling Errors, Error Detection, Error Cor-rection, Probabilistic Graphical Model, Accelerator
Hardware performance counters (HPCs) are widely used in profil-ing applications to characterize and find bottlenecks in applicationperformance. Even though HPCs can count hundreds of differenttypes of architectural and microarchitectural events, they are lim-ited because those events are collected (i.e., multiplexed) on a fixednumber of hardware registers (usually 4–10 per core). As a result,they are error prone because of application, sampling, and asynchro-nous collection behaviors borne out of multiplexing. Such behaviorin HPC measurements is not a new problem, and has been knownfor the better part of a decade [1, 12, 29, 32, 43, 44, 48].
Targeted Need.
Traditional approaches of tackling HPC errorshave relied on collecting measurements across several applicationruns, and then performing offline computations to (i) impute miss-ing or errored measurements with new values (e.g., [43]); or (ii) drop-ping outlier values to reduce overall error (e.g., [29]). Both of these require time and compute resources for collecting training dataand inference, thus are suitable for offline analysis (like profiling).These techniques are untenable in emergent applications that useHPCs as inputs to complete a feedback loop and make dynamicreal-time decisions that affect system resources using a variety ofmachine learning (ML) methods. Examples include online perfor-mance hotspot identification (e.g., [14]), userspace or runtime-levelscheduling (e.g., [2, 4, 10, 17, 48]), and power and energy manage-ment (e.g., [13, 36, 37, 40]), as well as attack detectors and systemintegrity monitors [8]. In such cases, the HPC measurement errorspropagate, get exaggerated, and can lead to longer training time and poor decision quality (as illustrated in §6.3). This is not surprisingbecause ML systems are known to be sensitive to small changes intheir inputs (e.g., in adversarial ML) [9, 18, 24]. As we will show in§2, HPC measurement errors can be large (as much as 58%); hencethey must be explicitly handled.This paper presents BayesPerf, a system for quantifying uncer-tainty and correcting errors in HPC measurements using a domain-driven Bayesian model that captures micro-architectural relation-ships between HPCs. BayesPerf corrects HPC measurement errorsat the system (i.e., CPU and OS) level, thereby allowing the down-stream control and decisions models that use HPCs to be simpler,faster and use less training data (if used with ML). The proposedmodel is based on the insight that even though individual HPCmeasurements might be in error, groups of different HPC measure-ments that are related to one another can be jointly considered—toreduce the measurement errors—using the underlying statisticalrelationships between the HPC measurements. We derive such re-lationships by using design and implementation knowledge of themicroarchitectural resources provided by CPU vendors [7, 19]. Forexample, the number of LLC misses, the size of DMA transactions,and the DRAM bandwidth utilization are related quantities, andcan be used to reduce measurement errors in each other. Approach & Contributions.
The key contributions are:(1)
The BayesPerf ML Model.
We present a probabilistic ML modelthat incorporates microarchitectural relationships to combinemeasurements from several noisy HPCs to infer their true val-ues, as well as quantify the uncertainty in the inferred valuedue to noise. Hence allowing: In a simple processor, DRAM Bandwidth = (LLC misses × Cache line size+ a r X i v : . [ c s . D C ] F e b a) improving decision-making with explicit quantification ofHPC measurement uncertainty.(b) reduced need for aggressive (high-frequency) HPC sampling(which negatively impacts application performance) to cap-ture high-fidelity measurements, thereby increasing our ob-servability into the system.(2) The BayesPerf Accelerator.
To enable the use of BayesPerf MLmodel in latency-critical, real-time decision-making tasks, thispaper presents the design and implementation of an acceleratorfor Monte Carlo-based training and inference of the BayesPerfmodel. The accelerator exploits(a) high-throughput random-number generators.(b) maximal parallelism based on the statistical relationshipsmentioned above, to rapidly sample multiple parts of theBayesPerf model in parallel.(3)
A Prototype Implementation.
We describe an FPGA-based proto-type implementation of the BayesPerf system (on a Xilinx Virtex7 FPGA) for Linux running on Intel x86_64 (Sky Lake) and IBMppc64 (Power9) processors. The BayesPerf system is designedto provide API-compatibility with Linux’s perf subsystem [27],allowing it to be used by any userspace performance monitor-ing tool for both x86_64 and ppc64 systems. Our experimentsdemonstrated that BayesPerf reduces the average error in HPCmeasurements from 40.1% to 7.6% when events are being multi-plexed, which is an overall 5.28× error reduction. Further, theBayesPerf accelerator provides an 11.8× reduction in powerconsumption, while adding less than 2% read latency overheadover native HPC sampling.(4)
Increasing training and model efficiency of decision-making tasks.
We demonstrate the generality of the BayesPerf system by inte-grating it with a high-level ML-based IO scheduler that controlstransfers over a PCIe interconnect. We observed that the train-ing time for the scheduler was reduced by 37% (~52 hr reduction)and the average makespan of scheduled workloads decreasedby 19%.The remainder of the paper is organized as follows. First in §2,we discuss the sources of HPC measurement errors. Then in §3we provide an overview of the design of the BayesPerf system. §4describes the formulation, training and inference of the ML modelused to correct errors. §5 describes the accelerator that allowsinference on the ML model in real-time. Then in §6 we discuss aprototype implementation and it’s evaluation. Finally, in §7 and§8, we put BayesPerf in perspective of traditional methods, anddescribe future challenges, respectively.
Every modern processor has a logical unit called the PerformanceMonitoring Unit (PMU), which consists of a set of HPCs. An HPCcounts how many times a certain event occurs during a time intervalof a program’s execution. The number and configurability of theHPCs vary across processor vendors and microarchitectures. Forexample, modern Intel processors have three fixed HPCs (whichmeasure ISA-related events) and eight programmable HPCs per core(which measure microarchitectural events and are split between theSMT threads on the core) [21]. The events measured by an HPC arevendor-specific and microarchitecture-dependent, and vary withprocessor models within the same microarchitecture. For example,
30 40 50 60 70 10 15 20 25 30 35 A v g e r a g e E rr o r ( % ) Number of Multiplexed Counters
Figure 1: Errors due to event multiplexing in HPC measure-ments across ten application runs. an Intel Haswell CPU has 400 programmable events, compared tothe 1623 events on a HaswellX CPU; both have the same numberof HPC registers per core (three + eight) [48]. Therefore, one mustcarefully pick and configure which events to monitor with theavailable registers.
Reading HPCs.
Performance counters can be read using:(1)
Polling:
The HPCs can be read at any instant by using specificinstructions to write (to configure the HPC) and read (to poll thevalue of an HPC) model-specific registers (MSRs) that representHPCs. For example, x86_64 uses specific instructions to read(i.e., rdmsr ) from and write (i.e., wrmsr ) to MSRs, respectively;both instructions require OS-level access privilege, and henceare performed by the OS on behalf of a user. Here, one HPC isprogrammed to count only one event during the execution of aprogram. Hence, polling is ineffective, as the number of eventsthat can be simultaneously measured is limited by the numberof available hardware registers.(2)
Sampling:
HPCs also support sampling of counters based on theoccurrence of events, thereby letting multiple events timesharea single HPC [30, 32]. This feature is enabled through a specificinterrupt, called the Performance Monitoring Interrupt (PMI),which can be generated after the occurrence of a certain numberof events (i.e., a predetermined threshold). The interrupt handlerthen polls (i.e., samples) the HPC. The multiplexing of eventsoccurs through a separate scheduling interrupt that is triggeredperiodically to change the configuration of the HPCs and swapevents in and out. The collected measurements are generallyscaled to account for the time they were not scheduled to aHPC [12], and that can lead to making erroneous measurements.Sampling is necessary due to the severe disparity between thenumbers of events types and the number of counters.
Sources of Errors.
In addition to the errors due to event multi-plexing, HPCs demonstrate other modalities of measurement error.For example, HPC measurements can vary across runs because ofOS activity, scheduling of programs in multitasking environments,memory-layout, and memory-pressure, and varied multi-processorinteractions may change between different runs. Nondeterminismin OS behavior (e.g., servicing of hardware interrupts) also playsa significant role in HPC measurement errors [44]. Performancecounters have also been shown to over count certain events onsome processors [44]. Finally, the implementation of userspace andOS-kernel-level tools can cause different tools to provide differentmeasurements for the same HPCs in strictly controlled environ-ments for the same application. The variations in measurementsmay result from the techniques involved in acquiring them, e.g., a y e s P e rf T r a d i t i o n a l TimeTime { c , c , c } = { e a , e b , e c }
Inference
Figure 2: Overview of the BayesPerf ML model. the point at which they start the counters, the reading technique(polling or sampling), the measurement level (thread, process, core,multiple cores), and the noise-filtering approach used.
Measurement Errors.
As a result of this non-determinism,quantifying error in HPCs is difficult as there is no way to get“ground truth” measurements because of inherent variations inmeasurements. In this paper, we define HPC error as magnitude ofdifference between corresponding HPC measurements made in tworuns of a workload, one in polling and other in sampling mode. Thecorrespondence between the two HPC traces (time-series) is estab-lished by dynamic time warping [5] that calculates an “alignment”between the two time series datasets using edit-distance. Fig. 1 illustrates the net effect of measurement errors on the fi-delity of an HPC counter using Linux’s perf subsystem. In this case,the baseline dataset is collected using polling, and the target datasetis collected using sampling, each on 10 independent applicationruns capturing both variations in a single run, and variations acrossruns. We observe a 58 ± .
3% average error in HPC measurementswhen 35 on-core events are being multiplexed on an Intel processor,compared to the baseline of polling 4 events at a time. Errors in Derived Events.
Such high error is particularly trou-bling, as it is quite conceivable to count 35 events simultaneously,particularly for measuring derived events . Derived events are ob-tained by combining individual HPC measurements in a mathemat-ical expression. Consider for example, the “
Backend_Bound_SMT ”derived event on Intel BroadwellX processor. It measures the frac-tion of µops issue slots utilized in a core, and alone takes mea-surements from 16 HPCs to compute [7]. This information mightbe valuable in a OS-level scheduler that controls an SMT pro-cessor, with the objective of minimizing interference betweenCPU-bound processes/threads. Often such information would beconflated with other derived metrics like “
Memory_Bound ” and“
Frontend_Bound_SMT ”, which together would require the use of29 unique counters. That according to Fig. 1 would incur an averageerror of ~45%. This is further exasperated by the fact that the HPCsneed to be counted per-SMT thread, per-core, and per-socket. Forexample, in an average 2-socket server system this would implycollecting thousands of counters (i.e., 2784 HPCs = 29 counters ×24 cores × 2 sockets). This definition of error is based on prior work on HPC errors [29]. The experimental setup is described in detail in §6.1.
Adding More Registers?
A relevant question to ask is whetherthe HPC-error problem will disappear if more HPC registers areadded into future CPUs. The short answer is that it will not, be-cause as we continue to add more monitors, the system complexityincreases which is untenable in commercial CPUs that are oftendriven by other practical considerations. Hence, HPC counters willeventually always end up introducing the sampling-based error.
Key Insight.
The key insight that drives this work is that microar-chitectural invariants (e.g. [7, 19, 41]) can be applied to measuredHPC data to estimate whether it is, in fact, in error (i.e., a detector).Further, we can quantify the “uncertainty” of an HPC measurementby quantifying the probability of deviation from that invariant (i.e.,its egregiousness). When the above is applied to a group of HPCmeasurements, each targeting different microarchitectural units,the underlying invariants can be composed, encoded as statisticalrelationships, i.e., joint probability distributions, which can thenbe composed into larger probabilistic graphical models . We thenuse a Bayesian inference approach to integrate the data and priorknowledge of the system to effectively attenuate the high errormeasurements and significantly amplify correct measurements, allin real-time. This works in practice as the number of HPCs withlower errors are generally more numerous than those with highererror (also verified by our observations), hence they bias the ag-gregate results to the lower errored values. As a result, BayesPerfsignificantly outperforms traditional purely data-driven statisticalapproaches for outlier detection.
BayesPerf ML Model.
Below, we provide a high-level descrip-tion of the model, using the example illustrated in Fig. 2. In thisexample, the goal is to measure (by multiplexing) a set of events { 𝑒 𝑎 , . . . , 𝑒 𝑓 } , on a set of HPCs { 𝑐 , 𝑐 , 𝑐 } . Deciding Schedules of HPCs:
BayesPerf first determines a sched-ule of how the events are multiplexed on the HPCs. The scheduleconsists of a set of
HPC configurations that are collected over time.We define an HPC configuration as a mapping between countersand events, that defines which counters are collected at an instantof time. The notation { 𝑐 , 𝑐 , 𝑐 } = { 𝑒 𝑎 , 𝑒 𝑏 , 𝑒 𝑐 } is used to define sucha configuration, and imply that 𝑐 counts 𝑒 𝑎 . The scheduling pro-cess is driven primarily by the microarchitectural considerationsof the available HPCs and the types of events that each one canmeasure, i.e., as not all HPCs can measure all events. Traditional L i n u x L i n u x + R D P M C B a y e s P e r f ( C P U ) B a y e s P e r f ( A c c ) C o u n t e r M i n e r A v g . O v e r h e a d ( c y c l e s ) Figure 3: Latency overhead of reading counters withBayesPerf compared to traditional methods on an x86 CPU.
HPC measurement tools, like the Linux perf subsystem triggerHPC configuration changes in a round-robin manner, based on aperiodic hardware timer-driven interrupt (see Fig. 2). BayesPerfuses a similar interrupt driven approach, but does not use round-robin to build a schedule of configurations.
It creates configurationsof overlapping counters, such that each set of counters have “statisticalrelationships” to other events in preceding and subsequently scheduledconfigurations.
For example, in Fig. 2, 𝑒 𝑎 and 𝑒 𝑒 are such overlap-ping events. As we will show in §4, these “statistical relationships”can be derived based on microarchitectural invariants (i.e., domainknowledge) that tie together the resources underlying the measure-ments. BayesPerf encodes those invariants as generative joint- and conditional-probability distributions for the processors used in ourexperiments. Inferring Unscheduled Events:
At each instant of time, BayesPerfthen uses sampled data from the overlapping events to compute afull posterior distribution (i.e., the likely values and their associateduncertainties) of the unscheduled events using a Bayesian inferenceapproach. Consider 𝑒 𝑏 in the second time slice of Fig. 2. It is calcu-lated using its’ own samples from the previous time slice and thesamples of 𝑒 𝑎 (which is the event repeated across time slice one andtwo) in the current time slice. The result of the Bayesian inferenceusing the sampled data is a probability distribution Pr ( 𝑒 𝑡𝑏 | 𝑒 𝑡 − 𝑏 , 𝑒 𝑡𝑎 ) at time 𝑡 ; this distribution not only gives us an estimate of 𝑒 𝑏 (i.e., byfinding the most likely value of 𝑒 𝑏 under the distribution), but alsoquantifies uncertainty (i.e., using the probability value Pr ( 𝑒 𝑏 | . . . ) )in that estimate. The compositional nature of Bayesian inferenceallows chain events across multiple time slices, if the overall set ofevents to be measured is large, albeit at the cost of larger uncer-tainty in the estimate. For example, in Fig. 2 the chain of events ( 𝑒 𝑏 → 𝑒 𝑎 ) ⇝ ( 𝑒 𝑎 → 𝑒 𝑒 ) ⇝ ( 𝑒 𝑒 → 𝑒 𝑑 ) can be used directly esti-mate 𝑒 𝑏 from samples of 𝑒 𝑎 , but also transitively estimate it fromsamples of 𝑒 𝑒 . Here “ → ” describes the above statistical relationshipsbetween events in a configuration (i.e., in a single time slice), and“ ⇝ ” describes data collected between overlapping events acrosstime slices.The BayesPerf system then allows an user to poll the posteriorprobability distributions of any of the events being collected. Thesedistributions can be passed along (i.e., integrated) into higher-levelML/control frameworks or used directly to compute error boundsof HPC measurements. BayesPerf Accelerator.
Though the BayesPerf ML model isable to provide significantly higher-quality samples from the rawHPC measurements, it introduces the additional runtime over-head of performing Bayesian inference on every new measurement polled by the user. Consider Fig. 3; it shows the average overhead(over 100 reads) of reading a HPC value using the Linux kernel’s(perf subsystem) read() system call (i.e., polling), the x86_64 rdpmc instruction to read HPCs in userspace, a purely CPU implemen-tation of the BayesPerf ML model (using TensorFlow Probabil-ity [11, 42]), an FPGA accelerated version of BayesPerf (describedlater in §5), and CounterMiner[29] (described later in §6 and usedas a baseline in our evaluation). We observe that a single HPCread when the CPU implementation of BayesPerf is being used hasapproximately 9 × longer latency than native polling of the HPC.In order to reduce the latency, we introduce an accelerator thatparallelizes the process of computing posterior inference on theBayesPerf ML model. The accelerator largely builds upon our priorwork [3] in building MCMC accelerators that treats a lack of statisti-cal dependencies between variables as a scope for parallel execution.Using the accelerator, BayesPerf adds less than 2% overhead in readlatency compared to the native solution. Our implementation of theaccelerator on a PCIe-attached FPGA device can take advantageof modern cache-coherent accelerator-processor communicationprotocols like CAPI [39], and essentially provide users with thesame interface and same performance characteristics they couldget if they were natively polling the OS for HPC measurements. In this section we first discuss formalization of the HPCs and eventsfor a generic CPU. Then, in §4.1, we discuss the problem of sched-uling sets of performance counters onto available HPCs. Finally,in §4.3, we discuss an inference strategy to compute the posteriordistribution of a single event based on generated schedule and HPCmeasurement samples.
Formalism.
We assume that every processor has a pre-determined number of fixed and programmable HPCs. We referto them as 𝑛 𝑓 and 𝑛 𝑝 , respectively. The HPCs themselves are in-dexed and referred to as 𝑓 . . . 𝑓 𝑛 𝑓 for the fixed HPCs and 𝑐 . . . 𝑐 𝑛 𝑝 for the programmable HPCs. The processor as a whole has a set 𝐸 = { 𝑒 , . . . , 𝑒 𝑛 𝑒 } of 𝑛 𝑒 architectural and microarchitectural eventsthat are measured using 𝑓 ∗ and 𝑐 ∗ . At any point in time, the pro-grammable HPCs are configured to count any one of the events in 𝐸 .The instantaneous mapping between counters and events is calleda configuration . Fixed HPCs are not considered in a configuration,as they cannot be programmed. Not all programmable HPCs willbe able to count all events (i.e., all configurations might not bevalid), depending on microarchitectural and implementation con-siderations. For example, an Intel off-core response event requiresone HPC and one MSR register, and the L1D_PEND_MISS.PENDING event can be only counted on the third HPC on Haswell/Broadwellsystems. Configuration validity constraints are known ahead oftime, can be dynamically checked, and must always be satisfied.BayesPerf uses the Linux’s builtin validity checker.A sample 𝑠 𝑗 is generated from an HPC 𝑐 𝑖 (i.e., an interrupt isfired to read the value of a counter and store it in memory) when aparticular threshold 𝜏 𝑖,𝑘 is reached on one of the fixed HPCs 𝑓 𝑘 . That process is denoted by 𝑠 𝑗 ∼ 𝑐 𝑖 if 𝑓 𝑘 ≥ 𝜏 𝑖,𝑘 . In addition to thevalue of the counter, the sampling process also records two timemeasurements, 𝑡 𝑖𝑟 and 𝑡 𝑖𝑒 , where 𝑡 𝑖𝑒 ≤ 𝑡 𝑖𝑟 . They correspond to the In general, this triggering event occurs based on the number of clock cycles or numberof instructions executed. otal time the application has been running, and the total timefor which an event has been sampled (i.e., it has been enabled),respectively. Traditional approaches (e.g., one that is used in Linux)use these times to correct HPC undercounting errors and assumethat the true value of a performance counter is scaled according to 𝑠 𝑗 ↦→ 𝑠 𝑗 × 𝑡 𝑖𝑟 / 𝑡 𝑖𝑒 . Statistical Dependencies.
Some subsets of events in 𝐸 havestatistical relationships between them. Those statistical relation-ships are described by joint probability distribution functions . Forexample, if 𝑒 and 𝑒 share such a relationship, then it is representedby their joint probability distribution Pr ( 𝑒 , 𝑒 ; Θ ) . Where, Θ refersto all tunable or learnable parameters of the distribution.We assert that if nothing is known about the statistical re-lationships between the events, then Pr ( 𝑒 𝑖 , . . . ) can be approxi-mated by a neural network and trained using data from HPCs.However, for most real systems, knowledge about the underly-ing microarchitectural resources being counted in a HPC can becorrelated together to describe Pr ( 𝑒 𝑖 , . . . ) . To do so, we use alge-braic models of the composition of HPC measurements by us-ing information about the CPU microarchitecture found in pro-cessor performance manuals [7, 19, 45]. For example, in an In-tel x86 Sandy Bridge microarchitecture [23, 45], the fraction ofcycles a CPU is stalled because of DRAM access is given by ( − Mem_L3_Hit_Frac ) ×
STALLS_L2_PENDING / CLKS . Those stalls canbe caused by either DRAM bandwidth issues or DRAM latency is-sues, which in turn can be measured as
ORO_DRD_BW_Cycles / CLKS , and
ORO_DRD_Any_Cycles / CLKS − ORO_DRD_BW_Cycles / CLKS respectively. Here,
ORO_DRD_Any_Cycles , ORO_DRD_BW_Cycles , Mem_L3_Hit_Frac , STALLS_L2_PENDING , and
CLKS correspond to a set of fixed andprogrammable events, which are related to each other via the al-gebraic relations described above. Given the equivalence of thosethree computed quantities, we can compute one, given values of theother. When some of these events are reported with measurementerrors, the equivalence relationship becomes statistical (i.e., cap-ture randomness because of errors). We then define a distributionfunction for individual events, where only valid combinations ofthe event values have a non zero probability of occurrence.
Problem.
Given statistical dependencies between events, we needto ensure that the configurations created for two consecutive timeslices (i.e., scheduler quanta) have at least one overlapping eventin order to establish either a first-order or a transitive statisticalrelationship between consecutive time slices. For example, if wehave four events 𝑒 to 𝑒 that are related by 𝑓 ( 𝑒 , 𝑒 ) and 𝑔 ( 𝑒 , 𝑒 , 𝑒 ) ,we must ensure that samples of 𝑒 occur repeatedly across multipletime slices. Given (from a profiling application) an original scheduleof configurations 𝐶 → 𝐶 → · · · → 𝐶 𝑛 , where 𝐶 𝑖 executes in timeslice 𝑖 , into another schedule of 𝐶 ′ 𝑖 s such that transitive statisticalrelationships hold, such that the validity criteria holds on each 𝐶 ′ 𝑖 .In the case when it is not possible ensure the validity criteria onevery 𝐶 ′ 𝑖 , we break the chain of repeated events, and start overagain from a valid configuration. Solution.
The first step of the scheduling process is to aggre-gate all the statistical dependencies available for the processor inquestion into a graphical structure. The graph is produced by ex-panding the scheduled chain 𝐶 → · · · → 𝐶 𝑛 using the statistical relationships between the events in the chain. In the ML/statisticscommunity, such a graph commonly referred to as probabilisticgraphical model , and more specifically identified as a factor graph (FG) [26]. Remember from above that the statistical dependencies be-tween the events are specified as joint probability functions Pr ( 𝑆 𝑖 ) , where 𝑆 𝑖 ⊆ 𝐸 . Using those functions, we generate a bipartite FG 𝐺 = ( 𝐸 ∪ { Pr , . . . Pr 𝑛 } , {( 𝑒, Pr 𝑖 )| 𝑒 ∈ 𝑆 𝑖 ∀ 𝑖 }) . The FG representsthe joint distribution of all the events in the schedule, composedtogether from every individual joint distribution.Now, given the FG and two consecutive configurations from aschedule 𝐶 𝑡 and 𝐶 𝑡 + (with events 𝐸 𝑡 , 𝐸 𝑡 + ⊆ 𝐸 respectively), ourscheduling problem reduces to (i) finding whether 𝐸 𝑡 and 𝐸 𝑡 + sharean event such that the transitive statistical dependency is met; and(ii) if they do not share such a dependency, producing the shortestsequence of 𝐶 ′∗ such that 𝐶 𝑡 → 𝐶 ′( ) → · · · → 𝐶 𝑡 + . Solution of thefirst of the two problems is straightforward. We do it by computingthe Markov blanket [26] of the sets 𝐸 𝑡 and 𝐸 𝑡 + under the factorgraph. The Markov blanket 𝐵 𝑥 𝑖 of a variable 𝑥 𝑖 in the factor graphdefines a subset of 𝑥 ¬ 𝑖 such that 𝑥 𝑖 is conditionally independentof 𝑥 ¬ 𝑖 given 𝐵 𝑥 𝑖 . If the Markov blankets of 𝐸 𝑡 and 𝐸 𝑡 + overlap(i.e., 𝐵 𝐸 𝑡 ∩ 𝐵 𝐸 𝑡 + ≠ ∅ ), then we are guaranteed that there exists atleast one event that shares transitive dependencies between thetime slices. The second problem is a little more involved. It can besolved by finding the shortest path (assuming unit cost for eachedge traversed) from each 𝑒 ∈ 𝐸 𝑡 to each 𝑒 ′ ∈ 𝐸 𝑡 + in the FG. Thatcan be accomplished using Djikstra’s algorithm, checking validityof the path at every step. In addition to the graph traversal, onemust also apply the following optimizations to prune unnecessary 𝐶 ′∗ s.(1) Removing Common Steps:
If an intermediate step 𝐶 ′ 𝑖 exists suchthat the Markov blankets 𝐵 𝑒 , 𝐵 𝑒 , . . . 𝐵 𝑒 𝑛 of events 𝑒 , 𝑒 , . . . 𝑒 𝑛 overlap, the next transition state of the schedule can be con-densed. That is, if there exists an 𝑒 ∗ ∈ 𝐵 𝑒 ∩ . . . 𝐵 𝑒 𝑛 , thencomposition of statistical relationships can happen through 𝑒 ∗ , instead of through the larger set of events, i.e., 𝐶 ′ 𝑖 + ↦→( 𝐶 ′ 𝑖 + \ { 𝑒 , . . . 𝑒 𝑛 }) ∪ 𝑒 ∗ (2) Removing Redundant Steps:
If there exists two steps 𝐶 ′ 𝑖 and 𝐶 ′ 𝑖 + such that there is no change in the Markov blanket (i.e., 𝐵 𝐸 𝑖 = 𝐵 𝐸 𝑖 + ), then we can skip the transition 𝐶 ′ 𝑖 + and insteadtransition to 𝐶 ′ 𝑖 + . That situation can occur because the Markovblankets in individual traversals 𝑒 ⇝ 𝑒 ′ will change at everystep; however, the union of all such blankets might not change.If it does not change, we have enough statistical information toskip the 𝑖 + th step and go directly to 𝑖 + Checking Validity of the Configuration.
A key challenge indetermining a valid transformation of a schedule is that of identi-fying the configurations that do not satisfy the microarchitecturalconstraints placed on HPCs. We check the validity of a new sched-ule using Linux’s perf_event subsystem. It allows us to iterate overall HPCs in a configuration until it reaches an event that it fails toschedule, thereafter notifying the user of validity failure. To maxi-mize the use of available counters, the perf iteration strategy startswith the most constrained events and goes to the least constrainedevents in a configuration. Linux’s native scheduling for a groupof events happens independently per PMU and per logical core. We use the shorthand Pr 𝑖 = Pr ( 𝑆 𝑖 ) . lgorithm 1 General EP algorithm.
Input:
Target distribution 𝑓 ( 𝜃 ) = (cid:206) 𝑓 𝑘 ( 𝜃 ) Output:
Global approximation 𝑔 ( 𝜃 ) = (cid:206) 𝑔 𝑘 ( 𝜃 ) Choose initial 𝑔 𝑘 ( 𝜃 ) for 𝑘 ∈ { , . . . 𝐾 − }∧ until 𝑔 𝑘 converges do 𝑔 − 𝑘 ( 𝜃 ) ∝ 𝑔 ( 𝜃 ) / 𝑔 𝑘 ( 𝜃 ) ⊲ Cavity distribution 𝑔 \ 𝑘 ( 𝜃 ) ∝ Pr ( 𝑦 𝑘 | 𝜃 ) 𝑔 − 𝑘 ( 𝜃 ) ⊲ MCMC 𝑔 𝑛𝑒𝑤 ( 𝜃 ) ∝ 𝑔 \ 𝑘 ( 𝜃 ) ⊲ Local update Δ 𝑔 𝑘 ( 𝜃 ) ≈ 𝑔 𝑛𝑒𝑤 ( 𝜃 ) / 𝑔 ( 𝜃 ) 𝑔 ( 𝜃 ) ← 𝑔 ( 𝜃 ) Δ 𝑔 𝑘 ( 𝜃 ) ⊲ Global update end for return { 𝑔 𝑘 ( 𝜃 ) | 𝑘 ∈ [ , 𝐾 ) } As some PMUs are shared between threads of the same core orpackage, their availability may change depending on what eventsare being measured on the other cores.
The first step to computing the full posterior distribution is to modelerrors in the capture of samples from HPCs. Recall that we listedsources of such errors in §2. For a single event 𝑒 programmed inan HPC 𝑐 , if the error in measurement 𝑒 𝑐 can be modeled, then themeasured/sampled values 𝑚 𝑐 can be modeled in terms of the truevalue 𝑣 𝑐 plus measurement noise 𝑒 𝑐 , i.e., 𝑚 𝑐 = 𝑣 𝑐 + 𝑒 𝑐 . Here, wefocus only on random errors, by assuming zero systematic error.That is a valid assumption because the only reason for system-atic errors will be hardware or software bugs. We assume thatthe error can be modeled as 𝑒 𝑐 ∼ N ( , 𝜎 ) for some unknownvariance 𝜎 , hence Pr ( 𝑚 𝑐 | 𝑣 𝑐 ) = N ( 𝑚 𝑐 , 𝜎 ) [43]. Now, given 𝑁 samples of HPC, we compute their sample mean 𝜇 and sample vari-ance 𝑆 . A scaled and shifted Student’s t -distribution describes themarginal distribution of the unknown mean of a Gaussian, whenthe dependence on variance has been marginalized out [15], i.e., 𝑣 𝑐 ∼ 𝜇 + 𝑆 / √ 𝑁 𝑆𝑡𝑢𝑑𝑒𝑛𝑡 ( 𝜈 = 𝑁 − ) . In all our experiments, theconfidence level of the t -distribution was set to 95%. Now, since themeasurement error model for an HPC is stochastic, when samplesfrom these models are used in the algebraic relationships describedabove, they too become stochastic in nature. The FG becomes oneunified graphical representation of all of these statistical relationships,i.e., between the errored samples and true values of events, as wellas among different events that measure related aspects of the CPU’smicroarchitecture.
Once we have computed a schedule that ensures that events withstatistical dependencies between them are measured in consecutivetime slices, the next goal is to utilize the measurements to produce aposterior distribution for an event. Recall Fig. 2. In each schedulingtime slice, we have measurements/samples from the current sliceand the preceding slice. However, because of the transitive statisticaldependencies, we would like to jointly compute inference for theFG (i.e., compute the posterior probability of some event in the FGgiven the sampled data) for some 𝑘 time slices into the past.Our approach to performing this computation with low-latencyguarantees utilizes the idea that one can break the larger probleminto 𝑘 smaller parts, performing inference on each of the 𝑘 parts,and then put the results together to get an approximate posteriorinference, i.e., similar to map-reduce. There are two difficulties with such algorithms, as they are usually constructed. First, each of the 𝑘 pieces has only partial information; as a result, for any of the pieces,a lot of computation is wasted in places that are contradicted by theother 𝑘 − 𝑘 pieces mustbe carefully combined together to ensure that the prior (which isembedded into the FG model) is not counted multiple times. We usethe Expectation Propagation (EP) algorithm [16, 31, 34] to overcomethose difficulties to perform the inference. The EP algorithm natu-rally lends itself to distributed inference on partitioned datasets [16].Hence we can perform inference on partitions of data, i.e., eachscheduled configuration of the HPCs. In contrast, other techniquesfor Bayesian inference would require us to explicitly change theinference algorithm depending on the schedule of HPCs and thestructure of the FG. Such changes might not be feasible for all possi-ble schedules or all CPU architectures. The EP algorithm works bycomputing an effective region of overlap over our 𝑘 pieces, i.e., foreach piece, we use an approximate prior computed over the other 𝑘 − 𝑓 (·) (in ourcase the FG) with a density 𝑔 (·) that admits the same factorization,and uses a Gaussian mean field approximation [26]. Training.
Training is not explicitly required for the proposedBayesPerf model. The advantage of using Bayesian models like FGsis that training on such models can be reduced to inference on themodels’ parameters. At runtime, for each time slice, we compute(infer) a full posterior distribution over the variables (i.e., 𝐸 ) andparameters (i.e., Θ ) of the FG, and then use maximum likelihoodestimation to pick the set of parameters (i.e. ˆ Θ ( 𝑀𝐿𝐸 ) ) that canexplain a data trace generated by the system. In this section we describe the software and hardware componentsin which BayesPerf is deployed. Further, we describe the archi-tecture and implementation of the BayesPerf accelerator that tar-gets the execution of Alg. 1. Fig. 4 shows the architecture of theBayesPerf system, which works as follows.
Setup.
BayesPerf is used by one or more “monitoring pro-cesses/threads” (labeled “Monitoring Application” in Fig. 4) to mon-itor hardware threads of a “Target Process.” The BayesPerf userAPI is identical to the the Linux perf subsystem, and hence anyuser space program that uses the standard Linux interface cantransparently use BayesPerf. Using this API, the monitoring pro-cess registers events of interest (labelled as in Fig. 4 ) with theuserspace component of the BayesPerf system, labelled “BayesPerfShim.” The shim represents a userspace driver [25] that replicatesthe API of the Linux perf subsystem. Linux perf.
The shim registers HPCs on behalf of the user pro-cess with the Linux kernel. (labelled as ). The kernel then managesthe scheduling of performance counters onto the CPU (using thescheduling algorithm described in §4.1). This step is labelled as .When the target process raises performance monitoring interrupts (PMIs; labelled as ), the Linux perf subsystem is responsible forreading the corresponding HPC and writing out the sampled valueinto a “ring buffer” (labelled as ) that represents a segment ofmemory that is mapped into the address space of both the shim andthe perf subsystem. The ring buffer represents a FIFO in which new * refers to annotations in Fig. 4 if not otherwise specified. serspace Kernel CPUBayesPerf FPGAAppThreadsTarget Process Linux Perf Subsytem Monitoring Application B a y e s P e rf S h i m perf_event_open PMI 1PMI 2PMI 3con fi g_hpcpoll_bayes_hpc C A P I CAPI-based Direct Memory Access
Kernel-User space ring bu ff ers HPC Con fi g Counter uArchEvent SourceuArchEvent SourceuArchEvent Source … PEG Inf. Acc.PEG Inf. Acc.PEG Inf. Acc. … M C M C S a m p l e r s Figure 4: High-level architecture of the BayesPerf system. samples are enqueued by the kernel and read from the userspaceprocess. The ring buffer automatically provides a mechanism formanaging backpressure between the shim and kernel as new sam-ples are dropped if the ring buffer is full.
Interfacing with the Accelerator.
As we will discuss in §6.1,we have prototyped the BayesPerf system on two different architec-tures: an Intel x86_64 and an IBM Power9 processor. The protocolfor communication between the software and the BayesPerf acceler-ator (labelled as ; described later) differ for the two architectures.On the Power9 system, we leverage CAPI 2.0 [39], a protocol thatextends the processor’s cache coherence protocol to PCIe-attacheddevices. In that case, as the accelerator can directly access the hostmemory, it can consume samples enqueued onto the ring buffer bythe kernel (labelled as ). It does so by snooping on cache invalida-tion messages for the cache lines corresponding to the ring buffer.Similarly, outputs of the accelerator are directly written back tothe shim’s virtual memory space. For Intel systems, the acceleratoruses the base PCIe protocol and IOMMU-mediated PCIe DMA toread HPC samples and write the computed posterior distributions.Here, the shim must actively poll writes from the kernel to the ringbuffer, and once the write has been made, initiate transfer of thesamples to the FPGA. Similarly, the shim polls for interrupts fromthe accelerator that signify completion of computation, and initi-ates DMA transfers for the results. This added software interactionadds some latency overhead to the entire computation. Polling Results.
Finally, the monitoring application reads(polls) the results of the posterior computation in BayesPerf (la-belled as ) from ring buffers in the BayesPerf shim. These readsare always reads to the host memory of the CPU and do not needto initiate DMA requests with the accelerator. This design is ableto mask almost all the latency that is added because of the addedcomputation in BayesPerf (see Fig. 3). Multi-Threaded Applications.
OS-level monitoring contexts,like processes or threads are dealt with at the level of BayesPerfshim. Hence, when an OS context switch occurs, the memory refer-ences of the perf ring buffers are changed by setting configurationregisters on the accelerator using MMIO. When the new referencesare written, the accelerator begins pulling data from a different buffer in memory. As a result, the accelerator can be shared acrossthreads that are concurrently executing on the host CPU.
The Accelerator.
Fig. 5 illustrates the architecture of theBayesPerf accelerator. The accelerator exploits parallelism in thestructure of Alg. 1 in two ways. First, we execute posterior infer-ence on each of the 𝑘 time-slices in parallel (recall §4.3). Theseparallel execution engines are labeled as “EP 1” through “EP k” and“Controller” in Fig. 5. The EPs execute lines 3–6 of Alg. 1 in parallel,and communicate their results to the global controller, which syn-chronously updates 𝑔 ( 𝜃 ) and dispatches the new value to the idleEP. The values of the measurements from the HPCs (i.e., inputs) aswell as the latest values of 𝑔 ( 𝜃 ) are stored in the on board DRAM.Our target FPGA board (which we will describe in §6.1) supports 4channels of 16GB LPDDR4 memory each. The input data and thecurrent values of 𝑔 ( 𝜃 ) (which together comprise ∼
100 MB of data)are replicated across those modules to allow concurrent reads fromdifferent EPs to progress simultaneously.The second level of parallelism exploited by the model is in thecomputation of MCMC inference in each of the EPs. Those arerepresented by the “MCMC Sampler” blocks in Fig. 5. They executeline 4 of Alg. 1 in parallel, by using MCMC to estimate Pr ( 𝑦 𝑘 | 𝜃 ) (i.e.,the likelihood that the data 𝑦 𝑘 is drawn from the local approxima-tor 𝑔 𝑘 ). Here we leverage our prior work, AcMC [3], a high-levelsynthesis compiler for MCMC applications, to generate IPs thatcan generate samples from the target distributions of the HPC mea-surements. The HPC statistical relationships (i.e., the FG) are fed P C I e E n dp o i n t NoCEP 1 … NoCEP kNoCTo Host X D M A / P S L DRAM
NoC
MCMC Sampler
NoC
MCMC Sampler
NoC
AXI Crossbar
Controller
AcMC2 generated MCMC accelerator IPs … Figure 5: Architecture of the BayesPerf accelerator.able 1: Area & power for components of the BayesPerfFPGA for the x86_64 and ppc64 configurations.
Component Utilization (%) Power (W)BRAM DSP FF LUT URAM Vivado Measured x86-PCIe 62 78 52 81 58 11.2 17.2ppc64-CAPI 71 66 49 79 58 10.5 16.1 into the compiler as a probabilistic program, i.e., a program in a do-main specific language that can represent statistical dependenciesbetween program variables. AcMC then automatically generatesefficient uniform random number generators, and automaticallysynthesizes other statistical constraints in FG. Instead of using theAcMC -generated controllers for the MCMC samplers, we use theEPs to directly control the pipelines of MCMC samplers. That is,(i) to set and update configuration parameters like seed values; and(ii) to update the state of the sampler with one which passes therejection sampling test criteria for each random-walk. Allotment ofthe samplers to EPs, and all subsequent communication betweenthe EPs and samplers, happen over a network-on-chip (NoC) gener-ated with CONNECT [35]. This approach enables us to use samplesfrom previous iterations as starting points for Markov-chain ran-dom walks. This optimization is possible only because we are usingMCMC inside an EP algorithm, instead of by itself [20]. The NoCuses a butterfly topology to allow communication between EPsand samplers, as well as between the samplers themselves (as isrequired by AcMC ). All our experiments use a 16 port NoC, with 4of those ports being connected to the EPs, and the remaining 12 tothe MCMC samplers. This is the maximal configuration for whichwe were able to meet timing requirements on the FPGA for a 250MHz clock. This section discusses our experimental evaluation of the BayesPerfsystem and is organized as follows. First, in §6.1, we describe theexperimental setup and explore the performance, power, and arearequirements of BayesPerf accelerators when programmed onto anFPGA. Then, in §6.2 we evaluate the capabilities of the BayesPerfsystem in correcting measurement errors in HPCs. Finally, wedemonstrate the integration of BayesPerf with ML-based resourcemanagement systems to improve their outcomes.
We evaluate BayesPerf on two system configurations: (i) an IBMAC922 dual-socket Power9 system (which we will refer to as the“ppc64” configuration), and (ii) a dual-socket Intel Xeon E5-2695system (which we will refer to as the “x86” configuration). Boththe systems are populated with two NVIDIA K80 GPUs, a singleFDR Infiniband NIC, and a directly attached FPGA board (whichwe describe below). Both systems ran Ubuntu 18.04 with kernelversion v4.15.0 . Accelerator: FPGA.
The FPGA accelerator was based on thearchitecture in §5. All experiments were performed on an Alpha-Data ADM-PCIE-9V3 FPGA board (with Xilinx Virtex UltraScale+VU3P-2 FPGA) clocked at 250 MHz. For the Power9 systems, theFPGA board was configured to use the CAPI 2.0 interface [39]. Forthe x86 configuration, the FPGA board was configured to use PCIe3 x16 along with the Xilinx XRT drivers. The power and FPGA utiliza-tion metrics for the two configurations of the BayesPerf acceleratorare listed in Table 1. In comparison to a 100W TDP of the Intel pro-cessor and a 190W TDP Power9 processor, the FPGA performs 5.8×and 11.8× better, respectively, in terms of power consumption. TheBayesPerf-ppc64 FPGA read latency is shown in Fig. 3. We observethat a single HPC read using the CPU implementation of BayesPerfhas approximately 9 × longer latency than native polling of the HPC.However, when the accelerator is being used, BayesPerf adds lessthan 2% overhead in read latency compared to the native solution.Compared to the BayesPerf-ppc64 implementation that uses CAPI,the BayesPerf-x86 has on average 15.8% larger latency. We canattribute that slowdown to the requirement that a userspace driveractively initiates DMA transfers to the FPGA accelerator, whereasthe CAPI configuration snoops for cache invalidation messages. To demonstrate the efficacy of BayesPerf in correcting HPC mea-surement errors, we employed the 29 workloads from the HiBenchsuite [22], which span microbenchmarks, machine learning, SQL,web search, graph analytics, and streaming applications. They repre-sent real-world application workloads used in a cloud environment.We used the two machines in our experiment to simulate a cluster.Each of the machines hosted 32 workers, and the Spark masterwas deployed on the x86 node. We measured 10 derived events foreach of the microarchitectures, where each derived event corre-sponded to a group of HPCs to be measured and aggregated usinga mathematical relationship. We do not detail the events here forlack of space. The metadata corresponding to the events for the x86configuration can be found in the Linux kernel source tree [41] forboth the x86 and ppc64 configurations. In both cases, we measuredall HPCs corresponding to the first 10 metrics.
Baselines.
We use three baselines for comparison. First, weuse Linux’s inbuilt correction mechanism that uses enabled timeand total time (recall from §4) to correct for measurement errors.This is the most realistic baseline for users who would use thedefault configuration available in Linux. Second, we use a variancereduction technique called CounterMiner [29] (CM), a state of theart HPC correction technique used in profiling analysis. Note thatCM was originally meant to be used for offline analysis. As we willshow in the remainder of this section, this requirement manifests aslow average correction accuracy, with large variance, when used foronline corrections. Third, we use the online technique by Weaver et.al. [43] (referred to as “WM+Pin”) for correcting instruction countsin x86 processors. WM+Pin only corrects the number of instructionsexecuted and was originally meant to correct core performancemetrics like IPC or CPI. Further, it requires intercepting instructionsthrough Pin [28] to collect opcodes for every dynamic instruction.This causes performance degradation of up to 198.2× across ourbenchmarks.
Error Correction.
Fig. 6 shows the significant improvementin measurement values compared to the baseline. The averageerror across all benchmarks dropped from 39.25% and 40.1% forthe “Linux (x86)” and “Linux (ppc64),” respectively, to 8.06% (i.e.,4.87×= . / . ) and 7.6% (i.e., 5.28×= . / . ). Similarly, when“BayesPerf (x86)” and “BayesPerf (ppc64)” are compared to “CM(x86)” and “CM (ppc64),” the average error dropped by 3.63× S o r t W o r d C o u n t T e r a S o r t R e p a r t i t i o n D F S I O E B a y e s K M e a n s G M M L R A L S G B T X G B o o s t L i n e a r L D A P C A R F S V M S V D S c a n J o i n A g g r e g a t e P a g e R a n k N u t c h I n d e x i n g N W e i g h t I d e n t i t y R e p a r t i t i o n S t a t e f u l W o r d C o u n t F i x W i n d o w A v e r a g e E rr o r ( % ) Linux (x86)Linux (ppc64) CM (x86)CM (ppc64) BayesPerf (x86)BayesPerf (ppc64)
Figure 6: Error in performance counter measurements across the HiBench benchmarks. S o r t W o r d C o u n t T e r a S o r t R e p a r t i t i o n D F S I O E B a y e s K M e a n s G M M L R A L S G B T X G B o o s t L i n e a r L D A P C A R F S V M S V D S c a n J o i n A g g r e g a t e P a g e R a n k N u t c h I n d e x i n g N W e i g h t I d e n t i t y R e p a r t i t i o n S t a t e f u l W o r d C o u n t F i x W i n d o w N o r m a l i z e d I m p r o v e m e n t BayesPerf vs Linux (x86)BayesPerf vs Linux (ppc64) BayesPerf vs CM (x86)BayesPerf vs CM (ppc64)
Figure 7: Normalized improvement in performance counter error measurements across the HiBench benchmarks. (= . / . ) and 3.73× ( = . / . ), respectively. Similar im-provements were observed in the CM configuration. That corre-sponds to a nearly 40% improvement in the quality of the result ofthe ppc64 configuration. The normalized improvement in averageerror for each of the benchmark applications when using BayesPerf,compared to the two baselines is shown Fig. 7. Recall from §3, thaterror in measurement is computed as the similarity between twotime series sequences of performance counter samples [5]. In thecase of the BayesPerf counters, we used a maximum likelihood esti-mator to provide the most likely value of the performance counterat a point in time. We normalize the similarity scores using an aver-age similarity score between two runs of the application, where theHPCs were measured with polling. That way, we could correct forany OS-based nondeterminism in the result. Just like in §2, wherethe magnitude of the error is a comparison between “polling” modeand “sampling” under Linux and CM (see Fig. 6). Scaling.
Fig. 8 shows the scaling behavior of the BayesPerfmethod with increasing numbers of counters for the “KMeans”workload in the HiBench suite. We observe that BayesPerf consis-tently reduced error by as much as 34% as the number of countersscaled up from 10 to 35 (for Linux). Further, WM+Pin performsworse than CM as it only corrects instruction counts. This justifiesour choice of using CM as the main baseline for the evaluation.Interestingly we find that floating point initialization, which is a ma-jor source of errors in [43], doesn’t result in overcounts, indicatingthat the issue is resolved in modern CPUs.
Latency Overhead.
Since BayesPerf, performs significantlymore compute than either Linux or the CM configurations, it isexpected to be a significantly higher latency. Recall from Fig. 3 thatthe difference in latency between BayesPerf (when implemented in software) and the Linux correction is nearly 9×. The BayesPerf ac-celerator is designed to mitigate the effects of this increased latency.Again, from Fig. 3, we see that it successfully does so, reducing the9× difference to 2%. This is on par with native HPC reads using rdpmc as well as kernel-assisted HPC reads.
The core value of the BayesPerf approach in terms of it’s errorcorrection capability has been demonstrated in the previous sec-tion. Here we demonstrate the downstream value of BayesPerfto applications that use HPCs as inputs to control system re-sources. Examples of such applications include online performancehotspot identification (e.g., [14]), userspace or runtime-level sched-uling (e.g., [2, 4, 10, 17, 48]), and power and energy management(e.g., [13, 36, 37, 40]), as well as attack detectors and system integritymonitors [8]. Further they often use as many as 45 HPCs in thecase of [2, 17, 46].
The Problem.
We now look at a situation in which BayesPerfmeasurements can be integrated into higher-level decision-makingframeworks to perform resource management decisions. In thispart of the experiment, we used HPC measurements to augment
Linux (x86)CM (x86) BayesPerf (x86)WMK + Pin
10 15 20 25 30 35
Linux (ppc64)CM (ppc64)BayesPerf (ppc64)
Figure 8: Scaling errors with the number of events sampled.
PU 0 CPU 1 M e m o r y M e m o r y PCIe Switch PCIe Switch PCIe Switch PCIe SwitchTraining GPU BayesPerf FPGA N I C G P U G P U G P U G P U N I C To Network To Network
Performance interference between fl ows. B a n d w i d t h ( G B / s ) Message Size (Bytes)
IsolatedContention
Figure 9: Topology of test system in §6.3 as well as the effectof the resource contention. an Apache Spark Executor [47] that needed to run a distributedshuffle operation (which is part of the HiBench TeraSort bench-mark [22]). Fig. 9 illustrates the rich dynamic information that canbe extracted from HPC measurements, and how they can be used inhigher-level controllers. Consider the case of a PCIe interconnectwhich is populated with NIC and GPU devices. Here, the Sparkexecutor uses two GPUs to perform a halo exchange (for training adeep neural network). Fig. 9 shows the performance (in this case,bandwidth) of the exchange as “isolated” performance. If, at thesame time, the application were to perform a distributed shuffle(across nodes in a cluster) using the NIC, we would observe thatthe original GPU-to-GPU communication is affected because ofPCIe bandwidth contention at shared links. That phenomenon isshown as “contention” performance in Fig. 9, and it can cause asmuch as a 0–1.8× slowdown, depending on the size of the PCIetransactions. Online bandwidth and transaction size monitoring(which is enabled by HPCs) can be used by a higher-level softwareframework to optimally schedule such transfers, so that the perfor-mance impact of shared resource contention is minimized. Whilethe example is simple, it illustrates how errors in measurements canaffect the ML algorithm, and hence the overall system performance.We use two ML-based scheduling algorithms broadly based onthose presented in [10] and in our prior work [2]. The first usedcollaborative filtering as the core ML algorithm, and the second useddeep reinforcement learning. The goal of our ML-based schedulerwas to decide which of the two NICs it would use to perform theshuffle operation, given that the GPUs were communicating witheach other and contending for PCIe bandwidth. We simulated theGPU communication by using Tensorflow to train YoloNet on theImageNet dataset.
The Models.
The goal of this case study was to show the sen-sitivity of ML models to errors in their inputs (especially comingfrom HPCs). The inputs to the models included: (a) sampled HPCmeasurements corresponding to the numbers of allocating, full, par-tial, and non-snoop writes, (b) sampled HPC measurements corre-sponding to demand code reads and partial/MMIO reads, (c) DRAMChannel bandwidth utilization, (d) memory-bus bandwidth utiliza-tion, and (e) the size of data to be shuffled (in or out), and the NUMAnode on which the data would be resident. Note that all of the aboveare derived events, computation of which required us to capture32 unique HPC events. Out of which, 12 were collected for eachphysical core (i.e., used 432 HPCs = 12 events × 18 cores × 2 sockets), L o ss Iteration Count
BayesPerf (Acc)BayesPerf (CPU)LinuxCM
Figure 10: Decrease in training time due to BayesPerf. and 20 were off-core events being collected per-socket (i.e., used 40HPCs = 20 events × 2 sockets).The first model, used collaborative filtering to impute values ofapplication performance (in this case throughput) with data comingin from the inputs above, as well as data from training workloadsof the SparkBench suite in HiBench. It is based on the techniquepresented in [10]. The second model used a straightforward neuralnetwork: a 4-layer, fully connected ReLU-activated neural networkwith 36 neurons in layer 1, 16 neurons in each of layers 2 & 3, and2 neurons in the last layer. The two neurons in the last layer chosebetween the two NICs that were decided between as part of this task.The model was trained with actor-critic reinforcement learningbased on the approach described in [2]. The loss function used fortraining the model minimized the total time taken to complete theshuffle. The model was trained on the HiBench benchmark suitewithout the TeraSort benchmark, and then evaluated using theTeraSort benchmark. When BayesPerf was used, the MLE estimatefrom the posterior distribution of the HPC was passed into thenetwork. The GPU marked “Training GPU” was used to performthe collaborative filtering and reinforcement learning as well asruntime inference on the system. It did not contend for the samePCIe resources as the workloads that was being scheduled GPUs.
Implementation Details: Training.
Recall from §4 that theBayesPerf model in itself does not require training. However, thetwo models described above require training. The model from [2]learns by reinforcement. Hence, it does not have specific trainingand testing phases. The net epochs of data used to train the modelare shown in Fig. 10. For the model in [10], which has specific train-ing and test datasets, we calibrate against bias by using threefoldcross-validation (i.e., across applications in Fig. 6).
Implementation Details: Hyperparameters.
The hyperpa-rameters used in the model are taken directly from [2] and [10].These parameters include learning rate, LSTM-unroll-length, andepoch lengths, among others. In addition, we follow the procedureset out in [10] to determine the optimal value of sparsity. We sweepover the range between 30% and 80%. All results in this paper usesthe optimal (found from our sweep) value of 75% sparsity.
Results.
We compare the results of using the above model withBayesPerf and without, using two metrics.
Results: Training Time.
The collaborative filtering modeldoes not have an explicit training phase. For the deep learningmodel, Fig. 10 illustrates the difference in training time when error-corrected measurements are used. In the figure, the loss is normal-ized using the time taken to run the same shuffle operation in acompletely isolated system. We observed a nearly 37% reductionn the number of iterations before convergence. Each training it-eration in Fig. 10 takes 63s; therefore, the overall saving of 37%corresponds to ~52 hr. The reason for the reduction is apparent:a 40% error in the inputs of the neural network is slowing downthe optimization process. Moreover, we observe that the time toconvergence is effected by (a) the magnitude of error reduction, asseen by the difference between the Linux–CM (12.5% decrease) and–BayesPerf (37% decrease) configurations; and (b) the timeliness ofthe error reduction, as seen by the difference between the CPU andaccelerated versions of BayesPerf (28.5% decrease).
Results: Decision Quality.
We observe that use of the ML-based scheduler (i.e., that makes Spark PCIe aware) leads to a 15 . ± .
2% and 22 . ± .
9% improvement in average shuffle completiontime for the two models respectively. Addition of BayesPerf to themodel results in a further 8 . ± .
9% and 19 ± .
4% reduction inaverage shuffle latency, respectively.
Error Correction in HPCs
Measurement errors due to sam-pling in HPCs have been observed and reported on for the pastdecade [1, 12, 29, 32, 43, 44, 48]. Methods for correction of sampledHPC values can be broadly grouped into two separate approaches.The first group of methods artificially imputes data in the collectedsamples by interpolating between two sampled events using linearor piece-wise linear interpolation (e.g., [41]). The advantage of suchinterpolation methods is that they can be run in real time: however,they might not provide good imputations [48]. The second group ofmethods correct measurements by dropping outlier values, insteadof by adding new interpolated values. Such methods are at the otherextreme: they cannot be run in real time, as they need the entiretrace of an application before providing corrections. For example,Lv et al. [29] use the Gumbel test for outlier detection, and Neillet. al. [33] use fork-join aware agglomorative clustering to removeoutlier points. These methods are not suitable for dynamic controlsituations that need online HPC correction. Further, the core statis-tical technique used by these variance reduction approaches assumethat the underlying distribution of the data remains unchanged,however, most workload exhibit distinct stages where workloadbehavior and thus the underlying distribution of the HPCs willchange.In contrast to those techniques, BayesPerf corrects measure-ments by using statistical relationships between events. For well-documented processors, such relationships can be known ahead oftime, and the entire correction algorithm can be executed withoutany need to pre-collect data. The BayesPerf system (with its accel-erator) allows nearly native latency access to the corrected HPCs,thereby enabling their use in dynamic control processes.
Using HPCs in Control.
Several recent papers have exploredthe use of HPCs to perform higher-level resource management prob-lems. Examples include online performance hotspot identification(e.g., [14]), userspace or runtime-level scheduling (e.g., [2, 4, 6, 10,17, 38, 48]), power and energy management (e.g., [13, 36, 37, 40]),and attack detectors and system integrity monitors [8]. Most of themethods mentioned above do not explicitly use any techniques tocorrect for errors in HPC measurements. Further, while it is notimpossible that some of the ML techniques can inherently correctfor HPC errors, there are no guarantees that it does so.
It is crucial to have reliable instrumentation/measurement in com-mercial CPUs, as exemplified by the inclusion of the
PEBS (precisionevent-based sampling) and
LBR (last branch record) technologiesin modern Intel processors. However, as we showed in this paper,such technology alone falls short of correcting errors in the valuesof HPCs accrued because of nondeterminism and sampling artifacts.This paper presented the design and evaluation of BayesPerf, anML model and associated accelerator that allows for correctionof noisy HPC measurements, reducing the average error in HPCmeasurements from 42.11% to 7.8% when events are being multi-plexed. BayesPerf is the first step in realizing a general-purposeHPC-error-correction system for real x86 and ppc64 systems to-day and potentially for future processors. We believe it will formthe basis for performing large-scale measurement/characterizationstudies that use HPC data (i.e., offline analysis), but also enable aslew of applications that can use the HPC data to make control-decisions in a computer system (i.e., online analysis).
Acknowledgments
We thank the ASPLOS reviewers and our shepherd, AlexandrePassos, for their valuable comments that improved the paper. Weappreciate S. Lumetta, W-M. Hwu, J. Xiong, and J. Applequist fortheir insightful discussion and comments on the early drafts ofthis manuscript. This work is partially supported by the NationalScience Foundation (NSF) under grant Nos. CNS 13-37732, CNS 16-24790, and CCF 20-29049; by the IBM-ILLINOIS Center for CognitiveComputing Systems Research (C3SR), a research collaboration thatis part of the IBM AI Horizon Network; and by IBM, Intel, andXilinx through equipment donations. Any opinions, findings, andconclusions or recommendations expressed in this material arethose of the authors and do not necessarily reflect the views of theNSF, IBM, Intel, or, Xilinx. Saurabh Jha is supported by a 2020 IBMPhD fellowship.
References [1] Glenn Ammons, Thomas Ball, and James R. Larus. 1997. Exploiting HardwarePerformance Counters with Flow and Context Sensitive Profiling.
SIGPLAN Not.
32, 5 (May 1997), 85–96. https://doi.org/10.1145/258916.258924[2] Subho Banerjee, Saurabh Jha, Zbigniew Kalbarczyk, and Ravishankar Iyer. 2020.Inductive-bias-driven Reinforcement Learning For Efficient Schedules in Hetero-geneous Clusters. In
Proceedings of the 37th International Conference on MachineLearning (Proceedings of Machine Learning Research, Vol. 119) , Hal Daumé III andAarti Singh (Eds.). PMLR, Virtual, 629–641. http://proceedings.mlr.press/v119/banerjee20a.html[3] Subho S. Banerjee, Zbigniew T. Kalbarczyk, and Ravishankar K. Iyer. 2019. AcMC2: Accelerating Markov Chain Monte Carlo Algorithms for Probabilistic Models. In
Proceedings of the Twenty-Fourth International Conference on Architectural Supportfor Programming Languages and Operating Systems (Providence, RI, USA) (ASPLOS’19) . ACM, New York, NY, USA, 515–528. https://doi.org/10.1145/3297858.3304019[4] Andrew Baumann, Paul Barham, Pierre-Evariste Dagand, Tim Harris, RebeccaIsaacs, Simon Peter, Timothy Roscoe, Adrian Schüpbach, and Akhilesh Singhania.2009. The Multikernel: A New OS Architecture for Scalable Multicore Systems. In
Proceedings of the ACM SIGOPS 22Nd Symposium on Operating Systems Principles (Big Sky, Montana, USA) (SOSP ’09) . ACM, New York, NY, USA, 29–44. https://doi.org/10.1145/1629575.1629579[5] Donald J. Berndt and James Clifford. 1994. Using Dynamic Time Warping toFind Patterns in Time Series. In
Proceedings of the 3rd International Conference onKnowledge Discovery and Data Mining (Seattle, WA) (AAAIWS’94) . AAAI Press,359–370.[6] Jingde Chen, Subho S. Banerjee, Zbigniew T. Kalbarczyk, and Ravishankar K. Iyer.2020. Machine Learning for Load Balancing in the Linux Kernel. In
Proceedingsof the 11th ACM SIGOPS Asia-Pacific Workshop on Systems (Tsukuba, Japan) (APSys ’20) . Association for Computing Machinery, New York, NY, USA, 67–74.https://doi.org/10.1145/3409963.34104927] Intel Corp. 2016. Intel® 64 and IA-32 Architectures Software Developer Manuals.https://software.intel.com/en-us/articles/intel-sdm. Accessed 2019-03-05.[8] S. Das, J. Werner, M. Antonakakis, M. Polychronakis, and F. Monrose. 2019. SoK:The Challenges, Pitfalls, and Perils of Using Hardware Performance Countersfor Security. In . 20–38. https://doi.org/10.1109/SP.2019.00021[9] Pritam Dash, Mehdi Karimibiuki, and Karthik Pattabiraman. 2019. Out of Control:Stealthy Attacks against Robotic Vehicles Protected by Control-Based Techniques.In
Proceedings of the 35th Annual Computer Security Applications Conference (SanJuan, Puerto Rico) (ACSAC ’19) . Association for Computing Machinery, NewYork, NY, USA, 660–672. https://doi.org/10.1145/3359789.3359847[10] Christina Delimitrou and Christos Kozyrakis. 2013. Paragon: QoS-aware Schedul-ing for Heterogeneous Datacenters. In
Proceedings of the Eighteenth InternationalConference on Architectural Support for Programming Languages and OperatingSystems (Houston, Texas, USA) (ASPLOS ’13) . ACM, New York, NY, USA, 77–88.https://doi.org/10.1145/2451116.2451125[11] Joshua V Dillon, Ian Langmore, Dustin Tran, Eugene Brevdo, Srinivas Vasudevan,Dave Moore, Brian Patton, Alex Alemi, Matt Hoffman, and Rif A Saurous. 2017.Tensorflow distributions. arXiv preprint arXiv:1711.10604 (2017).[12] M. Dimakopoulou, S. Eranian, N. Koziris, and N. Bambos. 2016. Reliable and Effi-cient Performance Monitoring in Linux. In
SC ’16: Proceedings of the InternationalConference for High Performance Computing, Networking, Storage and Analysis .396–408. https://doi.org/10.1109/SC.2016.33[13] Yi Ding, Nikita Mishra, and Henry Hoffmann. 2019. Generative and Multi-phase Learning for Computer Systems Optimization. In
Proceedings of the 46thInternational Symposium on Computer Architecture (Phoenix, Arizona) (ISCA ’19) .ACM, New York, NY, USA, 39–52. https://doi.org/10.1145/3307650.3326633[14] Yu Gan, Yanqi Zhang, Kelvin Hu, Dailun Cheng, Yuan He, Meghna Pancholi,and Christina Delimitrou. 2019. Seer: Leveraging Big Data to Navigate the Com-plexity of Performance Debugging in Cloud Microservices. In
Proceedings of theTwenty-Fourth International Conference on Architectural Support for ProgrammingLanguages and Operating Systems (Providence, RI, USA) (ASPLOS ’19) . ACM, NewYork, NY, USA, 19–33. https://doi.org/10.1145/3297858.3304004[15] A Gelman, JB Carlin, HS Stern, and DB Rubin. 1995.
Bayesian Data Analysis .Chapman & Hall, New York.[16] Andrew Gelman, Aki Vehtari, Pasi Jylänki, Tuomas Sivula, Dustin Tran, SwupnilSahai, Paul Blomstedt, John P Cunningham, David Schiminovich, and ChristianRobert. 2017. Expectation propagation as a way of life: A framework for Bayesianinference on partitioned data. arXiv preprint arXiv:1412.4869 (2017).[17] Jana Giceva, Gustavo Alonso, Timothy Roscoe, and Tim Harris. 2014. Deploymentof Query Plans on Multicores.
Proc. VLDB Endow.
8, 3 (Nov. 2014), 233–244.https://doi.org/10.14778/2735508.2735513[18] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley,Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative AdversarialNets. In
Advances in Neural Information Processing Systems 27 , Z. Ghahramani,M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Eds.). Curran Asso-ciates, Inc., 2672–2680. http://papers.nips.cc/paper/5423-generative-adversarial-nets.pdf[19] Brian Hall, Peter Bergner, Alon Shalev Housfater, Madhusudanan Kandasamy,Tulio Magno, Alex Mericas, Steve Munroe, Mauricio Oliveira, Bill Schmidt, WillSchmidt, et al. 2017.
Performance optimization and tuning techniques for IBMPower Systems processors including IBM POWER8 . IBM Redbooks.[20] Matthew D Hoffman, David M Blei, Chong Wang, and John Paisley. 2013. Sto-chastic variational inference.
The Journal of Machine Learning Research
14, 1(2013), 1303–1347.[21] Intel. 2014. Intel 64 and IA-32 architectures optimization reference manual.
IntelCorporation, Sept (2014).[22] Intel. 2016. Sparkbench: The Big Data Micro Benchmark Suite for Spark 2.0.https://github.com/intel-hadoop/HiBench/. Accessed 19-November-2019.[23] Intel. 2019. Top-down Microarchitecture Analysis Method. https://software.intel.com/en-us/vtune-cookbook-top-down-microarchitecture-analysis-method.[Online; accessed 19-November-2019].[24] Saurabh Jha, Shengkun Cui, Subho S Banerjee, Timothy Tsai, Zbigniew T Kalbar-czyk, and Ravishankar K Iyer. 2020. ML-driven Malware for Targeting AV Safety.In
Probabilistic graphical models: principlesand techniques . MIT press.[27] Linux Community. 2019. perf: Linux profiling with performance counters. https://perf.wiki.kernel.org/index.php/Main_Page. [Online; accessed 19-November-2019].[28] Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, GeoffLowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood. 2005. Pin:building customized program analysis tools with dynamic instrumentation.
Acmsigplan notices
40, 6 (2005), 190–200. [29] Y. Lv, B. Sun, Q. Luo, J. Wang, Z. Yu, and X. Qian. 2018. CounterMiner: MiningBig Performance Data from Hardware Counters. In . 613–626. https://doi.org/10.1109/MICRO.2018.00056[30] J. M. May. 2001. MPX: Software for multiplexing hardware performance coun-ters in multithreaded programs. In
Proceedings 15th International Parallel andDistributed Processing Symposium. IPDPS 2001 . 8 pp.–. https://doi.org/10.1109/IPDPS.2001.924955[31] Thomas P. Minka. 2001. Expectation Propagation for Approximate BayesianInference. In
Proceedings of the Seventeenth Conference on Uncertainty in ArtificialIntelligence (Seattle, Washington) (UAI’01) . Morgan Kaufmann Publishers Inc.,San Francisco, CA, USA, 362–369.[32] T. Mytkowicz, P. F. Sweeney, M. Hauswirth, and A. Diwan. 2007. Time In-terpolation: So Many Metrics, So Few Registers. In . 286–300. https://doi.org/10.1109/MICRO.2007.27[33] Richard Neill, Andi Drebes, and Antoniu Pop. 2017. Fuse: Accurate Multiplexingof Hardware Performance Counters Across Executions.
ACM Trans. Archit. CodeOptim.
14, 4, Article 43 (Dec. 2017), 26 pages. https://doi.org/10.1145/3148054[34] Manfred Opper and Ole Winther. 2000. Gaussian Processes for Classification:Mean-Field Algorithms.
Neural Comput.
12, 11 (Nov. 2000), 2655–2684. https://doi.org/10.1162/089976600300014881[35] Michael K. Papamichael and James C. Hoe. 2012. CONNECT: Re-examiningConventional Wisdom for Designing Nocs in the Context of FPGAs. In
Proceedingsof the ACM/SIGDA International Symposium on Field Programmable Gate Arrays (Monterey, California, USA) (FPGA ’12) . ACM, New York, NY, USA, 37–46. https://doi.org/10.1145/2145694.2145703[36] Raghavendra Pradyumna Pothukuchi, Joseph L. Greathouse, Karthik Rao, Christo-pher Erb, Leonardo Piga, Petros G. Voulgaris, and Josep Torrellas. 2019. Tan-gram: Integrated Control of Heterogeneous Computers. In
Proceedings of the52nd Annual IEEE/ACM International Symposium on Microarchitecture (Colum-bus, OH, USA) (MICRO ’52) . ACM, New York, NY, USA, 384–398. https://doi.org/10.1145/3352460.3358285[37] R. P. Pothukuchi, S. Y. Pothukuchi, P. Voulgaris, and J. Torrellas. 2018. Yukta:Multilayer Resource Controllers to Maximize Efficiency. In . 505–518. https://doi.org/10.1109/ISCA.2018.00049[38] Haoran Qiu, Subho S. Banerjee, Saurabh Jha, Zbigniew T. Kalbarczyk, and Ravis-hankar K. Iyer. 2020. FIRM: An Intelligent Fine-grained Resource ManagementFramework for SLO-Oriented Microservices. In
IBM Journal of Research and Development
59, 1(Jan 2015), 7:1–7:7. https://doi.org/10.1147/JRD.2014.2380198[40] Stephen J. Tarsa, Rangeen Basu Roy Chowdhury, Julien Sebot, Gautham Chinya,Jayesh Gaur, Karthik Sankaranarayanan, Chit-Kwan Lin, Robert Chappell, RonakSinghal, and Hong Wang. 2019. Post-silicon CPU Adaptation Made PracticalUsing Machine Learning. In
Proceedings of the 46th International Symposium onComputer Architecture (Phoenix, Arizona) (ISCA ’19) . ACM, New York, NY, USA,14–26. https://doi.org/10.1145/3307650.3322267[41] Linus Torvald. 2020. Linux Perf Subsystem Userspace Tools. https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/perf/pmu-events/arch. Ac-cessed 2020-03-05.[42] Dustin Tran, Alp Kucukelbir, Adji B. Dieng, Maja Rudolph, Dawen Liang, andDavid M. Blei. 2016. Edward: A library for probabilistic modeling, inference, andcriticism. arXiv preprint arXiv:1610.09787 (2016).[43] Vincent M Weaver and Sally A McKee. 2008. Can hardware performance countersbe trusted?. In .IEEE, 141–150.[44] V. M. Weaver, D. Terpstra, and S. Moore. 2013. Non-determinism and overcounton modern hardware performance counter implementations. In .215–224. https://doi.org/10.1109/ISPASS.2013.6557172[45] A. Yasin. 2014. A Top-Down method for performance analysis and countersarchitecture. In . 35–44. https://doi.org/10.1109/ISPASS.2014.6844459[46] Wucherl Yoo, Kevin Larson, Lee Baugh, Sangkyum Kim, and Roy H. Campbell.2012. ADP: Automated Diagnosis of Performance Pathologies Using HardwareEvents. In
Proceedings of the 12th ACM SIGMETRICS/PERFORMANCE Joint Inter-national Conference on Measurement and Modeling of Computer Systems (London,England, UK) (SIGMETRICS ’12) . Association for Computing Machinery, NewYork, NY, USA, 283–294. https://doi.org/10.1145/2254756.2254791[47] Matei Zaharia, Reynold S. Xin, Patrick Wendell, Tathagata Das, Michael Armbrust,Ankur Dave, Xiangrui Meng, Josh Rosen, Shivaram Venkataraman, Michael J.Franklin, Ali Ghodsi, Joseph Gonzalez, Scott Shenker, and Ion Stoica. 2016. ApacheSpark: A Unified Engine for Big Data Processing.
Commun. ACM
59, 11 (Oct.016), 56–65. https://doi.org/10.1145/2934664[48] Gerd Zellweger, Denny Lin, and Timothy Roscoe. 2016. So Many PerformanceEvents, So Little Time. In
Proceedings of the 7th ACM SIGOPS Asia-Pacific Workshop on Systems (Hong Kong, Hong Kong) (APSys ’16)(APSys ’16)