[PDF] Using HEP experiment workflows for the benchmarking and accounting of WLCG computing resources

Abstract

Benchmarking of CPU resources in WLCG has been based on the HEP-SPEC06 (HS06) suite for over a decade. It has recently become clear that HS06, which is based on real applications from non-HEP domains, no longer describes typical HEP workloads. The aim of the HEP-Benchmarks project is to develop a new benchmark suite for WLCG compute resources, based on real applications from the LHC experiments. By construction, these new benchmarks are thus guaranteed to have a score highly correlated to the throughputs of HEP applications, and a CPU usage pattern similar to theirs. Linux containers and the CernVM-FS filesystem are the two main technologies enabling this approach, which had been considered impossible in the past. In this paper, we review the motivation, implementation and outlook of the new benchmark suite.

Full PDF

aa r X i v : . [ c s . PF ] J un Using HEP experiment workﬂows for the benchmarkingand accounting of WLCG computing resources

Andrea

Valassi , ∗ , Manfred

Alef , Jean-Michel

Barbet , Olga

Datskova , Riccardo

De Maria , Miguel

Fontes Medeiros , Domenico

Giordano , Costin

Grigoras , Christopher

Hollowell , Martina

Javurkova , Viktor

Khristenko , , David

Lange , Michele

Michelotto , Lorenzo

Rinaldi , Andrea

Sciabà , and Cas

Van Der Laan CERN, Geneva, Switzerland KIT, Karlsruhe, Germany CNRS-SUBATECH, Nantes, Frances Brookhaven National Laboratory, USA University of Massachusetts Amherst, USA University of Iowa, USA Princeton University, USA INFN, Padova, Italy Università di Bologna, Italy

Abstract.

Benchmarking of CPU resources in WLCG has been based on theHEP-SPEC06 (HS06) suite for over a decade. It has recently become clear thatHS06, which is based on real applications from non-HEP domains, no longerdescribes typical HEP workloads. The aim of the HEP-Benchmarks project isto develop a new benchmark suite for WLCG compute resources, based on realapplications from the LHC experiments. By construction, these new bench-marks are thus guaranteed to have a score highly correlated to the throughputsof HEP applications, and a CPU usage pattern similar to theirs. Linux contain-ers and the CernVM-FS ﬁlesystem are the two main technologies enabling thisapproach, which had been considered impossible in the past. In this paper, wereview the motivation, implementation and outlook of the new benchmark suite.

The Worldwide LHC Computing Grid (WLCG) is a large distributed computing infrastruc-ture serving scientiﬁc research in High Energy Physics (HEP). WLCG was set up to addressthe scientiﬁc computing needs of the four Large Hadron Collider (LHC) experiments, and itintegrates storage and compute resources at almost 200 sites in over 40 countries [1]. Whilethe experiment requirements are managed centrally through a well deﬁned process [2], whichmatches them against the overall amounts pledged by the contributing funding agencies, theprocurement and operation of hardware resources are largely delegated to the individual sites,resulting in a very diverse computing landscape.The compute power provided by WLCG, in particular, comes from a variety of CPUs dis-tributed worldwide, where the speciﬁc hardware deployed at one site can be quite di ﬀ erentfrom that at another site, both in terms of cost and of computing performance. A common unitof measurement is therefore needed to quantify the experiment needs and the resources pro-vided by the sites in a given year, and to allow review boards to compare these to the amountsthat were actually used [3, 4]. A good evaluation metric of a compute resource, in this con-text, is one that is highly correlated to its application throughput, i.e. to the amount of useful ∗ e-mail: [email protected] work” (e.g. the number of events processed by a HEP application) that the compute re-source can do per unit time: this is the typical use case of a CPU benchmark [5]. Since 2009,in particular, HEP-SPEC06 (HS06) has been the standard CPU benchmark for all of LHCcomputing. The total integrated power of WLCG sites in 2017 [6], for instance, was morethan 5M HS06: taking into account that the typical worker nodes deployed in WLCG havean HS06 score of around 10 per CPU core [7], this means that LHC computing in 2017 wassupported by approximately 500k CPU cores.While their main motivation is the overall accounting of resources, both on a yearly basisand in the planning of long term projects, HS06 and other CPU benchmarks have many otherapplications in LHC computing. Individual computing sites use HS06 for their procurement,to buy the CPU resources providing the amount of HS06 pledged to the HEP experimentsfor the lowest ﬁnancial cost, also taking into account electrical power e ﬃ ciency measuredin HS06 per Watt. The experiments also use CPU benchmarks for scheduling and manag-ing their jobs on distributed compute resources, to predict the processing time required tocomplete a given application workload and optimize its placement or smooth termination onbatch queues [8] and preemptible cloud resources. CPU benchmarks may also be useful insoftware optimizations, to compare an application’s performance to the theoretical computepower of the machine where it is run, or to the reference performance of another application.HS06 has served the needs of WLCG for over 10 years, while many things have changed.On modern hardware, users have reported scaling deviations up to 20% [8] from the perfor-mance predicted by HS06. It is now clear that HS06 should be replaced by a new CPU bench-mark. In the following, the motivations of the choice to develop a new “HEP-Benchmarks”suite and its implementation are described. After reviewing in Sec. 2 the evolution of CPUbenchmarks in HEP up until HS06, Sec. 3 describes the limitations of HS06 and the reasonswhy the new suite is based on the containerization of LHC experiment software workloads.Section 4 summarises the design, implementation choices and status of HEP-Benchmarks,while Section 5 reports on its outlook, a few months after the CHEP2019 conference. Computing architectures and software paradigms, in HEP and outside of it, have signiﬁcantlyevolved over time and will keep on changing. This implies that CPU benchmarks, and morespeciﬁcally those used for HEP computing, also need to evolve to keep up with these changes.

CERN Units

The 1992 paper by Eric McIntosh [9] is an essential read to understand the beginnings ofCPU benchmarking in HEP and its later evolution. In the 1980’s, the default benchmarkwas the “CERN Unit”, whose score was derived from a small set of typical FORTRAN66programs used in HEP at the time for event simulation and reconstruction. This was used, forinstance, to grant CPU quotas to all users of CERN central systems. In the early 1990’s, thedeﬁnition of a CERN Unit was updated, as the task of running the old benchmark on newermachines turned out to be impossible: the set of programs was thus reviewed, to make thebenchmark more portable and more representative of the then current HEP workloads andof FORTRAN77. It was already clear, however, that HEP benchmarks should further evolve,for instance to take into account a more widespread use of FORTRAN90, of double-precisionarithmetics on 32-bit architectures, and possibly of vectorisation and of parallel processing.Largely speaking, CPU benchmarks can be grouped into three categories [10]: kernel,synthetic and application. Kernel benchmarks are based on libraries and small code frag-ments that often account for the majority of CPU time in user applications: an example ishe LINPACK benchmark [11], based on the LINPACK matrix algebra package. Syntheticbenchmarks are custom-built to include a mix of low-level instructions, e.g. ﬂoating-point orinteger operations, resembling that found in user applications. Two examples are the Whet-stone [12] and Dhrystone [13] benchmarks. Application benchmarks are based on actual userapplications. From a user’s perspective, benchmarking a machine based on the user’s ownapplication is clearly the best option, although in practice this is often impractical.The CERN Unit had been designed to be based, as much as possible, on real HEP appli-cations, rather than on kernel or synthetic benchmarks. McIntosh made this clear in his 1992paper, where, however, he also commented that new approaches may be needed for the future,as at the time he considered it virtually impossible to capture a modern event processing pro-gram involving over 100k lines of code, several external libraries and one or more databasesor data sets. For reference, the CERN Unit was shipped as a tarball that required less than50 MB of disk space in total, for unzipping, building and executing all included programs,using the compiler and operating system found on the machine to be benchmarked.

SPEC CPU benchmarks and SI2K (SPEC CINT2000)

In his review of existing benchmarks outside HEP, one option that McIntosh mentioned asbeing perhaps the most useful for HEP was the SPEC benchmark suite. After the CERNUnit, indeed, all of the default CPU benchmarks used in HEP have been based on the SPECbenchmark suite, up until today. SPEC (Standard Performance Evaluation Corporation) [14],founded in 1988, is a nonproﬁt corporation formed to establish, maintain and endorse stan-dardized benchmarks of computing systems. SPEC distributes two di ﬀ erent categories ofCPU benchmark suites, focusing on integer and ﬂoating-point operations. In the HEP world,several versions of the SPEC CPU integer benchmark suite have been used since the early1990’s, starting with SPEC CPU92 [15]. In particular, CINT2000 (the integer component ofSPEC CPU2000), known informally as SI2K, was the CPU benchmark used in 2005 by thefour LHC collaborations for their Computing Technical Design Reports [16–19].Since about 2005, however, many presentations at HEPiX conferences pointed out agrowing discrepancy between the performances of HEP applications and those predicted fromthe SI2K scores of the systems where these applications were run. In 2006, a HEPiX Bench-marking Working Group (BWG) was set up speciﬁcally with the task of identifying the ap-propriate successor of SI2K. In 2009, the BWG suggested [15] to adopt a new HEP-speciﬁcbenchmark, HEP-SPEC06 (HS06), based on the then latest SPEC suite, CPU2006 [20]. HEP-SPEC2006: a HEP-speciﬁc version of the SPEC CPU2006 C ++ benchmark suite HS06 is based on a subset of SPEC CPU2006 including the seven benchmarks written inC ++ [21], three from the integer suite and four from the ﬂoating-point suite. In line with thegeneral approach [10] followed in SPEC CPU suites, these seven programs are not kernel orsynthetic benchmarks, but represent instead real applications, mostly from scientiﬁc domains,although not from the HEP domain. HS06 di ﬀ ers from the SPEC CPU2006 C ++ suite in thatit includes a few HEP-speciﬁc tunings: for instance, the programs must be built using gcc in32-bit mode also on 64-bit architectures, and with other well deﬁned compiler options [7],and they must also be executed in a speciﬁc conﬁguration on the machine to be benchmarked,as if the available processor cores were all conﬁgured as independent single-core batch slotsto run several single-process applications in parallel.HS06 was identiﬁed as valid successor of SI2K by the HEPiX BWG for essentially tworeasons [15]. First, the HS06 score was found to be highly correlated to throughput on a largenumber of diverse machines in a test “lxbench” cluster, for each of many typical HEP appli-cations. The test machines were typical WLCG worker nodes, all based on x86 architectures,ut including single-core and multi-core CPUs with di ﬀ erent speeds and from di ﬀ erent ven-dors, and with a diverse range of cache and RAM sizes. The test applications covered fourmain HEP use cases [22], generation (GEN), simulation (SIM), digitization (DIGI) and recon-struction (RECO), including programs contributed by all four LHC experiments. The secondreason for choosing HS06 was that its CPU usage pattern, as measured from CPU hardwarecounters using perfmon [23–25], was found to be quite similar to that observed on the CERNbatch system used by the LHC experiments (in particular, the fraction of ﬂoating point oper-ations was around 10% in both cases). The memory footprint of the SPEC CPU2006 tests inHS06, around 1 GB, was also comparable to that of typical HEP applications, requiring upto 2 GB (while the memory footprint of the older SI2K benchmark was only 200 MB). In summary, in 2009 HS06 was chosen because, while it is based on a set of C ++ applicationsfrom domains other than HEP, HS06 had been found to be su ﬃ ciently representative of HEP’sown typical applications, both in terms of throughput and of CPU usage patterns. The prob-lem today is that, since a few years, it has become clear [26–28] that this is no longer the case.To start with, the throughputs of HEP applications, mainly of ALICE and LHCb [8], havebeen reported to deviate up to 20% on some systems from those predicted by HS06. In ad-dition, important di ﬀ erences are now observed between the CPU usage patterns of HS06 andHEP applications, as measured from performance counters using Trident [28] (a tool basedon libpfm [23] from perfmon): in particular, with respect to the HS06 benchmarks, HEPworkloads have a lower instructions-per-cycle (IPC) ratio and may di ﬀ er by 20% or morein the percentages of execution slots spent in the four categories suggested by Top-Downanalysis [29] (retiring i.e. successful, front-end bound, back-end bound and bad speculation).More generally, HS06 benchmarks are no longer representative of WLCG software andcomputing today: memory footprints have increased to 2 GB or more [30] per core; 64-bitbuilds have replaced 32-bit builds; multi-threaded, multi-process and vectorized software so-lutions are becoming more common; and the hardware landscape is also more and more het-erogeneous, with the emergence of non-x86 architectures such as ARM, Power9 and GPUs,especially at HPC centers.In addition, SPEC CPU2006, on which HS06 is based, was retired in 2018, after therelease of a newer SPEC CPU2017 benchmark suite. An extensive analysis [26, 27] of thisnew suite by the HEPiX BWG, however, pointed out that SPEC CPU2017 is a ﬀ ected by thesame problems as HS06. In particular, SPEC CPU2017 scores were found to have a highcorrelation to HS06 scores, and hence still an unsatisfying correlation to HEP workloads;also, the CPU usage patterns of SPEC CPU2017, as measured by Trident, were found to besimilar to those of HS06, and quite di ﬀ erent from those of HEP workloads. The HEP-Benchmarks suite: using containerized HEP workloads as CPU benchmarks

The solution to the issues described above is, in theory, quite simple. Rather than testingreal application benchmarks from domains other than HEP (like those in SPEC CPU2006and CPU2017) or kernel or synthetic benchmarks, and looking for the benchmarks whosescore has the highest correlation to the throughputs of typical HEP workloads, and whoseCPU usage patterns look most similar to those of HEP workloads, the “obvious” approach tofollow is to build a benchmark suite including precisely those typical HEP workloads.

By construction , in fact, a benchmark based on a HEP application is guaranteed to givea score and a CPU usage pattern that are the most representative of that application. Thisis precisely the approach followed in the new HEP-Benchmarks [31] suite, which we areuilding within the HEPiX BWG to make it the successor of HS06, as described in the nextsection. The central package of this project, hep-workloads, is a collection of workloads fromthe four LHC experiments, covering all of the GEN, SIM, DIGI and RECO use cases.In retrospective, this is the same approach on which the CERN Unit was based. As dis-cussed in the previous section, the CERN Unit was eventually discontinued because in theearly 1990’s it seemed no longer possible to capture a complex HEP application with all ofits software and data dependencies. A problem which seemed impossible to solve 30 yearsago, however, can be much more easily addressed using the technologies available today. Thereason why it is now possible to encapsulate HEP workloads in the hep-workloads package,in particular, is the availability of two enabling technologies: ﬁrst and foremost, Linux con-tainers [32], which allow the packaging and distribution of HEP applications with all of theirdependencies, including the full O / S; and, in addition, the cvmfs shrinkwrap utility [33, 34],which makes it possible to selectively capture which speciﬁc software and data ﬁles areneeded to execute a HEP workload in a portable and reproducible way, out of the muchlarger LHC experiment software installations on the cvmfs (CernVM-FS) ﬁlesystem [35].

The development of what has now become the HEP-Benchmarks project started in mid-2017as a proof-of-concept study by one member of the HEPiX BWG, using the ATLAS kit-validation (KV) [36] and a CMS workload as ﬁrst examples. This work took o ﬀ on a largerscale towards the end of 2018, when many more collaborators from the BWG and the fourLHC experiments joined the e ﬀ ort. The project is maintained on the gitlab infrastructure atCERN [31], which is also used for Continuous Integration (CI) builds and tests, for issuetracking and for documentation. HEP-Benchmarks includes three main components, whichare mapped to separate gitlab repositories and are described in the following subsections. The hep-workloads package: HEP reference workloads

The hep-workloads package is the core of the HEP-Benchmarks suite. It contains all thecode and infrastructure, both common and workload-speciﬁc, to build a standalone containerfor each of the HEP software workloads it includes. Individual images are built, tested andversioned in the package’s gitlab CI and are then distributed via its container registry [37].Images are built as Docker containers [32], but they can also be executed via Singularity [38].A single command is enough to download a speciﬁc benchmark and execute it using theembedded pre-compiled libraries and binaries. A benchmark summary ﬁle in json formatand more detailed logs are stored in a results directory. For instance, to download the latestversion of the LHCb GEN / SIM image, execute it using either Docker or Singularity, and storeresults in the host / tmp directory, it is enough to run one of the two following commands: docker run -v /tmp:/results \gitlab-registry.cern.ch/hep-benchmarks/hep-workloads/lhcb-gen-sim-bmk:latestsingularity run -B /tmp:/results \gitlab-registry.cern.ch/hep-benchmarks/hep-workloads/lhcb-gen-sim-bmk:latest The main result in the json ﬁle is the benchmark score for the given workload, measured asan absolute throughput of number of events processed per wall-clock time. This is essentiallyderived from the total wall-clock time to complete the processing of a predeﬁned number ofevents. The json ﬁle also contains all relevant metadata about how the benchmark was run,as well as more detailed results, including memory and CPU time usage.Some of the HEP workloads, like ALICE and LHCb GEN / SIM, are single-process (SP)and single-threaded (ST); others use parallelism to save memory on multi-core CPUs, viaulti-threading (MT) techniques like CMS RECO [39], or via multi-process (MP) techniquesinvolving forking and copy-on-write, like ATLAS RECO [40]. For MT / MP applications, thenumber of threads / processes is ﬁxed to that used by the experiment in production, and everyimage is executed in such a way as to ﬁll all available logical cores on the machine that isbenchmarked, by launching an appropriate number of identical copies of each application.When more than one copy is executed, the benchmark score is the sum of their throughputs.The number of logical cores, derived from the nproc command, is equal to the number ofphysical cores for machines conﬁgured with hyper-threading disabled, but it is higher if thisis enabled. On a machine with 16 physical cores and 2x hyper-threading, for instance, bydefault 32 copies of the ST / SP LHCb GEN / SIM and 8 copies of the 4xMT CMS RECObenchmarks are executed. All of these parameters are, in any case, conﬁgurable.The design of the hep-workloads package relies on the fact that all four LHC experimentsinstall and distribute their pre-compiled software libraries using the cvmfs ﬁle system. To adda new workload, experiment experts just need to prepare an orchestrator script, which setsthe runtime environment, runs one or more copies of the application, and ﬁnally parses theoutput logs to determine the event throughput, which is used as benchmark score. A commonbenchmark driver harmonises the control ﬂow and error checking in all workloads, making iteasier to debug them. The build procedure in the gitlab CI includes the following four steps.1. First, an interim Docker container is built, where /cvmfs is, as usual, a directory man-aged by the network-aware cvmfs service [35], which is able to retrieve missing ﬁlesvia http. In addition, the cvmfs shrinkwrap [33, 34] tracing mechanism is enabled.2. One copy of the workload application from that interim image is then executed: thisgenerates a shrinkwrap trace ﬁle specifying which ﬁles were accessed from /cvmfs .3. The ﬁnal standalone Docker container is built, where /cvmfs is a local folder, includ-ing all ﬁles identiﬁed by tracing during the previous step. This includes all relevantexperiment libraries and executables, pre-compiled for the O / S chosen for this image.4. This ﬁnal container is tested, by running the workload using both Docker and Singu-larity. If tests succeed, the image is pushed to the gitlab registry [37].A key element of this approach is reproducibility: repeated runs of each workload are meantto always process the same events and produce the same results. This is essential for resourcebenchmarking, to ensure that timing measurements on two di ﬀ erent nodes correspond to thesame computational load. It is also important during the build process, where tests of the ﬁnalcontainer may fail if they need di ﬀ erent /cvmfs ﬁles from those identiﬁed while tracing theinterim container. Strict reproducibility can be guaranteed for ST (including MP) workloads,but not for the CMS MT workloads, where the sharing of processing across di ﬀ erent softwarethreads may lead to small di ﬀ erences in execution paths; however, this is not considered anissue for benchmarking, and no errors have been observed during the build process either.Workload images vary in size between 500 MB (ATLAS GEN) and 4 GB (CMS DIGI).GEN containers are generally the smallest because event generation is CPU intensive withalmost no input data, while DIGI and RECO images are much larger as event digitisation andreconstruction are more I / O intensive and large reference data ﬁles must be shipped withinthe workload containers. Docker images are internally made up of layers (and this structure ismaintained when they are converted to Singularity). Taking into account that bug ﬁxing andfeature improvements have often led to a rapid development cycle, the hep-workloads CI hasbeen optimized to stack these layers in the order which makes them as cacheable as possible.The bottom layers contain what changes least often, like the O / S and data ﬁles, while thehigher layers include experiment software and common and workload-speciﬁc scripts. he hep-score package: a new CPU benchmark for HEP

The aim of the hep-score package is to combine the benchmark scores derived from theindividual HEP workloads into a single number, a “HEPscore”. The package is highly con-ﬁgurable, allowing the deﬁnition of a combined HEPscore from any combination of speciﬁcversions of individual workloads, with speciﬁc MT / MP settings. The numbers of events toprocess can also be tuned, to choose the appropriate compromise between benchmark preci-sion and execution time. The prototypes that are currently being developed, for instance, de-rive a combined score from a geometric mean of the throughputs of ATLAS (GEN, SIM andDIGI / RECO), CMS (GEN / SIM, DIGI and RECO) and LHCb (GEN / SIM) benchmarks, andtake between 6 and 16 hours to complete. This is similar to what was done for HS06, whosecombined score was derived from the geometric mean of the 7 individual SPEC CPU2006C ++ benchmarks. A normalization factor can also be added for each individual benchmark,to redeﬁne its relative score on a machine as a ratio, i.e. as the throughput on that machinedivided by the throughput on the reference machine. Unlike absolute throughputs, which cantake a priori any value and have the dimensions of events processed per second, these relativescores are quite practical because they are adimensional numbers. In particular, the relativescores of individual benchmarks, and a fortiori the combined relative score deﬁned as thegeometric mean of a subset of these scores, are all equal to 1 on the reference machine.Having said that, it should be stressed that it is impossible to characterize a computersystem’s performance by a single metric. This concept was very well expressed by KaivalyaDixit, long-time president of the SPEC corporation, who even warned about “the dangerand the folly” [10] of relying on either a single performance number or a single benchmark.There are, however, many use cases where a single number is needed, and the accounting ofWLCG resources is presently one of them. This is not a technical issue: it is a policy issue,and its discussion is beyond the scope of this paper. On a technical level, our design of thehep-score package takes this into account by allowing the deﬁnition of a highly conﬁgurablecombined score, but also by the fact that the detailed scores of individual workloads are alsostored in the report of any HEPscore execution. This is very important because it provides amechanism to analyse a posteriori the performance of individual HEP workloads. The hep-benchmark-suite package: a toolkit for benchmark execution and result collection

The hep-benchmark-suite package, ﬁnally, is a toolkit to coordinate the execution of severalbenchmarks, including not only HEPscore, but also HS06, SPEC CPU2017, KV and others.Results are collected in a global json document that can then be uploaded to a database.This package is used for what are currently the main activities of our team, to demonstratethe readiness of HEP-Benchmarks as a replacement of HS06: ﬁrst, for testing that individualHEP workloads are reliable and give stable results (typically, within 5% or better) on repeatedrun on the same system; second, for the study of their correlations to one another, and to HS06and other benchmarks. To this end, a wide range of x86 worker nodes has been collected,similar to the lxbench cluster that had been used in the initial comparisons of HS06 and SI2K.

To date, WLCG pledged compute resources have essentially consisted only of x86 CPUs.This processor architecture has therefore been the main focus of developments in the HEP-Benchmarks project so far. The design of its components, however, and speciﬁcally that ofthe core hep-workloads package, is quite general and can be easily extended to non-x86 ar-chitectures such as ARM or Power9, and even to other compute resources such as GPUs,which are becoming important for WLCG because of their widespread adoption in the latestupercomputers at HPC centers. By and large, the large scale GEN, SIM, DIGI, RECO pro-duction workloads of the four LHC experiments are not yet ready [41] to be moved from tra-ditional WLCG x86 resources to GPUs, but we should be ready to benchmark these resourceswhen this happens. Within the HEP-Benchmarks project, a new hep-workloads-GPU pack-age has therefore been added, to prototype the benchmarking of GPU workloads, including asoftware workload from the LHC accelerator domain, SixTrack [42]. Work is also in progressto integrate a prototype of the CMS RECO workload on heterogeneous resources [43].

References [1] S. Campana,

Computing challenges of the future , Update of European Strategy for Parti-cle Physics, Grenada (2019). https: // indico.cern.ch / event / / contributions / LHC computing (WLCG): Past, present, and future , Proc. Int. School of Physics“Enrico Fermi”, Varenna (2014). https: // doi.org / / WLCG status report , CERN-RRB-2019-123, WLCG RRB, October 2019.https: // indico.cern.ch / event / / contributions / Computing resources scrutiny group report , CERN-RRB-2019-080,WLCG RRB, October 2019. https: // indico.cern.ch / event / / contributions / Computer benchmarking: paths and pitfalls , IEEESpectrum , 38-43 (1987). https: // doi.org / / MSPEC.1987.6448963[6] HEP Software Foundation,

A Roadmap for HEP Software and Computing R & D for the2020s , Comput. Softw. Big Sci. , 7 (2019). https: // doi.org / / s41781-018-0018-8[7] HEPiX Benchmarking WG web site. https: // w3.hepix.org / benchmarking.html[8] P. Charpentier, Benchmarking worker nodes using LHCb productions and comparingwith HEP-SPEC06 , Proc. CHEP2016, San Francisco, J. Phys. Conf. Ser. , 082011(2017). https: // doi.org / / / / / Benchmarking computers for HEP , 15th CERN School of Computing,L’Aquila (1992), CERN-CN-92-13. https: // doi.org / / CERN-1993-003.186[10] K. M. Dixit,

Overview of the SPEC Benchmarks , in Jim Gray (Ed.),

The BenchmarkHandbook for Database and Transaction Systems (2nd Edition), Morgan Kaufmann 1993.https: // jimgray.azurewebsites.net / benchmarkhandbook / toc.htm[11] J. Dongarra, P. Luszczek, A. Petitet, The LINPACK Benchmark: past, present and fu-ture , Conc. Comp. Pract. Exper. , 803-820 (2003). https: // doi.org / / cpe.728[12] H. J. Curnow, B. A. Wichmann, A synthetic benchmark , The Computer Journal ,43-49 (1976). https: // doi.org / / comjnl / Dhrystone: a synthetic systems programming benchmark , Comm. ACM , 1013-1030 (1984). https: // doi.org / / // spec.org[15] M. Michelotto et al., A comparison of HEP code with SPEC benchmarks on multi-core worker nodes , Proc. CHEP2009, Prague, J. Phys. Conf. Ser. , 052009 (2010).https: // doi.org / / / / / ALICE Computing TDR (2005). https: // cds.cern.ch / record / ATLAS Computing TDR (2005). https: // cds.cern.ch / record / CMS Computing TDR (2005). https: // cds.cern.ch / record / LHCb Computing TDR (2005). https: // cds.cern.ch / record / SPEC CPU2006 benchmark descriptions , ACM SIGARCH Comp. Arch.News , 1-17 (2006). https: // doi.org / / C ++ benchmarks in SPEC CPU2006 , ACM SIGARCH Comp. Arch. News , 77-83 (2007). https: // doi.org / / The CMSSW benchmarking suite: using HEP code to measureCPU performance , Proc. CHEP2009, Prague, J. Phys. Conf. Ser. , 052016 (2010).https: // doi.org / / / / / Perfmon2: a ﬂexible performance monitoring interface for Linux , Proc.OLS2006, Ottawa. https: // / doc / ols / / ols2006v1-pages-269-288.pdf[24] A. Hirstius, CPU-level performance monitoring with Perfmon , HEPiX Spring 2008,CERN. https: // indico.cern.ch / event / / contributions / An update on perfmon and the struggle to get into the Linuxkernel , Proc. CHEP2009, Prague, J. Phys. Conf. Ser. , 042048 (2010).https: // doi.org / / / / / Next Generation of HEP CPU Benchmarks , Proc. CHEP2018, Soﬁa,EPJ Web of Conf. , 08011 (2019). https: // doi.org / / epjconf / Next Generation of HEP CPU Benchmarks , Proc.ACAT2019, Saas Fee. https: // indico.cern.ch / event / / contributions / Trident: An Automated System Tool for Collecting and An-alyzing Performance Counters , Proc. CHEP2018, Soﬁa, EPJ Web of Conf. , 08024(2019). https: // doi.org / / epjconf / A Top-Down method for performance analysis and counters architecture ,Proc. 2014 IEEE ISPASS, Monterey. https: // doi.org / / ISPASS.2014.6844459[30] J. Elmsheuser et al.,

ATLAS Grid Workﬂow Performance Optimization , Proc. CHEP2018,Soﬁa, EPJ Web of Conf. , 03021 (2019). https: // doi.org / / epjconf / // gitlab.cern.ch / hep-benchmarks.[32] Docker, What is a container? , https: // / resources / what-container[33] CernVM-FS Shrinkwrap, https: // cvmfs.readthedocs.io / en / stable / cpt-shrinkwrap.html[34] P. S. M. Teuber, E ﬃ cient unpacking of required software from CERNVM-FS , CERNOpenlab Report (2019). https: // doi.org / / zenodo.2574461[35] J. Blomer et al., Distributing LHC application software and conditions databases usingthe CernVM ﬁle system , Proc. CHEP2010, Taipei, J. Phys. Conf. Ser. , 042003 (2011).https: // doi.org / / / / / Benchmarking the ATLAS software through the Kit Val-idation engine , Proc. CHEP2009, Prague, J. Phys. Conf. Ser. , 042037 (2010).https: // doi.org / / / / / // gitlab.cern.ch / hep-benchmarks / hep-workloads / container_registry[38] G. M. Kurtzer, V. Sochat, M. W. Bauer, Singularity: Scientiﬁc containers for mobility ofcompute , PLoS ONE , e0177459 (2017). https: // doi.org / / journal.pone.0177459[39] E. Sexton-Kennedy et al., Implementation of a Multi-threaded Framework for Large-scale Scientiﬁc Applications , Proc. ACAT2014, Prague, J. Phys. Conf. Ser. , 012034(2015). https: // doi.org / / / / / Running ATLAS workloads within massively parallel distributed ap-plications using Athena Multi-Process framework , Proc. CHEP2015, Okinawa, J. Phys.Conf. Ser. , 072050 (2015). https: // doi.org / / / / / Overview of the GPU e ﬀ orts for WLCG production workloads , Pre-GDB onbenchmarking, CERN (2019). https: // indico.cern.ch / event / / contributions / SixTrack Version 5 , Proc. IPAC2019, Melbourne. J. Phys. Conf. Ser. , 012129 (2019). https: // doi.org / / / / / Heterogeneous online reconstruction at CMS , to appear in Proc. CHEP2019,Adelaide. https: // indico.cern.ch / event / / contributions //