Heterogeneous reconstruction of tracks and primary vertices with the CMS pixel tracker
Andrea Bocci, Matti Kortelainen, Vincenzo Innocente, Felice Pantaleo, Marco Rovere
HHeterogeneous reconstruction of tracks and primaryvertices with the CMS pixel tracker
A. Bocci, M. Kortelainen, V. Innocente, F. Pantaleo, M. Rovere CERN, European Organization for Nuclear Research, Meyrin, Switzerland Fermi National Accelerator Laboratory, Batavia, Illinois, U.S.A.
E-mail: [email protected]
Abstract:
The High-Luminosity upgrade of the LHC will see the accelerator reach aninstantaneous luminosity of 7 × cm − s − with an average pileup of 200 proton-protoncollisions. These conditions will pose an unprecedented challenge to the online and offlinereconstruction software developed by the experiments. The computational complexity willexceed by far the expected increase in processing power for conventional CPUs, demandingan alternative approach.Industry and High-Performance Computing (HPC) centres are successfully using het-erogeneous computing platforms to achieve higher throughput and better energy efficiencyby matching each job to the most appropriate architecture.In this paper we will describe the results of a heterogeneous implementation of pixeltracks and vertices reconstruction chain on Graphics Processing Units (GPUs). Theframework has been designed and developed to be integrated in the CMS reconstructionsoftware, CMSSW. The speed up achieved by leveraging GPUs allows for more complexalgorithms to be executed, obtaining better physics output and a higher throughput. Keywords:
GPU, Particle Track Reconstruction, Vertex Reconstruction, HeterogeneousComputing, Patatrack a r X i v : . [ phy s i c s . i n s - d e t ] A ug ontents n -tuplets 53.3 Fishbone n -tuplets 63.4 n -tuplet fit 83.5 Ambiguity resolution 93.6 Pixel Vertices 9 The High-Luminosity upgrade of the LHC [1] will pose unprecedented challenges to thereconstruction software used by the experiments due to the increase both in instantaneousluminosity and readout rate. In particular, the CMS experiment at CERN [2] has beendesigned with a two-levels trigger system: the Level 1 Trigger, implemented on custom-designed electronics, and the
High Level Trigger (HLT), a streamlined version of the CMSoffline reconstruction software running on a computer farm. A software trigger systemrequires a trade-off between the complexity of the algorithms running on the availablecomputing resources, the sustainable output rate, and the selection efficiency.When the HL-LHC will be operational, it will reach a luminosity of 7 × cm − s − with an average pileup of 200 proton-proton collisions. To fully exploit the higher luminosity,the CMS experiment will increase the full readout rate from 100 kHz to 750 kHz [3]. Thehigher luminosity, pileup and input rate present an exceptional challenge to the HLT, thatwill require a processing power larger than today by more than an order of magnitude.This exceeds by far the expected increase in processing power for conventional CPUs,demanding alternative solutions. – 1 – promising approach to mitigate this problem is represented by heterogeneous comput-ing . Heterogeneous computing systems gain performance and energy efficiency not by merelyincreasing the number of the same-kind processors, but by employing different co-processorsspecifically designed to handle specific tasks in parallel. Industry and High-PerformanceComputing centres (HPC) are successfully exploiting heterogeneous computing platformsto achieve higher throughput and better energy efficiency by matching each job to the mostappropriate architecture.In order to investigate the feasibility of a heterogeneous approach in a typical HighEnergy Physics experiment, the authors developed a novel pixel tracks and vertices recon-struction chain within the official CMS reconstruction software CMSSW [4]. The input tothis chain is represented by RAW data coming out directly from the detector’s front-endelectronics, while the output is represented by legacy pixel tracks and vertices that couldbe transparently re-used by other components of the CMS reconstruction.The results shown in this article are based on Open Data released by the CMS, whilethe data formats were derived from the CMS Experiment [5].The development of a heterogeneous reconstruction faces several fundamental challenges: • the adoption of a different programming paradigm; • the experimental reconstruction framework and its scheduling must accommodate forheterogeneous processing; • the heterogeneous algorithms should achieve the same or better physics performanceand processing throughput as their CPU siblings; • it must be possible to run and validate on conventional machines, without anydedicated resources.This article is organized as follows: Section 2 will describe the CMS heterogeneousframework, Section 3 will discuss the algorithms developed in the Patatrack pixel track andvertex reconstruction workflow, Section 4 will describe the physics results and computationalperformance and compare them to the CMS pixel track reconstruction used at the HLT fordata taking in 2018, while Section 5 will contain our conclusions. The backbone of the CMS data processing software,
CMSSW , is a rather generic frameworkthat processes independent chunks of data [4]. These chunks of data are called events , andin CMS correspond to one full readout of the detector. Consecutive events with uniformcalibration data are grouped into luminosity blocks , that are further grouped into longer runs .The data are processed by modules that communicate via a C++-type-safe containercalled event (or luminosity block or run for the larger units). An analyzer can only readdata products, while a producer can read and write new data products and a filter can, in– 2 –ddition, decide whether the processing of a given event should be stopped. Data productsbecome immutable (or more precisely, const in the C++11 sense) after being inserted intothe event.During the Long Shutdown 1 and Run 2 of the LHC, the CMSSW framework gainedmulti-threading capabilities [6–8] implemented with the Intel Threading Building Blocks(TBB) library. The threading model employs task-level parallelism to process concurrentlyindependent modules within the same or different events, multiple events within the sameor different luminosity blocks and intervals of validity of the calibration data. Currentlythe boundary of runs incur barrier-style synchronization point in the processing. A recentextension is the concept of external worker , a generic mechanism to allow producers inCMSSW to offload asynchronous work outside of the framework scheduler.More details on the concept of external worker and its interaction with CUDA can befound in [9].
Precise track reconstruction becomes more challenging at higher pileup, as the numberof vertices and the number of tracks increase, making the pattern recognition and theclassification of hits produced by the same charged particle a harder combinatorial problem.To mitigate the complexity of the problem the authors developed parallel algorithms thatcan perform the track reconstruction on GPUs, starting from the “raw data” from the CMSPixel detector, as will be described later in this section. The steps performed during thetracks and vertices reconstruction are illustrated in Fig. 1.
Figure 1 . Steps involved in the tracks and vertices reconstruction starting from the pixel “rawdata”.
The data structures (structure of arrays, SoA) used by the parallel algorithms areoptimized for coalesced memory access on the GPU and differ substantially from the ones– 3 – igure 2 . Longitudinal sketch of the pixel detector geometry. used by the standard reconstruction in CMS (legacy data formats). The data transferbetween CPU and GPU and their transformation between legacy and optimised formatsare very time consuming operations. For this reason the authors decided to design afully contained chain of modules that runs on the GPU starting from the “raw-data” andproduces the final tracks and vertices as output. While a “mixed CPU-GPU workflow” isnot supported, for validation purposes the intermediate data products can be transferredfrom the GPU to the CPU and converted to the corresponding legacy data formats.
The CMS “Phase 1” Pixel detector [10], installed in 2017, will serve as the vertex detectoruntil the major “Phase 2” upgrade for the HL-LHC. It consists of 1856 sensors of size 1 . . ,
560 pixels, for a total of 124 million pixels, corresponding to about2 m total area. The pixel size is 100 µm × µm , the thickness of the sensitive volumeis 285 µm . The sensors are arranged in four “barrel” layers and six “endcap” disks, threeon each side, to provide four-hit pixel coverage up to a pseudorapidity of | η | < .
5. TheCMS pixel detector geometry is sketched in Fig. 2. The barrel layers extend for 26 . . . . local reconstruction , that reconstructs the informationabout the individual hits in the detector.During this phase, the digitized information is unpacked and interpreted to create digis :each digi represents a single pixel with a charge above the signal-over-noise threshold, andcontains information about the collected charge and the local row and column positionin the module grid. This process is parallelized on two levels: information coming fromdifferent modules is processed in parallel by independent blocks of threads, while each digiwithin a module is assigned a unique index and is processed by a different thread.Neighboring digis are grouped together to form clusters using an iterative process. Digiswithin each module are laid out on a two dimensional grid using their row, column and– 4 –nique index information. Each digi is then assigned to a thread. If two or more adjacentdigis are found, the one with the smaller index becomes the seed for the others. Thisprocedure is repeated until all the digis have been assigned to a seed and no other changesare possible. The outcome of the clusterization is a cluster index for each digi: a thread isallocated to each seed; a global atomic counter is increased by all threads, returning theunique cluster index for each seed, and thus for each digi.Finally, the shape of the clusters and the charge of the digis are used to determine the hit position and its uncertainty in the coordinates local to the module as described in [11, § n -tuplets Clusters are linked together to form n -tuplets that are later fitted to extract the final trackparameters. The n -tuplets production proceeds through the following steps: • creation of doublets • connection of doublets • identification of root doublets • depth-first-search (DFS) from each root doubletThe doublets are created by connecting hits belonging to adjacent pairs of pixeldetector layers, illustrated by the solid arrows in Fig. 3. To account for geometrical anddetector inefficiency doublets are also created between chosen pairs of non-adjacent layers,as illustrated by the dashed arrows in Fig. 3.Various selection criteria are applied to reduce the combinatorics. The following criteriahave a strong impact on timing and physics performance: • p minT : searching for low transverse momentum tracks can be very computationallyexpensive. Setting a minimum threshold for p T limits the possible curvature, hencereducing the number of possible combinations of hits. • R max and z max : the maximum transverse and longitudinal distance of closest approachwith respect to the beam-spot. Tracks produced within a radius of less than 1 mm BPix1BPix2BPix3BPix4FPix1 - FPix2 - FPix3 - FPix1 + FPix2 + FPix3 + Figure 3 . Combinations of pixel layers that can create doublets directly (solid arrow), or byskipping a layer to account for geometrical acceptance (dashed arrow). – 5 – igure 4 . Windows opened in the transverse and longitudinal planes. The outer hit is colored inred, the inner hits in blue [12]. around the beam-spot are called prompt tracks . Searching for detached tracks with alarger value of R max leads to an increase in combinatorics. These “alignment criteria”are illustrated in Fig. 4. • n hits : requiring a high number of hits in the n -tuplets leads to a more pure set of tracksand cuts can be loosened, while a lower number of hits produces higher efficiency atthe cost of a higher fake-rate.Hits within each layers are arranged in a tiled data-structure along the azimuthal ( φ )direction for optimal performance. The search for compatible hit pairs is performed inparallel by different threads, each starting from a different outer hit. The pairs of inner andouter hits that satisfy the alignment criteria and have compatible clusters sizes along thez-direction form a doublet. The cuts applied during the doublets building are described inTable 1, and their impact on the physics results and reconstruction time are provided inTables 2 and 3.The doublets that share a common hit are tested for compatibility to form a triplet.The compatibility requires that the three hits are aligned in the R − Z plane, and that thecircumference passing through them intersects the beamspot compatibility region definedby R max . All doublets from all layer pairs are tested in parallel.All compatible doublets form a direct acyclic graph. All the doublets whose innerhit lies on BPix1 are marked as root doublets. To reconstruct ”outer” triplets, doubletsstarting on
BPix2 or the two
FPix1 layers and without inner neighbors are also markedas root. Each root doublet is subsequently assigned to a different thread that performs aDepth-First Search (DFS) over the direct acyclic graph starting from it. A DFS is usedbecause one could prefer searching for all the n -tuplets up to n hits. The advantage ofthis approach is that the buckets containing triplets and quadruplets are disjoint sets as atriplet could not have been extended further to become a quadruplet [12]. n -tuplets Full hit coverage in the instrumented pseudorapidity range is implemented in modern PixelDetectors via partially overlapping sensitive layers. This, at the same time, mitigates theimpact of possible localized hit inefficiencies. With this design, though, requiring at mostone hit per layer can lead to several n -tuplets corresponding to the same particle. This is– 6 – ut Description PhiHist binned phi window between inner and outer hit usinga 128 bin histogramPhiW PhiHist + tuned phi window between inner and outerhitZW window in z for the inner hitZIP cut on the impact parameter along the beam axisPT cut on the curvature assuming zero transverse impactparameter (equivalent to a cut on the TIP for high pttracks)CZS cut on the cluster size compatibility
Table 1 . Description of the cuts applied during the reconstruction of doublets.
Cuts Doublets n -tuplets Tracks unconnPhiHist 1,268,193 23,254 1,256 0.966PhiHist+ZW 866,316 18,301 1,266 0.966PhiHist+ZW+ZIP 269,410 11,235 1,265 0.926PhiW+ZW 594,739 13,403 1,212 0.958PhiW+ZW+ZIP 185,642 8,327 1,214 0.919PhiW+ZW+ZIP+CSZ 129,307 6,060 1,087 0.915PhiW+ZW+ZIP+PT 164,567 7,273 1,141 0.921PhiW+ZW+ZIP+PT+CSZ 115,248 5,270 999 0.918 Table 2 . Average number of doublets, n -tuplets and final tracks per event, as well as the fractionof cell not connected, for each set of doublet reconstruction cuts (described in Table 2), runningover a sample of t ¯ t events with an average pileup of 50 and an average of 15,000 hits per event. time in µ sCuts doublets connect DFS cleanPhiHist 6,123 15,127 1,690 1,976PhiHist+ZW 950 6,582 778 538PhiHist+ZW+ZIP 310 488 354 237PhiW+ZW 552 2,995 549 377PhiW+ZW+ZIP 271 265 274 183PhiW+ZW+ZIP+CSZ 291 187 216 154PhiW+ZW+ZIP+PT 259 156 246 125PhiW+ZW+ZIP+PT+CSZ 280 108 192 114 Table 3 . Time spent in the three components of n -tuplets building, as well as in the Fishbone andambiguity resolution algorithms (“clean”), for each set of doublet reconstruction cuts (described inTable 2). It should be noted that using very relaxed cuts requires larger memory buffers on GPU,up to 12GB, while running with the last 4 sets requires less than 2GB of memory. – 7 –articularly relevant in the forward region due to the design of the Pixel Forward Disks thatis illustrated in Fig. 5: up to four hits in the same layer can be found in localized forwardareas. The
Fishbone n -tuplet solves the ambiguities by merging overlapping doublets. The Fishbone mechanism is active while creating the doublets: among all the aligned doubletsthat share the same outermost hit, only the shortest one is kept. In this way ambiguitiesare resolved and a single
Fishbone n -tuplet is created.Furthermore, among all the tracks that share a hit-doublet only the ones with thelargest number of hits are retained. Figure 5 . A typical
Fishbone n -tuplet. The shadowed areas indicate partially overlapping modulesin the same layer. n -tuplet fit The “Phase 1” upgraded pixel detector has one more barrel layer and one additional diskat each side with respect to the previous detector. The possibility of using four (or more)hits from distinct layers opens new opportunities for the pixel tracker fitting method. It ispossible not only to give a better statistical estimation of the track parameters (d z , cot ( θ ),d , p T and φ [13]) thanks to the additional point, but also to include in the fitting proceduremore realistic effects, such as the energy loss and the multiple scattering of the particle dueto its interaction with the material of the detector.The pixel track reconstruction developed by the authors includes a multiple scattering-aware fit: the Broken Line [14] Fit. This follows three main steps: • a fast pre-fit in the transverse plane gives an estimate of the track momentum, usedto compute the multiple scattering contribution, • a line-fit in the S-Z plane, • a circle fit in the transverse plane.The d z and cot ( θ ) track parameters and their covariance matrices are derived from theline fit, while the d , p T and φ , and their covariance matrices from the circle fit. The finaltrack parameters and covariance matrix are computed combining the individual resultstogether.The fits are performed in parallel over all n -tuplets using one thread per n -tuplet. Thefit implementation uses the Eigen C++ library [15] that natively supports CUDA.– 8 – .5 Ambiguity resolution Tracks that share a hit-doublet are considered “ambiguous” and only the one with the best χ is retained. Triplets are considered “ambiguous” if they share one hit: only the one withthe smallest transverse impact parameter is retained. The fitted pixel tracks are subsequently used to form pixel vertices. Vertices are searched asclusters in the z coordinate of the point of closest transverse approach of each track with thebeam line ( z ). Only tracks with at least 4 hits and a p T larger than a configurable threshold(0.5 GeV) are considered. For each track with an error in z lower than a configurablethreshold the local density of close-by tracks is computed. Tracks are considered in thedensity calculation if they are within a certain ∆ z cut and if their χ compatibility is lowerthan a configurable χ . Tracks with local density greater than 1 are considered as aseed for a vertex. Tracks are then linked to another track that has a higher local density, ifthe distance between the two tracks is lower than ∆ z cut and if their χ ≤ χ . Allthe tracks that are logically linked starting from each seed become part of the same vertexcandidate. Each vertex candidate is promoted to be a final vertex if it contains at least 2tracks.This algorithm is easily parallelizable and, in one dimension as in this case, requires noiterations. It is less sensitive to noise (fake tracks) and has a lower merge rate of a standardDBSCAN [16]. It is much faster than any hierarchical algorithm [17] or algorithms basedon deterministic annealing [11, § § z of the contributing tracks. Vertices with a χ larger than a given threshold(9 per degree of freedom) are split in two using a k-mean algorithm.Finally the vertices are sorted using the sum of the p of the contributing tracks. Thevertex with the largest (cid:80) p is labelled as the “primary” vertex i.e. the vertex correspondingto the signal (triggering) event. In this section the performance of the Patatrack reconstruction is evaluated and comparedto the track reconstruction that CMS has used for data taking in 2018 (in the followingreferred to as CMS-2018) [12].
The performance studies have been performed using 20 000 t¯t simulated events from CMSopen data [5], with an average of 50 superimposed pileup collisions with a center-of-massenergy √ s = 13 TeV, using detector design conditions.– 9 – .2 Physics performance The efficiency is defined as the fraction of simulated tracks, N sim , having produced at leastthree hits in the pixel detector, that have been associated with at least one reconstructedtrack, N rec . efficiency = N rec N sim (4.1)A reconstructed pixel track is associated with a simulated track if all the hits that it containscome from the same simulated track. The efficiency is computed only for tracks comingfrom the hard interaction and not for those from the pileup. The CPU and GPU versionsof the Patatrack workflow produce the same physics results, as shown in Fig. 6. For thisreason, there will be no further distinction in the discussion of the physics results betweenthe workflows running on CPU and GPU.The efficiency of quadruplets is sensibly improved by the Patatrack quadruplets workflowwith respect to CMS-2018, as shown in Fig. 7. The main reasons for this improvementare the possibility to skip a layer outside geometrical acceptance when building doubletsand the usage of different Cellular Automaton cuts for the barrel and the end-caps. Theefficiency can be further improved including the pixel tracks built from triplets (Patatracktriplets).The fake rate is defined as the fraction of all the reconstructed tracks coming from areconstructed primary vertex that are not associated uniquely to a simulated track. In thecase of a fake track, the set of hits used to reconstruct the track does not belong to the samesimulated track. As shown in Fig. 8, the fake-rate performance of Patatrack quadruplets isimproved with respect to the CMS-2018 pixel reconstruction in the end-cap region, mainlythanks to the different treatment of the end-caps in the Cellular Automaton. The inclusionof the pixel tracks built from Patatrack triplets slightly increases the fake rate in the trackscoming from the primary vertices, given that loosening the requirement on the number ofhits decreases the quality of the selection cuts.A duplicate track is a reconstructed track matching to a simulated track that itself hasbeen matched to at least two tracks. The introduction of the Fishbone algorithm improvesthe duplicate rejection in the Patatrack workflows by up to two orders of magnitude withrespect to the CMS-2018 pixel track reconstruction, as shown in Fig. 9.For historical reasons the CMS-2018 Pixel reconstruction does not perform a fit on the n -tuplets in the transverse plane, and considers instead only the first three hits for the trackparameters estimation. Furthermore, the errors on the track parameters are taken from alook-up table parameterized in η and p T . The improvement brought in by the Broken Linefit to the accuracy of the fits can be quantified by looking at the resolutions defined as : σ (fitted value − true value) . (4.2)The resolution of the estimation of the p T is improved by up to a factor 2 whencompared to the CMS-2018 Pixel tracking (Fig. 10). The resolution of the transverse impactparameter d improves, especially in the barrel (Fig. 11).The CMS 2018 Pixel tracking behaves better in the longitudinal plane than it does in– 10 – -
10 1 10 (GeV) T Simulated track p T r a ck i ng e ff i c i en cy
13 TeV
Patatrack
CMS Open Data 2018
Patatrack - Triplets CPU Patatrack - Triplets GPU =50) æ PU Æ event tracks (tt | < 2.5 h | (a) Efficiency vs p T - - - h Simulated track T r a ck i ng e ff i c i en cy
13 TeV
Patatrack
CMS Open Data 2018
Patatrack - Triplets CPU Patatrack - Triplets GPU =50) æ PU Æ event tracks (tt > 0.9 GeV T p (b) Efficiency vs η Figure 6 . Comparison of the pixel tracks reconstruction efficiency of the CPU and GPU versionsof the Patatrack Pixel reconstruction for simulated t¯t events with an average of 50 superimposedpileup collisions. -
10 1 10 (GeV) T Simulated track p T r a ck i ng e ff i c i en cy
13 TeV
Patatrack
CMS Open Data 2018
Patatrack - Triplets Patatrack - Quadruplets CMS - 2018 =50) æ PU Æ event tracks (tt | < 2.5 h | (a) Efficiency vs p T - - - h Simulated track T r a ck i ng e ff i c i en cy
13 TeV
Patatrack
CMS Open Data 2018
Patatrack - Triplets Patatrack - Quadruplets CMS - 2018 =50) æ PU Æ event tracks (tt > 0.9 GeV T p (b) Efficiency vs η Figure 7 . Pixel tracks reconstruction efficiency for simulated t¯t events with an average of 50superimposed pileup collisions. The performance of the Patatrack reconstruction when producingPixel Tracks starting from n -tuplets with n hits ≥ n hits ≥ – 11 – -
10 1 10 (GeV) T Track p T r a ck i ng f a k e r a t e f r o m PV
13 TeV
Patatrack
CMS Open Data 2018
Patatrack - Triplets Patatrack - Quadruplets CMS - 2018 =50) æ PU Æ event tracks (tt (a) Fake rate vs p T - - - h Track T r a ck i ng f a k e r a t e f r o m PV
13 TeV
Patatrack
CMS Open Data 2018
Patatrack - Triplets Patatrack - Quadruplets CMS - 2018 =50) æ PU Æ event tracks (tt > 0.9 GeV T p (b) Fake rate vs η Figure 8 . Pixel tracks reconstruction fake rate for simulated t¯t events with an average of 50superimposed pileup collisions. The performance of the Patatrack reconstruction when producingpixel tracks starting from n -tuplets with n hits ≥ n hits ≥ -
10 1 10 (GeV) T Track p - - - - - -
10 1 T r a ck i ng dup li c a t e r a t e
13 TeV
Patatrack
CMS Open Data 2018
Patatrack - Triplets Patatrack - Quadruplets CMS - 2018 =50) æ PU Æ event tracks (tt (a) Duplicate rate vs p T - - - h Track - - - - - -
10 1 T r a ck i ng dup li c a t e r a t e
13 TeV
Patatrack
CMS Open Data 2018
Patatrack - Triplets Patatrack - Quadruplets CMS - 2018 =50) æ PU Æ event tracks (tt > 0.9 GeV T p (b) Duplicate rate vs η Figure 9 . Pixel tracks reconstruction duplicate rate for simulated t¯t events with an average of 50superimposed pileup collisions. The performance of the Patatrack reconstruction when producingPixel Tracks starting from n -tuplets with n hits ≥ n hits ≥ – 12 – (GeV) T Simulated track p T r e s o l u t i on / p T p
13 TeV
Patatrack
CMS Open Data 2018
Patatrack - Triplets Patatrack - Quadruplets CMS - 2018 =50) æ PU Æ event tracks (tt > 0.9 GeV T p (a) p T resolution vs p T - - - h Simulated track - - T r e s o l u t i on / p T p
13 TeV
Patatrack
CMS Open Data 2018
Patatrack - Triplets Patatrack - Quadruplets CMS - 2018 =50) æ PU Æ event tracks (tt > 0.9 GeV T p (b) p T resolution vs η Figure 10 . Pixel tracks p T resolution for simulated t¯t events with an average of 50 superimposedpileup collisions. The performance of the Patatrack reconstruction when producing Pixel Tracksstarting from n -tuplets with n hits ≥ n hits ≥ the transverse plane. However, the Broken Line fit’s improvement in the estimate of thelongitudinal impact parameter d z is visible for tracks with p T > /c , as shown by(Fig. 12).The number of reconstructed vertices together with the capability to separate twoclose-by vertices have been measured to have an estimate of the performance of the vertexingalgorithm. This capability can be quantified by measuring the vertex merge rate, i.e. theprobability of having two different simulated vertices reconstructed as a single vertex.Figure 13 shows how the vertexing performance evolves with the number of simulatedproton interactions. The hardware and software configurations used to to carry out the computing performancemeasurements are: • dual socket Xeon Gold 6130 [18], 2 ×
16 physical cores, 64 hardware threads, • a single NVIDIA T4 [19], • NVIDIA CUDA 11 with Multi-Process Service [20], • CMSSW 11 1 2 Patatrack [21].A CMSSW reconstruction sequence that runs only the pixel reconstruction modulesas described in Section 3 was created. More than one event can be executed in parallelby different CPU threads. These can perform asynchronous operations like kernels andmemory transfers, in parallel on the same GPU. The maximum amount of events that– 13 – -
10 1 10 (GeV) T Simulated track p m ) m r e s o l u t i on ( d
13 TeV
Patatrack
CMS Open Data 2018
Patatrack - Triplets Patatrack - Quadruplets CMS - 2018 =50) æ PU Æ event tracks (tt > 0.9 GeV T p (a) d xy resolution vs p T - - - h Simulated track m ) m r e s o l u t i on ( d
13 TeV
Patatrack
CMS Open Data 2018
Patatrack - Triplets Patatrack - Quadruplets CMS - 2018 =50) æ PU Æ event tracks (tt > 0.9 GeV T p (b) d xy resolution vs η Figure 11 . Pixel tracks transverse impact parameter resolution for simulated t¯t events with anaverage of 50 superimposed pileup collisions. The performance of the Patatrack reconstructionwhen producing pixel tracks starting from n -tuplets with n hits ≥ n hits ≥ -
10 1 10 (GeV) T Simulated track p m ) m r e s o l u t i on ( z d
13 TeV
Patatrack
CMS Open Data 2018
Patatrack - Triplets Patatrack - Quadruplets CMS - 2018 =50) æ PU Æ event tracks (tt > 0.9 GeV T p (a) d z resolution vs p T - - - h Simulated track m ) m r e s o l u t i on ( z d
13 TeV
Patatrack
CMS Open Data 2018
Patatrack - Triplets Patatrack - Quadruplets CMS - 2018 =50) æ PU Æ event tracks (tt > 0.9 GeV T p (b) d z resolution vs η Figure 12 . Pixel tracks longitudinal impact parameter resolution for simulated t¯t events with anaverage of 50 superimposed pileup collisions. The performance of the Patatrack reconstructionwhen producing pixel tracks starting from n -tuplets with n hits ≥ n hits ≥ – 14 – Simulated interactions V e r t i c e s
13 TeV
Patatrack
CMS Open Data 2018
Patatrack - Triplets Patatrack - Quadruplets CMS - 2018 =50) æ PU Æ events(tt (a) Reconstructed vs simulated vertices
10 20 30 40 50 60 70
Simulated interactions V e r t e x m e r ge r a t e
13 TeV
Patatrack
CMS Open Data 2018
Patatrack - Triplets Patatrack - Quadruplets CMS - 2018 =50) æ PU Æ events(tt (b) Vertex merge rate vs simulated vertices Figure 13 . Pixel vertices reconstruction efficiency and merge rate for simulated t¯t events withan average of 50 superimposed pileup collisions. The performance of the Patatrack reconstructionwhen producing pixel tracks starting from n -tuplets with n hits ≥ n hits ≥ can be processed in parallel today is about 80, and is limited by the amount of allocatedmemory on the GPU required for each event.In a data streaming application the measurement of the throughput, i.e. the number ofreconstructed events per unit time, is a more representative metric than the measurementof the latency. The benchmark runs 8 independent CMSSW jobs, each reconstructing 8events in parallel with 8 CPU threads. The throughput of the CMS-2018 reconstructionhas been compared to the Patatrack quadruplets and triplets workflows. The test includesthe GPU and the CPU versions of the Patatrack workflows. The Patatrack workflows runwith three different configurations:1. no copy : the SoA containing the results stays in the memory where they have beenproduced;2. copy, no conversion : the SoA containing the results is copied to the host, if initiallyproduced by the GPU;3. copy, conversion : the SoA containing the results is copied to the host and convertedto the legacy CMS-2018 pixel tracks and vertices data formats.These configurations are useful to understand the impact of optimizing a potential consumerof the GPU results so that it runs on GPUs in the same reconstruction sequence or so thatit can consume GPU-friendly data structures, with respect to interfacing the Patatrackworkflows to the existing framework without any further optimization.The results of the benchmark are shown in Table 4. The benchmark shows that a singleNVIDIA T4 can achieve almost three times the performance of a full dual socket Intel Xeon– 15 – hroughput in events/sConfiguration Triplets CPU Triplets GPU Quadruplets CPU Quadruplets GPU CMS 2018no copy 611 870 892 1386 476copy, no conv. — 867 — 1372 —conversion 585 861 855 1352 — Table 4 . Throughput of the Patatrack triplets and quadruplets workflows when executed on GPUand CPU, compared to the CMS-2018 reconstruction. The benchmark is configured to reconstruct64 events in parallel. Three different configurations have been compared: in no copy the result isnot copied from the memory of the device where it was initially produced; in copy, no conv. theSoA containing result produced on the GPU is copied to the host memory; in conversion the SoAcontaining the result is copied to the host memory (if needed) and then converted to the legacydata format used for the pixel tracks and vertices by the CMS reconstruction.
Skylake node when running the Patatrack pixel quadruplets reconstruction. Producingeven better physics performance by producing also pixel tracks from triplets has the effectof almost halving the throughput. Copying the results from the GPU memory to the hostmemory has a small impact to the throughput, thanks to the possibility of hiding latencyby overlapping the execution of kernels with copies. Converting the SoA results to thelegacy data format has a small impact on the throughput as well, but comes with a hiddencost: the conversion takes almost 100% of the machine’s processing power. This can beavoided by migrating all the consumers to the SoA data format.
The future runs of the Large Hadron Collider (LHC) at CERN will pose significant challengeson the event reconstruction software, due to the increase in both event rate and complexity.For track reconstruction algorithms, the number of combinations that have to be testeddoes not scale linearly with the number of simultaneous proton collisions.The work described in this article presents innovative ways to solve the problem oftracking in a pixel detector such as the CMS one, by making use of heterogeneous computingsystems in a data taking production-like environment, while being integrated in the CMSexperimental software framework CMSSW. The assessment of the Patatrack reconstructionphysics and timing performance demonstrated that it can improve physics performance whilebeing significantly faster than the existing implementation. The possibility to configure thePatatrack reconstruction workflow to run on CPU or to transfer and convert results to usethe CMS data format allows to run and validate the workflow on conventional machines,without any dedicated resources.This work is setting the foundations for the development of heterogeneous algorithms inHEP both from the algorithmic and from the framework scheduling points of view. Otherparts of the reconstruction, e.g. calorimeters or Particle Flow, will be able to benefit froman algorithmic and data structure redesign to be able to run efficiently on GPUs.The ability to run on other accelerators with a performance portable code is also beingexplored, to ease maintainability and test-ability of a single source.– 16 –
Acknowledgements
We thank our colleagues of the CMS collaboration for publishing high quality simulateddata under the open access policy. We also would like to thank the Patatrack studentsand alumni R. Ribatti and G. Tomaselli for their hard work and dedication. We thank theCERN openlab for providing a platform for discussion on heterogeneous computing and forfacilitating knowledge transfer and support between Patatrack and industrial partners.This manuscript has been authored by Fermi Research Alliance, LLC under ContractNo. DE-AC02-07CH11359 with the U.S. Department of Energy, Office of Science, Office ofHigh Energy Physics.
References [1] G. Apollinari, O. Br¨uning, T. Nakamoto and L. Rossi,
High luminosity large hadron colliderhl-lhc , arXiv preprint arXiv:1705.08830 (2017) .[2] CMS Collaboration, The CMS experiment at the CERN LHC , JInst (2008) S08004.[3] CMS collaboration, The Phase-2 Upgrade of the CMS Level-1 Trigger , Tech. Rep.CERN-LHCC-2020-004. CMS-TDR-021, CERN, Geneva (Apr, 2020).[4] C.D. Jones, M. Paterno, J. Kowalkowski, L. Sexton-Kennedy and W. Tanenbaum,
The newCMS event data model and framework , in
Proceedings of International Conference onComputing in High Energy and Nuclear Physics (CHEP06) , 2006,https://indico.cern.ch/event/408139/.[5]
CMS collaboration,
TTToHadronic TuneCP5 13TeV-powheg-pythia8 in FEVTDEBUGHLTformat for LHC Phase2 studies. CERN Open Data Portal. , .[6] C.D. Jones and E. Sexton-Kennedy,
Stitched together: Transitioning CMS to a hierarchicalthreaded framework , J. Phys.: Conf. Ser. (2014) 022034.[7] C.D. Jones, L. Contreras, P. Gartung, D. Hufnagel and L. Sexton-Kennedy,
Using the CMSthreaded framework in a production environment , J. Phys.: Conf. Ser. (2015) 072026.[8] C.D. Jones et al.,
CMS event processing multi-core efficiency status , J. Phys.: Conf. Ser. (2017) 042008.[9] A. Bocci, D. Dagenhart, V. Innocente, C. Jones, M. Kortelainen, F. Pantaleo et al.,
Bringingheterogeneity to the CMS software framework , arXiv preprint arXiv:2004.04334 (2020) .[10] A. Dominguez, D. Abbaneo, K. Arndt, N. Bacchetta, A. Ball, E. Bartz et al., CMS TechnicalDesign Report for the Pixel Detector Upgrade , Tech. Rep. CERN-LHCC-2012-016.CMS-TDR-11 (Sep, 2012).[11]
CMS collaboration,
Description and performance of track and primary-vertex reconstructionwith the CMS tracker , JINST (2014) P10009 [ ].[12] F. Pantaleo, New Track Seeding Techniques for the CMS Experiment , 2017.[13] R. Fr¨u’hwirth, M. Regler, R. Bock, H. Grote and D. Notz,
Data analysis techniques forhigh-energy physics (2000).[14] V. Blobel,
A new fast track-fit algorithm based on broken lines , Nucl. Instrum. Meth.
A566 (2006) 14. – 17 –
15] G. Guennebaud, B. Jacob et al., “Eigen v3.” http://eigen.tuxfamily.org, 2010.[16] M. Ester, H.-P. Kriegel, J. Sander and X. Xu,
A density-based algorithm for discoveringclusters in large spatial databases with noise. , in
KDD , E. Simoudis, J. Han and U.M. Fayyad,eds., pp. 226–231, AAAI Press, 1996, http://dblp.uni-trier.de/db/conf/kdd/kdd96.html.[17] R. Campello, D. Moulavi and J. Sander,
Density-based clustering based on hierarchical densityestimates , in
Advances in Knowledge Discovery and Data Mining , J. Pei, V. Tseng, L. Cao,H. Motoda and G. Xu, eds., vol. 7819, (Berlin, Heidelberg), Springer (2013).[18] Intel Corp., “Intel Xeon Gold 6130 Processor .” , 2020.[19] NVIDIA Corp., “NVIDIA T4 TENSOR CORE GPU.” ,2020.[20] NVIDIA Corp., “NVIDIA MULTI-PROCESS SERVICE.” https://docs.nvidia.com/deploy/pdf/CUDA_Multi_Process_Service_Overview.pdf ,2020.[21] CMS Patatrack, “CMSSW 11 1 2 Patatrack .” https://github.com/cms-patatrack/cmssw/tree/CMSSW_11_1_2_Patatrack , 2020., 2020.