[PDF] Heterogeneous reconstruction of tracks and primary vertices with the CMS pixel tracker

Abstract

The High-Luminosity upgrade of the LHC will see the accelerator reach an instantaneous luminosity of 7× 10 34 c m −2 s −1 with an average pileup of 200 proton-proton collisions. These conditions will pose an unprecedented challenge to the online and offline reconstruction software developed by the experiments. The computational complexity will exceed by far the expected increase in processing power for conventional CPUs, demanding an alternative approach. Industry and High-Performance Computing (HPC) centres are successfully using heterogeneous computing platforms to achieve higher throughput and better energy efficiency by matching each job to the most appropriate architecture. In this paper we will describe the results of a heterogeneous implementation of pixel tracks and vertices reconstruction chain on Graphics Processing Units (GPUs). The framework has been designed and developed to be integrated in the CMS reconstruction software, CMSSW. The speed up achieved by leveraging GPUs allows for more complex algorithms to be executed, obtaining better physics output and a higher throughput.

Full PDF

HHeterogeneous reconstruction of tracks and primaryvertices with the CMS pixel tracker

A. Bocci, M. Kortelainen, V. Innocente, F. Pantaleo, M. Rovere CERN, European Organization for Nuclear Research, Meyrin, Switzerland Fermi National Accelerator Laboratory, Batavia, Illinois, U.S.A.

E-mail: [email protected]

Abstract:

The High-Luminosity upgrade of the LHC will see the accelerator reach aninstantaneous luminosity of 7 × cm − s − with an average pileup of 200 proton-protoncollisions. These conditions will pose an unprecedented challenge to the online and oﬄinereconstruction software developed by the experiments. The computational complexity willexceed by far the expected increase in processing power for conventional CPUs, demandingan alternative approach.Industry and High-Performance Computing (HPC) centres are successfully using het-erogeneous computing platforms to achieve higher throughput and better energy eﬃciencyby matching each job to the most appropriate architecture.In this paper we will describe the results of a heterogeneous implementation of pixeltracks and vertices reconstruction chain on Graphics Processing Units (GPUs). Theframework has been designed and developed to be integrated in the CMS reconstructionsoftware, CMSSW. The speed up achieved by leveraging GPUs allows for more complexalgorithms to be executed, obtaining better physics output and a higher throughput. Keywords:

GPU, Particle Track Reconstruction, Vertex Reconstruction, HeterogeneousComputing, Patatrack a r X i v : . [ phy s i c s . i n s - d e t ] A ug ontents n -tuplets 53.3 Fishbone n -tuplets 63.4 n -tuplet ﬁt 83.5 Ambiguity resolution 93.6 Pixel Vertices 9 The High-Luminosity upgrade of the LHC [1] will pose unprecedented challenges to thereconstruction software used by the experiments due to the increase both in instantaneousluminosity and readout rate. In particular, the CMS experiment at CERN [2] has beendesigned with a two-levels trigger system: the Level 1 Trigger, implemented on custom-designed electronics, and the

High Level Trigger (HLT), a streamlined version of the CMSoﬄine reconstruction software running on a computer farm. A software trigger systemrequires a trade-oﬀ between the complexity of the algorithms running on the availablecomputing resources, the sustainable output rate, and the selection eﬃciency.When the HL-LHC will be operational, it will reach a luminosity of 7 × cm − s − with an average pileup of 200 proton-proton collisions. To fully exploit the higher luminosity,the CMS experiment will increase the full readout rate from 100 kHz to 750 kHz [3]. Thehigher luminosity, pileup and input rate present an exceptional challenge to the HLT, thatwill require a processing power larger than today by more than an order of magnitude.This exceeds by far the expected increase in processing power for conventional CPUs,demanding alternative solutions. – 1 – promising approach to mitigate this problem is represented by heterogeneous comput-ing . Heterogeneous computing systems gain performance and energy eﬃciency not by merelyincreasing the number of the same-kind processors, but by employing diﬀerent co-processorsspeciﬁcally designed to handle speciﬁc tasks in parallel. Industry and High-PerformanceComputing centres (HPC) are successfully exploiting heterogeneous computing platformsto achieve higher throughput and better energy eﬃciency by matching each job to the mostappropriate architecture.In order to investigate the feasibility of a heterogeneous approach in a typical HighEnergy Physics experiment, the authors developed a novel pixel tracks and vertices recon-struction chain within the oﬃcial CMS reconstruction software CMSSW [4]. The input tothis chain is represented by RAW data coming out directly from the detector’s front-endelectronics, while the output is represented by legacy pixel tracks and vertices that couldbe transparently re-used by other components of the CMS reconstruction.The results shown in this article are based on Open Data released by the CMS, whilethe data formats were derived from the CMS Experiment [5].The development of a heterogeneous reconstruction faces several fundamental challenges: • the adoption of a diﬀerent programming paradigm; • the experimental reconstruction framework and its scheduling must accommodate forheterogeneous processing; • the heterogeneous algorithms should achieve the same or better physics performanceand processing throughput as their CPU siblings; • it must be possible to run and validate on conventional machines, without anydedicated resources.This article is organized as follows: Section 2 will describe the CMS heterogeneousframework, Section 3 will discuss the algorithms developed in the Patatrack pixel track andvertex reconstruction workﬂow, Section 4 will describe the physics results and computationalperformance and compare them to the CMS pixel track reconstruction used at the HLT fordata taking in 2018, while Section 5 will contain our conclusions. The backbone of the CMS data processing software,

CMSSW , is a rather generic frameworkthat processes independent chunks of data [4]. These chunks of data are called events , andin CMS correspond to one full readout of the detector. Consecutive events with uniformcalibration data are grouped into luminosity blocks , that are further grouped into longer runs .The data are processed by modules that communicate via a C++-type-safe containercalled event (or luminosity block or run for the larger units). An analyzer can only readdata products, while a producer can read and write new data products and a ﬁlter can, in– 2 –ddition, decide whether the processing of a given event should be stopped. Data productsbecome immutable (or more precisely, const in the C++11 sense) after being inserted intothe event.During the Long Shutdown 1 and Run 2 of the LHC, the CMSSW framework gainedmulti-threading capabilities [6–8] implemented with the Intel Threading Building Blocks(TBB) library. The threading model employs task-level parallelism to process concurrentlyindependent modules within the same or diﬀerent events, multiple events within the sameor diﬀerent luminosity blocks and intervals of validity of the calibration data. Currentlythe boundary of runs incur barrier-style synchronization point in the processing. A recentextension is the concept of external worker , a generic mechanism to allow producers inCMSSW to oﬄoad asynchronous work outside of the framework scheduler.More details on the concept of external worker and its interaction with CUDA can befound in [9].

Precise track reconstruction becomes more challenging at higher pileup, as the numberof vertices and the number of tracks increase, making the pattern recognition and theclassiﬁcation of hits produced by the same charged particle a harder combinatorial problem.To mitigate the complexity of the problem the authors developed parallel algorithms thatcan perform the track reconstruction on GPUs, starting from the “raw data” from the CMSPixel detector, as will be described later in this section. The steps performed during thetracks and vertices reconstruction are illustrated in Fig. 1.

Figure 1 . Steps involved in the tracks and vertices reconstruction starting from the pixel “rawdata”.

The data structures (structure of arrays, SoA) used by the parallel algorithms areoptimized for coalesced memory access on the GPU and diﬀer substantially from the ones– 3 – igure 2 . Longitudinal sketch of the pixel detector geometry. used by the standard reconstruction in CMS (legacy data formats). The data transferbetween CPU and GPU and their transformation between legacy and optimised formatsare very time consuming operations. For this reason the authors decided to design afully contained chain of modules that runs on the GPU starting from the “raw-data” andproduces the ﬁnal tracks and vertices as output. While a “mixed CPU-GPU workﬂow” isnot supported, for validation purposes the intermediate data products can be transferredfrom the GPU to the CPU and converted to the corresponding legacy data formats.

The CMS “Phase 1” Pixel detector [10], installed in 2017, will serve as the vertex detectoruntil the major “Phase 2” upgrade for the HL-LHC. It consists of 1856 sensors of size 1 . . ,

560 pixels, for a total of 124 million pixels, corresponding to about2 m total area. The pixel size is 100 µm × µm , the thickness of the sensitive volumeis 285 µm . The sensors are arranged in four “barrel” layers and six “endcap” disks, threeon each side, to provide four-hit pixel coverage up to a pseudorapidity of | η | < .

5. TheCMS pixel detector geometry is sketched in Fig. 2. The barrel layers extend for 26 . . . . local reconstruction , that reconstructs the informationabout the individual hits in the detector.During this phase, the digitized information is unpacked and interpreted to create digis :each digi represents a single pixel with a charge above the signal-over-noise threshold, andcontains information about the collected charge and the local row and column positionin the module grid. This process is parallelized on two levels: information coming fromdiﬀerent modules is processed in parallel by independent blocks of threads, while each digiwithin a module is assigned a unique index and is processed by a diﬀerent thread.Neighboring digis are grouped together to form clusters using an iterative process. Digiswithin each module are laid out on a two dimensional grid using their row, column and– 4 –nique index information. Each digi is then assigned to a thread. If two or more adjacentdigis are found, the one with the smaller index becomes the seed for the others. Thisprocedure is repeated until all the digis have been assigned to a seed and no other changesare possible. The outcome of the clusterization is a cluster index for each digi: a thread isallocated to each seed; a global atomic counter is increased by all threads, returning theunique cluster index for each seed, and thus for each digi.Finally, the shape of the clusters and the charge of the digis are used to determine the hit position and its uncertainty in the coordinates local to the module as described in [11, § n -tuplets Clusters are linked together to form n -tuplets that are later ﬁtted to extract the ﬁnal trackparameters. The n -tuplets production proceeds through the following steps: • creation of doublets • connection of doublets • identiﬁcation of root doublets • depth-ﬁrst-search (DFS) from each root doubletThe doublets are created by connecting hits belonging to adjacent pairs of pixeldetector layers, illustrated by the solid arrows in Fig. 3. To account for geometrical anddetector ineﬃciency doublets are also created between chosen pairs of non-adjacent layers,as illustrated by the dashed arrows in Fig. 3.Various selection criteria are applied to reduce the combinatorics. The following criteriahave a strong impact on timing and physics performance: • p minT : searching for low transverse momentum tracks can be very computationallyexpensive. Setting a minimum threshold for p T limits the possible curvature, hencereducing the number of possible combinations of hits. • R max and z max : the maximum transverse and longitudinal distance of closest approachwith respect to the beam-spot. Tracks produced within a radius of less than 1 mm BPix1BPix2BPix3BPix4FPix1 - FPix2 - FPix3 - FPix1 + FPix2 + FPix3 + Figure 3 . Combinations of pixel layers that can create doublets directly (solid arrow), or byskipping a layer to account for geometrical acceptance (dashed arrow). – 5 – igure 4 . Windows opened in the transverse and longitudinal planes. The outer hit is colored inred, the inner hits in blue [12]. around the beam-spot are called prompt tracks . Searching for detached tracks with alarger value of R max leads to an increase in combinatorics. These “alignment criteria”are illustrated in Fig. 4. • n hits : requiring a high number of hits in the n -tuplets leads to a more pure set of tracksand cuts can be loosened, while a lower number of hits produces higher eﬃciency atthe cost of a higher fake-rate.Hits within each layers are arranged in a tiled data-structure along the azimuthal ( φ )direction for optimal performance. The search for compatible hit pairs is performed inparallel by diﬀerent threads, each starting from a diﬀerent outer hit. The pairs of inner andouter hits that satisfy the alignment criteria and have compatible clusters sizes along thez-direction form a doublet. The cuts applied during the doublets building are described inTable 1, and their impact on the physics results and reconstruction time are provided inTables 2 and 3.The doublets that share a common hit are tested for compatibility to form a triplet.The compatibility requires that the three hits are aligned in the R − Z plane, and that thecircumference passing through them intersects the beamspot compatibility region deﬁnedby R max . All doublets from all layer pairs are tested in parallel.All compatible doublets form a direct acyclic graph. All the doublets whose innerhit lies on BPix1 are marked as root doublets. To reconstruct ”outer” triplets, doubletsstarting on

BPix2 or the two

FPix1 layers and without inner neighbors are also markedas root. Each root doublet is subsequently assigned to a diﬀerent thread that performs aDepth-First Search (DFS) over the direct acyclic graph starting from it. A DFS is usedbecause one could prefer searching for all the n -tuplets up to n hits. The advantage ofthis approach is that the buckets containing triplets and quadruplets are disjoint sets as atriplet could not have been extended further to become a quadruplet [12]. n -tuplets Full hit coverage in the instrumented pseudorapidity range is implemented in modern PixelDetectors via partially overlapping sensitive layers. This, at the same time, mitigates theimpact of possible localized hit ineﬃciencies. With this design, though, requiring at mostone hit per layer can lead to several n -tuplets corresponding to the same particle. This is– 6 – ut Description PhiHist binned phi window between inner and outer hit usinga 128 bin histogramPhiW PhiHist + tuned phi window between inner and outerhitZW window in z for the inner hitZIP cut on the impact parameter along the beam axisPT cut on the curvature assuming zero transverse impactparameter (equivalent to a cut on the TIP for high pttracks)CZS cut on the cluster size compatibility

Table 1 . Description of the cuts applied during the reconstruction of doublets.

Cuts Doublets n -tuplets Tracks unconnPhiHist 1,268,193 23,254 1,256 0.966PhiHist+ZW 866,316 18,301 1,266 0.966PhiHist+ZW+ZIP 269,410 11,235 1,265 0.926PhiW+ZW 594,739 13,403 1,212 0.958PhiW+ZW+ZIP 185,642 8,327 1,214 0.919PhiW+ZW+ZIP+CSZ 129,307 6,060 1,087 0.915PhiW+ZW+ZIP+PT 164,567 7,273 1,141 0.921PhiW+ZW+ZIP+PT+CSZ 115,248 5,270 999 0.918 Table 2 . Average number of doublets, n -tuplets and ﬁnal tracks per event, as well as the fractionof cell not connected, for each set of doublet reconstruction cuts (described in Table 2), runningover a sample of t ¯ t events with an average pileup of 50 and an average of 15,000 hits per event. time in µ sCuts doublets connect DFS cleanPhiHist 6,123 15,127 1,690 1,976PhiHist+ZW 950 6,582 778 538PhiHist+ZW+ZIP 310 488 354 237PhiW+ZW 552 2,995 549 377PhiW+ZW+ZIP 271 265 274 183PhiW+ZW+ZIP+CSZ 291 187 216 154PhiW+ZW+ZIP+PT 259 156 246 125PhiW+ZW+ZIP+PT+CSZ 280 108 192 114 Table 3 . Time spent in the three components of n -tuplets building, as well as in the Fishbone andambiguity resolution algorithms (“clean”), for each set of doublet reconstruction cuts (described inTable 2). It should be noted that using very relaxed cuts requires larger memory buﬀers on GPU,up to 12GB, while running with the last 4 sets requires less than 2GB of memory. – 7 –articularly relevant in the forward region due to the design of the Pixel Forward Disks thatis illustrated in Fig. 5: up to four hits in the same layer can be found in localized forwardareas. The

Fishbone n -tuplet solves the ambiguities by merging overlapping doublets. The Fishbone mechanism is active while creating the doublets: among all the aligned doubletsthat share the same outermost hit, only the shortest one is kept. In this way ambiguitiesare resolved and a single

Fishbone n -tuplet is created.Furthermore, among all the tracks that share a hit-doublet only the ones with thelargest number of hits are retained. Figure 5 . A typical

Fishbone n -tuplet. The shadowed areas indicate partially overlapping modulesin the same layer. n -tuplet ﬁt The “Phase 1” upgraded pixel detector has one more barrel layer and one additional diskat each side with respect to the previous detector. The possibility of using four (or more)hits from distinct layers opens new opportunities for the pixel tracker ﬁtting method. It ispossible not only to give a better statistical estimation of the track parameters (d z , cot ( θ ),d , p T and φ [13]) thanks to the additional point, but also to include in the ﬁtting proceduremore realistic eﬀects, such as the energy loss and the multiple scattering of the particle dueto its interaction with the material of the detector.The pixel track reconstruction developed by the authors includes a multiple scattering-aware ﬁt: the Broken Line [14] Fit. This follows three main steps: • a fast pre-ﬁt in the transverse plane gives an estimate of the track momentum, usedto compute the multiple scattering contribution, • a line-ﬁt in the S-Z plane, • a circle ﬁt in the transverse plane.The d z and cot ( θ ) track parameters and their covariance matrices are derived from theline ﬁt, while the d , p T and φ , and their covariance matrices from the circle ﬁt. The ﬁnaltrack parameters and covariance matrix are computed combining the individual resultstogether.The ﬁts are performed in parallel over all n -tuplets using one thread per n -tuplet. Theﬁt implementation uses the Eigen C++ library [15] that natively supports CUDA.– 8 – .5 Ambiguity resolution Tracks that share a hit-doublet are considered “ambiguous” and only the one with the best χ is retained. Triplets are considered “ambiguous” if they share one hit: only the one withthe smallest transverse impact parameter is retained. The ﬁtted pixel tracks are subsequently used to form pixel vertices. Vertices are searched asclusters in the z coordinate of the point of closest transverse approach of each track with thebeam line ( z ). Only tracks with at least 4 hits and a p T larger than a conﬁgurable threshold(0.5 GeV) are considered. For each track with an error in z lower than a conﬁgurablethreshold the local density of close-by tracks is computed. Tracks are considered in thedensity calculation if they are within a certain ∆ z cut and if their χ compatibility is lowerthan a conﬁgurable χ . Tracks with local density greater than 1 are considered as aseed for a vertex. Tracks are then linked to another track that has a higher local density, ifthe distance between the two tracks is lower than ∆ z cut and if their χ ≤ χ . Allthe tracks that are logically linked starting from each seed become part of the same vertexcandidate. Each vertex candidate is promoted to be a ﬁnal vertex if it contains at least 2tracks.This algorithm is easily parallelizable and, in one dimension as in this case, requires noiterations. It is less sensitive to noise (fake tracks) and has a lower merge rate of a standardDBSCAN [16]. It is much faster than any hierarchical algorithm [17] or algorithms basedon deterministic annealing [11, § § z of the contributing tracks. Vertices with a χ larger than a given threshold(9 per degree of freedom) are split in two using a k-mean algorithm.Finally the vertices are sorted using the sum of the p of the contributing tracks. Thevertex with the largest (cid:80) p is labelled as the “primary” vertex i.e. the vertex correspondingto the signal (triggering) event. In this section the performance of the Patatrack reconstruction is evaluated and comparedto the track reconstruction that CMS has used for data taking in 2018 (in the followingreferred to as CMS-2018) [12].

The performance studies have been performed using 20 000 t¯t simulated events from CMSopen data [5], with an average of 50 superimposed pileup collisions with a center-of-massenergy √ s = 13 TeV, using detector design conditions.– 9 – .2 Physics performance The eﬃciency is deﬁned as the fraction of simulated tracks, N sim , having produced at leastthree hits in the pixel detector, that have been associated with at least one reconstructedtrack, N rec . eﬃciency = N rec N sim (4.1)A reconstructed pixel track is associated with a simulated track if all the hits that it containscome from the same simulated track. The eﬃciency is computed only for tracks comingfrom the hard interaction and not for those from the pileup. The CPU and GPU versionsof the Patatrack workﬂow produce the same physics results, as shown in Fig. 6. For thisreason, there will be no further distinction in the discussion of the physics results betweenthe workﬂows running on CPU and GPU.The eﬃciency of quadruplets is sensibly improved by the Patatrack quadruplets workﬂowwith respect to CMS-2018, as shown in Fig. 7. The main reasons for this improvementare the possibility to skip a layer outside geometrical acceptance when building doubletsand the usage of diﬀerent Cellular Automaton cuts for the barrel and the end-caps. Theeﬃciency can be further improved including the pixel tracks built from triplets (Patatracktriplets).The fake rate is deﬁned as the fraction of all the reconstructed tracks coming from areconstructed primary vertex that are not associated uniquely to a simulated track. In thecase of a fake track, the set of hits used to reconstruct the track does not belong to the samesimulated track. As shown in Fig. 8, the fake-rate performance of Patatrack quadruplets isimproved with respect to the CMS-2018 pixel reconstruction in the end-cap region, mainlythanks to the diﬀerent treatment of the end-caps in the Cellular Automaton. The inclusionof the pixel tracks built from Patatrack triplets slightly increases the fake rate in the trackscoming from the primary vertices, given that loosening the requirement on the number ofhits decreases the quality of the selection cuts.A duplicate track is a reconstructed track matching to a simulated track that itself hasbeen matched to at least two tracks. The introduction of the Fishbone algorithm improvesthe duplicate rejection in the Patatrack workﬂows by up to two orders of magnitude withrespect to the CMS-2018 pixel track reconstruction, as shown in Fig. 9.For historical reasons the CMS-2018 Pixel reconstruction does not perform a ﬁt on the n -tuplets in the transverse plane, and considers instead only the ﬁrst three hits for the trackparameters estimation. Furthermore, the errors on the track parameters are taken from alook-up table parameterized in η and p T . The improvement brought in by the Broken Lineﬁt to the accuracy of the ﬁts can be quantiﬁed by looking at the resolutions deﬁned as : σ (ﬁtted value − true value) . (4.2)The resolution of the estimation of the p T is improved by up to a factor 2 whencompared to the CMS-2018 Pixel tracking (Fig. 10). The resolution of the transverse impactparameter d improves, especially in the barrel (Fig. 11).The CMS 2018 Pixel tracking behaves better in the longitudinal plane than it does in– 10 – -

10 1 10 (GeV) T Simulated track p T r a ck i ng e ff i c i en cy

13 TeV

Patatrack

CMS Open Data 2018

Patatrack - Triplets CPU Patatrack - Triplets GPU =50) æ PU Æ event tracks (tt | < 2.5 h | (a) Eﬃciency vs p T - - - h Simulated track T r a ck i ng e ff i c i en cy

13 TeV

Patatrack

CMS Open Data 2018

Patatrack - Triplets CPU Patatrack - Triplets GPU =50) æ PU Æ event tracks (tt > 0.9 GeV T p (b) Eﬃciency vs η Figure 6 . Comparison of the pixel tracks reconstruction eﬃciency of the CPU and GPU versionsof the Patatrack Pixel reconstruction for simulated t¯t events with an average of 50 superimposedpileup collisions. -

10 1 10 (GeV) T Simulated track p T r a ck i ng e ff i c i en cy

13 TeV

Patatrack

CMS Open Data 2018

Patatrack - Triplets Patatrack - Quadruplets CMS - 2018 =50) æ PU Æ event tracks (tt | < 2.5 h | (a) Eﬃciency vs p T - - - h Simulated track T r a ck i ng e ff i c i en cy

13 TeV

Patatrack

CMS Open Data 2018

Patatrack - Triplets Patatrack - Quadruplets CMS - 2018 =50) æ PU Æ event tracks (tt > 0.9 GeV T p (b) Eﬃciency vs η Figure 7 . Pixel tracks reconstruction eﬃciency for simulated t¯t events with an average of 50superimposed pileup collisions. The performance of the Patatrack reconstruction when producingPixel Tracks starting from n -tuplets with n hits ≥ n hits ≥ – 11 – -

10 1 10 (GeV) T Track p T r a ck i ng f a k e r a t e f r o m PV

13 TeV

Patatrack

CMS Open Data 2018

Patatrack - Triplets Patatrack - Quadruplets CMS - 2018 =50) æ PU Æ event tracks (tt (a) Fake rate vs p T - - - h Track T r a ck i ng f a k e r a t e f r o m PV

13 TeV

Patatrack

CMS Open Data 2018

Patatrack - Triplets Patatrack - Quadruplets CMS - 2018 =50) æ PU Æ event tracks (tt > 0.9 GeV T p (b) Fake rate vs η Figure 8 . Pixel tracks reconstruction fake rate for simulated t¯t events with an average of 50superimposed pileup collisions. The performance of the Patatrack reconstruction when producingpixel tracks starting from n -tuplets with n hits ≥ n hits ≥ -

10 1 10 (GeV) T Track p - - - - - -

10 1 T r a ck i ng dup li c a t e r a t e

13 TeV

Patatrack

CMS Open Data 2018

Patatrack - Triplets Patatrack - Quadruplets CMS - 2018 =50) æ PU Æ event tracks (tt (a) Duplicate rate vs p T - - - h Track - - - - - -

10 1 T r a ck i ng dup li c a t e r a t e

13 TeV

Patatrack

CMS Open Data 2018

Patatrack - Triplets Patatrack - Quadruplets CMS - 2018 =50) æ PU Æ event tracks (tt > 0.9 GeV T p (b) Duplicate rate vs η Figure 9 . Pixel tracks reconstruction duplicate rate for simulated t¯t events with an average of 50superimposed pileup collisions. The performance of the Patatrack reconstruction when producingPixel Tracks starting from n -tuplets with n hits ≥ n hits ≥ – 12 – (GeV) T Simulated track p T r e s o l u t i on / p T p

13 TeV

Patatrack

CMS Open Data 2018

Patatrack - Triplets Patatrack - Quadruplets CMS - 2018 =50) æ PU Æ event tracks (tt > 0.9 GeV T p (a) p T resolution vs p T - - - h Simulated track - - T r e s o l u t i on / p T p

13 TeV

Patatrack

CMS Open Data 2018

Patatrack - Triplets Patatrack - Quadruplets CMS - 2018 =50) æ PU Æ event tracks (tt > 0.9 GeV T p (b) p T resolution vs η Figure 10 . Pixel tracks p T resolution for simulated t¯t events with an average of 50 superimposedpileup collisions. The performance of the Patatrack reconstruction when producing Pixel Tracksstarting from n -tuplets with n hits ≥ n hits ≥ the transverse plane. However, the Broken Line ﬁt’s improvement in the estimate of thelongitudinal impact parameter d z is visible for tracks with p T > /c , as shown by(Fig. 12).The number of reconstructed vertices together with the capability to separate twoclose-by vertices have been measured to have an estimate of the performance of the vertexingalgorithm. This capability can be quantiﬁed by measuring the vertex merge rate, i.e. theprobability of having two diﬀerent simulated vertices reconstructed as a single vertex.Figure 13 shows how the vertexing performance evolves with the number of simulatedproton interactions. The hardware and software conﬁgurations used to to carry out the computing performancemeasurements are: • dual socket Xeon Gold 6130 [18], 2 ×

16 physical cores, 64 hardware threads, • a single NVIDIA T4 [19], • NVIDIA CUDA 11 with Multi-Process Service [20], • CMSSW 11 1 2 Patatrack [21].A CMSSW reconstruction sequence that runs only the pixel reconstruction modulesas described in Section 3 was created. More than one event can be executed in parallelby diﬀerent CPU threads. These can perform asynchronous operations like kernels andmemory transfers, in parallel on the same GPU. The maximum amount of events that– 13 – -

10 1 10 (GeV) T Simulated track p m ) m r e s o l u t i on ( d

13 TeV

Patatrack

CMS Open Data 2018

Patatrack - Triplets Patatrack - Quadruplets CMS - 2018 =50) æ PU Æ event tracks (tt > 0.9 GeV T p (a) d xy resolution vs p T - - - h Simulated track m ) m r e s o l u t i on ( d

13 TeV

Patatrack

CMS Open Data 2018

Patatrack - Triplets Patatrack - Quadruplets CMS - 2018 =50) æ PU Æ event tracks (tt > 0.9 GeV T p (b) d xy resolution vs η Figure 11 . Pixel tracks transverse impact parameter resolution for simulated t¯t events with anaverage of 50 superimposed pileup collisions. The performance of the Patatrack reconstructionwhen producing pixel tracks starting from n -tuplets with n hits ≥ n hits ≥ -

10 1 10 (GeV) T Simulated track p m ) m r e s o l u t i on ( z d

13 TeV

Patatrack

CMS Open Data 2018

Patatrack - Triplets Patatrack - Quadruplets CMS - 2018 =50) æ PU Æ event tracks (tt > 0.9 GeV T p (a) d z resolution vs p T - - - h Simulated track m ) m r e s o l u t i on ( z d

13 TeV

Patatrack

CMS Open Data 2018

Patatrack - Triplets Patatrack - Quadruplets CMS - 2018 =50) æ PU Æ event tracks (tt > 0.9 GeV T p (b) d z resolution vs η Figure 12 . Pixel tracks longitudinal impact parameter resolution for simulated t¯t events with anaverage of 50 superimposed pileup collisions. The performance of the Patatrack reconstructionwhen producing pixel tracks starting from n -tuplets with n hits ≥ n hits ≥ – 14 – Simulated interactions V e r t i c e s

13 TeV

Patatrack

CMS Open Data 2018

Patatrack - Triplets Patatrack - Quadruplets CMS - 2018 =50) æ PU Æ events(tt (a) Reconstructed vs simulated vertices

10 20 30 40 50 60 70

Simulated interactions V e r t e x m e r ge r a t e

13 TeV

Patatrack

CMS Open Data 2018

Patatrack - Triplets Patatrack - Quadruplets CMS - 2018 =50) æ PU Æ events(tt (b) Vertex merge rate vs simulated vertices Figure 13 . Pixel vertices reconstruction eﬃciency and merge rate for simulated t¯t events withan average of 50 superimposed pileup collisions. The performance of the Patatrack reconstructionwhen producing pixel tracks starting from n -tuplets with n hits ≥ n hits ≥ can be processed in parallel today is about 80, and is limited by the amount of allocatedmemory on the GPU required for each event.In a data streaming application the measurement of the throughput, i.e. the number ofreconstructed events per unit time, is a more representative metric than the measurementof the latency. The benchmark runs 8 independent CMSSW jobs, each reconstructing 8events in parallel with 8 CPU threads. The throughput of the CMS-2018 reconstructionhas been compared to the Patatrack quadruplets and triplets workﬂows. The test includesthe GPU and the CPU versions of the Patatrack workﬂows. The Patatrack workﬂows runwith three diﬀerent conﬁgurations:1. no copy : the SoA containing the results stays in the memory where they have beenproduced;2. copy, no conversion : the SoA containing the results is copied to the host, if initiallyproduced by the GPU;3. copy, conversion : the SoA containing the results is copied to the host and convertedto the legacy CMS-2018 pixel tracks and vertices data formats.These conﬁgurations are useful to understand the impact of optimizing a potential consumerof the GPU results so that it runs on GPUs in the same reconstruction sequence or so thatit can consume GPU-friendly data structures, with respect to interfacing the Patatrackworkﬂows to the existing framework without any further optimization.The results of the benchmark are shown in Table 4. The benchmark shows that a singleNVIDIA T4 can achieve almost three times the performance of a full dual socket Intel Xeon– 15 – hroughput in events/sConﬁguration Triplets CPU Triplets GPU Quadruplets CPU Quadruplets GPU CMS 2018no copy 611 870 892 1386 476copy, no conv. — 867 — 1372 —conversion 585 861 855 1352 — Table 4 . Throughput of the Patatrack triplets and quadruplets workﬂows when executed on GPUand CPU, compared to the CMS-2018 reconstruction. The benchmark is conﬁgured to reconstruct64 events in parallel. Three diﬀerent conﬁgurations have been compared: in no copy the result isnot copied from the memory of the device where it was initially produced; in copy, no conv. theSoA containing result produced on the GPU is copied to the host memory; in conversion the SoAcontaining the result is copied to the host memory (if needed) and then converted to the legacydata format used for the pixel tracks and vertices by the CMS reconstruction.

Skylake node when running the Patatrack pixel quadruplets reconstruction. Producingeven better physics performance by producing also pixel tracks from triplets has the eﬀectof almost halving the throughput. Copying the results from the GPU memory to the hostmemory has a small impact to the throughput, thanks to the possibility of hiding latencyby overlapping the execution of kernels with copies. Converting the SoA results to thelegacy data format has a small impact on the throughput as well, but comes with a hiddencost: the conversion takes almost 100% of the machine’s processing power. This can beavoided by migrating all the consumers to the SoA data format.

The future runs of the Large Hadron Collider (LHC) at CERN will pose signiﬁcant challengeson the event reconstruction software, due to the increase in both event rate and complexity.For track reconstruction algorithms, the number of combinations that have to be testeddoes not scale linearly with the number of simultaneous proton collisions.The work described in this article presents innovative ways to solve the problem oftracking in a pixel detector such as the CMS one, by making use of heterogeneous computingsystems in a data taking production-like environment, while being integrated in the CMSexperimental software framework CMSSW. The assessment of the Patatrack reconstructionphysics and timing performance demonstrated that it can improve physics performance whilebeing signiﬁcantly faster than the existing implementation. The possibility to conﬁgure thePatatrack reconstruction workﬂow to run on CPU or to transfer and convert results to usethe CMS data format allows to run and validate the workﬂow on conventional machines,without any dedicated resources.This work is setting the foundations for the development of heterogeneous algorithms inHEP both from the algorithmic and from the framework scheduling points of view. Otherparts of the reconstruction, e.g. calorimeters or Particle Flow, will be able to beneﬁt froman algorithmic and data structure redesign to be able to run eﬃciently on GPUs.The ability to run on other accelerators with a performance portable code is also beingexplored, to ease maintainability and test-ability of a single source.– 16 –

Acknowledgements

We thank our colleagues of the CMS collaboration for publishing high quality simulateddata under the open access policy. We also would like to thank the Patatrack studentsand alumni R. Ribatti and G. Tomaselli for their hard work and dedication. We thank theCERN openlab for providing a platform for discussion on heterogeneous computing and forfacilitating knowledge transfer and support between Patatrack and industrial partners.This manuscript has been authored by Fermi Research Alliance, LLC under ContractNo. DE-AC02-07CH11359 with the U.S. Department of Energy, Oﬃce of Science, Oﬃce ofHigh Energy Physics.

References [1] G. Apollinari, O. Br¨uning, T. Nakamoto and L. Rossi,

High luminosity large hadron colliderhl-lhc , arXiv preprint arXiv:1705.08830 (2017) .[2] CMS Collaboration, The CMS experiment at the CERN LHC , JInst (2008) S08004.[3] CMS collaboration, The Phase-2 Upgrade of the CMS Level-1 Trigger , Tech. Rep.CERN-LHCC-2020-004. CMS-TDR-021, CERN, Geneva (Apr, 2020).[4] C.D. Jones, M. Paterno, J. Kowalkowski, L. Sexton-Kennedy and W. Tanenbaum,

The newCMS event data model and framework , in

Proceedings of International Conference onComputing in High Energy and Nuclear Physics (CHEP06) , 2006,https://indico.cern.ch/event/408139/.[5]

CMS collaboration,

TTToHadronic TuneCP5 13TeV-powheg-pythia8 in FEVTDEBUGHLTformat for LHC Phase2 studies. CERN Open Data Portal. , .[6] C.D. Jones and E. Sexton-Kennedy,

Stitched together: Transitioning CMS to a hierarchicalthreaded framework , J. Phys.: Conf. Ser. (2014) 022034.[7] C.D. Jones, L. Contreras, P. Gartung, D. Hufnagel and L. Sexton-Kennedy,

Using the CMSthreaded framework in a production environment , J. Phys.: Conf. Ser. (2015) 072026.[8] C.D. Jones et al.,

CMS event processing multi-core eﬃciency status , J. Phys.: Conf. Ser. (2017) 042008.[9] A. Bocci, D. Dagenhart, V. Innocente, C. Jones, M. Kortelainen, F. Pantaleo et al.,

Bringingheterogeneity to the CMS software framework , arXiv preprint arXiv:2004.04334 (2020) .[10] A. Dominguez, D. Abbaneo, K. Arndt, N. Bacchetta, A. Ball, E. Bartz et al., CMS TechnicalDesign Report for the Pixel Detector Upgrade , Tech. Rep. CERN-LHCC-2012-016.CMS-TDR-11 (Sep, 2012).[11]

CMS collaboration,

Description and performance of track and primary-vertex reconstructionwith the CMS tracker , JINST (2014) P10009 [ ].[12] F. Pantaleo, New Track Seeding Techniques for the CMS Experiment , 2017.[13] R. Fr¨u’hwirth, M. Regler, R. Bock, H. Grote and D. Notz,

Data analysis techniques forhigh-energy physics (2000).[14] V. Blobel,

A new fast track-ﬁt algorithm based on broken lines , Nucl. Instrum. Meth.

A566 (2006) 14. – 17 –

15] G. Guennebaud, B. Jacob et al., “Eigen v3.” http://eigen.tuxfamily.org, 2010.[16] M. Ester, H.-P. Kriegel, J. Sander and X. Xu,

A density-based algorithm for discoveringclusters in large spatial databases with noise. , in

KDD , E. Simoudis, J. Han and U.M. Fayyad,eds., pp. 226–231, AAAI Press, 1996, http://dblp.uni-trier.de/db/conf/kdd/kdd96.html.[17] R. Campello, D. Moulavi and J. Sander,

Density-based clustering based on hierarchical densityestimates , in

Advances in Knowledge Discovery and Data Mining , J. Pei, V. Tseng, L. Cao,H. Motoda and G. Xu, eds., vol. 7819, (Berlin, Heidelberg), Springer (2013).[18] Intel Corp., “Intel Xeon Gold 6130 Processor .” , 2020.[19] NVIDIA Corp., “NVIDIA T4 TENSOR CORE GPU.” ,2020.[20] NVIDIA Corp., “NVIDIA MULTI-PROCESS SERVICE.” https://docs.nvidia.com/deploy/pdf/CUDA_Multi_Process_Service_Overview.pdf ,2020.[21] CMS Patatrack, “CMSSW 11 1 2 Patatrack .” https://github.com/cms-patatrack/cmssw/tree/CMSSW_11_1_2_Patatrack , 2020., 2020.