[PDF] Parallelizing the Unpacking and Clustering of Detector Data for Reconstruction of Charged Particle Tracks on Multi-core CPUs and Many-core GPUs

Abstract

We present results from parallelizing the unpacking and clustering steps of the raw data from the silicon strip modules for reconstruction of charged particle tracks. Throughput is further improved by concurrently processing multiple events using nested OpenMP parallelism on CPU or CUDA streams on GPU. The new implementation along with earlier work in developing a parallelized and vectorized implementation of the combinatoric Kalman filter algorithm has enabled efficient global reconstruction of the entire event on modern computer architectures. We demonstrate the performance of the new implementation on Intel Xeon and NVIDIA GPU architectures.

Full PDF

PProceedings of CTD 2020PROC-CTD2020-37January 28, 2021

Parallelizing the Unpacking and Clustering of Detector Data forReconstruction of Charged Particle Tracks on Multi-core CPUs andMany-core GPUs

Giuseppe Cerati , Peter Elmer , Brian Gravelle , Matti Kortelainen ,Vyacheslav Krutelyov , Steven Lantz , Mario Masciovecchio , KevinMcDermott , Boyana Norris , Allison Reinsvold Hall , Micheal Reid ,Daniel Riley , Matevˇz Tadel , Peter Wittich , Bei Wang , FrankW¨urthwein , and Avraham Yagil Cornell University, Ithaca, NY, USA 14853, Fermi National Accelerator Laboratory, Batavia, IL, USA 60510 Princeton University, Princeton, NJ, USA 08544 UC San Diego, La Jolla, CA, USA 92093 University of Oregon, Eugene, OR, USA 97403

ABSTRACTWe present results from parallelizing the unpacking and clustering steps of theraw data from the silicon strip modules for reconstruction of charged particletracks. Throughput is further improved by concurrently processing multipleevents using nested OpenMP parallelism on CPU or CUDA streams on GPU.The new implementation along with earlier work in developing a parallelized andvectorized implementation of the combinatoric Kalman ﬁlter algorithm hasenabled eﬃcient global reconstruction of the entire event on modern computerarchitectures. We demonstrate the performance of the new implementation onIntel Xeon and NVIDIA GPU architectures.PRESENTED ATConnecting the Dots Workshop (CTD 2020)April 20-30, 2020 a r X i v : . [ h e p - e x ] J a n onnecting the Dots. April 20-30, 2020 The reconstruction of charged particle trajectories (tracking) is a pivotal element of the reconstruction chainin Compact Muon Solenoid (CMS) [1] as it measures the direction and momentum of charged particles,which is then also used as input for nearly all high-level reconstruction algorithms: vertex reconstruction,b-tagging, lepton reconstruction, jet isolation and missing transverse momentum reconstruction. Trackingby far is the most time consuming step of event construction and its time scales poorly with the detectoroccupancy. This brings a computing challenge to the upcoming upgrade of the accelerator from the LargeHadron Collider (HLC) to the High-Luminosity LHC (HL-HLC) where the instantaneous luminosity willincrease by an order of magnitude due to the increased number of overlapping proton-proton collisions.To address this challenge, the parallel Kalman ﬁlter tracking project mkFit was established in 2014with the goal of enabling eﬃcient tracking on modern computing architectures. Over the last 6 years,we have made signiﬁcant progress towards developing a parallelized and vectorized implementation of thecombinatoric Kalman ﬁlter algorithm for tracking [2, 3, 4, 5]. This allows the eﬃcient global reconstructionof the entire event within the projected online CPU budget. This also opens the possibility of deploying mkFit into the CMS High Level Trigger (HLT), where the performance requirements are particularly strict.The current goal is to test the algorithm online in Run 3 of the LHC.Global reconstruction necessarily entails the unpacking and clustering of the hit information from allsilicon strip tracker modules before the hots are processed by mkFit . The current CMS HLT, on theother hand, performs hit and track reconstruction on demand, i.e., only for softwarer-selected regions of thedetector. Therefore, we have recently begun to investigate how to implement the unpacking and clusteringsteps eﬃciently for the entire detector at once. This document highlights the latest development in enablingparallelization of unpacking and clustering on modern computing architectures. We start from a standaloneversion of the unpacker, which uses simulated raw data and calibration data, along with simpliﬁed versionsof many related CMSSW classes.

The CMS strip tracker data are organized by Front End Drivers (FEDs). Each FED consists of 96 opticalchannels with 256 strips per channel [6]. In every CMS event, only the strips that are measured to be above athreshold are recorded. In zero suppression mode, the measured signal within a FED is stored like a variantof compressed sparse row (CSR) format, where the channel and strip numbers correspond to the row andcolumn numbers (Figure 1). Besides the event-based strip tracker data, pre-measured calibration data, e.g.,gain and noise values for each strip, must also be unpacked by FED channels.Figure 1: Raw data format from CMS strip trackerData layout is paramount to performance, and therefore we choose the structure-of-array (SoA) layout.The SoA approach maximizes spatial locality on streaming access and thus improves sustained bandwidthon both modern CPUs and GPUs. The unpacking step transforms all event-based strip tracker data andpre-measured calibration data to the SoA format in order to provide optimal performance for the clusteringstep. Speciﬁcally, we construct a map of the channel locations and attributes on CPU. The construction canonly be done sequentially, but can overlap with raw data transfer. We then unpack the data to SoA format,which can be done concurrently for each channel on a multicore CPU or GPU. Unpacking is the most timeconsuming step since it involves irregular data access pattern and is particularly costly on GPU.The CMS clustering is based on the “three thresholds” algorithm [6]. First only strips with signal-to-noise ratio larger than the “channel threshold” are recorded and we call these strips as active strips. Our1 onnecting the Dots. April 20-30, 2020 parallel implementation is based on the fact that every candidate cluster will have at least one active stripwith signal-to-noise ratio larger than the “seed threshold”, which we call seed strips. We start by buildingan array of indices of seed strips. We then form a cluster around a seed strip by determining the left andright boundaries, computing the cluster charge, and checking it against the “cluster threshold”. For eachcluster that passes the “three thresholds”, we output its left and right boundaries, centroid and optionallyADC values. Figure 2 shows the workﬂow of the parallel clustering algorithm. The seed seeking stage canbe parallelized over all strips and the cluster forming stage can be parallelized over all seed strips. The inputdata for these tests is a simulated data sample of tt events with an average pileup per event of 70 using thePhase 1 CMS geometry in 2018 with realistic detector conditions. For this TTBar PU70 data sample, thestrip number and the seed strip number are roughly 800,000 and 150,000, respectively.Figure 2: Parallel clustering algorithm workﬂow In order to maximize the hardware usage and improve the overall throughput for the standalone program,we have enabled concurrency in processing multiple events. On CPU, this is achieved with OpenMP nestedparallelism by creating two levels of OpenMP parallel regions, one inside the other. The outer level is used tohandle diﬀerent events while an inner level is used to handle sets of strips within a single event. To maintaingood memory locality in nested parallelism, it is important to ensure that threads are given aﬃnity to speciﬁccores. It is also important that memory is placed locally to the processor that accesses the data. On GPU,we use CUDA streams to launch multiple events concurrently on the same device. This not only maximizesGPU utilization, it also reduces data transfer overhead by overlapping communication with computation.We rely on a cached memory allocator developed in CMSSW [7] based on one in the CUB library [8] forGPU memory allocation. This has signiﬁcantly reduced the memory allocation and deallocation overheadin processing multiple events.

We evaluate the parallel performance of the standalone program on Tigergpu, the GPU cluster at researchcomputing of Princeton University. Each node in Tigergpu is equipped with two 2.4 GHz Xeon BroadwellE5-2680 v4 processors on the host and four 1328 MHz NVIDIA P100 GPU as accelerators. For CPU results,each test uses all 28 cores with nested parallelism. For GPU results, we use OpenMP to stream multipleevents concurrently on a single GPU. Table 1 shows the wall-clock time and throughput results in running840 events, where all events are copies of a single TTBar PU70 event. On CPU, the best performance is2 onnecting the Dots. April 20-30, 2020 observed in running 14 events concurrently with 2 cores per event, while on GPU, running 2 events/streamsconcurrently results in the best performance. Note that the performance comparison is between running thecode on a single GPU and on 28 CPU cores.2x2.4 GHz Xeon Broadwell E5-2680 v4NxM time (s) throughput (events/s)1x28 1.7 4922x14 2.1 4004x7 1.67 5027x4 1.57 53214x2 1.50 56028x1 2.10 400 1238MHz NVIDIA P100N time (s) throughput (events/s)1 1.36 6152 1.29 6494 1.41 5957 1.46 57414 1.53 54828 1.54 546Table 1: Performance comparison in running 840 TTBar PU70 events. N is the events concurrency and Mis the CPU parallelization within one event. The reported number for each test is the average of 10 trials.

We have enabled parallelization of unpacking and clustering of CMS silicon strip detector data and demon-strated its performance on both multi-core CPUs and many-core GPUs. Overall, we observe that a singleGPU (P100) outperforms two sockets Intel Broadwell CPU (2x14 cores 2.4 GHz Xeon Broadwell E5-2680v4) by 24% in average. Next, we will integrate the implementation with CMSSW and convert clusters toglobal hit coordinates in order to provide the necessary input for mkFit . ACKNOWLEDGEMENTS

This work is supported by the U.S. National Science Foundation, under grants PHY1520969, PHY1521042,PHY1520942 and PHY1624356, and under Cooperative Agreement OAC1836650, and by the U.S. Depart-ment of Energy, Oﬃce of Science, Oﬃce of Advanced Scientiﬁc Computing Research, Scientiﬁc Discoverythrough Advanced Computing (SciDAC) program. This research used resources at research computing ofPrinceton University.

References [1] S. Chatrchyan et al. The CMS Experiment at the CERN LHC.

JINST , 3:S08004, 2008.[2] Giuseppe Cerati et al. Parallelized kalman-ﬁlter-based reconstruction of particle tracks on many-coreprocessors and gpus.

EPJ Web of Conferences , 150:00006, 2017.[3] Giuseppe Cerati et al. Parallelized and vectorized tracking using kalman ﬁlters with cms detector geom-etry and events.

EPJ Web of Conferences , 214:02002, 2019.[4] Giuseppe Cerati et al. Speeding up Particle Track Reconstruction in the CMS Detector using a Vectorizedand Parallelized Kalman Filter Algorithm. arXiv e-prints , page arXiv:1906.11744, June 2019.[5] Giuseppe Cerati et al. Reconstruction of Charged Particle Tracks in Realistic Detector Geometry Usinga Vectorized and Parallelized Kalman Filter Algorithm. arXiv e-prints , page arXiv:2002.06295, February2020. 3 onnecting the Dots. April 20-30, 2020 [6] The CMS Collaboration. Description and performance of track and primary-vertex reconstruction withthe CMS tracker.

Journal of Instrumentation , 9(10):P10009–P10009, oct 2014.[7] Andrea Bocci, David Dagenhart, Vincenzo Innocente, Christopher Jones, Matti Kortelainen, Felice Pan-taleo, and Marco Rovere. Bringing heterogeneity to the CMS software framework. arXiv e-printsarXiv e-prints