GPU-based reconstruction and data compression at ALICE during LHC Run 3
GGPU-based reconstruction and data compression at ALICEduring LHC Run 3
David
Rohr on behalf of the ALICE collaboration , ∗ European Organization for Nuclear Research (CERN), Geneva, Switzerland
Abstract.
In LHC Run 3, ALICE will increase the data taking rate signifi-cantly to 50 kHz continuous read out of minimum bias Pb-Pb collisions. Thereconstruction strategy of the online o ffl ine computing upgrade foresees a firstsynchronous online reconstruction stage during data taking enabling detectorcalibration, and a posterior calibrated asynchronous reconstruction stage. Thesignificant increase in the data rate poses challenges for online and o ffl ine re-construction as well as for data compression. Compared to Run 2, the onlinefarm must process 50 times more events per second and achieve a higher datacompression factor. ALICE will rely on GPUs to perform real time processingand data compression of the Time Projection Chamber (TPC) detector in realtime, the biggest contributor to the data rate. With GPUs available in the onlinefarm, we are evaluating their usage also for the full tracking chain during theasynchronous reconstruction for the silicon Inner Tracking System (ITS) andTransition Radiation Detector (TRD). The software is written in a generic way,such that it can also run on processors on the WLCG with the same reconstruc-tion output. We give an overview of the status and the current performance ofthe reconstruction and the data compression implementations on the GPU forthe TPC and for the global reconstruction. ALICE (A Large Ion Collider Experiment [1]) is one of the four major experiments at theLHC (Large Hadron Collider) at CERN. It is a dedicated heavy-ion experiment studying leadcollisions at the LHC at unprecedented energies. During the second long LHC shutdown in2019 and 2020, the LHC upgrade will provide a higher Pb–Pb collision rate, and ALICE willupdate many of its detectors and systems [2]. In particular, the main tracking detectors TPC(Time Projection Chamber) and ITS (Inner Tracking System) will be upgraded [3], and thecomputing scheme will change with the O online-o ffl ine computing upgrade [4].ALICE will upgrade the detectors for LHC Run 3 and switch from the current triggeredread-out of up to 1 kHz of Pb–Pb events to a continuous read-out of 50 kHz minimum biasPb–Pb events. The continuous read-out of pp collisions will happen at rates between 200 kHzand 1 MHz. ALICE is abandoning the hardware triggers and will switch to a full onlineprocessing in software. During data taking, the synchronous processing will serve two mainobjectives: detector calibration and data compression. With a flat budget and the yearly in-creases of storage capacity, recording and storing raw data as today is prohibitively expensive ∗ e-mail: [email protected] a r X i v : . [ phy s i c s . i n s - d e t ] J un ata links from detectors Disk buffer R un f a r m Synchronous processing - Local processing- Event / timeframe building- Calibration / compression
Asynchronous processing - Reprocessing with full calibration- Full reconstruction
Permanent storage
Compressed Raw DataReconstructed Data D u r i ng D a t a t ak i ng D u r i ng no b ea m > 3.5 TB/s< 100 GB/s Readout nodes > 600 GB/s
Figure 1.
Illustration of the ALICE computing strategy for Run 3, with synchronous processing duringdata taking, and asynchronous processing in periods without beam. at 50 to 100 times the data rate. ALICE aims at a compression of the TPC data, the largestcontributor to raw data size, of a factor 20 compared to the zero-suppressed raw data size ofRun 2. By producing the calibration during data taking, ALICE will reduce the number ofo ffl ine reconstruction passes over the data, where the first two passes serve the calibrationtoday. The output of the synchronous data processing will be compressed time frames, whichare stored to an on-site disk bu ff er, and from there written to tapes. When the computingfarm is not fully used for the synchronous processing, e.g. in periods without beam or duringpp data taking, it will perform a part of the asynchronous reconstruction, which reprocessesthe data and generates final reconstruction output. The part of asynchronous processing thatexceeds the capacity of the farm will be done in the grid. This asynchronous stage will em-ploy the same algorithms and software as the synchronous stage, but with di ff erent settings,additional reconstruction steps, and final calibration. Figure 1 gives an overview of the O computing. The reconstruction of the central barrel detectors of ALICE, foremost the TPC (Time Pro-jection Chamber), is the most computing-intense part of event reconstruction, and the focusof this paper. Therefore, ALICE foresees the usage of Graphics Cards (GPUs) to acceleratethese steps. In parallel, a similar e ff ort on a smaller scale has started to investigate whetherthe reconstruction of the forward detector reconstruction could leverage GPUs in the sameway. The core part is the tracking of the TPC, which was adapted from the ALICE High LevelTrigger [5] and improved to match the Run 2 o ffl ine reconstruction in terms of e ffi ciency andresolution. Several new algorithms have been implemented for the GPU reconstruction, inparticular for the Inner Tracking System (ITS) [6]. Another addition is the data compressionor the TPC, which consists of a track model compression step [7] and an entropy encodingstep, which will most likely use ANS [8] encoding. We foresee 2 GPU processing scenarios: • Baseline scenario : This contains the minimum set of reconstruction steps on the GPUrequired to perform the synchronous reconstruction on the online processing farm atthe peak data rate assumed for LHC Run 3. This scenario defines the size of the onlineprocessing farm, in particular the number of processor cores and GPUs. • Optimistic scenario : The asynchronous reconstruction will perform many processingsteps of the synchronous reconstruction (except for the calibration and the data com-pression) one more time, thus it can leverage the available GPU algorithms. Sincethere are many more steps in the asynchronous reconstruction, it will be inevitablyCPU bound if all these steps are processed by the processor while there are no addi-tional steps on the GPU. Therefore we aim to o ffl oad more processing steps onto theGPU, and a promising candidate is the complete central barrel tracking chain.Figure 2 gives an overview of the corresponding reconstruction steps. TPC Track Finding TPC Track MergingITS Track Finding ITS Track Fit TPC ITS MatchingTPC dE/dx ITS AfterburnerTRD TrackingITS Vertexing TOF Matching Global FitV0 FindingTPC Track Model
Compression
TPC Entropy CompressionTPC Track Fit
In operationNearly readyBeing studiedDevelopment not started
TPC Cluster removalTPC <10MeV/ c identificationSorting Material Lookup Memory ReuseGPU API Framework Common GPU Components:
TPC Calibration
TPC Cluster Finder
GPU barrel tracking chain part of baseline scenario part of optimistic scenario
Figure 2.
Overview of the relevant processing steps in the central barrel tracking and reconstructionchain.
We show steps of the baseline scenario with a yellow label, and the additional ones ofthe optimistic scenario with a white label. Green boxes indicate steps that are already fullyintegrated and tested on the GPU, and blue boxes those where a GPU implementation isprincipally ready but not fully deployed, but there are no significant risks left that couldprevent an eventual GPU processing. For the baseline scenario, most steps are basically ready,except for the TPC Cluster Finder and the identification of TPC tracks below 10 MeV / c . TheTPC Cluster Finder is a last-minute addition to the baseline GPU processing. It was originallydesignated to run on the FPGA in the readout servers, but it has become likely that the FPGAresources will be insu ffi cient to house the full TPC cluster finding.E-PERF tag. Therefore, itwas moved to the GPU reconstruction. Consequently, the GPU implementation started late,and it is already in a considerably good shape with only minor features, like the propagationof Monte Carlo labels, missing. The situation is di ff erent for the identification of tracks withlow transverse momentum, for which we don’t have a working prototype yet that achieves therequired e ffi ciency and performance. As will be discussed in section 4, it is not clear whetherthis step will be needed.The work plan foresees to consolidate the baseline steps first, facilitating a full systemtest, and then integrating the steps of the optimistic scenario following the order defined bythe reconstruction chain graph. This will make sure that a consecutive set of steps runs on theGPU avoiding unnecessary intermediate data transfer forth and back. The blocking part forow is the matching of TPC to ITS tracks, which is required for many posterior steps. Theentropy compression, which belongs to the synchronous baseline scenario, is not required torun on the GPU since we already have a su ffi ciently fast CPU implementation, but it couldfree up CPU resources for other tasks. From the estimations of the CPU processing timesbased on the Run 2 o ffl ine reconstruction, the vertex finding represents a considerable CPUload while the tasks seems to be parallel and suited for GPUs. Therefore, it will make senseto follow the barrel tracking graph up to the V0 finding. The ALICE Run 3 processing will be based on time frames, which consist of recorded dataover a period of time. The current design foresees 10 to 20 milliseconds which translates toaround 500 to 1000 heavy-ion events at the peak interaction rate of 50 kHz. Reconstructionof the time frames is performed independently. This means that tracks from drift detectorslike the TPC might range from one time frame to the next one. Such tracks will not bereconstructible, which will lead to a loss of statistics of less than 1%, but this simplifies thereconstruction significantly. However, this means that time frames must be reconstructed as awhole and not be split into parts further. The TPC is the largest contributor to the data volume.This means the GPU must either be able to hold the required TPC data for a full time frame,or processing must happen in an approach similar to a ring bu ff er with the data streamed inand out. The first approach would be preferable, since the latter makes the software morecomplicated. Finally however, a mixture of the two is required since at least the cluster finderwill use a ring bu ff er for its input since the GPU cannot store the full TPC raw data at once.Many of the steps also use a large scratch memory, which is used temporarily by indi-vidual processing steps to store transient results. These steps must run consecutively on theGPU, and reuse the memory of the previous steps. Therefore we manage the GPU memorymanually. A large bu ff er is allocated ahead of time. Memory is given to certain reconstruc-tion steps, and then reused for the following steps. For the TPC cluster finding (and also forthe TPC track finding [5]), the TPC volume is split in 36 sectors which are processed in apipeline. The raw data is needed only for the cluster finding step, so once a sector is finished,its raw data con be removed from the GPU.Figure 3 gives an overview of the memory allocation. The large bu ff er is split in a leftand a right part. The left part aggregates data that will persist (e.g. the clusters obtained fromthe TPC cluster finder, which are used by several subsequent steps). The right part housestransient data, which is used only by one reconstruction step, and will be overwritten for thenext one. In addition, segments in the middle can be given as scratch bu ff ers temporarily.The illustration shows from the first to the second row how TPC clusters are aggregated inthe persistent region, while the input bu ff ers in the non-persistent region are reused. Multiplekernels can run in parallel. They can belong to the same reconstruction step when it runs ina pipeline, or to independent reconstruction steps. Input data can also be persistent if usedmany times (like ITS hits), and there may be gaps in the persistent region because in somecases only upper bounds for bu ff er sizes are available resulting in a gap to the next bu ff er.Over time the size of the persistent region increases leaving less scratch space, but this doesnot create a shortage since the most memory-intense tasks are the TPC cluster finding andtrack finding, which run at the beginning. At the end of the processing of one time frame,a special optimization can already preload the first TPC raw data bu ff ers for the next timeframe, which minimizes GPU idle time between time frames.Figure 4 shows that the processing time of the individual steps depends basically linearlyon the input data size and thus on the length of the time frame with only minor fluctuations. emory TPC Raw 1TPC Hits 1
Non-persisting input dataPersistent data
TPC cluster finder
Memory
TPC Raw 3TPC Hits 1
Non-persisting input dataPersistent data
TPC cluster finder
TPC Raw 4TPC Hits 2 Scratch
Non-persistent scratch data for algorithms
ScratchTPC Hits 3 TPC Hits 4
TPC cluster finder
Memory
TPC Hits 1
Non-persisting input dataPersistent data
TPC Hits 2
Non-persistent scratch data for algorithms
TPC Hits 3 TPC Hits 4
ITS
Hits TPC Tracks ITS Tracks Matches TPC Raw 1TPC Raw 2
Next time frame
Figure 3.
Illustration of the memory allocation strategy during the processing of a time frame.
ALI-PERF-343943
Figure 4.
Performance of several GPU reconstruction steps versus size of input data.
This is important since it proves that the approach of processing long time frames at oncedoes not produce an overhead in processing time.With the current implementation, the ALICE reconstruction will need 16 GB of memoryfor the processing of a 10 ms time frame. Processing longer time frames will require GPUswith more memory, which will probably be much more expensive. While some optimizationsare possible, it is not clear whether the memory requirements can be reduced su ffi ciently tooperate on GPUs with less than 16 GB of memory without a ring bu ff er. TPC Cluster Removal
The compression of the TPC data involves multiple steps: • Clusterization of the raw data into clusters (lossy) [5]. • Track model and entropy compression (lossless) [5, 7]. • Removal of clusters of tracks that will not be used for physics analysis (lossy) [9].The third point, removal of clusters, can be realized in two ways: • Strategy A : Positive identification of clusters and tracks to be removed. • Strategy B : Identification of good tracks to be kept, and removal of everything else.In both cases, the clusters in a tube around the good tracks are protected from removal toensure the optimal tracking resolution with final calibration which can modify the clusterattachment. Strategy B will be faster and remove more clusters than Strategy A, but it bearsthe risk of removing clusters of good tracks if the synchronous reconstruction was unable toreconstruct a good track, or reconstructed it incompletely. Since the tracking algorithm isthe same, the di ff erence will depend to the largest extent on the calibration. Currently, bothstrategies are developed in parallel until the implications of strategy B are fully understood.So far, strategy A lacks the identification of tracks below 10 MeV / c as shown in Fig. 2, whichreduces the achievable reduction factors significantly. Currently, strategy A yields a finaldata rate of 87.7 to 118.1 GB / s of compressed raw data transferred to the storage for 50 kHzPb–Pb, while strategy B achieves 71.7 to 89.9 GB / s. The O Technical Design Report (TDR)assumes an output rate of 88 GB / s. In these ranges, the higher bound represents the currentstate of the software, while the lower bound uses Monte Carlo information to estimate thehighest achievable reduction if track merging and protection of clusters in the tube aroundgood tracks were 100% e ffi cient. We have presented the online computing strategy of ALICE for Run 3 and the central barreltracking chain, which is the most promising candidate for GPU usage. The implementationof the software for the baseline scenario, which will run the computing-intense synchronousreconstruction steps on GPUs, is nearly finished while the o ffl oading of additional steps forthe optimistic scenario is ongoing. By reusing the same memory for consecutive processingsteps, GPUs with 16 GB of memory can process time frames of around 10 ms. Longer timeframes will require more GPU memory, and the usage of GPUs with less memory wouldnecessitate significant software changes and the inclusion of a ring bu ff er. TPC data reductionstrategy B yields the foreseen data rates of the TDR today but bears the risk of losing goodtracks. Strategy A does not bear this risk but does not yet achieve the desired data reductionfactors. Its implementation and optimization are ongoing. References [1] ALICE Collaboration, “The ALICE experiment at the CERN LHC”, J. Inst. S08002(2008)[2] ALICE Collaboration, “Upgrade of the ALICE Experiment: Letter of Intent”, CERN-LHCC-2012-012 (2012)[3] ALICE Collaboration, “Technical Design Report for the Upgrade of the ALICE TimeProjection Chamber”, CERN-LHCC-2013-020 (2013)[4] ALICE Collaboration, “Technical Design Report for the Upgrade of the Online-O ffl ineComputing System”, CERN-LHCC-2015-006, ALICE-TDR-019 (2015)5] ALICE Collaboration, Real-time data processing in the ALICE High Level Trigger at theLHC, CPC
25 (2019), arXiv:1812:08036[6] M. Puccio for the ALICE Collaboration, “Tracking in high-multiplicity events”, Proceed-ings of Science287