Investigation of heterogeneous computing platforms for real-time data analysis in the CBM experiment
IInvestigation of heterogeneous computing platforms forreal-time data analysis in the CBM experiment
V. Singhal , S. Chattopadhyay , V. Friese Abstract
Future experiments in high-energy physics will pose stringent require-ments to computing, in particular to real-time data processing. As an ex-ample, the CBM experiment at FAIR Germany intends to perform onlinedata selection exclusively in software, without using any hardware trigger,at extreme interaction rates of up to 10 MHz. In this article, we describehow heterogeneous computing platforms, Graphical Processing Units (GPUs)and CPUs, can be used to solve the associated computing problems on theexample of the first-level event selection process sensitive to J/ ψ decays us-ing muon detectors. We investigate and compare pure parallel computingparadigms (Posix Thread, OpenMP, MPI) and heterogeneous parallel com-puting paradigms (CUDA, OpenCL) on both CPU and GPU architecturesand demonstrate that the problem under consideration can be accommo-dated with a moderate deployment of hardware resources, provided theircompute power is made optimal use of. In addition, we compare OpenCLand pure parallel computing paradigms on CPUs and show that OpenCL canbe considered as a single parallel paradigm for all hardware resources. Keywords:
Heterogeneous Computing, GPU, CBM, CUDA, OpenCL
Email address: [email protected] (V. Singhal) Homi Bhabha National Institute, Kolkata, India Variable Energy Cyclotron Centre, 1/AF Bidhan Nagar, Kolkata-700 064, India GSI Darmstadt, D-64291 Darmstadt, Germany
Preprint submitted to Computer Physics Communications February 6, 2020 a r X i v : . [ phy s i c s . c o m p - ph ] F e b . Introduction The computing demands for modern experiments in high-energy particleand nuclear physics are increasingly challenging. This holds in particular forexperiments relying heavily on real-time data processing. An example is theCompressed Baryonic Matter experiment (CBM) [1] planned at the futureFacility for Antiproton and Ion Research (FAIR) [2] situated at Darmstadt,Germany. CBM will study strongly interacting matter by investigating col-lisions of heavy nuclei at very high interaction rates of up to 10 /s, enablingaccess to extremely rare physics probes [3]. A key feature of the experiment isa data taking concept without any hardware trigger, which will lead to a rawdata rate of about 1 TB/s [4]. All data from the various detector systems willbe forwarded to a computing farm, where the collision data will be inspectedin real-time for potentially interesting physics signatures and consequentlyeither accepted or rejected, such that the full raw data flow is reduced toseveral GB/s suitable for storage. In this article, we study the software im-plementation of an event selection process sensitive to events containing J/ψ decay candidates. Since this process must be suitable to be deployed online,efficient use of the available computing architectures, exploiting their parallelprocessing features, is mandatory [5].In modern computing architectures, both multi-core CPU systems andauxiliary accelerators like Graphics Processing Units (GPU) play a pivotalrole. GPUs consist of a set of multiprocessors designed to obtain the bestperformance with graphics computing (fast mathematical calculations forthe purpose of rendering images). Nevertheless, their computational powercan also be used, within certain limits, for general-purpose computing [6],an application type which is becoming more and more popular. Extractinggood performance from GPU processors, however, is not trivial and requiresarchitecture-specific optimizations [7].Apart from choosing the computer architecture adequate to the problem,the application developer has to select the programming environment makingoptimal use of concurrency features. An overview and discussion of parallelprogramming models for both multi-core CPU systems and heterogeneoussystems is given in [8]. On an NVIDIA GPU, the choice is between using theproprietary API CUDA, the application of which is restricted to NVIDIAhardware, and OpenCL, an open-standard framework allowing to executeparallel code across various hardware platforms [9]. Since CUDA is designedspecifically for NVIDIA GPUs, one could expect that the advantage of having2ode portable between platforms, as offered by OpenCL, may come with aperformance penalty. Some work was thus dedicated to compare the perfor-mance of CUDA and OpenCL implementations of the same computationalproblem on GPU. While e.g., Kamiri et al. [10] find CUDA to perform slightlybetter than OpenCL for their specific problem, Fang et al. [11] conclude, onan extensive benchmark suite, not to find worse performance for OpenCL,provided a fair comparison is done. This situation motivates us to investi-gate our computing problem, the selection of
J/ψ candidates from heavy-ioncollision events, on GPU using both CUDA and OpenCL and compare therespective performances.Using OpenCL enables to develop portable code, which can be appliedalso on multi-core CPU architectures without the use of accelerator co-processors. This feature is of particular convenience when writing appli-cations at a time when the architecture they will run on is not yet decided.However, possible performance penalties with respect to native CPU con-currency standards such as OpenMP or MPI need being considered. Whilesome comparisons of pure parallel programming paradigms are available inthe literature [12, 13], comparisons of OpenCL to these standards are scarce;one example is to be found in [14]. Moreover, general conclusions are hard tobe drawn; the proper choice of programming framework will depend on thenature of the concrete problem. We thus extend our investigations by imple-menting the event selection algorithm in OpenMP, MPI and pthreads andconfronting the performance findings with those obtained with the OpenCLimplementation.The article is organized as follows. After a short introduction of the ex-periment and its computing challenge in section 2, section 3 outlines theevent selection algorithm. Section 4 describes the computing platforms andtesting conditions used for this study. The implementation of the algorithmand its performance on various computing platforms and for different paral-lel computing paradigms are reported in sections 5 and 6. The results aresummarized in section 7, followed by an outlook and acknowledgments.
2. The CBM experiment and its muon detection system
The Compressed Baryonic Matter (CBM) experiment [1] is a dedicatedrelativistic heavy-ion collision experiment at the upcoming FAIR accelera-tor centre at Darmstadt, Germany [2]. For CBM, energetic ions ( p beam =3 . A − A GeV for Au ions) will be collided with a fixed target, producing3 y z Figure 1: Left: the CBM experiment setup (muon configuration) with its detector systems;right: the CBM muon detection system with a pictorial view of a J/ ψ decay into µ + and µ − . Triplets of tracking stations (magenta) are placed between absorber slabs (yellow). a large number of particles which will be registered in several detector sys-tems serving for momentum determination and particle identification. Theleft panel of Fig. 1 shows the experimental setup for the measurements ofmuons, comprising the main tracking system STS inside a magnetic dipolefield, the muon detector system (MUCH), a Transition Radiation Detectorused for intermediate tracking, and the Time-of-Flight Detector.The MUCH system [15], shown in the right-hand panel of Fig. 1, allowsthe separation of muons from other particles by several massive absorbers in-terlaid with position-sensitive detector stations. In its full configuration, thesystem comprises of one graphite and five iron slabs of thicknesses varyingfrom 20 cm to 100 cm. The absorbers are separated by a 30 cm gap housingthree layers of position-sensitive gas detectors. The distance between twodetector layers is 10 cm. The last station consisting of a detector tripletis called trigger station and is positioned after a one meter thick iron ab-sorber. The detectors layers are constructed from GEM modules with a padsegmentation in r − φ geometry. The pad size varies in the radial directionaccording to the hit density profile, such as to obtain an approximately uni-form occupancy. The full system comprises about 5 · readout channels.It is currently investigated to use RPC instead of GEM detectors in the twodownstream stations for costs reasons; in this study, GEM detectors are as-sumed throughout. The position resolution depends on the radial distancefrom the beam line with corresponding pad size e.g. station-1 pad size variesfrom 0.32cm to 1.71cm. 4ne objective of the experiment is the measurement of J/ ψ meson pro-duction via their decay into a muon pair (J /ψ → µ + µ − ). At CBM energies,however, the J/ ψ multiplicity is expected to be extremely small - of the orderof 10 − per collision [16]. The experiment will thus be operated at extremeinteraction rates in order to obtain sufficient statistics. To reduce the totaldata rate of about 1 TB/s to a recordable value of a few GB/s, J/ ψ candi-date events must be selected in real-time during data taking, rejecting thevast majority of the collisions. The J/ ψ trigger signature is obvious from theright panel of Fig. 1: two muons traversing the absorber system and beingregistered simultaneously in the trigger station. The trajectories of the muoncandidates have to point back to the collision vertex (target), since the J/ ψ decays promptly at the vertex [17].The trigger signature will be evaluated exclusively in software on the so-called FLES (First-Level Event Selection) computing cluster [18, 19] hostedin the Green Cube building at GSI. Efficient trigger algorithms are a prereq-uisite for the affordability of the FLES cluster.
3. The real-time event selection process
The signature for J/ ψ → µ + µ − candidate events is a rather simple oneand is visualized in the right panel of Fig. 1. The two daughter muons,having high momentum because of the large q value of the decay, traverseall absorber layers and reach the trigger station, while hadrons, electrons,and low-momentum muons will be absorbed. Since J/ ψ decays promptly(c τ = 7 . · − s), the decay products practically originate from the primary(collision) vertex, i.e., from the target. Owing again to the high momentumof the muons, their trajectories can be approximated by straight lines evenin the bending plane of the dipole magnetic field, which has a bending powerof 1 Tm [20]. The trigger station consisting of three detector layers providesthree position measurements, allowing to check the back-pointing to the pri-mary vertex. The signature of a candidate event is thus the simultaneousregistration of two particles in the trigger station which can be extrapolatedbackward to the target.To study the feasibility of this selection procedure, Fig. 2 (left) shows thenumber of hits per event for the trigger station (combined for all three layers)for a sample of background events (not containing any J/ ψ decay) simulatedin the CBM setup. The average number is about 45 for the trigger station5
200 400 600 800 1000event c oun t s x(c.m.)300 - - - y ( c . m . ) - - - Figure 2: Left: event-wise hit distribution on the last MUCH station (combined of thelast three layers); right: x − y distribution of background hits in the trigger station for1000 events. The accumulation at the periphery is due to secondaries not shielded by theabsorber. and 15 for each of its layers. The dominant sources of this background aresecondary muons originating from weak decays of Λ and K between thetarget and the trigger station. The right panel of Fig. 2 shows the spatialdistribution in the transverse plane ( x − y ) of these background hits. Thevoid area in the centre is not instrumented, since it hosts the vacuum pipefor the non-interacting beam [15]. The algorithm corresponding to the signature described in the previoussection operates on all data from the MUCH detector within one event (col-lision). It comprises the following steps:1. Create all possible triplets of hits in the trigger station, with one hitfrom each layer.2. For each triplet:(a) Fit the hits of the triplet plus the event vertex (0, 0) by a straightline in both the x − z (bending) plane and the y − z (non-bending)plane, i.e., x = m z and y = m z .(b) Compute the Mean Square Deviation (MSD) of the triplet fit. Atriplet is rejected if the MSD (per degree of freedom) is above athreshold (0.03 for M SD xz , 0.025 for M SD yz ). The MSD cut sup-presses random combinations of hits as well as real triplets from6 in cm xz MSD - - - p r obab ili t y Signal TracksBackground Tracks in cm yz MSD - - - p r obab ili t y Signal TracksBackground Tracks
Figure 3: Left: (
M SD xz ) and right: ( M SD yz ) event normalised distribution for signaltracks in 1000 signal events (Au+Au at 35 A GeV) overlay with all triplets in backgroundevents with vertical (green) cut line. secondary tracks. To illustrate, Fig. 3 shows the MSD distribu-tion for the x − z plane (left) and the y − z plane (right) for signaltracks (central Au+Au events at p beam = 35 A GeV)(blue) withoverlay of all triplets in background events (red) and a vertical(green) line indicate the cut value.The threshold values result from an optimization procedure, tak-ing into account signal efficiency and background suppression.3. Select the event if it contains at least two triplets passing the MSD cut.For the application to simulated events used in the following, data areread from a file produced by the detector simulation (GEANT3 [21] orGEANT4 [22] plus detailed detector response implementation). When de-ployed for data taking, the algorithm will receive the input (hit) data fromthe online data stream, which is aggregated by the data acquisition soft-ware and supplied to the compute nodes through remote direct memoryaccess (RDMA) [23]. Hit data are grouped into events; they provide three-dimensional coordinate information ( x, y, z ). In order to assess the performance of the trigger algorithm, the followingsets of simulated data were produced and studied:D1. Signal events containing only one decay J/ ψ → µ + µ − . The phase-spacedistribution of the J/ ψ were generated using the PLUTO generator [24].72. Background events (central Au+Au at p beam = 35 A GeV) generatedusing the UrQMD model [25].D3. Background events with one embedded decay J/ ψ → µ + µ − in everyUrQMD event.The following performance figures are used:a) efficiency (E): the fraction of embedded signal events selected by thealgorithm;b) efficiency under acceptance (EUA): the fraction of selected embeddedsignal events in which both decay muons have hits in all three layers ofthe trigger station;c) background suppression factor (BSF): the ratio of all background eventsto background events selected by the trigger algorithm.The efficiency is mainly determined by the geometrical coverage of the muondetection system. Typical values are about 39 %. With the cut values namedabove, the EUA is 85.4 %. The reason for it to be smaller than unity is oneor both signal tracks not passing the MSD cut. The BSF of 71.4 shows thatthe primary aim of the algorithm - suppression of a large fraction of the inputdata rate - is reached; the probability to find a chance pair of triplets passingthe MSD cut is still not negligible.Further background suppression will be achieved by full track reconstruc-tion in the STS and MUCH detectors. This will provide the track momentumand more precise determination of the track impact parameter on the targetplane, allowing to better separate primary from secondary muons. Full trackreconstruction, however, is algorithmically involved [26] and thus requiressignificant computing resources. Owing to the first-level trigger algorithmdescribed here, it needs only be applied on a data rate already reduced by afactor of about 70.
4. Heterogeneous computing
Modern computers come with a variety of concepts for concurrent dataprocessing on many-core architectures [27]: dedicated co-processors like NVIDIAor AMD GPUs, Many Integrated Core (MIC) by Intel such as Xeon Phi, theAccelerated Processing Unit (APU) of AMD, the Cell Processor of IBM andothers. Nowadays, a single system comprising more than one host CPU andmore than one GPU is called a heterogeneous system [28]. The architectures8f such systems are based on the SIMT (Single Instruction Multiple Thread)technology and come with a plenitude of processing units, allowing to runmany threads in parallel. However, in order to make use of their computingpotential, adequate programming paradigms are needed. The pure paral-lel programming paradigms Posix Threads (pthread) [29], OpenMP [30] andMPI (Massage Passing Interface) [31, 32, 33] are available since long for utiliz-ing multi-core CPU architectures like those of Intel, AMD, or IBM. For many-core architectures, Apple developed the OpenCL (Open Compute Language)managed by the Khronos group [9], and NVIDIA came with CUDA [34, 35],a proprietary parallel programming API that can be used for NVIDIA hard-ware only.The choice of architecture and programming paradigm may well dependon the specific computing problem to be solved. For the algorithm describedin the previous section, we have tested implementations on the following twoheterogeneous platforms:S1. A Dell T7500 workstation comprising two Intel Xeon 2.8 GHz six-coreprocessors with 2 GB/core RAM, together with two NVIDIA GPUs(Tesla C2075 [36] and Quadro 4000 [37]).S2. An AMD-based HP Server with four AMD Opteron 2.6 GHz processorscomprising 16 cores with 4 GB/core RAM (total 64 cores).The algorithm was implemented on these setups with different pure andheterogeneous parallel programming models, namely:a) Using both Intel processors (12 cores in total) of the S1 setup withpthread, OpenMP, MPI and OpenCL;b) Using all four AMD processors (64 cores in total) of the S2 setup withpthread, OpenMP, MPI and OpenCL;c) Using the Tesla C2075 GPU of the S1 setup (448 cores) with CUDAand OpenCL;d) Using the Quadro 4000 GPU of the S1 setup (256 cores) with CUDAand OpenCL.The event selection process described in section 3 was implemented usingOpenCL 1.1 on both CPU and GPU architectures, and CUDA 4.0 on theNVIDIA GPUs. In addition, it was implemented on CPUs with pthread,OpenMP 3.1 and MPI 3.3. The results presented in this study were obtainedwith Scientific Linux CERN release 6.10 (Carbon), kernel version 2.6.32-754.el6.x86 64, on the CPUs. We used the gcc compiler 4.8.2 with the op-timization option -O2 and the architecture-specific option -march=native ,9 igure 4: Grid, block and thread architecture of a GPU [39] which we found to provide the best results [38]. Timing measurements wereperformed using std::chrono library functions. As relevant performance fig-ures, we report average execution time per event or the average throughput.Note that the thread creation, distribution and aggregation times are notaccounted for in MPI timing measurements whereas they are for pthread,OpenMP and OpenCL on CPU.
5. Investigation on NVIDIA GPUs
In this section, we explore the NVIDIA GPU architecture, memory ar-rangements and the implementation and optimization of the event selectionprocess on NVIDIA GPUs. We compare different heterogeneous parallel pro-gramming paradigms (CUDA and OpenCL) on NVIDIA’s Tesla and QuadroGPUs.
The Tesla C2075 GPU [36] used in this work consists of an array of14 so-called Streaming Multiprocessors (SM). Each SM contains 32 ScalarProcessor (SP) cores. The smallest executable unit of parallelism on a GPUcomprises of 32 threads known as a warp of threads. Therefore, in total14 ·
32 = 448 cores are available in a single GPU card. All SPs within anSM share resources such as registers and memory. The Instruction IssueUnit distributes the same instruction to each SP inside a SM. Thus, they10xecute the same instruction at any time, a concept known as SIMT (SingleInstruction Multiple Threads).Figure 4 shows a diagrammatic representation of the GPU architecture.The thread hierarchy is of two levels. At the topmost level there exists a gridof thread blocks. At the second level, the thread blocks are organized as anarray of threads. A function to be executed on the GPUs is known as kernel.Kernel execution takes place in the form of a batch of threads organized asa grid of thread blocks. The thread blocks are scheduled across SMs. Eachblock comprises of many warps of 32 threads. Threads belonging to thesame warp execute the same instruction over different data. The efficiencyof computation is best when the threads follow the same execution path forthe majority of the computation [40]. Execution divergence, when threads ofa warp follow different execution paths, is handled automatically inside thehardware with a slight penalty on execution time. The size of thread blocks(number of threads per block) and number of blocks can be managed by theprogrammer.
The Compute Unified Device Architecture (CUDA) has been developedby NVIDIA for performing general-purpose computing on NVIDIA GPUusing parallel computation features. The CUDA memory organization is hi-erarchical. Each thread has its own private local memory apart from theregisters, which are 32,768 per SM. The Tesla C2075 GPU includes a config-urable (as instruction or data) L1 cache per SM block and a unified L2 cachefor all processor cores. All threads inside a block share different memoryspaces - per block shared memory. The size of a block local shared mem-ory is 48 KB. Its lifetime equals that of the block and is characterized bylow memory access times. Shared memory comprises of a sequence of 32-bitwords called banks. There also exists a global memory (6 GB) shared byall threads across all thread blocks, having the lifetime of the application.The access time to the global memory is larger than that to other memories.Global memory comprises of 128 byte segment sequence, and at any time,memory requests for 16 threads (a half warp) are serviced together. Eachsegment corresponds to a memory transaction. If the threads in a half warpaccess data spread across different memory segments (uncoalesced memoryrequest), the corresponding multiple memory transactions would lower theperformance [41]. 11 .3. Investigation with CUDA
The event selection algorithm described in section 3 was implemented inthe C language using the CUDA API and then compiled with the NVIDIAcompiler (nvcc v7.0.27). To optimize the event selection code for the GPU, weanalysed the program and performed memory arrangement according to theGPU architecture as discussed above. Multiple events are processed at thesame time, one event being allocated to one thread. After development andimplementation of the event selection process using CUDA, we concentratedon optimizing the CPU to GPU data transfer time, which is significant asthe data volume is large.In our first approach, the entire event hit data of the setup, consisting ofthe STS and MUCH detectors, was transferred to the GPU. The followingsteps were taken:1. Data are read in from a file; in the actual experiment, the data will bedeployed in shared memory by the data acquisition software [18]. Thefile I/O times are thus not included in the following timing measure-ments. Since access to the shared memory will be performed in parallelto the algorithmic computation, we do not expect a significant impacton the final result.2. Memory on the GPU device is allocated for a chosen number of events n ev (cudaMalloc).3. Hit data for n ev events are transferred from CPU to GPU (cudaMem-cpy).4. The number of blocks b and the number of threads per block t areselected to optimise the GPU computation time.5. The event selection algorithm is executed in GPU threads for n ev eventsin parallel, with n ev ≤ b · t .6. The list of selected events is transferred back to the CPU (cudaMem-cpy).The GPU schedules and balances the selected number of blocks andthreads on the available SMs (14 for C2075) and SPs (32). For process-ing, the data container with the hit coordinate array is arranged as x , y , z , x , y , z , ....... x n , y n , z n x , y , z , x , y , z , ....... x n , y n , z n ........ x m , y m , z m , x m , y m , z m , ....... x mn , y mn , z mn x, y, z are the hit coordinates in configuration space; the first indexdenotes the event, the second one the consecutive number of the hit in theevent.Our investigation showed that this data arrangement suffers from uncoa-lesced memory access to the x, y, z coordinates (see also [41]). By its SIMTarchitecture, CUDA executes 32 threads of a block simultaneously; thereforeall 32 threads should read from the global memory in a single or double readinstruction. To cope with this, we rearranged the data such that coalescedmemory access is possible. First, we introduced separate data containers foreach coordinate axis, and second, hits of different events are arranged to-gether: x , x ,........ x m , x , x , ........ x m ................ x n , x n ........ x mn y , y ,........ y m , y , y , ........ y m ................ y n , y n ........ y mn z , z ,........ z m , z , z , ........ z m ................ z n , z n ........ z mn In the course of further optimizing the process, we found that the majorityof time is taken by the global read of data by each thread, thereby requiring areduction in global read time as data reside in the global memory and not inthe shared or private memory of the GPU. By construction of the algorithm,the number of global reads for each thread is proportional to the numberof events n ev . Each event contains about 5000 hits, and every global readtakes around 300–400 clock cycles [40]. For the computation, however, only asmall fraction of these data are used, namely hits in the last (trigger) station,which are about 15 per layer per event (see Fig. 2). Thus, we introduced afiltering of the data on the CPU host side, such that only hit data in thetrigger station are transferred to GPU [42].The importance of the optimization steps is illustrated in Fig. 5 (left),showing the per-event GPU execution time for the various implementationson the Tesla GPU and respectively the per-event CPU to GPU data transfertime. This study was performed for up to 4,000 events because of memorylimitations of the GPU. The processing time is reduced by a factor of two fromthe first implementation (i1) to the one properly using coalesced memory (i2).The data transfer time is the same for both implementations since the samedata are transferred. Filtering of input data at the host side (i3) gives areduction by about two orders of magnitude for the per event-execution timecompared to (i2) and one order of magnitude for the per-event data transfertime.Figure 5 (right) and Table 1 compare the per-event GPU execution time13 umber of events1000 1500 2000 2500 3000 3500 4000 E x e c u t i on t i m e pe r e v en t [ u s ] GPU time i1GPU time i2GPU time i3CPU-GPU transfer time i1/i2CPU-GPU transfer time i3
Number of events E x e c u t i on t i m e pe r e v en t [ u s ] GPUCPUCPU-GPU transfer
Figure 5: Processing time per event in microseconds as function of the number of events.The left panel compares the implementations i1, i2 and i3 with CUDA on the Tesla GPU(see text). A comparison of the execution times on GPU (implementation i3) and on CPU(single-thread) is shown in the right panel. and data transfer time for implementation i3 to the single-threaded executiontime on CPU. The data filtering on the host side relaxes the restrictionsimposed by the limited GPU memory, such that a larger number of events(we tested up to 80,000) can be processed at a time. The data transfer time islower by one order of magnitude compared to the execution time; moreover,it can be hidden by performing computation and transfer in parallel [23]. Themeasurements demonstrate the importance to load the GPU with sufficientdata in order to make optimal use of its capacity. Compared to the single-threaded execution on the CPU (using optimization in the gcc compiler), weobtain a speed-up of 4.55 for a data set of 40k events by using the Tesla GPU.Table 1 shows that for more than 40k events, the speed-up with respect tothe single-threaded CPU is slightly reduced, indicating that the optimal dataload on the GPU is reached with this amount of events. Our investigationsshow that about 3 · events per second can be processed on a single TeslaGPU. The previous section has demonstrated that making optimal use of a GPUwith the CUDA API is far from trivial and requires sophisticated optimiza-tion of the data arrangement. The OpenCL programming paradigm [9] offers14 able 1: Results for the event selection algorithm on the Tesla GPU umber of events E x e c u t i on t i m e pe r e v en t [ u s ] CUDA on TeslaOpenCL on TeslaCUDA on QuadroOpenCL on Quadro
Figure 6: Execution time per event for the CUDA and OpenCL implementations on theNVIDIA Tesla and Quadro GPUs the implementations in CUDA and in OpenCL. We find the OpenCL codeexecution time to be slightly higher than that of the CUDA code on bothTesla and Quadro, possibly indicating that CUDA is better optimized to theNVIDIA GPU architectures. However, the difference is modest and seemsa reasonable price for the flexibility offered by an architecture-independentcode. Comparing Tesla and Quadro, we find the Tesla GPU to be more pow-erful, which becomes visible at large-enough input data (number of events).The hardware differences between these two GPU cards are manifold – pro-cessor speed, global memory size, number of computing cores etc [36, 37].We conclude that the Tesla GPU seems more appropriate for our problemthan the Quadro GPU.
6. Investigations on multi-core CPU
An alternative to using GPU accelerator co-processors is to make use ofthe multi-core CPU architecture present in contemporary computers [43, 44].Concurrency on CPU cores can be established using pthread, OpenMP,MPI, and OpenCL, all of which are open-source programming paradigms,16 umber of Intel cores0 5 10 15 20 25 T h r oughpu t ( e v en t s / s e c ond ) · PThreadOpenMPMPI
Number of AMD cores0 20 40 60 80 100 120 T h r oughpu t ( e v en t s / s e c ond ) · PThreadOpenMPMPI
Figure 7: Throughput (number of events executed per second) obtained with the pthread,OpenMP and the MPI implementations as a function of the number of cores used inparallel for a sample of 20k events. The left panel shows the results for setup S1 (Intel, 12physical cores), the right panel those for setup S2 (AMD, 64 physical cores). where OpenCL is primarily developed for many-core or GPU architecture.A preliminary study using pthread and OpenMP only was presented in [38],demonstrating the importance of the proper choice of compiler options. Here,we study in addition MPI and, in particular, OpenCL. We tested implemen-tations of the event selection algorithm for all four of these programmingparadigms on the two platforms S1 and S2; the GPUs of the S1 setup wereidle or in open condition. Hardware parallelism was exploited in the simplestway by processing one event per thread (see subsection 5.3).Figure 7 (left) compares the throughput (number of events executed persecond) on the Intel Xeon processors (2 x 6 cores) of setup S1 in dependence ofthe number of cores (threads) used in parallel for a sample of 20000 events.We find for all three pure parallel programming implementations a linearscaling with the number of threads up to 12 threads, from when on thethroughput decreases again. This signals that from this point onwards, thecontext switching time starts to dominate the total process time. The sametest was performed on the setup S2 (4 x AMD Opteron 16 cores) as shown inthe right panel of Figure 7, obtaining similar results for 64 threads. OpenCLtreats the underlying device, in this case the CPU, as a single compute unit;therefore, different timing results cannot be gathered by varying the numberof cores. 17
10 Number of events012345 E x e c u t i on t i m e pe r e v en t [ u s ] PThread on Intel Xeon (12 core) PThread on AMD CPU (64 core)OpenMP on Intel Xeon (12 core) OpenMP on AMD CPU (64 core)MPI on Intel Xeon (12 core) MPI on AMD CPU (64 core)OpenCL on Intel Xeon (12 core) OpenCL on AMD CPU (64 core)
Figure 8: Execution time per event as function of the number of events processed at atime for the implementations with pthread, OpenMP, MPI and OpenCL on the Intel andAMD CPUs. The number of threads equals that of the available physical cores (12 forIntel, 64 for AMD).
For both setups, we find the throughput to scale with the number ofthreads / physical cores (the speed-up is 35 for AMD and 11 for Intel),which is to be expected for pure data-level parallelism. On the Intel setup,the performance obtained with pthread, OpenMP and MPI are similar, whereMPI shows slight higher throughput. On the AMD setup, both pthread andOpenMP are less performant than MPI, although hardware-specific compilerflags were used. As was already pointed out in section 4, thread spawningand distributing time are not accounted for the MPI implementation; theycontribute in proportion to the number of threads. We attribute our find-ings to the fact that the event selection process does not use shared memoryor inter-thread communication as explained earlier in section 5. Unlike theother frameworks, MPI statically binds the thread to CPU cores. The sim-ilarity of the results for the pthread and the OpenMP implementations areto be expected since internally, OpenMP uses pthread for spawning multiplethreads.The performances obtained with OpenCL, pthread, OpenMP and MPI on18he two hardware architectures are compared in Fig. 8 for different numbersof events processed at a time (up to 80,000). The number of threads is12 for Intel and 64 for AMD, as shown to be optimal by Fig. 7. On bothplatforms, we find the execution times for pthread and OpenMP to decreasewith the number of events and then saturate, indicating a minimal data size(about 20k events) from which on the process overhead can be neglected.On Intel, OpenCL performs clearly worse than the other implementations,whereas it is slightly better than OpenMP and pthread on AMD; here, MPIis found to clearly give the best execution speed. As reasons for OpenCL toperform worse than e.g., MPI, we have to acknowledge the fact that OpenCLwas primarily designed as a GPU programming tool; thus, its performanceis proportional to the number of thread invocations. OpenCL also producesvectorized code in an automatized way, whereas manual vectorization and/orcompiler optimization flags need being used for better performance on otherimplementations.A comparison of the Intel and AMD processors is not straightforwardbecause of the differences in the number of computing units and theoreticalpeak performances. Considering only the throughput per core, AMD appearsless performant than Intel for all implementations since the number of parallelthreads is 5 times larger than for the Intel CPUs, but the per event executiontime is only 2.5 times smaller. A complete assessment, however, would haveto also take into account the costs for purchase and operation, which isbeyond the scope of this article.
7. Conclusions
We have described the development of an event selection algorithm for theCBM-MUCH detector data and a systematic study for the implementation ofthe event selection process using different parallel computing paradigms likepthread, OpenMP, MPI, and OpenCL for multi-core CPU architectures, andCUDA and OpenCL for many-core architectures like NVIDIA GPUs. Forboth platforms, the event selection procedure suppresses the archival datarate by almost two orders of magnitude without reducing the signal efficiency,thus satisfying the CBM requirements for high-rate data taking.On GPUs, we have found a speed-up of 4.5 with respect to the optimizedsingle-thread execution on CPU. This result, however, is only obtained aftercareful optimization of the implementation in CUDA. OpenCL on NVIDIAGPUs are found to perform only slightly worse than that for CUDA. Our19esults show that about 3 · events per second can be processed on asingle GPU card of NVIDIA Tesla family. Present hardware supports up tofour GPUs on a single motherboard. This suggests that the targeted CBMinteraction rate of 10 events per second can be accommodated by a smallnumber of servers properly equipped with GPUs.In a multi-core CPU environment, we have compared OpenCL, pthread,OpenMP and MPI as open-source concurrency paradigms. A linear scalingof the data throughput with the number of parallel threads is observed upto the number of available physical cores. In the powerful S2 setup within total 64 AMD cores, we find that about 2 · events can be processedper second, which is already close to the targeted event rate of 10 /s. Thisdemonstrates that SIMD instructions provided by modern CPUs are essentialto achieve the required throughput, and that the computing demands of theCBM experiment for the real-time selection of J/ ψ candidate events can beachieved by properly making use of the parallel capacities of heterogeneouscomputing architectures. As an example, the NVIDIA Tesla GPU of setupS1 could be placed into setup S2 to achieve the desired goal.Comparing the different programming paradigms, we find the cross-platformOpenCL to be a proper choice for heterogeneous computing environmentstypical for modern architectures, which combine CPU cores with GPU-likeaccelerator cards. For such kind of systems, OpenCL provides a suitablesolution to simultaneously exploit all available compute units for a given ap-plication. It also provides the flexibility to future improvements in computingarchitectures, which is of particular importance for CBM as an experimentin the construction stage. This flexibility, however, comes at the price of areduced performance on CPU when compared to pure parallel programmingparadigms.
8. Outlook
The data selection procedure developed and investigated in this article re-lies on data aggregated into events, corresponding to a single nucleus-nucleusinteraction. The data acquisition of the CBM experiment, however, will de-liver free-streaming data not associated to single event by a hardware trigger.To properly account for this situation, not only the spatial coordinates, butalso the time measurement of each hit must be considered. This will increasethe complexity of the current, rather simple algorithm.20e are working together with the CBM collaboration towards extend-ing the algorithm to event building and selection from the real online datastream and also will investigate the throughput on multi-core and many-coreplatforms in parallel using hybrid programming [45]. In addition, other algo-rithmic approaches to the trigger problem will be investigated, reducing thecombinatorics by a more selective triplet construction.Our study shows that the computational problem can be solved with rea-sonable expenditure on CPUs, but also on GPUs as co-processors, or by acombination of both [46, 47]. It does not yet include a full exploitation ofpossible measures for further acceleration, like using vendor-specific compil-ers (Intel) or using manual code vectorization. Such investigations will beperformed in the future as prerequisites for a decision on the hardware ar-chitecture, which of course will have to balance performance with acquisitionand running costs.
References [1] https://fair-center.eu/for-users/experiments/cbm-and-hades/cbm.html , June 2019[2] https://fair-center.eu , June 2019[3] T. Ablyazimonv et al. (CBM collaboration), “Challenges in QCDmatter physics – The scientific programme of the Compressed Bary-onic Matter experiment at FAIR”, Eur. Phys. J. A 53, 60 (2017),doi:10.1140/epja/i2017-12248-y[4] V. Friese (for the CBM collaboration), “The high-rate data challenge:computing for the CBM experiment”, J. Phys.: Conf. Ser. 898, 112003(2017), doi:10.1088/1742-6596/898/10/112003[5] V. Friese, ”Computational Challenges for the CBM Experiment, in:G. Adam G., J. Busa J., M. Hnatic (eds) Mathematical Modeling andComputational Science. MMCP 2011. Lecture Notes in Computer Sci-ence, vol 7125. Springer, Berlin, Heidelberg 2012, doi:10.1007/978-3-642-28212-6 2[6] D. B. Kirk and W. Hwu, “Programming massively parallel processors,A hands on approach”, 2nd Edition, Morgan Kaufmann, 2013217] S. Ryoo et al. , “Program Optimization Space Pruning for a Multi-threaded GPU”, in: Proceedings of the 6th annual IEEE/ACM inter-national symposium on Code generation and optimization (CGO 08),ACM, New York NY 2008, doi:10.1145/1356058.1356084[8] J. Diaz, C. Munoz-Caro, and A. Nino, “A Survey of Parallel Program-ming Models and Tools in the Multi and Many-Core Era”, IEEE Trans-actions on Parallel and Distributed Systems, Vol. 23, No. 8, pp. 1369-1386 (2012), doi:10.1109/TPDS.2011.308[9] , October 2018[10] K. Karimi, N. G. Dickson and F. Hamze, “A Performance Comparisonof CUDA and OpenCL”, arXiv:1005.2581 (2010)[11] J. Fang, A. L. Varbanescu and H. Sips, “A Comprehensive PerformanceComparison of CUDA and OpenCL“, in: International Conference onParallel Processing, IEEE 2011, doi:10.1109/ICPP.2011.45[12] A. A. Alabboud, S. Hasan, N. A. A. Hamid and A. Y. Tuama, “Per-formance Analysis of MPI Approaches and Pthread in Multi Core Sys-tems”, Journal of Engineering and Applied Sciences 12, 609-616 (2017),doi:10.3923/jeasci.2017.609.616[13] A. C. Sodan, “Message-Passing and Shared-Data Programming Mod-els: Wish vs. Reality”, in: 19 th International Symposium onHigh Performance Computing Systems and Applications, IEEE 2005,doi:10.1109/HPCS.2005.34[14] K. Karimi, “The Feasibility of Using OpenCL Instead of OpenMP forParallel CPU Programming”, arXiv:1503.06532 (2015)[15] The CBM Collaboration,