GPU-accelerated machine learning inference as a service for computing in neutrino experiments
Michael Wang, Tingjun Yang, Maria Acosta Flechas, Philip Harris, Benjamin Hawks, Burt Holzman, Kyle Knoepfel, Jeffrey Krupa, Kevin Pedro, Nhan Tran
GGPU-accelerated machine learning inferenceas a service for computing in neutrinoexperiments
Michael Wang , ∗ , Tingjun Yang , Maria Acosta Flechas , Philip Harris ,Benjamin Hawks , Burt Holzman , Kyle Knoepfel , Jeffrey Krupa , KevinPedro , Nhan Tran , Fermi National Accelerator Laboratory, Batavia, IL 60510, USA Massachusetts Institute of Technology, Cambridge, MA 02139, USA Northwestern University, Evanston, IL 60208, USA
Correspondence*:Michael [email protected]
ABSTRACT
Machine learning algorithms are becoming increasingly prevalent and performant in thereconstruction of events in accelerator-based neutrino experiments. These sophisticatedalgorithms can be computationally expensive. At the same time, the data volumes of suchexperiments are rapidly increasing. The demand to process billions of neutrino events withmany machine learning algorithm inferences creates a computing challenge. We explore acomputing model in which heterogeneous computing with GPU coprocessors is made availableas a web service. The coprocessors can be efficiently and elastically deployed to provide theright amount of computing for a given processing task. With our approach, Services for OptimizedNetwork Inference on Coprocessors (SONIC), we integrate GPU acceleration specifically for theProtoDUNE-SP reconstruction chain without disrupting the native computing workflow. With ourintegrated framework, we accelerate the most time-consuming task, track and particle shower hitidentification, by a factor of 17. This results in a factor of 2.7 reduction in the total processing timewhen compared with CPU-only production. For this particular task, only 1 GPU is required forevery 68 CPU threads, providing a cost-effective solution.
Fundamental particle physics has pushed the boundaries of computing for decades. As detectors havebecome more sophisticated and granular, particle beams more intense, and data sets larger, the biggestfundamental physics experiments in the world have been confronted with massive computing challenges.The Deep Underground Neutrino Experiment (DUNE) [1], the future flagship neutrino experimentbased at Fermi National Accelerator Laboratory (Fermilab), will conduct a rich program in neutrino andunderground physics, including determination of the neutrino mass hierarchy [2] and measurements ofCP violation [3] in neutrino mixing using a long baseline accelerator-based neutrino beam, detection andmeasurements of atmospheric and solar neutrinos [4], searches for supernova-burst neutrinos [5] and otherneutrino bursts from astronomical sources, and searches for physics at the grand unification scale via protondecay [6]. a r X i v : . [ phy s i c s . c o m p - ph ] S e p ang et al. GPU-accelerated ML inference aaS for computing in neutrino experiments
The detectors will consist of four modules, of which at least three are planned to be kton LiquidArgon Time Projection Chambers (LArTPCs). Charged particles produced from neutrino or other particleinteractions will travel through and ionize the argon, with ionization electrons drifted over many meters ina strong electric field and detected on planes of sensing wires or printed-circuit-board charge collectors.The result is essentially a high-definition image of a neutrino interaction, which naturally lends itself toapplications of machine learning (ML) techniques designed for image classification, object detection, andsemantic segmentation. ML can also aid in other important applications, like noise reduction and anomalyor region-of-interest detection.Due to the number of channels and long readout times of the detectors, the data volume produced by thedetectors will be very large: uncompressed continuous readout of a single module will be nearly . PB persecond. Because that amount of data is impossible to collect and store, let alone process, and because mostof that data will not contain interactions of interest, a real-time data selection scheme must be employed toidentify data containing neutrino interactions. With a limit on total bandwidth of PB of data per year forall DUNE modules, that data selection scheme, with accompanying compression, must effectively reducethe data rate by a factor of .In addition to applications in real-time data selection, accelerated ML inference that can scale to processlarge data volumes will be important for offline reconstruction and selection of neutrino interactions. Theindividual events are expected to have a size on the order of a few gigabytes, and extended readout events(associated, for example, with supernova burst events) may be significantly larger, up to TB per module.It will be a challenge to efficiently analyze that data without an evolution of the computing models andtechnology that can handle data retrieval, transport, parallel processing, and storage in a cohesive manner.Moreover, similar computing challenges exist for a wide range of existing neutrino experiments such asMicroBooNE [7] and NO ν A [8].In this paper, we focus on the acceleration of the inference of deep ML models as a solution for processinglarge amounts of data in the ProtoDUNE single-phase apparatus (ProtoDUNE-SP) [9] reconstructionworkflow. For ProtoDUNE-SP, ML inference is the most computationally intensive part of the full eventprocessing chain and is run repeatedly on hundreds of millions of events. A growing trend to improvecomputing power has been the development of hardware that is dedicated to accelerating certain kindsof computations. Pairing a specialized coprocessor with a traditional CPU, referred to as heterogeneouscomputing, greatly improves performance. These specialized coprocessors utilize natural parallelizationand provide higher data throughput. In this study, the coprocessors employed are graphics processing units(GPUs); however, the approach can accommodate multiple types of coprocessors in the same workflow. MLalgorithms, and in particular deep neural networks, are a driver of this computing architecture revolution.For optimal integration of GPUs into the neutrino event processing workflow, we deploy them “as aservice.” The specific approach is called Services for Optimized Network Inference on Coprocessors(SONIC) [10, 11, 12], which employs a client-server model. The primary processing job, including theclients, runs on the CPU, as is typically done in particle physics, and the ML model inference is performedon a GPU server. This can be contrasted with a more traditional model with a GPU directly connectedto each CPU node. The SONIC approach allows a more flexible computing architecture for acceleratingparticle physics computing workflows, providing the optimal number of heterogeneous computing resourcesfor a given task.The rest of this paper is organized as follows. We first discuss the related works that motivated andinformed this study. In Section 2, we describe the tasks for ProtoDUNE-SP event processing and the
This is a provisional file, not the final typeset article ang et al. GPU-accelerated ML inference aaS for computing in neutrino experiments specific reconstruction task for which an ML algorithm has been developed. We detail how the GPUcoprocessors are integrated into the neutrino software framework as a service on the client side and howwe set up and scale out GPU resources in the cloud. In Section 3, we present the results, which includesingle job and batch job multi-CPU/GPU latency and throughput measurements. Finally, in Section 4, wesummarize the study and discuss further applications and future work.
Related Work
Inference as a service was first employed for particle physics in Ref. [10]. This initial study utilizedcustom field programmable gate arrays (FPGAs) manufactured by Intel Altera and provided through theMicrosoft Brainwave platform [13]. These FPGAs achieved low-latency, high-throughput inference forlarge convolutional neural networks such as ResNet-50 [14] using single-image batches. This accelerationof event processing was demonstrated for the Compact Muon Solenoid (CMS) experiment at the LargeHadron Collider (LHC), using a simplified workflow focused on inference with small batch sizes. Our studywith GPUs for neutrino experiments focuses on large batch size inferences. GPUs are used in elements ofthe simulation of events in the IceCube Neutrino Observatory [15], recently a burst for the elements thatrun on GPUs was deployed at large scale[16].Modern deep ML algorithms have been embraced by the neutrino reconstruction community becausepopular computer vision and image processing techniques are highly compatible with the neutrinoreconstruction task and the detectors that collect the data. NO ν A has applied a custom convolutionalneural network (CNN), inspired by GoogLeNet [17], to the classification of neutrino interactions fortheir segmented liquid scintillator-based detector [18]. MicroBooNE, which uses a LArTPC detector,has conducted an extensive study of various CNN architectures and demonstrated their effectiveness inclassifying and localizing single particles in a single wire plane, classifying neutrino events and localizingneutrino interactions in a single plane, and classifying neutrino events using all three planes [19]. Inaddition, MicroBooNE has applied a class of CNNs, known as semantic segmentation networks, to 2Dimages formed from real data acquired from the LArTPC collection plane, in order to classify each pixelas being associated with an EM particle, other type of particle, or background [20]. DUNE, which willalso use LArTPC detectors, has implemented a CNN based on the SE-ResNet [21] architecture to classifyneutrino interactions in simulated DUNE far detector events [22]. Lastly, a recent effort has successfullydemonstrated an extension of the 2D pixel-level semantic segmentation network from MicroBooNE tothree dimensions, using submanifold sparse convolutional networks [23, 24].
In this study, we focus on a specific computing workflow, the ProtoDUNE-SP reconstruction chain, todemonstrate the power and flexibility of the SONIC approach. ProtoDUNE-SP, assembled and tested atthe CERN Neutrino Platform (the NP04 experiment at CERN) [25], is designed to act as a test bed andfull-scale prototype for the elements of the first far detector module of DUNE. It is currently the largestLArTPC ever constructed and is vital to develop the technology required for DUNE. This includes thereconstruction algorithms that will extract physics objects from the data obtained using LArTPC detectors,as well as the associated computing workflows.In this section, we will first describe the ProtoDUNE-SP reconstruction workflow and the ML model thatis the current computing bottleneck. We will then describe the SONIC approach and how it was integrated
Frontiers 3 ang et al.
GPU-accelerated ML inference aaS for computing in neutrino experiments into the LArTPC reconstruction software framework. Finally, we will describe how this approach can bescaled up to handle even larger workflows with heterogeneous computing.
The workflow used in this paper is the full offline reconstruction chain for the ProtoDUNE-SP detector,which is a good representative of event reconstruction in present and future accelerator-based neutrinoexperiments. In each event, ionizing particles pass through the liquid argon, emitting scintillation lightthat is recorded by photodetectors. The corresponding pulses are reconstructed as optical hits. These hitsare grouped into flashes from which various parameters are determined, including time of arrival, spatialcharacteristics, and number of photoelectrons detected.After the optical reconstruction stage, the workflow proceeds to the reconstruction of LArTPC wire hits.It begins by applying a deconvolution procedure to recover the original waveforms by disentangling theeffects of electronics and field responses after noise mitigation. The deconvolved waveforms are then usedto find and reconstruct wire hits, providing information such as time and collected charge. Once the wirehits have been reconstructed, the 2D information provided by the hits in each plane is combined with thatfrom the other planes in order to reconstruct 3D space points. This information is primarily used to resolveambiguities caused by the induction wires in one plane wrapping around into another plane.The disambiguated collection of reconstructed 2D hits is then fed into the next stage, which consists ofa modular set of algorithms provided by the Pandora software development kit [26]. This stage finds thehigh-level objects associated with particles, like tracks, showers, and vertices, and assembles them into ahierarchy of parent-daughter nodes that ultimately point back to the candidate neutrino interaction.The final module in the chain,
EmTrackMichelId , is an ML algorithm that classifies reconstructed wirehits as being track-like, shower-like, or Michel electron-like [27]. This algorithm begins by constructing × pixel images whose two dimensions are the time t and the wire number w in the plane. Theseimages, called patches, are centered on the peak time and wire number of the reconstructed hit beingclassified. The value of each pixel corresponds to the measured charge deposition in the deconvolvedwaveforms for the wire number and time interval associated with the row and column, respectively, of thepixel. Inference is performed on these patches using a convolutional neural network. Importantly, over theentire ProtoDUNE-SP detector, there are many × patches to be classified, such that a typical eventmay have ∼ , patches to process. Because of the way the data is processed per wire plane, those ∼ , patches are processed in batches with average sizes of either 235 or 1693. We will explore theperformance implications of this choice in the next section. The neural network employed by the
EmTrackMichelId module of the ProtoDUNE-SP reconstructionchain consists of a 2D convolutional layer followed by two fully connected (FC) layers. The convolutionallayer takes each of the × pixel patches described in Section 2.1 and applies 48 separate × convolutions to it, using stride lengths of 1, to produce 48 corresponding × pixel feature maps. Thesefeature maps are then fed into the first fully connected (FC) layer consisting of 128 neurons, which is, inturn, connected to the second FC layer of 32 neurons. Rectified linear unit (ReLU) activation functionsare applied after the convolutional layer and each of the two FC layers. Dropout layers are implementedbetween the convolutional layer and the first FC layer and between the two FC layers to help prevent This is a provisional file, not the final typeset article ang et al. GPU-accelerated ML inference aaS for computing in neutrino experiments
Figure 1.
Architecture of the neural network used by the
EmTrackMichelId module in the ProtoDUNE-SPreconstruction chain, a convolutional (2DConv) layer is flattened to two fully connected layers.overfitting. The second FC layer splits into two separate branches. The first branch terminates in threeoutputs that are constrained to a sum of one by a softmax activation function, and the second branchterminates in a single output with a sigmoid activation function that limits its value within a range of 0 to 1.The total number of trainable parameters in this model is 11,900,420.
ProtoDUNE-SP reconstruction code is based on the LArSoft C++ software framework [28], whichprovides a common set of tools shared by many LArTPC-based neutrino experiments. Within thisframework,
EmTrackMichelId , which is described in Section 2.1, is a LArSoft “module” that makesuse of the
PointIdAlg “algorithm.”
EmTrackMichelId passes the wire number and peak time associatedwith a hit to
PointIdAlg , which constructs the patch and performs the inference task to classify it.In this study, we follow the SONIC approach that is also in development for other particle physicsapplications. It is a client-server model, in which the coprocessor hardware used to accelerate the neuralnetwork inference is separate from the CPU client and accessed as a (web) service. The neural networkinputs are sent via TCP/IP network communication to the GPU. In this case, a synchronous, blockingcall is used. This means that the thread makes the web service request and then waits for the responsefrom the server side, only proceeding once the server sends back the network output. In ProtoDUNE-SP,the CPU usage of the workflow, described in Section 2.1, is dominated by the neural network inference.Therefore, a significant increase in throughput can still be achieved despite including the latency from theremote call while the CPU waits for the remote inference. An asynchronous, non-blocking call would beslightly more efficient, as it would allow the CPU to continue with other work while the remote call wasongoing. However, this would require significant development in LArSoft for applications of task-basedmultithreading, as described in Ref. [29].
Frontiers 5 ang et al.
GPU-accelerated ML inference aaS for computing in neutrino experiments
In this work, the functionality to access the GPU as a service was realized by implementing the C++ clientinterface, provided by the Nvidia Triton inference server [30], in a new subclass of the
InterfaceModel class in LArSoft. This class is used by
PointIdAlg to access the underlying ML model that performs theinference. The patches constructed by
PointIdAlg are put into the proper format and transmitted to theGPU server for processing, while a blocking operation ensues until inference results are received from theserver. Communication between the client and server is achieved through remote procedure calls basedon gRPC [31]. The desired model interface subclass to use is selected and its parameters specified at runtime by the user through a FHiCL [32] configuration file. The code is found in the
LArSoft/larrecodnn package [33]. On the server side, we deploy NVidia T4 GPUs targeting data center acceleration.This flexible approach has several advantages: • Rather than requiring one coprocessor per CPU with a direct connection over PCIe, many workernodes can send requests to a single GPU, as depicted in Fig. 2. This allows heterogeneous resources tobe allocated and re-allocated based on demand and task, providing significant flexibility and potentialcost reduction. The CPU-GPU system can be “right-sized” to the task at hand, and with modern serverorchestration tools, described in the next section, it can elastically deploy coprocessors. • There is a reduced dependency on open-source ML frameworks in the experimental code base.Otherwise, the experiment would be required to integrate and support separate C++ APIs for everyframework in use. • In addition to coprocessor resource scaling flexibility, this approach allows the event processing to usemultiple types of heterogeneous computing hardware in the same job, making it possible to match thecomputing hardware to the ML algorithm. The system could, for example, use both FPGAs and GPUsservers to accelerate different tasks in the same workflow.There are also challenges to implementing a computing model that accesses coprocessors as a service.Orchestration of the client-server model can be more complicated, though we find that this is facilitatedwith modern tools like the Triton inference server and Kubernetes. In Section 4, we will discuss futureplans to demonstrate production at full scale. Networking from client to server incurs additional latency,which may lead to bottlenecks from limited bandwidth. For this particular application, we account forand measure the additional latency from network bandwidth, and it is a small, though non-negligible,contribution to the overall processing.The Triton software also handles load balancing for servers that provide multiple GPUs, further increasingthe flexibility of the server. In addition, the Triton server can host multiple models from various MLframeworks. One particularly powerful feature of the Triton inference server is dynamic batching, whichcombines multiple requests into optimally-sized batches in order to perform inference as efficiently aspossible for the task at hand. This effectively enables simultaneous processing of multiple events withoutany changes to the experiment software framework, which assumes one event is processed at a time.
We performed tests on many different combinations of computing hardware, which provided a deeperunderstanding of networking limitations within both Google Cloud and on-premises data centers. Eventhough the Triton Inference Server does not consume significant CPU power, the number of CPU coresprovisioned for the node did have an impact on the maximum ingress bandwidth achieved in the early tests.
This is a provisional file, not the final typeset article ang et al. GPU-accelerated ML inference aaS for computing in neutrino experiments
Client CPUClient CPUClient CPU … N e t w o r k S t a n d a r d H E P c o m p u t i n g Load Balancer
Triton Inference ServerTriton Inference ServerTriton Inference ServerTriton Inference Server
AI Model RepositoryAI Inference Cluster (CPU | GPU)
Figure 2.
Depiction of the client-server model using Triton where multiple CPU processes on the clientside are accessing the AI model on the server side.
Figure 3.
The Google Kubernetes Engine setup which demonstrates how the Local Compute FermiGridfarm communicates with the GPU server and how the server is orchestrated through Kubernetes.To scale the NVidia T4 GPU throughput flexibly, we deployed a Google Kubernetes Engine (GKE)cluster for server-side workloads. The cluster is deployed in the US-Central data center, which is located inIowa; this impacts the data travel latency. The cluster was configured using a Deployment and ReplicaSet.These are Kubernetes artifacts for application deployment, management and control. They hold resourcerequests, container definitions, persistent volumes, and other information describing the desired state of thecontainerized infrastructure. Additionally, a load-balancing service to distribute incoming network trafficamong the Pods was deployed. We implemented Prometheus-based monitoring, which provided insightinto three aspects: system metrics for the underlying virtual machine, Kubernetes metrics on the overallhealth and state of the cluster, and inference-specific metrics gathered from the Triton Inference Server viaa built-in Prometheus publisher. All metrics were visualized through a Grafana instance, also deployedwithin the same cluster. The setup is depicted in Fig. 3.
Frontiers 7 ang et al.
GPU-accelerated ML inference aaS for computing in neutrino experiments
A Pod is a group of one or more containers with shared storage and network, and a specification for howto run the containers. A Pod’s contents are always co-located and co-scheduled, and run in a shared contextwithin Kubernetes Nodes [34]. We kept the Pod to Node ratio at 1:1 throughout the studies, with each Podrunning an instance of the Triton Inference Server (v20.02-py3) from the Nvidia Docker repository. ThePod hardware requests aim to maximize the use of allocatable virtual CPU (vCPU) and memory and to useall GPUs available to the container.In this scenario, it can be naively assumed that a small instance n1-standard-2 with 2 vCPUs, . GB ofmemory, and different GPU configurations (1, 2, 4, or 8) would be able to handle the workload, whichwould be distributed evenly on GPUs. After performing several tests, we found that horizontal scalingwould allow us to increase our ingress bandwidth, because Google Cloud imposes a hard limit on networkbandwidth of Gbit/s per vCPU, up to a theoretical maximum of Gbit/s for each virtual machine [35].Given these parameters, we found that the ideal setup for optimizing ingress bandwidth was to provisionmultiple Pods on 16-vCPU machines with fewer GPUs per Pod. For GPU-intensive tests, we took advantageof having a single point of entry, with Kubernetes balancing the load and provisioning multiple identicalPods behind the scenes, with the total GPU requirement defined as the sum of the GPUs attached to eachPod.
Using the setup described in the previous section to deploy GPUaaS to accelerate machine learninginference, we measure the performance and compare against the default CPU-only workflow in ProtoDUNE-SP.First, we describe the baseline CPU-only performance. We then measure the server-side performancein different testing configurations, in both standalone and realistic conditions. Finally, we scale up theworkflow and make detailed measurements of performance. We also derive a scaling model for how weexpect performance to scale and compare it to our measured results.
To compare against our heterogeneous computing system, we first measure the throughput of the CPU-only process. The workflow processes events from a Monte Carlo simulation of cosmic ray events inProtoDUNE-SP, produced with the Corsika generator [36]. The radiological backgrounds, including Ar, Ar,
Rn, and Kr, are also simulated using the
RadioGen module in LArSoft. Each event correspondsto a ms readout window with a sampling rate of . µ s per time tick. The total number of electronicchannels is 15360. A white noise model with an RMS of . ADC was used. The workflows are executedon grid worker nodes running Scientific Linux 7, with CPU types shown in Table 1. The fraction of allclients that ran with an average client-side batch size of for each CPU type is shown in the secondcolumn of this table. Of these clients, 64% of them ran on nodes with Gbps network interface cards(NICs) and the remainder ran on nodes with Gbps NICs.We measure the time it takes for each module in the reconstruction chain to run. We divide them into2 basic categories: the non-ML modules and the ML module. The time values are given in Table 2. Ofthe CPU time in the ML module, we measure that s is dedicated to data preprocessing to prepare forneural network inference, and the rest of the time, s, from the module is spent in inference. This is thebaseline performance to which we will compare our results using GPUaaS. This is a provisional file, not the final typeset article ang et al. GPU-accelerated ML inference aaS for computing in neutrino experiments
CPU type fraction (%)AMD EPYC 7502 @ 2.5 GHz 11.7AMD Opteron 6134 @ 2.3 GHz 0.6AMD Opteron 6376 @ 2.3 GHz 4.6Intel Xeon E5-2650 v2 @ 2.6 GHz 30.8Intel Xeon E5-2650 v3 @ 2.3 GHz 5.2Intel Xeon E5-2670 v3 @ 2.3 GHz 7.3Intel Xeon E5-2680 v4 @ 2.4 GHz 17.3Intel Xeon Gold 6140 @ 2.3 GHz 22.6
Table 1.
CPU types and distribution for the grid worker nodes used for the “big-batch” clients (see text formore details). Wall time (s)ML module non-ML modules Total220 110 330
Table 2.
The average CPU-only wall time per job for the different module categories.
To get a standardized measure of the performance, we first use standard tools for benchmarking theGPU performance. Then we perform a stress test on our GPUaaS instance to understand the server-sideperformance under high load.
Server standalone performance
The baseline performance of the GPU server running the
EmTrackMichelId model is measured using the perf client tool included in the Nvidia Triton inference server package. The tool emulates a simple client bygenerating requests over a defined time period. It then returns the latency and throughput, repeating the testuntil the results are stable. We define the baseline performance as the throughput obtained the saturationpoint of the model on the GPU. We attain this by increasing the client-side request concurrency—themaximum number of unanswered requests by the client—until the throughput saturates. We find that themodel reaches this limit quickly at a client-side concurrency of only 2 requests. At this point, the throughputis determined to be , ± , inferences per second. This corresponds to an event processing timeof . ± . s. This is the baseline expectation of the performance of the GPU server. Saturated server stress test
To understand the behavior of the GPU server performance in a more realistic setup, we set up manysimultaneous CPU processes to make inference requests to the GPU. This saturates the GPUs, keeping thepipeline of inference requests as full as possible. We measure several quantities from the GPU server inthis scenario. To maximize throughput, we activate the dynamic batching feature of Triton, which allowsthe server to combine multiple requests together in order to take advantage of the efficient batch processingof the GPU. This requires only one line in the server configuration file.In this setup, we run 400 simultaneous CPU processes that send requests to the GPU inference server.This is the same compute farm described in Sec. 3.1. The jobs are held in an idle state until all jobs are
Frontiers 9 ang et al.
GPU-accelerated ML inference aaS for computing in neutrino experiments : : : : : : : : : : Time (hh:mm:ss)050100150200250300350400 N u m b e r o f c li e n t s s y n c s t a r t N u m b e r o f i n f e r e n c e s / s : : : : : : : : : : Time (hh:mm:ss)050100150200250300350400 N u m b e r o f c li e n t s s y n c s t a r t - gp u U t ili z a t i o n : : : : : : : : : : Time (hh:mm:ss)050100150200250300350400 N u m b e r o f c li e n t s s y n c s t a r t N u m b e r o f b a t c h e s Figure 4.
Top left: The number of inferences per second processed by the 4-GPU server, which saturatesat approximately 126,000. Top right: The GPU usage, which peaks around 60%. Bottom: The number oftotal batches processed by the 4-GPU server. The incoming batches are sent to the server with size 1693,but are combined up to size 5358 for optimal performance.allocated CPU resources and all input files are transferred to local storage on the grid worker nodes, atwhich point the event processing begins simultaneously. This ensures that the GPU server is handlinginference requests from all the CPU processes at the same time. This test uses a batch size of 1693. Wemonitor the following performance metrics of the GPU server in 10-minute intervals: • GPU server throughput: for the 4-GPU server, we measure that the server is performing about 122,000inferences per second for large batch and dynamic batching; this amounts to 31,000 inferences persecond per GPU. This is shown in Fig. 4 (top left). This is higher than the measurement from thestandalone performance client, by a factor of ∼ . . For large batch and no dynamic batching, weobserve similar throughput, while for small batch and no dynamic batching, we find that performanceis a bit worse, close to the standalone client performance at 22,000 inf/s/GPU. • GPU processing usage: we monitor how occupied the GPU processing units are. We find that the GPUis ∼ occupied during saturated processing. This is shown in Fig. 4 (top right). This is a provisional file, not the final typeset article ang et al. GPU-accelerated ML inference aaS for computing in neutrino experiments • GPU batch throughput: we measure how many batches of inferences are processed by the server. Thebatch size sent by the CPU processor is 1693 on average, but dynamic batching prefers to run at atypical batch size of 5358. This is shown in Fig. 4 (bottom).
In the previous section, we discussed in detail the GPU server performance. With that information, westudy in detail the performance of the entire system and the overall improvement expected in throughput.To describe important elements of the as-a-service computing model, we first define some concepts andvariables. Many of these elements have been described in previous sections, but we collect them here topresent a complete model. • t CPU is the total time for CPU-only inference as described in Sec. 3.1. This is measured to be s. • p is the fraction of t CPU which can be accelerated, and conversely − p is the fraction of the processingthat is not being accelerated. The ML inference task takes s, but we subtract t preprocess , the timerequired for data preprocessing, which still occurs on the CPU even when the algorithm is offloaded tothe GPU. This results in p = 0 . . • t GPU is the time explicitly spent doing inference on the GPU. We measure this to be 22,000–31,000inferences per second depending on whether or not dynamic batching is used. For 55,000 inferencesper event, this turns out to be . s ( . s) when dynamic batching is enabled (disabled with smallbatch). • t preprocess is the time spent on the CPU for preprocessing to prepare the input data to be transmitted tothe server in the correct format. This is measured to be s. • t transmit is the latency incurred from transmitting the neural network input data. For 55,000 inferencesper event, with each input a × image at bits, the total amount of data transmitted is about . Gigabits per event. Sec. 2.4 specifies that each CPU process is allocated a Gbps link on theserver side while Sec. 3.1 specifies Gbps or Gbps link speed on the client side. Therefore thecommunication bottleneck varies between Gbps and Gbps such that the total latency for transmittingdata is between . s and . s. • t travel is the travel latency to go from the Fermilab data center, which hosts the CPU processes, to theGCP GPUs. This depends on the number of requests N request and the latency per request t request . Thelatter can vary based on networking conditions, but we measure it to be roughly ms. The small batchsize of 256 images requires N request = 214 to transmit the 55,000 images, while the large batch sizeof 1720 images requires N request = 32 . Given these parameters, we find that the travel latency is . s( . s) for small (large) batch size. • t latency = t preprocess + t transmit + t travel summarizes the additional latency distinct from the actual GPUprocessing. • t ideal is the total processing time assuming the GPU is always available; this is described in more detailin the next section. • N CPU and N GPU are the numbers of simultaneously running CPU and GPU processors, respectively.With each element of the system latency now defined, we can model the performance of SONIC. Initially,we assume blocking modules and zero communication latency. We define p as the fraction of the event Frontiers 11 ang et al.
GPU-accelerated ML inference aaS for computing in neutrino experiments which can be accelerated, such that the total time of a CPU-only job is trivially defined as: t CPU = (1 − p ) × t CPU + p × t CPU (1)We replace the time for the accelerated module with the GPU latency terms: t ideal = (1 − p ) × t CPU + t GPU + t latency . (2)This reflects the ideal scenario when the GPU is always available for the CPU job. We also include t latency ,which accounts for the preprocessing, bandwidth, and travel time to the GPU. The value of t GPU is fixed,unless the GPU is saturated with requests. We define this condition as how many GPU requests can bemade while a single CPU is processing an event. The GPU saturation condition is therefore defined as: N CPU N GPU > t ideal t GPU . (3)Here, t ideal is equivalent to Eq. (2), the processing time assuming there is no saturated GPU. Thereare two conditions, unsaturated and saturated GPU, which correspond to N CPU N GPU < t ideal t GPU and N CPU N GPU > t ideal t GPU ,respectively. We can compute the total latency ( t SONIC ) to account for both cases: t SONIC = (1 − p ) × t CPU + t GPU (cid:20) max (cid:18) , N CPU N GPU − t ideal t GPU (cid:19)(cid:21) + t latency . (4)Therefore, the total latency is constant when the GPUs are not saturated and increases linearly in thesaturated case proportional to t GPU . Substituting Eq. (2) for t ideal , the saturated case simplifies to: t SONIC = t GPU × N CPU N GPU . (5) To test the performance of the SONIC approach, we use the setup described in the “server stress test”in Section 3.2. We vary the number of simultaneous jobs from 1–400 CPU processes. To test differentcomputing model configurations, we run the inference with two different batch sizes: 235 (small batch) and1693 (large batch). This size is specified at run time through a parameter for the
EmTrackMichelId modulein the FHiCL [32] configuration file describing the workflow. With the small batch size, inferences arecarried out in approximately 235 batches per event. Increasing the batch size to 1693 reduces the numberof inference calls sent to the Triton server to 32 batches per event, which decreases the travel latency. Wealso test the performance impact of enabling or disabling dynamic batching on the server.In Fig. 5 (left), we show the performance results for the latency of the
EmTrackMichelId module forsmall batch size vs. large batch size, with dynamic batching turned off. The most important performancefeature is the basic trend. The processing time is flat as a function of the number of simultaneous CPUprocesses up to 190 (270) processes for small (large) batch size. After that, the processing time begins togrow, as the GPU server becomes saturated and additional latency is incurred while service requests arebeing queued. For example, in the large batch case, the performance of the
EmTrackMichelId module isconstant whether there are 1 or 270 simultaneous CPU processes making requests to the server. Therefore,
This is a provisional file, not the final typeset article ang et al. GPU-accelerated ML inference aaS for computing in neutrino experiments using less than 270 simultaneous CPU processes for the 4-GPU server is an inefficient use of the GPUresources; and we find that the optimal ratio of CPU processes to a single GPU is 68:1.As described in Section 3.3, s of the module time is spent on the CPU for preprocessing to preparethe inputs for neural network inference. The term t travel is computed based on a measured round trip pingtime of ms for a single service request. Therefore, for small (large) batch size, the total t travel per eventis . s ( . s). The difference between the corresponding processing times for the different batch sizesroughly corresponds to that . s. We also see that in the small batch size configuration, the GPU serversaturates earlier, at about 190 simultaneous CPU processes. In comparison, the large batch size serversaturates at about 270 simultaneous processes. This is because the GPU is more efficient with larger batchsize: at a batch size of 235 (1693), the GPU server can process about 80,000 (125,000) images per second.The overall performance using the SONIC approach is compared to the model from Section 3.3. We seethat performance matches fairly well with our expectations.In Fig. 5 (right), we show the performance of the SONIC approach for large batch size with dynamicbatching enabled or disabled, considering up to 400 simultaneous CPU processes. We find that at largebatch size, for our particular model, the large batch size of 1693 is already optimal and the performance isthe same with or without dynamic batching. We also find that the model for large batch size matches thedata well. P r o c e ss i n g t i m e [ s e c o n d s ] Model (small Batch)Model (big Batch)CPU-only (w/o Triton)w/ Triton on GKE-4gpu, avg batch size = 234w/ Triton on GKE-4gpu, avg batch size = 1692 P r o c e ss i n g t i m e [ s e c o n d s ] Model (big batch)Model (dynamic batching, big batch)CPU-only (w/o Triton)w/ Triton on GKE-4gpu, dyn bat Off, avg bat sz = 1692w/ Triton on GKE-4gpu, dyn bat On, avg bat sz = 1692
Figure 5.
Processing time for the
EmTrackMichelId module as a function of simultaneous CPU processes,using a Google Kubernetes 4-GPU cluster. Left: small batch size vs. large batch size, with dynamicbatching turned off. Right: large batch size performance with dynamic batching turned on and off. In bothplots, the dotted lines indicate the predictions of the latency model, specifically Eq. (4).We stress that, until the GPU server is saturated, the
EmTrackMichelId module now takes about s perevent in the most optimal configuration. This should be compared against the CPU-based inference, whichtakes s on average. The EmTrackMichelId module is accelerated by a factor of 17, and the total eventprocessing time goes from s to s on average, a factor of 2.7 reduction in the overall processingtime.Finally, it is important to note that throughout our studies using commercially available cloud computing,we have observed that there are variations in the GPU performance. This could result from a number offactors beyond our control, related to how CPU and GPU resources are allocated and configured in thecloud. Often, these factors are not even exposed to the users and therefore difficult to monitor. That said,
Frontiers 13 ang et al.
GPU-accelerated ML inference aaS for computing in neutrino experiments the GPU performance, i.e. the number of inferences per second, is a non-dominant contribution to the totaljob latency. Volatility in the GPU throughput primarily affects the ratio of CPU processes to GPUs. Weobserve variations at the 30%–40% level, and in this study, we generally present conservative performancenumbers.
In this study, we demonstrate for the first time the power of accessing GPUs as a service with the Servicesfor Optimized Network Inference on Coprocessors (SONIC) approach to accelerate computing for neutrinoexperiments. We integrate the Nvidia Triton inference server into the LArSoft software framework, whichis used for event processing and reconstruction in liquid argon neutrino experiments. We explore thespecific example of the ProtoDUNE-SP reconstruction workflow. The reconstruction processing time isdominated by the
EmTrackMichelId module, which runs neural network inference for a fairly traditionalconvolutional neural network algorithm over thousands of patches of the ProtoDUNE-SP detector. In thestandard CPU-only workflow, the module consumes 65% of the overall CPU processing time.We explore the SONIC approach, which abstracts the neural network inference as a web service. A 4-GPUserver is deployed using the Nvidia Triton inference server, which includes powerful features such asload balancing and dynamic batching. The inference server is orchestrated using Google Cloud Platform’sKubernetes Engine. The SONIC approach provides flexibility in dynamically scaling the GPUaaS tomatch the inference requests from the CPUs, right-sizing the heterogeneous resources for optimal usage ofcomputing. It also provides flexibility in dealing with different machine learning (ML) software frameworksand tool flows, which are constantly improving and changing, as well as flexibility in the heterogeneouscomputing hardware itself, such that different GPUs, FPGAs, or other coprocessors could be deployedtogether to accelerate neural network algorithms. In this setup, the
EmTrackMichelId module is acceleratedby a factor of 17, and the total event processing time goes from s to s on average resulting in a factorof 2.7 reduction in the overall processing time. We find that the optimal ratio of GPUs to simultaneousCPU processes is 1 to 68.With these promising results, there are a number of interesting directions for further studies. • Integration into full-scale production : A natural next step is to deploy this workflow at fullscale, moving from 400 simultaneous CPU processes up to 1000–2000. While this should be fairlystraightforward, there will be other interesting operational challenges to be able to run multipleproduction campaigns. For example, the ability to instantiate the server as needed from the client sidewould be preferable. The GPU resources should scale in an automated way when they become saturated.There are also operational challenges to ensure the right model is being served and server-side metadatais preserved automatically. • Server platforms : Related to the point above, physics experiments would ultimately prefer to runthe servers without relying on the cloud, instead using local servers in lab and university data centers.Preliminary tests have been conducted with a single GPU server at the Fermilab Feynman ComputingCenter. Larger-scale tests are necessary, including the use of cluster orchestration platforms. Finally, asimilar setup should be explored at high performance computing (HPC) centers, where a large amountof GPU resources may be available. • Further GPU optimization : Thus far, the studies have not explored significant optimization of theactual GPU operations. In this paper, a standard 32-bit floating point implementation of the model wasloaded in the Triton inference server. A simple extension would be to try model optimization using
This is a provisional file, not the final typeset article ang et al. GPU-accelerated ML inference aaS for computing in neutrino experiments • More types of heterogeneous hardware : In this study, we have deployed GPUs as a service, whilein other studies, FPGAs and ASICs as a service were also explored. For this particular model and usecase, with large batch sizes, GPUs already perform very well. However, the inference for other MLalgorithms may be more optimal on different types of heterogeneous computing hardware. Therefore,it is important to study our workflow for other platforms and types of hardware.By capitalizing on the synergy of ML and parallel computing technology, we have introduced SONIC, anon-disruptive computing model that provides accelerated heterogeneous computing with coprocessors,to neutrino physics computing. We have demonstrated large speed improvements in the ProtoDUNE-SPreconstruction workflow and anticipate more applications across neutrino physics and high energy physicsmore broadly.
ACKNOWLEDGMENTS
We acknowledge the Fast Machine Learning collective as an open community of multi-domain experts andcollaborators. This community was important for the development of this project. We acknowledge theDUNE collaboration for providing the ProtoDUNE-SP reconstruction code and simulation samples. Wewould like to thank Tom Gibbs and Geetika Gupta from Nvidia for their support in this project. We thankAndrew Chappell, Javier Duarte, Steven Timm for their detailed feedback on the manuscript.M. A. F., B. Ha., B. Ho., K. K., K. P., N. T., M. W., and T. Y. are supported by Fermi Research Alliance,LLC under Contract No. DE-AC02-07CH11359 with the U.S. Department of Energy, Office of Science,Office of High Energy Physics. N. T. and B. Ha. are partially supported by the U.S. Department of EnergyEarly Career Award. K. P. is partially supported by the High Velocity Artificial Intelligence grant as part ofthe Department of Energy High Energy Physics Computational HEP sessions program. P. H. is supportedby NSF grants
Frontiers 15 ang et al.
GPU-accelerated ML inference aaS for computing in neutrino experiments
REFERENCES [
1] DUNE collaboration, S. Jones et al.,
Deep Underground Neutrino Experiment (DUNE), Far DetectorTechnical Design Report, Volume 1 Introduction to DUNE , . [
2] X. Qian and P. Vogel,
Neutrino Mass Hierarchy , Prog. Part. Nucl. Phys. (2015) 1,[ ]. [
3] H. Nunokawa, S. J. Parke and J. W. Valle,
CP Violation and Neutrino Oscillations , Prog. Part. Nucl.Phys. (2008) 338, [ ]. [
4] F. Capozzi, S. W. Li, G. Zhu and J. F. Beacom,
DUNE as the Next-Generation Solar NeutrinoExperiment , Phys. Rev. Lett. (2019) 131803, [ ]. [
5] K. Scholberg,
Supernova Neutrino Detection , Ann. Rev. Nucl. Part. Sci. (2012) 81, [ ]. [
6] DUNE collaboration, V. A. Kudryavtsev,
Underground physics with DUNE , J. Phys. Conf. Ser. (2016) 062032, [ ]. [
7] M
ICRO B OO NE collaboration, R. Acciarri et al.,
Design and Construction of the MicroBooNEDetector , JINST (2017) P02017, [ ]. [
8] NO V A collaboration, D. Ayres et al.,
The NOvA Technical Design Report , . [
9] DUNE collaboration, B. Abi et al.,
First results on ProtoDUNE-SP liquid argon time projectionchamber performance from a beam test at the CERN Neutrino Platform , . [
10] J. Duarte et al.,
FPGA-accelerated machine learning inference as a service for particle physicscomputing , Comput. Softw. Big Sci. (2019) 13, [ ]. [
11] K. Pedro, “SonicCMS.” [software] version v5.0.0 (accessed 2020-02-17) https://github.com/hls-fpga-machine-learning/SonicCMS , 2019. [
12] J. Krupa et al.,
GPU coprocessors as a service for deep learning inference in high energy physics , . [
13] A. Caulfield, E. Chung, A. Putnam, H. Angepat, J. Fowers, M. Haselman et al.,
A cloud-scaleacceleration architecture , IEEE Computer Society, October, 2016. [
14] K. He, X. Zhang, S. Ren and J. Sun,
Deep residual learning for image recognition , (2016) , [ ]. [
15] F. Halzen and S. R. Klein,
IceCube: An Instrument for Neutrino Astronomy , Rev. Sci. Instrum. (2010) 081101, [ ]. [
16] I. Sfiligoi, F. W¨urthwein, B. Riedel and D. Schultz,
Running a pre-exascale, geographicallydistributed, multi-cloud scientific simulation , High Performance Computing (2020) 23. [
17] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov et al.,
Going Deeper withConvolutions , arXiv e-prints (Sept., 2014) arXiv:1409.4842, [ ]. [
18] A. Aurisano, A. Radovic, D. Rocco, A. Himmel, M. Messier, E. Niner et al.,
A convolutional neuralnetwork neutrino event classifier , JINST (2016) P09001. [
19] R. Acciarri et al.,
Convolutional neural networks applied to neutrino events in a liquid argon timeprojection chamber , JINST (2017) P03011. [
20] M
ICRO B OO NE collaboration, C. Adams et al.,
Deep neural network for pixel-level electromagneticparticle identification in the microboone liquid argon time projection chamber , Phys. Rev. D (2019) 092001. [
21] J. Hu, L. Shen, S. Albanie, G. Sun and E. Wu,
Squeeze-and-Excitation Networks , . [
22] DUNE collaboration, B. Abi et al.,
Neutrino interaction classification with a convolutional neuralnetwork in the dune far detector , 2020. [
23] B. Graham and L. van der Maaten,
Submanifold Sparse Convolutional Networks , . This is a provisional file, not the final typeset article ang et al. GPU-accelerated ML inference aaS for computing in neutrino experiments [
24] D
EEP L EARN P HYSICS collaboration, L. Domin´e and K. Terao,
Scalable deep convolutional neuralnetworks for sparse, locally dense liquid argon time projection chamber data , Phys. Rev. D (2020) 012005. [
25] F. Pietropaolo,
Review of Liquid-Argon Detectors Development at the CERN Neutrino Platform , J.Phys. Conf. Ser. (2017) 012038. [
26] J. Marshall and M. Thomson,
Pandora particle flow algorithm , in
International Conference onCalorimetry for the High Energy Frontier , p. 305, 2013. . [
27] L. Michel,
Interaction between four half-spin particles and the decay of the µ -meson , Proc. Phys. Soc.A (1950) 514. [
28] E. Snider and G. Petrillo,
LArSoft: Toolkit for Simulation, Reconstruction and Analysis of LiquidArgon TPC Neutrino Detectors , J. Phys. Conf. Ser. (2017) 042057. [
29] A. Bocci, D. Dagenhart, V. Innocente, M. Kortelainen, F. Pantaleo and M. Rovere,
Bringingheterogeneity to the CMS software framework , . [
30] Nvidia, “Triton Inference Server.” [software] version v1.8.0 (accessed 2020-02-17) https://docs.nvidia.com/deeplearning/sdk/triton-inference-server-guide/docs/index.html , 2019. [
31] Google, “gRPC.” [software] version v1.19.0 (accessed 2020-02-17) https://grpc.io/ , 2018. [
32] “The Fermilab Hierarchical Configuration Language.” https://cdcvs.fnal.gov/redmine/projects/fhicl/wiki . [
33] L. Garren, M. Wang, T. Yang et al., “larrecodnn.” [software] version v08 02 00 (accessed2020-04-07) https://github.com/LArSoft/larrecodnn , 2020. [
34] The Kubernetes Authors c (cid:13)
The Linux Foundation, “Concepts - Workloads - Pods.” 2020. [
35] Google LLC, “Compute Engine Documentation - Machine types.” 2020. [
36] D. Heck, J. Knapp, J. Capdevielle, G. Schatz and T. Thouw,
CORSIKA: A Monte Carlo code tosimulate extensive air showers , Tech. Rep. FZKA-6019, 1998. [
37] C. N. Coelho, A. Kuusela, H. Zhuang, T. Aarrestad, V. Loncar, J. Ngadiuba et al.,
Ultra Low-latency,Low-area Inference Accelerators using Heterogeneous Deep Quantization with QKeras and hls4ml , . [
38] A. Pappalardo, G. Franco and N. Fraser,
Xilinx/brevitas: Pretrained 4b MobileNet V1 r2 , 2020.10.5281/zenodo.3979501., 2020.10.5281/zenodo.3979501.