[PDF] FPGA-accelerated machine learning inference as a service for particle physics computing

Abstract

New heterogeneous computing paradigms on dedicated hardware with increased parallelization, such as Field Programmable Gate Arrays (FPGAs), offer exciting solutions with large potential gains. The growing applications of machine learning algorithms in particle physics for simulation, reconstruction, and analysis are naturally deployed on such platforms. We demonstrate that the acceleration of machine learning inference as a web service represents a heterogeneous computing solution for particle physics experiments that potentially requires minimal modification to the current computing model. As examples, we retrain the ResNet-50 convolutional neural network to demonstrate state-of-the-art performance for top quark jet tagging at the LHC and apply a ResNet-50 model with transfer learning for neutrino event classification. Using Project Brainwave by Microsoft to accelerate the ResNet-50 image classification model, we achieve average inference times of 60 (10) milliseconds with our experimental physics software framework using Brainwave as a cloud (edge or on-premises) service, representing an improvement by a factor of approximately 30 (175) in model inference latency over traditional CPU inference in current experimental hardware. A single FPGA service accessed by many CPUs achieves a throughput of 600--700 inferences per second using an image batch of one, comparable to large batch-size GPU throughput and significantly better than small batch-size GPU throughput. Deployed as an edge or cloud service for the particle physics computing model, coprocessor accelerators can have a higher duty cycle and are potentially much more cost-effective.

Full PDF

CComputing and Software for Big Science

This is a post-peer-review, pre-copyedit version of this article.The ﬁnal authenticated version is available online at: https://doi.org/10.1007/s41781-019-0027-2.

FPGA-accelerated machine learning inference as a service forparticle physics computing

Javier Duarte · Philip Harris · Scott Hauck · Burt Holzman · Shih-Chieh Hsu · Sergo Jindariani · Suﬃan Khan · Benjamin Kreis · Brian Lee · Mia Liu · Vladimir Lonˇcar · Jennifer Ngadiuba · KevinPedro · Brandon Perez · Maurizio Pierini · Dylan Rankin · NhanTran · Matthew Trahms · Aristeidis Tsaris · Colin Versteeg · Ted W.Way · Dustin Werran · Zhenbin Wu

Received: 29 April 2019 / Accepted: 20 August 2019

Abstract

Large-scale particle physics experimentsface challenging demands for high-throughput comput-

J.D., B.H., S.J., B.K., M.L., K.P., N.T., and A.T. are sup-ported by Fermi Research Alliance, LLC under Contract No.DE-AC02-07CH11359 with the U.S. Department of Energy,Oﬃce of Science, Oﬃce of High Energy Physics. P.H. andD.R. are supported by a Massachusetts Institute of Technol-ogy University grant. M.P., J.N., and V.L. received fundingfrom the European Research Council (ERC) under the Euro-pean Union’s Horizon 2020 research and innovation program(grant agreement no 772369). V.L. also received funding fromthe Ministry of Education, Science, and Technological Devel-opment of the Republic of Serbia under project ON171017.S-C.H. is supported by DOE Oﬃce of Science, Oﬃce of HighEnergy Physics Early Career Research program under AwardNo. de-sc0015971. S.H., M.T., and D.W. are supported by F5Networks. Z. W. is supported by the National Science Foun-dation under Grants No. 1606321 and 115164.Javier Duarte, Burt Holzman, Sergo Jindariani, BenjaminKreis, Mia Liu, Kevin Pedro, Nhan Tran, Aristeidis TsarisFermi National Accelerator Laboratory, Batavia, IL 60510,USAPhilip Harris, Dylan RankinMassachusetts Institute of Technology, Cambridge, MA02139, USAScott Hauck, Shih-Chieh Hsu, Matthew Trahms, Dustin Wer-ranUniversity of Washington, Seattle, WA 98195, USASuﬃan Khan, Brian Lee, Brandon Perez, Colin Versteeg, TedW. WayMicrosoft, Redmond, WA 98052, USAVladimir LonˇcarCERN, CH-1211 Geneva 23, SwitzerlandInstitute of Physics Belgrade, University of Belgrade, SerbiaJennifer Ngadiuba, Maurizio PieriniCERN, CH-1211 Geneva 23, SwitzerlandZhenbin WuUniversity of Illinois at Chicago, Chicago, IL 60607, USA ing resources both now and in the future. New het-erogeneous computing paradigms on dedicated hard-ware with increased parallelization, such as Field Pro-grammable Gate Arrays (FPGAs), oﬀer exciting solu-tions with large potential gains. The growing applica-tions of machine learning algorithms in particle physicsfor simulation, reconstruction, and analysis are natu-rally deployed on such platforms. We demonstrate thatthe acceleration of machine learning inference as a webservice represents a heterogeneous computing solutionfor particle physics experiments that potentially re-quires minimal modiﬁcation to the current computingmodel. As examples, we retrain the

ResNet-50 convo-lutional neural network to demonstrate state-of-the-artperformance for top quark jet tagging at the LHC andapply a

ResNet-50 model with transfer learning forneutrino event classiﬁcation. Using Project Brainwaveby Microsoft to accelerate the

ResNet-50 image classiﬁ-cation model, we achieve average inference times of 60(10) milliseconds with our experimental physics soft-ware framework using Brainwave as a cloud (edge oron-premises) service, representing an improvement bya factor of approximately 30 (175) in model inferencelatency over traditional CPU inference in current ex-perimental hardware. A single FPGA service accessedby many CPUs achieves a throughput of 600–700 infer-ences per second using an image batch of one, compa-rable to large batch-size GPU throughput and signiﬁ-cantly better than small batch-size GPU throughput.Deployed as an edge or cloud service for the particlephysics computing model, coprocessor accelerators canhave a higher duty cycle and are potentially much morecost-eﬀective.

Keywords particle physics, heterogeneous computing,FPGA, machine learning a r X i v : . [ phy s i c s . d a t a - a n ] O c t Javier Duarte et al.

Contents

With large datasets and high data acquisition rates,high-performance and high-throughput computing re-sources are an essential element of the experimentalparticle physics program. These experiments are con-stantly increasing in both sophistication of detectortechnology and intensity of particle beams. As such,particle physics datasets are growing in size just as thealgorithms that process the data are growing in com-plexity. For example, the high luminosity phase of theLarge Hadron Collider (HL-LHC) will deliver 15 timesmore data than the current LHC run. The HL-LHCwill collide bunches of protons at a rate of 40 MHz, andthe collision environment will have 5 times as manyparticles per collision [1]. The Compact Muon Solenoid(CMS) experiment will be upgraded for the HL-LHCwith up to 10 times more readout channels. Through aseries of online ﬁlters, CMS aims to store HL-LHC colli-sion events at a rate of 5 kHz. Such a data rate leads todatasets that are exabytes in scale [2]. Future neutrinoexperiments such as Deep Underground Neutrino Ex-periment (DUNE) [3] and cosmology experiments likeSquare Kilometre Array (SKA) [4] are expected to pro-duce datasets at the exabyte scale.In the past, the physics and computing communitiesrelied largely on the progress of silicon technologies tohandle growing computing requirements. However, atpresent, improvement in single processor performanceis stalling due to changes in the scaling of power con-sumption [5]. The current particle physics computingparadigms will not suﬃce to simulate, process, and an-alyze the massive datasets that the next-generation ex-perimental facilities will deliver. New technologies thatprovide order-of-magnitude improvements are needed.Concurrently, the ubiquity of sophisticated de-tectors with complex outputs has led to the quickadoption of machine learning (ML) algorithms astools to reconstruct physics processes. Neutrino ex-periments currently use state-of-the-art convolutionalneural networks (CNNs) [6,7], such as

GoogLeNet and

ResNet-50 [8], to perform the neutrino event recon-struction and identiﬁcation. At the LHC, ML methodsare used in all stages of the ATLAS, CMS, LHCb, and ALICE experiments, from low-level calibration of in-dividual reconstructed particles [9] to high-level opti-mization of ﬁnal-state event topologies [10]. ML was avital component of the Higgs boson discovery [11,12]and is now being explored for the ﬁrst level of process-ing: low latency, sub-microsecond online ﬁltering appli-cations [13,14]. Across big science, such as cosmologyand large astrophysical surveys, similar trends exist asthe experiments grow and the data rates increase.While the computing challenge in particle physicsis a vital concern for current and future experiments,it is not unique. With the rise of so-called “big data,”Internet of Things (IoT), and the increase in the quan-tity of data across a wide range of scientiﬁc ﬁelds, thesophisticated large-scale processing of big data has be-come a global challenge. At the forefront of this trendis the need for new computing resources to handle boththe training and inference of large ML models.In this paper, we focus on the inference of deepML models as a solution for processing large datasets;inference is computationally intensive and runs repeat-edly on hundreds of billions of events. A growing trendto improve computing power has been the develop-ment of hardware that is dedicated to accelerating cer-tain kinds of computations. Pairing a specialized co-processor with a traditional CPU, referred to as het-erogeneous computing , greatly improves performance.These specialized coprocessors, including GPUs, FieldProgrammable Gate Arrays (FPGAs), and Applica-tion Speciﬁc Integrated Circuits (ASICs), utilize natu-ral parallelization and provide higher data throughput.ML algorithms, and in particular deep neural networks,are at the forefront of this computing revolution due totheir high parallelizability and common computationalneeds.To capitalize on this new wave of heterogeneouscomputing and specialized hardware, particle physicistshave two primary options:1.

Adapt domain-speciﬁc algorithms to run on special-ized accelerator hardware.

This option takes advantage of speciﬁc human ex-pert knowledge, but can be challenging to imple-ment on new and ever-changing hardware platformswith diﬀerent computing paradigms (such as

CUDA or Verilog). New portable development environments(e.g.

OpenCL , Kokkos ) can potentially provide cross-hardware solutions.2.

Design ML algorithms to replace domain-speciﬁc al-gorithms.

This option has the advantage of running nativelyon specialized hardware using open-source softwarestacks, but it can be a challenge to map speciﬁcphysics problems onto ML solutions.

PGA-accelerated machine learning inference as a service for particle physics computing 3

In this paper, we explore how such heterogeneouscomputing resources can be deployed within the currentcomputing model for particle physics in a scalable andnon-disruptive way. While accelerating domain-speciﬁcalgorithms on specialized hardware is possible, in thispaper we study the second option, where a ML algo-rithm is adapted to solve a challenge and acceleratedusing a specialized hardware platform. We will presentphysics results for a publicly available top quark taggingdataset for the LHC [15] and discuss how this could beapplied for neutrino experiments such as NOvA [16].This study focuses on the newly available MicrosoftProject Brainwave platform that deploys FPGA copro-cessors as a service at datacenter scale [17]. Brainwaveprovides a ﬁrst scalable platform to study, though othersuch options exist. Results from this study will serve asa performance benchmark for any similar systems andwill provide valuable lessons for applying new technolo-gies to particle physics computing.The rest of this paper is organized as follows. InSection 2, we describe the requirements of the particlephysics computing model that is used in collider ex-periments at the LHC and neutrino experiments suchas DUNE. We detail the challenges facing this com-puting model in the future. In Section 3, we exploresome example use cases to be deployed on the Mi-crosoft Brainwave platform. We train and evaluate adedicated model identifying particles at the LHC anddiscuss the potential application for neutrino physics.In Section 4, we then describe the Microsoft Brainwaveplatform and how we integrate it into our experimentalcomputing model to accelerate ML inference. In Sec-tion 5, we present latency results from tests of FPGAcoprocessors as a service and compare the results tobenchmark values for CPUs and GPUs. We also pro-vide ﬁrst studies on the scalability of such an approach.Finally, in Section 6, we conclude by summarizing thestudy and discussing the next steps required for furtherdevelopment of this program. trigger , reduces the rate of events to a manageable levelto be recorded for oﬄine processing. The trigger is typ-ically divided into multiple tiers. The ﬁrst tier (Level-1,L1) is performed with custom electronics with very lowlatency (1–10 µ s) where the latency is a ﬁxed size forevery event. The second step (high level trigger, HLT) isperformed on more standard computing resources andhas a variable per-event latency of 10–100 ms. Finally,oﬄine analysis of the saved events passing the HLT cantake signiﬁcantly longer, though ultimately the oﬄineprocessing time is limited by available computing re-sources.In this paper, we consider the possible gains fromheterogeneous computing resources as applied to boththe HLT and oﬄine processing steps. When consider-ing how best to use new optimized computing resourcesfor physics, we must understand the implications of theevent processing model described above. An exampleof the current computing model is shown in Fig. 1.Event data is processed, often sequentially, across mul-tiple CPU threads.It is important to note that the basic processingunit is a single event and performing the same task formultiple events (batching) becomes signiﬁcantly morecomplex to manage. Because each event contains poten-tially millions of channels of information, it is optimalto load the needed components of that event into mem-ory and then execute all desired algorithms for thatevent. The tasks themselves, denoted in Fig. 1 as mod-ules, can be very complex, either with time-consumingphysics-based algorithms, or, as is becoming more pop-ular, machine learning algorithms. There may be dozensor even hundreds of modules executed for each event. Itcan be seen that the most time-consuming and complextasks will be the latency bottleneck in event processing.2.2 Upcoming computing challengesIn the next decade, the HL-LHC upgrade will increasethe LHC collision rate by an order of magnitude. TheCMS detector will undergo a series of upgrades to beable to cope with the increased collision rate and the as-sociated increase in radiation levels, which would dam-age parts of the current detector beyond the point ofrecovery. The detector upgrades include a new pixeltracker with almost 2 billion readout channels anda high granularity endcap calorimeter with 6 millionchannels [18]. Both of these constitute more than anorder-of-magnitude increase in channels compared to Javier Duarte et al.

Event SetupDatabaseConfiguration Parameter

SetsInput Source (data or simulation)

Output 1Output 2… threads M ODULE ODULE ODULE ODULE ML INFER M ODULE Event Processing Job ML INFER M ODULE Fig. 1: A diagram of the computing model used in the CMS software.the current systems. Another consequence of the HL-LHC upgrade will be an increase in the rate of multiplecollisions per proton bunch crossing (pileup). While thecurrent LHC conﬁguration results in about 30 collisionsper bunch crossing, this value will increase to about 200collisions at the HL-LHC.The consequence is that the upgraded CMS detec-tor will have to record and process more events, each ofwhich contain more channels and more energy depositsfrom pileup. The time to analyze these extremely com-plex events is currently simulated to be approximately300 seconds. The impact on the CPU resources neededby CMS is depicted in Fig. 2 [2]. The relative increase incomputing resources required for the HL-LHC is morethan a factor of 10 greater than current needs. Simi-larly, the DUNE experiment, the largest liquid argonneutrino detector ever designed, will comprise roughly1 million channels with megahertz sampling and mil-lisecond integration times [3]. Both of these frontier ex-periments will need new solutions for event processingto be able to make sense of the large datasets that willbe delivered in the next decades.

In this section, we highlight examples of machine learn-ing models relevant for physics to test in acceleratorhardware. These are not meant as realistic examples,but rather as a proof-of-concept to be expanded whenmore mature physics models can be accelerated on co-processors.3.1

ResNet-50 and other modelsAt the moment, only a limited number of neural net-work architectures are available for acceleration on theBrainwave platform. The available models—

ResNet-50 , T H S * s CPU seconds by Type

Prompt DataNon-Prompt DataLHC MCHL-LHC MCAnalysis

Fig. 2: Estimated CPU resource needs for CMS in thenext decade [2]. THS06 stands for tera (10 ) HEP-SPEC06, a standard measure of the performance of aCPU code used in high-energy physics. VGG-16 [19], and

DenseNet-121 [20]—are CNNs opti-mized for image classiﬁcation. These CNNs typicallycontain several convolutional layers that extract mean-ingful features of the image. This part of the network isthe most computationally intensive and is often calledthe “featurizer.” The ﬁnal part of the network is muchsmaller and typically includes a few fully connectedlayers with the ﬁnal output corresponding to a set ofprobabilities for each category. This part of the net-work is called the “classiﬁer.” In our study, we focuson the

ResNet-50 model. The FPGA is used to accel-erate the featurizer step of the

ResNet-50 inference,while the classiﬁer step is performed on the CPU. Intotal,

ResNet-50 contains approximately 25 million pa-rameters and requires approximately 4 G-ops (4 × )for a single inference. While the neural network archi-tectures are ﬁxed, the weights can be retrained withinone of these available network architectures. We usethis workﬂow to train a ResNet-50 neural network fora physics-speciﬁc task in Sec. 3.2 and Sec. 3.3. Even

PGA-accelerated machine learning inference as a service for particle physics computing 5 with a restricted architecture, the amount of ML tasksthat can be performed with these sophisticated imagerecognition models is substantial. We will explore two:classiﬁcation of boosted top quarks and neutrino ﬂavorclassiﬁcation.However, we also stress that this is a proof-of-concept study to demonstrate the improvements forphysics computing from heterogeneous computing plat-forms as a service. As the technology matures rapidly,we will also see an improvement in the software toolsetsassociated with this new hardware. We expect the ca-pability to translate any model to specialized hardwareto become available in the near future. In fact, severaltools are working towards this capability [21,22,23].3.2 Top tagging at the LHCAt the LHC, quarks and gluons originating from theproton collisions produce collimated sprays of parti-cles in the detector called jets . Studying the substruc-ture of these jets is an important tool for identifyingtheir origin. There are broad physics applications fromstudying Higgs boson properties, to searching for newphysics beyond the standard model such as supersym-metry and dark matter, and measuring the propertiesof quantum chromodynamics (QCD). Because this taskinvolves highly-correlated and high-dimensionality in-puts, it is an active area of R&D for ML algorithmsin particle physics. Various representations of the datahave been considered, including ﬁxed 2D images, vari-able length sets, and graphs.In this case study, we consider the task of classify-ing collimated decays of top quarks in a jet from morecommon jets originating from lighter quarks or gluons.There are many ML approaches to this challenge in theliterature [24] and a public dataset, developed from oneof these studies, has been created for comparison [25,15]. The

Pythia8 [26,27] generator is used to producefully hadronic tt events for signal (known as“top quarkjets”) and QCD dijet events for background (known as“QCD jets”) produced in 14 TeV proton-proton colli-sions. No multiple parton interactions or pileup interac-tions are included and their inclusion would require im-proving our neural network model.

Delphes [28] withthe ATLAS detector conﬁguration is used to simulatedetector eﬀects. The

Delphes

E-ﬂow candidates areclustered using

FastJet [29,30] into anti- k T [31] jetswith size parameter R = 0 .

8. Jets with transverse mo-mentum ( p T ) between 550 and 650 GeV and | η | < η is the pseudorapidity. Top quark jetsare required to satisfy generator-level matching criteria:the jet must be matched to a parton-level top quarkand all of its decay products within ∆R = 0 .

8, where ∆R = (cid:112) ( ∆η ) + ( ∆φ ) and φ is the azimuthal angle.Up to 200 jet constituent four-momenta are stored.The Brainwave platform allows the use of customweights for speciﬁc applications computed by trainingpredeﬁned CNNs. In this training, we treat the jets as2D grayscale images in the η - φ plane and send themas input to the ResNet-50 algorithm. Jet images arecreated by summing jet constitutent p T in a 2D gridof 224 ×

224 in η and φ units from − . . ResNet-50 architecture, the images are normal-ized such that each image has a range between 0 and225 and duplicated 3 times, once for each RGB chan-nel. We illustrate the images for QCD and top quarkjets in Fig. 3 where the images are averaged over 5,000jets. Top quark jets have a 3-prong nature which mani-fests as a broader radiation pattern when averaged overmany jets.For our speciﬁc task, after the primary

ResNet-50 featurizer we add our own custom classiﬁer, whichcomprises one fully connected layer of width 1024with ReLU [33] activation and another fully connectedlayer of width 2 with softmax activation. The train-ing dataset contains about 1.2 million events while thevalidation and test datasets each have approximately400,000 events. The training is performed by minimiz-ing the categorical cross-entropy loss function using theAdam algorithm [34] with an initial learning rate of10 − and a minibatch size of 64 over 10 epochs onan NVIDIA Tesla V100 GPU. The best model is cho-sen based on the smallest average loss evaluated onthe validation dataset. The training for this particular ResNet-50 model is unique because there is a partic-ular quantized version of

ResNet-50 that needs to be“ﬁne-tuned,” or trained with a smaller learning rate.The quantized model is initialized using the weightsfrom the trained ﬂoating point model and trained withan initial learning rate of 10 − and a minibatch size of32 for 10 additional epochs. Finally, as the quantizedmodel evaluated with the Brainwave FPGA service dif-fers numerically from the quantized model evaluated onthe local GPU, an additional ﬁne-tuning is applied tothe classiﬁer after evaluating the ResNet-50 featureson Brainwave. This ﬁne-tuning of the classiﬁer layersis performed over 100 epochs using the validation datawith the Adam algorithm, an initial learning 10 − , anda batch size of 128. On a single V100 GPU, the initialﬂoating point training time is approximately 1.5 hoursper epoch while the “ﬁne-tuned” training is approxi-mately 4 hours per epoch. The classiﬁer layer trainingis signiﬁcantly faster, only minutes per epoch.After training, we evaluate the performance of ourtrained ResNet-50 top tagger. The receiver operator

Javier Duarte et al. i i N o r m a li z e d p T [ A U ] i i N o r m a li z e d p T [ A U ] Fig. 3: A comparison of QCD (left) and top (right) jet images averaged over 5,000 jets.

Model Accuracy AUC 1 /ε B ( ε S = 30%)Floating point 0.9009 0.9797 670.8Quant. 0.8413 0.9754 414.6Quant., f.t. 0.9296 0.9825 970.7Brainwave 0.9257 0.9821 934.8Brainwave, f.t. 0.9348 0.9830 999.6 Table 1: The performance of the evaluated models onthe top tagging dataset.characteristic (ROC) curve is a graph of the false pos-itive rate (background QCD jet eﬃciency) as a func-tion of the true positive rate (top quark jet eﬃciency.)It is customary to report three metrics for the per-formance of the network on the top tagging dataset:model accuracy, area under the ROC curve (AUC),and background rejection power at a ﬁxed signal ef-ﬁciency of 30%, 1 /ε B ( ε S = 30%). Fig. 4 shows theROC curve comparison for the transfer learning ver-sion of ResNet-50 as well as the fully retrained fea-turizer with custom weights. In Table 1, the accuracy,AUC, and 1 /ε B ( ε S = 30%) values are listed for eachmodel considered. The performance of the retrained ResNet-50 compared to other models developed forthis dataset is state-of-the-art; the best performance is1 /ε B ( ε S = 30%) ≈ ResNet-50 in terms of the numbers of pa-rameters and operations. However, it should be notedthat the best-performing models to date (

ResNeXt50 and a directed graph CNN) [32,24] are within a factorof a few in size with respect to the

ResNet-50 model.We emphasize here that this study is a proof-of-conceptfor the physics performance and that there are manyother very challenging, computationally intensive algo-

Signal efficiency B a c k g r o un d e ff i c i e n c y Floating point: acc = 90.1%, AUC = 98.0%, 1/ B = 671Quant.: acc. = 84.1%, AUC = 97.5%, 1/ B = 415Quant., f.t.: acc. = 98.2%, AUC = 93.0%, 1/ B = 971Brainwave: acc. = 92.6%, AUC = 98.2%, 1/ B = 935Brainwave, f.t.: acc. = 93.5%, AUC = 98.3%, 1/ B = 1000 Fig. 4: The ROC curves showing the performance ofthe ﬂoating point and quantized versions (before ﬁne-tuning, after ﬁne-tuning, and using the Brainwave ser-vice) of the

ResNet-50 top tagging model.rithms where machine learning is being explored. Weanticipate that for these looming challenges, the size ofthe models will continue to grow to meet the demandsof new experiments.3.3 Neutrino ﬂavor identiﬁcation at NOvANeutrino event classiﬁcation can also beneﬁt from ac-celerating the inference of large ML models. In thissection, due to a lack of publicly available neutrinodatasets, we do not fully quantify the performance ofa particular model. Instead, we present a workﬂow todemonstrate that this work is applicable beyond theLHC.

PGA-accelerated machine learning inference as a service for particle physics computing 7

We illustrate the type of classiﬁcation task neededfor neutrino experiments by using simulated neutrinoevents and cosmic data from the NOvA experiment.NOvA pioneered the application of convolutional neu-ral networks (CNN) in particle physics in 2016 by be-coming the ﬁrst experiment to use a CNN in a pub-lished result [7,35]. In our study, we use transfer learn-ing with

ResNet-50 to distinguish between the diﬀer-ent detector signatures associated with various neutrinointeraction types and associated backgrounds. We ex-tract features from neutrino interaction events usingthe

ResNet-50 featurizer (pre-trained using the Ima-geNet dataset [36]) and retrain the ﬁnal fully connectedclassiﬁer layers to perform neutrino event classiﬁcation.Speciﬁcally, 500,000 simulated neutrino events with cos-mic data overlays were used for training, with the fol-lowing ﬁve categories: charged current electron neu-trino, charged current muon neutrino, charged currenttau neutrino, neutral current neutrino interactions, andcosmic ray tracks. These events are highly amenable toclassiﬁcation by CNN architectures such as

ResNet-50 .We then applied the transfer learning

ResNet-50 model to a separate test set of 150,000 events. As a vi-sual example, we show three simulated neutrino inter-action type events in Fig. 5 that are selected with prob-ability, larger than 0.9. On the left (middle, right) is anexample event originating from an electron (muon, tau)neutrino charged current interaction. While the opti-mal use of ML to improve neutrino event reconstruc-tion and classiﬁcation is an active area of research, themost successful approach thus far employs CNN archi-tectures, which work well with the homogeneous natureof the neutrino detectors. While the transfer learningapproach does not yield state-of-the-art performancefor neutrino event classiﬁcation, we expect that a full re-training of

ResNet-50 would be more successful, whichis the subject of future work.Current neutrino experiments, including NOvA andothers, are potentially exciting applications of coproces-sors as a service. A large fraction of their event recon-struction time is already consumed by inference of largeCNNs [37]. Therefore, they stand to gain signiﬁcantlyfrom accelerating network inference. The approach out-lined in Section 4 could provide a non-disruptive solu-tion to accelerate neutrino computing performance inthe present as well as in the future. – By considering ML algorithms, we can greatly bene-ﬁt from developments outside of the ﬁeld of particlephysics. Industry and academic investment in MLis growing rapidly, and there is a vast amount ofresearch on specialized hardware for ML that couldbe utilized within the community. – Often, ML algorithms are quite parallelizable, mak-ing them amenable to acceleration on specializedhardware. For some physics-based algorithms, thisis not possible, while for others it could require sub-stantial investment to rewrite for new, often chang-ing computing hardware.We, therefore, focus on ML acceleration in our study. Tocapitalize on the ML-focused hardware developments,we rely on the continued research and development ofML applications for particle physics tasks. This is anactive area of research with growing interest, as indi-cated by recent work across many neutrino and col-lider experiments [38,39] and initiatives such as theHEP.TrkX project [40] and the Tracking ML KaggleChallenge [41]. Additionally, ML has the potential toprovide event simulation [42], another computationallyintensive part of the chain.One challenge is to integrate FPGA coprocessorsinto the computing model without disrupting the cur-rent multithreaded paradigm, where several modulesprocess an event in parallel. A natural method for in-tegrating heterogeneous resources is via a network ser-

Javier Duarte et al.

Top View ν e CC Event

Side View ν e CC Event

Top View ν µ CC Event

Side View ν µ CC Event

Top View ν τ CC Event

Side View ν τ CC Event

Fig. 5: Example visualizations of simulated neutrino events correctly classiﬁed by our

ResNet-50 model withprobability greater than 0.9: electron neutrino (left), muon neutrino (middle), and tau neutrino (right). The topand bottom rows are the top and side views from the NOvA detector. (NOvA’s beam energy and baseline prohibitlong baseline tau neutrino appearance searches, but the event is shown for illustration purposes.)

TOR TOR TORTORL1 L1

Expensive compression

Deep neural networksWeb search ranking BioinformaticsWeb search ranking L2 TOR (a) (b)

Fig. 6: A schematic of the Microsoft Brainwave acceleration platform [17].vice. This client-server model is ﬂexible enough to beused locally by a single user or within a computing farmwhere a single thread communicates with the server. Inthe particular case investigated here, we use the gRPC package [43], an open-source Remote Procedure Call(RPC) system developed initially by Google, interfaceswith the Brainwave system. gRPC uses protocol buﬀers( protobuf ) [44] for data serialization and transmission.This setup deﬁnes a communication method betweenthe FPGA coprocessor resources and an experiment’sprimary computing CPU-based datacenters. This is il-lustrated in Fig. 7 where a module running on a CPUfarm performs fast inference of a particular ML algo- rithm via gRPC . First, we test the performance of a sin-gle task which makes a request to a single cloud ser-vice which performs a remote access to the Brainwaveplatform. However, scaling up the number of requestsis natural for the Brainwave system, which is capableof load balancing of service requests.One may also consider a case where the FPGA co-processor resources are located at the same datacenter, on-premises , as the CPUs, as a so-called edge resource .This is illustrated in Fig. 8. In this scenario, the same we refer synonymously to a cloud service being accessed remotely we refer synonymously to a edge service being accessed on-premises, or on-prem PGA-accelerated machine learning inference as a service for particle physics computing 9

Network input

Datacenter (CPU farm)

CPU FPGA

Prediction

Experimental

Software g R P C p r o t o c o l Heterogeneous Cloud Resource

Fig. 7: An illustration of FPGA-accelerated ML cloudresources integrated into the experimental physics com-puting model as a service. gRPC interface protocols are used to communicate withthe FPGA hardware, and the software access for fastinference is unchanged. To benchmark this scenario, werun our application on a virtual machine (VM) in thecloud datacenter. Results comparing both these scenar-ios with other hardware from the literature are pre-sented in Section 5.

CPU FPGA

Heterogeneous “Edge” Resource g R P C p r o t o c o l Experimental software

Fig. 8: An illustration of FPGA-accelerated ML edgeresources integrated into the experimental physics com-puting model as a service.4.2 Particle physics computing model with servicesFor our demonstration study, we use the CMS exper-iment software framework,

CMSSW [45]. This softwareuses Intel Thread Building Blocks [46] for task-basedmultithreading. A typical module, such as those de-picted in Fig. 1, has a produce function that obtainsdata from an event, operates on it, and then outputsderived data. This pattern assumes that all of the op-erations occur on the same machine.Our goal is to utilize the Brainwave hardware as aservice to perform inference of a large ML model such as

ResNet-50 . Within

CMSSW , a hook to the gRPC system isestablished using a special feature called

ExternalWork .Optimal use of both CPU and heterogeneous computingresources requires that requests be transmitted asyn- chronously, freeing up a CPU thread to do other workrather than forcing it to wait until a request is com-plete. The

ExternalWork pattern accomplishes this bysplitting the simpler pattern described above into twosteps. The ﬁrst step, the acquire function, obtains datafrom an event, launches an asynchronous call to a het-erogeneous resource, and then returns. Once the callis complete, a callback function is executed to placethe corresponding produce function for the module backinto the task queue. This is depicted in Fig. 9.

External processingCMSSW thread acquire () FPGA, GPU, etc. produce () (other work) Fig. 9: A diagram of the

ExternalWork feature in

CMSSW , showing the communication between the soft-ware and external processors such as FPGAs.In this case, the event data provided to the service isa

TensorFlow tensor with the appropriate size (224 × ×

3) for inference with

ResNet-50 . A list of theclassiﬁcation results is returned back to the module,which employs

ExternalWork . For simplicity, we referto the full chain of inference as a service within ourexperimental software stack as “Services for OptimizedNetwork Inference on Coprocessors” or

SONIC [47].

SONIC packagewithin

CMSSW , measuring the total end-to-end latency ofan inference request using Brainwave. In a simple test,we create an image from a jet (as described in Sec. 3)from a simulated CMS dataset. We take reconstructedparticle candidates and combine them as pixels in a 2Dgrayscale image tensor input to the

ResNet-50 model(as in Sec. 3.2).We perform two latency tests: remote and on-premises or on-prem . The remote test communicateswith the Brainwave system as a cloud service, as illus-trated in Fig. 7. For this test, we execute our exper-imental software,

CMSSW , on the local Fermilab CPUcluster (Intel Xeon 2.6 GHz) in Illinois, US, and com-municate via gRPC with the service located at the AzureEast 2 Datacenter in Virginia, US. The on-prem testsare executed at the same datacenter as the Brainwave

FPGA coprocessors. We run a VM in the Azure East 2Datacenter, deploying

CMSSW inside a Docker container,and communicate with the FPGA coprocessors locatedin the same facility.We measure the total round-trip latency of the infer-ence request as seen by

CMSSW , starting from the trans-mission of the image and ending with the receipt of theclassiﬁcation results. The latencies are shown in Fig. 10for a linear latency scale (top) and a logarithmic latencyscale (bottom). The on-prem performance is shown inorange, with a mean inference time of 10 ms, and the remote performance is shown in blue, with a mean in-ference time of 60 ms. From internal Brainwave tim-ing tests, the featurizer inference step performed on theFPGA takes 1.8 ms and the classiﬁer inference step per-formed on the CPU is similar. The remaining time inthe 10 ms is primarily used for network transmission.

Time [ms] E v e n t s remoteon-prem Time [ms] E v e n t s remoteon-prem Fig. 10: Total round trip inference latencies for

ResNet-50 on the Brainwave system both remote and on-prem . The top plot is linear in time and the bottomplot is logarithmic in time.The remote performance can be as fast as 30 mswith a median value of 50 ms, and there are long tailsout to hundreds of ms at the per-mille level. The mea-sured latency is strongly dependent on network condi-tions which can cause the structures seen in Fig. 10. Due to the speed of light, there is a hard physical limitin the transmission time of the signal to the Azure East2 Datacenter and back to Fermilab, which we estimateto be around 10 ms. The physical distance between theexperimental computing cluster and the remote data-center will limit any cloud-based inference speeds.After comparing the remote versus on-prem latency,we performed a scaling test to estimate how many co-processor services would be needed to support large-scale deployment in a production environment. A givennumber of simultaneous processes were run using thebatch system at Fermilab and the round-trip latencywas measured. All jobs connected to a single Brain-wave service. This test corresponds to a “worst-case”estimation of the scaling of a single service because eachprocess only executed the Brainwave test module thatperforms inference on jet images. In an actual produc-tion process, the test module would run alongside manyother modules (see Fig. 1), greatly reducing the prob-ability of simultaneous requests to the cloud service.The results of the test are shown in Fig. 11. The mean,standard deviation, and long tail for the round trip la-tency all tend to increase with more simultaneous jobs,but only moderately. It should also be noted that somecalls timed out during the largest-scale test with 500 si-multaneous processes, leading to a failure rate of 1.8%,while the other tests had zero or negligible failures.We also measure the throughput based on the totaltime for each simultaneous process to complete serialprocessing of 5000 jet images. These results are shownin Fig. 12. Though the round trip latency for a singlerequest has a large variance, the total time to processthe full series of images is remarkably consistent. Thisdemonstrates the eﬃcient load balancing performed bythe Brainwave server.With the total time measured for all simultane-ous processes to complete, we can compute the totalthroughput of the Brainwave service. Recall from abovethat while the cloud service inference round trip latencyis 60 ms, on average, the latency for the featurizer in-ference on the FPGA itself is approximately 1.8 ms.When we run multiple simultaneous CPU processesthat all send requests to one service, we fully popu-late the pipeline of data streaming into the service. Thiskeeps the FPGA occupied, increasing its duty cycle andthe total inference throughput of the service. This is il-lustrated in Fig. 12, where we show the throughput ofthe service in inferences per second as a function of thenumber of simultaneous CPU processes accessing theservice. As the number of simultaneous processes in-creases, the number of inferences per second increases,because of the increased pressure on the pipeline of theFPGA service. The mean latency, shown in Fig. 11,

PGA-accelerated machine learning inference as a service for particle physics computing 11 Simultaneous processes M e a n t i m e [ m s ] Simultaneous processes T i m e [ m s ] Fig. 11: Top: Mean round trip inference latencies for

ResNet-50 on the Brainwave system for diﬀerent num-bers of simultaneous processes. The error bars representthe standard deviation. Bottom: The full distributionsdisplayed in “violin” style. The vertical bars indicatethe extrema. The horizontal axis scale is arbitrary.does not degrade much as the number of simultaneousjobs increases from 1 to 50, while the throughput in-creases by a factor of nearly 40 (600 inferences per sec-ond). The throughput of the service plateaus at around650 inferences per second; it is limited by the inferencetime on the FPGA that is, at best, 1.8 ms. From thesestudies, we ﬁnd that it is more eﬃcient and also morecost-eﬀective to have multiple simultaneous CPU pro-cesses connect to a single FPGA service.The ratio of simultaneous processes to FPGA ser-vices is dependent on the other tasks in the process;typical physics processes run many modules. The testswe have performed are the most pessimistic scenariobecause each process only executes the Brainwave testmodule. Therefore, in more realistic workloads wheremany tasks are run per process and a majority of thosetasks run on the CPU, we expect that one FPGA ser-vice will be able to serve one model for many morethan 50 simultaneous CPU processes. Detailed studiesof these more realistic workloads will be performed inthe future. Simultaneous processes I n f e r e n c e s / s Simultaneous processes T o t a l t i m e [ s ] Fig. 12: Top: Throughput of the FPGA service as thenumber of inferences per second for diﬀerent numbersof simultaneous processes. The error bars represent thestandard deviation. Bottom: mean total time and distri-bution (in seconds) to process 5000 jet images through

ResNet-50 on the Brainwave system for diﬀerent num-bers of simultaneous processes. The vertical bars indi-cate the extrema. The horizontal axis scale is arbitrary.5.2 CPU/GPU comparisonsNext, we compare the performance of the Brainwaveplatform to CPU and GPU performance for the same

ResNet-50 model. Such comparisons can be greatly af-fected by many details of the entire computing stackand vary widely even within the literature. Nonethe-less, to get a sense of the relative performance, we per-form two types of tests. First, we do our own stan-dalone python benchmark tests with the azure-ML im-plementation of

ResNet-50 as well as the

TensorFlow implementation of the

ResNet-50 model. Here, we ver-ify our results against the literature. While many moredetailed studies exist, these benchmarks validate ournumbers against other similar tests. Second, we importthe

ResNet-50 model ﬁle provided by Brainwave into

CMSSW and perform inference on the local CPU with theversion of

TensorFlow currently in the

CMSSW release .The standalone python benchmark results for CPUsare presented in Fig. 13. The CPU used in these tests isan Intel i7 3.6 GHz. For the CPU, we compare the num-ber of cores used for either the Brainwave implemen-tation of ResNet-50 or the conventional

TensorFlowResNet-50 . The performance is shown versus the im-age batch size; particle physics applications can vary intheir batch sizes typically from 1 to 100. As expected,the performance is stable versus batch size. For bothmodels, we observe roughly the same inference time,ranging from roughly 180 ms to 500 ms. Additionally,we observe that the model inference time is close tooptimal when using 4 cores, with small improvementsbeyond.Figure 14 shows the inference times on GPUs. It isimportant to note that the GPU used in these tests, anNVidia GTX 1080 Ti, is connected directly to the CPU,rather than using RPC over a network for communica-tion. Therefore, these results cannot be compared di-rectly to either the remote or on-prem Brainwave per-formance; however, they provide a useful characteriza-tion of limiting performance. The purple GPU pointsutilize the Brainwave implementation of

ResNet-50 where, as with the Brainwave implementation on CPU,a protobuf ﬁle is imported. This is what we would ex-pect within

CMSSW for custom models in the future andrepresents the closest direct comparison of a GPU withthe Brainwave FPGA implementation. The other GPUlines consist of the oﬃcial

ResNet-50 as provided within

TensorFlow . The oﬃcial

ResNet-50 can have better in-ference times by factors of a few. An optimized versionof

ResNet-50 is also available. It gives a 0–20% reduc-tion in inference with respect to the oﬃcial

ResNet-50 .All of the GPU benchmarks also follow the expectedtrend for large image batch sizes, with an improvementin the aggregate performance. The per-image latencyfor a batch of one image is found to be anywhere from5 to 10 times worse than the ultimate performance ona GPU.Within

CMSSW , we ﬁnd that importing the protobuf model of

ResNet-50 can take approximately 5 min-utes. Once the model is imported, subsequent infer-ences take, on average, 1.75 seconds per inference. Thisbenchmark point can most closely be compared withthe standalone single-thread CPU performance that isshown in Fig. 13, approximately 500 ms. The main dif-ferences between the standalone performance and the It takes signiﬁcant eﬀort to adapt

TensorFlow to be com-patible with the multithreading pattern used in

CMSSW , andhence the latest version of

TensorFlow is usually not availableto be used in the experiment’s software.

Batch size T i m e p e r i n f e r e n c e [ s ] Azure ResNet-50 CPU 1-coreAzure ResNet-50 CPU 4-coreAzure ResNet-50 CPU 8-coreTF ResNet-50 CPU 1-coreTF ResNet-50 CPU 8-core

Batch size I n f e r e n c e s / s Azure ResNet-50 CPU 1-coreAzure ResNet-50 CPU 4-coreAzure ResNet-50 CPU 8-coreTF ResNet-50 CPU 1-coreTF ResNet-50 CPU 8-core

Fig. 13: Standalone CPU inference time per image (top)and images per second (bottom) as a function of batchsize for the

TensorFlow oﬃcial

ResNet-50 model com-pared with the Azure

ResNet-50 model. The dashedline indicates a time of 10 ms, consistent with the on-prem inference time of the Brainwave system.

CMSSW tests are two-fold: the

TensorFlow version (1.06vs. 1.10) and the processor speed (2.6 GHz vs. 3.6 GHz).It is not uncommon for hardware across the global com-puting grid of the CMS experiment to vary in per-formance signiﬁcantly, which is another considerationwhen deploying both on-prem and remote services.To summarize, for total inference time for a batch ofone image, we present Brainwave, CPU, and GPU per-formance in Table 2. The most straightforward compar-ison with the current CMSSW performance of 1.75 sec-onds is the 10 (60) ms on-prem ( remote ) that it wouldtake to perform inference with Brainwave. This repre-sents a factor of 175 (30) speedup for Brainwave on-prem ( remote ) over current CMSSW

CPU performance.We can extrapolate from Table 2 that, for more mod-ern versions of

TensorFlow and CPUs, the

CMSSW

CPUinference time could improve to approximately 500 ms.

PGA-accelerated machine learning inference as a service for particle physics computing 13

Table 2: A summary comparison of total inference time for Brainwave, CPU, and GPU performance

Type Hardware (cid:104)

Inference time (cid:105)

Max throughput Setup

CPU Xeon 2.6 GHz, 1 core 1.75 seconds 0.6 img/s

CMSSW , TF v1.06

CPU i7 3.6 GHz, 1 core 500 ms 2 img/s python,

TF v1.10

CPU i7 3.6 GHz, 8 core 200 ms 5 img/s python,

TF v1.10

GPU (batch=1) NVidia GTX 1080 100 ms 10 img/s python,

TF v1.10

GPU (batch=32) NVidia GTX 1080 9 ms 111 img/s python,

TF v1.10

GPU (batch=1) NVidia GTX 1080 7 ms 143 img/s TF internal,

TF v1.10

GPU (batch=32) NVidia GTX 1080 1.5 ms 667 img/s TF internal,

TF v1.10

Brainwave Altera Artix 10 ms 660 img/s

CMSSW , on-prem Brainwave Altera Artix 60 ms 660 img/s

CMSSW , remote Batch size T i m e p e r i n f e r e n c e [ s ] Azure ResNet-50 GPUTF ResNet-50 GPUTF ResNet-50 GPU (train)

Batch size I n f e r e n c e s / s Azure ResNet-50 GPUTF ResNet-50 GPUTF ResNet-50 GPU (train)

Fig. 14: Standalone GPU inference time per image (top)and images per second (bottom) as a function of batchsize for the

TensorFlow oﬃcial

ResNet-50 model com-pared with the Azure

ResNet-50 model. The dashedline indicates a time of 10 ms, consistent with the on-prem inference time of the Brainwave system.GPU comparisons can be more nuanced , depend-ing on the model implementation and batch sizes. How-ever, for a batch of one image, we can say that the For that matter, CPU comparisons can also be nuancedwhen considering devices with many cores and large RAM.However, they do not ﬁt in with the

CMSSW computing model.

Brainwave inference latencies, both on-prem and re-mote including network latencies, are of a similar orderto local, physically connected GPU inference times. TheGPU and Brainwave have similar maximum through-put, about 660 images per second, though the formeronly achieves this with large batch size and the lat-ter achieves this when accessed with many CPUs si-multaneously. It should be emphasized that Brainwaveachieves this performance using single-image requestsand including network infrastructure for deployment asa service, while the GPU requires a large batch size forthe same performance and is directly connected to theCPU via PCIe (Peripheral Component Interconnect ex-press). As will be described in Sec. 6, future studies areneeded to better understand the scalability and cost ofdiﬀerent heterogeneous computing architectures. Theperformance of other coprocessors as services, includ-ing GPUs, is another item for future study.

The current computing model for particle physics willnot suﬃce to keep up with the expected future increasesin dataset size, detector complexity, and event mul-tiplicity. Single-threaded CPU performance has stag-nated in recent years; therefore, it is no longer viableto rely on improvements in the clock speed of general-purpose computing. Industry trends towards hetero-geneous computing—mixed hardware computing plat-forms with CPUs communicating with GPUs, FPGAs,and ASICs as coprocessors—provide a potential solu-tion that can perform calculations more than an or-der of magnitude faster than CPUs. The new coproces-sor hardware is geared towards machine learning algo-rithms, which are parallelizable, high-performing evenwith reduced precision, and energy eﬃcient. Therefore,to best utilize the new computing hardware, it is im-portant to adopt machine learning algorithms in par-ticle physics computing. Fortunately, machine learningis very common in particle physics, from simulation to reconstruction and analysis, and its usage continues togrow.In this paper, we explore the potential of FPGAsto accelerate machine learning inference for particlephysics computing. We focus on the acceleration ofthe

ResNet-50 convolutional neural network model andadapt it to physics applications. As an example, weinterpret jets, collimated sprays of particles producedin LHC collisions, as 2D images that are classiﬁed by

ResNet-50 . We keep the same architecture but trainnew weights to distinguish top quark jets from lightquark and gluon jets. Using a publicly available dataset,we compare our model against other state-of-the-artmodels in the literature and ﬁnd similarly excellent per-formance. We also discuss the potential for Brainwaveto be used in other particle physics applications. Forexample, neutrino event reconstruction deploys largeconvolution neural networks in their experiments andlarge network inferences are a bottleneck in their cur-rent computing workﬂow. Coprocessor-accelerated ma-chine learning inference could be deployed for such neu-trino experiments today .We accelerate

ResNet-50 using the newly availableMicrosoft Brainwave platform that deploys FPGA co-processors as a service . We ﬁnd that using machinelearning acceleration as a service is a simple yet veryhigh-performing approach that can be integrated intomodern particle physics experimental software with lit-tle disruption. Using open source RPC protocols, wecan communicate with Brainwave from our datacenterswith our experimental software to accelerate machinelearning inference. We refer to this workﬂow as

SONIC (Services for Optimized Network Inference on Copro-cessors).Even including the network transit time from theFermilab datacenter in Illinois to the Microsoft data-center in Virginia, the inference latency is still 30 timesfaster than our current, default CPU performance. Wetest Brainwave both as a cloud service and an edge( on-premises ) service with

ResNet-50 inferences aver-aging 60 and 10 ms, respectively. For the edge scenarioincluding network service infrastructure, this is compa-rable to the performance of a GPU connected directlyto the CPU for a batch of one image, which is impor-tant for the particle physics event processing model.We also study the scalability of the

SONIC workﬂow byhaving many batch CPU jobs make requests to a singleFPGA service. We ﬁnd, even in very extreme scenar-ios where the job’s only task is to access the Brainwaveservice, 50–100 simultaneous CPU jobs can be executedwith little drop in latency while greatly improving thethroughput of the FPGA to the point where a GPU canonly be competitive with large batch sizes. This result suggests a setup with many CPUs connecting to oneservice will be more than suﬃcient for our computingneeds and be more cost-eﬀective.This proof-of-concept work has potentially revolu-tionary implications for many large scale scientiﬁc ex-periments. Further academic studies and industry de-velopments will help to bring this technology to matu-rity; we highlight a few in particular. – Continue eﬀorts to design machine learning algo-rithms to replace particle physics algorithms.

Newcommercial coprocessors are being designed withmachine learning applications in mind, and parti-cle physics should capitalize on this. – Develop tools for generically translating models andexplore a broad oﬀering of potential hardware.

Whilewe have explored a speciﬁc

ResNet-50 network ar-chitecture, machine learning algorithms for diﬀerenttypes of physics applications will require very diﬀer-ent network architectures. We will need to exploreall the available tools to automate network trans-lation for specialized hardware. Various availablehardware options coming onto the market shouldbe explored and benchmarked. – Continue to build infrastructure and study scalabil-ity/cost.

We have developed a minimal experimentalsoftware framework for communicating with Brain-wave. This will have to grow in sophistication forauthentication, communication, ﬂexibility, and scal-ability to operate within the worldwide grid com-puting paradigm.Future heterogeneous computing architectures are apowerful and exciting solution to particle physics com-puting challenges. This study is the ﬁrst demonstrationof how to integrate them into our physics algorithmsand our computing model to enable new discoveries infundamental physics.

Acknowledgements

We would like to thank the entire Microsoft Azure Ma-chine Learning, Bing, and Project Brainwave teams forthe development of and opportunity to preview andstudy the acceleration platform. In particular, we wouldlike to acknowledge Doug Burger, Eric Chung, JeremyFowers, Daniel Lo, Kalin Ovtcharov, and Andrew Put-nam, for their support and enthusiasm. We wouldlike to thank Lothar Bauerdick and Oliver Gutschefor seed funding through USCMS computing opera-tions. We would like to thank Alex Himmel and otherNOvA collaborators for support and comments on themanuscript.

PGA-accelerated machine learning inference as a service for particle physics computing 15

Part of this work was conducted at “ iBanks ,” the AIGPU cluster at Caltech. We acknowledge NVIDIA, Su-perMicro, and the Kavli Foundation for their support of“ iBanks .” Part of this work was conducted using GoogleCloud resources provided by the MIT Quest for Intelli-gence program. Part of this work is supported throughIRIS-HEP under nsf-grant 1836650. We thank the orga-nizers of the public available top tagging dataset (andothers like it) for providing benchmarks for the physicscommunity.The authors thank the NOvA collaboration for theuse of its Monte Carlo software tools and data and forthe review of this manuscript. This work was supportedby the US Department of Energy and the US NationalScience Foundation. NOvA receives additional supportfrom the Department of Science and Technology, India;the European Research Council; the MSMT CR, CzechRepublic; the RAS, RMES, and RFBR, Russia; CNPqand FAPEG, Brazil; and the State and University ofMinnesota. We are grateful for the contributions of thestaﬀ at the Ash River Laboratory, Argonne NationalLaboratory, and Fermilab.On behalf of all authors, the corresponding authorstates that there is no conﬂict of interest.

References

1. G. Apollinari, I. B´ejar Alonso, O. Br¨uning, M. Lamont,L. Rossi. https://cds.cern.ch/record/2116337 (2015)2. HEP Software Foundation. https://arxiv.org/abs/1712.06982 (2017)3. R. Acciarri, et al. https://arxiv.org/abs/1601.05471 (2016)4. G. Mellema, et al., Exper. Astron. , 235 (2013). DOI10.1007/s10686-013-9334-55. National Research Council, (2011). DOI 10.17226/129806. R. Acciarri, et al., JINST (03), P03011 (2017). DOI10.1088/1748-0221/12/03/P030117. A. Aurisano, A. Radovic, D. Rocco, A. Himmel, M.D.Messier, E. Niner, G. Pawloski, F. Psihas, A. Sousa,P. Vahle, JINST (09), P09001 (2016). DOI 10.1088/1748-0221/11/09/P090018. K. He, X. Zhang, S. Ren, J. Sun, 2016 IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR)(2016). DOI 10.1109/CVPR.2016.909. S. Chatrchyan, et al., JINST , P09009 (2013). DOI10.1088/1748-0221/8/09/P0900910. T.Q. Nguyen, D. Weitekamp, D. Anderson, R. Castello,O. Cerri, M. Pierini, M. Spiropulu, J.R. Vlimant. https://arxiv.org/abs/1807.00083 (2018)11. S. Chatrchyan, et al., Phys. Lett. B , 30 (2012). DOI10.1016/j.physletb.2012.08.02112. G. Aad, et al., Phys. Lett. B , 1 (2012). DOI 10.1016/j.physletb.2012.08.02013. J. Duarte, et al., JINST (07), P07027 (2018). DOI10.1088/1748-0221/13/07/P0702714. J.F. Low, A.W. Brinkerhoﬀ, E.L. Busch, A.M. Carnes,I.K. Furic, S. Gleyzer, K. Kotov, A. Madorsky, J.T.Rorie, B. Scurlock, W. Shi, D.E. Acosta, Tech. Rep. CMS-CR-2017-361, CERN, Geneva (2017). URL https://cds.cern.ch/record/2289251

15. Gregor Kasieczka, Michael Russell, Tilman Plehn. TopTagging Reference Dataset. https://goo.gl/XGYju3 (2017)16. D.S. Ayres, et al., (2007). DOI 10.2172/93549717. A. Caulﬁeld, E. Chung, A. Putnam, H. Angepat,J. Fowers, M. Haselman, S. Heil, M. Humphrey,P. Kaur, J.Y. Kim, D. Lo, T. Massengill, K. Ovtcharov,M. Papamichael, L. Woods, S. Lanka, D. Chiou,D. Burger, (IEEE Computer Society, 2016). URL

18. CMS Collaboration, Technical Proposal for the Phase-IIUpgrade of the Compact Muon Solenoid. CMS Tech-nical Proposal CERN-LHCC-2015-010, CMS-TDR-15-02(2015). URL https://cds.cern.ch/record/2020886

19. K. Simonyan, A. Zisserman, (2014). URL https://arxiv.org/abs/1409.1556

20. G. Huang, Z. Liu, K.Q. Weinberger, 2017 IEEE Con-ference on Computer Vision and Pattern Recognition(CVPR) p. 2261 (2017). DOI 10.1109/CVPR.2017.24321. Xilinx. Xilinx ML Suite. https://github.com/Xilinx/ml-suite (2018)22. Tensorﬂow. Using TPUs. (2018)23. Intel. Intel distribution of OpenVINO toolkit. https://software.intel.com/en-us/openvino-toolkit (2018)24. G. Kasieczka, et al. The Machine Learning Landscapeof Top Taggers. https://arxiv.org/abs/1902.09914 (2019)25. A. Butter, G. Kasieczka, T. Plehn, M. Russell, SciPostPhys. (3), 028 (2018). DOI 10.21468/SciPostPhys.5.3.02826. T. Sj¨ostrand, S. Ask, J.R. Christiansen, R. Corke, N. De-sai, P. Ilten, S. Mrenna, S. Prestel, C.O. Rasmussen, P.Z.Skands, Comput. Phys. Commun. , 159 (2015). DOI10.1016/j.cpc.2015.01.02427. P. Skands, S. Carrazza, J. Rojo, Eur. Phys. J. C (8),3024 (2014). DOI 10.1140/epjc/s10052-014-3024-y28. J. de Favereau, C. Delaere, P. Demin, A. Giammanco,V. Lematre, A. Mertens, M. Selvaggi, JHEP , 057(2014). DOI 10.1007/JHEP02(2014)05729. M. Cacciari, G.P. Salam, G. Soyez, Eur. Phys. J. C ,1896 (2012). DOI 10.1140/epjc/s10052-012-1896-230. M. Cacciari, G.P. Salam, Phys. Lett. B , 57 (2006).DOI 10.1016/j.physletb.2006.08.03731. M. Cacciari, G.P. Salam, G. Soyez, JHEP , 063 (2008).DOI 10.1088/1126-6708/2008/04/06332. H. Qu, L. Gouskos. ParticleNet: Jet Tagging via ParticleClouds. https://arxiv.org/abs/1902.08570 (2019)33. V. Nair, G.E. Hinton, in Proceedings of ICML , vol. 27(2010), vol. 27, pp. 807–81434. D.P. Kingma, J. Ba. Adam: A method for stochastic op-timization. https://dblp.org/rec/bib/journals/corr/KingmaB14 (2014)35. P. Adamson, et al., Phys. Rev. Lett. (23), 231801(2017). DOI 10.1103/PhysRevLett.118.23180136. J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, L. Fei-Fei,in

CVPR09 (2009)37. Private Communications with Alex Himmel (Fermilab)38. A. Radovic, M. Williams, D. Rousseau, M. Kagan,D. Bonacorsi, A. Himmel, A. Aurisano, K. Terao,T. Wongjirad, Nature (7716), 41 (2018). DOI10.1038/s41586-018-0361-239. K. Albertsson, et al., J. Phys. Conf. Ser. (2), 022008(2018). DOI 10.1088/1742-6596/1085/2/0220086 Javier Duarte et al.40. S. Farrell, D. Anderson, P. Calaﬁura, G. Cerati, L. Gray,J. Kowalkowski, M. Mudigonda, Prabhat, P. Spentzouris,M. Spiropoulou, A. Tsaris, J.R. Vlimant, S. Zheng, EPJWeb Conf. , 00003 (2017). DOI 10.1051/epjconf/20171500000341. CERN. TrackML Particle Tracking Challenge. (2018)42. M. Paganini, L. de Oliveira, B. Nachman, Phys. Rev.

D97 (1), 014021 (2018). DOI 10.1103/PhysRevD.97.01402143. Google. gRPC. version v1.14.0 https://grpc.io/ (2018)44. Google. Protocol Buﬀers. https://github.com/protocolbuffers/protobuf (2019)45. CMS Collaboration. CMSSW. version CMSSW 10 2 0 https://github.com/cms-sw/cmssw (2018)46. Intel. Thread Building Blocks. version 2018 U1 (2018)47. K. Pedro. SonicCMS. version v3.1.0 https://github.com/hls-fpga-machine-learning/SonicCMShttps://github.com/hls-fpga-machine-learning/SonicCMS