Distance-Weighted Graph Neural Networks on FPGAs for Real-Time Particle Reconstruction in High Energy Physics
Yutaro Iiyama, Gianluca Cerminara, Abhijay Gupta, Jan Kieseler, Vladimir Loncar, Maurizio Pierini, Shah Rukh Qasim, Marcel Rieger, Sioni Summers, Gerrit Van Onsem, Kinga Wozniak, Jennifer Ngadiuba, Giuseppe Di Guglielmo, Javier Duarte, Philip Harris, Dylan Rankin, Sergo Jindariani, Mia Liu, Kevin Pedro, Nhan Tran, Edward Kreinar, Zhenbin Wu
FFERMILAB-PUB-20-405-E-SCD
Distance-Weighted Graph Neural Networks onFPGAs for Real-Time Particle Reconstruction inHigh Energy Physics
Yutaro Iiyama
International Center for Elementary Particle Physics (ICEPP)University of TokyoTokyo 113-0033, Japan
Gianluca Cerminara, Abhijay Gupta, Jan Kieseler, Vladimir Loncar ∗ ,Maurizio Pierini, Shah Rukh Qasim † , Marcel Rieger, Sioni Summers,Gerrit Van Onsem, Kinga Wozniak ‡ European Organization for Nuclear Research (CERN)CH-1211 Geneva 23, Switzerland
Jennifer Ngadiuba
California Institute of TechnologyPasadena, CA 91125, USA
Giuseppe Di Guglielmo
Columbia UniversityNew York, NY 10027, USA
Javier Duarte
University of California San DiegoLa Jolla, CA 92093, USA
Philip Harris, Dylan Rankin
Massachusetts Institute of TechnologyCambridge, MA 02139, USA
Sergo Jindariani, Mia Liu, Kevin Pedro, Nhan Tran § Fermi National Accelerator LaboratoryBatavia, IL 60510, USA
Edward Kreinar
HawkEye360Herndon, VA 20170, USA
Zhenbin Wu
University of Illinois at ChicagoChicago, IL 60607, USAAugust 11, 2020
Abstract ∗ Also at Institute of Physics Belgrade, Pregrevica 118, Belgrade, Serbia † Also at Manchester Metropolitan University, Manchester M15 6BH, UK ‡ Also at University of Vienna, 1010 Vienna, Austria § Also at Northwestern University, Evanston, IL 60208, USA a r X i v : . [ h e p - e x ] A ug raph neural networks have been shown to achieve excellent performance for several crucialtasks in particle physics, such as charged particle tracking, jet tagging, and clustering. Animportant domain for the application of these networks is the FGPA-based first layer ofreal-time data filtering at the CERN Large Hadron Collider, which has strict latency andresource constraints. We discuss how to design distance-weighted graph networks that can beexecuted with a latency of less than 1 µ s on an FPGA. To do so, we consider a representativetask associated to particle reconstruction and identification in a next-generation calorimeteroperating at a particle collider. We use a graph network architecture developed for suchpurposes, and apply additional simplifications to match the computing constraints of Level-1trigger systems, including weight quantization. Using the hls4ml library, we convert thecompressed models into firmware to be implemented on an FPGA. Performance of thesynthesized models is presented both in terms of inference accuracy and resource usage. K eywords deep learning · FPGA · fast inference · graph networks · imaging calorimeter At the CERN Large Hadron Collider (LHC), high-energy physics (HEP) experiments collect signals generatedby the particles produced in high-energy proton collisions that occur every 25 ns, when two proton beamscross. The readout from the detectors that capture the particles emerging from the collision is filtered by areal-time processing system, known as the trigger , that discards uninteresting collision events, based on a setof predefined algorithms. The trigger system is structured in two stages: a Level-1 trigger (L1T), implementedwith custom electronics on-detector and field-programmable gate arrays (FPGAs); and a high-level trigger(HLT), consisting of a computer farm, possibly including co-processor accelerators like graphics processingunits (GPUs) and FPGAs. Because of asynchronous event processing at the HLT, the accept/reject decisionhas to be reached with a typical latency of O (100) ms. However, at the L1T, a decision must be taken withina fixed latency of O (1) µ s. The main limitations are the synchronous, “hard-deadline” nature of the processingsystem and the limited size of the memory buffer for the data from each beam crossing.While HLT algorithms have a complexity comparable to those used offline to produce the final physicsresults, a typical L1T algorithm consists of simpler rules based on coarser objects to satisfy the latencyconstraint. Consequently, the resolution of quantities computed at the L1T is typically poor compared tooffline quantities. Recently, the successful deployment of the first machine learning (ML) L1T algorithm,based on a boosted decision tree (BDT), at the LHC [1] has changed this tendency, raising interest in usingML inference as fast-to-execute approximations of complex algorithms with good accuracy. This first exampleconsisted of a large, pre-computed table of input and output values implementing a BDT, which raises thequestion of how to deploy more complex architectures. This question motivated the creation of hls4ml [2], alibrary designed to facilitate the deployment of ML algorithms on FPGAs.A typical hls4ml workflow begins with a neural network model that is implemented and trained using Keras [3],
PyTorch [4], or
TensorFlow [5]. The trained model is passed to hls4ml , directly or throughthe
ONNX [6] interface, and converted to C++ code that can be processed by a high-level synthesis (HLS)compiler to produce an FPGA firmware. By design, hls4ml targets low-latency applications. To this end, itsdesign prioritizes all-on-chip implementations of the most common network components. Its functionality hasbeen demonstrated with dense neural networks (DNNs) [2], extended to also support BDTs [7]. Extensionsto convolutional and recurrent neural networks are in development. The library comes with handles tocompress the model by quantization, up to binary and ternary precision [8]. Recently, support for
QKeras [9]models has been added, in order to allow for quantization-aware training of models [10]. While the hls4ml applications go beyond HEP, its development has been driven by the LHC L1T use case.Graph neural networks (GNNs) are among the complex architectures whose L1T implementations are inhigh demand, given the growing list of examples showing how well GNNs can deal with tasks related toHEP [11–22]. In fact, while the irregular geometry of a typical HEP detector complicates the use of computingvision techniques such as convolutional neural networks, GNNs can naturally deal with the sparse andirregular nature of HEP data.In this work, we show how a graph model can be efficiently deployed on FPGAs to perform inference within O (1) µ s for HEP-related problems. We consider the distance-weighted architecture GarNet , introduced inRef. [15], which is designed to keep resource consumption under control by reducing as much as possible thenumber of operations. It has been demonstrated to perform well for a HEP-related task, namely particlereconstruction in a calorimeter. For these reasons, it represents a good candidate for our purpose. The2rmware implementation of
GarNet presented in this work has been included in hls4ml , representing thefirst graph-based algorithm available in the library.We present a case study of a neural network algorithm based on
GarNet , applied to a task of identifying thenature of an incoming particle and simultaneously estimating its energy from the energy deposition patternsin a simulated imaging calorimeter. The inference accuracy of the firmware implementation of the algorithmis compared against its offline counterpart running on processors (CPUs and GPUs). Latency and resourceutilization of the translated FPGA firmware are reported, along with a discussion on their implications forreal-world usage of similar algorithms.This paper is structured as follows. In Section 2, we briefly recount related work. Section 3 defines the mainproblem by outlining the challenges in designing a graph network compatible with L1T latency and resourceconstraints. Section 4 describes how
GarNet addresses these challenges, and introduces a simplified formof the algorithm with a better affinity to a firmware implementation. The case study using a calorimetersimulation is presented in Section 5, with detailed descriptions of the task setup, model architecture, trainingresults, and the summary of FPGA firmware synthesis. Finally, conclusions are given in Section 6.
Graph neural networks are gaining interest in HEP applications, mainly due to their intrinsic value whendealing with sparse input datasets, which are very common in HEP. A recent review of applications of GNNsto HEP problems may be found in Ref. [22]. In particular, dynamic GNNs [15, 23–25] are relevant for particlereconstruction tasks, such as tracking [20] and calorimetry [15].In order to fully exploit DL-based particle reconstruction, the capability of deploying DL models to thereal-time data filtering system of a HEP detector is essential. For the FPGA-based L1T systems of theLHC, automatic tools for network-to-circuit conversation, such as the hls4ml library, are emerging. Using hls4ml , several solutions for HEP-specific tasks (e.g. jet tagging) have been provided [2, 7, 8, 10], exploitingmodels with simpler architectures than what is shown here. This tool has been applied extensively for tasksin the HL-LHC upgrade of the CMS L1T system, including an autoencoder for anomaly detection, andDNNs for muon energy regression and identification, tau lepton identification, and vector boson fusion eventclassification [26].
In the framework of Ref. [27], a graph is a triplet ( V , E , U ), where V is a set of entities (vertices) eachpossessing some attributes in a fixed format, E is a set of pairwise relations (edges) between the elements in V , potentially possessing some additional attributes, and U are global (graph-level) attributes. While a GNNcan be any neural network that acts on such graphs, in this work we specifically consider graph networks(GN) [27], i.e., architectures that consist of repeatable graph-to-graph mapping blocks (GN blocks). Each GNblock performs some combination of operations such as edge feature transformation, aggregation of neighbors’features at each vertex, vertex feature transformation, global aggregation of edge and vertex features, andglobal feature transformation. A GN takes a graph as an input sample, where the cardinality of V maydiffer sample to sample, and infers its properties, which may be anything from a global scalar, such as aclassification label of the sample, to new edge attributes.To be usable as a part of an LHC L1T system, an algorithm must execute within O (1) µ s and have thethroughput to accept all inputs from each beam crossing every 25 ns. Time-multiplexing, whereby N copiesof the algorithm accept inputs from N different beam crossings, may be used to decrease the throughputrequirement by a factor of N . Additionally, there is a practical constraint that the firmware implementationshould fit in the FPGA resources of the system, i.e., utilize the resources such as digital signal processingunits (DSPs), look-up tables (LUTs), flip-flips (FFs), and block RAM (BRAM) within the limits of chipsavailable on the market. Satisfying these requirements with a GNN can be challenging for multiple reasonslisted below. • Model depth : Within each GN block, vertices exchange information with other directly connectedvertices or with global attributes. Therefore, to expand the receptive field of each vertex beyondthe nearest neighbors, multiple GN blocks must be repeated in the network. Given that varioustransformations within each GN block are often themselves multilayer perceptrons (MLPs), GNNmodels tend to be quite deep. Deep networks go against the latency requirement, as each perceptron3ayer uses at least one clock cycle on an FPGA under a straightforward implementation, and alsoagainst the resource usage requirement, because MLPs utilize multiplications heavily. • Input size : Typically, for problems where the application of GNNs is interesting, the cardinality of V is at least O (10 ). Even with the high degree of parallelism of FPGAs, due to finiteness of thecompute resource, such large input will have to be processed serially to a certain extent, increasingthe latency and the interval before a new input can be accepted, known as the initiation interval (II).Longer IIs lead to lower throughput values. • Memory usage : Related to the problem of the input size, if the algorithm requires temporaryretention of features for all vertices or edges, memory usage may be prohibitive for an FPGA firmwareimplementation. • Memory access pattern : Except for certain cases, algorithms that have both V and E in the inputusually requires random memory access, for example when reading or writing features of vertices atthe ends of the edges. This poses a challenge in FPGA firmware design not only because it impliesthat there needs to be a large enough memory bank to store all vertex and/or edge data, but alsobecause random memory access itself is a costly operation [28]. The exceptions include when E istrivial ( E = ∅ or when the graph is complete) and when all samples have an identical graph topology.In such cases, the memory access pattern of the algorithm is known at compile time and thereforecan be statically scheduled in the FPGA firmware.The case of E = ∅ is a rather extreme solution to the last challenge, but it is also attractive in terms ofmemory usage. In fact, even without explicit input edge features, a GNN can infer regional and non-localproperties of the graph by globally gathering the vertex features and then scattering the gathered informationback to the vertices. This information flow can also be mediated by a learnable attention mechanism [29].The attention mechanism suppresses information from vertices that are considered unimportant, effectivelyforming “soft” edges among the unsuppressed vertices.In the next section, we study a GNN architecture with these exact properties, then discuss the modificationsto the architecture to make it suitable for an FPGA firmware implementation. hls4ml framework In this work, we consider
GarNet [15] as a specific example of GNN. A
GarNet layer is a GN block thattakes as input a set of V vertices, each possessing F in features, and returns the same set of vertices with F out features. In a GarNet layer, F in features of each vertex are encoded into an internal representationand gathered at S aggregators . A distance parameter between each of the aggregators and vertices is alsocomputed from the vertex attributes. Information gathered at the aggregators are then sent back to individualvertices and decoded into F out features. Communications between the vertices and aggregators are weightedby a decreasing function of the distance parameter, implementing an attention mechanism that allows thenetwork to learn a dynamic, nontrivial graph structure from the vertex input alone.The original GarNet algorithm, while already using less compute and memory resource than other similarGNN architectures in Ref. [15, 23], is still challenging to implement as fast and high-throughput FPGAfirmware. The biggest problem arises from the use of the input feature vector as a part of the input to thedecoder, which requires retention of the input data until the last steps of the algorithm. An immediateconsequence of this requirement is a longer II, because processing of new samples cannot start while theinput data for the current sample is still in use. Furthermore, the input feature vector is already used tocompute the distance parameter as well as the internal representation of each vertex, and therefore a reuse ofthe input in the decoder creates a complex data flow, restricting the options for pipelining the algorithm.We therefore designed a modified
GarNet algorithm with a simplified processing flow: • Input transformation (Figs. 1a and 1b): An encoder network converts the features g jv ( j =1 , . . . , F in ) of the v th vertex ( v = 1 , . . . , V ) into an internal learned representation vector f iv ( i =1 , . . . , F LR ). In parallel, another network (distance calculator) also acts on g jv and computes thedistance parameters d av ( a = 1 , . . . , S ) between the vertices and the S aggregators. Implicitly, thismeans that a complete bipartite graph with V S edges is built from V and S , where S is the set ofaggregators (Fig. 1b). The encoder and distance calculator networks are both single-layer perceptrons4igure 1: Processing flow of the modified GarNet algorithm: (a) The input features ( g jv ) of each vertex areprocessed by a linear network, that returns a new set of features ( f iv ) and its distance from the S aggregators( d av ). (b) A graph is built in the learned space, using the d av distances. (c) A message is gathered by eachaggregator, as a weighted sum across the vertices of f iv , with W av = exp( − d av ) as weights. (d) A message isfrom each aggregator ( ˜ f iav ) is passed back to each vertex, with the same W av weight. (e) The aggregatedoutputs of each vertex are given as input to a neural network, which returns the learned representation.with linear activation functions, so one can write them as linear transformations f iv = F in X j =1 w ij g jv + b i (1) d av = F in X j =1 α aj g jv + β a , (2)5here ( w ij , b i ) and ( α aj , β a ) are the kernels and biases of the encoder and distance calculator networks,respectively. • Aggregation (Fig. 1c): The learned representation vectors f iv of the vertices are weighted by apotential function W av = exp( − d av ) and averaged across the vertices. In other words, the i thaveraged feature h ia of aggregator a is written as h ia = 1 V max V X v =1 W av f iv . (3)The factor V max in the denominator is the maximum possible value for the vertex multiplicity V (as V may have a different value for each input sample). Through this normalization by a commonfactor, the information about the size of the sample (cardinality of V ) is effectively encoded into h ia . • Output transformation (Figs. 1d and 1e): The aggregated features are sent back to the verticesusing the same weights as ˜ f iav = W av h ia , (4)and then transformed by a single-layer decoder network with linear activation function into the finaloutput representation g kv ( k = 1 , . . . , F out ). With the kernel u and bias c of the decoder, this iswritten as g kv = F LR X i =1 S X a =1 u kia ˜ f iav + c k . (5)This simplified algorithm differs from the original design in the following ways. First, only the mean oververtices is computed at the aggregators, whereas the maximum is also used in the original design. In otherwords, the aggregators in the original design have h ia = max v W av f iv (6)as an additional set of features. Secondly, as already noted, the input feature vector is not used as a part ofthe input to the decoder network. In the original GarNet design, the decoder is expressed as g kv = F LR X i =1 S X a =1 W av (cid:0) u kia h ia + u kia h ia (cid:1) + F in X i =1 w ki g iv + c k , (7)with additional sets of kernel weights u and w . Finally, the original design applies a nonlinear (tanh)activation function to the decoder, while the simplified version uses a linear activation. In the specific caseconsidered in the next section, these simplifications result in negligible degradation of the network performance.In the remainder of this paper, this simplified version of the algorithm is referred to as GarNet .It is worth pointing out that while the
GarNet layer uses only linear activation functions for all of the internalneural networks, it can still learn nonlinear functions through the non-linearity of the potential function W av .On the other hand, having no nonlinear activation functions allows a compact FPGA firmware implementationof the layer, consisting mostly of multiplications and additions. The only substantial computation comeswith the exponential function, whose values can be pre-computed with sufficient granularity and stored.An FPGA firmware implementation of the GarNet layer using Vivado [30] HLS is integrated into the hls4ml library. The HLS source code is written in C++ and is provided as a template, from which an HLS functionfor a
GarNet layer can be instantiated, specifying the configurable parameters such as S , F LR , and F out . Inthe following, we provide some noteworthy details of the implementation.In the HLS source code of GarNet , all quantities appearing in the computation are expressed as eitherintegers or fixed-point numbers with fractional precision of at least eight bits. In particular, the distanceparameter d av is represented with three integer bits, eight fractional bits, and one sign bit. During the layercomputation, d av is reinterpreted as a 12-bit unsigned integer, which is used to retrieve the correspondingpre-computed value of W av from a table with 4,096 entries.The processing flow in Eqs. (1) to (5) is compactified in the hls4ml implementation by exploiting the linearityof the encoder, average aggregation, and the decoder. Equations (1), (3), and (5) can be combined into g kv = S X a =1 W av F in X j =1 ˜ w kja G ja + ˜ b ka L a + c k , (8)6here ˜ w kja = F LR X i =1 u kia w ij , ˜ b ka = F LR X i =1 u kia b i , G ja = 1 V max V X v =1 W av g jv , and L a = 1 V max V X v =1 W av . (9)In particular, the kernel and bias tensors of the encoder and decoder are contracted into ˜ w and ˜ b at logicsynthesis time, resulting in fewer steps to arrive at the output from the input.With this simplification, the input data from each sample are encoded into W av , G ja , and L a . Therefore, anew sample can be processed as soon as the three quantities from the previous sample are computed. Inother words, the II of the overall GarNet layer depends on the number of clock cycles needed to computethe three quantities. Furthermore, G ja and L a can be derived trivially from W av , making the latency of thecomputation of the latter the critical determinant of the throughput of the algorithm.The computation of W av is performed independently on each vertex, and is therefore parallelizable across thevertices. In a fully parallelized implementation, there would be V max logic units (one unit per vertex) operatedsimultaneously. However, with V typically as large as O (10 ) or greater, this configuration would consumetoo much of the FPGA resources and would not fit on a single chip. Therefore, the hls4ml implementationof GarNet allows a partial parallelization of the algorithm controlled by a parameter called the reuse factor ( R reuse ). For R reuse >
1, the logic unit to compute W av is cloned V max /R reuse times, such that each unit isreused serially up to R reuse times. This serial reuse is fully pipelined with the local II of one clock cycle. Thelatency T W for computing W av for all vertices is therefore given by T W = T W + R reuse , (10)where T W ∼
20 is the number of clock cycles needed to compute W av for one vertex. The value of T W depends on the numerical precision of the fixed-point numbers in the computation.Finally, the kernel and bias of the encoder and the kernel of the decoder can be quantized, such that eachelement takes only values −
1, 0, or 1 (ternary quantization) [31]. In the quantized version of the algorithm,contracted kernel and bias ˜ w and ˜ b have elements that are O (1) integers. Multiplication of small integerswith fixed-point numbers can be performed in FPGAs using LUTs rather than DSPs, which are usually themore scarce resource. Multiplications with LUTs also proceed faster than those with DSPs. As a case study, the hls4ml implementation of
GarNet is applied to a representative task for the LHC L1T,namely reconstructing electrons and pions in a simulated 3D imaging calorimeter. In the following, we firstdescribe the dataset used for the study, then define the task and the architectures of the ML models, andpresent the inference performance of the models and the resource usage of the synthesized firmware.
The calorimeter is a multi-layered full-absorption detector with a geometry similar to the one described inRef. [15]. The detector is made entirely of tungsten, which is considered as both an absorber and a sensitivematerial, and no noise or threshold effects in the readout electronics are simulated. While this homogeneouscalorimeter design is not a faithful representation of a modern sampling calorimeter, this simplification allowsus to evaluate the performance of the ML models decoupled from detector effects.The calorimeter extends 36 cm in x and y and has a total depth in z of 2 m, corresponding to approximately20 nuclear interaction lengths and 170 radiation lengths. The coordinate origin is placed at the center of thefront face of the calorimeter. The calorimeter is segmented into 50 layers along z , with each layer divided intosmall square cells in the x - y plane, forming a three-dimensional imaging detector. Cells are oriented so theirsides are parallel to the x and y axes. Tiling of the cells in each layer is uniform except for in one quadrant,where the cell sides are half as long as those in the other area. The aim of the tiling is to incorporate theirregularity of the geometry of a real-life particle physics calorimeter. The quadrant with smaller cells andthe remainder of the layer are respectively called the high granularity (HG) and low granularity (LG) regions.The first 25 layers in z correspond to the electromagnetic calorimeter, with a layer thickness of 1 cm andcell dimensions of 2.25 cm × × × × Geant4 [32].Each event used in this study contains a high-energy primary particle and low-energy pileup particles, whichrepresent backgrounds from simultaneous additional proton-proton interactions. The primary particle iseither an electron (e − ) or a charged pion ( π ± ), shot at the calorimeter with momentum aligned along the z axis, i.e., perpendicular to the front face of the calorimeter. The x and y coordinates of the particle’s originare randomly sampled according to a uniform distribution in a 10 cm ×
10 cm region centered at x = y = 0.Following this procedure, we aim to mimic a realistic situation in which the actual calorimeter extends toa much larger surface and the area covered by the geometry used in this study represents a portion of it.The value of the particle momentum is drawn randomly for each event from a uniform distribution between10 GeV and 100 GeV. The pileup particles consist of photons ( γ ) and π ± . The number of pileup particles israndomly sampled from a Poisson distribution with a mean of 40, with the π ± multiplicity fixed to twice the γ multiplicity. This setup approximates the flux of pileup particles expected at a pseudorapdity η = 2 ina ∆ η × ∆ φ = 0 . × . µ = 0 . c = 0 . z direction. The radiusof the cylinder is set at 6.4 cm so that the resulting cluster contains 95% of the energy of the primary particlefor 50% of the pion events. Because electromagnetic showers have a narrower energy spread than hadronicshowers in general, all of the electron events have at least 95% of the energy contained in the same cylinder.Typical events with momenta of the primary particles around 50 GeV and the total pileup energy close to themedian of the distribution are shown in Fig. 3a. The hits in the figure are colored by the fraction of the hitenergy due to the primary particle (primary fraction, f prim ) to help the visualization.The actual dataset used in this study thus contains one cluster per sample, given as an array of hits in thecluster, and one integer indicating the number of hits in the sample. Only the hits with energy greater than120 MeV are considered. Each cluster contains at most 128 hits, sorted by hit energy in decreasing order.Note that sorting of the hit has no effect on the neural network, and is only relevant when truncating the list8f hits to consider smaller clusters, as explored later. In fact, 0.2% of the events resulted in clusters with morethan 128 hits, for which the lowest energy hits were discarded from the dataset. Each hit is represented byfour numbers, corresponding to the hit coordinates, given in x , y , and z , and energy. The x and y coordinatesare relative to the seed cell. The dataset consists of 500,000 samples, split evenly and randomly into e − and π ± events, stored as NumPy [35] arrays in
HDF5 format [36]. The dataset together with the ground truthinformation is available on the Zenodo platform [37].
Electron 49.2 (48.0) GeV, Pileup 64.9 (21.4) GeV Pion 50.9 (33.1) GeV, Pileup 59.7 (17.2) GeV(a) z ( c m ) x ( c m ) y ( c m ) z ( c m ) x ( c m ) y ( c m ) P r i m a r y f r a c t i o n (b) z ( c m ) x ( c m ) y ( c m ) z ( c m ) x ( c m ) y ( c m ) E p r e d / h ClusteredUnclustered
Figure 3: Examples of electron (left) and pion (right) events. Values in parentheses in the graph titles arethe respective energy depositions contained in the cluster around the seed hit. Points represent hits in thedetector, with their coordinates at the center of the corresponding detector cells and the size of the markersproportional to the square root of the hit energy. Opaque points are within the cluster, while the translucentones are not. In (a), the point color scale from blue to red corresponds to the primary fraction (see Section 5.1for definition). In (b), the color scale from blue to green corresponds to ∆ E pred / ∆ h , which is an indication ofthe importance the neural network model places to individual hits for energy regression. See Section 5.3 fordetails. The task in this study is to identify the nature of the primary particle and to simultaneously predict its energy,given the hits in the cluster. The ability to reliably identify the particle type and estimate its energy at thecluster level in a local calorimeter trigger system greatly enhances the efficacy of high-level algorithms, such asparticle-flow reconstruction [38–40], downstream in the L1T system. However, because of the distortion of theenergy deposition pattern in the cluster due to pileup, particle identification based on collective properties ofthe hits, such as the depth of the energy center of mass, can achieve only modest accuracy. Furthermore, onlyhalf of the pion events have 95% of the energy deposition from the pion contained in the cluster, requiringsubstantial extrapolation in the energy prediction. This task is thus both practically relevant and sufficientlynontrivial as a test bench of a
GarNet -based ML model.The architecture of the model is as follows. First, the input data represented by a two-dimensional arrayof V max × F in numbers per cluster are processed by a stack of three GarNet layers. The parameters(
S, F LR , F out ) for the first two layers are (4 , ,
8) and for the last layer are (8 , , arNet layer is averaged across the vertices for each of the 16 features. The resulting array of 16 numbersis then passed through two fully connected layers with 16 and 8 nodes and ReLU [41] activation. Data flow issplit into two branches in the final step. The first branch consists of a fully connected layer with a singlenode, whose output is activated by a sigmoid function and is interpreted as the classification prediction,i.e., the predicted probability that the primary particle is an electron. The other branch also consists ofa single-node fully connected layer, but with a linear activation of the output, which is interpreted as thepredicted value of the energy of the particle.This model is built in Keras [3], using the corresponding implementation of
GarNet available in Ref. [42].In total, the model has 3,402 trainable parameters (2,976 in the three
GarNet layers), whose values areoptimized through a supervised training process using the Adam optimizer [43]. Input is processed in batchesof 64 samples during training. The overall objective function that is minimized in the training is a weightedsum of objective functions for the classification and regression tasks: L = β L class + (1 − β ) L reg (11)with β = 0 .
01. The objective function for classification L class is the binary cross entropy in each batchbetween the truth labels (electrons are represented by 1 and pions by 0) and the classification output of themodel. The objective function for regression L reg is the batch mean of the relative squared error L reg = [( E pred − E true ) /E true ] , (12)where E pred and E true are the predicted and true energies of the primary particle, respectively. The trainingis performed on 400,000 training and 100,000 validation samples over a few hundred epochs, with earlystopping when the value of the objective function does not improve for ten consecutive epochs. Keeping thefull training dataset on RAM and using two NVIDIA GeForce RTX 2080 Ti GPUs in parallel, each epochtakes roughly 30 seconds to process.Additionally, we prepare a model in which the encoders and decoders of the GarNet layers are quantized asternary networks using
QKeras [9, 10], which performs quantization-aware training with the straight-throughestimator by quantizing the layers during a forward pass but not a backward pass [44–46, 10]. In the following,this model is referred to as the quantized model , and the original model as the continuous model . Thequantized model is trained with the same objective function and training hyperparameters as the continuousmodel.To evaluate the inference performance of the trained models, reference algorithms are defined separately forthe classification and regression subtasks. The reference algorithm for classification ( cut-based classification)computes the energy-weighted mean ¯ z and standard deviation σ z of the z coordinates of the hits,¯ z = P Vi =1 z i h i P Vi =1 h i and σ z = vuut P Vi =1 ( z i − ¯ z ) h i P Vi =1 h i , (13)where i is the index of hits in the cluster and z i and h i are the z coordinate and energy of the i th hit. Thecluster is labeled as an electron if ¯ z < ¯ z cut and σ z < σ cut z , where ¯ z cut and σ cut z are predefined thresholds.Pions, and hadrons in general, tend to penetrate deeper in an absorbing detector and create showers ofsecondary particles with a larger transverse size than electrons and photons. For regression, the referencealgorithm ( weight-based regression) predicts the energy of the primary particle through a formula E refpred = V X i =1 w l ( i ) (cid:0) h i + b l ( i ) (cid:1) , (14)where l ( i ) is the detector z layer of hit i . Parameters { w l , b l } ( l = 1 , . . . ,
50) are determined by minimizing L reg over the training dataset using E refpred as the predicted energy. Particle identification based on the energydeposition profile of the cluster and energy estimation based on weighted sum of hit energies are both commonstrategies in the conventional, non-ML-based event reconstruction approaches. Performance of the trained continuous and quantized models, evaluated using the validation sample, areshown in Fig. 4. For each ML model, the inference results based on the original
Keras model and the HLSmodel, converted using hls4ml , are shown. The HLS model provides a realistic emulation of the synthesizedFPGA firmware. 10he classification performance is given in terms of receiver operating characteristic (ROC) curves that tracethe electron identification efficiency (true positive fraction) and pion rejection efficiency (true negative fraction)for different thresholds of the classifiers. The two
GarNet -based models perform similarly and better thanthe cut-based reference in terms of the electron identification efficiency for a given pion rejection efficiency. Adetailed comparison of the four sets of results from the
GarNet -based models in the inset reveals that thecontinuous model performs slightly better than the quantized model, and that the difference between the
Keras and HLS implementations is smaller for the quantized model.The regression performance is given in terms of the response ( E pred /E true ). Distributions of the responseare summarized in 10 GeV bins of E true , separately for the continuous model, quantized model, and theweight-based reference. In each summary, the horizontal line in the box corresponds to the median of thedistribution, the top and bottom of the box to the upper and lower quartiles, and the upper and lower endsof the whiskers to the 95th and 5th percentiles. The GarNet -based models exhibit narrower spreads of theresponse distributions in most of the bins, with the continuous model again performing slightly better thanthe quantized model.
Classification E l e c t r o n i d e n t i f i c a t i o n e ff i c i e n c y Keras continuousHLS continuousKeras quantizedHLS quantizedCut-based
Regression
Keras continuousHLS continuousKeras quantizedHLS quantizedWeight-based
10 20 30 40 50 60 70 80 90 100Primary particle energy (GeV)0.250.500.751.001.251.501.75 Pions R e s p o n s e Figure 4: Classification (left) and regression (right) inference performance of the continuous and quantized
GarNet -based models and the reference algorithms. Results from the
Keras and HLS implementations areshown for the
GarNet -based models. The classification performance is quantified with a ROC curve ofelectron identification efficiency versus pion rejection efficiency. The inset in the left graph shows a close-upview of the efficiency range 0.90–0.96 for both axes. The regression performance is quantified as the response( E pred /E true ) in 10 GeV bins of E true . The horizontal line in the box corresponds to the median of thedistribution, the top and bottom of the box to the upper and lower quartiles, and the upper and lower endsof the whiskers to the 95th and 5th percentiles.The differences between the Keras and HLS implementations are due to the numerical precision in thecomputation. While the former represents all fractional numbers in 32-bit floating-point numbers, the latteremploys fixed-point numbers with bit widths of at most 18. Consequently, for the quantized model, wherethe encoder and decoder of the
GarNet layers employ integer weights for inference, the difference betweenthe two implementations are smaller.For both subtasks, the
GarNet -based models generally outperform the reference algorithms. The referencealgorithm has narrower spread of the response in some energy bins for the regression subtask. However, it isimportant to note that the weights and biases appearing in Eq. (14) are optimized for a specific pileup profile,while in a real particle collider environment, pileup flux changes dynamically even on the timescale of a fewhours. In contrast, algorithms based on inference of properties of individual hits, such as the
GarNet -basedmodels presented in this study, are expected to be able to identify hits due to pileup even under differentpileup environments and thus to have a stable inference performance with respect to change in pileup flux.Since a detailed evaluation of application-specific performance of
GarNet is not within the scope of thiswork, we leave this and other possible improvements to the model architecture and training to future studies.11o verify that
GarNet can infer relations between individual vertices without edges E in the input, thefollowing test is performed. Using the two events shown in Fig. 3, the energy of each hit in the clusters isincreased one at a time by 10%, and the inference with the continuous model is performed for each perturbedevent. If the model has learned to perfectly distinguish the primary particle from pileup at the vertex level, asmall change in the energy of a hit from pileup should result in no change in the predicted particle energy.In Fig. 3b, each hit in the cluster is colored by the ratio of the change of predicted particle energy and theamount of perturbation (∆ E pred / ∆ h ). While some hits in Fig. 3a with f prim = 0 appear with ∆ E pred / ∆ h > f prim and ∆ E pred / ∆ h is observed. The occurrence of ∆ E pred / ∆ h > GarNet -based model is learning thestructure of the graph.
The latency, II, and resource usage of the FPGA firmware synthesized from the HLS implementationsare summarized in Table. 1. Vitis Core Development Kit 2019.2 [47] is used for synthesis, with a XilinxKintex UltraScale FPGA (part number xcku115-flvb2104-2-i ) as the target device and a clock frequencyof 200 MHz. The reported resource usage numbers reflect the synthesis estimates from Vivado HLS. Thelatency and II reported here are the maximum values for samples with full V max vertices; the actual HLSimplementation allows early termination of the serial reuse of the vertex-processing logic unit for sampleswith fewer vertices. The area under the ROC curve (AUC) and overall response root mean square (RMS) areused to summarize the performance.Table 1: Summary of the latency, II, FPGA resource usage metrics, and inference accuracy metrics of thesynthesized firmware. The reported resource usage numbers reflect the synthesis estimates from VivadoHLS. The target FPGA is a Xilinx Kintex UltraScale FPGA (part number xcku115-flvb2104-2-i ), whichhas 5,520 DSPs, 663,360 LUTs, 1,326,720 FFs, and 77.8 Mb of BRAM [48]. The utilized percentage of thetargeted FPGA resources are denoted in the square brackets. Model V max R reuse Latency Interval DSP (10 ) LUT (10 ) FF (10 ) BRAM (Mb) ROC Response(cycles) (cycles) AUC RMSContinuous 128 32 155 55 3.1 [56%] 57 [9%] 39 [2.9%] 1.8 [2.3%] 0.98 0.23Quantized 128 32 148 50 1.6 [29%] 70 [11%] 41 [3.1%] 1.9 [2.4%] 0.98 0.24Quantized 64 16 99 34 1.6 [29%] 63 [9%] 38 [2.9%] 1.8 [2.3%] 0.96 0.24Quantized 32 8 75 26 1.4 [25%] 52 [8%] 33 [2.5%] 1.8 [2.3%] 0.86 0.37Quantized 16 4 63 22 1.5 [27%] 57 [9%] 37 [2.8%] 1.8 [2.3%] 0.64 0.36 Comparing the continuous and quantized models with V max = 128, the former has a longer latency and II andconsumes substantially more DSPs. On the other hand, the quantized model uses more LUTs, mainly for themultiplications in the GarNet encoders and decoders, as discussed in Section 4. However, it is known thatthe expected LUT usage tend to be overestimated in Vivado HLS, while the expected DSP usage tends to beaccurate [8, 2]. The DSP usage of 3 . × for the continuous model is well within the limit of the targetdevice, but is more than what is available on a single die slice (2 . × ) [48]. The quantized model fits inone slice in all metrics. Given the small difference in the inference performance between the two models, it isclear that the quantized model is advantageous for this specific case study.The latency of the synthesized quantized model at 148 clock periods, corresponding to 740 ns, satisfies theLHC L1T requirement of O (1) µ s execution. However, the II of 50 clock periods (250 ns) implies that thelogic must be time-multiplexed tenfold to be able to process a single cluster per LHC beam crossing period of25 ns. With O (100) or more clusters expected per beam crossing in the collision environment of HL-LHC, thethroughput of the synthesized firmware is therefore inadequate for a reasonably sized L1T calorimeter systemwith O (100) FPGAs, and requires down-scoping or implementation improvements.The simplest down-scoping measure is to reduce the size of the input. This is effective because the mostprominent factor driving both the latency and the II of the firmware is R reuse (see Eq. (10)), which in turn isdetermined by V max to be able to fit the logic in a single chip. To test how short the II can be made whileretaining a reasonable inference performance, additional models with V max = 64, 32, and 16 are trained andsynthesized into FPGA firmware. Clusters with more hits than V max are truncated by discarding the lowestenergy hits. The fraction of truncated clusters for the three V max values are 27%, 85%, and 99%, respectively.The results of synthesis of the additional models are given in the last three rows of Table 1. The values ofFPGA resource usage metrics are similar in all quantized models because the ratio V max /R reuse is kept at 4.12he area under the ROC curve (AUC) and the root-mean-square (RMS) of the response are considered asmetrics for the inference performance. Only a modest degradation of performance is observed by truncatingthe clusters to V max = 64, while the II is reduced by 16 clocks as a direct result of the reduction of R reuse bythe same amount. This working point might thus represent a reasonable compromise between the inferenceperformance and throughput. Further cluster truncation results in considerable loss of inference accuracy. Itis also clear that reduction of R reuse has a diminishing return in terms of shorter II, and improvements toother parts of the algorithm are necessary to further reduce the II. In this paper, we presented an implementation of a graph neural network algorithm as FPGA firmware with O (1) µ s execution time. General considerations and challenges in implementing graph neural networks forreal-time trigger systems at particle collider experiments are outlined, along with how algorithms such as GarNet address these issues. We then described the simplified version of
GarNet , which is now availableas a general-purpose graph network layer in the hls4ml library. An example use case of a machine learningmodel based on the simplified version of
GarNet , applied to data from a simulation of a small imagingcalorimeter, is presented. The model is able to learn to predict the identity and the energy of the particlesdetected at the calorimeter with high accuracy, while its firmware implementation executes in 740 ns andfits easily in a commercially available FPGA. Although the throughput of the firmware is not sufficient tomake the model readily deployable in a submicrosecond, real-time collider trigger system, its variants withreduced input size are shown to have higher throughput with reasonable inference performance. These resultsdemonstrate that fast inference of graph neural networks in FPGAs is possible, and with hls4ml , variousgraph-based machine learning architectures can be automatically translated into firmware.
Acknowledgments
We acknowledge the Fast Machine Learning collective as an open community of multi-domain experts andcollaborators. This community was important for the development of this project.M. P., A. G., K. W., S. S., V. L. and J. N. are supported by the European Research Council (ERC) underthe European Union’s Horizon 2020 research and innovation program (Grant Agreement No. 772369).S. J., M. L., K. P., and N. T. are supported by Fermi Research Alliance, LLC under Contract No. DE-AC02-07CH11359 with the U.S. Department of Energy, Office of Science, Office of High Energy Physics. P. H. issupported by a Massachusetts Institute of Technology University grant. Z. W. is supported by the NationalScience Foundation under Grants No. 1606321 and 115164.
References [1] CMS Collaboration, “Boosted decision trees in the Level-1 muon endcap trigger at CMS”,
J. Phys.Conf. Ser. (2018) 042042, doi:10.1088/1742-6596/1085/4/042042 .[2] J. Duarte et al., “Fast inference of deep neural networks in FPGAs for particle physics”,
J. Instrum. (2018) P07027, doi:10.1088/1748-0221/13/07/P07027 , arXiv:1804.06913 .[3] F. Chollet et al., “ Keras ”, (2015). https://keras.io .[4] A. Paszke et al., “
PyTorch : An imperative style, high-performance deep learning library”, in
Advances in Neural Information Processing Systems 32 , H. Wallach et al., eds., p. 8026. CurranAssociates, Inc., Red Hook, New York, 2019.[5] M. Abadi et al., “
TensorFlow : Large-scale machine learning on heterogeneous distributed systems”,(2015).[6] J. Bai, F. Lu, K. Zhang et al., “
ONNX : Open neural network exchange”, (2019). https://github.com/onnx/onnx .[7] S. Summers et al., “Fast inference of boosted decision trees in FPGAs for particle physics”,
J. Instrum. (2020) P05026, doi:10.1088/1748-0221/15/05/P05026 , arXiv:2002.02534 .138] V. Loncar et al., “Compressing deep neural networks on FPGAs to binary and ternary precision with hls4ml ”, Mach. Learn.: Sci. Technol. (3, 2020) doi:10.1088/2632-2153/aba042 , arXiv:2003.06308 .[9] Google, “ QKeras ”, (2020). https://github.com/google/qkeras .[10] C. N. Coelho et al., “Ultra low-latency, low-area inference accelerators using heterogeneous deepquantization with f
QKeras and hls4ml ”, arXiv:2006.10159 .[11] I. Henrionet al., “Neural message passing for jet physics”, in Deep Learning for Physical SciencesWorkshop at the 31st Conference on Neural Information Processing Systems . 2017.[12] M. Abdughani, J. Ren, L. Wu, and J. M. Yang, “Probing stop pair production at the LHC with graphneural networks”,
J. High Energy Phys. (2019) 055, doi:10.1007/J.HighEnergyPhys.08(2019)055 , arXiv:1807.09088 .[13] IceCube Collaboration, “Graph neural networks for IceCube signal classification”, arXiv:1809.06166 .[14] J. Arjona Martínez et al., “Pileup mitigation at the Large Hadron Collider with graph neural networks”, Eur. Phys. J. Plus (2019) 333, doi:10.1140/epjp/i2019-12710-3 , arXiv:1810.07988 .[15] S. R. Qasim, J. Kieseler, Y. Iiyama, and M. Pierini, “Learning representations of irregularparticle-detector geometry with distance-weighted graph networks”, Eur. Phys. J. C (2019) 608, doi:10.1140/epjc/s10052-019-7113-9 , arXiv:1902.07987 .[16] H. Qu and L. Gouskos, “ParticleNet: Jet tagging via particle clouds”, Phys. Rev. D (2020)056019, doi:10.1103/PhysRevD.101.056019 , arXiv:1902.08570 .[17] E. A. Moreno et al., “JEDI-net: a jet identification algorithm based on interaction networks”, Eur.Phys. J. C (2020) 58, doi:10.1140/epjc/s10052-020-7608-4 , arXiv:1908.05318 .[18] E. A. Moreno et al., “Interaction networks for the identification of boosted H → bb decays”, Phys. Rev.D (2020) 012010, doi:10.1103/PhysRevD.102.012010 , arXiv:1909.12285 .[19] C. Jin, S.-z. Chen, and H.-H. He, “Classifying the cosmic-ray proton and light groups on theLHAASO-KM2A experiment with the graph neural network”, arXiv:1910.07160 .[20] X. Ju et al., “Graph neural networks for particle reconstruction in high energy physics detectors”, in Machine Learning and the Physical Sciences Workshop at the 33rd Annual Conference on NeuralInformation Processing Systems . 2019. arXiv:2003.11603 .[21] E. Bernreuther et al., “Casting a graph net to catch dark showers”, arXiv:2006.08639 .[22] J. Shlomi, P. Battaglia, and J.-R. Vlimant, “Graph neural networks in particle physics”, arXiv:2007.13681 . Submitted to
Mach. Learn.: Sci. Technol. [23] Y. Wang et al., “Dynamic graph CNN for learning on point clouds”,
ACM Trans. Graph. (2019) doi:10.1145/3326362 , arXiv:1801.07829 .[24] J. Kieseler, “Object condensation: one-stage grid-free multi-object reconstruction in physics detectors,graph and image data”, (2020). arXiv:2002.03605 .[25] L. Gray, T. Klijnsma, and S. Ghosh, “A dynamic reduction network for point clouds”, (2020). arXiv:2003.08013 .[26] CMS Collaboration, “The Phase-2 upgrade of the CMS Level-1 trigger”, CMS Technical Design ReportCERN-LHCC-2020-004. CMS-TDR-021, CERN, 4, 2020.[27] P. W. Battaglia et al., “Relational inductive biases, deep learning, and graph networks”, (2018). arXiv:1806.01261 .[28] M. Besta et al., “Graph processing on FPGAs: Taxonomy, survey, challenges”, (2019). arXiv:1903.06697 .[29] Petar Veličković and Guillem Cucurull and Arantxa Casanova and Adriana Romero and Pietro Liò andYoshua Bengio, “Graph attention networks”, in . 2018. arXiv:1710.10903 . 1430] D. O’Loughlin et al., “Xilinx Vivado High Level Synthesis: Case studies”, in , p. 352–356. IET, 2014.[31] C. Zhu, S. Han, H. Mao, and W. J. Dally, “Trained ternary quantization”, in . 2017. arXiv:1612.01064 .[32] GEANT4 Collaboration, “ Geant4 —a simulation toolkit”,
Nucl. Instrum. Methods Phys. Res. A (2003) 250, doi:10.1016/S0168-9002(03)01368-8 .[33] G. Apollinari et al., eds., “High-Luminosity Large Hadron Collider (HL-LHC): Technical Design ReportV. 0.1”. CERN Yellow Reports: Monographs. CERN, Geneva, 2017. doi:10.23731/CYRM-2017-004 .[34] CMS Collaboration, “The Phase-2 upgrade of the CMS endcap calorimeter”, CMS Technical DesignReport CERN-LHCC-2017-023. CMS-TDR-019, 2017.[35] S. van der Walt, S. C. Colbert, and G. Varoquaux, “The NumPy array: A structure for efficientnumerical computation”,
Comput. Sci. Eng. (2011), no. 2, 22–30, doi:10.1109/MCSE.2011.37 .[36] The HDF Group, “Hierarchical Data Format, version 5”, (1997–2020). .[37] Y. Iiyama and J. Kieseler, “Simulation of an imaging calorimeter to demonstrate GarNet on FPGA”,6, 2020. doi:10.5281/zenodo.3888910 , https://doi.org/10.5281/zenodo.3888910 .[38] ALEPH Collaboration, “Performance of the ALEPH detector at LEP”, Nucl. Instrum. Methods Phys.Res. A (1995) 481, doi:10.1016/0168-9002(95)00138-7 .[39] CMS Collaboration, “Particle-flow reconstruction and global event description with the CMS detector”,
J. Instrum. (2017) P10003, doi:10.1088/1748-0221/12/10/P10003 , arXiv:1706.04965 .[40] ATLAS Collaboration, “Jet reconstruction and performance using particle flow with the ATLASDetector”, Eur. Phys. J. C (2017) 466, doi:10.1140/epjc/s10052-017-5031-2 , arXiv:1703.10485 .[41] A. F. Agarap, “Deep learning using rectified linear units (ReLU)”, (2018). arXiv:1803.08375 .[42] Qasim, Shah Rukh and Kieseler, Jan and Iiyama, Yutaro and Pierini, Maurizio, “caloGraphNN”, (2019). https://github.com/jkiesele/caloGraphNN .[43] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization”, in . 2014. arXiv:1412.6980 .[44] M. Courbariaux, Y. Bengio, and J.-P. David, “BinaryConnect: Training deep neural networks withbinary weights during propagations”, in Advances in Neural Information Processing Systems 28 ,C. Cortes et al., eds., p. 3123. Curran Associates, Inc., 2015. arXiv:1511.00363 .[45] B. Moons, K. Goetschalckx, N. Van Berckelaer, and M. Verhelst, “Minimum energy quantized neuralnetworks”, in , p. 1921, IEEE. 2017.[46] S. Zhou et al., “DoReFa-Net: Training low bitwidth convolutional neural networks with low bitwidthgradients”, (2016). arXiv:1606.06160 .[47] V. Kathail, “Xilinx Vitis Unified Software Platform”, in , p. 173. ACM, New York, NY, USA, 2020. doi:10.1145/3373087.3375887 .[48] Xilinx, Inc., “UltraScale FPGA product tables and product selection guide”, (2020).