[PDF] Distance-Weighted Graph Neural Networks on FPGAs for Real-Time Particle Reconstruction in High Energy Physics

Abstract

Graph neural networks have been shown to achieve excellent performance for several crucial tasks in particle physics, such as charged particle tracking, jet tagging, and clustering. An important domain for the application of these networks is the FGPA-based first layer of real-time data filtering at the CERN Large Hadron Collider, which has strict latency and resource constraints. We discuss how to design distance-weighted graph networks that can be executed with a latency of less than 1 μs on an FPGA. To do so, we consider a representative task associated to particle reconstruction and identification in a next-generation calorimeter operating at a particle collider. We use a graph network architecture developed for such purposes, and apply additional simplifications to match the computing constraints of Level-1 trigger systems, including weight quantization. Using the hls4ml library, we convert the compressed models into firmware to be implemented on an FPGA. Performance of the synthesized models is presented both in terms of inference accuracy and resource usage.

Full PDF

FFERMILAB-PUB-20-405-E-SCD

Distance-Weighted Graph Neural Networks onFPGAs for Real-Time Particle Reconstruction inHigh Energy Physics

Yutaro Iiyama

International Center for Elementary Particle Physics (ICEPP)University of TokyoTokyo 113-0033, Japan

Gianluca Cerminara, Abhijay Gupta, Jan Kieseler, Vladimir Loncar ∗ ,Maurizio Pierini, Shah Rukh Qasim † , Marcel Rieger, Sioni Summers,Gerrit Van Onsem, Kinga Wozniak ‡ European Organization for Nuclear Research (CERN)CH-1211 Geneva 23, Switzerland

Jennifer Ngadiuba

California Institute of TechnologyPasadena, CA 91125, USA

Giuseppe Di Guglielmo

Columbia UniversityNew York, NY 10027, USA

Javier Duarte

University of California San DiegoLa Jolla, CA 92093, USA

Philip Harris, Dylan Rankin

Massachusetts Institute of TechnologyCambridge, MA 02139, USA

Sergo Jindariani, Mia Liu, Kevin Pedro, Nhan Tran § Fermi National Accelerator LaboratoryBatavia, IL 60510, USA

Edward Kreinar

HawkEye360Herndon, VA 20170, USA

Zhenbin Wu

University of Illinois at ChicagoChicago, IL 60607, USAAugust 11, 2020

Abstract ∗ Also at Institute of Physics Belgrade, Pregrevica 118, Belgrade, Serbia † Also at Manchester Metropolitan University, Manchester M15 6BH, UK ‡ Also at University of Vienna, 1010 Vienna, Austria § Also at Northwestern University, Evanston, IL 60208, USA a r X i v : . [ h e p - e x ] A ug raph neural networks have been shown to achieve excellent performance for several crucialtasks in particle physics, such as charged particle tracking, jet tagging, and clustering. Animportant domain for the application of these networks is the FGPA-based ﬁrst layer ofreal-time data ﬁltering at the CERN Large Hadron Collider, which has strict latency andresource constraints. We discuss how to design distance-weighted graph networks that can beexecuted with a latency of less than 1 µ s on an FPGA. To do so, we consider a representativetask associated to particle reconstruction and identiﬁcation in a next-generation calorimeteroperating at a particle collider. We use a graph network architecture developed for suchpurposes, and apply additional simpliﬁcations to match the computing constraints of Level-1trigger systems, including weight quantization. Using the hls4ml library, we convert thecompressed models into ﬁrmware to be implemented on an FPGA. Performance of thesynthesized models is presented both in terms of inference accuracy and resource usage. K eywords deep learning · FPGA · fast inference · graph networks · imaging calorimeter At the CERN Large Hadron Collider (LHC), high-energy physics (HEP) experiments collect signals generatedby the particles produced in high-energy proton collisions that occur every 25 ns, when two proton beamscross. The readout from the detectors that capture the particles emerging from the collision is ﬁltered by areal-time processing system, known as the trigger , that discards uninteresting collision events, based on a setof predeﬁned algorithms. The trigger system is structured in two stages: a Level-1 trigger (L1T), implementedwith custom electronics on-detector and ﬁeld-programmable gate arrays (FPGAs); and a high-level trigger(HLT), consisting of a computer farm, possibly including co-processor accelerators like graphics processingunits (GPUs) and FPGAs. Because of asynchronous event processing at the HLT, the accept/reject decisionhas to be reached with a typical latency of O (100) ms. However, at the L1T, a decision must be taken withina ﬁxed latency of O (1) µ s. The main limitations are the synchronous, “hard-deadline” nature of the processingsystem and the limited size of the memory buﬀer for the data from each beam crossing.While HLT algorithms have a complexity comparable to those used oﬄine to produce the ﬁnal physicsresults, a typical L1T algorithm consists of simpler rules based on coarser objects to satisfy the latencyconstraint. Consequently, the resolution of quantities computed at the L1T is typically poor compared tooﬄine quantities. Recently, the successful deployment of the ﬁrst machine learning (ML) L1T algorithm,based on a boosted decision tree (BDT), at the LHC [1] has changed this tendency, raising interest in usingML inference as fast-to-execute approximations of complex algorithms with good accuracy. This ﬁrst exampleconsisted of a large, pre-computed table of input and output values implementing a BDT, which raises thequestion of how to deploy more complex architectures. This question motivated the creation of hls4ml [2], alibrary designed to facilitate the deployment of ML algorithms on FPGAs.A typical hls4ml workﬂow begins with a neural network model that is implemented and trained using Keras [3],

PyTorch [4], or

TensorFlow [5]. The trained model is passed to hls4ml , directly or throughthe

ONNX [6] interface, and converted to C++ code that can be processed by a high-level synthesis (HLS)compiler to produce an FPGA ﬁrmware. By design, hls4ml targets low-latency applications. To this end, itsdesign prioritizes all-on-chip implementations of the most common network components. Its functionality hasbeen demonstrated with dense neural networks (DNNs) [2], extended to also support BDTs [7]. Extensionsto convolutional and recurrent neural networks are in development. The library comes with handles tocompress the model by quantization, up to binary and ternary precision [8]. Recently, support for

QKeras [9]models has been added, in order to allow for quantization-aware training of models [10]. While the hls4ml applications go beyond HEP, its development has been driven by the LHC L1T use case.Graph neural networks (GNNs) are among the complex architectures whose L1T implementations are inhigh demand, given the growing list of examples showing how well GNNs can deal with tasks related toHEP [11–22]. In fact, while the irregular geometry of a typical HEP detector complicates the use of computingvision techniques such as convolutional neural networks, GNNs can naturally deal with the sparse andirregular nature of HEP data.In this work, we show how a graph model can be eﬃciently deployed on FPGAs to perform inference within O (1) µ s for HEP-related problems. We consider the distance-weighted architecture GarNet , introduced inRef. [15], which is designed to keep resource consumption under control by reducing as much as possible thenumber of operations. It has been demonstrated to perform well for a HEP-related task, namely particlereconstruction in a calorimeter. For these reasons, it represents a good candidate for our purpose. The2rmware implementation of

GarNet presented in this work has been included in hls4ml , representing theﬁrst graph-based algorithm available in the library.We present a case study of a neural network algorithm based on

GarNet , applied to a task of identifying thenature of an incoming particle and simultaneously estimating its energy from the energy deposition patternsin a simulated imaging calorimeter. The inference accuracy of the ﬁrmware implementation of the algorithmis compared against its oﬄine counterpart running on processors (CPUs and GPUs). Latency and resourceutilization of the translated FPGA ﬁrmware are reported, along with a discussion on their implications forreal-world usage of similar algorithms.This paper is structured as follows. In Section 2, we brieﬂy recount related work. Section 3 deﬁnes the mainproblem by outlining the challenges in designing a graph network compatible with L1T latency and resourceconstraints. Section 4 describes how

GarNet addresses these challenges, and introduces a simpliﬁed formof the algorithm with a better aﬃnity to a ﬁrmware implementation. The case study using a calorimetersimulation is presented in Section 5, with detailed descriptions of the task setup, model architecture, trainingresults, and the summary of FPGA ﬁrmware synthesis. Finally, conclusions are given in Section 6.

Graph neural networks are gaining interest in HEP applications, mainly due to their intrinsic value whendealing with sparse input datasets, which are very common in HEP. A recent review of applications of GNNsto HEP problems may be found in Ref. [22]. In particular, dynamic GNNs [15, 23–25] are relevant for particlereconstruction tasks, such as tracking [20] and calorimetry [15].In order to fully exploit DL-based particle reconstruction, the capability of deploying DL models to thereal-time data ﬁltering system of a HEP detector is essential. For the FPGA-based L1T systems of theLHC, automatic tools for network-to-circuit conversation, such as the hls4ml library, are emerging. Using hls4ml , several solutions for HEP-speciﬁc tasks (e.g. jet tagging) have been provided [2, 7, 8, 10], exploitingmodels with simpler architectures than what is shown here. This tool has been applied extensively for tasksin the HL-LHC upgrade of the CMS L1T system, including an autoencoder for anomaly detection, andDNNs for muon energy regression and identiﬁcation, tau lepton identiﬁcation, and vector boson fusion eventclassiﬁcation [26].

In the framework of Ref. [27], a graph is a triplet ( V , E , U ), where V is a set of entities (vertices) eachpossessing some attributes in a ﬁxed format, E is a set of pairwise relations (edges) between the elements in V , potentially possessing some additional attributes, and U are global (graph-level) attributes. While a GNNcan be any neural network that acts on such graphs, in this work we speciﬁcally consider graph networks(GN) [27], i.e., architectures that consist of repeatable graph-to-graph mapping blocks (GN blocks). Each GNblock performs some combination of operations such as edge feature transformation, aggregation of neighbors’features at each vertex, vertex feature transformation, global aggregation of edge and vertex features, andglobal feature transformation. A GN takes a graph as an input sample, where the cardinality of V maydiﬀer sample to sample, and infers its properties, which may be anything from a global scalar, such as aclassiﬁcation label of the sample, to new edge attributes.To be usable as a part of an LHC L1T system, an algorithm must execute within O (1) µ s and have thethroughput to accept all inputs from each beam crossing every 25 ns. Time-multiplexing, whereby N copiesof the algorithm accept inputs from N diﬀerent beam crossings, may be used to decrease the throughputrequirement by a factor of N . Additionally, there is a practical constraint that the ﬁrmware implementationshould ﬁt in the FPGA resources of the system, i.e., utilize the resources such as digital signal processingunits (DSPs), look-up tables (LUTs), ﬂip-ﬂips (FFs), and block RAM (BRAM) within the limits of chipsavailable on the market. Satisfying these requirements with a GNN can be challenging for multiple reasonslisted below. • Model depth : Within each GN block, vertices exchange information with other directly connectedvertices or with global attributes. Therefore, to expand the receptive ﬁeld of each vertex beyondthe nearest neighbors, multiple GN blocks must be repeated in the network. Given that varioustransformations within each GN block are often themselves multilayer perceptrons (MLPs), GNNmodels tend to be quite deep. Deep networks go against the latency requirement, as each perceptron3ayer uses at least one clock cycle on an FPGA under a straightforward implementation, and alsoagainst the resource usage requirement, because MLPs utilize multiplications heavily. • Input size : Typically, for problems where the application of GNNs is interesting, the cardinality of V is at least O (10 ). Even with the high degree of parallelism of FPGAs, due to ﬁniteness of thecompute resource, such large input will have to be processed serially to a certain extent, increasingthe latency and the interval before a new input can be accepted, known as the initiation interval (II).Longer IIs lead to lower throughput values. • Memory usage : Related to the problem of the input size, if the algorithm requires temporaryretention of features for all vertices or edges, memory usage may be prohibitive for an FPGA ﬁrmwareimplementation. • Memory access pattern : Except for certain cases, algorithms that have both V and E in the inputusually requires random memory access, for example when reading or writing features of vertices atthe ends of the edges. This poses a challenge in FPGA ﬁrmware design not only because it impliesthat there needs to be a large enough memory bank to store all vertex and/or edge data, but alsobecause random memory access itself is a costly operation [28]. The exceptions include when E istrivial ( E = ∅ or when the graph is complete) and when all samples have an identical graph topology.In such cases, the memory access pattern of the algorithm is known at compile time and thereforecan be statically scheduled in the FPGA ﬁrmware.The case of E = ∅ is a rather extreme solution to the last challenge, but it is also attractive in terms ofmemory usage. In fact, even without explicit input edge features, a GNN can infer regional and non-localproperties of the graph by globally gathering the vertex features and then scattering the gathered informationback to the vertices. This information ﬂow can also be mediated by a learnable attention mechanism [29].The attention mechanism suppresses information from vertices that are considered unimportant, eﬀectivelyforming “soft” edges among the unsuppressed vertices.In the next section, we study a GNN architecture with these exact properties, then discuss the modiﬁcationsto the architecture to make it suitable for an FPGA ﬁrmware implementation. hls4ml framework In this work, we consider

GarNet [15] as a speciﬁc example of GNN. A

GarNet layer is a GN block thattakes as input a set of V vertices, each possessing F in features, and returns the same set of vertices with F out features. In a GarNet layer, F in features of each vertex are encoded into an internal representationand gathered at S aggregators . A distance parameter between each of the aggregators and vertices is alsocomputed from the vertex attributes. Information gathered at the aggregators are then sent back to individualvertices and decoded into F out features. Communications between the vertices and aggregators are weightedby a decreasing function of the distance parameter, implementing an attention mechanism that allows thenetwork to learn a dynamic, nontrivial graph structure from the vertex input alone.The original GarNet algorithm, while already using less compute and memory resource than other similarGNN architectures in Ref. [15, 23], is still challenging to implement as fast and high-throughput FPGAﬁrmware. The biggest problem arises from the use of the input feature vector as a part of the input to thedecoder, which requires retention of the input data until the last steps of the algorithm. An immediateconsequence of this requirement is a longer II, because processing of new samples cannot start while theinput data for the current sample is still in use. Furthermore, the input feature vector is already used tocompute the distance parameter as well as the internal representation of each vertex, and therefore a reuse ofthe input in the decoder creates a complex data ﬂow, restricting the options for pipelining the algorithm.We therefore designed a modiﬁed

GarNet algorithm with a simpliﬁed processing ﬂow: • Input transformation (Figs. 1a and 1b): An encoder network converts the features g jv ( j =1 , . . . , F in ) of the v th vertex ( v = 1 , . . . , V ) into an internal learned representation vector f iv ( i =1 , . . . , F LR ). In parallel, another network (distance calculator) also acts on g jv and computes thedistance parameters d av ( a = 1 , . . . , S ) between the vertices and the S aggregators. Implicitly, thismeans that a complete bipartite graph with V S edges is built from V and S , where S is the set ofaggregators (Fig. 1b). The encoder and distance calculator networks are both single-layer perceptrons4igure 1: Processing ﬂow of the modiﬁed GarNet algorithm: (a) The input features ( g jv ) of each vertex areprocessed by a linear network, that returns a new set of features ( f iv ) and its distance from the S aggregators( d av ). (b) A graph is built in the learned space, using the d av distances. (c) A message is gathered by eachaggregator, as a weighted sum across the vertices of f iv , with W av = exp( − d av ) as weights. (d) A message isfrom each aggregator ( ˜ f iav ) is passed back to each vertex, with the same W av weight. (e) The aggregatedoutputs of each vertex are given as input to a neural network, which returns the learned representation.with linear activation functions, so one can write them as linear transformations f iv = F in X j =1 w ij g jv + b i (1) d av = F in X j =1 α aj g jv + β a , (2)5here ( w ij , b i ) and ( α aj , β a ) are the kernels and biases of the encoder and distance calculator networks,respectively. • Aggregation (Fig. 1c): The learned representation vectors f iv of the vertices are weighted by apotential function W av = exp( − d av ) and averaged across the vertices. In other words, the i thaveraged feature h ia of aggregator a is written as h ia = 1 V max V X v =1 W av f iv . (3)The factor V max in the denominator is the maximum possible value for the vertex multiplicity V (as V may have a diﬀerent value for each input sample). Through this normalization by a commonfactor, the information about the size of the sample (cardinality of V ) is eﬀectively encoded into h ia . • Output transformation (Figs. 1d and 1e): The aggregated features are sent back to the verticesusing the same weights as ˜ f iav = W av h ia , (4)and then transformed by a single-layer decoder network with linear activation function into the ﬁnaloutput representation g kv ( k = 1 , . . . , F out ). With the kernel u and bias c of the decoder, this iswritten as g kv = F LR X i =1 S X a =1 u kia ˜ f iav + c k . (5)This simpliﬁed algorithm diﬀers from the original design in the following ways. First, only the mean oververtices is computed at the aggregators, whereas the maximum is also used in the original design. In otherwords, the aggregators in the original design have h ia = max v W av f iv (6)as an additional set of features. Secondly, as already noted, the input feature vector is not used as a part ofthe input to the decoder network. In the original GarNet design, the decoder is expressed as g kv = F LR X i =1 S X a =1 W av (cid:0) u kia h ia + u kia h ia (cid:1) + F in X i =1 w ki g iv + c k , (7)with additional sets of kernel weights u and w . Finally, the original design applies a nonlinear (tanh)activation function to the decoder, while the simpliﬁed version uses a linear activation. In the speciﬁc caseconsidered in the next section, these simpliﬁcations result in negligible degradation of the network performance.In the remainder of this paper, this simpliﬁed version of the algorithm is referred to as GarNet .It is worth pointing out that while the

GarNet layer uses only linear activation functions for all of the internalneural networks, it can still learn nonlinear functions through the non-linearity of the potential function W av .On the other hand, having no nonlinear activation functions allows a compact FPGA ﬁrmware implementationof the layer, consisting mostly of multiplications and additions. The only substantial computation comeswith the exponential function, whose values can be pre-computed with suﬃcient granularity and stored.An FPGA ﬁrmware implementation of the GarNet layer using Vivado [30] HLS is integrated into the hls4ml library. The HLS source code is written in C++ and is provided as a template, from which an HLS functionfor a

GarNet layer can be instantiated, specifying the conﬁgurable parameters such as S , F LR , and F out . Inthe following, we provide some noteworthy details of the implementation.In the HLS source code of GarNet , all quantities appearing in the computation are expressed as eitherintegers or ﬁxed-point numbers with fractional precision of at least eight bits. In particular, the distanceparameter d av is represented with three integer bits, eight fractional bits, and one sign bit. During the layercomputation, d av is reinterpreted as a 12-bit unsigned integer, which is used to retrieve the correspondingpre-computed value of W av from a table with 4,096 entries.The processing ﬂow in Eqs. (1) to (5) is compactiﬁed in the hls4ml implementation by exploiting the linearityof the encoder, average aggregation, and the decoder. Equations (1), (3), and (5) can be combined into g kv = S X a =1 W av  F in X j =1 ˜ w kja G ja + ˜ b ka L a  + c k , (8)6here ˜ w kja = F LR X i =1 u kia w ij , ˜ b ka = F LR X i =1 u kia b i , G ja = 1 V max V X v =1 W av g jv , and L a = 1 V max V X v =1 W av . (9)In particular, the kernel and bias tensors of the encoder and decoder are contracted into ˜ w and ˜ b at logicsynthesis time, resulting in fewer steps to arrive at the output from the input.With this simpliﬁcation, the input data from each sample are encoded into W av , G ja , and L a . Therefore, anew sample can be processed as soon as the three quantities from the previous sample are computed. Inother words, the II of the overall GarNet layer depends on the number of clock cycles needed to computethe three quantities. Furthermore, G ja and L a can be derived trivially from W av , making the latency of thecomputation of the latter the critical determinant of the throughput of the algorithm.The computation of W av is performed independently on each vertex, and is therefore parallelizable across thevertices. In a fully parallelized implementation, there would be V max logic units (one unit per vertex) operatedsimultaneously. However, with V typically as large as O (10 ) or greater, this conﬁguration would consumetoo much of the FPGA resources and would not ﬁt on a single chip. Therefore, the hls4ml implementationof GarNet allows a partial parallelization of the algorithm controlled by a parameter called the reuse factor ( R reuse ). For R reuse >

1, the logic unit to compute W av is cloned V max /R reuse times, such that each unit isreused serially up to R reuse times. This serial reuse is fully pipelined with the local II of one clock cycle. Thelatency T W for computing W av for all vertices is therefore given by T W = T W + R reuse , (10)where T W ∼

20 is the number of clock cycles needed to compute W av for one vertex. The value of T W depends on the numerical precision of the ﬁxed-point numbers in the computation.Finally, the kernel and bias of the encoder and the kernel of the decoder can be quantized, such that eachelement takes only values −

1, 0, or 1 (ternary quantization) [31]. In the quantized version of the algorithm,contracted kernel and bias ˜ w and ˜ b have elements that are O (1) integers. Multiplication of small integerswith ﬁxed-point numbers can be performed in FPGAs using LUTs rather than DSPs, which are usually themore scarce resource. Multiplications with LUTs also proceed faster than those with DSPs. As a case study, the hls4ml implementation of

GarNet is applied to a representative task for the LHC L1T,namely reconstructing electrons and pions in a simulated 3D imaging calorimeter. In the following, we ﬁrstdescribe the dataset used for the study, then deﬁne the task and the architectures of the ML models, andpresent the inference performance of the models and the resource usage of the synthesized ﬁrmware.

The calorimeter is a multi-layered full-absorption detector with a geometry similar to the one described inRef. [15]. The detector is made entirely of tungsten, which is considered as both an absorber and a sensitivematerial, and no noise or threshold eﬀects in the readout electronics are simulated. While this homogeneouscalorimeter design is not a faithful representation of a modern sampling calorimeter, this simpliﬁcation allowsus to evaluate the performance of the ML models decoupled from detector eﬀects.The calorimeter extends 36 cm in x and y and has a total depth in z of 2 m, corresponding to approximately20 nuclear interaction lengths and 170 radiation lengths. The coordinate origin is placed at the center of thefront face of the calorimeter. The calorimeter is segmented into 50 layers along z , with each layer divided intosmall square cells in the x - y plane, forming a three-dimensional imaging detector. Cells are oriented so theirsides are parallel to the x and y axes. Tiling of the cells in each layer is uniform except for in one quadrant,where the cell sides are half as long as those in the other area. The aim of the tiling is to incorporate theirregularity of the geometry of a real-life particle physics calorimeter. The quadrant with smaller cells andthe remainder of the layer are respectively called the high granularity (HG) and low granularity (LG) regions.The ﬁrst 25 layers in z correspond to the electromagnetic calorimeter, with a layer thickness of 1 cm andcell dimensions of 2.25 cm × × × × Geant4 [32].Each event used in this study contains a high-energy primary particle and low-energy pileup particles, whichrepresent backgrounds from simultaneous additional proton-proton interactions. The primary particle iseither an electron (e − ) or a charged pion ( π ± ), shot at the calorimeter with momentum aligned along the z axis, i.e., perpendicular to the front face of the calorimeter. The x and y coordinates of the particle’s originare randomly sampled according to a uniform distribution in a 10 cm ×

10 cm region centered at x = y = 0.Following this procedure, we aim to mimic a realistic situation in which the actual calorimeter extends toa much larger surface and the area covered by the geometry used in this study represents a portion of it.The value of the particle momentum is drawn randomly for each event from a uniform distribution between10 GeV and 100 GeV. The pileup particles consist of photons ( γ ) and π ± . The number of pileup particles israndomly sampled from a Poisson distribution with a mean of 40, with the π ± multiplicity ﬁxed to twice the γ multiplicity. This setup approximates the ﬂux of pileup particles expected at a pseudorapdity η = 2 ina ∆ η × ∆ φ = 0 . × . µ = 0 . c = 0 . z direction. The radiusof the cylinder is set at 6.4 cm so that the resulting cluster contains 95% of the energy of the primary particlefor 50% of the pion events. Because electromagnetic showers have a narrower energy spread than hadronicshowers in general, all of the electron events have at least 95% of the energy contained in the same cylinder.Typical events with momenta of the primary particles around 50 GeV and the total pileup energy close to themedian of the distribution are shown in Fig. 3a. The hits in the ﬁgure are colored by the fraction of the hitenergy due to the primary particle (primary fraction, f prim ) to help the visualization.The actual dataset used in this study thus contains one cluster per sample, given as an array of hits in thecluster, and one integer indicating the number of hits in the sample. Only the hits with energy greater than120 MeV are considered. Each cluster contains at most 128 hits, sorted by hit energy in decreasing order.Note that sorting of the hit has no eﬀect on the neural network, and is only relevant when truncating the list8f hits to consider smaller clusters, as explored later. In fact, 0.2% of the events resulted in clusters with morethan 128 hits, for which the lowest energy hits were discarded from the dataset. Each hit is represented byfour numbers, corresponding to the hit coordinates, given in x , y , and z , and energy. The x and y coordinatesare relative to the seed cell. The dataset consists of 500,000 samples, split evenly and randomly into e − and π ± events, stored as NumPy [35] arrays in

HDF5 format [36]. The dataset together with the ground truthinformation is available on the Zenodo platform [37].

Electron 49.2 (48.0) GeV, Pileup 64.9 (21.4) GeV Pion 50.9 (33.1) GeV, Pileup 59.7 (17.2) GeV(a) z ( c m ) x ( c m ) y ( c m ) z ( c m ) x ( c m ) y ( c m ) P r i m a r y f r a c t i o n (b) z ( c m ) x ( c m ) y ( c m ) z ( c m ) x ( c m ) y ( c m ) E p r e d / h ClusteredUnclustered

Figure 3: Examples of electron (left) and pion (right) events. Values in parentheses in the graph titles arethe respective energy depositions contained in the cluster around the seed hit. Points represent hits in thedetector, with their coordinates at the center of the corresponding detector cells and the size of the markersproportional to the square root of the hit energy. Opaque points are within the cluster, while the translucentones are not. In (a), the point color scale from blue to red corresponds to the primary fraction (see Section 5.1for deﬁnition). In (b), the color scale from blue to green corresponds to ∆ E pred / ∆ h , which is an indication ofthe importance the neural network model places to individual hits for energy regression. See Section 5.3 fordetails. The task in this study is to identify the nature of the primary particle and to simultaneously predict its energy,given the hits in the cluster. The ability to reliably identify the particle type and estimate its energy at thecluster level in a local calorimeter trigger system greatly enhances the eﬃcacy of high-level algorithms, such asparticle-ﬂow reconstruction [38–40], downstream in the L1T system. However, because of the distortion of theenergy deposition pattern in the cluster due to pileup, particle identiﬁcation based on collective properties ofthe hits, such as the depth of the energy center of mass, can achieve only modest accuracy. Furthermore, onlyhalf of the pion events have 95% of the energy deposition from the pion contained in the cluster, requiringsubstantial extrapolation in the energy prediction. This task is thus both practically relevant and suﬃcientlynontrivial as a test bench of a

GarNet -based ML model.The architecture of the model is as follows. First, the input data represented by a two-dimensional arrayof V max × F in numbers per cluster are processed by a stack of three GarNet layers. The parameters(

S, F LR , F out ) for the ﬁrst two layers are (4 , ,

8) and for the last layer are (8 , , arNet layer is averaged across the vertices for each of the 16 features. The resulting array of 16 numbersis then passed through two fully connected layers with 16 and 8 nodes and ReLU [41] activation. Data ﬂow issplit into two branches in the ﬁnal step. The ﬁrst branch consists of a fully connected layer with a singlenode, whose output is activated by a sigmoid function and is interpreted as the classiﬁcation prediction,i.e., the predicted probability that the primary particle is an electron. The other branch also consists ofa single-node fully connected layer, but with a linear activation of the output, which is interpreted as thepredicted value of the energy of the particle.This model is built in Keras [3], using the corresponding implementation of

GarNet available in Ref. [42].In total, the model has 3,402 trainable parameters (2,976 in the three

GarNet layers), whose values areoptimized through a supervised training process using the Adam optimizer [43]. Input is processed in batchesof 64 samples during training. The overall objective function that is minimized in the training is a weightedsum of objective functions for the classiﬁcation and regression tasks: L = β L class + (1 − β ) L reg (11)with β = 0 .

01. The objective function for classiﬁcation L class is the binary cross entropy in each batchbetween the truth labels (electrons are represented by 1 and pions by 0) and the classiﬁcation output of themodel. The objective function for regression L reg is the batch mean of the relative squared error L reg = [( E pred − E true ) /E true ] , (12)where E pred and E true are the predicted and true energies of the primary particle, respectively. The trainingis performed on 400,000 training and 100,000 validation samples over a few hundred epochs, with earlystopping when the value of the objective function does not improve for ten consecutive epochs. Keeping thefull training dataset on RAM and using two NVIDIA GeForce RTX 2080 Ti GPUs in parallel, each epochtakes roughly 30 seconds to process.Additionally, we prepare a model in which the encoders and decoders of the GarNet layers are quantized asternary networks using

QKeras [9, 10], which performs quantization-aware training with the straight-throughestimator by quantizing the layers during a forward pass but not a backward pass [44–46, 10]. In the following,this model is referred to as the quantized model , and the original model as the continuous model . Thequantized model is trained with the same objective function and training hyperparameters as the continuousmodel.To evaluate the inference performance of the trained models, reference algorithms are deﬁned separately forthe classiﬁcation and regression subtasks. The reference algorithm for classiﬁcation ( cut-based classiﬁcation)computes the energy-weighted mean ¯ z and standard deviation σ z of the z coordinates of the hits,¯ z = P Vi =1 z i h i P Vi =1 h i and σ z = vuut P Vi =1 ( z i − ¯ z ) h i P Vi =1 h i , (13)where i is the index of hits in the cluster and z i and h i are the z coordinate and energy of the i th hit. Thecluster is labeled as an electron if ¯ z < ¯ z cut and σ z < σ cut z , where ¯ z cut and σ cut z are predeﬁned thresholds.Pions, and hadrons in general, tend to penetrate deeper in an absorbing detector and create showers ofsecondary particles with a larger transverse size than electrons and photons. For regression, the referencealgorithm ( weight-based regression) predicts the energy of the primary particle through a formula E refpred = V X i =1 w l ( i ) (cid:0) h i + b l ( i ) (cid:1) , (14)where l ( i ) is the detector z layer of hit i . Parameters { w l , b l } ( l = 1 , . . . ,

50) are determined by minimizing L reg over the training dataset using E refpred as the predicted energy. Particle identiﬁcation based on the energydeposition proﬁle of the cluster and energy estimation based on weighted sum of hit energies are both commonstrategies in the conventional, non-ML-based event reconstruction approaches. Performance of the trained continuous and quantized models, evaluated using the validation sample, areshown in Fig. 4. For each ML model, the inference results based on the original

Keras model and the HLSmodel, converted using hls4ml , are shown. The HLS model provides a realistic emulation of the synthesizedFPGA ﬁrmware. 10he classiﬁcation performance is given in terms of receiver operating characteristic (ROC) curves that tracethe electron identiﬁcation eﬃciency (true positive fraction) and pion rejection eﬃciency (true negative fraction)for diﬀerent thresholds of the classiﬁers. The two

GarNet -based models perform similarly and better thanthe cut-based reference in terms of the electron identiﬁcation eﬃciency for a given pion rejection eﬃciency. Adetailed comparison of the four sets of results from the

GarNet -based models in the inset reveals that thecontinuous model performs slightly better than the quantized model, and that the diﬀerence between the

Keras and HLS implementations is smaller for the quantized model.The regression performance is given in terms of the response ( E pred /E true ). Distributions of the responseare summarized in 10 GeV bins of E true , separately for the continuous model, quantized model, and theweight-based reference. In each summary, the horizontal line in the box corresponds to the median of thedistribution, the top and bottom of the box to the upper and lower quartiles, and the upper and lower endsof the whiskers to the 95th and 5th percentiles. The GarNet -based models exhibit narrower spreads of theresponse distributions in most of the bins, with the continuous model again performing slightly better thanthe quantized model.

Classification E l e c t r o n i d e n t i f i c a t i o n e ff i c i e n c y Keras continuousHLS continuousKeras quantizedHLS quantizedCut-based

Regression

Keras continuousHLS continuousKeras quantizedHLS quantizedWeight-based

10 20 30 40 50 60 70 80 90 100Primary particle energy (GeV)0.250.500.751.001.251.501.75 Pions R e s p o n s e Figure 4: Classiﬁcation (left) and regression (right) inference performance of the continuous and quantized

GarNet -based models and the reference algorithms. Results from the

Keras and HLS implementations areshown for the

GarNet -based models. The classiﬁcation performance is quantiﬁed with a ROC curve ofelectron identiﬁcation eﬃciency versus pion rejection eﬃciency. The inset in the left graph shows a close-upview of the eﬃciency range 0.90–0.96 for both axes. The regression performance is quantiﬁed as the response( E pred /E true ) in 10 GeV bins of E true . The horizontal line in the box corresponds to the median of thedistribution, the top and bottom of the box to the upper and lower quartiles, and the upper and lower endsof the whiskers to the 95th and 5th percentiles.The diﬀerences between the Keras and HLS implementations are due to the numerical precision in thecomputation. While the former represents all fractional numbers in 32-bit ﬂoating-point numbers, the latteremploys ﬁxed-point numbers with bit widths of at most 18. Consequently, for the quantized model, wherethe encoder and decoder of the

GarNet layers employ integer weights for inference, the diﬀerence betweenthe two implementations are smaller.For both subtasks, the

GarNet -based models generally outperform the reference algorithms. The referencealgorithm has narrower spread of the response in some energy bins for the regression subtask. However, it isimportant to note that the weights and biases appearing in Eq. (14) are optimized for a speciﬁc pileup proﬁle,while in a real particle collider environment, pileup ﬂux changes dynamically even on the timescale of a fewhours. In contrast, algorithms based on inference of properties of individual hits, such as the

GarNet -basedmodels presented in this study, are expected to be able to identify hits due to pileup even under diﬀerentpileup environments and thus to have a stable inference performance with respect to change in pileup ﬂux.Since a detailed evaluation of application-speciﬁc performance of

GarNet is not within the scope of thiswork, we leave this and other possible improvements to the model architecture and training to future studies.11o verify that

GarNet can infer relations between individual vertices without edges E in the input, thefollowing test is performed. Using the two events shown in Fig. 3, the energy of each hit in the clusters isincreased one at a time by 10%, and the inference with the continuous model is performed for each perturbedevent. If the model has learned to perfectly distinguish the primary particle from pileup at the vertex level, asmall change in the energy of a hit from pileup should result in no change in the predicted particle energy.In Fig. 3b, each hit in the cluster is colored by the ratio of the change of predicted particle energy and theamount of perturbation (∆ E pred / ∆ h ). While some hits in Fig. 3a with f prim = 0 appear with ∆ E pred / ∆ h > f prim and ∆ E pred / ∆ h is observed. The occurrence of ∆ E pred / ∆ h > GarNet -based model is learning thestructure of the graph.

The latency, II, and resource usage of the FPGA ﬁrmware synthesized from the HLS implementationsare summarized in Table. 1. Vitis Core Development Kit 2019.2 [47] is used for synthesis, with a XilinxKintex UltraScale FPGA (part number xcku115-flvb2104-2-i ) as the target device and a clock frequencyof 200 MHz. The reported resource usage numbers reﬂect the synthesis estimates from Vivado HLS. Thelatency and II reported here are the maximum values for samples with full V max vertices; the actual HLSimplementation allows early termination of the serial reuse of the vertex-processing logic unit for sampleswith fewer vertices. The area under the ROC curve (AUC) and overall response root mean square (RMS) areused to summarize the performance.Table 1: Summary of the latency, II, FPGA resource usage metrics, and inference accuracy metrics of thesynthesized ﬁrmware. The reported resource usage numbers reﬂect the synthesis estimates from VivadoHLS. The target FPGA is a Xilinx Kintex UltraScale FPGA (part number xcku115-flvb2104-2-i ), whichhas 5,520 DSPs, 663,360 LUTs, 1,326,720 FFs, and 77.8 Mb of BRAM [48]. The utilized percentage of thetargeted FPGA resources are denoted in the square brackets. Model V max R reuse Latency Interval DSP (10 ) LUT (10 ) FF (10 ) BRAM (Mb) ROC Response(cycles) (cycles) AUC RMSContinuous 128 32 155 55 3.1 [56%] 57 [9%] 39 [2.9%] 1.8 [2.3%] 0.98 0.23Quantized 128 32 148 50 1.6 [29%] 70 [11%] 41 [3.1%] 1.9 [2.4%] 0.98 0.24Quantized 64 16 99 34 1.6 [29%] 63 [9%] 38 [2.9%] 1.8 [2.3%] 0.96 0.24Quantized 32 8 75 26 1.4 [25%] 52 [8%] 33 [2.5%] 1.8 [2.3%] 0.86 0.37Quantized 16 4 63 22 1.5 [27%] 57 [9%] 37 [2.8%] 1.8 [2.3%] 0.64 0.36 Comparing the continuous and quantized models with V max = 128, the former has a longer latency and II andconsumes substantially more DSPs. On the other hand, the quantized model uses more LUTs, mainly for themultiplications in the GarNet encoders and decoders, as discussed in Section 4. However, it is known thatthe expected LUT usage tend to be overestimated in Vivado HLS, while the expected DSP usage tends to beaccurate [8, 2]. The DSP usage of 3 . × for the continuous model is well within the limit of the targetdevice, but is more than what is available on a single die slice (2 . × ) [48]. The quantized model ﬁts inone slice in all metrics. Given the small diﬀerence in the inference performance between the two models, it isclear that the quantized model is advantageous for this speciﬁc case study.The latency of the synthesized quantized model at 148 clock periods, corresponding to 740 ns, satisﬁes theLHC L1T requirement of O (1) µ s execution. However, the II of 50 clock periods (250 ns) implies that thelogic must be time-multiplexed tenfold to be able to process a single cluster per LHC beam crossing period of25 ns. With O (100) or more clusters expected per beam crossing in the collision environment of HL-LHC, thethroughput of the synthesized ﬁrmware is therefore inadequate for a reasonably sized L1T calorimeter systemwith O (100) FPGAs, and requires down-scoping or implementation improvements.The simplest down-scoping measure is to reduce the size of the input. This is eﬀective because the mostprominent factor driving both the latency and the II of the ﬁrmware is R reuse (see Eq. (10)), which in turn isdetermined by V max to be able to ﬁt the logic in a single chip. To test how short the II can be made whileretaining a reasonable inference performance, additional models with V max = 64, 32, and 16 are trained andsynthesized into FPGA ﬁrmware. Clusters with more hits than V max are truncated by discarding the lowestenergy hits. The fraction of truncated clusters for the three V max values are 27%, 85%, and 99%, respectively.The results of synthesis of the additional models are given in the last three rows of Table 1. The values ofFPGA resource usage metrics are similar in all quantized models because the ratio V max /R reuse is kept at 4.12he area under the ROC curve (AUC) and the root-mean-square (RMS) of the response are considered asmetrics for the inference performance. Only a modest degradation of performance is observed by truncatingthe clusters to V max = 64, while the II is reduced by 16 clocks as a direct result of the reduction of R reuse bythe same amount. This working point might thus represent a reasonable compromise between the inferenceperformance and throughput. Further cluster truncation results in considerable loss of inference accuracy. Itis also clear that reduction of R reuse has a diminishing return in terms of shorter II, and improvements toother parts of the algorithm are necessary to further reduce the II. In this paper, we presented an implementation of a graph neural network algorithm as FPGA ﬁrmware with O (1) µ s execution time. General considerations and challenges in implementing graph neural networks forreal-time trigger systems at particle collider experiments are outlined, along with how algorithms such as GarNet address these issues. We then described the simpliﬁed version of

GarNet , which is now availableas a general-purpose graph network layer in the hls4ml library. An example use case of a machine learningmodel based on the simpliﬁed version of

GarNet , applied to data from a simulation of a small imagingcalorimeter, is presented. The model is able to learn to predict the identity and the energy of the particlesdetected at the calorimeter with high accuracy, while its ﬁrmware implementation executes in 740 ns andﬁts easily in a commercially available FPGA. Although the throughput of the ﬁrmware is not suﬃcient tomake the model readily deployable in a submicrosecond, real-time collider trigger system, its variants withreduced input size are shown to have higher throughput with reasonable inference performance. These resultsdemonstrate that fast inference of graph neural networks in FPGAs is possible, and with hls4ml , variousgraph-based machine learning architectures can be automatically translated into ﬁrmware.

Acknowledgments

We acknowledge the Fast Machine Learning collective as an open community of multi-domain experts andcollaborators. This community was important for the development of this project.M. P., A. G., K. W., S. S., V. L. and J. N. are supported by the European Research Council (ERC) underthe European Union’s Horizon 2020 research and innovation program (Grant Agreement No. 772369).S. J., M. L., K. P., and N. T. are supported by Fermi Research Alliance, LLC under Contract No. DE-AC02-07CH11359 with the U.S. Department of Energy, Oﬃce of Science, Oﬃce of High Energy Physics. P. H. issupported by a Massachusetts Institute of Technology University grant. Z. W. is supported by the NationalScience Foundation under Grants No. 1606321 and 115164.

References [1] CMS Collaboration, “Boosted decision trees in the Level-1 muon endcap trigger at CMS”,

J. Phys.Conf. Ser. (2018) 042042, doi:10.1088/1742-6596/1085/4/042042 .[2] J. Duarte et al., “Fast inference of deep neural networks in FPGAs for particle physics”,

J. Instrum. (2018) P07027, doi:10.1088/1748-0221/13/07/P07027 , arXiv:1804.06913 .[3] F. Chollet et al., “ Keras ”, (2015). https://keras.io .[4] A. Paszke et al., “

PyTorch : An imperative style, high-performance deep learning library”, in

Advances in Neural Information Processing Systems 32 , H. Wallach et al., eds., p. 8026. CurranAssociates, Inc., Red Hook, New York, 2019.[5] M. Abadi et al., “

TensorFlow : Large-scale machine learning on heterogeneous distributed systems”,(2015).[6] J. Bai, F. Lu, K. Zhang et al., “

ONNX : Open neural network exchange”, (2019). https://github.com/onnx/onnx .[7] S. Summers et al., “Fast inference of boosted decision trees in FPGAs for particle physics”,

J. Instrum. (2020) P05026, doi:10.1088/1748-0221/15/05/P05026 , arXiv:2002.02534 .138] V. Loncar et al., “Compressing deep neural networks on FPGAs to binary and ternary precision with hls4ml ”, Mach. Learn.: Sci. Technol. (3, 2020) doi:10.1088/2632-2153/aba042 , arXiv:2003.06308 .[9] Google, “ QKeras ”, (2020). https://github.com/google/qkeras .[10] C. N. Coelho et al., “Ultra low-latency, low-area inference accelerators using heterogeneous deepquantization with f

QKeras and hls4ml ”, arXiv:2006.10159 .[11] I. Henrionet al., “Neural message passing for jet physics”, in Deep Learning for Physical SciencesWorkshop at the 31st Conference on Neural Information Processing Systems . 2017.[12] M. Abdughani, J. Ren, L. Wu, and J. M. Yang, “Probing stop pair production at the LHC with graphneural networks”,

J. High Energy Phys. (2019) 055, doi:10.1007/J.HighEnergyPhys.08(2019)055 , arXiv:1807.09088 .[13] IceCube Collaboration, “Graph neural networks for IceCube signal classiﬁcation”, arXiv:1809.06166 .[14] J. Arjona Martínez et al., “Pileup mitigation at the Large Hadron Collider with graph neural networks”, Eur. Phys. J. Plus (2019) 333, doi:10.1140/epjp/i2019-12710-3 , arXiv:1810.07988 .[15] S. R. Qasim, J. Kieseler, Y. Iiyama, and M. Pierini, “Learning representations of irregularparticle-detector geometry with distance-weighted graph networks”, Eur. Phys. J. C (2019) 608, doi:10.1140/epjc/s10052-019-7113-9 , arXiv:1902.07987 .[16] H. Qu and L. Gouskos, “ParticleNet: Jet tagging via particle clouds”, Phys. Rev. D (2020)056019, doi:10.1103/PhysRevD.101.056019 , arXiv:1902.08570 .[17] E. A. Moreno et al., “JEDI-net: a jet identiﬁcation algorithm based on interaction networks”, Eur.Phys. J. C (2020) 58, doi:10.1140/epjc/s10052-020-7608-4 , arXiv:1908.05318 .[18] E. A. Moreno et al., “Interaction networks for the identiﬁcation of boosted H → bb decays”, Phys. Rev.D (2020) 012010, doi:10.1103/PhysRevD.102.012010 , arXiv:1909.12285 .[19] C. Jin, S.-z. Chen, and H.-H. He, “Classifying the cosmic-ray proton and light groups on theLHAASO-KM2A experiment with the graph neural network”, arXiv:1910.07160 .[20] X. Ju et al., “Graph neural networks for particle reconstruction in high energy physics detectors”, in Machine Learning and the Physical Sciences Workshop at the 33rd Annual Conference on NeuralInformation Processing Systems . 2019. arXiv:2003.11603 .[21] E. Bernreuther et al., “Casting a graph net to catch dark showers”, arXiv:2006.08639 .[22] J. Shlomi, P. Battaglia, and J.-R. Vlimant, “Graph neural networks in particle physics”, arXiv:2007.13681 . Submitted to

Mach. Learn.: Sci. Technol. [23] Y. Wang et al., “Dynamic graph CNN for learning on point clouds”,

ACM Trans. Graph. (2019) doi:10.1145/3326362 , arXiv:1801.07829 .[24] J. Kieseler, “Object condensation: one-stage grid-free multi-object reconstruction in physics detectors,graph and image data”, (2020). arXiv:2002.03605 .[25] L. Gray, T. Klijnsma, and S. Ghosh, “A dynamic reduction network for point clouds”, (2020). arXiv:2003.08013 .[26] CMS Collaboration, “The Phase-2 upgrade of the CMS Level-1 trigger”, CMS Technical Design ReportCERN-LHCC-2020-004. CMS-TDR-021, CERN, 4, 2020.[27] P. W. Battaglia et al., “Relational inductive biases, deep learning, and graph networks”, (2018). arXiv:1806.01261 .[28] M. Besta et al., “Graph processing on FPGAs: Taxonomy, survey, challenges”, (2019). arXiv:1903.06697 .[29] Petar Veličković and Guillem Cucurull and Arantxa Casanova and Adriana Romero and Pietro Liò andYoshua Bengio, “Graph attention networks”, in . 2018. arXiv:1710.10903 . 1430] D. O’Loughlin et al., “Xilinx Vivado High Level Synthesis: Case studies”, in , p. 352–356. IET, 2014.[31] C. Zhu, S. Han, H. Mao, and W. J. Dally, “Trained ternary quantization”, in . 2017. arXiv:1612.01064 .[32] GEANT4 Collaboration, “ Geant4 —a simulation toolkit”,

Nucl. Instrum. Methods Phys. Res. A (2003) 250, doi:10.1016/S0168-9002(03)01368-8 .[33] G. Apollinari et al., eds., “High-Luminosity Large Hadron Collider (HL-LHC): Technical Design ReportV. 0.1”. CERN Yellow Reports: Monographs. CERN, Geneva, 2017. doi:10.23731/CYRM-2017-004 .[34] CMS Collaboration, “The Phase-2 upgrade of the CMS endcap calorimeter”, CMS Technical DesignReport CERN-LHCC-2017-023. CMS-TDR-019, 2017.[35] S. van der Walt, S. C. Colbert, and G. Varoquaux, “The NumPy array: A structure for eﬃcientnumerical computation”,

Comput. Sci. Eng. (2011), no. 2, 22–30, doi:10.1109/MCSE.2011.37 .[36] The HDF Group, “Hierarchical Data Format, version 5”, (1997–2020). .[37] Y. Iiyama and J. Kieseler, “Simulation of an imaging calorimeter to demonstrate GarNet on FPGA”,6, 2020. doi:10.5281/zenodo.3888910 , https://doi.org/10.5281/zenodo.3888910 .[38] ALEPH Collaboration, “Performance of the ALEPH detector at LEP”, Nucl. Instrum. Methods Phys.Res. A (1995) 481, doi:10.1016/0168-9002(95)00138-7 .[39] CMS Collaboration, “Particle-ﬂow reconstruction and global event description with the CMS detector”,

J. Instrum. (2017) P10003, doi:10.1088/1748-0221/12/10/P10003 , arXiv:1706.04965 .[40] ATLAS Collaboration, “Jet reconstruction and performance using particle ﬂow with the ATLASDetector”, Eur. Phys. J. C (2017) 466, doi:10.1140/epjc/s10052-017-5031-2 , arXiv:1703.10485 .[41] A. F. Agarap, “Deep learning using rectiﬁed linear units (ReLU)”, (2018). arXiv:1803.08375 .[42] Qasim, Shah Rukh and Kieseler, Jan and Iiyama, Yutaro and Pierini, Maurizio, “caloGraphNN”, (2019). https://github.com/jkiesele/caloGraphNN .[43] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization”, in . 2014. arXiv:1412.6980 .[44] M. Courbariaux, Y. Bengio, and J.-P. David, “BinaryConnect: Training deep neural networks withbinary weights during propagations”, in Advances in Neural Information Processing Systems 28 ,C. Cortes et al., eds., p. 3123. Curran Associates, Inc., 2015. arXiv:1511.00363 .[45] B. Moons, K. Goetschalckx, N. Van Berckelaer, and M. Verhelst, “Minimum energy quantized neuralnetworks”, in , p. 1921, IEEE. 2017.[46] S. Zhou et al., “DoReFa-Net: Training low bitwidth convolutional neural networks with low bitwidthgradients”, (2016). arXiv:1606.06160 .[47] V. Kathail, “Xilinx Vitis Uniﬁed Software Platform”, in , p. 173. ACM, New York, NY, USA, 2020. doi:10.1145/3373087.3375887 .[48] Xilinx, Inc., “UltraScale FPGA product tables and product selection guide”, (2020).