[PDF] Point Cloud Transformers applied to Collider Physics

Abstract

Full PDF

PPoint Cloud Transformers applied to ColliderPhysics

V. Mikuni, a F. Canelli, a a University of Zurich,Winterthurerstrasse 190, 8057 Zurich, Switzerland

Abstract:

Methods for processing point cloud information have seen a great success incollider physics applications. One recent breakthrough in machine learning is the usageof Transformer networks to learn semantic relationships between sequences in languageprocessing. In this work, we apply a modiﬁed Transformer network called Point CloudTransformer as a method to incorporate the advantages of the Transformer architecture toan unordered set of particles resulting from collision events. To compare the performancewith other strategies, we study jet-tagging applications for highly-boosted particles. a r X i v : . [ phy s i c s . d a t a - a n ] F e b ontents The interactions between elementary particles is described by the Standard Model (SM) ofparticle physics. Particle colliders are then used to study these interactions by comparingexperimental signatures to SM predictions. At every collision, hundreds of particles canbe created, detected, and reconstructed by particle detectors. Extracting relevant physicsquantities from this space is a challenging task that is often accomplished through the usageof physics-motivated summary statistics, that reduce the data dimensionality to manageablequantities. A recent approach is to interpret the set of reconstructed particles as points ina point cloud. Point clouds represent a set of unordered objects, described in a well-deﬁnedspace, often used for applications in self-driving vehicles, robotics, and augmented reality,to name a few. With this approach, information from each bunch crossing in a particlecollider is interpreted as a point cloud, where the goal is to use this high-dimensional setof reconstructed particles to extract relevant information. However, the task of extractinginformation from point clouds can also be a challenging task. One novel approach is touse the Transformers architecture [1] to learn the semantic relationship between objects.Transformers yielded a great success in recent years applied to natural language processing– 1 –NLP), often showing a superior performance when compared to previous well-establishedmethods. The advantage of this architecture is the capability of learning semantic aﬃni-ties between objects without losing information during long sequences. Transformers arealso easily parallelizable, a huge computational advantage over sequential architectures likegated recurrent [2] and long short-term memory [3] neural networks. Applications of theTransformer network have already been applied to examples outside NLP problems, withexamples in image recognition [4, 5].The Transformer architecture is not readily applicable to point clouds. In fact, sincepoint clouds are intrinsically unordered, the Transformer structure has to be modiﬁed todeﬁne a self-attention operation that respects data symmetries, like permutation invariance.A recent approach introduced in [6] addresses these issues through the development of PointCloud Transformers (PCT). In this work, we will ﬁrst introduce the key features developedfor PCT and use a modiﬁed version, applied to a high energy physics task, in the formof jet-tagging. Results are compare with other approaches using three diﬀerent publicdatasets.

Neural network architectures that treat collision events as point clouds have recently grownin number given their state-of-the-art performance when applied to diﬀerent collider physicsproblems. A few examples of such applications are jet-tagging [7, 8], secondary vertexﬁnding [9], event reconstruction [10–13], and jet parton assignment [14]. A comprehensivereview of the diﬀerent methods is described in [15].Two applications in particular will be relevant for the following discussions of the PCTimplementation. These are the ParticleNet [16] and ABCNet [17] architectures. The for-mer introduces the EdgeConv operation, initially developed in [3]. This operation uses ak-nearest neighbors approach to create local patches inside a point cloud. The local infor-mation is then used to create high level features for each point that retains the informationof the local neighborhood. ABCNet, on the other hand, uses the local information to deﬁnean attention mechanism, ﬁrst introduced in [18] and applied in [19]. A similar concept ofattention mechanisms are deﬁned for PCT, where a self-attention layer is used to providethe relationship importance between all particles in the set.Jet-tagging is a common task used to benchmark diﬀerent algorithms applied to col-lider physics. While a number of algorithms have been proposed in recent years, a specialattention will be given to algorithms with results in public datasets. In [16], results arepresented for both quark gluon and top quark datasets, while [20] introduces a multiclas-siﬁcation sample containing ﬁve diﬀerent jet categories. The description of each dataset isdiscussed in Sec. 5.

The Transformer implementation applied to point clouds requires two main building blocks:the feature extraction and the self-attention (SA) layers. The feature extractor is used– 2 –o map the input point cloud F in ∈ R N × d in to a higher dimensional representation F e ∈ R N × d out . This step is used to achieve a higher level of abstraction for each point presentin the point cloud. In this work, two diﬀerent strategies are compared. An architectureconsisting of stacked, one-dimensional convolutional (Conv1D) layers, and a second option,based on EdgeConv blocks. The EdgeConv block consists of an EdgeConv operation [3]followed by 2 two-dimensional convolutional (Conv2D) layers and an average pooling op-eration. The EdgeConvolution operation uses a k-nearest neighbors approach to deﬁne avicinity for each point in the point cloud. This enhances the ability of the network toextract information from local neighborhood around each point.The ﬁrst strategy is then referred to simple PCT (SPCT) while the second will bereferred to just PCT.The second main building block is the usage of an oﬀset attention deﬁned as a self-attention (SA) layer. The output of the feature extractor F e is used as the input of the ﬁrstSA layer. The goal of the SA layer is to determine the relationship between all particles ofthe point cloud through an oﬀset attention mechanism. This approach diﬀers from the onetaken in ABCNet, where a self-attention and neighboring attention coeﬃcients are deﬁnedfor a neighborhood of each particle.In the same terms deﬁned in the original Transformer [1] work, three diﬀerent matrices,all built from linear transformations of the original inputs. These matrices are called query(Q), key (K), and value (V). The linear transformations are accomplished through the usageof Conv1D layers such that: Q , K , V = F e . (W q , W k , W v ) (3.1) Q , K ∈ R N × d a , V ∈ R N × d out . (3.2)The matrices (W q , W k , W v ) contain the trainable linear coeﬃcients introduced by theconvolutional operation. The attention weights (A) are then calculated by ﬁrst multiplyingthe query matrix with the transpose of the key matrix: A = Q . K T , A ∈ R N × N . (3.3)A softmax operation is then applied to each row of A to normalize the coeﬃcients for allpoints. The last step is to deﬁne the oﬀset-attention. First, the attention weights aremultiplied by the value matrix, resulting in the self-attention F sa with F sa = A . V , F sa ∈ R N × d out . (3.4)The diﬀerence between F e and F sa is passed through a Conv1D layer with same outputdimension d out . The result of this layer is the oﬀset added to the original inputs F e . Diﬀerentlevels of abstraction can be achieved by stacking multiple SA layers, using the output ofeach SA layer as the input for the next.To complete the general architecture, the SA layers are combined through a simple con-catenation followed by an average pooling operation, leaving the entire architecture invari-ant with permutations of the input points. The output of this operation is passed through– 3 –ully connected layers before reaching the output layer, normalized through a softmax oper-ation. All convolutional and fully connected layers are passed through the nonlinear ReLUactivation function, with the exception of the convolutional operations inside the SA layersand the output layer.The general PCT network and the main building blocks are shown in Fig. 1. Thetraining details are explained in Sec. 4 FEATURE EXTRACTOR (Nxd out )INPUT DATASET (Nxd in )SA LAYERS (Nxd out )OUTPUT (N categories )FULLY CONNECTED SPCT PCT (2x)FEATURE EXTRACTOR

INPUT INPUTConv1DConv1DOUTPUT EdgeConvConv2DConv2D

Average pooling

OUTPUT

SA LAYER

INPUT Conv1D C o n v D SOFTMAX (NxN) C o n v D Conv1DOUTPUT

KEY (d a xN)QUERY (Nxd a ) VALUE (Nxd out ) Matrix multiplicationMatrix subtractionMatrix addition

CONCATENATIONCONCATENATION

Figure 1 . General network architecture (left), feature extractor (middle), and self-attention layerdescription (right). d in , d out , and, d c represent the input feature, output feature, and fully con-nected layer sizes. The PCT implementation is done using Tensorﬂow v1.14 [21]. A Nvidia GTX 1080 Tigraphics card is used for the training and evaluation steps. For all architectures, the Adamoptimiser [22] is used with a learning rate starting from 0.001 and decreasing by a factor2 every 20 epochs, to a minimum of 1e-6. The training is performed with a mini batchsize of 64 and a maximum number of 200 epochs. If no improvement is observed in theevaluation set for 15 consecutive epochs, the training is stopped. The epoch with the lowestclassiﬁcation loss on the test set is stored for further evaluation.

Diﬀerent performance metrics are compared for (S)PCT applied to a jet classiﬁcation taskon diﬀerent public datasets. Jets are collimated sprays of particles resulting from thehadronization and fragmentation of energetic partons. Jets can show distinguishing featuresdepending on the elementary particle that has initiated the jet. Traditional methods usethis information to deﬁne physics-motivated observables [23] that can distinguish diﬀerentjet categories.The PCT architecture uses two EdgeConv blocks, each deﬁning the k-nearest neighborsof each point with k = 20. The initial distances are calculated in the pseudorapidity-azimuth( η − φ ) space of the form ∆ R = (cid:112) ∆ η + ∆ φ . The distances used for the second EdgeConv– 4 –lock are calculated using the full-feature space produced in the output of the last EdgeConvblock.Besides the feature extractor, PCT uses three SA layers while SPCT uses two. Theoutput of all SA layers are concatenated for both PCT and SPCT. However, the output ofthe last EdgeConv block is also added during concatenation with a skip connection. Thedetailed architectures used during training for PCT and SPCT are shown in Fig. 2. INPUT DATASET (Nx16)SA LAYER (64)

FULLY CONNECTED (64)

SPCT PCT

EdgeConv Block (128) k=20

INPUT DATASET (Nx16)

EdgeConv Block (64) k=20

Dropout (0.5)Conv1D (128)Conv1D (64)SA LAYER (64)CONCATENATIONConv1D (128)

FULLY CONNECTED (N categories ) SA LAYER (64)

FULLY CONNECTED (128)

Dropout (0.5)SA LAYER (64)CONCATENATIONConv1D (256)

FULLY CONNECTED (N categories ) SA LAYER (64) INPUT (Nxd)EdgeConvConv2DConv2D

Average pooling

OUTPUT

EdgeConv Block

Figure 2 . SPCT (left) and PCT (middle) architectures used for all jet-tagging classiﬁcation tasks.The EdgeConv block structure used in PCT is shown in the right. Numbers in parenthesis representlayer sizes.

PCT and SPCT receive as inputs the particles found inside the jets. The input featuresvary between applications, depending on the available content for each public dataset. Forall comparisons, up to 100 particles per jet are used. If more particles were found inside ajet, the event is truncated, otherwise zero-padded up to 100.

For this study, samples containing simulated jets originating from W bosons, Z bosons, topquarks, light quarks, and gluons produced at √ s = 13 TeV proton-proton collisions are used.The samples are available at [24]. This dataset is created and conﬁgured using a parametricdescription of a generic LHC detector, described in [25, 26]. The jets are clustered with theanti- k T algorithm [27] with radius parameter R = 0.8, while also requiring that the jet’s p T – 5 –s around 1 TeV, which ensures that most of the decay products of the generated particles arefound inside a single jet. The training and testing sets contain 567k and 63k jets respectively.The performance comparison is reported using the oﬃcial evaluation set, containing 240kjets. For each particle, a set of 16 kinematic features are used. These distributions werechosen to match the particle features used in [20] to facilitate the comparison.The area under the curve (AUC) for each evaluation is calculated by taking each jetcategory as a signal while the remaining categories are treated as background. The resultsare shown in Tab. 1. Table 1 . Area under the curve for each jet category reported on the HLS4ML LHC Jet dataset.Results for all methods are taken as the average of 10 trainings with random network initialization.If the uncertainty is not quoted then the variation is negligible compared to the expected value.Bold results represent the algorithm with highest performance.

Algorithm Gluon Light quark W boson Z boson Top quarkDNN [20] 0.9384 0.9026 0.9537 0.9459 0.9620GRU [20] 0.9040 0.8962 0.9192 0.9042 0.9350CNN [20] 0.8945 0.9007 0.9102 0.8994 0.9494JEDI-net [20] 0.9529 0.9301 0.9739 0.9679 0.9683JEDI-net with (cid:80) O [20] 0.9528 0.9290 0.9695 0.9649 0.9677SPCT 0.9585 0.9370 0.9767 0.9799 0.9730PCT The top tagging dataset consists of jets containing the hadronic decay products of topquarks (treated as signal) together with jets generated through QCD dijet events (treatedas background). The samples are available at [28]. The events are generated with

Pythia8 [29] with detector simulation done through

Delphes [30]. The jets are clustered with theanti- k T algorithm with radius parameter R = 0.8. Only jets with transverse momentum p T ∈ [550 , GeV and rapidity | y | < are kept. The oﬃcial training, testing, andevaluation splitting are used, containing 1.2M/400k/400k events respectively. For eachparticle, a set of 7 input features is used. These distributions are the same ones used in[16] to facilitate the comparison between algorithms. The AUC and background rejectionpower, deﬁned as the inverse of the background eﬃciency for a ﬁxed signal eﬃciency, arelisted in Tab. 2, with a reduced number of algorithms as reported in [16]. A more complete,although slightly outdated list is available at [31]. The dataset used for the studies are available from [32]. It consists of stable particlesclustered into jets, excluding neutrinos, using the anti- k T algorithm with R = 0.4. Thequark-initiated sample (treated as signal) is generated using a Z( νν ) + ( u, d, s ) while thegluon-initiated data (treated as background) are generated using Z( νν ) + g processes. Bothsamples are generated using Pythia8 [29] without detector eﬀects. Jets are required to– 6 – able 2 . Comparison between the performance reported for diﬀerent classiﬁcation algorithms onthe top tagging dataset. The uncertainty quoted corresponds to the standard deviation of ninetrainings with diﬀerent random weight initialization. If the uncertainty is not quoted then thevariation is negligible compared to the expected value. Bold results represent the algorithm withhighest performance.

Acc AUC 1/ (cid:15) B ( (cid:15) S = 0 . ) 1/ (cid:15) B ( (cid:15) S = 0 . )ResNeXt-50 [16] 0.936 0.9837 302 ± ± ± ± ± ± ± ± ± ± JEDI-net [20] 0.9263 0.9786 - 590.4JEDI-net with (cid:80) O [20] 0.9300 0.9807 - 774.6SPCT 0.931 0.9813 230 ±

10 851 ± ±

12 1287 ± p T ∈ [500 , GeV and rapidity | y | < . for the reconstruc-tion. For the training, testing and evaluation, the recommended splitting is used with1.6M/200k/200k events respectively. Each particle contains the four momentum and theexpected particles type (electron, muon, photon, or charged/neutral hadrons). For eachparticle, a set of 13 kinematic features is used. These features are chosen to match the onesused in [16, 17]. The AUC and background rejection power are listed in Tab. 3. Table 3 . Comparison between the performance reported for diﬀerent classiﬁcation algorithms onthe quark and gluon dataset. The uncertainty quoted corresponds to the standard deviation ofnine trainings with diﬀerent random weight initialization. If the uncertainty is not quoted then thevariation is negligible compared to the expected value. Bold results represent the algorithm withhighest performance.

Acc AUC 1/ (cid:15) B ( (cid:15) S = 0 . ) 1/ (cid:15) B ( (cid:15) S = 0 . )ResNeXt-50 [16] 0.821 0.9060 30.9 80.8P-CNN [16] 0.827 0.9002 34.7 91.0PFN [32] - 0.9005 34.7 ± ± ± ± ± SPCT 0.824 0.899 34.4 ± ± ± ± Besides the algorithm performance, the computational cost is also an important ﬁgure ofmerit. To compare the amount of computational resources required to evaluate each model,– 7 –he number of trainable weights and the number of ﬂoating point operations (FLOPs) arecomputed. The comparison of these quantities for diﬀerent algorithms are shown in Tab. 4.

Table 4 . Number of trainable weights and ﬂoating point operations (FLOPs) for each model underconsideration

Algorithm Weights FLOPsResNeXt-50 [16] 1.46M -P-CNN [16] 348k -PFN [32] 82k -ParticleNet-Lite [16] 26k -ParticleNet [16] 366k -ABCNet [17] 230k -DNN [20] 14.7k 27kGRU [20] 15.6k 46kCNN [20] 205.5k 400kJEDI-net [20] 33.6k 116MJEDI-net with (cid:80) O [20] 8.8k 458MSPCT 55.4k 20MPCT 153.9k 381MWhile PCT shows a better overall AUC compared to SPCT, the improvement in perfor-mance from the usage of EdgeConv blocks comes with a cost in computational complexity.SPCT, on the other hand, provides a good balance between performance and computationalcost, resulting in almost 20 times less FLOPs and 3 times less trainable weights comparedto PCT. The SA module deﬁnes the relative importance between all points in the set through theattention weights. We can use this information to identify the regions inside a jet that havehigh importance for a chosen particle. To visualize the particle importance, the HLS4MLLHC jet dataset is used to create a pixelated image of a jet in the transverse plane. Theaverage jet image of 100k examples in the evaluation set is used. For each image, a simplepreprocessing strategy is applied to align the diﬀerent images. First, the whole jet istranslated such that the particle with the highest transverse momentum in the jet is centeredat (0,0). This particle is also used as the reference particle from where attention weightsare shwon. Next, the full jet image is rotated, making the second most energetic particlealigned with the positive y-coordinate. Lastly, the image is ﬂipped in the x-coordinate incase the third most energetic particle is located on the negative x-axis, otherwise the imageis left as is. These transformations are also used in other jet image studies such as [17, 33].The pixel intensity for each jet image is taken from the attention weights after the softmaxoperation is applied, expressing the particle importance with respect to the most energetic– 8 –article in the event. A comparison of the extracted images for each SA layer and for eachjet category is shown in Fig. 3 . - - - - - Gluon - - fD - - hD Gluon - - - - - Light Quark - - fD - - hD Light Quark - - - - - Z Boson - - fD - - hD Z Boson - - - - - W Boson - - fD - - hD W Boson - - - - - Top Quark - - fD - - hD Top Quark - - - - - Gluon - - fD - - hD Gluon - - - - - - Light Quark - - fD - - hD Light Quark - - - - - Z Boson - - fD - - hD Z Boson - - - - - W Boson - - fD - - hD W Boson - - - - - Top Quark - - fD - - hD Top Quark - - - - - Gluon - - fD - - hD Gluon - - - - - Light Quark - - fD - - hD Light Quark - - - - - Z Boson - - fD - - hD Z Boson - - - - - W Boson - - fD - - hD W Boson - - - - - Top Quark - - fD - - hD Top Quark

Figure 3 . Average jet image for each jet category (columns) and for each self-attention layer (rows).The pixel intensities represent the overall particle importance compared to the most energeticparticle in the jet.

The diﬀerent SA layers are able to extract diﬀerent information for each jet. In partic-ular, the jet substructure is exploited, resulting in an increased relevance to harder subjetsin the case of Z boson, W boson, and top quark initiated jets. On the other hand, lightquark and gluon initiated jets have a more homogeneous radiation pattern, resulting alsoin a more homogenous picture.

In this work, a new method based on the Transformer architecture was applied to a highenergy physics application. The point cloud transformer (PCT) modiﬁes the usual Trans-former architecture to be applied to a set of unordered points present in a point cloud. Thismethod has the advantage of extracting semantic aﬃnities between the points through thedevelopment of a self-attention mechanism. We evaluate the performance of this archi-tecture applied to several jet-tagging datasets by testing two diﬀerent implementations,one that exploits the neighborhood information through EdgeConv operations and a sim-pler form that connects all points through convolutional layers called simple PCT (SPCT).Both approaches have shown state-of-the-art performance compared to other publicly avail-able results. While the classiﬁcation performance of SPCT is slightly lower compared tothe standard PCT, the number of ﬂoating point operations required to evaluate the modeldecreases by almost a factor 20. This reduced computational complexity can be exploited– 9 –n environments with limited computing resources or applications that require fast inferenceresponses.A diﬀerent advantage of (S)PCT is the visualization of the self-attention coeﬃcientsto understand which points have a greater importance through the classiﬁcation task. Tra-ditional methods often deﬁne physics-motivated observables to distinguish the diﬀerenttypes of jets. PCT, on the other hand, exploits subjet information by learning aﬃnities ona particle-by-particle basis, resulting in images with distinct features for jets of diﬀerentdecay modes.

The authors would like to thank Jean-Roch Vlimant for helpful comments during the de-velopment of this work. This research was supported in part by the Swiss National ScienceFoundation (SNF) under contract No. 200020-182037 and Forschungskredit of the Univer-sityof Zurich, grant no. FK-20-097.

10 BibliographyReferences [1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, andI. Polosukhin,

Attention is all you need , CoRR abs/1706.03762 (2017)[ arXiv:1706.03762 ].[2] J. Chung, Ç. Gülçehre, K. Cho, and Y. Bengio,

Empirical evaluation of gated recurrentneural networks on sequence modeling , CoRR abs/1412.3555 (2014) [ arXiv:1412.3555 ].[3] Y. Wang, Y. Sun, Z. Liu, S. E. Sarma, M. M. Bronstein, and J. M. Solomon,

Dynamic graphCNN for learning on point clouds , CoRR abs/1801.07829 (2018) [ arXiv:1801.07829 ].[4] B. Wu, C. Xu, X. Dai, A. Wan, P. Zhang, Z. Yan, M. Tomizuka, J. Gonzalez, K. Keutzer,and P. Vajda,

Visual transformers: Token-based image representation and processing forcomputer vision , 2020.[5] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner,M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby,

An image isworth 16x16 words: Transformers for image recognition at scale , 2020.[6] M.-H. Guo, J.-X. Cai, Z.-N. Liu, T.-J. Mu, R. R. Martin, and S.-M. Hu,

Pct: Point cloudtransformer , 2020.[7] P. T. Komiske, E. M. Metodiev, and J. Thaler,

Energy Flow Networks: Deep Sets forParticle Jets , JHEP (2019) 121, [ arXiv:1810.05165 ].[8] M. J. Dolan and A. Ore, Equivariant Energy Flow Networks for Jet Tagging , arXiv:2012.00964 .[9] J. Shlomi, S. Ganguly, E. Gross, K. Cranmer, Y. Lipman, H. Serviansky, H. Maron, andN. Segol, Secondary Vertex Finding in Jets with Neural Networks , arXiv:2008.02831 .[10] M. J. Fenton, A. Shmakov, T.-W. Ho, S.-C. Hsu, D. Whiteson, and P. Baldi, PermutationlessMany-Jet Event Reconstruction with Symmetry Preserving Attention Networks , arXiv:2010.09206 . – 10 –

11] J. Duarte and J.-R. Vlimant,

Graph Neural Networks for Particle Tracking andReconstruction , arXiv:2012.01249 .[12] J. Pata, J. Duarte, J.-R. Vlimant, M. Pierini, and M. Spiropulu, MLPF: Eﬃcientmachine-learned particle-ﬂow reconstruction using graph neural networks , arXiv:2101.08578 .[13] X. Ju et al., Graph Neural Networks for Particle Reconstruction in High Energy Physicsdetectors , in , 3, 2020. arXiv:2003.11603 .[14] J. S. H. Lee, I. Park, I. J. Watson, and S. Yang,

Zero-Permutation Jet-Parton Assignmentusing a Self-Attention Network , arXiv:2012.03542 .[15] J. Shlomi, P. Battaglia, and J.-R. Vlimant, Graph Neural Networks in Particle Physics , arXiv:2007.13681 .[16] H. Qu and L. Gouskos, Jet tagging via particle clouds , Phys. Rev. D (Mar, 2020) 056019.[17] V. Mikuni and F. Canelli,

ABCNet: An attention-based method for particle tagging , Eur.Phys. J. Plus (2020), no. 6 463, [ arXiv:2001.05311 ].[18] P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio,

Graph attentionnetworks , 2017.[19] C. Chen, L. Z. Fragonara, and A. Tsourdos,

GAPNet: Graph Attention based Point NeuralNetwork for Exploiting Local Feature of Point Cloud , arXiv:1905.08705 .[20] E. A. Moreno, O. Cerri, J. M. Duarte, H. B. Newman, T. Q. Nguyen, A. Periwal, M. Pierini,A. Serikova, M. Spiropulu, and J.-R. Vlimant, JEDI-net: a jet identiﬁcation algorithm basedon interaction networks , arXiv:1908.05318 .[21] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis,J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia,R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore,D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker,V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke,Y. Yu, and X. Zheng, TensorFlow: Large-scale machine learning on heterogeneous systems ,2015. Software available from tensorﬂow.org.[22] D. P. Kingma and J. Ba,

Adam: A Method for Stochastic Optimization , arXiv e-prints (Dec,2014) arXiv:1412.6980, [ arXiv:1412.6980 ].[23] J. Thaler and K. Van Tilburg, Identifying Boosted Objects with N-subjettiness , JHEP (2011) 015, [ arXiv:1011.2268 ].[24] M. Pierini, J. M. Duarte, N. Tran, and M. Freytsis, Hls4ml lhc jet dataset (100 particles) ,Jan., 2020.[25] E. Coleman, M. Freytsis, A. Hinzmann, M. Narain, J. Thaler, N. Tran, and C. Vernieri,

Theimportance of calorimetry for highly-boosted jet substructure , JINST (2018), no. 01T01003, [ arXiv:1709.08705 ].[26] J. Duarte et al., Fast inference of deep neural networks in FPGAs for particle physics , JINST (2018), no. 07 P07027, [ arXiv:1804.06913 ].[27] M. Cacciari, G. P. Salam, and G. Soyez, The anti- k T jet clustering algorithm , JHEP (2008) 063, [ arXiv:0802.1189 ]. – 11 –

28] G. Kasieczka, T. Plehn, J. Thompson, and M. Russel,

Top quark tagging reference dataset ,Mar., 2019.[29] T. Sjöstrand, S. Ask, J. R. Christiansen, R. Corke, N. Desai, P. Ilten, S. Mrenna, S. Prestel,C. O. Rasmussen, and P. Z. Skands,

An Introduction to PYTHIA 8.2 , Comput. Phys.Commun. (2015) 159–177, [ arXiv:1410.3012 ].[30]

DELPHES 3

Collaboration, J. de Favereau, C. Delaere, P. Demin, A. Giammanco,V. Lemaître, A. Mertens, and M. Selvaggi,

DELPHES 3, A modular framework for fastsimulation of a generic collider experiment , JHEP (2014) 057, [ arXiv:1307.6346 ].[31] A. Butter et al., The Machine Learning Landscape of Top Taggers , SciPost Phys. (2019)014, [ arXiv:1902.09914 ].[32] P. T. Komiske, E. M. Metodiev, and J. Thaler, Energy ﬂow networks: deep sets for particlejets , Journal of High Energy Physics (Jan, 2019).[33] P. T. Komiske, E. M. Metodiev, and M. D. Schwartz,

Deep learning in color: towardsautomated quark/gluon jet discrimination , JHEP (2017) 110, [ arXiv:1612.01551 ].].