[PDF] Hashing and metric learning for charged particle tracking

Abstract

We propose a novel approach to charged particle tracking at high intensity particle colliders based on Approximate Nearest Neighbors search. With hundreds of thousands of measurements per collision to be reconstructed e.g. at the High Luminosity Large Hadron Collider, the currently employed combinatorial track finding approaches become inadequate. Here, we use hashing techniques to separate measurements into buckets of 20-50 hits and increase their purity using metric learning. Two different approaches are studied to further resolve tracks inside buckets: Local Fisher Discriminant Analysis and Neural Networks for triplet similarity learning. We demonstrate the proposed approach on simulated collisions and show significant speed improvement with bucket tracking efficiency of 96% and a fake rate of 8% on unseen particle events.

Full PDF

HHashing and metric learning for charged particletracking

Sabrina Amrouche ∗ Moritz Kiehn Tobias Golling

Université de Genève

Andreas Salzburger

European Organization for Nuclear Research CERN

Abstract

We propose a novel approach to charged particle tracking at high intensity particlecolliders based on Approximate Nearest Neighbors search. With hundreds ofthousands of measurements per collision to be reconstructed e.g. at the HighLuminosity Large Hadron Collider, the currently employed combinatorial track ﬁ nding approaches become inadequate. Here, we use hashing techniques to separatemeasurements into buckets of 20-50 hits and increase their purity using metriclearning. Two different approaches are studied to further resolve tracks insidebuckets: Local Fisher Discriminant Analysis and Neural Networks for tripletsimilarity learning. We demonstrate the proposed approach on simulated collisionsand show signi ﬁ cant speed improvement with bucket tracking ef ﬁ ciency of

96 % and a fake rate of on unseen particle events.

Modern high energy hadron collider experiments, such as the ATLAS and CMS experiments atthe LHC at CERN, produce vast amounts of data to be analyzed. This is driven by the fact that inorder to maximize the research potential, many proton-proton collisions are performed with manyparticles emerging from the interaction region. Future upgrades of existing or new acceleratorssuch as the High Luminosity Large Hadron Collider by 2026 or the proposed FCC collider willsigni ﬁ cantly increase this complexity. Finding the trajectories of particles produced in such collisionsis a crucial step and yet, due to limitations of the current approaches, it is the most CPU intensivetask in reconstruction. A trajectory is the trail a particle leaves when interacting with the detectormaterial and causing a detectable signal in the traversed devices. The measurements are referred to ashits and the process of grouping together the hits generated by the same particle is called tracking.Tracking in high energy physics experiments is particularly challenging due to the high numberof trajectories that need to be resolved simultaneously(1). This is mainly due to the absence ofhit-particle descriptive characteristics (identity) hence making them all possible candidates for atrajectory. Any two points can potentially be associated into a trajectory that can in turn be evaluatedonly after it is completed. This would be comparable to tracking cars that change colors and shapesat every frame . In this study we consider high-luminosity scenarios where the target is to rapidlylabel more than 100K hits generated by more than 10K particles. We propose to combine searchtechniques and deep learning to solve ef ﬁ ciently the tracking problem. ∗ Corresponding author In the contrast to cars, however, we have quasi-deterministic equations of motions that govern the particletrajectory.Second Workshop on Machine Learning and the Physical Sciences (NeurIPS 2019), Vancouver, Canada.

Proposed approach: hashing and metric learning

Finding similar items in large datasets is a challenging yet well studied problem. A high dimensional-ity of the records makes it even more problematic to de ﬁ ne a valid similarity metric. ApproximateNearest Neighbors (ANN) techniques are often used to reduce the search complexity(2). Formallythe ANN problem is de ﬁ ned as follows: Given a set S of n data points in a metric space X the task isto index the points so that, given a query point q ∈ X , the data points most similar to q are quicklyfound. ANN indexes items into hashes or buckets that contain approximate neighbors to a querypoint. As a result, for a given item of interest, searching for its neighbors becomes restricted to thebucket only.We propose to use ANN to index particle tracks and therefore reduce the tracking to only be appliedwithin a single bucket. The neighbors of a given query point are the hits from its true particle. Themain appeal of ANN for tracking is the fast index query time that allows to potentially retrieve aparticle from one query point. Moreover, by using queries across the hit point cloud, the approach iseasily parallelizable. The ANN notion of similarity is rede ﬁ ned to ﬁ nd neighbors as points alignedalong the same trajectory (particle neighbors) by using a suitable distance function. The hashes canbe regarded as average track estimates that are re ﬁ ned with metric learning. Within a bucket, we usemetric learning to map to an optimal embedding where hits from the same particle naturally clusterwith a simple metric.For the conducted experiments we use the TrackML challenge dataset(3). This dataset simulation isdone assuming an average of 200 overlaid simulatenous (so called pile-up) events with one signalevent. The latter has been chosen to be a top quark pair production, whereas for the scope of thisstudy only minor importance is given to the actual physics event. The data points are simulated byACTS , a fast simulation emulating particle collisions as they might occur in a detector with highaccuracy. A particle leaves on average 12 hits in the simulated TrackML detector where each hitis characterized by 5 features: global coordinates x, y, z and two angles φ and θ that are obtainedfrom the cluster shape and indicate the track direction with respect to the detection sensor. In theremainder of this paper we describe the two components of the proposed approach applied to theTrackML dataset. We choose to build the ANN index as a tree based partition method (binary trees) where the points areassigned according to their feature values only (unsupervised). In the ﬁ rst experiment, we randomlyquery hits and request their closest neighbors. We analyze then the content of each bucket and recordthe size of the leading (largest) track. The average query time is .

05 ms on an Intel(R) Core(TM)i7-6500U CPU @ 2.50GHz.Figure 1 shows the size of the leading particle in a bucket of 20 neighbors and 50 neighbors for 5000random queries. The ANN index was built on [ x, y, z ] vectors with the angular distance using theopen source ANN library Annoy (4) and without any pre- ﬁ ltering of the dataset (100K hits). Theparticle size distribution shows that most buckets contain valid particles with up to 15 hits from thesame particle in a 20 neighbors bucket.Since the particle label associated with every point is available from simulation, we can measurethe number of queries necessary to reconstruct a full event assuming an optimal bucket tracking .A track will be marked as reconstructed if

80 % of its hits are found inside the same bucket (purity ≥ . ). Figure 2 shows the ef ﬁ ciencies for bucket sizes of 20 and 50 as a function of the number ofqueries. The ef ﬁ ciency is de ﬁ ned as the number of reconstructed tracks divided by the total numberof tracks. In Figure 2 we consider only particles with a transverse momentum p T greater than and at least 4 hits per track. With 2000 random queries only, 20 neighbors buckets retrieve slightlymore than

95 % particles while 50 neighbors buckets retrieve over .

99 % . The ef ﬁ ciencies arefor µ = 200 events. Querying sequentially 2000 random buckets of 50 hits takes .

09 s on thearchitecture previously mentioned and with the python API provided by the Annoy library. A Common Tracking Software open source library on cern.ch/acts Indeed, current pattern recognition algorithms at the LHC show a close to 100% ef ﬁ ciency for reconstructableparticles, i.e. particles that leave suf ﬁ cient hits in the detector to be found .

05 ms in our study setup. The speedhas the potential to scale almost linearly when running on multiple processes. We also proposed analternative to tracking in buckets: LFDA and Neural Nets which both showed promising accuracyresults in ﬁ nding tracks with up to

95 % tracking ef ﬁ ciency. A future research direction is to extendthe similarity learning to quadruplets and ultimately to the full bucket. References [1] Mankel, Rainer. "Pattern recognition and event reconstruction in particle physics experiments." Reports onProgress in Physics 67.4 (2004): 553[2] Arya, Sunil, et al. "An optimal algorithm for approximate nearest neighbor searching ﬁ xed dimensions."Journal of the ACM (JACM) 45.6 (1998): 891-923.[3] Rousseau, David, et al. “The TrackML challenge”. NIPS 2018-32nd Annual Conference on NeuralInformation Processing Systems. 2018.[4] Aumüller, Martin, Erik Bernhardsson, and Alexander Faithfull. "ANN-Benchmarks: A benchmarkingtool for approximate nearest neighbor algorithms." International Conference on Similarity Search andApplications. Springer, Cham, 2017.[5] Sugiyama, Masashi. "Dimensionality reduction of multimodal labeled data by local ﬁ sher discriminantanalysis." Journal of machine learning research 8.May (2007): 1027-1061.[6] Davis, Jason V., et al. "Information-theoretic metric learning." Proceedings of the 24th internationalconference on Machine learning. ACM, 2007.sher discriminantanalysis." Journal of machine learning research 8.May (2007): 1027-1061.[6] Davis, Jason V., et al. "Information-theoretic metric learning." Proceedings of the 24th internationalconference on Machine learning. ACM, 2007.