[PDF] Muon Identification Using Deep Neural Networks with the Muon Telescope Detector at STAR

Abstract

The installation of the muon telescope detector opened new possibilities for studying dimuon production at STAR. However, backgrounds from hadron punch-through and weak decays of pions and kaons make the identification of primary muons challenging. In this paper we present a study of shallow and deep neural networks trained as classifiers for the purpose of muon identification using information from the muon telescope detector at STAR. The performance of shallow neural networks is presented as a function of the number of neurons in their hidden layer. A hyperparameter optimization for determining the optimal deep neural network classifier architecture is presented. The optimized deep neural network is compared with shallow neural networks, boosted decision trees, likelihood ratios, and traditional cut-based PID techniques. The superiority of the deep neural network based muon identification technique is demonstrated and compared with traditional PID through the measurement of the ϕ meson and the ψ(2S) in p+p collisions at s √ = 200 GeV. The deep neural network based PID simultaneously provides higher signal efficiency, signal-to-background ratio, and significance of the ϕ peak compared to traditional PID techniques. Finally, a deep neural network assisted technique for measuring the muon purity in data is presented and discussed.

Full PDF

MMuon Identiﬁcation Using Deep Neural Networks withthe Muon Telescope Detector at STAR

J. D. Brandenburg ∗ , a,b , Frank Geurts a a Rice University b Brookhaven National Lab

Abstract

The installation of the muon telescope detector opened new possibilities forstudying dimuon production at STAR. However, backgrounds from hadronpunch-through and weak decays of pions and kaons make the identiﬁcation ofprimary muons challenging. In this paper we present a study of shallow anddeep neural networks trained as classiﬁers for the purpose of muon identiﬁcationusing information from the muon telescope detector at STAR. The performanceof shallow neural networks is presented as a function of the number of neuronsin their hidden layer. A hyperparameter optimization for determining the op-timal deep neural network classiﬁer architecture is presented. The optimizeddeep neural network is compared with shallow neural networks, boosted de-cision trees, likelihood ratios, and traditional cut-based PID techniques. Thesuperiority of the deep neural network based muon identiﬁcation technique isdemonstrated and compared with traditional PID through the measurement ofthe φ meson and the ψ (2 S ) in p+p collisions at √ s = 200 GeV. The deep neuralnetwork based PID simultaneously provides higher signal eﬃciency, signal-to-background ratio, and signiﬁcance of the φ peak compared to traditional PIDtechniques. Finally, a deep neural network assisted technique for measuring themuon purity in data is presented and discussed. Keywords: muon identiﬁcation, shallow neural networks, deep neural networks,multivariate classiﬁers, STAR, Muon Telescope Detector

Preprint submitted to Nucl. Instru. Meth. A August 16, 2019 a r X i v : . [ phy s i c s . i n s - d e t ] A ug Corresponding author.

E-mail address: [email protected] (J. D. Bran-denburg).

1. Introduction

In 2014 the Solenoidal Tracker at RHIC (STAR) completed its installation ofthe Muon Telescope Detector (MTD). The MTD has made muon identiﬁcationover a large momentum range possible for the ﬁrst time at STAR. However,even with the MTD, identiﬁcation of pure muons can be challenging due tobackgrounds from hadron punch through. The identiﬁcation of dimuon pairs isfurther obscured by secondary muons originating from the weak decays of π → µ + ν and K → µ + ν . We are motivated to explore the possible improvements overtraditional techniques in single muon identiﬁcation and muon pair identiﬁcationthat can be obtained employing modern supervised learning algorithms.In this paper we explore classiﬁcation techniques using artiﬁcial neural net-works (ANN) for improving muon identiﬁcation using the information providedby the MTD at STAR. In Sect. 2, a brief description of the relevant STAR sub-systems is provided and the variables used for muon identiﬁcation are deﬁned.In Sect. 3, the dataset details are provided and the procedure used to gener-ate the training samples is described. In Sect. 4, the use of ANN classiﬁers isexplored for muon identiﬁcation. Both shallow and deep neural networks arecompared and the techniques used to determine the optimal deep neural net-work architecture are discussed and presented. In Sect. 5, the performance ofthe DNN based muon identiﬁcation vs. traditional techniques is compared inp+p collisions at √ s = 200 GeV. In this section, the use of the trained DNN fordata-driven muon purity measurements is also presented. Finally, a summaryis presented in Sect. 6. 2 . STAR Detector The STAR detector is a multi-purpose detector designed with large, uniformacceptance in 0 < φ < π and | η | <

1. The relevant STAR subsystems used forthis study are the Time Projection Chamber (TPC), the Magnet System, theTime-of-Flight (TOF) detector, and the Muon Telescope detector (MTD) [1–3].The TPC provides charged particle tracking and particle identiﬁcation informa-tion via ionization energy loss ( dE/dx ) measurement. The TPC sits within a0.5 T magnetic ﬁeld, allowing the charge ( q ) and transverse momenta ( p T ) oftracks to be measured from the curvature of their trajectories. The TPC covers2 π in azimuth and approximately ± η ) for collisionsat the center of the detector. The TPC provides momentum measurement witha momentum resolution of ∼ c of momentumat mid rapidity.The Time-of-Flight (TOF) detector is installed outside the TPC at a radiusof 210 cm and provides precise timing information with a timing resolutionlarger than ∼

90 ps in heavy ion collisions[4]. The TOF detector covers 2 π inazimuth and approximately ± . ∼

45% in the azimuthal direction for | η | < σ ≈

100 ps) and position ( σ ≈ atchedstripcenter Y local Zlocal Y( Δ Y) matched module center Z: Projected hit position 87 cm y z

Figure 1: A schematic of an MTD module. The strips are 87 cm and run along the local z axis. Each module contains 12 strips along the local y axis. Each strip is 3.6 cm wide with aspace of 0.6 cm between strips. readout allows the local Z position of hits to be measured via the diﬀerence intime between the two ends of the strip. Within each module, the local Y positionof a hit is measured by determining which of the 12 strips within a moduleregistered the hit. The ∆ Z and ∆ Y are the residual between the measuredlocal positions and the projected positions in the local Z and Y directionsrespectively. Figure 1 shows a schematic of the local MTD module coordinatesand the ∆ Z and ∆ Y calculation. The full list of variables that are used in thisstudy for muon identiﬁcation are: • ∆ T OF - Diﬀerence between the calculated time-of-ﬂight using a muonhypothesis versus the time-of-ﬂight measured by the MTD. • ∆ Z - Diﬀerence between the local Z position calculated using a muonhypothesis versus the position measured by the MTD. • ∆ Y - Diﬀerence between the local Y position calculated using a muonhypothesis versus the position measured by the MTD from the center of4he matched strip. • cell - the geometric strip index ranging from 0 - 11 with 0 and 11 at theoutside edges of each module. The average amount of steel between theinteraction point and the MTD module is lowest at the edges. • module - the geometric module index ranging from 0 - 4. • backleg - the geometric backleg index ranging from 0 - 29. The amountof material between the interaction point and the MTD backlegs variesas a function of backleg since the detector is not fully symmetric in the φ direction. • n σ π - the dE/dx information measured by the TPC. The value normalizedby the expected value for the π and corrected for detector resolution isused for simplicity. The value of n σ π for muons is on average ∼ +0.5. • DCA - Distance of closest approach of the track to the primary collisionvertex. • p T - Transverse momentum of the track. The ∆ T OF , ∆ Y , and ∆ Z resolutions depend strongly on p T . • q - the track charge measured from its curvature.These variables will be used as the inputs when training neural networkclassiﬁers in Sec. 4.

3. Dataset and Training Samples

The data used for this study was collected by the STAR detector from p+pcollisions at √ s = 200 GeV during the 2015 RHIC run. The events were selectedusing the dimuon trigger requiring that at least two MTD signals be measuredwithin a timing window. The primary vertex of the events was required to bewithin ±

100 cm of the center of the detector along z . In total the dimuon5 TD Cell0 2 4 6 8 10 12 d N / d ( M T D C e ll ) - SignalBackground

MTD Cell d N / d ( M T D C e ll ) (a) Z (cm) D MTD 60 - - - - Z ) ( c m ) D d N / d ( M T D - - - - SignalBackground

MTD Δ" cm d N / ( M T D Δ " ) c m $ % (b)Figure 2: Simulated MTD cell (a) and ∆ Z distributions for signal and background sources.The aﬀect of varying amounts of steel in the φ direction can be clearly seen in the celldistribution. Hadrons are signiﬁcantly more likely to punch through to the steel guarding theedge cells (at 0 and 11, respectively) than the central cells. trigger recorded 300M events corresponding to a total sampled luminosity of122 pb − [5].Muon candidate tracks were required to have a p T > c , have a dis-tance of closest (DCA) approach to the collision vertex of DCA < dE/dx measurement to ensure a reason-able dE/dx resolution. Finally, muon candidate tracks are required to projectto active MTD volume and be matched to MTD hits that ﬁred the trigger.

In Sect. 4 the training and use of ANNs to perform a two-class classiﬁca-tion problem to distinguish signal muons from various types of backgroundsis discussed. This type of ANN based classiﬁcation is an example of super-vised learning and therefore requires labeled datasets for the training phase. AMonte Carlo (MC) simulation procedure is used to generate the labeled signaland background datasets needed to train the supervised learning algorithms dis-cussed in Sect. 4. We deﬁne our signal class as primary muon tracks, i.e. those6riginating from the primary interaction vertex. In contrast, the backgroundclass includes all other sources of tracks that match to a hit in the MTD andresult in a reconstructed track in the tracker. The main sources of backgroundare a result of: • punch-through hadrons: e.g., π ± , K ± , and p/ ¯ p • charged-pion weak decays: π → µ + ν • charged-kaon weak decays: K → µ + ν The procedure used to forward model the signal and backgrounds consistsof three main steps: a kinematic event generator, a simulation of the STARdetector, and a full event reconstruction. First, events are generated with ∼ p + p collisionat √ s = 200 GeV. Each track in the event is randomly chosen to be a µ , π , K , or p . The kinematics of each particle are sampled from ﬂat distributions in0 < p T < . c , | η | < .

8, and − π < φ < π . The particle species andkinematics are then fed into a GEANT3 [7] based simulation of the full STAR ge-ometry. The GEANT3 simulation performs decays of unstable particles, modelsenergy loss of particles traversing media, and interactions with detector materi-als. Finally, full event reconstruction is performed on the result of the GEANT3based simulation. This step performs charged particle reconstruction using thesimulated hits in the TPC, determines the event’s primary interaction vertex,and computes the dE/dx of reconstructed tracks. After tracking is complete thetracks are matched to the simulated MTD hits. The result of this simulation isa set of the PID variables for each of the signal and background processes. Anexample of the MTD cell and ∆ Z variables are shown for signal and backgroundin Figs. 2a and 2b, respectively. ∆ TOF distributions from Data

A data-driven approach is employed to determine the MTD ∆

T OF distri-butions separately for the signal and background classes. For this procedure 1D7 /ψ Selection Cuts3.0 < M µµ < c DCA < < nσ π < | ∆Y | < σ (+0.5, p T > c ) | ∆Z | < σ (+0.5, p T > c ) p leading T > c Table 1: Cuts used for determining the signal and background ∆Time-of-Flight PDFs. cuts are applied to all PID variables except the ∆

T OF . With the cuts listedin Table 1, a relatively pure

J/ψ sample can be obtained. Figure 4 shows theunlike-sign and like-sign distributions near the

J/ψ mass after applying the cutslisted in Table 1. Daughter tracks from the

J/ψ are used to extract the ∆

T OF probability distribution function (PDF) for signal. Speciﬁcally the signal PDFis extracted from the

J/ψ mass peak (3.0 < M < c ) with the back-ground under the peak estimated using the like-sign pairs in the same massregion. The ∆ T OF from the like-sign background is properly scaled and sub-tracted from the peak region to remove background contributions. The signal∆

T OF

PDF is shown in Fig. 4. The background ∆

T OF

PDF is extracted fromtracks passing an inverted set of cuts meant to exclude all signal muons. Thesecuts are shown in the right hand column of Table 1.The background ∆

T OF distribution is further separated into the contribu-tions for π , K , and p using timing information from the TOF detector. Thesub-sample of tracks which match to both the MTD and TOF are used to ex-tract the MTD ∆ T OF distribution for π , K , and p separately. The β − = c/v distribution measured by the TOF detector is shown in Fig. 3b for all back-ground tracks matched to both the MTD and TOF. In this ﬁgure, there areclear β − bands corresponding to pions, kaons and protons. The MTD ∆ T OF distribution for these three species were extracted by selecting around a given β − band. 8 (GeV/c mm M2.6 2.8 3 3.2 3.4 3.6 3.8 4 4.2 4.4 - ) d N / d M ( G e V / c · < 3 p s -1 < n > 3 GeV/c)) T (+0.5 p s dY, dZ < 3 > 1.5 (GeV/c) leadingT pbg scale = 1.110 ) Signal Region : 3.0 < M < 3.2 (GeV/c unlike signlike sign (a)(b)Figure 3: The invariant mass distribution for unlike-sign and like-sign pairs near the

J/ψ mass. The N − J/ψ signiﬁcance by cutting on allMTD PID variables except the ∆

T OF distribution. A p leading T > . c ) cut is appliedto further improve the purity in the J/ψ mass region. The β − vs. momentum distribution forall tracks passing basic QA cuts that are matched to hits in the MTD and the BTOF detectors(b). The β − calculated from the BTOF information shows clear contributions from π , K ,and p/ ¯ p . TOF (ns) D - ( a r b . no r m a li z a t i on ) - T O F ( n s ) D d N / d - -

10 1 = 200 GeVsRun 15 p+p – p – K pp/ Y from J/ – m Figure 4: The ∆TOF distributions for µ ± from J/ψ , π ± , K ± , and p/ ¯ p . K S → π + π − and φ → K + K − Decays

Selecting K S → π + π − decays in data provides a π ± enhanced sample thatcan be used to test the validity of the MC simulation procedure for the π ± background sources. The selection of K S candidates is carried out by applyingthe topological selection cuts listed in Table 2. In order to increase the availablestatistics for comparison only one of the K S daughters is required to have amatching hit in the MTD. Figure 5a shows the π + π − invariant mass distributionnear the K S mass used to select π ± daughter tracks. The π ± ∆ Y , ∆ Z , and celldistributions are computed using the unlike-sign distribution minus the scaledlike-sign distribution for each variable in the K S mass region (497 ±

25 MeV/ c ).Distributions with an enhanced kaon yield can be selected from the daughtersof φ → K + K − decays. The K + K − invariant mass distribution around M φ isshown in Fig. 5b for the case in which one track is matched to an MTD hit.The K ± ∆ Y , ∆ Z , and cell distributions are computed using the unlike-signdistribution minus the scaled like-sign distribution for each variable in the φ mass region (1.019 ± c ). The comparison between the ∆ Y , ∆ Z S Selection Cuts0.472 < M ππ < c decay length > < < p T +0.025 p T | nσ π | < Table 2: Cuts used to select K S → π + π − decays. The daughter pions provide a π -enhancedsampled that can be compared to the π MonteCarlo simulation. GeV/c pp M - ) d N / d M ( G e V / c · One MTD matched track unlike-signscaled like-signsignal regionbackground region (a) ) (GeV/c KK M - ) d N / d M ( G e V / c · One MTD matched track unlike-signlike-signsignal regionbackground region (b)Figure 5: The M π + π − distribution near the K S mass shown for the cases in which only onetrack is matched to an MTD hit (a) and the M K + K − distribution near the φ mass shown forthe cases in which only one track is matched to an MTD hit (b). and MTD cell distributions from MC and data for π ± and K ± tracks are shownin Figs. 6a and 6b. The data / simulation ratios show that the ∆ Y , ∆ Z andMTD Cell distributions agree within ∼ ±

4. Training and Evaluation of Neural Networks

In this section the use of dense Multilayer Perceptrons (MLP), a type offeed-forward ANN, are trained as continuous classiﬁers for the purpose of muonidentiﬁcation. First, shallow artiﬁcial neural networks (SNN) will be discussed.A shallow artiﬁcial neural network is deﬁned by the presence of a single hidden11 cm) - - - - da t a / s i m u l a t i on Y (circles) D (open) – K Z (stars) D (closed) – p (a) Cell da t a / s i m u l a t i on – K – p (b)Figure 6: The ∆ Y and ∆ Z data / simulation ratio for both π ± and K ± (c). The MTD celldata / simulation ratio for both π ± and K ± (d). layer of neurons between the input and output layers. The universal approxi-mation theorem [8, 9] states that a feed-forward ANN with certain activationfunctions and at least one hidden layer containing a ﬁnite number of neurons canapproximate any continuous function on compact subsets of R n . However, theuniversal approximation theorem makes no claim about the size of the hiddenlayer required to approximate a given function. In practice the number of neu-rons in the hidden ( N H ) layer may need to be intractably large to approximatethe desired function with acceptable error. In addition, with increasing numberof neurons the risk of over training can increase resulting in a model capable ofrepresenting the input data with small error but with very poor generalizationperformance. In this section an exploration of the performance of a large set of SNNs as afunction of the number of neurons in their hidden layer is presented. The mod-els are trained using the Toolkit for Multivariate Data Analysis with ROOT(TMVA) [10]. Table 3 lists the parameters used in the training phase for allmodels. Each model is trained on a random subset of 100K signal events and100K background events. A disjoint testing sample is drawn from 250K signal12 T ............charge........DCA...........n .........MTD Y........MTD Z........MTD TOF......MTD Cell......MTD Backleg...MTD Module....Bias.......... OutputB B B Figure 7: An example of a dense multilayer perceptron neural network architecture. Theshallow neural networks have only a single hidden layer of neurons between the input andoutput layers. The deep neural networks have two or more. Bias neurons in the hidden layersare marked with a ”B”.

Number of neurons in the hidden layer P e r f o r m an c e ( A UC x ) Figure 8: The signal vs. background rejection power as a function of the number of neurons( NN HL ) in the hidden layer of a shallow neural network. The performance of the SNNs arequantiﬁed using the AUC - the area under the background rejection vs. signal eﬃciency curve(See 5.1 in text). The points are the mean value of 10 models trained with diﬀerent randomsamples. The uncertainties show the ± σ of the models assuming a Gaussian variance. N H , 10 models were trained with diﬀerent randomizedtraining and testing samples. The performance of each trained SNN is quanti-ﬁed using the area under the curve (AUC) of the background rejection versussignal eﬃciency distribution (higher is better). The results of the SNN scanare summarized in Fig. 8 where the AUC is shown as a function of N H . Eachpoint shows the mean response of 10 models with uncertainties that show the 1 σ variation between the response of the 10 models assuming a Gaussian variance.The background rejection power of the SNN shows clear improvement as N H is increased until N H ≈

30. Above N H ≈

30, adding more and more neuronsprovides relatively smaller and smaller improvement in the background rejectionpower.

Deep neural networks, in contrast to SNNs which contain only a single hiddenlayer, contain two or more hidden layers. The additional hidden layers can allowa network to learn complex relationships between input features with far fewerneurons and connections than a shallow network would need. Depending on theapplication it is also common for DNNs to combine various types of layers, suchas convolutional layers, to promote the learning of speciﬁc types of relationships.14 able 3: Parameters used in the training phase for the shallow and deep neural networks.

Parameter ValueNeuron Activation Func-tion tanhEstimator Type Mean SquareNeuron Input Function sumTraining Method Back-PropagationLearning Rate 0.02Decay Rate 0.01Learning Mode SequentialMax , , , , • Signal vs. background rejection power • Prefer simplest NN architecture (fewer number of neurons is better andfewer number of hidden layers is better) • Prefer monotonically increasing S/B as a function of NN responseThese three criteria are considered to determine the optimal set of DNN hy-perparameters. Each DNN was trained using the parameters listed in Table 3with only the architecture related parameters varying. Training DNNs can re-quire signiﬁcantly more time and larger labeled samples compared to SNNs toreach convergence. The DNNs were trained with 1M signal and 1M backgroundevents and took between 10 and 100 times longer to train than the set of SNNsdepending on the speciﬁc architecture. However, the time-cost required to trainDNNs can be greatly reduced by employing modern libraries like TenserFlowthat have been heavily optimized for parallelized network training using GPUs[13].

5. Results and Applications

In the previous section neural networks were trained as classiﬁers for thepurpose of separating signal muons from various background sources. The per-formance of the neural network based classiﬁers are compared using modiﬁedreceiver operating characteristic (ROC) curves in Fig. 9 by plotting the back-ground rejection power (1 − ε bg ) vs. the signal eﬃciency ( ε sig ). The performanceof a classiﬁer can be succinctly summarized with the area under the curve (AUC)of the background rejection vs. signal eﬃciency curve. An ideal classiﬁer is ableto reject 100% of the background while providing 100% signal eﬃciency and hasan AUC of 1. On the other hand, a random guess classiﬁer has an should havea 50/50 chance of correctly guessing the class and has an AUC of 0.5.16 (Signal Efficiency) sig e ) bg e B a ck g r ound R e j e c t i on ( - AUC = 1.0Ideal Classifier: AUC = 0.661 Traditional 1D Cuts: AUC = 0.826 1D Likelihood Ratios: (HL=20): AUC = 0.907 Shallow Neural Network (N=250): AUC = 0.948 Boosted Decision Trees (HL=3x14): AUC = 0.969 Deep Neural Network

Figure 9: The background rejection (1 − ε bg ) versus the signal eﬃciency ( ε sig ) for severaldiﬀerent multivariate classiﬁers and traditional 1D cuts. The neural network classiﬁers shown in Fig. 9 are also compared with clas-siﬁers employing optimized 1D cuts, 1D likelihood ratios, and boosted decisiontrees (BDTs). The cuts used in the 1D cut classiﬁer were optimized on the Jψ peak in p + p collisions at √ s = 200 GeV. Both the 1D likelihood ratio classiﬁerand the BDTs were trained using the TMVA package. The 1D likelihood ratioclassiﬁer was trained with default parameters using spline interpolation whenbuilding the feature PDFs. The track p T and charge ( q ) variables were removedfrom the 1D likelihood classiﬁers since they should not be used directly for muonidentiﬁcation. Additionally, since 1D likelihoods cannot properly incorporatethe p T dependence of the ∆TOF, ∆ Y , and ∆ Z features, the 1D likelihood clas-siﬁer was evaluated only for tracks in a narrow p T range (1.4 < p T < c ).A more thorough look at using likelihood ratios for muon identiﬁcation with theMTD can be found in [3]. The BDT classiﬁer was trained with N T rees = 250and

M axDepth = 5 with all other parameters set to the defaults.17 (GeV/c mm M0.8 1 1.2 - ) d N / d M ( G e V / c · =200 GeVsRun15 p+p at 3.787 – – =244.796 raw f N ) – mass=1.015 ) – width=0.014 S/B=0.191=6.270S + BS/ /NDF= c mm |y > 1.1 GeV/c T m |<0.5 && p m h |1D cut PID DataSignalBackground

Figure 10: Raw yield extraction of the φ meson using optimized traditional 1D PID techniques. ) (GeV/c mm M0.8 1 1.2 - ) d N / d M ( G e V / c · =200 GeVsRun15 p+p at 3.253 – – =281.331 raw f N ) – mass=1.016 ) – width=0.014 S/B=0.336=8.407S + BS/ /NDF= c mm |y > 1.1 GeV/c T m |<0.5 && p m h |DNN-based PID DataSignalBackground

Figure 11: Raw yield extraction of the φ meson using the DNN based PID. .2. Muon Identiﬁcation in Data The DNN classiﬁer out-performed the other multivariate classiﬁers inves-tigated in Sec. 4 based on an analysis of the background rejection power vs.signal eﬃciency evaluated on a testing sample of simulated events. We can fur-ther test the performance of the DNN classiﬁer by applying it to the dimuondata collected from p + p collisions at √ s = 200 GeV. The decay of resonancesto muons, like the φ → µ + µ − decay, provides a self-analyzing set of data fortesting muon identiﬁcation techniques. Muon pairs are selected in the data byﬁrst evaluating the DNN response for all muon candidates in an event. Pairsare then formed from oppositely charged muons. Signal pairs are selected basedon the pair DNN response r pair : r pair = (cid:113) r a + r b (1)where r a and r b are the DNN responses for paired muons a and b , respectively.The DNN was speciﬁcally optimized to promote a response of r ≈ r ≈ µ + µ − pair will be r pair ≈ √

2. The optimal r pair cut forselecting φ → µ + µ − decays was determined by maximizing the φ signiﬁcance( S/ √ S + B ) in steps of r pair = 0 .

01. The signal and background contributionswere extracted by ﬁtting the raw µ + µ − invariant mass spectra in 0.85 < M µµ < c . A 4 th -order polynomial was used to model the background anda Gaussian was used for the φ meson peak. The optimal cut was found to be r pair > .

36 which provides a φ meson signiﬁcance of ∼ S/B ratio of0.33. Figure 11 shows the raw φ meson yield extraction ﬁts using traditional1D cuts optimized on the J/ψ for muon identiﬁcation and using the DNN-based muon identiﬁcation. The DNN-based muon identiﬁcation simultaneouslyprovides higher

S/B ratio, signiﬁcance, and signal eﬃciency compared to theoptimized 1D muon identiﬁcation. In Fig. 12, the raw µ + µ − invariant massspectra in the range 0 < M µµ < is shown for optimized 1D cut-basedmuon identiﬁcation and compared with the DNN-based muon identiﬁcation. In19 (GeV/c - µ + µ M ( a r b . no r m a li z a t i on ) - ) d N / d M ( G e V / c −

10 110 = 200 GeVsRun15 p+p at | < 0.5 µ η > 1.0 [GeV/c], | µ T p | < 0.5 µµ |y Invariant Mass Spectra - µ + µ Raw STAR Preliminary

DNN-based PID1D cut-based PID

Figure 12: Comparison of the raw M µµ invariant mass distribution using optimized 1D cut-based muon identiﬁcation versus the DNN-based muon identiﬁcation. The distributions arescaled in 1.5 < M µµ < c to make comparison easier. addition to improving S/B and signiﬁcance of the ω and φ mesons, the DNN-based muon identiﬁcation allows the ψ (2 S ) to be visible. Since no individual feature among the set of PID features clearly separatessignal from background contributions, it is not possible to ﬁt any one of thefeatures in order to extract the muon purity of tracks in data. Given the signaland background PDFs for each of the 8 PID features (neglecting p T and q ),one could in principle conduct a simultaneous ﬁt to all 8 distributions in orderto extract the yield of signal and background contributions. Since each distri-bution would need to be ﬁt to a µ , π , K , and p contribution it would requiresimultaneously ﬁtting 8 distributions with 32 templates constrained by 4 freeyield parameters. While possible, in practice a simultaneous ﬁt with so manydistributions and templates is technically challenging and often proves unstable.Instead, the complexity of the problem can be greatly reduced by simplyﬁtting the DNN response for muon candidates with the template shapes forsignal and background components. Since the DNN combines all PID featuresIn this setup, only a single distribution needs to be ﬁt with the 4 template20hapes for signal and background each with a free yield parameter. Figure 13shows the result of this procedure applied to muon candidate tracks in therange 1.5 < p T < c . The template for each component is computedby evaluating the DNN on simulated tracks in the same kinematic regions asthose in the data. The data/ﬁt ratio shown in the lower panel of Fig. 13 showsthat the ﬁt is capable of describing the DNN response for muon candidates towithin ∼

20% over the entire range of DNN responses.After determining the yield of each signal and background contribution,the DNN response can be projected back onto all of the 8 PID features toverify that the DNN is properly combining the information from all variables.Ensuring that the projection onto each PID feature results in a good descriptionof the data is a strong demonstration that the DNN is not over-training onartifacts in the training samples. Projections onto the ∆ Z and DCA featuresare shown in Figs. 14a and 14b. This technique allows the increased signal vs.background separation power provided by the DNN-based muon identiﬁcation tobe leveraged for data-driven muon purity measurements. At the same time, theability to project the muon purity ﬁt results back onto the PID features providesa data-driven strategy to test for over-training and poor model generalization.21 NN response0 0.2 0.4 0.6 0.8 1 d N / d DNN r e s pon s e - - - < 1.55 (GeV/c) T / NDF = 244.46 / 106 = 2.31 c – = 0.334 m Yield 0.017 – = 0.532 p Yield 0.014 – Yield K = 0.086 0.004 – Yield p = 0.037 =200 GeVsRun15 p+p m p

K p sum

DNN response f i t / da t a Figure 13: The top panel shows the DNN response for muon candidates in the range 1.5 < p T < c . A template ﬁt is conducted to extract the contributions from µ (red), π (blue), K (orange), and p (magenta). The lower panel shows the ratio of the data over thesum of the contributions. Z (cm) D - - - - - - Z ( c m ) D d N / d - - - < 1.55 (GeV/c) T / NDF = 100.65 / 96 = 1.05 c – = 0.334 m Yield 0.017 – = 0.532 p Yield 0.014 – Yield K = 0.086 0.004 – Yield p = 0.037 =200 GeVsRun15 p+p m p

K p

Z (cm) D - - - - - f i t / da t a (a) DCA (cm) - d N / d DC A ( c m ) - - - < 1.55 (GeV/c) T / NDF = 205.04 / 148 = 1.39 c – = 0.334 m Yield 0.017 – = 0.532 p Yield 0.014 – Yield K = 0.086 0.004 – Yield p = 0.037 =200 GeVsRun15 p+p m p

K p

DCA (cm) f i t / da t a (b)Figure 14: The result of the DNN response ﬁt for µ , π , K , and p contributions projected backonto the ∆ Z (a) and DCA (b) distributions. The ratio of ﬁt over data is shown in the lowerpanels of each ﬁgure. . Summary The installation of the muon telescope detector has made muon identiﬁcationpossible at STAR over a large p T range. With only a single layer of steel actingas a hadron absorber, backgrounds from hadron punch through and weak decaysmake primary muon identiﬁcation challenging. Several quantities measured bythe STAR tracker and MTD are used to train shallow and deep neural networkclassiﬁers for the purpose of muon identiﬁcation. The deep neural networkclassiﬁer out-performed the other multivariate classiﬁers investigated in Sec. 4based on an analysis of the background rejection power vs. signal eﬃciencyevaluated on a testing sample of simulated events. When applied to dimuontriggered p+p collisions at √ s = 200 GeV, the DNN-based PID simultaneouslyprovides higher S/B ratio, signiﬁcance and eﬃciency for the φ -meson yieldextraction. At higher masses, the he DNN-based muon identiﬁcation makes the ψ (2 S ) state signiﬁcantly more visible in the raw M µµ distribution comparedto optimized 1D cut-based muon identiﬁcation. Finally, an application of thetrained DNN for data-driven muon purity measurements is presented.

7. Acknowledgements

We thank the STAR Collaboration for the use of the experimental datashown in this paper and the operation of this system during RHIC runningperiods as part of STAR standard shift crew operations. This work was fundedby the U.S. DOE Oﬃce of Science under contract No. DE-FG02-10ER41666.

References [1] K. H. Ackermann et al . (STAR Collaboration) , Nucl. Instr. and Meth. A (2003), 624.[2] M. Anderson et al . (STAR Collaboration) , Nucl. Instr. and Meth. A (2003) 659–678. 233] T. Huang, R. Ma, B. Huang et al . Nucl. Instr. and Meth. A (2016)88–93.[4] W. Llope, F. Geurts, J. Mitchell et al . Nucl. Instr. and Meth. A (2004)252–273.[5] T. Todoroki, Nucl. Phys. A, 967 (2017): 572–75[6] C. Yang, X. J. Huang, C. M. Du et al . Nucl. Instr. and Meth. A (2014)1–6.[7] R. Brun, a. C. McPherson, P. Zanarini, et al . CERN Program Library LongWriteup W5013.[8] C. Debao, Approx. Theory its Appl. (1993) 17–28.[9] K. Hornik, M. Stinchcombe, H. White, Neural Networks (1989) 359–366.[10] J. Therhaag AIP Conf. Proc., , (2012), 1013–1016.[11] B. Efron, J. Am. Stat. Assoc. (1987) 171–185.[12] B. Efron, Ann. Stat. (1979) 1–26.[13] M. Abadi, A. Agarwal, P. Barham, et alet al