[PDF] Timing and characterization of shaped pulses with MHz ADCs in a detector system: a comparative study and deep learning approach

Abstract

Timing systems based on Analog-to-Digital Converters are widely used in the design of previous high energy physics detectors. In this paper, we propose a new method based on deep learning to extract the time information from a finite set of ADC samples. Firstly, a quantitative analysis of the traditional curve fitting method regarding three kinds of variations (long-term drift, short-term change and random noise) is presented with simulation illustrations. Next, a comparative study between curve fitting and the neural networks is made to demonstrate the potential of deep learning in this problem. Simulations show that the dedicated network architecture can greatly suppress the noise RMS and improve timing resolution in non-ideal conditions. Finally, experiments are performed with the ALICE PHOS FEE card. The performance of our method is more than 20% better than curve fitting in the experimental condition.

Full PDF

PPrepared for submission to JINST

Timing and characterization of shaped pulses with MHzADCs in a detector system: a comparative study and deeplearning approach

Pengcheng Ai, a Dong Wang, a , Guangming Huang, a Ni Fang, a Deli Xu, a Fan Zhang b a Central China Normal University, No.152 Luoyu Road, Wuhan, Hubei, 430079 P.R.China b Hubei University of Technology, No.28 Nanli Road, Wuhan, Hubei, 430068 P.R.China

E-mail: [email protected]

Abstract: Timing systems based on Analog-to-Digital Converters are widely used in the designof previous high energy physics detectors. In this paper, we propose a new method based on deeplearning to extract the time information from a ﬁnite set of ADC samples. Firstly, a quantitativeanalysis of the traditional curve ﬁtting method regarding three kinds of variations (long-term drift,short-term change and random noise) is presented with simulation illustrations. Next, a comparativestudy between curve ﬁtting and the neural networks is made to demonstrate the potential of deeplearning in this problem. Simulations show that the dedicated network architecture can greatlysuppress the noise RMS and improve timing resolution in non-ideal conditions. Finally, experimentsare performed with the ALICE PHOS FEE card. The performance of our method is more than 20%better than curve ﬁtting in the experimental condition.Keywords: Analysis and statistical methods; Pattern recognition, cluster ﬁnding, calibration andﬁtting methods; Front-end electronics for detector readout; Timing detectors Corresponding author. a r X i v : . [ phy s i c s . d a t a - a n ] M a r ontents µ s shaping time 175.2 100 ns shaping time 195.3 Discussion about the experimental results 20 Pulse timing is a common problem in high energy physics [1], optics [2], telecommunication[3] and many other applied physics disciplines. Among feasible methods, fast electronic readoutsystems provide a cost-eﬀective and robust solution with relatively high timing resolution. Inmany engineering circumstances, we care more about the availability and practicality than technicalindicators. Electronic timing systems are usually good candidates for these applications.In high energy physics, accurate timing, along with energy and position information, is neededto reconstruct the collision events so as to discriminate against backgrounds [4] and identify phe-nomenons of interest [5]. Several kinds of detectors can provide the time information. For example,Time-of-Flight detectors can measure the time of incoming events directly; Time Projection Cham-bers (TPC) and calorimeters can measure the pulse signal and infer the time afterwards; silicondetectors and pixel sensors can measure the hit information and oﬀer an auxiliary time stamp, and soon. The ﬁnal reconstructed event is a combination and coincidence of multiple sources of detectors.– 1 –here are two major branches of timing systems: systems based on Analog-to-Digital Con-verters (ADC) and systems based on Time-to-Digital Converters (TDC). In general, TDC-basedsystems are specialized in time measurement and can achieve a precision of tens of picoseconds[6] when conﬁgured properly. In spite of their high precision, the major drawback of TDC-basedsystems is that they lack the necessary amplitude information which is critical in some applications.If both time and amplitude are of interest, ADC-based systems are good alternatives to TDC-basedsystems. The empirical timing precision for ADC-based systems is in the order of nanoseconds.For ADC-based systems, a typical work ﬂow can be described as follows. The original signalfrom TPCs or calorimeters is preprocessed by Charge Sensitive pre-Ampliﬁers (CSA) to get a step-like signal. Afterwards, this signal is fed to Front-End Electronics (FEE). The signal conditioningon the FEE board includes buﬀering, amplifying and bandpass ﬁltering by CR-RCn shapers. Finally,the signal is sampled by ADCs with the prescribed precision and data depth. The recorded ADCsamples can serve multiple purposes. For a classiﬁcation task, the shaped pulse signal can be usedto discriminate between particles or physical events [7–10]. For a regression task, timing or otherpulse information is extracted from the digitized pulse signal [11].To obtain the time from a ﬁnite set of ADC samples, we can use an estimated ﬁtting function andperform curve ﬁtting to get estimated values of underlying parameters. Curve ﬁtting is a standardinference method in the time domain and it shows promising properties under certain conditions(See section 3.1.2). However, its applicability and accuracy rely on the ﬁtting function and the idealform of noise heavily. As a result, the actual performance of curve ﬁtting is limited by experimentalconditions of ADC-based systems [12].Recently, deep learning [13] as a renewed machine learning technique has progressed rapidly.It has been successfully used for particle/event discrimination and identiﬁcation at the pulse level[14], the pixel level [15] and the voxel (three-dimensional) level [16]. In view of the fact that neuralnetworks are applicable to classiﬁcation tasks as well as regression tasks, it is meaningful to explorethe capability of deep learning in the above-mentioned pulse timing problem.In this paper, we mainly discuss the deep learning approach to pulse timing based on acomparison between curve ﬁtting and the proposed method. Section 2 brieﬂy introduces the projectbackground and the mathematical form of the researched pulse. Section 3 explains the traditionalcurve ﬁtting method by theoretical analysis and simulation studies. Section 4 gives a comparativestudy and the details of the new approach of deep learning. Section 5 discusses the experiments weconduct and shows the experimental results. Finally, a conclusion is drawn in section 6.

The ALICE PHOS detectors [17] refer to the Photon Spectrometers designed for the ALICEexperiment [18]. The detectors were produced in 2007 and scheduled for the ﬁrst p+p collisionsat LHC in 2008 [19]. The scintillator is made of lead tungstate crystals and mainly used to detecthigh energy photons (up to 80 GeV). An Avalanche Photo-Diode (APD) receives the scintillationand converts it to an electrical signal, which is applied to a CSA near the APD. The output of theCSA is connected to the FEE card via a ﬂat cable.The FEE card has 32 independent readout channels, each of which is connected to two shapersections with high gain and low gain. The CR-RC2 signal shapers are made up of discrete– 2 –omponents on a 12-layer Printed Circuit Board (PCB). For each channel, there are two overlapping10-bit ADCs at the terminations of the two shapers, which give an equivalent dynamic range of14 bits. The sampling rate of the ADCs is ﬁxed to 10 MS/s. The same readout plan and PCBlayout were adopted by ALICE EMCal detectors [20], which refer to the ALICE ElectromagneticCalorimeters. The major diﬀerence of FEE cards between PHOS and EMCal lies in the shapingtime of the shapers. For PHOS, the designate shaping time is 1 µ s; however for EMCal, we usediﬀerent resistors and capacitors to achieve a shaping time of 100 ns.The CR-RC2 shaper is a bandpass ﬁlter in the frequency domain. In the time domain, itsresponse to an ideal step signal can be formulated as the equation below: f ( t ) =  K (cid:16) t − t τ p (cid:17) · e − ·( t − t τ p ) + b , for t ≥ t b for t < t (2.1)where t is the start time, and b is the pedestal. K is originally deﬁned as Q · A C f which is a variablerelated to the energy of the incoming photon, where Q is the APD charge, A is the shaper gainand C f is charging capacitance of the CSA. In our simulations, without changing the nature of theproblem, we use K as a normalization factor for numerical purposes. τ p is the peaking time deﬁnedas the interval between the start of the semi-Gaussian pulse and the moment when f ( t ) reaches itsmaximum value. The relation between the shaping time τ and the peaking time τ p is τ p = n · τ .For the CR-RC2 shaper structure, n equals 2, so the peaking times for the PHOS and EMCal are 2 µ s and 200 ns, respectively.Since the CR-RCn shaper is representative for most applications in high energy physics, in thelatter sections we center on the pulse function in equation 2.1 to discuss diﬀerent timing methods. Curve ﬁtting is a traditional model ﬁtting technique mainly aimed at ﬁnding the parameterizedmathematical relations between two or more variables. Classical linear curve ﬁtting can be directlysolved by the least squares method, and nonlinear curve ﬁtting can be solved by the trust region andLevenberg-Marquardt methods [21]. In the pulse timing scenario, the main purpose of curve ﬁttingis to determine the desired parameters related to the time information. In the following subsections,we analyze the curve ﬁtting method in terms of its capability to reveal the ground-truth parametersunder various conditions.

We consider the following nonlinear least squares problem:– 3 –inimize S = minimize n (cid:213) i = r i = minimize n (cid:213) i = [ y i − f ( t i ; β , θ )] (3.1)where S is the sum of squared residuals to minimize, r i is the i-th residual, y i is the i-th observedvalue (from ADC), and t i is the i-th time value. There is some noise residing in the observed value y i , and we denote this noise term as n i . Besides, β is the ﬁtting parameters and θ is the systemparameters . The division of ﬁtting parameters and system parameters is made according to ourunderstanding of the problem and practical issues. It is not recommended to set two parameterswith high correlation as ﬁtting parameters at the same time, which will cause instability to the ﬁttingprocess.It should be noted that the above formulation is a general framework for the ﬁtting problem.Usually we choose a function family f ( t ; β , θ m ) for curve ﬁtting. However, f ( t ; β , θ m ) is only asubset of the underlying possible functions f ( t ; β , θ ) . We denote the reference ﬁtting function as f ( t ; β , θ ) in section 3.1.2 and section 3.1.3. In this part, we assume that the selected ﬁtting function is accurate (i.e. θ is ﬁxed to θ and θ m = θ ),and the noise distribution is strictly Gaussian with a ﬁxed variance σ . Under these assumptions,the distribution of the observed value can be written as: y i = f ( t i ; β , θ ) + n i ∼ N (cid:16) f ( t i ; β , θ ) , σ (cid:17) (3.2)Since the Gaussian distribution is P ( x | µ, σ ) = √ πσ e −( x − µ ) / σ , the corresponding log-likelihood function is: L ( y , y , . . . , y n ; β , θ ) = ln n (cid:214) i = P ( y i | f ( t i ; β , θ ) , σ ) = − σ n (cid:213) i = [ y i − f ( t i ; β , θ )] + const (3.3)The equation 3.3 implies that, in the ideal condition, using curve ﬁtting to minimize the sum ofsquared residuals S is equivalent to maximizing the log-likelihood function of the noise distributions.In other words, curve ﬁtting gives the maximum likelihood estimators of ﬁtting parameters. Thisclaim reveals the statistical properties of the curve ﬁtting method. It is based on a hypothesisof Gaussian noise distributions, which is a useful prior when our knowledge about the system islimited. – 4 – .1.3 Quantitative analysis of drift, change and noise In reality, the assumptions in section 3.1.2 are usually not valid. Variations in the ﬁtting functionand the noise make the problem much more complicated. In this paper, we consider three types ofvariations which are representative in high energy physics:1.

Long-term drift . This kind of variation refers to the deviation in the system parameters θ after the circuit board is fabricated. It can also represent the persistent change betweentwo calibration runs. It will aﬀect the pulse function consistently so that the event-by-eventcharacteristics stay the same for ADC sampling values.2. Short-term change . This kind of variation refers to the deviation in the system parameters θ between two events. It will change according to the current status of the detector, but itseﬀect is near-identical to all ADC sampling values in a single event. In other words, theevent-by-event characteristics will change in the operation of the experiment.3. Random noise . This kind of variation refers to the randomized noise n i residing in theobserved value y i . It will vary between ADC samples in a single event. Since it is random,the actual value of the noise is not predictable. However, its statistical features can bedetermined in advance.Next, we will introduce these variations into the curve ﬁtting. We only consider the variationsthat are near the reference point so that the ﬁtting result will not be rejected by the ﬁtting process (i.e.without increasing the chi-square criterion signiﬁcantly). When the above variations are present,by using the ﬁrst-order approximation we can formulate y i as: y i = f ( t i ; β , θ ) + (cid:213) j ∂ f ( t i ; β , θ ) ∂θ j ∆ θ j + n i (3.4)Since we use the reference system parameters in the curve ﬁtting, non-ideal y i will cause achange in the ﬁtting parameters. By using the ﬁrst-order approximation: f ( t i ; β , θ ) = f ( t i ; β , θ ) + (cid:213) j ∂ f ( t i ; β , θ ) ∂ β j ∆ β j (3.5)Curve ﬁtting tries to minimize the sum of squared residuals by varying β . By applying theﬁrst-order necessary condition for a minimum, we get the following equation: ∇ β S = ∇ β n (cid:213) i = r i = ∇ β (cid:34) n (cid:213) i = ( y i − f ( t i ; β , θ )) (cid:35) = ( J T J ) ∆ β = J T ( P ∆ θ + n ) (3.7)where J ij = ∂ f ( t i ; β , θ ) ∂ β j P ij = ∂ f ( t i ; β , θ ) ∂θ j – 5 –f J T J is nonsingular, the deviation in the ﬁtting parameters can be solved by: ∆ β = ( J T J ) − J T ( P ∆ θ + n ) (3.8)In general, equation 3.8 is a generalization of linear curve ﬁtting to nonlinear cases. It impliesthat, under ﬁrst-order approximations, the deviation of the ﬁtting parameters around the referencepoint is linearly dependent on the deviation of the system parameters and random noise. To demonstrate the accuracy of ﬁrst-order approximations in our pulse function, we compare theresults from calculating equation 3.8 to the results from directly applying curve ﬁtting. For the pulsefunction in equation 2.1, we divide parameters in the following way without inducing a complicatedfunction family: β = { K , t } , θ = { τ p , b } (3.9)In the following simulations, we choose K = . t = . τ p = . b = . t = . t = . .

1, so there are a total of 33points. The value of K ensures that the amplitude is renormalized to a range in the interior of ( , ) .This parameterization is in accord with the PHOS electronics with 1 µ s shaping time (section 5.1). Long-term drift and short-term change

These two kinds of variations are associated withsystem parameters θ . We separate τ p and b and study their inﬂuence on ﬁtting parameters K and t respectively. The simulation results are shown in ﬁgure 1. The solid line is calculated fromﬁrst-order approximations, and the solid dots are generated from curve ﬁtting. It can be seen that ina region near the reference point the ﬁrst-order approximations are fairly accurate. This is especiallytrue for ( t , τ p ) and ( K , b ) pairs, which have high correlations. In other two pairs, the discrepancyof ﬁrst-order approximations and curve ﬁtting is determined by higher order eﬀects. Random noise

According to equation 3.8, if the per-sample noise is Gaussian, the linear mappingwill propagate the noise to the ﬁtting parameters directly, so the distribution of ﬁtting parameterswill also be Gaussian. On the other hand, if the per-sample noise is not Gaussian, the linearmapping will work in a similar way. In order to study the distribution of ﬁtting parameters forthese non-Gaussian cases, we select two representative noise distributions, which are the crystalball distribution [22] and the Moyal distribution [23]. The former one has a long tail at the left-handside and the latter one has a long tail at the right-hand side. Their probability density functions are:crystal ball: f ( x , β, m ) = (cid:40) N exp (− x / ) , for x > − β N A ( B − x ) − m for x ≤ − β (3.10)Moyal: f ( x ) = exp (−( x + exp (− x ))/ )/√ π (3.11)For the crystal ball distribution, we choose β = m =

3, shift its center to [ . , . ] and downscaleit with 0.01. For the Moyal distribution, we shift its center to [ . , . ] and downscale it with0.00625. In addition, in order to study the non-negative eﬀect in detector electronics, we clip the– 6 – s h i f t o f K i n c u r v e f i tt i n g actual pointlinear approximation (a) K vs. τ p −0.100 −0.075 −0.050 −0.025 0.000 0.025 0.050 0.075 0.100shift of tau_p−0.04−0.020.000.020.04 s h i f t o f t _ i n c u r v e f i tt i n g actual pointlinear approximation (b) t vs. τ p −0.3 −0.2 −0.1 0.0 0.1 0.2 0.3shift of base−2−1012 s h i f t o f K i n c u r v e f i tt i n g actual pointlinear approximation (c) K vs. b −0.3 −0.2 −0.1 0.0 0.1 0.2 0.3shift of base−0.3−0.2−0.10.00.10.20.3 s h i f t o f t _ i n c u r v e f i tt i n g actual pointlinear approximation (d) t vs. b Figure 1 . A gathering of ﬁgures for drift and change simulations. Each ﬁgure compares the result fromﬁrst-order (linear) approximations and the result from curve ﬁtting directly. noise to force the noise values to 0 if they become negative. The simulation results are shownin ﬁgure 2. For each ﬁgure, we calculate the ﬁrst-order approximations and run Monte Carlosimulation with a volume of 1000. It can be seen that although the noise distributions have strongnon-Gaussian features, the distributions of ﬁtting parameters have Gaussian shapes. The mean andstandard deviation calculated from equation 3.8 can very well characterize the distributions fromcurve ﬁtting. This implies that for medium size (33 in this case) of sampling points, distributionsof ﬁtting parameters show accordance to the law of large numbers, which is a statistical property ofmany independent random variables.In conclusion, the ﬁrst-order approximations can describe curve ﬁtting in a simple and convenientway. Diﬀerent variations can be viewed as independent forces that drive the deviation of ﬁttingparameters, and their relation is additive. This paves the way for the comparison in section 4.2where we will demonstrate the potential of deep learning against this perspective.– 7 – .0000 0.0025 0.0050 0.0075 0.0100 0.0125 0.0150 0.0175 0.0200shift of noise p.d.f.0.150.200.250.300.35 s h i f t o f K i n c u r v e f i tt i n g avg. of linear approx.error bandMonte Carlo simulation (a) K vs. shift of clipped crystal ball s h i f t o f t _ i n c u r v e f i tt i n g avg. of linear approx.error bandMonte Carlo simulation (b) t vs. shift of clipped crystal ball s h i f t o f K i n c u r v e f i tt i n g avg. of linear approx.error bandMonte Carlo simulation (c) K vs. shift of clipped Moyal s h i f t o f t _ i n c u r v e f i tt i n g avg. of linear appro .error bandMonte Carlo simulation (d) t vs. shift of clipped Moyal Figure 2 . A gathering of ﬁgures for noise simulations. Each ﬁgure compares the result from ﬁrst-order(linear) approximations and the result from curve ﬁtting directly.

Deep learning is a major breakthrough in recent years. It is based on neural networks, but itsfocus has shifted to building intricate network architectures for real-world applications (eg. image,voice, natural language, etc.). It started with image classiﬁcation tasks [24, 25] and spread to otherdomains [26, 27] in artiﬁcial intelligence. Furthermore, it has been applied to high energy physicsin recent literatures [28–31]. In the following subsections, we discuss how to use deep learning tosolve the pulse timing problem.

The concept of neural networks is fundamental in deep learning. The basic element of a neuralnetwork is called a neuron . A neuron has N inputs and one output. Besides, it has N weights and onebias as its parameters. It computes the products of the inputs and the weights in an element-wisemanner, adds them together with the bias and uses a nonlinear activation function on the sum.Many similar neurons can act on the same inputs and form a layer. For a neural network with one– 8 –ntermediate layer, the output unit is also a neuron. The only intermediate layer is also called thehidden layer.A deep neural network usually refers to a network with more than one hidden layer. By takingthe output of the former layer as the input, hidden layers can be stacked. Increasing the depthof the network can gain additional power to extract structured features and reduce the number ofparameters needed to approximate some functions. In general, neural networks have promisingmathematical characteristics. They are supported by the universal approximation theorem [32, 33],which states that neural networks can approximate mathematical functions with arbitrary precisionswith enough neurons and layers.One successful network structure is the convolutional neural network [24]. It is based on theideas of weights sharing and shift invariance. Instead of connecting a neuron to all inputs, wecompute the output of a neuron in a vicinity (eg. a 2D patch) of the neuron. Besides, the weightsto produce the output are shared across diﬀerent places. By taking these measures, the parametersin a neural network can be greatly reduced and the eﬃciency can be improved dramatically.To train a neural network, we need (input, label) pairs. The input is propagated through theneural network and compared to the label to compute a loss function. Then the loss is used to updatethe parameters of the whole network by the back propagation algorithm. The updating formula isusually based on the gradient descent method, i.e. descending the parameters in a direction whichreduces the loss function. The loss function is usually the cross entropy along with the softmaxfunction for a classiﬁcation task [24], and the mean square error and its derivatives for a regressiontask.

With the knowledge above, it is ready to discuss the potential of deep learning and compare it tothe curve ﬁtting method. The study is carried out in the aspect of variations in section 3.1.3.

Long-term drift

From the analysis in section 3, the long-term drift will introduce a bias to theﬁtting parameters. In a large detector system, correcting the bias is a tremendous task, and evenimpractical in some cases. For one thing, unlike the discussion in section 3.2, system parametersare hidden in the function and sometimes have very sophisticated forms. For another, the non-uniformity of diﬀerent cells makes the problem even more complicated. Furthermore, if we viewthe bias in the non-Gaussian noise as a kind of long-term drift, the total eﬀect is a mixture of severalaspects. To tackle the bias challenge, we can use a regression neural network to ﬁx the inﬂuence ofthe long-term drift to the ﬁtting parameters. Without loss of generality, we can assume that the lastlayer of the neural network has the form y = f ( x ; w , b ) = (cid:205) i w i · x i + b . Since the last layer hasa bias parameter b , if there is a persistent shift in the system, this shift will be counteracted by thebias parameter b through the training process. As long as the training label is suﬃciently accurate,the bias can be greatly reduced by the neural network. Short-term change

For curve ﬁtting, the short-term change has a direct impact on the precisionof the ﬁtting parameters. In equation 3.8, it can be seen that the event-by-event variations of thesystem parameters θ will result in the ﬂuctuation of the ﬁtting parameters. The primary cause forthis phenomenon is that curve ﬁtting treats each set of ADC samples as an independent and complete– 9 –et of features. However, diﬀerent sets of ADC samples belong to the same function family, and anoverall understanding of the function family is beneﬁcial to the explanation of the individual set offeatures. The optimization of neural networks is such a global process which is helpful to establishthe overall understanding. To see this point, we can rewrite the mapping of the neural network as: β (cid:48) = g ( f ( t ; β , θ ) + n ; W , B ) (4.1)where f ( t ; β , θ ) = ( f ( t ; β , θ ) , f ( t ; β , θ ) , . . . , f ( t n ; β , θ )) is the vector of sampling points, and W , B are the weights and biases of the neural network. When we optimize the model, the traininglabel will change consistently with the underlying ﬁtting parameters β but remain the same whensystem parameters θ vary. As a result of training, the weights W and biases B of the neural networkfollow a gradient descent direction so that the change of β (cid:48) is proportional to the change of β butorthogonal to the change of θ . In other words, training increases the sensitivity to variations ofﬁtting parameters and reduces the sensitivity to variations of system parameters. In this way, theinﬂuence of the short-term change can be greatly alleviated. Random noise

We have already analyzed the Gaussian noise with the accurate ﬁtting function insection 3.1.2. Here we focus on the noise with more complex forms. According to the central limittheorem, the distributions of ﬁtting parameters will take Gaussian shapes when noise is presented.This is a degenerative process and could loss original information. To help understand the claim, wemight think of the development of modern physics. When the instrumentation was not so advanced,people could only observe macro phenomenons, which were normally distributed according tostatistical laws. Once the hardware condition had improved, people could measure the micromechanisms, and the ﬁne structures could be found. In our problem, curve ﬁtting does not utilizethe information in each time point suﬃciently, and the loss of information can not be retrieved.On the other hand, we already know that neural networks have micro structures. This oﬀers anopportunity to achieve better performance than curve ﬁtting in the non-Gaussian settings. Sincethe nonlinear mapping in the activation function can implement a complicated function family, it ispossible to use neural networks to retrieve the origin information from noisy inputs.In conclusion, deep learning is a good alternative to the traditional curve ﬁtting method in termsof drift, change and noise when used in an appropriate way.

In this part, we will discuss the implementation issues of deep learning in the speciﬁc pulse timingproblem. Although neural networks are promising according to the analysis in section 4.2, it doesnot mean that any structure will perform well. When facing a new problem, practitioners need tocustomize the network structure to make it suitable for the problem settings.We design our network architecture based on the ideas from [34, 35]. A diagram of the adoptedarchitecture is shown in ﬁgure 3. In principle, the network is comprised of two parts, a denoisingautoencoder and a regression network.The denoising autoencoder [36] is a network which tries to recover the original unstainedinput from its noisy version. A typical autoencoder is made up of a pyramid structure whichperforms feature extraction (encoding), and an inverted pyramid structure which restores original– 10 – r i g i na l i n t e r po l a t i on ne t w o r k i npu t s en c ode r l a y e r c ode r l a y e r c ode r l a y e r c ode r l a y e r c ode r l a y e r c ode r l a y e r t w o r k ou t pu t s f u ll yc onne c t ed l a y e r f u ll yc onne c t ed l a y e r f i na l ou t pu t c on v o l u t i on c on v o l u t i on c on v o l u t i on de c on v o l u t i on de c on v o l u t i on r eg r e ss i onne t w o r k deno i s i ngau t oen c ode r skip connections Figure 3 . A diagram of the network architecture. data (decoding). We add following features to the prototype of the autoencoder to improve itsperformance:1.

Convolution and deconvolution . In the encoder layers and decoder layers, we use convolution[24] and deconvolution [37] operations to replace the fully-connected layers. These operationscan utilize the locality of input features and extract structured patterns from data. In theconvolution, we use many groups of parameters (called a ﬁlter or kernel ) to compute theoutput (called a feature map ). Each ﬁlter has its own weights and bias, and it moves acrossthe input feature map to produce an one-dimensional output. Many ﬁlters will result in afeature map with many channels of one-dimensional data. In the deconvolution, the operationbetween the input and the output is transposed. For the same stride and padding, the outputshape of deconvolution operations will be the same as the input shape of the correspondingconvolution operations.2. Skip connections . Optimizing a deep neural network suﬀers from problems of vanish-ing/exploding gradients. Even when these problems are handled by normalization, a degrad-ing problem still aﬀects the performance of the model. In [25], a dedicated structure calledthe residual network was suggested to solve the problem. In view of this work, we implementskip connections between the encoder layers and decoder layers to overcome the issues whentraining a deep network. Except the last layer, every layer in the encoder is directly copied tothe corresponding layer in the decoder. At the decoding side, the channels from the encoderand the channels from the main passage of the network are concatenated. In this way, therelation between long-range layers is preserved so that it is easier for the network to learnvaluable features from the input.3.

Leaky ReLU . The Rectiﬁed Linear Unit (ReLU) [38] is a kind of activation function whichis widely used in deep learning. Since ReLUs force the output to become zero when theinput is negative, it blocks the ﬂow of information for a considerable amount of neurons in a– 11 –etwork. The leaky ReLU [39] is proposed to solve the problem. Unlike ReLUs, the leakyReLU has a gradual slope at the negative x-axis. It has a non-zero gradient even when theinput is negative. In our network, we use leaky ReLUs in the encoder layers.

Table 1 . Speciﬁcation for the denoising autoencoder.

ConvolutionNo. stride ﬁlter width out channels leaky ReLU1 2 4 64 No ReLU2 2 4 128 Yes (0.2)3 2 4 256 Yes (0.2)4 2 4 512 Yes (0.2)5 2 4 512 Yes (0.2)6 2 4 512 Yes (0.2)7 2 4 512 Yes (0.2)8 2 4 512 Yes (0.2) DeconvolutionNo. stride ﬁlter width out channels dropout8 2 4 1024 Yes (0.5)7 2 4 1024 Yes (0.5)6 2 4 1024 Yes (0.5)5 2 4 1024 No4 2 4 512 No3 2 4 256 No2 2 4 128 No1 2 4 1 No

Speciﬁcally, the denoising autoencoder is a network with 8 × softmax layer is usedbetween the denoising autoencoder and the regression network.Training such a network can be divided into the following two steps:1. Autoencoder pre-training . It is strongly recommended to pre-train the denoising autoencoderas the ﬁrst step of the training process. Based on the function of the autoencoder, we needto estimate the form of noise and generate (noisy input, unstained input) pairs as the (input,label) to train the network. To be more speciﬁc, ﬁrst we randomly generate a set of samplingpoints according to the pulse function. Then we add per-sample noise to the sampling pointsaccording to the probability distribution of the estimated form of noise. If the expressionof the short-term change is known, it can also be used. Actually, only a rough estimate canimprove the ﬁnal performance signiﬁcantly (see section 5). In this stage, only simulation datais used.2.

End-to-end ﬁnetuning . After pre-training, we can use experimental data (if available) to makean end-to-end ﬁnetuning of the whole network. A precise label indicating the ground-truthparameter is used at the far-end of the network to generate a loss function. There are twooptions in ﬁnetuning. The ﬁrst option is to keep the autoencoder unchanged and only ﬁnetunethe regression network. If there are no distinct changes in the pulse function compared tothe pre-training stage, this option can be used. The second option is to ﬁnetune the wholenetwork together. For this option, the pre-trained network only works as an optimal start– 12 –oint for ﬁnetuning, and the capacity of the model is larger (which also implies overﬁttingissues).

In this part, we run simulations of the proposed neural network regarding the variations discussedin section 3.1.3. Since the advantage of the neural network model on the long-term drift is evidentaccording to the discussion in section 4.2, we do not run simulations for this kind of variation.In order to study the variations, ﬁrst we need to generate the simulation dataset. The pulsefunction is the same as section 3.2. In the following simulations, we choose K uniformly sampled inthe range [ . , . ] and t uniformly sampled in the range [− . , . ] . The reference values for τ p and b are 2.0, 0.1 respectively. The pulse for the noisy input (or the input with short-term change)is sampled from t = . t = . .

1. We drop the last point when training, sothere are 32 points. The same pulse for the label is sampled at a super-resolution ratio of 8 in thesame interval, so there are a total of 256 points. We gather the simulation samples into two separatedatasets. The training dataset has 40000 samples and the test dataset has 10000 samples.To calculate the timing resolution, we test diﬀerent methods on the test dataset and get thepredicted values of the start time t . For curve ﬁtting, the predicted values are the ﬁtting parameters.For regression networks, the predicted values are the outputs of the networks. Then we use thediﬀerence between the predicted values and the ground-truth values to make a Gaussian ﬁt. Thestandard deviation of the Gaussian ﬁt is a measure of the timing resolution. r e l a t i v e a m p li t u d e inputsoutputstargets (a) A typical ﬁgure of the inputs, the outputs and the targets(label) of the autoencoder. P r o b a b ili t y inputs: μ = 0.01110, σ = 0.00843 outputs: μ = 0.00310, σ = 0.00191 inputsoutputs (b) The RMS of amplitude between the inputs/outputs andthe ground-truth targets. The ﬁgure is plotted on the statis-tics of the whole test dataset. Figure 4 . The simulation results of the denoising autoencoder for the short-term change. ( left ) We choosea sample in the test dataset and plot the noisy input, the denoising outcome and the training label. ( right ) Wecalculate the Root Mean Square (RMS) between the inputs/outputs of the neural network and the ground-truthlabel for each sample in the test dataset. Then we make a Gaussian ﬁt for all the samples and plot the ﬁgure.

Short-term change

To study the eﬀects of the short-term change, we introduce the baselineshift, i.e. the variations of the pedestal b . The baseline shift is a common type of the short-termchange especially when the event rate is high so that nearby events will interplay. To construct– 13 – able 2 . Simulation results for the short-term change. The table compares diﬀerent neural network modelswith curve ﬁtting. model note converged timing resolution ( µ s)ﬁtting original data — — 0.01217ﬁtting autoencoder outputs only base network — 0.00296regression net v1 base network ﬁxed successful 0.00303regression net v2 base network trainable successful 0.00182 the dataset, ﬁrst we add the same shift to all sampling points in an event. The shift is randomlysampled from a Gaussian distribution with 0.1 mean and 0.014 standard deviation. The trainingtargets of the denoising autoencoder are set to have the pedestal b = .

1, which is the standard valueused in curve ﬁtting. The results are shown in ﬁgure 4 and table 2. In the left ﬁgure, we can seethat although the pedestal b and the amplitude K are both random and have high correlation, thedenoising autoencoder can eﬀectively perceive the change in the pedestal and cancel the change.The right ﬁgure shows the distribution of RMS based on the statistics of the whole test dataset.The average of RMS is reduced from 0.01110 to 0.00310 by a factor of 3.58. In the table, wecompare the timing resolution achieved by curve ﬁtting and neural networks. In the ﬁrst two lines,it can be seen that ﬁtting the outputs of the denoising autoencoder is better than ﬁtting original data,which demonstrates the eﬀectiveness of the neural network structure. The result of the regressionnetwork v1 when the base network is ﬁxed is slightly worse than ﬁtting the outputs of the denoisingautoencoder. The best result (1.82 ns) comes with the regression network v2 when the base networkis trainable. It outperforms curve ﬁtting results signiﬁcantly. It implies that, for the short-termchange, when we choose a proper start point and ﬁnetune the whole network, the result can be evenbetter than the autoencoder alone. Random noise

We analyze two representative kinds of noise: the Gaussian noise and the clippedMoyal noise (see section 3.2).

Table 3 . Simulation results for the Gaussian noise. The table compares diﬀerent neural network models withcurve ﬁtting. model note converged timing resolution ( µ s)ﬁtting original data maximum likelihood estimator — 0.01206only regression net no base network failed 0.26756ﬁtting autoencoder outputs only base network — 0.01249regression net v1 base network ﬁxed successful 0.01530regression net v2 base network trainable successful 0.01261 In the ﬁrst place, we add the Gaussian noise with zero mean and 0.014 standard deviation.This introduces a noise ratio noise std. deviationaverage amplitude ≈ . .0 0.5 1.0 1.5 2.0 2.5 3.0time (μs)0.300.350.400.450.500.55 r e l a t i v e a m p li t u d e inputsoutputstargets (a) A typical ﬁgure of the inputs, the outputs and the targets(label) of the autoencoder. P r o b a b ili t y inputs: μ = 0.01390, σ = 0.00172 outputs: μ = 0.00372, σ = 0.00155 inputsoutputs (b) The RMS of amplitude between the inputs/outputs andthe ground-truth targets. The ﬁgure is plotted on the statis-tics of the whole test dataset. Figure 5 . The simulation results of the denoising autoencoder for the Gaussian noise. The ﬁgures areplotted in the same way as ﬁgure 4. test dataset. The average of the noise RMS is reduced from 0.01390 to 0.00372 by a factor of3.74. In the table, we use three neural network models and compare their performance with curveﬁtting. Since the Gaussian noise is the most common case, in the analysis we add the regressionnetwork alone for comparison. According to section 3.1.2, ﬁtting original data gives the resultof the maximum likelihood estimator which is the theoretical lower bound. It can be seen thatthe network architecture is important to achieve the optimal performance. When we use only theregression network, the model fails to converge and gives a result worse than the sampling period.However, when we use the autoencoder-regression network architecture, the model can convergesuccessfully. The best result of neural networks comes from the regression network v2 with thebase network trainable. This shows the advantage of the model capacity in the problem.

Table 4 . Simulation results for the clipped Moyal noise. The table compares diﬀerent neural network modelswith curve ﬁtting. model note converged timing resolution ( µ s)ﬁtting original data — — 0.01203ﬁtting autoencoder outputs only base network — 0.00324regression net v1 base network ﬁxed successful 0.00463regression net v2 base network trainable successful 0.00487 In the second place, we analyze the clipped Moyal noise. The original Moyal distribution isshifted to location 0.004, rescaled with 0.006 and then clipped for noise generation. Again, thenoise is more intense than reality. The results are shown in ﬁgure 6 and table 4. In the left ﬁgure,we can see that the unique structure of the denoising autoencoder can very well get the clue ofthe ground-truth target from the noisy input. To further illustrate the idea, we plot the distributionof RMS on the test dataset in the right ﬁgure. The average of the noise RMS is reduced from0.01722 to 0.00093 by a factor of 18.52. This exceeds the results from former simulations. In the– 15 – .0 0.5 1.0 1.5 2.0 2.5 3.0time (μs)0.30.40.50.6 r e l a t i v e a m p li t u d e inputsoutputstargets (a) A typical ﬁgure of the inputs, the outputs and the targets(label) of the autoencoder. P r o b a b ili t y inputs: μ = 0.01722, σ = 0.00313 outputs: μ = 0.00093, σ = 0.00046 inputsoutputs (b) The RMS of amplitude between the inputs/outputs andthe ground-truth targets. The ﬁgure is plotted on the statis-tics of the whole test dataset. Figure 6 . The simulation results of the denoising autoencoder for the clipped Moyal noise. The ﬁgures areplotted in the same way as ﬁgure 4. table, we compare the timing resolution between curve ﬁtting and neural networks. In the ﬁrst twolines, it can be seen that curve ﬁtting with the denoising autoencoder alone can improve the timingresolution signiﬁcantly. Besides, when regression networks are added, the models can successfullyconverge and show competitive results. In this case, keeping the base network ﬁxed (regressionnetwork v1) is slightly better than making the base network trainable (regression network v2), whichdemonstrates the good baseline provided by the autoencoder.To conclude the simulation results, the network architecture proposed in section 4.3 can very welltackle the non-ideal conditions. Finetuning the whole network together can achieve results betterthan ﬁtting the outputs of autoencoder when the short-term change is applied, but slightly worsewhen the random noise is applied. Finetuning the regression network alone can sometimes achievebetter results than ﬁnetuning the whole network, especially when the base network is accurate. Inexperimental conditions, it is not always possible to provide exact training targets for the denoisingautoencoder as in the simulations. Thus, ﬁnetuning the regression network with the precise timelabel is vital to improve the performance of the whole network.

We build a hardware test platform to study the pulse timing in the real-world environment. Aphotograph of the platform is shown in ﬁgure 7. The test platform is based on the PHOS detector(section 2). We use a pulse generator to produce pulses with the ∼

50 ns width and the ∼

10 Hzfrequency. This pulse signal drives a LED to produce light for the PHOS crystal. The scintillationis collected by the APD and passed to the CSA. Then it is transmitted to the CR-RC2 shaper onthe FEE card. The output of the CR-RC2 shaper is hardwired to the AD9656 data acquisitionboard, which is connected to the HPDAQ motherboard for TCP/IP communications. The AD9656is a 4-channel ADC chip with 2.8 V dynamic range, 16-bit precision and 125 MHz sampling rate.– 16 – igure 7 . A photograph of the hardware test platform with the PHOS detector, AD9656 data acquisitionboard and HPDAQ.

Choosing such a high-speed ADC chip makes it possible to compare the performance of curveﬁtting and the neural network model with diﬀerent number of sampling points.To prepare the datasets, we watch two channels of signals simultaneously. One channel is thetrigger signal driving the LED, and the other channel is the output of the shaper on the FEE card.We randomly choose a ﬁxed-interval section from the most salient part of the output pulse. Thenwe add a label to the pulse according to the interval between the trigger signal and the selectedsection. This label is used to train the neural network and work as the baseline for curve ﬁtting. Wenormalize the amplitude of the ADC sampling points to the range similar to section 3.2 and section4.4. We collect 80000 samples for the training dataset and 20000 samples for the test dataset. µ s shaping time In this part, we conduct experiments with 1 µ s shaping time (2 µ s peaking time) which is theALICE PHOS speciﬁcation. The sampling section has a span of 3072 ns. We choose 2 k + k ≥

2) or quadratic (for k =

1) interpolation when training the neural network.We pre-train the model under the assumption of the Gaussian noise with the parameterization insection 4.4. Then we ﬁnetune the whole network using the experimental data. The base network istrainable when ﬁnetuning.We analyze 6 diﬀerent conditions with k = , , , , ,

6. This gives an approximate sampling– 17 – t i m i n g r e s o l u t i o n ( n s ) curve fitting3 5 9 17 33 65number of sampling points8.08.18.28.38.48.58.6 t i m i n g r e s o l u t i o n ( n s ) neural network (a) timing resolution s y s t e m b i a s ( n s ) curve fitting3 5 9 17 33 65number of sampling points−15−10−5051015 s s t e m b i a s ( n s ) neural network (b) system bias Figure 8 . Experimental results for the 1 µ s shaping time. rate of 0.625 MHz, 1.25 MHz, 2.5 MHz, 5 MHz, 10 MHz and 20 MHz respectively. We performan independent training process using the same pre-trained model. Then we test our model on thecorresponding test dataset and make a Gaussian ﬁt of the residuals (diﬀerence between regressionoutputs and time labels) to get the mean and the standard deviation. The standard deviation of theGaussian ﬁt is a measure of the timing resolution and the mean is a measure of the system bias.For curve ﬁtting, we use the same sampling points and ﬁt the residuals (diﬀerence between ﬁttingparameters and time labels) to a Gaussian distribution.We use a batch size of 16 when training the neural network, and the training proceeds for 10epoches. The ﬁnal result and error bar (1 σ error) for the neural network are calculated by the testresults paused at even number of training epoches.The main result is shown in ﬁgure 8. In the left ﬁgure, it can be seen that the neural networkworks better than curve ﬁtting steadily. With as few as 3 sampling points, the two methods canalready achieve relatively good performance. When sampling points increase, the results improveslightly. When we use greater or equal than 17 sampling points, the performance of curve ﬁtting hitsa plateau, but the neural network can still improve. The best performance achieved by the neuralnetwork is 8 . ± .

11 ns, which is 27 .

3% better than curve ﬁtting (11.31 ns).In the right ﬁgure, the system bias is greatly reduced by the neural network model comparedto curve ﬁtting. From directly observation, the interval between the start of the trigger signal andthe start of the shaped pulse is approximately 15 sampling points (120 ns), which is close to resultsfrom curve ﬁtting (137.94 ns to 148.11 ns). The bias for the neural network ﬂuctuates around thehorizontal axis. Since the bias is a ﬁxed value for a given model, it can be calibrated in the sameway as curve ﬁtting, and the burden for calibration is considerably alleviated.– 18 – .81.92.02.12.2 t i m i n g r e s o l u t i o n ( n s ) curve fitting3 5 9 17 33number of sampling points1.31.41.51.61.7 t i m i n g r e s o l u t i o n ( n s ) neural network (a) timing resolution s y s t e m b i a s ( n s ) curve fitting3 5 9 17 33number of sampling points 3 2 10123 s y s t e m b i a s ( n s ) neural network (b) system bias Figure 9 . Experimental results for the 100 ns shaping time.

In this part, we conduct experiments with 100 ns shaping time (200 ns peaking time) which is theALICE EMCal speciﬁcation. We replace resistors and capacitors in the CR-RC2 shaper on the FEEcard to achieve a shorter shaping time. The sampling section has a span of 256 ns. We choose2 k + k = , , , ,

5. This gives a sampling rate of7.8125 MHz, 15.625 MHz, 31.25 MHz, 62.5 MHz and 125 MHz respectively. Other experimentalconditions and procedures are similar to section 5.1.To determine the label for curve ﬁtting and the neural network with a precision superior to thesampling period, we ﬁt the trigger signal to the square pulse response of a second-order system: Y step ( t ) = K (cid:18) + T T − T e −( t − t s )/ T − T T − T e −( t − t s )/ T (cid:19) u ( t − t s ) (5.1) Y square ( t ) = Y step ( t ) − Y step ( t − w ) (5.2)where u ( t ) is the step function, Y step ( t ) is the overdamped step response of a second-order system. K and t s are parameters to be ﬁtted, and other parameters are ﬁxed according to the circuit speciﬁcationand the experimental observation. t s is used as the label to judge the quality of curve ﬁtting andtrain the neural network.The main result is shown in ﬁgure 9. In the left ﬁgure, the timing resolution has improvedsigniﬁcantly compared to the 1 µ s shaping time. Again, the neural network outperforms curveﬁtting. When the number of sampling points increases from 3 to 33, the precision of the neuralnetwork and curve ﬁtting increases slightly, and the trend gradually slows down. The neural networkachieves the optimal result 1 . ± .

03 ns at 17 sampling points, which is 24 .

7% better than curveﬁtting (1.82 ns). – 19 –n the right ﬁgure, the system bias of the neural network model is much less than curve ﬁtting.Basically, curve ﬁtting has a large system bias (90.16 ns to 91.73 ns) which is in accord withapproximate 96 ns from direct observation, but the neural network model suppresses the absolutevalue of the bias to less than 2 ns. This facilitates the calibration and improves the overall stabilityof the timing system.

In the above experiments, a relation between the shaping time of the shaper and the timing resolutionis being considered. The experimental results show that, decreasing shaping time can potentiallyincrease the timing resolution when other conditions are the same. In the frequency domain,a shorter shaping time means a bandpass ﬁlter with higher cut-oﬀ frequency. Therefore, moreinformation about the original event is kept. In the time domain, shorter shaping time can alleviatethe long-range misﬁt problem. To be more speciﬁc, in the experiments of 1 µ s shaping time,sampling points are far away from the desired start time t ; thus any slight discrepancy betweenthe ﬁtted model and the ideal model will cause a large deviation in the value of t . The similarissue applies to the neural network if we view the discrepancy as an intrinsic error and a source ofmisunderstanding. To use the 100 ns shaping time, the distance between sampling points and thestart time is shortened and the long-range problem is properly handled.However, on the other hand, when the shorter shaping time is used, the inﬂuence of threekinds of variations (especially short-term change and random noise) is relatively more signiﬁcant.Besides, since the width of the LED pulse is less than 50 ns, signal integrity issues (especiallyovershooting) aﬀect the precision of the ﬁtted label. As a result, the improvement of timingresolution is worse than estimates based on a proportional hypothesis ( ∼ The classic curve ﬁtting method uses a Gaussian noise hypothesis, and its performance is guaranteedby its statistical properties. However, when the long-term drift, short-term change and random noiseare presented in the pulse function, the limitation of curve ﬁtting emerges. Among the possiblealternatives, neural networks show strong resistance to these three kinds of variations by its delicatestructure and optimization process. Simulations and experiments demonstrate its superiority overcurve ﬁtting.Nevertheless, neural networks have their special requirements which pose new challenges tothe design of the detector system. Since most deep learning methods are based on the supervisedlearning, an accurate label for training is needed. Sometimes acquiring the label is not an easy task,especially when the detector system has complex geometric structures and intricate components.This raises the demand for the traceable design, i.e. a design scheme in which the timing informationcan be traced back internally through the calibration process. From this perspective, we sincerelyhope our work will provide a new way of thinking in the future design of timing systems.– 20 – cknowledgments

This research is supported by the National Natural Science Foundation of China (Grant Number11875146, 11505074, 11605051).

References [1] R. Grzywacz,

Applications of digital pulse processing in nuclear spectroscopy , Nucl. Instrum. Meth.B (2003) 649.[2] E. Samain,

Timing of optical pulses by a photodiode in the geiger mode , Appl. Opt. (1998) 502.[3] C. Han, I. F. Akyildiz and W. H. Gerstacker, Timing acquisition and error analysis for pulse-basedterahertz band wireless systems , IEEE Tran. Veh. Technol. (2017) 10102.[4] ALICE Collaboration, Performance of the ALICE experiment at the CERN LHC , Int. J. Mod. Phys. A (2014) 1430044.[5] ATLAS collaboration, Observation of a new particle in the search for the Standard Model Higgsboson with the ATLAS detector at the LHC , Phys. Lett. B (2012) 1 [ arXiv:1207.7214 ].[6] P. Antonioli and S. Meneghini,

A 20 ps tdc readout module for the alice time of ﬂight system: designand test results , in

Proceedings of the 9 th Workshop on Electronics for LHC Experiments ,Amsterdam, The Netherlands, 29 September – 3 October 2003, pp. 311–315 [CERN-2003-006].[7] G. Mauri, M. Mariotti, F. Casinini, F. Sacchetti and C. Petrillo,

Pulse shape analysis of neutronsignals in Si-based detectors , arXiv:1805.01261 .[8] K. Mahata, A. Shrivastava, J.A. Gore, S.K. Pandit, V.V. Parkar, K. Ramachandran et al., Particleidentiﬁcation using digital pulse shape discrimination in a nTD silicon detector with a 1 GHzsampling digitizer , Nucl. Instrum. Meth. A (2018) 20 [ arXiv:1804.01985 ].[9] LUX collaboration,

Liquid xenon scintillation measurements and pulse shape discrimination in theLUX dark matter detector , Phys. Rev. D (2018) 112002 [ arXiv:1802.06162 ].[10] Y. Ashida, H. Nagata, Y. Koshio, T. Nakaya and R. Wendell, Separation of gamma-ray and neutronevents with CsI(tl) pulse shape analysis , Progr. of Theor. Exp. Phys. (2018) 043H01.[11] J. Kaspar et al.,

Design and performance of SiPM-based readout of PbF crystals for high-rate,precision timing applications , 2017 JINST P01009 [ arXiv:1611.03180 ].[12] P. J. Fish,

Electronic noise and low noise design , Macmillan International Higher Education (2017).[13] Y. LeCun, Y. Bengio and G. Hinton,

Deep learning , Nature (2015) 436.[14] J. Griﬃths, S. Kleinegesse, D. Saunders, R. Taylor and A. Vacheret,

Pulse Shape Discrimination andExploration of Scintillation Signals Using Convolutional Neural Networks , arXiv:1807.06853 .[15] MicroBooNE collaboration, A Deep Neural Network for Pixel-Level Electromagnetic ParticleIdentiﬁcation in the MicroBooNE Liquid Argon Time Projection Chamber , arXiv:1808.07269 .[16] P. Ai, D. Wang, G. Huang and X. Sun, Three-dimensional convolutional neural networks forneutrinoless double-beta decay signal/background discrimination in high-pressure gaseous TimeProjection Chamber , 2018

JINST P08015 [ arXiv:1803.01482 ].[17] H. Muller, R. Pimenta, Z. Yin, D. Zhou, X. Cao, Q. Li et al.,

Conﬁgurable electronics with low noiseand 14-bit dynamic range for photodiode-based photon detectors , Nucl. Instrum. Meth. A (2006)768. – 21 –

18] ALICE collaboration,

The ALICE experiment at the CERN LHC , 2008

JINST S08002.[19] H. Torii,

The ALICE PHOS calorimeter , J. Phys. Conf. Ser. (2009) 012045.[20] A. Fantoni,

The ALICE Electromagnetic Calorimeter: EMCAL , J. Phys. Conf. Ser. (2011)012043.[21] D. W. Marquardt,

An algorithm for least-squares estimation of nonlinear parameters , J. Soc. Ind. andAppl. Math. (1963) 431.[22] J. Gaiser, Appendix-F Charmonium Spectroscopy from Radiative Decays of the J/ ψ and ψ (cid:48) , Ph.D.thesis, SLAC, 1982.[23] C. Walck, Hand-book on statistical distributions for experimentalists , Tech. Rep., Particle PhysicsGroup, Fysikum, University of Stockholm (1996).[24] A. Krizhevsky, I. Sutskever and G. E. Hinton,

Imagenet classiﬁcation with deep convolutional neuralnetworks , in

Advances in neural information processing systems , 2012, pp. 1097–1105.[25] K. He, X. Zhang, S. Ren and J. Sun,

Deep residual learning for image recognition , in

Proceedings ofthe IEEE conference on computer vision and pattern recognition , 2016, pp. 770–778.[26] G. Hinton, L. Deng, D. Yu, G. Dahl, A. rahman Mohamed, N. Jaitly et al.,

Deep neural networks foracoustic modeling in speech recognition: The shared views of four research groups , IEEE SignalProcessing Mag. (2012) 82.[27] Y. Shen, X. He, J. Gao, L. Deng and G. Mesnil, Learning semantic representations usingconvolutional neural networks for web search , in

Proceedings of the 23 rd International Conferenceon World Wide Web , 2014, pp. 373–374.[28] L. de Oliveira, M. Kagan, L. Mackey, B. Nachman and A. Schwartzman,

Jet-images – deep learningedition , JHEP (2016) 69 [ arXiv:1511.05190 ].[29] E. Racah, S. Ko, P. Sadowski, W. Bhimji, C. Tull, S.-Y. Oh et al., Revealing fundamental physics fromthe daya bay neutrino experiment using deep neural networks , in

Proceedings of the 15 th IEEEInternational Conference on Machine Learning and Applications (ICMLA) , 2016, pp. 892–897.[30] J. Renner, A. Farbin, J. M. Vidal, J. Benlloch-Rodríguez, A. Botas, P. Ferrario et al.,

Backgroundrejection in NEXT using deep neural networks , 2017

JINST T01004 [ arXiv:1609.06202 ].[31] R. Acciarri, C. Adams, R. An, J. Asaadi, M. Auger, L. Bagby et al.,

Convolutional Neural NetworksApplied to Neutrino Events in a Liquid Argon Time Projection Chamber , 2017

JINST P03011[ arXiv:1611.05531 ].[32] K. Hornik, M. Stinchcombe and H. White,

Multilayer feedforward networks are universalapproximators , Neural networks (1989) 359.[33] G. Cybenko, Approximation by superpositions of a sigmoidal function , Math. Control Signal. (1992) 455.[34] P. Isola, J.-Y. Zhu, T. Zhou and A. A. Efros, Image-to-image translation with conditional adversarialnetworks , in

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR) , 2017, pp. 5967–5976.[35] V. Kuleshov, S. Z. Enam and S. Ermon,

Audio super resolution using neural networks , arXiv:1708.00853 .[36] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio and P.-A. Manzagol, Stacked denoising autoencoders:Learning useful representations in a deep network with a local denoising criterion , J. Mach. Learn.Res. (2010) 3371. – 22 –

37] H. Noh, S. Hong and B. Han,

Learning deconvolution network for semantic segmentation , in

Proceedings of the IEEE international conference on computer vision , 2015, pp. 1520–1528.[38] V. Nair and G. E. Hinton,

Rectiﬁed linear units improve restricted boltzmann machines , in

Proceedingsof the 27 th International Conference on Machine Learning (ICML-10) , 2010, pp. 807–814.[39] B. Xu, N. Wang, T. Chen and M. Li,

Empirical evaluation of rectiﬁed activations in convolutionalnetwork , arXiv:1505.00853 .[40] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever and R. Salakhutdinov, Dropout: a simple wayto prevent neural networks from overﬁtting , J. Mach. Learn. Res. (2014) 1929.(2014) 1929.