Improved deep learning techniques in gravitational-wave data analysis
IImproved deep learning techniques in gravitational-wave data analysis
Heming Xia, Lijing Shao,
2, 3, ∗ Junjie Zhao, and Zhoujian Cao Department of Astronomy, School of Physics, Peking University, Beijing 100871, China Kavli Institute for Astronomy and Astrophysics, Peking University, Beijing 100871, China National Astronomical Observatories, Chinese Academy of Sciences, Beijing 100012, China School of Physics and State Key Laboratory of Nuclear Physics and Technology, Peking University, Beijing 100871, China Department of Astronomy, Beijing Normal University, Beijing 100875, China (Dated: December 25, 2020)In recent years, convolutional neural network (CNN) and other deep learning models have been graduallyintroduced into the area of gravitational-wave (GW) data processing. Compared with the traditional matched-filtering techniques, CNN has significant advantages in e ffi ciency in GW signal detection tasks. In addition,matched-filtering techniques are based on the template bank of the existing theoretical waveform, which makesit di ffi cult to find GW signals beyond theoretical expectation. In this paper, based on the task of GW detectionof binary black holes, we introduce the optimization techniques of deep learning, such as batch normalizationand dropout, to CNN models. Detailed studies of model performance are carried out. Through this study, werecommend to use batch normalization and dropout techniques in CNN models in GW signal detection tasks.Furthermore, we investigate the generalization ability of CNN models on di ff erent parameter ranges of GWsignals. We point out that CNN models are robust to the variation of the parameter range of the GW waveform.This is a major advantage of deep learning models over matched-filtering techniques. I. INTRODUCTION
On 11 February 2016, LIGO and Virgo Collaboration an-nounced the first detection of gravitational-wave (GW) sig-nals from 14 September 2015, the so-called GW150914 [1–3]. The detected GW signal comes from a binary black hole(BBH) merger. The masses of the BBH are estimated tobe 29 M (cid:12) and 36 M (cid:12) . The successful observation of GWsignals has provided valuable experimental data for GW as-tronomy, setting o ff a wave of GW researches. So far, 50GW signals from compact binary coalescences have beensuccessfully detected [4–10]. Except for the binary neutronstar (BNS) merger event, GW170817 [11] and three other strictly-speaking unclear merger events—GW190814 [7],GW190425 [8], and GW190426 152155 [10]—the remaining46 signals all come from BBH mergers.Currently, both LIGO and Virgo mainly use the matched-filtering method [12–15] to detect GW signals. This methodbuilds a theoretical waveform template bank to match themonitored data and captures trigger signals as candidates forfurther verification [12]. Matched filtering plays a vital role inthe processing of GW signal detection. However, it has short-comings that should not be overlooked [16, 17]. Matched-filtering method requires a full search in the template bankto match the signal, which limits the data processing speedand, if a big template bank is used, has di ffi culty to meet theneeds of real-time observation. In addition, with continuousexpansion of theoretical waveforms in an enlarging parameterspace, the search space of matched filtering increases, whichleads to an increase of data processing time and a reduction inthe processing speed [17].In recent years, many machine learning methods have beendeveloped in GW signal detection tasks [18–23]. In the ma-chine learning field, as AlexNet won the championship in ∗ [email protected] the ImageNet competition in 2012 [24], deep learning algo-rithms stood out and achieved great success in many fieldssuch as image classification, natural language processing, andspeech recognition [25, 26]. In terms of classification tasks,compared with traditional machine learning algorithms, manydeep learning algorithms, including convolutional neural net-works (CNNs), have made significant progress in model ac-curacy and model complexity. Besides, the characteristics ofthe deep learning algorithm make the time-consuming trainingprocess be completed o ffl ine before the actual data analysis. Itgreatly reduces the amount of calculation in the online processand meets the need of real-time detection [26].At present, deep learning methods, especially CNNs, havebeen widely explored in GW data processing [27–35]. In2017, George and Huerta [27] firstly applied CNN to GW sig-nal detection tasks. They generated mock GW signals fromBBH mergers and added them into white Gaussian noise togenerate simulation datasets. They pointed out that the sensi-tivity, namely the fraction of signals which are correctly iden-tified, of CNNs is similar to the matched-filtering method,while its speed has been greatly improved. The work in thesame period by Gabbard et al. [28] compared the false alarmrate and the receiver operating characteristic (ROC) curve be-tween CNN models and the matched-filtering method, lead-ing to a similar conclusion. Subsequently, the application ofdeep learning in the field of GW signal detection has been ex-panded greatly. Krastev [29] indicates that deep learning mod-els work better on BNS mergers than BBH mergers. Moredeep learning models such as residual network, fully CNN,and other structures have been introduced [34–39], and theresearch field has been continuously expanding. Variationalautoencoders and Bayesian neural networks are used for pa-rameter estimation of GW signals [30, 31]. Long short-termmemory network has made progress in the field of GW sig-nal noise reduction, which proves that it can e ff ectively re-move environmental noise and restore the GW signal undernoise [32]. In the sky localization searching task of GW sig- a r X i v : . [ a s t r o - ph . H E ] D ec nals, deep learning methods such as CNNs have also achievedgood results [33].However, almost all deep learning algorithms such asCNNs used in the current researches are basic models. Itmeans that they can be further optimized. In addition, manystudies have pointed out that deep learning models can main-tain a certain degree of robustness to GW signals beyond therange of the training set parameters [27, 28, 35]. However,there is no specific research on this aspect. Grounded on theabove two points, we conduct experiments on the optimizatione ff ects of several deep learning techniques in the field of GWsignal detection. The result shows that, compared with thebasic model, the model with improved techniques achievesbetter performance. On the low signal-to-noise ratio (SNR)dataset, the model with multiple improved techniques has anaccuracy rate of 84% on the testing set, 12% higher than thatof the basic model, and an area under curve (AUC) score of0.91, 6% higher than that of the basic model. On the over-all dataset, our model with multiple improved techniques hasan accuracy rate of 94% on the testing set, 4% higher thanthat of the basic model, and an AUC score of 0.98, 2% higherthan that of the basic model. Moreover, we make a detailedresearch on the robustness of CNN models on GW signal de-tection tasks. Our experiments show that the CNN modelhas good robustness for data of di ff erent parameter ranges formasses and spins.This paper is organized as follows. In Sec. II, we give abrief overview on deep learning. In Sec. III, we introduce oursimulated dataset to be used in our experiments. Then theimproved techniques for CNN and the corresponding experi-mental results are shown in Secs. IV and V, respectively. InSec. VI, we investigate the generalization ability of the CNNmodel in di ff erent parameter ranges. II. DEEP LEARNING
Traditional machine learning methods include k -nearestneighbor, decision tree, support vector machine, and soon [40]. The advantage of machine learning methods is thatto some extent, they can replace the process of human learn-ing. Through training on a large dataset, theses models canlearn the relationship between data so that to classify, predict,and help human to make decisions [41]. However, when tradi-tional machine learning methods are applied to specific tasks,they usually have di ffi culty processing the original data. Insome cases, researchers have to manually extract data featuresand put them into algorithms. Besides, traditional machinelearning methods are usually limited by their fixed modelstructure. It is di ffi cult for these algorithms to achieve rapidimprovement in computing power and accuracy [26].Deep learning overcomes some limitations of traditionalmachine learning methods. Its algorithm is derived from neu-ral network which is a sub-field of traditional machine learn-ing. This algorithm solves the limitation of the depth of neuralnetwork and increases the computational power of the model.So far, many deep learning models such as CNN, residual net-work, and long and short time memory have been proposed. 𝑥 ! 𝑥 " 𝑥 𝑥 $ ... ∑ 𝑤 ! 𝑤 $ 𝑤 𝑤 " ... 𝑏 𝜎 𝑦 FIG. 1. Architecture of a neuron.
At present, deep learning has achieved substantial success inface recognition, automatic driving, speech processing, andmany other fields [26]. For the sake of a self-contained work,below we briefly review the principle of deep learning, in-cluding the structure of neurons, the basic principle of neuralnetwork, and the model of CNN.
A. Neuron
The idea of neural networks in machine learning evolvedfrom biological models. Generally speaking, neural networkis a network of parallel interconnections composed of simpleadaptive units [42]. Its organization can simulate the inter-action of the biological nervous system to real-world objects.The neuron is the “simple unit” in the above definition. In1943, McCulloch and Pitts [43] abstracted it into the simplemodel shown in Fig. 1, namely the “M-P neuron model”. Inthis model, each neuron receives input data x = ( x , x , ..., x n )from previous neurons. The input is multiplied by the weight w = ( w , w , ..., w n ), plus the bias b , and then is injected intothe activation function σ to obtain the output y . A single neu-ron can be represented by the following formula using vectors, y = σ ( w x (cid:124) + b ) . (1)The activation function σ introduces nonlinear operationsinto neurons. Otherwise, the structure of neurons will simplybe a superposition of linear operations, and the power of thenetwork will be greatly reduced. Activation function σ comesin many forms [44]. In this paper, we use the most commonlyused activation function ReLU [45] in our study, σ ( x ) = x , x > , , x ≤ . (2)In general, the current neuron will only output positive val-ues after calculating the weighted sum of data from the first N neurons. From the feature level, it can be understood asthe following. After a linear combination of N features, neu-rons input the combined features into the activation functionto obtain the output features. FIG. 2. Structure of a FCNN.
B. Neural Network
The simplest example of neural network is Fully ConnectedNeural Network (FCNN). The structure of FCNN is shown inFig. 2. A manually specified number of neurons constituteeach layer, which is called a “Dense Layer” of the network.Neurons between di ff erent layers are independent. Each neu-ron receives data from the former layer, calculates the result,and puts them into all neurons of the next layer [46]. Notethat the input data size has a linear relationship with the pa-rameters of the first layer in FCNN [26]. With the input datasize increases, the number of parameters in the network willincrease correspondingly, slowing down the learning speed ofthe model and increasing the requirement of data storage [25].Neural network is a supervised learning algorithm, whosecharacteristic is to use the labeled data—which is the correctoutput y —for training, and test the model on unlabeled data.A loss function is defined to measure the di ff erence betweenthe model output ˆ y and the correct y . The expectation fortraining is to make the value of loss function as small as possi-ble [26]. The parameters to be informed in the neural networkare the weights w ’s and biases b ’s in each layer of neurons.According to the gradient descent strategy, parameters in theneural network are updated to the direction where the valueof loss function decreases [47]. The gradient descent strategyreads, θ new = θ old − α (cid:79) θ J ( θ ) , (3)where θ is the parameter to be updated, J is the loss function,and α is the manually specified learning rate which controlsthe speed of parameter updates.The learning of neural network is based on the trainingdataset collected and annotated by human beings. However,when we finally apply the model to real tasks, we hope theneural network to have good generalization ability . General-ization ability with respect to the neural network is defined asthe ability of the network to handle unseen patterns [48]. Inother words, this concept measures how accurately an algo-rithm is able to predict outcome values for previously unseen
FIG. 3. Structure of a CNN. data [49]. The testing set is a good type of unseen data. Thedata characteristics of the testing set are similar to the train-ing set. In the meantime, its distribution is independent of thetraining set, and it does not appear in the training process [26].Therefore, the final result of the testing set is a reliable indexto evaluate the performance of the neural network.However, if the number of parameters in the network ismore compared to the samples in the training set, overfitting occurs [50]. If there are too many parameters in the neural net-work, the predicting power of the model will get too strong,which makes the model fit the (noisy) characteristics of thetraining dataset too much in the learning process. As a result,the model is only e ff ective for the data samples that appear inthe training set, resulting in the so-called the overfitting prob-lem [51]. Therefore, how to design the model structure ofneural network with appropriate layers and neuron numberswith strong data generalization ability is an outstanding issue. C. Convolutional Neural Network
As shown in Fig. 3, the structure of CNN is divided intothe convolutional layer, the pooling layer, and the denselayer [26]. Each convolutional layer is composed of a spec-ified number of kernels. Each kernel multiplies the input fea-ture values with weights, and adds the biases to obtain out-puts. Di ff erent kernels get di ff erent parameter values aftertraining. The pooling layer itself does not contain any pa-rameters. Take the max-pooling layer as an example. Afterreceiving the input data, this layer scans the data according toa specified stride within a window of a certain length. Then, itoutputs the maximum value of the data in each scanning win-dow [52]. Therefore, the pooling layer compresses the data.It checks all the features in the scanning window and choosesthe most important one [53]. There are other pooling methods,e.g., the average pooling, which outputs the average value ofthe data in the window [54].The pooling layer plays an important role in improving thereceptive field of CNN. After the data passes through the pool-ing layer, the original data length gets shortened, and the ker-nel in the next convolutional layer can handle a larger rangeof data than the convolution window in the previous layer,thereby it expands the convolutional layer’s overall operationrange of the data [24]. After extracting the features of the in-put data through convolutional layers and pooling layers, theflattened feature map will be put into the fully connected layerto obtain the final output [26].Since the parameters to be learned by the convolutionallayer are only the parameter values in the convolution kernel,they are independent of the input data size. Therefore, CNNreduces the number of free parameters, allowing the networkto be deeper with fewer parameters [54]. Due to its uniquemodel structure and powerful data processing capability, CNNis being widely used [26]. III. SIMULATED DATASET
In this section we discuss our strategy to simulate GW dataand build di ff erent datasets for machine learning studies. A. Data Obtaining
Usually, GW waveforms are divided into three stages: in-spiral, merger, and ringdown. The signal detected by a singleGW detector is, h ( t ) = F + ( t ) h + ( t ) + F × ( t ) h × ( t ) , (4)where h + and h × are two polarization modes of GWs, F + and F × are the corresponding pattern functions of these two po-larization modes as functions of the sky localization and thepolarization angle [55].In this work, we focus on GW signals generated by BBHmergers. We use the e ff ective-one-body numerical-relatively(EOBNR) model with aligned spins [56] to simulate the wave-form. Without losing generality, in addition to the SNR ofGW signals, we focus on intrinsic parameters, i.e. masses andspins. The extrinsic parameters, such as the polarization angleand the sky localization, are all fixed to fiducial values. Thepossible precession e ff ect of BBHs is not considered, since weare using the aligned-spin waveform. The spin parameter isdenoted by χ . We set the luminosity distannce D L =
100 Mpc,and neglect the redshift e ff ect of the GW signal. Such a set-ting makes us focus on the machine learning algorithm, andthe assumptions can easily be relaxed when needed. Worthto mention that, in practice we have tested the e ff ects broughtby the inclusion of extrinsic parameters, such as sky locationof the GW source, inclination of the BBH orbit, and the polar-ization angle of the GW. Consistent optimization e ff ects wereobtained with extrinsic parameters. Because the dependenceof the GW waveform on extrinsic parameters is much simplerthan that of intrinsic parameters for GWs from spin-alignedBBHs, in the following we will focus on the intrinsic parame-ters. It is straightforward to augment with extrinsic parametersin our machine learning data analysis.We use the open-source tool provided by Gebhard et al. [35] to generate data. This tool generates GW signals based on PyCBC [57] and LALSuite platforms. With given pa-rameters, analog signals from LALSuite contain two time se-ries, i.e. the two polarization modes of GW signals. The abovesequences are combined with the corresponding antenna func-tions F + , × according to Eq. (4). The signal o ff set caused by thedistance di ff erence between LIGO’s Hanford and Livingstondetectors is properly introduced.The final GW signal sequence used is, s ( t ) = h ( t ) + n ( t ) , (5)where h ( t ) is the GW waveform obtained by the above sim-ulation (4), and n ( t ) is the detector’s noise to the strain. Theadvanced LIGO’s (aLIGO’s) power spectral density (PSD) atthe “zero-detuned high-power” design sensitivity (aLIGOZe-roDetHighPower) [58] is used to simulate the Gaussian whitenoise. After inserting the analog waveform into the noise, wecan calculate the SNR of the strain. A rescaling of it, cor-responding to a rescaling in the distance, can achieve otherdesired SNR values [55]. Finally, we get the GW strain withspecific SNR values. The strain needs to be preprocessed be-fore being used as the final data, which is similar to PyCBC’sGW data processing. Preprocessing stage includes two steps.The first one is data whitening. The aLIGO’s design sensi-tivity is used to whiten the original strain, and to filter outthe spectral components of the environmental noise so as toproperly scale its influence on the strain. The second step isfiltering. We filter out the frequency components below 20 Hzto eliminate the influence of the Newtonian noise in the lowfrequency. B. Dataset Building
This work involves experiments on multiple datasets whichhave the same structure. They have the following characteris-tics: (i) with the given parameter range, the data parameters,masses m , and spins χ , , are all in the form of random sam-pling; (ii) the ratio of the samples containing the GW signaland pure noise in the dataset is 1 to 1; (iii) the duration of eachsample is 1 second, and the sampling rate is 4096 Hz, that is,each sample is a time series with a length of 4096; (iv) con-sidering the symmetry of mass parameters in a BBH, we use m ≥ m convention to sample masses; (v) the time-series dataof the samples all use the Hanford detector sequence (the H1sequence); (vi) in this work we do not consider the influenceof sky position of the GW signal in the strain, that is to say,the peak value of the GW signal is located in the same posi-tion of the strain in the datasets. These specifics are naturalfor a study of such kind.Totally, we construct 5 datasets for training and 5 datasetsfor testing, which are annotated as Training Datasets and
Test-ing Datasets hereafter. https://pycbc.org/ https://github.com/lscsoft/lalsuite TABLE I. Parameters for Training Datasets.Dataset m ( M (cid:12) ) m ( M (cid:12) ) χ , SNR1.1 [10 ,
80] [10 ,
80] 0 [7 , . ,
80] [10 ,
80] 0 [7 , ,
60] [30 ,
60] 0 83.1 30 45 [ − . , .
5] 84.1 [10 ,
80] [10 ,
80] [ − . , . , m ( M (cid:12) ) m ( M (cid:12) ) χ , SNR1 [10 ,
80] [10 ,
80] 0 [7 , . , , ,
80] [10 , ,
80] 0 83 30 45 [ − . , . , . ,
80] [10 ,
80] [ − . , . , ,
80] [10 ,
80] [ − . , . , . , • Training Datasets . Each Training Dataset is consistedof a training set and a testing set. The training set andthe testing set contain 5000 and 500 samples respec-tively. When we train the model on each dataset, themodel is first trained on the training set, and then testedon the testing set. • Testing Datasets . Testing Datasets contain many sub-testing sets (annotated as sub-datasets below) to test themodel performance on di ff erent parameter ranges. Eachsub-dataset contains 500 samples.The parameters for Training Datasets used in this work areshown in Table I. Taking Dataset 1.1 as an example. Thisdataset only considers three parameters: component massesof the BBHs, m and m , and the SNR ρ . The spin parametersof the BBHs are set to zero. The mass range of both blackholes is m i ∈ [10 M (cid:12) , M (cid:12) ] ( i = , ρ ∈ [7 , . m ≥ m as defaulted), whose distributionis shown in Fig. 4. Figure 5 gives two GW examples fromDataset 1.2 with a large SNR (upper panel) and a small SNR(lower panel).The parameter settings of Testing Datasets are shown in Ta-ble II. As shown above, the parameters used to construct eachsub-dataset are given in the [ min , step , max ] format, where min and max define the overall sampling range of the param-eter, and each sub-dataset is constructed with a uniform step size. Take the Testing Dataset 1 as an example, which consistsof 16 sub-datasets. The SNR parameter sampling range of the
10 20 30 40 50 60 70 80 m ( M )1020304050607080 m ( M ) Training SetTesting Set
FIG. 4. Mass distribution in Dataset 1.1. We use m ≥ m as aconvention to randomly sample mass parameters. Cyan points arethe training set samples and red points are the testing set samples. first sub-dataset is [7 , . ff erent parameter intervals and exploremodel robustness with the change of parameter values. IV. IMPROVING TECHNIQUESA. Model Baseline
After preliminary experiments with parameters on net-works with di ff erent depths and hyperparameter values, wedecide to use the network structure similar to that of Georgeand Huerta [27] as the baseline CNN model. Its structure isshown in Fig. 6.The model receives GW strain as input and outputs the dis-criminant value of classification. In the model, the classifica-tion threshold is set to 0.5, that is, a sample with a discriminantvalue greater than 0.5 is judged as a positive sample (i.e., con-taining a GW signal), otherwise it is judged as a negative one(i.e., pure noise). The baseline model contains three convolu-tional layers. The number of channels of each convolutionallayer is 16, 32, and 64, and the size of the convolution kernelfor them is 16, 8, and 8, respectively. After each convolutionallayer, a pooling layer and an activation layer are provided. Thepooling layer uses maximum pooling with a pooling windowof 4 and a step size of 4, which means that the feature se-quence is down-sampled to a quarter of the original sequencelength and retains the feature with the largest value. The activelayer uses the ReLU function shown in Eq. (2). Subsequently,the feature sequences extracted by the convolution structureare input to the fully connected layer to achieve discrimina-tive classification. The model outputs discriminant values todetermine whether the sequence contains a GW signal.
Time (s) W h i t e n e d S t r a i n a n d S i g n a l ( H ) strainsignal Time (s) W h i t e n e d S t r a i n a n d S i g n a l ( H ) strainsignal FIG. 5. Examples for the simulated GW strain: ( m , m ) = (75 . M (cid:12) , . M (cid:12) ) and ρ = .
56 (upper); ( m , m ) = (67 . M (cid:12) , . M (cid:12) ) and ρ = .
41 (lower). The orange line is thenormalized GW signal in the sample strain. vector (size: 4096)matrix (size: 1 × × × × × × × × × × FIG. 6. Basic CNN model architecture that is used in this work [27].
In this work, cross entropy is used as the loss function toupdate the gradient. Cross entropy is an important conceptproposed by Shannon [59] in information theory. It is oftenused to measure the di ff erence between the predicted distribu-tion and the true distribution [59]. Let the label of each samplebe y and the discriminant value given by the model be p . Thecross entropy is, J = − (cid:0) y · log p + (1 − y ) · log(1 − p ) (cid:1) . (6)We use ADAM [60] as a gradient descent strategy to updateour model parameters. This strategy inputs the samples intothe model in batches to calculate the gradient and updates theparameters. After trying batch number values of 5, 10, 25,50, 100, 200, 250, and 500, we take the value 25, on whichthe model has the highest accuracy. The learning rate is set to5 × − , and the training rate is reduced 10 times every 20epochs to avoid overfitting the model.The code implementation of our work is based on the Pytorch framework [61], which uses the
CUDA deep learninglibrary (cuDNN) [62] to accelerate the GPU’s model opera-tion. Our work deploys experiments on NVIDIA TITAN XGPU.
B. Improving Techniques
We now discuss several improvements—dropout, batchnormalization, and the 1 × Dropout was first proposed by Hinton et al. [63] in 2012to improve the overfitting problem of neural networks. Sub-sequently, dropout has become one of the widely used tech-niques in deep learning [64]. Its basic idea is that, in eachbatch of training, the probability p is artificially specified, sothat the neurons in the fully connected layer stop working withthe probability p , and set their parameter values are set to zero.The advantages of dropout are the following [63]. First, inthe training process of each batch, because the neurons of eachlayer are inactivated with probability p , the network structureof each training is di ff erent. The overall model training isequivalent to the joint decision-making process between mul-tiple neural networks with di ff erent structures, which helps toimprove the problem of overfitting. Second, dropout resultsin that, two neurons do not necessarily appear in the samenetwork structure each time so that the parameter update nolonger depends on the joint decision of some neurons withfixed relationships. At the feature level, this technique pre-vents decision-making from over-dependence on certain fea-tures and forces the model to learn more robust feature repre-sentations [26]. Batch normalization is a neural network training tech-nique proposed by Io ff e and Szegedy [65] in 2015. Its specificidea is the following. In the training process of each batch, af-ter the data passes through the activation layer, the activationvalue of each batch of data is normalized. That is, the averagevalue of the sample data of each batch is normalized to 0, andthe variance is normalized to 1. In a batch of data of length m ,the activated data are set to B = { x , · · · , x m } , and the batchnormalized data are set to y = { y , · · · , y m } . Then the batchnormalized algorithm is expressed as [65] µ B ← m m (cid:88) i = x i , (7) σ B ← m m (cid:88) i = ( x i − µ B ) , (8) (cid:98) x i ← x i − µ B (cid:113) σ B + (cid:15) , (9) y i ← γ (cid:98) x i + β , (10)where γ and β in the algorithm are the parameters learned inthe gradient update. The purpose of this step is to make theresult of batch normalization be the same as the original in-put data, which maintains a possibility to retain the originalstructure. The parameter (cid:15) in Eq. (9) is used to prevent invalidcalculation when the variance σ B is zero.The batch normalization technique makes the mean andvariance of the input data distribution of each layer in the CNNwithin a certain range. Thus, each layer of the network doesnot need to adapt to the change in the distribution of inputdata, which is conducive to accelerating the learning speed ofthe model and speeding up the simulation [26]. At the sametime, batch normalization suppresses the problem that smallchanges in parameters are amplified with the deepening of thenetwork layers, making the network more adaptable to modelparameters and more stable gradient updates. Besides, dueto the use of dropout technique, the number of e ff ective neu-rons in the model decreases, and the fitting speed of networkslows down [65]. Considering that batch normalization hasa significant improvement e ff ect on the fitting speed of themodel, dropout and batch normalization techniques are oftenintroduced into the structure of the neural network at the sametime [66]. The × convolution , also known as “network in net-work”, was proposed by Lin et al. [67] in 2014. The 1 × ×
1. In one-dimensional convolution, it is a convolution kernel in whichthe window length is 1. The 1 × × × × TABLE III. The extended techniques used for each model.Model Dropout Batch norm. 1 × √ ConvNet3 √ ConvNet4 √ ConvNet5 √ √
ConvNet6 √ √ √
TABLE IV. AUC scores on Dataset 1.1 and 1.2. The highest value ineach column is highlighted.Dataset 1.1 1.2ConvNet1 0.86 0.96ConvNet2 0.90 0.97ConvNet3 0.89 0.97ConvNet4 0.90 0.97ConvNet5
ConvNet6 dimensional features of the original data are realized, and themodel’s ability to represent data features is enhanced [26].
V. SIMULATION RESULTS
Based on the enhanced techniques, we extend the basicmodel of CNN described in Sec. IV A. Dropout, batch nor-malization, and the 1 × ρ ∈ [7 , . ρ ∈ [7 , ff ect during the training process. When the validation setis input into the model, it does not participate in the gradi-ent update, but is only used to calculate the accuracy and lossfunction value.The validation results of each model in the training pro-cess are shown in Fig. 7. As shown in the figure, based onDataset 1.1 (low SNR), the accuracy of each extended net- A cc u r a c y ( % ) ConvNet1ConvNet2ConvNet3ConvNet4ConvNet5ConvNet6 0 50 100 150 200 250 300 350Iteration0.30.40.50.60.7 V a li d a t i o n L o ss ConvNet1ConvNet2ConvNet3ConvNet4ConvNet5ConvNet6 A cc u r a c y ( % ) ConvNet1ConvNet2ConvNet3ConvNet4ConvNet5ConvNet6
300 3209294 0 50 100 150 200 250 300 350Iteration0.20.30.40.50.60.70.8 V a li d a t i o n L o ss ConvNet1ConvNet2ConvNet3ConvNet4ConvNet5ConvNet6
300 3200.1500.175
FIG. 7. Model comparison on Datasets 1.1 (upper panels) and 1.2 (lower panels). ConvNet1 to ConvNet6 are the models shown in Table III.We use the accuracy (fraction of samples correctly classified) and the validation loss (loss function value in the validation set) as our metricsto track the model performance in the training process. work model is significantly improved compared with that ofthe basic model (ConvNet1). The validation results showthat the final accuracy of each network is stable at 83%–87%, while the accuracy of the basic model is below 80%.Among them, ConvNet5, which uses dropout and batch nor-malization, achieves the highest accuracy in the stable stageafter enough iterations. ConvNet6, which uses all improvingtechniques—namely dropout, batch normalization, and 1 × × × ff erence between models. In this sit-uation, the extended model is still superior to the basic model.In terms of accuracy, ConvNet5 reaches the optimal value inthe stable stage of the late training epoch. Its performance isslightly better than that of ConvNet6. In terms of the valueof loss function, ConvNet5 and ConvNet6 have similar resultsin the stable stage. Actually, ConvNet5 is slightly better thanConvNet6 in the stable stage. In addition, as the SNR distribu-tion range of the dataset expands, the di ff erences between thesamples increase so that the overfitting problem of the modelis reduced.After training, the models are provided to the testing sets for T r u e P o s i t i v e R a t e ConvNet1ConvNet2ConvNet3ConvNet4ConvNet5ConvNet6 0.0 0.2 0.4 0.6 0.8 1.0False Positive Rate0.00.20.40.60.81.0 T r u e P o s i t i v e R a t e ConvNet1ConvNet2ConvNet3ConvNet4ConvNet5ConvNet6
ConvNet1 ConvNet2 ConvNet3 ConvNet4 ConvNet5 ConvNet60.620.640.660.680.700.720.740.760.780.800.820.840.860.88 A cc u r a c y .75 .83 .82 .81 .87 .87.75 .80 .80 .80 .81 .80.75 .82 .81 .81 .84 .84noisesignaltotal ConvNet1 ConvNet2 ConvNet3 ConvNet4 ConvNet5 ConvNet60.830.840.850.860.870.880.890.900.910.920.930.940.950.96 A cc u r a c y .91 .91 .93 .92 .96 .96.89 .92 .91 .93 .93 .93.90 .92 .92 .92 .94 .94noisesignaltotal FIG. 8. Results on the testing sets of Datasets 1.1 (left panels) and 1.2 (right panels). We use both ROC and the accuracy as our evaluationmetrics. As seen in lower panels, the accuracy in identifying noise and signal, as well as the total accuracy, are illustrated separately. final evaluation. The testing results are shown in Fig. 8. ROCcurve is a graphical plot that illustrates the diagnostic abilityof a binary classifier system as its discrimination threshold isvaried [68]; see Sec. IV A. When the threshold changes, thecurve reflects the fraction of positive samples correctly identi-fied (True Positive Rate) versus the fraction of negative sam-ples incorrectly identified (False Positive Rate) [69]. The areaunder the curve (annotated as AUC) is equal to the probabilitythat a model ranks a randomly chosen positive sample higherthan a negative one [68].The AUC scores of di ff erent models on Datasets 1.1 and1.2 are shown in Table IV. It can be seen that on Dataset 1.1,ConvNet5 ans ConvNet6 both achieve the highest accuracyand AUC score, but the classification rate of GW signals ofConvNet6 is slightly lower than that of ConvNet5 (shown inFig. 8). On Dataset 1.2, ConvNet5 achieves both the highestaccuracy and AUC score.Based on the above study, we find that the dropout, batchnormalization, and 1 × VI. GENERALIZATION AND ROBUSTNESS
In the work of George and Huerta [27], who firstly intro-duced the CNN structure into the field of GW data processing, the generalization ability of deep learning algorithms in thetask of GW signal detection was discussed. Subsequent stud-ies have pointed out that this ability is a major advantage ofdeep learning algorithms over the matched filtering [28, 35].As mentioned in Sec. II B, generalization ability refers to theability of the detection algorithm to respond to signals that areoutside of the distribution of training data [49]. We here arethe first to study the generalization ability of the CNN modelin the task of GW signal detection, including the generaliza-tion characteristics of the CNN model in multiple parameterranges such as masses and aligned spins.From the discussion in Sec. V, we use ConvNet5 for ourfollowing experiments, and hereinafter we refer it as “themodel”. Considering that the discussion in Sec. V is basedonly on the performance of the model on the dataset wherespins are set to zero, we conduct a further test on the Con-vNet5 model on Dataset 4.1. Dataset 4.1 introduces spin pa-rameters on the basis of Dataset 1.2, which is used to test theoverall performance of the model in the GW detection task.After 30 epochs of training on Dataset 4.1, the model reachesan accuracy of 93%, which proves that the model still has agood ability to detect GW signals when considering the spinparameters. In the following subsections, we will test the0 m ( M ) m ( M ) - . - . - . - . . . . . - . - . - . - . . . . . FIG. 9. Model robustness on BBH masses (left) and spins (right). The colorbar and numbers are the accuracy of the model tested on di ff erentsub-datasets (in specific mass / spin ranges) of Testing Datasets. The mirror part of m ≥ m is also plotted. generalization ability of the model in masses, spins, and thegeneralization ability of the nonspinning model with spinningdata. A. Mass Parameters
To simplify the investigation, we first study the generaliza-tion ability of the model in the GW detection task that onlyconsiders mass parameters. The Training Dataset used in thissection is Dataset 2.1 (see Table I). In this dataset, the SNRof the GW signal is 8, the mass sampling range of the BBH is[30 M (cid:12) , M (cid:12) ], and the spins are set to zero. After 30 epochsof training, the accuracy on the testing set is 93%. Subse-quently, we use the Testing Dataset 2 to test the model on thesub-datasets with the mass sampling range of [10 M (cid:12) , M (cid:12) ].The accuracy is shown in the left panel of Fig. 9.We find that the model works better on sub-datasets withinthe original training parameter range of [30 M (cid:12) , M (cid:12) ]. Formost of sub-datasets beyond the training parameter range (i.e.,outside of the range [30 M (cid:12) , M (cid:12) ]), high accuracy is stillachieved, indicating that the model does have a certain gener-alization ability in the mass parameter. It can be seen from thedata in the figure that the model is more e ff ective for BBHs ofhigh masses, and the accuracy for low-mass BBHs is lower.Studying the signal pattern corresponding to the mass param-eter, we find that this is caused by the fact that the modelis easier to identify data fluctuations introduced by short sig-nals. As the mass decreases, the GW signal in band becomeslonger, which is relatively closer to the characteristics of ran-dom noise. This is harder to identify with our CNN model.We show example waveforms of di ff erent masses and spinsin Fig. 10. It is clear from the upper panels that GW signalsof smaller masses are longer, thus they are harder to be de-tected in our CNN models. We leave the discussion of furtherimprovement of our CNN models in this parameter space tofuture studies. B. Spin Parameters
Now we discuss the generalization ability of the modelin spin parameters. We use Dataset 3.1 as an illustration.The SNR of this dataset is 8, the masses of the BBH are30 M (cid:12) and 45 M (cid:12) , and the range of the spin parameter χ is χ i ∈ [ − . , .
5] ( i = , χ rangingin [ − . , . . , . , − . , − . χ value, the GW signals become longer (the “hang-up”e ff ect), thus harder to be detected with our CNN model. Fur-ther improvements in this aspect will be subject to future stud-ies. C. Robustness of Nonspinning Model on Spinning Data
What if the model is only trained with nonspinning GW sig-nals, and it is tested with spinning signals? We use the Dataset1.2 (see Table I) for training that only uses the mass param-eters. After 30 epochs of training, we test the model on theTesting Dataset 4 (see Table II) which contains spin parame-ters as well. We obtain an accuracy of ∼ ∼
94% in the Dataset 1.2 where only the mass parameteris included. It shows the generalization ability of the model1
Time (s) W h i t e n e d S t r a i n a n d S i g n a l ( H ) strainsignal Time (s) W h i t e n e d S t r a i n a n d S i g n a l ( H ) strainsignal Time (s) W h i t e n e d S t r a i n a n d S i g n a l ( H ) strainsignal Time (s) W h i t e n e d S t r a i n a n d S i g n a l ( H ) strainsignal Time (s) W h i t e n e d S t r a i n a n d S i g n a l ( H ) strainsignal Time (s) W h i t e n e d S t r a i n a n d S i g n a l ( H ) strainsignal FIG. 10. Simulated GW strain examples of di ff erent masses and spins. The upper panels are the strain examples of increasing masses fromleft to right: (10 M (cid:12) , M (cid:12) ), (45 M (cid:12) , M (cid:12) ), and (80 M (cid:12) , M (cid:12) ). The lower panels contain examples of di ff erent spins (from left to right):(0 . , . , − . , − . A cc u r a c y .83 .85 .90 .90 .91 .90 .95 .96 .96 .96 .96 .99 .98 .98 .98 .99 FIG. 11. Robustness of CNN models that are only trained with non-spinning GWs on the spinning GW data for di ff erent SNRs. that is only trained with nonspinning GWs, but tested in datacontaining spin parameters.We further use the Testing Dataset 5 (see Table II) to test themodel on sub-datasets with di ff erent SNR distribution. Theresult is shown in Fig. 11. As shown in the figure, the modelhas the best detection e ffi ciency on data with high SNR andit is slightly worse on data with low SNR. However, on thesub-dataset of testing data with SNR in the range of [7 . , . ∼ D. Conclusion
From the above study, it can be seen that the CNN modelshows good generalization ability for data of di ff erent param- eter ranges, which is a major advantage over the matched fil-tering method in using the CNN model. The matched filteringmethod is based on the existing template bank. Its search andresponse to GW signals are limited to the existing waveforms.The GW signals beyond the existing waveforms cannot beeasily detected. The generalization ability of the CNN modelin the task of GW signal detection will help us to discoversignals beyond the existing templates. It may also be ableto detect signals that imprint orbital eccentricity [70], orbitalprecession [71], and deviations from general relativity [72].Such a generalization ability will undoubtedly play an essen-tial role in searches of GW signals beyond what we have in thetemplate bank. More studies are needed along this direction. VII. DISCUSSION
For the first time, this work specifically studied the e ff ectson CNN models brought by the improving techniques—suchas dropout, batch normalization, and 1 × ff erent improving techniques were made.Our simplified experiments show that dropout, batch normal-ization, and the 1 × ffi culty of the task continues toincrease, the requirements for feature extraction and process-ing are getting higher and higher. The network depth of theCNN is limited by the gradient propagation problem, result-ing in a limitation on its performance [26]. The proposal ofthe residual network structure greatly alleviates the problemof gradient propagation in CNNs and makes the depth limit ofCNN models rapidly expand [73]. In the field of time seriesprocessing, the “Long Short Term Memory” model is widelyused in multiple tasks such as speech processing and stockprice prediction, and has achieved excellent results [74]. Atpresent, some studies have introduced the above structure intothe process of GW data processing [32, 34]. However, in thetask of GW signal detection, there is still a lack of unifiedinvestigation on these models. Comparing these models (in-cluding CNN models) on GW signal detection tasks, such asthe recognition rate, model size, running time and other indi-cators, is crucial to find an optimal model structure suitable forreal-world task. It is a worthwhile direction for future studies.For the sake of simplifying the investigation, this work onlyuses simulated white noise to construct GW sample data. Inorder to make the model more in line with the characteristicsof reality detection, real noise construction from aLIGO / Virgo data can be conducted. At the same time, the data used inthis article are only for experimental purposes, so the amountof data is limited. By increasing the number of samples inthe dataset, the performance of the model can be further im-proved. Ensemble learning is also an optimization techniquethat is worth considering, where the basic idea is to train mul-tiple neural network models, and to use voting and other com-mon decision-making strategies in order to make up for thedecision-making defects of a single model and to improve thedecision-making ability of the overall model [75]. These as-pects deserve detailed studies on their own.
ACKNOWLEDGMENTS
This work was supported by the National Natural Sci-ence Foundation of China (11975027, 11991053, 11690023,11721303), the Young Elite Scientists Sponsorship Pro-gram by the China Association for Science and Technol-ogy (2018QNRC001), the Max Planck Partner Group Pro-gram funded by the Max Planck Society, and the High-performance Computing Platform of Peking University. Itwas partially supported by the Strategic Priority Research Pro-gram of the Chinese Academy of Sciences through the GrantNo. XDB23010200. [1] B. P. Abbott et al. (Virgo, LIGO Scientific), Phys. Rev. Lett. , 061102 (2016), arXiv:1602.03837 [gr-qc].[2] B. Abbott et al. (LIGO, VIRGO), “Observation of GravitationalWaves from a Binary Black Hole Merger,” in
Centennial ofGeneral Relativity: A Celebration , edited by C. A. Z. Vascon-cellos (2017) pp. 291–311.[3] J. Liu, G. Wang, Y.-M. Hu, T. Zhang, Z.-R. Luo, Q.-L. Wang,and L. Shao, Chin. Sci. Bull. , 1502 (2016).[4] B. Abbott et al. (LIGO Scientific, Virgo), Phys. Rev. X ,031040 (2019), arXiv:1811.12907 [astro-ph.HE].[5] A. H. Nitz, C. Capano, A. B. Nielsen, S. Reyes, R. White,D. A. Brown, and B. Krishnan, Astrophys. J. , 195 (2019),arXiv:1811.01921 [gr-qc].[6] R. Abbott et al. (LIGO Scientific, Virgo), Phys. Rev. D ,043015 (2020), arXiv:2004.08342 [astro-ph.HE].[7] R. Abbott et al. (LIGO Scientific, Virgo), Astrophys. J. Lett. , L44 (2020), arXiv:2006.12611 [astro-ph.HE].[8] B. Abbott et al. (LIGO Scientific, Virgo), Astrophys. J. Lett. , L3 (2020), arXiv:2001.01761 [astro-ph.HE].[9] R. Abbott et al. (LIGO Scientific, Virgo), Phys. Rev. Lett. ,101102 (2020), arXiv:2009.01075 [gr-qc].[10] R. Abbott et al. (LIGO Scientific, Virgo), (2020),arXiv:2010.14527 [gr-qc].[11] B. Abbott et al. (LIGO Scientific, Virgo), Phys. Rev. Lett. ,161101 (2017), arXiv:1710.05832 [gr-qc].[12] L. S. Finn, Phys. Rev. D , 5236 (1992), arXiv:gr-qc / et al. , Astrophys. J. , 136 (2012),arXiv:1107.2665 [astro-ph.IM].[14] S. A. Usman et al. , Class. Quant. Grav. , 215004 (2016),arXiv:1508.02357 [gr-qc]. [15] B. Abbott et al. (LIGO Scientific, Virgo), Phys. Rev. D ,122004 (2016), [Addendum: Phys.Rev.D 94, 069903 (2016)],arXiv:1602.03843 [gr-qc].[16] R. Smith, S. E. Field, K. Blackburn, C.-J. Haster, M. P¨urrer,V. Raymond, and P. Schmidt, Phys. Rev. D , 044031 (2016),arXiv:1604.08253 [gr-qc].[17] I. Harry, S. Privitera, A. Boh´e, and A. Buonanno, Phys. Rev. D , 024012 (2016), arXiv:1603.02444 [gr-qc].[18] N. J. Cornish and T. B. Littenberg, Class. Quant. Grav. ,135012 (2015), arXiv:1410.3835 [gr-qc].[19] A. J. Chua, C. R. Galley, and M. Vallisneri, Phys. Rev. Lett. , 211101 (2019), arXiv:1811.05491 [astro-ph.IM].[20] A. J. Chua and M. Vallisneri, Phys. Rev. Lett. , 041102(2020), arXiv:1909.05966 [gr-qc].[21] N. Mukund, S. Abraham, S. Kandhasamy, S. Mitra, and N. S.Philip, Phys. Rev. D , 104059 (2017), arXiv:1609.07259[astro-ph.IM].[22] R. E. Colgan, K. R. Corley, Y. Lau, I. Bartos, J. N. Wright,Z. Marka, and S. Marka, Phys. Rev. D , 102003 (2020),arXiv:1911.11831 [astro-ph.IM].[23] E. Cuoco, J. Powell, M. Cavagli`a, et al. , Machine Learning:Science and Technology , 011002 (2020), arXiv:2005.03745.[24] A. Krizhevsky, I. Sutskever, and G. E. Hinton, Commun. ACM , 84–90 (2017).[25] J. Schmidhuber, Neural Networks , 85, arXiv:1404.7828.[26] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning (MIT Press, 2016) .[27] D. George and E. Huerta, Phys. Rev. D , 044039 (2018),arXiv:1701.00008 [astro-ph.IM].[28] H. Gabbard, M. Williams, F. Hayes, and C. Messenger,Phys. Rev. Lett. , 141103 (2018), arXiv:1712.06041 [astro-ph.IM]. [29] P. G. Krastev, Phys. Lett. B , 135330 (2020),arXiv:1908.03151 [astro-ph.IM].[30] H. Gabbard, C. Messenger, I. S. Heng, F. Tonolini, andR. Murray-Smith, (2019), arXiv:1909.06296 [astro-ph.IM].[31] H. Shen, E. Huerta, Z. Zhao, E. Jennings, and H. Sharma,(2019), arXiv:1903.01998 [gr-qc].[32] H. Shen, D. George, E. A. Huerta, and Z. Zhao, in ICASSP 2019: IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP) (2019) pp. 3237–3241,arXiv:1711.09919 [gr-qc].[33] C. Chatterjee, L. Wen, K. Vinsen, M. Kovalam, and A. Datta,Phys. Rev. D , 103025 (2019), arXiv:1909.06367 [astro-ph.IM].[34] C. Dreissigacker, R. Sharma, C. Messenger, R. Zhao, andR. Prix, Phys. Rev. D , 044009 (2019), arXiv:1904.13291[gr-qc].[35] T. D. Gebhard, N. Kilbertus, I. Harry, and B. Sch¨olkopf, Phys.Rev. D , 063015 (2019), arXiv:1904.08693 [astro-ph.IM].[36] D. George, H. Shen, and E. Huerta, Phys. Rev. D , 101501(2018).[37] M. Chen, Y. Zhong, Y. Feng, D. Li, and J. Li, ScienceChina Physics, Mechanics, and Astronomy , 129511 (2020),arXiv:2003.13928 [astro-ph.IM].[38] J. P. Marulanda, C. Santa, and A. E. Romano, Phys. Lett. B , 135790 (2020), arXiv:2004.01050 [gr-qc].[39] H. Wang, S. Wu, Z. Cao, X. Liu, and J.-Y. Zhu, Phys. Rev. D , 104003 (2020), arXiv:1909.13442 [astro-ph.IM].[40] T. M. Mitchell, Machine Learning (McGraw-Hill, New York,1997).[41] C. M. Bishop,
Pattern Recognition and Machine Learning (In-formation Science and Statistics) (Springer-Verlag, Berlin, Hei-delberg, 2006).[42] T. Kohonen, Neural Networks , 3 (1988).[43] W. S. McCulloch and W. Pitts, The Bulletin of MathematicalBiophysics , 115 (1943).[44] C. Nwankpa, W. Ijomah, A. Gachagan, and S. Marshall,(2018), arXiv:1811.03378.[45] V. Nair and G. E. Hinton, in Proceedings of the 27th Inter-national Conference on International Conference on MachineLearning (Omnipress, Madison, WI, USA, 2010) p. 807–814.[46] J. Hopfield, Proc. Nat. Acad. Sci. , 2554 (1982).[47] S. Sra, S. Nowozin, and S. Wright, Optimization for MachineLearning , Neural Information Processing Series (MIT Press).[48] S. Urolagin, K. V. Prema, and N. V. S. Reddy, in
Proceedingsof the 2011 International Conference on Advanced Comput-ing, Networking and Security , ADCONS’11 (Springer-Verlag,Berlin, Heidelberg, 2011) p. 171–178.[49] M. Mohri, A. Rostamizadeh, and A. Talwalkar,
Foundations ofMachine Learning, Second Edition , Adaptive Computation andMachine Learning Series (MIT Press).[50] S. Lawrence and C. L. Giles, in
Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Net-works. IJCNN 2000. Neural Computing: New Challenges andPerspectives for the New Millennium , Vol. 1 (IEEE, 2000) pp.114–119. [51] D. M. Hawkins, Journal of Chemical Information and Com-puter Sciences , 1.[52] K. Yamaguchi, K. Sakamoto, T. Akabane, and Y. Fujimoto, in The First International Conference on Spoken Language Pro-cessing, ICSLP 1990, Kobe, Japan, November 18-22, 1990 (ISCA, 1990).[53] D. Ciregan, U. Meier, and J. Schmidhuber, in (2012) pp.3642–3649, arXiv:1202.2745 [cs].[54] H. H. Aghdam and E. J. Heravi,
Guide to Convolutional Neu-ral Networks: A Practical Application to Tra ffi c-Sign Detectionand Classification , 1st ed. (Springer Publishing Company, In-corporated, 2017).[55] C. Cutler and E. E. Flanagan, Phys. Rev. D , 2658 (1994),arXiv:gr-qc / et al. , Phys.Rev. D , 044028 (2017), arXiv:1611.03703 [gr-qc].[57] A. Nitz, I. Harry, D. Brown, C. M. Biwer, J. Willis, et al. ,“gwastro / pycbc: Pycbc release v1.16.9,” (2020).[58] D.Shoemaker, LIGO Document (2010).[59] C. E. Shannon, Bell System Technical Journal , 379 (1948).[60] D. P. Kingma and J. Ba, (2014), arXiv:1412.6980 [cs.LG].[61] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, et al. ,(2019), arXiv:1912.01703 [cs.LG].[62] S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran,B. Catanzaro, and E. Shelhamer, CoRR abs / (2014),arXiv:1410.0759.[63] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, andR. R. Salakhutdinov, (2012), arXiv:1207.0580 [cs.NE].[64] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, andR. Salakhutdinov, Journal of Machine Learning Research ,1929 (2014).[65] S. Io ff e and C. Szegedy, in Proceedings of the 32nd Inter-national Conference on International Conference on MachineLearning - Volume 37 , ICML’15 (JMLR.org, 2015) p. 448–456.[66] X. Li, S. Chen, X. Hu, and J. Yang, in / CVF Con-ference on Computer Vision and Pattern Recognition (CVPR) (IEEE) pp. 2677–2685.[67] M. Lin, Q. Chen, and S. Yan, (2013), arXiv:1312.4400[cs.NE].[68] T. Fawcett, Pattern Recognition Letters , 861.[69] D. Powers, Mach. Learn. Technol. (2008).[70] X. Liu, Z. Cao, and L. Shao, Phys. Rev. D , 044049 (2020),arXiv:1910.00784 [gr-qc].[71] S. Babak, A. Taracchini, and A. Buonanno, Phys. Rev. D ,024010 (2017), arXiv:1607.05661 [gr-qc].[72] L. Shao, N. Sennett, A. Buonanno, M. Kramer, and N. Wex,Phys. Rev. X , 041025 (2017), arXiv:1704.07561 [gr-qc].[73] K. He, X. Zhang, S. Ren, and J. Sun, in (2016)pp. 770–778, arXiv:1512.03385 [cs.CV].[74] S. Hochreiter and J. Schmidhuber, Neural Comput. ,1735–1780 (1997).[75] D. Opitz and R. Maclin, Journal of Artificial Intelligence Re-search11