[PDF] Improved deep learning techniques in gravitational-wave data analysis

Abstract

In recent years, convolutional neural network (CNN) and other deep learning models have been gradually introduced into the area of gravitational-wave (GW) data processing. Compared with the traditional matched-filtering techniques, CNN has significant advantages in efficiency in GW signal detection tasks. In addition, matched-filtering techniques are based on the template bank of the existing theoretical waveform, which makes it difficult to find GW signals beyond theoretical expectation. In this paper, based on the task of GW detection of binary black holes, we introduce the optimization techniques of deep learning, such as batch normalization and dropout, to CNN models. Detailed studies of model performance are carried out. Through this study, we recommend to use batch normalization and dropout techniques in CNN models in GW signal detection tasks. Furthermore, we investigate the generalization ability of CNN models on different parameter ranges of GW signals. We point out that CNN models are robust to the variation of the parameter range of the GW waveform. This is a major advantage of deep learning models over matched-filtering techniques.

Full PDF

IImproved deep learning techniques in gravitational-wave data analysis

Heming Xia, Lijing Shao,

2, 3, ∗ Junjie Zhao, and Zhoujian Cao Department of Astronomy, School of Physics, Peking University, Beijing 100871, China Kavli Institute for Astronomy and Astrophysics, Peking University, Beijing 100871, China National Astronomical Observatories, Chinese Academy of Sciences, Beijing 100012, China School of Physics and State Key Laboratory of Nuclear Physics and Technology, Peking University, Beijing 100871, China Department of Astronomy, Beijing Normal University, Beijing 100875, China (Dated: December 25, 2020)In recent years, convolutional neural network (CNN) and other deep learning models have been graduallyintroduced into the area of gravitational-wave (GW) data processing. Compared with the traditional matched-ﬁltering techniques, CNN has signiﬁcant advantages in e ﬃ ciency in GW signal detection tasks. In addition,matched-ﬁltering techniques are based on the template bank of the existing theoretical waveform, which makesit di ﬃ cult to ﬁnd GW signals beyond theoretical expectation. In this paper, based on the task of GW detectionof binary black holes, we introduce the optimization techniques of deep learning, such as batch normalizationand dropout, to CNN models. Detailed studies of model performance are carried out. Through this study, werecommend to use batch normalization and dropout techniques in CNN models in GW signal detection tasks.Furthermore, we investigate the generalization ability of CNN models on di ﬀ erent parameter ranges of GWsignals. We point out that CNN models are robust to the variation of the parameter range of the GW waveform.This is a major advantage of deep learning models over matched-ﬁltering techniques. I. INTRODUCTION

On 11 February 2016, LIGO and Virgo Collaboration an-nounced the ﬁrst detection of gravitational-wave (GW) sig-nals from 14 September 2015, the so-called GW150914 [1–3]. The detected GW signal comes from a binary black hole(BBH) merger. The masses of the BBH are estimated tobe 29 M (cid:12) and 36 M (cid:12) . The successful observation of GWsignals has provided valuable experimental data for GW as-tronomy, setting o ﬀ a wave of GW researches. So far, 50GW signals from compact binary coalescences have beensuccessfully detected [4–10]. Except for the binary neutronstar (BNS) merger event, GW170817 [11] and three other strictly-speaking unclear merger events—GW190814 [7],GW190425 [8], and GW190426 152155 [10]—the remaining46 signals all come from BBH mergers.Currently, both LIGO and Virgo mainly use the matched-ﬁltering method [12–15] to detect GW signals. This methodbuilds a theoretical waveform template bank to match themonitored data and captures trigger signals as candidates forfurther veriﬁcation [12]. Matched ﬁltering plays a vital role inthe processing of GW signal detection. However, it has short-comings that should not be overlooked [16, 17]. Matched-ﬁltering method requires a full search in the template bankto match the signal, which limits the data processing speedand, if a big template bank is used, has di ﬃ culty to meet theneeds of real-time observation. In addition, with continuousexpansion of theoretical waveforms in an enlarging parameterspace, the search space of matched ﬁltering increases, whichleads to an increase of data processing time and a reduction inthe processing speed [17].In recent years, many machine learning methods have beendeveloped in GW signal detection tasks [18–23]. In the ma-chine learning ﬁeld, as AlexNet won the championship in ∗ [email protected] the ImageNet competition in 2012 [24], deep learning algo-rithms stood out and achieved great success in many ﬁeldssuch as image classiﬁcation, natural language processing, andspeech recognition [25, 26]. In terms of classiﬁcation tasks,compared with traditional machine learning algorithms, manydeep learning algorithms, including convolutional neural net-works (CNNs), have made signiﬁcant progress in model ac-curacy and model complexity. Besides, the characteristics ofthe deep learning algorithm make the time-consuming trainingprocess be completed o ﬄ ine before the actual data analysis. Itgreatly reduces the amount of calculation in the online processand meets the need of real-time detection [26].At present, deep learning methods, especially CNNs, havebeen widely explored in GW data processing [27–35]. In2017, George and Huerta [27] ﬁrstly applied CNN to GW sig-nal detection tasks. They generated mock GW signals fromBBH mergers and added them into white Gaussian noise togenerate simulation datasets. They pointed out that the sensi-tivity, namely the fraction of signals which are correctly iden-tiﬁed, of CNNs is similar to the matched-ﬁltering method,while its speed has been greatly improved. The work in thesame period by Gabbard et al. [28] compared the false alarmrate and the receiver operating characteristic (ROC) curve be-tween CNN models and the matched-ﬁltering method, lead-ing to a similar conclusion. Subsequently, the application ofdeep learning in the ﬁeld of GW signal detection has been ex-panded greatly. Krastev [29] indicates that deep learning mod-els work better on BNS mergers than BBH mergers. Moredeep learning models such as residual network, fully CNN,and other structures have been introduced [34–39], and theresearch ﬁeld has been continuously expanding. Variationalautoencoders and Bayesian neural networks are used for pa-rameter estimation of GW signals [30, 31]. Long short-termmemory network has made progress in the ﬁeld of GW sig-nal noise reduction, which proves that it can e ﬀ ectively re-move environmental noise and restore the GW signal undernoise [32]. In the sky localization searching task of GW sig- a r X i v : . [ a s t r o - ph . H E ] D ec nals, deep learning methods such as CNNs have also achievedgood results [33].However, almost all deep learning algorithms such asCNNs used in the current researches are basic models. Itmeans that they can be further optimized. In addition, manystudies have pointed out that deep learning models can main-tain a certain degree of robustness to GW signals beyond therange of the training set parameters [27, 28, 35]. However,there is no speciﬁc research on this aspect. Grounded on theabove two points, we conduct experiments on the optimizatione ﬀ ects of several deep learning techniques in the ﬁeld of GWsignal detection. The result shows that, compared with thebasic model, the model with improved techniques achievesbetter performance. On the low signal-to-noise ratio (SNR)dataset, the model with multiple improved techniques has anaccuracy rate of 84% on the testing set, 12% higher than thatof the basic model, and an area under curve (AUC) score of0.91, 6% higher than that of the basic model. On the over-all dataset, our model with multiple improved techniques hasan accuracy rate of 94% on the testing set, 4% higher thanthat of the basic model, and an AUC score of 0.98, 2% higherthan that of the basic model. Moreover, we make a detailedresearch on the robustness of CNN models on GW signal de-tection tasks. Our experiments show that the CNN modelhas good robustness for data of di ﬀ erent parameter ranges formasses and spins.This paper is organized as follows. In Sec. II, we give abrief overview on deep learning. In Sec. III, we introduce oursimulated dataset to be used in our experiments. Then theimproved techniques for CNN and the corresponding experi-mental results are shown in Secs. IV and V, respectively. InSec. VI, we investigate the generalization ability of the CNNmodel in di ﬀ erent parameter ranges. II. DEEP LEARNING

Traditional machine learning methods include k -nearestneighbor, decision tree, support vector machine, and soon [40]. The advantage of machine learning methods is thatto some extent, they can replace the process of human learn-ing. Through training on a large dataset, theses models canlearn the relationship between data so that to classify, predict,and help human to make decisions [41]. However, when tradi-tional machine learning methods are applied to speciﬁc tasks,they usually have di ﬃ culty processing the original data. Insome cases, researchers have to manually extract data featuresand put them into algorithms. Besides, traditional machinelearning methods are usually limited by their ﬁxed modelstructure. It is di ﬃ cult for these algorithms to achieve rapidimprovement in computing power and accuracy [26].Deep learning overcomes some limitations of traditionalmachine learning methods. Its algorithm is derived from neu-ral network which is a sub-ﬁeld of traditional machine learn-ing. This algorithm solves the limitation of the depth of neuralnetwork and increases the computational power of the model.So far, many deep learning models such as CNN, residual net-work, and long and short time memory have been proposed. 𝑥 ! 𝑥 " 𝑥 𝑥 $ ... ∑ 𝑤 ! 𝑤 $ 𝑤 𝑤 " ... 𝑏 𝜎 𝑦 FIG. 1. Architecture of a neuron.

At present, deep learning has achieved substantial success inface recognition, automatic driving, speech processing, andmany other ﬁelds [26]. For the sake of a self-contained work,below we brieﬂy review the principle of deep learning, in-cluding the structure of neurons, the basic principle of neuralnetwork, and the model of CNN.

A. Neuron

The idea of neural networks in machine learning evolvedfrom biological models. Generally speaking, neural networkis a network of parallel interconnections composed of simpleadaptive units [42]. Its organization can simulate the inter-action of the biological nervous system to real-world objects.The neuron is the “simple unit” in the above deﬁnition. In1943, McCulloch and Pitts [43] abstracted it into the simplemodel shown in Fig. 1, namely the “M-P neuron model”. Inthis model, each neuron receives input data x = ( x , x , ..., x n )from previous neurons. The input is multiplied by the weight w = ( w , w , ..., w n ), plus the bias b , and then is injected intothe activation function σ to obtain the output y . A single neu-ron can be represented by the following formula using vectors, y = σ ( w x (cid:124) + b ) . (1)The activation function σ introduces nonlinear operationsinto neurons. Otherwise, the structure of neurons will simplybe a superposition of linear operations, and the power of thenetwork will be greatly reduced. Activation function σ comesin many forms [44]. In this paper, we use the most commonlyused activation function ReLU [45] in our study, σ ( x ) =  x , x > , , x ≤ . (2)In general, the current neuron will only output positive val-ues after calculating the weighted sum of data from the ﬁrst N neurons. From the feature level, it can be understood asthe following. After a linear combination of N features, neu-rons input the combined features into the activation functionto obtain the output features. FIG. 2. Structure of a FCNN.

B. Neural Network

The simplest example of neural network is Fully ConnectedNeural Network (FCNN). The structure of FCNN is shown inFig. 2. A manually speciﬁed number of neurons constituteeach layer, which is called a “Dense Layer” of the network.Neurons between di ﬀ erent layers are independent. Each neu-ron receives data from the former layer, calculates the result,and puts them into all neurons of the next layer [46]. Notethat the input data size has a linear relationship with the pa-rameters of the ﬁrst layer in FCNN [26]. With the input datasize increases, the number of parameters in the network willincrease correspondingly, slowing down the learning speed ofthe model and increasing the requirement of data storage [25].Neural network is a supervised learning algorithm, whosecharacteristic is to use the labeled data—which is the correctoutput y —for training, and test the model on unlabeled data.A loss function is deﬁned to measure the di ﬀ erence betweenthe model output ˆ y and the correct y . The expectation fortraining is to make the value of loss function as small as possi-ble [26]. The parameters to be informed in the neural networkare the weights w ’s and biases b ’s in each layer of neurons.According to the gradient descent strategy, parameters in theneural network are updated to the direction where the valueof loss function decreases [47]. The gradient descent strategyreads, θ new = θ old − α (cid:79) θ J ( θ ) , (3)where θ is the parameter to be updated, J is the loss function,and α is the manually speciﬁed learning rate which controlsthe speed of parameter updates.The learning of neural network is based on the trainingdataset collected and annotated by human beings. However,when we ﬁnally apply the model to real tasks, we hope theneural network to have good generalization ability . General-ization ability with respect to the neural network is deﬁned asthe ability of the network to handle unseen patterns [48]. Inother words, this concept measures how accurately an algo-rithm is able to predict outcome values for previously unseen

FIG. 3. Structure of a CNN. data [49]. The testing set is a good type of unseen data. Thedata characteristics of the testing set are similar to the train-ing set. In the meantime, its distribution is independent of thetraining set, and it does not appear in the training process [26].Therefore, the ﬁnal result of the testing set is a reliable indexto evaluate the performance of the neural network.However, if the number of parameters in the network ismore compared to the samples in the training set, overﬁtting occurs [50]. If there are too many parameters in the neural net-work, the predicting power of the model will get too strong,which makes the model ﬁt the (noisy) characteristics of thetraining dataset too much in the learning process. As a result,the model is only e ﬀ ective for the data samples that appear inthe training set, resulting in the so-called the overﬁtting prob-lem [51]. Therefore, how to design the model structure ofneural network with appropriate layers and neuron numberswith strong data generalization ability is an outstanding issue. C. Convolutional Neural Network

As shown in Fig. 3, the structure of CNN is divided intothe convolutional layer, the pooling layer, and the denselayer [26]. Each convolutional layer is composed of a spec-iﬁed number of kernels. Each kernel multiplies the input fea-ture values with weights, and adds the biases to obtain out-puts. Di ﬀ erent kernels get di ﬀ erent parameter values aftertraining. The pooling layer itself does not contain any pa-rameters. Take the max-pooling layer as an example. Afterreceiving the input data, this layer scans the data according toa speciﬁed stride within a window of a certain length. Then, itoutputs the maximum value of the data in each scanning win-dow [52]. Therefore, the pooling layer compresses the data.It checks all the features in the scanning window and choosesthe most important one [53]. There are other pooling methods,e.g., the average pooling, which outputs the average value ofthe data in the window [54].The pooling layer plays an important role in improving thereceptive ﬁeld of CNN. After the data passes through the pool-ing layer, the original data length gets shortened, and the ker-nel in the next convolutional layer can handle a larger rangeof data than the convolution window in the previous layer,thereby it expands the convolutional layer’s overall operationrange of the data [24]. After extracting the features of the in-put data through convolutional layers and pooling layers, theﬂattened feature map will be put into the fully connected layerto obtain the ﬁnal output [26].Since the parameters to be learned by the convolutionallayer are only the parameter values in the convolution kernel,they are independent of the input data size. Therefore, CNNreduces the number of free parameters, allowing the networkto be deeper with fewer parameters [54]. Due to its uniquemodel structure and powerful data processing capability, CNNis being widely used [26]. III. SIMULATED DATASET

In this section we discuss our strategy to simulate GW dataand build di ﬀ erent datasets for machine learning studies. A. Data Obtaining

Usually, GW waveforms are divided into three stages: in-spiral, merger, and ringdown. The signal detected by a singleGW detector is, h ( t ) = F + ( t ) h + ( t ) + F × ( t ) h × ( t ) , (4)where h + and h × are two polarization modes of GWs, F + and F × are the corresponding pattern functions of these two po-larization modes as functions of the sky localization and thepolarization angle [55].In this work, we focus on GW signals generated by BBHmergers. We use the e ﬀ ective-one-body numerical-relatively(EOBNR) model with aligned spins [56] to simulate the wave-form. Without losing generality, in addition to the SNR ofGW signals, we focus on intrinsic parameters, i.e. masses andspins. The extrinsic parameters, such as the polarization angleand the sky localization, are all ﬁxed to ﬁducial values. Thepossible precession e ﬀ ect of BBHs is not considered, since weare using the aligned-spin waveform. The spin parameter isdenoted by χ . We set the luminosity distannce D L =

100 Mpc,and neglect the redshift e ﬀ ect of the GW signal. Such a set-ting makes us focus on the machine learning algorithm, andthe assumptions can easily be relaxed when needed. Worthto mention that, in practice we have tested the e ﬀ ects broughtby the inclusion of extrinsic parameters, such as sky locationof the GW source, inclination of the BBH orbit, and the polar-ization angle of the GW. Consistent optimization e ﬀ ects wereobtained with extrinsic parameters. Because the dependenceof the GW waveform on extrinsic parameters is much simplerthan that of intrinsic parameters for GWs from spin-alignedBBHs, in the following we will focus on the intrinsic parame-ters. It is straightforward to augment with extrinsic parametersin our machine learning data analysis.We use the open-source tool provided by Gebhard et al. [35] to generate data. This tool generates GW signals based on PyCBC [57] and LALSuite platforms. With given pa-rameters, analog signals from LALSuite contain two time se-ries, i.e. the two polarization modes of GW signals. The abovesequences are combined with the corresponding antenna func-tions F + , × according to Eq. (4). The signal o ﬀ set caused by thedistance di ﬀ erence between LIGO’s Hanford and Livingstondetectors is properly introduced.The ﬁnal GW signal sequence used is, s ( t ) = h ( t ) + n ( t ) , (5)where h ( t ) is the GW waveform obtained by the above sim-ulation (4), and n ( t ) is the detector’s noise to the strain. Theadvanced LIGO’s (aLIGO’s) power spectral density (PSD) atthe “zero-detuned high-power” design sensitivity (aLIGOZe-roDetHighPower) [58] is used to simulate the Gaussian whitenoise. After inserting the analog waveform into the noise, wecan calculate the SNR of the strain. A rescaling of it, cor-responding to a rescaling in the distance, can achieve otherdesired SNR values [55]. Finally, we get the GW strain withspeciﬁc SNR values. The strain needs to be preprocessed be-fore being used as the ﬁnal data, which is similar to PyCBC’sGW data processing. Preprocessing stage includes two steps.The ﬁrst one is data whitening. The aLIGO’s design sensi-tivity is used to whiten the original strain, and to ﬁlter outthe spectral components of the environmental noise so as toproperly scale its inﬂuence on the strain. The second step isﬁltering. We ﬁlter out the frequency components below 20 Hzto eliminate the inﬂuence of the Newtonian noise in the lowfrequency. B. Dataset Building

This work involves experiments on multiple datasets whichhave the same structure. They have the following characteris-tics: (i) with the given parameter range, the data parameters,masses m , and spins χ , , are all in the form of random sam-pling; (ii) the ratio of the samples containing the GW signaland pure noise in the dataset is 1 to 1; (iii) the duration of eachsample is 1 second, and the sampling rate is 4096 Hz, that is,each sample is a time series with a length of 4096; (iv) con-sidering the symmetry of mass parameters in a BBH, we use m ≥ m convention to sample masses; (v) the time-series dataof the samples all use the Hanford detector sequence (the H1sequence); (vi) in this work we do not consider the inﬂuenceof sky position of the GW signal in the strain, that is to say,the peak value of the GW signal is located in the same posi-tion of the strain in the datasets. These speciﬁcs are naturalfor a study of such kind.Totally, we construct 5 datasets for training and 5 datasetsfor testing, which are annotated as Training Datasets and

Test-ing Datasets hereafter. https://pycbc.org/ https://github.com/lscsoft/lalsuite TABLE I. Parameters for Training Datasets.Dataset m ( M (cid:12) ) m ( M (cid:12) ) χ , SNR1.1 [10 ,

80] [10 ,

80] 0 [7 , . ,

80] [10 ,

80] 0 [7 , ,

60] [30 ,

60] 0 83.1 30 45 [ − . , .

5] 84.1 [10 ,

80] [10 ,

80] [ − . , . , m ( M (cid:12) ) m ( M (cid:12) ) χ , SNR1 [10 ,

80] [10 ,

80] 0 [7 , . , , ,

80] [10 , ,

80] 0 83 30 45 [ − . , . , . ,

80] [10 ,

80] [ − . , . , ,

80] [10 ,

80] [ − . , . , . , • Training Datasets . Each Training Dataset is consistedof a training set and a testing set. The training set andthe testing set contain 5000 and 500 samples respec-tively. When we train the model on each dataset, themodel is ﬁrst trained on the training set, and then testedon the testing set. • Testing Datasets . Testing Datasets contain many sub-testing sets (annotated as sub-datasets below) to test themodel performance on di ﬀ erent parameter ranges. Eachsub-dataset contains 500 samples.The parameters for Training Datasets used in this work areshown in Table I. Taking Dataset 1.1 as an example. Thisdataset only considers three parameters: component massesof the BBHs, m and m , and the SNR ρ . The spin parametersof the BBHs are set to zero. The mass range of both blackholes is m i ∈ [10 M (cid:12) , M (cid:12) ] ( i = , ρ ∈ [7 , . m ≥ m as defaulted), whose distributionis shown in Fig. 4. Figure 5 gives two GW examples fromDataset 1.2 with a large SNR (upper panel) and a small SNR(lower panel).The parameter settings of Testing Datasets are shown in Ta-ble II. As shown above, the parameters used to construct eachsub-dataset are given in the [ min , step , max ] format, where min and max deﬁne the overall sampling range of the param-eter, and each sub-dataset is constructed with a uniform step size. Take the Testing Dataset 1 as an example, which consistsof 16 sub-datasets. The SNR parameter sampling range of the

10 20 30 40 50 60 70 80 m ( M )1020304050607080 m ( M ) Training SetTesting Set

FIG. 4. Mass distribution in Dataset 1.1. We use m ≥ m as aconvention to randomly sample mass parameters. Cyan points arethe training set samples and red points are the testing set samples. ﬁrst sub-dataset is [7 , . ﬀ erent parameter intervals and exploremodel robustness with the change of parameter values. IV. IMPROVING TECHNIQUESA. Model Baseline

After preliminary experiments with parameters on net-works with di ﬀ erent depths and hyperparameter values, wedecide to use the network structure similar to that of Georgeand Huerta [27] as the baseline CNN model. Its structure isshown in Fig. 6.The model receives GW strain as input and outputs the dis-criminant value of classiﬁcation. In the model, the classiﬁca-tion threshold is set to 0.5, that is, a sample with a discriminantvalue greater than 0.5 is judged as a positive sample (i.e., con-taining a GW signal), otherwise it is judged as a negative one(i.e., pure noise). The baseline model contains three convolu-tional layers. The number of channels of each convolutionallayer is 16, 32, and 64, and the size of the convolution kernelfor them is 16, 8, and 8, respectively. After each convolutionallayer, a pooling layer and an activation layer are provided. Thepooling layer uses maximum pooling with a pooling windowof 4 and a step size of 4, which means that the feature se-quence is down-sampled to a quarter of the original sequencelength and retains the feature with the largest value. The activelayer uses the ReLU function shown in Eq. (2). Subsequently,the feature sequences extracted by the convolution structureare input to the fully connected layer to achieve discrimina-tive classiﬁcation. The model outputs discriminant values todetermine whether the sequence contains a GW signal.

Time (s) W h i t e n e d S t r a i n a n d S i g n a l ( H ) strainsignal Time (s) W h i t e n e d S t r a i n a n d S i g n a l ( H ) strainsignal FIG. 5. Examples for the simulated GW strain: ( m , m ) = (75 . M (cid:12) , . M (cid:12) ) and ρ = .

56 (upper); ( m , m ) = (67 . M (cid:12) , . M (cid:12) ) and ρ = .

41 (lower). The orange line is thenormalized GW signal in the sample strain. vector (size: 4096)matrix (size: 1 × × × × × × × × × × FIG. 6. Basic CNN model architecture that is used in this work [27].

In this work, cross entropy is used as the loss function toupdate the gradient. Cross entropy is an important conceptproposed by Shannon [59] in information theory. It is oftenused to measure the di ﬀ erence between the predicted distribu-tion and the true distribution [59]. Let the label of each samplebe y and the discriminant value given by the model be p . Thecross entropy is, J = − (cid:0) y · log p + (1 − y ) · log(1 − p ) (cid:1) . (6)We use ADAM [60] as a gradient descent strategy to updateour model parameters. This strategy inputs the samples intothe model in batches to calculate the gradient and updates theparameters. After trying batch number values of 5, 10, 25,50, 100, 200, 250, and 500, we take the value 25, on whichthe model has the highest accuracy. The learning rate is set to5 × − , and the training rate is reduced 10 times every 20epochs to avoid overﬁtting the model.The code implementation of our work is based on the Pytorch framework [61], which uses the

CUDA deep learninglibrary (cuDNN) [62] to accelerate the GPU’s model opera-tion. Our work deploys experiments on NVIDIA TITAN XGPU.

B. Improving Techniques

We now discuss several improvements—dropout, batchnormalization, and the 1 × Dropout was ﬁrst proposed by Hinton et al. [63] in 2012to improve the overﬁtting problem of neural networks. Sub-sequently, dropout has become one of the widely used tech-niques in deep learning [64]. Its basic idea is that, in eachbatch of training, the probability p is artiﬁcially speciﬁed, sothat the neurons in the fully connected layer stop working withthe probability p , and set their parameter values are set to zero.The advantages of dropout are the following [63]. First, inthe training process of each batch, because the neurons of eachlayer are inactivated with probability p , the network structureof each training is di ﬀ erent. The overall model training isequivalent to the joint decision-making process between mul-tiple neural networks with di ﬀ erent structures, which helps toimprove the problem of overﬁtting. Second, dropout resultsin that, two neurons do not necessarily appear in the samenetwork structure each time so that the parameter update nolonger depends on the joint decision of some neurons withﬁxed relationships. At the feature level, this technique pre-vents decision-making from over-dependence on certain fea-tures and forces the model to learn more robust feature repre-sentations [26]. Batch normalization is a neural network training tech-nique proposed by Io ﬀ e and Szegedy [65] in 2015. Its speciﬁcidea is the following. In the training process of each batch, af-ter the data passes through the activation layer, the activationvalue of each batch of data is normalized. That is, the averagevalue of the sample data of each batch is normalized to 0, andthe variance is normalized to 1. In a batch of data of length m ,the activated data are set to B = { x , · · · , x m } , and the batchnormalized data are set to y = { y , · · · , y m } . Then the batchnormalized algorithm is expressed as [65] µ B ← m m (cid:88) i = x i , (7) σ B ← m m (cid:88) i = ( x i − µ B ) , (8) (cid:98) x i ← x i − µ B (cid:113) σ B + (cid:15) , (9) y i ← γ (cid:98) x i + β , (10)where γ and β in the algorithm are the parameters learned inthe gradient update. The purpose of this step is to make theresult of batch normalization be the same as the original in-put data, which maintains a possibility to retain the originalstructure. The parameter (cid:15) in Eq. (9) is used to prevent invalidcalculation when the variance σ B is zero.The batch normalization technique makes the mean andvariance of the input data distribution of each layer in the CNNwithin a certain range. Thus, each layer of the network doesnot need to adapt to the change in the distribution of inputdata, which is conducive to accelerating the learning speed ofthe model and speeding up the simulation [26]. At the sametime, batch normalization suppresses the problem that smallchanges in parameters are ampliﬁed with the deepening of thenetwork layers, making the network more adaptable to modelparameters and more stable gradient updates. Besides, dueto the use of dropout technique, the number of e ﬀ ective neu-rons in the model decreases, and the ﬁtting speed of networkslows down [65]. Considering that batch normalization hasa signiﬁcant improvement e ﬀ ect on the ﬁtting speed of themodel, dropout and batch normalization techniques are oftenintroduced into the structure of the neural network at the sametime [66]. The × convolution , also known as “network in net-work”, was proposed by Lin et al. [67] in 2014. The 1 × ×

1. In one-dimensional convolution, it is a convolution kernel in whichthe window length is 1. The 1 × × × × TABLE III. The extended techniques used for each model.Model Dropout Batch norm. 1 × √ ConvNet3 √ ConvNet4 √ ConvNet5 √ √

ConvNet6 √ √ √

TABLE IV. AUC scores on Dataset 1.1 and 1.2. The highest value ineach column is highlighted.Dataset 1.1 1.2ConvNet1 0.86 0.96ConvNet2 0.90 0.97ConvNet3 0.89 0.97ConvNet4 0.90 0.97ConvNet5

ConvNet6 dimensional features of the original data are realized, and themodel’s ability to represent data features is enhanced [26].

V. SIMULATION RESULTS

Based on the enhanced techniques, we extend the basicmodel of CNN described in Sec. IV A. Dropout, batch nor-malization, and the 1 × ρ ∈ [7 , . ρ ∈ [7 , ﬀ ect during the training process. When the validation setis input into the model, it does not participate in the gradi-ent update, but is only used to calculate the accuracy and lossfunction value.The validation results of each model in the training pro-cess are shown in Fig. 7. As shown in the ﬁgure, based onDataset 1.1 (low SNR), the accuracy of each extended net- A cc u r a c y ( % ) ConvNet1ConvNet2ConvNet3ConvNet4ConvNet5ConvNet6 0 50 100 150 200 250 300 350Iteration0.30.40.50.60.7 V a li d a t i o n L o ss ConvNet1ConvNet2ConvNet3ConvNet4ConvNet5ConvNet6 A cc u r a c y ( % ) ConvNet1ConvNet2ConvNet3ConvNet4ConvNet5ConvNet6

300 3209294 0 50 100 150 200 250 300 350Iteration0.20.30.40.50.60.70.8 V a li d a t i o n L o ss ConvNet1ConvNet2ConvNet3ConvNet4ConvNet5ConvNet6

300 3200.1500.175

FIG. 7. Model comparison on Datasets 1.1 (upper panels) and 1.2 (lower panels). ConvNet1 to ConvNet6 are the models shown in Table III.We use the accuracy (fraction of samples correctly classiﬁed) and the validation loss (loss function value in the validation set) as our metricsto track the model performance in the training process. work model is signiﬁcantly improved compared with that ofthe basic model (ConvNet1). The validation results showthat the ﬁnal accuracy of each network is stable at 83%–87%, while the accuracy of the basic model is below 80%.Among them, ConvNet5, which uses dropout and batch nor-malization, achieves the highest accuracy in the stable stageafter enough iterations. ConvNet6, which uses all improvingtechniques—namely dropout, batch normalization, and 1 × × × ﬀ erence between models. In this sit-uation, the extended model is still superior to the basic model.In terms of accuracy, ConvNet5 reaches the optimal value inthe stable stage of the late training epoch. Its performance isslightly better than that of ConvNet6. In terms of the valueof loss function, ConvNet5 and ConvNet6 have similar resultsin the stable stage. Actually, ConvNet5 is slightly better thanConvNet6 in the stable stage. In addition, as the SNR distribu-tion range of the dataset expands, the di ﬀ erences between thesamples increase so that the overﬁtting problem of the modelis reduced.After training, the models are provided to the testing sets for T r u e P o s i t i v e R a t e ConvNet1ConvNet2ConvNet3ConvNet4ConvNet5ConvNet6 0.0 0.2 0.4 0.6 0.8 1.0False Positive Rate0.00.20.40.60.81.0 T r u e P o s i t i v e R a t e ConvNet1ConvNet2ConvNet3ConvNet4ConvNet5ConvNet6

ConvNet1 ConvNet2 ConvNet3 ConvNet4 ConvNet5 ConvNet60.620.640.660.680.700.720.740.760.780.800.820.840.860.88 A cc u r a c y .75 .83 .82 .81 .87 .87.75 .80 .80 .80 .81 .80.75 .82 .81 .81 .84 .84noisesignaltotal ConvNet1 ConvNet2 ConvNet3 ConvNet4 ConvNet5 ConvNet60.830.840.850.860.870.880.890.900.910.920.930.940.950.96 A cc u r a c y .91 .91 .93 .92 .96 .96.89 .92 .91 .93 .93 .93.90 .92 .92 .92 .94 .94noisesignaltotal FIG. 8. Results on the testing sets of Datasets 1.1 (left panels) and 1.2 (right panels). We use both ROC and the accuracy as our evaluationmetrics. As seen in lower panels, the accuracy in identifying noise and signal, as well as the total accuracy, are illustrated separately. ﬁnal evaluation. The testing results are shown in Fig. 8. ROCcurve is a graphical plot that illustrates the diagnostic abilityof a binary classiﬁer system as its discrimination threshold isvaried [68]; see Sec. IV A. When the threshold changes, thecurve reﬂects the fraction of positive samples correctly identi-ﬁed (True Positive Rate) versus the fraction of negative sam-ples incorrectly identiﬁed (False Positive Rate) [69]. The areaunder the curve (annotated as AUC) is equal to the probabilitythat a model ranks a randomly chosen positive sample higherthan a negative one [68].The AUC scores of di ﬀ erent models on Datasets 1.1 and1.2 are shown in Table IV. It can be seen that on Dataset 1.1,ConvNet5 ans ConvNet6 both achieve the highest accuracyand AUC score, but the classiﬁcation rate of GW signals ofConvNet6 is slightly lower than that of ConvNet5 (shown inFig. 8). On Dataset 1.2, ConvNet5 achieves both the highestaccuracy and AUC score.Based on the above study, we ﬁnd that the dropout, batchnormalization, and 1 × VI. GENERALIZATION AND ROBUSTNESS

In the work of George and Huerta [27], who ﬁrstly intro-duced the CNN structure into the ﬁeld of GW data processing, the generalization ability of deep learning algorithms in thetask of GW signal detection was discussed. Subsequent stud-ies have pointed out that this ability is a major advantage ofdeep learning algorithms over the matched ﬁltering [28, 35].As mentioned in Sec. II B, generalization ability refers to theability of the detection algorithm to respond to signals that areoutside of the distribution of training data [49]. We here arethe ﬁrst to study the generalization ability of the CNN modelin the task of GW signal detection, including the generaliza-tion characteristics of the CNN model in multiple parameterranges such as masses and aligned spins.From the discussion in Sec. V, we use ConvNet5 for ourfollowing experiments, and hereinafter we refer it as “themodel”. Considering that the discussion in Sec. V is basedonly on the performance of the model on the dataset wherespins are set to zero, we conduct a further test on the Con-vNet5 model on Dataset 4.1. Dataset 4.1 introduces spin pa-rameters on the basis of Dataset 1.2, which is used to test theoverall performance of the model in the GW detection task.After 30 epochs of training on Dataset 4.1, the model reachesan accuracy of 93%, which proves that the model still has agood ability to detect GW signals when considering the spinparameters. In the following subsections, we will test the0 m ( M ) m ( M ) - . - . - . - . . . . . - . - . - . - . . . . . FIG. 9. Model robustness on BBH masses (left) and spins (right). The colorbar and numbers are the accuracy of the model tested on di ﬀ erentsub-datasets (in speciﬁc mass / spin ranges) of Testing Datasets. The mirror part of m ≥ m is also plotted. generalization ability of the model in masses, spins, and thegeneralization ability of the nonspinning model with spinningdata. A. Mass Parameters

To simplify the investigation, we ﬁrst study the generaliza-tion ability of the model in the GW detection task that onlyconsiders mass parameters. The Training Dataset used in thissection is Dataset 2.1 (see Table I). In this dataset, the SNRof the GW signal is 8, the mass sampling range of the BBH is[30 M (cid:12) , M (cid:12) ], and the spins are set to zero. After 30 epochsof training, the accuracy on the testing set is 93%. Subse-quently, we use the Testing Dataset 2 to test the model on thesub-datasets with the mass sampling range of [10 M (cid:12) , M (cid:12) ].The accuracy is shown in the left panel of Fig. 9.We ﬁnd that the model works better on sub-datasets withinthe original training parameter range of [30 M (cid:12) , M (cid:12) ]. Formost of sub-datasets beyond the training parameter range (i.e.,outside of the range [30 M (cid:12) , M (cid:12) ]), high accuracy is stillachieved, indicating that the model does have a certain gener-alization ability in the mass parameter. It can be seen from thedata in the ﬁgure that the model is more e ﬀ ective for BBHs ofhigh masses, and the accuracy for low-mass BBHs is lower.Studying the signal pattern corresponding to the mass param-eter, we ﬁnd that this is caused by the fact that the modelis easier to identify data ﬂuctuations introduced by short sig-nals. As the mass decreases, the GW signal in band becomeslonger, which is relatively closer to the characteristics of ran-dom noise. This is harder to identify with our CNN model.We show example waveforms of di ﬀ erent masses and spinsin Fig. 10. It is clear from the upper panels that GW signalsof smaller masses are longer, thus they are harder to be de-tected in our CNN models. We leave the discussion of furtherimprovement of our CNN models in this parameter space tofuture studies. B. Spin Parameters

Now we discuss the generalization ability of the modelin spin parameters. We use Dataset 3.1 as an illustration.The SNR of this dataset is 8, the masses of the BBH are30 M (cid:12) and 45 M (cid:12) , and the range of the spin parameter χ is χ i ∈ [ − . , .

5] ( i = , χ rangingin [ − . , . . , . , − . , − . χ value, the GW signals become longer (the “hang-up”e ﬀ ect), thus harder to be detected with our CNN model. Fur-ther improvements in this aspect will be subject to future stud-ies. C. Robustness of Nonspinning Model on Spinning Data

What if the model is only trained with nonspinning GW sig-nals, and it is tested with spinning signals? We use the Dataset1.2 (see Table I) for training that only uses the mass param-eters. After 30 epochs of training, we test the model on theTesting Dataset 4 (see Table II) which contains spin parame-ters as well. We obtain an accuracy of ∼ ∼

94% in the Dataset 1.2 where only the mass parameteris included. It shows the generalization ability of the model1

Time (s) W h i t e n e d S t r a i n a n d S i g n a l ( H ) strainsignal Time (s) W h i t e n e d S t r a i n a n d S i g n a l ( H ) strainsignal Time (s) W h i t e n e d S t r a i n a n d S i g n a l ( H ) strainsignal Time (s) W h i t e n e d S t r a i n a n d S i g n a l ( H ) strainsignal Time (s) W h i t e n e d S t r a i n a n d S i g n a l ( H ) strainsignal Time (s) W h i t e n e d S t r a i n a n d S i g n a l ( H ) strainsignal FIG. 10. Simulated GW strain examples of di ﬀ erent masses and spins. The upper panels are the strain examples of increasing masses fromleft to right: (10 M (cid:12) , M (cid:12) ), (45 M (cid:12) , M (cid:12) ), and (80 M (cid:12) , M (cid:12) ). The lower panels contain examples of di ﬀ erent spins (from left to right):(0 . , . , − . , − . A cc u r a c y .83 .85 .90 .90 .91 .90 .95 .96 .96 .96 .96 .99 .98 .98 .98 .99 FIG. 11. Robustness of CNN models that are only trained with non-spinning GWs on the spinning GW data for di ﬀ erent SNRs. that is only trained with nonspinning GWs, but tested in datacontaining spin parameters.We further use the Testing Dataset 5 (see Table II) to test themodel on sub-datasets with di ﬀ erent SNR distribution. Theresult is shown in Fig. 11. As shown in the ﬁgure, the modelhas the best detection e ﬃ ciency on data with high SNR andit is slightly worse on data with low SNR. However, on thesub-dataset of testing data with SNR in the range of [7 . , . ∼ D. Conclusion

From the above study, it can be seen that the CNN modelshows good generalization ability for data of di ﬀ erent param- eter ranges, which is a major advantage over the matched ﬁl-tering method in using the CNN model. The matched ﬁlteringmethod is based on the existing template bank. Its search andresponse to GW signals are limited to the existing waveforms.The GW signals beyond the existing waveforms cannot beeasily detected. The generalization ability of the CNN modelin the task of GW signal detection will help us to discoversignals beyond the existing templates. It may also be ableto detect signals that imprint orbital eccentricity [70], orbitalprecession [71], and deviations from general relativity [72].Such a generalization ability will undoubtedly play an essen-tial role in searches of GW signals beyond what we have in thetemplate bank. More studies are needed along this direction. VII. DISCUSSION

For the ﬁrst time, this work speciﬁcally studied the e ﬀ ectson CNN models brought by the improving techniques—suchas dropout, batch normalization, and 1 × ﬀ erent improving techniques were made.Our simpliﬁed experiments show that dropout, batch normal-ization, and the 1 × ﬃ culty of the task continues toincrease, the requirements for feature extraction and process-ing are getting higher and higher. The network depth of theCNN is limited by the gradient propagation problem, result-ing in a limitation on its performance [26]. The proposal ofthe residual network structure greatly alleviates the problemof gradient propagation in CNNs and makes the depth limit ofCNN models rapidly expand [73]. In the ﬁeld of time seriesprocessing, the “Long Short Term Memory” model is widelyused in multiple tasks such as speech processing and stockprice prediction, and has achieved excellent results [74]. Atpresent, some studies have introduced the above structure intothe process of GW data processing [32, 34]. However, in thetask of GW signal detection, there is still a lack of uniﬁedinvestigation on these models. Comparing these models (in-cluding CNN models) on GW signal detection tasks, such asthe recognition rate, model size, running time and other indi-cators, is crucial to ﬁnd an optimal model structure suitable forreal-world task. It is a worthwhile direction for future studies.For the sake of simplifying the investigation, this work onlyuses simulated white noise to construct GW sample data. Inorder to make the model more in line with the characteristicsof reality detection, real noise construction from aLIGO / Virgo data can be conducted. At the same time, the data used inthis article are only for experimental purposes, so the amountof data is limited. By increasing the number of samples inthe dataset, the performance of the model can be further im-proved. Ensemble learning is also an optimization techniquethat is worth considering, where the basic idea is to train mul-tiple neural network models, and to use voting and other com-mon decision-making strategies in order to make up for thedecision-making defects of a single model and to improve thedecision-making ability of the overall model [75]. These as-pects deserve detailed studies on their own.

ACKNOWLEDGMENTS

This work was supported by the National Natural Sci-ence Foundation of China (11975027, 11991053, 11690023,11721303), the Young Elite Scientists Sponsorship Pro-gram by the China Association for Science and Technol-ogy (2018QNRC001), the Max Planck Partner Group Pro-gram funded by the Max Planck Society, and the High-performance Computing Platform of Peking University. Itwas partially supported by the Strategic Priority Research Pro-gram of the Chinese Academy of Sciences through the GrantNo. XDB23010200. [1] B. P. Abbott et al. (Virgo, LIGO Scientiﬁc), Phys. Rev. Lett. , 061102 (2016), arXiv:1602.03837 [gr-qc].[2] B. Abbott et al. (LIGO, VIRGO), “Observation of GravitationalWaves from a Binary Black Hole Merger,” in

Centennial ofGeneral Relativity: A Celebration , edited by C. A. Z. Vascon-cellos (2017) pp. 291–311.[3] J. Liu, G. Wang, Y.-M. Hu, T. Zhang, Z.-R. Luo, Q.-L. Wang,and L. Shao, Chin. Sci. Bull. , 1502 (2016).[4] B. Abbott et al. (LIGO Scientiﬁc, Virgo), Phys. Rev. X ,031040 (2019), arXiv:1811.12907 [astro-ph.HE].[5] A. H. Nitz, C. Capano, A. B. Nielsen, S. Reyes, R. White,D. A. Brown, and B. Krishnan, Astrophys. J. , 195 (2019),arXiv:1811.01921 [gr-qc].[6] R. Abbott et al. (LIGO Scientiﬁc, Virgo), Phys. Rev. D ,043015 (2020), arXiv:2004.08342 [astro-ph.HE].[7] R. Abbott et al. (LIGO Scientiﬁc, Virgo), Astrophys. J. Lett. , L44 (2020), arXiv:2006.12611 [astro-ph.HE].[8] B. Abbott et al. (LIGO Scientiﬁc, Virgo), Astrophys. J. Lett. , L3 (2020), arXiv:2001.01761 [astro-ph.HE].[9] R. Abbott et al. (LIGO Scientiﬁc, Virgo), Phys. Rev. Lett. ,101102 (2020), arXiv:2009.01075 [gr-qc].[10] R. Abbott et al. (LIGO Scientiﬁc, Virgo), (2020),arXiv:2010.14527 [gr-qc].[11] B. Abbott et al. (LIGO Scientiﬁc, Virgo), Phys. Rev. Lett. ,161101 (2017), arXiv:1710.05832 [gr-qc].[12] L. S. Finn, Phys. Rev. D , 5236 (1992), arXiv:gr-qc / et al. , Astrophys. J. , 136 (2012),arXiv:1107.2665 [astro-ph.IM].[14] S. A. Usman et al. , Class. Quant. Grav. , 215004 (2016),arXiv:1508.02357 [gr-qc]. [15] B. Abbott et al. (LIGO Scientiﬁc, Virgo), Phys. Rev. D ,122004 (2016), [Addendum: Phys.Rev.D 94, 069903 (2016)],arXiv:1602.03843 [gr-qc].[16] R. Smith, S. E. Field, K. Blackburn, C.-J. Haster, M. P¨urrer,V. Raymond, and P. Schmidt, Phys. Rev. D , 044031 (2016),arXiv:1604.08253 [gr-qc].[17] I. Harry, S. Privitera, A. Boh´e, and A. Buonanno, Phys. Rev. D , 024012 (2016), arXiv:1603.02444 [gr-qc].[18] N. J. Cornish and T. B. Littenberg, Class. Quant. Grav. ,135012 (2015), arXiv:1410.3835 [gr-qc].[19] A. J. Chua, C. R. Galley, and M. Vallisneri, Phys. Rev. Lett. , 211101 (2019), arXiv:1811.05491 [astro-ph.IM].[20] A. J. Chua and M. Vallisneri, Phys. Rev. Lett. , 041102(2020), arXiv:1909.05966 [gr-qc].[21] N. Mukund, S. Abraham, S. Kandhasamy, S. Mitra, and N. S.Philip, Phys. Rev. D , 104059 (2017), arXiv:1609.07259[astro-ph.IM].[22] R. E. Colgan, K. R. Corley, Y. Lau, I. Bartos, J. N. Wright,Z. Marka, and S. Marka, Phys. Rev. D , 102003 (2020),arXiv:1911.11831 [astro-ph.IM].[23] E. Cuoco, J. Powell, M. Cavagli`a, et al. , Machine Learning:Science and Technology , 011002 (2020), arXiv:2005.03745.[24] A. Krizhevsky, I. Sutskever, and G. E. Hinton, Commun. ACM , 84–90 (2017).[25] J. Schmidhuber, Neural Networks , 85, arXiv:1404.7828.[26] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning (MIT Press, 2016) .[27] D. George and E. Huerta, Phys. Rev. D , 044039 (2018),arXiv:1701.00008 [astro-ph.IM].[28] H. Gabbard, M. Williams, F. Hayes, and C. Messenger,Phys. Rev. Lett. , 141103 (2018), arXiv:1712.06041 [astro-ph.IM]. [29] P. G. Krastev, Phys. Lett. B , 135330 (2020),arXiv:1908.03151 [astro-ph.IM].[30] H. Gabbard, C. Messenger, I. S. Heng, F. Tonolini, andR. Murray-Smith, (2019), arXiv:1909.06296 [astro-ph.IM].[31] H. Shen, E. Huerta, Z. Zhao, E. Jennings, and H. Sharma,(2019), arXiv:1903.01998 [gr-qc].[32] H. Shen, D. George, E. A. Huerta, and Z. Zhao, in ICASSP 2019: IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP) (2019) pp. 3237–3241,arXiv:1711.09919 [gr-qc].[33] C. Chatterjee, L. Wen, K. Vinsen, M. Kovalam, and A. Datta,Phys. Rev. D , 103025 (2019), arXiv:1909.06367 [astro-ph.IM].[34] C. Dreissigacker, R. Sharma, C. Messenger, R. Zhao, andR. Prix, Phys. Rev. D , 044009 (2019), arXiv:1904.13291[gr-qc].[35] T. D. Gebhard, N. Kilbertus, I. Harry, and B. Sch¨olkopf, Phys.Rev. D , 063015 (2019), arXiv:1904.08693 [astro-ph.IM].[36] D. George, H. Shen, and E. Huerta, Phys. Rev. D , 101501(2018).[37] M. Chen, Y. Zhong, Y. Feng, D. Li, and J. Li, ScienceChina Physics, Mechanics, and Astronomy , 129511 (2020),arXiv:2003.13928 [astro-ph.IM].[38] J. P. Marulanda, C. Santa, and A. E. Romano, Phys. Lett. B , 135790 (2020), arXiv:2004.01050 [gr-qc].[39] H. Wang, S. Wu, Z. Cao, X. Liu, and J.-Y. Zhu, Phys. Rev. D , 104003 (2020), arXiv:1909.13442 [astro-ph.IM].[40] T. M. Mitchell, Machine Learning (McGraw-Hill, New York,1997).[41] C. M. Bishop,

Pattern Recognition and Machine Learning (In-formation Science and Statistics) (Springer-Verlag, Berlin, Hei-delberg, 2006).[42] T. Kohonen, Neural Networks , 3 (1988).[43] W. S. McCulloch and W. Pitts, The Bulletin of MathematicalBiophysics , 115 (1943).[44] C. Nwankpa, W. Ijomah, A. Gachagan, and S. Marshall,(2018), arXiv:1811.03378.[45] V. Nair and G. E. Hinton, in Proceedings of the 27th Inter-national Conference on International Conference on MachineLearning (Omnipress, Madison, WI, USA, 2010) p. 807–814.[46] J. Hopﬁeld, Proc. Nat. Acad. Sci. , 2554 (1982).[47] S. Sra, S. Nowozin, and S. Wright, Optimization for MachineLearning , Neural Information Processing Series (MIT Press).[48] S. Urolagin, K. V. Prema, and N. V. S. Reddy, in

Proceedingsof the 2011 International Conference on Advanced Comput-ing, Networking and Security , ADCONS’11 (Springer-Verlag,Berlin, Heidelberg, 2011) p. 171–178.[49] M. Mohri, A. Rostamizadeh, and A. Talwalkar,

Foundations ofMachine Learning, Second Edition , Adaptive Computation andMachine Learning Series (MIT Press).[50] S. Lawrence and C. L. Giles, in

Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Net-works. IJCNN 2000. Neural Computing: New Challenges andPerspectives for the New Millennium , Vol. 1 (IEEE, 2000) pp.114–119. [51] D. M. Hawkins, Journal of Chemical Information and Com-puter Sciences , 1.[52] K. Yamaguchi, K. Sakamoto, T. Akabane, and Y. Fujimoto, in The First International Conference on Spoken Language Pro-cessing, ICSLP 1990, Kobe, Japan, November 18-22, 1990 (ISCA, 1990).[53] D. Ciregan, U. Meier, and J. Schmidhuber, in (2012) pp.3642–3649, arXiv:1202.2745 [cs].[54] H. H. Aghdam and E. J. Heravi,

Guide to Convolutional Neu-ral Networks: A Practical Application to Tra ﬃ c-Sign Detectionand Classiﬁcation , 1st ed. (Springer Publishing Company, In-corporated, 2017).[55] C. Cutler and E. E. Flanagan, Phys. Rev. D , 2658 (1994),arXiv:gr-qc / et al. , Phys.Rev. D , 044028 (2017), arXiv:1611.03703 [gr-qc].[57] A. Nitz, I. Harry, D. Brown, C. M. Biwer, J. Willis, et al. ,“gwastro / pycbc: Pycbc release v1.16.9,” (2020).[58] D.Shoemaker, LIGO Document (2010).[59] C. E. Shannon, Bell System Technical Journal , 379 (1948).[60] D. P. Kingma and J. Ba, (2014), arXiv:1412.6980 [cs.LG].[61] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, et al. ,(2019), arXiv:1912.01703 [cs.LG].[62] S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran,B. Catanzaro, and E. Shelhamer, CoRR abs / (2014),arXiv:1410.0759.[63] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, andR. R. Salakhutdinov, (2012), arXiv:1207.0580 [cs.NE].[64] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, andR. Salakhutdinov, Journal of Machine Learning Research ,1929 (2014).[65] S. Io ﬀ e and C. Szegedy, in Proceedings of the 32nd Inter-national Conference on International Conference on MachineLearning - Volume 37 , ICML’15 (JMLR.org, 2015) p. 448–456.[66] X. Li, S. Chen, X. Hu, and J. Yang, in / CVF Con-ference on Computer Vision and Pattern Recognition (CVPR) (IEEE) pp. 2677–2685.[67] M. Lin, Q. Chen, and S. Yan, (2013), arXiv:1312.4400[cs.NE].[68] T. Fawcett, Pattern Recognition Letters , 861.[69] D. Powers, Mach. Learn. Technol. (2008).[70] X. Liu, Z. Cao, and L. Shao, Phys. Rev. D , 044049 (2020),arXiv:1910.00784 [gr-qc].[71] S. Babak, A. Taracchini, and A. Buonanno, Phys. Rev. D ,024010 (2017), arXiv:1607.05661 [gr-qc].[72] L. Shao, N. Sennett, A. Buonanno, M. Kramer, and N. Wex,Phys. Rev. X , 041025 (2017), arXiv:1704.07561 [gr-qc].[73] K. He, X. Zhang, S. Ren, and J. Sun, in (2016)pp. 770–778, arXiv:1512.03385 [cs.CV].[74] S. Hochreiter and J. Schmidhuber, Neural Comput. ,1735–1780 (1997).[75] D. Opitz and R. Maclin, Journal of Artiﬁcial Intelligence Re-search11