Deep learning for gravitational-wave data analysis: A resampling white-box approach
Manuel D. Morales, Javier M. Antelis, Claudia Moreno, Alexander I. Nesterov
DDeep learning for gravitational-wave data analysis:A resampling white-box approach
Manuel D. Morales, ∗ Javier M. Antelis, † Claudia Moreno, ‡ and Alexander I. Nesterov § Departamento de F´ısica, Centro Universitario de Ciencias Exactas e Ingenier´ıas,Universidad de Guadalajara, Av. Revoluci´on 1500, Guadalajara, Jalisco, 44430, M´exico Tecnol´ogico de Monterrey, Escuela de Ingenier´ıa y Ciencias,Av. Gral. Ram´on Corona 2514, Zapopan, Jalisco C.P. 45138, M´exico (Dated: September 10, 2020)In this work, we apply Convolutional Neural Networks (CNNs) to detect gravitational wave (GW)signals of compact binary coalescences, using single-interferometer data from LIGO detectors. Asnovel contribution, we adopted a resampling white-box approach to advance towards a statisticalunderstanding of uncertainties intrinsic to CNNs in GW data analysis. Resampling is performedby repeated k -fold cross-validation experiments, and for a white-box approach, behavior of CNNsis mathematically described in detail. Through a Morlet wavelet transform, strain time series areconverted to time-frequency images, which in turn are reduced before generating input datasets.Moreover, to reproduce more realistic experimental conditions, we worked only with data of non-Gaussian noise and hardware injections, removing freedom to set signal-to-noise ratio (SNR) valuesin GW templates by hand. After hyperparameter adjustments, we found that resampling smoothsstochasticity of mini-batch stochastic gradient descend by reducing mean accuracy perturbationsin a factor of 3 .
6. CNNs were quite precise to detect noise but not sensitive enough to recallGW signals, meaning that CNNs are better for noise reduction than generation of GW triggers.However, applying a post-analysis, we found that for GW signals of SNR ≥ .
80 with H1 dataand SNR ≥ .
80 with L1 data, CNNs could remain as tentative alternatives for detecting GWsignals. Besides, with receiving operating characteristic curves we found that CNNs show muchbetter performances than those of Naive Bayes and Support Vector Machines models and, with asignificance level of 5%, we estimated that predictions of CNNs are significant different from thoseof a random classifier. Finally, we elucidated that performance of CNNs is highly class dependentbecause of the distribution of probabilistic scores outputted by the softmax layer.
PACS numbers: 04.30.-w, 07.05.Kf, 07.05.MhKeywords: Gravitational Waves, Deep Learning, Convolutional Neural Networks, Binary Black Holes, LIGO,Probabilistic Binary Classification, Resampling Regime, White-Box Testings, Uncertainty
I. INTRODUCTION
Beginning with the discovery of LIGO and Virgocollaborations [1], observation of gravitational wave(GW) events emitted by compact binary coalescences(CBCs) have made GW astronomy, to some extent,a routine practice. Until now, eleven GW events(
GWTC-1-confident ) has been observed during O1 andO2 scientific runs [2], within which ten correspond toGW from binary black hole (BBH) sources and, notably,one from a binary neutron star (BNS) system [3] with areported electromagnetic counterpart [4] that usher in aera of multimessenger astronomy. More recently, morethan a dozen GW events has been detected during O3run, including the first observation of a BBH with asym-metric masses [5], and a CBC of a black hole and othercompact object that could be either a low-mass blackhole or a heavy neutron star, without detected electro-magnetic counterpart [6]. O3 runs have been recorded ∗ Spokesperson: [email protected] † [email protected] ‡ [email protected] § [email protected] with the network of LIGO Livingston (L1) and Hanford(H1) twin observatories and Virgo (V1) observatory.Sensitivity of GW detectors have been remarkably in-creased these last years. Here GW data analysis hada crucial role quantifying and stirring as much as pos-sible effects of non-Gaussian noise, mainly short-timetransients called “glitches” [7] for which there is still nofully explanations about their physical causes. Monitor-ing signal-to-noise ratio (SNR) of all detections is anessential procedure here, being indeed the heart of thestandard algorithm used for detection and characteriza-tion of GWs in LIGO and Virgo, namely Matched Filter(MF) [8, 9]. This technique is not original from GWdata analysis, but rather has been broadly used in elec-trical engineering, particularly signal processing, fromdecades [10]. In general, MF is based on the asump-tion that signals to be search are embedded in additiveGaussian noise. And in particular, for GW data anal-ysis, the very starting point are raw strain data (scalartime series) outputted by each detector of a network ofdetectors, having beforehand a wide bank of theoreticalGW templates. Then, MF performs a one-by-one search-ing with all possible combinations, correlating whitenedstrain data with each template. In particular, each corre-lation actually define a SNR value and, consequently, the a r X i v : . [ a s t r o - ph . I M ] S e p goal is selecting those matchs that maximize SNR, medi-ated by several statistical and coincidente tests betweendetectors of the network [11]. A standard pipeline thatcurrently includes MF and is used in LIGO and Virgoanalyses is PyCBC [12, 13].Searchings performed by LIGO and Virgo for CBC sig-nals have generally resorted to a bank theoretical tem-plates with inspiral, merger, and ringdown stages, in or-der to cover a limited space of mass and spin values.Here MF has shown be a powerfull tool having a crucialrole in all confirmed detections of GWs emitted by CBCs.However, there are significant reasons for exploring alter-native detection strategies. To begin, noise outputted byinterferometric detectors is non-Gaussian, which conflictswith assumption of Gaussian noise in MF techniques.Generally, whitening techniques [14] are applied to rawstrain data before detection and characterization algo-rithms but glitches remain, raising the need of applyingconsistence statistical tests [15] and coincidence proce-dures as cross-correlation [16, 17] for a network of detec-tors, which rasing the problem of how to systematicallydeal with single detector data.Elsewhere, one would want to perform more generalGW searchings. Natural next steps is to include precess-ing spins [18], orbital eccentricity [19], and neutron startidal deformability [20], among others aspects. The moregeneral a GW searching the much wider the parameterspace and MF becomes computationally prohibitive.Besides, there are sources other than CBCs that couldproduce nondeterministic GWs [21] which, in principle,cannot been anticipated through theoretical waveformsand, therefore, make impossible any analysis with MF.Phenomena as core-collapse supernovae (CCSNe) [22],pulsar glitches [23], and neutron stars collapsing intoblack holes [24], among others, fall into this category.Here it should be stressed that LIGO and Virgo pipelinescurrently have a standard data analysis algorithm for all-sky searching of GW transients without prior theoreti-cal templates, namely coherent WaveBurst (cWB) [25].Under the hood, this algorithm includes standard clean-ing and whitening procedures, Wilson-Daubechies-Meyerwavelet transform, clustering of time-frequency samplesbased on power excesses, and reconstruction of wave-forms by maximum likelihood estimation. cWB can alsoanalize data from a network of detectors [26]. However,cWB rests upon the same assumption as MF, againly,that noise background is Gaussian, which is not the casegiven the outputs of interferometric detectors as it wasmentioned, even after applying whitenings.Actually, if we are not interested in replacing standardalgorithms as MF and/or cWB, still it is pertinent to havealternative algorithms just for independent verificationsand/or increasing the confidence level of GW detections.In this context, Machine Learning (ML) [27] and itssuccessor, Deep Learning (DL) [28], emerge as promis-ing alternatives or, at least, complementary tools to MFand/or cWB. These techniques assume nothing regard-ing background noise. In addition, they have shown be remarkably useful to analyze enormous mounts of datathrough sequential or online learning processes, whichcould be a significant improvement for more general GWsearchings in real time. Even more, if we do not havedeterministic templates to predict GW signals as thoseemmited by CCSNe or anticipate noise artifacts as non-Gaussian glitches, unsupervised ML and DL algorithms(which do not need prior labeled training samples) couldbe interesting for future explorations.It is well known that, in recent years, ML techniqueshave had a remarkable boom, with many successful ap-plications in both industry and academia: from recom-mender systems [29], fraud detection [30], genomics [31],and astronomy [32], among others. Overlapping betweenML and physics is particularly interesting, because worksas a cross-interaction. For instance, we have applicationsof ML to particular subfields of physics (e.g. cosmol-ogy [33] and chemical physics [34]), conceptual develop-ments in ML algorithms inspired by theoretical physicalinsights [35], and AI enhacements with quantum comput-ing [36]. Carleo et al. [37] provide an illustrative reviewabout ML and physical sciences.In GW data analysis, implementation of ML algo-rithms have no more than a few years. Biswas et al. [38]contributed with a pioneering work, where Artificial Neu-ral Networks (ANNs), Support Vector Machines, andRandom Forest algorithms were used to detect glitches indata from H1 and L1 detectors, recorded during S4 andS6 runs. Competitive performance results, back then,were obtained. Later, works focused on several problemswere published, with better results. For instance: ANNfor detection of GWs associated with short gamma-raybursts [39], Dictionary Learning for denoising [40], Dif-ference Boosting Neural Network and Hierarchical Clus-tering for detection of glitches [41], ML and citizen sci-ence for glitches classification[42], and background reduc-tion for CCSNe searching with single-detector data [43],among others.Applications of DL, in particular Convolutional Neu-ral Network (CNN) algorithms, are even more recentin GW data analysis. Gabbard et al.[44], George andHuerta [45], provided first works, in which CNNs wereimplemented to detect simulated GW signals from BBHsembedded in Gaussian noise. They claimed that per-formance of CNNs was similar and much better thanMF, respectively. Later, George and Huerta extendedtheir work by embedding simulated GWs and analyti-cal glitches in real non-Gaussian noise data of LIGO de-tectors, with similar results [46]. Thereafter, CNN al-gorithms has been applied for several instrumental andphysical problems, showing more improvements. For in-stance: detection of glitches [47], trigger generation forlocating coalescente time of GWs emitted by BBHs [48],detection of GWs from BBHs and BNS [49], detection ofGWs emitted by CCSNe and using both phenomenolog-ical [50] and numerical [51] waveforms, and detection ofcontinuous GWs from isolated neutron stars [52], amongothers.From a practical point of view, previous works of CNNalgorithms applied to GW detection and characterizationreach competitive performance results according to stan-dard metrics as accuracy, loss functions, false alarm rates,etc., showing feasibility of DL in GW data analysis in firstplace. However, from a formal statistical point of viewwe warn that, beyond just applications, we are in needof deeper explorations that seriously take into accountthe inevitable uncertainty involved in DL algorithms be-fore put them as real alternatives to standard pipelinesin LIGO and Virgo.For instance, Gebhard et al. [48] critically pointed out,a CNN algorithm does not input waveform templates asisolated samples, but also distribution of these templatesconsidering the whole training and testing datasets. In-deed, in all mentioned previous works of DL applied toGW data analysis, distributions of samples were set class-balanced totally by hand. It is a common practice in MLand DL to draw on artificial balanced datasets to madeCNN algorithms easily tractable with respect to hyperpa-rameter tuning, choice of performance metrics, and costmissclasification. Nonetheless, when uncertainty is takeninto account, real frequency of occurrence of samples can-not be ignored, because they define how reliable is ourdecision criteria for classifying, when an algorithm out-put an score for a single input sample. This kind detailsare very known in ML community [53], and need to be se-riously explored in GW data analysis beyond just hands-on approaches. Deeper multidisciplinary researches arenecessary to advance in this field.For this research, our general goal is to draw on CNNarchitectures to make a standard GW detection. In par-ticular, beginning from a training set containing single-interferometer strain data, and then transforming it intotime-frequency images with a Morlet wavelet transform,the aim is distinguish those samples which are only non-Gaussian noise from those samples that contain a GWembedded in non-Gaussian noise –in ML language, thisis just a binary classification. Moreover, as a novel con-tribution, here we will take into consideration two ingre-dients to advance towards a statistically informed under-standing of involved uncertainties, namely: resampling,and a white-box approach.In this work we are not interested in reaching higherperformance metric values than those reported in previ-ous works, neither testing new CNN architectures, nei-ther using latest real and/or simulated data; but ratherfacing the question about how to clever deal with uncer-tainties of CNNs which is an unescapable requesite if weclaim that DL techniques are real alternatives to currentpipelines. Moreover, as a secondary result that is usefulfor reproductibility purposes, we want to show that CNNalgorithms, even considering repeated experiments, canbe easily run in a single local CPU to reach good per-formance results, without resorting to expensive high-performance computing –result that we achieved mainlyby transforming raw strain data to low resolution imagesin the time-frequency representation. In particular, our resampling white-box approach hasseveral subtleties, as follows.Full problem of dealing with arbitrarily imbalanceddatasets, for now, is beyond of this research. Nonethe-less, even though working with balanced datasets, a goodstarting point is to include stochasticity by resampling,which in turn we define as repeated experiments of aglobal k -fold cross-validation (CV) routine. This stochas-ticity is different to that usually introduced in each learn-ing epoch by taking a mini-batch of the whole training setfor updating the model parameters (e.g. by a stochasticgradient descent algorithm) and, therefore, minimizingthe cross-entropy. For this research, we will consider theabove two sources of stochasticity.Stochastic resampling helps to aliviate artificiality in-troduced by a balanced dataset, because the initial split-ting into k folds is totally random and each whole k -fold CV experiment is not reproducible in a determin-istic fashion. Besides, this approach put an experimen-tal setup in which uncertainties are even more evidentand need to be seriously treated beyond of just report-ing metrics of single values. Indeed, in most commonsituations with really big datasets (i.e. millions, billions,or more samples), stastistical tools for decreasing systemresources, and data changing over time, among others;CNN algorithms are generally set such that their pre-dictions are not deterministic, leading to distributions ofperformance metrics instead of single value metrics –anddemanding formal probabilistic analyses. Given thesedistributions, with their inherent uncertainties, a statis-tical paired-sample test is necessary to formally concludehow close or far is our CNNs to a totally random classi-fier, and that we performed in this work.Our white-box approach works as a complementarytool to understand how uncertainties influence perfor-mance of CNN algorithms. First aspect here is to ex-plicitely (mathematically) describe how each layer of aCNN works and why to choose these layers. This is ac-tually the most basic explanatory procedure consideringthat, from a fundamental point of view, we still do nothave analytical theories to explain low generalization er-rors in DL and to ensure, beforehand, good performancesof a certain CNN architecture –in practice, we only havea lot of previously implemented CNN architectures forother problems, that need to be test as a just essay-error process for a new problem, actually. Then, thisexplicit description assist us to understand why, givenoutput probabilistic scores, classification is class depen-dent and threshold-dependent given the distribution ofthese scores.With regard to data, we decided to use recordings fromS6 LIGO run, separately from H1 and L1 detectors, con-sidering GW signals of CBCs only generated by hard-ware injections. This approach reproduces more realisticconditions and further strengthens performace results ofCNNs, because it removes beforehand the instrumentalfreedom of generating distribution of numerical relativ-ity templates with respect to SNR values totally by hand.Take in mind that, in real experimental conditions, SNRvalues of GW events, and even, their frequency of occur-rence, are given as (unmodifiable) facts, because these di-rectly depend on the nature of the astrophysical sourcesand the GW detectors.Hereinafter, this paper will organize as follows. InSection II we describe technical details of the method-ology and its implementation, including datasets, pre-processing, CNN architectures, model training, valida-tion, and selection of performance metrics. Later, in Sec-tion III we present the results and discussion, includinglearning monitoring, hyperparameter adjustments, per-formance metrics, and statistical tests. Finally, in Sec-tion IV, we include conclusions. In Appendices A and Bwe present brief descriptions of Naive Bayes and SupportVector Machines models, respectively, which are classicML algorithms that we compared with our CNN algo-rithms as it is shown in Section III. II. METHODS AND MATERIALSA. Problem statement
As starting point of our problem, we have a i - th sliceof raw strain data recorded at one of the LIGO detectors.In mathematical notation, this slice of data is expressedby a column vector of times series in N dimensions, s raw i ( t ) = (cid:2) s i ( t ) , s i ( t ) , ..., s i ( t N − ) (cid:3) T , (1)where the sampling time is t s = t j − t j − with j =1 , , ..., N slice −
1, the sampling frequency is f s = 1 /t s , andof course, the time length of the slice is T slice = N slice /f s (in seconds) with N slice points of data. The next step isto model theoretically the above slice of data, thereforewe introduce the following expression: s raw i ( t ) = (cid:40) n i ( t ) if there is not a GW ,n i ( t ) + h i ( t ) if there is a GW , (2)where n i ( t ) is the non-Gaussian noise from the detectorand h i ( t ) the strain from a GW whose duration, arrivaltime and waveform are unknown. Notice that h i ( t ) actu-ally is the response of the detector in its output when aGW is detected in its input, thus its amplitude not onlydepends on the intensity of the upcoming GW, but alsoon the location of the source in the sky, the polariza-tion of the GW, and the location of the detector on theEarth’s surface. If we work with data of one detector,Eq. 2 is not afected but, if we consider data from twoor more detectors together (i.e. data from a network ofdetectors), the GW strain h i ( t ) will need be multipliedby a specific scale factor depending on the detector, andnoise n i ( t ) will be different.For this research, the problem that we address is todecide if a segment of strain data contains only noise,or contains noise plus a unknown GW signal. Then, in practice, we will implement CNN algorithms that per-forms a binary classification, inputting strain samples oftime lengh T win < T slice and deciding, for each sample, ifit does not contain a GW signal (i.e. the sample ∈ class1), or contains a GW signal (i.e. the sample ∈ class 2). B. Dataset description
In order to build our strain samples, we use real dataprovided by LIGO detectors, which are freely available onthe LIGO-Virgo Gravitational Wave Open Science Cen-ter (GWOSC), . Wedecided to use data of sixth science run (S6), recordedfrom July 07, 2009 to October 20, 2010. During this run,detectors achieved a sensitivity given by a power spectraldenstity around 10 − , with uncertainties up to ±
15% inamplitude [54]. Each downloaded strain data slice has atime length of T slice = 4096s and a sampling frequency of f s = 4096Hz.S6 contains hardware injections already added to thenoise strain data. These injections are generated by phys-ically displacing the legs of the detectors simulating theeffects of GWs, and are used for experimental tests andcalibration of the detectors. For this research, we solelywork with hardware injections of GWs emitted by CBCs.For each injection, we known the coalescence (or merger)time t c in GPS, masses m and m in solar mass units M (cid:12) , the distance D to the source in Mpc, the expectedand the recovered signal-to-noise ratio, namely SN R exp , SN R rec , respectively. All this information is providedby LIGO on the aforementioned website.For a detailed exploration, consider Fig. 1. This showsdistributions of CBC injections in S6 data, from H1 andL1 detectors, giving information about masses of and dis-tances from the sources. These injections simulate stellarBBHs located at the scale of nearby galaxies, covering adistance range from 10Mpc to 100Mpc. For data fromboth detectors there is a high occurrency of sources withtotal mass M = m + m from 1 M (cid:12) to 12 M (cid:12) aprox.These GW events are included in the first three bins ofthe histograms, being mainly at distances about 10Mpcto 30Mpc according to the scatter plots –excepting someoutliers lying in the subregions m > m with distances10Mpc (cid:54) M (cid:54) M > M (cid:12) , most of BBHs are more distant yet their fre-quency of occurrence clearly decreases –excepting thoseevents appearing adjacent to the vertical axis, for m taking its smaller values independent of m , describingBBHs at closer distances with high frequency of occur-rence.For this research, we discard data blocks with missingstrain data, i.e. containing NaN (not a number) entries;and blind injections, i.e. those with no information ex-cept t c . We downloaded 501 segments of data from H1and 402 segments of data from L1, in which science datacontained in time series s raw ( t ) are totally available. CBC mass distribution for H1 data
CBC mass distribution for L1 data
FIG. 1. Distributions of CBC GW injections present in the S6 data from LIGO detectors H1 (left panel) and L1 (right panel).In particular, masses of and distances from the sources are shown. Except those GW events lying just adjacent to the verticalaxis and some outliers, we can observe that the general trend is that as total mass M = m + m increases, GW events tendto be less frequent and more distant. Data obtained from the LIGO-Virgo GWOSC, . C. Data pre-processing
With LIGO raw strain data segments at hand, thenext methodological step is data pre-processing. Thishas three stages, namely data cleaning, construction ofstrain samples, and application of wavelet transform.
1. Data cleaning
Data cleaning or data conditioning is a standard stageto reduce the noise and flattening the power spectraldensity (PSD). It consists in three steps. Firstly, foreach i - th slice of raw strain data that we introduced inEq. 1, segments around coalescence times t c are extractedvia blackman window of time length 128s. Secondly, awhitening is applied to each l - th of these segments as: s white l ( t ) = N (cid:88) k =1 ˜ s raw l ( f k ) (cid:112) S raw l ( f k ) e i πtk/N , (3)where ˜ s l raw ( f k ) is the N -point discrete Fourier Transformof the 128s segment of raw strain data s raw l ( t ), S raw l ( f k )the N -point two-sided PSD of the raw data, N the num-ber of points of data in the 128s raw strain segment,and i the imaginary unit. In theory, PSD is defined asthe Fourier transform of the raw data autocorrelation.Then, we implemented the Welch’s estimate [55], wherewe compute the PSD between 0Hz and 2048Hz at a res-olution of 1 / R ( τ ) = (cid:104) s white ( t + τ ) s white ( τ ) (cid:105) = σ δ [ τ ] , (4)with σ denoting the variance and δ [ τ ] the discrete timeunit impulse function . Finally, a Butterworth band-pass filter from 20Hz to 1000Hz is applied to the alreadywhitened segment. This filtering removes extreme fre-quencies that are out of our region of interest and dis-cards 16s on the edges of the segment to avoid roll-offeffects, resulting in a new segment s white + s bpf = s clean of time length T clean = 96s. Fig. 2 shows the effects ofapplying the whitening and the band-pass filtering to theraw strain data, both in the time domain and the ampli-tude spectral density (ASD) wich is computed as √ P SD .In particular, from the time domain plots, it can be seenthat after the cleaning, amplitude of the strain data isreduced 5 orders of magnitude, from 10 − to 10 − –effect of data corruption at the edges of data slice, af-ter whitening, is clearly shown in the middle plot. Alsonotice that before applying data cleaning, ASD showsthe known noise profile that describes the sensitivity ofLIGO detectors, which is the sum of all contributions ofnoise sources, namely seismic activity, thermal fluctua-tions, and variable laser intensity, among others.
2. Strain samples
The next stage of the pre-processing is the building ofstrain samples that will be inputted by our CNNs. This is Explicitly, the discrete time unit impulse function is defined by δ [ τ ] = 1 for τ = 0, and δ [ τ ] = 0 for τ (cid:54) = 0. Time (s) -101 S t r a i n ( un i t l e ss ) -16 s raw (t) Time (s) -101 S t r a i n ( un i t l e ss ) -20 s white (t) Time (s) -505 S t r a i n ( un i t l e ss ) -21 s white+bpf (t) -1 Frequency (Hz) -30 -25 -20 -15 AS D ( H z - / ) Amplitude Spectral Density
RawWhiteWhite+BPF
FIG. 2. Illustration of the raw strain data cleaning. The left panel shows three plots as time series segments. First, a segmentof raw strain data of 128s after we applied the blackman window. Second, the resulting strain data after the whitening, in whichit is noticeable the spurious effects at the edges. Finally, the cleaned segment of data after applied the band-pass filtering andedges removal to the previous whitened segment of data. In the right panel we plot the same data, but now in the amplitudespectral density (ASD) vs. frequency space. schematically depicted in Fig. 3. This procedure has twosteps. First, a l - th cleaned strain data segment, denotedas s clean l ( t ), is splited in overlapped windows of duration T win , identifying if it has or does not have an injection–time length T win is of the order of the injected wave-form duration, leastwise. Here classes are assigned: if awindow of data contains an injection, then class 2 (C2)is assigned; on the other hand, if that window does notcontain an injection, class 1 (C1) is assigned. This classassignment is applied to all windows of data. Next, fromthe set of all tagged windows we select four consecutiveones of C1, and four consecutive ones of C2. This pro-cedure is applied to each segment of clean data s clean l ( t )and, for avoid confusion with notation, we depict a k - th windowed strain sample as: s win k ( t ) = (cid:2) s k ( t ) , s k ( t ) , ..., s k ( t N win − ) (cid:3) T , (5)where N win = f s T win according to the time length of thesamples, with f s the sampling frequency. In practice, aswe initially have 501 segments from H1 and 402 segmentsfrom L1, then 501 × ,
008 and 402 × ,
360 strainsamples, respectively, are generated. Consequently, in-dex k appearing in Eq. 5 take values from 0 to 4007 forH1 detector data, and values from 0 to 3359 for L1 de-tector data. Besides, T win is a resolution measure, and asit will seen later, we run our code with several values of T win in order to identify which of them are optimal withrespect to performances of CNNs.
3. Wavelet transform
Some works in GW data analysis have used raw straintime series directly as input to deep convolutional neural l-th raw data segment, length 128sS raw (t)(t) S cleanl l
16s T win . . . . . . t c l-th cleaned data segment, length 96sC1C2 . . . . . . win /4 C1 C1 C2
FIG. 3. Strain samples generation by slidding windows. ACBC GW injection is located at coalescence time t c in l - th raw strain data segment of 128s. After a cleaning, the segmentis splitted in overlapped windows, individually identifying ifa window has or does not have the GW injection. Finally,among all these windows, a set of eight strain samples is se-lected, four with noise alone (C1) and four with noise plus thementioned CBC GW injection (C2). This set of eight sampleswill be part of the input dataset of our CNNs. This procedureis applied to all raw strain data segments. networks (e.g. [45] and [48]), however we will not followthis approach. We rather decided to apply CNNs forwhat was designed in its origins [56], and for what havedramatically improved last decades [57], namely imagerecognition. Then, we need a method to transform ourstrain vectors to image matrices, i.e. grid of pixels. Forthis research, we decided to use the wavelet transform(WT), which in signal processing is a known approach forworking in the time-frequency representation [58]. Oneof the great advantages of the WT is that, by using a lo-calized wavelet as kernel (also called “mother wavelet”),it allows to visualize tiny changes of the frequency struc-ture in time, and therefore, improve the search of GWcandidates which arise as non-stationary short-durationtransients in addition to the detector noise .In general, there are several wavelets that can be usedas kernel. Here we decide to work with the complex Mor-let wavelet [59], that in its discrete version has the fol-lowing form: ψ ( t n , f j ) = 1 (cid:112) σ tj √ π exp (cid:20) − t n σ tj ) (cid:21) exp (2 iπf j t n ) , (6)having a Gaussian form along time and along frequency,with standard deviations σ t and σ f , respectively. More-over, these standard deviations are not independent ofeach other, because they are related by σ tj = 1 / (cid:0) πσ f j (cid:1) and σ f j = f j /δw , where δw is the width of the waveletand f j its center in the frequency domain.It is important to clarify that Morlet wavelet has aGaussian shape, then it does not have a compact support.For this reason, the mesh in which we defined Eq. 6 (i.e.the set of discrete values for time t n and frequency f j ) isinfinite by definition. Further, resolutions ∆ t = t n − t n − and ∆ f = f j − f j − are solely constrained by our systemresources and/or to what extent we want to economizethese resources.Then, to perform the WT of the strain sample s win k ( t )(defined by Eq. 5) with respect the kernel wavelet ψ ( t n , f j ) (defined by Eq. 6), we just need to computethe following convolution operation: W s k [ t n , f j ] = N win − (cid:88) m =0 s k ( t m ) ψ ∗ ( t m − n , f j ) , (7)where s k ( t m ) is the m - th element of the column vec-tor s win k ( t ). Besides, n = 0 , , ..., N time and j =0 , , ..., N freq , where N time and N freq define the size ofeach k - th image generated by the WT transform, be-ing W s k [ t n , f j ] just the ( n, j ) element or pixel of theeach generated image. In practice, we set our WT suchthat it outputs images with dimensions N time = 4096 and N freq = 47 pixels.In general, the grid of pixels defined by all values t n and f j depend on the formulation of the problem. Herewe chose frequencies varying from f = 40Hz to f N freq =500Hz, with a resolution of ∆ f = 10Hz, given that isconsistent with the GW signals that we want to detect. WT suits better to our purposes than standard procedures as, forinstance, Fourier transform. The latter draw on kernels describ-ing simple harmonic oscillations, i.e. without temporal localiza-tion; making it the default option for searches of steady-statesignals instead of transient signals.
In addition, as we apply the WT to each cleaned straindata sample, we have discrete time values varying from t = 0s to t N time = T win − /f s , with a resolution of∆ t = t s = 1 /f s , where f s = 4096Hz is the samplingfrequency of the initial segments of strain data.Although the size of output images is not too large, wedecided to apply a resizing to reduce them from 4096 × ×
16 pixels. Keep in mind that as we need toanalyse several thousand images, using their original sizewould be unnecessarily expensive for system resources.The resizing was performed by a bicubic interpolation,in which each pixel of a reduced image is a weightedaverage of pixels in the nearest 4-by-4 neighborhood.In summary, after applying the WT and the resizing toeach strain data sample, we generated an image dataset { X i , y i } Ni =1 , where X i ∈ R N time =32 × N freq =16 depicts eachimage as a matrix of pixels , y i ∈ { , } the classes thatwe are working with, and N set ∈ { , } dependingif we are using data from H1 or L1, respectively.Fig. 4 show two representative strain date samples(one belonging to C1 and other to C2) as a time do-main signal, their time-frequency representation accord-ing to the WT with a Morlet wavelet as kernel, and itsresized form. Both samples were generated from a straindata segment recorded by the L1 detector of 4094s atGPS time 932335616. Image sample at the right showsa GW transcient that has a variable frequency approx-imately between 100Hz and 400Hz. It is important toclarify that, before entering to CNNs, image samples arerearranged such that they the whole dataset has a size N set × × ×
1, denoting N set images of size 32 × D. CNN architectures
Historically and before CNNs were developed, tradi-tional feedforward Deep Neural Networks (DNNs, thatare only ANNs with more than one hidden fully con-nected layer) were not able to produce good results asmore data and/or layers were included [28]. For imagerecognition this was a big issue, and it was not over-comed until seminal works of Fukushima [60] and LeCunet al. [61]. Both works were bioinspired and developednew algorithms taking as reference experiments of Hubeland Wiesel [62–64] about feature learning in visual cortexon cats and monkeys. Already mentioned work of LeCunet al. developed the first CNN architecture, namely the
LeNet-5 architecture.What distingish CNNs from DNNs is that first onesuse convolution operation in one or more of their layers, In image processing, it is usual to specify the size of a imageas height × width . However, as we are defining our images bya standard 2 D discrete plot ( x axis vs. y axis), we prefered toinvert the convention by putting sizes as width × height . -21 Noise Only
Time (s) F r equen cy ( H z ) Time (s) F r equen cy ( H z ) -21 M1=8.17 | M2=6.33 | D=6.44 | SNR=21.28
Time (s) F r equen cy ( H z ) Time (s) F r equen cy ( H z ) FIG. 4. Visualization of two image samples of T win = 1 . M M D in Mpc, and SNR is the expected signal-to-noise ratio. Upper panel show strain data in the time domain, middle panels showtime-frequency representations after apply a WT, and bottom panels show resized images by a bicubic interpolation. rather (or in adition to) fully connected layers [28], lead-ing to three advantages that cannot be found on secondones. Firstly, weight sharing that, unlike general ma-trix multiplication performed by fully connected layers,is a tremendous help for reducing system requeriments.Secondly, hierarchical learning that, by applying severalconvolutional kernels or filters, a CNN is able to extractfeatures from the input image from low levels to higherlevels. Finally, their significant improvements in perfor-mance with respect to classic ML algorithms providedthat there are enough data. These advantages are justpractical because, from a fundamental point of view, stillthere are not theories able to explain why CNNs have asmall generalization error, or even, to establish analyti-cal criteria to unequivocally choose a specific architecturefor performing a particular task [65]. Even so, there areseveral standards for CNNs design and certain mathe-matical guidelines that we can adopt to understand howa CNN learns from data inside the black box.Our CNN algorithms consists in two main stages interms of functionality, namely feature extraction andclassification. First stage begins inputting images froma training dataset, and second stage ends ouputing pre-dicted classes for each image sample –to be compare withactual classes. Obviously, classification is our ultimategoal, consisting in a perceptron stage plus an activationfunction as usual in ANNs. However, feature extrac-tion is actually the core of CNNs, providing the abilityof image recognition itself. Here there are three sub-stages that are very common in many CNNs, includingours: convolution, which applies an affine transform; de-tection, for recognizing nonlinearities; and pooling, that performs a downsampling. These substages are imple-mented through layers that, as a whole, define a stack.Then, this stack can be connected to a second stack, andso on, until last stack is connected to the classifier. As itwill be seen later, for our hyperparameter adjustments,we tested CNN architectures with 1, 2, and 3 stacks.Knowing general functionality of the CNNs is a mustbut, for contributing with a white-box approach, we needto understand what each layer does. For this reason, weproceed to mathematically describe each kind of layerthat were used –not only those involved in the alreadymentioned (sub)stages of the CNN, but also those thatwere required in our hands-on implementation with the MATLAB Deep Learning Toolbox [66]. Take in mindthat output of a layer is input of the next layer, asdetailed in the Fig. 5 for a single-stack CNN.
Image Input Layer . Inputs images and applies azero-center normalization. Denoting an i - th input sampleas the matrix of pixels X i ∈ R N time × N freq belonging to adataset of N train same size training images, this layeroutputs the normalized image: X norm i ( j, k ) = X i ( j, k ) − N train N train (cid:88) i =1 X i ( j, k ) , (8)where the second term is the average image of thewhole dataset. Normalization, also known as histogramstretching, is useful for dimension scaling, makingchanges in each attribute, i.e. each pixel ( j, k ) alongall images, of a common scale. As normalization doesnot distort relative intensities too seriously and helps toenhance contrast of images, we can apply it to the entiretraining dataset, independently what class each imagebelong for. Convolution Layer . Convolves each image X norm i with C K sliding kernels of dimension K time × K freq . Thislayer just apply an affine transform to input images. De-noting each l - th kernel by K l with l ∈ { , , ..., C K } , thislayer outputs C K feature maps, and each of them is animage composed by the elements or pixels: M li ( p, q ) = (cid:88) m (cid:88) n X norm ( m, n ) K l ( p − m + 1 , q − n + 1) + b , (9)where b is a bias term, and indices p and q run over allvalues that lead to legal subscripts of X norm ( m, n ) and K l ( p − m + 1 , q − n + 1). Depending on parametriza-tion of subscripts m and n , dimension of images M li canvary. If we include width M time and height M freq of out-put maps (in pixels) in of a two-dimensional vector justfor notation, these spatial sizes are computed by:( M time , M freq )= 1(str time , str freq ) [( N time , N freq ) − ( K time , K freq )+ 2(pad time , pad freq ) (cid:3) (1 , , (10)where str (i.e. stride) is the step size in pixels with whicha kernel move above X norm i , and padd (i.e. padding)denotes time rows and/or frequency columns of pixelsadded to X norm i for moving the kernel beyond the bor-ders of the former. During the training, components ofkernel and bias terms are iteratively learned from certaininitial values; then, once the CNN has captured andfixed optimal values for these parameters, convolution isapplied to all testing images. ReLU Layer . Applies the Rectified Linear Unit(ReLU) activation function to each neuron (pixel) of eachfeature map M li obtained from the previous convolu-tional layer, outputting the following: R li ( p, q ) = max (cid:8) , M li ( p, q ) (cid:9) . (11)In practice, this layer works as a detector of nonlin-earities in input sample images; and its neurons canoutput true zero values, generating sparse interactionsand reducing system requeriments. Besides, this layerdoes not lead to saturation in hidden units during thelearning, because its form, given by Eq. 11, does notconverge to finite asymptotic values . Saturation is an effect when an activation function locate in ahidden layer of a CNN converge rapidly to its finite extremevalues, becoming the CNN insensitive to small variations of inputdata in most of its domain. In feedforward networks, activationfunctions as sigmoid or tanh are prone to saturation, hence theyuse are discouraged except when output layer of the CNN hasa cost function able to compensate their saturation [28], as forexample the cross-entropy function.
Max Pooling Layer . Downsamples each feature map R li with the maximum on local sliding regions L R ofdimension P time ×P freq . Each pixel of a resulting reducedfeatured map m (cid:96)i is given by the following: m li ( r, s ) = max r,s (cid:8) R li ( p, q ) (cid:9) , ∀ ( p, q ) ∈ L R , (12)where ranges for indices r and s depend on spatial sizesof outputs maps; and these sizes, i.e. width m time andheight m freq , being included in a two-dimensional vectorjust for notation, are computed by:( m time , m freq )= 1(str time , str freq ) [( M time , M freq ) − ( P time , P freq )+2(pad time , pad freq ) (cid:3) + (1 , , (13)where padding and stride values have same meanings asin the convolutional layer. Interestly, max pooling layerleaves invariant output values under small translationsin the input images, which could be useful for workingwith a network of detectors –case in which a GW signalfrom the same source has several versions depending onthe number of detectors that are being considered, withits respective time delays. And more generally, this layercontribute to reduce overfiting, to reduce the numberof neurons for each map, and hence, to reduce systemrequeriments. Fully Connected Layer . This is the classic percep-tron layer used in classical ANNs and performs the bi-nary classification itself. It maps all images m (cid:96)i to thetwo-dimensional vector h i by the affine transformation: h i = W T m flat i + b , (14)where m flat i is a vector of N fc dimensions, with N fc thetotal number of neurons considering all input featuremaps, b a two-dimensional bias vector, and W a weightmatrix of dimension 2 × N fc . Similarly to convolutionallayer, elements of W and b are the model parameters tobe learn in the training, beginning from certain initialvalues appropriately chosen. Notice that matrix m flat i become flatter, in a single column vector, all featuremaps m li (with l = 1 , , ..., C K ); therefore information0about topology of sample images (i.e. their edges) is lost. Softmax Layer . Applies the softmax activation func-tion to each component j of vector h i y ji = softmax( h ji ) = e h ji / (cid:88) j e h ji , (15)where j = 1 , y ji can be interpreted as posterior distributions of class c j conditioned by model parameters. That is to say y ji ( θ ) = P i ( c j | θ ), where θ is a multidimensional vectorcontaining all model parameters. This is the reason whyit is common to refer to the quantity y ji ( θ ) ∈ [0 ,
1] asthe output score of the CNN.
Classification Layer . Stochastically takes ˜
N As detailed, model parameters are included in vector θ , that in turn appears in the cross-entropy function (16).Starting from given initial values, these parameters haveto be learned by an minimization of the cross-entropy,taking ˜ N < N train random image samples, i.e. a mini-batch. For model parameters updating, we drawn on theknown gradient descent algorithm, including a momen-tum term to boost iterations. Denoting model parame-ters at the r - th iteration or epoch as θ ( r ) , then its up-dating at the ( r + 1)- th iteration is given by the followingoptimization rule: θ ( r +1) = θ ( r ) − α ∇ θ ( r ) E (cid:16) θ ( r ) (cid:17) + γ (cid:16) θ ( r ) − θ ( r − (cid:17) , (17)where we have that ∇ θ ( r ) is the gradient with respect tomodel parameters and E the cross-entropy, in additionto two empirical quantities to be set by hand, namely thelearning rate α and the momentum γ . Given that we arecomputing the cross-entropy with ˜ N < N train randomsamples, the above rule is called mini-batch stochasticgradient descent (SGD) algorithm. Besides, for all ourexperiments we set a learning rate α = 1 . 00, a momen-tum γ = 0 . 9, and a mini-batch size of ˜ N = 128 imagesamples.With regard to initialization of weights, we draw onGlorot initializer [68]. This scheme independently sam-ples values from a uniform distribution with a mean equalto zero and a variance given by 2 / ( n in + n out ), where n in = K time K freq and n out = K time K freq C K for convolu-tional layers, and n in = C K size( m flat i ) and n out = 2 C K for the fully connected layer –remember that C K is thenumber of kernels of dimension K time × K freq and m flat i the vector inputted by the fully connected layer. All bi-ases, on the other hand, are initialized with zeros.Finally, whether we work with data from H1 or L1 de-tector, N train will be quite greater than ˜ N = 128; never-theless, its specific value depends on our global validationtechnique, to be explained in the subsection II F. F. Global and local validation For this research, we used only real LIGO strain datawith a given and limited number of CBC hardware in-jections. In practice, this approach reduces the instru-mental ability to arbitrarily generate big datasets, un-1 y i = P i ( C | θ ) y i = P i ( C | θ ) ... C o n v o l u t i o n M a x P oo li n g F u ll y C o nn e c t e d .......... R e L U S o f t m a x ... N o r m a li z a t i o n ( i - t h i m a g e s a m p l e ) N o i s e N o i s e + G W C r o ss - E n t r o p y C K f e a t u r e m a p s F e a t u r e e x t r a c t i o n C l a ss i f i c a t i o n s t a c k N t r a i n i n p u t i m a g e s FIG. 5. Single stack CNN architecture used as basis for this research. N train same size image samples are simultaneouslyinputted but, for simplicity, we detail the procedure for a single image –besides, although this image is shown colorized, trainingimages dataset occupies just one channel. A X i image feed the CNN, and a two-dimensional vector is outputted, giving usposterior probabilities of class 1 (noise alone) and class 2 (noise + GW), both conditioned by model parameters included invector θ . After that, cross-entropy E ( θ ) is computed. Notation for input(output) matrices and vectors is the same introducedin section II D to mathematically describe the several kind of layers used in the CNNs. like the classical approach in which, one can perform anunlimited number of software injections with numericaland/or analytical templates. Even, when using synteticdata and because of system resources limitations, the or-der of magnitude of generated templates is quite smallcompared to the big data regime of petabytes and be-yond. Therefore, a global validation technique to reachgood statistical confidence and to perform a fair modelevaluation, actually is required. For these reasons, weimplemented the k -fold CV technique [69], that consistsin the following recipe: i) split the original dataset into k nonoverlapping subsets to perfom k trials, then ii) foreach i - th trial use the i - th subset for testing leaving restof data for training, and finally iii) compute the averageof each performance metric across all trials.It is known that the value of k in k -fold CV definesa trade-off between bias and variance. When k = N (i.e. leave-one-out cross-validation), the estimation is un-biased but variance can be high and, when k is small, theestimation will have lower variance but bias could be aproblem [70]. For several classic ML algorithms, pre-vious works have suggested that k = 10 represent thebest trade-off option ([71–73]), then we decided to takethis value as a first approach. But even more, followingworks [74] and [75], we decided to perform 10 repetitions of the 10-fold cross-validation process [53], in order toreach a good stability of our estimates, to present fairvalues of the cross-entropy function given the stochasticnature of our resampling approach, and more important,obtain information about the distribution of accuracy(and other standard metrics) in which there is involveduncertainty.Moreover, k -fold CV helps to aliviate the artificial-ity introduced by balanced dataset, because the initialsplitting into k folds is totally random. In research [48]authors warns about the fact that CNNs not only cap-ture GW templates alone, but also transfers to the teststage the exact same probability distribution given inthe training set. This claim is true but, it is importantto take in mind that working with balanced datasets,as a first approach, is simply motivated by the fact thatmany of standard performance metrics gives excessive op-timistic results on classes of higher frequency in imbal-anced dataset, and dealing with arbitrarily unbalanceddatasets is not a trivial task. Anyway, including k -foldCV as a random resampling, starting from a balanceddataset, is desirable for statistical purposes.With sizes of our datasets detailed at the end of sub-section II C 3 and the 10-fold CV, we have a training setof N test = floor[4008 / 10] = 400 samples obtained from2 Layer Activations Learnablesper image sample per image sample Image Input × × Convolution of size 5 × C K kernelsStrides: 1, Paddings: 0 28 × × C K Weights: 5 × × × C K Biases: 1 × × C K ReLU × × C K – Max Pooling of size 2 × × × C K – Convolution of size 5 × C K kernelsStrides: 1, Paddings: 0 10 × × C K Weights: 5 × × C K × C K Biases: 1 × × C K ReLU × × C K – Max Pooling of size 2 × × × C K – Convolution of size 4 × × , C K kernelsStrides: 1, Paddings: 0 2 × × C K Weights: 4 × × C K × C K Biases: 1 × × C K ReLU × × C K – Max Pooling of size 2 × × × C K – Fully Connected × × × C K Biases: 2 × Softmax × × Ouput Cross-Entropy – –TABLE I. CNN architecture of 3 stacks used for this research. The number of kernels in convolutional layers is variable,and it took values C K ∈ { , , , , , , } . Part of this illustration is also valid for CNN architectures of 1 or 2 stacks(also implemented in this research), in which image input layer is followed by the next three or six layers as feature extractor,respectively, and then by the fully connected layer until the output cros-entropy layer in the classification stage. All theseCNNs were implemented with the MATLAB Deep Learning Toolbox . H1 detector, and of N test = floor[3360 / 10] = 336 sam-ples from L1 detector. Consequently, N train = 3607 forH1 data, and N train = 3607 for L1 data.Local validation, that is to say validation performedwithin a learning epoch, is also an crucial ingredientof our methodology. In particular, our algorithm splitsthe training set of N train data into two subsets, one forthe training itself (0 . N train ) and other for validation(0 . N valid ). Validation works as a preparatory mini testwhich is useful for monitoring learning and generalizationduring the training process.With regard to regularization techniques, local valida-tion was performed once per floor[ N train / ˜ N ] epochs andcross-entropy was monitored with a validation patience p = 5 (value given by hand) which simply means thatif E ( r +1) ≥ E ( r ) occurs p times during the validation,then training process is automatically stopped. Besides,to avoid overfitting, training samples were randomizedbefore training and before validation, solely in the firstlearning epoch. This randomization is performed suchthat link between each training image X i and its respec-tive class y i is left intact, because we do not want to“mislead” our CNN when it learns from known data. G. Performance metrics Once our CNN is trained, the goal is predicting classesof unseen data, i.e. data on which the model was nottrained, and achieves a good performance. Hence per-formance metrics in the test are really crucial. Consid-ering last layer of our CNNs, a metric that is naturalto monitor during the training and validation process isthe cross-entropy. Other metrics that we use come fromcounting predictions. As our task has to do with a binaryclassification, these metrics can be computed from the el-ements of a 2 × P r e d i c t e d v a l u e s Actual values total TruePositives(TP) FalsePositives(FP) P (cid:48) FalseNegative(FN) TrueNegatives(TN) N (cid:48) total P N Metric Definition What does it measure?Accuracy TP + TNP + N How often a correctclassification is madePrecision T P/P (cid:48) How many examples retrievedas relevant are truly relevantRecall T P/P How many truly relevantexamples are retrievedFall-out F P/N How many examples retrievedas relevant are non-relevantF1 score 2 Precision × RecallPrecision+Recall Harmonic mean of precisionand recall.G mean1 √ Recall × Fall-out Geometric mean of recalland fall-out.TABLE II. Confusion matrix for a binary classifier and its consequent standard performance metrics. In general, to have acomplete view of the classifier, it is suitable to draw on, at least, accuracy, precision, recall, and fall-out. F1 score and G mean1are useful metrics for imbalanced multilabel classifications, and they also give important moderation features which help inmodel evaluation. Each metric has its probabilistic interpretation. others. The first curve describes the performance of theCNN in the fall-out (or false positive rate) vs. recall (ortrue positive rate) space, and the second one in the re-call vs. precision space. Given that the whole curves aregenerated by varying the threshold, each point of thosecurves represents the performance of the CNN given aspecific threshold. As we worked with balanced datasets,plotting ROC curves will be enough. Finally, F1 scoreand G mean1 are metrics that summarize in a singlemetric pairs of other metrics and being, in general, use-ful for imbalanced multiclass classifications. Anyway, wedecided to compute these two last metrics, because theygive useful moderation features for performance evalua-tion –more details in subsection III C.In order to ensure that our results are statistically sig-nificant, we also want to perform a shuffling. As it wasmentioned in subsection II F, our algorithm already per-forms a randomization over the training set, before thetraining and before the validation in order to preventoverfitting. However, shuffling applied here is more rad-ical because it broke the link between training imagesand their respective classes, and it is made by randompermutation over indices i solely for the matrices X i ,belonging to training set { X i , y i } N train i =1 , before each thetraining. Then, if this shuffling is present and our resultsare truly significant, we expect that accuracy in testingbe lower than that computed when shuffling no is present,reaching values around 0 . D and D shuff , respectively. Then, with means of each of these sets, namely µ and µ shuff , the goal is testing the null hy-pothesis H : µ − µ shuff = 0, and then, we just would needto compute the p-value, i.e. the probability of resultingaccuracies be possible assuming that the mentioned nullhypothesis is correct, given a level of significance –moredetails about shuffling and consequent statistical tests arepresented in subsection III E. III. RESULTS AND DISCUSSIONA. Learning monitoring per fold While the mini-batch SGD was running, we monitoredthe cross-entropy and accuracy evolution along epochs.This is the very first check to ensure our CNN were prop-erly learning from data and our local validation criteriastopped the learning algorithm in the right moment. Ifour CNNs were correctly implemented, we expected thatcross-entropy be minimized to reach values as close aspossible to 0, and accuracy reach values as close as pos-sible to 1. If this check gave wrong results, then therewould be no point in computing subsequent metrics.Two representative examples of this check, using datafrom H1 and L1 detectors, are shown in Fig. 6. Both wereperformed during a single fold of a 10-fold CV experi-ment, from the first to the last mini-batch SGD epoch.Besides, here we used a time resolution of T win = 0 . Epoch C r o ss - en t r op y Cross-entropy single fold monitoring, H1 data 330 340 350 360 3700.20.250.3 Epoch A cc u r a cy Accuracy single fold monitoring, H1 data 330 340 350 360 3700.860.880.90.920.94 Epoch C r o ss - en t r op y Cross-entropy single fold monitoring, L1 data 470 480 490 500 5100.320.340.360.380.4 Epoch A cc u r a cy Accuracy single fold monitoring, L1 data 470 480 490 500 5100.80.820.840.86 FIG. 6. Evolution of cross-entropy (left panels) and accuracy (right panels), in function of epochs of learning process, usingdata from H1 detector (upper panels) and L1 detector (lower panels). On trending, cross-entropy decreases and accuracyincreases, even if there is a clear stochastic component due to the mini-batch SGD learning algorithm. Here we set length ofsliding windows in T win = 0 . CNN finish its learning process, cross-entropy and accu-racy reach values 0 . 184 and 0 . . 357 and 0 . N < N train are taken, then stochastic noise is introduced.Besides, when using data from L1 detector, some anoma-lous peaks appear between epoch 350 and 400 but thisis not a problem because CNN normally continues itslearning process and trendings in both metrics are notaffected. At the end, we can observe this resilience ef-fect because of our validation patience criterion, that isimplemented to prevent our CNN algorithm prematurelystops and/or to dispense with manually adjust the total number of epochs for each learning fold.Still focusing on the SGD fluctuations, zoomed plots inFig. 6 show their orden of magnitude –the highest peakminus the lowest peak. When we work with H1 data,cross-entropy fluctuations are about 0 . 130 and accuracyfluctuacions are about 0 . . B. Hyperparameter adjustments Our CNN models introduce several hyperparameters,namely number of stacks, size and mount of kernels, sizeof pooling regions, number of layers for each stack, stride,and padding, among many others. To present a system-atic study for all hyperparameters is beyond the scope ofour research. However, given CNN architectures shownin Table I, we decided to study and adjust two of them,namely the number of stacks and the number of kernelsin convolutional layers. In addition, although it is notan hyperparameter of the CNN, the time length T win ofthe samples is a resolution that also required to be setto reach an optimal performance, then we included it inthe following analyses.First hiperparameter adjustment is shown in Fig. 7and it was implemented to find optimal number of stacksand time length resolution, according to resulting meanaccuracies. Left panels show distribution of mean accu-racy for all 10 repetitions or runs of the entire 10-foldCV experiment, in function of T win . Each of these meanaccuracies, that we can denote as Acc iCV , is the aver-age among all fold-accuracies of a i - th run of the 10-foldCV. In addition, right panels show mean of mean accura-cies among all 10 runs, i.e. Acc = (1 / (cid:80) i =1 Acc iCV ,in function of T win . Inside right plots we have includedsmall boxplots that, as it will be seen next, are usefulto study dispersion and skewness of mean accuracy dis-tributions –circles inside boxplots are distribution meanvalues. Line plots show contributions of our three CNNarchitectures, i.e. with 1, 2, and 3 stacks.Consider top panels of Fig. 7 for H1 data. Noticefrom the right plot that, for all CNN architectures,mean of mean accuracies shows a trend to decrease when0 . ≤ T win ≤ . 00s and this decrease occurs more pro-nouncedly when we work with less stacks. Besides, when0 . ≤ T win ≤ . . T win = 0 . 75 with a CNN of 3 stacks. Then, todecide if this setting is optimal, we need to explore inFig. 7 left upper plot together with boxplots inside rightupper plot. From the left plot, we have that not only thementioned setting give high mean accuracy values, butalso T win = 0 . 25s and T win = 1 . T win = 0 . 75s with 2 and 3 stacks, and even T win = 1 . . 9. Setting of T win = 1 . 25s with 3 stackscan be discarded because its maximum mean accuracyclearly is an outlier and, to elucidate what of remainingsettings is optimal, we need to explore boxplots.Here it is crucial to assimilate that the optimal set-ting to choose actually depends on what specifically wehave. If the dispersion does not worry us too much andwe want to have a high probability of occurence for manyhigh values of mean accuracy, setting of T win = 0 . 25s with2 stacks is the best, because its distribution has a slighly negative skewness concentrating most of mass probabil-ity to upper mean accuracy values. On the other hand,if we prefer to have more stable estimates working withless dispersion at the expense of having a clear positiveskewness (in fact, having a high mass concentration in aregion that does not reach as high mean accuracy valuesas the range from median to third quartile in the previ-ous setting), setting of T win = 0 . 75s with 3 stacks is thenatural choice. In practice, we would like to work withgreater dispersions if they help to reach highest mean ac-curacy values but, as all our boxplots have similar max-imum values, we decide to maintain our initial choice of T win = 0 . 75s with 3 stacks for H1 data.From left upper scatter plot of Fig. 7 can be seen that,regardless number of stacks, data dispersion in 1 . ≤ T win ≤ . 00s is greater than in 0 . ≤ T win ≤ . T win increase, mo-tivate to discard all settings for T win ≥ . T win = 0 . 5s with 2 stacks, T win = 0 . 75s with2 and 3 stacks, and T win = 1 . 00s with 3 stacks. Now, ex-ploring boxplots inside right panel we notice that, evenif settings of T win = 0 . , . 00s with 3 stacks reach thehighest mean accuracy values, their positive skewness to-ward lower values of mean accuracies is not great. Then,the two remaining settings, which in fact have negativeskewness toward higher mean accuracy values, are theoptimal options, and again choosing one or other will de-pend of what extent we tolerate data dispersion. Unlikeupper panels, here a larger dispersion increase probalityto reach higher mean accuracy values, therefore we finallydecide to work with the setting of T win = 0 . 75s with 2stacks for L1 data.To find the optimal mount of kernels in convolutionlayers, we perfomed the adjustment shown in Fig. 8,again separately for data from each LIGO detector. Con-sidering the information provided by previous adjust-ment, we set T win = 0 . Twin/s M ean a cc u r a cy pe r ea c h r un Stack setting using data from H1 detector Twin/s M ean o f m ean a cc u r a c i e s a m ong a ll r un s Stack setting using data from H1 detector No. stacks , Twin M ean a cc u r a cy pe r ea c h r un Twin/s M ean a cc u r a cy pe r ea c h r un Stack setting using data from L1 detector Twin/s M ean o f m ean a cc u r a c i e s a m ong a ll r un s Stack setting using data from L1 detector Twin/s, No. stacks M ean a cc u r a cy pe r ea c h r un FIG. 7. Hyperparameter adjustment to find best CNN architectures, i.e. number of stacks, and time resolution T win . Leftpanels show all samples of mean accuracies, and right panels the mean of mean accuracies among all runs; all in function of T win for 1, 2, and 3 stacks in the CNN. Besides, some small boxplots are included inside right plots to have more clear informationabout dispersion and skewness of distribution of mean accuracies. Based on this adjustment, we conclude that T win = 0 . zontal spreading inside each boxplot was made to avoidvisual overlap of markers, and it does not mean that sam-ples was obtained with number of kernels different fromalready speficied in horizontal axis.Let us concentrate on kernels adjustment for H1 datain the left panel of Fig. 8. From these results we havethat a CNN with 12 kernels give us more stable results byfar, because most of its mean accuracies lie in the small-est dispersion region –discarding outliers, half of meanaccuracies are concentrated in a tiny interquartile regionlocated near to 0 . per se (values from 0 . 800 to 0 . 898 are actually good), but ratherbecause, unlike other cases, the nearly zero skewness of its distribution is not prone to boost sample values be-yond third quartile as it is appeared. Configuration with8 kernels has a distribution mean very close to the set-ting with 24 kernels, and even, reaches two mean accu-racy values about 0 . . 895 (and hence, boxplots locatedtowards relative higher mean accuracy values), these lastfour configurations offer the best options. At the end,we decided to work with 32 kernels, because this settinggroup a whole set of desirable features: the highest meanof mean accuracies, namely 0 . Number of kernels M ean a cc u r a cy pe r ea c h r un Kernels setting, H1 data (Twin=0.75s, 3 stacks) Number of kernels M ean a cc u r a cy pe r ea c h r un Kernels setting, L1 data (Twin=0.75s, 2 stacks) FIG. 8. Adjustment to find the optimal number of kernels in the CNN architecture. Both panels show boxplots for thedistributions of mean accuracies in function of the number of kernels, with values of T win and mount of stacks found in previoushyperparameter adjustments. Based on location of mean accuracy samples, dispersion, presence of outliers, and skewness ofdistributions, CNN architectures with 32 kernels and 16 kernels are the optimal choices to reach the highest mean accuracyvalues, when working with data from H1 and L1 detectors, respectively. of Fig. 8. Here the situation is easier to analize, be-cause performance differences appears visually clearerthan those for data from H1. Settings with 8, 20, and28 kernels lead to mediocre performances, specially thefirst one which has a high dispersion and 70% of its sam-ples are below 0 . . . 805 to 0 . . T win = 0 . T win = 0 . 50s and we workwith a CNN architecture of 2 stacks and 20 kernels inconvolution layers, the order of magnitude of SGD fluctu-ations in accuracy is about 0 . . . C. Confusion matrices and standard metrics In general, accuracy gives information about the prob-ability of a successful classification, either if we are clas-sifying a noise alone sample (C1) or noise plus GW sam-ple (C2); that is to say, it is a performance metric withmulti-label focus. However, we would like to elucidateto what extent our CNNs are profient to separately de-tect samples of each class, then it is useful to introducepeformance metrics with single-label focus. A standardtool is the confusion matrix, that is shown in Fig. 9 de-pending on data from each detector. As we are under aresampling regime, each element of confusion matrices iscomputed considering the entire mount of 100 N test de-tections, which in turn are resulting from concatenatingall prediction vectors of dimension N test outputted by the10 runs of the 10-k fold CV.A first glance to confusion matrices shown in Fig. 9 re-veals that our CNNs have a better performance detecting8 Confusion Matrix using data from H1 detector P r ed i c t ed V a l ue s Actual Values C1 C2C1C2 Confusion Matrix using data from L1 detector P r ed i c t ed V a l ue s Actual Values C1 C2C1C2 FIG. 9. Confusion matrices computed from the testing. C1 is the label for noise only and C2 the label for noise plus GWinjection. Our CNN has 32 kernels and 3 stacks when is run with H1 data, and 16 kernels and 2 stacks when is run with L1data. Time resolution is T win = 0 . / (8533 + 427) ≈ . 952 predicitions for C1 are correct and1413 / (8533 + 1413) ≈ . 142 predictions for C2 are incorrect and, working L1 data, 6075 / (6075 + 445) ≈ . 932 predicitions forC1 are correct and 1842 / (6075 + 1842) ≈ . 233 predictions for C2 are incorrect. These results show that our CNN classifiesvery precisely noise samples at the cost of reaching a not less number of false GW predictions noise alone samples than detecting noise plus GW sam-ples, because (C1,C1) element is greater than (C2,C2)for both matrices. Yet amount of successful predictions ofnoise plus GW are reasonable good because they consid-erably surpass a totally random performance –describedby successful detections or the order of 50% of total neg-ative samples.Moreover, from Fig. 9 we have that based on wrongpredictions, CNNs are more likely to make a type II er-ror than type I error, because (C2,C1) > (C1,C2) for bothconfusion matrices. If we think more carefully, this resultleads to an advantage and a disadvantage. The advantageis that our CNN performs a “conservative” detection ofnoise alone samples in the sense that a sample will be notclassified as beloging class 1 unless the CNN is very sure,that is to say the CNN is quite precise to detect noisesamples. Using H1 data, 8533 / (8533 + 427) ≈ . / (6075 + 445) ≈ . 932 of samples pre-dicted as C1 belong this class. This is an importantbenefit if, for instance, we wanted to apply our CNNsto remove noise samples from a segment of data with anarrow marging of error in addition to other detectionalgorithm and/or analysis focused on generating trig-gers. Nonetheless, the disadvantage is that a not lessnumber of noise samples are lost by wrongly classifyingthem as GW event samples. In terms of false negativerates, we have that 1413 / (8533 + 1413) ≈ . 142 of ac-tual noise samples are misclassified with H1 data, and1842 / (6075 + 1842) ≈ . 233 of actual noise samples aremisclassified with L1 data. This would be a serious prob- We are using the notation (row , column) to represent each ele-ment of a confusion matrix. lem if our CNN were implemented to decide if an indi-vidual trigger is actually a GW signal and not a noisesample –either Gaussian or non-Gaussian noise.Take in mind that, according to statistical decision the-ory, there will always be a trade-off between type I andtype II errors [78]. Hence, given our CNN architectureand datasets, it is not possible to reduce value of (C2,C1)element without increasing value of (C1,C2) element. Inprinciple, keeping the total number of training samples,we could generalize the CNN architecture for a multi-label classification to further specify the noise includ-ing several kind of glitches as was implemented in worksas [47] and [51]. Indeed, starting from our current prob-lem, such multiclass generalization could be motivatedto redistribute current false negative counts (C2,C1) innew elements of a bigger confusion matrix, where sev-eral false positive predictions will be converted to newsucessful detections located along a new longer diagonal.Nonetheless, it is not clear how to keep constant the bot-tom edge of the diagonal of the original binary confusionmatrix when the number of noise classes is increased; notto mention that this approach can be seen as a totallydifferent problem instead of a generalization, actually.With regard to misclassified GW event samples, de-spite they are quite less than misclassified noise samples,we would like to understand more about them. Then,we decided to study the ability of the CNN to detectGW events depending on the values of their expectedSNR –values that are provided with LIGO hardware in-jections. Results are shown in Fig. 10; upper panel withdata from H1 detector, and lower panel with data fromL1 detector. Both panels include a blue histogram foractual (injected) GW events that come from the testingset, a gray histogram with GW events detected for theCNN, and the bin-by-bin discrepancy between both his-9tograms as scatter points. As a first approach, we definedthis bin-by-bin discrepancy as the relative error: Rel Err ( i ) = [ N det ( i ) − N test ( i )] /N det ( i ) , (18)where N det and N test are the detected GW count andinjected GW count respectively, and index i represent abin. Here we set 29 same-length bins for both histograms,starting from a lower edge SNR= 9 . . . . N test predictions givenour resampling regime.By comparing most bins that appear on both panelsof Fig. 10, we have that detected GW count is greateras more actual injections in the testing set there are.Besides, most GW events are concentrated in a regionof smaller SNR values. For H1 data, most events arein the first six bins, namely from SNR= 9 . . 9; with 6520 actual GW events and 5191 detected GW,representing aprox. the 72 . 77% and 68 . 78% of the totalnumber of actual GW events and detected GW events,respectively. For L1 data, on the other hand, most eventsare in the first seven bins, from SNR= 10 . . 6; with 5040 actual GW events and 3277 detected GWevents, representing aprox. the 77 . 30% and 70 . 05% ofthe total number of actual GW events and detected GWevents, respectively.Above information about counts is relevant but, themost important results come from relative errors. Fromthese we have that, in both panels, a clear trend of de-tecting a greater percentage of actual GW events as longas those events has greater SNR values. Besides, if we fo-cus in upper panel of Fig. 10, corresponding to H1 data,we have that in first four bins, GW count relative errorsare the greatest; beginning with − . − . . 80, relativeerrors stochastically approaches to zero –indeed, relativeerror exactly equal to zero is reached in 10 of 29 bins.For L1 data, shown in lower panel, we observe a similarbehaviour of relative error. In first six bins the greatestrelative errors appears; from − . − . . 80, relative errors stochas-tically approaches to zero. Indeed, here we have thatrelative error value exactly equal to zero is reached forthe first time at smaller SNR value than with H1 data,although once zero values begin to appear, relative er-rors further away from zero than with H1 data also ap-pear. This last result is statistically consistent with thefact that, according to Fig. 9, negative predictive value(NPV = (C2 , C2) / [(C2 , C1) + (C2 , C2)]) is smaller inthe confusion matrix for H1 data than for L1 data, withNPV ≈ . 842 and NPV ≈ . . And this is desirable, because in real exper-imental conditions, SNR values of GW signals dependsolely on the nature of the astrophysical sources and theconditions of the detectors –aspects that obviously arenot modifiable during an observation run. If the CNNis able to deal with this limiting scenario beforehand, itdoes not learn more than what is strictly necessary, avoid-ing overoptimistic results or even, underperformance. In-deed, because of this aspect is that we can transparentlyconcluded that our CNN per se is more sensitive to sto-castically detect GW signals when SNR ≥ . ≥ . 80 for L1 data.Continuing with our analysis, Table III shows a sum-mary of several metrics that we previously defined onTable II –againly, these metrics were computed by count-ing the entire mount of 100 N test predictions given thatwe repeated 10 times a 10-fold CV experiment. Fromthe table, we have that working with L1 data, we ob-serve that recall has a mean value telling us that 76 . . . 2% of them are actually noise alone, andjust 8 . 69% are GW signals. Even, for H1 data resultsare better, because mean precision is 95 . 2% and a meanfall-out is 5 . . . In addition to SNR values, frequency of occurrence of GW eventsalso represents an important challenge to generate a more realis-tic dataset emulating record of astrophysical data, even thoughthis leads us to work with highly imbalanced datasets. Even,for a more realistic situation, we could internally described eachbin of histograms in Fig. 10 as a random sampling in which thecounts themselves take random values, following a distribution –indeed, this hypothesis is usually assumed to perform systematicstatistical comparisons between two histograms. 10 20 30 40 50 60 70 80 90 100 Expected SNR G W c oun t f o r t e s t i ng and de t e c t i on -0.5-0.4-0.3-0.2-0.10 G W c oun t r e l a t i v ee rr o r GW count vs. SNR values, H1 data TestingDetectionRel. Error 10 20 30 40 50 60 70 80 90 Expected SNR G W c oun t f o r t e s t i ng and de t e c t i on -0.5-0.4-0.3-0.2-0.10 G W c oun t r e l a t i v ee rr o r GW count vs. SNR values, L1 data TestingDetectionRel. Error FIG. 10. Histograms for counting GW samples present in test set and GW samples detected by the CNN. This count wasmade from all 100 N test predictions because we have 10 × 10 = 100 SGD learning-testing runs. Besides, relative error betweenboth histograms is shown as scatter points, with their respective standard deviations computed from the 10 repetitions of the10-fold CV routine. From plots we have that our CNN detects more CBC GW events insofar they have a SNR ≥ . 80 for H1data (upper panel), and SNR ≥ . 80 for L1 data (bottom panel). call and mean precision. Besides, although fall-out plusprecision theoretically is exactly 1, here we are consider-ing means among several stochastic realization of thesesmetrics; then summation slightly differs in 0 . . . 213 for H1 data and 0 . .Dispersion of metrics is also shown in Table III. Fordata from a given detector (H1 or L1), we observe thatstandard deviations of accuracy, precision, recall, andfall-out are of the same order of magnitude. This is ex-pected because these metric were computed directly fromthe same resampling of data predictions. Besides, for H1 For a N -labels classification, imbalanced datasets, and N > N is quite larger, because we would need N metricsto detail the model performance. Hence the needed of draw onmetrics as F1 score and G mean1. Standard metrics with H1 data Metric Mean Min Max SDAccuracy 0 . 897 0 . 888 0 . 907 0 . . 952 0 . 941 0 . 968 0 . . 858 0 . 844 0 . 873 0 . . . . . . 903 0 . 894 0 . 912 0 . . 213 0 . 179 0 . 237 0 . Standard metrics with L1 data Metric Mean Min Max SDAccuracy 0 . 825 0 . 780 0 . 853 0 . . 932 0 . 905 0 . 956 0 . . 768 0 . 724 0 . 793 0 . . . . 127 0 . . 842 0 . 804 0 . 866 0 . . 256 0 . 211 0 . 303 0 . T win = 0 . or L1 data, we have that standard deviation of F1 scoreis also of the same order of magnitude as other metrics.However, with G mean1 we observe a slightly smallerdispersion with L1 data than with H1 data, which is inconsistency with the little improvement reported in themean of the G mean1. Anyway, this reported improve-ment is actually marginal, because all other metrics re-ports a better performance of our CNN working withdata from H1 detector. In the next subsection we givemore reasons to reach this conclusion. D. ROC comparative analyses As it was mentioned in subsection II E, all performancemetrics shown in Table II depend on a choosen fixedthreshold for assigning a class per image sample. Un-til now, previous analyses used a threshold of 0 . X i ∈ R N time × N freq is flattened in a vector x i ∈ R N time N freq × such that all columns of X i , from the first to the last, areconcatenated as a one big single column. For NB model,we assume that our train set follows a Gaussian distribu-tion, with mean and variance obtained from to the max-imum likelihood estimation (MLE). For SVM model, on the other hand, we applied a normalization for each com-ponent of x i along all training samples, and we used alinear kernel.Take in mind that there is not a definitive criteriato generate ROC curves under the resampling regime.Then, following the same approach taken in subsec-tion III C for computing confusion matrices, we consid-ered the whole set of 100 N test predictions made by our10 × 10 = 100 learning-testing process. In practice, thisapproach avoids to average point-by-point and helps tosmooth ROC curves through by increasing its numberof discrete generative steps. For all ROC curves we set T win = 0 . 75 for the strain samples, and for ROC curvesdescribing performance ouf our CNN, we used same hy-perparameters adjustments for stacks and kernels that,in subsection III B, we found are the best.Results of our comparative analyses are shown inFig. 11. Notice that, in both panels, we have that allROC curves, in general, are quite distance from the 45-degree diagonal of totally random performance, which isfairly good. Even so, depending on the used dataset,their have different performances. When the modelslearns (and test with) H1 data, their performance arebetter than with L1 data. Now, if we focus separatelyon each panel, we have that, almost for all thresholds,CNN model has the best performance, NB model theworst performance, and SVM model is in the middle.However, as it is shown in zoomed plots, we have thatROC curves in both panels has some peculiarities. Inthe left zoomed plot, we observe that our CNN hasthe best performance only until its ROC curve reaches(fall-out , recall) = (0 . , . . , . , . , . Fall-out R e c a ll Receiver operating characteristic curves, H1 data CNNOOP for CNN SVMOOP for SVM NBOOP for NB Fall-out R e c a ll Receiver operating characteristic curves, L1 data CNNOOP for CNN SVMOOP for SVM NBOOP for NB FIG. 11. ROC curves for the CNNs, a SVM classifier with a linear kernel, and a Gaussian NB classifier. Data from H1 detector(left panel) and L1 detector (righ panel) were used. For each ROC curve, its optimal operating point (OOP) is also shown.We set T win = 0 . 75s and same CNN hyperparameter adjustments used in previous studies. The general trend, for almost allthresholds, is that our CNN has the best performance, followed by the SVM classifier, and finally by the NB classifier. Zoomedplots shown small changes in performances near to point (1 , Notice that, on all ROC curves, a specific point havebeen highlighted. This is called “optimal operatingpoint” (OOP) and corresponds to the particular opti-mal threshold (OT) in which a classifier has the besttrade-off between the costs of missing noise aline samples,cost( F N ), against the costs of raising false noise aline de-tections, cost( F P ). In the ROC space, this trade-off isdefined by isocost lines of constant expected cost:cost exp = P [1 − Recall] cost ( F N )+ N [Fall-out] cost ( F P ) , (19)where P = T P + F N and N = F P + T N . Assuming,as a first approach, that cost( F N ) = cost( F P ) = 0 . , 1) of the ROC plot . For each ROC curve,their OOP, OT, and expected costs, are included in thetable IV. Notice from this table that, for the CNN classi-fier, OT with H1 data is not closer to the exact fifty-fiftychance value of 0 . . If cost( F N ) and cost( F P ) were different and/or the dataset wereimbalanced with respect classes C1 and C2, OOP and OT wouldbe near one of its extremes. Then, in that situation, ROC anal-ysis would be more sensitive to statistical fluctuations, makingdifficult to take statisfically significant decisions with respect tothe class with much less detections and/or samples. This situa-tion would require more data for dealing with the imbalance oralternative analysis as precision-recall curves or cost curves. and 0 . cost defineisocost curves closer to (0 , 1) in the ROC space.In general, relative performance between models canchange depending if their ROC curves intersects. Be-cause of this we would like to have a metric for summa-rizing, regardless thresholds, performance of a model ina single scalar. Here we used the total area under theROC curve (AUC) [79]; this is a standard metric thatgives the performance of the CNN averaged over all pos-sible trade-offs between TP predictions and FP predic-tions. Moreover, we can use this metric to made a finalchoice among all models; the best model corrrespond tothe highest AUC value. In practice, we computed thismetric by a trapezoidal approximation and its resultsare also included in the table IV. We have that, for bothdatasets, AUC NB < AUC SVM < AUC CNN , allowing usto conclude that, among the three models, the CNN def-initely has the best performance, followed by the SVMclassifier, and finally by the NB classifier. E. Shuffling and output scoring As it was mentioned in subsection II G, two relatedanalysis to ensure that results are statistically signifi-cant were performed. The first one was run our CNNalgorithm including a shuffling of training samples be-fore each training, with the peculiarity of removing linksbetween each sample and its known label. A comparison3 Data Model Optimal operating point Optimal threshold Optimal cost exp AUCH1 CNN (0 . , . . 430 0 . . . , . . 228 0 . . . , . . 379 0 . 143 0 . . , . . 472 0 . . . , . . 418 0 . 157 0 . . , . . 920 0 . 175 0 . . of distribution of the mean accuracies along all runs ofthe 10-fold CV experiment, with and without shuffling,is shown in Fig. 12 –remember that each point of theboxplots, i.e. a i - th mean accuracy or Acc iCV as was de-fined subsection III B, come from the i - th run of the whole10-fold CV. From this plot we have that shuffling radi-cally affects results. Whether we work with data fromH1 detector or L1 detector, and if shuffling is present,distribution of mean accuracy moves towards lower val-ues and increase its dispersion. With H1 data, mean ofmean accuracies decreases from 0 . 897 to 0 . 494 and stan-dard deviation increases from 0 . 564 o 1 . 34; and with L1data, mean of mean accuracies falls from 0 . 825 to 0 . . 98 to 2 . H1 H1 shuff. L1 L1 shuff. M ean a cc u r a cy pe r ea c h r un Shuffling and paired-sample t-test p-value H1 =2.7482e-14 p-value L1 =5.6119e-10 FIG. 12. Shuffling and paired-sample t-test for elucidating ifpredictions of our CNN are statistically significant. Boxplotsfor mean accuracies resulting of each experiment of the 10-fold CV, are shown. By shuffling, we mean that our trainingset is shuffled before each training, such that links betweeneach sample and its known class is broken. Besides, p-valuesof the mentioned paired-sample t-test are including, showingthat predictions of the CNN are significantly different from atotally random process –with a significance level of 5%. Moreover, if shuffling is not or is present, boxplot ofmean accuracies has positive or negative skewness, re-spectively. This makes sense because, without shuffling, the higher mean accuracies the greater is the effort ofthe CNN for reaching performances with those accura-cies; there is not free lunch and we expect to have ahigher concentration of samples below the median thanabove the median and, therefore, a positive skewness. Onthe other hand, if we have shuffling, we known from thebasics of probability that adding new points, one by one,to the mean accuracy distribution, is actually a stochas-tically symmetric process around 0 . k -fold CVexperiment. Then, given that here we obtained mediansslightly below of 0 . . 499 with H1 data and 0 . 492 withL1 data), it is expected there is a higher concentration ofpoints above medians, and then, boxplots with negativeskewness; because this works as a balance to maintain thesymmetry of stochastic occurrences (i.e. boxplot points)around the 0 . D = (cid:8) Acc iCV (cid:9) i =1 , D shuff = (cid:110) Acc iCV, shuff i (cid:111) i =1 , (20)without and with shuffling, respectively. Then, withmeans of each dataset at hand, µ and µ shuff , the task istest the null hypothesis H : µ − µ shuff = 0 by computingthe p -value that is defined as: p = µ − µ shuff σ/ √ N D . (21)Then, assuming a significance level α = 0 . 05 (a stan-dard similarity threshold between D and D shuff ), we havethat: i) if p > α = 0 . 05, then we accept H , or ii) if p < α = 0 . 05, then we reject H . Results for p -values areshown in Fig. 12, namely 2 . × − with H1 dataand 5 . × − with L1 data. These values are muchless than α = 0 . 05, hence we reject null hypothesis andconclude that, for a significance level of 5% (or confidencelevel of 95%), distribution D is significantly different from D shuff . This is actually a quite good result.As final analysis, we focus on output scoring of theCNNs. As it was explained in subsection II D, our CNNs4output scores that are probabilities generated by the soft-max layer. After the training, these probabilities are de-fined by our classes, c (noise alone) and c (noise plusGW), conditioned by model parameters within vector θ once they have already be learned; namely y j ( θ ) = P ( c j | θ ) (with j = 1 , 2) for each input image sample.Histograms describing distribution of these probabilities,considering all our 100 N test predictions, are included inFig. 13 –all histograms were made using 28 same-lengthbins. Here we have important results.Firstly, in both panels of Fig. 13 we have that dis-tribution for y and distribution for y are multimodal,and each one has three different modes or maximums. Inaddition, we observe that both distributions are asym-metric. Given a multimodal distribution, there are nota univocal definition of its center; it can be its mean,its median, its center of probability mass, among others.Here we decided to define the center of distribution foras the optimal threshold (OT), because this metric is di-rectly related to our decision criteria for assigning a classto the output score. The closer to the OT a probabilisticoccurence is located, the greater uncertainty for taking adecision about what class a CNN actually is predictingwith that probability. OT values were already computedand presented in Table IV, and we included them in pan-els of Fig. 13 as dashed lines.For y probability we have that in the left-hand sideof OT there is a low concentration of occurrences untilbefore left edge bin, [0 , . . 31 of all oc-currences counted along all bins if we work with H1 data,and 0 . 21 of all occurrences with L1 data. In contrast, inthe right-hand side of OT we have more dispersed occur-rences around the two remaining modes –with H1 data,one of these reimaining modes is located at edge right bin]0 . , . y is similar to that of y except now it is inverted along horizontal axis, thenthe highest mode is at the right edge bin, two remainingmodes at the left-hand side of OT, among others. For y ,fraction of counted occurrences in right edge bin is alsothe same, 0 . 31 with H1 data and 0 . 21 with L1 data.Above results actually mean that, given our datasets,our CNNs are more optimistic predicting GW samplesthan predicting noise alone samples, or equivantly, morepessimistic predicting noise alone samples than predict-ing GW events. Hence, even though our input datasetsare exactly class-balanced, predictive behavior of ourCNNs is highly class dependent. Under a frequentist ap-proach, asymmetric shape of distributions for y and y is the statistical reason why the CNNs have a high preci-sion and a not negligible false negative rate or, said moresimply, why the CNNs are more “conservative” classify-ing samples as noise alone, than classifying as GW events–and this is coherent with that we interpreted from con-fusion matrices shown in Fig. 9.It is also important to notice how, depending of data,distribution of occurrences, either for y or y , change. Remember that, from previous ROC comparative analy-ses, we found that working with H1 data reach a betterperformance than working with L1 data, and here wealso observed this improvement from another point ofview. Because, the more uncertainty our CNNs have forpredicting a specific class, occurrences are more concen-trated (i.e. skewed) towards OT. Even, if our networkwas not learn anything, e.g. because of a shuffling aswe did previously applied, then we would have that allprobabilities are distributed as a Gaussian centered a thedefault threshold, 0 . 5, i.e. a totally random performance–although it is not explicitly included here, we checkedthis random values with our code and visualizations. IV. CONCLUSIONS In this work, we applied CNN algorithms for detectionof GW signals from CBCs, through a binary classicationwith samples of non-Gaussian noise only and samples ofnon-Gaussian noise plus GW injections. Being a cru-cial part of the data pre-processing, we applied a Morletwavelet transform, to convert our time series vectors (i.e.strain data) in time-frequency matrices (i.e. image data).And the resulting images were automatically decoded inthe convolutional stacks of our CNNs. Besides, imagesin time-frequency representation were reduced in such away that all our CNNs were easily run in a single localCPU, reaching good performance results.The significant novel contribution of our work is adopt-ing a resampling white-box approach, motivated by theneed to advance towards a statistically informed under-standing of uncertainties involved in CNNs. Moreover,as a manner to reproduce a more realistic experimen-tal setting for testing our CNN algorithms, we draw onsingle-interferometer data from LIGO S6 run, consider-ing raw strain data with noise and hardware injections ofGW signals solely; that is to say, we removed the instru-mental freedom of generating distributions of simulatedGW signals as intense as one want.Because of introducing stochasticity by repeated 10-fold CV experiments, almost all tasks were more complexthan in a simple deterministic fashion but, it also forcedus to acquire useful tools to overcome too much opti-mistic or too much pessimistic evaluations of CNN algo-rithms. Hyperparameter adjustments required careful in-terpretations of mean accuracy distributions. We testedseveral CNN architectures and found two that achieveoptimal peformances, one with 3 stacks and 32 kernelsfor data from H1 detector, and other with 2 stacks and16 kernels for data from L1 detector, and in both caseswith a resolution T win = 0 . . 6. This resultserves as recommendation for future works, to run CNNsin the resampling regime.From analyses of confusion matrics and standard met-5 CNN output scores, data from H1 detector Softmax probability C oun t y =P(c | )y =P(c | )optimal threshold CNN output scores, data from L1 detector Softmax probability C oun t y =P(c | )y =P(c | )optimal threshold FIG. 13. Histogram of CNN output probabilistic scores, y and y , working with data from H1 detector (left panel) and fromL1 detector (right panel). As it can be observed from the skewness towards edge bins, the CNNs are more optimistic predictingGW samples than predicting noise alone samples. Besides, with H1 data there is less uncertainty than with L1 data, becausewith L1 data there is more population of occurrences in middle bins. Here we counted all 100 N test predictions. rics, we found that whether working with H1 data orL1 data, CNN algorithms are quite precise to detectnoise samples but not sensitive enough to recover GWsignals. This results mean that, in general, CNN algo-rithms are better suitable for noise reduction than gen-eration of GW triggers. However, these conclusions arenot totally definitive, because if we considering only GWevents of SNR ≥ . ≥ . tentative , because men-tioned conditions for SNR values actually depend on thenature of initial datasets inputted by the CNNs that, inour case, still are class-balanced. At this point, it is ev-ident that more research is necessary to begin dealingwith arbitrarily class-imbalanced data.With ROC curves, we also compared the CNNs withother two classic ML models, NB and SVM, reachingthat CNNs have much better performances than men-tioned classic ML models –result that is consistent withwhat have been reported these recent years in most worksof DL applied to GW detection. From ROC analyses wealso found optimal thresholds, that are very useful pa-rameters to establish a statistical decision criterion forthe classification, namely 0 . 430 for H1 data, and 0 . V. ACKNOWLEDGMENTS M.D.M and A.I.N. acknowledge the support ofPRODEP, 511-6 / ), a service of LIGO Laboratory, the LIGO Scien-tific Collaboration and the Virgo Collaboration. LIGO isfunded by the U.S. National Science Foundation. Virgo6is funded by the French Centre National de RechercheScientifique (CNRS), the Italian Istituto Nazionale dellaFisica Nucleare (INFN) and the Dutch Nikhef, with con-tributions by Polish and Hungarian institutes. Appendix A: Naive Bayes classifiers Naive Bayes (NB) is a family of classifiers based on theknown Bayes theorem and the naive assumption that allobserved data are independent of each other and iden-tically distributed (i.i.d.). Considering a binary classifi-cation, we have that given an i - th input feature vector x i = ( x i , x i , ..., x id ) , a NB model outputs two poste-rior categorical probabilities, namely: y ji ( x i ) = P i ( c j | x i ) , with j = 1 , . (A1)Then, class of the highest output probability is chosenas the prediction for the input vector x i based on themaximum a posteriori (MAP) estimation as decision cri-teria. Under the hood, NB classifiers output posteriorprobabilities that, according to Bayes theorem, are: P i ( c j | x i ) ∝ P i ( x i | c j ) P ( c j ) , (A2)where we can ignore the marginal normalization con-stant, i.e. the numerator of the right-hand side thatwould convert the proportionality in a equality, becauseof class predictions are made regardless of it. Now, tocompute probabilities appearing at the right-hand side ofEq. A2, we need three ingredients. Firstly, the likelihoodfunction that, assuming that our features are indepen-dent and i.i.d., is computed by: P i ( x i | c j ) = d (cid:88) k =1 P i ( x ik | c j ) . (A3)Secondly, prior marginal probability P ( c j ), that are com-puted as relative frequency of class c j in the train set.Finally, we have that values of conditional probabilities P i ( x ik | c j ) will depend on our assumption about how fea-tures are distributed. Indeed, this last assumption is animportant aspect, because it will define what kind of NBclassifier we will working with.For our study, we chose P i ( x ik | c j ) as Gaussian distri-bution, in which its mean and variance (i.e. our modelparameters) are computed from the train set through themaximum likelihood estimation (MLE). Besides, under afrequentist apporach, MLE is the standard procedure ofmaximizing likelihood function –or for minimizing log-likelihood function as the cost function. In practice, weimplemented our Gaussian NB classifier with the MATLABStatistics and Machine Learning Toolbox [80]. Given the dimension of images in the time-frequency representa-tion, and keeping consistency with our notation, d = N time N freq . Appendix B: Support Vector Machine classifiers Support Vector Machines (SVM) are a family of re-gressors and binary classifiers that are based on a opti-mization and geometrical framework. For a SVM clas-sifier, we start with an i - th input feature vector x i =( x i , x i , ..., x id ). Next, assuming that data are not lin-early separable, we apply a transformation φ in order tomap our data in a feature space of ˜ d > d dimensions, lead-ing to the new feature vector φ ( x i ) = (cid:0) φ i , φ i , ..., φ i ˜ d (cid:1) .If our dataset is linearly separable in the higher dimen-sional space, then the binary SVM classifier outputs a(non-probabilistic) score given by the function: f ( x ) = w T φ ( x i ) + b , (B1)where w = (cid:0) w , w , ..., w ˜ d (cid:1) is a weight vector and b abias term, both to be learn. As a decision criteria, if f ( x ) > f ( x ) < 0, predicted class is 2 (negative).Beyond scoring, function f ( x ) has a geometrical in-tepretation. As it is illustrated in Fig. 14, we have that f ( x ) = 0 is an hyperplane that separates samples ofone class from those of the other class, with a decisionboundary located along the intersection line of hyper-plane f ( x ) = 0 with the higher feature hypersurface (notnecessarily a hyperplane) defined by the transformation φ . Decision boundary is specifically formulated such thatit has a margin defined by support vectors (i.e. the clos-est samples to the decision boundary), and intersectionsof hyperplanes f ( x ) = ± w and b can be expressed as the objective:minimize w ,b, ζ (cid:32) w T w + C d (cid:88) i =1 ζ i (cid:33) , (B2)such that t i f ( x ) ≥ − ζ i with i = 1 , , .., d, (B3)where ζ i ≥ i - th instance is allowed to violate the margin, C > w T w (to increase margin as much as possible)and minimization of ζ i (to reduce as much as possiblemargin violations), and t i = ± i - th in-stance is class 1 or class 2, respectively. Eqs. B2 (objec-tive) and B3 both together denote a convex and linearlyconstrained optimization problem, wich is a particularcase of a Quadratic Programming (QP) problem.7 S u p p o r t v e c t o r S u p p o r t v e c t o r S u p p o r t v e c t o r D e c i s i o n b o u n d a r y M a r g i n : Positive sample (class 1): Negative sample (class 2) FIG. 14. Geometric illustration of a SVM classifier. After apply a transformation φ to input feature vectors x i , dispersionof data samples ocurrs in a higher dimensional hypersurface defined by φ i and φ i . The decision boundary is given by theintersection of hyperplane f ( x ) = 0 with the mentioned hypersurface. Besides, this intersection have a margin that is definedby the support vectors and the intersection lines of hyperplanes f ( x ) = ± Notice that, because of constrain given by Eq. B3, com-puting φ ( x i ) could be resource intensive, and even com-putationally prohibitive, if we have many features. How-ever, this can be dodged with the kernel trick. Because ofthe Represent Theorem [81], w can always be rewrittenas a linear combination w = (cid:80) Ni =1 α i ζ i x i . Then, Eqs. B2and B3 can be expressed as:minimize α d (cid:88) i =1 d (cid:88) j =1 α i α j t i t j φ ( x i ) T φ ( x j ) − d (cid:88) i =1 α i (cid:35) , (B4)such that α i ≥ , with i = 1 , , .., d, (B5)which is the dual for of our linearly constrain objetive,and also a QP problem. Moreover, we notice that the non-linear transformation appearing as a dot product inEq. B5 always can be rewritten as: φ ( x i ) T φ ( x j ) = K ( x i , x j ) , (B6)where function K is called kernel. This kernel depends onoriginal features and can take several forms depending onthe problem we are dealing with. For this research, weimplemented a linear kernel, K ( x i , x j ) = x Ti x j , beingthe default option for a binary SVM classifier.In general, there are several techniques to solve QPproblems, both in its primal and dual forms. A detailedexploration of these are beyond the scope of this paper.However, it is worthy to mention that in our SVM clas-sifier we implemented a Sequential Minimal Optimiza-tion (SMO) routine with the MATLAB Statistics andMachine Learning Toolbox . For more details aboutSMO, see reference [82]. [1] B. P. Abbott et al. Observation of Gravitational Wavesfrom a Binary Black Hole Merger. Phys. Rev. Lett. ,116:061102, 2016.[2] B. P. Abbott et al. GWTC-1: A Gravitational-WaveTransient Catalog of Compact Binary Mergers Observedby LIGO and virgo during the First and Second Observ-ing Runs. Phys. Rev. X , 9:031040, 2016.[3] B. P. Abbott et al. GW170817: Observation of Gravita-tional Waves from a Binary Neutron Star Inspiral. Phys.Rev. Lett. , 119:161101, 2017. [4] B. P. Abbott et al. Multi-messenger Observations of aBinary Neutron Star Merger. ApJ , 848:L12, 2017.[5] R. Abbott et al. GW190412: Observation ofa Binary-Black-Hole Coalescence with AsymmetricMasses. arXiv:2004.08342 , 2020.[6] R. Abbott et al. GW190814: Gravitational Waves fromthe Coalescence of a 23 Solar Mass Black Hole with a 2.6Solar Mass Compact Object. ApJ , 896:L44, 2020.[7] B. P. Abbott et al. Characterization of transient noisein advanced LIGO relevant to gravitational wave signal GW150914. Class. Quantum Grav. , 33:134001, 2016.[8] B. Allen, W. G. Anderson, P. R. Brady, D. A. Brown,and J. D. E. Creighton. FINDCHIRP: An algorithm fordetection of gravitational waves from inspiraling compactbinaries. Phys. Rev. D , 85:122006, 2012.[9] S. Babak et al. Searching for gravitational waves frombinary coalescence. Phys. Rev. D , 87:024033, 2013.[10] G. L. Turin. An introduction to matched filters. IRETransactions on Information Theory , 6:311–329, 1960.[11] J. M. Antelis and C. Moreno. Obtaining gravitationalwaves from inspiral binary systems using LIGO data. Eur. Phys. J. Plus , 132:10, 2017.[12] S. A. Usman et al. The PyCBC search for gravitationalwaves from compact binary coalescence. Class. QuantumGrav. , 33:215004, 2016.[13] A. H. Nitz, T. D. Canton, D. Davis, and S. Reyes. Rapiddetection of gravitational waves from compact binarymergers with PyCBC Live. Phys. Rev. D , 98:024050,2018.[14] E. Cuoco, G. Cella, and G. M. Guidi. Whitening of non-stationary noise from gravitational wave detectors. Class.Quantum Grav. , 21:S801–6, 2004.[15] B. Allen. A chi-squared time-frequency discriminator forgravitational wave detection. Phys. Rev. D , 71:062001,2005.[16] B. P. Abbott et al. All-sky search for long-durationgravitational-wave transients in the second AdvancedLIGO observing run. Phys. Rev. D , 99:104033, 2019.[17] J. M. Antelis and C. Moreno. An independent searchof gravitational waves in the first observation run ofadvanced LIGO using cross-correlation. Gen. Relativ.Gravit. , 51:61, 2019.[18] I. Harry, S. Privitera, A. Boh´e, and A. Buonanno. Search-ing for Gravitational Waves from Compact Binaries withPrecessing Spins. Phys. Rev. D , 94:024012, 2016.[19] E. A. Huerta et al. Complete waveform model forcompact binaries on eccentric orbits. Phys. Rev. D ,95:024038, 2017.[20] J. E. Thompson, E. Fauchon-Jones, S. Khan, E. Nitoglia,F. Pannarale, T. Dietrich, and M. Hannam. Modeling thegravitational wave signature of neutron star black holecoalescences: PhenomNSBH. Phys. Rev. D , 101:124059,2020.[21] B. P. Abbott et al. All-sky search for short gravitational-wave bursts in the second Advanced LIGO and AdvancedVirgo run. Phys. Rev. D , 100:024017, 2020.[22] C. L. Fryer and K. C. B. New. Gravitational waves fromgravitational collapse. Living Rev. Rel. , 14:1, 2011.[23] N. Andersson and G. L. Comer. Probing neutron-starsuperfluidity with gravitational-wave data. Phys. Rev.Lett. , 87:241101, 2001.[24] L. Baiotti, I. Hawke, and L. Rezzolla. On the gravita-tional radiation from the collapse of neutron stars to ro-tating black holes. Class. Quantum Grav. , 24:S187–S206,2007.[25] S. Klimenko, I. Yakushin, A. Mercer, and G. Mitsel-makher. Coherent method for detection of gravitationalwave bursts. Class. Quantum Grav. , 25:114029, 2008.[26] S. Klimenko et al. Method for detection and reconstruc-tion of gravitational wave transients with networks of ad-vanced detectors. Phys. Rev. D , 93:042004, 2016.[27] Ch. M. Bishop. Pattern Recognition and Machine Learn-ing . Springer, Singapore, 2006.[28] I. Goodfellow, Y. Bengio, and A. Courville. Deep Learn- ing . Academic Press, Cambridge, Massachusetts; Lon-don, England, 2016.[29] F. Ricci, L. Rokach, B. Shapira, and P. B. Kantor (Eds.). Recommender Systems Handbook . Springer, New York,2011.[30] X. Zhou, S. Cheng, M. Zhu, C. Guo, S. Zhou, P. Xu,Z. Xue, and W. Zhang. A state of the art survey of datamining-based fraud detection and credit scoring. MATECWeb of Conferences. EDP Sciences , 189:03002., 2018.[31] M. K. K. Leung, A. Delong, B. Alipanahi, and B. J.Frey. Machine Learning in Genomic Medicine: A Reviewof Computational Problems and Data Sets. Proceedingsof the IEEE , 104:176–197, 2016.[32] C. J. Fluke and C. Jacobs. Surveying the reach and ma-turity of machine learning and artificial intelligence in as-tronomy. WIREs Data Mining Knowl. Discov. , 10:e1349,2020.[33] C. Bonnett et al. Redshift distributions of galaxies inthe Dark Energy Survey Science Verification shear cata-logue and implications for weak lensing. Phys. Rev. D ,94:042005, 2016.[34] L. Zhang, J. Han, H. Wang, R. Car, and W. E. DeepPotential Molecular Dynamics: a scalable model withthe accuracy of quantum mechanics. Phys. Rev. Lett. ,120:143001, 2018.[35] T. Cohen, M. Weiler, B. Kicanaoglu, and M. Welling.Gauge Equivariant Convolutional Networks and theIcosahedral CNN. Proceedings of the 36th InternationalConference on Machine Learning , 97:1321–1330, 2019.[36] M. Schuld and F. Petruccione. Supervised Learning withQuantum Computers . Springer, Switzerland, 2018.[37] G. Carleo, I. Cirac, K. Cranmer, L. Daudet, M. Schuld,N. Tishby, L. Vogt-Maranto, and L. Zdeborov´a. Ma-chine learning and the physical sciences. Rev. Mod. Phys. ,91:045002, 2019.[38] R. Biswas et al. Application of machine learning algo-rithms to the study of noise artifacts in gravitational-wave data. Phys. Rev. D 88 , 88:062003, 2013.[39] K. Kim, I. W. Harry, K. A. Hodge, Y.-M. Kim, C.-H. Lee,H. K. Lee, J. J. Oh, S. H. Oh, and E. J. Son. Applicationof artificial neural network to search for gravitational-wave signals associated with short gamma-ray bursts. Class. Quantum Grav. , 32:245002, 2015.[40] A. Torres-Forn´e, A. Marquina, J. A. Font, and J. Ib´aez. Denoising of gravitational wave signals via dictionarylearning algorithms. Phys. Rev. D , 94:124040, 2016.[41] N. Mukund, S. Abraham, S. Kandhasamy, S. Mitra, andN. S. Philip. Transient classification in LIGO data us-ing difference boosting neural network. Phys. Rev. D ,95:104059, 2017.[42] M. Zevin et al. Gravity Spy: integrating advanced ligodetector characterization, machine learning, and citizenscience. Class. Quantum Grav. , 34:064003, 2017.[43] M. Cavagli´a, S. Gaudio, T. Hansen, K. Staats, M. Szczep-anczyk, and M. Zanolin. Improving the backgroundof gravitational-wave searches from core collapse super-novae: A machine learning approach. To be published,draft shared by the authors , 2019.[44] H. Gabbard, M. Williams, F. Hayes, and C. Messen-ger. Matching Matched Filtering with Deep Networksfor Gravitational-Wave Astronomy. Phys. Rev. Lett. ,120:141103, 2018.[45] D. George and E. A. Huerta. Deep neural networks toenable real-time multimessenger astrophysics. Phys. Rev. D , 97:044039, 2018.[46] D. George and E. A. Huerta. Deep learning for real-time gravitational wave detection and parameter estima-tion: Results with Advanced LIGO data. Phys. Lett. B ,778:64–70, 2018.[47] M. Razzano and E. Cuoco. Image-based deep learningfor classification of noise transients in gravitational wavedetectors. Class. Quantum Grav. , 35:095016, 2018.[48] T. D. Gebhard, N. Kilbertus, I. Harry, and B. Sch¨olkopf.Convolutional neural networks: A magic bullet forgravitational-wave detection¿. Phys. Rev. D , 100:063015,2019.[49] P. G. Krastev. Real-time detection of gravitational wavesfrom binary neutron stars using artificial neural net-works. Phys. Lett. B , 803:135330, 2020.[50] P. Astone, P. Cerda-Duran, I. Di Palma, M. Drago,F. Muciaccia, C. Palomba, and F. Ricci. New methodto observe gravitational waves emitted by core collapsesupernovae. Phys. Rev. D , 98:122002, 2018.[51] A. Iess, E. Cuoco, F. Morawski, and J. Powell. Core-Collapse supernova gravitational-wave search and deeplearning classification. Mach. Learn.: Sci. Technol. ,1:025014, 2020.[52] A. L. Miller et al. How effective is machine learning todetect long transient gravitational waves from neutronstars in a real search? Phys. Rev. D , 100:062005, 2019.[53] N. Japkowicz and M. Shah. Evaluating Learning Algo-rithms: A Classification Perspective . Cambridge Univer-sity Press, New York, 2011.[54] J. Abadie et al. Sensitivity Achieved by the LIGOand Virgo Gravitational Wave Detectors during LIGO’sSixth and Virgo’s Second and Third Science Runs. arXiv:1203.2674v2 , 2012.[55] P. Welch. The use of fast Fourier transform for the es-timation of power spectra: A method based on time av-eraging over short, modified periodograms. IEEE Trans-actions on Audio and Electroacoustics , 15:70–73, 1967.[56] Y. LeCun. Generalization and Network Design Strate-gies. Technical Report CRG-TR-89-4. University ofToronto , 1989.[57] Y. LeCun, Y. Bengio, and G. Hinton. Deep Learning. Nature , 521:436–444, 2015.[58] S. Mallat. A Wavelet Tour of Signal Processing: TheSparse Way . Academic Press, 3rd edition, Boston, 2009.[59] R. Kronland-Martinet, J. Morlet, and A. Grossmann.Analysis of sound patterns through wavelet transforms. Int. J. Patt. Recogn. Art. Intell. , 1:273–302, 1987.[60] K. Fukushima. Neocognitron: A self-organizing neuralnetwork model for a mechanism of pattern recognitionunaffected by shift in position. Cybernetics , 36:193–202,1980.[61] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner.Gradient-based learning applied to document recogni-tion. Proceedings of the IEEE , 86:2278–2324, 1998.[62] D. H. Hubel. Single unit activity in striate cortex ofunrestrained cats. J. Physiol. , 147:226–238.2, 1959.[63] D. H. Hubel and T. N. Wiesel. Receptive fields of singleneurones in the cat´s striate cortex. J. Physiol. , 148:574–591, 1959.[64] D. H. Hubel and T. N. Wiesel. Receptive fields and func-tional architecture of monkey striate cortex. J. Physiol. ,195:215–243, 1968. [65] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals.Understanding deep learning requires rethinking general-ization. , 2017.[66] The MathWorks Inc. MATLAB Deep Learning Toolbox .Natick, MA, United States, 2020a.[67] J. S. Bridle. Probabilistic Interpretation of FeedforwardClassification Network Outputs, with Relationships toStatistical Pattern Recognition. In F. Fogelman-Souli´eand J. H´erault, editors, Neurocomputing: Algorithms,Architectures and Applications , pages 227–236. Springer-Verlag, Berlin, 1989.[68] X. Glorot and Y. Bengio. Understanding the difficulty oftraining deep feedforward neural networks. Proceedingsof the 13th International Conference on Machine Learn-ing , 9:249–256, 2010.[69] F. Mosteller and J. W. Tukey. Data analysis, includ-ing statistics. In G. Lindzey and E. Aronson, editors, Handbook of Social Psychology, Vol. 2, first edition , pageChapter 10. Addison-Wesley, Reading, MA, 1968.[70] T. Hastie, R. Tibshirani, and J. Friedman. The Elementsof Statistical Learning: Data Mining, Inference, and Pre-diction . Springer, 2nd edition, 2009.[71] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J.Stone. Classification and Regresssion Trees . WadsworthInternational Group, 1984.[72] S. M. Weiss and N. Indurkhya. Decision tree pruning:Biased or optimal. In B. Hayes-Roth and R. E. Korf, edi-tors, Proceedings of the twelfth national conference on ar-tificial intelligence , pages 626–632. AAAI Press and MITPress, Seattle, WA, 1994.[73] R. Kohavi. A study of cross-validation and bootstrap foraccuracy estimation and model selection. Proceedings ofthe Fourteenth International Joint Conference on Artifi-cial Intelligence , 2:1137–1143, 1995.[74] A. M. Molinaro, R. Simon, and R. M. Pfeiffer. PredictionError Estimation: A Comparison of Resampling Meth-ods. Bioinformatics , 21(15):3301–3307, 2005.[75] J.-H. Kim. Estimating classification error rate: Repeatedcross-validation, repeated hold-out and bootstrap. Com-put. Stat. Data Anal. , 53(11):3735–3745, 2009.[76] T. Fawcett. An introduction to ROC analysis. PatternRecognition Letters , 27:861–874, 2006.[77] J. Davis and M. Goadrich. The relationship betweenPrecision-Recall and ROC curves. Proceedings of the 23rdInternational Conference on Machine Learning , pages233–240, 2006.[78] S. M. Kay. Statistical Signal Processing Volume II: De-tection Theory . Prentice Hall PTR, New Jersey, 1998.[79] A. P. Bradley. The use of area under the ROC curve inthe evaluation of machine learning algorithms. PatternRecognition , 30:1145–1159, 1997.[80] The MathWorks Inc. MATLAB Statistics and MachineLearning Toolbox . Natick, MA, United States, 2020a.[81] G. Wahba. Spline Models of Observation Ddata . CBMS-NSF Regional Conference Series in Applied Mathematics,Philadelphia, 1990.[82] R.-E. Fan, P.-H. Chen, and C.-J. Lin. Working set selec-tion using second order information for training supportvector machines.