[PDF] Local earthquakes detection: A benchmark dataset of 3-component seismograms built on a global scale

Abstract

Machine learning is becoming increasingly important in scientific and technological progress, due to its ability to create models that describe complex data and generalize well. The wealth of publicly-available seismic data nowadays requires automated, fast, and reliable tools to carry out a multitude of tasks, such as the detection of small, local earthquakes in areas characterized by sparsity of receivers. A similar application of machine learning, however, should be built on a large amount of labeled seismograms, which is neither immediate to obtain nor to compile. In this study we present a large dataset of seismograms recorded along the vertical, north, and east components of 1487 broad-band or very broad-band receivers distributed worldwide; this includes 629,095 3-component seismograms generated by 304,878 local earthquakes and labeled as EQ, and 615,847 ones labeled as noise (AN). Application of machine learning to this dataset shows that a simple Convolutional Neural Network of 67,939 parameters allows discriminating between earthquakes and noise single-station recordings, even if applied in regions not represented in the training set. Achieving an accuracy of 96.7, 95.3, and 93.2% on training, validation, and test set, respectively, we prove that the large variety of geological and tectonic settings covered by our data supports the generalization capabilities of the algorithm, and makes it applicable to real-time detection of local events. We make the database publicly available, intending to provide the seismological and broader scientific community with a benchmark for time-series to be used as a testing ground in signal processing.

Full PDF

LLocal earthquakes detection: A benchmark dataset of 3-componentseismograms built on a global scale

Fabrizio Magrini a , * , Dario Jozinovi (cid:1) c a , b , Fabio Cammarano a , Alberto Michelini b ,Lapo Boschi c , d , e a Department of Science, Universit (cid:3) a Degli Studi Roma Tre, Italy b Istituto Nazionale di Geo ﬁ sica e Vulcanologia (INGV), Rome, Italy c Dipartimento di Geoscienze, Universit (cid:3) a Degli Studi di Padova, Italy d Sorbonne Universit (cid:1) e, CNRS, INSU, Institut des Sciences de La Terre de Paris, ISTeP UMR 7193, F-75005, Paris, France e Istituto Nazionale di Geo ﬁ sica e Vulcanologia, Bologna, Italy A R T I C L E I N F O

Keywords:

Benchmark datasetEarthquake detection algorithmSupervised machine learningSeismology

A B S T R A C T

Machine learning is becoming increasingly important in scienti ﬁ c and technological progress, due to its ability tocreate models that describe complex data and generalize well. The wealth of publicly-available seismic datanowadays requires automated, fast, and reliable tools to carry out a multitude of tasks, such as the detection ofsmall, local earthquakes in areas characterized by sparsity of receivers. A similar application of machine learning,however, should be built on a large amount of labeled seismograms, which is neither immediate to obtain nor tocompile. In this study we present a large dataset of seismograms recorded along the vertical, north, and eastcomponents of 1487 broad-band or very broad-band receivers distributed worldwide; this includes 629,095 3-component seismograms generated by 304,878 local earthquakes and labeled as EQ, and 615,847 ones labeledas noise (AN). Application of machine learning to this dataset shows that a simple Convolutional Neural Networkof 67,939 parameters allows discriminating between earthquakes and noise single-station recordings, even ifapplied in regions not represented in the training set. Achieving an accuracy of 96.7, 95.3, and 93.2% on training,validation, and test set, respectively, we prove that the large variety of geological and tectonic settings covered byour data supports the generalization capabilities of the algorithm, and makes it applicable to real-time detection oflocal events. We make the database publicly available, intending to provide the seismological and broader sci-enti ﬁ c community with a benchmark for time-series to be used as a testing ground in signal processing.

1. Introduction

Natural earthquakes are the shaking of the Earth surface caused by asudden release of elastic energy from geological faults which generatesmechanical waves, called seismic waves. The strength of an earthquake,generally indicated by its magnitude, is proportional to the logarithm ofthe energy liberated (e.g. Båth, 1955), and determines our ability toperceive the ground motion due to the seismic-wave propagation. Overthe last century, the possibility of recording the arrival times of differentseismic phases at sensitive instruments (i.e. seismographs) has enabledseismologists to image and understand the Earth ’ s interior and dynamics.Among these seismic phases are the compressional (P) and shear (S)waves, which are generally the ﬁ rst to be recorded at a seismic receiverwhen an earthquake occurs and should be considered as the two fundamental types of seismic waves, in that they generate all the others(e.g. surface waves) by interacting with the discontinuities within theEarth.The enhancement and spreading of seismic sensors around the world,together with the theoretical progress made in seismology over the lastdecades, nowadays allow not only to exploit seismic signals emitted byearthquakes, but also those connected to ambient noise (e.g. Shapiro andCampillo, 2004; Boschi and Weemstra, 2015). This study aims to providea dataset of labeled seismograms generated by both local earthquakesand noise, and recorded at a large number of seismic receivers distributedaround the world (Fig. 4). The choice of collecting only localearthquake-data is motivated by the fact that small-magnitude events,which generate relatively small amplitudes and are easily attenuated, areoften problematic to detect but provide valuable information about * Corresponding author. E-mail address: [email protected] (F. Magrini).

Contents lists available at ScienceDirect

Arti ﬁ cial Intelligence in Geosciences https://doi.org/10.1016/j.aiig.2020.04.001Received 7 February 2020; Received in revised form 20 April 2020; Accepted 23 April 2020Available online 5 July 20202666-5441/ © license (http://creativecommons.org/licenses/by-nc-nd/4.0/). Arti ﬁ cial Intelligence in Geosciences 1 (2020) 1 – arthquake processes (Brodsky, 2019). Cataloging small earthquakescould be important, for example, for better understanding how earth-quakes interact with one another, their reoccurrence, nucleation stageand the foreshock evolution (Ross et al., 2019).This global dataset is intended to be used for carrying out a multitudeof seismological and signal processing tasks on single-station recordings,and its size particularly suits machine learning (ML) applications. ML isbecoming increasingly important in scienti ﬁ c and technological progress,due to its ability to create models that describe complex data. In the ﬁ eldof seismology, ML algorithms have proved to be often more reliable thanexpert scientists in recognizing seismic phases arrivals (e.g. Zhu et al.,2019) and determining physical quantities associated with the earth-quake (e.g. Ross et al., 2018). This has important implications, e.g., forthe improvement of modern earthquake early-warning system tech-niques and therefore for the mitigation of risk (Meier et al., 2019). MLapplications, however, always require a large number of samples toinduce these models to generalize well, i.e. to properly classify data notrepresented in the training set (for an overview on applications of ML inseismology see, e.g., Kong et al., 2019). At the present time, availabilityof seismological benchmark datasets like the one presented here is verylimited. To our knowledge, the only instance of something similar in sizehas been assembled and published in a recent, independent study byMousavi et al. (2019a). The impressive work carried out by these authors,however, led to a dataset of different characteristics, arising from e.g.different processing and geographic distribution of the seismogramscollected.We hope that a collection of time-series like the one presented heremay bene ﬁ t not only seismologists, but a broader community includingdata scientists interested in informative data such as seismogramsrecorded on the Earth surface. After explaining the procedure adopted foran automated labeling of the waveforms (Section 2), we describe in detailthe features of the dataset (Section 3). An application of supervised ML toa binary classi ﬁ cation problem is presented in Section 4. Speci ﬁ cally, weshow the ability of a Convolutional Neural Network (Krizhevsky et al.,2012) trained on our dataset to recognize earthquakes from noise inunlabeled data (i.e. test set) based on single-station recordings. Possible applications of the dataset and conclusions are presented in Sections 5and 6, respectively.

2. Labeling and downloads

We searched for seismic data recorded at more than 1500 publicly-available, broad-band or very broad-band seismic stations equippedwith sensors oriented along vertical (Z), north (N), and east (E) compo-nents. For each receiver, recordings that satis ﬁ ed some quality criteriaexplained in the following paragraphs were downloaded, demeaned,detrended, tapered with a 5% cosine-taper, and bandpass- ﬁ ltered be-tween 0.1 and 5 Hz before deconvolving the instrumental response tophysical units (velocity). Each 3-components seismogram was then cutinto time-windows of 27 s sampled at 20 Hz, and labeled as earthquake(EQ) or noise (AN) following an automated procedure, presented below. To download EQ waveforms we relied on several catalogues ofseismic events (see Data & Resources); for each location, we used thecatalogue with the largest number of earthquakes reported, and selectedonly seismic events satisfying 3 quality parameters. Since this study fo-cuses on local earthquakes, we set the (1) maximum hypocentral distance of an earthquake with respect to a receiver to 134 km. Events in thecatalogue satisfying this condition are subject to a further selection basedon a criterion of (2) time separation from other events: if the consideredevent is disturbed by other events that occurred at about the same time inthe vicinity of the station, the event is discarded. In practice, we only usethose whose origin times are at least 100 s before and 600 s after theclosest available events in the catalogue with epicentral distances (cid:1) (cid:3) ( e

189 km). This conservative choice is motivated by the need of avoidingarrivals of seismic phases from different local events within the samewaveforms.The last requirement for an event on the catalogue to pass the qualityselection is determined by a (3) perceptibility radius (see Nuttli and Zoll-weg, 1974, and our implementation detailed below) that is function of

Fig. 1.

Perceptibility radius used for collecting earthquake-data (red) compared with those obtained by using a logarithmic (black) and a linear (gray) relationbetween hypocentral distance and magnitude. (For interpretation of the references to color in this ﬁ gure legend, the reader is referred to the Web version ofthis article.) F. Magrini et al. Arti ﬁ cial Intelligence in Geosciences 1 (2020) 1 – oth magnitude and hypocentral distance (Fig. 1): for each magnitude,the perceptibility radius indicates the maximum acceptable hypocentraldistance above which the event is discarded since no visible signal canlikely be detected. The strategy of avoiding such events is motivated bythe fact that the capability of a receiver to record an earthquake decreaseswith the hypocentral distance and strongly depends on the attenuationproperties of the Earth, which can be signi ﬁ cantly variable depending one.g. the local geology and/or the tectonic environment (e.g. Dalton et al.,2008; Dalton and Faul, 2010). The choice of the perceptibility radius iscritical in compromising the trade-off between the number of rejectedevents and the quality of the downloaded data; however, accounting forits spatial variability in order to optimize the labeling at each locationcovered by our dataset would be, at least, impractical. For this reason,our choice of the perceptibility radius is empirical, and has been madeafter visual inspection of its performance in terms of quality of thelabeled waveforms vs. rejection rate. In this regard, we visually checked alarge number of seismograms (more than 40,000) and excluded from thedataset those stations which appeared too noisy and therefore did notbring clear evidence of earthquakes in the data. The bad quality of suchwaveforms can be ascribed either to an inappropriate de ﬁ nition of theperceptibility radius or strong ambient noise at those locations, or torelatively large errors on the events information provided in the cata-logues (due to, e.g., a scarce seismic coverage in the surroundings ofcertain receivers).In practice, the perceptibility radius has been de ﬁ ned using thefunctions linspace and geomspace of NumPy Python library (Oliphant,2006): we chose the maximum acceptable hypocentral distances withinan interval between 4 km and 120 km using 21 points (a) spaced evenly,i.e. Linear ¼ linspace ð ; ; Þ , and (b) spaced evenly on a logarithmicscale, i.e. Logarithmic ¼ geomspace ð ; ; Þ ; each of these two arrays has then been associated with a set of 21 magnitudes evenly spaced be-tween 0.3 and 2.3 to obtain the gray line and black curve shown in Fig. 1,respectively. Magnitudes above 2.4 are always accepted provided con-dition (1) is met. The perceptibility radius employed in this study hasbeen obtained by de ﬁ ning the array of the hypocentral distances as aweighted average of (a) and (b): Used ¼ ð Linear þ Logarithmic Þ (redcurve in Fig. 1).For each event that met the above conditions, 3-components seis-mograms starting 4 s before the expected arrival time of the P-wave at thereceiver (calculated using IASP91 as 1-D background model, Kennett andEngdahl, 1991) were downloaded and labeled as EQ (Fig. 2). Concerning the labeling of noise data (Fig. 3), we followed the samecriterion of time separation from the closest events reported in the cata-logue, already described above: each waveform labeled as AN israndomly downloaded, provided its starting time and ending time areseparated from the closest events at least 100 s and 600 s, respectively. Itmight be noted that this approach would not prevent from labeling as ANrecordings of ground motion generated by seismic events at epicentraldistances greater than 1.7 (cid:3) and strong enough to be detected. This choice,however, is supported by two considerations. (1) The Gutenber-Richterrelation (Gutenberg and Richter, 1944) says that the probability ofoccurrence of an earthquake decreases, to a good approximation, expo-nentially with increasing magnitudes; this circumstance alone makes theprobability of randomly labeling as AN an earthquake strong enough tobe perceptible at the station location relatively small. In addition, (2) thecharacteristics of a seismogram recording seismic waves generated by astrong, distant earthquake will be substantially different from those of a

Fig. 2.

Four randomly selected seismograms (Z, N, E components from top to bottom of each panel) which passed the selection criteria explained in Section 2.1 andwere consequently labeled as earthquakes. Each recording brings evidence of a particle motion due to a seismic event. Station codes, start times of the waveforms,origin times and magnitudes of the earthquakes (as indicated by the catalogue providers) are reported in the following. (a) station code: CI.PALA, start time: 2018-10-20 23:16:46, event time: 2018-10-20 23:16:45, magnitude: 2.5; (b) station code: AE.U15A, start time: 2015-03-03 14:18:56, event time: 2015-03-03 14:18:44,magnitude: 2.4; (c) station code: IV.ATPC, start time: 2014-03-03, 10:08:02, event time: 2014-03-03 10:08:04, magnitude: 1.3; (d) station code: CN.MOBC, start time:2014-06-26 15:15:11, event time: 2014-06-26 15:15:03, magnitude: 2.3.

F. Magrini et al. Arti ﬁ cial Intelligence in Geosciences 1 (2020) 1 – eismogram produced by a local earthquake. Indeed, the chosen durationfor the waveforms (27 s), together with the maximum hypocentral dis-tance (134 km) used for labeling EQ, imply that EQ seismograms willnecessarily record both P, S, and surface waves generated by a seismicevent, due to simple velocity-time-distance considerations (for averagevelocities of these seismic phases see, e.g., Stein and Wysession, 2009).However, since these seismic waves travel at different velocities, thesame would not happen in case of a more distant earthquake, thuswithout affecting the possibility of discriminating between noise- andearthquake-waveforms.

3. Final dataset

The above procedure allowed us to collect 1,244,942 3-componentseismograms recorded at 1487 receivers distributed worldwide (Fig. 4):615,847 labeled as AN and 629,095 as EQ. EQ data have been retrievedfrom a total amount of 304,878 different earthquakes (Fig. 5), whosemagnitude distribution is shown to follow the Gutenberg and Richter(1944) distribution in Fig. 6, at least down to magnitudes of e :

5. Forlower magnitudes, the decrease in the number of earthquakes in our EQdata can be ascribed both to the insuf ﬁ cient completeness of the Fig. 3.

Four randomly selected seismograms (Z, N, E components from top to bottom of each panel) which passed the selection criteria explained in Section 2.2 andwere consequently labeled as noise. Station codes and start times of the waveforms are reported in the following. a) station code: OK.ELIS, start time: 2017-05-0205:46:30; (b) station code: IV.ATMI, start time: 2015-09-11 12:05:38; (c) station code: IV.FIAM, start time: 2018-08-05 20:25:34; (d) station code: FR.TURF, starttime: 2016-11-25 22:48:35.

Fig. 4.

Location of the receivers whose recordings are collected in our dataset. Red, blue, and green indicate training, validation, and test sets as employed in the MLapplication (Section 4), respectively. (For interpretation of the references to color in this ﬁ gure legend, the reader is referred to the Web version of this article.) F. Magrini et al. Arti ﬁ cial Intelligence in Geosciences 1 (2020) 1 – atalogues and to the conservative choice of the perceptibility radiusadopted in the downloads.We make the dataset publicly available through https://doi.org/10.5281/zenodo.3648232 as a unique ﬁ le in HDF5 binary data format.Fig. 7 summarizes the structure of the database, which we dubbed LEN-DB (Local Earthquakes and Noise DataBase). The labeled data are splitinto 2 HDF5-Groups : EQ and AN. Each of these groups contains as many

HDF5-Datasets as the number of 3-component seismograms; these arelabeled in accordance to the format net sta starttime , where net , sta , and starttime represent the seismic network, station, and start time of theseismograms. Each HDF5-Dataset (i.e. each triplet of seismograms) has an attribute , which allows accessing the respective metadata. Attributes ofAN data consist of the station and waveform information: net (networkcode), sta (station code), stla (station latitude, in degrees, north positive), stlo (station longitude, in degrees, east positive), stel (station elevation, inmeters), starttime and endtime (start time and end time of the waveforms,respectively); as for EQ data, information about the event are also re-ported: mag (magnitude), evla (epicenter latitude, in degrees, north positive), evlo (epicenter longitude, in degrees east positive), evdp (depthof hypocenter with respect to the nominal sea level given by the WGS84geoid, in meters), otime (event origin time), dist (epicentral distance, inmeters), az (event to station azimuth, in degrees), and baz (station toevent azimuth, in degrees). In addition, one HDF5-Group allows accessingstations ’ metadata through as many HDF5-Datasets as the number of re-ceivers employed for collecting the waveforms.

4. Machine learning application

We present in this Section a simple application of the dataset to asignal classi ﬁ cation problem. Speci ﬁ cally, we trained a ConvolutionalNeural Network (CNN) (Krizhevsky et al., 2012) to discriminate betweenrecordings of noise and recordings of earthquakes; the trained modeltherefore represents a single-station local-earthquake detectionalgorithm. The architecture of the CNN ensures that the algorithm is invariant toa certain degree of data translation and rotation. In other words, whenapplied to a time-series, the CNN learns its characteristics, regardless oftheir position in time (Chollet, 2018). The inputs to the CNN are the 27 s3-component seismograms sampled at 20 Hz. Each input is normalizedusing the maximum value among the triplet of seismograms, and themaximum is stored and serves as complementary data to the normalizedtime-series. The architecture of the algorithm is a slightly modi ﬁ edversion of ConvNetQuake, a CNN adopted by Perol et al. (2018) andLomax et al. (2019) for detection and characterization of local and globalearthquakes, respectively. The output of the algorithm consists of a realnumber between 0 and 1, which classi ﬁ es a given waveform into EQ orAN upon approximation to the closest integer.The CNN has been set up using the Keras Python library (Cholletet al., 2015). The input layer consists of the normalized waveform arrayof dimensions (540, 3). This layer is followed by 8 stacked L2-regularizedconvolutional layers (with regularization constant set to 0.0002), whosenumber of features is progressively halved by employing max-pooling(e.g. Scherer et al., 2010). The last convolutional layer is ﬂ attened andthe extracted features, together with the maximum used in the normal-ization, are then fed to a fully-connected layer with 256 neurons. Thiscon ﬁ guration resulted in 67,939 model parameters. The recti ﬁ ed linear Fig. 5.

Spatial distribution of the 304,878 earthquakes exploited for collecting 3-component seismograms labeled as EQ.

Fig. 6.

Magnitude distribution of the 304,878 earthquakes exploited for col-lecting 3-component seismograms labeled as EQ.

F. Magrini et al. Arti ﬁ cial Intelligence in Geosciences 1 (2020) 1 – nit (ReLU) activation function (e.g. Nair and Hinton, 2010) is usedthroughout the whole architecture except for the last layer, where afully-connected layer with one neuron returns the classi ﬁ cation of thewaveform using a sigmoid activation function. Crossentropy (e.g.Goodfellow et al., 2016) and Adam (Kingma and Ba, 2014) are the lossfunction and the optimization algorithm used throughout the model,respectively. We split the dataset on a geographical basis (Fig. 4), using 884,073(452,147 EQ, 431,926 AN), 266,407 (128,698 EQ, 137,709 AN), and94,462 (48,250 EQ, 46,212 AN) 3-components seismograms for training,validation, and test set, respectively; the magnitude distributions of theearthquakes used in these three subsets is illustrated in the

Fig. 7.

Schematic representation of the structure of the database.

Fig. 8.

Accuracy (top) and loss (bottom) as function of epoch achieved on training and validation sets. The ﬁ nal model has been trained for 100 epochs, larger epochsare shaded in gray. In each subplot, the red (training) and blue (validation) curves indicate the running average of accuracy/loss actual values (red and blue dots fortraining and validation, respectively). (For interpretation of the references to color in this ﬁ gure legend, the reader is referred to the Web version of this article.) F. Magrini et al. Arti ﬁ cial Intelligence in Geosciences 1 (2020) 1 – upplementary materials. The strategy to split the dataset geographicallyis adopted to prevent the waveforms of different subsets from carryinggeological/tectonic information buried in the signal; in fact, this couldpossibly induce the model to over ﬁ t the data due to information leakage(see, e.g., Chollet, 2018). In other words, the large number of stationsfrom different geological and tectonic settings is probable to make thealgorithm learn general features of EQ and AN signals, regardless of thecharacteristics of the speci ﬁ c areas. Splitting the dataset on ageographical basis therefore helps the generalization capabilities of themodel (a thorough discussion on the topic can be found e.g. in Good-fellow et al., 2016). Due to the large number of samples and to limitationsin the RAM memory available, for this study we randomly split thetraining data into three separated subsets. During the training, at eachepoch one of them is randomly chosen and used, so that each subsetequally contributes to the learning process of the model. The trainingprocess required e

30 min on a Nvidia 1060 4 GB for 100 epochs using abatch size of 512 samples. The trained Keras model is available at https://github.com/djozinovi/LEN-DB.The above procedure yielded an overall accuracy of 96.7% and 95.3%on the training and validation sets, respectively; graphs of accuracy andloss as function of epoch for both training and validation sets are shownin Fig. 8. The performance of the trained model has then been validatedover the test set, on which we achieved an overall accuracy of 93.2%.

Fig. 9 shows the confusion matrices (e.g. Sammut and Webb, 2017)obtained using our algorithm for classifying the waveforms of the threedatasets individually; the percentage of

False Negatives is larger than theone of

False Positives for both train, validation, and test sets, albeit small. This is ascribed to the dif ﬁ culty of our simple model of 67,939 parame-ters in detecting earthquakes in presence of relatively high noise levels.To this regard, some seismic networks included in the test set proved tobe problematic, contributing to decrease the value of overall accuracy, asillustrated by Table 1. Among them, the KR (Kyrgyzstan) networkshowed the largest number of undetected earthquakes (i.e. False Nega-tives). The relatively large number of EQ waveforms belonging to thisseismic network turned out to be an important factor in determining theoverall accuracy of the subset. In fact, excluding the KR network from thetest set allows increasing its overall accuracy from 93.2% to 95.5%,values which are consistent with those achieved on validation andtraining sets.Visual inspection of the waveforms misclassi ﬁ ed as AN showed that,for the majority of them, the ground motion caused by the earthquakereported on the catalogue is only barely visible, even for magnitudes (cid:4) ﬁ rm thehigh quality of the labeled seismograms collected in our dataset. It isworth noting the performance of the CNN on the global network IU,which offers an insight into the strong geographic variability of thewaveforms; as opposed to a very small number of False Negatives asso-ciated with IU at the locations included in test and training sets (seeTables 1 and 2), a relatively poor accuracy is observed for the samenetwork on the validation set (Table 3). In analogy with the KR network,this can be ascribed to high noise levels at speci ﬁ c sensors. On the otherhand, the high overall accuracy achieved on the three subsets in presenceof such variability of the waveforms with location provides evidence ofthe good generalization properties of the detection algorithm.

5. Possible applications

We have shown in Section 4 that the trained model can be applied todetect small earthquakes in regions that were not represented in thetraining set. The high accuracy achieved (96.7, 95.3, and 93.2% ontraining, validation, and test set, respectively) brings evidence that thesame method can be applied to real-time detection of earthquakes onindividual stations, by streaming continuous data in batches of 27 s onthe condition of pre-processing the seismograms as in Section 2. In thisregard, a possible attempt to further improve the performance of thealgorithm would be introducing one or more recurrent layers in themodel (e.g. Mousavi et al., 2019b), to account for the temporal relationbetween the seismic phases arriving at the receivers.Our algorithm also suits the analysis of past recordings with the

Fig. 9.

Confusion matrices of (a) train, (b) validation, and (c) test set. In each subpanel, the colorscale indicates the number of 3-component seismograms employed.(For interpretation of the references to colour in this ﬁ gure legend, the reader is referred to the web version of this article.) Table 1

True Negatives (TN), False Negatives (FN), False Positives (FP), and True Posi-tives (TP) resulting from the application of the detection algorithm to the test set;the results are shown for each seismic network (NET) individually. Accuracy (inpercentage) in correctly classifying AN and EQ waveforms are indicated as ACCAN and ACC EQ, respectively, while ACC indicates overall accuracy.

NET TN FN FP TP ACC AN ACC EQ ACCCN 184 0 6 66 96.8 100

DK 1254 32 52 950 96 96.7

G 1291 292 22 1854 98.3 86.4

GE 413 14 13 279 96.9 95.2

IC 775 66 48 364 94.2 84.7

II 4186 204 65 4495 98.5 95.7 IU 2296 92 45 2562 98.1 96.5

JP 8438 1032 77 12871 99.1 92.6

KR 25227 3603 684 18511 97.4 83.7

NO 747 9 3 552 99.6 98.4

PL 379 4 7 398 98.2 99

F. Magrini et al. Arti ﬁ cial Intelligence in Geosciences 1 (2020) 1 – urpose of enriching the actual catalogues by detecting small, localearthquakes that could not be detected by other methods commonlyemployed (for example the STA/LTA; e.g. Withers et al., 1998); withoutrelying on multiple stations, this would prove especially useful in areaswith scarce density of receivers. We have shown that the large variety ofgeological and tectonic settings covered by our data supports thegeneralization capabilities of the detection algorithm. On the other hand,for carrying out speci ﬁ c tasks like improving the completeness of thecatalogues in certain locations, it might be bene ﬁ cial to focus only on aportion of the dataset; training a machine learning algorithm on thosedata would then enable to incorporate the geographic characteristics ofthe investigated region in the trained model, possibly leading to higheraccuracy in the detection. In addition, it might be worth investigating ifsuch a collection of AN data could allow to ef ﬁ ciently simulate seismicambient noise and extract information on the distribution of noise sources contributing to the recordings of a speci ﬁ c area. Somethingsimilar would be particularly useful for constraining the attenuationproperties of the region (e.g. Tsai, 2011; Boschi et al., 2019).Other possible applications of our dataset are connected to signal-processing tasks. Denoising of the waveforms (e.g. Mousavi et al.,2016) and detection of anomalies due to e.g. electronic failures of thesensors are an example; when dealing with real data, it is common toobserve the presence of anomalous, meaningless signals which mightintroduce a bias in the results of a study. In fact, while visually checkingthe seismograms collected, we noticed the presence of a few instances ofsuch anomalies; although we estimated the amount of such signals to bevery small in comparison to the number of healthy waveforms in ourdataset ( < : Fig. 10.

Nine randomly selected seismograms (Z, N, E components from top to bottom of each panel) belonging to the KR network and misclassi ﬁ ed as AN. Buried inthe seismograms is the evidence of earthquakes characterized by magnitudes (cid:4)

3. Station codes, start times of the waveforms, origin times and magnitudes of theearthquakes (as indicated by the catalogue providers) are reported in the following. (a) station code: KR.DRK, start time: 2012-04-30 17:43:56, event time: 2012-04-3017:43:39, magnitude: 3.2; (b) station code: KR.BTK, start time: 2018-08-14 23:05:19, event time: 2018-08-14 23:05:05, magnitude: 3.4; (c) station code: KR.MNAS,start time: 2018-08-25 22:17:44, event time: 2018-08-25 22:17:38, magnitude: 4.9; (d) station code: KR.DRK, start time: 2017-06-20 16:50:17, event time: 2017-06-2016:50:06, magnitude: 3.0; (e) station code: KR.ANVS, start time: 2018-06-09 23:51:52, event time: 2018-06-09 23:51:42, magnitude: 3.0; (f) station code: KR.ANVS,start time: 2013-09-14 10:00:38, event time: 2013-09-14 10:00:21, magnitude: 3.1; (g) station code: KR.DRK, start time: 2018-08-12 17:18:58, event time: 2018-08-1217:18:46, magnitude: 3.1; (h) station code: KR.TOKL, start time: 2012-06-10 21:29:00, event time: 2012-06-10 21:28:44, magnitude: 3.2; (i) station code: KR.SFK,start time: 2018-08-08 20:19:47, event time: 2018-08-08 20:19:32, magnitude: 4.3.

F. Magrini et al. Arti ﬁ cial Intelligence in Geosciences 1 (2020) 1 – ecided to keep those anomalies within the dataset, leaving such a task tobe tackled by future studies and different methodological approaches.

6. Conclusions

We compiled a large dataset of seismograms recorded along thevertical, north, and east components of 1487 broad-band or very broad-band receivers distributed worldwide, including 629,095 3-componentseismograms generated by 304,878 local earthquakes and labeled asEQ, and 615,847 ones labeled as noise (AN). We used the dataset to traina Convolutional Neural Network (CNN) in discriminating noise-fromearthquake-data, and showed that the trained model can be applied todetect small earthquakes in regions that were not represented in thetraining set. The high accuracy achieved (96.7, 95.3, and 93.2% ontraining, validation, and test set, respectively) con ﬁ rm both the highquality of the labeled seismograms collected in the dataset and the goodgeneralization properties of the detection algorithm.Availability of similar benchmark datasets is, at the present time, verylimited. To our knowledge, the only instance of something comparable insize and built on a global scale has been published in a recent, inde-pendent work by Mousavi et al. (2019a); differences, however, arise fromthe distribution of the receivers, processing, and duration of the3-component seismograms. Our global dataset is intended to be used forcarrying out a multitude of seismological and signal processing tasks based on single-station recordings; importantly, its size suits machinelearning applications. For this reason, we believe that our large collectionof waveforms will not only bene ﬁ t seismologists, but a broader com-munity including data scientists interested in informative data such asseismograms recorded on the Earth surface. Catalogues of seismic events were downloaded from Istituto Nazio-nale di Geo ﬁ sica e Vulcanologia (INGV) Seismological Data Centre(2006), International Seismological Centre (2019) (Storchak et al., 2013,2015; Giacomo et al., 2018), and IRIS Data Services (0000). The facilitiesof IRIS Data Services, and speci ﬁ (cid:1) eseau Sismologique et g (cid:1) eod (cid:1) esique Français (1995), Institut De PhysiqueDu Globe De Paris (IPGP) & Ecole Et Observatoire Des Sciences De LaTerre De Strasbourg (EOST) (1982), GEOFON Data Centre (1993), Fed-eral Institute for Geosciences and Natural Resources (BGR) (1976),Albuquerque Seismological Laboratory (ASL)/USGS (1988, 1990, 1992,1993), The Finnish National Seismic Network. GFZ Data Services (1980),Scripps Institution of Oceanography (1986), Istituto Nazionale di Geo- ﬁ sica e Vulcanologia (INGV) Seismological Data Centre (2006), KyrgyzInstitute of Seismology, KIS (2007), MedNet Project Partner Institutions(1990), UC San Diego: Central and Eastern US Network (2013), ZAMG -Zentralanstalt für Meterologie und Geodynamik (1987), OklahomaGeological Survey: Oklahoma Seismic Network (1978), Penn State Uni-versity: Pennsylvania State Seismic Network (2004), University OfMontana: University of Montana Seismic Network (2017), InternationalFederation of Digital Seismograph Networks: XV Seismic Network (2014)(Tape and West, 2014; Tape et al., 2018). Acknowledgments

We are grateful to three anonymous reviewers for their insightful andconstructive reviews. We thank the makers of Obspy (Beyreuther et al.,2010). Graphics were created with Python Matplotlib (Hunter, 2007).The Grant to Department of Science, Roma Tre University (MIUR-ItalyDipartimenti di Eccellenza, ARTICOLO 1, COMMI 314 - 337 LEGGE232/2016) is gratefully acknowledged.

Appendix A. Supplementary data

Supplementary data to this article can be found online at https://doi.org/10.1016/j.aiig.2020.04.001.

References

Alaska Earthquake Center, Univ of Alaska Fairbanks, 1987. Alaska Regional Network,dataset/Seismic Network, 10.7914/SN/AK, last accessed January 2020, 1987.Albuquerque Seismological Laboratory (ASL)/USGS, 1988. Global Seismograph Network- IRIS/USGS. International Federation of Digital Seismograph Networks dataset/Seismic Network. 10.7914/SN/IU, last accessed January 2020.Albuquerque Seismological Laboratory (ASL)/USGS, United States National SeismicNetwork, 1990. International Federation of Digital Seismograph Networks dataset/Seismic Network. doi:10.7914/SN/US, last accessed January 2020.

Table 2

Same as Table 1, but obtained on the training set.

NET TN FN FP TP ACC AN ACC EQ ACCAE 642 6 7 282 98.9 97.9

AK 21315 921 760 28596 96.6 96.9

CH 39171 2212 431 34417 98.9 94

CI 26044 2506 581 46416 97.8 94.9

CN 25973 625 466 22654 98.2 97.3

CZ 3666 161 42 3957 98.9 96.1

FN 1468 35 13 1150 99.1 97

FR 55384 5902 585 48631 99 89.2

G 520 42 1 425 99.8 91

GB 1865 13 8 596 99.6 97.9

GE 2807 116 58 1582 98 93.2

GR 5114 235 67 3758 98.7 94.1

HE 3028 39 14 2910 99.5 98.7

II 8257 151 41 11074 99.5 98.7 IU 6795 57 98 7308 98.6 99.2

IV 116349 5402 3120 119243 97.4 95.7

MN 22572 1497 300 20280 98.7 93.1 N4 10555 127 255 7698 97.6 98.4 NO 1489 22 2 901 99.9 97.6 OE 12630 234 148 12813 98.8 98.2

OK 41179 1084 731 38321 98.3 97.2

PE 954 35 15 450 98.5 92.8

PL 620 49 0 676 100 93.2

UM 2902 11 46 3644 98.4 99.7

US 10264 153 253 8330 97.6 98.2

XV 3874 87 138 5696 96.6 98.5

Table 3

Same as Tables 1 and 2, but obtained on the validation set.

NET TN FN FP TP ACC AN ACC EQ ACCAF 1807 75 41 1369 97.8 94.8

AU 4027 290 107 1444 97.4 83.3

BR 598 4 2 215 99.7 98.2

G 5791 733 207 5327 96.5 87.9

GE 52076 1771 1674 44331 96.9 96.2

GT 264 12 6 63 97.8 84

II 1782 47 28 1509 98.5 97

IU 27099 4374 753 21619 97.3 83.2

MN 6208 44 137 6689 97.8 99.3

NZ 22626 1640 407 25024 98.2 93.8

TU 12634 385 203 12360 98.4 97

F. Magrini et al. Arti ﬁ cial Intelligence in Geosciences 1 (2020) 1 – – – – – – & Co. KG.Chollet, F., et al., 2015. Keras. https://keras.io.Dalton, C.A., Faul, U.H., 2010. The oceanic and cratonic upper mantle: clues from jointinterpretation of global velocity and attenuation models. Lithos 120 (1 – – € om, G., Dziewo (cid:1) nski, A.M., 2008. The global attenuation structure of theupper mantle. J. Geophys. Res.: Solid Earth 113 (B9).Federal Institute for Geosciences and Natural Resources (BGR), 1976. German RegionalSeismic Network (GRSN). https://doi.org/10.25928/MBX6-HR74 last accessedJanuary 2020.GEOFON Data Centre, GEOFON Seismic Network, 1993. DeutschesGeoForschungsZentrum GFZ. https://doi.org/10.14470/TR560404 last accessedJanuary 2020.Geological Survey of Canada, Canadian National Seismograph Network, 1980.International Federation of Digital Seismograph Networks dataset/Seismic Network.10.7914/SN/CN, last accessed January 2020.Giacomo, D.D., Engdahl, E.R., Storchak, D.A., 2018. The ISC-GEM Earthquake Catalogue(1904 – – – –

95. https://doi.org/10.1109/MCSE.2007.55.Institut De Physique Du Globe De Paris (IPGP), Ecole Et Observatoire Des Sciences De LaTerre De Strasbourg (EOST), 1982. GEOSCOPE, French Global Network of BroadBand Seismic Stations. Institut de Physique du Globe de Paris (IPGP). https://doi.org/10.18715/GEOSCOPE.G last accessed January 2020.Institute of Geophysics, 1973. Academy of Sciences of the Czech Republic, Czech RegionalSeismic Network. International Federation of Digital Seismograph Networks dataset/Seismic Network. 10.7914/SN/CZ, last accessed January 2020.International Federation of Digital Seismograph Networks, 2014. XV Seismic Network, XVSeismic Network dataset/Seismic Network. 10.7914/SN/XV_2014, last accessedJanuary 2020.International Seismological Centre, 2019. International Seismological Centre, ISC-GEMEarthquake Catalogue. https://doi.org/10.31905/d808b825 last accessed January2020.IRIS Data Services. IRIS Data Management Center. http://ds.iris.edu/ds/nodes/dmc/ lastaccessed January 2020.Istituto Nazionale di Geo ﬁ sica e Vulcanologia (INGV) Seismological Data Centre, INGVSeismological Data Centre, 2006. Rete Sismica Nazionale (RSN). Istituto Nazionale diGeo ﬁ sica e Vulcanologia (INGV), Italy. https://doi.org/10.13127/SD/X0FXNH7QFYlast accessed January 2020, 2006.Kennett, B., Engdahl, E., 1991. Travel times for global earthquake location and phaseidenti ﬁ cation. Geophys. J. Int. 105 (2), 429 – – ﬁ cation with deepconvolutional neural networks. In: Advances in Neural Information ProcessingSystems, pp. 1097 – (cid:1) c, D., 2019. An investigation of rapid earthquakecharacterization using single-station waveforms and a convolutional neural network.Seismol Res. Lett. 90 (2A), 517 – ﬁ sica e Vulcanologia (INGV). https://doi.org/10.13127/SD/FBBBTDTD6Q last accessed January 2020. Meier, M.-A., Ross, Z.E., Ramachandran, A., Balakrishna, A., Nair, S., Kundzicz, P., Li, Z.,Andrews, J., Hauksson, E., Yue, Y., 2019. Reliable real-time seismic signal/noisediscrimination with machine learning. J. Geophys. Res.: Solid Earth 124 (1),788 – – V355.Mousavi, S.M., Sheng, Y., Zhu, W., Beroza, G.C., 2019a. STanford EArthquake Dataset(STEAD): A Global Data Set of Seismic Signals for AI. IEEE Access.Mousavi, S.M., Zhu, W., Sheng, Y., Beroza, G.C., 2019b. CRED: a deep residual network ofconvolutional and recurrent units for earthquake signal detection. Sci. Rep. 9 (1),1 – ﬁ ed linear units improve restricted Boltzmannmachines. In: Proceedings of the 27th International Conference on Machine Learning.ICML-10, pp. 807 – – (cid:1) eseau Sismologique et g (cid:1) eod (cid:1) esique Français, 1995. RESIF-RLBP French Broad-band network, RESIF-RAP Strong Motion Network and Other Seismic Stations inMetropolitan France. https://doi.org/10.15778/RESIF.FR last accessed January2020.Ross, Z.E., Meier, M.-A., Hauksson, E., 2018. P wave arrival picking and ﬁ rst-motionpolarity determination with deep learning. J. Geophys. Res.: Solid Earth 123 (6),5120 – – ﬁ cial Neural Networks. Springer, pp. 92 – & Sons.Storchak, D.A., Di Giacomo, D., Bond (cid:1) ar, I., Engdahl, E.R., Harris, J., Lee, W.H.,Villase ~ nor, A., Bormann, P., 2013. Public release of the ISC – GEM global instrumentalearthquake catalogue (1900 – – (cid:1) ar, I., Lee, W.H., Bormann, P.,Villase ~ nor, A., 2015. The ISC-GEM global instrumental earthquake catalogue(1900 – – – – F. Magrini et al. Arti ﬁ cial Intelligence in Geosciences 1 (2020) 1 –10