[PDF] Earthquake magnitude and location estimation from real time seismic waveforms with a transformer network

Abstract

Precise real time estimates of earthquake magnitude and location are essential for early warning and rapid response. While recently multiple deep learning approaches for fast assessment of earthquakes have been proposed, they usually rely on either seismic records from a single station or from a fixed set of seismic stations. Here we introduce a new model for real-time magnitude and location estimation using the attention based transformer networks. Our approach incorporates waveforms from a dynamically varying set of stations and outperforms deep learning baselines in both magnitude and location estimation performance. Furthermore, it outperforms a classical magnitude estimation algorithm considerably and shows promising performance in comparison to a classical localization algorithm. In this work, we furthermore conduct a comprehensive study of the requirements on training data, the training procedures and the typical failure modes using three diverse and large scale data sets. Our analysis gives several key insights. First, we can precisely pinpoint the effect of large training data; for example, a four times larger training set reduces the required time for real time assessment by a factor of four. Second, the basic model systematically underestimates large magnitude events. This issue can be mitigated by incorporating events from other regions into the training through transfer learning. Third, location estimation is highly precise in areas with sufficient training data, but is strongly degraded for events outside the training distribution. Our analysis suggests that these characteristics are not only present for our model, but for most deep learning models for fast assessment published so far. They result from the black box modeling and their mitigation will likely require imposing physics derived constraints on the neural network.

Full PDF

EE ARTHQUAKE MAGNITUDE AND LOCATION ESTIMATION FROMREAL TIME SEISMIC WAVEFORMS WITH A TRANSFORMERNETWORK

A P

REPRINT

Jannes Münchmeyer , , ∗ , Dino Bindi , Ulf Leser , Frederik Tilmann , Helmholtzzentrum Potsdam, Deutsches GeoForschungsZentrum GFZ, Potsdam, Germany Institut für Informatik, Humboldt-Universität zu Berlin, Berlin, Germany Insitut für geologische Wissenschaften, Freie Universität Berlin, Berlin, Germany ∗ To whom correspondence should be addressed: [email protected]

January 7, 2021 A BSTRACT

Precise real time estimates of earthquake magnitude and location are essential for early warning andrapid response. While recently multiple deep learning approaches for fast assessment of earthquakeshave been proposed, they usually rely on either seismic records from a single station or from aﬁxed set of seismic stations. Here we introduce a new model for real-time magnitude and locationestimation using the attention based transformer networks. Our approach incorporates waveformsfrom a dynamically varying set of stations and outperforms deep learning baselines in both magnitudeand location estimation performance. Furthermore, it outperforms a classical magnitude estimationalgorithm considerably and shows promising performance in comparison to a classical localizationalgorithm. Our model is applicable to real-time prediction and provides realistic uncertainty estimatesbased on probabilistic inference. In this work, we furthermore conduct a comprehensive study ofthe requirements on training data, the training procedures and the typical failure modes. Using threediverse and large scale data sets, we conduct targeted experiments and a qualitative error analysis.Our analysis gives several key insights. First, we can precisely pinpoint the effect of large trainingdata; for example, a four times larger training set reduces average errors for both magnitude andlocation prediction by more than half, and reduces the required time for real time assessment by afactor of four. Second, the basic model systematically underestimates large magnitude events. Thisissue can be mitigated, and in some cases completely resolved, by incorporating events from otherregions into the training through transfer learning. Third, location estimation is highly precise inareas with sufﬁcient training data, but is strongly degraded for events outside the training distribution,sometimes producing massive outliers. Our analysis suggests that these characteristics are not onlypresent for our model, but for most deep learning models for fast assessment published so far. Theyresult from the black box modeling and their mitigation will likely require imposing physics derivedconstraints on the neural network. These characteristics need to be taken into consideration forpractical applications.

Recently, multiple studies investigated deep learning on raw seismic waveforms for the fast assessment of earthquakeparameters, such as magnitude (e.g. Lomax et al. (2019), Mousavi & Beroza (2020), van den Ende & Ampuero (2020)),location (e.g. Kriegerowski et al. (2019), Mousavi & Beroza (2019), van den Ende & Ampuero (2020)) and peak groundacceleration (e.g. Jozinovi´c et al. (2020)). Deep learning is well suited for these tasks, as it does not rely on manuallyselected features, but can learn to extract relevant information from the raw input data. This property allows the modelsto use the full information contained in the waveforms of an event. However, the models published so far use ﬁxed timewindows and can not be applied to data of varying length without retraining. Similarly, except the model by van den a r X i v : . [ phy s i c s . g e o - ph ] J a n arthquake magnitude and location estimation with TEAM-LM A P

REPRINT

Ende & Ampuero (2020), all models process either waveforms from only a single seismic station or rely on a ﬁxedset of seismic stations deﬁned at training time. The model by van den Ende & Ampuero (2020) enables the use of avariable station set, but combines measurements from multiple stations using a simple pooling mechanism. While it hasnot been studied so far in a seismological context, it has been shown in the general domain that set pooling architecturesare in practice limited in the complexity of functions they can model (Lee et al., 2019).Here we introduce a new model for magnitude and location estimation based on the architecture recently introducedfor the transformer earthquake alerting model (TEAM) (Münchmeyer et al., 2020a), a deep learning based earthquakeearly warning model. While TEAM estimated PGA at target locations, our model estimates magnitude and hypocentrallocation of the event. We call our adaptation TEAM-LM, TEAM for location and magnitude estimation. We useTEAM as a basis due to its ﬂexible multi-station approach and its ability to process incoming data effectively inreal-time, issuing updated estimates as additional data become available. Similar to TEAM, TEAM-LM uses mixturedensity networks to provided probability distributions rather than merely point estimates as predictions. For magnitudeestimation, our model outperforms two state of the art baselines, one using deep learning (van den Ende & Ampuero,2020) and one classical approach (Kuyuk & Allen, 2013). For location estimation, our model outperforms a deeplearning baseline (van den Ende & Ampuero, 2020) and shows promising performance in comparison to a classicallocalization algorithm.We note a further deﬁciency of previous studies for deep learning in seismology. Many of these pioneering studiesfocused their analysis on the average performance of the proposed models. Therefore, little is known about the conditionsunder which these models fail, the impact of training data characteristics, the possibility of sharing knowledge acrossworld regions, and of speciﬁc training strategies. All of these are of particular interest when considering practicalapplication of the models.To address these issues and provide guidance for practitioners, we perform a comprehensive evaluation of TEAM-LMon three large and diverse data sets: a regional broadband data set from Northern Chile, a strong motion data set fromJapan, and another strong motion data set from Italy. These data sets differ in their seismotectonic environment (NorthChile and Japan: subduction zones; Italy: dominated by both convergent and divergent continental deformation), theirspatial extent (North Chile: regional scale; Italy and Japan: national catalogs), and the instrument type (North Chile:broadband, Italy and Japan: strong motion). All three data sets contain hundreds of thousands of waveforms. NorthChile is characterized by a relatively sparse station distribution, but a large number of events and a low magnitude ofcompleteness. There are far more stations in the Italy and Japan data sets, but a smaller number of earthquakes. Thisselection of diverse data sets allows for a comprehensive analysis, giving insights for different use cases. Our targetedexperiments show that the characteristics are rooted in the principle structure used by TEAM-LM, i.e., the black boxapproach of learning a very ﬂexible model from data, without imposing any physical constraints. As this black boxapproach is common to all current fast assessment models using deep learning, they can be transferred to these models.This ﬁnding is further backed by comparison to the results from previous studies.

For this study we use three data sets (Table 1, Figure 1): one from Northern Chile, one from Italy and one from Japan.The Chile data set is based on the catalog by Sippl et al. (2018) with the magnitude values from Münchmeyer et al.(2020b). While there were minor changes in the seismic network conﬁguration during the time covered by the catalog,the station set used in the construction of this catalog had been selected to provide a high degree of stability of thelocations accuracy throughout the observational period (Sippl et al., 2018). Similarly, the magnitude scale has beencarefully calibrated to achieve a high degree of consistency in spite of signiﬁcant variations of attenuation (Münchmeyeret al., 2020b). This data set therefore contains the highest quality labels among the data sets in this study. For the Chiledata set, we use broadband seismogramms from the ﬁxed set of 24 stations used for the creation of the original catalogand magnitude scale. Although the Chile data set has the smallest number of stations of the three data sets, it comprisesthree to four times as many waveforms as the other two due to the large number of events.The data sets for Italy and Japan are more focused on early warning, containing fewer events and only strong motionwaveforms. They are based on catalogs from the INGV (ISIDe Working Group, 2007) and the NIED KiKNet (NationalResearch Institute For Earth Science And Disaster Resilience, 2019), respectively. The data sets each encompass alarger area than the Chile data set and include waveforms from signiﬁcantly more stations. In contrast to the Chile datasets, the station coverage differs strongly between different events, as only stations recording the event are considered.In particular, KiKNet stations do not record continuous waveforms, but operate in trigger mode, only saving waveformsif an event triggered at the station. For Japan each station comprises two sensors, one at the surface and one boreholesensor. Therefore for Japan we have 6 component recordings (3 surface, 3 borehole) available instead of the 3 component2arthquake magnitude and location estimation with TEAM-LM

A P

REPRINT

Figure 1: Overview of the data sets. The top row shows the spatial station distribution, the second tow the spatialevent distribution. The event depth is encoded using color. Higher resolution versions of the maps can be found in thesupplementary material (Figures SM 1, SM 2, SM 3). The bottom row shows the distributions of the event magnitudes.The magnitude scales are the peak displacement based M A , local magnitude M L , moment magnitude M W , body wavemagnitude m b and M JMA , a magnitude primarily using peak displacement.3arthquake magnitude and location estimation with TEAM-LM

A P

REPRINT

Table 1: Overview of the data sets. The lower boundary of the magnitude category is the 5th percentile of the magnitude;this limit is chosen as each data set contains a small number of unrepresentative very small events. The upper boundaryis the maximum magnitude. Magnitudes are given with two digit precision for Chile, as the precision of the underlyingcatalog is higher than for Italy and Japan. The Italy data set uses different magnitudes for different events, which are M L ( >

90 % of the events), M W ( <

10 %) and m b ( < M A M L , M W , m b M JMA

Depth [km] 0 - 102 - 183 0 - 10 - 617 0 - 19 - 682Distance [km] 0.1 - 180 - 640 0.1 - 180 - 630 0.2 - 120 - 3190Events 96,133 7,055 13,512Unique stations 24 1,080 697Traces 1,605,983 494,183 372,661Traces per event 16.7 70.3 27.6Sensor type BB SM SM & SM-boreholeCatalog source Münchmeyer et al. (2020b) INGV NIEDrecordings for Italy and Chile. A full list of seismic networks used in this study can be found in the appendix (TableSM 1).For each data set we use the magnitude scale provided in the catalog. For the Chile catalog, this is M A , a peakdisplacement based scale, but without the Wood-Anderson response and therefore saturation-free for large events(Münchmeyer et al., 2020b; Deichmann, 2018). For Japan M JMA is used. M JMA combines different magnitude scales,but similarly to M A primarily uses horizontal peak displacement (Doi, 2014). For Italy the catalog provides differentmagnitude types approximately dependent on the size of the event: M L ( >

90 % of the events), M W ( <

10 %) and m b ( < M W = 6 . ) and October( M W = 6 . ). Notably, the largest event in the test set is signiﬁcantly larger than the largest event in the training set( M w = 6 . L’Aquila event in 2007), representing a challenging test case. For Italy, we assign the remaining events totraining and development set randomly with a 6:1 ratio.The Italy data set has been published as Münchmeyer et al. (2020). Download instructions for the Japan data set areavailable at https://github.com/yetinam/TEAM . A data publication for the Chile data set with GFZ data servicesis under preparation.

We build a model for real time earthquake magnitude and location estimation based on the core ideas of the transformerearthquake alerting model (TEAM), as published in Münchmeyer et al. (2020a). TEAM is an end-to-end peak ground4arthquake magnitude and location estimation with TEAM-LM

A P

REPRINT

Figure 2: Overview of the adapted transformer earthquake alerting model, showing the input, the feature extraction, thefeature combination, the magnitude/location estimation and the output. For simplicity, not all layers are shown, butonly their order and combination is visualized schematically. For the exact number of layers and the size of each layerplease refer to tables SM 2 to SM 4. Please note that the number of input stations is variable, due to the self-attentionmechanism in the feature combination. 5arthquake magnitude and location estimation with TEAM-LM

A P

REPRINT acceleration (PGA) model calculating probabilistic PGA estimates based on incoming waveforms from a ﬂexible set ofstations. It employs the transformer network method (Vaswani et al., 2017), an attention based neural network whichwas developed in the context of natural language processing (NLP), at the core of its algorithm. Here, we adapt TEAMto calculate real time probabilistic estimates of event magnitude and hypocentral location. As our model closely followsthe architecture and key ideas of TEAM, we use the name TEAM-LM to refer to the location and magnitude estimationmodel.Similar to TEAM, TEAM-LM consists of three major components (Figure 2): a feature extraction, which generatesfeatures from raw waveforms at single stations, a feature combination, which aggregates features across multiplestations, and an output estimation. Here, we brieﬂy discuss the core ideas of the TEAM architecture and training andput a further focus on the necessary changes for magnitude and location estimation. For a more detailed account ofTEAM and TEAM-LM we refer to Münchmeyer et al. (2020a), Tables SM 2 to SM 4 and the published implementation.The input to TEAM consists of three component seismogramms from multiple stations and their locations. TEAMaligns all seismogramms to start and end at the same times t and t . We choose t to be 5 seconds before the ﬁrst Parrival at any station. This allows the model to understand the noise conditions at all stations. We limit t to be at latest t + 30 s. In a real-time scenario t is the current time, i.e., the available amount of waveforms, and we use the sameapproach to imitate real-time waveforms in training and evaluation. The waveforms are padded with zeros to a length of30 s to achieve constant length input to the feature extraction.TEAM uses a CNN architecture for feature extraction, which is applied separately at each station. The architectureconsists of several convolution and pooling layers, followed by a multi-layer perceptron (Table SM 2). To avoid scalingissues, each input waveform is normalized through division by its peak amplitude. As the amplitude is expected tobe a key predictor for the event magnitude, we provide the logarithm of the peak amplitude as a further input to themulti-layer perceptron inside the feature extraction network. We ensure that this transformation does not introduce aknowledge leak by calculating the peak amplitude only based on the waveforms until t . The full feature extractionreturns one vector for each station, representing the measurements at the station.The feature vectors from multiple stations are combined using a transformer network (Vaswani et al., 2017). Trans-formers are attention based neural networks, originally introduced for natural language processing. A transformertakes a set of n vectors as input, and outputs again n vectors which now incorporate the context of each other. Theattention mechanism allows the transformer to put special emphasis on inputs that it considers particularly relevant andthereby model complex inter-station dependencies. Importantly, the parameters of the transformer are independent ofthe number of input vectors n , allowing to train and apply a transformer on variable station sets. To give the transformera notion of the position of the stations, TEAM encodes the latitude, longitude and elevation of the stations using asinusoidal embedding and adds this embedding to the feature vectors.TEAM adds the position embeddings of the PGA targets as additional inputs to the transformer. In TEAM-LM, we aimto extract information about the event itself, where we do not know the position in advance. To achieve this, we addan event token, which is a vector with the same dimensionality as the positional embedding of a station location, andwhich can be thought of as a query vector. This approach is inspired by the so-called sentence tokens in NLP that areused to extract holistic information on a sentence (Devlin et al., 2018). The elements of this event query vector arelearned during the training procedure.From the transformer output, we only use the output corresponding to the event token, which we term event embeddingand which is passed through another multi-layer perceptron predicting the parameters and weights of a mixture ofGaussians (Bishop, 1994). We use N = 5 Gaussians for magnitude and N = 15 Gaussians for location estimation.For computational and stability reasons, we constrain the covariance matrix of the individual Gaussians for locationestimation to a diagonal matrix to reduce the output dimensionality. Even though uncertainties in latitude, longitudeand depth are known to generally be correlated, this correlation can be modeled with diagonal covariance matrices byusing the mixture.The model is trained end-to-end using a log-likelihood loss with the Adam optimizer (Kingma & Ba, 2014). We trainseparate models for magnitude and for location. As we observed difﬁculties in the onset of the optimization whenstarting from a fully random initialization, we pretrain the feature extraction network. To this end we add a mixturedensity network directly after the feature extraction and train the resulting network to predict magnitudes from singlestation waveforms. We then discard the mixture density network and use the weights of the feature extraction asinitialization for the end-to-end training. We use this pretraining method for both magnitude and localization networks.Similarly to the training procedure for TEAM we make extensive use of data augmentation during training. First, werandomly select a subset of up to 25 stations from the available station set. We limit the maximum number to 25 forcomputational reasons. Second, we apply temporal blinding, by zeroing waveforms after a random time t . This type ofaugmentation allows TEAM-LM to be applied to real time data. To avoid knowledge leaks for Italy and Japan, we6arthquake magnitude and location estimation with TEAM-LM A P

REPRINT only use stations as inputs that triggered before time t for these data sets. This is not necessary for Chile, as there themaximum number of stations per event is below 25 and waveforms for all events are available for all stations active atthat time, irrespective of whether the station actually recorded the event. Third, we oversample large magnitude events,as they are strongly underrepresented in the training data set. We discuss the effect of this augmentation in further detailin the Results section. In contrast to the station selection during training, in evaluation we always use the 25 stationspicking ﬁrst. Again, we only use stations and their waveforms as input once they triggered, thereby ensuring that thestation selection does not introduce a knowledge leak. Recently, van den Ende & Ampuero (2020) suggested a deep learning method capable of incorporating waveforms froma ﬂexible set of stations. Their architecture uses a similar CNN based feature extraction as TEAM-LM. In contrast toTEAM-LM, for feature combination it uses maximum pooling to aggregate the feature vectors from all stations insteadof a transformer. In addition they do not add predeﬁned position embeddings, but concatenate the feature vector for eachstation with the location coordinates and apply a multi-layer perceptron to get the ﬁnal feature vectors for each station.The model of van den Ende & Ampuero (2020) is both trained and evaluated on 100 s long waveforms. Therefore it isnot suitable for real time processing. Furthermore, the authors did not publish an implementation of the model. Forthese reasons we are not able to compare the exact model of van den Ende & Ampuero (2020) to TEAM-LM.To still compare TEAM-LM to the techniques introduced in this approach, we implemented a model based on the keyconcepts of van den Ende & Ampuero (2020). As we aim to evaluate the performance differences from the conceptualchanges, rather than different hyperparameters, e.g., the exact size and number of the convolutional layers, we use thesame architecture as TEAM-LM for the feature extraction and the mixture density output. Additionally we train themodel for real time processing using zero padding. In comparison to TEAM-LM we replace the transformer with amaximum pooling operation and remove the event token.We evaluate two different representations for the position encoding. In the ﬁrst, we concatenated the positions to thefeature vectors as proposed by van den Ende & Ampuero (2020). In the second, we add the position embeddingselement-wise to the feature vectors as for TEAM-LM. In both cases, we run a three-layer perceptron over the combinedfeature and position vector for each station, before applying the pooling operation.We use the fast magnitude estimation approach (Kuyuk & Allen, 2013) as a classical, i.e., non deep-learning, baselinefor magnitude. The magnitude is estimated from the horizontal peak displacement in the ﬁrst seconds of the P wave. Asthis approach needs to know the hypocentral distance, it requires knowledge of the event location. We simply providethe method with the catalog hypocenter. While this would not be possible in real time, and therefore gives the methodan unfair advantage over the deep learning approaches, it allows us to focus on the magnitude estimation capabilities.Furthermore, in particular for Italy and Japan, the high station density usually allows for sufﬁciently well constrainedlocation estimates at early times. For a full description of this baseline, see supplement section SM 2.As a classical location baseline we employ NonLinLoc (Lomax et al., 2000) with the 1D velocity models from Graeber& Asch (1999) (Chile), Ueno et al. (2002) (Japan) and Matrullo et al. (2013) (Italy). For the earliest times after theevent detection usually only few picks picks are available. Therefore we apply two heuristics. Until at least 3/5/5(Chile/Japan/Italy) picks are available, the epicenter is estimated as the arithmetic mean of the stations with pickedarrivals so far, while the depth is set to the median depth in the training data set. Until at least 4/7/7 picks are available,we apply NonLinLoc, but ﬁx the depth to the median depth in the data set. We require higher numbers of picks forItaly and Japan, as the pick quality is lower than in Chile but the station density is higher. This leads to worse earlyNonLinLoc estimates in Italy and Japan compared to Chile, but improves the performance of the heuristics.

We ﬁrst compare the estimation capabilities of TEAM-LM to the baselines in terms of magnitude (Figure 3). Weevaluate the models at ﬁxed times t = 0.5 s, 1 s, 2 s, 4 s, 8 s, 16 s, 25 s after the ﬁrst P arrival at any station in thenetwork. In addition to presenting selected results here, we provide tables with the results of further experiments in thesupplementary material (Tables SM 6–SM 16).TEAM-LM outperforms the classical magnitude baseline consistently. On two data sets, Chile and Italy, the performanceof TEAM-LM with only 0.5 s of data is superior to the baseline with 25 s of data. Even on the third data set, Japan,TEAM-LM requires only approximately a quarter of the time to reach the same precision as the classical baselineand achieves signiﬁcantly higher precision after 25 s. The RMSE for TEAM-LM stabilizes after 16 s for all data sets7arthquake magnitude and location estimation with TEAM-LM A P

REPRINT

Figure 3: RMSE of the mean magnitude predictions from TEAM-LM, the pooling model with sinusoidal locationembeddings (POOL-E), the pooling model with concatenated positions (POOL-C) and the classical baseline method.The time indicates the time since the ﬁrst P arrival at any station, the RMSE is provided in magnitude units [m.u.]. Errorbars indicate ± standard deviation when training the model with different random initializations. For better visibilityerror bars are provided with a small x-offset. Standard deviations were obtained from six realisations. Note that theuncertainty of the provided means is by a factor √ smaller than the given standard deviation, due to the number ofsamples. We provide no standard deviation for the baseline, as it does not depend on a model initialization.Figure 4: RMSE comparison of the TEAM-LM mean magnitude predictions for different magnitude buckets. Linestylesindicate the model type: trained only on the target data (solid line), using transfer learning (dashed), classical baseline(dotted). For Chile/Italy/Japan we count events as small if their magnitude is below . / . / and as large if theirmagnitude is at least . / / . The time indicates the time since the ﬁrst P arrival at any station, the RMSE is provided inmagnitude units [m.u.].with ﬁnal values of 0.08 m.u. for Chile, 0.20 m.u. for Italy and 0.22 m.u. for Japan. The performance differencesbetween TEAM-LM and the classical baseline result from the simpliﬁed modelling assumptions for the baseline. Whilethe relationship between early peak displacement and magnitude only holds approximately, TEAM-LM can extractmore nuanced features from the waveform. In addition, the relationship for the baseline was originally calibrated for amoment magnitude scale. While all magnitude scales have an approximate 1:1 relationship with moment magnitude,this might introduce further errors.We further note that the performance of the classical baseline for Italy are consistent with the results reported by Festaet al. (2018). They analyzed early warning performance in a slightly different setting, looking only at the 9 largestevents in the 2016 Central Italy sequence. However, they report a RMSE of 0.28 m.u. for the PRESTO system 4 safter the ﬁrst alert, which matches approximately the 8 s value in our analysis. Similarly, Leyton et al. (2018) analyzehow fast magnitudes can be estimated in subductions zones, and obtain values of . ± . across all events and − . ± . for the largest events ( M w > . ) at 30 s after origin time. This matches the observed performance of theclassical baseline for Japan. For Chile, our classical baseline performs considerably worse, likely caused by the manysmall events with bad SNR compared to the event set considered by Leyton et al. (2018). However, TEAM-LM stilloutperforms the performance numbers reported by Leyton et al. (2018) by a factor of more than 2.8arthquake magnitude and location estimation with TEAM-LM A P

REPRINT

Figure 5: True and predicted magnitudes without upsampling or transfer learning (left column), with upsamplingbut without transfer learning (middle column) and with upsampling and transfer learning (right column). All plotsshow predictions after 8 seconds. In the transfer column for Chile and Japan we show results after ﬁne-tuning on thetarget data set; for Italy we show results from the model without ﬁne-tuning as this model performed better. For thelargest events in Italy (

M > . ) we additionally show the results after ﬁne-tuning with pale blue dots. We suspectthe degraded performance in the ﬁne tuned model results from the fact, that the largest training event ( M W = 6 . ) isconsiderably smaller than the largest test event ( M W = 6 . ). Vertical lines indicate the borders between small, mediumand large events as deﬁned in Figure 4. The orange lines show the running 5th, 50th and 95th percentile in 0.2 m.u.buckets. Percentile lines are only shown if sufﬁciently many data points are available. The very strong outlier for Japan(true ∼ ∼ > A P

REPRINT

Improvements for TEAM-LM in comparison to the deep learning baseline variants are much smaller than to the classicalapproach. Still, for the Japan data set at late times, TEAM-LM offers improvements of up to 27 % for magnitude.For the Italy data set, the baseline variants are on par with TEAM-LM. For Chile, only the baseline with positionembeddings is on par with TEAM-LM. Notably, for the Italy and Japan data sets, the standard deviation betweenmultiple runs with different random model initialization is considerably higher for the baselines than for TEAM-LM(Figure 3, error bars). This indicates that the training of TEAM-LM is more stable with regard to model initialization.The gains of TEAM-LM can be attributed to two differences: the transformer for station aggregation and the positionembeddings. In our experiments we ruled out further differences, e.g. size and structure of the feature extraction CNN,by using identical network architectures for all parts except the feature combination across stations. We think thatgains from using a transformer can be explained with its attention mechanism. The attention allows the transformerto focus on speciﬁc stations, for example the stations which have recorded the longest waveforms so far. In contrast,the maximum pooling operation is less ﬂexible. We suspect that the high gains for Japan result from the wide spatialdistribution of seismicity and therefore very variable station distribution. While in Italy most events are in Central Italyand in Chile the number of stations are limited, the seismicity in Japan occurs along the whole subduction zone withadditional onshore events. This complexity can likely be handled better with the ﬂexibility of the transformer than usinga pooling operation.In many use cases, the performance of magnitude estimation algorithms for large magnitude events is of particularimportance. In Figure 4 we compare the RMSE of TEAM-LM and the classical baselines for subsets binned by catalogmagnitude and observe a clear dependence on the event magnitude. For all data sets the RMSE for large events is higherthan for intermediate sized events, which is again higher than for small events. On the other hand the decrease in RMSEover time is strongest for larger events. This general pattern can also be observed for the classical baseline, even thoughthe difference in RMSE between magnitude buckets is smaller. As both variants of the deep learning baseline showvery similar trends to TEAM-LM, we omit them from this discussion.We discuss two possible causes for these effects: (i) the magnitude distribution in the training set restricts the qualityof the model optimization, (ii) inherent characteristics of large events. Cause (i) arise from the Gutenberg-Richterdistribution of magnitudes. As large magnitudes are rare, the model has signiﬁcantly less examples to learn fromfor large magnitudes than for small ones. This should impact the deep learning models the strongest, due to theirhigh number of parameters. Cause (ii) has a geophysical origin. As large events have longer rupture durations, theinformation gain from longer waveform recordings is larger for large events. We probe the likely individual contributionsof these causes in the following.Estimations for large events not only show lower precision, but are also biased (Figure 5, middle column). For Chileand Italy a clear saturation sets in for large events. Interestingly the saturation starts at different magnitudes, which arearound 5.5 for Italy and 6.0 for Chile. For Japan, events up to magnitude 7 are predicted without obvious bias. Thissaturation behavior is not only visible for TEAM-LM, but has also been observed in prior studies, e.g., in Mousavi &Beroza (2020) (their Fig. 3, 4). In their work, with a network trained on signiﬁcantly smaller events, the saturationalready set in around magnitude 3. The different saturation thresholds indicate that the primary cause for saturation isnot the longer rupture duration of large events or other inherent event properties, as in cause (ii), but is instead likelyrelated to the low number of training examples for large events, rendering it nearly impossible to learn their generalcharacteristics, as in cause (i). This explanation is consistent with the much higher saturation threshold for the Japanesedata set, where the training data set contains a comparably large number of large events, encompassing the year 2011with the Tohoku event and its aftershocks.As a further check of cause (i), we trained models without upsampling large magnitude events during training, therebyreducing the occurence of large magnitude events to the natural distribution observed in the catalog (Figure 5, leftcolumn). While the overall performance stays similar, the performance for large events is degraded on each of the datasets. Large events are on average underestimated even more strongly. We tried different upsampling rates, but were notable to achieve signiﬁcantly better performance for large events than the conﬁguration of the preferred model presentedin the paper. This shows that upsampling yields improvements, but can not solve the issue completely, as it does notintroduce actual additional data. On the other hand, the performance gains for large events from upsampling seem tocause no observable performance drop for smaller event. As the magnitude distribution in most regions approximatelyfollows a Gutenberg-Richter law with b ≈ , upsampling rates similar to the ones used in this paper will likely work forother regions as well. We evaluate the epicentral error distributions in terms of the 50 th , 90 th , 95 th and 99 th error percentiles (Figure 6). Interms of the median epicentral error, TEAM-LM outperforms all baselines in all cases, except for the classical baseline10arthquake magnitude and location estimation with TEAM-LM A P

REPRINT

Figure 6: Violin plots and error quantiles of the distributions of the epicentral errors for TEAM-LM, the poolingbaseline with position embeddings (POOL-E), the pooling baseline with concatenated position (POOL-C), TEAM-LMwith transfer learning (TEAM-TRA) and a classical baseline. Vertical lines mark the 50 th , 90 th , 95 th and 99 th errorpercentiles, with smaller markers indicating higher quantiles. The time indicates the time since the ﬁrst P arrival at anystation. We compute errors based on the mean location predictions. A similar plot for hypocentral errors is available inthe supplementary material (Figure SM 4). 11arthquake magnitude and location estimation with TEAM-LM A P

REPRINT at late times in Italy. For all data sets, TEAM-LM shows a clear decrease in median epicentral error over time. Thedecrease is strongest for Chile, going from 19 km at 0.5 s to 2 km at 25 s. For Italy the decrease is from 7 km to 2 km,for Japan from 22 km to 14 km. For all data sets the error distributions are heavy tailed. While for Chile even the errorsat high quantiles decrease considerably over time, these quantiles stay nearly constant for Italy and Japan.Similar to the difﬁculties for large magnitudes, the characteristics of the location estimation point to insufﬁcienttraining data as source of errors. The Chile data set covers the smallest region and has by far the lowest magnitudeof completeness, leading to the highest event density. Consequently the location estimation performance is best andoutliers are very rare. For the Italy and Japan data sets, signiﬁcantly more events occurred in regions with only fewtraining events, causing strong outliers. The errors for the Japanese data set are highest, presumably related to the largenumber of offshore events with consequently poor azimuthal coverage.We expect a further difference from the number of unique stations. While for a small number of unique stations, as in theChile data set, the network can mostly learn to identify the stations using their position embeddings, it might be unableto do so for a larger number of stations with fewer training examples per station. Therefore the task is signiﬁcantlymore complicated for Italy and Japan, where the concept of station locations has to be learned simultaneously to thelocalization task. This holds true even though we encode the station locations using continuously varying positionembeddings. Furthermore, whereas for moderate and large events waveforms from all stations of the Chilean networkwill show the earthquake and can contribute information, the limitation to 25 stations of the current TEAM-LMimplementation does not allow a full exploitation of the information contained in the hundreds of recordings of largerevents in the Japanese and Italian data sets. This will matter in particular for out-of-network events, where the wavefrontcurvature and thus event distance can only be estimated properly by considering stations with later arrivals.Looking at the classical baseline, we see that it performs considerably worse than TEAM-LM in the Chile data set in alllocation quantiles, better than TEAM-LM in all but the highest quantiles at late times in the Italy data set, and worsethan TEAM-LM at late times in the Japan data set. This strongly different behavior can largely be explained with thepick quality and the station density in the different data sets. While the Chile data set contains high quality automaticpicks, obtained using the MPX picker (Aldersons, 2004), the Italy data set uses a simple STA/LTA and the Japan dataset uses triggers from KiKNet. This reduces location quality for Italy and Japan, in particular in the case of a lownumber of picks available for location. On the other hand, the very good median performance of the classical approachfor Italy can be explained from the very high station density, giving a strong prior on the location. An epicentral error ofaround 2 km after 8 s is furthermore consistent with the results from Festa et al. (2018). Considering the reduction inerror due to the high station density in Italy, we note that the wide station spacing in Chile likely caused higher locationerrors than would be achievable with a denser seismic network designed for early warning.In addition to the pick quality, the assumption of a 1D velocity model for NonLinLoc introduces a systematic error intothe localization, in particular for the subduction regions in Japan and Chile where the 3D structure deviates considerablyfrom the 1D model. Because of these limitations the classical baseline could be improved by employing more proﬁcientpickers or ﬁne-tuned velocity models. Nonetheless, in particular the results from Chile, where the classical baselinehas access to high quality P-picks, suggest that TEAM-LM can, given sufﬁcient training data, outperform classicalreal-time localization algorithms.For magnitude estimation no consistent performance differences between the baseline approach with position embed-dings and the approach with concatenated coordinates, as originally proposed by van den Ende & Ampuero (2020),are visible. In contrast, for location estimation, the approach with embeddings consistently outperforms the approachwith concatenated coordinates. We speculate that the positional embeddings might show better performance becausethey explicity encode information on how to interpolate between locations at different scales, enabling an improvedexploitation of the information from stations with few or no training examples. This is more important for locationestimation, where an explicit notion of relative position is required. In contrast, magnitude estimation can use furtherinformation, like frequency content, which is less position dependent.

A common strategy for mitigating data sparsity is the injection of additional information from related data sets throughtransfer learning (Pan & Yang, 2009), in our use case waveforms from other source regions. This way the model issupposed to be taught the properties of earthquakes that are consistent across regions, e.g., attenuation due to geometricspreading or the magnitude dependence of source spectra. Note that a similar knowledge transfer implicitly is part ofthe classical baseline, as it was calibrated using records from multiple regions.Here, we conduct a transfer learning experiment inspired by the transfer learning used for TEAM. We ﬁrst train amodel jointly on all data sets and then ﬁne-tune it to each of the target data sets. This way, the model has more trainingexamples, which is of special relevance for the rare large events, but still is adapted speciﬁcally to the target data set.12arthquake magnitude and location estimation with TEAM-LM

A P

REPRINT

As the Japan and Italy data sets contain acceleration traces, while the Chile data set contains velocity traces, we ﬁrstintegrate the Japan and Italy waveforms to obtain velocity traces. This does not have a signiﬁcant impact on the modelperformance, as visible in the full results tables (Tables SM 6 to SM 9).Transfer learning reduces the saturation for large magnitudes (Figure 5, right column). For Italy the saturation is evencompletely eliminated. For Chile, while the largest magnitudes are still underestimated, we see a clearly lower levelof underestimation than without transfer learning. Results for Japan for the largest events show nearly no difference,which is expected as the Japan data set contains the majority of large events and therefore does not gain signiﬁcantadditional large training examples using transfer learning. The positive impact of transfer learning is also reﬂectedin the lower RMSE for large and intermediate events for Italy and Chile (Figure 4). These results do not only offer away of mitigating saturation for large events, but also represent further evidence for data sparsity as the reason for theunderestimation.We tried the same transfer learning scheme for mitigating mislocations (Figure 6). For this experiment we shifted thecoordinates of stations and events such that the data sets spatially overlap. We note that this shifting is not expectedto have any inﬂuence on the single data set performance, as the relative locations of events and stations within a dataset stay unchanged and nowhere the model uses absolute locations. The transfer learning approach is reasonable, asmislocations might result from data sparsity, similarly to the underestimation of large magnitudes. However, none ofthe models shows signiﬁcantly better performance than the preferred models, and in some instances performance evendegrades. We conducted additional experiments where shifts were applied separately for each event, but observed evenworse performance.We hypothesize that this behaviour indicates that the TEAM-LM localization does not primarily rely on travel timeanalysis, but rather employs some form of ﬁngerprinting of earthquakes. These ﬁngerprints could be speciﬁc scatteringpatterns for certain source regions and receivers. Note that similar ﬁngerprints are exploited in the traditional templatematching approaches (e.g. Shelly et al. (2007)). While the travel time analysis should be mostly invariant to shifts andtherefore be transferable between data sets, the ﬁngerprinting is not invariant to shifts. This would also explain why thetransfer learning, where all training samples were already in the pretraining data set and therefore their ﬁngerprintscould be extracted, outperforms the shifting of single events, where ﬁngerprints do not relate to earthquake locations.

Another common method to improve the quality of machine learning systems in face of data sparsity is multi-tasklearning (Ruder, 2017), i.e., having a network with multiple outputs for different objectives and training it simultaneouslyon all objectives. This approach has previously been employed for seismic source characterisation (Lomax et al., 2019),but without an empirical analysis on the speciﬁc effects of multi-task learning.We perform an experiment, in which we train TEAM-LM to predict magnitude and location concurrently. The featureextraction and the transformer parts are shared and only the ﬁnal MLPs and the mixture density networks are speciﬁcto the task. This method is known as hard parameter sharing. The intuition is that the individual tasks share somesimilarity, e.g., in our case the correct estimation of the magnitude likely requires an assessment of the attenuationand geometric spreading of the waves and therefore some understanding of the source location. This similarity is thenexpected to drive the model towards learning a solution for the problem that is more general, rather than speciﬁc to thetraining data. The reduced number of free parameters implied by hard parameter sharing is also expected to improvethe generality of the derived model, if the remaining degrees of freedom are still sufﬁcient to extract the relevantinformation from the training data for each sub-task.Unfortunately, we actually experience a moderate degradation of performance for either location or magnitude in anydata set (Tables SM 6 to SM 12) when following a multi-task learning strategy. The RMSE of the mean epicenterestimate increases by at least one third for all times and data sets, and the RMSE for magnitude stays nearly unchangedfor the Chile and Japan data sets, but increases by ∼

20% for the Italy data set. Our results therefore exhibit a case ofnegative transfer.While it is generally not known, under which circumstances multi-task learning shows positive or negative inﬂuence(Ruder, 2017), a negative transfer usually seems to be caused by insufﬁciently related tasks. In our case we suspect thatwhile the tasks are related in a sense of the underlying physics, the training data set is large enough that similaritiesrelevant for both tasks can be learned already from a single objective. At the same time, the particularities of thetwo objectives can be learned less well. Furthermore, we earlier discussed that both magnitude and location mightnot actually use travel time or attenuation based approaches, but rather frequency characteristics for magnitude and aﬁngerprinting scheme for location. These approaches would be less transferable between the two tasks. We conclude13arthquake magnitude and location estimation with TEAM-LM

A P

REPRINT that hard parameter sharing does not improve magnitude and location estimation. Future work is required to see if othermulti-task learning schemes can be applied beneﬁcially.

As all location error distributions are heavy tailed, we visually inspect the largest deviations between predicted andcatalog locations to understand the behavior of the localization mechanism of TEAM-LM. We base this analysis on theChile data set (Figure 7), as it has generally the best location estimation performance, but observations are similar forthe other data sets (Figures SM 5 and SM 6).Nearly all mislocated events are outside the seismic network and location predictions are generally biased towards thenetwork. This matches the expected errors for traditional localization algorithms. In contrast to traditional algorithms,events are not only predicted to be closer to the network, but they are also predicted as lying in regions with a higherevent density in the training set (Figure 7, inset). This suggests that not enough similar events were included in thetraining data set. Similarly, Kriegerowski et al. (2019) observed a clustering tendency when predicting the location ofswarm earthquakes with deep learning.We investigated two subgroups of mislocated events: the Iquique sequence, consisting of the Iquique mainshock,foreshocks and aftershocks, and mine blasts. The Iquique sequence is visible in the north-western part of the study area.All events are predicted approximately 0.5 ◦ too far east. The area is both outside the seismic network and has no eventsin the training set. This systematic mislocation may pose a serious threat in applications, such as early warning, whenconfronted with a major change in the seismicity pattern, as is common in the wake of major earthquakes or duringsudden swarm activity, which are also periods of heightened seismic hazard.For mine blasts, we see one mine in the northeast and one in the southwest (marked by red circles in Figure 7). Whileall events are located close by, the location are both systematically mispredicted in the direction of the network andexhibit scatter. Mine-blasts show a generally lower location quality in the test set. While they make up only ∼ Our analysis so far showed the importance of the amount of training data. To quantify the impact of data availability onmagnitude and location estimation, we trained models only using fractions of the training and validation data (Figure8). We use the Chile data set for this analysis, as it contains by far the most events. We subsample the events by onlyusing each k th event in chronological order, with k = 2 , , , , , . This strategy approximately maintains themagnitude and location distribution of the full set. We point out, that TEAM-LM only uses information of the eventunder consideration and does not take the events before or afterwards into account. Therefore, the ‘gaps’ betweenevents introduced by the subsampling do not negatively inﬂuence TEAM-LM.For all times after the ﬁrst P arrival, we see a clear increase in the magnitude-RMSE for a reduction in the number oftraining samples. While the impact of reducing the data set by half is relatively small, using only a quarter of the dataalready leads to a twofold increase in RMSE at late times. Even more relevant in an early warning context, a fourfoldsmaller data sets results in an approximately fourfold increase in the time needed to reach the same precision as withthe full data. This relationship seems to hold approximately across all subsampled data sets: reducing the data set k foldincreases the time to reach a certain precision by a factor of k .We make three further observations from comparing the predictions to the true values (Figure SM 7). First, for nearly allmodels the RMSE changes only marginally between 16 s and 25 s, but the RMSE of this plateau increases signiﬁcantlywith a decreasing number of training events. Second, the lower the amount of training data, the lower is the saturationthreshold above which all events are strongly underestimated. In addition, for 1/32 and 1/64 of the full data set, an‘inverse saturation’ effect is noticeable for the smallest magnitudes. Third, while for the full data set and the largestsubsets all large events are estimated at approximately the saturation threshold, if at most one quarter of the training datais used, the largest events even fall signiﬁcantly below the saturation threshold. For the models trained on the smallestsubsets (1/8 to 1/64), the higher the true magnitude the lower the predicted magnitude becomes. We assume that thelarger the event is, the further away from the training distribution it is and therefore it is estimated approximately at the14arthquake magnitude and location estimation with TEAM-LM A P

REPRINT

Figure 7: The 200 events with the highest location errors in the Chile data set overlayed on top of the spatial eventdensity in the training data set. The location estimations use 16 s of data. Each event is denoted by a yellow dot for theestimated location, a green cross for the true location and a line connecting both. Stations are shown by black triangles.The event density is calculated using a Gaussian kernel density estimation and does not take into account the eventdepth. The inset shows the event density at the true event location in comparison to the event density at the predictedevent location for the 200 events. Red circles mark locations of mine blast events. The inset waveforms show oneexample of a waveform from a mineblast (top) and an example waveform of an earthquake (bottom, 26 km depth) ofsimilar magnitude ( M A = 2 . ) at similar distance (60 km) on the transverse component. Similar plots for Italy andJapan can be found in the supplementary material (Figures SM 5 and SM 6).15arthquake magnitude and location estimation with TEAM-LM A P

REPRINT

Figure 8: RMSE for magnitude and epicentral location at different times for models trained on differently sized subsetsof the training set in Chile. The line color encodes the fraction of the training and validation set used in training. Allmodels were evaluated on the full Chilean test set. We note that the variance of the curves with fewer data is higher, dueto the increased stochasticity from model training and initialization.most dense region of the training label distribution. These observations support the hypothesis that underestimations oflarge magnitudes for the full data set are caused primarily by insufﬁcient training data.While the RMSE for epicenter estimation shows a similar behavior as the RMSE for magnitude, there are subtledifferences. If the amount of training data is halved, the performance only degrades mildly and only at later times.However, the performance degradation is much more severe than for magnitude if only a quarter or less of the trainingdata are available. This demonstrates that location estimation with high accuracy requires catalogs with a high eventdensity.The strong degradation further suggests insights into the inner working of TEAM-LM. Classically, localization shouldbe a task where interpolation leads to good results, i.e., the travel times for an event in the middle of two others shouldbe approximately the average between the travel times for the other events. Following this argument, if the networkwould be able to use interpolation, it should not suffer such signiﬁcant degradation when faced with fewer data. Thisprovides further evidence that the network does not actually learn some form of triangulation, but only an elaborateﬁngerprinting scheme, backing the ﬁnding from the qualitative analysis of location errors.

Often, large events are of the greatest concerns, and as discussed, generally showed poorer performance because theyare not well represented in the training data. It therefore appears plausible that a model optimized for large events mightperform better than a model trained on both large and small events. In order to test this hypothesis, we employed anextreme version of the upscaling strategy by training a set of models only on large events, which might avoid tuningthe model to seemingly irrelevant small events. In fact, these models perform signiﬁcantly worse than the modelstrained on the full data set, even for the large events (Tables SM 6 to SM 12). Therefore even if the events of interestare only the large ones, training on more complete catalogs is still beneﬁcial, presumably by giving the network morecomprehensive information on the regional propagation characteristics and possibly site effects.

So far we only analyzed the mean predictions of TEAM-LM. As for many application scenarios, for example earlywarning, quantiﬁed uncertainties are required, TEAM-LM outputs not only these mean predictions, but a probabilitydensity. Figure 9 shows the development of magnitude uncertainties for events from different magnitude classes inthe Chile data set. As we average over multiple events, each set of lines can be seen as a prototype event of a certainmagnitude. 16arthquake magnitude and location estimation with TEAM-LM

A P

REPRINT

Figure 9: Magnitude predictions and uncertainties in the Chile data set as a function of time since the ﬁrst P arrival.Solid lines indicate median predictions, while dashed lines show 20th and 80th quantiles of the prediction. Each colorrepresents a different magnitude bucket. For each magnitude bucket, we sampled 1,000 events around this magnitudeand combined their predictions. If less than 1,000 events were available within ± t = 0 , followed by a slow convergence to the ﬁnalmagnitude estimate. We suspect that the magnitude estimation always converges from below, as due to the Gutenberg-Richter distribution, lower magnitudes are more likely a priori . The uncertainties are largest directly after t = 0 andsubsequently decrease, with the highest uncertainties for the largest events. As we do not use transfer learning in thisapproach, there is a consistent underestimation of the largest magnitude events, visible from the incorrect medianpredictions for magnitudes 5 and 6.While the Gaussian mixture model is designed to output uncertainties, it cannot be assumed that the predicteduncertainties are indeed well calibrated, i.e., that they actually match the real error distribution. Having well calibrateduncertainties is crucial for downstream tasks that rely on the uncertainties. Neural networks trained with a log-likelihoodloss generally tend to be overconﬁdent (Snoek et al., 2019; Guo et al., 2017), i.e., underestimate the uncertainties.This overconﬁdence is probably caused by the strong overparametrization of neural network models. To assess thequality of our uncertainty estimations for magnitude, we use the observation that for a speciﬁc event i , the predictedGaussian mixture implies a cumulative distribution function F ipred : R → [0 , . Given the observed magnitude y itrue ,we can calculate u i = F ipred ( y itrue ) . If y itrue is indeed distributed according to F ipred , then u i needs to be uniformlydistributed on [0 , . We test this based on the u i of all events in the test set using P-P plots (Figure 10). Further detailson the method can be found in the supplementary material (Section SM 3). Note that good calibration is a necessarybut not sufﬁcient condition for a good probabilistic forecast. An example of a perfectly calibrated but mostly uselessprobabilistic prediction would be the marginal probability of the labels.Figure 10 shows the P-P plots of u in comparison to a uniform distribution. For all data sets and all times the modelis signﬁcantly miscalibrated, as estimated using Kolmgorov-Smirnov test statistics (Section SM 3). Miscalibrationis considerably stronger for Italy and Japan than for Chile. More precisely, the model is always overconﬁdent, i.e.,estimates narrower conﬁdence bands than the actually observed errors. Further, in particular at later times, the model isbiased towards underestimating the magnitudes. This is least visible for Chile. We speculate that this is a result of thelarge training data set for Chile, which ensures that for most events the density of training events in their magnituderange is high.To mitigate the miscalibration, we trained ensembles (Hansen & Salamon, 1990), a classical method to improvecalibration. Instead of training a single neural network, a set of n neural networks, in our case n = 10 , are trained,which all have the same structure, but different initialization and batching in training. The networks therefore representa sample of size n from the posterior distribution of the model parameters given the training data. For Italy and Japan,this improves calibration considerably (Figure 10). For Chile, the ensemble model, in contrast to the single model,exhibits underconﬁdence, i.e., estimates too broad uncertainty bands.The maximum distance between the empirical cumulative distribution function of u and a uniformly distributed variable d ∞ is the test statistic of the Kolmogorov-Smirnov test. While d ∞ is reduced by nearly half for some of the ensemble17arthquake magnitude and location estimation with TEAM-LM A P

REPRINT

Figure 10: P-P plots of the CDFs of the empirical quantile compared to the expected uniform distribution. The P-Pplot shows (CDF u i ( z ) , CDF uniform ( z )) for z ∈ [0 , . The expected uniform distribution is shown as the diagonalline, the misﬁt is indicated as shaded area. The value in the upper corner provides d ∞ , the maximum distance betweenthe diagonal and the observed CDF. d ∞ can be interpreted as the test statistic for a Kolmogorov-Smirnov test. Curvesconsistently above the diagonal indicate a bias to underestimation, and below the diagonal to overestimation. Sigmoidalcurves indicate over-conﬁdence, mirrored sigmoids indicate under-conﬁdence. See supplementary section SM 3 for afurther discussion of the plotting methodology and its connection to the Kolmogorov-Smirnov test.18arthquake magnitude and location estimation with TEAM-LM A P

REPRINT

Figure 11: The ﬁgure shows 90th percent conﬁdence areas for sample events around 5 example locations. For eachlocation the 5 closest events are shown. Conﬁdence areas belonging to the same location are visualized using thesame color. Conﬁdence areas were chosen as curves of constant likelihood, such that the probability mass above thelikelihood equals 0.9. To visualize the result in 2D we marginalize out the depth. Triangles denote station locationsfor orientation. The top row plots show results from a single model, while the bottom row plots show results from anensemble of 10 models. 19arthquake magnitude and location estimation with TEAM-LM

A P

REPRINT results, the Kolmogorov-Smirnov test indicates, that even the distributions from the ensemble models deviate highlysigniﬁcantly from a uniform distribution ( p (cid:28) − ). A table with d ∞ for all experiments can be found in thesupplementary material (Table SM 9).To evaluate the location uncertainties qualitatively, we plot conﬁdence ellipses for a set of events in Chile (Figure 11).Again we compare the predictions from a single model to the predictions of an ensemble. At early times, the uncertaintyregions mirror the seismicity around the station with the ﬁrst arrival, showing that the model correctly learned theprior distribution. Uncertainty ellipses at late times approximately match the expected uncertainty ellipses for classicalmethods, i.e., they are small and fairly round for events inside the seismic network, where there is good azimuthalcoverage, and larger and elliptical for events outside the network. Location uncertainties are not symmetric aroundthe mean prediction, but show higher likelihood towards the network than further outwards. Location errors for theensemble model are more smooth than from the single model, but show the same features. The uncertainty ellipses areslightly larger, suggesting that the single model is again overconﬁdent.In addition to improving calibration, ensembles also lead to slight improvements regarding the accuracy of the meanpredictions (Tables SM 6 to SM 12). Improvements in terms of magnitude RMSE range up to , for epicentrallocation error up to . Due to the high computational demand of training ensembles, all other results reported inthis paper are calculated without ensembling. In this study we adapted TEAM to build TEAM-LM, a real time earthquake source characterization model, and used itto study the pitfalls and particularities of deep learning for this task. We showed that TEAM-LM achieves state of theart in magnitude estimation, outperforming both a classical baseline and a deep learning baseline. Given sufﬁcientlylarge catalogs, magnitude can be assessed with a standard deviation of ∼ A P

REPRINT

Acknowledgements

We thank the National Research Institute for Earth Science and Disaster Resilience for providing the catalog andwaveform data for our Japan data set. We thank the Istituto Nazionale di Geoﬁsica e Vulcanologia and the Dipartimentodella Protezione Civile for providing the catalog and waveform data for our Italy data set. We thank Christian Sippl forproviding the P picks for the Chile catalog. We thank Sebastian Nowozin for insightful discussions on neural networkcalibration and probabilistic regression. Jannes Münchmeyer acknowledges the support of the Helmholtz EinsteinInternational Berlin Research School in Data Science (HEIBRiDS). We use obspy (Beyreuther et al., 2010), tensorﬂow(Abadi et al., 2016) and color scales from Crameri (2018).An implementation of TEAM-LM and TEAM is available at https://github.com/yetinam/TEAM . The Italy dataset has been published as Münchmeyer et al. (2020). Download instructions for the Japan data set are available at https://github.com/yetinam/TEAM . A data publication for the Chile data set with GFZ data services is underpreparation.

References

Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al.,2016. Tensorﬂow: A system for large-scale machine learning, in { USENIX } symposium on operating systemsdesign and implementation ( { OSDI } , pp. 265–283.Aldersons, F., 2004. Toward three-dimensional crustal structure of the dead sea region from local earthquake tomography.Asch, G., Tilmann, F., Schurr, B., & Ryberg, T., 2011. Seismic network 5e: Minas project (2011/2013).Beyreuther, M., Barsch, R., Krischer, L., Megies, T., Behr, Y., & Wassermann, J., 2010. Obspy: A python toolbox forseismology, Seismological Research Letters , (3), 530–533.Bishop, C. M., 1994. Mixture density networks, Tech. rep., Aston University.Cesca, S., Sobiesiak, M., Tassara, A., Olcay, M., Günther, E., Mikulla, S., & Dahm, T., 2009. The iquique local networkand picarray.Crameri, F., 2018. Geodynamic diagnostics, scientiﬁc visualisation and staglab 3.0, Geoscientiﬁc Model Development , (6), 2541–2562.Deichmann, N., 2018. Why Does ML Scale 1:1 with 0.5logES?, Seismological Research Letters , (6), 2249–2255.Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K., 2018. Bert: Pre-training of deep bidirectional transformers forlanguage understanding, arXiv preprint arXiv:1810.04805 .Dipartimento di Fisica, Università degli studi di Napoli Federico II, 2005. Irpinia seismic network (isnet).Doi, K., 2014. Seismic network and routine data processing-japan meteorological agency, Summary of the Bulletin ofthe International Seismological Centre , (7-12), 25–42.EMERSITO Working Group, 2018. Seismic network for site effect studies in amatrice area (central italy) (sesaa).Festa, G., Picozzi, M., Caruso, A., Colombelli, S., Cattaneo, M., Chiaraluce, L., Elia, L., Martino, C., Marzorati, S.,Supino, M., & Zollo, A., 2018. Performance of Earthquake Early Warning Systems during the 2016–2017 Mw 5–6.5Central Italy Sequence, Seismological Research Letters , (1), 1–12.GEOFON Data Center, 1993. Geofon seismic network.Geological Survey-Provincia Autonoma di Trento, 1981. Trentino seismic network.Graeber, F. M. & Asch, G., 1999. Three-dimensional models of P wave velocity and P-to-S velocity ratio in the southerncentral Andes by simultaneous inversion of local earthquake data, Journal of Geophysical Research: Solid Earth , (B9), 20237–20256.Guo, C., Pleiss, G., Sun, Y., & Weinberger, K. Q., 2017. On calibration of modern neural networks, arXiv preprintarXiv:1706.04599 .Hansen, L. K. & Salamon, P., 1990. Neural network ensembles, IEEE transactions on pattern analysis and machineintelligence , (10), 993–1001.ISIDe Working Group, 2007. Italian seismological instrumental and parametric database (iside).Istituto Nazionale di Geoﬁsica e Vulcanologia (INGV), 2008. Ingv experiments network.21arthquake magnitude and location estimation with TEAM-LM A P

REPRINT

Istituto Nazionale di Geoﬁsica e Vulcanologia (INGV), Istituto di Geologia Ambientale e Geoingegneria (CNR-IGAG),Istituto per la Dinamica dei Processi Ambientali (CNR-IDPA), Istituto di Metodologie per l’Analisi Ambientale(CNR-IMAA), & Agenzia Nazionale per le nuove tecnologie, l’energia e lo sviluppo economico sostenibile (ENEA),2018. Centro di microzonazione sismica network, 2016 central italy seismic sequence (centromz).Istituto Nazionale di Geoﬁsica e Vulcanologia (INGV), Italy, 2006. Rete sismica nazionale (rsn).Jozinovi´c, D., Lomax, A., Štajduhar, I., & Michelini, A., 2020. Rapid prediction of earthquake ground shaking intensityusing raw waveform data and a convolutional neural network,

Geophysical Journal International , (2), 1379–1389.Kingma, D. P. & Ba, J., 2014. Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980 .Kriegerowski, M., Petersen, G. M., Vasyura-Bathke, H., & Ohrnberger, M., 2019. A Deep Convolutional NeuralNetwork for Localization of Clustered Earthquakes Based on Multistation Full Waveforms, Seismological ResearchLetters , (2A), 510–516.Kuyuk, H. S. & Allen, R. M., 2013. A global approach to provide magnitude estimates for earthquake early warningalerts, Geophysical Research Letters , (24), 6329–6333.Lee, J., Lee, Y., Kim, J., Kosiorek, A., Choi, S., & Teh, Y. W., 2019. Set transformer: A framework for attention-basedpermutation-invariant neural networks, in International Conference on Machine Learning , pp. 3744–3753, PMLR.Leyton, F., Ruiz, S., Baez, J. C., Meneses, G., & Madariaga, R., 2018. How Fast Can We Reliably Estimate theMagnitude of Subduction Earthquakes?,

Geophysical Research Letters , (18), 9633–9641.Lomax, A., Virieux, J., Volant, P., & Berge-Thierry, C., 2000. Probabilistic earthquake location in 3d and layeredmodels, in Advances in seismic event location , pp. 101–134, Springer.Lomax, A., Michelini, A., & Jozinovi´c, D., 2019. An Investigation of Rapid Earthquake Characterization UsingSingle-Station Waveforms and a Convolutional Neural Network,

Seismological Research Letters , (2A), 517–529.Matrullo, E., De Matteis, R., Satriano, C., Amoroso, O., & Zollo, A., 2013. An improved 1-d seismic velocity modelfor seismological studies in the campania–lucania region (southern italy), Geophysical Journal International , (1),460–473.MedNet Project Partner Institutions, 1990. Mediterranean very broadband seismographic network (mednet).Mousavi, S. M. & Beroza, G. C., 2019. Bayesian-Deep-Learning Estimation of Earthquake Location from Single-StationObservations, arXiv:1912.01144 [physics] .Mousavi, S. M. & Beroza, G. C., 2020. A Machine-Learning Approach for Earthquake Magnitude Estimation, Geophysical Research Letters , (1), e2019GL085976.Mousavi, S. M., Zhu, W., Sheng, Y., & Beroza, G. C., 2018. CRED: A Deep Residual Network of Convolutional andRecurrent Units for Earthquake Signal Detection, arXiv:1810.01965 .Münchmeyer, J., Bindi, D., Leser, U., & Tilmann, F., 2020. Fast earthquake assessment and earthquake early warningdataset for italy.Münchmeyer, J., Bindi, D., Leser, U., & Tilmann, F., 2020a. The transformer earthquake alerting model: A newversatile approach to earthquake early warning, Geophysical Journal International .Münchmeyer, J., Bindi, D., Sippl, C., Leser, U., & Tilmann, F., 2020b. Low uncertainty multifeature magnitudeestimation with 3-D corrections and boosting tree regression: Application to North Chile,

Geophysical JournalInternational , (1), 142–159.National Research Institute For Earth Science And Disaster Resilience, 2019. Nied k-net, kik-net.OGS (Istituto Nazionale di Oceanograﬁa e di Geoﬁsica Sperimentale), 2016. North-east italy seismic network (nei).OGS (Istituto Nazionale di Oceanograﬁa e di Geoﬁsica Sperimentale) and University of Trieste, 2002. North-east italybroadband network (ni).Pan, S. J. & Yang, Q., 2009. A survey on transfer learning, IEEE Transactions on knowledge and data engineering , (10), 1345–1359.Perol, T., Gharbi, M., & Denolle, M., 2018. Convolutional neural network for earthquake detection and location, Science Advances , (2), e1700578.Presidency of Counsil of Ministers - Civil Protection Department, 1972. Italian strong motion network (ran).Raissi, M., Perdikaris, P., & Karniadakis, G. E., 2019. Physics-informed neural networks: A deep learning frameworkfor solving forward and inverse problems involving nonlinear partial differential equations, Journal of ComputationalPhysics , , 686–707. 22arthquake magnitude and location estimation with TEAM-LM A P

REPRINT

RESIF - Réseau Sismologique et géodésique Français, 1995a. Resif-rlbp french broad-band network, resif-rap strongmotion network and other seismic stations in metropolitan france.RESIF - Réseau Sismologique et géodésique Français, 1995b. Réseau accélérométrique permanent (french ac-celerometrique network) (rap).Ruder, S., 2017. An overview of multi-task learning in deep neural networks, arXiv preprint arXiv:1706.05098 .Shelly, D. R., Beroza, G. C., & Ide, S., 2007. Non-volcanic tremor and low-frequency earthquake swarms,

Nature , (7133), 305–307.Sippl, C., Schurr, B., Asch, G., & Kummerow, J., 2018. Seismicity Structure of the Northern Chile Forearc From > Journal of Geophysical Research: Solid Earth , (5),4063–4087.Snoek, J., Ovadia, Y., Fertig, E., Lakshminarayanan, B., Nowozin, S., Sculley, D., Dillon, J., Ren, J., & Nado, Z., 2019.Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift, in Advances in NeuralInformation Processing Systems , pp. 13969–13980.Ueno, H., Hatakeyama, S., Aketagawa, J., Funasaki, J., & Hamada, N., 2002. Improvement of hypocenter determinationprocedures in the japan meteorological agency,

Quart. J. Seism. , , 123–134.Universidad de Chile, 2013. Red sismologica nacional.Universita della Basilicata, 2005. Unibas.University of Genova, 1967. Regional seismic network of north western italy. international federation of digitalseismograph networks.van den Ende, M. P. A. & Ampuero, J.-P., 2020. Automated Seismic Source Characterization Using Deep Graph NeuralNetworks, Geophysical Research Letters , (17), e2020GL088690.Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I., 2017.Attention is all you need, in Advances in neural information processing systems , pp. 5998–6008.Wigger, P., Salazar, P., Kummerow, J., Bloch, W., Asch, G., & Shapiro, S., 2016. West–ﬁssure- and atacama-faultseismic network (2005/2012). 23arthquake magnitude and location estimation with TEAM-LM

A P

REPRINT

Table SM 1: Seismic networksRegion Network ReferenceChile GE GEOFON Data Center (1993)C, C1 Universidad de Chile (2013)8F Wigger et al. (2016)IQ Cesca et al. (2009)5E Asch et al. (2011)Italy 3A Istituto Nazionale di Geoﬁsica e Vulcanologia (INGV) et al. (2018)BA Universita della Basilicata (2005)FR RESIF - Réseau Sismologique et géodésique Français (1995a)GU University of Genova (1967)IT Presidency of Counsil of Ministers - Civil Protection Department (1972)IV Istituto Nazionale di Geoﬁsica e Vulcanologia (INGV), Italy (2006)IX Dipartimento di Fisica, Università degli studi di Napoli Federico II (2005)MN MedNet Project Partner Institutions (1990)NI OGS (Istituto Nazionale di Oceanograﬁa e di Geoﬁsica Sperimentale) and University of Trieste(2002)OX OGS (Istituto Nazionale di Oceanograﬁa e di Geoﬁsica Sperimentale) (2016)RA RESIF - Réseau Sismologique et géodésique Français (1995b)ST Geological Survey-Provincia Autonoma di Trento (1981)TV Istituto Nazionale di Geoﬁsica e Vulcanologia (INGV) (2008)XO EMERSITO Working Group (2018)Japan KiK-Net National Research Institute For Earth Science And Disaster Resilience (2019)

SM 1 Data sourcesSM 2 Classical magnitude estimation baseline

For magnitude estimation we compare TEAM-LM to a classical baseline. To this end we use the peak displacementbased approach proposed by Kuyuk & Allen (2013). At each station, we bandpass ﬁlter the signal between 0.5 Hz and3 Hz and discard traces with insufﬁcient signal to noise ratio. We extract peak displacement

P D from the horizontalcomponents in the ﬁrst 6 s of the P wave, while only including samples before the S onset. We use the relationship M = c log( P D ) + c log( R ) + c + N (0 , σ ) (1)from Kuyuk & Allen (2013) to estimate magnitudes from peak displacement. We use c = 1 . , c = 1 . and σ = 0 . from Kuyuk & Allen (2013). These parameters were calibrated using data from California and Japan, but theauthors state that the relationship can be applied to earthquake source zones around the world. To account for a constantoffset between different magnitude scales, we optimized c separately for each dataset such that the predictions do nothave a systematic bias compared to the ground truth.We average the predictions from multiple stations, effectively assuming independence between the predictions. Toobtain earlier predictions, we already calculate magnitude estimates at a station once at least 1 s of P wave data hasbeen recorded. We assign higher weights to stations with longer P wave records, with weights linearly increasing from0.11 for 1 s of waveforms, to 1.0 for 6 s of data. Thereby, while getting early estimates from the ﬁrst stations, new datafrom later stations does not perturb the prediction strongly until enough data has been recorded.As the estimation relies on the hypocentral distance R between station and event, the method requires an estimate of thehypocentral location. We provide the method with the cataloged hypocentral location. While this is an unrealisticallyoptimistic assumption for an actual real time determination, it allows us to put our focus on the magnitude estimationcapabilities. We note that this gives the baseline an advantage compared to TEAM-LM, which has no information onthe earthquake location.For some events in the Chile catalog, the SNR criterion is not fulﬁlled at any station due to the inclusion of smallermagnitude events and the higher distances between stations and events. For these events the baseline does not issue amagnitude estimation. We exclude these events from the evaluation of the baseline, leading to an optimistic assessmentof the performance of the baseline. 24arthquake magnitude and location estimation with TEAM-LM A P

REPRINT

Table SM 2: Architecture of the feature extraction network. The input shape of the waveform data is (time, channels).FC denotes fully connected layers. As FC layers can be regarded as 0D convolutions, we write the output dimensionalityin the ﬁlters column. The “Concatenate scale” layer concatenates the log of the peak amplitude to the output of theconvolutions. We want to mention that depending on the existence of borehole data the number of input ﬁlters for theﬁrst Conv1D varies. Layer Filters Kernel size StrideConv2D 8 5, 1 5, 1Conv2D 32 16, 3 1, 3Flatten to 1DConv1D 64 16 1MaxPool1D 2 2Conv1D 128 16 1MaxPool1D 2 2Conv1D 32 8 1MaxPool1D 2 2Conv1D 32 8 1Conv1D 16 4 1Flatten to 0DConcatenate scaleFC 500FC 500FC 500Table SM 3: Architecture of the transformer network.Feature Value

SM 3 Calibration estimation

Calibration of a model describes whether the predicted uncertainties match the observed values, i.e., if the observation y true was drawn from a distribution with cumulative distribution function (CDF) F pred . Unfortunately, for each event i ,only one prediction observation of the magnitude y itrue and one prediction F ipred is available. To this end, we deﬁne therandom variable u i = F ipred ( y itrue ) . If y itrue is distributed according to F ipred than u i must be uniformly distributedon [0 , . This follows from the deﬁnition of the CDF. If F is a CDF and U a uniform random variable on [0 , , than F − ( U ) is distributed according to F .We take the u i of all events as samples of a random variable U and compare U to a uniform random variable on [0 , . The maximum difference between the empirical CDF of U and a uniform variable is the test statistic of aKolmogorov-Smirnov test, d ∞ . As the number of events n is large, critical values d α to a conﬁdence threshold α canbe estimated as: d α = (cid:113) − log α √ n (2)For α = 10 − , this gives values d α of 0.015 (Chile), 0.054 (Japan) and 0.039 (Italy). This is considerably below theobserved values d ∞ , even using ensembles, indicating that U differs highly signiﬁcantly from a uniform distribution. SM 4 Figures and tables

A P

REPRINT

Table SM 4: Architecture of the mixture density network.Feature ValueDimensions fully connected layers (magnitude) 150, 100, 50, 30, 10Dimensions fully connected layers (location) 150, 100, 50, 50, 50Mixture size (magnitude) 5Mixture size (location) 15Base distribution GaussianFigure SM 1: Event and station distribution for Chile. In the map, events are indicated by dots, stations by triangles.The event depth is encoded using color. 26arthquake magnitude and location estimation with TEAM-LM

A P

REPRINT

Figure SM 2: Event and station distribution for Italy. In the map, events are indicated by dots, stations by triangles. Theevent depth is encoded using color.Table SM 5: Experiment names for the results tablesName ExplanationBaseline Baseline methodPlain Model trained with a single loss for magnitude or locationMulti-task Model trained with both a loss for magnitude and for locationHigh magnitudes Model only trained on events with magnitude above 4.5 (Chile), 3.5 (Italy), 5.5 (Japan)Transfer Model trained with transfer learning and a single lossMulti-task transfer Model trained with transfer learning and both lossesJoint Model trained on all datasets jointly with a single lossJoint multi-task Model trained on all datasets jointly with both lossesVelocity Model trained with acceleration traces integrated to velocityEnsemble Model trained as an ensemble of size 10No upsampling Model trained without upsampling large magnitudesPooling (Emb) Model trained with the pooling architecture using position embeddingsPooling (Concat) Model trained with the pooling architecture and concatenated coordinates27arthquake magnitude and location estimation with TEAM-LM

A P

REPRINT

Figure SM 3: Event and station distribution for Japan. In the map, events are indicated by dots, stations by triangles.The event depth is encoded using color. There are ∼

20 additional events far offshore in the catalog, which are outsidethe displayed map region. 28arthquake magnitude and location estimation with TEAM-LM

A P

REPRINT

Figure SM 4: Distribution of the hypocentral errors for TEAM-LM, the pooling baseline with position embeddings(POOL-E), the pooling baseline with concatenated position (POOL-C), TEAM-LM with transfer learning (TEAM-TRA)and a classical baseline. Vertical lines mark the 50 th , 90 th , 95 th and 99 th error percentiles. The time indicates the timesince the ﬁrst P arrival at any station. We use the mean predictions.29arthquake magnitude and location estimation with TEAM-LM A P

REPRINT

Figure SM 5: The 100 events with the highest location error in the Italy dataset overlayed on top of the spatial eventdensity in the training dataset. The estimations use 16 s of data. Each event is denoted by a dot for the estimatedlocation, a cross for the true location and a line connecting both. Stations are not shown as station coverage is dense.The event density is calculated using a Gaussian kernel density estimation and does not take into account the eventdepth. The inset shows the event density at the true event location in comparison to the event density at the predictedevent location. 30arthquake magnitude and location estimation with TEAM-LM

A P

REPRINT

Figure SM 6: The 200 events with the highest location error in the Japan dataset overlayed on top of the spatial eventdensity in the training dataset. The estimations use 16 s of data. Each event is denoted by a dot for the estimatedlocation, a cross for the true location and a line connecting both. Stations are not shown as station coverage is dense.The event density is calculated using a Gaussian kernel density estimation and does not take into account the eventdepth. The inset shows the event density at the true event location in comparison to the event density at the predictedevent location. 31arthquake magnitude and location estimation with TEAM-LM

A P

REPRINT

Figure SM 7: True and predicted magnitudes after 8 seconds using only parts of the datasets for training. All plots showthe Chile dataset. The fraction in the corner indicates the amount of training and validation data used for model training.All models were evaluated on the full test dataset. 32able SM 6: Test set RMSE magnitude estimate across all magnitudes. For some experiments we additionally provide standard deviation. The standard deviations wereobtained from six runs with different random model initialization. In this case the provided mean value is the mean over six runs. Note that the provided standard deviationdenotes the empirical standard deviation of a single run, therefore the uncertainty of the mean expected to be smaller by a factor of √ . Due to computational constraintswe are only able to provide standard deviations for a selected set of experiments. Chile Italy Japan0.5s 1.0s 2.0s 4.0s 8.0s 16.0s 25.0s 0.5s 1.0s 2.0s 4.0s 8.0s 16.0s 25.0s 0.5s 1.0s 2.0s 4.0s 8.0s 16.0s 25.0sBaseline - 0.55 0.52 0.47 0.42 0.42 0.43 - 0.77 0.68 0.46 0.35 0.35 0.35 - 0.65 0.51 0.44 0.38 0.31 0.30Multi-task 0.31 0.25 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± No Upsampling 0.31 0.25 0.21 0.16 0.12 0.09 0.08

Pooling (Emb) 0.31 0.25 0.21 0.16 0.13 0.09 0.09 0.36 0.33 0.29 0.25 0.21 0.20 0.21 0.66 0.49 0.41 0.36 0.31 0.28 0.28 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± able SM 7: Test set mean absolute error (MAE) magnitude estimate across all magnitudes Chile Italy Japan0.5s 1.0s 2.0s 4.0s 8.0s 16.0s 25.0s 0.5s 1.0s 2.0s 4.0s 8.0s 16.0s 25.0s 0.5s 1.0s 2.0s 4.0s 8.0s 16.0s 25.0sBaseline - 0.43 0.41 0.37 0.33 0.33 0.34 - 0.63 0.53 0.34 0.25 0.24 0.24 - 0.50 0.40 0.34 0.28 0.23 0.22Multi-task 0.23 0.18 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± No Upsampling ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± able SM 8: Test set R2 score across all magnitudes Chile Italy Japan0.5s 1.0s 2.0s 4.0s 8.0s 16.0s 25.0s 0.5s 1.0s 2.0s 4.0s 8.0s 16.0s 25.0s 0.5s 1.0s 2.0s 4.0s 8.0s 16.0s 25.0sBaseline - 0.53 0.56 0.60 0.62 0.59 0.57 - -2.26 -1.83 -0.34 0.21 0.14 0.10 - 0.26 0.54 0.66 0.75 0.82 0.84Multi-task 0.74 0.84

High magnitudes 0.03 0.03 0.02 0.03 0.03 0.02 0.01 -0.29 -0.29 -0.27 -0.26 -0.17 -0.14 -0.15 0.02 0.07 0.17 0.32 0.45 0.57 0.50Plain 0.74 0.83 0.89 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± No Upsampling 0.74 0.83 0.88

Pooling (Emb) 0.74 0.83 0.88 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Velocity - - - - - - - 0.11 0.21 0.48 0.69 0.79 0.78 0.77 0.32 0.63 0.76 0.80 0.86 0.90 0.90Joint multi-task 0.73 0.83 able SM 9: Test set test statistic d α for the Kolmogorov-Smirnov test across all magnitudes. Chile Italy Japan0.5s 1.0s 2.0s 4.0s 8.0s 16.0s 25.0s 0.5s 1.0s 2.0s 4.0s 8.0s 16.0s 25.0s 0.5s 1.0s 2.0s 4.0s 8.0s 16.0s 25.0sBaseline - - - - - - - - - - - - - - - - - - - - -Multi-task 0.05 0.03 0.03 0.06 0.04 0.03 0.04 0.09 0.10 0.14 0.16 0.17 0.22 0.22 0.22 0.17 0.17 0.18 0.19 0.19 0.20High magnitudes 0.15 0.16 0.15 0.15 0.14 0.13 0.14 0.13 0.16 0.19 0.22 0.22 0.25 0.24 0.16 0.14 0.12 0.09 0.12 0.11 0.11Plain 0.04 0.04 0.04 0.05 0.04 0.04 0.03 0.11 0.13 0.16 0.15 0.16 0.15 0.16 0.19 0.13 0.12 0.13 0.14 0.12 0.11 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± able SM 10: Test set RMSE of magnitude estimate for large events Chile Italy Japan0.5s 1.0s 2.0s 4.0s 8.0s 16.0s 25.0s 0.5s 1.0s 2.0s 4.0s 8.0s 16.0s 25.0s 0.5s 1.0s 2.0s 4.0s 8.0s 16.0s 25.0sBaseline - 0.95 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± No Upsampling 1.56 1.30 1.05 0.82 0.74 0.56 0.51 1.03 0.91 0.80 0.64 0.48 0.43 0.43 1.46 1.22 1.02 0.98 0.78 0.52 0.50Pooling (Emb) 1.40 1.11 0.88 0.68 0.55 0.42 0.37 1.02 0.94 0.82 0.62 0.49 0.45 0.44 1.23 1.00 0.87 0.87 0.71 0.44 0.41 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Joint 2.50 2.08 1.68 1.37 1.26 1.01 0.77 2.06 2.61 1.48 1.14 able SM 11: Test set MAE of magnitude estimate for large events

Chile Italy Japan0.5s 1.0s 2.0s 4.0s 8.0s 16.0s 25.0s 0.5s 1.0s 2.0s 4.0s 8.0s 16.0s 25.0s 0.5s 1.0s 2.0s 4.0s 8.0s 16.0s 25.0sBaseline - 0.74 0.57 0.46 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± No Upsampling 1.19 0.97 0.73 0.55 0.44 0.37 0.33 0.81 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± able SM 12: Test set R2 score of magnitude estimate for large events Chile Italy Japan0.5s 1.0s 2.0s 4.0s 8.0s 16.0s 25.0s 0.5s 1.0s 2.0s 4.0s 8.0s 16.0s 25.0s 0.5s 1.0s 2.0s 4.0s 8.0s 16.0s 25.0sBaseline - -0.95 -0.09 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± No Upsampling -4.26 -2.68 -1.39 -0.44 -0.19 0.32 0.44 -3.18 -2.30 -1.55 -0.62 0.08 0.28 0.27 -8.93 -5.95 -3.88 -3.48 -1.81 -0.25 -0.17Pooling (Emb) -3.24 -1.69 -0.69 -0.02 0.35 0.61 0.71 -3.18 -2.59 -1.69 -0.60 -0.03 0.13 0.15 -6.01 -3.69 -2.55 -2.52 -1.36 0.10 0.19 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± -3.95 -3.39 -1.89 -0.51 -7.90 -4.79 -2.89 -2.53 -1.58 0.10 0.04 able SM 13: Test set root squared mean for hypocentral error. We note that only 4 out of 10 models for the Italy location ensemble converged. We used only theconverged models for the ensemble evaluation. Chile Italy Japan0.5s 1.0s 2.0s 4.0s 8.0s 16.0s 25.0s 0.5s 1.0s 2.0s 4.0s 8.0s 16.0s 25.0s 0.5s 1.0s 2.0s 4.0s 8.0s 16.0s 25.0sBaseline 78.97 77.62 74.32 69.53 44.28 17.51 17.58 66.92 75.15 91.77 189.05 172.65 69.88 56.09 82.89 88.03 90.80 98.55 184.61 373.34 332.11Multi-task 48.49 40.17 33.10 24.79 17.69 12.63 11.86 67.06 65.52 61.91 58.09 55.13 54.93 56.25 140.83 141.21 144.12 139.64 135.31 134.77 134.13High magnitudes 124.94 113.56 113.02 119.20 116.61 90.28 93.75 192.27 190.97 189.82 188.28 186.77 187.36 188.10 393.78 377.11 365.31 383.16 352.42 344.42 352.45Plain 43.96 36.99 28.66 20.74 12.52 8.97 8.83

Pooling (Emb) 44.04 37.00 30.42 25.30 18.56 15.10 14.40 65.73 65.00 61.25 59.02 56.91 57.07 58.21 98.95 95.17 93.88 91.29 88.20 87.26 87.13Pooling (Concat) 53.19 44.53 38.02 33.01 27.09 22.31 21.80 181.39 181.41 181.47 181.52 181.46 181.47 181.52 163.43 155.61 154.13 152.15 149.51 148.23 148.29Transfer 44.92 37.96 30.76 21.24 13.78 9.05 9.25

Multi-task transfer 51.07 41.46 32.91 23.00 14.56 10.80 10.46 55.78 54.26 50.60 47.10 43.82 45.37 49.68 98.96 91.05 90.85 87.61 86.29 87.75 88.27Velocity - - - - - - - 75.50 74.75 71.02 68.44 66.97 66.67 68.56 77.41 69.06 67.91 64.79 62.96 61.95 60.64Joint multi-task 52.36 43.38 35.14 25.23 17.32 14.82 14.64 62.87 61.27 56.59 54.38 51.56 51.81 53.81 111.41 100.82 100.55 97.64 97.20 97.92 97.39Joint 49.55 42.75 35.52 24.32 15.69 10.12 9.76 67.38 65.06 60.95 58.38 58.10 59.13 60.29 198.47 196.29 195.19 196.00 194.68 194.46 195.26 able SM 14: Test set mean absolute hypocentral error. We note that only 4 out of 10 models for the Italy location ensemble converged. We used only the convergedmodels for the ensemble evaluation.

Chile Italy Japan0.5s 1.0s 2.0s 4.0s 8.0s 16.0s 25.0s 0.5s 1.0s 2.0s 4.0s 8.0s 16.0s 25.0s 0.5s 1.0s 2.0s 4.0s 8.0s 16.0s 25.0sBaseline 73.30 71.60 67.38 58.00 28.15 13.80 13.92 26.03 32.11 50.13 111.88 71.80 18.57 15.70 43.66 46.38 49.29 54.04 78.95 154.71 124.94Multi-task 33.78 27.71 22.39 15.81 10.99 8.20 7.89 27.09 25.25 22.88 20.27 18.87 18.76 19.22 67.48 66.47 65.97 63.09 59.05 57.13 56.76High magnitudes 106.43 98.18 99.49 103.49 104.08 82.75 85.67 120.07 119.24 119.07 117.63 116.63 116.50 116.93 229.31 221.00 218.40 218.67 208.64 203.95 207.15Plain 30.41 25.01 19.06 12.40 7.26 5.42 5.30

Multi-task transfer 35.86 28.93 22.46 14.69 9.10 6.77 6.60 24.66 23.51 21.80 19.75 18.74 19.00 20.18 48.75 43.28 40.07 37.48 35.21 34.39 34.21Velocity - - - - - - - 32.69 31.80 29.85 28.20 27.17 27.25 27.85 43.11 39.53 37.60 34.71 32.58 31.59 31.09Joint multi-task 38.31 31.64 25.02 17.28 11.90 9.78 9.65 30.48 28.98 27.13 25.74 24.53 24.56 25.30 54.41 48.50 44.99 42.34 40.29 39.34 38.94Joint 34.75 29.17 23.04 14.61 8.66 5.95 5.70 24.61 22.52 19.82 17.05 15.81 15.72 16.22 124.42 121.32 118.92 117.11 113.93 112.59 112.77 able SM 15: Test set root squared mean for epicentral error. We note that only 4 out of 10 models for the Italy location ensemble converged. We used only the convergedmodels for the ensemble evaluation.

Chile Italy Japan0.5s 1.0s 2.0s 4.0s 8.0s 16.0s 25.0s 0.5s 1.0s 2.0s 4.0s 8.0s 16.0s 25.0s 0.5s 1.0s 2.0s 4.0s 8.0s 16.0s 25.0sBaseline 65.49 63.86 59.74 52.07 29.72 13.82 13.82 55.77 64.82 67.69 84.57 116.35 66.82 53.17 73.93 79.82 82.69 86.91 160.74 360.57 325.00Multi-task 44.19 36.51 29.74 22.46 16.20 11.29 10.46 61.81 60.39 58.22 56.28 53.67 53.46 54.20 138.89 139.48 142.85 138.34 134.03 133.63 133.03High magnitudes 121.57 111.06 110.52 115.92 114.20 88.24 90.81 187.58 186.33 185.25 183.85 182.34 182.95 183.69 387.59 367.30 352.89 375.67 338.82 331.22 338.71Plain 39.48 32.95 25.25 18.34 11.24 7.73 7.58

Multi-task transfer 46.17 37.03 28.98 20.61 12.97 9.45 9.12 49.79 48.63 45.65 43.73 40.68 42.06 45.25 95.24 87.75 87.95 84.75 83.85 85.67 86.23Velocity - - - - - - - 70.61 70.14 66.79 65.30 63.80 63.76 64.77 72.34 64.27 63.83 60.69 59.33 58.74 57.48Joint multi-task 47.10 38.58 30.60 22.14 14.93 13.05 12.80 56.11 54.73 50.60 48.94 46.58 47.09 48.61 108.26 97.65 97.53 94.64 94.43 95.36 94.80Joint 44.46 37.75 30.83 21.41 14.44 8.86 8.53 62.76 61.32 58.91 57.21 57.06 57.98 58.71 197.03 195.09 194.19 195.16 194.27 194.13 194.85 able SM 16: Test set mean absolute epicentral error. We note that only 4 out of 10 models for the Italy location ensemble converged. We used only the converged modelsfor the ensemble evaluation.

Chile Italy Japan0.5s 1.0s 2.0s 4.0s 8.0s 16.0s 25.0s 0.5s 1.0s 2.0s 4.0s 8.0s 16.0s 25.0s 0.5s 1.0s 2.0s 4.0s 8.0s 16.0s 25.0sBaseline 58.27 56.23 51.33 40.82 18.73 10.25 10.28 21.58 26.72 33.11 45.59 46.45 14.3016.85 14.44 12.30 10.55 10.12 10.67 27.90 23.94 22.08 19.80 18.12 16.67 16.36