[PDF] The transformer earthquake alerting model: A new versatile approach to earthquake early warning

Abstract

Earthquakes are major hazards to humans, buildings and infrastructure. Early warning methods aim to provide advance notice of incoming strong shaking to enable preventive action and mitigate seismic risk. Their usefulness depends on accuracy, the relation between true, missed and false alerts, and timeliness, the time between a warning and the arrival of strong shaking. Current approaches suffer from apparent aleatoric uncertainties due to simplified modelling or short warning times. Here we propose a novel early warning method, the deep-learning based transformer earthquake alerting model (TEAM), to mitigate these limitations. TEAM analyzes raw, strong motion waveforms of an arbitrary number of stations at arbitrary locations in real-time, making it easily adaptable to changing seismic networks and warning targets. We evaluate TEAM on two regions with high seismic hazard, Japan and Italy, that are complementary in their seismicity. On both datasets TEAM outperforms existing early warning methods considerably, offering accurate and timely warnings. Using domain adaptation, TEAM even provides reliable alerts for events larger than any in the training data, a property of highest importance as records from very large events are rare in many regions.

Full PDF

TT HE TRANSFORMER EARTHQUAKE ALERTING MODEL : A

NEWVERSATILE APPROACH TO EARTHQUAKE EARLY WARNING

NON - PEER REVIEWED MANUSCRIPT SUBMITTED TO G EOPHYSICAL R ESEARCH L ETTERS

Jannes Münchmeyer , , ∗ , Dino Bindi , Ulf Leser , Frederik Tilmann , Helmholtzzentrum Potsdam, Deutsches GeoForschungsZentrum GFZ, Potsdam, Germany Institut für Informatik, Humboldt-Universität zu Berlin, Berlin, Germany Insitut für geologische Wissenschaften, Freie Universität Berlin, Berlin, Germany ∗ To whom correspondence should be addressed: [email protected]

September 18, 2020 A BSTRACT

Earthquake early warning aims to provide advance notice of incoming strong shaking to enablepreventive action and mitigate seismic risk. Its usefulness depends on accuracy, the relation betweentrue, missed and false alerts, and timeliness, the time between a warning and the arrival of strongshaking. Here we propose a novel early warning method, the deep-learning based transformerearthquake alerting model (TEAM). TEAM analyzes raw, strong motion waveforms of an arbitrarynumber of stations at arbitrary locations in real-time, making it easily adaptable to changing seismicnetworks. For two regions with high seismic hazard, Japan and Italy, it outperforms existing earlywarning methods considerably, offering accurate and timely warnings. Using domain adaptation,TEAM even provides reliable alerts for events larger than any in the training data, a property ofhighest importance as records from very large events are rare in many regions.

The concept of earthquake early warning has been around for over a century, but the necessary instrumentation andmethodologies have only been developed in the last three decades [1, 2]. Early warning systems aim to raise alertsif shaking levels likely to cause damage are going to occur. Existing methods split into two main classes: sourceestimation based and propagation based. The former, like EPIC [3] or FINDER [4], estimate the source properties of anevent, i.e., its location or fault extent and magnitude, and then use a ground motion prediction equation (GMPE) toinfer shaking at target sites. They provide long warning times, but incur a large aleatoric uncertainty due to simpliﬁedassumptions in the source estimation and in the GMPE [5]. Propagation based methods, like PLUM [5], infer theshaking at a given location from measurements at nearby seismic stations. Predictions are more accurate, but warningtimes are reduced, as warnings require measurements of strong shaking at nearby stations [6].Recently, machine learning methods, particularly deep learning methods, have emerged as a tool for fast assessment ofearthquakes. Under certain circumstances, they led to improvements in various tasks, e.g., estimation of magnitude[7, 8], location [9, 10] or peak ground acceleration (PGA) [11]. Nonetheless, no existing method is applicable toearly warning because they lack real-time capabilities, instead requiring ﬁxed waveform windows after the P arrival.Furthermore, many of these approaches are designed for single station input, missing out on the potential of jointly usingmultiple stations across the network. Those methods combining data from multiple stations assume a ﬁxed station setdetermined at training time, limiting their adaptability to changing networks. Finally, existing methods systematicallyunderestimate the strongest shaking and the highest magnitudes, as these are rare and therefore underrepresented in thetraining data (Fig. 6, 8 in [11], Fig. 3, 4 in [8]). However, early warning systems must also be able to provide reliablewarnings for earthquakes larger than any previously seen in a region.Here, we present the transformer earthquake alerting model (TEAM), a deep learning method for early warning,combining the advantages of both classical early warning strategies while avoiding the deﬁciencies of prior deep a r X i v : . [ phy s i c s . g e o - ph ] S e p ON - PEER REVIEWED MANUSCRIPT SUBMITTED TO G EOPHYSICAL R ESEARCH L ETTERS learning approaches. We evaluate TEAM on two data sets from regions with high seismic hazard, namely Japan andItaly. Due to their complementary seismicity, this allows to evaluate the capabilities of TEAM across scenarios. Wecompare TEAM to two state-of-the-art warning methods, of which one is prototypical for source based warning andone for propagation based warning.

For our study we use two nation scale datasets from highly seismically active regions with dense seismic networks,namely Japan (13,512 events, years 1997-2018, Figure S1) and Italy (7,055 events, years 2008-2019, Figure S2). Theirseismicity is complementary, with predominantly subduction plate interface or Wadati-Benioff zone events for Japan,many of them offshore, and shallow, crustal events for Italy. We split both datasets into training, development andtest sets with ratios of 60:10:30. We use the training set for model training, the development set for model selection,and the test set only for the ﬁnal evaluation. We split the Japan dataset chronologically, yielding the events betweenAugust 2013 and December 2018 as test set. For Italy, we test on all events in 2016, as these are of particular interest,encompassing most of the central Italy sequence with the M W =6.2 and M w =6.5 Norcia events [12]. Especially thelatter event is notably larger than any in the training set ( M w = 6 . L’Aquila event in 2007), thereby challenging theextrapolation capabilities of TEAM.Both datasets consist of strong motion waveforms. For Japan each station comprises two sensors, one at the surface andone borehole sensor, while for Italy only surface recordings are available. As the instrument response in the frequencyband of interest is ﬂat, we do not restitute the waveforms, but only apply a gain correction. This has the advantage thatit can trivially be done in real-time. The data and preprocessing are further described in appendix A.

The early warning workﬂow with TEAM encompasses three separate steps (Figure 1): event detection, PGA estimationand thresholding. We do not further consider the event detection task here, as it forms the base of all methods discussedand affects them similarly. The PGA estimation, resulting in PGA probability densities for a given set of target locations,is the heart of TEAM and described in detail below. In the last step, thresholding, TEAM issues warnings at all targetlocations where the predicted exceedance probability p for ﬁxed PGA thresholds surpasses a predeﬁned probability α .The PGA assessment with TEAM is again subdivided into three components: feature extraction, feature combination,and PGA estimation (Figure S3). We present the key ideas for each of these components and their combination inthe following. A complete description of the TEAM architecture and training procedure can be found in appendixB. For feature extraction TEAM uses a convolutional neural network (CNN) over the currently recorded waveformsat all input stations. CNNs are well established for feature extraction from seismic waveforms, as they are able torecognize complex features independent of their position in the trace. On the other hand, CNN based feature extractionusually requires a ﬁxed input length, inhibiting real-time processing. We allow real-time processing by zero-padding thewaveforms to 30 s length, thereby maintaining the constant input length. We combine the extracted waveform featureswith sinusoidal vector representations of the station locations, yielding one vector per station. A transformer [13] usesthese vectors and the target locations to generate one vector per target location. Transformers are attention-based neuralnetworks for combining information from a ﬂexible number of input vectors in a learnable way. This architecture,processing a varying number of inputs, together with the explicitly encoded locations, allows TEAM to handle varyingsets of stations and targets. The vectors resulting from the transformer, each representing predictions at one target, arefed into a mixture density network [14], which computes PGA densities.To mitigate the systematic underestimation of high PGA values observed in previous machine learning models, TEAMoversamples large events and PGA targets close to the epicenter during training, which reduces the inherent bias indata towards smaller PGAs. When learning from small catalogs or when applied to regions where events substantiallylarger than all training events can be expected, e.g., because of known locked fault patches or historic records, TEAMadditionally can use domain adaptation. To this end the training procedure is modiﬁed to include events from otherregions, that are similar to the expected events in the target region. While this measure, to some degree, blurs geologicalproperties of the target region, it provides realistic waveforms and PGAs of large events with similar characteristics,leading to models performing better for large events.Our Italy dataset is an example of this situation. Accordingly, TEAM applies domain adaptation to this case: It ﬁrsttrains a joint model using data from Japan and from Italy, which is then ﬁne-tuned using the Italy data on its own,except for the addition of a few large, shallow, onshore events from Japan. We chose these events, as for Italy one alsoexpects large, shallow, crustal events due to its tectonic setting and earthquake history.2 ON - PEER REVIEWED MANUSCRIPT SUBMITTED TO G EOPHYSICAL R ESEARCH L ETTERS

Figure 1: Schematic view of TEAM’s early warning workﬂow for the October 2016 Norcia event ( M w = 6 . ) 2.5 safter the ﬁrst P wave pick ( ∼ a. An event is detected through triggering at multiple seismicstations. The waveform colors correspond to the stations highlighted with orange to magenta outlines. The circlesindicate the approximate current position of P (dashed) and S (solid) wavefronts. b. TEAM’s input are raw waveformsand station coordinates; it estimates probability densities for the PGA at a target set. A more detailed TEAM overviewis given in Figure S3. c. The exceedance probabilities for a ﬁxed set of PGA thresholds are calculated based on theestimated PGA probability densities. If the probability exceeds a threshold α , a warning is issued. The ﬁgure visualizesa 10%g PGA level with α = 0 . , resulting in warnings for the stations highlighted. The colors correspond to thestations with green outlines in a. d. The real-time shake map shows the highest PGA levels for which a warning isissued. Stations are colored according to their current warning level. The table lists all stations for which warnings havealready been issued.

We compare TEAM to two state-of-the-art early warning methods, one using source estimation and one propagationbased. As source estimation based method we use the estimated point source approach (EPS), which estimatesmagnitudes from peak displacement during the P-wave onset [15] and then applies a GMPE [16] to predict the PGA.For simplicity, our implementation assumes knowledge of the ﬁnal catalog epicentre, which is impossible in real-time,leading to overoptimistic results for EPS. As propagation based method we chose PLUM [5], which issues warnings ifa station within a radius r of the target exceeds the level of shaking. Additionally we apply the GMPE used in EPS tocatalog location and magnitude as an approximate upper accuracy bound for point source algorithms (Catalog-GMPEor C-GMPE). C-CMPE is a theoretical bound that can not be realized in real-time. It can be considered as an estimateof the modeling error for point source approaches. A detailed description of the baseline methods can be found inappendix C. We compare the alert performance of all methods for PGA thresholds from light (1%g) to very strong (20%g) shaking,regarding precision , the fraction of alerts actually exceeding the PGA threshold, and recall , the fraction of issued alertsamong all cases where the PGA threshold was exceeded [17, 18]. As precision and recall trade-off against each other3 ON - PEER REVIEWED MANUSCRIPT SUBMITTED TO G EOPHYSICAL R ESEARCH L ETTERS

Figure 2: Warning statistics for the three early-warning models (TEAM, EPS, PLUM) for the Japan and Italy datasets.In addition, statistics are provided for C-GMPE, which can only be evaluated post-event due to its reliance on catalogmagnitude and location. a. Precision and recall curves across different thresholds α = 0 . , . , . , . . . , . , . , . .As PLUM has no tuning parameter, its performance is shown as a point. Enlarged markers show the conﬁgurationsyielding the highest F1 scores. Numbers in the corner give the area under the precision recall curve (AUC), a standardmeasure quantifying the predictive performance across thresholds. b. Precision, recall and F1 score at different PGAthresholds using the F1 optimal value α . Threshold probabilities α were chosen independently for each method andPGA threshold. c. Number of events and traces exceeding each PGA threshold for training and test set. Training setnumbers include development events and show the numbers before oversampling is applied. For Italy training and testevent curve are overlapping due to similar numbers of events.depending on α , we additionally use the summary metric F1 score , the harmonic mean of precision and recall. Theevaluation metrics and full setup of the evaluation are deﬁned in appendix D. Additionally, we discuss the calibration ofthe predicted probabilities from TEAM and EPS in appendix E.TEAM outperforms both EPS and PLUM for both datasets and all PGA thresholds, indicated by the precision-recall-curves of TEAM lying to the top-right of the baseline curves (Figure 2a). For any baseline method conﬁguration, thereis a TEAM conﬁguration surpassing it both in precision and in recall. Improvements are larger for Japan, but stillsubstantial for Italy. Next, we selected α values yielding the highest F1 score separately for each PGA threshold andmethod. Again, TEAM outperforms both baselines signiﬁcantly on both datasets, irrespective of the PGA level (Figure2b). Further performance statistics are available in tables S1 and S2.All methods degrade with increasing PGA levels, particularly for Japan. This degradation is intrinsic to early warningfor high thresholds due to the very low prior probability of strong shaking [6, 18, 17]. Furthermore, shortage of trainingdata with high PGA values results in less well constrained model parameters.4 ON - PEER REVIEWED MANUSCRIPT SUBMITTED TO G EOPHYSICAL R ESEARCH L ETTERS

Figure 3: Warning time statistics. a. Area under the precision recall curve for different minimum warning times. Allalerts with shorter warning times are counted as false negatives. b. Warning time histograms for true alerts at F1-optimal α , chosen separately for each threshold and method. Warning time dependence on hypocentral distance is shown inFigure S6.Using domain adaptation techniques, TEAM copes well with the Italy data, even though the largest test event ( M w = 6 . )is signiﬁcantly larger than the largest training event ( M w = 6 . ), and three further test events have M W ≥ . . Toassess the impact of this technique, we compared TEAM’s results to a model trained without it (Figures S4, S5). Whilefor low PGA thresholds differences are small, at high PGA levels they grow to more than 20 points F1 score. For largeevents, TEAM strongly outperforms TEAM without domain adaptation even for low PGA thresholds. This shows thatdomain adaptation does not only allow the model to predict higher PGA values, but also to accurately assess the regionof lighter shaking for large events. The previous evaluation considered prediction accuracy irrespective of the warning time, which is the time betweenissuing of the warning and ﬁrst exceedance of the level of shaking. Actually, TEAM consistently outperforms bothbaselines across different required warning times and irrespective of the PGA threshold (Figure 3a). In particular PLUMwarning times are short and only competitive at high PGA thresholds, where potential maximum warning times arenaturally short due to the proximity between stations with strong shaking and the epicenter [19].Warning times depend on α : a lower α value naturally leads to longer warning times but also to more false positivewarnings. At F1-optimal thresholds α , EPS and TEAM have similar warning time distributions (Figure 3b, Table S3),but lowering α leads to stronger increases in warning times for TEAM. For instance, at 10%g, lowering α from 0.5 to0.2 increases average warning times of TEAM by 2.3 s/1.2 s (Japan/Italy), but only by 1.1 s/0.1 s for EPS. Short timesas measured here are critical in real applications: First, they reduce the time available for counter measures. Second,real warning times will be shorter than reported here due to telemetry and compute delays. However, compute delays5 ON - PEER REVIEWED MANUSCRIPT SUBMITTED TO G EOPHYSICAL R ESEARCH L ETTERS

Figure 4: Scenario analysis of the 22nd November 2016 M J = 7 . Fukushima earthquake, the largest test event locatedclose to shore. Maps show the warning levels for each method (top three rows) at different times (shown in the corner, t = 0 s corresponds to P arrival at closest station). Dots represent stations and are colored according to the PGArecorded during the full event, i.e., the prediction target. The bottom row shows (left to right), the catalog based GMPEpredictions, the warning time distributions, and the true positives (TP), false negatives (FN) and false positives (FP) foreach method, both at a 2%g PGA threshold. EPS and GMPE shake map predictions do not include station terms, butthey are included for the bottom row histograms.for TEAM are very mild: analysing the Norcia event (25 input stations, 246 target sites) for one time step took only0.15 s on a standard workstation using non-optimized code.6 ON - PEER REVIEWED MANUSCRIPT SUBMITTED TO G EOPHYSICAL R ESEARCH L ETTERS

We analyze differences between the methods using one example event from each dataset (Japan: Figure 4, Italy: FigureS7). All methods underestimate the shaking in the ﬁrst seconds (left column Figures 4, S7). However, TEAM isthe quickest to detect the correct extent of the shaking. Additionally, it estimates even ﬁne-grained regional shakingdetails in real-time (middle and right columns). In contrast, shake maps for EPS remain overly simpliﬁed due to theassumptions inherent to GMPEs (right column and bottom left panel). PLUM produces very good PGA estimates, butexhibits the worst warning times.Notably, TEAM predictions at later times correspond even better to the measured PGA than C-GMPE estimates,although these are based on the ﬁnal magnitude (top right and bottom left panels). For the Japan data, this is also visiblein Figure 2, showing higher accuracy of TEAM’s prediction compared to C-GMPE for all thresholds except 20%g.We assume TEAM’s superior performance is rooted in both global and local aspects. Global aspects are the abilitiesto exploit variations in the waveforms, e.g., frequency content, to model complex event characteristics, such as stressdrop, radiation pattern or directivity, and to compare to events in the training set. Local aspects include understandingregional effects, e.g., frequency dependent site responses, and the ability to consider shaking at proximal stations.Thereby, combining a global event view with propagation aspects, TEAM can be seen as a hybrid model betweensource estimation and propagation.

In this study we presented the transformer earthquake alerting model (TEAM). TEAM outperforms existing earlywarning methods in terms of both alert performance and warning time. Using a ﬂexible machine learning model, TEAMis able to extract information about an event from raw waveforms and leverage the information to model the complexdependencies of ground motion. We point out two further aspects that make TEAM appealing to users. First, TEAMcan adapt to various user requirements by combining two thresholds, one for shake level and one for the exceedanceprobability. As TEAM outputs probability density functions over the PGA, these thresholds can easily be adjusted byindividual users on the ﬂy, e.g., by setting sliders in an early warning system. Second, deep learning models typicallyexhibit large performance improvements from larger training datasets [20] due to the high number of model parameters.In our study this reﬂects in the better performance on the twofold larger Japan dataset. This indicates that TEAM’sperformance can be improved just by collecting more comprehensive catalogs, which happens automatically over time.

Acknowledgements

We thank the National Research Institute for Earth Science and Disaster Resilience (NIED) for providing the catalog andwaveform data for our Japan dataset. We thank the Istituto Nazionale di Geoﬁsica e Vulcanologia and the Dipartimentodella Protezione Civile for providing the catalog and waveform data for our Italy dataset. J. M. acknowledges thesupport of the Helmholtz Einstein International Berlin Research School in Data Science (HEIBRiDS). We thank MatteoPicozzi for discussions on earthquake early warning that helped improve the study design. Publications of the code anddatasets used in this study are under preparation.

References [1] R. M. Allen, P. Gasparini, O. Kamigaichi, and M. Bose. The Status of Earthquake Early Warning around theWorld: An Introductory Overview.

Seismological Research Letters , 80(5):682–693, September 2009.[2] Richard M. Allen and Diego Melgar. Earthquake Early Warning: Advances, Scientiﬁc Challenges, and SocietalNeeds.

Annual Review of Earth and Planetary Sciences , 47(1):361–388, 2019.[3] Angela I. Chung, Ivan Henson, and Richard M. Allen. Optimizing Earthquake Early Warning Performance:ElarmS-3.

Seismological Research Letters , 90(2A):727–743, March 2019.[4] M. Böse, D. E. Smith, C. Felizardo, M.-A. Meier, T. H. Heaton, and J. F. Clinton. FinDer v.2: Improvedreal-time ground-motion predictions for M2–M9 with seismic ﬁnite-source characterization.

Geophysical JournalInternational , 212(1):725–742, January 2018.[5] Yuki Kodera, Yasuyuki Yamada, Kazuyuki Hirano, Koji Tamaribuchi, Shimpei Adachi, Naoki Hayashimoto,Masahiko Morimoto, Masaki Nakamura, and Mitsuyuki Hoshiba. The Propagation of Local Undamped Motion(PLUM) Method: A Simple and Robust Seismic Waveﬁeld Estimation Approach for Earthquake Early WarningThePropagation of Local Undamped Motion (PLUM) Method.

Bulletin of the Seismological Society of America ,108(2):983–1003, April 2018. 7 ON - PEER REVIEWED MANUSCRIPT SUBMITTED TO G EOPHYSICAL R ESEARCH L ETTERS [6] Men-Andrin Meier, Yuki Kodera, Maren Böse, Angela Chung, Mitsuyuki Hoshiba, Elizabeth Cochran, SarahMinson, Egill Hauksson, and Thomas Heaton. How Often Can Earthquake Early Warning Systems Alert SitesWith High-Intensity Ground Motion?

Journal of Geophysical Research: Solid Earth , 125(2):e2019JB017718,2020.[7] Anthony Lomax, Alberto Michelini, and Dario Jozinovi´c. An Investigation of Rapid Earthquake Characteriza-tion Using Single-Station Waveforms and a Convolutional Neural Network.

Seismological Research Letters ,90(2A):517–529, 2019.[8] S. Mostafa Mousavi and Gregory C. Beroza. A Machine-Learning Approach for Earthquake Magnitude Estimation.

Geophysical Research Letters , 47(1):e2019GL085976, 2020.[9] Marius Kriegerowski, Gesa M. Petersen, Hannes Vasyura-Bathke, and Matthias Ohrnberger. A Deep ConvolutionalNeural Network for Localization of Clustered Earthquakes Based on Multistation Full Waveforms.

SeismologicalResearch Letters , 90(2A):510–516, 2019.[10] S. Mostafa Mousavi and Gregory C. Beroza. Bayesian-Deep-Learning Estimation of Earthquake Location fromSingle-Station Observations. arXiv:1912.01144 [physics] , December 2019.[11] Dario Jozinovi´c, Anthony Lomax, Ivan Štajduhar, and Alberto Michelini. Rapid prediction of earthquake groundshaking intensity using raw waveform data and a convolutional neural network.

Geophysical Journal International ,222(2):1379–1389, 2020.[12] Mauro Dolce and Daniela Di Bucci. The 2016–2017 Central Apennines Seismic Sequence: Analogies andDifferences with Recent Italian Earthquakes. In Kyriazis Pitilakis, editor,

Recent Advances in EarthquakeEngineering in Europe: 16th European Conference on Earthquake Engineering-Thessaloniki 2018 , Geotechnical,Geological and Earthquake Engineering, pages 603–638. Springer International Publishing, Cham, 2018.[13] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser,and Illia Polosukhin. Attention is all you need. In

Advances in neural information processing systems , pages5998–6008, 2017.[14] Christopher M Bishop. Mixture density networks. Technical report, Aston University, 1994.[15] Huseyin Serdar Kuyuk and Richard M. Allen. A global approach to provide magnitude estimates for earthquakeearly warning alerts.

Geophysical Research Letters , 40(24):6329–6333, 2013.[16] Georgia Cua and Thomas H. Heaton. Characterizing Average Properties of Southern California Ground MotionAmplitudes and Envelopes. EERL Report, Earthquake Engineering Research Laboratory, Pasadena, CA, 2009.[17] Men-Andrin Meier. How “good” are real-time ground motion predictions from Earthquake Early Warningsystems?

Journal of Geophysical Research: Solid Earth , 122(7):5561–5577, 2017.[18] Sarah E. Minson, Annemarie S. Baltay, Elizabeth S. Cochran, Thomas C. Hanks, Morgan T. Page, Sara K.McBride, Kevin R. Milner, and Men-Andrin Meier. The Limits of Earthquake Early Warning Accuracy and BestAlerting Strategy.

Scientiﬁc Reports , 9(1):2478, February 2019.[19] Sarah E. Minson, Men-Andrin Meier, Annemarie S. Baltay, Thomas C. Hanks, and Elizabeth S. Cochran. Thelimits of earthquake early warning: Timeliness of ground motion estimates.

Science Advances , 4(3):eaaq0504,March 2018.[20] Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhinav Gupta. Revisiting unreasonable effectiveness of datain deep learning era. In

Proceedings of the IEEE international conference on computer vision , pages 843–852,2017.[21] Jasper Snoek, Yaniv Ovadia, Emily Fertig, Balaji Lakshminarayanan, Sebastian Nowozin, D Sculley, JoshuaDillon, Jie Ren, and Zachary Nado. Can you trust your model’s uncertainty? evaluating predictive uncertaintyunder dataset shift. In

Advances in Neural Information Processing Systems , pages 13969–13980, 2019.[22] Kazi R. Karim and Fumio Yamazaki. Correlation of JMA instrumental seismic intensity with strong motionparameters.

Earthquake Engineering & Structural Dynamics , 31(5):1191–1212, 2002.[23] Elizabeth S. Cochran, Julian Bunn, Sarah E. Minson, Annemarie S. Baltay, Deborah L. Kilb, Yuki Kodera, andMitsuyuki Hoshiba. Event Detection Performance of the PLUM Earthquake Early Warning Algorithm in SouthernCalifornia.

Bulletin of the Seismological Society of America , 109(4):1524–1541, August 2019.[24] National Research Institute For Earth Science And Disaster Resilience. Nied k-net, kik-net, 2019.[25] Istituto Nazionale di Geoﬁsica e Vulcanologia (INGV), Istituto di Geologia Ambientale e Geoingegneria (CNR-IGAG), Istituto per la Dinamica dei Processi Ambientali (CNR-IDPA), Istituto di Metodologie per l’AnalisiAmbientale (CNR-IMAA), and Agenzia Nazionale per le nuove tecnologie, l’energia e lo sviluppo economico8 ON - PEER REVIEWED MANUSCRIPT SUBMITTED TO G EOPHYSICAL R ESEARCH L ETTERS sostenibile (ENEA). Centro di microzonazione sismica network, 2016 central italy seismic sequence (centromz),2018.[26] Universita della Basilicata. Unibas, 2005.[27] RESIF - Réseau Sismologique et géodésique Français. Resif-rlbp french broad-band network, resif-rap strongmotion network and other seismic stations in metropolitan france, 1995.[28] University of Genova. Regional seismic network of north western italy. international federation of digitalseismograph networks, 1967.[29] Presidency of Counsil of Ministers - Civil Protection Department. Italian strong motion network (ran), 1972.[30] Istituto Nazionale di Geoﬁsica e Vulcanologia (INGV), Italy. Rete sismica nazionale (rsn), 2006.[31] Dipartimento di Fisica, Università degli studi di Napoli Federico II. Irpinia seismic network (isnet), 2005.[32] MedNet Project Partner Institutions. Mediterranean very broadband seismographic network (mednet), 1990.[33] OGS (Istituto Nazionale di Oceanograﬁa e di Geoﬁsica Sperimentale) and University of Trieste. North-east italybroadband network (ni), 2002.[34] OGS (Istituto Nazionale di Oceanograﬁa e di Geoﬁsica Sperimentale). North-east italy seismic network (nei),2016.[35] RESIF - Réseau Sismologique et géodésique Français. Réseau accélérométrique permanent (french ac-celerometrique network) (rap), 1995.[36] Geological Survey-Provincia Autonoma di Trento. Trentino seismic network, 1981.[37] Istituto Nazionale di Geoﬁsica e Vulcanologia (INGV). Ingv experiments network, 2008.[38] EMERSITO Working Group. Seismic network for site effect studies in amatrice area (central italy) (sesaa), 2018.[39] David J. Wald, Vincent Quitoriano, Thomas H. Heaton, and Hiroo Kanamori. Relationships between PeakGround Acceleration, Peak Ground Velocity, and Modiﬁed Mercalli Intensity in California.

Earthquake Spectra ,15(3):557–564, August 1999. 9 ON - PEER REVIEWED MANUSCRIPT SUBMITTED TO G EOPHYSICAL R ESEARCH L ETTERS

A Data and Preprocessing

For our study we use two datasets, one from Japan, one from Italy. The Japan dataset consists of 13,512 events between1997 and 2018 from the NIED KiK-net catalog [24]. The data was obtained from NIED and consists of triggeredstrong motion records. Each trace contains 15 s of data before the trigger and has a total length of 120 s. Each stationconsists of two three component strong motion sensors, one at the surface and one borehole sensor. We split the datasetchronologically with ratios of 60:10:30 between training, development and test set. The training set ends in March 2012,the test set begins in August 2013. Events in between are used as development set. We decided to use a chronologicalsplit to ensure a scenario most similar to the actual application in an early warning setting.The Italy dataset consists of 7,055 events between 2008 and 2019 from the INGV catalog. We use data from the 3A[25], BA [26], FR [27], GU [28], IT [29], IV [30], IX [31], MN [32], NI [33], OX [34], RA [35], ST [36], TV [37]and XO [38] networks. We use all events from 2016 as test set and the remaining events as training and developmentsets. The test set consists of 31% of the events, a similar fraction as in the Japan dataset. We shufﬂe events betweentraining and development set. While a chronological split would have been the default choice, we decided to use 2016for testing, as it contains a long seismic sequence in central Italy containing several very large events in August andOctober. Further details on the statistics of both datasets can be found in Table S4.Before training we extract, align and preprocess the waveforms and store them in hdf5 format. As alignment requiresthe ﬁrst P pick, we need approximate picks for the datasets. For Japan we use the trigger times provided by NIED.Our preprocessing accounts for misassociated triggers. For Italy we use an STA/LTA trigger around the predicted Parrival. While triggering needs to be handled differently in an application scenario, we use this simpliﬁed approach asour evaluation metrics depend only very weakly on the precision of the picks.

B TEAM - Architecture and training details

TEAM conducts end-to-end PGA estimation. Input to our model are 3 component waveforms at 100 Hz sampling ratefrom multiple stations and the corresponding station coordinates. In case of available borehole sensors, we treat thestation as having 6 components (3 surface, 3 borehole). In addition the model is provided with a set of output locations,at which the PGA should be predicted. These can be anywhere within the spatial domain of the model and need not beidentical with station locations in the training set.The input waveforms from all stations are aligned in time, i.e., all start at the same time t and end at the same time t .We deﬁne t to be 5 s before the ﬁrst P wave arrival at any station. This guarantees that each waveform starts with atleast 5 s of noise, allowing the model to understand the current noise conditions at a station. In a real-time scenario t isthe current time, i.e., limited by the waveforms available up to that point in time. We limit t to be at most t + 30 s. Toobtain a constant length input for TEAM, we pad all waveforms after t with zeros up to a total length of 30 s.The PGA estimation consists of three main components, shown in yellow in Figure S3. The ﬁrst component is astation-wise feature extraction using a convolutional network (CNN). The second component is a feature combinationthat combines the information from the different stations based on their location and calculates feature representationsfor the target locations. The last step is a mixture density estimation calculating probabilistic estimates for the PGAat the target locations. The model has about 13.3 million parameters in total. All components are trained jointlyend-to-end. B.1 Feature extraction network

The feature extraction is conducted separately for each station. Nonetheless the same convolutional neural network(CNN) for feature extraction is applied at all stations, i.e., the same model with the same model weights.As amplitudes of seismic waveforms can span several orders of magnitude, the ﬁrst layer of the network normalizes thetraces by dividing through their peak. All components of one station are normalized jointly, such that the amplituderatio between the components stays unaltered. As the peak amplitude of the trace is a key predictor, we logarithmizethe value and concatenate it to the feature vector after passing through all the convolutional layers, prior to the fullyconnected layers.We apply a set of convolutional and max-pooling layers to the waveforms. We use convolutional layers as this allowsthe model to extract translation invariant features and as convolutional kernels can be interpreted as modeling frequencyfeatures. We concatenate the output of the convolutions and the logarithm of the peak amplitude. This vector is fed intoa multi-layer perceptron to generate the ﬁnal features vector for the station. All layers use ReLu activations. A detailedoverview of the number and speciﬁcations of the layers in the feature extraction model can be found in Table S5.10 ON - PEER REVIEWED MANUSCRIPT SUBMITTED TO G EOPHYSICAL R ESEARCH L ETTERS

B.2 Feature combination network

The feature extraction provides one feature vector per input station representing the waveforms. As additional input themodel is provided with the location of the stations, represented by latitude, longitude and elevation. The targets for thePGA estimation are speciﬁed by the latitude, longitude and elevation as well.We use a transformer network [13] for the feature combination. Given a set of n input vectors, a transformer produces n output vectors capturing combined information from all the vectors in a learnable way. We use transformers for twomain reasons. First, they are permutation equivariant, i.e., changing the order of input or output stations does not haveany impact on the output. This is essential, as there exists no natural ordering on the input stations. Second, they canhandle variable input sizes, as the number of parameters of transformers is independent of the number of input vectors.This property allows application of the model to different sets of stations and a ﬂexible number of target locations.To incorporate the locations of the stations we use predeﬁned position embeddings. As proposed in [13], we usepairs of sinusoidal functions, sin( πλ i x ) and cos( πλ i x ) , with different wavelength λ i . We use 200 dimensions forlatitude and longitude, respectively, and the remaining 100 dimensions for elevation. We anticipate two advantagesof sinusoidal embeddings for representing the station position. First, keeping the position embeddings ﬁxed insteadof learnable reduces the parameters and therefore likely provides better representations for stations with only fewinput measurements or sites not contained in the training set. Second, sinusoidal embeddings guarantee that shiftscan be represented by linear transformations, independent of the location it applies to. As the attention mechanism intransformers is built on linear projections and dot products, this should allow for more efﬁcient attention scores at leastin the ﬁrst transformer layers. As proposed in the original transformer paper [13], the position embeddings are addedelement-wise to the feature vectors to form the input of the transformer. We calculate position embeddings of the targetlocations in the same way.As in our model input and output size of the transformer are identical, we only use the transformer encoder stack[13] with six encoder layers. Inputs are the feature vectors with position embeddings from all input stations and theposition embeddings of the output locations. By applying masking to the attention we ensure that no attention weight isput on the vectors corresponding to the output locations. This guarantees that each target only affects its own PGAvalue and not any other PGA values. As the self-attention mechanism of the transformer has quadratic computationalcomplexity in the number of inputs, we restrict the maximum number of input stations to 25 (see training details for theselection procedure). Further details on the hyperparameters can be found in Table S6. The transformer returns oneoutput vector for each input vector. We discard the vectors corresponding to the input stations and only keep the vectorscorresponding to the targets. B.3 Mixture density output

Similar to the feature extraction, the output calculation is conducted separately for each target, while sharing the samemodel and weights between all targets. We use a mixture density network to predict probability densities for the PGA[14]. We model the probability as a mixture of m = 5 Gaussian random variables. Using a mixture of Gaussiansinstead of a single Gaussian allows the model to predict more complex distribution, like non-Gaussian distributions,e.g., asymmetric distributions. The functional form of the Gaussian mixture is (cid:80) mi =1 α i ϕ µ i ,σ i ( x ) . We write ϕ µ i ,σ i forthe density of a standard normal with mean µ i and standard deviation σ i . The values α i are non-negative weights forthe different Gaussians with the property (cid:80) mi =1 α i = 1 . The mixture density network uses a multi-layer percepton topredict the parameters α i , µ i and σ i . The hidden dimensions are 150, 100, 50, 30, 10. The activation function is ReLufor the hidden layers, linear for the µ and σ outputs and softmax for the α output. B.4 Training details

We train the model end-to-end using negative log-likelihood as loss function. To increase the amount of training data andto train the model on shorter segments of data we apply various forms of data augmentation. Each data augmentation iscalculated separately each time a particular waveform sample is shown, such that the effective training samples vary.First, if our dataset contains more stations for an event than the maximum number of 25 allowed by the model, wesubsample. We introduce a bias to the subsampling to favor stations closer to the event. We use up to twenty targets forPGA prediction. Similarly to the input station, we subsample if more targets are available and bias the subsampling tostations close to the event. This bias ensures that targets with higher PGA values are shown more often during training.Second, we apply station blinding, meaning we zero out a set of stations in terms of both waveforms and coordinates.The number of stations to blind is uniformly distributed between zero and the total number of stations available minusone. In combination with the ﬁrst point this guarantees that the model also learns to predict PGA values at sites whereno waveform inputs are available. 11 ON - PEER REVIEWED MANUSCRIPT SUBMITTED TO G EOPHYSICAL R ESEARCH L ETTERS

Third, we apply temporal blinding. We uniformly select a time t that is between 1 s before the ﬁrst P pick and 25 s after. All waveforms are set to zero after time t . The model therefore only uses data available at time t . Even though wenever apply TEAM to times before the ﬁrst P pick, we include these in the training process to ensure TEAM learns asensible prior distribution. We observed that this leads to better early predictions. As information about the triggeringstation distribution would introduce a knowledge leak if available from the beginning, we zero out all waveforms andcoordinates from stations that did not trigger until time t .Fourth, we oversample large magnitude events. As large magnitude events are rare, we artiﬁcially increase their numberin the training set. An event with magnitude M ≥ M is used λ M − M times in each training epoch with λ = 1 . and M = 5 for Japan and M = 4 for Italy. This event-based oversampling implicitly increases the number of high PGAvalues in the training set, too.We apply all data augmentation on the training and the development set, to ensure that the development set properlyrepresents the task we are interested in. As this introduces stochasticity into the development set metrics, we evaluatethe development set three times after each epoch and average the result. In contrast, at test time we do not apply anydata augmentation, except temporal blinding for modelling real-time application. If more than 25 stations are availablefor a test set event, we select the 25 station with the earliest arrivals for evaluation.We train our model using the Adam optimizer. We emphasize that the model is only trained on predicting the PGAprobability density and does not use any information on the PGA thresholds used for evaluation. We start with alearning rate of − and decrease the learning rate by a factor of 3 after 5 epochs without a decrease in validation loss.For the ﬁnal evaluation we use the model from the epoch with lowest loss on the development set. We apply gradientclipping with a value of 1.0. We use a batch size of 64. We train the model for 100 epochs.To improve the calibration of the predicted probability densities we use ensembles [21]. We use an ensemble sizeof 10 models and average the predicted probability densities. We weight each ensemble member identically. Toincrease the entropy between the ensembles, we also modify the position encodings between the ensemble membersby rotating the latitude and longitude values of stations and targets. The rotations for the 10 ensemble members are ◦ , ◦ , . . . , ◦ , ◦ .For the Italy model we use domain adaptation by modifying the training procedure. We ﬁrst train a model jointly onthe Italy and Japan data set, according to the conﬁguration described above. We use the resulting model weights asinitialization for the Italy model. For this training we reduce the number of PGA targets to 4, leading to a higher fractionof high PGA values in the training data, and the learning rate to − . In addition, we train jointly on an auxilary dataset, comprised of 77 events from Japan. The events were chosen to be shallow, crustal and onshore, having a magnitudebetween 5.0 and 7.2. We shift the coordinates of the stations to lie in Italy. We use 85% of the auxiliary events in thetraining set and 15% in the development set.We implemented the model using Tensorﬂow. We trained each model on one GeForce RTX 2080 Ti or Tesla V100.Training of a single model takes approximately 5 h for the Japan dataset, 10 h for the joint model and 1 h for the Italydata set. We benchmarked the inference performance of TEAM on a common workstation with GPU acceleration(Intel i7-7700, Nvidia Quadro P2000). Running TEAM with ensembling at a single timestep took 0.15 s for all246 PGA targets of the Norcia event. As our implementation is not optimized for run time, we expect an optimizedimplementation to yield multifold lower run times, enabling a real-time application of TEAM with high update rate andlow compute latency. C Baseline methods

We compare TEAM to two baseline methods, EPS and PLUM. We do not compare to any deep learning baseline,because we are not aware of any published deep learning method for early warning that can actually be applied inreal-time. For the EPS method we use a GMPE based on the functional form by Cua and Heaton [16] and add aquadratic magnitude term as proposed by Meier [17]. We make further minor adjustments to accommodate the widerrange of magnitudes in our datasets. The functional form of the GMPE is: log( pga ) = a M + a max( M − M , + b ( R d + C ( M )) + d log( R d + C ( M )) + e + δ S + N (0 , σ ) (1) C ( M ) := c exp( c max(0 , M − M −

5) + π/ (2) R d := (cid:113) R + H d (3)We write M for magnitude, R for epicentral distance, δ S for the station bias, and e for an error term. We use m/s as unit for PGA and km as unit for all length measurements. We use a pseudo-depth H d , depending on the eventdepth and the dataset. This allows to model the stronger attenuation with distance for shallow events. For Italy we set12 ON - PEER REVIEWED MANUSCRIPT SUBMITTED TO G EOPHYSICAL R ESEARCH L ETTERS H d = 5 km for events shallower than 20 km and H d = 50 km for all other events. For Japan we set H d = 5 km forevents shallower than 20 km, H d = 40 km for events between 20 km and 200 km and set H d to the actual depth for alldeeper events, to account for a few very deep events. We set M = 4 for Italy and M = 6 for Japan.We ﬁx c = 1 . and c = 1 . , as proposed by Cua and Heaton[16], and optimize the other parameters usinglinear regression. We perform the optimization iteratively to obtain station bias terms, using the union of training anddevelopment set. To avoid noise samples in calibration we only use stations for which R d < ( M − . ∗ km forJapan and R d < ( M − ∗ km for Italy. The calibrated GMPEs have residual values σ of 0.29 for Italy and 0.33 forJapan, matching the value of ∼ M = c log( P D ) + c log( R ) + c + N (0 , σ ) (4)to estimate magnitudes from peak displacement. We use c = 1 . , c = 1 . , c = 5 . (Italy) / c = 5 . (Japan)and σ = 0 . . These are the values from [15], except for c which needed to be adjusted as we do not use momentmagnitude. We combine the predictions in probability space assuming independence between the predictions fromdifferent stations. We weight stations based on the length of the P wave window recorded so far. We use the mean valueof the single-station magnitude estimates for PGA estimation. For both the application of the GMPE and the magnitudeestimation we use the catalog hypocenters. As the quality of real-time location estimates will be worse, this leads toinﬂated performance measures for EPS.As second baseline, we adapted the PLUM algorithm [5]. While the original paper applies PLUM to seismic intensities,we apply it to PGA values. This adaptation is possible, as approximate linear and especially monotonic relations existbetween intensity and PGA [22]. The PGA prediction ˆ pga st at a station s at time t is the maximum of all observed PGAvalues pga s (cid:48) t at stations s (cid:48) within a radius r of s . Therefore a warning for a certain threshold for a station is issued oncethe threshold has been exceeded at any station within the radius r . Due to different station densities in Italy and Japanwe used different values for r . For Italy we used r = 15 km for Japan we used r = 30 km. Following the ﬁndings ofCochran et al [23], we do not use site correction terms in our implementation of PLUM as they only have minor impacton the performance. D Evaluation metrics

We analyze the performance of the early warning algorithms using PGA thresholds of 1%g, 2%g, 5%g, 10%g and20%g, approximately matching Modiﬁed Mercalli Intensity (MMI) III (light) to VII (very strong) [39]. We calculatePGA from the absolute value of the two horizontal components. For the PGA values for the Japanese data, we use thesurface stations and not the borehole stations.A warning at a site should be issued if anytime during the event the PGA threshold is exceeded at the site. We considera warning correct (true positive, TP), if a warning for a certain threshold was issued and the threshold was actuallyexceeded later during the event. Missed warnings (false negative, FN) are all cases, where the PGA threshold wasexceeded, but no warning was issued or the warning was issued after the PGA threshold was ﬁrst exceeded. We considera warning false (false positive, FP), if a warning was issued, but the threshold was not exceeded. All remaining casesare true negatives (TN).As the number of true negatives depends strongly on the inclusion criteria of the catalog, we use metrics independent ofthe true negatives. As summary statistics we use precision , TP/(TP+FP), measuring the fraction of correct warningsamong all warnings, and recall , TP/(TP+FN), measuring the fraction of possible correct warnings that was issued. Weuse the

F1 score = ∗ precision ∗ recall / ( precision + recall ) as a combined statistic. Any analysis using a ﬁxed α uses the13 ON - PEER REVIEWED MANUSCRIPT SUBMITTED TO G EOPHYSICAL R ESEARCH L ETTERS value maximizing the F1 score, which is speciﬁc to each method and PGA threshold. For an analysis independent of thethreshold α we use the area under the precision recall curve (AUC). We use values α = 0 . , . , . , . . . , . , . , . and add additional points at (0 , and (1 , to the precision recall curve to approximate the AUC. For comparisonof PLUM using AUC in Figure 3, we introduce an artiﬁcial precision recall line for PLUM with a slope of − goingthrough the observed precision and recall values.We deﬁne the warning time as the time between the issuance of a warning and the ﬁrst exceedance of the threshold.We consider a zero latency system and do not impose a minimum warning time. For comparing warning timesbetween methods or different parameter combinations, we only use the subset of station event pairs, where bothmethods/parameter combinations issued correct warnings.We evaluate PLUM continuously, i.e., warnings are issued immediately at the exceedance of a threshold. TEAM andEPS are evaluated every 0.1 s, starting 1 s after the ﬁrst P arrival for EPS and 0.5 s after the ﬁrst P arrival for TEAM. Weuse a longer time before the ﬁrst prediction for EPS as the early results of EPS are unstable. Warnings are not retracted,i.e., even if the model later estimates a shake level below the warning threshold, the warning stay active. E Model calibration

Even though TEAM and EPS give probabilistic predictions, it is not assured whether these predictions are well-calibrated, i.e., if the predicted conﬁdence values actually correspond to observed probabilities. Calibrated probabilitiesare essential for threshold selection, as they are required to balance expected costs of taking action versus expectedcosts of not taking action. We note that while good calibration is a necessary condition for a good model, it is notsufﬁcient, as a model constantly predicting the marginal distribution of the labels would be always perfectly calibrated,yet not very useful.Figures S9 and S10 show the calibration diagrams for Japan and Italy at different times after the ﬁrst P arrival. Forlow PGA thresholds the calibration is near optimal for the Japan models at all times. For 10%g and 20%g EPS isunderconﬁdent, while the calibration of TEAM is still good. For these thresholds the calibration improves with timesince the ﬁrst P arrival.For Italy the calibration diagrams are more diverse. EPS is generally slightly overconﬁdent, except for the lowconﬁdence values at low PGA thresholds. TEAM shows good calibration for all PGA thresholds except 20%g, whereit is underconﬁdent. We suspect that the worse calibration for the largest events is caused by the domain adaptationstrategy. In conclusion, both models are generally well calibrated, with a slightly better calibration for TEAM.14 ON - PEER REVIEWED MANUSCRIPT SUBMITTED TO G EOPHYSICAL R ESEARCH L ETTERS

Figure S1: Map of the event and station distribution in the Japan data set. Stations are shown as black triangles, events asdots. The event color encodes the event magnitude. There are ∼ additional far offshore events outside the displayedmap region in the catalog. 15 ON - PEER REVIEWED MANUSCRIPT SUBMITTED TO G EOPHYSICAL R ESEARCH L ETTERS

Figure S2: Map of the event and station distribution in the Italy data set. Stations present in the training set are shownas black triangles, while stations only present in the test set are shown as yellow triangles. Events are shown as dotswith the color encoding the event magnitude. All events with magnitudes above 5.5 are shown as stars. The red starsindicate large training events, while the yellow stars indicate large test events. The inset shows the central Italy regionwith intense seismicity and high station density in the test set. Moment magnitudes for the largest test events are givenin the inset. 16 ON - PEER REVIEWED MANUSCRIPT SUBMITTED TO G EOPHYSICAL R ESEARCH L ETTERS

Figure S3: Overview of the transformer earthquake alerting model. Colors encode if a box is a neural network, aninput or an output. The main components of the model are the feature extraction (a CNN), the feature combination (atransformer) and the PGA estimation at targets (a mixture density network). Further inputs are the target coordinates,that are input just like stations, but with a zero feature vector. Ten instances of this network are trained independentlyand the results ensemble-averaged. 17 ON - PEER REVIEWED MANUSCRIPT SUBMITTED TO G EOPHYSICAL R ESEARCH L ETTERS

Figure S4: True positives (TP), false negatives (FN) and false positives (FP) for the events in the Italy test sets causingthe largest shaking. The methods are the transformer earthquake alerting model without domain adaptation (TEAMbase), the transformer earthquake alerting model (TEAM), the estimated point source algorithm (EPS) and PLUM. Inaddition, a GMPE with full catalog information is included for reference. Values α were chosen separately for eachthreshold and method to yield the highest F1 score for the whole test set, but are kept constant across all events. TEAMwith domain adaptation outperforms TEAM without domain adaptation consistently across all thresholds. This indicatesthat the domain adaptation not only allows TEAM to better predict higher levels of shaking, but also to better assesslarge events in general. 18 ON - PEER REVIEWED MANUSCRIPT SUBMITTED TO G EOPHYSICAL R ESEARCH L ETTERS

Figure S5: Precision, recall and F1 score at different PGA thresholds for Italy including TEAM without domainadaptation. Threshold values α were chosen independently for each method and PGA threshold to yield the highestF1 score. The methods are the transformer earthquake alerting model without domain adaptation (TEAM Base), thetransformer earthquake alerting model (TEAM), the estimated point source (EPS) model and the PLUM model. Inaddition the graph shows the performance of C-GMPE, a GMPE with full catalog information for reference.Figure S6: Warning time and hypocentral distance between station and event for each true alert at F1-optimal α . Thewhite area corresponds roughly to the range of possible warning times and is bounded by the 90 th percentile of thetimes between ﬁrst detection of an event (i.e., arrival of P wave at the closest station) and ﬁrst exceedance of the PGAthreshold in recordings at that approximate distance. 19 ON - PEER REVIEWED MANUSCRIPT SUBMITTED TO G EOPHYSICAL R ESEARCH L ETTERS

Figure S7: Scenario analysis of the 30th October 2016 M w = 6 . Norcia earthquake, the largest event in the Italy testset. See Fig. 4 in the main paper for further explanations. The bottom row diagrams for this scenario analysis use a10%g PGA threshold. 20 ON - PEER REVIEWED MANUSCRIPT SUBMITTED TO G EOPHYSICAL R ESEARCH L ETTERS

Figure S8: Predictions and residuals of the GMPEs derived in this study. All PGA values are given as log units using m/s . Every point refers to one recording. Solid lines indicate running means, dashed lines denote the running standarddeviation around the running mean. Orange crosses denote mean and standard deviations for magnitude ranges withinsufﬁcient data to infer a continuous line. Window sizes are 0.24 m.u./10 km (Italy) and 0.44 m.u./53 km (Japan).Overall σ is 0.29 for Italy and 0.33 for Japan. The plotted magnitude values have been offset by random values between-0.05 and 0.05 m.u. for increased visibility. 21 ON - PEER REVIEWED MANUSCRIPT SUBMITTED TO G EOPHYSICAL R ESEARCH L ETTERS

Figure S9: Calibration diagrams for Japan at different times after the ﬁrst P detection and different PGA thresholds. Theconﬁdence is deﬁned as the probability of exceeding the PGA threshold as predicted by the model. Each bar representsthe traces with a conﬁdence value inside the limits of the bar. Its height is given by the accuracy, the fraction of tracesactually exceeding the threshold among all traces in the bar. For a perfectly calibrated model, the conﬁdence equalsthe accuracy. This is indicated by the dashed line. We note that accuracy estimations for the high PGA thresholds arestrongly impacted by stochasticity due to the small number of samples.22 ON - PEER REVIEWED MANUSCRIPT SUBMITTED TO G EOPHYSICAL R ESEARCH L ETTERS

Figure S10: Calibration diagrams for Italy at different times after the ﬁrst P detection and different PGA thresholds. Fora further description see the caption of ﬁgure S9. 23 ON - PEER REVIEWED MANUSCRIPT SUBMITTED TO G EOPHYSICAL R ESEARCH L ETTERS