[PDF] A gray-box model for a probabilistic estimate of regional ground magnetic perturbations: Enhancing the NOAA operational Geospace model with machine learning

Abstract

We present a novel algorithm that predicts the probability that the time derivative of the horizontal component of the ground magnetic field dB/dt exceeds a specified threshold at a given location. This quantity provides important information that is physically relevant to Geomagnetically Induced Currents (GIC), which are electric currents { associated to} sudden changes in the Earth's magnetic field due to Space Weather events. The model follows a 'gray-box' approach by combining the output of a physics-based model with machine learning. Specifically, we combine the University of Michigan's Geospace model that is operational at the NOAA Space Weather Prediction Center, with a boosted ensemble of classification trees. We discuss the problem of re-calibrating the output of the decision tree to obtain reliable probabilities. The performance of the model is assessed by typical metrics for probabilistic forecasts: Probability of Detection and False Detection, True Skill Statistic, Heidke Skill Score, and Receiver Operating Characteristic curve. We show that the ML enhanced algorithm consistently improves all the metrics considered.

Full PDF

aa r X i v : . [ phy s i c s . s p ace - ph ] J u l manuscript submitted to JGR: Space Physics

A gray-box model for a probabilistic estimate ofregional ground magnetic perturbations: Enhancing theNOAA operational Geospace model with machinelearning

E. Camporeale , , M. D. Cash , H. J. Singer , C. C. Balch , Z. Huang , G.Toth CIRES, University of Colorado, Boulder, CO, USA Center for Mathematics and Computer Science (CWI), Amsterdam, Netherlands NOAA, Space Weather Prediction Center, Boulder, CO 80305 Department of Climate and Space Sciences and Engineering, University of Michigan, Ann Arbor, MI,USA

Key Points: • We present a new model to forecast the maximum value of dB/dt over 20-minuteintervals at speciﬁc locations • The model provides a probabilistic forecast of exceeding a pre-deﬁned thresholdat a given location • The ML-enhanced algorithm consistently improves the predictive metrics of thephysics-based model

Corresponding author: Enrico Camporeale, [email protected] –1–anuscript submitted to

JGR: Space Physics

Abstract

We present a novel algorithm that predicts the probability that the time derivativeof the horizontal component of the ground magnetic ﬁeld dB/dt exceeds a speciﬁedthreshold at a given location. This quantity provides important information that isphysically relevant to Geomagnetically Induced Currents (GIC), which are electriccurrents associated to sudden changes in the Earth’s magnetic ﬁeld due to SpaceWeather events. The model follows a ’gray-box’ approach by combining the output ofa physics-based model with machine learning. Speciﬁcally, we combine the Universityof Michigan’s Geospace model that is operational at the NOAA Space Weather Pre-diction Center, with a boosted ensemble of classiﬁcation trees. We discuss the problemof re-calibrating the output of the decision tree to obtain reliable probabilities. Theperformance of the model is assessed by typical metrics for probabilistic forecasts:Probability of Detection and False Detection, True Skill Statistic, Heidke Skill Score,and Receiver Operating Characteristic curve. We show that the ML enhanced algo-rithm consistently improves all the metrics considered.

Geomagnetically induced currents (GIC) represent one of the most severe risksposed by space weather events on our infrastructure on the ground, such as high-voltage power transmission systems. GICs are caused by sudden variations of theEarth’s magnetic ﬁeld that, through Faraday’s law, induce a variation of the electricﬁeld (Boteler et al., 1998; Pirjola et al., 2000; Lanzerotti, 2001; Pulkkinen et al., 2005;Pirjola, 2007; Schrijver & Mitchell, 2013). The induced electric ﬁelds responsiblefor GICs can be estimated from the amplitude of the time derivative of magneticﬂuctuations, often denoted as dB/dt , when combined with information of local earthconductivity characteristics (Boteler & Pirjola, 1998; Pirjola, 2002; Viljanen et al.,2004; Ngwira et al., 2008; Horton et al., 2012). Hence, much attention has beendedicated to understanding and forecasting dB/dt (Viljanen, 1997; Viljanen et al.,2001).Previous works on forecasting dB/dt can generally be divided into empirical andphysics-based models. Empirical models exploit the statistical relationships betweeninput quantities, such as solar wind observations recorded by satellites orbiting aroundL1 (ﬁrst Lagrangian point) and the observed dB/dt at a speciﬁc station, with a typicaltime-lag ranging between 15 and 60 minutes. Those statistical relationship can thenbe encoded into a regression model, in the form of a neural network, or a linear ﬁlter.Empirical models include Gleisner and Lundstedt (2001); Weigel et al. (2002, 2003);Wintoft (2005); Wintoft et al. (2005); Weimer (2013); Wintoft et al. (2015); Lotz andCilliers (2015).On the other hand, physics-based models follow the evolution in time and spaceof the plasma and the electromagnetic ﬁeld surrounding Earth and derive the groundmagnetic ﬁeld perturbation from physical laws. Typically the spatial domain is dividedin sub-regions, where the MHD approximation is used in the outer magnetosphere,while the inner magnetosphere and the transition to ionosphere are modeled by in-cluding kinetic processes. Examples of physics-based models that can, in principle,forecast dB/dt given the conditions of the solar wind observed at L1 are OpenGGCM(Open General Geospace Circulation Model, Raeder et al. (1998)), GAMERA (GridAgnostic MHD for Extended Research Applications, Zhang et al. (2019)), and SWMF(Space Weather Modeling Framework, T´oth et al. (2005)). Several works have assessedthe ability of physics-based models to forecast geomagnetic perturbations and moregenerally to recover plasma and ﬁeld conditions as observed in the data (see, e.g. Yuand Ridley (2008); Welling and Ridley (2010); Pulkkinen et al. (2011); Rast¨atter etal. (2011, 2013); Gordeev et al. (2015); Jordanova et al. (2018); Welling (2019)). Thevalidation and comparisons of diﬀerent models for predicting dB/dt was speciﬁcallytackled in Pulkkinen et al. (2013) in order to support selecting a model to transition –2–anuscript submitted to

JGR: Space Physics to operations at NOAA’s Space Weather Prediction Center (SWPC). As a result ofthat comparison, the University of Michigan’s SWMF model, henceforth referred toas the Geospace model, was selected for transition to real-time operations.In this paper we present a new model for predicting whether dB/dt will exceedgiven thresholds in a given time interval at speciﬁc locations. The model builds onthe physics-based Geospace model. We show that the skill of the physics-based modelcan be considerably enhanced with a machine learning technique, improving all theperformance metrics considered.

The Geospace model that runs operationally at NOAA/SWPC is a version of theSpace Weather Modeling Framework developed by the University of Michigan (T´othet al., 2005, 2012), that couples the following three physics domains. The outer mag-netosphere is solved by BATSRUS (BlockAdaptive Tree Solar wind Roetype UpwindScheme) (Gombosi et al., 2004), the inner magnetosphere by the Rice ConvectionModel (RCM) (Toﬀoletto et al., 2003), and the ionosphere electrodynamics by theRidley Ionosphere Model (RIM) (Ridley et al., 2004). A detailed description of theGeospace model and its modules can be found in Pulkkinen et al. (2013); T´oth et al.(2014) dB/dt

In deﬁning the problem, we follow the strategy introduced in Pulkkinen et al.(2013) and later adopted in T´oth et al. (2014) and Welling et al. (2017). Speciﬁcally,we deﬁne dB/dt = max { t,t +∆ t } p ( dB n /dt ) + ( dB e /dt ) (1)as the maximum value of the time derivative of the horizontal magnetic ﬁeld, over aninterval ∆ t , where n and e denote the north and east components of the magnetic ﬁeld,respectively. More speciﬁcally, we restrict the time interval to ∆ t = 20 minutes, and wecast the problem as a classiﬁcation task. Namely, our model predicts the probabilitythat dB/dt will exceed a given threshold at a given location, in a 20-minute interval.We use overlapping intervals with a 1 minute stride between subsequent intervals.Henceforth we simply refer to dB/dt as deﬁned in Eq. (1).As a proof-of-concept, we will show results for the following three magnetic sta-tions: Fresno, California (Geomagnetic latitude: 43.12 ◦ N, operated by USGS, code:FRN), Ottawa, Canada (Geomagnetic latitude: 54.88 ◦ N, operated by GSC, code:OTT), Iqaluit, Canada (Geomagnetic Latitude: 73.25 ◦ N, operated by GSC, code:IQA), hence testing our new method for low, mid and high magnetic latitudes, re-spectively. The reported magnetic coordinates are derived from the InternationalGeomagnetic Reference Field (IGRF) 12th generation (Th´ebault et al., 2015). Theextension of this method to any other station is straightforward.The need of enhancing a physics-based approach with machine learning can beappreciated by analyzing the accuracy of the Geospace model in predicting dB/dt . Fig-ure 1 shows the number of instances of a given dB/dt value observed in the simulation(vertical axis) versus the corresponding value observed in the data (horizontal axis),both in logarithmic scale (FRN, OTT, and IQA stations shown in the left, middle andright panels, respectively). Each column (i.e. a ﬁxed observed value) is normalizedto its maximum value. The statistics are computed over a two year interval (see be-low). The solid red line represents a perfect match between predicted and observedvalues. Figure 1 shows that the simulations tend to underestimate dB/dt for largevalues (particularly at high latitude) and overestimate it for small values (particularlyat low latitude). Also, the range of observed and predicted values is dependent on thegeomagnetic latitude, as expected. –3–anuscript submitted to

JGR: Space Physics

The paper is divided as follows. Section 2 introduces the data used for this studyand the corresponding time periods covered. Section 3 describes the methodology,including the machine learning technique, the performance metrics, and the featureschosen in the model. Section 4 presents the results of the new model, comparingits performance with the output of the Geospace model alone, and emphasizes theprobabilistic nature of the forecast. Finally, in Section 5 we draw conclusions andmake ﬁnal remarks about future directions.

The magnetic ﬁeld historical records have been obtained by the InternationalReal-time Magnetic Observatory Network (INTERMAGNET). The one-minute data inIAGA-2002 format were retrieved for the period 2001-01-01 to 2019-05-05 ( ftp://ftp.seismo.nrcan.gc.ca/intermagnet/minute/variation/IAGA2002/ ) for the three sta-tions (FRN, OTT, IQA), consisting of about 9.45M valid entries per station. Theoutput of the Geospace model used for this work covers the time period 2017-05-28to 2019-05-05, about 1,000,000 one-minute output values. In addition, we also usethe output of the Geospace model discussed in Pulkkinen et al. (2013), evaluatedover a time period covering 6 geomagnetic storms. Those simulation outputs aremade available from the NASA Community Coordinated Modeling Center (CCMC)( https://ccmc.gsfc.nasa.gov/RoR WWW/pub/dBdt/out/ ).The Geospace model outputs the magnetic ﬁeld at the location of the three sta-tions at one minute resolution. However, the inner boundary of the global MHD modelis at 2.5 Earth radii ( R E ). Therefore, the magnetic perturbations at the geomagneticobservatories are calculated from the currents using Biot-Savart integrals, taking intoaccount the following three contributions: the currents inside the BATS-R-US domain,the ﬁeld-aligned currents in the gap region between 1 and 2.5 R E radial distance, andthe Pedersen and Hall currents in the ionosphere electrodynamics model RIM (Yu &Ridley, 2008).In order to assess the accuracy of a trained machine learning model, it is im-portant that the performance metrics are calculated on a portion of a data that hasnot been used for training (so-called unseen data). Moreover, when dealing with tem-poral dataset, it is equally important that the training and test sets are temporallydisjoint so to minimize the temporal correlations between the two and to ensure thatthe machine learning algorithm does actually learn some patterns and does not merelymemorizes the training data. For all our experiments and results, we use as trainingset the period covering 2017-05-28 to 2018-12-31 and as test set the period 2019-01-01 to 2019-05-05. In addition, three of the six storm events used in Pulkkinen et al.(2013) have been added to the training set (events numbered 1, 3, 4), and the rest havebeen used for testing. We have veriﬁed that the distribution of dB/dt is approximatelyequal between training and test sets. After excluding gaps in the data, the split resultsin about 450,000 data points in the training set and 90,000 in the test set. As mentioned in the Introduction, the goal of this work is not to predict theprecise value of dB/dt for any given 20 minutes interval, but rather to estimate theprobability that a pre-deﬁned threshold will be exceeded. Hence, the ﬁrst task is todeﬁne such thresholds. In this paper, we slightly deviate from Pulkkinen et al. (2013),which focused on the following four thresholds: (0 . , . , . , .

5) nT/s, independentof the station considered. Instead, we deﬁne thresholds speciﬁc for each location, byanalyzing the overall distribution of dB/dt observed in the INTERMAGNET data ( ∼

19 years of data) and choosing the following percentiles as thresholds: 60%, 70%, 80%,90%, 95%. The resulting thresholds are summarized in Table 1. –4–anuscript submitted to

JGR: Space Physics

Table 1.

Thresholds considered for each station (in nT/s)

Station 60% 70% 80% 90% 95%FRN 0.012 0.014 0.018 0.027 0.038OTT 0.03 0.038 0.05 0.076 0.11IQA 0.24 0.32 0.45 0.73 1.11

The task under consideration is a probabilistic classiﬁcation: for a given stationthe model outputs the probability that dB/dt will exceed a speciﬁed threshold value.Such a probabilistic outcome can be interpreted as a deterministic binary prediction(i.e. positive/negative) by simply assigning ‘positive’ to all predictions above a certainprobability, and ‘negative’ otherwise. Once the probabilistic outcome is interpreted asa binary prediction, one can calculate the following quantities, deﬁned over a certainnumber of predictions: • P = total number of observed positives (event occurrences); • N = total number of observed negatives (event non-occurrences); • T P = True Positives: number of predicted positives that are observed positives; • F P = False Positives: number of predicted positives that are observed negatives; • T N = True Negatives: number of predicted negatives that are observed nega-tives; • F N = False Negatives: number of predicted negatives that are observed posi-tives;and the following performance metrics: • T P R = T P/P = True Positive Rate (also called Probability of Detection, Sen-sitivity, Hit Rate); • F P R = F P/N = False Positive Rate (also called Probability of False Detection,False Alarm Rate); • T SS = T P R − F P R = True Skill Statistics. • HSS = 2(

T P · T N − F N · F P ) / ( P ( F N + T N ) + N ( T P + F P )) = Heidke SkillScoreThe

T P R measures the ability to ﬁnd all positive events and a perfect classiﬁer resultsin

T P R = 1 ; the

F P R measures the probability of wrongly classifying a negativeas a positive, and a perfect classiﬁer results in

F P R = 0. Hence,

T SS is a usefulmetric that combines both types of information and should be as close as possible to1. Moreover, in a Receiver Operating Characteristic (ROC) curve,

T P R and

F P R are respectively on the vertical and horizontal axis, and

T SS measures the distance tothe diagonal (no-skill) line (Krzanowski & Hand, 2009). Finally, the

HSS measuresthe skill of a method compared to a baseline represented by random chance.

HSS hasbeen used in Pulkkinen et al. (2013) and is used here for comparison with previousstudies.The baseline accuracy is represented by the True Skill Statistic and the HeidkeSkill Score yielded by the Geospace model alone, that is by calculating dB/dt directlyfrom the simulation output and comparing to observations. Figure 2 shows the

T SS (left) and

HSS (right), where blue, red, and yellow lines are for

FRN , OTT , and

IQA stations, respectively. The scores are computed over all the data for which we haveGeospace simulation outputs, and are shown for diﬀerent thresholds, represented on –5–anuscript submitted to

JGR: Space Physics the the horizontal axis in terms of their percentile calculated over ∼

19 years of obser-vational data. One can notice that both scores are latitude dependent. Interestingly,while the TSS increases with higher percentiles (less frequent events) for FRN andOTT, the opposite is true for IQA. A diﬀerent behaviour is also noticible with regardsto the HSS score: FRN and OTT stations peak around the 70th percentile, whileIQA peaks approximately at the 35th percentile. In general, Figure 2 shows that theGeospace model performs better at predicting large thresholds of dB/dt at mid andlow latitudes than at high latitudes.

A variety of methods exist in the Machine Learning arena to perform a proba-bilistic classiﬁcation task. For this work, we have opted to use a boosted ensemble ofclassiﬁcation trees. The method of choice is called RobustBoost (Freund, 2009). Inthis section we provide a short introduction and appropriate references.Let us assume we want to assign a label y ∈ { , } to a data point x = { x , x , . . . , x D } , where D is the dimensionality of x . The task is a supervised bi-nary classiﬁcation, meaning that we make use of a large dataset of labeled examplesto infer a pattern between the inputs x and the binary outputs y , that can be usedto infer the label of new data points that have not been used to train the model. Adecision tree is a simple method that recursively partitions the D-dimensional hyper-space of input variables one dimension at the time, thus creating a tree-like structure.In other words, by taking as decision boundaries hyperplanes deﬁned by simple in-equalities such as x i < c , a decision tree divides the input space into a number ofhypercubes where a given label is assigned to all the data belonging to the same hy-percube. Decision trees have the great advantage of being very transparent and easilyinterpretable. In fact, one can simply follow the tree structure from top to bottom tounderstand how a label is associated to a given data point. In order to choose where toset up a decision boundary (i.e. the value of the constant c ) and along which variable,a partition criterion is followed. Two standard partition criteria are the Gini Index,and the Information Gain. The Gini Index measures the reduction in class impurity,which is deﬁned as the probability that two randomly chosen data that belong to thesame partition have diﬀerent labels. At a given iteration when growing a tree, the bestpartition is the one that reduces such impurity, or in other words that minimizes theprobability of a data point being mislabeled. The Information Gain is based on theentropy measured at each node and the optimal split is the one that minimizes theglobal entropy (or maximizes information). A reference monograph on decision treesis Breiman (2017). The accuracy of a classiﬁcation tree can be improved by using aboosting strategy. Boosting refers to a class of algorithms that makes use of an en-semble of (not very accurate) predictions to produce a much more accurate one. Themembers of the ensemble are called weak learners and their weighted sum is referredto as strong learner. In the context of classiﬁcation trees, the weak learners are rep-resented by trees that are grown to only a few layers. One of the most successful andwidely applied boosting techniques is Adaboost (short for Adaptive Boosting), intro-duced in the seminal paper by Freund and Schapire (1997). This is an algorithm thatiteratively adds members to an existing ensemble. The newest member increasinglyfocuses on the data points that were misclassiﬁed by the previous members, and theweights of each member are iteratively adjusted. AdaBoost is typically less prone tooverﬁtting than other algorithms, but it is very sensitive to outliers, because it willkeep focusing on the few data points that are mis-classiﬁed, eventually at the expenseof the remaining dataset. A modiﬁcation of Adaboost that adds robustness to thealgorithm (in the sense of not being so sensitive to outliers) is

RobustBoost , beingintroduced in Freund (2009). RobustBoost can be intuitively understood as “givingup” on data points that are so far on the incorrect side of a decision boundary thatthey are unlikely to be correctly classiﬁed even after many iterations. –6–anuscript submitted to

JGR: Space Physics

In this work we have used the MATLAB (R2019a) implementation of Robust-Boost which is included in the Statistics and Machine Learning Toolbox. We havetested and compared the following boosting techniques: AdaBoost, GentleBoost, Lo-gistBoost, RobustBoost, and Bagging, and although their results were comparable,RobustBoost is the algorithm that consistently yielded better results.

In the machine learning jargon a feature is an explanatory variable that is usedas an input for a given algorithm. The present work builds up on the idea presentedin T´oth et al. (2014). The main ﬁnding was that a strong correlation exists betweenobserved dB/dt and the observed maximum variation in the amplitude of the magneticﬁeld, within the same 20-minute interval. In fact, the correlation is almost linearwhen both quantities are expressed in their logarithm. T´oth et al. (2014) argued thatthe magnetic perturbations relative to the background dipole value obtained by theGeospace model simulations are much more reliable than the values of dB/dt computeddirectly from the ﬁeld.Feature selection refers to the procedure of selecting the most informative inputsused in a machine learning algorithm, making sure that the number of selected featuresis large enough for the algorithm to be accurate, but not too large, in order to preventoverﬁtting and for optimizing computational eﬃciency. The initial selection of featuresis done by a visual exploratory analysis of the correlation between candidate featuresand the target dB/dt . We have found that several quantities correlate well whenplotted in logarithmic scale. As an example, we show in Figure 3 such correlations forthe IQA station. The complete list of initially identiﬁed features is in Table 2. Here,the lead-time of the model forecasts ∆ T are deﬁned as the propagation time of thesolar wind between the L1 point (where the solar wind is measured) and the outerboundary of the computational domain, approximately at the Earth’s bow shock. Inthe OMNI dataset used for this study solar wind quantities are conveniently time-shifted to account for the propagation time between L1 and the bow shock. Hence,we have shifted back in time the measurements of Sym-H only, using the timeshiftprovided by the OMNI dataset (at 1 minute resolution).The procedure chosen to reduce the number of features is known as backwardelimination , and it works as follows. First, we train a linear regression model usingall the features listed in Table 2, thus assuming the output ( log ( dB/dt ), measuredfrom magnetometer data) to be a linear combination of the inputs, each weighted bya coeﬃcient. The model returns both the values of the coeﬃcients and their standarddeviation. The t − statistic (t-value) is deﬁned as the ratio between coeﬃcients andtheir standard deviation. Inputs with low t-value (in absolute value) are deemed non-informative. Therefore, we iteratively reduce the number of features by eliminatingthe one with the smallest t-value and we re-train a new linear model at each iterationwith the remaining features. In this way we rank all the features listed in Table 2 (ﬁrstcolumn). Moreover, for each iteration we record the coeﬃcient of determination R asa metric for the goodness of ﬁt. The ﬁnal ranking of features is represented in Figure 4,that shows how the value of R changes by increasingly adding features. The featureson the horizontal axis are sorted in order of importance from left (most important)to right (less important) and each circle corresponds to a linear model that uses thenamed feature in addition to all the ones listed to its left. Not surprisingly, the pastvalue of dB/dt is the most informative feature, yielding by itself a R value of 0.805.However, the next two are features determined by the Geospace model output, namelythe diﬀerence between the maximum and minimum values of the North and Easthcomponents of the magnetic ﬁeld in a 20-minute window. This justiﬁes the grey-boxphilosophy of combining inputs from simulation outputs with past observations. Onthe basis of the backward elimination procedure, we decide to use the top 5 features ofFigure 4, noticing that R tends to plateau with more than 5 features. The procedure –7– a nu s c r i p t s ub m i tt e d t o J G R : S pa ce P h y s i c s Table 2.

Ranking of features. T denotes the time at which dB/dt is predicted and ∆ T is the solar wind propagation time Rank Feature Meaning Data source Time Selected1 log ( dB/dt ) target at previous time magnetometer T − ∆ T yes2 log (max( Bn ) − min( Bn )) Range of North component of magneticﬁeld predicted by simulation Geospace T yes3 log (max( Be ) − min( Be )) Range of East component of magneticﬁeld predicted by simulation Geospace T yes4 log ( Bz ) z-component (GSM) ofinterplanetary magnetic ﬁeld OMNI dataset T − ∆ T yes5 SymH

Geomagnetic index Sym-H OMNI dataset T − ∆ T yes6 dB/dt geo target predicted by simulation output Geospace T no7 log (max( B ) − min( B )) geo Range of magnetic ﬁeld amplitudepredicted by simulation Geospace T − ∆ T no8 log ( n ) solar wind density OMNI dataset T − ∆ T no9 log (max( B ) − min( B )) Range of magnetic ﬁeld amplitude observed magnetometer T − ∆ T no10 log ( E ) Electric ﬁeld OMNI dataset T − ∆ T no11 log ( | V x | ) x − component of solar wind speed OMNI dataset T − ∆ T no –8– anuscript submitted to JGR: Space Physics has been run on the combined training sets for all three stations. However, to avoidoverﬁtting, at each iteration only 50% of the combined training set has been used totrain the linear model. This explains the small ﬂuctuations of R during the plateau,that otherwise would be monotonically increasing, if subsequent models were trainedon identical data. In this Section we show the results of our model in terms of the True PositiveRate (TPR, or probability of detection), False Positive Rate (FPR, or probability offalse detection), True Skill Statistics, and the Heidke Skill Score discussed in Section3.1. Essentially, a diﬀerent classiﬁer is trained for each station and each threshold. Theproposed grey-box approach is compared against two alternative approaches: a white-box approach where one simply uses the value of dB/dt predicted by the Geospacemodel, and a black-box approach where similar machine learning classiﬁers are trained,with the only diﬀerence of not using the inputs coming from the Geospace model. Inother words, among the top 5 features listed in Table 2, the black-box models use onlythe three that do not come from Geospace output. Figure 5 shows the TPR (left) andFPR (right) for the three stations and as functions of the diﬀerent threshold levels (seeTable 1). Figure 6 shows TSS (left) and HSS (right) with the same format. One cannotice that both black- and grey-box models largely outperform the correspondingwhite-box models. Moreover, although the results are dependent on stations andthresholds, the grey-box model further improves the black-box model, especially forlarge thresholds, which are the cases of most interest for space weather. On the otherhand, whenever the white-box model yields large values for the probability of detection(e.g. for FRN station), the probability of false detection is also large, resulting in lowvalues for both TSS and HSS.

As anticipated in the introduction, the goal of this work is not to provide abinary classiﬁcation, but rather to estimate the probability of exceeding pre-deﬁnedthresholds. In principle, classiﬁcation trees can output probabilities, which are simplycalculated as the observed ratio between positives and negatives on a given leaf (theﬁnal node on a decision tree) calculated over the whole training set. A well-knownproblem with classiﬁcation trees is that such probabilities are often mis-calibrated(Niculescu-Mizil & Caruana, 2005). Calibration refers to the consistency betweenthe predicted probability assigned to an event and the actual frequency observed forthat event. For instance, in the binary classiﬁcation setting, if we collect all theinstances in which a model predicts a probability p for a ’positive’ outcome (in our case,exceeding a threshold), that model is well-calibrated if on average a positive is actuallyobserved with frequency p (the frequency being calculated over all those instances).One way to visualize the relationship between predicted probabilities and observedfrequency is through a reliability diagram (DeGroot & Fienberg, 1983). To constructsuch diagram for binary classiﬁcation, one discretizes the predicted probabilities inbins. For each bin, the average predicted frequency (horizontal axis) is plotted againstthe true fraction of positive cases in that bin (vertical axis). A perfect calibration willresult in a diagonal straight line. Figure 7 shows the reliability diagrams for FRN,OTT, and IQA, respectively in the top, middle, and bottom row. Each panel refersto a diﬀerent threshold (see Table 1), and the blue circles represent the calibration ofthe boosted ensemble models, as trained by the MATLAB routine. One can clearlysee that such predictions are mis-calibrated. We apply a simple calibration strategy,where a mapping between old and new probabilities is derived by simply interpolatinglinearly the blue circles. For instance, a probability of 40% might be re-calibrated toa new value of 30%. To perform re-calibration fairly, we have derived the reliability –9–anuscript submitted to JGR: Space Physics diagram and the corresponding calibration map from the training set only. Figure 7shows the reliability diagram calculated over the test set. The red diamonds representthe re-calibrated reliability diagrams, that clearly suggest that all the models havebeen properly re-calibrated. The re-calibrated values are the ones that should be usedto provide a probabilistic prediction.

Another important diagnostic for a probabilistic model is the ROC curve. Inorder to interpret a probabilistic prediction in terms of true/false positives/negatives(see Sec. 3.1), a probability threshold needs to be used to separate the predicted pos-itives from the negatives. In the limit that such threshold is pushed to 0%, all thepredictions become positives, which means that both the true positive rate (TPR) andthe false positive rate (FPR) are equal to 1 (all positives are correctly predicted, butall negative are mis-classiﬁed). In the opposite limit, when the threshold is 100% andall predictions are negative both TPR and FPR become equal to 0 (no positives arepredicted, but all negatives are correctly predicted). The ROC curve is a continuouscurve in the (FPR,TPR) space that connects these extreme scenarios (TPR=FPR=1and TPR=FPR=0) by gradually changing the threshold from 0% to 100%. The opti-mal prediction is TPR=1 and FPR=0, and the optimal threshold is the point on theROC curve with maximum distance from the diagonal. ROC curves for FRN, OTT,and IQA stations are shown in Figure 8, respectively in the left, middle, and rightpanel. Diﬀerent colors denote the ﬁve diﬀerent thresholds, and a ﬁlled circle repre-sents the optimal values (that have been used in previous Figures). Note that the TrueSkill Statistic (TSS) is the vertical distance between the ROC curve and the diagonalline (TPR=FPR), which represents no skill (i.e. a climatological forecast). The ROCcurves demonstrate the general tendency of the models to improve their True SkillStatistic for higher thresholds, as already shown in previous Figures. Moreover, it isimportant to realize that the re-calibration described in the previous Section does notaﬀect the ROC curve. In fact, by mapping old to new probabilities, the points ona given ROC curve get shifted along the same curve. In other words, what changesthrough re-calibration is the value of the optimal threshold, but not the correspondingvalues of TPR, FPR, and Skill Scores. In practice, because the un-calibrated mod-els tend to be overconﬁdent (i.e. below the diagonal line in the reliability diagram),re-calibration changes the optimal threshold from 50% to larger values. For instance,it can be that for a given model one needs to interpret as positives predictions withprobabilities larger than 80% rather than 50%.

We have developed a model that estimates the probability of dB/dt exceedinga given threshold, for three stations ranging from low, to mid and high latitudes(FRN, OTT, and IQA). Five diﬀerent thresholds were chosen for each station, bycalculating the 60, 70, 80, 90, 95 percentiles on a long-span historic dataset ( ∼ dB/dt ,although we expect the model to improve over time by better capturing properties ofthe physical system. However, as already noted in T´oth et al. (2014), the maximumperturbation of the magnetic ﬁeld within a 20-minute interval correlates very stronglywith dB/dt and hence it can be used as a predictor in a machine learning algorithm.The chosen machine learning algorithm is an ensemble of classiﬁcation trees,adaptively boosted via RobustBoost (Freund, 2009), and the performance metrics –10–anuscript submitted to JGR: Space Physics that we have analyzed are the True Positive Rate (TPR, or probability of detection),False Positive Rate (FPR, or probability of false detection), True Skill Statistic (TSS),and Heidke Skill Score (HSS). Finally, we have discussed the issue of re-calibrationand the ROC curve relative to all models.Overall the gray-box approach proposed in this paper consistently enhances theresults of the corresponding white-box approach, where one would directly take theresults of the Geospace model as predictors of dB/dt . Indeed, Figure 9 summarizes theﬁndings of previous Figures by comparing the True Skill Statistic (left panel) and theHeidke Skill Score (right panel) of the Geospace model alone (horizontal axis) againstthe corresponding results applying machine learning (vertical axis). Diﬀerent symbolsare for the three diﬀerent stations, and the region above the diagonal black solid linedenotes an improvement.The new model will be a valuable addition to the operational capabilities of theNOAA’s Space Weather Prediction Center. It will be straightforward to extend themodel including several stations spanning a range of latitudes and longitudes. We arecurrently investigating what is the optimal strategy to represent in a compact graphicaldisplay the probabilistic outcomes for several stations and several thresholds, such thatthe SWPC forecasters can extract valuable real-time information on a regional scaleand experiment how to incorporate such information in their forecast.

Acknowledgments ftp.seismo.nrcan.gc.ca/intermagnet/minute/variation/IAGA2002/ . The Geospace model outputs discussed in Pulkkinen et al.(2013) are made available by the NASA Community Coordinated Modeling Center(CCMC) ( https://ccmc.gsfc.nasa.gov/RoR WWW/pub/dBdt/out/ ).All the data and codes will be made available as a Zenodo/Github repository,after the manuscript is accepted for publication. –11–anuscript submitted to

JGR: Space Physics

Figure 1.

2D histogram of the counts of dB/dt as obtained from the Geospace simulation(vertical axis) vs the corresponding measured values (horizontal axis). Both axes are in logarith-mic scale, and the heat-map is normalized column-wise with respect to the maximum value foreach column, for better visualization. FRN, OTT, and IQA stations are respectively shown in theleft, middle and right panels.

Percentile T r ue S k ill S t a t i s t i c FRNOTTIQA

Percentile H e i d k e S k ill S c o r e FRNOTTIQA

Figure 2.

True Skill Statistic (left) and Heidke Skill Score (right) obtained from the predic-tions of the Geospace model, for diﬀerent stations (in blue for FRN, red for OTT, and yellow forIQA), as functions of the diﬀerent thresholds percentiles. The percentile are calculated on thedistribution of observed dB/dt for a given station over a period of ∼

19 years of data.–12–anuscript submitted to

JGR: Space Physics log (max(B)-min(B)) -0.500.511.522.5 l og ( d B / d t ) log (max(Bn)-min(Bn)) -0.500.511.522.5 IQA station log (max(Be)-min(Be)) -0.500.511.522.5 log (max(B)-min(B)) past -0.500.511.522.5 l og ( d B / d t ) log (dB/dt) past -0.500.511.522.5 2.5 2.6 2.7 2.8 2.9 log (|V x |) past -0.500.511.522.5 Figure 3.

2D histogram of the counts of the target variable dB/dt at the IQA station (verti-cal axis) and the 6 features described in Sec. 3.3. Each heat-map is normalized column-wise withrespect to its maximum value. d B / d t pa s t r ange ( B n ) r ange ( B e ) B z S y m - H d B / d t geo r ange ( B ) n r ange ( B ) pa s t E V x R selected not selected Figure 4.

Coeﬃcient of determination R for the linear models trained succesively on a largernumber of features. Each symbol represents a model trained with the feature reported on hori-zontal axis in addition to all the features to its left (see Table 2). The most important featuresare to the left and the less important to the right.–13–anuscript submitted to JGR: Space Physics

Figure 5.

Probability of detection (left) and Probability of false detection (right) vs diﬀerentthresholds (horizontal axis). Blue, red, and yellow lines denote respectively: a black-box modeltrained without using Geospace output, a gray-box model that uses both past observations andGeospace output, and a white-box model that uses only Geospace outputs.–14–anuscript submitted to

JGR: Space Physics

Figure 6.

True Skill Statistic (left) and Heidke Skill Score (right) vs diﬀerent threshold (hor-izontal axis). Blue, red, and yellow lines denote respectively: a black-box model trained withoutusing Geospace output, a gray-box model that uses both past observations and Geospace output,and a white-box model that uses only Geospace outputs.–15–anuscript submitted to

JGR: Space Physics

Figure 7.

Reliability diagrams for diﬀerent thresholds (increasing from left to right panels).Blue circles indicates the result of the non-calibrated models, and the red diamonds indicatethe reliability achieved after re-calibration. FRN, OTT, and IQA stations are shown in the top,middle, bottom row, respectively.

Figure 8.

ROC curves (TPR vs FPR) for diﬀerent thresholds. Filled dots indicate the opti-mal points along a given ROC curve. FRN, OTT, and IQA stations shown in the left, middle,and right panel, respectively. –16–anuscript submitted to

JGR: Space Physics

Geospace only G eo s pa c e + M L True Skill Statistic

FRNOTTIQA

Geospace only G eo s pa c e + M L Heidke Skill Score

FRNOTTIQA

Figure 9.

Comparison of the True Skill Statistic (left) and Heidke Skill Score (right) for mod-els using the output of the Geospace model alone (horizontal axis) vs the model presented in thispaper (combining Geospace outputs with machine learning, vertical axis). The diagonal blackline indicates no improvement. –17–anuscript submitted to

JGR: Space Physics

References

Boteler, D., & Pirjola, R. (1998). The complex-image method for calculating themagnetic and electric ﬁelds produced at the surface of the earth by the auroralelectrojet.

Geophysical Journal International , (1), 31–40.Boteler, D., Pirjola, R., & Nevanlinna, H. (1998). The eﬀects of geomagnetic dis-turbances on electrical systems at the earth’s surface. Advances in Space Re-search , (1), 17–27.Breiman, L. (2017). Classiﬁcation and regression trees . Routledge.Camporeale, E. (2019). The challenge of machine learning in space weather nowcast-ing and forecasting.

Space Weather , (8).Camporeale, E., Wing, S., & Johnson, J. (2018). Machine learning techniques forspace weather . Elsevier.DeGroot, M. H., & Fienberg, S. E. (1983). The comparison and evaluation of fore-casters.

Journal of the Royal Statistical Society: Series D (The Statistician) , (1-2), 12–22.Freund, Y. (2009). A more robust boosting algorithm. arXiv preprintarXiv:0905.2138 .Freund, Y., & Schapire, R. E. (1997). A decision-theoretic generalization of on-linelearning and an application to boosting. Journal of computer and system sci-ences , (1), 119–139.Gleisner, H., & Lundstedt, H. (2001). A neural network-based local model for pre-diction of geomagnetic disturbances. Journal of Geophysical Research: SpacePhysics , (A5), 8425–8433.Gombosi, T. I., Powell, K. G., De Zeeuw, D. L., Clauer, C. R., Hansen, K. C.,Manchester, W. B., . . . others (2004). Solution-adaptive magnetohydrody-namics for space plasmas: Sun-to-earth simulations. Computing in science &engineering , (2), 14.Gordeev, E., Sergeev, V., Honkonen, I., Kuznetsova, M., Rast¨atter, L., Palmroth,M., . . . Wiltberger, M. (2015). Assessing the performance of community-available global mhd models using key system parameters and empirical rela-tionships. Space Weather , (12), 868–884.Horton, R., Boteler, D., Overbye, T. J., Pirjola, R., & Dugan, R. C. (2012). A testcase for the calculation of geomagnetically induced currents. IEEE Transac-tions on Power Delivery , (4), 2368–2373.Jordanova, V. K., Delzanno, G. L., Henderson, M. G., Godinez, H. C., Jeﬀery, C.,Lawrence, E. C., . . . others (2018). Speciﬁcation of the near-earth space envi-ronment with shields. Journal of Atmospheric and Solar-Terrestrial Physics , , 148–159.Krzanowski, W. J., & Hand, D. J. (2009). Roc curves for continuous data . Chapmanand Hall/CRC.Lanzerotti, L. J. (2001). Space weather eﬀects on technologies.

Space weather , ,11–22.Lotz, S., & Cilliers, P. (2015). A solar wind-based model of geomagnetic ﬁeld ﬂuctu-ations at a mid-latitude station. Advances in Space Research , (1), 220–230.Ngwira, C. M., Pulkkinen, A., McKinnell, L.-A., & Cilliers, P. J. (2008). Improvedmodeling of geomagnetically induced currents in the south african power net-work. Space Weather , (11).Niculescu-Mizil, A., & Caruana, R. (2005). Obtaining calibrated probabilities fromboosting. In Uai (p. 413).Pirjola, R. (2002). Review on the calculation of surface electric and magneticﬁelds and of geomagnetically induced currents in ground-based technologicalsystems.

Surveys in geophysics , (1), 71–90.Pirjola, R. (2007). Space weather eﬀects on power grids. In Space weather-physicsand eﬀects (pp. 269–288). Springer. –18–anuscript submitted to

JGR: Space Physics

Pirjola, R., Boteler, D., Viljanen, A., & Amm, O. (2000). Prediction of geomagnet-ically induced currents in power transmission systems.

Advances in Space Re-search , (1), 5–14.Pulkkinen, A., Kuznetsova, M., Ridley, A., Raeder, J., Vapirev, A., Weimer, D., . . .others (2011). Geospace environment modeling 2008–2009 challenge: Groundmagnetic ﬁeld perturbations. Space Weather , (2).Pulkkinen, A., Lindahl, S., Viljanen, A., & Pirjola, R. (2005). Geomagnetic stormof 29–31 october 2003: Geomagnetically induced currents and their relationto problems in the swedish high-voltage power transmission system. SpaceWeather , (8).Pulkkinen, A., Rastatter, L., Kuznetsova, M., Singer, H., Balch, C., Weimer, D., . . .others (2013). Community-wide validation of geospace model ground magneticﬁeld perturbation predictions to support model transition to operations. SpaceWeather , (6), 369–385.Raeder, J., Berchem, J., & Ashour-Abdalla, M. (1998). The geospace environmentmodeling grand challenge: Results from a global geospace circulation model. Journal of Geophysical Research: Space Physics , (A7), 14787–14797.Rast¨atter, L., Kuznetsova, M., Glocer, A., Welling, D., Meng, X., Raeder, J., . . .others (2013). Geospace environment modeling 2008–2009 challenge: D stindex. Space Weather , (4), 187–205.Rast¨atter, L., Kuznetsova, M., Vapirev, A., Ridley, A., Wiltberger, M., Pulkkinen,A., . . . Singer, H. (2011). Geospace environment modeling 2008–2009 chal-lenge: Geosynchronous magnetic ﬁeld. Space Weather , (4), 1–15.Ridley, A., Gombosi, T., & DeZeeuw, D. (2004). Ionospheric control of the magneto-sphere: Conductance. In Annales geophysicae (Vol. 22, pp. 567–584).Schrijver, C. J., & Mitchell, S. D. (2013). Disturbances in the us electric grid associ-ated with geomagnetic activity.

Journal of Space Weather and Space Climate , , A19.Th´ebault, E., Finlay, C. C., Beggan, C. D., Alken, P., Aubert, J., Barrois, O., . . .others (2015). International geomagnetic reference ﬁeld: the 12th generation. Earth, Planets and Space , (1), 79.Toﬀoletto, F., Sazykin, S., Spiro, R., & Wolf, R. (2003). Inner magnetospheric mod-eling with the rice convection model. Space Science Reviews , (1-2), 175–196.T´oth, G., Meng, X., Gombosi, T. I., & Rast¨atter, L. (2014). Predicting the timederivative of local magnetic perturbations. Journal of Geophysical Research:Space Physics , (1), 310–321.T´oth, G., Sokolov, I. V., Gombosi, T. I., Chesney, D. R., Clauer, C. R., De Zeeuw,D. L., . . . others (2005). Space weather modeling framework: A new tool forthe space science community. Journal of Geophysical Research: Space Physics , (A12).T´oth, G., Van der Holst, B., Sokolov, I. V., De Zeeuw, D. L., Gombosi, T. I., Fang,F., . . . others (2012). Adaptive numerical algorithms in space weather model-ing. Journal of Computational Physics , (3), 870–903.Viljanen, A. (1997). The relation between geomagnetic variations and their timederivatives and implications for estimation of induction risks. Geophysical re-search letters , (6), 631–634.Viljanen, A., Nevanlinna, H., Pajunp¨a¨a, K., & Pulkkinen, A. (2001). Time deriva-tive of the horizontal geomagnetic ﬁeld as an activity indicator. In Annalesgeophysicae (Vol. 19, pp. 1107–1118).Viljanen, A., Pulkkinen, A., Amm, O., Pirjola, R., & Korja, T. (2004). Fast compu-tation of the geoelectric ﬁeld using the method of elementary current systemsand planar earth models. In

Annales geophysicae (Vol. 22, pp. 101–113).Weigel, R., Klimas, A., & Vassiliadis, D. (2003). Solar wind coupling to and pre-dictability of ground magnetic ﬁelds and their time derivatives.

Journal of –19–anuscript submitted to

JGR: Space Physics

Geophysical Research: Space Physics , (A7).Weigel, R., Vassiliadis, D., & Klimas, A. (2002). Coupling of the solar wind totemporal ﬂuctuations in ground magnetic ﬁelds. Geophysical Research Letters , (19), 21–1.Weimer, D. R. (2013). An empirical model of ground-level geomagnetic perturba-tions. Space Weather , (3), 107–120.Welling, D. (2019). Magnetohydrodynamic models of b and their use in gic esti-mates. Geomagnetically Induced Currents from the Sun to the Power Grid , 43–65.Welling, D., Anderson, B., Crowley, G., Pulkkinen, A., & Rast¨atter, L. (2017). Ex-ploring predictive performance: A reanalysis of the geospace model transitionchallenge.

Space Weather , (1), 192–203.Welling, D., & Ridley, A. (2010). Validation of swmf magnetic ﬁeld and plasma. Space Weather , (3).Wintoft, P. (2005). Study of the solar wind coupling to the time diﬀerence horizon-tal geomagnetic ﬁeld. In Annales geophysicae (Vol. 23, pp. 1949–1957).Wintoft, P., Wik, M., Lundstedt, H., & Eliasson, L. (2005). Predictions of localground geomagnetic ﬁeld ﬂuctuations during the 7–10 november 2004 eventsstudied with solar wind driven models. In

Annales geophysicae (Vol. 23, pp.3095–3101).Wintoft, P., Wik, M., & Viljanen, A. (2015). Solar wind driven empirical forecastmodels of the time derivative of the ground magnetic ﬁeld.

Journal of SpaceWeather and Space Climate , , A7.Yu, Y., & Ridley, A. J. (2008). Validation of the space weather modeling frameworkusing ground-based magnetometers. Space Weather , (5), 1–20.Zhang, B., Sorathia, K. A., Lyon, J. G., Merkin, V. G., Garretson, J. S., & Wilt-berger, M. (2019). Gamera: A three-dimensional ﬁnite-volume mhd solver fornon-orthogonal curvilinear geometries. The Astrophysical Journal SupplementSeries , (1), 20.(1), 20.