[PDF] Ensemble Forecasting of Major Solar Flares: Methods for Combining Models

Abstract

One essential component of operational space weather forecasting is the prediction of solar flares. With a multitude of flare forecasting methods now available online it is still unclear which of these methods performs best, and none are substantially better than climatological forecasts. Space weather researchers are increasingly looking towards methods used by the terrestrial weather community to improve current forecasting techniques. Ensemble forecasting has been used in numerical weather prediction for many years as a way to combine different predictions in order to obtain a more accurate result. Here we construct ensemble forecasts for major solar flares by linearly combining the full-disk probabilistic forecasts from a group of operational forecasting methods (ASAP, ASSA, MAG4, MOSWOC, NOAA, and MCSTAT). Forecasts from each method are weighted by a factor that accounts for the method's ability to predict previous events, and several performance metrics (both probabilistic and categorical) are considered. It is found that most ensembles achieve a better skill metric (between 5\% and 15\%) than any of the members alone. Moreover, over 90\% of ensembles perform better (as measured by forecast attributes) than a simple equal-weights average. Finally, ensemble uncertainties are highly dependent on the internal metric being optimized and they are estimated to be less than 20\% for probabilities greater than 0.2. This simple multi-model, linear ensemble technique can provide operational space weather centres with the basis for constructing a versatile ensemble forecasting system -- an improved starting point to their forecasts that can be tailored to different end-user needs.

Full PDF

ssubmitted to

Journal of Space Weather and Space Climate c (cid:13) The author(s) under the Creative Commons Attribution 4.0 International License (CC BY 4.0)

Ensemble Forecasting of Major Solar Flares:Methods for Combining Models

J. A. Guerra , S. A. Murray , , D. S. Bloomﬁeld and P. T. Gallagher , Physics Department, Villanova University 800 E Lancaster Ave. Villanova, PA 19085, USAe-mail: [email protected] (cid:63) School of Physics, Trinity College Dublin, Ireland School of Cosmic Physics, Dublin Institute for Advanced Studies, Ireland Department of Mathematics, Physics and Electrical Engineering, Northumbria University,Newcastle upon Tyne, NE1 8ST, UK

ABSTRACT

One essential component of operational space weather forecasting is the prediction of solar ﬂares.With a multitude of ﬂare forecasting methods now available online it is still unclear which ofthese methods performs best, and none are substantially better than climatological forecasts. Spaceweather researchers are increasingly looking towards methods used by the terrestrial weather com-munity to improve current forecasting techniques. Ensemble forecasting has been used in numeri-cal weather prediction for many years as a way to combine di ﬀ erent predictions in order to obtain amore accurate result. Here we construct ensemble forecasts for major solar ﬂares by linearly com-bining the full-disk probabilistic forecasts from a group of operational forecasting methods (ASAP,ASSA, MAG4, MOSWOC, NOAA, and MCSTAT). Forecasts from each method are weighted bya factor that accounts for the method’s ability to predict previous events, and several performancemetrics (both probabilistic and categorical) are considered. It is found that most ensembles achievea better skill metric (between 5% and 15%) than any of the members alone. Moreover, over 90%of ensembles perform better (as measured by forecast attributes) than a simple equal-weights av-erage. Finally, ensemble uncertainties are highly dependent on the internal metric being optimizedand they are estimated to be less than 20% for probabilities greater than 0.2. This simple multi-model, linear ensemble technique can provide operational space weather centres with the basis forconstructing a versatile ensemble forecasting system – an improved starting point to their forecaststhat can be tailored to di ﬀ erent end-user needs. Key words.

Solar ﬂares forecasting – Ensembles – Weighted linear combination

1. Introduction

Predicting when a solar ﬂare may occur is perhaps one of the most challenging tasks in spaceweather forecasting due to the intrinsic nature of the phenomenon itself (magnetic energy storage (cid:63)

Corresponding author 1 a r X i v : . [ phy s i c s . s p ace - ph ] A ug uerra et. al. : Ensemble Forecasting of Solar Flares S D O l a un c h New methodsExisting methods V a r i a n c e M-class

Fig. 1.

Left:

Number of ﬂare forecasting methods publicly available per year since 1992. For eachyear, existing methods (grey) and new methods (red) are displayed. Since 2010 the number of ﬂareforecasting methods has increased at an average of approximately three every two years. This infor-mation was partially gathered from Leka et al. (2019), the NASA / GSFC Community CoordinatedModeling Center (CCMC) archive of forecasts, and other operational centre online archives. Theearliest date when the ﬁrst forecast was made available in these sources was used for the purposesof this ﬁgure.

Right: forecast variance vs. average forecast for a six-method group of probabilisticforecasts for M-class ﬂares between 2015 and 2017. Variance is lower when the average forecast iscloser to zero or one.by turbulent shear ﬂows + unknown triggering mechanism + magnetic reconnection), the lack ofmore appropriate remote-sensing data, and the rarity of the events, particularly for large (i.e., X-class) ﬂares (Leka and Barnes, 2018; Hudson, 2007). Yet the need for more accurate, time-sensitive,user-speciﬁc, and versatile forecasts remains relevant as the technological, societal, and economicalimpact of these events becomes more evident with time (Tsagouri et al., 2013). In the past decadethe number of ﬂare forecasting methods has increased rapidly at an annual average rate of ∼ Solar Dynamics Observatory (SDO; Pesnell et al., 2012), which provideshigh-quality solar imagery with an operational-like routine.Di ﬀ erences in input data, training sets, empirical and / or statistical models used among di ﬀ erentforecasting methods make it di ﬃ cult to directly compare performances across all methods (Barneset al., 2016). Just by looking at the probabilistic forecasts for M-class ﬂares from a group of sixdi ﬀ erent methods (see Section 2) during three years (2015–2017) reveals that the variance of theprobabilities around the average value is signiﬁcantly larger away from 0 and 1 (Fig. 1, right panel).That is, for a particular time, if solar conditions are such that there is only a low chance of observingan M-class ﬂare, all methods report similar low chances. Similarly, if solar conditions favor a highchance of observing the ﬂare, all methods seem to report similar high chances. However, if there isonly a moderate chance of observing a ﬂare, forecasts in this case can range from low to very high.In an operational environment, space weather forecasters are often faced with the responsibility ofissuing alerts and making decisions based on forecasts like those described above. However, in cases et. al. : Ensemble Forecasting of Solar Flares where di ﬀ erent methods provide very di ﬀ erent forecasts, it can be di ﬃ cult to know which methodcan be more accurate given the speciﬁc solar conditions. It is in these cases that using the di ﬀ erentforecasts to create a combined, more-accurate prediction may be advantageous. Ensemble forecast-ing, although successfully used for terrestrial weather practices for decades, is fairly new in spaceweather (Knipp, 2016; Murray, 2018). In the ﬁeld of ﬂare forecasting, Guerra et al. (2015) demon-strated the applicability of multi-model input ensemble forecasting for the ﬂare occurrence within aparticular active region. Using a small statistical sample ( only four forecasting methods and 13 full-passage active-region hourly forecast time series) the authors showed that linearly combining prob-abilistic forecasts, using combination weights based on the performance history of each method,makes more accurate forecasts. In addition, Guerra et al. (2015) suggested that combining forecastswhich are intrinsically di ﬀ erent (i.e., automatic / software versus human-inﬂuenced / experts) havethe potential to improve the prediction in comparison to the case in which only forecasts of similartype are used. However, the small data sample used in such analysis (events and forecasts) is notstatistically signiﬁcant for fully quantifying how much ensembles can improve prediction of ﬂares.In this study, the ensemble forecasting method presented in Guerra et al. (2015) is expanded toinclude more forecasting methods and a larger data sample, with a particular focus on analyzingfull-disk forecasts that are used more widely by operational centers. The e ﬀ ects of consideringdi ﬀ erent performance metrics and linear combination schemes are modelled and tested. Section 2presents and brieﬂy describes the data sample employed (forecasts, forecasting methods used tocreate these, and observed events). In Section 3 the ensemble models are described. Main resultsare presented and discussed in Section 4. Discussion is organized around the constructed ensembles,comparison among them, and a brief demonstration of uncertainty analysis is also included. Finally,conclusions and potential future work are outlined in Section 5.

2. Forecast Data Sample

In this investigation, full-disk probabilistic forecast time series for the occurrence of M- and X-class ﬂares were used from six di ﬀ erent operational methods. Table 1 presents and describes theforecasting methods (i.e., members) used for ensemble construction. Many of them are avail-able on the NASA Community Coordinated Modelling Center Flare Scoreboard that is locatedat https://ccmc.gsfc.nasa.gov/challenges/flare.php . Four out of six methods (MAG4,ASSA, ASAP, MCSTAT) are fully automated, while the remaining two (NOAA, MOSWOC) areconsidered as human-inﬂuenced – i.e., the raw forecasts, produced by a trained software, are ad-justed according to a human forecaster’s expertise and knowledge. All methods listed in Table 1produce forecasts at least every 24 hours, and forecast probabilities consist of the likelihood (0 − − ∆ t . A time span of three years (2014, 2015, and 2016) was considered in this study.This particular time period was chosen in order to maximize both the number of methods to beincluded and the number of forecasts without signiﬁcant gaps in the data.It is important to highlight that in order to combine the forecasts from di ﬀ erent sources, these needto correspond to the same forecast window duration. For all methods but one, forecasts correspondto a 24-hour window. For the exception, ASSA, ∆ t =

12 hours. In this case, because of the Poisson-statistics nature of that method, a 12-hour forecast can be transformed into a 24-hour forecast asillustrated in Guerra et al. (2015). In addition, for methods such as MCSTAT and ASAP, which et. al. : Ensemble Forecasting of Solar Flares

Table 1.

Flare Forecasting methods included in the ensemble forecast (members). Name, devel-oper / issuer / responsible institution, details on predictive model, archive or place used to retrieveforecasts, and references for each method are presented. Method Issuer / Responsible Predicting method Source ReferenceMAG4 U. of Alabama, Forecasting curve + iSWA Falconer et al. (2011, 2014)MSFC Free energy proxy;fully automatedASSA Korean Space McIntosh class + iSWA ASSA Manual Weather Center Poisson statistics;Fully automated.ASAP U. of Bradford, UK McIntosh class; iSWA Colak and Qahwaji (2008, 2009)sunspot-group area;Neural network.NOAA NOAA SWPC Table look-up + swpc.noaa.gov Crown (2012)persistence + Climatology;human corrected.MOSWOC UK Met O ﬃ ce McIntosh class + metoffice.gov.uk/ Murray et al. (2017)Poisson statistics; space-weather human correctedMCSTAT Trinity College McIntosh class + solarmonitor.org Gallagher et al. (2002),Dublin Poisson statistics; Bloomﬁeld et al. (2012)fully automated. iswa.ccmc.gsfc.nasa.gov http://spaceweather.rra.go.kr/images/assa/ASSA_GUI_MANUAL_112.pdf provide forecasts for individual active regions, the full-disk forecasts can be calculated accordingto Murray et al. (2017), P fd = − Π i (1 − P i ) , (1)where P i is the probability of ﬂaring for the i -th active region present on the disk. The product istaken over the total number of regions at the forecast issue time.Major ﬂares of GOES M- and X-classes were studied here, since C-class ﬂare forecasts arenot typically issued by operational centres. Figure 2 presents the 24-hour probabilistic forecastdata for M-class ﬂares, including histograms of values (left panels) and the full 3-year time series(right panels). Data is color-coded according to the forecasting method – from top to bottom, blackcorresponds to MAG4, blue to ASSA, green to ASAP, red to NOAA, purple to MOSWOC, andgold to MCSTAT. All forecasts for M-class ﬂares show similar characteristics – probability valuesrange from almost 0 . . − .

0, with a decreasing frequency from low to high probability bins.Although, in case of MAG4, higher frequency is concentrated in the lowest-probability bin whilesome bins are empty. On the other hand, forecasts for X-class ﬂares (not displayed) show a varietyof upper limit for probability ranges – between 0 .

25 (ASSA) and 0 .

90 (ASAP).During the study time period (2014-2016) a total of 18 X-class ﬂares and 348 M-class ﬂareswere observed. However, due to the deﬁnition of forecasts stated above, the deﬁnition of eventscorresponds to the days in which at least one ﬂare, of a particular class, was observed. Therefore,a time series of events is constructed by assigning 1.0 (positive) to ﬂaring days and 0.0 (negative)otherwise. Since multiple ﬂares can occur during the same day, the number of event days is notequal to the number of ﬂares observed. Event days are displayed in the right-hand panels of Figure 2by vertical gray lines. In total, 189 and 17 days between 1 January 2014 and 31 December 2016 et. al. : Ensemble Forecasting of Solar Flares MAG4 ASSA ASAP NOAA MOSWOC MCSTAT

MAG4

ASSA

ASAP

NOAA

MOSWOC J a n M a y S e p J a n M a y S e p J a n M a y S e p MCSTAT

Fig. 2.

Data sample. Probabilistic forecasts and events for M-class ﬂares (histograms, left panels;time series, right panels). From the top, forecasting methods (color) are: MAG4 (black), ASSA(blue), ASAP (green), NOAA (red), MOSWOC (purple), and MCSTAT (gold). In the right panels,vertical grey lines signal positive events, i.e., days when at least one M-class ﬂare was observed.

Table 2.

Matrix of Pearson’s R correlation coe ﬃ cients calculated among M-class ﬂare forecast timeseries shown in Figure 2, right panel. The mean value for each method is calculated using all thenon-zero values in the column and row that corresponds to such method. ASSA ASAP NOAA MOSWOC MCSTAT MeanMAG4 0.615 0.431 0.689 0.718 0.653 0.621ASSA – 0.534 0.661 0.705 0.769 0.657ASAP – – 0.476 0.512 0.543 0.499NOAA – – – 0.938 0.835 0.720MOSWOC – – – – 0.849 0.744MCSTAT – – – – – 0.730 have M- and X-class ﬂares, respectively, yielding climatological event-day frequencies of 0.172 and0.016, respectively.Visual inspection of the time series in Figure 2 reveals a certain level of correlation across allforecasting methods. This observation is not unexpected since all methods use parameterization ofthe same photospheric magnetic ﬁeld or sunspot-related data as a starting point. Table 2 displaysthe linear correlation (Pearson’s R) coe ﬃ cients calculated between pairs of forecasting methodsusing the time-series data in Figure 2 (right panels). The last column on Table 2 shows the averagecorrelation value across all methods – average of all non-zero entries in the table for the columnand row corresponding to the same method. Average correlation coe ﬃ cients for M-class ﬂare fore-casts range between ∼ ﬃ cients arebetween ∼ et. al. : Ensemble Forecasting of Solar Flares

3. Ensemble Models

Given a group of M probabilistic forecast time series, all corresponding to the same type of event,a combined or ensemble prediction can be obtained by linear combination (LC) as Guerra et al.(2015), P c ( { w i } , { P i } ; t ) = M − (cid:88) i = w i P i ( t ) , (2)in which the index i corresponds to the i − th member in the group { P i ; i = , . . . , M − } . The com-bining weight, w i , determines the contribution of each member time series (i.e., forecasting method)to the ensemble prediction. This problem is reduced to ﬁnding an appropriate set of combinationweights { w i ; i = , . . . , M − } that makes the ensemble prediction more accurate than any of theindividual ensemble members. Three particular options to determine the combination weights areexplored in this investigation: 1) error-variance minimization (performance history); 2) constrainedmetric optimization; 3) unconstrained metric optimization. Each of these options are explained inthe following sections. The simplest and most straightforward way to determine the set of combination weights is by look-ing at the performance history of each member (Armstrong, 2001). By doing this, higher weightsare assigned to members with relatively good forecasting track record and lower weights to fore-casts with poor performance (Genre et al., 2013). Given that each forecast time series consist of thesame number and range of discrete times, weights can be calculated as (Stock and Watson, 2004), w i = m − i (cid:80) M − j = m − j , (3)where each member’s weight is proportional to the reciprocal of their m i , the cumulative sum ofpast partial errors, m i = N − (cid:88) k = ( P i , k − E k ) . (4)In the Equation above, E k is the events time series and the index k labels the discrete time range, { t k ; k = , . . . , N − } . Equation 4 corresponds the unnormalized Brier Score – the Mean SquaredError (MSE) for probabilistic forecasts – since the ( N − − normalization coe ﬃ cient cancels outbecause of the ratio in Equation 3. Equation 4 implies that members with smaller partial error havelarger weights. On the other hand, from Equations 3 and 4 is easy to prove that, M − (cid:88) i = w i = , (5)with w i >

0. This means that combination weights are constrained to add up to unity. This isimportant when the forecasts are probabilistic – the value of P c cannot exceed 1. et. al. : Ensemble Forecasting of Solar Flares It can also be seen from Equations 3 and 4 that combination weights depend on the timeseries(forecasts and events) temporal range and resolution.

Alternatively, an optimal set of combination weights can be found by solving the optimizationproblem, ddw i M ( P c , E ) = , i = , . . . , M − , (6)where M corresponds to a performance metric (a sort of loss function that quantiﬁes the di ﬀ erencebetween forecasts and events), and P c is the linear combination given by Equation 2. In this casethe solution to Equation 6, (cid:110) w con i (cid:111) (cid:63) , must also satisfy the constraint given in Equation 5. Whenusing combination weights as described in this section and Section 3.1, the linear combination inEquation 2 is known as a constrained linear combination (CLC; Granger and Ramanathan (1984)). On the other hand, an unconstrained linear combination (ULC; Granger and Ramanathan (1984))of ensemble members’ weights, w unc i , can be constructed by adding a weighted contribution ofthe climatological frequency in as an additional probabilistic forecast. This results in the linearcombination of Equation 2 becoming, P c ( { w i } , { P i } ; t ) = M (cid:88) i = w unc i P i ( t ) + w E ¯ E ( t ) , (7)where ¯ E ( t ) is a time series with a constant value equal to the climatological frequency (calculatedover the studied time period), and w E is its combination weight. In this case, Equation 5 becomes, M (cid:88) i = w unc i + w E = , (8)with w i unc and w E capable of taking positive or negative values. In this case, the sum of the combi-nation weights for the ensemble’s members (i.e. without w E ) is not constrained to any value. Hence,this particular linear combination is called unconstrained. Solving Equation 6 with Equations 7 and8 provides a di ﬀ erent group of ensembles given the optimal set of unconstrained weights, (cid:110) w unc i (cid:111) (cid:63) .In this case, ¯ E ( t ) functions as a benchmark level that takes into account the level of ﬂaring activityover the three year time period studied here.

4. Results

In order to solve the optimization problem of Equation 6 (using either constrained or unconstrainedlinear combinations) and thus ﬁnd the combination weights, a metric or loss function must be used.The constructed ensemble forecasts will be di ﬀ erent from each other as much as the metrics are in-trinsically di ﬀ erent. The list of metrics employed in this work are presented in Table 3. Probabilistic et. al. : Ensemble Forecasting of Solar Flares Table 3.

Performance metrics tested in the optimization process. Each metric produces a di ﬀ erentset of combination weights (i.e., a di ﬀ erent ensemble). In each case a label is shown in parenthesesthat is used throughout the rest of the manuscript. Categorical metrics are calculated using 2 × P th . See Appendix A for their deﬁnitions. Probabilistic CategoricalBrier score (BRIER) Brier score (BRIER C)Mean absolute error (MAE) True skill statistic (TSS)Linear correlation coe ﬃ cient (LCC) Heidke skill score (HSS)Rank (Nonlinear) correlation coe ﬃ cient Accuracy (ACC)(NLCC ρ , NLCC τ ) Critical success index (CSI)Reliability (REL) Gilbert skill score (GSS)Resolution (RES)Relative Operating Characteristic (ROC) curve area NLCC ρ and NLCC τ are Spearman’s rank correlation and Kendall’s τ correlation, correspondingly. metrics are used as well as the more traditional categorical metrics (Murray et al., 2017), althoughensembles methods are versatile enough to fulﬁll the requirements of operational environments byallowing the use of any metric that might be of particular interest. Equation 6, along with the corresponding constraint of Equation 5 or 8, was solved with the

Scipy optimization software (Oliphant, 2007) using a Sequential Least SQuares Programming (SLSQP;Kraft, 1988) solver method. SLSQP is an iterative method for solving nonlinear optimization prob-lems with and without bounded value constraints. Initial-guess values for { w i } are provided to theroutine, while the derivatives (with respect to the weights) are calculated numerically. The SLSQPmethod only performs a minimization of the function value, therefore for those metrics in Table 3that require maximization (e.g., LCC, NLCC, ROC), the negative value of the metric is used as afunction to minimize. For some of the optimization metrics, the resulting weights showed sensitiv-ity to the initial-guess values given to the SLSQP solver – possibly due to the metric being noisyat the resolution of the solver. Therefore, in order to ensure that the solution { w i } (cid:63) corresponds to aglobal minimum, for each ensemble the solver is executed 500 times with randomly-selected initialvalues between, [0,1] for constrained case and [-1,1] for unconstrained case, at every step. This re-sults in a distribution of values for each weight. In most cases, these distributions (not shown here)are normal in shape, therefore the mean value is used as the ﬁnal optimized weight, with standarderror (deviation) associated of up to ∼

10% of the mean value. However, in few cases, distributionsappeared wider due to the noisy nature of the metric (loss) function.In the following sections, only the results for M-class ﬂare events are presented and discussedwith mention to the results for X-class ﬂares. Corresponding plots for X-class ﬂare events can befound in Appendix B. It is worth keeping in mind that, due to the relatively low number of X-classevent days, results for M + (i.e., ﬂares of M-class and above) will be similar to those for only M-classﬂares because these ﬂares will dominate the statistics in the sample used. et. al. : Ensemble Forecasting of Solar Flares H i s t o r y C o m b i n a t i o n w e i g h t s M-class B r i e r L C C N L C C _ ρ N L C C _ τ M A E R E L R E S R O C Metric Optimized

Constrained Linear Combination B r i e r L C C N L C C _ ρ N L C C _ τ M A E R E L R E S R O C Metric Optimized

Unconstrained Linear CombinationMAG4ASSAASAPNOAA MOSWOCMCSTATCLIM

Fig. 3.

Ensemble combination weights for the optimization of probabilistic metrics (Table 3, leftcolumn) on M-class ﬂare forecasts.

Left panel corresponds to combination weights calculated fromperformance track (see text for details) while

Middle and

Right panels correspond to constrainedand unconstrained linear combinations, respectively. Weights are presented using the same colorscheme as Figure 2 for each forecast method member. Note, ULC weights corresponding to theclimatological forecast member are displayed in gray in the right panel.Figure 3 shows the optimized combination weights { w i } (cid:63) for the performance history (left panel)and probabilistic metrics (outlined in Table 3). Middle and right panels correspond to the con-strained ( { w con i } (cid:63) ) and unconstrained ( { w unc i } (cid:63) ) linear combinations. Combination weights are dis-played according to the color code used in Figure 2. It is shown in the right panel that for theULC case some combination weights acquire negative values, as expected. It is worth noting thatnegative values do not necessarily imply that such a member / method performance is worse thanthose members with positive weights because it is this particular combination that is necessary tooptimize the chosen metric. It is clear that ensembles (i.e., the sets of combination weights) aregenerally very di ﬀ erent for the optimization of di ﬀ ering metrics and the type of linear combination.However, some general characteristics are observed: 1) human-adjusted members appear in mostensembles with major (positive) contributions – i.e., larger magnitudes than the equal weightingvalues of w coneq = / = .

167 and w unceq = / = .

142 for the CLC and ULC cases, respectively; 2)combination weights for members that are zero in the CLC case tend to show negative values in theULC case; 3) for most ULC ensembles, the climatological forecast member has a positive weight,implying that for the ensemble members considered and the time range studied, the level of activitymight have been underforecast by some of the members.It is also clear from Figure 3 that using the ULC approach results in the formation of ensembleswith more members having non-zero weights (i.e., more diverse ensemble membership). For X-class ﬂare events, the resulting ensembles are more sensitive to the metric used (see Fig. C.1). Noclear tendency arises in that case, however these results seem highly dependent on the low numberof X-class ﬂares in the studied sample.Figure 4, on the other hand, corresponds to the categorical-metric counterpart of Figure 3. Forcategorical metrics, the threshold value used to transform probabilistic forecasts to deterministic et. al. : Ensemble Forecasting of Solar Flares

Brier TSS HSS ACC CSI GSSMetric Optimized0.750.500.250.000.250.500.751.00

M-class

Constrained Linear Combination

Brier TSS HSS ACC CSI GSSMetric Optimized

Unconstrained Linear CombinationMAG4ASSAASAPNOAA MOSWOCMCSTATCLIM

Fig. 4.

Similar to Figure 3, but for the optimization of categorical metrics (Table 3, right column).forecasts is determined during the optimization process. See Appendix A for details about thisthresholding procedure. For categorical-metric ensembles, it is observed: 1) unlike probabilistic-metric ensembles, weights determined using the CLC approach seem to consistently show non-zerovalues for most metrics; 2) for both CLC and ULC, ensembles seem more similar to each other interms of the combination weights (i.e., the same members appear to dominate in most ensembles,being MAG4, NOAA, and MOSWOC); 3) weights for the climatological forecast member appearwith negative values in most ensembles, contrary to the probabilistic case.

As indicated above, in order to determine combination weights such as those in Figures 3 and 4, thevalue of a chosen metric is optimized. In Figure 5 these optimized metric values are presented forM-class ﬂare forecasts using the ULC approach. The left panel corresponds to probabilistic-metric-optimized ensembles, while the right panel shows the categorical-metric-optimized ensembles.For the probabilistic metrics (Fig. 5, left panel), several values are presented: grey box-and-whiskers show the individually-calculated metrics for all members (top and bottom of the box rep-resent ﬁrst and third quartiles, the horizontal line in between correspond to the median, and thewhiskers signal maximum and minimum); a metric value for the equal-weights ensemble (arith-metic mean; black circle), and the value for the best-performing ensemble (red circle; using theweights from Figure 3,

Middle ). For a more convenient visualization, those metrics that are mini-mized (i.e., BRIER, MAE, and REL), are displayed as 1 − (metric value). In this way, better per-forming metric values are concentrated towards the upper limit (i.e., 1) of the range. For all themetrics in the left panel of Figure 5, the observed tendency is that the best-performing ensembleyields a metric value greater than that of the equal-weights ensemble ( M (Best-Perf. Ensemble) > M (Eq.- w Ensemble)) which, in turn, produces a metric value greater than the median of the mem-bers’ individual metric values ( M (Eq.- w Ensemble) > ¯ M i ). However, the equal-weights ensemblemetric value often lies above the median value, meaning that one or two members perform better in et. al. : Ensemble Forecasting of Solar Flares - B R I E R L C C N L C C _ ρ N L C C _ τ - M A E - R E L R E S R O C Metric Optimized0.010.101.00 M e t r i c V a l u e M-class

Unconstrained Linear CombinationMembersEq. w EnsembleOpt. Ensemble - B R I E R _ C T S S H S S A C C C S I G S S

Metric Optimized0.010.101.00 M e t r i c / T h r e s h o l d M-class

Unconstrained Linear CombinationMembers MetricsOpt. Ensemble Members ThresholdOpt. Threshold

Fig. 5.

Left:

For probabilistic metrics three values are shown per metric: 1) metrics values for en-semble members are displayed as box-and-whiskers; 2) the metric of the equal-weights ensem-ble (black circle); 3) the optimized (or best performing) ensemble (red circle).

Right:

Metric andthreshold values for categorical-metric ensembles. Gray and blue box-and-whiskers plots corre-spond to the ensemble members’ metrics and thresholds. Red circles and diamonds correspond tothe optimized-ensemble metrics and thresholds. For better comparison, metrics in both panels thatrequire minimization (i.e., BRIER, MAE, and REL) are displayed as 1 − (metric value).this metric than the equal-weights ensemble. In addition, the best performing ensemble (red circle)often produces a metric value higher than the upper limit of the box and sometimes even higherthan the maximum. This implies that the best-performing ensemble also often performs better thanthe best-performing individual member.The right panel of Figure 5 displays metric values and probability threshold values for the categor-ical metrics. In a similar way to the probabilistic-metrics case, box-and-whiskers correspond to theensemble members’ values, while symbols correspond to the values of the best-performing ensem-ble. Grey and blue box-and-whiskers correspond to metric and threshold values. The equal-weightsensemble is not shown here for practical reasons, but the results are similar to the probabilistic case.This plot shows two clear tendencies: 1) All categorical-metric-optimized ensembles achieved ametric value larger than the median for their members. Indeed, all red circles are seen outside thewhiskers’ range – that is, the optimized-ensemble metric is larger than that of the best performingindividual member; 2) the probability threshold values that are found to maximize the metrics arelower than the average probability threshold across the members.In the case of X-class ﬂares (Appendix Fig. C.2), results show similar tendencies to those de-scribed above. Also, for both the M- and X-class event levels, optimized metric values for the CLCcase (not shown here) follow similar patterns as for the ULC case (i.e., M (Opt. Ensemble) > M (Eq.- w Ensemble) > ¯ M i ). However, metric values for ULC ensembles are typically up to 5% (M-class)and 15% (X-class) higher than those of CLC ensembles. These results demonstrate that using theULC approach when constructing ensembles achieves more optimal values for both probabilisticand categorical metrics. The improvement of metric values appeared larger for the X-class eventlevel perhaps suggesting that ensembles might be very useful for rare events. However, with the et. al. : Ensemble Forecasting of Solar Flares low number of events available for this class, the statistical signiﬁcance of such suggestion can’t beshown. The performance of how each ensemble output may be evaluated is demonstrated here using avariety of probabilistic validation metrics. It is important to clarify that the results in this sectiondo not correspond to those of a validation process because the metrics were calculated in-sample.These results are intended to demonstrate that the choice of optimization metric for constructing anensemble is fundamentally important / inﬂuential if the best-performing forecasts are to be desired.It is worth noting that ‘best performing’ can mean di ﬀ erent things depending on the end-user (seeSection 5 for further discussion), and here a selection of commonly-used metrics are used to simplyshowcase the usefulness of this technique.Following the operational ﬂare forecasting validation measures used by Murray et al. (2017) andLeka et al. (2019), ROC curves and reliability diagrams are displayed in Figure 6 for a selectionof M-class forecast ensemble members and ﬁnal optimized ensembles (see Appendix C.4 for theequivalent X-class case). Reliability diagrams are conditioned on the forecasts, indicating how closethe forecast probabilities of an event correspond to the actual observed frequency of events. Theseare good companions to ROC curves that are instead conditioned on the observations. ROC curvespresent forecast discrimination, and the area under the curve provides a useful measure of the dis-criminatory ability of a forecast. The ROC area for all forecasts as well as Brier score is presentedin Appendix Tables B.1 and B.2 for M- and X- forecasts, respectively. Brier score measures themean square probability error, and can be broken down into components of reliability, resolution,and uncertainty, which are also listed in these tables.For each table the scores are grouped in order of original input forecasts, ensembles fromprobabilistic-metric-optimization, and categorical-metric-optimization. Results for both CLC andULC approaches are included. In general, the M-class forecast results are better than X-class, al-though that is to be expected considering the small number events in the time period used (only 17X-class event days compared to 189 M-class event days out of 1096 total days). Most values are tobe expected, with overall good Brier score but poor resolution, and only a few resulting forecastswith a ‘poor’ ROC score (in the range 0.5 - 0.7). It is interesting to see in Table B.1 that overallthe equal-weighted ensemble outperforms MAG4, which is the best of the automated (without hu-man input) M-class forecasts, but that the human-edited MOSWOC forecast is the best performingoverall in the original members group.Group rankings are also included in both Appendix Tables B.1 and B.2, calculated by ﬁrst rank-ing the forecasts based on all four scores separately, and then taking an average of the rankingsand re-ranking for each group. Although the broad study of Leka et al. (2019) found that no singleforecasting method displayed high performance over many skill metrics, this ranking averaging isdone here in order to observe if there are major di ﬀ erences between probabilistic and categoricalmetrics. The top performers for each group in M-class forecasts are MOSWOC, NLCC ρ unc, andCSI, while for X-class the best performing forecasts are MOSWOC, LCC unc, and CSI. It is worthnoting that rankings may change quite signiﬁcantly depending on the metrics used, therefore theraw forecast data is freely provided to the reader to compare the results using any metric of theirown interest (see Acknowledgements). Table 4 summarizes the top ﬁve performers for each metric et. al. : Ensemble Forecasting of Solar Flares Table 4.

Rankings of evaluation metrics for M-class ﬂare forecasts. For each metric top ﬁve ensem-bles are displayed.

Rank Brier Reliability Resolution ROCScore Area1 LCC unc NLCC ρ unc LCC unc NLCC τ unc2 BRIER unc BRIER unc CSI unc NLCC ρ unc3 NLCC ρ unc BRIER C HSS unc ROC unc4 HSS unc GSS BRIER unc BRIER unc5 ROC unc REL unc NLCC ρ unc LCC unc based on their rankings separately. In this table both BRIER unc and NLCC ρ unc ensembles ap-pear in the top ﬁve of all four evaluation metrics, while the LLC unc metric appears in evaluationskill metrics. Therefore, these three ensembles will be often be used as a sample of “overall” topperformers in the following sections.For comparison purposes, Figures 6 and C.4 display ROC and reliability plots for a selection ofthese top-performing methods based on the rankings in the three groups. The upper row comparesthe best original ensemble members of each di ﬀ erent method type and outputs, namely MOSWOC(human-edited, black line), MAG4 (automated, turquoise line), the equal-weights ensemble (blueline), and one of the top performing probabilistic ensembles as per Table 4, NLCC ρ unc (purpleline). The other rows compare forecasts within ranking groups, for example the middle row showsbest constrained vs unconstrained probabilistic weighted methods, and the lower row constrainedvs unconstrained categorical weighted methods. For the ROC curves in the left column, the betterperforming methods should be tending towards the upper left corner of the plot. For the reliabilitydiagrams in the right column, methods should preferably be in the gray-shaded zone of ‘positiveskill’ around the diagonal, and if they are tending toward the horizontal line they are becomingcomparable to climatology.These ﬁgures provide an easier illustrative depiction of the scores presented by the tables. Forexample, the ROC curve in the upper row of Figure 6 highlights the clear improvement that theensembles have over the automated MAG4 method, with all other curves similarly good for M-class forecasts. The reliability diagrams of Figure 6 show that most methods / ensembles gener-ally over-forecast (i.e., data points lie to the right of and below the center diagonal line), exceptthe NLCC ρ unc ensemble. The plots for X-class forecasts in Appendix Figure C.4 clearly high-light the issues related to rare-event forecasting, with poorer results across the board for all meth-ods / ensembles compared to the M-class forecast results. There are two main uncertainty sources associated with the linearly-combined (ensemble) proba-bilistic forecasts P c here constructed: 1) the uncertainty associated with the weighted average; 2)systematic uncertainties associated with the input data for Equations 2 and 7 (i.e., forecasts andweights). Thus, u ( P c ) = u + u , (9) et. al. : Ensemble Forecasting of Solar Flares Fig. 6.

ROC curves (

Left column) and reliability diagrams (

Right column) for M-class ﬂare fore-casts, comparing the top ranking individual method types and ﬁnal ensemble performer (

Upper row), and for constrained and unconstrained ensembles based on probabilistic (

Middle row) andcategorical (

Lower row) metrics. Note that the center diagonal line in the ROC curves represents noskill, while for the reliability diagrams it indicates perfect reliability. The shaded areas in the reli-ability diagrams indicate regions that contribute positively to the Brier skill score (not shown / usedhere).where u can be calculated as a weighted standard deviation. A simpliﬁed version (due to theconstraints in Equations 5 and 8) of the standard error of the weighted mean (SEM) formulationpresented by Gatz and Smith (1995) et. al. : Ensemble Forecasting of Solar Flares u = MM − M − (cid:88) i = w i ( P i − P c ) . (10)is adopted here. Equation 10 corresponds to the typical SEM corrected by the factor M / ( M − u = M − (cid:88) i = { w i u ( P i ) + P i u ( w i ) } , (11)where u ( P i ) in the ﬁrst term is the uncertainty associated with the probabilistic forecasts for thei-th ensemble member and u ( w i ) in the second term is the uncertainty of the i-th member combi-nation weight. Most ensemble members in this study do not have uncertainties associated to theirforecasts. As it was mentioned in Section 4.1, combination weights such as those in Figures 3 and4 (as well as Appendix Figs. C.1 and C.2 for the X-class case) correspond to the mean values ofnormal distributions. Therefore their uncertainties can be represented by the corresponding standarddeviation, σ ( w i ). Since Equation 11 implies that the more members the ensemble has the larger theuncertainty, the systematic errors must be normalized by the number of members with non-zeroweights, M (cid:48) . Therefore, Equation 11 reduces to, u = M (cid:48) M − (cid:88) i = P i σ ( w i ) . (12)Figure 7 displays the fractional errors calculated with Equations 9, 10, and 12 for those threemetrics that repeatedly appeared in Table 4: LCC (grey), BRIER (red), NLCC ρ (black). Left andright plots in Figure 7 compare the constrained (CLC) and unconstrained (ULC) cases for all threemetrics. Uncertainties in both linear-combination cases (CLC and ULC) show a similar trend –fractional errors are larger for low probabilities than for large probabilities. In the case of ULC(Fig. 7, left panel), fractional errors appear to always decrease with increasing probability value ina slow, non-linear way. In this case, the LCC-optimized ensemble provides the lowest errors withvalues ranging from 5% when P → . P → .

0. On the other hand, ULC ensemblesshow a slow non-linear decrease at low probability values, but then fractional errors seem to reach aconstant levels. In this case, the BRIER-optimized ensemble gives the overall lowest errors rangingbetween 0.5% and 5% when P (cid:38) u ( w i ) = σ ( w i ) + u ( w i ). This contribution can be calculated by error propagation through themean-value expression. However, the uncertainty u ( w i ) term is not included in the present resultssince the SLSQP solver does not provide the such uncertainties directly. Second, the uncertaintyvalues presented here (i.e. Figure 7) were calculated with both weights and forecasts in-sample. Forout-of-sample uncertainties, a similar behavior can be expected. The total uncertainty u ( P c ) growswith increasing probability P i value making the fractional uncertainty decrease as seen in Figure et. al. : Ensemble Forecasting of Solar Flares F r a c t i o n a l E rr o r Constrained LCCBrierNLCC_rho F r a c t i o n a l U n c e r t a i n t y Unconstrained LCCBrierNLCC_rho

Fig. 7.

Fractional uncertainties as a function of ensemble probability.

Left and

Right panels comparethe CLC and ULC cases for the three top-performing ensembles (Table 4 for M-class ﬂares, consist-ing of the metrics linear correlation coe ﬃ cient (LCC; grey), Brier score (Brier; red), and non-linearrank correlation coe ﬃ cient, ρ (NLCC ρ ; black).7. The values of weights (Equation 10) and their uncertainties (Equation 12), which are alwayscalculated in-sample, should not a ﬀ ect this speciﬁc behavior, instead they should only determinethe rate of growth and overall level of uncertainties.

5. Concluding Remarks

This investigation presented the modeling and implementation of multi-model input ensemble fore-casts for major solar ﬂares. Using probabilistic forecasts from six di ﬀ erent forecasting methods thatare publicly available online in at least a ‘near-operational’ state, three di ﬀ erent schemes for lin-early combining forecasts were tested: track history (i.e., variance minimization), metric-optimizedconstrained, and metric-optimized unconstrained linear combinations. In the last two cases, a groupof 13 forecast validation metrics (7 probabilistic and 5 categorical) were used as functions to beoptimized and thus ﬁnd the most optimal ensemble combination weights. Resulting ensemble fore-casts for this study time (2014–2016, inclusive) were compared to each other and ranked by usingfour widely used probabilistic performance metrics: Brier score, reliability, resolution, and ROCarea. Finally, uncertainties on each ensemble were studied.A total of 28 ensembles were constructed to study M- and X-class ﬂare forecasts. The vast ma-jority of ensembles not only performed better – as measured by the four metrics – than all theensemble members but also better than the equal-weights ensemble. This means that even though asimple geometric average of forecasts will be a more accurate forecast than any one of the originalensemble members on their own, according to the results in this investigation, that is not necessar-ily the most optimal linear combination. For both ﬂare event levels, di ﬀ erent optimization metricslead to di ﬀ ering ensemble combination weights with non-zero weights from both automated andhuman-inﬂuenced members. When the combination weights are forced to have only positive values(i.e., a constrained linear combination) it is observed that optimization of the more mathematicalmetrics (i.e., Brier, LCC, NLCC, and MAE) do not necessarily include all members, in contrast to et. al. : Ensemble Forecasting of Solar Flares optimization of the more attribute-related metrics (i.e., RES, REL, ROC). When the ensemble mem-ber weights are allowed to obtain negative values as well, those previously zero-weighted membersare observed to have negative weights. It is important to highlight that a negative weight does notmean that it is less important than a positive-valued weight since it is the overall linear combination(positive and negative weights) that achieves the optimal metric value. The tendency is similar forall categorical metrics.The optimized combination weights provided ﬁnal metric values greater than both the metriccalculated using an equal-weights combination and the average metric value across all members.However, only in the M-class case with a greater number of event days in the time period studieddid every optimized ensemble show a metric value better than all of the ensemble members. Asit is expected, the relatively low number of X-class event days in our data sample is not enoughto make every optimized ensemble better than any of the members. It is in these cases where thechoice of optimizing metric is of great importance. This conclusion is valid for both probabilisticand categorical metrics and, in the latter case, probability thresholds for ensembles were alwaysobserved to be lower than the average threshold among the members. When using an unconstrainedlinear combination, metric values are typically up to 5% (for M-class forecasts) and 15% (for X-class forecasts) better than ensembles using a constrained linear combination.When looking at the top ﬁve performing ensembles in each separate skill metric used for this in-sample evaluation, three metrics seem to repeatedly appear for the M-class ﬂares: BRIER, LCC, andNLCC ρ . It is worth noting that similar scores to those in Table B.1 were found in previous ﬂareforecast validation studies. The tendency of forecasts to over-forecast was also found by Murrayet al. (2017) and Sharpe and Murray (2017) for the validation of MOSWOC forecasts. However,interestingly in this work the highest probability bins in the reliability diagrams of Figure 6 alsoover-forecast, while Murray et al. (2017) found under-forecasting for high probabilities. Brier scoresgenerally also generally agree with these earlier works, although the comparison study of Barneset al. (2016) found slightly lower values. However, it is di ﬃ cult to gain any meaningful insightwhen inter-comparing works that used di ﬀ erent sized data sets over di ﬀ erent time periods, and asmentioned above these results do not correspond to those of a validation process because the metricswere calculated in-sample.It is particularly interesting to note how well the simple equal-weights ensemble performs inthis work compared to the more complex weighting schemes. While equal-weights ensembles willrarely outperform the human-edited forecasts, they have been successful in outperforming the bestof the automated methods (Murray, 2018). These could be a helpful starting point for forecasterswhen issuing operational forecasts before additional information or more complex model resultsare obtained. However, the weighting schemes do provide a level of ﬂexibility that simple averagescannot; they allow operational centers to tailor their forecasts depending on what measure of perfor-mance a user cares about the most (e.g., do they want to mitigate against misses or false alarms?).In this work, only a selection of metrics are highlighted based on current standards used by thecommunity. However, the data used here are provided with open access so that readers can performtheir own analysis (see Acknowledgements).Ensemble models possess the great advantage of estimating forecast uncertainties, even in caseswhen none of the members have associated uncertainties. The two main sources of uncertaintyfor multi-model ensemble prediction are statistical and systematic; the former is quantiﬁed by theweighted standard deviation, while the latter depends (mostly) on the uncertainty of the combination et. al. : Ensemble Forecasting of Solar Flares weights. For both constrained and unconstrained linear combination ensembles, fractional uncer-tainties are observed to decrease non-linearly with increasing ensemble probability. However, theoverall values of uncertainties are lower for the unconstrained linear combination ensemble cases.The lowest values of fractional uncertainty ( ∼ . −

5% for P (cid:38) .

2) are achieved by the BRIERensemble. The main factor making the di ﬀ erence between constrained and unconstrained ensem-bles resides in the number of non-zero weights; the more members in an ensemble, the smaller theuncertainty.The results presented in this study demonstrate that multi-model ensemble predictions of solarﬂares are ﬂexible and versatile enough to be implemented and used in operational environmentswith metrics that satisfy user-speciﬁc needs. The evaluation of the ensemble forecasts is deferredto a future work since the intention of the present study is to illustrate how operational centers mayimplement an ensemble forecasting system for major solar ﬂares using any number of members andoptimization metric. Acknowledgements.

The forecast data used here is available via Zenodo ( https://doi.org/10.5281/zenodo.3964552 ). The analysis made use of the Python Scipy (Oliphant, 2007) and R Veriﬁcation(Gilleland, 2015) packages. S. A. M. is supported by Air Force O ﬃ ce of Scientiﬁc Research (AFSOR)award number FA9550-19-1-7010, and previously by the Irish Research Council Postdoctoral FellowshipProgramme and the AFOSR award number FA9550-17-1-039. Initial funding for J. G. A. was provided bythe European Union Horizon 2020 research and innovation programme under grant agreement No. 640216(FLARECAST project; http://flarecast.eu ). The authors thank the anonymous reviewers for their com-ments and recommendations. References

Armstrong, J. S., 2001. Combining Forecasts. Springer US, Boston, MA. DOI:10.1007 / Astrophysical Journal , , 89. DOI:10.3847 / / / /

89. 1, 5Bloomﬁeld, D. S., P. A. Higgins, R. T. J. McAteer, and P. T. Gallagher, 2012. Toward Reliable Benchmarkingof Solar Flare Forecasting Methods.

ApJ Lett. , , L41. DOI:10.1088 / / / / L41. 1Colak, T., and R. Qahwaji, 2008. Automated McIntosh-Based Classiﬁcation of Sunspot Groups Using MDIImages.

Sol. Phys. , , 277–296. DOI:10.1007 / s11207-007-9094-3. 1Colak, T., and R. Qahwaji, 2009. Automated Solar Activity Prediction: A hybrid computer platform usingmachine learning and solar imaging for automated prediction of solar ﬂares. Space Weather , , S06001.DOI:10.1029 / Space Weather , , S06006. DOI:10.1029 / et. al. : Ensemble Forecasting of Solar FlaresFalconer, D., A. F. Barghouty, I. Khazanov, and R. Moore, 2011. A tool for empirical forecasting of majorﬂares, coronal mass ejections, and solar particle events from a proxy of active-region free magnetic energy. Space Weather , , S04003. DOI:10.1029 / Space Weather , , 306–317.DOI:10.1002 / Solar Physics , (1), 171–183. DOI:10.1023 / A:1020950221179. 1Gatz, D. F., and L. Smith, 1995. The standard error of a weighted mean concentrationI. Bootstrapping vsother methods.

Atmospheric Environment , (11), 1185 – 1193. DOI:10.1016 / International Journal of Forecasting , (1), 108–121. 3.1Gilleland, E., 2015. veriﬁcation: Weather Forecast Veriﬁcation Utilities (v1.42). URL https://cran.r-project.org/package=verification . 5Granger, C. W. J., and R. Ramanathan, 1984. Improved methods of combining forecasts. Journal ofForecasting , (2), 197–204. DOI:10.1002 / for.3980030207. 3.2, 3.3Guerra, J. A., A. Pulkkinen, and V. M. Uritsky, 2015. Ensemble forecasting of major solar ﬂares: First results. Space Weather , , 626–642. DOI:10.1002 / Astrophys. J. Lett. , (1),L45–L48. DOI:10.1086 / Space Weather , , 52–53.DOI:10.1002 / Forschungsbericht. DeutscheForschungs- und Versuchsanstalt fr Luft- und Raumfahrt, DFVLR , . 4.1Leka, K., and G. Barnes, 2018. Chapter 3 - Solar Flare Forecasting: Present Methods and Challenges.In N. Buzulukova, ed., Extreme Events in Geospace, 65 – 98. Elsevier. ISBN 978-0-12-812700-1.DOI:10.1016 / B978-0-12-812700-1.00003-0. 1Leka, K. D., S.-H. Park, K. Kusano, J. Andries, G. Barnes, et al., 2019. A Comparison of Flare ForecastingMethods. II. Benchmarks, Metrics, and Performance Results for Operational Solar Flare ForecastingSystems.

Astrophys. J. Suppl. Ser. , (2), 36. DOI:10.3847 / / ab2e12. 1, 4.3Murray, S. A., 2018. The Importance of Ensemble Techniques for Operational Space Weather Forecasting. Space Weather , (7), 777–783. DOI:10.1029 / ﬃ ce SpaceWeather Operations Centre. Space Weather , , 577–588. DOI:10.1002 / Computing in Science Engineering , (3), 10–20.DOI:10.1109 / MCSE.2007.58. 4.1, 5 19uerra et. al. : Ensemble Forecasting of Solar FlaresPesnell, W. D., B. J. Thompson, and P. C. Chamberlin, 2012. The Solar Dynamics Observatory (SDO).

Sol. Phys. , , 3–15. DOI:10.1007 / s11207-011-9841-3. 1Sharpe, M. A., and S. A. Murray, 2017. Veriﬁcation of Space Weather Forecasts Issued by the Met O ﬃ ceSpace Weather Operations Centre. Space Weather , , 1383–1395. DOI:10.1002 / Journal of Forecasting , (6), 405–430. DOI:10.1002 / for.928. 3.1Tsagouri, I., A. Belehaki, B. N., C. C., D. V., et al., 2013. Progress in space weather modeling in an opera-tional environment. J. Space Weather Space Clim. , , A17. DOI:10.1051 / swsc / Appendix A: Categorical metrics deﬁnitions

Probabilistic forecasts P are transformed into categorical ones by choosing a probability thresholdvalue, P th and then applying the transformation, P cat =  P ≥ P th P < P th (A.1)In this investigation the chosen value for P th in every case, corresponds to that which maximizesthe value of the metric in use. This threshold value is determined during the optimization processby constructing a metric vs P th curve and ﬁnding the value that minimizes or maximizes the speciﬁcmetric, depending on whether small or large values indicate better forecast performance. Table A.1.

Contingency table for deterministic (yes / no) forecasts and event classes. Event Observed: Yes (1) Event Observed: No (0)Event Forecast: Yes (1) a (hits) b (false alarms)Event Forecast: No (0) c (misses) d (correct negatives) A 2x2 contingency table (Table A.1) summarizes the four possible outcomes in case of determin-istic forecasts ( P cat ) and events ( E ). Categorical metrics are derived from Table 3 following, – Brier score: BRIER C = N (cid:80) ( P cat − E ) – True Skill Score: TSS = ad − bc ( a + c )( b + d ) – Heidke Skill Score: HSS = a + d − en − e with n = a + b + c + d and e = ( a + b )( a + c ) + ( b + d )( c + d ) – Accuracy : ACC = a + dn – Critical Success Index: CSI = aa + b + c – Gilbert Skill Score: GSS = a − a random a + b + c − a random with a random = ( a + c )( a + b ) n et. al. : Ensemble Forecasting of Solar Flares Table B.1.

Table with validation metrics for M-class ﬂare forecasts. Note that there are 189 eventdays and 907 non-event days out of 1096 total days, and for all cases the decomposed Brier uncer-tainty is 0.143.

Ensemble Forecast Name / Brier Reliability Resolution ROC GroupGrouping Ensemble Label Score Area RankMembers ASAP 0.151 0.0163 0.0079 0.575 7ASSA 0.150 0.0235 0.0167 0.738 5MAG4 0.126 0.0064 0.0237 0.772 4MCSTAT 0.183 0.0606 0.0200 0.769 6MOSWOC 0.116 0.0056 0.0327 0.842 1NOAA 0.116 0.0070 0.0335 0.838 2Equal-weights 0.121 0.0046 0.0264 0.816 3Prob.-optimized BRIER 0.110 0.0009 0.0338 0.848 8BRIER unc 0.107 0.0007 0.0368 0.853 2LCC 0.109 0.0016 0.0355 0.848 6LCC unc 0.106 0.0019 0.0387 0.853 2MAE 0.126 0.0064 0.0237 0.772 15MAE unc 0.127 0.0082 0.0244 0.811 15NLCC ρ ρ unc 0.107 0.0007 0.0366 0.854 1NLCC τ τ unc 0.109 0.0011 0.0351 0.856 2REL 0.114 0.0013 0.0298 0.831 14REL unc 0.111 0.0008 0.0322 0.841 12RES 0.110 0.0009 0.0332 0.841 11RES unc 0.114 0.0010 0.0322 0.832 13ROC 0.109 0.0021 0.0357 0.847 8ROC unc 0.108 0.0010 0.0357 0.853 5Cat.-optimized ACC 0.112 0.0023 0.0335 0.890 10ACC unc 0.126 0.0131 0.0297 0.625 11BRIER C 0.111 0.0008 0.0327 0.891 8BRIER C unc 0.129 0.0156 0.0289 0.596 12CSI 0.109 0.0013 0.0350 0.889 1CSI unc 0.111 0.0062 0.0376 0.630 2GSS 0.110 0.0008 0.0338 0.839 3GSS unc 0.129 0.0221 0.0360 0.878 7HSS 0.111 0.0033 0.0349 0.889 7HSS unc 0.108 0.0021 0.0372 0.620 5TSS 0.111 0.0013 0.0348 0.856 3TSS unc 0.130 0.0227 0.0359 0.879 9 21uerra et. al. : Ensemble Forecasting of Solar Flares Table B.2.

Table with validation metrics for X-class ﬂare forecasts. Note that there are 17 event daysand 1,079 non-event days out of 1096 total days, and for all cases the decomposed Brier uncertaintyis 0.015.

Ensemble Forecast Name / Brier Reliability Resolution ROC GroupGrouping Ensemble Label Score Area RankMembers ASAP 0.047 0.0319 0.0002 0.534 7ASSA 0.018 0.0028 0.0000 0.716 6MAG4 0.016 0.0030 0.0024 0.767 2MCSTAT 0.026 0.0115 0.0008 0.878 3MOSWOC 0.017 0.0049 0.0035 0.879 1NOAA 0.018 0.0046 0.0015 0.834 3Equal-weights 0.019 0.0045 0.0008 0.874 3Prob.-optimized BRIER 0.016 0.0030 0.0027 0.888 4BRIER unc 0.015 0.0023 0.0023 0.820 7LCC 0.016 0.0030 0.0027 0.896 1LCC unc 0.015 0.0031 0.0033 0.879 1MAE 0.015 0.0024 0.0024 0.768 13MAE unc 0.016 0.0020 0.0015 0.680 15NLCC ρ ρ unc 0.016 0.0028 0.0017 0.906 3NLCC τ τ unc 0.020 0.0067 0.0018 0.919 7REL 0.017 0.0027 0.0014 0.896 7REL unc 0.015 0.0019 0.0018 0.865 10RES 0.017 0.0040 0.0027 0.895 14RES unc 0.016 0.0024 0.0014 0.894 10ROC 0.018 0.0052 0.0025 0.908 10ROC unc 0.026 0.0132 0.0021 0.887 16Cat.-optimized ACC 0.017 0.0033 0.0018 0.890 2ACC unc 0.016 0.0032 0.0027 0.625 7BRIER C 0.017 0.0033 0.0017 0.891 2BRIER C unc 0.016 0.0032 0.0027 0.596 9CSI 0.016 0.0030 0.0026 0.889 1CSI unc 0.015 0.0038 0.0045 0.630 5GSS 0.022 0.0076 0.0009 0.839 12GSS unc 0.019 0.0055 0.0021 0.878 9HSS 0.016 0.0034 0.0030 0.889 2HSS unc 0.015 0.0038 0.0046 0.620 6TSS 0.021 0.0068 0.0010 0.856 11TSS unc 0.018 0.0050 0.0024 0.879 7 Appendix B: Forecast Comparison MetricsAppendix C: X-class Flare Forecast Results

This section contains results similar to those presented in Sections 4.1 and 4.2 for the case of X-classﬂare forecasts. et. al. : Ensemble Forecasting of Solar Flares H i s t o r y C o m b i n a t i o n w e i g h t s X-class B r i e r L C C N L C C _ ρ N L C C _ τ M A E R E L R E S R O C Metric Optimized

Constrained Linear Combination B r i e r L C C N L C C _ ρ N L C C _ τ M A E R E L R E S R O C Metric Optimized

Unconstrained Linear CombinationMAG4ASSAASAPNOAA MOSWOCMCSTATCLIM

Fig. C.1.

Same as Figure 3, but for X-class ﬂare forecasts.

Table C.1.

Rankings of evaluation metrics for X-class ﬂare forecasts. For each metric,the top ﬁveperforming ensembles are displayed.

Rank Brier Reliability Resolution ROCscore area1 HSS unc REL unc HSS unc NLCC τ unc2 CSI unc MAE unc CSI unc NLCC τ ρ ρ unc Brier TSS HSS ACC CSI GSSMetric Optimized0.750.500.250.000.250.500.751.00

X-class

Constrained Linear Combination

Brier TSS HSS ACC CSI GSSMetric Optimized

Unconstrained Linear CombinationMAG4ASSAASAPNOAA MOSWOCMCSTATCLIM

Fig. C.2.

Same as Figure 4, but for X-class ﬂare forecasts. et. al. : Ensemble Forecasting of Solar Flares - B R I E R L C C N L C C _ ρ N L C C _ τ - M A E - R E L R E S R O C Metric Optimized0.00010.00100.01000.10001.0000 M e t r i c V a l u e X-class

Unconstrained Linear CombinationMembersEq. w EnsembleOpt. Ensemble - B R I E R _ C T S S H S S A C C C S I G S S

Metric Optimized0.000.010.101.00 M e t r i c / T h r e s h o l d X-class

Unconstrained Linear CombinationMembers MetricsOpt. Ensemble Members ThresholdOpt. Threshold

Fig. C.3.

Same as Figure 5, but for X-class ﬂare forecasts. et. al. : Ensemble Forecasting of Solar Flares

Fig. C.4.