Improving Outbreak Detection with Stacking of Statistical Surveillance Methods
IImproving Outbreak Detection withStacking of Statistical Surveillance Methods
Moritz Kulessa
Knowledge Engineering GroupTechnische Universität [email protected]
Eneldo Loza Mencía
Knowledge Engineering GroupTechnische Universität [email protected]
Johannes Fürnkranz
Knowledge Engineering GroupTechnische Universität [email protected]
ABSTRACT
Epidemiologists use a variety of statistical algorithms for the earlydetection of outbreaks. The practical usefulness of such methodshighly depends on the trade-off between the detection rate of out-breaks and the chances of raising a false alarm. Recent researchhas shown that the use of machine learning for the fusion of mul-tiple statistical algorithms improves outbreak detection. Insteadof relying only on the binary output ( alarm or no alarm ) of thestatistical algorithms, we propose to make use of their p -values fortraining a fusion classifier. In addition, we also show that addingadditional features and adapting the labeling of an epidemic periodmay further improve performance. For comparison and evaluation,a new measure is introduced which captures the performance ofan outbreak detection method with respect to a low rate of falsealarms more precisely than previous works. Our results on syn-thetic data show that it is challenging to improve the performancewith a trainable fusion method based on machine learning. In par-ticular, the use of a fusion classifier that is only based on binaryoutputs of the statistical surveillance methods can make the overallperformance worse than directly using the underlying algorithms.However, the use of p -values and additional information for thelearning is promising, enabling to identify more valuable patternsto detect outbreaks. ACM Reference Format:
Moritz Kulessa, Eneldo Loza Mencía, and Johannes Fürnkranz. 2019. Improv-ing Outbreak Detection with Stacking of Statistical Surveillance Methods.In
Proceedings of epiDAMINK 2019: Epidemiology meets Data Mining andKnowledge discovery, Workshop held in conjuction with ACM SIGKDD 2019(epiDAMINK ’19).
ACM, New York, NY, USA, 8 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn
The early detection of infectious disease outbreaks is of great sig-nificance for public health. The spread of such outbreaks could bediminished tremendously by applying control measures as early aspossible, which indeed can save lives and reduce suffering [19]. For
Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected]. epiDAMINK ’19, August 05, 2019, Anchorage, Alaska - USA © 2019 Association for Computing Machinery.ACM ISBN 978-x-xxxx-xxxx-x/YY/MM...$15.00https://doi.org/10.1145/nnnnnnn.nnnnnnn that purpose, statistical algorithms have been developed to auto-mate and improve outbreak detection. Such methods raise alarmsin the case that an unusually high number of infections is detectedwhich results in a further investigation by an epidemiologist [10].Ideally, such algorithms are completely automated while still beingable to be applied on a wide spectrum of different infections and syn-dromes [20]. However, if not chosen wisely or configured properly,they may also raise many false alarms which can overwhelm theepidemiologist. In particular for large surveillance systems, wheremany time series for different diseases and different locations aremonitored simultaneously, the false alarm rate is a major concernand therefore highly determines the practical usefulness of an out-break detection method [23]. However, regulating the false alarmrate usually has an impact on the ability to detect outbreaks. Tofind a good trade-off between those measures is one of the majorchallenges in outbreak detection [1, 19].Traditional outbreak detection methods rely on historic datato fit a parametric distribution which is then used to check thestatistical significance of the current observation. Choosing thesignificance level for the statistical method beforehand makes theevaluation difficult. In line with Kleinman and Abrams [15], wepropose a method which uses the p -values of the statistical methodsin order to evaluate their performance. In particular, we propose avariant of Receiver Operating Characteristic (ROC) curves, whichshows the false alarm rate on the x -axis and the detection rate—in contrast to the true positive rate—on the y -axis. By using thearea under the partial ROC curve [17], we are able to obtain ameasure for the performance of an algorithm satisfying a particularconstraint on the false alarm rate (e.g. less than 1% false alarms).This criterion serves as the main measure for our evaluations andenables to analyze the trade-off between the false alarm rate andthe detection rate of outbreak detection methods precisely.Prior work on outbreak detection mainly focuses on forecast-ing the number of infections for a disease (e.g. [3, 4]). However,only little research has been devoted to use supervised machinelearning (ML) techniques for improving algorithms, which canraise alarms. Jafarpour et al. [11] used
Baysian networks to identifythe determinants for detection performance to find appropriatealgorithm configurations for outbreak detection methods. Further-more, classification algorithms and voting schemes have been usedfor the fusion of outbreak detection methods on univariate timeseries [12, 24] as well as on multi-stream time series [2, 16, 18].However, the examined approaches only rely on the binary output( alarm or no alarm ) of the underlying statistical methods for thefusion which limits the information about a particular observation. a r X i v : . [ c s . L G ] J u l piDAMINK ’19, August 05, 2019, Anchorage, Alaska - USA Moritz Kulessa, Eneldo Loza Mencía, and Johannes Fürnkranz Prior research in the area of ML has shown that more precise infor-mation of the underlying models improves the overall performanceof the fusion [25]. Therefore, we propose an approach for the fu-sion of outbreak detection methods which uses the p -values of theunderlying statistical methods. Moreover, one can also incorporatedifferent information for the outbreak detection (e.g., weather data,holidays, statistics about the data, . . . ) by just augmenting the datawith additional attributes. As a first step, we put our focus on im-proving the performance of outbreak detection methods using anunivariate time series as the only source of information. Further-more, the way outbreaks are labeled in the data also has a majorinfluence on the learnability of outbreak detectors. Thus, we pro-pose adaptions for the labeling of outbreaks in order to maximizethe detection rate of ML algorithms. The key idea of our approach is to learn to combine predictionsof commonly used statistical outbreak detection methods with atrainable ML algorithm. Thus, we first need to generate a series ofaligned prediction vectors, each consisting of one entry for eachmethod. This sequence can then be used for training the ML model.Let us denote with C = ( c , c , . . . , c n ) ∈ N n the time seriesof infection counts for a particular disease. Many methods relyon a sliding window approach which uses the previous m countsas reference values for fitting a particular parametric distribution.Therefore, the mean µ ( t ) and the variance σ ( t ) can be computedover these m reference values as follows: µ ( t ) = m m (cid:213) i = c t − i σ ( t ) = m m (cid:213) i = ( c t − i − µ ) On the fitted distributions, a statistical significance test is performedin order to identify suspicious spikes. For the purpose of outbreakdetection, we rely on one tailed-tests for the statistical algorithmsin order to only capture the observation of unusual high numberof infections. For a particular observed count c t and a fitted distri-bution p ( x ) , the p -value is computed as the probability ∫ ∞ c t p ( x ) dx of observing c t or higher counts. Hence, small p -values representuncommonly high counts of c t . The sensitivity of raising an alarmis regulated by the significance level α and if the p -value is inferiorto the threshold α an alarm is raised.We have chosen to base our work on the following methodswhich are all implemented in the R package surveillance [22]: EARS C1 and
EARS C2 are variants of the
Early AberrationReporting System [7, 9] which rely on the assumption of aGaussian distribution. The difference between C2 and C1lies in the added gap of two time points between the refer-ence values and the current observed count c t , so that thedistribution of c t are assumed as in the following: c t C1 ∼ N ( µ ( t ) , σ ( t )) c t C2 ∼ N ( µ ( t − ) , σ ( t − )) EARS C3 combines the result of the C2 method over a periodof three previous observations. For convenience of notation,the incidence counts c t for the C3 method are transformed according to the statistics so that it fits to the normal distri-bution. (cid:34) c t − µ ( t − ) (cid:112) σ ( t − ) − (cid:213) i = max ( , c t − i − µ ( t − − i ) (cid:112) σ ( t − − i ) − ) (cid:35) C3 ∼ N ( , ) Despite the inaccurate assumption of the Gaussian distribu-tion for low counts, the EARS variants are often included incomparative studies due to its simplicity and still serves ascompetitive baseline [1, 7, 8].
Bayes method.
In contrast to the family of C-algorithms, theBayes algorithm relies on the assumption of a negative bino-mial distribution: c t Bayes ∼ N B ( m · µ ( t ) + , mm + ) RKI method.
Since the Gaussian distribution is not suitablefor count data with a low mean, the RKI algorithm, as imple-mented by Salmon et al. [22], assumes a Poisson distribution: c t RKI ∼ (cid:40) Poisson (⌊ µ ( t )⌋ + ) , if µ ( t ) ≤ N ( µ ( t ) , σ ( t )) , otherwiseThey all have in common that they require comparably littlehistoric data on their own, which allows us to train the ML methodon longer sequences. Moreover, such methods are universally ap-plicable and serve as drop-in approaches for surveillance systemssince they only rely on the detection of a local increase in incidentswithout the need to capture effects like seasonality and trend. The combination of information from several sources in order toobtain a unified picture is known as fusion [14].
Classifier fusion isa special case which combines the outputs of multiple classifiersin order to improve classification performance. In our context, thestatistical algorithms for syndromic surveillance can be seen asclassifiers, each classifying the current observation into the classes alarm or no alarm . A straight-forward way for combining the pre-dictions of multiple outbreak detection methods is to simply voteand follow the majority prediction. A more sophisticated approachconsists of training a classifier that uses the predictions of the de-tection methods as input, and is trained on the desired output, atechnique that is known in ML as stacking [26].Recent work in the area of outbreak detection and fusion hasfocused on fusing the information obtained by simultaneously mon-itoring multiple time series for a particular disease. Lau et al. [16]have shown that the performance of statistical algorithms can al-ready be improved by combining them with simple voting schemes.Mnatsakanyan et al. [18] could further improve the performanceusing Bayesian networks and including further information aboutthe patients (e.g., age) as additional attributes. Moreover, Burkomet al. [2] have used a hierarchy of Bayesian networks in order toincorporate additional information about health surveillance dataand environmental sensors. However, all of these fusion methodsaim to capture the degree of dependence between the monitoredtime series relying on spatial correlations.Only little research has been devoted to improving the perfor-mance of statistical algorithms on univariate time series. In par-ticular, Texier et al. [24] have used the ML technique hierarchical mproving Outbreak Detection with Stacking of Surveillance Methods epiDAMINK ’19, August 05, 2019, Anchorage, Alaska - USA ... 34 35 36 37 38 39 40 41 42 43 ...week0246810 nu m b e r o f i n f e c t i o n s ... 0.86 0.26 0.63 0.00 0.14 0.00 0.13 0.83 0.99 1.00 ...... 0.63 0.14 0.43 0.00 0.10 0.00 0.12 0.73 0.98 0.99 ... RKIp-valuesBayesp-values outbreak ( O ) O O O index augmented features p -valuesweek prev. p -values target t mean t RKI t − Bayes t − RKI t Bayes t outbreak t . . . . . . . . . . . . . . . . . . . . .34 1.00 0.59 0.63 0.86 0.63 no35 0.50 0.86 0.63 0.26 0.14 no36 0.50 0.26 0.14 0.63 0.43 no37 0.50 0.63 0.43 0.00 0.00 yes38 1.50 0.00 0.00 0.14 0.10 yes39 2.25 0.14 0.10 0.00 0.00 yes40 4.50 0.00 0.00 0.13 0.12 yes41 6.25 0.13 0.12 0.83 0.73 yes42 6.00 0.83 0.73 0.99 0.98 yes43 5.50 0.99 0.98 1.00 0.99 no. . . . . . . . . . . . . . . . . . . . . Figure 1: Example for the creation of training data for the learning algorithm including the statistical algorithms Bayes andRKI with a window size of one ( w = ) and the mean over the previous four counts ( m = ) as features. On the left hand side,the time series for a particular disease is visualized at the center representing the number of cases of infections over time. Thecomputed p -values of the statistical algorithms (underneath) and the label indicating an outbreak for each observation (above)are placed at the respective time index t . Using this information the data instances can be created as shown on the right: Eachparticular time point is represented by one training instance, labeled according to the original targets O . mixture of experts [13] to combine the output of the methods fromEARS. However, the authors note that all algorithms rely on theassumption of a Gaussian distribution, which limits their diversity.In contrast, Jafarpour et al. [12] have used a variety of classifica-tion algorithms ( logistic regression , CART and
Baysian Networks )for the fusion of outbreak detection methods. As underlying sta-tistical algorithms they have used the Cumulative Sum (CUSUM),two Exponential Weighted Moving Average algorithms, the EARSmethods (C1,C2,C3) and the Farrington algorithm [19]. In general,the results of Texier et al. [24] and Jafarpour et al. [12] indicate thatML improves the ability to detect outbreaks while simple votingschemes (e.g. weighted voting and majority vote) did not performwell. Moreover, the algorithms have not been evaluated with respectto data which include seasonality and trend.
In this work, we show that the availability of additional informationcan further improve the performance of the fusion classifier. There-fore, we first propose to use p -values of the statistical methods forthe fusion in order to include information about the certainty of analarm, and then show how to add additional external informationto the learning process of the ML algorithm. Finally, we investigatedifferent variants for labeling outbreaks. p -values Given base estimators д ( x ) , . . . , д K ( x ) , a fusion combiner is a func-tion h ( д ( x ) , . . . , д K ( x )) that combines the predictions of the basefunctions. In the simple case of binary voting, i.e., д i ( x ) ∈ { , } , thecombiner h ( x ) = K (cid:205) i д i ( x ) with a threshold of 0 . stacking the function h : X K −→ O is learned by training a machine learning classifier on a set of previous ob-servations ( д ( x ) , . . . , д K ( x )) , . . . , ( д ( x n ) , . . . , д K ( x n )) –derivedfrom applying д i on x t – with associated targets o , . . . , o n ∈ O . Werefer to this as the training set in contrast to the evaluation set,which contains new, unseen observations. In outbreak detection,the instances x t correspond to the points in the time series C ofinfection counts c t and o t ∈ { , } denotes the labelling of a timepoint as belonging to an outbreak (1) or not (0).Previous approaches [12, 24] used the binary alarms ({0,1}) ofbase outbreak detectors. In this work instead, we propose to baseour stacking model on the p -values, i.e., д i ( x ) ∈ [ , ] , providedby the underlying statistical approaches (cf. Sec. 2). In fact, the p -values can directly be seen as the certainty of currently observing anoutbreak, enabling the learning algorithm to make use of the baseestimations in a much more fine grained way. This information isotherwise lost when using binary alarms, which are indeed obtainedby just applying a fixed threshold on the computed p -values. Inaddition to the circumvented difficulty of tuning such threshold,previous studies on stacking have shown empirically that using theraw predictions can improve over the discretized option [25].Figure 1 visualizes an example on how the data for the learn-ing algorithm is created by using the p -values of the statisticalalgorithms Bayes and RKI. The columns RKI t and Bayes t repre-sent the computed p -values for the current observation while theother columns (mean t , RKI t − and Bayes t − ) represent additionalinformation explained in the following section. The use of a trainable fusion method allows us to include addi-tional information which can help to decide whether a given alarmshould be raised or not. As additional features, we propose to in-clude the mean of the counts over the last m time points (the same piDAMINK ’19, August 05, 2019, Anchorage, Alaska - USA Moritz Kulessa, Eneldo Loza Mencía, and Johannes Fürnkranz number of time points as used by the statistical methods), whichcan give us evidence about the reliability of a particular outcome.For example, the assumption of a Gaussian distribution for a lowmean of count data ( ≤
20) is known to be imprecise. Therefore, alearning algorithm might induce in this scenario that the p -valuesof the statistical methods C1, C2 and C3 may not be trustworthy.Moreover, under the assumption that a time series is stationaryan unusual high mean can also be a good indicator to detect anoutbreak, especially in the case that an outbreak arises slowly overtime. The column mean t in Figure 1 illustrates how the mean overthe last four observed counts ( m =
4) is added as an additionalfeature.Finally, we also include the output of the statistical methodsfor previous time points in a window of a user-defined size w asadditional features. For the example in Figure 1, we have used awindow size of one ( w =
1) which includes the previous output ofboth statistical algorithms.
A major challenge for ML algorithms is that the duration of anoutbreak period is not clearly defined [23]. A simple strategy—which we refer to as O —is to label all time points positive as longas cases for the particular epidemic are reported (e.g. time pointsprior to the peak of an outbreak and a few time points after thepeak). In this case, the goal of the learning algorithm is to predictmost time points in an ongoing epidemic as positive, regardless oftheir time stamp. Indeed, our early results indicate that the predictorlearns to recognize the fading-out of an outbreak (e.g. weeks 40 to42 in Figure 1). This is due to the fact that the peak of the outbreakis included in the reference values which results in a considerablyhigh mean µ ( t ) for the significance test. Because of this, unusuallyhigh p -values are generated for the counts after the peak, whichprovide sufficient evidence for the stacking algorithm to raise analarm. However, this also increases the number of false alarms asthe ML approach learns to raise alarms when the count is decreasingoutside an epidemic period.To avoid this, we propose three adaptations of O : O labels alltime points until the peak (the point with maximum number ofcounts during the period) as positive. O instead skips the timepoints whose count is decreasing compared to the immediate pre-vious count (i.e., it labels all increasing counts until reaching thepeak). Finally, O labels only the peak of the outbreak as positive.Figure 1 visualizes an example outbreak with the correspondingdifferent options to label the epidemic period on the top-left. Instead of manually adjusting the α parameter of the statisticalmethods and examining the results individually, which is mostlydone in previous works, we propose to evaluate the p -value as itis done by Kleinman and Abrams [15]. In particular, the p -valuecan be interpreted as a score, which sorts examples according totheir degree to which they indicate an alarm. This allows us toanalyze an algorithm with ROC curves [5]. A ROC curve can beused to examine the trade-off between the true positive rate (i.e.,the probability of raising an alarm in case of an actual outbreak)and the false alarm rate (i.e., the probability of falsely raising an d e t e c t i o n r a t e C1 methodrandom chance d e t e c t i o n r a t e dAUC Figure 2: ROC curve using the detection rate on the y -axis.The better-than-chance performance is lifted above the di-agonal since the detection rate is an interval-based metric. alarm when no outbreak is ongoing). In order to only focus on highspecificity results (e.g., with a false alarm rate below 1%), which is ofmajor importance for many medical applications, we only consider partial ROC curves . By using the partial area under the ROC curveas proposed in [17], we obtain a simple measure to evaluate theperformance of an algorithm, satisfying particular constraint onthe false alarm rate. We refer to this measure as pAUC e where theparameter e defines the maximum allowed false alarm rate to beconsidered. It is computed as pAUC e = ∫ e ROC ( f ) d fe where ROC ( f ) denotes the true positive rate given a false alarmrate of f . However, alarms raised in cases when the epidemic hasalready been detected are typically not very decisive and informa-tive anymore. To incorporate this, we consider the detection rate ,which represents the proportion of recognized outbreaks (i.e., theoutbreaks in which at least one alarm is raised during their activity).Following Kleinman and Abrams [15] and Jafarpour et al. [12], wetherefore use a ROC curve-like representation with the detectionrate on the y -axis instead of the true positive rate, and use dAUC e to refer to the partial area under this curve. Figure 2 shows an ex-ample of the ROC-curve like representation and visualizes the areaof dAUC . Kleinman and Abrams [15] proposed to use weightedROC curves to also incorporate the influence of the measure timeli-ness (mean time to detect an outbreak). However, we argue that theweighing with the timeliness introduces a trade-off (importance oftimeliness over detection rate) and a loss in interpretability of theabsolute numbers. The key aspect of our experimental evaluation is to demonstratethat the fusion of p -values leads to a further improvement in perfor-mance compared to only using the binary output of the statisticalalgorithms. However, for a deeper understanding of our proposedapproaches, we first performed experiments to evaluate the influ-ence of including additional features for the stacking in Section 6.2, mproving Outbreak Detection with Stacking of Surveillance Methods epiDAMINK ’19, August 05, 2019, Anchorage, Alaska - USA followed by an analysis of adapting the labeling for the learning inSection 6.3. Finally, using the obtained knowledge about the effectof the proposed techniques, we compare them with the underlyingstatistical algorithms in Section 6.4, which represents our mainresult. As an implementation baseline for the statistical methods, we haveused the R package surveillance [22] and adapted the implementa-tion of the methods EARS (C1, C2 and C3), Bayes and RKI in orderto also return p -values. All methods use the previous 7 time pointsas reference values, which is the standard configuration. For theML part, we rely on the Python library scikit-learn [21]. To keepthe evaluation simple, we use a random forest classifier. Basically, itlearns an ensemble of randomized decision trees, which has provento be robust in performance theoretically and practically [6, 27].Each model is composed of 100 decision trees with a minimumnumber of instances per leaf of 5 and default settings otherwise. Toallow comparability between the fusion methods, we also evalu-ated the approach which only combines the binary outputs of thestatistical methods as proposed in [12, 24] and which we refer to asthe standard fusion . Our preliminary experiments have shown that α = .
5% for the underlying statistical methods performs best forthis fusion approach. For all evaluations, we focused on our pro-posed evaluation measure dAUC where we fixed the constrainton the false alarm rate to be less than 1%.Our evaluation is based on synthetic data which have been pro-posed by Noufaily et al. [19]. In total 42 different test cases are usedwhich reflect a wide range of application scenarios allowing toanalyze the effects of trend (T), seasonality (S1) and biannual sea-sonality (S2) explicitly. For each parameter configuration 100 timeseries are generated, each containing a total of 624 weeks. Follow-ing Noufaily et al. [19], the last 49 weeks of each time series serveas evaluation data which include exactly one outbreak whereas thefirst 575 weeks contain four outbreaks and represents the so called baseline data . Each outbreak starts at a randomly drawn week andthe number of cases per outbreak is generated with a Poisson dis-tribution with the mean equal to a constant k times the standarddeviation of the counts observed at the starting week. The outbreakcases are then distributed over time using a log-normal distribu-tion with mean 0 and standard deviation 0 .
5. We evaluated eachstacking configuration separately for each test case using the base-line data of the 100 time series for training (in total 57 .
500 weeksincluding 400 outbreaks) and the remaining 4 .
900 weeks for testing(100 outbreaks), respectively. The statistical methods were appliedseparately for each time series in order to obtain the p -values asinputs for the learner as well as the predictions on the evaluationset.Instead of reporting the average over dAUC scores, whichcould have different scales for different test cases, we determineda ranking over the compared methods for each test case. After-wards, each method’s rank is averaged across the evaluated testcases to obtain an overall rank. In order to evaluate the effectsof trend and seasonality explicitly, we average the rankings onlyover the test cases which include these effects. To differentiate be-tween our proposed approaches, we use the notation M ( a , o , w ) where M ∈ { P , S } specifies whether p -value fusion (P) or the stan-dard fusion (S) has been used, a ∈ {¬ µ , µ } whether the mean isincluded, o ∈ { O , O , O , O } which labeling for the learning, and w ∈ { , , . . . , } the window size which has been used for theevaluation. In total, we tested 192 configurations from which wecompare only a small subset , respectively, depending on the ana-lyzed aspect. The first aspect to review concerns the inclusion of the mean countover the last seven time points. Therefore, we have analyzed theeffect of this feature independent of the other parameters using O for the labeling of the outbreak and window size w =
0. Theresults for the average rank are displayed in Table 1. Comparingthe standard to the p -value fusion method reveals a beneficial effectespecially for the p -value approach, for which the variant includingthe mean achieves an average rank of 1 .
31 over 1 .
91. In contrast,the average ranks of 3 .
36 over 3 .
43 for the standard method notonly shows that there are issues regarding the usage of the meanfor some of the test case configurations, but also the substantialgap between using the binary outputs and the more fine-grained p -values. A closer examination reveals that the best improvementfor both fusion methods can be achieved on time series withouttrend and seasonality. By adding effects like trend and seasonality,the mean changes over time, making it difficult for the learningalgorithm to use this information. In contrast to the standard fusion,the p -value fusion method still enhances by including the meanover the previous time points.The observation that the p -value fusion method is superior to thestandard fusion can also be seen when comparing different windowsizes. The results of this experiment, using O for the labeling of theoutbreak and not including the mean, are displayed in Table 2. Inparticular, no window configuration of the standard fusion methodcan outperform any of the p -value configurations with respect to theaverage rank. Overall, a window size of 1 performed best for bothfusion approaches. Being able to compare to the most immediateprevious output of the underlying statistical algorithms seems tomake it easier to detect anomalies. In contrast, larger window sizesharm the overall performance, which suggests that the additionalinformation is not relevant for detecting sudden changes and ratherconfuses the learner. Interestingly, on certain combinations of trendand seasonality a larger window size for the p -value fusion methodseems to be beneficial. Actually, the increase of the window sizealso results in taking a further look back in the past allowing todetect effects like trend and seasonality achieving good results onthe test cases which only contain biannual seasonality. However,the observed results for larger window sizes are inconsistent acrossthe different test cases, making it difficult to draw valid conclusions. In addition to augmenting the input data, we have evaluated theeffect of adapting the labeling of the epidemic period for the trainingof the stacking algorithm. The comparison shown in Table 3 wasperformed without the augmentation. piDAMINK ’19, August 05, 2019, Anchorage, Alaska - USA Moritz Kulessa, Eneldo Loza Mencía, and Johannes Fürnkranz
Table 1: Comparison of including or not including the mean in the data for ML algorithms: overall denotes all test cases, {(¬) T , (¬) S , (¬) S } only cases (not) containing trend, annual/biannual seasonality, respectively. Each particular subset, fulfill-ing constraints on seasonality and trend, include test cases. Approach Overall {¬ T , ¬ S , ¬ S } {¬ T , S , ¬ S } {¬ T , S , S } { T , ¬ S , ¬ S } { T , S , ¬ S } { T , S , S } S( ¬ µ , O , 0) 3.429 3.714 3.571 µ , O , 0) P( ¬ µ , O , 0) 1.905 2.571 1.857 2.000 1.714 1.714 1.571P( µ , O , 0) O ). Approach Overall {¬ T , ¬ S , ¬ S } {¬ T , S , ¬ S } {¬ T , S , S } { T , ¬ S , ¬ S } { T , S , ¬ S } { T , S , S } S( ¬ µ , O ,0) 9.738 9.000 9.571 S( ¬ µ , O ,1) ¬ µ , O ,2) 10.762 10.571 10.857 10.714 10.571 10.143 11.714S( ¬ µ , O ,4) 11.310 11.429 11.714 11.714 10.857 12.000 10.143S( ¬ µ , O ,6) 11.619 12.714 12.286 10.286 11.571 11.857 11.000S( ¬ µ , O ,8) 11.548 11.143 11.571 12.000 10.857 12.000 11.714S( ¬ µ , O ,12) 11.929 12.143 12.143 13.000 11.714 11.571 11.000P( ¬ µ , O ,0) 5.000 5.714 5.000 5.714 4.429 3.714 5.429P( ¬ µ , O ,1) P( ¬ µ , O ,2) 4.381 5.000 4.714 4.000 4.571 4.571 3.429P( ¬ µ , O ,4) 4.667 4.143 5.000 4.000 5.143 5.286 4.429P( ¬ µ , O ,6) 4.310 5.000 4.429 3.857 3.857 3.857 4.857P( ¬ µ , O ,8) 4.000 3.000 4.000 4.714 5.000 3.857 3.429P( ¬ µ , O ,12) 3.595 Table 3: Comparison of the different labeling strategies for the epidemics (not using the average and w = ). Approach Overall {¬ T , ¬ S , ¬ S } {¬ T , S , ¬ S } {¬ T , S , S } { T , ¬ S , ¬ S } { T , S , ¬ S } { T , S , S } S( ¬ µ , O , 0) 6.476 6.286 5.571 ¬ µ , O , 0) 6.738 7.286 6.714 6.286 6.429 7.000 6.714S( ¬ µ , O , 0) 5.738 6.286 5.714 5.286 S( ¬ µ , O , 0) ¬ µ , O , 0) 3.762 3.857 3.857 ¬ µ , O , 0) 3.262 2.857 4.143 4.857 2.429 2.429 2.857P( ¬ µ , O , 0) 2.690 3.143 3.000 3.286 2.714 2.143 P( ¬ µ , O , 0) Table 4: Comparison of the standard fusion, the p -value fusion and each individual statistical algorithm. Approach Overall {¬ T , ¬ S , ¬ S } {¬ T , S , ¬ S } {¬ T , S , S } { T , ¬ S , ¬ S } { T , S , ¬ S } { T , S , S } C1 5.381 6.429 5.429 4.143 5.714 5.714 4.857C2 4.810 4.571 4.000 4.286 5.857 5.286 4.857C3 4.690 5.429 4.571 4.286 4.857 4.429 4.571Bayes 2.595 4.000 3.143 µ , O , 1) 5.238 3.000 6.000 5.714 4.714 5.857 6.143P( µ , O , 1) mproving Outbreak Detection with Stacking of Surveillance Methods epiDAMINK ’19, August 05, 2019, Anchorage, Alaska - USA d A U C % C1C2C3BayesRKI S( , O , 1)S( , O , 1)P( , O , 1)P( , O , 1)1 2 3 4 5 6 7 8 9 10k0.00.20.40.6 p A U C % Figure 3: Results for the measures dAUC % and pAUC % . Each box plot represents the distribution of measure values for aparticular method computed over all test cases for a fixed outbreak size defined by the parameter k (a bigger value for k indicate more cases per outbreak). In general, we can observe that by narrowing the labeling of theoutbreak on particular events (i.e., O , O or O ) a better perfor-mance can be achieved. This effect is clearly visible for the p -valuefusion method and less obvious for the standard fusion method,for which the adaption O seems to be an exception. In particu-lar, learning only the peaks ( O ) achieved the best results for bothfusion approaches. The benefit of this variant is that the learnercan actually focus on the identification of strong and sudden peakswhich is indeed the main goal of outbreak detection. However, incase of biannual seasonality the frequent change of the counts overthe season results in many random peaks which apparently makesit difficult for the stacking approach to distinguish between anepidemic peak and a peak caused by random effects. On the testcases without trend ( {¬ T , S , S } ) outbreaks are better identifiableby also including the fading of the outbreak ( O ), whereas on thetest cases which contain trend ( { T , S , S } ) the best option seemsto be O , which only includes only the increasing counts until thepeak of the outbreak is reached ( O ). Considering the results of the previous experiment, we have chosento evaluate the p -value and the standard fusion approach with awindow size of 1, the adaption of the labeling O and includingthe mean. In order to draw conclusions, we have evaluated theunderlying statistical methods itself which serve as a baseline.The results for the average rank are represented in Table 4. Here,we can observe that p -value fusion achieves the best rating acrossall test cases. In contrast, the performance of the standard fusionapproach is often worse than the underlying statistical algorithms.In line with Texier et al. [24] and Jafarpour et al. [12], we can ob-serve an improvement of the standard fusion approach on the timeseries without trend and seasonality. However, this improvementis not consistent for all compared test cases, resulting only in anaverage rank of 3 . p -value approach always achieves the best result. Indeed, the ability to detect outbreaks withthe standard fusion approach is reduced since it is based on theoutput of the statistical algorithms given a particular pre-definedsignificance level α for them. This limits the information aboutsudden changes encapsulated in the training data which makes itpretty difficult for the ML algorithm to identify valuable patterns. Acloser examination reveals that trend and seasonality has an impacton the evaluated stacking approaches. In particular, by learningover the baseline data of time series which include trend, the learneris fed with observations which are not representative for the future(evaluation data) due to the changed circumstances. Moreover, thelearning algorithm usually assumes that the instances are consid-ered to be independent and identically distributed in the learningdata set, not allowing to capture concept drift. Our proposed ap-proaches are not designed to adjust to these settings but we believethat further investigations on the influence of trend and seasonalityand how they can be handled is an interesting avenue for futurework.Furthermore, we have evaluated the approaches with respectto the number of cases per outbreak. In contrast to the previousexperiments, where the value for the parameter k (used to definethe number of cases per outbreak) was randomly drawn between 1and 10, we have fixed this parameter to a particular value for alltime series of the 42 test cases. The results for the measure dAUC across the 42 test cases with a fixed value for the parameter k isvisualized as box plots, representing minimum, first quantile, mean,third quantile and maximum, in Figure 3. In addition to dAUC ,we include the analysis of the pAUC measure and compare tothe original labeling O in order to further investigate the effect ofthe labeling on detection rate and true positive rate.As the cases per outbreak increases all methods are more likely toobtain a better performance. While the C1, C2, C3 and RKI methodachieve comparable results across all outbreak sizes, we are sur-prised to observe that the Bayes method has a better performancein case of larger outbreaks. This contradicts our expectation thatthe RKI method should obtain the best results across these methods piDAMINK ’19, August 05, 2019, Anchorage, Alaska - USA Moritz Kulessa, Eneldo Loza Mencía, and Johannes Fürnkranz since the Poisson assumption was specifically used to generate thesynthetic data. Regarding the p -value fusion approaches, the resultsconfirm the better overall performance across all outbreak sizeswhile the performance of the standard fusion approach gets worsecompared to the other methods with an increasing number of casesper outbreak. This gives further evidence that the standard fusionis not ideal. A closer examination of the graphs for the measures dAUC and pAUC reveals the difference between the adaptionof the labeling for the learning. In particular, without adaption theML algorithm achieves a tremendously better performance for thetrade-off between the true positive rate and the false alarm rate.However, this also has an effect on the ability to detect outbreaksas discussed in Section 4.3, yielding a slightly worse result for themeasure dAUC than with adapting the labeling. In this work, we introduced an approach for the fusion of outbreakdetection methods using machine learning, more specifically stack-ing. The original idea is to use the alarm or no alarm prediction ofthe underlying statistical algorithms as inputs to the learner. Weimproved that setup by incorporating the p -values instead, whichcontain more information about the certainty of an event than thesimple binary outputs. In addition, we proposed to add additionalinformation to the learning data and to adapt the labeling of anoutbreak in order to improve the ability to detect outbreaks. Forevaluation, we proposed a measure based on ROC curves whichbetter adapts to the specific need for a very low false alarm ratebut still considers the trade-off with the detection rate.Our experimental results on synthetic data show that the fusionof p -values improves the performance compared to the underlyingstatistical algorithms. Contrary to previous work, we could alsoobserve that simple fusion of binary outputs using stacking doesnot always lead to an improvement. By incorporating additionalinformation to the learning data, more specifically the mean countof the previous observations and the previous outputs of the sta-tistical methods, the machine learning algorithm is able to capturemore reliable patterns to detect outbreaks. Furthermore, the label-ing of an outbreak has an influence on the performance for theclassification algorithm to detect outbreaks. By setting the focuson the peak of an outbreak during the learning process, a betterperformance to detect sudden changes can be achieved.The effectiveness of the proposed method has still to be con-firmed on real data. Nevertheless, our results suggest that p -valuestacking is generally well-suited for combining the outcomes ofestablished methods for outbreak detection with only a low riskof decreasing performance. Moreover, stacking allows to enrichthe detection by additional signals and sources of information in ahighly flexible way. However, a major challenge remains the treat-ment of the outbreak annotations during training, since these labelsare inherently non-binary (endemic vs. epidemic) and additionallynoisy and unreliable for real data. ACKNOWLEDGMENTS
This work was supported by the Innovation Committee of the Fed-eral Joint Committee (G-BA) [ESEG project, grant number 01VSF17034]
REFERENCES [1] G. Bédubourg and Y. Le Strat. 2017. Evaluation and comparison of statisticalmethods for early temporal detection of outbreaks: A simulation-based study.
PLOS ONE
Statistics in Medicine
Proceedings ofthe SIAM International Conference on Data Mining . 262–270.[4] D. Farrow, L. Brooks, S. Hyun, R. J. Tibshirani, D. Burke, and R. Rosenfeld. 2017. Ahuman judgment approach to epidemiological forecasting.
PLOS ComputationalBiology
Pattern Recognition Letters
Journal ofMachine Learning Research
Statistics inMedicine
Journal of Emerging InfectiousDiseases
Journal ofUrban Health
BMC MedicalInformatics and Decision Making
Journal of Biomedical Informatics
AMIA AnnualSymposium Proceedings
Neural Computation
Information Fusion
Statistical Methods in Medical Research
Journal of Emerging Infectious Diseases
Statistics in Medicine
Journal of theAmerican Medical Informatics Association
Statistics in Medicine
Bioinformatics . In press.[21] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M.Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cour-napeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: MachineLearning in Python.
Journal of Machine Learning Research
Journal of StatisticalSoftware
Technometrics
BMC Medical Informatics and Decision Making
Journal of ArtificialIntelligence Research
Neural Networks