Dynamic Virtual Graph Significance Networks for Predicting Influenza
IIEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. XX, AUGUST 2020 1
Dynamic Virtual Graph Significance Networksfor Predicting Influenza
Jie Zhang, Pengfei Zhou, and Hongyan Wu
Abstract —Graph-structured data and their related algorithms have attracted significant attention in many fields, such as influenzaprediction in public health. However, the variable influenza seasonality, occasional pandemics, and domain knowledge pose greatchallenges to construct an appropriate graph, which could impair strength of the current popular graph-based algorithms to performdata analysis. In this study, we develop a novel method, Dynamic Virtual Graph Significance Networks (DVGSN), which cansupervisedly and dynamically learn from similar “infection situations” in historical timepoints. Representation learning on the dynamicvirtual graph can tackle the varied seasonality and pandemics, and therefore improve the performance. The extensive experiments onreal-world influenza data demonstrate that DVGSN significantly outperforms the current state-of-the-art methods. To the best of ourknowledge, this is the first attempt to supervisedly learn a dynamic virtual graph for time-series prediction tasks. Moreover, theproposed method needs less domain knowledge to build a graph in advance and has rich interpretabilities, which makes the methodmore acceptable in the fields of public health, life sciences, and so on.
Index Terms —Representation Learning, Dynamic Virtual Graph, Influenza Prediction, Time Series. (cid:70)
NTRODUCTION W ITH the growing emergence of graph-structured datasuch as social networks and biological networks [1],[2], the algorithms to analyze graph data have attracted sig-nificant attention, such as Graph Convolutional Networks(GCNs) [3], [4], [5] and Graph Attention Networks (GAT)[6], etc. The structure of graph data exerts significant impacton the performance of theses algorithms, because thesealgorithms heavily depend on the neighborhood relation-ship of the graph. For example, GNNs iteratively aggregateand integrate the embedding of its neighbors to learn thenode embedding of the graph. However, finding out allthe influential neighbor nodes and measuring their edgeweights appropriately to construct a graph are nontrivial inmany cases, such as in life sciences and public health fields,which require substantial domain knowledge.Influenza prediction is an important interdisciplinaryproblem between computer science and public health. In-fluenza circulates worldwide and places a heavy burden onpeople’s health Every year [7], [8]. The strong infectivityand outbreak of influenza are estimated to result in ap-proximately 35 million cases of symptomatic illnesses, 16million outpatient medical visits, 490 thousand influenza-associated hospitalizations, and 34 thousand cases of deathsin the influenza season of 2018-2019 in the United States[9]. The influenza virus undergoes high mutation rates andfrequent genetic re-assortment [10], [11], [12]. To help clin-ics, hospitals, pharmaceutical companies, and governmentsbetter prepare for influenza in a timely manner, we need areliable model to predict influenza trends. • J.Zhang and P.Zhang are with the Department of Smart Health,SenseTime, Shanghai, CN, 200233; H.W is with the Joint EngineeringResearch Center for Health Big Data Intelligent Analysis Technology,Shenzhen Institutes of Advanced Technology, Chinese Academy ofSciences, Shenzhen, CN, 518055.E-mail: [email protected]; [email protected];[email protected]
There are mainly two challenges of predicting influenza. [Challenge 1]
Influenza seasonality usually varies, from oneseason to another, in timing, severity, and duration [13],[14]. Table 1 shows the descriptive statistics of the Influenza-Like Illness (ILI) rates of influenza seasons from 2003-2004to 2016-2017 in the United States. The rates of “standarddeviation / mean” in the “Highest ILI Rate” and “Duration”are 33% and 39%, respectively. Such an irregular variationhandicaps the predictive methods. [Challenge 2]
Influenzapandemics occur occasionally but can totally disorder theseasonality for years. A pandemic is a serious world-wideoutburst, resulting from the emerge of a new type of virusand resulting in extremely higher ILI rates, several close andconsecutive peaks, and much longer duration, as the piece ofthe curve around 2009 in Figure 1 shows. Such a “mutated”outbreak makes the prediction more difficult.Fig. 1: The ILI rates of influenza seasons from 2003-2004 to2016-2017 in the United States. The ILI rate is defined as thenumber of ILI patients divided by the number of all-illnesspatients.The existing machine/deep learning models, such asXGBoost (XGB), Temporal Pattern Attention Long Short-Term Memory (TPA-LSTM), Temporal Convolutional Net-works (TCN), and Transformer, use current and historicalvalues in a user-defined time window as input to predict a r X i v : . [ c s . A I] F e b EEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. XX, AUGUST 2020 2
TABLE 1: The columns illustrate the variation of influenza intiming, severity, and lasting, respectively. “Duration” countsthe consecutive weeks, during which the ILI rates are allover 0.01.
Seasons Peak Week(year/week) HighestILI Rate Duration(weeks)2003-2004 2003/06 0.0317 282004-2005 2003/52 0.0706 252005-2006 2005/07 0.0475 372006-2007 2005/52 0.0305 312007-2008 2007/07 0.0327 312008-2009 2008/07 0.0542 312009-2010 2009/06 0.0334 912009/21 0.04212009/42 0.07622010-2011 2011/05 0.0444 372011-2012 2012/11 0.0229 412012-2013 2012/52 0.0603 442013-2014 2013/52 0.0439 432014-2015 2014/52 0.0611 412015-2016 2016/10 0.0359 422016-2017 2017/06 0.0481 44MEAN - 0.0460 40standard deviation(SD) - 0.0152 16SD/MEAN - 33% 39% future values. These methods lack considering similaritiesoutside the time window. Although one can simply increasethe length of the time window to include more informa-tion, there are always some timepoints outside the window.Besides, the bigger the length of the time window is, thefewer the training instances will be left, which makes thepredictive model unreliable. If a method can dynamicallyfind historical timepoints that have similar “infection situa-tions” as auxiliary information, the model could tackle thevaried seasonality and the occasional pandemics.However, how to accurately represent the “infectionsituations” poses a challenge since the situation should in-clude the information of the influenza severity, the tendency,the duration and other factors that may be beyond ourknowledge.In this study, we develop a novel method, namely Dy-namic Virtual Graph Significance Networks (DVGSN), asFigure 2 illustrates. DVGSN constructs a virtual graph forinfluenza prediction. In the virtual graph, a virtual noderepresents a timepoint. The embedding of a virtual noderepresents the “infection situations” at the timepoint. Thevirtual edges connect two virtual nodes at two timepoints,and the edge weights measure the significance of the virtualedge. Since the timepoints connected by the virtual edgescan be outside the time window, DVGSN can break thelimitation of the time window by learning from neighbornodes and improve the predictive accuracy.A natural static graph defined with domain knowledgebeforehand could not align well with the specific analyticaltask. As a result, the “neighborhood” in an “unsupervisedgraph” could be improper for the specific analytical task anddamage the analytical outcomes. Different from a naturalgraph with static nodes and edges, in a dynamic virtualgraph, every node and edge are supervisedly dynamically Fig. 2: The architechture of the proposed method—DVGSN.learned during the training procedure in the prediction task.Moreover, a virtual graph naturally has rich interpretabili-ties. For example, similar “infection situations” found bythe virtual graph can provide us with clues how the virtualgraph finds similarities and how the proposed method per-forms the prediction for pandemics. The interpretabilitiesmake the proposed method more acceptable, especially inthe fields of epidemiology and public health, in whichresearches usually emphasize the interpretabilities of thepredictive models for further government measures, etc.The contributions of this work are concluded as follows.(1) To the best of our knowledge, this is the first attemptto supervisedly learn a dynamic virtual graph for time-series prediction.(2) The proposed method need less domain knowledgeto build a graph in advance and has rich interpretabilities,which are indispensable in epidemiology, public health, andthe like.(3) We carry out extensive experiments on the real-world data, and the experimental results prove that theproposed method significantly outperforms the state-of-the-art methods.
EEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. XX, AUGUST 2020 3
This section describes the previous work from the point ofview of influenza prediction and graph-based deep learn-ing.
The machine/deep learning for forecasting influenza orother time-series data are mainly categorized into twogroups. Firstly, some researchers focus on looking for effec-tive “features”. For example, search engine query data areused for prediction influenza in
Google Flu Trends [15], [16].Twitter data are also used in other research papers [17], [18].However, these models usually suffer from the unreliablesource of huge amounts of information from such as internetsearches. For example, Google’s algorithm was quite vulner-able to overfitting to seasonal terms unrelated to the flu, like“high school basketball”. This example also demonstratesthe importance of model interpretability. Secondly, otherresearchers focus on looking for effective “models”, suchas RF [19], [20], [21], Gradient Boosting [19], [21], Multi-layer Perceptron (MLP) [19], [21], Long Short Term Memory(LSTM) [19], [21], [22], Transformer (TFR) [23], and so on.Deep learning based methods, e.g. Transformer, are drawingmore attention for their accuracy while most of them suffersfrom the poor interpretability.Moreover, statistical models and dynamic analysis mod-els are considered easily accessible tools for simulatingpatterns of infection by influenza, such as SI, SIS, SIRmodel and their variants [24]. However, their parameters aresubject to change and the approximation of the parametersis difficult [25], such as the basic reproduction number R ,population mobility etc. For mining a natural graph, such as Cora [26] and Digg [27],Graph Neural Networks (GNNs) are usually used, such asGCN, GAT, and Graph Isomorphism Network (GIN) [28].In an analytical task without a natural graph, to leveragepowerful GNNs, a graph can be constructed beforehand. Re-searches need to use domain knowledge, such as medicineand transportation [29], and mathematical calculation, suchas Euclidean distance [30], to construct a graph beforehand.Nonetheless, all of these graphs are thought of as “unsuper-vised graphs” because the calculation for the constructionis not updated by backpropagation for the specific analyt-ical task. In other words, an “unsupervised graph” couldNOT align with the specific analytical task. As a result,the “neighborhood” in an “unsupervised graph” could beimproper for the specific analytical task and damage theanalytical outcomes. In this study, we develop a method toconstruct a “supervised graph”, which could dynamicallyand supervisedly learn the effective information from otherinstances during the training procedure in the specific ana-lytical task.
We formally define the prediction task with the classicmachine/deep learning algorithms as Formula 1: ˆ y ( v,q ) =[ˆ y ( v +1) , ˆ y ( v +2) , . . . , ˆ y ( v + q ) ] T =[ f ( O ( v,p ) ) , f ( O ( v,p ) ) , . . . , f q ( O ( v,p ) )] T (1)where ˆ y ( v,q ) ∈ R q is the vector to be predicted, v is a giventimepoint, and q is the predictive window size; ˆ y ( v + i ) is thepredicted value of the upcoming i -th week, and f i ( · ) is atime-series model to predict the value of the upcoming i -th week; O ( v,p ) = [ o v , o ( v − , . . . , o ( v − p +1) ] is the observedtime-series values with a time lag p , o v is the observed value,and o ( v − i ) is the value of the past i -th week.There are two types of time-series prediction: (a) single-step influenza prediction and (b) multi-step prediction. Asingle-step prediction predicts the value for one step inadvance ( q = 1 in Formula 1 ), and a multi-step predictionpredicts the consecutive values with a bigger predictivewindow size ( q > in Formula 1).As Formula 1 shows, the classic methods heavily de-pends on the observations in the time window but lacksconsidering historical similarities outside the time window.Table 2 presents the notations utilized in this work.TABLE 2: Notations and Explanations. Notations Explanations O the observed time-seies data p the time lag q the predictive window size y ( v,q ) the vector of the true values of the predictivewindow size of q for the node v in the virtual graph ˆ y ( v,q ) the vector of the predicted values of the predictivewindow size of q for the node v in the virtual graph G the virtual graph V the set of all virtual nodes E the set of all virtual edges χ the observed matrix of the ILI rates S the node embedding matrix of the virtual graph T the adjacency matrix of the virtual graph As aforementioned in Table 1 and Figure 1, Influenzaseasons that vary in timing, severity, and duration. Andpandemics mutate the influenza outbreaks. Dynamicallylooking for similar “infection situations” instead of stickingto a fixed periodicity (roughly one year) could be a key tovaried seasonality and pandemics for influenza prediction.We formally define the concept of a dynamic virtual graph.
Dynamic Virtual Graph . Different from a natural graphwith static nodes and edges, we define a virtual graph as G = ( V , E ) with a set of nodes ( V ) and a set of edges( E ), in which every node and edge are supervisedly orsemi-supervisedly dynamically learned during the trainingprocedure in the prediction task.In this study each node is a function that can be trainedto capture “infection situations”, and each edge describesthe significance of the similarity. Figure 3 gives an image ofa virtual graph to predict influenza in this study. A virtual EEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. XX, AUGUST 2020 4
Fig. 3: An image of a virtual graph. A virtual node repre-sents a timepoint in the format of “year/week”. A virtualedge connects two virtual nodes at two timepoints. Thevalues in the boxes above the edges are the significance ofthe virtual edges.node v ( v ∈ V ) represents a timepoint, such as 2017/12,which means the 12th week in 2017. The embedding ofthe virtual node, representing the comprehensive “infectionsituations” at the timepoint, is denoted as s v for the node v .A virtual edge e ( e ∈ E ) connects two virtual nodes at twotimepoints, and the edge weights measure the significanceof the virtual edge. The virtual node representation will be used (1) to per-form “infection-situations” embedding for the subsequentrepresentation learning and (2) to learn the significanceof the similarity among the different “infection-situations”.How to define a proper function of the virtual node rep-resentation vector s v for the node v is a pivotal problem.An “infection-situation” embedding vector needs to includecomprehensive infective information, such as :(a) the timing, severity, and duration of the infection at agiven timepoint;(b) the first-order differences (“speed”) and the second-order differences (“acceleration”) at a given timepoint;(c) tendency (upward, downward, fluctuation, or a turn-ing point) at a given timepoint;(d) descriptive statistics (mean, median, maximum, min-imum, variance, and the like) at a given timepoint.Many previous researches studied how to “unsupervis-edly” represent “situation” for prediction [31], [32], [33],[34]. Some use current and past values in the time-seriesdata as input. Others use domain knowledge, such as vary-ing lag structures at different steps [32], to present “situa-tion” vectors. Nonetheless, these methods of “unsupervisedpresenting” is separate from the model training. In thisstudy, we propose a supervised representation of “situation”to learn a more appropriate presentation for a given analyt- ical task. Formula 2 illustrates the node representation of“infection-situations”: s v = τ ( obs v , obs ( v − , ... , obs ( v − p +1) ; θ τ )= τ ( O ( v,p ) ; θ τ )= σ ( W MLP − ( σ ( W MLP − O ( v,p ) ))) (2)where τ is a neural network, θ τ is the parameter for ( τ ),and σ is an activation function for which we use Expo-nential Linear Units (ELU) in this work. In this work,a two-layered Multilayer Perceptron (MLP) is adopted. W MLP − ∈ R p × and W MLP − ∈ R × are thetrainable weights of the first and second layer of MLP,respectively.There are three reasons why we need to project theoriginal feature space nonlinearly into a high-dimensionalspace as an embedding vector of a virtual node: (a) To work as time-series feature extraction. Time-series analyses usually adopt arbitrary feature en-gineering, such as Kalman Filtering [35], and so on. Howto select effective feature engineering is non-trivial, usuallyexperienced-based and time-consuming. Since an MLP cantheoretically simulate any function [36], [37], [38], imple-menting MLP can perform effective representation in a high-dimensional space. (b) To supervisedly and dynamically learn virtual noderepresentations.
Since the MLP is a part of the entire end-to-end learning,the projected embedding vectors are supervisedly learnedfrom and for a specific analytical task. Such a supervisedly-learned embedding space is supposed to work better thanthe original feature space, which is static and cannot beupdated or learned, and thereby improve the accuracy. (c) To project virtual node embedding to an appropriatespace.
The high-dimensional embedding vectors of two virtualnodes will be used to define the significance of a virtualedge. A high-dimensional space that represents a variety ofcomplex time-series characteristics can work better than theoriginal feature space that just consists of the ILI rates ofcurrent and past few weeks.
To a given node, different neighbors may have different sim-ilarities in “infection situations”. The significance of a virtualedge needs to be decided. In this study, we measure thesignificance of the virtual edge between the node u and v byperforming inner product on the after linear projection andinstance normalization on the high-dimensional embeddingvectors of “infection-situations”, as Formula 3 illustrates: t ( v,u ) = κ ( s v , s u )= inst norm [ W line proj s v ] (cid:12) inst norm [ W line proj s u ] (3)where t ( v,u ) is the the significance of the virtual edge fromthe node u to v , inst norm ( · ) is the instance normalization, W line proj ∈ R × is the trainable weight of the linearprojection and (cid:12) is the inner product.The significance ( t ( v,u ) ) of the virtual edge from the node u to v has some properties: (a) − ≤ t ( v,u ) ≤ EEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. XX, AUGUST 2020 5
The value range of virtual edge significance is [ − , . (b) t ( v,v ) = 1 The significance of the self-loop virtual edge is “1”. (c) t ( v,u ) = t ( u,v ) The virtual edge significance is symmetric.
The virtual graph, which is composed of the virtual nodesrepresenting the “infection situation” at each timepoint andthe virtual edges with the similarity significance, is inputinto GNNs. For a given node, the GNNs iteratively aggre-gate and integrate the embedding of its neighbors to learna representation vector ( h v ). Formula 4 illustrates the l -thiteration of aggregation and integration: h ( l ) v = IN T ( h ( l − v , AGG ( h ( l − u ; ∀ u ∈ N ( v ))) (4)where h ( l ) v and h ( l ) u is the representation vector at the l -thiteration/layer of the given node v and the neighbor node u , respectively; N ( v ) is the set of neighbor nodes; AGG ( · ) is a function that aggregates the infective information fromits neighbors N ( v ) —the timepoints that have similar “in-fection situations”, and IN T ( · ) is a function that integratesthe infective information from the given timepoint v itselfand the aggregated infective information by AGG ( · ) basedon its neighbors N ( v ) .A variety of functions of AGG ( · ) and IN T ( · ) have beenproposed in the previous studies [6], [28]. In this work, weinitiate and update the node representations as Formula 5shows: h (0) v = s v h ( l ) v = σ ( W GNN − l · (cid:88) u ∈N ( v ) ∪{ v } ( t ( v,u ) · h ( l − u )) (5)where W GNN − l ∈ R × is the trainable weight of thel-th layer of GNNs. In this work, we adopt 2-layered GNNs. The sum of the initial virtual node representation ( h v ),the representation vector after the first GNN layer ( h v ),and the second GNN layer ( h v ) is input into a regressivelayer (implemented by a linear layer) to achieve the finalprediction, as Formula 6 shows: ˆ y ( v,q ) = W regr ( h v + h v + h v ) (6)where W regr ∈ R × q is the trainable weight of the finalregressive layer. The loss function is defined as Formula 7 shows: L = 1 n × q (cid:88) v ∈V || y ( v,q ) − ˆ y ( v,q ) || + λ || T || F (7)where n × q (cid:80) v ∈V || y ( v,q ) − ˆ y ( v,q ) || is the predictive loss inMean Square Error (MSE) ( n is the number of virtual nodes), y ( v,p ) and ˆ y ( v,p ) are the vectors of the true and predictedvalues, respectively. T is the adjacency matrix of the virtualgraph, and || T || is the penalty term to limit the complexityof the virtual graphs and improve the robustness of themodel, and || · || F represents the matrix Frobenius norm); and λ is an adjustable hyper-parameter to balance the twoparts of losses.The entire algorithm of the proposed methods is shownin Figure 4 and Algorithm 1.Fig. 4: The structure of the proposed DVGSN. (a) The limitation of time window The prediction for a given timepoint v by DVGSN canbe simplified and formularized as Formula 8: ˆ y ( v,q ) = f ς ( s v , s u , E ; θ ς )= f ς ( s v , s u , κ ( s v , s u ); θ ς )= f ς ( τ ( O ( v,p ) ) , τ ( O ( u,p ) ) , κ ( τ ( O ( v,p ) ) , τ ( O ( u,p ) )); θ τ , θ ς )= f ς ( τ ( O ) , κ ( τ ( O )); θ τ , θ ς )= f ς ( O ; θ ) (8)where f ς is the algorithm of DVGSN and θ represents theset of all the parameters in DVGSN.To perform prediction for the timepoint v , all the ob-served time-series data ( O ) in the training dataset are inputinto DVGSN. Comparatively, the input in the classic ma-chine/deep learning methods (as formula 1 shows) are just EEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. XX, AUGUST 2020 6
Algorithm 1
The proposed DVGSN.
Input:
The observed time-series data: O ,The time lag: p ,The predictive window size: q ,The training epochs: I ,The adjustable hyperparameter of levels: λ . Output:
The predictive model for influenza in United States. Prepare the observed matrix: χ = [ O ( v,p ) ] , ∀ v ∈ V ; Prepare the target matrix: Y = [ y ( v,q ) ] , ∀ v ∈ V ; Calculate the number of virtual nodes: n = | O | − p − q + 1 ; for i = 1 , , · · · , I do S ← elu (1 d conv ( elu (1 d conv ( χ )))); T ← inst norm (1 d conv ( S ))( inst norm (1 d conv ( S ))) T ; H ← S ; H ← GNN ( S, T ) ; H ← GNN ( H , T ) ; ˆ Y = regressive layer ( H + H + H ) ; L = || Y − ˆ Y || F + λ × || T || F ; Perform back propagation and update parameters; end for the observed values ( O ( v,p ) ) in the user-defined time win-dow with the time lag ( p ). Inputting all the observed time-series data ( O ) can capture similar “infection situations”from all the timepoints instead of sticking to the fixed staticnodes and edges. (b) Other differences The virtual nodes and virtual edges in DVGSN aresupervisedly learned during the training procedure in thespecific prediction task while the other existing GNNs-basedmethods use a static graph defined beforehand. Anotherdifference lies in the
AGG ( · ) function in Formula 4. Thealgorithm of the l -th layer of iteration in the attentive GNNs(such as GAT) and DVGSN is illustrated as Formula 9 andFormula 10 shows, respectively: h ( l ) v = σ ( W · Σ u ∈N ( v ) ∪{ v } ( α ( v,u ) · h ( l − u ) α ( v,u ) = exp ( σ ( a T [ W h ( l − v || W h ( l − u ]))Σ k ∈N ( v ) exp ( σ ( a T [ W h ( l − v || W h ( l − k ]) , ∀ u ∈ N ( v ) (9)where α ( v,u ) is the normalized attention coefficient, ( · ) T represents matrix transposition, and || is the concatenationoperation. h ( l ) v = σ ( W GNN − l · (cid:88) u ∈N ( v ) ∪{ v } ( t ( v,u ) · h ( l − u )) t ( v,u ) = inst norm ( W line proj s v ) (cid:12) inst norm ( W line proj s u ) , ∀ u ∈ N ( v ) (10)In GAT, the input ( α ( v,u ) · h ( l − u ) is a mean (preciselya weighted mean ) of the embedding vectors in the neigh-borhood since (cid:80) u ∈N ( v ) α ( v,u ) = 1 regardless of the graphstructure. Comparatively, in DVGSN, the aggregation func-tion (cid:80) ( t ( v,u ) · h ( l − u ) is a sum of the embedding vectorsin the neighborhood on condition that − ≤ t ( v,u ) ≤ holds. Theoretically, the expressive power of mean basedaggregators is weaker than sum aggregators because sum captures the full multiset while mean captures the propor-tion / distribution of elements of a given type [28]. We scrape the influenza data of the United States from2003/30 to 2017/30 in the “FluView Interactive” [39], awebsite of Centers for Disease Control and Prevention, Na-tional Center for Immunization and Respiratory Diseases.The weekly ILI rates are calculated and used for this work.Figure 1 illustrates the time-series plot of the ILI rates. Thepiece of the curve around 2009, which has three consecutivepeaks, is a pandemic in 2009. Table 1 summaries the de-scriptive statistics of the influenza seasons from 2002-2003 to2016-2017. The “mean ± standard deviation” of the columnof “The Highest ILI Rate” is . ± . . Besides, thestandard deviation ( . ) is around 33% higher than thevalue of mean ( . ), presenting a considerable variance inseverity. The “mean ± standard deviation” of the columnof “Duration” is ± . The standard deviation ( ) isaround 40%h of the mean ( ), presenting a considerablevariance in lasting. Moreover, the column of “The Peakweek” demonstrates the timing of influenza seasons variesyear by year. In this work, the baseline models include a variety of state-of-the-art models. We do not compare with the SI-basedmodels because it is difficult to obtain the values of theparameters such as the number of susceptible and infectedindividuals. • Autoregression (AR) . An AR model is a statisticalmethod that uses observed values from current and pasttime steps as input and implements a linear regression topredict future values. • k -Nearest Neighbors( k -NN) . The regression of k -NNexamines the values of a chosen number ( k ) of data pointssurrounding a target data point, and uses the mean of thevalues as prediction. The k -NN regression can be used fortime-series prediction [40]. • Random Forest(RF) . A RF model is an ensemble ofdecision trees trained with the “bagging” method, whichleverages a combination of learning models and therebyincreases the overall performance [41]. • XGB . The XGB regression implements the frameworkof Gradient Boosting by providing a parallel boosting,which consists of iteratively learning weak regressors withrespect to a distribution and adding them to a final strongregressor [42]. • Multilayer Perceptron (MLP) . An MLP is a neuralnetwork, in which each node in a layer is fully connectedto every node in the adjacent layers and fit a non-linearfunction. The MLP can be used for time series analyses bymapping current and past values to one or multiple futurepredictive values [43]. • TPA-LSTM . The TPA-LSTM uses a set of filters toextract time-invariant temporal patterns. The extraction issimilar to transforming time series data into its “frequencydomain” for forecasting [44].
EEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. XX, AUGUST 2020 7 • TCN . A TCN uses a causal and dilated convolutionalnetwork to predict sequential data [45], such as time series[46], and so on . • TFR . The TFR is a model relying entirely on attentionmechanism to compute representations of its input and out-put without using sequence aligned RNNs or convolution[47]. TFR can also be used for time -series prediction [48]. • GAT . The GAT algorithm is a type of GNNs thatconsiders the attention mechanism on graphs [6]. • GIN . The GIN model is a theoretically designed modelfor analyzing the expressive power of GNNs to capturedifferent graph structures. The GIN is provably the mostexpressive among the class of GNNs and is theoretically aspowerful as the Weisfeiler-Lehman graph isomorphism test[28].
For all the baseline models and the proposed model, werandomly initialize parameters with the uniform distribu-tion and select the Adam optimizer [49] with a learningrate of 0.001. We set all hidden layer size of 256 units anduse the activation function of an ELU. We set the epoch to200 and choose the parameter with the best result on thevalidation set. To ensure fairness, we split the datasets anduse the same training set (the current timepoints are from2003/41 to 2012/02), validation set (the current timepointsare from 2012/03 to 2014/42), and testing set (the currenttimepoints are from 2014/43 to 2017/29 when the predictivewindow size is 1; the current timepoints are from 2014/43 to2017/27 when the predictive window size is 3; the currenttimepoints are from 2014/43 to 2017/24 when the predictivewindow size is 6) for all the models in this work. We set allthe models to the same depth. For MLP, we use a five-layerstructure, which contains an input layer, four hidden layers,and an output layer (regressive layer). For TPA-LSTM, wefeed the feature vectors into a TPA-LSTM layer, then inputthe hidden state to 3-layered MLP, and finally implementan output layer. For TFR, we use two TFR blocks, whichcorrespond to four layers. For GAT, we use four layers ofGAT and one output layer. For the proposed method, ineach backpropagation, we use all training data as modelinput and randomly select batch nodes from the trainingset to calculate the loss and update model parameters. Inthis work, to use the algorithm of GIN and GAT, we alsoconnect a given virtual node to all the other virtual nodes,as DVGSN does. Our source code and dataset are availableat https://github.com/aI-area/DVGSN.
To perform a comprehensive comparison, we perform threeseries of experiments. We set the the predictive window sizeto 1 to do a short period prediction, and set the value tobe 3 and 6 respectively to test the predictive ability in alonger period. The loss functions of all the models averagethe predictive MSEs of the future weeks. To evaluate therobustness of the model, we also test each algorithm with atime lag of 6, 9, and 12, respectively. The value of λ in thisgroup of experiments is 0.01.Table 3 presents the results of all the models. The resultsshow that the proposed DVGSN significantly outperforms all the baseline methods in all the prediction tasks withthe predictive window size being 1, 3 and 6 respectively,which proves that DVGSN can satisfy both the short andlong period prediction tasks. Time lag.
DVGSN shows a slight advantage with thetime lag being 9. The results shows that it is better enoughto construct the virtual graph node and reflect the currenttrend with the recent 9 historical data. A longer time lagmay offer no help since the virtual graph can learn thesimilar historical situation by itself. The result also showsthat the selection of the hyperparameter time lag for DVGSNis relatively easy.The other baseline methods, including KNN, RF, XG,MLP, and TPA-LSTM shows a better performance with ashort time lag 6 for the short period prediction as thepredictive window size is 1, while a bigger time lag 12is better for a longer period prediction task in which thepredictive window size is 3. A longer time lag could offermore support for the longer prediction task. However, aswe have introduced previously, the time lag is limited anda longer one can reduce the training space.The situation of TCN, TFR, GAT and GIN is more like acomprise between the above two cases. They show a betterperformance with a short time lag 6 for the short periodprediction while a slight advantage with the time lag being9 for the longer prediction task. The graph-based solutionscould reduce their dependence on the time lag. At the sametime, the fact that they cannot benefit from a longer time lagfor the short prediction may be caused by their fixed staticgraph mode.
BLATION S TUDY
To verify the effectiveness of the constructed dynamicgraph, we designed a variant, denoted as “DVGSN(fixed)”,in which the virtual edges are fixed instead of being learned.A given node is connected to the nodes at the timepointsone week ago and one year (52 weeks) ago, considering theperiodicity and time series of influenza. Other preprocessesare the same as those in the proposed method. We aslo ad-just the hyperparameter λ to demonstrate the effectivenessof the penalty term in the loss function.Fig. 5: The average MSEs by the different λ s. Comparison between the fixed and dynamic graph.
Table 4 compares the performance of the fixed and dynamicgraph. In 8 of 9 cases, the dynamic graphs perform better.
EEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. XX, AUGUST 2020 8
TABLE 3: The MSEs in all the models. The bold font indicates the best performance in each test. p ∗ q ∗ AR kNN RF XGB MLP TPA-LSTM TCN TRF GAT GIN DVGSN6 1 0.0796 0.1177 0.0897 0.1071 0.0795 0.0774 0.0794 0.0867 1.2326 1.1891
12 1 0.0778 0.2113 0.0844 0.0916 0.0878 0.0799 0.0840 0.0930 1.2299 1.2733 ∗ The p and q refers to the time lag and the predictive window size, respectively.
TABLE 4: Comparison between the fixed and dynamic graph. The bold font indicates the better performance in each pairof comparison. q 1 3 6p 6 9 12 6 9 12 6 9 12DVGSN(fixed) λ = 0 . ) 0.0749 TABLE 5: Comparison among different λ in DVGSN. The bold font indicates the better performance in each pair ofcomparison. q p λ = 0 λ = 0 . λ = 0 . λ = 0 . λ = 0 . λ = 0 . λ = 0 . λ = 0 . λ = 0 . λ = 1 Comparison among different λ s in DVGSN. A bigger λ exerts a heavier penalty on a complex virtual graphs withdense edges of small weights. Table 5 shows comparisonamong different λ s in DVGSN. In the most cases, the“DVGSN ( λ (cid:54) = 0 )” outperforms the “DVGSN ( λ = 0 )”. Inconclusion, restricting the complexity of the virtual graphimproves the performance.Figure 5 compares the average MSEs by the different λ s.The X-axis and Y-axis represents λ s in logarithm scale andthe column-average MSEs from Table 5, respectively. Thecurve roughly presents a shape of the letter of “V”. That isprobably because the penalty term cannot help to improvethe robustness when the λ is too small (close to zero). Whenthe λ is too big, the loss will focus on the penalty term butignore the predictive MSE loss. In a word, in this work,when the λ approximately equals to 0.01, DVGSN performsbest. ODEL I NTERPRETATION
This section explains how the proposed method works.
Figure 6 illustrates the similarity that the virtual graphlearns. The X-axis represents the time series from the past 11th week to the future 3rd week ( p = 12 and q = 3 ).The Y-axis represents the ILI rate. The date in the formatof “year/week” above each column of the subfigures is the“current” timepoint. The model predicts the ILI rates of thefuture 3 weeks (on the right side of the red dash line). In eachsubfigure, the red curves represent the “infection situations”of the given timepoints. The ILI rates of the current and past11 weeks (on the left side of the red dash line) are projectedto a high-dimensional space to calculate the significanceof the similarity (the green float in each subfigure). Asa result, the blue curves in the top two subfigures arethe most positively similar “infection situations” that thevirtual graph finds; the blue curves in the third row ofthe subfigure represent dissimilar “infection situations” thatthe virtual graph finds; and the blue curves in the bottomtwo subfigures are the most negatively similar “infectionsituations” that the virtual graph finds. This section explores whether DVGSN can deal with thevaried influenza seasonality. Figure 7 gives two examples.The X-axis represents the time series from the past 9thweek to the upcoming 3rd week ( p = 9 and q = 3 ). Thered curves represent the “infection situations” of the giventimepoints; and the blue and green curves represent the EEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. XX, AUGUST 2020 9
Fig. 6: The illustrative examples of the positive similarities, dissimilarities, and negative similarities. The X-axis representsfrom the past 11th week ( p = 12 ) to the future 3rd week ( q = 3 ) to predict. The red dash lines separate past(includingcurrent) and future timepoints. The Y-axis represents the ILI rate. The date above each column of the subfigures is thegiven timepoint. The green float in each subfigure is the significance of the similarity between the given virtual node (thered curve) and the neighbor node (the blue curve).two most similar “infection situations”. We find the twomost similar timepoints do not correspond to the sameweek of the previous years, which demonstrates DVGSNcan tackle varied influenza seasonality instead of sticking tothe periodicity of 52 weeks. This section explores how DVGSN learns from the histor-ical “infection situations” and predicts the pandemic. We present five examples in the 2009 pandemic in Figure 8.The X-axis and Y-axis represents time series from 2002/40 to2017/30 and the ILI rates, respectively. In each subfigure, thered piece represents the “infection situations” of the giventimepoints in the pandemic. The timepoints in the five sub-figures is 2009/07, 2009/13, 2009/37, 2009/51, and 2010/11respectively, which represent a rising, a falling down, arebound after reaching a bottom, a drop after reaching apeak, and fluctuations after a huge dropping in the 2009
EEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. XX, AUGUST 2020 10
Fig. 7: The illustrative examples how DVGSN deals with thevaried seasonality. The X-axis represents from the past 9thweek to the upcoming 3rd week ( p = 9 and q = 3 ). The “in-fection situations” of the given timepoints are representedby the red curves; and the most two similar “infection sit-uations” are represented by the blue and green curves. Thenumber after “w:” in the figure legend is the significance ofthe similarity.Fig. 8: This figure illustrates how DVGSN learns from thehistorical “infection situations” and predicts the pandemic.The X-axis represents time series from 2002/40 to 2017/30.pandemic respectively. The two yellow pieces in the curvesrepresent two of the most similar “infection situations” thatDVGSN learns. The number after “w:” in each figure legendis the significance of the similarity. By comparing the past“infection situations” and future tendency between the redpieces and the two yellow pieces in the five examples,we conclude that DVGSN can find and learn the similar“infection situations” outside the pandemic and thereby make a reliable model for the influenza prediction. ONCLUSION
In this work, we proposed a method—DVGSN. DVGSN canfind similar “infection situations” outside the time windowand therefore improve the predictive accuracy for influenza.The extensive experiments on real-world influenza datademonstrate that DVGSN significantly outperforms the cur-rent state-of-the-art methods. Besides, the proposed methodhas rich interpretabilities, which provide us clues how themodel perform prediction for influenza. Another strongpoint of the proposed method lies in that it need less domainknowledge to build a graph in advance, which may be verydifficult in the medical science related fields. As all the deeplearning based methods the proposed method also dependson enough data to train the model. Hopefully, this methodcan help us better prepare for influenza outbreaks, and workon other public health related analytical tasks well. A CKNOWLEDGMENTS
This work was supported by the Strategic Priority Re-search Program of Chinese Academy of Sciences (Grant No.XDB38040200). We would like to thank all the authors of theopen source code in the baseline methods. R EFERENCES [1] A. Lancichinetti, F. Radicchi, J. J. Ramasco, and S. Fortunato,“Finding statistically significant communities in networks,”
PloSone , vol. 6, no. 4, 2011.[2] W. W. Zachary, “An information flow model for conflict and fissionin small groups,”
Journal of anthropological research , vol. 33, no. 4,pp. 452–473, 1977.[3] J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun, “Spectral net-works and locally connected networks on graphs,” arXiv preprintarXiv:1312.6203 , 2013.[4] M. Defferrard, X. Bresson, and P. Vandergheynst, “Convolutionalneural networks on graphs with fast localized spectral filtering,”in
Advances in neural information processing systems , 2016, pp. 3844–3852.[5] T. N. Kipf and M. Welling, “Semi-supervised classification withgraph convolutional networks,” arXiv preprint arXiv:1609.02907 ,2016.[6] P. Veliˇckovi´c, G. Cucurull, A. Casanova, A. Romero, P. Lio,and Y. Bengio, “Graph attention networks,” arXiv preprintarXiv:1710.10903 , 2017.[7] H. Influenza, “Key facts about influenza (flu) vaccine,”
InfluenzaHuman
Journal of Virology , vol. 73, no. 3,pp. 1878–1884, 1999.[12] P. Su´arez, J. Valc´arcel, and J. Ort´ın, “Heterogeneity of the mutationrates of influenza a viruses: Isolation of mutator mutants,”
Journalof Virology , vol. 66, no. 4, pp. 2491–2494, 1992.[13] J. Puig-Barber`a, A. Tormos, A. Sominina, E. Burtseva, O. Launay,M. A. Ciblak, A. Natividad-Sancho, A. Buigues-Vila, S. Mart´ınez-´ubeda, and C. Mah´e, “First-year results of the global influenzahospital surveillance network: 2012–2013 northern hemisphereinfluenza season,”
BMC Public Health , vol. 14, no. 1, p. 564.
EEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. XX, AUGUST 2020 11
Nature , vol. 457, 2009.[16] K. Lee, A. Agrawal, and A. Choudhary, “Forecasting influenzalevels using real-time social media streams,” in , 2017.[17] S. Molaei, M. Khansari, H. Veisi, and M. Salehi, “Predicting thespread of influenza epidemics by analyzing twitter messages,”
Health and Technology .[18] J. Li and C. Cardie, “Early stage influenza detection from twitter,”
Computer Science , 2013.[19] A. Darwish, Y. Rahhal, and A. Jafar, “A comparative study onpredicting influenza outbreaks using different feature spaces: ap-plication of influenza-like illness data from early warning alertand response system in syria,”
BMC Research Notes , vol. 13, no. 1,pp. 1–8, 2020.[20] M. J. Kane, N. Price, M. Scotch, and P. Rabinowitz, “Comparisonof arima and random forest time series models for prediction ofavian influenza h5n1 outbreaks,”
Bmc Bioinformatics , vol. 15, no. 1,pp. 276 (9 pp.)—276 (9 pp.).[21] J. Zhang and K. Nawata, “A comparative study on predictinginfluenza outbreaks,”
Bioscience trends , 2017.[22] R. Yin, E. Luusua, J. Dabrowski, Y. Zhang, and C. K. Kwoh, “Tem-pel: Time-series mutation prediction of influenza a viruses viaattention-based recurrent neural networks,”
Bioinformatics , 2020.[23] N. Wu, B. Green, X. Ben, and S. O’Banion, “Deep transformermodels for time series forecasting: The influenza prevalence case,” arXiv preprint arXiv:2001.08317 , 2020.[24] V. Dukic, H. F. Lopes, and N. G. Polson, “Tracking epidemics withgoogle flu trends data and a state-space seir model,”
Journal of theAmerican Statistical Association , vol. 107, 2012.[25] Q. Wu, X. Fu, Z. Jin, and M. Small, “Influence of dynamic immu-nization on epidemic spreading in networks,”
Physica A StatisticalMechanics & Its Applications , vol. 419, pp. 566–574.[26] P. Sen, G. Namata, M. Bilgic, L. Getoor, B. Galligher, and T. Eliassi-Rad, “Collective classification in network data,”
AI magazine ,vol. 29, no. 3, pp. 93–93, 2008.[27] T. Hogg and K. Lerman, “Social dynamics of digg,”
EPJ DataScience , vol. 1, no. 1, p. 5, 2012.[28] K. Xu, W. Hu, J. Leskovec, and S. Jegelka, “How powerful aregraph neural networks?” arXiv preprint arXiv:1810.00826 , 2018.[29] X. Geng, Y. Li, L. Wang, L. Zhang, Q. Yang, J. Ye, and Y. Liu,“Spatiotemporal multi-graph convolution network for ride-hailingdemand forecasting,” in
Proceedings of the AAAI Conference onArtificial Intelligence , vol. 33, 2019, pp. 3656–3663.[30] S. Fu, X. Yang, and W. Liu, “The comparison of different graphconvolutional neural networks for image recognition,” in
Pro-ceedings of the 10th International Conference on Internet MultimediaComputing and Service , 2018, pp. 1–6.[31] J. Alberg and Z. C. Lipton, “Improving factor-based quantitativeinvesting by forecasting company fundamentals,” arXiv preprintarXiv:1711.04837 , 2017.[32] S. F. Crone and S. H¨ager, “Feature selection of autoregressiveneural network inputs for trend time series forecasting,” in . IEEE,2016, pp. 1515–1522.[33] R. J. Frank, N. Davey, and S. P. Hunt, “Time series prediction andneural networks,”
Journal of intelligent and robotic systems , vol. 31,no. 1-3, pp. 91–103, 2001.[34] M. Ghiassi, H. Saidane, and D. Zimbra, “A dynamic artificial neu-ral network model for forecasting time series events,”
InternationalJournal of Forecasting , vol. 21, no. 2, pp. 341–362, 2005.[35] I. K. F. Links, “An introduction to the kalman filter,” 1995.[36] K. Hornic, “Multilayer feedforward networks are universal ap-proximators,” vol. 2, no. 5, pp. 359–366, 1989.[37] M. Leshno, V. Y. Lin, A. Pinkus, and S. Schocken, “Multilayerfeedforward networks with a nonpolynomial activation functioncan approximate any function,”
Neural Networks , vol. 6, no. 6, pp.861–867.[38] K. Hornik, “Approximation capabilities of multilayer feedforwardnetworks,”
Neural Networks
Water Re-sources Management , vol. 34, no. 1, pp. 263–282, 2020.[41] H. Wu, Y. Cai, Y. Wu, R. Zhong, Q. Li, J. Zheng, D. Lin, and Y. Li,“Time series analysis of weekly influenza-like illness rate using aone-year period of factors in random forest regression,”
Biosciencetrends , 2017.[42] R. A. Abbasi, N. Javaid, M. N. J. Ghuman, Z. A. Khan, S. U.Rehman et al. , “Short term load forecasting using xgboost,” in
Workshops of the International Conference on Advanced InformationNetworking and Applications . Springer, 2019, pp. 1120–1131.[43] J. Cao, Z. Li, and J. Li, “Financial time series forecasting modelbased on ceemdan and lstm,”
Physica A: Statistical Mechanics andits Applications , vol. 519, pp. 127–139, 2019.[44] S.-Y. Shih, F.-K. Sun, and H.-y. Lee, “Temporal pattern attention formultivariate time series forecasting,”
Machine Learning , vol. 108,no. 8-9, pp. 1421–1441, 2019.[45] S. Bai, J. Z. Kolter, and V. Koltun, “An empirical evaluationof generic convolutional and recurrent networks for sequencemodeling,” arXiv preprint arXiv:1803.01271 , 2018.[46] P. Hewage, A. Behera, M. Trovati, E. Pereira, M. Ghahremani,F. Palmieri, and Y. Liu, “Temporal convolutional neural (tcn)network for an effective weather forecasting using time-series datafrom the local weather station,”
Soft Computing , pp. 1–30, 2020.[47] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones,A. N. Gomez, L. u. Kaiser, and I. Polosukhin, “Attention isall you need,” in
Advances in Neural Information ProcessingSystems 30 , I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach,R. Fergus, S. Vishwanathan, and R. Garnett, Eds. CurranAssociates, Inc., 2017, pp. 5998–6008. [Online]. Available:http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf[48] Y. Wang, L. Wang, Q. Chang, and C. Yang, “Effects of direct input–output connections on multilayer perceptron neural networks fortime series prediction,”
Soft Computing , pp. 1–10, 2019.[49] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimiza-tion,” arXiv preprint arXiv:1412.6980arXiv preprint arXiv:1412.6980