[PDF] Dynamic Virtual Graph Significance Networks for Predicting Influenza

Abstract

Graph-structured data and their related algorithms have attracted significant attention in many fields, such as influenza prediction in public health. However, the variable influenza seasonality, occasional pandemics, and domain knowledge pose great challenges to construct an appropriate graph, which could impair the strength of the current popular graph-based algorithms to perform data analysis. In this study, we develop a novel method, Dynamic Virtual Graph Significance Networks (DVGSN), which can supervisedly and dynamically learn from similar "infection situations" in historical timepoints. Representation learning on the dynamic virtual graph can tackle the varied seasonality and pandemics, and therefore improve the performance. The extensive experiments on real-world influenza data demonstrate that DVGSN significantly outperforms the current state-of-the-art methods. To the best of our knowledge, this is the first attempt to supervisedly learn a dynamic virtual graph for time-series prediction tasks. Moreover, the proposed method needs less domain knowledge to build a graph in advance and has rich interpretability, which makes the method more acceptable in the fields of public health, life sciences, and so on.

Full PDF

IIEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. XX, AUGUST 2020 1

Dynamic Virtual Graph Signiﬁcance Networksfor Predicting Inﬂuenza

Jie Zhang, Pengfei Zhou, and Hongyan Wu

Abstract —Graph-structured data and their related algorithms have attracted signiﬁcant attention in many ﬁelds, such as inﬂuenzaprediction in public health. However, the variable inﬂuenza seasonality, occasional pandemics, and domain knowledge pose greatchallenges to construct an appropriate graph, which could impair strength of the current popular graph-based algorithms to performdata analysis. In this study, we develop a novel method, Dynamic Virtual Graph Signiﬁcance Networks (DVGSN), which cansupervisedly and dynamically learn from similar “infection situations” in historical timepoints. Representation learning on the dynamicvirtual graph can tackle the varied seasonality and pandemics, and therefore improve the performance. The extensive experiments onreal-world inﬂuenza data demonstrate that DVGSN signiﬁcantly outperforms the current state-of-the-art methods. To the best of ourknowledge, this is the ﬁrst attempt to supervisedly learn a dynamic virtual graph for time-series prediction tasks. Moreover, theproposed method needs less domain knowledge to build a graph in advance and has rich interpretabilities, which makes the methodmore acceptable in the ﬁelds of public health, life sciences, and so on.

Index Terms —Representation Learning, Dynamic Virtual Graph, Inﬂuenza Prediction, Time Series. (cid:70)

NTRODUCTION W ITH the growing emergence of graph-structured datasuch as social networks and biological networks [1],[2], the algorithms to analyze graph data have attracted sig-niﬁcant attention, such as Graph Convolutional Networks(GCNs) [3], [4], [5] and Graph Attention Networks (GAT)[6], etc. The structure of graph data exerts signiﬁcant impacton the performance of theses algorithms, because thesealgorithms heavily depend on the neighborhood relation-ship of the graph. For example, GNNs iteratively aggregateand integrate the embedding of its neighbors to learn thenode embedding of the graph. However, ﬁnding out allthe inﬂuential neighbor nodes and measuring their edgeweights appropriately to construct a graph are nontrivial inmany cases, such as in life sciences and public health ﬁelds,which require substantial domain knowledge.Inﬂuenza prediction is an important interdisciplinaryproblem between computer science and public health. In-ﬂuenza circulates worldwide and places a heavy burden onpeople’s health Every year [7], [8]. The strong infectivityand outbreak of inﬂuenza are estimated to result in ap-proximately 35 million cases of symptomatic illnesses, 16million outpatient medical visits, 490 thousand inﬂuenza-associated hospitalizations, and 34 thousand cases of deathsin the inﬂuenza season of 2018-2019 in the United States[9]. The inﬂuenza virus undergoes high mutation rates andfrequent genetic re-assortment [10], [11], [12]. To help clin-ics, hospitals, pharmaceutical companies, and governmentsbetter prepare for inﬂuenza in a timely manner, we need areliable model to predict inﬂuenza trends. • J.Zhang and P.Zhang are with the Department of Smart Health,SenseTime, Shanghai, CN, 200233; H.W is with the Joint EngineeringResearch Center for Health Big Data Intelligent Analysis Technology,Shenzhen Institutes of Advanced Technology, Chinese Academy ofSciences, Shenzhen, CN, 518055.E-mail: [email protected]; [email protected];[email protected]

There are mainly two challenges of predicting inﬂuenza. [Challenge 1]

Inﬂuenza seasonality usually varies, from oneseason to another, in timing, severity, and duration [13],[14]. Table 1 shows the descriptive statistics of the Inﬂuenza-Like Illness (ILI) rates of inﬂuenza seasons from 2003-2004to 2016-2017 in the United States. The rates of “standarddeviation / mean” in the “Highest ILI Rate” and “Duration”are 33% and 39%, respectively. Such an irregular variationhandicaps the predictive methods. [Challenge 2]

Inﬂuenzapandemics occur occasionally but can totally disorder theseasonality for years. A pandemic is a serious world-wideoutburst, resulting from the emerge of a new type of virusand resulting in extremely higher ILI rates, several close andconsecutive peaks, and much longer duration, as the piece ofthe curve around 2009 in Figure 1 shows. Such a “mutated”outbreak makes the prediction more difﬁcult.Fig. 1: The ILI rates of inﬂuenza seasons from 2003-2004 to2016-2017 in the United States. The ILI rate is deﬁned as thenumber of ILI patients divided by the number of all-illnesspatients.The existing machine/deep learning models, such asXGBoost (XGB), Temporal Pattern Attention Long Short-Term Memory (TPA-LSTM), Temporal Convolutional Net-works (TCN), and Transformer, use current and historicalvalues in a user-deﬁned time window as input to predict a r X i v : . [ c s . A I] F e b EEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. XX, AUGUST 2020 2

TABLE 1: The columns illustrate the variation of inﬂuenza intiming, severity, and lasting, respectively. “Duration” countsthe consecutive weeks, during which the ILI rates are allover 0.01.

Seasons Peak Week(year/week) HighestILI Rate Duration(weeks)2003-2004 2003/06 0.0317 282004-2005 2003/52 0.0706 252005-2006 2005/07 0.0475 372006-2007 2005/52 0.0305 312007-2008 2007/07 0.0327 312008-2009 2008/07 0.0542 312009-2010 2009/06 0.0334 912009/21 0.04212009/42 0.07622010-2011 2011/05 0.0444 372011-2012 2012/11 0.0229 412012-2013 2012/52 0.0603 442013-2014 2013/52 0.0439 432014-2015 2014/52 0.0611 412015-2016 2016/10 0.0359 422016-2017 2017/06 0.0481 44MEAN - 0.0460 40standard deviation(SD) - 0.0152 16SD/MEAN - 33% 39% future values. These methods lack considering similaritiesoutside the time window. Although one can simply increasethe length of the time window to include more informa-tion, there are always some timepoints outside the window.Besides, the bigger the length of the time window is, thefewer the training instances will be left, which makes thepredictive model unreliable. If a method can dynamicallyﬁnd historical timepoints that have similar “infection situa-tions” as auxiliary information, the model could tackle thevaried seasonality and the occasional pandemics.However, how to accurately represent the “infectionsituations” poses a challenge since the situation should in-clude the information of the inﬂuenza severity, the tendency,the duration and other factors that may be beyond ourknowledge.In this study, we develop a novel method, namely Dy-namic Virtual Graph Signiﬁcance Networks (DVGSN), asFigure 2 illustrates. DVGSN constructs a virtual graph forinﬂuenza prediction. In the virtual graph, a virtual noderepresents a timepoint. The embedding of a virtual noderepresents the “infection situations” at the timepoint. Thevirtual edges connect two virtual nodes at two timepoints,and the edge weights measure the signiﬁcance of the virtualedge. Since the timepoints connected by the virtual edgescan be outside the time window, DVGSN can break thelimitation of the time window by learning from neighbornodes and improve the predictive accuracy.A natural static graph deﬁned with domain knowledgebeforehand could not align well with the speciﬁc analyticaltask. As a result, the “neighborhood” in an “unsupervisedgraph” could be improper for the speciﬁc analytical task anddamage the analytical outcomes. Different from a naturalgraph with static nodes and edges, in a dynamic virtualgraph, every node and edge are supervisedly dynamically Fig. 2: The architechture of the proposed method—DVGSN.learned during the training procedure in the prediction task.Moreover, a virtual graph naturally has rich interpretabili-ties. For example, similar “infection situations” found bythe virtual graph can provide us with clues how the virtualgraph ﬁnds similarities and how the proposed method per-forms the prediction for pandemics. The interpretabilitiesmake the proposed method more acceptable, especially inthe ﬁelds of epidemiology and public health, in whichresearches usually emphasize the interpretabilities of thepredictive models for further government measures, etc.The contributions of this work are concluded as follows.(1) To the best of our knowledge, this is the ﬁrst attemptto supervisedly learn a dynamic virtual graph for time-series prediction.(2) The proposed method need less domain knowledgeto build a graph in advance and has rich interpretabilities,which are indispensable in epidemiology, public health, andthe like.(3) We carry out extensive experiments on the real-world data, and the experimental results prove that theproposed method signiﬁcantly outperforms the state-of-the-art methods.

EEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. XX, AUGUST 2020 3

This section describes the previous work from the point ofview of inﬂuenza prediction and graph-based deep learn-ing.

The machine/deep learning for forecasting inﬂuenza orother time-series data are mainly categorized into twogroups. Firstly, some researchers focus on looking for effec-tive “features”. For example, search engine query data areused for prediction inﬂuenza in

Google Flu Trends [15], [16].Twitter data are also used in other research papers [17], [18].However, these models usually suffer from the unreliablesource of huge amounts of information from such as internetsearches. For example, Google’s algorithm was quite vulner-able to overﬁtting to seasonal terms unrelated to the ﬂu, like“high school basketball”. This example also demonstratesthe importance of model interpretability. Secondly, otherresearchers focus on looking for effective “models”, suchas RF [19], [20], [21], Gradient Boosting [19], [21], Multi-layer Perceptron (MLP) [19], [21], Long Short Term Memory(LSTM) [19], [21], [22], Transformer (TFR) [23], and so on.Deep learning based methods, e.g. Transformer, are drawingmore attention for their accuracy while most of them suffersfrom the poor interpretability.Moreover, statistical models and dynamic analysis mod-els are considered easily accessible tools for simulatingpatterns of infection by inﬂuenza, such as SI, SIS, SIRmodel and their variants [24]. However, their parameters aresubject to change and the approximation of the parametersis difﬁcult [25], such as the basic reproduction number R ,population mobility etc. For mining a natural graph, such as Cora [26] and Digg [27],Graph Neural Networks (GNNs) are usually used, such asGCN, GAT, and Graph Isomorphism Network (GIN) [28].In an analytical task without a natural graph, to leveragepowerful GNNs, a graph can be constructed beforehand. Re-searches need to use domain knowledge, such as medicineand transportation [29], and mathematical calculation, suchas Euclidean distance [30], to construct a graph beforehand.Nonetheless, all of these graphs are thought of as “unsuper-vised graphs” because the calculation for the constructionis not updated by backpropagation for the speciﬁc analyt-ical task. In other words, an “unsupervised graph” couldNOT align with the speciﬁc analytical task. As a result,the “neighborhood” in an “unsupervised graph” could beimproper for the speciﬁc analytical task and damage theanalytical outcomes. In this study, we develop a method toconstruct a “supervised graph”, which could dynamicallyand supervisedly learn the effective information from otherinstances during the training procedure in the speciﬁc ana-lytical task.

We formally deﬁne the prediction task with the classicmachine/deep learning algorithms as Formula 1: ˆ y ( v,q ) =[ˆ y ( v +1) , ˆ y ( v +2) , . . . , ˆ y ( v + q ) ] T =[ f ( O ( v,p ) ) , f ( O ( v,p ) ) , . . . , f q ( O ( v,p ) )] T (1)where ˆ y ( v,q ) ∈ R q is the vector to be predicted, v is a giventimepoint, and q is the predictive window size; ˆ y ( v + i ) is thepredicted value of the upcoming i -th week, and f i ( · ) is atime-series model to predict the value of the upcoming i -th week; O ( v,p ) = [ o v , o ( v − , . . . , o ( v − p +1) ] is the observedtime-series values with a time lag p , o v is the observed value,and o ( v − i ) is the value of the past i -th week.There are two types of time-series prediction: (a) single-step inﬂuenza prediction and (b) multi-step prediction. Asingle-step prediction predicts the value for one step inadvance ( q = 1 in Formula 1 ), and a multi-step predictionpredicts the consecutive values with a bigger predictivewindow size ( q > in Formula 1).As Formula 1 shows, the classic methods heavily de-pends on the observations in the time window but lacksconsidering historical similarities outside the time window.Table 2 presents the notations utilized in this work.TABLE 2: Notations and Explanations. Notations Explanations O the observed time-seies data p the time lag q the predictive window size y ( v,q ) the vector of the true values of the predictivewindow size of q for the node v in the virtual graph ˆ y ( v,q ) the vector of the predicted values of the predictivewindow size of q for the node v in the virtual graph G the virtual graph V the set of all virtual nodes E the set of all virtual edges χ the observed matrix of the ILI rates S the node embedding matrix of the virtual graph T the adjacency matrix of the virtual graph As aforementioned in Table 1 and Figure 1, Inﬂuenzaseasons that vary in timing, severity, and duration. Andpandemics mutate the inﬂuenza outbreaks. Dynamicallylooking for similar “infection situations” instead of stickingto a ﬁxed periodicity (roughly one year) could be a key tovaried seasonality and pandemics for inﬂuenza prediction.We formally deﬁne the concept of a dynamic virtual graph.

Dynamic Virtual Graph . Different from a natural graphwith static nodes and edges, we deﬁne a virtual graph as G = ( V , E ) with a set of nodes ( V ) and a set of edges( E ), in which every node and edge are supervisedly orsemi-supervisedly dynamically learned during the trainingprocedure in the prediction task.In this study each node is a function that can be trainedto capture “infection situations”, and each edge describesthe signiﬁcance of the similarity. Figure 3 gives an image ofa virtual graph to predict inﬂuenza in this study. A virtual EEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. XX, AUGUST 2020 4

Fig. 3: An image of a virtual graph. A virtual node repre-sents a timepoint in the format of “year/week”. A virtualedge connects two virtual nodes at two timepoints. Thevalues in the boxes above the edges are the signiﬁcance ofthe virtual edges.node v ( v ∈ V ) represents a timepoint, such as 2017/12,which means the 12th week in 2017. The embedding ofthe virtual node, representing the comprehensive “infectionsituations” at the timepoint, is denoted as s v for the node v .A virtual edge e ( e ∈ E ) connects two virtual nodes at twotimepoints, and the edge weights measure the signiﬁcanceof the virtual edge. The virtual node representation will be used (1) to per-form “infection-situations” embedding for the subsequentrepresentation learning and (2) to learn the signiﬁcanceof the similarity among the different “infection-situations”.How to deﬁne a proper function of the virtual node rep-resentation vector s v for the node v is a pivotal problem.An “infection-situation” embedding vector needs to includecomprehensive infective information, such as :(a) the timing, severity, and duration of the infection at agiven timepoint;(b) the ﬁrst-order differences (“speed”) and the second-order differences (“acceleration”) at a given timepoint;(c) tendency (upward, downward, ﬂuctuation, or a turn-ing point) at a given timepoint;(d) descriptive statistics (mean, median, maximum, min-imum, variance, and the like) at a given timepoint.Many previous researches studied how to “unsupervis-edly” represent “situation” for prediction [31], [32], [33],[34]. Some use current and past values in the time-seriesdata as input. Others use domain knowledge, such as vary-ing lag structures at different steps [32], to present “situa-tion” vectors. Nonetheless, these methods of “unsupervisedpresenting” is separate from the model training. In thisstudy, we propose a supervised representation of “situation”to learn a more appropriate presentation for a given analyt- ical task. Formula 2 illustrates the node representation of“infection-situations”: s v = τ ( obs v , obs ( v − , ... , obs ( v − p +1) ; θ τ )= τ ( O ( v,p ) ; θ τ )= σ ( W MLP − ( σ ( W MLP − O ( v,p ) ))) (2)where τ is a neural network, θ τ is the parameter for ( τ ),and σ is an activation function for which we use Expo-nential Linear Units (ELU) in this work. In this work,a two-layered Multilayer Perceptron (MLP) is adopted. W MLP − ∈ R p × and W MLP − ∈ R × are thetrainable weights of the ﬁrst and second layer of MLP,respectively.There are three reasons why we need to project theoriginal feature space nonlinearly into a high-dimensionalspace as an embedding vector of a virtual node: (a) To work as time-series feature extraction. Time-series analyses usually adopt arbitrary feature en-gineering, such as Kalman Filtering [35], and so on. Howto select effective feature engineering is non-trivial, usuallyexperienced-based and time-consuming. Since an MLP cantheoretically simulate any function [36], [37], [38], imple-menting MLP can perform effective representation in a high-dimensional space. (b) To supervisedly and dynamically learn virtual noderepresentations.

Since the MLP is a part of the entire end-to-end learning,the projected embedding vectors are supervisedly learnedfrom and for a speciﬁc analytical task. Such a supervisedly-learned embedding space is supposed to work better thanthe original feature space, which is static and cannot beupdated or learned, and thereby improve the accuracy. (c) To project virtual node embedding to an appropriatespace.

The high-dimensional embedding vectors of two virtualnodes will be used to deﬁne the signiﬁcance of a virtualedge. A high-dimensional space that represents a variety ofcomplex time-series characteristics can work better than theoriginal feature space that just consists of the ILI rates ofcurrent and past few weeks.

To a given node, different neighbors may have different sim-ilarities in “infection situations”. The signiﬁcance of a virtualedge needs to be decided. In this study, we measure thesigniﬁcance of the virtual edge between the node u and v byperforming inner product on the after linear projection andinstance normalization on the high-dimensional embeddingvectors of “infection-situations”, as Formula 3 illustrates: t ( v,u ) = κ ( s v , s u )= inst norm [ W line proj s v ] (cid:12) inst norm [ W line proj s u ] (3)where t ( v,u ) is the the signiﬁcance of the virtual edge fromthe node u to v , inst norm ( · ) is the instance normalization, W line proj ∈ R × is the trainable weight of the linearprojection and (cid:12) is the inner product.The signiﬁcance ( t ( v,u ) ) of the virtual edge from the node u to v has some properties: (a) − ≤ t ( v,u ) ≤ EEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. XX, AUGUST 2020 5

The value range of virtual edge signiﬁcance is [ − , . (b) t ( v,v ) = 1 The signiﬁcance of the self-loop virtual edge is “1”. (c) t ( v,u ) = t ( u,v ) The virtual edge signiﬁcance is symmetric.

The virtual graph, which is composed of the virtual nodesrepresenting the “infection situation” at each timepoint andthe virtual edges with the similarity signiﬁcance, is inputinto GNNs. For a given node, the GNNs iteratively aggre-gate and integrate the embedding of its neighbors to learna representation vector ( h v ). Formula 4 illustrates the l -thiteration of aggregation and integration: h ( l ) v = IN T ( h ( l − v , AGG ( h ( l − u ; ∀ u ∈ N ( v ))) (4)where h ( l ) v and h ( l ) u is the representation vector at the l -thiteration/layer of the given node v and the neighbor node u , respectively; N ( v ) is the set of neighbor nodes; AGG ( · ) is a function that aggregates the infective information fromits neighbors N ( v ) —the timepoints that have similar “in-fection situations”, and IN T ( · ) is a function that integratesthe infective information from the given timepoint v itselfand the aggregated infective information by AGG ( · ) basedon its neighbors N ( v ) .A variety of functions of AGG ( · ) and IN T ( · ) have beenproposed in the previous studies [6], [28]. In this work, weinitiate and update the node representations as Formula 5shows: h (0) v = s v h ( l ) v = σ ( W GNN − l · (cid:88) u ∈N ( v ) ∪{ v } ( t ( v,u ) · h ( l − u )) (5)where W GNN − l ∈ R × is the trainable weight of thel-th layer of GNNs. In this work, we adopt 2-layered GNNs. The sum of the initial virtual node representation ( h v ),the representation vector after the ﬁrst GNN layer ( h v ),and the second GNN layer ( h v ) is input into a regressivelayer (implemented by a linear layer) to achieve the ﬁnalprediction, as Formula 6 shows: ˆ y ( v,q ) = W regr ( h v + h v + h v ) (6)where W regr ∈ R × q is the trainable weight of the ﬁnalregressive layer. The loss function is deﬁned as Formula 7 shows: L = 1 n × q (cid:88) v ∈V || y ( v,q ) − ˆ y ( v,q ) || + λ || T || F (7)where n × q (cid:80) v ∈V || y ( v,q ) − ˆ y ( v,q ) || is the predictive loss inMean Square Error (MSE) ( n is the number of virtual nodes), y ( v,p ) and ˆ y ( v,p ) are the vectors of the true and predictedvalues, respectively. T is the adjacency matrix of the virtualgraph, and || T || is the penalty term to limit the complexityof the virtual graphs and improve the robustness of themodel, and || · || F represents the matrix Frobenius norm); and λ is an adjustable hyper-parameter to balance the twoparts of losses.The entire algorithm of the proposed methods is shownin Figure 4 and Algorithm 1.Fig. 4: The structure of the proposed DVGSN. (a) The limitation of time window The prediction for a given timepoint v by DVGSN canbe simpliﬁed and formularized as Formula 8: ˆ y ( v,q ) = f ς ( s v , s u , E ; θ ς )= f ς ( s v , s u , κ ( s v , s u ); θ ς )= f ς ( τ ( O ( v,p ) ) , τ ( O ( u,p ) ) , κ ( τ ( O ( v,p ) ) , τ ( O ( u,p ) )); θ τ , θ ς )= f ς ( τ ( O ) , κ ( τ ( O )); θ τ , θ ς )= f ς ( O ; θ ) (8)where f ς is the algorithm of DVGSN and θ represents theset of all the parameters in DVGSN.To perform prediction for the timepoint v , all the ob-served time-series data ( O ) in the training dataset are inputinto DVGSN. Comparatively, the input in the classic ma-chine/deep learning methods (as formula 1 shows) are just EEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. XX, AUGUST 2020 6

Algorithm 1

The proposed DVGSN.

Input:

The observed time-series data: O ,The time lag: p ,The predictive window size: q ,The training epochs: I ,The adjustable hyperparameter of levels: λ . Output:

The predictive model for inﬂuenza in United States. Prepare the observed matrix: χ = [ O ( v,p ) ] , ∀ v ∈ V ; Prepare the target matrix: Y = [ y ( v,q ) ] , ∀ v ∈ V ; Calculate the number of virtual nodes: n = | O | − p − q + 1 ; for i = 1 , , · · · , I do S ← elu (1 d conv ( elu (1 d conv ( χ )))); T ← inst norm (1 d conv ( S ))( inst norm (1 d conv ( S ))) T ; H ← S ; H ← GNN ( S, T ) ; H ← GNN ( H , T ) ; ˆ Y = regressive layer ( H + H + H ) ; L = || Y − ˆ Y || F + λ × || T || F ; Perform back propagation and update parameters; end for the observed values ( O ( v,p ) ) in the user-deﬁned time win-dow with the time lag ( p ). Inputting all the observed time-series data ( O ) can capture similar “infection situations”from all the timepoints instead of sticking to the ﬁxed staticnodes and edges. (b) Other differences The virtual nodes and virtual edges in DVGSN aresupervisedly learned during the training procedure in thespeciﬁc prediction task while the other existing GNNs-basedmethods use a static graph deﬁned beforehand. Anotherdifference lies in the

AGG ( · ) function in Formula 4. Thealgorithm of the l -th layer of iteration in the attentive GNNs(such as GAT) and DVGSN is illustrated as Formula 9 andFormula 10 shows, respectively: h ( l ) v = σ ( W · Σ u ∈N ( v ) ∪{ v } ( α ( v,u ) · h ( l − u ) α ( v,u ) = exp ( σ ( a T [ W h ( l − v || W h ( l − u ]))Σ k ∈N ( v ) exp ( σ ( a T [ W h ( l − v || W h ( l − k ]) , ∀ u ∈ N ( v ) (9)where α ( v,u ) is the normalized attention coefﬁcient, ( · ) T represents matrix transposition, and || is the concatenationoperation. h ( l ) v = σ ( W GNN − l · (cid:88) u ∈N ( v ) ∪{ v } ( t ( v,u ) · h ( l − u )) t ( v,u ) = inst norm ( W line proj s v ) (cid:12) inst norm ( W line proj s u ) , ∀ u ∈ N ( v ) (10)In GAT, the input ( α ( v,u ) · h ( l − u ) is a mean (preciselya weighted mean ) of the embedding vectors in the neigh-borhood since (cid:80) u ∈N ( v ) α ( v,u ) = 1 regardless of the graphstructure. Comparatively, in DVGSN, the aggregation func-tion (cid:80) ( t ( v,u ) · h ( l − u ) is a sum of the embedding vectorsin the neighborhood on condition that − ≤ t ( v,u ) ≤ holds. Theoretically, the expressive power of mean basedaggregators is weaker than sum aggregators because sum captures the full multiset while mean captures the propor-tion / distribution of elements of a given type [28]. We scrape the inﬂuenza data of the United States from2003/30 to 2017/30 in the “FluView Interactive” [39], awebsite of Centers for Disease Control and Prevention, Na-tional Center for Immunization and Respiratory Diseases.The weekly ILI rates are calculated and used for this work.Figure 1 illustrates the time-series plot of the ILI rates. Thepiece of the curve around 2009, which has three consecutivepeaks, is a pandemic in 2009. Table 1 summaries the de-scriptive statistics of the inﬂuenza seasons from 2002-2003 to2016-2017. The “mean ± standard deviation” of the columnof “The Highest ILI Rate” is . ± . . Besides, thestandard deviation ( . ) is around 33% higher than thevalue of mean ( . ), presenting a considerable variance inseverity. The “mean ± standard deviation” of the columnof “Duration” is ± . The standard deviation ( ) isaround 40%h of the mean ( ), presenting a considerablevariance in lasting. Moreover, the column of “The Peakweek” demonstrates the timing of inﬂuenza seasons variesyear by year. In this work, the baseline models include a variety of state-of-the-art models. We do not compare with the SI-basedmodels because it is difﬁcult to obtain the values of theparameters such as the number of susceptible and infectedindividuals. • Autoregression (AR) . An AR model is a statisticalmethod that uses observed values from current and pasttime steps as input and implements a linear regression topredict future values. • k -Nearest Neighbors( k -NN) . The regression of k -NNexamines the values of a chosen number ( k ) of data pointssurrounding a target data point, and uses the mean of thevalues as prediction. The k -NN regression can be used fortime-series prediction [40]. • Random Forest(RF) . A RF model is an ensemble ofdecision trees trained with the “bagging” method, whichleverages a combination of learning models and therebyincreases the overall performance [41]. • XGB . The XGB regression implements the frameworkof Gradient Boosting by providing a parallel boosting,which consists of iteratively learning weak regressors withrespect to a distribution and adding them to a ﬁnal strongregressor [42]. • Multilayer Perceptron (MLP) . An MLP is a neuralnetwork, in which each node in a layer is fully connectedto every node in the adjacent layers and ﬁt a non-linearfunction. The MLP can be used for time series analyses bymapping current and past values to one or multiple futurepredictive values [43]. • TPA-LSTM . The TPA-LSTM uses a set of ﬁlters toextract time-invariant temporal patterns. The extraction issimilar to transforming time series data into its “frequencydomain” for forecasting [44].

EEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. XX, AUGUST 2020 7 • TCN . A TCN uses a causal and dilated convolutionalnetwork to predict sequential data [45], such as time series[46], and so on . • TFR . The TFR is a model relying entirely on attentionmechanism to compute representations of its input and out-put without using sequence aligned RNNs or convolution[47]. TFR can also be used for time -series prediction [48]. • GAT . The GAT algorithm is a type of GNNs thatconsiders the attention mechanism on graphs [6]. • GIN . The GIN model is a theoretically designed modelfor analyzing the expressive power of GNNs to capturedifferent graph structures. The GIN is provably the mostexpressive among the class of GNNs and is theoretically aspowerful as the Weisfeiler-Lehman graph isomorphism test[28].

For all the baseline models and the proposed model, werandomly initialize parameters with the uniform distribu-tion and select the Adam optimizer [49] with a learningrate of 0.001. We set all hidden layer size of 256 units anduse the activation function of an ELU. We set the epoch to200 and choose the parameter with the best result on thevalidation set. To ensure fairness, we split the datasets anduse the same training set (the current timepoints are from2003/41 to 2012/02), validation set (the current timepointsare from 2012/03 to 2014/42), and testing set (the currenttimepoints are from 2014/43 to 2017/29 when the predictivewindow size is 1; the current timepoints are from 2014/43 to2017/27 when the predictive window size is 3; the currenttimepoints are from 2014/43 to 2017/24 when the predictivewindow size is 6) for all the models in this work. We set allthe models to the same depth. For MLP, we use a ﬁve-layerstructure, which contains an input layer, four hidden layers,and an output layer (regressive layer). For TPA-LSTM, wefeed the feature vectors into a TPA-LSTM layer, then inputthe hidden state to 3-layered MLP, and ﬁnally implementan output layer. For TFR, we use two TFR blocks, whichcorrespond to four layers. For GAT, we use four layers ofGAT and one output layer. For the proposed method, ineach backpropagation, we use all training data as modelinput and randomly select batch nodes from the trainingset to calculate the loss and update model parameters. Inthis work, to use the algorithm of GIN and GAT, we alsoconnect a given virtual node to all the other virtual nodes,as DVGSN does. Our source code and dataset are availableat https://github.com/aI-area/DVGSN.

To perform a comprehensive comparison, we perform threeseries of experiments. We set the the predictive window sizeto 1 to do a short period prediction, and set the value tobe 3 and 6 respectively to test the predictive ability in alonger period. The loss functions of all the models averagethe predictive MSEs of the future weeks. To evaluate therobustness of the model, we also test each algorithm with atime lag of 6, 9, and 12, respectively. The value of λ in thisgroup of experiments is 0.01.Table 3 presents the results of all the models. The resultsshow that the proposed DVGSN signiﬁcantly outperforms all the baseline methods in all the prediction tasks withthe predictive window size being 1, 3 and 6 respectively,which proves that DVGSN can satisfy both the short andlong period prediction tasks. Time lag.

DVGSN shows a slight advantage with thetime lag being 9. The results shows that it is better enoughto construct the virtual graph node and reﬂect the currenttrend with the recent 9 historical data. A longer time lagmay offer no help since the virtual graph can learn thesimilar historical situation by itself. The result also showsthat the selection of the hyperparameter time lag for DVGSNis relatively easy.The other baseline methods, including KNN, RF, XG,MLP, and TPA-LSTM shows a better performance with ashort time lag 6 for the short period prediction as thepredictive window size is 1, while a bigger time lag 12is better for a longer period prediction task in which thepredictive window size is 3. A longer time lag could offermore support for the longer prediction task. However, aswe have introduced previously, the time lag is limited anda longer one can reduce the training space.The situation of TCN, TFR, GAT and GIN is more like acomprise between the above two cases. They show a betterperformance with a short time lag 6 for the short periodprediction while a slight advantage with the time lag being9 for the longer prediction task. The graph-based solutionscould reduce their dependence on the time lag. At the sametime, the fact that they cannot beneﬁt from a longer time lagfor the short prediction may be caused by their ﬁxed staticgraph mode.

BLATION S TUDY

To verify the effectiveness of the constructed dynamicgraph, we designed a variant, denoted as “DVGSN(ﬁxed)”,in which the virtual edges are ﬁxed instead of being learned.A given node is connected to the nodes at the timepointsone week ago and one year (52 weeks) ago, considering theperiodicity and time series of inﬂuenza. Other preprocessesare the same as those in the proposed method. We aslo ad-just the hyperparameter λ to demonstrate the effectivenessof the penalty term in the loss function.Fig. 5: The average MSEs by the different λ s. Comparison between the ﬁxed and dynamic graph.

Table 4 compares the performance of the ﬁxed and dynamicgraph. In 8 of 9 cases, the dynamic graphs perform better.

EEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. XX, AUGUST 2020 8

TABLE 3: The MSEs in all the models. The bold font indicates the best performance in each test. p ∗ q ∗ AR kNN RF XGB MLP TPA-LSTM TCN TRF GAT GIN DVGSN6 1 0.0796 0.1177 0.0897 0.1071 0.0795 0.0774 0.0794 0.0867 1.2326 1.1891

12 1 0.0778 0.2113 0.0844 0.0916 0.0878 0.0799 0.0840 0.0930 1.2299 1.2733 ∗ The p and q refers to the time lag and the predictive window size, respectively.

TABLE 4: Comparison between the ﬁxed and dynamic graph. The bold font indicates the better performance in each pairof comparison. q 1 3 6p 6 9 12 6 9 12 6 9 12DVGSN(ﬁxed) λ = 0 . ) 0.0749 TABLE 5: Comparison among different λ in DVGSN. The bold font indicates the better performance in each pair ofcomparison. q p λ = 0 λ = 0 . λ = 0 . λ = 0 . λ = 0 . λ = 0 . λ = 0 . λ = 0 . λ = 0 . λ = 1 Comparison among different λ s in DVGSN. A bigger λ exerts a heavier penalty on a complex virtual graphs withdense edges of small weights. Table 5 shows comparisonamong different λ s in DVGSN. In the most cases, the“DVGSN ( λ (cid:54) = 0 )” outperforms the “DVGSN ( λ = 0 )”. Inconclusion, restricting the complexity of the virtual graphimproves the performance.Figure 5 compares the average MSEs by the different λ s.The X-axis and Y-axis represents λ s in logarithm scale andthe column-average MSEs from Table 5, respectively. Thecurve roughly presents a shape of the letter of “V”. That isprobably because the penalty term cannot help to improvethe robustness when the λ is too small (close to zero). Whenthe λ is too big, the loss will focus on the penalty term butignore the predictive MSE loss. In a word, in this work,when the λ approximately equals to 0.01, DVGSN performsbest. ODEL I NTERPRETATION

This section explains how the proposed method works.

Figure 6 illustrates the similarity that the virtual graphlearns. The X-axis represents the time series from the past 11th week to the future 3rd week ( p = 12 and q = 3 ).The Y-axis represents the ILI rate. The date in the formatof “year/week” above each column of the subﬁgures is the“current” timepoint. The model predicts the ILI rates of thefuture 3 weeks (on the right side of the red dash line). In eachsubﬁgure, the red curves represent the “infection situations”of the given timepoints. The ILI rates of the current and past11 weeks (on the left side of the red dash line) are projectedto a high-dimensional space to calculate the signiﬁcanceof the similarity (the green ﬂoat in each subﬁgure). Asa result, the blue curves in the top two subﬁgures arethe most positively similar “infection situations” that thevirtual graph ﬁnds; the blue curves in the third row ofthe subﬁgure represent dissimilar “infection situations” thatthe virtual graph ﬁnds; and the blue curves in the bottomtwo subﬁgures are the most negatively similar “infectionsituations” that the virtual graph ﬁnds. This section explores whether DVGSN can deal with thevaried inﬂuenza seasonality. Figure 7 gives two examples.The X-axis represents the time series from the past 9thweek to the upcoming 3rd week ( p = 9 and q = 3 ). Thered curves represent the “infection situations” of the giventimepoints; and the blue and green curves represent the EEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. XX, AUGUST 2020 9

Fig. 6: The illustrative examples of the positive similarities, dissimilarities, and negative similarities. The X-axis representsfrom the past 11th week ( p = 12 ) to the future 3rd week ( q = 3 ) to predict. The red dash lines separate past(includingcurrent) and future timepoints. The Y-axis represents the ILI rate. The date above each column of the subﬁgures is thegiven timepoint. The green ﬂoat in each subﬁgure is the signiﬁcance of the similarity between the given virtual node (thered curve) and the neighbor node (the blue curve).two most similar “infection situations”. We ﬁnd the twomost similar timepoints do not correspond to the sameweek of the previous years, which demonstrates DVGSNcan tackle varied inﬂuenza seasonality instead of sticking tothe periodicity of 52 weeks. This section explores how DVGSN learns from the histor-ical “infection situations” and predicts the pandemic. We present ﬁve examples in the 2009 pandemic in Figure 8.The X-axis and Y-axis represents time series from 2002/40 to2017/30 and the ILI rates, respectively. In each subﬁgure, thered piece represents the “infection situations” of the giventimepoints in the pandemic. The timepoints in the ﬁve sub-ﬁgures is 2009/07, 2009/13, 2009/37, 2009/51, and 2010/11respectively, which represent a rising, a falling down, arebound after reaching a bottom, a drop after reaching apeak, and ﬂuctuations after a huge dropping in the 2009

EEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. XX, AUGUST 2020 10

Fig. 7: The illustrative examples how DVGSN deals with thevaried seasonality. The X-axis represents from the past 9thweek to the upcoming 3rd week ( p = 9 and q = 3 ). The “in-fection situations” of the given timepoints are representedby the red curves; and the most two similar “infection sit-uations” are represented by the blue and green curves. Thenumber after “w:” in the ﬁgure legend is the signiﬁcance ofthe similarity.Fig. 8: This ﬁgure illustrates how DVGSN learns from thehistorical “infection situations” and predicts the pandemic.The X-axis represents time series from 2002/40 to 2017/30.pandemic respectively. The two yellow pieces in the curvesrepresent two of the most similar “infection situations” thatDVGSN learns. The number after “w:” in each ﬁgure legendis the signiﬁcance of the similarity. By comparing the past“infection situations” and future tendency between the redpieces and the two yellow pieces in the ﬁve examples,we conclude that DVGSN can ﬁnd and learn the similar“infection situations” outside the pandemic and thereby make a reliable model for the inﬂuenza prediction. ONCLUSION

In this work, we proposed a method—DVGSN. DVGSN canﬁnd similar “infection situations” outside the time windowand therefore improve the predictive accuracy for inﬂuenza.The extensive experiments on real-world inﬂuenza datademonstrate that DVGSN signiﬁcantly outperforms the cur-rent state-of-the-art methods. Besides, the proposed methodhas rich interpretabilities, which provide us clues how themodel perform prediction for inﬂuenza. Another strongpoint of the proposed method lies in that it need less domainknowledge to build a graph in advance, which may be verydifﬁcult in the medical science related ﬁelds. As all the deeplearning based methods the proposed method also dependson enough data to train the model. Hopefully, this methodcan help us better prepare for inﬂuenza outbreaks, and workon other public health related analytical tasks well. A CKNOWLEDGMENTS

This work was supported by the Strategic Priority Re-search Program of Chinese Academy of Sciences (Grant No.XDB38040200). We would like to thank all the authors of theopen source code in the baseline methods. R EFERENCES [1] A. Lancichinetti, F. Radicchi, J. J. Ramasco, and S. Fortunato,“Finding statistically signiﬁcant communities in networks,”

PloSone , vol. 6, no. 4, 2011.[2] W. W. Zachary, “An information ﬂow model for conﬂict and ﬁssionin small groups,”

Journal of anthropological research , vol. 33, no. 4,pp. 452–473, 1977.[3] J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun, “Spectral net-works and locally connected networks on graphs,” arXiv preprintarXiv:1312.6203 , 2013.[4] M. Defferrard, X. Bresson, and P. Vandergheynst, “Convolutionalneural networks on graphs with fast localized spectral ﬁltering,”in

Advances in neural information processing systems , 2016, pp. 3844–3852.[5] T. N. Kipf and M. Welling, “Semi-supervised classiﬁcation withgraph convolutional networks,” arXiv preprint arXiv:1609.02907 ,2016.[6] P. Veliˇckovi´c, G. Cucurull, A. Casanova, A. Romero, P. Lio,and Y. Bengio, “Graph attention networks,” arXiv preprintarXiv:1710.10903 , 2017.[7] H. Inﬂuenza, “Key facts about inﬂuenza (ﬂu) vaccine,”

InﬂuenzaHuman

Journal of Virology , vol. 73, no. 3,pp. 1878–1884, 1999.[12] P. Su´arez, J. Valc´arcel, and J. Ort´ın, “Heterogeneity of the mutationrates of inﬂuenza a viruses: Isolation of mutator mutants,”

Journalof Virology , vol. 66, no. 4, pp. 2491–2494, 1992.[13] J. Puig-Barber`a, A. Tormos, A. Sominina, E. Burtseva, O. Launay,M. A. Ciblak, A. Natividad-Sancho, A. Buigues-Vila, S. Mart´ınez-´ubeda, and C. Mah´e, “First-year results of the global inﬂuenzahospital surveillance network: 2012–2013 northern hemisphereinﬂuenza season,”

BMC Public Health , vol. 14, no. 1, p. 564.

EEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. XX, AUGUST 2020 11

Nature , vol. 457, 2009.[16] K. Lee, A. Agrawal, and A. Choudhary, “Forecasting inﬂuenzalevels using real-time social media streams,” in , 2017.[17] S. Molaei, M. Khansari, H. Veisi, and M. Salehi, “Predicting thespread of inﬂuenza epidemics by analyzing twitter messages,”

Health and Technology .[18] J. Li and C. Cardie, “Early stage inﬂuenza detection from twitter,”

Computer Science , 2013.[19] A. Darwish, Y. Rahhal, and A. Jafar, “A comparative study onpredicting inﬂuenza outbreaks using different feature spaces: ap-plication of inﬂuenza-like illness data from early warning alertand response system in syria,”

BMC Research Notes , vol. 13, no. 1,pp. 1–8, 2020.[20] M. J. Kane, N. Price, M. Scotch, and P. Rabinowitz, “Comparisonof arima and random forest time series models for prediction ofavian inﬂuenza h5n1 outbreaks,”

Bmc Bioinformatics , vol. 15, no. 1,pp. 276 (9 pp.)—276 (9 pp.).[21] J. Zhang and K. Nawata, “A comparative study on predictinginﬂuenza outbreaks,”

Bioscience trends , 2017.[22] R. Yin, E. Luusua, J. Dabrowski, Y. Zhang, and C. K. Kwoh, “Tem-pel: Time-series mutation prediction of inﬂuenza a viruses viaattention-based recurrent neural networks,”

Bioinformatics , 2020.[23] N. Wu, B. Green, X. Ben, and S. O’Banion, “Deep transformermodels for time series forecasting: The inﬂuenza prevalence case,” arXiv preprint arXiv:2001.08317 , 2020.[24] V. Dukic, H. F. Lopes, and N. G. Polson, “Tracking epidemics withgoogle ﬂu trends data and a state-space seir model,”

Journal of theAmerican Statistical Association , vol. 107, 2012.[25] Q. Wu, X. Fu, Z. Jin, and M. Small, “Inﬂuence of dynamic immu-nization on epidemic spreading in networks,”

Physica A StatisticalMechanics & Its Applications , vol. 419, pp. 566–574.[26] P. Sen, G. Namata, M. Bilgic, L. Getoor, B. Galligher, and T. Eliassi-Rad, “Collective classiﬁcation in network data,”

AI magazine ,vol. 29, no. 3, pp. 93–93, 2008.[27] T. Hogg and K. Lerman, “Social dynamics of digg,”

EPJ DataScience , vol. 1, no. 1, p. 5, 2012.[28] K. Xu, W. Hu, J. Leskovec, and S. Jegelka, “How powerful aregraph neural networks?” arXiv preprint arXiv:1810.00826 , 2018.[29] X. Geng, Y. Li, L. Wang, L. Zhang, Q. Yang, J. Ye, and Y. Liu,“Spatiotemporal multi-graph convolution network for ride-hailingdemand forecasting,” in

Proceedings of the AAAI Conference onArtiﬁcial Intelligence , vol. 33, 2019, pp. 3656–3663.[30] S. Fu, X. Yang, and W. Liu, “The comparison of different graphconvolutional neural networks for image recognition,” in

Pro-ceedings of the 10th International Conference on Internet MultimediaComputing and Service , 2018, pp. 1–6.[31] J. Alberg and Z. C. Lipton, “Improving factor-based quantitativeinvesting by forecasting company fundamentals,” arXiv preprintarXiv:1711.04837 , 2017.[32] S. F. Crone and S. H¨ager, “Feature selection of autoregressiveneural network inputs for trend time series forecasting,” in . IEEE,2016, pp. 1515–1522.[33] R. J. Frank, N. Davey, and S. P. Hunt, “Time series prediction andneural networks,”

Journal of intelligent and robotic systems , vol. 31,no. 1-3, pp. 91–103, 2001.[34] M. Ghiassi, H. Saidane, and D. Zimbra, “A dynamic artiﬁcial neu-ral network model for forecasting time series events,”

InternationalJournal of Forecasting , vol. 21, no. 2, pp. 341–362, 2005.[35] I. K. F. Links, “An introduction to the kalman ﬁlter,” 1995.[36] K. Hornic, “Multilayer feedforward networks are universal ap-proximators,” vol. 2, no. 5, pp. 359–366, 1989.[37] M. Leshno, V. Y. Lin, A. Pinkus, and S. Schocken, “Multilayerfeedforward networks with a nonpolynomial activation functioncan approximate any function,”

Neural Networks , vol. 6, no. 6, pp.861–867.[38] K. Hornik, “Approximation capabilities of multilayer feedforwardnetworks,”

Neural Networks

Water Re-sources Management , vol. 34, no. 1, pp. 263–282, 2020.[41] H. Wu, Y. Cai, Y. Wu, R. Zhong, Q. Li, J. Zheng, D. Lin, and Y. Li,“Time series analysis of weekly inﬂuenza-like illness rate using aone-year period of factors in random forest regression,”

Biosciencetrends , 2017.[42] R. A. Abbasi, N. Javaid, M. N. J. Ghuman, Z. A. Khan, S. U.Rehman et al. , “Short term load forecasting using xgboost,” in

Workshops of the International Conference on Advanced InformationNetworking and Applications . Springer, 2019, pp. 1120–1131.[43] J. Cao, Z. Li, and J. Li, “Financial time series forecasting modelbased on ceemdan and lstm,”

Physica A: Statistical Mechanics andits Applications , vol. 519, pp. 127–139, 2019.[44] S.-Y. Shih, F.-K. Sun, and H.-y. Lee, “Temporal pattern attention formultivariate time series forecasting,”

Machine Learning , vol. 108,no. 8-9, pp. 1421–1441, 2019.[45] S. Bai, J. Z. Kolter, and V. Koltun, “An empirical evaluationof generic convolutional and recurrent networks for sequencemodeling,” arXiv preprint arXiv:1803.01271 , 2018.[46] P. Hewage, A. Behera, M. Trovati, E. Pereira, M. Ghahremani,F. Palmieri, and Y. Liu, “Temporal convolutional neural (tcn)network for an effective weather forecasting using time-series datafrom the local weather station,”

Soft Computing , pp. 1–30, 2020.[47] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones,A. N. Gomez, L. u. Kaiser, and I. Polosukhin, “Attention isall you need,” in

Advances in Neural Information ProcessingSystems 30 , I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach,R. Fergus, S. Vishwanathan, and R. Garnett, Eds. CurranAssociates, Inc., 2017, pp. 5998–6008. [Online]. Available:http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf[48] Y. Wang, L. Wang, Q. Chang, and C. Yang, “Effects of direct input–output connections on multilayer perceptron neural networks fortime series prediction,”

Soft Computing , pp. 1–10, 2019.[49] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimiza-tion,” arXiv preprint arXiv:1412.6980arXiv preprint arXiv:1412.6980