[PDF] Radflow: A Recurrent, Aggregated, and Decomposable Model for Networks of Time Series

Abstract

We propose a new model for networks of time series that influence each other. Graph structures among time series are found in diverse domains, such as web traffic influenced by hyperlinks, product sales influenced by recommendation, or urban transport volume influenced by road networks and weather. There has been recent progress in graph modeling and in time series forecasting, respectively, but an expressive and scalable approach for a network of series does not yet exist. We introduce Radflow, a novel model that embodies three key ideas: a recurrent neural network to obtain node embeddings that depend on time, the aggregation of the flow of influence from neighboring nodes with multi-head attention, and the multi-layer decomposition of time series. Radflow naturally takes into account dynamic networks where nodes and edges change over time, and it can be used for prediction and data imputation tasks. On real-world datasets ranging from a few hundred to a few hundred thousand nodes, we observe that Radflow variants are the best performing model across a wide range of settings. The recurrent component in Radflow also outperforms N-BEATS, the state-of-the-art time series model. We show that Radflow can learn different trends and seasonal patterns, that it is robust to missing nodes and edges, and that correlated temporal patterns among network neighbors reflect influence strength. We curate WikiTraffic, the largest dynamic network of time series with 366K nodes and 22M time-dependent links spanning five years. This dataset provides an open benchmark for developing models in this area, with applications that include optimizing resources for the web. More broadly, Radflow has the potential to improve forecasts in correlated time series networks such as the stock market, and impute missing measurements in geographically dispersed networks of natural phenomena.

Full PDF

RRadflow: A Recurrent, Aggregated, and Decomposable Modelfor Networks of Time Series

Alasdair Tran , , Alexander Mathews , Cheng Soon Ong , , Lexing Xie Australian National University Data61, CSIRO{alasdair.tran,alex.mathews,chengsoon.ong,lexing.xie}@anu.edu.au

ABSTRACT

We propose a new model for networks of time series that influ-ence each other. Graph structures among time series are foundin diverse domains, such as web traffic influenced by hyperlinks,product sales influenced by recommendation, or urban transportvolume influenced by road networks and weather. There has beenrecent progress in graph modeling and in time series forecasting,respectively, but an expressive and scalable approach for a networkof series does not yet exist. We introduce Radflow, a novel modelthat embodies three key ideas: a recurrent neural network to obtainnode embeddings that depend on time, the aggregation of the flowof influence from neighboring nodes with multi-head attention,and the multi-layer decomposition of time series. Radflow natu-rally takes into account dynamic networks where nodes and edgeschange over time, and it can be used for prediction and data impu-tation tasks. On real-world datasets ranging from a few hundred toa few hundred thousand nodes, we observe that Radflow variantsare the best performing model across a wide range of settings. Therecurrent component in Radflow also outperforms N-BEATS, thestate-of-the-art time series model. We show that Radflow can learndifferent trends and seasonal patterns, that it is robust to missingnodes and edges, and that correlated temporal patterns among net-work neighbors reflect influence strength. We curate WikiTraffic,the largest dynamic network of time series with 366K nodes and22M time-dependent links spanning five years. This dataset pro-vides an open benchmark for developing models in this area, withapplications that include optimizing resources for the web. Morebroadly, Radflow has the potential to improve forecasts in corre-lated time series networks such as the stock market, and imputemissing measurements in geographically dispersed networks ofnatural phenomena.

ACM Reference Format:

Alasdair Tran, Alexander Mathews, Cheng Soon Ong, Lexing Xie. 2021.Radflow: A Recurrent, Aggregated, and Decomposable Model for Networksof Time Series. In

Proceedings of the Web Conference 2021 (WWW ’21),April 19–23, 2021, Ljubljana, Slovenia . ACM, New York, NY, USA, 12 pages.https://doi.org/10.1145/3442381.3449945

KEYWORDS time series, networks, graphs, sequence models, wikipedia

This paper is published under the Creative Commons Attribution 4.0 International(CC-BY 4.0) license. Authors reserve their rights to disseminate the work on theirpersonal and corporate Web sites with the appropriate attribution.

WWW ’21, April 19–23, 2021, Ljubljana, Slovenia © 2021 IW3C2 (International World Wide Web Conference Committee), publishedunder Creative Commons CC-BY 4.0 License.ACM ISBN 978-1-4503-8312-7/21/04.https://doi.org/10.1145/3442381.3449945

Forecasting time series is a long-standing research problem that isapplicable to econometrics, marketing, astronomy, ocean sciences,and other domains. Similarly, networks are the subject of activeresearch with broad relevance to areas such as transportation, in-ternet infrastructure, signaling in biological processes, and onlinemedia. In this paper, we are concerned with forecasting among anetwork of time series with mutual influences. Tools for tacklingthis problem will help answer questions about complex systemsevolving over time in the above application domains and beyond.In the context of recent progress in end-to-end learning for timeseries and large networks, there are three prominent challenges.The first is expressiveness , i.e. building models to represent richerclasses of functions. Recent works in predicting book sales [6] andonline video views [38] employed simple aggregation [6] or linearcombinations [38] of the last observation of incoming nodes. Re-cent graph neural networks [8, 14, 35] provide flexible aggregationamong network neighbors, but do not readily apply to time se-ries. N-BEATS [21] is the new state-of-the-art model in time seriesbenchmarks, using a stack of neural modules to decompose history;but this architecture does not provide a usable representation for anetwork of series as the neural modules do not explicitly encode thetemporal structure of the data. There are several graph-to-sequencetasks [4, 39] considered in the natural language processing (NLP)domain, but the networks of time series problem, cast in such ter-minology, is graphs-of-sequences to graphs-of-sequences.The second challenge is scale . Our goal is to model longitudinal(e.g., daily) time series spanning a few years, and large networks inthe order of hundreds of thousands of nodes. This requires scala-bility in the time series component, the graph component, as wellas their interactions. The recently proposed T-GCN model [40], forexample, nests a graph neural network within a recurrent neuralnetwork, which is limited in both space and time complexity thatprevents it from scaling to web-scale networks. The networks usedin their evaluation contain only a few hundred nodes.The third is the dynamic nature of links and nodes in the network .For example, Wu et al. [38] reported that 50% of online video recom-mendation links appear in fewer than 5 out of 63 days of observation,and we observe that more than 100K new Wikipedia pages werecreated over the first half of 2020 alone. Dynamic networks are anactive topic of attention for graph neural networks [18, 22, 33], butthese existing algorithms are designed for link prediction and nodeclassification, not for time series forecasting.We propose a novel neural model for networks of time series thattackles all three challenges. We adopt a R ecurrent structure thataffords time-sensitive A ggregations of network flow on top of the D ecomposition principle of time series; hence the name Radflow .It is more expressive than N-BEATS [21] because it can generate a r X i v : . [ c s . S I] F e b WW ’21, April 19–23, 2021, Ljubljana, Slovenia Alasdair Tran, Alexander Mathews, Cheng Soon Ong, Lexing Xie

Figure 1: Overview of the Radflow model, centered around the WikiTraffic subgraph of

Yellow vests movement (a socialmovement in France since 2018, shown on the left). Each node is a page with a time series of view counts, shown in its individualmini-panel. Edge strengths correspond to average attention scores in the forecast period. The final forecast is produced bysumming over the output from eight layers of the recurrent component and the network flow aggregation component (shownon the right). Radflow correctly predicts a sharp drop in traffic volume for

Yellow vests movement from 3 June 2020 to 30 June2020, due to information from neighboring nodes such as

Black bloc and

Turning Point USA . node embeddings to handle graph inputs. It is more scalable than T-GCN as it can process hundreds of thousands of nodes via networkattention and importance-weighted node sampling. The structureof Radflow allows it to take dynamically changing nodes and edgesas inputs, makes it tolerant to missing data, and is suitable formultivariate time series. Moreover, its multi-head attention strategyand layered decompositions provide interpretations over networkinfluences and time.Radflow is evaluated on four datasets. Two are on urban trafficconsisting of several hundred nodes; two are large-scale datasets—VevoMusic containing 61K videos [38] and a newly curated Wiki-Traffic dataset containing 366K pages and 22M dynamic links. Onboth VevoMusic and WikiTraffic, Radflow without network in-formation is consistently better than the comparable N-BEATS [21].Among models with network information, Radflow variants per-form the best in both imputation and forecasting tasks. In particular,Radflow outperforms state-of-the-art ARNet [38] by 19% in SMAPEscore on VevoMusic. We find that the layers in the recurrent com-ponent capture different seasonality and trends, while attentionover the network captures the time-varying influence from neigh-boring nodes. Fig. 1 illustrates the task of predicting 28 days of viewcounts on Yellow vests movement , based on the historical traffic ofthat page and the traffic of the neighboring pages. Radflow correctlypredicts the sharp drop that is observed during the test period.Our key contributions include:(1) Radflow, an end-to-end neural forecasting model for dy-namic networks of multivariate time series that is scalableto hundreds of thousands of nodes.(2) Interpretable predictions over time series patterns via lay-ered decompositions, and over network neighbors via multi-head attention.(3) Consistently outperforming state-of-the-art time series fore-casting models and networked series forecasting models inreal-world datasets across a diverse set of tasks. (4) WikiTraffic, the largest dynamic network of time series,containing multi-dimensional traffic data from 366K Wikipediapages and 22M dynamic links over five years. The dataset,code, and pretrained models are available on GitHub . Time series modeling has an extensive literature spanning manyfields. Classical approaches [11] include exponential smoothingand autoregressive integrated moving average (ARIMA) models.Exponential smoothing uses exponentially decaying weights ofpast observations, while extensions can incorporate both trendsand seasonality [3, 10, 36]. ARIMA [2] models aim to describe auto-correlations in a time series using a linear combination of pastobservations and forecasting errors; extensions can also incorpo-rate seasonality. In recent years, neural network approaches havebecome more popular. Wu et al. [37] used transformers to forecastinfluenza activities. Zhu and Laptev [42] used a Bayesian neuralnetwork to model uncertainty in the forecasting.Oreshkin et al. [21] were the first to show that a pure neuralmodel without any time series-specific component can outperformexisting statistical techniques on the benchmark datasets M3 [16],M4 [17], and Tourism [1]. Their proposed model, N-BEATS, treatsthe time series prediction as a non-linear multivariate regressionproblem that outputs a fixed-length vector as the forecast. N-BEATS’key modeling component is the stacking of layers, each of whichtakes as input the residual time series calculated by the previouslayers. However, N-BEATS only works with one-dimensional timeseries and does not produce a time series representation vector ateach step, making it difficult to be used in settings with dynamicnetwork information. We address both of these shortcomings byadopting the recurrent network structure that produces time series https://github.com/alasdairtran/radflow adflow WWW ’21, April 19–23, 2021, Ljubljana, Slovenia representations (which we call embeddings) at any time step, whilestill taking advantage of the residual stacking idea from N-BEATS.Another type of time series is discrete events happening in con-tinuous time, often described by temporal point processes. Pre-dictive approaches using point processes [19] require the data tocontain details of individual events rather than the aggregate statis-tics that are more common in large-scale web data due to privacyand storage constraints. Point process estimation models are typi-cally quadratic in the number of events, which is too expensive forlarge-scale data like VevoMusic and WikiTraffic. Network effects in online services are an active area of researchthat studies how links between online items determine proper-ties such as visibility, influence, and future behavior. Social net-works and information networks are frequently studied in thiscontext. On Twitter, Su et al. [29] showed that the introduction of anew network-based recommender system resulted in a substantialchange to the network structure, exacerbating the “rich get richer”phenomenon. On Wikipedia, the link structure has been used totrack the evolution of emerging topics [12] and the flow of trafficcaused by exogenous events [41]. Zhu et al. [41] showed that whena Wikipedia article gains attention from an exogenous event, it canlead to a substantial rise in attention on downstream hyperlinkedarticles. Kämpf et al. [12] showed that the evolution of an emergingtopic can be tracked and predicted using Wikipedia page viewsand internal links. The product recommendation network on Ama-zon has been shown to affect purchasing decisions [20]. Wu et al.[38] showed that the network induced by YouTube’s recommendersystem leads to a flow of influence between music videos.One recent approach to model the network effect is to use neuralgraph networks to generate low-dimensional embeddings of nodesin a graph. Early methods such as node2vec [7] and DeepWalk [24]are transductive, designed mainly to work with a fixed graph. Re-cent models can be applied to an inductive setting that requiresgenerating embeddings for nodes not seen during training. Thisis done, for example, by sampling and aggregating node featuresfrom the local neighborhood in GraphSage [8]. Various aggrega-tion methods have been proposed including max-pooling [8] andmean-pooling [14]. Veličković et al. [35] proposed Graph Atten-tion Network (GAT) that uses a modified version of the multi-headattention [34] to aggregate the neighborhood. In our proposed archi-tecture, network embedding and aggregation are key components.A variety of network aggregation mechanisms can be used (seeSection 4.2 and Section 6). Our suggested aggregation mechanismis similar to GAT, but instead uses the original, and more common,dot-product formulation of the multi-head attention.

Being a new area of research, there are few forecasting methods anda limited number of datasets for networks of time series. Early ap-proaches ignore the network structure and instead treat each nodeas an independent series [25, 30]. Wu et al. [38] incorporated thelocal network structure into an autoregressive time series model,but the architecture only works with a static graph. Zhao et al.[40] proposed a new Recurrent Neural Network (RNN) cell called T-GCN that takes into account the structure of a static graph by in-corporating a Graph Convolution Network (GCN) component. Thebundling of these two components and the lack of neighborhoodsampling makes T-GCN too computationally expensive to apply tographs of more than a few hundred nodes.A related problem is predicting how edges in networks change,such as those using point processes [32, 33] or two-dimensionalattention over graphs and time [27]. We do not tackle this problem;instead we assume that the dynamic graph is observed (such asgenerated by a recommender system or crowd-sourcing), and theprediction target is a time series on each node rather than the evolv-ing graph itself. Our work is the first forecasting model optimizedfor large dynamic networks of time series.

Consider the problem of time series forecasting in a graph. Theinput is a graph 𝐺 = ( 𝑽 , 𝑬 ) , consisting of 𝑁 nodes denoted 𝑽 = { 𝒗 , 𝒗 , . . . , 𝒗 𝑁 } , and 𝑀 edges. Each node 𝒗 𝑗 is associated with amultivariate time series having 𝑇 observations: 𝒗 𝑗 = [ 𝒗 𝑗 , 𝒗 𝑗 , . . . , 𝒗 𝑗𝑇 ] (1)where 𝒗 𝑗𝑡 ∈ R 𝐷 is the 𝐷 -dimensional observation vector of node 𝒗 𝑗 at time step 𝑡 . When the time series has only one value per timestep (univariate), then 𝐷 = . We use 𝒗 𝑗 [ 𝑡 : 𝑠 ] to denote a subsequenceof 𝒗 𝑗 containing all observations from time 𝑡 to 𝑠 , where 𝑡 ≤ 𝑠 : 𝒗 𝑗 [ 𝑡 : 𝑠 ] = [ 𝒗 𝑗𝑡 , 𝒗 𝑗𝑡 + , . . . , 𝒗 𝑗𝑠 − , 𝒗 𝑗𝑠 ] (2)If node 𝒗 𝑖 has the potential to directly influence the time series ofnode 𝒗 𝑗 at time step 𝑡 , then we add a directed edge 𝑒 𝑖 𝑗𝑡 from 𝒗 𝑖 to 𝒗 𝑗 , and 𝒗 𝑖 becomes a neighbor of 𝒗 𝑗 . We define N 𝑡 ( 𝒗 𝑗 ) to be the setof neighbors of 𝒗 𝑗 at times step 𝑡 . Edges may appear and disappearover time, thus 𝐺 is a dynamic graph. We can now represent 𝐺 as an adjacency array 𝑨 ∈ R 𝑁 × 𝑁 × 𝑇 . For an unweighted directedgraph, the entry 𝑎 𝑖 𝑗𝑡 in 𝑨 is 1 if edge 𝑒 𝑖 𝑗𝑡 exists, and 0 otherwise.We now define the time series forecasting problem as it appliesto dynamic graphs. The forecast length 𝐹 is the number of futuretime steps for which the model will make predictions, while thebackcast length 𝐵 is the number of past observations available formaking such predictions. Suppose we are currently at time 𝑡 = .To forecast the time series of node 𝒗 𝑗 (which we shall call the ego node) from time step to 𝐹 , the forecast model ^ 𝒗 𝑗 [ ,𝐹 ] = ForecastModel (cid:16) 𝒗 𝑗 [− 𝐵 + ] , V N( 𝒗 𝑗 ) (cid:17) (3)will take two sets of inputs: the 𝐵 most recent observations of 𝒗 𝑗 and the information from its neighbors. This leads to two differentsettings, both of which will be evaluated in Section 7. The first isImputation, in which we observe the true values of the neighborsat the time of prediction. This amounts to using the ground-truthobservations of the neighbors during the forecast period: V N( 𝒗 𝑗 ) = (cid:110) 𝒗 ′[− 𝐵 + 𝐹 ] (cid:12)(cid:12)(cid:12) 𝒗 ′ ∈ N ( 𝒗 𝑗 ) (cid:111) (4)This is the setting used by Wu et al. [38] and is most useful when themain goal is to fill in missing data in the time series, or to interpretthe influence between nodes. The second setting is Forecast, wherewe first use our best pure time series model to predict the future WW ’21, April 19–23, 2021, Ljubljana, Slovenia Alasdair Tran, Alexander Mathews, Cheng Soon Ong, Lexing Xie

LSTM CellFeedForward FeedForward FeedForward 𝒑 𝑗 𝑡 (backcast vector) 𝒒 𝑗 𝑡 (forecast vector) 𝒖 𝑗 𝑡 (node vector) 𝒛 𝑗 𝑡 = 𝑾 𝐷 𝒗 𝑗𝑡 𝒛 𝑗 𝑡 . . . ^ 𝒒 𝑗𝑅𝑡 + . . . ^ 𝒗 𝑗𝑅𝑡 + 𝑾 𝑅 Block Figure 2: An overview of Radflow’s recurrent block (Sec-tion 4.1). Shown here is the first of L blocks at time 𝑡 , whichtakes a projected representation 𝑧 𝑗 𝑡 of the raw observationsas input, and produces: the backcast vector 𝒑 𝑗 𝑡 , the forecastvector 𝒒 𝑗 𝑡 , and the node vector 𝒖 𝑗 𝑡 . The backcast vector issubtracted from 𝑧 𝑗 𝑡 to obtain the (residual) input for thenext block. The forecast vector is an additive component forthe overall forecast ^ 𝒗 𝑗𝑅𝑡 + in Eq. (15). The node vector will beused when aggregating the neighbors (Section 4.2). observations of each neighbor. These predictions are then used inthe full model to forecast 𝒗 𝑗 itself.In both settings, the final output of the model is ^ 𝒗 𝑗 [ ,𝐹 ] = [ ^ 𝒗 𝑗 , ^ 𝒗 𝑗 , . . . , ^ 𝒗 𝑗𝐹 ] (5)corresponding to the forecast values for the next 𝐹 time steps. Herewe use the hat notation to denote the model’s predictions, e.g., ^ 𝒗 𝑗𝑡 is the forecast time series vector of the ground truth 𝒗 𝑗𝑡 . Radflow consists of two main modules: a recurrent component anda flow aggregation component. The recurrent component modelsall the time series in the graph independently, while the flow aggre-gation component additively adjusts the predictions based on theneighboring time series. The forecast ^ 𝒗 𝑗𝑡 of node 𝒗 𝑗 at time step 𝑡 isobtained by summing up the outputs of the two main modules, ^ 𝒗 𝑗𝑡 = ^ 𝒗 𝑗𝑅𝑡 + ^ 𝒗 𝑗𝐴𝑡 (6)where ^ 𝒗 𝑗𝑅𝑡 is the forecast contribution from the recurrent compo-nent and ^ 𝒗 𝑗𝐴𝑡 is the contribution from the aggregation component.Note that ^ 𝒗 𝑗𝐴𝑡 is itself a function of ^ 𝒗 𝑗𝑅𝑡 . We predict time series by breaking them down into 𝐿 componentsusing stacked recurrent blocks. The recurrent component is alsodesigned to feed into the flow aggregation component which usesthe node vectors to aggregate information in a neighborhood. Fig. 2shows a schematic diagram of the recurrent component.We first project the historical observations of the time series intoa latent space in R 𝐻 , where 𝐻 is the hidden state size: 𝒛 𝑗 𝑡 = 𝑾 𝐷 𝒗 𝑗𝑡 (7) Here 𝑾 𝐷 ∈ R 𝐻 × 𝐷 is a learnable weight matrix. To get an intuitivejustification for this projection, consider the special case where 𝐷 = and 𝑾 𝐷 is the all-ones vector. Then 𝒛 𝑗 𝑡 would contain 𝐻 copies of the observation 𝑣 𝑗𝑡 . This resembles running an ensembleof 𝐻 different time series models in parallel.The recurrent component of our model consists of 𝐿 blocks. Let 𝒛 𝑗ℓ𝑡 be the input to Block ℓ for node 𝒗 𝑗 at step 𝑡 . In particular, thevector 𝒛 𝑗 𝑡 computed in Eq. (7) will be used as the input to the firstblock. Each block will output three vectors—the backcast vector 𝒑 𝑗ℓ𝑡 , the forecast vector 𝒒 𝑗ℓ𝑡 , and the node vector 𝒖 𝑗ℓ𝑡 : ( 𝒑 𝑗ℓ𝑡 , 𝒒 𝑗ℓ𝑡 , 𝒖 𝑗ℓ𝑡 ) = Block ℓ ( 𝒛 𝑗ℓ𝑡 ) (8)with 𝒑 𝑗ℓ𝑡 , 𝒒 𝑗ℓ𝑡 , 𝒖 𝑗ℓ𝑡 ∈ R 𝐻 . Specifically inside each block, we have anLSTM cell followed by feedforward layers. The LSTM cell is first tooperate, accepting as input: the time series residual computed bythe previous block 𝒛 𝑗ℓ𝑡 ∈ R 𝐻 , the previous time step’s hidden state 𝒉 𝑗ℓ𝑡 − ∈ R 𝐻 , and the cell state 𝒄 𝑗ℓ𝑡 − ∈ R 𝐻 . The LSTM cell producesa hidden output 𝒉 𝑗ℓ𝑡 , which is then passed through three differentfeedforward layers: 𝒑 𝑗ℓ𝑡 = FeedForward 𝑃ℓ ( 𝒉 𝑗ℓ𝑡 ) (9) 𝒒 𝑗ℓ𝑡 = FeedForward 𝑄ℓ ( 𝒉 𝑗ℓ𝑡 ) (10) 𝒖 𝑗ℓ𝑡 = FeedForward

𝑈 ℓ ( 𝒉 𝑗ℓ𝑡 ) (11)Each of the feedforward layers consists of two linear projectionswith a GELU activation after the first linear projection:FeedForward ( 𝒉 ) = 𝑾 𝐹𝐹 GELU ( 𝑾 𝐹𝐹 𝒉 ) (12)The GELU activation function is a stochastic variant of ReLU thathas been shown to outperform ReLU in sequence-to-sequence mod-els [9]. It is defined as GELU ( 𝑥 ) = 𝑥 Φ ( 𝑥 ) , where Φ is the standardGaussian cumulative distribution function.The first output 𝒑 𝑗ℓ𝑡 is a component of the projected time seriescaptured by Block ℓ . Subsequent blocks depend on the residual valueof the projected time series after removing this component: 𝒛 𝑗ℓ + 𝑡 = 𝒛 𝑗ℓ𝑡 − 𝒑 𝑗ℓ𝑡 (13)The second output 𝒒 𝑗ℓ𝑡 is Block ℓ ’s contribution to the forecast of thenext time step. The final forecast representation of the recurrentcomponent will be the sum of all the blocks ^ 𝒒 𝑗𝑅𝑡 + = 𝐿 ∑︁ ℓ = 𝒒 𝑗ℓ𝑡 (14)where ^ 𝒒 𝑗𝑅𝑡 + ∈ R 𝐻 . We then project this into R 𝐷 to get the forecastcontribution from the recurrent component, i.e. the first term inEq. (6): ^ 𝒗 𝑗𝑅𝑡 + = 𝑾 𝑅 ^ 𝒒 𝑗𝑅𝑡 + (15) The flow aggregation component models the influence betweenthe time series of neighboring nodes in the network. This compo-nent takes as input time-dependent embeddings from the recurrentcomponent of each node in the neighborhood, and produces asoutput the second term in Eq. (6). Each embedding summarizes the adflow WWW ’21, April 19–23, 2021, Ljubljana, Slovenia time series of the corresponding node up to the current time. Let 𝒖 𝑗𝑡 be the embedding of the ego node 𝒗 𝑗 at time step 𝑡 , formed bysumming the node vectors 𝒖 𝑗ℓ𝑡 over all 𝐿 blocks: 𝒖 𝑗𝑡 = 𝐿 ∑︁ ℓ = 𝒖 𝑗ℓ𝑡 (16)In the Imputation setting, the set of embeddings of all neighborsof the ego at time 𝑡 + is U 𝒗 𝑗 𝑡 + = { 𝒖 𝑖𝑡 + | 𝑖 s.t. 𝒗 𝑖 ∈ N 𝑡 + ( 𝒗 𝑗 )} (17)while in the Forecast setting, we simply replace the ground truth 𝒖 𝑖𝑡 + with the forecast ^ 𝒖 𝑖𝑡 + . We now project the ego’s embeddinginto the query space, 𝒖 𝑄 𝑗𝑡 = 𝑾 𝑄 𝒖 𝑗𝑡 (18)and the neighbors’ embeddings into the key and value space, 𝒖 𝐾𝑖𝑡 + = 𝑾 𝐾 𝒖 𝑖𝑡 + ∀ 𝑖 s.t. 𝒗 𝑖 ∈ N 𝑡 + ( 𝒗 𝑗 ) (19) 𝒖 𝑉𝑖𝑡 + = 𝑾 𝑉 𝒖 𝑖𝑡 + ∀ 𝑖 s.t. 𝒗 𝑖 ∈ N 𝑡 + ( 𝒗 𝑗 ) (20)The aggregated embedding ˜ 𝒖 𝑗𝑡 + is then the weighted sum of thevalues with a GELU activation, ˜ 𝒖 𝑗𝑡 + = GELU (cid:16) ∑︁ 𝑖 𝜆 𝑖 𝒖 𝑉𝑖𝑡 + (cid:17) (21)where the weights 𝜆 𝑖 , called attention scores , are computed from thedot product between the query and the keys, followed by a softmax.Note that the ego node is not included in the aggregation; insteadit is added separately, ^ 𝒖 𝑗𝑡 + = 𝑾 𝐸 𝒖 𝑗𝑡 + 𝑾 𝑁 ˜ 𝒖 𝑗𝑡 + (22)which is then projected down to R 𝐷 , ^ 𝒗 𝑗𝐴𝑡 + = 𝑾 𝐴 ^ 𝒖 𝑗𝑡 + (23)The vector ^ 𝒗 𝑗𝐴𝑡 + is the forecast contribution from the flow aggrega-tion component, i.e. the second term in Eq. (6).We call the full model with multi-head attention Radflow . Notethat the flow aggregation component and the recurrent componentare decoupled. Thus we can easily substitute the multi-head atten-tion with another node aggregation method. In particular, if wereplace Eq. (21) with a simple arithmetic average of the neighbors, ˜ 𝒖 𝑗𝑡 + = |N 𝑡 + ( 𝒗 𝑗 )| ∑︁ 𝑖 𝒖 𝑖𝑡 + (24)we would obtain the original formulation of GraphSage [8]. We callthe model that uses Eq. (24) instead of Eq. (21) Radflow-GraphSage .In addition to adopting Eq. (24), a further simplification is to removethe linear projection in Eq. (22) when adding the ego’s embeddingwith its neighbors’. Let us call this variant

Radflow-MeanPooling . Our multi-head attention neighborhood aggregationis similar to the Graph Attention Network (GAT) [35]. To computethe attention score in GAT, we first need to concatenate the egonode’s embedding with the neighbor’s, and then feed the resultthrough a single feedforward network followed by a LeakyReLU.In contrast, we revert back to the original multi-head attention [34] where we compute the attention scores with a simple dot product.We also add zero attention, in which a node has the option not toattend to any neighbor. We will empirically show in Section 7 thatour simpler method outperforms GAT in almost all settings.

The process of feeding residuals of time seriesinto deep network layers is inspired by N-BEATS. However, N-BEATS takes residuals from the raw scalar observations, whereasour approach calculates the residuals from the vector-valued pro-jections of the time series, as shown in Eq. (7). Moreover, N-BEATSis not easily adapted to the dynamic graph setting since it doesnot produce embeddings that depend on time. N-BEATS treats theforecasting task as a multivariate regression problem, where everystep can see every other step in the history. This allows us to obtainan embedding for the whole series but not for an individual step. Inour proposed architecture, the node vector 𝒖 𝑗ℓ𝑡 is used to constructthe time-dependent embedding of each step, as shown in Eq. (16). In the last few years, transformers [34] havebecome the sequence model of choice in the NLP domain. Despitetheir success in NLP tasks, little progress has been made with timeseries forecasting. Most recently, Wu et al. [37] designed a trans-former to forecast flu cases, but their model provides only marginalimprovement over the LSTM baseline. Our preliminary investi-gation indicated that LSTMs perform better than transformers inthe time series setting. We hypothesize that the strict temporalordering of the LSTM can encode time series more naturally; whiletext, which often has a latent tree structure, is more naturally en-coded by the transformer with its attention mechanism and positionencodings.

The most relevant non-neural aggre-gation approach is ARNet [38], a forecasting model for scalar-valuedtime series in which the prediction is computed as: ^ 𝑣 𝑗𝑡 = 𝑝 ∑︁ 𝑘 = 𝛼 𝑗𝑘 𝑣 𝑗𝑡 − 𝑘 + ∑︁ 𝑣 𝑖 ∈N( 𝒗 𝑗 ) 𝛽 𝑖 𝑗 𝑣 𝑖𝑡 (25)where the first term is an autoregressive model of order 𝑝 = (days)and the second term models the network effect. The learnable pa-rameters 𝛽 𝑖 𝑗 can be interpreted as the edge weight that controls theproportion of views propagating from node 𝑖 to node 𝑗 . AlthoughARNet is simple with a straightforward interpretation, the modelassumes that the network is static. Furthermore Wu et al. [38] onlyevaluated on the Imputation setting and not on the Forecast set-ting where the future observations are unknown. We will show inSection 7 that the added complexity of Radflow allows it to bothincorporate dynamic graphs and function in the Forecast setting. The closest model to ours is T-GCN [40],where a modified GRU cell does a graph convolution before com-puting the update and reset gates. Unlike Radflow which has beenimplemented to fetch subgraphs from disk and only compute thenetwork information once after the final LSTM layer, T-GCN re-quires the entire network to be in memory and aggregates thenetwork at every time step in every layer. Therefore, T-GCN doesnot scale to larger datasets due to both space and time complexity.Our proposed architecture, on the other hand, can easily handledynamic networks of hundreds of thousands of nodes.

WW ’21, April 19–23, 2021, Ljubljana, Slovenia Alasdair Tran, Alexander Mathews, Cheng Soon Ong, Lexing Xie

Table 1: Key statistics on two snapshots (the first and lastday) of VevoMusic and WikiTraffic.

VevoMusic WikiTraffic1 Sep 18 2 Nov 18 1 Jul 15 30 Jun 20Number of nodes 60,663 60,664 329,255 366,145Number of edges 1,189,460 1,192,478 12,869,374 17,417,749Nodes with in-edges 56,852 56,672 303,440 356,120Mean in-degree 21 21 42 49Median in-degree 10 10 14 17Nodes with out-edges 60,553 60,545 319,426 362,148Mean out-degree 20 20 40 48Median out-degree 19 19 26 31Diameter 32 27 15 22Average path length 8.4 8.3 4.1 4.0Clustering coefficient 0.17 0.15 0.014 0.015

The empirical validation of Radflow is carried out on two smallstatic networks—Los-loop and SZ-taxi [40]; and two large-scaledynamic networks—VevoMusic [38] and WikiTraffic. Out ofthese, WikiTraffic is a new dataset that we have collected and itis the largest dynamic network of time series to date. This sectiondescribes each dataset in detail.

Los-loop and SZ-taxi [40] contain time series of traffic speeds androad network information. Los-loop is a network of 207 sensors,measuring traffic speeds at 5-minute intervals from 1 March to7 March 2012. There is an edge between two sensors if they areclose to each other. SZ-taxi is a network of 156 roads in the LuohuDistrict in Shenzhen, containing 15-minute interval traffic speedsfrom 1 January to 31 January 2015. If two roads are connected,an edge is formed between them. Both are static networks, withLos-loop containing 2,833 edges and SZ-taxi containing 532. Weuse these datasets to compare Radflow against T-GCN [40].

VevoMusic [38] is a YouTube video network containing 60,740music videos from 4,435 unique artists. Each node in the networkcorresponds to a video and is associated with a time series of dailyview counts collected over the course of 63 days from 1 September2018 to 2 November 2018. A directed edge from video 𝑢 to video 𝑣 is present on day 𝑡 if 𝑣 appears on 𝑢 ’s list of recommendations onday 𝑡 .To ensure a fair comparison, we use the chronological train-testsplit by Wu et al. [38], in which we train on the first 49 days, validateon the next 7 days, and test on the final 7 days. We also follow theoriginal setup which computes evaluation metrics on the 13,710nodes with at least one incoming edge. This makes the differencesbetween network and non-network models more apparent. We collected the new WikiTraffic network dataset which contains366K nodes with 22 million unique page pairs that have an edge on

VevoMusic

WikiTraffic desktopnon-desktop

Figure 3: Ground-truth time series averaged across all sam-ples in VevoMusic (top) and WikiTraffic (bottom). Shadedareas correspond to the test period. Strong weekly seasonal-ity can be observed in both datasets. at least one day over a five-year period. On any given day, we haveup to 17 million links, as shown in Table 1. WikiTraffic is similarto VevoMusic in that they both exhibit a strong weekly seasonality(Fig. 3). They both have dynamic links, although WikiTraffic linksare more stable overall (Fig. 4).The data collection starts with the raw dump of the EnglishWikipedia containing the full revision history of 17 million articles.From this we collect daily view counts from 1 July 2015 to 30 June2020. We then remove articles with less than 100 daily averageviews in the final 140 days. This leaves us with 366,802 pages. Theview counts are split into two categories: views from desktop usersand from non-desktop users. We set the final 28 days to be the testperiod, the 28 days before that to be the validation period, and theremaining days for training. Forecasting 28 days in advance allowsus to test the robustness of the model when predicting a substantialtime into the future.Furthermore, since WikiTraffic is an order of magnitude largerthan other networked time series datasets, we can set aside nodesto be used only during testing. Thus the train-test split is dividedboth by time and by node, providing a stronger test of a model’sability to generalize. To be useful for evaluating network-basedforecasting models, the test set should form its own network, so wechoose nodes that are connected. We start with four seed categories: Programming Languages , Star Wars , Global Warming , and

GlobalHealth , each with many subcategories. Starting with each seedcategory, we collect all pages in that category and all subcategorieswithin four levels. This provides 2,434 pages for our test set. Finallywe consider two versions of the dataset—a univariate version wherewe predict the total view count of a page, and a bivariate versionwhere we predict the desktop and non-desktop traffic separately.Prior to our work, Google created a dataset containing two yearsof traffic from 145K randomly sampled pages in Wikipedia for aKaggle competition . However this dataset contains no network https://dumps.wikimedia.org adflow WWW ’21, April 19–23, 2021, Ljubljana, Slovenia Number of Days N u m b e r o f E dg e s VevoMusic

Number of Days WikiTraffic

Figure 4: The distribution of link durations. In VevoMusicmost links are short-lived (50% appear fewer than five days),likely due to YouTube diversifying its recommendations. Incontrast, links in WikiTraffic are more stable, with halfpersisting throughout the entire five years. information and includes low-traffic pages which are noisy. Rozem-berczki et al. [26] curated a small hyperlink Wikipedia networkon specialized topics (chameleons, crocodiles, and squirrels) withonly monthly view counts. Consonni et al. [5] introduced Wik-iLinkGraphs containing all dynamic links from 2001 to 2018, butthey did not collect traffic information. In contrast, our WikiTraf-fic is the largest dynamic network of time series, enabling detailedstudies of information flow and user interests on a large scale.

Evaluation is done by predicting view counts of the final 7 days onVevoMusic and the final 28 days on WikiTraffic. For Los-loopand SZ-taxi, we predict the speeds in the final hour (the final 12steps in Los-loop and the final 4 steps in SZ-taxi). As stated inSection 3, we consider both the Forecast and Imputation set-tings for the large datasets. On VevoMusic, we evaluate on twodifferent networks: the full dynamic network which we call Vevo-Music (dynamic), and the static network which we call VevoMusic(static). To construct the static version, Wu et al. [38] used a ma-jority smoothing method to remove edges that occur only brieflyand made the remaining edges persistent in all time steps. Theirbest model ARNet was only evaluated on this static network in theImputation setting. On WikiTraffic, we consider two networks:one is a network of univariate time series of view counts, whilethe other is a network of bivariate time series where desktop andnon-desktop traffic are split.Following prior forecasting work [16, 17], our main evaluationmetric will be the Symmetric Mean Absolute Percentage Error forforecast horizon 𝐹 :SMAPE- 𝐹 = T 𝐹𝐷 T ∑︁ 𝑗 = 𝐹 ∑︁ 𝑡 = 𝐷 ∑︁ 𝑑 = (cid:12)(cid:12)(cid:12) 𝑣 𝑗𝑡𝑑 − ^ 𝑣 𝑗𝑡𝑑 (cid:12)(cid:12)(cid:12) (cid:16)(cid:12)(cid:12)(cid:12) 𝑣 𝑗𝑡𝑑 (cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12) ^ 𝑣 𝑗𝑡𝑑 (cid:12)(cid:12)(cid:12)(cid:17) (26)where T is the number of samples in the test set, 𝐹 is the forecasthorizon, 𝐷 is the dimension of the time series, and ^ 𝑣 𝑗𝑡𝑑 is the forecastvalue of the ground truth 𝑣 𝑗𝑡𝑑 . SMAPE is interpretable with anupper bound of 200 and a lower bound of 0. It is scale-independent,ensuring that prediction errors will be considered relative to themagnitude of the sequence. This is important because it preventsnodes with a large number of views from dominating the evaluation measure. A lower SMAPE corresponds to a better fit, with it being0 if and only if the prediction matches the ground truth perfectly.For the two small networks of univariate time series, Los-loop andSZ-taxi, we additionally report the Root Mean Square Error,RMSE- 𝐹 = (cid:118)(cid:117)(cid:117)(cid:116) T 𝐹 T ∑︁ 𝑗 = 𝐹 ∑︁ 𝑡 = (cid:16) 𝑣 𝑗𝑡 − ^ 𝑣 𝑗𝑡 (cid:17) (27)and the Mean Absolute Error,MAE- 𝐹 = T 𝐹 T ∑︁ 𝑗 = 𝐹 ∑︁ 𝑡 = (cid:12)(cid:12)(cid:12) 𝑣 𝑗𝑡 − ^ 𝑣 𝑗𝑡 (cid:12)(cid:12)(cid:12) (28)similar to what was done in the T-GCN paper [40]. We compare 8 time series baselines, 7 variants of networked timeseries, and 7 more variants of Radflow in an ablation study. Resultsof the following 8 time series baselines are in Table 3 and Table 4.(1)

Copying Previous Step:

We use the final observation be-fore the test period as the prediction. This is the final dayin VevoMusic and WikiTraffic, the final 5 minutes in Los-loop, and the final 15 minutes in SZ-taxi.(2)

Copying Previous Week:

Since we observe weekly sea-sonality in both VevoMusic and WikiTraffic (see Fig. 3), astronger baseline is to copy observations in the final weekjust before the test period and use them as the predictions.(3)

AR:

The autoregressive (AR) model used by Wu et al. [38]for the static VevoMusic network.(4)

Seasonal ARIMA:

We train an ARIMA ( 𝑝, 𝑑, 𝑞 )( 𝑃, 𝐷, 𝑄 ) 𝑚 model separately for each time series, where 𝑝, 𝑑, 𝑞, 𝑃, 𝐷, 𝑄 are the AR, difference, MA, seasonal AR, seasonal difference,and seasonal MA terms, respectively. These are learned au-tomatically using the pmdarima package. The number ofperiods in a season 𝑚 is set to 7 days for VevoMusic andWikiTraffic.(5) Individual LSTMs:

The LSTM baseline used by Wu et al.[38]. It is trained separately for each time series, with noweight sharing across network nodes.(6)

LSTM:

The standard LSTM model with weight sharing. Un-like variant (4), this method uses only one set of LSTMweights for the entire dataset.(7)

N-BEATS:

The neural regression with residual stackingfrom Oreshkin et al. [21]. The implementation consists ofeight stacks, each containing one generic block. A genericblock internally uses four fully-connected layers, followedby a fork into the forecast and backcast space. For bivariateWikiTraffic, we train two separate N-BEATS models.(8)

Radflow-NoNetwork:

Radflow with only the recurrent com-ponent, i.e. first term ^ 𝒗 𝑗𝑅𝑡 in Eq. (6). It does not take anycontribution from the network.Results for the forecasting models which use the network structureare presented in Table 4 and 5.(9) T-GCN:

The model proposed by Zhao et al. [40] that uses amodified GRU cell to aggregate nodes. Since T-GCN is notscalable to larger datasets (see Section 4.3), we only compareRadflow against T-GCN on Los-loop and SZ-taxi [40].

WW ’21, April 19–23, 2021, Ljubljana, Slovenia Alasdair Tran, Alexander Mathews, Cheng Soon Ong, Lexing Xie

Table 2: Hyperparameters of Radflow-NoNetwork (8) andRadflow (15) on WikiTraffic (univariate). We calibrate thehidden size to ensure that the number of parameters of allmodels are within 5% of each other. See the Appendix [31]for the hyperparameters of other model variants. 𝐿 ) 8 8 8Dropout probability 0.1 0.1 0.1Number of attention heads - 4 4Backcast (seed) length 112 112 112Forecast length 28 28 28(10) ARNet:

The state-of-the-art model [38] for the VevoMusic(static) dataset in the Imputation setting (see Section 4.3.4).(11)

LSTM-MeanPooling:

The same architecture as (6) but withmean-pooling node aggregation, using the hidden output ofthe final LSTM layer as a node’s representation.(12–15)

Radflow:

Variants of our proposed architecture with dif-ferent network aggregation techniques: a simple mean (12),GraphSage (13), Graph Attention Network (14), and our fullRadflow model with multi-head attention (15). Table 2 out-lines the hyperparameters used in the full model.Finally we conduct an ablation study to test the key componentsof our architecture (Table 6). Starting with the best model (15), wesubstitute one component with an alternative:(16–20)

Radflow with other node embeddings:

As shown in Sec-tion 4, the LSTM cell contains a hidden state 𝒉 which is usedto produce three vectors, 𝒑 , 𝒒 , and 𝒖 . Instead of having aseparate output 𝒖 to represent a node, we could alternativelyreuse the cell’s hidden state 𝒉 (16), the backcast representa-tion 𝒑 (17), or the forecast representation 𝒒 (18). We couldalso concatenate different representations, such as [ 𝒉 ; 𝒑 ] (19) or [ 𝒉 ; 𝒑 ; 𝒒 ] (20).(21) Radflow with no final projection:

We ignore the linearprojection in Eq. (22) and add the ego node’s embedding toits neighbors’ directly.(22)

Radflow with one attention head:

This final variant teststhe effect of having only one attention head instead of thedefault four in the full model.

Web-scale time series observations often vary greatly in scale. Anunpopular page might get zero views, while a popular page mightreceive millions of visits daily. To ensure similar scaling, both theinputs and outputs of our models are log-transformed time series.Outputs are exponentiated before computing SMAPE, RMSE, andMAE. Missing views are imputed by propagating the last validobservation forward. We do not apply any other preprocessingtechniques to the time series, such as trend or seasonality removal.Table 2 shows key hyperparameters of Radflow. We trained allmodels on the SMAPE objective using the Adam optimizer [13] with

Table 3: Performance of time series models with no networkinformation. We report mean SMAPE-7 on VevoMusic (22Oct 18 – 2 Nov 18) and mean SMAPE-28 on WikiTraffic (3Jun 20 – 30 Jun 20). Rows are numbered according to Sec 6.1.See the Appendix [31] for statistical significance tests.

VevoMusic WikiTraffic(univariate) WikiTraffic(bivariate)(1) Copying Previous Step 14.0 22.5 26.8(2) Copying Previous Week 10.3 21.0 25.4(3) AR [38] 10.2 - -(4) Seasonal ARIMA 9.67 19.6 22.8(5) Individual LSTMs [38] 9.99 - -(6) LSTM 8.68 16.6 20.4(7) N-BEATS 8.64 16.6 20.3(8) Radflow-NoNetwork

Table 4: Forecast performance on the static traffic net-works. On Los-loop, we report mean SMAPE-12, RMSE-12,and MAE-12. On SZ-taxi, we report mean SMAPE-4, RMSE-4, and MAE-4.

Los-loop SZ-taxi

SMAPE RMSE MAE SMAPE RMSE MAE (1) Copying Previous Step 3.92 3.40 2.39 𝛽 = . , 𝛽 = . , 𝜖 = − . We set the weight decay factor to − and decouple it from the learning rate [15]. We warm up thelearning rate to − in the first 5,000 steps and then linearly decayit afterward for 10 epochs, each of which consists of 10,000 steps.We clip the gradient norm at 0.1. All our models are implementedin Pytorch [23]. For a fair comparison, we fix the number of layersof all variants to eight and ensure that size of all variants are within5% of each other. VevoMusic experiments were trained on aTitan V GPU and WikiTraffic experiments were trained on a TitanRTX GPU. The Titan RTX has twice the memory of the Titan Vand is needed to train the two-hop Radflow on WikiTraffic. Allpure time series models converge very quickly, taking no more thanthree hours to train. Models with one-hop aggregation take up to17 hours to train, while models with two-hop aggregation can takeup to two days. We pick the model from the epoch with the lowestSMAPE score on the validation set as our best model.

Unlike previous approachessuch as T-GCN, our models do not require the whole graph to be inmemory during training. Instead we store the graph in the HDF5format and only load one batch at a time directly from disk.

To keep the computation tractable,we devise an importance-based neighborhood sampling technique.Instead of the common uniform sampling that was, for example, adflow WWW ’21, April 19–23, 2021, Ljubljana, Slovenia

Table 5: Performance of models with network information. We report mean SMAPE-7 on VevoMusic (22 Oct 18 – 2 Nov 18)and mean SMAPE-28 on WikiTraffic (3 Jun 20 – 30 Jun 20). Rows are numbered according to Sec 6.1. Bold numbers indicatethe best model(s) within a column. Refer to the Appendix [31] for the p-values from the dependent t-test for paired samplesbetween models with similar performance.

VevoMusic (static) VevoMusic (dynamic) WikiTraffic (univariate) WikiTraffic (bivariate)Forecast Imputation Forecast Imputation Forecast Imputation Forecast Imputation1H 2H 1H 2H 1H 2H 1H 2H 1H 2H 1H 2H 1H 2H 1H 2H(10) ARNet [38] - - 9.02 - - - - - - - - - - - - -(11) LSTM-MeanPooling 8.60 8.67 8.14 8.13 8.80 9.03 7.91 7.90 16.8 16.7 15.5 15.2 20.2 19.9 19.2 18.9(12) Radflow-MeanPooling 8.34 8.44 7.82 7.81 8.42 (14) Radflow-GAT 8.52 8.50 7.88 7.74 8.43 8.39 7.44 7.28 16.2 used by Hamilton et al. [8], we propose a two-stage approach to se-lect neighbors. First we assign each neighbor a score (cid:205) 𝑑 𝑣 𝑗𝑡𝑑 outdegree ( 𝒗 𝑗𝑡 )+ .This score is the total view count of the neighbor at time step 𝑡 , nor-malized by the outdegree of the neighbor at that time. A self-loop isadded to avoid division by zero. Intuitively, a neighbor with a largernumber of views will have a greater influence, but the influencewill be more diffuse if that neighbor has many outlinks. Using thesescores, we remove neighbors in the bottom 10th percentile in theneighborhood of each ego node, which reduces noise induced byaggregation.In the second stage, we sample four neighbors during trainingwith probability proportional to the number of time steps that theneighbor appears in the backcast period. During evaluation, wefind using all nodes to be computationally infeasible due to thelarge data size. Thus for each ego node, we choose only the 16 mostfrequently appearing neighbors in the one-hop setting and the topeight in the two-hop setting. We first discuss prediction performances of different model variants(Sections 7.1 and 7.2, Tables 3 to 6). When applicable, we reportin parentheses the p-value (denoted as 𝑝 ) from the dependent t-test for paired samples. All differences discussed in this sectionare statistically significant. For more detailed significance tests,see the Appendix [31]. We then present a visual interpretationof different layers in the recurrent component (Section 7.3), fol-lowed by insights provided by the network aggregation component(Section 7.4). Finally, we present two preliminary studies on thepotential applications of models like Radflow: the robustness of pre-dictions when the network is not fully observed (Section 7.5.1), andthe relationship between traffic surges on nodes and their attentionscores (Section 7.5.2). Forecasting without networks.

Table 3 summarizes the com-parison between Radflow-NoNetwork and the corresponding timeseries forecasting baselines (1–7). The LSTM variant (6) outper-forms both AR (3) and Seasonal ARIMA (4), showing the robust

Table 6: Ablation study on the key components of Radflowon one-hop VevoMusic networks. See Section 7.2.

VevoMusic(static) VevoMusic(dynamic)(15) Radflow (16) Radflow ( 𝒉 as embeddings) 7.78 7.51(17) Radflow ( 𝒑 as embeddings) 7.77 7.49(18) Radflow ( 𝒒 as embeddings) 7.84 7.35(19) Radflow ( [ 𝒉 ; 𝒑 ] as embeddings) 7.75 7.39(20) Radflow ( [ 𝒉 ; 𝒑 ; 𝒒 ] as embeddings) 7.76 7.38(21) Radflow (no final projection) 7.80 7.33(22) Radflow (one attention head) 7.77 7.43 performance of flexible neural models. Furthermore, it also outper-forms models trained on individual time series (3, 4, 5), highlightingthe advantage of using large amounts of training data. N-BEATS (7)outperforms LSTM (6) by a small margin of 0.04 SMAPE ( 𝑝 = − )on VevoMusic, while Radflow-NoNetwork outperforms all otherbaselines, showing promises in combining the recurrent structurewith the residual stacking idea in our architecture. Additionally,Radflow-NoNetwork outperforms ARNet (10), the state-of-the-artfor VevoMusic that uses network information (Table 5), indicatingthat having the right model outweighs having more informationfor this task. Forecasting with networks.

Tables 4 and 5 summarize per-formances of models (9–15) on the four networked time seriesdatasets. Our full model (15) outperforms T-GCN in Los-loop bya non-trivial margin on all metrics. SZ-taxi is more noisy and nomodel is able to beat the SMAPE from copying the previous step.This is because SZ-taxi contains many consecutive zero measure-ments, which the copying baseline is able to take advantage of.On non-zero test measurements, Radflow is able to outperform allother variants. See the Appendix [31] for these results.Across all eight Forecasting settings on VevoMusic and Wiki-Traffic, the top-performing models are all Radflow variants. Com-pared to Radflow-NoNetwork (8), incorporating one-hop neigh-bors improves the SMAPE score on VevoMusic from 8.42 to 8.33( 𝑝 = − ). On WikiTraffic (univariate), using one-hop Graph-Sage (13) improves SMAPE from 16.1 to 15.9 ( 𝑝 = − ). This WW ’21, April 19–23, 2021, Ljubljana, Slovenia Alasdair Tran, Alexander Mathews, Cheng Soon Ong, Lexing Xie

Figure 5: Average prediction over all test pages from eachlayer of Radflow-NoNetwork on the 28-day test period ofWikiTraffic. The light shade corresponds to the 95% con-fidence interval. Different layers capture different seasonaland trend patterns. See Section 7.3. confirms the recently reported network effects in the YouTube andWikipedia traffic [38, 41]. Note that WikiTraffic is much largerand more diverse, making consistent improvements harder, hencethe smaller effect size compared to VevoMusic.Finally, compared to the smoothed static network, using thedynamic network in VevoMusic only improves the performancefor some model variants. This is because the static graph was con-structed from the smoothing of edges (i.e. uncommon edges wereremoved), which reduces noise in the absence of ground-truth viewsfrom the neighbors. More generally, although Radflow is designedto handle dynamic graphs, it is impossible to know a priori whetherdynamic edges will improve the prediction performance in a givendata domain.

Having ground-truth observations of neighbor-ing nodes during the forecast period leads to substantially betterSMAPE scores for all models. Models that perform well in the Fore-cast setting also perform well in the Imputation setting, withRadflow (15) achieving the highest SMAPE in six of the eight Impu-tation settings. The performance gain going from one hop to twohops is more evident in the Imputation setting. For example, thereis a boost from 7.67 to 7.63 ( 𝑝 = − ) on the static VevoMusic,and from 7.32 to 7.27 ( 𝑝 = − ) on the dynamic VevoMusic.Compared to the previous state-of-the-art ARNet [38], our bestmodel achieves a SMAPE score that is 19% better (from 9.02 to 7.27).On WikiTraffic, using two-hop neighbors uniformly lowers theperformance, while the training time more than doubles. Comparedto VevoMusic, each Wikipedia page has many more links, mostof which would be ignored by the reader. The network effect inWikiTraffic is thus weaker than VevoMusic (see Fig. 6), leadingto more noise being introduced in the two-hop setting. Among the variants with attention, the weighted multi-head at-tention of Radflow yields the best performance in 10 out of 16tasks across Forecast and Imputation, while GraphSage, whichputs an equal weight on all neighbors, is the second best (best in 6out of 16 tasks). This indicates that simpler attention mechanisms

Figure 6: The contribution from the network on the finalforecasts. The network component of less popular nodes islarger. For nodes with similar daily view counts, VevoMusic(left) exhibits a stronger network effect than WikiTraffic(right). See Section 7.4. (inner-product attention in Radflow and node averaging in Radflow-GraphSage) are preferred over more complex ones (Radflow-GAT),but it should not be too simple (Radflow-MeanPooling).Table 6 presents an ablation study that tests other key com-ponents in Radflow. Overall, we find that having more than oneattention head helps (15 vs 22), so does a linear projection beforenode aggregation (15 vs 21). It is also preferable to have a sepa-rate projection on the output of the LSTM cell to obtain the nodeembeddings 𝒖 , than to re-use the hidden state 𝒉 , the backcast rep-resentation 𝒑 , or the forecast representation 𝒒 (15 vs 16–20). Alldifferences are statistically significant. The recurrent component of Radflow is decomposable into 𝐿 = layers, via Eqs. (14) and (15). Fig. 5 shows the layer-wise contri-bution to the forecast from Radflow-NoNetwork. The results areaveraged over all 2,434 test pages in WikiTraffic over the 28-daytest period, and re-scaled to the same range for readability. Weobserve that component 8 encodes strong weekly seasonality thatis consistent across all test pages (with a small confidence inter-val). Components 1–5 encode varying levels of a decreasing trend,whereas component 7 encodes an increasing trend. Overall, weeklyseasonality is visible in all components except the first, confirm-ing the common intuition that representations learned via neuralnetworks are often over-complete. Radflow can identify settings wherethe network information becomes important for the final predic-tions. Fig. 6 shows the contribution to the final forecast made bythe network component (the second term of Eq. (6)). In both Vevo-Music and WikiTraffic, there is an approximately inverse linearrelationship between the network contribution and the log of av-erage daily views, indicating that forecasts of less popular nodesrely more heavily on the network. For nodes with a similar numberof daily views (take, for example, views per day), VevoMu-sic exhibits a stronger network effect ( ∼ ∼ adflow WWW ’21, April 19–23, 2021, Ljubljana, Slovenia Figure 7: An interactive web app [28] for attention flow,available at https://attentionflow.ml. Shown here is the sub-graph for the Wikipedia page

Kylo Ren , a character in

StarWars played by

Adam Driver . The thickness of an edge isdetermined by the product between the attention score andthe traffic volume of the source node. See Section 7.4.2. links and Wikipedia hyperlinks are different, warranting furtherinvestigation (e.g., with user-level data).

The attention scores in Radflowcapture some information about the flow of traffic. In particular,we observe that the model pays more attention to a neighboringnode on days that have a spike in traffic (see the Appendix [31]for a specific example). This motivates us to visualize the attentionflow among WikiTraffic nodes in an interactive web app [28]. Inthis visualization, nodes are represented in a graph (bottom panel)and as time series (top panel). Edge weights are attention scoresfrom Radflow multiplied by the traffic on the source node.Fig. 7 is a screenshot for the subgraph centered around theWikipedia page of

Kylo Ren , a character in the Star Wars seriesplayed by the actor

Adam Driver . From the figure, we observe thatthe time series of the two nodes (

Kylo Ren in blue,

Adam Driver inpink) have synchronized spikes at the same time as the release ofmajor Star Wars movies. Furthermore, there appears to be moretraffic flowing from

Kylo Ren to Adam Driver than the other way,indicated by the thickness of the edges.

We consider the robustness of Radflow tomissing data. This setting is relevant when collecting and observ-ing data from all nodes is very costly (e.g., sites spread out overlarge geographical areas) or when nodes are simply unavailable(e.g., sites whose data are proprietary). To this end, Fig. 8 showsour evaluation of Radflow on the imputation task for VevoMu-sic with a percentage of time series values (left pane) or edges(right pane) deleted at random. As more nodes become missing, theperformance of Radflow decays at a much slower rate than Radflow-NoNetwork. This indicates that Radflow is effective in imputing andmitigating missing values from other neighbors. Similarly, Radflowis relatively robust to missing edges. With 40% of edges missing,the performance of the two-hop model only drops by 1% in SMAPE.Even with 80% of edges missing, Radflow is still substantially betterthan Radflow-NoNetwork. % of missing views S M A P E - Missing View Effect

No hopsOne hopTwo hops 0 20 40 60 80 % of missing edges

Missing Edge Effect

No hopsOne hopTwo hops

Figure 8: The effect of missing view counts (left) and miss-ing edges (right) on VevoMusic using Radflow. As we deletemore data, Radflow’s performance degrades at a muchslower rate than Radflow-NoNetwork (in blue). See Sec-tion 7.5.1.Figure 9: The effect of doubling a neighbor’s view count dur-ing the forecast period. Left: scatter plot of attention scoresbefore (x-axis) and after (y-axis) doubling. Right: Scatter plotof the relative increase of the ego node’s views (x-axis) andattention scores after doubling (y-axis). Attention scores de-crease as a neighbor becomes more popular (left). A higherattention corresponds to a larger flow of traffic from theneighbor to the ego node (right). See Section 7.5.2.

In addition to the example of an actualtraffic spike in Fig. 1, we further perform an evaluation to visualizethe effects of hypothetical sudden changes in node traffic. Thisscenario is useful for planning resources such as edge networkcaching for mobile videos, or for estimating economic demandsassociated with nodes such as advertising.For each test node in the imputation task on WikiTraffic, wepick one neighbor and double its views on one forecasting day. Fig. 9(left) shows that as a page becomes more popular, the attention onit uniformly decreases, indicating that attention tends to dampennode traffic spikes instead of amplifying them. More evidently, wecan see from Fig. 9 (right) that a higher attention score is positivelycorrelated with a large effect on the ego node’s traffic . This indicatesthat despite the non-linear relationship between attention scoresand network component in Section 4, one could qualitatively inferthe traffic flow from one page to another using attention scores.

WW ’21, April 19–23, 2021, Ljubljana, Slovenia Alasdair Tran, Alexander Mathews, Cheng Soon Ong, Lexing Xie

We propose Radflow, an end-to-end model for forecasting a net-work of time series. It is expressive with a stack of recurrent unitsrepresenting different components of the time series, scalable tohundreds of thousands of nodes in a network using multi-headattention and importance-sampling on neighbors, able to repre-sent underlying networks that change over time, and suitable formultivariate networked series with missing nodes and edges. Weachieve state-of-the-art results on recent web-scale networked timeseries. We also show that the stack of recurrent units successfullydecomposes time series into different seasonal and trend effects,and that the network attention aggregates and explains influencebetween nodes. Future work includes extending Radflow to othernetworked data types such as events over continuous time. One canexplore a wide range of applications such as imputing geographicdata or allocating network resources. It would also be interestingto investigate causal reasoning or counterfactual modeling withRadflow-like structures.

ACKNOWLEDGMENTS

This research is supported in part by the Australian Research Coun-cil Project DP180101985 and AOARD project 20IOA064. We thankNVIDIA for providing us with Titan V GPUs for experimentation.

REFERENCES [1] George Athanasopoulos, Rob Hyndman, Haiyan Song, and Doris C. Wu. 2011.The tourism forecasting competition.

International Journal of Forecasting

27, 3(2011), 822–844.[2] G. E. P. Box and G. M Jenkins. 1970.

Time series analysis: Forecasting and control .San Francisco: Holden-Day.[3] Robert Goodell Brown. 1959.

Statistical forecasting for inventory control . Mc-Graw/Hill.[4] Deng Cai and Wai Lam. 2020. Graph Transformer for Graph-to-Sequence Learn-ing.. In

AAAI . 7464–7471.[5] Cristian Consonni, David Laniado, and Alberto Montresor. 2019. WikiLinkGraphs:A complete, longitudinal and multi-language dataset of the Wikipedia link net-works. In

Proceedings of the International AAAI Conference on Web and SocialMedia , Vol. 13. 598–607.[6] Vasant Dhar, Tomer Geva, Gal Oestreicher-Singer, and Arun Sundararajan. 2014.Prediction in economic networks.

Information Systems Research

25, 2 (2014),264–284.[7] Aditya Grover and Jure Leskovec. 2016. node2vec: Scalable feature learning fornetworks. In

Proceedings of the 22nd ACM SIGKDD international conference onKnowledge discovery and data mining . 855–864.[8] Will Hamilton, Zhitao Ying, and Jure Leskovec. 2017. Inductive representationlearning on large graphs. In

Advances in neural information processing systems .1024–1034.[9] Dan Hendrycks and Kevin Gimpel. 2016. Gaussian Error Linear Units (GELUs). arXiv: Learning (2016).[10] CC Holt. 1957. Forecasting seasonals and trends by exponentially weightedaverages(O.N.R. Memorandum No. 52).

Carnegie Institute of Technology (1957).[11] R.J. Hyndman and G. Athanasopoulos. 2018.

Forecasting: principles and practice (2nd editio ed.). OTexts: Melbourne, Australia. OTexts.com/fpp2[12] Mirko Kämpf, Eric Tessenow, Dror Y Kenett, and Jan W Kantelhardt. 2015. Thedetection of emerging trends using Wikipedia traffic data and context networks.

PloS one

10, 12 (2015), e0141892.[13] Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Opti-mization. In

International Conference on Learning Representations .[14] Thomas N. Kipf and Max Welling. 2017. Semi-Supervised Classification withGraph Convolutional Networks. In

International Conference on Learning Repre-sentations (ICLR) .[15] Ilya Loshchilov and Frank Hutter. 2019. Decoupled Weight Decay Regularization.In

International Conference on Learning Representations .[16] Spyros Makridakis and Michele Hibon. 2000. The M3-Competition: results,conclusions and implications.

International Journal of Forecasting

16, 4 (2000),451–476.[17] Spyros Makridakis, Evangelos Spiliotis, and Vassilios Assimakopoulos. 2018. TheM4 Competition: Results, findings, conclusion and way forward.

International Journal of Forecasting

34, 4 (2018), 802–808.[18] Franco Manessi, Alessandro Rozza, and Mario Manzo. 2020. Dynamic GraphConvolutional Networks.

Pattern Recognit.

97 (2020).[19] Swapnil Mishra, Marian-Andrei Rizoiu, and Lexing Xie. 2016. Feature drivenand point process approaches for popularity prediction. In

Proceedings of the25th ACM international on conference on information and knowledge management .1069–1078.[20] Gal Oestreicher-Singer, Barak Libai, Liron Sivan, Eyal Carmi, and Ohad Yassin.2013. The network value of products.

Journal of Marketing

77, 3 (2013), 1–14.[21] Boris N. Oreshkin, Dmitri Carpov, Nicolas Chapados, and Yoshua Bengio. 2020. N-BEATS: Neural basis expansion analysis for interpretable time series forecasting.In

International Conference on Learning Representations .[22] Aldo Pareja, Giacomo Domeniconi, Jian Jhen Chen, Tengfei Ma, ToyotaroSuzumura, Hiroki Kanezashi, Tim Kaler, and Charles E. Leisersen. 2019.EvolveGCN: Evolving Graph Convolutional Networks for Dynamic Graphs.

ArXiv abs/1902.10191 (2019).[23] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang,Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer.2017. Automatic Differentiation in PyTorch. In

NIPS Autodiff Workshop .[24] Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. 2014. DeepWalk: Online Learn-ing of Social Representations. In

Proceedings of the 20th ACM SIGKDD Interna-tional Conference on Knowledge Discovery and Data Mining (KDD ’14) . Associationfor Computing Machinery, New York, NY, USA, 701–710.[25] N. Petluri and E. Al-Masri. 2018. Web Traffic Prediction of Wikipedia Pages. In . 5427–5429.[26] Benedek Rozemberczki, Carl Allen, and Rik Sarkar. 2019. Multi-scale AttributedNode Embedding. arXiv:cs.LG/1909.13021[27] Aravind Sankar, Yanhong Wu, Liang Gou, Wei Zhang, and Hao Yang. 2020.DySAT: Deep Neural Representation Learning on Dynamic Graphs via Self-Attention Networks. In

Proceedings of the 13th International Conference on WebSearch and Data Mining . 519–527.[28] Minjeong Shin, Alasdair Tran, Siqi Wu, Alexander Mathews, Rong Wang, Geor-giana Lyall, and Lexing Xie. 2021. AttentionFlow: Visualising Influence in Net-works of Time Series. In

The 14th International Conference on Web Search andData Mining, Demo (WSDM ’21) .[29] Jessica Su, Aneesh Sharma, and Sharad Goel. 2016. The effect of recommendationson network structure. In

Proceedings of the 25th international conference on WorldWide Web . 1157–1167.[30] Gabor Szabo and Bernardo A Huberman. 2010. Predicting the popularity ofonline content.

Commun. ACM

53, 8 (2010), 80–88.[31] Alasdair Tran, Alexander Mathews, Cheng Soon Ong, and Lexing Xie. 2021.Radflow: A Recurrent, Aggregated, and Decomposable Model for Networksof Time Series — Supplementary Materials. https://github.com/alasdairtran/radflow.[32] Rakshit Trivedi, Hanjun Dai, Yichen Wang, and Le Song. 2017. Know-evolve:deep temporal reasoning for dynamic knowledge graphs. In

Proceedings of the34th International Conference on Machine Learning-Volume 70 . 3462–3471.[33] Rakshit Trivedi, Mehrdad Farajtabar, Prasenjeet Biswal, and Hongyuan Zha.2019. Dyrep: Learning representations over dynamic graphs. In

InternationalConference on Learning Representations .[34] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is Allyou Need.

ArXiv abs/1706.03762 (2017).[35] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, PietroLiò, and Yoshua Bengio. 2018. Graph Attention Networks. In

International Con-ference on Learning Representations .[36] Peter R Winters. 1960. Forecasting sales by exponentially weighted movingaverages.

Management science

6, 3 (1960), 324–342.[37] Neo Z. Wu, Bradley A. Green, Xue Ben, and Shawn O’Banion. 2020. DeepTransformer Models for Time Series Forecasting: The Influenza Prevalence Case.

ArXiv abs/2001.08317 (2020).[38] Siqi Wu, Marian-Andrei Rizoiu, and Lexing Xie. 2019. Estimating Attention Flowin Online Video Networks.

Proc. ACM Hum.-Comput. Interact.

3, CSCW, Article183 (Nov. 2019), 25 pages. https://doi.org/10.1145/3359285[39] Kun Xu, Lingfei Wu, Zhiguo Wang, Yansong Feng, Michael Witbrock, and VadimSheinin. 2018. Graph2seq: Graph to sequence learning with attention-basedneural networks. arXiv preprint arXiv:1804.00823 (2018).[40] Ling Zhao, Yujiao Song, Chao Zhang, Yu Liu, Pu Wang, Tao Lin, Min Deng, andHaifeng Li. 2019. T-GCN: A Temporal Graph Convolutional Network for TrafficPrediction.

IEEE Transactions on Intelligent Transportation Systems (2019), 1–11.[41] Kai Zhu, Dylan Walker, and Lev Muchnik. 2020. Content Growth and Atten-tion Contagion in Information Networks: Addressing Information Poverty onWikipedia.

Information Systems Research (2020).[42] Lingxue Zhu and Nikolay Laptev. 2017. Deep and Confident Prediction for TimeSeries at Uber. (2017), 103–110. adflow WWW ’21, April 19–23, 2021, Ljubljana, Slovenia

A ON LOS-LOOP AND SZ-TAXI

We first discuss some peculiarities found in SZ-taxi. Fig. 10 showsthat SZ-taxi is noisier than Los-loop, which explains SZ-taxi’sworse SMAPE scores across all model variants in Table 4. We alsosee a drop in the median speed during the training period of SZ-taxi (Fig. 10 bottom). This leads to a small positive bias in the finalpredictions of Radflow during the test period (Fig. 11). Such bias isnot observed in the copying baseline.Upon closer inspection of the ground-truth measurements inSZ-taxi’s test period, we find that 28% of the data points are ex-actly zero. Many of these zero measurements happen consecutively.Recall that the SMAPE metric is sensitive to zero values. In par-ticular, if the ground-truth value is zero and the predicted valueis non-zero, we obtain the maximum SMAPE of 200. Meanwhile,both RMSE and MAE do not suffer from this problem at zero. Thisexplains why in Table 4, the Radflow variant gets a relatively badSMAPE score but a good RMSE and MAE. To confirm this effect,Table 7 presents metrics both on all test measurements and on onlynon-zero test measurements. We see that when we exclude the zeromeasurements, SMAPE correlates better with RMSE and MAE, andRadflow now outperforms the copying baseline.Finally from Table 8, we see that on both datasets, our Radflow-NoNetwork have statistically similar performance to T-GCN. Ourfull Radflow outperforms T-GCN in Los-loop while it has compa-rable performance to T-GCN in SZ-taxi.

Los-Loop

SZ-Taxi

Figure 10: The median of the ground-truth time series acrossall samples in Los-loop (top) and SZ-taxi (bottom).

15 10 5 0 5 10 15

Prediction Error

Copying Previous Step

15 10 5 0 5 10 15

Prediction Error

Radflow

Figure 11: Prediction errors on SZ-taxi. Table 7: Forecast performance on SZ-taxi. We report meanSMAPE-4, RMSE-4, and MAE-4. We consider both the settingwhere we use all test measurements and the one where weonly use non-zero measurements.

All Measurments Non-zero Measurements

SMAPE RMSE MAE SMAPE RMSE MAE (1) Copying Previous Step

Table 8: Selected p-values from the dependent t-test forpaired samples on Los-loop and SZ-taxi. The models arenumbered according to Sec 6.1. We only show model pairswhere the p-value is at least 0.001. For SZ-taxi, we distin-guish between the metrics computed on all test measure-ments (SZ-taxi-a) and on only non-zero measurements (SZ-taxi-n).

Dataset Group 1 Group 2 p-value

Model

SMAPE

Model

SMAPE

Los-loop (1) Copying Previous Step 3.92 (9) T-GCN 3.97 0.674Los-loop (8) Radflow-NoNetwork 3.60 (9) T-GCN 3.97 0.002SZ-taxi-a (8) Radflow-NoNetwork 80.2 (9) T-GCN 80.5 0.949SZ-taxi-a (9) T-GCN 80.5 (15) Radflow 77.5 0.509SZ-taxi-n (8) Radflow-NoNetwork 32.7 (9) T-GCN 33.1 0.874SZ-taxi-n (9) T-GCN 33.1 (15) Radflow 29.0 0.090

B FURTHER DISCUSSION ON RADFLOW

We start by discussing possible interpretations of the attentionscores in Radflow. We then present results on how the performanceis affected by the topic category and the popularity of a page inWikiTraffic. Finally we provide further hyperparameters andstatistical significance tests for experiments on both VevoMusicand WikiTraffic.

B.1 Attention scores

Analyzing attention scores can provide insights into both the dataand the model. To this end, we show how two types of informationare captured by the scores—the flow of traffic from neighboringpages and the time series correlation.

B.1.1 Spikes from neighbors.

Fig. 12 shows view counts of AndyGavin’s (a video game programmer) Wikipedia page and threelinked articles with shading indicating attention scores, i.e. 𝜆 inEq. (21). The scores are extracted from Radflow in the Imputationsetting. During the forecast period, details of a new video game Crash Bandicoot 4 were released, leading to a spike in traffic onAndy Gavin’s page (the designer of the original

Crash Bandicoot ).Linked articles, e.g.,

Naughty Dog (the company Gavin co-founded),exhibit similar spikes to which substantial attention is applied. Thisexample supports the intuition that network attention is importantwhen an exogenous event causes traffic on neighboring nodes tochange rapidly.

B.1.2 Time series correlation.

To further investigate what kind ofinformation the attention scores capture, Fig. 13 presents a density

WW ’21, April 19–23, 2021, Ljubljana, Slovenia Alasdair Tran, Alexander Mathews, Cheng Soon Ong, Lexing Xie

Naughty DogEvan WellsCrash Bandicoot (video game) Andy Gavin

Figure 12: An example where Radflow attends to threeneighbors as it forecasts the traffic of

Andy Gavin (videogame programmer and entrepreneur). A darker blue corre-sponds to a higher attention on the corresponding day. Thenetwork effect is important when there is a surge in traffic.

Attention Score C o rr e l a t i o n C o e ff i c i e n t VevoMusic

Attention Score

WikiTraffic

Figure 13: Density map of network attention scores (x-axis)and the correlation coefficient of a node with its neighbor(y-axis). plot of correlation coefficients between the time series of an egonode and that of a neighbor, against the average attention scoreon the neighbor during the forecast period. On both VevoMusicand WikiTraffic, we observe that if two time series have a verylow to negative correlation, the model will almost never output ahigh attention score. On WikiTraffic, we also see that positivelycorrelated time series rarely result in a low attention score.

B.2 Effect of Page Category and Popularity

On WikiTraffic, we break our model’s performance down by bothpage category and page popularity, in order to see if these haveany effect on the performance. From Fig. 14, we see that on bothRadflow-NoNetwork and Radflow, there is no significant differencein the performance among the four test categories—Global Health,Global Warming, Programming, and Star Wars. Furthermore, theimprovement in SMAPE from adding network information is con-sistent across these categories.In contrast, the popularity of a page has a significant impacton how well the predictions are. From Fig. 15, we observe thatthere exists an optimal range of popularity where traffic is most

Global Health Global Warming Programming Star Wars01020304050 S M A P E - Figure 14: Performance broken down by categories in thetest set of WikiTraffic (univariate). The dashed lines indi-cate mean SMAPE-28. Blue boxplots correspond to Radflow-NoNetwork, while green boxplots correspond to Radflowwith one-hop aggregation. All categories benefit from thenetwork information.

Average Daily View Counts S M A P E - Figure 15: Radflow-NoNetwork’s performance on Wiki-Traffic (univariate), broken down by the popularity of apage. The dashed lines indicate mean SMAPE-28. There ex-ists an optimal range of popularity where traffic is the mostpredictable, resulting in a low SMAPE. predictable. Our model is best at forecasting pages that have be-tween 200 and 1,000 daily visits. Pages with fewer than 50 dailyvisits are fairly difficult to forecast. This matches the observationwe made on SZ-taxi, where it is also difficult to forecast the lowtraffic speeds. More interestingly, our model’s performance alsodrops slightly with very popular pages (those with more than 1,000daily views). This could be due the traffic of these popular pagesbeing driven mostly by events external to the network.

B.3 Hyperparameters and Significance Tests

This final section provides further details on the hyperparametersand statistical significance tests. In the main paper, Table 2 showsthe hyperparameters of our two key models, Radflow-NoNetworkand Radflow. Tables 9 and 10 provide hyperparameters of the re-maining model variants. Finally Table 11 contains the p-values fromthe dependent t-test for paired samples on VevoMusic and Wiki-Traffic. We note that if two models differ by more than one digitin the third significant figure, it is sufficient to conclude that thedifference is statistically significant in these datasets (i.e. p-value <0.05). adflow WWW ’21, April 19–23, 2021, Ljubljana, Slovenia

Table 9: Hyperparameters of pure time series models. Rows are numbered according to Sec 6.1. All model variants use 8 layerswith a dropout of 0.1. For each dataset, we calibrate the hidden size so that the number of parameters between all modelsare within 5% of each other. Note that the N-BEATS model for WikiTraffic (bivariate) consists of two separate models, onepredicting the desktop traffic and the other predicting the combined mobile/app traffic. Each of these two models contains829,760 parameters.

Dataset Model Hidden Size ParametersVevoMusic (6) LSTM 164 1,625,078(7) N-BEATS 192 1,630,480(8) Radflow-NoNetwork 128 1,589,762WikiTraffic (univariate) (6) LSTM 164 1,625,078(7) N-BEATS 176 1,610,176(8) Radflow-NoNetwork 128 1,589,762WikiTraffic (bivariate) (6) LSTM 164 1,625,900(7) N-BEATS 120 1,659,520(8) Radflow-NoNetwork 128 1,590,020Los-loop (8) Radflow-NoNetwork 73 260,758(9) T-GCN [40] 292 261,060SZ-taxi (8) Radflow-NoNetwork 73 260,758(9) T-GCN [40] 294 262,252

Table 10: Hyperparameters of time series models with network information. Rows are numbered according to Sec 6.1. Allmodel variants use 8 layers with a dropout of 0.1. For each dataset, we calibrate the hidden size so that the number of param-eters between all models are within 5% of each other. Some model variants were not evaluated under the two-hop setting andare indicated with a hyphen.

Dataset Model One Hop Two HopsHidden Size Parameters Hidden Size ParametersVevoMusicWikiTraffic (univariate) (11) LSTM-MeanPooling 160 1,598,562 160 1,650,082(12) Radflow-MeanPooling 120 1,632,604 120 1,632,604(13) Radflow-GraphSage 118 1,606,928 118 1,634,894(14) Radflow-GAT 120 1,647,364 120 1,662,124(15) Radflow 116 1,608,112 112 1,576,180(16) Radflow ( 𝒉 as embeddings) 124 1,586,088 - -(17) Radflow ( 𝒑 as embeddings) 124 1,586,088 - -(18) Radflow ( 𝒒 as embeddings) 124 1,586,088 - -(19) Radflow ( [ 𝒉 ; 𝒑 ] as embeddings) 116 1,632,588 - -(20) Radflow ( [ 𝒉 ; 𝒑 ; 𝒒 ]

104 1,639,564 - -(21) Radflow (no final projection) 116 1,608,112 - -(22) Radflow (one head) 116 1,608,112 - -WikiTraffic (bivariate) (11) LSTM-MeanPooling 160 1,599,364 160 1,650,884(12) Radflow-MeanPooling 120 1,632,968 120 1,632,968(13) Radflow-GraphSage 118 1,607,286 118 1,635,252(14) Radflow-GAT 120 1,647,728 120 1,662,488(15) Radflow 116 1,608,464 112 1,576,520Los-loop (15) Radflow 64 260,036 - -SZ-taxi (15) Radflow 64 260,036 - -

WW ’21, April 19–23, 2021, Ljubljana, Slovenia Alasdair Tran, Alexander Mathews, Cheng Soon Ong, Lexing Xie

Table 11: Selected p-values from the dependent t-test for paired samples on VevoMusic and WikiTraffic. Models are num-bered according to Sec 6.1. We only show pairs where the p-value is at least 0.001. We observe that if two models differ by morethan one digit in the third significant figure, their difference is statistically significant in these datasets ( 𝑝 < . ). Group 1 Group 2 p-valueModel Hops SMAPE Model Hops SMAPEVevoMusic (6) LSTM 0H 8.68 (7) N-BEATS 0H 8.64 0.006 V e v o M u s i c ( s t a t i c ) Forecast (12) Radflow-MeanPooling 1H 8.34 (15) Radflow 1H 8.33 0.171(13) Radflow-GraphSage 1H 8.39 (13) Radflow-GraphSage 2H 8.37 0.036(13) Radflow-GraphSage 1H 8.39 (15) Radflow 2H 8.39 0.938(14) Radflow-GAT 1H 8.52 (14) Radflow-GAT 2H 8.50 0.018(13) Radflow-GraphSage 2H 8.37 (15) Radflow 2H 8.39 0.014Imputation (11) LSTM-MeanPooling 1H 8.14 (11) LSTM-MeanPooling 2H 8.13 0.117(12) Radflow-MeanPooling 1H 7.82 (12) Radflow-MeanPooling 2H 7.81 0.376(15) Radflow 1H 7.67 (13) Radflow-GraphSage 2H 7.64 0.001(13) Radflow-GraphSage 2H 7.64 (15) Radflow 2H 7.63 0.017 V e v o M u s i c ( d y n a m i c ) Forecast (12) Radflow-MeanPooling 1H 8.42 (13) Radflow-GraphSage 1H 8.43 0.618(12) Radflow-MeanPooling 1H 8.42 (14) Radflow-GAT 1H 8.43 0.586(13) Radflow-GraphSage 1H 8.43 (14) Radflow-GAT 1H 8.43 0.987(15) Radflow 1H 8.37 (14) Radflow-GAT 2H 8.39 0.001(13) Radflow-GraphSage 2H 8.46 (15) Radflow 2H 8.45 0.064Imputation (11) LSTM-MeanPooling 1H 7.91 (11) LSTM-MeanPooling 2H 7.90 0.702(13) Radflow-GraphSage 1H 7.46 (14) Radflow-GAT 1H 7.44 0.007(13) Radflow-GraphSage 2H 7.27 (14) Radflow-GAT 2H 7.28 0.091(13) Radflow-GraphSage 2H 7.27 (15) Radflow 2H 7.27 0.871(14) Radflow-GAT 2H 7.28 (15) Radflow 2H 7.27 0.110 W i k i T r a ff i c ( u n i v a r i a t e ) Forecast (6) LSTM 0H 16.6 (7) N-BEATS 0H 16.6 0.471(6) LSTM 0H 16.6 (11) LSTM-MeanPooling 2H 16.7 0.002(6) LSTM 0H 16.6 (13) Radflow-GraphSage 2H 16.7 0.067(7) N-BEATS 0H 16.6 (12) Radflow-MeanPooling 2H 16.5 0.002(7) N-BEATS 0H 16.6 (13) Radflow-GraphSage 2H 16.7 0.016(8) Radflow-NoNetwork 0H 16.1 (14) Radflow-GAT 1H 16.2 0.019(8) Radflow-NoNetwork 0H 16.1 (15) Radflow 1H 16.2 0.027(12) Radflow-MeanPooling 1H 16.5 (12) Radflow-MeanPooling 2H 16.5 0.105(14) Radflow-GAT 1H 16.2 (15) Radflow 1H 16.2 0.948(11) LSTM-MeanPooling 2H 16.7 (13) Radflow-GraphSage 2H 16.7 0.139(14) Radflow-GAT 2H 16.0 (15) Radflow 2H 16.0 0.934Imputation (12) Radflow-MeanPooling 1H 15.1 (14) Radflow-GAT 1H 15.1 0.844(12) Radflow-MeanPooling 1H 15.1 (13) Radflow-GraphSage 2H 15.0 0.001(14) Radflow-GAT 1H 15.1 (12) Radflow-MeanPooling 2H 15.1 0.001(14) Radflow-GAT 1H 15.1 (13) Radflow-GraphSage 2H 15.0 0.001(11) LSTM-MeanPooling 2H 15.2 (12) Radflow-MeanPooling 2H 15.1 0.098(11) LSTM-MeanPooling 2H 15.2 (14) Radflow-GAT 2H 15.2 0.625(12) Radflow-MeanPooling 2H 15.1 (14) Radflow-GAT 2H 15.2 0.064 W i k i T r a ff i c ( b i v a r i a t e ))