[PDF] Spatial-Temporal Fusion Graph Neural Networks for Traffic Flow Forecasting

Abstract

Spatial-temporal data forecasting of traffic flow is a challenging task because of complicated spatial dependencies and dynamical trends of temporal pattern between different roads. Existing frameworks typically utilize given spatial adjacency graph and sophisticated mechanisms for modeling spatial and temporal correlations. However, limited representations of given spatial graph structure with incomplete adjacent connections may restrict effective spatial-temporal dependencies learning of those models. To overcome those limitations, our paper proposes Spatial-Temporal Fusion Graph Neural Networks (STFGNN) for traffic flow forecasting. SFTGNN could effectively learn hidden spatial-temporal dependencies by a novel fusion operation of various spatial and temporal graphs, which is generated by a data-driven method. Meanwhile, by integrating this fusion graph module and a novel gated convolution module into a unified layer, SFTGNN could handle long sequences. Experimental results on several public traffic datasets demonstrate that our method achieves state-of-the-art performance consistently than other baselines.

Full PDF

SSpatial-Temporal Fusion Graph Neural Networks for Trafﬁc Flow Forecasting

Mengzhang Li, Zhanxing Zhu

Peking University, Beijing, China { mcmong, zhanxing.zhu } @pku.edu.cn Abstract

Spatial-temporal data forecasting of trafﬁc ﬂow is a chal-lenging task because of complicated spatial dependenciesand dynamical trends of temporal pattern between differentroads. Existing frameworks typically utilize given spatial ad-jacency graph and sophisticated mechanisms for modelingspatial and temporal correlations. However, limited repre-sentations of given spatial graph structure with incompleteadjacent connections may restrict effective spatial-temporaldependencies learning of those models. Furthermore, exist-ing methods are out at elbows when solving complicatedspatial-temporal data: they usually utilize separate modulesfor spatial and temporal correlations, or they only use inde-pendent components capturing localized or global heteroge-neous dependencies. To overcome those limitations, our pa-per proposes a novel Spatial-Temporal Fusion Graph Neu-ral Networks (STFGNN) for trafﬁc ﬂow forecasting. First, adata-driven method of generating “temporal graph” is pro-posed to compensate several existing correlations that spa-tial graph may not reﬂect. SFTGNN could effectively learnhidden spatial-temporal dependencies by a novel fusion op-eration of various spatial and temporal graphs, treated fordifferent time periods in parallel. Meanwhile, by integrat-ing this fusion graph module and a novel gated convolutionmodule into a uniﬁed layer, SFTGNN could handle long se-quences by learning more spatial-temporal dependencies withlayers stacked. Experimental results on several public trafﬁcdatasets demonstrate that our method achieves state-of-the-art performance consistently than other baselines . Introduction

Forecasting task of spatial-temporal data especially trafﬁcdata has been widely studied recently because (1) trafﬁcforecasting is one of the most important part of IntelligentTransportation System (ITS) which has great effect on dailylife; (2) its data structures is also representative in reality:other location-based data such as wind energy stations, airmonitoring stations and cell towers can all be formulated asthis spatial-temporal data structure.Recently, graph modeling on spatial-temporal data hasbeen in the spotlight with the development of graph neu-ral networks. Many works have achieved impressive per- Code available at: https://github.com/MengzhangLI/STFGNN

Figure 1: Example of spatial-temporal dependencies in anetwork. Yellow lines indicate the spatial adjacency in real-ity. Districts play the same role in trafﬁc network are likelyto have similar temporal pattern, which are represented bygreen dash lines.formance on prediction accuracy. Although signiﬁcant im-provements have been made in incorporating graph structureinto spatial-temporal data forecasting model, these modelsstill face several shortcomings.The ﬁrst limitation is being lack of an informative graphconstruction. Taking Figure 1 for example, those distantnodes may have certain correlations, i.e., they would sharesimilar ”temporal pattern”. For instance, during rush hours,most roads near the ofﬁce buildings (from business districts)will encounter trafﬁc jams in the same period. However,most existing models only utilize given spatial adjacencymatrix for graph modeling, and ignore the temporal simi-larity between nodes when modeling the adjacency matrix .Some works already made several attempts to improve rep-resentation of graph. Mask matrix (Song et al. 2020) andself-adaptive matrix (Wu et al. 2019) are introduced to ad-just existed spatial adjacency matrix, but these learnable ma- a r X i v : . [ c s . L G ] D ec rices are both lack of correlations representation ability forcomplicated spatial-temporal dependencies in graph. Tem-poral self-attention module (Xu et al. 2020; Wang et al.2020) of transformers can also extract dynamic spatial-temporal correlations predetermined spatial graph may notreﬂect. However, it may face overﬁtting of spatial-temporaldependencies learning due to dynamical change and noisyinformation in reality data especially in long-range predic-tion tasks, where autoregressive models can hardly avoid er-ror accumulation.Besides, current studies of spatial-temporal data fore-casting are ineffective to capture dependencies between lo-cal and global correlations . RNN/LSTM-based models (Liet al. 2017; Zhang et al. 2018) are time-consuming andmay suffer gradient vanishing or explosion when capturinglong-range sequences. Sequential procedure of transform-ers (Park et al. 2019; Wang et al. 2020; Xu et al. 2020)may still be time-consuming in inference. CNN-based meth-ods need to stack layers for capturing global correlationsof long sequences. STGCN (Yu, Yin, and Zhu 2017) andGraphWaveNet (Wu et al. 2019) may lose local informa-tion if dilation rate increases. STSGCN (Song et al. 2020)proposes a novel localized spatial-temporal subgraph thatsynchronously capture local correlations, which is only de-signed locally and ignoring global information. When miss-ing data happens, situation is more severe where it wouldonly learn local noise.To capture both local and global complicated spatial-temporal dependencies, we present a novel CNN-basedframework called Spatial-Temporal Fusion Graph NeuralNetwork (STFGNN). Motivated by dynamic time warp-ing (Berndt and Clifford 1994), we propose a novel data-driven method for graph construction: the temporal graphlearned based on similarities between time series. Then sev-eral graphs could be integrated as a spatial-temporal fu-sion graph to obtain hidden spatial-temporal dependencies.Moreover, to break the local and global correlation trade-off, gated dilated convolution module is introduced, whoselarger dilation rate could capture long-range dependencies.The main contributions of this work are as follows.• We construct a novel graph by a data-driven method,which preserve hidden spatial-temporal dependencies.This data-driven adjacency matrix is able to extract cor-relations that given spatial graph may not present. Then,we propose a novel spatial-temporal fusion graph moduleto capture spatial-temporal dependencies synchronously.• We propose an effective framework to capture local andglobal correlations simutaneously, by assembling a Gateddilated CNN module with spatial-temporal fusion graphmodule in parallel. Long-range spatial-temporal depen-dencies could also be extracted with layers stacked.• To make thorough comparisons and test performance incomplicated cases, extensive experiments are conductedon eight real-world datasets used in previous works, re-spectively. The results show our model consistenty out-performs baselines, which strongly proves our proposedmodel could handle complicated trafﬁc situations in real-ity with different tarfﬁc characteristics, road numbers and Figure 2: Two time series and their warping path calculatedby DTW and fast-DTW algorithm. The red zone is searchingzone of fast-DTW deﬁned by ”Searching Length” T .missing value ratios. Related Works

Graph Convolution Network

Graph convolution network are widely applied in manygraph-based tasks such as classiﬁcation (Kipf and Welling2016) and clustering (Chiang et al. 2019), which has twotypes. One is extending convolutions to graphs in spectraldomain by ﬁnding the corresponding Fourier basis (Brunaet al. 2013). GCN (Kipf and Welling 2016) is representa-tive work and constrcuts typical baselines in many tasks. Theother is generalizing spatial neighbours by typical convolu-tion. GAT (Veliˇckovi´c et al. 2017) which introduces atten-tion mechanism into graph ﬁled, and GraphSAGE (Hamil-ton, Ying, and Leskovec 2017) which generates node em-beddings by sampling and aggregating features locally areall typical works.

Spatial-Temporal Forecasting

Spatial-temporal prediction plays an important role in manyapplication areas. To incoporate spatial dependencies moreeffectively, recent works introduce graph convoluntionalnetwork (GCN) to learn the trafﬁc networks. DCRNN (Liet al. 2017) utilizes the bi-directional random walks on thetrafﬁc graph to model spatial information and captures tem-poral dynamics by gated recurrent units (GRU). Transformermodels such as (Wang et al. 2020; Park et al. 2019) uti-lize spatial and temporal attention modules in transformerfor spatial-temporal modeling. They would be more effec-tive when training than LSTM but still make predictions stepby step due to their auto-regressvie structures. STGCN (Yu,Yin, and Zhu 2017) and GraphWaveNet (Wu et al. 2019) em-ployed graph convolution on spatial domain and 1-D convo-lution along time axis. They process graph information andtime series separately. STSGCN (Song et al. 2020) make at-tempts to incoporate spatial and temporal blocks altogethery localized spatial-temporal synchronous graph convolu-tion module regardless of global mutual effect.

Algorithm 1:

Temporal Graph Generation

Input:

N time series from V ( |V| = N ) W Initialization, reset to zero matrix TDL: TemporalDistance Calculation deﬁned in Alg 2 for i = 1 , , · · · , N do for j = 1 , , · · · , N do dist i,j = TDL( V i , V j ) (Alg. 2) end Sort smallest k ( k ≤ N ) element and their index j = { j , j , · · · , j k } s.t. dist i,j ≤ dist i,j ≤ dist i,j k if ˜ j ∈ j then W i, ˜ j = W ˜ j,i = 1 ; end return Weighted Matrix W of Temporal Graph G . Similarity of Temporal Sequences

The methods for measuring the similarity between timeseries can be divided into three categories: (1) timestep-based, such as Euclidean distance reﬂecting point-wise tem-poral similarity; (2) shape-based, such as Dynamic TimeWarping (Berndt and Clifford 1994) according to the trendappearance; (3) change-based, such as Gaussian MixtureModel(GMM) (Povinelli et al. 2004) which reﬂects similar-ity of data generation process.Dynamic Time Warping is a typical algorithm to mea-sure similarity of time series. Given two time series X =( x , x , · · · , x n ) and Y = ( y , y , · · · , y m ) , series dis-tance matrix M n × m could be introduced whose entry is M i,j = | x i − y j | . Then cost matrix M c could be deﬁned: M c ( i, j ) = M i,j +min( M c ( i, j − , M c ( i − , j ) , M c ( i, j )) (1) After several iterations of i and j , dist ( X, Y ) = M c ( n, m ) is the ﬁnal distance between X and Y with thebest alignment which can represent the similarity betweentwo time series.From Eq. (1) we can tell that Dynamic Time Warping isan algorithm based on dynamic programming and its core issolving the warping curve, i.e., matchup of series points x i and y j . In other words the ”warping path” Ω Ω = ( ω , ω , · · · , ω λ ) , max( n, m ) ≤ λ ≤ n + m is generated through iterations of Eq. (1). Its element ω λ =( i, j ) means matchup of x i and y j . Preliminaries

We can represent the road network as a graph G =( V, E, A SG ) , where V is a ﬁnite set of nodes | V | = N ,corresponding to the observation of N sensors or roads; E is a set of edges and A SG ∈ R N × N is a spatial adjacency In this paper, N represents number of trafﬁc roads/nodes, n represents given length of certain time series. They are totally dif-ferent. matrix representing the nodes proximity or distance. Denotethe observed graph signal X ( t ) G ∈ R N × d means it repre-sent the observation of spatial graph information G at timestep t , whose element is observed d trafﬁc features(e.g., thespeed, volume) of each sensor. The aim of trafﬁc forecastingis learning a function f from previous T speed observationsto predict next T (cid:48) trafﬁc speed from N correlated sensors onthe road network. [ X ( t − T +1) G , · · · , X t G ] f −→ [ X t +1 G , · · · , X t + T (cid:48) G ] (2) Algorithm 2:

Temporal Distance Calculation (TDL)

Input: X = ( x , · · · , x n ) ∈ R n × d , Y = ( y , · · · , y m ) ∈ R m × d , Searching Length T for i = 1 , , · · · , n do for j = max(0 , i − T ) , · · · , min( m, i + T + 1) do M i,j = | X i − Y j | ; if i = 0 , j = 0 then M C ( i, j ) = M i,j ; else if i = 0 then M C ( i, j ) = M i,j + M i,j − ; else if j = 0 then M C ( i, j ) = M i,j + M i − ,j ; else if j = i − T then M C ( i, j ) = M i,j + min( M i − ,j − , M i − ,j ) ; else if j = i + T then M C ( i, j ) = M i,j + min( M i − ,j − , M i,j − ) ; else M C ( i, j ) = M i,j + min( M i − ,j − , M i,j − , M i − ,j ) ; end end return dist ( X, Y ) = M C ( n, m ) Spatial-Temporal Fusion Graph NeuralNetworks

We present the framework of Spatial-Temporal FusionGraph Neural Network in Figure 3. It consists of (1) an in-put layer, (2) stacked Spatial-Temporal Fusion Graph NeuralLayers and (3) an output layer. The input and output layerare one and two Fully-Connected Layer followed by activa-tion layer such as ”ReLU” (Nair and Hinton 2010) respec-tively. Every Spatial-Temporal Fusion Graph Layer is con-structed by several Spatial-Temporal Fusion Graph NeuralModules (STFGN Modules) in parallel and a Gated CNNModule which includes two parallel 1D dilated convolutionblocks.

Spatial-Temporal Fusion Graph Construction

The aim of generating temporal graph is to achieve certaingraph structure with more accurate dependency and gen-uine relation than spatial graph. Then, incorporating tempo-ral graph into a novel spatial-temporal fusion graph, whichcould make deep learning model lightweight because thisfusion graph already has correlation information of eachnode with its (1) spatial neighbours, (2) nodes with simi-lar temporal pattern, and (3) own previous or later situationalong time axis.owever, generating temporal graph based on similar-ity of time series by DTW is not easy, it is a typical dy-namic programming algorithm with computational com-plexity O ( n ) . Thus it might be unacceptable for many ap-plications because time series of real world is usually verylong. To reduce complexity of DTW, we restrict its ”SearchLength” T . The searching space of warping path is circum-scribed by: ω k = ( i, j ) , | i − j | ≤ T (3)Consequently, the computational complexity of DTW is re-duced from O ( n ) to O ( T n ) which made its application onlarge scale spatial-tempral data possible. We name it ”fast-DTW”.As shown in Figure 2, given two roads’ time series whoselength is | X | and | Y | , repectively. The distance of those twotime series M c ( | X | , | Y | ) could be calculated by Eq. (1). Thewarping path of fast-DTW is restrcited near the diagonal(red zone in Figure 2), consequently the cost of calculatingmatch i and j of the element ω λ = ( i, j ) of warping path Ω is not as expensive as DTW algorithm.The set of α which determines how many smallest num-bers are treated in Alg. 1 is tricky and we would analysisit in the section of experiments. Empirically, we keep thesparisity of Temporal Graph A T G almost the same as Spa-tial Graph A SG .Figure 3(b) is the example of Spatial-Temporal FusionGraph. It consists of three kinds N × N matrix: SpatialGraph A SG which is given by dataset, Temporal Graph A T G generated by Alg. 1, and Temporal Connectivity graph A T C whose element is nonzero iif previous and next time stepsis the same node. Given Spatial-Temporal Fusion Graph A ST F G ∈ R N × N , and taken A T G within a red circle inFigure 3(b) for instance. It denotes the connection betweensame node from time step: 2 to 3 (current time step t = 2).For each node l ∈ { , , · · · , N } , i = ( t +1) ∗ N + l = 3 N + l and j = t ∗ N + l = 2 N + l , then A ST F G ( i,j ) = 1 . To sumup, Temporal Connectivity graph denotes connection of thesame node at proximate time steps.Finally, Spatial-Temporal Fusion Graph A ST F G ∈ R KN × KN is generated. Altogether with the sliced input dataof each STFGN Module: h = [ X ( t ) G , · · · , X ( t + K ) G ] ∈ R K × N × d × C (4)It is sliced iteratively from total input data: X = [ X ( t ) G , · · · , X ( t + T ) G ] ∈ R T × N × d × C (5) X ( t ) G is high-dimension feature of original data X ( t ) G . C is the number of input feature channel from STFGN mod-ule, which is also the number of output feature channel frominput layer. Spatial-Temporal Fusion Graph Neural Module

In Spatial-Temporal Fusion Graph Neural Module (STFGNModule), the lightweight deep learning model could extracthidden spatial-temporal dependencies by several simple op-erations such as matrix multiplication with Spatial-TemporalFusion Graph A ST F G , residual connections and maxpool-ing. In this paper, regular spectral ﬁlter such as Laplacianin graph convolution is replaced with a more simpliﬁedand time-saving operation: matrix multiplication. Each nodein network could aggregate spatial dependency from A SG ,temporal pattern correlation from A T G and its own proxi-mate correlation long time axis from A T C by several timesmatrix multiplication with A ST F G .Gating mechanism in LSTM/RNN is also utilized ingraph multiplication block. In STFGN Module, gated linearunits is used for generalization in graph multiplication by itsnonlinar activation. Graph multiplication module is formu-lated as below: h l +1 = ( A ∗ h l W + b ) (cid:12) σ ( A ∗ h l W + b ) (6)where h l denotes l -th hidden states of certain STFGNmodule. A ∗ is shorthand of spatial-temporal fusion graph A ST F G ∈ R KN × KN , W , W ∈ R C × C , b , b ∈ R C areall model parameters of GLU. (cid:12) means Hadamard productand σ means sigmoid function.By stacking L graph multiplication blocks, more com-plicated and non-local spatial-dependencies could be aggre-gated. Intuitively, the residual connections (He et al. 2016)would also be introduced for each block. MaxPooling wouldbe operated on the concatenation of each hidden state h M = M axP ool ([ h , · · · , h L ]) ∈ R K × N × d × C . Finally this con-catenation corresponding to the middle time step would becropped, saving h o = h M (cid:2) (cid:98) K (cid:99) : (cid:98) K (cid:99) + 1 , : , : , : (cid:3) ∈ R × N × d × C Figure 3(b) shows this cropped feature has containedcomplicated heterogeneity. In each matrix multiplication, A SG in the middle of diagonal (corresponding to croppedlocation of concatenation) transmit information from spatialneighbour. A T C in its horizontal and vertical direction giveseach node its own information along time axis. A T G in cor-ner enhance information from nodes with similar temporalpattern.Input data would be treated by multiple STFGN Mod-ules independently in parallel, which is time-saving andcould capture more complicated correlations. Then con-catenation of each STFGN module output would be addedwith Gated CNN output and becomes input of next STFGNlayer. Noted that the size of each STFGN module output is R ( T − K +1) × N × d × C , i.e., each STFGN layer would cut inputfrom T to T − K + 1 in time dimensions. It means STFGNlayers could stack up to (cid:98) TK − (cid:99) − layers. Gated Convolution Module

Although A ST F G could extract global spatial-temporal de-pendencies by integration of A T G , the correlation it con-tains is more from nodes in a distant (like the example fromFigure 1). Long-range spatial-temporal dependencies of thenode itself is also important, which is very challenging formany CNN-based works (Yu, Yin, and Zhu 2017; Wu et al.2019; Song et al. 2020) because inborn structure of CNNcan hardly outperform auto-regressive models like trans-former(Park et al. 2019; Wang et al. 2020). Different fromigure 3: Detailed framework of STFGNN. (a) is the example of input of Spatial-Temporal Fusion Graph, which would begenerated iteratively along the time axis. (b) is the example of Spatial-Temporal Fusion Graph, whose size K is 4 and 3,respectively. It consists of three kinds of adjacency matrix ∈ N × N : spatial graph A SG , temporal graph A T G and temporalconnectivity graph A T C . The A T C within a red circle would be taken for instance in the body. (c) is overall structure ofSTFGNN, its Gated CNN module and STFGNN modules are in parallel. (d) is detailed architecture of the Spatial-TemporalFusion Graph Modules, each module will be independently trained for input iteratively generated from (a) in parallel as well.previous work like GraphWaveNet and STGCN, dilated con-volution with large dilation rate is introduced in this paper.Given the total input data X ∈ R T × N × d × C , it takes theform: Y = φ (Θ ∗ X + a ) (cid:12) σ (Θ ∗ X + b ) (7)Similar with Eq. (6), φ ( · ) and σ ( · ) are tanh and sigmoidfunction, respectively. Θ and Θ are two independent 1Dconvolution operation with dilation rate = K -1. It could en-large receptive ﬁled along time axis thus strengthen modelperformance for extracting sequential dependencies.Huber loss is chosen as loss function, objective functionis shown below: L ( ˆX ( t +1):( t + T ) G , Θ) = T (cid:80) i =1 N (cid:80) j =1 d (cid:80) k =1 h (cid:0) ˆX ( t + i ) G , X ( t + i ) G (cid:1) T × N × d (8) h (cid:0) ˆ Y , Y (cid:1) = 

12 ( ˆ Y − Y ) , | ˆ Y − Y | ≤ δδ | ˆ Y − Y | − δ , | ˆ Y − Y | > δδ is hyperparameter to control sensitivity of squared errorloss. Experiments

Datasets

We verify the performance of STFGNN on eight publictrafﬁc network datasets. METR-LA and PEMS-BAY re-

Datasets

Table 1: Dataset description and statistics.leased by (Li et al. 2017), PeMSD7 (M) and PeMSD7(L) re-leased by (Yu, Yin, and Zhu 2017) and PEMS03, PEMS04,PEMS07, PEMS08 released by(Song et al. 2020). METR-LA records four months of trafﬁc ﬂow on 207 sensors onthe highways of Los Angeles County. PEMS-BAY recordssix months of trafﬁc speed on 325 sensors in Bay Area.PEMS03, PEMS04, PEMS07 and PEMS08 are constructedfrom four districts, respectively in California. All these datais collected from the Caltrans Performance MeasurementSystem (PeMS) and aggregated into 5-minutes windows,which means there are 288 points in the trafﬁc ﬂow for oneday. The spatial adjacency networks for each dataset is con-structed by actual road network based on distance. Z-scorenormalization is adopted to standardize the data inputs. Thedetailed information is shown in Table 1. atasets Models 15min 30min 60minMAE MAPE(%) RMSE MAE MAPE(%) RMSE MAE MAPE(%) RMSE M ET R - L A ARIMA 3.99 9.60 8.21 5.15 12.70 10.45 6.90 17.40 13.23FC-LSTM 3.44 9.60 6.30 3.77 10.90 7.23 4.37 13.20 8.69DCRNN 2.77 7.30 5.38 3.15 8.80 6.45 3.60 10.50 7.60STGCN 2.88 7.62 5.74 3.47 9.57 7.24 4.59 12.70 9.40STSGCN

GraphWaveNet 2.69 6.90 5.15 3.07 8.37 6.22 3.53 10.01 7.37STGRAT 2.60 6.60 5.07 3.01 8.15 6.21 3.49 10.01 7.42STGNN 2.62 6.55 4.99 2.98 7.77 5.88 3.49 9.69 6.94STFGNN ± ± ± ± ± ± ± ± ± P E M S - B AY ARIMA 1.62 3.50 3.30 2.33 5.40 4.76 3.38 8.30 6.50FC-LSTM 2.05 4.80 4.19 2.20 5.20 4.55 2.37 5.70 4.96DCRNN 1.38 2.90 2.95 1.74 3.90 3.97 2.07 4.90 4.74STGCN 1.36 2.90 2.96 1.81 4.17 4.27 2.49 5.79 5.69STSGCN

GraphWaveNet 1.30 2.73 2.74 1.63 3.67 3.70 1.95 4.63 4.52STGRAT 1.29 2.67 2.71 1.61 3.63 3.69 1.95 4.64 4.54STGNN 1.17 ± ± ± ± ± ± ± ± ± Table 2: Performance comparison of STFGNN and baseline models on METR-LA and PEMS-BAY datasets.

Baseline Methods

We compare STFGNN with those following models:• ARIMA: Auto-Regressive Integrated Moving Averagemodel (Box and Pierce 1970), which is a typical statis-tical model in time series ﬁeld.• FC-LSTM: Long Short-Term Memory Network, which isa recurrent neural network with fully connected LSTMhidden units(Sutskever, Vinyals, and Le 2014).• DCRNN: Diffusion Convolution Recurrent Neural Net-work, which integrates graph convolution into a encoder-decoder gated recurrent unit(Li et al. 2017).• STGCN: spatio-temporal Graph Convolutional Networks,, which integrates graph convolution into a 1D convolu-tion unit(Yu, Yin, and Zhu 2017).• ASTGCN(r): Attention Based Spatial Temporal GraphConvolutional Networks, which introduces spatial andtemporal attention mechanisms into model. Only recentcomponents of modeling periodicity is taken to keep faircomparison(Guo et al. 2019).• GraphWaveNet: Graph WaveNet is a framework com-bines adaptive adjacency matrix into graph convolutionwith 1D dilated convolution(Wu et al. 2019).• STSGCN: Spatial-Temporal Synchronous Graph Con-volutional Networks, which utilizes localized spatial-temporal subgraph module to model localized correla-tions independently(Song et al. 2020).• STGRAT: Spatio-Temporal Graph Attention Network,which utilizes adapted spatial and temporal self-attentionmodule in transformer model(Park et al. 2019).• STGNN: Spatial Temporal Graph Neural Network, whichis also a complicated transformer model with a learnablepositional attention mechanism and a sequential compo-nent(Wang et al. 2020).

Experiment Settings

To make fair comparison with previous baselines, we splitthe data totally the same: with ratio 7 : 1 : 2 at METR-LA,PEMS-BAY, PeMSD7(M) and PeMSD7(L), and 6 : 2 : 2 atPEMS03, PEMS04, PEMS07, PEMS08 into training sets,validation sets and test sets. One hour 12 continuous timesteps historical data is used to predict next hour’s 12 con-tinuous time steps data. STFGNN is evaluated more than 10times in each public dataset.Experiments are conducted under the environment withone Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz andNVIDIA TESLA V100 GPU 16GB card. The temporalgraph A T G generated by fast-DTW in Alg. 1 costs less than30 minutes in most public datasets. The Searching Length”T” in ”fast-DTW” algorithm is 12, which is the largest pre-diction time steps in our trafﬁc forecasting task. The spar-sity of A T G is 0.01. The model contains 3 STFGNLs, whereeach contains 8 independent STFGNMs and 1 gated con-volution module with dilation rate 3 because the size K ofspatial-temporal fusion graph we use is 4. Elements of allthree kinds graph are booled to 0 or 1 for the sake of simpli-ﬁcation. Filters in each convolution are all 64. We train ourmodel using Adam optimizer with learning rate 0.001. Thethreshold parameter of loss function δ is 1, the batch size is32 and the training epoch is 200. Experiment Results and Analysis

Table 2, Table 3 and Table 4 show the through compari-son between different models. Results show our STFGNNoutperforms baseline models consistently and overwhelm-ingly on every dataset except only one metric of short-rangeMAPE in PEMS-BAY dataset, which is slighly larger thanSTGNN.

Results on METR-LA and PEMS-BAY

Due to relatively less node numbers (207 and 325) andmore smooth time series (metrics of them are much smallerthan counterparts of four PEMS datasets), METR-LA and

C-LSTM DCRNN STGCN ASTGCN(r) GraphWaveNet STSGCN STFGNNDatasets MetricPEMS03 MAE 21.33 18.18 17.49 17.69 ± MAPE(%) 23.33 18.91 17.15 19.40 ± RMSE 35.11 30.31 30.12 29.66 ± PEMS04 MAE 27.14 24.70 22.70 22.93 ± MAPE(%) 18.20 17.12 14.59 16.56 ± RMSE 41.59 38.12 35.55 35.22 ± PEMS07 MAE 29.98 25.30 25.38 28.05 ± MAPE(%) 13.20 11.66 11.08 13.92 ± RMSE 45.94 38.58 38.78 42.57 ± PEMS08 MAE 22.20 17.86 18.02 18.61 ± MAPE(%) 14.20 11.45 11.40 13.08 ± RMSE 34.06 27.83 27.83 28.16 ± Table 3: Performance comparison of STFGNN and base-line models on PEMS03, PEMS04, PEMS07 and PEMS08datasets.PEMS-BAY are widely used for performance evaluation. Ta-ble 2 shows models based on transformers and LSTM/RNNusually outperform others. Transformer models are goodat short-range prediction because of good learning abilityof their complicated network architectures. But the gap ofeach metric between STFGNN and transformer models be-comes larger when range increases, because error accumu-lation caused by reality data noise is inevitable for all auto-regressive models.Modules of STSGCN only extract local spatial-temporaldependencies, and their modules only use multiplicationoperation (Fully-connected network and adjacency matrixmultiplication operation). Thus frequent missing valueswould disturb its local learning module and smooth time se-ries would magnify its limited representation ability.

Results on PEMS datasets

Followed by metrics previous baseline(Song et al. 2020)takes, Table 3 compares the performance of STFGNN andother models for 60 minutes ahead prediction on PEMS03,PEMS04, PEMS07 and PEMS08 datasets.All these four datasets are not particularly smooth, therelatively poor performance of GraphWaveNet reveals itsstruggle because it can not stack its spatial-temporal layersand enlarge receptive ﬁelds of 1D CNN concurrently.

Results on PeMSD7 datasets

Although the scale of PeMSD7(L) dataset (1026nodes) might be challenging for some transformer andLSTM/RNN, no missing value and relatively smooth char-acteristics are friendly for deep learning models. The con-sistently best result of STFGNN on PeMSD7(M) andPeMSD7(L) proves its basically good representation ability.

Ablation Experiments

To verify effectiveness of different parts in STFGNN, weconduct ablation experiments on PEMS04 and PEMS08. Ta-ble 5 shows metric of MAE, MAPE and RMSE. The ”ModelElement” represents each conﬁguration. Some conclusionscould be drawn:• For ingredient of A ST F G , larger A ST F G means morecomplicated heterogeneity in spatial-temporal dependen-cies could be extracted regradless of less stacking layers.• For sparsity of A T G , it is an important hyperparameter,which determines performance of STFGNN. Empirically,it was set based on sparsity of prior spatial graph. We also

T Metric FC-LSTM DCRNN STGCN GraphWaveNet STSGCN STFGNN P e M S D ( M ) ± MAPE% 8.60 5.30 5.20 4.93 4.62 ± RMSE 6.20 4.04 4.07 4.01 3.59 ± ± MAPE% 9.55 7.39 7.27 6.89 5.83 ± RMSE 7.03 5.58 5.70 5.48 4.63 ± ± MAPE% 10.10 9.85 9.77 8.04 7.62 ± RMSE 7.51 7.19 7.55 6.25 6.01 ± P e M S D ( L ) ± MAPE% 11.10 5.51 5.56 5.22 7.00 ± RMSE 7.68 4.45 4.32 4.23 5.19 ± ± MAPE% 11.41 8.18 7.98 7.27 7.84 ± RMSE 7.94 6.31 6.21 5.72 5.86 ± ± MAPE% 11.69 11.91 11.17 9.45 9.24 ± RMSE 8.20 8.33 8.27 7.05 6.89 ± Table 4: Performance comparison of STFGNN and baselinemodels on PeMSD7(M) and PeMSD7(L) datasets.Dataset Model Elements MAE MAPE% RMSEPEMS04 STSGCN 21.19 13.90 33.65 [ ST , T sp ] [ ST , T sp ] [ ST , T sp ] [ T , T sp , Θ] [ T , T sp , Θ] [ ST , T sp , Θ] PEMS08 STSGCN 17.13 10.96 26.80 [ ST , T sp ] [ ST , T sp ] [ ST , T sp ] [ T , T sp , Θ] [ T , T sp , Θ] [ ST , T sp , Θ] ST means A ST F G with size k = 4. T means A SG is all replaced to A T G in A ST F G . T sp , T sp means nonzeroratio of A T G is about 5% and 1%, respectively. Θ repre-sents whether gated convolution module is added into eachSTFGN layer. The default STFGNN conﬁguration we use inthis paper is [ ST , T sp , Θ] .demonstrate, with proper sparsity of A T G , spatial infor-mation free trafﬁc forecasting model is possible, whichhas promising application value if A SG is unavailable.• For Gated Convolution Module, it could remedy long-range learning ability of STFGN Modules which couldimprove performance of STSGNN. Conclusion

In this paper, we present a novel framework for spatial-temporal trafﬁc data forecasting. Our model could cap-ture hidden spatial-dependencies effectively by a noveldata-driven graph and its further fusion with given spa-tial graph. By integration with STFGN module and anovel Gated CNN module which enlarges receptive ﬁledon temporal sequences and stacking it, STFGNN couldlearn localized spatial-temporal heterogeneity and globalspatial-temporal homogeneity simultaneously. Detailed ex-periments and analysis reveal advantages and defects of pre-ious models, which in turn demonstrate STFGNN consis-tent great performance.

Acknowledgements

This project is supported by The National De-fense Basic Scientiﬁc Research Project, China (No.JCKY2018204C004), National Natural Science Foundationof China (No.61806009 and 61932001), Beijing NovaProgram (No. 202072) from Beijing Municipal Science& Technology Commission and PKU-Baidu Funding2019BD005.

References

Berndt, D. J.; and Clifford, J. 1994. Using dynamic timewarping to ﬁnd patterns in time series. In

KDD workshop ,volume 10, 359–370. Seattle, WA, USA:.Box, G. E.; and Pierce, D. A. 1970. Distribution of resid-ual autocorrelations in autoregressive-integrated moving av-erage time series models.

Journal of the American statisticalAssociation arXiv preprint arXiv:1312.6203 .Chiang, W.-L.; Liu, X.; Si, S.; Li, Y.; Bengio, S.; and Hsieh,C.-J. 2019. Cluster-GCN: An efﬁcient algorithm for trainingdeep and large graph convolutional networks. In

Proceed-ings of the 25th ACM SIGKDD International Conference onKnowledge Discovery & Data Mining , 257–266.Guo, S.; Lin, Y.; Feng, N.; Song, C.; and Wan, H. 2019. At-tention based spatial-temporal graph convolutional networksfor trafﬁc ﬂow forecasting. In

Proceedings of the AAAI Con-ference on Artiﬁcial Intelligence , volume 33, 922–929.Hamilton, W.; Ying, Z.; and Leskovec, J. 2017. Inductiverepresentation learning on large graphs. In

Advances in neu-ral information processing systems , 1024–1034.He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep resid-ual learning for image recognition. In

Proceedings of theIEEE conference on computer vision and pattern recogni-tion , 770–778.Kipf, T. N.; and Welling, M. 2016. Semi-supervised classi-ﬁcation with graph convolutional networks. arXiv preprintarXiv:1609.02907 .Li, Y.; Yu, R.; Shahabi, C.; and Liu, Y. 2017. Diffusion con-volutional recurrent neural network: Data-driven trafﬁc fore-casting. arXiv preprint arXiv:1707.01926 .Nair, V.; and Hinton, G. E. 2010. Rectiﬁed linear units im-prove restricted boltzmann machines. In

ICML .Park, C.; Lee, C.; Bahng, H.; Kim, K.; Jin, S.; Ko, S.;Choo, J.; et al. 2019. Stgrat: A spatio-temporal graphattention network for trafﬁc forecasting. arXiv preprintarXiv:1911.13181 .Povinelli, R. J.; Johnson, M. T.; Lindgren, A. C.; and Ye,J. 2004. Time series classiﬁcation using Gaussian mixturemodels of reconstructed phase spaces.

IEEE Transactionson Knowledge and Data Engineering

Proceedings of the AAAI Conference on ArtiﬁcialIntelligence , volume 34, 914–921.Sutskever, I.; Vinyals, O.; and Le, Q. V. 2014. Sequenceto sequence learning with neural networks. In

Advances inneural information processing systems , 3104–3112.Veliˇckovi´c, P.; Cucurull, G.; Casanova, A.; Romero, A.; Lio,P.; and Bengio, Y. 2017. Graph attention networks. arXivpreprint arXiv:1710.10903 .Wang, X.; Ma, Y.; Wang, Y.; Jin, W.; Wang, X.; Tang, J.;Jia, C.; and Yu, J. 2020. Trafﬁc Flow Prediction via SpatialTemporal Graph Neural Network. In

Proceedings of TheWeb Conference 2020 , 1082–1092.Wu, Z.; Pan, S.; Long, G.; Jiang, J.; and Zhang, C. 2019.Graph wavenet for deep spatial-temporal graph modeling. arXiv preprint arXiv:1906.00121 .Xu, M.; Dai, W.; Liu, C.; Gao, X.; Lin, W.; Qi, G.-J.; and Xiong, H. 2020. Spatial-Temporal TransformerNetworks for Trafﬁc Flow Forecasting. arXiv preprintarXiv:2001.02908 .Yu, B.; Yin, H.; and Zhu, Z. 2017. Spatio-temporal graphconvolutional networks: A deep learning framework for traf-ﬁc forecasting. arXiv preprint arXiv:1709.04875 .Zhang, J.; Shi, X.; Xie, J.; Ma, H.; King, I.; and Yeung,D.-Y. 2018. Gaan: Gated attention networks for learn-ing on large and spatiotemporal graphs. arXiv preprintarXiv:1803.07294arXiv preprintarXiv:1803.07294