[PDF] Inductive Graph Neural Networks for Spatiotemporal Kriging

Abstract

Time series forecasting and spatiotemporal kriging are the two most important tasks in spatiotemporal data analysis. Recent research on graph neural networks has made substantial progress in time series forecasting, while little attention has been paid to the kriging problem -- recovering signals for unsampled locations/sensors. Most existing scalable kriging methods (e.g., matrix/tensor completion) are transductive, and thus full retraining is required when we have a new sensor to interpolate. In this paper, we develop an Inductive Graph Neural Network Kriging (IGNNK) model to recover data for unsampled sensors on a network/graph structure. To generalize the effect of distance and reachability, we generate random subgraphs as samples and reconstruct the corresponding adjacency matrix for each sample. By reconstructing all signals on each sample subgraph, IGNNK can effectively learn the spatial message passing mechanism. Empirical results on several real-world spatiotemporal datasets demonstrate the effectiveness of our model. In addition, we also find that the learned model can be successfully transferred to the same type of kriging tasks on an unseen dataset. Our results show that: 1) GNN is an efficient and effective tool for spatial kriging; 2) inductive GNNs can be trained using dynamic adjacency matrices; 3) a trained model can be transferred to new graph structures and 4) IGNNK can be used to generate virtual sensors.

Full PDF

IInductive Graph Neural Networks for SpatiotemporalKriging

Yuankai Wu

McGill University [email protected]

Dingyi Zhuang

McGill University [email protected]

Aurelie Labbe

HEC Montreal [email protected]

Lijun Sun

McGill University [email protected]

Abstract

Time series forecasting and spatiotemporal kriging are the two most importanttasks in spatiotemporal data analysis. Recent research on graph neural networks hasmade substantial progress in time series forecasting, while little attention is paid tothe kriging problem—recovering signals for unsampled locations/sensors. Mostexisting scalable kriging methods (e.g., matrix/tensor completion) are transductive,and thus full retraining is required when we have a new sensor to interpolate. Inthis paper, we develop an Inductive Graph Neural Network Kriging (IGNNK)model to recover data for unsampled sensors on a network/graph structure. Togeneralize the effect of distance and reachability, we generate random subgraphs assamples and reconstruct the corresponding adjacency matrix for each sample. Byreconstructing all signals on each sample subgraph, IGNNK can effectively learnthe spatial message passing mechanism. Empirical results on several real-worldspatiotemporal datasets demonstrate the effectiveness of our model. In addition,we also ﬁnd that the learned model can be successfully transferred to the sametype of kriging tasks on an unseen dataset. Our results show that: 1) GNN is anefﬁcient and effective tool for spatial kriging; 2) inductive GNNs can be trainedusing dynamic adjacency matrices; and 3) a trained model can be transferred tonew graph structures.

With recent advances in information and communications technologies (ICT), large-scale spatiotem-poral datasets are collected from various applications, such as trafﬁc sensing and climate monitoring.Analyzing these spatiotemporal datasets has attracted considerable attention. Time series forecastingand spatiotemporal kriging are two essential tasks in spatiotemporal analysis [1, 2]. While recentresearch advances in deep learning have made substantial progress in time series forecasting (e.g.,[3, 4, 5]), little attention is paid to the spatiotemporal kriging application. The goal of spatiotemporalkriging is to perform signal interpolation for unsampled locations given the observed signals fromsampled locations during the same period. The interpolation results can produce a ﬁne-grainedand high-resolution realization of spatiotemporal data, which can be used to enhance real-worldapplications such as travel time estimation and disaster evaluation. In addition, a better kriging modelcan achieve higher estimation accuracy/reliability with less number of sensors, thus reducing theoperation and maintenance cost of a sensor network.For general spatiotemporal kriging problems (e.g., in Euclidean domains), a well-developed approachis Gaussian process (GP) regression [1, 6], which uses ﬂexible kernel structure to characterize

Preprint. Under review. a r X i v : . [ c s . L G ] J un patiotemporal correlations. However, GP has two limitations: (1) the model is computationallyexpensive and thus it cannot deal with large-scale dataset, and (2) it is difﬁcult to model networkedsystems using existing graph kernel structures. To solve large-scale kriging problems in a networkedsystem, graph regularized matrix/tensor completion has emerged as an effective solution (e.g.,[2, 7, 8, 9]). Combining low-rank structure and spatiotemporal regularizers, these models cansimultaneously characterize both global consistency and local consistency in the data. However,matrix/tensor completion is essentially transductive: for new sensors/nodes introduced to the network,we cannot directly apply a previously trained model; instead, we have to retrain the full model forthe new graph structure even with only minor changes (i.e., after introducing a new sensor). Inaddition, the low-rank scheme is ineffective to accommodate time-varying/dynamic graph structures.For example, some sensors may retire over time without being replaced, and some sensors may beintroduced at new locations. In such cases, the network structure itself is not consistent over time,making it challenging to utilize the full information. m essage passing t rain to r econstruct12 34 56 78 2 34 5 86124 56 78 2 34 5 86124 56 78 o riginal t raining g raph s ample s ample m ask ed node r econstructed node t rain to r econstruct (a) Training process of IGNNK

31 2 764 5 8 Yp Timet t+h1 32 764 5 89 m issing t raining graph training data X k riging graph k riging data1 (b) Kriging process of IGNNK Figure 1: Framework of of IGNNK. (a) In sample { , } are unsampled and nodes { , } are masked and used for reconstruction. In sample { } and { } , respectively. (b) Illustration of real-time kriging, where the goal is to perform interpolation forvirtual sensors { , } . Note that the set of observed sensor during [ t, t + h ) is not necessary thesame as the set during training [1 , p ] . For example, during [ t, t + h ) , the sensor { } in X is removedand a new sensor { } (in green) is added to the network.Recent research has explored the potential of modeling spatiotemporal data using Graph NeuralNetwork (GNN). GNNs are powerful in characterizing complex spatial dependencies by its messagepassing mechanism [10]. They also demonstrate the ability and inductive power to generalize themessage passing mechanism to unseen nodes or even entirely new (sub)graphs [11, 12, 13]. Inspiredby these works, here we develop an Inductive Graph Neural Network Kriging (IGNNK) model tosolve real-time spatiotemporal kriging problems on dynamic network structures. Different fromgraphs in recommender systems governed by certain typology, our spatial graph actually containsvaluable location information which allows us to quantify the exact pairwise “distance” beyond“hops”. In particular, for directed networks such as highway network, the distance matrix will beasymmetric and it actually captures the degree of “reachability” from one sensor to another [14, 15].To better leverage the distance information, IGNNK trains a GNN with the goal of reconstructinginformation on random subgraph structures (see Figure 1). We ﬁrst randomly select a subset of nodesfrom all available sensors and create a corresponding subgraph. We mask some of them as missingand train the GNN to reconstruct the full signals of all nodes (including both the observed and themasked nodes) on the subgraph (Figure 1(a)). This training scheme allows GNN to effectively learn2he message passing mechanism, which can be further generalized to unseen nodes/graphs. Next,given observed signals from the same or even an entirely new network structure, the trained modelcan perform kriging through reconstruction (Figure 1(b)).We compare IGNNK with other state-of-the-art kriging methods on ﬁve real-world spatiotemporaldatasets. IGNNK achieves the best performance on all datasets, suggesting that the model can effec-tively generalize spatiotemporal dynamics on a sensor network. To demonstrate the transferability ofIGNNK, we apply the trained models from two trafﬁc speed datasets (METR-LA and SeData) toa new one (PeMS-Bay), and we ﬁnd that the two models offer very competitive performance evenwhen the new dataset is never seen. The network spatiotemporal kriging problem can be considered a special matrix completion problemin which several rows are completely missing. A common solution is to leverage the networkstructure as side information [7, 16]. In a spatiotemporal setting, network kriging is similar to theproblems presented in [2, 8, 9, 17]. Low-rank tensor models are developed by capturing dependenciesamong variables [2] or incorporating spatial autoregressive dynamics [9]. Different from previousapproaches, we try to gain inductive ability using GNN. Our method is closely related the followingresearch directions.

GNNs for spatiotemporal datasets

Graph Convolutional Networks (GCNs) are the most com-monly used GNN. The generalized convolution operation on graph is ﬁrst introduced in [18]. Def-ferrard et al. [19] proposes to use Chebyshev polynomial ﬁlters on the eigenvalues to approximatethe convolutional ﬁlters. Kipf et al. [20] further simpliﬁes the graph convolution operation using theﬁrst-order approximation of Chebyshev polynomial. To model temporal dynamics on graphs, GCNsare combined with recurrent neural networks (RNNs) and temporal convolutional networks (TCNs).One of the early methods using GCNs to ﬁlter inputs and hidden states in RNNs is [21]. Later studieshave integrated different convolution strategies, such as diffusion convolution [14], gated convolution[22], attention mechanism [23], and graph WaveNet [24]. As these models are essentially designedfor temporal forecasting problems, they are not suitable for the spatiotemporal kriging task.

Inductive GNNs

Hamilton et al. [11] is the ﬁrst to show that GNNs are capable of learning both transductive and inductive node representations. Some recent studies have developed inductive GNNsfor recommender systems by extracting user/item embeddings on the user-item bipartite graphs. Forexample, Zhang et al. [25] masks a part of the observed user and item embeddings and trains a GCNto reconstruct these masked embedding. Zhang and Chen [26] uses local graph patterns around arating (i.e., an observed entry in the matrix) and builds one-hop subgraph to train a GCN, providinginductive power to generalize to unseen users/items and transferability. Zeng et al. [13] proposesa graph sampling approach to construct subgraphs to train GNN for large graphs. The full GNNtrained using sampled subgraphs shows superior performance. More recently, Appleby et al. [17]develops a kriging convolutional networks (KCN), which construct a local subgraph by nearestneighbor algorithm, and train a GNN to reconstruct each individual node’s static label. However, thenearest neighbor approach (i.e., local structure) in these studies might be insufﬁcient and ineffectivein generalizing the global patterns and the effect of distance in spatial networks and spatiotemporaldata [2, 14]. In addition, KCN ignores the fact that the labels of the node is temporally evolving inspatiotemporal datasets.

Spatiotemporal kriging refers to the task of interpolating/recovering time series/signals at unsampledlocations/sensors given signals from sampled locations/sensors. We use the terms “node”, “sensor”,and “location” interchangeably throughout the paper. This study focuses on spatiotemporal krigingon networks: the spatial domain becomes an irregular network structure instead of a 2D surface.Consider a set of trafﬁc sensors on a highway network as an example: we can model the sensors asnodes in a network, and the edges can be deﬁned based on the typology of the highway network [14].3n this case, the objective of spatiotemporal kriging is to recover trafﬁc state time series at locationswith no sensors. Thus, kriging can be considered the process of setting up virtual sensors. We illustratethe real-time network kriging problem in Figure 1(b). Let [ t , t ] = { t , t + 1 , . . . , t − , t } denote a set of time points. Suppose we have data from n sensors during a historical period [1 , p ] ( n = 8 in Figure 1(b), corresponding to sensors { , . . . , } ). We denote by a multivariate time seriesmatrix X ∈ R n × p the available data, with each row being the signal collected from a sensor. Weuse X as training data. Let [ t, t + h ) = { t, t + 1 , . . . , t + h − } be the period that we will performkriging. During this period, assume we have data Y st ∈ R n st × h available from n st sensors ( n st = 8 inFigure 1(b), corresponding to sensors { , . . . , } )), and we are interested in interpolating the signals Y ut ∈ R n ut × h on n ut virtual sensors (i.e., new/unsampled nodes, corresponding to sensors { , } in Figure 1(b)). Note that both X and Y st might be corrupted with missing values. Moreover, it ispossible that n (cid:54) = n st as some sensors may retire and new sensors may be introduced. Our goal isto estimate Y ut given Y st . To achieve this, we design IGNNK to learn and generalize the messagepassing mechanism in training data X . As a ﬁrst step of IGNNK, we develop a sampling procedure—Algorithm 1—to generate a set ofsubgraphs for training. The key idea is to randomly sample a subset of nodes to get X sample and buildthe corresponding adjacency matrix W sample . The kriging problem is very different from applicationsin recommender systems, where the graph encodes only topological information (i.e., social networkor co-purchasing network). In the spatial setting, our graph is essentially fully connected, in whichwe expect edge weight diminishes with Euclidean/travel distance between a pair of nodes. To bettercharacterize the effect of “distance”, we simply choose a purely random sampling scheme to generatesample subgraphs, instead of creating a local subgraph for each node [11, 17]. We create a maskmatrix M sample to keep some nodes as observed and the test as “unsampled”. We then use thegenerated masks M sample , graph signals X sample and adjacency matrix W sample to train a GNN. Theinput data X sample itself could contain missing values, which we will also mark as unknown. Thiswill enable IGNNK to perform spatial interpolation under missing data scenarios. Algorithm 1

Subgraph signal and random mask generation

Require:

Historical data X from sampled locations over period [1 , p ] (size n × p ).Parameters: window length h , sample size each iteration S , and maximum iteration I max . for iteration = 1 : I max dofor sample = 1 : S do Generate random integers n o (number of nodes selected as observed) and n m (number ofnodes selected as missing) with n o + n m ≤ n .Randomly sample n o + n m indices without replacement from [1 , n ] to obtain I sample = { i , . . . , i n o , . . . i n o + n m } .Randomly choose a time point j within range [1 , p − h ] . Let J sample = [ j, j + h ) .Obtain submatrix signal X sample = X [ I sample , J sample ] with size of ( n o + n m ) × h .Construct adjacency matrix W sample ∈ R ( n o + n m ) × ( n o + n m ) for nodes in I sample .Generate a mask matrix M sample of size ( n o + n m ) × h , M sample [ i, :] = (cid:26) , if i ∈ [1 , n o ] , , otherwise . end for Use sets { X S } , { M S } , { W S } to train GNNs. end for3.3 GNN architecture The second step of IGNNK is to train a GNN model to reconstruct the full matrix X sample on thesubgraph given the incomplete signals X M sample = X sample (cid:12) M sample , where (cid:12) denotes Hadamardproduct. In order to achieve higher forecasting power, previous studies mainly combine GNNs withsequence-learning models—such as RNNs and TCNs—to jointly capture spatiotemporal correlations,in particular long-term temporal dependencies. However, for our real-time kriging task, the recoverywindows h is relatively short. Therefore, we simply assume that all time points in the recoverywindow h are correlated with each other, and model a length- h signal as h features.4eal-world spatial networks are often directed with a asymmetric distance matrix (see e.g., [15]). Tocharacterize the stochastic nature of spatial and directional dependencies, we adopt the DiffusionGraph Convolutional Networks (DGCNs) [14] as the basic building block of our architecture: H l +1 = K (cid:88) k =1 T k ( ¯ W f ) H l Θ kb,l + T k ( ¯ W b ) H l Θ kf,l , (1)where ¯ W f = W sample / rowsum ( W sample ) and ¯ W b = W T sample / rowsum ( W T sample ) are the forward transi-tion matrix and the backward transition matrix, respectively; Here we use two transition matricesbecause the adjacency matrix can be asymmetrical in a directional graph. In an undirected graph, ¯ W f = ¯ W b . K is the order of diffusion convolution; the Chebyshev polynomial is used to approximatethe convolution process in DGCN, and we have T k ( X ) = 2 XT k − ( X ) − T k − ( X ) deﬁned in arecursive manner with T ( X ) = I and T ( X ) = X ; Θ kf,l and Θ kb,l are learning parameters of the l thlayer that control how each node transforms received information; H l +1 is the outputs of the l th layer.Unlike traditional GNNs using a ﬁxed spatial structure, in IGNNK each sample has its own subgraphstructure. Thus, the adjacency matrices W sample capturing neighborhood information and messagepassing direction are also different in different samples. DGCN DGCN + DGCN sample X ˆ sample X M H H t+h t t+ h tz z Figure 2: Graph neural network structure of IGNNKFigure 2 illustrates the whole graph neural networks structure of IGNNK, which is a simple 3-layerDGCN. The input to the ﬁrst layer is the masked signals H = X M sample . Next, we set H followEq. (1) with parameters Θ kb, ∈ R h × z and Θ kf, ∈ R h × z . Since the masked nodes only pass to itsneighbors in the ﬁrst layer, a one-layer GCN cannot produce desirable features. Therefore, we addanother layer of DGCN (i.e., H ) to produce more generalized representations: H = σ (cid:32) K (cid:88) k =1 T k ( ¯ W f ) H Θ kb, + T k ( ¯ W b ) H Θ kf, (cid:33) + H , (2)where Θ kf, ∈ R z × z and Θ kb, ∈ R z × z are parameters of the second layer DGCN, and σ ( · ) is anonlinear activation function. The reason we add H to H is that H contains the information aboutsensors with missing data.After obtaining the ﬁnal graph representation, we use another DGCN to output the reconstruction: ˆ X = K (cid:88) k =1 T k ( ¯ W f ) H Θ kb, + T k ( ¯ W b ) H Θ kf, , (3)where Θ kf, ∈ R z × h and Θ kb, ∈ R z × h are learning parameters of the last layer. Note that we keepthe structure of IGNNK as parameter-economic as possible. One can always introduce more complexstructures to further improve model capacity and performance. As stated in Section 3.1, the problem we addressed here is to recover the information of unsampledsensors. A straightforward approach is to deﬁne training loss functions only on the masked signals.However, from a methodological perspective, we prefer IGNNK to be generalized to both dynamicgraph structures and the unseen nodes [11]. To make the learned message passing mechanism more5eneralized for all nodes, we use the total reconstruction error on both observed and unseen nodes asour loss function: J = (cid:88) sample (cid:107) ˆ X sample − X sample (cid:107) F . (4)We next illustrate how to perform kriging to get matrix Y ut for the virtual sensors in Figure 1(b). First,we can create the ( n st + n ut ) × h binary mask matrix M t and the ( n st + n ut ) × ( n st + n ut ) adjacencymatrix W t given the typology of the sensor network during [ t, t + h ) . Then, we can feed W t and the“masked” signals Y Mt = [ Y st ; Y ut ] with Y ut = to the trained IGNNK model to get ˆ Y Mt = (cid:104) ˆ Y st ; ˆ Y ut (cid:105) .The estimation ˆ Y ut ∈ R n ut × h is the ﬁnal kriging results for those virtual sensors. In this section, we assess the performance of IGNNK using ﬁve real-world spatiotemporal datasetsfrom diverse applications: (1)

METR-LA is a trafﬁc speed dataset from 207 sensors in Los Angelesover four months (Mar 1, 2012 to Jun 30, 2012); (2)

NREL registers solar power output by 137photovoltaic power plants in Alabama state in 2006; (3)

USHCN contains monthly precipitation of1218 locations from 1899 to 2019; (4)

SeData is also a trafﬁc speed dataset collected from 323 loopdetectors in Seattle highway network; (5)

PeMS-Bay is trafﬁc speed dataset similar to METR-LA,which consists of trafﬁc speed time series of 325 sensors in the Bay Area from Jan 1, 2017 to May 13,2017. We use PeMS-Bay to demonstrate the transferability of IGNNK. The frequency for METR-LA,NERL, SeData and PeMS-Bay are all 5-min. In terms of adjacency structure, we compute thepairwise Euclidean distance matrices for NREL and USHCN; both METR-LA and PeMS-Bay havea travel distance-based adjacency matrix (i.e., “reachability”) which are asymmetric; SeData has asimple binary adjacency matrix (1 if two sensors are neighbors and 0 otherwise). We summarizethe detailed information and the rules/methods to construct adjacency matrices for each dataset inAppendix A.

Our implementation of IGNNK is available at https://github.com/Kaimaoge/IGNNK . We com-pare the performance of IGNNK with the following widely used kriging/interpolation and ma-trix/tensor factorization models. (1) kNN : K-nearest neighbors, which estimates the informationof unknown nodes by averaging the values of the K spatially closest sensors on the network. (2)OKriging: ordinary kriging–a well-developed spatial interpolation model [1]. (3) KPMF : KernelizedProbabilistic Matrix Factorization, which incorporates graph kernel information [7] into matrixfactorization using a regularized Laplacian kernel. (4)

GLTL : Greedy Low-rank Tensor Learning,which is a transductive tensor factorization model for spatiotemporal cokriging [2]. GLTL dealswith multiple variable using a [ location × time × variable ] tensor structure. We implement a reducedmatrix version of GLTL, as we only have one variable for all the datasets. (5) sRMGCNN [27], atransductive matrix completion framework uses GCNs to learn graph features. The model is designedfor recommender systems, but it can be easily applied to kriging as a matrix completion problem.To evaluate the performance of different models, we randomly select ≈

25% nodes as “unsampled”locations/virtual sensors ( n st ) and keep the rest as observed for all the ﬁve datasets. For IGNNK,we take data from the ﬁrst 70% time points as a training set X and test the kriging performance onthe following 30% time points. For simpliﬁcation, we keep the values of n m and n ﬁxed in ourimplementation, instead of generating random numbers as in Algorithm 1. We choose h = 24 (i.e., 2h) for the three trafﬁc datasets, h = 16 (i.e., 80 min) for NREL, and h = 6 (i.e., 6 months) for USHCN.We perform kriging by a rolling-window approach on: [ t, t + h ) , [ t + h, t + 2 h ) , [ t + 2 h, t + 3 h ) ,etc. It should be noted that, since the training and test nodes are both given in the datasets, we canpre-compute the pairwise adjacency matrix for all nodes and obtain W sample by directly selecting asubmatrix from the full adjacency matrix. The detailed conﬁguration of IGNNK is given in AppendixB. For matrix/tensor-based models (i.e., KPMF, GLTL and sRMGCNN), we use the entire datasetas input but mask the same 25% “unsampled” nodes as in IGNNK. We use the ﬁrst 70% timepoints from unsampled nodes as a validation set to select the optimal hyperparameters, and evaluaterecovery performance on the following 30% time points. For these models, the recovery for all the30% test can be generated at once, and they use more information compared with IGNNK. The6ull Gaussian process model for spatiotemporal kriging is computationally very expensive. Thus,we implement OKriging and kNN for each time step t separately (i.e., h = 1 without consideringtemporal dependencies). Again, we use the ﬁrst 70% time points from the unsampled sensors forvalidation and the rest 30% for test. All the baseline models are tuned using validataion (see AppendixC). Since OKriging uses the longitude/latitude as input, it is not appropriate for road-distance-basedtransportation networks. Therefore, OKriging is only applied to NERL and USHCN datasets. As theadjacency matrix of SeData is given as a binary matrix instead of using actual distances, we cannotget effective neighbors for kNN. Thus, kNN is not applied to SeData. Table 1 shows the kriging performance of IGNNK and other baseline modelson four datasets. As can be seen, the proposed method consistently outperforms other baseline models,providing the lowest RMSE and MAE on all datasets.Table 1: Kriging performance comparison of different models on four datasets.METR-LA NREL USHCN SeDataModel RMSE MAE RMSE MAE RMSE MAE RMSE MAEIGNNK kNN 11.071 6.927 4.192 2.850 3.400 2.086 - -KPMF 12.851 7.890 8.771 7.408 6.663 4.847 13.060 8.339GLTL 9.668 6.559 4.840 3.372 5.047 3.396 6.989 4.285sRMGCNN 14.581 9.047 8.361 6.445 5.844 4.621 9.381 5.718OKriging - - 3.470 2.381 3.231 - -As can be seen, IGNNK achieves good performance on the two spatial datasets—NREL and USHCN—where OKriging ﬁts the tasks very well. There are two cases worth mentioning in Table 1. First,kNN and Okriging also gives comparable results to IGNNK on USHCN data, the MAE of Okrigingis even lower than the one of IGNNK. This is due to the fact that sensors have a high density andprecipitation often shows smooth spatial variation (see Figure 6 in Appendix D). In this case, localspatial consistency might be powerful enough for kriging, and thus we do not see much improvementfrom IGNNK. For SeData, GLTL also shows good performance. A potential reason is the datalimitation: the adjacency structure of SeData is given as a binary matrix of sensor connectivity (i.e., 1if two sensors are neighbors next to each other on a highway segment, and 0 otherwise;). In this case, W only encodes the typology, instead of the full pairwise distance information as in other datasets.In fact, relative distance serves a critical role in the kriging task on a highway network due to thecomplex causal structures and dynamics of trafﬁc ﬂow [14]. As a results, we do expect IGNNK lesspowerful due to the lack of “distance” effect. We provide some examples of spatial and temporalvisualizations of IGNNK and other baseline models in Appendix D. Transfer learning performance

We next demonstrate the transferability of IGNNK. Our exper-iments are based on three datasets—METR-LA, SeData, and PeMS-Bay—collected from threedifferent highway networks using the same type of sensors (i.e., loop detector).PeMS-Bay dataset is very similar to METR-LA: both datasets register trafﬁc speed with a 5-minfrequency, and both of them provide travel distance for each of sensors. The two datasets are usedside by side in [14]. SeData has the same format; however, as mentioned, the dataset provides asimple binary adjacency matrix showing connectivity. To show the effect of adjacency matrix, wetrain two sets of models separately following the approach as the kriging experiments: one with aGaussian kernel adjacency matrix based on pairwise travel distance (same as METR-LA), and theother on the binary connectivity matrix (same as SeData).The top part of Table 2 shows the results obtained using Gaussian kernel adjacency (

Gaussian ) andbinary adjacency (

Binary ), respectively. Not surprisingly, IGNNK-Gaussian offers the best accuracy,clearly outperforming all other models. In general,

Gaussian offers better performance than

Binary for all the baseline models except sRMGCNN. Similar to the kriging experiment on SeData, we ﬁnd

Binary

IGNNK and

Binary

GLTL demonstrate comparable performance.7able 2: Kriging performance of different models on PeMS-Bay. The last two rows shows thetransferability of the two IGNNK models trained on METR-LA and SeData.Gaussian BinaryModel RMSE MAE MAPE RMSE MAE MAPEIGNNK % 9.245 5.394 13.26%kNN 7.431 4.245 9.13% - - -KPMF 7.332 4.293 9.21% 10.065 5.985 16.03%GLTL 8.846 4.486 10.25% 8.504 4.962 12.24%sRMGCNN 13.460 10.318 21.41% 13.191 8.718 19.59%IGNNK Transfer METR-LA SeData % 11.484 6.456 15.10% m il e / h TrueIGNNK transferred from METR-LAIGNNK transferred from SeDataIGNNK trained on PeMS-Bay (Gaussian)IGNNK trained on PeMS-Bay (Binary)kNN (Gaussian)

Figure 3: Kriging performance on unknown nodes in PeMS-Bay dataset. We provide the full baselinecomparison in Appendix D.We then apply the two IGNNK models—trained on METR-LA and SeData, respectively—directlyon the test data of PeMS-Bay and obtain the kriging errors (the last two rows in Table 2). As can beseen, IGNNK (METR-LA) provides comparable performance to, if not better than, the second-bestmodel—

Gaussian

GLTL; however, IGNNK (SeData) does not offer an encouraging results in thiscase. Our experiment shows that IGNNK can effectively learn the effect of pairwise distance fromMETR-LA and then generalize the results to PeMS-Bay. Figure 3 shows an example of krigingresults of these models. While no models can reproduce the sudden drop during 6:00 am–9:00 am,we can see that

Gaussian provides much better recovery during non-peak hours than

Binary . Theanalysis on transferability further conﬁrms the critical role of distance in kriging tasks.

In this paper, we introduce IGNNK as a novel framework for spatiotemporal kriging. Instead oflearning transductive latent features, the training scheme provides IGNNK with additional gener-alization and inductive power. Thus, we can apply a trained model directly to perform kriging forany new locations of interest without retraining. Our numerical experiments show that IGNNKconsistently outperforms other baseline models on ﬁve real-world spatiotemporal datasets. Besides,IGNNK demonstrates remarkable transferability in our trafﬁc data kriging task as an example. Ourresults also suggest that “distance” information in graph plays a critical role in spatiotemporal kriging,which is different from applications in recommender systems where a graphs essentially encodetopological information. The ﬂexibility of this model allows us to model time-varying systems, suchas moving sensors (e.g., probe vehicles) or crowdsourcing systems, which will create a dynamicnetwork structure. 8here are several directions for future work. First, we can adapt IGNNK to accommodate multivariatedatasets as a spatiotemporal tensor (e.g., [2]). Second, better temporal models, such as RNNs andTCNs, can be incorporated to characterize complex temporal dependencies. This will allows us toperform kriging for a much longer time window with better temporal dynamics. Third, one can furthercombine time series forecasting with kriging in an integrated framework, providing forecasting resultsfor both existing and virtual sensors for better decision making.

Acknowledgments and Disclosure of Funding

This research is supported by the Natural Sciences and Engineering Research Council (NSERC)of Canada, Fonds de recherche du Quebec - Nature et technologies (FRQNT), and the CanadaFoundation for Innovation (CFI). Y. Wu would like to thank the Institute for Data Valorisation(IVADO) for providing scholarship to support this study. The authors declare no competing ﬁnancialinterests with respect to this work.

References [1] Noel Cressie and Christopher K Wikle.

Statistics for Spatio-temporal Data . John Wiley & Sons,2015.[2] Mohammad Taha Bahadori, Qi Rose Yu, and Yan Liu. Fast multivariate spatio-temporal analysisvia low rank tensor learning. In

Advances in Neural Information Processing Systems , pages3491–3499, 2014.[3] Syama Sundar Rangapuram, Matthias W Seeger, Jan Gasthaus, Lorenzo Stella, Yuyang Wang,and Tim Januschowski. Deep state space models for time series forecasting. In

Advances inNeural Information Processing Systems , pages 7785–7794, 2018.[4] David Salinas, Michael Bohlke-Schneider, Laurent Callot, Roberto Medico, and Jan Gasthaus.High-dimensional multivariate forecasting with low-rank gaussian copula processes. In

Ad-vances in Neural Information Processing Systems , pages 6824–6834, 2019.[5] Rajat Sen, Hsiang-Fu Yu, and Inderjit S Dhillon. Think globally, act locally: A deep neural net-work approach to high-dimensional time series forecasting. In

Advances in Neural InformationProcessing Systems , pages 4838–4847, 2019.[6] Christopher KI Williams and Carl Edward Rasmussen.

Gaussian Processes for MachineLearning . The MIT Press, 2006.[7] Tinghui Zhou, Hanhuai Shan, Arindam Banerjee, and Guillermo Sapiro. Kernelized probabilisticmatrix factorization: Exploiting graphs and side information. In

Proceedings of the SIAMInternational Conference on Data mining , pages 403–414, 2012.[8] Dingxiong Deng, Cyrus Shahabi, Ugur Demiryurek, Linhong Zhu, Rose Yu, and Yan Liu.Latent space model for road networks to predict time-varying trafﬁc. In

Proceedings of the 22ndACM SIGKDD International Conference on Knowledge Discovery and Data Mining , pages1525–1534, 2016.[9] Koh Takeuchi, Hisashi Kashima, and Naonori Ueda. Autoregressive tensor factorization forspatio-temporal predictions. In

IEEE International Conference on Data Mining (ICDM) , pages1105–1110, 2017.[10] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neuralnetworks? arXiv preprint arXiv:1810.00826 , 2018.[11] Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on largegraphs. In

Advances in Neural Information Processing Systems , pages 1024–1034, 2017.[12] Petar Veliˇckovi´c, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and YoshuaBengio. Graph attention networks. arXiv preprint arXiv:1710.10903 , 2017.[13] Hanqing Zeng, Hongkuan Zhou, Ajitesh Srivastava, Rajgopal Kannan, and Viktor Prasanna.Graphsaint: Graph sampling based inductive learning method. In

International Conference onLearning Representations , 2020.[14] Yaguang Li, Rose Yu, Cyrus Shahabi, and Yan Liu. Diffusion convolutional recurrent neuralnetwork: Data-driven trafﬁc forecasting. arXiv preprint arXiv:1707.01926 , 2017.915] Juan C Martínez Mori and Samitha Samaranayake. Bounded asymmetry in road networks.

Scientiﬁc Reports , 9:11951, 2019.[16] Nikhil Rao, Hsiang-Fu Yu, Pradeep K Ravikumar, and Inderjit S Dhillon. Collaborative ﬁlteringwith graph information: Consistency and scalable methods. In

Advances in Neural InformationProcessing Systems , pages 2107–2115, 2015.[17] Gabriel Appleby, Linfeng Liu, and Li-Ping Liu. Kriging convolutional networks.[18] Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann Lecun. Spectral networks and locallyconnected networks on graphs. In

International Conference on Learning Representations , 2014.[19] Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networkson graphs with fast localized spectral ﬁltering. In

Advances in Neural Information ProcessingSystems , pages 3844–3852, 2016.[20] Thomas N Kipf and Max Welling. Semi-supervised classiﬁcation with graph convolutionalnetworks. arXiv preprint arXiv:1609.02907 , 2016.[21] Youngjoo Seo, Michaël Defferrard, Pierre Vandergheynst, and Xavier Bresson. Structuredsequence modeling with graph convolutional recurrent networks. In

International Conferenceon Neural Information Processing , pages 362–373. Springer, 2018.[22] Bing Yu, Haoteng Yin, and Zhanxing Zhu. Spatio-temporal graph convolutional networks: Adeep learning framework for trafﬁc forecasting. pages 3634–3640, 2018.[23] Jiani Zhang, Xingjian Shi, Junyuan Xie, Hao Ma, Irwin King, and Dit-Yan Yeung. GaAN:Gated attention networks for learning on large and spatiotemporal graphs. arXiv preprintarXiv:1803.07294 , 2018.[24] Zonghan Wu, Shirui Pan, Guodong Long, Jing Jiang, and Chengqi Zhang. Graph wavenetfor deep spatial-temporal graph modeling. In

International Joint Conference on ArtiﬁcialIntelligence , pages 1907–1913, 2019.[25] Jiani Zhang, Xingjian Shi, Shenglin Zhao, and Irwin King. STAR-GCN: Stacked andreconstructed graph convolutional networks for recommender systems. arXiv preprintarXiv:1905.13129 , 2019.[26] Muhan Zhang and Yixin Chen. Inductive matrix completion based on graph neural networks. arXiv preprint arXiv:1904.12058 , 2019.[27] Federico Monti, Michael Bronstein, and Xavier Bresson. Geometric matrix completion withrecurrent multi-graph neural networks. In

Advances in Neural Information Processing Systems ,pages 3697–3707, 2017.[28] Manajit Sengupta, Yu Xie, Anthony Lopez, Aron Habte, Galen Maclaurin, and James Shelby.The national solar radiation data base (NSRDB).

Renewable and Sustainable Energy Reviews ,89:51–60, 2018.[29] Matthew J Menne, Claude N Williams Jr, and Russell S Vose. The US historical climatologynetwork monthly temperature data, version 2.

Bulletin of the American Meteorological Society ,90(7):993–1008, 2009.[30] Zhiyong Cui, Ruimin Ke, and Yinhai Wang. Deep bidirectional and unidirectional LSTM recur-rent neural network for network-wide trafﬁc speed prediction. arXiv preprint arXiv:1801.02143 ,2018.[31] Daniel G Krige. A statistical approach to some basic mine valuation problems on the witwa-tersrand.

Journal of the Southern African Institute of Mining and Metallurgy , 52(6):119–139,1951.[32] Kathleen M Carley, Dave Columbus, and Ariel Azoulay. Automap user’s guide 2012. Technicalreport, CARNEGIE-MELLON UNIV PITTSBURGH PA INST OF SOFTWARE RESEARCHINTERNAT, 2012.[33] Sungyong Seo and Yan Liu. Differentiable physics-informed graph networks. arXiv preprintarXiv:1902.02950 , 2019. 10

Data description

We use ﬁve real-world spatiotemporal datasets to assess our model (see Table 3 for a summary). Theyare:

METR-LA contains trafﬁc speed information collected by highway loop detectors in Los Angeles[14]. We follows the work of Li et al. [14] to select 4 months of data from 207 sensors with5-min frequency from Mar 1st 2012 to Jun 30th 2012. For “0” entries in X (missing values), thecorresponding entries in the mask matrices also set to 0. We build the adjacency matrix following[14]: W ij = exp (cid:32) − (cid:18) dist ( v i , v j ) σ (cid:19) (cid:33) , (5)where W ij is the adjacency matrix weights between sensors v i and v j , dist ( v i , v j ) represents theroad network distance and σ = 6696 m is the standard deviation of sensor distances. We further set W ij = 0 if W ij < . . The ﬁnal adjacency matrix is obtained directly from Github repository of[14]. NREL provides several energy datasets, and here we choose the Alabama Solar Power Data forIntegration Studies [28]. The solar power data contain 5-minute solar power collected by 137photovoltaic power plants in 2006. Similarly, we follow [2] and use the following formulation toconstruct the adjacency matrix given the geometric distances between power plants: W ij = exp (cid:18) − dist ( v i , v j ) σ (cid:19) , (6)where dist ( v i , v j ) is calculated using the Haversine formula based on longitude and latitude. Weset σ = 7 . km. The maximum capacities of each power plant are provided in this dataset. In eachnode, each time series is normalized to [0,1] by dividing the values by the corresponding capacity.In this dataset, half of the observations are zeros due to sunrise and sunset. The zero observationswill undermine the training of both IGNNK and other baseline models. To alleviate this effect, weonly keep the data from 7 am–7 pm each day. Note that we regard “0” entries in NREL as trueobservations. Therefore, we do not impose additional masks for missing values when processing thedata. USHCN consists of 1218 time series of monthly precipitation from 1899 to 2019, which is collectedby the U.S. Historical Climatology Network (USHCN) [29]. We apply the same rule as in NREL toconstruct dynamic adjacency matrix with σ = 50 km. We further set W ij = 0 if W ij < . . As thetime span of this dataset is much longer than others and the variance-to-mean ratio exceeds 500, thedataset can help examine how different models perform on time series with substantial oscillations. SeData is also a trafﬁc speed dataset. It registers 5-min trafﬁc speed data from 323 loop detectors inSeattle highway of a year [30]. It should be noted that SeData does not provide exact coordinatesand typology information of sensors. Instead, it provides a binary adjacency matrix based onneighborhood. We build the adjacency matrix using classical undirected graph deﬁnition: W ij = (cid:26) , if i and j are neighbors , , otherwise . (7) PeMS refers to Performance Measurement System (PeMS) for California highway. It is collectedby California Transportation Agency (CalTrans). Same as the work of Li et al. [14], we select325 sensors in the Bay Area (PeMS-Bay) between Jan 1st 2017 and May 13th 2017. The temporalfrequency of PeMS-Bay is also 5-min. We use PeMS-Bay to demonstrate the transferability ofIGNNK. The adjacency matrix construction will adapt the transfer learning task accordingly. Theadjacency matrix construction for training IGNNK and transferring model trainined on METR-LA is https://github.com/liyaguang/DCRNN https://github.com/zhiyongc/Seattle-Loop-Data https://github.com/liyaguang/DCRNN Sensors Time length Adjacency matrix Usage Missing(frequency)METR-LA 207 34272 (5-min) Threshold Gaussian kernel IGNNK training 8.11%NREL 137 105120 (5-min) Geo-distance Gaussian kernel IGNNK training -USHCN 1218 1440 (1-month) Geo-distance Gaussian kernel IGNNK training 3.07%SeData 323 105120 (5-min) 0-1 undirected graph IGNNK training -PeMS-Bay 325 52116 (5-min) Gaussian kernel IGNNK training 0.003%PeMS-Bay 325 52116 (5-min) 0-1 graph (SeData) Transfer learning 0.003%PeMS-Bay 325 52116 (5-min) Gaussian kernel (METR-LA) Transfer learning 0.003%

B Implementation details of IGNNK

Why use diffusion convolution : There exists a large number of works on GCN structures. We havecompared diffusion convolution [14] with both GCN [20] and Chebynet [19]. We found that diffusionconvolution outperforms the other 2 baselines. In addition, diffusion convolution is a structurespeciﬁcally designed for spatiotemporal data and it can be applied on directed networks.

The parameters of GNNs : We use 3-layers DGCNs for experiments on different datasets. Theparameters for training the GNNs are given in Table 4. The number of iteration I max is determinedby the temporal length of each dataset: I max ≈ × . × Th × S .Table 4: IGNNK Parameters for each dataset Parameters METR-LA NREL USHCN SeData PeMS-Bay0-1 / Gaussianwindow length h

24 16 6 24 24number of evaluation windows (test) 428 1971 72 1314 2171number of kriging nodes ( n ut ) 50 30 300 80 80sampled observed nodes size n o

100 100 900 240 240sampled masked nodes size n m

50 30 300 80 80hidden feature dimension z

100 100 100 350 100activation function σ relu relu relu relu relu batch size ( S ) 4 8 8 4 4number of iterations ( I max ) 186750 287250 30750 574500 285000order of diffusion convolution 2 2 2 2 2 We use Adam optimizer with learning rate 0.0001 to optimize the GNNs. Figure 4 shows the testRMSE curves METR-LA. We can see that test RMSE of DGCN is stable after 200 epochs of trainingwithout overﬁtting. We also plot the learning curves using 3 layers of GCN and Chebynet. For GCN,we use selu activation because it outperforms relu in our experiments. For Chebynet, the convolutionorder is set to 3.

Evaluation and real-time kriging : For model evaluation, we provide the information Y st duringtime period [ t, t + h ) of the new graph and mask unsampled nodes. We use the reconstruction errorson unsampled nodes Y ut for evaluation. We perform rolling recovery, e.g., [ t + h, t + 2 h ) with Y st + h ,and [ t + 2 h, t + 3 h ) with Y st +2 h , using the same model on test data. Evaluation on transfer learning performance : The PeMS-Bay data between Jan 1st 2017 and May13th 2017 are used to evaluate the model. We apply the two trained models—IGNNK (METR-LA)and IGNNK (SeData) on the PeMS-Bay dataset with 80 evaluation sensors. Note that we also deﬁneand reconstruct the corresponding binary adjacency matrix W ij when applying IGNNK (SeData).12

200 400 600Iteration(X249)1020304050 R M S E RMSE_on_test_set (DGCN)RMSE_on_test_set (ChebyNet)RMSE_on_test_set (GCN)

Figure 4: The learning RMSE curve on METR-LA test set.

Computing infrastructure description : We use Alienware R8 with CPU 8 core 3832.495 mHz,GPU GeForce RTX 2080 Ti 8GB, and 12GB memory for

Python -based experiments. It tanks about15min-30min to train a IGNNK, but depending on the datasets.

C Benchmark settings

C.1 Baseline modelskNN

For METR-LA/PeMS-Bay, the distance is deﬁned by the road distance in the transportationnetwork. In NREL/USHCN, distance is deﬁned by the Haversine distance. For each dataset, weselect the best k by comparing the kriging RMSE errors. Based on that, we use inverse distanceweight instead of simply averaging the K neighbors during prediction. Prediction ˆ x = (cid:80) Ki =1 1 d i x i ,where d i is the distance to between the unknown node and the corresponding observed neighbor.We choose the optimal K according to their lowest RMSE. The selection of k for different datasetsfollows: METR-LA: NREL:

USHCH:

PeMS-Bay (Gaussian): OKriging

Ordinary kriging is a geostatistical method that interpolates values by a Gaussian processgoverned by prior spatial covariance. We implement the ordinary kriging by

Automap [31, 32].

Automap automates the process of kriging by adaptively ﬁtting variograms and testing a number ofdifferent models. In our experimental setting, we test spherical, exponential, Gaussian, matern, andstein variograms and then picks the best one with the smallest residual sum of squares. Note thatwe apply the method separately for each snapshot window [ t, t + h ) in the test data. In other words,at each time point t , we provide the corresponding column vector in Y st to the algorithm to get theinterpolated column vector as kriging result. This task can be embarrassingly parallel for all timepoints. KPMF

Kernelized Probabilistic Matrix Factorization [7] exploits graph kernel as side information toperform matrix completion. It can infer the rows/columns with no entries. It is originally designedfor recommender systems, but it can be extended to spatiotemporal tasks. Given the input matrixwith N sensors and T time points, we use the same adjacency matrices in Appendix A. to build theLaplacian graph kernels. We implement this model using the MATLAB code provided by the authors.We focus on its spatially kriging performance and impose identity matrix as temporal kernel. Inthe implementation, we use stochastic gradient descent for optimization, and set the learning rateand maximum number of iterations to × − and 100, respectively. We perform grid search totune three critical parameters using validation RMSE: Variance σ : { . , . , . , , } , Sizeof latent dimensions D : { , , , } and Graph weight γ : { . , . , . , , } . The tunedparameters for KPMF are listed in Table 5. GLTL

Greedy Low-rank Tensor Learning [2] is a model for both cokriging and forecasting tasksfor multiple variables. The number of variables is 1 for our datasets. We implement this using the https://people.eecs.berkeley.edu/~tinghuiz/ σ D

20 20 20 20 40 γ MATLAB source code provided by the authors. We choose the Orthogonal algorithm as it showssuperior performance in [2]. We set the maximum number of iterations to 100 and the convergencestopping criteria to × − . For other datasets, we use the same adjacency matrices as IGNNK areused. For directed networks (i.e., METR-LA, SeData and PeMS-Bay), we set W = (cid:0) W + W (cid:62) (cid:1) / to get the undirected version. The Laplacian matrix is computed by L = D − W where D is adiagonal matrix with D ii = (cid:80) j W ij . Following the implementation in [2], we also rescale theLaplacian matrix by L = L max ij L ij . The other critical parameter is µ for the weight of Laplacianregularizer. We perform grid search by validation on the ﬁrst 70% time points of the “unsampled”sensors. We choose µ from { . , . , , , } . We reach convergence before 100 iterations onall datasets. The tuned parameters for different datasets are listed in Table 6.Table 6: GLTL Parameters for each datasetParameter METR-LA NREL USHCN Sedata PeMS-Bay µ sRMGCNN Separable Recurrent Multi-Graph CNN, it is a transductive matrix completion frameworkusing GCNs to learn graph structures. We implement this using the source code in GitHub . We needto deﬁne the possible largest entry value during training, which can be considered the highest ratingin recommendation system. The highest values chosen for ﬁve datasets are: METR-LA:

NREL:

USHCN:

SeData:

PeMS-Bay:

C.2 Implementation of matrix completion models

There are two approaches to apply the matrix completion models. The ﬁrst approach is to trainthe models on the whole matrix of the datasets presented in Table 3. The second approach is totrain and validate the model using ﬁrst 70% time points to get optimal parameters, and then trainanother model on the 30% test with the optimal parameters. Both approaches can provide the wholereconstruction/interpolation at once. For all the transductive matrix completion models, we ﬁnd thatthe ﬁrst approach that uses the information of all the time points (i.e., whole matrix) provides betterresults than the second one. The reason is that the ﬁrst approach can leverage historical information,while the second approach cannot. We implement all algorithms using a workstation with a XeonE5-2698v4 CPU and a Tesla V100 GPU.

C.3 Evaluation metrics

We use the following metrics to compare model performance.Root Mean Square Error (RMSE):RMSE ( x, ˆ x ) = (cid:115) | N | (cid:88) i ∈ N ( x i − ˆ x ) . Mean Absolute Error (MAE): MAE ( x, ˆ x ) = 1 | N | (cid:88) i ∈ N | x i − ˆ x | . http://roseyu.com/code.html https://github.com/fmonti/mgcnn/ ( x, ˆ x ) = (cid:88) i ∈ N (cid:12)(cid:12)(cid:12)(cid:12) x i − ˆ xx i (cid:12)(cid:12)(cid:12)(cid:12) . D Visualization of result

We provide spatial and temporal visualizations to show the performing of IGNNK. For spatialvisualization, we select crowded/uncrowded periods to examine the performance of our model oncapturing the sharp peaks and ﬁtness the trends.

Spatial visualization

Since SeData has no geographical information, we cannot plot its sensor locations. In Figures 5, 6and 7, we depict the spatial visualizations for METR-LA, USHCN and NREL datasets. C r o w ded True IGNNK kNN GLTL U n c r o w ded Figure 5: Spatial presentation of kriging performance in crowded and uncrowded time upon METR-LA dataset. Compared models are ground truth, IGNNK, kNN, GLTL (from left to right in eachrow). H i gh True IGNNK kNN GLTL Lo w Figure 6: Spatial visualization of kriging performance in high and low precipitation time for USHCNdataset.For all ﬁgures, we select the crowded/high precipitation/noon time to illustrate the performance ofIGNNK in capturing peak as well as the uncrowded/low precipitation/evening time to discover theﬁtness in trends. It can be found that IGNNK consistently outperforms other baseline models. Theevaluation metrics can be found in Table 1.

Temporal visualization

Figure 8 gives the temporal visualization of kriging results. For each dataset, we select a certain timewindow to demonstrate the performance of different models. As can be seen, IGNNK provides betterﬁt to the true data than other baseline models.

Transfer learning temporal visualization

Figure 9 shows results for transfer learning tasks. We select one of the unknown sensors in PeMS-Bayas an example. Figure 9(a) shows a period when IGNNK outperforms kNN. Figure 9(b) gives anotherperiod when kNN performs better than IGNNK. In both examples, we show the results from thetransferred models. As can be seen, the transferred models from METR-LA and SeData providecomparable performance, if not better than, GLTL as the best transductive model. This results veriﬁesthat our model can indeed learn the inductive message passing mechanism. Detailed quantitativeresults can be found in Table 2. 15 oon

True IGNNK kNN GLTL E v en i ng Figure 7: Spatial visualization of kriging performance in noon and evening time for NREL solarsystem in Alabama state. m il e / h Jan2006 Mar May Jul Sep Nov Jan2007051015202530 p r e c i p i t a t i on / c m TrueIGNNKKPMFGLTLkNNOKrigingsRGCNN (a) METR-LA m il e / h Jan2006 Mar May Jul Sep Nov Jan2007051015202530 p r e c i p i t a t i on / c m TrueIGNNKKPMFGLTLkNNOKrigingsRGCNN (b) USHCN PV / M W m il e / h TrueIGNNKKPMFGLTLkNNOKrigingsRGCNN (c) NREL PV / M W m il e / h TrueIGNNKKPMFGLTLkNNOKrigingsRGCNN (d) SeData

Figure 8: Temporal visualization of kriging results.

Neural Network Weights Visualization

To better illustrate the generalization ability for transfer learning and the power of diffusion process,we visualize the network weights Θ in Eq (1) of three DGCN layers, as can be found in Figure 10.We set models with the same adjacency matrix together and average the parameters to make themequally sized. It can be found that even though the parameters in the ﬁrst layer greatly differ fromthe other, they ultimately tend to be similarly distributed. This gives us an intuitive understandingof why transfer learning works because the message passing mechanism is learned by a deep GNN.Furthermore, it can be found in Figure 10(c) that models with binary adjacency tend to be sharper than16 :00Mar 10th 3:00 6:00 9:00 12:003040506070 m il e / h TrueIGNNK transferred from METR-LAIGNNK transferred from SeDataIGNNK trained on PeMS-Bay (Gaussian)IGNNK trained on PeMS-Bay (Binary)GLTLKPMFkNN (Gaussian)sRGCNN (a) Successfully transferred period m il e / h TrueIGNNK transferred from METR-LAIGNNK transferred from SeDataIGNNK trained on PeMS-Bay (Gaussian)IGNNK trained on PeMS-Bay (Binary)GLTLKPMFkNN (Gaussian)sRGCNN (b) Unsuccessfully transferred period

Figure 9: Kriging performance on unknown nodes in PeMS-Bay dataset with all the baseline models.Gaussian kernel based ones. This might explain why they are less accurate than the Gaussian kernelbased models, as they focus more on the local smoothness and capture less long-range information.

E Explanation with Partial Differential Equation

Many real-world dynamic systems can be represented as a relation between temporal and spatialderivatives: M (cid:88) i =1 ∂ i u∂t i = N (cid:88) i =1 a i ( u, x ) ∂ i u∂x i , (8)where u is a physical quantity (e.g, trafﬁc speed and precipitation). x is the spatial direction, and t is the time point. M and N are the highest order of temporal derivatives and spatial derivatives,respectively. It is suggested that any type of Partial Differential Equation (PDE) written in Equation 8can be represented by a graph neural networks [33].To perform spatiotemporal kriging, we are particular interested in how the target physical quantity u varies over spatial direction. As stated before, we have modeled the interaction of u in temporaldirection as fully connected neural networks. The continues version of the simpliﬁed IGNNK model(1-layer) can be represented as the following PDE u ( x, t ) = L ∂a ( u ( x, t )) ∂x , (9)where L is the Laplace operator determined by the used GNN structure. a ( u ( x, t )) is a functionrepresented by a fully connected neural networks. The goal is to get the parameters of function a () ,then the physical quantity u at unknown x s can be estimated.17 . − . − . − . − . − .

02 0 .

00 0 .

02 0 . METR-LAPeMS-Bay Gaussian − . − . − . − . − . − .

02 0 .

00 0 .

02 0 . SeDataPeMS-Bay BinaryAveraged output weight Θ N u m be r (a) Diffusion learning parameter Θ in the ﬁrst layer DGCN − . − . − . − . − .

05 0 .

00 0 .

05 0 . METR-LAPeMS-Bay Gaussian − . − . − . − . − .

05 0 .

00 0 .

05 0 . SeDataPeMS-Bay BinaryAveraged output weight Θ N u m be r (b) Diffusion learning parameter Θ in the second layer DGCN − . − . − . − . . . . . METR-LAPeMS-Bay Gaussian − . − . − . − . . . . . SeDataPeMS-Bay BinaryAveraged output weight Θ N u m be r (c) Diffusion learning parameter Θ in the third layer DGCNin the third layer DGCN