[PDF] Inductive Representation Learning on Temporal Graphs

Abstract

Inductive representation learning on temporal graphs is an important step toward salable machine learning on real-world dynamic networks. The evolving nature of temporal dynamic graphs requires handling new nodes as well as capturing temporal patterns. The node embeddings, which are now functions of time, should represent both the static node features and the evolving topological structures. Moreover, node and topological features can be temporal as well, whose patterns the node embeddings should also capture. We propose the temporal graph attention (TGAT) layer to efficiently aggregate temporal-topological neighborhood features as well as to learn the time-feature interactions. For TGAT, we use the self-attention mechanism as building block and develop a novel functional time encoding technique based on the classical Bochner's theorem from harmonic analysis. By stacking TGAT layers, the network recognizes the node embeddings as functions of time and is able to inductively infer embeddings for both new and observed nodes as the graph evolves. The proposed approach handles both node classification and link prediction task, and can be naturally extended to include the temporal edge features. We evaluate our method with transductive and inductive tasks under temporal settings with two benchmark and one industrial dataset. Our TGAT model compares favorably to state-of-the-art baselines as well as the previous temporal graph embedding approaches.

Full PDF

PPublished as a conference paper at ICLR 2020 I NDUCTIVE REPRESENTATION LEARNING ONTEMPORAL GRAPHS

Da Xu ∗ , Chuanwei Ruan ∗ , Evren Korpeoglu , Sushant Kumar , Kannan Achan Walmart LabsSunnyvale, CA 94086, USA { Da.Xu,Chuanwei.Ruan,EKorpeoglu,SKumar4,KAchan } @walmartlabs.com A BSTRACT

Inductive representation learning on temporal graphs is an important step towardsalable machine learning on real-world dynamic networks. The evolving nature oftemporal dynamic graphs requires handling new nodes as well as capturing tem-poral patterns. The node embeddings, which are now functions of time, shouldrepresent both the static node features and the evolving topological structures.Moreover, node and topological features can be temporal as well, whose patternsthe node embeddings should also capture. We propose the temporal graph at-tention (TGAT) layer to efﬁciently aggregate temporal-topological neighborhoodfeatures as well as to learn the time-feature interactions. For TGAT, we use theself-attention mechanism as building block and develop a novel functional timeencoding technique based on the classical Bochner’s theorem from harmonic ala-ysis. By stacking TGAT layers, the network recognizes the node embeddings asfunctions of time and is able to inductively infer embeddings for both new andobserved nodes as the graph evolves. The proposed approach handles both nodeclassiﬁcation and link prediction task, and can be naturally extended to includethe temporal edge features. We evaluate our method with transductive and induc-tive tasks under temporal settings with two benchmark and one industrial dataset.Our TGAT model compares favorably to state-of-the-art baselines as well as theprevious temporal graph embedding approaches.

NTRODUCTION

The technique of learning lower-dimensional vector embeddings on graphs have been widely ap-plied to graph analysis tasks (Perozzi et al., 2014; Tang et al., 2015; Wang et al., 2016) and deployedin industrial systems (Ying et al., 2018; Wang et al., 2018a). Most of the graph representationlearning approaches only accept static or non-temporal graphs as input, despite the fact that manygraph-structured data are time-dependent. In social network, citation network, question answeringforum and user-item interaction system, graphs are created as temporal interactions between nodes.Using the ﬁnal state as a static portrait of the graph is reasonable in some cases, such as the protein-protein interaction network, as long as node interactions are timeless in nature. Otherwise, ignoringthe temporal information can severely diminish the modelling efforts and even causing questionableinference. For instance, models may mistakenly utilize future information for predicting past inter-actions during training and testing if the temporal constraints are disregarded. More importantly, thedynamic and evolving nature of many graph-related problems demand an explicitly modelling of thetimeliness whenever nodes and edges are added, deleted or changed over time.Learning representations on temporal graphs is extremely challenging, and it is not until recentlythat several solutions are proposed (Nguyen et al., 2018; Li et al., 2018; Goyal et al., 2018; Trivediet al., 2018). We conclude the challenges in three folds.

Firstly , to model the temporal dynamics,node embeddings should not be only the projections of topological structures and node features butalso functions of the continuous time. Therefore, in addition to the usual vector space, temporalrepresentation learning should be operated in some functional space as well.

Secondly , graph topo-logical structures are no longer static since the nodes and edges are evolving over time, which poses ∗ Both authors contributed equally to this research. a r X i v : . [ c s . L G ] F e b ublished as a conference paper at ICLR 2020Figure 1: Visual illustration for several complications from the temporal graphs. (A). The gen-eration process of a temporal graph and its snapshots. It is obvious that the static graphs in thesnapshots only reﬂect partial temporal information. (B). The ﬁnal state of the temporal graph whenprojected to the time-independent 2-D plane. Other than the missing temporal information, themulti-edge situation arises as well. (C). When predicting the link between node A and C at time t , the message-passing paths should be subject to temporal contraints. The solid lines give theappropriate directions, and the dashed lines violates the temporal constraints.temporal constraints on neighborhood aggregation methods. Thirdly , node features and topolog-ical structures can exhibit temporal patterns. For example, node interactions that took place longago may have less impact on the current topological structure and thus the node embeddings. Also,some nodes may possess features that allows them having more regular or recurrent interactionswith others. We provide sketched plots for visual illustration in Figure 1.Similar to its non-temporal counterparts, in the real-world applications, models for representa-tion learning on temporal graphs should be able to quickly generate embeddings whenever re-quired, in an inductive fashion.

GraphSAGE (Hamilton et al., 2017a) and graph attention network ( GAT ) (Veliˇckovi´c et al., 2017) are capable of inductively generating embeddings for unseen nodesbased on their features, however, they do not consider the temporal factors. Most of the temporalgraph embedding methods can only handle transductive tasks, since they require re-training or thecomputationally-expensive gradient calculations to infer embeddings for unseen nodes or node em-beddings for a new timepoint. In this work, we aim at developing an architecture to inductively learnrepresentations for temporal graphs such that the time-aware embeddings (for unseen and observednodes) can be obtained via a single network forward pass. The key to our approach is the combi-nation of the self-attention mechanism (Vaswani et al., 2017) and a novel functional time encodingtechnique derived from the

Bochner’s theorem from classical harmonic analysis (Loomis, 2013).The motivation for adapting self-attention to inductive representation learning on temporal graphs isto identify and capture relevant pieces of the temporal neighborhood information. Both graph con-volutional network (

GCN ) (Kipf & Welling, 2016a) and

GAT are implicitly or explicitly assigningdifferent weights to neighboring nodes (Veliˇckovi´c et al., 2017) when aggregating node features. Theself-attention mechanism was initially designed to recognize the relevant parts of input sequence innatural language processing. As a discrete-event sequence learning method, self-attention outputs avector representation of the input sequence as a weighted sum of individual entry embeddings. Self-attention enjoys several advantages such as parallelized computation and interpretability (Vaswaniet al., 2017). Since it captures sequential information only through the positional encoding, tem-poral features can not be handled. Therefore, we are motivated to replace positional encoding withsome vector representation of time. Since time is a continuous variable, the mapping from the timedomain to vector space has to be functional. We gain insights from harmonic analysis and proposea theoretical-grounded functional time encoding approach that is compatible with the self-attentionmechanism. The temporal signals are then modelled by the interactions between the functional timeencoding and nodes features as well as the graph topological structures.To evaluate our approach, we consider future link prediction on the observed nodes as transductive learning task, and on the unseen nodes as inductive learning task. We also examine the dynamic nodeclassiﬁcation task using node embeddings (temporal versus non-temporal) as features to demonstratethe usefulness of our functional time encoding. We carry out extensive ablation studies and sensi-tivity analysis to show the effectiveness of the proposed functional time encoding and

TGAT -layer.2ublished as a conference paper at ICLR 2020

ELATED W ORK

Graph representation learning . Spectral graph embedding models operate on the graph spectraldomain by approximating, projecting or expanding the graph Laplacian (Kipf & Welling, 2016a;Henaff et al., 2015; Defferrard et al., 2016). Since their training and inference are conditioned onthe speciﬁc graph spectrum, they are not directly extendable to temporal graphs. Non-spectral ap-proaches, such as

GAT , GraphSAGE and

MoNET , (Monti et al., 2017) rely on the localized neigh-bourhood aggregations and thus are not restricted to the training graph.

GraphSAGE and

GAT alsohave the ﬂexibility to handle evolving graphs inductively. To extend classical graph representationlearning approaches to the temporal domain, several attempts have been done by cropping the tem-poral graph into a sequence of graph snapshots (Li et al., 2018; Goyal et al., 2018; Rahman et al.,2018; Xu et al., 2019b), and some others work with temporally persistent node (edges) (Trivedi et al.,2018; Ma et al., 2018). Nguyen et al. (2018) proposes a node embedding method based on temporalrandom walk and reported state-of-the-art performances. However, their approach only generatesembeddings for the ﬁnal state of temporal graph and can not directly apply to the inductive setting.

Self-attention mechanism.

Self-attention mechanisms often have two components: the embeddinglayer and the attention layer. The embedding layer takes an ordered entity sequence as input. Self-attention uses the positional encoding, i.e. each position k is equipped with a vector p k (ﬁxed orlearnt) which is shared for all sequences. For the entity sequence e = ( e , . . . , e l ) , the embeddinglayer takes the sum or concatenation of entity embeddings (or features) ( z ∈ R d ) and their positionalencodings as input: Z e = (cid:2) z e + p , . . . , z e + p l (cid:3) (cid:124) ∈ R l × d , or Z e = (cid:2) z e || p , . . . , z e || p l (cid:3) (cid:124) ∈ R l × ( d + d pos ) . (1)where || denotes concatenation operation and d pos is the dimension for positional encoding. Self-attention layers can be constructed using the scaled dot-product attention, which is deﬁned as:Attn (cid:0) Q , K , V (cid:1) = softmax (cid:16) QK (cid:124) √ d (cid:17) V , (2)where Q denotes the ’queries’, K the ’keys’ and V the ’values’. In Vaswani et al. (2017), they aretreated as projections of the output Z e : Q = Z e W Q , K = Z e W K , V = Z e W V , where W Q , W K and W V are the projection matrices. Since each row of Q , K and V represents an entity, thedot-product attention takes a weighted sum of the entity ’values’ in V where the weights are givenby the interactions of entity ’query-key’ pairs. The hidden representation for the entity sequenceunder the dot-product attention is then given by h e = Attn ( Q , K , V ) . EMPORAL G RAPH A TTENTION N ETWORK A RCHITECTURE

We ﬁrst derive the mapping from time domain to the continuous differentiable functional domainas the functional time encoding such that resulting formulation is compatible with self-attentionmechanism as well as the backpropagation-based optimization frameworks. The same idea wasexplored in a concurrent work (Xu et al., 2019a). We then present the temporal graph attention layerand show how it can be naturally extended to incorporate the edge features.3.1 F

UNCTIONAL TIME ENCODING

Recall that our starting point is to obtain a continuous functional mapping

Φ : T → R d T fromtime domain to the d T -dimensional vector space to replace the positional encoding in (1). Withoutloss of generality, we assume that the time domain can be represented by the interval starting fromorigin: T = [0 , t max ] , where t max is determined by the observed data. For the inner-product self-attention in (2), often the ’key’ and ’query’ matrices ( K , Q ) are given by identity or linear projectionof Z e deﬁned in (1), leading to terms that only involve inner-products between positional (time)encodings. Consider two time points t , t and inner product between their functional encodings (cid:10) Φ( t ) , Φ( t ) (cid:11) . Usually, the relative timespan, rather than the absolute value of time, reveals criticaltemporal information. Therefore, we are more interested in learning patterns related to the timespanof | t − t | , which should be ideally expressed by (cid:10) Φ( t ) , Φ( t ) (cid:11) to be compatible with self-attention.Formally, we deﬁne the temporal kernel K : T × T → R with K ( t , t ) := (cid:10) Φ( t ) , Φ( t ) (cid:11) and K ( t , t ) = ψ ( t − t ) , ∀ t , t ∈ T for some ψ : [ − t max , t max ] → R . The temporal kernel is then3ublished as a conference paper at ICLR 2020translation-invariant, since K ( t + c, t + c ) = ψ ( t − t ) = K ( t , t ) for any constant c . Gener-ally speaking, functional learning is extremely complicated since it operates on inﬁnite-dimensionalspaces, but now we have transformed the problem into learning the temporal kernel K expressed by Φ . Nonetheless, we still need to ﬁgure out an explicit parameterization for Φ in order to conduct ef-ﬁcient gradient-based optimization. Classical harmonic analysis theory, i.e. the Bochner’s theorem,motivates our ﬁnal solution. We point out that the temporal kernel K is positive-semideﬁnite (PSD)and continuous, since it is deﬁned via Gram matrix and the mapping Φ is continuous. Therefore, thekernel K deﬁned above satisfy the assumptions of the Bochner’s theorem, which we state below. Theorem 1 (Bochner’s Theorem) . A continuous, translation-invariant kernel K ( x , y ) = ψ ( x − y ) on R d is positive deﬁnite if and only if there exists a non-negative measure on R such that ψ is theFourier transform of the measure. Consequently, when scaled properly, our temporal kernel K have the alternate expression: K ( t , t ) = ψ ( t , t ) = (cid:90) R e iω ( t − t ) p ( ω ) dω = E ω [ ξ ω ( t ) ξ ω ( t ) ∗ ] , (3)where ξ ω ( t ) = e iωt . Since the kernel K and the probability measure p ( ω ) are real, we extract thereal part of (3) and obtain: K ( t , t ) = E ω (cid:2) cos( ω ( t − t )) (cid:3) = E ω (cid:2) cos( ωt ) cos( ωt ) + sin( ωt ) sin( ωt ) (cid:3) . (4)The above formulation suggests approximating the expectation by the Monte Carlo integral(Rahimi & Recht, 2008), i.e. K ( t , t ) ≈ d (cid:80) di =1 cos( ω i t ) cos( ω i t ) + sin( ω i t ) sin( ω i t ) , with ω , . . . , ω d i.i.d ∼ p ( ω ) . Therefore, we propose the ﬁnite dimensional functional mapping to R d as: t (cid:55)→ Φ d ( t ) := (cid:114) d (cid:2) cos( ω t ) , sin( ω t ) , . . . , cos( ω d t ) , sin( ω d t ) (cid:3) , (5)and it is easy to show that (cid:10) Φ d ( t ) , Φ d ( t ) (cid:11) ≈ K ( t , t ) . As a matter of fact, we prove the stochasticuniform convergence of (cid:10) Φ d ( t ) , Φ d ( t ) (cid:11) to the underlying K ( t , t ) and shows that it takes only areasonable amount of samples to achieve proper estimation, which is stated in Claim 1. Claim 1.

Let p ( ω ) be the corresponding probability measure stated in Bochner’s Theorem for kernelfunction K . Suppose the feature map Φ is constructed as described above using samples { ω i } di =1 ,then we only need d = Ω (cid:0) (cid:15) log σ p t max (cid:15) (cid:1) samples to have sup t ,t ∈ T (cid:12)(cid:12) Φ d ( t ) (cid:48) Φ d ( t ) − K ( t , t ) (cid:12)(cid:12) < (cid:15) with any probability for ∀ (cid:15) > , where σ p is the second momentum with respect to p ( ω ) . The proof is provided in supplement material.By applying Bochner’s theorem, we convert the problem of kernel learning to distribution learning,i.e. estimating the p ( ω ) in Theorem 1. A straightforward solution is to apply the reparameteriza-tion trick by using auxiliary random variables with a known marginal distribution as in variationalautoencoders (Kingma & Welling, 2013). However, the reparameterization trick is often limited tocertain distributions such as the ’local-scale’ family, which may not be rich enough for our purpose.For instance, when p ( ω ) is multimodal it is difﬁcult to reconstruct the underlying distribution via di-rect reparameterizations. An alternate approach is to use the inverse cumulative distribution function(CDF) transformation. Rezende & Mohamed (2015) propose using parameterized normalizing ﬂow ,i.e. a sequence of invertible transformation functions, to approximate arbitrarily complicated CDFand efﬁciently sample from it. Dinh et al. (2016) further considers stacking bijective transforma-tions, known as afﬁne coupling layer, to achieve more effective CDF estimation. The above methodslearns the inverse CDF function F − θ ( . ) parameterized by ﬂow-based networks and draw samplesfrom the corresponding distribution. On the other hand, if we consider an non-parameterized ap-proach for estimating distribution, then learning F − ( . ) and obtain d samples from it is equivalentto directly optimizing the { ω , . . . , ω d } in (4) as free model parameters. In practice, we ﬁnd thesetwo approaches to have highly comparable performances (see supplement material). Therefore wefocus on the non-parametric approach, since it is more parameter-efﬁcient and has faster trainingspeed (as no sampling during training is required).The above functional time encoding is fully compatible with self-attention, thus they can replace thepositional encodings in (1) and their parameters are jointly optimized as part of the whole model.4ublished as a conference paper at ICLR 2020Figure 2: The architecture of the l th TGAT layer with k = 3 attention heads for node v at time t .3.2 T EMPORAL GRAPH ATTENTION LAYER

We use v i and x i ∈ R d to denote node i and its raw node features. The proposed TGAT architecturedepends solely on the temporal graph attention layer ( TGAT layer). In analogy to

GraphSAGE and

GAT , the

TGAT layer can be thought of as a local aggregation operator that takes the temporalneighborhood with their hidden representations (or features) as well as timestamps as input, and theoutput is the time-aware representation for target node at any time point t . We denote the hiddenrepresentation output for node i at time t from the l th layer as ˜ h ( l ) i ( t ) .Similar to GAT , we perform the masked self-attention to take account of the structural in-formation (Veliˇckovi´c et al., 2017). For node v at time t , we consider its neighborhood N ( v ; t ) = { v , . . . , v N } such that the interaction between v and v i ∈ N ( v ; t ) , which takesplace at time t i , is prior to t . The input of TGAT layer is the neighborhood information Z = (cid:8) ˜ h ( l − ( t ) , . . . , ˜ h ( l − N ( t N ) (cid:9) and the target node information with some time point (cid:0) ˜ h ( l − ( t ) , t (cid:1) .When l = 1 , i.e. for the ﬁrst layer, the inputs are just raw node features. The layer produces thetime-aware representation of target node v at time t , denoted by ˜ h ( l )0 ( t ) , as its output. Due to thetranslation-invariant assumption for the temporal kernel, we can alternatively use { t − t , . . . , t − t N } as interaction times, since | t i − t j | = (cid:12)(cid:12) ( t − t i ) − ( t − t j ) (cid:12)(cid:12) and we only care for the timespan.In line with original self-attention mechanism, we ﬁrst obtain the entity-temporal feature matrix as Z ( t ) = (cid:104) ˜ h ( l − ( t ) || Φ d T (0) , ˜ h ( l − ( t ) || Φ d T ( t − t ) , . . . , ˜ h ( l − N ( t N ) || Φ d T ( t − t N ) (cid:105) (cid:124) (or use sum)(6)and forward it to three different linear projections to obtain the ’query’, ’key’ and ’value’: q ( t ) = (cid:2) Z ( t ) (cid:3) W Q , K ( t ) = (cid:2) Z ( t ) (cid:3) N W K , V ( t ) = (cid:2) Z ( t ) (cid:3) N W V , where W Q , W K , W V ∈ R ( d + d T ) × d h are the weight matrices that are employed to capturethe interactions between time encoding and node features. For notation simplicity, in the fol-lowing discussion we treat the dependence of the intermediate outputs on target time t as im-plicit. The attention weights { α i } Ni =1 of the softmax function output in (2) is given by: α i =exp (cid:0) q (cid:124) K i (cid:1) / (cid:16) (cid:80) q exp (cid:0) q (cid:124) K q (cid:1)(cid:17) . The attention weight α i reveals how node i attends to the fea-tures of node v within the topological structure deﬁned as N ( v ; t ) after accounting for their inter-action time with v . The self-attention therefore captures the temporal interactions with both nodefeatures and topological features and deﬁnes a local temporal aggregation operator on graph. Thehidden representation for any node v i ∈ N ( v ; t ) is given by: α i V i . The mechanism can be effec-tively shared across all nodes for any time point. We then take the row-wise sum from the abovedot-product self-attention output as the hidden neighborhood representations , i.e. h ( t ) = Attn (cid:0) q ( t ) , K ( t ) , V ( t ) (cid:1) ∈ R d h . Node v i may have multiple interactions with v at different time points. For the sake of presentation clarity,we do not explicitly differentiate such recurring interactions in our notations. GraphSAGE and concatenate the neighbourhood representation with the target node’s fea-ture vector z . We then pass it to a feed-forward neural network to capture non-linear interactionsbetween the features as in (Vaswani et al., 2017): ˜ h ( l )0 ( t ) = FFN (cid:16) h ( t ) || x (cid:17) ≡ ReLU (cid:16) [ h ( t ) || x ] W ( l )0 + b ( l )0 (cid:17) W ( l )1 + b ( l )1 , W ( l )0 ∈ R ( d h + d ) × d f , W ( l )1 ∈ R d f × d , b ( l )0 ∈ R d f , b ( l )1 ∈ R d , where ˜ h ( l )0 ( t ) ∈ R d is the ﬁnal output representing the time-aware node embedding at time t forthe target node. Therefore, the TGAT layer can be implemented for node classiﬁcation task usingthe semi-supervised learning framework proposed in Kipf & Welling (2016a) as well as the linkprediction task with the encoder-decoder framework summarized by Hamilton et al. (2017b).Veliˇckovi´c et al. (2017) suggests that using multi-head attention improves performances and stabi-lizes training for

GAT . For generalization purposes, we also show that the proposed TGAT layer canbe easily extended to the multi-head setting. Consider the dot-product self-attention outputs from atotal of k different heads, i.e. h ( i ) ≡ Attn ( i ) (cid:0) q ( t ) , K ( t ) , V ( t ) (cid:1) , i = 1 , . . . , k . We ﬁrst concatenatethe k neighborhood representations into a combined vector and then carry out the same procedure: ˜ h ( l )0 ( t ) = FFN (cid:16) h (1) ( t ) || . . . || h ( k ) ( t ) || x (cid:17) . Just like

GraphSAGE , a single

TGAT layer aggregates the localized one-hop neighborhood, and bystacking L TGAT layers the aggregation extends to L hops. Similar to GAT , out approach does notrestrict the size of neighborhood. We provide a graphical illustration of our

TGAT layer in Figure 2.3.3 E

XTENSION TO INCORPORATE E DGE F EATURES

We show that the

TGAT layer can be naturally extended to handle edge features in a message-passing fashion. Simonovsky & Komodakis (2017) and Wang et al. (2018b) modify classicalspectral-based graph convolutional networks to incorporate edge features. Battaglia et al. (2018)propose general graph neural network frameworks where edges features can be processed. For tem-poral graphs, we consider the general setting where each dynamic edge is associated with a featurevector, i.e. the interaction between v i and v j at time t induces the feature vector x i,j ( t ) . To propagateedge features during the TGAT aggregation, we simply extend the Z ( t ) in (6) to: Z ( t ) = (cid:104) . . . , ˜ h ( l − i ( t i ) || x ,i ( t i ) || Φ d T ( t − t i ) , . . . (cid:105) (or use summation), (7)such that the edge information is propagated to the target node’s hidden representation, and thenpassed on to the next layer (if exists). The remaining structures stay the same as in Section 3.2.3.4 T EMPORAL SUB - GRAPH BATCHING

Stacking L TGAT layers is equivalent to aggregate over the L -hop neighborhood. For each L -hopsub-graph that is constructed during the batch-wise training, all message passing directions mustbe aligned with the observed chronological orders. Unlike the non-temporal setting where eachedge appears only once, in temporal graphs two node can have multiple interactions at differenttime points. Whether or not to allow loops that involve the target node should be judged case-by-case. Sampling from neighborhood, or known as neighborhood dropout , may speed up and stabilizemodel training. For temporal graphs, neighborhood dropout can be carried uniformly or weightedby the inverse timespan such that more recent interactions has higher probability of being sampled.3.5 C OMPARISONS TO RELATED WORK

The functional time encoding technique and

TGAT layer introduced in Section 3.1 and 3.2 solvesseveral critical challenges, and the

TGAT network intrinsically connects to several prior methods. • Instead of cropping temporal graphs into a sequence of snapshots or constructing time-constraint random walks, which inspired most of the current temporal graph embeddingmethods, we directly learn the functional representation of time. The proposed approach is6ublished as a conference paper at ICLR 2020motivated by and thus fully compatible with the well-established self-attention mechanism.Also, to the best of our knowledge, no previous work has discussed the temporal-featureinteractions for temporal graphs, which is also considered in our approach. • The

TGAT layer is computationally efﬁcient compared with RNN-based models, since themasked self-attention operation is parallelizable, as suggested by Vaswani et al. (2017). Theper-batch time complexity of the

TGAT layer with k heads and l layers can be expressed as O (cid:0) ( k ˜ N ) l (cid:1) where ˜ N is the average neighborhood size, which is comparable to GAT . Whenusing multi-head attention, the computation for each head can be parallelized as well. • The inference with

TGAT is entirely inductive . With an explicit functional expression ˜ h ( t ) for each node, the time-aware node embeddings can be easily inferred for any timestampvia a single network forward pass. Similarity, whenever the graph is updated, the embed-dings for both unseen and observed nodes can be quickly inferred in an inductive fashionsimilar to that of GraphSAGE , and the computations can be parallelized across all nodes. • GraphSAGE with mean pooling (Hamilton et al., 2017a) can be interpreted as a specialcase of the proposed method, where the temporal neighborhood is aggregated with equalattention coefﬁcients.

GAT is like the time-agnostic version of our approach but with adifferent formulation for self-attention, as they refer to the work of Bahdanau et al. (2014).We discuss the differences in detail in the Appendix. It is also straightforward to show ourconnections with the menory networks (Sukhbaatar et al., 2015) by thinking of the temporalneighborhoods as memory. The techniques developed in our work may also help adapting

GAT and

GraphSAGE to temporal settings as we show in our experiments.

XPERIMENT AND R ESULTS

We test the performance of the proposed method against a variety of strong baselines (adapted fortemporal settings when possible) and competing approaches, for both the inductive and transductive tasks on two benchmark and one large-scale industrial dataset.4.1 D

ATASETS

Real-world temporal graphs consist of time-sensitive node interactions, evolving node labels as wellas new nodes and edges. We choose the following datasets which contain all scenarios.

Reddit dataset . We use the data from active users and their posts under subreddits, leading to atemporal graph with 11,000 nodes, ∼ Wikipedia dataset . We use the data from top edited pages and active users, yielding a temporalgraph ∼ Industrial dataset . We choose 70,000 popular products and 100,000 active customers as nodes fromthe online grocery shopping website and use the customer-product purchase as temporal edges ( ∼ RANSDUCTIVE AND INDUCTIVE LEARNING TASKS

Since the majority of temporal information is reﬂected via the timely interactions among nodes, wechoose to use a more revealing link prediction setup for training. Node classiﬁcation is then treatedas the downstream task using the obtained time-aware node embeddings as input. http://snap.stanford.edu/jodie/reddit.csv http://snap.stanford.edu/jodie/wikipedia.csv https://grocery.walmart.com/ Dataset Reddit Wikipedia IndustrialMetric Accuracy AP Accuracy AP Accuracy AP

GAE 74.31 (0.5) 93.23 (0.3) 72.85 (0.7) 91.44 (0.1) 68.92 (0.3) 81.15 (0.2)VAGE 74.19 (0.4) 92.92 (0.2) 78.01 (0.3) 91.34 (0.3) 67.81 (0.4) 80.87 (0.3)DeepWalk 71.43 (0.6) 83.10 (0.5) 76.67 (0.5) 90.71 (0.6) 65.87 (0.3) 80.93 (0.2)Node2vec 72.53 (0.4) 84.58 (0.5) 78.09 (0.4) 91.48 (0.3) 66.64 (0.3) 81.39 (0.3)CTDNE 73.76 (0.5) 91.41 (0.3) 79.42 (0.4) 92.17 (0.5) 67.81 (0.3) 80.95 (0.5)GAT 92.14 (0.2) 97.33 (0.2) 87.34 (0.3) 94.73 (0.2) 69.58 (0.4) 81.51 (0.2)GAT+T 92.47 (0.2) 97.62 (0.2) 87.57 (0.2) 95.14 (0.4) 70.15 (0.3) 82.66 (0.4)GraphSAGE 92.31(0.2) 97.65 (0.2) 85.93 (0.3) 93.56 (0.3) 70.19 (0.2) 83.27 (0.3)GraphSAGE+T 92.58 (0.2) 97.89 (0.3) 86.31 (0.3) 93.72 (0.3) 71.84 (0.3) 84.95 (0.)Const-TGAT 91.39 (0.2) 97.86 (0.2) 86.03 (0.4) 93.50 (0.3) 68.52 (0.2) 81.91 (0.3)TGAT (0.3) (0.2) (0.2) (0.1) (0.2) (0.1)

Table 1: Transductive learning task results for predicting future edges of nodes that have been ob-served in training data. All results are converted to percentage by multiplying by 100, and the stan-dard deviations computed over ten runs (in parenthesis). The best and second-best results in eachcolumn are highlighted in bold font and underlined.

GraphSAGE is short for

GraphSAGE -LSTM.

Dataset Reddit Wikipedia IndustrialMetric Accuracy AP Accuracy AP Accuracy AP

GAT 89.86 (0.2) 95.37 (0.3) 82.36 (0.3) 91.27 (0.4) 68.28 (0.2) 79.93 (0.3)GAT+T 90.44 (0.3) 96.31 (0.3) 84.82 (0.3) 93.57 (0.3) 69.51 (0.3) 81.68 (0.3)GraphSAGE 89.43 (0.1) 96.27 (0.2) 82.43 (0.3) 91.09 (0.3) 67.49 (0.2) 80.54 (0.3)GraphSAGE+T 90.07 (0.2) 95.83 (0.2) 84.03 (0.4) 92.37 (0.5) 69.66 (0.3) 82.74 (0.3)Const-TGAT 88.28 (0.3) 94.12 (0.2) 83.60 (0.4) 91.93 (0.3) 65.87 (0.3) 77.03 (0.4)TGAT (0.2) (0.3) (0.2) (0.3) (0.3) (0.2)

Table 2: Inductive learning task results for predicting future edges of unseen nodes.

Transductive task examines embeddings of the nodes that have been observed in training, via thefuture link prediction task and the node classiﬁcation. To avoid violating temporal constraints, wepredict the links that strictly take place posterior to all observations in the training data.

Inductive task examines the inductive learning capability using the inferred representations of un-seen nodes, by predicting the future links between unseen nodes and classify them based on theirinferred embedding dynamically. We point out that it sufﬁces to only consider the future sub-graphfor unseen nodes since they are equivalent to new graphs under the non-temporal setting.As for the evaluation metrics , in the link prediction tasks, we ﬁrst sample an equal amount of nega-tive node pairs to the positive links and then compute the average precision ( AP ) and classiﬁcation accuracy . In the downstream node classiﬁcation tasks, due to the label imbalance in the datasets,we employ the area under the ROC curve ( AUC ).4.3 B

ASELINES

Transductive task : for link prediction of observed nodes, we choose the compare our approachwith the state-of-the-art graph embedding methods:

GAE and

VGAE (Kipf & Welling, 2016b). Forcomplete comparisons, we also include the skip-gram-based node2vec (Grover & Leskovec, 2016)as well as the spectral-based

DeepWalk model (Perozzi et al., 2014), using the same inner-productdecoder as

GAE for link prediction. The

CDTNE model based on the temporal random walk hasbeen reported with superior performance on transductive learning tasks (Nguyen et al., 2018), so weinclude

CDTNE as the representative for temporal graph embedding approaches.

Inductive task : few approaches are capable of managing inductive learning on graphs even in thenon-temporal setting. As a consequence, we choose

GraphSAGE and

GAT as baselines after adapt-ing them to the temporal setting. In particular, we equip them with the same temporal sub-graphbatching describe in Section 3.4 to maximize their usage on temporal information. Also, we im-plement the extended version for the baselines to include edge features in the same way as ours(in Section 3.3). We experiment on different aggregation functions for

GraphSAGE , i.e.

Graph-

SAGE -mean,

GraphSAGE -pool and

GraphSAGE -LSTM. In accordance with the original work ofHamilton et al. (2017a),

GraphSAGE -LSTM gives the best validation performance among the threeapproaches, which is reasonable under temporal setting since LSTM aggregation takes account ofthe sequential information. Therefore we report the results of

GraphSAGE -LSTM.In addition to the above baselines, we implement a version of

TGAT with all temporal attentionweights set to equal value (

Const-TGAT ). Finally, to show that the superiority of our approach owesto both the time encoding and the network architecture, we experiment with the enhanced

GAT and

GraphSAGE -mean by concatenating the proposed time encoding to the original features duringtemporal aggregations (

GAT+T and

GraphSAGE+T ).Figure 3: Results of node classiﬁcationtask in the ablation study . Dataset Reddit Wikipedia Industrial

GAE 58.39 (0.5) 74.85 (0.6) 76.59 (0.3)VGAE 57.98 (0.6) 73.67 (0.8) 75.38 (0.4)CTDNE 59.43 (0.6) 75.89 (0.5) 78.36 (0.5)GAT 64.52 (0.5) 82.34 (0.8) 87.43 (0.4)GAT+T 64.76 (0.6) 82.95 (0.7) 88.24 (0.5)GraphSAGE 61.24 (0.6) 82.42 (0.7) 88.28 (0.3)GraphSAGE+T 62.31 (0.7) 82.87 (0.6) 89.81 (0.3)Const-TGAT 60.97 (0.5) 75.18 (0.7) 82.59 (0.6)TGAT (0.7) (0.7) (0.3)

Table 3: Dynamic node classiﬁcation task results, where thereported metric is the

AUC .4.4 E

XPERIMENT SETUP

We use the time-sensitive link prediction loss function for training the l -layer TGAT network: (cid:96) = (cid:88) ( v i ,v j ,t ij ) ∈E − log (cid:16) σ (cid:0) − ˜ h li ( t ij ) (cid:124) ˜ h lj ( t ij ) (cid:1)(cid:17) − Q. E v q ∼ P n ( v ) log (cid:16) σ (cid:0) ˜ h li ( t ij ) (cid:124) ˜ h lq ( t ij ) (cid:1)(cid:17) , (8)where the summation is over the observed edges on v i and v j that interact at time t ij , and σ ( . ) is thesigmoid function, Q is the number of negative samples and P n ( v ) is the negative sampling distri-bution over the node space. As for tuning hyper-parameters, we ﬁx the node embedding dimensionand the time encoding dimension to be the original feature dimension for simplicity, and then selectthe number of TGAT layers from { } , the number of attention heads from { } , accord-ing to the link prediction AP score in the validation dataset. Although our method does not putrestriction on the neighborhood size during aggregations, to speed up training, specially when usingthe multi-hop aggregations, we use neighborhood dropout (selected among p = { } ) withthe uniform sampling. During training, we use . as learning rate for Reddit and Wikipediadataset and . for the industrial dataset, with Glorot initialization and the Adam SGD optimizer.We do not experiment on applying regularization since our approach is parameter-efﬁcient and onlyrequires Ω (cid:0) ( d + d T ) d h + ( d h + d ) d f + d f d (cid:1) parameters for each attention head, which is inde-pendent of the graph and neighborhood size. Using two TGAT layers and two attention heads withdropout rate as 0.1 give the best validation performance. For inference, we inductively compute theembeddings for both the unseen and observed nodes at each time point that the graph evolves, orwhen the node labels are updated. We then use these embeddings as features for the future linkprediction and dynamic node classiﬁcations with multilayer perceptron.We further conduct ablation study to demonstrate the effectiveness of the proposed functional timeencoding approach. We experiment on abandoning time encoding or replacing it with the originalpositional encoding (both ﬁxed and learnt). We also compare the uniform neighborhood dropout tosampling with inverse timespan (where the recent edges are more likely to be sampled), which isprovided in supplement material along with other implementation details and setups for baselines.4.5 R

ESULTS

The results in Table 1 and Table 2 demonstrates the state-of-the-art performances of our approach onboth transductive and inductive learning tasks. In the inductive learning task, our

TGAT network sig-niﬁcantly improves upon the the upgraded

GraphSAGE -LSTM and

GAT in accuracy and average precision by at least 5 % for both metrics, and in the transductive learning task TGAT consistentlyoutperforms all baselines across datasets. While

GAT+T and

GraphSAGE+T slightly outperformor tie with

GAT and

GraphSAGE -LSTM, they are nevertheless outperformed by our approach. Onone hand, the results suggest that the time encoding have potential to extend non-temporal graphrepresentation learning methods to temporal settings. On the other, we note that the time encodingstill works the best with our network architecture which is designed for temporal graphs. Over-all, the results demonstrate the superiority of our approach in learning representations on temporalgraphs over prior models. We also see the beneﬁts from assigning temporal attention weights toneighboring nodes, where

GAT signiﬁcantly outperforms the

Const-TGAT in all three tasks. Thedynamic node classiﬁcation outcome (in Table 3) further suggests the usefulness of our time-awarenode embeddings for downstream tasks as they surpass all the baselines. The ablation study resultsof Figure 3 successfully reveals the effectiveness of the proposed functional time encoding approachin capturing temporal signals as it outperforms the positional encoding counterparts.4.6 A

TTENTION A NALYSIS

To shed some insights into the temporal signals captured by the proposed

TGAT , we analyze thepattern of the attention weights { α ij ( t ) } as functions of both time t and node pairs ( i, j ) in theinference stage. Firstly , we analyze how the attention weights change with respect to the timespansof previous interactions, by plotting the attention weights (cid:8) α jq ( t ij ) | q ∈ N ( v j ; t ij ) (cid:9) ∪ (cid:8) α ik ( t ij ) | k ∈N ( v i ; t ij ) (cid:9) against the timespans { t ij − t jq }∪{ t ij − t ik } when predicting the link for ( v i , v j , t ij ) ∈ E (Figure 4a). This gives us an empirical estimation on the α (∆ t ) , where a smaller ∆ t means a morerecent interaction. Secondly , we analyze how the topological structures affect the attention weightsas time elapses. Speciﬁcally, we focus on the topological structure of the recurring neighbours , byﬁnding out what attention weights the model put on the neighbouring nodes with different number of reoccurrences . Since the functional forms of all { α ij ( . ) } are ﬁxed after training, we are able to feedin different target time t and then record their value on neighbouring nodes with different numberof occurrences (Figure 4b). From Figure 4a we observe that TGAT captures the pattern of havingless attention on more distant interactions in all three datasets. In Figure 4b, it is obvious that whenpredicting a more future interaction,

TGAT will consider neighbouring nodes who have a highernumber of occurrences of more importance. The patterns of the attention weights are meaningful,since the more recent and repeated actions often have larger inﬂuence on users’ future interests.Figure 4: Attention weight analysis. We apply the

Loess smoothing method for visualization.

ONCLUSION AND FUTURE WORK

We introduce a novel time-aware graph attention network for inductive representation learning ontemporal graphs. We adapt the self-attention mechanism to handle the continuous time by proposinga theoretically-grounded functional time encoding. Theoretical and experimental analysis demon-strate the effectiveness of our approach for capturing temporal-feature signals in terms of both nodeand topological features on temporal graphs. Self-attention mechanism often provides useful modelinterpretations (Vaswani et al., 2017), which is an important direction of our future work. Develop-ing tools to visualize the evolving graph dynamics and temporal representations efﬁciently is anotherimportant direction for both research and application. Also, the functional time encoding techniquehas huge potential for adapting other deep learning methods to the temporal graph domain.10ublished as a conference paper at ICLR 2020 R EFERENCES

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointlylearning to align and translate. arXiv preprint arXiv:1409.0473 , 2014.Peter W Battaglia, Jessica B Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vinicius Zambaldi,Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, et al.Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261 ,2018.Micha¨el Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks ongraphs with fast localized spectral ﬁltering. In

Advances in neural information processing systems ,pp. 3844–3852, 2016.Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real nvp. arXivpreprint arXiv:1605.08803 , 2016.Matthias Fey and Jan E. Lenssen. Fast graph representation learning with PyTorch Geometric. In

ICLR Workshop on Representation Learning on Graphs and Manifolds , 2019.Palash Goyal, Nitin Kamra, Xinran He, and Yan Liu. Dyngem: Deep embedding method for dy-namic graphs. arXiv preprint arXiv:1805.11273 , 2018.Aditya Grover and Jure Leskovec. node2vec: Scalable feature learning for networks. In

Proceedingsof the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining ,pp. 855–864. ACM, 2016.Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs.In

Advances in Neural Information Processing Systems , pp. 1024–1034, 2017a.William L Hamilton, Rex Ying, and Jure Leskovec. Representation learning on graphs: Methodsand applications. arXiv preprint arXiv:1709.05584 , 2017b.Mikael Henaff, Joan Bruna, and Yann LeCun. Deep convolutional networks on graph-structureddata. arXiv preprint arXiv:1506.05163 , 2015.Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprintarXiv:1312.6114 , 2013.Thomas N Kipf and Max Welling. Semi-supervised classiﬁcation with graph convolutional net-works. arXiv preprint arXiv:1609.02907 , 2016a.Thomas N Kipf and Max Welling. Variational graph auto-encoders. arXiv preprintarXiv:1611.07308 , 2016b.Taisong Li, Jiawei Zhang, S Yu Philip, Yan Zhang, and Yonghong Yan. Deep dynamic networkembedding for link prediction.

IEEE Access , 6:29219–29230, 2018.Lynn H Loomis.

Introduction to abstract harmonic analysis . Courier Corporation, 2013.Yao Ma, Ziyi Guo, Eric Zhao Zhaochun Ren, and Dawei Yin Jiliang Tang. Streaming graph neuralnetworks. arXiv preprint arXiv:1810.10627 , 2018.Federico Monti, Davide Boscaini, Jonathan Masci, Emanuele Rodola, Jan Svoboda, and Michael MBronstein. Geometric deep learning on graphs and manifolds using mixture model cnns. In

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pp. 5115–5124, 2017.Giang Hoang Nguyen, John Boaz Lee, Ryan A Rossi, Nesreen K Ahmed, Eunyee Koh, andSungchul Kim. Continuous-time dynamic network embeddings. In

Companion Proceedings ofthe The Web Conference 2018 , pp. 969–976. International World Wide Web Conferences SteeringCommittee, 2018.James W Pennebaker, Martha E Francis, and Roger J Booth. Linguistic inquiry and word count:Liwc 2001.

Mahway: Lawrence Erlbaum Associates , 71(2001):2001, 2001.11ublished as a conference paper at ICLR 2020Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. Deepwalk: Online learning of social repre-sentations. In

Proceedings of the 20th ACM SIGKDD international conference on Knowledgediscovery and data mining , pp. 701–710. ACM, 2014.Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In

Advances inneural information processing systems , pp. 1177–1184, 2008.Mahmudur Rahman, Tanay Kumar Saha, Mohammad Al Hasan, Kevin S Xu, and Chandan K Reddy.Dylink2vec: Effective feature representation for link prediction in dynamic networks. arXivpreprint arXiv:1804.05755 , 2018.Danilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing ﬂows. arXivpreprint arXiv:1505.05770 , 2015.Martin Simonovsky and Nikos Komodakis. Dynamic edge-conditioned ﬁlters in convolutional neu-ral networks on graphs. In

Proceedings of the IEEE conference on computer vision and patternrecognition , pp. 3693–3702, 2017.Sainbayar Sukhbaatar, Jason Weston, Rob Fergus, et al. End-to-end memory networks. In

Advancesin neural information processing systems , pp. 2440–2448, 2015.Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. Line: Large-scaleinformation network embedding. In

Proceedings of the 24th international conference on worldwide web , pp. 1067–1077. International World Wide Web Conferences Steering Committee, 2015.Rakshit Trivedi, Mehrdad Farajtabar, Prasenjeet Biswal, and Hongyuan Zha. Representation learn-ing over dynamic graphs. arXiv preprint arXiv:1803.04051 , 2018.Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In

Advances in neural informationprocessing systems , pp. 5998–6008, 2017.Petar Veliˇckovi´c, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and YoshuaBengio. Graph attention networks. arXiv preprint arXiv:1710.10903 , 2017.Daixin Wang, Peng Cui, and Wenwu Zhu. Structural deep network embedding. In

Proceedings ofthe 22nd ACM SIGKDD international conference on Knowledge discovery and data mining , pp.1225–1234. ACM, 2016.Jizhe Wang, Pipei Huang, Huan Zhao, Zhibo Zhang, Binqiang Zhao, and Dik Lun Lee. Billion-scalecommodity embedding for e-commerce recommendation in alibaba. In

Proceedings of the 24thACM SIGKDD International Conference on Knowledge Discovery & Data Mining , pp. 839–848.ACM, 2018a.Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E Sarma, Michael M Bronstein, and Justin M Solomon.Dynamic graph cnn for learning on point clouds. arXiv preprint arXiv:1801.07829 , 2018b.Da Xu, Chuanwei Ruan, Evren Korpeoglu, Sushant Kumar, and Kannan Achan. Self-attention withfunctional time representation learning. In

Advances in Neural Information Processing Systems ,pp. 15889–15899, 2019a.Da Xu, Chuanwei Ruan, Kamiya Motwani, Evren Korpeoglu, Sushant Kumar, and Kannan Achan.Generative graph convolutional network for growing graphs. In

ICASSP 2019-2019 IEEE In-ternational Conference on Acoustics, Speech and Signal Processing (ICASSP) , pp. 3167–3171.IEEE, 2019b.Da Xu, Chuanwei Ruan, Evren Korpeoglu, Sushant Kumar, and Kannan Achan. Product knowledgegraph embedding for e-commerce. In

Proceedings of the 13th International Conference on WebSearch and Data Mining , pp. 672–680, 2020.Rex Ying, Ruining He, Kaifeng Chen, Pong Eksombatchai, William L Hamilton, and Jure Leskovec.Graph convolutional neural networks for web-scale recommender systems. In

Proceedings of the24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining , pp. 974–983. ACM, 2018. 12ublished as a conference paper at ICLR 2020

A A

PPENDIX

A.1

PROOF FOR CLAIM Proof.

The proof is also shown in our concurrent work Xu et al. (2019a). We also provide it here forcompleteness. To prove the results in Claim 1, we alternatively show that under the same condition,Pr (cid:0) sup t ,t ∈ T | Φ B d ( t ) (cid:48) Φ B d ( t ) − K ( t , t ) | ≥ (cid:15) (cid:1) ≤ σ p (cid:114) t max (cid:15) exp (cid:0) − d(cid:15) (cid:1) . (9)Deﬁne the score S ( t , t ) = Φ B d ( t ) (cid:48) Φ B d ( t ) . The goal is to derive a uniform upper boundfor s ( t , t ) − K ( t , t ) . By assumption S ( t , t ) is an unbiased estimator for K ( t , t ) , i.e. E [ S ( t , t )] = K ( t , t ) . Due to the translation-invariant property of S and K , we let ∆( t ) ≡ s ( t , t ) − K ( t , t ) , where t ≡ t − t for all t , t ∈ [0 , t max ] . Also we deﬁne s ( t − t ) := S ( t , t ) . Therefore t ∈ [ − t max , t max ] , and we use t ∈ ˜ T as the shorthand notation. The LHS in (1)now becomes Pr (cid:0) sup t ∈ ˜ T | ∆( t ) | ≥ (cid:15) (cid:1) .Note that ˜ T ⊆ ∪ N − i =0 T i with T i = [ − t max + it max N , − t max + i +1) t max N ] for i = 1 , . . . , N . So ∪ N − i =0 T i is a ﬁnite cover of ˜ T . Deﬁne t i = − t max + (2 i +1) t max N , then for any t ∈ T i , i = 1 , . . . , N we have | ∆( t ) | = | ∆( t ) − ∆( t i ) + ∆( t i ) |≤ | ∆( t ) − ∆( t i ) | + | ∆( t i ) |≤ L ∆ | t − t i | + | ∆( t i ) |≤ L ∆ t max N + | ∆( t i ) | , (10)where L ∆ = max t ∈ ˜ T (cid:107)∇ ∆( t ) (cid:107) (since ∆ is differentiable) with the maximum achieved at t ∗ . So wemay bound the two events separately.For | ∆( t i ) | we simply notice that trigeometric functions are bounded between [ − , , and therefore − ≤ Φ B d ( t ) (cid:48) Φ B d ( t ) ≤ . The Hoeffding’s inequality for bounded random variables immediatelygives us: Pr (cid:0) | ∆( t i ) | > (cid:15) (cid:1) ≤ exp ( − d(cid:15)

16 ) . So applying the Hoeffding-type union bound to the ﬁnite cover givesPr ( ∪ N − i =0 | ∆( t i ) | ≥ (cid:15) ≤ N exp( − d(cid:15)

16 ) (11)For the other event we ﬁrst apply Markov inequality and obtain:Pr (cid:0) L ∆ t max N ≥ (cid:15) (cid:1) = Pr (cid:0) L ∆ ≥ (cid:15)N t max (cid:1) ≤ t max E [ L ] (cid:15)N . (12)Also, since E [ s ( t − t )] = ψ ( t − t ) , we have E [ L ] = E (cid:107)∇ s ( t ∗ ) − ∇ ψ ( t ∗ ) (cid:107) = E (cid:107)∇ s ( t ∗ ) (cid:107) − E (cid:107)∇ ψ ( t ∗ ) (cid:107) ≤ E (cid:107)∇ s ( t ∗ ) (cid:107) = σ p , (13)where σ p is the second momentum with respect to p ( ω ) .Combining (11), (12) and (11) gives us:Pr (cid:0) sup t ∈ ˜ T | ∆( t ) | ≥ (cid:15) (cid:1) ≤ N exp( − d(cid:15)

16 ) + 4 t max σ p (cid:15)N . (14)It is straightforward to examine that the RHS of (14) is a convex function of N and is minimized by N ∗ = σ p (cid:113) t max (cid:15) exp ( d(cid:15) ) . Plug N ∗ back to (14) and we obtain (9). We then solve for d accordingto (9) and obtain the results in Claim 1. 13ublished as a conference paper at ICLR 2020A.2 C OMPARISONS BETWEEN THE ATTENTION MECHANISM OF

TGAT

AND

GAT

In this part, we provide detailed comparisons between the attention mechanism employed by ourproposed

TGAT and the

GAT proposed by Veliˇckovi´c et al. (2017). Other than the obvious fact that

GAT does not handle temporal information, the main difference lies in the formulation of attentionweights. While

GAT depends on the attention mechanism proposed by Bahdanau et al. (2014), ourarchitecture refers to the self-attention mechanism of Vaswani et al. (2017). Firstly, the attentionmechanism used by

GAT does not involve the notions of ’query’, ’key’ and ’value’ nor the dot-product formulation introduced in (2). As a consequence, the attention weight between node v i andits neighbor v j is computed via α ij = exp (cid:16) LeakyReLU (cid:0) a (cid:124) [ Wh i || Wh j ] (cid:1)(cid:17)(cid:80) k ∈N ( v i ) exp (cid:16) LeakyReLU (cid:0) a (cid:124) [ Wh i || Wh k ] (cid:1)(cid:17) , where a is a weight vector, W is a weight matrix, N ( v i ) is the neighorhood set for node v i and h i is the hidden representation of node v i . It is then obvious that their computation of α ij is verydifferent from our approach. In TGAT , after expanding the expressions in Section 3, the attentionweight is computed by: α ij ( t ) = exp (cid:16)(cid:0) [˜ h i ( t i ) || Φ d T ( t − t i )] W Q (cid:1) (cid:124) (cid:0) [˜ h j ( t j ) || Φ d T ( t − t j )] W K (cid:1)(cid:17)(cid:80) k ∈N ( v i ; t ) exp (cid:16)(cid:0) [˜ h i ( t i ) || Φ d T ( t − t i )] W Q (cid:1) (cid:124) (cid:0) [˜ h k ( t k ) || Φ d T ( t − t k )] W K (cid:1)(cid:17) . Intuitively speaking, the attention mechanism of

GAT relies on the parameter vector a and theLeakyReLU(.) to capture the hidden factor interactions between entities in the sequence, while weuse the linear transformation followed by the dot-product to capture pair-wise interactions of thehidden factors between entities and the time embeddings. The dot-product formulation is importantfor our approach. From the theoretical perspective, the time encoding functional form is derivedaccording to the notion of temporal kernel K and its inner-product decomposition (Section 3). Asfor the practical performances, we see from Table 1, 2 and 3 that even after we equip GAT with thesame time encoding, the performance is still inferior to our

TGAT .A.3 D

ETAILS ON DATASETS AND PREPROCESSING

Reddit dataset : this benchmark dataset contains users interacting with subreddits by posting underthe subreddits. The timestamps tell us when the user makes the posts. The dataset uses the postsmade in a one-month span, and selects the most active users and subreddits as nodes, giving a totalof 11,000 nodes and around 700,000 temporal edges. The user posts have textual features that aretransformed into a 172-dimensional vector representing under the linguistic inquiry and word count (LIWC) categories (Pennebaker et al., 2001). The dynamic binary labels indicate if a user is bannedfrom posting under a subreddit. Since node features are not provided in the original dataset, we usethe all-zero vector instead.

Wikipedia dataset : the dataset also collects one-month of interactions induced by users’ editing theWikipedia pages. The the top edited pages and active users are considered, leading to ∼ Industrial dataset : we obtain the large-scale customer-product interaction graph from the onlinegrocery shopping platform grocery.walmart.com . We select ∼ ∼ Reddit Wikipedia Industrial

172 172 100

LIWC categoryvector LIWC categoryvector documentembeddings

30 days 30 days 30 days%

Training nodes

90% 90% 90%%

Unseen nodes

10% 10% 10%%

Training edges ∼ ∼ ∼ Future edges betweenobserved nodes ∼ ∼ ∼ Future edges betweenunseen nodes ∼ ∼ ∼

366 217 5,236

Label type binary binary binary

Positive label meaning banned fromposting banned fromeditting interested indietary productsTable 4: Data statistics for the three datasets. Since we sample a proportion of unseen nodes, thepercentage of the edge statistics reported here are approximations.features into a 100-dimensional vector space as preprocessing. The user nodes and edges do notpossess features.We then split the temporal graphs chronologically into 70%-15%-15% for training, validation andtesting according to the time epochs of edges, as illustrated in Figure 5 with the Reddit dataset. Sinceall three datasets have a relatively stationary edge count distribution over time, using the 70 and 85percentile time points to split the dataset results in approximately 70%-15%-15% of total edges, assuggested by Figure 5.Figure 5: The distribution of temporal edge count for the Reddit dataset, and the illustration on thetrain-validation-test splitting.To ensure that an appropriate amount of future edges among the unseen nodes will show up duringvalidation and testing, for each dataset, we randomly sample 10% of nodes, mask them duringtraining and treat them as unseen nodes by only considering their interactions in validation andtesting period. This manipulation is necessary since the new nodes that show up during validationand testing period may not have much interaction among themselves. The statistics for the threedatasets are summarized in Table 4.

Preprocessing.

Node2vec and

DeepWalk baselines who only take static graphs as input, the graph is con-structed using all edges in training data regardless of temporal information. For

DeepWalk , we treatthe recurrent edges as appearing only once, so the graph is unweighted. Although our approach han-dles both directed and undirected graphs, for the sake of training stability of the baselines, we treatthe graphs as undirected . For

Node2vec , we use the count of recurrent edges as their weights andconstruct the weighted graph. For all three datasets, the obtained graphs in both cases are undirected and do not have isolated nodes. Since we choose from active users and popular items, the graphs areall connected .For the graph convolutional network baselines, i.e.

GAE and

VGAE , we construct the same undi-rected weighted graph as for

Node2vec . Since

GAE and

VGAE do not take edge features as input,we use the posts/edits as user node features. For each user in Reddit and Wikipedia dataset, we takethe average of their post/edit feature vectors as the node feature. For the industrial dataset whereuser features are not available, we use the all-zero feature vector instead.As for the downstream dynamic node classiﬁcation task, we use the same training, validation andtesting dataset as above. Since we aim at predicting the dynamic node labels, for Reddit andWikipedia dataset we predict if the user node is banned and for the industrial dataset we predictthe customers’ business labels, at different time points. Due to the label imbalance, in each of thebatch when training for the node label classiﬁer, we conduct stratiﬁed sampling such that the labeldistributions are similar across batches.A.4 E

XPERIMENT S ETUP FOR B ASELINES

For all baselines, we set the node embedding dimension to d = 100 to keep in accordance with ourapproach. Transductive baselines.

Since

Node2vec and

DeepWalk do not provide room for task-speciﬁc manipulation or hacking, wedo not modify their default loss function and input format. For both approaches, we select the num-ber of walks among { } and the walk-length among { } according to the validation AP . Setting number of walks =80 and walk-length =30 give slightly better validation performancecompared to others for both approaches. Notice that both Node2vec and

DeepWalk use the sigmoidfunction with embedding inner-products as the decoder to predict neighborhood probabilities. Sowhen predicting whether v i and v j will interact in the future, we use σ ( − z (cid:124) i z j ) as the score, where z i and z j are the node embeddings. Notice that Node2vec has the extra hyper-parameter p and q which controls the likelihood of immediately revisiting a node in the walk and interpolation betweenbreadth-ﬁrst strategy and depth-ﬁrst strategy. After selecting the optimal number of walks and walk-length under p = 1 and q = 1 , we further tune the different values of p in { } while ﬁxing q = 1 . According to validation, p = 0 . and . give comparable optimal performance.For the GAE and

VGAE baselines, we experiment on using one, two and three graph convolutionallayers as the encoder (Kipf & Welling, 2016a) and use the ReLU ( . ) as the activation function. Byreferencing the ofﬁcial implementation, we also set the dimension of hidden layers to 200. Sim-ilar to previous ﬁndings, using two layers gives signiﬁcant performances to using only one layer.Adding the third layer, on the other hand, shows almost identical results for both models. Thereforethe results reported are based on two-layer GCN as the encoder. For

GAE , we use the standardinner-product decoder as our approach and optimize over the reconstruction loss, and for

VGAE , werestrict the Gaussian latent factor space (Kipf & Welling, 2016b). Since we have eliminated the tem-poral information when constructing the input, we ﬁnd that the optimal hyper-parameters selectedaccording to the tuning have similar patterns as in the previous non-temporal settings.For the temporal network embedding model

CTDNE , the walk length for the temporal random walkis also selected among { } , where setting walk length to 80 gives slightly better validationoutcome. The original paper considers several temporal edge selection (sampling) methods (uni-form, linear and exponential) and ﬁnds uniform sampling with best performances (Nguyen et al.,2018). Since our setting is similar to theirs, we adopt the uniform sampling approach. Inductive baselines.

GraphSAGE and

GAT baselines, as mentioned before, we train the models in an identicalway as our approach with the temporal subgraph batching , despite several slight differences. Firstly,the aggregation layers in

GraphSAGE usually considers a ﬁxed neighborhood size via sampling,whereas our approach can take an arbitrary neighborhood as input. Therefore, we only considerthe most recent d sample edges during each aggregation for all layers, and we ﬁnd d sample = 20 givesthe best performance among { } . Secondly, GAT implements a uniform neighborhooddropout. We also experiment with the inverse timespan sampling for neighborhood dropout, and ﬁndthat it gives slightly better performances but at the cost of computational efﬁciency, especially forlarge graphs. We consider aggregating over one, two and three-hop neighborhood for both

GAT and

GraphSAGE . When working with three hops, we only experiment on

GraphSAGE with the meanpooling aggregation. In general, using two hops gives comparable performance to using three hops.Notice that computations with three-hop are costly, since the number of edges during aggregationincrease exponentially to the number of hops. Thus we stick to using two hops for

GraphSAGE , GAT and our approach. It is worth mentioning that when implementing

GraphSAGE -LSTM, theinput neighborhood sequences of LSTM are also ordered by their interaction time.

Node classiﬁcation with baselines.

The dynamic node classiﬁcation with

GraphSAGE and

GAT can be conducted similarity to ourapproach, where we inductively compute the most up-to-date node embeddings and then input themas features to an MLP classiﬁer. For the transductive baselines, it is not reasonable to predict thedynamic node labels with only the ﬁxed node embeddings. Instead, we combine the node embeddingwith the other node embedding it is interacting with when the label changes, e.g. combine theuser embedding with the Wikipedia page embedding that the user attempts on editing when thesystem bans the user. To combine the pair of node embeddings, we experimented on summation,concatenation and bi-linear transformation. Under summation and concatenation, the combinedembeddings are then used as input to an MLP classiﬁer, where the bi-linear transformation directlyoutputs scores for classiﬁcation. The validation outcomes suggest that using concatenation withMLP yields the best performance.A.5 I

MPLEMENTATION DETAILS

Training.

We implement

Node2vec using the ofﬁcial C code on a 16-core Linux server with 500Gb memory. DeepWalk is implemented with the ofﬁcial python code . We refer to the PyTorchgeometric library for implementing the GAE and

VGAE baselines (Fey & Lenssen, 2019). Toaccommodate the temporal setting and incorporate edges features, we develop off-the-shelf imple-mentation for

GraphSAGE and

GAT in PyTorch by referencing their original implementations .We also implement our model using PyTorch. All the deep learning models are trained on a ma-chine with one Tesla V100 GPU. We use the Glorot initialization and the Adam SGD optimizer forall models, and apply the early-stopping strategy during training where we terminate the trainingprocess if the validation AP score does not improve for 10 epochs. Downstream node classiﬁcation.

As we discussed before, we use the three-layer MLP as classiﬁerand the (combined) node embeddings as input features from all the experimented approaches, forall three datasets. The MLP is trained with the Glorot initialization and the Adam SGD optimizerin PyTorch as well. The (cid:96) regularization parameter λ is selected in { } case-by-case during training. The early-stopping strategy is also employed.A.6 S ENSITIVITY ANALYSIS AND EXTRA ABLATION STUDY

Firstly, we focus on the output node embedding dimension as well as the functional time encodingdimension in this sensitivity analysis. The reported results are averaged over ﬁve runs. We experi-ment on d ∈ { , , , , } and d T ∈ { , , , , } , and the results are reportedin Figure 7a and 7c. The remaining model setups reported in Section 4.4 are untouched when vary- https://github.com/snap-stanford/snap/tree/master/examples/node2vec https://github.com/phanein/deepwalk https://github.com/williamleif/GraphSAGE https://github.com/PetarV-/GAT (a) Comparison between uniform and inversetimespan weighted sampling on the link predic-tion task (b) Comparison between three different ways oflearning the functional time encoding, on link pre-diction task. Figure 6: Extra ablation study.ing d or d T . We observe slightly better outcome when increasing either d or d T on the industrialdataset. The patterns on Reddit and Wikipedia dataset are almost identical. (a) Sensitivity analysis on node embeddings di-mension. (b) Sensitivity analysis on time embeddings di-mension.(c) Sensitivity analysis on number of attention heads and layers(hops) with d = 100 and d T = 100 . Figure 7: Sensitivity analysis on the Industrial dataset.Secondly, we compare between the two methods of learning functional encoding, i.e. using ﬂow-based model or using the non-parametric method introduced in Section 3.1. We experiment on twoﬂow-based state-of-the-art CDF learning method: normalizing ﬂow (Rezende & Mohamed, 2015)and

RealNVP (Dinh et al., 2016). We use the default model setups and hyper-parameters in their18ublished as a conference paper at ICLR 2020reference implementations . We provide the results in Figure 6b. As we mentioned before, usingﬂow-based models leads to highly comparable outcomes as the non-parametric approach, but theyrequire longer training time since they implement sampling during each training batch. However, itis possible that carefully-tuned ﬂow-based models can lead to nontrivial improvements, which weleave to the future work.Finally, we provide sensitivity analysis on the number of attention heads and layers for

TGAT . Recallthat by stacking two layers in

TGAT we are aggregating information from the two-hop neighbour-hood. For both accuracy and AP , using three-head attention and two-layers gives the best outcome.In general, the results are relatively stable to the number of heads, and stacking two layers leads tosigniﬁcant improvements compared with using only a single layer.The ablation study for comparing between uniform neighborhood dropout and sampling with inversetimespan is given in Figure 6a. The two experiments are carried out under the same setting whichwe reported in Section 4.4. We see that using the inverse timespan sampling gives slightly worseperformances. This is within expectation since uniform sampling has advantage in capturing therecurrent patterns, which can be important for predicting user actions. On the other hand, the resultsalso suggest the effectiveness of the proposed time encoding for capturing such temporal patterns.Moreover, we point out that using the inverse timespan sampling slows down training, particularlyfor large graphs where a weighted sampling is conducted within a large number of nodes for eachtraining batch construction. Nonetheless, inverse timespan sampling can help capturing the morerecent interactions which may be more useful for certain tasks. Therefore, we suggest to choose theneighborhood dropout method according to the speciﬁc use cases. https://github.com/ex4sperans/variational-inference-with-normalizing-ﬂows https://github.com/chrischute/real-nvphttps://github.com/chrischute/real-nvp