[PDF] Continuous-Time Relationship Prediction in Dynamic Heterogeneous Information Networks

Abstract

Online social networks, World Wide Web, media and technological networks, and other types of so-called information networks are ubiquitous nowadays. These information networks are inherently heterogeneous and dynamic. They are heterogeneous as they consist of multi-typed objects and relations, and they are dynamic as they are constantly evolving over time. One of the challenging issues in such heterogeneous and dynamic environments is to forecast those relationships in the network that will appear in the future. In this paper, we try to solve the problem of continuous-time relationship prediction in dynamic and heterogeneous information networks. This implies predicting the time it takes for a relationship to appear in the future, given its features that have been extracted by considering both heterogeneity and temporal dynamics of the underlying network. To this end, we first introduce a feature extraction framework that combines the power of meta-path-based modeling and recurrent neural networks to effectively extract features suitable for relationship prediction regarding heterogeneity and dynamicity of the networks. Next, we propose a supervised non-parametric approach, called Non-Parametric Generalized Linear Model (NP-GLM), which infers the hidden underlying probability distribution of the relationship building time given its features. We then present a learning algorithm to train NP-GLM and an inference method to answer time-related queries. Extensive experiments conducted on synthetic data and three real-world datasets, namely Delicious, MovieLens, and DBLP, demonstrate the effectiveness of NP-GLM in solving continuous-time relationship prediction problem vis-a-vis competitive baselines

Full PDF

11Continuous-Time Relationship Prediction in DynamicHeterogeneous Information Networks

SINA SAJADMANESH,

Department of Computer Engineering, Sharif University of Technology, Iran

SOGOL BAZARGANI,

Department of Computer Engineering, Sharif University of Technology, Iran

JIAWEI ZHANG,

IFM Lab, Department of Computer Science, Florida State University, United States

HAMID R. RABIEE,

Department of Computer Engineering, Sharif University of Technology, IranOnline social networks, World Wide Web, media and technological networks, and other types of so-called information networks are ubiquitous nowadays. These information networks are inherently heterogeneous and dynamic . They are heterogeneous as they consist of multi-typed objects and relations, and they aredynamic as they are constantly evolving over time. One of the challenging issues in such heterogeneous anddynamic environments is to forecast those relationships in the network that will appear in the future. In thispaper, we try to solve the problem of continuous-time relationship prediction in dynamic and heterogeneousinformation networks. This implies predicting the time it takes for a relationship to appear in the future,given its features that have been extracted by considering both heterogeneity and temporal dynamics ofthe underlying network. To this end, we first introduce a feature extraction framework that combines thepower of meta-path-based modeling and recurrent neural networks to effectively extract features suitable forrelationship prediction regarding heterogeneity and dynamicity of the networks. Next, we propose a supervisednon-parametric approach, called

Non-Parametric Generalized Linear Model (Np-Glm), which infers the hiddenunderlying probability distribution of the relationship building time given its features. We then presenta learning algorithm to train Np-Glm and an inference method to answer time-related queries. Extensiveexperiments conducted on synthetic data and three real-world datasets, namely Delicious, MovieLens, andDBLP, demonstrate the effectiveness of Np-Glm in solving continuous-time relationship prediction problemvis-à-vis competitive baselines.CCS Concepts: •

Information systems → Data mining ; Social recommendation ; •

Computing method-ologies → Machine learning ;Additional Key Words and Phrases: Link Prediction, Social Network Analysis, Heterogeneous Network,Non-Parametric Modeling, Recurrent Neural Network, Autoencoder

Link prediction is the problem of prognosticating a certain relationship, like interaction or collabo-ration, between two entities in a networked system that are not connected already [23]. Due tothe popularity and ubiquity of networked systems in the real world, such as social, economic, orbiological networks, this problem has attracted a considerable attention in recent years and hasfound its applications in various interdisciplinary domains, such as viral marketing, bioinformatics,recommender systems, and social network analysis [43]. For example, suggesting new friends inan online social network [21] or predicting drug-target interactions in a biological network [7]are two quite different problems, but can both cast as the prediction task of friendship links anddrug-target links, respectively.

Authors’ addresses: Sina Sajadmanesh, Department of Computer Engineering, Sharif University of Technology, Azadi Ave,Tehran, Tehran, 1458889694, Iran, [email protected]; Sogol Bazargani, Department of Computer Engineering, SharifUniversity of Technology, Azadi Ave, Tehran, Tehran, 1458889694, Iran; Jiawei Zhang, IFM Lab, Department of ComputerScience, Florida State University, 1017 Academic Way, Tallahassee, Florida, 32304, United States, [email protected]; HamidR. Rabiee, Department of Computer Engineering, Sharif University of Technology, Azadi Ave, Tehran, Tehran, 1458889694,Iran, [email protected]. 1556-4681/2018/5-ART1 $15.00https://doi.org/10.1145/nnnnnnn.nnnnnnnACM Transactions on Knowledge Discovery from Data, Vol. 1, No. 1, Article 1. Publication date: May 2018. a r X i v : . [ c s . S I] M a y :2 Sajadmanesh et al. The problem of link prediction has a long literature and is studied extensively in the last decade.Initial works on link prediction problem mostly concentrated on homogeneous networks, whichare composed of single type of nodes connected by links of the same type [21, 22, 40]. However,many of today’s networks, such as online social networks or bibliographic networks, are inherently heterogeneous , in which multiple types of nodes are interconnected using multiple types of links[31, 37]. For example, a bibliographic network may contain author, paper, venue, etc. as differentnode types; and write, publish, cite, and so on as diverse link types that bind nodes with differenttypes to each other. In these heterogeneous networks, the concept of a link can be generalizedto a relationship, which can be constructed by combining different links with different types.For instance, the author-cite-paper relationship can be defined in a bibliographic network as acombination of author-write-paper and paper-cite-paper links. Analogously, one can generalize thelink prediction to relationship prediction in heterogeneous networks which tries to predict complexrelationships instead of links [34].While most of the studies on the link/relationship prediction in heterogeneous networks utilizea static snapshot of the underlying network, many of these networks are dynamic in nature, whichmeans that new nodes and linkages are continually added to the network, and some existing nodesand links may be removed from the network over time. For example, in online social networks, suchas Facebook, new users are joining in the network every day, and new friendship links are beingadded to the network gradually. This dynamic characteristic causes the structure of the network tochange and evolve over time, and taking these changes into account can significantly boost thequality of link prediction task [28].In recent years, newer studies have shifted from traditional link prediction on static and homo-geneous networks toward newer domains, considering heterogeneity and dynamicity of networks[10, 12, 14, 25, 29]. However, most of these works merely focus on one of these aspects, disregardingthe other. Although there are quite a few studies that address both the challenges of heterogeneityand dynamicity [2, 30], to the best of our knowledge, all of them have ultimately formulated thelink prediction problem as a binary classification task, i.e., predicting whether a link will appear inthe network in the future. However, in dynamic networks, new links are continually appearingover time. So a much more interesting problem, which we call it continuous-time link prediction in this paper, is to predict when a link will emerge or appear between two nodes in the network.Examples of this problem include predicting the time at which two individuals become friends ina social network or the time when two authors collaborate on writing a paper in a bibliographicnetwork [34]. Inferring the link formation time in advance can be very useful in many concreteapplications in different disciplines, such as socialogy, economics, biology, and epidemiology, wherethe interactions between entities can be modeled via timed links. For example in the biologicalcontext, predicting the marker proteins interaction time in a gene regulatory network will lead topredicting tumor progression and prognosis [38]. As another example in online social networks, ifthe recommender system could predict the relationship building time between two people, then itcan issue a friendship suggestion close to that time since it will have a relatively higher chance tobe accepted. Good continuous-time link prediction results will lead to denser connections amongusers, and can greatly improve users’ engagement that is the ultimate goal of online social networks[20].In this paper, we aim to solve the problem of continuous-time relationship prediction, in whichwe forecast the relationship building time between two nodes in a dynamic and heterogeneousenvironment. This problem is very challenging from the technical perspective, and cannot be solvedtrivially for three main reasons. First, the formulation of continuous-time relationship prediction isquite different from the conventional link prediction due to the involvement of temporal dynamicsof the network and the necessity of considering network evolution time-line. Second, we only

ACM Transactions on Knowledge Discovery from Data, Vol. 1, No. 1, Article 1. Publication date: May 2018. ontinuous-Time Relationship Prediction 1:3 know the building time of those relationships that are already present at the network and for therest of them that are yet to happen, which are excessive in number versus the existing ones, welack such information. Finally, as opposed to the works concerning the binary link prediction,there are very rare works in the literature on continuous-time link prediction that attempt toanswer the “when” question. To the best of our knowledge, the only work that has studied thecontinuous-time relationship prediction problem so far is proposed by Sun et al. [34]. They infera probability distribution over time for each pair of nodes given their features and answer time-related queries about the relationship building time between the two nodes using the inferreddistribution. However, the drawback of their method, not to mention neglecting the temporaldynamics of the network, is that it mainly relies on the assumption that relationship building timesare coming from a certain probability distribution that must be fixed beforehand. This assumptionthough simplifying is very restrictive, because in real applications this distribution is unknown, andconsidering any specific one as a priori could be far from reality or limit the solution generality.In order to address the above challenges, we propose a supervised non-parametric methodto solve the problem of continuous-time relationship prediction. To this end, we first formallydefine the continuous-time relationship prediction problem and formulate the approach to solveit generally. Then, we introduce our novel feature extraction framework which leverages meta-path-based modeling and recurrent neural networks to deal with heterogeneity and dynamicity ofinformation networks. Next, we present

Non-Parametric Generalized Linear Model (Np-Glm) whichmodels the distribution of relationship building time given the extracted features. The strengthof this non-parametric model is that it is capable of learning the underlying distribution of therelationship building time, as well as the contribution of each extracted feature in the network.Afterward, we propose an inference algorithm to answer queries, like the most probable time bywhich a relationship will appear between two nodes or the probability of relationship creationbetween them during a specific period. Finally, we conduct comprehensive experiments over asynthetic dataset to verify the correctness of Np-Glm’s learning algorithm, and on three real-worlddataset - DBLP, Delicious, and MovieLens - to demonstrate the effectiveness and generality of theproposed method in predicting the relationship building time versus the relevant baselines. As asummary, we can enumerate our major contributions as follows:(i) The proposed feature extraction framework can tackle heterogeneity of the data as well ascapturing the temporal dynamics of the network by incorporating meta-path-based featuresinto a recurrent neural network based autoencoder.(ii) Our non-parametric model takes a unique approach toward learning the underlying distri-bution of relationship building time without imposing any significant assumptions on theproblem.(iii) Extensive evaluations over both synthetic and real-world datasets are performed to investigatethe effectiveness of the proposed method.(iv) To the best of our knowledge, this paper is the first one which studies the continuous-timerelationship prediction problem in both dynamic and heterogeneous network configurations.The rest of this paper is organized as follows. In Section 2, we provide introductory backgroundson the concept and formally define the problem of continuous-time relationship prediction. Thenin Section 3, we introduce our novel feature extraction framework. Next, we go through the detailsof the proposed Np-Glm method in Section 4, explaining its learning method and how it answersinference queries. Experiments on synthetic data and real-world datasets are described in Section 5and 6, respectively. Section 7 discusses the related works, and finally in Section 8, we conclude thepaper.

ACM Transactions on Knowledge Discovery from Data, Vol. 1, No. 1, Article 1. Publication date: May 2018. :4 Sajadmanesh et al.

PaperV enueT erm Author writepublishmention cite (a) DBLP

BookmarkT aд U ser posthas-tag contact (b) Delicious

Movie U serCountryT aд GenreActorDirector ratehas-tag has-genreplay-indirectproduced-in (c) MovieLensFig. 1. Schema of three different heterogeneous networks. Underlined characters are used as abbreviationsfor corresponding node types.

In this section, we introduce some important concepts and definitions used throughout the paperand formally define the problem of continuous-time relationship prediction.

An information network is heterogeneous if it contains multiple kinds of nodes and links. Formally,it is defined as a directed graph G = ( V , E ) where V = (cid:208) i V i is the set of nodes comprising theunion of all the node sets V i of type i . Similarly, E = (cid:208) j E j is the set of links constituted by theunion of all the link sets E j of type j . Now we bring the definition of the network schema [35] whichis used to describe a heterogeneous information network at a meta-level: Definition 2.1. (Network Schema) The schema of a heterogeneous network G is a graph S G = (V , E) where V is the set of different node types and E is the set of different link types in G .In this paper, we focus on three different heterogeneous and dynamic networks: (1) DBLPbibliographic network ; (2) Delicious bookmarking network ; and (3) MovieLens recommenda-tion network . The schema of these networks is depicted in Fig. 1. As an example, in the bib-liographic network, V = { Author , Paper , V enue , T erm } is the set of different node types, and E = { write , publish , mention , cite } is the set of different link types.Analogous to homogeneous networks where an adjacency matrix is used to represent whetherpairs of nodes are linked to each other or not, in heterogeneous networks, we define HeterogeneousAdjacency Matrices to represent the connectivity of nodes of different types: http://dblp.uni-trier.de/ http://delicious.com/ https://movielens.org/ACM Transactions on Knowledge Discovery from Data, Vol. 1, No. 1, Article 1. Publication date: May 2018. ontinuous-Time Relationship Prediction 1:5 Definition 2.2. (Heterogeneous Adjacency Matrix) Given a heterogeneous network G with schema S G = (V , E) , for each link type ε ∈ E denoting the relation between node types ν i , ν j ∈ V , theheterogeneous adjacency matrix M ε is a binary | V ν i | × | V ν j | matrix representing whether nodes oftype ν i are in relation with nodes of type ν j with link type ε or not.For instance, in the bibliographic network, the heterogeneous adjacency matrix M write is abinary matrix where each row is associated with an author and each column is associated with apaper, and M write ( i , j ) indicates if the author i has written the paper j .As we mentioned in the Introduction section about heterogeneous networks, the concept of alink can be generalized to a relationship. In this case, a relationship could be either a single link or acomposite relation constituted by the concatenation of multiple links that together have a particularsemantic meaning. For example, the co-authorship relation in the bibliographic network with theschema shown in Fig. 1a, can be defined as the combination of two Author-write-Paper links, making

Author-write-Paper-write-Author relation. When dealing with link or relationship prediction inheterogeneous networks, we must exactly specify what kind of link or relationship we are goingto predict. This specific relation to be predicted is called the

Target Relation [34]. For example, inDBLP bibliographic network we aim to predict if and when an author will cite a paper from anotherauthor. Thus the target relation, in this case, would be

Author-write-Paper-cite-Paper-write-Author . An information network is dynamic when its nodes and linkage structure can change over time.That is, in a dynamic information network, all nodes and links are associated with a birth anddeath time. More formally, a dynamic network at the timestamp τ is defined as G τ = ( V τ , E τ ) where V τ and E τ are respectively the set of nodes and the set of links existing in the network atthe timestamp τ .In this paper, we consider the case that an information network is both dynamic and hetero-geneous. This means that all network entities are associated with a type, and can possibly havebirth and death times, regardless of their types. The bibliographic network is an example of bothdynamic and heterogeneous one. Whenever a new paper is published, a new Paper node will beadded to the network, alongside with the corresponding new

Author , Term , and

Venue nodes (ifthey don’t exist yet). New links will be formed among these newly added nodes to indicate the write , publish and mention relationships. Some linkages might also form between the existing nodesand the new ones, like new cite links connecting the new paper with the existing papers in itsreference list.In order to formally describe the state of a heterogeneous and dynamic network at any timestamp τ , we define the time-aware heterogeneous adjacency matrix in the following. Definition 2.3. (Time-Aware Heterogeneous Adjacency Matrix) Given a dynamic heterogeneousnetwork G τ with schema S G = (V , E) , for each link type ε ∈ E denoting the relation between nodetypes ν i , ν j ∈ V , the time-aware heterogeneous adjacency matrix M τε is a binary matrix representingif nodes of type ν i are in relation with nodes of type ν j with link type ε at the timestamp τ . Moreformally, for a ∈ ν i and b ∈ ν j we have: M τε ( a , b ) = (cid:40) , if ( a , b ) ∈ ε and bt ( a , b ) < τ ≤ dt ( a , b ) , otherwisewhere bt ( a , b ) and dt ( a , b ) denote the birth and the death time of the link ( a , b ) , respectively. ACM Transactions on Knowledge Discovery from Data, Vol. 1, No. 1, Article 1. Publication date: May 2018. :6 Sajadmanesh et al. t t + ∆ t + ∆ . . . t + k ∆ t ∆ Feature Extraction Window ( Φ = k ∆ ) Observation Window ( Ω ) Fig. 2. The evolutionary timeline of the network data.

Suppose that we are given a dynamic and heterogeneous information network as G τ lastly observedat the timestamp τ , together with its network schema S G . Now, given the target relation R , theaim of continuous-time relationship prediction is to forecast the building time t ≥ τ of the targetrelation R between any node pair ( a , b ) in G τ .In order to solve this problem given a pair of nodes like ( a , b ) , we try to train a supervised modelthat can predict a point estimate on the time it takes for the relationship of type R to be formedbetween them. The input to such a model will be a feature vector x corresponding to the node pair ( a , b ) . The model will then output with a continuous variable t that indicates when the relationshipof type R will be built between a and b . To train such a model, we need to assemble a datasetcomprising the feature vectors of all the node pairs between which the relation R have already beenformed. The process of selecting sample node pairs, extracting their feature vector, and trainingthe supervised model are explained in the subsequent sections. In this section, we present our feature extraction framework that is designed to have three ma-jor characteristics: First, it effectively considers different type of nodes and links available in aheterogeneous information network and regards their impact on the building time of the targetrelationship. Second, it takes the temporal dynamics of the network into account and leveragesthe network evolution history instead of simply aggregating it into a single snapshot. Finally, theextracted features are suitable for not only the link prediction problem but also the generalized relationship prediction . We will incorporate these features in the proposed non-parametric model inSection 4 to solve the continuous-time relationship prediction problem.

To solve the problem of continuous-time relationship prediction in dynamic networks, we needto pay attention to the temporal history of the network data from two different points of view.First, we have to mind the evolutionary history of the network for feature extraction, so that theextracted features reflect the changes made in the network over time. Second, we have to specifythe exact relationship building time for each pair of nodes that have formed the target relationship.This is because our goal is to train a supervised model to predict a continuous variable, which inthis case is the building time of the target relationship. Hence, for each sample pair of nodes, weneed a feature vector x , associated with a target variable t that indicates the building time of thetarget relationship between them.Suppose that we have observed a dynamic network G τ recorded in the interval t < τ ≤ t .According to Fig. 2, we split this interval into two parts: the first part for extracting the feature x , and the second for determining the target variable t . We refer to the first interval as FeatureExtraction Window whose length is denoted by Φ , and the second as Observation Window , whoselength is denoted by Ω . Now, based on the existence of the target relationship in the observation ACM Transactions on Knowledge Discovery from Data, Vol. 1, No. 1, Article 1. Publication date: May 2018. ontinuous-Time Relationship Prediction 1:7 window, all the node pairs in the network will fall within either one of the following three differentgroups:(1) Node pairs that form the target relationship before the beginning of the observation window(in the feature extraction window).(2) Node pairs that form the target relationship in the observation window for the first time (notexisting before in the feature extraction window).(3) Node pairs that do not form the target relationship (neither in the feature extraction windownor in the observation window).The node pairs in the 2nd and 3rd categories constitute our data samples, and will be used in thelearning procedure to train the supervised model. For such pairs, we extract their feature vector x using the history available in the feature extraction window. For each node pair in the 2nd category,we see that the target relationship between them has been created at a time like t r ∈ ( t + Φ , t ] . Sowe set t = t r − ( t + Φ ) as the time it takes for the relationship to form since the beginning of theobservation window. For these samples, we also set an auxiliary variable y = observed their exact building time. On the other hand, For node pairs in the 3rd category,we haven’t seen their exact building time, but we know that it should be definitely after t . Forsuch samples, that we call censored samples, we set t = t − ( t + Φ ) that is equal to the lengthof the observation window Ω , and set y = t . As a result, each datasample is associated with a triple ( x , y , t ) representing its feature vector, observation status, andthe time it takes for the target relationship to be formed, respectively. In this part, we describe how to utilize the temporal history of the network in the feature extractionwindow in order to extract features for continuous-time relationship prediction problem. We firstbegin with the meta-path-based feature set for heterogeneous information networks, and thenincorporate these features into a recurrent neural network based autoencoder to exploit the temporaldynamics of the network as well. Hereby, we begin by defining the concept of meta-path [35]:

Definition 3.1 (Meta-Path).

In a heterogeneous information network, a meta-path is a directedpath following the graph of the network schema to describe the general relations that can bederived from the network. Formally speaking, given a network schema S G = (V , E) , the sequence ν ε −→ ν ε −→ . . . ν k − ε k − −−−→ ν k is a meta-path defined on S G where ν i ∈ V and ε i ∈ E .Meta-paths are commonly used in heterogeneous information networks to describe multi-typedrelations that have concrete semantic meanings. For example, in the bibliographic network whoseschema is shown in Fig. 1a, we can define the co-authorship relation by the following meta-path: Author write −−−−−→

Paper write ←−−−−−

Author or simply by A → P ← A . Another example is the author citation relation, which in this paper isused as the target relation for DBLP network. It can be specified as: Author write −−−−−→

Paper cite −−−→

Paper write ←−−−−−

Author abbreviated as A → P → P ← A .We can extend the concept of the heterogeneous adjacency matrix, which is used to indicaterelationships between nodes of different types, to meta-path adjacency matrix , which we will use to ACM Transactions on Knowledge Discovery from Data, Vol. 1, No. 1, Article 1. Publication date: May 2018. :8 Sajadmanesh et al.

Table 1. Similarity Meta-Paths in Different Networks

Network Meta-Path Semantic Meaning D B L P A → P ← A Authors co-write a paper A → P ← A → P ← A Authors have common co-author A → P ← V → P ← A Authors publish in the same venue A → P → T ← P ← A Authors use the same term A → P → P ← P ← A Authors cite the same paper A → P ← P → P ← A Authors are cited by the same paper D e l i c i o u s U ↔ U ↔ U Users have common contact U → B ← U Users post the same bookmark U → B → T ← B ← U Users post bookmarks with the same tag M o v i e L e n s M → A ← M Movies share an actor M → C ← M Movies belong to the same country M → D ← M Movies have the same director M → G ← M Movies have the same genre M → T ← M Movies have the same tag U → M ← U Users rate common movie U → M → A ← M ← U Users rate movies sharing an actor U → M → C ← M ← U Users rate movies from the same country U → M → D ← M ← U Users rate movies of the same director U → M → G ← M ← U Users rate movies with the same genre U → M → T ← M ← U Users rate movies with the same tag indicate the number of path instances between two nodes of (possibly) different types, as explainedbelow.

Definition 3.2. (Meta-path Adjacency Matrix) Given a heterogeneous network G with schema S G = (V , E) , and the meta-path ν ε −→ ν ε −→ . . . ν k − ε k − −−−→ ν k defined over S G denoting therelation between node types ν i , ν j ∈ V , the meta-path adjacency matrix M Ψ is defined as: M Ψ = k − (cid:214) i = M ε i which indicates the number of path instances between any node pair u ∈ ν and v ∈ ν k followingthe meta-path Ψ . The time-aware counterpart of meta-path adjacency matrix is defined analogouslyby using the time-aware heterogeneous adjacency matrix.Among the possible meta-paths that can be defined on a network schema, there are some thatcapture the similarity between two nodes. For example, the co-authorship meta-path A → P ← A in a bibliographic network creates a sense of similarity between two Author nodes. These typeof meta-paths, called similarity meta-paths , are widely used to define topological features forlink prediction problem in heterogeneous networks [29, 33, 46]. Table 1 presents a number ofsimilarity meta-paths that can be defined on DBLP, Delicious, and MovieLens networks to capturethe heterogeneous similarity between different node types.

ACM Transactions on Knowledge Discovery from Data, Vol. 1, No. 1, Article 1. Publication date: May 2018. ontinuous-Time Relationship Prediction 1:9

The concept of similarity meta-paths can be extended to define heterogeneous features suitablefor relationship prediction problem, where we have a target relation. Here we follow the sameapproach as in [34] which suggests the following three meta-path-based building blocks to describefeatures for relationship prediction problem, given a target relation between two nodes of type A and B : (1) A similarity A target B (2) A target B similarity B (3) A relation C relation B where ⇝ denotes a meta-path, with labels similarity and target denoting a similarity meta-pathand the target relation, respectively. The relation label denotes an arbitrary meta-path relatingtwo nodes of possibly different types. The first block tells that there are some nodes of type A similar to a single node of the same type that has made the target relationship with a node oftype B . Therefore, those similar nodes may also form the target relation with the type B node. Ananalogous intuition is behind the second block. For the third, it says that some nodes of type A are in relation with some type C nodes, which are themselves in relation with some nodes of type B . Hence, it is likely that type A nodes form some relationships, such as the target relationship,with type B nodes. We refer to the meta-paths that are created using these three blocks as featuremeta-paths .As an example in DBLP bibliographic network, for the target relation, we use A → P → P ← A as a meta-path denoting the author citation relation. In Addition, Paper-cite-Author ( P → P → A )and Author-cite-Paper ( A → P → P ) are also used as the arbitrary relations, and the similaritymeta-paths for DBLP network from Table 1 are used to define the features for author citationrelationship prediction.After specifying feature meta-paths, we need a method to quantify them as numeric features.Due to the dynamicity of the network, different links are emerging and vanishing from the networkover time. Therefore, the quantifying method must handle this dynamicity. Here, we formallydefine Time-Aware Meta-Path-based Features : Definition 3.3 (Time-Aware Meta-Path-based Feature).

Suppose that we are given a dynamicheterogeneous network G τ along with its network schema S G = (V , E) , and a target Relation A ⇝ B . For a given pair of nodes a ∈ A and b ∈ B , and a feature meta-path Ψ = A ε −→ ν ε −→ . . . ν n − ε n −−→ B defined on S G , the time-aware meta-path-based feature at the timestamp τ is thenumber of path instances between a and b following Ψ : f τ Ψ ( a , b ) = M τ Ψ [ a , b ] This way, for any pair of nodes, we can quantify the number of path instances of a particularmeta-path at any specific timestamp τ . Although this quantification requires matrix multiplication,it can be done efficiently due to the following reasons:(1) The heterogeneous adjacency matrices are highly sparse, thus for calculating meta-pathadjacency matrices, we can considerably reduce the time complexity of each single matrixmultiplication by using fast sparse matrix multiplication algorithms [18].(2) The process of calculating the meta-path adjacency matrices is highly parallelizable, as thecorresponding meta-paths decouples into simpler similarity meta-paths, which themselvesdecouple further into link types. Therefore, we can calculate the adjacency matrix of different ACM Transactions on Knowledge Discovery from Data, Vol. 1, No. 1, Article 1. Publication date: May 2018. :10 Sajadmanesh et al. similarity meta-paths in parallel, and then multiply them together to obtain the featuremeta-path adjacency matrices.(3) Due to the similarity meta-paths sharing common sub-paths, computation time for thesimilarity meta-paths can also be saved using dynamic programming to avoid recalculatingpreviously computed products. For example, for the DBLP dataset, if the target relation is A → P → P ← A , then by using the similarity meta-paths shown in the Table 1, the path A → P → P will appear in all the following feature meta-paths: A → P → P ← P ← AA → P → P → P ← AA → P → P ← A Therefore, we can calculate M A → P → P once and then reuse it in the calculation of the adjacencymatrices of the above meta-paths.(4) Finally, the symmetry of the similarity meta-paths further reduces the number of products,because we can calculate the matrix corresponding to half of the path, and then multiplythe resulting matrix by its transpose. For instance, the adjacency matrix of the similaritymeta-path A → P ← V → P ← A can be calculated as X · X T where X = M write · M publish ,reducing the number of multiplications from three to two.So far we proposed a method to calculate the time-aware meta-path-based features, which is thenumber of path instances of a particular meta-path at the timestamp τ . If we set this timestamp tothe end of the feature extraction window, it is as though we are aggregating the whole networkinto a single snapshot observed at time t + Φ . In order to avoid such an aggregation, we divide thefeature extraction window into a sequence of k contiguous intervals of a constant size ∆ , as shownin Fig. 2. By doing so, we intend to extract time-aware features in each sub-window that results in amultivariate time series containing the information about the temporal evolution of the topologicalfeatures between any pair of nodes. With this in mind, we define Dynamic Meta-Path-based TimeSeries as follows:

Definition 3.4 (Dynamic Meta-Path-based Time Series).

Suppose that we are given a dynamicheterogeneous network G τ observed in a feature extraction window of size Φ ( t < τ ≤ t + Φ ),along with its network schema S G = (V , E) and a target relation A ⇝ B . Also suppose that thefeature extraction window is divided into k fragments of size ∆ . For a given pair of nodes a ∈ A and b ∈ B in G t + Φ , and a meta-path Ψ defined on S G , the dynamic meta-path-based time series of ( a , b ) is calculated as: x i Ψ ( a , b ) = f t + i ∆Ψ ( a , b ) − f t + ( i − ) ∆Ψ ( a , b ) i = . . . k For each feature meta-path designed using the triple building blocks described before, we get aunique time series. For each time step, we put the corresponding values from all the time seriesinto a vector. Consequently, we get a multivariate time series where each time step is vector-valued.For example, if we have d feature meta-paths Ψ to Ψ d , then each time step of the resulting timeseries for any node pair ( a , b ) will become: x ia , b = [ x i Ψ ( a , b ) , . . . , x i Ψ d ( a , b )] T , i = . . . k We refer to this vector-valued time series as

Multivariate Meta-Path-based Time Series . Such mul-tivariate time series reflect how topological features change between two nodes across differentsnapshots of the network. Based on the level of the network dynamicity, it can capture increas-ing/decreasing trends or even periodic/re-occurring patterns.

ACM Transactions on Knowledge Discovery from Data, Vol. 1, No. 1, Article 1. Publication date: May 2018. ontinuous-Time Relationship Prediction 1:11 . . . . . . x x x k x x xx x k x k − x Fig. 3. The architecture of the LSTM Autoencoder used for dynamic feature extraction. The first k stepsdepicts the manner of the working of the encoder LSTM, while the second k steps describes the decoderLSTM. The output of the k th stage is used as the feature vector x , which is fed into the decoder k times toproduce the input sequence in the reversed order. Now it’s time to convert the multivariate meta-path-based time series into a single feature vectorso that we can use it as the input to our non-parametric model that will be discussed in the nextsection. A trivial solution would be to stack all the vector-valued time steps of the multivariatetime series into a single vector. However, this approach will result in a very high dimensionalvector as the number of time steps increases and can lead to difficulties in the learning proceduredue to the curse of dimensionality. This is in contrast with our expectation that more time stepswould bring more information about the history of the network and should result in a betterprediction model. To overcome this problem, we combine the power of recurrent neural networks,especially Long Short Term Memory (LSTM) units [17], which have proven to be very successful inhandling time series and sequential data, with Autoencoders [4], which are widely used to learnalternative representations of the data such that the learned representation can reconstruct theoriginal input. Our goal is to transform the multivariate meta-path-based time series into a compactvector representation such that the resulting vector holds as much information from the originalmultivariate time series as possible.Inspired by the work of Dai and Le on semi-supervised sequence learning [9], we design anautoencoder that learns how to take a multivariate time series as input and compress it into a latentvector representation. The architecture of such autoencoder is illustrated in Fig. 3. The autoencoderconsists of two components: (1) the encoder, which takes the input data and transforms it into alatent representation; and (2) the decoder, which takes the encoded representation and transformsit back to the input space. The autoencoder is trained in such a way that it can reconstruct theoriginal input data.As the purpose of using the autoencoder in this paper is to compress multivariate time series,instead of using simple feed-forward neural networks, both encoder and decoder are built usingLSTMs. The input to the encoder LSTM is a multivariate time series of length k . The encoder acceptsthe vector-valued time steps of the input multivariate time series sequentially. After receiving the k th time step, the output of the encoder LSTM will be the compressed feature vector that we willuse as the input to the Np-Glm method. In order to train the encoder to learn how to compress theinput time series, it is matched with a decoder LSTM. The decoder LSTM receives k copies of thecompressed feature vector one after another, and with a proper loss function (such as mean squarederror) it is forced to reconstruct the original multivariate time series in reverse order. Reversingthe output sequence will make the optimization of the model easier since it causes the decoder torevert back the changes made by the encoder to the input sequence.The benefits of using the LSTM autoencoder is three-fold: (1) since the autoencoder can re-construct the original time series, which reflects the temporal dynamics of the network, we getminimum information loss in the compressed feature vector; (2) as we can set the dimensionalityof the compressed feature vector to any desired value, we can evade the curse of dimensionality; ACM Transactions on Knowledge Discovery from Data, Vol. 1, No. 1, Article 1. Publication date: May 2018. :12 Sajadmanesh et al.

Table 2. Characteristics of Some Probability Distributions Used for Event-Time Modeling

Distribution Density function Survival function Intensity function Cumulative intensity f T ( t ) S ( t ) λ ( t ) Λ ( t ) Exponential α exp (− αt ) exp (− αt ) α αt Rayleigh tσ exp (− t σ ) exp (− t σ ) tσ t σ Gompertz αe t exp (cid:8) − α ( e t − ) (cid:9) exp (cid:8) − α ( e t − ) (cid:9) αe t αe t Weibull αt α − β α exp (cid:110) −( tβ ) α (cid:111) exp (cid:110) −( tβ ) α (cid:111) αt α − β α ( tβ ) α and (3) due to the inherent dynamicity of recurrent neural networks and LSTMs, when we receive ( k + ) th snapshot of the network, we can easily fine-tune the previous autoencoder that was learnedwith k snapshots to consider the new snapshot as well, instead of repeating the whole learningprocedure from scratch.To conclude this section, we quickly review the whole procedure of processing the network data,training the autoencoder, and assembling a training dataset for the supervised model to predict thebuilding time of a particular target relation:(1) The network evolution timeline is split into the feature extraction window and the observationwindow.(2) Those node pairs that have either formed the target relationship in the observation window(observed samples) or have not formed the target relationship at all (censored samples) areselected as sample node pairs.(3) By extracting feature meta-paths based on the target relation and similarity meta-paths, amultivariate time series can be obtained for each sample node pair. Thus if we have N samplenode pairs, we will have a dataset of N multivariate time series.(4) The LSTM autoencoder is trained using the dataset of N multivariate time series to learnhow to compress time series into feature vectors.(5) For each sample node pair, the corresponding multivariate time series is compressed into afeature vector x using the learned encoder LSTM.(6) For each observed node pair, the feature vector x is labeled with y = t denoting the time it takes for the node pair to form the target relationship.For censored node pairs, y is set to zero and t becomes equal to the size of the observationwindow.(7) Finally, we will have a dataset of the form { x , y , t } i , i = . . . N that will be used to train thesupervised model.We explain our proposed non-parametric model in the next section that takes the learnedrepresentation as the feature vector x and attempts to predict the corresponding event time t . In this section we introduce our proposed model, called

Non-Parametric Generalized Linear Model ,to solve the problem of continuous-time relationship prediction based on the extracted features.Since the relationship building time is treated as a continuous random variable, we attempt tomodel the probability distribution of this time, given the features of the target relationship. Thus,if we denote the target relationship building time by t and its features by x , our aim is to modelthe probability density function f T ( t | x ) . A conventional approach to modeling this function is ACM Transactions on Knowledge Discovery from Data, Vol. 1, No. 1, Article 1. Publication date: May 2018. ontinuous-Time Relationship Prediction 1:13 to fix a parametric distribution for t (e.g. Exponential distribution) and then relate x to t using aGeneralized Linear Model [34]. The major drawback of this approach is that we need to know theexact distribution of the relationship building time, or at least, we could guess the best one thatfits. The alternative way that we follow is to learn the shape of f T ( t | x ) from the data using anon-parametric solution.In the rest of this section, we first bring the necessary theoretical backgrounds related to theconcept, then we go through the details of the proposed model. In the end, we explain the learningand inference algorithms of Np-Glm. Here we define some essential concepts that are necessary to study before we proceed to theproposed model. Generally, the formation of a relationship between two nodes in a network cansimply be considered as an event with its occurring time as a random variable T coming from adensity function f T ( t ) . Regarding this, we can have the following definitions: Definition 4.1 (Survival Function).

Given the density f T ( t ) , the survival function denoted by S ( t ) ,is the probability that an event occurs after a certain value of t , which means: S ( t ) = P ( T > t ) = ∫ ∞ t f T ( t ) dt (1) Definition 4.2 (Intensity Function).

The intensity function (or failure rate function), denoted by λ ( t ) , is the instantaneous rate of event occurring at any time t given the fact that the event has notoccurred yet: λ ( t ) = lim ∆ t → P ( t ≤ T ≤ t + ∆ t | T ≥ t ) ∆ t (2) Definition 4.3 (Cumulative Intensity Function).

The cumulative intensity function, denoted by Λ ( t ) , is the area under the intensity function up to a point t : Λ ( t ) = ∫ t λ ( t ) dt (3)The relations between density, survival, and intensity functions come directly from their defini-tions as follows: λ ( t ) = f T ( t ) S ( t ) (4) S ( t ) = exp (− Λ ( t )) (5) f T ( t ) = λ ( t ) exp (− Λ ( t )) (6)Table 2 shows the density, survival, intensity, and cumulative intensity functions of some widely-used distributions for event time modeling. Looking at Eq. 6, we see that the density function can be specified uniquely with its intensityfunction. Since the intensity function often has a simpler form than the density itself, if we learnthe shape of the intensity function, then we can infer the entire distribution eventually. Therefore,we focus on learning the shape of the conditional intensity function λ ( t | x ) from the data, andthen accordingly infer the conditional density function f T ( t | x ) based on the learned intensity.In order to reduce the hypothesis space of the problem and avoid the curse of dimensionality, we ACM Transactions on Knowledge Discovery from Data, Vol. 1, No. 1, Article 1. Publication date: May 2018. :14 Sajadmanesh et al. assume that λ ( t | x ) , which is a function of both t and x , can be factorized into two separate positivefunctions as the following: λ ( t | x ) = д ( w T x ) h ( t ) (7)where д is a function of x which captures the effect of features via a linear transformation usingcoefficient vector w independent of t , and h is a function of t which captures the effect of timeindependent of x . This assumption, referred to as proportional hazards condition [5], holds in GLMformulations of many event-time modeling distributions, such as the ones shown in Table 2. Ourgoal is now to fix the function д and then learn both the coefficient vector w and the function h from the training data. In order to do so, we begin with the likelihood function of the data whichcan be written as follows: N (cid:214) i = f T ( t i | x i ) y i P ( T ≥ t i | x i ) − y i (8)The likelihood consists of the product of two parts: The first part is the contribution of thosesamples for which we have observed their exact building time, in terms of their density function.The second part on the other hand, is the contribution of the censored samples, for which we usethe probability of the building time being greater than the recorded one. By applying Eq. 5 and 6we can write the likelihood in terms of the intensity function: N (cid:214) i = (cid:2) λ ( t i | x i ) exp {− Λ ( t i | x i )} (cid:3) y i exp {− Λ ( t i | x i )} − y i (9)By merging the exponentials and applying Eq. 3 and 7, the likelihood function becomes: N (cid:214) i = (cid:2) д ( w T x i ) h ( t i ) (cid:3) y i exp {− д ( w T x i ) ∫ t i h ( t ) dt } (10)Since we don’t know the form of h ( t ) , we cannot directly calculate the integral appeared inthe likelihood function. To deal with this problem, we treat h ( t ) as a non-parametric function byapproximating it with a piecewise constant function that changes just in t i s. Therefore, the integralover h ( t ) , denoted by H ( t ) , becomes a series: H ( t i ) = ∫ t i h ( t ) dt ≃ i (cid:213) j = h ( t j )( t j − t j − ) (11)assuming samples are sorted by t in increasing order, without loss of generality. The function H ( t ) defined above plays an important role in both learning and inference phases. In fact, both thelearning and inference phases rely on H ( t ) instead of h ( t ) , which we will see later in this paper.Replacing the above series in the likelihood, taking the logarithm and negating, we end up withthe following negative log-likelihood function, simply called the loss function , denoted by L : L ( w , h ) = N (cid:213) i = (cid:110) д ( w T x i ) i (cid:213) j = h ( t j )( t j − t j − ) − y i (cid:2) log д ( w T x i ) + log h ( t i ) (cid:3) (cid:111) (12)The loss function depends on both the vector w and the function h ( t ) . In the next part, we explainan iterative learning algorithm to learn both w and h ( t ) collectively. ACM Transactions on Knowledge Discovery from Data, Vol. 1, No. 1, Article 1. Publication date: May 2018. ontinuous-Time Relationship Prediction 1:15

Minimizing the loss function (Eq. 12) relies on the choice of the function д . There are no particularlimits on the choice of д except that it must be a non-negative function. For example, both quadraticand exponential functions of w T x will do the trick. Here, we proceed with д ( w T x ) = exp ( w T x ) since it makes the loss function convex with respect to w . Subsequent equations can be derived forother choices of д analogously.Setting the derivative of the loss function with respect to h ( t k ) to zero yields a closed formsolution for h ( t k ) : h ( t k ) = y k ( t k − t k − ) (cid:205) Ni = k exp ( w T x i ) (13)By applying Eq. 11, we get the following for H ( t i ) : H ( t i ) = i (cid:213) j = y j (cid:205) Nk = j exp ( w T x k ) (14)which depends on the vector w . On the other hand, we cannot obtain a closed form solution for w from the loss function. Therefore, we turn to use Gradient-based optimization methods to find theoptimal value of w . The loss function with respect to w is as follows: L ( w ) = N (cid:213) i = (cid:8) exp ( w T x i ) H ( t i ) − y i w T x i (cid:9) + Const . (15)which depends on the function H . As the learning of both w and H depends on each other, theyshould be learned collectively. Here, we use an iterative algorithm to learn w and H alternatively.We begin with a random vector w ( ) . Then in each iteration τ , we first update H ( τ ) via Eq. 14 using w ( τ − ) . Next, we optimize Eq. 15 using the values of H ( τ ) ( t i ) to obtain w ( τ ) . We continue this routineuntil convergence. Since this procedure successively reduces the value of the loss function, andas the loss function (i.e. the negative log-likelihood) is bounded from below, the algorithm willultimately converge to a stationary point. The pseudo code of the learning procedure is given inAlgorithm 1. In this part, we explain how to answer the common inference queries based on the inferred dis-tribution f T ( t | x ) . Suppose that we have learned the vector w and the function H using thetraining samples ( x i , y i , t i ) , i = . . . N following Algorithm 1. Afterward, for a testing relationship R associated with a feature vector x R , the following queries can be answered: What is the probability for the relationship R to be formed betweentime t α and t β ? This is equivalent to calculating P ( t α ≤ T ≤ t β | x R ) , which by definition is: P ( t α ≤ T ≤ t β | x R ) = S ( t α | x R ) − S ( t β | x R ) = exp {− д ( w T x R ) H ( t α )} − exp {− д ( w T x R ) H ( t β )} (16)The problem here is to obtain the values of H ( t α ) and H ( t β ) , as t α and t β may not be among t i s ofthe training samples, for which H is estimated. To calculate H ( t α ) , we find k ∈ { , , . . . , N } suchthat t k ≤ t α < t k + . Due to the piecewise constant assumption for the function h , we get: h ( t α ) = H ( t α ) − H ( t k ) t α − t k (17) ACM Transactions on Knowledge Discovery from Data, Vol. 1, No. 1, Article 1. Publication date: May 2018. :16 Sajadmanesh et al.

Algorithm 1:

The learning algorithm of Np-Glm

Input: X N × d = ( x , . . . x N ) T as d -dimensional feature vectors, y N × as observation states, and t N × asrecorded times. Output:

Learned parameters w d × and H N × . converдed ← False ; threshold ← − ; τ ← L ( τ ) = ∞ ;Initialize w ( τ ) with random values; while Not converдed do τ ← τ + H ( τ ) using w ( τ − ) ;Minimize Eq. 15 to obtain w ( τ ) using H ( τ ) ;Use Eq. 12 to obtain L ( τ ) using w ( τ ) and H ( τ ) ; if (cid:13)(cid:13)(cid:13) L ( τ ) − L ( τ − ) (cid:13)(cid:13)(cid:13) < threshold then converдed ← True ; endend w ← w ( τ ) ; H ← H ( τ ) ; On the other hand, since h only changes in t i s, we have: h ( t α ) = h ( t k + ) = H ( t k + ) − H ( t k ) t k + − t k (18)Combining Eq. 17 and 18, we get: H ( t α ) = H ( t k ) + ( t α − t k ) H ( t k + ) − H ( t k ) t k + − t k (19)Following the similar approach, we can calculate H ( t β ) , and then answer the query using Eq. 16.The dominating operation here is to find the value of k . Since we have t i s sorted beforehand, thisoperation can be done using a binary search with O ( log N ) time complexity. By how long the target relationship R will be formed with probability α ? Thisquestion is equivalent to find the time t α such that P ( T ≤ t α | x R ) = α . By definition, we have:1 − P ( T ≤ t α | x R ) = S ( t α | x R ) = exp {− д ( w T x R ) H ( t α )} = − α Taking logarithm of both sides and rearranging, we get: H ( t α ) = − log ( − α ) д ( w T x R ) (20)To find t α , we first find k such that H ( t k ) ≤ H ( t α ) < H ( t k + ) . We eventually have t k ≤ t α < t k + since H is a non-decreasing function due to non-negativity of the function h . Therefore, we againend up with Eq. 19, by rearranging which we get: t α = ( t k + − t k ) H ( t α ) − H ( t k ) H ( t k + ) − H ( t k ) + t k (21) ACM Transactions on Knowledge Discovery from Data, Vol. 1, No. 1, Article 1. Publication date: May 2018. ontinuous-Time Relationship Prediction 1:17

Algorithm 2:

Synthetic dataset generation algorithm.

Input:

The number of observed samples N o , the number of censored samples N c , the dimension of thefeature vectors d , and the desired distribution dist Output:

Synthetically generated data X N × d , y N × , and t N × . N ← N o + N c ;Draw a weight vector w ∼ N( , I d ) , where I d is the d -dimensional identity matrix;Draw scalar intercept b ∼ N( , ) ; for i ← to N do Draw feature vector x i ∼ N( , I d ) ;Set distribution parameter α i ← exp ( w T x i + b ) ; if dist == Rayleiдh then

Draw t i ∼ α i t exp {− . α i t } ; else if dist == Gompertz then

Draw t i ∼ α i e t exp {− α i ( e t − )} ; end Sort pairs ( x i , t i ) by t i in ascending order; for i ← to N o do y i ← endfor i ← ( N o + ) to N do y i ← end By combining the Eq. 20 and 21, we can obtain the value of t α , which is the answer to the quantilequery. It worth mentioning that if α = . t α becomes the median of the distribution f T ( t | x R ) .Here again the dominant operation is to find the value of k , which due to the non-decreasingproperty of the function H can be found using a binary search with O ( log N ) time complexity. Generating random samples from the inferred distribution can easilybe carried out using the Inverse-Transform sampling algorithm. To pick a random sample from theinferred distribution f T ( t | x ) , we first generate uniform random variable u ∼ U ni f orm ( , ) . Then,we find k such that S ( t k + | x ) ≤ u ≤ S ( t k | x ) . We output t k + as the generated sample. Again,searching for the suitable value of k is the dominant operation which can be undertaken via binarysearch with O ( log N ) time complexity. We use synthetic data to verify the correctness of Np-Glm and its learning algorithm. Since Np-Glmis a non-parametric method, we generate synthetic data using various parametric models withpreviously known random parameters and evaluate how well Np-Glm can learn the parametersand the underlying distribution of the generated data.

We consider generalized linear models of two widely used distributions for event-time modeling,Rayleigh and Gompertz, as the ground truth models for generating synthetic data. Algorithm 2is used to generate a total of N data samples with d -dimensional feature vectors, consisting N o non-censored (observed) samples and remaining N c = N − N o censored ones. For all syntheticexperiments, we generate 10-dimensional feature vectors ( d = ACM Transactions on Knowledge Discovery from Data, Vol. 1, No. 1, Article 1. Publication date: May 2018. :18 Sajadmanesh et al.

20 60 100 140 1801 . . . . . . . Iteration l o g L N = N = N = (a) Rayleigh distribution

10 20 30 40 501 . . . . . . . . . Iteration l o g L N = N = N = (b) Gompertz distributionFig. 4. Convergence of Np-Glm’s average log-likelihood ( log L ) for different number of training samples ( N ).Censoring ratio has been set to 0.5.

10 30 50 70 90 − . − . − . − . . . . . . Iteration l o g L

5% censoring25% censoring50% censoring (a) Rayleigh distribution

10 20 30 40 500 . . . . . . . . Iteration l o g L

5% censoring25% censoring50% censoring (b) Gompertz distributionFig. 5. Convergence of Np-Glm’s average log-likelihood ( log L ) for different censoring ratios with 1K samples. Since Np-Glm’s learning is done in an iterative manner, we firstanalyze whether this algorithm converges as the number of iterations increases. We recorded thelog-likelihood of Np-Glm, averaged over the number of training samples N in each iteration. Werepeated this experiment for N ∈ { , , } with a fixed censoring ratio of 0.5, whichmeans half of the samples are censored. The result is depicted in Fig. 4. We can see that thealgorithm successfully converges with a rate depending on the underlying distribution. For the caseof Rayleigh, it requires about 100 iterations to converge but for Gompertz, this reduces to about 30.Also, we see that using more training data leads to achieving more log-likelihood as expected.In Fig. 5, we fixed N = H ( t ) for all t in the observation window.Therefore, as the censoring ratio increases, the observation window is decreased, so Np-Glm hasto infer a fewer number of parameters, leading to a faster convergence. Note that as opposed toFig. 4, here a higher log-likelihood doesn’t necessarily indicate a better fit, due to the likelihoodmarginalization we get by censored samples. Next, we evaluated how good Np-Glm can infer the parametersused to generate synthetic data. To this end, we varied the number of training samples N and ACM Transactions on Knowledge Discovery from Data, Vol. 1, No. 1, Article 1. Publication date: May 2018. ontinuous-Time Relationship Prediction 1:19

100 300 500 700 9000 . . . . . . N M A E

0% censoring25% censoring50% censoring (a) Rayleigh distribution

100 300 500 700 9000 . . . . . . N M A E

0% censoring25% censoring50% censoring (b) Gompertz distributionFig. 6. Np-Glm’s mean absolute error (MAE) vs the number of training samples ( N ) for different censoringratios. . . . . . . N c M A E N o = N o = N o = (a) Rayleigh distribution . . . . . . N c M A E N o = N o = N o = (b) Gompertz distributionFig. 7. Np-Glm’s mean absolute error (MAE) vs the number of censored samples ( N c ) for different number ofobserved samples ( N o ). measured the mean absolute error (MAE) between the learned weight vector ˆ w and the groundtruth. Fig. 6 illustrates the result for different censoring ratios. It can be seen that as the number oftraining samples increases, the MAE gradually decreases. The other point to notice is that morecensoring ratio results in a higher error due to the information loss we get by censoring.In another experiment, we investigated whether censored samples are informative or not. Forthis purpose, we fixed the number of observed samples N o and changed the number of censoredsamples from 0 to 200. We measured the MAE between the learned w and the ground truth for N o ∈ { , , } . The result is shown in Fig. 7. It clearly demonstrates that adding more censoredsamples causes the MAE to dwindle up to an extent, after which we get no substantial improvement.This threshold is dependent on the underlying distribution. In this case, for Rayleigh and Gompertzit is about 80 and 120, respectively. Finally, we assess the running time of Np-Glm’s learning algorithmagainst the size of the training data when it becomes relatively large. To this end, we varied thenumber of samples from 10K to 100M and measured the average running time of the learningalgorithm of Np-Glm on a single machine whose specification is reported in Table 3. Fig. 8 depictsthe result in log-log scale for Rayleigh and Gompertz distributions under different censoring ratiosselected from the set { . , . , . } . It can be seen from the figure that the running time scales ACM Transactions on Knowledge Discovery from Data, Vol. 1, No. 1, Article 1. Publication date: May 2018. :20 Sajadmanesh et al.

Table 3. PC Specification and Configuration

Operating System Windows 10CPU Intel Core i7 1.8 GHzRAM 12 GB DDR IIIGPU Nvidia GeForce GT 750Disk Type SSDProgramming Language Python 3.6 N T ( s e c o n d s )

0% censoring25% censoring50% censoring (a) Rayleigh distribution N T ( s e c o n d s )

0% censoring25% censoring50% censoring (b) Gompertz distributionFig. 8. Np-Glm’s average running time ( T ) measured in seconds vs the number of training samples ( N ) in log − log scale for different censoring ratios. linearly with the number of training samples since the number of parameters to be inferred inNp-Glm as a non-parametric model depends on the size of the training data. The censoring ratiothough negligible in scale can impact the running time of the algorithm, with more censoringratio resulting in less running time. This is because higher censoring ratio reduces the observationwindow, which in turn reduces the number of parameters. We apply Np-Glm with the proposed feature set on a number of real-world datasets to evaluate itseffectiveness and compare its performance in predicting the relationship building time vis-à-visstate of the art models.

We use the DBLP bibliographic citation network, provided by [36], which has bothattributes of dynamicity and heterogeneity. The network contains four types of objects: authors,papers, venues, and terms. The network schema of this dataset is depicted in Fig. 1a. Each paper isassociated with a publication date, with a granularity of one year. Based on the publication venueof the papers, we limited the original DBLP dataset to those papers that are published in venuesrelative to the theoretical computer science. This resulted in having about 16k authors and 37kpapers published from 1969 to 2016 in 38 venues.

Another dynamic and heterogeneous dataset we use in our experiments is theDelicious bookmarking dataset from [6], with a network schema presented in Fig. 1b. It containsthree types of objects, namely users, bookmarks, and tags, whose numbers are about 1.7k, 31k, and22k, respectively. The dataset includes bookmarking timestamps from May 2006 to October 2010.

ACM Transactions on Knowledge Discovery from Data, Vol. 1, No. 1, Article 1. Publication date: May 2018. ontinuous-Time Relationship Prediction 1:21

Table 4. Demographic Statistics of Real-World Datasets

Dataset Time Span Entity CountDBLP From 1969 to 2016 Nodes

Author

V enue Paper

T erm

U ser

T aд

Bookmark

U ser

Genre Movie

T aд

Actor

Country Director

The third heterogeneous dataset with dynamic characteristics has been ex-tracted from MovieLens personalized movie recommendation website, provided by [15]. The datasetcomprises seven types of objects, that are users, movies, tags, genres, actors, directors, and countries,as illustrated by the network schema in the Fig. 1c. It contains about 1.4k users and 5.6k movies,with user-movie rating timestamps ranging from September 1997 to January 2009.The demographic statistics of all datasets are presented in Table 4.

To challenge the performance of Np-Glm, we use a number ofbaselines introduced in the following: • Generalized Linear Model (Glm) : This is the state-of-the-art method proposed in [34]. We usethe GLM-based framework with Exponential and Weibull distributions, denoted as Exp-Glmand Wbl-Glm used in [34]. • Censored Regression Model (Crm) : This model, also called type II Tobit model, is designed toestimate linear relationships between variables when there is censoring in the dependentvariable. In other words, it is an extension to the ordinary least squares linear regression forcencored data [39]. The structural equation in this model is: t ∗ = w T x + ϵ where ϵ is a normally distributed error term and t ∗ is a latent variable which is observedwithin the observation window and censored otherwise. Accordingly, the observed t is definedas: t = (cid:40) t ∗ if y = Ω if y = w is learned using maximum likelihood estimation (more details in[3]). • Additive Regression Model (Arm) : This model is another regression method suggested byAalen for censored data [1]. Like Np-Glm, it specifies the intensity function, but instead of a

ACM Transactions on Knowledge Discovery from Data, Vol. 1, No. 1, Article 1. Publication date: May 2018. :22 Sajadmanesh et al. multiplicative linear model, the Aalen’s model is additive: λ ( t | x ) = d (cid:213) i = w i ( t ) x i The learning algorithm infers ∫ t w i ( t ) dt instead of estimating individual w i s. For more detailsabout the learning algorithm, the reader can refer to [19].For all models, we consider the median of the distribution f T ( t | x test ) as the predicted time forany test sample and then compare it to the ground truth time t test .To examine the effect of considering different feature extractors on the performance of themodels, we use another dynamic feature extractor and a static one against the proposed LSTMAutoencoder: • Exponential Smoothing : This dynamic feature extractor previously used in [14] is an expo-nentially weighted moving average over the features extracted in all the snapshots of thenetwork, which is calculated as: f i = (cid:40) x , if i = α x i + ( − α ) f i − , otherwisewhere f i is the smoothed feature after i th snapshot, x i is the i th step of the dynamic meta-path-based time series, and α ∈ ( , ) is the smoothing factor. We then set x = f k as the finalfeature vector if we have k snapshots in total. • Single Snapshot : This static feature extractor considers the whole network as a single snapshot,neglecting its temporal dynamics. This feature extractor is equivalent to the one proposed in[34].

We assess different methods using a number of evaluation metricswhich are described in the following: • Mean Absolute Error (MAE): This metric measures the expected absolute error between thepredicted time values and the ground truth:

MAE ( t , ˆ t ) = N N (cid:213) i = (cid:12)(cid:12) t i − ˆ t i (cid:12)(cid:12) • Mean Relative Error (MRE): This metric calculates the expected relative absolute error betweenthe predicted time values and the ground truth:

MRE ( t , ˆ t ) = N N (cid:213) i = (cid:12)(cid:12)(cid:12)(cid:12) t i − ˆ t i t i (cid:12)(cid:12)(cid:12)(cid:12) • Root Mean Squared Error (RMSE): This metric computes the root of the expected squarederror between the predicted time values and the ground truth:

RMSE ( t , ˆ t ) = (cid:118)(cid:117)(cid:116) N N (cid:213) i = (cid:0) t i − ˆ t i (cid:1) • Mean Squared Logarithmic Error (MSLE): This measures the expected value of the squaredlogarithmic error between the predicted time values and the ground truth:

MSLE ( t , ˆ t ) = N N (cid:213) i = (cid:0) log ( + t i ) − loд ( + ˆ t i ) (cid:1) ACM Transactions on Knowledge Discovery from Data, Vol. 1, No. 1, Article 1. Publication date: May 2018. ontinuous-Time Relationship Prediction 1:23 • Median Absolute Error (MDAE): It is the median of the absolute errors between the predictedtime values and the ground truth:

MDAE ( t , ˆ t ) = median ( (cid:12)(cid:12) t − ˆ t (cid:12)(cid:12) . . . (cid:12)(cid:12) t N − ˆ t N (cid:12)(cid:12) )• Maximum Threshold Prediction Accuracy (ACC): This measures for what fraction of samples,a model have a lower absolute error than a given threshold:

ACC ( t , ˆ t ) = N N (cid:213) i = (cid:0)(cid:12)(cid:12) t i − ˆ t i (cid:12)(cid:12) < threshold (cid:1) • Concordance Index (CI): This metric is one of the most widely used performance measures forsurvival models that estimates how good a model performs at ranking predicted times [16]. Itcan be seen as the fraction of all the sample pairs whose predicted timestamps are correctlyordered among all samples that can be ordered, and is considered as the generalization ofthe Area Under Receiver Operating Characteristic Curve (AUC) when we are dealing withcensored data [32].

For DBLP dataset, we confine the data samples to those authors whohave published more than 5 papers in the feature extraction window of each experiment. Followingthe triple building blocks described for feature extraction in Section 3, and using the similaritymeta-paths in Table 1, we start the feature extraction process with 19 feature meta-paths. In allexperiments, the author citation relation ( A → P → P ← A ) is chosen as the target relation. Forthe case of the Delicious dataset, we select user-user relation ( U ↔ U ) as the target relation, anddesign 6 feature meta-paths via the similarity meta-paths in Table 1. Regarding the MovieLensdataset, we limit the actor list to the top three for each movie. To imply a notion of “like” relationbetween user and movie, we only consider ratings above 4 in the scale of 5. For this dataset, thetarget relation is set to user rate movie ( U → M ), based on which, we design 11 final meta-paths.For the sake of convenience, we convert the scale of time differences from timestamp to month inDelicious and MovieLens datasets.Except for parameter settings analysis (subsection 6.3.3) where we will analyze the effect ofdifferent parameters on the performance of different models, in the rest of the experiments in thissection we set the length of the observation window Ω to 6 for all three datasets. For DBLP dataset,the number of snapshots k is set to 6, while for the other two datasets we set k =

12. We also fixthe time difference between network snapshots ∆ to 1 in all cases. These settings lead to havinga feature extraction window of size Φ = Φ =

12 months for Delicious andMovieLens. Accordingly, the number of labeled instances for DBLP, Delicious, and MovieLens areabout 3.4K, 3.9K, and 7.8K, respectively. About half of the labeled samples are censored ones, whichare picked uniformly at random among all the possible candidates.We implemented the LSTM autoencoder using Keras deep learning library [8]. We used meansquare error loss function, linear activation function, and Adadelta optimizer [45] with defaultparameters. For all datasets, we set the dimension of the encoded feature as twice as the inputdimension and trained the autoencoder in 50 epochs. For exponential smoothing feature extractor,the smoothing factor α were tuned to maximize the performance on the training dataset. ForNp-Glm, the data samples were ordered according to their corresponding time variables, as themodel needs the samples sorted by their recorded time. We use 5-fold cross-validation and reportthe average results for all the experiments in this section. ACM Transactions on Knowledge Discovery from Data, Vol. 1, No. 1, Article 1. Publication date: May 2018. :24 Sajadmanesh et al.

Table 5. Comprehensive Performance Comparison of Different Methods

Dataset Feature Model MAE MRE RMSE MSLE MDAE CI D B L P LSTM Autoencoder(Dynamic) Np-Glm .

99 0 .

95 2 .

43 0 .

30 1 .

73 0 . Wbl-Glm 2.33 1.10 2.85 0.36 2.08 0.58Exp-Glm 3.11 1.39 3.88 0.52 2.58 0.50Crm 3.08 1.06 3.32 2.04 2.98 0.37Arm 2.95 1.33 4.48 0.48 1.48 0.56Exp. Smoothing(Dynamic) Np-Glm 2.15 1.07 2.54 0.32 1.98 0.53Wbl-Glm 2.50 1.22 2.89 0.38 2.46 0.58Exp-Glm 3.20 1.49 3.73 0.51 3.06 0.45Crm 2.55 0.97 3.05 1.58 2.11 0.55Arm 6.75 2.83 7.86 1.17 6.39 0.60Single Snapshot(Static) Np-Glm 2.76 1.35 3.07 0.44 2.88 0.50Wbl-Glm 2.81 1.38 3.16 0.45 2.88 0.48Exp-Glm 3.28 1.57 3.70 0.53 3.30 0.14Crm 2.96 1.03 3.21 1.73 2.97 0.38Arm 3.89 1.76 5.45 0.66 2.11 0.46 D e l i c i o u s LSTM Autoencoder(Dynamic) Np-Glm .

10 1 .

20 2 .

55 0 .

35 2 .

05 0 . Wbl-Glm 2.37 1.31 2.89 0.40 2.16 0.57Exp-Glm 3.21 1.58 3.84 0.54 2.89 0.55Crm 6.38 3.10 6.55 1.33 6.87 0.43Arm 5.20 2.56 6.23 0.86 4.99 0.52Exp. Smoothing(Dynamic) Np-Glm 2.25 1.36 2.74 0.40 2.11 0.66Wbl-Glm 2.61 1.64 3.20 0.47 2.17 0.56Exp-Glm 3.52 1.99 4.54 0.62 3.20 0.39Crm 3.28 3.69 3.84 2.07 2.88 0.43Arm 6.36 3.24 7.80 1.09 6.72 0.56Single Snapshot(Static) Np-Glm 2.33 1.46 2.80 0.41 2.17 0.61Wbl-Glm 2.65 1.62 3.23 0.47 2.26 0.43Exp-Glm 3.35 1.91 4.17 0.59 2.75 0.35Crm 3.06 2.05 3.47 1.53 2.84 0.38Arm 5.79 2.76 6.69 1.16 5.89 0.37 M o v i e L e n s LSTM Autoencoder(Dynamic) Np-Glm .

48 3 .

08 3 .

04 0 .

55 2 .

14 0 . Wbl-Glm 3.06 3.61 3.79 0.65 2.60 0.56Exp-Glm 3.79 2.70 4.60 0.78 3.48 0.45Crm 3.07 3.47 3.74 2.02 2.51 0.40Arm 5.53 5.63 7.41 1.12 3.80 0.53Exp. Smoothing(Dynamic) Np-Glm 2.69 3.35 3.18 0.59 2.61 0.66Wbl-Glm 3.09 3.62 3.59 0.66 2.95 0.52Exp-Glm 3.52 2.86 4.05 0.74 3.26 0.43Crm 3.18 3.37 3.68 1.90 2.56 0.48Arm 9.39 8.60 10.06 1.83 9.26 0.52Single Snapshot(Static) Np-Glm 2.92 3.44 3.45 0.67 3.36 0.50Wbl-Glm 2.99 3.52 3.51 0.69 3.37 0.49Exp-Glm 3.42 2.89 3.86 0.78 3.82 0.49Crm 3.14 3.48 3.63 2.20 3.55 0.35Arm 5.71 5.50 7.30 1.23 5.18 0.47

In the rest of this section, we first assess how well different methods perform over various datasetsand compare their performance based on different measures. Next, we discuss the efficiency of ourproposed method by measuring and comparing its running time against the other baselines. Finally,we analyze the effect of different parameters and problem configurations on the performance ofcompetitive methods.

In the first set of experiments, we evaluate the predic-tion power of different models combined with different feature extractors on DBLP, Delicious andMovieLens datasets. MAE, MRE, RMSE, MSLE, MDAE, and CI of all models using both dynamic and

ACM Transactions on Knowledge Discovery from Data, Vol. 1, No. 1, Article 1. Publication date: May 2018. ontinuous-Time Relationship Prediction 1:25 . . . . . . . . . . . . . Absolute Error P r e d i c t i o n A cc u r a c y Np-GlmWbl-GlmCrmArm (a) DBLP . . . . . . . . . . . . . Absolute Error P r e d i c t i o n A cc u r a c y Np-GlmWbl-GlmCrmArm (b) Delicious . . . . . . . . . . . . . Absolute Error P r e d i c t i o n A cc u r a c y Np-GlmWbl-GlmCrmArm (c) MovieLensFig. 9. Prediction accuracy of different methods vs the maximum tolerated absolute error on different datasets. . . . . . . M A E Np-GlmWbl-Glm (a) Mean Absolute Error . . . . . . . C I Np-GlmWbl-Glm (b) Concordance IndexFig. 10. Effect of choosing different number of snapshots on performance of different methods using Deliciousdataset. . . . . . . . . M A E Np-GlmWbl-Glm (a) Mean Absolute Error . . . . . . . . C I Np-GlmWbl-Glm (b) Concordance IndexFig. 11. Effect of choosing different number of snapshots on performance of different methods using Movie-Lens dataset. static feature sets has been shown in Table 5. We see that in all three networks, Np-Glm with theLSTM Autoencoder feature set is superior to the other methods under all performance measures.For instance, our model Np-Glm can obtain an MAE of 1.99 for DBLP dataset, which is 15% lower

ACM Transactions on Knowledge Discovery from Data, Vol. 1, No. 1, Article 1. Publication date: May 2018. :26 Sajadmanesh et al. . . . . . . . . . . ∆ M A E Np-GlmWbl-Glm (a) Mean Absolute Error . . . . . . . . . . ∆ C I Np-GlmWbl-Glm (b) Concordance IndexFig. 12. Effect of choosing different values for ∆ on performance of different methods using Delicious dataset. . . . . . . . . . . . . ∆ M A E Np-GlmWbl-Glm (a) Mean Absolute Error . . . . . . . . . . . ∆ C I Np-GlmWbl-Glm (b) Concordance IndexFig. 13. Effect of choosing different values for ∆ on performance of different methods using MovieLensdataset. than the MAE obtained by its closest competitor, Wbl-Glm. As of CI, Np-Glm achieves 0.62 onDBLP, which is 7% better than Wbl-Glm. On Delicious dataset, Np-Glm improves MAE and CIby 11% and 23%, respectively, relative to Wbl-Glm. Similarly, Np-Glm reduces MAE by 19% andincreases CI by 25%. Comparable results hold for other performance measures as well. Accordingly,Wbl-Glm, which has two degrees of freedom, has shown a better performance compared to othermodels. That is while Np-Glm, as a non-parametric model with highly tunable shape, outperformsall the other “less-flexible” models by learning the true distribution of the data.Moreover, it is evident from Table 5 that using the dynamic features learned with the LSTMautoencoder has boosted the performance of all models over different datasets, and has outperformedthe other feature extractors. Based on the results presented in Table 5, the alternative dynamicfeature extractor, exponential smoothing, has performed better than the static single snapshotfeature extractor, yet not better than the proposed LSTM Autoencoder. Comparing the LSTMAutoencoder with exponential smoothing feature extractor, over the DBLP dataset, the proposedfeature extractor has achieved 7% less MAE and 17% more CI with Np-Glm. Over Delicious, Np-Glm with LSTM-based features reduces MAE by about 7% and improves CI by 7%. Finally, on theMovieLens dataset, combining LSTM Autoencoder and Np-Glm leads to an improvement of 8% and7% under MAE and CI, respectively. The other models behave more or less similarly when they ACM Transactions on Knowledge Discovery from Data, Vol. 1, No. 1, Article 1. Publication date: May 2018. ontinuous-Time Relationship Prediction 1:27

Table 6. Comparison of Computational Time Measured in Seconds

Dataset Model Feature ExtractorSingle Snapshot Exp. Smoothing LSTM AutoencoderDBLP Np-Glm 35.92 98.35 93.59Wbl-Glm 36.01 74.34 79.83Delicious Np-Glm 2.01 110.67 128.43Wbl-Glm 1.89 97.44 123.53MovieLens Np-Glm 19.60 177.015 232.77Wbl-Glm 19.70 154.86 213.7 are combined with different feature extractors. This result clearly demonstrates that our featureextraction framework is performing well on capturing the temporal dynamics of the networks.In the next experiment, we investigated the performance of different models using the LSTMautoencoder feature extraction framework under maximum threshold prediction accuracy. Toevaluate the prediction accuracy of a model, we record the fraction of test samples for which thedifference between their true times and predicted ones are lower than a given threshold, called tolerated error . The results are plotted in Fig. 9 where we varied the tolerated error in the range { . , . , . . . , . } . We can see from the figure that Np-Glm and Wbl-Glm perform comparably, yetNp-Glm outperforms Wbl-Glm in all cases. For example on MovieLens dataset (Fig. 9c), Np-Glmcan predict the relationship building time of all the test samples with 100% accuracy by an error of3 months, whereas for Wbl-Glm, this is reduced to 90%. Similarly, on the Delicious dataset, Np-Glmwith 3 months of tolerated error achieves around 80% accuracy, which is about 12% more thanWbl-Glm. In this part, we analyze and compare the running time of Np-Glmand Wbl-Glm models utilizing different feature extractors, namely LSTM autoencoder, exponentialsmoothing, and single snapshot. All the algorithms were implemented in Python and were runon a Windows 10 PC with Intel Core i7 1.8 GHz CPU and 12GB of RAM. The full specification ofthe host machine is reported in Table 3. We measured the running time of all the methods duringa complete training and test procedure, including feature extraction, learning, and inference. Forexponential smoothing feature extractor, we included the time required for tuning the smoothingfactor α using a separate validation set, while for LSTM based framework the training time of theautoencoder is counted toward total running time. Table 6 presents the results over each of theDBLP, Delicious, and MovieLens datasets. Since a considerable amount of running time is spenton feature extraction, dynamic feature extraction frameworks require more time to process thenetwork data as opposed to static single snapshot feature extractors. However, the proposed LSTMautoencoder performs comparably to exponential smoothing in terms of running time. Even thoughLSTM autoencoder is a bit slower than the other dynamic feature extractors, it demonstrates higherprediction performance compared to models utilizing exponential smoothing. For example, onMovieLens network with more than 20K nodes and 1 million links, Np-Glm with LSTM autoencoderrequires less than four minutes to process the whole network, extract features and learn from about6K labeled samples, and perform prediction for about 2K instances on a typical PC. The performance of different models is influenced by twoparameters, the number of snapshots k , and the time difference between snapshots ∆ , as theseparameters determine the length of the feature extraction window Φ . In this set of experiments,we investigate how these parameters affect the performance of our model Np-Glm and its closest ACM Transactions on Knowledge Discovery from Data, Vol. 1, No. 1, Article 1. Publication date: May 2018. :28 Sajadmanesh et al. competitor Wbl-Glm over Delicious and MovieLens datasets using the proposed LSTM basedfeature extraction framework.Firstly, The effect of increasing the number of snapshots on achieved MAE and CI by Np-Glm andWbl-Glm over Delicious and MovieLens datasets is illustrated in Fig. 10 and Fig. 11, respectively.For both datasets, we set ∆ = . Ω =

18 and varied the number of snapshots in the range of 3to 18. As we can see in both figures, increasing the number of snapshots results in lower predictionerror and higher accuracy. This is due to the fact that as the number of snapshots grows, a longerhistory of the network is taken into account. Therefore, different models can benefit from moreinformation about the temporal dynamics of the network given to them through the extractedfeature vector.Finally, the impact of choosing different values for ∆ is analyzed on the performance of Np-Glmand Wbl-Glm in terms of MAE and CI. The results for Delicious and MovieLens datasets aredepicted in Fig. 12 and Fig. 13, respectively. In this experiment, the number of snapshots andobservation window length are accordingly set to 6 and 24. Different values of ∆ are selected fromthe set { . , . , . . . , . } . As illustrated in both figures, by increasing ∆ up to an extent, we witnessthat the performance of models improves gradually. That is because increasing the value of ∆ leadsto a wider feature extraction window. However, since the number of snapshots is constant, we seeno performance improvement when the value of ∆ becomes greater than a certain threshold. Thisis due to the fact that short-term temporal evolution of the network will be ignored when the valueof ∆ becomes too wide. The problem of link prediction has been studied extensively in recent years and many approacheshave been proposed to solve this problem [41, 42]. Previous work on time-aware link prediction hasmostly considered temporality in analyzing the long-term network trend over time [11]. Authorsin [28] have shown that temporal metrics are an extremely valuable new contribution to linkprediction, and should be used in future applications. Dunlavy et al . focused on the problem ofperiodic temporal link prediction [13]. They concentrated on bipartite graphs that evolve over timeand also considered a weighted matrix that contained multilayer data and tensor-based methodsfor predicting future links. Oyama et al . solved the problem of cross-temporal link prediction,in which the links among nodes in different time frames are inferred [26]. They mapped dataobjects in different time frames into a common low-dimensional latent feature space and identifiedthe links on the basis of the distance between the data objects. Özcan et al . proposed a novellink prediction method for evolving networks based on NARX neural network [27]. They take thecorrelation between the quasi-local similarity measures and temporal evolutions of link occurrencesinformation into account by using NARX for multivariate time series forecasting. Yu et al . developeda novel temporal matrix factorization model to explicitly represent the network as a function oftime [44]. They provided results for link prediction as a specific example and showed that theirmodel performs better than the state-of-the-art techniques.The most relevant works to this study are available in [2, 14, 24, 30, 34]. The Authors in [14]approach the problem of time series link prediction by extracting simple temporal features fromthe time series, such as mean, (weighted) moving average, and exponential smoothing besides sometopological features like common neighbor and Adamic-Adar. But their method is designed forhomogeneous networks and fail to consider the heterogeneity of modern networks. Aggarwal et al .[2] tackle the link prediction problem in both dynamic and heterogeneous information networksusing a dynamic clustering approach alongside content-based and structural models. However,they aim to solve the conventional link prediction problem, not the continuous-time relationshipprediction studied in this paper. In [30], the authors proposed a feature set, called TMLP, well suited

ACM Transactions on Knowledge Discovery from Data, Vol. 1, No. 1, Article 1. Publication date: May 2018. ontinuous-Time Relationship Prediction 1:29 for link prediction in dynamic and heterogeneous information networks. Although their proposedfeature set copes with both dynamicity and heterogeneity of the network, it cannot be extended forthe generalized problem of relationship prediction and is only designed for solving the simpler linkprediction problem. Milani Fard et al . developed an approach called MetaDynaMix which utilizesa set of latent and topological features for predicting a target relationship between two nodes ina dynamic heterogeneous information network [24]. They combine meta path-based topologicalfeatures with inferred latent features that take temporal network evolutions into account, in orderto capture both heterogeneity and dynamicity of the network.Most of the aforementioned works answered the question of whether a link will appear inthe network. To the best of our knowledge, the only work that has focused on the continuous-time relationship prediction problem is proposed by Sun et al . [34], in which a generalized linearmodel based framework is suggested to model the relationship building time. They consider thebuilding time of links as independent random variables coming from a pre-specified distributionand model the expectation as a function of a linear predictor of the extracted topological features.A shortcoming of this model is that we need to exactly specify the underlying distribution ofrelationship building times. We came over this problem by learning the distribution from the datausing a non-parametric solution. Furthermore, we considered the temporal dynamics of the networkwhich has been entirely ignored in their work.

In this paper, we studied the problem of continuous-time relationship prediction in both dynamicand heterogeneous information networks. To effectively tackle this problem, we first introduced anovel feature extraction framework based on meta-path modeling and recurrent neural networkautoencoders to systematically extract features that take both the temporal dynamics and hetero-geneous characteristics of the network into account for solving the continuous-time relationshipproblem. We then proposed a supervised non-parametric model, called Np-Glm, which exploits theextracted features to predict the relationship building time in information networks. The strengthof our model is that it does not impose any significant assumptions on the underlying distribu-tion of the relationship building time given its features, but tries to infer it from the data via anon-parametric approach. Extensive experiments conducted on a synthetic dataset and real-worlddatasets from DBLP, Delicious, and MovieLens demonstrated the correctness of our method and itseffectiveness in predicting the relationship building time.For future work, we would like to design a unified architecture to combine feature extractionstep with the learning algorithm in an integrated deep learning framework. Moreover, althoughthe propsed method is able to scale to large information networks with thousands of nodes, it isnot currently extensible to web-scale information networks where the number of nodes is in thescale of hundreds of millions. Learning temporal non-parametric models within an extremely hugedataset is a challenging problem and is an interesting and important future work. As calculatingmeta-path-based features are the primary computational bottleneck of our method, to make thelearning process scalable, we set to investigate node embedding and approximation techniques.

ACKNOWLEDGMENTS

This work is partially supported by NSF through grant IIS-1763365.

REFERENCES [1] Odd O Aalen. 1989. A linear regression model for the analysis of life times.

Statistics in medicine

8, 8 (1989), 907–925.[2] Cham Aggarwal, Yan Xie, and Philip S Yu. 2012. On dynamic link inference in heterogeneous networks. In

Proceedingsof the 2012 SIAM International Conference on Data Mining . SIAM, 415–426.ACM Transactions on Knowledge Discovery from Data, Vol. 1, No. 1, Article 1. Publication date: May 2018. :30 Sajadmanesh et al. [3] Takeshi Amemiya. 1984. Tobit models: A survey.

Journal of econometrics

24, 1-2 (1984), 3–61.[4] Yoshua Bengio et al. 2009. Learning deep architectures for AI.

Foundations and trends® in Machine Learning

2, 1 (2009),1–127.[5] Norman E Breslow. 1975. Analysis of Survival Data under the Proportional Hazards Model.

International StatisticalReview / Revue Internationale de Statistique

43, 1 (1975), 45–57.[6] Iván Cantador, Peter Brusilovsky, and Tsvi Kuflik. 2011. 2nd Workshop on Information Heterogeneity and Fusion inRecommender Systems (HetRec 2011). In

Proceedings of the 5th ACM conference on Recommender systems (RecSys 2011) .ACM, New York, NY, USA.[7] Xing Chen, Ming-Xi Liu, and Gui-Ying Yan. 2012. Drug–target interaction prediction by random walk on theheterogeneous network.

Molecular BioSystems

8, 7 (2012), 1970–1978.[8] François Chollet et al. 2015. Keras. https://keras.io. (2015).[9] Andrew M Dai and Quoc V Le. 2015. Semi-supervised sequence learning. In

Advances in Neural Information ProcessingSystems . 3079–3087.[10] Darcy Davis, Ryan Lichtenwalter, and Nitesh V Chawla. 2011. Multi-relational link prediction in heterogeneousinformation networks. In

Advances in Social Networks Analysis and Mining (ASONAM), 2011 International Conferenceon . IEEE, 281–288.[11] Yugchhaya Dhote, Nishchol Mishra, and Sanjeev Sharma. 2013. Survey and analysis of temporal link predictionin online social networks. In

Advances in Computing, Communications and Informatics (ICACCI), 2013 InternationalConference on . IEEE, 1178–1183.[12] Yuxiao Dong, Jie Tang, Sen Wu, Jilei Tian, Nitesh V Chawla, Jinghai Rao, and Huanhuan Cao. 2012. Link predictionand recommendation across heterogeneous social networks. In

Data Mining (ICDM), 2012 IEEE 12th InternationalConference on . IEEE, 181–190.[13] Daniel M Dunlavy, Tamara G Kolda, and Evrim Acar. 2011. Temporal link prediction using matrix and tensorfactorizations.

ACM Transactions on Knowledge Discovery from Data (TKDD)

5, 2 (2011), 10.[14] Alireza Hajibagheri, Gita Sukthankar, and Kiran Lakkaraju. 2016. Leveraging network dynamics for improved linkprediction. In

Social, Cultural, and Behavioral Modeling: 9th International Conference, SBP-BRiMS 2016, Washington, DC,USA, June 28-July 1, 2016, Proceedings 9 . Springer, 142–151.[15] F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context.

ACM Trans. Interact.Intell. Syst.

5, 4, Article 19 (Dec. 2015), 19 pages.[16] Frank E Harrell Jr, Robert M Califf, David B Pryor, Kerry L Lee, Robert A Rosati, et al. 1982. Evaluating the yield ofmedical tests.

Jama

Neural computation

9, 8 (1997), 1735–1780.[18] Ellis Horowitz, Sartaj Sahni, and Susan Anderson-Freed. 1983.

Fundamentals of data structures . Vol. 20. Computerscience press Rockville, MD.[19] David W Hosmer Jr, Stanley Lemeshow, and Susanne May. 2011.

Applied survival analysis: regression modeling oftime-to-event data . Vol. 618. John Wiley & Sons.[20] Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue Moon. 2010. What is Twitter, a social network or a newsmedia?. In

Proceedings of the 19th international conference on World wide web . ACM, 591–600.[21] David Liben-Nowell and Jon Kleinberg. 2007. The link prediction problem for social networks. journal of the Associationfor Information Science and Technology

58, 7 (2007), 1019–1031.[22] Ryan N Lichtenwalter, Jake T Lussier, and Nitesh V Chawla. 2010. New perspectives and methods in link prediction. In

Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining . ACM, 243–252.[23] Linyuan Lü and Tao Zhou. 2011. Link prediction in complex networks: A survey.

Physica A: Statistical Mechanics andits Applications

Proceedings of the European Conference on Information Retrieval (ECIR) . Springer, 12 pages.[25] Behnaz Moradabadi and Mohammad Reza Meybodi. 2017. A novel time series link prediction method: Learningautomata approach.

Physica A: Statistical Mechanics and its Applications

482 (2017), 422–432.[26] Satoshi Oyama, Kohei Hayashi, and Hisashi Kashima. 2011. Cross-temporal link prediction. In

Data Mining (ICDM),2011 IEEE 11th International Conference on . IEEE, 1188–1193.[27] Alper Özcan and Şule Gündüz Öğüdücü. 2016. Temporal Link Prediction Using Time Series of Quasi-Local NodeSimilarity Measures. In

Machine Learning and Applications (ICMLA), 2016 15th IEEE International Conference on . IEEE,381–386.[28] Anet Potgieter, Kurt A April, Richard JE Cooke, and Isaac O Osunmakinde. 2009. Temporality in link prediction:Understanding social complexity.

Emergence: Complexity and Organization

11, 1 (2009), 69.[29] S. Sajadmanesh, H. R. Rabiee, and A. Khodadadi. 2016. Predicting anchor links between heterogeneous social networks.In . 158–163.ACM Transactions on Knowledge Discovery from Data, Vol. 1, No. 1, Article 1. Publication date: May 2018. ontinuous-Time Relationship Prediction 1:31 [30] Niladri Sett, Saptarshi Basu, Sukumar Nandi, and Sanasam Ranbir Singh. 2017. Temporal link prediction in multi-relational network.

World Wide Web (2017), 1–25.[31] Chuan Shi, Yitong Li, Jiawei Zhang, Yizhou Sun, and S Yu Philip. 2017. A survey of heterogeneous information networkanalysis.

IEEE Transactions on Knowledge and Data Engineering

29, 1 (2017), 17–37.[32] Harald Steck, Balaji Krishnapuram, Cary Dehing-oberije, Philippe Lambin, and Vikas C Raykar. 2008. On ranking insurvival analysis: Bounds on the concordance index. In

Advances in neural information processing systems . 1209–1216.[33] Yizhou Sun, Rick Barber, Manish Gupta, Charu C Aggarwal, and Jiawei Han. 2011. Co-author relationship predictionin heterogeneous bibliographic networks. In

Advances in Social Networks Analysis and Mining (ASONAM), 2011International Conference on . IEEE, 121–128.[34] Yizhou Sun, Jiawei Han, Charu C. Aggarwal, and Nitesh V. Chawla. 2012. When Will It Happen?: RelationshipPrediction in Heterogeneous Information Networks. In

Proceedings of the Fifth ACM International Conference on WebSearch and Data Mining (WSDM ’12) . ACM, New York, NY, USA, 663–672.[35] Yizhou Sun, Jiawei Han, Xifeng Yan, Philip S Yu, and Tianyi Wu. 2011. Pathsim: Meta path-based top-k similaritysearch in heterogeneous information networks.

Proceedings of the VLDB Endowment

4, 11 (2011), 992–1003.[36] Jie Tang, Jing Zhang, Limin Yao, Juanzi Li, Li Zhang, and Zhong Su. 2008. ArnetMiner: Extraction and Mining ofAcademic Social Networks. In

Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discoveryand Data Mining (KDD ’08) . ACM, New York, NY, USA, 990–998.[37] Ben Taskar, Ming-Fai Wong, Pieter Abbeel, and Daphne Koller. 2004. Link prediction in relational data. In

Advances inneural information processing systems . 659–666.[38] Ian W Taylor, Rune Linding, David Warde-Farley, Yongmei Liu, Catia Pesquita, Daniel Faria, Shelley Bull, Tony Pawson,Quaid Morris, and Jeffrey L Wrana. 2009. Dynamic modularity in protein interaction networks predicts breast canceroutcome.

Nature biotechnology

27, 2 (2009), 199.[39] James Tobin. 1958. Estimation of Relationships for Limited Dependent Variables.

Econometrica

DataMining, 2007. ICDM 2007. Seventh IEEE International Conference on . IEEE, 322–331.[41] Peng Wang, BaoWen Xu, YuRong Wu, and XiaoYu Zhou. 2015. Link prediction in social networks: the state-of-the-art.

Science China Information Sciences

58, 1 (2015), 1–38.[42] Tingli Wang and Guoqiong Liao. 2014. A review of link prediction in social networks. In

Management of e-Commerceand e-Government (ICMeCG), 2014 International Conference on . IEEE, 147–150.[43] Stanley Wasserman and Katherine Faust. 1994.

Social network analysis: Methods and applications . Vol. 8. Cambridgeuniversity press.[44] Wenchao Yu, Charu C Aggarwal, and Wei Wang. 2017. Temporally Factorized Network Modeling for EvolutionaryNetwork Analysis. In

Proceedings of the Tenth ACM International Conference on Web Search and Data Mining . ACM,455–464.[45] Matthew D Zeiler. 2012. ADADELTA: an adaptive learning rate method. arXiv preprint arXiv:1212.5701 (2012).[46] Jiawei Zhang, Philip S Yu, and Zhi-Hua Zhou. 2014. Meta-path based multi-network collective link prediction. In