[PDF] On a Bernoulli Autoregression Framework for Link Discovery and Prediction

Abstract

We present a dynamic prediction framework for binary sequences that is based on a Bernoulli generalization of the auto-regressive process. Our approach lends itself easily to variants of the standard link prediction problem for a sequence of time dependent networks. Focusing on this dynamic network link prediction/recommendation task, we propose a novel problem that exploits additional information via a much larger sequence of auxiliary networks and has important real-world relevance. To allow discovery of links that do not exist in the available data, our model estimation framework introduces a regularization term that presents a trade-off between the conventional link prediction and this discovery task. In contrast to existing work our stochastic gradient based estimation approach is highly efficient and can scale to networks with millions of nodes. We show extensive empirical results on both actual product-usage based time dependent networks and also present results on a Reddit based data set of time dependent sentiment sequences.

Full PDF

OOn a Bernoulli Autoregression Frameworkfor Link Discovery and Prediction

Xiaohan Yan ∗ Avleen S. Bijral ∗ Microsoft, One Microsoft Way, Redmond, WA 98052 { xiaoya,avbijral } @microsoft.com July 24, 2020

Abstract

We present a dynamic prediction framework for binary sequences that is based on a Bernoulligeneralization of the auto-regressive process. Our approach lends itself easily to variants ofthe standard link prediction problem for a sequence of time dependent networks. Focusingon this dynamic network link prediction/recommendation task, we propose a novel problemthat exploits additional information via a much larger sequence of auxiliary networks and hasimportant real-world relevance. To allow discovery of links that do not exist in the availabledata, our model estimation framework introduces a regularization term that presents a trade-oﬀbetween the conventional link prediction and this discovery task. In contrast to existing workour stochastic gradient based estimation approach is highly eﬃcient and can scale to networkswith millions of nodes. We show extensive empirical results on both actual product-usage basedtime dependent networks and also present results on a Reddit based data set of time dependentsentiment sequences.

Predicting and recommending links in a future network is a problem of signiﬁcant interest in manycommunities with problems of interest ranging from recommending friends Aiello et al. (2012),patent partners Wu et al. (2013) to applications in disease propagation Myers and Leskovec (2010)and detection of abnormal communication Huang and Lin (2009). From a higher perspective linkprediction is not markedly diﬀerent from the problem of matrix completion when the network isrepresented as a adjacency matrix. However, in many cases past link information is predictive ofthe future. In one such application, Richard et al. (2010) proposed an approach for time dependentgraphs where past adjacency matrices and a forecast of a network feature sequence is used toestimate links at a future time. This is a dynamic extension of the static link prediction problemLiben-Nowell and Kleinberg (2007) with origins in using network properties of a static snapshot topredict links. As a simple example consider the problem of predicting communication between twousers over a network, past information is clearly predictive.In this paper we discuss one variant of the dynamic link prediction problem and in contrast toRichard et al. (2010) present a highly scalable model with an eﬃcient estimation approach. The ∗ Both authors contributed equally to this research. a r X i v : . [ s t a t . M L ] J u l nderlying idea behind our approach is to exploit a sequence of auxiliary networks with featuredata to inform links in the network sequence of interest. This is a fairly common scenario in alarge corporation with multiple products and services, it can be immensely useful to use networkinformation gathered from a product A to inform usage (via link prediction) on a product B. Inthe online retail setting such as Amazon a variety of user-item interactions are possible including“ view ”, “ add-to-cart ” and “ purchase ”. One could generate networks at a user-item level for eachtype of event, and leverage the “ view ” network or the “ add-to-cart ” network to infer connections onthe “ purchase ” network, i.e., when a user eventually buys an item. In such scenarios we have twonetwork sequences, a main sequence that is expected to be sparse and a relatively dense auxiliarysequence with additional feature information that is predictive of a link in the main network. Theintroduction of the auxiliary sequences serves two purposes, one to provide candidates of links topredict from and a feature set that is possibly predictive of these links. However, the model wepropose is general and holds even for a single network sequence with feature information. In thestandard matrix completion framework, the observed entries are directly modeled with low rankconstraints on the matrix. Instead, we model link probabilities using a Bernoulli parametrization,incorporate an auto-regressive structure to allow for dependence on past data and also include anadditional dependence on logit transformed feature data. The equivalence of a low rank constraintin this framework is a relatively under-parametrized model as we shall see in Section 3. Existing work on link prediction has primarily focused on static networks with methods rangingfrom matrix factorization Koren (2008), node neighborhood measures Sarkar et al. (2011); Liben-Nowell and Kleinberg (2007) to network diﬀusion Myers and Leskovec (2010). For an extensivesurvey see Wang et al. (2015). In contrast (dynamic) link prediction in the presence of a sequenceof networks is relatively unexplored. Vu et al. (2011) propose a longitudinal network evolutionapproach that models edges at time t as a multivariate counting process where the intensity functionincorporates edge level features (network derived) using either the multiplicative Cox or additiveAalen form. Our setting is diﬀerent in that we have two sequences of networks that we can exploitand the design of our estimation method enables us to discover new links which is not possible here.Moreover, the estimation methods require computation of large weight matrices and it is not clearif this can scale to networks with millions of nodes. Additionally, many real world networks tend tohave low rank structures due to similar local node characteristics and explicit introduction of theseconstraints is a desirable feature. Closer in structure to our model is the nonparametric approachdescribed in Sarkar et al. (2012), the edge probabilities are assumed to be Bernoulli distributedwith a link function connecting the edge/node probabilities to the node/edge level features. Besidesthe diﬀerence in the problem setting, Sarkar et al. (2012) use a nonparametric estimator for thelink function that uses a kernel weighted local neighborhood approach (in time and feature space)to estimate the link function. This requires a search over edge features that are similar to the edgebeing modeled and can be quite expensive, even with the use of fast search techniques it is unlikelyto scale to extremely large graphs.In the matrix completion line of work Richard et al. (2010) propose a graph feature tracking(GFT) framework for dynamic networks where the entire network at a future step is estimated bysolving an appropriately regularized optimization problem. The optimization reﬂects a trade-oﬀ Example dataset: arbitrary binarysequences with mapped features. In Section 5.2 we apply our approach on a hyperlink networkfrom Reddit and show its improved performance over a baseline approach. The Reddit hyperlinknetwork represents a single network sequence and the applicability of our approach demonstratesits potential wide usage on modeling time-dependent binary sequences. The rest of the work isorganized as follows. In Section 3 we propose the BAR model with its estimation problem inSection 3.1. We compare the BAR model with baselines (GFT, logistic) in a simulation study inSection 4. On two real datasets in Section 5, one from actual products A and B and the otherfrom Reddit, we apply the proposed model to show its superior performance in link prediction anddiscovery. Finally, we conclude the work with discussions in Section 6. Let the adjacency matrices A ( t ) ∈ { , } n and B ( t ) ∈ { , } N be the sequence of main and auxiliarynetworks for t ∈ { , . . . , T } . Corresponding to the edges in the auxiliary sequence we also have asequence of feature tensors F ( t ) ∈ (cid:60) N × N × d such that, for all ( i, j ) ∈ B ( t ) we have F ( t ) ij ∈ (cid:60) d . Weassume that an edge in a main network sequence exists at least once in the auxiliary sequence andthat the number of nodes ( N and n ) is ﬁxed over time. This is not an issue in practice since wecan always set N to be the largest values over t and assume all 0 rows at other times.There are two important aspects of our model deﬁnition. First, we believe that the featuresequence and the history of connections are predictive of a link in the future. As an example, inthe context of an email network this could imply that an increased exchange between two nodesis possibly indicative of a future calendar event. Second, the probability of link formation evolveswith time and in that sense depends on the entire sequence of edge level features. For any link A ( t ) ij we have A ( t ) ij (cid:12)(cid:12) F (1) , . . . , F ( t ) ind ∼ Bernoulli (cid:0) Q ( t ) ij (cid:1) Q ( t ) ij = λ Q ( t − ij + (1 − λ ) P ( t ) ij and P ( t ) ij =  (cid:16) − β (cid:48) F ( t ) ij (cid:17) if B ( t ) ij = 10 otherwise (1) To maintain anonymity we don’t reveal the actual product names. λ ∈ [0 ,

1] controls the trade-oﬀ between the dependence on the past and the current featurevector for a link ( i, j ). The model expresses our belief that given the features and link history,a connection at time t is independent (but non-homogeneous) of all other links and that thisprobability evolves as an auto-regressive model with weights on probability of connection at time t − Bernoulli Auto-Regressive (BAR) . Note that the recursion of Q ( t ) ij in Model (1) allows us to expressthe probability of connection at a given time as a weighted sum of past probabilities. This weightedsum along with a non-linear dependence on the features provides ﬂexibility in modeling arbitrarysequences with observation level features. This is more general than modeling the probabilitiesas a logit transform of lagged feature vectors. In the choice of P ( t ) ij we can also use diﬀerent linkfunctions to extend our model to sequences of discrete outcomes of K categories where K > P ( t ) ij is valued between 0 and 1 so that Q ( t ) ij is a validBernoulli parameter.For certain applications BAR may be under-parametrized. We can extend the model by havingdiﬀerent coeﬃcient vectors in the link probability for diﬀerent sets of connections. These sets canbe determined by using clustering on the features as a pre-processing step. More precisely we canhave for P ( t ) ij P ( t ) ij =  (cid:16) − β (cid:48) k F ( t ) ij (cid:17) if B ( t ) ij = 1 and C ( t ) ij = k C ( t ) ij is the cluster membership for link ( i, j ) at time t . We assume only one cluster for edgemembership for the rest of the work.The decaying contribution of historical features is more obvious once we express the recursionof Q ( t ) ij in Model (1) equivalently as Q ( t ) ij = λ t Q (0) ij + t (cid:88) s =1 λ t − s (1 − λ ) P ( s ) ij . (3)The longer the time has elapsed since time s , the smaller the weight ( λ t − s (1 − λ )) is applied onthe probability P ( s ) ij and its underlying features from time s . The diminishing contribution fromfeatures over elapsed time is a valid assumption in many applications (e.g., attribution model).The choice of λ governs the decaying rate: larger λ makes Q ( t ) ij more dependent on historicalprobabilities, whereas smaller λ magniﬁes the dependence on current features. To illustrate therelation between Q ( t ) ij and λ , we consider a simpliﬁed scenario: we let Q (0) ij be 0.2 and P ( t ) ij be 0.8 forall t ∈ { , . . . , } . In Figure 1 we show how Q ( t ) ij evolves with t at various λ values. At extremes, Q ( t ) ij takes constant value of Q (0) ij (when λ = 1) or P ( t ) ij (when λ = 0). For 0 < λ <

1, the larger λ is, the more slowly the contribution of historical features to the Q ( t ) ij decays and the more slowly Q ( t ) ij deviates from its initial assignment (0.2) towards the probability derived from current features(0.8). 4 Q ( t ) ij Figure 1: Evolution of Q ( t ) ij over t at various λ , under Q (0) ij = 0 . P ( t ) ij = 0 . Model estimation is performed by maximizing the log-likelihood function for Model (1) since foreach ( i, j ) we can obtain a closed form expression for the probability of connection Q ( t ) ij at time t using the recursive formula (3). However, as we mentioned before the main network can be verysparse and so we cannot hope to learn much from a small number of observations. To cast ournetwork to a wider set of connections we add a regularization term to the log-likelihood and proposeto minimize the following problemminimize β ∈(cid:60) d -Log-Likelihood + α T (cid:88) t =1 (cid:13)(cid:13)(cid:13)(cid:16) Q ( t ) − B ( t ) (cid:17) Φ ( t ) (cid:13)(cid:13)(cid:13) F (4)where Φ ( t ) = F ( t ) ⊗ N ∈ (cid:60) N × d are aggregated features at sender-level, i.e., Φ ( t ) i(cid:96) = (cid:80) { j : B ( t ) ij =1 } F ( t ) ij(cid:96) encodes the (cid:96) th feature aggregated over recipients of sender i .The regularization term provides us a way to minimize deviation of ﬁrst order statistics (forthe features) between the network probabilities of interest and the observed auxiliary network.In another word, we want network aggregated features to be similar for the main and auxiliarysequences. The regularization and the probabilities P ( t ) ij are precisely what allow us to learn fromauxiliary data, and Problem (4) represents a trade-oﬀ between the task of link prediction anddiscovery. If we have a lot more observed data in the main network sequence and it is of interestto predict the prevalence of existing connections in the future we should give more weight to thelog-likelihood term (choose a small α ) and if instead our goal is to discover newer connections weshould focus on trying to minimize the regularizer term (with a larger α ). If there is only onenetwork sequence with feature information, we simply take α = 0 to accommodate missing B ( t ) and adjust P ( t ) ’s dependence onto A ( t ) (see Section 5.2 for an example). These adjustments wouldmake our model generalizable to single network sequence with relevant features.We propose a stochastic gradient descent (SGD) algorithm to solve Problem (4). Our SGDalgorithm exploits data sparsity in the main and auxiliary networks and is scalable to networkswith millions of nodes. We refer readers to Appendix A for details of the algorithm.5 Simulation Study

We simulate data from the generative model (1) and perform several experiments to gain moreinsight into its behavior and performance compared to competing models. Moreover, since thenetwork sequences and edge features are highly likely to be statistically dependent, we employsettings that are indicative of real-world applications. To generate the ﬁrst network in the auxiliarysequence we sample Erd˝os-Renyi graphs with diﬀerent combinations of N and edge generationprobabilities p . For subsequent time steps ( T = 15) we vary B ( t ) ij = (cid:40) B ( t − ij + 1 w.p. p add if B ( t − ij = 0 B ( t − ij − p del if B ( t − ij = 1 . (5)For every edge B ( t ) ij we sample the features F ( t ) ij ∈ (cid:60) such that each element is Poisson( µ t )distributed, where µ t = µ t − + (cid:15) t ∼ N ( , . × I ) is a simple random walk. The coeﬃcientvector β ∼ Unif[0 , is normalized to have unit Euclidean norm. Finally, the main networksequence A ( t ) ij is generated by Bernoulli trials with probability Q ( t ) ij computed using the recursionin (1), where Q (0) ij is initialized as the maximum likelihood estimate of edge probability.We compare the BAR model with two competing models. The ﬁrst model is graph featuretracking (GFT) from Richard et al. (2010) which incorporates a feature prediction and regular-ization procedure for estimating A ( T +1) . In speciﬁc, Richard et al. (2010) deﬁne n × k matrixof features Ω that can capture the dynamics of A ( t ) , and project A ( t ) with a linear feature map G ( t ) = A ( t ) Ω . With historical feature maps { G (1) , . . . , G ( T ) } Richard et al. (2010) propose usingridge regression to estimate ˆ G to substitute A ( T +1) Ω . In the directed graph setting, we let thefeatures Ω = V (: , k ) Σ − , k ) where V (: , k ) and σ k are k leading singular vectors and singular valuesfrom the SVD of A ( T ) = UΣV (cid:48) , respectively. We solve Problem (6) to get a GFT estimator for A ( T +1) : min S ∈(cid:60) n × n (cid:13)(cid:13)(cid:13) S − A ( T ) (cid:13)(cid:13)(cid:13) F + 12 ν (cid:13)(cid:13)(cid:13) SΩ − ˆ G (cid:13)(cid:13)(cid:13) F + τ (cid:107) S (cid:107) ∗ (6)where (cid:107) S (cid:107) ∗ is a nuclear norm for inducing low rankness in S . Since the GFT estimator is notguaranteed to have its values between 0 and 1, we threshold any value of S outside [0 ,

1] to itscloser boundary. In addition to GFT, we compare the BAR model to a logistic model with featuresbeing averaged historical features up to the current time. A ( t ) ij (cid:12)(cid:12) F (1) , . . . , F ( t ) ind ∼ Bernoulli 

11 + exp (cid:16) − β (cid:48) F avg( t ) ij (cid:17)  (7)where averaged features across historical time F avg( t ) is deﬁned as follows: F avg(1) ij = F (1) ij and for t ≥ , F avg( t ) ij = (cid:40) F avg( t − ij · ( t − /t if F ( t ) ij does not exist (cid:16) F avg( t − ij · ( t −

1) + F ( t ) ij (cid:17) /t if F ( t ) ij exists . (8)All three models under the comparison have access to historical data though they leverage the datadiﬀerently. 6able 1: For each simulation scenario (Task, Parameter), Node Degree (No. Links/No. Nodes) forthe main network in the test period and AUC ROC for each model (averaged over 10 repetitions). Task Parameter NodeDegree Models (AUC ROC)BAR GFT LogisticStandard N = n = 10 k, p = 5 × − , p add = 10 − p, p del = 10 − p N = n = 10 k, p = 5 × − , p add = 10 − p, p del = 10 − p N = n = 10 k, p = 5 × − , p add = 10 − p, p del = 0 . N = n = 10 k, p = 5 × − , p add = 10 − p, p del = 0 .

005 17.1 N = n = 10 k, p = 5 × − , p add = 10 − p, p del = 0 .

05 13.3 N = n = 10 k, p = 5 × − , p add = 10 − p, p del = 0 . N = n = 10 k, p = 5 × − , p add = 10 − p, p del = 0 . The comparisons are made over two tasks: a standard setting where networks of diﬀerentdensities are considered, and a setting with varying network dependence for which adjacent networkvariation is gradually increased. For each scenario, we ﬁx N and n to 10 thousands and p add to10 − p , but vary p and p del over a range of values. We include all simulation settings in Table 1. Forthe task with varying network dependence, we gradually increase p del but ﬁx all other parameters.Larger p del increases the probability of deleting an existing edge, injecting more variability intonetworks over time. By varying these parameters we highlight the advantages of our approachto a more traditional matrix completion framework (e.g., GFT from Richard et al. (2010)) andalso gain insight into the workings of our method. Under each parameter setting, we generate 10data samples and ﬁt each model onto the samples. We ﬁnely tune each model with training datafrom the ﬁrst 14 periods and test on the last period. We report the averaged AUC ROC over 10repetitions for each model in Table 1. On all tasks the BAR signiﬁcantly outperforms the GFT andthe logistic by achieving higher AUC ROC. When network becomes more varied and more sparsewith increasing p del , the challenge increases for all models; the BAR still constantly do better thanthe other two methods under this scenario. We exploit the performance of the BAR model on two real experiments. The ﬁrst experiment usesproduct B as an auxiliary network to predict and discover connections on product A (the mainnetwork). The superior performance of the BAR model demonstrates the connections between thetwo underlying products. Moreover, the BAR model can fully leverage such connections to inferlinks on the product of interest ( A ). Our second experiment focuses on sentiment classiﬁcation witha Reddit hyperlink network. We construct temporal network sequences for hyperlinks and use theBAR model (along with the baselines) to classify sentiment for the Reddit posts. We show that theBAR model performs at least as well as the logistic baselines even when minimal time dependenceis present in network sequence. Products from large corporations often provide a connected ecosystem for users where the modalityof connection is deﬁned by a speciﬁc product or feature. For example, there could be a commu-7 a i n n e t w o r k li n k s t h a t a r e n e w i n t e s t p e r i o d ( (cid:2778) 𝒏 𝒆 (cid:2205) ) Main network links once existed in train period( (cid:2778) 𝒆(cid:2206)𝒊𝒔𝒕𝒆𝒅 ) Void links between main network users ( (cid:2777) 𝒎𝒂𝒊𝒏(cid:2779)𝒎𝒂𝒊𝒏 )Main network users M a i n n e t w o r k u s e r s Void links from main network users to auxiliary network users ( (cid:2777) 𝒎𝒂𝒊𝒏(cid:2779)𝒂(cid:2203)(cid:2206) ) Void links between auxiliary network users ( (cid:2777) 𝒂(cid:2203)(cid:2206)(cid:2779)𝒂(cid:2203)(cid:2206) ) Void links from auxiliary network users to main network users ( (cid:2777) 𝒂(cid:2203)(cid:2206)(cid:2779)𝒎𝒂𝒊𝒏 )Auxiliary network users A u x ili a r y n e t w o r k u s e r s Figure 2: Decomposition of the test set. (Left) Set of 1s (i.e., links) from the test period arepartitioned into two subsets, depending on whether the link existed in train period (1 existed ) or not(1 new ). (Right) Set of 0s (i.e., non-link pairs) from the test period are partitioned into four subsetsby user membership: between main network users (0 main2main ), between auxiliary network users(0 aux2aux ), from main network to auxiliary one and vice versa (0 main2aux and 0 aux2main ).nication messaging and a social network. The diﬀerent networks within such companies can havevarying levels of connectivity. In our case product B ’s network is much bigger and denser than A ’s.Such an application is perfectly aligned with the framework we present here.In the experiment we let the product A ’s graph correspond to the main network and the B ’s tothe auxiliary network. The underlying intuition is that a network deﬁning activity on product B can be indicative of users who might be interested in the feature provided by A . To prepare data,we extract directed links among users from A and B at weekly level. The network data covers aperiod of 32 weeks where the last week is reserved as a test set. We down sample the B ’s networkfor computational reasons and use 15 weekly edge-level features corresponding only to activity.Note that we are also unable to detail the exact data and size descriptions due to privacy concerns.Finally we tune the BAR model over a grid of α values between 0 and 1. For comparison, weuse the same logistic model described in the simulation study in Section 4 that accounts for timedependence on historical features. Since the GFT does not scale to large networks we are unableto make a comparison to it. We evaluate the model performance on the hold-out test set.We have two tasks at hand: link prediction that aims at accurately predicting recurrent links,and link discovery that targets on discovering new links that never existed before. We split the testset to properly evaluate the performance on the two tasks. The test set is made up of A ’s linksfrom the last week (i.e., the set of ones ) and pair of users from A or B that did not connect on A in the week (i.e., the set of zeros ). The test set is highly imbalanced: the set of zeros is largerthan the set of ones by orders of magnitudes. From a practical standpoint, a predicted link on A is more likely to hold if one of the two users already use A . For link discovery, such an edgeprediction can be directly translated into a recommendation setting. Hence, restricting to zeroentries with exactly one user being an existing A user would help improve prediction accuracy for any model, and alleviate the imbalance in the test set. As is shown in the left panel of Figure 2,we partition the set of ones into links that existed in training (indexed as 1 existed ) and those thatnever existed before (indexed as 1 new ). The set 1 existed is suitable for gauging model performance8ecall among Top N Predictionsfor Link Prediction Recall among Top N Predictionsfor Link DiscoveryFigure 3: Recall among top N predictions for (Left) Link Prediction and (Right) Link Discovery.The top N predictions are selected among 0 main2aux ∪ aux2main and the corresponding set of onesby task (1 existed for link prediction and 1 new for link discovery).on link prediction, whereas the set 1 new allows us to assess the link discovery task. In the rightpanel of Figure 2, we partition the set of zeros into four segments according to the memberships ofthe sender and receiver: 0 main2main , aux2aux , main2aux and 0 aux2main . We refer readers to Figure 2for details of the partition.Table 2: Performance on link prediction and link discovery on A ’s network over test period. Task Composition of Test Set Models (AUC ROC)Set of Ones Set of Zeros BAR LogisticLink Prediction 1 existed main2aux ∪ aux2main new main2aux ∪ aux2main In general, link discovery is more challenging than link prediction since performance on thelink discovery task is diﬃcult to measure in the real setting speciﬁcally on oﬄine data. This isbecause the generative mechanism of new links isn’t known and our model assumption bases themon B usage. A more reliable testing mechanism is perhaps randomized control trials. The contrastin the diﬃculty of the two taks is reﬂected in Table 2: the BAR model achieves almost perfectAUC ROC over link prediction while its corresponding metric for link discovery is much lower.Moreover, on both tasks the BAR model outperforms the logistic baseline. We also evaluate modelperformance with recall which is particularly suitable for evaluating link discovery. For each task,we compute recall from top N predictions among the corresponding set of ones together with0 main2aux ∪ aux2main . As N increases, we include more candidates which results in an improvementof recall for any model. In both panels of Figure 3, the BAR model achieves signiﬁcantly betterrecall than the baseline at every N . However, link discovery is more challenging which is reﬂectedon the upper bound of the metric ( ∼ N (see the right panel of Figure 3). Forlink prediction (see the left panel of Figure 3), the BAR model with α = 0 .

31 61 91 121 151 181 211 241 271 301 331 361 391 421

Time Window (Daily) N u m b e r o f n o d e s Aggregation Level (Daily)

Time Window (Weekly) N u m b e r o f n o d e s Aggregation Level (Weekly)

Time Window (Monthly) N u m b e r o f n o d e s Aggregation Level (Monthly)

Time Window (Daily) N u m b e r o f li n k s Aggregation Level (Daily)

All linksPositive linksNegative links

Time Window (Weekly) N u m b e r o f li n k s Aggregation Level (Weekly)

All linksPositive linksNegative links

Time Window (Monthly) N u m b e r o f li n k s Aggregation Level (Monthly)

All linksPositive linksNegative links

Figure 4: Variation of the number of nodes (Top Row) and the number of links (Bottom Row) fromReddit hyperlink network over time at diﬀerent aggregation levels: (Left) Daily, (Middle) Weeklyand (Right) Monthly.

In the previous experiment we demonstrate signiﬁcant improvement of the BAR model over a morestandard approach in predicting and discovering links on networks with millions of nodes. Herewe show that the proposed model is easily generalizable to any dynamic binary classiﬁcation taskthat spans over time. The Reddit hyperlink network Kumar et al. (2018) encodes the directedconnections between two subreddit communities on Reddit from Jan 2014 to April 2017. Thesubreddit-to-subreddit hyperlink network is extracted from Reddit posts containing hyperlinksfrom one subreddit to another. The hyperlinks are time-stamped, directed and have attributes(a.k.a. features). Using crowd-sourcing and a text-based classiﬁer, Kumar et al. (2018) assigneda binary label to each hyperlink, for which label 1 indicates that the source expresses a positivesentiment towards the target and -1 if the source sentiment is negative towards the target.Our goal for the experiment is to classify sentiment for the hyperlinks with the BAR model andcompeting ones. The BAR model is designed for discrete times; however, the time stamps in thehyperlink network are continuous. To discretize the time stamps, we aggregate the hyperlinks atvarious granularities: daily, weekly and monthly. At each granularity we average the feature vectorscorresponding to the same pair of (source subreddit, target subreddit) within each time window.Among the 86 features (see Kumar et al. (2018) for details) we omit 11 of them that are countfeatures because they are likely to be dominated by longer posts after the averaging. For sentimentlabels within each time window, we take the majority vote as the label for (source subreddit, targetsubreddit). We experiment on the ﬁrst 14 months of data (or roughly equivalently, the ﬁrst 60 Reddit hyperlink network: https://snap.stanford.edu/data/soc-RedditHyperlinks.html

AggregationLevel Network Statistics Train/Test Periods Models (AUC ROC)Node Degree % Recurrent Links T train T test BAR Logistic (Raw) LogisticDaily 5.29 21.60% 365 60

We modify the BAR model to adapt to the single network scenario. Let A ( t ) ij ∈ {− , } representthe sentiment from the i th subreddit to the j th subreddit at time window t . We require the edge-level features F ( t ) ij to be from the network and update the dependency of P ( t ) ij onto A ( t ) ij P ( t ) ij =  (cid:16) − β (cid:48) F ( t ) ij (cid:17) if A ( t ) ij ∈ { , − } . (9)With single network we turn oﬀ the link discovery option by setting α to zero in Problem (4).For comparison we consider two variations of the logistic model. The ﬁrst variation keeps usingaveraged historical features for classifying sentiment at the current time window, as is used Section4 and Section 5.1, which we label as Logistic in Table 3. The second variation of the logistic model,which we label with

Logistic (Raw) in Table 3, simply use the features from the current timewindow for classifying sentiment. Table 3 has a comparison of the performance on correspondingtest period across varying aggregation level. As we aggregate over longer time window (from dailyto monthly), the network becomes sparser with node degree continuously decreasing. Comparingthe two variations of the logistic model, it performs better without averaging historical featureswhich indicates there is limited time dependence in the sentiment classiﬁcation task. Indeed, ourapproach performs the best with daily aggregation when there is more time dependence, as isevidenced by the highest percentage of recurrent links. At all aggregation levels, the BAR modeloutperforms the logistic ones, showing its strength over binary classiﬁcation task even with limitedtime dependence. As we mentioned before the competing GFT approach does not scale to largedatasets. Moreover, the sentiment classiﬁcation problem is not directly solvable with GFT’s matrixcompletion framework in (6). Thus, we left the GFT approach out in this comparison.11

Conclusions

We proposed a generalized auto-regressive model for link prediction and discovery with networksequences. The proposed BAR model can exploit signals from the auxiliary network to infer con-nections on the main network. We also develop an optimization framework for estimating themodel, and a stochastic gradient descent algorithm that is scalable to networks with millions ofnodes. To demonstrate the superior performance of our model, we apply the BAR model alongwith baselines on both simulated data and real data, and show signiﬁcant improvement in BAR’sperformance over baselines. Our experiment on the product A ’s network shows practical use of theBAR model at scale, while the Reddit hyperlink experiment validates the generalizability of BAR.For future line of research, an extension of the BAR model to continuous time would be useful formany real-world applications. We also plan to investigate the theoretical properties of our model. Acknowledgements

The authors thank Sushama Murthy and Richard Johnston for supporting the work.

Appendix

A Algorithm for BAR Model Optimization

We derive the log-likelihood part and the regularizer part of Problem (4) as follows.Log-Likelihood = T (cid:88) t =1  (cid:88) (cid:110) ( i,j ): A ( t ) ij =1 (cid:111) log (cid:16) Q ( t ) ij (cid:17) + (cid:88) (cid:110) ( i,j ): A ( t ) ij =0 (cid:111) log (cid:16) − Q ( t ) ij (cid:17) (10)Regularizer = T (cid:88) t =1 N (cid:88) i =1 d (cid:88) (cid:96) =1  N (cid:88) j =1 (cid:16) Q ( t ) ij − B ( t ) ij (cid:17) · Φ ( t ) j(cid:96)  (11)For the log-likelihood part (10), we represent the summands using f ( β , t, i, j ) and g ( β , t, i, j ) deﬁnedas follows: • f ( β , t, i, j ) := log (cid:16) Q ( t ) ij (cid:17) = log (cid:16) λ t Q (0) ij + (cid:80) ts =1 λ t − s (1 − λ ) P ( s ) ij (cid:17) when A ( t ) ij = 1; • g ( β , t, i, j ) := log (cid:16) − Q ( t ) ij (cid:17) = log (cid:16) − λ t Q (0) ij − (cid:80) ts =1 λ t − s (1 − λ ) P ( s ) ij (cid:17) when A ( t ) ij = 0.The gradients of f ( β , t, i, j ) and g ( β , t, i, j ) can be derived by the Chain Rule: • ∇ β f ( β , t, i, j ) = λ t Q (0) ij + (cid:80) ts =1 λ t − s (1 − λ ) P ( s ) ij · (cid:80) ts =1 λ t − s (1 − λ ) ∇ β P ( s ) ij ; • ∇ β g ( β , t, i, j ) = − − λ t Q (0) ij − (cid:80) ts =1 λ t − s (1 − λ ) P ( s ) ij · (cid:80) ts =1 λ t − s (1 − λ ) ∇ β P ( s ) ij ;12here ∇ β P ( s ) ij = P ( s ) ij (cid:16) − P ( s ) ij (cid:17) F ( s ) ij if B ( s ) ij = 1 and 0 otherwise. The gradient of the log-likelihoodis ∇ β Log-Likelihood = T (cid:88) t =1  (cid:88) (cid:110) ( i,j ): A ( t ) ij =1 (cid:111) ∇ β f ( β , t, i, j ) + (cid:88) (cid:110) ( i,j ): A ( t ) ij =0 (cid:111) ∇ β g ( β , t, i, j )  . For the regularizer part (11), we represent the summand using h ( β , t, i, (cid:96) ) = (cid:16)(cid:80) Nj =1 (cid:16) Q ( t ) ij − B ( t ) ij (cid:17) · Φ ( t ) j(cid:96) (cid:17) of which the gradient can be derived by the Chain Rule as ∇ β h ( β , t, i, (cid:96) ) = 2  N (cid:88) j =1 (cid:16) Q ( t ) ij − B ( t ) ij (cid:17) · Φ ( t ) j(cid:96)  ·  N (cid:88) j =1 Φ ( t ) j(cid:96) t (cid:88) s =1 λ t − s (1 − λ ) ∇ β P ( s ) ij  . The gradient of the regularizer is ∇ β Regularizer = T (cid:88) t =1 N (cid:88) i =1 d (cid:88) (cid:96) =1 ∇ β h ( β , t, i, (cid:96) )A gradient descent update at the k th iteration with step size η has the form β k ← β k − − η ( −∇ β Log-Likelihood + α · ∇ β Regularizer) . We propose a computationally feasible substitute to the gradient descent. At the k th iteration, werandomly sample a time stamp t , a sender node i lk for the log-likelihood gradient, and a sender node i reg for the regularizer gradient. The Stochastic Gradient Descent (SGD) algorithm is detailed inAlgorithm 1. To eﬃciently compute grad lk and grad reg from Lines 8-9 of the algorithm, we leveragethe sparsity in the auxiliary sequence B ( s ) : for a given i there are limited j s with B ( s ) ij = 1. Hence,only a small number of non-zero ∇ β P ( s ) ij needs to be accounted for in computing the gradients of f ( · ) , g ( · ) and h ( · ). References

Aiello, L. M., Barrat, A., Schifanella, R., Cattuto, C., Markines, B., and Menczer, F. (2012).Friendship prediction and homophily in social media.

ACM Transactions on the Web (TWEB) ,6(2):1–33.Huang, Z. and Lin, D. K. (2009). The time-series link prediction problem with applications incommunication surveillance.

INFORMS Journal on Computing , 21(2):286–303.Koren, Y. (2008). Factorization meets the neighborhood: a multifaceted collaborative ﬁlteringmodel. In

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discov-ery and data mining , pages 426–434.Kumar, S., Hamilton, W. L., Leskovec, J., and Jurafsky, D. (2018). Community interaction andconﬂict on the web. In

Proceedings of the 2018 World Wide Web Conference on World WideWeb , pages 933–943. International World Wide Web Conferences Steering Committee.13 lgorithm 1

Stochastic Gradient Descent Algorithm for Solving Problem (4)

Input: A ( t ) , B ( t ) , F ( t ) , Φ ( t ) , λ, α and η Initialize β k ← while β k not converge do k ← k + 1 Sample t from { , . . . , T } Sample i lk from (cid:110) i : A ( t ) ij = 1 for some j (cid:111) Sample i reg from (cid:110) i : B ( t ) ij = 1 for some j (cid:111) grad lk ← (cid:80) (cid:110) j : A ( t ) i lk j =1 (cid:111) ∇ β f (cid:0) β k − , t, i lk , j (cid:1) + (cid:80) (cid:110) j : A ( t ) i lk j =0 (cid:111) ∇ β g (cid:0) β k − , t, i lk , j (cid:1) grad reg ← (cid:80) d(cid:96) =1 ∇ β h (cid:0) β k − , t, i reg , (cid:96) (cid:1) β k ← β k − − η (cid:16) − grad lk + α · grad reg (cid:17) end whileOutput: β k Liben-Nowell, D. and Kleinberg, J. (2007). The link-prediction problem for social networks.

Journalof the American society for information science and technology , 58(7):1019–1031.Myers, S. and Leskovec, J. (2010). On the convexity of latent social network inference. In

Advancesin neural information processing systems , pages 1741–1749.Richard, E., Baskiotis, N., Evgeniou, T., and Vayatis, N. (2010). Link discovery using graphfeature tracking. In Laﬀerty, J. D., Williams, C. K. I., Shawe-Taylor, J., Zemel, R. S., andCulotta, A., editors,

Advances in Neural Information Processing Systems 23 , pages 1966–1974.Curran Associates, Inc.Richard, E., Ga¨ıﬀas, S., and Vayatis, N. (2014). Link prediction in graphs with autoregressivefeatures.

Journal of Machine Learning Research , 15(1):565–593.Sarkar, P., Chakrabarti, D., and Jordan, M. (2012). Nonparametric link prediction in dynamicnetworks. arXiv preprint arXiv:1206.6394 .Sarkar, P., Chakrabarti, D., and Moore, A. W. (2011). Theoretical justiﬁcation of popular linkprediction heuristics. In

Twenty-Second International Joint Conference on Artiﬁcial Intelligence .Vu, D. Q., Hunter, D., Smyth, P., and Asuncion, A. U. (2011). Continuous-time regression modelsfor longitudinal networks. In

Advances in neural information processing systems , pages 2492–2500.Wang, P., Xu, B., Wu, Y., and Zhou, X. (2015). Link prediction in social networks: the state-of-the-art.

Science China Information Sciences , 58(1):1–38.Wu, S., Sun, J., and Tang, J. (2013). Patent partner recommendation in enterprise social networks.In