Tracking Temporal Evolution of Graphs using Non-Timestamped Data
TTracking Temporal Evolution of Graphs usingNon-Timestamped Data
Sujit Rokka Chhetri * University of California [email protected]
Palash Goyal * USC Information Sciences [email protected]
Arquimedes Canedo * Siemens Corporate [email protected]
ABSTRACT
Datasets to study the temporal evolution of graphs are scarce. Toencourage the research of novel dynamic graph learning algorithmswe introduce YoutubeGraph-Dyn (available at https://github.com/palash1992/YoutubeGraph-Dyn), an evolving graph dataset gen-erated from YouTube real-world interactions. YoutubeGraph-Dynprovides intra-day time granularity (with 416 snapshots taken every6 hours for a period of 104 days), multi-modal relationships thatcapture different aspects of the data, multiple attributes includingtimestamped, non-timestamped, word embeddings, and integers. Ourdata collection methodology emphasizes the creation of time evolv-ing graphs from non-timestamped data. In this paper, we providevarious graph statistics of YoutubeGraph-Dyn and test state-of-the-art graph clustering algorithms to detect community migration, andtime series analysis and recurrent neural network algorithms to fore-cast non-timestamped data.
Graph learning is a growing field tasked with learning representa-tions of networks [11, 21]. Since many real-world problems canbe represented with graphs, the are several graph datasets availableto the community to test new learning algorithms. However, real-world networks evolve temporally and exhibit dynamic patterns thatthese datasets fail to capture. Recent work on dynamic graph learn-ing [7, 8] has shown that these methods are capable of predictingthe evolution of networks. Unfortunately, there are very few real-world dynamic graph datasets available. This includes Hep-th, acollaboration graph of authors in a High Energy Physics Theoryconference [5], composed of instances spanning from January 1993to April 2003; and Autonomous Systems (AS) [13], a communica-tion network of who-talks-to-whom from the BGP (Border GatewayProtocol) logs, composed of 733 instances spanning from November8, 1997 to January 2, 2000. These dynamic graph datasets have somelimitations. First, the granularity of the temporal data is greater thana day and therefore it is not possible to study fast dynamics (i.e., bythe hour). Second, only one type of relationship is encoded (e.g.,who-collaborates-with-whom, who-talks-to-whom) and therefore itis not possible to study related dynamics and their interdependencies.Third, nodes and edges do not provide attributes and therefore thesedatasets are not suitable for testing deep learning algorithms.To accelerate the research on dynamic graph representation learn-ing, this paper introduces
YoutubeGraph-Dyn , a new dynamic graphdataset generated from the YouTube API that captures different in-teractions between channels, videos, and comments. YouTube isa global-scale social network where people interact through video * All authors contributed equally to this research. views, comments, likes, dislikes, and subscriptions. These multi-modal interactions occur in real time and their rich semantics pro-vides an opportunity to analyze real-world temporal patterns. YoutubeGraph-Dyn addresses the limitations of existing dynamic graphdatasets as follows: • Fine granularity : graph snapshots every 6 hours. • Multi-modal : encoding various types of relationships be-tween channels and videos through comments, ownership,and communities. • Multiple attributes : including both time-stamped and non-timestamped attributes including word embeddings, lists ofword embeddings, and integers. • Reproducibility : to encourage the further advancement ofdynamic graph analysis and other deep learning techniques,we make the YoutubeGraph-Dyn dataset available to thecommunity.People in social networks form communities. YoutubeGraph-Dynprovides a ground truth to study the evolution of communities overtime. In particular, YoutubeGraph-Dyn’s multi-modal interactionsmotivate new applications in the prediction of user migration tonew communities. Similarly, social networks provide platforms tospread ideas. YoutubeGraph-Dyn’s multiple attributes motivate newapplications in the prediction of content virality, peer influence, andcontent popularity.In this paper, we introduce YoutubeGraph-Dyn’s key characteris-tics. To do so, we analyze the predictability of YouTube channels’subscriber, video, and comment counts with different models: au-toregressive integrated moving average (ARIMA) models, long shortterm memory (LSTM), and gated recurrent units (GRU) recurrentneural networks. Communities in YoutubeGraph-Dyn are analyzedthrough clustering analysis using state-of-the-art graph embeddingalgorithms [20] on an induced channel-comment-video temporalgraph. This graph captures the interactions between communitiesthrough comments in YouTube videos.This paper is organized as follows. Section 2 presents the re-lated work to distinguish YoutubeGraph-Dyn’s contributions. Sec-tion 3 presents our data collection methodology emphasizing non-timestamped data. Section 4 presents YoutubeGraph-Dyn’s statisticsand evaluates community clustering and data forecasting. Section 5discusses the limitations of YoutubeGraph-Dyn and provides anoutlook for future work. Section 6 concludes. YouTube data has been instantiated into various datasets over theyears. Most of these datasets have one or more of the limitationsdiscussed in the introduction. YoutubeGraph-Dyn - https://github.com/palash1992/YoutubeGraph-Dyn a r X i v : . [ c s . S I] J u l n [2], snapshots were taken every 2-3 days. A snapshot is adirected graph, where each video is a node in the graph. If a video b is in the related video list of video a , then there is a directededge from a to b . Notice that the related video list is the result ofYoutube’s recommender system and it is unclear what criteria areused to determine how two videos are related. This dataset onlyincludes commentCount as an attribute, but no actual comments.The Youtube-D, Youtube-U [16], and IMC 2007 datasets [15, 16]were collected to analyze the growth of the user base in the socialnetwork as a whole. These datasets consist of user-to-user graphswith daily snapshots between December 10th, 2006 and January15th, 2007, and February 8th, 2007 to July 23rd, 2007, representing201 days of growth. This dataset has been also analyzed to identifynetwork communities [19, 22].Although not suitable for graph learning, but worth mentioning,is the YouTube-8M dataset [1, 6]. The YouTube-8M dataset is amulti-label video classification dataset composed of 8 million videos,500K hours of machine-generated annotations, and a vocabulary of4800 visual entities. One possible direction for future work wouldbe to induce video-to-annotation, visual entities-to-annotations, orvideo-to-visual entities temporal graphs.A multi-attribute dataset very similar to YoutubeGraph-Dyn, buttargetted to sentiment analysis and natural language processing,is [12]. It consists of daily snapshots of up 200 trending YouTubevideos in the United States and the United Kingdom. In comparison,YoutubeGraph-Dyn features more than 6,000 videos.Dynamic graph representation techniques [7, 8, 24, 25] have useddynamic graph data available in various domains. TIMERS [24]uses multiple distinct snapshots of Facebook, Wikipedia and DBLPgraphs. dyngraph2vec [7] uses High Energy Physics collaborationnetworks and Autonomous Systems router networks. The data usedin such learning methods are scarcely available and often with agranularity of a few days or months. Furthermore, for many of thesenetworks, the number of snapshots available are not sufficient tolearn long temporal dependencies.This paper aims to introduce a new dynamic data set which cap-tures information about the interactions between channels, videosand comments in YouTube. Our dataset has a granularity of 6 hoursenabling its use for more fine-grained temporal analysis. To understand the temporal dynamics in the YouTube graph, weneed to analyze the Youtube signals that are available through theYoutube API [23]. We collected the metadata from 6,342 chan-nels, 277,604 videos, and more than 20 million comments from2018/07/13 18:00:00 to 2018/10/26 with a frequency of 6 hours(00:00, 06:00, 12:00, and 18:00). These 6,342 channels were se-lected based on their mentions or their videos mentions on Twitter.The rationale behind this design choice is that content with the poten-tial of becoming popular is being discussed in other social networksbeyond YouTube, e.g., on Twitter. We bootstrapped the originalchannel list from 16,209 YouTube video links posted in Twitter from2018/07/06 to 2018/07/12. We filtered the Twitter Firehose by thehashtag youtube and parsed the tweets looking for YouTube URLs.In this paper, we focus on the three main abstractions in Youtube:(1) a channel provides a user a presence in the social network that
Xa bc Z Yd 1 34 25 6 CommentsVideoChannel XZY abcd
Key: (a) Complete graph induced from the interactions between Channels, Videos, and Comments (b) Channels-to-Videos bipartite graph induced from comment interactions
Figure 1: Sample graphs generated from the dataset. allows them to upload videos; (2) a video is the unit of exchangein Youtube where other users can view, like, dislike, and comment;and (3) a comment thread is a collection of comments and commentreplies associated to a video. Because the path to fame is a processthat occurs in time, we need to distinguish between timestamped and non-timestamped data.Timestamped data provides a date; for example, when a commentwas made, when a video was uploaded. From timestamped data, wecan induce any snapshot of the network by filtering the data by time.Therefore, a single sample suffices to induce any snapshot in time.Non-timestamped data does not provide an explicit date; for ex-ample, a video’s current view count, a current channel’s numberof videos. To associate a date for the non-timestamped data onemust sample it explicitly and with a certain frequency to capturechanges. To illustrate this problem, consider a channel’s subscribercount twenty-four hours ago to be 100 subscribers, eighteen hoursago to be 350, twelve hours ago to be 500, and now to be 200. Ifsampled daily, we would only see a 2 × growth in subscribers but wewould miss the growth from 100 to 500 and the reduction to 200.Thus, capturing fine grain network dynamics requires more frequentsampling compared to what is available in current datasets. Table 1 shows the collected data attributes with a unique identifier(UID) that help us to build the network. Figure 1 shows a samplenetwork induced from the relationships between UIDs. Purple nodesrepresent channels with their UIDs (X, Y, Z), pink nodes representvideos with their UID (a, b, c, d), and blue nodes represent commentswith their UIDs (1, 2, 3, 4, 5, 6). The direction of the arrow representsthe originator (source) and the recipient (sink). In Section 4 weprovide the details of the types of graphs that can be induced fromthis data and analyze the key connectivity metrics in the network.
Table 2 shows the collected non-timestamped attributes. For each,we added the snapshot date when they were sampled. Although someattributes such as a channel’s title, description, and topic categories(assigned by Youtube) are unlikely to change over time, they maybe indicators of popularity. For example, by referencing trendingtopics, a channel owner may drive more visits to its channel. Thesame intuition applies to a video’s title, description, tags (assignedby the user), duration, and category ID (assigned by Youtube). able 1: Unique ID attributes.Source Attribute Type Channel Channel UID StringList of video UIDs List of stringsVideo Video UID StringParent channel UID StringComment thread UID List of stringsComment Comment thread UID StringThread Parent video UID StringParent channel UID StringA channel’s subscriber count, view count, and video count are ex-pected to change with more frequency and may be stronger indicatorsof popularity. So are a video’s view count, like count, dislike count,and comment count. In Section 4 we provide a detailed analysis ofchannels’ subscriber count. It is worth mentioning that although weare able to sample a channel’s subscriber count, we were not able tocollect the subscriber’s channel UIDs because this requires explicitauthorization from the channel’s owner. Access to the subscriberlist would provide a more detailed insight into the network. A sub-scription is a strong indicator of interest among the social network.However, without subscriber lists, the social network can still beanalyzed via comments.Comments are the most frequently changing attribute and area good indicator of popularity. Channels and videos with morecomment activity are likely to be more popular. We capture thecomments text as non-timestamped data because users are allowedto modify or delete their comments. Also, a channel owner is allowedto disable comments for a video. These are important events that arecaptured by our dataset but also present a data collection problembecause they cause gaps in the data.
There are two main reasons for the gaps in the dataset: server errors,and user settings changes. Whenever our data collector detects aserver error, it attempts to get that data three times and if it doesnot succeed, it adds that task to a queue for one more retry beforethe process termination. user settings changes include disabling thecomments of a video and deleting a video. Gaps are marked as NaNand we interpolate the missing data points to eliminate these gaps.
From the temporal data-set, multiple graphs can be induced foranalyzing the interaction between the Youtube community. In thispaper, we present one of the graphs. The induced graph consists ofinteraction between user communities through the comments in theYoutube videos. This interaction is captured for multiple time steps.The steps for generation of the temporal graphs are as follows:(1)
Channel-comment-video graph : First of all, a total of 6,336channels are monitored every 6 hours to gather their chan-nel, video and comment thread. This is done for a total of416 time steps. For each time step, we construct a bipartitegraph consisting of channels and videos (see Figure 3 (a)).
Table 2: Non-timestamped data attributes.Source Attribute Type
Channel Snapshot date DateTitle StringDescription StringTopic categories List of stringsSubscriber count IntView count IntVideo count IntVideo list List of videosVideo Snapshot date DateTitle StringDescription StringTags (user defined) List of stringsDuration Int (seconds)Category ID IntView count IntLike count IntDislike count IntComment count IntComment Snapshot date DateThread Comments text StringThe edges are added among channels and videos only if thechannels have commented in the videos.(2)
Graph filtering : In the next step, we perform filtering toremove the channels which are not present in the original6336 channel list (see Figure 3 (b)). This is done becausethe channel, video and comment thread data is not collectedfor the rest of the channels not in the original list of 6336channels. Moreover, we prune the bipartite graph to removevideos that have less than 1000 view counts, channels thathave less than 1000 or more than 10 million subscribers. Thisis done to remove extreme outlier from the graph.(3)
Channel-video ownership addition : In the next step we addedge between the channels and videos based on the ownership(see Figure 3 (c)). This is done to keep track of how manyvideos a channel uploads over a period of time.(4)
Channel-community : In the final step (see Figure 3 (d)),we add edges between channels based on category identity.There is a total of 64 categories in the current dataset for thechannels. One or more categories are assigned by YouTube toeach channel. These categories include film, hobby, lifestyle,music, society, sports, technology, video game, etc.
We characterize the induced graph in terms of nodes and edges,average cluster coefficient, number of triangles, graph diameter, andaverage community size. The statistics are as follows:
Nodes and Edges.
The node and edges temporal statistics is pre-sented in figure 2. The total number of nodes varies from 4,213 to8,845 from day 1 to day 104. Similarly, the number of edges varyfrom 456,887 from day 1 to 566,594 to day 104. The number of
20 40 60 80 100460000480000500000520000540000560000 E d g e s n o d e s . . . . . . . . A v e r a g ec l u s t e r i n g c o e ffi c i e n t H i g h e s t D e g r ee N u m b e r o f t r i a n g l e s . . . . . . T r a n s i t i v i t y . . . . . . . . . D i a m e t e r A v e r a g e d e g r eec o nn ec t i v i t y f o r H D A v e r a g ec o mm un i t y s i z e Figure 2: Statistics of the temporal graphs over 104 days (Time interval = 6 hours). nodes increases almost linearly over time. The edges are weighted,as multiple channels can have more than one categoryID in common.
Average clustering coefficient.
The average clustering coefficientof the weighted graph is calculated as follows [17]: C = n (cid:213) v ∈ G c v (1a) c u = deд ( u )( deд ( u ) − )) (cid:213) uv ( ˆ w uv ˆ w uw ˆ w vw ) / (1b)where ˆ w uv are the edge weights, and these weights are normalizedby the maximum weight of the network ˆ w uv = w uv / max ( w ) [10].From figure 2, it can be noticed that with increasing time steps theaverage clustering coefficient decreases. More precisely, the averageclustering coefficient is 0.3453 at day 1 and 0.1980 at day 104. Number of Triangles.
For the induced graph the maximum numberof triangles present at each time step is presented in Figure 2. The total number of triangles varies from 436,761 from day 1 to 537,365from day 104. It can be noticed from the figure that this changein the number of triangles is closely related to the increase in thenumber of edges over time.
Graph Diameter.
The diameter of the graph is measured by calcu-lating the maximum eccentricity of the graph. the eccentricity of anode u on a graph is defined as the maximum distance from node u to all the other nodes in a graph. The network diameter for thegiven induced graph is almost constant in most of the time steps.The diameter value in fact only changes among the values 8, 9 and12. Transitivity.
The transitivity of the induced graph is defined as fol-lows: T = trianдles triads (2)It is also known as the fraction of all possible triangles in a graph.From Figure 2, it can be seen that the transitivity of the induced hannels Videoscommented_on Channels Videoscommented_onChannels Videoscommented_on u p l o a d e d _ v i d e o Channels Videoscommented_on u p l o a d e d _ v i d e o C a t e go r y (a) Generation of bipartite graph (b) Filtering of channels and videos(c) Channel-video ownership (b) Channel communities throughvideo categories Figure 3: Inducing graph from the dataset. graph slightly decreases over time. It changes from 0.8176 from day1 to 0.8114 from day 104.
Highest degree.
The average degree connectivity of a graph iscalculated as follows [10]: k wnn , i = s i (cid:213) j ∈ N ( i ) w ij k j (3)where s i represents the weighted degree of the node i , W ij representsthe weight of the edge that links i and j , and N ( i ) represents theneighbors of the node i . For the graph the highest degrees changesfrom 633 from day 1 to 704 from day 104. Moreover, the actualaverage connectivity for the respective degree from day 1 is 767.90,and the average connectivity for the degree 704 from day 104 is851.77. Average community size.
For the given 6,336 channels, the totalnumber of categories collected is 64. For these 64 communities, theaverage community size of channels and videos is shown in Figure 2.Overall, we observe that the graph becomes slightly sparser withtime as evidenced by the decreasing clustering coefficient and tran-sitivity denoting the lack of strong community formation between the channels and videos over time. However, the average commu-nity size increases over time. Thus, many communities spread theinfluence on other channels and videos to increase community size.
Figure 4 presents the clustering analysis performed on the inducedgraphs. We utilized the state-of-the-art static graph embedding al-gorithms called Structural Deep Network Embedding (SDNE) [20].SDNE utilizes fully connected deep auto-encoders to embed thegiven graph to a low-dimensional representation. We used three lay-ers of neural networks in each of the encoder (with the output size of500, 300 and 128) and decoder (with the output size of 300, 500, anda total number of nodes). The high-dimensional graph is embeddedinto a vector of size 128. This embedding is then reduced to a size of2 using t-Distributed Stochastic Neighbour Embedding (t-SNE) [14].They are then plotted with the nodes colored according to the groundtruth (categoryIDs). The t-SNE plots for time steps 1, 25, 50, and100 are shown in Figure 4. The black colored nodes represent thevideos, and the other colored nodes represent channels belonging todifferent communities. These results show that YoutubeGraph-Dyncaptures community migration dynamics on channels.
Predicting the future values of non-timestamped data is an impor-tant task for graph learning algorithms in order to properly capturethe temporal evolution of the graph. In this section, we comparethree different models: autoregressive integrated moving average(ARIMA) models [4], long short term memory (LSTM) [9], andgated recurrent units (GRU) [3] recurrent neural networks. We eval-uate their performance on the prediction of the channel’s subscribercount. Unlike other non-timestamped data such as likes, dislikes,and comment count that tends to grow over time, the channel sub-scriber count presents different trends. Some channels grow theirsubscribers, other channels lose their subscribers, others keep thesame number of subscribers, and others show fluctuations over time.The ARIMA model is based on statsmodels [18]. We train onemodel per channel capable of forecasting the next value. The firstvalue is predicted using 70% of the data. We report the mean squareerror (MSE) between the actual and predicted values. The p , d , q parameters of the model are individually reported.The recurrent neural network (RNN) models are implementedin Keras. Unlike ARIMA, a single RNN model is crated for allchannels. We split the dataset in 70% for training and 30% fortesting. Furthermore, we create b batches of k -length to train themodels. These models consist of an input LSTM or GRU layer, l > = hidden LSTM or GRU layers, and a dense output layer witha ReLu activation function. The embedding size of e determines thedimensionality of the output space for each of the layers. We useAdam optimizer with a learning rate of 1e-4, and a mean squarederror loss function.Figure 5 shows the normalized subscriber count trends (in blue)for three channels: Channel A, Channel B, and Channel C. ChannelA shows a steady increase in subscriber count, followed by a plateau,and finishing with a steady increase of subscribers. Channel B showsa steady decrease in subscribers. And Channel C shows a sharpincrease in subscribers and manages to keep those subscribers until
100 75 50 25 0 25 50 75 1001007550250255075
Figure 4: Community clustering for the temporal graphs (forsnapshots 1, 25, 50, and 100. the end of the time steps. Figure 5(a) shows the ARIMA predictions(in orange) for the three channels.Figure 5(b) shows the LSTM predictions, and Figure 5(c) showsthe GRU predictions using a lookback of 5, batch size of 50, andembedding size of 10. The three models succeed at predicting thegeneral trends of the three channels. In particular, the ARIMA mod-els provide a more accurate prediction of the next subscriber countvalues for the three channels. The LSTM and GRU models give goodpredictions for Channel A but are less accurate for channels B and C.Figure 6 shows the distribution of MSE for various subscriber countpredictor model parameterizations. The tick represents the medianMSE for all 6,342 channels. (a) ARIMA predictions(b) LSTM predictions(c) GRU predictionsChannel A Channel B Channel C
Figure 5: Channel’s subscriber count predictions
The arima_x_y_z models are the configurations for different x = p , y = d , z = q values. For example, the arima_0_1_0 represents an ARIMA model parameterized as p = , d = , q = .On the other hand, the RNN models indicate the k -length of b batcheswith an e embedding size and l hidden layers of type дru or lstm .For example, k10_b50_e10_l0_ngru represents a GRU modelwith no hidden layers, an embedding size of 10, a batch size of 50with 10-length samples per batch.In general, the ARIMA models achieve a smaller MSE comparedto the RNN models. However, the RNN models generalize the databetter as a single RNN model is trained for all 6,342 channels;an individual ARIMA model is trained per channel. For the RNN igure 6: Exploration of hyperparameters for ARIMA, LSTM, and GRU models. models, the k = configurations provide the best predictions. Boththe ARIMA and RNN models can be used to forecast YoutubeGraph-Dyn’s non-timestamped data. In this section, we summarize the limitations of YoutubeGraph-Dyn and provide our perspective on possible extensions and fu-ture research. First, the channel-comment-video temporal graphwas the main focus of this paper. This is motivated by the needof having better datasets for dynamic graph learning. However,YoutubeGraph-Dyn can be used to induce other types of graphs.Second, YoutubeGraph-Dyn provides finer time granularity com-pared to similar datasets; we showed that 6-hour intervals are suffi-cient to capture channel community migration dynamics. However,YoutubeGraph-Dyn’s time granularity may not be adequate to cap-ture faster dynamics in the network. Third, YoutubeGraph-Dyn islimited to a single social network (YouTube). We anticipate thatfuture dynamic graph learning research will require richer temporalgraphs that go beyond a single network.
In this paper, we focused on channel-comment-video graphs. How-ever, there are other types of graphs that can be induced from theraw data. Some of the graphs that can be induced are as follows: • User-comment-user graph : A user-comment-user graph can bebuilt from comment threads when two users interact through areply. This graph could be used, for example, to study the structureand dynamics of influencers. • Channel-category-channel graph : A channel-category-channelgraph can be built when a user uploads a video which is commonwith other channels. Since the channels can have multiple commoncategories, weighted edges can be added among the channels. Aschannels will have videos uploaded under different categories, thisgraph will help to study the trend followed by the channels. More-over, it will also allow us to study the evolution of the categorycommunity. • User-category-user graph : A user-category-user graph can beconstructed by adding edges between users who comment onvideos with similar video category. Unlike user-comment-usergraph, where the edge between users are added when they interactwith a reply, a user-category-user graph will not require interaction.Moreover, unlike the channel-category-channel graph, the usersmay not have any video uploaded. This graph will help us inanalyzing the evolution of users community with respect to thevideo category. • Subscriber count ego network : To study the popularity of thechannels, a subscriber count ego network may be created. Sincesubscriber count is continuous value, it has to be binned first. Alogarithmic scale of base 10 may appropriately capture the rangeof subscribers. After binning the subscriber count, the channelswith a specific number of subscribers are added to the ego network.Moreover, an edge between the channels is added based on thecategory of videos uploaded by them. This graph will allow us tostudy how various unpopular channels evolved to become popularby publishing videos in topics that are most popular at a particularperiod of time. Channel:video-sentiment-user graph : The comments providedby the users in the videos may be processed to perform sentimentanalysis and ranked based on positive or negative sentiments withweights ranging from 1 to 10, 1 being most negative and 10 beingmost positive. Based on this a channel:video-sentiment-user graphmay be created, where there are two types of nodes a channel:videonode and a user node. The weighted edge between them is based onthe sentiment score. This graph will allow us to study the overallstanding of channels in the user community.
The YouTube API restricts the number of requests (i.e., credits)that one account can perform per hour. Therefore, the work can bedistributed among various accounts to either increase the volume ofchannels, comments, and videos or to take intra-hour snapshots ofthe data. In our case, three accounts (one per author) were sufficientto take snapshots of the 6,342 channels including all their videosand comments every 6 hours. The motivation behind intra-hoursnapshots is to capture faster network dynamics such as virality. Weanticipate intra-hour granularity to provide interesting dynamics fornew algorithms.
In many cases, users have a presence in several social networkssuch as YouTube, Twitter, and Instagram. Analyzing multi-networkinteractions is an interesting area of research that requires multi-network temporal graphs. However, the methodology to create sucha dataset must be first developed and we leave that for future work.
YoutubeGraph-Dyn takes a first step in filling the gap that exists indatasets that capture the evolution of graphs. Our approach, basedon non-timestamped data from the YouTube API, allowed us to cre-ate a channel-comment-video time evolving graph. Using SDNE, astate-of-the-art graph embedding algorithm, we demonstrated thatYoutubeGraph-Dyn captures community migration dynamics inYouTube channels. YoutubeGraph-Dyn provides 106 intra-day graphsnapshots taken every 6 hours resulting in a total of 416 time steps. Akey differentiator of YoutubeGraph-Dyn is the encoding of varioustypes of relationships and multiple attributes in the form of wordembeddings and integers. We showed that RNNs and ARIMA mod-els can be used to predict non-timestamped data such as subscribercount.
REFERENCES [1] Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Paul Natsev, George Toderici, Balakrishnan Varadarajan, andSudheendra Vijayanarasimhan. 2016. YouTube-8M: A Large-Scale Video Classification Benchmark.
CoRR abs/1609.08675 (2016). http://arxiv.org/abs/1609.08675[2] Xu Cheng, Cameron Dale, and Jiangchuan Liu. 2008. Dataset for Statistics and Social Network of YouTube Videos.(2008). http://netsg.cs.sfu.ca/youtubedata/[3] Junyoung Chung, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Bengio. 2015. Gated feedback recurrent neuralnetworks. In
International Conference on Machine Learning . 2067–2075.[4] Javier Contreras, Rosario Espinola, Francisco J Nogales, and Antonio J Conejo. 2003. ARIMA models to predictnext-day electricity prices.
IEEE transactions on power systems
18, 3 (2003), 1014–1020.[5] Johannes Gehrke, Paul Ginsparg, and Jon Kleinberg. 2003. Overview of the 2003 KDD Cup.
SIGKDD Explor.Newsl.
5, 2 (Dec. 2003), 149–151.[6] Google. 2016. Youtube-8M Dataset. (2016). https://research.google.com/youtube8m/[7] Palash Goyal, Sujit Rokka Chhetri, and Arquimedes Canedo. 2018. dyngraph2vec: Capturing Network Dynamicsusing Dynamic Graph Representation Learning.
CoRR abs/1809.02657 (2018). http://arxiv.org/abs/1809.02657[8] Palash Goyal, Nitin Kamra, Xinran He, and Yan Liu. 2018. DynGEM: Deep Embedding Method for DynamicGraphs.
CoRR abs/1805.11273 (2018). http://arxiv.org/abs/1805.11273[9] Klaus Greff, Rupesh K Srivastava, Jan Koutník, Bas R Steunebrink, and Jürgen Schmidhuber. 2017. LSTM: Asearch space odyssey.
IEEE transactions on neural networks and learning systems
28, 10 (2017), 2222–2232.[10] Aric Hagberg, Pieter Swart, and Daniel S Chult. 2008.
Exploring network structure, dynamics, and function usingNetworkX . Technical Report. Los Alamos National Lab.(LANL), Los Alamos, NM (United States). [11] William L. Hamilton, Rex Ying, and Jure Leskovec. 2017. Representation Learning on Graphs: Methods andApplications.
CoRR
Proceedings of the Eleventh ACM SIGKDD International Conference onKnowledge Discovery in Data Mining (KDD ’05) . ACM, 177–187.[14] Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE.
Journal of machine learningresearch
9, Nov (2008), 2579–2605.[15] Alan Mislove, Massimiliano Marcon, Krishna P. Gummadi, Peter Druschel, and Bobby Bhattacharjee. 2007. Mea-surement and Analysis of Online Social Networks. In
Proceedings of the 5th ACM/Usenix Internet MeasurementConference (IMC’07) . San Diego, CA.[16] Alan E. Mislove. 2009.
Online Social Networks: Measurement, Analysis, and Applications to Distributed Informa-tion Systems . Ph.D. Dissertation. Rice University.[17] Jari Saramäki, Mikko Kivelä, Jukka-Pekka Onnela, Kimmo Kaski, and Janos Kertesz. 2007. Generalizations of theclustering coefficient to weighted complex networks.
Physical Review E
Proceedings of the 22ndACM SIGKDD international conference on Knowledge discovery and data mining . ACM, 1225–1234.[21] Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and Philip S. Yu. 2019. A Comprehen-sive Survey on Graph Neural Networks.
CoRR abs/1901.00596 (2019). http://arxiv.org/abs/1901.00596[22] Jaewon Yang and Jure Leskovec. 2012. Defining and Evaluating Network Communities based on Ground-truth.
CoRR abs/1205.6233 (2012). http://arxiv.org/abs/1205.6233[23] YouTube. 2019. Data API. (2019). https://developers.google.com/youtube/v3/[24] ZiweiZhang,PengCui,JianPei,XiaoWang,andWenwuZhu.2018. Timers:Error-boundedsvdrestartondynamicnetworks. In
Thirty-Second AAAI Conference on Artificial Intelligence .[25] Linhong Zhu, Dong Guo, Junming Yin, Greg Ver Steeg, and Aram Galstyan. 2016. Scalable temporal latent spaceinference for link prediction in dynamic social networks.
IEEE Transactions on Knowledge and Data Engineering
28, 10 (2016), 2765–2777.28, 10 (2016), 2765–2777.