Twitch Gamers: a Dataset for Evaluating Proximity Preserving and Structural Role-based Node Embeddings
aa r X i v : . [ c s . S I] F e b Twitch Gamers: a Dataset for Evaluating Proximity Preservingand Structural Role-based Node Embeddings
Benedek Rozemberczki
The University of EdinburghEdinburgh, United [email protected]
Rik Sarkar
The University of EdinburghEdinburgh, United [email protected]
ABSTRACT
Proximity preserving and structural role-based node embeddingshave become a prime workhorse of applied graph mining. Novelnode embedding techniques are often tested on a restricted set ofbenchmark datasets. In this paper, we propose a new diverse so-cial network dataset called
Twitch Gamers with multiple potentialtarget attributes. Our analysis of the social network and node clas-sification experiments illustrate that
Twitch Gamers is suitable forassessing the predictive performance of novel proximity preserv-ing and structural role-based node embedding algorithms.
The prediction of unknown node attributes using vertex featuresis a central problem in both theoretical and applied graph miningresearch. One way to create high quality node features is to embedthe vertices in an Euclidean space. Node embedding algorithms arefrequently used as an upstream unsupervised feature extractionmethod to distill useful features for downstream supervised mod-els. Their success is mainly due to the favorable algorithmic quali-ties they have such as runtime and memory efficiency. In additionto efficiency, the extracted node representations are known to berobust to hyperparameter changes [12, 13, 16] and the learned fea-tures are reusable when new downstream machine learning taskscome up [1, 7]. Node embedding techniques are typically evaluatedon a limited number of public benchmark datasets [7, 8, 12–14, 21],which are not compatible with newly proposed attribute based al-gorithms [15, 16, 20]. This highlights the need for new benchmarkdatasets which are rich in attributes.
Present work.
In order to foster node embedding researchwe publicly release
Twitch Gamers : a medium sized undirectedsocial network of online streamers with multiple interesting ver-tex attributes. Using
Twitch Gamers , the predictive performanceof a node embedding algorithm can be tested on multiple newchallenging node classification and vertex level regression prob-lems. Potential machine learning tasks include the identificationof dead accounts, selection of users that stream explicit contentand broadcaster language prediction. Our work creates opportu-nity for the assessment of numerous existing node representationlearning techniques and newly developed vertex embedding pro-cedures.
Main contributions.
The most important contributions of ourpaper can be summarized as follows:(1) We release
Twitch Gamers : a new social network datasetwhich we specifically collected for benchmarking the ver-tex classification performance of proximity preserving andstructural role-based node embedding techniques. (2) We carry out a descriptive analysis of the social networkand underlying generic vertex features and argue that it issuitable for testing novel node embedding methods.(3) We evaluate the performance of standard node embeddingalgorithms under various train/test split regimes.The rest of our work has the following structure. We overviewthe related work about node embedding procedures in Section 2.We discuss in Section 3 the data collection and the dataset itself.We perform descriptive analysis of the social network and the genericvertex attributes in Section 4. In Section 5 we showcase the pre-dictive performance of various well known node embedding tech-niques on the
Twitch Gamers dataset. The paper concludes withSection 6 where we discuss potential future work.
Given a graph 𝐺 = ( 𝑉 , 𝐸 ) node embedding techniques learn a func-tion 𝑓 : 𝑉 → R 𝑑 which maps the nodes 𝑣 ∈ 𝑉 into a 𝑑 dimensionalEuclidean space. When generic vertex features are not available fornode classification, proximity preserving and structural role-basednode embedding techniques are suitable for distilling high qualityreusable feature sets [1]. We will utilize linear runtime node em-bedding algorithms to showcase that Twitch Games is suitable formulti-aspect testing of feature extraction.Proximity preserving node embedding algorithms [6, 10, 12, 13,16] learn this embedding by preserving a certain notion of prox-imity in the embedding space such as pairwise truncated randomwalk transition probabilities. This way nodes that are close to eachother in the graph are also close in the embedding space. Structuralrole-based node embedding techniques on the other hand preservestructural similarity in the embedding space. Nodes which havesimilar structural properties such as centrality and transitivity areclose to each other in the embedding space [5, 7, 8].
Twitch is a streaming service where users can broadcast live streamsof playing computer games. As users can follow each other thereis an underlying social network which can be accessed through thepublic API. In 2018 April we crawled the largest connected compo-nent of this social network with snowball sampling starting fromthe user called
Lowko . The released
Twitch Gamers dataset is aclean subset of the original social network. We filtered out nodesand edges based on the following principled steps:(1)
No missing attributes.
We only kept nodes that have allof the vertex attributes present.2)
Mutual relationships.
We discarded relationships whichare asymmetric and only included mutual edges in the re-leased dataset.(3)
Member of the largest component.
We only consid-ered nodes which are part of the largest connected com-ponent.The result of this three step data cleaning process is an undirected,single component social network with approximately 168 thou-sand nodes and 6.79 million edges. Vertices in this restricted sub-sample do not have any missing node attributes. We summarizedthe name, meaning, and type of available generic node attributesin Table 1.
Name Meaning Type
Identifier Numeric vertex identifier. IndexDead Account Inactive user account. CategoricalBroadcaster Language Languages used for broadcasting. CategoricalAffiliate Status Affiliate status of the user. CategoricalExplicit Content Explicit content on the channel. CategoricalCreation Date Joining date of the user. DateLast Update Last stream of the user. DateView Count Number of views on the channel. CountAccount Lifetime Days between first and last stream. Count
Table 1: The name, meaning and type of vertex attributes inthe
Twitch Gamers dataset.
Categorical attributes such as
Dead Account, Affiliate Status , and
Explicit Content can be used as targets for binary classification,while
Broadcaster Language can be used for multi-class node clas-sification with more than 20 categories. The vertex attributes
ViewCount and
Account Lifetime can serves as target for count dataregression problems at the node level. Various other supervisedand unsupervised machine learning tasks can be performed onthe dataset such as link prediction and community detection withground truth labels. The
Twitch Gamers dataset is publicly avail-able at https://github.com/benedekrozemberczki/datasets.
Alive Dead
Dead Account % o f e d g e s Intra-class Inter-classBasic Affiliate
Affiliate Status Gaming Mature
Explicit Content
Figure 1: The percentage of intra and inter-class edges con-ditional on the
Dead Account, Affiliate Status , and
ExplicitContent attributes.
Our descriptive analysis of
Twitch Gamers focuses on the inter-action of graph topology and attributes. Specifically, we investi-gate which potential target attributes can be predicted well withneighbourhood-preserving and structural role-based techniques.
CS DA DE EN ES FI FR HU IT JA % o f e d g e s Intra-class Inter-classKO NL NO PL PT RU SV TH TR ZH
Broadcaster Language % o f e d g e s Figure 2: The percentage of intra and inter-class edges con-ditional on the
Broadcaster Language attribute.
We plotted the ratio of inter and intra-class edges conditionalon the categorical attributes on Figures 2 and 1. These resultsshow that users who broadcast in more commonly spoken lan-guage (English, German, French) are more likely to have connec-tions with users who broadcast in the same language. This pos-tulates that proximity preserving node embedding techniques willextract expressive features that can predict
Broadcaster Language precisely. We also see that Twitch users who churned from theplatform are well embedded in the social network and do not formcommunities. When it comes to the
Affiliate Status and
ExplicitContent attributes we cannot highlight particular insights aboutthe related linking behaviour of vertices.
AliveDead log
Degree Centrality
Dead Account
Dead Account
BasicAffiliate log
Degree Centrality
Affiliate Status
Affiliate Status
GamingMature log
Degree Centrality
Explicit Content
Explicit Content
Figure 3: The box plots of degree centrality and clusteringcoefficient conditional on the
Dead Account, Affiliate Status ,and
Explicit Content attributes. e used boxplots to visualize the distribution of the log trans-formed degree and clustering coefficient conditional on the cate-gorical vertex attributes. We plotted these boxplots of the struc-tural features on Figures 4 and 3. Based on these plots we candeduce that users who broadcast in more commonly spoken lan-guages are well connected. At the same time their friends are lesslikely to be connected – this potentially hints at their hub-like role.The results obtained for the other attributes are also intuitive: (i)users who churned from the platform are less central in the so-cial network; (ii) broadcasters who use explicit language are lesspopular; (iii) those who obtain affiliate status are generally wellconnected in the social network. These findings hint that all of thecategorical features can be embedded with the use of structuralrole-based node embedding techniques. ZHTRTHSVRUPTPLNONLKOJAITHUFRFIESENDEDACS log
Degree Centrality 0 0.25 0.5 0.75 1.0ZHTRTHSVRUPTPLNONLKOJAITHUFRFIESENDEDACS Clustering Coefficient
Figure 4: The box plots of degree centrality and clustering co-efficient conditional on the
Broadcaster Language attribute.
We use
Twitch Gamers to evaluate the predictive value of featuresextracted with popular node embedding algorithms. The targetattributes of node classification were the
Explicit Content , Broad-caster Language , Dead Account and
Affiliate Status variables. Inour experiments we used the open-source
Karate Club [17] librarywith the default hyperparameter settings of the node embeddingprocedures. Specifically, we tested the performance of the follow-ing proximity preserving node embeddings:(1)
Diff2Vec [18, 19] factorizes a pointwise mutual informa-tion (henceforth PMI) matrix derived from a diffusion pro-cess.(2)
DeepWalk [12] decomposes the PMI matrix of summednormalized adjacency matrix powers with implicit factor-ization. (3)
Walklets [13] factorizes the PMI matrix of normalized ad-jacency matrix powers to obtain multi-scale node embed-dings.(4)
RandNE [22] smooths an orthogonal node embedding ma-trix with powers of the adjacency matrix.We evaluated the value of features extracted with these struc-tural role-based node embedding algorithms:(1)
Role2Vec [2] decomposes the PMI matrix node – tree fea-ture co-occurrences with an implicit factorization technique.(2)
ASNE [9] factorizes a target matrix obtained by concate-nating the adjacency matrix and a structural feature ma-trix which includes one-hot encodings of the log degreeand clustering coefficient.(3)
MUSAE [15] learns multi-scale structural role-based nodeembeddings from matrices obtained by multiplying thestructural feature matrix with adjacency matrix powers.(4)
FEATHER [20] distills node embeddings from graph char-acteristic functions of the log transformed degree and clus-tering coefficient.We used the scikit-learn [3, 11] implementation of logistic regres-sion with the default hyperparameter settings to predict the nodelabels using the node embeddings as input features. It has to benoted that these default settings involve the use of weight regular-ization, because of this each node embedding dimension was nor-malized. The classifiers were trained with various highly skewedtrain/test data split ratios by utilizing less than 1% of training data.We plotted mean macro-averaged AUC scores on the test set cal-culated from 10 random seed train/test splits on Figures 5 and 6. . . . . . . . . . .
91 Training data ratio % A U C s c o r e Explicit ContentDiff2Vec DeepWalk Walklets RandNE . . . . . . . . . .
91 Training data ratio % A U C s c o r e Broadcaster Language . . . . . . . . . .
91 Training data ratio % A U C s c o r e Dead Account . . . . . . . . . .
91 Training data ratio % A U C s c o r e Affiliate Status
Figure 5: Predictive performance of proximity preservingnode embedding techniques on classification tasks mea-sured by area under the curve scores on the test set as a func-tion of training set ratio.
The most important finding based on our results is that the tar-get attributes in
Twitch Gamers are suitable for testing the predic-tive power of features extracted with both proximity preserving nd structural role-based node embeddings. These results show-case that certain node embedding techniques have a considerableadvantage on the downstream tasks. We also see evidence thatproximity preserving algorithms extract features which are moreuseful for predicting Broadcaster Language . This was expected basedon our empirical analysis as it is an attribute which most probablystrongly influences linking behaviour. Another similar intuitivefinding is that structural role-based embedding techniques[2] cre-ate more expressive features for predicting the
Dead Account targetvariable. This is not surprising, our descriptive analysis had shownthat users who churned from the platform have idiosyncratic struc-tural attributes. Our results also verify the known fact that multi-scale proximity preserving node embeddings, such as
Walklets [13]and
GraRep [4], outperform techniques like
DeepWalk [12] thatpool information form low and higher order proximities. . . . . . . . . . .
91 Training data ratio % A U C s c o r e Explicit ContentRole2Vec ASNE MUSAE FEATHER . . . . . . . . . .
91 Training data ratio % A U C s c o r e Broadcaster Language . . . . . . . . . .
91 Training data ratio % A U C s c o r e Dead Account . . . . . . . . . .
91 Training data ratio % A U C s c o r e Affiliate Status
Figure 6: Predictive performance of structural role preserv-ing node embedding techniques on classification tasks mea-sured by area under the curve scores on the test set as a func-tion of training set ratio.
We introduced
Twitch Gamers a medium sized social network datasetwith a rich set of potential target attributes. Our descriptive anal-ysis of the dataset had demonstrated that both proximity preserv-ing and structural role-based node embeddings can potentially dis-till high quality features for node classification. We verified thisprecognition by a series of experiments. Our findings show that
Twitch Gamers can serve as an important benchmark to assessnovel node embedding techniques. We are particularly excited thatthe prediction of certain vertex attributes turned out to be challeng-ing machine learning task.
ACKNOWLEDGEMENTS
Benedek Rozemberczki was supported by the Centre for DoctoralTraining in Data Science, funded by EPSRC (grant EP/L016427/1).
REFERENCES [1] N. Ahmed, R. A. Rossi, J. Lee, T. Willke, R. Zhou, X. Kong, and H. Eldardiry.2020. Role-based Graph Embeddings.
IEEE Transactions on Knowledge and DataEngineering (2020), 1–1.[2] Nesreen K Ahmed, Ryan Rossi, John Boaz Lee, Xiangnan Kong, Theodore LWillke, Rong Zhou, and Hoda Eldardiry. 2018. Learning Role-based Graph Em-beddings.
Proceedings of the 26th IJCAI Conference on Artificial Intelligence -Statistical Relational AI Workshop (2018).[3] Lars Buitinck, Gilles Louppe, Mathieu Blondel, Fabian Pedregosa, AndreasMueller, Olivier Grisel, Vlad Niculae, Peter Prettenhofer, Alexandre Gramfort,Jaques Grobler, Robert Layton, Jake VanderPlas, Arnaud Joly, Brian Holt, andGa¨el Varoquaux. 2013. API Design for Machine Learning Software: Experiencesfrom the Scikit-Learn Project. In
ECML PKDD Workshop: Languages for DataMining and Machine Learning . 108–122.[4] Shaosheng Cao, Wei Lu, and Qiongkai Xu. 2015. GraRep: Learning Graph Rep-resentations with Global Structural Information. In
Proceedings of the 24th ACMInternational on Conference on Information and Knowledge Management . ACM,891–900.[5] Claire Donnat, Marinka Zitnik, David Hallac, and Jure Leskovec. 2018. LearningStructural Node Embeddings via Diffusion Wavelets. In
Proceedings of the 24thACM SIGKDD International Conference on Knowledge Discovery and Data Mining .ACM, 1320–1329.[6] Aditya Grover and Jure Leskovec. 2016. Node2Vec: Scalable Feature Learningfor Networks. In
Proceedings of the 22nd ACM SIGKDD International Conferenceon Knowledge Discovery and Data Mining . 855–864.[7] Keith Henderson, Brian Gallagher, Tina Eliassi-Rad, Hanghang Tong, SugatoBasu, Leman Akoglu, Danai Koutra, Christos Faloutsos, and Lei Li. 2012. RolX:Structural Role Extraction and Mining in Large Graphs. In
Proceedings of the18th ACM SIGKDD International Conference on Knowledge Discovery and DataMining . ACM, 1231–1239.[8] Keith Henderson, Brian Gallagher, Lei Li, Leman Akoglu, Tina Eliassi-Rad,Hanghang Tong, and Christos Faloutsos. 2011. It’s Who You Know: Graph Min-ing Using Recursive Structural Features. In
Proceedings of the 17th ACM SIGKDDInternational Conference on Knowledge Discovery and Data mining . ACM, 663–671.[9] Lizi Liao, Xiangnan He, Hanwang Zhang, and Tat-Seng Chua. 2018. AttributedSocial Network Embedding.
IEEE Transactions on Knowledge and Data Engineer-ing
30, 12 (2018), 2257–2270.[10] Mingdong Ou, Peng Cui, Jian Pei, Ziwei Zhang, and Wenwu Zhu. 2016. Asym-metric Transitivity Preserving Graph Embedding. In
Proceedings of the 22ndACM SIGKDD International Conference on Knowledge Discovery and Data Mining .ACM, 1105–1114.[11] Fabian Pedregosa, Ga¨el Varoquaux, Alexandre Gramfort, Vincent Michel,Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, RonWeiss, Vincent Dubourg, et al. 2011. Scikit-Learn: Machine Learning in Python.
Journal of Machine Learning Research
12, Oct (2011), 2825–2830.[12] Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. 2014. DeepWalk: Online Learn-ing of Social Representations. In
Proceedings of the 20th ACM SIGKDD Interna-tional Conference on Knowledge Discovery and Data Mining.
ACM.[13] Bryan Perozzi, Vivek Kulkarni, Haochen Chen, and Steven Skiena. 2017. Don’tWalk, Skip!: Online Learning of Multi-scale Network Embeddings. In
Proceed-ings of the 2017 IEEE/ACM International Conference on Advances in Social Net-works Analysis and Mining 2017 . ACM, 258–265.[14] Jiezhong Qiu, Yuxiao Dong, Hao Ma, Jian Li, Kuansan Wang, and Jie Tang. 2018.Network Embedding as Matrix Factorization: Unifying DeepWalk, LINE, PTE,and Node2Vec. In
Proceedings of the 11th ACM International Conference on WebSearch and Data Mining . ACM, 459–467.[15] Benedek Rozemberczki, Carl Allen, and Rik Sarkar. 2019. Multi-Scale AttributedNode Embedding. arXiv preprint arXiv:1909.13021 (2019).[16] Benedek Rozemberczki, Ryan Davies, Rik Sarkar, and Charles Sutton. 2019.GEMSEC: Graph Embedding with Self Clustering. In
Proceedings of the 2019IEEE/ACM International Conference on Advances in Social Networks Analysis andMining 2019 . ACM, 65–72.[17] Benedek Rozemberczki, Oliver Kiss, and Rik Sarkar. 2020. Karate Club: AnAPI Oriented Open-source Python Framework for Unsupervised Learning onGraphs. In
Proceedings of the 29th ACM International Conference on Informationand Knowledge Management (CIKM ’20) . ACM.[18] Benedek Rozemberczki, Oliver Kiss, and Rik Sarkar. 2020. Little Ball of Fur:A Python Library for Graph Sampling. In
Proceedings of the 29th ACM Interna-tional Conference on Information and Knowledge Management (CIKM ’20) . ACM,3133–3140.[19] Benedek Rozemberczki and Rik Sarkar. 2018. Fast Sequence-Based Embed-ding with Diffusion Graphs. In
International Workshop on Complex Networks .Springer, 99–107.[20] Benedek Rozemberczki and Rik Sarkar. 2020. Characteristic Functions onGraphs: Birds of a Feather, from Statistical Descriptors to Parametric Models. In roceedings of the 29th ACM International Conference on Information and Knowl-edge Management (CIKM ’20) . ACM.[21] Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei.2015. LINE: Large-Scale Information Network Embedding. In Proceedings of the24th International Conference on World Wide Web . 1067–1077.[22] Ziwei Zhang, Peng Cui, Haoyang Li, Xiao Wang, and Wenwu Zhu. 2018. Billion-scale network embedding with iterative random projection. In . IEEE, 787–796.. IEEE, 787–796.