[PDF] Twitch Gamers: a Dataset for Evaluating Proximity Preserving and Structural Role-based Node Embeddings

Abstract

Proximity preserving and structural role-based node embeddings have become a prime workhorse of applied graph mining. Novel node embedding techniques are often tested on a restricted set of benchmark datasets. In this paper, we propose a new diverse social network dataset called Twitch Gamers with multiple potential target attributes. Our analysis of the social network and node classification experiments illustrate that Twitch Gamers is suitable for assessing the predictive performance of novel proximity preserving and structural role-based node embedding algorithms.

Full PDF

aa r X i v : . [ c s . S I] F e b Twitch Gamers: a Dataset for Evaluating Proximity Preservingand Structural Role-based Node Embeddings

Benedek Rozemberczki

The University of EdinburghEdinburgh, United [email protected]

Rik Sarkar

The University of EdinburghEdinburgh, United [email protected]

ABSTRACT

Twitch Gamers with multiple potentialtarget attributes. Our analysis of the social network and node clas-siﬁcation experiments illustrate that

Twitch Gamers is suitable forassessing the predictive performance of novel proximity preserv-ing and structural role-based node embedding algorithms.

The prediction of unknown node attributes using vertex featuresis a central problem in both theoretical and applied graph miningresearch. One way to create high quality node features is to embedthe vertices in an Euclidean space. Node embedding algorithms arefrequently used as an upstream unsupervised feature extractionmethod to distill useful features for downstream supervised mod-els. Their success is mainly due to the favorable algorithmic quali-ties they have such as runtime and memory eﬃciency. In additionto eﬃciency, the extracted node representations are known to berobust to hyperparameter changes [12, 13, 16] and the learned fea-tures are reusable when new downstream machine learning taskscome up [1, 7]. Node embedding techniques are typically evaluatedon a limited number of public benchmark datasets [7, 8, 12–14, 21],which are not compatible with newly proposed attribute based al-gorithms [15, 16, 20]. This highlights the need for new benchmarkdatasets which are rich in attributes.

Present work.

In order to foster node embedding researchwe publicly release

Twitch Gamers : a medium sized undirectedsocial network of online streamers with multiple interesting ver-tex attributes. Using

Twitch Gamers , the predictive performanceof a node embedding algorithm can be tested on multiple newchallenging node classiﬁcation and vertex level regression prob-lems. Potential machine learning tasks include the identiﬁcationof dead accounts, selection of users that stream explicit contentand broadcaster language prediction. Our work creates opportu-nity for the assessment of numerous existing node representationlearning techniques and newly developed vertex embedding pro-cedures.

Main contributions.

The most important contributions of ourpaper can be summarized as follows:(1) We release

Twitch Gamers : a new social network datasetwhich we speciﬁcally collected for benchmarking the ver-tex classiﬁcation performance of proximity preserving andstructural role-based node embedding techniques. (2) We carry out a descriptive analysis of the social networkand underlying generic vertex features and argue that it issuitable for testing novel node embedding methods.(3) We evaluate the performance of standard node embeddingalgorithms under various train/test split regimes.The rest of our work has the following structure. We overviewthe related work about node embedding procedures in Section 2.We discuss in Section 3 the data collection and the dataset itself.We perform descriptive analysis of the social network and the genericvertex attributes in Section 4. In Section 5 we showcase the pre-dictive performance of various well known node embedding tech-niques on the

Twitch Gamers dataset. The paper concludes withSection 6 where we discuss potential future work.

Given a graph 𝐺 = ( 𝑉 , 𝐸 ) node embedding techniques learn a func-tion 𝑓 : 𝑉 → R 𝑑 which maps the nodes 𝑣 ∈ 𝑉 into a 𝑑 dimensionalEuclidean space. When generic vertex features are not available fornode classiﬁcation, proximity preserving and structural role-basednode embedding techniques are suitable for distilling high qualityreusable feature sets [1]. We will utilize linear runtime node em-bedding algorithms to showcase that Twitch Games is suitable formulti-aspect testing of feature extraction.Proximity preserving node embedding algorithms [6, 10, 12, 13,16] learn this embedding by preserving a certain notion of prox-imity in the embedding space such as pairwise truncated randomwalk transition probabilities. This way nodes that are close to eachother in the graph are also close in the embedding space. Structuralrole-based node embedding techniques on the other hand preservestructural similarity in the embedding space. Nodes which havesimilar structural properties such as centrality and transitivity areclose to each other in the embedding space [5, 7, 8].

Twitch is a streaming service where users can broadcast live streamsof playing computer games. As users can follow each other thereis an underlying social network which can be accessed through thepublic API. In 2018 April we crawled the largest connected compo-nent of this social network with snowball sampling starting fromthe user called

Lowko . The released

Twitch Gamers dataset is aclean subset of the original social network. We ﬁltered out nodesand edges based on the following principled steps:(1)

No missing attributes.

We only kept nodes that have allof the vertex attributes present.2)

Mutual relationships.

We discarded relationships whichare asymmetric and only included mutual edges in the re-leased dataset.(3)

Member of the largest component.

We only consid-ered nodes which are part of the largest connected com-ponent.The result of this three step data cleaning process is an undirected,single component social network with approximately 168 thou-sand nodes and 6.79 million edges. Vertices in this restricted sub-sample do not have any missing node attributes. We summarizedthe name, meaning, and type of available generic node attributesin Table 1.

Name Meaning Type

Identiﬁer Numeric vertex identiﬁer. IndexDead Account Inactive user account. CategoricalBroadcaster Language Languages used for broadcasting. CategoricalAﬃliate Status Aﬃliate status of the user. CategoricalExplicit Content Explicit content on the channel. CategoricalCreation Date Joining date of the user. DateLast Update Last stream of the user. DateView Count Number of views on the channel. CountAccount Lifetime Days between ﬁrst and last stream. Count

Table 1: The name, meaning and type of vertex attributes inthe

Twitch Gamers dataset.

Categorical attributes such as

Dead Account, Aﬃliate Status , and

Explicit Content can be used as targets for binary classiﬁcation,while

Broadcaster Language can be used for multi-class node clas-siﬁcation with more than 20 categories. The vertex attributes

ViewCount and

Account Lifetime can serves as target for count dataregression problems at the node level. Various other supervisedand unsupervised machine learning tasks can be performed onthe dataset such as link prediction and community detection withground truth labels. The

Twitch Gamers dataset is publicly avail-able at https://github.com/benedekrozemberczki/datasets.

Alive Dead

Dead Account % o f e d g e s Intra-class Inter-classBasic Aﬃliate

Aﬃliate Status Gaming Mature

Explicit Content

Figure 1: The percentage of intra and inter-class edges con-ditional on the

Dead Account, Aﬃliate Status , and

ExplicitContent attributes.

Our descriptive analysis of

Twitch Gamers focuses on the inter-action of graph topology and attributes. Speciﬁcally, we investi-gate which potential target attributes can be predicted well withneighbourhood-preserving and structural role-based techniques.

CS DA DE EN ES FI FR HU IT JA % o f e d g e s Intra-class Inter-classKO NL NO PL PT RU SV TH TR ZH

Broadcaster Language % o f e d g e s Figure 2: The percentage of intra and inter-class edges con-ditional on the

Broadcaster Language attribute.

We plotted the ratio of inter and intra-class edges conditionalon the categorical attributes on Figures 2 and 1. These resultsshow that users who broadcast in more commonly spoken lan-guage (English, German, French) are more likely to have connec-tions with users who broadcast in the same language. This pos-tulates that proximity preserving node embedding techniques willextract expressive features that can predict

Broadcaster Language precisely. We also see that Twitch users who churned from theplatform are well embedded in the social network and do not formcommunities. When it comes to the

Aﬃliate Status and

ExplicitContent attributes we cannot highlight particular insights aboutthe related linking behaviour of vertices.

AliveDead log

Degree Centrality

Dead Account

BasicAﬃliate log

Degree Centrality

Aﬃliate Status

GamingMature log

Degree Centrality

Explicit Content

Figure 3: The box plots of degree centrality and clusteringcoeﬃcient conditional on the

Dead Account, Aﬃliate Status ,and

Explicit Content attributes. e used boxplots to visualize the distribution of the log trans-formed degree and clustering coeﬃcient conditional on the cate-gorical vertex attributes. We plotted these boxplots of the struc-tural features on Figures 4 and 3. Based on these plots we candeduce that users who broadcast in more commonly spoken lan-guages are well connected. At the same time their friends are lesslikely to be connected – this potentially hints at their hub-like role.The results obtained for the other attributes are also intuitive: (i)users who churned from the platform are less central in the so-cial network; (ii) broadcasters who use explicit language are lesspopular; (iii) those who obtain aﬃliate status are generally wellconnected in the social network. These ﬁndings hint that all of thecategorical features can be embedded with the use of structuralrole-based node embedding techniques. ZHTRTHSVRUPTPLNONLKOJAITHUFRFIESENDEDACS log

Degree Centrality 0 0.25 0.5 0.75 1.0ZHTRTHSVRUPTPLNONLKOJAITHUFRFIESENDEDACS Clustering Coeﬃcient

Figure 4: The box plots of degree centrality and clustering co-eﬃcient conditional on the

Broadcaster Language attribute.

We use

Twitch Gamers to evaluate the predictive value of featuresextracted with popular node embedding algorithms. The targetattributes of node classiﬁcation were the

Explicit Content , Broad-caster Language , Dead Account and

Aﬃliate Status variables. Inour experiments we used the open-source

Karate Club [17] librarywith the default hyperparameter settings of the node embeddingprocedures. Speciﬁcally, we tested the performance of the follow-ing proximity preserving node embeddings:(1)

Diﬀ2Vec [18, 19] factorizes a pointwise mutual informa-tion (henceforth PMI) matrix derived from a diﬀusion pro-cess.(2)

DeepWalk [12] decomposes the PMI matrix of summednormalized adjacency matrix powers with implicit factor-ization. (3)

Walklets [13] factorizes the PMI matrix of normalized ad-jacency matrix powers to obtain multi-scale node embed-dings.(4)

RandNE [22] smooths an orthogonal node embedding ma-trix with powers of the adjacency matrix.We evaluated the value of features extracted with these struc-tural role-based node embedding algorithms:(1)

Role2Vec [2] decomposes the PMI matrix node – tree fea-ture co-occurrences with an implicit factorization technique.(2)

ASNE [9] factorizes a target matrix obtained by concate-nating the adjacency matrix and a structural feature ma-trix which includes one-hot encodings of the log degreeand clustering coeﬃcient.(3)

MUSAE [15] learns multi-scale structural role-based nodeembeddings from matrices obtained by multiplying thestructural feature matrix with adjacency matrix powers.(4)

FEATHER [20] distills node embeddings from graph char-acteristic functions of the log transformed degree and clus-tering coeﬃcient.We used the scikit-learn [3, 11] implementation of logistic regres-sion with the default hyperparameter settings to predict the nodelabels using the node embeddings as input features. It has to benoted that these default settings involve the use of weight regular-ization, because of this each node embedding dimension was nor-malized. The classiﬁers were trained with various highly skewedtrain/test data split ratios by utilizing less than 1% of training data.We plotted mean macro-averaged AUC scores on the test set cal-culated from 10 random seed train/test splits on Figures 5 and 6. . . . . . . . . . .

91 Training data ratio % A U C s c o r e Explicit ContentDiﬀ2Vec DeepWalk Walklets RandNE . . . . . . . . . .

91 Training data ratio % A U C s c o r e Broadcaster Language . . . . . . . . . .

91 Training data ratio % A U C s c o r e Dead Account . . . . . . . . . .

91 Training data ratio % A U C s c o r e Aﬃliate Status

Figure 5: Predictive performance of proximity preservingnode embedding techniques on classiﬁcation tasks mea-sured by area under the curve scores on the test set as a func-tion of training set ratio.

The most important ﬁnding based on our results is that the tar-get attributes in

Twitch Gamers are suitable for testing the predic-tive power of features extracted with both proximity preserving nd structural role-based node embeddings. These results show-case that certain node embedding techniques have a considerableadvantage on the downstream tasks. We also see evidence thatproximity preserving algorithms extract features which are moreuseful for predicting Broadcaster Language . This was expected basedon our empirical analysis as it is an attribute which most probablystrongly inﬂuences linking behaviour. Another similar intuitiveﬁnding is that structural role-based embedding techniques[2] cre-ate more expressive features for predicting the

Dead Account targetvariable. This is not surprising, our descriptive analysis had shownthat users who churned from the platform have idiosyncratic struc-tural attributes. Our results also verify the known fact that multi-scale proximity preserving node embeddings, such as

Walklets [13]and

GraRep [4], outperform techniques like

DeepWalk [12] thatpool information form low and higher order proximities. . . . . . . . . . .

91 Training data ratio % A U C s c o r e Explicit ContentRole2Vec ASNE MUSAE FEATHER . . . . . . . . . .

91 Training data ratio % A U C s c o r e Broadcaster Language . . . . . . . . . .

91 Training data ratio % A U C s c o r e Dead Account . . . . . . . . . .

91 Training data ratio % A U C s c o r e Aﬃliate Status

Figure 6: Predictive performance of structural role preserv-ing node embedding techniques on classiﬁcation tasks mea-sured by area under the curve scores on the test set as a func-tion of training set ratio.

We introduced

Twitch Gamers a medium sized social network datasetwith a rich set of potential target attributes. Our descriptive anal-ysis of the dataset had demonstrated that both proximity preserv-ing and structural role-based node embeddings can potentially dis-till high quality features for node classiﬁcation. We veriﬁed thisprecognition by a series of experiments. Our ﬁndings show that

Twitch Gamers can serve as an important benchmark to assessnovel node embedding techniques. We are particularly excited thatthe prediction of certain vertex attributes turned out to be challeng-ing machine learning task.

ACKNOWLEDGEMENTS

Benedek Rozemberczki was supported by the Centre for DoctoralTraining in Data Science, funded by EPSRC (grant EP/L016427/1).

REFERENCES [1] N. Ahmed, R. A. Rossi, J. Lee, T. Willke, R. Zhou, X. Kong, and H. Eldardiry.2020. Role-based Graph Embeddings.

IEEE Transactions on Knowledge and DataEngineering (2020), 1–1.[2] Nesreen K Ahmed, Ryan Rossi, John Boaz Lee, Xiangnan Kong, Theodore LWillke, Rong Zhou, and Hoda Eldardiry. 2018. Learning Role-based Graph Em-beddings.

Proceedings of the 26th IJCAI Conference on Artiﬁcial Intelligence -Statistical Relational AI Workshop (2018).[3] Lars Buitinck, Gilles Louppe, Mathieu Blondel, Fabian Pedregosa, AndreasMueller, Olivier Grisel, Vlad Niculae, Peter Prettenhofer, Alexandre Gramfort,Jaques Grobler, Robert Layton, Jake VanderPlas, Arnaud Joly, Brian Holt, andGa¨el Varoquaux. 2013. API Design for Machine Learning Software: Experiencesfrom the Scikit-Learn Project. In

ECML PKDD Workshop: Languages for DataMining and Machine Learning . 108–122.[4] Shaosheng Cao, Wei Lu, and Qiongkai Xu. 2015. GraRep: Learning Graph Rep-resentations with Global Structural Information. In

Proceedings of the 24th ACMInternational on Conference on Information and Knowledge Management . ACM,891–900.[5] Claire Donnat, Marinka Zitnik, David Hallac, and Jure Leskovec. 2018. LearningStructural Node Embeddings via Diﬀusion Wavelets. In

Proceedings of the 24thACM SIGKDD International Conference on Knowledge Discovery and Data Mining .ACM, 1320–1329.[6] Aditya Grover and Jure Leskovec. 2016. Node2Vec: Scalable Feature Learningfor Networks. In

Proceedings of the 22nd ACM SIGKDD International Conferenceon Knowledge Discovery and Data Mining . 855–864.[7] Keith Henderson, Brian Gallagher, Tina Eliassi-Rad, Hanghang Tong, SugatoBasu, Leman Akoglu, Danai Koutra, Christos Faloutsos, and Lei Li. 2012. RolX:Structural Role Extraction and Mining in Large Graphs. In

Proceedings of the18th ACM SIGKDD International Conference on Knowledge Discovery and DataMining . ACM, 1231–1239.[8] Keith Henderson, Brian Gallagher, Lei Li, Leman Akoglu, Tina Eliassi-Rad,Hanghang Tong, and Christos Faloutsos. 2011. It’s Who You Know: Graph Min-ing Using Recursive Structural Features. In

Proceedings of the 17th ACM SIGKDDInternational Conference on Knowledge Discovery and Data mining . ACM, 663–671.[9] Lizi Liao, Xiangnan He, Hanwang Zhang, and Tat-Seng Chua. 2018. AttributedSocial Network Embedding.

IEEE Transactions on Knowledge and Data Engineer-ing

30, 12 (2018), 2257–2270.[10] Mingdong Ou, Peng Cui, Jian Pei, Ziwei Zhang, and Wenwu Zhu. 2016. Asym-metric Transitivity Preserving Graph Embedding. In

Proceedings of the 22ndACM SIGKDD International Conference on Knowledge Discovery and Data Mining .ACM, 1105–1114.[11] Fabian Pedregosa, Ga¨el Varoquaux, Alexandre Gramfort, Vincent Michel,Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, RonWeiss, Vincent Dubourg, et al. 2011. Scikit-Learn: Machine Learning in Python.

Journal of Machine Learning Research

12, Oct (2011), 2825–2830.[12] Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. 2014. DeepWalk: Online Learn-ing of Social Representations. In

Proceedings of the 20th ACM SIGKDD Interna-tional Conference on Knowledge Discovery and Data Mining.

ACM.[13] Bryan Perozzi, Vivek Kulkarni, Haochen Chen, and Steven Skiena. 2017. Don’tWalk, Skip!: Online Learning of Multi-scale Network Embeddings. In

Proceed-ings of the 2017 IEEE/ACM International Conference on Advances in Social Net-works Analysis and Mining 2017 . ACM, 258–265.[14] Jiezhong Qiu, Yuxiao Dong, Hao Ma, Jian Li, Kuansan Wang, and Jie Tang. 2018.Network Embedding as Matrix Factorization: Unifying DeepWalk, LINE, PTE,and Node2Vec. In

Proceedings of the 11th ACM International Conference on WebSearch and Data Mining . ACM, 459–467.[15] Benedek Rozemberczki, Carl Allen, and Rik Sarkar. 2019. Multi-Scale AttributedNode Embedding. arXiv preprint arXiv:1909.13021 (2019).[16] Benedek Rozemberczki, Ryan Davies, Rik Sarkar, and Charles Sutton. 2019.GEMSEC: Graph Embedding with Self Clustering. In

Proceedings of the 2019IEEE/ACM International Conference on Advances in Social Networks Analysis andMining 2019 . ACM, 65–72.[17] Benedek Rozemberczki, Oliver Kiss, and Rik Sarkar. 2020. Karate Club: AnAPI Oriented Open-source Python Framework for Unsupervised Learning onGraphs. In

Proceedings of the 29th ACM International Conference on Informationand Knowledge Management (CIKM ’20) . ACM.[18] Benedek Rozemberczki, Oliver Kiss, and Rik Sarkar. 2020. Little Ball of Fur:A Python Library for Graph Sampling. In

Proceedings of the 29th ACM Interna-tional Conference on Information and Knowledge Management (CIKM ’20) . ACM,3133–3140.[19] Benedek Rozemberczki and Rik Sarkar. 2018. Fast Sequence-Based Embed-ding with Diﬀusion Graphs. In

International Workshop on Complex Networks .Springer, 99–107.[20] Benedek Rozemberczki and Rik Sarkar. 2020. Characteristic Functions onGraphs: Birds of a Feather, from Statistical Descriptors to Parametric Models. In roceedings of the 29th ACM International Conference on Information and Knowl-edge Management (CIKM ’20) . ACM.[21] Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei.2015. LINE: Large-Scale Information Network Embedding. In Proceedings of the24th International Conference on World Wide Web . 1067–1077.[22] Ziwei Zhang, Peng Cui, Haoyang Li, Xiao Wang, and Wenwu Zhu. 2018. Billion-scale network embedding with iterative random projection. In . IEEE, 787–796.. IEEE, 787–796.