[PDF] AttrE2vec: Unsupervised Attributed Edge Representation Learning

Abstract

Representation learning has overcome the often arduous and manual featurization of networks through (unsupervised) feature learning as it results in embeddings that can apply to a variety of downstream learning tasks. The focus of representation learning on graphs has focused mainly on shallow (node-centric) or deep (graph-based) learning approaches. While there have been approaches that work on homogeneous and heterogeneous networks with multi-typed nodes and edges, there is a gap in learning edge representations. This paper proposes a novel unsupervised inductive method called AttrE2Vec, which learns a low-dimensional vector representation for edges in attributed networks. It systematically captures the topological proximity, attributes affinity, and feature similarity of edges. Contrary to current advances in edge embedding research, our proposal extends the body of methods providing representations for edges, capturing graph attributes in an inductive and unsupervised manner. Experimental results show that, compared to contemporary approaches, our method builds more powerful edge vector representations, reflected by higher quality measures (AUC, accuracy) in downstream tasks as edge classification and edge clustering. It is also confirmed by analyzing low-dimensional embedding projections.

Full PDF

AAttrE2vec: Unsupervised Attributed Edge RepresentationLearning

Piotr Bielak a , Tomasz Kajdanowicz a , Nitesh V. Chawla a,b a Department of Computational Intelligence, Wroclaw University of Science and Technology, Poland b Department of Computer Science and Engineering, University of Notre Dame, Notre Dame, IN, USA

Abstract

Representation learning has overcome the often arduous and manual featurization of net-works through (unsupervised) feature learning as it results in embeddings that can applyto a variety of downstream learning tasks. The focus of representation learning on graphshas focused mainly on shallow (node-centric) or deep (graph-based) learning approaches.While there have been approaches that work on homogeneous and heterogeneous net-works with multi-typed nodes and edges, there is a gap in learning edge representations.This paper proposes a novel unsupervised inductive method called

AttrE2Vec , whichlearns a low-dimensional vector representation for edges in attributed networks. It sys-tematically captures the topological proximity, attributes aﬃnity, and feature similarityof edges. Contrary to current advances in edge embedding research, our proposal extendsthe body of methods providing representations for edges, capturing graph attributes inan inductive and unsupervised manner. Experimental results show that, compared tocontemporary approaches, our method builds more powerful edge vector representations,reﬂected by higher quality measures (AUC, accuracy) in downstream tasks as edge classi-ﬁcation and edge clustering. It is also conﬁrmed by analyzing low-dimensional embeddingprojections.

Keywords: representation learning, graphs, edge embedding, random walk, neuralnetwork, attributed graph.

1. Introduction

Complex networks, included attributed and heterogeneous networks, are ubiquitous— from recommender systems to citation networks and biological systems [1]. Thesenetworks present a multitude of machine learning problem statements, including nodeclassiﬁcation, link prediction, and community detection. A fundamental aspect of anysuch machine learning (ML) task, transductive or inductive, is the availability of fea-turized data. Traditionally, researchers have identiﬁed several network characteristicssuited to speciﬁc ML tasks and used them for the learning algorithm. This practice isarduous as it often entails customizing to each speciﬁc ML task, and also is limited tothe computable characteristics.This has led to a surge in (unsupervised) algorithms and methods that learn embed-dings from the networks, such that these embeddings form the featurized representation

Preprint submitted to Information Sciences January 1, 2021 a r X i v : . [ c s . L G ] D ec igure 1: Our proposed AttrE2vec model compared to other methods in the task of an attributed graphembedding. Colors denote edge features. On the left we can see a graph, where the features are alignedto substructures of the graph. On the right, the features were shuﬄed (ca. 50%). Traditional approachesfail to build robust representations, whereas our method includes features information to construct theembedding vectors. of the network for the ML tasks [2, 3, 4, 5, 6]. This area of research is generally no-tated as representation learning in networks. Generally, these embeddings generated byrepresentation learning methods are agnostic to the end use-case, as they are generatedin an unsupervised fashion. Traditionally, the focus was on representation learning onhomogeneous networks, i.e. the networks that have singular type of nodes and edges,and also do not have attributes attached to the nodes and edges [4].Existing representation learning models mainly focus on transductive learning, wherea model can only be trained using the entire input graph. It means that the modelrequires all the nodes and a ﬁxed structure of the network in the training phase, e.g.,Node2vec [7], DeepWalk [8] and GCN [9], to some extent. Besides, there have beenmethods focused on heterogeneous networks that incorporate diﬀerent typed nodes andedges in a network, as well as content at each node [10, 11].On the other hand, a less explored and exploited approach is the inductive setting. Inthis approach, only a part of the network is used to train the model to infer embeddingsfor new nodes. Several attempts have been made in the inductive setting including EP-B[12], GraphSAGE [13], GAT [14], SDNE [15], TADW [16], AHNG[17] or PVECB [18].There is also recent progress on heterogeneous graph embedding, e.g., MIFHNE [19] or2odels based on graph neural networks [20].State-of-the-art network embedding techniques are mostly unsupervised, i.e., aim atlearning low-dimensional representations that preserve the structure of an input graph,e.g., GraphSAGE [13], DANE [21], line2vec [22], RCAN [23]. Nevertheless, semi-supervisedor supervised methods can learn vector representations but for a speciﬁc downstream pre-diction task, e.g., TADW [16] or FSCNMF [24]. Hence it has been shown in the literaturethat not much supervision is required to learn the embeddings.In recent years, proposed models mainly focus on the graphs that do not containattributes related to nodes and edges [4]. It is especially noticeable for edge attributes.The majority of proposed approaches consider node attributes only, omitting the richnessof edge feature space while learning the representation. Nevertheless, there have beensuccessfully introduced such models as DANE [21], GraphSAGE [13], SDNE [15] orCAGE [25] which make use of node features and EGNN [26], NEWEE [27], EGAT [28]that consume edge attributes. Table 1: Comparison of most representative graph embedding methods with their abilities to learnthe representation, with or without attributes, reasoning types and short characteristics. The mostprominent and appropriate methods selected to compare to

AttrE2vec in experiments are marked withbold text.

Method Representation Attributed Reasoning Family

Nodes Edges Nodes Edges Transduct. Induct.

Sup e r v i s e d ECN [29] (2016) (cid:88) (cid:88) neigh. aggr.GCN [9] (2017) (cid:88) (cid:88) (cid:88) (cid:88)

GCN/GNNECC [30] (2017) (cid:88) (cid:88) (cid:88)

GCN, DLFSCNMF [24] (2018) (cid:88) (cid:88) (cid:88)

GCNGAT [14] (2018) (cid:88) (cid:88) (cid:88) (cid:88)

AE, DLPlanetoid [31] (2018) (cid:88) (cid:88) (cid:88) (cid:88)

GNNEGNN [26] (2019) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88)

GNNEdgeConv [32] (2019) (cid:88) (cid:88)

GNNEGAT [28] (2019) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88)

GNNAttribute2vec [33] (2020) (cid:88) (cid:88) (cid:88)

GCN U n s up e r v i s e d DeepWalk [8] (2014) (cid:88) (cid:88)

RW, skip-gramTADW [16] (2015) (cid:88) (cid:88) (cid:88)

RW, MFLINE [34] (2015) (cid:88) (cid:88)

RW, skip-gram

Node2vec [7] (2016) (cid:88) (cid:88)

RW, skip-gram

SDNE [15] (2016) (cid:88) (cid:88) (cid:88) (cid:88) AE GraphSAGE [13] (2017) (cid:88) (cid:88) (cid:88) (cid:88)

RWEP-B [12] (2017) (cid:88) (cid:88) (cid:88) (cid:88) AE Struc2vec [35] (2017) (cid:88) (cid:88)

RW, skip-gramDANE [21] (2018) (cid:88) (cid:88) (cid:88) (cid:88) AE Line2vec [22] (2019) (cid:88) (cid:88)

RW, skip-gramNEWEE [27] (2019) (cid:88) (cid:88) (cid:88) (cid:88)

RW, skip-gram

AttrE2vec (2020) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88)

RW, AE, DL

Both node-based embedding methods and graph neural network inspired methods donot generalize eﬀectively to both transductive and inductive settings, especially whenthere are attributes associated with edges. This work is motivated by the idea of un-supervised learning on networks with attributed edges such that the embeddings aregeneralizable across tasks and are inductive.To that end, we develop a novel

AttrE2vec , an unsupervised learning model thatadapts auto-encoder and self-attention network with the use of feature reconstruction andgraph structural loss. To learn edge representation,

AttrE2vec splits edge neighborhoodinto two parts, separately for each node endings of the edge, and then generates random3dge walks in both neighborhoods. All walks are then aggregated over the node and edgeattributes using one of the proposed strategies (Avg, Exp, GRU, ConcatGRU). Theseare accumulated with the original nodes and edge features and then fed to attentionand dense layer to encode the edge. The embeddings are subsequently inferred via atwo-step loss function — for both feature reconstruction and graph structural loss. As aconsequence,

AttrE2vec can explicitly incorporate feature information from nodes andedges at many hops away to eﬀectively produce the plausible edge embeddings for theinductive setting.In summary, our main contributions are as follows: • we propose a novel unsupervised AttrE2vec method, which learns a low-dimensionalvector representation for edges that are attributed • we exploit the concept of a graph-topology-driven edge feature aggregation, fromsimple ones to learnable GRU based, that captures edge topological proximity andsimilarity of edge features • the proposed method is inductive and allows getting the representation for edgesnot present in the training phase • we conduct various experiments and show that our AttrE2vec method has superiorperformance over all of the baseline methods on edge classiﬁcation and clusteringtasks.

2. Related work and Research Gap

Embedding information networks has received signiﬁcant interest from the researchcommunity. We refer the readers to the survey articles for a comprehensive overview ofnetwork embedding [4, 5, 3, 2] and cite only some of the most prominent works that arerelevant.

Unsupervised network embedding methods use only the network structure ororiginal attributes of nodes and edges to construct embeddings. The most commonmethod is DeepWalk [8], which in two-phases constructs node neighborhoods by per-forming ﬁxed-length random walks and employs the skip-gram [7] model to preserve theco-occurrences between nodes and their neighbors. This two-phase framework was lateran inspiration for learning network embeddings by proposing diﬀerent strategies for con-structing node neighborhoods or modeling co-occurrences between nodes, e.g., node2vec[7], Struc2vec [35], GraphSAGE [13], line2vec [22] or NEWEE [27]. Another group of un-supervised methods utilizes auto-encoder or graph neural networks to obtain embedding.SDNE [15] uses auto-encoder architecture to preserve ﬁrst and second-order proximitiesby jointly optimizing the loss in neighborhood reconstruction. Another auto-encoderbased representatives are EP-B [12] and DANE [21].

Supervised network embedding methods are constructed as an end-to-end meth-ods for particular tasks like node classiﬁcation or link prediction. These methods requirenetwork structure, attributes of nodes and edges (if method is capable of using) andsome annotated target like node class. The representatives are ECN [29], ECC [30],FSCNMF [24], GAT [14], planetoid [31], EGNN [26], GCN [9], EdgeConv [32], EGAT[28], Attribute2vec [33]. 4 dge representation learning has been already tackled by several methods, i.e.ECN [29], EGNN [26], line2vec [22], EdgeConv [32], EGAT [28]. However, non of thesemethods was able to directly take into account attributes of edges as well as perform thelearning in an unsupervised manner.All the characteristics of the representative node and edge representation learningmethods are grouped in Table 1.

3. Method

In the following paragraphs, we explain our three-fold motivation to propose the

AttrE2vec . Edge embeddings.

For a decade, network processing approaches gather more and moreattention as graph data is produced in an increasing number of systems. Network em-bedding traditionally provided the notion of vectorizing nodes that was used in nodeclassiﬁcation or clustering. However, the edge representation learning did not gatherenough attention and was accomplished through node embedding transformation [36].Nevertheless, such an approach is problematic. For instance, inferring edge type fromneighboring nodes’ embeddings may not be the best choice for edge type classiﬁcation inheterogeneous social networks. We claim that eﬃcient edge clustering, edge attribute re-gression, or link prediction tasks require dedicated and speciﬁc edge representations. Weexpect that the representation learning approach devoted strictly to edges provides morepowerful vector representations than traditional methods that require node embeddingstrained upfront and transform nodes’ embedding to represent edges.

Inductive embedding methods.

A vast majority of contemporary network representationlearning methods is transductive (see Table 1). It means that any change to the graphrequires the whole retraining of the method to provide predictions for unseen cases—suchproperty limits the applicability of methods due to high computational costs. Contrary,the inductive approach builds a predictive ability that can be applied to unseen casesand does not need retraining – in general, inductive methods have a lower computationcost. Considering these advantages, we expect modern edge embedding methods to beinductive.

Encoding graph attributes in embeddings.

Much of the real-world data exhibits rich at-tribute sets or meta-data that contain crucial information, e.g., about the similarity ofnodes or edges. Traditionally, graph representation learning has been focused on ex-ploiting the network structure, omitting the related content. Thus, we may expect toconsume attributes as a regularizer over the structure. It would allow overcoming thelimitation when the only edge discriminating ability is encoded in the edges’ attributes,not in the graph’s structure. Relying only on the network would produce inconclusiveembeddings. 5 .2. Attributed graph edge embedding

We denote an attributed graph as G = ( V, E ), where V is a set of nodes and E = { ( u, v ) ∈ V × V } a set of edges. Every node u and every edge e = ( u, v ) has associatedfeatures: m u ∈ R d V and f uv ∈ R d E , where M ∈ R | V |× d V and F ∈ R | E |× d E are nodeand edge feature matrices, respectively. By d V we denote dimensionality of node featurespace and d E dimensionality of edge feature space. The edge embedding task is deﬁnedas learning a function g : E → R d , which takes an edge and outputs its low-dimensionalvector representation. Note that the embedding dimension d should be much less than theoriginal edge feature dimensionality d E , i.e.: d << d E . More speciﬁcally, we aim at usingthe topological structure of the graph and node and edge attributes: f : ( E, F , M ) → R d . Figure 2: Overview of the

AttrE2vec model. The model ﬁrst computes edge random walks on twoneighborhoods of a given edge ( u, v ). Each neighbourhood walks are aggregated into S u , S v . Both arecombined with the edge features f uv using an Encoder module, which results into the edge embeddingvector h uv . The loss function consists of two parts: structural loss ( L cos ) and feature reconstruction loss( L MSE ). In contrast to traditional node embedding methods, we shift the focus from nodesto edges and consider a graph from an edge perspective. Given any edge e = ( u, v ), wecan observe three natural sources of knowledge: the edge attributes itself and the twoneighborhoods - N u and N v , located behind nodes u and v , respectively. In AttrE2vec,we exploit all three sources jointly.First, we obtain aggregations (summaries) S u , S v of the both neighborhoods N u , N v .We want to capture the topological structure of the neighborhood, so we perform k edgerandom walks of length L , which start from node u (or v , respectively) and use auniformly distributed neighbor sampling approach (DeepWalk-like) to obtain the nextedge. Each i th walk w iu started from node u is hence a sequences of edges. RW ( G, k, L, u ) → { w u , w u , . . . , w ku } w iu ≡ ( u, u ) , ( u , u ) , . . . , ( u L − , u L )6ext, we take the attributes of the edges (and nodes, if applicable) in each randomwalk and aggregate them into a single vector using the walk aggregation model Agg w . S iu = Agg w ( w iu , F , M )Later, aggregated walks are combined using the neighborhood aggregation modelAgg n , which summarizes the neighborhood S u (and S v , respectively). The proposedimplementations of these aggregation are given in Section 3.4. S u = Agg n ( { S u , S u , . . . , S ku } )Finally, we obtain the low dimensional edge embedding h uv using an encoder Enc module. It combines the edge attributes f uv with the summarized neighborhood infor-mation S u , S v . We employ a simple Multilayer Perceptron (MLP) with 3 inputs (each ofsize equal to the edge features dimensionality) and an attention mechanism over these in-puts, to check how much of the information of each input is used to create the embeddingvector (see Figure 3): h uv = Enc ( f uv , S u , S v ) Figure 3: Encoder module architecture

The overall illustration of the method is contained in Figure 2 and the inferencealgorithm is shown in Algorithm 1.

For the purpose of the neighborhood aggregation model

Agg n , we use an average overvectors S iu , as there is no particular ordering of these vectors (each one was generatedby an equally important random walk). In the case of walk aggregation, we propose thefollowing: 7 lgorithm 1: AttrE2vec inference algorithm

Data: graph G , edge list xe , edge features F , node features M Params: number of random walks per node k , random walk length L Result: edge embedding vectors h uv beginforeach ( u , v ) in xe doforeach i in (1. . . k) do w iu = RW ( G, L, u ) S iu = Agg w ( w iu , F , M ) w iv = RW ( G, L, v ) S iv = Agg w ( w iv , F , M ) end S u = Agg n ( { S u , . . . , S ku } ) S v = Agg n ( { S v , . . . , S kv } ) h uv = Enc ( f uv , S u , S v ) endend • average – that computes a simple average of the edge attribute vectors in therandom walk; S iu = 1 L L (cid:88) n =1 f u n u n +1 • exponential – that computes a weighted average, where the weights are exp onentsof the ”minus” position in the random walk so that further away edges are lessimportant than the near ones; S iu = 1 L L (cid:88) n =1 e − n f u n u n +1 • GRU – that uses a Gated Recurrent Unit [37] architecture, where hidden and inputdimension is equal to the edge attribute dimension; the aggregated representationis the output of the last hidden vector; the aggregation process starts here at theend of the random walk and proceeds to the beginning; S iu = GRU( { f u n u n +1 , f u n − u n , . . . , f u u } ) • ConcatGRU – that is similar to the GRU-based aggregator, but here we alsouse the node feature information by concatenating the node attributes with theedge attributes; hence the GRU input size is equal to the sum of the edge andnode dimensions; in case there are not any node features available, one could use8etwork-speciﬁc features, like degree, betweenness or more advanced techniques likeNode2vec; the hidden dimension size and the aggregation direction is unchanged; S iu = ConcatGRU( { f u n u n +1 ⊕ m u n , . . . , f u u ⊕ m u } ) AttrE2vec is designed to make the most use of edge attributes and information aboutthe structure of the network. Therefore we propose a loss function, which consists of twomain parts: • structural loss L cos – computes a cosine embedding loss ; such function tries tominimize the cosine distance between a given embedding h and embeddings of edgessampled from the random walks h + (positive), and simultaneously to maximize acosine distance between an embedding h and embeddings of edges sampled from aset of all edges in the graph h − (negative), except for these in the random walks: L cos = 1 | B | (cid:88) h uv ∈ B (cid:88) h +uv (1 − cos( h uv , h +uv )) + (cid:88) h − uv cos( h uv , h − uv )  where B denotes a minibatch of edges and | B | the minibatch size, • feature reconstruction loss L MSE – computes a mean squared error of the actualedge features and the outputs of a decoder (implemented as a 3-layer MLP – seeFigure 4), that reconstruct the edge features based on the edge embeddings; L MSE = 1 | B | (cid:88) ( h uv ,f uv ) ∈ B (DEC( h uv ) − f uv ) where B denotes a minibatch of edges and | B | the minibatch size. Figure 4: Decoder module architecture

We combine the values of the above loss functions using a mixing parameter λ ∈ [0 , L = λ ∗ L cos + (1 − λ ) ∗ L MSE . Experiments To evaluate the proposed model’s performance, we perform three tasks: edge classi-ﬁcation, edge clustering, and embedding visualization on three real-world datasets. Weﬁrst train our model on a small subset of edges (inductive setting). Then we use themodel to infer embeddings for edges from the test set. Finally, we evaluate them in alldownstream tasks: by predicting the class of edges in citation graphs ( edge classiﬁ-cation ), by applying the K-means++ algorithm ( edge clustering ; as deﬁned in [22])and by the dimensionality reduction method T-SNE ( embedding visualization ). Wecompare our model to several baselines and contemporary methods in all experiments,see Table 1. Eventually, we check the inﬂuence of AttrE2vec’s hyperparameters and per-form an ablation study on artiﬁcially generated datasets. We implement our model inthe popular deep learning framework PyTorch. All experiments were performed on anNVIDIA GTX1080Ti. Upon acceptance in the journal, we will make our code availableat https://github.com/attre2vec/attre2vec and include our DVC [38] pipeline sothat all experiments can be easily reproduced.

Table 2: Datasets used in the experiments.

Name Features Number of Training instancesinitial pre-processednode edge node edge nodes edges classes inductive transductive

Cora 1 433 0 32 260 2 485 5 069 7+1 160 5 069Citeseer 3 703 0 32 260 2 110 3 668 6+1 140 3 668Pubmed 500 0 32 260 19 717 44 324 3+1 80 44 324

In order to compare gathered evaluation evidence we focused on well known datasets,that appear in the literature, namely: Cora [39], Citeseer [39] and Pubmed [40]. Theseare citation networks of scientiﬁc papers in several research areas, where nodes are thepapers and edges denote citations between papers. We summarize basic statistics aboutthe datasets before and after pre-processing steps in Table 2. Raw datasets containnode features only in the form of high dimensional sparse bags of words. For Cora andCiteseer, these are binary vectors, showing which of the most popular words were usedin a given paper, and for Pubmed, the features are in the form of TF-IDF vectors. Toadjust the datasets to our problem setting, we apply the following pre-processing stepsto obtain edge level features, which are used to train and evaluate our

AttrE2vec model: • we create dense vector representations of the nodes’ features by applying Doc2vec[41] in the PV-DBOW variant with a target dimension size of 128; • for each edge ( u, v ) and its symmetrical version ( v, u ) (necessary to perform uni-form, undirected random walks) we extract the following features: – u and v (binaryBoW; for Pubmed transformed from TF-IDF to binary BoW),10 u and v , –

256 features – concatenation of Doc2vec features for nodes u and v , – • we apply standardization (StandardScaler in Scikit-Learn [42]) of the edge featurematrix.Moreover, we extracted new node features as 32-dimensional Node2vec embeddingsto provide the evaluation possibility for one of our model versions (AttrE2vec with Con-catGRU aggregator), which generalizes upon both edge and nodes attributes.Raw datasets provide each node labeled by the research area the paper comes from. Toapply this knowledge in the edge classiﬁcation problem setting, we applied the followingrule: if an edge has two nodes from the same class (research area), the edge receives thisclass; if two nodes have diﬀerent classes, the edge between these nodes is assigned witha cross-domain citation class.To ensure a fair comparison method, we follow the dataset preparation scheme fromEP-B [12], i.e., for each dataset (Cora, Citeseer, Pubmed) we sample 10 train/validation/testsets, where the train set consists of 20 edges per class and the validation and test setsto contain 1 000 randomly chosen edges each. While reporting the resulting metrics, weshow the mean values over these ten sampled sets (together with the standard deviation). We compare our method against several baseline methods. In the most simple case,we use the edge features obtained during the pre-processing phase for all datasets (furtherreferred to as

Doc2vec ).Many standard approaches employ simple node embedding transformations to obtainedge embeddings. The authors of Node2vec [36] proposed binary operators like averaging,Hadamard product, or L1 and L2 norms of vector diﬀerences. Here, we will use followingmethods to obtain node embeddings: DeepWalk [8], Node2vec [36], SDNE [43] andStruc2vec [35]. In preliminary experiments, we evaluated these methods and checkedthat the Average operator and an embedding size of 64 gives the best results. We willuse these models in 2 setups: (a)

Avg( M , M ) – using only the averaged node features,(b) Avg( M , M ) ⊕F – like previously but concatenated with the edge features from thedataset (in total 324-dim vectors).We also checked a scheme to compute a 64-dim PCA reduction of the concatenatedfeatures to have comparable vector sizes with the 64-dimensional embedding of our model,but these turned out to perform poorly. Note that SDNE has the capability of inductivereasoning, but due to the non-availability of such implementation, we decided to evaluatethis method in the transductive scheme (which works in favor of the method).11 igure 5: Architecture of the MLP( M , M ) .Figure 6: Architecture of the MLP( M , M , F ) . We also extend our body of baselines by more sophisticated approaches – two denseautoencoder architectures. In the ﬁrst setting

MLP( M , M ) , we train a model (seeFigure 5), which reconstructs concatenated embeddings of connected nodes. In the secondbaseline MLP( M , M , F ) , the autoencoder (see Figure 6) is extended by edge attributes.In both settings, we employ the mean squared error as the model loss function. Theoutput of the encoders (embeddings) is used in the downstream tasks. The input nodeembeddings are obtained using the methods mentioned above, i.e., DeepWalk, Node2vec,SDNE, and Struc2vec.The last baseline is Line2vec [22], which is directly dedicated for edges - we use anembedding size of 64. To evaluate our model in an inductive setting, we need to make sure that test edgesare unseen during the model training procedure – we remove them from the graph. Notethat all baselines (except for GraphSage, see 1) require all edges during the trainingphase (i.e., these are transductive methods).After each training epoch of

AttrE2vec , we evaluate the embeddings using L2-regularized Logistic Regression (LR) classiﬁer and compute AUC. The regression modelis trained on edge embeddings from the train set and evaluated on edge embeddings fromthe validation set. We take the model with the highest AUC value on the validation set.12 able 3: AUC values for edge classiﬁcation. F denotes the edge attributes (also referred to as ”Doc2vec”), M – node attributes (e.g., embeddings computed using ”Node2vec”), ⊕ – concatenation operator, Avg( M , M ) – average operator on node embeddings, MLP( · ) – encoder output of MLP autoencodertrained on given attributes. AUC in bold shows the highest value and

AUC in italic — the secondhighest value.

Method group/name Vector AUCsize Citeseer Cora Pubmed T r a n s du c t i v e Edge features only; F (Doc2vec) 260 86.13 ± ± ± Line2vec

64 86.19 ± ± ± Avg( M , M ) DeepWalk 64 58.40 ± ± ± ± ± ± ± ± ± ± ± ± MLP( M , M ) DeepWalk 64 55.88 ± ± ± ± ± ± ± ± ± ± ± ± Avg( M , M ) ⊕F DeepWalk 324 86.13 ± ± ± ± ± ± ± ± ± ± ± ± MLP( M , M , F ) DeepWalk 64 84.58 ± ± ± ± ± ± ± ± ± ± ± ± I ndu c t i v e Avg( M , M ) GraphSage 64 54.84 ± ± ± MLP( M , M ) GraphSage 64 55.19 ± ± ± Avg( M , M ) ⊕F GraphSage 324 86.14 ± ± ± MLP( M , M , F ) GraphSage 64 84.63 ± ± ± AttrE2vec (our) Avg 64 . ± .

82 93 . ± .

56 87 . ± . Exp 64 88.91 ± ± ± . ± .

13 93 . ± .

63 86 . ± . ConcatGRU 64 88.56 ± ± ± Moreover, an early stopping strategy is implemented– if the validation AUC metric doesnot improve for more than 15 epochs, the learning is terminated. Our approach to modelselection is aligned with the schema proposed in [44] because this approach is more nat-ural than relying on the loss function. This is repeated for all 10 data splits (see: Section4.1 for details). We report a mean and std AUC measures for 10 test sets (see Table 3)We choose AdamW [45] with a learning rate of 0 .

001 to optimize our model’s pa-rameters. We also set the size of positive samples to | h + | = 5 and negative samplesto | h − | = 10 in the cosine embedding loss. The mixing coeﬃcient is set to λ = 0 . Avg( M , M ) , achieve poor results of about 50-60% AUC. However, if these are combined with the edge features from the datasets Avg( M , M ) ⊕F , the AUC values increase signiﬁcantly to about 86%, 88% and 79% forCiteseer, Cora, and Pubmed, respectively. Unfortunately, this results in an even highervector dimensionality (324).The MLP-based approach results lead to similar conclusions. Using only node em-beddings MLP( M , M ) we achieve quite poor results of about 50% (on Pubmed) up to60% (on Cora). With MLP( M , M , F ) approach we observe that edge features improvethe classiﬁcation results. The AUC values are still slightly worse than concatenationoperator ( Avg( M , M ) ⊕F ), but we can reduce the edge embedding size to 64.The Line2vec [22] algorithm achieves very good results, without considering edgefeatures information – we get about 86%, 92% and 85% AUC for Citeseer, Cora, andPubmed, respectively. These values are higher than for any other baseline approach.Our model performs the best among all evaluated methods. For Citeseer, we gainabout 3 percent points compared to the best baselines: Line2vec, Struc2vec ( Avg( M , M ) ⊕F )or GraphSage ( Avg( M , M ) ⊕F ). Note that the algorithm is trained only on 140 edgesin the inductive setting, whereas all transductive baselines require the whole graph fortraining. The gains on Cora are 2 pp, and on Pubmed we achieve up to 4pp (and upto 8pp compared only to GraphSage ( Avg( M , M ) ⊕F ) ). Our model with the Average(Avg) aggregator works the best, whereas the Gated Recurrent Unit (GRU) aggregatorachieves the second-best results. Similarly to Line2vec [22], we apply the K-Means++ algorithm on the resulting em-bedding vectors and compute an unsupervised clustering accuracy [46]. We summarizethe results in Table 4. Our model performs the best in all but one case and achievessigniﬁcantly better results than other baseline methods. The only exception is for thePubmed dataset, where Line2vec achieves the best clustering accuracy. Other baselinemethods perform similarly as in the edge classiﬁcation task. Hence, we will not discussthe details, and we encourage the reader to go through the detailed results.

For all tested baseline methods and our proposed

AttrE2vec method, we compute2-dimensional projections of the produced embeddings using T-SNE [47] method. Wevisualize them in Figure 7. In our subjective opinion, these plots correspond to the AUCscores reported in Table 3—the higher the AUC, the better the group separation. Indetails, for Doc2vec raw edge features seem to form groups, but unfortunately overlapto some degree. We cannot observe any pattern in the node embedding-based settings(

Avg( M , M ) and MLP( M , M ) ), they tempt to be quasi-random. When concatenatedwith the edge attributes ( Avg( M , M ) ⊕F and MLP( M , M , F ) ) we observe a slightlybetter grouping, but yet non satisfying. AttrE2vec model produces much more formedgroups, with only a little overlapping. To summarize, based on the observed groups’separability and AUC metrics, our approach works the best among all methods.14 igure 7: 2-D T-SNE projections of embedding vectors for all evaluated methods. Columns denotesaggregation approach, beside F that denotes the edge attributes and g ( E ) that is an edge embeddingobtained with graph structure only. Rows gather particular methods. able 4: Accuracy on edge clustering. F denotes the edge attributes (also referred to as ”Doc2vec”), M – node attributes (e.g., embeddings computed using ”Node2vec”), ⊕ – concatenation operator, Avg( M , M ) – average operator on node embeddings, MLP( · ) – encoder output of MLP autoencodertrained on given attributes. AUC in bold shows the highest value and

AUC in italic — the secondhighest value.

Method group/name Vector Accuracysize Citeseer Cora Pubmed T r a n s du c t i v e Edge features only; F (Doc2vec) 260 54.13 ± ± ± Line2vec

64 54.73 ± ± . ± . M , M ) DeepWalk 64 28.89 ± ± ± ± ± ± ± ± ± ± ± ± MLP( M , M ) DeepWalk 64 26.36 ± ± ± ± ± ± ± ± ± ± ± ± Avg( M , M ) ⊕F DeepWalk 324 54.13 ± ± ± ± ± ± ± ± ± ± ± ± MLP( M , M , F ) DeepWalk 64 48.74 ± ± ± ± ± ± ± ± ± ± ± ± I ndu c t i v e Avg( M , M ) GraphSage 64 18.79 ± ± ± MLP( M , M ) GraphSage 64 18.92 ± ± ± Avg( M , M ) ⊕F GraphSage 324 54.06 ± ± ± MLP( M , M , F ) GraphSage 64 48.79 ± ± ± AttrE2vec (our) Avg 64 59.82 ± ± ± ± . ± . ± . ± .

25 66 . ± . ± . ± . ± . ± .

5. Hyperparameter Sensitivity of AttrE2vec

We investigate hyperparameters’ eﬀect considering each of them independently, i.e.,setting a given parameter and preserving default values for all other parameters. Theevaluation is applied for our model’s two inductive variants: with the Average aggregatorand with the GRU aggregator. We use all three datasets (Cora, Citeseer, Pubmed) andreport the AUC values. We choose following hyperparameter value sets (values with anasterisk denote the default value for that parameter): • length of random walk: L = { , ∗ , } , • number of random walks: k = { , , ∗ } , • embedding size: d = { , , ∗ } , • mixing parameter: λ = { , . , . ∗ , . , } .16 igure 8: Eﬀects of hyperparameters on Cora, Citeseer and Pubmed datasets. The results of all experiments are summarized in Figure 8. We observe that for bothaggregation variants, Avg and GRU, the trends are similar, so we will include and discussthem based only on the Average aggregator.In general, the higher the number of random walks k and the length of a singlerandom walk L , the better results are achieved. One may require higher values of theseparameters, but it signiﬁcantly increases the random walk computation time and themodel training itself.Unsurprisingly, the embedding size (embedding dimension) also follows the sametrend. With more dimensions, we can ﬁt more information into the created representa-tions. However, as an embedding goal is to ﬁnd low-dimensional vector representations,we should keep reasonable dimensionality. Our chosen values (16, 32, 64) seem plausiblewhile working with 260-dimensional edge features.As for loss mixing parameter λ , we observe that too high values negatively inﬂuencethe model performance. The greater the value, the more critical the structural loss be-comes. Simultaneously the feature loss becomes less relevant. Choosing λ = 0 causesthe loss function to consider feature reconstruction only and completely ignores the em-bedding loss. This yields signiﬁcantly worse results and conﬁrms that our approach ofcombining both feature reconstruction and structural embedding loss is justiﬁed. Ingeneral, the best values are achieved for setting an equal inﬂuence of both loss factors( λ = 0 .

6. Ablation study

We performed an ablation study to check whether our method

AttrE2vec is invariantto introduced noise in an artiﬁcially generated network. We use a barbell graph, which17 igure 9:

AttrE2vec performance for various noise levels p and mixing parameter values λ ∈ { , . , } .Figure 10: 2-D representations of ideal and noisy graph edges using AttrE2vec with λ ∈ { , . , } . AttrE2vec , which includes both structural and feature-based loss, performs withdiﬀerent amount of such noise.We will use the graph mentioned above and introduce noise by shuﬄing p % of alledge pairs, which are from diﬀerent classes, i.e., an edge with class 2 (originally lo-cated in the path) may be swapped with one from the full graphs (classes 1 or 3).We use our AttrE2vec model with an Average aggregator in the transductive setting(due to the graph size) and report the edge classiﬁcation AUC for diﬀerent values of p ∈ { , . , . . . , . , . . . , . , } and λ ∈ { , . , } . The values of the mixing parameter λ allow us to check how the model behaves when working only with a feature-based loss( λ = 0), only with a structural loss ( λ = 1), and with both losses at equal importance( λ = 0 . p, λ ) pair, due to the shuﬄing procedure’s randomness. We report the mean andstandard deviation of the AUC value in Figure 9.Using only the feature loss or a combination of both losses allows us to achieve nearly100% AUC in the classiﬁcation task. The ﬂuctuations appear due to the low numberof training epochs and the local optima problem. The performance of the model thatuses only structural loss ( λ = 1) decreases with higher shuﬄing probabilities, and froma certain point, it starts improving slightly because shuﬄing results in a complete swapof two classes, i.e., all features and classes from one graph part are exchanged with allfeatures and classes from another part of the graph.We also demonstrate how our method reacts on noisy data with various λ ∈ { , . , } .There are two graphs: one where the features are aligned to substructures of the graphand the second with shuﬄed features (ca. 50%), see Figure 10. Keeping AttrE2vec with λ = 0 .

7. Conclusions and future work

We introduce

AttrE2vec – the novel unsupervised and inductive embedding model tolearn attributed edge embeddings by leveraging on the self-attention network with auto-encoder over attribute space and structural loss on aggregated random walks.

Attre2vec can directly aggregate feature information from edges and nodes at many hops awayto infer embeddings not only for present nodes, but also for new nodes. Extensiveexperimental results show that

AttrE2vec obtains the state-of-the-art results in edgeclassiﬁcation and clustering on CORA, PUBMED and CITESEER.19 cknowledgments

The work was partially supported by the National Science Centre, Poland grant No.2016/21/D/ST6/02948, and 2016/23/B/ST6/01735, as well as by the Department ofComputational Intelligence, Wroc(cid:32)law University of Science and Technology statutoryfunds.

References [1] W. Hu, M. Fey, M. Zitnik, Y. Dong, H. Ren, B. Liu, M. Catasta, J. Leskovec, R. Barzilay,P. Battaglia, Y. Bengio, M. Bronstein, S. G¨unnemann, W. Hamilton, T. Jaakkola, S. Jegelka,M. Nickel, C. Re, L. Song, J. Tang, M. Welling, R. Zemel, Open graph benchmark: Datasets formachine learning on graphs (may 2020). arXiv:2005.00687 .URL http://arxiv.org/abs/2005.00687 [2] D. Zhang, J. Yin, X. Zhu, C. Zhang, Network Representation Learning: A Survey, IEEE Transac-tions on Big Data 6 (1) (2018) 3–28. doi:10.1109/tbdata.2018.2850013 .[3] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, P. S. Yu, A Comprehensive Survey on Graph NeuralNetworks, IEEE Transactions on Neural Networks and Learning Systems (2019) 1–21 doi:10.1109/TNNLS.2020.2978386 .[4] B. Li, D. Pi, Network representation learning: a systematic literature review, Neural Computingand Applications 32 (21) (2020) 16647–16679. doi:10.1007/s00521-020-04908-5 .[5] I. Chami, S. Abu-El-Haija, B. Perozzi, C. R´e, K. Murphy, Machine Learning on Graphs: A Modeland Comprehensive Taxonomy (2020).URL http://arxiv.org/abs/2005.03675 [6] S. Bahrami, F. Dornaika, A. Bosaghzadeh, Joint auto-weighted graph fusion and scalable semi-supervised learning, Information Fusion 66 (2021) 213–228.URL [7] A. Grover, J. Leskovec, Node2vec: Scalable feature learning for networks, in: Proceedings of theACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Vol. 13-17-Augu, 2016, pp. 855–864. doi:10.1145/2939672.2939754 .[8] B. Perozzi, R. Al-Rfou, S. Skiena, DeepWalk: Online Learning of Social Representations Bryan,in: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery anddata mining - KDD ’14, ACM Press, New York, New York, USA, 2014, pp. 701–710. doi:10.1145/2623330.2623732 .URL http://dl.acm.org/citation.cfm?doid=2623330.2623732 [9] T. N. Kipf, M. Welling, Semi-supervised classiﬁcation with graph convolutional networks, in: 5thInternational Conference on Learning Representations, ICLR 2017 - Conference Track Proceedings,International Conference on Learning Representations, ICLR, 2017, pp. 1–14. arXiv:1609.02907 .URL http://arxiv.org/abs/1609.02907 [10] Y. Dong, N. V. Chawla, A. Swami, Metapath2vec: Scalable representation learning for hetero-geneous networks, in: Proceedings of the ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining, Vol. Part F1296, ACM, New York, NY, USA, 2017, pp. 135–144. doi:10.1145/3097983.3098036 .URL https://dl.acm.org/doi/10.1145/3097983.3098036 [11] S. . Wang, V. V. Govindaraj, J. M. G´orriz, X. Zhang, Y. . Zhang, Covid-19 classiﬁcation byfgcnet with deep feature fusion from graph convolutional network and convolutional neural network,Information Fusion 67 (2021) 208–229, cited By :1.URL [12] A. Garc´ıa-Dur´an, M. Niepert, Learning graph representations with embedding propagation, in:Advances in Neural Information Processing Systems, Vol. 2017-Decem, 2017, pp. 5120–5131.[13] W. L. Hamilton, R. Ying, J. Leskovec, Inductive representation learning on large graphs, in: Ad-vances in Neural Information Processing Systems, Vol. 2017-Decem, 2017, pp. 1025–1035.[14] P. Veliˇckovi´c, A. Casanova, P. Li`o, G. Cucurull, A. Romero, Y. Bengio, Graph attention networks,in: 6th International Conference on Learning Representations, ICLR 2018 - Conference TrackProceedings, International Conference on Learning Representations, ICLR, 2018, pp. 1–12. arXiv:1710.10903 .

15] D. Wang, P. Cui, W. Zhu, Structural deep network embedding, in: Proceedings of the ACMSIGKDD International Conference on Knowledge Discovery and Data Mining, Vol. 13-17-Augu,2016, pp. 1225–1234. doi:10.1145/2939672.2939753 .[16] C. Yang, Z. Liu, D. Zhao, M. Sun, E. Y. Chang, Network representation learning with rich textinformation, in: IJCAI International Joint Conference on Artiﬁcial Intelligence, Vol. 2015-Janua,2015, pp. 2111–2117.[17] M. Liu, J. Liu, Y. Chen, M. Wang, H. Chen, Q. Zheng, Ahng: Representation learning on attributedheterogeneous network, Information Fusion 50 (2019) 221–230, cited By :3.URL [18] L. Lan, P. Wang, J. Zhao, J. Tao, J. Lui, X. Guan, Improving network embedding with partiallyavailable vertex and edge content, Information Sciences 512 (2020) 935–951. doi:10.1016/j.ins.2019.09.083 .[19] B. Li, D. Pi, Y. Lin, I. Khan, L. Cui, Multi-source information fusion based heterogeneous networkembedding, Information Sciences 534 (2020) 53–71. doi:10.1016/j.ins.2020.05.012 .[20] C. Zhang, D. Song, C. Huang, A. Swami, N. V. Chawla, Heterogeneous graph neural network,in: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and DataMining, ACM, New York, NY, USA, 2019, pp. 793–803. doi:10.1145/3292500.3330961 .URL https://dl.acm.org/doi/10.1145/3292500.3330961 [21] H. Gao, H. Huang, Deep attributed network embedding, in: IJCAI International Joint Conferenceon Artiﬁcial Intelligence, Vol. 2018-July, 2018, pp. 3364–3370. doi:10.24963/ijcai.2018/467 .[22] S. Bandyopadhyay, A. Biswas, N. Murty, R. Narayanam, Beyond node embedding: A direct unsu-pervised edge representation framework for homogeneous networks (2019). arXiv:1912.05140 .[23] Y. Chen, T. Qian, Relation constrained attributed network embedding, Information Sciences 515(2020) 341–351. doi:10.1016/j.ins.2019.12.033 .[24] S. Bandyopadhyay, H. Kara, A. Kannan, M. N. Murty, FSCNMF: Fusing structure and content vianon-negative matrix factorization for embedding information networks (2018). arXiv:1804.05313 .[25] D. Nozza, E. Fersini, E. Messina, CAGE: Constrained deep Attributed Graph Embedding, Infor-mation Sciences 518 (2020) 56–70. doi:10.1016/j.ins.2019.12.082 .[26] J. Kim, T. Kim, S. Kim, C. D. Yoo, Edge-labeling graph neural network for few-shot learning, in:Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recogni-tion, Vol. 2019-June, 2019, pp. 11–20. arXiv:1905.01436 , doi:10.1109/CVPR.2019.00010 .[27] Q. Li, Z. Cao, J. Zhong, Q. Li, Graph representation learning with encoding edges, Neurocomputing361 (2019) 29–39. doi:10.1016/j.neucom.2019.07.076 .[28] L. Gong, Q. Cheng, Exploiting edge features for graph neural networks, in: Proceedings of the IEEEComputer Society Conference on Computer Vision and Pattern Recognition, 2019, pp. 9203–9211. doi:10.1109/CVPR.2019.00943 .[29] C. Aggarwal, G. He, P. Zhao, Edge classiﬁcation in networks, in: 2016 IEEE 32nd InternationalConference on Data Engineering, ICDE 2016, Institute of Electrical and Electronics Engineers Inc.,2016, pp. 1038–1049. doi:10.1109/ICDE.2016.7498311 .[30] M. Simonovsky, N. Komodakis, Dynamic edge-conditioned ﬁlters in convolutional neural networkson graphs, in: Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition,CVPR 2017, Vol. 2017-Janua, 2017, pp. 29–38. doi:10.1109/CVPR.2017.11 .[31] T. D. Bui, S. Ravi, V. Ramavajjala, Neural Graph Learning: Training Neural Networks UsingGraphs, dl.acm.org 2018-Febua (2018) 64–71. doi:10.1145/3159652.3159731 .[32] Y. Wang, Y. Sun, M. M. Bronstein, J. M. Solomon, Z. Liu, S. E. Sarma, Dynamic Graph CNN forLearning on Point Clouds, ACM Transactions on Graphics 38 (5) (2019) 146. doi:10.1145/3326362 .[33] T. Wanyan, C. Zhang, A. Azad, X. Liang, D. Li, Y. Ding, Attribute2vec: Deep network embeddingthrough multi-ﬁltering GCN (apr 2020). arXiv:2004.01375 .URL http://arxiv.org/abs/2004.01375 [34] J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, Q. Mei, LINE: Large-scale information networkembedding, in: WWW 2015 - Proceedings of the 24th International Conference on World WideWeb, 2015, pp. 1067–1077. doi:10.1145/2736277.2741093 .[35] L. F. Ribeiro, P. H. Saverese, D. R. Figueiredo, Struc2vec: Learning node representations fromstructural identity, in: Proceedings of the ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining, Vol. Part F1296, 2017, pp. 385–394. doi:10.1145/3097983.3098061 .[36] A. Grover, J. Leskovec, node2vec: Scalable feature learning for networks, in: Proceedings of the22nd ACM SIGKDD international conference on Knowledge discovery and data mining, ACM,2016, pp. 855–864.[37] J. Chung, C. Gulcehre, K. Cho, Y. Bengio, Empirical Evaluation of Gated Recurrent Neural Net- orks on Sequence Modeling (dec 2014). arXiv:1412.3555 .URL http://arxiv.org/abs/1412.3555 [38] R. Kuprieiev, D. Petrov, R. Valles, P. Redzy´nski, C. da Costa-Luis, A. Schepanovski, I. Shcheklein,S. Pachhai, J. Orpinel, F. Santos, A. Sharma, Zhanibek, D. Hodovic, P. Rowlands, Earl, A. Grigorev,N. Dash, G. Vyshnya, maykulkarni, Vera, M. Hora, xliiv, W. Baranowski, S. Mangal, C. Wolﬀ,nik123, O. Yoktan, K. Benoy, A. Khamutov, A. Maslakov, Dvc: Data version control - git for data& models (May 2020). doi:10.5281/zenodo.3859749 .URL https://doi.org/10.5281/zenodo.3859749 [39] P. Sen, G. Namata, M. Bilgic, L. Getoor, B. Galligher, T. Eliassi-Rad, Collective classiﬁcation innetwork data, AI Magazine 29 (3) (2008) 93. doi:10.1609/aimag.v29i3.2157 .URL https://ojs.aaai.org/index.php/aimagazine/article/view/2157 [40] G. Namata, B. London, L. Getoor, B. Huang, Query-driven Active Surveying for Collective Clas-siﬁcation, in: Proceedings ofthe Workshop on Mining and Learn- ing with Graphs, Edinburgh,Scotland, UK., 2012, pp. 1–8.[41] Q. Le, T. Mikolov, Distributed representations of sentences and documents, in: 31st InternationalConference on Machine Learning, ICML 2014, Vol. 4, 2014, pp. 2931–2939. arXiv:1405.4053 .URL http://arxiv.org/abs/1405.4053 [42] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Pret-tenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot,E. Duchesnay, Scikit-learn: Machine learning in Python, Journal of Machine Learning Research 12(2011) 2825–2830.[43] D. Wang, P. Cui, W. Zhu, Structural deep network embedding, in: Proceedings of the 22Nd ACMSIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, ACM,New York, NY, USA, 2016, pp. 1225–1234. doi:10.1145/2939672.2939753 .URL http://doi.acm.org/10.1145/2939672.2939753 [44] D. Q. Nguyen, T. D. Nguyen, D. Phung, A self-attention network based node embedding model(jun 2020). arXiv:2006.12100 .URL http://arxiv.org/abs/2006.12100 [45] I. Loshchilov, F. Hutter, Decoupled Weight Decay Regularization (nov 2017). arXiv:1711.05101 .URL http://arxiv.org/abs/1711.05101 [46] J. Xie, R. Girshick, A. Farhadi, Unsupervised deep embedding for clustering analysis, in: M. F.Balcan, K. Q. Weinberger (Eds.), Proceedings of The 33rd International Conference on MachineLearning, Vol. 48 of Proceedings of Machine Learning Research, PMLR, New York, New York,USA, 2016, pp. 478–487.URL http://proceedings.mlr.press/v48/xieb16.html [47] L. van der Maaten, G. Hinton, Visualizing data using t-SNE, Journal of Machine Learning Research9 (2008) 2579–2605.URL