[PDF] SLAPS: Self-Supervision Improves Structure Learning for Graph Neural Networks

Abstract

Graph neural networks (GNNs) work well when the graph structure is provided. However, this structure may not always be available in real-world applications. One solution to this problem is to infer a task-specific latent structure and then apply a GNN to the inferred graph. Unfortunately, the space of possible graph structures grows super-exponentially with the number of nodes and so the task-specific supervision may be insufficient for learning both the structure and the GNN parameters. In this work, we propose the Simultaneous Learning of Adjacency and GNN Parameters with Self-supervision, or SLAPS, a method that provides more supervision for inferring a graph structure through self-supervision. A comprehensive experimental study demonstrates that SLAPS scales to large graphs with hundreds of thousands of nodes and outperforms several models that have been proposed to learn a task-specific graph structure on established benchmarks.

Full PDF

SSLAPS: Self-Supervision Improves Structure Learning for Graph NeuralNetworks

Bahare Fatemi * 1

Layla El Asri Seyed Mehran Kazemi Abstract

Graph neural networks (GNNs) work well whenthe graph structure is provided. However, thisstructure may not always be available in real-world applications. One solution to this prob-lem is to infer a task-speciﬁc latent structure andthen apply a GNN to the inferred graph. Un-fortunately, the space of possible graph struc-tures grows super-exponentially with the num-ber of nodes and so the task-speciﬁc supervisionmay be insufﬁcient for learning both the struc-ture and the GNN parameters. In this work, wepropose the S imultaneous L earning of A djacencyand GNN P arameters with S elf-supervision, orSLAPS, a method that provides more supervi-sion for inferring a graph structure through self-supervision. A comprehensive experimental studydemonstrates that SLAPS scales to large graphswith hundreds of thousands of nodes and outper-forms several models that have been proposed tolearn a task-speciﬁc graph structure on establishedbenchmarks.

1. Introduction

Graph representation learning has grown rapidly and foundapplications in domains where data points deﬁne a graph(Chami et al., 2020; Kazemi et al., 2020). Graph neuralnetworks (GNNs) (Scarselli et al., 2008) have been a keycomponent to the success of the research in this area. Fol-lowing the success of graph convolutional networks (GCNs)(Kipf & Welling, 2017) on semi-supervised node classi-ﬁcation, several other GNN variants have been proposedfor different prediction tasks on graphs (Hamilton et al.,2017; Veliˇckovi´c et al., 2018; Gilmer et al., 2017; Battagliaet al., 2018) and the power of these models has been studiedtheoretically (Xu et al., 2019; Sato, 2020).The performance of GNNs highly depends on the quality of * This work was done during an internship at Borealis AI. University of British Columbia Borealis AI. Correspondenceto: Bahare Fatemi < [email protected] > .Pre-print. Copyright 2021 by the author(s). the input graph structure and deteriorates when the graphstructure is noisy (see Z¨ugner et al., 2018; Dai et al., 2018;Fox & Rajamanickam, 2019). The need for a clean graphstructure impedes the applicability of GNNs to domainswhere one has access to a set of nodes and their featuresbut not to an underlying graph structure, or only has accessto a noisy structure. Examples of such domains includebrain signal classiﬁcation (Jang et al., 2019), computer-aided diagnosis (Cosmo et al., 2020), analysis of computerprograms (Johnson et al., 2020), and particle reconstruction(Qasim et al., 2019).In this paper, we develop a model that learns the GNN pa-rameters and an adjacency matrix simultaneously. Our goalis to learn a structure that maximizes the GNN performanceon the downstream task. This is different from the worksthat aim at discovering the node relations or dependencies,e.g., in probabilistic graphical models. Since the numberof possible graph structures grows super-exponentially withthe number of nodes (Stanley, 1973) and obtaining nodelabels is typically costly, the number of available labels maynot be enough for learning both the GNN parameters andan adjacency matrix–especially for semi-supervised nodeclassiﬁcation. Our main contribution is to supplement theclassiﬁcation task with a well-motivated self-supervised taskthat helps learn a high-quality adjacency matrix. The self-supervised task is generic and can be combined with severalexisting approaches. It works by masking some input fea-tures (or adding noise to them) and training a separate GNNaiming at updating the adjacency matrix in such a way thatit can recover the masked (or noisy) features.We experiment with several datasets. For datasets with agraph structure, we only feed the node features to our model.The model operates on the node features and an adjacencythat is learned simultaneously from data. We compare ourmodel with different classes of methods: some which do notuse the graph structure for predicting labels, some whichuse a ﬁxed k-Nearest Neighbors (kNN) graph built basedon a chosen similarity metric, and some which initializethe graph with kNN but then revise it throughout the train-ing. We show that our model consistently outperforms thesemethods. We also show that the self-supervised task is keyto the high performance of our model. As an additional con-tribution, we provide an implementation for simultaneous a r X i v : . [ c s . L G ] F e b LAPS: Self-Supervision Improves Structure Learning for Graph Neural Networks structure and parameter learning that scales to graphs withhundreds of thousands of nodes.

2. Related Work

Existing methods that relate to this work can be groupedinto the following categories.

Similarity Graph:

One approach for inferring a graphstructure is to select a similarity metric and set the edgeweight between two nodes to be their similarity (Roweis& Saul, 2000; Tenenbaum et al., 2000; Belkin et al., 2006).To obtain a sparse structure, one may create a kNN simi-larity graph, only connect pairs of nodes whose similaritysurpasses some predeﬁned threshold, or do sampling. As anexample, Gidaris & Komodakis (2019) create a (ﬁxed) kNNgraph using the cosine similarity of the node features. Wanget al. (2019b) extend this idea by creating a fresh graph ineach layer of the GNN based on the node embedding similar-ities in that layer as opposed to ﬁxing a graph solely basedon the initial features. Instead of choosing a single similaritymetric, Halcrow et al. (2020) fuse several (potentially weak)measures of similarity. The quality of the predictions ofthese methods depends heavily on the choice of the similar-ity metric(s). Moreover, designing an appropriate similaritymetric may not always be straightforward.

Fully-connected Graph:

Another approach is to start witha fully-connected graph and assign edges weights usingthe available meta-data or employ GNN variants such asgraph attention networks (Veliˇckovi´c et al., 2018; Zhanget al., 2018) which provide weights for each edge via anattention mechanism. This approach has been used in com-puter vision (e.g., Suhail & Sigal, 2019), natural languageprocessing (e.g., Zhu et al., 2019), and few-shot learning(e.g., Garcia & Bruna, 2017). The complexity of this ap-proach grows rapidly making it applicable only to small-sized graphs. Zhang et al. (2020) propose to deﬁne localneighborhoods for each node and only assume that theselocal neighborhoods are fully connected. Their approach,however, relies on an initial graph structure to deﬁne thelocal neighborhoods.

Learnable Graph:

Instead of a similarity graph based onthe initial features, one may use a graph generator with learn-able parameters. Li et al. (2018b) create a fully-connectedgraph based on a bilinear similarity function with learnableparameters. Franceschi et al. (2019) sample graph struc-tures from a learnable fully-connected structure and employa bi-level optimization setup for simultaneously learning theGNN parameters and the structure. Yang et al. (2019) updatethe input adjacency matrix based on the inductive bias thatnodes belonging to the same class should be connected toeach other and nodes belonging to different classes shouldbe disconnected. Chen et al. (2020) propose an iterative approach that iterates over projecting the nodes to a latentspace and constructing an adjacency matrix from the latentrepresentations multiple times. A common approach in thiscategory is to learn a projection of the nodes to a latent spacewhere node similarities correspond to edge weights. Wuet al. (2018) project the nodes to a latent space by learningweights for each of the input features. Cosmo et al. (2020)and Qasim et al. (2019) use a multi-layer perceptron forprojection. Yu et al. (2020) use a GNN for projection whichuses the initial node features as well as an initial graph struc-ture, aiming at providing a revised graph structure to thetask-speciﬁc GNN. In our experiments, we compare withseveral approaches from this category.

Leveraging Domain Knowledge:

In some applications,one may leverage domain knowledge to guide the model to-ward learning speciﬁc structures. For example, Johnson et al.(2020) leverage abstract syntax trees and regular languagesin learning graph structures of Python programs that aid rea-soning for downstream tasks. Jin et al. (2020b) train GNNsthat are robust to adversarial attacks by learning a cleanedversion of the input structure. They use the domain knowl-edge that clean adjacency matrices are often sparse andlow-rank and exhibit feature smoothness along connectednodes. In our paper, we experiment with general-purposedatasets without access to domain knowledge.

Proposed Method:

Our model falls within the learnablegraph category. We supplement the training with a self-supervised objective to increase the amount of supervisionin learning a structure. Our self-supervised task is inspiredby, and similar to, the pre-training strategies for GNNs(Hu et al., 2020b;c; Jin et al., 2020a; You et al., 2020; Zhuet al., 2020) (speciﬁcally, we adopt the multi-task learningframework of You et al. (2020)), but it differs from thisline of work as we use self-supervision for learning a graphstructure whereas the above methods use it to learn better(and, in some cases, transferable) GNN parameters.

3. Background and Notation

We use lowercase letters to denote scalars, bold lowercaseletters to denote vectors and bold uppercase letters to denotematrices. I represents an identity matrix. For a vector v , we represent its i th element as v i . For a matrix M ,we represent the i th row as M i and the element at the i th row and j th column as M ij . For an attributed graph, weuse n , m and f to represent the number of nodes, edges,and features respectively, and denote the graph as G = {V , A , X } where V = { v , . . . , v n } is a set of nodes, A ∈ R n × n is an adjacency matrix with A ij indicating the weightof the edge from v i to v j ( A ij = 0 implies no edge), and X ∈ R n × f is a matrix whose rows correspond to nodefeatures or attributes. LAPS: Self-Supervision Improves Structure Learning for Graph Neural Networks

Graph convolutional networks (GCNs) are a powerful vari-ant of GNNs. For a graph G = {V , A , X } with a degreematrix D , layer l of the GCN architecture can be deﬁned as H ( l ) = σ ( ˆ AH ( l − W ( l ) ) where ˆ A represents a normal-ized adjacency matrix, H ( l − ∈ R n × d l − represents thenode representations in layer l-1 with H (0) = X , W ( l ) ∈ R d l − × d l is a weight matrix, σ is an activation function suchas ReLU (Nair & Hinton, 2010), and H ( l ) ∈ R n × d l is theupdated node embeddings. For undirected graphs wherethe adjacency is symmetric, ˆ A = D − ( A + I ) D − cor-responds to a row-and-column normalized adjacency withself-loops, and for directed graphs where the adjacency isnot necessarily symmetric, ˆ A = D − ( A + I ) correspondsto a row normalized adjacency matrix with self-loops. Here, D is a (diagonal) degree matrix for ( A + I ) deﬁned as D ii = 1 + (cid:80) j A ij .

4. Proposed Method: SLAPS

SLAPS consists of four components: 1) generator, 2) ad-jacency processor, 3) classiﬁer, and 4) self-supervision.Figure 1 illustrates these components. We describe eachcomponent in more detail and motivate the need for self-supervision.

The generator is a function G : R n × f → R n × n with pa-rameters θ G which takes the node features X ∈ R n × f asinput and produces a (perhaps sparse, non-normalized, andnon-symmetric) matrix ˜ A ∈ R n × n as output. We considerthe following two generators and leave experimenting withmore sophisticated graph generators (e.g., (You et al., 2018;Liu et al., 2019)) and models with tractable adjacency com-putations (e.g., (Choromanski et al., 2020)) as future work. Full Parameterization (FP):

For this generator, θ G ∈ R n × n and the generator function is deﬁned as ˜ A = G F P ( X ; θ G ) = θ G . That is, the generator ignores the inputnode features and directly optimizes the adjacency matrix.This generator is similar to the one proposed by Franceschiet al. (2019) except that they treat each element of ˜ A asthe parameter of a Bernoulli distribution and sample graphstructures from these Bernoulli distributions. The advan-tages of this generator include its simplicity and ﬂexibilityfor learning any adjacency matrix. Its disadvantages includeadding n parameters to the model, which limits scalabilityand makes the model susceptible to overﬁtting. MLP-kNN:

Here, θ G corresponds to the weights of amulti-layer perceptron (MLP) and ˜ A = G MLP ( X ; θ G ) = kNN ( MLP ( X )) , where MLP : R n × f → R n × f (cid:48) is an MLPthat produces a matrix with updated node representations X (cid:48) ; kNN : R n × f (cid:48) → R n × n produces a sparse matrix. Let M ∈ R n × n with M ij = 1 if v j is among the top k sim- ilar nodes to v i and otherwise, and let S ∈ R n × n with S ij = Sim ( X (cid:48) i , X (cid:48) j ) for some differentiable similarity func-tion Sim (we used cosine). Then ˜ A = kNN ( X (cid:48) ) = M (cid:12) S where (cid:12) represents the Hadamard (element-wise) product.With this formulation, in the forward phase of the network,one can ﬁrst compute the matrix M using an off-the-shelfk-nearest neighbors algorithm and then compute the sim-ilarities in S only for pairs of nodes where M ij = 1 . Inour experiments, we compute exact k-nearest neighbors;one can approximate it using locality-sensitive hashing ap-proaches for larger graphs (see, e.g., (Halcrow et al., 2020;Kitaev et al., 2020)). In the backward phase of our model,we compute the gradients only with respect to those ele-ments in S whose corresponding value in M is (i.e. thoseelements S ij such that M ij = 1 ); the gradient with respectto the other elements is . Since S is computed based on X (cid:48) ,the gradients ﬂow to the elements in X (cid:48) (and consequentlyto the weights of the MLP) through S . Smart Initialization:

In our experiments, we found theinitialization of the generator parameters (i.e. θ G ) to beimportant. Let A kNN represent an adjacency matrix createdby applying a kNN function on the initial node features.One smart initialization for θ G is to initialize them in away that the generator generates A kNN before trainingstarts (i.e. ˜ A = A kNN before training starts). Such aninitialization can be trivially done for the FP generator byinitializing θ G to A kNN . For MLP-kNN, we consider twovariants. In one, hereafter referred to simply as MLP, wekeep the input dimension the same throughout the layers.In the other, hereafter referred to as MLP-D, we considerMLPs with diagonal weight matrices (i.e., except the maindiagonal, all other parameters in the weight matrices arezero). For both variants, we initialize the weight matricesin θ G with the identity matrix to ensure that the output ofthe MLP is initially the same as its input and the kNN graphcreated on these outputs is equivalent to A kNN . MLP-Dcan be thought of as assigning different weights to differentfeatures and then computing node similarities. Note that,alternatively, one may use other MLP variants but pre-trainthe weights to output A kNN before the main training starts. The output ˜ A of the generator may have both positive andnegative values, may be non-symmetric and non-normalized.To ensure all values are positive and make the adjacencysymmetric and normalized, we let: A = D − (cid:16) P ( ˜ A ) + P ( ˜ A ) T (cid:17) D − (1)Here P is a function with a non-negative range appliedelement-wise on its input. In our experiments, when usingan MLP generator, we let P be the ReLU function appliedelements-wise on ˜ A . When using the fully-parameterized LAPS: Self-Supervision Improves Structure Learning for Graph Neural Networks

Node features G e n e r a t o r G NN C Non-symmetric, non-normalized adjacency Symmetric, normalized adjacency Denoisedfeatures A d j a c e n c y P r o c e ss o r G NN D A E Noisy Features A dd N o i s e Figure 1.

Overview of SLAPS. At the top, a generator receives the node features and produces a non-symmetric, non-normalized adjacencyhaving (possibly) both positive and negative values (Section 4.1). The adjacency processor makes the values positive, symmetrizes andnormalizes the adjacency (Section 4.2). The resulting adjacency and the node features go into

GNN C which predicts the node classes(Section 4.3). At the bottom, some noise is added to the node features. The resulting noisy features and the generated adjacency go into GNN

DAE which then denoises the features (Section 4.5). (FP) generator, applying ReLU results in a gradient ﬂowproblem as any edge whose corresponding value in ˜ A be-comes less than or equal to zero stops receiving gradientupdates. For this reason, for FP we apply the ELU (Clevertet al., 2015) function to the elements of ˜ A and then adda value of . The sub-expression P ( ˜ A )+ P ( ˜ A ) T makes theresulting matrix P ( ˜ A ) symmetric. To understand the rea-son for taking the mean of P ( ˜ A ) and P ( ˜ A ) T , assume ˜ A is generated by G MLP . If v j is among the k most similarnodes to v i and vice versa, then the strength of the connec-tion between v i and v j will remain the same. However, if,say, v j is among the k most similar nodes to v i but v i isnot among the top k for v j , then taking the average of thesimilarities reduces the strength of the connection between v i and v j . Finally, once we have a symmetric adjacencywith non-negative values, we compute the degree matrix D for P ( ˜ A )+ P ( ˜ A ) T and normalize P ( ˜ A )+ P ( ˜ A ) T by multiplyingit left and right with D − . The classiﬁer is a function

GNN C : R n × f × R n × n → R n ×|C| with parameters θ GNN C . It takes the node fea-tures X and the generated adjacency A as input and pro-vides for each node the logits for each class. C corre-sponds to the classes and |C| corresponds to the num-ber of classes. We use a two-layer GCN for which θ GNN C = { W (1) , W (2) } and deﬁne our classiﬁer as GNN C ( A , X ; θ GNN C ) = A ReLU ( AXW (1) ) W (2) butother GNN variants can be used as well (recall that A isnormalized). The training loss L C for the classiﬁcation taskis computed by taking the softmax of the logits to produce aprobability distribution for each node and then computingthe cross-entropy loss. One may create a model using only the three componentsdescribed so far corresponding to the top part of Figure 1.As we will explain here, however, this model may sufferseverely from supervision starvation. The same problemalso applies to many existing approaches for the problemstudied in this work, as they can be formulated as a combi-nation of variants of these three components.Consider a scenario in training the model described abovewhere two unlabeled nodes v i and v j are not directly con-nected to any labeled nodes according to the generated struc-ture. Then, since a two-layer GCN makes predictions for thenodes based on their two-hop neighbors, the classiﬁcationloss (i.e. L C ) is not affected by the edge between v i and v j and this edge receives no supervision . Figure 2 providesan example of such a scenario. Let us call the edges that donot affect the loss function L C (and consequently do not re-ceive supervision) as no-supervision edges . These edges areclearly problematic because although they may not affect thetraining loss, the predictions at the test time depend on theseedges and if their values are learned without enough super-vision, the model may make poor predictions at the test time.A natural question with regard to the extent of the problemcaused by such edges is the proportion of no-supervisionedges. The following theorem formally establishes the ex-tent of the problem for Erd˝os-R´enyi graphs (Erd˝os & R´enyi,1959). An Erd˝os-R´enyi graph with n nodes and m edges isa graph chosen uniformly at random from the collection ofall graphs which have n nodes and m edges. While this problem may be alleviated to some extent by in-creasing the number of layers of the GCN, deeper GCNs typicallyprovide inferior results due to issues such as oversmoothing (see,e.g., Li et al., 2018a; Oono & Suzuki, 2020).

LAPS: Self-Supervision Improves Structure Learning for Graph Neural Networks

Unlabelled E d g e w i t h n o s up e r v i s i o n Unlabelled Labelled

LabelledUnlabelled

Unlabelled

Figure 2.

Using a two-layer GCN, the predictions made for thelabeled nodes are not affected by the dashed edge.

Theorem 1

Let G ( n, m ) be an Erd˝os-R´enyi graph with n nodes and m edges. Assume we have labels for q nodesselected uniformly at random. The probability of an edgebeing a no-supervision edge with a two-layer GCN is equalto (1 − qn )(1 − qn − ) (cid:81) qi =1 (1 − m − ( n ) − i ) . We defer the proof to Appendix A. To put the numbers fromthe theorem in perspective, let us consider three establishedbenchmarks for semi-supervised node classiﬁcation namely

Cora , Citeseer , and

Pubmed (the statistics for these datasetscan be found in Appendix B). For an Erd˝os-R´enyi graphwith similar statistics as the Cora dataset ( n = 2708 , m =5429 , q = 140 ), the probability of an edge being a no-supervision edge is . according to the above theorem.For Citeseer, and Pubmed, this number is . and . respectively.While Theorem 1 is stated for Erd˝os-R´enyi graphs wherethe labeled nodes have been selected uniformly at random,in real-world applications the problem may be even moresevere as, e.g., the labels may not be distributed evenly indifferent parts of the graph. To increase the amount of supervision for learning the struc-ture and remedy the problem pointed out in Section 4.4,we propose a self-supervised approach based on denois-ing autoencoders (Vincent et al., 2008). Let

GNN

DAE : R n × f × R n × n → R n × f be a GNN with parameters θ GNN

DAE that takes node features and a normalized adjacency pro-duced by a generator as input and provides updated node fea-tures with the same dimension as output. We train

GNN

DAE such that it receives a noisy version ˜ X of the features X asinput and produces the denoised features X as output. Let idx represent the indices corresponding to the elements of X to which we have added noise, and X idx represent thevalues at these indices. During training, we minimize: L DAE = L ( X idx , GNN

DAE ( ˜ X , A ; θ GNN

DAE ) idx ) (2)where A is the generated adjacency matrix and L is a lossfunction. For datasets where the features consist of binaryvectors, idx consists of r percent of the indices of X whose values are and rη percent of the indices whose values are , both selected uniformly at random in each epoch. Both r and η (corresponding to the negative ratio) are hyperpa-rameters. In this case, we add noise by setting the s in theselected mask to s and L is the binary cross-entropy loss.For datasets where the input features are continuous num-bers, idx consists of r percent of the indices of X selecteduniformly at random in each epoch. We add noise by eitherreplacing the values at idx with or by adding independentGaussian noises to each of the features. In this case, L is themean-squared error loss.This self-supervision uses the intuition that the node featuresare correlated with the node labels and helps by incorporat-ing the inductive bias that a graph structure that is appropri-ate for predicting the node features is also appropriate forpredicting the node labels. Although some edges may notreceive supervision from the main task (i.e. from GCN C –see Section 4.4), the supervision provided by this task (i.e.from GCN

DAE ) helps learn an appropriate weight for them.

Our ﬁnal model is trained to minimize L = L C + λ L DAE where L C is the classiﬁcation loss, L DAE is the denoisingautoencoder loss (see Equation 2), and λ is a hyperparametercontrolling the relative importance of the two losses.To verify the merit of the GNN

DAE for learning an adjacencymatrix in isolation, we also consider a variant of SLAPSnamed

SLAP S s that is trained in two stages. We ﬁrsttrain the GNN

DAE model by minimizing L DAE describedin in Equation 2. Recall that L DAE depends on the parame-ters θ G of the generator and the parameters θ GNN

DAE of thedenoising autoencoder. After every t epochs of training,we ﬁx the adjacency matrix, train a classiﬁer with the ﬁxedadjacency matrix, and measure classiﬁcation accuracy onthe validation set. We select the epoch that produces theadjacency providing the best validation accuracy for theclassiﬁer. Note that in SLAP S s , the adjacency matrix istrained only based on GNN

DAE .

5. Experiments

Baselines:

We compare our proposal to several baselineswith different properties. The ﬁrst baseline is a multi-layerperceptron (MLP) which does not take the graph structureinto account. We also compare against MLP-GAM* (Stretcuet al., 2019) which learns a fully-connected graph structureand uses this structure to supplement the loss function of theMLP toward predicting similar labels for neighboring nodes.Our third baseline is label propagation (LP) (Zhu & Ghahra-mani, 2002), a well-known model for semi-supervised learn-ing. Similar to Franceschi et al. (2019), we also consider abaseline named kNN-GCN where we create a kNN graph

LAPS: Self-Supervision Improves Structure Learning for Graph Neural Networks

Table 1.

Results of SLAPS and the baselines on established node classiﬁcation benchmarks. † indicates results have been taken fromFranceschi et al. (2019). ‡ indicates results have been taken from Stretcu et al. (2019). Bold and underlined values indicate best andsecond-best mean performances respectively. OOM indicates out of memory.Model Generator Cora Citeseer Cora390 Citeseer370 Pubmed ogbn-arxiv MLP . ± . † . ± . † . ± . . ± . . ± . . ± . MLP-GAM* . ‡ . ‡ − − . ‡ − LP . ± . . ± . . ± . . ± . . ± . OOM kNN-GCN . ± . † . ± . † . ± . . ± . . ± . . ± . LDS − − . ± . † . ± . † OOM OOM

GRCN . ± . . ± . . ± . . ± . . ± . OOM

DGCNN . ± . . ± . . ± . . ± . . ± . OOM

IDGL . ± . . ± . . ± . . ± . . ± . OOM

SLAP S FP . ± . . ± . . ± . . ± . OOM OOM

SLAP S

MLP . ± . . ± . . ± . . ± . . ± . . ± . SLAP S

MLP-D . ± . . ± . . ± . . ± . . ± . . ± . based on the node features and feed this graph to a GCN.The graph structure remains ﬁxed in this approach. Wealso compare with baselines that learn the graph structurefrom data including LDS (Franceschi et al., 2019), GRCN(Yu et al., 2020), DGCNN (Wang et al., 2019b), and IDGL(Chen et al., 2020). We feed a kNN graph to the modelsrequiring an initial graph structure. Datasets:

We use three established benchmarks in the GNNliterature namely Cora, Citeseer, and Pubmed (Sen et al.,2008) as well as a newly released dataset named ogbn-arxiv (Hu et al., 2020a) that is orders of magnitude largerthan the other three datasets and is more challenging dueto the more realistic split of the data into train, validation,and test sets. For all the datasets, we only feed the nodefeatures to the models and not the graph structure. Fol-lowing Franceschi et al. (2019), we also experiment withseveral classiﬁcation (non-graph) datasets available in scikit-learn (Pedregosa et al., 2011) including Wine, Cancer, Dig-its, and 20News. Dataset statistics can be found in Ap-pendix B. For Cora and Citeseer, the LDS model uses thetrain data for learning the parameters of their classiﬁcationGCN, half of the validation for learning the parameters ofthe adjacency matrix (in their bi-level optimization setup,these are considered as hyperparameters), and the other halfof the validation set for early stopping and tuning the otherhyperparameters. Besides experimenting with the originalsetups of these two datasets, we also consider a setup thatis closer to that of LDS: we use the train set and half of thevalidation set for training and the other half of validationfor early stopping and hyperparameter tuning. We namethe modiﬁed versions Cora390 and Citeseer370 respectivelywhere the number proceeding the dataset name correspondsto the number of labels used for training. We also follow asimilar procedure for the scikit-learn datasets.

Implementation:

We defer the implementation details andthe best hyperparameter settings for our model on all thedatasets to Appendix C. Code and data is available at https://github.com/BorealisAI/SLAPS-GNN.

The results of SLAPS and the baselines on the node classiﬁ-cation benchmarks are in Table 1. Considering the baselinesﬁrst, we see that learning a fully-connected graph in MLP-GAM* makes it outperform MLP. kNN-GCN signiﬁcantlyoutperforms MLP on Cora and Citeseer but underperformson Pubmed and ogbn-arxiv. This shows the importance ofthe similarity metric and the graph structure that is fed intoGCN; a low-quality structure can harm model performance.LDS outperforms MLP but the fully parameterized adja-cency matrix of LDS results in memory issues for Pubmedand ogbn-arxiv. As for GRCN, it was shown in the originalpaper that GRCN can revise a good initial adjacency matrixand provide a substantial boost in performance. However,as evidenced by the results, if the initial graph structure issomewhat poor, GRCN’s performance becomes on-par withkNN-GCN. IDGL is the best performing baseline.SLAPS consistently outperforms the baselines on alldatasets, in some cases by large margins. Among the gener-ators, the winner is dataset-dependent with MLP-D mostlyoutperforming MLP on datasets with many features andMLP outperforming on datasets with small numbers of fea-tures. Using the software that was publicly released by theauthors, the baselines that learn a graph structure fail onogbn-arxiv; our implementation, on the other hand, scalesto such large graphs.Table 2 reports the results for the scikit-learn datasets andcompares with LDS and IDGL. On three out of four datasets,SLAPS outperforms the other two baselines. Among thedatasets on which we can train SLAPS with the FP generator,20news has the largest number of nodes (9,607 nodes). Onthis dataset, we observed that an FP generator suffers fromoverﬁtting and produces weaker results compared to othergenerators due to its large number of parameters.

LAPS: Self-Supervision Improves Structure Learning for Graph Neural Networks

Table 2.

Results on classiﬁcation datasets. † indicates results have been taken from Franceschi et al. (2019). Bold and underlined valuesindicate best and second-best mean performances respectively.Model Generator Wine Cancer Digits 20news LDS . ± . † . ± . † . ± . † . ± . † IDGL . ± . . ± . . ± . . ± . SLAP S FP . ± . . ± . . ± . . ± . SLAP S

MLP . ± . . ± . . ± . . ± . SLAP S

MLP-D . ± . . ± . . ± . . ± . s : To provide more insight into the value pro-vided by the self-supervision task on the learned adjacency,we conduct experiments with SLAPS s . Recall from Sec-tion 4.6 that in SLAPS s , the adjacency is learned onlybased on the self-supervision task and the node labels areonly used for early stopping, hyperparameter tuning, andtraining GCN C . Figure 3(a) shows the performance ofSLAPS and SLAPS s on Cora and compares them withkNN-GCN. Although SLAPS s does not use the node labelsin learning an adjacency matrix, it outperforms kNN-GCN( . improvement when using an FP generator). Withan FP generator, SLAPS s even achieves competitive per-formance with SLAPS; this is mainly because FP does notleverage the supervision provided by GCN C toward learn-ing generalizable patterns that can be used for nodes otherthan those in the training set. These results corroborate theeffectiveness of the self-supervision task for learning an ad-jacency matrix. Besides, the results show that learning theadjacency using both self-supervision and the task-speciﬁcnode labels results in higher predictive accuracy. The value of λ : Figure 3(b) shows the performance ofSLAPS on Cora and Citeseer with different values of λ .When λ = 0 , corresponding to removing self-supervision,the model performance is somewhat poor. As soon as λ becomes positive, both models see a large boost in perfor-mance showing that self-supervision is crucial to the highperformance of SLAPS. Increasing λ further provides largerboosts until it becomes so large that the self-supervision lossdominates the classiﬁcation loss and the performance dete-riorates. Note that with λ = 0 , SLAPS with the MLP gen-erator becomes a variant of the model proposed by Cosmoet al. (2020), but with a different similarity function. The effect of the training set size:

According to Theo-rem 1, a smaller q (corresponding to the training set size)results in more no-supervision edges in each epoch. Toexplore the effect of self-supervision as a function of q , wecompared SLAPS with and without supervision on Coraand Citeseer while reducing the number of labeled nodesper class from 20 to 5. We used the FP generator for this The generator used in this experiment is MLP; other genera-tors produced similar results. experiment. With 5 labeled nodes per class, adding self-supervision provides . and . improvements onCora and Citeseer respectively, which is substantially higherthan the corresponding numbers when using 20 labelednodes per class ( . and . respectively). This pro-vides empirical evidence for Theorem 1. k in kNN: Figure 3(c) shows the perfor-mance of SLAPS on Cora for three graph generators as afunction of k in kNN. For all three cases, the value of k playsa major role in model performance. The FP generator is theleast sensitive because in FP, k only affects the initializationof the adjacency matrix but then the model can change thenumber of neighbors of each node. For MLP and MLP-D,however, the number of neighbors of each node remainsclose to k (but not necessarily equal as the adjacency proces-sor can add or remove some edges) and the two generatorsbecome more sensitive to k . For larger values of k , the extraﬂexibility of the MLP generator enables removing some ofthe unwanted edges through the function P or reducing theweights of the unwanted edges resulting in MLP being lesssensitive to large values of k compared to MLP-D. Symmetrization:

To symmetrize the adjacency, in Equa-tion 1 we took the average of P ( ˜ A ) and P ( ˜ A ) T . Here wealso consider two other choices: 1) max ( P ( ˜ A ) , P ( ˜ A ) T ),and 2) not symmetrizing the adjacency (i.e. using P ( ˜ A ) ).Figure 3(d) compares these three choices on Cora and Cite-seer with an MLP generator (other generators produced sim-ilar results). On both datasets, symmetrizing the adjacencyprovides a performance boost. Compared to mean sym-metrization, max symmetrization performs slightly worse.This may be because max symmetrization does not distin-guish between the case where both v i and v j are among the k most similar nodes of each other and the case where onlyone of them is among the k most similar nodes of the other. So far, we have shown that self-supervision helps learn abetter graph structure for GNNs. Here, we verify whetherself-supervision is also helpful when a noisy structure is pro-vided as input. Toward this goal, we experiment with Cora

LAPS: Self-Supervision Improves Structure Learning for Graph Neural Networks

FP MLP MLP-D64666870727476 A cc u r a c y (a) A cc u r a c y CoraCiteseer (b) k57.560.062.565.067.570.072.5 A cc u r a c y FPMLPMLP-D (c)

Cora Citeseer646668707274 A cc u r a c y (d) Cora=25% Cora=50% Citeseer=25% Citeseer=50%45505560657075 A cc u r a c y SLAPSSLAPS ( =0)GCN (e) [ . , . ]( . , . )[ . , . )[ . , . )[ . , . )[ . , . )[ . , i n f ) Edge weight interval0.51.01.52.02.5 O dd s CoraCiteseer (f)

Figure 3.

The performance of SLAPS (a) compared to SLAPS s on Cora with different generators, (b) with MLP graph generator onCora and Citeseer as a function of λ , (c) with different graph generators on Cora as a function of k in kNN, and (d) on Cora and Citeseerwith different adjacency symmetrizations, (e) compared to SLAPS with λ = 0 and GCN when noisy graphs are provided as input ( ρ indicates the percentage of perturbations). (f) The odds of two nodes in the test set sharing the same label as a function of the edge weightslearned by SLAPS. and Citeseer and provide noisy versions of the input graph asinput. We perturb the graph structure by replacing ρ percentof the edges in the original structure (selected uniformly atrandom) with random edges. Figure 3(e) shows the perfor-mance of SLAPS with and wihtout self-supervision ( λ = 0 corresponds to no supervision). We also report the resultsof vanilla GCN on these perturbed graphs for comparison.It can be viewed that self-supervision consistently providesa boost in performance especially for higher values of ρ . Following the experiment in Section 5.4, wecompared the learned and original structures by measuringthe number of random edges added during perturbation butremoved by the model and the number of edges removedduring the perturbation but recovered by the model. ForCora, SLAPS removed . and . of the noisy edgesand recovered . and . of the removed edges for ρ = 25% and ρ = 50% respectively while SLAPS with λ = 0 only removed . and . of the noisy edgesand recovered . and . of the removed edges.This provides evidence on self-supervision being helpful forstructure learning. Cluster assumption:

Many graph-based semi-supervisedclassiﬁcation models are based on the cluster assumption according to which nearby nodes are more likely to sharethe same label (Chapelle & Zien, 2005). To verify the qual-ity of the learned adjacency, for every pair of nodes in thetest set, we compute the odds of the two nodes sharing thesame label as a function of the normalized weight of theedge connecting them. Figure 3(f) represents the odds fordifferent weight intervals (recall that A is row and columnnormalized). For both Cora and Citeseer, nodes connectedwith higher edge weights are more likely to share the samelabel compared to nodes with lower or zero edge weights.Speciﬁcally, when A ij ≥ . , v i and v j are almost 2.5and 2.0 times more likely to share the same label on Coraand Citeseer respectively. Note that SLAPS may connectnodes based on a different criterion than the one used in theoriginal datasets and so the learned adjacencies in this ex-periment do not necessarily resemble the original structures.

6. Conclusion

We proposed SLAPS which is a model for learning the pa-rameters of a graph neural network and the graph structureof the nodes simultaneously based on self-supervision. Weexplored the design space of SLAPS by comparing differ-ent graph generation and symmetrization approaches. Weshowed the effectiveness of our model using a comprehen-sive set of experiments and analyses.

LAPS: Self-Supervision Improves Structure Learning for Graph Neural Networks

References

Battaglia, P. W., Hamrick, J. B., Bapst, V., Sanchez-Gonzalez, A., Zambaldi, V., Malinowski, M., Tacchetti,A., Raposo, D., Santoro, A., Faulkner, R., et al. Rela-tional inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261 , 2018.Belkin, M., Niyogi, P., and Sindhwani, V. Manifold reg-ularization: A geometric framework for learning fromlabeled and unlabeled examples.

JMLRR , 7(Nov):2399–2434, 2006.Chami, I., Abu-El-Haija, S., Perozzi, B., R´e, C., and Mur-phy, K. Machine learning on graphs: A model and com-prehensive taxonomy. arXiv preprint arXiv:2005.03675 ,2020.Chapelle, O. and Zien, A. Semi-supervised classiﬁcationby low density separation. In

AISTATS , volume 2005, pp.57–64. Citeseer, 2005.Chen, Y., Wu, L., and Zaki, M. J. Deep iterative and adap-tive learning for graph neural networks. In

The FirstInternational Workshop on Deep Learning on Graphs:Methodologies and Applications (with AAAI) , February2020. URL https://dlg2019.bitbucket.io/aaai20 .Choromanski, K., Likhosherstov, V., Dohan, D., Song, X.,Gane, A., Sarlos, T., Hawkins, P., Davis, J., Mohiuddin,A., Kaiser, L., et al. Rethinking attention with performers. arXiv preprint arXiv:2009.14794 , 2020.Clevert, D.-A., Unterthiner, T., and Hochreiter, S. Fastand accurate deep network learning by exponential linearunits (elus). arXiv preprint arXiv:1511.07289 , 2015.Cosmo, L., Kazi, A., Ahmadi, S.-A., Navab, N., and Bron-stein, M. Latent patient network learning for automaticdiagnosis. arXiv preprint arXiv:2003.13620 , 2020.Dai, H., Li, H., Tian, T., Huang, X., Wang, L., Zhu, J., andSong, L. Adversarial attack on graph structured data. arXiv preprint arXiv:1806.02371 , 2018.Erd˝os, P. and R´enyi, A. On random graphs.

PublicationesMathematicae Debrecen , 6:290–297, 1959.Fox, J. and Rajamanickam, S. How robust are graphneural networks to structural noise? arXiv preprintarXiv:1912.10206 , 2019.Franceschi, L., Niepert, M., Pontil, M., and He, X. Learningdiscrete structures for graph neural networks. In

ICML ,2019.Garcia, V. and Bruna, J. Few-shot learning with graph neuralnetworks. arXiv preprint arXiv:1711.04043 , 2017. Gidaris, S. and Komodakis, N. Generating classiﬁcationweights with gnn denoising autoencoders for few-shotlearning. In

Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition , pp. 21–30,2019.Gilmer, J., Schoenholz, S. S., Riley, P. F., Vinyals, O., andDahl, G. E. Neural message passing for quantum chem-istry. In

ICML , pp. 1263–1272, 2017.Halcrow, J., Mosoi, A., Ruth, S., and Perozzi, B. Grale:Designing networks for graph learning. In

Proceedingsof the 26th ACM SIGKDD International Conference onKnowledge Discovery & Data Mining , pp. 2523–2532,2020.Hamilton, W., Ying, Z., and Leskovec, J. Inductive repre-sentation learning on large graphs. In

NeurIPS , 2017.Hu, W., Fey, M., Zitnik, M., Dong, Y., Ren, H., Liu, B.,Catasta, M., and Leskovec, J. Open graph benchmark:Datasets for machine learning on graphs. arXiv preprintarXiv:2005.00687 , 2020a.Hu, W., Liu, B., Gomes, J., Zitnik, M., Liang, P., Pande, V.,and Leskovec, J. Strategies for pre-training graph neuralnetworks. In

ICLR , 2020b.Hu, Z., Dong, Y., Wang, K., Chang, K.-W., and Sun, Y. Gpt-gnn: Generative pre-training of graph neural networks.In

Proceedings of the 26th ACM SIGKDD InternationalConference on Knowledge Discovery & Data Mining , pp.1857–1867, 2020c.Jang, S., Moon, S.-E., and Lee, J.-S. Brain signal classiﬁ-cation via learning connectivity structure. arXiv preprintarXiv:1905.11678 , 2019.Jin, W., Derr, T., Liu, H., Wang, Y., Wang, S., Liu,Z., and Tang, J. Self-supervised learning on graphs:Deep insights and new direction. arXiv preprintarXiv:2006.10141 , 2020a.Jin, W., Ma, Y., Liu, X., Tang, X., Wang, S., and Tang, J.Graph structure learning for robust graph neural networks. arXiv preprint arXiv:2005.10203 , 2020b.Johnson, D. D., Larochelle, H., and Tarlow, D. Learninggraph structure with a ﬁnite-state automaton layer. arXivpreprint arXiv:2007.04929 , 2020.Kazemi, S. M., Goel, R., Jain, K., Kobyzev, I., Sethi, A.,Forsyth, P., and Poupart, P. Representation learning fordynamic graphs: A survey.

JMLR , 2020.Kingma, D. P. and Ba, J. Adam: A method for stochasticoptimization. arXiv preprint arXiv:1412.6980 , 2014.

LAPS: Self-Supervision Improves Structure Learning for Graph Neural Networks

Kipf, T. N. and Welling, M. Semi-supervised classiﬁcationwith graph convolutional networks. In

ICLR , 2017.Kitaev, N., Kaiser, Ł., and Levskaya, A. Reformer: Theefﬁcient transformer. arXiv preprint arXiv:2001.04451 ,2020.Li, Q., Han, Z., and Wu, X.-M. Deeper insights into graphconvolutional networks for semi-supervised learning. In

AAAI , 2018a.Li, R., Wang, S., Zhu, F., and Huang, J. Adaptivegraph convolutional neural networks. arXiv preprintarXiv:1801.03226 , 2018b.Liu, J., Kumar, A., Ba, J., Kiros, J., and Swersky, K. Graphnormalizing ﬂows. In

NeurIPS , pp. 13556–13566, 2019.Nair, V. and Hinton, G. E. Rectiﬁed linear units improverestricted boltzmann machines. In

Icml , 2010.Oono, K. and Suzuki, T. Graph neural networks exponen-tially lose expressive power for node classiﬁcation. In

ICLR , 2020.Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E.,DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer,A. Automatic differentiation in pytorch. In

NIPS-W ,2017.Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V.,Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P.,Weiss, R., Dubourg, V., et al. Scikit-learn: Machinelearning in python.

JMLR , 12:2825–2830, 2011.Qasim, S. R., Kieseler, J., Iiyama, Y., and Pierini, M. Learn-ing representations of irregular particle-detector geometrywith distance-weighted graph networks.

The EuropeanPhysical Journal C , 79(7):1–11, 2019.Roweis, S. T. and Saul, L. K. Nonlinear dimensionality re-duction by locally linear embedding. science , 290(5500):2323–2326, 2000.Sato, R. A survey on the expressive power of graph neuralnetworks. arXiv preprint arXiv:2003.04078 , 2020.Scarselli, F., Gori, M., Tsoi, A. C., Hagenbuchner, M., andMonfardini, G. The graph neural network model.

IEEETransactions on Neural Networks , 20(1):61–80, 2008.Sen, P., Namata, G., Bilgic, M., Getoor, L., Galligher, B.,and Eliassi-Rad, T. Collective classiﬁcation in networkdata.

AI magazine , 29(3):93–93, 2008.Stanley, R. P. Acyclic orientations of graphs.

DiscreteMathematics , 5(2):171–178, 1973. Stretcu, O., Viswanathan, K., Movshovitz-Attias, D., Pla-tanios, E., Ravi, S., and Tomkins, A. Graph agreementmodels for semi-supervised learning. In

NeurIPS , pp.8713–8723, 2019.Suhail, M. and Sigal, L. Mixture-kernel graph attentionnetwork for situation recognition. In

Proceedings of theIEEE International Conference on Computer Vision , pp.10363–10372, 2019.Tenenbaum, J. B., De Silva, V., and Langford, J. C. Aglobal geometric framework for nonlinear dimensionalityreduction. science , 290(5500):2319–2323, 2000.Veliˇckovi´c, P., Cucurull, G., Casanova, A., Romero, A., Lio,P., and Bengio, Y. Graph attention networks. In

ICLR ,2018.Vincent, P., Larochelle, H., Bengio, Y., and Manzagol, P.-A.Extracting and composing robust features with denoisingautoencoders. In

ICML , pp. 1096–1103, 2008.Wang, M., Yu, L., Zheng, D., Gan, Q., Gai, Y., Ye, Z., Li,M., Zhou, J., Huang, Q., Ma, C., et al. Deep graph library:Towards efﬁcient and scalable deep learning on graphs. arXiv preprint arXiv:1909.01315 , 2019a.Wang, Y., Sun, Y., Liu, Z., Sarma, S. E., Bronstein, M. M.,and Solomon, J. M. Dynamic graph cnn for learning onpoint clouds.

Acm Transactions On Graphics (tog) , 38(5):1–12, 2019b.Wu, X., Zhao, L., and Akoglu, L. A quest for structure:Jointly learning the graph structure and semi-supervisedclassiﬁcation. In

CIKM , pp. 87–96, 2018.Xu, K., Hu, W., Leskovec, J., and Jegelka, S. How powerfulare graph neural networks? In

ICLR , 2019.Yang, L., Kang, Z., Cao, X., Jin, D., Yang, B., and Guo,Y. Topology optimization based graph convolutional net-work. In

IJCAI , pp. 4054–4061, 2019.You, J., Ying, R., Ren, X., Hamilton, W. L., and Leskovec, J.Graphrnn: Generating realistic graphs with deep auto-regressive models. arXiv preprint arXiv:1802.08773 ,2018.You, Y., Chen, T., Wang, Z., and Shen, Y. When does self-supervision help graph convolutional networks? arXivpreprint arXiv:2006.09136 , 2020.Yu, D., Zhang, R., Jiang, Z., Wu, Y., and Yang, Y. Graph-revised convolutional network. In

ECML PKDD , 2020.Zhang, J., Shi, X., Xie, J., Ma, H., King, I., and Yeung, D.-Y.Gaan: Gated attention networks for learning on large andspatiotemporal graphs. arXiv preprint arXiv:1803.07294 ,2018.

LAPS: Self-Supervision Improves Structure Learning for Graph Neural Networks

Zhang, J., Zhang, H., Xia, C., and Sun, L. Graph-bert: Onlyattention is needed for learning graph representations. arXiv preprint arXiv:2001.05140 , 2020.Zhu, H., Lin, Y., Liu, Z., Fu, J., Chua, T.-s., and Sun, M.Graph neural networks with generated parameters forrelation extraction. arXiv preprint arXiv:1902.00756 ,2019.Zhu, Q., Du, B., and Yan, P. Self-supervised train-ing of graph convolutional networks. arXiv preprintarXiv:2006.02380 , 2020.Zhu, X. and Ghahramani, Z. Learning from labeled andunlabeled data with label propagation. 2002.Zhu, X., Ghahramani, Z., and Lafferty, J. D. Semi-supervised learning using gaussian ﬁelds and harmonicfunctions. In

Proceedings of the 20th International con-ference on Machine learning (ICML-03) , pp. 912–919,2003.Z¨ugner, D., Akbarnejad, A., and G¨unnemann, S. Adversarialattacks on neural networks for graph data. In

Proceedingsof the 24th ACM SIGKDD International Conference onKnowledge Discovery & Data Mining , pp. 2847–2856,2018.

A. Proof of Theorem 1

Theorem 1

To compute the probability of an edge being a no-supervision edge, we ﬁrst compute the probability of thetwo nodes of the edge being unlabeled themselves and thenthe probability of the two nodes not being connected to anylabeled nodes. Let v and u represent two nodes connectedby an edge.With n nodes and q labels, the probability of a node beinglabeled is qn . Therefore, P r ( v is unlabeled ) = (1 − qn ) and P r ( u is unlabeled | v is unlabeled ) = (1 − qn − ) .Therefore, P r ( v is unlabeled, u is unlabeled ) = (1 − qn )(1 − qn − ) .Since there is an edge between v and v , there are m − edges remaining. Also, there are (cid:0) n (cid:1) − pairs of nodes thatcan potentially have an edge between them. Therefore, theprobability of v being disconnected from the ﬁrst labelednode is − m − ( n ) − . If v is disconnected from the ﬁrst labelednode, there are still m − edges remaining and there arenow (cid:0) n (cid:1) − pairs of nodes that can potentially have an edge between them. So the probability of v being disconnectedfrom the second node given that it is disconnected from theﬁrst labeled node is − m − ( n ) − . With similar reasoning, wecan see that the probability of v being disconnected fromthe i -th labeled node given that it is disconnected from theﬁrst i − labeled nodes is − m − ( n ) − i .We can follow similar reasoning for u . The probability of u being disconnected from the ﬁrst labeled node given that v is disconnected from all q labeled nodes is − m − ( n ) − q − .That is because there are still m − edges remaining and (cid:0) n (cid:1) − q − pairs of nodes that can potentially be connectedwith an edge. We can also see that the probability of u being disconnected from the i -th labeled node given that itis disconnected from the ﬁrst i − labeled nodes and that v is disconnected from all q labeled nodes is − m − ( n ) − q − i .As the probability of the two nodes being unlabeled andnot being connected to any labeled nodes in the graph areindependent, their joint probability is the multiplicationof their probabilities computed above and it is equal to (1 − qn )(1 − qn − ) (cid:81) qi =1 (1 − m − ( n ) − i ) . (cid:3) B. Dataset statistics

The statistics of the datasets used in the experiments can befound in Table 4.

C. Implementation Details

We implemented our model in PyTorch (Paszke et al., 2017),used deep graph library (DGL) (Wang et al., 2019a) for thesparse operations, and used Adam (Kingma & Ba, 2014)as the optimizer. We performed early stopping and hyper-parameter tuning based on the accuracy on the validationset for all datasets except Wine and Cancer. For these twodatasets, validation accuracy reached 100 percent with manyhyperparameter settings, making it difﬁcult to select the bestset of hyperparameters so instead, we used the validationcross-entropy loss.We ﬁxed the maximum number of epochs to . We usetwo-layer GCNs for both

GNN C and GNN

DAE as well asfor baselines and two-layer MLPs throughout the paper (forexperiments on ogbn-arxiv, although the original paper usesmodels with three layers and with batch normalization aftereach layer, to be consistent with our other experiments weused two layers and removed the normalization). We usedtwo learning rates, one for

GCN C as lr C and one for theother parameters of the models as lr DAE . We tuned the twolearning rates from the set { . , . } . We added dropoutlayers with dropout probabilities of . after the ﬁrst layer ofthe GNNs. We also added dropout to the adjacency matrix LAPS: Self-Supervision Improves Structure Learning for Graph Neural Networks

Table 3.

Best set of hyperparameters for different datasets chosen on validation set.Dataset Generator lr C lr DAE dropout c dropout DAE k λ r η

Cora FP 0.001 0.01 0.5 0.25 30 10 10 5Cora MLP 0.01 0.001 0.25 0.5 20 10 10 5Cora MLP-D 0.01 0.001 0.25 0.5 15 10 10 5Citeseer FP 0.01 0.01 0.5 0.5 30 1 10 1Citeseer MLP 0.01 0.001 0.25 0.5 30 10 10 5Citeseer MLP-D 0.001 0.01 0.5 0.5 20 10 10 5Cora390 FP 0.01 0.01 0.25 0.5 20 100 10 5Cora390 MLP 0.01 0.001 0.25 0.5 20 10 10 5Cora390 MLP-D 0.001 0.001 0.25 0.5 20 10 10 5Citeseer370 FP 0.01 0.01 0.5 0.5 30 1 10 1Citeseer370 MLP 0.01 0.001 0.25 0.5 30 10 10 5Citeseer370 MLP-D 0.01 0.01 0.25 0.5 20 10 10 5Pubmed MLP 0.01 0.01 0.5 0.5 15 10 10 5Pubmed MLP-D 0.01 0.01 0.25 0.25 15 100 5 5ogbn-arxiv MLP 0.01 0.001 0.25 0.5 15 10 1 5ogbn-arxiv MLP-D 0.01 0.001 0.5 0.25 15 10 1 5Wine FP 0.01 0.001 0.5 0.5 20 0.1 5 5Wine MLP 0.01 0.001 0.5 0.25 20 0.1 5 5Wine MLP-D 0.01 0.01 0.25 0.5 10 1 5 5Cancer FP 0.01 0.001 0.5 0.25 20 0.1 5 5Cancer MLP 0.01 0.001 0.5 0.5 20 1.0 5 5Cancer MLP-D 0.01 0.01 0.5 0.5 20 0.1 5 5Digits FP 0.01 0.001 0.25 0.5 20 0.1 5 5Digits MLP 0.01 0.001 0.25 0.5 20 10 5 5Digits MLP-D 0.01 0.01 0.25 0.25 20 0.1 5 520news FP 0.01 0.01 0.5 0.5 20 500 5 520news MLP 0.001 0.001 0.25 0.5 20 500 5 520news MLP-D 0.01 0.01 0.25 0.25 20 100 5 5

Table 4.

Dataset statistics.

Dataset Nodes Edges Classes Features Label rateCora 2,708 5,429 7 1,433 0.052Citeseer 3,327 4,732 6 3,703 0.036Pubmed 19,717 44,338 3 500 0.003ogbn-arxiv 169,343 1,166,243 40 128 0.537Wine 178 0 3 13 0.112Cancer 569 0 2 30 0.035Digits 1,797 0 10 64 0.05620news 9,607 0 10 236 0.021

LAPS: Self-Supervision Improves Structure Learning for Graph Neural Networks for both

GNN C and GNN

DAE as dropout C dropout DAE respectively and tuned the values from the set { . , . } .We set the hidden dimension of GNN C to for all datasetsexcept for ogbn-arxiv for which we set it to . We usedcosine similarity for building the kNN graphs and tunedthe value of k from the set { , , , } . We tuned λ ( λ controls the relative importance of the two losses) from theset { . , , , , } . We tuned r and η from the sets { , , } and { , }}