[PDF] Simple and Deep Graph Convolutional Networks

Abstract

Graph convolutional networks (GCNs) are a powerful deep learning approach for graph-structured data. Recently, GCNs and subsequent variants have shown superior performance in various application areas on real-world datasets. Despite their success, most of the current GCN models are shallow, due to the {\em over-smoothing} problem. In this paper, we study the problem of designing and analyzing deep graph convolutional networks. We propose the GCNII, an extension of the vanilla GCN model with two simple yet effective techniques: {\em Initial residual} and {\em Identity mapping}. We provide theoretical and empirical evidence that the two techniques effectively relieves the problem of over-smoothing. Our experiments show that the deep GCNII model outperforms the state-of-the-art methods on various semi- and full-supervised tasks. Code is available at this https URL .

Full PDF

SSimple and Deep Graph Convolutional Networks

Ming Chen Zhewei Wei

Zengfeng Huang Bolin Ding Yaliang Li Abstract

Graph convolutional networks (GCNs) are a pow-erful deep learning approach for graph-structureddata. Recently, GCNs and subsequent variantshave shown superior performance in various ap-plication areas on real-world datasets. Despitetheir success, most of the current GCN modelsare shallow, due to the over-smoothing problem.In this paper, we study the problem of design-ing and analyzing deep graph convolutional net-works. We propose the GCNII, an extension ofthe vanilla GCN model with two simple yet effec-tive techniques:

Initial residual and

Identity map-ping . We provide theoretical and empirical evi-dence that the two techniques effectively relievesthe problem of over-smoothing. Our experimentsshow that the deep GCNII model outperforms thestate-of-the-art methods on various semi- and full-supervised tasks. Code is available at https://github.com/chennnM/GCNII .

1. Introduction

Graph convolutional networks (GCNs) (Kipf & Welling,2017) generalize convolutional neural networks (CNNs) (Le-Cun et al., 1995) to graph-structured data. To learn the graphrepresentations, the “graph convolution” operation appliesthe same linear transformation to all the neighbors of a nodefollowed by a nonlinear activation function. In recent years,GCNs and their variants (Defferrard et al., 2016; Veliˇckovi´cet al., 2018) have been successfully applied to a wide rangeof applications, including social analysis (Qiu et al., 2018;Li & Goldwasser, 2019), trafﬁc prediction (Guo et al., 2019;Li et al., 2019), biology (Fout et al., 2017; Shang et al.,2019), recommender systems (Ying et al., 2018), and com- School of Information, Renmin University of China GaolingSchool of Articial Intelligence, Renmin University of China Beijing Key Lab of Big Data Management and Analysis Methods MOE Key Lab of Data Engineering and Knowledge Engineer-ing School of Data Science, Fudan University Alibaba Group.Correspondence to: Zhewei Wei < [email protected] > . Proceedings of the th International Conference on MachineLearning , Vienna, Austria, PMLR 119, 2020. Copyright 2020 bythe author(s). puter vision (Zhao et al., 2019; Ma et al., 2019).Despite their enormous success, most of the current GCNmodels are shallow. Most of the recent models, such asGCN (Kipf & Welling, 2017) and GAT (Veliˇckovi´c et al.,2018), achieve their best performance with 2-layer models.Such shallow architectures limit their ability to extract in-formation from high-order neighbors. However, stackingmore layers and adding non-linearity tends to degrade theperformance of these models. Such a phenomenon is called over-smoothing (Li et al., 2018b), which suggests that asthe number of layers increases, the representations of thenodes in GCN are inclined to converge to a certain valueand thus become indistinguishable. ResNet (He et al., 2016)solves a similar problem in computer vision with residualconnections , which is effective for training very deep neuralnetworks. Unfortunately, adding residual connections inthe GCN models merely slows down the over-smoothingproblem (Kipf & Welling, 2017); deep GCN models are stilloutperformed by 2-layer models such as GCN or GAT.Recently, several works try to tackle the problem of over-smoothing. JKNet (Xu et al., 2018) uses dense skip con-nections to combine the output of each layer to preservethe locality of the node representations. Recently, DropE-dge (Rong et al., 2020) suggests that by randomly removingout a few edges from the input graph, one can relieve theimpact of over-smoothing. Experiments (Rong et al., 2020)suggest that the two methods can slow down the perfor-mance drop as we increase the network depth. However,for semi-supervised tasks, the state-of-the-art results arestill achieved by the shallow models, and thus the beneﬁtbrought by increasing the network depth remains in doubt.On the other hand, several methods combine deep prop-agation with shallow neural networks. SGC (Wu et al.,2019) attempts to capture higher-order information in thegraph by applying the K -th power of the graph convolu-tion matrix in a single neural network layer. PPNP andAPPNP (Klicpera et al., 2019a) replace the power of thegraph convolution matrix with the Personalized PageRankmatrix to solve the over-smoothing problem. GDC (Klicperaet al., 2019b) further extends APPNP by generalizing Per-sonalized PageRank (Page et al., 1999) to an arbitrary graphdiffusion process. However, these methods perform a linearcombination of neighbor features in each layer and lose the a r X i v : . [ c s . L G ] J u l imple and Deep Graph Convolutional Networks powerful expression ability of deep nonlinear architectures,which means they are still shallow models.In conclusion, it remains an open problem to design a GCNmodel that effectively prevents over-smoothing and achievesstate-of-the-art results with truly deep network structures.Due to this challenge, it is even unclear whether the networkdepth is a resource or a burden in designing new graph neu-ral networks. In this paper, we give a positive answer to thisopen problem by demonstrating that the vanilla GCN (Kipf& Welling, 2017) can be extended to a deep model withtwo simple yet effective modiﬁcations. In particular, wepropose G raph C onvolutional N etwork via I nitial residualand I dentity mapping ( GCNII ), a deep GCN model thatresolves the over-smoothing problem. At each layer, ini-tial residual constructs a skip connection from the inputlayer, while identity mapping adds an identity matrix to theweight matrix. The empirical study demonstrates that thetwo surprisingly simple techniques prevent over-smoothingand improve the performance of GCNII consistently as weincrease its network depth. In particular, the deep GCNIImodel achieves new state-of-the-art results on various semi-supervised and full-supervised tasks.Second, we provide theoretical analysis for multi-layer GCNand GCNII models. It is known (Wu et al., 2019) that bystacking k layers, the vanilla GCN essentially simulates a K -th order of polynomial ﬁlter with predetermined coef-ﬁcients. (Wang et al., 2019) points out that such a ﬁltersimulates a lazy random walk that eventually converges tothe stationary vector and thus leads to over-smoothing. Onthe other hand, we prove that a K -layer GCNII model canexpress a polynomial spectral ﬁlter of order K with arbi-trary coefﬁcients. This property is essential for designingdeep neural networks. We also derive the closed-form ofthe stationary vector and analyze the rate of convergencefor the vanilla GCN. Our analysis implies that nodes withhigh degrees are more likely to suffer from over-smoothingin a multi-layer GCN model, and we perform experimentsto conﬁrm this theoretical conjecture.

2. Preliminaries

Notations.

Given a simple and connected undirectedgraph G = ( V, E ) with n nodes and m edges. We de-ﬁne the self-looped graph ˜ G = ( V, ˜ E ) to be the graph witha self-loop attached to each node in G . We use { , . . . , n } to denote the node IDs of G and ˜ G , and d j and d j + 1 todenote the degree of node j in G and ˜ G , respectively. Let A denote the adjacency matrix and D the diagonal degreematrix. Consequently, the adjacency matrix and diagonal de-gree matrix of ˜ G is deﬁned to be ˜ A = A + I and ˜ D = D + I ,respectively. Let X ∈ R n × d denote the node feature ma-trix, that is, each node v is associated with a d -dimensionalfeature vector X v . The normalized graph Laplacian matrix is deﬁned as L = I n − D − / AD − / , which is a symmet-ric positive semideﬁnite matrix with eigendecomposition U Λ U T , . Here Λ is a diagonal matrix of the eigenvaluesof L , and U ∈ R n × n is a unitary matrix that consists ofthe eigenvectors of L . The graph convolution operationbetween signal x and ﬁlter g γ (Λ) = diag( γ ) is deﬁned as g γ ( L ) ∗ x = U g γ (Λ) U T x , where the parameter γ ∈ R n corresponds to a vector of spectral ﬁlter coefﬁcients. Vanilla GCN. (Kipf & Welling, 2017) and (Defferrardet al., 2016) suggest that the graph convolution operationcan be further approximated by the K -th order polynomialof Laplacians U g θ (Λ) U T x ≈ U (cid:32) K (cid:88) (cid:96) =0 θ (cid:96) Λ (cid:96) (cid:33) U (cid:62) x = (cid:32) K (cid:88) (cid:96) =0 θ (cid:96) L (cid:96) (cid:33) x , where θ ∈ R K +1 corresponds to a vector of polynomialcoefﬁcients. The vanilla GCN (Kipf & Welling, 2017) sets K = 1 , θ = 2 θ and θ = − θ to obtain the convolutionoperation g θ ∗ x = θ (cid:0) I + D − / AD − / (cid:1) x . Finally, by the renormalization trick , (Kipf & Welling, 2017) replacesthe matrix I + D − / AD − / by a normalized version ˜ P =˜ D − / ˜ A ˜ D − / = ( D + I n ) − / ( A + I n )( D + I n ) − / .and obtains the Graph Convolutional Layer H ( (cid:96) +1) = σ (cid:16) ˜ PH ( (cid:96) ) W ( (cid:96) ) (cid:17) . (1)Where σ denotes the ReLU operation.SGC (Wu et al., 2019) shows that by stacking K lay-ers, GCN corresponds to a ﬁxed polynomial ﬁlter of or-der K on the graph spectral domain of ˜ G . In particu-lar, let ˜ L = I n − ˜ D − / ˜ A ˜ D − / denote the normalizedgraph Laplacian matrix of the self-looped graph ˜ G . Con-sequently, applying a K -layer GCN to a signal x corre-sponds to (cid:16) ˜ D − / ˜ A ˜ D − / (cid:17) K x = (cid:16) I n − ˜ L (cid:17) K x . (Wuet al., 2019) also shows that by adding a self-loop to eachnode, ˜ L effectively shrinks the underlying graph spectrum. APPNP. (Klicpera et al., 2019a) uses Personalized PageR-ank to derive a ﬁxed ﬁlter of order K . Let f θ ( X ) denote theoutput of a two-layer fully connected neural network on thefeature matrix X , PPNP’s model is deﬁned as H = α (cid:16) I n − (1 − α ) ˜ A (cid:17) − f θ ( X ) . (2)Due to the property of Personalized PageRank, such a ﬁl-ter preserves locality and thus is suitable for classiﬁcationtasks. (Klicpera et al., 2019a) also proposes APPNP, whichreplaces α (cid:16) I n − (1 − α ) ˜ A (cid:17) − with an approximation de-rived by a truncated power iteration. Formally, APPNP with K -hop aggregation is deﬁned as H ( (cid:96) +1) = (1 − α ) ˜ P H ( (cid:96) ) + α H (0) , (3) imple and Deep Graph Convolutional Networks where H (0) = f θ ( X ) . By decoupling feature transfor-mation and propagation, PPNP and APPNP can aggregateinformation from multi-hop neighbors without increasingthe number of layers in the neural network. JKNet.

The ﬁrst deep GCN framework is proposed by(Xu et al., 2018). At the last layer, JKNet combines allprevious representations (cid:2) H (1) , . . . , H ( K ) (cid:3) to learn repre-sentations of different orders for different graph substruc-tures. (Xu et al., 2018) proves that 1) a K -layer vanillaGCN model simulates random walks of K steps in the self-looped graph ˜ G and 2) by combining all representationsfrom the previous layers, JKNet relieves the problem ofover-smoothing. DropEdge

A recent work (Rong et al., 2020) suggeststhat randomly removing some edges from ˜ G retards the con-vergence speed of over-smoothing. Let ˜ P drop denote therenormalized graph convolution matrix with some edge re-moved at random, the vanilla GCN equipped with DropEdgeis deﬁned as H ( (cid:96) +1) = σ (cid:16) ˜ P drop H ( (cid:96) ) W ( (cid:96) ) (cid:17) . (4)

3. GCNII Model

It is known (Wu et al., 2019) that by stacking K layers, thevanilla GCN simulates a polynomial ﬁlter (cid:16)(cid:80) K(cid:96) =0 θ (cid:96) ˜ L (cid:96) (cid:17) x of order K with ﬁxed coefﬁcients θ on the graph spectraldomain of ˜ G . The ﬁxed coefﬁcients limit the expressivepower of a multi-layer GCN model and thus leads to over-smoothing. To extend GCN to a truly deep model, we needto enable GCN to express a K order polynomial ﬁlter with arbitrary coefﬁcients. We show this can be achieved by twosimple techniques: Initial residual connection and

Identitymapping . Formally, we deﬁne the (cid:96) -th layer of GCNII as H ( (cid:96) +1) = σ (cid:16)(cid:16) (1 − α (cid:96) ) ˜ PH ( (cid:96) ) + α (cid:96) H (0) (cid:17)(cid:16) (1 − β (cid:96) ) I n + β (cid:96) W ( (cid:96) ) (cid:17)(cid:17) , (5)where α (cid:96) and β (cid:96) are two hyperparameters to be discussedlater. Recall that ˜ P = ˜ D − / ˜ A ˜ D − / is the graph con-volution matrix with the renormalization trick. Note thatcompared to the vanilla GCN model (equation (1)), we maketwo modiﬁcations: 1) We combine the smoothed represen-tation ˜ PH ( (cid:96) ) with an initial residual connection to the ﬁrstlayer H (0) ; 2) We add an identity mapping I n to the (cid:96) -thweight matrix W ( (cid:96) ) . Initial residual connection.

To simulate the skip connec-tion in ResNet (He et al., 2016), (Kipf & Welling, 2017)proposes residual connection that combines the smoothedrepresentation ˜ PH ( (cid:96) ) with H ( (cid:96) ) . However, it is also shownin (Kipf & Welling, 2017) that such residual connection only partially relieves the over-smoothing problem; the perfor-mance of the model still degrades as we stack more layers.We propose that, instead of using a residual connection tocarry the information from the previous layer, we constructa connection to the initial representation H (0) . The initialresidual connection ensures that that the ﬁnal representationof each node retains at least a fraction of α (cid:96) from the inputlayer even if we stack many layers. In practice, we cansimply set α (cid:96) = 0 . or . so that the ﬁnal representation ofeach node consists of at least a fraction of the input feature.We also note that H (0) does not necessarily have to be thefeature matrix X . If the feature dimension d is large, wecan apply a fully-connected neural network on X to obtaina lower-dimensional initial representation H (0) before theforward propagation.Finally, we recall that APPNP (Klicpera et al., 2019a) em-ploys a similar approach to the initial residual connection inthe context of Personalized PageRank. However, (Klicperaet al., 2019a) also shows that performing multiple non-linearity operations to the feature matrix will lead to over-ﬁtting and thus results in the performance drop. Therefore,APPNP applies a linear combination between different lay-ers and thus remains a shallow model. This suggests thatthe idea of initial residual alone is not sufﬁcient to extendGCN to a deep model. Identity mapping.

To amend the deﬁciency of APPNP,we borrow the idea of identity mapping from ResNet. At the (cid:96) -th layer, we add an identity matrix I n to the weight matrix W ( (cid:96) ) . In the following, we summarize the motivations forintroducing identity mapping into our model. • Similar to the motivation of ResNet (He et al., 2016),identity mapping ensures that a deep GCNII modelachieves at least the same performance as its shallowversion does. In particular, by setting β (cid:96) sufﬁcientlysmall, deep GCNII ignores the weight matrix W ( (cid:96) ) and essentially simulates APPNP (equation (3)). • It has been observed that frequent interaction betweendifferent dimensions of the feature matrix (Klicperaet al., 2019a) degrades the performance of the modelin semi-supervised tasks. Mapping the smoothed rep-resentation ˜ PH ( (cid:96) ) directly to the output reduces suchinteraction. • Identity mapping is proved to be particularly usefulin semi-supervised tasks. It is shown in (Hardt &Ma, 2017) that a linear ResNet of the form H ( (cid:96) +1) = H ( (cid:96) ) (cid:0) W ( (cid:96) ) + I n (cid:1) satisﬁes the following properties: 1)The optimal weight matrices W ( l ) have small norms;2) The only critical point is the global minimum. Theﬁrst property allows us to put strong regularization on imple and Deep Graph Convolutional Networks W (cid:96) to avoid over-ﬁtting, while the later is desirable insemi-supervised tasks where training data is limited. • (Oono & Suzuki, 2020) theoretically proves that thenode features of a K -layer GCNs will converge toa subspace and incur information loss. In particular,the rate of convergence depends on s K , where s isthe maximum singular value of the weight matrices W ( (cid:96) ) , (cid:96) = 0 , . . . , K − . By replacing W ( (cid:96) ) with (1 − β (cid:96) ) I n + β (cid:96) W ( (cid:96) ) and imposing regularization on W ( (cid:96) ) ,we force the norm of W ( (cid:96) ) to be small. Consequently,the singular values of (1 − β (cid:96) ) I n + β (cid:96) W ( (cid:96) ) will beclose to . Therefore, the maximum singular value s will also be close to , which implies that s K is large,and the information loss is relieved.The principle of setting β (cid:96) is to ensure the decay of theweight matrix adaptively increases as we stack more layers.In practice, we set β (cid:96) = log( λ(cid:96) + 1) ≈ λ(cid:96) , where λ is ahyperparameter. Connection to iterative shrinkage-thresholding.

Re-cently, there has been work on optimization-inspired net-work structure design (Zhang & Ghanem, 2018; Papyanet al., 2017). The idea is that a feedforward neural networkcan be considered as an iterative optimization algorithmto minimize some function, and it was hypothesized thatbetter optimization algorithms might lead to better networkstructure (Li et al., 2018a). Thus, theories in numericaloptimization algorithms may inspire the design of betterand more interpretable network structures. As we will shownext, the use of identity mappings in our structure is alsowell-motivated from this. We consider the LASSO objec-tive: min x ∈R n (cid:107) Bx − y (cid:107) + λ (cid:107) x (cid:107) . Similar to compressive sensing, we consider x as the signalwe are trying to recover, B as the measurement matrix, and y as the signal we observe. In our setting, y is the originalfeature of a node, and x is the node embedding the networktries to learn. As opposed to standard regression models,the design matrix B is unknown parameters and will belearned through back propagation. So, this is in the samespirit as the sparse coding problem, which has been used todesign and to analyze CNNs (Papyan et al., 2017). Iterativeshrinkage-thresholding algorithms are effective for solvingthe above optimization problem, in which the update in the ( t + 1) th iteration is: x t +1 = P µ t λ (cid:0) x t − µ t B T Bx t + µ t B T y (cid:1) , Here µ t is the step size, and P β ( · ) (with β > ) is theentry-wise soft thresholding function: P θ ( z ) =  z − θ, if z ≥ θ , if | z | < θz + θ, if z ≤ − θ . Now, if we reparameterize − B T B by W , the above up-date formula becomes quite similar to the one used inour method. More spopposeeciﬁcally, we have x t +1 = P µ t λ (cid:0) ( I + µ t W ) x t + µ t B T y (cid:1) , where the term µ t B T y corresponds to the initial residual, and I + µ t W correspondsto the identity mapping in our model (5). The soft threshold-ing operator acts as the nonlinear activation function, whichis similar to the effect of ReLU activation. In conclusion,our network structure, especially the use of identity map-ping is well-motivated from iterative shrinkage-thresholdingalgorithms for solving LASSO.

4. Spectral Analysis

We consider the following GCN model with residual con-nection: H ( (cid:96) +1) = σ (cid:16)(cid:16) ˜ PH ( (cid:96) ) + H ( (cid:96) ) (cid:17) W ( (cid:96) ) (cid:17) . (6)Recall that ˜ P = ˜ D − / ˜ A ˜ D − / is the graph convolutionmatrix with the renormalization trick. (Wang et al., 2019)points out that equation (6) simulates a lazy random walk with the transition matrix I n + ˜ D − / ˜ A ˜ D − / . Such a lazyrandom walk eventually converges to the stationary stateand thus leads to over-smoothing. We now derive the closed-form of the stationary vector and analyze the rate of suchconvergence. Our analysis suggests that the converge rate ofan individual node depends on its degree, and we conductexperiments to back up this theoretical ﬁnding. In particular,we have the following Theorem. Theorem 1.

Assume the self-looped graph ˜ G is connected.Let h ( K ) = (cid:16) I n + ˜ D − / ˜ A ˜ D − / (cid:17) K · x denote the representa-tion by applying a K -layer renormalized graph convolutionwith residual connection to a graph signal x . Let λ ˜ G de-note the spectral gap of the self-looped graph ˜ G , that is,the least nonzero eigenvalue of the normalized Laplacian ˜ L = I n − ˜ D − / ˜ A ˜ D − / . We have1) As K goes to inﬁnity, h ( K ) converges to π = (cid:104) ˜ D / , x (cid:105) m + n · ˜ D / , where denotes an all-one vector.2) The convergence rate is determined by h ( K ) = π ± (cid:32) n (cid:88) i =1 x i (cid:33) · (cid:32) − λ G (cid:33) K · . (7)Recall that m and n are the number of nodes and edges inthe original graph G . We use the operator ± to indicate thatfor each entry h ( K ) ( j ) and π ( j ) , j = 1 , . . . , n , (cid:12)(cid:12)(cid:12) h ( K ) ( j ) − π ( j ) (cid:12)(cid:12)(cid:12) ≤ (cid:32) n (cid:88) i =1 x i (cid:33) · (cid:32) − λ G (cid:33) K . imple and Deep Graph Convolutional Networks The proof of Theorem 1 can be found in the supplementarymaterials. There are two consequences from Theorem 1.First of all, it suggests that the K -th representation of GCN h ( K ) converges to a vector π = (cid:104) ˜ D / , x (cid:105) m + n · ˜ D / . Suchconvergence leads to over-smoothing as the vector π onlycarries the two kinds of information: the degree of eachnode, and the inner product between the initial signal x andvector D / . Convergence rate and node degree.

Equation (7) sug-gests that the converge rate depends on the summation offeature entries (cid:80) ni =1 x i and the spectral gap λ ˜ G . If we takea closer look at the relative converge rate for an individualnode j , we can express its ﬁnal representation h ( K ) ( j ) as h ( K ) ( j )= (cid:112) d j + 1  n (cid:88) i =1 √ d i +12 m + n x i ± (cid:80) ni =1 x i (cid:16) − λ G (cid:17) K (cid:112) d j + 1  . This suggests that if a node j has a higher degree of d j (and hence a larger (cid:112) d j + 1 ), its representation h ( K ) ( j ) converges faster to the stationary state π ( j ) . Based on thisfact, we make the following conjecture. Conjecture 1.

Nodes with higher degrees are more likelyto suffer from over-smoothing.

We will verify Conjecture 1 on real-world datasets in ourexperiments.

We consider the spectral domain of the self-looped graph ˜ G . Recall that a polynomial ﬁlter of order K on a graphsignal x is deﬁned as (cid:16)(cid:80) K(cid:96) =0 θ (cid:96) ˜ L (cid:96) (cid:17) x , where ˜ L is the nor-malized Laplacian matrix of ˜ G and θ k ’s are the polynomialcoefﬁcients. (Wu et al., 2019) proves that a K -layer GCNsimulates a polynomial ﬁlter of order K with ﬁxed coef-ﬁcients θ . As we shall prove later, such ﬁxed coefﬁcientslimit the expressive power of GCN and thus leads to over-smoothing. On the other hand, we show a K -layer GCNIImodel can express a K order polynomial ﬁlter with arbitrarycoefﬁcients. Theorem 2.

Consider the self-looped graph ˜ G and a graphsignal x . A K -layer GCNII can express a K order polyno-mial ﬁlter (cid:16)(cid:80) K(cid:96) =0 θ (cid:96) ˜ L (cid:96) (cid:17) x with arbitrary coefﬁcients θ . The proof of Theorem 2 can be found in the supplementarymaterials. Intuitively, the parameter β allows GCNII tosimulate the coefﬁcient θ (cid:96) of the polynomial ﬁlter. Expressive power and over-smoothing.

The ability toexpress a polynomial ﬁlter with arbitrary coefﬁcients is es-sential for preventing over-smoothing. To see why this is the case, recall that Theorem 1 suggests a K -layer vanilla GCNsimulates a ﬁxed K -order polynomial ﬁlter ˜ P K x , where ˜ P is the renormalized graph convolution matrix. Over-smoothing is caused by the fact that ˜ P K x converges toa distribution isolated from the input feature x and thusincuring gradient vanishment. DropEdge (Rong et al., 2020)slows down the rate of convergence, but eventually will failas K goes to inﬁnity.On the other hand, Theorem 2 suggests that deep GCNIIconverges to a distribution that carries information fromboth the input feature and the graph structure. This prop-erty alone ensures that GCNII will not suffer from over-smoothing even if the number of layers goes to inﬁnity.More precisely, Theorem 2 states that a K -layer GCNIIcan express h ( K ) = (cid:16)(cid:80) K(cid:96) =0 θ (cid:96) ˜ L (cid:96) (cid:17) · x with arbitrary co-efﬁcients θ . Since the renormalized graph convolutionmatrix ˜ P = I n − ˜ L , it follows that K -layer GCNII canexpress h ( K ) = (cid:16)(cid:80) K(cid:96) =0 θ (cid:48) (cid:96) ˜ P (cid:96) (cid:17) · x with arbitrary coef-ﬁcients θ (cid:48) . Note that with a proper choice of θ (cid:48) , h ( K ) can carry information from both the input feature and thegraph structure even with K going to inﬁnity. For example,APPNP (Klicpera et al., 2019a) and GDC (Klicpera et al.,2019b) set θ (cid:48) i = α (1 − α ) i for some constant < α < . AsK goes to inﬁnity, h ( K ) = (cid:16)(cid:80) K(cid:96) =0 θ (cid:48) (cid:96) ˜ P (cid:96) (cid:17) · x converges tothe Personalized PageRank vector of x , which is a functionof both the adjacency matrix ˜ A and the input feature vector x . The difference between GCNII and APPNP/GDC is that1) the coefﬁcient vector theta in our model is learned fromthe input feature and the label, and 2) we impose a ReLUoperation at each layer.

5. Other Related Work

Spectral-based GCN has been extensively studied for thepast few years. (Li et al., 2018c) improves ﬂexibility bylearning a task-driven adaptive graph for each graph datawhile training. (Xu et al., 2019) uses the graph wavelet basisinstead of the Fourier basis to improve sparseness and lo-cality. Another line of works focuses on the attention-basedGCN model (Veliˇckovi´c et al., 2018; Thekumparampil et al.,2018; Zhang et al., 2018), which learn the edge weightsat each layer based on node features. (Abu-El-Haija et al.,2019) learn neighborhood mixing relationships by mixingof neighborhood information at various distances but stilluses a two-layer model. (Gao & Ji, 2019; Lee et al., 2019)devote to extend pooling operations to graph neural network.For unsupervised information, (Velickovic et al., 2019) traingraph convolutional encoder through maximizing mutual in-formation. (Pei et al., 2020) build structural neighborhoodsin the latent space of graph embedding for aggregation toextract more structural information. (Dave et al., 2019) usesa single representation vector to capture both topological imple and Deep Graph Convolutional Networks

Table 1.

Dataset statistics.

Dataset Classes Nodes Edges FeaturesCora 7 2,708 5,429 1,433Citeseer 6 3,327 4,732 3,703Pubmed 3 19,717 44,338 500Chameleon 4 2,277 36,101 2,325Cornell 5 183 295 1,703Texas 5 183 309 1,703Wisconsin 5 251 499 1,703PPI 121 56,944 818,716 50information and nodal attributes in graph embedding. Manyof the sampling-based methods proposed to improve thescalability of GCN. (Hamilton et al., 2017) uses a ﬁxedsize of neighborhood samples through layers, (Chen et al.,2018a; Huang et al., 2018) propose efﬁcient variants basedon importance sampling. (Chiang et al., 2019) constructminibatch based on graph clustering.

6. Experiments

In this section, we evaluate the performance of GCNIIagainst the state-of-the-art graph neural network modelson a wide variety of open graph datasets.

Dataset and experimental setup.

We use three standardcitation network datasets Cora, Citeseer, and Pubmed (Senet al., 2008) for semi-supervised node classiﬁcation. In thesecitation datasets, nodes correspond to documents, and edgescorrespond to citations; each node feature corresponds to thebag-of-words representation of the document and belongsto one of the academic topics. For full-supervised nodeclassiﬁcation, we also include Chameleon (Rozemberczkiet al., 2019), Cornell, Texas, and Wisconsin (Pei et al.,2020). These datasets are web networks, where nodes andedges represent web pages and hyperlinks, respectively. Thefeature of each node is the bag-of-words representation ofthe corresponding page. For inductive learning, we useProtein-Protein Interaction (PPI) networks (Hamilton et al.,2017), which contains 24 graphs. Following the setting ofprevious work (Veliˇckovi´c et al., 2018), we use 20 graphsfor training, 2 graphs for validation, and the rest for testing.Statistics of the datasets are summarized in Table 1.Besides GCNII (5), we also include GCNII*, a variantof GCNII that employs different weight matrices for thesmoothed representation ˜ PH ( (cid:96) ) and the initial residual H (0) .Formally, the ( (cid:96) + 1) -th layer of GCNII* is deﬁned as H ( (cid:96) +1) = σ (cid:16) (1 − α (cid:96) ) ˜ PH ( (cid:96) ) (cid:16) (1 − β (cid:96) ) I n + β (cid:96) W ( (cid:96) )1 (cid:17) ++ α (cid:96) H (0) (cid:16) (1 − β (cid:96) ) I n + β (cid:96) W ( (cid:96) )2 (cid:17)(cid:17) . Table 2.

Summary of classiﬁcation accuracy ( % ) results on Cora,Citeseer, and Pubmed. The number in parentheses corresponds tothe number of layers of the model.Method Cora Citeseer PubmedGCN 81.5 71.1 79.0GAT 83.1 70.8 78.5APPNP 83.3 71.8 80.1JKNet 81.1 (4) 69.8 (16) 78.1 (32)JKNet(Drop) 83.3 (4) 72.6 (16) 79.2 (32)Incep(Drop) 83.5 (64) 72.7 (4) 79.5 (4)GCNII ± (64) ± (32) 80.2 ± ± ± ± (16) As mentioned in Section 3, we set β (cid:96) = log( λ(cid:96) + 1) ≈ λ/(cid:96) ,where λ is a hyperparameter. For the semi-supervised nodeclassiﬁcation task, we apply the standard ﬁxed train-ing/validation/testing split (Yang et al., 2016) on threedatasets Cora, Citeseer, and Pubmed, with 20 nodes perclass for training, 500 nodes for validation and 1,000 nodesfor testing. For baselines, we include two recent deep GNNmodels: JKNet (Xu et al., 2018) and DropEdge (Ronget al., 2020). As suggested in (Rong et al., 2020), weequip DropEdge on three backbones: GCN (Kipf & Welling,2017), JKNet (Xu et al., 2018) and IncepGCN (Rong et al.,2020). We also include three state-of-the-art shallow mod-els: GCN (Kipf & Welling, 2017), GAT (Veliˇckovi´c et al.,2018) and APPNP (Klicpera et al., 2019a).We use the Adam SGD optimizer (Kingma & Ba, 2015) witha learning rate of 0.01 and early stopping with a patienceof 100 epochs to train GCNII and GCNII*. We set α (cid:96) =0 . and L regularization to . for the dense layer onall datasets. We perform a grid search to tune the otherhyper-parameters for models with different depths based onthe accuracy on the validation set. More details of hyper-parameters are listed in the supplementary materials. Comparison with SOTA.

Table 2 reports the mean clas-siﬁcation accuracy with the standard deviation on the testnodes of GCN and GCNII after 100 runs. We reuse themetrics already reported in (Fey & Lenssen, 2019) for GCN,GAT, and APPNP, and the best metrics reported in (Ronget al., 2020) for JKNet, JKNet(Drop) and Incep(Drop). Ourresults successfully demonstrate that GCNII and GCNII*achieves new state-of-the-art performance across all threedatasets. Notably, GCNII outperforms the previous state-of-the-art methods by at least . It is also worthwhile tonote that the two recent deep models, JKNet and IncepGCNwith DropEdge, do not seem to offer signiﬁcant advantagesover the shallow model APPNP. On the other hand, our imple and Deep Graph Convolutional Networks Table 3.

Summary of classiﬁcation accuracy ( % ) results with vari-ous depths.Dataset Method Layers2 4 8 16 32 64Cora GCN GCNII 82.2 82.6 84.2 84.6 85.4

GCNII* 80.2 82.3 82.8 83.5 84.9

Citeseer GCN method achieves this result with a 64-layer model, whichdemonstrates the beneﬁt of deep network structures.

A detailed comparison with other deep models.

Table 3summaries the results for the deep models with various num-bers of layers. We reuse the best-reported results for JKNet,JKNet(Drop) and Incep(Drop) . We observe that on Coraand Citeseer, the performance of GCNII and GCNII* con-sistently improves as we increase the number of layers. OnPubmed, GCNII and GCNII* achieve the best results with16 layers, and maintain similar performance as we increasethe network depth to 64. We attribute this quality to theidentity mapping technique. Overall, the results suggest thatwith initial residual and identity mapping, we can resolvethe over-smoothing problem and extend the vanilla GCNinto a truly deep model. On the other hand, the performanceof GCN with DropEdge and JKNet drops rapidly as thenumber of layers exceeds 32, which means they still sufferfrom over-smoothing. We now evaluate GCNII in the task of full-supervised nodeclassiﬁcation. Following the setting in (Pei et al., 2020),we use 7 datasets: Cora, Citeseer, Pubmed, Chameleon, https://github.com/DropEdge/DropEdge Table 4.

Summary of Micro-averaged F1 scores on PPI.

Method PPIGraphSAGE (Hamilton et al., 2017) 61.2VR-GCN (Chen et al., 2018b) 97.8GaAN (Zhang et al., 2018) 98.71GAT (Veliˇckovi´c et al., 2018) 97.3JKNet (Xu et al., 2018) 97.6GeniePath (Liu et al., 2019) 98.5Cluster-GCN (Chiang et al., 2019) 99.36GCNII 99.53 ± ± Cornell, Texas, and Wisconsin. For each datasets, we ran-domly split nodes of each class into 60%, 20%, and 20%for training, validation and testing, and measure the perfor-mance of all models on the test sets over 10 random splits,as suggested in (Pei et al., 2020). We ﬁx the learning rateto 0.01, dropout rate to 0.5 and the number of hidden unitsto 64 on all datasets and perform a hyper-parameter searchto tune other hyper-parameters based on the validation set.Detailed conﬁguration of all model for full-supervised nodeclassiﬁcation can be found in the supplementary materials.Besides the previously mentioned baselines, we also includethree variants of Geom-GCN (Pei et al., 2020) as they arethe state-of-the-art models on these datasets.Table 5 reports the mean classiﬁcation accuracy of eachmodel. We reuse the metrics already reported in (Pei et al.,2020) for GCN, GAT, and Geom-GCN. We observe thatGCNII and GCNII* achieves new state-of-the-art results on6 out of 7 datasets, which demonstrates the superiority ofthe deep GCNII framework. Notably, GCNII* outperformsAPPNP by over 12% on the Wisconsin dataset. This resultsuggests that by introducing non-linearity into each layer,the predictive power of GCNII is stronger than that of thelinear model APPNP.

For the inductive learning task, we apply 9-layer GCNII andGCNII* models with 2048 hidden units on the PPI dataset.We ﬁx the following sets of hyperparameters: α (cid:96) = 0 . , λ = 1 . and learning rate of 0.001. Due to the large volumeof training data, we set the dropout rate to 0.2 and the weightdecay to zero. Following (Veliˇckovi´c et al., 2018), we alsoadd a skip connection from the (cid:96) -th layer to the ( (cid:96) + 1) -thlayer of GCNII and GCNII* to speed up the convergenceof the training process. We compare GCNII with the fol-lowing state-of-the-art methods: GraphSAGE (Hamiltonet al., 2017), VR-GCN (Chen et al., 2018b), GaAN (Zhanget al., 2018), GAT (Veliˇckovi´c et al., 2018), JKNet (Xu et al., imple and Deep Graph Convolutional Networks Table 5.

Mean classiﬁcation accuracy of full-supervised node classiﬁcation.

Method Cora Cite. Pumb. Cham. Corn. Texa. Wisc.GCN 85.77 73.68 88.13 28.18 52.70 52.16 45.88GAT 86.37 74.32 87.62 42.93 54.32 58.38 49.41Geom-GCN-I 85.19 (64) 77.08 (64) 89.57 (64) 60.61 (8) 74.86 (16) 69.46 (32) 74.12 (16)GCNII* 88.01 (64) 77.13 (64) (64) (8) (16) (32) (16)2018), GeniePath (Liu et al., 2019), Cluster-GCN (Chianget al., 2019). The metrics are summarized in Table 4.In concordance with our expectations, the results show thatGCNII and GCNII* achieve new state-of-the-art perfor-mance on PPI. In particular, GCNII achieves this perfor-mance with a 9-layer model, while the number of layerswith all baseline models are less or equal to 5. This suggeststhat larger predictive power can also be leveraged by in-creasing the network depth in the task of inductive learning.

Recall that Conjecture 1 suggests that nodes with higherdegrees are more likely to suffer from over-smoothing. Toverify this conjecture, we study how the classiﬁcation accu-racy varies with node degree in the semi-supervised nodeclassiﬁcation task on Cora, Citeseer, and Pubmed. Morespeciﬁcally, we group the nodes of each graph according totheir degrees. The i -th group consists of nodes with degreesin the range [2 i , i +1 ) for i = 0 , . . . , ∞ . For each group,we report the average classiﬁcation accuracy of GCN withresidual connection with various network depths in Figure 1.We have the following observations. First of all, we note thatthe accuracy of the 2-layer GCN model increases with thenode degree. This is as expected, as nodes with higher de-grees generally gain more information from their neighbors.However, as we extend the network depth, the accuracyof high-degree nodes drops more rapidly than that of low-degree nodes. Notably, GCN with 64 layers is unable toclassify nodes with degrees larger than 100. This suggeststhat over-smoothing indeed has a greater impact on nodeswith higher degrees. Figure 2 shows the results of an ablation study that evaluatesthe contributions of our two techniques: initial residual con- nection and identity mapping. We make three observationsfrom Figure 2: 1) Directly applying identity mapping to thevanilla GCN retards the effect of over-smoothing marginally.2) Directly applying initial residual connection to the vanillaGCN relieves over-smoothing signiﬁcantly. However, thebest performance is still achieved by the 2-layer model. 3)Applying identity mapping and initial residual connectionsimultaneously ensures that the accuracy increases with thenetwork depths. This result suggests that both techniquesare needed to solve the problem of over-smoothing.

7. Conclusion

We propose GCNII, a simple and deep GCN model thatprevents over-smoothing by initial residual connection andidentity mapping. The theoretical analysis shows that GC-NII is able to express a K order polynomial ﬁlter witharbitrary coefﬁcients. For vanilla GCN with multiple layers,we provide theoretical and empirical evidence that nodeswith higher degrees are more likely to suffer from over-smoothing. Experiments show that the deep GCNII modelachieves new state-of-the-art results on various semi- andfull-supervised tasks. Interesting directions for future workinclude combining GCNII with the attention mechanism andanalyzing the behavior of GCNII with the ReLU operation. Acknowledgements

This research was supported in part by National Natural Sci-ence Foundation of China (No. 61832017, No. 61932001and No. 61972401), by Beijing Outstanding Young ScientistProgram NO. BJJWZYJH012019100020098, by the Fun-damental Research Funds for the Central Universities andthe Research Funds of Renmin University of China underGrant 18XNLG21, by Shanghai Science and TechnologyCommission (Grant No. 17JC1420200), by Shanghai Sail-ing Program (Grant No. 18YF1401200) and a research fund imple and Deep Graph Convolutional Networks Degree A cc u r a cy Cora

GCNII-64GCN-2GCN-8GCN-16GCN-64 10 Degree A cc u r a cy Citeseer

GCNII-64GCN-2GCN-8GCN-16GCN-64 10 Degree A cc u r a cy Pubmed

GCNII-64GCN-2GCN-8GCN-16GCN-64

Figure 1.

Semi-supervised node classiﬁcation accuracy v.s. degree. Layers A cc u r a cy Cora

GCNGCN+InitialResidualGCN+IdentityMappingGCNII 2 Layers A cc u r a cy Citeseer

GCNGCN+InitialResidualGCN+IdentityMappingGCNII 2 Layers A cc u r a cy Pubmed

GCNGCN+InitialResidualGCN+IdentityMappingGCNII

Figure 2.

Ablation study on initial residual and identity mapping. supported by Alibaba Group through Alibaba InnovativeResearch Program.

References

Abu-El-Haija, S., Perozzi, B., Kapoor, A., Alipourfard, N.,Lerman, K., Harutyunyan, H., Steeg, G. V., and Galstyan,A. Mixhop: Higher-order graph convolutional architec-tures via sparsiﬁed neighborhood mixing. In

ICML , 2019.Chen, J., Ma, T., and Xiao, C. Fastgcn: Fast learning withgraph convolutional networks via importance sampling.In

ICLR , 2018a.Chen, J., Zhu, J., and Song, L. Stochastic training of graphconvolutional networks with variance reduction. In

ICML ,2018b.Chiang, W., Liu, X., Si, S., Li, Y., Bengio, S., and Hsieh,C. Cluster-gcn: An efﬁcient algorithm for training deepand large graph convolutional networks. In

KDD , pp.257–266. ACM, 2019.Chung, F. Four proofs for the cheeger inequality and graphpartition algorithms. In

Proceedings of ICCM , volume 2,pp. 378, 2007.Dave, V. S., Zhang, B., Chen, P., and Hasan, M. A. Neural-brane: Neural bayesian personalized ranking for at- tributed network embedding.

Data Science and Engi-neering , 4(2):119–131, 2019.Defferrard, M., Bresson, X., and Vandergheynst, P. Con-volutional neural networks on graphs with fast localizedspectral ﬁltering. In

NeurIPS , pp. 3837–3845, 2016.Fey, M. and Lenssen, J. E. Fast graph representation learningwith PyTorch Geometric. In

ICLR Workshop on Repre-sentation Learning on Graphs and Manifolds , 2019.Fout, A., Byrd, J., Shariat, B., and Ben-Hur, A. Proteininterface prediction using graph convolutional networks.In

NeurIPS , pp. 6530–6539, 2017.Gao, H. and Ji, S. Graph u-nets. In

ICML , 2019.Guo, S., Lin, Y., Feng, N., Song, C., and Wan, H. Attentionbased spatial-temporal graph convolutional networks fortrafﬁc ﬂow forecasting. In

AAAI , 2019.Hamilton, W. L., Ying, R., and Leskovec, J. Inductive rep-resentation learning on large graphs. In

NeurIPS , 2017.Hardt, M. and Ma, T. Identity matters in deep learning. In

ICLR , 2017.He, K., Zhang, X., Ren, S., and Sun, J. Deep residuallearning for image recognition. In

CVPR , pp. 770–778,2016. imple and Deep Graph Convolutional Networks

Huang, W., Zhang, T., Rong, Y., and Huang, J. Adaptivesampling towards fast graph representation learning. In

NeurIPS , pp. 4563–4572, 2018.Kingma, D. P. and Ba, J. Adam: A method for stochasticoptimization. In

ICLR , 2015.Kipf, T. N. and Welling, M. Semi-supervised classiﬁcationwith graph convolutional networks. In

ICLR , 2017.Klicpera, J., Bojchevski, A., and G¨unnemann, S. Predictthen propagate: Graph neural networks meet personalizedpagerank. In

ICLR , 2019a.Klicpera, J., Weißenberger, S., and G¨unnemann, S. Diffu-sion improves graph learning. In

NeurIPS , pp. 13333–13345, 2019b.LeCun, Y., Bengio, Y., et al. Convolutional networks forimages, speech, and time series.

The handbook of braintheory and neural networks , 3361(10):1995, 1995.Lee, J., Lee, I., and Kang, J. Self-attention graph pooling.In

ICML , 2019.Li, C. and Goldwasser, D. Encoding social information withgraph convolutional networks forpolitical perspective de-tection in news media. In

ACL , 2019.Li, H., Yang, Y., Chen, D., and Lin, Z. Optimization al-gorithm inspired deep neural network structure design. arXiv preprint arXiv:1810.01638 , 2018a.Li, J., Han, Z., Cheng, H., Su, J., Wang, P., Zhang, J., andPan, L. Predicting path failure in time-evolving graphs.In

KDD . ACM, 2019.Li, Q., Han, Z., and Wu, X. Deeper insights into graphconvolutional networks for semi-supervised learning. In

AAAI , 2018b.Li, R., Wang, S., Zhu, F., and Huang, J. Adaptive graphconvolutional neural networks. In

AAAI , 2018c.Liu, Z., Chen, C., Li, L., Zhou, J., Li, X., Song, L., andQi, Y. Geniepath: Graph neural networks with adaptivereceptive paths. In

AAAI , 2019.Ma, J., Wen, J., Zhong, M., Chen, W., and Li, X. MMM:multi-source multi-net micro-video recommendation withclustered hidden item representation learning.

Data Sci-ence and Engineering , 4(3):240–253, 2019.Oono, K. and Suzuki, T. Graph neural networks exponen-tially lose expressive power for node classiﬁcation. In

ICLR , 2020.Page, L., Brin, S., Motwani, R., and Winograd, T. Thepagerank citation ranking: Bringing order to the web.Technical report, Stanford InfoLab, 1999. Papyan, V., Romano, Y., and Elad, M. Convolutional neuralnetworks analyzed via convolutional sparse coding.

TheJournal of Machine Learning Research , 18(1):2887–2938,2017.Pei, H., Wei, B., Chang, K. C.-C., Lei, Y., and Yang, B.Geom-gcn: Geometric graph convolutional networks. In

ICLR , 2020.Qiu, J., Tang, J., Ma, H., Dong, Y., Wang, K., and Tang, J.Deepinf: Social inﬂuence prediction with deep learning.In

KDD , pp. 2110–2119. ACM, 2018.Rong, Y., Huang, W., Xu, T., and Huang, J. Dropedge:Towards deep graph convolutional networks on node clas-siﬁcation. In

ICLR , 2020.Rozemberczki, B., Allen, C., and Sarkar, R. Multi-scaleattributed node embedding, 2019.Sen, P., Namata, G., Bilgic, M., Getoor, L., Gallagher, B.,and Eliassi-Rad, T. Collective classiﬁcation in networkdata.

AI Magazine , 29(3):93–106, 2008.Shang, J., Xiao, C., Ma, T., Li, H., and Sun, J. Gamenet:Graph augmented memory networks for recommendingmedication combination. In

AAAI , 2019.Thekumparampil, K. K., Wang, C., Oh, S., and Li, L.-J. Attention-based graph neural network for semi-supervised learning, 2018.Veliˇckovi´c, P., Cucurull, G., Casanova, A., Romero, A., Li`o,P., and Bengio, Y. Graph Attention Networks.

ICLR ,2018.Velickovic, P., Fedus, W., Hamilton, W. L., Li`o, P., Bengio,Y., and Hjelm, R. D. Deep graph infomax. In

ICLR , 2019.Wang, G., Ying, R., Huang, J., and Leskovec, J. Improv-ing graph attention networks with large margin-basedconstraints. arXiv preprint arXiv:1910.11945 , 2019.Wu, F., Souza, A., Zhang, T., Fifty, C., Yu, T., and Wein-berger, K. Simplifying graph convolutional networks. In

ICML , pp. 6861–6871, 2019.Xu, B., Shen, H., Cao, Q., Qiu, Y., and Cheng, X. Graphwavelet neural network. In

ICLR , 2019.Xu, K., Li, C., Tian, Y., Sonobe, T., Kawarabayashi, K.,and Jegelka, S. Representation learning on graphs withjumping knowledge networks. In

ICML , 2018.Yang, Z., Cohen, W. W., and Salakhutdinov, R. Revisitingsemi-supervised learning with graph embeddings. In

ICML , 2016. imple and Deep Graph Convolutional Networks

Ying, R., He, R., Chen, K., Eksombatchai, P., Hamilton,W. L., and Leskovec, J. Graph convolutional neural net-works for web-scale recommender systems. In

KDD , pp.974–983. ACM, 2018.Zhang, J. and Ghanem, B. Ista-net: Interpretableoptimization-inspired deep network for image compres-sive sensing. In

CVPR , pp. 1828–1837, 2018.Zhang, J., Shi, X., Xie, J., Ma, H., King, I., and Yeung, D.Gaan: Gated attention networks for learning on large andspatiotemporal graphs. In

UAI , 2018.Zhao, L., Peng, X., Tian, Y., Kapadia, M., and Metaxas,D. N. Semantic graph convolutional networks for 3dhuman pose regression. In

CVPR , pp. 3425–3435, 2019. imple and Deep Graph Convolutional Networks

A. Proofs

A.1. Proof of Theorem 2

Proof.

For simplicity, we assume the signal vector x tobe non-negative. Note that we can convert x into a non-negative input layer H ( ) by a linear transformation. Weconsider a weaker version of GCNII by ﬁxing α (cid:96) = 0 . andﬁxing the weight matrix (1 − β (cid:96) ) I n + β (cid:96) W ( (cid:96) ) to be γ (cid:96) I n ,where γ (cid:96) is a learnable parameter. We have H ( l +1) = σ (cid:16) ˜ D − / ˜ A ˜ D − / (cid:16) H ( (cid:96) ) + x (cid:17) γ (cid:96) I n (cid:17) . Since the input feature x is non-negative, we can removethe ReLU operation: H ( (cid:96) +1) = γ (cid:96) ˜ D − / ˜ A ˜ D − / (cid:16) H ( (cid:96) ) + x (cid:17) = γ (cid:96) (cid:16)(cid:16) I n − ˜ L (cid:17) · (cid:16) H ( (cid:96) ) + x (cid:17)(cid:17) . Consequently, we can express the ﬁnal representation as H ( K − = (cid:32) K − (cid:88) (cid:96) =0 (cid:32) K − (cid:89) k = K − (cid:96) − γ k (cid:33) (cid:16) I n − ˜ L (cid:17) (cid:96) (cid:33) x . (8)On the other hand, a polynomial ﬁlter of graph ˜ G can beexpressed as (cid:32) K − (cid:88) k =0 θ k ˜ L k (cid:33) x = (cid:32) k (cid:88) i =0 θ k (cid:16) I n − (cid:16) I n − ˜ L (cid:17)(cid:17) k (cid:33) x = (cid:32) K − (cid:88) k =0 θ k (cid:32) k (cid:88) (cid:96) =0 ( − (cid:96) (cid:18) k(cid:96) (cid:19) (cid:16) I n − ˜ L (cid:17) (cid:96) (cid:33)(cid:33) x . Switching the order of summation follows that a K -orderpolynomial ﬁter (cid:16)(cid:80) K − k =0 θ k ˜ L k (cid:17) x can be expressed as (cid:32) K − (cid:88) k =0 θ k ˜ L k (cid:33) x = (cid:32) K − (cid:88) (cid:96) =0 (cid:32) K − (cid:88) k = (cid:96) θ k ( − (cid:96) (cid:18) k(cid:96) (cid:19)(cid:33)(cid:16) I n − ˜ L (cid:17) (cid:96) (cid:33) x . (9)To show that GCNII can express an arbitrary K -order poly-nomial ﬁlter, we need to prove that there exists a solution γ (cid:96) , (cid:96) = 0 , . . . , K − such that the corresponding coefﬁcientsof (cid:16) I n − ˜ L (cid:17) (cid:96) in equations (8) and (9) are equivalent. Moreprecisely, we need to show the following equation system K − (cid:89) k = K − (cid:96) − γ k = K − (cid:88) k = (cid:96) θ k ( − (cid:96) (cid:18) k(cid:96) (cid:19) , k = 0 , . . . , K − , has a solution γ (cid:96) , (cid:96) = 0 , . . . , K − . Since the left-handside is a partial product of γ k from K − (cid:96) − to K − , we can solve the equation system by γ K − (cid:96) − = K − (cid:88) k = (cid:96) θ k ( − (cid:96) (cid:18) k(cid:96) (cid:19)(cid:44) K − (cid:88) k = (cid:96) − θ k ( − (cid:96) − (cid:18) k(cid:96) − (cid:19) , (10)for (cid:96) = 1 , . . . , K − and γ K − = (cid:80) K − k =0 θ k . Note that theabove solution may fail when (cid:80) K − k = (cid:96) − θ k ( − (cid:96) − (cid:0) k(cid:96) − (cid:1) =0 . In this case, we can set γ K − (cid:96) − sufﬁciently large so thatequation (10) is still a good approximation. We also notethat this case is rare because it implies that the K -orderﬁlter ignores all features from the (cid:96) -hop neighbors. Thisproves that a K -layer GCNII can express the K -th orderpolynomial ﬁlter (cid:16)(cid:80) ki =0 θ i L i (cid:17) x with arbitrary coefﬁcients θ . A.2. Proof of Theorem 1

To prove Theorem 1, we need the following

Cheeger In-equality (Chung, 2007) for lazy random walks.

Lemma 1 ((Chung, 2007)) . Let p ( K ) i = (cid:16) I n + ˜ A ˜ D − (cid:17) K e i is the K -th transition probability vector from node i onconnected self-looped graph ˜ G . Let λ ˜ G denote the spectralgap of ˜ G . The j -th entry of p ( K ) i can be bounded by (cid:12)(cid:12)(cid:12)(cid:12) p ( K ) i ( j ) − d j + 12 m + n (cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:114) d j + 1 d i + 1 (cid:32) − λ G (cid:33) K . Proof of Theorem 1.

Note that I n = ˜ D − / ˜ D / , we have h ( K ) = (cid:32) I n + ˜ D − / ˜ A ˜ D − / (cid:33) K · x = (cid:32) ˜ D − / (cid:32) I n + ˜ A ˜ D − (cid:33) ˜ D / (cid:33) K · x = ˜ D − / (cid:32) I n + ˜ A ˜ D − (cid:33) K · (cid:16) ˜ D / x (cid:17) . We express ˜ D / x as linear combination of standard basis: ˜ D / x = ( D + I n ) / x = n (cid:88) i =1 (cid:16) x ( i ) (cid:112) d i + 1 (cid:17) · e i , it follows that h ( K ) = ˜ D − / (cid:32) I n + ˜ A ˜ D − (cid:33) K · n (cid:88) i =1 (cid:16) x ( i ) (cid:112) d i + 1 (cid:17) · e i = n (cid:88) i =1 x ( i ) (cid:112) d i + 1 · ˜ D − / (cid:32) I n + ˜ A ˜ D − (cid:33) K · e i . imple and Deep Graph Convolutional Networks We note that (cid:16) I n + ˜ A ˜ D − (cid:17) K · e i = p ( K ) i is the K -th transi-tion probability vector of a random walk from node i . ByLemma 1, the j -th entry of p ( K ) i can be bounded by (cid:12)(cid:12)(cid:12)(cid:12) p ( K ) i ( j ) − d j + 12 m + n (cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:114) d j + 1 d i + 1 (cid:32) − λ G (cid:33) K , or equivalently, p ( K ) i ( j ) = d j + 12 m + n ± (cid:114) d j + 1 d i + 1 (cid:32) − λ G (cid:33) K . Therefore, we can express the j -th entry of h ( K ) as h ( K ) ( j ) = (cid:32) n (cid:88) i =1 (cid:112) d i + 1 x ( i ) · ˜ D − / p ( K ) i (cid:33) ( j )= n (cid:88) i =1 (cid:112) d i +1 x ( i ) 1 (cid:112) d j +1 ·  d j +12 m + n ± (cid:114) d j +1 d i +1 (cid:32) − λ G (cid:33) K  = n (cid:88) i =1 (cid:112) ( d j + 1)( d i + 1)2 m + n x ( i ) ± n (cid:88) i =1 x ( i ) (cid:32) − λ G (cid:33) K . This proves h ( K ) = (cid:68) ˜ D / , x (cid:69) m + n ˜ D / ± (cid:32) n (cid:88) i =1 x i (cid:33) · (cid:32) − λ G (cid:33) K · , and the Theorem follows. B. Hyper-parameters details

Table 6 summarizes the training conﬁguration of GCNIIfor semi-supervised. L d and L c denote the weight de-cay for dense layer and convolutional layer respectively.The searching hyper-parameters include numbers of layers,hidden dimension, dropout, λ and L c regularization.Table 7 summarizes the training conﬁguration of all modelfor full-supervised. We use the full-supervised hyper-parameter setting from DropEdge for JKNet and IncepGCNon citation networks. For other cases, grid search was per-formed over the following search space: layers (4, 8, 16,32 ,64), dropedge (0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9), α (cid:96) (0.1, 0.2, 0.3, 0.4, 0.5), λ (0.5, 1, 1.5), L regularization(1e-3, 5e-4, 1e-4, 5e-5, 1e-5, 5e-6, 1e-6). Table 6.

The hyper-parameters for Table 2.Dataset Hyper-parametersCora layers: 64, α (cid:96) : 0.1, lr: 0.01, hidden: 64, λ : 0.5,dropout: 0.6, L c : 0.01, L d : 0.0005Citeseer layers: 32, α (cid:96) : 0.1, lr: 0.01, hidden: 256, λ : 0.6,dropout: 0.7, L c : 0.01, L d : 0.0005Pubmed layers: 16, α (cid:96) : 0.1, lr: 0.01, hidden: 256, λ : 0.4,dropout: 0.5, L c : 0.0005, L d : 0.0005 Table 7.

The hyper-parameters for Table 5.Dataset Method Hyper-parametersCora APPNP α : 0.1, L : 0.0005, lr: 0.01, hidden: 64,dropout: 0.5GCNII layers: 64, α (cid:96) : 0.2, lr: 0.01, hidden: 64, λ : 0.5, dropout: 0.5, L : 0.0001Cite. APPNP α : 0.5, L : 0.0005, lr: 0.01, hidden: 64,dropout: 0.5GCNII layers: 64, α (cid:96) : 0.5, lr: 0.01, hidden: 64, λ : 0.5, dropout: 0.5, L : 5e-6Pubm. APPNP α : 0.4, L : 0.0001, lr: 0.01, hidden: 64,dropout: 0.5GCNII layers: 64, α (cid:96) : 0.1, lr: 0.01, hidden: 64, λ : 0.5, dropout: 0.5, L : 5e-6Cham. APPNP α : 0.1, L : 1e-6, lr: 0.01, hidden: 64,dropout: 0.5JKNet layers: 32, lr: 0.01, hidden: 64,dropedge: 0.7, dropout: 0.5, L : 0.0001IncepGCN layers: 8, lr: 0.01, hidden: 64,dropedge: 0.9, dropout: 0.5, L : 0.0005GCNII layers: 8, α (cid:96) : 0.2, lr: 0.01, hidden: 64, λ : 1.5, dropout: 0.5, L : 0.0005Corn. APPNP α : 0.5, L : 0.005, lr: 0.01, hidden: 64,dropout: 0.5JKNet layers: 4, lr: 0.01, hidden: 64,dropedge: 0.5, dropout: 0.5, L : 5e-5IncepGCN layers: 16, lr: 0.01, hidden: 64,dropedge: 0.7, dropout: 0.5, L : 5e-5GCNII layers: 16, α (cid:96) : 0.5, lr: 0.01, hidden: 64, λ : 1, dropout: 0.5, L : 0.001Texa. APPNP α : 0.5, L : 0.001, lr: 0.01, hidden: 64,dropout: 0.5JKNet layers: 32, lr: 0.01, hidden: 64,dropedge: 0.8, dropout: 0.5, L : 5e-5IncepGCN layers: 8, lr: 0.01, hidden: 64,dropedge: 0.8, dropout: 0.5, L : 5e-6GCNII layers: 32, α (cid:96) : 0.5, lr: 0.01, hidden: 64, λ : 1.5, dropout: 0.5, L : 0.0001Wisc. APPNP α : 0.5, L : 0.005, lr: 0.01, hidden: 64,dropout: 0.5JKNet layers: 8, lr: 0.01, hidden: 64,dropedge: 0.8, dropout: 0.5, L : 5e-5IncepGCN layers: 8, lr: 0.01, hidden: 64,dropedge: 0.7, dropout: 0.5, L : 0.0001GCNII layers: 16, α (cid:96) : 0.5, lr: 0.01, hidden: 64, λ : 1, dropout: 0.5, L2