[PDF] Mask-GVAE: Blind Denoising Graphs via Partition

Abstract

We present Mask-GVAE, a variational generative model for blind denoising large discrete graphs, in which "blind denoising" means we don't require any supervision from clean graphs. We focus on recovering graph structures via deleting irrelevant edges and adding missing edges, which has many applications in real-world scenarios, for example, enhancing the quality of connections in a co-authorship network. Mask-GVAE makes use of the robustness in low eigenvectors of graph Laplacian against random noise and decomposes the input graph into several stable clusters. It then harnesses the huge computations by decoding probabilistic smoothed subgraphs in a variational manner. On a wide variety of benchmarks, Mask-GVAE outperforms competing approaches by a significant margin on PSNR and WL similarity.

Full PDF

MMask-GVAE: Blind Denoising Graphs via Partition

Jia Li

The Chinese University of Hong [email protected]

Mengzhou Liu

The Chinese University of Hong [email protected]

Honglei Zhang

Tianjin [email protected]

Pengyun Wang

Huawei Noah’s Ark [email protected]

Yong Wen

Huawei Noah’s Ark [email protected]

Lujia Pan

Huawei Noah’s Ark [email protected]

Hong Cheng

The Chinese University of Hong [email protected]

ABSTRACT

We present Mask-GVAE, a variational generative model for blinddenoising large discrete graphs, in which "blind denoising" meanswe don’t require any supervision from clean graphs. We focuson recovering graph structures via deleting irrelevant edges andadding missing edges, which has many applications in real-worldscenarios, for example, enhancing the quality of connections ina co-authorship network. Mask-GVAE makes use of the robust-ness in low eigenvectors of graph Laplacian against random noiseand decomposes the input graph into several stable clusters. Itthen harnesses the huge computations by decoding probabilisticsmoothed subgraphs in a variational manner. On a wide variety ofbenchmarks, Mask-GVAE outperforms competing approaches by asignificant margin on PSNR and WL similarity.

CCS CONCEPTS • Mathematics of computing → Graph algorithms ; •

Comput-ing methodologies → Unsupervised learning . KEYWORDS graph denoising; graph clustering; graph autoencoder

ACM Reference Format:

Jia Li, Mengzhou Liu, Honglei Zhang, Pengyun Wang, Yong Wen, LujiaPan, and Hong Cheng. 2021. Mask-GVAE: Blind Denoising Graphs viaPartition. In

Proceedings of the Web Conference 2021 (WWW ’21), April 19–23, 2021, Ljubljana, Slovenia.

ACM, New York, NY, USA, 11 pages. https://doi.org/10.1145/3442381.3449899

Recently, graph learning models [28, 60] have achieved remarkableprogress in many graph related tasks. Compared with other ma-chine learning models that build on i.i.d. assumption, graph learning

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected].

WWW ’21, April 19–23, 2021, Ljubljana, Slovenia © 2021 Association for Computing Machinery.ACM ISBN 978-1-4503-8312-7/21/04.https://doi.org/10.1145/3442381.3449899 models require as input a more sophisticated graph structure. Mean-while, how to construct a graph structure from raw data is still anopen problem. For instance, in a social network the common prac-tice to construct graph structures is based on observed friendshipsbetween users [32]; in a protein-protein interaction (PPI) networkthe common practice is based on truncated spatial closeness be-tween proteins [10]. However, these natural choices of graphs maynot necessarily describe well the intrinsic relationships betweenthe node attributes in the data [11, 45], e.g., observed friendshipdoes not indicate true social relationship in a social network [66],truncated spatial closeness may incorporate noisy interactions andmiss true interactions between proteins in a PPI network [58]. Inthis work, we assume we are given a degraded graph structure,e.g., having missing/irrelevant edges.

We aim to recover a graphstructure that removes irrelevant edges and adds missing edges.

Asin practice noisy-clean graph pairs are rare [39], we propose tode-noise the input noisy graph without any supervision from itsclean counterpart, which is referred to as blind graph denoising .Latent generative models such as variational autoencoders (VAEs)[26] have shown impressive performance on denoising images [20]and speech [7]. As images or speech exist in a continuous space, itis easy to utilize gradient descent weapons to power the denoisingprocess. On the contrary, our problem setting involves a large dis-crete structure, thus it is challenging to generalize the current deepgenerative models to our problem setting.We present Mask-GVAE, the first variational generative modelfor blind denoising large graphs. A key insight of Mask-GVAE isthat graph Laplacian eigenvectors associated with small eigenval-ues (low eigenvectors) are stable against random edge perturbationsif there is a significant cluster structure [12, 23, 46]. From anotherviewpoint, [19] finds the stability of low eigenvectors lays the foun-dation of robustness of spectral clustering [57]. Likewise, manyreal-world graphs do hold these distinct substructures, e.g., PPI net-work [1], social network [32], and co-authorship network [67]. Weshow an illustrative example in Figure 1. In Mask-GVAE, we first usegraph neural networks (GNNs) [28, 33] and normalized cut [54] forthe fast estimation of low eigenvectors, i.e., cluster mask. We thenencode the latent variables and generate probabilistic smoothedsubgraphs, conditioned on the cluster mask and latent variables.A discrete graph is then sampled upon the denoised subgraphs tocircumvent the non-differentiability problem [55]. a r X i v : . [ c s . L G ] F e b WW ’21, April 19–23, 2021, Ljubljana, Slovenia J. Li et al.

The clean graphThe noisy graph u' of the clean graph u' of the clean graph u of the noisy graph u of the noisy graph Figure 1: Graph Laplacian eigenvectors of a graph and noisy one. u (the eigenvector w.r.t. the second smallest eigenvalue) isstable against random noise while u (the eigenvector w.r.t. the largest eigenvalue) fluctuates. An important requirement of Mask-GVAE is the ability to fastestimate stable cluster masks. While there are some neural net-works proposed for computing cluster masks [3, 5, 41, 52], theyeither rely on other outsourcing tools such as K-means [52], orrequire as input supervised information [5]. We propose an end-to-end neural network to encode cluster mask in an unsupervisedmanner. It differs from the above methods as (1) it uses GNN forfast computation of low eigenvectors, as GNN quickly shrinks higheigenvectors and keeps low eigenvectors [33, 60]; (2) it generalizes normalized cut [54] to work as the loss function, since the optimalsolution of spectral relaxed normalized cut coincides with the loweigenvectors of normalized Laplacian [57].Another challenge for Mask-GVAE is how to incorporate harddomain-specific constraints, i.e., cluster mask, into variational graphgeneration. In the literature, GVAE [30] constructs parse treesbased on the input graph and then uses Recurrent Neural Net-works (RNNs) to encode to and decode from these parse trees. Itutilizes the binary mask to delete invalid vectors. NeVAE [50] andCGVAE [37] both leverage GNNs to generate graphs which matchthe statistics of the original data. They make use of a similar mask mechanism to forbid edges that violate syntactical constraint. Ourmodel also uses the mask mechanism to generate cluster-awaregraphs.Our contributions are summarized as follows. • We study the blind graph denoising problem. Compared withthe state-of-the-art methods that mainly focus on graphswith limited sizes, our solution is the first one that does notrely on explicit eigen-decomposition and can be applied tolarge graphs. • We present Mask-GVAE, the first variational generativemodel for blind denoising large graphs. Mask-GVAE achievessuperior denoising performance to all competitors. • We theoretically prove low eigenvectors of graph Laplacianare stable against random edge perturbations if there is asignificant cluster structure, which lays the foundation ofmany subgraph/cluster based denoising methods. • We evaluate Mask-GVAE on five benchmark data sets. Ourmethod outperforms competing methods by a large marginon peak signal-to-noise ratio (PSNR) and Weisfeiler-Lehman(WL) similarity.The remainder of this paper is organized as follows. Section 2gives the problem definition. Section 3 describes our methodology.We provide a systematic theoretical study of the stability of loweigenvectors in Section 4. We report the experimental results inSection 5 and discuss related work in Section 6. Finally, Section 7concludes the paper.

Consider a graph 𝐺 = ( 𝑉 , 𝐴, 𝑋 ) where 𝑉 = { 𝑣 , 𝑣 , . . . , 𝑣 𝑁 } is theset of nodes. We use an 𝑁 × 𝑁 adjacency matrix 𝐴 to describe theconnections between nodes in 𝑉 . 𝐴 𝑖 𝑗 ∈ { , } represents whetherthere is an undirected edge between nodes 𝑣 𝑖 and 𝑣 𝑗 . We use 𝑋 = { 𝑥 , 𝑥 , . . . , 𝑥 𝑁 } to denote the attribute values of nodes in 𝑉 , where 𝑥 𝑖 ∈ R 𝑑 is a 𝑑 -dimensional vector.We assume a small portion of the given graph structure 𝐴 isdegraded due to noise, incomplete data preprocessing, etc. Thecorruptions are two-fold: (1) missing edges, e.g., missing friendshiplinks among users in a social network, and (2) irrelevant edges, e.g.,incorrect interactions among proteins in a protein-protein interac-tion network. The problem of graph denoising is thus defined torecover a graph ˆ 𝐺 = ( 𝑉 , ˆ 𝐴, 𝑋 ) from the given one 𝐺 = ( 𝑉 , 𝐴, 𝑋 ) . Inthis work, we don’t consider noisy nodes and features, leaving thisfor future work.Graph Laplacian regularization has been widely used as a signalprior in denoising tasks [11, 45]. Given the adjacency matrix 𝐴 andthe degree matrix 𝐷 with 𝐷 𝑖𝑖 = (cid:205) 𝑗 𝐴 𝑖 𝑗 , the graph Laplacian matrixis defined as 𝐿 = 𝐷 − 𝐴 . Recall 𝐿 is a positive semidefinite matrix, 𝑋 ⊤ : ,𝑘 𝐿𝑋 : ,𝑘 = (cid:205) 𝑁𝑖,𝑗 = 𝐴 𝑖 𝑗 ( 𝑋 𝑖𝑘 − 𝑋 𝑗𝑘 ) measures the sum of pairwisedistances between nodes in 𝐺 , and 𝑋 : ,𝑘 is the 𝑘 -th column vector of 𝑋 . In this work, we consider the recovered structure ˆ 𝐴 should be ask-GVAE: Blind Denoising Graphs via Partition WWW ’21, April 19–23, 2021, Ljubljana, Slovenia coherent with respect to the features 𝑋 . In this context, the graphdenoising problem has the following objective function:arg min L = ∑︁ 𝑖 < 𝑗 | 𝐴 𝑖 𝑗 − ˆ 𝐴 𝑖 𝑗 | + 𝜔 Tr (cid:16) 𝑋 ⊤ ˆ 𝐿𝑋 (cid:17) , (1)where the first term is a fidelity term ensuring the recovered struc-ture ˆ 𝐴 does not deviate from the observation 𝐴 too much, andthe second term is the graph Laplacian regularizer. ˆ 𝐿 is the graphLaplacian matrix for ˆ 𝐴 . 𝜔 ≥ (·) isdefined as the sum of elements on the main diagonal of a givensquare matrix. In this work, we focus on blind graph denoising ,i.e., we are unaware of the clean graph and the only knowledge wehave is the observed graph 𝐺 = ( 𝑉 , 𝐴, 𝑋 ) .Our problem formulation is applicable to many real applicationsas illustrated below. Application 1.

In many graph-based tasks where graph structuresare not readily available, one needs to construct a graph structurefirst [18, 45, 64]. As an instance, the k-nearest neighbors algorithm(k-NN) is a popular graph construction method in image segmenta-tion tasks [14]. One property of a graph constructed by k-NN is thateach pixel (node) has a fixed number of nearest neighbors, whichinevitably introduces irrelevant and missing edges. Thus, graphdenoising can be used to get a denoised graph ˆ 𝐴 by admitting smallmodifications with the observed signal prior 𝐴 . Application 2.

Consider a knowledge graph (KG) where nodesrepresent entities and edges encode predicates. It is safe to assumethere are some missing connections and irrelevant ones, as KGs areknown for their incompleteness [65]. In this context, the task ofgraph denoising is to modify the observed KG such that its qualitycan be enhanced.

Application 3.

Consider a co-authorship network as another ex-ample where a node represents an author and an edge representsthe co-authorship relation between two authors. Due to the dataquality issue in the input bibliographic data (e.g., name ambiguity)[24], the constructed co-authorship network may contain noise inthe form of irrelevant and missing edges. In this scenario, our graphdenoising method can be applied to the co-authorship network toremove the noise.We then contrast the difference between graph denoising andother related areas including adversarial learning and link predic-tion below. graph denoising vs. adversarial learning.

An adversarial learningmethod considers a specific task, e.g., node classification [68]. Inthis regard, there is a loss function (e.g., cross-entropy for nodeclassification) guiding the attack and defense. Differently, we con-sider the general graph denoising without any task-specific loss.We thus don’t compare with the literature of adversarial learningin this work. graph denoising vs. link prediction.

While link prediction is termedas predicting whether two nodes in a network are likely to have alink [36], currently most methods [36, 66] focus on inferring miss-ing links from an observed network. Differently, graph denoisingconsiders the observed network consists of both missing links andirrelevant links. Moreover, while link prediction can take advantage of a separate trustworthy training data to learn the model, our blindsetting means our model needs to denoise the structures based onthe noisy input itself.

Motivated by the recent advancements of discrete VAEs [38] anddenoising VAEs [7, 20] on images and speech, we propose to extendVAEs to the problem of denoising large discrete graphs. However,current graph VAEs still suffer from the scalability issues [48, 49]and cannot be generalized to large-scale graph settings. As anotherissue, current VAEs suffer from a particular local optimum knownas component collapsing [25], meaning a good optimization of priorterm results in a bad reconstruction term; if we directly attachingEq.1 and the loss function of VAE, the situation would be worse.In this work, we present Mask-GVAE, which first decomposes alarge graph into stable subgraphs and then generates smoothedsubgraphs in a variational manner. Specifically, Mask-GVAE con-sists of two stages, one computing the cluster mask and the othergenerating the denoised graph.

A cluster mask 𝐶 encodes the stable low eigenvectors of graphLaplacian. Specifically, 𝐶 ∈ { , } 𝑁 × 𝐾 is a binary mask, where 𝐾 isthe number of clusters, 𝐶 𝑖𝑘 = 𝑖 belongs to cluster 𝑘 and 𝐶 𝑖𝑘 = The loss function.

The definition of graph cut is:1 𝐾 ∑︁ 𝑘 𝑐𝑢𝑡 ( 𝑉 𝑘 , 𝑉 𝑘 ) , (2)where 𝑉 𝑘 is the node set assigned to cluster 𝑘 , 𝑉 𝑘 = 𝑉 \ 𝑉 𝑘 , 𝑐𝑢𝑡 ( 𝑉 𝑘 , 𝑉 𝑘 ) = (cid:205) 𝑖 ∈ 𝑉 𝑘 ,𝑗 ∈ 𝑉 𝑘 𝐴 𝑖 𝑗 and it calculates the number of edges with one endpoint inside cluster 𝑉 𝑘 and the other in the rest of the graph. Taking 𝐶 into consideration, the graph cut can be re-written as:1 𝐾 ∑︁ 𝑘 ( 𝐶 ⊤ : ,𝑘 𝐷𝐶 : ,𝑘 − 𝐶 ⊤ : ,𝑘 𝐴𝐶 : ,𝑘 ) = 𝐾 Tr (cid:0) 𝐶 ⊤ 𝐿𝐶 (cid:1) , (3)in which 𝐷 and 𝐿 are the degree and Laplacian matrices respectively. 𝐶 ⊤ : ,𝑘 𝐷𝐶 : ,𝑘 stands for the number of edges with at least one end pointin 𝑉 𝑘 and 𝐶 ⊤ : ,𝑘 𝐴𝐶 : ,𝑘 counts the number of edges with both end pointsin cluster 𝑉 𝑘 . The normalized cut [41, 54] thus becomes:1 𝐾 Tr (cid:0) ( 𝐶 ⊤ 𝐿𝐶 ) ⊘ ( 𝐶 ⊤ 𝐷𝐶 ) (cid:1) , (4)where ⊘ is element-wise division. Note an explicit constraint is that 𝐶 ⊤ 𝐶 is a diagonal matrix, we thus apply a penalization term [34],which results in a differentiable unsupervised loss function: L 𝑢 = 𝐾 Tr (cid:0) ( 𝐶 ⊤ 𝐿𝐶 ) ⊘ ( 𝐶 ⊤ 𝐷𝐶 ) (cid:1) + 𝜑 (cid:12)(cid:12)(cid:12)(cid:12) 𝐾𝑁 𝐶 ⊤ 𝐶 − 𝐼 𝐾 (cid:12)(cid:12)(cid:12)(cid:12) 𝐹 , (5)where (cid:12)(cid:12)(cid:12)(cid:12) · (cid:12)(cid:12)(cid:12)(cid:12) 𝐹 represents the Frobenius norm of a matrix. The network architecture.

Our architecture is similar to [34, 41], which has two main parts: (1) node embedding, and (2) clusterassignment. In the first part, we leverage two-layer graph neuralnetworks, e.g., GCN [28], Heatts [33], to get two-hop neighborhood-aware node representations, thus those nodes that are denselyconnected and have similar attributes can be represented with

WW ’21, April 19–23, 2021, Ljubljana, Slovenia J. Li et al. similar node embeddings. In the second part, based on the nodeembeddings, we use two-layer perceptrons and softmax functionto assign similar nodes to the same cluster. The output of thisneural network structure is the cluster mask 𝐶 , which is trained byminimizing the unsupervised loss L 𝑢 in Eq. 5. In this subsection, we describe our method (illustrated in Figure2) which produces a denoised graph ˆ 𝐴 that is conditioned on thecluster mask 𝐶 and meets the fidelity constraint. As discussed, weuse a latent variable model parameterized by neural networks togenerate the graph ˆ 𝐺 . Specifically, we focus on learning a param-eterized distribution over the graph 𝐺 and the cluster mask 𝐶 asfollows: 𝑃 ( ˆ 𝐴 | 𝐺, 𝐶 ) = ∫ 𝑞 𝜙 ( 𝑍 | 𝐺, 𝐶 ) 𝑝 𝜃 ( ˆ 𝐴 | 𝐺, 𝐶, 𝑍 ) 𝑑𝑍, (6)where 𝑞 𝜙 ( 𝑍 | 𝐺, 𝐶 ) is the encoder and 𝑝 𝜃 ( ˆ 𝐴 | 𝐺, 𝐶, 𝑍 ) is the decoder.While the encoder is straightforward and we can use the corre-sponding encoders in existing work [27], the decoder is hard dueto the following two factors: • Discrete decoding : Generating discrete graphs is challenging. • Cluster awareness : Existing graph generation methods cannotexplicitly incorporate cluster information.To satisfy the discrete decoding , we decouple the decoder into twosteps: (1) probabilistic graph decoder, which produces a probabilisticgraph ¯ 𝐴 in training, and (2) discrete graph refinement, in which wesample a discrete graph ˆ 𝐴 in testing based on the prior knowledge 𝐺 and ¯ 𝐴 . To address the cluster awareness , we propose to directlyincorporate cluster information into the learning process by usingthe mask [30] mechanism, which powers the model with the abilityto generate denoised graph with smoothed subgraphs. Graph encoder.

We follow VGAE [27] by using the mean fieldapproximation to define the variational family: 𝑞 𝜙 ( 𝑍 | 𝐺, 𝐶 ) = 𝑁 (cid:214) 𝑖 = 𝑞 𝜙 𝑖 ( 𝑧 𝑖 | 𝐴, 𝑋 ) , (7)where 𝑞 𝜙 𝑖 ( 𝑧 𝑖 | 𝐴, 𝑋 ) is the predefined prior distribution, namely,isotropic Gaussian with diagonal covariance. The parameters forthe variational marginals 𝑞 𝜙 𝑖 ( 𝑧 𝑖 | 𝐴, 𝑋 ) are specified by a two-layerGNN [28, 33]: 𝜇, 𝜎 = GNN 𝜙 ( 𝐴, 𝑋 ) , (8)where 𝜇 and 𝜎 are the vector of means and standard deviationsfor the variational marginals { 𝑞 𝜙 𝑖 ( 𝑧 𝑖 | 𝐴, 𝑋 )} 𝑁𝑖 = . 𝜙 = { 𝜙 𝑖 } 𝑁𝑖 = is theparameter set for encoding. Probabilistic graph decoder.

We first compute the edge probabil-ity: 𝑝 𝜃 ( ¯ 𝐴 𝑖 𝑗 | 𝐺, 𝐶, 𝑍 ) = ¯ 𝐴 𝑖 𝑗 = sigmoid ( 𝑊 𝑎 ReLU ( 𝑊 𝑎 𝐸 𝑖 𝑗 )) , (9)where 𝐸 𝑖 𝑗 = [ 𝑍 𝑖 | 𝑋 𝑖 ] ⊙ [ 𝑍 𝑗 | 𝑋 𝑗 ] , 𝑍 𝑖 | 𝑋 𝑖 is a concatenation of 𝑍 𝑖 and 𝑋 𝑖 , ⊙ represents element-wise multiplication. Intuitively we computethe edge probability by two-layer perceptrons and sigmoid function. The loss function.

We approximate 𝑝 ( ¯ 𝐴 | 𝐺, 𝐶, 𝑍 ) by: 𝑝 ( ¯ 𝐴 | 𝐺, 𝐶, 𝑍 ) = (cid:214) ( 𝐶𝐶 ⊤ ) 𝑖𝑗 = ¯ 𝐴 𝑖 𝑗 (cid:214) ( 𝐶𝐶 ⊤ ) 𝑖𝑗 = ( − ¯ 𝐴 𝑖 𝑗 ) , (10)where ( 𝐶𝐶 ⊤ ) 𝑖 𝑗 = 𝑖 and node 𝑗 are in the samecluster and ( 𝐶𝐶 ⊤ ) 𝑖 𝑗 = 𝐾 is large, it is beneficial to re-scale both 𝐶𝐶 ⊤ 𝑖 𝑗 = 𝐶𝐶 ⊤ 𝑖 𝑗 = 𝐴 𝑖 𝑗 = 𝑆 = 𝐶𝐶 ⊤ ⊙ 𝐴, (11)where 𝑆 𝑖 𝑗 = 𝑖 and node 𝑗 are adjacent and in thesame cluster w.r.t. the cluster mask 𝐶 . With 𝑆 , ( 𝐴 − 𝑆 ) can be usedto denote the inter-cluster connections.The overall loss for Mask-GVAE is: L( 𝜙, 𝜃 ; 𝐺, 𝐶 ) = L 𝑝𝑟𝑖𝑜𝑟 − E 𝑞 𝜙 ( 𝑍 / 𝐺,𝐶 ) ( log 𝑝 ( ¯ 𝐴 | 𝐺, 𝐶, 𝑍 )) , (12)where L 𝑝𝑟𝑖𝑜𝑟 = KL ( 𝑞 ( 𝑍 | 𝐺, 𝐶 )|| 𝑝 ( 𝑍 )) is the prior loss with 𝑝 ( 𝑍 ) = (cid:206) 𝑁𝑖 = 𝑃 ( 𝑧 𝑖 ) = (cid:206) 𝑁𝑖 = N ( 𝑧 𝑖 | , 𝐼 ) . Intuitively, Eq. 12 takes into consid-eration both the objective function in Eq. 1 and cluster structures. Discrete graph refinement.

At testing time, we draw discrete sam-ples from the probabilistic graph ¯ 𝐺 = ( 𝑉 , ¯ 𝐴, 𝑋 ) . To ensure the de-noised graph ˆ 𝐴 does not deviate from the observation 𝐴 too much,we set a budget Δ to measure their difference. We split the budgetinto two parts, Δ / Δ / ( 𝐴 − 𝑆 ) 𝑖 𝑗 = 𝐸 − 𝑖 𝑗 = exp (cid:8) ( − ¯ 𝐴 𝑖 𝑗 ) (cid:9)(cid:205) exp (cid:8) ( − ¯ 𝐴 𝑖 𝑗 ) (cid:9) 𝑖 𝑓 ( 𝐴 − 𝑆 ) 𝑖 𝑗 = . We sample without replacement Δ / 𝐸 − 𝑖 𝑗 . The sampled edge set is then deleted from the originalgraph structure 𝐴 . To add edges, we are allowed to add connectionsfor intra-cluster nodes ( 𝐶𝐶 ⊤ − 𝐴 ) 𝑖 𝑗 = 𝐸 + 𝑖 𝑗 = exp (cid:8) ¯ 𝐴 𝑖 𝑗 (cid:9)(cid:205) exp (cid:8) ¯ 𝐴 𝑖 𝑗 (cid:9) 𝑖 𝑓 ( 𝐶𝐶 ⊤ − 𝐴 ) 𝑖 𝑗 = . We sample Δ / 𝐸 + 𝑖 𝑗 . The sam-pled edge set is added into the original graph structure 𝐴 . With thisstrategy, we can generate the final denoised graph structure ˆ 𝐴 . Connection between Mask-GVAE and Eq. 1.

The first term of Eq.1 is used to ensure the proximity between the input graph 𝐴 andthe denoised graph ˆ 𝐴 , which is achieved in Mask-GVAE by thereconstruction capacity of VAEs and the discrete budget in thesampling stage. Next, we illustrate the connection between thegraph Laplacian term in Eq. 1 and Mask-GVAE. We can re-writethe graph Laplacian regularization when 𝑑 = 𝑋 ⊤ 𝐿𝑋 𝑑 = = ( ∑︁ 𝑆 𝑖𝑗 = ( 𝑋 𝑖 − 𝑋 𝑗 ) (cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) 𝑖𝑛𝑡𝑟𝑎 − 𝑐𝑙𝑢𝑠𝑡𝑒𝑟 + ∑︁ ( 𝐴 − 𝑆 ) 𝑖𝑗 = ( 𝑋 𝑖 − 𝑋 𝑗 ) (cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) 𝑖𝑛𝑡𝑒𝑟 − 𝑐𝑙𝑢𝑠𝑡𝑒𝑟 ) , (13)By decomposing the Laplacian regularization into these two terms,we focus on intra/inter-cluster connections when adding/deletingedges. As attributed graph clustering aims to discover groups of ask-GVAE: Blind Denoising Graphs via Partition WWW ’21, April 19–23, 2021, Ljubljana, Slovenia The noisy graph G The de-noised graph Ĝ Cluster mask C Discrete graph refinement

ProbabilisticDecoder Z ⊕ L ( 𝝓 , 𝜭 ; 𝙂 , 𝘾 ) Encoder

The probabilistic graph Ḡ Figure 2: The proposed Mask-GVAE. It first utilizes the cluster mask 𝐶 and the noisy graph 𝐺 to learn the probabilistic de-noised graph ¯ 𝐺 . The learning process is conditioned on 𝐶 as cluster results are robust against random noise (Proposition 4.1).With the optimum of the learning process, it then draws discrete graph ˆ 𝐺 at test time to meet the fidelity constraint. nodes in a graph such that the intra-group nodes are not only moresimilar but also more densely connected than the inter-group ones[59], this heuristic contributes to the overall optimization of thegraph Laplacian term. Our solution consists of a cluster mask module 𝑓 (·) and a denoisedgraph generation module 𝑔 (·) . As the error signals of the denoisedgraph generation module are obtained from the cluster mask mod-ule and the cluster mask module needs the denoised graphs asinputs for robust cluster results, we design an iterative framework,to alternate between minimizing the loss of both 𝑔 (·) and 𝑓 (·) . Werefer to Algorithm 1 for details of the training procedure.We denote the parameters of the cluster mask module 𝑓 (·) as W 𝑓 and the parameters of the denoised graph generation module 𝑔 (·) as W 𝑔 . At the beginning of Algorithm 1, we use the clustermask module 𝑓 (·) to get the initial cluster results. We then exploitthe denoised graph generator 𝑔 (·) so as to get a denoised graph ˆ 𝐺 (line 4). Based on that, we utilize the idea of randomized smoothing [8, 22] and feed both 𝐺 and ˆ 𝐺 into the cluster mask module 𝑓 (·) tocompute L 𝑢 (line 7). With the optimum of 𝑓 (·) , we get the robustcluster mask 𝐶 to power the learning process of 𝑔 (·) (line 9). We analyze the computational complexity of our proposed method.The intensive parts of Mask-GVAE contain the computation ofEq. 5 and Eq. 12. Aside from this, the convolution operation takes 𝑂 (| 𝐸 | 𝑑ℎ ) [28] for one input graph instance with | 𝐸 | edges where ℎ is the number of feature maps of the weight matrix.Regarding Eq. 5, the core is to compute the matrices ( 𝐶 ⊤ 𝐿𝐶 ) and ( 𝐶 ⊤ 𝐷𝐶 ) . Using sparse-dense matrix multiplications, the com-plexity is 𝑂 (| 𝐸 | 𝐾 ) . For Eq. 12, the intensive part is the topologyreconstruction term. As the optimization is conditioned on existingedges, the complexity is 𝑂 (| 𝐸 |) . Thus, it leads to the complexity of 𝑂 (| 𝐸 | 𝐾 + | 𝐸 | 𝑑ℎ ) for one input graph instance. Algorithm 1:

Training the model

Input: 𝐺 , 𝐾 . Output: ˆ 𝐺 . Initial: parameters W 𝑔 , W 𝑓 ; repeat 𝐶 ← 𝑓 ( 𝐺, 𝐾 ) ; ˆ 𝐺 ← 𝑔 ( 𝐺, 𝐶 ) ; L 𝑝𝑟𝑖𝑜𝑟 ← KL ( 𝑞 ( 𝑍 | 𝐺, 𝐶 )|| 𝑝 ( 𝑍 )) ; L( 𝜙, 𝜃 ; 𝐺, 𝐶 ) ← L 𝑝𝑟𝑖𝑜𝑟 − E 𝑞 𝜙 ( 𝑍 / 𝐺,𝐶 ) ( log 𝑝 ( ¯ 𝐴 | 𝐺, 𝐶, 𝑍 )) ; L 𝑢 ← 𝑓 ({ 𝐺, ˆ 𝐺 }) ; // Update parameters according to gradients W 𝑔 ← + − ▽ W 𝑔 L( 𝜙, 𝜃 ; 𝐺, 𝐶 ) ; W 𝑓 ← + − ▽ W 𝑓 L 𝑢 ; until deadline ;We compare with VGAE [27] and one of our baseline DOMI-NANT [9], whose operations require several convolution layerswith 𝑂 (| 𝐸 | 𝑑ℎ ) and topology reconstruction term on all possiblenode pairs 𝑂 ( 𝑁 ) . Thus, the complexity of VGAE and DOMINANTis 𝑂 ( 𝑁 + | 𝐸 | 𝑑ℎ ) . As observed in the experiments, usually we have | 𝐸 | 𝐾 ≤ 𝑁 for large graphs, e.g., Citeseer, Pubmed, Wiki, Reddit.When compared with NE [58] and ND [13], our solution wins asboth NE and ND rely on explicit eigen-decomposition. The stability of cluster mask under random noise is vital for Mask-GVAE, and that stability is dominated by the stability of low eigen-vectors of graph Laplacian [57]. In this part, we initiate a systematicstudy of the stability of low eigenvectors, using the notion of aver-age sensitivity [56], which is the expected size of the symmetric dif-ference of the output low eigenvectors before and after we randomlyremove a few edges. Let the clean graph be 𝐺 ′ = ( 𝑉 , 𝐴 ′ , 𝑋 ) , the de-gree matrix be 𝐷 ′ and the clean Laplacian matrix be 𝐿 ′ = 𝐷 ′ − 𝐴 ′ . We WW ’21, April 19–23, 2021, Ljubljana, Slovenia J. Li et al.

Table 1: Statistics of graphs used in graph clusteringData Nodes Edges Class features

Citeseer 3,327 4,732 6 3,703Pubmed 19,717 44,338 3 500Wiki 2,405 17,981 17 4,973denote the 𝑖 -th smallest eigenvalue of 𝐿 ′ as 𝜆 ′ 𝑖 , and the correspond-ing eigenvector as u ′ 𝑖 . We consider the noisy graph 𝐺 = ( 𝑉 , 𝐴, 𝑋 ) is generated from the clean one 𝐺 ′ by the following procedure: foreach edge ( 𝑖, 𝑗 ) in 𝐺 ′ , remove it with probability 𝑞 independently.Following [19], we analyze the bi-partition of the graph data, as alarger number of clusters can be found by applying the bi-partitionalgorithm recursively.Assumption 4.1. Let the node degrees of the clean graph be 𝑁 (cid:205) 𝑖 = 𝑑 ′ 𝑖 = 𝑁 𝜒 and the number of edges be 𝑚 ′ = 𝑁 𝜑 , the following propertieshold. (1) ∃ 𝜖 ≥ , 𝑠.𝑡., 𝜖𝜖 − 𝜆 ′ < 𝜆 ′ and 𝜆 ′ ≥ max ( 𝜖𝑞𝜆 ′ 𝑁 , 𝜖 log 𝑁 ) . (2) 𝑞 ≤ 𝜆 ′ 𝜅𝑁 𝛽 and 𝛽 ≥ max ( 𝜒, 𝜑 ) with 𝜅 > . Assumption 4.1.1 implies the graph has at most one outstand-ing sparse cut by the higher-order Cheeger inequality [31, 46]. Ithas been discussed that the eigenspaces of Laplacian with such alarge eigengap are stable against edge noise [57]. Assumption 4.1.2indicates the probability of edge noise is small w.r.t. the numberof nodes. To better understand Assumption 4.1, let us consider theexample in Figure 1. It is easy to check 𝜒 = . 𝜑 = . 𝜆 ′ = . 𝜆 ′ = .

28 and 𝜆 ′ 𝑁 = .

96. Let 𝑞 = .

01, then we can get 𝛽 = . 𝜅 = . E [ sin ( ∠ ( u ′ , u ))] as the expected sin ( ∠ (· , ·)) for theangle between u ′ and u , then the following proposition holds.Proposition 4.1. Under Assumption 4.1, E [ sin ( ∠ ( u ′ , u ))] un-der random noise satisfies E [ sin ( ∠ ( u ′ , u ))] ≤ 𝜅 . Please refer to Appendix A for the proof. For the example inFigure 1, we can check E [ sin ( ∠ ( u ′ , u ))] ≈ . ≤ / . We first validate the effectiveness of our graph clustering algorithm.Then we evaluate Mask-GVAE on blind graph denoising tasks.

Data and baselines.

We use three benchmark data sets, i.e., Pubmed,Citeseer [51] and Wiki [63]. Statistics of the data sets can be foundin Table 1. As baselines, we compare against (1) Spectral Clustering(SC) [57], which only takes the node adjacency matrix as affinitymatrix; (2) Node2vec [15] + Kmeans (N2v&K), which first usesNode2vec to derive node embeddings and then utilizes K-means togenerate cluster results; and (3) VGAE [27] + Kmeans (VGAE&K).

Setup.

For our method, we let the output dimension of the second-layer perceptrons equal to the number of clusters. We use the samenetwork architecture through all the experiments. Our implemen-tation is based on Tensorflow. We train the model using full batch based Adam optimizer with exponential decay. We set 𝜑 = . Results.

The clustering accuracy (ACC), normalized mutual in-formation (NMI) and macro F1 score (F1) are shown in Figure 3. Ourmethod outperforms the competitors on all data sets. As ours doesnot rely on K-means to derive cluster memberships, this clusterperformance indicates the effectiveness of our framework on graphclustering tasks.

Baselines.

We use the following approaches as our baselines: • DOMINANT [9], a graph neural network that performsanomaly detection on attributed graph. It computes the de-gree of anomaly by the distance of recovered features/edgesand the input features/edges. In this work, we consider thetop anomaly edges as noisy edges and remove them fromthe input graphs. • NE [58], a blind graph denoising algorithm that recovers thegraph structure based on diffusion algorithms. It is designedto remove irrelevant edges in biological networks. It relieson explicit eigen-decomposition and cannot be applied tolarge graphs. • ND [13], a blind graph denoising algorithm that solves aninverse diffusion process to remove the transitive edges. Italso relies on explicit eigen-decomposition and cannot beapplied to large graphs. • E-Net [61], a non-blind graph enhancement neural networkthat draws inspirations from link prediction method [66]. Fora fair comparison, we replace the supervision (clean graphs)used in E-Net with the noisy version. • Our-ablation, an ablation of our framework that does notutilize cluster mask. It resembles VGAE [27] with two differ-ences: (1) we use a multi-layer perceptron in decoders whileVGAE uses a non-parameterized version, and (2) we replaceGCN [28] with Heatts [33].

Metrics.

We follow image denoising [45, 64] to use peak signal-to-noise ratio (PSNR) when the clean graph 𝐺 ′ = ( 𝑉 , 𝐴 ′ , 𝑋 ) isknown. MSE = (cid:205) 𝑁𝑖 (cid:205) 𝑖𝑗 ( 𝐴 ′ 𝑖 𝑗 − ˆ 𝐴 𝑖 𝑗 ) 𝑁 ( 𝑁 − ) . (14)PSNR is then defined as:PSNR =

10 log (cid:18) (cid:19) . (15)As for the structural similarity, we leverage Weisfeiler-LehmanGraph Kernels [53]. 𝑊 𝐿 = ⟨ 𝜙 ( ˆ 𝐴 ) , 𝜙 ( 𝐴 ′ )⟩ √︃ ⟨ 𝜙 ( ˆ 𝐴 ) , 𝜙 ( ˆ 𝐴 )⟩⟨ 𝜙 ( 𝐴 ′ ) , 𝜙 ( 𝐴 ′ )⟩ , (16) ask-GVAE: Blind Denoising Graphs via Partition WWW ’21, April 19–23, 2021, Ljubljana, Slovenia ACC NMI F10204060 ( % ) Pubmed ACC NMI F102040 ( % ) Citeseer ACC NMI F1010203040 ( % ) Wiki

SCN2v&KVGAE&KOurs

Figure 3: Graph clustering performance comparison of different methods

0% 5% 10% 15% 20%0102030405060 P S N R MUTAG 0% 5% 10% 15% 20%0102030405060 P S N R PTC-MR0.00.10.20.30.4 W L W L Figure 4: Influence of the estimated noise Δ on the denoisingperformance of MUTAG(left) and PTC-MR(right) where ⟨· , ·⟩ denotes dot product, 𝜙 (·) represents a vector derivedby Weisfeiler-Lehman Graph Kernels.For all metrics, a greater value denotes a better performance. Data.

We use five graph classification benchmarks: IMDB-Binaryand IMDB-Multi [62] connecting actors/actresses based on movieappearance, Reddit-Binary [62] connecting users through responsesin Reddit online discussion, MUTAG [29] containing mutageniccompound, PTC [29] containing compounds tested for carcino-genicit, as the clean graphs. For each 𝐴 ′ , we add noise to edgesto produce 𝐴 by the following methods: (1) randomly adding 10%nonexistent edges, (2) randomly removing 10% edges with respect tothe existing edges in 𝐴 ′ , as adopted in E-Net [61]. We refer to Table2 for the detailed information about the obtained noisy graphs. Setup.

For Mask-GVAE, we adopt the same settings as in theexperiment of graph clustering except that we use Bayesian Infor-mation Criterion (BIC) to decide the optimal number of clusters. Forall methods, we denoise the given graphs with the same budget, i.e.,we add 10% edges to recover missing edges and remove 10% edgesto delete irrelevant edges, based on the intermediate probabilisticgraphs.

Results.

Table 2 lists the experimental results on the five datasets respectively. We analyse the results from the following twoperspectives.Scalability: Both ND and NE suffer from a long run time and takemore than 1 day on the larger data set Reddit-Binary, as they both rely on explicit eigen-decomposition. On the contrary, E-Net andour method can take advantage of GNN and avoid explicit eigen-decomposition, which makes the algorithms scalable to very largenetworks.Performance: Among all approaches, Mask-GVAE achieves the bestperformance on all data sets in terms of the two measures PSNRand WL similarity. It also beats Our-ablation, which shows theeffectiveness of cluster mask.

Running time.

Table 3 lists the running time on the three selecteddata sets. As can be seen, on small-sized graphs like PTC-MR, Mask-GVAE requires longer time as it needs to compute the cluster mask.On medium-sized graphs like IMDB-MULTI, the running time ofMask-GVAE is comparable to that of DOMINANT and NE/ND. Onlarge-sized graphs like REDDIT-BIN, Mask-GVAE uses less timecompared with all baselines, as the optimization of denoised graphgeneration in Mask-GVAE is conditioned on the observed edges.

Sensitivity.

We test the sensitivity of Mask-GVAE to the degreeof noise and modularity [42] in given graphs. We target randomcluster graphs [43] and generate 200 synthetic graphs with an av-erage of 100 nodes. We set the degree of noise to 10%, 20% and 30%.We control the modularity to be 0.05 (weak cluster structure) and0.35 (strong cluster structure). Table 4 lists the results. As can beseen, our solution consistently outperforms baselines in most cases,regardless of the degree of noise and cluster structure. In addition,we observe most methods perform better in PSNR on strong clus-tered graphs (modularity = 0.35), which shows the importance ofclusters in current denoising approaches.

Estimating the degree of noise.

Estimating the degree of noisefor the given inputs is still an open problem [16]. In this work, weuse budget Δ to represent our estimation on the degree of noisein the given graphs. In this part, we evaluate how the budget Δ affects the denoising performance. Taking MUTAG/PTC-MR with10% noise as an example, we vary Δ from 0% to 20% and plot thecorresponding denoising performance in Figure 4. As we increase Δ , the curve of PSNR is quite flat, indicating our model is robustto the estimated noise on PSNR. As for WL, it first increases thendrops, meaning that an appropriate noise estimation is essential forthe performance of our model on structural similarity. WW ’21, April 19–23, 2021, Ljubljana, Slovenia J. Li et al.

Table 2: Blind denoising performance comparison of different methods on benchmarks

Datasets IMDB-BIN IMDB-MULTI REDDIT-BIN MUTAG PTC-MR(No. Graphs) 1000 1500 2000 188 344(Avg. Nodes) 19.8 13.0 508.5 17.9 14.3(Avg. Edges) 193.1 65.9 497.8 19.8 14.7Metrics PSNR WL PSNR WL PSNR WL PSNR WL PSNR WLDOMINANT [9] 50 . ± .

33 19 . ± .

41% 49 . ± .

65 54 . ± .

81% 74 . ± .

17 21 . ± .

69% 55 . ± .

47 25 . ± .

10% 42 . ± .

51 24 . ± . . ± .

11 22 . ± .

14% 53 . ± .

05 55 . ± .

03% - - 43 . ± .

18 9 . ± .

22% 38 . ± .

27 22 . ± . . ± .

17 25 . ± .

09% 55 . ± .

21 57 . ± .

14% - - 54 . ± .

22 25 . ± .

18% 47 . ± .

05 22 . ± . . ± .

97 18 . ± .

15% 55 . ± .

01 56 . ± .

62% 87 . ± .

33 22 . ± .

07% 52 . ± .

58 30 . ± .

42% 50 . ± .

66 34 . ± . . ± .

48 21 . ± .

33% 55 . ± .

59 57 . ± .

55% 75 . ± .

10 23 . ± .

96% 55 . ± .

68 27 . ± .

12% 47 . ± .

44 33 . ± . . ± .

51 26 . ± . % . ± .

66 58 . ± . % . ± .

71 25 . ± . % . ± .

31 36 . ± . % . ± .

56 38 . ± . % Table 3: Running time (in seconds) comparison of differentmethods

Data sets IMDB-MULTI REDDIT-BIN PTC-MR

DOMINANT

73 1895 11 ND

41 - 6 NE

43 - 14

E-Net

216 2114 19

Mask-GVAE

79 1043 21

Visualization.

We target Citeseer and add 30% noise. In Figure 5,we derive node embeddings before/after Mask-GVAE by Node2vec[15] and project the embeddings into a two-dimensional space witht-SNE, in which different colors denote different classes. For noisedCiteseer, nodes of different classes are mixed up, as reflected bythe geometric distances between different colors. Mask-GVAE canmake a distinct difference between classes.

Case study.

To have a better understanding of how Mask-GVAEworks, we target a subgraph of co-authorship network which con-sists of 11 scholars in the area of Data Base (DB) and Data Mining(DM). The subgraph is based on Google Scholar and constructed asfollows: (1) for each scholar, we use a one-hot representation with300 dimensions encoding his/her research keywords, and (2) weconstruct an adjacency matrix 𝐴 by denoting 𝐴 𝑖 𝑗 = 𝐴 𝑗𝑖 = 𝐾 = Δ = https://scholar.google.com/ (b) Citeseer denoised by Mask-GVAE(a) Citeseer with 30% noise Figure 5: Two-dimensional visualization of node embed-dings before and after the denoising of Mask-GVAE on Cite-seer.

Jiawei Han Yizhou SunYing Zhang Chuan Xiao

Wei WangUNSW

Xiaofang Zhou Philip S. Yu

Wei Wang

UCLA

Jeffrey Xu YuXuemin Lin

Wenjie Zhang

Figure 6: A case to show how Mask-GVAE denoises graphs

Graph Laplacian in denoising tasks . Several studies have uti-lized graph Laplacian in various denoising tasks, including imagedenoising [45, 64], signal denoising [11], and 3D point cloud de-noising [18]. As graph structures are not readily available in thesedomains, these studies differ in how they construct graph structures,i.e., structures are only created as an auxiliary source to recover ahigh quality image/signal/3D point cloud. Only recently, E-Net [61]has proposed to adopt Graph Laplacian regularization in non-blindgraph denoising tasks, in order to restore a denoised graph with global smoothing and sparsity . Graph denoising . Although there have been many studies on im-age denoising, graph denoising has been studied less, in particular,the study of blind denoising large discrete graphs is still lacking. ask-GVAE: Blind Denoising Graphs via Partition WWW ’21, April 19–23, 2021, Ljubljana, Slovenia

Table 4: Performance comparison of different methods on the random cluster graphs

Degree of noise

10% 20% 30%

Modularity - PSNR WL PSNR WL PSNR WL PSNR WL PSNR WL PSNR WL

DOMINANT ND NE E-Net

Mask-GVAE 76.32 48.94% 78.01 50.07% 72.73 48.37% 74.66 36.82% 70.49 36.14% 72.68 35.99%

It is worth noting the study of signal denoising on graphs [4, 44]is different from the study of graph structure denoising. When itcomes to structure denoising, ND [13] formulates the problem asthe inverse of network convolution, and introduces an algorithmthat removes the combined effect of all indirect paths by exploitingeigen-decomposition. NE [58] recovers the graph structure basedon diffusion algorithms and follows the intuition that nodes con-nected by high-weight paths are more likely to have direct edges.Low-rank estimation [17] and sparse matrix estimation [47] assumethe given graph is incomplete and noisy, thus they aim to recoverthe structure with the property of low-rank/sparsity. [40] infers thehidden graph structures based on the heuristic that a set of hidden,constituent subgraphs are combined to generate the observed graph.Inspired by link prediction [66], E-Net [61] enhances the quality ofgraph structure via exploiting subgraph characteristics and GNNs.Moreover, E-NET requires supervision of the clean graphs, whichis different from our blind setting.

Utilization of substructure . It is a common practice to utilizecertain substructures (i.e., clusters or subgraphs) in denoising tasks[13, 40, 61] or other related areas [6, 66]. The underlying ideas canbe generally classified into two groups. The first group is relatedto scalability, i.e., substructures make the algorithms scalable tovery large networks. Representative works include ND [13], E-Net[61], Cluster-GCN [6]. The second group considers substructurescan provide meaningful context for the tasks at hand. For example,SEAL [66] shows that local subgraphs reserve rich informationrelated to link existence. However, all these methods are heuristicmethods and shed little light on when these substructure basedideas fail. In this work, we prove that in order to make substructureheuristics work, the given graph should have a distinct clusterstructure. Moreover, it is the robustness of low eigenvectors ingraph Laplacian matrix that lays the foundation of these heuristics.

Discrete graph generation . Directly generating discrete graphwith gradient descent methods is intractable. In this regard, severalefforts have been made to bypass the difficulty. [2] learns a policynetwork with reinforcement learning, [21] approximates the dis-crete data by Gumbel distribution, [55] circumvents the problem byformulating the loss on a probabilistic graph and drawing discretegraphs thereafter, which we follow in this work.

In this paper, we present Mask-GVAE, the first variational gener-ative model for blind denoising large discrete graphs. Given thehuge search space of selecting proper candidates, we decomposethe graph into several subgraphs and generate smoothed clustersin a variational manner, which is based on the assumption that low eigenvectors are robust against random noise. The effective-ness of Mask-GVAE is validated on five graph benchmarks, with asignificant improvement on PSNR and WL similarity.

ACKNOWLEDGMENTS

The work described in this paper was supported by grants from theResearch Grants Council of the Hong Kong Special AdministrativeRegion, China [Project No.: CUHK 14205617] and [Project No.:CUHK 14205618], Huawei Technologies Research and DevelopmentFund, and NSFC Grant No. U1936205.

A PROOF OF PROPOSITION 4.1

Lemma A.1.

If Assumption 4.1 holds, then 𝑃 ( 𝜆 ≥ ( − 𝜖 ) 𝜆 ′ ) ≥ − 𝑁 − . We first prove lemma A.1. Let 𝐸 be the set of edges of 𝐺 ′ , 𝐹 bethe set of edges removed from 𝐸 , 𝐸 𝑒 = 𝐸 𝑖𝑖 + 𝐸 𝑗 𝑗 − 𝐸 𝑖 𝑗 − 𝐸 𝑗𝑖 , 𝐸 𝑖 𝑗 is the matrix with 1 for position ( 𝑖, 𝑗 ) and 0 for others, and 𝐸 𝐹 = (cid:205) 𝑒 ∈ 𝐹 𝐸 𝑒 , if ( 𝑖, 𝑗 ) ∈ 𝐸 . Let 𝑋 𝑒 be an indicator random variable ofthe event that 𝑒 ∈ 𝐹 . Thus, 𝐸 𝐹 = (cid:205) 𝑒 ∈ 𝐸 = 𝑋 𝑒 𝐸 𝑒 , 𝐸 𝐹 is the Laplacianmatrix of the induced graph 𝐹 . It is easy to get 𝑃 [ 𝑋 𝑒 = ] = 𝑞 , E [ 𝐸 𝐹 ] = 𝑞𝐿 ′ . Let 𝜇 = 𝜆 𝑚𝑎𝑥 ( E [ 𝐸 𝐹 ]) = 𝑞𝜆 ′ 𝑁 . Furthermore, note that0 ≤ 𝜆 ( 𝑋 𝑒 𝐸 𝑒 ) ≤ 𝜆 𝑚𝑎𝑥 ( 𝑋 𝑒 𝐸 𝑒 ) ≤

2. By the matrix Chernoff bound inPeng and Yoshida [46], for any 𝑠 >

0, we have 𝑃 ( 𝜆 𝑚𝑎𝑥 ( 𝐸 𝐹 ) ≥ ( + 𝑠 ) 𝜇 ) ≤ 𝑁 (cid:16) exp ( 𝑠 )( + 𝑠 ) + 𝑠 (cid:17) 𝜇 . Set 1 + 𝑠 = , ( + 𝑠 ) 𝜇 = log 𝑁 , then 𝑃 ( 𝜆 𝑚𝑎𝑥 ( 𝐸 𝐹 ) ≥ ( + 𝑠 ) 𝜇 ) ≤ 𝑁 − . Thus with probability at least 1 − 𝑁 − , 𝜆 𝑚𝑎𝑥 ( 𝐸 𝐹 ) ≤ max ( 𝜇, log 𝑁 ) = max ( 𝑞𝜆 ′ 𝑁 , log 𝑁 ) ≤ 𝜆 ′ 𝜖 holds. Due to thefact that 𝜆 ≥ 𝜆 ′ − 𝜆 𝑚𝑎𝑥 ( 𝐸 𝐹 ) , 𝑃 ( 𝜆 ≥ ( − 𝜖 ) 𝜆 ′ ) ≥ − 𝑁 − holds.With lemma A.1, we prove Proposition 4.1 based on the well-known Davis-Kahan perturbation theorem.It is easy to show that E [∥ 𝐿 − 𝐿 ′ ∥ 𝐹 ] = 𝑞 ∥ 𝐿 ′ ∥ 𝐹 .Define 𝛿 = min {| 𝜆 𝑗 − 𝜆 ′ | , 𝑗 ≠ } , we can get the following boundaccording to Davis-Kahan theorem, E [ sin (cid:0) ∠ ( u , u ′ ) (cid:1) ] ≤ E [ 𝛿 ∥ 𝐿 − 𝐿 ′ ∥ 𝐹 ] = 𝛿 ( 𝑞 ∥ 𝐿 ′ ∥ 𝐹 ) . Easy to check ∥ 𝐿 ′ ∥ 𝐹 = 𝑂 ( max ( 𝑁 𝜒 , 𝑁 𝜑 )) .By Lemma A.1, we have 𝑃 ( 𝜆 ≥ ( − 𝜖 ) 𝜆 ′ ) ≥ − 𝑁 − . Notethat if 𝜖𝜖 − 𝜆 ′ < 𝜆 ′ , then 𝛿 = 𝜆 ′ , thusif 𝛽 > max ( 𝜒, 𝜑 ) , then E [ sin ( ∠ ( u , u ′ ))] → 𝛽 = max ( 𝜒, 𝜑 ) , then E [ sin ( ∠ ( u , u ′ ))] ≤ 𝜅 . WW ’21, April 19–23, 2021, Ljubljana, Slovenia J. Li et al.

REFERENCES [1] Yong-Yeol Ahn, James P Bagrow, and Sune Lehmann. 2010. Link communitiesreveal multiscale complexity in networks.

Nature

InternationalConference on Learning Representations (ICLR) (2017).[3] Sandro Cavallari, Vincent W Zheng, Hongyun Cai, Kevin Chen-Chuan Chang,and Erik Cambria. 2017. Learning community embedding with communitydetection and node embedding on graphs. In

Proceedings of the 2017 ACM onConference on Information and Knowledge Management (CIKM) . ACM, 377–386.[4] Siheng Chen, Aliaksei Sandryhaila, José MF Moura, and Jelena Kovacevic. 2014.Signal denoising on graphs via graph filtering. In . IEEE, 872–876.[5] Zhengdao Chen, Xiang Li, and Joan Bruna. 2019. Supervised community de-tection with line graph neural networks.

International Conference on LearningRepresentations (ICLR) (2019).[6] Wei-Lin Chiang, Xuanqing Liu, Si Si, Yang Li, Samy Bengio, and Cho-Jui Hsieh.2019. Cluster-GCN: An efficient algorithm for training deep and large graphconvolutional networks. In

Proceedings of the 25th ACM SIGKDD InternationalConference on Knowledge Discovery & Data Mining . 257–266.[7] Jan Chorowski, Ron J Weiss, Samy Bengio, and Aäron van den Oord. 2019. Unsu-pervised speech representation learning using wavenet autoencoders.

IEEE/ACMtransactions on audio, speech, and language processing

27, 12 (2019), 2041–2053.[8] Jeremy M Cohen, Elan Rosenfeld, and J Zico Kolter. 2019. Certified adversarialrobustness via randomized smoothing.

ICML (2019).[9] Kaize Ding, Jundong Li, Rohit Bhanushali, and Huan Liu. 2019. Deep anomalydetection on attributed networks. In

Proceedings of the 2019 SIAM InternationalConference on Data Mining . SIAM, 594–602.[10] P. D. Dobson and A. J. Doig. 2003. Distinguishing enzyme structures from non-enzymes without alignments.

Journal of molecular biology

IEEETransactions on Signal Processing

64, 23 (2016), 6160–6173.[12] Justin Eldridge, Mikhail Belkin, and Yusu Wang. 2018. Unperturbed: spectralanalysis beyond Davis-Kahan. In

Proceedings of Algorithmic Learning Theory ,Vol. 83. PMLR, 321–358.[13] Soheil Feizi, Daniel Marbach, Muriel Médard, and Manolis Kellis. 2013. Networkdeconvolution as a general method to distinguish direct dependencies in networks.

Nature Biotechnology (2013), 726–733.[14] Pedro F Felzenszwalb and Daniel P Huttenlocher. 2004. Efficient graph-basedimage segmentation.

International journal of computer vision

59, 2 (2004), 167–181.[15] Aditya Grover and Jure Leskovec. 2016. node2vec: Scalable feature learning fornetworks. In

Proceedings of the 22nd ACM SIGKDD International Conference onKnowledge Discovery and Data Mining (SIGKDD) . 855–864.[16] Shi Guo, Zifei Yan, Kai Zhang, Wangmeng Zuo, and Lei Zhang. 2019. Towardconvolutional blind denoising of real photographs. In

Proceedings of the IEEEConference on Computer Vision and Pattern Recognition . 1712–1722.[17] Cho-Jui Hsieh, Kai-Yang Chiang, and Inderjit S Dhillon. 2012. Low rank mod-eling of signed networks. In

Proceedings of the 18th ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining (SIGKDD) . 507–515.[18] Wei Hu, Xiang Gao, Gene Cheung, and Zongming Guo. 2020. Feature graphlearning for 3d point cloud denoising.

IEEE Transactions on Signal Processing

Proceedings of the 21st International Conferenceon Neural Information Processing Systems (NeurIPS) . 705–712.[20] Daniel Im Jiwoong Im, Sungjin Ahn, Roland Memisevic, and Yoshua Bengio. 2017.Denoising criterion for variational auto-encoding framework. In

Thirty-FirstAAAI Conference on Artificial Intelligence . 2059–2065.[21] Eric Jang, Shixiang Gu, and Ben Poole. 2016. Categorical reparameterization withgumbel-softmax.

International Conference on Learning Representations (ICLR) (2016).[22] Jinyuan Jia, Binghui Wang, Xiaoyu Cao, and Neil Zhenqiang Gong. 2020. CertifiedRobustness of Community Detection against Adversarial Structural Perturbationvia Randomized Smoothing. In

Proceedings of The Web Conference 2020 . 2718–2724.[23] Brian Karrer, Elizaveta Levina, and Mark EJ Newman. 2008. Robustness ofcommunity structure in networks.

Physical review E

77, 4 (2008), 046119.[24] Jinseok Kim and Jana Diesner. 2016. Distortive effects of initial-based namedisambiguation on measurements of large-scale coauthorship networks.

Journalof the Association for Information Science and Technology

67, 6 (2016), 1446–1461.[25] Durk P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, andMax Welling. 2016. Improved variational inference with inverse autoregressiveflow. In

Advances in neural information processing systems . 4743–4751.[26] Diederik P Kingma and Max Welling. 2013. Auto-encoding variational bayes.

International Conference on Learning Representations (ICLR) (2013). [27] Thomas N Kipf and Max Welling. 2016. Variational Graph Auto-Encoders.

NeurIPSWorkshop on Bayesian Deep Learning (2016).[28] Thomas N. Kipf and Max Welling. 2017. Semi-Supervised Classification withGraph Convolutional Networks. In

International Conference on Learning Repre-sentations (ICLR) .[29] Nils Kriege and Petra Mutzel. 2012. Subgraph matching kernels for attributedgraphs.

ICML (2012), 291–298.[30] Matt J Kusner, Brooks Paige, and José Miguel Hernández-Lobato. 2017. Grammarvariational autoencoder. In

Proceedings of the 34th International Conference onMachine Learning (ICML) . 1945–1954.[31] James R Lee, Shayan Oveis Gharan, and Luca Trevisan. 2014. Multiway spectralpartitioning and higher-order cheeger inequalities.

Journal of the ACM (JACM)

61, 6 (2014), 1–30.[32] Jia Li, Yu Rong, Hong Cheng, Helen Meng, Wenbing Huang, and Junzhou Huang.2019. Semi-Supervised Graph Classification: A Hierarchical Graph Perspective.In

The World Wide Web Conference (WWW) . 972–982.[33] Jia Li, Jianwei Yu, Jiajin Li, Honglei Zhang, Kangfei Zhao, Yu Rong, Hong Cheng,and Junzhou Huang. 2020. Dirichlet Graph Variational Autoencoder. In

Neurips .[34] Jia Li, Honglei Zhang, Zhichao Han, Yu Rong, Hong Cheng, and Junzhou Huang.2020. Adversarial attack on community detection by hiding individuals. In

Proceedings of The Web Conference 2020 . 917–927.[35] Wen Li, Ying Zhang, Yifang Sun, Wei Wang, Mingjie Li, Wenjie Zhang, andXuemin Lin. 2019. Approximate nearest neighbor search on high dimensionaldata-experiments, analyses, and improvement.

IEEE Transactions on Knowledgeand Data Engineering (2019).[36] David Liben-Nowell and Jon Kleinberg. 2007. The link-prediction problem forsocial networks.

Journal of the American society for information science andtechnology

58, 7 (2007), 1019–1031.[37] Qi Liu, Miltiadis Allamanis, Marc Brockschmidt, and Alexander Gaunt. 2018.Constrained graph variational autoencoders for molecule design. In

Proceedingsof the 32nd International Conference on Neural Information Processing Systems(NeurIPS) . 7806–7815.[38] Guy Lorberbom, Andreea Gane, Tommi Jaakkola, and Tamir Hazan. 2019. Di-rect Optimization through arg max for Discrete Variational Auto-Encoder. In

Advances in Neural Information Processing Systems . 6200–6211.[39] Angshul Majumdar. 2018. Blind denoising autoencoder.

IEEE transactions onneural networks and learning systems

30, 1 (2018), 312–317.[40] Quaid D Morris and Brendan J Frey. 2004. Denoising and untangling graphsusing degree priors. In

Proceedings of the 16th International Conference on NeuralInformation Processing Systems (NeurIPS) . 385–392.[41] Azade Nazi, Will Hang, Anna Goldie, Sujith Ravi, and Azalia Mirhoseini. 2019.GAP: Generalizable Approximate Graph Partitioning Framework.

InternationalConference on Learning Representations Workshop (2019).[42] Mark EJ Newman. 2006. Modularity and community structure in networks.

Proceedings of the national academy of sciences

Physical review letters

The Journal of MachineLearning Research

18, 1 (2017), 6410–6445.[45] Jiahao Pang and Gene Cheung. 2017. Graph Laplacian regularization for imagedenoising: Analysis in the continuous domain.

IEEE Transactions on ImageProcessing

26, 4 (2017), 1770–1785.[46] Pan Peng and Yuichi Yoshida. 2020. Average Sensitivity of Spectral Clustering.In

Proceedings of the 26th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining (SIGKDD) .[47] Emile Richard, Pierre-André Savalle, and Nicolas Vayatis. 2012. Estimation ofsimultaneously sparse and low rank matrices. In

Proceedings of the 29th Interna-tional Coference on International Conference on Machine Learning (ICML) . 51–58.[48] Guillaume Salha, Romain Hennequin, Jean-Baptiste Remy, Manuel Moussallam,and Michalis Vazirgiannis. 2020. FastGAE: Fast, Scalable and Effective Graph Au-toencoders with Stochastic Subgraph Decoding. arXiv preprint arXiv:2002.01910 (2020).[49] Guillaume Salha, Romain Hennequin, Viet Anh Tran, and Michalis Vazirgiannis.2019. A degeneracy framework for scalable graph autoencoders.

Proceedings of theTwenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19) (2019).[50] Bidisha Samanta, DE Abir, Gourhari Jana, Pratim Kumar Chattaraj, Niloy Gan-guly, and Manuel Gomez Rodriguez. 2019. Nevae: A deep generative modelfor molecular graphs. In

Thirty-Third AAAI Conference on Artificial Intelligence(AAAI) . 1110–1117.[51] Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Galligher, andTina Eliassi-Rad. 2008. Collective classification in network data.

AI magazine

InternationalConference on Learning Representations (ICLR) (2018). ask-GVAE: Blind Denoising Graphs via Partition WWW ’21, April 19–23, 2021, Ljubljana, Slovenia [53] Nino Shervashidze, Pascal Schweitzer, Erik Jan Van Leeuwen, Kurt Mehlhorn,and Karsten M Borgwardt. 2011. Weisfeiler-lehman graph kernels.

Journal ofMachine Learning Research

12, Sep (2011), 2539–2561.[54] Jianbo Shi and Jitendra Malik. 2000. Normalized cuts and image segmentation.

IEEE Transactions on pattern analysis and machine intelligence

22, 8 (2000), 888–905.[55] Martin Simonovsky and Nikos Komodakis. 2018. Graphvae: Towards generationof small graphs using variational autoencoders. In

Artificial Neural Networks andMachine Learning (ICANN) . Springer, 412–422.[56] Nithin Varma and Yuichi Yoshida. 2019. Average sensitivity of graph algorithms. arXiv preprint arXiv:1904.03248 (2019).[57] Ulrike Von Luxburg. 2007. A tutorial on spectral clustering.

Statistics andcomputing

17, 4 (2007), 395–416.[58] Bo Wang, Armin Pourshafeie, Marinka Zitnik, Junjie Zhu, Carlos D Bustamante,Serafim Batzoglou, and Jure Leskovec. 2018. Network enhancement as a generalmethod to denoise weighted biological networks.

Nature communications

9, 1(2018), 1–8.[59] Meng Wang, Chaokun Wang, Jeffrey Xu Yu, and Jun Zhang. 2015. Communitydetection in social networks: an in-depth benchmarking study with a procedure-oriented framework.

VLDB

8, 10 (2015), 998–1009.[60] Felix Wu, Amauri Souza, Tianyi Zhang, Christopher Fifty, Tao Yu, and KilianWeinberger. 2019. Simplifying Graph Convolutional Networks. In

Proceedings of the 36th International Conference on Machine Learning (ICML) . PMLR, 6861–6871.[61] J. Xu, Y. Yang, C. Wang, Z. Liu, J. Zhang, L. Chen, and J. Lu. 2020. Robust NetworkEnhancement from Flawed Networks.

IEEE Transactions on Knowledge and DataEngineering (2020), 1–1.[62] P. Yanardag and S.V.N. Vishwanathan. 2015. Deep Graph Kernels. In

KDD . 1365–1374.[63] Cheng Yang, Zhiyuan Liu, Deli Zhao, Maosong Sun, and Edward Chang. 2015.Network representation learning with rich text information. In

The InternationalJoint Conference on Artificial Intelligence (IJCAL) .[64] Jin Zeng, Jiahao Pang, Wenxiu Sun, and Gene Cheung. 2019. Deep graph Laplacianregularization for robust denoising of real images. In

Proceedings of the IEEEConference on Computer Vision and Pattern Recognition (CVPR) Workshop .[65] Chuxu Zhang, Huaxiu Yao, Chao Huang, Meng Jiang, Zhenhui Li, and Nitesh VChawla. 2020. Few-Shot Knowledge Graph Completion.

AAAI (2020).[66] Muhan Zhang and Yixin Chen. 2018. Link prediction based on graph neural net-works. In

Proceedings of the 32nd International Conference on Neural InformationProcessing Systems (NeurIPS) . 5171–5181.[67] Yang Zhou, Hong Cheng, and Jeffrey Xu Yu. 2009. Graph clustering based onstructural/attribute similarities.

Proceedings of the VLDB Endowment

2, 1 (2009),718–729.[68] Daniel Zügner, Amir Akbarnejad, and Stephan Günnemann. 2018. AdversarialAttacks on Neural Networks for Graph Data. In