Graph Embedding VAE: A Permutation Invariant Model of Graph Structure
GGraph Embedding VAE: A PermutationInvariant Model of Graph Structure
Tony Duan [email protected] ∗ Juho Lee [email protected]
Abstract
Generative models of graph structure have applications in biology and socialsciences. The state of the art is GraphRNN, which decomposes the graph generationprocess into a series of sequential steps. While effective for modest sizes, it losesits permutation invariance for larger graphs. Instead, we present a permutationinvariant latent-variable generative model relying on graph embeddings to encodestructure. Using tools from the random graph literature, our model is highly scalableto large graphs with likelihood evaluation and generation in O ( | V | + | E | ) . We focus on learning a generative model of un-directed graph structure without node labels. Let G = ( V, E ) denote a graph represented by a symmetric adjacency matrix A ∈ { , } | V |×| V | . Notethat we are interested in the inductive (across graphs) setting rather than transductive (single graph). Permutation invariance.
We begin with a few useful definitions (Zaheer et al., 2017).
Definition 1.
A function f : X n → Y is permutation-invariant if and only if it satisfies f ( πx ) = f ( x ) for any permutation π ∈ S n , the set of permutations of indices { , . . . , n } . Definition 2.
A function f : X n → Y n is permutation-equivariant if and only if it satisfies f ( πx ) = πf ( x ) for any permutation π ∈ S n , the set of permutations of indices { , . . . , n } .Within the context of graphs, the Message Passing Neural Network (MPNN) (Gilmer et al., 2017) isthe most popular permutation-equivariant model. In its simplest form, a MPNN acts on a set of nodefeatures X given a fixed adjacency matrix A by layers of message passing, described by MPNNLayer( X ; A ) = σ ( AXW ) , where σ is a non-linearity . Another permutation-equivariant model is the Set Transformer, which is composed of layers calledthe Induced Self-Attention Block (ISAB). Based on multihead attention (Vaswani et al., 2017), anISAB computes the pairwise interactions between the n elements in X . ISAB( X ) = MultiheadAttention( X , MultiheadAttention( I , X )) , where I is a set of m trained inducing points. Instead of computing self-attention directly on X requiring O ( n ) time complexity, ISAB indirectly compares the elements in X via the referencepoints I , thus reducing the time-complexity to O ( nm ) . It is worth noting that an ISAB is a specialcase of a MPNN with a fully connected adjacency matrix A . Joint permutation invariance.
To satisfy permutation invariance with respect to arbitrary nodere-orderings, we want to learn a likelihood model p θ such that: p θ ( PAP (cid:62) ) = p θ ( A ) , for all permutation matrices P . ∗ Work done as an intern at AITRICS.Presented at the NeurIPS 2019 Workshop on Graph Representation Learning, Vancouver Canada. a r X i v : . [ c s . L G ] O c t Z Z Embed f φ Reconstruct
Figure 1: Graphical model summarizing our approach. We use a permutation-equivariant graphembedding A → Z to encode adjacency matrices into latent space, then apply a normalizing flow Z ↔ Z . We rely on an approximate posterior for reconstruction Z → A .Note that this is a different type of symmetry than the permutation equivariance we’ve alreadydescribed, because the rows and columns of A need to be re-ordered when the permutation is applied.Following Bloem-Reddy and Teh (2019), we call this type of symmetry joint invariance . Definition 3.
A function f : X n × n → Y is jointly permutation-invariant if and only if it satisfies f ( πxπ (cid:62) ) = f ( x ) for any permutation π ∈ S n , the set of permutations of indices { , . . . , n } . Definition 4.
A function f : X n × n → Y n is jointly permutation-equivariant if and only if it satisfies f ( πxπ (cid:62) ) = πf ( x ) for any permutation π ∈ S n , the set of permutations of indices { , . . . , n } . Latent-variable models.
Instead of learning the adjacency matrix p θ ( A ) directly, we introducea latent variable Z . Following the standard formulation of variational auto-encoders (Kingma andWelling, 2014), we introduce a variational approximation to model the intractable posterior. p θ ( A ) = (cid:90) Z p ( Z ) p θ ( A | Z ) q φ ( Z | A ) ≈ p θ ( Z | A ) This puts our method in the same line of work as VGAE (Kipf and Welling, 2016) and Graphite(Grover et al., 2019). Both are latent-variable models with permutation-equivariant decoders p θ ( A | Z , X ) and permutation-equivariant encoders q φ ( Z | A , X ) , instantiated as message-passingneural networks (Gilmer et al., 2017). However, these prior works rely on the availability of nodefeatures X for message passing. In the absence of node features an arbitrary ordering of nodes is usedto learn initial embeddings (by setting X = I n ), resulting in the loss of permutation-equivariance. Graph embeddings.
How do we break symmetries between nodes in the absence of node features?We explore use of a graph embedding to encode structure of the graph. Formally, we employa function
Embed : { , } | V |×| V | → R | V |× P that satisfies joint permutation-equivariance. Thetextbook example of a graph embedding method is the Laplacian Eigenmap (Belkin and Niyogi,2003; Verma and Zhang, 2017), defined via the eigendecomposition of the Laplacian matrix. Embed( A ) = Φ (cid:62) , D − A = ΦΛΦ (cid:62) . The Laplacian Eigenmap is our canonical example of an embedding method, though in experimentswe investigate Locally Linear Embeddings as well (Roweis and Saul, 2000). More recently developeddeep learning embeddings built on stochastic random walks could theoretically be employed as well(Perozzi et al., 2014; Grover and Leskovec, 2016; Abu-El-Haija et al., 2018). However, we note thatsuch methods are typically invariant to permutations of the embedding dimensions, resulting in adifferent type of symmetry, so we leave their investigation to future work.
Encoder.
We begin our variational posterior with a graph embedding, then apply a normalizingflow to improve expressivity (Rezende and Mohamed, 2015). Letting f φ denote a differentiableinvertible transformation (potentially composed of a chain of simpler such tranformations), we have Z | A ∼ Normal(Embed( A ) , σ I ) Z = f φ ( Z )log q φ ( Z | A ) = log q ( Z | A ) − log (cid:12)(cid:12)(cid:12)(cid:12) det ∂ f φ ∂ Z (cid:12)(cid:12)(cid:12)(cid:12) . We parameterize f φ as a Neural Spline Flow over coupling layers (Durkan et al., 2019) for adequateexpressivity. Coupling and × convolutions are performed over the P dimensions of the embeddings.To ensure permutation equivariance while allowing dependencies between nodes, the splines foreach coupling layer are parameterized by a stack of ISABs. We note that stacking self-attentionlayers over the node embeddings results in potentially complex interactions between nodes that havetypically been captured via message-passing neural networks. However, the use of ISABs requiresonly O ( | V | m ) complexity instead of O ( | V | ) complexity, where m is the number of inducing points.2 ecoder. Our decoder applies the inverse flow f − φ , an ISAB, then a Bernoulli-Exponential link. Z ∼ Normal( , I ) Z ∗ = ISABStack( f − φ ( Z )) A | Z ∗ ∼ Bernoulli - Exponential( Z ∗ Z ∗ (cid:62) ) The Bernoulli-Exponential link (Zhou, 2015; Caron, 2012) is defined by augmenting the model withtruncated Exponential random variables M ∈ [0 , | V |×| V | corresponding to entries of the adjacencymatrix A . Letting z ∗ i , z ∗ j denote rows of Z ∗ corresponding to the i -th and j -th nodes, a i,j | z ∗ i , z ∗ j ∼ Bernoulli( m i,j < m i,j | z ∗ i , z ∗ j ∼ Exponential( z ∗ (cid:62) i z ∗ j ) The joint log-likelihood can then be expressed as log p θ ( A , M | Z ∗ ) = (cid:88) ( i,j ) ∈ E (cid:16) log( z i ∗ (cid:62) z ∗ j ) − z ∗ (cid:62) i z ∗ j ( m i,j − (cid:17) − (cid:32) ( n (cid:88) i =1 z ∗ i ) (cid:62) ( n (cid:88) i =1 z ∗ i ) − n (cid:88) i =1 || z ∗ i || (cid:33) . The advantage of the Bernoulli-Exponential link function over a traditional (ex. logistic) link functionis scalability; the joint log-likelihood can be can be calculated in O ( | E | + | V | ) instead of O ( | V | ) .We sample from the analytic posterior for inference, noting that we only need to sample the | E | auxiliary variables where a i,j = 1 ( δ below denotes the Dirac delta function centered at ). Sincethe auxiliary random variables are continuous the re-parameterization trick can be used. q ( M | A , Z ∗ ) = (cid:89) i The high-level idea is summarized in Figure 1. We fit by optimizing the ELBO, log p θ ( A ) ≥ E q φ ( Z , M | A ) [log p θ ( A , Z , M ) − log q φ ( Z , M | A )]= E q ( Z , M | A ) [log p ( Z ) + log p θ ( A , M | Z ) − log q φ ( Z | A ) − log q ( M | A , Z )]= E q ( Z , M | A ) (cid:20) log p ( Z ) + log p θ ( A , M | Z ) − log q ( Z | A ) − log q ( M | A , Z ) + log (cid:12)(cid:12)(cid:12)(cid:12) det ∂ f φ ∂ Z (cid:12)(cid:12)(cid:12)(cid:12)(cid:21) We call our jointly permutation-invariant generative model the Graph Embedding VAE (GE-VAE). Deep generative models of graphs. So far the most successful inductive model of graph structurehas been GraphRNN (You et al., 2018), which fits an auto-regressive model to sequences of nodeand edge formations derived from A . The factorization implied by each sequence is dependent onchosen node orderings, so the model is not permutation invariant. However, by amortizing oversampled breadth-first orderings the model is approximately permutation invariant for modest graphsizes | V | . Graphite (Grover et al., 2019) is a variational auto-encoder model with permutation-equivariant MPNNs for encoding and decoding, but in the absence of node features relies on anarbitrary node ordering by setting node features X = I . Graph Normalizing Flow (GNF) (Liu et al.,2019) uses permutation equivariant MPNNs to parameterize coupling layers in a normalizing flowmodel. They initialize node embeddings by sampling X ∼ Normal( , σ I ) , which is invariant tonode re-orderings. However, they require a separate decoder for generating samples A | Z , by reversemessage-passing. We note that without X both Graphite and GNF require message-passing overfully connected A to sample new graphs, a step which we replace with the Set Transformer. Permutation invariant and equivariant models. Zaheer et al. (2017) first introduced permutationinvariance and equivariance in the context of deep models. Herzig et al. (2018) introduced graphpermutation-invariance which is similar to our notion of joint permutation-equivariance underpermutations of A , but assumes the presence of unique node features X to break symmetry. Hartfordet al. (2018) discuss exchangeable matrix invariance , which reflects separate symmetries in separatepermutations of rows and columns of A , but not joint permutations. Bloem-Reddy and Teh (2019)provide a review of the above definitions that capture the symmetry in graphs.3 raph Embedding VAE GraphRNNDataset max | V | max | E | bits/dim degree cluster orbit degree cluster orbitCommunity 160 1945 0.297 0.011 0.056 0.002 0.014 0.002 0.039Ego 399 1071 0.155 0.116 0.711 0.163 0.077 0.316 0.030Grid 361 684 0.071 0.779 0.026 0.509 − − Protein 500 1575 0.114 0.591 1.563 0.451 0.034 0.935 0.217 Table 1: Test set MMD and log-likelihood graph generation statistics comparing our method GE-VAEwith GraphRNN. Results for GraphRNN are reported from You et al. (2018).Figure 2: (Top) Generated graphs for the Community and Ego datasets. (Bottom) Interpolation inlatent space between a four and two community graph structure. We experiment with several datasets, following the GraphRNN (You et al., 2018) codebase. (1)Community: 3500 two-community graphs with ER clusters. (2) Ego: 757 3-hop ego networksextracted from Citeseer (Sen et al., 2008). (3) Grid: 3500 standard 2D grid graphs. (4) Protein: 918protein graphs over amino acids (Dobson and Doig, 2003). All datasets were split into roughly intotraining and into test sets. In order to handle graphs of various sizes in a dataset, we take only theeigenvectors corresponding to the smallest P eigenvalues of each graph’s unnormalized Laplacian.We implement masking in all self-attention steps, and maximize the reconstruction log-probability per edge (i.e. per dimension). We evaluate by reporting Maximum Mean Discrepancy (MMD)statistics over degree distributions, clustering coefficient distributions, and orbit count statistics (Table1). For the GE-VAE we report estimated test set log-likelihoods as well. We exhibit visualizations ofgenerated graphs in Figure 2. Overall we find that the GE-VAE is competitive with GraphRNN onthe Community and Ego datasets, but outperformed by GraphRNN on the Grid and Protein datasets.We suspect that this is due to the extremely multimodal nature of these graphs that are difficult tocapture with a latent-variable model. Embedding limitations. The primary limitation of our method is heavy dependence on graphembeddings to encode the structure of the adjacency matrix in a jointly permutation-equivariantway. There is a heavy upfront computation cost to calculating embeddings for large graphs. More-over, graph embeddings (such as the Laplacian Eigenmap) are not generally scale-invariant; asgraph size increases, the representations encoded each dimension of the node embeddings do notstraightforwardly map between graphs (see Figure 3 in the Appendix). Meta-learning embeddings. A meta-learned graph embedding model that is simultaneouslytrained along with our latent-variable model would allow for more flexible representations. Sucha process would need to be permutation-equivariant, and also able to break symmetries by beingposition-aware (a simple MPNN would not suffice) (You et al., 2019). We leave this for future work. Acknowledgement This work was supported by NRF-2019M3E5D4065965 project.4 eferences Abu-El-Haija, S., Perozzi, B., Al-Rfou, R., and Alemi, A. A. (2018). Watch Your Step: LearningNode Embeddings via Graph Attention. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K.,Cesa-Bianchi, N., and Garnett, R., editors, Advances in Neural Information Processing Systems 31 ,pages 9180–9190. Curran Associates, Inc.Belkin, M. and Niyogi, P. (2003). Laplacian Eigenmaps for Dimensionality Reduction and DataRepresentation. Neural Computation , 15(6):1373–1396.Bloem-Reddy, B. and Teh, Y. W. (2019). Probabilistic symmetry and invariant neural networks.Technical report. arXiv: 1901.06082.Caron, F. (2012). Bayesian nonparametric models for bipartite graphs. In Pereira, F., Burges, C. J. C.,Bottou, L., and Weinberger, K. Q., editors, Advances in Neural Information Processing Systems25 , pages 2051–2059. Curran Associates, Inc.Dobson, P. D. and Doig, A. J. (2003). Distinguishing enzyme structures from non-enzymes withoutalignments. Journal of Molecular Biology , 330(4):771–783.Durkan, C., Bekasov, A., Murray, I., and Papamakarios, G. (2019). Neural Spline Flows. Technicalreport. arXiv: 1906.04032.Gilmer, J., Schoenholz, S. S., Riley, P. F., Vinyals, O., and Dahl, G. E. (2017). Neural Message Passingfor Quantum Chemistry. In International Conference on Machine Learning , pages 1263–1272.Grover, A. and Leskovec, J. (2016). Node2vec: Scalable Feature Learning for Networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery andData Mining , KDD ’16, pages 855–864, New York, NY, USA. ACM. event-place: San Francisco,California, USA.Grover, A., Zweig, A., and Ermon, S. (2019). Graphite: Iterative Generative Modeling of Graphs. In International Conference on Machine Learning , pages 2434–2444.Hartford, J., Graham, D., Leyton-Brown, K., and Ravanbakhsh, S. (2018). Deep Models of Interac-tions Across Sets. In International Conference on Machine Learning , pages 1909–1918.Herzig, R., Raboh, M., Chechik, G., Berant, J., and Globerson, A. (2018). Mapping Images to SceneGraphs with Permutation-Invariant Structured Prediction. In Bengio, S., Wallach, H., Larochelle,H., Grauman, K., Cesa-Bianchi, N., and Garnett, R., editors, Advances in Neural InformationProcessing Systems 31 , pages 7211–7221. Curran Associates, Inc.Kingma, D. P. and Welling, M. (2014). Auto-Encoding Variational Bayes. In International Conferenceon Learning Representations . arXiv: 1312.6114.Kipf, T. N. and Welling, M. (2016). Variational Graph Auto-Encoders. Technical report. arXiv:1611.07308.Kumar, A., Poole, B., and Murphy, K. (2019). Learning Generative Samplers using Relaxed InjectiveFlow. Technical report.Liu, J., Kumar, A., Ba, J., Kiros, J., and Swersky, K. (2019). Graph Normalizing Flows. Technicalreport. arXiv: 1905.13177.Perozzi, B., Al-Rfou, R., and Skiena, S. (2014). DeepWalk: Online Learning of Social Repre-sentations. In Proceedings of the 20th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining , KDD ’14, pages 701–710, New York, NY, USA. ACM. event-place:New York, New York, USA.Rezende, D. and Mohamed, S. (2015). Variational Inference with Normalizing Flows. In InternationalConference on Machine Learning , pages 1530–1538.Roweis, S. T. and Saul, L. K. (2000). Nonlinear Dimensionality Reduction by Locally LinearEmbedding. Science , 290(5500):2323–2326. 5en, P., Namata, G., Bilgic, M., Getoor, L., Galligher, B., and Eliassi-Rad, T. (2008). CollectiveClassification in Network Data. AI Magazine , 29(3):93–93.Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., andPolosukhin, I. (2017). Attention is All you Need. In Guyon, I., Luxburg, U. V., Bengio, S.,Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R., editors, Advances in Neural InformationProcessing Systems 30 , pages 5998–6008. Curran Associates, Inc.Verma, S. and Zhang, Z.-L. (2017). Hunt For The Unique, Stable, Sparse And Fast Feature LearningOn Graphs. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S.,and Garnett, R., editors, Advances in Neural Information Processing Systems 30 , pages 88–98.Curran Associates, Inc.You, J., Ying, R., and Leskovec, J. (2019). Position-aware Graph Neural Networks. In InternationalConference on Machine Learning , pages 7134–7143.You, J., Ying, R., Ren, X., Hamilton, W., and Leskovec, J. (2018). GraphRNN: Generating RealisticGraphs with Deep Auto-regressive Models. In International Conference on Machine Learning ,pages 5708–5717.Zaheer, M., Kottur, S., Ravanbakhsh, S., Poczos, B., Salakhutdinov, R. R., and Smola, A. J. (2017).Deep Sets. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S.,and Garnett, R., editors, Advances in Neural Information Processing Systems 30 , pages 3391–3401.Curran Associates, Inc.Zhou, M. (2015). Infinite Edge Partition Models for Overlapping Community Detection and LinkPrediction. In International Conference on Artificial Intelligence and Statistics , pages 1135–1143.6 Appendix Injective flow perspective. Suppose we instead replace the first step of the encoder as (cid:40)(cid:40)(cid:40)(cid:40)(cid:40)(cid:40)(cid:40)(cid:40)(cid:40)(cid:40)(cid:40)(cid:40)(cid:40)(cid:40)(cid:40) Z | A ∼ Normal(Embed( A ) , σ I ) Z | A = Embed( A ) . Then applying f φ ◦ Embed results in a series of transformations of A into the latent variable Z .Crucially, graph embedding methods are not invertible, since there will always exist embeddings thatdo not correspond to any adjacency matrices – so we cannot interpret the composition as a seriesof invertible flows. However, when the graph embedding is injective (such as the case when theLaplacian Eigenmap is used (Verma and Zhang, 2017)), the result is an injective flow (Kumar et al.,2019). log p φ ( A ) = log p ( f φ ◦ Embed ◦ A ) + 12 log (cid:12)(cid:12) det J f φ ◦ Embed ( A ) (cid:62) J f φ ◦ Embed ( A ) (cid:12)(cid:12) Log-likelihood evaluation. We compute test set log-likelihood by Monte Carlo importance sam-pling with the variational posterior, using 128 samples. log p θ ( A ) = log (cid:90) Z p θ ( A , Z ) d Z = log E q φ ( Z | A ) (cid:20) p θ ( Z , A ) q φ ( Z | A ) (cid:21) = log E q φ ( Z | A ) p ( Z ) p θ ( A | Z ) (cid:12)(cid:12)(cid:12) det ∂ f φ ∂ Z (cid:12)(cid:12)(cid:12) q φ ( Z | A ) We take the sum of the upper triangular entries of p θ ( A | Z ) and then divide by | V | ( | V | − tocalculate the number of bits per dimension, independent of the number of nodes in the graph. Generating large-scale graphs We can generate large-scale graphs efficiently by interpreting theBernoulli-Exponential link as augmented Poisson random variables instead of augmented Exponentialrandom variables. Recalling that the marginal p ( a i,j = 1) = 1 − e − z ∗(cid:62) i z ∗ j , let a i,j | z ∗ i , z ∗ j ∼ Bernoulli( m i,j = 0) m i,j | z ∗ i , z ∗ j ∼ Poisson( z ∗ (cid:62) i z ∗ j ) . It follows that the total number of edges is distributed as E = (cid:88) i 12 ( n (cid:88) i =1 z ∗ i,d )( n (cid:88) i =1 z ∗ i,d ) − n (cid:88) i =1 z ∗ i,d (cid:33) def = P (cid:88) d =1 E d . So we can first sample the total number of edges E by sampling E d (corresponding to each dimension),then sample the nodes ( i, j ) corresponding to each edge by picking p ( i ) | E d ∝ z ∗ i,d . dditional figures.dditional figures.