Neural Message Passing on High Order Paths
Daniel Flam-Shepherd, Tony Wu, Pascal Friederich, Alan Aspuru-Guzik
NNeural Message Passing on High Order Paths
Daniel Flam-Shepherd , ∗ , Tony Wu ∗ , Pascal Friederich , , Alan Aspuru-Guzik , , University of Toronto , Karlsruhe Institute of Technology , CIFAR , Vector Institute Abstract
Graph neural network have achieved impressive results in predicting molecularproperties, but they do not directly account for local and hidden structures in thegraph such as functional groups and molecular geometry. At each propagation step,GNNs aggregate only over first order neighbours, ignoring important informationcontained in subsequent neighbours as well as the relationships between thosehigher order connections. In this work, we generalize graph neural nets to passmessages and aggregate across higher order paths. This allows for information topropagate over various levels and substructures of the graph. We demonstrate ourmodel on a few tasks in molecular property prediction.
Graph Neural Networks (GNNs) are a powerful tool for representation learning across differentdomains involving relational data such as molecules [1] or social and biological networks [2]. Thesemodels learn node embeddings in a message passing framework [3] by passing and aggregatingnode and edge feature information across the graph using neural networks. The learned noderepresentations can then be used for any downstream procedure such as node or graph classificationor regression. In particular, GNNs have been used to drastically reduce the computation time forpredicting molecular properties [3].However, current GNN models still suffer from limitations as they only propagate information acrossneighbouring edges and, after propagation, use simple pooling of final node embeddings [1, 4]. Thismeans that, in most models, nodes only learn about the larger neighbourhood surrounding them overmany propagation steps. This makes it difficult for GNNs to learn higher order graph structure andimpossible to learn in a single propagation layer. However, such long range correlations are importantfor many domains, in particular, when learning chemical properties that depend on rings, branches,functional groups or molecular geometry.The only way to directly account for higher order graph properties is to pass messages over additionalneighbours in every propagation layer of the GNN. Notice how much larger the neighbourhood of theatom gets when you consider second and third order neighbours (Figure 1). This work focuses ongeneralizing message passing neural networks to accomplish this.a) b) c)Figure 1: the neighbourhood of one atom comprised of a) st b) nd and c) rd order neighbours preprint (2019) a r X i v : . [ c s . L G ] F e b .1 Motivations There are many factors pertaining to molecular graphs that motivate the development of our model.In this section we discuss, in more depth, the limitations of GNNs with respect to specific aspects ofmolecules that motivate our model. These include molecular substructures like rings and functionalgroups, molecular geometry as characterized by internal coordinates as well as stereochemistry. HO Figure 2: Cyclohexanol molecular substructures play an important role in determining molec-ular properties for example functional groups are responsible for thechemical reactions a molecule undergoes. By only aggregating overneighbours, GNN cannot learn about these larger substructures in a singlepropagation layer. On the other hand, by passing messages over largerneighbourhoods, in every layer we could directly learn about these struc-tures. Furthermore, We could directly indicate if the path that a messageis traveling on contains a simple functional group like alcohol (ROH) orpasses through a larger functional group. For example, in figure 2, atomsin the neighbourhood of OH could receive messages of length two ormore indicating that an alcohol group is in their neighbourhood. molecular geometry is the three dimensional arrangement of atoms in amolecule and influences several properties, including the reactivity, polarity and biological activity ofthe molecule. An important application of GNNs is predicting quantum mechanical properties ofmolecules, which are heavily dependent on the geometry of the molecule. The 3D configuration of amolecule can be fully specified by 1) bond lengths – the distance between two bonded atoms, 2) bondangles – the angle formed between three neighbouring atoms, and 3) dihedral angles between fourconsecutive atoms. In fact the potential energy is typically modeled as a sum of terms involving eachof these three. Current GNN approaches to quantum chemistry incorporate neighbouring geometryby using bond distances as edge features [3], but do not directly account for the relative orientation ofneighbouring atoms and bonds – a framework that could do so would be advantageous.Figure 3: rotation of a bond stereochemistry involves the relative spatial arrangement ofatoms in molecules, specifically, stereoisomers– which aremolecules with the same discrete graph but different three-dimensional orientation of atoms. For example, enantomers–mirror images of molecules and cis-trans isomers, that onlydiffer through the rotation of a functional group. Even ifthey use interatomic distances as edge features, GNNs willhave limited ability to distinguish stereoisomers, since thesemolecules only differ through the relative orientation of atoms.In general, at every propagation step, GNNs should learn representations over each node’s extendedneighbourhood to encode the relationships between nodes in that neighbourhood.
We generalize MPNNs to aggregate across larger neighbourhoods by passing messages along simplepaths of higher order neighbours. We describe the general framework in Section 3. We experimentwith various molecular property prediction task and a node classification task in citation networks.Our specific contributions are two-fold • we devise a simple extension to any message passing neural network to learn representationsover larger node neighbourhoods within each propagation layer by simply augmenting themessage function to aggregate over additional neighbours. • By summing over additional neighbours we enable the use of path features such as bondangles for paths of length two and dihedral angles for paths of length three and thus encodingthe full molecular geometry and orientation, so that MPNNs can distinguish isomers.2
Related Work and Background operate on graphs G with n nodes each with feature vector x v ∈ R f that specify what kind of atom the node is, among other possible features. There are n × n edge feature vectors e vw ∈ R e that specify what kind of bond type atoms v, w have. The forwardpass has two phases, a message passing phase and a readout phase.The message passing phase runs for T propagation steps and is defined in terms of message functions M t and node update functions U t . During the message passing phase, hidden states h tv at each nodein the graph are updated based on messages m t +1 v according to m t +1 v = (cid:88) w ∈N v M t ( h tv , h tw , e vw ) , h t +1 v = U t ( h tv , m t +1 v ) , y = Readout ( { h Tv } v ∈ G ) The message node v receives aggregates over its neighbours N v , in this case, by simple summation.We then readout predictions y based on final node embeddings. The first graph neural network model was proposed by [5] and many variants have been recentlyproposed [4, 6, 7]. Our focus is on the general framework of neural message passing from [3]. Wereview relevant GNN models and their use in Molecular Deep learning in this section.
Molecular Deep Learning
Recently GNNs have superseded machine learning methods involvinghand-crafted feature representation, on predicting molecular properties. For example, neural fin-gerprints generalizes standard molecular fingerprints with a differentiable one that achieves betterpredictive accuracy [1]. Another model, SchNet [8] defines a continuous-filter convolutional neuralnetwork for modeling quantum interactions and achieves state of the art results.
Higher Order GNNs.
Recent work has generalized graph convolution networks (GCNs) [7] tohigher order structure by repeatedly mixing feature representations of neighbors at various distances[9], or casting GCNs into a general framework inspired by the path integral formulation of quantummechanics [10]. Both of these works are based on powers of the adjacency matrix and do notaccount directly for the relationship between higher order neighbours. Another work [11] proposesk-dimensional GNNs in order to take higher order graph structures at multiple scales into account.GNNs and higher order GNNs do not incorporate the relationship between higher order neighbours,which would allow for features that are dependent on that relationship, namely ’path features’. path augmented transformer
Another model based on the transformer architecture [12] accountsfor long range dependencies in molecular graphs by augmenting edge feature tensor to include some(shortest) path features like bond type, conjugacy, inter-atomic distance and ring membership. structured transformer
A few graph neural networks recently proposed have incorporated direc-tional information. The first [13] builds a model for proteins that considers the local change in thecoordinate system for each atom in the chain.
3D GCN [14] build a three-dimensional graph convolutional network, for molecular properties andbiochemical activities prediction using 3D molecular graph by augmenting the standard GCN layerwith the relative atomic position vector. directional message passing [15] embeds the messages passed between atoms such that eachmessage is associated with a direction in coordinate space and are rotationally equivariant since theassociated directions rotate with the molecule. Their message passing scheme transforms messagesbased on the angle between them in order to encode direction.3 wyx ϕ vwyx α vwy vwyx e vw vw a) standard message b) path messageFigure 4: Message function and path features for a) standard MPNN and b) MPNN passing messageson paths with length 3 in a molecule with path features involving molecular geometry We extend the message passing framework by propagating information from every node’s higher orderneighbour instead of aggregating messages from only nearest neighbours. The message passing phaseis augmented such that hidden states h tv at each node in the graph are updated based on messagesover all simple paths up to length (cid:96) from its neighbourhood: m t +1 v = (cid:88) p ∈P v(cid:96) M t ( h tv , p ) = (cid:88) v ∈N v (cid:88) v ∈N v v (cid:54) = v · · · (cid:88) v (cid:96) ∈N v(cid:96) − v (cid:96) (cid:54) = v (cid:96) − ,...,v M t ( h tv , p v → v (cid:96) ) (1)where we define p to be a path in P (cid:96)v , which is the set of all simple paths starting from node v withlength (cid:96) and p v → v (cid:96) to be path features along path p from node v to node v (cid:96) . We only sum oversimple paths, excluding loops and multiple inclusions of the same node. For graphs with a large number of nodes and edges, passing messages along paths becomes veryexpensive and, as in GraphSage [2], sampling a subset of paths of higher order neighbours isnecessary. However, for molecules, where the number of neighbours is usually ≤ this is notnecessary. Furthermore, one can include domain specific path features in the message function. Wedescribe two examples of these path features below molecular substructures we can incorporate whether the path travels through a molecular substruc-ture by considering paths of at least length 2, where we have a message function that sums over 2neighbouring atoms v → w → y . Along with their node and edge features, the possible path featuresinclude ring features - ie one hot indication if any atoms are in (specific) rings as well as if the path isa functional group (ROH) or within a larger functional group. m t +1 v = (cid:88) w ∈N v (cid:88) y ∈N w y (cid:54) = v M t ( h tv , p v → y ) p v → y = (cid:20) h tw h ty e vw p vy (cid:21) (2) molecular geometry considering paths of length 3, where we have a message function that sumsover 3 neighbouring atoms v → w → y → x . Along paths of length three additional features includetwo bond angles α vwy & α wyx and the dihedral angle ϕ vwyx between the planes defined by thepairs of atoms ( v, w ) and ( y, x ) . Effectively, messages passed over 3 consecutive neighbours containinformation about the entire molecular geometry (see Figure 4). m t +1 v = (cid:88) w ∈N v (cid:88) y ∈N w y (cid:54) = v (cid:88) x ∈N y x (cid:54) = w,v M t ( h tv , p v → x ) p v → x = h tw h ty h tx e vw e wy e yx α vwy α wyx ϕ vwyx (3)4ataset QM8 ESOl CEPUnits MAE in eV ( × − ) RMSE in log Mol/L Percentneural fingerprint [1] 13.80 ± ± ± ± ± ± ± ± – –path transformer [12] 10.20 ± ± Path MPNN 8.70 ± ± ± Table 1 : Mean and std error predictive accuracy on various dataset and baselines
NHNO O NNH
OPSO O N + O - O NH SH N OO
SON NH NH SiH NN a) QM8 a) ESOL a) CEPFigure 5: molecules from the datasets considered We compare the performance of our model against a few baselines on a variety of molecular propertyprediction tasks involving different datasets. These tasks include : • ESOL: [17] predicting the aqueous solubility of 1144 molecules. • QM8 : [18] predicting 16 electronic spectra values calculated using density functional theoryfor 21786 organic molecules that have 8 or less heavy atoms (CON and F) • CEP : the photovoltaic efficiency of 20000 organic molecules from The Harvard CleanEnergy Project [19]
We use the following basic MPNN model that is augmented along the lines of section three in orderto pass messages over paths (Path MPNN). m t +1 v = Attention w ∈N v M t ( h tv , h tw , e vw ) , U t = σ ([ h tv , m tv ]) y = Set2Set v ∈ G ( { h Tv , x v } ) this uses graph attention [6] as an aggregation method and the message function from the interactionnetworks model in [20], which is a simple concatenation of node and edge features. The node updatefunction concatenates incoming messages with the current node state and feeds it through a denselayer. After propagation through message passing layers, we use the set2set model [21] as the readoutfunction to combine the node hidden features into a fix-sized hidden vector. For QM8 we passmessages over paths of length three and use path features for molecular geometry as specified inequation (3). For ESOL and CEP we pass messages over paths of length two and use path features formolecular substructures as specified by equation (2) The models are trained using root mean squarederror (RMSE) for loss. Model evaluation is done using mean absolute error (MAE) of the molecularproperties in the QM8 dataset, RMSE for ESOL and percent for CEP. We use the top performing model from Molecule Net [16] (Molnet) for each dataset. Wealso benchmark with the differentiable version of circular fingerprints from [1] (neural fingerprints).To highlight the importance of path features, we also compared the performance of the (MPNN)model we used without passing messages on paths. The last benchmark is the Path-AugmentedGraph Transformer Network) (PAGTN) since this model is similarly built to model longer-rangedependencies in molecular graphs. As can be seen in Table 1, for QM8, ESOL and CEP, passingmessages over paths leads to a substantial improvement in predictive accuracy.5
Comparison with other Higher Order GNNs
In a separate experiment, we compare the path MPNN with other GNNs that use higher orderneighbours and do not use edge features. We consider a standard task of semi-supervised nodeclassification with the CORA dataset.
It contains sparse bag-of-words feature vectors for each document and a list of citation links betweendocuments which we use as undirected edges in the adjacency matrix. Each document has a classlabel. Altogether, the network has 2,708 nodes and 5,429 edges with 7 classes and 1,433 features.
Model Test accuracyGCN [7] 81.5MixHop [9] 81.9PAN [10] 82.0
Path GCN 82.4
We use the experimental setup of [7]. We sum overpaths of length 3 while uniformly sampling a single sec-ond order and third order neighbour. Our base MPNNis a GCN [7] that has message function m t +1 v = (cid:88) w ∈N v ˆ A vw h tv , U t = σ ( m tv ) where σ is a dense layer with sigmoid activation. For acitation network the path features are just the node features and edge features connecting v to nodesthat are (cid:96) nodes away, i.e. p v → v (cid:96) = { h tv , e vv , . . . , h tv (cid:96) , e v (cid:96) v (cid:96) − } We compare with two other higher order GCN variants: Mixhop [9] and PAN [10]: Path integralgraph convolution – both use powers of the adjacency to aggregate GCN layers of higher orderneighbours. From the results table above our model achieve similar accuracy to our baselines.
Limitations
In this work we only considered very simple message functions, in general, it is notstraight forward to construct message function over paths. For example, the message function from[3], maps edge features to a square matrix using a neural net– incorporating more neighbours andtheir edge and path features into this kind of message function introduces many design challenges.We introduce a general GNN framework based on message passing over simple paths of higherorder neighbours. This allows us to use path features in addition to node and edge features, whichis very useful in molecular graphs, as many informative features are characterized by the pathsbetween atoms. We benchmarked our framework on molecular property prediction tasks and a nodeclassification task in citation networks.
References [1] David Duvenaud, Dougal Maclaurin, Jorge Aguilera-Iparraguirre, Rafael Gómez-Bombarelli,Timothy Hirzel, Alán Aspuru-Guzik, and Ryan P. Adams. Convolutional networks on graphsfor learning molecular fingerprints. In
Neural Information Processing Systems , 2015.[2] Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on largegraphs. In
Advances in Neural Information Processing Systems , pages 1024–1034, 2017.[3] Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. Neuralmessage passing for quantum chemistry. In
Proceedings of the 34th International Conferenceon Machine Learning-Volume 70 , pages 1263–1272. JMLR. org, 2017.64] Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. Gated graph sequence neuralnetworks. arXiv preprint arXiv:1511.05493 , 2015.[5] Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini.The graph neural network model.
IEEE Transactions on Neural Networks , 20(1):61–80, 2008.[6] Petar Veliˇckovi´c, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and YoshuaBengio. Graph attention networks. arXiv preprint arXiv:1710.10903 , 2017.[7] Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutionalnetworks. arXiv preprint arXiv:1609.02907 , 2016.[8] Kristof T. Schütt, Pieter-Jan Kindermans, Huziel E. Sauceda, Stefan Chmiela, AlexandreTkatchenko, and Klaus-Robert Müller. Schnet: A continuous-filter convolutional neural networkfor modeling quantum interactions, 2017.[9] Sami Abu-El-Haija, Bryan Perozzi, Amol Kapoor, Hrayr Harutyunyan, Nazanin Alipourfard,Kristina Lerman, Greg Ver Steeg, and Aram Galstyan. Mixhop: Higher-order graph convolutionarchitectures via sparsified neighborhood mixing. arXiv preprint arXiv:1905.00067 , 2019.[10] Zheng Ma, Ming Li, and Yuguang Wang. Pan: Path integral based convolution for deep graphneural networks. arXiv preprint arXiv:1904.10996 , 2019.[11] Christopher Morris, Martin Ritzert, Matthias Fey, William L Hamilton, Jan Eric Lenssen,Gaurav Rattan, and Martin Grohe. Weisfeiler and leman go neural: Higher-order graph neuralnetworks. In
Proceedings of the AAAI Conference on Artificial Intelligence , volume 33, pages4602–4609, 2019.[12] Benson Chen, Regina Barzilay, and Tommi Jaakkola. Path-augmented graph transformernetwork. arXiv preprint arXiv:1905.12712 , 2019.[13] John Ingraham, Vikas Garg, Regina Barzilay, and Tommi Jaakkola. Generative models forgraph-based protein design. In
Advances in Neural Information Processing Systems , pages15794–15805, 2019.[14] Kyunghyun Cho, Bart Van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. Onthe properties of neural machine translation: Encoder-decoder approaches. arXiv preprintarXiv:1409.1259 , 2014.[15] Johannes Klicpera, Janek Groß, and Stephan Günnemann. Directional message passing formolecular graphs. In
International Conference on Learning Representations , 2020.[16] Zhenqin Wu, Bharath Ramsundar, Evan N Feinberg, Joseph Gomes, Caleb Geniesse, Aneesh SPappu, Karl Leswing, and Vijay Pande. Moleculenet: a benchmark for molecular machinelearning.
Chemical science , 9(2):513–530, 2018.[17] John S Delaney. Esol: estimating aqueous solubility directly from molecular structure.
Journalof chemical information and computer sciences , 44(3):1000–1005, 2004.[18] Lars Ruddigkeit, Ruud Van Deursen, Lorenz C Blum, and Jean-Louis Reymond. Enumerationof 166 billion organic small molecules in the chemical universe database gdb-17.
Journal ofchemical information and modeling , 52(11):2864–2875, 2012.[19] Johannes Hachmann, Roberto Olivares-Amaya, Sule Atahan-Evrenk, Carlos Amador-Bedolla,Roel S Sánchez-Carrera, Aryeh Gold-Parker, Leslie Vogt, Anna M Brockway, and Alán Aspuru-Guzik. The harvard clean energy project: large-scale computational screening and design oforganic photovoltaics on the world community grid.
The Journal of Physical Chemistry Letters ,2(17):2241–2251, 2011.[20] Peter Battaglia, Razvan Pascanu, Matthew Lai, Danilo Jimenez Rezende, et al. Interactionnetworks for learning about objects, relations and physics. In
Advances in neural informationprocessing systems , pages 4502–4510, 2016.[21] Oriol Vinyals, Samy Bengio, and Manjunath Kudlur. Order matters: Sequence to sequence forsets. arXiv preprint arXiv:1511.06391arXiv preprint arXiv:1511.06391