Heterogeneous Molecular Graph Neural Networks for Predicting Molecule Properties
HHeterogeneous Molecular Graph Neural Networksfor Predicting Molecule Properties
Zeren Shui
Computer Science & EngineeringUniversity of MinnesotaMinneapolis, [email protected]
George Karypis
Computer Science & EngineeringUniversity of MinnesotaMinneapolis, [email protected]
Abstract —As they carry great potential for modeling complexinteractions, graph neural network (GNN)-based methods havebeen widely used to predict quantum mechanical propertiesof molecules. Most of the existing methods treat molecules asmolecular graphs in which atoms are modeled as nodes. Theycharacterize each atom’s chemical environment by modeling itspairwise interactions with other atoms in the molecule. Althoughthese methods achieve a great success, limited amount of worksexplicitly take many-body interactions, i.e., interactions betweenthree and more atoms, into consideration. In this paper, weintroduce a novel graph representation of molecules, heteroge-neous molecular graph (HMG) in which nodes and edges are ofvarious types, to model many-body interactions. HMGs have thepotential to carry complex geometric information. To leveragethe rich information stored in HMGs for chemical predictionproblems, we build heterogeneous molecular graph neural net-works (HMGNN) on the basis of a neural message passingscheme. HMGNN incorporates global molecule representationsand an attention mechanism into the prediction process. Thepredictions of HMGNN are invariant to translation and rotationof atom coordinates, and permutation of atom indices. Our modelachieves state-of-the-art performance in 9 out of 12 tasks on theQM9 dataset.
Index Terms —Heterogeneous molecular graphs, many-body in-teractions, graph neural networks, molecular property prediction
I. I
NTRODUCTION
Predicting quantum mechanical properties of moleculesbased on their structures is important for molecule screeningand drug design. We can compute exact molecular propertiesby solving the many-body Schr¨odinger equation. However,closed form solution to this equation is only available forsimple systems. Although researchers developed methods suchas Density Functional Theory (DFT) [1] to approximate thesolution, the computational cost of these methods scales poorlyand is worse than O ( n ) w.r.t. the number of electrons.Recently, researchers have been developing machine learn-ing methods that are orders of magnitude faster with amoderate compromise in prediction accuracy. Among themachine learning approaches, graph neural network (GNN)-based methods attract a lot of research attention as their abilityto model complex interactions among atoms. These methodstreat molecules as molecular graphs (e.g., distance graphs [2]–[5], chemical graphs [6], K -nearest neighbor graphs [7]) inwhich atoms are modeled as nodes. They compute an atom’s low-dimensional representation as a function of its feature andcharacteristics of its graph neighbors. The low-dimensionalrepresentations are then used to estimate the local contributionof the atoms to the desired property, or to compute a globalrepresentation of the molecule for downstream predictions.The many-body expansion (MBE) [8]–[10] is an importantscheme that computes the energy of an N -particle system asthe sum of the contributions of many-body terms E = (cid:88) i E i + (cid:88) i Traditionally, prediction of many important molecular prop-erties such as atomization energies relies on methods that ap-proximate the solution of the many-body Schr¨odinger equationsuch as density function theory (DFT) and its variants [14].This class of methods involves solving complex linear systemsand has a computational complexity worse than O ( n ) where n is the number of atoms.Recent years have seen a surge in data-driven methods thattrain machine learning models to learn patterns from moleculedatabases. The learned patterns are assumed to be generalin chemical space and can be used to estimate properties ofunknown compounds. These attempts started from [15], [16]which feed hand-crafted molecule descriptors (e.g., Coulombmatrix, bag of bonds) into regression models such as linearregression and random forests. These methods rely heavilyon the quality of the crafted descriptors and have limitedrepresentation power.Recently, graph neural networks (GNN) have been achiev-ing a great success in graph-related applications [17]–[20].In chemistry, researchers developed GNN-based method forlearning tasks over graph represented molecules. The authorsof [6] introduced a generic framework over chemical graphsthat models interactions between atoms in a message passingfashion. In [3], [4], [21], the authors designed neural networkstructures that have no dependency on hand-crafted featuresbut learn molecule representations from only atom types andcoordinates. Since GNNs possess a hierarchical structure, i.e., https://github.com/shuix007/HMGNN they iteratively apply GNN layers on graphs to encode eachnode’s multi-hop neighbors into its embedding, GNN-basedmethods [22] and [2] further decompose atom-wise predictionto layer-wise atom prediction to fit in the MBE framework.Although these methods include many-body contributionsinto final predictions, they do not have an explicit mod-eling of many-body representations and interactions. Somerecent works have incorporated many-body interactions andrepresentations by updating edge embeddings along messagepassing [7] or by passing messages on line graphs of thecorresponding molecular graphs [23]. However, these methodscapture only partial many-body interactions and lack many-body predictions.Equivariant neural network is another class of neural net-work methods that has been applied in chemical predictionproblems. The notion of group equivariant neural network wasfirst introduced by [24] in the domain of image processing.Later, researchers developed neural network methods that areequivariant to continuous rotations for learning representationsfor 3D objects, including molecules [25]–[27]. These methodsachieve rotation invariance by transforming objects from Eu-clidean space to Fourier space and conducting computationsin Fourier space. In these methods, each many-body interactsonly with itself but not other many-bodies. Thus, they are notoptimal in predicting molecule properties.III. N OTATIONS AND D EFINITIONS We denote matrices by bold upper-case letters (e.g., W ),and vectors by bold lower-case letters (e.g., x ). We denoteentries of a matrix/vector by lower-case letter with subscripts(e.g., x ij / x i ). We use superscripts to indicate variables at the t -th message passing layer (e.g., h ( t ) ). We denote moleculargraphs by G = ( V , E ) where V and E represent the set of nodes(atoms) and edges, respectively. Two atoms are connected ina molecular graph when the Euclidean distance between themis less than a cutoff threshold c > . Each edge in the graphis associated with a distance to store the geometric structureof the molecule. We define a p -body in a molecular graph G as a p -clique of the graph. We refer to the value of p as theorder of the many-body.IV. H ETEROGENEOUS M OLECULAR G RAPH AND M ANY -B ODY I NTERACTIONS In this section, we illustrate the construction of hetero-geneous molecular graphs (HMG) and how we leveragethe heterogeneous structure of HMGs to model many-bodyrepresentations and interactions. A. Heterogeneous Molecular Graph An HMG is a graph in which nodes are many-bodiesand edges are defined by various types of geometric andset relations. HMGs are constructed from molecular graphs.We denote an HMG of order N of a molecular graph G as H N ( G ) = ( {V p } , {E pq } ) where ≤ p ≤ q ≤ N , V p is theset of p -bodies in G (i.e., all p -cliques of G ), and E pq is theset of edges between V p and V q . We denote the order p of a) A formaldehyde molecule. (b) A molecular graph of the formaldehydemolecule. (c) A heterogeneous molecular graph of the moleculargraph. Fig. 1: An example of heterogeneous molecular graph (HMG). Figure 1(a) is a spatial structure of a formaldehyde (CH O)molecule. Each atom in the molecule is associated with a three-dimensional coordinates in the Euclidean space. Figure 1(b)is the molecular graph of the methanol molecule with a cutoff distance c = 2 . We convert atom coordinates to pair-wisedistances to guarantee translation and rotation invariance of the representation. We denote edges whose distances are less than c using black solid lines, and edges that are broke by the cutoff using black dotted lines. Figure 1(c) is a HMG of ordertwo constructed from the molecular graph. There are two types of nodes ( -bodies and -bodies denoted by yellow and bluecircles, respectively) and three types of edges ( - and - denoted by yellow and blue lines, respectively, - denoted byblack dashed lines) in the HMG. Edges between nodes of the same order are associated with features that depict the geometricrelation between the nodes (distance for - edges, angle for - edges). p -bodies as the node type and p - q as the type of the edgesthat connect nodes of order p and nodes of order q . Giventwo nodes i ∈ V p and j ∈ V q , when they are of the sameorder, i.e., p = q , i and j are connected if they share p − atoms. A special case is when p = q = 1 , instead of building acomplete graph, we use the edge set E of the molecular graphto define connections. When the two nodes are of differentorders, presumably p < q , ( i, j ) ∈ E pq if i is a sub-graph of j .An example HMG is shown in Figure 1. With this formulation,we can explicitly model up to N -body representations by nodeembeddings and N + 1 -body interactions by message passing.In an HMG, each node i of order p is associated with adiscrete feature Z p,i that indicates its atomic composition, anda continuous feature x p,i that describes aspects of its geometry.Note that, nodes of order do not have continuous featuressince they are points in the Euclidean space and do not havegeometric structure. Each edge ( i, j ) is associated with an edgefeature e p,ij when i and j are of the same order p . The edgefeature characterizes the geometric relation between the twonodes, e.g., distance between atoms, angles between bonds. Inthis paper, we use a hash function to map the set of atomicnumbers of the atoms to Z p,i . Construction of continuousnode features and edge features requires feature engineeringespecially when order of the many-bodies are high. We willillustrate how we convert geometric information to featurevectors up to the second order in Section VI-A. B. Message Passing on Heterogeneous Molecular Graphs The message passing framework consists of two phases,message passing and node update. On molecular graphs, eachnode (atom) i sends/receives messages to/from its neighbors and uses the received messages to update its embedding m ( t ) i = (cid:88) j ∈N ( i ) f ( t ) (cid:16) h ( t ) i , h ( t ) j , e ij (cid:17) h ( t +1) i = g ( t ) (cid:16) h ( t ) i , m ( t ) i (cid:17) . (2)In Eq-2, N ( i ) is the set of neighbor nodes of i , h ( t ) i is the node(atom) embedding of i , m ( t ) i is the aggregation of messagesfrom i ’s neighbor nodes, e ij is the edge feature associatedwith the edge between i and j , f ( · ) is a message functionthat maps embeddings of the sender and the receiver and thecorresponding edge feature to a message vector, g ( · ) is a nodeupdate function that combines the incoming message and theold embedding to be the new node embedding. Both f ( · ) and g ( · ) are learnable. Message passing on HMGs is differentfrom that on molecular graphs due to the heterogeneousproperty of HMGs. Nodes in HMGs are of different ordersand they pass messages through edges of different types. Amessage passing framework needs to learn edge type specificmessage functions and order specific node update functions tocapture this heterogeneous structure. Moreover, the frameworkshould allow inter-order message passing such that the nodeembeddings can capture information from other orders. Forexample, by passing messages from -bodies, -bodies canencode edge angle information into their embeddings. Let i ∈ V p be a node of order p in a HMG and h ( t ) p,i be itsembedding at the t -th layer, we design the message passingframework as m ( t ) q,i = (cid:88) j ∈N q ( i ) f ( t ) qp (cid:16) h ( t ) p,i , h ( t ) q,j , e ij (cid:17) h ( t +1) p,i = g ( t ) p (cid:16) h ( t ) p,i , m ( t )1 ,i , m ( t )2 ,i , · · · , m ( t ) N,i (cid:17) (3)here N q ( i ) the set of nodes of order q that are connected to i , m ( t ) q,i denotes the aggregated messages from nodes i ’s neighbornodes of order q , e ij denotes the edge feature between i and j if they are of the same order, f pq ( · ) and g p ( · ) are learnablefunctions specific to edge type pq and node type (order) p , respectively. Compare to the message passing frameworkon molecular graphs which has two functions to learn, thisframework possesses larger model capacity and is able tomodel many-body interactions explicitly.V. H ETEROGENEOUS M OLECULAR G RAPH N EURAL N ETWORKS .We present Heterogeneous Molecular Graph Neural Net-works (HMGNN) for the purpose of predicting moleculeproperties. An HMGNN contains four types of modules,input module, interaction module, output module, and fusionmodule. All the modules except the fusion module are orderspecific. HMGNNs learn functions for message passing onheterogeneous molecular graphs to compute local node repre-sentations, and uses a readout function to combine the repre-sentations to form a global molecule representation. HMGNNscompute node-wise contributions to the target property andaggregates them based on their orders. The final prediction isa weighted combination of the predictions of all orders wherethe weights are computed by an attention mechanism fromthe global molecule representation. An HMGNN is learnedby optimizing a loss function which forces predictions ofeach order and the fused prediction to be close to the truetarget. Since the construction of heterogeneous moleculargraphs and associated features rely on atom pairwise distancesand atomic numbers but not atom coordinates, HMGNNs areinvariant under both translations and rotations. HMGNNs arealso permutation invariant to atom indices as the messageaggregation function in Eq-3 and the readout function arepermutation invariant [28]. Figure 2 shows an overview ofthe architecture of HMGNN. A. Input Module The input module of HMGNN converts raw features ofnodes to latent embeddings. As we described in SectionIV-A, each node i ∈ V p in a HMG is associated with adiscrete feature Z p,i and a continuous feature x p,i . We usean embedding lookup table to map the discrete feature Z p,i toa real value vector e Z p,i and apply a fully connected layer tothe concatenation of the latent vector e Z p,i and the continuousfeature x p,i to get the initial node embedding h (1) p,i = φ (cid:0) W in p (cid:0) e Z p,i (cid:107) x p,i (cid:1) + b in p (cid:1) (4)where W in p and b in p are learnable parameters for nodes of order p ( p -bodies), φ ( · ) is an element-wise activation function, (cid:107) denotes concatenation of vectors. B. Interaction Module HMGNN stacks T interaction modules to encode infor-mation across far reaches of the heterogeneous moleculargraph into node embeddings. Each interaction module takes the output embeddings of the previous module and updatethe embeddings. Note that, edges between nodes of the sameorders have features while other edges do not. As a result, weparamatrize the message functions between nodes of the sameorder as m ( t ) p,i = (cid:88) j ∈N p ( i ) G ( t ) p e p,ij (cid:12) φ (cid:16) W ( t ) pp h ( t ) p,j + b ( t ) pp (cid:17) (5)and the message functions along edges without features as m ( t ) q,i = (cid:88) j ∈N q ( i ) φ (cid:16) W ( t ) qp h ( t ) q,j + b ( t ) qp (cid:17) . (6)In Eq-5 and Eq-6, N p ( i ) and N q ( i ) denotes the set of neighbornodes of order p and order q of node i , respectively, (cid:12) denotesthe Hadamard product, G , W , and b are learnable parameters.A node embedding h ( t ) p,i is then updated as a function of itsold embedding and the incoming messages, h ( t +1) p,i = h ( t ) p,i + φ (cid:16) W ( t ) p (cid:16) h ( t ) p,i (cid:107) m ( t )1 ,i (cid:107) · · · (cid:107) m ( t ) N,i (cid:17) + b ( t ) p (cid:17) , (7)where (cid:107) denotes concatenation of vectors. The interactionmodule then refines the node embeddings with two consecutivefully connected layers with residual connections [29]. C. Output Module Each many-body order p possesses a specific output modulethat passes the output of its interaction module, final nodeembeddings h ( T +1) p,i , through a sequence of linear mappingsand a aggregation process to compute the estimated value ofthe target property. First, we use a fully connected layer toconvert the node embeddings to node predictions ˜ y p,i = w out p h ( T +1) p,i + b out p . (8)where w out p and b out p are learnable parameters for nodes of order p . Then we follow [2] and scale the predictions with scalingparameters that are specific to the discrete feature Z p,i of thenodes ˆ y p,i = s Z p,i ˜ y p,i + r Z p,i . (9)where s and r are learnable embedding lookup tables that map Z p,i to the corresponding scaling factors and shifts. The goalof the scaling layer is to adapt the magnitude of the predictionsto different unit systems of the target property. D. Fusion Module The fusion module computes a global molecule represen-tation out of the final node embeddings and uses the globalrepresentation to weigh the prediction of different orders. Wesum the final node embeddings h ( T +1) p,i of each p -body to forman order specific representation and concatenate them to be anintermediate representation ˜ v = (cid:88) i ∈V h ( T +1)1 ,i (cid:107) (cid:88) i ∈V h ( T +1)2 ,i (cid:107) · · · (cid:107) (cid:88) i ∈V N h ( T +1) N,i . (10)Since node embeddings of different orders are computed bydifferent parameters and the number of nodes of the orders also MGNN: Input:Residual: Interaction: Output:Fusion: Fig. 2: Computation flow of heterogeneous molecular graph neural networks (HMGNN) for many-bodies up to order two.We use (cid:3) to represent the input to the function. The activation function is set to be the shifted softplus function, i.e., φ ( x ) = ln(0 . e x + 0 . . Each many-body order p owns its input module, interaction module, and output module. For eachnode i of order p , an input module converts the discrete and continuous feature of the node to an initial node embedding h (0) p,i .HMGNN passes the initial embeddings through a stack of T interaction modules to encode information from its neighbor nodesof different orders to the node embedding. The outputs of the last interaction module, the final node embedding h ( T +1) p,i , arethen fed into a fusion module and an output module to compute a weight vector α and prediction ˆ y p,i , respectively. HMGNNsums the predictions per many-body order and computes the final prediction as a weighted sum of these summed predictions.varies, the distributions of the order specific representationscould be dramatically different from each other. In order tounify the distributions of the representations and to acceleratetraining, we apply batch normalization [30] followed by a fullyconnected layer on the intermediate representation to obtainthe global representation v = BatchNorm (˜ v ) z = φ ( Wv + b ) . (11)Then we pass the global representation through an attentionlayer to compute the weight α p that measures the importanceof the predictions of order pα p = exp (cid:0) LeakyReLU (cid:0) z T a p (cid:1)(cid:1)(cid:80) Nq =1 exp ( LeakyReLU ( z T a q )) (12)where a are learnable vectors, and (cid:80) p α p = 1 . We can under-stand the global representation as a query to the knowledge-base distilled in a for assigning contributions to predictionsof different orders. This gives the model better flexibility andexplainability in dealing with different molecules. E. Final prediction Inspired by the many-body expansion, we decompose thefinal prediction as a weighted sum of the prediction of differentorders ˆ y = α (cid:88) i ∈V ˆ y ,i + α (cid:88) i ∈V ˆ y ,i + α (cid:88) i ∈V ˆ y ,i + · · · (13) where the weights α p are computed by the fusion module. F. Model Training Since all the modules in HMGNNs except for the fu-sion module are order specific, and the final prediction isa weighted average of the predictions per order, trainingHMGNNs by optimizing objective functions that only dependon the final prediction (the fused prediction) may causegradient vanishing issues for parameters of some orders so thatthese parameters do not learn enough and lose their predictionutilities. To avoid this issue, we treat the computation ofeach order as a separate prediction task and propose a multi-task objective function that forces the prediction of all orderstogether with the final prediction to be close to the true target L = 1 N + 1 (cid:32) | ˆ y − y | + N (cid:88) p =1 | ˆ y p − y | (cid:33) + λ (cid:107) Θ (cid:107) (14)where ˆ y p = (cid:80) i ∈V p ˆ y p,i is the node order specific prediction, Θ denotes all trainable parameters of the model, λ ≥ is ahyper-parameter that controls the strength of L normalizationto prevent the model overfits. This objective function preservesgradient flow for parameters of each order and gives highertraining importance to orders that the fussing module assigninglarger weights to. . Complexity Analysis The time and space complexity of HMGNN depends lin-early on the number of nodes and edges in a HMG. Thenumber of nodes determines the complexity of the inputmodule and the output module while the number of edgesdetermines the complexity of message passing.Let G be a molecular graph with N atoms and H P ( G ) beits HMG that explicitly models up to P -bodies. We assume G is a complete graph for the worst case scenario. The numberof nodes of order p in H P ( G ) is (cid:0) Np (cid:1) . Let i ∈ V p be a nodeof order p (i.e., a q -body), i is connected to nodes that areof order q where q ∈ { , · · · , P } . When q < p , the numberof q order neighbors of node i is (cid:0) pq (cid:1) as i is connected to all q -bodies who are sub-graphs of i ; when q = p , the numberof order p neighbors of i is p ( N − p ) since i is connectedto p -bodies who share p − atoms with i ; When q > p , thenumber of q -body neighbors of i is (cid:0) N − pq − p (cid:1) . As a result, thecomplexity of message passing is P (cid:88) p =1 (cid:18) Np (cid:19) (cid:32) p − (cid:88) q =1 (cid:18) pq (cid:19) + P (cid:88) q = p +1 (cid:18) N − pq − p (cid:19) + p ( N − p ) (cid:33) (15)and the complexity of the input/output module of HMGNN is (cid:80) Pp =1 (cid:0) Np (cid:1) .In this paper, we experiment with HMGs and HMGNNs forup to -bodies, consequently, the time complexity and spacecomplexity of our model are both O ( N ) . Modern computingarchitectures such as graphics processing unit (GPU) andtensor processing unit (TPU) are optimized to accelerate thiscomputation. Empirically, HMGNNs can generate propertypredictions for 10000 randomly drawn molecules from theQM9 dataset in 4 seconds.VI. E XPERIMENTS We conduct experiments to investigate three research prob-lems in regards of many-body modeling and the HMGNNmodel • How does HMGNN perform in the molecule propertyprediction tasks compared against the current state-of-the-art methods? • How does many-body representation, interaction, andprediction contribute to the prediction? • What is the utility of the components of HMGNN? A. Implementation Details We experiment with HMGs and HMGNNs for many-bodiesup to order two. There are two types of nodes ( -bodies and -bodies), two types of edges with edge features ( - and - edges), and one type of edge without edge features ( - edges).Since -bodies are atoms, they only have discrete features.Each -body i is determined by its two end atoms and thedistance between them d ,i .There are three types of geometries that we need to model,distance d ij ∈ (0 , c ) between -bodies i and j , length l ,i ∈ (0 , c ) of -bodies, and angle θ ij ∈ [0 , π ] between -bodies i and j . We use a set of K radial basis functions (RBF) to TABLE I: Target properties in the QM9 dataset. Target Description µ Dipole moment α Isotropic polarizability (cid:15) HOMO Energy of Highest occupied molecular orbital (HOMO) (cid:15) LUMO Energy of Lowest occupied molecular orbital (LUMO) ∆ (cid:15) Gap, difference between LUMO and HOMO (cid:104) R (cid:105) Electronic spatial extentZPVE Zero point vibrational energy U Internal energy at K U Internal energy at . K H Enthalpy at . K G Free energy at . K C v Heat capacity at . K convert the scalar geometries to real valued vector features.Let x ∈ [ a, b ] be a scalar input and x ∈ R K be the realvalued output of the RBFs, the k -th entry of x is computed as x k = exp (cid:16) − β k (exp ( − x ) − µ k ) (cid:17) (16)where µ k and β k specify the center and width of x k . Fordistance d ij between -bodies, we multiply its feature vectorby a continuous monotonic decreasing function ψ ( d ij ) thathas ψ (0) = 1 and ψ ( c ) = 0 . With this formulation, an -body node will have less influence to/from its distant order neighbors. We follow [2] and set the value of µ k to beequally spaced between exp ( − a ) and exp ( − b ) while β k =(2 K − (exp ( − a ) − exp ( − b ))) − . The goal of using RBFs isto decorrelate the scalar features to accelerate training [3]. Weapply three different sets of RBFs to convert the distance d ij ,the length l ,i , and the angle θ ij to the corresponding features e ,ij , x ,i , and e ,ij , respectively.We set the latent dimension to be and use interactionmodules for our experiments. We use the shifted softplusfunction as the activation function. For ZPVE, U , U , H , G and C v , the cutoff distances c = 3 while for other targets c = 5 . We initialize the weights of fully connected layers withrandom orthogonal matrices scaled by the glorot initializationscheme [31] and the bias to zero. For learning the parametersof HMGNN, we run the AMSGrad algorithm [32] with a batchsize of for up to steps and set the L regularizer λ to be × − . We initialize the learning rate to be × − and multiply it with . every gradient steps. Thetraining algorithm stops if the MAE on the validation set doesnot decrease for steps. We implement HMGNN usingthe Deep Graph Library (DGL) [33], [34]. B. Experimental Setting We evaluate the performance of the proposed model on theQM9 dataset [12], [13]. QM9 is a widely used benchmarkfor evaluating models that predict molecule properties. Itconsists of around K equilibrium molecules associatedwith geometric, energetic, electronic, and thermodynamicproperties. The properties are described in Table I. Thesemolecules contain up to nine heavy atoms (C, O, N, and F).We randomly select molecules for training, olecules for validation, and molecules as the testset. We conduct model selection for different targets on thevalidation set and report the mean absolute error (MAE) of thebest performing models. For properties with atomic referencevalues ( U , U , H , G , C v ), we subtract the original value bythe per-atom-type reference values to be the target. Since ∆ (cid:15) is defined as the gap between (cid:15) LUMO and (cid:15) HOMO , we predict itas ∆ (cid:15) = (cid:15) LUMO − (cid:15) HOMO . In our experiments, we convert theunits of (cid:15) HOMO , (cid:15) LUMO , ∆ (cid:15) , ZPVE, U , U , H , G to eV.We compare the performance of HMGNN with six state-of-the-art methods, enn-s2s [6], SchNet [3], neural messagepassing with edge updates (NMP-edge) [7], Cormorant [25],PhysNet [2], and directional message passing neural network(DimeNet) [23]. Results of enn-s2s, SchNet, NMP-edge, Cor-morant, and DimeNet are from the corresponding papers. Wetake the results of PhysNet from [23]. C. Prediction Performance We show the prediction performance of HMGNN and thecompeting methods on the 12 properties of QM9 in TableII. Our proposed method sets the new state-of-the-art on9 out of the 12 target properties. HMGNN’s performancealigns with the best results on the remaining targets withan exception of (cid:104) R (cid:105) . We also present the performance ofsumming over predictions over -bodies (HMGNN- ) and -bodies (HMGNN- ), respectively. Although the performanceof HMGNN- is consistently worse than HMGNN- , theirweighted combination outperforms any of the standaloneprediction. This demonstrates the effectiveness of the fusionmodule driven by the global molecule representations andthe attention mechanism, and that explicitly modeling andcomputing predictions of many-bodies can be beneficial forchemical prediction tasks.We analyze the effect of a critical hyper-parameter, thecutoff distance c , on prediction performances of four typesof properties. We choose U to represent properties relatedto atomization energies ( U , U , H , G ), C v to representthermodynamic properties ( C v ), ZPVE to represent propertiesrelated to fundamental vibrations of the molecule (ZPVE), and µ to represent electronic properties ( µ , α , (cid:15) HOMO , (cid:15) LUMO , ∆ (cid:15) , (cid:104) R (cid:105) ) [6]. We present the training and test mean absolute error(MAE) of HMGNNs on HMGs constructed with c ∈ { , , } in Figure 3.When constructing molecular graphs as well as HMGs,the larger the cutoff distance we choose, the less geometricinformation about the molecules that we lose. However, a largecutoff value does not always lead to better performance. InFigure 3, despite the training error decreases across all thefour targets as the cutoff value increases, the test error showsan increasing trend for three properties. This is a signal thatthe model over-fits the training set on the three properties.This is because of the large model capacity of HMGNNs asthey have one set of parameters for each many-body order.An HMGNN of order N possesses N times the number ofparameters of a normal GNN-based model. D. Ablation Study In this section, we conduct ablation study on two targets(i.e., U , C v ) to demonstrate the importance of the multi-task learning loss, inter-order message passing, and explicitmodeling of high-order bodies in improving the performanceof molecular property prediction. We propose three variantsof the HMGNN model and show their results in Table III. 1) Remove MTL (Multi-Task Learning loss): This varianthas the same specification with the default model. It differswith the default model in that it is trained by minimizing thenaive loss | ˆ y − y | instead of the multi-task learning loss that weproposed in Eq-14. As shown in Table III, the -bodies of thisvariant lose their prediction power while the fusion modulegives all attention weights to the -bodies, and as a result, theperformance of this variant is worse than the default HMGNN.Furthermore, the prediction of -bodies (i.e., HMGNN-1) isalso less accurate than the default model. 2) Remove IOMP (Inter-Order Message Passing): Thisvariant removes edges/messages between -bodies and -bodies, as a result, information of the two orders are notshared. We can see that the performance of HMGNN- andHMGNN drops in the prediction of both U and C v . Thisdemonstrates the importance of inter-order message passing.However, the prediction accuracy of HMGNN- on U isbetter than models with inter-order message passing. Thismight because -bodies (both distance and angle) contain moregeometric information than -bodies (only distance). 3) Remove HO (High-Order modeling): This variant re-moves high-order related modeling ( -body interaction, repre-sentation, and prediction) and is similar to existing GNN-basedprediction methods (i.e., PhysNet). As shown in Table III, thismethod performs worse than HMGNN-1 of the variant thatremoves multi-task learning loss. This shows another evidenceof the effectiveness of inter-order message passing. E. Visualization of Attention weights In Figure 4, we show the attention scores of the -bodypredictions generated by the fusion module for predicting U , C v , µ , and ZPVE on the test set. Since we only experimentwith many-bodies up to the second order, the attention weightsof the -bodies is one minus that of the -bodies. On the fourtypes of chemical properties, -body contribution dominatesthe prediction of most of the molecules. However, -bodypredictions also take a considerable amount of attention.VII. C ONCLUSION We propose a novel heterogeneous graph based moleculerepresentation, heterogeneous molecular graph (HMG), tomodel many-body representations and interactions. Inspiredby the many-body expansion of energy surfaces, we designa heterogeneous molecular graph neural network (HMGNN)to leverage the rich information stored in HMGs for molec-ular prediction tasks. HMGNN follows a message passingparadigm and leverages global molecule representations usingan attention mechanism. We propose to train HMGNNs byABLE II: Mean absolute error on QM9 with 110K training molecules. In each row, we use boldface for the best performancemethod. Column HMGNN- and HMGNN- correspond to the performance of summing over predictions of -bodies and -bodies, respectively. Target Unit enn-s2s SchNet NMP-edge Cormorant PhysNet DimeNet HMGNN- HMGNN- HMGNN µ D 0.030 0.033 0.029 0.038 0.0529 0.0286 0.0276 0.0283 α a (cid:15) HOMO meV 43 41 36.7 34 32.9 27.8 24.94 26.31 (cid:15) LUMO meV 37 34 30.8 38 27.4 ∆ (cid:15) meV 69 63 58.0 61 42.5 34.8 33.44 35.02 (cid:104) R (cid:105) a U meV 19 14 10.5 22 8.15 8.02 6.19 9.06 U meV 19 19 10.6 21 8.34 7.89 7.22 11 H meV 17 14 11.3 21 8.42 8.11 6.35 8.37 G meV 19 14 12.2 20 9.40 8.98 7.95 11.06 c v calmol K T e s t M A E / m e V (a) U . T e s t M A E / c a l / m o l K (b) C v . T e s t M A E / m e V (c) ZPVE. T e s t M A E / D (d) µ . Fig. 3: Effect of the cutoff distance c on prediction performance on four target properties.TABLE III: Ablation study on U and C v . Target Architecture HMGNN-1 HMGNN-2 HMGNN U Default 6.19 9.06 5.92Remove MTL 8.22 9716.95 8.22Remove IOMP 10.26 8.18 7.88Remove HO 10.08 - - C v Default 0.0241 0.0250 0.0233Remove MTL 0.0247 1.4022 0.0247Remove IOMP 0.0297 0.0275 0.0244Remove HO 0.0289 - - optimizing a multi-task learning loss. HMGNN achieves state-of-the-art performance on 9 out of 12 properties on the QM9dataset. Experiments also show that the multi-task learningloss improves the generalization of the model. In this paper, weonly model many-bodies up to the second order, future worksshould aim to model many-bodies of higher than third ordersand also to enable HMGNNs for another important chemicalprediction tasks, molecular dynamics simulations.VIII. A CKNOWLEDGEMENT This work was supported in part by NSF (1447788,1704074, 1757916, 1834251), Army Research Office(W911NF1810344), Intel Corp, and the Digital TechnologyCenter at the University of Minnesota. Access to research andcomputing facilities was provided by the Digital TechnologyCenter and the Minnesota Supercomputing Institute. We are grateful to Mingjian Wen for his fruitful comments,corrections and inspiration.R EFERENCES[1] P. Hohenberg and W. Kohn, “Inhomogeneous electron gas,” Physicalreview , vol. 136, no. 3B, p. B864, 1964.[2] O. T. Unke and M. Meuwly, “Physnet: a neural network for predictingenergies, forces, dipole moments, and partial charges,” Journal ofchemical theory and computation , vol. 15, no. 6, pp. 3678–3693, 2019.[3] K. Sch¨utt, P.-J. Kindermans, H. E. S. Felix, S. Chmiela, A. Tkatchenko,and K.-R. M¨uller, “Schnet: A continuous-filter convolutional neuralnetwork for modeling quantum interactions,” in Advances in neuralinformation processing systems , 2017, pp. 991–1001.[4] K. T. Sch¨utt, H. E. Sauceda, P.-J. Kindermans, A. Tkatchenko, andK.-R. M¨uller, “Schnet–a deep learning architecture for molecules andmaterials,” The Journal of Chemical Physics , vol. 148, no. 24, p. 241722,2018.[5] C. Lu, Q. Liu, C. Wang, Z. Huang, P. Lin, and L. He, “Molecu-lar property prediction: A multilevel quantum interactions modelingperspective,” in Proceedings of the AAAI Conference on ArtificialIntelligence , vol. 33, 2019, pp. 1052–1060.[6] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl,“Neural message passing for quantum chemistry,” in Proceedings of the34th International Conference on Machine Learning-Volume 70 . JMLR.org, 2017, pp. 1263–1272.[7] P. B. Jørgensen, K. W. Jacobsen, and M. N. Schmidt, “Neural messagepassing with edge updates for predicting properties of molecules andmaterials,” arXiv preprint arXiv:1806.03146 , 2018.[8] F. H. Stillinger and T. A. Weber, “Computer simulation of local orderin condensed phases of silicon,” Physical review B , vol. 31, no. 8, p.5262, 1985.[9] M. J. Elrod and R. J. Saykally, “Many-body effects in intermolecularforces,” Chemical reviews , vol. 94, no. 7, pp. 1975–1997, 1994.[10] K. Yao, J. E. Herr, and J. Parkhill, “The many-body expansion combinedwith neural networks,” The Journal of chemical physics , vol. 146, no. 1,p. 014106, 2017. .65 0.70 0.75 0.80 0.85 0.90 0.95Attention weight of 1-body predictions0100200300400500 N u m b e r o f m o l e c u l e s (a) U . N u m b e r o f m o l e c u l e s (b) C v . N u m b e r o f m o l e c u l e s (c) ZPVE. N u m b e r o f m o l e c u l e s (d) µ . Fig. 4: Attention weights generate by the fusion module for predicting the four properties. [11] S. Ruder, “An overview of multi-task learning in deep neural networks,” arXiv preprint arXiv:1706.05098 , 2017.[12] L. Ruddigkeit, R. Van Deursen, L. C. Blum, and J.-L. Reymond,“Enumeration of 166 billion organic small molecules in the chemicaluniverse database gdb-17,” Journal of chemical information and model-ing , vol. 52, no. 11, pp. 2864–2875, 2012.[13] R. Ramakrishnan, P. O. Dral, M. Rupp, and O. A. Von Lilienfeld,“Quantum chemistry structures and properties of 134 kilo molecules,” Scientific data , vol. 1, p. 140022, 2014.[14] R. G. Parr, “Density functional theory of atoms and molecules,” in Horizons of Quantum Chemistry . Springer, 1980, pp. 5–15.[15] F. A. Faber, L. Hutchison, B. Huang, J. Gilmer, S. S. Schoenholz, G. E.Dahl, O. Vinyals, S. Kearnes, P. F. Riley, and O. A. von Lilienfeld,“Machine learning prediction errors better than dft accuracy,” arXivpreprint arXiv:1702.05532 , 2017.[16] A. P. Bart´ok, S. De, C. Poelking, N. Bernstein, J. R. Kermode, G. Cs´anyi,and M. Ceriotti, “Machine learning unifies the modeling of materials andmolecules,” Science advances , vol. 3, no. 12, p. e1701816, 2017.[17] T. N. Kipf and M. Welling, “Semi-supervised classification with graphconvolutional networks,” arXiv preprint arXiv:1609.02907 , 2016.[18] P. Velikovi, G. Cucurull, A. Casanova, A. Romero, P. Li, andY. Bengio, “Graph attention networks,” in International Conferenceon Learning Representations , 2018. [Online]. Available: https://openreview.net/forum?id=rJXMpikCZ[19] W. Hamilton, Z. Ying, and J. Leskovec, “Inductive representationlearning on large graphs,” in Advances in neural information processingsystems , 2017, pp. 1024–1034.[20] Z. Ying, J. You, C. Morris, X. Ren, W. Hamilton, and J. Leskovec,“Hierarchical graph representation learning with differentiable pooling,”in Advances in neural information processing systems , 2018, pp. 4800–4810.[21] K. T. Sch¨utt, F. Arbabzadah, S. Chmiela, K. R. M¨uller, andA. Tkatchenko, “Quantum-chemical insights from deep tensor neuralnetworks,” Nature communications , vol. 8, no. 1, pp. 1–8, 2017.[22] N. Lubbers, J. S. Smith, and K. Barros, “Hierarchical modeling ofmolecular energies using a deep neural network,” The Journal ofchemical physics , vol. 148, no. 24, p. 241715, 2018.[23] J. Klicpera, J. Gro, and S. Gnnemann, “Directional message passingfor molecular graphs,” in International Conference on Learning Representations , 2020. [Online]. Available: https://openreview.net/forum?id=B1eWbxStPH[24] T. Cohen and M. Welling, “Group equivariant convolutional networks,”in International conference on machine learning , 2016, pp. 2990–2999.[25] B. Anderson, T. S. Hy, and R. Kondor, “Cormorant: Covariant molec-ular neural networks,” in Advances in Neural Information ProcessingSystems , 2019, pp. 14 510–14 519.[26] N. Thomas, T. Smidt, S. Kearnes, L. Yang, L. Li, K. Kohlhoff, andP. Riley, “Tensor field networks: Rotation-and translation-equivariantneural networks for 3d point clouds,” arXiv preprint arXiv:1802.08219 ,2018.[27] R. Kondor, T. S. Hy, H. Pan, B. M. Anderson, and S. Trivedi,“Covariant compositional networks for learning graphs,” 2018. [Online].Available: https://openreview.net/forum?id=S1TgE7WR-[28] K. Xu, W. Hu, J. Leskovec, and S. Jegelka, “How powerful aregraph neural networks?” in International Conference on LearningRepresentations , 2019. [Online]. Available: https://openreview.net/forum?id=ryGs6iA5Km[29] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” in Proceedings of the IEEE conference on computer visionand pattern recognition , 2016, pp. 770–778.[30] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deepnetwork training by reducing internal covariate shift,” arXiv preprintarXiv:1502.03167 , 2015.[31] X. Glorot and Y. Bengio, “Understanding the difficulty of trainingdeep feedforward neural networks,” in Proceedings of the thirteenthinternational conference on artificial intelligence and statistics , 2010,pp. 249–256.[32] S. J. Reddi, S. Kale, and S. Kumar, “On the convergence of adamand beyond,” in International Conference on Learning Representations ,2018. [Online]. Available: https://openreview.net/forum?id=ryQu7f-RZ[33] D. Zheng, M. Wang, Q. Gan, Z. Zhang, and G. Karypis, “Learning graphneural networks with deep graph library,” in Companion Proceedings ofthe Web Conference 2020 , 2020, pp. 305–306.[34] M. Wang, L. Yu, D. Zheng, Q. Gan, Y. Gai, Z. Ye, M. Li, J. Zhou,Q. Huang, C. Ma et al. , “Deep graph library: Towards efficient andscalable deep learning on graphs,” arXiv preprint arXiv:1909.01315arXiv preprint arXiv:1909.01315