[PDF] 3DMolNet: A Generative Network for Molecular Structures

Abstract

With the recent advances in machine learning for quantum chemistry, it is now possible to predict the chemical properties of compounds and to generate novel molecules. Existing generative models mostly use a string- or graph-based representation, but the precise three-dimensional coordinates of the atoms are usually not encoded. First attempts in this direction have been proposed, where autoregressive or GAN-based models generate atom coordinates. Those either lack a latent space in the autoregressive setting, such that a smooth exploration of the compound space is not possible, or cannot generalize to varying chemical compositions. We propose a new approach to efficiently generate molecular structures that are not restricted to a fixed size or composition. Our model is based on the variational autoencoder which learns a translation-, rotation-, and permutation-invariant low-dimensional representation of molecules. Our experiments yield a mean reconstruction error below 0.05 Angstrom, outperforming the current state-of-the-art methods by a factor of four, and which is even lower than the spatial quantization error of most chemical descriptors. The compositional and structural validity of newly generated molecules has been confirmed by quantum chemical methods in a set of experiments.

Full PDF

33DMolNet: A Generative Network for Molecular Structures

Vitali Nesterov, Mario Wieser, Volker Roth

Department of Mathematics and Computer ScienceUniversity of Basel, Switzerland [email protected]

Abstract

With the recent advances in machine learning for quan-tum chemistry, it is now possible to predict the chemicalproperties of compounds and to generate novel molecules.Existing generative models mostly use a string- or graph-based representation, but the precise three-dimensional co-ordinates of the atoms are usually not encoded. First at-tempts in this direction have been proposed, where autore-gressive or GAN-based models generate atom coordinates.Those either lack a latent space in the autoregressive setting,such that a smooth exploration of the compound space is notpossible, or cannot generalize to varying chemical compo-sitions. We propose a new approach to efﬁciently generatemolecular structures that are not restricted to a ﬁxed size orcomposition. Our model is based on the variational autoen-coder which learns a translation-, rotation-, and permutation-invariant low-dimensional representation of molecules. Ourexperiments yield a mean reconstruction error below 0.05 ˚A,outperforming the current state-of-the-art methods by a fac-tor of four, and which is even lower than the spatial quanti-zation error of most chemical descriptors. The compositionaland structural validity of newly generated molecules has beenconﬁrmed by quantum chemical methods in a set of experi-ments.

Introduction

An exploration of the chemical compound space (CCS) isessential in the design process of new materials. Consid-ering that the amount of small organic molecules is in themagnitude of (Kenakin 2006), a systematic coverage ofmolecules in a combinatorial manner is intractable. Further-more, a precise approximation of quantum chemical prop-erties by solving the Schr¨odinger’s equation is computa-tionally demanding even for a tiny amount of mid-sizedmolecules. In recent years, an entirely new research ﬁeldemerged, aiming a statistical approach by training regressionmodels for quantum chemical property prediction to over-come computational drawbacks (Sch¨utt et al. 2018; Faberet al. 2017; Christensen, Faber, and von Lilienfeld 2019).More recently, deep latent variable models have been intro-duced to solve the inverse problem and to generate novelmolecules. Here, the major focus of the research communityis on deep latent variable models that use either string- or Preprint. Work in progress.

Figure 1: A schematic illustration of the decoding process ofthe 3DMolNet. The model learns a low-dimensional repre-sentation Z of molecules (red plane). Each latent encodingcan be decoded to a nuclear charge matrix C , an Euclideandistance matrix D , and a bond matrix B using neural net-works (blue dotted arrows). From this, we recover the 3-dmolecular structure of a molecule (brown arrows).graph-based molecular representations (Gomez-Bombarelliet al. 2018; Kusner, Paige, and Hern´andez-Lobato 2017; Jin,Barzilay, and Jaakkola 2018; Janz et al. 2018). The pre-cise spatial conﬁguration of the molecules is usually notenclosed. This leads to various drawbacks, for instance,molecules which have the same chemical composition butdifferent geometries, a so called isomers, cannot be distin-guished. Furthermore, a targeted exploration of the CCSwith respect to quantum chemical property is rather poor andwith respect to geometries it is impossible.To account for this challenges, Gebauer, Gastegger, andSch¨utt proposed G-SchNet (Gebauer, Gastegger, and Sch¨utt2019), an autoregressive model for generation of atom typesand corresponding 3-d points. The rotation- and translation- a r X i v : . [ q - b i o . B M ] O c t nvariances are achieved by using Euclidean distances be-tween atoms. The permutation-invariant features are ex-tracted with continuous-ﬁlter convolutions (Sch¨utt et al.2018). Due to the autoregressive nature of the model, it hasno continuous latent space representation for smooth explo-ration of the CCS. Furthermore, the probability for comple-tion of complex structures decays with the amount of sam-pling steps. Hoffmann and No´e introduced EDMNet thatgenerates Euclidean distance matrices (EDM) in a one-shotfashion, which can be further transformed into actual co-ordinates. For generation of the EDMs a GAN architecture(Goodfellow et al. 2014) is used, while the critic network isbased on the SchNet architecture (Sch¨utt et al. 2018). Thisapproach, however, is limited to a ﬁxed chemical compo-sition to circumvent the permutation problem. In addition,GANs are frequently difﬁcult to train. A smooth explorationof the nearby areas in the latent space is not possible as well.In order to overcome the aforementioned limitations, wepropose the 3DMolNet, a generative network for 3-d molec-ular structures. A graphical sketch of our approach is illus-trated in Figure 1. The molecule representation consists oftwo core components, a nuclear charge matrix and an EDM.Since some bond types have high variances in the bondlengths, an exact assignment of bond types often cannot bederived from the distances. For this reason, we additionallyinclude an explicit bond matrix to our representation. Toovercome the permutation problem, we use an InChI-basedcanonical ordering of the atoms.Due to the absence of canonical identiﬁers for most of thehydrogen atoms, we involve only heavy atoms in our repre-sentation. This is however not problematic, since the back-bone of a molecule is the most important part. The hydro-gens are explicitly deﬁned if the core structure of a moleculeis known. A quantum mechanical approximation of the hy-drogen coordinates is computationally an easy optimizationtask. This is because, given a ﬁxed core structure, a many-body problem has a signiﬁcantly smaller degree of freedom.Additionally, the positions of the hydrogen atoms are en-tirely determined only if the stereospeciﬁcity is deﬁned.Our model is based on the Variational autoencoder(Kingma and Welling 2014; Rezende, Mohamed, and Wier-stra 2014) architecture to learn a low-dimensional latent rep-resentation of the molecules. To recover a molecule, we de-code each representation component separately and obtainthe entire molecular structure in the post processing steps.In summary, the main contributions of this work are:1. We introduce a translation-, rotation-, and permutation-invariant molecule representation. To overcome the per-mutation problem a canonical ordering of the atoms isused.2. We propose a VAE-based model to learn a continuouslow-dimensional representation of the molecules and togenerate 3-d molecular structures in a one-shot fashion.3. We demonstrate on the QM9 dataset that the proposedmodel is able to reconstruct almost all chemical compo-sitions with up to 9 heavy atoms. The reconstruction ofcoordinates for heavy atoms yields a RMSD below 0.05 ˚A, outperforming the state-of-the-art method by almost afactor of four. Related Work

Variational Autoencoder.

In recent years, the Variationalautoencoder (VAE) (Kingma and Welling 2014; Rezende,Mohamed, and Wierstra 2014) became an important toolfor various machine learning tasks such as semi-supervisedlearning (Kingma et al. 2014) and conditional image gen-eration (Sohn, Lee, and Yan 2015). An important line ofresearch are VAEs for disentangled representation learning(Bouchacourt, Tomioka, and Nowozin 2018; Klys, Snell,and Zemel 2018; Lample et al. 2017; Creswell et al.2017). More recently, variational autoencoders have shownstrong connections to the information bottleneck principle(Tishby, Pereira, and Bialek 1999; Alemi et al. 2017; Achilleand Soatto 2018), where multiple model extensions havebeen proposed in the context of sparsity (Wieczorek et al.2018), causality (Parbhoo et al. 2020b,a), archetypal analy-sis (Keller et al. 2019, 2020) or invariant subspace learning(Wieser et al. 2020).

Graph-based Deep Generative Models.

Deep generativemodels are heavily used in the context of computationalchemistry to generate novel molecular structures based ongraph representations. With a text-based graph represen-tation, a SMILES string, (Gomez-Bombarelli et al. 2018)introduced a chemical VAE to generate novel molecules.However, this method has limitations in generating syntacti-cally correct SMILES strings. Several methods have beenproposed to overcome this drawback (Kusner, Paige, andHern´andez-Lobato 2017; Dai et al. 2018; Janz et al. 2018).Another line of research deals with explicit graph-basedmolecular descriptors. Liu et al. introduce a constrainedgraph VAE to generate valid molecular structures. Subse-quently, improved approaches has been introduced by (Jin,Barzilay, and Jaakkola 2018; Jin et al. 2019; Li et al. 2018).

Coordinate-based Deep Generative Models.

More re-cently, deep generative models for spatial structures has beenproposed to overcome the limitations of graph-based de-scriptors. Such representations do not take into account pre-cise three-dimensional conﬁgurations of the atoms. Man-simov et al. estimated the 3-d conﬁgurations of moleculesfrom molecular graphs. Hoffmann and No´e learned thedistribution of spatial structures by generating translation-and rotation-invariant Euclidean distance matrices. How-ever, their approach is limited to molecules with the samechemical composition. Current state-of-the-art results areprovided by the G-SchNet (Gebauer, Gastegger, and Sch¨utt2019), which is based on an autoregressive model which,however, is lacking a latent representation of molecules andis inefﬁcient and error-prone when sampling complex andlarge structures. reliminaries

Variational Autoencoder

The Variational autoencoder (VAE) (Kingma and Welling2014; Rezende, Mohamed, and Wierstra 2014) is a deeplatent variable model that combines a generative process(decoder) with a variational inference (encoder) to learn aprobabilistic model. The encoder models a posterior distri-bution q ( z | x ) , where we assume a Gaussian distributionparametrized with µ z and σ z and the decoder models thegenerative distribution p ( x | z ) . In order to learn the proba-bilistic model, we optimize the following lower bound: L VAE ≥ E z ∼ q ( z | x ) [log p ( x | z )] − D KL [ q ( z | x ) || p ( z )] . (1)Due to the deterministic nature of neural networks, therandom variable Z is reparametrized as z = µ z + σ z ∗ (cid:15) where (cid:15) ∼ N (0 , . Quantum-mechanical Background

A molecule as a quantum-mechanical system is fully de-scribed by the Schr¨odinger’s (SE) equation. It is sufﬁcientto consider a non-relativistic and time-independent formula-tion H Ψ = E Ψ . With an additional elimination of the exter-nal electromagnetic ﬁeld, the Hamiltonian H is determinedby the external potential v ( r ) = (cid:88) i Z i / | r − R i | , (2)where Z ∈ N N is a set of nuclear charges, R ∈ R N × are the corresponding atomic positions, and r ∈ R M × arestationary electronic states. The interatomic energy contri-butions result from the solution of the full electronic many-body problem and the nuclear Coulomb repulsion energy: Z i Z j / | R i − R j | . (3)Neglecting the former, a physical system depends only ona set of nuclear charge numbers and the corresponding atomcoordinates in a three-dimensional space. Methods

The goal of our learning framework is to capture the distri-bution of the molecular structures, which are expressed interms of chemical compositions and geometries. To over-come the permutation problem, we use a canonical order-ing of the atoms. In our approach we exclude the hydro-gen atoms from the representation, since those can be easilyadded and the coordinates can be cheaply optimized withquantum-mechanical calculations. Our model involves aVAE-based architecture for smooth exploration of the chem-ical space. To enforce the quality of the generated EDMs, ad-ditional geometry loss terms are added. The molecular corestructure is reconstructed from the estimated representationcomponents. The entire molecule is recovered in the postprocessing steps.

Representation

A physical many-body system depends only on a set of nu-clear charge numbers Z i ∈ N and corresponding R i ∈ R coordinates. Our molecule representation involves this in-formation with a nuclear charge matrix C ∈ N N × N andan Euclidean distance matrix D ∈ R N × N . Additionally, abond matrix B ∈ N M × is included to encode explicit bondtypes. We deﬁne the nuclear charge matrix as follows: C ij = Z i Z (cid:62) j . (4)The distances matrix is deﬁned with D ij = (cid:107) R i − R j (cid:107) . (5)The bond matrix B includes indices of connected atoms inthe ﬁrst and the second columns. The third column containsbond type values. Although the bond matrix could be ex-tracted from the distances, the variance of the bond lengthsrelatively high for some bond types, such that an assignmentfrom the distances is often ambiguous. To handle moleculesof different sizes, the charge, distance, and bond matricesare padded with zero rows and or columns to a ﬁxed size.The ﬁnal molecule representation is a set of the three com-ponents: X = { C , D , B } . (6) Canonical Identiﬁer

A major obstacle for a learning algorithm is an arbitraryatom ordering. To overcome this problem we generate aunique atom ordering for molecules. There exist differentcanonicalization algorithms. We get the canonical identiﬁers(CIs) with an InChI-based algorithm implemented in theCDK package (Steinbeck et al. 2003). The generation of CIsinvolves a structure normalization, an InChI-based canoni-cal labelling, and a tree traversal (O’Boyle 2012). With thismethod, we get CIs for each heavy atom and only for ex-plicit hydrogen atoms. Those are indicated isotops, dihy-drogen and hydrogen ions, and hydrogen atoms attached totetrahedral stereocentres with deﬁned stereochemistry. Dueto the absence of canonical labels for most of the hydrogenatoms, our representation involves only heavy atoms.

Model

As previously stated, our aim is to generate novel molecularcompositions and corresponding 3-d structures (see Figure1). Therefore, we want to learn both, a low-dimensional anda rotation-, translation-, and permutation-invariant represen-tation of molecules (see Figure 2). To do so, we reformulatethe standard VAE framework (Equation 1) by deﬁning eachrepresentation component C , D , and B in a form of inde-pendent random variables. This leads to an extended para-metric formulation given with: L VAE ≥ E z ∼ q ( z | c , d , b ) [ p ( c , d , b | z )] (7) − βD KL [ q ( z | c , d , b ) || p ( z )] . igure 2: An illustration of the 3DMolNet framework. In the preprocessing step, we apply a canonical ordering and remove allhydrogen atoms. We then get the representation X as a concatenation of the nuclear charge C , the euclidean distance D , andthe bond B matrices. Subsequently, we map X into a low-dimensional latent representation Z using an encoder neural networkwhich is parametrized with φ . Here, (x) denotes the mean and the surrounding circle the variance of our estimate. To decodea molecule, we use three separate neural networks parametrized with ψ, τ , and θ which decode representation components C , D , and B separately. In the post processing step, we recover the chemical and structural composition of the heavy atomswith a multidimensional scaling algorithm and add hydrogen atoms with Open Babel (O’Boyle et al. 2011) and ﬁne-tune thecoordinates with MOPAC (Stewart 2016) while ﬁxing the positions of heavy atoms. Encoder.

The encoder is deﬁned as the KL-divergence be-tween the posterior q φ ( z | c , d , b ) and the prior p ( z ) and isdenoted as: D KL [ q φ ( z | c , d , b ) || p ( z )] , where φ are the neural network parameters. We deﬁne theposterior as a Gaussian distribution and assume a Gaussianprior p ( z ) = N (0 , I ) . Decoder.

The decoder is deﬁned as the the negative log-likelihood of p ( c , d , b | z ) . As we assume conditional in-dependence between c , d , and b we can express the jointdistribution as follows: E z ∼ q ( z | c , d , b ) [ p ( c , d , b | z )] = E z φ ∼ q ( z | c , d , b ) [ p θ ( c | z ) · p τ ( d | z ) · p ψ ( b | z )] , where φ , τ and θ denote neural network parameters ofeach respective decoder. We deﬁne each log-likelihood termto be the mean-absolute-error between a particular represen-tation component and the reconstructed counterpart. Geometry Loss.

To further improve the quality of theEDM reconstruction, we additionally penalize the negativeeigenvalues and the rank of the Gram matrix. To do so, weﬁrst deﬁne the geometric centering matrix J = I − N (cid:62) , (8)where I is the identity matrix and is the all-ones vector.The Gram matrix is deﬁned with G = − J DJ . (9)The matrix D is an EDM, if the Gram matrix G is positivesemi-deﬁnite. To enforce this property we penalize negativeeigenvalues of the Gram matrix G . Therefor, we ﬁrst geteigenvalues λ with the eigenvalue decomposition of G . Wethen apply the ReLU function to the negative of the eigen-values ¯ λ = ReLU( − λ ) . The corresponding loss term is de-ﬁned as follows: L EV = ¯ λ (cid:62) ¯ λ . (10)Since the atom coordinates exists in a maximally 3-dimensional embedding, we additionally penalize largerthan k = 3 rank of the Gram matrix G . For this, we sort theeigenvalues in descending order λ = [ λ ≤ λ ≤ . . . λ N ] .The rank loss is obtained with L R = N (cid:88) i = k +1 λ i . (11) Overall Training Objective.

The total loss function isgiven with: L total = L VAE + L EV + L R . (12) Recovering Molecular Structures

After having introduced our model, we now describe themechanism to get the entire molecule from the nuclearcharge, distance, and bond matrices. To obtain the atom andcoordinate pairs we ﬁrst symmetrize the generated nuclearharge matrix C and the distance matrix D . The diagonal ofthe distance matrix D is set to zero. The reconstructed ﬂoat-ing point values of the bond matrix B are rounded to integervalues. Subsequently, we recover a set of the nuclear chargenumbers and the corresponding coordinates { Z i , R i } . To doso, we use the classical multidimensional scaling algorithm.We then use Open Babel to read-in the molecular structureby setting types and coordinates of the atoms and assignbonds and corresponding bond types. Lastly, we reconstructa complete molecule by adding hydrogen atoms with OpenBabel. To get initial positions of hydrogen atoms we use anefﬁcient force-ﬁled method with Open Babel (O’Boyle et al.2011) and ﬁne-tune by using a semi-empirical ab initio ap-proximation with MOPAC (Stewart 2016).Since the ordering of the added hydrogen atoms differfrom the ordering in the initial molecule, the best matchingidentity assignment has to be found in order to meaningfullycalculate the deviation of the structures. Given the coordi-nates of the added and the target hydrogen atoms, we usethe Hungarian algorithm (Kuhn 1955) to minimize the as-signment costs. Experiments and Discussion

Dataset

For our experiments, we use the QM9 dataset (Ramakr-ishnan et al. 2014), which includes 133,885 small organicmolecules and consist of up to nine heavy atoms (C, O, N,and F). Each molecule has geometric, energetic, electronic,and thermodynamic properties obtained from the DensityFunctional Theory (DFT) calculations. The chemical spaceof QM9 is based on the GDB-17 dataset (Ruddigkeit et al.2012), enumerating more then 166 billion molecules of up to17 heavy atoms. GDB-17 is systematically generated basedon the molecular connectivity graphs and approaches a com-plete and unbiased enumeration of the chemical compoundspace of small and stable organic molecules. However, themolecules are generated based on a set of ﬁxed rules, thosedoes not cover all valid molecular structures which could beuncovered with deep generative models.

Experimental Setup

We train the 3DMolNet on a set of 50K randomly se-lected molecules from the QM9 dataset. For validation 5Kmolecules are randomly selected. The remaining moleculesare used for evaluation as a test set. We use Open Babel forvalidity check of the valences. We also remove moleculesfrom the QM9 dataset which are inconsistent in terms ofthe provided bonds in the SMILES representation of themolecule and bonds assigned by Open Babel. To comparemolecular geometries, we apply Procrustes analysis andcalculate the root-mean-square deviation (RMSD) of pair-wise atomic coordinates between generated and ground truthstructures. To compare with G-SchNet, we used code and atrained model provided under a published source.

Architecture networks for generation of the nuclear charge matrix C , thedistance matrix D , and the bond matrix B . The encoder andeach decoder have a standard fully connected architecture.The latent space is set to 64 dimensions. The compressionparameter β is reduced after each epoch. Reconstruction Accuracy

In order to judge about the quality of the 3DMolNet, weevaluate the reconstruction accuracy of the nuclear chargenumbers, the atom coordinates, and the bond matrix. Our ex-periments show that the chemical compositions are exactlyreconstructed in more than 99 % of the cases. The bond ma-trix is exactly reconstructed in 98 % of the cases. To evaluatethe reconstructed geometries, we only accept molecules withexactly reconstructed atoms, bonds, and bond types. Thereconstruction of heavy atom coordinates yields a RMSDof 0.048 ˚A. By using a force-ﬁeld for optimization of thecoordinates of the hydrogen atoms, a RMSD of 0.21 ˚A isachieved. Applying a further optimization step by using asemi-empirical ab initio method, the RMSD drops to 0.16 ˚A(see Table 1).To compare with G-SchNet we ﬁrst investigate the recon-struction accuracy of the chemical compositions. Accordingto our experiments, the chemical compositions are exactlyreconstructed in only 19 % of the cases. Even though thisresult is obtained with teacher forcing as a one-step-aheadprediction, it still signiﬁcantly lacks behind our accuracyrates. According to published numbers, G-SchNet yields aRMSD of 0.18 ˚A for the reconstruction of heavy atom co-ordinates. This is roughly four times as high compared toour results. By including the hydrogen atoms, G-SchNetachieves a RMSD of 0.23 ˚A however using a fast force-ﬁeldsolution for optimizing hydrogen coordinates, we achievecomparable results. By using a more precise approximation,our method clearly outperforms G-SchNet.A comparison with EDMNet in terms of the reconstruc-tion accuracy of the coordinates cannot be directly done, dueto the nature of the chosen GAN-based architecture. Be-yond that, a fair comparison is not possible anyway, sincethe EDMNet is trained only on a sub set of molecules of thesame chemical composition. nterpolation between Molecules

An important and useful property of our model is a continu-ous and smooth latent space, which allows a gradual explo-ration of the nearby chemical compositions and structures.In a row of experiments, we evaluate and discuss the out-comes of an interpolation between molecular domains. Ourstart and target molecules are chosen from the test set. To es-timate the quality of the sampled structures, we additionallyrelax the entire geometry of the molecules in each interpola-tion step.In general, we could observe smooth transitions betweenmolecular structures and chemical compositions, mean-ing that structure- and composition-related molecules aremapped close to each other in the latent space, i.e the la-tent space is structured with respect to compositional fea-tures. Figure 3a depicts an example of a smooth transitionbetween structurally related molecules, yet consisting of dif-ferent atom types. Here, the ﬁrst and the last molecule are re-covered from the mean of data points picked out of the testset. By moving towards the target molecule, we plot each in-termediate molecule when ever the decoded chemical com-position changes. Additional gray plots depict correspond-ing relaxed geometries of the molecules.In the second example (see Figure 3b), the chemical com-position, as well as the molecular geometry changes, as theinterpolation takes place between less related molecules.During the interpolation process we discovered moleculargeometries which closely correspond with the relaxed coun-terparts. However, we also found molecules which deviatefrom the relaxed geometries. The most common differencesoccurs within the reconstructs of the hydrogen coordinates.Further deviations are caused by core sub-structures of themolecule, naturally leading to signiﬁcantly different out-comes for the hydrogen coordinates. This phenomenon ismore closely discussed in the next paragraph.A direct comparison of our results with the G-SchNet andEDMNet is at this point not possible, since those modelsdo not allow a continuous exploration of the latent space.Hence, an interpolation between molecular domains cannotbe performed. In case of the EDMNet, even a transition be-tween different chemical compositions and molecule sizes isnot possible.

Molecule Discovery

For discovery of novel molecules within the QM9 dataset,we randomly sample from the latent space around the meanregions of the molecules from the training set. To identifya novel chemical composition, ﬁrst, we generate a set ofcanonical SMILES strings for the entire QM9 dataset. Wethen compare the canonical SMILES string of the generatemolecule with the QM9 references. A molecule is acceptedand is considered as novel if no identical canonical SMILESstring is found within the QM9 dataset. Molecules with in-valid bond and bond types are rejected. We were able toidentify more than 20K novel molecules with new chemi-cal compositions (see Figure 4). To investigate the quality ofthe generated molecular structures we relaxed the structureswith MOPAC and computed the pair-wise atom distancesyielding in a RMSD of 0.32 ˚A. (a) (b)

Figure 3: An illustration of the interpolation steps betweenrelated molecules. The colored atoms and bonds representgenerated molecules, while gray structures represent relaxedcounterparts. The hydrogen atoms are removed. (a) depictsan interpolation between different chemical compositionswithin a ﬁxed structural domain, while (b) shows an inter-polation between different geometries as well.In Figure 4, we illustrate a selection of the discoveredmolecules with increasing RMSD values between the gen-erated structures and corresponding equilibrium states. Onesees that the heavy atoms deviate the least from the ground-truth. The hydrogen atoms contribute most to the deviation.However, from the chemical perspective the exact recon-struction of the hydrogen atoms is least important, sincemost of them are not restricted to a ﬁxed conﬁguration.In terms of qualitative exploration of the chemical space,heavy atoms contribute most to the structural features of themolecules.To put our results into relation, for G-SchNet a me- a) 0.05 (b) 0.07 (c) 0.12 (d) 0.17(e) 0.24 (f) 0.39 (g) 0.42 (h) 0.55

Figure 4: An illustration of novel molecules discovered with 3DMolNet. The colored atoms and bonds represents generatedstructures and the translucent plot is the relaxed counterpart. The molecules are labeled with the RMSD for the relaxation.dian of around 0.3 ˚A is reported, being a comparableresult achieved in our experiments. However, a directcomparison between G-SchNet and our model is notappropriate, since our molecules are sampled from acontinuous latent space which can produce distorted andunstable molecular structures as well. In fact, it turnedout to be problematic to fairly reason about the structuralcloseness to the equilibrium conformations of the relaxedcounterparts. Due to a high-dimensional latent space itis computationally demanding to ﬁnd mean coordinateswithin a latent space domain of discovered molecules.This is, however, important in order to decode the leastdistorted version of the molecule. Furthermore, some novelmolecules could be within a domain of isomers, i.e beingin a family of different molecular structures with the samechemical composition, which are mapped closely in thelatent space. Hence, sampling in the nearby regions cangenerate considerably different relaxations and worsen theresults. Additionally, generated molecules require a furthercriterion to ﬁlter out potential chemically unstable structuresor even to penalize those reconstructions during the training.

Conclusion

We have developed the 3DMolNet to efﬁciently generatenovel 3-d molecular structures of a variable size and chemi-cal composition. In comparison, few recent methods are ei-ther autoregressive or GAN-based. The former lacks a la-tent space entirely, while the latter is restricted due to GAN-speciﬁc drawbacks, such that both models do not allow asmooth exploration of the chemical space. Additional re-strictions of the GAN-based model allow only generationof ﬁxed chemical compositions.To address these limitations, we make three distinct con-tributions: 1. We introduced a translation-, rotation-, and permutation-invariant molecule representation involving a canonicalordering of the atom and coordinate pairs.2. We proposed an extended version of the Variationalautoencoder, which allow to efﬁciently generate 3-dmolecular structures and explore molecular domainswithin a continuous low-dimensional representation ofthe molecules.3. We achieved a high reconstruction precision of the atomcoordinates, which is below 0.05 ˚A and is in the rangeof a typical spatial quantization error of common chemi-cal descriptors. Furthermore, our model almost perfectlyreconstructs exact chemical compositions and bond typeson a QM9 test set.

Future Work

In future work we want to improve the quality of the geome-tries by including molecular stability constraints to generateenergetically more reasonable conﬁgurations. Furthermore,we want to extend the model with a decoder for hydrogenatoms. Lastly, to improve the process of the latent space ex-ploration, we would like to investigate alternative forms oflatent representations.

Acknowledgments

We would like to thank Anders S. Christensen, Felix A.Faber, Puck van Gerwen, O. Anatole von Lilienfeld, andJimmy C. Kromann for insightful comments and helpful dis-cussions. Vitali Nesterov is partially supported by the NCCRMARVEL funded by the Swiss National Science Founda-tion. Mario Wieser is partially supported by the NCCRMARVEL and grant 51MRP0158328 (SystemsX.ch HIV-X)funded by the Swiss National Science Foundation. eferences

Achille, A.; and Soatto, S. 2018. Information Dropout: Learn-ing Optimal Representations Through Noisy Computation.

IEEETransactions on Pattern Analysis and Machine Intelligence .Alemi, A. A.; Fischer, I.; Dillon, J. V.; and Murphy, K. 2017. DeepVariational Information Bottleneck. In

International Conferenceon Learning Representations .Bouchacourt, D.; Tomioka, R.; and Nowozin, S. 2018. Multi-LevelVariational Autoencoder: Learning Disentangled RepresentationsFrom Grouped Observations. In

AAAI Conference on Artiﬁcial In-telligence .Christensen, A. S.; Faber, F. A.; and von Lilienfeld, O. A. 2019.Operators in quantum machine learning: Response properties inchemical space.

The Journal of chemical physics .Creswell, A.; Mohamied, Y.; Sengupta, B.; and Bharath,A. A. 2017. Adversarial Information Factorization. In arXiv:1711.05175 .Dai, H.; Tian, Y.; Dai, B.; Skiena, S.; and Song, L. 2018. Syntax-Directed Variational Autoencoder for Structured Data.

Interna-tional Conference on Learning Representations .Faber, F. A.; Hutchison, L.; Huang, B.; Gilmer, J.; Schoenholz,S. S.; Dahl, G. E.; Vinyals, O.; Kearnes, S.; Riley, P. F.; andVon Lilienfeld, O. A. 2017. Prediction errors of molecular machinelearning models lower than hybrid DFT error.

Journal of chemicaltheory and computation .Gebauer, N.; Gastegger, M.; and Sch¨utt, K. 2019. Symmetry-adapted generation of 3d point sets for the targeted discovery ofmolecules. In

Advances in Neural Information Processing Systems .Gebauer, N.; Gastegger, M.; and Sch¨utt, K. 2019. Symmetry-adapted generation of 3d point sets for the targeted discovery ofmolecules. In

Advances in Neural Information Processing Systems .Gomez-Bombarelli, R.; Wei, J. N.; Duvenaud, D.; Hernandez-Lobato, J. M.; Sanchez-Lengeling, B.; Sheberla, D.; Aguilera-Iparraguirre, J.; Hirzel, T. D.; Adams, R. P.; and Aspuru-Guzik,A. 2018. Automatic Chemical Design Using a Data-Driven Con-tinuous Representation of Molecules. In

ACS Central Science .Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; and Bengio, Y. 2014. Genera-tive adversarial nets. In

Advances in neural information processingsystems .Hoffmann, M.; and No´e, F. 2019. Generating valid Euclidean dis-tance matrices. arXiv:1910.03131 .Janz, D.; van der Westhuizen, J.; Paige, B.; Kusner, M. J.; andHern´andez-Lobato, J. M. 2018. Learning a generative model forvalidity in complex discrete structures.

International Conferenceon Learning Representations .Jin, W.; Barzilay, R.; and Jaakkola, T. 2018. Junction Tree Varia-tional Autoencoder for Molecular Graph Generation. In

Interna-tional Conference on Machine Learning .Jin, W.; Yang, K.; Barzilay, R.; and Jaakkola, T. S. 2019. LearningMultimodal Graph-to-Graph Translation for Molecular Optimiza-tion.

International Conference on Learning Representations .Keller, S. M.; Samarin, M.; Torres, F. A.; Wieser, M.; and Roth, V.2020. Learning Extremal Representations with Deep ArchetypalAnalysis. In arXiv:2002.00815 .Keller, S. M.; Samarin, M.; Wieser, M.; and Roth, V. 2019. DeepArchetypal Analysis. In

German Conference on Pattern Recogni-tion . Kenakin, T. 2006.

A Pharmacology Primer: Theory, Applications,and Methods . Elsevier.Kingma, D. P.; Mohamed, S.; Jimenez Rezende, D.; and Welling,M. 2014. Semi-supervised Learning with Deep Generative Models.In

Advances in Neural Information Processing Systems .Kingma, D. P.; and Welling, M. 2014. Auto-encoding variationalbayes.

International Conference on Learning Representations .Klys, J.; Snell, J.; and Zemel, R. 2018. Learning Latent Subspacesin Variational Autoencoders. In

Advances in Neural InformationProcessing Systems .Kuhn, H. W. 1955. The Hungarian method for the assignment prob-lem.

Naval research logistics quarterly

International Conference on Ma-chine Learning .Lample, G.; Zeghidour, N.; Usunier, N.; Bordes, A.; Denoyer, L.;and Ranzato, M. 2017. Fader Networks: Manipulating Images bySliding Attributes. In

Advances in Neural Information ProcessingSystems .Li, Y.; Vinyals, O.; Dyer, C.; Pascanu, R.; and Battaglia, P. W.2018. Learning Deep Generative Models of Graphs.

CoRR abs/1803.03324.Liu, Q.; Allamanis, M.; Brockschmidt, M.; and Gaunt, A. 2018.Constrained Graph Variational Autoencoders for Molecule Design.In

Advances in Neural Information Processing Systems .Mansimov, E.; Mahmood, O.; Kang, S.; and Cho, K. 2019. Molec-ular Geometry Prediction using a Deep Generative Graph NeuralNetwork.

Scientiﬁc Reports .O’Boyle, N.; Banck, M.; James, C.; Morley, C.; Vandermeersch, T.;and Hutchison, G. 2011. Open Babel: An open chemical toolbox .O’Boyle, N. M. 2012. Towards a Universal SMILESrepresentation-A standard method to generate canonical SMILESbased on the InChI.

Journal of cheminformatics

Machine Learning for Healthcare (MLHC) .Parbhoo, S.; Wieser, M.; Wieczorek, A.; and Roth, V. 2020b. In-formation Bottleneck for Estimating Treatment Effects with Sys-tematically Missing Covariates.

Entropy .Ramakrishnan, R.; Dral, P. O.; Rupp, M.; and von Lilienfeld, O. A.2014. Quantum chemistry structures and properties of 134 kilomolecules. In

Scientiﬁc Data .Rezende, D. J.; Mohamed, S.; and Wierstra, D. 2014. StochasticBackpropagation and Approximate Inference in Deep GenerativeModels. In

International Conference on Machine Learning .Ruddigkeit, L.; van Deursen, R.; Blum, L. C.; and Reymond, J.-L. 2012. Enumeration of 166 Billion Organic Small Molecules inthe Chemical Universe Database GDB-17.

Journal of ChemicalInformation and Modeling .Sch¨utt, K. T.; Sauceda, H. E.; Kindermans, P.-J.; Tkatchenko, A.;and M¨uller, K.-R. 2018. SchNet – A deep learning architecture formolecules and materials.

The Journal of Chemical Physics .Sohn, K.; Lee, H.; and Yan, X. 2015. Learning Structured Out-put Representation using Deep Conditional Generative Models. In

Advances in Neural Information Processing Systems .teinbeck, C.; Han, Y.; Kuhn, S.; Horlacher, O.; Luttmann, E.; andWillighagen, E. 2003. The Chemistry Development Kit (CDK): AnOpen-Source Java Library for Chemo- and Bioinformatics.

Journalof Chemical Information and Computer Sciences .Stewart, J. J. P. 2016. MOPAC2016.

Stewart Computational Chem-istry

URL OpenMOPAC.net.Tishby, N.; Pereira, F. C.; and Bialek, W. 1999. The informationbottleneck method. In

Allerton Conference on Communication,Control and Computing .Wieczorek, A.; Wieser, M.; Murezzan, D.; and Roth, V. 2018.Learning Sparse Latent Representations with the Deep Copula In-formation Bottleneck. In

International Conference on LearningRepresentations .Wieser, M.; Parbhoo, S.; Wieczorek, A.; and Roth, V. 2020. In-verse Learning of Symmetry Transformations. arXiv e-printsarXiv e-prints