[PDF] Advances of Machine Learning in Molecular Modeling and Simulation

Abstract

In this review, we highlight recent developments in the application of machine learning for molecular modeling and simulation. After giving a brief overview of the foundations, components, and workflow of a typical supervised learning approach for chemical problems, we showcase areas and state-of-the-art examples of their deployment. In this context, we discuss how machine learning relates to, supports, and augments more traditional physics-based approaches in computational research. We conclude by outlining challenges and future research directions that need to be addressed in order to make machine learning a mainstream chemical engineering tool.

Full PDF

aa r X i v : . [ phy s i c s . d a t a - a n ] F e b Advances of Machine Learning in Molecular Modeling and Simulation

Mojtaba Haghighatlari ∗ and Johannes Hachmann

1, 2, 3, † Department of Chemical and Biological Engineering, University at Buﬀalo,The State University of New York, Buﬀalo, NY 14260, United States Computational and Data-Enabled Science and Engineering Graduate Program,University at Buﬀalo, The State University of New York, Buﬀalo, NY 14260, United States New York State Center of Excellence in Materials Informatics, Buﬀalo, NY 14203, United States

In this review, we highlight recent developments in the application of machine learning for molec-ular modeling and simulation. After giving a brief overview of the foundations, components, andworkﬂow of a typical supervised learning approach for chemical problems, we showcase areas andstate-of-the-art examples of their deployment. In this context, we discuss how machine learningrelates to, supports, and augments more traditional physics-based approaches in computational re-search. We conclude by outlining challenges and future research directions that need to be addressedin order to make machine learning a mainstream chemical engineering tool.

I. MACHINE LEARNING FROM A CHEMICALPERSPECTIVE

Over the past few years, data science has started tooﬀer a fresh perspective on tackling complex chemicalquestions, such as discovering and designing chemical sys-tems with tailored property proﬁles, revealing intricatestructure-property relationships (SPRs), and exploringthe vastness of chemical space [1]. Data-derived predic-tion models serve as surrogates for physics-based modelsthat are at the heart of traditional modeling and simula-tion work. They are attractive, because they are usuallydramatically less demanding than physics-based modelsand can thus be deployed in studies of correspondinglylarger scope and scale. If trained on experimental data,they are also not subject to the approximations madein physics-based models and may thus not exhibit theresulting discrepancies with respect to non-idealized ex-perimental ﬁndings. Of course, data-derived models havetheir own intrinsic errors and limitations, which we willaddress in the course of this review. ∗ mojtabah@buﬀalo.edu † hachmann@buﬀalo.edu Machine learning (ML) is a data mining technique andused to create data-derived models. It enables us to ex-tract complex and often hidden correlations (and thusideally insights, patterns, rules, and guidance) from givendata sets and to encapsulate them in mathematical form.ML is commonly categorized into four types, i.e., super-vised, semi-supervised, unsupervised, and reinforcementlearning. The main diﬀerence between these types is inessence the amount of information (i.e., labeling, con-text) that is available for the target variable that servesas the ground truth for the training of an ML algorithm.While all ML types have found application in chemicalresearch [2], supervised learning has so far been mostcommonly used, and this review will thus focus on it.The popularity of supervised learning may be due to itsheuristic and intuitive approach to learning, which is sim-ilar to a scientist’s way of gaining insights into SPR. Asupervised prediction model can be thought of as a func-tion f : X → Y that maps an input x ∈ X to an output y ∈ Y , where x in this context is the feature represen-tation of a chemical system and y its target property. Ifthe variables x and y are continuous (numerical), thenthe mapping is a regression; if they are discrete (categor-ical), then it is a classiﬁcation.We can utilize a host of supervised ML algorithmsto train and optimize model f to approximate the out-put value for a given input. Two popular algorithmsthat have been widely used are artiﬁcial neural networks(ANNs) and kernel methods. Both can be thought of astransforming the input x into a new feature (latent vari-able) space, in which it becomes linearly correlated withthe output y [3]. The transformation itself is typicallyhighly non-linear. A major advantage of the ANNs istheir capacity to transform features sequentially throughseveral layers, which is referred to as deep learning. Ker-nel methods, on the other hand, usually transform fea-tures in a one-step process using kernel functions. Unlikein ANNs, this process is predeﬁned prior to the tuning ofthe model’s parameters, and is thus less ﬂexible to learnthe best latent variable space. The advantage of kernelmethods is their superior performance in ﬁnding globalsolutions, even for small-size data sets where ANNs havedeﬁcits. The support vector machines and kernel ridgeregression are two common examples of kernel-based al-gorithms.The relationship between a molecular structure andits properties is deterministic, i.e., there exists an exactmapping from fundamental physics (i.e., the Schr¨odingerequation). This mapping is ultimately the foundationfor traditional modeling and simulation techniques. Thetopology of ML models is generally very ﬂexible (as, e.g.,expressed in the universal approximation theorem forANNs), so that they can learn and recover the under-lying SPRs of a problem, even from simple chemical rep-resentations (assuming no signiﬁcant loss of informationwithin this representation).We can consider a feature representation method as afunction g : M → X that maps a basic chemical represen-tation m ∈ M to a feature input x ∈ X (typically calleda descriptor). The representation m may contain spatialor at least topological information that deﬁnes a moleculeand is expressed, e.g., in atomic coordinates, simpliﬁedmolecular-input line-entry system (SMILES) [4], interna-tional chemical identiﬁer (InChI), or other formats.A common feature space X is spanned by structuraldescriptors. Some ML approaches also utilize physical or(physico-)chemical properties as descriptors, such that g corresponds to a simulation or some other type of cal-culation (including those from ﬁrst principles ). As thisapproach builds physics into the feature space, it has acertain appeal and has gained corresponding popularity.However, it is important to point out that the computa-tional cost of obtaining such descriptors (which includeoptimized geometries) may easily make this the bottle-neck of an ML approach, in which case it will limit itsutility as an eﬃcient surrogate for the prediction of y .This issue has to be considered as part of a cost-beneﬁtanalysis.Another class of descriptors is designed to capture thelocal environment of each atom in a molecule [5]. Thisapproach considers a molecule as a graph with atom andbond (i.e., node and edge) features. Each atom caninteract with all other atoms in its immediate vicinity, FIG. 1. The major tasks and mathematical setup of a super-vised machine learning workﬂow: For a given data set { M, Y } ,in which for a number of molecules in basic chemical represen-tation m ∈ M the target property y ∈ Y is given, we apply afeature representation method as a function g : M → X thatmaps M to a feature input space X . After cleaning and otherpreprocessing steps, we use { X, Y } to formulate an ML model f : X → Y that maps the feature input space X to the out-put label space Y . The ML model is trained on the trainingsubset of { X, Y } , and subsequently validated and optimizedon its testing subset, so that it minimizes the prediction errorfor Y . which results in an update of the corresponding localatomic features. Incidentally, this approach has its rootsin both chemical and data sciences: In the context ofmolecular simulations, cutoﬀ radii have long been usedto exploit the short-ranged nature of intermolecular in-teractions. In data science, the idea of dynamic irregulargraphs provides the underpinning of graph convolutionalneural networks. The overlap of the two disciplines inthis area has led to many methodological advances forthe generation of descriptors. Results from a number ofrecent studies suggest that an ensemble of local features(rather than a global representation), is able to providea more robust solution to the challenges involved withvariant graph size and the order of atoms in molecules[6, 7].The descriptors discussed so far are essentially hand-crafted to explicitly expose certain structural, physical,or (physico-)chemical information x from m and providea structured (i.e., tabular) feature representation. Al-ternatively, the feature generation g can also be mergedinto the prediction model f and both will be jointly op-timized, e.g., through hidden layers (latent space) indeep learning. This class of descriptors is called learnedfeatures [8].The overall ML workﬂow for chemical problems encom-passes a number of steps as shown in Fig. 1, includingparsing, cleaning, and preprocessing a chemical data set { M, Y } , compiling an array of descriptors via g , as wellas training, evaluation, and validation of the predictionmodel f . II. APPLICATIONS OF MACHINE LEARNINGIN CHEMICAL RESEARCH

In the following section, we summarize three applica-tion areas of ML in chemical research, with particularconsideration of the inherent structure of the associateddata sets, types of representation, and connections to tra-ditional modeling. We limit the scope of our discussionto molecular systems, which still cover a broad range ofuse cases.

A. Discovery and Design of New Compounds

The application of ML for the exploration of chemi-cal space and the creation of new compounds (rangingfrom small molecules to polymers and materials) can bedivided into two distinct approaches: (i) discovery , i.e.,ML is used to generate fast prediction models for proper-ties of interest, with which large-scale surveys of chemicalspace can be conducted in order to identify compoundsthat exhibit desired property proﬁles; (ii) design , i.e.,ML is used to develop a quantitative understanding ofthe SPRs of interest, which can be inverted to pursuethe targeted, rational design (or inverse engineering) ofcompounds with particular properties. While the coreactivity, i.e., the ML of SPRs, is the same in both cases,its use follows diﬀerent perspectives.

Discovery.

The idea of employing data-derived pre-diction models instead of physics-based models (or exper-imentation) as a means to characterize candidates in thesearch for new molecules may be one of the earliest ap-plications in chemistry, for which the use of ML was pro-posed. Traditional molecular modeling and simulationshave been used for this purpose for many years. More re-cently, they have also been employed in the context of vir-tual high-throughput screening studies, in which they aretasked with assessing entire libraries of candidate com-pounds (see Fig. 2 and, e.g., Ref. [9, 10]). However, thecomputational footprint, in particular of ﬁrst-principles approaches, is limiting both individual as well as large-scale studies that seek to identify compounds with specif-ically targeted properties.The application of data-derived prediction models en-ables us to dramatically accelerate the survey of chemicalspace, often by several orders of magnitude. A speed-upof that magnitude allows a corresponding increase in thescale and scope that is viable for screening eﬀorts. (It isthus sometimes referred to as hyperscreening .) The can-didate libraries are typically generated from a collectionof moieties and patterns that are of interest in a givencontext [11]. The combination of such a set of buildingblocks leads to a molecular library for a particular do-main in chemical space, i.e., the candidates belong tothe same distribution [12]. A number of experimentalor high-level computational training sets have been de-veloped for speciﬁc classes of molecules [13, 14]. Sincethese data sets focus on relatively similar compounds from the same distribution, the choice of representationand the ML model training are arguably less challengingcompared to more universally applicable models. Theextrapolative use of data-derived prediction models out-side the domain for which they were trained has to beconducted with great care and caution, as they are leastreliable here. This is a conceptual challenge, as screeningstudies are often interested in compounds with extremeproperties that are likely at the margins of the distribu-tion, where the predictions are least reliable. Iterativeretraining of ML models allows us to shift the trainingdata distribution into particular areas of interest, thusmaking them more robust for use in discovery.A reasonably diverse collection of molecules can befound in the open-source QM9 data set originally ex-tracted from the GDB-17 chemical universe of 166 billionorganic molecules [15]. The QM9 data comprises com-puted geometries and properties for 134,000 molecules atdensity functional theory (DFT) level. Due to the diver-sity of molecular structures and broad range of calculatedproperties, QM9 plays an important role as a benchmarkdata set for new models and methods [16, 17]. Its contri-bution to method developments can be compared to theMNIST data set in the hand-written character recogni-tion community [18]. In contrast to, e.g., data sets from ﬁrst-principles modeling, those from data-derived mod-els have so far rarely been used for the generation of newreference data. Yet, they have played an important rolein a number of methodological advances in the ﬁeld. As aresult, the reported accuracies for many of the recent MLprediction models surpass those of traditional molecularmodeling and simulations [19].

Design.

While discovery is still based on a traditionaltrial-and-error process – albeit one drastically acceleratedby ML – the notion of a deliberate, de novo design of newcompounds represents a diﬀerent research paradigm. Itaddresses the problem that even rapid and eﬃcient hy-perscreening studies can only scratch the surface of thepractically inﬁnite molecular space. Instead, the designparadigm seeks to utilize insights into the SPRs obtainedfrom ML for the targeted creation of systems with spe-ciﬁc properties. The understanding of how changes inthe molecular structure (or a compound’s features) leadto changes in the desired properties can be inverted togain a property to structure mapping. The mathematicalstructure of a data-derived SPR prediction model (e.g.,the dominant features, principal components, latent vari-ables, or learned features) yields a foundation for inversedesign. Models that are less easy to interpret can beprojected onto surrogate models for which the extractionof guidelines is easier. A key challenge is to realize thesimultaneous enhancement of diﬀerent properties. Theemerging design rules can be used to formulate individualcompounds [20], but also to identify high-value domainsin chemical space, which can be enumerated in screen-ing libraries (e.g., by sampling compounds similar to alead compound). The latter approach is eﬀectively inter-facing the discovery and design perspectives and allows

FIG. 2. Flowchart showing a computational funnel typicalfor high-throughput virtual screening (HTVS) studies. Theneural network schemes on the left and right represent deepgenerative model architectures that can conceptually replacediﬀerent elements of the screening funnel. Both generativeadversarial networks (GANs) and variational autoencoders(VAEs) include two networks that revolutionize the conven-tional generation and analysis steps by probabilistic means. Adeep reinforcement learning (RL) network can also be trainedto bias the generation towards promising candidates. both physics-based and data-derived modeling studies tobe more targeted.Another approach that is very promising and hascaused much excitement is the application of generativemodels (see Fig. 2). For instance, Sanchez-Lengeling etal. have shown that a generative adversarial network(GAN) in tandem with reinforcement learning can out-perform evolutionary algorithms in order to bias the gen-erative process toward the extreme regions of a propertydistribution [21, 22]. The use of GANs for molecular de-sign and library generation is very recent and a number ofconcerns and challenges still need to be overcome in theirdevelopment. Two of the principal challenges of GANs(and other generative approaches) are the rate at whichinvalid (i.e., chemically irrelevant or non-sensical) struc-tures are generated, and their ability to produce topo-logically diﬀerent molecules compared to the underlyingtraining data [23, 24]. Another example of generativemodels are variational autoencoders (VAEs) that learnthe distribution of embedded space and thus enables tun-ing in that space [25]. Recurrent neural networks (RNNs)operate in a sequential manner similar to creating newmolecules one atom at a time. One beneﬁt of RNNs istheir memory mechanism that allows them to rememberthe eﬀects of previous sequences [26].

B. Creation of New Modeling Techniques

Instead of replacing physics-based with data-derivedmodeling entirely as outlined in Sec. II A, ML can alsobe used to (i) calibrate and correct the results of physics-based models to account for some of their systematic er-rors; (ii) complement traditional modeling and simula-tion approaches (i.e., employ combinations of physics-based and data-derived models); and (iii) facilitate the development of new physics-based modeling techniques.The calibration approach allows us to improve the pre-dictive performance of physics-based models and obtainhigh-quality results at the cost of lower-quality methods.It can also help bridge the gap between experiment andtheory that results from the inherent approximations inthe latter (see, e.g., Ref. [10]). Transfer learning is an MLdesign methodology that has been a particularly success-ful technique in this context [27, 28]. In the combination approach, we only utilize ML for aspects for which nogood physics-based models are available or where theiruse is impractical (e.g., because of insuﬃcient accuracy,prohibitive cost, or other numerical issues). We thus re-tain as much of the physical foundations and robustnessof traditional modeling as possible, while being prag-matic about the parts of a problem, where that is notpossible (see, e.g., Refs. [29–31]).The development of entirely new modeling techniques by means of ML has seen encouraging pioneering eﬀorts,in particular for force ﬁelds (FFs) and DFT. The ma-jor driving force behind ML-generated FFs is the lack ofgeneralizability in the classical FFs and the interatomicpotentials that underpin them. This is an area whereML is apparently able to bridge the accuracy and ver-satility typically seen from quantum chemistry and theeﬃciency of molecular mechanics simulations. A recentline of research has focused on learning interatomic po-tentials from quantum chemical data sets [32]. Thereare two speciﬁc challenges involved in this applicationthat make it distinct from prediction models for molec-ular properties. One is the need for a diverse samplingof non-equilibrium chemical conformations, as both MLand classical FFs perform poorly outside of their appli-cability domain. Access to a diverse collection of high-quality training samples is thus essential in creating MLFFs. For instance, Botu et al. have improved on pre-vious work by diversifying their training data, e.g., byadding more atomic environments and applying cluster-ing methods [33]. Smith et al. have pushed the normalmode sampling method to obtain single point energies formore than 20 million conformations generated for 58,000small molecules [34]. The results of these eﬀorts wereshown to be eﬃciently generalizable, even for the simula-tion of more complex phenomena. The second importantchallenge is to conserve the consistency between potentialenergies and forces as discussed by Chmiela et al. [35].They provide a robust solution to this challenge by de-veloping gradient-domain ML models (which reproduceglobal FFs by training in the force domain and incorpo-rating both energies and forces) in an automated fashion,thus learning accurate ML FFs.In the DFT context, ML is used to create new function-als for diﬀerent terms in the electronic Hamiltonian. Theexact form of several functionals (e.g., the kinetic en-ergy functional for interacting electrons or the exchange-correlation functional in the Kohn–Sham formalism) areunknown and otherwise approximated by physical rea-soning. The ML-generated functionals allow DFT toavoid common failures, such as in accurately describingbond-breaking processes. Diﬀerent ML functionals forspeciﬁc classes of molecules, target properties, and elec-tronic structure situations are being developed, as arefast methods that, e.g., learn energy functionals directlywithout having to solve the Kohn–Sham equations, thusmaking them a viable approach for ab initio moleculardynamics simulations [36].

C. Predictions of Chemical Reactions and CatalystSystems

Research on chemical reactions is another ﬁeld that hasbeen beneﬁting from the advances in ML methodology.ML has been paving the way for a better understanding ofchemical transformations with numerous real-world im-plications. SMILES are often the representation of choicefor both the inputs (reactants and reagents) and outputs(products) of data-derived models for chemical reactions.These models are trained on known reactions to recognizestructural patterns that may undergo bond-breaking or -formation in the course of a reaction or catalytic process[37]. One particularly important data set for this ap-plication domain is the result of the US patent reactionextraction by Lowe [38].The progress in predicting organic reactions and theirproducts has been particularly noteworthy in recentyears. Nam et al. have introduced sequence-to-sequencemodels to address the reaction prediction task similarto linguistic translation problems [39]. More recently,Schwaller et al. outperformed a similar approach in anend-to-end template-free model with a focus on the at-tention mechanism and a new tokenization strategy [40].Coley et al. introduced a graph convolutional neural net-work approach with competitive performance. It wasused for the prediction of reaction products as well asreactive sites of the reagents that are most likely to ini-tiate a reaction [41]. One major contribution of the lasttwo studies is the development of web applications tofacilitate easy access to their models. These tools areavailable via the IBMRXN and ASKCOS websites, re-spectively [42, 43].A promising direction of ongoing work is the predictionof reaction pathways and mechanisms. All these eﬀortsultimately aim for a practical and more generalizable im-plementation of retrosynthetic analysis, which has beena grand challenge in organic chemistry for many years[44]. Insights regarding the synthetic feasibility of vir- tual compounds are also a key concern for the screeninglibrary generation and molecular design eﬀorts discussedin Sec. II A.

III. OUTLOOK ON FUTURE DIRECTIONSA. Feature Representations

As discussed in Sec. I, the descriptors of a given molec-ular system are an abstraction of its detailed nature (aswell as a numerical representation). The choice of a suit-able feature space is still our ﬁrst and most eﬀectivemeans to infuse physics into ML models. There havebeen eﬀorts to deﬁne criteria for the development of ef-ﬁcient descriptors [5], e.g., that they are (1) invariantto the symmetries of the underlying physics; (2) easyto interpret; (3) expressed in a direct and concise formto avoid redundancy and the curse of dimensionality;and (4) computationally eﬃcient. However, developingmolecular representations that adhere to all these crite-ria has been an exceedingly diﬃcult task. More impor-tantly, there is now agreement that ML approaches mayrequire diﬀerent types of descriptors to recover the en-tirety of SPRs of molecular systems. Further researchinto the creation of new descriptors (including ﬁnger-print schemes) as well as the formulation of additionalcriteria will be necessary for the foreseeable future. Theaccessibility and ﬂexibility of deep learning models canaccelerate future developments via learned features andtheory-informed models.

B. Machine Learning for Small Data

While ML ideas became popular during the recent ’bigdata’ wave (i.e., in chemistry with the emergence oflarge-scale screening result from high-level ﬁrst-principles modeling), large data sets are in practice more often thannot unavailable. In fact, problems for which data is (still)sparse tend to be of particular interest. As the data gen-eration (both from experiment and modeling) is often alimiting factor, we will have to strive to reduce its cost orthe number of data points needed to obtain ML models ofa desired accuracy. It is thus essential to put an empha-sis on developing ML methods that achieve better per-formance on small data sets. As mentioned in Sec. II B,transfer learning is a promising approach in this context.We will also need to employ smart sampling methodsand identify data points that are most important for thetraining of ML models. Active learning strategies oﬀera path towards this goal [45–47]. Many of these tech-niques are of general-purpose utility, but some will haveto be tailored towards the speciﬁc problem settings ofdata-derived models for chemistry.

C. Software and Tool Development

The idea to utilize ML and other data mining tech-niques in the chemical domain is so recent that muchof the basic infrastructure has not yet been developed,or is still in its early stages [1]. The majority of toolsand expertise tend to be technically involved, labor in-tensive, or otherwise unavailable to the community atlarge. Many researchers are now starting to pursue open-source software development projects to tackle this sit-uation [48]. However, the lack of rigorous developmentguidelines remains a challenge that researchers from do-main science need to overcome to make their eﬀorts last-ing and sustainable. The Molecular Sciences SoftwareInstitute (MolSSI) is one of the pioneers in establishingbest practices and guidance for early-stage software de-velopments in this ﬁeld [49, 50].

IV. CONCLUSIONS

In this review, we discussed how ML can advance tra-ditional modeling and simulation by (partially) replac-ing them (i.e., choosing data-derived over physics-basedmodels or combining the two); calibrating, augmenting,or otherwise correcting their results; targeting studiesand their objectives; and providing the means to eﬀec-tively mine their results for a deeper understanding ofhidden SPRs. Many ML models are still built on dataprovided by modeling and simulation – often as part ofvirtual high-throughput screening studies – and combin-ing ML and traditional modeling infuses physics and ro-bustness into the resulting data-derived prediction mod-els. These and other emerging ML techniques have beenenabling accelerated discovery and rational design in nu-merous areas of chemistry. Its early successes indicatethat ML is bound to become a mainstream tool in chem-ical research. Yet, there is still much to (machine) learnon how to develop the full potential of ML in chemistry.

COMPETING FINANCIAL INTERESTS

The authors declare to have no competing ﬁnancialinterests.

ACKNOWLEDGMENTS

MH gratefully acknowledges support by Phase-I andPhase-II Software Fellowships (grant No. ACI-1547580-479590) of the National Science Foundation (NSF)Molecular Sciences Software Institute (grant No. ACI-1547580) at Virginia Tech. JH acknowledges supportedby the NSF CAREER program under grant No. OAC-1751161, the NSF Big Data Spokes program under grantNo. IIS-1761990, and funding by the New York StateCenter of Excellence in Materials Informatics (grant No.CMI-1148092-8-75163).

ANNOTATIONS • [1] This NSF workshop report compiles the opinionsof a group of active researchers in the ﬁeld regard-ing the current challenges and future opportunitiesoﬀered by data-driven approaches in the chemicaldomain. • [20] This review discusses the recent advances in in-verse molecular design using deep generative mod-els. • [34] In this study, deep learning is used to ﬁt in-teratomic potentials and develop the so-called ANImodel for transferable data-derived potentials withcomparable accuracy to the reference DFT calcu-lations. • [37] This study surveys the role of ML in synthesisplanning and the prediction of reaction outcomes. • [48] This paper presents a software ecosystem forthe development and broader dissemination of tech-niques at the diﬀerent stages of a molecular datamining workﬂow. [1] Johannes Hachmann, Theresa L Windus, John AMcLean, Vanessa Allwardt, Alexandra C Schrimpe-Rutledge, Mohammad Atif Faiz Afzal, and MojtabaHaghighatlari, Framing the role of big data and moderndata science in chemistry , Tech. Rep. (2018) this NSFworkshop report compiles the opinions of a group of ac-tive researchers in the ﬁeld regarding the current chal-lenges and future opportunities oﬀered by data-drivenapproaches in the chemical domain.[2] Kirstin Alberi, Marco Buongiorno Nardelli, Andriy Za- kutayev, Lubos Mitas, Stefano Curtarolo, Anubhav Jain,Marco Fornari, Nicola Marzari, Ichiro Takeuchi, Mar-tin L Green, Mercouri Kanatzidis, Mike F Toney, SergiyButenko, Bryce Meredig, Stephan Lany, Ursula Kat-tner, Albert Davydov, Eric S Toberer, Vladan Ste-vanovic, Aron Walsh, Nam-Gyu Park, Al´an Aspuru-Guzik, Daniel P Tabor, Jenny Nelson, James Murphy,Anant Setlur, John Gregoire, Hong Li, Ruijuan Xiao, Al-fred Ludwig, Lane W Martin, Andrew M Rappe, Su-HuaiWei, and John Perkins, “The 2019 materials by design roadmap,” J. Phys. D: Appl. Phys. , 013001 (2018).[3] Matthias Rupp, O. Anatole Von Lilienfeld,and Kieron Burke, “Guest Editorial: SpecialTopic on Data-Enabled Theoretical Chemistry,”J. Chem. Phys. (2018), 10.1063/1.5043213,arXiv:1806.02690.[4] David Weininger, “SMILES, a chemical language and in-formation system. 1. Introduction to methodology andencoding rules,” J. Chem. Inf. Model. , 31–36 (1988).[5] Albert P. Bart´ok, Risi Kondor, and G´aborCs´anyi, “On representing chemical environments,”Phys. Rev. B , 184115 (2013).[6] Steven Kearnes, Kevin McCloskey, Marc Berndl,Vijay Pande, and Patrick Riley, “Moleculargraph convolutions: moving beyond ﬁngerprints,”J. Comput. Aided. Mol. Des. , 595–608 (2016).[7] Truong Son Hy, Shubhendu Trivedi, Horace Pan, Bran-don M. Anderson, and Risi Kondor, “Predicting molec-ular properties with covariant compositional networks,”J. Chem. Phys. (2018), 10.1063/1.5024797.[8] Kristof T. Sch¨utt, Farhad Arbabzadah, Stefan Chmiela,Klaus R. M¨uller, and Alexandre Tkatchenko, “Quantum-Chemical Insights from Deep Tensor Neural Networks,”Nat. Commun. , 6–13 (2016), arXiv:1609.08259.[9] Johannes Hachmann, Roberto Olivares-Amaya, SuleAtahan-Evrenk, Carlos Amador-Bedolla, Roel S.S´anchez-Carrera, Aryeh Gold-Parker, Leslie Vogt,Anna M. Brockway, and Al´an Aspuru-Guzik,“The Harvard Clean Energy Project: Large-ScaleComputational Screening and Design of OrganicPhotovoltaics on the World Community Grid,”J. Phys. Chem. Lett. , 2241–2251 (2011).[10] Johannes Hachmann, Roberto Olivares-Amaya, AdrianJinich, Anthony L. Appleton, Martin A. Blood-Forsythe, L´aszl´o R. Seress, Carolina Rom´an-Salgado,Kai Trepte, S. Atahan-Evrenk, S¨uleyman Er, SupriyaShrestha, Rajib Mondal, Anatoliy Sokolov, Zhenan Bao,and Al´an Aspuru-Guzik, “Lead candidates for high-performance organic photovoltaics from high-throughputquantum chemistry-the Harvard Clean Energy Project,”Energy Environ. Sci. , 698–704 (2014).[11] Zheng Gong, Yanze Wu, Liang Wu, and Huai Sun, “Pre-dicting thermodynamic properties of alkanes by high-throughput force ﬁeld simulation and machine learning,”J. Chem. Inf. Model. , 2502–2516 (2018).[12] Edward O Pyzer-Knapp, Changwon Suh, Rafael G´omez-Bombarelli, Jorge Aguilera-Iparraguirre, and Al´anAspuru-Guzik, “What Is high-throughput virtual screen-ing? A perspective from organic materials discovery,”Annu. Rev. Mater. Res. , 195–216 (2015).[13] Keith T. Butler, Daniel W. Davies, Hugh Cartwright,Olexandr Isayev, and Aron Walsh, “Machinelearning for molecular and materials science,”Nature , 547–555 (2018), arXiv:1402.6991v1.[14] Zheng Li, Noushin Omidvar, Wei Shan Chin, Es-ther Robb, Amanda Morris, Luke Achenie, andHongliang Xin, “Machine-Learning Energy Gaps ofPorphyrins with Molecular Graph Representations,”J. Phys. Chem. A , 4571–4578 (2018).[15] Raghunathan Ramakrishnan, Pavlo O. Dral, MatthiasRupp, and O. Anatole Von Lilienfeld, “Quantum chem-istry structures and properties of 134 kilo molecules,”Sci. Data , 1–7 (2014).[16] Gr´egoire Ferr´e, Terry Haut, and Kipton Barros, “Learn- ing molecular energies using localized graph kernels,”J. Chem. Phys. , 114107 (2017).[17] Christopher R. Collins, Geoﬀrey J. Gordon,O. Anatole Von Lilienfeld, and David J. Yaron,“Constant size descriptors for accurate ma-chine learning models of molecular properties,”J. Chem. Phys. (2018), 10.1063/1.5020441.[18] Yann LeCun, Corinna Cortes, and Christopher J.C.Burges, “The MNIST database of handwritten digits,”http://yann.lecun.com/exdb/mnist/ (2013).[19] Felix A. Faber, Luke Hutchison, Bing Huang, JustinGilmer, Samuel S. Schoenholz, George E. Dahl, OriolVinyals, Steven Kearnes, Patrick F. Riley, and O. Ana-tole Von Lilienfeld, “Prediction errors of molecular ma-chine learning models lower than hybrid DFT error,”J. Chem. Theory Comput. , 5255–5264 (2017).[20] Benjamin Sanchez-Lengeling and Al´an Aspuru-Guzik,“Inverse molecular design using machine learn-ing: Generative models for matter engineering,”Science , 360–365 (2018), this review discusses therecent advances in inverse molecular design using deepgenerative models.[21] Benjamin Sanchez-Lengeling, Carlos Outeiral,Gabriel L. Guimaraes, and Al´an Aspuru-Guzik,“Optimizing distributions over molecular space.An objective-reinforced generative adversarial net-work for inverse-design chemistry (ORGANIC),”Prepr. https//chemrxiv.org/articles/ORGANIC 1 pdf/5309668 (2017), 10.26434/chemrxiv.5309668.v3.[22] Evgeny Putin, Arip Asadulaev, Yan Ivanenkov, VladimirAladinskiy, Benjamin Sanchez-Lengeling, Al´an Aspuru-Guzik, and Alex Zhavoronkov, “Reinforced adversar-ial neural computer for de novo molecular design,”J. Chem. Inf. Model. , 1194–1204 (2018).[23] Artur Kadurin, Alexander Aliper, Andrey Kazen-nov, Polina Mamoshina, Quentin Vanhaelen, KuzmaKhrabrov, and Alex Zhavoronkov, “The cornucopiaof meaningful leads: Applying deep adversarial au-toencoders for new molecule development in oncology,”Oncotarget , 10883–10890 (2016), arXiv:1703.10593.[24] Mariya Popova, Olexandr Isayev, and AlexanderTropsha, “Deep reinforcement learning for de novo drugdesign,” Sci. Adv. (2018), 10.1126/sciadv.aap7885,arXiv:1711.10907.[25] Rafael G´omez-Bombarelli, Jennifer N. Wei, DavidDuvenaud, Jos´e Miguel Hern´andez-Lobato, Benjam´ınS´anchez-Lengeling, Dennis Sheberla, Jorge Aguilera-Iparraguirre, Timothy D. Hirzel, Ryan P. Adams, andAl´an Aspuru-Guzik, “Automatic chemical design usinga data-driven continuous representation of molecules,”ACS Cent. Sci. , 268–276 (2018), arXiv:1610.02415.[26] Marwin H.S. Segler, Thierry Kogej, Christian Tyr-chan, and Mark P. Waller, “Generating focusedmolecule libraries for drug discovery with recurrentneural networks,” ACS Cent. Sci. , 120–131 (2018),arXiv:1701.01329.[27] Garrett B. Goh, Charles Siegel, Abhinav Vishnu,and Nathan O. Hodas, “Using rule-based la-bels for weak supervised learning: A ChemNetfor transferable chemical property prediction,”Prepr. https//arxiv.org/abs/1712.02734 (2017), 10.475/123,arXiv:1712.02734.[28] Mohammad M. Sultan and Vijay S. Pande,“Transfer learning from markov models leadsto eﬃcient sampling of related systems,” J. Phys. Chem. B , 5291–5299 (2018).[29] Mohammad Atif Faiz Afzal, Chong Cheng, and JohannesHachmann, “Combining ﬁrst-principles and data model-ing for the accurate prediction of the refractive index oforganic polymers,” J. Chem. Phys. , 241712 (2018).[30] Jo˜ao Marcelo Lamim Ribeiro, Pablo Bravo, YihangWang, and Pratyush Tiwary, “Reweighted autoen-coded variational Bayes for enhanced sampling (RAVE),”J. Chem. Phys. , 072301 (2018), arXiv:1802.03420.[31] Chen Wei, Aik Rui Tan, and Andrew L. Fergu-son, “Collective variable discovery and enhancedsampling using autoencoders: Innovations in net-work architecture and error function design,”J. Chem. Phys. , 072312 (2018).[32] Alireza Khorshidi and Andrew A. Peter-son, “Amp: A modular approach to ma-chine learning in atomistic simulations,”Comput. Phys. Commun. , 310–324 (2016).[33] Venkatesh Botu, Rohit Batra, James Chap-man, and Rampi Ramprasad, “Machine learn-ing force ﬁelds: Construction, validation, andoutlook,” J. Phys. Chem. C , 511–522 (2017),arXiv:1610.02098.[34] Justin S. Smith, Olexandr Isayev, and Adrian E.Roitberg, “ANI-1: an extensible neural network poten-tial with DFT accuracy at force ﬁeld computationalcost,” Chem. Sci. , 3192–3203 (2017), in this study,deep learning is used to ﬁt interatomic potentials anddevelop the so-called ANI model for transferable data-derived potentials with comparable accuracy to the ref-erence DFT calculations., arXiv:1610.08935.[35] Stefan Chmiela, Alexandre Tkatchenko, Huziel E.Sauceda, Igor Poltavsky, Kristof T Sch¨utt, andKlaus-Robert M¨uller, “Machine learning of ac-curate energy-conserving molecular force ﬁelds,”Sci. Adv. (2017), 10.1126/sciadv.1603015.[36] Felix Brockherde, Leslie Vogt, Li Li, Mark E. Tuckerman,Kieron Burke, and Klaus-Robert M¨uller, “Bypassingthe Kohn-Sham equations with machine learning,”Nat. Commun. (2017), 10.1038/s41467-017-00839-3,1609.02815.[37] Connor W. Coley, William H. Green, and Klavs F.Jensen, “Machine Learning in Computer-Aided SynthesisPlanning,” Acc. Chem. Res. , 1281–1289 (2018), thisstudy surveys the role of ML in synthesis planning andthe prediction of reaction outcomes.[38] Daniel M. Lowe, “Patent reaction extraction: down-loads, https://bitbucket.org/dan2097/patent-reaction-extraction/downloads,” (2014).[39] Juno Nam and Jurae Kim, “Linking the neuralmachine translation and the prediction of organicchemistry reactions,” arXiv:1612.09529 , 1–19 (2016),arXiv:1612.09529.[40] Philippe Schwaller, Th´eophile Gaudin, D´avid L´anyi,Costas Bekas, and Teodoro Laino, “”Found in Trans-lation”: predicting outcomes of complex organic chem-istry reactions using neural sequence-to-sequence mod-els,” Chem. Sci. , 6091–6098 (2018), arXiv:1711.04810.[41] Connor W Coley, Wengong Jin, Luke Rogers, Timothy FJamison, Tommi S Jaakkola, William H Green, ReginaBarzilay, and Klavs F Jensen, “A graph-convolutionalneural network model for the prediction of chemical re-activity,” Chem. Sci. , 370–377 (2019).[42] “IBM RXN for Chemistry,” https://rxn.res.ibm.com (2018).[43] “ASKCOS,” http://askcos.mit.edu (2018).[44] Tomasz Klucznik, Barbara Mikulak-Klucznik, Michael P.McCormack, Heather Lima, Sara Szymku´c, Man-ishabrata Bhowmick, Karol Molga, Yubai Zhou, LindseyRickershauser, Ewa P. Gajewska, Alexei Toutchkine, Pi-otr Dittwald, Micha l P. Startek, Gregory J. Kirkovits,Rafa l Roszak, Ariel Adamski, Bianka Sieredzi´nska,Milan Mrksich, Sarah L.J. Trice, and Bartosz A.Grzybowski, “Eﬃcient syntheses of diverse, medici-nally relevant targets planned by computer and ex-ecuted in the laboratory,” Chem , 522–532 (2018),arXiv:j.chempr.2018.02.002 [10.1016].[45] Florian H¨ase, Lo¨ıc M. Roch, Christoph Kreis-beck, and Al´an Aspuru-Guzik, “PHOEN-ICS: A universal deep Bayesian optimizer,”Prepr. https//arxiv.org/abs/1801.01469 (2018), 10.1021/acscentsci.8b00307,arXiv:1801.01469.[46] Kevin Tran and Zachary W. Ulissi, “Active learn-ing across intermetallics to guide discovery of elec-trocatalysts for CO2reduction and H2evolution,”Nat. Catal. , 696–703 (2018).[47] Konstantin Gubaev, Evgeny V. Podryabinkin,Gus L.W. Hart, and Alexander V. Shapeev, “Ac-celerating high-throughput searches for new al-loys with active learning of interatomic potentials,”Comput. Mater. Sci. , 148–156 (2019).[48] Johannes Hachmann, Mohammad Atif Faiz Afzal, Mo-jtaba Haghighatlari, and Yudhajit Pal, “Building anddeploying a cyberinfrastructure for the data-driven de-sign of chemical systems and the exploration of chem-ical space,” Mol. Simul. , 921–929 (2018), this paperpresents a software ecosystem for the development andbroader dissemination of techniques at the diﬀerentstages of a molecular data mining workﬂow.[49] Anna Krylov, Theresa L. Windus, Taylor Barnes, EliseoMarin-Rimoldi, Jessica A. Nash, Benjamin Pritchard,Daniel G.A. Smith, Doaa Altarawy, Paul Saxe, Ce-cilia Clementi, T. Daniel Crawford, Robert J. Har-rison, Shantenu Jha, Vijay S. Pande, and TeresaHead-Gordon, “Perspective: Computational chemistrysoftware and its advancement as illustrated throughthree grand challenge cases for molecular science,”J. Chem. Phys. , 180901 (2018).[50] Nancy Wilkins-Diehr and T. Daniel Crawford, “NSF’sinaugural software institutes: The science gateways com-munity institute and the molecular sciences software in-stitute,” Comput. Sci. Eng.20