[PDF] Deep Learning and Knowledge-Based Methods for Computer Aided Molecular Design -- Toward a Unified Approach: State-of-the-Art and Future Directions

Abstract

The optimal design of compounds through manipulating properties at the molecular level is often the key to considerable scientific advances and improved process systems performance. This paper highlights key trends, challenges, and opportunities underpinning the Computer-Aided Molecular Design (CAMD) problems. A brief review of knowledge-driven property estimation methods and solution techniques, as well as corresponding CAMD tools and applications, are first presented. In view of the computational challenges plaguing knowledge-based methods and techniques, we survey the current state-of-the-art applications of deep learning to molecular design as a fertile approach towards overcoming computational limitations and navigating uncharted territories of the chemical space. The main focus of the survey is given to deep generative modeling of molecules under various deep learning architectures and different molecular representations. Further, the importance of benchmarking and empirical rigor in building deep learning models is spotlighted. The review article also presents a detailed discussion of the current perspectives and challenges of knowledge-based and data-driven CAMD and identifies key areas for future research directions. Special emphasis is on the fertile avenue of hybrid modeling paradigm, in which deep learning approaches are exploited while leveraging the accumulated wealth of knowledge-driven CAMD methods and tools.

Full PDF

11 Deep Learning and Knowledge-Based Methods for

Computer-Aided Molecular Design — Toward a Unified Approach: State-of-the-Art and Future Directions

Abdulelah S. Alshehri a,b , Rafiqul Gani c , Fengqi You a * a Robert Frederick Smith School of Chemical and Biomolecular Engineering, Cornell University, Ithaca, NY 14853, USA b Department of Chemical Engineering, College of Engineering, King Saud University, P.O. Box 800, Riyadh 11421, Saudi Arabia c PSE for SPEED Company, Skyttemosen 6, DK-3450 Allerød, Denmark July 5, 2020

Abstract

Keywords : Computer-aided molecular design, product design, process design, deep generative modeling, deep learning Introduction

Materials have been a pivotal part of the modern economy and a prospective solution route to solving the world’s most pressing scientific problems that span disciplines and impact communities [1], [2]. Designing molecules to improve the functionality and efficiency of products and processes can bring tangible environmental, technological, and economic benefits in energy harvesting and storage, medical diagnostics and therapy, and carbon capture and utilization, to mention but a few. The design process of chemicals involves deriving new versions from existing molecules or creating original ones from novel molecules [1]. Yet, molecular design has been a daunting trial-and-error experiment-based or/and heuristic rule-based process constrained by resources and limited to a small class of known molecular structures [3], [4]. In spite of century-long efforts in chemical synthesis and the large set of synthesized molecules (~10 ), the so-called chemical space is still an unexplored galaxy with an estimated number of small organic molecules populating the space of more than 10 [5], [6]. Revolutionary advances are likely to emerge from novel molecules in the uncharted territories of the chemical space. Efforts to systematically characterize and link properties to molecules root back to the 1940s with the development of an early group contribution (GC) method, predicting heat and free energies of organic compounds [7]. A recent GC-based model was successful in reaching chemical accuracy for the enthalpy of formation of organic compounds. To some extent, advancements in property prediction methods have been driven by the need to build a framework for its reverse problem. This ultimate purpose was best emphasized by the organic photochemist, George S. Hammond, in his speech upon receiving the Norris Award in Physical Organic Chemistry in 1968, “The most fundamental and lasting objective of synthesis is not production of new compounds, but production of properties” [8]. A few decades later, the availability of powerful computational resources in conjunction with the growing demand for chemical products gave rise to the field of computer-aided molecular design (CAMD), which combines property estimation methods with mathematical optimization methods to find promising molecular structures that best meet target properties [9]. In this context, GC-based methods are the most used class of property estimation methods in CAMD due to their easy incorporation within mathematical models and high qualitatively correct estimates [10]. Identified as a key challenge in the new millennium, the importance of CAMD stems from the value and potential unlocked by connecting the molecular-level design with properties at the macroscopic level [11]. Propelled partly by the exponential growth in the computing power and partly by advances in mathematical optimization algorithms and property estimation models, CAMD has attracted growing research activities and enjoyed enormous progress across multiple scales — ranging from single-molecule design to mixture design and integrated product and process design. Yet, knowledge-based methods either suffer from the inadequate number of available property models or require significant computational resources to explore the staggeringly vast discrete design space. As such, severe limitations on the quality of solutions are imposed by the exponential growth of complexity with the size of molecules. Hence, the growing market demand for specialty chemicals and the global environmental concerns highlights the urgent need for smarter, more efficient, and more effective methods to navigate the chemical space in order to accelerate innovation and discovery in molecular design. Machine learning has emerged to outperform many conventional algorithms and artificial intelligence techniques across various fields, such as computer vision [12], [13], speech recognition [14], [15], and natural language processing [16], [17]. In particular, the subfield of deep learning has shown a remarkable ability to discover intricate structures in high-dimensional data and transform complex representations into higher abstract levels [18]. For the problem of property prediction, deep learning has not only beaten other machine learning methods for building an accurate QSPR model [19], but also has recently closed the gap between QSPR predictive models and the quantum chemistry-based Density Functional Theory (DFT) on the formation energy of materials [20]. As mathematical optimization plays a central role in CAMD, deep learning has enabled unforeseen transformative advances on the theoretical and practical aspects of optimization, as in the linearization of mixed-integer quadratic programming problems [21], optimization under uncertainty [22], human-level control [23], and battery charging protocols [24]. Further, deep generative models have recently made large strides in their ability to model molecular distributions and synthesize realistic molecules from learned distributions, as evidenced by numerous successful applications in molecular design [25], [26]. The promise of integrating deep learning in CAMD is the better performance of computational models that offers a more extensive and less demanding landscape for optimization and characterization, leading to novel technologies and reductions in computational demand [27]. Several sources in the literature have pointed out potential hybrid routes for harnessing the promise of deep learning, while leveraging the accumulated wealth of scientific knowledge by developing theory-guided neural networks in quantum chemistry [28], fluid mechanics [29], and system dynamics [30]. A number of excellent reviews of the knowledge-based CAMD paradigm are available in the literature [4], [9], [31], [32]. Instead of merely focusing on knowledge-based methods and techniques, this review lays out a comparative description of both the knowledge-based CAMD the emerging deep learning approaches, identifies challenges and gaps in the approaches, and explores opportunities and future directions of these approaches and their combination. In the following section, we describe and discuss the key principles of the knowledge-based CAMD paradigm, including property estimation methods and solution techniques for organic molecules, along with their tools and applications. Section 3 surveys the current state-of-the-art in the nascent area of deep learning for molecular design, covering three main elements: molecular representations, major deep generative architectures, and benchmarking and evaluation metrics. The survey starts by assessing the relative merits of different molecular representations, which in turn define the information and structures that can be exploited in learning-based model development. It is followed by a discussion detailing the mathematical structure, respective strengths, and reported results of several deep learning architectures. Further, special emphasis is placed on benchmarks for evaluating the generation of molecules and the optimization of their properties in learning-based models. In section 4, we offer a thorough discussion of the perspectives and challenges of the knowledge-based and data-driven CAMD paradigms to identify important future research directions for the rapidly evolving field. We also highlight the potential of a hybrid modeling paradigm combining the strengths of knowledge-driven CAMD methods and powerful deep learning techniques for more effective, accurate reliable, and efficient molecular design. The stages of the classical knowledge-based CAMD and recently developed data-driven frameworks are illustrated in Figure 1. Figure 1.

Diagrammatic comparison of computer-aided molecular design methods. Knowledge-based Computer-Aided Molecular Design

A generic solution framework for CAMD involves two components for molecular modeling: (1) a model for the estimation of molecular properties and the thermodynamics of their mixtures, and (2) computational optimization algorithms for guiding the search in the large, discrete chemical space. In this section, we summarize several foundational aspects to the problem of CAMD: property prediction methods for pure organic components and mixtures, mathematical formulations of CAMD classes, solution methods and techniques, software tools, and common applications. This section also offers comparative descriptions of the knowledge-based CAMD framework components.

Property Estimation Methods

Finding a solution to a CAMD problem is dependent on the ability to efficiently and accurately estimate target properties [33]. The estimation involves the construction of pure compound models and functional and thermodynamic models for mixtures. Despite the staggering number of mechanistic, empirical, and semi-empirical property estimation methods, most of them are not applicable in CAMD. Semi-empirical models have been the most popular class of methods due to their lower computational cost and the representation of molecules that can be easily encoded within optimization models [4], [34]. However, such models enforce further constraints on the formulation of the CAMD problem in the form of application ranges, availability of parameters, and inherent uncertainty within the estimation models. The generation of practical CAMD solutions typically follows a hierarchical ranking process. In the ranking process, the computational effort is adjusted as the search space is narrowed in subsequent abstraction levels, shifting property prediction from simple GC-based methods to complex computations and experimental measurements [3]. At the most essential level, the CAMD and property prediction problems require molecules to be decomposed into their basic chemical building units of atoms, bonds, rings, functional groups, atomic types, charges, and so on. The various representations of building units give rise to different QSPR methods. It should be noted that there is a distinction between the GC family of methods and QSPRs although GC-based can be classified as a special case of QSPRs [10], [35]. In this context, QSPR with descriptors can characterize property very well whereas GC methods offer a parallel performance with fewer parameters [10]. In the next subsections, widely-applied QSPR methods for pure components and mixture property models are summarized and comparatively evaluated. We note that the following property prediction discussion revolves around organic molecules and systems despite recent extensions of these methods to other chemicals as ionic liquids [36] and electrolytes [37].

Pure Component Quantitative Structure-Property Relationships

Group contribution (GC) methods, topological indices (TI), and signature descriptors (SD) are the most popular semi-empirical models for pure component property prediction in CAMD. Particularly, most methods in CAMD applications are based on the GC or additivity approach. First proposed in 1949, this approach is probably the earliest method for estimating several properties of compounds [7]. Additivity or contribution methods express properties of molecules as functions of the number of occurrences of molecular fragments called functional groups [10], [38]. The accuracies of earlier GC models have been significantly improved due to the development of more elaborate models that include polyfunctional and structural groups and interaction terms. GC-based models have proven capable of achieving high levels of accuracy, estimating the enthalpy of formation for or a broad range of organic molecules within chemical accuracy [39]. A prevalent GC method in CAMD applications is the GC+ method, which performs the estimation at three distinct levels: a first-order using contributions from simple groups, F ; a second-order for polyfunctional compounds and identification of isomers, S ; and a third-order for overreaching structural features, such as fused rings, T [33], [40]. Improved models and additional features of the GC+ method have been proposed, such as revised parameters with uncertainty estimates [41] and 22 environment-related properties [42]. For a vector of the number of occurrences of each group n g and a vector of regression coefficients c g , the GC+ model has the form of the following equation: g g g g g gg F g S g T P f c n c n c n  = + +    ∑ ∑ ∑    (1) The GC+ model includes 182 first-order, 122 second-order, and 66 third-order groups. Past GC-based models have suffered from degradation in the accuracy of large multifunctional molecules [10]. However, recent models have addressed this limitation, reporting acceptable performance for complex molecules like amino acids and lipids [43], [44]. We point the interested reader to a recent review for more detailed coverage of the current limitations and opportunities in GC methods [10]. TI and SD QSPR methods are based on the chemical graph theory, where the atoms and bonds of a chemical structure are represented as nodes and edges in a graph [45]. TIs are computed using functions that capture several properties in the chemical graph as connectivity and atomic types [46]. Signatures in SDs are systematic codification procedures over an alphabet of atomic types, characterizing the neighborhood of the atoms present in a molecule [47]. In these classes of QSPRs, instead of pairing groups with regression coefficients as in GC, indices and descriptors are used to relate properties to different molecular graph features. TI takes several forms and functions, such as the sum of the shortest distances between all pairs of nodes, known as the Wiener index [48], and the sum of bond contributions in the

Randić index [49]. Similar to TIs, SDs translate molecular graphs into molecular descriptors, but SDs preserve connectivity and structural information through node coloring that differentiates between different atoms and different types of same atoms [4]. It should be noted that both the TI and SD methods have limited applications in CAMD [4]. When comparing these popular approaches for CAMD applications, it is useful to focus on their predictive and discriminative powers, together with their applicability within optimization models. As we discuss the generalizability of predictions and the robustness of regression variables, it is essential to highlight that chemical graph methods are more reflective of the molecular structure [45]. This is owed to their expression as functions of the entire chemical graph as opposed to independently contributing groups as in the GC class of methods. Nevertheless, this highly descriptive nature and the lack of established models for wider classes of molecules of TIs and SDs significantly degrade their predictive power, leading to overfitting issues and several limitations on the design space. Despite this downside on the predictive power, TIs and SD are far more discriminatively powerful than GC as demonstrated by their ability to discriminate between similar chemical structures including stereoisomers. Still, for general CAMD problems, chemical graph-based QSPR methods are more problematic to incorporate into optimization models, rendering some problems intractable as they require more binary and discrete variables [4]. It is worth noting that the size and complexity issues have been alleviated by decomposition-based methods, which are covered in subsection 2.3.4.. In these methods, the pre-analysis stage of the framework substantially reduces the search space by removing redundant binary variables [34].

Mixture Property Models

For the problem of mixture property prediction, properties can be classified into functional and equilibrium-based. In the case of functional properties, the estimation is based on pure component QSPR methods and a mixing rule for a given set of molecules and their compositions with a predefined phase identity. In the second class, equilibrium-based properties are predicted using calculation algorithms for vapor-liquid equilibrium (VLE), liquid-liquid equilibrium (LLE), and solid-liquid equilibrium (SLE). Generating estimates for mixture properties requires the integration of pure component and equilibrium-based property models to predict properties and calculate phase behaviors [9], [50]. Mixture thermodynamic models have been well-studied across different disciplines with numerous applied models in CAMD including UNIFAC [51], SAFT [52], and COSMO-based methods [40], [53]. In the problem of mixture design, the use of COMSO-based methods is advantageous over other methods as they do not require binary interaction parameters, which pose a challenge to the search algorithm for design optimization. Moreover, the collection of COSMO methods provides an accuracy level comparable to the quantum chemistry-based method, DFT [4], [53]. 0

Mathematical Optimization Formulation for CAMD

Several alternative optimization formulations of the CAMD problem have been developed in the literature to address the problem at different scales and design objectives. The levels of the problem and their applications fall broadly into three classes: single-molecule design, mixture/blend design, and integrated process and product design. The last design class encloses both or either of the other two classes into a process/product problem with an explicit relationship between the molecule/mixture and the process/product. A generic mathematical formulation of the integrated process/product design problem is formulated as the following mixed-integer nonlinear programming (MINLP) model [31]: ( ) , , , , max

Tx z y w v

C z f x + (2) ( ) s.t. , , , , r P f x z w v θ= (3) ( ) , L y U θ φ≤ ≤ (4) ( ) , ,

L y U θ φθ≤ ≤ (5) ( ) , , ,

L y x U θ φθ≤ ≤ (6) ( ) , , , ,

L y x z U θ φθ≤ ≤ (7) ( ) , L U

S yS S η≤ ≤ (8) T Bx C z D + ≤ (9) Where x is a vector of continuous variables (e.g., operations conditions, flow rate, etc.), z is a vector of measured-controlled variables (e.g., temperature, pressure, etc.), y is a vector of binary variables (e.g., descriptor identity, molecule identity, etc.), w and v are vectors of input variables and manipulated-design variables, respectively. Eq. (3) expresses the process model equations as functions of the model variables and the property functions, θ . Eqs. (4)-(7) provide upper and lower bounds on the different classes of pure component and mixture properties with ϕ as a structural parameters vector. The last two constraints ensure the molecular and flowsheet feasibility, respectively, given the constant vectors B and D [50]. 1 The above model represents an integrated process and product design problem, but many variations of CAMD problems can be derived from the model. For instance, the subset of Eqs. (4)-(8) gives the molecular feasibility problem, where molecules that satisfy target properties are generated and tested for structural validity. The addition of the cost function to the feasibility problem turns it into the single-molecule design problem that could determine the optimal molecular structure [31]. CAMD Solution Methods

Enumeration Methods

Early efforts in the search for molecules and mixtures for specific applications have been carried out by enumeration methods. Solving the CAMD problem using this class of approaches follows a combinatorial approach. That is generating chemically feasible molecules and estimating their target properties, followed by screening the properties against specified property constraints and ranking the molecules using an evaluation metric [50]. Given the low computational burden of evaluating properties from QSPRs, this sequential “generate-and-test” approach is particularly efficient for selecting an optimal molecule over a small pool of molecules in the chemical space. Yet, the major obstacle of applying this class of solution methods lies in limiting the design space to a practical size for larger problems. The combinatorial explosion problem that arises in large design spaces can be circumvented by controlling the generation and testing steps through screening. A number of algorithms have been developed to incorporate knowledge-based evaluation and sequential interval analysis for efficiently reducing the number of candidate molecules [54], [55].

Mathematical programming

In addition to enumeration methods, the field of CAMD has benefited from the rapid advances in mathematical optimization algorithms in identifying solutions to problems with nonconvexities and nonlinearities in the molecular system model. This class of methods guarantee optimality for convex problems and serves as a theoretical basis for heuristics and decomposition methods [4]. Before presenting a few relevant optimization frameworks for CAMD problems, we note the challenges faced in the development of the CAMD algorithmic framework. Sharing the same challenge as the enumeration methods, the binary variables associated with descriptors and structural interactions give rise to a combinatorial explosion. Even for the 182 first-order 2 descriptors in Eq. (1), the number of possible group combinations for a maximum molecule size of 15 descriptors grows to the order of 10 . Moreover, current representations of molecular structures in CAMD are plagued by redundancy, where each molecule is not given by a single unique representation [56]. These challenges are further compounded by the inherent complexity of synthesis and the cost of production. To address these barriers, many researchers have recognized the need for strategy-oriented techniques, ranging from exploiting the problem structure to altering its formulation. Early efforts involve exploiting the problem structure to implement exact optimization algorithms as the branch-and-reduce algorithm [57], [58]. A noteworthy exploitation involved branching on molecular descriptors and property values to efficiently compute all feasible solutions using a single branching tree in conjunction with a feasibility pre-solver [59]. Other attempts rely on the application of reformulation techniques and the development of multi-stage frameworks to sidestep nonlinearity and nonconvexity while guarding against the exclusion of possible optimal solutions from the design space [60], [61]. Developments in this direction include the extension of a decomposition method [56] through the simultaneous consideration of the first-order and second-order functional groups, illustrating the conservation of the original feasible space [61]. Other improvements came about through the introduction of novel constraints [59], [62], [63], and applying reformulation methods using linearization techniques [56], [64], [65] and convexified terms [60], [66]. Derivative-free Optimization

When the chemical design space is too large to be handled by exact optimization algorithms, an intuitive alternative to knowledge-based frameworks is a high-level search strategy that guides the search process. Under the absence of an algebraic form of the optimization problem, this class of black-box methods searches the design space either stochastically using adaptive operations, or deterministically by following a specific set of operations. Stochastic methods or metaheuristics apply high-level selection strategies to sample the chemical space, evaluating the objective function for each sample point, and exploiting the accumulated search experience to identify regions in the chemical space with high-quality molecules [9]. Alternatively, given a specific starting point, deterministic methods evaluate molecules with fixed procedures and arrive at the same final molecule [67]. Within these fixed procedures, different variants of local and global search methods are used, such as trust-region methods [68] and SNOBFIT [69]. The two subclasses of Derivative-Free Optimization (DFO) methods can be applied both directly without 3 an underlying model, and indirectly in connection with a surrogate model that offers derivative approximations [70]. In CAMD, many DFO approaches have been applied to optimize the generation of molecular structures. For the stochastic subclass of methods, the natural selection-based genetic algorithms are by far the most commonly used metaheuristic in CAMD, with applications that span from encoding molecular structures to optimizing discrete variables in integrated product and process design problems [71]–[74]. Moreover, Tabu search algorithms have proven highly effective in molecular design problems as metal catalysts, ionic liquids, and polymers [75]–[78]. On the other side, deterministic methods have only been adopted recently and to a limited extent [67], [79]. A study comparing 27 DFO algorithms on the problem of mixture design found global and surrogate-based methods to perform better than other solution strategies [67]. However, it should be noted that the nature of DFO algorithms causes the quality of solutions to be highly dependent on the spaces and time limits, and hyperparameters that govern the search. Further, developed models tend to be only applicable to a specific class of molecular design problems [70], [80].

Decomposition Methods

The combinatorial nature of the design space in CAMD coupled with nonlinearities in property models render many CAMD problems unamenable to optimization algorithms. A well-known attribute to the success of solving many challenging combinatorial optimization problems is the application of decomposition methods [81]. This is also true for the CAMD MINLP problem, for which many problem-specific decomposition approaches have been developed. This family of methods decomposes the original optimization problem into easier subproblems that are solved either sequentially or iteratively. In the sequential approach, the optimization problem is constructed by the sequential addition of constraints of increasing complexity. On the other hand, the iterative approach produces feasible solutions by approximating the constraints of subproblems and iterating between the subproblems to compute upper and lower bounds on the original problem [4]. These approaches make the MINLP problem easier to solve and often provide guarantees on the optimality of convex problems [82]. While most decomposition approaches in the literature have been dedicated to the more complex problems of the mixture and integrated process/product design [83]–[85], the single-molecule design remains a starting point and a building block for the other classes of problems [67], [82]. Although decomposition frameworks have achieved success in solving problems of significance, the interdependence between molecular design level and 4 product/process design level is often oversimplified [86]. In practice, there is a complex nonlinear relationship between molecules and process performance, with the optimal solution corresponding to an intricate trade-off between different molecular properties and process variables. In such an interlinked system, sequential or iterative decomposition of molecular design problems may be suboptimal [83], [87]. A more detailed relationship between molecular design (solvent) and process design (crystallization-based chemical process) has been reported recently [88].

CAMD Tools and Applications

CAMD Tools

There exist several databases and software packages for CAMD [89]. A simplistic approach to providing fast solutions to single-molecule design problems is database search with listed target properties or with the help of efficient QSPR methods [90]. There is an increasing number of chemical databases that contain molecular structures in machine-readable formats with some molecular and property data. Some examples are GDB-17 [91], the largest publicly available database with more than 1.6 billion compounds, and PubChemQC [92], which has around four million compounds with property data generated from DFT calculations. As database search solutions may be suboptimal, violate other process and mixture constraints, and exclude novel molecules, a superior class of tools is referred to as model-based design in which a problem-specific mathematical optimization model is formulated and solved. Leading packages include ProCAMD [90], AMODEO [56], and OptCAMD [34]. This class of tools is highly versatile to a large variety of problems, to which it provides globally optimal solutions, but they still suffer from some drawbacks related to the size and complexity of molecules, the lack of property models for some molecules and products, and the accuracy of available property models [89].

Applications

A vast array of various CAMD applications has been growing over the past few decades for different problem scales, property estimation methods, and solution techniques. Progressively, applications have played an integral role in the development of QSPR models, solution algorithms, and frameworks. Here, we give an overview of some popular applications that have attracted much interest from the CAMD community. It is worth noting that CAMD applications have been focused on major applications in the chemical, environmental, and pharmaceutical industries, such as liquid-liquid extraction, CO capture, polymer design, and reaction and product solvents, etc. 5 The development of single-molecule design applications started in the early 1980s, initially with generate-and-test approaches that were later extended to include improved screening methods and mathematical programming. In an early example of such advances, a decomposition-based generate-and-test and mathematical programming approaches were to liquid-liquid extraction [93], [94]. The efficiency of mathematical programming-based and metaheuristic frameworks was illustrated in polymer design case studies [64], [80]. Later work further explored the more challenging variants of mixture design. Examples of initial work in mixture design considered the feasibility problem for the solvent design [4], [84], [95], followed by incorporating equations of state into optimization models for refrigerant mixtures design [62]. Notable, recent efforts developed decomposition methods in conjunction with derivative-free optimization and quantum chemistry-based COMSO/COMSO-RS thermodynamics, demonstrating their efficiency with instances in solvent design, liquid-liquid extraction, and reactions solvents [40], [67], [79]. Although most CAMD methods and applications focus on single-molecule and mixture design, the solution to these classes of problems are usually integrated into a process or a final product. As such, an explicit relationship between a molecule and a process/product is necessary to optimize the process/product performance. Some of the few resources that developed applications for the integrated problem predominantly used decomposition methods [83], [86], [96]–[98]. Also, integrated product/process design applications using mathematical programming [67], [99], [100], and metaheuristics [101]–[103] are present. A brief list of references classified by design level and major application is given by Table 1.6 Table 1.

A brief list of popular CAMD applications classified by design level and application.

Design Level Application References Single-Molecule

Liquid-liquid extraction Austin et al. [40], Brignole et al. [104], Diwekar and Xu [71], Gani and Brignole [93], Gani et al. [105],

Gebreslassie and Diwekar [106], Harper et al.[107], Harper and Gani [55],

Karunanithi et al. [82],

Kim and Diwekar [108], Marcoulaki and Kokossis [109], Odele and Macchietto[94], Ourique and Telles [110], Scheffczyk et al. [111] Polymer design Brown et al. [112], Camarda and Maranas [60], Eslick et al. [77], Maranas [64], Pavurala and Achenie [113], Venkatasubramanian et al. [74], [80], Zhang et al. [61]. Reaction solvents Wang and Achenie [114], Gani et al. [115], Folic et al. [116], [117], Struebing et al. [118], Zhou at al. [119] Refrigerant design Churi and Achenie [120], Duvedi and Achenie [57], Gani et al. [105], Joback et al. [121], Marcoulaki and Kokossis [109], Ourique and Telles [110], Sahinidis et al. [59], Samudra and Sahinidis [122]

Mixture

Solvent design Austin et al. [40], [67], [79], Buxton et al. [99], Conte et al. [123], Duvedi and Achenie [62], Gani and Fredenslund [95], Karunanithi et al. [82], Klein et al. [84]

Integrated Process/ Product CO Capture Bardow et al. [83], Burger et al. [100], Gopinath et al. [86], Lampe et al. [96], Pereira et al. [124], Stavrou et al. [97] Gas absorption Buxton et al. [99], Papadopoulos and Linke [103], Bommareddy et al. [98], Zhou et al. [72] 7 Deep Learning for Molecular Design

An ever-increasing volume of research studies has been reported in applying deep learning models to molecular generation and property prediction of molecules. To address the CAMD problems, generative models are used in conjunction with predictive QSPR models that relate learned feature representations of molecular descriptors to target chemical, physical, or biological properties of structures. In addition to beating other machine learning methods in property predictions, deep learning has recently demonstrated the capability to produce property predictions comparable to DFT calculations [18], [20]. Thus, the focus of this section is more along the lines of generative models and their synergy with predictive models for molecular design and optimization. A high-level description of popular architectures for generative models in the literature is displayed in Figure 2.

Figure 2.

High-level diagrammatic description of common architectures in the literature. The section starts by reviewing molecular representations used as input to models. The core of the section is devoted to developing a comparative and contrastive summary of published generative models and methods that hold the most potential for integrating molecular generation 8 with property prediction. A brief discussion on evaluation metrics and benchmarking platforms concludes the section.

Molecular Representations

Central to the learning process is the digital encoding of an expressive molecular structure representation that captures the structural information about a molecule using a specific format and definite rules. The selection of the method that translates the structural information into a machine-readable format is termed featurization or feature engineering. In the context of molecular design and discovery, popular representations fall into two classes: 2-dimensional and 3-dimesnsional. The suitable selection of a molecular representation requires insights into the specific problem and the intended machine learning algorithm. However, the selection of the best-performing representation for a learning task or algorithm is not always apparent, remaining an open research question in cheminformatics [125]. Figure 3 illustrates a few commonly used molecular representations including: coulomb matrix [126], fingerprints [127], InCHI [128], SMILES [129], Junction-Tree [130], and molecular graph convolution [131].

Figure 3.

A few commonly applied molecular representations in deep learning for methanol. 9 Before shedding light on the different molecular representation methods, it is imperative to highlight the key invariances that certain representations capture: permutation, translational, and rotational. Permutation invariance allows the representation to remain unchanged under reordering or permutation of atoms. Representations unaltered by translational and rotational operations in the chemical space satisfy the other two invariances [132]. These invariances can be reduced by two desirable properties of machine-readable representations: uniqueness and invertibility. Whereas the uniqueness property is fulfilled when any molecule can be represented by a single expression, invertibility goes in the opposite direction and is satisfied with a one-to-one mapping between representations and their corresponding molecule. Since the generative power of deep learning is fundamental to the problem of molecular design, most implementations use invertible representations [133]. There is sheer size of publically available databases of compounds for use by researchers undertaking molecular design. Table 2 provides a list of public databases of molecules, and their size and downloadable representations. It is noteworthy that the molecular representations provided by these public databases can be validated and translated into different molecular representations using several software tools and packages such as RDkit [134], OpenBabel [135], and OpenEye [136]. The sources of the property values included in these databases range from text-mined experimental data and pseudo-experimental DFT calculations, to predicted values with varying degrees of confidence. Thus, it is critical to maintain a degree of skepticism about the accuracy of property values, and factor in uncertainties and underlying sources. Popular molecular 2D and 3D representations and their relative merits are reviewed in the following subsections.0

Table 2.

A list of representative publicly available chemical databases including descriptions of their content, the number of compounds, and available molecular representations.

Database Content Size Representations Ref. GDB-17

Enumeration of small organic molecules up to 17 atoms of C, N, O, S, and halogens. 166,400,000,000 SMILES [91]

ZINC15

Commercially-available compounds. >750,000,000 SMILES, InChI, InChI Key, 2D SDF, 3D SDF [137]

SureChEMBL

Compounds extracted from patent documents. 17,000,000 SMILES, InChI Key [138] eMolecules

Commercially-available compounds. 5,900,000 SMILES, 2D SDF [139]

PubChemQC

Compounds with quantum chemistry estimated property based on DFT methods. >3,500,000 SMILES, InChI [92]

ChEMBL

Bioactive molecules with drug-like properties. 1,900,000 SMILES, InChI, InChI Key, 2D SDF [140]

SuperNatural

Natural molecules with physicochemical properties and toxicity class. 325,508 SMILES [141] 1

2D Representations

2D representation-based learning-based molecular design models have enjoyed remarkable success, although the use of such representations entails the loss of conformational and bond distance information [142]. As described in the previous section, the use of the chemical graph theory to represent molecules is a natural way to apply the principles of bonding and chemistry valence rules, yet the most popular representation in data-driven molecular design literature is the string-based SMILES [128], [133]. Similar to most graph methods, SMILES do not correspond to unique representations. This non-uniqueness can be resolved by converting standard SMILES into its canonical form. However, such a feature has been proven to be quite useful as a data augmentation technique for machine learning [143]. Another common string-based representation is the IUPAC International Chemical Identifier (InChI) [144]. It is reported that the use of InChI in a variational autoencoder-based approach led to substantially better performance than SMILES in the learning task due to the more complex syntax that involves counting and arithmetic [25]. Recent work showed that a more meaningful chemical latent space is learned by translating between InChI and SMILES using variational autoencoders [145]. A typically preprocessing step of string-based representations includes converting them to one-hot encoding or molecular descriptors [25], [146]. In molecular generation, achieving a perfect rate of valid reconstruction for SMILES from the latent space remains an open problem, despite the availability of well-developed deep learning architectures for various sequence generation applications, such as music synthesis and natural language processing [142]. Many other graph representations, such as 2D images, tensors, and MACCS keys, have been applied to molecular generation with varying success rates. Inspired by the success in image classification by Google’s Inception-ResNet deep convolutional neural network (CNN) [147], a deep CNN, Chemception, was developed using image data of 2D drawings of molecules encoded in single-channel greyscale for property prediction [148]. Also, 2D image embedding combined with standard SMILES were used as an input to a heteroencoder, resulting in a better latent space compared to SMILES and canonical SMILES [143]. Operating in the space of graphs through the use of vectors and tensors to encode adjacency and connectivity information has been shown to be a promising alternative. Several successful implementations of tensor-based representations have achieved perfect reconstruction validity and high rates of novelty for small molecules [149], [150]. It should be noted that more compact representations are applicable for small datasets and specific 2 classes of molecules in molecular generation. Gene expressions were generated using the 166bit MACCS with a valid decoding rate below 10% [151], [152]. It is also worth noting that such compact representations are better suited for regression tasks with many successful implementations in property prediction using bag of bonds, fingerprints, Coulomb matrices, and molecular descriptors [146], [153]–[155].

3D Representations

Owing to the variance of 3D representations with translational, rotational, and permutation operations, describing molecules in the 3D space might not be best suited for generative models [142]. An approach to express molecules in the 3D grid space is achieved using voxels with localized channels for distinct types of atoms and nuclear charge information. This approach encounters the challenges of high dimensionality and sparsity, leading to high complexity and poor performance of the generative model [133]. To ease the computational effort and avoid sparsity, a wave transformation-based approach was proposed, replacing each atom with concentric waves diverging from its center using the wave transform kernel [156]. Another key contribution on this front is a novel, but not invertible, tensor field neural networks framework that is equivariant to 3D rotations, translations, and permutations of 3D coordinates of points [132], [157]. The challenges and limitations of this class of representation underline the need for research work that develops efficient invertible 3D representations for molecular design.

Deep Learning Architectures for Molecular Design

Recurrent Neural Networks (RNNs)

RNN is a prevalent class of deep learning models for sequence generation, constituting a building block of other generative deep learning architectures. Equipped with parameter-sharing and graph-unrolling schemes, this architecture is capable of mapping relationships and dependencies between molecular character sequences of arbitrary lengths by introducing a notion of time or order to the model [158]. The arbitrary-length sequence is dealt with by incorporating a hidden state whose activation is dependent on that of the previous element in the molecular sequence. For generative models in molecular design, an RNN produces a probability distribution of the next character of the sequence based on its current hidden state, where the last character in the sequence is a special value. For a given sequence ( ) , , , , t s s s s = … , the probability of the sequence can be expressed as: ( ) ( ) ( ) ( ) , , , | | , ( | , , , ) t t t p s s s p s p s s p s s s p s s s s − … = … … (10) For n training sequence, a model parametrized by θ learns the probability distribution over the sequences by maximizing the log-likelihood on the sequence space: θ θ) N S n n n nt tn s p s s s sN −= = … ∑∑ (11) The vanishing and exploding gradient problems tend to arise when training RNNs to capture long-term dependencies, making the application of gradient-based optimization unsuitable [159]. A dominant approach to alleviate this issue is the development of more sophisticated activation functions that involve gating mechanism for element-wise nonlinearity following affine transformations [160]. Attempts in these directions have resulted in two types of recurrent units that have been shown to perform well for sequence modeling: the long short-term memory (LSTM) unit and the more recent gated recurrent unit (GRU) [161], [162]. For the task of molecular generation, several papers have demonstrated the potential of LSTM-based RNNs, reporting generated structures validity rates as high as 97.7% [143], [163], [164]. A remarkable outcome was also achieved using a three-layer GRU-based RNN with 94% of the SMILES generated by the prior corresponding to valid molecules [165]. Despite such successes, it is suggested that the introduction of an external stack is essential for the generation of valid SMILES owing to the limited counting ability of LSTM and GRU units [26]. Instead, their work employs a stack-augmented RNN (Stack-RNN) that defines a new memory cell on top of a GRU recurrent units to better infer algorithmic patterns [166]. Under the Stack-RNN generative model, 95% of SMILES were found to be valid compared to an 86% validity rate with the same RNN architecture excluding the stack memory. Such differences in validity rates may be resultant from a variety of factors, including different databases, sampling approaches, and molecular validation tools. Several property prediction models have been implemented in conjunction with SMILES-based RNN architectures. Two frameworks have compared several predictive algorithms to perform the property prediction task, reporting Support Vector Machine (SVM) and Gradient Boosting Trees (GBT) as the best classifiers of Dopamine receptor D and half-maximal inhibitory concentration (pIC ) properties, respectively [164], [165]. Deep learning models are also commonly applied for the predictive task. With SMILES string as an input vector, the ReLeaSE framework uses a multilayer neural network connected to an embedding recurrent LSTM layer for predicting four properties [26]. Generative and predictive models are coupled under a 4 reinforcement learning system for optimizing target properties of generated molecules, which will be covered in the Reinforcement Learning subsection.

Autoencoders

Autoencoders (AE) is a multilayer neural network that is trained to recover its input as its output by means of an internal hidden layer describing the code of the input. Connected through the code layer, AEs consist of two neural networks, serving as an encoder and a decoder. In molecular generation, the encoder translates each representation, SMILE, into a fixed-dimensional vector, while the decoder does the opposite stochastic mapping operation [158]. These two networks aim to learn the identity function, whereas the role of the latent representation is to induce the networks to learn a reduced representation that allows for capturing the most salient and descriptive information of the representation [167]. By exploring theoretical relationships between the latent space and AEs [168], the formulation of variational autoencoder (VAE) was proposed in 2014 and first applied to molecular design in 2017 [167], [169]. Later developments in the VAE formulation included new models for semi-supervised VAE (SSVAE), which incorporates approximate Bayesian inference with advances in variational methods to improve the quality of the generative modeling approach [170]. Another dominant class of probabilistic autoencoders is the adversarial autoencoder (AAE), which employs the GAN framework as a variational inference algorithm for latent variables [171], [172]

Variational Autoencoders

VAEs are latent variable models comprised of latent variables ( ) i z that are drawn from a prior ( ) p z and fed into a decoder ( ) | p x z θ . The central idea behind VAEs is to sample values of the latent variable that are likely to have generated x and only use such values in the computation of the probability of x in the training set. To perform this task, a new encoding function ( ) | q z x is needed to provide a distribution over z values that are likely to return x . The training procedure is done by maximizing the variational lower bound given by Eq. (12), which allows for the loss function to be written as in Eq. (13). ( ) ( ) ( ) ( ) ~ ( | ) log , ; log | log x, z z q z x p x x q z x p θ φ φ θ θφ  ≥ = − −    (12) ( ) ( ) ( ) ( ) ( ) ~ ( | ) θ , ; log p | | || z q z x K L x x z D q z x p z φ φ θ θφ  = −    (13) 5 where the first term in the right-hand side is the variational lower bound and the Kullback–Leibler divergence term serves as a regularizer. VAEs have proven to possess a very powerful generalization ability due to the stochasticity encoded within the learning method, describing molecules as continuous probability distributions instead of discrete fixed points. This is particularly advantageous for the problem of molecular design, as the probabilistic nature of the formulation forces the latent space to have robust and diverse representations. Further, this feature is especially useful for the construction of open-ended spaces for chemical compounds which allow for not only the generation of new molecules, but also interpolating between existing ones [25]. Early VAE implementations to molecular generation have suffered from low valid rates of SMILES outputted by the decoder, suggesting underlying issues within the latent space [25], [169]. Grammar VAE (GAVE) and character VAE (CVAE) have reported valid decoding rates of 0.7% and 7.2%, respectively [169]. Another VAE model resulted in a decoding rate of 4% for randomly sampled points from the latent space and a rate that ranges from 73% to 79% when sampling around known molecules. The introduction of syntax-directed translation to the SMILES used as an input to the VAE increased the valid decoding rate to 43.5%. VAEs with graph representations as input have shown much better performance with several sources reporting up to 100% validity of generated molecules using hypergraph grammar, constrained graphs, and graph-to-graph translation [173]–[175]. Supervised/Semi-Supervised Autoencoders

A unique advantage that SAE/SSAE offer is the ability to connect the property prediction into the molecular generation paradigm and conditional sampling [176]. This type of architecture offers a unique advantage to the CAMD problem by working with partially labeled datasets in which target properties are available for a subset of molecules. To incorporate chemical properties as an output variable y , a generative semi-supervised model can be adopted under the assumption that the target properties have a Gaussian distribution [170], [176]. Since the exact form of the posterior distribution is intractable, an approximate form is used to estimate the posterior distributions over y and z . The variational lower bound for labeled data are given as follows: ( ) ( ) ( ) ( ) ( ) ( ) ( | , ) log , , log x|y, z log | , log y log q z x y p x y x y p q z x y p p z θ φ θ φ θ  ≥ − = − + +    (14)

6 For molecules with an unobserved label, a target property, y , is treated as a latent variable over which posterior inference is performed, resulting in the variational bound below: ( ) ( ) ( ) ( ) ( ) ( ) ( , | ) log log x|y, z log , | log y log q y z x p x x p q y z x p p z θ φ θ φ θ  ≥ − = − + +    (15) Under this class of models, it is possible to generate molecules with target properties through sampling from a conditional generative distribution given a set of desired properties, obviating the need for any optimization procedure. A leading result was reported by an SSVAE model that achieves a >99% validity rate of generated SMILES with more than 92% of SMILES being unique [176]. It is noteworthy that the same concepts with slight variations can be followed for constructing a semi-supervised AAE (SSAAE), which was adopted recently for generating molecules that optimize the half-maximal inhibitory concentration [177]. We anticipate that this class of models will be most promising in constructing an extensive chemical latent space owing to its ability for learning meaningful disentanglements of data [178]. Adversarial Autoencoders

AAEs are similar to VAEs with the only difference being associated with the regularization term. AAE replaces the Kullback–Leibler divergence term with an adversarial training procedure to impose a prior distribution on the code vector of the AE [171]. Building on the GAN framework, the adversarial process simultaneously trains two neural networks: a generative model, G , that captures the data distribution and generates new samples, and a discriminative model, D , that distinguishes between the prior distribution and the encoding produced by G [172]. The overall loss function combining the adversarial loss for the discriminator and the reconstruction loss can be expressed as: ( ) ( ) ( ) ( ) ( ) ( ) ( ) ~ ~ ~ , ; log | log 1 log | | d z d x p x p x p x D q z x D z p x q z x θ φ θφ Θ     = − − − + ∑ −          (16) Only a few implementations of AAE have been reported in the literature with severe difficulties in the adversarial training process, even for small datasets [179]. Common across all AE-based models, differentiable predictive models can be trained for property prediction using continuous latent representations that correspond to a subset of molecules. This enables gradient-based optimization methods to navigate the chemical design space and move towards the direction of optimizing a given objective of target properties, avoiding the complications induced by the 7 discrete nature of the chemical space [25]. A special algorithm for optimized property-oriented decoder was able to identify molecules with property values 121% higher than Bayesian optimization and reinforcement learning [180]. Similar results are reported for the molecular hypergraph variational approach (MHG-VAE) for molecular optimization when the target function evaluations are limited, outperforming reinforcement learning and GAN-based Graph Convolutional Policy Network (GCPN) in terms of computational cost [173], [181].

Generative Adversarial Networks (GANs)

GAN constitute another class of generative models that constructs a latent space to simplify molecular representations into a compressed representation shared across the molecular domain. Serving as the theoretical basis of the idea of adversarial training introduced earlier in AAE, in molecular design, the framework pits a generative neural network against an adversarial neural network that aims to discriminate between the generated molecules distribution and the original molecules used for training [172]. For a generative model, G , and a discriminative model, D , the theoretical minimax objective function is expressed as: ( ) ( ) ( ) ( ) ( ) ( ) ( ) ~ ~ min max , log log 1 d z x p x z p zG D V D G D x D G z   = + −      (17)

Optimizing the above function is done iteratively and with some alterations to the function to provide better gradients early in learning [172]. Also, GANs offer much more flexibility than AE in terms of defining the objective function using the Jensen-Shannon divergence, f-divergences, or a combination of them [182]. However, training GANs is characterized by instability and delicacy of parameters, as it requires locating a Nash equilibrium of a non-convex game with continuous, high-dimensional parameters [182], [183]. Though GANs remains a current research focus, improvements to training stability and sample quality have been proposed and implemented for molecular generation, such as Wasserstein GAN (WGAN) and SeqGAN [183], [184]. Several extensions to the GANs framework have been adopted to generate molecules with limitations associated with the generation of valid SMILES representations. An Objective-Reinforce GAN (ORGAN) model has a molecule validity rate that fluctuates from 0.01% to 99.8%, attributing such outcome to the sparsity and roughness of the chemical space that leads to a poor generator performance [185]. Among other issues in adversarial training settings, the vanishing gradient problem arises from the convergence of the minimax objective to a saddle point, where the gradient of the generator vanishes given a discriminator that perfectly labels the real and 8 generated data. In an attempt to escape saddle points, an RNN-based differentiable neural computer (DNC) with access to an external memory cell was implemented, leading to a 76% rate of valid SMILES [186], [187]. A parallel validity rate was achieved by an adversarial threshold neural computer (ATNC) structure, which acts as a filtering agent between the generator network and the discriminator and reinforcement learning networks [188]. A more effective and stable approach combined the SeqGAN framework for sequential data with WGAN as a loss function, obtaining a valid SMILES rate of 80.3% [189]. Graph-based GAN implementations demonstrated their ability to offer near-perfect validity of generated molecules. An implicit GAN and reinforcement learning framework for molecules had success in achieving high rates of validity running up to 99.8%. However, this approach attains a maximum rate of unique molecules of 3.2%, alluding to the issue of mode collapse, wherein the generator rotates through a small set of molecules that seem most plausible to the discriminator [150], [190], [191]. Further, on the property optimization front, a superior performance was reported under a graph generation policy network (GCPN) [181]. Yet, two common issues are often encountered under graph-based GANs: the limited diversity in generated molecules and handling the graph isomorphism problem . Reinforcement Learning

Reinforcement learning has been applied in the literature to generate molecules with desirable properties and fine-tune the performance of the GAN generator [189]. The framework requires a learning agent to learn a stochastic policy that maps states to action probability vectors, so as to maximize a reward [192]. Commonly applied reinforcement learning-based approaches in CAMD include the two main model-free approaches to learning an optimal policy: methods based on policy search and value functions methods [165]. Reinforcement learning embodies a natural environment for a more powerful generate-and-test approach to the CAMD problem. SMILES with optimized properties were generated by a partially observable Markov Decision Process (MDP) RNN using policy search methods. In this approach, the use of policy search methods is argued to be a more intuitive approach as it possesses the ability to start with a prior sequence model as an initial policy, requiring shorter episodes and leading to an optimal policy [165]. A comparative study for optimizing molecules towards property targets was carried out between several policy-based methods: Proximal Policy Optimization (PPO) and REINFORCE, a hybrid advantage actor-critic method (A2C), and Hillclimb-MLE [163], [193]–[195]. The study 9 ranked the reinforcement learning methods based on 19 benchmarks for molecular generation and design, suggesting that the Hillclimb-MLE method outperformed the rest given sufficient computational times and sample evaluations [163]. In contrast to model-free methods, a few model-based reinforcement learning methods coupled with adversarial training have been implemented. A notable model-based implementation is ORGAN, discussed in the GAN section [185]. The demonstrated success of the ORGANIC framework paved the way for further efforts to build on the framework and its use for benchmarking new GAN-based reinforcement learning approaches [150], [187], [188], [190].0

Table 3.

A detailed list of published molecular generation and optimization works.

Acronyms : (AAE) Adversarial Autoencoder; (ATNC) Adversarial Threshold Neural Computer; (BI) Bayesian Inversion ; (BL) Bayesian Learning; (BNN) Bayesian Neural Network; (BO) Bayesian Optimization; (CGVAE) Conditional Graph Variational Autoencoder; (CVAE) Conditional Variational Autoencoder; (DNN) Deep Neural Network; (DRD2) Dopamine Receptor D2 ; (GAN) Generative Adversarial Network; (GBT) Gradient Boosted Trees; (GCPN) Graph Convolutional Policy Network; (GGNN) Gated Graph Neural Networks; (GP) Gaussian Process; (GVAE) Graph Variational Autoencoder; (HBA) number of hydrogen acceptor; (HBD) hydrogen bond donor; (HL) HOMO-LUMO gap ; (JAK2) Janus kinase 2; (JT) Junction Tree; (LogP) Partition coefficient; (MHG) Molecular Hypergraph Grammar ; (MW) Molecular weight; (pIC50) half maximal inhibitory concentration; (POD) Property Oriented Decoder; (QED) Quantitative Estimate of Drug- likeness; (RL) Reinforcement Learning ; (RNN) Recurrent neural network; (RO5) Lipinski's rule of five; (SA) synthetic accessibility; (SGP) Sparse Gaussian Process; (SSAAE) Semi-supervised Adversarial Autoencoder; (SSVAE) Semi-supervised Variational Autoencoder; (SVM) Support Vector Machine; (T m ) Melting Temperature; (TPSA) topological polar surface area; (U) Internal Energy; (VAE) Variational Autoencoder. Generative Model Representation Predictive Model

Predicted/Optimized Features

Ntrain Database Ref. RNN

SMILES RNN/RL MW, LogP, HBD, HBA 1,735,442 ChEMBL Neil et al. [163]

RNN

SMILES DNN/RNN Tm, logP, pIC50, JAK2 1,500,000 ChEMBL21 Popova et al. [26]

RNN

SMILES SVM DRD2 1,500,000 ChEMBL Olivecrona et al. [165] 1

RNN

SMILES GBT pIC50 1,400,000 ChEMBL Segler et al. [164] BL Fragments BI HL, U 60,000 PubChem Ikebata et al. [196]

VAE

SMILES DNN LogP, HBD, HBA, TPSA 500,000 ZINC Lim et al. [197]

VAE

SMILES DNN/GP LogP, QED, SA 358,000 QM9/ZINC Gómez-Bombarelli et al. [25]

VAE

Graph (embedded vectors) BO/ RL/ POD LogP, QED 20,000 ZINC/QM9 Samanta et al. [180]

VAE + BNN

SMILES BNN/BO LogP, QED, SA,

AAE (druGAN)

MACCS NA Predefined anticancer properties 6,252 HMS LINCS Kadurin et al. [179]

CGVAE

Graph GGNN QED 250,000 QM9/ZINC/XEPDB Liu et al. [174] 2

CVAE/ GVAE

SMILES BO LogP 250,000 ZINC Kusner et al. [169]

JT-VAE

Graph SGP LoP, SA,

JT-VAE+GAN

Graph GNN LogP, QED, DRD2 250,000 ZINC Jin et al. [175]

MHG-VAE

Graph GP LogP, SA,

SSVAE

SMILES RNN MW, LogP, QED 310,000 ZINC Kang and Cho [176]

SSAAE

SMILES Disentanglement logP, SA 1,800,000 ZINC Polykovskiy et al. [177]

GAN

SMILES RL LogP, SA, QED 5,000 ZINC Guimaraes et al. [189]

GAN (ORGAN)

SMILES RL MW, LogP, TPSA 15,000 ZINC/ChemDiv Putin et al. [187]

GAN

SMILES RL Tm, QED, RO5 15,000 GDB-17/ZINC Sanchez-Lengeling et al. [185]

GAN

Grammar SMILES NA Predefined transcriptomic profile 19,800 L1000 CMap Méndez-Lucio et al. [152] 3

GAN (ATNC)

SMILES RL MW, LogP, TPSA 15,000 ChemDiv Putin et al. [188]

GAN (Cycle)

Graph RL LogP, rings 250,000 ZINC Maziarka et al. [190]

GAN

Graph RL LogP, SA, QED 133,885 QM9 Cao and Kipf [150]

GAN (GCPN)

Graph RL LogP, QED, MW 250,000 ZINC You et al. [181] 4

Benchmarking in Molecular Design

Despite the extraordinary advances in deep learning models, their performance is not directly realized from CAMD results, especially for generative models where no clear-cut exists for comparing generated molecules. Many researchers have observed that the emphasis on yielding impressive empirical results might not have been matched with a parallel emphasis on empirical rigor [199]–[201]. It is argued that this creates a bias towards implementing newer approaches that are claimed to outperform classical approaches [199]. Two large-scale studies on evaluating the performance of new generative approaches found no evidence that newer approaches consistently score better than original formulations with sufficient hyperparameter tuning and random restarts [133], [201], [202]. As such, empirical rigor and standardized benchmarks and datasets are critical to triggering progress towards better CAMD models and algorithms. Testing the promising new generative models on consistent tasks and comparing their performance to classical methods is vital to accelerate the pace of progress toward automated molecular discovery and optimization. A molecular design framework should be assessed on two elements: the characteristics of generated molecules and the optimization of an objective function of target properties. In the literature, generative models have been widely assessed based on certain properties, such as LogP, QED, or SA, making the assessment process difficult and uninformative [203]. While many metrics and reward functions are presented and applied in the literature, two deep learning benchmarking platforms for molecular design have been developed for the evaluation of models and algorithms in a controlled setting: MOSES and GuacaMol [204], [205].

MOSES

MOlecular SEtS (MOSES) offers a combined implementation of a benchmarking platform that consists of data preprocessing tools, a standard database, and evaluation metrics, along with state-of-the-art molecular generation models. In contrast to GaucaMol, MOSES is solely concerned with evaluating the task of molecular generation. To compare a generated set of molecules, G , against a reference set of molecules, R , taken from the training set, the platform defines five main evaluation metrics. Most of the metrics are similarity measures - fragment similarity (Frag), scaffold similarity (Scaff), Fréchet ChemNet Distance (FCD), nearest neighbor similarity (SNN), and the internal diversity (IntDiv p ). The platform also includes auxiliary metrics that are commonly 5 used for small molecule drug discovery, including MW, LogP, SA, QED, and natural product-likeness score (NP). Frag computes the cosine distance between the fragment frequencies of the generated and reference molecules, f G and f R , using the BRICS algorithm [206] to decompose molecules into chemical fragments. Similarly, Scaff calculates the cosine distance between the scaffolds frequencies, s , which are produced by implementing the Bemis-Murcko algorithm to remove the side chain atoms [207]. The SNN metric is expressed as the average of the Tanimoto distance between a fragment representation of a generated molecule, m G , and its nearest neighbor from the reference set, m R : ( ) ( )

1, max , RG G Rm Rm G

SNN G R T m mG ∈∈ = ∑ (18) Another metric that uses the Tanimoto distance to assess the diversity of generated molecules is the internal diversity metric proposed by

Benhenda [208] . The internal diversity is different from other similarity metrics in that it offers insight into the diversity of the generated molecules, allowing to detect flows in the generative model as mode collapse. ( ) ( )

11 , ppp m m G

IntDiv G T m mG ∈ = − ∑ (19) Similar in principal to Fréchet Inception Distance [209], FCD computes the distance between the distribution of molecules in the dataset and generated molecules using: the activations of the penultimate layer of the “ChemNet” LSTM-RNN, mean vectors, μ , and covariance matrices, Σ , for each distribution [203]. ( ) ( ) , 2 G R G R G R

FCD G R Tr µ µ= − + Σ + Σ − Σ Σ (20)

GuacaMol

GuacaMol outlines two categories of quantitative benchmarks for molecular design models: distribution-learning and goal-directed. The distribution-learning category uses five benchmarks to quantify the quality of a generative model trained to reproduce a distribution of molecules from the training set. On the other hand, goal-oriented benchmarks employ robust and simple scoring functions to disentangle the selection of a good scoring function from the problem of molecular optimization. Under goal-oriented benchmarks, the objective function represents a combination of different molecular features, such as physiochemical properties and structural features [205]. The 6 two classes of benchmarks are evaluated separately to better analyze the performance of a molecular design framework. However, such evaluation rests on the assumption that there is no one-to-one correspondence between the two tasks, and this assumption may not hold for reinforcement learning-based approaches. Even so, the benchmarking platform offers a unique quantitative route to advance the comparability of models in terms of molecular optimization. The following paragraphs provide a summary of the key metrics within the two categories of the platform.

Distribution-learning Benchmarks.

This class of benchmarks assesses the molecule generation task using the following five benchmarks: validity, uniqueness, novelty, Fréchet ChemNet Distance, and Kullback–Leibler divergence. The validity benchmark determines the ability of a model to generate theoretically valid molecules, which can be validated using software packages, such as RDKit [134]. Further, a generative model is also evaluated based on the two overlapping benchmarks of uniqueness and novelty. The uniqueness measures the ability of the model to generate diverse molecules with no repetitions, whereas the novelty computes the fraction of molecules that are not present in the training set. Last, GuacaMol includes a common metric for distribution reconstruction, the Kullback–Leibler divergence, which captures diversity by providing a measure on how the generated molecules distribution is different from that of the training set [210].

Goal-directed Benchmarks.

Under this category, the objective function is defined as a combination of two or more molecular features, including but not limited to: the presence of substructures, similarity to other molecules, and structural and physicochemical properties. GuacaMol establishes 20 benchmarks that compile the combinations of four main molecular features: similarity, rediscovery, isomers, and median molecules. The similarity metric quantifies the crucial form of inverse-screening in many CAMD problems, which aims to generate a molecule based on the similarity/dissimilarity to a given molecule. A closely related benchmark to similarity is the rediscovery metric, seeking to rediscover a target molecule with special importance in de novo drug design. As isomers can have very dissimilar properties, an isometry metric is included to evaluate the flexibility of a generative model in enumerating the isomers of a given molecular representation. Moreover, the median molecules metric is integrated to maximize a molecule similarity to neighboring molecules and reward encoding more molecular structures [205]. It is worth noting that this class of 7 benchmarks is not concerned with the selection of the best scoring function, but rather it considers the complex combinations and trade-offs of molecular features. Perspectives and Future Directions

In the preceding sections, we provided a critical survey of knowledge-based and data-driven methods and tools for molecular design and optimization. The literature presents an increasingly complex and rich array of molecular representations, QSPRs, solution methods, model architectures, algorithms, and tools. Even with many successful implementations and the sheer size of research in this direction, challenges presented in implementations point to the exigency for the molecular design community to devise multifaceted strategies and frameworks directed towards closing the loop. While the emergent deep learning methods have yet to surpass knowledge-based methods on molecular design tasks, we predict that major advancement in this application domain will come about through complex systems that integrate complex chemical knowledge with representation learning [205], [211]. Here, we offer our outlook on current approaches along with the major opportunities, challenges, and trends for this nascent class of solution methods.

Hybrid Knowledge-based and Data-driven Approaches

The principles of the knowledge-based and data-driven CAMD approaches are essentially different. Knowledge-based methods directly explore the chemical space, while encoding the rules for structural validity and bounding property targets [3]. On the other hand, deep learning-based CAMD methods approximate the structure of a subset of the chemical space observable as input data, constructing a latent space that preserves the required features for reconstructing the chemical space [25]. In this way, deep learning approaches are promising candidate routes for replacing and complementing the knowledge-based counterpart by side-stepping limitations associated with the complexity of molecular systems. For example, a potential alternative to complex property prediction models is the use of input-convex neural networks, which would allow for the optimization over molecules [212]. As model-based methods currently stand, frameworks and tools have long been established and used to solve problems of significance in academia and the industry [89]. Yet, several challenges lie ahead for applying knowledge-based methods to many different classes of molecules including the paucity of property data, the reliability and predictive 8 power of property models, and the accessibility of solution strategies for multiscale design [89], [213]. Conversely, while holding the promise to transform molecular design, deep generative models are still in infancy and guided by the empirical nature of model development in machine learning. It is hence imperative to adopt and consider these challenges in the development of future data-driven or hybrid approaches. At present, knowledge-based decomposition methods have been at the forefront of CAMD with quite a few established models. For instance, the preeminent OptCAMD framework tackles several complexity issues, and is capable of integrating machine learning methods for property prediction, demonstrating considerable success in several case studies [34], [214]. Further, a comparative study reported that genetic algorithms consistently perform as good as or better than present-day state-of-the-art deep learning models on molecular design tasks with much lower computational cost [205], [211]. However, rapid advances are expected to follow as the nascent deep learning in molecular design literature presents proof-of-concept demonstrations in inorganic solid-state functional materials and reticular frameworks [215], [216]. In the short term, we anticipate that hybrid systems involving decomposition, deep learning, and knowledge-based methods hold the most potential to solve problems of significance. In this context, many sources in the literature have addressed the efficacy of anchoring deep learning algorithms with scientific theory through a synergistic coupling of response and loss functions, selecting theory-compliant model architectures, constraining the space of probabilistic models with theory-based priors, and including theory-based regularization terms [29], [217]. Demonstrated successes of the fusion of deep learning and scientific knowledge include physics-informed neural networks [218], interfaces between quantum mechanics and neural networks [28], molecular deep tensor neural networks, SchNet [219], among others. As progress in the field of deep generative learning accelerates, we anticipate that more sophisticated methods will emerge for better integration of chemistry knowledge, resulting in improved performance and broad-ranging latent chemical spaces [220].

Property Data Availability

The issue of limited availability of property data plagues the development of property prediction models, which serve as the underpinning of the CAMD problem. The complexity of this issue is compounded when mixtures or products are considered. The collection of reliable property models can be enlarged using theory-based methods, data-driven methods, or their combinations [89]. Despite the groundbreaking leap that deep learning has brought to property 9 prediction, this data-intensive class of methods poses data availability as a major limitation to establishing confidence in the generalizability of its predictive models [19], [20]. Further, although sizable datasets exist for several chemical properties, many central properties that are more expensive to compute or measure remain limited. A viable solution route is to generate pseudo-experimental property data using reference methods, such as quantum mechanics [221], DFT [20], and COMSO-RS [53], leading to larger model availability and data standardization [10], [89]. Also, different approaches could be explored to address the paucity of property data, including data fusion, transfer learning, active learning, and text mining [222], [223]. Future studies exploring the cost-accuracy tradeoff through the generation and testing of low and high-fidelity property data may offer insights into relating certain properties to elemental structures and their transferability to larger sets of molecular structures [224]. Given the successful applications of text and image-mining in extracting novel compounds and synthesis routes [138], [225], the task of extracting property data from published literature remains a potential route to substantially increase the volume of readily available data. Additional progress in property data mining and aggregation is expected to mitigate some of the challenges associated with the absence of engineering knowledge in structure-property relationships for different classes of molecules, such as crystal structures [226], alloys [227], proteins and nucleic acids [228], [229], polymers [230], ionic liquids [231], and biologics [232], [233]. This enlargement of open-source datasets and the library of available property models is vital to accelerating further development of this field [89].

Molecular Representations

As seen, vital to the success of the predictive and generative tasks is the choice of an expressive representation of molecules. There does not seem to be a single representation that fits all properties based on a rigorous study that controlled properties through varying elemental compositions and atomic configurations [234]. CAMD is often performed at multiple layers and different scales of complexity [31]. As distinctions between isomers or similar structures become more important, representations become more complex and mechanistic for improved properties prediction [9]. While 3D representations suffer from sparsity, invertibility, and invariances issues, there is a shift towards the use of 3D representations in the literature with several methods tackling such issues [156], [157]. Further, constructing chemical latent spaces from a combination of representations could lead to significant performance improvements. A heteroencoder encoding several representations was shown to outperform an encoder with a single representation input 0 [143]. Thus, for molecular design frameworks with multiple design stages, it may be best to include several molecular representations, with each design stage assigned the most suitable representation for the given design task [31]. Further, the use of non-atom centric representation of molecules as 3D structure and electrons has been recently suggested to describe molecular systems more accurately [204]. Looking ahead, innovation with available representations and the development of novel ones appear to be promising creative endeavors.

Generative Models

The current pace and thrust of developments in generative deep learning present interesting new avenues for molecular generation and optimization. There is much to be accomplished from improving the generalizability of models and relaxing the i.i.d. assumption for reflecting real-world data, to covering unexplored areas of the chemical space, and performing operations in the space by transforming learned embeddings into practical rules. As the entire field of deep learning advances, novel methods applicable to our problem of interest emerge, such as multi-level VAEs [178], [235], Boltzmann Generator [236], and GraphNVP [237], [238]. Further, the literature points to many alternative routes and several extensions of interest. The use of token embeddings has been suggested to construct more robust RNN-based models. Another way to improve RNNs is through their modification to factor in more complex memory and attention mechanism [142], [239]. Even with the extensive success, the embeddings algorithm has achieved in natural language processing, it is also possible to adapt or tweak other algorithms as Transformer [240] and Sequence-to-sequence learning [241]. Provided that implementations with graphs as molecular representations assume graphs to be static, incorporating RNNs to build dynamic graphs was demonstrated to address their invertibility [180]. Also, two tree graph-based methods identify tree decomposition optimization as a route to improve their generalization using goodness criteria as the minimum description length [242] or Bayesian approaches [130], [173], [243]. With novelty as main scoring criteria, many implementations of the powerful framework of GANs still suffer from mode collapse. A possible approach to alleviate this issue is by incorporating recent developments in dynamically controlled attention RNN for text generation tasks [238], [244], [245]. Provided that the generator model in GANs is often trained as a stochastic policy in a reinforcement learning framework, several approaches that potentially lead to more stable training and better generators have yet to be applied to molecular design. These include, but are not limited to, maximum entropy inverse reinforcement learning [246], actor-critic methods [247], and multi-1 objective constrained reinforcement learning [248]. In light of the success of hierarchical representations of large molecules (polymers, proteins, and metal frameworks) [249], [250], hierarchical GANs may also have the capability to learn a domain-invariant feature representation of more complex and larger molecules [251].

Evaluation Metrics

The development of consistent tasks and general evaluation metrics to compare generative and predictive models and rapidly screen candidate molecules and products is essential to speed up material discovery. Adopted metrics should rigorously and independently assess the tasks of molecular generation and molecular optimization. Pioneering efforts in the development of evaluation metrics include the MOSES [204] and GuacaMol [205] platforms, which provide an open-source library with datasets, baseline models, and evaluation metrics. Yet, as many current metrics have been based on heuristics and rules of medicinal chemistry, novel benchmarks are required to holistically design molecular solutions for broader spectrums of applications and disciplines [204]. Further, with most proposed metrics predominantly focusing on the evaluation of generative models, it is imperative to develop metrics around the property prediction task, especially for quantifying uncertainty. This metric is particularly essential for opaque methods as neural networks in order to describe model ignorance and identify molecules for which experimental or pseudo-experimental data may be needed [252], [253]. As such, higher levels of collaborative work are needed to refine and extend the library of open-source benchmarking platforms and methods.

Integrated Product and Process Design

The application of deep learning to CAMD is merely a starting facet of a complex autonomous system that concurrently integrates the generation, optimization, and synthesis of chemical products. Even though many CAMD methods and tools have long been developed, only a limited number of models have been implemented in the industry. Additionally, there is no published work that concurrently considers synthesis routes and product/process design [89]. Deep learning models have been proposed for Computer-Aided Synthesis Planning (CASP) in order to accelerate the synthesis process. This type of models takes molecules as an input, and generates feasible synthesis routes with purchasable starting compounds [254]. Further exploitation of deep learning advances for binding the circle of design involves enabling autonomous experimentation and 2 synthesis in self-driven laboratories [255]. As a pivotal step towards unleashing the “Moore’s law for scientific discovery”, the development of integrated design and synthesis systems requires synergy between theoretical and experimental researchers. [256]. The deployment of multi-scale modeling and efficient property estimation models within process design will lead to improved systematic methodologies for rapid prediction and evaluation in product/process design. As the field stands, the lack of systematic methods renders the evaluation process ineffective, excluding potentially superior molecules or mixtures [31]. In order to expand the portfolio of chemical products and improve the efficiency of processes, it is important to develop reliable, fast, and sustainable design and simulation tools. Hybrid product and process design systems linking knowledge-based decomposition methods with deep learning-based techniques would be necessary to approach current material design problems, such as polymers for membranes, zeolites for adsorbents, metals for catalysts, and enzymes for bio-catalysts. Such design problems, however, are less tenable due to the lack of property data and complexities in structure-property relationships. As seen, the role of researchers in the short run is to integrate learning-based models into the knowledge-based design of products and processes, offering a roadmap of knowledge gaps and challenges that needs to be addressed. Conclusion

Although the knowledge-based CAMD methods have had longstanding success for systematic screening and identification of promising molecules, the emergence of deep learning approaches for molecular design holds a transformative potential in the near future. In this paper, we reviewed recent progress, limitations, and opportunities in CAMD for both knowledge-based and deep learning-based approaches. In addition to offering a detailed review of knowledge-based property prediction methods and solution techniques, the article also presented a survey of state-of-the-art deep generative CAMD models, examining various molecular representations, deep learning architectures, and evaluation benchmarks. The comparative descriptions of the key elements underlying the knowledge-based and data-driven CAMD methods revealed several challenges and opportunities from multiple facets. Building on the discussions of the current challenges, we identified a key promising path forward represented by hybrid methods, which harness the powerful capabilities of deep learning while leveraging the accumulated wealth of knowledge in the rich domain. Future work could be directed towards building large and diverse datasets of 3 property data, developing more expressive molecular representations, advancing deep generative models unanchored in the current assumptions, establishing better benchmarking methods, and integrating products and processes in the design loop. As seen, success in these endeavors largely hinges on work and innovations around integrating the various forms in which relevant chemistry and physics theory and knowledge are represented.

References [1] H. N. J. Schifferstein and L. Wastiels, “Chapter 2 - Sensing Materials: Exploring the Building Blocks for Experiential Design,” E. Karana, O. Pedgley, and V. B. T.-M. E. Rognoli, Eds. Boston: Butterworth-Heinemann, 2014, pp. 15–26. [2]

S. L. Moskowitz, “The Coming of the Advanced‐Materials Revolution,” in

The Advanced Materials Revolution: Technology and Economic Growth in the Age of Globalization , Hoboken, NJ, USA: John Wiley & Sons, Inc., 2008, pp. 11–20. [3] R. Gani, “Computer-Aided Methods and Tools for Chemical Product Design,”

Chemical Engineering Research and Design , vol. 82, no. 11, pp. 1494–1504, Nov. 2004. [4] N. D. Austin, N. V. Sahinidis, and D. W. Trahan, “Computer-aided molecular design: An introduction and review of tools, applications, and solution techniques,”

Chemical Engineering Research and Design , vol. 116, pp. 2–26, Dec. 2016. [5] P. Kirkpatrick and C. Ellis, “Chemical space,”

Nature , vol. 432, no. 7019, pp. 823–823, Dec. 2004. [6] R. S. Bohacek, C. McMartin, and W. C. Guida, “The art and practice of structure-based drug design: A molecular modeling perspective,”

Medicinal Research Reviews , vol. 16, no. 1, pp. 3–50, Jan. 1996. [7] J. L. Franklin, “Prediction of Heat and Free Energies of Organic Compounds,”

Industrial & Engineering Chemistry , vol. 41, no. 5, pp. 1070–1076, May 1949. [8] “New Horizons in chemical space,”

Nature Reviews Drug Discovery , vol. 3, no. 5, p. 375, 2004. [9] R. Gani, L. E. K. Achenie, and V. Venkatasubramanian, “Chapter 1 - Introduction to CAMD,” in

Computer Aided Molecular Design: Theory and Practice , Luke E.K. Achenie and Rafiqul Gani and Venkat Venkatasubramanian, Ed. Elsevier, 2003, pp. 3–21. [10] R. Gani, “Group contribution-based property estimation methods: advances and perspectives,”

Current Opinion in Chemical Engineering , vol. 23, pp. 184–196, Mar. 2019. [11] I. E. Grossmann, “Challenges in the new millennium: product discovery and design, enterprise and supply chain optimization, global life cycle assessment,”

Computers & Chemical Engineering , vol. 29, no. 1, pp. 29–39, Dec. 2004. [12] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,”

Communications of the ACM , vol. 60, no. 6, pp. 84–90, May 2017. [13] J. Tompson, A. Jain, Y. LeCun, and C. Bregler, “Joint Training of a Convolutional Network and a Graphical Model for Human Pose Estimation,” in

Advances in Neural Information Processing Systems , Jun. 2014, pp. 1799–1807. [14] G. Hinton, L. Deng, D. Yu, G. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. Sainath, and B. Kingsbury, “Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups,”

IEEE Signal Processing Magazine , vol. 29, no. 6, pp. 82–97, Nov. 2012. [15] T. N. Sainath, A. Mohamed, B. Kingsbury, and B. Ramabhadran, “Deep convolutional neural networks for LVCSR,” in , May 2013, pp. 8614–8618. [16] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa, “Natural Language Processing (almost) from Scratch,”

Journal of Machine Learning Research , vol. 12, pp. 2493–2537, Mar. 2011. [17] M. I. Jordan and T. M. Mitchell, “Machine learning: Trends, perspectives, and prospects,”

Science , vol. 349, no. 6245, pp. 255–260, Jul. 2015. [18] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,”

Nature , vol. 521, no. 7553, pp. 436–444, May 2015. [19] J. Ma, R. P. Sheridan, A. Liaw, G. E. Dahl, and V. Svetnik, “Deep Neural Nets as a Method for Quantitative Structure–Activity Relationships,”

Journal of Chemical Information and Modeling , vol. 55, no. 2, pp. 263–274, Feb. 2015. [20] D. Jha, K. Choudhary, F. Tavazza, W. Liao, A. Choudhary, C. Campbell, and A. Agrawal, “Enhancing materials property prediction by leveraging computational and experimental data using deep transfer learning,”

Nature Communications , vol. 10, no. 1, p. 5316, Dec. 2019. [21] P. Bonami, A. Lodi, and G. Zarpellon, “Learning a Classification of Mixed-Integer Quadratic Programming Problems,” in

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) , Springer, 2018, pp. 595–604. [22] C. Ning and F. You, “Optimization under uncertainty in the era of big data and deep learning: When machine learning meets mathematical programming,”

Computers & Chemical Engineering , vol. 125, pp. 434–448, Jun. 2019. [23] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis, “Human-level control through deep reinforcement learning,”

Nature , vol. 518, no. 7540, pp. 529–533, Feb. 2015. [24] P. M. Attia, A. Grover, N. Jin, K. A. Severson, T. M. Markov, Y.-H. Liao, M. H. Chen, B. Cheong, N. Perkins, Z. Yang, P. K. Herring, M. Aykol, S. J. Harris, R. D. Braatz, S. Ermon, and W. C. Chueh, “Closed-loop optimization of fast-charging protocols for batteries with machine learning,”

Nature , vol. 578, no. 7795, pp. 397–402, 2020. [25] R. Gómez-Bombarelli, J. N. Wei, D. Duvenaud, J. M. Hernández-Lobato, B. Sánchez-Lengeling, D. Sheberla, J. Aguilera-Iparraguirre, T. D. Hirzel, R. P. Adams, and A. Aspuru-Guzik, “Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules,”

ACS Central Science , vol. 4, no. 2, pp. 268–276, Feb. 2018. [26] M. Popova, O. Isayev, and A. Tropsha, “Deep reinforcement learning for de novo drug design,”

Science Advances , vol. 4, no. 7, p. eaap7885, Jul. 2018. [27] Y. Gil, M. Greaves, J. Hendler, and H. Hirsh, “Amplify scientific discovery with artificial intelligence,”

Science , vol. 346, no. 6206, pp. 171–172, Oct. 2014. [28] K. T. Schütt, M. Gastegger, A. Tkatchenko, K.-R. Müller, and R. J. Maurer, “Unifying machine learning and quantum chemistry with a deep neural network for molecular wavefunctions,”

Nature Communications , vol. 10, no. 1, p. 5024, Dec. 2019. [29] M. Raissi, A. Yazdani, and G. E. Karniadakis, “Hidden fluid mechanics: Learning velocity 5 and pressure fields from flow visualizations,”

Science , vol. 367, no. 6481, pp. 1026–1030, Feb. 2020. [30] Z. Y. Wan, P. Vlachas, P. Koumoutsakos, and T. Sapsis, “Data-assisted reduced-order modeling of extreme events in complex dynamical systems,”

PLOS ONE , vol. 13, no. 5, pp. 1–22, May 2018. [31] R. Gani, “Chemical product design: challenges and opportunities,”

Computers & Chemical Engineering , vol. 28, no. 12, pp. 2441–2457, Nov. 2004. [32] L. Y. Ng, F. K. Chong, and N. G. Chemmangattuvalappil, “Challenges and opportunities in computer-aided molecular design,”

Computers & Chemical Engineering , vol. 81, pp. 115–129, Oct. 2015. [33] J. Marrero and R. Gani, “Group-contribution based estimation of pure component properties,”

Fluid Phase Equilibria , vol. 183–184, pp. 183–208, Jul. 2001. [34] Q. Liu, L. Zhang, L. Liu, J. Du, A. K. Tula, M. Eden, and R. Gani, “OptCAMD: An optimization-based framework and tool for molecular and mixture product design,”

Computers & Chemical Engineering , vol. 124, pp. 285–301, May 2019. [35]

J. C. Dearden, “Quantitative structure‐property relationships for prediction of boiling point, vapor pressure, and melting point,”

Environmental Toxicology and Chemistry , vol. 22, no. 8, p. 1696, 2003. [36] Y. Chen, G. M. Kontogeorgis, and J. M. Woodley, “Group Contribution Based Estimation Method for Properties of Ionic Liquids,”

Industrial & Engineering Chemistry Research , vol. 58, no. 10, pp. 4277–4292, Mar. 2019. [37] S. H. Kim, A. Anantpinijwatna, J. W. Kang, and R. Gani, “Analysis and modeling of alkali halide aqueous solutions,”

Fluid Phase Equilibria , vol. 412, pp. 177–198, Mar. 2016. [38] L. Constantinou and R. Gani, “New group contribution method for estimating properties of pure compounds,”

AIChE Journal , vol. 40, no. 10, pp. 1697–1710, Oct. 1994. [39] A. S. Hukkerikar, R. J. Meier, G. Sin, and R. Gani, “A method to estimate the enthalpy of formation of organic compounds with chemical accuracy,”

Fluid Phase Equilibria , vol. 348, pp. 23–32, Jun. 2013. [40] N. D. Austin, N. V. Sahinidis, and D. W. Trahan, “A COSMO-based approach to computer-aided mixture design,”

Chemical Engineering Science , vol. 159, pp. 93–105, Feb. 2017. [41] A. S. Hukkerikar, B. Sarup, A. Ten Kate, J. Abildskov, G. Sin, and R. Gani, “Group-contribution+ (GC+) based estimation of properties of pure components: Improved property estimation and uncertainty analysis,”

Fluid Phase Equilibria , vol. 321, pp. 25–43, May 2012. [42] A. S. Hukkerikar, S. Kalakul, B. Sarup, D. M. Young, G. Sin, and R. Gani, “Estimation of Environment-Related Properties of Chemicals for Design of Sustainable Processes: Development of Group-Contribution + (GC + ) Property Models and Uncertainty Analysis,”

Journal of Chemical Information and Modeling , vol. 52, no. 11, pp. 2823–2839, Nov. 2012. [43] S. Jhamb, X. Liang, R. Gani, and A. S. Hukkerikar, “Estimation of physical properties of amino acids by group-contribution method,”

Chemical Engineering Science , vol. 175, pp. 148–161, Jan. 2018. [44] O. A. Perederic, L. P. Cunico, S. Kalakul, B. Sarup, J. M. Woodley, G. M. Kontogeorgis, and R. Gani, “Systematic identification method for data analysis and phase equilibria modelling for lipids systems,”

The Journal of Chemical Thermodynamics , vol. 121, pp. 153–169, Jun. 2018. [45] N. Trinajstic,

Chemical Graph Theory , 2nd ed. New York: Routledge, 2018. 6 [46] J. Devillers and A. Balaban,

Topological Indices and Related Descriptors in QSAR and QSPAR , Illustrate. CRC Press, 2000. [47] J.-L. Faulon, D. P. Visco, and R. S. Pophale, “The Signature Molecular Descriptor. 1. Using Extended Valence Sequences in QSAR and QSPR Studies,”

Journal of Chemical Information and Computer Sciences , vol. 43, no. 3, pp. 707–720, May 2003. [48] H. Wiener, “Structural Determination of Paraffin Boiling Points,”

Journal of the American Chemical Society , vol. 69, no. 1, pp. 17–20, Jan. 1947. [49] M. Randic, “Characterization of molecular branching,”

Journal of the American Chemical Society , vol. 97, no. 23, pp. 6609–6615, Nov. 1975. [50] W. D. Seider, D. R. Lewin, S. Widagdo, K. M. Ng, R. Gani, and J. D. Seader, “Molecular and Mixture Design,” in

Product and Process Design Principles: Synthesis, Analysis, and Evaluation , Wiley, 2016, pp. 79–109. [51] A. Fredenslund, R. L. Jones, and J. M. Prausnitz, “Group-contribution estimation of activity coefficients in nonideal liquid mixtures,”

AIChE Journal , vol. 21, no. 6, pp. 1086–1099, Nov. 1975. [52] W. G. Chapman, K. E. Gubbins, G. Jackson, and M. Radosz, “SAFT: Equation-of-state solution model for associating fluids,”

Fluid Phase Equilibria , vol. 52, pp. 31–38, Dec. 1989. [53] A. Klamt, V. Jonas, T. Bürger, and J. C. W. Lohrenz, “Refinement and Parametrization of COSMO-RS,”

The Journal of Physical Chemistry A , vol. 102, no. 26, pp. 5074–5085, Jun. 1998. [54] K. G. Joback and G. Stephanopoulos, “Searching Spaces of Discrete Solutions: The Design of Molecules Possessing Desired Physical Properties,” in

Advances in Chemical Engineering , 1995, pp. 257–311. [55] P. M. Harper and R. Gani, “A multi-step and multi-level approach for computer aided molecular design,” in

Computers and Chemical Engineering , 2000, vol. 24, no. 2–7, pp. 677–683. [56] A. P. Samudra and N. V. Sahinidis, “Optimization-based framework for computer-aided molecular design,”

AIChE Journal , vol. 59, no. 10, pp. 3686–3701, Oct. 2013. [57] A. P. Duvedi and L. E. K. Achenie, “Designing environmentally safe refrigerants using mathematical programming,”

Chemical Engineering Science , vol. 51, no. 15, pp. 3727–3739, Aug. 1996. [58] M. Sinha, L. E. K. Achenie, and G. M. Ostrovsky, “Environmentally benign solvent design by global optimization,”

Computers & Chemical Engineering , vol. 23, no. 10, pp. 1381–1394, Dec. 1999. [59] N. V. Sahinidis, M. Tawarmalani, and M. Yu, “Design of alternative refrigerants via global optimization,”

AIChE Journal , vol. 49, no. 7, pp. 1761–1775, Jul. 2003. [60] K. V. Camarda and C. D. Maranas, “Optimization in Polymer Design Using Connectivity Indices,”

Industrial & Engineering Chemistry Research , vol. 38, no. 5, pp. 1884–1892, May 1999. [61] L. Zhang, S. Cignitti, and R. Gani, “Generic mathematical programming formulation and solution for computer-aided molecular design,”

Computers & Chemical Engineering , vol. 78, pp. 79–84, Jul. 2015. [62] A. Duvedi and L. E. K. Achenie, “On the design of environmentally benign refrigerant mixtures: a mathematical programming approach,”

Computers & Chemical Engineering , vol. 21, no. 8, pp. 915–923, Apr. 1997. 7 [63] M. Lampe, M. Stavrou, J. Schilling, E. Sauer, J. Gross, and A. Bardow, “Computer-aided molecular design in the continuous-molecular targeting framework using group-contribution PC-SAFT,”

Computers & Chemical Engineering , vol. 81, pp. 278–287, Oct. 2015. [64] C. D. Maranas, “Optimal Computer-Aided Molecular Design: A Polymer Design Case Study,”

Industrial & Engineering Chemistry Research , vol. 35, no. 10, pp. 3403–3414, Jan. 1996. [65] G. M. Ostrovsky, L. E. K. Achenie, and M. Sinha, “A reduced dimension branch-and-bound algorithm for molecular design,”

Computers & Chemical Engineering , vol. 27, no. 4, pp. 551–567, Apr. 2003. [66] S. Jonuzaj, A. Gupta, and C. S. Adjiman, “The design of optimal mixtures from atom groups using Generalized Disjunctive Programming,”

Computers & Chemical Engineering , vol. 116, pp. 401–421, Aug. 2018. [67] N. D. Austin, A. P. Samudra, N. V. Sahinidis, and D. W. Trahan, “Mixture design using derivative-free optimization in the space of individual component properties,”

AIChE Journal , vol. 62, no. 5, pp. 1514–1530, May 2016. [68] M. J. D. Powell, “UOBYQA: unconstrained optimization by quadratic approximation,”

Mathematical Programming , vol. 92, no. 3, pp. 555–582, May 2002. [69] W. Huyer and A. Neumaier, “SNOBFIT -- Stable Noisy Optimization by Branch and Fit,”

ACM Transactions on Mathematical Software , vol. 35, no. 2, pp. 1–25, Jul. 2008. [70] Y. Sun, N. V Sahinidis, A. Sundaram, and M.-S. Cheon, “Derivative-free optimization for chemical product design,”

Current Opinion in Chemical Engineering , vol. 27, pp. 98–106, Mar. 2020. [71] U. M. Diwekar and W. Xu, “Improved Genetic Algorithms for Deterministic Optimization and Optimization under Uncertainty. Part I. Algorithms Development,”

Industrial & Engineering Chemistry Research , vol. 44, no. 18, pp. 7132–7137, Aug. 2005. [72] T. Zhou, Y. Zhou, and K. Sundmacher, “A hybrid stochastic–deterministic optimization approach for integrated solvent and process design,”

Chemical Engineering Science , vol. 159, pp. 207–216, Feb. 2017. [73] R. H. Herring and M. R. Eden, “Evolutionary algorithm for de novo molecular design with multi-dimensional constraints,”

Computers & Chemical Engineering , vol. 83, pp. 267–277, Dec. 2015. [74] V. Venkatasubramanian, K. Chan, and J. M. Caruthers, “Evolutionary Design of Molecules with Desired Properties Using the Genetic Algorithm,”

Journal of Chemical Information and Modeling , vol. 35, no. 2, pp. 188–195, Mar. 1995. [75] B. Lin, S. Chavali, K. Camarda, and D. C. Miller, “Computer-aided molecular design using Tabu search,”

Computers & Chemical Engineering , vol. 29, no. 2, pp. 337–347, Jan. 2005. [76] S. E. McLeese, J. C. Eslick, N. J. Hoffmann, A. M. Scurto, and K. V. Camarda, “Design of ionic liquids via computational molecular design,”

Computers & Chemical Engineering , vol. 34, no. 9, pp. 1476–1480, Sep. 2010. [77] J. C. Eslick, Q. Ye, J. Park, E. M. Topp, P. Spencer, and K. V. Camarda, “A computational molecular design framework for crosslinked polymer networks,”

Computers & Chemical Engineering , vol. 33, no. 5, pp. 954–963, May 2009. [78] T. Rusu and O. M. Gogan, “Multiobjective Tabu search method for the optimization of block copolymers structure,” in

AIP Conference Proceedings , 2018, p. 020171. [79] N. D. Austin, N. V. Sahinidis, I. A. Konstantinov, and D. W. Trahan, “COSMO-based 8 computer-aided molecular/mixture design: A focus on reaction solvents,”

AIChE Journal , vol. 64, no. 1, pp. 104–122, Jan. 2018. [80] V. Venkatasubramanian, K. Chan, and J. M. Caruthers, “Computer-aided molecular design using genetic algorithms,”

Computers & Chemical Engineering , vol. 18, no. 9, pp. 833–844, Sep. 1994. [81] I. Nowak,

Relaxation and Decomposition Methods for Mixed Integer Nonlinear Programming , vol. 152. Basel: Birkhäuser-Verlag, 2005. [82] A. T. Karunanithi, L. E. K. Achenie, and R. Gani, “A New Decomposition-Based Computer-Aided Molecular/Mixture Design Methodology for the Design of Optimal Solvents and Solvent Mixtures,”

Industrial & Engineering Chemistry Research , vol. 44, no. 13, pp. 4785–4797, Jun. 2005. [83] A. Bardow, K. Steur, and J. Gross, “Continuous-Molecular Targeting for Integrated Solvent and Process Design,”

Industrial & Engineering Chemistry Research , vol. 49, no. 6, pp. 2834–2840, Mar. 2010. [84] J. A. Klein, D. T. Wu, and R. Gani, “Computer aided mixture design with specified property constraints,”

Computers & Chemical Engineering , vol. 16, pp. S229–S236, May 1992. [85] M. . Eden, S. . Jørgensen, R. Gani, and M. . El-Halwagi, “A novel framework for simultaneous separation process and product design,”

Chemical Engineering and Processing: Process Intensification , vol. 43, no. 5, pp. 595–608, May 2004. [86] S. Gopinath, G. Jackson, A. Galindo, and C. S. Adjiman, “Outer approximation algorithm with physical domain reduction for computer-aided molecular and separation process design,”

AIChE Journal , vol. 62, no. 9, pp. 3484–3504, Sep. 2016. [87] C. S. Adjiman, A. Galindo, and G. Jackson, “Molecules Matter,” in

Computer Aided Chemical Engineering , 2014, pp. 55–64. [88] S. Chai, Q. Liu, X. Liang, Y. Guo, S. Zhang, C. Xu, J. Du, Z. Yuan, L. Zhang, and R. Gani, “A grand product design model for crystallization solvent design,”

Computers & Chemical Engineering , vol. 135, p. 106764, Apr. 2020. [89] L. Zhang, H. Mao, Q. Liu, and R. Gani, “Chemical product design – recent advances and perspectives,”

Current Opinion in Chemical Engineering , vol. 27, pp. 22–34, Mar. 2020. [90] S. Kalakul, L. Zhang, H. A. Choudhury, N. O. Elbashir, M. R. Eden, and R. Gani, “ProCAPD – A Computer-Aided Model-Based Tool for Chemical Product Design and Analysis,” in

Computer Aided Chemical Engineering , 2018, pp. 469–474. [91] L. Ruddigkeit, R. van Deursen, L. C. Blum, and J.-L. Reymond, “Enumeration of 166 Billion Organic Small Molecules in the Chemical Universe Database GDB-17,”

Journal of Chemical Information and Modeling , vol. 52, no. 11, pp. 2864–2875, Nov. 2012. [92] M. Nakata and T. Shimazaki, “PubChemQC Project: A Large-Scale First-Principles Electronic Structure Database for Data-Driven Chemistry,”

Journal of Chemical Information and Modeling , vol. 57, no. 6, pp. 1300–1308, Jun. 2017. [93] R. Gani and E. A. Brignole, “Molecular design of solvents for liquid extraction based on UNIFAC,”

Fluid Phase Equilibria , vol. 13, pp. 331–340, Jan. 1983. [94] O. Odele and S. Macchietto, “Computer aided molecular design: a novel method for optimal solvent selection,”

Fluid Phase Equilibria , vol. 82, pp. 47–54, Feb. 1993. [95] R. Gani and A. Fredenslund, “Computer aided molecular and mixture design with specified property constraints,”

Fluid Phase Equilibria , vol. 82, pp. 39–46, Feb. 1993. [96] M. Lampe, M. Stavrou, H. M. Bücker, J. Gross, and A. Bardow, “Simultaneous Optimization of Working Fluid and Process for Organic Rankine Cycles Using PC-SAFT,” 9

Industrial & Engineering Chemistry Research , vol. 53, no. 21, pp. 8821–8830, May 2014. [97] M. Stavrou, M. Lampe, A. Bardow, and J. Gross, “Continuous Molecular Targeting–Computer-Aided Molecular Design (CoMT–CAMD) for Simultaneous Process and Solvent Design for CO 2 Capture,”

Industrial & Engineering Chemistry Research , vol. 53, no. 46, pp. 18029–18041, Nov. 2014. [98] S. Bommareddy, N. G. Chemmangattuvalappil, C. C. Solvason, and M. R. Eden, “Simultaneous solution of process and molecular design problems using an algebraic approach,”

Computers & Chemical Engineering , vol. 34, no. 9, pp. 1481–1486, Sep. 2010. [99] A. Buxton, A. G. Livingston, and E. N. Pistikopoulos, “Optimal design of solvent blends for environmental impact minimization,”

AIChE Journal , vol. 45, no. 4, pp. 817–843, Apr. 1999. [100] J. Burger, V. Papaioannou, S. Gopinath, G. Jackson, A. Galindo, and C. S. Adjiman, “A hierarchical method to integrated solvent and process design of physical CO 2 absorption using the SAFT- γ Mie approach,”

AIChE Journal , vol. 61, no. 10, pp. 3249–3269, Oct. 2015. [101] M. Hostrup, P. M. Harper, and R. Gani, “Design of environmentally benign processes: integration of solvent design and separation process synthesis,”

Computers & Chemical Engineering , vol. 23, no. 10, pp. 1395–1414, Dec. 1999. [102] K.-J. Kim and U. M. Diwekar, “Integrated Solvent Selection and Recycling for Continuous Processes,”

Industrial & Engineering Chemistry Research , vol. 41, no. 18, pp. 4479–4488, Sep. 2002. [103] A. I. Papadopoulos and P. Linke, “Multiobjective molecular design for integrated process-solvent systems synthesis,”

AIChE Journal , vol. 52, no. 3, pp. 1057–1070, Mar. 2006. [104] E. A. Brignole, S. Bottini, and R. Gani, “A strategy for the design and selection of solvents for separation processes.,”

Fluid Phase Equilibria , vol. 29, pp. 125–132, Oct. 1986. [105] R. Gani, B. Nielsen, and A. Fredenslund, “A group contribution approach to computer-aided molecular design,”

AIChE Journal , vol. 37, no. 9, pp. 1318–1332, Sep. 1991. [106] B. H. Gebreslassie and U. M. Diwekar, “Efficient ant colony optimization for computer aided molecular design: Case study solvent selection problem,”

Computers & Chemical Engineering , vol. 78, pp. 1–9, Jul. 2015. [107] P. M. Harper, R. Gani, P. Kolar, and T. Ishikawa, “Computer-aided molecular design with combined molecular modeling and group contribution,”

Fluid Phase Equilibria , vol. 158–160, pp. 337–347, Jun. 1999. [108] K.-J. Kim and U. M. Diwekar, “Efficient Combinatorial Optimization under Uncertainty. 2. Application to Stochastic Solvent Selection,”

Industrial & Engineering Chemistry Research , vol. 41, no. 5, pp. 1285–1296, Mar. 2002. [109] E. C. Marcoulaki and A. C. Kokossis, “Molecular design synthesis using stochastic optimisation as a tool for scoping and screening,”

Computers & Chemical Engineering , vol. 22, pp. S11–S18, Mar. 1998. [110] J. E. Ourique and A. Silva Telles, “Computer-aided molecular design with simulated annealing and molecular graphs,”

Computers & Chemical Engineering , vol. 22, pp. S615–S618, Mar. 1998. [111] J. Scheffczyk, L. Fleitmann, A. Schwarz, M. Lampe, A. Bardow, and K. Leonhard, “COSMO-CAMD: A framework for optimization-based computer-aided molecular design using COSMO-RS,”

Chemical Engineering Science , vol. 159, pp. 84–92, Feb. 2017. [112] W. M. Brown, S. Martin, M. D. Rintoul, and J.-L. Faulon, “Designing Novel Polymers with 0 Targeted Properties Using the Signature Molecular Descriptor,”

Journal of Chemical Information and Modeling , vol. 46, no. 2, pp. 826–835, Mar. 2006. [113] N. Pavurala and L. E. K. Achenie, “A mechanistic approach for modeling oral drug delivery,”

Computers & Chemical Engineering , vol. 57, pp. 196–206, Oct. 2013. [114] Y. Wang and L. E. K. Achenie, “Computer aided solvent design for extractive fermentation,”

Fluid Phase Equilibria , vol. 201, no. 1, pp. 1–18, Aug. 2002. [115] R. Gani, C. Jiménez-González, and D. J. C. Constable, “Method for selection of solvents for promotion of organic reactions,”

Computers & Chemical Engineering , vol. 29, no. 7, pp. 1661–1676, Jun. 2005. [116]

M. Folić, C. S. Adjiman, and E. N. Pistikopoulos, “Design of solvents for optimal reaction rate constants,”

AIChE Journal , vol. 53, no. 5, pp. 1240–1256, May 2007. [117] M. Foli ć , C. S. Adjiman, and E. N. Pistikopoulos, “Computer-Aided Solvent Design for Reactions: Maximizing Product Formation,” Industrial & Engineering Chemistry Research , vol. 47, no. 15, pp. 5190–5202, Aug. 2008. [118] H. Struebing, Z. Ganase, P. G. Karamertzanis, E. Siougkrou, P. Haycock, P. M. Piccione, A. Armstrong, A. Galindo, and C. S. Adjiman, “Computer-aided molecular design of solvents for accelerated reaction kinetics,”

Nature Chemistry , vol. 5, no. 11, pp. 952–957, Nov. 2013. [119] T. Zhou, Z. Lyu, Z. Qi, and K. Sundmacher, “Robust design of optimal solvents for chemical reactions—A combined experimental and computational strategy,”

Chemical Engineering Science , vol. 137, pp. 613–625, Dec. 2015. [120] N. Churi and L. E. K. Achenie, “Novel Mathematical Programming Model for Computer Aided Molecular Design,”

Industrial & Engineering Chemistry Research , vol. 35, no. 10, pp. 3788–3794, Jan. 1996. [121] K. G. Joback, “Designing molecules possessing desired physical property values,” Massachusetts Institute of Technology, 1989. [122] A. Samudra and N. Sahinidis, “Design of Secondary Refrigerants,” in

Design for Energy and the Environment , Jun. 2009, pp. 879–886. [123] E. Conte, R. Gani, and K. M. Ng, “Design of formulated products: A systematic methodology,”

AIChE Journal , vol. 57, no. 9, pp. 2431–2449, Sep. 2011. [124] F. E. Pereira, E. Keskes, A. Galindo, G. Jackson, and C. S. Adjiman, “Integrated solvent and process design using a SAFT-VR thermodynamic description: High-pressure separation of carbon dioxide and methane,”

Computers & Chemical Engineering , vol. 35, no. 3, pp. 474–491, Mar. 2011. [125] K. T. Butler, D. W. Davies, H. Cartwright, O. Isayev, and A. Walsh, “Machine learning for molecular and materials science,”

Nature , vol. 559, no. 7715, pp. 547–555, Jul. 2018. [126] M. Rupp, A. Tkatchenko, K.-R. Müller, and O. A. von Lilienfeld, “Fast and Accurate Modeling of Molecular Atomization Energies with Machine Learning,”

Physical Review Letters , vol. 108, no. 5, p. 058301, Jan. 2012. [127] I. MDL Information Syetems, “MACCS keys: MDL Information Syetems, Inc.,” San Leandro, CA. [128] D. Weininger, “SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules,”

Journal of Chemical Information and Modeling , vol. 28, no. 1, pp. 31–36, Feb. 1988. [129] S. R. Heller, A. McNaught, I. Pletnev, S. Stein, and D. Tchekhovskoi, “InChI, the IUPAC International Chemical Identifier,”

Journal of Cheminformatics , vol. 7, no. 1, p. 23, Dec. 1 2015. [130] W. Jin, R. Barzilay, and T. Jaakkola, “Junction Tree Variational Autoencoder for Molecular Graph Generation,” , Feb. 2018. [131] S. Kearnes, K. McCloskey, M. Berndl, V. Pande, and P. Riley, “Molecular graph convolutions: moving beyond fingerprints,”

Journal of Computer-Aided Molecular Design , vol. 30, no. 8, pp. 595–608, Aug. 2016. [132] A. C. Mater and M. L. Coote, “Deep Learning in Chemistry,”

Journal of Chemical Information and Modeling , vol. 59, no. 6, pp. 2545–2559, Jun. 2019. [133] D. C. Elton, Z. Boukouvalas, M. D. Fuge, and P. W. Chung, “Deep learning for molecular design—a review of the state of the art,”

Molecular Systems Design & Engineering , vol. 4, no. 4, pp. 828–849, 2019. [134] G. Landrum, “RDKit: Open-source Cheminformatics,”

Http://Www.Rdkit.Org/ , 2006. . [135] N. M. O’Boyle, M. Banck, C. A. James, C. Morley, T. Vandermeersch, and G. R. Hutchison, “Open Babel: An open chemical toolbox,”

Journal of Cheminformatics , vol. 3, no. 1, p. 33, Dec. 2011. [136] B. P. Kelley, S. P. Brown, G. L. Warren, and S. W. Muchmore, “POSIT: Flexible Shape-Guided Docking For Pose Prediction,”

Journal of Chemical Information and Modeling , vol. 55, no. 8, pp. 1771–1780, Aug. 2015. [137] T. Sterling and J. J. Irwin, “ZINC 15 – Ligand Discovery for Everyone,”

Journal of Chemical Information and Modeling , vol. 55, no. 11, pp. 2324–2337, Nov. 2015. [138] G. Papadatos, M. Davies, N. Dedman, J. Chambers, A. Gaulton, J. Siddle, R. Koks, S. A. Irvine, J. Pettersson, N. Goncharoff, A. Hersey, and J. P. Overington, “SureChEMBL: a large-scale, chemically annotated patent document database,”

Nucleic Acids Research , vol. 44, no. D1, pp. D1220–D1228, Jan. 2016. [139] emolecules ®, “eMolecules Database,”

La Jolla, CA , 2020. [140] A. Gaulton, A. Hersey, M. Nowotka, A. P. Bento, J. Chambers, D. Mendez, P. Mutowo, F. Atkinson, L. J. Bellis, E. Cibrián-Uhalte, M. Davies, N. Dedman, A. Karlsson, M. P. Magariños, J. P. Overington, G. Papadatos, I. Smit, and A. R. Leach, “The ChEMBL database in 2017,”

Nucleic Acids Research , vol. 45, no. D1, pp. D945–D954, Jan. 2017. [141] P. Banerjee, J. Erehman, B.-O. Gohlke, T. Wilhelm, R. Preissner, and M. Dunkel, “Super Natural II—a database of natural products,”

Nucleic Acids Research , vol. 43, no. D1, pp. D935–D939, Jan. 2015. [142] G. B. Goh, N. O. Hodas, and A. Vishnu, “Deep learning for computational chemistry,”

Journal of Computational Chemistry , vol. 38, no. 16, pp. 1291–1307, Jun. 2017. [143] E. Bjerrum and B. Sattarov, “Improving Chemical Autoencoder Latent Space and Molecular De Novo Generation Diversity with Heteroencoders,”

Biomolecules , vol. 8, no. 4, p. 131, Oct. 2018. [144] S. Heller, A. McNaught, S. Stein, D. Tchekhovskoi, and I. Pletnev, “InChI - the worldwide chemical structure identifier standard,”

Journal of Cheminformatics , vol. 5, no. 1, p. 7, Dec. 2013. [145] R. Winter, F. Montanari, F. Noé, and D.-A. Clevert, “Learning continuous and data-driven molecular descriptors by translating equivalent chemical representations,”

Chemical Science , vol. 10, no. 6, pp. 1692–1701, 2019. [146] K. K. Yalamanchi, V. C. O. van Oudenhoven, F. Tutino, M. Monge-Palacios, A. Alshehri, X. Gao, and S. M. Sarathy, “Machine Learning To Predict Standard Enthalpy of Formation 2 of Hydrocarbons,”

The Journal of Physical Chemistry A , vol. 123, no. 38, pp. 8305–8313, Sep. 2019. [147] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” in , Jun. 2016, pp. 770–778. [148] G. B. Goh, C. Siegel, A. Vishnu, N. O. Hodas, and N. Baker, “Chemception: A Deep Neural Network with Minimal Chemistry Knowledge Matches the Performance of Expert-developed QSAR/QSPR Models,”

Stat , Jun. 2017. [149] Y. Li, O. Vinyals, C. Dyer, R. Pascanu, and P. Battaglia, “Learning Deep Generative Models of Graphs,”

CoRR , Mar. 2018. [150] N. De Cao and T. Kipf, “MolGAN: An implicit generative model for small molecular graphs,”

Stat , May 2018. [151] J. L. Durant, B. A. Leland, D. R. Henry, and J. G. Nourse, “Reoptimization of MDL keys for use in drug discovery,”

Journal of Chemical Information and Computer Sciences , 2002. [152] O. Méndez-Lucio, B. Baillif, D.-A. Clevert, D. Rouquié, and J. Wichard, “De novo generation of hit-like molecules from gene expression signatures using artificial intelligence,”

Nature Communications , vol. 11, no. 1, p. 10, Dec. 2020. [153] F. A. Faber, L. Hutchison, B. Huang, J. Gilmer, S. S. Schoenholz, G. E. Dahl, O. Vinyals, S. Kearnes, P. F. Riley, and O. A. von Lilienfeld, “Prediction Errors of Molecular Machine Learning Models Lower than Hybrid DFT Error,”

Journal of Chemical Theory and Computation , vol. 13, no. 11, pp. 5255–5264, Nov. 2017. [154] D. Duvenaud, D. Maclaurin, J. Aguilera-Iparraguirre, R. Gómez-Bombarelli, T. Hirzel, A. Aspuru-Guzik, and R. P. Adams, “Convolutional Networks on Graphs for Learning Molecular Fingerprints,” in

Advances in Neural Information Processing Systems , Sep. 2015, pp. 2224--2232. [155] G. Montavon, K. Hansen, S. Fazli, M. Rupp, F. Biegler, A. Ziehe, A. Tkatchenko, O. A. Von Lilienfeld, and K. R. Müller, “Learning invariant representations of molecules for atomization energy prediction,” in

Advances in Neural Information Processing Systems , 2012, pp. 440–448. [156] D. Kuzminykh, D. Polykovskiy, A. Kadurin, A. Zhebrak, I. Baskov, S. Nikolenko, R. Shayakhmetov, and A. Zhavoronkov, “3D Molecular Representations Based on the Wave Transform for Convolutional Neural Networks,”

Molecular Pharmaceutics , vol. 15, no. 10, pp. 4378–4385, Oct. 2018. [157] N. Thomas, T. Smidt, S. Kearnes, L. Yang, L. Li, K. Kohlhoff, and P. Riley, “Tensor field networks: Rotation- and translation-equivariant neural networks for 3D point clouds,”

CoRR , Feb. 2018. [158] I. Goodfellow, Y. Bengio, and A. Courville,

Deep Learning . MIT Press, 2016. [159] Y. Bengio, P. Simard, and P. Frasconi, “Learning long-term dependencies with gradient descent is difficult,”

IEEE Transactions on Neural Networks , vol. 5, no. 2, pp. 157–166, Mar. 1994. [160] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling,” arXiv , Dec. 2014. [161] S. Hochreiter and J. Schmidhuber, “Long Short-Term Memory,”

Neural Computation , vol. 9, no. 8, pp. 1735–1780, Nov. 1997. [162] K. Cho, B. van Merrienboer, D. Bahdanau, and Y. Bengio, “On the Properties of Neural Machine Translation: Encoder–Decoder Approaches,” in

Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation , 2014, pp. 103–111. [163] D. Neil, M. Segler, L. Guasch, M. Ahmed, D. Plumbley, M. Sellwood, and N. Brown, “Exploring deep recurrent models with reinforcement learning for molecule design,” 2018. [164] M. H. S. Segler, T. Kogej, C. Tyrchan, and M. P. Waller, “Generating Focused Molecule Libraries for Drug Discovery with Recurrent Neural Networks,”

ACS Central Science , vol. 4, no. 1, pp. 120–131, Jan. 2018. [165] M. Olivecrona, T. Blaschke, O. Engkvist, and H. Chen, “Molecular de-novo design through deep reinforcement learning,”

Journal of Cheminformatics , vol. 9, no. 1, p. 48, Dec. 2017. [166] A. Joulin and T. Mikolov, “Inferring algorithmic patterns with stack-augmented recurrent nets,” in

Advances in Neural Information Processing Systems , 2015, pp. 190–198. [167] D. P. Kingma and M. Welling, “Auto-Encoding Variational Bayes,” arXiv , Dec. 2013. [168] Y. Le Cun and F. Fogelman-Soulié, “Modèles connexionnistes de l’apprentissage,”

Intellectica. Revue de l’Association pour la Recherche Cognitive , vol. 2, no. 1, pp. 114–143, 1987. [169] M. J. Kusner, B. Paige, and J. M. Hernández-Lobato, “Grammar Variational Autoencoder,” , Mar. 2017. [170] D. P. Kingma, D. J. Rezende, S. Mohamed, and M. Welling, “Semi-Supervised Learning with Deep Generative Models,”

CoRR , vol. abs/1406.5, 2014. [171] A. Makhzani, J. Shlens, N. Jaitly, and I. J. Goodfellow, “Adversarial Autoencoders,”

CoRR , vol. abs/1511.0, 2015. [172] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” 2014. [173] H. Kajino, “Molecular Hypergraph Grammar with its Application to Molecular Optimization,”

CoRR , vol. abs/1809.0, Sep. 2018. [174] Q. Liu, M. Allamanis, M. Brockschmidt, and A. L. Gaunt, “Constrained graph variational autoencoders for molecule design,” 2018. [175] W. Jin, K. Yang, R. Barzilay, and T. Jaakkola, “Learning multimodal graph-to-graph translation for molecular optimization,” 2019. [176] S. Kang and K. Cho, “Conditional Molecular Design with Deep Generative Models,”

Journal of Chemical Information and Modeling , vol. 59, no. 1, pp. 43–52, Jan. 2019. [177] D. Polykovskiy, A. Zhebrak, D. Vetrov, Y. Ivanenkov, V. Aladinskiy, P. Mamoshina, M. Bozdaganyan, A. Aliper, A. Zhavoronkov, and A. Kadurin, “Entangled Conditional Adversarial Autoencoder for de Novo Drug Discovery,”

Molecular Pharmaceutics , vol. 15, no. 10, pp. 4398–4405, Oct. 2018. [178] D. Bouchacourt, R. Tomioka, and S. Nowozin, “Multi-level variational autoencoder: Learning disentangled representations from grouped observations,” 2018. [179] A. Kadurin, S. Nikolenko, K. Khrabrov, A. Aliper, and A. Zhavoronkov, “druGAN: An Advanced Generative Adversarial Autoencoder Model for de Novo Generation of New Molecules with Desired Molecular Properties in Silico,”

Molecular Pharmaceutics , vol. 14, no. 9, pp. 3098–3104, Sep. 2017. [180] B. Samanta, A. De, G. Jana, P. K. Chattaraj, N. Ganguly, and M. Gomez-Rodriguez, “NeVAE: A Deep Generative Model for Molecular Graphs,”

CoRR , Feb. 2018. [181] J. You, B. Liu, R. Ying, V. Pande, and J. Leskovec, “Graph convolutional policy network for goal-directed molecular graph generation,” 2018. [182] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen, “Improved techniques for training GANs,” 2016. 4 [183] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein GAN,”

Stat , Jan. 2017. [184] L. Yu, W. Zhang, J. Wang, and Y. Yu, “SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient,”

CoRR , vol. abs/1609.0, 2016. [185] B. Sanchez-Lengeling, C. Outeiral, G. L. Guimaraes, and A. Aspuru-Guzik, “Optimizing distributions over molecular space. An Objective-Reinforced Generative Adversarial Network for Inverse-design Chemistry (ORGANIC),”

ChemRxiv , 2017. [186] A. Graves, G. Wayne, M. Reynolds, T. Harley, I. Danihelka, A. Grabska-

Barwińska, S. G.

Colmenarejo, E. Grefenstette, T. Ramalho, J. Agapiou, A. P. Badia, K. M. Hermann, Y. Zwols, G. Ostrovski, A. Cain, H. King, C. Summerfield, P. Blunsom, K. Kavukcuoglu, D. Hassabis, “Hybrid computing using a neural network with dynamic external memory,”

Nature , vol. 538, no. 7626, pp. 471–476, Oct. 2016. [187] E. Putin, A. Asadulaev, Y. Ivanenkov, V. Aladinskiy, B. Sanchez-Lengeling, A. Aspuru-Guzik, and A. Zhavoronkov, “Reinforced Adversarial Neural Computer for de Novo Molecular Design,”

Journal of Chemical Information and Modeling , vol. 58, no. 6, pp. 1194–1204, Jun. 2018. [188] E. Putin, A. Asadulaev, Q. Vanhaelen, Y. Ivanenkov, A. V. Aladinskaya, A. Aliper, and A. Zhavoronkov, “Adversarial Threshold Neural Computer for Molecular de Novo Design,”

Molecular Pharmaceutics , vol. 15, no. 10, pp. 4386–4397, Oct. 2018. [189] G. L. Guimaraes, B. Sanchez-Lengeling, C. Outeiral, P. L. C. Farias, and A. Aspuru-Guzik, “Objective-Reinforced Generative Adversarial Networks (ORGAN) for Sequence Generation Models,” aRXiv.Stat , May 2017. [190]

Ł. Maziarka, A. Pocha, J. Kaczmarczyk, K. Rataj, T. Danel, and M. Warchoł, “Mol -CycleGAN: a generative model for molecular optimization,”

Journal of Cheminformatics , vol. 12, no. 1, p. 2, Dec. 2020. [191] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks,” in , Oct. 2017, pp. 2242–2251. [192] V. François-Lavet, P. Henderson, R. Islam, M. G. Bellemare, and J. Pineau, “An Introduction to Deep Reinforcement Learning,”

Foundations and Trends® in Machine Learning , vol. 11, no. 3–4, pp. 219–354, 2018. [193] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal Policy Optimization Algorithms,”

CoRR , vol. abs/1707.0, 2017. [194] R. J. Williams, “Simple statistical gradient-following algorithms for connectionist reinforcement learning,”

Machine Learning , vol. 8, no. 3–4, pp. 229–256, May 1992. [195] V. Mnih, A. P. Badia, L. Mirza, A. Graves, T. Harley, T. P. Lillicrap, D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep reinforcement learning,” 2016. [196] H. Ikebata, K. Hongo, T. Isomura, R. Maezono, and R. Yoshida, “Bayesian molecular design with a chemical language model,”

Journal of Computer-Aided Molecular Design , vol. 31, no. 4, pp. 379–391, Apr. 2017. [197] J. Lim, S. Ryu, J. W. Kim, and W. Y. Kim, “Molecular generative model based on conditional variational autoencoder for de novo molecular design,”

Journal of Cheminformatics , vol. 10, no. 1, p. 31, Dec. 2018. [198] R.-R. Griffiths and J. M. Hernández-Lobato, “Constrained Bayesian optimization for automatic chemical design using variational autoencoders,”

Chemical Science , vol. 11, no. 2, pp. 577–586, 2020. [199] D. Sculley, J. Snoek, A. Rahimi, and A. Wiltschko, “Winner’s curse? On pace, progress, 5 and empirical rigor,” 2018. [200] P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup, and D. Meger, “Deep reinforcement learning that matters,” 2018. [201] M. Lucic, K. Kurach, M. Michalski, O. Bousquet, and S. Gelly, “Are GANs created equal? A large-scale study,” 2018. [202] G. Melis, C. Dyer, and P. Blunsom, “On the state of the art of evaluation in neural language models,” 2018. [203] K. Preuer, P. Renz, T. Unterthiner, S. Hochreiter, and G. Klambauer, “Fréchet ChemNet Distance: A Metric for Generative Models for Molecules in Drug Discovery,”

Journal of Chemical Information and Modeling , vol. 58, no. 9, pp. 1736–1741, Sep. 2018. [204] D. Polykovskiy, A. Zhebrak, B. Sanchez-Lengeling, S. Golovanov, O. Tatanov, S. Belyaev, R. Kurbanov, A. Artamonov, V. Aladinskiy, M. Veselov, A. Kadurin, S. I. Nikolenko, A. Aspuru-Guzik, and A. Zhavoronkov, “Molecular Sets {(MOSES):} {A} Benchmarking Platform for Molecular Generation Models,”

CoRR , vol. abs/1811.1, 2018. [205] N. Brown, M. Fiscato, M. H. S. Segler, and A. C. Vaucher, “GuacaMol: Benchmarking Models for de Novo Molecular Design,”

Journal of Chemical Information and Modeling , vol. 59, no. 3, pp. 1096–1108, Mar. 2019. [206] J. Degen, C. Wegscheid-Gerlach, A. Zaliani, and M. Rarey, “On the Art of Compiling and Using ‘Drug-Like’ Chemical Fragment Spaces,”

ChemMedChem , vol. 3, no. 10, pp. 1503–1507, Oct. 2008. [207] G. W. Bemis and M. A. Murcko, “The Properties of Known Drugs. 1. Molecular Frameworks,”

Journal of Medicinal Chemistry , vol. 39, no. 15, pp. 2887–2893, Jan. 1996. [208] M. Benhenda, “ChemGAN challenge for drug discovery: can AI reproduce natural chemical diversity?,”

Stat , Aug. 2017. [209] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “GANs trained by a two time-scale update rule converge to a local Nash equilibrium,” 2017. [210] S. Kullback and R. A. Leibler, “On Information and Sufficiency,”

The Annals of Mathematical Statistics , vol. 22, no. 1, pp. 79–86, Mar. 1951. [211] J. H. Jensen, “A graph-based genetic algorithm and generative model/Monte Carlo tree search for the exploration of chemical space,”

Chemical Science , vol. 10, no. 12, pp. 3567–3572, 2019. [212] B. Amos, L. Xu, and J. Z. Kolter, “Input Convex Neural Networks,” in

Proceedings of the 34th International Conference on Machine Learning , 2017, vol. 70, pp. 146–155. [213] L. Zhang, D. K. Babi, and R. Gani, “New Vistas in Chemical Product and Process Design,”

Annual Review of Chemical and Biomolecular Engineering , vol. 7, no. 1, pp. 557–582, Jun. 2016. [214] L. Zhang, H. Mao, L. Liu, J. Du, and R. Gani, “A machine learning based computer-aided molecular design/screening methodology for fragrance molecules,”

Computers & Chemical Engineering , vol. 115, pp. 295–308, 2018. [215] J. Noh, J. Kim, H. S. Stein, B. Sanchez-Lengeling, J. M. Gregoire, A. Aspuru-Guzik, and Y. Jung, “Inverse Design of Solid-State Materials via a Continuous Representation,”

Matter , vol. 1, no. 5, pp. 1370–1384, Nov. 2019. [216] Z. Yao, B. Sanchez-Lengeling, N. S. Bobbitt, B. J. Bucior, S. G. H. Kumar, S. P. Collins, T. Burns, T. K. Woo, O. Farha, R. Q. Snurr, and A. Aspuru-Guzik, “Inverse Design of Nanoporous Crystalline Reticular Materials with Deep Generative Models,”

ChemrRxiv , 2020. 6 [217] A. Karpatne, G. Atluri, J. H. Faghmous, M. Steinbach, A. Banerjee, A. Ganguly, S. Shekhar, N. Samatova, and V. Kumar, “Theory-Guided Data Science: A New Paradigm for Scientific Discovery from Data,”

IEEE Transactions on Knowledge and Data Engineering , vol. 29, no. 10, pp. 2318–2331, Oct. 2017. [218] M. Raissi, P. Perdikaris, and G. E. Karniadakis, “Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations,”

Journal of Computational Physics , vol. 378, pp. 686–707, Feb. 2019. [219] K. T. Schütt, F. Arbabzadah, S. Chmiela, K. R. Müller, and A. Tkatchenko, “Quantum-chemical insights from deep tensor neural networks,”

Nature Communications , vol. 8, no. 1, p. 13890, Apr. 2017. [220] F. Locatello, S. Bauer, M. Lucie, G. Rätsch, S. Gelly, B. Schölkopf, and O. Bachem, “Challenging common assumptions in the unsupervised learning of disentangled representations,” , vol. 2019-June, pp. 7247–7283, 2019. [221] E. A. Carter, “Challenges in Modeling Materials Properties Without Experimental Input,”

Science , vol. 321, no. 5890, pp. 800–803, Aug. 2008. [222] C. Chen, Y. Zuo, W. Ye, X. Li, Z. Deng, and S. P. Ong, “A Critical Review of Machine Learning of Energy Materials,”

Advanced Energy Materials , vol. 10, no. 8, p. 1903242, Feb. 2020. [223] V. Kaushal, R. Iyer, S. Kothawade, R. Mahadev, K. Doctor, and G. Ramakrishnan, “Learning from less data: A unified data subset selection and active learning framework for computer vision,” 2019. [224] C. Chen, W. Ye, Y. Zuo, C. Zheng, and S. P. Ong, “Graph Networks as a Universal Machine Learning Framework for Molecules and Crystals,”

Chemistry of Materials , vol. 31, no. 9, pp. 3564–3572, May 2019. [225] D. Lowe, “Chemical reactions from US patents (1976-Sep2016),” 2017. [226] F. H. Allen, “The Cambridge Structural Database: a quarter of a million crystal structures and rising,”

Acta Crystallographica Section B Structural Science , vol. 58, no. 3, pp. 380–388, Jun. 2002. [227] J. E. Saal, S. Kirklin, M. Aykol, B. Meredig, and C. Wolverton, “Materials Design and Discovery with High-Throughput Density Functional Theory: The Open Quantum Materials Database (OQMD),”

JOM , vol. 65, no. 11, pp. 1501–1509, Nov. 2013. [228] S. K. Burley, H. M. Berman, C. Bhikadiya, C. Bi, L. Chen, L. Di Costanzo, C. Christie, J. M. Duarte, S. Dutta, Z. Feng, S. Ghosh, D. S. Goodsell, R. K. Green, V. Guranovic, D. Guzenko, B. P. Hudson, Y. Liang, R. Lowe, E. Peisach, I.Perisko va, A. Prlić, C. Randle, A.

Rose, P. Rose, R. Sala, M. Sekharan, C. Shao, L. Tan, Y. Tao, Y. Valasatava, M. Voigt, J. Westbrook, J. Woo, H. Yang, J. Young, M. Zhuravleva, and C. Zardecki, “Protein Data Bank: the single global archive for 3D macromolecular structure data,”

Nucleic Acids Research , vol. 47, no. D1, pp. D520–D528, Jan. 2019. [229] Y. Murakami, S. Omori, and K. Kinoshita, “NLDB: a database for 3D protein–ligand interactions in enzymatic reactions,”

Journal of Structural and Functional Genomics , vol. 17, no. 4, pp. 101–110, Dec. 2016. [230] S. Otsuka, I. Kuwajima, J. Hosoya, Y. Xu, and M. Yamazaki, “PoLyInfo: Polymer Database for Polymeric Materials Design,” in , Sep. 2011, pp. 22–29. [231] Q. Dong, C. D. Muzny, A. Kazakov, V. Diky, J. W. Magee, J. A. Widegren, R. D. Chirico, 7 K. N. Marsh, and M. Frenkel, “ILThermo: A Free-Access Web Database for Thermodynamic Properties of Ionic Liquids †,”

Journal of Chemical & Engineering Data , vol. 52, no. 4, pp. 1151–1159, Jul. 2007. [232] D. S. Wishart, Y. D. Feunang, A. C. Guo, E. J. Lo, A. Marcu, J. R. Grant, T. Sajed, D. Johnson, C. Li, Z. Sayeeda, N. Assempour, I. Iynkkaran, Y. Liu, A. Maciejewski, N. Gale, A. Wilson, L. Chin, R. Cummings, D. Le, D. Le, A. Pon, C. Knox, and M. Wilson, “DrugBank 5.0: a major update to the DrugBank database for 2018,”

Nucleic Acids Research , vol. 46, no. D1, pp. D1074–D1082, Jan. 2018. [233] R. Oughtred, C. Stark, B.-J. Breitkreutz, J. Rust, L. Boucher, C. Chang, N. Kolas, L. O’Donnell, G. Leung, R. McAdam, F. Zhang, S. Dolma, A. Willems, J. Coulombe-Huntington, A. Chatr-aryamontri, K. Dolinski, and M. Tyers, “The BioGRID interaction database: 2019 update,”

Nucleic Acids Research , vol. 47, no. D1, pp. D529–D541, Jan. 2019. [234] O. A. von Lilienfeld, “First principles view on chemical compound space: Gaining rigorous atomistic control of molecular properties,”

International Journal of Quantum Chemistry , vol. 113, no. 12, pp. 1676–1689, Jun. 2013. [235] A. Kumar, P. Sattigeri, and A. Balakrishnan, “Variational inference of disentangled latent concepts from unlabeled observations,” 2018. [236] F. Noé, S. Olsson, J. Köhler, and H. Wu, “Boltzmann generators: Sampling equilibrium states of many-body systems with deep learning,”

Science , vol. 365, no. 6457, p. eaaw1147, Sep. 2019. [237] K. Madhawa, K. Ishiguro, K. Nakago, and M. Abe, “GraphNVP: An Invertible Flow Model for Generating Molecular Graphs,”

Stat , May 2019. [238] D. Schwalbe-Koda and R. Gómez-Bombarelli, “Generative Models for Automatic Chemical Design,”

CoRR , vol. abs/1907.0, 2019. [239] R. Vaidyanathan and M. El-Halwagi, “Computer-Aided Synthesis of Polymers and Blends with Target Properties,”

Industrial & Engineering Chemistry Research , vol. 35, no. 2, pp. 627–634, Jan. 1996. [240]

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I.

Polosukhin, “Attention is all you need,” 2017. [241] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” 2014. [242] I. JONYER, L. B. HOLDER, and D. J. COOK, “MDL-BASED CONTEXT-FREE GRAPH GRAMMAR INDUCTION AND APPLICATIONS,”

International Journal on Artificial Intelligence Tools , vol. 13, no. 01, pp. 65–79, Mar. 2004. [243] S. F. Chen, “Bayesian grammar induction for language modeling,” in

Proceedings of the 33rd annual meeting on Association for Computational Linguistics - , 1995, pp. 228–235. [244] S. Subramanian, S. Rajeswar, F. Dutil, C. Pal, and A. Courville, “Adversarial Generation of Natural Language,” in

Proceedings of the 2nd Workshop on Representation Learning for NLP , 2017, pp. 241–251. [245] Z. Hu, Z. Yang, X. Liang, R. Salakhutdinov, and E. P. Xing, “Toward controlled generation of text,” 2017. [246] C. Finn, P. F. Christiano, P. Abbeel, and S. Levine, “A Connection between Generative Adversarial Networks, Inverse Reinforcement Learning, and Energy-Based Models,”

CoRR , vol. abs/1611.0, 2016. [247] D. Pfau and O. Vinyals, “Connecting Generative Adversarial Networks and Actor-Critic Methods,”

CoRR , vol. abs/1610.0, 2016. 8 [248] H. Mossalam, Y. M. Assael, D. M. Roijers, and S. Whiteson, “Multi-Objective Deep Reinforcement Learning,”

CoRR , vol. abs/1610.0, 2016. [249] N. Anand and P. S. Huang, “Generative modeling for protein structures,” 2018. [250] E. C. Alley, G. Khimulya, S. Biswas, M. AlQuraishi, and G. M. Church, “Unified rational protein engineering with sequence-based deep representation learning,”

Nature Methods , vol. 16, no. 12, pp. 1315–1322, Dec. 2019. [251] F. Yu, X. Wu, Y. Sun, and L. Duan, “Exploiting Images for Video Recognition with Hierarchical Generative Adversarial Networks,” in

Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence , Jul. 2018, pp. 1107–1113. [252] M. Segù, A. Loquercio, and D. Scaramuzza, “A General Framework for Uncertainty Estimation in Deep Learning,”

CoRR , vol. abs/1907.06890, 2019. [253] L. Hirschfeld, K. Swanson, K. Yang, R. Barzilay, and C. W. Coley, “Uncertainty Quantification Using Neural Networks for Molecular Property Prediction,” arXiv cs.LG , no. 2005.10036, 2020. [254] C. W. Coley, W. H. Green, and K. F. Jensen, “Machine Learning in Computer-Aided Synthesis Planning,”

Accounts of Chemical Research , vol. 51, no. 5, pp. 1281–1289, May 2018. [255] L. M. Roch, F. Häse, C. Kreisbeck, T. Tamayo-Mendoza, L. P. E. Yunker, J. E. Hein, and A. Aspuru-Guzik, “ChemOS: Orchestrating autonomous experimentation,”

Science Robotics , vol. 3, no. 19, 2018. [256] D. P. Tabor, L. M. Roch, S. K. Saikin, C. Kreisbeck, D. Sheberla, J. H. Montoya, S. Dwaraknath, M. Aykol, C. Ortiz, H. Tribukait, C. Amador-Bedolla, C. J. Brabec, B. Maruyama, K. A. Persson, and A. Aspuru-Guzik, “Accelerating the discovery of materials for clean energy in the era of smart automation,”

Nature Reviews Materials , vol. 3, no. 5, pp. 5–20, May 2018., vol. 3, no. 5, pp. 5–20, May 2018.