Autonomous discovery in the chemical sciences part I: Progress
AAutonomous discovery in the chemical sciences part I: Progress
Connor W. Coley ∗† , Natalie S. Eyke ∗ , Klavs F. Jensen ∗‡ Keywords: automation, chemoinformatics, drug discovery, machine learning, materials science ∗ Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139 † [email protected] ‡ [email protected] a r X i v : . [ q - b i o . Q M ] M a r ontents Noniterative discovery of chemical processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195.3.1 Discovery of new synthetic pathways . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195.3.2 Discovering models of chemical reactivity . . . . . . . . . . . . . . . . . . . . . . . . . 215.3.3 Discovery of new chemical reactions from experimental screening . . . . . . . . . . . . 255.4
Iterative discovery of chemical processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255.4.1 Discovery of optimal synthesis conditions . . . . . . . . . . . . . . . . . . . . . . . . . 255.4.2 Discovery of new chemical reactions through an active search . . . . . . . . . . . . . . 285.5
Noniterative discovery of structure-property models . . . . . . . . . . . . . . . . . . . . . . . 295.5.1 Discovery of important molecular features . . . . . . . . . . . . . . . . . . . . . . . . . 305.5.2 Discovery of models for spectral analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 315.5.3 Discovery of potential energy surfaces and functionals . . . . . . . . . . . . . . . . . . 325.5.4 Discovery of models for phase behavior . . . . . . . . . . . . . . . . . . . . . . . . . . 335.6
Noniterative discovery of new physical matter . . . . . . . . . . . . . . . . . . . . . . . . . . . 335.6.1 Discovery through brute-force experimentation . . . . . . . . . . . . . . . . . . . . . . 345.6.2 Discovery through computational screening . . . . . . . . . . . . . . . . . . . . . . . . 365.6.3 Discovery through molecular generation . . . . . . . . . . . . . . . . . . . . . . . . . . 395.7
Iterative discovery of new physical matter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425.7.1 Discovery for pharmaceutical applications . . . . . . . . . . . . . . . . . . . . . . . . . 425.7.2 Discovery for materials applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455.8 Brief summary of discovery in other domains . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 Abstract
This two-part review examines how automation has contributed to different aspects of discovery in thechemical sciences. In this first part, we describe a classification for discoveries of physical matter (molecules,materials, devices), processes, and models and how they are unified as search problems. We then introduce aset of questions and considerations relevant to assessing the extent of autonomy. Finally, we describe manycase studies of discoveries accelerated by or resulting from computer assistance and automation from thedomains of synthetic chemistry, drug discovery, inorganic chemistry, and materials science. These illustratehow rapid advancements in hardware automation and machine learning continue to transform the nature ofexperimentation and modelling.Part two reflects on these case studies and identifies a set of open challenges for the field.
The prospect of a robotic scientist has long been an object of curiosity, optimism, skepticism, and job-lossfear, depending on who is asked. As computing was becoming mainstream, excitement grew around thepotential for logic and reasoning–the underpinnings of the scientific process–to be codified into computerprograms; as hardware automation became more robust and cost effective, excitement grew around thepotential for a universal synthesis platform to enhance the work of human chemists in the lab; and as dataavailability and statistical analysis/inference techniques improved, excitement grew around the potentialfor statistical models (machine learning included) to draw new insights from vast quantities of chemicalinformation [1–7].The confluence of these factors makes that prospect increasingly realistic. In organic chemistry, we havealready seen proof-of-concept examples of the “robo-chemist” [8] able to intelligently select and conductexperiments [9–11]; there have even been strides made toward a universal synthesis platform [12–15], theo-retically capable of executing most chemical processes but highly constrained in practice. While there havebeen fewer success stories in automating drug discovery holistically [16], the excitement around machinelearning in this application space is especially apparent, with dozens of start-up companies promising torevolutionize the development of new medicines through artificial intelligence [17].A more pessimistic view of automated discovery is that machines will never be able to make real “revo-lutions” in science because they necessarily operate within a specific set of instructions [18]. This attitude isexemplified by
Lady Lovelace’s objection : “The Analytical Engine has no pretensions to originate anything.It can do whatever we know how to order it to perform. It can follow analysis; but it has no power of3nticipating any analytical relations or truths” [1]. Some have expressed a milder sentiment, perhaps in lightof advances in computing, cautioning that an increasing reliance on robotic tools might reduce the odds ofa serendipitous discovery [6]. Muggleton is more declarative, stating that “science is an essentially humanactivity that requires clarity both in the statement of hypotheses and their clear and undeniable refutationthrough experimentation” [19]. However, there is little disagreement that automation and computation inscience has improved productivity through efficiency, reduction of error, and the ability to address large-scaleproblems [20].In the remainder of Part 1, we will discuss the different types of discovery typically reported in thechemical sciences and how they can be unified as searches in a high-dimensional design space. Along withthis definition comes a recommended set of questions to ask when evaluating the extent to which a discoverycan be attributed to automation or autonomy. We will then discuss a number of case studies arranged interms of the type of discovery being pursued and the nature of the approach used to do so. Part 2 willreflect on these case studies and make explicit what we believe to be the primary obstacles to autonomousdiscovery.
There is no single definition of what constitutes a scientific discovery. Valdés-Pérez defines discovery as “thegeneration of novel, interesting, plausible, and intelligible knowledge” [5]. Data-driven knowledge discovery,specifically, has been defined as the “nontrivial extraction of implicit, previously unknown, and potentialuseful information” [21]. Each of these criteria, however, is inherently subjective. “Novel” is simultaneouslyambiguous and considered distinct from “new”; it is generally meant to indicate some level of nonobviousnessor, by one definition, a lack of predictability [22]. However, if we artificially limit what we consider tobe known and demonstrate a successful extrapolation to a conclusion that really was known, it would bereasonable to argue that this does not constitute a discovery. This connects to the question of what it mightmean for a discovery to be “interesting” or “useful”, for which we avoid providing a precise definition.For the purposes of this review, we instead define three broad types of discoveries in the chemical sciences(Figure 1) and provide examples of each.
Physical matter.
Often, the ultimate result of a discovery campaign is the identification of a molecule(not discounting macromolecules), material, or device that achieves a desired function. This category encom-passes most drug discovery efforts, where the output may be new chemical matter that could later become4igure 1: The three broad categories of discovery described in this review: physical matter, processes, andmodels.part of a therapeutic, as well as materials discovery for a wide array of applications.
Processes.
Discoveries may also take the form of processes. These may be abstract, like the Haber-Bosch process, pasteurization, and directed evolution. They are more often concrete, like synthetic routesto organic molecules or a specific set of reaction conditions to achieve a chemical transformation.
Models.
Our definition of a model includes empirical models (such as those obtained through regressionof experimental data), structure-function relationships, symbolic regressions, natural laws, and even concep-tual models that provide mechanistic understanding. It is common for models to be part of the discovery ofthe other two types as surrogates for experiments, as will be seen in many examples below.The most famous examples of scientific discoveries in chemistry tend to be natural laws or theories thatare able to rationalize observed phenomena that previous theories could not. Mendeleev’s periodic table ofthe elements, Thomson’s discovery of the electron, Rutherford’s discovery of atomic nuclei, the Schrodingerequation, Kekulé’s structure of benzene, et cetera. In their time, these represented radical departures fromprevious frameworks. Though we do consider these to be models, identifying them through computationalor algorithmic approaches would require substantially more open-ended hypothesis generation than what iscurrently possible. 5 .2 Discovery as a search
We argue that the process of scientific discovery can always be thought of as a search problem, regardless ofthe nature of that discovery [7, 23, 24].Molecular discovery is a search within “chemical space” [25–28]–an enormous combinatorial design spaceof theoretically-possible molecules. A common estimate of its size, considering only molecules of a limitedsize made up of CHONS atoms, is [29], although for any one application or with reasonable restrictions(e.g., on drug-likeness or synthetic accessibility), the size of the relevant chemical space will be significantlysmaller [30, 31]. Biological compounds exist in an even larger space if one considers that there are, e.g., theoretically-possible 100-peptide proteins using only canonical amino acids, although again the num-ber that are foldable and biologically relevant will be significantly smaller. Materials discovery is anothercombinatorial design space, where structural composition must be defined by both discrete variables (e.g.,elemental identities) and continuous variables (e.g., stoichometric ratios) and processing conditions. Thedesign space for a device is even larger, as it compounds the complexity of its constituent components withadditional considerations about its geometry.Discovering a chemical or physical process is the result of searching a design space defined by processvariables and/or sequences of operations. For example, optimizing a chemical reaction for its yield mightinvolve changing species’ concentrations, the reaction temperature, and the residence time [32]. It may alsoinclude selecting the identity of a catalyst as a discrete variable [33], or changing the order of addition [34].A new research workflow can be thought of as the identification of actions to be taken and their timing,such as the development of split-and-pool combinatorial chemistry for diversity-oriented synthesis [35] or ascreening and selection strategy for directed evolution [36].The majority of models that are “discovered”, under our broad definition, are empirical relationships thatcome from data fitting. In these cases, the search space is well-defined once an input representation (e.g.,a set of descriptors or parameters) and a model family (e.g., a linear model, a deep neural network) areselected. While this can present a massive search space when considering all possible values of all learnedparameters (e.g., for deep learning regression techniques), the final model is often the result of a simplified, local search from a random initialization (e.g., using stochastic gradient descent). Symbolic regressionsare searches in a combinatorial space of input variables and mathematical operations [37]. More abstractmodels, like mechanistic explanations of natural phenomena, exist in a high-dimensional hypothesis spacethat is difficult to formalize; automated discovery tools that are able to generate causal explanations do sousing simplified terminology and well-defined ontologies [38].In virtually every case of computer-assisted discovery, the actual search space is significantly larger than6hat the program or platform is allowed to explore. We might decide to focus our attention on a specificset of compounds (e.g., a fixed scaffold), a specific class of materials (e.g., perovskites), a specific step ina catalyst synthesis process with a finite number of tunable process variables (e.g., the temperature andtime of an annealing step), or a specific hypothesis structure (e.g., categorizing a ligand’s effect on a proteinas an agonist, antagonist, promoter, etc.). Constraining the search space is one way of integrating domainexpertise/intuition into the discovery process. Moreover, it can greatly simplify the search process andmitigate the practical challenges of automated validation and feedback. The way that we navigate the search space in a discovery effort is often iterative. Classically, the discoveryof physical matter, such as in lead optimization for drug discovery, is divided into stages of design, make,test. An analogous cycle for searching hypothesis space could be described as hypothesize, validate, revisebeliefs.Figure 2: Simplified schematic of a hypothesis-driven (or model-driven) discovery process. When not pro-ceeding iteratively, new information is not used to revise our belief (current knowledge). Lowercase romannumerals (red) correspond to the questions for assessing autonomy in discovery.This third step–test or revise beliefs–helps to explain the role of validation and feedback in discovery:experiments, physical or computational, serve to support or refute hypotheses. When information is imperfector insufficient to lead to a confident prediction, it is important to collect new information to improve ourunderstanding of the problem. This might mean taking an empirical regression fit to a small number of datapoints, evaluating our uncertainty, and performing follow-up experiments to reduce our uncertainty in regionswhere we would like to have a more confident prediction (Figure 2). Purely virtual screening is not sufficientfor drug discovery [39], where experimental validation continues to be essential [40]; Schneider and Clarkdescribe experimental testing of drugs designed using de novo workflows as a “non-negotiable” criterion [41].In the materials space as well, Halls and Tasaki propose a materials discovery scheme in which synthesis,7haracterization, and testing are critical components [42]. The scope of hypotheses that lend themselves toautomated validation has limited the scope of discovery tasks that are able to be automated.Consider a scenario where we have a large data set of molecular structures and a property of interest,like their in vitro binding affinity for a particular protein target. We can perform a statistical regressionto correlate the two and represent our understanding of the structure-function landscape. Based on thatmodel, we may propose a new structure–a compound not yet tested–that is predicted to have high activity.Whether that constitutes discovery of the compound is ambiguous. Using scientific publication as the bar,it is reasonable to expect a high degree of confidence, regardless of whether that confidence arises froma statistical analysis of existing data or from confirmation through acquisition of new data. Even with ahighly accurate model, performing a large virtual screen could lead to thousands of false positive results[31]. For a philosophical discussion about the nature of knowledge and need for confidence, correctness, andjustification, see the description of the Gettier problem in ref. 43.Figure 3: One way to visualize the discovery process. The goal definition will implicitly or explicitly definethe search space within which we operate. Available structured information can be used to generate or refinea hypothesis within that search space. Often, when we are doing more than pure data analysis, there willbe an iterative process of information gathering prior to the final output, or interpretation.We note here that this hypothesis-first approach to discovery (Figure 3) is more consistent with thephilosophy of Popper [44]. This is in contrast to an observation- or experiment-first approach, which is moreconsistent with the philosophy of Bacon [45]; data mining studies tend to be Baconian [46]. In practice,when discovery proceeds iteratively, the distinction between the two is simply where one enters the cycle.Both are types of model-guided discovery, which is distinct from brute-force screening or approaches relyingsolely on serendipity that we discuss later. 8
Elements of autonomous discovery
It is impossible to imagine conducting research without some degree of machine assistance, defining “machine”broadly. We rely on computers to organize, analyze, and visualize data; analytical instruments to queuesamples, perform complex measurements, and convert them into structured data. However, it is important toconsider precisely what is facilitated by automation or computer-assistance in terms of the broader discoveryprocess. Many technologies (e.g., NMR sample changers) add a tremendous amount of convenience andreduce the manual burden of experimentation, but provide only a modest acceleration of discovery ratherthan a fundamental shift in the way we approach these problems. Considering the cognitive burden ofexperimental design and analysis connects to the distinction between autonomy and automation. A toy slotcar that sets its own speed as it proceeds through a fixed track is qualitatively different from a self-drivingcar in the city, yet each successfully operates within its defined environment. Though there is no precisethreshold between automation and autonomy, autonomy generally implies some degree of decision-makingand adaptability in response to unexpected outcomes.
Here, we propose a set of questions to ask when evaluating the extent to which a discovery process orworkflow is autonomous: (i) How broadly is the goal defined? (ii) How constrained is the search/designspace? (iii) How are experiments for validation/feedback selected? (iv) How superior to a brute force searchis navigation of the design space? (v) How are experiments for validation/feedback performed? (vi) Howare results organized and interpreted? (vii) Does the discovery outcome contribute to broader scientificknowledge? These questions are mapped onto the schematic for hypothesis-driven discovery in Figure 2.(i)
How broadly is the goal defined?
While algorithms can be made to exhibit creativity (e.g., comingup with a unique strategy in Go or Chess [47, 48]), at some level, they do so for the sake of maximizinga human-defined objective. Is the goal defined at the highest level possible (e.g., find an effectivetherapeutic)? Or is it narrow (e.g., find a molecule that maximizes this black-box property for whichwe have an assay and preliminary data)? The higher the level at which the mission can be defined, themore compelling the discovery becomes. That requires platforms to understand what experiments canbe performed and how they are useful for the task at hand.(ii)
How constrained is the search/design space?
An unconstrained search space is one that we oper-ate in as human researchers. There are many ways in which humans can artificially constrain the searchspace available to an autonomous platform. A maximally constrained search space in the discovery9f physical matter could be a (small) fixed list of candidates over which to screen. Limitations in theexperimental and computational capabilities of an autonomous platform have the effect of constrainingthe search space as well; the scientific process itself has been described by some as a dual search ina hypothesis space and experimental space [24, 49]. How these constraints are defined influences thedifficulty of the search process, the likelihood of success, and the significance of the discovery. The fewerthe constraints placed on a platform, the greater the degree to which it can be said to be operatingautonomously.(iii) How are experiments for validation/feedback selected?
Unconstrained experimental design isa complex process requiring evaluation of local decisions as well as a global strategy for the overalltimeframe, coherency, and scientific merit of a proposed experiment [50]. When operating within a re-stricted experimental space, design can be simplified to local decisions of specific implementation detailswithout these high-level decisions. Cummings and Bruni define a taxonomy for human-automation col-laboration in terms of the three primary roles played by a human or computer: moderator (of the overalldecision-making process), generator (of feasible solutions), and decision-maker (of which action to take)[51]. Their levels of automation include ones where humans must take all decisions/actions, where thecomputer narrows down the selection, where the computer executes one if the human approves, andwhere the computer executes automatically and informs the human if necessary. The second level istypical for the discovery of new physical matter, where computational design algorithms may proposecompounds that must be subjected to a manual assessment of synthesizability before being manuallysynthesized. The smaller the search space and the cheaper the experiments–including considerationsof time and risk of failure–the less human intervention is required in selecting experiments.(iv)
How superior to a brute force search is navigation of the design space?
This question seeksto identify the extent to which there is “intelligence” in the search strategy. Langley et al.’s notionof discovery as a heuristic search emphasizes this criterion [23]. Whether or not the strategy is moreeffective than a brute force search depends on the size of the space and how experiments are selected.For example, a high throughput screen of compounds from a fixed library is equivalent to a brute-forcesearch. An active learning strategy designed to promote exploration might require only 20% of theexperiments to find an optimal solution. When dealing with continuous (e.g., process variables) orvirtually infinite (e.g., molecular structure) design spaces, it is not possible to quantify meaningfullythe number of experiments in a brute-force search.(v)
How are experiments for validation/feedback performed?
Being able to automatically gathernew information to support/refute a hypothesis is an important aspect of an automated discovery10orkflow. At one extreme, experiments are performed entirely by humans (regardless of how they areproposed); in the middle, experiments might be performed semi-automatically but require significanthuman set-up between experiments; at the other extreme, experiments can be performed entirelywithout human intervention. This question is tightly coupled to that of who chooses the experimentsand the size of the search space. The narrower the experimental design space, the more likely it isthat validation/feedback can be automated. In computational studies, it is relatively straightforwardto automate simulations if we are willing to discard failures without manual inspection (e.g., DFTsimulations that fail to converge).(vi)
How are results organized and interpreted?
In an iterative workflow, the results of informationgathering (experiments, simulations) are organized as structured information and used to update ourprior knowledge and revise our beliefs before the next round of experimental design. Provided that theexperiments/simulations can be designed to produce information that is already in a compatible format(e.g., quantifying a reaction yield to build a model of yield as a function of process variables), this issimply a practical step toward closing the loop. In a few specialized workflows, experimental resultsnaturally drive the selection of subsequent experiments, as in directed evolution and phase-assistedcontinuous evolution [52].(vii) (optional) Does the outcome contribute to broader scientific knowledge?
Though not nec-essarily related to the concept of autonomy, this question speaks to impact and intelligibility. Doesit require extensive interpretation after the fact to evaluate how or what it has learned, or is it self-explanatory? Intelligibility is one of the criteria for discovery put forward by Valdés-Pérez [5], amongothers. Describing physical phenomena requires far less domain knowledge than does explaining thosephenomena [53]. Especially in empirical modeling, there is often a dichotomy between models built foraccurate predictions and models built for explanatory predictions [54, 55]. Turing made note of this atleast as early as 1950, saying that “an important feature of a learning machine is that its teacher willoften be very largely ignorant of quite what is going on inside, although he may still be able to someextent to predict his pupil’s behavior” [1]. The past few years have seen an interest in the transparency,interpretability, and explainability of machine learning models, not just the accuracy [56].Several of these questions probe the extent to which discovery is “closed loop”, which implicitly assumesan iterative process of multiple hypothesize-test-revise beliefs cycles. Iterative refinement is crucial whenoperating inside poorly-explored design spaces (e.g., using an uncommon scaffold) or with new objectivefunctions (e.g., maximizing binding to a new protein target in vitro ). Most of the case studies describedin the following sections are better described as “open loop” and involve only certain aspects of the work-11ow in Figure 2. For example, a common paradigm of computer-aided discovery is to define an objectivefunction, perform a large-scale data mining study, propose a solution of new molecule, material, and/ormodel, and manually validate a small number of those predictions. Waltz and Buchanan describe manyearly computational discovery programs as merely running calculations, rather than trying to close the loop[20]. A confluence of improved data availability, computing abilities, and experimental capabilities have brought ussubstantially closer to autonomous discovery (Figure 4). These improvements contribute to two categories ofmethodological progress: (1) techniques for navigating the search space more effectively, and (2) techniquesfor accelerating validation/feedback. Many machine learning techniques, for example, have been used tobuild empirical models within the search space to enable or accelerate the search; mapping the design spacefor a molecule, material, device, or process to relevant performance metrics is a prerequisite for any “rationaldesign”.Figure 4: The factors that have enabled autonomous discovery fall into one of three main categories.As Claus and Underwood point out, effective discovery requires assimilation of knowledge contained inlarge quantities of data of diverse types [57]. The quantity of chemical property and process data availablein journals, the patent literature, and online databases makes it challenging to analyze by hand. Digitizationof organic reaction information into computer-readable databases like Reaxys, SPRESI, CASREACT, andLowe’s USPTO dataset has not just facilitated searching that information, but has enabled new analysesthereof [58]. Millions of bioactivity measurements are found in databases like ChEMBL and PubChem, not tomention the dozens of genomic, metabolomic, and proteomic databases that have emerged in the last decade[59, 60]. There are also many repositories for experimental and computational properties of materials, which12ave facilitated the construction of empirical models to predict new material performance [61, 62]. Gil et al.[63] discusses the utility of AI techniques in searching and synthesizing large amounts of information as partof “discovery informatics” [57, 64, 65]. Even now, an enormous amount of untapped information remainshoused in laboratory notebooks and journal articles. For such information to be directly usable, someonemust undertake the challenge of compiling the data into an accessible, user-friendly format and overcomeany intellectual property restrictions. Image and natural language processing techniques can make this taskless burdensome; there is increasing interest in adapting such information extraction algorithms for use inchemistry [66–71].Autonomous discovery systems rely on a variety of computational tools to generate hypotheses fromdata without human intervention. This includes both the software that makes the recommendations (e.g.,proposes correlations, regresses models, selects experiments) as well as the underlying hardware that makesusing the software tractable. Our discussion of the advances in this area focuses on software developmentswith an emphasis on machine learning algorithms, which has elicited cross-disciplinary excitement [72–75].Typically, search domains that are of interest for discovery are characterized by high dimensionality(e.g., chemical space). In such domains, the patterns within the available data may be beyond the capacityof humans to infer a priori without years of intuition-building practice. Machine learning and patternrecognition algorithms can be used to discover these regularities automatically, e.g., by using the availabledata to parameterize a neural network model [76]. Varnek and Baskin and Mitchell provide overviews ofmachine learning techniques as applicable to common cheminformatics problems [77, 78] and brief tutorialscan be found in a number of reviews [79–82]. It is becoming increasingly common to use machine learning todevelop empirical quantitative structure-activity/property relationships (QSARs/QSPRs) to score moleculesand guide virtual screening as part of broader discovery frameworks [83]. These models can be used todistinguish promising compounds from unpromising ones and prioritize molecules for synthesis and testing(validation), thus facilitating the extrapolation of information about existing molecules to novel moleculesthat exist only in silico [31].Algorithms that enable efficient navigation of design spaces represent an important set of computingadvances. Even with a model representing our belief about a physical structure-property relationship, analgorithmic framework is needed to apply that belief to experimental design. These frameworks includeactive learning strategies [84] that aim to maximize the accuracy of predictive models while minimizing therequired training data, as well as goal-directed strategies such as Bayesian optimization [85] and geneticalgorithms [86]. These iterative techniques can reduce the experimental burden associated with discovery indomains or search spaces where exhaustive testing is not practical.Algorithms that are capable of directly proposing candidate molecules or materials (physical matter)13s a form of experiment selection are worth special emphasis. Recently, deep generative models [87] suchas generative adversarial networks (GANs) [88] and variational autoencoders (VAEs) [89] have attracted agreat deal of interest, as they facilitate the creation of diverse molecular libraries without the impossible taskof systematically enumerating all potential functional compounds [90–92]. Many case studies that leveragethese and related frameworks for the discovery of physical matter are described later.Experimental advances toward autonomous discovery include automation of well-established laboratoryworkflows (along with parallelization and miniaturization) as well as entirely novel synthetic and analyticalmethodologies. Aspects of experimental validation (Figure 5) have existed in an automated format fordecades (e.g., addition to and sampling from chemical reactors [93, 94]), and many of the requisite hardwareunits have been commercialized (e.g., liquid handling platforms and plate readers available through companiessuch as Beckman, Hamilton, BioTek, and Tecan). However, moving beyond piecemeal automation to theentire experimental burden of discovery workflows is challenging. Each process step, which may includesynthesis, purification, assay preparation, and analysis, must be seamlessly integrated for the platform tooperate without manual intervention; each interface presents new potential points of failure [95].Figure 5: Generic workflow for experimental validation.The complexity of software required for hardware automation ranges from sequencing commands from afixed schedule [15], to real-time control and optimization [10], to higher-level scheduling and orchestration[96]; user interface driven software such as LabVIEW [97] can aid the creation of fit-for-purpose control sys-tems with minimal programming experience. Although end-to-end automation of an experimental discoveryworkflow is uncommon, there are numerous benefits to be gained even from partial automation, chief amongthese being standardization and increased throughput [98].In addition to automation, novel experimental methodologies have been developed that lend themselvesparticularly well to autonomous discovery workflows by facilitating the exploration of broad design spaces, ahelpful feature that increases the likelihood of discovery [99]. These include synthesis-focused methodologies,such as DNA-encoded libraries [100] and diversity-oriented synthesis [35, 101], as well as analysis-focusedmethodologies, such as ambient mass spectrometry [102] and MISER for accelerating liquid chromatographicanalysis [103]. 14he three categories of enabling factors described herein facilitate discovery in different ways: data isleveraged to create models that inform and predict, computational tools are used to create models from dataand reason about which experiments to perform next, and physical (or computational) experiments validatehypotheses and facilitate refinement thereof. These factors can be strategically combined to give rise todifferent types of studies. For example, the experimental capabilities described here, in isolation, can be usedfor high-throughput, brute-force screening; computational tools can be used for data generation (through,e.g., DFT simulations); virtual screening is achieved through the combination of data and algorithms; andintegration of all three is needed for fully autonomous discovery.
In this section, we summarize a series of case studies that demonstrate how automation and machine au-tonomy influence discovery in various research domains. The extent to which techniques in automationand computation have enabled each case varies. Some only benefit from automated laboratory hardware,others learn underlying trends from large or complex data, and still others use computational techniques toefficiently explore high dimensional design spaces.Specifically, subsection 5.1 describes early computational reasoning frameworks; 5.2 describes the discov-ery of mechanistic models; 5.3 and 5.4 describe the noniterative and iterative discovery of chemical processes;5.5 describes the noniterative discovery of property models; 5.6 and 5.7 describe the noniterative and iterativediscovery of physical matter; finally, 5.8 provides a brief summary of a few tangentially-related domains.
There has been a long-standing fascination with the philosophical question of whether or not it is possible tocodify and automate the process of discovery [104]. In the 1980s and 1990s, several programs were developedto mimic a codifiable approach to discovery and to reproduce specific quintessential discoveries of models , ledby Langley and Zytkow [6]. These programs deal with questions of model induction and hypothesis generation(as a form of data analysis) rather than experimental selection and automated validation/feedback.BACON is a rule-based framework introduced in 1978 to formalize the Baconian method of inductivereasoning to discover empirical laws, supplemented with data-driven heuristics [105]. BACON.4, a later iter-ation specifically designed for chemical problems, searched for arithmetic combinations of input variables toidentify regularities in data (e.g., noting that pressure times volume is invariant for constant temperature ina closed gas system) [7]. This approach was able to recapitulate Ohm’s law, Archimedes’ law of displacement,Snell’s law, conservation of momentum, gravitation, and Black’s specific heat law [106]. The search for an15mpirical relationship was greatly simplified by excluding any irrelevant variables (i.e., all input variableswere known to be important) and eliminating all measurement noise. Extensions of this approach includeddescribing piecewise functions (FARENHEIT [107]) and coping with irrelevant observations and noise (ABA-CUS [108]). More recently, Schmidt and Lipson demonstrated that using a symbolic regression frameworksimilar to BACON, it is possible to rediscover Hamiltonians, Lagrangians, and geometric conservation lawsfrom empirical motion tracking data [37]. Much like its predecessors, their program uses a two-part processof generating and scoring hypothesized analytical laws.The STAHL program developed by Zytkow and Simon in the mid-1980s sought to automate the con-struction of compositional models to, e.g., rediscover Lavoisier’s theory of oxygen [109]. It operates on a listof chemical reactions to produce a list of proposed chemical elements and the compounds they make up bymaking inferences like “A + B + C −−→
B + D” = ⇒ “D is composed of A and C”. While the program wasarguably successful in formalizing a specific form of scientific reasoning, the lack of any consideration forstoichometry, phase changes, and ability to consider uncertainty, competing hypotheses, and request infor-mation makes such a logic framework highly limited in utility. The KEKADA program [110] was designedwith those abilities in order to replicate the discovery of the Krebs cycle. Using seven heuristic operators(hypothesis proposers, problem generators, problem choosers, expectation setters, hypothesis generators, hy-pothesis modifiers, and confidence modifiers) and simulated experiments of metabolic reactions, KEKADAwas able to rediscover the Krebs cycle from the same empirical data that would have been obtainable at thetime.The knowledge bases for these early programs were comprised of expert-defined relationships, rules, andheuristics designed to reflect prior knowledge and bring the programs up to the level of domain experts.Programs based entirely on user-defined axioms have proved successful in automatic theorem generation ingraph theory [111]. However, these rules bring at least two drawbacks in the context of inductive reason-ing. The first is that it is more difficult for experts to recapitulate their knowledge through rules than byproviding examples from which an algorithm can generalize [112]. The second is that too stringent priorsmay restrict the model from deviating far enough from existing theory to make a substantial discovery andmerely “fill in the gaps” of what is known. Kulkarni and Simon argue that a lack of prior knowledge aboutallowed/disallowed reactions actually served to benefit Krebs, as a formally trained chemist might not havepursued a hypothesis that was–at the time–believed to be highly unlikely [110].16 .2 Discovery of mechanistic models
Computer assistance has proved useful in the exploration and simulation of reaction pathways [113–116].The vast number of possible elementary reactions creates a combinatorial space of hypothesized pathwaysthat is difficult to explore manually in an unbiased manner, making it a prime candidate for algorithmicapproaches. One such approach, MECHEM, enumerates elementary reactions in catalytic reaction systemsto identify series of mechanistic steps able to rationalize an observed global reaction [117–119]. Ismail et al.have demonstrated a similar approach to identifying multi-step reaction mechanisms for catalytic reactionsusing a ReaxFF potential energy surface [121] to guide the search toward kinetically-likely pathways [120].In the absence of heuristics or calculations to drive the search, millions of possible elementary reactions canbe generated even with species of just a few atoms [122].The Reaction Mechanism Generator (RMG) fills a similar role in developing detailed kinetic mechanismsfor combustion and pyrolysis processes [123]. Expert-defined reaction templates enumerate potential ele-mentary reactions between a set of user-defined input molecules; rate constants for the forward and reversereactions are estimated from a combination of first principles calculations (e.g., DFT) and group additivityrules regressed to experimental data. The ability to estimate kinetic and thermodynamic parameters enablesthe identification of new elementary reactions and pathways and, e.g., exploration of untested fuel additives’effects on ignition delay [124]. An earlier study by Broadbelt et al. used a similar approach to developdetailed kinetic models for pyrolysis reactions [125].Mechanistic enumerations/searches have been applied extensively to the discovery of transition statesand reaction channels [126–128]. These methods represent a search in the (3 N − -dimensional potentialenergy surface landscape implicitly defined by an N -atom pool of reacting species. Approaches like Bernyoptimization [129] are used to identify transition state (TS) geometries for the purposes of estimating ener-getic barrier heights. Double-ended search methods like the freezing string method (FSM [130]) or growingstring method (GSM [131]) require knowledge of the product structure and run iterative electronic structurecalculations to identify a plausible reaction pathway; these can be applied to the discovery of new elementaryreactions by systematically enumerating potential product species [132–134] (Figure 7). Single-ended searchmethods operate on reactant species only and perturb the geometry along reactive coordinates, including,e.g., the artificial force induced reaction method (AFIR [135]). An alternate approach to reaction discoveryis by direct simulation of reactive mixtures using molecular dynamics (MD) [136–138]. Wang et al. describethe use of an “ ab initio nanoreactor” to find unexpected products from similar starting materials to theUrey-Miller experiment on the origin of life [136]. Importantly, their approach does not require the use of17igure 6: Discovery of detailed kinetic models through iterative selection of important elementary reactionsteps. Figure reproduced from Broadbelt et al. [125].heuristics to define reaction coordinates or enumeration rules to define possible products. Instead, moleculesin an MD simulation are periodically pushed toward the center to impart kinetic energy and encourage colli-sions at a rate that enables the observation of rare events over tractable simulation timescales. In principle,these can be to applied to the prospective prediction of novel reaction types and, ultimately, the developmentof new synthetic methodologies.Figure 7: Workflow for identification of reaction networks between known reactants R and known productsP through combinatorial enumeration of possible mechanistic steps pruned by calculated transition stateenergies. Figure reproduced from Kim et al. [134]. 18 .3 Noniterative discovery of chemical processes
Synthetic pathways are a prerequisite for physically producing a molecule of interest, whether for experimen-tal validation of a predicted property or for production at scale. Retrospective analyses of known single-stepchemical reactions can yield hypothesized synthetic pathways as combinations thereof. Gothard et al. de-scribe an analysis of a “Network of Organic Chemistry”–a copy of the Beilstein database with seven millionreactions–for the discovery of one-pot reactions; their search space comprised any consecutive sequence ofknown reactions where the product of one is a reactant of another [139]. Candidate sequences were evaluatedusing eight filters, including a × table of functional groups and their cross-reactivity and a × table of their compatibility under 97 categories of reaction conditions. Through application of these expertheuristics to millions of candidate sequences, the authors identified multi-step chemistries that could poten-tially be run without an intermediate purification, choosing a handful of such pathways for experimentalvalidation. While their filters were all hand-encoded, data mining techniques can also be used to estimatefunctional group reactivity [140, 141]. Selecting pathways within a search space defined by combinations ofknown single-step reactions has taken on other forms as well, including the identification of cyclic pathways[142], the optimization of process cost [143], and the optimization of estimated process mass intensity [144].Generating yet-unseen chemical reactions for a synthesis plan–a necessity for the synthesis of novelmolecules–is a harder search problem than when searching within a fixed reaction network [145]. Becausethe number of states in a naive retrosynthetic expansion will scale as b d for branching factor b and depth d , guiding the search is an essential aspect of computer-aided synthesis planning (CASP) programs. Thebreadth of the search depends on the coverage of the rule sets: abstracted enzymatic reactions tend to numberin the hundreds [146], expert transformation rules often number in dozens or hundreds [147, 148] but canextend into the tens of thousands in contemporary programs [149], and algorithmically-extracted templatesgenerally number in the thousands to hundreds of thousands [150–153]. To the extent that reaction rulesand synthetic strategies can be codified, synthesis planning is highly conducive to computational assistance[58, 154–158] (Figure 8). CASP approaches that generate retrosynthetic suggestions without the use ofpre-extracted template libraries [159, 160] still result in a large search space of possible disconnections.Even the earliest CASP programs emphasized the importance of navigating the search space of possibledisconnections [154, 161]. The search in OCSS was guided by five subgoals for structural simplification:reduce internal connectivity, reduce molecular size, minimize functional groups, remove reactive or unstablefunctional groups, and simplify stereochemistry [161]. Starting material oriented retrosynthesis introducesadditional constraints in the search, as the goal state is a specific starting material, rather than one of many19rom a database of available compounds [162]. It is only fairly recently that CASP tools have started to beused more widely for discovery of synthetic routes. Development is stymied by the complexities of validationand feedback, which can only occur by experimental implementation [163] or review by expert chemists [164].Figure 8: Workflow used by the WODCA program for computer-aided synthesis planning. Figure reproducedfrom Ihlenfeldt and Gasteiger [154].There are two main approaches to navigating the search space during retrosynthetic expansion to deter-mine which disconnections are most promising: value functions and action policies. Value functions estimatethe synthetic complexity of reactant molecules as a proxy for how close they are to being purchasable [165–169]. Despite their limitations, these are widely used in virtual screening libraries as a rapid means of prior-itizing compounds that appear more synthetically tractable. While even simple user-defined heuristics thatattempt to break a molecule into the smallest possible fragments can be successful in planning full syntheticroutes, learned value functions can offer some advantages in finding shorter pathways or being tailored to auser-defined cost function [170]. Action policies directly predict which transformation rule to apply basedon literature precedents in a knowledge base; this can be accomplished through a simple nearest-neighborstrategy [171] or through a trained neural network model for classification [153]. The latter approach hasbeen integrated into a Monte Carlo tree search framework to rapidly generate and explore the space ofcandidate pathways, resulting in recommendations that chemists considered equally plausible to literaturepathways in a double-blind study [164]. Less common approaches to navigating the search space includeproof-number search [172].Reaction pathway discovery is relevant in synthetic biology and metabolic engineering contexts as well.20or example, one study by Rangarajan et al. describes the application of Rule Input Network Generator(RING, [174]) to identify plausible production biosynthetic pathways through a heuristic-driven networkgeneration and analysis [173]. Kim et al. review algorithms and heuristics used to explore metabolic networksand find optimal pathways [146]. A broader review of machine learning for biological networks can be foundin ref. 175. Identifying synthetic pathways is but one step toward fully automated synthesis. For any theoretical robo-chemist capable of synthesizing any molecule on demand [8, 14], these ideas must be able to be acted uponand executed in the laboratory. Even without automated synthesis, hypothesized synthetic pathways areof little use without experimental validation. This requires additional models of chemical reactivity thatcan, among other things, propose suitable reaction conditions, estimate the confidence in the reactions itproposes, and have some notion of why one set of substrates might achieve a higher yield than others. Modelsfor these tasks can be trained directly on experimental data using a variety of statistical techniques.Given a set of combination of successful and unsuccessful reaction examples (i.e., high and low yielding),one can train a binary classifier model to predict whether a proposed set of reaction conditions will besuccessful [176]. The same task can also be treated as a regression of reaction yields, rather than as a classi-fication, as a function of substrate descriptors; a virtual screen of known conditions as a fixed search spacecan then propose substrate-dependent optimal conditions [177]. When only successful reaction examplesare present, one can treat the selection of reaction conditions as a recommendation problem comprising aclassification subproblem (for reagent, catalyst, solvent identity) and a regression subproblem (temperature)under the assumption that the “true” published conditions are adequate. This was Gao et al.’s approachusing the Reaxys database to produce a model that is able to propose conditions at the level of speciesidentity and temperature based on reactant and product structures [178]. In the process of learning therelationship between reactants/products and suitable reaction conditions, the model learns a continuousembedding for chemicals that reflects their function in organic synthesis, similar to how semantic meaning iscaptured by word2vec models [179]. Formulating condition selection as a data-driven classification problemhas also been used in a more focused manner as an alternative to expert recommender systems [180], e.g.,to choose phosphine ligands for Buchwald-Hartwig aminations [144] or catalysts for deprotections [141].In some cases, computational prediction of solvation free energies can meaningfully assist in the selectionof reaction solvents [181]. To a first approximation, solvation energy can be estimated by a linear modeldescribing potential solute-solvent interactions [182, 183]. When those interaction parameters can be pre-dicted via DFT, one can estimate the performance of a large virtual set of solvents, e.g., to optimize the rate21onstant for a particular reaction of interest [184].Similar models for a priori evaluation of reaction conditions can be found in materials applications. Inone instance, Raccuglia et al. used a combination of 3955 reaction successes and failures from laboratorynotebooks to train an SVM model to predict outcomes for the crystallization of vanadium selenites [185](Figure 9). Recasting the model as a decision tree led to correlations that reflected expert intuition, whicharguably contributed to the synthesis of five previously-unseen compounds [186]. A similar study applied amuch smaller dataset of 54 conditions to predict whether a process would produce atomically precise goldnanocrystals, using a siamese neural network architecture to relate proposed conditions to precedents [187].For larger scale analyses, the literature serves as an unstructured data source of inorganic reactions andhas been used to populate a structured database of synthesis conditions and outcomes via natural languageprocessing of over 640,000 manuscripts [188]; virtual screening and synthesis planning pipelines have beenbuilt on top of such data to help guide the experimental realization of computationally-proposed materials[69, 189–191].Figure 9: Workflow used by Raccuglia et al. for training an interpretable predictive model of the suc-cess/failure of vanadium selenite crystallization. Figure reproduced from Raccuglia et al. [185].Anticipating the outcomes of organic reactions is a very different modelling task. The space of possibleresults is high dimensional (chemical space) rather than low dimensional (e.g., the phase of the resulting ma-terial or a boolean measure of success/failure). The ability to accurately prediction reaction products wouldbe powerful in combination with CASP to improve the likelihood that proposed reactions are experimentallyrealizable. The task of predicting reaction outcomes in silico has been approached through several heuristicand computational techniques over the years [192–197] but has seen renewed interest as a supervised learningproblem as a result of increased data availability [158].Segler and Waller treat reaction discovery as an edge prediction problem in a knowledge graph of known22hemistry [198]. Specifically, they predict the products of bimolecular reactions through the applicationof algorithmically-extracted half reactions that similar substrates underwent. Novel combinations of halfreactions that had not been observed previous could be accurately predicted, albeit with a modest rate ofsuccess. With a similar goal, Jacob and Lapkin build a stochastic block model (SBM) for the classification ofreactions into true or false using reactions in Reaxys (true) and ones randomly generated from known chem-icals (false) [199]. Other machine learning-based methods include ones that rank enumerated mechanistic[200–202] or pseudo-mechanistic [203] steps, score/rank reaction templates [153, 204], score/rank candidateproducts generated from reaction templates [205], propose reaction products as resulting from sets of graphedits [206, 207], and translate reactant SMILES strings to product SMILES strings using models built fornatural language processing tasks [208–210]. These all formulate reaction prediction differently; for example,the model in ref. 207 learns to enumerate likely changes in bond order and learns to rank candidate productsgenerated through combinatorial enumeration of those sub-reactions (Figure 10).Figure 10: Workflow used by Coley et al. for predicting the products of organic reactions. Figure reproducedfrom Coley et al. [207].The quantitative prediction of reaction outcomes is closer to a standard regression task; when only onechemical species is varied–a single substrate or a single catalyst–the problem is exactly that of developinga QSAR/QSPR model. The historical approach in physical organic chemistry is again the development oflinear free energy relationships [182], for which group contribution approaches are particularly attractive dueto their interpretability [211]. Hammett parameters are a classic example of correlating molecular structureswith reactivity [212]. Computational prediction of organic reaction rates has been demonstrated using simple23egressions on expert descriptors [213] and using structure-derived descriptors [214–218] with much of thelatter work coming from Varnek and coworkers.Even with increasingly powerful machine learning techniques to describe patterns in experimental data,computational chemistry has a significant role in developing predictive models of chemical reactivity [219].Using informative electronic (e.g., Fukui functions [220, 221]) and steric (e.g., Sterimol [222, 223]) descrip-tors can help model generalization and performance, especially in low data environments. Given suitabledescriptors and holding other process parameters constant, complex properties have been described withlinear or nearly-linear models, e.g., catalyst performance and enantioselectivity [224–228] (Figure 11). De-scriptors tailored to a specific reaction class can be effective representations for predicting regioselectivity[229] and yield [230] among other performance metrics, although they may not be broadly applicable acrossreaction and substrate classes. In principle, these descriptors could be calculated with greater universalitythan expert-selected ones already known to be relevant [231]. Similarly, selectivity in complex syntheticsteps can be explained by expert-defined DFT calculations [232] that could, in principle, be automated.Figure 11: Discovery of catalysts for enantioselective catalysis using a surrogate model trained on experi-mental ∆∆ G values to screen unseen reaction conditions. Figure reproduced from Reid and Sigman [228].If these models are truly describing the underlying patterns of chemical reactivity, they could be ap-plied prospectively to the discovery of new synthetic methods. This is yet to be demonstrated. Time-splitvalidations arguably demonstrate this generalization ability, however, a separate algorithm (a hypothesisgenerator) would be required to “steer” these models toward the combination of reactants mostly likely toresult in new chemistry. 24 .3.3 Discovery of new chemical reactions from experimental screening The discovery of new chemical reactions can widen synthetically-accessible chemical space and allow us torealize molecules that were previously difficult to access [233, 234]. The rise of combinatorial chemistryin the 1990s opened up new means of discovering new chemical reactions and functional physical matterthrough experimental screening [100, 235]. Low-volume liquid handling and rapid analysis by HPLC orESI-MS or even fluorescent readouts [236] have enabled material-efficient reaction screening toward this end[237]. Microplate reaction screening has advanced to the point where it requires only nanomole quantitiesof material and achieves throughputs of thousands of reactions per hour [238, 239]; related technologiesusing continuous flow [240] and electrospray ionization [241] can achieve similar throughputs and materialconsumption.These technologies have accelerated the rate at which candidate reactions (different substrates, condi-tions) can be tested, but still navigate a search space in a brute force manner. High throughput experimen-tation can be hypothesis-driven and used to investigate a narrower search space [242, 243] or be informedby mechanistic knowledge [244] and functional diversity [245, 246] (Figure 12), though this is less commonin practice. Beyond improving the speed of experimentation and sensitivity of analysis, progress toward au-tomated discovery of new chemical reactions has included developing new techniques for exploring the vastspace of possible chemical reactions with fewer individual experiments: either by an active search (describedlater) or by pooling. Clever pooling strategies can allow for the simultaneous evaluation of multiple hy-potheses through techniques like mass-encoded libraries [247], DNA-templated synthesis [248], and substratecombinations designed to enable straightforward deconvolution [249, 250] for “accelerated serendipity” [251].Multicomponent reactions represent a particularly large space and have historically been discovered eitherthrough serendipity or pooled/combinatorial screening [252–254]. Collins et al. review screening approachesto reaction discovery and development [255].
Iterative discovery of chemical processes
Automatic discovery of optimal synthesis conditions is a task where closed-loop experimentation is frequentlyapplied. With a platform able to perform reactions under a wide range of operating conditions and automat-ically analyze and interpret the outcomes, one can use an optimization algorithm to guide a search withina pre-defined process parameter space. As with many other examples of automated discoveries, the searchspace is highly constrained by expert human operators. It is the case that standard numerical optimizationroutines are sufficient to explore the narrow search space of interest when an expert is able to define a narrow25igure 12: Chemistry informer approach to reaction screening to understand substrate compatibility, empha-sizing the use of complex substrates to understand more complex chemical phenomena. Figure reproducedfrom Kutchukian et al. [245].range of conditions that will likely lead to promising results.The earliest automated platforms for organic reactions used batch reactors and computer-controlled valvesor pumps to automatically add reagents according to computer-selected experiments [93, 94]. Automatedcontrol of continuous process variables (e.g., residence time, temperature, reactant ratios) is simplified whenusing flow platforms that eliminate the need to physically replace or clean batch vessels. Due to the easeof sampling a crude product stream with an inline valve, they are frequently used to screen arrays ofdifferent process conditions to map out an experimental space and the corresponding parameter-performancerelationship [256–259]. Automated optimization of organic reactions in flow (Figure 13) has been extensivelyreviewed and is an excellent entry point for those interested in automated chemistry [9, 11, 260–266].Optimization routines that have been widely employed include conjugate gradient [267], simplex [268],genetic algorithms (GAs) [269], Stable Noisy Optimization by Branch and Fit (SNOBFIT) [270], adaptiveresponse surface methods [271], Bayesian optimization approaches [272], and reinforcement learning [273].Optimizations over continuous variables generally use black box methods like genetic algorithms, Simplex,and SNOBFIT [32, 274–280], gradient-based methods like steepest descent and conjugate gradient [281], orexplicit model-based methods like an adaptive response surface [33, 282, 283]. In a recent study, Bédardet al. describe a reconfigurable flow platform that uses the SNOBFIT algorithm to optimize several commonorganic transformations [284]. Discrete process variables can be varied through the use of selector valvesor liquid handlers [240, 285] to also optimize, e.g., catalyst/ligand identities [33, 283, 286] and reactionsolvent [287]. Despite their hype, “modern” machine learning approaches to reaction optimization [288, 289]have not demonstrated any clear advantages over previously-used statistical methods. One underexploredopportunity is to embed prior chemical knowledge into the model through pretraining; Zhou et al. do this26igure 13: General platform schematic for the iterative optimization of synthetic processes in flow withrespect to continuous variables (flowrates, temperature, pressure). Figure reproduced from Mateos et al.[265].not with chemical knowledge, but with knowledge about the geometry/roughness of the expected regressionsurface to improve the hill-climbing efficiency of a reinforcement learning optimization routine [288].When the performance of a chemical process is measured by multiple objectives, it is important tounderstand their associated tradeoffs [290]. Rather than combining them into a single scalar metric to op-timize over [10, 291, 292], one can optimize for knowledge of the Pareto front–settings of process variableswhere one performance metric cannot be increased without decreasing another [293, 294]. Multi-step reac-tions are particularly challenging to optimize, because the effects of changing one parameter can propagatethrough downstream process steps. They are typically broken up into individual synthetic steps to improvethe tractability of the problem [295, 296] or optimized approximately through screening, rather than trueclosed-loop feedback [297].Similar closed-loop optimizations have been demonstrated for materials-focused applications. Differentproperties of interest necessitate different analytical endpoints, but the overall workflow is the same. Opti-mization goals have included the emission intensity of quantum dots [298], the conversion and particle sizeresulting from a copolymerization [291], the identification of crystallization conditions for polyoxometalates[299, 300], the production of Bose-Einstein condensates [301], and the realization of a metal-organic frame-work (MOF) with high surface area [302]. The MOF synthesis optimization by Moosavi et al. is particularlynoteworthy in that prior data on syntheses of other MOFs were used to estimate the relative importance ofsynthetic parameters to enable a maximally diverse initial design of experiments, jump-starting the phase of27terative empirical optimization [302].The challenge for these discoveries is often practical, not methodological. Experimental platforms mustbe able to analyze the relevant performance metrics and to control process variables across a search space thatis broad enough to make computational assistance worthwhile. The ARES (autonomous research system)is an example of how complex instrumentation can enable optimizations of processes that are traditionallydifficult to automate [303, 304]. ARES can perform up to 100 carbon nanotube growth experiments viachemical vapor deposition (CVD) per day under different temperatures, pressures, and gas compositionswith real-time monitoring of growth rates using Raman spectroscopy. After fitting a random forest modelwith 84 expert-defined experiments as prior knowledge, a genetic algorithm was successfully applied achievea user-defined target growth rate through automated control of process conditions.Iterative discovery of quantitative models of process performance (e.g., experimentation to estimate ki-netic parameters) differs from optimization only in how experiments are selected. Instead of selecting exper-iments with the ultimate goal of maximizing yield or achieving optimal product properties, experiments canbe selected to minimize uncertainty in regressed parameters or discriminate between multiple hypothesizedmodels [305]. The acquisition functions needed for these goals–to quantify how useful a proposed experimentwould be–can be directly imported from work in statistics on parameter estimation and model discrimination[306]. There are still challenges for multi-step reactions, as deconvoluting the effects of kinetic parametersfrom individual steps may not be straightforward even when the rate laws are known [307].
There are far fewer examples of trying to discover of new chemical reactions through active searches thanthrough noniterative screening strategies. Amara et al. describe one example of discovering new reactionpathways in a catalytic reactor system by reformulating the problem as a reaction optimization [308]. Usinga modified Simplex algorithm, they were able to optimize the yield of then-uncharacterized side products;mechanistic pathways were proposed by experts based on evaluation of the conditions leading to differentproduct distributions.Granda et al. instead treat reaction discovery as a natural consequence of building a quantitative modelof chemical reactivity [309] (Figure 14). Specifically, they describe a platform for evaluating the reactivity oftwo- and three-component reactions among a set of 18 hand-picked building block molecules (969 possibleexperiments) using two empirical models: one makes a boolean prediction of whether a reaction has takenplace based on NMR, MS, and ATIR data before/after mixing; the second makes a boolean prediction ofwhether a given combination of substrates is likely to be reactive, using a one-hot representation for chem-ical species. The physical apparatus is reminiscent of MEDLEY–an automated reaction system employing28omputer-controlled pumps connected to a round-bottom flask [310]. However, the goal of reaction discoveryand training this binary classifier are misaligned: the algorithmic exploration of the search space of possiblereactions does not direct experiments toward those likely to lead to novel reactions; the reactions claimed tohave been discovered were identified only through (error-prone [311]) manual analysis of product mixtures.Moreover, all 969 reaction combinations could have been performed in a miniaturized well-plate format whileusing less time and materials (cf. item iv of section 4.1) and the one-hot encoding of substrates precludesprediction of reactivity for unseen substrates. An earlier version of this platform was used by the same groupto explore the 64 possible pathways defined by a three-step synthesis with one of four reagents added ateach step [312]. The “most reactive” pathway was found through a step-wise greedy search by identifyingthe reagent whose addition led to the largest change in the ATIR spectrum of the product mixture. Whilethis too did not explicitly bias experiments towards new reactions, the concept of selecting experiments in anon-brute force manner for reaction discovery is worth further investment.Figure 14: Workflow for iteratively training a binary classifier of whether a reaction mixture is reactive,using experimental validation and feedback. Figure reproduced from Granda et al. [309].
Noniterative discovery of structure-property models
Models capable of relating the structural and compositional features of a molecule or material to its proper-ties are of substantial utility in discovery. These relationships are often learned directly from data, whethervia standard multivariate regression or machine learning algorithms. To the extent that they are inter-pretable, they can yield insight into how the fundamental features of a chemical entity or system influenceits properties or performance, thus informing design. Quantitative structure-activity/property relationships(QSARs/QSPRs) can act as our belief about a performance landscape (cf. “belief” in Figure 2) for the sakeof a specific discovery task like the discovery of new physical matter. While there is only a weak distinction29etween developing a QSAR/QSPR model for its own sake or for the purposes of exploring a design space,this section will focus on studies where the primary discovery is of the model itself. General considerationsand trends in QSAR/QSPR are discussed in refs. 313, 78, and 314.
Given a QSAR/QSPR model, one can investigate how the model perceives different structural attributesto reveal which are most informative of the prediction task. Substructure filters are commonly employedto process screening hits [315–317] and flag reactive or toxic functional groups [318–323]. This is a stan-dard problem of interpretability that has received significant attention in the machine learning community[324]. When the form of the desired interpretation is restricted to molecular substructures, then standardapproaches for variable and feature selection can be applied when using a representation based on the pres-ence/absence of certain substructures [325, 326]. Polishchuk provides a recent review of interpretability forQSAR/QSPR models, including this category [327].An early attempt to correlate predicted function directly with structural attributes was PROGOL [328],an inductive logic programming algorithm. In its original demonstration, PROGOL identified a set of fivecriteria for determining whether a compound is likely to be mutagenic based on the presence of hypoth-esized toxicophores defined by connectivity and partial charge values; subsequent studies pursued similarexplanations for carcinogenicity [329] and ACE inhibition activity [330], among others (Figure 15).Figure 15: Workflow for the application of PROGOL [328] to the induction of structural alerts for carcino-genicity and the resulting rules. [ x - y ] indicates that the rule correctly applied x times and incorrectly applied y times in the dataset of 330 organic compounds. Figure reproduced from King and Srinivasan [329].One approach to interpretability is to rely on few-parameter regressions with interpretable descriptors30331, 332], which can provide explanations as meaningful as the descriptors themselves. Decision trees providea natural mode of assessing descriptor importance, though ensembling methods (e.g., Random Forest models)can obfuscate analysis [333, 334]. More general techniques exist to extract symbolic rules from trainedmachine learning models that are relatively agnostic to the type of model used [335–337]; ref. 185 providesan example of a decision tree extracted from an SVM model trained to predict the success of an inorganicsynthesis procedure. There are numerous other examples of QSAR/QSPR studies that estimate descriptorimportance in an attempt to rationalize predictions [144, 327, 338–341]. Other approaches instead aim toidentify the training examples most relevant to a given test example [342, 343].Visualizing explanations can be more intuitive than looking at quantitative feature importance metrics.One popular approach is to approximate a model by a fragment-contribution approach by looking at how thepredicted property changes when part of the input molecule is masked [344–346]. If the value decreases whenmasking a certain substructure, that substructure is assumed to positively contribute to the property. Thisper-atom or per-substructure importance metric is usually an oversimplification of what is being learned,though sometimes it is exactly what is being learned [347]. The accuracy of machine learning models is usuallyat least partially attributable to the nonlinearities between the input featurization and output property. A natural application of data science techniques is to the analysis of spectral data for computer-aidedstructural elucidation (CASE). The underlying function that maps a molecule or material to the resultsof an assay is no different than a standard structure-property relationship, except the property might behigh dimensional. CASE will become increasingly important for structure confirmation and quantitation asautonomous systems start to explore new areas of chemical and reactivity space.The DENDRAL program is an early example of a program designed for structural elucidation of organicstructures from mass spectrometry (MS) data [348, 349]. It crossreferences the mass loss between peaks witha list of known fragments to identify the likely substituents of the original molecule, enumerates possiblemolecular structures, predicts the MS spectra of those candidate structures, and makes its final proposalbased on consistency with the observed spectrum. DENDRAL proved useful in its ability to perform manyrapid calculations (spectral simulations and matching), but still required expert heuristics to help explore thevast space of possible molecular structures, including a “badlist” to prune unrealistic structures. Hufsky andBöcker provide a recent review of computational analysis of MS fragmentation patterns [350]. ComputationalMS analysis continues to be the application of supervised learning approaches [351] and has seen renewedinterest in the context of metabolomics [352–355].Unsurprisingly, other types of analytical data are also commonly evaluated using computational or ma-31hine learning models, including UV circular dichroism to elucidate protein secondary structure [356]. Thereverse problem of spectral prediction is also popular and includes techniques to predict NMR shifts [357,358], IR spectra [359], and protein fluoresence [360]. Materials-focused studies have looked at predicting theoptical properties of metal oxides [361] and analyzing microstructure from SEM data [362], among others.Tables 1 and 2 of ref. 363 summarize many early examples of applying neural networks to MS, NMR,IR, NIR, UV, and fluoresence spectra leading up to the mid-1990s. A more recent overview of learningstructure-spectrum relationships and CASE can be found in ref. 364.
There is tremendous interest in using machine learning techniques to build surrogate models for computationally-expensive ab initio calculations. Models can replace either the entire energy calculation [365–368] or specificparameterized components (e.g., functionals or correlation energies) [369–372]. A prominent example fromRoitberg, Isayev, and coworkers is ANAKIN-ME or ANI (accurate neural network engine for molecular en-ergies); ANI is a neural network surrogate model of an energy potential trained on roughly 60,000 DFTcalculations [373] and its second generation, ANI-1ccx, is further refined on a smaller set of CCSD(T)/CBScalculations [374] (Figure 16). Active learning strategies can be used to strategically acquire costly trainingdata when training such models [374–376]. The accurate prediction of electronic properties is directly usefulfor the discovery of organic electronic materials and is a central focus of the Harvard Clean Energy Project[377].Figure 16: Workflow for refining a surrogate model of electronic structure calculations, originally trained on ω B97x/6-31G* data, on higher-quality CCSD(T)*/CBS data. Figure reproduced from S Smith et al. [374].The desire to create computationally-inexpensive surrogate models has underpinned the developmentof classical force field models for molecular dynamics (MD) simulations [378–380]. Perhaps unsurprisingly,machine learning models can serve as drop-in replacements for heuristic force fields if so trained [381–383]32nd can assist in coarse-graining for larger scale simulations [384, 385] or as a post-processing step toanalyze simulation results [386, 387]. Structure-aided drug design relies on similar parametric functionsfor predicting protein-ligand binding. There are many molecular docking programs that propose and scoredifferent poses describing ligand interactions with protein targets [388–390]. Scoring functions–meant toprovide quantitative measures that correlate with binding affinity–are ideal applications of machine learningtechniques. Nonlinear statistical models can help bridge the divide between our pseudo-first principles modelsof the underlying chemical interactions and the actual behavior we observe experimentally [390–397].
QSAR/QSPR models that describe a molecule or material’s phase behavior can aid computational designby predicting whether a proposed compound is physically realizable in its desired form. For hypothesizedmetallic alloys, for example, one can predict crystal structures [398–401] and phase behavior [402–407]. Fororganic molecules, one can similarly predict whether compounds are likely to crystallize easily [408] and theirpreferred processing-dependent polymorph [409]. Machine learning models can also reduce the number ofevaluations required, e.g., for finding minimum energy configurations [400].
Noniterative discovery of new physical matter
The noniterative discovery of new physical matter is a common application of computational learning tech-niques or automated experimental platforms. This category encompasses experimentation strategies in whichsearch spaces are predefined and explored exhaustively and virtual screening with or without the use of asurrogate model to approximate a structure-function landscape (right half of Figure 17).A quintessential paradigm in this category is the use of a large dataset from experiments or simulationsto train a QSAR/QSPR model, often using some form of machine learning for nonlinear regression, whichis then used to screen a large number of candidate compounds or materials. A handful of candidates maybe selected for synthesis and validation of the prediction, but the results of that validation are not used torevise the model. This approach essentially constructs a fixed “map” with which to explore the search spaceand identify promising candidates. It leaves little room for serendipity, as compounds that are not predictedto be useful–even if accounting for uncertainty–are generally not tested, unless the algorithm is explicitlybiased toward random exploration. 33igure 17: Taxonomy of strategies for the discovery of new physical matter.
In several studies, the part of the discovery process that is automated is not the hypothesis (model buildingor selection of compounds to test) but the experiment (initial data generation or validation). Experimentalautomation addresses the practical challenge of validation but not the methodological challenge of how toguide the scientific process or constrain the search space. In general, brute-force experimentation is a pro-ductive discovery strategy only when the experimentation is high-throughput in nature. High-throughputexperimentation platforms are capable of searching broad design spaces, which makes serendipitous discov-eries more likely and places less emphasis on the experiment-selection faculties of the researchers. Note,however, that manual constraint of the design space remains a critical aspect of the process. An idealhigh-throughput platform is high-throughput from end to end (synthesis to analysis), but many of the casestudies described herein emphasize the development of tools that accelerate just one of these stages.Among the more interesting developments in HTE for drug discovery are entirely novel methodologiesthat are uniquely suited to rapid data generation. Despite the advances in achieving greater throughput withtraditional HTE efforts [410, 411], the space that can be feasibly screened using single-compound-per-wellsynthesis approaches is often too small to provide many promising bioactive leads. DNA-encoded libraries(DELs), a concept introduced by Brenner and Lerner [100], enable synthesis of compounds for screening at34ates of hundreds of compounds per well [412, 413]. An adaptation of the split-and-pool synthesis strategy[414], many modern DEL case studies report theoretical library complexities of hundreds of millions [415–417] or even billions [418–420] of compounds, exceeding the size of the search space by several orders ofmagnitude over traditional HTE approaches [421]. In light of these impressive synthesis rates, it should benoted that analysis and (if necessary) purification can be rate-limiting. Recent successes of this strategyinclude the identification of a series of receptor interacting protein 1 (RIP1) kinase inhibitors [418], whichare implicated in multiple inflammatory diseases, and an inhibitor of soluble epoxide hydrolase (sEH) withrelevance to several disease areas [422]. DNA-encoded chemistry is reviewed in ref. 421.Another strategy that has proved useful in this area is diversity-oriented synthesis (DOS) which aims togenerate structurally (and thereby functionally) diverse collections of small molecules [101]. These strategiesmay involve reacting a starting material with diverse arrays of reagents in series, or coupling an array ofstarting materials to one another across strategic functional groups [423]; multicomponent reactions areparticularly useful for this application [424]. Tan [101], Spandl et al. [425], Galloway et al. [423], and Garcia-Castro et al. [426] review DOS strategies in detail and describe successful applications for the discovery oflead drug compounds and biological probes.Other novel approaches to making HTE for drug discovery more efficient focus not on synthesizinglarger and/or more diverse libraries, but on screening them more efficiently [427–431]. Development ofinformation-rich, efficient assays is a complex challenge. If the disease target has already been identifiedand it is possible to prepare the target such that it can be sufficiently isolated, stabilized, and accuratelydispensed, then in vitro biochemical assays, which may involve the assessment of target-ligand binding affinityor augmentation of enzymatic activity, are useful [432–434]. These assays can be easily miniaturized andserve as the workhorse of many screening campaigns. These target-based assays tend to be efficient, butthey assess activity against a single target in isolation and ignore the complexities of human physiology andpolypharmacology. Cell-based assays can do a better job capturing activity (because, for example, relevantcofactors are present) while also providing a measure of toxicity and other off-target effects. Despite theadded complexity of automatically maintaining and dispensing cell populations, cell-based assays have beenadequately automated and miniaturized for compatibility with the highest-density plates [435]. A varietyof easily-automated well measurement tools have been developed for compatibility with cell-based assays,including fluorescent detection; the automation and miniaturization of this type of assay for compatibilitywith HTS has been well-reviewed by An and Tolliday [436]. Recently, there has been an increased interestin phenotypic screening [437] along with an increased reliance on computational tools for high-throughputanalysis [438, 439].Many high-throughput synthesis and analysis tools have been developed to facilitate experimentation35nd discovery in materials science. Combinatorial synthesis methods that yield a single sample containingcontinuous composition gradients [440] are one notable example. In the seminal demonstration of thistechnique, Xiang et al. describe a parallelized synthesis method for superconducting copper oxide thin filmsthat varies composition, stoichiometry, and deposition sequence to identify promising compositions [441].As Senkov et al. point out, this type of experimentation can present unique obstacles to miniaturization insubdomains such as metal alloy design, where experiments of a certain scale are required to observe emergentmacro- and mesoscale properties [442]. Since this early demonstration, combinatorial and other HTE methodshave been developed in a wide array of materials subfields: to screen solid-state catalyst libraries [443, 444]as well as to discover cobalt-based MOFs [445], photosensitizers for catalyzing photoinduced water reduction[446], mixed metal oxide catalysts [447], adhesive coatings for automotive applications [448], polymers forgene delivery [449], ternary alloys that have high glass forming ability [450], and others.Development of high-throughput analytics is critical to avoid analytical bottlenecks; one method highlyamenable to parallelization is infrared imaging [443, 451, 452], which has been used to screen arrays ofheterogeneous catalysts [453]. Potyrailo and Takeuchi emphasize the diverse properties materials exhibit andthe need for a correspondingly diverse array of characterization tools [454]. As an alternative to developinghigh-throughput characterization techniques, Sun et al. recently reported the rapid exploration of a 96-member library of perovskite-inspired compositions (Figure 18) in which they accelerated screening in partby replacing rate-limiting analytics with a cheaper analysis and a machine learning model [455]. Their studyincluded the first reporting of four lead-free compositions in thin-film form. Reviews pertinent to high-throughput experimental screening for materials discovery are provided by Meier and Schubert [456], Zhao[457], Rajan [458], Hook et al. [459], Potyrailo et al. [460], and Green et al. [440].
There are a number of noniterative approaches to materials discovery [461] that rely on computationalworkflows like virtual screening or high-throughput simulation techniques [462–464]; Ong et al. developed aPython library (pymatgen) specifically designed to facilitate these workflows [465]. Curtarolo et al., Pyzer-Knapp et al., and Himanen et al. provide recent reviews on high throughput virtual screening for materialsdiscovery, specifically describing the process of generating large databases and identifying meaningful trendswithin them [62, 463, 466].The predominant use of machine learning in computational materials discovery has been to fit surrogatemodels to existing (often, experimental) data and screen a large design space [467–473]. To the extentthat performance can be correlated to structure, these models can reveal opportunities for the design ofnew catalysts/ligands for organic synthesis [224, 226–228, 474] (Figure 11), metallic catalysts [475, 476],36igure 18: Workflow for the (relatively) rapid screening of a combinatorial space of perovskite-like compo-sitions. Figure reproduced from Sun et al. [455].Heusler compounds [477], metal organic frameworks (MOFs) [478], hybrid organic-inorganic perovskites[479], superhard materials [480], thermal materials [481], organic electronic materials [482–486], polymersfor electronic applications [487, 488], porous crystalline materials for gas storage [489, 490], and reductiveadditives for battery electrolyte formulations [42]. Computational models have also been used to determinewhen calculations are likely to fail [491] and to identify associations between materials and specific propertykeywords through text mining [492].A few trends are apparent in the experimental validation of these frameworks. First, the confidence ofcomputational predictions is intimately coupled to the quality and applicability of the model. Second, ex-perimental validation is often preceded by extensive manual filtering of the computationally-prioritized com-pounds to take into account factors such as synthesizability, laboratory capabilities, and (human-)perceivedsuitability for the discovery objective [484, 493, 494]. As with organic molecules, there can be a misalign-ment between the compounds one would like to test (that are computationally predicted to achieve a desiredfunction) and what can be realized (synthesized) experimentally. In one example, as a conservative filterfor synthesizability, Sumita et al. require that proposed molecules have at least one known synthetic routereported in SciFinder [495]. Third, there are often discrepancies between the predictions made by a surro-gate model and the values determined experimentally, and in some cases the discrepancies are large enough37o have a substantial bearing on the desired performance [493]. These latter two trends imply that thepertinent features of promising materials are often not fully captured by the algorithms developed to date.Thus, experimental validation is acutely relevant in this area.Computational screening, with or without experimental validation, is also a common strategy for identi-fying promising therapeutic candidates. Many reviews on the use of virtual screening in drug discovery exist,including Schneider [39], Sliwoski et al. [496], Macalino et al. [497], Lavecchia [498], Wingert and Camacho[499], Zhang et al. [83], and Panteleev et al. [500]; these emphasize the use of machine learning methods togenerate the surrogate QSAR/QSPR models that guide the VS process. Walters recently reviewed strate-gies for library enumeration in the drug discovery space [31]. These range from applying known reactiontransformations to available molecules in order to define make-on-demand libraries [147, 501–506] to gen-erative strategies, which are discussed later in the context of de novo design of singular lead compounds.Make-on-demand libraries have the advantage that candidates selected for follow-up experimental validationshould be readily synthesizable, with some exceptions (10-20% of compounds, anecdotally) due to imperfectenumeration rules.For drug discovery applications, virtual screening is often divided into two categories: structure-based[507–509] and ligand-based [510, 511]. Structure-based VS relies on scoring functions that relate informationabout a molecule and the target protein to binding affinity between the two. Docking analysis is a commonparadigm in structure-based VS. Many software packages for this purpose exist, including AutoDock [512],FlexX [513], GOLD [514], and Glide [515], and have been extensively reviewed and compared [388, 516].Ligand-based strategies, in contrast, make no direct consideration of the structure of the target protein. In-stead, they rely on QSAR models and/or direct similarity assessments [517] that compare library compoundsto a reference compound that exhibits desired properties. Many algorithms exist for making the similaritycomparisons [518, 519].Studies that validate virtual screening strategies by synthesizing and testing the compounds identifiedby their workflow include [506, 520, 521]; we note that such validation is routine and expected in industrialdrug discovery campaigns. In one example, Hoffer et al. combine virtual screening with partially-automatedsynthesis and testing in a workflow for hit-to-lead optimization that they call diversity-oriented target-focused synthesis, or DOTS [522, 523] (Figure 19). The DOTS framework begins with a hit fragment aroundwhich a virtual library is enumerated through the in silico application of common synthetic reactions thatcombine the hit with commercially-available building blocks. Next, the authors apply their docking program,S4MPLE [524], which uses an evolutionary algorithm for conformational sampling, to select the compoundswith favorable target interactions. Finally, the high-priority set is synthesized using a Chemspeed roboticsynthesis platform to carry out expert-defined syntheses and subjected to in vitro evaluation. In one case38tudy exploring optimization of an inhibitor of one of the two bromodomains of the BRD4 protein (whichis implicated in inflammatory and cardiovascular diseases and cancer), all seventeen of the high-prioritycompounds had higher pIC50 values than the initial hit.Figure 19: Workflow for diversity-oriented target-focused synthesis (DOTS). Experimental steps are in or-ange; computational steps are in blue. Figure reproduced from Hoffer et al. [522].
All of the case studies above examine search spaces defined by discrete sets of candidates. These candidatesets consist either of the compounds already existing in a database or library of interest, or they are somehowsystematically enumerated. While some of these libraries are quite large, for example, the 170 millionmake-on-demand compounds from ref. 506 or the 11 billion in the REAL database [525], their discretenature constrains the search space. Computational techniques such as deep generative models in whichmolecules are generated, manipulated, and/or optimized in a continuous latent space (Figure 20) haveemerged and represent a means of overcoming the finiteness of discrete candidate sets (and, more specifically,as an alternative to earlier design approaches based, e.g., on genetic algorithms [526]). These models arepredicated on the assumption that the generated compounds, by virtue of being drawn from the samedistribution as the training molecules, will inherit the training molecules’ important properties such asstability and synthesizability while being biased toward a specific property of interest (e.g., bioactivity) [90,92]. Experimental validation is uniquely relevant for these techniques since they are not based on first-39rinciples calculations, interpretable QSARs, or well-vetted heuristics, but rather neural models that createan obfuscated approximation of the distribution within chemical space and an underlying structure-functionlandscape.Figure 20: Diagram of an autoencoder for molecular discovery. Figure reproduced from Gómez-Bombarelliet al. [527].In an early example of the adaptation of deep generative networks to the pharmaceutical space, Kadurinet al. describe the development of an adversarial autoencoder (AAE), which wraps an autoencoder in thegenerative adversarial network (GAN) training framework [88] to identify antitumor agents based on ex-isting MCF-7 cell line assay data [528] (see ref. 529 for the original implementation and ref. 530 for theimproved technique, DruGAN). In another early example, Gómez-Bombarelli et al. applied a VAE operatingon SMILES strings (following the decoding approach of Bowman et al. [531]) for the latent-space optimiza-tion of molecules with respect to druglikeness and synthetic accessibility metrics, demonstrating superiorperformance to random search and a genetic algorithm when initialized on low-performing molecules [527].Generative models with RNN encoding-decodings have emerged as one of the major paradigms in denovo drug design [532, 533]. For example, Yuan et al. use a character-level RNN [535] to generate virtual40ibraries of SMILES strings [534]. By training their model on 25,000 known VEGFR-2 inhibitors, they wereable to generate a library enriched with high-affinity ligands relative to target-agnostic screening libraries,as judged through a computational docking program. Five of the highest-affinity ligands were selected forsynthesis and testing and two were found to be more potent than vatalanib, a known inhibitor. Bjerrum andThrelfall adopted a similar approach using the ZINC12 database to train their model [536]. Their emphasison evaluating the synthetic accessibility of the molecules that their model designed reflects the extent towhich generative models have failed in this area historically. Combining this strategy with reinforcementlearning to generate molecules that are similar to a seed compound [537], molecules that have high predictedbioactivity against a particular target [533, 537, 538], molecules that otherwise have desirable druglikeproperties (such as chemical beauty and Lipinski) [539], and molecules that represent an internally diverseset [532] have proven fruitful, as have applications to peptide design [540, 541]. Transfer learning in the formof model pretraining has also been useful to successfully overcome the disadvantages inherent in low-datadomains, for example to design modulators of therapeutically-relevant nuclear receptors [542, 543].As Jin et al. point out, a failing of the SMILES string representation in the molecular generation contextis that a single molecule can usually be mapped to several distinct, valid SMILES strings, which complicatesthe creation of a latent space that varies smoothly from one molecule to another, similar one. They contributean alternative approach, the junction tree variational autoencoder , that generates molecular graphs ratherthan SMILES strings, demonstrating the ability to generate both a library of valid molecules as well aslatent-space optimization to optimize molecules according to a joint logP-synthetic accessibility objectivefunction [544].Several of the generative model case studies cited herein include experimental validation [532, 534, 540,541, 543, 545, 546], although the validation was not automated and the sets of generated molecules oftenrequired extensive filtering before selection for synthesis. Schneider and Clark review fragment-based denovo drug discovery efforts that specifically include experimental validation [41]. Schneider and Clark alsohighlight the fact that de novo efforts are often plagued by synthesizability issues and advocate for theincorporation of CASP software into the workflow to help address this. In lieu of experimentation, somestudies validate the capabilities of generative models by comparing distributions of properties of the generatedmolecules to those of the training set [547, 548]. See [90–92] for detailed reviews of the methods for andapplications of generative models in chemistry and molecular design.41 .7 Iterative discovery of new physical matter
One rarely has a perfect understanding of a structure-function landscape, particularly in discovery appli-cations where data can be limited. In this section, we focus on case studies in which computer-assistanceis applied to at least the experimental selection aspect of an iterative discovery workflow (left half of Fig-ure 17). In general, iterative discovery of new physical matter centers around a structure-function modelthat is used to reason about which experiments to perform. The results of the experiments are then used todevise the subsequent round and update the structure-function model such that more accurate predictionscan be made. Iterative strategies like active learning and Bayesian optimization can be used to augmentthe set of available data efficiently, focusing on informative and/or promising experiments within the searchspace [84, 85]. Other iterative strategies include evolutionary algorithms, which operate by mutating candi-dates directly based on validation data from an experiment, a simulation, or a surrogate model. Model-freestrategies where validation and feedback inherently drives experimental selection (e.g., directed evolution andcontinuous evolution techniques [52, 412]) are out of the scope of this review, but do represent an importantclass of autonomous experimental platforms.
Given the complexity of the assays involved in characterizing new drug compounds, active learning is espe-cially beneficial in the drug discovery domain to reduce the experimental burden and accelerate the search[549, 550]. Pool-based active learning (Figure 21) refers to the selection of candidates from a discrete, pre-enumerated set of options [84]. Active learning has been deployed in this setting both for information–tobuild accurate models of the activity of potential drug compounds against specified targets [551, 552]–as wellas for performance–to identify active compounds in as few experiments as possible [551, 553]. These aimsare not mutually exclusive; in order to identify high-performing candidates, it is often the case that thesealgorithms must be designed to select experiments where activity is expected to be highest and experimentsthat support overall model accuracy, i.e., effectively balancing exploration and exploitation [554–557].A particularly noteworthy example of the pool-based active learning strategy is the platform Eve, whichwas designed to conduct autonomous, closed-loop hit identification [558, 559]. Experimentally, Eve is en-dowed with the capacity to rapidly screen predefined compound libraries against a variety of biological assaysat a rate of >10,000 compounds per day. The platform uses collected data to create a surrogate model of thestructure-activity landscape and then selects subsequent rounds of compounds with high predicted activity,selectivity, and/or prediction variance, rather than exhaustively exploring its search space. The authorscreated an econometric model of the drug discovery process that accounts for (a) the utility of a hit (b)42igure 21: Workflow for pool-based active learning to identify compounds that bind strongly to proteinswithin an N compound × M protein space of interactions. Figure reproduced from Kangas et al. [554].the utility of the reduction in experimental space that needs to be screened, and (c) the cost of missedhits, among others, and found that it is typically more economical to use active learning than brute-forcescreening. The success of Eve depends on access to a large library of compounds that contains at least oneacceptably-high-performing compound.Increasing the size of the search space increases the likelihood there is a high-performing global optimum(although it may be difficult to identify). In experimental settings, the ability to synthesize compoundson-demand expands the search space beyond the set of in-stock compounds. Desai et al. describe a mi-crofluidic platform able to produce × compounds on-demand via a one-step Sonogashira coupling [556].Impressively, the platform integrates synthesis with online purification, dilution, and activity assay againstAbl1 and Abl2 kinases. A random forest model was created to approximate the structure-activity landscapeand guide experiment selection. Experiments were chosen via one of two approaches–one greedy approachto maximize expected activity and one to explore undersampled chemical space–and results were used to43pdate the surrogate model [560]. Despite the expansion in the search space achieved by incorporatingsynthesis capabilities, the design space explored in this example remains extremely narrow; a brute-forcesearch would have been tractable and potentially faster if parallelized. Still, this study is an excellentproof-of-concept for closed-loop synthesis, purification, and testing. More flexible synthesis-purification andsynthesis-purification-testing platforms have been developed but only applied to open-loop discovery withmanual compound design and synthesis planning [12, 427, 561, 562] (Figure 22). Table 2 in ref. 563 reviewssome additional examples of integrated synthesis and testing.Figure 22: Integrated platform for open-loop synthesis, purification, and testing from AbbVie. Figurereproduced from Baranczak et al. [562].Evolutionary strategies are another means of expanding the search space beyond a set of pre-enumeratedcandidates. Besnard et al. employ one such technique in which candidate compounds evolve as part of theiterative process [564]. Specifically, on each iteration, new candidates are evolved from the highest performersfrom the previous generation through the application of structural transformations from the medicinal chem-istry literature. Here, the discovery workflow relies on the development of accurate QSAR models trainedon ChEMBL data to ensure alignment between the in silico performance and experimental activity. Firthet al. use a related technique that evolves molecules using a fragment replacement routine (RATS, rapidalignment of topological scaffolds), which enables a less constrained exploration of chemical space than does44eaction enumeration; this approach was demonstrated on a surrogate multi-objective function (including amodel of CDK2 activity) starting from a known active scaffold, with manual experimental validation of asmall library of recommendations [565].Genetic algorithms (GAs) are related to the approaches used by Besnard et al. and Firth et al. Candidatecompounds (experiments) are proposed as mutations of a parent compound whose performance is known;allowed mutations serve as a constraint on the search space and the optimization trajectory [566] (seeFigure 23 for an illustration of one implementation and the corresponding algorithm flowchart). In contrastto active learning, GAs tend to use static fitness functions for compound scoring (although a few rely onexperimental outcomes instead [567]). In a very early example of iterative molecular optimization using agenetic algorithm, Weber et al. describe the identification of inhibitors of the serine protease thrombin; 16generations led to the identification of sub-micromolar inhibitors [252]. The key to this approach is that the × × × design space was defined by discrete substrate choices in a 4-component Ugi-type reactionto ensure straightforward (albeit manual) synthesis and testing. Other iterative strategies for drug discoveryinclude in silico application of synthetic transformations to generate molecules that are scored based on theirsimilarity to a target molecule and are, in principle, synthetically-accessible [569].GAs have been successfully used to conduct both single- and multi-objective optimizations that accountfor factors including target protein binding affinity [571, 572], cost and bioavailability [573], and similarity toa chosen compound [570, 573–575]. Many strategies are fragment-based, operating in the manner describedabove, although some also allow atom-level mutations [576]. Strategies that operate directly on moleculargraphs have also been proposed [577], and in the case of peptide design, one can operate directly on sequences[578]. Despite the increased interest in deep learning techniques, genetic algorithms remain a powerfulstrategy for exploring chemical space [579]. A number of reviews describe drug discovery applications ofGAs and evolutionary algorithms more generally [526, 580–583]. Iterative experimental design strategies are increasingly being adopted to guide discovery in the materialsspace. For example, Xue et al. use the Bayesian optimization framework EGO [585] to select experimentsfrom a discretized composition space that ultimately led them to NiTi-based shape memory alloys deliveringlow thermal hysteresis [584]. Similar approaches have been used for the same [586] and other applicationsto discover BaTiO -based piezoelectric materials with large electrostrains [587] and to optimize meltingtemperature [588]. Compared to other domains, materials discovery tends to be relatively conducive to fullycomputational approaches since the properties we can calculate are more directly relevant to the functionswe wish to optimize; as a result, several studies have used Bayesian optimization to select compounds for45valuation with calculations or simulations, rather than experiments, e.g., to optimize thermal conductivities[589] and elastic moduli [590]. Additionally, some instances of generative models–described in the noniterativediscovery section above–incorporate Bayesian optimization to optimize compounds for desirable performance[527].Related active learning strategies have been deployed in materials development as well. For example,Tran and Ulissi use DFT to validate candidates proposed through active learning from a fixed pool of1,499 intermetallic compounds from the Materials Project, with the goal of identifying high-performanceelectrocatalysts for CO reduction and H evolution [591]. Gubaev et al. use active learning to efficientlycreate a DFT surrogate for predicting the convex hulls of metallic alloy systems, ultimately discoveringpreviously unknown stable alloys [592]. Even iterative greedy searches have proven effective in prioritizingsimulations of a fixed library of candidate materials for hydrogen storage [593].Several recent studies combine GAs, surrogate models for electronic structure calculations, and activelearning for the discovery of spin crossover complexes and transition metal catalysts [594, 595] (Figure 24).Treating the calculations as reliable measures of performance, these studies represent fully autonomousdiscovery within the space of organometallic complexes the genetic algorithm is able to explore. Among themany additional applications of GAs to materials discovery have been polymer design [566], the identificationof stable four-component alloys [596], promising polymers for OPVs [597], MOFs for carbon capture [598],and polymer dielectrics with user-defined dielectric constants and bandgaps [599]. An excellent review ofapplications of active learning and Bayesian optimization to materials development is provided in ref. 600.The utility of computational validation for materials discovery is somewhat offset by the complexity ofexperimental validation. As stated above, the synthesis and analysis of materials and devices can be difficultto automate. A recent study by MacLeod et al. (Figure 25) serves as an excellent example of automatingmore than simple solution-phase chemistry [601]. Precursor solutions are spin cast into thin films, whichare thermally annealed and analyzed for their optical and electronic properties. The platform, Ada, usesChemOS [96] for hardware orchestration and Pheonics [292] for Bayesian optimization to explore a twodimensional continuous search space of composition and annealing temperature; its objective is to optimizea pseudomobility metric that correlates with hole mobility. While what Ada measures is still a proxy for thetrue performance of a multilayer device, the ability to miniaturize and automate fabrication processes likethin film casting expands the scope of problems able to be tackled by autonomous platforms.46igure 23: (Top) Schematic illustration of one approach to genetic algorithm-based drug design: predockedfragments (black) are linked to fragments (red) from a user-supplied list, with the target protein and itspolar groups indicated in blue. (Bottom) The flow chart of the algorithm. Figure reproduced from Dey andCaflisch [570]. 47igure 24: Workflow for the iterative design of transition metal complexes (TMCs) using a genetic algorithmand automated DFT calculations. Figure reproduced from Nandy et al. [595].48igure 25: The autonomous platform Ada for optimizing optoelectronic properties of spin cast thin films.Figure reproduced from MacLeod et al. [601]. 49 .8 Brief summary of discovery in other domains There are many more attempts to automate aspects of the discovery process and incorporate machinelearning into scientific workflows than can be mentioned here. Some pertinent to biology and human healthare summarized below. A more comprehensive, collaborative review can be found in ref. 602.
To identify correlations from text mining:
The enormous size of the scientific literature makes itchallenging to analyze holistically. Information retrieval tools are needed to bring together relevant piecesof information, either for automated analysis or manual inspection. ARROWSMITH was an early systemfor the latter use [603] that identified MEDLINE abstracts with overlapping terms potentially indicatingan implicit causal relationship. This was used to discover a number of testable hypotheses including a linkbetween magnesium and migraines [604]. Literature mining is an essential tool for organizing biological dataand enabling computational studies [38, 605–609].
To identify trends in genomics data:
The vast quantity of structured genetic information broughtabout by the Human Genome Project and advances in DNA sequencing is well-suited for data mining. As oneexample, probabilistic graph models can be built from genetic, protein-interaction, and metabolic pathwayinformation to propose hypotheses for gene functions [610, 611]. Two practical introductions to machinelearning for genomics can be found in ref. 612 and ref. 613.
To engineer new proteins:
Protein engineering through directed evolution requires navigating ahigh dimensional and discontinuous structure-function landscape. Random mutagenesis navigates this spaceblindly, one step at a time, while site-directed mutagenesis requires knowledge of which amino acid positionsto perturb. Supervised machine learning models and other statistical techniques can assist in the selectionof mutants over multiple rounds of evolution, particularly when dealing with nonadditive effects, as reviewedby Yang et al. [614]. Computational techniques are used for protein engineering in many other ways [615]including protein structure prediction [616, 617].
To identify gene/enzyme relationships:
A high-profile example of an automated platform formolecular genetics is King et al.’s Adam [618–620]. In the original demonstration, Adam was made aware ofthe aromatic amino acid synthesis pathway in yeast, hypothesized which of 15 open reading frames (ORFs)encoded which enzyme, and selected growth experiments to perform (choosing one knockout mutant outof 15 options and one to two metabolites out of 9 options). While this almost represented a truly closed-loop system, there were still manual steps involved in transferring well plates between the liquid handler,incubator, and plate reader. Additionally, these experimental and hypothesis spaces are extraordinarilynarrow: the model’s accuracy using an active search strategy was 80.1% compared to 74.0% for a naivemethod that chose the cheapest experiment yet to be performed. In ref. 620, King acknowledges the criticism50hat “the new scientific knowledge was implicit in the formulation of the problem, and is therefore not novel”.
In the first part of this review, we have defined three broad categories of discovery–physical matter, processes,and models–and suggested guidelines for evaluating the extent to which a scientific workflow can be describedas autonomous: (i) How broadly is the goal defined? (ii) How constrained is the search/design space? (iii)How are experiments for validation/feedback selected? (iv) How superior to a brute force search is navigationof the design space? (v) How are experiments for validation/feedback performed? (vi) How are resultsorganized and interpreted? (vii) Does the discovery outcome contribute to broader scientific knowledge?As illustrated by the case studies we have included, there has been substantial progress in developingmethods that build toward autonomous discovery. Yet there are few examples of true closed-loop discoveryfor all but the narrowest design spaces–often a consequence of the complexity of automating experimentalvalidation.We continue to face both practical and methodological challenges in our quest for autonomous discovery.The second part of this review will reflect on a selection of case studies in terms of the questions we poseand then describe remaining challenges where further development is required.
We thank Thomas Struble for providing comments on the manuscript and our other colleagues and collab-orators for useful conversations around this topic. This work was supported by the Machine Learning forPharmaceutical Discovery and Synthesis Consortium and the DARPA Make-It program under contract AROW911NF-16-2-0023.
References [1] A. M. Turing,
Mind , LIX , 433–460.[2] G. F. Bradshaw, P. W. Langley, H. A. Simon,
Science , , 971–975.[3] A. M. Turing, Computers & Thought , (Eds.: E. A. Feigenbaum, J. Feldman), MIT Press, Cambridge,MA, USA, , pp. 11–35.[4] P. Langley,
Int. J. Hum. Comput. Stud. , , 393–410.[5] R. E. Valdés-Pérez, Artif. Intell. , , 335–346.[6] A. Sparkes et al., Automated Experimentation , , 1.[7] P. D. Sozou, P. C. Lane, M. Addis, F. Gobet in Springer Handbook of Model-Based Science , SpringerHandbooks, Springer International Publishing, , pp. 719–734.[8] M. Peplow,
Nature , , 20–22. 519] C. Houben, A. A. Lapkin, Curr. Opin. Chem. Eng. , , 1–7.[10] D. E. Fitzpatrick, C. Battilocchio, S. V. Ley, Org. Process Res. Dev. , , 386–394.[11] B. J. Reizman, K. F. Jensen, Acc. Chem. Res. , , 1786–1796.[12] A. G. Godfrey, T. Masquelin, H. Hemmerle, Drug Discov. Today , , 795–802.[13] J. Li, S. G. Ballmer, E. P. Gillis, S. Fujii, M. J. Schmidt, A. M. E. Palazzolo, J. W. Lehmann, G. F.Morehouse, M. D. Burke, Science , , 1221–1226.[14] Lowe, Derek, Automated Chemistry: A Vision, en-US, .[15] S. Steiner et al., Science , , eaav2211.[16] G. Schneider, Nat. Rev. Drug Discov. , , 97–113.[17] S. Smith, 141 Startups Using Artificial Intelligence in Drug Discovery, https://blog.benchsci.com/startups-using-artificial-intelligence-in-drug-discovery (visited on 07/30/2019).[18] P. W. Anderson, E. Abrahams, Science , , 1515–1516.[19] S. H. Muggleton, Nature , , 409–410.[20] D. Waltz, B. G. Buchanan, Science , , 43–44.[21] A. Sharafi, Knowledge Discovery in Databases , Springer Fachmedien Wiesbaden, Cambridge, MA,USA, .[22] P. S. Gromski, A. B. Henson, J. M. Granda, L. Cronin,
Nat. Rev. Chem. , , 119–128.[23] P. Langley, H. A. Simon, G. L. Bradshaw, J. M. Zytkow, Scientific Discovery: Computational Explo-rations of the Creative Process , MIT Press, Cambridge, MA, USA, .[24] D. Klahr,
Cogn. Sci. , , 1–48.[25] T. I. Oprea, J. Gottfries, J. Comb. Chem. , , 157–166.[26] C. Lipinski, A. Hopkins, Nature , , 855–861.[27] C. M. Dobson, Nature , , 824–828.[28] J.-L. Reymond, Acc. Chem. Res. , , 722–730.[29] R. S. Bohacek, C. McMartin, W. C. Guida, Med. Res. Rev. , , 3–50.[30] K. L. M. Drew, H. Baiman, P. Khwaounjoo, B. Yu, J. Reynisson, J. Pharm. Pharmacol. , ,490–495.[31] W. P. Walters, J. Med. Chem. , , 1116–1124.[32] J. P. McMullen, K. F. Jensen, Org. Process Res. Dev. , , 1169–1176.[33] B. J. Reizman, Y.-M. Wang, S. L. Buchwald, K. F. Jensen, React. Chem. Eng. , , 658–666.[34] S. E. Denmark, B. L. Christenson, D. M. Coe, S. P. O’Connor, Tetrahedron Lett. , , 2215–2218.[35] S. L. Schreiber, Science , , 1964–1969.[36] F. H. Arnold, Acc. Chem. Res. , , 125–131.[37] M. Schmidt, H. Lipson, Science , , 81–85.[38] B. M. Gyori, J. A. Bachman, K. Subramanian, J. L. Muhlich, L. Galescu, P. K. Sorger, Mol. Syst.Biol. , , 954.[39] G. Schneider, Nat. Rev. Drug Discov. , , 273–276.[40] S. K. Saikin, C. Kreisbeck, D. Sheberla, J. S. Becker, A.-G. A., Expert Opin. Drug Discovery , , 1–4.[41] G. Schneider, D. E. Clark, Angew. Chem. Int. Ed. , , 10792–10803.[42] M. D. Halls, K. Tasaki, J. Power Sources , , 1472–1478.[43] E. L. Gettier, Analysis , , 121.[44] K. Popper, Conjectures and Refutations, The Growth of Scientific Knowledge , Routledge, .5245] F. Bacon,
Novum organum , Google-Books-ID: tH4_AAAAYAAJ, Clarendon Press, , 644 pp.[46] P. Giza,
J. Exp. Theor. Artif. Intell. , , 1053–1069.[47] D. Silver et al., Nature , , 354–359.[48] D. Silver et al., Science , , 1140–1144.[49] D. Klahr, A. Fay, K. Dunbar, Cognitive Psychology , , 111–146.[50] L. M. Baker, K. Dunbar in Proceedings of the 18th Annual Conference of the Cognitive ScienceSociety, Erlbaum, Mahwah, NJ, , pp. 154–159.[51] M. L. Cummings, S. Bruni in Springer Handbook of Automation , Springer Berlin Heidelberg, ,pp. 437–447.[52] M. S. Packer, D. R. Liu,
Nat. Rev. Genet. , , 379–394.[53] P. Langley, J. M. Zytkow, Artif. Intell. , , 283–312.[54] L. Breiman, Statist. Sci. , , 199–231.[55] G. Shmueli, Statist. Sci. , , 289–310.[56] R. Roscher, B. Bohn, M. F. Duarte, J. Garcke, arXiv preprint arXiv:1905.08883 .[57] B. L. Claus, D. J. Underwood, Drug Discov. Today , , 957–966.[58] W. A. Warr, Mol. Inf. , , 469–476.[59] A. Gaulton et al., Nucleic Acids Res. , , D1100–D1107.[60] D. J. Rigden, X. M. Fernández, Nucleic Acids Res. , , D1–D7.[61] J. Hill, G. Mulholland, K. Persson, R. Seshadri, C. Wolverton, B. Meredig, MRS Bull. , ,399–409.[62] L. Himanen, A. Geurts, A. S. Foster, P. Rinke, arXiv:1907.05644 [cond-mat physics:physics] .[63] Y. Gil, M. Greaves, J. Hendler, H. Hirsh, Science , , 171–172.[64] V. G. Honavar, Review of Policy Research , , 326–330.[65] Y. Gil, H. Hirsh in 2012 AAAI Fall Symposium Series, .[66] M. Krallinger, F. Leitner, O. Rabal, M. Vazquez, J. Oyarzabal, A. Valencia, J. Cheminform. , , S1.[67] M. C. Swain, J. M. Cole, J. Chem. Inf. Model. , , 1894–1904.[68] M. Krallinger, O. Rabal, A. Lourenço, J. Oyarzabal, A. Valencia, Chem. Rev. , , 7673–7761.[69] E. Kim, K. Huang, A. Saunders, A. McCallum, G. Ceder, E. Olivetti, Chem. Mater. , , 9436–9444.[70] Z. Zhai, D. Q. Nguyen, S. A. Akhondi, C. Thorne, C. Druckenbrodt, T. Cohn, M. Gregory, K.Verspoor, arXiv:1907.02679 [cs] .[71] S. Zheng, S. Dharssi, M. Wu, J. Li, Z. Lu in Methods in Molecular Biology , (Eds.: R. S. Larson, T. I.Oprea), Methods in Molecular Biology, Springer New York, New York, NY, , pp. 231–252.[72] M. Musib et al.,
Science , , 28–30.[73] H. Chen, O. Engkvist, Y. Wang, M. Olivecrona, T. Blaschke, Drug Discov. Today , , 1241–1250.[74] G. B. Goh, N. O. Hodas, A. Vishnu, J. Comput. Chem. , , 1291–1307.[75] J. Vamathevan et al., Nat. Rev. Drug Discov. , , 463–477.[76] C. M. Bishop, Pattern Recognition and Machine Learning , Springer Science & Business Media LLC, .[77] A. Varnek, I. Baskin,
J. Chem. Inf. Model. , , 1413–1437.[78] J. B. O. Mitchell, WIREs Comput Mol Sci , , 468–481.5379] J. Luts, F. Ojeda, R. Van de Plas, B. De Moor, S. Van Huffel, J. A. Suykens, Anal. Chim. Acta , , 129–145.[80] M. Rupp, Int. J. Quantum Chem. , , 1058–1073.[81] T. Mueller, A. G. Kusne, R. Ramprasad, Rev. Comput. Chem. , , 186–273.[82] K. T. Butler, D. W. Davies, H. Cartwright, O. Isayev, A. Walsh, Nature , , 547–555.[83] L. Zhang, J. Tan, D. Han, H. Zhu, Drug Discov. Today , , 1680–1685.[84] B. Settles, Synthesis Lectures on Artificial Intelligence and Machine Learning , , 1–114.[85] P. I. Frazier, arXiv preprint arXiv:1807.02811 .[86] J. H. Holland, Adaptation in Natural and Artificial Systems, An Introductory Analysis with Applica-tions to Biology, Control, and Artificial Intelligence , The MIT Press, .[87] R. Salakhutdinov,
Annu. Rev. Stat. Appl. , , 361–385.[88] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengioin Advances in neural information processing systems, , pp. 2672–2680.[89] D. P. Kingma, M. Welling.[90] B. Sanchez-Lengeling, A. Aspuru-Guzik, Science , , 360–365.[91] D. Schwalbe-Koda, R. Gómez-Bombarelli, arXiv:1907.01632 [physics stat] .[92] D. C. Elton, Z. Boukouvalas, M. D. Fuge, P. W. Chung, arXiv:1903.04388 [physics stat] .[93] S. N. Deming, H. L. Pardue, Anal. Chem. , , 192–200.[94] H. Winicov, J. Schainbaum, J. Buckley, G. Longino, J. Hill, C. Berkoff, Anal. Chim. Acta , ,469–476.[95] J. Y. Pan, ACS Med. Chem. Lett. , , 703–707.[96] L. M. Roch, F. Häse, C. Kreisbeck, T. Tamayo-Mendoza, L. P. E. Yunker, J. E. Hein, A. Aspuru-Guzik, Sci. Robot. , , eaat5559.[97] C. Elliott, V. Vijayakumar, W. Zink, R. Hansen, JALA: Journal of the Association for LaboratoryAutomation , , 17–24.[98] T. Chapman, Nature , , 661–663.[99] T. Kodadek, Nat. Chem. Biol. , , 162–165.[100] S. Brenner, R. A. Lerner, PNAS , , 5381–5383.[101] D. S. Tan, Nat. Chem. Biol. , , 74–84.[102] R. G. Cooks, Science , , 1566–1570.[103] C. J. Welch, X. Gong, W. Schafer, E. C. Pratt, T. Brkovic, Z. Pirzada, J. F. Cuff, B. Kosjek, Tetra-hedron: Asymmetry , , 1674–1681.[104] H. A. Simon, P. W. Langley, G. L. Bradshaw, Synthese , , 1–27.[105] P. Langley, Proc. 2nd National Conference of the Canadian Society for Computational Studies ofIntelligence 1978 , 173–180.[106] P. Langley, G. L. Bradshaw, H. A. Simon in
Machine Learning , (Eds.: R. S. Michalski, J. G. Car-bonell, T. M. Mitchell), Symbolic Computation, Springer Berlin Heidelberg, Berlin, Heidelberg, ,pp. 307–329.[107] J. M. Zytkow in
Proceedings of the Fourth International Workshop on MACHINE LEARNING , (Ed.:P. Langley), Elsevier, , pp. 281–287.[108] B. C. Falkenhainer, R. S. Michalski,
Mach. Learn. , , 367–401.[109] J. M. Zytkow, H. A. Simon, Mach. Learn. , , 107–137.[110] D. Kulkarni, H. A. Simon, Cogn. Sci. , , 139–175.54111] S. Fajtlowicz in Graph Theory and Applications, Proceedings of the First Japan Conference on GraphTheory and Applications , (Eds.: J. Akiyama, Y. Egawa, H. Enomoto), Graph Theory and Applications,Elsevier, , pp. 113–118.[112] I. H. Witten, B. A. MacDonald,
Int. J. Man Mach. Stud. , , 171–196.[113] W. H. Green in Chemical Engineering Kinetics , (Ed.: G. B. Marin), Chemical Engineering Kinetics,Elsevier, , pp. 1–313.[114] G. N. Simm, M. Reiher,
J. Chem. Theory Comput. , , 5238–5248.[115] G. N. Simm, A. C. Vaucher, M. Reiher, J. Phys. Chem. A , , 385–399.[116] J. P. Unsleber, M. Reiher, arXiv:1906.10223 [physics] .[117] R. E. Valdés-Pérez, Catal. Lett. , , 79–87.[118] R. E. Valdés-Pérez, Artif. Intell. , , 247–280.[119] R. E. Valdés-Pérez, Artif. Intell. , , 191–201.[120] I. Ismail, H. B. V. A. Stuttaford-Fowler, C. Ochan Ashok, C. Robertson, S. Habershon, J. Phys.Chem. A , , 3407–3417.[121] T. P. Senftle et al., Npj Comput. Mater. , , 15011.[122] J. T. Margraf, K. Reuter, ACS Omega , , 3370–3379.[123] C. W. Gao, J. W. Allen, W. H. Green, R. H. West, Comput. Phys. Commun. , , 212–225.[124] P. Zhang, N. W. Yee, S. V. Filip, C. E. Hetrick, B. Yang, W. H. Green, Phys. Chem. Chem. Phys. , , 10637–10649.[125] L. J. Broadbelt, S. M. Stark, M. T. Klein, Ind. Eng. Chem. Res. , , 790–799.[126] P. M. Zimmerman, J. Comput. Chem. , , 1385–1392.[127] Z. W. Ulissi, A. J. Medford, T. Bligaard, J. K. Nørskov, Nat. Commun. , , 14621.[128] S. Maeda, Y. Harabuchi, J. Chem. Theory Comput. , , 2111–2115.[129] H. B. Schlegel, J. Comput. Chem. , , 214–218.[130] A. Behn, P. M. Zimmerman, A. T. Bell, M. Head-Gordon, J. Chem. Phys. , , 224108.[131] A. Goodrow, A. T. Bell, M. Head-Gordon, J. Chem. Phys. , , 244108.[132] Y. V. Suleimanov, W. H. Green, J. Chem. Theory Comput. , , 4248–4259.[133] C. A. Grambow, A. Jamal, Y.-P. Li, W. H. Green, J. Zádor, Y. V. Suleimanov, J. Am. Chem. Soc. , , 1035–1048.[134] Y. Kim, J. W. Kim, Z. Kim, W. Y. Kim, Chem. Sci. , , 825–835.[135] S. Maeda, T. Taketsugu, K. Morokuma, J. Comput. Chem. , , 166–173.[136] L.-P. Wang, A. Titov, R. McGibbon, F. Liu, V. S. Pande, T. J. Martínez, Nature Chem. , ,1044–1048.[137] L.-P. Wang, R. T. McGibbon, V. S. Pande, T. J. Martinez, J. Chem. Theory Comput. , ,638–649.[138] T. Lei, W. Guo, Q. Liu, H. Jiao, D.-B. Cao, B. Teng, Y.-W. Li, X. Liu, X.-D. Wen, J. Chem. TheoryComput. , , 3654–3665.[139] C. M. Gothard, S. Soh, N. A. Gothard, B. Kowalczyk, Y. Wei, B. Baytekin, B. A. Grzybowski, Angew.Chem. Int. Ed. , , 7922–7927.[140] S. Soh, Y. Wei, B. Kowalczyk, C. M. Gothard, B. Baytekin, N. Gothard, B. A. Grzybowski, Chem.Sci. , , 1497.[141] A. I. Lin, T. I. Madzhidov, O. Klimchuk, R. I. Nugmanov, I. S. Antipin, A. Varnek, J. Chem. Inf.Model. , , 2140–2148. 55142] M. D. Bajczyk, P. Dittwald, A. Wołos, S. Szymkuć, B. A. Grzybowski, Angew. Chem. Int. Ed. Engl. , , 2367–2371.[143] A. A. Lapkin, P. K. Heer, P.-M. Jacob, M. Hutchby, W. Cunningham, S. D. Bull, M. G. Davidson, Faraday Discuss. , , 483–496.[144] J. Li, M. D. Eastgate, React. Chem. Eng. , DOI .[145] W. D. Smith, , 7.[146] S. M. Kim, M. I. Peña, M. Moll, G. N. Bennett, L. E. Kavraki,
Journal of Cheminformatics , , 51.[147] M. Hartenfeller, H. Zettl, M. Walter, M. Rupp, F. Reisen, E. Proschak, S. Weggen, H. Stark, G.Schneider, PLoS Comput. Biol. , , e1002380.[148] S. Avramova, N. Kochev, P. Angelov, Data , , 14.[149] S. Szymkuć, E. P. Gajewska, T. Klucznik, K. Molga, P. Dittwald, M. Startek, M. Bajczyk, B. A.Grzybowski, Angew. Chem. Int. Ed. , , 5904–5937.[150] J. Law, Z. Zsoldos, A. Simon, D. Reid, Y. Liu, S. Y. Khew, A. P. Johnson, S. Major, R. A. Wade,H. Y. Ando, J. Chem. Inf. Model. , , 593–602.[151] C. D. Christ, M. Zentgraf, J. M. Kriegl, J. Chem. Inf. Model. , , 1745–1756.[152] A. Bøgevig, H.-J. Federsel, F. Huerta, M. G. Hutchings, H. Kraut, T. Langer, P. Löw, C. Oppawsky,T. Rein, H. Saller, Org. Process Res. Dev. , , 357–368.[153] M. H. S. Segler, M. P. Waller, Chem. Eur. J. , , 5966–5971.[154] W.-D. Ihlenfeldt, J. Gasteiger, Angew. Chem. Int. Ed. Engl. , , 2613–2633.[155] M. H. Todd, Chem. Soc. Rev. , , 247.[156] A. Cook, A. P. Johnson, J. Law, M. Mirzazadeh, O. Ravitz, A. Simon, Wiley Interdiscip. Rev. Comput.Mol. Sci. , , 79–107.[157] O. Ravitz, Drug Discov. Today Technol. , , e443–e449.[158] C. W. Coley, W. H. Green, K. F. Jensen, Acc. Chem. Res. , , 1281–1289.[159] B. Liu, B. Ramsundar, P. Kawthekar, J. Shi, J. Gomes, Q. Luu Nguyen, S. Ho, J. Sloane, P. Wender,V. Pande, ACS Cent. Sci. , , 1103–1113.[160] S. Zheng, J. Rao, Z. Zhang, J. Xu, Y. Yang, arXiv:1907.01356 [physics] .[161] E. J. Corey, W. T. Wipke, Science , , 178–192.[162] A. P. Johnson, C. Marshall, P. N. Judson, Recl. Trav. Chim. Pays-Bas , , 310–316.[163] T. Klucznik et al., Chem , , 522–532.[164] M. H. S. Segler, M. Preuss, M. P. Waller, Nature , , 604–610.[165] S. H. Bertz, J. Am. Chem. Soc. , , 3599–3601.[166] P. Ertl, A. Schuffenhauer, J. Cheminform. , , 8.[167] R. P. Sheridan, N. Zorn, E. C. Sherer, L.-C. Campeau, C. ( Chang, J. Cumming, M. L. Maddess,P. G. Nantermet, C. J. Sinz, P. D. O’Shea, J. Chem. Inf. Model. , , 1604–1616.[168] J. R. Proudfoot, J. Org. Chem. , , 6968–6971.[169] C. W. Coley, L. Rogers, W. H. Green, K. F. Jensen, J. Chem. Inf. Model. , , 252–261.[170] J. S. Schreck, C. W. Coley, K. J. M. Bishop, ACS Cent. Sci. , , 970–981.[171] C. W. Coley, L. Rogers, W. H. Green, K. F. Jensen, ACS Cent. Sci. , , 1237–1245.[172] A. Heifets, PhD thesis, .[173] S. Rangarajan, A. Bhan, P. Daoutidis, Comput. Chem. Eng. , , 141–152.[174] S. Rangarajan, A. Bhan, P. Daoutidis, Ind. Eng. Chem. Res. , , 10459–10470.56175] D. M. Camacho, K. M. Collins, R. K. Powers, J. C. Costello, J. J. Collins, Cell , , 1581–1592.[176] G. Marcou, J. Aires de Sousa, D. A. R. S. Latino, A. de Luca, D. Horvath, V. Rietsch, A. Varnek, J.Chem. Inf. Model. , , 239–250.[177] M. K. Nielsen, D. T. Ahneman, O. Riera, A. G. Doyle, J. Am. Chem. Soc. , , 5004–5008.[178] H. Gao, T. J. Struble, C. W. Coley, Y. Wang, W. H. Green, K. F. Jensen, ACS Cent. Sci. , ,1465–1476.[179] T. Mikolov, K. Chen, G. Corrado, J. Dean, arXiv:1301.3781 [cs] .[180] R. Banãres-Alcántara, E. Ko, A. Westerberg, M. Rychener, Comput. Chem. Eng. , , 923–938.[181] C. Reichardt, T. Welton, Solvents and Solvent Effects in Organic Chemistry , Wiley-VCH VerlagGmbH & Co. KGaA, , 425 pp.[182] P. R. Wells,
Chemical Reviews , , 171–219.[183] R. W. Taft, J.-L. M. Abboud, M. J. Kamlet, M. H. Abraham, J. Solution Chem. , , 153–186.[184] H. Struebing, Z. Ganase, P. G. Karamertzanis, E. Siougkrou, P. Haycock, P. M. Piccione, A. Arm-strong, A. Galindo, C. S. Adjiman, Nature Chem. , , 952–957.[185] P. Raccuglia, K. C. Elbert, P. D. F. Adler, C. Falk, M. B. Wenny, A. Mollo, M. Zeller, S. A. Friedler,J. Schrier, A. J. Norquist, Nature , , 73–76.[186] R. J. Xu, J. H. Olshansky, P. D. F. Adler, Y. Huang, M. D. Smith, M. Zeller, J. Schrier, A. J. Norquist, Mol. Syst. Des. Eng. , , 473–484.[187] J. Li, T. Chen, K. Lim, L. Chen, S. A. Khan, J. Xie, X. Wang, Advanced Intelligent Systems , , arXiv: 1811.02771, 1900029.[188] E. Kim, K. Huang, A. Tomala, S. Matthews, E. Strubell, A. Saunders, A. McCallum, E. Olivetti, Sci.Data , , 170127.[189] E. Kim, K. Huang, S. Jegelka, E. Olivetti, Npj Comput. Mater. , , 53.[190] E. Kim et al., arXiv:1901.00032 [cond-mat stat] .[191] Z. Jensen, E. Kim, S. Kwon, T. Z. H. Gani, Y. Román-Leshkov, M. Moliner, A. Corma, E. Olivetti, ACS Cent. Sci. , DOI .[192] J. Gasteiger, C. Jochum in
Organic Compunds , Vol. 74 , Springer-Verlag, Berlin/Heidelberg, ,pp. 93–126.[193] I. Ugi, J. Bauer, J. Brandt, J. Friedrich, J. Gasteiger, C. Jochum, W. Schubert,
Angew. Chem. Int.Ed. , , 111–123.[194] T. D. Salatin, W. L. Jorgensen, J. Org. Chem. , , 2043–2051.[195] G. Sello, J. Chem. Inf. Model. , , 713–717.[196] H. Satoh, K. Funatsu, J. Chem. Inf. Comput. Sci. , , 34–44.[197] I. M. Socorro, K. Taylor, J. M. Goodman, Org. Lett. , , 3541–3544.[198] M. H. S. Segler, M. P. Waller, Chem. Eur. J. , , 6118–6128.[199] P.-M. Jacob, A. Lapkin, , DOI .[200] M. A. Kayala, C.-A. Azencott, J. H. Chen, P. Baldi, J. Chem. Inf. Model. , , 2209–2222.[201] M. A. Kayala, P. Baldi, J. Chem. Inf. Model. , , 2526–2540.[202] D. Fooshee, A. Mood, E. Gutman, M. Tavakoli, G. Urban, F. Liu, N. Huynh, D. Van Vranken, P.Baldi, Mol. Syst. Des. Eng. , , 442–452.[203] J. Bradshaw, M. J. Kusner, B. Paige, M. H. S. Segler, J. M. Hernández-Lobato, arXiv:1805.10970[physics stat] .[204] J. N. Wei, D. Duvenaud, A. Aspuru-Guzik, ACS Central Science , , 725–732.[205] C. W. Coley, R. Barzilay, T. S. Jaakkola, W. H. Green, K. F. Jensen, ACS Cent. Sci. , , 434–443. 57206] W. Jin, C. Coley, R. Barzilay, T. Jaakkola, NeurIPS , 2604–2613.[207] C. W. Coley, W. Jin, L. Rogers, T. F. Jamison, T. S. Jaakkola, W. H. Green, R. Barzilay, K. F.Jensen,
Chem. Sci. , , 370–377.[208] J. Nam, J. Kim, arXiv:1612.09529 [cs] .[209] P. Schwaller, T. Gaudin, D. Lányi, C. Bekas, T. Laino, arXiv:1711.04810 null , , 6091–6098.[210] P. Schwaller, T. Laino, T. Gaudin, P. Bolgar, C. Bekas, A. A. Lee, , DOI .[211] J. A. Platts, D. Butina, M. H. Abraham, A. Hersey, J. Chem. Inf. Comput. Sci. , , 835–845.[212] L. P. Hammett, J. Am. Chem. Soc. , , 96–103.[213] J. Gasteiger, U. Hondelmann, P. Röse, W. Witzenbichler, J. Chem. Soc. Perkin Trans. 2 , ,193–204.[214] J. Gálvez, M. Gálvez-Llompart, R. García-Domenech, Green Chem. , , 1056.[215] T. I. Madzhidov, P. G. Polishchuk, R. I. Nugmanov, A. V. Bodrov, A. I. Lin, I. I. Baskin, A. A.Varnek, I. S. Antipin, Russ. J. Org. Chem. , , 459–463.[216] P. Polishchuk, T. Madzhidov, T. Gimadiev, A. Bodrov, R. Nugmanov, A. Varnek, J. Comput.-AidedMol. Des. , , 829–839.[217] M. Glavatskikh, T. Madzhidov, D. Horvath, R. Nugmanov, T. Gimadiev, D. Malakhova, G. Marcou,A. Varnek, Mol. Inf. , , 1800077.[218] T. I. Madzhidov, T. R. Gimadiev, D. A. Malakhova, R. I. Nugmanov, I. I. Baskin, I. S. Antipin, A. A.Varnek, J. Struct. Chem. , , 650–656.[219] Q. N. N. Nguyen, D. J. Tantillo, Chem. Asian J. , , 674–680.[220] K. Fukui, H. Fujimoto, Frontier Orbitals and Reaction Paths, Selected Papers of Kenichi Fukui ,Google-Books-ID: azpkDQAAQBAJ, WORLD SCIENTIFIC, , 563 pp.[221] P. W. Ayers, R. G. Parr,
J. Am. Chem. Soc. , , 2010–2018.[222] A. Verloop in Pesticide Chemistry: Human Welfare and Environment , (Eds.: P. Doyle, T. Fujita),Elsevier, , pp. 339–344.[223] S. H. Unger, C. Hansch in
Progress in Physical Organic Chemistry , John Wiley & Sons, Inc., ,pp. 91–118.[224] J. D. Oslob, B. Åkermark, P. Helquist, P.-O. Norrby,
Organometallics , , 3015–3021.[225] A. Milo, A. J. Neel, F. D. Toste, M. S. Sigman, Science , , 737–743.[226] M. S. Sigman, K. C. Harper, E. N. Bess, A. Milo, Acc. Chem. Res. , , 1292–1301.[227] A. F. Zahrt, J. J. Henle, B. T. Rose, Y. Wang, W. T. Darrow, S. E. Denmark, Science , ,eaau5631.[228] J. P. Reid, M. S. Sigman, Nature , , 343–348.[229] S. Banerjee, A. Sreenithya, R. B. Sunoj, Phys. Chem. Chem. Phys. , , 18311–18318.[230] D. T. Ahneman, J. G. Estrada, S. Lin, S. D. Dreher, A. G. Doyle, Science , , 186–190.[231] G. A. Landrum, J. E. Penzotti, S. Putta, Meas. Sci. Technol. , , 270–277.[232] M. Elkin, T. R. Newhouse, Chem. Soc. Rev. , , 7830–7844.[233] D. C. Blakemore, L. Castro, I. Churcher, D. C. Rees, A. W. Thomas, D. M. Wilson, A. Wood, NatureChem. , , 383–394.[234] J. Boström, D. G. Brown, R. J. Young, G. M. Keserü, Nat. Rev. Drug Discov. , , 709–727.[235] J.-M. Lehn, Chem. Eur. J. , , 2455–2463.[236] K. H. Shaughnessy, P. Kim, J. F. Hartwig, J. Am. Chem. Soc. , , 2123–2132.[237] M. T. Reetz, M. H. Becker, H.-W. Klein, D. Stöckigt, Angew. Chem. Int. Ed. , , 1758–1761.58238] A. Buitrago Santanilla et al., Science , , 49–53.[239] S. Lin et al., Science , , eaar6236.[240] D. Perera, J. W. Tucker, S. Brahmbhatt, C. J. Helal, A. Chong, W. Farrell, P. Richardson, N. W.Sach, Science , , 429–434.[241] M. Wleklinski, B. P. Loren, C. R. Ferreira, Z. Jaman, L. Avramova, T. J. P. Sobreira, D. H. Thompson,R. G. Cooks, Chem. Sci. , , 1647–1653.[242] E. S. Isbrandt, R. J. Sullivan, S. G. Newman, Angew. Chem. Int. Ed. , , 7180–7191.[243] M. Shevlin, M. R. Friedfeld, H. Sheng, N. A. Pierson, J. M. Hoyt, L.-C. Campeau, P. J. Chirik, J.Am. Chem. Soc. , , 3562–3569.[244] M. Teders, A. Gómez-Suárez, L. Pitzer, M. N. Hopkinson, F. Glorius, Angew. Chem. Int. Ed. , , 902–906.[245] P. S. Kutchukian et al., Chem. Sci. , , 2604–2613.[246] A. B. Beeler, S. Su, C. A. Singleton, J. A. Porco, J. Am. Chem. Soc. , , 1413–1419.[247] H. M. Geysen, C. D. Wagner, W. M. Bodnar, C. J. Markworth, G. J. Parke, F. J. Schoenen, D. S.Wagner, D. S. Kinder, Chem. Biol. , , 679–688.[248] M. W. Kanan, M. M. Rozenman, K. Sakurai, T. M. Snyder, D. R. Liu, Nature , , 545–549.[249] D. W. Robbins, J. F. Hartwig, Science , , 1423–1427.[250] K. Troshin, J. F. Hartwig, Science , , 175–181.[251] A. McNally, C. K. Prier, D. W. C. MacMillan, Science , , 1114–1117.[252] L. Weber, K. Illgen, M. Almstetter, Synlett , , 366–374.[253] A. Dömling, Curr. Opin. Chem. Biol. , , 318–323.[254] B. Ganem, Acc. Chem. Res. , , 463–472.[255] K. D. Collins, T. Gensch, F. Glorius, Nature Chem. , , 859–871.[256] A. Sugimoto, T. Fukuyama, M. T. Rahman, I. Ryu, Tetrahedron Lett. , , 6364–6367.[257] K. Koch, B. J. A. van Weerdenburg, J. M. M. Verkade, P. J. Nieuwland, F. P. J. T. Rutjes, J. C. M.van Hest, Org. Process Res. Dev. , , 1003–1006.[258] P. J. Nieuwland, R. Segers, K. Koch, J. C. M. van Hest, F. P. J. T. Rutjes, Org. Process Res. Dev. , , 783–787.[259] D. L. Browne, S. Wright, B. J. Deadman, S. Dunnage, I. R. Baxendale, R. M. Turner, S. V. Ley, Rapid Commun. Mass Spectrom. , , 1999–2010.[260] R. L. Hartman, K. F. Jensen, Lab Chip , , 2495.[261] D. C. Fabry, E. Sugiono, M. Rueping, Isr. J. Chem. , , 341–350.[262] S. V. Ley, D. E. Fitzpatrick, R. J. Ingham, R. M. Myers, Angew. Chem. Int. Ed. , , 3449–3464.[263] D. K. B. Mohamed, X. Yu, J. Li, J. Wu, Tetrahedron Lett. , , 3965–3977.[264] V. Sans, L. Cronin, Chem. Soc. Rev. , , 2032–2043.[265] C. Mateos, M. J. Nieves-Remacha, J. A. Rincón, React. Chem. Eng. , DOI .[266] C. S. Horbaczewskyj, C. E. Willans, A. A. Lapkin, R. A. Bourne in
Handbook of Green Chemistry ,American Cancer Society, , pp. 329–374.[267] R. Fletcher,
Comput. J. , , 149–154.[268] J. A. Nelder, R. Mead, Comput. J. , , 308–313.[269] J. H. Holland, J. ACM , , 297–314.[270] W. Huyer, A. Neumaier, ACM Trans. Math. Softw. , , 1–25.[271] M. A. Bezerra, R. E. Santelli, E. P. Oliveira, L. S. Villar, L. A. Escaleira, Talanta , , 965–977.59272] M. Pelikan, D. E. Goldberg, E. Cantú-Paz in Proceedings of the 1st Annual Conference on Geneticand Evolutionary Computation - Volume 1, Morgan Kaufmann Publishers Inc., San Francisco, CA,USA, , pp. 525–532.[273] R. S. Sutton, A. G. Barto, Reinforcement Learning , Springer US, , 549 pp.[274] J. P. McMullen, K. F. Jensen,
Annual Rev. Anal. Chem. , , 19–42.[275] A. J. Parrott, R. A. Bourne, G. R. Akien, D. J. Irvine, M. Poliakoff, Angew. Chem. Int. Ed. , , 3788–3792.[276] R. A. Skilton, A. J. Parrott, M. W. George, M. Poliakoff, R. A. Bourne, Appl. Spectrosc. , ,1127–1131.[277] V. Sans, L. Porwol, V. Dragone, L. Cronin, Chem. Sci. , , 1258–1264.[278] N. Holmes, G. R. Akien, R. J. D. Savage, C. Stanetty, I. R. Baxendale, A. J. Blacker, B. A. Taylor,R. L. Woodward, R. E. Meadows, R. A. Bourne, React. Chem. Eng. , , 96–100.[279] K. Poscharny, D. Fabry, S. Heddrich, E. Sugiono, M. Liauw, M. Rueping, Tetrahedron , EngineeringChemistry for the Future of Organic Synthesis , , 3171–3175.[280] M. Rubens, J. H. Vrijsen, J. Laun, T. Junkers, Angew. Chem. Int. Ed. , , 3183–3187.[281] J. S. Moore, K. F. Jensen, Org. Process Res. Dev. , , 1409–1415.[282] A. Echtermeyer, Y. Amar, J. Zakrzewski, A. Lapkin, Beilstein J. Org. Chem. , , 150–163.[283] L. M. Baumgartner, C. W. Coley, B. J. Reizman, K. W. Gao, K. F. Jensen, React. Chem. Eng. , , 301–311.[284] A.-C. Bédard, A. Adamo, K. C. Aroh, M. G. Russell, A. A. Bedermann, J. Torosian, B. Yue, K. F.Jensen, T. F. Jamison, Science , , 1220–1225.[285] Y.-J. Hwang, C. W. Coley, M. Abolhasani, A. L. Marzinzik, G. Koch, C. Spanka, H. Lehmann, K. F.Jensen, Chem. Commun. , , 6649–6652.[286] J. E. Kreutz, A. Shukhaev, W. Du, S. Druskin, O. Daugulis, R. F. Ismagilov, J. Am. Chem. Soc. , , 3128–3132.[287] B. J. Reizman, Thesis, Massachusetts Institute of Technology, .[288] Z. Zhou, X. Li, R. N. Zare, ACS Cent. Sci. , , 1337–1344.[289] D. Reker, G. J. L. Bernardes, T. Rodrigues, chemRxiv , DOI .[290] B. E. Walker, J. H. Bannock, A. M. Nightingale, J. C. deMello, React. Chem. Eng. , , 785–798.[291] C. Houben, N. Peremezhney, A. Zubov, J. Kosek, A. A. Lapkin, Org. Process Res. Dev. , ,1049–1053.[292] F. Häse, L. M. Roch, C. Kreisbeck, A. Aspuru-Guzik, ACS Cent. Sci. , , 1134–1145.[293] S. Garg, S. K. Gupta, Macromol. Theory Simul. , , 46–53.[294] A. M. Schweidtmann, A. D. Clayton, N. Holmes, E. Bradford, R. A. Bourne, A. A. Lapkin, Chem.Eng. J. , , 277–282.[295] D. Cortés-Borda et al., J. Org. Chem. , , 14286–14299.[296] E. Wimmer, D. Cortés-Borda, S. Brochard, E. Barré, C. Truchet, F.-X. Felpin, React. Chem. Eng. , DOI .[297] P. Sagmeister, J. D. Williams, C. A. Hone, C. O. Kappe,
React. Chem. Eng. , DOI .[298] S. Krishnadasan, R. J. C. Brown, A. J. deMello, J. C. deMello,
Lab Chip , , 1434.[299] V. Duros, J. Grizou, W. Xuan, Z. Hosni, D.-L. Long, H. N. Miras, L. Cronin, Angew. Chem. Int. Ed. , , 10815–10820.[300] V. Duros, J. Grizou, A. Sharma, S. H. M. Mehr, A. Bubliauskas, P. Frei, H. N. Miras, L. Cronin, J.Chem. Inf. Model. , , 2664–2671. 60301] P. B. Wigley et al., Sci. Rep. , , 25890.[302] S. M. Moosavi, A. Chidambaram, L. Talirz, M. Haranczyk, K. C. Stylianou, B. Smit, Nat. Commun. , , 539.[303] P. Nikolaev, D. Hooper, N. Perea-López, M. Terrones, B. Maruyama, ACS Nano , , 10214–10222.[304] P. Nikolaev, D. Hooper, F. Webber, R. Rao, K. Decker, M. Krein, J. Poleski, R. Barto, B. Maruyama, Npj Comput. Mater. , , 16031.[305] J. P. McMullen, K. F. Jensen, Org. Process Res. Dev. , , 398–407.[306] C. M. Anderson-Cook, C. M. Borror, D. C. Montgomery, J. Stat. Plan. Inference , , 629–641.[307] B. J. Reizman, K. F. Jensen, Org. Process Res. Dev. , , 1770–1782.[308] Z. Amara, E. S. Streng, R. A. Skilton, J. Jin, M. W. George, M. Poliakoff, Eur. J. Org. Chem. , , 6141–6145.[309] J. M. Granda, L. Donina, V. Dragone, D.-L. Long, L. Cronin, Nature , , 377–381.[310] A. Orita, Y. Yasui, J. Otera, Org. Process Res. Dev. , , 333–336.[311] J. K. Sader, J. E. Wulff, Nature , , E54–E59.[312] V. Dragone, V. Sans, A. B. Henson, J. M. Granda, L. Cronin, Nat. Commun. , , 15733.[313] A. Dudek, T. Arodz, J. Galvez, Comb. Chem. High Throughput Screen. , , 213–228.[314] A. Cherkasov et al., J. Med. Chem. , , 4977–5010.[315] B. C. Pearce, M. J. Sofia, A. C. Good, D. M. Drexler, D. A. Stock, J. Chem. Inf. Model. , ,1060–1068.[316] J. B. Baell, G. A. Holloway, J. Med. Chem. , , 2719–2740.[317] J. B. Baell, J. W. M. Nissink, ACS Chem. Biol. , , 36–44.[318] D. Sanderson, C. Earnshaw, Hum Exp Toxicol , , 261–273.[319] J. Kazius, R. McGuire, R. Bursi, J. Med. Chem. , , 312–320.[320] I. Sushko, E. Salmina, V. A. Potemkin, G. Poda, I. V. Tetko, J. Chem. Inf. Model. , , 2310–2316.[321] N. Basant, S. Gupta, K. P. Singh, J. Chem. Inf. Model. , , 1337–1348.[322] J.-P. Métivier, A. Lepailleur, A. Buzmakov, G. Poezevara, B. Crémilleux, S. O. Kuznetsov, J. L. Goff,A. Napoli, R. Bureau, B. Cuissart, J. Chem. Inf. Model. , , 925–940.[323] X. Li, X. Yan, Q. Gu, H. Zhou, D. Wu, J. Xu, J. Chem. Inf. Model. , , 1044–1049.[324] F. Doshi-Velez, B. Kim, .[325] R. Tibshirani, J. Royal Stat. Soc. , , 267–288.[326] I. Guyon, A. Elisseeff, Journal of Machine Learning Research , , 1157–1182.[327] P. Polishchuk, J. Chem. Inf. Model. , , 2618–2639.[328] R. D. King, S. H. Muggleton, A. Srinivasan, M. J. Sternberg, Proc. Natl. Acad. Sci. U. S. A. , , 438–442.[329] R. D. King, A. Srinivasan, Environ. Health Perspect. , , 1031–1040.[330] P. Finn, S. Muggleton, D. Page, A. Srinivasan, Machine Learning , , 241–270.[331] C. Hansch, P. P. Maloney, T. Fujita, R. M. Muir, Nature , , 178–180.[332] S. M. Free, J. W. Wilson, J. Med. Chem. , , 395–399.[333] V. Svetnik, A. Liaw, C. Tong, J. C. Culberson, R. P. Sheridan, B. P. Feuston, J. Chem. Inf. Comput.Sci. , , 1947–1958.[334] V. E. Kuz’min, P. G. Polishchuk, A. G. Artemenko, S. A. Andronati, Mol. Inf. , , 593–603.61335] G. G. Towell, J. W. Shavlik, Mach. Learn. , , 71–101.[336] N. Barakat, A. P. Bradley, Neurocomputing , Artificial Brains , , 178–190.[337] K. Preuer, G. Klambauer, F. Rippmann, S. Hochreiter, T. Unterthiner, arXiv:1903.02788 [cs q-biostat] .[338] E. Byvatov, G. Schneider, J. Chem. Inf. Comput. Sci. , , 993–999.[339] M. Eklund, U. Norinder, S. Boyer, L. Carlsson, J. Chem. Inf. Model. , , 837–843.[340] C. W. Coley, R. Barzilay, W. H. Green, T. S. Jaakkola, K. F. Jensen, J. Chem. Inf. Model. , , 1757–1772.[341] A. A. Lee, Q. Yang, A. Bassyouni, C. R. Butler, X. Hou, S. Jenkinson, D. A. Price, PNAS , , 3373–3378.[342] K. Hansen, D. Baehrens, T. Schroeter, M. Rupp, K.-R. Müller, Mol. Inf. , , 817–826.[343] P. W. Koh, P. Liang in Proceedings of the 34th International Conference on Machine Learning -Volume 70, JMLR.org, , pp. 1885–1894.[344] S. Riniker, G. A. Landrum, J. Cheminform. , , 43.[345] P. G. Polishchuk, V. E. Kuz’min, A. G. Artemenko, E. N. Muratov, Mol. Inf. , , 843–853.[346] P. Polishchuk, O. Tinkov, T. Khristova, L. Ognichenko, A. Kosinskaya, A. Varnek, V. Kuz’min, J.Chem. Inf. Model. , , 1455–1469.[347] K. T. Schütt, F. Arbabzadah, S. Chmiela, K. R. Müller, A. Tkatchenko, Nat. Commun. , ,13890.[348] B. G. Buchanan, G. L. Sutherland, E. A. Feigenbaum, Heuristic DENDRAL: A Program for Gener-ating Explanatory Hypotheses in Organic Chemistry, English, Monographs, .[349] E. A. Feigenbaum, B. G. Buchanan, J. Lederberg, On Generality and Problem Solving: A Case StudyUsing the DENDRAL Program, English, Monographs, .[350] F. Hufsky, S. Böcker, Mass Spec. Rev. , , 624–633.[351] J. N. Wei, D. Belanger, R. P. Adams, D. Sculley, arXiv:1811.08545 [physics stat] , , 700–708.[352] F. Allen, R. Greiner, D. Wishart, Metabolomics , , 98–110.[353] I. Blaženović, T. Kind, J. Ji, O. Fiehn, Metabolites , , 31.[354] M. A. Samaraweera, L. M. Hall, D. W. Hill, D. F. Grant, Anal. Chem. , , 12752–12760.[355] M. Ludwig, K. Dührkop, S. Böcker, Bioinformatics , , i333–i340.[356] G. Böhm, R. Muhr, R. Jaenicke, Protein Eng. Des. Sel. , , 191–195.[357] J. Aires-de-Sousa, M. C. Hemmer, J. Gasteiger, Anal. Chem. , , 80–90.[358] Y. Binev, M. M. B. Marques, J. Aires-de-Sousa, J. Chem. Inf. Model. , , 2089–2097.[359] M. Gastegger, J. Behler, P. Marquetand, Chem. Sci. , , 6924–6935.[360] C. Nantasenamat, C. Isarankura-Na-Ayudhya, N. Tansila, T. Naenna, V. Prachayasittikul, J. Comput.Chem. , , 1275–1289.[361] H. S. Stein, D. Guevarra, P. F. Newhouse, E. Soedarmadji, J. M. Gregoire, Chem. Sci. , ,47–55.[362] J. Ling, M. Hutchinson, E. Antono, B. DeCost, E. A. Holm, B. Meredig, Materials Discovery , , 19–28.[363] B. G. Sumpter, D. W. Noid, Annu. Rev. Mater. Sci. , , 223–277.[364] J. Aires de Sousa in Applied Chemoinformatics , (Eds.: T. Engel, J. Gasteiger), Wiley-VCH VerlagGmbH & Co. KGaA, Weinheim, Germany, , pp. 133–163.[365] J. Behler, M. Parrinello,
Phys. Rev. Lett. , , 146401.62366] G. Montavon, M. Rupp, V. Gobre, A. Vazquez-Mayagoitia, K. Hansen, A. Tkatchenko, K.-R. Müller,O. Anatole von Lilienfeld, New J. Phys. , , 095003.[367] K. Hansen, G. Montavon, F. Biegler, S. Fazli, M. Rupp, M. Scheffler, O. A. von Lilienfeld, A.Tkatchenko, K.-R. Müller, J. Chem. Theory Comput. , , 3404–3419.[368] G. Schmitz, I. H. Godtliebsen, O. Christiansen, J. Chem. Phys. , , 244113.[369] J. C. Snyder, M. Rupp, K. Hansen, K.-R. Müller, K. Burke, Phys. Rev. Lett. , , DOI .[370] J. Behler, Angew. Chem. Int. Ed. , , 12828–12840.[371] M. Welborn, L. Cheng, T. F. Miller, J. Chem. Theory Comput. , , 4772–4779.[372] L. Cheng, M. Welborn, A. S. Christensen, T. F. Miller, J. Chem. Phys. , , 131103.[373] J. S. Smith, O. Isayev, A. E. Roitberg, Chem. Sci. , , 3192–3203.[374] J. S Smith, B. T. Nebgen, R. Zubatyuk, N. Lubbers, C. Devereux, K. Barros, S. Tretiak, O. Isayev,A. Roitberg, DOI .[375] E. V. Podryabinkin, A. V. Shapeev, Comput. Mater. Sci. , , 171–180.[376] J. S. Smith, B. Nebgen, N. Lubbers, O. Isayev, A. E. Roitberg, J. Chem. Phys. , , 241733.[377] J. Hachmann, R. Olivares-Amaya, S. Atahan-Evrenk, C. Amador-Bedolla, R. S. Sánchez-Carrera, A.Gold-Parker, L. Vogt, A. M. Brockway, A. Aspuru-Guzik, J. Phys. Chem. Lett. , , 2241–2251.[378] S. L. Mayo, B. D. Olafson, W. A. Goddard, J. Phys. Chem. , , 8897–8909.[379] A. K. Rappe, C. J. Casewit, K. S. Colwell, W. A. Goddard, W. M. Skiff, J. Am. Chem. Soc. , , 10024–10035.[380] J. Wang, R. M. Wolf, J. W. Caldwell, P. A. Kollman, D. A. Case, J. Comput. Chem. , ,1157–1174.[381] A. P. Bartók, M. C. Payne, R. Kondor, G. Csányi, Phys. Rev. Lett. , , 136403.[382] L. Zhang, J. Han, H. Wang, R. Car, W. E, Phys. Rev. Lett. , , 143001.[383] V. L. Deringer, N. Bernstein, A. P. Bartók, M. J. Cliffe, R. N. Kerber, L. E. Marbella, C. P. Grey,S. R. Elliott, G. Csányi, J. Phys. Chem. Lett. , , 2879–2885.[384] W. Wang, R. Gómez-Bombarelli, arXiv:1812.02706 [physics stat] .[385] J. Wang, S. Olsson, C. Wehmeyer, A. Pérez, N. E. Charron, G. de Fabritiis, F. Noé, C. Clementi, ACS Cent. Sci. , DOI .[386] W. F. Reinhart, A. W. Long, M. P. Howard, A. L. Ferguson, A. Z. Panagiotopoulos,
Soft Matter , , 4733–4745.[387] A. Mardt, L. Pasquali, H. Wu, F. Noé, Nat. Commun. , , 5.[388] G. L. Warren et al., J. Med. Chem. , , 5912–5931.[389] N. S. Pagadala, K. Syed, J. Tuszynski, Biophys. Rev. , , 91–102.[390] M. Su, Q. Yang, Y. Du, G. Feng, Z. Liu, Y. Li, R. Wang, J. Chem. Inf. Model. , , 895–913.[391] P. J. Ballester, J. B. O. Mitchell, Bioinformatics , , 1169–1175.[392] Q. U. Ain, A. Aleksandrova, F. D. Roessler, P. J. Ballester, Wiley Interdiscip. Rev. Comput. Mol.Sci. , , 405–424.[393] J. C. Pereira, E. R. Caffarena, C. N. dos Santos, J. Chem. Inf. Model. , , 2495–2506.[394] E. J. Bjerrum, Comput. Biol. Chem. , , 133–144.[395] M. Wójcikowski, P. J. Ballester, P. Siedlecki, Sci. Rep. , , 46710.[396] J. Jiménez, M. Škalič, G. Martínez-Rosell, G. De Fabritiis, J. Chem. Inf. Model. , , 287–296.[397] M. M. Stepniewska-Dziubinska, P. Zielenkiewicz, P. Siedlecki, Bioinformatics , , 3666–3674.[398] S. Curtarolo, D. Morgan, K. Persson, J. Rodgers, G. Ceder, Physical review letters , , 135503.63399] C. C. Fischer, K. J. Tibbetts, D. Morgan, G. Ceder, Nature Mater. , , 641–646.[400] Z. W. Ulissi, A. R. Singh, C. Tsai, J. K. Nørskov, J. Phys. Chem. Lett. , , 3931–3935.[401] A. Ziletti, D. Kumar, M. Scheffler, L. M. Ghiringhelli, Nat. Commun. , , DOI .[402] O. Levy, G. L. W. Hart, S. Curtarolo, J. Am. Chem. Soc. , , 4830–4833.[403] G. Hautier, C. C. Fischer, A. Jain, T. Mueller, G. Ceder, Chem. Mater. , , 3762–3767.[404] B. Meredig, A. Agrawal, S. Kirklin, J. E. Saal, J. W. Doak, A. Thompson, K. Zhang, A. Choudhary,C. Wolverton, Phys. Rev. B , , 094104.[405] R. Gautier, X. Zhang, L. Hu, L. Yu, Y. Lin, T. O. L. Sunde, D. Chon, K. R. Poeppelmeier, A. Zunger, Nature Chem. , , 308–316.[406] N. Artrith, A. Urban, G. Ceder, J. Chem. Phys. , , 241711.[407] P. Nguyen, T. Tran, S. Gupta, S. Rana, S. Venkatesh, arXiv:1811.06060 [cond-mat stat] , arXiv:1811.06060.[408] J. G. P. Wicker, R. I. Cooper, CrystEngComm , , 1927–1934.[409] A. M. Hiszpanski, C. J. Dsilva, I. G. Kevrekidis, Y.-L. Loo, Chem. Mater. , , 3330–3337.[410] L. M. Mayr, D. Bojanic, Curr. Opin. Pharmacol. , Anti-infectives/New technologies , , 580–588.[411] W. P. Janzen, Chem. Biol. , , 1162–1170.[412] J. Scheuermann, C. E. Dumelin, S. Melkko, D. Neri, J. Biotechnol. , , 568–581.[413] L. Mannocci, M. Leimbacher, M. Wichert, J. Scheuermann, D. Neri, Chemical Communications , , 12747.[414] L. A. Thompson, J. A. Ellman, Chem. Rev. , , 555–600.[415] M. A. Clark et al., Nat. Chem. Biol. , , 647–654.[416] A. Litovchick et al., Sci. Rep. , , 1–8.[417] Y. Ding et al., ACS Med. Chem. Lett. , , 888–893.[418] P. A. Harris et al., J. Med. Chem. , , 2163–2178.[419] C. S. Kollmann et al., Bioorg. Med. Chem. , , 2353–2365.[420] H. Deng et al., J. Med. Chem. , , 7061–7079.[421] R. A. Goodnow, C. E. Dumelin, A. D. Keefe, Nat. Rev. Drug Discov. , , 131–147.[422] S. L. Belyanskaya, Y. Ding, J. F. Callahan, A. L. Lazaar, D. I. Israel, ChemBioChem , ,837–842.[423] W. R. Galloway, A. Isidro-Llobet, D. R. Spring, Nat. Commun. , , 1–13.[424] J. E. Biggs-Houck, A. Younai, J. T. Shaw, Curr. Opin. Chem. Biol. , , 371–382.[425] R. J. Spandl, M. Díaz-Gavilán, K. M. G. O’Connell, G. L. Thomas, D. R. Spring, The ChemicalRecord , , 129–142.[426] M. Garcia-Castro, S. Zimmermann, M. G. Sankar, K. Kumar, Angew. Chem. Int. Ed. , ,7586–7605.[427] N. J. Gesmundo, B. Sauvagnat, P. J. Curran, M. P. Richards, C. L. Andrews, P. J. Dandliker, T.Cernak, Nature , , 228–232.[428] P. O. Krutzik, G. P. Nolan, Nat. Methods , , 361–368.[429] J. Swann et al., ACS Infect. Dis. , , 281–293.[430] J. Inglese, D. S. Auld, A. Jadhav, R. L. Johnson, A. Simeonov, A. Yasgar, W. Zheng, C. P. Austin, React. Chem. Eng.s , , 11473–11478.[431] M. T. Guo, A. Rotem, J. A. Heyman, D. A. Weitz, Lab Chip , , 2146.64432] W. P. Walters, M. Namchuk, Nat. Rev. Drug Discov. , , 259–266.[433] M. Schenone, V. Dančík, B. K. Wagner, P. A. Clemons, Nat. Chem. Biol. , , 232–240.[434] J. A. Frearson, I. T. Collie, Drug Discov. Today , , 1150–1158.[435] P. E. Brandish, J. Biomol. Screening , , 481–487.[436] W. F. An, N. Tolliday, Mol Biotechnol , , 180–186.[437] J. G. Moffat, J. Rudolph, D. Bailey, Nature Reviews Drug Discovery , , 588–602.[438] A. E. Carpenter et al., Genome Biology , , R100.[439] T. R. Jones et al., PNAS , , 1826–1831.[440] M. L. Green, I. Takeuchi, J. R. Hattrick-Simpers, J. Appl. Phys. , , 231101.[441] X. .-.-D. Xiang, X. Sun, G. Briceno, Y. Lou, K.-A. Wang, H. Chang, W. G. Wallace-Freedman, S.-W.Chen, P. G. Schultz, Science , , 1738–1740.[442] O. Senkov, J. Miller, D. Miracle, C. Woodward, Nat. Commun. , , 1–10.[443] S. M. Senkan, Nature , , 350–353.[444] S. Senkan, K. Krantz, S. Ozturk, V. Zengin, I. Onal, Angew. Chem. Int. Ed. , , 2794–2799.[445] P. Wollmann et al., Chem. Commun. , , 5151.[446] J. I. Goldsmith, W. R. Hudson, M. S. Lowry, T. H. Anderson, S. Bernhard, J. Am. Chem. Soc. , , 7502–7510.[447] S. Bergh, S. Guan, A. Hagemeyer, C. Lugmair, H. Turner, A. F. Volpe, W. Weinberg, G. Mott, Appl.Catal. A , , 67–76.[448] R. A. Potyrailo, B. J. Chisholm, W. G. Morris, J. N. Cawse, W. P. Flanagan, L. Hassib, C. A.Molaison, K. Ezbiansky, G. Medford, H. Reitz, J. Comb. Chem. , , 472–478.[449] A. Akinc, D. M. Lynn, D. G. Anderson, R. Langer, J. Am. Chem. Soc. , , 5316–5323.[450] P. Tsai, K. M. Flores, Acta Mater. , , 426–434.[451] A. Holzwarth, H. W. Schmidt, W. F. Maier, Angew. Chem. Int. Ed. , , 2644–2647.[452] C. M. Snively, G. Oskarsdottir, J. Lauterbach, Angew. Chem. Int. Ed. , , 3028–3030.[453] J. Caruthers, J. Lauterbach, K. Thomson, V. Venkatasubramanian, C. Snively, A. Bhan, S. Katare,G. Oskarsdottir, Journal of Catalysis , , 98–109.[454] R. A. Potyrailo, I. Takeuchi, Meas. Sci. Technol. , .[455] S. Sun et al., Joule , , 1437–1451.[456] M. A. R. Meier, U. S. Schubert, J. Mater. Chem. , , 3289.[457] J. Zhao, Prog. Mater Sci. , , 557–631.[458] K. Rajan, Annu. Rev. Mater. Res. , , 299–322.[459] A. L. Hook, D. G. Anderson, R. Langer, P. Williams, M. C. Davies, M. R. Alexander, Biomaterials , , 187–198.[460] R. Potyrailo, K. Rajan, K. Stoewe, I. Takeuchi, B. Chisholm, H. Lam, ACS Comb. Sci. , ,579–633.[461] J. K. Nørskov, T. Bligaard, J. Rossmeisl, C. H. Christensen, Nature Chem. , , 37–46.[462] W. Setyawan, S. Curtarolo, Comput. Mater. Sci. , , 299–312.[463] E. O. Pyzer-Knapp, C. Suh, R. Gómez-Bombarelli, J. Aguilera-Iparraguirre, A. Aspuru-Guzik, Annu.Rev. Mater. Res. , , 195–216.[464] A. Jain et al., APL Materials , , 011002.[465] S. P. Ong, W. D. Richards, A. Jain, G. Hautier, M. Kocher, S. Cholia, D. Gunter, V. L. Chevrier,K. A. Persson, G. Ceder, Comput. Mater. Sci. , , 314–319.65466] S. Curtarolo, G. L. W. Hart, M. B. Nardelli, N. Mingo, S. Sanvito, O. Levy, Nature Mater. , ,191–201.[467] H. K. D. H. Bhadeshia, R. C. Dimitriu, S. Forsik, J. H. Pak, J. H. Ryu, Mater. Sci. Technol. , , 504–510.[468] C. D. Fjell, H. Jenssen, K. Hilpert, W. A. Cheung, N. Panté, R. E. W. Hancock, A. Cherkasov, J.Med. Chem. , , 2006–2015.[469] T. Le, V. C. Epa, F. R. Burden, D. A. Winkler, Chem. Rev. , , 2889–2919.[470] K. Rajan, Annu. Rev. Mater. Res. , , 153–169.[471] L. Zhang, Z. Chen, J. Su, J. Li, Renewable and Sustainable Energy Reviews , , 554–567.[472] G. H. Gu, J. Noh, I. Kim, Y. Jung, J. Mater. Chem. A , , 17096–17117.[473] J. G. Freeze, H. R. Kelly, V. S. Batista, Chem. Rev. , , 6595–6612.[474] B. Meyer, B. Sawatlon, S. Heinen, O. A. von Lilienfeld, C. Corminboeuf, Chem. Sci. , , 7069–7077.[475] Z. Li, S. Wang, W. S. Chin, L. E. Achenie, H. Xin, J. Mater. Chem. A , , 24131–24138.[476] R. Jinnouchi, R. Asahi, J Phys Chem Lett , , 4279–4283.[477] A. O. Oliynyk, E. Antono, T. D. Sparks, L. Ghadbeigi, M. W. Gaultois, B. Meredig, A. Mar, Chem.Mater. , , 7324–7331.[478] A. S. Rosen, J. M. Notestein, R. Q. Snurr, J. Comput. Chem. , , 1305–1318.[479] S. Lu, Q. Zhou, Y. Ouyang, Y. Guo, Q. Li, J. Wang, Nat. Commun. , , 3405.[480] A. Mansouri Tehrani, A. O. Oliynyk, M. Parry, Z. Rizvi, S. Couper, F. Lin, L. Miyagi, T. D. Sparks,J. Brgoch, J. Am. Chem. Soc. , , 9844–9853.[481] H. Zhang, K. Hippalgaonkar, T. Buonassisi, O. M. Løvvik, E. Sagvolden, D. Ding, ES Energy Environ. , DOI .[482] J. Hachmann et al.,
Energy Environ. Sci. , , 698–704.[483] E. O. Pyzer-Knapp, K. Li, A. Aspuru-Guzik, Adv. Funct. Mater. , , 6495–6502.[484] R. Gómez-Bombarelli et al., Nature Mater. , , 1120–1127.[485] S. Sun et al., arXiv:1812.01025 [cond-mat physics:physics] .[486] S. Lu, Q. Zhou, L. Ma, Y. Guo, J. Wang, Small Methods , , 1900360.[487] M. Zeng, J. N. Kumar, Z. Zeng, R. Savitha, V. R. Chandrasekhar, K. Hippalgaonkar, arXiv:1811.06231[cond-mat] , arXiv: 1811.06231.[488] L. Wilbraham, R. S. Sprick, K. E. Jelfs, M. A. Zwijnenburg, Chem. Sci. , , 4973–4984.[489] Y. J. Colón, D. Fairen-Jimenez, C. E. Wilmer, R. Q. Snurr, J. Phys. Chem. C , , 5383–5389.[490] A. Pulido et al., Nature , , 657–664.[491] C. Duan, J. P. Janet, F. Liu, A. Nandy, H. J. Kulik, J. Chem. Theory Comput. , , 2331–2345.[492] V. Tshitoyan, J. Dagdelen, L. Weston, A. Dunn, Z. Rong, O. Kononova, K. A. Persson, G. Ceder,A. Jain, Nature , , 95–98.[493] S. Nagasawa, E. Al-Naamani, A. Saeki, J. Phys. Chem. Lett. , , 2639–2646.[494] C. E. Wilmer, M. Leaf, C. Y. Lee, O. K. Farha, B. G. Hauser, J. T. Hupp, R. Q. Snurr, Nature Chem. , , 83–89.[495] M. Sumita, X. Yang, S. Ishihara, R. Tamura, K. Tsuda, ACS Cent. Sci. , , 1126–1133.[496] G. Sliwoski, S. Kothiwale, J. Meiler, E. W. Lowe, Pharmacol. Rev. , , 334–395.[497] S. J. Y. Macalino, V. Gosu, S. Hong, S. Choi, Arch. Pharm. Res. , , 1686–1701.[498] A. Lavecchia, Drug Discov. Today , , 318–331.66499] B. M. Wingert, C. J. Camacho, Curr. Opin. Chem. Biol. , , 87–92.[500] J. Panteleev, H. Gao, L. Jia, Bioorg. Med. Chem. Lett. , , 2807–2815.[501] F. Chevillard, P. Kolb, J. Chem. Inf. Model. , , 1824–1835.[502] L. Humbeck, S. Weigang, T. Schäfer, P. Mutzel, O. Koch, ChemMedChem , , 532–539.[503] H. M. Vinkers, M. R. de Jonge, F. F. D. Daeyaert, J. Heeres, L. M. H. Koymans, J. H. van Lenthe,P. J. Lewi, H. Timmerman, K. Van Aken, P. A. J. Janssen, J. Med. Chem. , , 2765–2773.[504] C. A. Nicolaou, I. A. Watson, H. Hu, J. Wang, J. Chem. Inf. Model. , , 1253–1266.[505] N. van Hilten, F. Chevillard, P. Kolb, J. Chem. Inf. Model. , , 644–651.[506] J. Lyu et al., Nature , , 224–229.[507] P. D. Lyne, Drug Discov. Today , , 1047–1055.[508] S. Ghosh, A. Nie, J. An, Z. Huang, Curr. Opin. Chem. Biol. , , 194–202.[509] T. Cheng, Q. Li, Z. Zhou, Y. Wang, S. H. Bryant, AAPS J , , 133–141.[510] P. Willett, Drug Discov. Today , , 1046–1053.[511] P. Ripphausen, B. Nisius, J. Bajorath, Drug Discov. Today , , 372–376.[512] D. S. Goodsell, A. J. Olson, Proteins , , 195–202.[513] M. Rarey, B. Kramer, T. Lengauer, G. Klebe, J. Mol. Biol. , , 470–489.[514] G. Jones, P. Willett, R. C. Glen, A. R. Leach, R. Taylor, J. Mol. Biol. , , 727–748.[515] R. A. Friesner et al., J. Med. Chem. , , 1739–1749.[516] N. S. Pagadala, K. Syed, J. Tuszynski, Biophys. Rev. , , 91–102.[517] P. J. Ballester, I. Westwood, N. Laurieri, E. Sim, W. G. Richards, J. R. Soc. Interface , ,335–342.[518] A. Cereto-Massagué, M. J. Ojeda, C. Valls, M. Mulero, S. Garcia-Vallvé, G. Pujadas, Methods , , 58–63.[519] Y.-C. Lo, S. E. Rensi, W. Torng, R. B. Altman, Drug Discov. Today , , 1538–1546.[520] L. Friedrich, T. Rodrigues, C. S. Neuhaus, P. Schneider, G. Schneider, Angew. Chem. Int. Ed. , , 6789–6792.[521] J. Fang, X. Pang, R. Yan, W. Lian, C. Li, Q. Wang, A.-L. Liu, G.-H. Du, RSC Adv. , , 9857–9871.[522] L. Hoffer et al., J. Med. Chem. , , 5719–5732.[523] L. Hoffer, M. Saez-Ayala, D. Horvath, A. Varnek, X. Morelli, P. Roche, J. Chem. Inf. Model. , , 1472–1485.[524] L. Hoffer, J.-P. Renaud, D. Horvath, J. Chem. Inf. Model. , , 836–851.[525] REAL Compounds - Enamine, https : / / enamine . net / library - synthesis / real - compounds (visited on 07/25/2019).[526] G. Schneider, U. Fechner, Nat. Rev. Drug Discov. , , 649–663.[527] R. Gómez-Bombarelli, J. N. Wei, D. Duvenaud, J. M. Hernández-Lobato, B. Sánchez-Lengeling, D.Sheberla, J. Aguilera-Iparraguirre, T. D. Hirzel, R. P. Adams, A. Aspuru-Guzik, ACS Cent. Sci. , , 268–276.[528] A. Kadurin, A. Aliper, A. Kazennov, P. Mamoshina, Q. Vanhaelen, K. Khrabrov, A. Zhavoronkov, Oncotarget , , 10883–10890.[529] A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, B. Frey, .[530] A. Kadurin, S. Nikolenko, K. Khrabrov, A. Aliper, A. Zhavoronkov, Mol. Pharmaceutics , ,3098–3104. 67531] S. R. Bowman, L. Vilnis, O. Vinyals, A. M. Dai, R. Jozefowicz, S. Bengio, arXiv preprint arXiv:1511.06349 .[532] E. Putin, A. Asadulaev, Q. Vanhaelen, Y. Ivanenkov, A. V. Aladinskaya, A. Aliper, A. Zhavoronkov, Mol. Pharmaceutics , , 4386–4397.[533] M. H. S. Segler, T. Kogej, C. Tyrchan, M. P. Waller, ACS Cent. Sci. , , 120–131.[534] W. Yuan et al., J. Chem. Inf. Model. , , 875–882.[535] Andrej, Multi-layer Recurrent Neural Networks (LSTM, GRU, RNN) for character-level languagemodels in Torch: karpathy/char-rnn , .[536] E. J. Bjerrum, R. Threlfall, arXiv:1705.04612 [cs q-bio] .[537] M. Olivecrona, T. Blaschke, O. Engkvist, H. Chen, J. Cheminform. , , 1–14.[538] M. Popova, O. Isayev, A. Tropsha, Sci. Adv. , , eaap7885.[539] B. Sanchez-Lengeling, C. Outeiral, G. L. Guimaraes, A. Aspuru-Guzik, ChemRxiv , 1–18.[540] A. T. Müller, J. A. Hiss, G. Schneider,
J. Chem. Inf. Model. , , 472–479.[541] F. Grisoni, C. S. Neuhaus, G. Gabernet, A. T. Müller, J. A. Hiss, G. Schneider, ChemMedChem , , 1300–1302.[542] A. Gupta, A. T. Müller, B. J. H. Huisman, J. A. Fuchs, P. Schneider, G. Schneider, Mol. Inf. , , 1700111.[543] D. Merk, L. Friedrich, F. Grisoni, G. Schneider, Mol. Inf. , , 1700153.[544] W. Jin, R. Barzilay, T. Jaakkola, arXiv:1802.04364 .[545] E. Putin, A. Asadulaev, Y. Ivanenkov, V. Aladinskiy, B. Sanchez-Lengeling, A. Aspuru-Guzik, A.Zhavoronkov, J. Chem. Inf. Model. , , 1194–1204.[546] D. Polykovskiy, A. Zhebrak, D. Vetrov, Y. Ivanenkov, V. Aladinskiy, P. Mamoshina, M. Bozdaganyan,A. Aliper, A. Zhavoronkov, A. Kadurin, Mol. Pharmaceutics , , 4398–4405.[547] P. Ertl, R. Lewis, E. Martin, V. Polyakov, , .[548] K. Preuer, P. Renz, T. Unterthiner, S. Hochreiter, G. Klambauer, J. Chem. Inf. Model. , ,1736–1741.[549] R. F. Murphy, Nat. Chem. Biol. , , 327–330.[550] D. Reker, G. Schneider, Drug Discov. Today , , 458–465.[551] M. K. Warmuth, J. Liao, G. Rätsch, M. Mathieson, S. Putta, C. Lemmen, J. Chem. Inf. Comput.Sci. , , 667–673.[552] A. W. Naik, J. D. Kangas, C. J. Langmead, R. F. Murphy, PLoS One , , e83996.[553] Y. Fujiwara, Y. Yamashita, T. Osoda, M. Asogawa, C. Fukushima, M. Asao, H. Shimadzu, K. Nakao,R. Shimizu, J. Chem. Inf. Model. , , 930–940.[554] J. D. Kangas, A. W. Naik, R. F. Murphy, BMC Bioinf. , , 143.[555] D. Reker, P. Schneider, G. Schneider, Chem. Sci. , , 3919–3927.[556] B. Desai et al., J. Med. Chem. , , 3033–3047.[557] B. Li, S. Rangarajan, arXiv:1906.10273 [physics] .[558] E. Bilsland et al., Open Biol. , , 120158.[559] K. Williams et al., J. R. Soc. Interface , , 20141289.[560] W. Czechtizky et al., ACS Med. Chem. Lett. , , 768–772.[561] J. E. Hochlowski, P. A. Searle, N. P. Tu, J. Y. Pan, S. G. Spanton, S. W. Djuric, J. Flow Chem. , , 56–61.[562] A. Baranczak, N. P. Tu, J. Marjanovic, P. A. Searle, A. Vasudevan, S. W. Djuric, ACS Med. Chem.Lett. , , 461–465. 68563] S. Chow, S. Liver, A. Nelson, Nat. Rev. Chem. , , 174–183.[564] J. Besnard et al., Nature , , 215–220.[565] N. C. Firth, B. Atrash, N. Brown, J. Blagg, J. Chem. Inf. Model. , , 1169–1180.[566] V. Venkatasubramanian, K. Chan, J. Caruthers, Comput. Chem. Eng. , An International Journal ofComputer Applications in Chemical Engineering , , 833–844.[567] S. D. Pickett, D. V. S. Green, D. L. Hunt, D. A. Pardoe, I. Hughes, ACS Med. Chem. Lett. , ,28–33.[568] L. Weber, S. Wallbaum, C. Broger, K. Gubernator, Angew. Chem. Int. Ed. in English , ,2280–2282.[569] A. Button, D. Merk, J. A. Hiss, G. Schneider, Nat. Mach. Intell. , , 307–315.[570] F. Dey, A. Caflisch, J. Chem. Inf. Model. , , 679–690.[571] R. Wang, Y. Gao, L. Lai, J. Mol. Model. , , 498–516.[572] S. C.-H. Pegg, J. J. Haresco, I. D. Kuntz, J. Comput.-Aided Mol. Des. , , 911–933.[573] V. J. Gillet, P. Willett, P. J. Fleming, D. V. Green, J. Mol. Graphics Modell. , , 491–498.[574] R. P. Sheridan, S. K. Kearsley, J. Chem. Inf. Model. , , 310–320.[575] G. Schneider, M.-L. Lee, M. Stahl, P. Schneider, J. Comput.-Aided Mol. Des. , , 487–494.[576] D. Douguet, E. Thoreau, G. Grassy, J. Comput.-Aided Mol. Des. , , 449–466.[577] N. Brown, B. McKay, F. Gilardoni, J. Gasteiger, J. Chem. Inf. Comput. Sci. , , 1079–1087.[578] S. Kamphausen, N. Höltge, F. Wirsching, C. Morys-Wortmann, D. Riester, R. Goetz, M. Thürk, A.Schwienhorst, J. Comput.-Aided Mol. Des. , , 551–567.[579] J. H. Jensen, Chem. Sci. , , 3567–3572.[580] C. A. Nicolaou, N. Brown, Drug Discov. Today Technol. , , e427–e435.[581] D. E. Clark, D. R. Westhead, J. Comput-Aided Mol. Des. , , 337–358.[582] L. Terfloth, Drug Discov. Today , , 102–108.[583] C. A. Nicolaou, N. Brown, C. S. Pattichis, Curr. Opin. Drug Discov. Devel. , , 316–324.[584] D. Xue, P. V. Balachandran, J. Hogden, J. Theiler, D. Xue, T. Lookman, Nat. Commun. , ,1–9.[585] D. R. Jones, M. Schonlau, W. J. Welch, J. Global Optim. , , 455–492.[586] A. Solomou, G. Zhao, S. Boluki, J. K. Joy, X. Qian, I. Karaman, R. Arróyave, D. C. Lagoudas, Mater.Des. , , 810–827.[587] R. Yuan, Z. Liu, P. V. Balachandran, D. Xue, Y. Zhou, X. Ding, J. Sun, D. Xue, T. Lookman, Adv.Mater. , , 1702884.[588] A. Seko, T. Maekawa, K. Tsuda, I. Tanaka, Phys. Rev. B , , DOI .[589] A. Seko, A. Togo, H. Hayashi, K. Tsuda, L. Chaput, I. Tanaka, Phys. Rev. Lett. , , DOI .[590] P. V. Balachandran, D. Xue, J. Theiler, J. Hogden, T. Lookman, Sci. Rep. , , DOI .[591] K. Tran, Z. W. Ulissi, Nat. Catal. , , 696–703.[592] K. Gubaev, E. V. Podryabinkin, G. L. Hart, A. V. Shapeev, Comput. Mater. Sci. , , 148–156.[593] A. W. Thornton et al., Chem. Mater. , , 2844–2854.[594] J. P. Janet, L. Chan, H. J. Kulik, J. Phys. Chem. Lett. , , 1064–1071.[595] A. Nandy, C. Duan, J. P. Janet, S. Gugler, H. J. Kulik, Ind. Eng. Chem. Res. , , 13973–13986.69596] G. H. Jóhannesson, T. Bligaard, A. V. Ruban, H. L. Skriver, K. W. Jacobsen, J. K. Nørskov, Phys.Rev. Lett. , , 255506.[597] N. M. O’Boyle, C. M. Campbell, G. R. Hutchison, J. Phys. Chem. C , , 16200–16210.[598] Y. G. Chung et al., Sci. Adv. , , e1600909.[599] A. Mannodi-Kanakkithodi, G. Pilania, T. D. Huan, T. Lookman, R. Ramprasad, Sci. Rep. , ,DOI .[600] T. Lookman, P. V. Balachandran, D. Xue, R. Yuan, Npj Comput. Mater. , , 21.[601] B. P. MacLeod et al., arXiv:1906.05398 [cond-mat physics:physics] .[602] T. Ching et al., J. R. Soc. Interface , , 20170387.[603] D. R. Swanson, N. R. Smalheiser, Artif. Intell. , Scientific Discovery , , 183–203.[604] D. R. Swanson, Perspect. Biol. Med. , , 526–557.[605] H.-M. Müller, E. E. Kenny, P. W. Sternberg, PLOS Biol. , , e309.[606] M. Krallinger, A. Valencia, Genome Biol. , , 224.[607] L. J. Jensen, J. Saric, P. Bork, Nat. Rev. Genet. , , 119–129.[608] M. Krallinger, A. Valencia, L. Hirschman, Genome Biol. , , S8.[609] E. W. Sayers et al., Nucleic Acids Res. , , D23–D28.[610] S. M. Leach, H. Tipney, W. Feng, W. A. Baumgartner, P. Kasliwal, R. P. Schuyler, T. Williams,R. A. Spritz, L. Hunter, PLoS. Comput. Biol. , , e1000215.[611] L. Bell, R. Chowdhary, J. S. Liu, X. Niu, J. Zhang, PLoS One , , e21474.[612] M. W. Libbrecht, W. S. Noble, Nat. Rev. Genet. , , 321–332.[613] G. Eraslan, Ž. Avsec, J. Gagneur, F. J. Theis, Nat. Rev. Genet. , , 389–403.[614] K. K. Yang, Z. Wu, F. H. Arnold, Nature Methods , , 687.[615] R. Verma, U. Schwaneberg, D. Roccatano, Computational and Structural Biotechnology Journal , , e201209008.[616] R. Evans, J. Jumper, J. Kirkpatrick, L. Sifre, T. F. G. Green, C. Qin, A. Zidek, A. Nelson, A.Bridgland, H. Penedones, Annu. Rev. Biochem. , , 363–382.[617] M. AlQuraishi, Cell Systems , , 292–301.e3.[618] R. D. King, K. E. Whelan, F. M. Jones, P. G. K. Reiser, C. H. Bryant, S. H. Muggleton, D. B. Kell,S. G. Oliver, Nature , , 247–252.[619] R. D. King et al., Science , , 85–89.[620] R. D. King, V. Schuler Costa, C. Mellingwood, L. N. Soldatova, IEEE Technol. Soc. Mag. ,37