Autonomous discovery in the chemical sciences part II: Outlook
AAutonomous discovery in the chemical sciences part II: Outlook
Connor W. Coley ∗† , Natalie S. Eyke ∗ , Klavs F. Jensen ∗‡ Keywords: automation, chemoinformatics, machine learning, drug discovery, materials science ∗ Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139 † [email protected] ‡ [email protected] a r X i v : . [ q - b i o . Q M ] M a r ontents C: Establish open access databases with standardized data representations . . . . . . . 11
C: Address the inconsistent quality of existing data . . . . . . . . . . . . . . . . . . . . 123.1.2 Building empirical models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
C: Improve representations of molecules and materials . . . . . . . . . . . . . . . . . . 13
C: Improve empirical modeling performance in low-data environments . . . . . . . . . 14
C: Incorporate physical invariance and equivariance properties . . . . . . . . . . . . . . 14
C: Unify and utilize heterogeneous datasets . . . . . . . . . . . . . . . . . . . . . . . . 15
C: Improve interpretability of machine learning models . . . . . . . . . . . . . . . . . . 163.2 Automated validation and feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.2.1 Experimental validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
C: Expand the scope of automatable experiments . . . . . . . . . . . . . . . . . . . . . 17
C: Facilitate integration through systems engineering . . . . . . . . . . . . . . . . . . . 18
C: Automate the planning of multistep chemical syntheses . . . . . . . . . . . . . . . . 183.2.2 Computational validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
C: Accelerate code/software used for computational validation . . . . . . . . . . . . . . 20
C: Broaden capabilites / applicability of first-principles calculations . . . . . . . . . . . 213.2.3 Shared challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
C: Ensure that validation reflects the real application . . . . . . . . . . . . . . . . . . . 21
C: Lower the cost of automated validation . . . . . . . . . . . . . . . . . . . . . . . . . 21
C: Combine newly acquired data with prior literature data . . . . . . . . . . . . . . . . 223.3 Selection of experiments for validation and feedback . . . . . . . . . . . . . . . . . . . . . . . 23
C: Quantify model uncertainty and domain of applicability . . . . . . . . . . . . . . . . 23
C: Quantify the tradeoff between experimental difficulty and information gain . . . . . 24
C: Define discovery goals at a higher level . . . . . . . . . . . . . . . . . . . . . . . . . 253.3.1 Proposing molecules and materials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
C: Bias generative models towards synthetic accessibility . . . . . . . . . . . . . . . . . 27
C: Benchmark problems for molecular generative models . . . . . . . . . . . . . . . . . 273.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
C: Demonstrate extrapolative power of predictive models . . . . . . . . . . . . . . . . . 28
C: Demonstrate design-make-test beyond proof-of-concept . . . . . . . . . . . . . . . . 28
C: Develop benchmark problems for discovery . . . . . . . . . . . . . . . . . . . . . . . 29 Abstract
This two-part review examines how automation has contributed to different aspects of discovery in thechemical sciences. In this second part, we reflect on a selection of exemplary studies. It is increasinglyimportant to articulate what the role of automation and computation has been in the scientific process andhow that has or has not accelerated discovery. One can argue that even the best automated systems haveyet to “discover” despite being incredibly useful as laboratory assistants. We must carefully consider howthey have been and can be applied to future problems of chemical discovery in order to effectively designand interact with future autonomous platforms.The majority of this article defines a large set of open research directions, including improving our abilityto work with complex data, build empirical models, automate both physical and computational experimentsfor validation, select experiments, and evaluate whether we are making progress toward the ultimate goalof autonomous discovery. Addressing these practical and methodological challenges will greatly advance theextent to which autonomous systems can make meaningful discoveries.
In 2009, King et al. proposed a hypothetical, independent robot scientist that "automatically originateshypotheses to explain observations, devises experiments to test these hypotheses, physically runs the exper-iments by using laboratory robotics, interprets the results, and then repeats the cycle" [1]. To what extenthave we closed the gap toward each of these workflow components, and what challenges remain?The case studies in Part 1 illustrate many examples of the progress that has been made toward achievingmachine autonomy in discovery. Several studies in particular, summarized in Table 1, represent what weconsider to be exemplars of different discovery paradigms [2–15]. These include the successful execution ofexperimental and computational workflows as well as the successful implementation of automated exper-imental selection and belief revision. There are a great number of studies (some of which are describedin part one) that follow the paradigm of (a) train a surrogate QSAR/QSPR model on existing data, (b)computationally design a molecule or material to optimize predicted performance, and (c) manually validatea few compounds. The table intentionally underrepresents such studies, as we believe iterative validation tobe a distinguishing feature of autonomous workflows compared to “merely” automated calculations.We encourage the reader to reflect on these case studies through the lens of the questions we pro-posed for assessing autonomous discovery: (i) How broadly is the goal defined? (ii) How constrained isthe search/design space? (iii) How are experiments for validation/feedback selected? (iv) How superior to3 brute force search is navigation of the design space? (v) How are experiments for validation/feedbackperformed? (vi) How are results organized and interpreted? (vii) Does the discovery outcome contribute tobroader scientific knowledge?The goals of discovery are defined narrowly in most studies. We are not able to request that a platformidentify a good therapeutic, come up with an interesting material, uncover a new reaction, or propose aninteresting model. Instead, in most studies described to date, an expert defines a specific scalar performanceobjective that an algorithm tries to optimize. Out of the examples in Table 1, Kangas et al. has the highest-level goals: in one of their active learning evaluations, the goal could be described as finding strong activityfor any of the 20,000 compounds against any of the 177 assays. While Adam attempts to find relationshipsbetween genes and the enzymes they encode (discover a causal model), it does so from a very small pool ofhypotheses for the sake of compatibility with existing deletion mutants for a single yeast strain.The search spaces used in these studies vary widely in terms of the constraints that are imposed uponthem. Some are restricted out of necessity to ensure validation is automatable (e.g., Eve, Adam, Desaiet al., ARES) or convenient (e.g., Weber et al., Fang et al.). Others constrain the search space to a greaterdegree than automated validation requires. This includes reductions of 1) dimensionality, by holding processparameters constant (e.g., Reizman et al., Ada), or 2) the size of discrete candidate spaces (e.g., Gómez-Bombarelli et al., Thornton et al., Janet et al.). Computational studies that minimize constraints on theirsearch (e.g., RMG, Segler et al.) do so under the assumption that the results of validation (e.g., simulationresults, predictions from surrogate models) will be accurate across the full design space.In all cases, human operators have implicitly or explicitly assumed that a good solution can be found inthese restricted spaces. The extent to which domain expertise or prior knowledge is needed to establish thedesign space also varies. Molecular or materials design in relatively unbounded search spaces (e.g., Segler etal.) requires the least human input. Fixed candidate libraries that are small (e.g., Weber et al., Desai et al.)or derived from an expert-defined focused enumeration (e.g., RMG, Gómez-Bombarelli et al., Janet et al.)require significant application-specific domain knowledge; larger fixed candidate libraries may be application-agnostic (e.g., diverse screening libraries in Eve, Fang et al., Thornton et al.). Limiting process parameters(e.g., Reizman et al., ARES, Ada) require more general knowledge about typical parameter ranges whereoptima may lie.The third question regarding experiment selection is one where the field has excelled. There are manyframeworks for quantifying the value of an experiment in model-guided experimental design, both whenoptimizing for performance and when optimizing for information [17]. However, active learning with formalconsideration of uncertainty from either a frequentist perspective [18] (e.g., Eve, Reizman et al.) or a Bayesianperspective [19, 20] (e.g., Ada) is less common than with ad hoc definitions of experimental diversity meant4o encourage exploration (e.g., Desai et al., Kangas et al.). Both are less common than greedy selectioncriteria (e.g., ARES, RMG, Gómez-Bombarelli et al., Thornton et al.). Model-free experiment selection,including the use of genetic algorithms (e.g., Weber et al., Janet et al.), is also quite prevalent but requiressome additional overhead from domain experts to determine allowable mutations within the design space.When validation is not automated (e.g., Fang et al. or Gómez-Bombarelli et al.’s experimental follow-up), theselection of experiments is best described as pseudo-greedy, where the top predicted candidates are manuallyevaluated for practical factors like synthesizability.The magnitude of the benefit of computer-assisted experiment selection is a function of the size of thedesign space and, when applicable, the initialization required prior to the start of the iterative phase. In manycases, a brute force exploration of the design space is not prohibitively expensive (e.g., Eve, Adam, Desaiet al., Janet et al.), although this is harder to quantify when some of the design variables are continuous (e.g.,Reizman et al., ARES, Ada). Other design spaces are infinite or virtually infinite (e.g., RMG, Segler et al.),which makes the notion of a brute force search ill-defined. Regardless of whether the full design space canbe screened, we can still achieve a reduction in the number of experiments required to find high-performingcandidates, perhaps by a modest factor of 2-10 (e.g., Eve, Desai et al., Gómez-Bombarelli et al., Janet et al.)or even by 2-3 orders of magnitude (e.g., Weber et al., Kangas et al., Thornton et al.). It’s possible that theexperimental efficiency of some of these studies could be improved by reducing the number of experimentsneeded to initialize the workflow (e.g., Eve, Gómez-Bombarelli et al.).The manner in which experiments for validation are performed depends on the nature of the design spaceand, of course, whether experiments are physical or computational. Examples where validation is automatedare intentionally overrepresented in this review; there are many more examples of partially-autonomousdiscovery in which models prioritize experiments that are manually performed (e.g., similar to Weber etal., Fang et al.). There are also cases where almost all aspects of experimentation are automated but afew manual operations remain (e.g., transferring well plates for Adam). In computational workflows, onecan often construct pipelines to automate calculations (e.g., RMG, Gómez-Bombarelli et al., Segler et al.,Thornton et al., Janet et al.). In experimental workflows, one can programmatically set process conditionswith tailor-made platforms (e.g., Desai et al., ARES, Ada) or robotically perform assays using in-stockcompounds (e.g., Eve, Adam) or ones synthesized on-demand (e.g., Desai et al.). Pool-based active learningstrategies lend themselves to retrospective validation, where an “experiment” simply reveals a previous resultnot known to the algorithm (e.g., Kangas et al.); this is trivially automated and thus attractive for methoddevelopment. Note that in the workflow schematic for Kangas et al. in Table 1, we illustrate the revelationof PubChem measurements as experiments.In iterative workflows, automating the organization and interpretation of results is a practical step toward5utomating the subsequent selection of experiments. When workflows only proceed through a few iterationsof batched experiments, humans may remain in the loop to simplify the logistics of organizing results andinitializing each round (e.g., Gómez-Bombarelli et al., Thornton et al.), but nothing fundamentally preventsthis step from being automated. When many iterations are required or expected, it behooves us to ensurethat the results of experiments can be directly interpreted; otherwise, tens (e.g., Desai et al., Reizmanet al., Segler et al., Janet et al., Ada) or hundreds (e.g., Eve, ARES, Kangas et al.) of interventions byhuman operators would be required. This can be the case when iterative experimental design is used withmanual experimentation and analysis (e.g., the 20 iterations of a genetic algorithm conducted manually byWeber et al.). In non-iterative workflows with manual validation (e.g., Fang et al.), there is little benefitto automating the interpretation of new data. Relative to automating experiments and data acquisition,automating the interpretation thereof is rarely an obstacle to autonomy. Exceptions to this include caseswhere novel physical matter (a molecule or material) is synthesized and characterized (e.g., case studiesin Part 1 related to experimental reaction discovery), where further advances in computer-aided structuralelucidation (CASE) [21] are needed.Our final question when assessing autonomy is whether the discovery outcome contributes to broaderscientific knowledge. In Table 1, with the exception of Adam and RMG, we have focused on the discovery ofphysical matter or processes rather than models. The primary outcome of these discovery campaigns is theidentification of a molecule, material, or set of process conditions that achieves or optimizes a human-definedperformance objective. Workflows with model-based experimental designs (e.g., all but Weber et al. andJanet et al., who use genetic algorithms) have the secondary outcome of a surrogate model, which mayor may not lend itself to interpretation. However, the point of this question is whether the contributionto broader scientific knowledge came directly from the autonomous platform, not through an ex post facto analysis by domain experts. These discoveries generally require manual interpretation, again exceptingAdam, RMG, and similar platforms where what is discovered is part of a causal model.Our first and last questions represent lofty goals in autonomous discovery: we specify high-level goalsand receive human-interpretable, generalized conclusions beyond the identification of a molecule, material,device, process, or black box model. However, we have made tremendous progress in offloading both themanual and mental burden of navigating design spaces through computational experimental design andautomated validation. We often impose constraints on design spaces to avoid unproductive exploration andfocus the search on what we believe to be plausible candidates. To widen a design space requires thatexperiments remain automatable–less of a challenge for computational experiments than for physical ones–but may decrease the need for subject matter expertise and may increase the odds that the platform identifiesan unexpected or superior result. Well-established frameworks of active learning and Bayesian optimization6ave served us well for experimental selection, while new techniques in deep generative modeling have openedup opportunities for exploring virtually-infinite design spaces.7 nitial design selected experiment data belief/model existing data algorithmicexpert-defined physicalcomputationalmodel-free designmodel-based designupdate initializeReference Discovery Initialization Design space Data generation Notes WorkflowEve [2] bioactive,selectivemolecules 4,800 randomcompounds fromdesign space fixed library of14,400 compounds automatedmeasurement ofyeast growthcurves compound screening from afixed library with an activesearch · · · · ·
Adam [3] gene-enzymerelationships randomexperiment 15 open readingframe deletions automatedauxotrophyexperiments narrow hypothesis space,but nearly-closed-loopexperimentation · · · · ·
Weberet al. [4] thrombininhibitors 20 randomcompounds virtual × × × compound library manual synthesisand inhibitionassay iterative optimization usinga genetic algorithm; designspace defined by4-component Ugi reaction toensure synthesizability · · · · · Desai et al.[5] kinase inhibitors random compound × candidatesinmake-on-demandlibrary automatedmicrofluidicsynthesis andbiological testing closed-loop synthesis andbiological testing; narrowchemical design space · · · · · Reizmanet al. [6] reactionconditions algorithmicD-optimal design concentration,temperature, time,8 catalysts automatedsynthesis and yieldquantitation closed-loop reactionoptimization through ascreening phase and aniterative phase · · · · ·
ARES [7,16] carbonnanotubegrowthconditions 84 expert-definedexperimentalconditions process conditions(temperature,pressures, and gascompositions) automatednanotube growthandcharacterization complex experimentation;uses RF model forregression · · · · ·
Kangaset al. [8] bioactivecompoundsagainst manyassays 384 randommeasurementsfrom design space 177 assay × simulated experiments byrevealingPubChemmeasurements validation of pool-basedactive learning frameworkthrough retrospectiveanalysis; iterative batches of384 experiments · · · · · RMG [9] detailedgas-phasekineticmechanisms reactionconditions,optionally seededby knownmechanism elementaryreactions followingexpert-definedreaction templates estimation ofthermodynamicand kineticparameters iterative addition ofhypothesized elementaryreactions to a kinetic modelbased on simulations usingintermediate models · · · · ·
Fang et al.[10] neuroprotectivecompounds activity data fromChEMBL in-house library of28k candidates none literature-trained QSARmodel applied tononiterative virtualscreening with manual invitro validation · · · · ·
Gómez-Bombarelliet al. [11] organiclight-emittingdiode molecules 40k randomcompounds fromdesign space virtual library of1.6 M enumeratedcompounds DFT calculations iteratively selected batchesof 40k calculations;manually validated a smallnumber of compoundsexperimentally · · · · ·
Segleret al. [12] bioactivecompounds 1.4M moleculesfrom ChEMBL all of chemicalspace surrogateQSAR/QSPRmodels of activity iteratively refinedgenerative LSTM model(pretrained on ChEMBL) onactive molecules identifiedvia sampling an initial 100k+ 8 rounds ×
10k molecules · · · · ·
Thorntonet al. [13] hydrogenstoragematerials 200 human-chosensubset and 200random subset ofsearch space 850k structures(MaterialsGenome) grand canonicalMonte Carlosimulations few rounds of greedyoptimization with batchesof 1000 using surrogateQSPR model · · · · ·
Janet et al.[14] spin-statesplittinginorganiccomplexes random complexesfrom design space 708 ligandcombinations × · · · · · Ada [15] organic holetransportmaterials algorithmic dopant ratio andannealing time automatedsynthesis andanalysis of thinfilm complex experimentationsuccessfully automated;simple design space · · · · ·
Table 1: Selected examples of discovery accelerated by automation or computer assistance. The stages ofthe discovery workflow employed by each are shown as red arrows corresponding to the schematic above.Workflows may begin either with an initial set of experiments to run or by initializing a model with existing(external) data. Algorithmic initial designs include the selection of random experiments from the designspace. 8
Challenges and trends
The capabilities required for autonomous discovery are coming together rapidly. This section emphasizeswhat we see as key remaining challenges associated with working with complex data, automating validationand feedback, selecting experiments, and evaluation.
The discovery of complex phenomena requires a tight connection between knowledge and data [22]. A 1991article laments the “growing gap between data generation and data understanding” and the great potentialfor knowledge discovery from databases [23]. While we continue to generate new data at an increasing rate,we have also dramatically improved our ability to make sense of complex datasets through new algorithmsand advances in computing power.We intentionally use “complex data” rather than “big data”–the latter generally refers only to the size orvolume of data, and not its content. Here, we mean “complex data” when it would be difficult or impossiblefor a human to identify the same relationships or conclusions as an algorithm. This may be due to the sizeof the dataset (e.g., millions of bioactivity measurements), the lack of structure (e.g., journal articles), orthe dimensionality (e.g., a regression of multidimensional process parameters).Complex datasets come in many forms and have inspired an array of different algorithms for makingsense of them (and leveraging them for discovery). Unstructured data can be mined and converted intostructured data [24, 25] or directly analyzed as text, e.g. to develop hypotheses about new functionalmaterials [26]. Empirical models can be generated and used to draw inferences about factors that influencecomplex chemical reactivity [27–30]. Virtually any dataset of (input, output) pairs describing a performancemetric of a molecule, material, or process can serve as the basis for supervised learning of a surrogate model.Likewise, unsupervised techniques can be used to infer the structure of complex datasets [31–33] and formthe basis of deep generative models that propose new physical matter [34, 35].
Many studies don’t develop substantially novel methods, but instead take advantage of new data resources.This is facilitated by the increasing availability of public databases. The PubChem database, maintained bythe NIH and currently the largest repository of open-access chemical information [36], has been leveragedby many studies for ligand-based drug discovery proofs and thus is a particularly noteworthy example of thevalue inherent in these curation efforts. Curation efforts spearheaded by government organizations as well asthose led by individual research groups can both be enormously impactful, whether through amalgamation9f large existing datasets (a greater strength of the broad collaborations) or the accumulation of high-quality,well-curated data (which efforts by individual research groups tend may be better suited for).Table 2 provides a list of some popular databases used for tasks related to chemical discovery. Additionaldatabases related to materials science can be found in refs. 37 and 38. Some related to drug discovery arecontained in refs. 39 and 40. Additional publications compare commercial screening libraries that can beuseful in experimental or computational workflows [41, 42].Table 2: Overview of some databases used to facilitate discovery in the chemical sciences. API: applicationprogramming interface
Chemical structuresName Description Size (approx.) AvailabilityZINC [43, 44] commercially-available compounds 35 M OpenChemSpider [45] structures and misc. data 67 M APISureChEMBL [46] structures from ChEMBL 1.9 M OpenSuper Natural II [47] natural product structures 325 k OpenSAVI [48] enumerated synthetically-accessible structures and their buildingblocks 283 M OpeneMolecules [49] commercially-available chemicals and prices 5.9 M CommercialMolPort [50] in-stock chemicals and prices 7 M On RequestREAL (Enamine) [51] enumerated synthetically-accessible structures 11 B OpenChemspace [52] in-stock chemicals and prices 1 M On RequestGDB-11, GDB-13,GDB-17 [53–55] exhaustively enumerated chemical structures 26.4 M; 970 M;166 B OpenSCUBIDOO [56] enumerated synthetically-accessible structures > M OpenCHIPMUNK [57] enumerated synthetically-accessible structures 95 M OpenBiological dataName Description Size AvailabilityPubChem [36] compounds and properties, emphasis on bioassay results 96 M OpenChEMBL [58, 59] compounds and bioactivity measurements 1.9 M OpenChEBI [60] compounds and biological relevance 56 k OpenPDB [61] biological macromolecular structures 150 k OpenPDBBind [62] protein binding affinity 20 k OpenProTherm [63] thermodynamic data for proteins 10 k OpenLINCS [64] cellular interactions and perturbations varies OpenSKEMPI [65] energetics of mutant protein interactions 7 k OpenxMoDEL [66] MD trajectories of proteins 1700 OpenGenBank [67] species’ nucleotide sequences 400 k OpenDrugBank [68] drug compounds, associated chemical properties, and pharmacologicalinformation 13 k OpenBindingDB [69] compounds and binding measurements 750 k; 1.7 M OpenCDD [70] collaborative drug discovery database for neglected tropical diseases > datasets RegistrationToxCast [71, 72] compounds and cellular responses > OpenTox21 [73, 74] compounds and multiple bioassays 14k OpenChemical reactionsName Description Size AvailabilityUSPTO [75] chemical reactions (patent literature) 3.3 M OpenPistachio [76] chemical reactions (patent literature) 8.4 M CommercialReaxys [77] chemical reactions >
10 M CommercialCASREACT [78] chemical reactions >
10 M CommercialSPRESI [79] chemical reactions 4.3 M CommercialOrganic Reactions [80] chemical reactions 250 k CommercialEAWAG-BBD [81] biocatalysis and biodegradation pathways 219 OpenNIST Chemical Kinet-ics [82] gas-phase chemical reactions 38 k OpenNMRShiftDB [83] measured NMR spectra 52 k Open olecular propertiesName Description Size AvailabilityQM7/QM7b [84, 85] electronic properties (DFT) 7200 OpenQM9 [86] electronic properties (DFT) 134 k OpenQM8 [87] spectra and excited state properties 22 k OpenFreeSolv [88] aqueous solvatio energies 642 OpenNIST Chemistry Web-Book [89] miscellaneous molecular properties varies OpenMaterialsName Description Size AvailabilityPoLyInfo [90] polymer properties 400k OpenCOD [91–93] crystal structures of organic, inorganic, metal-organics compoundsand minerals 410k OpenCoRE MOF [94] properties of metal-organic frameworks 5k OpenhMOF [95] hypothetical metal-organic frameworks 140k OpenCSD [96] crystal structures 1 M APIICSD [97] crystal structure data for inorganic compounds 180k CommercialNOMAD [98] total energy calculations 50 M OpenAFLOW [99, 100] material compounds; calculated properties 2.1M; 282M OpenOQMD [101] total energy calculations 560 k OpenMaterials Project [102] inorganic compounds and computed properties 87 k APIComputational Mate-rials Repository [103,104] inorganic compounds and computed properties varies OpenPearson’s [105] crystal structures 319 k CommercialHOPV [106] experimental photovoltaic data from literature, QM calculations 350 k OpenJournal articlesName Description Size AvailabilityCrossref [107] journal article metadata 107 M OpenPubMed [108] biomedical citations 29 M OpenarXiv [109] arXiv articles (from many domains) 1.6 M OpenWiley [110] full articles millions APIElsevier [111] full articles millions API Several factors have contributed to the greater wealth and accessibility of chemical databases that canbe used to facilitate discovery. The first of these has to do with hardware: automated experimentation,especially that which is high-throughput in nature, has allowed us to generate data at a faster pace. Second,the successes of computational tools at leveraging these large quantities of data has created a self-catalyzingphenomenon: as the capabilities of tools are more frequently and widely demonstrated, the incentive tocollect and curate large datasets that can be used by these tools has grown.
Challenge:
Establish open access databases with standardized data representations
Time invested in the creation of open databases of molecules, materials, and processes can have an outsizedimpact on discovery efforts that are able to make use of that data for supervised learning or screening.Creating and maintaining these databases is not without its challenges. There’s much to be done tocapture and open-source the data generated by the scientific community. For cases where data must beprotected by intellectual property agreements, we need software that can facilitate sharing between collab-11rators and guarantee privacy as needed [112, 113]. Even standardizing representations can be challenging,particularly for polymeric materials with stochastic structures and process-dependent attributes [114].Government funding agencies in the EU and US are starting to prioritize accessibility of research resultsto the broader community [115]. Further evolution of the open data policies will accelerate discovery throughbroader analysis of data (crowdsourcing discovery [116–118]) and amalgamation of data for the purposes ofmachine learning. Best practices among experimentalists must begin to include the storage of experimentaldetails and results in searchable, electronic repositories.The data that exists in the literature that remains to be tapped by the curation efforts described aboveis vast. To access it, scientific researchers are gaining increasing interest in adapting information extractiontechniques for use in chemistry [119–124]. Information extraction and natural language processing bringstructure to unstructured data, e.g., published literature that presents information in text form. Methodshave evolved from identifying co-occurrences of specifics words [125] to the formalization of domain-specificontologies [126], learned word embeddings [26], knowledge graphs and databases [122], and causal models [25].Learning from unstructured data presents additional challenges in terms of data set preparation and problemformulation, and is significantly less popular than working with pre-tabulated databases. Nevertheless,building knowledge graphs of chemical topics may eventually let us perform higher level reasoning [127] toidentify and generalize from novel trends.
Challenge:
Address the inconsistent quality of existing data
Existing datasets may not contain all of the information needed for a given prediction task (i.e., the inputis underspecified or the schema is missing a metric of interest). Even when the right fields are present,there may be missing or misentered data from automated information extraction pipelines or manualentry.As Williams et al. point out, data curation (which involves evaluating the accuracy of data stored inrepositories) before the data is used to create a model or otherwise draw conclusions is very important:data submitted to the PDB is independently validated before it is added to the database, whereas dataadded to PubChem undergoes no prerequisite curation or validation [128]. Lack of curation results in manyproblems associated with missing and/or misentered data. These issues plague databases including the PDB(misassigned electron density) and Reaxys (missing data). As analytical technology continues to improve,one can further ask how much we should bother relying on old data in lieu of generating new data that wetrust more.Existing database curation policies must account for the potential for error propagation and incorporatestandardization procedures that can correct for errors when they arise [129, 130], for example by using12rosaII [131] to evaluate sequence-structure compatibility of PDB entries and identify errors [132]. Whilethe type of crowdsourcing error correction exemplified by Venclovas et al. can be helpful, we argue that itshouldn’t be relied upon [132]; curators should preemptively establish policies to help identify, control, andprevent errors.
Various statistical methods have been used for model-building for many years. Some of the most dramaticimprovements to statistical learning have been in the area of machine learning. Machine learning is nowthe go-to for developing empirical models that describe nonlinear structure-function relationships that es-timate the properties of new physical matter and serve as surrogate models for expensive calculations orexperiments. These types of empirical models act as a roadmap for many discovery efforts, so improvementshere significantly impact computer-aided discovery, even when the full workflow is not automated. Packageslike scikit-learn, Tensorflow, and Pytorch have lowered the barrier for implementing empirical models, andchemistry-specific packages like DeepChem [133] and ChemML [134] represent further attempts to streamlinemodel training and deployment (with mixed success in adoption).
Challenge:
Improve representations of molecules and materials
A wide array of strategies for representing molecules and materials for the sake of empirical modelinghave been developed, but certain aspects have yet to be adequately addressed by existing representationmethods. Further, it is difficult to know which representation method will perform best for a givenmodeling objective.In the wake of the 2012 ImageNet competition, in which a convolutional neural network dominated rule-based systems for image classification [135], there has been a shift in modeling philosophy to avoid humanfeature engineering, e.g., describing molecules by small numbers of expert descriptors, and instead learn suitable representations [136]. This is in part enabled by new network architectures, such as message passingnetworks particularly suited to embedding molecular structures [137–139]. There is no consensus as to whenthe aforementioned deep learning techniques should be applied over “shallower” learning techniques like RFsor SVMs with fixed representations [140]; which method performs best is task-dependent and determinedempirically [133, 141], although some heuristics, e.g. regarding the fingerprint granularity needed for aparticular materials modeling task, do exist [142]. Use of molecular descriptors may make generalization tonew inputs more predictable [28] but limits the space of relationships able to be described by presupposingthat the descriptors contain all information needed for the prediction task. Further, selecting features forlow-dimensional descriptor-based representations requires expert-level domain knowledge.13n addition to a better understanding of why different techniques perform well in different settings,there is a need for the techniques themselves to better capture relevant information about input structures.Some common representations in empirical QSAR/QSPR modeling are listed in Table 3. However, thereare several types of inputs that current representations are unable to describe adequately. These include(a) polymers that are stochastically-generated ensembles of specific macromolecular structures, (b) hetero-geneous materials with periodicity or order at multiple length scales, and (c) “2.5D” small molecules withdefined stereochemistry but flexible 3D conformations. Descriptor-based representations serve as a catch-all,as they rely on experts to encode input molecules, materials, or structures as numerical objects.
Representation Description
Descriptors Vector of calculated propertiesFingerprints Vector of presence/absence or count of structural features (many types)Coulomb matrices Matrix of electrostatic interactions between nucleiImages 2D line drawings of chemical structuresSMILES String defining small molecule connectivity (can be tokenized oradapted in various ways, e.g., SELFIES, DeepSMILES)FASTA String for nucleotide or peptide sequencesGraphs 2D representation with connectivity informationVoxels Discretized 3D or 4D representation of moleculesSpatial coordinates 3D representation with explicit coordinates for every atom
Table 3: Representations used in empirical QSAR/QSPR modeling.
Challenge:
Improve empirical modeling performance in low-data environments
Empirical modeling approaches must be validated on or extended to situations for which only tens ofexamples are available (e.g., a small number of hits from an experimental binding affinity assay).Defining a meaningful feature representation is especially important when data is limited [143, 144].Challenging discovery problems may be those for which little performance data is available and validationis expensive. For empirical models to be useful in these settings, they must be able to make reasonablyaccurate predictions with only tens of data points. QSAR/QSPR performance in low-data environmentsis understudied, with few papers explicitly examining low-data problems (e.g., fewer than 100 examples)[145–147]. The amount of data “required” to train a model is dependent on the complexity of the task,the true (unknown) mathematical relationship between the input representation and output, the size of thedomain over which predictions are made, and the coverage of the training set within that space.
Challenge:
Incorporate physical invariance and equivariance properties
By ensuring that models are only sensitive to meaningful differences in input representations, one canmore effectively learn an input-output relationship without requiring data augmentation to also learninput-input invariance or equivariance. 14ne potential way to improve low-data performance and generalization ability is to embed physicalinvariance or equivariance properties into models. Consider a model built to predict a physical propertyfrom a molecular structure: molecular embeddings from message passing neural networks are inherentlyinvariant to atom ordering. In contrast, embeddings calculated from tensor operations on Coulomb matricesare not invariant. Sequence encoders using a SMILES string representation of a molecule have been shownto benefit from data augmentation strategies so the model can learn the chemical equivalence of multipleSMILES strings [148, 149]. There are strong parallels to image recognition tasks, where one may want anobject recognition model to be invariant to translation, rotation, and scale. When using 3D representations ofmolecules with explicit atomic coordinates, it is preferable to use embedding architectures that are inherentlyrotationally-invariant [150, 151] instead of relying on inefficient preprocessing steps of structure alignment[29] and/or rotational enumeration [152] for voxel representations, which still may lead to models that donot obey natural invariance or equivariance laws.
Challenge:
Unify and utilize heterogeneous datasets
Vast quantities of unlabeled or labeled data can be used as baseline knowledge for pretraining empiricalmodels or in a multitask setting when tasks are sufficiently related.When human researchers approach a discovery task, they do so equipped with an intuition and knowledgebase built by taking courses, reading papers, running experiments, and so on. In computational discoveryworkflows with machine learning-based QSAR modeling, algorithms tend to focus only on the exact propertyor task and make little use of prior knowledge; only via the input representation, model architecture, andconstraints on the search space is domain-specific knowledge embedded. Models are often trained fromscratch on datasets that contain labeled (molecule, value) pairs.Such isolated applications of machine learning to QSAR/QSPR modeling can be effective, but there isa potential benefit to multitask learning or transfer learning when predictions are sufficiently related [153–157]. Searls argues that drug discovery stands to benefit from integrating different datasets relating to variousaspects of gene and protein functions [158]. As a simple example, one can consider that the prediction ofphenotypes from suppressing specific protein sequences might benefit from knowledge of protein structure,given the connection between protein sequence → structure → function. For some therapeutic targets, thereare dozens of databases known to be relevant that have not been meaningfully integrated [159]. Large-scalepretraining is a more general technique that can be used to learn an application-agnostic atom- or molecule-level representation prior to refinement on the actual QSAR task [160–163]. Performance on phenotypicassays has even been used directly as descriptors for molecules in other property prediction tasks [164], ashas heterogeneous data on drug, protein, and drug-protein interactions [165].15 hallenge: Improve interpretability of machine learning models
Machine learning models are typically applied as black box predictors with some minimal degree of expost facto interpretation: analysis of descriptor importance, training example relevance, simplified decisiontrees, etc. Extracting explanations consistent with those used by human experts in the scientific literaturerequires the structure of the desired explanations to be considered and built into the modeling pipeline.To the extent that existing autonomous discovery frameworks generate hypotheses that explain observa-tions and interpret the results of experiments, they rarely do so in a way that is directly intelligible to humans ,limiting the expansion of scientific knowledge that is derived from a campaign. In connection with this, manyof the case studies from Part 1 focus on discoveries that are readily physically observable–identifying a newmolecule that is active against a protein target, or a new material that can be used to improve energycapture–rather than something more abstract, such as answering a particular scientific question. We canprobe model understanding by enumerating predictions for different inputs, but these are human-definedexperiments to answer human-defined hypotheses (e.g., querying a reaction prediction model with substratesacross a homologous series). Standard approaches to evaluating descriptor importance still require carefulcontrol experiments to ensure that the explanations we extract are not spurious, even if they align withhuman intuition [166]. We again refer readers to ref. 167 for a review of QSAR interpretability. Ref. 168reviews additional aspects of explainable machine learning for scientific discovery.Many challenges above can be approached by what Rueden et al. call informed machine learning : “theexplicit incorporation of additional knowledge into machine learning models”. The taxonomy they proposeis reproduced in Figure 1. In particular, several points relate to (a) the integration of natural sciences(laws) and intuition into representations and model architectures and (b) the integration of world knowledgethrough pretraining or multitask/transfer learning.
Iterating between hypothesis generation and validation can be fundamental to the discovery process. Oneoften needs to collect new data to refine or prove/disprove hypotheses. Sufficiently advanced automationcan compensate for bad predictions by quickly falsifying hypotheses and identifying false positives [170](i.e., being “fast to fail” [171]). The last several decades have brought significant advances in automation ofsmall-scale screening, synthesis, and characterization, which facilitates validation via physical experiments,as well as advances in software for faster and more robust computational validation.16igure 1: Taxonomy of informed machine learning proposed by Rueden et al. The incorporation of priorknowledge into machine learning modeling can take a number of forms. Figure reproduced from ref. 169.
Many of the case studies we present portray great strides in terms of the speed and scale of experimentalvalidation. High-throughput and parallelized experimentation capabilities have been transformational inthe biological space and increasingly are being imported into the chemistry space [172]. The adoptionof HTE has simplified screening broad design spaces for new information [173–175]. Beyond brute-forceexperimentation, there are new types of experiments to accelerate the rate of data generation and hypothesisvalidation. These include split-and-pool techniques and other combinatorial methods to study multiplecandidates simultaneously [176–178].
Challenge:
Expand the scope of automatable experiments
Whether an iterative discovery problem’s hypotheses can be autonomously validated depends on whetherthe requisite experiments are amenable to automation.If we are optimizing a complex design objective, such as in small molecule drug discovery, we benefit fromhaving access to a large search space. Many syntheses and assays are compatible with a well-plate formatand are routinely automated (e.g., Adam [3] and Eve [2]). Moving plates, aspirating/dispensing liquid,17nd heating/stirring are all routine tasks for automated platforms. Experiments requiring more complexoperations may still be automatable, but require custom platforms, e.g., for the growth and characterizationof nanotubes by ARES [7] or deposition and characterization of thin films by Ada [15]. Dispensing andmetering of solids is important for many applications but is challenging at milligram scales, though newstrategies are emerging that may decrease the precision required for dosing solid reagents [179]. Indeed,the set of automatable experiments is ever-increasing, but a universal chemical synthesizer [180] remainselusive. The result of this gap is that design spaces may be constrained not only through prior knowledge(an intentional and useful narrowing of the space), but also limited by the capabilities of the automatedhardware available. Characterizing the structure of physical matter is increasingly routine, but our ability tomeasure complex functions and connecting those back to structure remains limited. Oliver et al. list severaluseful polymer characterization methods that have eluded full automation, such as differential scanningcalorimetry and thermogravimetric analysis [175].
Challenge:
Facilitate integration through systems engineering
Scheduling, performing, and analyzing experiments can involve coordinating tasks between several inde-pendent pieces of hardware/software that must be physically and programmatically linked.Expanding the scope of experimental platforms may require the integration of independent pieces ofequipment at both the hardware and software level. The wide variety of necessary tasks (scheduling, error-handling, etc.) means that designing control systems for such highly-integrated platforms is an enormouslycomplex task [181]. As a result, developing software for integration of an experimental platform [182] (Fig-ure 2) can be a large contributor to the cost. The lack of a standard API and command set between differenthardware providers means that each requires its own driver and software wrapper; this is particularly truefor analytical equipment, which even lacks standardization in file formats for measured data. Programs likeOVERLORD and Roch et al.’s ChemOS [183] are attempts to create higher-level controllers. Throughput-matching in sequential workflows is a challenge in general, requiring a plan for “parking” (and perhapsstabilizing) samples in the event of a bottleneck downstream. These practical issues must be resolved tobenefit from increased integration and the ability to generate data.
Challenge:
Automate the planning of multistep chemical syntheses
Many discovery tasks involve proposing new chemical matter; approaching these tasks with autonomoussystems requires the ability to synthesize novel compounds on-demand.A particularly challenging class of experiments is on-demand synthesis. The primary methodologi-cal/intellectual challenge for general purpose automated synthesis is the planning of processes–designing18igure 2: The workflow of automated synthesis, purification, and testing requires the scheduling of manyindependent operations handled by different pieces of hardware and software. Figure reproduced fromBaranczak et al. [182].multistep synthetic routes using available building blocks; selecting conditions for each reaction step includ-ing quantities, temperature, and time; and automating intermediate and final purifications. If reduced tostirring, heating, and fluid transfer operations, chemical syntheses are straightforward to automate [184–186], and robotic platforms (Figure 3) are capable of executing a series of process steps if those steps areprecisely planned [182, 187]. However, current CASP tools are unable to make directly implementablerecommendations with this level of precision.There are two diverging philosophies of how to approach automated synthesis: (a) the development ofgeneral-purpose machines able to carry out most chemical reactions, or (b) the development of specializedmachines to perform a few general-purpose reactions that are still able to produce most molecules. Thereferences in the preceding paragraph follow the former approach. Burke and co-workers have advocatedfor the latter and propose using advanced MIDA boronate building blocks and a single reaction/purificationstrategy to simplify process design [188]. Peptide and nucleic acid synthesizers exemplify this notion ofautomating a small number of chemical transformations to produce candidates within a vast design space.
Many discoveries can be validated with high confidence through computational techniques alone. Whereapplicable, this can be extremely advantageous. This is because the logistics of the alternative (physicalexperiments) may be much more complex, e.g., relying on access to a large chemical inventory (of candidatesor as precursors) to perform validation experiments within a large design space. An emblematic exampleof discoveries that can be validated through computation alone is that of physical matter whose desiredfunction can be reliably estimated with first principles calculations.19igure 3: Rendering of Eli Lilly’s first generation Automated Synthesis Laboratory (ASL) for automatedsynthesis and purification. Reproduced from ref. 187.
Challenge:
Accelerate code/software used for computational validation
Just as in physical experiments, there are practical challenges in computational experiments related tothe throughput of high fidelity calculations.A unique feature of computational validation is the well-recognized tradeoff between speed and accuracy.Consider Lyu et al.’s virtual screen of 170 million compounds to identify binders to two protein targetsthrough rigid-body docking [189]. The computational pipeline was fast–requiring about one second percompound–only requiring ≈ Challenge:
Broaden capabilites / applicability of first-principles calculations
Many properties of interest cannot be simulated accurately, forcing us to rely on experimental validation.Expanding the scope of what can be accurately modeled would open up additional applications for purelycomputational autonomous workflows. There are some tasks for which computational solutions exist butcould be improved, including binding prediction through docking, reactive force field modeling, transitionstate searching, conformer generation, solvation energy prediction, and crystal structure prediction. Thereare other tasks with even fewer satisfactory approaches, including long timescale molecular dynamics andmultiscale modeling in materials. Some grand challenges in computational chemistry are discussed in refs. 196and 197.
Challenge:
Ensure that validation reflects the real application
Computational or experimental validation that lends itself to automation is often a proxy for a moreexpensive evaluation. If the proxy and the true metric are misaligned, an autonomous platform will notbe able to generate any useful results.Ideally, there would be perfect alignment between the approaches to validation compatible with an au-tonomous system and the real task at hand. This is impossible for tasks like drug discovery, where imperfect in vitro assays are virtually required before evaluating in vivo performance during preclinical development.For other tasks, assays are simplified for the sake of automation or cost, e.g., Ada’s measurement of optoelec-tronic properties of a thin film as a proxy for hole mobility as a proxy of the efficiency of a multicomponentsolar cell [15]. Assays used for validation in autonomous systems do not necessarily need to be high through-put, just high fidelity and automatable. Avoiding false results, especially false negatives in design spaceswhere positive hits are sparse such as in the discovery of ligands that bind strongly to a specific protein, iscritical [198].
Challenge:
Lower the cost of automated validation
Relatively few things can be automated cheaply. This is especially true for problems requiring complexexperimental procedures, e.g., multi-step chemical synthesis.While the equipment needed for a basic HTE setup is becoming increasingly accessible and compatiblewith the budget of many academic research groups [199, 200], we must increase the complexity of the21utomated platforms that are used for validation in order to increase the complexity of problems that canbe addressed by autonomous workflows. Autonomous systems need not be high-throughput in nature, but,as we have mentioned several times throughout this review, accelerating search to facilitate explorationof ever-broader design spaces that we cannot explore manually should be one of the key goals/outcomes ofdevelopment of these types of platforms. As the community begins to undertake this challenge, it’s imperativethat we pay attention to affordability, lest we discourage/inhibit adoption. Homegrown systems can be madeinexpensively through integration of common hardware components and open-source languages for control[185, 201]. Miniaturization reduces material consumption costs, but can complicate system fabrication andmaintenance. The decision to automate a workflow should be the result of a holistic evaluation of return oninvestment (ROI) [181].The costs of computational assays are less of an impediment to autonomous discovery than experimentalassays, given the accessibility of large-scale compute. Improving their accuracy is more of a priority. Forexample, the docking method used by Lyu et al. was sufficiently inexpensive to screen millions of com-pounds and obtain results that correlate with experimental binding affinity, but the majority of high scoringcompounds are false positives and the differentiation of top candidates is poor [189, 202].
Challenge:
Combine newly acquired data with prior literature data
Predictive models trained on existing data reflect beliefs about structure-property landscapes; when newdata is acquired, that belief must be updated, preferably in a manner that reflects the relative confidenceof the data sources.A fundamental question yet to be addressed in studies combining data mining with automated validationis the following: how should new data acquired through experimental/computational validation be used toupdate models pretrained on literature data? The quintessential workflow for non-iterative data-driven dis-covery of physical matter includes (a) regressing a structure-property dataset, (b) proposing a new molecule,material, or device, and (c) validating the prediction for a small number of those predictions. Incorporatingthis new data into the model should account for the fact that the new data may be generated under morecontrolled conditions or may be higher fidelity than the literature data.The nature of existing data can be different from what is newly acquired. For example, tabulatedreaction data is available at the level of chemical species, temperature, time, intended major product, andyield. In the lab, we will know the conditions quantitatively (e.g., concentrations, order of addition), willhave the opportunity to record additional factors (e.g., ambient temperature, humidity), and may be able tomeasure additional endpoints (e.g., identify side products). However, while we can more thoroughly evaluatedifferent reaction conditions than what has been previously reported, the diversity of substrates reported in22he literature exceeds what is practical to have in-stock in any one laboratory; we must figure out how tomeaningfully integrate the two. For discovery tasks that aim to optimize physical matter with standardizedassays, where databases contain exactly what we would calculate or measure, this notion of complementarityis less applicable.
Excellent foundational work in statistics on (iterative) optimal experimental design strategies has beenadapted to the domain of chemistry. Although iterative strategies often depend on manually-designed ini-tializations and constrained search spaces, algorithms can be given the freedom to make decisions aboutwhich hypotheses to test. This flexibility makes iterative strategies inherently more relevant to autonomousdiscovery than noniterative ones.A variety of algorithms exist for efficiently navigating design spaces and/or compound libraries (virtualor otherwise). Broadly speaking, these can be categorized as model-free–black box optimizations, includingevolutionary algorithms (EAs)–or model-based–using surrogate models for predicting performance and/ormodel uncertainty. The latter category includes uncertainty-guided experimental selection where an acqui-sition function quantifies how useful a new experiment would be [17]; ref. 20 provides a tutorial on Bayesianoptimization.
Challenge:
Quantify model uncertainty and domain of applicability
Active learning strategies are crucially dependent on quantifying uncertainty; doing so reliably inQSAR/QSPR modeling remains elusive, and current strategies cannot anticipate structure-activity cliffsor other rough features.Accurate uncertainty quantification drives discovery by drawing attention to underexplored areas ofa design space and helping to triage experiments, e.g., in combination with Bayesian optimization [203].Statistical and probabilistic frameworks can account for uncertainty when analyzing data and selectingnew experiments [203–207], but we must be able to meaningfully estimate our uncertainty to use them.Common frequentist methods for estimating uncertainty include model ensembling [208] and Monte Carlo(MC) dropout [209]; various Bayesian approaches like the use of Gaussian process models have been usedas well [207, 210]. Not only is it difficult to generate meaningful outcomes with these methods, but alsothey tend to be computationally expensive (although MC dropout is generally less so than the others). InQSAR/QSPR, one often tries to define a domain of applicability (DOA) as a coarser version of uncertainty,where the DOA can be thought of as the input space for which the prediction and uncertainty estimation ismeaningful [211–213]. 23here is little to no agreement on the correct way to estimate epistemic uncertainty (as opposed toaleatoric uncertainty, which is that which arises from measurement noise). In drug discovery, activity cliffs[214]–sharp changes in binding affinity resulting from minor structural changes–are especially troublesomeand call into question any attempt to directly connect structural similarity to functional similarity [215,216]. Even functional descriptor-based representations are unlikely to capture all salient features. Implicit orexplicit assumptions must be made when choosing a representational and modeling technique, for examplechoosing an appropriate kernel and a prior on (or a hyperparameter controlling) the smoothness of thelandscape in a Gaussian processes model [217].
Challenge:
Quantify the tradeoff between experimental difficulty and information gain
Experiment selection criteria should be able to account for the difficulty of an experiment, i.e., employcost-sensitive active learning.Experiment selection methods rarely account for the cost of an experiment in any quantitative way.Separately, experiment selection is occasionally biased based on factors that are irrelevant to the hypothesis.If proposed experiments require the synthesis of several molecules (e.g., a compound library designed duringlead optimization), an expert chemist will generally select those they determine to be easily synthesized,rather than those that are most informative. One must ask if it is worth spending weeks making a singlecompound that maximizes the expected improvement or if there is a small analogue library that is easier tosynthesize that, collectively, offers a similar probability of improvement. In this setting, there will almostalways be a tradeoff between data that is fast and inexpensive to acquire and data that is most useful forthe discovery. Understanding that tradeoff is essential for autonomous systems where experiments can havevery different costs (e.g., selecting molecules to be synthesized) or likelihoods of success (e.g., electronicstructure simulations prone to failure) in contrast to where experiments have similar costs (e.g., selectingvirtual molecules for rigid-body docking). The situation becomes more complex for batched optimizationswhere, e.g., the cost of synthesizing 96 molecules in a parallel well-plate format is not merely the sum oftheir individual costs, but depends on overlap in the precursors and reaction conditions they employ.Williams et al. provide one example of how to roughly quantify the value of active learning-based screeningfor Eve [2]. It is easy to imagine how one might augment this framework to account for cost as part of theexperiment selection process. However, the utility calculation heuristics used by Williams et al. would needto be substantially improved in order to be usefully applied to cases where the cost of experiments vary,which is the interesting setting here. To date, the experiments able to be conducted by a given automatedor autonomous workflow are of comparable cost; the decision about whether that cost is reasonable is madeby the human designer of the platform. The term in experimental design for this is cost-sensitive active24earning [218].
Challenge:
Define discovery goals at a higher level
The ability of an automated system to make surprising or significant discoveries relies upon its ability toextrapolate and explore beyond what is known. This could be encouraged by defining broader objectivesthan what is currently done.In current data analyses, the structure of hypotheses tend to be prescribed: a mathematical functionrelating an expert-selected input to an expert-selected output, a correlative measure between two chemicalterms, a causal model that describes a sequence of events. Ideally, we would be able to generate hypothesesfrom complex datasets in a more open-ended fashion where do not have to know exactly what we are lookingfor. Techniques in knowledge discovery [219], unsupervised learning [220], and novelty detection [221] areintended for just that purpose and may present a path toward more open-ended generation of scientifichypotheses (Figure 4).Figure 4: Overview of the process of knowledge discovery from databases. Figure reproduced from Fayyadet al. [219].Experimental design can also be given greater flexibility by defining broad goals for discovery (perfor-mance, novelty, etc.) and using computational frameworks to learn tradeoffs in reaching those goals, e.g.,through reinforcement learning. Consider the goal of compound selection during a drug discovery campaign:to identify a molecule that ends up being suitable for clinical trials. In the earlier information-gatheringstages, we don’t necessarily need to select the highest performing compounds, just the ones that provideinformation that lets us eventually identify them (i.e., a future reward). More generally, the experimentsproposed for validation and feedback in a discovery campaign should be selected to achieve a higher-ordergoal ( eventually , finding the best candidate) rather than a narrow objective (maximizing performance withina designed compound library).Open-ended inference is a general challenge in deep learning [222], as is achieving what we would callcreativity in hypothesis generation [223]. At some level, in order to apply optimization strategies for experi-25ental design or analysis, the goal of a discovery search must be reducible to a scalar objective function. Weshould strive to develop techniques for guided extrapolation toward the challenging-to-quantify goals thatthe field has used when defining discovery: novelty, interestingness, intelligibility, and utility.
Strategies for selecting molecules and materials for validation in discovery workflows are worth additionaldiscussion (Figure 5). Iterative strategies of the sort described above apply here, with active learning beinguseful for selecting compounds from a fixed virtual library and evolutionary/generative algorithms beinguseful for designing molecules on-the-fly. Generative models are a particularly attractive way to designmolecules and materials with minimal human input, biased only by knowledge of the chemical space onwhich they are trained (Figure 6) [34, 35, 224].
NNOBOOHO HOO NHOH OONHS FOOClHO NHO NNOBOOHO HOO NHOH OONHS FOOClHO NHO
NO Br O NHO2N NH2 HN CF3 N
ONS HNNHNO N N OO OHP NSO ON + O O - OOHNOOSiBrBr OHOHOONHS F O NH NCl NO NHOH O N HN O OOONH NOO NOOO O SSFFF N OHOOO ClOSi NN NH NSONSSSS S S OOOONHSOOCl N O HNOO OFFFFFNO HO OH OOBrOO NNOBO O HN NOF NN
NNN NN N N N NNNNNN NNHN N HN N HN N HN NNNN NOH OH OH OHNNN N
Known chemical space Make-on-demand libraryEnumerated chemical space
De novo molecular generation* a bc d NH N NHNH MeO
NH N N HNMeON N NNFMe S ClOO OCO H NH N N HNMeONH N N HNMeONH N N HNMeOO O OMeSOO MeSOO MeSOO
Figure 5: Common sources of molecules from which to select those that fulfill some design objective.Molecules can be selected from (a) a fixed, known chemical space, (b) a make-on-demand library of syn-thesizable compounds, (c) an enumerated library (via systematic enumeration or evolutionary methods),and (d) molecules proposed de novo from a generative model. *An autoencoder architecture is shown as arepresentative type of generative model. 26igure 6: Schwalbe-Koda and Gómez-Bombarelli’s timeline of generative model development for molecules(top to bottom). Figure reproduced from ref. 35. Figure subparts reproduced from refs. 225, 226, 227, 228,229, and 230.
Challenge:
Bias generative models towards synthetic accessibility
Compared to fixed virtual libraries, a shortcoming of generative models is that the molecules or materialsthey propose may not be easily realizable.Algorithms that can leverage existing data to suggest promising, as-yet-untested possibilities exist, butthese do not yet function on the level of a human scientist in part because they do not understand whatexperiments are possible. Generative models can concoct new molecules in some abstract latent space,but simplistic measures of synthesizability [231, 232] are not enough to steer the models toward accessiblechemical space. Make-on-demand virtual libraries provide a distinct advantage in that one is more confidentproposed molecule can be made in short timeframe. Achieving that same confidence will be essential forthe adoption of de novo methods, some of which are beginning to combine molecular generation and virtualenumeration [233]. Some applications of generative models, like to peptide design, do not suffer from thislimitation as, to a first approximation, most peptides are equally synthesizable.
Challenge:
Benchmark problems for molecular generative models
The current evaluations for generative models do not reflect the complexity of real discovery problems.27he explosion of techniques for molecular generation has outpaced our ability to meaningfully assesstheir performance. A metric introduced early on as a proxy objective is the “penalized logP” metric formolecular optimization. While not used for any actual discovery efforts, a heuristic function of estimatedlogP, synthetic accessibility, and a penalty for rings larger than 6 atoms was introduced for (and continuesto be used for) benchmarking. This metric bears little resemblance to any multiobjective function onewould use in practice. Only recently have more systematic benchmarks been introduced to cover a widerrange of learning objectives: either maximizing a scalar objective or learning to mimic a distribution oftraining molecules. Two frameworks for such model comparisons include GuacaMol [234] and MOSES [235].However, these do not consider the number of function evaluations required by each method and still representsimplistic goals. Optimization goals that better reflect the complexity of real discovery tasks might includebinding or selectivity as predicted by docking scores [236]
Challenge:
Demonstrate extrapolative power of predictive models
If the ultimate goal of computer-aided discovery is to generate new scientific knowledge, extrapolationbeyond what is known is a necessity.The majority of approaches to automated discovery of physical matter rely on predictive models to guidethe selection of experiments. The most effective models to facilitate this process will be able to at leastpartially extrapolate from our current knowledge to new chemical matter, eliminating the need for brute forceexperimentation. This extrapolative power–the ability of QSAR/QSPR models to generalize to design spacesthey have not been trained on–should be prioritized as an evaluation metrics during model development.The potential for algorithms to guide us toward novel areas of chemical or reactivity space was emphasizedin a recent review by Gromski et al. [237].
Challenge:
Demonstrate design-make-test beyond proof-of-concept
All studies to-date that demonstrate closed-loop design, make, test cycles have been proof-of-conceptslimited to narrow search spaces, severely limiting their practical utility.A compelling demonstration of autonomous discovery in chemistry would be the closed-loop design,synthesis, and testing of new molecules to optimize a certain property of interest or build a structure-property model. There has not been much progress since early proof-of-concept studies that could access onlya limited chemical space [5, 238] despite significant advances in the requisite areas of molecular design, CASP,and automated synthesis. These constraints on the design space to ensure compatibility with automated28alidation prevent us from addressing many interesting questions and optimization objectives. Chow et al.describe several case studies where certain steps in the drug discovery have been integrated with each otherfor increased efficiency, but acknowledge–as others have–that all stages must be automated and integratedfor maximal efficiency [239–241].
Challenge:
Develop benchmark problems for discovery
Developing methods for autonomous discovery would benefit from a “sandbox” that doesn’t eliminate allof the complexity of real domain applications.There is no unified strategy for the use of existing data and the acquisition of new data for discoveringfunctional physical matter, processes, or models. The existence of benchmarks would encourage methoddevelopment and makes it easier to evaluate when new techniques are an improvement over existing ones.We have such evaluations for purely computational problems like numerical optimization and transition statesearching, but there are no realistic benchmarks upon which to test algorithms for autonomous discovery(e.g., hypothesis generation, experimental selection, etc.).Vempaty et al. describe one way to evaluate knowledge discovery algorithms through a simplified “coupon-collector model”; this model assumes that domain knowledge is a set of elements to be identified througha noisy observation process [242], which represents a limited problem formulation. Even for subtasks withseemingly better-defined goals like building empirical QSAR/QSPR models, there are no standard evaluationsfor assessing interpretability, uncertainty quantification, or generalizability. The field will need to collectivelyestablish a set of problem formulations that describe many discovery tasks of interest to domain experts inorder to benchmark components of autonomous discovery. Given the practical obstacles to validation throughphysical experiments, computational chemistry may be the right playground for advancing these techniques.However, we do caution that an overemphasis on quantitative benchmarking can be detrimental. Lan-guage tasks have reached a point where the amount of compute required for competitive performance isinaccessible for all but the most well-resourced research groups [243]. Unless benchmarking controls what(open source) training data is permissible, a lack of access to compute and data may inadvertently discouragemethod development.
The case studies in this article illustrate that computer assistance and automation have become ubiquitousparts of scientific discovery both by reducing the manual effort required to complete certain tasks and by29nabling entirely new approaches to discovery at an unprecedented throughput. But to what extent can thediscovery itself be considered a direct result of automation or autonomy?As summarized in our reflection of the case studies in Part 1, very few studies can claim to have achieved ahigh level of autonomy. In particular, researchers frequently gloss over the fact that specifying the discoveryobjective, defining the search space, and narrowing that space to the “relevant” space that is ultimatelyexplored requires substantial manual input. While there will always be a need for subject matter expertsin constructing these platforms and associated workflows, we hope it will be possible to endow autonomousplatforms with sufficiently broad background knowledge and validation capabilities that this initial narrowingof the search space is less critical to their success.
The bar for what makes a noteworthy discovery is ever-increasing. Computer-aided structural elucidation,building structure-activity relationships, and automated reaction optimization are all discoveries under thedefinition we have presented here, but they are not perceived to be as significant as they were in the past.As computational algorithms become more flexible and adaptive in other contexts, and as the scope ofautomatable validation experiments expands, more and more workflows will appear routine.We have intentionally avoided a precise definition of the degree of confidence required for a discoverywithout direct experimental observation of a desired physical property. This is because this varies widelyby domain and is rapidly evolving as computational validation techniques and proxy assays become moreaccurate. A computational prediction of a new chemical reaction would likely not be considered a discoveryunder any circumstances without experimental validation. A computational prediction of a set of bioactivecompounds might, but with a subjective threshold for the precision of its recommendations. Whether thecomputational workflow has directly made the discovery of a new compound might depend if all of the top n compounds were found to be active, or if at least m of the top n were, etc. The current role of humans in computer-assisted discovery is clear. Langley writes of the “developer’s role”in terms of high-level tasks: formulating the discovery problem, settling on an effective representation, andtransforming the output into results meaningful to the community [244, 245]. Honavar includes mapping thecurrent state of knowledge and generating/prioritizing research questions [246].Alan Turing’s imitation game (“the Turing test”) asks whether a computer program can be made indis-tinguishable from a human conversationalist [247]. It is interesting to wonder if we can reach a point where30utonomous platforms are able to report insights and interpretations that are indistinguishable (both inpresentation and scientific merit) from what a human researcher might publish in a journal article. Amongother things, this would require substantial advances in hypothesis generation, explainability, and scientifictext generation. Machine-generated review articles and textbooks may be the first to pass this test [248].Kitano’s more ambitious grand challenge in his call-to-arms is to make a discovery in the biomedical sciencesworthy of a Nobel Prize [170].We do not want to overstress a direct analogy of the Turing test to autonomous discoveries, because thetype of discoveries typically enabled by automation and computational techniques are often distinct fromthose made by hand. For the field to have the broadest shared capabilities, the best discovery platformswill excel at tasks that humans can’t easily or safely do. The scale of data generation, the size of a designspace that can be searched, and the ability to define new experiments that account for enormous quantitiesof existing information makes autonomous systems equipped to make discoveries in ways entirely distinctfrom humans.Turing makes the point that the goals of machines and programs are distinct; that a human would lose ina race with an airplane does not mean we should slow down airplanes so their speeds are indistinguishable.Rephrased more recently by Steve Ley, “while people are always more important than machines, increasinglywe think that it is foolish to do thing machines can do better than ourselves” [249]. Particularly when facedwith the grunt work of some manual experimentation, “leaving such things to machines frees us for stillbetter tasks” (Derek Lowe) [180]. We should embrace the divergence of human versus machine tasks.
We join many others in touting the promise of autonomous or accelerated discovery [239, 241, 250–258].Automation has brought increased productivity to the chemical sciences through efficiency, reproducibility,reduction in error, and the ability to cope with complex problems at scale; likewise, machine learning anddata science through the identification of highly nonlinear relationships, trends, and patterns in complexdata.The previous section identified a number of directions in which additional effort is required to capture thefull value of that promise: creating and maintaining high-quality open access datasets; building interpretable,data-efficient, and generalizable empirical models; expanding the scope of automated experimental platforms,particularly for multistep chemical synthesis; improving the applicability and speed of automated compu-tational validation; aligning automated validation with prior knowledge and what is needed for differentdiscovery applications, ideally not at significant cost; improving uncertainty quantification and cost-sensitive31ctive learning; enabling open-ended hypothesis generation for experimental selection; and explicitly incorpo-rating synthesizability considerations into generative models and benchmarking on realistic tasks. Evaluationwill require the creation of benchmark problems that we argue should focus on whether algorithms facilitateextrapolation to underexplored, large design spaces that are currently expensive or intractable to explore.Numerous research initiatives are supporting work in these directions. For example, the United StatesDepartment of Defense recently funded a multidisciplinary initiative to develop a Scientific AutonomousReasoning Agent; the Defense Advanced Research Projects Agency (DARPA) has funded several programsrelevant to autonomous discovery, including the Data-Driven Discovery of Models, Big Mechanism Project,Make-It, Accelerated Molecular Discovery, and Synergistic Discovery and Design; the Engineering and Physi-cal Sciences Research Council (EPSRC) has an ongoing Dial-a-Molecule challenge that strives to debottlenecksynthesis, and recently launched a Centre of Doctoral Training in Automated Chemical Synthesis enabledby Digital Molecular Technologies; the Materials Genome Initiative, Materials Project, and Mission Innova-tion’s Materials Acceleration Platform continue to bring sweeping changes to how data is materials scienceis collected, curated, and applied to discovery. Many more commercial efforts are underway as well, withsignificant investment from the pharmaceutical industry into the integration and digitization of their drugdiscovery workflows.A 2004 perspective article by Glymour stated that we were in the midst of a revolution to automatescientific discovery [259]. Regardless of whether we were then, we certainly seem to be now.
We thank Thomas Struble for providing comments on the manuscript and our other colleagues and collab-orators for useful conversations around this topic. This work was supported by the Machine Learning forPharmaceutical Discovery and Synthesis Consortium and the DARPA Make-It program under contract AROW911NF-16-2-0023.
References [1] R. D. King et al.,
Science , , 85–89.[2] K. Williams et al., J. R. Soc. Interface , , 20141289.[3] R. D. King, K. E. Whelan, F. M. Jones, P. G. K. Reiser, C. H. Bryant, S. H. Muggleton, D. B. Kell,S. G. Oliver, Nature , , 247–252.[4] L. Weber, S. Wallbaum, C. Broger, K. Gubernator, Angew. Chem. Int. Ed. in English , ,2280–2282.[5] B. Desai et al., J. Med. Chem. , , 3033–3047.[6] B. J. Reizman, Y.-M. Wang, S. L. Buchwald, K. F. Jensen, React. Chem. Eng. , , 658–666.327] P. Nikolaev, D. Hooper, N. Perea-López, M. Terrones, B. Maruyama, ACS Nano , , 10214–10222.[8] J. D. Kangas, A. W. Naik, R. F. Murphy, BMC Bioinf. , , 143.[9] C. W. Gao, J. W. Allen, W. H. Green, R. H. West, Comput. Phys. Commun. , , 212–225.[10] J. Fang, X. Pang, R. Yan, W. Lian, C. Li, Q. Wang, A.-L. Liu, G.-H. Du, RSC Adv. , , 9857–9871.[11] R. Gómez-Bombarelli et al., Nature Mater. , , 1120–1127.[12] M. H. S. Segler, T. Kogej, C. Tyrchan, M. P. Waller, ACS Cent. Sci. , , 120–131.[13] A. W. Thornton et al., Chem. Mater. , , 2844–2854.[14] J. P. Janet, L. Chan, H. J. Kulik, J. Phys. Chem. Lett. , , 1064–1071.[15] B. P. MacLeod et al., arXiv:1906.05398 [cond-mat physics:physics] .[16] P. Nikolaev, D. Hooper, F. Webber, R. Rao, K. Decker, M. Krein, J. Poleski, R. Barto, B. Maruyama, Npj Comput. Mater. , , 16031.[17] B. Settles, Synthesis Lectures on Artificial Intelligence and Machine Learning , , 1–114.[18] C. M. Anderson-Cook, C. M. Borror, D. C. Montgomery, J. Stat. Plan. Inference , , 629–641.[19] J. Mockus, V. Tiesis, A. Zilinskas, Towards global optimisation , , 117–129.[20] P. I. Frazier, arXiv preprint arXiv:1807.02811 .[21] J. Aires de Sousa in Applied Chemoinformatics , (Eds.: T. Engel, J. Gasteiger), Wiley-VCH VerlagGmbH & Co. KGaA, Weinheim, Germany, , pp. 133–163.[22] Y. Gil, H. Hirsh in 2012 AAAI Fall Symposium Series, .[23] A. Sharafi,
Knowledge Discovery in Databases , Springer Fachmedien Wiesbaden, Cambridge, MA,USA, .[24] E. Kim, K. Huang, A. Tomala, S. Matthews, E. Strubell, A. Saunders, A. McCallum, E. Olivetti,
Sci.Data , , 170127.[25] B. M. Gyori, J. A. Bachman, K. Subramanian, J. L. Muhlich, L. Galescu, P. K. Sorger, Mol. Syst.Biol. , , 954.[26] V. Tshitoyan, J. Dagdelen, L. Weston, A. Dunn, Z. Rong, O. Kononova, K. A. Persson, G. Ceder,A. Jain, Nature , , 95–98.[27] P. Raccuglia, K. C. Elbert, P. D. F. Adler, C. Falk, M. B. Wenny, A. Mollo, M. Zeller, S. A. Friedler,J. Schrier, A. J. Norquist, Nature , , 73–76.[28] D. T. Ahneman, J. G. Estrada, S. Lin, S. D. Dreher, A. G. Doyle, Science , , 186–190.[29] A. F. Zahrt, J. J. Henle, B. T. Rose, Y. Wang, W. T. Darrow, S. E. Denmark, Science , ,eaau5631.[30] J. P. Reid, M. S. Sigman, Nature , , 343–348.[31] W. F. Reinhart, A. W. Long, M. P. Howard, A. L. Ferguson, A. Z. Panagiotopoulos, Soft Matter , , 4733–4745.[32] A. Mardt, L. Pasquali, H. Wu, F. Noé, Nat. Commun. , , 5.[33] A. Rives, S. Goyal, J. Meier, D. Guo, M. Ott, C. L. Zitnick, J. Ma, R. Fergus, bioRxiv , 622803.[34] D. C. Elton, Z. Boukouvalas, M. D. Fuge, P. W. Chung, arXiv:1903.04388 [physics stat] .[35] D. Schwalbe-Koda, R. Gómez-Bombarelli, arXiv:1907.01632 [physics stat] .[36] S. Kim et al., Nucleic Acids Res. , , D1102–D1109.[37] J. Hill, G. Mulholland, K. Persson, R. Seshadri, C. Wolverton, B. Meredig, MRS Bull. , ,399–409.[38] L. Himanen, A. Geurts, A. S. Foster, P. Rinke, arXiv:1907.05644 [cond-mat physics:physics] .3339] D. J. Rigden, X. M. Fernández, Nucleic Acids Res. , , D1–D7.[40] leejunhyun, The Databases for Drug Discovery (DDD), , https://github.com/LeeJunHyun/The-Databases-for-Drug-Discovery (visited on 07/26/2019).[41] M. Krier, G. Bret, D. Rognan, J. Chem. Inf. Model. , , 512–524.[42] S. R. Langdon, N. Brown, J. Blagg, J. Chem. Inf. Model. , , 2174–2185.[43] J. J. Irwin, B. K. Shoichet, J. Chem. Inf. Model. , , 177–182.[44] T. Sterling, J. J. Irwin, J. Chem. Inf. Model. , , 2324–2337.[45] ChemSpider | Search and share chemistry, (visited on 02/12/2019).[46] G. Papadatos et al., Nucleic Acids Res. , , D1220–D1228.[47] P. Banerjee, J. Erehman, B.-O. Gohlke, T. Wilhelm, R. Preissner, M. Dunkel, Nucleic Acids Res. , , D935–D939.[48] Synthetically Accessible Virtual Inventory (SAVI) Database Download Page, https://cactus.nci.nih.gov/download/savi%5C_download/ (visited on 02/12/2019).[49] eMolecules Database Download - eMolecules, (visited on 07/31/2019).[50] MolPort: Download Compound Database | Available Compounds, (visited on 07/31/2019).[51] REAL Compounds - Enamine, https : / / enamine . net / library - synthesis / real - compounds (visited on 07/25/2019).[52] Chemspace | Compound Libraries, https://chem-space.com/compounds (visited on 07/31/2019).[53] T. Fink, J.-L. Reymond, J. Chem. Inf. Model. , , 342–353.[54] L. C. Blum, J.-L. Reymond, J. Am. Chem. Soc. , , 8732–8733.[55] L. Ruddigkeit, R. van Deursen, L. C. Blum, J.-L. Reymond, J. Chem. Inf. Model. , , 2864–2875.[56] F. Chevillard, P. Kolb, J. Chem. Inf. Model. , , 1824–1835.[57] L. Humbeck, S. Weigang, T. Schäfer, P. Mutzel, O. Koch, ChemMedChem , , 532–539.[58] ChEMBL, (visited on 02/12/2019).[59] A. Gaulton et al., Nucleic Acids Res. , , D1100–D1107.[60] J. Hastings, G. Owen, A. Dekker, M. Ennis, N. Kale, V. Muthukrishnan, S. Turner, N. Swainston,P. Mendes, C. Steinbeck, Nucleic Acids Res. , , D1214–D1219.[61] RCSB PDB: Homepage, (visited on 02/12/2019).[62] Welcome to PDBbind-CN Database, (visited on 02/12/2019).[63] M. M. Gromiha, H. Uedaira, J. An, S. Selvaraj, P. Prabakaran, A. Sarai, Nucleic Acids Res. , , 301–302.[64] A. B. Keenan et al., Cell Syst. , , 13–24.[65] J. Jankauskait˙e, B. Jiménez-García, J. Dapk¯unas, J. Fernández-Recio, I. H. Moal, Bioinformatics , , 462–469.[66] xMoDEL: Molecular Dynamics Libraries | Molecular Modeling and Bioinformatics Group, (visited on 02/12/2019).[67] D. A. Benson, M. Cavanaugh, K. Clark, I. Karsch-Mizrachi, J. Ostell, K. D. Pruitt, E. W. Sayers, Nucleic Acids Res. , , D41–D47.[68] D. S. Wishart, Nucleic Acids Res. , , D668–D672.[69] M. K. Gilson, T. Liu, M. Baitaluk, G. Nicola, L. Hwang, J. Chong, Nucleic Acids Res. , ,D1045–D1053. 3470] S. Ekins, B. A. Bunin in In Silico Models for Drug Discovery , Springer, , pp. 139–154.[71] D. J. Dix, K. A. Houck, M. T. Martin, A. M. Richard, R. W. Setzer, R. J. Kavlock,
Toxicol. Sci. , , 5–12.[72] A. M. Richard et al., Chem. Res. Toxicol. , , 1225–1251.[73] R. R. Tice, C. P. Austin, R. J. Kavlock, J. R. Bucher, Environ. Health Perspect. , , 756–765.[74] Tox21 Data Browser, https://tripod.nih.gov/tox21/index (visited on 08/06/2019).[75] D. Lowe, Chemical reactions from US patents (1976-Sep2016), .[76] Pistachio, https://doi.org/10.1036/1097-8542.519800 (visited on 04/04/2019).[77] Reaxys, (visited on 02/12/2019).[78] Reactions - CASREACT - Answers to your chemical reaction questions | CAS, /support/documentation/reactions (visited on 02/12/2019).[79] InfoChem - SPRESI - Storage and retrieval of chemical structure and reaction information - infochem, (visited on 02/12/2019).[80] Databases - Librarians - Wiley Online Library, https : / / onlinelibrary . wiley . com / library -info/products/databases (visited on 02/12/2019).[81] J. Gao, L. B. M. Ellis, L. P. Wackett, Nucleic Acids Res. , , D488–D491.[82] NIST Chemical Kinetics Database, https://kinetics.nist.gov/kinetics/index.jsp (visited on07/31/2019).[83] C. Steinbeck, S. Kuhn, Phytochemistry , , 2711–2717.[84] M. Rupp, A. Tkatchenko, K.-R. Müller, O. A. von Lilienfeld, Phys. Rev. Lett. , , 058301.[85] G. Montavon, M. Rupp, V. Gobre, A. Vazquez-Mayagoitia, K. Hansen, A. Tkatchenko, K.-R. Müller,O. Anatole von Lilienfeld, New J. Phys. , , 095003.[86] R. Ramakrishnan, P. O. Dral, M. Rupp, O. A. von Lilienfeld, Scientific Data , , 140022.[87] R. Ramakrishnan, M. Hartmann, E. Tapavicza, O. A. von Lilienfeld, J. Chem. Phys. , ,084111.[88] D. L. Mobley, J. P. Guthrie, J. Comput.-Aided Mol. Des. , , 711–720.[89] P. J. Linstrom, W. G. Mallard, J. Chem. Eng. Data , , 1059–1063.[90] S. Otsuka, I. Kuwajima, J. Hosoya, Y. Xu, M. Yamazaki in 2011 International Conference on EmergingIntelligent Data and Web Technologies, IEEE, , pp. 22–29.[91] A. Merkys, A. Vaitkus, J. Butkus, M. Okulič-Kazarinas, V. Kairys, S. Gražulis, J Appl Crystallogr , , 292–301.[92] S. Gražulis, A. Merkys, A. Vaitkus, M. Okulič-Kazarinas, J Appl Crystallogr , , 85–91.[93] S. Gražulis, A. Daškevič, A. Merkys, D. Chateigner, L. Lutterotti, M. Quirós, N. R. Serebryanaya,P. Moeck, R. T. Downs, A. Le Bail, Nucleic Acids Res , , D420–D427.[94] Y. G. Chung, J. Camp, M. Haranczyk, B. J. Sikora, W. Bury, V. Krungleviciute, T. Yildirim, O. K.Farha, D. S. Sholl, R. Q. Snurr, Chem. Mater. , , 6185–6192.[95] C. E. Wilmer, M. Leaf, C. Y. Lee, O. K. Farha, B. G. Hauser, J. T. Hupp, R. Q. Snurr, Nature Chem. , , 83–89.[96] C. R. Groom, I. J. Bruno, M. P. Lightfoot, S. C. Ward, Acta Cryst B , , 171–179.[97] G. Bergerhoff, R. Hundt, R. Sievers, I. D. Brown, J. Chem. Inf. Comput. Sci. , , 66–69.[98] NOMAD Repository, https://repository.nomad-coe.eu/ (visited on 02/12/2019).[99] S. Curtarolo et al., Comput. Mater. Sci. , , 218–226.[100] Aflow - Automatic - FLOW for Materials Discovery, http://aflowlib.org/ (visited on 02/12/2019).35101] S. Kirklin, J. E. Saal, B. Meredig, A. Thompson, J. W. Doak, M. Aykol, S. Rühl, C. Wolverton, NpjComput. Mater. , , 15010.[102] A. Jain et al., APL Materials , , 011002.[103] D. D. Landis, J. S. Hummelshoj, S. Nestorov, J. Greeley, M. Dulak, T. Bligaard, J. K. Norskov,K. W. Jacobsen, Comput. Sci. Eng. , , 51–57.[104] Projects — COMPUTATIONAL MATERIALS REPOSITORY, https : / / cmr . fysik . dtu . dk/ (visited on 02/12/2019).[105] Pearson’s Crystal Data, (visited on 02/12/2019).[106] S. A. Lopez, E. O. Pyzer-Knapp, G. N. Simm, T. Lutzow, K. Li, L. R. Seress, J. Hachmann, A.Aspuru-Guzik, Sci Data , , 160086.[107] R. Lammey, Sci. Ed. , , 22–27.[108] pubmeddev, Home - PubMed - NCBI, (visited on02/12/2019).[109] arXiv Bulk Data Access | arXiv e-print repository, https://arxiv.org/help/bulk_data (visited on08/02/2019).[110] Text and Data Mining Agreement - Wiley Online Library, http://olabout.wiley.com/WileyCDA/Section/id-826542.html (visited on 08/02/2019).[111] Text and data mining policy - Elsevier, (visited on 08/02/2019).[112] S. Ekins, A. M. Clark, S. J. Swamidass, N. Litterman, A. J. Williams, J. Comput.-Aided Mol. Des. , , 997–1008.[113] B. Hie, H. Cho, B. Berger, Science , , 347–350.[114] D. J. Audus, J. J. de Pablo, ACS Macro Lett. , , 1078–1082.[115] I. V. Tetko, O. Engkvist, U. Koch, J.-L. Reymond, H. Chen, Mol. Inf. , , 615–621.[116] S. Ekins, A. J. Williams, Pharm. Res. , , 393–395.[117] T. C. Norman, C. Bountra, A. M. Edwards, K. R. Yamamoto, S. H. Friend, Sci. Transl. Med. , , 88mr1–88mr1.[118] S. Ekins, A. M. Clark, A. J. Williams, Mol. Inf. , , 585–597.[119] M. Krallinger, F. Leitner, O. Rabal, M. Vazquez, J. Oyarzabal, A. Valencia, J. Cheminform. , , S1.[120] M. C. Swain, J. M. Cole, J. Chem. Inf. Model. , , 1894–1904.[121] M. Krallinger, O. Rabal, A. Lourenço, J. Oyarzabal, A. Valencia, Chem. Rev. , , 7673–7761.[122] E. Kim, K. Huang, A. Saunders, A. McCallum, G. Ceder, E. Olivetti, Chem. Mater. , , 9436–9444.[123] Z. Zhai, D. Q. Nguyen, S. A. Akhondi, C. Thorne, C. Druckenbrodt, T. Cohn, M. Gregory, K.Verspoor, arXiv:1907.02679 [cs] .[124] S. Zheng, S. Dharssi, M. Wu, J. Li, Z. Lu in Methods in Molecular Biology , (Eds.: R. S. Larson, T. I.Oprea), Methods in Molecular Biology, Springer New York, New York, NY, , pp. 231–252.[125] D. R. Swanson, N. R. Smalheiser,
Artif. Intell. , Scientific Discovery , , 183–203.[126] A. Gomez-Perez, M. Martinez-Romero, A. Rodriguez-Gonzalez, G. Vazquez, J. M. Vazquez-Naya,Ontologies in Medicinal Chemistry: Current Status and Future Challenges, en, Text, .[127] P. W. Battaglia et al., arXiv:1806.01261 [cs stat] .[128] A. J. Williams, S. Elkins, V. Tkachenko, C. Lipinski, A. Tropsha, Drug Discovery World , ,33–39.[129] M. Jaskolski, Acta Crystallogr D Biol Cryst , , 1865–1866.36130] H. Berman, G. J. Kleywegt, H. Nakamura, J. L. Markley, Acta Crystallogr D Biol Cryst , ,2297–2297.[131] M. J. Sippl, Proteins , , 355–362.[132] Č. Venclovas, K. Ginalski, C. Kang, Protein Sci. , , 1594–1602.[133] Z. Wu, B. Ramsundar, E. N. Feinberg, J. Gomes, C. Geniesse, A. S. Pappu, K. Leswing, V. Pande, Chem. Sci. , , 513–530.[134] M. Haghighatlari, G. Vishwakarma, D. Altarawy, R. Subramanian, B. U. Kota, A. Sonpal, S. Setlur,J. Hachmann, , DOI .[135] A. Krizhevsky, I. Sutskever, G. E. Hinton, Commun. ACM , , 84–90.[136] P. Hop, B. Allgood, J. Yu, Mol. Pharmaceutics , , 4371–4377.[137] D. K. Duvenaud, D. Maclaurin, J. Iparraguirre, R. Bombarell, T. Hirzel, A. Aspuru-Guzik, R. P.Adams in Advances in Neural Information Processing Systems 28 , (Eds.: C. Cortes, N. D. Lawrence,D. D. Lee, M. Sugiyama, R. Garnett), Curran Associates, Inc., , pp. 2224–2232.[138] S. Kearnes, K. McCloskey, M. Berndl, V. Pande, P. Riley,
J. Comput.-Aided Mol. Des. , ,595–608.[139] F. A. Faber, L. Hutchison, B. Huang, J. Gilmer, S. S. Schoenholz, G. E. Dahl, O. Vinyals, S. Kearnes,P. F. Riley, O. A. von Lilienfeld, arXiv preprint arXiv:1704.01212 , , 5255–5264.[140] V. Korolev, A. Mitrofanov, A. Korotcov, V. Tkachenko, arXiv:1906.06256 [physics] .[141] K. Yang et al., J. Chem. Inf. Model. , DOI .[142] R. Ramprasad, R. Batra, G. Pilania, A. Mannodi-Kanakkithodi, C. Kim,
Npj Comput. Mater. ,DOI .[143] L. M. Ghiringhelli, J. Vybiral, S. V. Levchenko, C. Draxl, M. Scheffler,
Phys. Rev. Lett. , ,105503.[144] J. P. Janet, H. J. Kulik, J. Phys. Chem. A , , 8939–8954.[145] H. Altae-Tran, B. Ramsundar, A. S. Pappu, V. Pande, ACS Cent. Sci. , , 283–293.[146] J. Li, T. Chen, K. Lim, L. Chen, S. A. Khan, J. Xie, X. Wang, Advanced Intelligent Systems , , arXiv: 1811.02771, 1900029.[147] Y. Zhang, C. Ling, Npj Comput. Mater. , , DOI .[148] E. J. Bjerrum, arXiv preprint arXiv:1703.07076 .[149] E. J. Bjerrum, B. Sattarov, Biomolecules , , 131.[150] A. P. Bartók, R. Kondor, G. Csányi, Phys. Rev. B , , 184115.[151] K. T. Schütt, P.-J. Kindermans, H. E. Sauceda, S. Chmiela, A. Tkatchenko, K.-R. Müller, arXiv:1706.08566[physics stat] .[152] M. Ragoza, J. Hochuli, E. Idrobo, J. Sunseri, D. R. Koes, J. Chem. Inf. Model. , , 942–957.[153] G. E. Dahl, N. Jaitly, R. Salakhutdinov, arXiv preprint arXiv:1406.1231 .[154] B. Ramsundar, S. Kearnes, P. Riley, D. Webster, D. Konerding, V. Pande, arXiv:1502.02072 [cs stat] .[155] C. Fare, L. Turcani, E. O. Pyzer-Knapp, arXiv:1809.06334 [physics stat] .[156] A. Gupta, A. T. Müller, B. J. H. Huisman, J. A. Fuchs, P. Schneider, G. Schneider, Mol. Inf. , , 1700111.[157] D. Merk, L. Friedrich, F. Grisoni, G. Schneider, Mol. Inf. , , 1700153.[158] D. B. Searls, Nat. Rev. Drug Discov. , , 45–58.[159] J. C. Sundaramurthi, S. Brindha, T. Reddy, L. E. Hanna, Tuberculosis , , 133–138.[160] G. B. Goh, N. O. Hodas, C. Siegel, A. Vishnu, arXiv:1712.02034 .37161] G. B. Goh, C. Siegel, A. Vishnu, N. O. Hodas, N. Baker, arXiv:1706.06689 [cs stat] .[162] Q. Zhou, P. Tang, S. Liu, J. Pan, Q. Yan, S.-C. Zhang, PNAS , , E6411–E6417.[163] W. Hu, B. Liu, J. Gomes, M. Zitnik, P. Liang, V. Pande, J. Leskovec, arXiv:1905.12265 [cs stat] .[164] R. Sawada, H. Iwata, S. Mizutani, Y. Yamanishi, J. Chem. Inf. Model. , , 2717–2730.[165] Y. Luo, X. Zhao, J. Zhou, J. Yang, Y. Zhang, W. Kuang, J. Peng, L. Chen, J. Zeng, Nat. Commun. , , 573.[166] K. V. Chuang, M. J. Keiser, Science , , eaat8603.[167] P. Polishchuk, J. Chem. Inf. Model. , , 2618–2639.[168] R. Roscher, B. Bohn, M. F. Duarte, J. Garcke, arXiv preprint arXiv:1905.08883 .[169] L. von Rueden, S. Mayer, J. Garcke, C. Bauckhage, J. Schuecker, arXiv:1903.12394 [cs stat] .[170] H. Kitano, AIMag , , 39.[171] S. M. Paul, D. S. Mytelka, C. T. Dunwiddie, C. C. Persinger, B. H. Munos, S. R. Lindborg, A. L.Schacht, Nat. Rev. Drug Discov. , , 203–214.[172] M. Shevlin, ACS Med. Chem. Lett. , , 601–607.[173] A. Buitrago Santanilla et al., Science , , 49–53.[174] N. J. Gesmundo, B. Sauvagnat, P. J. Curran, M. P. Richards, C. L. Andrews, P. J. Dandliker, T.Cernak, Nature , , 228–232.[175] S. Oliver, L. Zhao, A. J. Gormley, R. Chapman, C. Boyer, Macromolecules , , 3–23.[176] M. L. Green, I. Takeuchi, J. R. Hattrick-Simpers, J. Appl. Phys. , , 231101.[177] R. K. O’Reilly, A. J. Turberfield, T. R. Wilks, Acc. Chem. Res. , , 2496–2509.[178] K. Troshin, J. F. Hartwig, Science , , 175–181.[179] N. P. Tu, A. W. Dombrowski, G. M. Goshu, A. Vasudevan, S. W. Djuric, Y. Wang, Angew. Chem.Int. Ed. , , 7987–7991.[180] Lowe, Derek, Automated Chemistry: A Vision, en-US, .[181] J. Y. Pan, ACS Med. Chem. Lett. , , 703–707.[182] A. Baranczak, N. P. Tu, J. Marjanovic, P. A. Searle, A. Vasudevan, S. W. Djuric, ACS Med. Chem.Lett. , , 461–465.[183] L. M. Roch, F. Häse, C. Kreisbeck, T. Tamayo-Mendoza, L. P. E. Yunker, J. E. Hein, A. Aspuru-Guzik, Sci. Robot. , , eaat5559.[184] K. Machida, Y. Hirose, S. Fuse, T. Sugawara, T. Takahashi, CHEMICAL & PHARMACEUTICALBULLETIN , , 87–93.[185] S. Steiner et al., Science , , eaav2211.[186] T. Jiang, S. Bordi, A. E. McMillan, K.-Y. Chen, F. Saito, P. Nichols, B. Wanner, J. Bode, , DOI .[187] A. G. Godfrey, T. Masquelin, H. Hemmerle, Drug Discov. Today , , 795–802.[188] J. Li, S. G. Ballmer, E. P. Gillis, S. Fujii, M. J. Schmidt, A. M. E. Palazzolo, J. W. Lehmann, G. F.Morehouse, M. D. Burke, Science , , 1221–1226.[189] J. Lyu et al., Nature , , 224–229.[190] J. S. Smith, O. Isayev, A. E. Roitberg, Chem. Sci. , , 3192–3203.[191] J. E. Stone, J. C. Phillips, P. L. Freddolino, D. J. Hardy, L. G. Trabuco, K. Schulten, J Comput Chem , , 2618–2640.[192] I. S. Ufimtsev, T. J. Martínez, Comput. Sci. Eng. , , 26–34.38193] J. E. Stone, D. J. Hardy, I. S. Ufimtsev, K. Schulten, Journal of Molecular Graphics and Modelling , , 116–125.[194] C. Yang et al., arXiv:1905.05359 [cs] .[195] D. E. Shaw et al., Commun. ACM , , 91–97.[196] T. S. Hofer, Front Chem , , DOI .[197] S. Grimme, P. R. Schreiner, Angewandte Chemie International Edition , , 4170–4176.[198] N. Malo, J. A. Hanley, S. Cerquozzi, J. Pelletier, R. Nadon, Nat. Biotechnol. , , 167–175.[199] M. Baker, Nat. Methods , , 787–792.[200] C. L. Allen, D. C. Leitch, M. S. Anson, M. A. Zajac, Nature Catalysis , , 2–4.[201] M. O’Brien, L. Konings, M. Martin, J. Heap, Tetrahedron Lett. , , 2409–2413.[202] M. Su, Q. Yang, Y. Du, G. Feng, Z. Liu, Y. Li, R. Wang, J. Chem. Inf. Model. , , 895–913.[203] J. P. Janet, F. Liu, A. Nandy, C. Duan, T. Yang, S. Lin, H. J. Kulik, Inorg. Chem. , DOI .[204] Z. Ghahramani,
Nature , , 452–459.[205] T. Ueno, T. D. Rhone, Z. Hou, T. Mizoguchi, K. Tsuda, Materials Discovery , , 18–21.[206] A. A. Peterson, R. Christensen, A. Khorshidi, Phys. Chem. Chem. Phys. , , 10978–10985.[207] F. Häse, L. M. Roch, C. Kreisbeck, A. Aspuru-Guzik, ACS Cent. Sci. , , 1134–1145.[208] I. Cortés-Ciriano, A. Bender, J. Chem. Inf. Model. , , 1269–1281.[209] Y. Gal, Z. Ghahramani, 10.[210] G. N. Simm, M. Reiher, J. Chem. Theory Comput. , , 5238–5248.[211] A. Tropsha, A. Golbraikh, Current pharmaceutical design , , 3494–3504.[212] A. Tropsha, Mol. Inf. , , 476–488.[213] M. Toplak, R. Močnik, M. Polajnar, Z. Bosnić, L. Carlsson, C. Hasselgren, J. Demšar, S. Boyer, B.Zupan, J. Stålring, J. Chem. Inf. Model. , , 431–441.[214] D. Stumpfe, J. Bajorath, J. Med. Chem. , , 2932–2942.[215] J. Bajorath, Expert Opin. Drug Discovery , , 879–883.[216] R. Liu, A. Wallqvist, J. Chem. Inf. Model. , , 181–189.[217] O. Obrezanova, G. Csányi, J. M. R. Gola, M. D. Segall, J. Chem. Inf. Model. , , 1847–1857.[218] P. Donmez, J. G. Carbonell in Proceeding of the 17th ACM conference on Information and knowledgemining - CIKM ’08, Proceeding of the 17th ACM conference, ACM Press, Napa Valley, California,USA, , p. 619.[219] U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, AI magazine , , 37–37.[220] Y. Bengio, A. C. Courville, P. Vincent, CoRR abs/1206.5538 , , 2012.[221] M. A. F. Pimentel, D. A. Clifton, L. Clifton, L. Tarassenko, Signal Processing , , 215–249.[222] G. Marcus, arXiv:1801.00631 [cs stat] .[223] M. A. Boden, Artif. Intell. , , 347–356.[224] B. Sanchez-Lengeling, A. Aspuru-Guzik, Science , , 360–365.[225] R. Gómez-Bombarelli, J. N. Wei, D. Duvenaud, J. M. Hernández-Lobato, B. Sánchez-Lengeling, D.Sheberla, J. Aguilera-Iparraguirre, T. D. Hirzel, R. P. Adams, A. Aspuru-Guzik, ACS Cent. Sci. , , 268–276.[226] G. L. Guimaraes, B. Sanchez-Lengeling, C. Outeiral, P. L. C. Farias, A. Aspuru-Guzik, .[227] M. J. Kusner, B. Paige, J. M. Hernández-Lobato, arXiv:1703.01925 [stat] .[228] W. Jin, R. Barzilay, T. Jaakkola, arXiv:1802.04364 .39229] N. De Cao, T. Kipf, arXiv:1805.11973 [cs stat] .[230] Y. Li, O. Vinyals, C. Dyer, R. Pascanu, P. Battaglia, arXiv:1803.03324 [cs stat] .[231] P. Ertl, A. Schuffenhauer, J. Cheminform. , , 8.[232] C. W. Coley, L. Rogers, W. H. Green, K. F. Jensen, J. Chem. Inf. Model. , , 252–261.[233] J. Bradshaw, B. Paige, M. J. Kusner, M. H. S. Segler, J. M. Hernández-Lobato, arXiv:1906.05221[physics stat] .[234] N. Brown, M. Fiscato, M. H. Segler, A. C. Vaucher, arXiv:1811.09621 [physics q-bio] , , 1096–1108.[235] D. Polykovskiy et al., .[236] T. Aumentado-Armstrong, arXiv:1809.02032 [cs q-bio] .[237] P. S. Gromski, A. B. Henson, J. M. Granda, L. Cronin, Nat. Rev. Chem. , , 119–128.[238] W. Czechtizky et al., ACS Med. Chem. Lett. , , 768–772.[239] S. Chow, S. Liver, A. Nelson, Nat. Rev. Chem. , , 174–183.[240] C. A. Nicolaou et al., ACS Med. Chem. Lett. , , 278–286.[241] S. K. Saikin, C. Kreisbeck, D. Sheberla, J. S. Becker, A.-G. A., Expert Opin. Drug Discovery , , 1–4.[242] A. Vempaty, L. R. Varshney, P. K. Varshney, arXiv:1708.03833 [stat] , arXiv: 1708.03833.[243] A. Rogers, How the Transformers broke NLP leaderboards, .[244] P. Langley in Discovey Science, (Eds.: S. Arikawa, H. Motoda), Springer Berlin Heidelberg, ,pp. 25–39.[245] P. Langley, Int. J. Hum. Comput. Stud. , , 393–410.[246] V. G. Honavar, Review of Policy Research , , 326–330.[247] A. M. Turing, Mind , LIX , 433–460.[248] B. Writer,
Lithium-Ion Batteries , Springer, .[249] S. V. Ley, D. E. Fitzpatrick, R. J. Ingham, R. M. Myers,
Angew. Chem. Int. Ed. , , 3449–3464.[250] D. Waltz, B. G. Buchanan, Science , , 43–44.[251] Y. Gil, M. Greaves, J. Hendler, H. Hirsh, Science , , 171–172.[252] K. Alberi et al., J. Phys. D: Appl. Phys. , , 013001.[253] G. Schneider, Nat. Rev. Drug Discov. , , 97–113.[254] T. Dimitrov, C. Kreisbeck, J. S. Becker, A. Aspuru-Guzik, S. K. Saikin, ACS Appl. Mater. Interfaces , , 24825–24836.[255] P. Friederich, A. Fediai, S. Kaiser, M. Konrad, N. Jung, W. Wenzel, Advanced Materials , ,1808256.[256] J. Vamathevan et al., Nat. Rev. Drug Discov. , , 463–477.[257] J.-P. Correa-Baena, K. Hippalgaonkar, J. van Duren, S. Jaffer, V. R. Chandrasekhar, V. Stevanovic,C. Wadia, S. Guha, T. Buonassisi, Joule , , 1410–1420.[258] D. P. Tabor et al., Nat. Rev. Mater. , , 5–20.[259] C. Glymour, Daedalus ,133