[PDF] Discovery of Self-Assembling π -Conjugated Peptides by Active Learning-Directed Coarse-Grained Molecular Simulation

Abstract

Electronically-active organic molecules have demonstrated great promise as novel soft materials for energy harvesting and transport. Self-assembled nanoaggregates formed from π -conjugated oligopeptides composed of an aromatic core flanked by oligopeptide wings offer emergent optoelectronic properties within a water soluble and biocompatible substrate. Nanoaggregate properties can be controlled by tuning core chemistry and peptide composition, but the sequence-structure-function relations remain poorly characterized. In this work, we employ coarse-grained molecular dynamics simulations within an active learning protocol employing deep representational learning and Bayesian optimization to efficiently identify molecules capable of assembling pseudo-1D nanoaggregates with good stacking of the electronically-active π -cores. We consider the DXXX-OPV3-XXXD oligopeptide family, where D is an Asp residue and OPV3 is an oligophenylene vinylene oligomer (1,4-distyrylbenzene), to identify the top performing XXX tripeptides within all 20 3 = 8,000 possible sequences. By direct simulation of only 2.3% of this space, we identify molecules predicted to exhibit superior assembly relative to those reported in prior work. Spectral clustering of the top candidates reveals new design rules governing assembly. This work establishes new understanding of DXXX-OPV3-XXXD assembly, identifies promising new candidates for experimental testing, and presents a computational design platform that can be generically extended to other peptide-based and peptide-like systems.

Full PDF

DDiscovery of Self-Assembling π -ConjugatedPeptides by Active Learning-DirectedCoarse-Grained Molecular Simulation Kirill Shmilovich, † Rachael A. Mansbach, ‡ Hythem Sidky, † Olivia E. Dunne, † Sayak Subhra Panda, ¶ , § John D. Tovar, ¶ , § , (cid:107) and Andrew L. Ferguson ∗ , † † Pritzker School of Molecular Engineering, University of Chicago, Chicago, IL 60637,U.S.A. ‡ Theoretical Biology and Biophysics Group, Los Alamos National Laboratory, Los Alamos,NM 87545, U.S.A. ¶ Department of Chemistry, Johns Hopkins University, Baltimore, MD 21218, U.S.A. § Institute of NanoBioTechnology, Johns Hopkins University, Baltimore, MD 21218, U.S.A. (cid:107)

Department of Materials Science and Engineering, Johns Hopkins University, Baltimore,MD 21218, U.S.A.

E-mail: [email protected] a r X i v : . [ q - b i o . B M ] J a n bstract Electronically-active organic molecules have demonstrated great promise as novel softmaterials for energy harvesting and transport. Self-assembled nanoaggregates formedfrom π -conjugated oligopeptides composed of an aromatic core ﬂanked by oligopeptidewings oﬀer emergent optoelectronic properties within a water soluble and biocompat-ible substrate. Nanoaggregate properties can be controlled by tuning core chemistryand peptide composition, but the sequence-structure-function relations remain poorlycharacterized. In this work, we employ coarse-grained molecular dynamics simula-tions within an active learning protocol employing deep representational learning andBayesian optimization to eﬃciently identify molecules capable of assembling pseudo-1Dnanoaggregates with good stacking of the electronically-active π -cores. We consider theDXXX-OPV3-XXXD oligopeptide family, where D is an Asp residue and OPV3 is anoligophenylene vinylene oligomer (1,4-distyrylbenzene), to identify the top performingXXX tripeptides within all 20 = 8,000 possible sequences. By direct simulation ofonly 2.3% of this space, we identify molecules predicted to exhibit superior assemblyrelative to those reported in prior work. Spectral clustering of the top candidates re-veals new design rules governing assembly. This work establishes new understanding ofDXXX-OPV3-XXXD assembly, identiﬁes promising new candidates for experimentaltesting, and presents a computational design platform that can be generically extendedto other peptide-based and peptide-like systems. Introduction

Self-assembling π -conjugated peptides possessing a π -core ﬂanked by peptide wings haveemerged as a versatile building block for the bottom-up fabrication of bio-compatible nanoag-gregates with engineered optoelectronic properties. Overlaps between π -orbitals in neigh-boring aromatic cores within supramolecular assemblies lead to the emergence of optical andelectronic properties including ﬂuorescence, electron/hole transport, and exciton splitting,and the ﬂanking oligopeptide wings provide the capacity to operate in and interact withbiological environments. These peptidic materials have proven readily synthesizable andresponsive to external control mediated by pH, ﬂow, light, salt concentration, and temper-ature, and have found a host of potential applications in the context of photovoltaicpower generation, energy harvesting, and as organic transistors.

The structural andfunctional properties of the self-assembled nanoaggregates are governed by the molecularchemistry of the π -core and the amino acid sequence of the peptide wings.The Asp-X-X-X-(oligophenylenevinylene) -X-X-X-Asp (DXXX-OPV3-XXXD) family rep-resents one class of synthetic π -conjugated peptides possessing an oligophenylenevinylene π core, terminal Asp residues, and amino acid side chains, where X represents one of the20 natural amino acids (Fig. 1a). To assure the molecules are head-to-tail invariant, theoligopeptide wings are constrained to be mirror-symmetric both in the identity of the aminoacids and the N-to-C directionality, such that each molecule possesses two C-termini. Theterminal residues are constrained to be Asp to endow each terminus of the molecule withtwo carboxyl groups and provide a pH trigger for assembly: at pH>5 the four carboxyls aredeprotonated endowing the molecule with a (-4) e formal charge and disfavoring large scaleassembly, but at pH<1 the residues protonate, the molecule becomes neutral, and large-scaleaggregation proceeds. The DXXX-OPV3-XXXD family has attracted considerable experi-mental and computational attention in recent years due to their demonstrated capability toassemble into pseudo-1D optically and electronically active nanoaggregates whose structureand properties can be tuned through selection of the X residues.

Assembly in3queous solvent under acidic conditions is driven by hydrophobic, π - π stacking, and hydro-gen bonding interactions. The assembly of elongated peptides into linear aggregateswith in-register stacking and alignment of the π -cores favors π orbital overlap, electronicdelocalization along the backbone of the nanoaggragate, and the emergence of optical andelectronic functionality such as well-deﬁned absorption and emission spectra, HOMO/LUMOgaps, electron/hole conductivity, and exciton splitting capabilities (Fig. 1b,c). The complete DXXX-OPV3-XXXD family comprises 20 = 8,000 members correspond-ing to all possible permutations of the 20 natural amino acids within the unspeciﬁed XXXtriplet. This vast size of this chemical space is both a blessing – the large palette of molecu-lar chemistries provides enormous versatility in materials properties and the opportunity totailor structure and function – and a curse – it is a challenge to identify promising candi-dates within this enormous space. Identifying the candidates capable of self-assembling intowell-ordered optoelectronic nanoaggregates and divining the design precepts dictating themechanism is a key goal in realizing these peptides as novel biocompatible optoelectronicmaterials.Edisonian traversal of the large chemical space of DXXX-OPV3-XXXD molecules bytrial-and-improvement experimentation is essentially intractable due to the high time andlabor costs associated with peptide synthesis and testing. To date, no more than 13 mem-bers of the family have been experimentally synthesized and tested. Molecular simulationoﬀers an alternative means to perform high-throughput virtual screening of chemical spaceto identify the most promising candidates for experimental testing. Since assembly proceedson length scales of tens of nanometers and microsecond time scales, this has motivated thedevelopment of coarse-grained models explicitly parameterized against all-atom molecularsimulations (Fig. 1d). These models integrate out the electronic and atomistic de-grees of freedom by lumping together small numbers of atoms into beads in order to furnisha molecular model that oﬀers a judicious compromise between chemical realism and thecomputational eﬃciency required to directly simulate peptide assembly. Exhaustive sim-4

DB C

OH O OOH HN O NH O HN O NH O HOOO HONHOHNONHOHNO

P3 P5 P4 SP3 SC5 SC4

HN NHR O R O HN OO RNHHN OONHO R RR OHN NH O OHOHO COOHHOOC

Figure 1: The DXXX-OPV3-XXXD system. (a) Chemical structure of the prototypicalDXXX-OPV3-XXXD peptide monomer. The oligophenylenevinylene π core (OPV3) isﬂanked by oligopeptide wings (DXXX) that are mirror symmetric such that the identityof the amino acids is inverted and the molecule possesses two C-termini. The X residues areselected from the 20 natural amino acids such that the family comprises 20 = 8,000 distinctmolecules. (b) Molecular simulation snapshot of a self-assembled pseudo-1D nanoaggre-gate spontaneously formed by the spontaneous association of DVAA-OPV3-VAAD peptidemonomers into a linear stack. Good stacking between the π -cores (colored blue) favors π orbital overlap, electronic delocalization along the backbone of the nanoaggragate, and theemergence of optical and electronic functionality. (c) Experimental transmission electron mi-croscopy (TEM) image of self-assembled ﬁbrils formed by DFFG-OPV3-GFFD peptides inan acidic environment. Reprinted with permission from Ref. . Copyright (2014) AmericanChemical Society. (d) Illustration of the mapping from the DGAG-OPV3-GAGD all-atomstructure to the coarse-grained representation at which the simulations in this work areconducted. The coarse-grained beads corresponding to groupings of neighboring atoms arelabeled according to the Martini model employed in this work. ulation of all 8,000 candidates within the DXXX-OPV3-XXXD family remains, however,computationally expensive. As we shall demonstrate, however, doing so is unnecessary toparameterize a reliable surrogate model of peptide function and identify and validate themost promising candidates within the family.Chemical intuition is extremely valuable in guiding the computational search through5hemical space, but it can perform poorly in the limits of data paucity, where there are toofew examples to infer patterns, and data abundance, where there are too many examples toparse eﬀectively. Further, inherent preconceptions and biases may push the search away frompotentially proﬁtable regions of chemical space and overlook patterns in the high-dimensionaldata that may reveal important determinants of molecular performance. Active learning(a.k.a. sequential learning, optimal experimental design), and more speciﬁcally, Bayesianoptimization, presents a systematic approach to guide traversal of chemical space by usinginformation on all measurements conducted to date to inform the “next-best” measurementto conduct. In this manner, active learning predicts an optimal sequence in which toconsider the molecular candidates in order to identify the optimal ones with minimal datacollection eﬀort. For this reason, active learning and allied approaches have been rapidlygaining traction in the materials discovery, engineering, and design communities, with theseapproaches being deployed, for example, in the experimental discovery of novel shape memoryalloys, piezoelectrics, high glass transition polymers, the computational discovery ofdrugs, and magnetocaloric, superconducting, and thermoelectric materials. Our primary goal is to eﬃciently identify members of the DXXX-OPV3-XXXD familythat exhibit self-assembly into desired pseudo-1D nanoaggregates with good overlap be-tween the π -conjugated cores and are thus most promising in displaying emergent opticaland electronic functionality. We adopt a coarse-grained bead-level molecular simulationmodel as the engine for our high-throughput virtual screen and couple this with a deeplearning-enabled active learning protocol to guide optimal traversal of chemical space. Weidentify and computationally validate the top performing constituents of the 8,000-memberDXXX-OPV3-XXXD family after simulating only 2.3% of all possible molecules. This rep-resents a massive saving over exhaustive sampling enabled by active learning. The absenceof any introduced human bias within the active learning protocol also proved to be valu-able in identifying high-performing candidates incorporating methionine residues that werenot previously considered. A post hoc analysis of the observed assembly pathways provides6upporting mechanistic understanding of the self-assembly behavior and exposes practicalprecepts for molecular design. The rank ordered list of DXXX-OPV3-XXXD molecules pro-duced by our computational analysis provides a useful ﬁltration of the design space with thetop-performing candidates oﬀering a massively reduced candidate space for experimentalsynthesis and testing. The DXXX-OPV3-XXXD peptides were modeled using a previously-developed coarse-grainedpotential based on the Martini potential.

Martini is a popular coarse-grained potentialthat lumps approximately four heavy atoms into each coarse-grained bead, has demonstratedgreat successes in modeling peptides, proteins, lipids, and carbohydrates, and oﬀersa good compromise between chemical speciﬁcity and the computational eﬃciency necessaryto probe the formation of large peptide aggregates. The potential was initially developedfor DFAG-OPV3-GAFD by reﬁtting the native Martini parameters for bonded interactionsagainst all-atom simulation data. This bottom-up reparameterization of the bonded in-teractions greatly improved agreement between the coarse-grained and all-atom distribu-tion functions, potentials of mean force (PMF) for monomer stretching and dimerization,and time-averaged contact maps. We generalize this model to the complete DXXX-OPV3-XXXD family by maintaining the same parameterization of the bonds, angles, and backbonedihedrals within the OPV3 core and employing default Martini parameters for the amino acidside chains and all non-bonded interactions.

An illustration of the all-atom to coarse-grained bead-level mapping for DGAG-OPV3-GAGD is provided in Fig. 1d. Calculationand comparison of the translational diﬀusion constants for the all-atom and coarse-grainedmodels of DFAG-OPV3-GAFD showed these to be in agreement within error bars, indicat-ing no signiﬁcant discrepancy in the (translational) dynamical time scales between the two7odels and that no time scale corrections to the coarse grained calculations are required.Coarse-grained molecular dynamics simulations of peptide assembly were conducted usingthe Gromacs 2018.6 simulation suite. Initial system conﬁgurations for each DXXX-OPV3-XXXD considered were generated by randomly inserting 96 peptides into a 16.2 × × cubic simulation box with 3D periodic boundary conditions, corresponding to a con-centration of approximately 35 mM. The amino acid residues are prepared in protonationstates corresponding to pH 1 to mimic pH-triggered experimental assembly under acidic con-ditions. The coarse-grained peptides were then solvated in water to a density of 1.0 g/cm ofwater using the Martini non-polarizable water model. Steepest descent energy minimiza-tion was performed to eliminate high energy overlaps by removing forces greater than 1,000kJ/mol.nm. Initial particle velocities were assigned from a Maxwell-Boltzmann distributionat 298 K. All simulations were conducted in the NPT ensemble at 298 K and 1 bar usinga velocity-rescaling thermostat and Parrinello-Rahman barostat. Equations of motionwere numerically integrated using the leap-frog algorithm with a 5 fs time step and bondlengths ﬁxed using the LINCS algorithm. Lennard Jones interactions were smoothly shiftedto zero at 1.1 nm and reaction-ﬁeld electrostatics were employed using a relative electrostaticscreening constant of 15 appropriate for the non-polarizable water model. An initial 100ps equilibration run was conducted, after which time the temperature, pressure, density,and energy all stabilized. This was followed by a 3 µ s production run, after which time thestructural evolution of the system as measured by graphical analysis of the self-assembledaggregate (see Section 2.2.1) was stationary in time. Simulation snapshots were harvested foranalysis every 50 ps over the course of the production run. Calculations were predominantlyconducted on single NVIDIA GeForce RTX 2080 Ti cards and achieved execution speeds of ∼ µ s/day. 8 .2 Active learning peptide discovery An active learning protocol is employed to direct a principled traversal of the DXXX-OPV3-XXXD candidate space and minimize the number of coarse-grained simulations required todiscover the highest-performing candidates.

The fundamental challenge is that evalu-ating the quality of each peptide by direct simulation is expensive, so we wish to identifythe best peptide candidates in the fewest number of simulations. The procedure we em-ploy is in large part inspired by and adapted from a pioneering deep representational activelearning approach for molecular drug discovery developed by Gomez-Bombarelli et al. Ourapproach comprises four main steps and is illustrated schematically as an iterative activelearning cycle in Fig. 2. The coarse-grained molecular simulation engine representing ourmeasurement function within the protocol is described in Section 2.1, and we deﬁne ourﬁtness function in Section 2.2.1. Appreciating that some of the more technical machinelearning concepts may be foreign to some readers in the molecular modeling community, weexpose these steps in the protocol in some detail along with their speciﬁc adaptations toour molecular system, but those readers familiar with variational autoencoders, Gaussianprocess regression, and Bayesian optimization may feel free to skim over Sections 2.2.2-2.2.6.All codes are developed in Python 3 making use of the Scikit-learn, NumPy, Keras, and ORCA libraries. Jupyter notebooks implementing our methods are hosted on GitHub( https://github.com/KirillShmilovich/ActiveLearningCG ). To perform active learning discovery in our predeﬁned chemical space we deﬁne a scalar-valued ﬁtness function f : f i = f ( DXXX-OPV3-XXXD i ) that assigns a quality to eachpeptide in terms of its capacity to self-assemble into pseudo-1D nanoaggregates. Linearaggregates with good overlap between the π -conjugated cores are most promising in dis-playing emergent optical and electronic functionality and therefore anticipated to possessthe most desirable materials properties. We have previously employed DFT calculations9 XXX-OPV3-XXXD peptide family to screen

CG MD simulations of recommended candidates unsupervised VAE latent space embedding supervised GPR (re)training over VAE latent space active learning of next best oligopeptides to simulate

Encoder DecoderAdjacency MatrixComposition Vector ReconstructedOutputsEmbedding

95 % confidence intervalPrediction meanObservations Ground TruthAcquisition FunctionNext-bestCandidate HN NHR O R O HN OO RNHHN OONHO R RR OHN NH O OHOHO COOHHOOC

Figure 2: Active learning cycle for the data-driven discovery of optimally self-assemblingDXXX-OPV3-XXXD peptides. The cycle contains four components. (1) Coarse-grainedmolecular simulations are performed on selected DXXX-OPV3-XXXD candidates and thequality of the self-assembled aggregates formed by molecule i measured according to a scalarﬁtness f i . (2) The DXXX-OPV3-XXXD family is projected from the high-dimensionalchemical space of molecular structures into a low-dimensional latent space embedding E : DXXX-OPV3-XXXD i → z i ∈ R d using a variational autoencoder (VAE). The dimen-sionality of the latent space is optimized during each cycle. (3) A Gaussian process regression(GPR) model is constructed over the VAE latent space linking the latent space coordinates ofeach DXXX-OPV3-XXXD family member to the scalar ﬁtness function measuring the qual-ity of their self-assembled aggregates f : ˆ f i = f ( z i ) = ( f ◦ E )( DXXX-OPV3-XXXD i ) . TheGPR mapping f is retrained each cycle over all DXXX-OPV3-XXXD candidates that havebeen simulated to date and for which measures of the ﬁtness function f i , i ∈ sampled is avail-able, and is then used to predict the ﬁtness of unsimulated candidates ˆ f j , j ∈ unsampled .(4) The predicted means and uncertainties for ˆ f j , j ∈ unsampled furnished by the GPRsurrogate model are combined within an active learning acquisition function to identify the“next-best” candidates for which to perform coarse-grained molecular simulations to drivesampling towards the most promising candidates. The loop is cycled until the GPR surrogatemodel no longer changes with additional data collection and can then be used to reliablyidentify the top candidates for computational validation.to make direct predictions of optoelectronic properties, but the high computational cost ofthese calculations limit them to aggregates of small numbers of peptides (dimers and trimers)10nd require omission of the ﬂanking amino acid residues and solvent. As such, these cal-culations are poorly suited to high-throughput virtual screening for large-scale aggregationbehavior. Consequently, we deﬁne and optimize a structural measure of assembly quality inour coarse-grained molecular simulations as a proxy for optical and electronic functionality.This simpliﬁcation massively expedites sampling in the full chemical space and provides ameans to coarsely screen chemical space and focus a subsequent experimental search on themost promising candidates. Alternatively, this computational screen can be viewed as a pre-liminary ﬁltration within the coarsest level of a nested hierarchy of increasingly expensiveall-atom and/or electronic structure calculations.In order to specify f we deﬁne a geometric criterion by which a pair of peptides are con-sidered to form part of the same pseudo-1D nanoaggregate. To do so, we adopt a distancemetric that we have previously employed to deﬁne clustering in DFAG-OPV3-GAFD assem-bly and asphaltene aggregation. This so-called “optical distance” metric is deﬁned asthe minimum center of mass distance between aromatic cores in molecule a and b , d optical ab = min i ∈ core ( a ) ,j ∈ core ( b ) r ij , (1)where r ij is the intermolecular center-of-mass distance between the aromatic rings i and j within the OPV3 cores, and the minimization proceeds over the three aromatic rings i ∈ core ( a ) in molecule a , and the three aromatic rings j ∈ core ( b ) in molecule b . Pairsof molecules a and b which satisfy d optical ab < r cut = 0 . nm are considered to reside withinthe same cluster. The cutoﬀ r cut = 0 . was tuned to the mean of the distribution of d optical ab collected over DFAG-OPV3-GAFD peptide dimers with good in-register stacking ofthe OPV3 cores. In contrast with other choices of peptide clustering metrics based, forexample, on the overall center-of-mass or proximity of the peptide wings, the optical metricassures close intermolecular proximity of at least one pair of OPV3 aromatic rings in apair of associated peptides. This close association promotes π electron overlap, electron11elocalization, and the emergence of optoelectronic function, and it is for this reason thatthis metric is termed the optical distance metric. Given this deﬁnition, a natural choice for the ﬁtness f i of molecule i is the number ofsuch optical contacts in a self-assembled aggregate, since maximizing this value will promoteelectronic delocalization and the emergence of optoelectronic functionality. We evaluate theﬁtness function by representing the molecular system as a dynamically-evolving interactiongraph in which the peptides compose the vertices V = { v , v , . . . , v N } and the edges E = { e , , e , , ..., e , } are assigned between pairs of vertices v a and v b if d optical ab < r cut = 0 . .An illustration of the evolution of the interaction graph over the course of a 3 µ s simulationof DDAI-OPV3-IADD assembly is presented in Fig. 3. The number of vertices | V | = 96 isﬁxed by the number of peptides in the system. Maximization of the number of edges | E ( t ) | at time t is therefore equivalent to maximizing the mean degree of each vertex in the graph κ ( t ) = | E ( t ) || V | . As such, we adopt as our ﬁtness function, f i = κ ( t ; DXXX-OPV3-XXXD i ) = 2 | E ( t ; DXXX-OPV3-XXXD i ) || V | , (2)where the time average denoted by the overbar is performed over the terminal 50 ns of the3 µ s production run. Standard errors in the mean are estimated by block averaging theterminal 50 ns in ﬁve contiguous 10 ns blocks.A potential criticism of the ﬁtness function is that κ achieves a maximum for all-to-allconnectivity of the graph, and its maximization would therefore appear to not necessarilyfavor pseudo-1D linear stacks. Mathematically this is true, but there are strong physicallimitations on the maximum attainable value of κ since the excluded volume of the π coresallow then to form optical associations with a limited number of partners. The largest valueobserved in all of our calculations is κ = 6.07 (cf. Table 1), and visual inspection of theterminal aggregates conﬁrms that κ is positively correlated with the formation of elongatedpseudo-1D nanoaggregates similar to those illustrated in the t = 3,000 ns panel of Fig. 3.12 = 0 ns t = 50 ns t = 500 ns t = 3,000 ns κ = 0.13 κ = 3.69 κ = 5.67 κ = 6.08 DDAI

Figure 3: Dynamical evolution of the self-assembled structures of DDAI-OPV3-IADD overthe course of a 3 µ s coarse-grained molecular simulation. Snapshots of the molecular simu-lation show molecules in which the OPV3 π cores are colored blue, the peptide wings fadedgrey, and water is removed for clarity. The interaction graph corresponding to each snapshotis shown directly below each image. The vertices V corresponding to each peptide are coloredred, and edges E between peptide pairs deﬁned by d optical ab < r cut = 0 . (Eqn. 1) and coloredgrey. The average degree κ = | E || V | is reported at the bottom of each panel. At t = 0 ns, the96 randomly placed peptides form essentially a monomeric dispersion, with the exception ofﬁve dimer pairs, and the system possesses a correspondingly low κ = 0.13. As the simulationprogresses, the peptides self assemble under the inﬂuence of hydrogen bonding, π - π stacking,and hydrophobic interactions into small ( t = 50 ns, κ = 3.69) and then larger ( t = 500 ns, κ = 5.67; t = 3,000 ns, κ = 6.08) aggregates with a commensurate increase in the meandegree κ . In this ﬁgure, and throughout the paper, molecular renderings are generated usingVMD, and interaction graphs produced using NetworkX. In Step 3 (Section 2.2.3) we describe our training of a Gaussian process regression (GPR)surrogate model to predict the ﬁtness of candidate molecules that have not been simu-lated based on those that have. The predictions of this model are then used to performactive learning. We experimented with constructing the GPR directly over the chemicalspace of DXXX-OPV3-XXXD molecules by measuring pairwise distances between the XXX13ripeptides using BLOSUM substitution matrices, but following Gomez-Bombarelli et al., we found this approach to yield inferior surrogate models to those constructed over data-driven low-dimensional embeddings of the molecules generated using a variational autoen-coder (VAE). The low-dimensional VAE latent spaces also conveys advantages in thatlow-dimensional GPRs tend to be more robust, chemically similar molecules tend to beembedded proximately in the latent space providing interpretability of the chemical spacethrough dimensionality reduction, and the continuous and diﬀerentiable nature of the latentspace makes it well-suited to global optimization.

We represent the DXXX-OPV3-XXXD molecules to the VAE only through the identityof the XXX tripeptide, since this is the only diﬀerentiating feature between molecules. Webase this representation on the coarse-grained Martini model used to perform our molecularsimulations. This representation comprises two components for each molecule i : (i) anadjacency matrix A i , which captures the connectivity of beads within the tripeptide, and(ii) a one-hot encoded composition vector of bead-types T i specifying the identity of theMartini beads (Fig. 4). Since peptide sequences may contain varying numbers of coarse-grained beads, we standardize the size of the adjacency matrix A i ∈ R × to be suﬃcientlylarge enough to accommodate the largest tripeptide (Trp-Trp-Trp) and pad the array withzeroes for smaller molecules. A one-hot composition vector of length T i ∈ R is suﬃcient toaccommodate all tripeptide compositions considered. For each molecule i , the tuple ( A i , T i ) deﬁnes the input provided to the VAE.The architecture of the VAE is illustrated in Fig. 5. Given the two-part input ( A i , T i ) for molecule i , the encoder block processes this through two parallel networks to performfeature extraction from each input. The A i resemble a small image which motivate usinga short series of convolutional layers to treat these inputs, whereas the binary T i vectorsare passed through a series of fully-connected dense layers. The features extracted by theencoder through the two parallel encoder networks are subsequently concatenated and usedto generate the mean µ i and standard deviation σ i of a Gaussian distributed latent space14 Trp-Val-Tyr

11 1213 Adjacency MatrixMartini Represenation Composition Vector

Figure 4: Schematic of the representation of each XXX tripeptide to the VAE. The Martinirepresentation of each tripeptide i , in this example Trp-Val-Tyr (WVY), is converted into anadjacency matrix A i ∈ R × specifying the connectivity of beads within the tripeptide and aone-hot composition vector T i ∈ R specifying the identity of the beads. The tuple ( A i , T i ) deﬁnes the input provided to the VAE. A pre-deﬁned sequential numbering is employed forthe beads in each amino acid. The adjacency matrix is padded with rows and columns ofzeros to represent tripeptides containing fewer than the maximum number of 15 beads. Thecolored blocks in the adjacency matrix and composition vector correspond to the colors ofthe amino acids in the Martini molecule.embedding z i ∼ N ( µ i , σ i ) ∈ R d . The dimensionality of the latent space is treated asa hyperparameter that is optimized during each cycle of active learning and is found tolie in the range d ∈ [4 , . The decoder then attempts to reconstruct ( A i , T i ) from thelatent encoding z i again using two parallel networks. The part of the decoder predicting thereconstruction ˆ T i is identical to the architecture of the encoder, whereas the part predictingthe reconstruction ˆ A i is simply another series of fully-connected layers that is reshaped tomatch the size of the input. The overall action of the VAE is the functional composition ofthe encoder E : z i = E ( A i , T i ) and decoder D : ( ˆ A i , ˆ T i ) = D ( z i ) blocks such that the totaleﬀect of the network is ( ˆ A i , ˆ T i ) = ( D ◦ E )( A i , T i ) .The VAE is trained by minimizing the VAE loss L V AE composed of a reconstruction term15 x5 Conv, 16(15x15x1) (15x15x1)Dense, 36(54x1) (54x1)Dense, 36Dense, 24 Dense, 24Dense, 12 Dense, 12Dense, 125 Dense, 625Dense, d d

1. Batch norm2. ReLu3. 2x2 Max poolReLu ReLu Sigmoid ReshapeReLuReLu1. Batch norm2. ReLu3. 2x2 Max pool Concatenate, 44 Sampling, d ReLu ReLu Sigmoid

A AT T = Input/Output= Flatten = Concatenate = Sampling= Dense layer = Convolutional layer ^^ Figure 5: Architecture of the variational autoencoder (VAE) used to generate the DXXX-OPV3-XXXD latent space embedding. The VAE accepts as inputs adjacency matrix andcomposition vector tuples ( A i , T i ) and employs two parallel encoders to perform featureextraction and learn the mean µ i and standard deviation σ i of a Gaussian distributed latentspace embedding z i ∼ N ( µ i , σ i ) ∈ R d . The decoder generates approximate reconstruc-tions ( ˆ A i , ˆ T i ) of the inputs from the latent space representation. The network is trainedby minimizing a loss function balancing reconstruction accuracy and a regularization termconstraining the latent space to follow a multidimensional Gaussian distribution (Eqn. 3).The dimensionality d of the latent space is treated as a hyperparameter that is optimizedduring each cycle of active learning. L Rec and a Kullback-Leibler (KL) divergence term L KL , L V AE = L Rec + L KL , (3) L Rec = E i [ BCE ( ˆ A i , A i ) + BCE ( ˆ T i , T i )]) ≈ (cid:88) i ∈ mini-batch [ (cid:88) j =1 ( A i,j log( ˆ A i,j ) + (1 − A i,j ) log(1 − ˆ A i,j ))+ (cid:88) j =1 ( T i,j log( ˆ T i,j ) + (1 − T i,j ) log(1 − ˆ T i,j ))] , (4) L KL = − D KL ( z = E ( A , T ) || N (0 , I )) ≈ (cid:88) i ∈ mini-batch [ − d (cid:88) j =1 (1 + log( σ i,j ) − σ i,j − µ i,j )] , (5)where BCE ( x , y ) is the binary cross entropy between the reconstructions x and ground16ruth y , D KL ( Q || P ) is the Kullback-Leibler divergence from P to Q , and I is the n -by- n identity matrix. The reconstruction term L Rec encourages the VAE to reconstruct the inputsthrough the low-dimensional latent space information bottleneck. In contrast to a vanillaautoencoder which only aims to minimize L Rec , the KL divergence term L KL is an eﬀectiveregularizer which imposes a multivariate Gaussian prior on the latent space and preventsthe VAE from essentially “memorizing” the data set and learning a trivial identity mappingthrough a disconnected latent space. Training is performed by passing tuples ( A i , T i ) through the VAE in mini-batches of size 32 and updating the network parameters withmini-batch gradient descent using the Adam optimizer. The VAE loss L V AE is typicallyobserved to plateau within 4,000 epochs. We note that the regularization introduced by theKL divergence term L KL serves to prevent over-ﬁtting and enables us to train over the fullset of molecules to be embedded by the VAE.We present in Fig. 6 an example of a d = 3 VAE latent space embedding of the DXXX-OPV3-XXXD family. We color each member of the family by the number of beads in theXXX tripeptide to show that the ﬁrst dimensional of the latent space z is approximatelycorrelated with molecular size, possessing a Pearson correlation coeﬃcient ρ ( z , size ) = 0.858(p-value < 1 × − ). The other two dimensions in this example are also some functions ofmolecular composition and topology, but prove more challenging to correlate with physi-cally interpretable observables. Physical interpretability of the latent space dimensions isa pleasing but not required property of the embedding. The primary purpose of the VAEembedding is to provide a smooth, low-dimensional molecular representations for the GPRsurrogate model. We note that the latent space embedding could be shaped and mademore interpretable by simultaneous training of a supervised regression model as suggestedby Gomez-Bombarelli et al. d = 3 VAE latent space embedding of the DXXX-OPV3-XXXD family. The embedded molecules are colored according to the number ofbeads in the XXX tripeptide and selected molecules are visualized. The ﬁrst dimensional ofthe latent space z is correlated with molecular size ρ ( z , size ) = 0.858 (p-value < × − )providing a visual illustration that similar molecules are embedded close together in the VAElatent space. Fitness measurements f i , i ∈ sampled are available for those molecules DXXX-OPV3-XXXD i for which we have performed coarse-grained molecular simulation. Given thesedata we wish to predict the ﬁtness of all remaining candidates ˆ f j , j ∈ unsampled . Thisconstitutes a supervised regression task where we wish to train a surrogate model f overa small number of training examples to predict the ﬁtness of out-of-training examples as afunction of their location in the VAE latent space: f : ˆ f i = f ( z i ) = ( f ◦ E )( A i , T i ) . In thismanner, the regression model “short circuits” expensive direct simulation prediction of ﬁtness18ith a cheap surrogate model, and eliminates the need to perform exhaustive calculationsover all molecules in the family. The quality of the model predictions depends on the numberand chemical similarity of the training data: the model is expected to perform better withlarger training sets and make more accurate predictions for out-of-training examples that arechemically similar to examples in the training set. As such, we expect the model to improvewith additional cycles around the active learning loop. For the purposes of active learning(Section 2.2.4), it is also vital to perform uncertainty quantiﬁcation on the model predictionsso that we can both direct sampling towards the most high-performing candidates predictedby the model (exploitation) and towards undersampled areas where the model possesses thehighest uncertainties (exploration). For this reason, we select Gaussian process regression(GPR) to construct our surrogate model f : ˆ f i = f ( z i ) as a ﬂexible, non-parametric, Bayesianregression approach that comes with built-in uncertainty estimates. The fundamental principle of a GPR is to employ a Gaussian process to specify a Bayesianprior distribution over regression functions ﬁtting the data, and then to compute the pos-terior distribution over those functions that are in agreement with the training data. TheGaussian process is fully speciﬁed by its mean function, which is typically deﬁned to be zero,and its covariance function for which we choose the popular squared exponential kernel, k ( z , z (cid:48) ) = exp( − γ || z − z (cid:48) || ) , (6)where z and z (cid:48) denote latent space vectors and the bandwidth of the kernel γ is a hyper-parameter deﬁning the characteristic length scale over which latent space vectors “see” oneanother. Under these choices, the predicted ﬁtness f ∗ = f ( z ∗ ) for a new point z ∗ outside of19he training data is a Gaussian distributed random variable with, f ∗ ∼ N ( µ f ∗ , σ f ∗ ) , (7) µ f ∗ = K ( z ∗ , Z ) (cid:2) K ( Z , Z ) + ( σ f ) T I (cid:3) − f ,σ f ∗ = K ( z ∗ , z ∗ ) − K ( z ∗ , Z ) (cid:2) K ( Z , Z ) + ( σ f ) T I (cid:3) − K ( z ∗ , Z ) T , where I is the n -by- n identity matrix, f = [ f , f , . . . , f n ] T is the vector of (noisy) measure-ments of ﬁtness for the n training points Z = { z , z , ..., z n } computed in our coarse-grainedmolecular simulations, and σ f = [ σ f , σ f , . . . , σ f n ] T are associated variances of assumed i.i.d.Gaussian noise estimated by block averaging (Section 2.2.1), and the K matrices hold thecovariances within and between the training data Z and new point z ∗ , K ( Z , Z ) =  k ( z , z ) k ( z , z ) · · · k ( z , z n ) ... ... . . . ... k ( z n , z ) k ( z n , z ) · · · k ( z n , z n )  , (8) K ( z ∗ , Z ) = [ k ( z ∗ , z ) , k ( z ∗ , z ) , · · · , k ( z ∗ , z n )] , (9) K ( z ∗ , z ∗ ) = k ( z ∗ , z ∗ ) . (10)The ( σ f ) T I terms account for the uncertainty inherent in our measurements of f throughan assumed Gaussian noise model. These terms can also be conceived as a Tikhonov(a.k.a. ridge or nugget) regularization of the K ( Z , Z ) matrix that stabilizes its matrix inverseand is particularly useful when this matrix is ill-conditioned due to the close proximityof two or more training points in Z . A corollary of this regularization is that the GPRposterior is not a perfect interpolator of the training data due to the presence of measurementnoise, and we should anticipate residual discrepancies on the order of σ f between the GPRpredictions and the our measurements of f . The predictive accuracy and robustness of theGPR is enhanced by the smooth, continuous, and low-dimensional nature of the VAE latent20pace, which embeds chemically similar points nearby one another and therefore promotestransfer of information to new out-of-training points based on chemically proximate trainingexamples. The GPR prior and posterior are updated during each cycle of the active learningloop as additional training data are collected. The ﬁnal step in the cycle is to use the predictions of the surrogate GPR model to identifythe next peptide candidates to simulate. We frame this active learning problem as a Bayesianoptimization, where we have an expensive, non-diﬀerentiable, black-box function with noisyevaluations – the ﬁtness of each molecule evaluated by coarse-grained molecular simulation– that we wish to optimize in the minimum number of evaluations. Bayesian optimizationdeﬁnes an acquisition function u that wraps around the current surrogate model to identifypeptides with a high chance of being better than the current leader in the training data. Wecan represent optimization of the acquisition function as, z † = argmax z u ( z | Z = { ( z , f ) , ( z , f ) ..., ( z n , f n ) } ) , (11)where z † is the VAE latent space coordinates of the DXXX-OPV3-XXXD molecule thatmaximizes the acquisition function u , and the maximization is conditioned on the n samples { ( z , f ) , ( z , f ) ..., ( z n , f n ) } collected to date. The surrogate model f enters the maximiza-tion through the choice of acquisition function, for which many choices are available. Weemploy the popular expected improvement (EI) acquisition function that provides a balancedtrade-oﬀ between exploitation – selection of points where the surrogate model posterior mean µ f ( z ) is large – and exploration – selection of points where the surrogate model posterior21ariance σ f ( z ) is large. Following Lizotte, the EI is deﬁned as, u ( z | Z ) = EI( z | Z ) =  ( µ f ( z ) − f ( z + ) − ξ )Φ( Z ) + σ f ( z ) φ ( Z ) σ f ( z ) > σ f ( z ) = 0 , (12) Z =  µ f ( z ) − f ( x + ) − ξσ f ( z ) σ f ( z ) > σ f ( z ) = 0 , (13)where f ( z + ) , z + ∈ Z = { z , z , ..., z n } is the maximum ﬁtness value among all n sampledcandidates to date, Φ and Ψ are the cumulative distribution function and probability den-sity function of the standard normal distribution, and the hyperparameter ξ controls theexploration-exploitation trade-oﬀ. The ﬁrst term in Eqn. 12 promotes exploitation and thesecond promotes exploration: when ξ is small the EI will favor exploitation and select pointswith high posterior mean, while if ξ is large exploration is performed selecting points withlarge posterior uncertainty. Active learning typically proceeds by selecting a ﬁxed value ξ = 0.01 of the exploration-exploitation trade-oﬀ, identifying the candidate that maximizes the EI, and then performingexpensive function evaluation (here a coarse-grained molecular simulation) for that candi-date. We employ a slightly modiﬁed version of this approach that eﬀectively integrates over ξ and performs active learning in batches, which has the advantages of (i) eliminating thesensitivity in selection to the hyperparameter ξ , (ii) spreading the exploit-explore trade-oﬀ,and (iii) making more eﬃcient use of parallel compute resources to conduct multiple simu-lations in parallel in the same wall clock time. Speciﬁcally, we maximize the EI acquisitionfunction over the range log ξ ∈ [ − , . and select up to four candidates over this range asthe “next-best” candidates to simulate. Molecules that have already been sampled in preced-ing rounds are excluded from the pool of available candidates at each round. Where morethan four candidates emerge from the EI maximization, we randomly select four membersof this set. An example of this selection procedure is presented in Fig. 7. Coarse grained22olecular simulations of these optimal candidates are then performed to commence anotherround of the active learning cycle.Figure 7: Active learning candidate selection. The expected improvement (EI) acquisitionfunction (Eqn. 12) is evaluated over for all unsampled members of the DXXX-OPV3-XXXDfamily at values of the exploit-explore hyperparameter ξ over the range log ξ ∈ [ − , . .The blue points in the graph indicate which DXXX candidate maximizes the EI at eachvalue of ξ . In this illustrative example there are four candidates – DGGG, DEGA, DGLI,and DALW – that maximize the EI over the range of ξ considered. The vertical line showsthe recommended value of ξ = 0 . suggested in literature, which would result in theselection of only DEGA as the next molecule to simulate. In our approach, we select allfour of the molecules DGGG, DEGA, DGLI, and DALW that maximize EI over the entirerange of ξ considered as the next best candidates to simulate in parallel in the next roundof coarse-grained molecular simulations. The dimensionality d of the VAE latent space embedding and bandwidth γ of the GPR kernelare tunable hyperparameters to be optimized during each cycle of the active learning loop.We perform simultaneous tuning of d and γ during each round by creating 50 embeddingsof all 8,000 DXXX-OPV3-XXXD molecules into the VAE latent space z i ∼ N ( µ i , σ i ) ∈ R d ,each employing diﬀerent realization of random numbers to sample from the latent spaceGaussian for each point, for d = [3, 10]. We then optimize γ for each embedding over therange γ = [0.001, 100] using a line search followed by Nelder-Mead optimization to maximizethe GPR accuracy under cross-validation. We employ leave-one-out cross-validation (LOO-CV) for the ﬁrst ﬁve cycles of the active learning, and then 5-fold CV for subsequent rounds23ue to the high cost of LOO-CV for larger quantities of samples. The best performing VAEembedding and associated optimal d and γ are adopted for the remainder of the currentactive learning cycle. We cycle around the active learning loop until the GPR surrogate model no longer improveswith the collection of additional training data. A number of stopping criterion for activelearning have been proposed, but in this work we monitor and deﬁne convergence us-ing the stabilizing predictions (SP) method that evaluates performance based on unlabeleddata and the performance diﬀerence (PD) method that considers the labeled examples. The SP method examines the predictions of consecutive models at each iteration of the ac-tive learning procedure on a randomly selected set of 500 points, called the stop set, whichis held constant throughout the active learning. We measure the diﬀerence in the regressionpredictions between subsequent rounds using the average Bhattacharyya distance D B be-tween the posterior of consecutive GPR models over the stop set. Large diﬀerences in D B indicate the model is continuing to update the GPR posterior, whereas small values indicatethat the surrogate model predictions have stabilized.The PD method is used to evaluate model performance by 5-fold CV of the R score overthe accumulated labeled samples collected to date within each round of active learning. Aplateau in the R indicates that additional observations result in only marginal improvementsto the GPR ﬁt. A caution in assessing convergence using labeled data is that these datamay not be representative of the data as a whole.

These concerns are mitigated in ourapplication since our initial data set comprises a set of randomly selected peptides to initializethe active learning procedure and we collect up to four new data points each round acrossthe exploit-explore spectrum to assure broad sampling of chemical space.24 .3 Nonlinear manifold learning of assembly pathways

We employ diﬀusion maps as a manifold learning approach to identify the low-dimensionalassembly pathways by which the various DXXX-OPV3-XXXD molecules self-assemble intothe terminal aggregates. We have previously described the application of diﬀusion maps toself-assembling systems in Refs.

In brief, we compute a distance metric d ( i, j ) betweeneach pair of interaction graphs i and j harvested from each frame of each molecular simulationtrajectory. A number of graph kernels at varying levels of sophistication and abstraction havebeen proposed to measure the similarity between pairs of graphs. We follow the approachof Reinhart et al. who employed graphlet decompositions as a diﬀusion map distance metricto analyze colloidal crystallization. This approach featurizes a graph by enumerating alltopologically unique subgraphs (“graphlets”) with associated node permutations (“orbits”)within the network up to a certain subgraph size (usually up to ﬁve vertices), and creatinga vector of orbit counts for each vertex in our graph.

The vector of orbit counts ateach vertex is reweighted to account for over-counting of the smaller graphlets contained inthe larger ones (i.e. counts of graphlets comprising two vertices are necessarily containedin counts of graphlets comprising three or more vertices), averaged over all vertices in thegraph, and normalized to unit length. This vector represents a featurization of the graph thatis permutationally invariant to vertex labeling, and the L2-norm between pairs of vectorsdeﬁnes the graph kernel d ( i, j ) used to evaluate pairwise distances between our graphicalrepresentations of the conﬁgurational state of the molecular system.Diﬀusion maps then proceed by applying a Gaussian kernel to construct the convolvedsimilarity matrix, A i,j = exp (cid:32) − ( d ( i, j ) α ) (cid:15) (cid:33) , (14)where the kernel bandwidth (cid:15) controls the hop size of the random walk and can be auto-matically tuned based on the distribution of the A i,j . The use of the hyperparameter α ∈ (0 , was proposed by Wang et al. within a density-adaptive extension of diﬀusion maps25hat greatly improves the performance of diﬀusion maps in applications to systems with largediﬀerences in the density of points in the high-dimensional space. For α = 1 we recoverstandard diﬀusion maps; for α → the pairwise distances become increasingly similar andlarge ﬂuctuations in the density of points in the high-dimensional space are smoothed out.Adopting the tuning procedure proposed in Ref. we adopt α = 0.15.The A matrix is row normalized to create the right stochastic Markov transition matrix, M = D − A , (15)Where D is a diagonal matrix of the row sums of A . D i,j = N (cid:88) j =1 A i,j . (16)The matrix element M ti,j = p t ( i, j ) can interpreted as the probability p t ( i, j ) of hopping frompoint i to point j in t steps of the discrete random walk. Diagonalization of M producesan ordered set of eigenvectors and eigenvalues { ( ψ = (cid:126) , λ = 1) , ( ψ , λ ) , ( ψ , λ ) , ... } with λ = 1 ≥ λ ≥ λ ≥ . . . . The ﬁrst pair ( ψ = (cid:126) , λ = 1) is trivial and associated with thestationary distribution of the random walk. The higher order eigenvectors are associatedwith a hierarchy of increasingly fast relaxation modes of the random walk. Dimensionalityreduction is achieved by identifying a gap in the eigenvalue spectrum after the λ k +1 toresolve a subspace of slowly relaxing dynamical modes { ψ , ψ , . . . , ψ k +1 } . The diﬀusionmap embedding is the projection of the i th interaction graph into the i th component of thetop k non-trivial eigenvectors, i (cid:55)→ ( ψ ( i ) , ψ ( i ) , ..., ψ k +1 ( i )) . (17)We implement this formalism using the memory and compute eﬃcient pivot diﬀusionmap approach that reduces the scaling in the number of points N from O ( N ) to O ( N × n ) ,26here n << N is the number of so-called “pivot points” employed. This approach enablesthe application of diﬀusion maps to large data sets by performing on-the-ﬂy deﬁnition of the n pivot points deﬁning an approximate spanning tree over the high-dimensional data andwhich are used to support interpolative embeddings of the remaining points. The complete DXXX-OPV3-XXXD family comprises 20 = 8,000 members generated by allpermutations of placing each of the 20 natural amino acids within the XXX tripeptide. Priorto conducting active learning we ﬁltered this ensemble to eliminate a subset of candidatescontaining amino acids known and expected to produce undesired assembly behaviors. Speciﬁcally, we reduced our search space to the 11 = 1331 candidates in the set deﬁnedby X ∈ { Ala , Gly , Glu , Ile , Leu , Met , Phe , Trp , Tyr , Val , Asp } to avoid charged and/or polaramino acids expected to interfere with low-pH triggered assembly and focus on thoseresidues that have expressed good assembly behavior in previous experimental work. We perform active learning over DXXX-OPV3-XXXD sequences following the four-partprotocol – molecular simulation, VAE latent space embedding, GPR surrogate model con-struction, optimal selection of next candidates – described in Section 2.2 and illustrated inFig. 2. We seeded the search by conducting coarse-grained molecular dynamics simulationsof 90 randomly selected members of the family using the simulation protocol detailed in Sec-tion 2.1. This initial broad sampling over the candidate space provides the GPR surrogatemodel with diverse training data that enables it to identify more- and less-promising regionsof the latent space prior to making any predictions. We term this initial round of activelearning as Round 0. We conduct 25 additional rounds of active learning (Rounds 1-25)selecting up to four additional molecules for simulation during each pass. This resulted in asampling a total of N = 186 molecules (2.3% of the 8,000-member complete family; 14.0% of27he 1331-member chemically restricted family) and a cumulative 558 µ s of simulation time.The particular candidates selected and sampled in each round are listed in Table S1 in theSupplementary Information.Sampling was terminated by tracking the performance diﬀerence (PD) and stabilizingpredictions (SP) methods (Section 2.2.6). The PD method 5-fold cross validation R scorecommences at a reasonably high value of ∼

68% – likely due to the relatively large and diverse N = 90 initial candidates considered, and plateaus to a quite high value of ∼

78% by Round18 (Fig. 8a). The SP method reveals a Bhattacharyya distance between successive GPRposteriors of D B > 10 over the ﬁrst 13 rounds, indicating that the additional training dataincorporated into the GPR surrogate model are substantially altering its predictions. AfterRound 14, the Bhattacharyya distance plateaus to D B ∼ A B

Figure 8: Tracking of active learning stop criteria. (a) The performance diﬀerence (PD)method 5-fold cross validation R score stabilizes at R ∼

78% by Round 18. (b) The stabi-lizing predictions (SP) method Bhattacharyya distance between successive GPR posteriorsplateaus close to zero at D B ∼ .We present in Table 1 the top performing molecules among the 186 that were simulated28ithin our active learning protocol. We report their ﬁtness f i = κ i corresponding to the meannumber of π -core– π -core contacts per molecule in the terminal self-assembled aggregates(Section 2.2.1), the round of active learning in which they were sampled, and whether theyhave been previously explored by experiment or simulation. Additional molecules that havebeen previously identiﬁed as high performing by experiment and simulation are also presentedfor comparison. The list of all 1,331 DXXX-OPV3-XXXD molecules in the family withﬁtness predictions and rankings assigned by the terminal GPR surrogate model is presentedin Table S2. There is very good agreement between the numerical simulation results andthe GPR model predictions over the training set of 186 molecules for which measurementdata exist: the calculated and predicted values of f i = κ i possess a Pearson correlationcoeﬃcient of ρ Pearson = 0.90 (p-value = 4 × − ) and the calculated and predicted rankingspossess a Spearman correlation coeﬃcient of ρ Spearman = 0.86 (p-value = 1 × − ). Theagreement is not perfect due to our incorporation of uncertainty estimates in our noisyﬁtness measurements into GPR training such that the model predictions fall within theerror bars of our simulations (Section 2.2.3).Trends apparent in the active learning-ranking of the tripeptides in terms of amino acidcomposition and sequence are coincident with aspects of existing understanding, but alsosuggest new unexplored amino acid sequences as good putative candidates. The bulky aro-matic residues F, W, and Y tend to disfavor good assembly behaviors, with the largesize of these residues impeding good side chain packing and obstructing co-facial stack-ing of the cores (particularly in the X position of DX X X -OPV3-X X X D), and theiraromatic character disrupting the formation of linear aggregates with in-register stackingof the π -cores by introducing favorable aromatic stacking between the π -cores and peptidewings. These trends are expressed in the low ranking of molecules containing bulky aromaticresidues (e.g., DFGG (65), DFAV (85), DFAG (93), DIAG (102), DFAA (111), DFAF(147))compared to those with smaller hydrophobic side chains (e.g., DVAG (19), DAAG (33),DGAG (45)). The active learning protocol also identiﬁes as highly ranked a number of29 able 1: Top 15 DXXX-OPV3-XXXD molecules identiﬁed by the active learningprotocol. Additional molecules previously studied in simulation and experimentare also shown for comparison. Rank (out of 186) Molecule (DXXX) κ Discovery round Previously known?1 DEAA 6.06 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± )... ... ... ... ...33 DAAG 5.62 ± )... ... ... ... ...45 DGAG 5.54 ± ); Exp (Ref. )... ... ... ... ...65 DFGG 5.33 ± )... ... ... ... ...85 DFAV 5.09 ± )... ... ... ... ...93 DFAG 4.98 ± ); Exp (Ref. )... ... ... ... ...102 DIAG 4.86 ± )... ... ... ... ...111 DFAA 4.78 ± )... ... ... ... ...147 DFAF 4.29 ± ) previously unknown candidates enriched in smaller hydrophobic residues. Interestingly, anumber of highly ranked candidates contain an M residue in the X position (e.g., DIAM(3), DGGM (11)). Methionene-containing DXXX-OPV3-XXXD molecules have been com-30letely unexplored due, in part, to the expectation that a thioether group would likely disfa-vor hydrophobic association. Our calculations predict these candidates to possess excellentassembly behaviors and suggest them as novel molecules for experimental investigation. The active learning protocol considers only the terminal 50 ns of the 3,000 ns coarse-grainedmolecular dynamics trajectories to identify DXXX-OPV3-XXXD molecules that form desiredpseudo-1D linear aggregates. Having completed the active learning process, we subsequentlyanalyze the ensemble of N = 186 molecular simulation trajectories to provide molecular-levelunderstanding of the assembly pathways and mechanisms and furnish design precepts for theobserved assembly behaviors as a function of tripeptide sequence.We hypothesize that the molecular assembly trajectories proceed through conﬁgurationalphase space over a low-dimensional manifold. We determine this low-dimensional manifoldby performing diﬀusion map manifold learning over the trajectory ensemble. Each frameof each molecular simulation is represented as an interaction graph with vertices V and edges E deﬁned using the optical distance metric (Section 2.2.1). We subsample each trajectorykeeping every 20 th point and then run diﬀusion maps on the composite data set of 558,000graphs as detailed in Section 2.3. Diﬀusion maps then produce a nonlinear projection of thisgraph ensemble into a low-dimensional space in which graphs sharing a similar structure ofedges are embedded close together, and dissimilar graphs embedded far apart. (We empha-size that this low-dimensional embedding represents a nonlinear manifold residing withinthe conﬁgurational space of interaction graphs and is completely independent from the VAElatent space embedding of the chemical space of XXX tripeptides.) We trace assembly tra-jectories over this graph embedding to identify DXXX-OPV3-XXXD molecules that followsimilar and dissimilar dynamical assembly pathways and terminal states.The diﬀusion map eigenvalue spectrum possesses a spectral gap after the third non-trivial eigenvalue, motivating 3D embeddings into the three leading eigenvectors { ψ , ψ , ψ } .31urther, the ψ − ψ projection deﬁnes a curved, relatively thin manifold indicating that thesetwo embedding dimensions are correlated (Fig. S1). Accordingly, without too much lossof information we drop ψ and construct visually simpler 2D ψ − ψ embeddings that wepresent in Fig. 9. In Fig. 9a we present the composite embedding of all 558,000 simulationsnapshots. We ﬁnd ψ to be moderately strongly correlated with the average number of permolecule π -core– π -core contacts κ ( ρ ( ψ , κ ) = 0.78, p-value < 1 × − ), and ψ with themass averaged cluster size of the system M z ( ρ ( ψ , M z ) = 0.62, p-value < 1 × − ). In Fig. 9b-d we highlight the assembly trajectories for three selected molecules: DVAA-OPV3-AAVD as a good assembler with κ = (5.85 ± κ = (5.02 ± κ = (3.74 ± ( ψ ≈ − . , ψ ≈ . corresponding to an approximate monomericdispersion. Good assemblers such as DVAA-OPV3-AAVD follow pathways that travel alongthe lower perimeter of the manifold and terminate in the top-right corner ( ψ ≈ − . , ψ ≈ . → ( ψ ≈ − . , ψ ≈ − . → ( ψ ≈ . , ψ ≈ − . → ( ψ ≈ . , ψ ≈ . .The conﬁgurations in the top-right corner comprise pseudo-1D aggregates with good in-register stacking between the π -cores and large values of κ . Intermediate assemblers such asDGEG-OPV3-DGEG follow similar pathways that traverse the left and bottom perimeter,but terminate in the bottom-right region of the manifold at ( ψ ≈ . , ψ ≈ − . . Thisbottom-right region comprises loosely connected pseudo-1D aggregates which fail to form aglobally connected pseudo-1D structure and possess intermediate values of κ . Lastly, poorassemblers such as DWWI-OPV3-IWWD follow pathways that travel along the top of themanifold ( ψ ≈ − . , ψ ≈ . → ( ψ ≈ . , ψ ≈ . and terminate within the bulk of themanifold ( ψ ≈ . , ψ ≈ − . corresponding to disordered aggregates with poor in registerstacking and smaller κ . 32 GEG DWWIDVAA ᴪ ᴪ A BDC ᴪ ᴪ no . o f edge s S i m u l a t i on t i m e ( n s ) S i m u l a t i on t i m e ( n s ) S i m u l a t i on t i m e ( n s ) -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5-1.5 -1.0 -0.5 0.0 0.5 1.0 1.5-1.5 -1.0 -0.5 0.0 0.5 1.0 1.5210-1-2 210-1-2210-1-2 210-1-2 Figure 9: Diﬀusion map embeddings into ψ − ψ of the N = 186 DXXX-OPV3-XXXD molec-ular simulation trajectories. (a) Composite embedding of all 558,000 simulation snapshots.Each point represents a snapshot from one of the simulation trajectories and points are col-ored by the total number of edges in the corresponding molecular interaction graph (Section2.2.1). Temporal assembly courses of selected molecules over the 2D manifold where pointsare colored by simulation time: (b) DVAA-OPV3-AAVD (good assembler: κ = 5.85 ± κ = 5.02 ± κ = 3.74 ± The diﬀusion map embedding of the assembly trajectories presents a means to identify groupsof molecules with similar assembly behaviors and extract design precepts to promote goodassembly behavior. We map each of the N = 186 DXXX-OPV3-XXXD molecules in thediﬀusion map embedding to a single 3D point by averaging over the locations of the ﬁnal50 ns of simulation data in the space of the top three nontrivial diﬀusion map eigenvectors33 ψ i } i =2 . We then perform agglomerative hierarchical clustering using Ward’s method. Wecut the resulting dendrogram to partition in the molecules into three clusters (Fig. S2), andillustrate the clustering of the N = 186 molecules within the ψ − ψ diﬀusion map embeddingin Fig. 10. The three clusters reveal a natural categorization into good, intermediate, andpoor assemblers: (i) the green cluster of points in the top-right of the embedding comprisesthe good assemblers that form pseudo-1D linear stacks, (ii) the red cluster located in thebottom-right of the manifold contains intermediate assemblers that form loosely connectedsmall linear aggregates, and (iii) the orange cluster located in the bulk of the manifold thatforms disordered and disconnected clusters with poor π -core stacking. We then propagatedthe cluster labels deﬁned over these N = 186 molecules to the remaining (1,331 - 186) =1,145 molecules by performing a nearest-neighbor assignation based on distances within theVAE latent space in the terminal round of active learning (Section 2.2.2). A listing of thecluster assignations of each of the 1,331 molecules is provided in Table S3.Our classiﬁcation of the 1,331 molecules allows us to perform a statistical analysis of theenrichment or depletion of amino acid residues in good assemblers relative to intermediateor poor assemblers at each of the the three X i positions in the DX X X -OPV3-X X X Dsequence (Fig. 11). A fuller analysis would account for the complete tripeptide sequenceto consider the eﬀects of interactions with the other amino acids, but this simpler one-body analysis is both interpretable and illuminating. Drawing a signiﬁcance cutoﬀ at 1.5 × enrichment or depletion (p-value = 5 × − , one-tailed Fisher’s exact test), within goodassemblers at the X position we observe signiﬁcant enrichment in { A, G, I, L, V } residuesand depletion in { F, W } . At X , we observe an enrichment in { G, I, L } and impoverishmentin { F, V, W } . Finally, X is enriched in { G, I, L, V } and impoverished in { W, Y } .First considering the depleted amino acids, the largest hydrophobic residue W is disfa-vored in good assemblers at all positions. This can be understood as these bulky aromaticside chains possessing favorable π -stacking interactions with the π -cores, thereby disrupting π -core– π -core stacking. The W residue is most strongly disfavored in the core-adjacent X i g u r e : Sp ec tr a l c l u s t e r i n go f t h e N = s a m p l e d m o l ec u l e s i n t o f o u r c l u s t e r s w h i c h a r e p r o j ec t e d i n t o t h e t o p t h r ee n o n - tr i v i a l d i ﬀ u s i o n m a p e i g e n v ec t o r s { ψ i } i = . P o i n t s a r ec l u s t e r e db a s e d o n agg l o m e r a t i v e h i e r a r c h i c a l c l u s t e r i n ga nd c u tt i n g t h e r e s u l t i n g d e nd r og r a m a tt h e l e v e l o f t h r eec l u s t e r s . G r e y p o i n t s r e p r e s e n t i n s t a n t a n e o u ss n a p s h o t s h a r v e s t e d f r o m a ll m o l ec u l a r s i m u l a t i o n s p r o j ec t e d i n t o t h e ψ - ψ p l a n e . C o l o r e dp o i n t s r e p r e s e n tt h e a v e r ag ec oo r d i n a t e s o v e rt h e t e r m i n a l n s o f s i m u l a t i o n f o r e a c h o f t h e N = m o l ec u l e s . P o i n t s a r ec o l o r c o d e db y c l u s t e r : g r ee n ( goo d a ss e m b l e r s ) , r e d ( i n t e r m e d i a t e a ss e m b l e r s ) , a nd o r a n g e ( p oo r a ss e m b l e r s ) where smaller UV-vis spectral shifts were observed upon assembly for molecules containingaromatic residues. Of the remaining two aromatic amino acids, F is similarly disfavored,albeit not to the same degree, but the picture for Y is surprisingly nuanced. Y is moderatelydisfavored at X and strongly disfavored at X , but at X it is neither favored nor disfavored.The latter observation was unanticipated, and we currently lack an understanding for whythis should be so. This analysis illuminates how location within the tripeptide acts in concertwith the inherent physicochemical attributes of an amino acid to modulate its eﬀect.In regards to the enriched amino acids, the smaller hydrophobic residues G, I, and L arestrongly favored at all positions, with I particularly favored in the X position. This prefer-ence can be understood as the smaller aliphatic residues enabling closer packing between thepeptide wings compared to their bulkier counterparts and their absence of aromatic characterreducing interference in the co-facial stacking of π -cores. Residue A is moderately favoredat X and X , but neither favored nor disfavored at X . Contrariwise, V is moderately tostrongly favored at X and X , but moderately disfavored at X .Finally, there is no strong preferences for residues D, E, and M at any of the threepositions, with the exception of a moderate favorability for M at position X . The primary goal of this work was to employ molecular simulation to identify membersof the DXXX-OPV3-XXXD oligopeptide family exhibiting promising assembly behaviorsinto pseudo-1D nanoaggregates with good optoelectronic properties, and to discover designprecepts for the good assemblers. Trial-and-error exploration of the full chemical spaceis computationally and experimentally intractable, motivating our use of techniques fromoptimal experimental design and deep representational learning to eﬃciently traverse the36 r a c t i ona l r ep r e s en t a t i on o f r e s i due a t po s i t i on X i a m ong i n t e r m ed i a t e o r poo r a ss e m b l e r s Fractional representation of residue at position X i among good assemblers W F YE D M AI VG L W F VMY ED A GI L W Y F EA D MG LV I

Figure 11: Residue enrichment analysis of each position X i in DX X X -OPV3-X X X Dwithin molecules classiﬁed as good assemblers relative to those classiﬁed as intermediate orpoor assemblers. Good assemblers are enriched in amino acids residing below the dashedblack line and depleted in those residing above it. Dashed green and red lines show boundariesfor 1.5 × and 2 × diﬀerential enrichment and depletion.space of XXX tripeptide sequences and minimize the number of expensive molecular simu-lations required to identify the top candidates. Employing a combination of coarse-grainedmolecular simulation, variational autoencoders, Gaussian process regression, and Bayesianoptimization, we deﬁne an iterative active learning protocol that constructs surrogate mod-els of assembly behavior based on the simulation data collected to date, and uses thesemodels to optimally direct the next round of simulations. The loop is terminated whenthe surrogate model ceases to improve with additional simulation data and we can reliablypredict the top performing molecules. Using this platform we compute a converged rank or-dering of the DXXX-OPV3-XXXD oligopeptides in terms of assembly quality after directlysimulating only 2.3% of all possible oligopeptide sequences. The calculated ranking list isconsistent with existing understanding of what constitutes good and bad sequences for as-sembly, but also reveals new promising candidate molecules as superior assemblers that havenot previously been considered. Our ranked list presents an inexpensive ﬁltration of thecomplete DXXX-OPV3-XXXD sequence space to direct expensive experimental synthesisand characterization eﬀorts towards the most promising candidate molecules.37 subsequent analysis of the molecular simulation trajectories reveals a low-dimensionalmanifold within the high-dimensional conﬁgurational space over which assembly proceeds.Clustering of the simulation trajectories within this space reveals a natural partitioning ofthe DXXX-OPV3-XXXD family into good, intermediate, and poor assemblers. Statisticalanalysis of these classes reveals the good assemblers to be enriched in small and intermediate-sized hydrophobic residues, depleted in large aromatic residues, and that Asp, Glu, andMet do not strongly inﬂuence the quality of assembly. The one exception to the latterresult is that Met in the X position closest to the π -core does moderately to strongly favorassembly. These design precepts provide understanding of the rankings established by theactive learning protocol.In sum, this work oﬀers a comprehensive investigation of the assembly landscape of theDXXX-OPV3-XXXD family of π -conjugated peptides using Bayesian optimal experimentaldesign to guide expensive coarse-grained molecular simulations over microsecond time scales.Our calculations eﬃciently furnish a rank ordering of the DXXX-OPV3-XXXD and identifya small number of top-performing candidates. While these predictions are only as goodas the accuracy of the (coarse-grained) molecular model, they are consistent with existingphysicochemical understanding, and can be viewed as a coarse computational ﬁltration of thecomplete sequence space that can guide subsequent computation and experiment towardsthe most promising candidates. Ongoing experimental work will attempt to synthesize andtest the optoelectronic properties of the candidates in this work, while future computationalstudies will generalize the approach to D(X) n - Π -(X) n D molecules by extending the consideredchemical space to include diﬀerent Π cores, such as perylenediimide (PDI) or oligothiophene(OT), and varying the length of the peptides. Our platform is also generically extensibleto the design of other peptide and peptide-like systems, including antimicrobial peptides,cell-penetrating peptides, intrinsically disordered proteins, and peptoids, where the eﬃcienttraversal of chemical space, identiﬁcation of small numbers of top-performing candidates,and exposure of comprehensible design precepts are prioritized.38 ata Availability The coarse-grained molecular simulation trajectories of the self-assembly of the 186 DXXX-OPV3-XXXD molecules conducted in this work are hosted for free public download atthe Materials Data Facility, a project aﬃliated with the NIST Center for HierarchicalMaterials Design at http://dx.doi.org/10.18126/xqiz-hzc2 . Python 3 Jupyternotebooks implementing our active search procedures are available on GitHub at https://github.com/KirillShmilovich/ActiveLearningCG . Acknowledgments

This material is based upon work supported by the National Science Foundation underGrant Nos. DMR-1841807 and DMR-1728947, and a National Science Foundation GraduateResearch Fellowship to K.S. under Grant No. DGE-1746045. This work was completed inpart with resources provided by the University of Chicago Research Computing Center. Wegratefully acknowledge computing time on the University of Chicago high-performance GPU-based cyberinfrastructure (Grant No. DMR-1828629). We thank Dr. Ben Blaiszik for hisassistance in hosting our simulation trajectories on the Materials Data Facility.39 eferences (1) Pinotsi, D.; Grisanti, L.; Mahou, P.; Gebauer, R.; Kaminski, C. F.; Hassanali, A.;Kaminski Schierle, G. S. Proton transfer and structure-speciﬁc ﬂuorescence in hydro-gen bond-rich protein structures.

Journal of the American Chemical Society , , 3046–3057.(2) Guo, X.; Baumgarten, M.; Müllen, K. Designing π -conjugated polymers for organicelectronics. Progress in Polymer Science , , 1832–1908.(3) Kim, S. H.; Parquette, J. R. A model for the controlled assembly of semiconductorpeptides. Nanoscale , , 6940–6947.(4) Mitschke, U.; Bäuerle, P. The electroluminescence of organic materials. Journal ofMaterials Chemistry , , 1471–1507.(5) Roncali, J. Conjugated poly (thiophenes): synthesis, functionalization, and applica-tions. Chemical Reviews , , 711–738.(6) Fichou, D.; Ziegler, C. Structure and Properties of Oligothiophenes in the Solid State:Single crystals and thin ﬁlms ; Wiley-VCH: Weinheim, 1999.(7) Bian, L.; Zhu, E.; Tang, J.; Tang, W.; Zhang, F. Recent progress in the design ofnarrow bandgap conjugated polymers for high-eﬃciency organic solar cells.

Progressin Polymer Science , , 1292–1331.(8) Guo, X.; Baumgarten, M.; Müllen, K. Designing π -conjugated polymers for organicelectronics. Progress in Polymer Science , , 1832–1908.(9) Newman, C. R.; Frisbie, C. D.; da Silva Filho, D. A.; Brédas, J.-L.; Ewbank, P. C.;Mann, K. R. Introduction to organic thin ﬁlm transistors and design of n-channelorganic semiconductors. Chemistry of Materials , , 4436–4451.4010) Marder, S. R.; Lee, K.-S. Photoresponsive Polymers II ; Springer, 2008; Vol. 214.(11) Beaujuge, P. M.; Reynolds, J. R. Color control in π -conjugated organic polymers foruse in electrochromic devices. Chemical Reviews , , 268–320.(12) Marty, R.; Szilluweit, R.; Sánchez-Ferrer, A.; Bolisetty, S.; Adamcik, J.; Mezzenga, R.;Spitzner, E.-C.; Feifer, M.; Steinmann, S. N.; Corminboeuf, C. et al. Hierarchi-cally structured microﬁbers of “single stack” perylene bisimide and quaterthiophenenanowires. ACS Nano , , 8498–8508.(13) Löwik, D. W. P. M.; Leunissen, E. H. P.; van den Heuvel, M.; Hansen, M. B.; vanHest, J. C. M. Stimulus responsive peptide based materials. Chemical Society Reviews , , 3394.(14) Ulijn, R. V.; Smith, A. M. Designing peptide based nanomaterials. Chemical SocietyReviews , , 664.(15) Marciel, A. B.; Tanyeri, M.; Wall, B. D.; Tovar, J. D.; Schroeder, C. M.; Wilson, W. L.Fluidic-directed assembly of aligned oligopeptides with π -conjugated cores. AdvancedMaterials , , 6398–6404.(16) Webber, M. J.; Appel, E. A.; Meijer, E. W.; Langer, R. Supramolecular biomaterials. Nature Materials , , 13–26.(17) Mansbach, R. A.; Ferguson, A. L. Control of the hierarchical assembly of π -conjugatedoptoelectronic peptides by pH and ﬂow. Organic and Biomolecular Chemistry , , 5484–5502.(18) Schenning, A. P. H. J.; Meijer, E. W. Supramolecular electronics; nanowires fromself-assembled π -conjugated systems. Chemical Communications , 3245–3258.(19) Mba, M.; Moretto, A.; Armelao, L.; Crisma, M.; Toniolo, C.; Maggini, M. Synthesis41nd self-assembly of oligo(p-phenylenevinylene) peptide conjugates in water.

Chem-istry - A European Journal , , 2044–2047.(20) Gallaher, J. K.; Aitken, E. J.; Keyzers, R. A.; Hodgkiss, J. M. Controlled aggregationof peptide-substituted perylene-bisimides. Chemical Communications , , 7961.(21) Facchetti, A. π -conjugated polymers for organic electronics and photovoltaic cell ap-plications. Chemistry of Materials , , 733–758.(22) Wall, B. D.; Zacca, A. E.; Sanders, A. M.; Wilson, W. L.; Ferguson, A. L.; Tovar, J. D.Supramolecular Polymorphism: Tunable electronic interactions within π -conjugatedpeptide nanostructures dictated by primary amino acid sequence. Langmuir , ,5946–5956.(23) Beaujuge, P. M.; Reynolds, J. R. Color control in π -conjugated organic polymers foruse in electrochromic devices. Chemical Reviews , , 268–320.(24) Ardoña, H. A. M.; Tovar, J. D. Energy transfer within responsive pi-conjugatedcoassembled peptide-based nanostructures in aqueous environments. Chemical Science , , 1474–1484.(25) Thurston, B. A.; Tovar, J. D.; Ferguson, A. L. Thermodynamics, morphology, and ki-netics of early-stage self-assembly of π -conjugated oligopeptides. Molecular Simulation , , 955–975.(26) Sanders, A. M.; Kale, T. S.; Katz, H. E.; Tovar, J. D. Solid-phase synthesis of self-assembling multivalent π -conjugated peptides. ACS Omega , , 409–419.(27) Valverde, L. R.; Thurston, B. A.; Ferguson, A. L.; Wilson, W. L. Evidence for prenu-cleated ﬁbrilogenesis of acid-mediated self-assembling oligopeptides via molecular sim-ulation and ﬂuorescence correlation spectroscopy. Langmuir , , 7346–7354.4228) Thurston, B.; Shapera, E.; Tovar, J. D.; Schleife, A.; Ferguson, A. L. Revealingthe sequence-structure-electronic property relation of self-assembling π -conjugatedoligopeptides by molecular and quantum mechanical modeling. Langmuir , ,15221–15231.(29) Mansbach, R. A.; Ferguson, A. L. Coarse-grained molecular simulation of the hier-archical self-assembly of π -conjugated optoelectronic peptides. Journal of PhysicalChemistry B , , 1684–1706.(30) Wall, B. D.; Tovar, J. D. Synthesis and characterization of π -conjugated peptide-basedsupramolecular materials. Pure and Applied Chemistry , , 1039–1045.(31) Thurston, B. A.; Ferguson, A. L. Machine learning and molecular design of self-assembling -conjugated oligopeptides. Molecular Simulation , , 930–945.(32) Wall, B. D.; Diegelmann, S. R.; Zhang, S.; Dawidczyk, T. J.; Wilson, W. L.;Katz, H. E.; Mao, H.-Q.; Tovar, J. D. Aligned macroscopic domains of optoelectronicnanostructures prepared via shear-ﬂow assembly of peptide hydrogels. Advanced Ma-terials , , 5009–5014.(33) Panda, S. S.; Katz, H. E.; Tovar, J. D. Solid-state electrical applications of proteinand peptide based nanomaterials. Chemical Society Reviews , , 3640–3658.(34) Kumar, R. J.; MacDonald, J. M.; Singh, T. B.; Waddington, L. J.; Holmes, A. B.Hierarchical self-assembly of semiconductor functionalized peptide α -helices and op-toelectronic properties. Journal of the American Chemical Society , , 8564–8573.(35) Panda, S. S.; Shmilovich, K.; Ferguson, A. L.; Tovar, J. D. Controlling supramolec-ular chirality in peptide- π -peptide networks by variation of the alkyl spacer length. Langmuir , , 14060–14073. 4336) Marrink, S. J.; Risselada, H. J.; Yeﬁmov, S.; Tieleman, D. P.; De Vries, A. H. TheMARTINI force ﬁeld: coarse grained model for biomolecular simulations. The Journalof Physical Chemistry B , , 7812–7824.(37) Monticelli, L.; Kandasamy, S. K.; Periole, X.; Larson, R. G.; Tieleman, D. P.; Mar-rink, S.-J. The MARTINI coarse-grained force ﬁeld: extension to proteins. Journal ofChemical Theory and Computation , , 819–834.(38) de Jong, D. H.; Singh, G.; Bennett, W. F. D.; Arnarez, C.; Wassenaar, T. A.;Schäfer, L. V.; Periole, X.; Tieleman, D. P.; Marrink, S. J. Improved parametersfor the Martini coarse-grained protein force ﬁeld. Journal of Chemical Theory andComputation , , 687–697.(39) Mansbach, R. A.; Ferguson, A. L. Patchy particle model of the hierarchical self-assembly of π -conjugated optoelectronic peptides. The Journal of Physical ChemistryB , , 10219–10236.(40) Kim, C.; Chandrasekaran, A.; Jha, A.; Ramprasad, R. Active-learning and materialsdesign: The example of high glass transition temperature polymers. MRS Communi-cations , , 860–866.(41) Ling, J.; Hutchinson, M.; Antono, E.; Paradiso, S.; Meredig, B. High-dimensionalmaterials and process optimization using data-driven experimental design with well-calibrated uncertainty estimates. Integrating Materials and Manufacturing Innovation , , 207–217.(42) Barrett, R.; White, A. D. Iterative peptide modeling with active learning and meta-learning. arXiv preprint arXiv:1911.09103 ,(43) Gómez-Bombarelli, R.; Wei, J. N.; Duvenaud, D.; Hernández-Lobato, J. M.; Sánchez-Lengeling, B.; Sheberla, D.; Aguilera-Iparraguirre, J.; Hirzel, T. D.; Adams, R. P.;44spuru-Guzik, A. Automatic chemical design using a data-driven continuous repre-sentation of molecules. ACS Central Science , , 268–276.(44) Bickerton, G. R.; Paolini, G. V.; Besnard, J.; Muresan, S.; Hopkins, A. L. Quantifyingthe chemical beauty of drugs. Nature Chemistry , , 90–98.(45) Xue, D.; Balachandran, P. V.; Hogden, J.; Theiler, J.; Xue, D.; Lookman, T. Ac-celerated search for materials with targeted properties by adaptive design. NatureCommunications , , 1–9.(46) Yuan, R.; Liu, Z.; Balachandran, P. V.; Xue, D.; Zhou, Y.; Ding, X.; Sun, J.; Xue, D.;Lookman, T. Accelerated discovery of large electrostrains in BaTiO3-based piezo-electrics using active learning. Advanced Materials , , 1702884.(47) Gautieri, A.; Russo, A.; Vesentini, S.; Redaelli, A.; Buehler, M. J. Coarse-grainedmodel of collagen molecules using an extended MARTINI force ﬁeld. Journal of Chem-ical Theory and Computation , , 1210–1218.(48) Pannuzzo, M.; De Jong, D. H.; Raudino, A.; Marrink, S. J. Simulation of polyethy-lene glycol and calcium-mediated membrane fusion. The Journal of Chemical Physics , , 124905.(49) López, C. A.; De Vries, A. H.; Marrink, S. J. Computational microscopy of cyclodextrinmediated cholesterol extraction from lipid model membranes. Scientiﬁc Reports , , 2071.(50) Guo, C.; Luo, Y.; Zhou, R.; Wei, G. Probing the self-assembly mechanism ofdiphenylalanine-based peptide nanovesicles and nanotubes. ACS Nano , , 3907–3918.(51) Seo, M.; Rauscher, S.; Pomès, R.; Tieleman, D. P. Improving internal peptide dynam-ics in the coarse-grained MARTINI model: toward large-scale simulations of amyloid-45nd elastin-like peptides. Journal of Chemical Theory and Computation , , 1774–1785.(52) Abraham, M. J.; Murtola, T.; Schulz, R.; Páll, S.; Smith, J. C.; Hess, B.; Lindahl, E.GROMACS: High performance molecular simulations through multi-level parallelismfrom laptops to supercomputers. SoftwareX , , 19–25.(53) Bussi, G.; Donadio, D.; Parrinello, M. Canonical sampling through velocity rescaling. The Journal of Chemical Physics , , 014101.(54) Parrinello, M.; Rahman, A. Polymorphic transitions in single crystals: A new molec-ular dynamics method. Journal of Applied Physics , , 7182–7190.(55) Hockney, R. W.; Eastwood, J. W. Computer Simulation Ssing Particles ; CRC Press,1988.(56) Hess, B.; Bekker, H.; Berendsen, H. J. C.; Fraaije, J. G. E. M. LINCS: A linear con-straint solver for molecular simulations.

Journal of Computational Chemistry , , 1463–1472.(57) Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.;Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V. et al. Scikit-learn: machinelearning in Python. Journal of Machine Learning Research , , 2825–2830.(58) Oliphant, T. E. Guide to NumPy , 2nd ed.; CreateSpace Independent Publishing Plat-form, 2015.(59) Chollet, F. keras. https://github.com/fchollet/keras , 2015.(60) Hočevar, T.; Demšar, J. A combinatorial approach to graphlet counting.

Bioinformat-ics , , 559–565.(61) Wang, J.; Ferguson, A. L. Mesoscale simulation of asphaltene aggregation. The Journalof Physical Chemistry B , , 8016–8035.4662) Humphrey, W.; Dalke, A.; Schulten, K. VMD: visual molecular dynamics. Journal ofMolecular Graphics , , 33–8.(63) Hagberg, A.; Swart, P.; S Chult, D. Exploring network structure, dynamics, andfunction using NetworkX ; Technical Report, Los Alamos National Lab.(LANL), LosAlamos, NM, 2008.(64) Henikoﬀ, S.; Henikoﬀ, J. G. Amino acid substitution matrices from protein blocks.

Proceedings of the National Academy of Sciences , , 10915–10919.(65) Kingma, D. P.; Welling, M. Auto-encoding variational Bayes. arXiv preprintarXiv:1312.6114 ,(66) Calandra, R.; Peters, J.; Rasmussen, C. E.; Deisenroth, M. P. Manifold Gaussianprocesses for regression. 2016 International Joint Conference on Neural Networks(IJCNN). 2016; pp 3338–3345.(67) Doersch, C. Tutorial on variational autoencoders. arXiv preprint arXiv:1606.05908 ,(68) Kingma, D. P.; Ba, J. Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980 ,(69) Brochu, E.; Cora, V. M.; De Freitas, N. A tutorial on Bayesian optimization of ex-pensive cost functions, with application to active user modeling and hierarchical rein-forcement learning. arXiv preprint arXiv:1012.2599 ,(70) Sivia, D.; Skilling, J. Data Analysis: A Bayesian Tutorial ; Oxford University Press,2006.(71) Ebden, M. Gaussian processes: A quick introduction. arXiv preprint arXiv:1505.02965 , 4772) Rasmussen, C. E.; Williams, C. K. I.

Gaussian Processes for Machine Learning (Adap-tive Computation and Machine Learning) ; The MIT Press, 2005.(73) Mohammadi, H.; Riche, R. L.; Durrande, N.; Touboul, E.; Bay, X. An ana-lytic comparison of regularization methods for Gaussian processes. arXiv preprintarXiv:1602.00853 ,(74) Močkus, J. On Bayesian methods for seeking the extremum. Optimization techniquesIFIP technical conference. 1975; pp 400–404.(75) Jones, D. R.; Schonlau, M.; Welch, W. J. Eﬃcient global optimization of expensiveblack-box functions.

Journal of Global Optimization , , 455–492.(76) Lizotte, D. J. Practical Bayesian optimization. Ph.D. thesis, 2008.(77) Lorenz, R.; Monti, R. P.; Violante, I. R.; Anagnostopoulos, C.; Faisal, A. A.; Mon-tana, G.; Leech, R. The Automatic Neuroscientist: A framework for optimizing ex-perimental design with closed-loop real-time fMRI. NeuroImage , , 320–334.(78) Schohn, G.; Cohn, D. Less is more: Active learning with support vector machines.ICML. 2000; p 6.(79) Vlachos, A. A stopping criterion for active learning. Computer Speech & Language , , 295–312.(80) Zhu, J.; Wang, H.; Hovy, E. Multi-criteria-based strategy to stop active learning fordata annotation. Proceedings of the 22nd International Conference on ComputationalLinguistics-Volume 1. 2008; pp 1129–1136.(81) Bloodgood, M.; Vijay-Shanker, K. A method for stopping active learning basedon stabilizing predictions and the need for user-adjustable stopping. arXiv preprintarXiv:1409.5165 , 4882) Bloodgood, M.; Grothendieck, J. Analysis of stopping active learning based on stabi-lizing predictions. arXiv preprint arXiv:1504.06329 ,(83) Beatty, G.; Kochis, E.; Bloodgood, M. The use of unlabeled data versus labeled datafor stopping active learning for text classiﬁcation. 2019 IEEE 13th International Con-ference on Semantic Computing (ICSC). 2019; pp 287–294.(84) Bhattacharyya, A. On a measure of divergence between two multinomial populations. Sankhy¯a: The Indian Journal of Statistics , 401–406.(85) Bloodgood, M.; Vijay-Shanker, K. Taking into account the diﬀerences between activelyand passively acquired data: The case of active learning with support vector machinesfor imbalanced datasets. arXiv preprint arXiv:1409.4835 ,(86) Long, A. W.; Ferguson, A. L. Rational design of patchy colloids via landscape engi-neering.

Molecular Systems Design & Engineering , , 49–65.(87) Wang, J.; Ferguson, A. L. Nonlinear machine learning in simulations of soft and bio-logical materials. Molecular Simulation , , 1090–1107.(88) Ma, Y.; Ferguson, A. L. Inverse design of self-assembling colloidal crystals with om-nidirectional photonic bandgaps. Soft Matter , , 8808–8826.(89) Przulj, N. Biological network comparison using graphlet degree distribution. Bioinfor-matics , , e177–e183.(90) Aparício, D.; Ribeiro, P.; Silva, F. Temporal network comparison using graphlet-orbittransitions. arXiv preprint arXiv:1707.04572 ,(91) Shervashidze, N.; Borgwardt, K. Fast subtree kernels on graphs. Advances in NeuralInformation Processing Systems. 2009; pp 1660–1668.4992) Kashima, H.; Tsuda, K.; Inokuchi, A. Marginalized kernels between labeled graphs.Proceedings of the 20th international conference on machine learning (ICML-03). 2003;pp 321–328.(93) Bonner, S.; Brennan, J.; Theodoropoulos, G.; Kureshi, I.; McGough, A. Eﬃcient com-parison of massive graphs through the use of "graph ﬁngerprints". Twelfth Workshopon Mining and Learning with Graphs (MLG) ’16 SanFrancisco, California USA. Au-gust 2016.(94) Reinhart, W. F.; Panagiotopoulos, A. Z. Automated crystal characterization with afast neighborhood graph analysis method. Soft Matter , , 6083–6089.(95) Coifman, R. R.; Lafon, S. Diﬀusion maps. Applied and Computational Harmonic Anal-ysis , , 5–30.(96) Wang, J.; Gayatri, M. A.; Ferguson, A. L. Mesoscale simulation and machine learningof asphaltene aggregation phase behavior and molecular assembly landscapes. TheJournal of Physical Chemistry B , , 4923–4944.(97) Nadler, B.; Lafon, S.; Kevrekidis, I.; Coifman, R. R. Diﬀusion maps, spectral cluster-ing and eigenfunctions of Fokker-Planck operators. Advances in Neural InformationProcessing Systems. 2006; pp 955–962.(98) Wang, J.; Ferguson, A. L. A study of the morphology, dynamics, and folding path-ways of ring polymers with supramolecular topological constraints using molecularsimulation and nonlinear manifold learning. Macromolecules , , 598–616.(99) Ardona, H. A. M.; Besar, K.; Togninalli, M.; Katz, H. E.; Tovar, J. D. Sequence-dependent mechanical, photophysical and electrical properties of pi-conjugated pep-tide hydrogelators. Journal of Materials Chemistry C , , 6505–6514.50100) Vadehra, G. S.; Wall, B. D.; Diegelmann, S. R.; Tovar, J. D. On-resin dimerization in-corporates a diverse array of π -conjugated functionality within aqueous self-assemblingpeptide backbones. Chemical Communications , , 3947–3949.(101) Besar, K. Organic semiconductor devices for chemical sensing and bio interfaces. Ph.D.thesis, Johns Hopkins University, 2016.(102) Ferguson, A. L.; Panagiotopoulos, A. Z.; Debenedetti, P. G.; Kevrekidis, I. G. Sys-tematic determination of order parameters for chain dynamics using diﬀusion maps. Proceedings of the National Academy of Sciences , , 13597–13602.(103) Rubinstein, M.; Colby, R. H. Polymer Physics ; Oxford University Press, 2003.(104) Ward, J. H. Hierarchical Grouping to Optimize an Objective Function.

Journal of theAmerican Statistical Association , , 236–244.(105) Shmilovich, K.; Mansbach, R. A.; Ferguson, A. L. Dataset for discovery of self-assembling π -conjugated peptides by active learning-directed coarse-grained molecularsimulation. Materials Data Facility , DOI: 10.18126/xqiz–hzc2.(106) Blaiszik, B.; Chard, K.; Pruyne, J.; Ananthakrishnan, R.; Tuecke, S.; Foster, I. TheMaterials Data Facility: Data services to advance materials science research.

JOM , , 2045–2052.(107) Blaiszik, B.; Ward, L.; Schwarting, M.; Gaﬀ, J.; Chard, R.; Pike, D.; Chard, K.;Foster, I. A data ecosystem to support machine learning in materials science. MRSCommunications , , 1125–1133. 51 OC Graphic z z z3